# Using Active Learning to Improve Ethnic Group Classification

March 21, 2022

**Introduction**

Although Active Learning is not yet broadly used, some of the most famous companies, Facebook, Google, Twitter, etc., already use it in their projects. And because of its benefits, we also decided to use it in the project “*Using Computer Vision to Detect Ethnicity in Videos and Improve Ethnicity Awareness”*, where impact-driven startup Ceretai partnered with Omdena to develop an ethnic group classification system based on computer vision.

This article will introduce the concept of Active Learning, how the system could be applied in an Artificial Intelligence project, and an example of its result.

**What is Active Learning?**

*Active Learning, or Human-in-the-Loop (HITL) Machine Learning, is a branch of Machine Learning in which a classification algorithm interacts with an information source. This source is generally a human, but it can use any other entity to obtain the required information. Usually, this information source is called the oracle or teacher in the literature.*

The main objective of Active Learning is to support the labeling or annotation process. In a traditional labeling process, you would take the whole set or a random sample of unlabeled data, annotate it and use it to train and validate your model. Instead, the innovative strength of Active Learning lies in how the data is selected to be labeled.

Annotating can be a tedious process that implies a heavy time investment and people. If we could choose the most informative data points, that would improve the model the most, instead of randomly selecting them. We would massively reduce the number of data points required to label to reach a specific model performance target. Therefore, including Active Learning in our project will increase the efficiency and efficacy of building the data set.

But Human-in-the-Loop Machine Learning is not always the best solution. For example, introducing it in our project will be costly if we outsource the process or expensive in time and workload if we develop an in-house application. Therefore, we need to be sure that we meet the prerequisites to apply it and that the increased efficiency will pay off the resources invested.

**Why did we use it in our project?**

Our project, by definition, was a computer vision task, and it could be divided into two different stages. Firstly, a model that could detect the faces in any image or video frame, and afterwards, a model classifies the extracted faces according to their ethnic group. And it was in this second step we applied *Active Learning *to build the data set to train the model.

In general, computer vision projects use models that require vast amounts of labeled images to be trained. And our case wasn’t different; we wanted to classify by ethnic group a person just looking at its face. A task that would be complex for any human, even for trained specialists, will require even more data points to train than a regular computer vision model.

Here is where we found the challenge’s biggest problem: we needed an extensive labeled data set, but we had none. The apparent initial solution was to find one or more on the internet, but we could just find a small one that we could use. Finally, we found that the situation was an exact fit to use Active Learning, and we already met the minimum requirements to use it:

- We didn’t have enough resources (people and time) to build ourselves the required data set by randomly labeling a significant quantity of images.
- We had an initial seed of labeled data to start the process.
- We also had a model to be trained.

The plan was to build our own data set, labeled by ourselves, but smaller. A size that we could achieve with our limited resources but effective enough to get similar performance than using a much bigger data set. And the tool to be used was Active Learning.

**Overview to the process**

The whole Active Learning process can be structured as two different loops, one embedded into the other, the Outer Loop and the Inner Loop. In other words, each Outer Loop iteration requires a complete Inner Loop before continuing to the next one. Eventually, the process finishes once the Outer Loop achieves the performance target.

It is pretty simple to think of the process as a “black box”. It requires two inputs to start, the model that we want to train, and an initial seed of labeled data, to minimally train the model in the first iteration. This initial seed can be as tiny as we want, but the smaller it is, the more noise we will have in the first iterations. Because if the model does not have a minimum of information, it will struggle to find the best data points to label. Although soon enough, it will stabilize and improve the performance metrics better than randomly labeling.

There is another input to the system. But in this case, it is not unique nor at the beginning of the process. We need to source a new set of unlabeled data points at each *Outer Loop* iteration. During our challenge, we learnt that it was essential to control this data input, not only in terms of quality but also in terms of bias and balance. Because if we feed the system with unbalanced or biased data, the data set we were building will eventually become biased or unbalanced. However, we still should provide enough variety of new points, i.e. different angles of the faces in our challenge, for the model to learn further.

Finally, we have two outputs from the system once the process is completed. First, a trained model that meets the target performance and, most importantly, the high-quality data set we used to train the model. Although the built data set is optimized for the model used, it should perform well with other algorithms of shared characteristics.

**Outer Loop**

First, we will get into the first layer of our “black box”, the Outer Loop. The main objective of this layer is to feed the inner loop with new unlabeled data points and, afterwards, evaluate whether the required criteria for the classification model is met or requires further iterations.

The inputs to the loop have been explained already in the previous section. They were the classification model, the initial data seed to start, and a new unlabeled data set for each iteration. Once we have them, we extract the Over-fitting set from the training/validation set. The purpose of this data set will be discussed further in the following sections. But in short, it will be used to confirm that each iteration of the Inner Loop is not over-fitting towards the new data we are adding and, therefore, deteriorating the model performance on the previous training/validation data.

The next step of the process would be the Inner loop, which will be followed by the decision of whether to continue with another Outer Loop iteration or stop the process. Again, the question is simple, is the model performance good enough? Does it meet the criteria established before starting the Active Learning process? If the reply is yes, we can bring the loop to an end. If otherwise, it is not, we should move on to prepare for the next iteration.

Before moving to a new iteration, we need to first return the Over-fitting set to where it belongs, the training/validation set. Secondly, we scrap the data points of the unlabeled set that we did not use in the Inner Loop. And lastly, we need to find a new bunch of unlabeled data points to start the next iteration on them.

**Inner Loop**

As explained earlier, the next and deepest layer of the Active Learning process is the Inner Loop, embedded in the Outer Loop. Therefore, from the Outer Loop, it receives four different inputs:

- The training/validation set, built over the previous Outer Loop iterations.
- The brand new unlabeled set.
- The over-fitting set, represents the training/validation set distribution over all Inner Loop iterations.
- The latest classification model.

Once the inputs are fixed, the first step would be to extract an iteration validation set from the unlabeled set. This small sample of the unlabeled set will be manually labeled and used later to validate the iteration performance at the end of the loop. Hence, it should be statistically representative of the different data points we want to classify. We will get to the detail on this crucial point in the following section.

Following the iteration validation set extraction, we move on to extracting the most informative points for our model. Initially, we need to obtain the coefficients from each data point, which is the probability of that data point to be classified as each label, and the features, which usually is a numeric vector representing the data point in a broader feature space. At this point, the model has given us enough information to sample the most informative data points.

The data sampling can follow many different strategies, from just random selection to very complex sampling combinations. We chose to apply two different methods in series for our challenge: uncertainty sampling using entropy as a metric and, afterwards, diversity sampling with a clustering algorithm. We chose this strategy because it was the right combination between complexity and performance. Then, we need to manually label the sampled data points, same as we did with the iteration validation set earlier.

Now, we combine these new labeled data points with the training/validation data set and re-train the model. Once the model is trained using the latest data set, we need to evaluate how successful the iteration was and if we should run another one or stop here. The decision will be based on two factors:

- Are we over-fitting towards the new data? Using the over-fitting set, passed as input to the Inner Loop, we must ensure that the model performance on the data is kept at least.
- Is the model classification performance improved? Using the iteration validation set, we need to confirm that the unlabeled data set still gives helpful information to the model; otherwise, it is meaningless to keep using that data source.

Suppose we determine that the model is not over-fitting and is still improving when adding labeled points from the current unlabeled set. In that case, we should run another iteration going back to the initial step, extracting a new iteration validation set. We will repeat the process until at least one of the conditions fails. Then we will bring the Inner Loop to an end and go back to the Outer Loop.

**Validating Active Learning**

As you probably already imagined, validating our steps in the Active Learning process is key to achieving the desired efficiency and efficacy. If we fail to validate, we can start adding bias to our data set from human labeling errors. Or we could be losing efficiency by annotating meaningless data points, or even worse, and we could be giving too much importance to a few new data points and worsening the classification power of our model.

To avoid all those problems, we used three additional validations, apart from the usual validation when training a model:

- Validation on new data.
- Validation on old data.
- Labeling validation.

**Model Validation on New Data**

One of the key aspects and benefits of Active Learning is its superior efficiency. The differentiating factor is that instead of randomly labeling all the data points, we only label the most informative ones, which will improve our classification model’s the most. Therefore, we should confirm that the data points added at each iteration improve the model’s target metric.

The most critical factor in any validation data set is its source; it should represent one data set or distribution or the other, depending on what we want to validate. In this case, we want to evaluate if the model is generalizing and performing well in new data. Hence, the best source of new data, data that the model has never seen is the unlabeled data set. Although it adds more work to the process, the benefit widely pays off the additional burden of more manual labeling.

In addition, we need to ensure that the sample is balanced and all the possible labels are well represented. This problem needs to be tackled from two sides. First, the population from where we are sampling should have those characteristics. Even though the unlabeled set is not annotated, we must find a way to confirm it is balanced in all our possible labels and other factors. For instance, in our challenge, we searched for a balanced representation of each ethnic group and gender and age features. Then, the sample needs to be big enough to show the same distribution as the original population but not too big to reduce the tedious task of manually annotating data.

**Model Validation on Old Data**

The other control on the model that we need to conduct is validating the performance with the old data. The objective is to supervise if the model starts to over-fit, in this case, towards the new data points that we are adding. In other words, we want to avoid losing predictive power on the data that we had because the model is getting too complex and is not generalizing well. As soon as we find out that the model is overfitting, we should stop the iteration, scrap the current unlabeled set, and start with a new one because the new data points were taking too much importance on the overall data set.

Again, it is crucial to determine where we should extract this over-fitting set. We will use the training/validation set, where we are storing the newly labeled data in each iteration, and we will do it before starting each Inner Loop, as we don’t want to contaminate the original distribution with new data points. Therefore, this data set will be static while we run the Inner Loop, but the data set should renew each Outer Loop iteration to reflect the updated base data set.

In contrast to the model validation on new data, we don’t need to be careful of the population from where we are sampling the overfitting data set. Because the base data set is fixed, we can’t change or re-balance it directly. But, we have to be sure that the sample size we are using is enough to represent our training/validation set population. Otherwise, we would be incurring bias errors when we validate on old data, and we could be over-fitting without noticing it.

**Labeling validation**

The labeling validation is the most flexible of the three and probably the most expensive. The objective is to review the human side of the Human-in-the-loop framework, the manual annotation of data points. As this is a human activity, it is vulnerable to typical human errors, such as bias, lack or partial knowledge, noting mistakes, etc. For example, in our challenge, classifying faces by ethnic group, we faced some of these errors:

- Perception bias, each person has a different perception of an ethnic group. It might be challenging for a person from Southeast Asia to differentiate between a white person from South Europe and a Middle Eastern person.
- Traditional stereotypes, some facial features have been traditionally assigned to an ethnic group. But in reality, not all of them are inherent to just one ethnic group, and others are not shared within the ethnic group.
- Annotation errors, while labeling thousands of face images, making human mistakes like annotating in the wrong image was not strange.

These errors are tough to detect because there is no automatic way to do so. It should be another human that reviews and finds them. And the effects of them in the model predictive power are devastating. Wrongly labeled data points lead the model to misleading learnings. Therefore, having a clear and robust strategy to avoid them is crucial. However, there is no “one size fits all” solution. There is no infallible strategy either, just some good practices such as pair annotation, peer review, guideline definition, user-friendly annotation software, etc.

Using this type of solution will help mitigate the errors we could introduce in our data set. Errors that once are inside are almost impossible to revert, reducing the quality of the data set and the efficacy of our *Active Learning *process.

**Sampling strategies**

The other key aspect of the Active Learning process, which will define its efficiency, is the sampling strategy to select the data points to label. At the end of the day, the primary purpose of Human-in-the-loop is to annotate just the essential and most informative data points. The better we chose them, the better the result would be.

The literature shows a wide range of sampling strategies, specific sampling methods combined in various ways. Because of its simplicity of coding, the versatility of obtaining the required information (the coefficients and features were easy to extract and almost independent of the model used) and theoretical robustness, we used a serial combination of first uncertainty sampling using entropy then diversity sampling using a clustering algorithm.

How does the serial combination work? First, we apply uncertainty sampling to the whole data set, extracting a certain amount or percentage of the total size. As we need to conduct a subsequent selection, the size of this first sample needs to be a mid-step between the whole population and the final sample size we want to obtain. Then, once we have an intermediate size sample, we apply diversity sampling to extract the final set of data points that we will annotate.

Why first uncertainty and then diversity sampling? Beginning with uncertainty sampling will leave us just with the data points where the model struggles to classify and scrap the ones the model is confident about. Afterwards, we do diversity sampling because we want variety, not just passing similar data points because they were the most difficult to classify for the model. Otherwise, we would end up selecting very similar and repetitive data points, missing loads of essential points and losing efficiency.

**Uncertainty Sampling**

The objective of the uncertainty sampling method is almost self-explanatory: it extracts the data points where the model was the most uncertain or, in other words, where the probabilities of selecting each label were the most similar to each other. In this way, we remove all the points that wouldn’t add any value to our model if they were to be labeled.

There are many algorithms available to identify the most uncertain unlabeled data points. Some examples could be the least confident, the margin of confidence, the confidence ratio, classification entropy, etc. Each of them shows different pros and cons, some are more complex, or others are simpler but less robust. With them in mind, for our challenge, we decided to go for the classification entropy algorithm as it is one of the most robust and also not so complex to implement.

We can understand the concept of entropy by how surprised we would be at each possible outcome relative to their probabilities. Then, the highest entropy scenario would be when all the possible results present the same exact probabilities. You don’t know what the label would be, but you would be equally surprised when any of the labels are chosen. To put in practice the algorithm, we just need the coefficients, the probabilities of each label to be selected by the model, as input and apply the entropy formula:

Where, P(y|x) is the probability of a data point *x* to be labeled as *y* and *n* is the total number of different labels. The reason for using the logarithm with base two is out of the scope of this article, but it is included in the book mentioned in the reference section, as well as a deeper explanation of this formula.

Once we have obtained the entropy, we just need to take the n data points with higher entropy and move on to the next stage, diversity sampling.

**Diversity Sampling**

The second step in the sampling process is diversity sampling. Now that we have just the data points where the model is more uncertain, we want to optimize the annotation efforts further. How is diversity sampling helping us in this purpose? Now that we know that all the data points are informative, we should select the most different ones. We want a “diverse” selection. We don’t want to label repeated data points that, even though very informative, are giving the same information to the model.

We will group similar data points to ensure that we take a varied sample from the most uncertain data points. In other words, we will make clusters. We are not looking for a specific number of clusters or extracting meaning from them; we only need to separate the data points. A simple clustering algorithm like KMeans will suffice for this purpose. However, there is an added difficulty if you work with high dimensionality data points, like computer vision or NLP (Natural Language Processing) projects.

Euclidean distance-based clustering algorithms, such as KMeans, do not work well in high dimensionality situations. Therefore we need to use an alternative, either to use a distance measure that works well in such problems, like cosine similarity or to reduce the dimensions before applying the clustering algorithm. In our case, we applied the last option, using the PCA (Principal Components Analysis) algorithm, because we found that it was simpler to implement and had a good performance. Although it had a drawback, it added another hyperparameter to tune the system, the number of dimensions to reduce to.

Once we fix how we separate in clusters the data points, we need to define which points of each cluster we want to extract. In the Human-in-the-Loop Machine Learning book, the author proposes to use three different types of data points to maximize the diversity:

- Closest to the cluster centroid. These data points represent the cluster’s most representative items and, therefore, the most repeated and informative points. They will help to improve the annotation of many data points. They should represent around 40% of the total.
- Furthest to the cluster centroid. They represent the outliers, data points that had not much similarity with others, but still, the model needs to learn to annotate them. Furthermore, they might represent a new big group of similar data points that the model still has not been able to discover. The quantity of this type of data point should be similar to the first one.
- Random points in the cluster. We also want to apply some randomness to discover new information that the model is unaware of. They just should represent around 20% of the total.

Now that we have divided our data space into clusters and selected from each of them a diverse sample. We are ready to say that we have a varied and informative sample of data points to annotate and train our model.

**Conclusions**

The above graph shows a model performance improvement comparing random annotation and Active Learning annotation. In this case, we applied it during our challenge using a basic classification model, a more complex model would show much higher differences. We can see that the model reaches its best performance after randomly labeling 21,000 pictures, but if we apply the Active Learning method, a very similar level is reached at just 13,000 pictures annotated.

In this article, we have discussed the Active Learning framework and how we applied it in our Omdena’s challenge, “Using Computer Vision to Detect Ethnicity in News and Videos and Improve Ethnicity Awareness”. We can summarize the main takeaways in the following points:

- Active Learning is a branch of Machine Learning that works by the interaction of an algorithm and a human source of information.

- The most significant advantage of Active Learning is to drastically reduce the number of data points to be labeled and therefore reduce the required resources.
- Active Learning annotation requires a specific framework to be set up, several hyperparameters to be tuned, the classification model and an initial seed of labeled data points. Whereas random annotation does not need any of those, just the data points to be labeled.
- Because of the required initial investment, active learning is not a methodology applicable to every project. It best suits complex projects that require a large amount of training data, such as computer vision or NLP (Natural Language Processing).
- A strict and continuous validation at each iteration is essential for the final performance of the Active Learning process.
- The final performance of Active Learning mainly relies on the unlabeled data that we select to label manually. But, again, there are many sampling strategies and combinations of them. And therefore, there is no “one size fits all” solution. It will depend on the problem and the available resources.

**References**

- Monarch, R. M. 2021, Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI, Manning Publications, Shelter Island.

—

*This article is written by Diego Quintana Esteve.*

**Ready to test your skills?**

If you’re interested in collaborating, apply to join an Omdena project at: https://www.omdena.com/projects

**Active Learning: Smart Data Labeling with Machine Learning**