Emily Langhorne September 14, 2022

An Introduction to Active Learning in Machine Learning

blog image

When machine learning engineers want to improve model performance, their first port of call is to look at their training data. They must ensure that they have both large volumes of annotated data and that that data contains useful information from which the model can learn. Unfortunately, data annotation can be a costly endeavour. Many teams don’t have the time, money, or manpower to label and review each piece of data in these vast datasets.

Fortunately, active learning pipelines can help.

What is Active Learning?

Active learning (AL) is a procedure for training models in which the model queries annotators for information that can help improve its performance. 

The model trains on an initial subset of labelled data from a large dataset. Then, it tries to make predictions on the rest of unlabelled data based on what it has learned from its training. ML engineers evaluate how certain the model is in its predictions and, by using a variety of acquisition functions, they can quantify how much benefit annotating one of the unlabeled samples will yield. 

The model is deciding for itself what additional data will be the most useful for its learning by expressing uncertainty in its predictions. In doing so, it asks annotators to annotate more examples of only that type of data. By telling them which data it needs clarification on, the model saves annotators from wasting time labelling other data unnecessarily. The more uncertain a model is about a sample, the more that sample needs to be annotated.

A common sampling method involves looking at the entropy of the probability distribution of a prediction. This quantifies the average uncertainty in a set of outcomes. For example, in the problem of image classification, for each prediction made, the model reports a percent probability of confidence for each class considered . ML engineers then use these numbers to decide how certain the model is in its prediction. For instance, if a model is categorising images of cars and claims that there’s a 99 percent probability that an image is of a motorcycle, then that’s a prediction with a high level of certainty. If the same model encounters an image of a truck, and it claims there’s only a 55 percent probability that the image is a truck, then that’s a low level of certainty. ML teams now  know that to improve its performance the model needs to train on more labelled images of trucks but not motorcycles.

Let’s consider another simple example. Machine learning engineers are training a model to classify images of animals. After initially training on a subset of labelled data, the model can identify images of cats as “cat” and images of birds and fish as “not cat” with high certainty; however, when it encounters images of a dog, the model reports a 51 percent probability that it’s “not cat” and a 49 percent probability that it is is a cat. The model is saying, “I’m uncertain about how to identify a dog. I need more labelled images of this animal,” and the ML engineers know that it needs to be fed more labelled images of dogs, so they won't waste annotation resources labelling images of other animals.

The samples with a high level of uncertainty go back to the annotators who review and label them. ML engineers then use this newly labelled data to retrain the model before again testing it on more unlabelled data. 


Because the active learning pipeline begins with a small labelled dataset, the initial predictions that the model makes on the unlabelled data won’t be good. However, this loop– training, testing, identifying uncertainty, annotating, retraining–continues until the model reaches an acceptable performance threshold. At that point, the model’s predictions with a high level of certainty can be sent downstream for use in production while the others are sent back to the annotators, keeping the loop active and improving.

Active learning pipelines also help ML engineers identify edge cases. If a model makes a prediction with a high level of uncertainty, then it’s likely that the data does not fit into one of the categories that model has been designed to detect. Catching an outlier in a vast annotated training dataset is difficult, but with active learning methods, the model will immediately flag a sample outlier by showing its uncertainty. Because the ML engineers will retrain the model with the labelled sample, it learns to identify the edge case.  

Why Use Active Learning in Machine Learning?

Data labelling is a laborious and expensive process, especially if the labelling requires subject-matter expertise, and training a model on large amounts of data can be time consuming, resulting in high computing costs. A well-implemented active learning pipeline can make model training faster and cheaper while making data labelling less painful for annotators. 

In many cases, annotating all the data in a massive dataset doesn’t create much additional value when it comes to model performance. Intelligently selecting and labelling a portion of the dataset and then using that portion for active learning can result in increased model performance and decreased costs. Rather than invest in labelling all the data, organisations can invest their money in having annotators review and label samples of data specifically targeted to improve performance.

With an AL pipeline, ML teams can prioritise labelling data that will be most useful in training the model. Better yet, this process is managed on a continuous basis so that as new data is annotated and used for training, the model can adjust and determine what remaining data should be annotated next. This type of dynamic active learning– in which you iterate between annotating useful samples and training the model with new data– keeps self-correcting.

Somewhat surprisingly, active learning is also useful when ML engineers have a large amount of already labelled data. When it comes to training models, annotation isn’t the only expense, and companies have to think about the computational costs incurred during training. Training the model on every piece of labelled data available in a dataset might be a poor allocation of resources. ML engineers can use active learning to select a subset of data that’s most useful for the model’s learning from the already labelled dataset. They do so by performing computations to intelligently select a sample of annotated data before the training begins.

In this type of static active learning, the data is already labelled and the selection only occurs once, so ML teams don’t continuously manage the AL process. Static active learning makes the model training faster and thus cheaper. With a well implemented AL pipeline machine learning teams can get the best bang (model performance) for their buck (training budget).

Considerations For Building An Active Learning Pipeline

Implementing active learning pipelines requires specific knowledge about how to set up pipelines and intelligently select samples based on the uncertainty of the model’s predictions. Determining the metric used to quantify a model’s certainty without knowing what information is contained within the dataset is a challenging task, requiring a specific skill set.  If an ML team doesn't already have this expertise on hand, hiring a new team member with the correct skills will increase operational costs.

When it comes to dynamic active learning, model development and training teams need to have seamless integration with annotation teams. They have to coordinate schedules and communicate frequently as they train the model, evaluate predictions, review/label data, and retrain the model. If ML teams have to wait on the annotated data, the model stops indefinitely, so data samples must be annotated “on the go.”  An excellent task management system must be in place to minimise delays; otherwise, any gains in efficiency made by using dynamic active learning could be lost to waiting on annotations.

Static active learning doesn’t require this coordination, but it comes with its own challenges. Because it evaluates only a subset of the training data, the pipeline is rigid in its dependency on this initial selection. ML engineers have to trust that their first computation is correct, and getting it wrong comes at a very high cost. There’s no self-correction loop: if that first computation fails to account for something, then the data won’t be as optimal as the ML engineers originally thought. 

To overcome this challenge, ML teams can use a combination of the two methods. First, they can use a static active learning pipeline to select some of the annotated data. Once they have a relatively robust model, they can allow the model to query for more annotations, either from existing annotated data or by working seamlessly with an annotation team. Because a robust model has encountered more data, its designers will be more comfortable that it is correct in its uncertainty, and the model will perform the dynamic active learning loop more efficiently.

Active Learning and Data-Centric AI

The future of AI is data-centric and the future of model performance depends on getting smarter about how we label and manage training data. We need to query data in an intelligent way. We need to evaluate label and data quality while reducing the number of humans–and the amount of time that they spend– in the loop.

Embracing active learning techniques is the logical next step for the industry – their ability to increase efficiency and improve model performance is simply too valuable to ignore.