How to Find and Fix Label Errors

Nikolaj Buhl •

December 15, 2022

•12 min read

Back to blogs

Contents

Introduction
The Three Types of Label Errors in Computer Vision
How to Fix Label Errors?
Conclusion

Introduction

As Machine Learning teams continue to push the boundaries of computer vision, the quality of our training data becomes increasingly important. A single data or label error can impact the performance of our models, making it critical to evaluate and improve our training datasets continuously.

In this first part of the blog post series “Data errors in Computer Vision”, we'll explore the most common label errors in computer vision and show you how you can quickly and efficiently mitigate or fix them. Whether you're a seasoned ML expert or just starting out, this series has something for everyone. Join us as we dive into the world of data errors in computer vision and learn how to ensure your models are trained on the highest quality data possible.

The Data Error Problem in Computer Vision

At Encord, we work with pioneering machine learning teams across a variety of different use cases. As they transition their models into production, we have noticed an increasing interest in continuously improving the quality of their training data.

Most Data Scientists and ML practitioners spend much time debugging data to enhance their model performance. Even with automation and AI-supported breakthroughs in labeling, data debugging is still a tedious and time-consuming process based on manual inspection and one-off scripts in Jupyter notebooks.

Before we dive into the data errors and how to fix them, let us quickly recollect what a good training dataset is.

What is a good training dataset?

Data that is consistently and correctly labeled.
Data that is not missing any labels.
Data that covers critical edge cases.
Data that covers data outliers in your data distribution.
Data that is balanced and mimics the data distribution faced in a deployed environment (for example, in terms of different times of the day, seasons, lighting conditions, etc.).
Data that is continuously updated based on feedback from your production model to mitigate data drift issues.

Note! This post will not dive into domain-specific requirements such as data volume and modality. If you are interested in reading more about healthcare-specific practices, click here.

Today we will explain how to achieve points one and two for a good training data set, and in the upcoming posts in the series, we will cover the rest.

Let’s dive into it!

The Three Types of Label Errors in Computer Vision

Label errors directly impact a model’s performance. Having incorrect ground truth labels in your training dataset can have major downstream consequences in the production pipeline for your computer vision models. Identifying label errors in a dataset containing hundreds of images or frames can be done manually, but when working with large datasets containing 100.000s or millions of images, the manual process becomes impossible.

The three types of labeling errors in computer vision are 1) inaccurate labels, 2) mislabeled images, and 3) missing labels.

Inaccurate Labels

When a label is inaccurate, your algorithm will struggle to correctly identify objects correctly. The actual consequences of inaccurate labels in object detection have been studied previously, so we will not dive into that today.

The precise definition of an inaccurate label depends on the purpose of the model you are training, but generally, the common examples of inaccurate labels are:

Loose bounding boxes/polygons
Labels that do not cover the entire object
Labels that overlap with other objects

Note! In certain cases, such as ultrasound labeling, you would label the neighboring region of the object of interest to capture any changes around it. Thus the definition of inaccurate labels depends on the specific case.

For example, if you’re building an object detection model for computer vision to detect tigers in the wild, you want your labels to include the entire visible area of the tiger, no more and no less.

Inaccurate labels

Mislabeled Images

When a label attached to an object is mislabeled, it can lead to wrong predictions when deploying your model into the real world. Mislabeled images are common in training datasets. Research from MIT has shown that, on average, 3.4% of labels are mislabeled in common best practice datasets.

Mislabelled images

Missing Labels

The last common type of label error is missing labels. If a training data set contains missing labels the computer vision algorithm will not learn from samples without labels.

Missing label

What Causes Label Errors?

Erroneous labels are prevalent in many datasets, both open-source, and proprietary datasets. They happen for a variety of reasons, mainly:

Unclear ontology or label instructions: When labelers lack a clear definition of the objects and concepts that are labeled it can confuse the person performing the task. This can make it difficult to accurately and consistently understand the images and what is required of the annotator.
Annotator fatigue: A burnout that can occur for labelers who are performing repetitive labeling tasks. The process can at times be tedious and time-consuming, and it can take a toll on the energy of the person doing it.
Hard-to-annotate images: A hard labeling task can be difficult for various reasons. For example, it can require a high level of skill or knowledge to identify the objects in question. The image quality can be low, or the images can have many different objects confusing the annotator.

Next, we will show you a series of actions to prevent label errors in your labeling operations going forward!

How to Fix Label Errors?

To find label errors, you historically had to manually sift through your dataset, a time-consuming process at the best and impossible at worst. If you have a large data set, it would be like finding a needle in a haystack.

Luckily label errors can be mitigated today before deploying your model into production. In this section, we propose three strategies to can help you mitigate label errors during the labeling process or fix them later on:

1. Provide Clear Labeling Instructions

If you are not going to label the images yourself, providing your data annotation team with clear and concise instructions is essential. Good label instructions contain descriptions of the labeling ontology (taxonomy) and reference screenshots of high-quality labels. A good way to test the instructions is to have a non-technical colleague on your team review the instructions and see if they make sense on a conceptual level. Even better, dog food label instructions with your team to find potential pitfalls.

2. Implement a Quality Assurance System

In today's computer vision data pipelines, reviewing a subset, or all, of the created labels is the best practice. This can be done using a standard review module where you can decide upon the sampling rate of all labels or define different sampling rates for specific hard classes.

Sampling rate

In special cases, such as medical use cases, it is frequently required to use an expert review workflow. This entails sending labels initially rejected in the review stage for an extra expert opinion in the expert review stage. Depending on the complexity of the use case, this can be tailored to the situation.

Expert review workflow

Check out this post to learn how to structure quality assurance workflow for medical use cases.

3. Use a Trained Model to Find Label Errors

As you progress on your computer vision journey, you can use a trained model to identify mistakes in your data labeling. This is done by running the model on your annotated images and using a platform that supports label debugging and model predictions. The platform should be able to compare high-confidence false positive model predictions with the ground truth labels and flag any errors for re-labeling.

In this example, we will use Encord Active, an open-source active learning framework, to find how a trained model can be used to find label errors.

The dataset used in this example is the COCO validation dataset combined with model predictions from a pre-trained MASK R-CNN RESNET50 FPN V2 model. The sandbox dataset with labels and predictions can be downloaded directly from Encord Active's GitHub repo.

Note! Check out the full guide on how to use Encord Active to find and fix label errors in the COCO validation dataset.

Using the UI we sort for the highest confidence false positives to find images with possible label errors.

Encord Active - false positive

In the example below, we can see that the model has predicted four missing labels on the selected image. The objects missing are a backpack, a handbag, and two people. The predictions are marked in purple with a box around them.

False positive predictions

As all four predictions are correct, the label errors can automatically be sent back to the label editor to be corrected immediately.

Encord Label Editor

This operation is repeated with the rest of the dataset to find and fix the remaining erroneous labels.

If you’re interested in finding label errors in your training dataset today, you can download the open-source active learning framework, upload your own data, labels, and model predictions, and start finding label errors.

Conclusion

In summary, to mitigate label errors and missing labels, you can follow three best practice strategies:

Provide clear labeling instructions that contain descriptions of the labeling ontology (taxonomy) and reference screenshots of high-quality labels.
Implement a Quality Assurance system using a standard review workflow or expert review workflow.
Use a trained model to find label errors to spot label errors in your training dataset by running a model on your newly annotated samples to get model predictions and using a platform that supports model-driven label debugging.

Want to test your own models?

"I want to get started right away" - You can find Encord Active on Github here.

"Can you show me an example first?" - Check out this Colab Notebook.

"I am new, and want a step-by-step guide" - Try out the getting started tutorial.

If you want to support the project you can help us out by giving a Star on GitHub :)

Want to stay updated?

Follow us on Twitter and Linkedin for more content on computer vision, training data, and active learning.
Join our Discord channel to chat and connect.

Written by Nikolaj Buhl

Nikolaj is a Product Manager at Encord and a computer vision enthusiast. At Encord he oversees the development of Encord Active. Nikolaj holds a M.Sc. in Management from London Business School and Copenhagen Business School. In a previous life, he lived in China working at the Danish Embas... see more

View more posts

Build better ML models with Encord

Get started today

Discuss this blog on Slack

Join the Encord Developers community to discuss the latest in computer vision, machine learning, and data-centric AI

Join the community

Related Blogs

sampleImage_webinar-from-big-data-to-smart-data-recording

Video Data Quality

From Big Data to Smart Data: How to Manage, Clean and Curate Your Visual Datasets for AI Development

Webinar Recording Acquiring a dataset is just the beginning; the real challenge lies in refining it for training a Computer Vision model. Bloated, low-quality datasets waste resources and hamper model performance. The key to effective curation? Active Learning pipelines. By employing Active Learning, teams can intelligently select data that significantly impacts the model's performance. This method focuses on the model's current needs, ensuring each data point is impactful. The result is a streamlined annotation process and a more accurate, efficient Computer Vision model. Here are the key resources from the webinar: [Guide] How to curate your data [Case Study] See how one customer increased mAP by 20% through reducing their dataset size by 35% with visual data curation

February 1

60 min

sampleImage_model-robustness-machine-learning-strategies

Data Quality

Model Robustness: Building Reliable AI Models

Today, organizations are increasingly deploying artificial intelligence (AI) systems in highly sensitive and critical domains, such as medical diagnosis, autonomous driving, and cybersecurity. Reliance on AI models to perform vital tasks has opened up the possibility of large-scale failure with damaging consequences, such as in the event of malicious attacks or compromised infrastructure. AI incidents are growing significantly, reportedly averaging 79 incidents yearly from 2020 to 2023. For instance, Tessa, a healthcare chatbot, reportedly gave harmful advice to people with eating disorders; Tesla’s autonomous car did not recognize a pedestrian on the crosswalk; and Clearview AI’s security system wrongly identified an innocent person as a criminal. These disasters question the efficacy of AI systems and call for developing robust models resistant to vulnerabilities. So, what is model robustness in AI? And how can AI practitioners ensure that a model is robust? In this article, you will: Understand the significance of robustness in AI applications, Learn about the challenges of building robustness into AI systems, Learn how Encord Active can help improve the robustness of your ML models. What is Model Robustness? Model robustness is a machine-learning (ML) model’s ability to withstand uncertainties and perform accurately in different contexts. A model is robust if it performs strongly on datasets that differ from the training data. For instance, in advanced computer vision (CV) and large language models (LLMs), robustness ensures reliable predictions on unseen textual and image data generated from diverse sources. Real-world images can be blurry, distorted, noisy, etc., interfering with a CV model’s prediction performance and causing fatal accidents in safety-critical applications such as self-driving cars and medical diagnosis. Achieving robustness in such models will help mitigate these issues. However, robustness may not always lead to high accuracy, as accuracy is usually calculated based on how well the model fits on a validation dataset. This means a highly accurate model may not generalize well to entirely new data that was not present in the validation set. The diagram below illustrates the point. Robustness vs Accuracy Optimizing a model for robustness may imply lower accuracy and model complexity than required in the case of optimizing for low variance. That’s because robustness aims to create a model that can perform well on novel data distributions that significantly differ from test data. Significance of Model Robustness Ensuring model robustness is necessary as we increase our reliance on AI models to perform critical jobs. Below are a few reasons why model robustness is crucial in today’s highly digitalized world. Reduces sensitivity to outliers: Outliers can adversely affect the performance of algorithms like regression, decision trees, k-nearest neighbors, etc. Ensuring model robustness will make these models less sensitive to outliers and improve generalization performance. Protects models against malicious attacks: Adversarial attacks distort input data, forcing the model to make wrong predictions. For instance, an attacker can change specific images to trick the model into making a classification error. Robustness allows you to build models that can resist such attacks. Fairness: Robustness requires training models on representative datasets without bias. This means robust models generate fairer predictions and perform well on data that may contain inherent biases. Increases trust: Multiple domains, such as self-driving cars, security, medical diagnosis, business decision-making, etc., rely on AI to perform mission and safety-critical tasks. Robustness is essential in these areas to maintain high model performance by eliminating the chance of harmful errors. Reduces cost of retraining models: In robust models, data variations (distribution shifts) have minimal effect on performance. Hence, retraining is less frequent, reducing the computational resource load required to collect, preprocess, and train new data. Improves regulatory compliance: As data security and AI fairness laws become more stringent, data science teams must ensure regulatory compliance to avoid costly fines. Robust models are helpful as they mitigate the effects of adversarial attacks by maintaining stable performance when faced with attempts to exploit model vulnerabilities and perform optimally on new data, reducing data collection needs and the chances of a data breach. Now that we understand the importance of model robustness, let’s explore how you can achieve it in your ML pipelines. How to Achieve Model Robustness? Making machine learning models robust involves several techniques to ensure strong performance on unseen data for diverse use cases. The following section discusses the factors that contribute significantly to achieving model robustness. Data Quality High data quality enables efficient model training by ensuring the data is clean, diverse, consistent, and accurate. As such, models can quickly learn underlying data patterns and perform well on unseen samples without exhibiting bias, leading to higher robustness. Automated data pipelines are necessary to improve data quality as they help with data preprocessing to bring raw data into a usable format. The pipelines can include statistical checks to assess diversity and ensure the training data’s representativeness of the real-world population. Moreover, data augmentation, which artificially increases the training set by modifying input samples in a particular way, can also help reduce model overfitting. The illustration below shows how augmentation works in CV. Examples of Data Augmentation Lastly, the pipeline must include a vigorous data annotation process, as model performance relies heavily on label quality. Labeling errors can cause the model to generate incorrect predictions and become vulnerable to adversarial attacks. A clear annotation strategy with detailed guidelines and a thorough review process by domain experts can help improve the labeling workflow. Using active learning and consensus-based approaches such as majority voting can also boost quality by ensuring consistent labels across samples. Want to know how to increase data quality? Have a look at Mastering Data Cleaning and Data Preprocessing. Adversarial Training Adversarial robustness makes a model resistant to adversarial attacks. Such attacks often involve small perturbations to input data, causing the model to generate incorrect output. The attacker aims to steal or copy the model by understanding its inner workings. Types of Adversarial Attacks Adversarial attacks consist of multiple methodologies, such as: Evasion attacks involve perturbing inputs to cause incorrect model predictions. For instance, the fast gradient sign method (FGSM) is a popular perturbation technique that adds the sign of the loss function’s gradient to modify an input instance. Poisoning attacks occur when an adversary directly manipulates the input by changing labels or injecting harmful data into the training set. Model inversion attacks aim to reconstruct the training data samples using a target classifier. Such attacks can cause serious privacy breaches as attackers can discover sensitive data samples for training a particular model. Model extraction attacks occur when adversaries query a model’s Application Programming Interface (API) to collect output samples to create a synthetic dataset. The adversary can use the fake dataset to train another model that copies the functionality of the original learning algorithms. Let’s explore some prominent techniques to prevent these adversarial attacks. Robustness and Model Security AI practitioners can use various techniques to prevent adversarial attacks and make models more robust. The following are a few options. Adversarial training: This method involves training models on adversarial examples to prevent evasion attacks. Gradient masking: Building ML models that do not rely on gradients, such as k-nearest neighbors, can prevent attacks that use gradients to perturb inputs. Data cleaning: This simple technique helps prevent poisoning attacks by ensuring that training data does not contain malicious examples or samples with incorrect labels. Outlier detection: Identifying and removing outliers can also help make models robust to poisoning attacks. Differential privacy: The techniques involved in differential privacy add noise to data during model training, making it challenging for an attacker to extract information regarding a specific individual. Data encryption: Techniques like homomorphic encryption allow you to train models on encrypted data and prevent breaches. Output perturbation: You can avoid data leakage by adding noise to a deep learning model’s output. Watermarking: You can add outliers to your data by including watermarks in your input data. The model overfits these outliers, allowing you to identify your model’s replica. Domain Adaptation With domain adaptation, you can tailor a model to perform well on a target domain with limited labeled data, using knowledge from another source domain with sufficient data. For instance, you can have a classifier model that correctly classifies land animal images (source domain). However, you can use domain adaptation techniques to fine-tune the model, so it also classifies marine animals (target domain). This way, you can improve the model’s generalization performance for new classes to increase its robustness. Domain Adaptation Illustration Moreover, domain adaptation techniques make your model robust to domain shifts that occur when underlying data distributions change. For instance, differences between training and validation sets indicate a domain shift. You can broadly categorize domain adaptation as follows: Supervised, semi-supervised, and unsupervised domain adaptation: In supervised domain adaptation, the data in the target domain is completely labeled. In semi-supervised domain adaptation, only a few data samples have labels, while in unsupervised domain adaptation, no labels exist in the target domain. Heterogenous and homogenous domain adaptation: In heterogeneous domain adaptation, the target and source feature spaces are different, while they are the same in homogeneous domain adaptation. One-step and multi-step domain adaptation: In one-step domain adaptation, you can directly transfer the knowledge from the source to the target domain due to the similarity between the two. However, you introduce additional knowledge transfer steps in multi-step adaptation to smoothen the transition process. Multi-step techniques help when target and source domains differ significantly. Lastly, domain adaptation techniques include feature-based learning, where deep learning models learn invariable underlying domain features and use the knowledge to make predictions on the target domain. Other methods involve mapping the source domain to the target domain using generative adversarial networks (GANs). The technique works by learning to map a source image to another domain using a target domain label. Regularization Regularization helps prevent your model from overfitting and makes it more robust by reducing the generalization error. The Effect of Regularization on the Model Common regularization techniques include: Ridge regression: In ridge regression, you add a penalty to the loss function that equals the sum of the squares of the weights. Lasso regression: In lasso regression, the penalty term is the sum of the absolute value of all the weights. Entropy: The penalty term equals the entropy of the output distribution. Dropout: You can use the dropout technique in neural networks to randomly turn off or drop layers and nodes to reduce model complexity and improve generalization. Explainability Explainable AI (XAI) is a recent concept that allows you to understand how a machine learning system behaves and enhances model interpretability. Explainable Model vs. Black Box Model Illustration XAI techniques help make a model robust by allowing you to see the inner workings of a model and identify and fix any biases in the model’s decision-making process. XAI includes the following techniques: SHAP: Shapley Additive Explanations (SHAP) is a technique that computes Shapley values for features to represent their importance in a particular prediction. LIME: Local interpretable model-agnostic explanation (LIME) perturbs input data and analyzes the effects on output to compute feature importance. Integrated gradients: This technique establishes feature importance by computing gradients of features with respect to input data. Permutation importance: You can evaluate a feature’s importance by removing it and observing the effect on a particular performance metric, such as F1-score, precision, recall, etc. Partial dependence plot: This plot shows the marginal effect of features on a model’s output. It helps interpret whether the feature and the output have a simple or more complex relationship. Evaluation Strategies Model evaluation techniques help increase a model’s robustness by allowing you to assess performance and quickly identify issues during model development. While traditional evaluation metrics, such as the F1-score, precision, recall, etc., let you evaluate the performance of simple models against established benchmarks, more complex methods are necessary for modern LLMs and other foundation models. For instance, you can evaluate an LLM’s output using various automated scores, such as BLEU, ROUGE, CIDEr, etc. You can complement LLM evaluation with human feedback for a more robust assessment. In contrast, intersection-over-union (IoU), panoptic quality, mean average precision (mAP), etc., are some common methods for evaluating CV models. Learn more about model evaluation by reading our comprehensive guide on Model Test Cases: A Practical Approach to Evaluating ML Models. Challenges of Model Robustness While model robustness is essential for high performance, maintaining it is significantly challenging. The list below mentions some issues you can encounter when building robust models: Data volume and variety: Modern data comes from multiple sources in high volumes. Preprocessing these extensive datasets demands robust data pipelines and expert staff to identify issues during the collection phase. Increased model complexity: Recent advancements in natural language processing and computer vision modeling call for more sophisticated explainability techniques to understand how they process input data. Feature volatility: Model decay is a recurrent issue in dynamic domains with frequent changes in feature distribution. Keeping track of these distributional shifts calls for complex monitoring infrastructure. Evaluation methods: Developing the perfect evaluation strategy is tedious as you must consider several factors, such as the nature of a model’s output, ground-truth availability, the need for domain experts, etc. Achieving Model Robustness with Encord Active You can mitigate the above challenges by using an appropriate ML platform like Encord Active that helps you increase model robustness through automated evaluation features and development tools. Encord Active Encord Active automatically identifies labeling errors and boosts data quality through relevant quality metrics and vector embeddings. It also helps you debug models through comprehensive explainability reports, robustness tests, and model error analysis. In addition, the platform features active learning pipelines to help you identify data samples that are crucial for your model and streamline the data curation process. Evaluate the Quality of the Data You can use Encord Active to improve the quality of your data and, subsequently, enhance the robustness of vision models through several key features. Encord Active offers various features like data exploration, label exploration, similarity search, quality metrics (both off-the-shelf and custom), data and label tagging, image duplication detection, label error detection, and outlier detection. It supports various data types and labels and integrates seamlessly with Encord Annotate. Data curation workflow The platform supports curating images using embeddings and quality metrics to find data of bad quality for your model to learn from or low-quality samples you might want to test your model on. Here is an example using the Embeddings View within Encord Active to surface images that are too bright from the COCO 2017 dataset: You can also explore the embedding plots and filter the images by a quality metric like "Area" for instances where you might want to find the largest or smallest images from your set, among other off-the-shelf or custom quality metrics. Finding and Flagging Label Errors Within Encord Active, you can surface duplicate labels that could be overfitting or lead to misleading high-performance metrics during training and validation. Because the model may recognize repeated instances rather than learn generalizable patterns. After identifying such images, you can add them to a “Collection” and send them to Encord Annotate for re-labeling or removing the duplicates. 💡Recommended: Exploring the Quality of Hugging Face Image Datasets with Encord Active. Evaluating Model Quality Encord Active also allows you to determine which metrics influence your model's performance the most. You can import your model’s prediction to get a 360° view of the quality of your model across performance metrics and data slices. You can also inspect the metric impact on your model's performance. This can help you better understand how the model performs across metrics like the diversity of the data, label duplicates, brightness, and so on. These features collectively ensure that data quality is significantly improved, contributing to the development of more robust and accurate vision models. The focus on active learning and the ability to handle various stages of the data and model lifecycle make Encord Active a comprehensive tool for improving data quality in computer vision applications. Interested in learning more about Encord Active? Check out the documentation. Model Robustness: Key Takeaways Building robust models is the only way to leverage AI’s full potential to boost profitability. A few important things to remember about model robustness are: A robust model can maneuver uncertain real-world scenarios appropriately and increase trust in the AI system. Achieving model robustness can imply slightly compromising accuracy to reduce generalization errors. Ensuring model robustness helps you prevent adversaries from stealing your model or data. Improved data quality, domain adaptation techniques, and regularization's reduction of generalization error can all contribute to model robustness. Model explainability is essential for building robust models as it helps you understand a model’s behavior in detail. A specialized ML platform can help you overcome model robustness challenges such as increased model complexity and feature volatility.

December 6

8 min

sampleImage_webinar-data-to-diamonds-recording

Video Data Quality

From Data to Diamonds: Unearth the True Value of Quality Data

Bridging the chasm between ‘Just AI’ and ‘Useful AI’ can be challenging, however it’s apparent that leveraging valuable data is crucial to this. As access to data increases, computer vision teams need to produce informative and reliable training data as a priority, one approach is through developing active learning pipelines. From data curation to annotation and beyond, this webinar will provide you with the tools to implement active learning pipelines and level up your computer vision models Here are the key resources from the webinar: [Guide] How to curate your data [Case Study] How one customer improved per-class performance by 67%

November 17

60 min

Software To Help You Turn Your Data Into AI

Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.