Software To Help You Turn Your Data Into AI
Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.
As Machine Learning teams continue to push the boundaries of computer vision, the quality of our training data becomes increasingly important. A single data or label error can impact the performance of our models, making it critical to evaluate and improve our training datasets continuously.
In this first part of the blog post series “Data errors in Computer Vision”, we'll explore the most common label errors in computer vision and show you how you can quickly and efficiently mitigate or fix them. Whether you're a seasoned ML expert or just starting out, this series has something for everyone. Join us as we dive into the world of data errors in computer vision and learn how to ensure your models are trained on the highest quality data possible.
At Encord, we work with pioneering machine learning teams across a variety of different use cases. As they transition their models into production, we have noticed an increasing interest in continuously improving the quality of their training data.
Most Data Scientists and ML practitioners spend much time debugging data to enhance their model performance. Even with automation and AI-supported breakthroughs in labeling, data debugging is still a tedious and time-consuming process based on manual inspection and one-off scripts in Jupyter notebooks.
Before we dive into the data errors and how to fix them, let us quickly recollect what a good training dataset is.
Note! This post will not dive into domain-specific requirements such as data volume and modality. If you are interested in reading more about healthcare-specific practices, click here.
Today we will explain how to achieve points one and two for a good training data set, and in the upcoming posts in the series, we will cover the rest.
Let’s dive into it!
Label errors directly impact a model’s performance. Having incorrect ground truth labels in your training dataset can have major downstream consequences in the production pipeline for your computer vision models. Identifying label errors in a dataset containing hundreds of images or frames can be done manually, but when working with large datasets containing 100.000s or millions of images, the manual process becomes impossible.
The three types of labeling errors in computer vision are 1) inaccurate labels, 2) mislabeled images, and 3) missing labels.
When a label is inaccurate, your algorithm will struggle to correctly identify objects correctly. The actual consequences of inaccurate labels in object detection have been studied previously, so we will not dive into that today.
The precise definition of an inaccurate label depends on the purpose of the model you are training, but generally, the common examples of inaccurate labels are:
Note! In certain cases, such as ultrasound labeling, you would label the neighboring region of the object of interest to capture any changes around it. Thus the definition of inaccurate labels depends on the specific case.
For example, if you’re building an object detection model for computer vision to detect tigers in the wild, you want your labels to include the entire visible area of the tiger, no more and no less.
When a label attached to an object is mislabeled, it can lead to wrong predictions when deploying your model into the real world. Mislabeled images are common in training datasets. Research from MIT has shown that, on average, 3.4% of labels are mislabeled in common best practice datasets.
The last common type of label error is missing labels. If a training data set contains missing labels the computer vision algorithm will not learn from samples without labels.
Erroneous labels are prevalent in many datasets, both open-source, and proprietary datasets. They happen for a variety of reasons, mainly:
Next, we will show you a series of actions to prevent label errors in your labeling operations going forward!
To find label errors, you historically had to manually sift through your dataset, a time-consuming process at the best and impossible at worst. If you have a large data set, it would be like finding a needle in a haystack.
Luckily label errors can be mitigated today before deploying your model into production. In this section, we propose three strategies to can help you mitigate label errors during the labeling process or fix them later on:
If you are not going to label the images yourself, providing your data annotation team with clear and concise instructions is essential. Good label instructions contain descriptions of the labeling ontology (taxonomy) and reference screenshots of high-quality labels. A good way to test the instructions is to have a non-technical colleague on your team review the instructions and see if they make sense on a conceptual level. Even better, dog food label instructions with your team to find potential pitfalls.
In today's computer vision data pipelines, reviewing a subset, or all, of the created labels is the best practice. This can be done using a standard review module where you can decide upon the sampling rate of all labels or define different sampling rates for specific hard classes.
In special cases, such as medical use cases, it is frequently required to use an expert review workflow. This entails sending labels initially rejected in the review stage for an extra expert opinion in the expert review stage. Depending on the complexity of the use case, this can be tailored to the situation.
Check out this post to learn how to structure quality assurance workflow for medical use cases.
As you progress on your computer vision journey, you can use a trained model to identify mistakes in your data labeling. This is done by running the model on your annotated images and using a platform that supports label debugging and model predictions. The platform should be able to compare high-confidence false positive model predictions with the ground truth labels and flag any errors for re-labeling.
In this example, we will use Encord Active, an open-source active learning framework, to find how a trained model can be used to find label errors.
The dataset used in this example is the COCO validation dataset combined with model predictions from a pre-trained MASK R-CNN RESNET50 FPN V2 model. The sandbox dataset with labels and predictions can be downloaded directly from Encord Active's GitHub repo.
Note! Check out the full guide on how to use Encord Active to find and fix label errors in the COCO validation dataset.
Using the UI we sort for the highest confidence false positives to find images with possible label errors.
In the example below, we can see that the model has predicted four missing labels on the selected image. The objects missing are a backpack, a handbag, and two people. The predictions are marked in purple with a box around them.
As all four predictions are correct, the label errors can automatically be sent back to the label editor to be corrected immediately.
This operation is repeated with the rest of the dataset to find and fix the remaining erroneous labels.
If you’re interested in finding label errors in your training dataset today, you can download the open-source active learning framework, upload your own data, labels, and model predictions, and start finding label errors.
In summary, to mitigate label errors and missing labels, you can follow three best practice strategies:
"I want to get started right away" - You can find Encord Active on Github here.
"Can you show me an example first?" - Check out this Colab Notebook.
"I am new, and want a step-by-step guide" - Try out the getting started tutorial.
If you want to support the project you can help us out by giving a Star on GitHub :)
Want to stay updated?
Join the Encord Developers community to discuss the latest in computer vision, machine learning, and data-centric AI
Join the communityForget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.