Encord Blog
Data, Label, & Model Quality Metrics in Encord
Contents
What is a Quality Metric?
Quality Metric Defined
Conclusion
Written by
Akruti Acharya
View more postsWhen you’re working with datasets or developing a machine learning model, you often find yourself looking for or hypothesizing about subsets of data, labels, or model predictions with certain properties.
Quality metrics form the foundation for finding such data and testing the hypotheses.
What is a Quality Metric?
The core concept is to use quality metrics to index, slice, and analyze the subject in question in a structured way to perform informed actions when continuously cranking the active learning cycle.
Concrete example: You hypothesize that object "redness" influences the mAP score of your object detection model. To test this hypothesis, you define a quality metric that captures the redness of each object in the dataset. From the quality metric, you slice the data to compare your model performance on red vs. not red objects.
Quality Metric Defined
The best way to think of a quality metric in computer vision is:
By design, quality metrics are a very abstract class of functions because the accompanying methodologies are agnostic to the specific properties that the quality metrics express. No matter the specific quality metric, you can:
- sort your data according to the metric
- slice your data to inspect specific subsets
- find outliers
- compare training data to production data to detect data drifts
- evaluate your model performance as a function of the metric
- define model test-cases
- and much more
All of these are possible with Encord Active.
Data Quality Metrics
Data quality metrics are those metrics that require only information about the data itself. Within the computer vision domain, this means the raw images or video frames without any labels. This subset of quality metrics is typically used frequently at the beginning of a machine learning project where labels are scarce or perhaps not even existing.
Below are some examples of data quality metrics ranging from simple to more complex:
Image Brightness as a data quality metric on MS COCO validation dataset on Encord.
Image Singularity as a data quality metric on MS COCO validation dataset on Encord.
Label Quality Metrics
Label quality metrics apply to labels. Some metrics use image content, while others apply only to the label information. Label quality metrics serve many purposes, but some more frequent ones are surfacing label errors, model failure modes, and assessing annotator performance.
Here are some concrete examples of label quality metrics ranging from simple to more complex:
Object count as a label quality metric on MS COCO validation dataset on Encord.
Annotation Duplicate as a label quality metric on MS COCO validation dataset on Encord.
Model Quality Metrics
Model quality metrics also take into account the model predictions. The most obvious use-case for these metrics is acquisition functions, answering the question, "What should I label next?" There are many intelligent ways to leverage model predictions to answer this question. Here is a list of some of the most common ones:
Using Model Confidence as a model quality metric on MS COCO validation dataset on Encord. It shows the predictions where the confidence is between 50% to 80%.
Using Polygon Shape Similarity as a model quality metric on MS COCO validation dataset on Encord. It ranks objects by how similar they are to their instances in previous frames based on Hu moments. The more an object’s shape changes, the lower its score will be.
Custom Quality Metrics
We have now reviewed some examples of common quality metrics already in Encord Active.
However, every machine learning project is different, and most likely, you have just the idea of what to compute to surface the data that you want to evaluate or analyze.
With Encord Active, you only need to define the per-data-point computation. The tool will handle everything from executing the computation to visualizing your data based on your new metric.
You may want to know when your skeleton predictions are occluded or in which frames of video-specific annotations are missing.
You could also get even smarter and compare your labels with results from foundational models like SAM.
These different use cases are situations where you would build your custom metrics.
You can find the documentation for writing custom metrics here or you can follow any of the links provided above to specific quality metrics and find their implementation on GitHub.
Conclusion
Quality Metrics constitute the foundation of systematically exploring, evaluating, and iterating on machine learning datasets and models.
With Encord Active, it’s easy to define, execute, and utilize quality metrics to get the most out of your data, models, and annotators. We use them for slicing data, comparing data, tagging data, finding label errors, and much more. The true power of these metrics is that they can be arbitrarily specific to a problem at hand.
Ready to improve the performance and quality metrics of your CV models?
Sign-up for an Encord Free Trial: The Active Learning Platform for Computer Vision, used by the world’s leading computer vision teams.
AI-assisted labeling, model training & diagnostics, find & fix dataset errors and biases, all in one collaborative active learning platform, to get to production AI faster. Try Encord for Free Today.
Want to stay updated?
- Follow us on Twitter and LinkedIn for more content on computer vision, training data, and active learning.
- Join the Slack community to chat and connect.
Build better ML models with Encord
Get started todayWritten by
Akruti Acharya
View more postsRelated blogs
From Big Data to Smart Data: How to Manage, Clean and Curate Your Visual Datasets for AI Development
Webinar Recording Acquiring a dataset is just the beginning; the real challenge lies in refining it for training a Computer Vision model. Bloated, low-quality datasets waste resources and hamper model performance. The key to effective curation? Active Learning pipelines. By employing Active Learning, teams can intelligently select data that significantly impacts the model's performance. This method focuses on the model's current needs, ensuring each data point is impactful. The result is a streamlined annotation process and a more accurate, efficient Computer Vision model. Here are the key resources from the webinar: [Guide] How to curate your data [Case Study] See how one customer increased mAP by 20% through reducing their dataset size by 35% with visual data curation
Feb 01 2024
60 M
Model Robustness: Building Reliable AI Models
Today, organizations are increasingly deploying artificial intelligence (AI) systems in highly sensitive and critical domains, such as medical diagnosis, autonomous driving, and cybersecurity. Reliance on AI models to perform vital tasks has opened up the possibility of large-scale failure with damaging consequences, such as in the event of malicious attacks or compromised infrastructure. AI incidents are growing significantly, reportedly averaging 79 incidents yearly from 2020 to 2023. For instance, Tessa, a healthcare chatbot, reportedly gave harmful advice to people with eating disorders; Tesla’s autonomous car did not recognize a pedestrian on the crosswalk; and Clearview AI’s security system wrongly identified an innocent person as a criminal. These disasters question the efficacy of AI systems and call for developing robust models resistant to vulnerabilities. So, what is model robustness in AI? And how can AI practitioners ensure that a model is robust? In this article, you will: Understand the significance of robustness in AI applications, Learn about the challenges of building robustness into AI systems, Learn how Encord Active can help improve the robustness of your ML models. What is Model Robustness? Model robustness is a machine-learning (ML) model’s ability to withstand uncertainties and perform accurately in different contexts. A model is robust if it performs strongly on datasets that differ from the training data. For instance, in advanced computer vision (CV) and large language models (LLMs), robustness ensures reliable predictions on unseen textual and image data generated from diverse sources. Real-world images can be blurry, distorted, noisy, etc., interfering with a CV model’s prediction performance and causing fatal accidents in safety-critical applications such as self-driving cars and medical diagnosis. Achieving robustness in such models will help mitigate these issues. However, robustness may not always lead to high accuracy, as accuracy is usually calculated based on how well the model fits on a validation dataset. This means a highly accurate model may not generalize well to entirely new data that was not present in the validation set. The diagram below illustrates the point. Robustness vs Accuracy Optimizing a model for robustness may imply lower accuracy and model complexity than required in the case of optimizing for low variance. That’s because robustness aims to create a model that can perform well on novel data distributions that significantly differ from test data. Significance of Model Robustness Ensuring model robustness is necessary as we increase our reliance on AI models to perform critical jobs. Below are a few reasons why model robustness is crucial in today’s highly digitalized world. Reduces sensitivity to outliers: Outliers can adversely affect the performance of algorithms like regression, decision trees, k-nearest neighbors, etc. Ensuring model robustness will make these models less sensitive to outliers and improve generalization performance. Protects models against malicious attacks: Adversarial attacks distort input data, forcing the model to make wrong predictions. For instance, an attacker can change specific images to trick the model into making a classification error. Robustness allows you to build models that can resist such attacks. Fairness: Robustness requires training models on representative datasets without bias. This means robust models generate fairer predictions and perform well on data that may contain inherent biases. Increases trust: Multiple domains, such as self-driving cars, security, medical diagnosis, business decision-making, etc., rely on AI to perform mission and safety-critical tasks. Robustness is essential in these areas to maintain high model performance by eliminating the chance of harmful errors. Reduces cost of retraining models: In robust models, data variations (distribution shifts) have minimal effect on performance. Hence, retraining is less frequent, reducing the computational resource load required to collect, preprocess, and train new data. Improves regulatory compliance: As data security and AI fairness laws become more stringent, data science teams must ensure regulatory compliance to avoid costly fines. Robust models are helpful as they mitigate the effects of adversarial attacks by maintaining stable performance when faced with attempts to exploit model vulnerabilities and perform optimally on new data, reducing data collection needs and the chances of a data breach. Now that we understand the importance of model robustness, let’s explore how you can achieve it in your ML pipelines. How to Achieve Model Robustness? Making machine learning models robust involves several techniques to ensure strong performance on unseen data for diverse use cases. The following section discusses the factors that contribute significantly to achieving model robustness. Data Quality High data quality enables efficient model training by ensuring the data is clean, diverse, consistent, and accurate. As such, models can quickly learn underlying data patterns and perform well on unseen samples without exhibiting bias, leading to higher robustness. Automated data pipelines are necessary to improve data quality as they help with data preprocessing to bring raw data into a usable format. The pipelines can include statistical checks to assess diversity and ensure the training data’s representativeness of the real-world population. Moreover, data augmentation, which artificially increases the training set by modifying input samples in a particular way, can also help reduce model overfitting. The illustration below shows how augmentation works in CV. Examples of Data Augmentation Lastly, the pipeline must include a vigorous data annotation process, as model performance relies heavily on label quality. Labeling errors can cause the model to generate incorrect predictions and become vulnerable to adversarial attacks. A clear annotation strategy with detailed guidelines and a thorough review process by domain experts can help improve the labeling workflow. Using active learning and consensus-based approaches such as majority voting can also boost quality by ensuring consistent labels across samples. Want to know how to increase data quality? Have a look at Mastering Data Cleaning and Data Preprocessing. Adversarial Training Adversarial robustness makes a model resistant to adversarial attacks. Such attacks often involve small perturbations to input data, causing the model to generate incorrect output. The attacker aims to steal or copy the model by understanding its inner workings. Types of Adversarial Attacks Adversarial attacks consist of multiple methodologies, such as: Evasion attacks involve perturbing inputs to cause incorrect model predictions. For instance, the fast gradient sign method (FGSM) is a popular perturbation technique that adds the sign of the loss function’s gradient to modify an input instance. Poisoning attacks occur when an adversary directly manipulates the input by changing labels or injecting harmful data into the training set. Model inversion attacks aim to reconstruct the training data samples using a target classifier. Such attacks can cause serious privacy breaches as attackers can discover sensitive data samples for training a particular model. Model extraction attacks occur when adversaries query a model’s Application Programming Interface (API) to collect output samples to create a synthetic dataset. The adversary can use the fake dataset to train another model that copies the functionality of the original learning algorithms. Let’s explore some prominent techniques to prevent these adversarial attacks. Robustness and Model Security AI practitioners can use various techniques to prevent adversarial attacks and make models more robust. The following are a few options. Adversarial training: This method involves training models on adversarial examples to prevent evasion attacks. Gradient masking: Building ML models that do not rely on gradients, such as k-nearest neighbors, can prevent attacks that use gradients to perturb inputs. Data cleaning: This simple technique helps prevent poisoning attacks by ensuring that training data does not contain malicious examples or samples with incorrect labels. Outlier detection: Identifying and removing outliers can also help make models robust to poisoning attacks. Differential privacy: The techniques involved in differential privacy add noise to data during model training, making it challenging for an attacker to extract information regarding a specific individual. Data encryption: Techniques like homomorphic encryption allow you to train models on encrypted data and prevent breaches. Output perturbation: You can avoid data leakage by adding noise to a deep learning model’s output. Watermarking: You can add outliers to your data by including watermarks in your input data. The model overfits these outliers, allowing you to identify your model’s replica. Domain Adaptation With domain adaptation, you can tailor a model to perform well on a target domain with limited labeled data, using knowledge from another source domain with sufficient data. For instance, you can have a classifier model that correctly classifies land animal images (source domain). However, you can use domain adaptation techniques to fine-tune the model, so it also classifies marine animals (target domain). This way, you can improve the model’s generalization performance for new classes to increase its robustness. Domain Adaptation Illustration Moreover, domain adaptation techniques make your model robust to domain shifts that occur when underlying data distributions change. For instance, differences between training and validation sets indicate a domain shift. You can broadly categorize domain adaptation as follows: Supervised, semi-supervised, and unsupervised domain adaptation: In supervised domain adaptation, the data in the target domain is completely labeled. In semi-supervised domain adaptation, only a few data samples have labels, while in unsupervised domain adaptation, no labels exist in the target domain. Heterogenous and homogenous domain adaptation: In heterogeneous domain adaptation, the target and source feature spaces are different, while they are the same in homogeneous domain adaptation. One-step and multi-step domain adaptation: In one-step domain adaptation, you can directly transfer the knowledge from the source to the target domain due to the similarity between the two. However, you introduce additional knowledge transfer steps in multi-step adaptation to smoothen the transition process. Multi-step techniques help when target and source domains differ significantly. Lastly, domain adaptation techniques include feature-based learning, where deep learning models learn invariable underlying domain features and use the knowledge to make predictions on the target domain. Other methods involve mapping the source domain to the target domain using generative adversarial networks (GANs). The technique works by learning to map a source image to another domain using a target domain label. Regularization Regularization helps prevent your model from overfitting and makes it more robust by reducing the generalization error. The Effect of Regularization on the Model Common regularization techniques include: Ridge regression: In ridge regression, you add a penalty to the loss function that equals the sum of the squares of the weights. Lasso regression: In lasso regression, the penalty term is the sum of the absolute value of all the weights. Entropy: The penalty term equals the entropy of the output distribution. Dropout: You can use the dropout technique in neural networks to randomly turn off or drop layers and nodes to reduce model complexity and improve generalization. Explainability Explainable AI (XAI) is a recent concept that allows you to understand how a machine learning system behaves and enhances model interpretability. Explainable Model vs. Black Box Model Illustration XAI techniques help make a model robust by allowing you to see the inner workings of a model and identify and fix any biases in the model’s decision-making process. XAI includes the following techniques: SHAP: Shapley Additive Explanations (SHAP) is a technique that computes Shapley values for features to represent their importance in a particular prediction. LIME: Local interpretable model-agnostic explanation (LIME) perturbs input data and analyzes the effects on output to compute feature importance. Integrated gradients: This technique establishes feature importance by computing gradients of features with respect to input data. Permutation importance: You can evaluate a feature’s importance by removing it and observing the effect on a particular performance metric, such as F1-score, precision, recall, etc. Partial dependence plot: This plot shows the marginal effect of features on a model’s output. It helps interpret whether the feature and the output have a simple or more complex relationship. Evaluation Strategies Model evaluation techniques help increase a model’s robustness by allowing you to assess performance and quickly identify issues during model development. While traditional evaluation metrics, such as the F1-score, precision, recall, etc., let you evaluate the performance of simple models against established benchmarks, more complex methods are necessary for modern LLMs and other foundation models. For instance, you can evaluate an LLM’s output using various automated scores, such as BLEU, ROUGE, CIDEr, etc. You can complement LLM evaluation with human feedback for a more robust assessment. In contrast, intersection-over-union (IoU), panoptic quality, mean average precision (mAP), etc., are some common methods for evaluating CV models. Learn more about model evaluation by reading our comprehensive guide on Model Test Cases: A Practical Approach to Evaluating ML Models. Challenges of Model Robustness While model robustness is essential for high performance, maintaining it is significantly challenging. The list below mentions some issues you can encounter when building robust models: Data volume and variety: Modern data comes from multiple sources in high volumes. Preprocessing these extensive datasets demands robust data pipelines and expert staff to identify issues during the collection phase. Increased model complexity: Recent advancements in natural language processing and computer vision modeling call for more sophisticated explainability techniques to understand how they process input data. Feature volatility: Model decay is a recurrent issue in dynamic domains with frequent changes in feature distribution. Keeping track of these distributional shifts calls for complex monitoring infrastructure. Evaluation methods: Developing the perfect evaluation strategy is tedious as you must consider several factors, such as the nature of a model’s output, ground-truth availability, the need for domain experts, etc. Achieving Model Robustness with Encord Active You can mitigate the above challenges by using an appropriate ML platform like Encord Active that helps you increase model robustness through automated evaluation features and development tools. Encord Active Encord Active automatically identifies labeling errors and boosts data quality through relevant quality metrics and vector embeddings. It also helps you debug models through comprehensive explainability reports, robustness tests, and model error analysis. In addition, the platform features active learning pipelines to help you identify data samples that are crucial for your model and streamline the data curation process. Evaluate the Quality of the Data You can use Encord Active to improve the quality of your data and, subsequently, enhance the robustness of vision models through several key features. Encord Active offers various features like data exploration, label exploration, similarity search, quality metrics (both off-the-shelf and custom), data and label tagging, image duplication detection, label error detection, and outlier detection. It supports various data types and labels and integrates seamlessly with Encord Annotate. Data curation workflow The platform supports curating images using embeddings and quality metrics to find data of bad quality for your model to learn from or low-quality samples you might want to test your model on. Here is an example using the Embeddings View within Encord Active to surface images that are too bright from the COCO 2017 dataset: You can also explore the embedding plots and filter the images by a quality metric like "Area" for instances where you might want to find the largest or smallest images from your set, among other off-the-shelf or custom quality metrics. Finding and Flagging Label Errors Within Encord Active, you can surface duplicate labels that could be overfitting or lead to misleading high-performance metrics during training and validation. Because the model may recognize repeated instances rather than learn generalizable patterns. After identifying such images, you can add them to a “Collection” and send them to Encord Annotate for re-labeling or removing the duplicates. 💡Recommended: Exploring the Quality of Hugging Face Image Datasets with Encord Active. Evaluating Model Quality Encord Active also allows you to determine which metrics influence your model's performance the most. You can import your model’s prediction to get a 360° view of the quality of your model across performance metrics and data slices. You can also inspect the metric impact on your model's performance. This can help you better understand how the model performs across metrics like the diversity of the data, label duplicates, brightness, and so on. These features collectively ensure that data quality is significantly improved, contributing to the development of more robust and accurate vision models. The focus on active learning and the ability to handle various stages of the data and model lifecycle make Encord Active a comprehensive tool for improving data quality in computer vision applications. Interested in learning more about Encord Active? Check out the documentation. Model Robustness: Key Takeaways Building robust models is the only way to leverage AI’s full potential to boost profitability. A few important things to remember about model robustness are: A robust model can maneuver uncertain real-world scenarios appropriately and increase trust in the AI system. Achieving model robustness can imply slightly compromising accuracy to reduce generalization errors. Ensuring model robustness helps you prevent adversaries from stealing your model or data. Improved data quality, domain adaptation techniques, and regularization's reduction of generalization error can all contribute to model robustness. Model explainability is essential for building robust models as it helps you understand a model’s behavior in detail. A specialized ML platform can help you overcome model robustness challenges such as increased model complexity and feature volatility.
Dec 06 2023
8 M
From Data to Diamonds: Unearth the True Value of Quality Data
Bridging the chasm between ‘Just AI’ and ‘Useful AI’ can be challenging, however it’s apparent that leveraging valuable data is crucial to this. As access to data increases, computer vision teams need to produce informative and reliable training data as a priority, one approach is through developing active learning pipelines. From data curation to annotation and beyond, this webinar will provide you with the tools to implement active learning pipelines and level up your computer vision models Here are the key resources from the webinar: [Guide] How to curate your data [Case Study] How one customer improved per-class performance by 67%
Nov 17 2023
60 M
Exploring the Quality of Hugging Face Image Datasets with Encord Active
Computer vision engineers, data scientists, and machine learning engineers face a pervasive issue: the prevalence of low-quality images within datasets. You have likely encountered this problem through incorrect labels, varied image resolutions, noise, and other distortions. Poor data quality can lead to models learning incorrect features, misclassifications, and unreliable or incorrect outputs. In a domain where accuracy and reliability are paramount, this issue can significantly impede the progress and success of projects. This could result in wasted resources and extended project timelines. Take a look at the following image collage of Chihuahuas or muffins, for example: Chihuahua or muffin? My search for the best computer vision API How fast could you tell which images are Chihuahuas vs. muffins? Fast? Slow? Were you correct in 100% of the images? I passed the collage to GPT-4V because, why not? 😂 And as you can see, even the best-in-class foundation model misclassified some muffins as Chihuahuas! (I pointed out a few.) So, how do you make your models perform better? The sauce lies in the systematic approach of exploring, evaluating, and fixing the quality of images. Enter Encord Active! It provides a platform to identify, tag problematic images, and use features to improve the dataset's quality. This article will show you how to use Encord Active to explore images, visualize potential issues, and take next steps to rectify low-quality images. In particular, you will: Use a dog-food dataset from the Hugging Face Datasets library. Delve into the steps of creating an Encord Active project. Define and run quality metrics on the dataset. Visualize the quality metrics. Indicate strategies to fix the issues you identified. Ready? Let’s delve right in! 🚀 Using Encord Active to Explore the Quality of Your Images Encord Active toolkit helps you find and fix wrong labels through data exploration, model-assisted quality metrics, and one-click labeling integration. It takes a data-centric approach to improving model performance. With Encord Active, you can: Slice your visual data across metrics functions to identify data slices with low performance. Flag poor-performing slices and send them for review. Export your new data set and labels. Visually explore your data through interactive embeddings, precision/recall curves, and other advanced visualizations. Check out the project on GitHub, and hey, if you like it, leave a 🌟🫡. Demo: Explore the quality of 'dog' and 'food' images for ML models In this article, you will use Encord Active to explore the quality of the `sashs/dog-food` images. You’ll access the dataset through the Hugging Face Datasets library. You can use this dataset to build a binary classifier that categorizes images into the "dog" and "food" classes. The 'dog' class has images of canines that resemble fried chicken and some that resemble images of muffins, and the 'food' class has images of, you guessed it, fried chicken and muffins. The complete code is hosted on Colab. Open the Colab notebook side-to-side with this blog post. Interested in more computer vision, visual foundation models, active learning, and data quality notebooks? Check out the Encord Notebook repository Use Hugging Face Datasets to Download and Generate the Dataset Whatever machine learning, deep learning, or AI tasks you are working on, the Hugging Face Datasets library provides easy access to, sharing, and processing datasets, particularly those catering to audio, computer vision, and natural language processing (NLP) domains. The 🤗 datasets library enables an on-disk cache that is memory-mapped for quick lookups to back the datasets. Explore the Hugging Face Hub for the datasets directory You can browse and explore over 20,000 datasets housed in the library on the Hugging Face Hub. The Hub is a centralized platform for discovering and choosing datasets pertinent to your projects. In the search bar at the top, enter keywords related to the dataset you're interested in, e.g., "sentiment analysis," "image classification," etc. You should be able to: Filter datasets by domain, license, language, and so on. Find information such as the size, download number, and download link on the dataset card. Engage with the community by contributing to discussions, providing feedback, or suggesting improvements to the dataset. Load the ‘sashs/dog-food’ dataset Loading the `sashs/dog-food` dataset is pretty straightforward: Install the 🤗 Datasets library and download the dataset. To install Hugging Face Datasets, run the following command: pip install datasets Use the `load_dataset` function to load the 'sasha/dog-food' dataset from Hugging Face: dataset_dict=load_dataset('sasha/dog-food') `load_dataset` returns a dictionary object (`DatasetDict`). You can iterate through the train and test dataset split keys in the `DatasetDict` object. The keys map to a `Dataset` object containing the images for that particular split. You will explore the entire dataset rather than in separate splits. This should provide a comprehensive understanding of the data distribution, characteristics, and potential issues. To do that, merge the different splits into a single dataset using the `concatenate_datasets` function: dataset=concatenate_datasets([dfordindataset_dict.values()]) Perfect! Now, you have an entire dataset to explore with Encord Active in the subsequent sections. If you have not done that already, create a dataset directory to store the downloaded images. # Create a new directory "hugging_face_dataset" in the current working dir huggingface_dataset_path = Path.cwd() / "huggingface_dataset" # Delete dir if it already exists and recreate if huggingface_dataset_path.exists(): shutil.rmtree(huggingface_dataset_path) huggingface_dataset_path.mkdir() Use a loop to iterate through images from the ‘sashs/dog-food’ dataset and save them to the directory you created: for counter, item in tqdm(enumerate(dataset)): image = item['image'] image.save(f'./Hugging Face_dataset/{counter}.{image.format}') If your code throws errors, run the cell in the Colab notebook in the correct order. Super! You have prepared the groundwork for exploring your dataset with Encord Active. Create an Encord Active Project You must specify the directory containing your datasets when using Encord Active for exploration. You will initialize a local project with the image files—there are different ways to import and work with projects in Encord. Encord Active provides functions and utilities to load all your images, compute embeddings, and, based on that, evaluate the embeddings using pre-defined metrics. The metrics will help you search and find images with errors or quality issues. Before initializing the Encord Active project, define a function, `collect_all_images`, that obtains a list of all the image files from the `huggingface_dataset_path` directory, takes a root folder path as input, and returns a list of `Path` objects representing image files within the root folder: def collect_all_images(root_folder: Path) -> list[Path]: image_extensions = {".jpg", ".jpeg", ".png", ".bmp"} image_paths = [] for file_path in root_folder.glob("**/*"): if file_path.suffix.lower() in image_extensions: image_paths.append(file_path) return image_paths Remember to access and run the complete code in this cell. Initialize Encord Active project Next, initialize a local project using Encord Active's `init_local_project` function. This function provides the same functionality as running the `init` command in the CLI. If you prefer using the CLI, please refer to the “Quick import data & labels” guide. try: project_path: Path = init_local_project( files = image_files, target = projects_dir, project_name = "sample_ea_project", symlinks = False, ) except ProjectExistsError as e: project_path = Path("./sample_ea_project") print(e)# A project already exists with that name at the given path. Compute image embeddings and analyze them with metrics Analyzing raw image data directly in computer vision can often be impractical due to the high dimensionality of images. A common practice is to compute embeddings for the images to compress the dimensions, then run metrics on these embeddings to glean insights and evaluate the images. Ideally, you compute the embeddings using pre-trained (convolutional neural network) models. The pre-trained models capture the essential features of the images while reducing the data dimensionality. Once you obtain the embeddings, run similarity, clustering, and classification metrics to analyze different aspects of the dataset. Computing embeddings and running metrics on them can take quite a bit of manual effort. Enter Encord Active! Encord Active provides utility functions to run predefined subsets of metrics, or you can import your own sets of metrics. It computes the image embeddings and runs the metrics by the type of embeddings. Encord Active has three different types of embeddings: Image embeddings - general for each image or frame in the dataset Classification embeddings - associated with specific frame-level classifications Object embeddings - associated with specific objects, like polygons or bounding boxes Use the `run_metrics_by_embedding_type` function to execute quality metrics on the images, specifying the embedding type as `IMAGE`: run_metrics_by_embedding_type( EmbeddingType.IMAGE, data_dir=project_path, use_cache_only=True ) The `use_cache_only=True` parameter cached data only when executing the metrics rather than recomputing values or fetching fresh data. This can be a useful feature for saving computational resources and time, especially when working with large datasets or expensive computations. Create a `Project` object using the `project_path` - you will use this for further interactions with the project: ea_project=Project(project_path) Exploring the Quality Of Images From the Hugging Face Datasets Library Now that you have set up your project, it’s time to explore the images! There are typically two ways you could visualize images with Encord Active (EA): Through the web application (Encord Active UI) Combining EA with visualization libraries to display those embeddings based on the metrics We’ll use the latter in this article. You will import helper functions and modules from Encord Active with visualization libraries (`matplotlib` and `plotly`). This code cell contains a list of the modules and helper functions. Pre-defined subset of metrics in Encord Active Next, iterate through the data quality metrics in Encord Active to see the list of available metrics, access the name attribute of each metric object within that iterable, and construct a list of these names: [metric.nameformetricinavailable_metrics] You should get a similar output: There are several quality metrics to explore, so let’s define and use the helper functions to enable you to visualize the embeddings. Helper functions for displaying images and visualizing the metrics Define the `plot_top_k_images` function to plot the top k images for a metric: def plot_top_k_images(metric_name: str, metrics_data_summary: MetricsSeverity, project: Project, k: int, show_description: bool = False, ascending: bool = True): metric_df = metrics_data_summary.metrics[metric_name].df metric_df.sort_values(by='score', ascending=ascending, inplace=True) for _, row in metric_df.head(k).iterrows(): image = load_or_fill_image(row, project.file_structure) plt.imshow(image) plt.show() print(f"{metric_name} score: {row['score']}") if show_description: print(f"{row['description']}") The function sorts the DataFrame of metric scores, iterates through the top `k` images in your dataset, loads each image, and plots it using Matplotlib. It also prints the metric score and, optionally, the description of each image. You will use this function to plot all the images based on the metrics you define. Next, define a `plot_metric_distribution` function that creates a histogram of the specified metric scores using Plotly: def plot_metric_distribution(metric_name: str, metric_data_summary: MetricsSeverity): fig = px.histogram(metrics_data_summary.metrics[metric_name].df, x="score", nbins=50) fig.update_layout(title=f"{metric_name} score distribution", bargap=0.2) fig.show() Run the function to visualize the score distribution based on the “Aspect Ratio” metric: plot_metric_distribution(“AspectRatio”,metrics_data_summary) Most images in the dataset have aspect ratios close to 1.5, a normal distribution. The set has only a few extremely small or enormous image proportions. Use EA’s `create_image_size_distribution_chart` function to plot the size distribution of your images: image_sizes = get_all_image_sizes(ea_project.file_structure) fig = create_image_size_distribution_chart(image_sizes) fig.show() As you probably expected for an open-source dataset for computer vision applications, there is a dense cluster of points in the lower-left corner of the graph, indicating that many images have smaller resolutions, mostly below 2000 pixels in width and height. A few points are scattered further to the right, indicating images with a much larger width but not necessarily a proportional increase in height. This could represent panoramic images or images with unique aspect ratios. You’ll identify such images in subsequent sections. Inspect the Problematic Images What are the severe and moderate outliers in the image set? You might also need insights into the distribution and severity of outliers across various imaging attributes. The attributes include metrics such as green values, blue values, area, etc. Use the `create_outlier_distribution_chart` utility to plot image outliers based on all the available metrics in EA. The outliers are categorized into two levels: "severe outliers" (represented in red “tomato”) and "moderate outliers" (represented in orange): available_metrics = load_available_metrics(ea_project.file_structure.metrics) metrics_data_summary = get_metric_summary(available_metrics) all_metrics_outliers = get_all_metrics_outliers(metrics_data_summary) fig = create_outlier_distribution_chart(all_metrics_outliers, "tomato", 'orange') fig.show() Here’s the result: "Green Values," "Blue Values," and "Area" appear to be the most susceptible to outliers, while attributes like "Random Values on Images" have the least in the ‘sashs/dog-food’ dataset. This primarily means there are lots of images that have abnormally high values of green and blue tints. This could be due to the white balance settings in the camera for the images or low-quality sensors. If your model trains on this set, it’s likely that more balanced images may perturb the performance. What are the blurry images in the image set? Depending on your use case, you might discover that blurry images can sometimes deter your model. A model trained on clear images and then tested or used on blurry ones may not perform well. If the blur could lead to misinterpretations and errors, which can have significant consequences, you might want to explore the blurry images to remove or enhance them. plot_top_k_images('Blur',metrics_data_summary,ea_project,k=5,ascending=False) Based on a "Blur score" of -9.473 calculated by Encord Active, here is the output with one of the five blurriest images: What are the darkest images in the image set? Next, surface images with poor lighting or low visibility. Dark images can indicate issues with the quality. These could result from poor lighting during capture, incorrect exposure settings, or equipment malfunctions. Also, a model might struggle to recognize patterns in such images, which could reduce accuracy. Identify and correct these images to improve the overall training data quality. plot_top_k_images('Brightness', metrics_data_summary, ea_project, k=5, ascending=True) The resulting image reflects a low brightness score of 0.164: What are the duplicate or nearly similar images in the set? Image singularity in the context of image quality is when images have unique or atypical characteristics compared to most images in a dataset. Duplicate images can highlight potential issues in the data collection or processing pipeline. For instance, they could result from artifacts from a malfunctioning sensor or a flawed image processing step. In computer vision tasks, duplicate images can disproportionately influence the trained model, especially if the dataset is small. Identify and address these images to improve the robustness of your model. Use the “Image Singularity” metric to determine the score and the images that are near duplicates: plot_top_k_images('Image Singularity', metrics_data_summary, ea_project, k=15, show_description=True) Here, you can see two nearly identical images with similar “Image Singularity” scores: The tiny difference between the singularity scores of the two images—0.01299857 for the left and 0.012998693 for the right—shows how similar they are. Check out other similar or duplicate images by running this code cell. Awesome! You have played with a few pre-defined quality metrics. See the complete code to run other data quality metrics on the images. Next Steps: Fixing Data Quality Issues Identifying problematic images is half the battle. Ideally, the next step would be for you to take action on those insights and fix the issues. Encord Active (EA) can help you tag problematic images, which may skew model performance downstream. Post-identification, various strategies can be employed to rectify these issues. Below, I have listed some ways to fix problematic image issues. Tagging and annotation Once you identify the problematic images, you can tag them within EA. One of the most common workflows we see from our users at Encord is identifying image quality issues at scale with Encord Active, tagging problematic images, and sending them upstream for annotation with Annotate. Re-labeling Incorrect labels can significantly hamper model performance. EA facilitates the re-labeling process by exporting the incorrectly labeled images to an annotation platform like Encord Annotate, where you can correct the labels. Active learning Use active learning techniques to improve the quality of the dataset iteratively. You can establish a continuous improvement cycle by training the model on good-quality datasets and then evaluating the model on low-quality datasets to suggest datasets to improve. Active learning (encord.com) Check out our practical guide to active learning for computer vision to learn more about active learning, its tradeoffs, alternatives, and a comprehensive explanation of active learning pipelines. Image augmentation and correction Image augmentation techniques enhance the diversity and size of the dataset to improve model robustness. Consider augmenting the data using techniques like rotation, scaling, cropping, and flipping. Some images may require corrections like brightness adjustment, noise reduction, or other image processing techniques to meet the desired quality standards. Image quality is not a one-time task but a continuous process. Regularly monitoring and evaluating your image quality will help maintain a high-quality dataset pivotal for achieving superior model performance. Key Takeaways In this article, you defined the objective of training a binary classification model for your use case. Technically, you “gathered” human labels since the open 'sashs/dog-food' dataset was already labeled on Hugging Face. Finally, using Encord Active, you computed image embeddings and ran metrics on the embeddings. Inspect the problematic images by exploring the datasets based on the objective quality metrics. Identifying and fixing the errors in the dataset will set up your downstream model training and ML application for success. If you are interested in exploring this topic further, there’s an excellent article from Aliaksei Mikhailiuk that perfectly describes the task of image quality assessment in three stages: Define an objective Gather the human labels for your dataset Train objective quality metrics on the data
Oct 19 2023
14 M
Introduction to Quality Metrics
What is a Quality Metric? When you are working with datasets or developing a machine learning model, you often find yourself looking for or hypothesizing about subsets of data, labels, or model predictions with certain properties. Quality metrics form the foundation for finding such data and testing the hypotheses. The core concept is to use quality metrics to index, slice, and analyze the subject in question in a structured way to continuously perform informed actions when cranking the active learning cycle. Concrete example: You hypothesize that object "redness" influences the mAP score of your object detection model. To test this hypothesis, you define a quality metric that captures the redness of each object in the dataset. From the quality metric, you slice the data to compare your model performance on red vs. not red objects. 💡 Tip: Find an example notebook for this use-case here. Quality Metric Defined We like to think of a quality metric as: [A quality metric is] any function that assigns a value to individual data points, labels, or model predictions in a dataset. By design, quality metrics is a very abstract class of functions because the accompanying methodologies are agnostic to the specific properties that the quality metrics express. No matter the specific quality metric, you can: sort your data according to the metric slice your data to inspect specific subsets find outliers compare training data to production data to detect data drifts evaluate your model performance as a function of the metric define model test-cases and much more all of which are possible with Encord Active. Tip: Try to read the remainder of this post with the idea of "indexing" your data, labels, and model prediction based on quality metrics in mind. The metrics mentioned below are just the tip of the iceberg in terms of what quality metrics can capture -- only imagination limits the space. Data Quality Metric Data quality metrics are those metrics that require only information about the data itself. Within the computer vision domain, this means the raw images or video frames without any labels. This subset of quality metrics is typically used frequently at the beginning of a machine learning project where labels are scarce or perhaps not even existing. Below are some examples of data quality metrics ranging from simple to more complex: Image Brightness as a data quality metric on MS COCO validation dataset on Encord. Source: Author Image Singularity as a data quality metric on MS COCO validation dataset on Encord. Source: Author 💡 Tip: See the list of all pre-built data quality metrics here. Label Quality Metric Label quality metrics apply to labels. Some metrics use image content while others apply only to the label information. Label quality metrics serve many purposes but some of the more frequent ones are surfacing label errors, model failure modes, and assessing annotator performance. Here are some concrete examples of label quality metrics ranging from simple to more complex: Object count as a label quality metric on MS COCO validation dataset on Encord. Source: Author Annotation Duplicate as a label quality metric on MS COCO validation dataset on Encord. Source: Author 💡 Tip: See the list of all pre-built label quality metrics here. Model Quality Metric Model quality metrics also take into account the model predictions. The most obvious use-case for these metrics is acquisition functions; answering the question "What should I label next?". There are many intelligent ways to leverage model predictions to answer this question. Here is a list of some of the most common ones: Using Model Confidence as model quality metric on MS COCO validation dataset on Encord. It shows the predictions where the confidence is between 50% to 80%. Source: Author Using Polygon Shape Similarity as model quality metric on MS COCO validation dataset on Encord. It ranks objects by how similar they are to their instances in previous frames based on Hu moments. The more an object’s shape changes, the lower its score will be. 💡 Tip: To utilize acquisition functions with Encord Active, have a look here. Custom Quality Metrics We have now gone over some examples of common quality metrics that already exist in Encord Active. However, every machine learning project is different, and most likely, you have just the idea of what to compute in order to surface the data that you want to evaluate or analyze. With Encord Active, you only need to define the per-data-point computation and the tool will take care of everything from executing the computation to visualizing your data based on your new metric. Perhaps, you want to know when your skeleton predictions are occluded or in which frames of video-specific annotations are missing. You could also get even smarter and compare your labels with results from foundational models like SAM. These different use-cases are situations in which you would be building your own custom metrics. You can find the documentation for writing custom metrics here or you can follow any of the links provided above to specific quality metrics and find their implementation on GitHub. If you need assistance developing your custom metric, the [slack channel][ea-slack] is also always open. Conclusion Quality Metrics constitute the foundation of systematically exploring, evaluating and iterating on machine learning datasets and models. We use them for slicing data, comparing data, tagging data, finding label errors, and much more. The true power of these metrics is that they can be arbitrarily specific to a problem at hand. With Encord Active, it is super easy to define, execute, and utilize quality metrics to get the most out of your data, your models, and your annotators. Footnotes [^1]: The difficulty metric is inspired by this paper. [^2]: With COCO, model performance is already evaluated against multiple different subsets of labels. For example, scores like $AP^{\text{small}}$ and $AR^{\text{max=10}}$ from COCO can be expressed as label quality metrics and evaluated with Encord Active.
Apr 19 2023
3 M
Model Test Cases: A Practical Approach to Evaluating ML Models
As machine learning models become increasingly complex and ubiquitous, it's crucial to have a practical and methodical approach to evaluating their performance. But what's the best way to evaluate your models? Traditionally, average accuracy scores like Mean Average Precision (mAP) have been used that are computed over the entire dataset. While these scores are useful during the proof-of-concept phase, they often fall short when models are deployed to production on real-world data. In those cases, you need to know how your models perform under specific scenarios, not just overall. At Encord, we approach model evaluation using with a data-centric approach using model test cases. Think of them as the "unit tests" of the machine learning world. By running your models through a set of predefined test cases before continuing model deployment or prior to deployment, you can identify any issues or weaknesses and improve your model's accuracy. Even after deployment, model test cases can be used to continuously monitor and optimize your model's performance, ensuring it meets your expectations. In this article, we will explore the importance of model test cases and how you can define them using quality metrics. We will use a practical example to put this framework into context. Imagine you’re building a model for a car parking management system that identifies car throughput, measures capacity at different times of the day, and analyzes the distribution of different car types. You've successfully trained a model that works well on Parking Lot A in Boston with the cameras you've set up to track the parking lot. Your proof of concept is complete, investors are happy, and they ask you to scale it out to different parking lots. Car parking photos are taken under various weather and daytime conditions. However, when you deploy the same model in a new parking house in Boston and in another town (e.g., Minnesota), you find that there are a lot of new scenarios you haven't accounted for: In the parking lot in Boston, the cameras have slightly blurrier images, different contrast levels, and the cars are closer to the cameras. In Minnesota, there is snow on the ground, different types of lines painted on the parking lot, and new types of cars that weren't in your training data. This is where a practical and methodical approach to testing these scenarios is important. Let's explore the concept of defining model test cases in detail through five steps: Identify Failure Mode Scenarios Define Model Test Cases Evaluate Granular Performance Mitigate Failure Modes Automate Model Test Cases Identify Failure Mode Scenarios Thoroughly testing a machine learning model requires considering potential failure modes, such as edge cases and outliers, that may impact its performance in real-world scenarios. Identifying these scenarios is a critical first step in the testing process of any model. Failure mode scenarios may include a wide range of factors that could impact the model's performance, such as changing lighting conditions, unique perspectives, or variations in the environment. Let's consider our car parking management system. In this case, some of the potential edge cases and outliers could include: Snow on the parking lot Different types of lines painted on the parking lot New types of cars that weren't in your training data Different lighting conditions at different times of day Different camera angles, perspectives, or distance to cars Different weather conditions, such as rain or fog By identifying scenarios where your model might fail, you can begin to develop model test cases that evaluate the model's ability to handle these scenarios effectively. It's important to note that identifying model failure modes is not a one-time process and should be revisited throughout the development and deployment of your model. As new scenarios arise, it may be necessary to add new test cases to ensure that your model continues to perform effectively in all possible scenarios. Furthermore, some scenarios might require specialized attention, such as the addition of new classes to the model's training data or the implementation of more sophisticated algorithms to handle complex scenarios. For example, in the case of adding new types of cars to the model's training data, it may be necessary to gather additional data to train the model effectively on these new classes. Define Model Test Cases Defining model test cases is an important step in the machine learning development process as it enables the evaluation of model performance and the identification of areas for improvement. As mentioned earlier, this involves specifying classes of new inputs beyond those in the original dataset for which the model is supposed to work well, and defining the expected model behavior on these new inputs. Defining test cases begins by building hypotheses based on the different scenarios the model is likely to encounter in the real world. This can involve considering different environmental conditions, lighting conditions, camera angles, or any other factors that could affect the model's performance. Hereafter you define the expected model behavior under the scenario. My model should achieve X in the scenario where Y It is crucial that the test case is quantifiable. That is, you need to be able to measure whether the test case passes or not. In the next section, we’ll get back to how to do this in practice. For the car parking management system, you could define your model test cases as follows: The model should achieve an mAP of 0.75 for car detection when cars are partially covered in or surrounded by snow. The model should have an accuracy of 98% on parking spaces when the parking lines are partially covered in snow. The model should achieve an mAP of 0.75 for car detection in parking houses under poor light conditions. Evaluate Granular Performance Once the model test cases have been defined, the performance can be evaluated using appropriate performance metrics for each model test case. This might involve measuring the model's mAP, precision, and recall of data slices related to specified test cases. To find the specific data slices relevant to your model test case we recommend using Quality metrics. Quality metrics are useful to evaluate your model's performance based on specific criteria, such as object size, blurry images, or time of day. In practice, they are additional parametrizations added on top of your data, labels, and model predictions and they allow you to index your data, labels, and model predictions in semantically relevant ways. Read more here. Quality metrics can then be used to identify data slices related to your model test cases. To evaluate a specific model text case, you identify a slice of data that has the properties that the test case defines and evaluate your model performance on that slice of data. Mitigate Failure Modes If your model test case fails and the model is not performing according to your expectations in the defined scenario, you need to take action to improve performance. This is where targeted data quality improvements come in. These improvements can take various shapes and forms, including: Data collection campaigns: Collect new data samples that cover identified scenarios. Remember to ensure data diversity by obtaining samples from different locations and parking lot types. Nonetheless, you should regularly update the dataset to account for new scenarios and maintain model performance. Relabeling Campaigns: If your failure modes are due to label errors in the existing dataset it would be beneficial to correct any inaccuracies or inconsistencies in labels before collecting new data. If your use case is complex, we recommend collaborating with domain experts to ensure high-quality annotations. Data augmentation: By applying methods such as rotation, color adjustment, and cropping, you can increase the diversity of your dataset. Additionally, you can utilize techniques to simulate various lighting conditions, camera angles, or environmental factors that the model might encounter in real-world scenarios. Implementing domain-specific augmentation techniques, such as adding snow or rain to images, can further enhance the model's ability to generalize to various situations. Synthetic data generation: Creating artificial data samples can help expand the dataset, but it is essential to ensure that the generated data closely resembles real-world scenarios to maintain model performance. Combining synthetic data with real data can increase the dataset size and diversity, potentially leading to more robust models. Automated Model Test Cases Once you've defined your model test cases, you need a way to select data slices and test them in practice. This is where quality metrics and Encord Active comes in. Encord Active is an open-source data-centric toolkit that allows you to investigate and analyze your data distribution and model performance against these quality metrics, in an easy and convenient way. The chart above is automatically generated by Encord Active using uploaded model predictions. The chart shows the dependency between model performance and each metric - how much is model performance affected by each metric. With quality metrics, you identify areas where the model is underperforming, even if it's still achieving high overall accuracy. Thus they are perfect for testing your model test cases in practice. For example, the quality metric that specifically measures the model's performance in low-light conditions (see “Brightness” among quality metrics in the figure above) will help you to understand if your car parking management system model will struggle to detect cars in low-light conditions. You could also use the “Object Area” quality metric to create a model test case that checks if your model has issues with different sizes of objects (different distance to cars results in different object areas). One of the benefits of Encord Active is that it is open-source and it enables you to write your own custom quality metrics to test your hypotheses around different scenarios. Tip: If you have any specific things you’d like to test please get in touch with us and we would gladly help you get started. This means that you can define quality metrics that are specific to your use case and evaluate your model's performance against them. For example, you might define a quality metric that measures the model's performance in heavy rain conditions (a combination of low Brightness and Blur). Finally, if you would like to visually inspect the slices that your model is struggling with you can visualize model predictions (both TP, FP, and FNs) if you. Tip: You can use Encord Annotate to directly correct labels if you spot any outright label errors. Back to the car parking management system example: Once you have defined your model test cases and evaluated your model's performance against using the quality metrics, you can find low-performing "slices" of data. If you've defined a model test case for the scenario where there is snow on the ground in Minnesota, you can: Compute the quality metric that measures its performance in snowy conditions. Investigate how much this metric affects the overall performance. Filter the slice of images where your model performance is low. Set in motion a data collection campaign for images in similar conditions. Set up an automated model test that always tests for performance on snowy images in your future models. Tip: If you already have a database of unlabeled data you can leverage similarity search to find images of interest for your data collection campaigns. Benefits of The Model Test Case Framework As machine learning models continue to evolve, evaluating them is becoming more important than ever. By using a model test case framework, you can gain a more comprehensive understanding of your model's performance and identify areas for improvement. This approach is far more effective and safe than relying solely on high-level accuracy metrics, which can be insufficient in evaluating your model performance in real-world scenarios. So to summarize, the benefits of using model test cases instead of only high level accuracy performance metrics are: Enhanced understanding of your model: You gain a thorough understanding of your model by evaluating it in detail (rather than depending on one overall metric). systematically analyzing its performance will improve your (and your team's) confidence in its effectiveness during deployment and augments the model's credibility. Allows you to concentrate on addressing model failure modes: Armed with an in-depth evaluation from Encord Active, efforts to improve a model can be directed toward its weak areas. Focusing on the weaker aspects of your model accelerates its development, optimizes engineering time, and minimizes data collection and labeling expenses. Fully customizable to your specific case: One of the benefits of using open-source tools like Encord Active is that it enables you to write your own custom quality metrics and set up automated triggers without having to rely on proprietary software. If you're interested in incorporating model test cases into your data annotation and model development workflow, don't hesitate to reach out. Conclusion In this article, we start off by understanding why defining model test cases and using quality metrics to evaluate model performance against them is essential. It is a practical and methodical approach for identifying data-centric failure modes in machine learning models. By defining model test cases, evaluating model performance against quality metrics, and setting up automated triggers to test them, you can identify areas where the model needs improvement, prioritize data labeling efforts accordingly, and improve the model's credibility with your team. Furthermore, it changes the development cycle from reactive to proactive, where you can find and fix potential issues before they occur, instead of deploying your model in a new scenario and finding out that you have poor performance and trying to fix it. Open-source tools like Encord Active enable users to write their own quality metrics and set up automated triggers without having to rely on proprietary software. This can lead to more collaboration and knowledge sharing across the machine-learning community, ultimately leading to more robust and effective machine-learning models in the long run.
Mar 22 2023
5 M
3 Ways To Add More Classes To Computer Vision Models
Adding new classes to a production computer vision model may be necessary for a number of reasons, which we’ve explored in more detail below: improved accuracy increased versatility increased robustness When adding new classes, it is important to have enough high-quality data, use robust evaluation methods, and monitor the performance of the model over time to ensure its continued effectiveness. Adding new classes to a computer vision model can lead to improved accuracy, increased versatility, and the ability to handle a wider range of inputs, but only when it is done well. How do I know if I need to add new classes to my computer vision model? When developing a computer vision model and putting it into production, it is essential to continually benchmark its performance and consider adding new classes where performance is lacking. Several signs may indicate the need for adding new classes, including: Decreased accuracy on new data Changes in business requirements Changes in the environment Insufficient data for existing classes Overfitting Decreased accuracy on new data If you are observing a drop in accuracy when applying your model to new data, it may be due to the fact that the model has not encountered examples of the new classes present in the data or there are some errors in your dataset that you need to fix. To improve accuracy, you can add these classes to your training set and retrain your model. Changes in business requirements As business needs evolve, it may be necessary to add new classes to your model to account for the new objects or scenes that are now relevant to your application. For example, if you previously developed a model to recognize objects in a warehouse, but now need to extend it to recognize objects in a retail store, you may need to add new classes to account for the different types of products and displays. Changes in the environment Changes in the environment in which your model is being used can also impact its performance. For example, if the lighting conditions have changed or if there is a new background in the real-world images the detection model is analyzing, it may be necessary to add new classes to account for these changes. Insufficient data for existing classes If the data you have collected for your existing classes is not sufficient to train a high-quality model, adding new classes can help to improve overall performance. This is because the model will have access to more data to learn from. Overfitting Overfitting occurs when a model memorizes specific examples in the training data instead of learning general patterns. If you suspect that your model is overfitting, it may be because you have not provided it with enough variability in the training data. In this case, adding new classes can help to reduce overfitting by providing more diverse examples for the model to learn from. Quality and Quantity of Data It is important to consider the quality and quantity of data when adding new classes. A good rule of thumb is to have at least 100-1000 examples per class, but the number may vary depending on the complexity of the classes and the size of your model. The data should also be diverse and representative of the real-world scenarios in which the model will be used. To evaluate the effectiveness of the model with the added classes, it is important to use robust evaluation methods such as cross-validation. This will provide a reliable estimate of the model's performance on unseen data and help to ensure that it is not overfitting to the new data. Additionally, it is important to monitor the performance of your model over time and to be proactive in adding new classes if needed. Regular evaluation and monitoring can help you quickly identify when new classes are needed and ensure that your model remains up-to-date and effective. What are the Benefits of Adding New Classes to a Computer Vision Model? Adding new classes to a computer vision model can have several benefits, including: improved accuracy increased versatility the ability to handle a wider range of inputs. Improved Accuracy One of the main benefits of adding new classes to a computer vision model is improved accuracy. By adding new classes, the model can learn to recognize a wider range of objects, scenes, and patterns, leading to better performance in real-world applications, such as facial recognition or self-driving cars. This can be particularly important for tasks like image classification, object detection, and semantic segmentation, where the goal is to accurately identify and classify the elements in an image or video. With a larger number of classes, the model can learn to distinguish between similar objects, such as different breeds of dogs or species of flowers, and better generalize to unseen examples. Results from the paper "an image is worth 16x16 words: transformers for image recognition at scale" show that by using more data (JFT-300M dataset has ~375 million annotated images) you can significantly improve the model's performance. Increased Versatility Another benefit of adding new classes is increased versatility. By expanding the range of objects and scenes the model can recognize, it can be applied to a wider range of use cases and problems. For example, a model trained on a large image dataset of natural images can be adapted to a specific domain, such as medical imaging, by adding classes relevant to that domain. This can help the model to perform well in more specialized applications, such as disease diagnosis or surgical planning. Increased Robustness Adding new classes can also help the model handle a wider range of inputs. For example, a model trained on a diverse set of images can be more robust to variations in lighting, viewpoint, and other factors that can affect image quality. This can be especially important for real-world applications, where the images used to test the artificial intelligence model may be different from those used during training. How To Add New Classes To Your Computer Vision Model Adding new classes to a computer vision model is a crucial step in improving its accuracy and functionality– There are several steps involved in this process, including data collection, model training, debugging, and deployment. To make this task easier, various tools and software libraries have been developed. There are three main ways to prepare a new class for your computer vision model: Manual collection and annotation Generating synthetic data Active learning Let’s take a look at them one by one. Manual Dataset Collection and Annotation Annotation refers to the process of identifying and labeling specific regions in video or image data. This technique is mainly used for image classification through supervised learning. The annotated data serves as the input for training machine learning or deep learning models. With a large number of annotations and diverse image variations, the model can identify the unique characteristics of the images and videos and learn to perform tasks such as object detection or object tracking, image classification, and others, depending on the type of model being trained. There are various types of annotations, including 2D bounding boxes, instance segmentation, 3D cuboid boxes, and keypoint annotations. Instance segmentation involves identifying and outlining multiple objects in an image. - Bounding boxes or 3D cuboid boxes can be drawn around objects and assigned labels for object detection. Polygonal outlines can be traced around objects for semantic and instance segmentation. Keypoints and landmarks can also be identified and labeled for object landmark detection, and straight lines can be marked for lane detection. For image classification, images are grouped based on labels. To prepare and annotate your own dataset, you can either record videos, take photos, or search for freely available open source datasets online. If your company already has collected a dataset you can connect it to a platform via a cloud bucket integration (S3, Azure, GCP etc.). However, before you can use these images for training, you need to annotate them, as opposed to using data from an already annotated dataset. When collecting data, make sure to keep it as close to your intended inference environment as possible, considering all aspects of the input images, including lighting, angle, objects, etc. For example, if you want to build machine learning models that detect license plates you must take into account different light and weather conditions. There are many tools available for annotating images for computer visiondatasets. Each tool has its own set of features and is designed for a specific type of project. Encord Annotate: An annotation platform for AI-assisted image and video annotation and dataset management. It's the best option for teams that are looking to use AI automation to make the video and image annotation process more efficient. CVAT (Computer VisionAnnotation Tool): A free, open-source, web-based annotation toolkit built by Intel. CVAT supports four types of annotations (points, polygons, bounding boxes, and polylines). Labelbox: A US-based data annotation platform. Appen: A data labeling tool founded in 1996, making it one of the first and oldest solutions in the market. These are just a few of the tools available for adding new classes to a computer vision dataset. The best tool for you will depend on the specific needs of your project, such as: The size of your dataset. The types of annotations you need to make/ The platform you are using. In an ideal scenario, the annotation tool should seamlessly integrate into your machine learning workflow. It should be efficient, user-friendly, and allow for quick and accurate annotation, enabling you to focus on training your models and improving their performance. The tool should also have the necessary functionalities and features to meet your specific annotation requirements, making the overall process of annotating data smoother and more efficient. Synthetic Datasets Another way to create a dataset is to generate synthetic data. This method can be especially useful for training in unusual circumstances, as it allows you to create a much larger dataset than you could otherwise obtain from real-world sources. As a result, your model is likely to perform better and achieve better results. However, it is not recommended to use only synthetic data or put synthetic data into validation/test data. Generating synthetic computer vision datasets is another option for adding new classes to your model. There are several tools available for this purpose: Unity3D/Unreal Engine: Popular game engines that can be used to generate synthetic computer visiondatasets by creating virtual environments and simulating camera movements. Blender: A free and open-source 3D creation software that can be used to generate synthetic computer visiondatasets by creating 3D models and rendering them as images. AirSim: an open-source, cross-platform simulation environment for autonomous systems and robotics, developed by Microsoft. It uses Unreal Engine for physically and visually realistic simulations and allows for testing and developing autonomous systems such as drones, ground vehicles, and robotic systems. CARLA: an open-source, autonomous driving simulator. It provides a platform for researchers and developers to test and validate their autonomous vehicle algorithms in a simulated environment. CARLA simulates a variety of real-world conditions, such as weather, traffic, and road layouts, allowing users to test their algorithms in a range of scenarios. It also provides a number of pre-built maps, vehicles, and sensors, as well as an API for creating custom components. Generative adversarial networks (GANs) allow you to generate synthetic data by setting two neural networks to compete against each other. One generates the data and the other identifies whether it's real or synthetic. Through a process of iteration, the models adjust their parameters to improve their performance, with the discriminator becoming better at distinguishing real from synthetic data and the generator becoming more effective at creating accurate synthetic data. GANs can be used to supplement training datasets that lack sufficient real-world data, but there are also challenges to using synthetic data that need to be considered. These tools can be used to generate synthetic data for various computer vision tasks, such as object detection, segmentation, and scene understanding. The choice of tool will depend on the specific requirements of your project, such as the type of data you need to generate, the complexity of the scene, and the resources available. Annotation with Active Learning Active learning is a machine learning technique that trains models by allowing them to actively query annotators for information that will help improve their performance. The process starts with a small initial subset of labeled data from a large dataset. The model uses this labeled data to make predictions on the remaining unlabeled data. ML engineers and data scientists then evaluate the model's predictions to determine its level of certainty. A common method for determining uncertainty is by looking at the entropy of the probability distribution of the prediction. For example, in image classification, the model reports a probability of confidence for each class considered for each prediction made. If the model is highly confident in its prediction, such as a 99 percent probability that an image is a motorcycle, then it has a high level of certainty. If the model has low certainty in its prediction, such as a 55 percent probability that an image is a truck, then the model needs to be trained on more labeled images of trucks. Another example is the classification of images of animals. After the model is initially trained on a subset of labeled data, it can identify cats with high certainty but is uncertain about how to identify a dog, reporting a 51 percent probability that it is not a cat and a 49 percent probability that it is a cat. In this case, the model needs to be fed more labeled images of dogs so that the ML engineers can retrain the model and improve its performance. The samples with high uncertainty are sent back to the annotators, who label them and provide the newly labeled data to the ML engineers. The engineers then use this data to retrain the model, and the process continues until the model reaches an acceptable performance threshold. This loop of training, testing, identifying uncertainty, annotating, and retraining allows the model to continually improve its performance. Active learning pipelines also help ML engineers identify failure modes such as edge cases, where the model makes a prediction with high uncertainty, indicating that the data does not fit into one of the categories that the model has been designed to detect. The model flags these outliers, and the ML engineers can retrain the model with the labeled sample to help the model learn to identify these edge cases. Using active learning in machine learning can make model training faster and cheaper while reducing the burden of data labeling for annotators. Instead of labeling all the data in a massive dataset, organizations can intelligently select and label a portion of the data to increase model performance and reduce costs. With an AL pipeline, ML teams can prioritize labeling the data that is most useful for training the model and continuously adjust their training as new data is annotated and used for training. Surprisingly, active learning is also useful even when ML engineers have a large amount of already labeled data. Training the model on every piece of labeled data in a dataset can be a poor allocation of resources, and active learning can help select a subset of data that is most useful for training the model, reducing computational costs. Active learning is a powerful ML technique that allows models to actively seek information that will help improve their performance. By reducing the burden of data labeling and optimizing the use of computational resources, active learning can help organizations achieve better results more efficiently and cost-effectively. However, an active learning pipeline can be hard to implement. Encord Active is an open-source active learning tool that includes visualizations, workflows, and a set of data and labels quality metrics and model performance analysis based on the model's predictions. It allows you to add the model's predictions, filter them by model's confidence and export them into your annotation tool (for example Encord Annotate). What Do You Do Once You’ve Added New Classes? Once you’ve added new classes to your computer vision model, there are several steps you can take to optimize its performance: Evaluate the model Fine-tune the model Data augmentation Monitor performance Evaluate The Model The first step after adding new classes to your model is to evaluate its performance. This involves using a dataset of images or videos to test the model and see how well it can recognize the new classes. You can use metrics like accuracy, precision, recall, and F1 score to quantify the model's performance and compare it with baseline models. You can also visualize the results and check model performance using confusion matrices, precision-recall curves, and ROC curves. These evaluations will help you identify areas where the model is performing well and where it needs improvement. Fine-Tune The Model Based on the evaluation results, you may need to fine-tune the model to optimize its performance for the new classes. Fine-tuning can involve adjusting the model's hyperparameters, such as learning rate or weight decay, or adjusting the architecture of the model itself. You can also use techniques like transfer learning to leverage pre-trained models and fine-tune them for your specific task. Data Augmentation Another approach to improving the model's performance is to use data augmentation. This involves transforming the existing training data to create new, synthetic examples. For example, you can use techniques like random cropping, flipping, or rotation to create new training samples. By increasing the size of the training dataset, data augmentation can help to prevent overfitting and improve the model's generalization ability. Monitor Performance Once you’ve fine-tuned the model, it’s important to monitor its performance over time. This can involve tracking the model's behavior on a test set or in a real-world deployment and adjusting the model as needed to keep it up-to-date. Monitoring performance can help to ensure that the model continues to function well when new classes are added and as the underlying data distribution changes. Adding new classes to a computer vision model is just the first step in optimizing its performance. By evaluating the model, fine-tuning its parameters, using data augmentation and regularization, and monitoring its performance, you can make the model more accurate, versatile, and robust to new classes. These steps are crucial for ensuring that your model remains effective and up-to-date over time and for achieving the best possible performance in real-world applications. Want To Start Adding More Classes To Your Model? “I want to start annotating” - Get a free trial of Encord here. "I want to get started right away" - You can find Encord Active on Github here or try the quickstart Python command from our documentation. "Can you show me an example first?" - Check out this Colab Notebook. If you want to support the project you can help us out by giving a Star on GitHub ⭐ Want to stay updated? Follow us on Twitter and LinkedIn for more content on computer vision, training data, and active learning. Join the Slack community to chat and connect.
Mar 07 2023
5 M
How to Choose the Right Data for Your Computer Vision Project
For machine learning practitioners developing commercial applications, finding the best data for training their computer vision model is critical for achieving the required performance. But this kind of data-centric approach is only as good as your ability to get the right data to put into your model. To help you get that data, this article will explore: The importance of finding the best training data How to prioritize what to label How to decide which subset of data to start training your model on How to use open-source tools to select data for your computer vision application Why Focus on Data for Training Your Computer Vision Model? There are several reasons why you should focus largely on building training data while working on your computer vision model. We will look briefly into three main aspects of your training data: Data Quality Firstly, the quality of the data used for training directly impacts the model's performance. An artificial intelligence model’s ability to learn and generalize to new, unseen data depends on the quality of the data it is trained on. For your computer vision model, the data type can be images, videos, or DICOMs. No matter the type, the data quality can be considered good when it covers a wide range of possible scenarios and edge cases (i.e. it is representative of the problem space). If the data used for training is biased or unrepresentative, the model may perform poorly on real-world data that falls outside the training data range. This may be because the model is overfitting or underfitting the training data and not generalizing well to new data. Secondly, the data used for training should be relevant to the task at hand. This means that the features used in the data should be meaningful and predictive of the target variable. If irrelevant or redundant features are included in the training data, the model may learn patterns that are not useful for making accurate predictions. For example, let's say you want to check the quality of the dataset you have built for the classification task. In the case of classification, a good measure of the dataset's quality is how separable the data points are. If the data points can be grouped into separable clusters, then the model will learn to classify those clusters easily. If the datapoints aren’t separable, the model cannot learn to classify. To define the separability of the dataset, decomposition algorithms are used (Such as Principal Component Analysis, autoencoders, variational autoencoders, etc.) By plotting the features in 2D using a decomposition model, you can see in this dataset, the data points cannot be separated into clusters. Hence, data quality would be considered low for classification tasks. By plotting the features in 2D using a decomposition model, you can see in this dataset, the data points can easily be separated into clusters. Hence, data quality would be considered high for classification tasks. Of course, different computer vision tasks would each need a different measure to define the overall quality of the data. Data Quantity Sufficient data is also important for computer vision models. The amount of data needed for training varies depending on the complexity of the task and the model used (e.g., complex deep learning model segmentation tasks have one requirement, whereas a simple classification task might have a different one). For most use cases, a large amount of training data is better, as it allows the model to learn more robust representations of the computer vision task (read our blog on data augmentation to understand how to efficiently add more training images to your dataset). However, it bears repeating that data collection, data processing, and data annotation are time-consuming and resource intensive, so you have to find the right balance. Label Quality In addition to the quality of the data sources, the way the data is labeled also makes a huge difference. Label quality refers to the accuracy and consistency of the annotations or labels assigned to the data in computer vision tasks. It is a crucial factor that determines the performance of machine learning algorithms in tasks such as object detection, image classification, and semantic segmentation. High-quality labels provide the model with accurate information, allowing it to learn and generalize well to new data. In contrast, low-quality or inconsistent labels can lead to overfitting, poor performance, and incorrect predictions. Therefore, it is important to ensure that the labels used for training computer vision models are of high quality and accurately reflect the visual content of the images. There are several ways you can mitigate label errors in your available datasets, such as: Clear label instructions: A clear set of guidelines that are shared amongst the team is essential before starting the annotation process. It will ensure uniformity among the labels tagged and also provide information about the labels to the machine learning engineer who would use these annotated data. Quality assurance system: The best practice while building a computer vision model is to have a quality assurance system in the pipeline. Here, you have an additional step to audit a subset of the labeled data. This is particularly helpful when you are dealing with a massive dataset. The addition of this step also makes your model more production friendly. Use trained models to detect label errors: There are tools available (such as Encord Active) that have features to help you detect label errors. By being able to easily find label errors you can ensure they get rectified before they start reducing the accuracy of your model. To learn more about different label errors and how you can efficiently handle them, read our three-part guide Data errors in Computer Vision: Find and Fix Label Errors(part 1). Once you’ve collected the image or video data you need and you’re confident with the quality of the dataset, you need to start labeling it. But labeling data is a time-consuming and labor-intensive process. So it is vital you’re first able to identify which data should be labeled, rather than trying to label all of it. How do you decide what data you should label? The decision of which data to label is usually guided by the specific task and the resources at your availability. Below are a few factors you should consider while selecting data for your next computer vision project: Problem definition: The problem you are trying to solve should guide the selection of data to label. For example, if you are building a binary classifier, you should label data that contains examples of both classes. Data quality: High-quality data is essential for building accurate machine learning models, so it’s important to prioritize labeling high-quality data. For example, if you are building an object detection model, you need the object of interest to be represented from a different point of view in the dataset. The annotations should also specify which is the object of interest and which isn’t (even if they “look” similar). Data diversity: To ensure that your model generalizes well, it’s important to label data that is diverse and representative of the problem you are trying to solve. Annotation cost: Video or image annotation can be time-consuming and expensive, so it’s important to prioritize the most important data to label based on the available resources. Model requirements: The data should be selected according to the machine learning algorithm you want to use. These algorithms are selected based on the problem you want to solve. Some algorithms require more labeled data than others, so it’s important to consider this when selecting data to label. Available resources: The amount of data that can be annotated is often limited by the available resources, such as time and budget. The selection of data to be annotated should take these constraints into account. These factors help you to understand the properties of the data. Understanding your data is essential to curate a training dataset that will help the computer vision algorithm solve the task at hand better. One approach is to start by labeling a small subset of the data to get a feel for the types of patterns that need to be learned. This initial labeling phase can then be used to identify which data is particularly difficult to classify or identify data that is critical for the success of the application. Such experimentation is easier when using an annotation tool. This can also provide valuable insights into the types of patterns that need to be learned. Another approach is to leverage existing annotated data or use expert annotators or domain experts. This data can provide a starting point for labeling. This method can be more expensive but does require less experimentation. How to Decide Which Subset of Data for Model Training Once you have labeled your data, the next step is to split the data into training, validation, and test subsets. The training set is used to train the model, the validation set is used to tune the model's hyperparameters, and the test set is used to evaluate the model's performance on unseen data. Now, when deciding which subset of data to start training your ML model on, the following factors can be considered: Size of the dataset: A smaller subset can be used initially to quickly train and evaluate your model, while larger subsets are used to build a robust and fine-tuned model. Representativeness of the data: It is important to choose a subset that is representative of the entire dataset, to avoid biases in the model’s predictions. Annotation quality: If the data has been manually annotated (with bounding boxes, polygons, or anything else), it is important to ensure that the annotations are of high quality to avoid training a model on incorrect or inconsistent information. Computational resources: The volume of computational resources available for training can influence the size of the subset that can be used. The ultimate objective is to select a subset that balances between being small enough to train quickly and being large enough to represent the entire dataset accurately. There are data annotation tools that provide features that allow you to understand, annotate and analyze your data with ease. Let’s look at one such open-source tool and how to use it to understand your data. Using Open-Source Tools to Select Data For Training Here we will be using the open-source tool Encord Active on the RarePlanes dataset (which contains real and synthetically generated satellite imagery). For that, let’s download Encord Active in Python (or check out the Encord Active repository on GitHub): python3.9 -m venv ea-venv source ea-venv/bin/activate # within venv pip install encord-active RarePlanes is an open-source dataset that is available as a sandbox project on Encord Active and can be downloaded through the CLI itself. If you want to download it on your own, you need an AWS account and the AWS CLI installed and configured. Let’s download the RarePlanes image dataset: # within venv encord-active download #select the rareplanes dataset #to visualize cd rareplanes encord-active visualize Now you can see that the dataset has been imported to the platform, and you can visualize different properties of both the data and the labels. First, let's discuss the data properties. Upon selecting Data Quality→Summary, you can visualize an overview of the data properties. Data quality summary in Encord Active From the summary, you can select the properties which show severe outliers, such as the green values. This allows you to analyze the quality of the dataset. They also provide features that allow you to either split the outliers evenly when splitting the data into training, validation, and test dataset or if you want to eliminate the outliers entirely. For example, using the feature Action→Filter&Export, initially, we can see it has 9524 data points. Now, previously we saw that the data contains 35 blur outliers. So after evaluation of those blur outliers, while creating the new dataset, we can easily eliminate these outliers by simply selecting from the drop-down menu. The threshold of the blur can also be varied. Now let’s see how the tool can be helpful in checking the label quality. Select Label Quality→Summary, you will be able to visualize the different properties of the labels. This helps in checking how well the data has been labeled and if it is suitable for your machine learning project. Label quality summary page in Encord Active It allows you to see the major issues to fix in your dataset. So you can directly filter out the outliers and easily address these issues. For example, in the example above, you can see the RarePlanes dataset has 483 annotation duplicates. Example of annotation duplicates Encord Active’s explorer page showing the distribution of annotation duplicates In Label Quality→Explore, upon selecting Annotation duplicates, you can analyze the outliers with ease. Now, these outliers can be dealt with differently depending on their quantity and type. The above dataset has almost 7% of duplicates. These images would not provide any additional information for the model to learn. So it would be better if these images are removed or merged. Next Steps After Choosing Your Data After data curation and annotation, the project workflow can head in different directions based on where you are in your ML model lifecycle. If you have just finished curating your dataset, then checking if the amount of data is enough should be your next step. If you are happy with your curated dataset, then building a baseline model would be a good start. This can help you judge if you need to make changes to the architecture of your machine learning algorithm or do feature extraction to improve the input to your algorithm. Create a Baseline Model It is a simple model built on the dataset curated without feature engineering. The goal of creating a baseline model is to create a starting point for evaluating model performance and to provide a standard to compare other, more complex models. It is important to keep in mind that the performance of a baseline model is often limited, but it provides a useful starting point for building more complex, trained models. Check if the Amount of Data is Enough The amount of data needed for a machine learning project depends on several factors, including the complexity of the model, the complexity of the learning algorithm, and the level of accuracy desired. In general, more data can lead to more accurate models, but there is a point of diminishing returns where increasing the data size does not significantly improve performance. As a rough rule of thumb, in computer vision, 1000 images per class is enough. This number can go down significantly if pre-trained models are used (Source). According to a study, the performance on vision tasks increases logarithmically based on the volume of training data size. So a large dataset is always desirable. But nowadays, it isn’t that common to build neural networks from scratch as there are many high-performance pre-trained models available. Using transfer learning on these pre-trained models, you can improve performance on many computer vision tasks with a significantly smaller dataset. Feature Extraction A feature in computer vision is a quantifiable aspect of your image that is specific to the object you are trying to recognize. It could be a standout hue in an image or a particular shape like a line, an edge, or a section of an image. Feature extraction is an important step as the input image has too much extra information which isn’t necessary for the model to learn. So, the first step after the preparation of the dataset is to simplify the images by extracting the important information and throwing away non-essential information. By extracting important colors or image segments, we can transform complex and large image data into smaller sets of features. This allows the computer vision model to learn faster. Feature extraction can be done using traditional methods or deep neural networks. In traditional machine learning problems, you spend a good amount of time in manual feature selection and engineering. In this process, we rely on our domain knowledge (or partnering with domain experts) to create features that make the machine learning algorithms work better. Some of the traditional methods used for feature extraction are: Harris Corner Detection: Uses a Gaussian window function to detect corners. Scale-Invariant Feature Transform (SIFT): Scale-invariant technique for corner detection. Features from Accelerated Segment Test (FAST): A faster version of SIFT for corner detection and mainly used in real-time applications. In a deep learning model, we don’t need to extract features from the image manually. The network automatically extracts features and learns their importance on the output by applying weights to its connections. You feed the raw image to the network and, as it passes through the network layers, it identifies patterns within the image to create features. A few examples of neural networks that are used as feature extractors are: Superpoint: Self-Supervised Interest Point Detection and Description: Proposes a fully convolutional neural network that computes SIFT-like interest point locations and descriptors in a single forward pass. D2-Net: A Trainable CNN for Joint Description and Detection of Local Features: The paper proposes a single CNN that is both a dense feature descriptor and a feature detector. LF-Net: Learning Local Features from Images: The paper suggests using a sparse-matching deep architecture and uses an end-to-end training approach on image pairs having relative pose and depth maps. They run their detector on the first image, find the maxima and then optimize the weights so that when run on the second image, they produce a clean response map with sharp maxima at the right locations. Conclusion We went through the steps you can follow while curating your dataset. Our next focus was on choosing which data you should label and the subset of data you should use for training your computer vision model. We also looked at how you can use open-source tools to help us with data processing. Lastly, we discussed the next steps you can take once you’ve processed your data and what needs doing from there. Take a look at the links below for some further ideas for what to do once you’ve created your training dataset! The computer vision pipeline, Part 4: feature extraction Baselines: The Complete Guide A practical guide to active learning for computer vision Data-Centric Case Study: Improving Model Performance
Feb 13 2023
12 M
A Practical Guide to Active Learning for Computer Vision
Why do we need active learning? Data is the fuel that powers modern deep learning models, and copious amounts of it are often required to train these neural networks to learn complex functions from high-dimensional data such as images and videos. When building new datasets or even repurposing existing data for a new task, the examples have to be annotated with labels in order to be used for training and testing. However, this annotation process can sometimes be extensively time-consuming and expensive. Images and videos can often be scraped or even taken automatically, however labeling for tasks like segmentation and motion detection is laborious. Some domains, such as medical imaging, require domain knowledge from experts with limited accessibility. Wouldn’t it be nice if we could pick out the 5% of examples most useful to our model, rather than labeling large swathes of redundant data points? This is the idea behind active learning. What is active learning? Active learning is an iterative process where a machine learning model is used to select the best examples to be labeled next. After annotation, the model is retrained on the new, larger dataset, then selects more data to be labeled until reaching a stopping criterion. This process is illustrated in the figure below. When examples are iteratively selected from an existing dataset, this corresponds to the pool-based active learning paradigm. We will focus on this problem setup as it reflects the overwhelming majority of real-world use cases. Read more: For a basic overview of active learning look here. In this article, we will be going more in-depth into the following categories: Active learning methods Acquisition function How to design effective pipelines Common pitfalls to avoid when applying active learning to your own projects. MNIST test accuracy from different active learning methods (source) When to use active learning? “Wow, reducing my annotation workload by a factor of 10x without sacrificing performance - sign me up!”. While active learning is an appealing idea, there are tradeoffs involved and it’s not the right move for every computer vision project. As we said, the goal of active learning is to 1) identify gaps in your training data, 2) reduce the cost of the annotation process, and 3) maximize the performance of your model relative to the number of annotated training samples. Your first consideration should be whether or not it’s worth the cost tradeoff: Annotation cost savings vs. Cost of implementation plus extra computation Annotation costs: The comparison we need to make here will not be based on the cost of annotating the entire dataset, but the difference between annotating a sufficiently large enough random sample to train an accurate model vs the expected size of the sample chosen via active learning. Cost of implementation: Building an active learning pipeline can be a particularly ops-heavy task. Data annotation is being coupled into the training process, so these two need to be tightly synchronized. Training and data selection for annotation should ideally be fully automated, with proper monitoring also put into place. The iterations will only be as efficient as their weakest link, requiring an annotation process capable of consistently fast turnaround times throughout sequential batches. Designing and implementing such a system requires MLOps skills and experience. Computation costs: Active learning requires repeating the training process in each iteration, introducing a multiplier on the computation cost, which could be significant in cases involving very large, compute-heavy models, though usually this won’t be a limiting factor. For many applications, this will still be worth the tradeoff. However, there are four other considerations you need to make before you begin: Biased data: Since we aren’t sampling randomly, the chosen training set will be biased by the model. This is done to maximize discriminative performance but can have major detrimental effects on things like density estimation, calibration, and explainability. Coupling of model and data: By performing active learning, you are coupling your annotation process with a specific model. The examples chosen for annotation by the particular selected model may not necessarily be the most useful for other models. Furthermore, since data tends to be more valuable than models and outlive models, you should ask yourself how much influence you want a particular model to have over the dataset. On the contrary, it may be reasonable to believe that particular chosen training subsets wouldn’t have large specific benefits between similar models. Difficulty of validation: In order to compare the performance of different active learning strategies, we need to annotate all examples chosen from all acquisition functions. This necessitates far more annotation cost than simply performing active learning, likely eliminating the desired cost savings. Even just comparing to a random sampling baseline could double the annotation cost (or more). Given sufficient training data at the current step, acquisition functions can be compared by simulating active learning within the current training dataset, however, this can add a noticeable amount of complexity to an already complex system. Often, different acquisition functions, models for data selection, and sizes of annotation batches are chosen upfront sans data. Brittleness: Finally, the performance of different acquisition functions can vary unpredictably among different models and tasks, and it’s not even uncommon for active learning to fail to outperform random sampling. Additionally, from 3, even just knowing how your active learning strategy compares to random sampling requires performing random sampling anyway. “Ok, you’re really convincing me against this active learning thing - what are my alternatives?” Active Learning Alternatives Random Subsampling: So it’s not in the budget to annotate your entire dataset, but you still need training data for your model. The simplest approach here would be to annotate a random subset of your data. Random sampling has the major benefit of keeping your training set unbiased, which is expected by nearly all ML models anyway. In unbalanced data scenarios, you can also simultaneously balance your dataset via subsampling and get two birds with one stone here. Clustering-based sampling: Another option is to choose a subsample with the goal of best representing the unlabeled dataset or maximizing diversity. For example, if you want to subsample k examples, you could apply the k-means algorithm (with k-clusters) to the hidden representations of your images from a pre-trained computer vision model. Then select one example from each resulting cluster. There are many alternative unsupervised and self-supervised methods here, such as VAE, SimCLR, or SCAN, but they are outside the scope of this article. Manual Selection: In some instances, it may even make sense to curate a training subset manually. For tasks like segmentation, actually labeling the image takes far more work than determining whether it’s a challenging or interesting image to segment. For example, consider the case where we have an unbalanced dataset with certain rare objects or scenarios that we want to ensure make it into the training set. Those images could then be specifically selected manually (or automatically) for segmentation. Should I use active learning? Despite all of the potential issues and shortcomings outlined above, active learning has been demonstrated to be effective in scenarios across a wide variety of data modalities, domains, tasks, and models. For certain problems, there are few viable alternatives. For example, the field of autonomous driving often deals with enormous datasets of more than thousands of hours of video footage. Labeling each image for object detection or related tasks becomes very infeasible at scale. Additionally, driving video footage can be highly redundant. Sections containing unique conditions and precarious scenarios or difficult decision-making are rare. This presents a great opportunity for active learning to sort out relevant data for labeling. For those of you still interested in applying this idea to your own problem, we’ll now give you an overview of common methods and how to build your own active learning pipeline. Active Learning Methods So, how do we actually choose which images to label? We want to select the examples that will be the most informative to our model, so a natural approach would be to score each example based on its predicted usefulness for training. Since we are labeling samples in batches, we could take the top k-scoring samples for annotation. This function that takes an unlabeled example and outputs its score is called an acquisition function. Here we give a brief overview of different acquisition functions and active learning strategies applicable to deep learning models. We will focus on the multiclass classification task, however, most of these approaches are easily modified to handle other computer vision tasks as well. Uncertainty-based acquisition functions The most common approach to scoring examples is uncertainty sampling, where we score data points based on the uncertainty of the model predictions. The assumption is that examples the model is confident about are likely to be more informative than examples for which our model is very confident about the label. Here are some common, straightforward acquisition functions for classification problems: These acquisition functions are very simple and intuitive measures of uncertainty, however, they have one major drawback for our applications. Softmax outputs from deep networks are not calibrated and tend to be quite overconfident. For convolutional neural networks, small, seemingly meaningless perturbations in the input space can completely change predictions. Query by Committee: Another popular approach to uncertainty estimation is using ensembles. With an ensemble of models, disagreement can be used as a measure of uncertainty. There are various ways to train ensembles and measure disagreement. Explicit ensembles of deep networks can perform quite well for active learning. However, this approach is usually very computationally expensive. Bayesian Active Learning (Monte Carlo Dropout): When the topic of uncertainty estimation arises, one of your first thoughts is probably “Can we take a Bayesian approach?”. The most popular way to do this in deep active learning is Monte Carlo Dropout. Essentially, we can use dropout to simulate a deep gaussian process with Bernoulli weights. Then the posterior of our predictions is approximated by passing the input through our model multiple times with randomly applied dropout. Here is a pioneering paper on this technique. Bayesian Active Learning by Disagreement (BALD): On the subject of Bayesian active learning, Bayesian Active Learning through Disagreement (BALD), uses Monte Carlo Dropout to estimate the mutual information between the model’s output and its parameters: Loss Prediction: Loss prediction methods involve augmenting the neural network architecture with a module that outputs predictions of the loss for the given input. Unlabeled images with high predicted loss are then selected for labeling. It’s important to note that as the model is trained, loss on all examples will decrease. If a loss function like MSE is simply used to train the loss prediction head, then this loss will contribute less and less to the overall training loss as the model becomes more accurate. For this reason, a ranking-based loss is used to train the loss prediction head. Diversity and Hybrid Acquisition Functions So far we’ve looked at a handful of methods to score samples based on uncertainty. However, this approach has several potential issues: If the model is very uncertain about a particular class or type of example, then we can end up choosing an entire batch of very similar examples. If our batch size is large, then not only can this cause a lot of redundant labeling, but can leave us with an unbalanced and biased training set for the next iteration, especially early in the active learning process. Uncertainty-based methods are very prone to picking out outliers, examples with measurement errors, and other dirty data points that we don’t want to have overrepresented in our training set. These methods can result in quite biased training datasets in general. An alternative approach is to select examples for labeling by diversity. Instead of greedily grabbing the most uncertain examples, we choose a set of examples that when added to the training set, best represents the underlying distribution. This results in less biased training datasets, and can usually be used with larger batch sizes and less iterations. We will also look at hybrid techniques where we prioritize uncertainty, but choose examples in a batch-aware manner to avoid issues like mode collapse within batches. Core Sets: In computational geometry, a core set is a small set of points that best approximates a larger point set. In active learning, this corresponds to the strategy of selecting a small subset of data that best represents our larger unlabeled data set, such that a model trained on this subset is similar to a model trained after labeling all available data. This is done by choosing data points such that the distance between the latent feature representations of each sample in the unlabeled pool and its nearest neighbor in the labeled training set is minimized. Since this problem is NP-hard, a greedy algorithm is used. Core sets (source). BatchBALD: BatchBALD is a batch-aware version of the BALD acquisition function where instead of maximizing the mutual information between a single model output and model parameters, we maximize the mutual information over the model outputs of all of the examples in the batch. Bayesian Active Learning by Diverse Gradient Embeddings (BADGE): BADGE is another batch-aware method that, instead of using model outputs and hidden features, uses the gradient space to compare samples. Specifically, the gradient between the predicted class and the last layer of the network. The norm of this gradient is used to score the uncertainty of the samples (larger norm inputs have more effect during training and tend to be lower confidence). Then the gradient space is also used to create a diverse batch using kmeans++. Adversarial Approaches: Variational Adversarial Active Learning (VAAL) uses a β-VAE to encode inputs into a latent space and a jointly-trained discriminator to distinguish between labeled and unlabeled examples. Examples that are most confidently predicted as unlabeled are chosen for labeling to maximize the diversity of the labeled set. Wasserstein Adversarial Active Learning (WAAL) further adopts the Wasserstein distance metric for distribution matching. Which acquisition function should I use? “Ok, I have this list of acquisition functions now, but which one is the best? How do I choose?” This isn’t an easy question to answer and heavily depends on your problem, your data, your model, your labeling budget, your goals, etc. This choice can be crucial to your results and comparing multiple acquisition functions during the active learning process is not always feasible. This isn’t a question for which we can just give you a good answer. Simple uncertainty measures like least confident score, margin score, entropy, and BALD make good first considerations. You can implement these with Monte Carlo Dropout. This survey paper compared a variety of different active learning techniques on a set of image classification problems and found WAAL and loss prediction to perform well on many datasets. Tip! If you’d like to talk to an expert on the topic the Encord ML team can be found in the #help in our Active Learning Slack channel Dataset Initialization In the active learning paradigm our model selects examples to be labeled, however, to make these selections we need a model from which we can get useful representations or uncertainty metrics - a model that already “knows” something about our data. This is typically accomplished by training an initial model on a random subset of the training data. Here, we want to use just enough data to get a model that can make our acquisition function useful to kickstart the active learning process. Transfer learning with pre-trained models can further reduce the required size of the seed dataset and accelerate the whole process. It’s also important to note that the test and validation datasets still need to be selected randomly and annotated in order to have unbiased performance estimates. Stopping Criterion The active learning procedure will need to be stopped at some point (otherwise we undermine ourselves and label all of the data anyway). Common stopping criteria include: Fixed annotation budget: Examples are labeled until a target number of labeled examples or spending is reached. Model performance target: Examples are labeled until a particular accuracy, precision, MSE, AUROC, or other performance metric is reached. Performance gain: Continue until the model performance gain falls under a specified threshold for one or more iterations. Manual review: You usually won’t be performing so many iterations that a manual review of performance is infeasible - just make sure that this is done asynchronously, so you don’t bleed efficiency by having annotators waiting for confirmation to proceed. Model Selection Once you have a final labeled dataset, model selection can be performed as usual. However, selecting a model for active learning is not a straightforward task. Often this is done primarily with domain knowledge rather than validating models with data. For example, we can search over architectures and hyperparameters using our initial seed training set. However, models that perform better in this limited data setting are not likely to be the best performing once we’ve labeled 10x as many examples. Do we really want to use those models to select our data? No! We should select data that optimizes the performance of our final model. So we want to use the type of model that we expect to perform best on our task in general. Keep in mind that as our training set grows over each iteration, the size of our training epochs grows, and the hyperparameters and stopping criteria that maximize the performance of our model will change. Now, we could perform hyperparameter optimization each time we retrain the model, however, this strategy has the same issues that we discussed in the previous paragraph, and can dramatically increase the computation and time costs of the model fitting step. Yet again, the exact strategy you will use to handle this issue will depend on the details of your individual problem and situation. Active Learning Pipelines Example outline of an active learning pipeline Efficiency Ordinarily, data labeling is performed in one enormous batch. But with active learning, we split this process into multiple sequential batches of labeling. This introduces multiple possible efficiency leaks and potential slowdowns that need to be addressed, primarily: Automation: Multiple cycles of labeling and model training can make the active learning process quite lengthy, so we want to make these iterations as efficient as possible. Firstly, we want there to be as little waiting downtime as possible. The only component of the active learning cycle that necessitates being performed manually is the actual annotation. All model training and data selection processes should be completely automated, for example, as DAG-based workflows. What we don’t want to happen is to have new labels sitting in a file somewhere waiting for someone to manually run a script to integrate them into the training dataset and begin fitting a new model. Efficient Labeling: In the last point we discussed reducing downtime where all components are idling waiting for manual intervention. Time spent waiting for annotators should be minimized as well. Most data annotation services are set up to process large datasets as a single entity, not for fast turnaround times on smaller sequential batches. An inefficient labeling setup can dramatically slow down your system. Labeling Batch Size: The size of the labeling “batches”, and the total number of iterations is a very important hyperparameter to consider. Larger batches and shorter iterations will reduce the time requirements of active learning. More frequent model updates can improve the quality of the data selection, however, batches should be large enough to sufficiently affect the model such that the selected data points are changed in each iteration. Errors: It’s vital to have principled MLOps practices in place to ensure that active learning runs smoothly and bug-free. We’ll address this in the monitoring subsection. Data Ops Each active learning iteration introduces changes to both the data and model, requiring additional model and data versioning for reoganization. Additionally, data storage and transfer need to be handled appropriately. For example, you don’t want to be re-downloading all of the data to your model training server at each iteration. You only need the labeled examples and should be caching data appropriately. Monitoring, Testing, and Debugging The absolute last thing you want to happen when performing active learning is to realize that you’ve been sending the same examples to get re-labeled at every iteration for the past two weeks due to a bug. Proper testing is crucial - you should be checking the entire active learning workflow and training data platform thoroughly prior to large-scale implementation. Verifying the outputs of each component so that you can have confidence in your system before beginning labeling. Monitoring is also an essential component of an active learning pipeline. It’s important to be able to thoroughly inspect a model’s performance at each iteration to quickly identify potential issues involving training, data selection, or labeling. Here are some ideas to get started: Firstly, you need to monitor all of your usual model performance metrics after each training iteration. How is your model improving from increasing the size of the training set? If your model is not improving, then you have a problem. Some potential culprits are: Your labeling batch size is too small to make a noticeable difference. Your hyperparameter settings no longer suit your model due to the increased size of your training dataset. Your training dataset is already huge and you’ve saturated performance gains from more data. There is an issue/bug in your data selection process causing you to select bad examples. Record the difference in acquisition function scores and rankings after the training step in each iteration. You are looking for: Scores for already labeled examples should be much lower than unlabeled examples on average. If not, you are selecting data points similar to already-labeled examples. If you’re using uncertainty sampling, this could mean that seeing the labels of examples is not improving your model's certainty (the uncertainty is coming from the problem itself rather than a shortcoming of your model). Does the set of highest-ranked data points change after training your model on newly-labeled samples? If not, updating the model is not changing the data selection. This could have the same potential causes outlined in 1. Conclusion In this article, we walked you through the steps to decide whether or not active learning is the right choice for your situation, presented an overview of important active learning methods, and enumerated important considerations for implementing active learning systems. Now you have the tools that you need to begin formulating active learning solutions for your own projects! Want to get started with Active Learning? “I want to start annotating” - Get a free trial of Encord here. "I want to get started right away" - You can find Encord Active on Github here or try the quickstart Python command from our documentation. "Can you show me an example first?" - Check out this Colab Notebook. If you want to support the project you can help us out by giving a Star on GitHub ⭐ Want to stay updated? Follow us on Twitter and LinkedIn for more content on computer vision, training data, and active learning. Join the Slack community to chat and connect.
Feb 01 2023
12 M
How to Find and Fix Label Errors
Introduction As Machine Learning teams continue to push the boundaries of computer vision, the quality of our training data becomes increasingly important. A single data or label error can impact the performance of our models, making it critical to evaluate and improve our training datasets continuously. In this first part of the blog post series “Data errors in Computer Vision”, we'll explore the most common label errors in computer vision and show you how you can quickly and efficiently mitigate or fix them. Whether you're a seasoned ML expert or just starting out, this series has something for everyone. Join us as we dive into the world of data errors in computer vision and learn how to ensure your models are trained on the highest quality data possible. The Data Error Problem in Computer Vision At Encord, we work with pioneering machine learning teams across a variety of different use cases. As they transition their models into production, we have noticed an increasing interest in continuously improving the quality of their training data. Most Data Scientists and ML practitioners spend much time debugging data to enhance their model performance. Even with automation and AI-supported breakthroughs in labeling, data debugging is still a tedious and time-consuming process based on manual inspection and one-off scripts in Jupyter notebooks. Before we dive into the data errors and how to fix them, let us quickly recollect what a good training dataset is. What is a good training dataset? Data that is consistently and correctly labeled. Data that is not missing any labels. Data that covers critical edge cases. Data that covers data outliers in your data distribution. Data that is balanced and mimics the data distribution faced in a deployed environment (for example, in terms of different times of the day, seasons, lighting conditions, etc.). Data that is continuously updated based on feedback from your production model to mitigate data drift issues. Note! This post will not dive into domain-specific requirements such as data volume and modality. If you are interested in reading more about healthcare-specific practices, click here. Today we will explain how to achieve points one and two for a good training data set, and in the upcoming posts in the series, we will cover the rest. Let’s dive into it! The Three Types of Label Errors in Computer Vision Label errors directly impact a model’s performance. Having incorrect ground truth labels in your training dataset can have major downstream consequences in the production pipeline for your computer vision models. Identifying label errors in a dataset containing hundreds of images or frames can be done manually, but when working with large datasets containing 100.000s or millions of images, the manual process becomes impossible. The three types of labeling errors in computer vision are 1) inaccurate labels, 2) mislabeled images, and 3) missing labels. Inaccurate Labels When a label is inaccurate, your algorithm will struggle to correctly identify objects correctly. The actual consequences of inaccurate labels in object detection have been studied previously, so we will not dive into that today. The precise definition of an inaccurate label depends on the purpose of the model you are training, but generally, the common examples of inaccurate labels are: Loose bounding boxes/polygons Labels that do not cover the entire object Labels that overlap with other objects Note! In certain cases, such as ultrasound labeling, you would label the neighboring region of the object of interest to capture any changes around it. Thus the definition of inaccurate labels depends on the specific case. For example, if you’re building an object detection model for computer vision to detect tigers in the wild, you want your labels to include the entire visible area of the tiger, no more and no less. Mislabeled Images When a label attached to an object is mislabeled, it can lead to wrong predictions when deploying your model into the real world. Mislabeled images are common in training datasets. Research from MIT has shown that, on average, 3.4% of labels are mislabeled in common best practice datasets. Missing Labels The last common type of label error is missing labels. If a training data set contains missing labels the computer vision algorithm will not learn from samples without labels. What Causes Label Errors? Erroneous labels are prevalent in many datasets, both open-source, and proprietary datasets. They happen for a variety of reasons, mainly: Unclear ontology or label instructions: When labelers lack a clear definition of the objects and concepts that are labeled it can confuse the person performing the task. This can make it difficult to accurately and consistently understand the images and what is required of the annotator. Annotator fatigue: A burnout that can occur for labelers who are performing repetitive labeling tasks. The process can at times be tedious and time-consuming, and it can take a toll on the energy of the person doing it. Hard-to-annotate images: A hard labeling task can be difficult for various reasons. For example, it can require a high level of skill or knowledge to identify the objects in question. The image quality can be low, or the images can have many different objects confusing the annotator. Next, we will show you a series of actions to prevent label errors in your labeling operations going forward! How to Fix Label Errors? To find label errors, you historically had to manually sift through your dataset, a time-consuming process at the best and impossible at worst. If you have a large data set, it would be like finding a needle in a haystack. Luckily label errors can be mitigated today before deploying your model into production. In this section, we propose three strategies to can help you mitigate label errors during the labeling process or fix them later on: 1. Provide Clear Labeling Instructions If you are not going to label the images yourself, providing your data annotation team with clear and concise instructions is essential. Good label instructions contain descriptions of the labeling ontology (taxonomy) and reference screenshots of high-quality labels. A good way to test the instructions is to have a non-technical colleague on your team review the instructions and see if they make sense on a conceptual level. Even better, dog food label instructions with your team to find potential pitfalls. 2. Implement a Quality Assurance System In today's computer vision data pipelines, reviewing a subset, or all, of the created labels is the best practice. This can be done using a standard review module where you can decide upon the sampling rate of all labels or define different sampling rates for specific hard classes. In special cases, such as medical use cases, it is frequently required to use an expert review workflow. This entails sending labels initially rejected in the review stage for an extra expert opinion in the expert review stage. Depending on the complexity of the use case, this can be tailored to the situation. Check out this post to learn how to structure quality assurance workflow for medical use cases. 3. Use a Trained Model to Find Label Errors As you progress on your computer vision journey, you can use a trained model to identify mistakes in your data labeling. This is done by running the model on your annotated images and using a platform that supports label debugging and model predictions. The platform should be able to compare high-confidence false positive model predictions with the ground truth labels and flag any errors for re-labeling. In this example, we will use Encord Active, an open-source active learning framework, to find how a trained model can be used to find label errors. The dataset used in this example is the COCO validation dataset combined with model predictions from a pre-trained MASK R-CNN RESNET50 FPN V2 model. The sandbox dataset with labels and predictions can be downloaded directly from Encord Active's GitHub repo. Note! Check out the full guide on how to use Encord Active to find and fix label errors in the COCO validation dataset. Using the UI we sort for the highest confidence false positives to find images with possible label errors. In the example below, we can see that the model has predicted four missing labels on the selected image. The objects missing are a backpack, a handbag, and two people. The predictions are marked in purple with a box around them. As all four predictions are correct, the label errors can automatically be sent back to the label editor to be corrected immediately. This operation is repeated with the rest of the dataset to find and fix the remaining erroneous labels. If you’re interested in finding label errors in your training dataset today, you can download the open-source active learning framework, upload your own data, labels, and model predictions, and start finding label errors. Conclusion In summary, to mitigate label errors and missing labels, you can follow three best practice strategies: Provide clear labeling instructions that contain descriptions of the labeling ontology (taxonomy) and reference screenshots of high-quality labels. Implement a Quality Assurance system using a standard review workflow or expert review workflow. Use a trained model to find label errors to spot label errors in your training dataset by running a model on your newly annotated samples to get model predictions and using a platform that supports model-driven label debugging. Want to test your own models? "I want to get started right away" - You can find Encord Active on Github here. "Can you show me an example first?" - Check out this Colab Notebook. "I am new, and want a step-by-step guide" - Try out the getting started tutorial. If you want to support the project you can help us out by giving a Star on GitHub :) Want to stay updated? Follow us on Twitter and Linkedin for more content on computer vision, training data, and active learning. Join our Discord channel to chat and connect.
Dec 15 2022
12 M
How to Clean Data for Computer Vision
How clean is your data? Data cleaning for computer vision (CV) and machine learning (ML) projects is an integral part of the process, and something that must be done before annotation and model training can start. Unclean data costs time and money. According to an IBM estimate published in the Harvard Business Review (HBR), the cost of unclean, duplicate, and poor-quality data is $3.1 trillion. It can cost tens of thousands of dollars to clean 10,000+ lines of database entries before those datasets can be fed into a text-only ML, Deep Learning (DL), or Artificial Intelligence (AI-based) model. When it comes to image and video-based datasets, the work involved and, therefore, the cost, is even higher. It’s essential — for the integrity of a machine learning model algorithms, and the outputs and results you want to generate — that the datasets you’re feeding it are clean. In this article, we review the importance of data cleaning of image and video datasets for computer vision models, and how data ops and annotation teams can clean data before a project starts. What is Data Cleaning For Machine Learning Models? Machine learning models are more effective, produce better outcomes, and the algorithms train more efficiently when the data preparation has been done and the data they’re supplied is clean. As the data science saying goes, “Garbage in, garbage out.” If you put unclean data into a machine learning model, especially after the images or videos have been annotated, it will produce poor quality and inaccurate results. Data cleaning has to happen before annotation work can begin; as a project leader, you need to allocate time/budget to this part of the process, otherwise, you risk wasting crucial time training a model using unclean data. Otherwise, an annotation team spends too much time trying to clean the datasets before they can apply the relevant annotations and labels. Or if an annotation team doesn’t spot the “dirty data”, then a data ops team will have to fix the mistakes themselves or send the images or videos back to be re-annotated. In the HBR article where the cost of $3.1 trillion was mentioned, this is known as having an in-house “hidden data factory.” Far too many knowledge economy workers spend the time they shouldn’t have to clean data, especially when teams are up against tight deadlines. Hidden data factories are expensive and time-consuming productivity black holes. Data cleaning is non-value-added work. It wastes time if datasets aren’t cleaned and processed before annotators can start applying annotations and labels to the images and videos. How Does Image and Video Data Cleaning Apply to Computer Vision? As we’ve mentioned, clean data is needed before you can train computer vision model algorithms. Annotators should be spending all of their time working on images and videos they can annotate and label without worrying about dataset “janitor work”, as The New York Times once eloquently put it. With image-based datasets, there are numerous data-cleaning issues that can occur. You could have images that are too bright, or too dull, and there can be duplicates, corrupted files, and other problems with the images provided. All of these issues can influence the outcome of computer vision models. Different file types, incompatible files, or too many greyscale images across a large volume of the dataset can cause problems. Training a computer vision model involves ensuring the highest quality and volume of annotated images are fed into it. That’s not possible when there are problems with a problematic percentage of the images in a dataset. Medical images are even more complex, especially when the file formats (such as DICOM) involve numerous layers of images and patient data. Also, given the stringent requirements and regulatory hurdles medical datasets and ML models have to overcome, it’s even more crucial to ensure that the datasets clinical operations teams use are as clean as possible. With video-based datasets, the challenges can be even more difficult to overcome. Unclean data in the context of videos include corrupted files, duplicate frames, ghost frames, variable frame rates, and other sometimes unknown and unexpected problems. All of these data-cleaning challenges can be overcome. For the overall success of the machine learning or computer vision model, crucial time and effort is invested in cleaning raw datasets before the annotation team starts work. How Do You Know Your Data is Clean? There are a couple of ways you can test the cleanliness of your data before it goes into a machine learning model. One way, and by far the most time-consuming, would be for someone — ideally a quality assurance/quality control data professional — to go through every image and video manually. Checking every image, video, and frame, to make sure the images or videos are “clean”; without errors, duplicates, corrupted files, or any brightness/dullness problems we’ve previously mentioned. Whether images or videos are too bright or too dull is a potentially serious problem. Too much or too little of one or the other could unintentionally change the outcome of a machine learning model. So, a concerted effort in the early stages of a project must be made to ensure the datasets going to the annotation teams are clean. Manual checking and correcting of every image and video is very time-consuming, and often not practical, not when there could be thousands of images and videos in a dataset. One way to make this easier is to automate the process, and there are automation tools that will help accelerate and simplify data-cleaning tasks. One such tool is Encord Active, an open-source active learning framework for computer vision projects. Encord Active is a “test suite for your labels, data, and models. Encord Active helps you find failure modes in your models, prioritize high-value data for labeling, and find outliers in your data to improve model accuracy.” Model assertions in Encord Active Debug your models and your data before and during the annotation phase of a project, before you start feeding data into your machine learning project or computer vision model for training and validation. Compared to manual data cleaning, it saves a huge amount of time, and if you need to create or get more images or videos, then an automation tool can inform data ops leaders about the type and volume of data required to train the model. What Happens if You Put Unclean Data Into A Machine Learning Model? When “unclean data”, datasets with corrupted and duplicate files, images that are too bright or too dull, and videos with ghost and variable frames, are fed into machine learning algorithms, it will corrupt the outcome. In some cases, a model simply won’t train. A data operations manager might go home for the night, having put dozens of datasets into models to train, only to find the next day that the models haven’t trained. Or that the accuracy is so poor that the results are meaningless. A team then needs to investigate the cause of the poor results or untrained models. According to NYT interviews: “far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists . . . spend from 50 percent to 80 percent of their time mired in this more mundane labor of [data cleaning].” Starting over, having annotation teams re-annotate images and videos, or manually cleaning datasets takes time, causing project costs to spiral. Target outcomes and deadlines get pushed further into the future because the data quality isn’t good enough to be introduced to the model yet. Senior leaders and sponsors could lose confidence in the project. It might also be necessary to either artificially generate more images and videos, to balance out having ones that are either too bright, too dull, or have other issues that could impact the outcome. Videos might need to be re-formatted before they can be re-annotated. The cost and work involved in fixing unclean data after it goes into a machine learning model can be high. It’s better, whenever possible, to ensure the data going into the model is clean before annotation work starts, and then you can be more confident in the outcomes of the training part of the process. How Do You Do Image and Video Dataset Cleaning For Machine Learning Models? In extreme cases, when data is corrupted, it needs to be taken out of datasets. However, that’s always a last resort, as it’s always a case of the more data the better, for the sake of ML and CV models. Before that needs to happen, images and videos can be reformatted, and manually cleaned up, and automation tools can be used to change brightness levels and other formatting issues. Once as much data cleaning, or sourcing new clean data, has been done, then tools such as Encord Active can be used to debug the data. It’s also worth pointing out that the labels and annotations applied by the annotation team need to be “clean” before datasets can be fed into ML models. Again, this is where a tool like Encord Active can prove invaluable, to ensure computer vision models are being fed the most effective and efficient training data to produce the outcomes your team and project leaders need. Ready to clean your data and improve the performance of your computer vision model? Sign-up for an Encord Free Trial: The Active Learning Platform for Computer Vision, used by the world’s leading computer vision teams. AI-assisted labeling, model training & diagnostics, find & fix dataset errors and biases, all in one collaborative active learning platform, to get to production AI faster. Try Encord for Free Today. Want to stay updated? Follow us on Twitter and LinkedIn for more content on computer vision, training data, and active learning. Join our Discord channel to chat and connect.
Dec 02 2022
12 M
How to Improve Datasets for Computer Vision
Machine learning algorithms need vast datasets to train, improve performance, and produce the results the organization needs. Datasets are the fuel that computer vision applications and models run on. The more data the better. And this data should be high-quality, to ensure the best possible outcomes and outputs from artificial intelligence projects. One of the best ways to get the data you need to train ML models is to use open-source datasets. Fortunately, there are hundreds of high-quality, free, and large-volume open-source datasets that data scientists can use to train algorithmically-generated models. In a previous article, we took a deep dive into where to find the best open-source datasets for machine learning models, including where you can find them depending on your sector/use cases. Quick recap: What Are The Best Datasets For Machine Learning? Depending on your sector, and the use cases, some of the most popular machine learning and computer vision datasets include: Insurance: Car Damage Assessment Dataset; Sports: Multiview Football Dataset I & II (from KTH), and the OpenTTGames Dataset; SAR (Synthetic Aperture Radar) machine learning datasets: xView and xView3, both from the Copernicus Sentinel-1 mission of the European Space Agency (ESA), and the EU Copernicus Constellation; Smart cities and autonomous vehicles (self-driving cars) datasets: BDD100K and The KITTI Vision Benchmark Suite Retailers and manufacturers: RPC-Dataset Project Medical and healthcare: The Cancer Imaging Archive (TCIA) and the NIH Chest X-Rays (on Kaggle) Open dataset aggregators: Kaggle and OpenML In this tutorial, we’ll take a closer look at the steps you need to take to improve the open-source dataset you’re using to train your model. Training Your Model And Assessing Performance If you’re using a public, open-source dataset, there’s a good chance the images and videos have already been annotated and labeled. For all intents and purposes, these datasets are as close to being model-ready as possible. Using a public dataset is something of a useful shortcut to getting a proof of concept (POC) training model up and running. Getting you one step closer to being able to run a fully-tested production model. However, before simply feeding any of these datasets into a machine learning model, you need to: Make sure the data aligns with your project goals and objectives. Ensure that the annotations (bounding boxes, image segmentation, etc), and metadata are high quality, with sufficient modalities, and object types. Check that there are enough images or videos to reduce bias (for example, in the case of medical imaging datasets, is there a wide enough spread of races, genders, age groups, and patients with or without the diseases being studied?) Are there enough images or videos under different conditions (e.g. light vs. dark, day vs. night, shadows vs. no shadows Reviewing label quality in Encord Active Now you’ve got the data and you’ve checked that it’s suitable, you can start training a machine learning model for carrying out computer vision tasks. With each training task, you can set a machine learning model to a specific goal. For example, “identify black Ford cars manufactured between 2000 and 2010.” And to train that model, you might need to show the model tens of thousands of images or videos of cars. Within the training data should be sufficient examples so that it can positively identify the object(s), “X”; in this case: black Ford cars, and only those manufactured between specific years. It’s equally important that within the training data are thousands of examples of cars that are not Ford’s, the object(s) in question. To train a machine learning or CV model, you need to show the model enough examples of objects that are the opposite of what it’s being trained to identify. So, in this example, the dataset should include loads of images of cars that are different colors, makes, and models. ML and computer vision algorithms only train effectively when they’re shown a wide enough range of images and videos that contrast with the target object(s) in question. It’s useful to ensure any public, open-source datasets that you use benefit from extensive environmental factors too, such as light and dark, day and night, shadows, and other variables as required. Once you start training a model, your team can start assessing its performance. Don’t expect high-performance outputs from day one. There’s a good chance you could run 100 training tests and only 30% will score high enough to gain any valuable insights into how to turn one or two into working production models. Training model failure is a natural and normal part of computer vision projects. Identify Why and Where the Dataset Needs Improvement Now you’ve started to train the machine learning model (or models) you are using on this dataset, results will start to come in. These results will show you why and where the model is failing. Don’t worry, as data scientists, data operations, and machine learning specialists know, failure is an inevitable and normal part of the process. Without failure, there can be no advancement in machine learning models. At first, expect a high failure rate, or at the very least, a relatively low accuracy rate, such as 70%. Use this data to create a feedback loop so you can more clearly identify what’s needed to improve the success rate. For example, you might need: More images or videos; Specific types of images or videos to increase the efficiency outputs and accuracy (for more information on this check out our blog on data augmentation) ; An increase in the number of images or videos produces more balance in the results, e.g. reduce bias. Next, these results often generate another question: If we need more data, where can we get it from? Collect or Create New Images or Video Data Solving problems at scale often involves using large volumes of high-quality data to train machine learning models. If you’ve got access to the volumes you need from an open-source or proprietary dataset then keep feeding the model — and adjusting the datasets labels and annotations accordingly — until it starts generating the results you need. However, if that isn’t possible and you can’t get the data you need from other open-source, real-world datasets, there is another solution. For example, what if you need thousands of images or videos of car crashes, or train derailments? How many images and videos do you think exist of things that don’t happen very often? And even when they do happen, these edge cases aren’t always captured clearly in images or videos. In these scenarios, you need synthetic data. For more information, here’s An Introduction to Synthetic Training Data Computer-generated images (CGI), 3D games engines — such as Unity and Unreal — and Generative adversarial networks (GANs) are the best sources for creating realistic synthetic images or videos in the volumes your team will need to train a computer vision model. There’s also the option of buying synthetic datasets. If you’ve not got the time/budget to wait for custom-made images and videos to be generated. Either way, when your ML team is trying to solve a difficult edge case and there isn’t enough raw data, it’s always possible to create or buy some synthetic data that should improve the accuracy of a training model. Retrain Machine Learning Model and Reassess Until The Desired Performance Standards Are Achieved Assessing model performance in Encord Now you’ve got enough data to keep training and retraining the model, it should be possible to start achieving the performance and accuracy standards you need. Assuming initial results started out at 70%, once results are in the 90-95%+ range then your model is moving in the right direction! Keep testing and experimenting until you can start benchmarking the model for accuracy. Once accuracy outcomes are high enough, bias ratings low enough, and the results are on target with the aim of your model, then you can put the working model into production. Ready to find failure modes in your computer vision model and improve its accuracy? Sign-up for an Encord Free Trial: The Active Learning Platform for Computer Vision, used by the world’s leading computer vision teams. AI-assisted labeling, model training & diagnostics, find & fix dataset errors and biases, all in one collaborative active learning platform, to get to production AI faster. Try Encord for Free Today. Want to stay updated? Follow us on Twitter and LinkedIn for more content on computer vision, training data, and active learning.
Nov 28 2022
6 M
How to Automate the Assessment of Training Data Quality
When building AI models, machine learning engineers run into two problems with regard to labeling training data: the quantity problem and the quality problem. For a long time, machine learning engineers were stuck on the quantity problem. Supervised machine learning models need a lot of labeled data, and the model’s performance depends on having enough labeled training data to cover all the different types of scenarios and edge cases that the model might run into in the real world. As they gained access to more and more data, machine learning teams had to find ways to label it efficiently. In the past few years, these teams have started to find solutions to this quantity problem – either by hiring large groups of people to annotate the data or by using new tools that automate the process and generate a lot of labels in more systematic ways. Unfortunately, the quality problem only truly began to reveal itself once solutions to the quantity problem emerged. Solving the quantity problem first made sense– after all, the first thing you need to train a model is a lot of labeled training data. However, once you train a model on that data, it becomes apparent pretty quickly that the quality of the model’s performance is not only a function of the amount of training data but also of the quality of that data’s annotations. The Training Data Quality Problem Data quality issues arise for a number of reasons. The quality of the training data itself depends on having a strong pipeline for sourcing, cleaning, and organizing the data to make sure that your model isn’t trained on duplicate, corrupt, or irrelevant data. After putting together a strong pipeline for sourcing and managing data, machine learning teams must be certain that the labels identifying features in the data are error-free. That’s no easy task because mistakes in data annotations arise from human error, and the reasons for these errors are as varied as the human annotators themselves. All annotators can make mistakes, especially if they’re labeling for eight hours a day. Sometimes, annotators don’t have the domain expertise required to label the data accurately. Sometimes, they haven’t been trained appropriately for the task at hand. Other times, they aren’t conscientious or consistent: they either aren’t careful or haven’t been taught best practices in data annotation. A misplaced box around a polyp (from the Hyper Kvasir dataset) Regardless of the cause, poor data labeling can result in all types of model errors. For example, if trained on inaccurately labeled data, models might make miscategorization errors, such as mistaking a horse for a cow. Or if trained on data where the bounding boxes haven’t been drawn tightly around an object, models might make geometric errors, such as failing to distinguish the target object from the background or other objects in the frame. A recent study revealed that 10 of the most cited AI datasets have serious labeling errors: the famous ImageNet test set has an estimated label error of 5.8 percent. When you have errors in your labels, your model suffers because it's learning from incorrect information. When it comes to use cases where there’s a high sensitivity to the error with regard to the consequences of a model’s mistake, such as autonomous vehicles and medical diagnosis, the labels must be specific and accurate– there’s no room for these types of labeling errors or poor quality data. In these situations where a model must operate at 99.99 percent accuracy, small margins in its performance really matter. The breakdown in model performance from poor data quality is an insidious problem because machine learning engineers often don’t know whether the problem is in the model or in the data. They can spin their wheels trying to improve a model only to realize that the model will never improve because the problem was in the labels themselves. Taking a data- rather than a model-centric approach to AI can relieve some of the headaches. After all, these sorts of problems are best first addressed by improving the quality of the training data itself before looking to improve the quality of the model. However, data-centric AI can’t reach its potential until we solve the data quality problem. Currently, assuring data quality depends on manually intensive review processes. This approach to quality is problematic and unscalable because the volume of data that needs to be checked is far greater than the number of human reviewers available. And reviewers also make mistakes, so there’s human inconsistency throughout the labeling chain. To correct these errors, a company can have multiple reviewers look at the same data, but now the cost and the workload have doubled so it’s not an efficient or economical solution. Encord’s Fully Automated Data Quality and Label Assessment Tool When we began Encord, we were focused on the quantity problem. We wanted to solve the human bottleneck in data labeling by automating the process. However, we quickly realized after talking to many AI practitioners, and in particular those at more sophisticated companies, they were stuck on the quality problem. From these conversations, we decided to turn our attention to solving the data quality problem, too. We realized that the quantity problem would only truly be solved if we got smarter about ensuring that the amount of data going into the pot was also high-quality data. Encord has created and launched the first fully automated label and data quality assessment tool for machine learning. This tool replaces the manual process that makes AI development expensive, time-consuming, and difficult to scale. A Quick Tour of the Data Quality Assessment Tool Within Encord’s platform, we have developed a quality feature that detects likely errors within a client's project, using a semi-supervised learning algorithm. The client chooses all the labels and objects that they want to inspect from the project, runs the algorithm, and then receives an automated ranking of the labels by the probability of error. Each label receives a score, so rather than having a human review every individual label for quality, they can use the algorithm to curate the data for human review in an intelligent way. The score reflects whether the label is likely to be high or low quality. The client can set a threshold to send everything above a certain score to the model and send anything below a certain score for manual review. The human can then accept or reject the label based on its quality. The humans are still in the loop, but the data quality assessment software tool saves them as much time as possible. The humans are still in the loop, but the data quality assessment tool saves them as much time as possible, using their time efficiently and when it matters the most. In the example below, the client has annotated different objects in the room. The bounding box in the image should be identifying a chair, but it isn’t tight to the chair and misses some of the objects. That’s a label that a reviewer might want to inspect to see if it could be improved. Its score is .873, so if the threshold was set to .90 or above, this label would automatically be sent for review. It would never make it to the model unless a human passed it on. The tool also aggregates statistics on the human rejection rate of different items, so machine learning teams can get a better understanding of how often humans reject certain labels. With this information, they can focus on improving labeling for more difficult objects. In the below example, beds and chairs have the highest rejection rate. The tool currently works with object detection because that is the greatest need among our clients, but we’re currently working on ground-breaking research to make it work for other computer vision tasks like segmentation too. Increased Efficiency: Combining Automated Data Labelling with the Quality Data Assessment Tool Encord’s platform allows you to create labels manually and through automation (e.g. interpolation, tracking, or using our native model-assisted labeling). It also allows you to import model predictions via our API and Python SDK. Labels or imported model predictions are often subjected to manual review to ensure that they are of the highest possible quality or to validate results. Now, however, using our automated quality assessment tool, our clients can perform an automated review of the labels generated by the aforementioned different labeling agents without changing any of their workflows and at scale. The quality feature reassures customers about the quality of machine-generated labels. In fact, our platform aggregates information to show which label-generating agents– from both the human annotator group, the imported labels group, and the automatically produced labels group– are doing the best jobs. In other words, the tool doesn’t distinguish between human- and model-produced labels when ranking the labels within a dataset. As a result, this feature helps build confidence in using several different label-generating methods to produce high-quality training data. With both automated label generation using micro-models and the automated data quality assessment tool, Encord is optimizing the human-in-the-loop’s time as much as possible. In doing so, we can cherish people’s time by using it only for the most necessary and meaningful contributions to machine learning. Ready to automate the assessment of your training data? Sign-up for an Encord Free Trial: The Active Learning Platform for Computer Vision, used by the world’s leading computer vision teams. AI-assisted labeling, model training & diagnostics, find & fix dataset errors and biases, all in one collaborative active learning platform, to get to production AI faster. Try Encord for Free Today. Want to stay updated? Follow us on Twitter and LinkedIn for more content on computer vision, training data, and active learning.
Nov 11 2022
7 M
Software To Help You Turn Your Data Into AI
Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.