Back to Blogs

Contents

The growth of self-supervision models
Comparing DINOv2 to DINO
DinoV2 Dataset
DINOv2’s Network Architecture and Design: How it works
DINOv2 in Action
Conclusion
DINOv2 Frequently Asked Questions (FAQs)
Resources

Encord Blog

DINOv2: Self-supervised Learning Model Explained

April 21, 2023

5 mins

Back to Blogs

Power your AI models with the right data

Automate your data curation, annotation and label validation workflows.

Get started

Contents

The growth of self-supervision models
Comparing DINOv2 to DINO
DinoV2 Dataset
DINOv2’s Network Architecture and Design: How it works
DINOv2 in Action
Conclusion
DINOv2 Frequently Asked Questions (FAQs)
Resources

Written by

Stephen Oladele

View more posts

LLaMA ((Large Language Model Meta AI), SAM (Segment Anything Model), and now DINOv2! MetaAI has been at the forefront of breakthrough in NLP (natural language processing) and computer vision research over the past few months.

This week they released DINOv2, an advanced self-supervised learning technique to train models, enhancing computer vision by accurately identifying individual objects within images and video frames.

The DINOv2 family of models boasts a wide range of applications, including image classification, object detection, and video understanding, among others. Unlike other models like CLIP and OpenCLIP, DINOv2 does not require fine-tuning for specific tasks. It is pretrained to handle many tasks out-of-the-box, simplifying the implementation process.

In this article, you will take an in-depth look at:

DINOv2, the underlying techniques and datasets.
How DINOv2 compares to other foundation models.
How DINOv2 works, including the network design and architecture.
Potential applications and the frequently asked questions (FAQs) by adopters.

Use embeddings to get insight from your dataset even before the training

The growth of self-supervision models

In the past decade or two, the most dominant technique for developing models has been supervised learning, which is very data-intensive and requires careful labeling. Acquiring labels from human annotators is very difficult and expensive to achieve at scale, and this slows down the progress of ML in lots of areas, more specifically computer vision.

Over the past few years, many researchers and institutions have focused their efforts on self-supervision models, obtaining the labels through a “semi-automatic” technique that involves observing a labeled dataset and predicting part of the data from that batch based on the features. It leverages both labeled and unlabeled datasets to build the training data or help with other downstream tasks.

One of those “self-supervision” techniques is self-supervised learning (SSL). SSL has gained traction in computer vision for training deep learning models without extensive labeled data. It involves a two-step process of pretraining and fine-tuning, where models learn representations from unlabeled data through auxiliary tasks and adapt to specific tasks using smaller amounts of labeled data.

DINOv2 self-supervised learning

Source

Self-supervised models have, over the past few years, shown promise in applications such as image classification, object detection, and semantic segmentation, often achieving competitive or state-of-the-art performance. The advantages include a reduced reliance on labeled data, scalability to large datasets, and the potential for transfer learning.

The challenges with SSL remain in designing effective tasks, handling domain shifts, and understanding model interpretability and robustness. Some self-supervised learning (SSL) systems overcome these challenges by using techniques such as self-DIstillation with NO labels (DINO) which uses SSL and knowledge (or model) distillation methods.

Understanding knowledge (model) distillation

Knowledge distillation is the process of training a smaller model to mimic the larger model. In this case, you transfer the knowledge from the larger model (often called the “teacher”) to the smaller model (often called the “student”).

The first step involves training the teacher model with labeled data; it produces an output, so you map the input and output from the teacher model and use the smaller model to copy the output, while being more efficient in terms of model size and computational requirements.

The second step requires you to use a large dataset of unlabeled data to train the student models to perform as well as or better than the teacher models. The idea here is to train the large models with your techniques and distill a set of smaller models. This technique is very good for saving computing costs, and DINOv2 is built with it.

Comparing DINOv2 to DINO

Understanding the first generation DINO

DINOv1 (self-DIstillation with NO labels) is Meta AI’s version of a system for unsupervised pre-training of visual transformers. The idea is that self-supervised learning is well-aided by visual transformers for object detection because the attention maps contain explicit information about the semantic segmentation of an image.

DINO can visualize attention maps without supervision for random, unlabeled images and videos. This can be useful in cases like:

Image retrieval, where similar images are clustered together.
Image segmentation and proto-object detection.
Zero-shot classification using a k-nearest neighbor (kNN) classifier in the feature space (the features are extracted from the Vision Transformers trained by DINO).

What does DINOv2 do better than DINO?

DINOv2 self-supervised learning: Comparing DINO with DINOv2

Source

After testing both DINO and DINOv2 out, we have found the latter version to provide more accurate attention maps for objects in ambiguous and unambiguous scenes. According to Meta AI in the blog post explainer for DINOv2:

“DINOv2 is able to take a video and generate a higher-quality segmentation than the original DINO method. DINOv2 allows remarkable properties to emerge, such as a robust understanding of object parts, and robust semantic and low-level understanding of images.”

DINOv2 works better than DINO because of the following reasons:

A larger curated training dataset.
Improvements on the training algorithm and implementation.
A functional distillation pipeline.

Larger curated training dataset

Training more complex architectures for DINOv2 necessitates more data for optimal results. Since accessing more data is not always possible, the team used a publicly available repository of web data and built a pipeline to select useful data.

That entailed removing unnecessary images and balancing the dataset across concepts. Because manual curation was not feasible, they developed a method for curating a set of seed images from multiple public datasets (e.g., imagenet) and expanding it by retrieving similar images from crawled web data. This resulted in a pretraining dataset of 142 million images from a source of 1.2 billion images.

Improvements on the training algorithm and implementation

DINOv2 tackles the challenges of training larger models with more data by improving stability through regularization methods inspired by the similarity search and classification literature and implementing efficient techniques from PyTorch 2 and xFormers. This results in faster, more memory-efficient training with potential for scalability in data, model size, and hardware.

Functional distillation pipeline

In an earlier section, you learned that the process of knowledge distillation involves training the student model using both the original labelled data and the teacher model's output probabilities as soft targets. By leveraging the knowledge learned by the teacher model, such as class relationships, decision boundaries, and model generalization, the student model can achieve similar performance to the teacher model with a smaller footprint.

This makes knowledge distillation particularly useful with DINOV2. The training algorithm for DINOv2 uses self-distillation to compress large models into smaller ones, enabling efficient inference with minimal loss in accuracy for ViT-Small, ViT-Base, and ViT-Large models.

Compared to the state-of-the-art in areas like semi-supervised learning (SSL) and weakly-supervised learning (WSL), DINOv2 models perform very well across tasks such as segmentation, video understanding, fine-grained classification, and so on:

DINOv2 self-supervised learning

Source

The DINOv2 models seem to be able to achieve general, multipurpose backbones for many types of computer vision tasks and applications. The models generalize well across domains without fine-tuning, unlike other models like CLIP and OpenCLIP.

DinoV2 Dataset

Meta AI researchers curated a large dataset to train DINOv2; they call it LVD-142M which includes 142 million images, largely due to a self-supervised image retrieval pipeline. The model extracts essential features directly from images, rather than relying on text descriptions.

The main components of the data pipeline are:

The data sources.
Processing technique.
Self-supervised image retrieval.

DINOv2 self-supervised learning

Source

Data sources

The database consists of curated and uncurated datasets of 1.2 billion unique images. According to the researchers, the curated datasets contain ImageNet-22k, the train split of ImageNet-1k, Google Landmarks, and several fine-grained datasets.

Additionally, they sourced the uncurated dataset from a publicly available repository of crawled web data. They filtered the images to remove unsafe or restricted URLs and used post-processing techniques such as PCA hash deduplication, NSFW filtering, and blurring identifiable faces.

Image processing technique (de-duplication)

The researchers took the curated and uncurated datasets and fed them into a feature encoder to produce embeddings. They used the similarities (implemented with Faiss) between the embeddings to find and compare different images, which is helpful for de-duplication and retrieval.

De-duplication is the process of removing images with identical embeddings to reduce redundancy, while image retrieval is the process of retrieving images with identical embeddings.

Self-supervised image retrieval

In this case, a self-supervised image retrieval system uses a ViT-H/16 network pre-trained on ImageNet-22k to compute the image embeddings. They used a distributed compute cluster of nodes with 8 V100-32GB GPUs to compute the embeddings.

Using the embeddings, it finds similar images from the uncurated dataset to the curated datasets using cosine-similarity as a distance measure between the embeddings and k-means clustering of the uncurated data.

It then stores those similar embeddings in a database after processing them. This way, the system trains on batches of data that it curates itself by learning the similarities between curated and uncurated data batches.

DINOv2’s Network Architecture and Design: How it works

DINOv2 builds upon DINO’s network architecture and design and iBOT with several adjustments to “improve both the quality of the features as well as the efficiency of pretraining.” It uses self-supervised learning to extract essential features directly from images, rather than relying on text descriptions. This enables DINOv2 to better understand local information (in the image patches) and provide a multipurpose backbone for various computer vision tasks.

The researchers used the following approaches for DINOv2’s network architecture and design:

Image-level objective

In the initial DINO architecture, there’s a student-teacher network that involves a large network (the teacher) and a smaller network (the student). They crop different parts of the image and feed that to the feature encoder (a ViT) to produce embeddings. They build the student network by training on the embeddings, and they build the teacher network by taking the average of the weights of different student models.

Patch-level objective

The ViT divides the images into 4x4 patches, unrolls the array, and randomly masks some of the input patches the student network trains on. Principal component analysis (PCAs) are computed for patch features extracted by the network.

DINOv2 self-supervised learning

Source

The iBOT Masked Image Modeling (MIM) loss term DINOv2 adopted improves patch-level tasks (e.g., segmentation).

Untying head weights between both objectives

Based on their learnings from DINO, there’s every likelihood that the model head can overfit the data due to its small size. The models can also underfit at the patch-level with a large ViT network.

Adapating the resolution of the images

According to the researchers, increasing image resolution is important for pixel-level downstream tasks like segmentation or detection, but training at high resolution can be time and memory intensive. So they proposed a “curriculum learning” strategy to train the models in a meaningful order from low to high resolution images. In this case, they increased the resolution of images to 518 x 518 during a brief period at the end of pretraining.

In addition to the approaches we explained above, the researchers also applied parameters such as softmax normalization, KoLeo regularizers (which improve the nearest-neighbor search task), and the L2-norm for normalizing the embeddings.

DINOv2 in Action

Along with segmentation, DINOv2 allows the semantic understanding of object parts in ambiguous and unambiguous images. The model can generalize across domains, including:

Depth estimation.
Semantic segmentation.
Instance retrieval.

Depth estimation

DINOv2 can estimate the depth or distance information of objects in a scene from ambiguous two-dimensional images. At the high level, DINOv2 tries to determine the relative distances of different points or regions in the image, such as the distance from the camera to the objects or the distance between objects in the scene.

This makes the DINOv2 family of models useful for applications in various areas, such as:

Augmented reality,
Robotics,
Autonomous vehicles,
Medical imaging,
Human-computer interaction,
Gaming and entertainment,
and virtual reality.

Semantic segmentation

DINOv2 can accurately classify and segment different objects or regions within an image, such as buildings, cars, pedestrians, trees, and roads. This fine-grained level of object-level understanding provides a detailed understanding of the visual content in an image and is useful in a wide range of applications, including agriculture, microbiology, video surveillance, and environmental monitoring.

Instance retrieval

DINOv2 also works well in scenarios where fine-grained object or scene understanding is required and where it is necessary to identify and locate specific instances of objects or scenes with high precision. This is the perfect scenario for instance retrieval. The model directly uses frozen features from the SSL technique to find images similar to a given image from a large image database.

It compares the visual features of the query instance with the features of instances in the image database and finds the closest matches based on visual similarity. This has wide applications in areas like image-based search, visual recommendation systems, image retrieval, and video analytics.

Conclusion

DINOv2 from Meta AI is a game changer in computer vision. Its self-supervised learning approach and multipurpose backbone make it a versatile tool that can be easily adopted across various industries. Without fine-tuning and minimal labeled data requirements, DINOv2 paves the way for a more accessible and efficient future in computer vision applications.

In the future, the team aims to incorporate this versatile model as a foundational component into a larger, more intricate AI system capable of interacting with expansive language models.

By leveraging a robust visual backbone that provides detailed image information, these advanced AI systems can analyze and interpret images with greater depth, surpassing the limitations of single-text sentence descriptions commonly found in text-supervised models. DINOv2 eliminates this inherent constraint, opening up new possibilities for enhanced image understanding and reasoning.

From scaling to enhancing your model development with data-driven insights

DINOv2 Frequently Asked Questions (FAQs)

Is DINOv2 Open-Source? Can I use it for commercial use-cases?

MetaAI released the pre-trained model and project assets on GitHub, but they are not exactly open source. They are rather source-available because they are under the Creative Commons Attribution-NonCommercial 4.0 International Public License. Essentially, that means they are only available for noncommercial use. Find the license file in the repository.

What tasks can DINOv2 be used for?

DINOv2 generalizes to a lot of tasks, including semantic segmentation, depth estimation, instance retrieval, video understanding, and fine-grained classification.

Do I need to fine-tune DINOv2?

Fine-tuning is optional. You can fine-tune the encoder to squeeze out more performance on your task, but the DINOv2 family of models were trained to work out-of-the-box for most tasks, so unless a 2-5% performance improvement is significant for your tasks, then it’s likely not worth the effort.

Does it perform better than other methods?

DINOv2 has a competitive performance on classification tasks with other weakly supervised learning methods like CLIP, OpenCLIP, SWAG, and EVA-CLIP on various pre-trained visual transformer (ViT) architectures. It also compares well against other self-supervised learning methods like DINO, EsViT, iBOT, and Mugs. See the table below for the full comparison with kNN and linear classifiers:

DINOv2 self-supervised learning

Source

DINOv2's ability to train a large ViT model and distill it into smaller models that outperform OpenCLIP is an impressive feat in the unsupervised learning of visual features. The results speak for themselves, with superior performance on image and pixel benchmarks.

How can DINOv2 and SAM be used together?

SAM is the state-of-the-art for object detection and cannot be directly used for classification. Since it’s great at providing segmentation masks, you can use DINOv2’s ability for fine-grained classification on top of SAM for end-to-end applications.

How can I start using DINOv2?

There’s a demo lab for you to get started using the model on this page. At Encord, we are also working on a tutorial walkthrough with a step-by-step guide so you can implement the DINOv2 family of models on your dataset.

Can I incorporate prompting with the DINOv2 family of models?

For now, no. The DINOv2 are pure image foundation models and were not trained with texts. This is, of course, in contrast to text-image foundation models like CLIP. The authors of the paper argue that using images alone helped DINOv2 models outperform CLIP and other text-image foundation models.

What you might see in the next few months (I dare to say weeks, with the speed of innovation) is that there’d likely be a self-supervised pipeline that captions model outputs, uses them to train and caption more images, and produces a flywheel that iteratively labels and captions images, improving over time.

Will DINOv2 be great for labeling?

Yes, but compared to other solutions, the improvement might be minimal. Looking at the results from the evaluation table, comparisons with other methods like CLIP and OpenCLIP do not show significantly improved performance.

Resources

Meta AI’s announcement blog post.
GitHub repository for pre-trained models and other project assets.
DINOV2 website.
DINOv2 youtube video.

Power your AI models with the right data

Automate your data curation, annotation and label validation workflows.

Get started

Written by

Stephen Oladele

View more posts

Previous blog

Grounding-DINO + Segment Anything Model (SAM) vs Mask-RCNN: A comparison

Next blog

The Full Guide to Foundation Models

May 25 2023

5 M

sampleImage_data-refinement-guide-computer-vision

machine learning

Data Refinement Strategies for Computer Vision

Data refinement strategies for computer vision are integral for improving the data quality used to train machine learning-based models. Computer vision is becoming increasingly mission-critical across dozens of sectors, from facial recognition on social media platforms to self-driving cars. However, developing effective computer vision models is a challenging task. One of the key challenges in computer vision is dealing with large amounts of data and ensuring that the data is of high quality. This is where data refinement strategies come in. In this blog post, we will explore the different data refinement strategies in computer vision and how they can be used to improve the performance of machine learning models. We will also discuss the tools and techniques for creating effective data refinement strategies. Model-centric vs. Data-centric Computer Vision In computer vision, there are two paradigms: model-centric and data-centric. Both of these paradigms share a common goal of improving the performance of machine learning models, but they differ in their approach to achieving this objective. Model-centric computer vision relies on developing, experimenting, and improving complex machine learning (ML) models to accomplish tasks like object detection, image recognition, and semantic segmentation. Here the datasets are considered static, and changes are added to the model (architecture, hyperparameters, loss functions, etc.) to improve performance. Data-centric computer vision: In recent years, there has been a growing focus on data-centric computer vision as researchers and engineers recognize the significance of high-quality data in building effective ML, AI, and CV models. The models, their hyperparameters, etc., play only a minor role in the machine learning pipeline. On the other hand, data-centric computer vision prioritizes the quality and quantity of data used to train these models. Why Do We Need Data Refinement Strategies? Data refinement strategies are crucial in improving the quality of data and labels used to train machine learning models, as the quality of data and labels directly impacts the model's performance. Here are some ways data refinement strategies can help: Identifying Outliers Outliers are data points that do not follow the typical distribution of the dataset. Outliers can cause the model to learn incorrect patterns, leading to poor performance. By removing outliers, the model can focus on learning the correct patterns, leading to better performance. Identifying and Removing Noisy Data Noisy data refers to data that contains irrelevant or misleading information, such as duplicates or low-quality images. These data points can cause models to learn incorrect patterns, leading to inaccurate predictions. Identifying and Correcting Label Errors Label errors occur when data points are incorrectly labeled or labeled inconsistently, leading to misclassifying objects in images or videos. Correcting label errors ensures that the model receives accurate information during training, improving its ability to predict and classify objects accurately. 💡Read how to find and fix label errors Assisting in Model Performance Optimization and Debugging Data refinement strategies help preserve and debug the best-performing model by correcting incorrect labels that could affect the model’s performance evaluation metrics. You can get a more accurate and effective model by improving the data quality used to train the model. 💡You can try the different refinement strategies with Encord Active on GitHub today Common Data Refinement Strategies in Computer Vision Computer vision has made great strides in recent years, with applications across industries from healthcare to autonomous vehicles. However, the data quality used to train machine learning models is critical to their success. There are several common data refinement strategies used in computer vision. These strategies are designed to improve the data quality used to train machine learning models. The once we will cover today are: Smart data sampling Improving data quality Improving label quality Finding model failure modes Active learning Semi-supervised learning (SSL) 💡With Encord Active, you can visualize image embeddings, show images from a particular cluster, and export them for relabeling. Smart Data Sampling It involves identifying relevant data and removing irrelevant data. Rather than selecting data randomly or without regard to specific characteristics, smart data sampling involves using a systematic approach to choose data points most representative of the entire dataset. For example, if you train a model to recognize cars, we would want to select the cars in the street-view data. The goal of smart data sampling is to reduce the amount of data needed for training without sacrificing model accuracy. This technique can be advantageous when dealing with large datasets requiring significant computational resources and processing time. For example, image embeddings can be used for smart data sampling by clustering similar images based on their embeddings. These clusters can be used to filter out the images which have duplicates in the dataset and eliminate them. This reduces the amount of data needed for training while ensuring that the dataset is representative of the overall dataset. K-means and hierarchical clustering are two approaches to using image embeddings for smart data sampling. Improving Data Quality Improving the quality of data in the data refinement stage of machine learning, various techniques can be used, including data cleaning, data augmentation, balancing the dataset, and data normalization. These techniques help to ensure that the model is accurate, generalizes well on unseen data, and is not biased towards a particular class or category. 💡Read this post next if you want to find out how to improve the quality of labeled data Improving Label Quality Label errors can occur when the data is mislabeled or when the labels are inconsistent. You also need to ensure that all classes in the dataset are adequately represented to avoid biases and improve the model's performance in classifying minority classes. Improving label quality ensures that computer vision algorithms accurately identify and classify objects in images and videos. To improve label quality, data annotation teams can use complex ontological structures that clearly define objects within images and videos. You can also use AI-assisted labeling tools to increase efficiency and reduce errors, identify and correct poorly labeled data through expert review workflows and quality assurance systems, and improve annotator management to ensure consistent and high-quality work. Organizations can achieve higher accuracy scores and produce more reliable outcomes for their computer vision projects by continually assessing and improving label quality. Finding Model Failure Modes Machine learning models can fail in different ways. For example, the model may struggle to recognize certain types of objects or may have difficulty with images taken from certain angles. Finding model failure modes is a critical first step in the testing process of any machine learning model. Thoroughly testing a model requires considering potential failure modes, such as edge cases and outliers, that may impact its performance in real-world scenarios. These scenarios may include factors that could impact the model's performance, such as changing lighting conditions, unique perspectives, or environmental variations. By identifying scenarios where a model might fail, one can develop test cases that evaluate the model's ability to handle these scenarios effectively. It's important to note that identifying model failure modes is not a one-time process and should be revisited throughout the development and deployment of a model. As new scenarios arise, it may be necessary to add new test cases to ensure that a model continues to perform effectively in all possible scenarios. 💡Read more to find out how to evaluate ML models using model test cases. Active Learning Active learning is another strategy to improve your data and your model performance. Active learning involves iteratively selecting the most informative data samples for annotation by human annotators, thereby reducing the annotation effort and cost. This strategy is advantageous when large datasets need to be annotated, as it allows for more efficient use of resources. By selecting the most valuable samples for annotation, active learning can help improve the quality of the dataset and the accuracy of the resulting machine learning models. To implement active learning in computer vision, you first train a model on a small subset of the available data. The model then selects the most informative data points for annotation by a human annotator, who labels the data and adds it to the training set. This process continues iteratively, with the model becoming more accurate as it learns from the newly annotated data. There are several benefits to using active learning in computer vision: It reduces the amount of data that needs to be annotated, saving time and reducing costs. It improves the accuracy of the machine learning model by focusing on the most informative data points. Active learning enables the model to adapt to changes in the data distribution over time, ensuring that it remains accurate and up-to-date. 💡Read this post to learn about the role of active learning in computer vision Semi-Supervised Learning In semi-supervised learning (SSL), a combination of labeled and unlabeled data is used to train the model. The model leverages a large amount of unlabeled data to learn the underlying distribution and subsequently utilizes the labeled data to refine its hyperparameters and enhance the model’s overall performance. SSL can be particularly useful when obtaining labeled data is expensive or time-consuming (see the figure below). If you want to learn more about the current state of semi-supervised learning, you can read A Cookbook for Self-supervised Learning co-authored by Yann LeCun. 💡Note: The data refinement strategies are inclusive of one another. They can be used in combination. For example, active learning and semi-supervised learning can be used together. What Do You Need to Create Data Refinement Strategies? To create effective data refinement strategies, you will need: data, labels, model predictions, intuition, and the right tooling: Data and Labels A large amount of high-quality data is required to train the machine learning model accurately. The data must be clean, relevant, and representative of the target population. Labels are used to identify the objects in the images, and these labels must be accurate and consistent. It is critical to develop a clear labeling schema that is comprehensive and allows for the identification of all relevant features. Model Predictions Evaluating the performance of a machine learning model is necessary to identify areas that require improvement. Model predictions provide valuable insights into the accuracy and robustness of the model. Moreover, your model predictions combined with your model embedding are very useful when detecting detect data and labeling outliers and errors. Intuition Developing effective data refinement strategies requires a deep understanding of the data and the machine learning model. This understanding comes from experience and familiarity with the data and the technology. Expertise in your problem is critical for identifying relevant features and ensuring that the model effectively solves the problem. Tools A range of tools can be used to create effective data refinement strategies. For example, the Encord Active platform provides a range of metrics-based data and label quality improvement tools. These include labeling, evaluation, active learning, and experiment-tracking tools. Primary Methods for Data Refinement There are three primary methodologies for data refinement in computer vision. Refinement by Image This approach involves a meticulous manual review and selection of individual images to be included in the training dataset for the machine learning model. Each image is carefully analyzed for its suitability and relevance before being incorporated into the dataset. Although this method can yield highly accurate and well-curated data, it is often labor-intensive and costly, making it less feasible for large-scale projects or when resources are limited. Refinement by Class In this method, data refinement is based on the class or category of objects present in the images. The process involves selecting and refining data labels associated with specific classes or categories, ensuring that the machine learning model is trained with accurate and relevant information. This approach allows for a more targeted refinement process, focusing on the specific object classes that are of interest in the computer vision task. This method can be more efficient than the image-by-image approach, as it narrows the refinement process to relevant object classes. Refinement by Quality Metrics This methodology focuses on selecting and enhancing data according to predefined quality metrics. These metrics may include factors such as image resolution, clarity of labels, or the perspective from which the images are taken. By establishing and adhering to specific quality criteria, this approach ensures that only high-quality images are included in the training dataset, thus reducing the influence of low-quality images on the model's performance. This method can help streamline the refinement process and improve the overall effectiveness of the machine learning model. Alternatively, this process can be automated with active learning tools and pre-trained models. We will cover this in the next section. Practical Example of Data Refinement Encord Active is an open-source active learning toolkit designed to facilitate identifying and correcting label errors in computer vision datasets. With its user-friendly interface and a variety of visualization options, Encord Active streamlines the process of investigating and understanding failure modes in computer vision models, allowing users to optimize their datasets efficiently. To install Encord Active using pip, we need to run a command: pip install encord-active For more installation information, please read the documentation. You can import a COCO project to Encord Active with a single-line command: encord-active import project --coco -i ./images_folder -a ./annotations.json Executing this command will create a local Encord Active project and pre-calculate all associated quality metrics for your data. Quality metrics are supplementary parameters for your data, labels, and models, providing semantically meaningful and relevant indices for these elements. Encord Active offers a variety of pre-computed metrics that can be incorporated into your projects. Additionally, the platform allows you to create custom metrics tailored to your specific needs. A local Encord Active instance will be available upon successfully importing the project, enabling you to examine and analyze your dataset thoroughly. To open the imported project, just run the command: encord-active visualize Filtering Data and Label Outliers For data analysis, begin by navigating to the Data Quality → Summary tab. Here, you can examine the distribution of samples about various image-level metrics, providing valuable insights into the dataset's characteristics. Data Quality Summary page of Encord Active Using the summary provided by Encord Active, you can identify properties exhibiting significant outliers, such as extreme green values, which can be crucial for assessing the dataset's quality. The platform offers features that enable you to either distribute the outliers evenly when partitioning the data into training, validation, and test datasets or to remove the outliers entirely if desired. For label analysis, navigate to the Label Quality → Summary tab. Here, you can examine your dataset's quality of label-level metrics. Below the label quality summary, you can find the label distribution of your dataset On both Summary tabs, you can scroll down to get detailed information and visualization of the detected outliers and which metric to focus on. Image outliers detected based on brightness quality metric Once the outlier type has been identified, you would want to fix it. You can go to Data Quality → Explorer and filter images by the chosen metric value (brightness). Next, you can tag these images with the tag "high brightness" and download the data labels with added tags or directly send data for relabeling in Encord Annotate. Images filtered by brightness values and added the tag "high brightness" Finding Label Errors with a Pre-Trained Model and Encord Active As your computer vision projects advance, you can utilize a trained model to detect label errors in your data annotation pipeline. To achieve this, follow a straightforward process: Use a pre-trained model on newly annotated samples to generate model predictions. Import the model predictions into Encord Active. Click here to find further instructions. encord-active import predictions --coco results.json Overlay model predictions and ground truth labels for visualization within the Encord Active platform. Sort by high-confidence false-positive predictions and compare them against the ground truth labels. Once discrepancies are identified, flag the incorrect or missing labels and forward them for re-labeling using Encord Annotate. Model Quality page of Encord Active To ensure the integrity of the process, it is crucial that the computer vision model employed to generate predictions has not been trained on the newly annotated samples under investigation. Conclusion In summary, the success of machine learning models in computer vision heavily relies on the quality of data used to train them. Data refinement strategies, such as active learning, smart data sampling, improving data and label quality, and finding model failure modes, are crucial in ensuring that the models produce reliable and accurate results. These strategies require high-quality data, accurate and consistent labels, and a deep understanding of the data and the technology. Using effective data refinement strategies, you can achieve higher model accuracy and produce more reliable outcomes for your computer vision model. It is essential to continually assess and refine data quality throughout the development and deployment of machine learning models to ensure that they remain accurate and up-to-date in real-world scenarios. Ready to improve the data refinement of your CV models? Sign-up for an Encord Free Trial: The Active Learning Platform for Computer Vision, used by the world’s leading computer vision teams. AI-assisted labeling, model training & diagnostics, find & fix dataset errors and biases, all in one collaborative active learning platform, to get to production AI faster. Try Encord for Free Today. Want to stay updated? Follow us on Twitter and LinkedIn for more content on computer vision, training data, and active learning.

May 11 2023

5 M

machine learning

Software To Help You Turn Your Data Into AI

Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.