Data Curation in Computer Vision

Alexandre Bonnet
August 24, 2023
6 min read
blog image

In 2022, the global data volume was around 97 zettabytes. This figure is projected to nearly double to more than 181 zettabytes by 2025. That’s good news for the fields of artificial intelligence and machine learning as they require large datasets to generate more accurate results. Extracting value from such vast amounts of data, however, is often challenging.

In particular, AI systems that are powered by sophisticated computer vision (CV) algorithms require high data quality to achieve good results. Since CV models typically process unstructured data consisting of thousands of images, effective data management becomes necessary.

One aspect of data management that’s essential is data curation. It helps ensure that computer vision and machine learning models have higher accuracy by checking data for common errors, missing values, inconsistencies, and how it compares to real-world data.

Also, data curation helps you select edge cases more efficiently. An edge case is a situation with a very low probability of occurrence. AI experts may have high-quality data, but that data may only cover specific scenarios and disregard extreme conditions. 

Computer vision models are particularly susceptible to these edge cases as they may have to identify rare instances within a split second to avoid disasters. For instance, a model for self-driving cars trained on images of roads and alleys under normal weather conditions may fail to recognize critical objects under extreme weather. 

As such, edge cases must include images of all types of weather to avoid accidents. The data curation process ensures that all or most edge cases are handled properly.

In this article, you will learn about data curation in detail, its challenges, and how it helps improve CV model performance. We’ll also specifically discuss the role of data annotation in curating computer vision datasets.

Build Better Models, Faster with Encord's Leading Annotation Tool

What is Data Curation?

Data curation is a process of collecting, cleaning, selecting, and organizing data for AI models so data scientists can get complete, accurate, relevant, and unbiased data for model training, validation, and testing. It’s an iterative process that companies must follow even after model deployment to ensure incoming data matches the data in the production environment.

Data curation differs from data management as the latter is a much broader concept involving the development of policies and standards to maintain data integrity throughout the data lifecycle.

Data curation

As illustrated above, a complete data management lifecycle determines how an organization generates, collects, processes, stores, analyzes, visualizes, and interprets data assets. It also involves implementing robust data governance frameworks, which consist of protocols for data sharing across teams within an organization while ensuring data security and compliance with regulations.

Data curation, however, falls under the processing stage and involves the following essential steps to produce quality datasets.

Data Collection

The primary step in data curation is collecting data from disparate sources. These can be different public or proprietary databases, data warehouses, or scraping data from the web.

Data Validation

After collection, data is validated using automated pipelines to check for data accuracy, completeness, relevance, and consistency.

Data Cleaning

Then, data cleaning involves removing corrupted data points, outliers, incorrect formats, duplicates, and other redundancies.

light-callout-cta Want to learn more about data cleaning? Read our comprehensive guide on Mastering Data Cleaning and Data Preprocessing


Next is normalization, which involves re-scaling data values, so they’re within the same range. It usually applies to structured data and benefits machine learning algorithms by preventing skew in learned weights and coefficients.


It is a standard method of removing personally identifiable information from datasets, such as names, social security numbers (SSNs), and contact info.

Data Transformation

Automated pipelines then transform data into meaningful features for better model training. Feature engineering is a crucial element in this process. It allows data science teams to find relevant relationships between different columns and turn them into features that help explain the target variable.

Data Augmentation

Data augmentation introduces slight dataset variations to increase data volume and cover different scenarios. Data engineers use image operations like crop, flip, zoom, rotate, pan, and scale to enhance CV datasets.

Data curation - data augmentation

Data Augmentation Example

Note, augmented data differs from synthetic data. Synthetic data is computer-generated fake data that resembles real-world data. Typically, it is generated using state-of-the-art generative algorithms. On the other hand, augmented data refers to variations in training data regardless of how it is generated.

Data Sampling

Data sampling refers to the process of using a subset of data to train AI models. However, this may introduce bias during model training since we select only a specific part of the dataset. Such issues can be avoided through probabilistic sampling techniques like random, stratified, weighted, and importance sampling.

light-callout-cta You can read our complete guide on How to Mitigate Bias in Machine Learning Models to learn how to reduce bias in models efficiently.

Data Partitioning

The final step in data curation is data partitioning. This involves dividing data into training, validation, and test sets. The model uses the training datasets to learn patterns and compute coefficients or weights. During training, the model’s performance is tested on the validation. If the model performs poorly during validation, it can be adjusted by fine-tuning its hyper-parameters. Once you have satisfactory performance on the validation set, the test set is used to assess critical performance metrics, such as accuracy, precision, F1 score, etc, to see if the model is ready for deployment.

While there’s no one fixed way of splitting data into train, test, and validation sets, you can use the sampling methods described above to ensure that each dataset represents the population in a balanced manner. Doing so ensures your model doesn’t suffer from underfitting or overfitting.

light-callout-cta Get a deeper understanding of the training, validation and test set by reading the article on Training, Validation, Test Split for Machine Learning Datasets.

Data Curation in Computer Vision

While the above data curation steps generally apply to machine learning, the curation process involves more complexity when preparing data for computer vision tasks. 

First, let’s list the common types of computer vision tasks and then discuss annotation - a critical data curation element in computer vision.

Common Types of Computer Vision Tasks

  • Object Detection: Object detection identifies specific objects within a given image. For example, in the image below, the model tries to distinguish between different objects (like humans, vans, bus) on the road.
  • Image Classification: Image classification models predict whether an object exists in a given image based on the patterns they learn from the training data. For instance, an animal classifier would label the below image as “Dog” if the classifier has been trained on a good sample of dog images.
  • Face Recognition: Facial recognition tasks involve complex convolutional neural nets (CNNs) to learn intricate facial patterns and recognize faces in images.
  • Semantic Segmentation: AI practitioners can identify each pixel of a given object within an image through semantic segmentation. For instance, the image below illustrates how semantic segmentation distinguishes between several elements in a given image on a pixel level.
  • Text-to-Image Generative Models: Generating images from text is a new development in the generative AI space that involves writing text-based input prompts to describe the type of image you want. The generative model processes the prompt and produces suitable images that match the textual description. Several proprietary and open-source models, such as Midjourney, Stable Diffusion, Craiyon, DeepFloyd, etc., are recent examples that can create realistic photos and artwork in seconds.

Role of Data Annotation In Curating Computer Vision Data

Computer vision tasks require careful data annotation as part of the data curation process to ensure that models work as expected. 

Data annotation refers to labeling images (typically in the training data) so the model knows the ground truth for accurate predictions.

Listed below are a few annotation techniques.

  • Bounding Box: The technique involves creating a bounding box around the object of interest for image classification and object detection tasks.
  • Landmarking: In landmarking, the objective is to label individual features within an image. It’s suitable for facial recognition tasks. 
  • Tracking: Tracking is useful for labelling moving objects across multiple images.

General Considerations for Annotating Image Data

Data annotation can be time-consuming as it requires considerable effort to label each image or object within an image. It’s advisable to clearly define standard naming conventions for labeling to ensure consistency across all images.

AI practitioners can use labeled data from large datasets, such as ImageNet, which contains over a million training images across 1000 object classes. It is ideal for building a general-purpose image classification model.

Also, AI practitioners must develop a robust review process to identify annotation mistakes before feeding the data to a CV model. In addition, leveraging automation in the annotation workflow can reduce the model development time since the manual process is error-prone and costly.

Moreover, practitioners can employ innovative methods like active learning and image embeddings to improve data annotation accuracy. Let’s look at them briefly below.

Active Learning

Instead of labeling all the images in a dataset, active learning algorithms allow practitioners to annotate only a few highly useful images and use them for training. It uses an informativeness score that helps decide which image will be the most beneficial in improving performance.

For example, in a given dataset containing 1,500 images, the active learning method identifies the most useful data samples (let’s say 100 images) for annotation allowing the practitioners to train the model on a subset of labeled images and validate it on the remaining unlabeled 1400 images.

The model will assign confidence scores to the 1400 images, allowing data annotators to cross-check images with the lowest scores and re-label them manually. As a result, active learning reduces the data annotation time and significantly boosts model performance.

light-callout-cta Interested in learning more about active learning? Read our detailed Practical Guide to Active Learning for Computer Vision

Image Embeddings

Image embeddings are vectorized versions of image data where similar images have similar numerical vector representations.

Data curation - Image embeddings

Vector Embeddings of Images

As the diagram above illustrates, image embedding models transform images into a multi-dimensional vector space, with each number in the vector representing a specific image property or feature. For example, 1.3 in the first vector may represent an image’s color, 2.5 the image’s height, and so on.

Typically, image embeddings are helpful for semantic segmentation tasks as they break down an image into relevant vectors, allowing CV models to classify pixels more accurately.

Also, it helps with facial recognition by representing each facial feature as a number in the vector. The model can better use the vectorized form to distinguish between several facial structures.

Embeddings make it easier for algorithms to numerically compute how similar two or more images are. It helps practitioners annotate images more accurately.

Lastly, image embeddings are the backbone of generative text-to-image models, where practitioners can convert text-image pairs into embeddings. For example, you can have the text “image of a dog”, and an actual dog’s image paired together and converted into an embedding. Such embeddings can be given as input to a generative model, so it learns to create a dog’s image when it identifies the word “Dog” in a textual prompt.

Challenges in Data Curation

Data provides the foundation for building high-quality ML models. However, collecting relevant data comes with several challenges.

  • Data is Ever-evolving: With the rapid rise of big data, maintaining consistent and accurate data across time and platforms is challenging. Data distributions can change quickly as more data comes in, making data curation more difficult.
  • Data Security Issues: Edge computing is giving rise to security issues as organizations must ensure data collection from several sources is secure. It calls for robust encryption and de-identification strategies to protect private information and maintain data integrity throughout curation.
  • Data Infrastructure & Scale: It’s difficult for organizations to develop infrastructure for handling the ever-increasing scale of data and  ML applications. The exponential rise in data volume is causing experts to shift from code-based strategies to data-centric AI, primarily focusing on building models that help with data exploration and analysis.
  • Data Scarcity: Mission-critical domains like healthcare often lack high-quality data sources, making it difficult for practitioners to curate data and build accurate models. Models built using low-quality data can more likely give false positives, which is why expert human supervision is required to monitor the outcomes of such models.

Collaborative DICOM annotation platform for medical imaging
CT, X-ray, mammography, MRI, PET scans, ultrasound
medical banner

Using Encord Active for Data Curation

Encord Active provides a robust solution for efficient data curation throughout all stages of your computer vision project. By harnessing the active learning toolkit, you can enhance both data quality and model performance by identifying and addressing failure modes within models.

A notable feature of Encord Active is its data visualization capabilities. Its quality metrics empower users to gain deeper insights into the data and its label quality, facilitating the identification of errors and outliers.


Encord Active's model evaluation functionalities play a pivotal role in identifying model failure modes. This allows users to pinpoint precisely where their models face challenges, enabling them to direct their curation efforts with greater precision. By doing so, they can identify high-value data instances that require relabeling, allowing for a targeted approach to enhancing model accuracy.


Active learning within Encord Active enables users to intelligently select informative and valuable data samples for annotation and model improvement. This approach contrasts with randomly labeling extensive datasets, as the tool actively identifies instances where the current model struggles, thereby maximizing the impact of human annotation efforts. This targeted approach ensures that labeled data aligns with specific areas where model refinement is needed, optimizing the curation process.

Encord Active streamlines data curation workflows by using a systematic process to identify and address model failure modes. The tool helps to identify labeling errors and data inconsistencies in training data. Users can visualize areas where the model encounters difficulties, allowing them to focus on the most promising areas for improvement.

Incorporating Encord Active's active learning and workflows enhances overall dataset quality and subsequently elevates model performance. Whether during labeling processes or in production, Encord Active is indispensable for effective data quality assessment. Its comprehensive feature set empowers users to refine data, labels, and models, contributing to a robust and sophisticated computer vision pipeline.

Data Curation: Key Takeaways

As companies gravitate more toward AI to solve business problems using complex data, the importance of data curation will increase significantly. Here are some key things organizations must consider for building a successful data curation workflow.

  • Data curation is a part of data management. As such, data curation in isolation may only solve a part of the data problems. Companies must have a holistic management policy to ensure curation yields value.
  • The curation workflow must suit specific requirements. A workflow that works for a particular task may fail to produce results for another. For instance, object detection annotation techniques differ from facial recognition methods.
  • A successful data curation process requires specialized skills. For example, data curators must ensure that data scientists use the right annotation, normalization, sampling, and de-identification techniques for superior results. Curators must recognize the deficiencies of one method over another and give valuable suggestions.
  • Investing in purpose-built tools for managing data curation is beneficial as it can help automate the curation process for faster results and increased ROI.

Data curation is an ongoing process. The organization must commit to robust data curation practices throughout the model building, deployment, and monitoring stages and keep improving the curation workflow as data evolves.

The Complete Data Engine for AI Model Development

cta bannercta banner

Software To Help You Turn Your Data Into AI

Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.