Software To Help You Turn Your Data Into AI
Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.
In 2022, the global data volume was around 97 zettabytes. This figure is projected to nearly double to more than 181 zettabytes by 2025. That’s good news for the fields of artificial intelligence and machine learning as they require large datasets to generate more accurate results. Extracting value from such vast amounts of data, however, is often challenging.
In particular, AI systems that are powered by sophisticated computer vision (CV) algorithms require high data quality to achieve good results. Since CV models typically process unstructured data consisting of thousands of images, effective data management becomes necessary.
One aspect of data management that’s essential is data curation. It helps ensure that computer vision and machine learning models have higher accuracy by checking data for common errors, missing values, inconsistencies, and how it compares to real-world data.
Also, data curation helps you select edge cases more efficiently. An edge case is a situation with a very low probability of occurrence. AI experts may have high-quality data, but that data may only cover specific scenarios and disregard extreme conditions.
Computer vision models are particularly susceptible to these edge cases as they may have to identify rare instances within a split second to avoid disasters. For instance, a model for self-driving cars trained on images of roads and alleys under normal weather conditions may fail to recognize critical objects under extreme weather.
As such, edge cases must include images of all types of weather to avoid accidents. The data curation process ensures that all or most edge cases are handled properly.
In this article, you will learn about data curation in detail, its challenges, and how it helps improve CV model performance. We’ll also specifically discuss the role of data annotation in curating computer vision datasets.
Data curation is a process of collecting, cleaning, selecting, and organizing data for AI models so data scientists can get complete, accurate, relevant, and unbiased data for model training, validation, and testing. It’s an iterative process that companies must follow even after model deployment to ensure incoming data matches the data in the production environment.
Data curation differs from data management as the latter is a much broader concept involving the development of policies and standards to maintain data integrity throughout the data lifecycle.
As illustrated above, a complete data management lifecycle determines how an organization generates, collects, processes, stores, analyzes, visualizes, and interprets data assets. It also involves implementing robust data governance frameworks, which consist of protocols for data sharing across teams within an organization while ensuring data security and compliance with regulations.
Data curation, however, falls under the processing stage and involves the following essential steps to produce quality datasets.
The primary step in data curation is collecting data from disparate sources. These can be different public or proprietary databases, data warehouses, or scraping data from the web.
After collection, data is validated using automated pipelines to check for data accuracy, completeness, relevance, and consistency.
Then, data cleaning involves removing corrupted data points, outliers, incorrect formats, duplicates, and other redundancies.
Next is normalization, which involves re-scaling data values, so they’re within the same range. It usually applies to structured data and benefits machine learning algorithms by preventing skew in learned weights and coefficients.
It is a standard method of removing personally identifiable information from datasets, such as names, social security numbers (SSNs), and contact info.
Automated pipelines then transform data into meaningful features for better model training. Feature engineering is a crucial element in this process. It allows data science teams to find relevant relationships between different columns and turn them into features that help explain the target variable.
Data augmentation introduces slight dataset variations to increase data volume and cover different scenarios. Data engineers use image operations like crop, flip, zoom, rotate, pan, and scale to enhance CV datasets.
Note, augmented data differs from synthetic data. Synthetic data is computer-generated fake data that resembles real-world data. Typically, it is generated using state-of-the-art generative algorithms. On the other hand, augmented data refers to variations in training data regardless of how it is generated.
Data sampling refers to the process of using a subset of data to train AI models. However, this may introduce bias during model training since we select only a specific part of the dataset. Such issues can be avoided through probabilistic sampling techniques like random, stratified, weighted, and importance sampling.
The final step in data curation is data partitioning. This involves dividing data into training, validation, and test sets. The model uses the training datasets to learn patterns and compute coefficients or weights. During training, the model’s performance is tested on the validation. If the model performs poorly during validation, it can be adjusted by fine-tuning its hyper-parameters. Once you have satisfactory performance on the validation set, the test set is used to assess critical performance metrics, such as accuracy, precision, F1 score, etc, to see if the model is ready for deployment.
While there’s no one fixed way of splitting data into train, test, and validation sets, you can use the sampling methods described above to ensure that each dataset represents the population in a balanced manner. Doing so ensures your model doesn’t suffer from underfitting or overfitting.
While the above data curation steps generally apply to machine learning, the curation process involves more complexity when preparing data for computer vision tasks.
First, let’s list the common types of computer vision tasks and then discuss annotation - a critical data curation element in computer vision.
Computer vision tasks require careful data annotation as part of the data curation process to ensure that models work as expected.
Data annotation refers to labeling images (typically in the training data) so the model knows the ground truth for accurate predictions.
Listed below are a few annotation techniques.
Data annotation can be time-consuming as it requires considerable effort to label each image or object within an image. It’s advisable to clearly define standard naming conventions for labeling to ensure consistency across all images.
AI practitioners can use labeled data from large datasets, such as ImageNet, which contains over a million training images across 1000 object classes. It is ideal for building a general-purpose image classification model.
Also, AI practitioners must develop a robust review process to identify annotation mistakes before feeding the data to a CV model. In addition, leveraging automation in the annotation workflow can reduce the model development time since the manual process is error-prone and costly.
Moreover, practitioners can employ innovative methods like active learning and image embeddings to improve data annotation accuracy. Let’s look at them briefly below.
Instead of labeling all the images in a dataset, active learning algorithms allow practitioners to annotate only a few highly useful images and use them for training. It uses an informativeness score that helps decide which image will be the most beneficial in improving performance.
For example, in a given dataset containing 1,500 images, the active learning method identifies the most useful data samples (let’s say 100 images) for annotation allowing the practitioners to train the model on a subset of labeled images and validate it on the remaining unlabeled 1400 images.
The model will assign confidence scores to the 1400 images, allowing data annotators to cross-check images with the lowest scores and re-label them manually. As a result, active learning reduces the data annotation time and significantly boosts model performance.
Image embeddings are vectorized versions of image data where similar images have similar numerical vector representations.
As the diagram above illustrates, image embedding models transform images into a multi-dimensional vector space, with each number in the vector representing a specific image property or feature. For example, 1.3 in the first vector may represent an image’s color, 2.5 the image’s height, and so on.
Typically, image embeddings are helpful for semantic segmentation tasks as they break down an image into relevant vectors, allowing CV models to classify pixels more accurately.
Also, it helps with facial recognition by representing each facial feature as a number in the vector. The model can better use the vectorized form to distinguish between several facial structures.
Embeddings make it easier for algorithms to numerically compute how similar two or more images are. It helps practitioners annotate images more accurately.
Lastly, image embeddings are the backbone of generative text-to-image models, where practitioners can convert text-image pairs into embeddings. For example, you can have the text “image of a dog”, and an actual dog’s image paired together and converted into an embedding. Such embeddings can be given as input to a generative model, so it learns to create a dog’s image when it identifies the word “Dog” in a textual prompt.
Data provides the foundation for building high-quality ML models. However, collecting relevant data comes with several challenges.
Encord Active provides a robust solution for efficient data curation throughout all stages of your computer vision project. By harnessing the active learning toolkit, you can enhance both data quality and model performance by identifying and addressing failure modes within models.
A notable feature of Encord Active is its data visualization capabilities. Its quality metrics empower users to gain deeper insights into the data and its label quality, facilitating the identification of errors and outliers.
Encord Active's model evaluation functionalities play a pivotal role in identifying model failure modes. This allows users to pinpoint precisely where their models face challenges, enabling them to direct their curation efforts with greater precision. By doing so, they can identify high-value data instances that require relabeling, allowing for a targeted approach to enhancing model accuracy.
Active learning within Encord Active enables users to intelligently select informative and valuable data samples for annotation and model improvement. This approach contrasts with randomly labeling extensive datasets, as the tool actively identifies instances where the current model struggles, thereby maximizing the impact of human annotation efforts. This targeted approach ensures that labeled data aligns with specific areas where model refinement is needed, optimizing the curation process.
Encord Active streamlines data curation workflows by using a systematic process to identify and address model failure modes. The tool helps to identify labeling errors and data inconsistencies in training data. Users can visualize areas where the model encounters difficulties, allowing them to focus on the most promising areas for improvement.
Incorporating Encord Active's active learning and workflows enhances overall dataset quality and subsequently elevates model performance. Whether during labeling processes or in production, Encord Active is indispensable for effective data quality assessment. Its comprehensive feature set empowers users to refine data, labels, and models, contributing to a robust and sophisticated computer vision pipeline.
As companies gravitate more toward AI to solve business problems using complex data, the importance of data curation will increase significantly. Here are some key things organizations must consider for building a successful data curation workflow.
Data curation is an ongoing process. The organization must commit to robust data curation practices throughout the model building, deployment, and monitoring stages and keep improving the curation workflow as data evolves.