Back to Blogs

5 Steps to Build Data-Centric AI Pipelines

November 10, 2022
4 mins
blog image

Data-centric AI is a positive emerging trend in the machine learning (ML) and computer vision (CV) community

Simply put, data-centric AI is the notion that the most relevant component of an AI system is the data that it was trained on rather than the model or sets of models that it uses.

The data-centric AI concept recommends an attentional shift from finding improvements to model architectures and hyper-parameters to finding ways to improve the data. With the idea that better data will produce more accurate model outcomes. 

While this is fine in the abstract, it leaves a little to be desired concerning the actions necessary for a real-world AI practitioner. Data scientists and data ops teams are right to wonder: How exactly do you transition your workload from iterating over models to over data?

Model accurancy on ImageNet is leveling off over time

Model accuracy on ImageNet is leveling off over time

In this article we will go over a few of the practical steps for how to properly think about and implement data-centric AI. Specifically, we will investigate how data-centric AI differs from model-centric AI with respect to creating and handling training data.

For more information, here's our article on 5 Strategies To Build Successful Data Labeling Operations

What is a Data-centric approach to AI (artificial intelligence)? 

Data-centric shifts the focus when training computer vision models, or any algorithmically-generated model, from the model to the data. Unleashing the true potential of AI means sourcing, annotating, labeling, and building better datasets. 

The accuracy and output quality can and will improve dramatically with higher-quality data going into a model. 

Any data-centric approach is only as good as your ability to source, annotate, and label the right data to put into your model. 

In a previous article, we explore:

  • The importance of finding the best training data
  • How to prioritize what to label
  • How to decide which subset of data to start training your model on
  • How to use open-source tools to select data for your computer vision application

With that in mind, we can now turn to the benefits of a data-centric approach and 4 ways to implement a data-centric strategy.

Scale your annotation workflows and power your model performance with data-driven insights
medical banner

What are the benefits of a data-centric approach to AI? 

Adopting a data-centric approach for AI, ML, and computer vision models gives organizations numerous advantages when training and implementing production-ready models. 

As we’ve seen from working with companies in dozens of sectors, a data-centric approach, when supported by an AI-driven active learning platform for labeling and model training, produces the following advantages: 

  • Build and train computer vision models faster; 
  • Improve the quality of the data, and therefore, the accuracy and outputs of the model; 
  • Reducing the time it takes to train a model to deployment; 
  • Enhanced iterative learning cycles, improving the production-ready model's accuracy and outputs. 

Find and fix dataset errors and biases.

5 Steps for implementing a data-centric approach to AI, ML, and Computer Vision: Sourcing, Managing, Annotating, Reviewing, and Training (SMART)

Here are the five steps you need to take to develop a data-centric approach to AI, using the SMART model.

Sourcing the right data


Includes: Finding data, collecting it, cleaning it, sanitizing (for regulatory/compliance purposes)

Model-centric approach: Use ImageNet or an open-source dataset, that’ll be fine!

Data-centric AI model approach: Make every effort to source proprietary datasets that align with the goals and use case of the computer vision project.

Although a seemingly unimportant concern, the first and most crucial step for data-centric AI is securing a high-quality source of data or access to a proprietary data pipeline that aligns with the project goals and use case. 

In our experience, the main way to predict whether a computer vision project will succeed is the team's ability to source the best datasets possible (best in combining both quantity and quality). Sometimes through partnerships or more creative methods, such as sophisticated data scraping, structural advantages (e.g., access to Google datasets), or sheer force of will. 

Data scientist at work

From the clients Encord has worked with, we’ve seen that the investment in sourcing the best dataset was always worth the outcome. Sourcing high-quality data also creates positive externalities because better data attracts more skilled data scientists, data engineers, and ML engineers.

Once you’ve got the datasets, whether image- or video-based, it needs to be cleaned and cleansed so it’s ready for the annotation and labeling part of the process. Raw unprocessed data often violate legal, privacy, or other regulatory restrictions. 

Most data operations leaders are prepared to handle these challenges. A team is assembled, either internally or externally, to clean the data and prepare it for annotation and labeling. 

Managing image and video-based datasets


Includes: Storage, querying, sampling, augmenting, and curating datasets. 

Model-centric approach: Querying and slicing data in efficient ways is not necessary, I will use a fixed set of data and labels for everything because my focus will be on improving my model parameters.

Data-centric AI model strategy: Data retrieval and manipulation need to occur frequently and efficiently as we will be iterating through many permutations and transformations of the data.

Once you’ve sourced the right datasets, the next step is finding a way to manage them effectively. 

Data management is an undervalued part of computer vision because it’s a messy engineering task rather than mathematical formulations and algorithms. We find data scientists, not data engineers often design data systems.

More times than we would like, we’ve seen annotations in text files dumped into random Amazon S3 folders alongside an unstructured assortment of images or videos. This is mainly due to the philosophy that if the data is accessible somehow, it should be fine. Unfortunately, this inflexibility slows down the data-centric development process because of inefficient data access.

A data-centric approach maps out management solutions from the beginning of the projects and ensures all valuable utilities are included. Sometimes, that might be finding ways to create more data through augmentations and synthetic data creation. Other times, it will involve removing data (images, videos, and other data as needed) through sampling and pruning. 

Within the Large Hadron Collider( probably the most sophisticated data collection device on the planet), for instance, over 99.99% of the data is thrown away and never analyzed. This is not a random decision, of course, but it is part of the careful management of a system that produces around 100 petabytes yearly.

From a practical perspective, this means investing in data engineering early. This can be in talent or in external solutions; just make sure to future-proof your data systems, and don’t leave it to the hands of a mathematics Ph.D. (said by a former physics Ph.D.).

Open-source data from CERN

Open-source Large Hadron Collider data from CERN


Annotating and Reviewing Datasets Using Artificial Intelligence

(This is effectively two stages: Annotating and reviewing; however, we've grouped them together as they usually move swiftly from one to the next in the SMART data-centric pipeline)

Includes: Schema specification, pipeline design, manual and automated labeling, label, and model evaluation

Model-centric approach: Get to model development quicker by using an open source labeled dataset, or, if one is not available for your problem, pay a bunch of people to label stuff, and now you have labels you can use forever.

Data-centric AI model approach: Annotation is a continuous iterative workflow process and should be informed by model performance.

One of the biggest misconceptions about annotation is that it’s a one-off process. The model-centric view is you create a static set of labels for a project and then build a production model by optimizing parameters and hyper-parameters through permutations of train, testing, and validating these labels and annotations.

Example of annotations of people

It’s clear where this perception originates. This is the standard operating procedure for academic AI work. Academics tend lean on benchmark datasets to compare their results against a body of existing work run on the same datasets. For practical applications and business use cases, this approach doesn’t work. The real-world, unfortunately, doesn’t look like ImageNet. It’s a mess of dynamic and imperfect datasets that can be tailored for various projects and use cases. 

The solution to the messiness of real-world datasets is maintenance. Continuous annotation is the maintenance layer of AI. 

Robust data annotation pipelines and workflows are iterative and contain processes that include annotation, labeling, quality control, and assurance to ensure ground truth quality and input from existing models and intelligence. This ensures that AI models can adapt to the flow of new labels and data. The most maintainable AI systems are designed to accommodate these continuous processes and make the most of these active learning pipelines.

For industrial AI and any computer vision model that’s being designed and built by an organization is that intellectual property can be developed during the labeling process itself. In the world of data-centric AI, the label structures you use are in themselves architectural design choices that may give your system competitive advantages. Using common ontologies or open-source labels removes this potential advantage. These choices often require some empirical analysis to get right. 

Similar to how data annotation pipelines should be iterative, converging on the right label structure should itself also be an iterative process guided by experimentation.

Training Computer Vision Models with a data-centric approach 

Includes: Data splitting, efficient data loading, training and re-training, and active learning pipelines.

Model-centric AI: I trained my model and see the results in weights and biases! Hmm, they don’t look good, let me write some code to fix it.

Data-centric AI & CV models: I trained my model and see the results in weights and biases! Hmm, they don’t look good, let me check my dataset to see what’s wrong.

The model training and validation processes look very similar for both model-centric and data-centric approaches. The major difference is the first place a data scientist looks when they go to improve performance. A model-centric view will unsurprisingly check the model. Is there a bug in the model code? Did I use a wide enough scope of hyperparameters? Should I turn on batch normalization? 

When ( model ) training goes wrong

A data-centric view will (also unsurprisingly) focus on the data. Did I train on the right data? Is this failing for specific subsets of the data? Are there errors in my annotations and labels?

Using the data-centric approach, start with the datasets when looking for performance improvements post-training. 

Poor performance and accuracy outputs can originate from a wide range of potential issues, but the strategy behind taking a data-centric AI approach is that to build high-performance AI systems, much more care needs to go into getting the data layer right. 

Failure modes in this domain can be quite subtle, so careful thought is often required and can lead to deeper insight and understanding of the problems a model is encountering. Because it’s subtle, debugging your data after training also requires lining up all of the above steps of the SMART pipeline correctly. 

And like most of the other steps, training is not a one-off process in the pipeline, but dynamic and iterative and feeding the other steps. Training is not the end of a linear pipeline, only the middle of a circular one.

Scale your annotation workflows and power your model performance with data-driven insights
medical banner

Key Takeaways: Advantages of the data-centric approach to AI 

For those wanting to take a more effective data-centric AI approach, here are the steps you need to follow:

  • Find clever ways to source your high-quality proprietary datasets
  • Invest in good data engineering resources for dataset management
  • Setup continuous annotation generating and monitoring pipelines
  • Think about debugging your data first, before your models

While seemingly obvious, there is no shortage of companies that we have seen that fail to think about many of the points above. They don’t realize that they don’t necessarily need to have smarter or more sophisticated models than their competitors, they just need better data than they do. 

While probably not as ostensibly fun as reading a paper about the latest model that improved on an open-source benchmark, a data-centric approach is our best bet to make AI a practical reality for the everyday world.

Ready to accelerate and automate your data annotation and labeling? 

Sign-up for an Encord Free Trial: The Active Learning Platform for Computer Vision, used by the world’s leading computer vision teams. 

AI-assisted labeling, model training & diagnostics, find & fix dataset errors and biases, all in one collaborative active learning platform, to get to production AI faster. Try Encord for Free Today

Want to stay updated?

  • Follow us on Twitter and LinkedIn for more content on computer vision, training data, and active learning.
  • Join the Slack community to chat and connect.
cta banner

Build better ML models with Encord

Get started today
Written by

Eric Landau

View more posts