Eric Landau April 8, 2022

Data-centric AI: Practical implications with the SMART Pipeline

blog image

The latest trend in the ML community is the rise of what is termed data-centric AI. Simply put, data-centric AI is the notion that the most relevant component of an AI system is the data that it was trained on rather than the model or sets of models that it uses. It recommends an attentional shift from trying to find improvements to model architectures and hyper-parameters to finding ways to improve the data itself. While this is fine in the abstract, it leaves a little to be desired with respect to the actions necessary for a real-world AI practitioner. How exactly do you transition your workload from iterating over models to over data?

Model accurancy on ImageNet is leveling off over time

Model accuracy on ImageNet is leveling off over time

In this article we will go over a few of the practical steps for how to properly think about and implement data-centric AI. Specifically, we will investigate how data-centric AI differs from model-centric AI with respect to creating and handling training data.


Includes: finding data, collecting it, cleaning it, sanitizing(for regulatory/compliance purposes)

Model-centric perspective: Use ImageNet or an open source dataset, that’ll be fine!

Data-centric perspective: Use every bone in your body to find good proprietary datasets

While seemingly mundane, the absolute number one most important step for data-centric AI is securing a good data source. In fact, from our experience, the number one predictor of success for an AI company was their ability to source the best data(best in the sense of combining both quantity and quality). This was sometimes through partnerships, sometimes through creative methods such as sophisticated scraping, sometimes through structural advantages(Google will collect the best data on searches for example), or sometimes through sheer force of will of the company. From the cases we’ve seen, the investment in collecting and identifying the best data was almost always a worthy one, at least within the context of business applicability. Sourcing good data also creates positive externalities because better data attracts better data scientists, data engineers, and ML engineers.

Even once you have secured a source, however, the work is not close to done. Raw globs of unstructured data by itself will not be useful to downstream processes without cleaning and structuring. In many cases, unprocessed data also does not comply with legal, privacy, or other regulatory restrictions. Unfortunately the problems here are often idiosyncratic to the individual case of the company. Despite this, it is often possible to sort these systematically such that data can be extracted continuously from a secured source without worrying about downstream issues with the pipeline. The best AI companies will have a data source flow that requires minimal intervention to the next stage.


Includes: storage, querying, sampling, augmenting, curating

Model-centric perspective: Querying and slicing data in efficient ways is not necessary, I will use a fixed set of data and labels for everything because my focus will be on improving my model parameters.

Data-centric perspective: Data retrieval and manipulation needs to occur frequently and efficiently as we will be iterating through many permutations and transformations of the data.

Once you have found a supply of data, the next natural step is finding a way to effectively manage it. Data management is an undervalued component in the AI development stack because it is a task of messy engineering rather than idealized mathematical formulations. The issue we find is that data systems are often designed by data scientists, not by data engineers.

The number of times we have seen annotations in text files dumped in random S3 folders is larger than we would like to admit. This is largely due to a philosophy that if data is in a place that is accessible by some method, it should be fine. This lack of flexibility slows down the data-centric development process by disallowing experimentation due to inefficient data access.

A data-centric approach maps out management solutions from the beginning of the projects and makes sure all valuable utilities are included. Sometimes that might be finding ways to create more data through augmentations, sometimes it might be finding ways to get rid of data through sampling and pruning. Within the Large Hadron Collider(probably the most sophisticated data collection device on the planet) for instance, over 99.99% of the data is thrown away and never analyzed. This is not done randomly, of course, but is part of the careful management of a system that produces around 100 petabytes a year.

From a practical perspective, this means investing in data engineering early. This can be in talent, or in external solutions, just make sure to future-proof your data systems, and don’t leave it to the hands of a math PhD (said by a former physics PhD).

Annotating and Reviewing

Includes: schema specification, pipeline design, manual and automated labeling, label and model evaluation

Model-centric perspective: Get to model development quicker by using an open source labelled dataset, or, if one is not available for your problem, pay a bunch of people to label stuff and now you have labels you can use forever.

Data-centric AI: Annotation is a continuous process and should be informed by model performance.

The biggest misconception about annotation is that it is a one-off process. The model-centric view is you create a static set of labels for a project and then build a production model by optimizing parameters and hyper-parameters through permutations of train, test, and validation sets of these labels.

Example of annotations of people

It is clear where this perception originates. This is the standard operating procedure for much of academic AI work as academics lean on benchmark datasets to compare their results against the corpus of existing work run on the same datasets. This falls dangerously short however for practical applications. The real world unfortunately does not look like ImageNet. It is a mess of dynamic and imperfect data.

The only way to sort out the messiness of the real world is maintenance. Continuous annotation is the maintenance layer of AI. Robust annotation pipelines are iterative and contain flows that include not just one-off annotation, but strong review processes to ensure ground truth quality and input from existing models and intelligence. This facilitates continuous learning on your system, where models can adapt with the flow of new labels and data. The most maintainable AI systems are designed such that they can accommodate these continuous processes and make the most of these active learning pipelines.

The other consideration with respect to industrial AI is that intellectual property can be developed at the labelling step itself. In the world of data-centric AI, the label structures you use are in themselves architectural design choices that may give your system competitive advantages. Using common ontologies or open source labels nullifies this potential advantage. These choices are difficult to make a priori and often require some empirical analysis to get right. Thus, similar to how annotation pipelines should be iterative, converging on the right label structure should itself also be an iterative process guided by experimentation.


Includes: Data splitting, efficient data loading, training and re-training, active learning

Model-centric AI: I trained my model and see the results in weights and biases! Hmm, they don’t look good, let me write some code to fix it.

Data-centric: I trained my model and see the results in weights and biases! Hmm, they don’t look good, let me check my dataset to see what’s wrong.

The model training and validation processes look very similar for both model-centric and data-centric approaches. The major difference is the first place a data scientist looks when they go to improve performance. A model-centric view will unsurprisingly check the model. Is there a bug in the model code? Did I use a wide enough scope of hyper parameters? Should I turn on batch normalization? A data-centric view will (also unsurprisingly) focus on the data. Did I train on the right data? Is this failing for specific subsets of the data? Are there errors in my annotations?

When ( model ) training goes wrong

The actionable insight here is that when ideating on potential performance improvements post training, start with the data. Sub-optimality can of course originate from a wide menu of causes, but the thesis of data-centric AI is that to build sustainable high-performing AI systems, much more care needs to go into getting the data layer right. Failure modes in this domain can be quite subtle, so careful thought is often required and can lead to deeper insight and understanding of the problem space. Because it is subtle, debugging your data after training also requires lining up all of the above steps of the SMART pipeline correctly. And like most of the other steps, training is not a one-off process in the pipeline, but dynamic and iterative and feeding the other steps. Training is not the end of a linear pipeline, only the middle of a circular one.


To summarize, for more effective data-centric AI:

  • Find clever ways to source your data
  • Invest in good data engineering resources for dataset management
  • Setup continuous annotation generating and monitoring pipelines
  • Think about debugging your data first, before your models

While seemingly obvious, there are no shortage of companies that we have seen that fail to think about many of the points above. They don’t realize that they don’t necessarily need to have smarter or more sophisticated models than their competitors, they just need better data than they do. While probably not as ostensibly fun as reading a paper about the latest model that improved on an open source benchmark, a data-centric approach is our best bet to make AI a practical reality for the everyday world.