profile

Eric Landau January 13, 2023

Closing the AI production gap: Encord Active, a new open source toolkit for active learning

blog image

Recently there was a tweet by Paul Graham showing an old cover from Maclean’s magazine highlighting the “future of the internet”.

Graham poses the accompanying question: “What do 36 million people use now that eventually 5 billion people will?” At Encord we have long believed the answer to this question to be artificial intelligence, as AI feels to be at a similar inflection point now as the internet was in the 90s, poised to take off for widespread adoption.

Paul Graham Twitter post "What do 36 million people use now that eventually 5 billion will?

While this view that AI will be the next ubiquitous technology is not an uncommon one, its plausibility hasn’t been as palpable as recently with the imagination-grabbing advancements in generative AI over the last year. These advancements have seen the rise of “foundational models”, high capacity unsupervised AI systems that train over enormous swaths of data and take millions of dollars of GPU power doing it.

TLDR

Problem: There is an “AI production gap” between proof-of-concept and production models due to issues with AI model robustness, reliability, explainability, caused by a lack of high-quality labels, model edge cases, and cumbersome iteration cycles for model re-training.
Solution: We are releasing a free open-source active learning toolkit, designed to help people building computer vision models improve their data, labels, and model performance.

Introduction

The success of foundational models is creating a dynamic of duality in the AI world. With foundational models built by well-funded institutions with the GPU muscle to train over an internet’s worth of data and application-layer AI models normally built from the traditional supervised learning paradigm requiring labeled training data.

While the feats of these foundational models have been quite impressive, it is quite clear we are still in the very early days of the proliferation and value of application-layer AI, with numerous bottlenecks holding back wider adoption.

We started Encord a few years ago originally to tackle one of the major blockers to this adoption, the data labeling problem. Over the years, working with many exciting AI companies, we have since enhanced our views on the blockers for later-stage AI development and deploying models to production environments. Over this post, we will discuss our learnings from thousands of conversations with ML engineers, data operations specialists, business stakeholders, researchers, and software engineers and how it has culminated in the release of our newest open-source toolkit, Encord Active.

The Problem: Elon’s promise

There is a famous Youtube sequence of Elon Musk promising Tesla’s delivery of self-driving cars, every “next year”, since 2014.

Elon's promise

We are now at the start of 2023, and that promise still seems “one year” away. This demonstrates how the broader dream of realizing production AI models that have transformative effects over the real world (self-driving cars, robot chefs, etc) has been slower to materialize than expected given early promise and floods of investment.

This is a multi-faceted and complex issue coupled with societal structures outside the tech industry (regulators, governments, industry, etc.) lagging appreciation of associated implications that come from the second-order effects of adopting this technology. The more pernicious problem, however, is one which lies within the technology itself. Promising proof-of-concept models which perform well on benchmarked datasets in research environments have often struggled when in contact with real-world applications. This is the infamous AI production gap. Where does this gap come from?

The AI Production Gap

One of the main issues is that the set of requirements asked of AI applications rises precipitously when in contact with a production environment: robustness, reliability, explainability, maintainability, and much more stringent performance thresholds. An AI barista making a coffee is impressive in a cherry-picked demo video, but frustrating when spilling your Frappuccino 5 times in a row.

null

As such, the gap between “that’s neat” and “that’s useful” is much larger and more formidable than ML engineers had anticipated. The production gap can be attributed to a few high-level sub-components. Among others:

Slow shift to data-centricity:

Working with many AI practitioners, we have noticed a significant disconnect between academics and industry. Academics often focus on model improvements, working with fixed benchmark datasets and labels. They optimize the parts of the system that they have the most control over. Unfortunately, in practical use cases, these interventions have lower leveraged effects on the success of the AI application than taking a data-centric view.

Insufficient care has been placed on data-centric problems such as data selection, data quality improvement, and label error reduction. While not important from a model-centric view with fixed training and validation datasets, these elements are crucial for the success of production models.

Lack of decomposability:

A disadvantage of deep learning methods compared to traditional software is the lack of being able to take it apart in pieces for examination. Normal (but highly complex) software systems have composable parts that can be examined and tested in independent ways. Stress testing individual components of a system is a powerful strategy for fortifying the entirety. Benefits include the interpretability of system behavior and the ability to quickly isolate and debug errant pieces. Deep neural networks, for all their benefits, are billion parameter meshes of intransparency; you take it as is and have little luck in inspecting and isolating pieces component-wise.

Insufficient evaluation criteria:

Exacerbating the lack of decomposability are the insufficient criteria we have to evaluate AI systems. Normal approaches just take global averages of a handful of metrics. Complex high-dimensional systems need sophisticated evaluation systems to meet the complexity of their intended domain. The tools to probe and measure performance are still nascent for models and almost completely non-existent for data and label quality, leaving a lack of visibility into the true quality of an AI system.

The above problems (and the lack of human-friendly tools to deal with them) have all contributed in their own way (again among others) to the AI production gap.

At Encord, we have been lucky to see how ML engineers across a multifaceted set of use cases have tackled these issues. The interesting observation was that they used very similar strategies even in very varied use cases. We have been helping these companies now for years, and based on that experience we have released Encord Active, a tool that is data-centric, decomposable, human-interaction focused, and improves evaluation.

How It Should Be Done

Before going into Encord Active, let’s go over how we’ve seen it done by the best AI companies. The gold standard of active learning are stacks that are fully iterative pipelines where every component is run with respect to optimizing the performance of the downstream model: data selection, annotation, review, training, and validation are done with an integrated logic rather than as disconnected units.

Counterintuitively, the best systems also have the most human interaction. They fully embrace the human-in-the-loop nature of iterative model improvement by opening up entry points for human supervision within each sub-process while also maintaining optionality for completely automated flows when things are working. The best stacks are thus iterative, granular, inspectable, automatable, and coherent.

Last year, Andrej Karpathy presented Tesla’s Data Engine as their solution to bridge the gap, but where does that leave other start-ups and AI companies without the resources to build expensive in-house tooling?

Data Engine

Source: Tesla 2022 AI Day

Introducing Encord Active

Encountering the above problems and seeing the systems of more sophisticated players led us through a long winding path of creating various tools for our customers. We have decided to release them open source as Encord Active.

null

Loosely speaking Encord Active is an active learning toolkit with visualizations, workflows, and, importantly, a library of what we call “quality metrics”. While not the only value-add of the toolkit, for the remainder of the post, we will focus on the quality metric library as it is one of our key contributions.

Quality Metrics

Quality metrics are additional parametrizations added onto your data, labels, and models; they are ways of indexing your data, labels, and models in semantically interesting and relevant ways. They come in three flavors:

null

It was also very important that Encord Active gives practical and actionable workflows for ML engineers, data scientists, and data operations people. We did not want to build an insight-generating mechanism, we wanted to build a tool that could act as the command center for closing the full loop on concrete problems practitioners were encountering in their day-to-day model development cycle.

The way it works

Encord Active (EA) is designed to compute, store, inspect, manipulate, and utilize quality metrics for a wide array of functionality. It hosts a library of these quality metrics, and importantly allows you to customize by writing your own “quality metric functions” to calculate/compute QMs across your dataset.

Upload data, labels, and/or model predictions and it will automatically compute quality metrics across the library. These metrics are then returned in visualizations with the additional ability to incorporate them into programmatic workflows. We have adopted a dual approach such that you can interact with the metrics via a UI with visualizations, but also set them up in scripts for automated processes in your AI stack.

null

With this approach, let’s return back to the problems we had listed earlier:

Slow shift to data-centricity:

EA is designed to help improve model performance among several different dimensions. The data-centric approaches it facilitates include, among others:

  • Selecting the right data to use data labeling, model training, and validation
  • Reducing label error rates and label inconsistencies
  • Evaluating model performance with respect to different subsets within your data

null

EA is designed to be useful across the entire model development cycle. The quality metric approach covers everything from prioritizing data during data collecting, to debugging your labels, to evaluating your models.

The best demonstrations are with examples in the next section.

Decomposability:

Until we have better tools to inspect the inner workings of neural networks, EA treats the decomposability problem by shifting decomposability both up and down the AI stack. Rather than factoring a model itself, quality metrics allow you to very granularly decompose your data, labels, and model performance. This kind of factorization is critical for identifying potential problems and then properly debugging them.

null

Insufficient evaluation criteria:

As a corollary to the above EA allows for arbitrarily many and arbitrarily complex quality metrics to evaluate the performance of your model. Importantly, it breaks down the model performance as a function of the quality metrics automatically, guiding users to the metrics that are likely to be most impactful for model improvement.

null

Until we have both AGI AND the AI alignment problem solved, it remains critically important of keeping humans in the loop for monitoring, improvement, development, and maintenance of AI systems. EA is designed with this in mind. The UI allows for quick visual inspection and tagging of data, while the programmatic interface allows for systematization and automation of workflows discovered by ML engineers and data operations people.

Choose Your Own Adventure: Example Use Cases

Data selection:

With EA you can run your previous model over a new dataset and set the inverse confidence score of the model as a quality metric. Sample the data weighted by this quality metric for entry into an annotation queue. Furthermore, you can use the pre-computed quality metrics to identify subsets of outliers to exclude before training or subsets to oversample. You see a tutorial on data selection using the TACO dataset here.

Label error improvement:

You can use the “annotation quality” metric, which calculates the deviation of the class of a label from its nearest neighbors in an embedding space to identify which labels potentially contain errors. This additionally breaks down label error with respect to who annotated it, to help find annotators that need more training. If you upload your model predictions you can find high-confidence false positive predictions to identify label errors or missing labels.

Model performance analysis:

EA automatically breaks down your global model performance metrics and correlates them to each quality metric. This surfaces which quality metrics are important drives in your model performance, and which potential subsets of the data your model is likely to perform worst in going forward

null

Why Open Source

There was an observation we made working with late-stage AI companies that prompted us to release Encord Active open source. Many of the metrics companies use are often common, even for completely different vertical applications.

One of the strategies of a startup is to reduce the amount of redundant work that is being done in the world. Before Git’s common adoption, every software company was developing its own internal version control software. This amounted to tens of thousands of hours of wasted developer time that could be allocated to more productive activity. We believe the same is being done now with quality metrics for ML engineers.

Open sourcing Encord Active will remove the pain of people using notebooks to create redundant code and one-off scripts that many others are also developing and free up time for ML engineers to focus on improving their data and models in more interesting ways.

As a new open source tool, please be patient with us. We have some of the leading AI companies in the world using Encord Active, but it is still very much a work in progress. We want to make it the best tool it can be, and we want it out in the world so that it can help as many AI companies as possible move forward.

If it works, we can in a small way contribute to one of the hardest things in the world: making an Elon Musk promise come true. Because AI delayed is not AI denied.

Want to test your own data & models?

“I want to get started right away” — You can find Encord Active on Github.

“Can you show me an example first?” — Check out this Colab Notebook.

“I am new, and want a step-by-step guide” — Try out the getting started tutorial.