Back to Blogs

Contents

The end of an era: scaling laws

A new frontier

A shift in the market

What this actually means for your AI team

What you need for a "data quality" focus

The spending conversation you need to have

The bottom line

Share on socials

Encord Blog

Data Quality Is the Next Frontier in AI

Written by James Clough

September 3, 2025

5 min read

Summarize with AI

Back to Blogs

Explore the platform

Data infrastructure for multimodal AI

Explore product

Contents

The end of an era: scaling laws

A new frontier

A shift in the market

What this actually means for your AI team

What you need for a "data quality" focus

The spending conversation you need to have

The bottom line

Share on socials

The end of an era: scaling laws

In a few months time OpenAI will celebrate its tenth birthday as one of the world’s most influential and valuable private companies. But in January 2020 it was much more obscure, a research lab familiar to few outside of the tech industry. In that month a team of researchers from OpenAI published a paper on “Scaling Laws”, finding a precise mathematical formula that showed how bigger models trained on more data, with more compute, reliably showed better performance.

These scaling laws would come to define the recent era of AI and have held true for longer than almost anyone expected. It turned out that you could keep making the models bigger and bigger, and run longer training runs on more data, and this formula perfectly predicted how well your model would perform. Since then, OpenAI, Anthropic, and others have spent billions on gigantic datacenters, AI researchers, and on finding and digitizing as much text data as possible to feed to their models.

Can this era of hyper-growth go on forever? Many skeptics say no. They say that while big tech can buy more GPUs, and hire more PhDs, there is only so much data out there. Of course, more text is written and more videos recorded every year, but not at the same exponential rate that would be required to keep up with the trends of the last few years. So will AI progress stall when we run out of text to train on?

A new frontier

Five years after the publication of the Scaling Laws paper, an obscure Chinese research lab called DeepSeek launched a model of their own. It wasn’t much better than the models made available by American labs like OpenAI or Anthropic. But what shocked the world was that it had apparently been trained for just $6m, 100x less than comparable models, and with much less data and compute.

The news wiped a trillion dollars off the US stock market. Nvidia lost more from its market valuation in one day than any company had ever done before.

How could this happen? The model in question, DeepSeek R1 was part of a new generation of reasoning models, trained not simply on huge amounts of text, but on deliberately constructed chain of thought data. There is now a new frontier - much smaller but expertly curated datasets. Quality over quantity.

A shift in the market

Nvidia’s stock price has since recovered. But the priorities of the frontier AI companies Encord works with have permanently changed. Conversations that used to be about access to GPUs are now about data quality metrics, annotation precision and access to expert reviewers.

These teams are taking inspiration from how humans learn complex tasks. To drive a car safely, we don’t need billions of examples of “normal” driving. Instead we get carefully taught how to handle tricky scenarios - merging onto a motorway, braking in heavy rain and anticipating hazards. AI engineers are applying the same principles: carefully annotating edge cases and failure modes to teach their models far more than millions of routine examples would.

The evidence is in how they spend their time and their money. Companies that used to allocate almost all of their AI budget to compute are now shifting some of that spend to data curation and quality control. Their data engineers worry about consensus mechanisms and quality control instead of scraping data from the web.

The pattern is unmistakable. Whether it's automotive companies working on autonomous vehicles, or healthcare companies building diagnostic models, the same realization is hitting everyone: data quality is paramount.

What this actually means for your AI team

This shift towards quality will change everything about how you operate:

Expert human feedback replaces simple annotation: Creating your dataset used to involve basic classification or drawing boxes around items in an image. Now it is a core part of your R&D. Building the right ontologies and rubrics defines what your model will learn, so you need a broader concept of human feedback. For example, robots no longer need much human feedback to interpret the objects in their field of view, but do still need human feedback to decide which action to take in a complex, real world environment.

You can finally measure what matters. Instead of just measuring dataset size, you’ll be tracking annotation consistency, edge case coverage, and performance on specific failure modes.

Your experts become your advantage. That radiologist you hired, or that mathematician on your team are now giving your AI their unique expertise, one careful example at a time. Your data annotators become partners in model development, not just an outsourced workforce.

What you need for a "data quality" focus

Making the shift to quality isn't just a mindset change. It requires a fundamental shift in infrastructure and process:

Enable precise annotation and feedback

Your annotation tools need to be precision instruments. If your annotators are struggling with a clunky interface, then you're injecting errors at source. Every imprecision compounds, and just a few errors can poison even a huge dataset.

Build automated QA mechanisms

Consensus systems: Multiple annotators need to agree before data moves forward
Smart routing: Agents that send the right data to the right expert annotator
Analytics: Deep visibility into annotation patterns and quality metrics
Auditing and logs: Complete traceability of every annotation decision

The spending conversation you need to have

If you're a Head of ML or VP of AI or Engineering, here's the argument for your next budget meeting:

"We spend hundreds of thousands on engineers and PhD researchers. We spend hundreds of thousands on compute. But we spend almost nothing on data - the thing that actually determines whether those PhDs and GPUs produce anything useful."

Some of the best AI companies now spend more on data than on compute.

Your CEO is worried about GPU access and hiring AI talent. But you're feeding those expensive resources suboptimal data and wondering why you're not getting results.

The bottom line

The quantity era of AI is ending, but the quality era has just begun. The companies that understand this are already acting on it and advancing fast. While everyone else is still trying to scrape together more data, these leaders are crafting precise, high-quality datasets that outperform billion-example corpuses.

Ironically, this transition makes AI more, not less, accessible. You don't need exclusive access to massive datasets or the most powerful GPUs. You need thoughtfulness, expertise, and the right approach to data quality.

The skeptics are right that the theoretical scaling laws are hitting practical limits. They're wrong about what happens next. We're not approaching the end of AI progress - we're approaching the beginning of AI craftsmanship.

The question isn't whether the shift to quality will happen; it's already happening. The only question is whether you'll be part of it, or whether you'll be left behind, still throwing quantity at problems while your competitors have moved onto the next frontier.