Contents
3 Key Stats:
The Challenge: Foundation Models Can't Navigate the Real Web
How Encord Built a Scalable Data Pipeline for Yutori
The Result: Yutori Navigator Sets a New Standard
What This Means for Agentic AI


How Yutori Built the Best Web Agent in the World with High-Quality Human Data
3 Key Stats:
- Thousands of weekly trajectory evaluations with 20+ evolving error categories
- Systematic head-to-head model comparisons across hundreds of trajectory pairs weekly to quantify performance gains
- Yutori Navigator outperforms Gemini 2.5 and Claude 4.5 by 10-20 percentage points on real-world web tasks
Yutori is reimagining how people interact with the web. Founded in 2024 by former Meta AI leaders Devi Parikh, Dhruv Batra and Abhishek Das, the San Francisco startup builds autonomous web agents that reliably execute complex tasks – from booking reservations to researching products and making purchases.
In November 2025, Yutori launched Navigator, a web agent that operates a real browser to autonomously complete web tasks. Navigator outperforms Gemini 2.5, Claude 4.0 and 4.5, and OpenAI's Operator by 10-20 percentage points in accuracy while being 2-3x faster.
The Challenge: Foundation Models Can't Navigate the Real Web
Foundation models struggle with dynamic interfaces, multi-step workflows, ambiguous user intent, and the need to reason about both what to do and how to do it.
"Models aren't ready yet to deliver this out of the box," Devi explained. "In sequential decision making problems, errors compound at each step. We need to push on modeling capabilities to complete complex workflows on the web with high reliability. To this end, it is critical for data that models are trained on to be high quality. It's the starting point for where model capabilities will come from."
Yutori faced critical data challenges. Supervised fine-tuning requires high quality human-annotated trajectories. Every action an annotator takes on the web while completing a specified task – including any exploration and backtracking – needs to be accurately recorded while keeping the annotation interface simple. Trajectory evaluation can’t be fully automated yet across the entire diversity of tasks. Incorrect actions or subgoals along the way may still lead to successful outcomes. That needs to be appropriately accounted for. While quality is most important, throughput of data annotation also needs to be reasonable to make fast progress.
How Encord Built a Scalable Data Pipeline for Yutori
We partnered with Yutori to develop data infrastructure spanning the agent development lifecycle.
High-Quality Training Data
We created a tailored pipeline for supervised fine-tuning:
- Human-annotated trajectories
- Iterative quality control with strict standards enforced by both teams
Large-Scale Evaluation
As Yutori's customer base grew, evaluating agent performance at scale became critical. We worked together to build a comprehensive evaluation framework:
- Custom error taxonomy: We iteratively developed 20+ error types capturing action prediction, model reasoning and infra errors. Error categories evolved weekly as Yutori identified new model behaviors. We trained our team to adapt quickly while maintaining quality.
- Massive scale: We conducted thousands of trajectory evaluations per week, evolving from task outcome evaluation to detailed error categorization of both the process and outcome of the agentic trace.
- Model evaluations: Beyond single trajectory and Online Mind2Web evaluations, we conduct systematic head-to-head comparisons for hundreds of trajectory pairs weekly where different models attempt identical prompts. Our framework has captured signals across multiple performance dimensions - identifying improvement areas and quantifying Navigator's gains over Claude 4.5 and Gemini 2.5.
- Model-in-the-loop: We balanced human expertise with AI efficiency. Over time, human annotators provided the correct action labels only when models made mistakes.
The Result: Yutori Navigator Sets a New Standard
The partnership enabled Yutori to launch Navigator as the highest-performing web agent available:
- 10-20 percentage points higher in accuracy than Gemini 2.5, Claude 4.0 and 4.5, and OpenAI's Operator
- 2-3x faster task completion with lower inference costs
- Uniformly preferred by users in head-to-head evaluations

Yutori agents completing autonomous tasks (Videos courtesy of Yutori)
You can try Yutori’s product Scouts or build on their state-of-the-art web agentic tech via their API.
What This Means for Agentic AI
As agentic AI expands into web browsing, coding assistants, and customer service, evaluating performance in production becomes critical. Our work with Yutori demonstrates what frontier AI teams need:
- Flexible data infrastructure that adapts to key experimental needs as model capabilities evolve
- Custom workflows for each development phase
- Quality and quantity of training data
- Human-model collaboration balancing automation with nuanced understanding
- Structured evaluation frameworks capturing fine-grained signals
If you're building agentic AI that needs to work reliably in the real world, your data pipeline quality will determine whether your agents deliver on their promise.
Think Encord could be a good fit for your team?
Explore product
Explore our products
Yes. In addition to being able to train models & run inference using our platform, you can either import model predictions via our APIs & Python SDK, integrate your model in the Encord annotation interface if it is deployed via API, or upload your own model weights.
At Encord, we take our security commitments very seriously. When working with us and using our services, you can ensure your and your customer's data is safe and secure. You always own labels, data & models, and Encord never shares any of your data with any third party. Encord is hosted securely on the Google Cloud Platform (GCP). Encord native integrations with private cloud buckets, ensuring that data never has to leave your own storage facility.
Any data passing through the Encord platform is encrypted both in-transit using TLS and at rest.
Encord is HIPAA&GDPR compliant, and maintains SOC2 Type II certification. Learn more about data security at Encord here.Yes. If you believe you’ve discovered a bug in Encord’s security, please get in touch at security@encord.com. Our security team promptly investigates all reported issues. Learn more about data security at Encord here.
Yes - we offer managed on-demand premium labeling-as-a-service designed to meet your specific business objectives and offer our expert support to help you meet your goals. Our active learning platform and suite of tools are designed to automate the annotation process and maximise the ROI of each human input. The purpose of our software is to help you label less data.
The best way to spend less on labeling is using purpose-built annotation software, automation features, and active learning techniques. Encord's platform provides several automation techniques, including model-assisted labeling & auto-segmentation. High-complexity use cases have seen 60-80% reduction in labeling costs.
Encord offers three different support plans: standard, premium, and enterprise support. Note that custom service agreements and uptime SLAs require an enterprise support plan. Learn more about our support plans here.
Take control of your ML pipeline with Encord
Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord's Data Development platform accelerates every step of taking your model into production.
