Will World Models Eat Physical AI?: What We Learned from Our Physical AI Panel

Co-Founder & CEO at Encord
Physical AI is moving fast. Faster than most roadmaps anticipated. And if you want to understand where it is going you need to speak to those actually shipping robots into the real world.
That's exactly why we brought together Jason Ma (co-founder, Dyna Robotics) and Chris Paxton (AI Research, Agility Robotics) for a panel on world models and the future of Physical AI.
Here are the key takeaways from from that conversation:
Setting the Scene: Why World Models, Why Now
Last year felt like the year of VLAs, or Visual Language Action models. Everyone was attaching action heads to foundation models and seeing what stuck. That work hasn't stopped. But the conversation has shifted meaningfully toward world models, and the panel opened by establishing why.
The short answer from Jason: prediction paradigm shift.
The core difference between a VLA and a world model comes down to what the model is actually trying to predict. A VLA predicts what the robot does. A world model predicts what happens when the robot does it. That sounds like a subtle distinction, but it changes everything about what data you can use.
"Once you change the prediction paradigm, like predicting the future, you're no longer actually constrained by just robotics data," Jason explained. "You can actually use all the other data that's on the internet, video, egocentric data or not, to help the model learn some aspects of the world."
Chris's definition added a useful framing: a world model is anything that predicts how the world will evolve, ideally conditioned on the robot's own actions. That broad definition matters because it covers a lot of different architectures, from explicit 3D world models to latent space predictions, all of which share that core mathematical property.
Eric added that recent video generation models like Cosmos and Genie 3 have become genuinely useful here. The quality of AI-generated video is now good enough to serve as training data and inspiration for broader use cases, which creates a cycle where better generation and better world models reinforce each other.
The Data Story: It's Not Just Egocentric Video
One of the most interesting threads of the conversation was what kinds of data matter for world models and the answer is more nuanced than "more data is better."
The egocentric data boom is real. Lots of companies are now collecting first-person video at scale, and world models are uniquely positioned to extract value from it compared to VLAs.
But Chris pushed back on the idea that egocentric is the only angle that matters:
"Before all the egocentric data took off, there's a lot of video data that's from third person, right? Third person video provides you a lot of information as well."
Simulation data also got a meaningful shoutout. Jason made the case that the sim-to-real gap is primarily a physics gap and that with world models, you can get a lot of value out of simulation-rendered video even if the physics isn't perfect. Better rendering engines shrink that gap further.
But the underrated one? Failure data. This came up multiple times, and it's genuinely a differentiator for world models over VLAs.
"I think one thing I think is really nice is being able to use data that didn't work as well," Chris said. "Like that you would have to clean up and throw out for teleop data."
Jason expanded:
"With world model, it's kind of just as good as other data, teleop data, right? Because they have actions, it teaches you some aspects of the world."
This matters a lot at the early flywheel stage, when you have more failures than successes and traditional training pipelines would throw most of it out.
The Robotics Data Flywheel
The robotics data flywheel is a well-understood concept: deploy robots, collect data, train better models, deploy better robots, repeat. In practice, it's much harder than that sentence makes it sound.
The problem Chris identified: "You've got to start turning a lot more of that flywheel turning yourself." Unlike LLMs, which bootstrapped off decades of existing internet text, robotics data doesn't exist yet in the required form. You can't scrape the web for robot manipulation demonstrations.
Jason articulated the chicken-and-egg problem directly:
"The robot itself without any autonomy or being good is already something that provides a lot of utility to humans"
But in most robotics contexts, you need some baseline performance to deploy, which requires data you don't have yet.
World models help break this loop in two ways. First, they dramatically improve sample efficiency on robot-specific data, so you need less of the expensive stuff. Second, they absorb the failure data that a pure VLA pipeline wastes, which means your early deployments, even when they go wrong, are still contributing to the flywheel.
Eric added a dimension that often gets overlooked: the angular frequency of the flywheel matters as much as the flywheel itself.
"The single most important factor that determines the success of whether a whole AI project or initiative is going to work is the speed of iteration of that flywheel; how fast is that flywheel actually turning."
He also noted a paradigm shift in how the flywheel is structured. The old active learning loop was: annotate a pool of data → train a model → model tells you what to annotate next. The new loop is: run policy evaluations → model tells you what new data to collect. The emphasis has shifted from enriching an existing dataset to deciding what to go capture in the real world. That structural change is driving a lot of the activity in physical AI data infrastructure right now.
Where Do World Models Actually Fit in the Stack?
Chris walked through two different ways world models show up in the stack:
1. As policies (World Action Models). At this end of the spectrum, a world action model is essentially doing the same job as a VLA, just trained with a richer prediction objective that lets it absorb more kinds of data. In this framing, it's a drop-in improvement rather than a replacement.
2. As infrastructure. World models also show up as evaluation tools, data collection guides, and synthetic environment generators. This is where their value goes well beyond what any VLA can do.
Jason made the case for the latter being underappreciated: "World action model is that you can also use it for data generation. You can use it to evaluate policies, right? Because world models are action conditioned, you can run a policy inside a world model to help you do eval."
Think about what that means for deployment cycles. Right now, to get a meaningful signal on how well your model performs, you have to run robots in the real world. That's slow and expensive. If you can run your policy inside a world model and get a reasonable proxy evaluation, you can close the loop much faster.
"In the limit, if we can just bootstrap everything in simulation, that world model imagined simulation. I think that can be a huge boost," Jason said.
VLAs don't go away in this picture. Something still has to make decisions at inference time. But the world model becomes the scaffolding around everything else: the training signal, the evaluation environment, the data generation engine.
The Generalist vs Specialist Debate
The final substantive thread was one the physical AI community keeps circling back to: do you train a generalist model or a specialist?
Chris came down on the side of generalist for anything that has to operate in open-ended environments: "As soon as more things could go wrong and you can't perfectly control everything, it seems like, to get to those five nines of reliability, you need the more general model."
Jason's take was more nuanced: "I think general model provides a base model that can be adapted to different tasks, but I think you still get the best performance by some sort of adaptation."
His analogy: the best people doing their jobs are specialists, but they started as generalists. Pre-training on diverse data + post-training specialization is already the established playbook in LLMs, and there's no reason to expect it won't apply in physical AI too.
The framing that felt most useful: think of a generalist model as a high-quality initializer. Fine-tuning from a good general base beats training from scratch for specific tasks, even if the final performance target is specialist-level.
The Hard Part: Getting to Deployment
The conversation wasn't just about research directions. It spent real time on the messy realities of deployment.
Jason was direct about what he sees as underappreciated: the gap between a long demo and a reliable deployed system. "Doing a demo of 24 hours once versus deployment settings where you're expected to do that every day of the week for a month, there's actually still a huge gap."
The problems are systems problems, not just model problems. Observability. Performance monitoring at runtime. Handling stochasticity in model outputs. Being able to triage when a model's behavior degrades in a new environment. These aren't the things that make it into the research papers.
Chris flagged reliability and ease of deployment as the other side: "The robot's gotta work. It's gotta be easy to deploy and gotta be easy to teach it new skills."
What This Means for Teams Building in Physical AI
A few things crystallized for us across this conversation that are directly relevant if you're building in this space:
Data composition matters more than data volume alone. The recipe is shifting to less emphasis on perfectly curated teleop demonstrations, and more emphasis on diverse, high-volume data including failures, third-person video, and egocentric data from many sources.
World models are infrastructure, not just models. The teams that will win at scale are those treating world models as the scaffolding around their whole data and evaluation loop, not just as a better policy architecture.
The flywheel speed is the real variable. Eric's point about angular frequency is one we believe deeply at Encord. The companies that close the loop fastest, from deployment to data to model improvement, will compound their lead faster than anyone who's just focused on a single state-of-the-art benchmark.
Deployment is a systems problem, not just a model problem. If you're thinking about getting to reliable production, start thinking about observability, monitoring, and continuous evaluation infrastructure now, not after the model is "good enough."
And if you're working on any of the problems we discussed today, data collection, curation, annotation at scale for robotics or autonomous systems, this is exactly what Encord is built for.
