What Does It Actually Take to Trust a Robot in Production? Warehouse Robotics Roundtable Recap

Co-Founder & CEO at Encord
There's a gap between a robot doing something impressive in a demo and a robot doing something reliably in a live warehouse, especially with edge cases, like when the inventory has completely changed and someone's walked a torch past its camera. Closing that gap is what the entire warehouse robotics industry is quietly obsessed with and it's exactly why Encord brought together three engineering leaders together.
Near, Head of Engineering at Encord, sat down with Anshul from Addverb Robotics, where he's spent a decade building autonomous mobile robots and full warehouse automation systems; and Wojtek Jivurski, Senior Engineering Manager at NoMagic, who's been equipping robotic arms with proprietary AI to automate e-commerce fulfilment processes for the better part of nine years. What followed was an honest, technically grounded conversation about the real state of warehouse robotics.
Here's what they covered.
The Most Under-Hyped Thing Happening in Robotics Right Now
Wojtek's take: it's the data. Specifically, the proprietary operational data that companies like Nomagic have quietly been amassing for years. They're picking millions of diverse SKUs every month across multiple warehouses and conditions. That data, what he called their "library of chaos", is now emerging as one of the most valuable assets in the industry as teams try to train foundational robotic models.
The parallel to LLMs is exact: internet-scale data won the language model race. Production-scale manipulation data could win the physical AI race. And the companies who've been running robots in warehouses for years already have it.
Anshul's angle was different: it's the system of systems problem. Building a robot that works is, as he put it, "more or less solved in a lot of places." What isn't solved is the full-stack orchestration problem: fleet management, warehouse execution systems, integration with inventory and WMS layers, ROI modelling before deployment, change management after deployment.
He describes pre-deployment material flow simulation as dramatically underappreciated. The ability to model what a full automation installation will actually deliver in throughput before a single robot is bolted down. That's where the real business value gets unlocked or lost.
How Do You Actually Know When to Trust the Model?
This is the question at the heart of every robotics deployment, and both guests gave surprisingly candid answers about how imperfect the process still is.
Anshul's framing: assume the model isn't good enough, even when the numbers look good. Validation set performance and test set accuracy don't tell you what happens when a warehouse operator walks past the sensor with a torch, or when the lighting configuration in a new facility is subtly different from training data. His team's response to this is layered:
- First, extensive data logging: capturing edge cases in production, pushing them back to the model server, retraining continuously.
- Second, heavy use of photorealistic synthetic data with deliberately perturbed sensor simulations: reflections, different camera mounting angles, varied environmental conditions.
- Third, ensemble model strategies: running a large fallback model alongside the primary model so the system can context-switch under uncertainty rather than fail catastrophically.
- Fourth, smart edge deployment optimisation: TensorRT for Nvidia platforms, OpenVINO for Intel, because off-the-shelf models can't go straight to the robot without being compressed and optimised, and every compression step trades off accuracy.
And the data management challenge that comes with all of this at scale is genuinely hard. Hundreds of robots deployed across a site, each logging LiDAR and camera data, pushing to the cloud; the Wi-Fi infrastructure alone can become the bottleneck. Designing the end-to-end pipeline to handle this without choking the network is an engineering problem in its own right.
Wojtek's framing at Nomagic is about feedback loops at every level. Developer-level iteration happens on local machines and digital twins. Pull requests trigger full simulation CI pipelines. Pre-release, they run statistically significant pick samples in their Warsaw lab. Post-deployment, they run a full analytics stack segmenting performance by item category (fashion vs. general merchandise, large vs. small items) so they can identify where models are underperforming rather than watching the top-level throughput number. When something goes wrong and the primary model can't handle an edge case, they have a human-in-the-loop system that historically has let remote operators intervene within a 15-second average window. And on top of that, VLAs (visual language action models) now serve as a flexible fallback layer for cases the traditional automation can't crack.
The summary from both: you never fully trust the model until it's running in production. The goal isn't certainty. But rather, it's building systems that learn from production, fail gracefully, and let you iterate quickly when they do.
Synthetic Data: What's Actually Working, and Where It Still Falls Short
The synthetic data conversation got technical fast, and it's one of the most useful segments in the whole discussion.
What's working now that wasn't two years ago: Anshul's team has started using generative AI workflows to augment datasets across visual domains, taking real-world data collected in one condition (wooden pallets) and using GenAI to transform it into other conditions (steel pallets, blue pallets, different lighting) without re-collecting from scratch. They've also built workflows to reconstruct real environments from LiDAR and camera data into full 3D scenes (.obj files) that can be imported into simulation. Effectively doing real-to-sim and then sim-to-real. This genuinely bridges some of the domain gap.
What still doesn't work: you can never do a 1:1 mapping of a specific facility in simulation. Every factory and warehouse is different in its exact layout, lighting, movement patterns, and dense machine configurations. The DWG files you get during solution design give you a bare-bones 2D layout, converting that to a fully accurate photorealistic 3D simulation environment that accounts for all the real-world variables is a whole project unto itself, and rarely practical within deployment timelines.
Wojtek added the broader point: synthetic data is genuinely useful for benchmarking experiments and getting intuition about where to invest training effort. But the sim-to-real gap is real, and if you're deploying to the real world, you need production data.
Good old real data remains the gold standard. The most exciting synthetic data developments are moving the needle, but they're not replacing ground-truth production data collection. They're complementing it.
What Actually Breaks Good Deployments
Both guests spoke candidly about what goes wrong after a robot has been running well for a while.
The classic case at Nomagic: seasonal inventory shifts. Winter arrives, the warehouse starts handling big puffer jackets, and the model performance drops. Their continuous data integration handles most of this, but some distribution shifts require more deliberate intervention. One of the most striking examples Wojtek shared: they developed what they call a "shoebox picker", a dedicated hardware-plus-AI solution, specifically because open shoeboxes kept appearing in customer inventory and traditional suction grippers couldn't reliably handle them.
Anshul's perspective: it's often not the model at all. Workflows change. A customer adds a new production line. The task the robot was trained for now needs to integrate with additional systems. And beyond that, mechanical breakdowns are real, and customers often don't know how to handle them, which creates production interruptions.
Both guests also highlighted the urgency of empowering operators themselves. At the scale of hundreds of deployments across multiple geographies, it's not feasible for engineering teams to manage all retraining centrally. You need operators who understand how to capture and annotate edge cases, and tools that make that accessible without deep ML expertise.
What They're Most Excited About
Wojtek is watching lights-out warehouses move from aspiration to reality. Nomagic already supports fully autonomous 12-hour operation windows for customers in Switzerland. As robotic foundational models improve, the scope of what those lights-out periods can cover will expand.
Anshul is watching humanoids, though perhaps not in the form factor that makes headlines. His vision is a combination of specialised robots (AMRs, articulated arms, conveyors) working together in orchestrated workflows with increasing agility and throughput. The "dark factory" of fully automated manufacturing is maybe five years away, he estimates, but the direction is clear.
The consistent thread underneath both answers: the data and the systems to manage it are what will determine who wins. The labelling pipelines, feedback loops, foundational model libraries, and deployment infrastructure is where the competitive advantage lives.
Encord exists at exactly this intersection. If your team is building, deploying, or iterating on vision models for robotics, whether that's AMRs, inspection bots, manipulation arms, or anything else that needs to see and understand the world, the data layer is the leverage point.
Book a demo to see how Encord handles multimodal robotics data at scale.
Explore the platform to see the Agent Catalog and annotation tooling your team can deploy today.