Back to Blogs
Encord Blog

Webinar Recap: Building Physical AI

Written by Justin Sharps
Head of Forward Deployed Engineering at Encord
April 30, 2025|

5 min read

Summarize with AI
blog image

AI systems are increasingly moving beyond text and images into the physical world like robots, autonomous vehicles, and embodied agents. This makes the need for robust multimodal reasoning at scale critical. Last week’s webinar, Building Physical AI: How to Enable Multimodal Reasoning at Scale, brought together experts from Encord and Archetype AI.

The session explored what it really takes to build AI systems that can perceive, reason, and act in complex real-world environments. Here’s a recap, in case you missed it.

What is Physical AI?

Physical AI refers to AI systems that can operate in and reason about the physical world. These systems not only process images or pre-processed text, they also interpret sensory inputs from multiple sources like video, audio, and LiDAR. This combination of AI with physical systems makes decisions in real time in order to navigate real environments. 

 For more information, read the blog What is Physical AI

At the core of Physical AI is the ability to combine visual, auditory, spatial, and temporal data to understand what’s happening and what actions should follow. That is multimodal reasoning. 

Multimodal reasoning requires more than just robust AI models. It needs new data infrastructure, scalable annotation workflows and evaluation pipelines that mimic real world environments, not just benchmark accuracy.

Why It’s Hard to Build Physical AI Systems

Building AI systems deployed in the real world adds a new layer of complexity. Here are few gaps:

  • High-dimensional, unstructured input: You’re not dealing with curated datasets anymore. You’re working with noisy sensor streams, long video sequences, and time-synced data from multiple sources.
  • No clear ground truth: Especially for downstream tasks like navigation, the “correct” label is often contextual, ambiguous, or learned through feedback.
  • Fragmented workflows: Most annotation tools, model training frameworks, and evaluation platforms aren’t built to handle multimodal input in a unified way.

Key Takeaways from the Webinar

Encord provides tools built for multimodal annotation and model development. Our latest webinar “Building Physical AI: Enabling Multimodal Reasoning at Scale”, was in collaboration with Archetype AI, “a physical AI company helping humanity make sense of the world”. They have developed Newton, an AI model that can understand the physical world using multimodal sensor data. 

In this webinar, we outline key challenges and show how to build Physical AI systems that use rich multimodal context for multi-step reasoning. The aim is to build AI models which are safer and provide reliable real-world performance. Here are the key takeaways for teams scaling multimodal AI.

Multimodal Annotation

The first challenge in Physical AI is aligning and labeling data across modalities. The annotation pipelines must handle:

  • Video-native interfaces: Unlike static images, video requires temporal awareness. Annotators need to reason about motion, persistence, and cause-effect relationships between frames.
  • Ontologies that reflect the real world: Events, interactions, and object properties need structured representation. For example, knowing that “Person A hands object to Person B” involves spatial and temporal coordination, not just bounding boxes.
  • Multifile support: A key Encord feature highlighted was the ability to annotate and synchronize multiple data streams such as RGB video, depth maps, and sensor logs within a single interface. This enables richer context without switching tools.

Scalable Automated Labeling

Once a strong ontology is in place, it becomes possible to automate large parts of the annotation process. The Encord team outlined a two-step loop:

  • Use ontologies for consistent labeling: The structure enforces what can be labeled, how relationships are defined, and what types of attributes are expected. This reduces ambiguity and improves inter-annotator agreement.
  • Add model-in-the-loop tools: After initial manual labeling, models can be trained to pre-label data in the future as well. This cuts the annotation time dramatically. Annotators then shift to verification and correction, speeding up throughput.

This hybrid approach balances the need for quality via human input with the need for scale via automation. It’s particularly useful in domains with long-tail edge cases like industrial robotics or medical imaging, where manual annotation alone doesn’t scale.

Agentic Workflows

Traditional ML workflows are often rigid. They collect data, train a model, evaluate, repeat. But the real-world environment is dynamic. In the webinar, the speakers introduce the idea of agentic workflows. The pipelines are modular that can:

  • React to new data in real time
  • Orchestrate multiple model components
  • Include humans in the loop during key stages
  • Be reused across tasks, domains, or hardware setups

Encord’s agentic workflow builder is designed for this kind of modularity. It lets teams compose workflows using building blocks like models, data sources, annotation tools, evaluation criteria and run them in structured loops.

This helps AI systems that can not only perceive but also act to evaluate their own performance. They can also trigger re-labeling or retraining when needed.

Evaluation Metrics

Most existing ML benchmarks fall short when applied to Physical AI. Accuracy, mAP, and F1 scores don’t always correlate with real-world performance especially when the task is “did the robot successfully hand over the object?” or “did the system respond to the sound cue correctly?”

 “Evaluation is no longer just a number. It’s whether your robot does what it’s supposed to do.” - Frederik H.

What’s needed instead are:

  • Behavioral benchmarks: To measure whether the system can accomplish end-to-end tasks in physical environments.
  • Continuous evaluation: Instead of one-time test sets, build systems that constantly monitor themselves and flag drift or errors.
  • Task-aware success criteria: For example, a model that misses 3% of bounding boxes might still succeed in object tracking but fail miserably in manipulation if it loses track of a key object for just a moment.

The teams building physical AI systems should think beyond classic metrics and focus on holistic evaluation that spans perception, reasoning, and action.

The webinar also included a product walkthrough showing how Encord supports:

  • Video-native annotation tools with time-based tracking, object permanence, and multimodal synchronization
  • Multifile annotation interfaces where audio, LiDAR, or sensor data can be aligned with video
  • Reusable workflows that integrate models into the annotation process and kick off retraining pipelines
  • Agentic workflows for tasks like active learning, error correction, and feedback loops

For teams used to stitching together open-source tools, spreadsheets, and Python scripts, this kind of unified, GUI-driven interface could dramatically simplify experimentation and iteration.

What This Means for Teams Building Physical AI

If you’re building AI systems for the physical world whether that’s drones, manufacturing bots, self-driving cars, or AR agents, here’s what you should take away:

  • Start with a strong data ontology that reflects the semantics of your task
  • Choose annotation tools that support multimodal data natively and avoid fragmented workflows
  • Use model-in-the-loop setups to scale annotation cost-effectively
  • Design workflows that are modular, composable, and agent-driven
  • Define success using task-based or behavioral metrics, not just classification scores

Physical AI isn’t just a new application domain. It’s a fundamental shift in how we collect, train, and evaluate AI systems.

 Want to go deeper? You can watch the full webinar.
Building Physical AI: How to Enable Multimodal Reasoning at Scale

Conclusion

Physical AI represents the next frontier in machine learning, where models aren’t just answering questions, but interacting with the world. But to reach that future, we need more than smarter models. We need better data workflows, richer annotations, and tools that can keep up with the complexity of real-world signals.

The teams that win in this space won’t just be good at training models. They’ll be the ones who know how to structure, label, and scale the messy multimodal data that Physical AI depends on.

Explore the platform

Data infrastructure for multimodal AI

Explore product

Explore our products

Frequently asked questions
  • Encord supports various data collection methods for training physical AI robots, including capturing raw footage in real environments—such as filming interactions with elevators and automatic doors—as well as utilizing existing data from simulated environments. This dual approach allows teams to enhance their datasets and streamline the training process.

  • Encord offers features that facilitate the training and debugging of machine learning models specifically for embedded devices. By leveraging intelligence integration, users can identify edge cases across their data, which accelerates model development cycles and enhances overall understanding of model performance.

  • Encord provides the tooling and personnel necessary for human-in-the-loop services, allowing for real-time validation of model predictions. This is particularly useful in scenarios where robots need immediate assistance in identifying objects or determining the appropriate response to unexpected situations.

  • Encord provides tools that facilitate the collection and management of data for AI projects, including those involving humanoid and cognitive robots. Teams can use our platform to streamline their data collection processes, whether they are utilizing manual methods or more automated approaches.

  • Encord offers various resources, including webinars and case studies, to help teams understand and implement effective annotation and data management strategies in physical AI applications. These resources aim to showcase best practices and innovative use cases within the industry.

  • Encord enables the development of chatbots that can assist users by providing easy access to internal documentation, standard operating procedures, and FAQs. This enhances user experience by ensuring information is readily available in the desired format.

  • Encord offers bespoke onboarding support tailored to the specific needs of new users. This includes guidance on platform features, best practices for annotation, and assistance in integrating Encord into existing workflows to ensure a smooth transition.

  • Integrating customer-facing AI features with Encord involves utilizing its robust platform to develop applications that can answer customer queries, provide recommendations, and enhance engagement through AI-powered functionalities.

  • Encord offers human services to enhance data workflows, allowing teams to scale their operations effectively. This support is particularly useful for managing the workload of operators who may be stretched thin.