Back to Blogs

Gemini Robotics: Advancing Physical AI with Vision-Language-Action Models

March 20, 2025
5 mins
blog image

Google DeepMind’s latest work on Gemini 2.0 for robotics shows a remarkable shift in how large multimodal AI models are used to drive real-world automation. Instead of training robots in isolation for specific tasks, DeepMind introduced two specialized models:

  • Gemini Robotics: a vision-language-action (VLA) model built on Gemini 2.0. It accepts  physical actions as a new output modality for directly controlling robots.
  • Gemini Robotics-ER: a version of Gemini that incorporates embodied reasoning (ER) and spatial understanding. It allows roboticists to run their own programs along with Gemini’s spatial reasoning capabilities.

This is monumental because Google demonstrates how you can take a multimodal artificial intelligence model, fine-tune it and apply it for robotics. Since it is multimodal, the robotic systems learn to generalize better rather than being proficient at a particular task without needing massive amounts of data to add a new ability.

In this blog we will go through the key findings of the Gemini Robotics, the architecture, training pipeline and discuss the new capabilities it unlocks. 

Why Traditional Robotics Struggle?

Training robots has always been an expensive and complex task. Most of the robots are trained with supervised datasets, reinforcement learning or imitation learning, but each approach has significant limitations.

  • Supervised learning: needs massive annotated datasets. This makes scaling difficult.
  • Reinforcement learning (RL): It has only been proven effective in controlled environments. It still needs millions of trial and error interactions and still fails to generalize to the real-world applications.
  • Imitation learning (IL): It is efficient but it needs large scale expert demonstrations. It can be difficult to find demonstrations for each and every scenario.

These challenges lead to narrowly specialized models that work well in training environments but break down in real-world settings. A warehouse robot trained to move predefined objects might struggle if an unexpected item appears. A navigation system trained in simulated environments might fail in new locations with different lighting, obstacles, or floor textures. 

Hence, the core issue of traditional robots is the lack of true generalization.

However, DeepMind’s Gemini Robotics presents a solution to this problem by rethinking how robots are trained and how they interact with their environments. 

What Makes Gemini Robotics Different?

Gemini Robotics is a general-purpose model capable of solving dexterous tasks in different environments and supports different robot embodiments.

It uses Gemini 2.0 as a foundation and extends the multimodal capabilities to not only understand tasks through vision and language but also to act autonomously in the physical world. The integration of physical actions as a new output modality, alongside vision and language processing, allow the model to control robots directly. It helps the robots to adapt and perform complex tasks with minimal human interventions.

gemini robotics architecture flow chart

Source

Architecture Overview

Gemini Robotics is built around an advanced vision-language-action model (VLA), where vision and language inputs are integrated with robotic control outputs. The core idea behind this is to help the model to perceive its environment, understand natural language instructions and act in the real-world task by controlling the robot’s actions. 

It is a transformer based architecture. The key components include:

  • Vision Encoder: This module processes visual inputs from cameras or sensors, extracting spatial and object-related information. The encoder is capable of recognizing objects, detecting their positions, and understanding environmental contexts in dynamic settings.
  • Language Encoder: The language model interprets natural language instructions. It converts user commands into an internal representation that can be translated into actions by the robot. The strength of Gemini Robotics lies in its ability to comprehend ambiguous language, contextual nuances, and even tasks with incomplete information.
  • Action Decoder: The action decoder translates the multimodal understanding of the environment into actionable robotic movements. These include tasks like navigation, object manipulation, and interaction with external tools.

Training Pipeline

The training of these models is also unique as it combines multiple data sources and tasks to ensure that the model is good at generalizing across different settings. 

Data Collection

The training process begins with collecting a diverse range of data from robotic simulations and real-world environments. This data includes both visual data such as images, videos, depth maps, and sensor data, and linguistic data such as task descriptions, commands, and natural language instructions. To create a robust dataset, DeepMind uses a combination of both synthetic data from controlled environments and real-world data captured from real robots performing tasks.

Pretraining

The model is first pretrained on multimodal datasets, where it learns to associate vision and language patterns with tasks. This phase is designed to give the model an understanding of fundamental object recognition, navigation, and task execution in various contexts.

Pretraining helps the model learn generalizable representations of tasks without having to start from scratch for each new environment.

Fine-tuning on Robotic Tasks

After pretraining, the model undergoes fine-tuning using real-world robotic data to improve its task-specific capabilities. Here, the model is exposed to a wide range of tasks from simple object manipulation to complex multi-step actions in dynamic environments.

Fine-tuning is done using a combination of supervised learning for task labeling and reinforcement learning for optimizing robotic behaviors through trial and error.

Reinforcement Learning for Real-World Adaptation

A key component of the Gemini Robotics pipeline is the use of reinforcement learning (RL), especially in the fine-tuning stage. Through RL, the robot learns by performing actions and receiving feedback based on the success or failure of the task. This allows the model to improve over time and develop an efficient policy for action selection.

RL also helps the robot generalize its learned actions to different real-world environments.

Embodied Reasoning and Continuous Learning

The model is also designed for embodied reasoning, which allows it to adjust its actions based on ongoing environmental feedback. This means that Gemini Robotics is not limited to a static training phase but is capable of learning from new experiences as it interacts with its environment. This continuous learning process is crucial for ensuring that the robot remains adaptable, capable of refining its understanding and improving its behavior after deployment.

Gemini Robotics-ER

Building on the capabilities of Gemini Robotics, this model introduces embodied reasoning (ER).

What is Embodied Reasoning?

Embodied reasoning refers to the ability of the model to understand and plan based on the physical space it occupies. Unlike traditional models that react to sensory input or follow pre-programmed actions, Gemini Robotics-ER has a built-in capability to understand spatial relationships and reason about movement. 

Gemini Robotics Embodied Reasoning

Source

This enables the robot to assess its environment more holistically, allowing for smarter decisions about how it should approach tasks like navigation, object manipulation, or avoidance of obstacles.

For example, a robot with embodied reasoning wouldn’t just move toward an object based on visual recognition. Instead, it would take into account factors like:

  • Spatial context: Is the object within reach, or is there an obstacle blocking the way?
  • Task context: Does the object need to be lifted, moved to another location, or simply avoided?
  • Environmental context: What other objects are nearby, and how do they affect the task at hand?

Gemini Robotics benchmarks

Source

Gemini 2.0’s Embodied Reasoning Capabilities

The Gemini 2.0 model already provided embodied reasoning capabilities which are further improved in the Gemini Robotics-ER model. It needs no additional robot-specific data or training as well.

Some of the capabilities include:

  • Object Detection: It can perform open-world 2D object detection, and generate accurate bounding boxes for objects based on explicit and implicit queries.
  • Pointing: The model can point to objects, object parts, and spatial concepts like where to grasp or place items based on natural language descriptions.
  • Trajectory Prediction: Using its pointing capabilities, Gemini 2.0 predicts 2D motion trajectories grounded in physical observations, enabling the robot to plan movement.
  • Grasp Prediction: Gemini Robotics-ER extends this by predicting top-down grasps for objects, enhancing interaction with the environment.
  • Multi-View Correspondence: Gemini 2.0 processes stereo images to understand 3D scenes and predict 2D point correspondences across multiple views.

Example of 2D trajectory prediction Gemini Robotics

Example of 2D trajectory prediction. Source


How Gemini Robotics-ER Works?

Gemini Robotics-ER incorporates several key innovations in its architecture to facilitate embodied reasoning.

Spatial mapping and modeling

This helps the robot to build and continuously update a 3D model of its surroundings. This spatial model allows the system to track both static and dynamic objects, as well as the robot's own position within the environment.

Multimodal fusion

It combines vision sensors, depth cameras, and possibly other sensors (e.g., LiDAR). 

Spatial reasoning algorithms
These algorithms help the model predict interactions with environmental elements. Gemini Robotics-ER’s task planner integrates spatial understanding, allowing it to plan actions based on real-world complexities. Unlike traditional models, which follow predefined actions, Gemini Robotics-ER can plan ahead for tasks like navigating crowded areas, manipulating objects, or managing task sequences (e.g., stacking objects).

ERQA (Embodied Reasoning Quality Assurance)

It is an open-source benchmark to evaluate embodied reasoning capabilities of multimodal models. In the fine-tuned Gemini models it acts as a feedback loop which evaluates the quality and accuracy of spatial reasoning, decision-making, and action execution in real-time.

ERQA Question categories Gemini Robotics

ERQA Question categories. Source

The core of ERQA is its ability to evaluate whether the robot's actions are aligned with its planned sequence and expected outcomes based on the environment’s current state.

In practice, ERQA ensures that the robot:

  • Accurately interprets spatial relationships between objects and obstacles in its environment.
  • Adapts to real-time changes in the environment, such as moving obstacles or shifts in spatial layout.
  • Executes complex actions like object manipulation or navigation without violating physical constraints or failing to complete tasks.

The system generates feedback signals that inform the model about the success or failure of its decisions. These signals are used for real-time correction, ensuring that errors in spatial understanding or action execution are swiftly addressed and corrected.

Why Do These Models Matter for Robotics?

One of the biggest breakthroughs in Gemini Robotics is its ability to unify perception, reasoning, and control into a single AI system. Instead of relying solely on robotic experience, Gemini leverages vast external knowledge from videos, images, and text, enabling robots to make more informed decisions.

For example, if a household robot encounters a new appliance it has never seen before, a traditional model would likely fail unless it had been explicitly trained on that device. In contrast, Gemini can infer the appliance's function based on prior knowledge from images and instructional text it encountered during pretraining. This ability to extrapolate and reason about unseen scenarios is what makes multimodal AI so powerful for robotics.

Through this approach, DeepMind is laying the foundation for more intelligent and adaptable humanoid robots capable of operating across a wide range of industries from warehouse automation to household assistance and beyond.

Conclusion

In short, Google introduces models and benchmarks and shows how robots can do more and adapt more to different situations. By being general, interactive, and dexterous, it can handle a variety of tasks, respond quickly to changes, and perform actions with precision, much like humans. 

📘 Download our newest e-book, The rise of intelligent machines to learn more about implementing physical AI models.

encord logo

Better Data, Better AI

Enhance your AI with automated data curation, annotation, and validation.

Try it today
Written by
author-avatar-url

Akruti Acharya

View more posts
Frequently asked questions
  • Gemini Robotics is a vision-language-action (VLA) model built on DeepMind’s Gemini 2.0. It integrates physical actions as an output modality, allowing robots to understand tasks through vision and language while autonomously interacting with the real world.
  • Unlike traditional robotics, which rely on supervised learning, reinforcement learning, or imitation learning (all of which have scalability issues), Gemini Robotics leverages multimodal AI to enable generalization across different tasks and environments without extensive retraining.
  • Gemini Robotics-ER is an advanced version of Gemini Robotics that incorporates embodied reasoning (ER) and spatial understanding. This allows robots to assess their environment holistically, improving decision-making for navigation, object manipulation, and obstacle avoidance.
  • Embodied reasoning enables robots to plan and adapt based on real-time environmental feedback. Instead of simply reacting to stimuli, they understand spatial relationships and task contexts, allowing for more intelligent and flexible interactions.
  • Potential applications include warehouse automation, home assistance robots, industrial manufacturing, and autonomous navigation in unpredictable environments.

Explore our products