Encord Integrates NVIDIA Cosmos Reason and Embed

Co-Founder & CEO at Encord
We're excited to announce the integration of two NVIDIA Cosmos models - Cosmos Reason 2 and Embed - directly into the Encord platform. Both run natively on Encord's own infrastructure, so customers can use them in production from day one.
The Cosmos Reason agent automates prelabels on physical AI video. Cosmos Embed makes that video searchable by behaviour, not just by scene. Together, they turn raw video into structured, searchable training data.

Cosmos Reason: Automated Captioning And Pre-Labelling
Cosmos Reason is a vision-language world model. Given a video clip as input, it returns natural-language descriptions of the actions, objects, and scene context in the footage. Inside Encord, those descriptions arrive as prelabels attached to the right video segments. Annotators review and refine instead of starting from a blank canvas. Approved labels stream straight to the customer's training pipeline.
What this means in practice:
- A robotics team labelling dexterous manipulation no longer hand-types every grasp or release.
- An AV team captioning camera footage gets a usable starting point on every clip instead of writing each one from scratch.
- Industrial inspection teams get descriptions of anomalies, object states, and conditions without anyone watching the full reel first.
This means less time describing video, and more time improving the model.
Cosmos Embed: Action-Aware Embeddings And Behaviour Search
Cosmos Embed is a video embedding model. Each embedding is calculated over an eight-frame window, capturing the action that happens across those frames. That makes a video dataset searchable by behaviour, using natural-language queries.
In practice, you can query for what's actually happening in a clip - not "car on a snowy road" but "car turning in a snow storm," not "highway scene" but "car overtaking another vehicle." That makes edge cases easier to find - the rare scenarios that decide whether a model ships safely.
How They Work Together
- Robotic manipulation. Reason generates action and state descriptions for dexterous manipulation footage. Embed lets teams find the specific behaviours that need more training examples.
- Autonomous vehicles. Reason captions camera data at scale. Embed lets teams pull the exact driving scenarios that matter for their perception models - lane changes in rain, unprotected lefts, occluded pedestrians.
- Industrial inspection. Reason describes anomalies and object states in operational footage. Embed surfaces every clip where a defect occurs in a specific context.
- Vision-Language-Action models. Reason produces the grounded captions VLA training depends on. Embed makes the underlying dataset queryable by the behaviours those models need to learn.
For more information on Encord's Physical AI suite, click here.