Phase Annotation
Encord Computer Vision Glossary
A robot picking up an object doesn't do it in one continuous motion; it reaches, positions, grasps, lifts, and repositions. These are distinct phases, and knowing where one ends and the next begins is what phase annotation captures. It's a layer of structure that turns raw demonstration data into something a model can actually learn task logic from.
What Is Phase Annotation?
Phase annotation is the process of segmenting a robot demonstration into its component stages, labeling the boundaries between distinct actions or sub-tasks within a longer sequence.
Where trajectory annotation captures what the robot does at every frame, phase annotation captures the higher-level structure of how a task is organised.
A single pick-and-place task, for example, might be broken into: approach, grasp, lift, transport, place, retract. Each phase has a clear start and end point, and different success criteria. Phase annotation makes that structure explicit in the training data.
Why It Matters for Model Training
Models trained without phase structure learn to predict the next action from the current frame, which works for simple, repetitive tasks but breaks down on anything more complex. When a model understands that a task has phases, it can track where it is in a sequence, recover from errors mid-task, and generalise the same phase structure to new objects or environments.
For VLA training specifically, phase boundaries also anchor natural language captions. A caption like "grasp the handle" only makes sense in context, phase annotation provides that context, tying language descriptions to the right segment of the demonstration.
What Gets Labeled
Phase annotation typically involves:
- Phase boundaries: the exact frames where one phase ends and the next begins
- Phase labels: a consistent taxonomy of action categories applied across all demonstrations
- Success/failure flags: whether a given phase was completed correctly
- Transition conditions: what triggered the move from one phase to the next
Consistency across annotators matters enormously here. If one annotator marks the grasp phase as starting when the gripper begins to close, and another marks it when contact is made, the model sees contradictory structure and learns from it.
Phase Annotation vs. Action Segmentation
The two are closely related but operate at different levels. Action segmentation divides continuous video into labeled action clips; it's primarily a computer vision task.
Phase annotation is task-aware; it segments a demonstration according to the logical structure of what the robot is trying to accomplish, not just what's visually distinct.
In practice, physical AI annotation pipelines use both, with phase annotation providing the task-level scaffold and action segmentation filling in the frame-level detail.
Encord for Phase Annotation
Encord's timeline editor lets annotators mark phase boundaries with frame-level precision across multi-view video streams, apply consistent label taxonomies, and flag transition conditions, all within the same workspace used for trajectory and object annotation.
This keeps phase labels in sync with the rest of the demonstration data, so models train on a complete, coherent picture of each task rather than labels assembled from separate tools.
→ Explore Encord for Physical AI
→ Explore Annotation & Labeling
Related Terms
See also: Trajectory Annotation · Action Segmentation · Vision-Language-Action Model (VLA) · Behaviour Cloning · Imitation Learning · Teleoperation
Related Resources
Informational Guides:
- Gemini Robotics — Advancing Physical AI with VLA Models
- Accelerating Robotics VLA Segmentation with SAM 3
Technical Documentations:
Webinars and video content: