Back to Blogs

Contents

From “Visual Prompts” to “Concept Prompts”

Architectural Overview: Detection, Tracking, and a New “Presence” Head

Interactivity and Refinement

A Scalable Human + AI Data Engine

Performance and Scaling

Limitations and Outlook

What to Expect When SAM 3 Launches

Share on socials

Encord Blog

Segment Anything Model 3 (SAM 3): What to Expect from the Next Generation of Foundation Segmentation Models

Written by Frederik Hvilshøj

ML Lead at Encord

October 31, 2025|

5 min read

Summarize with AI

Back to Blogs

Explore the platform

Data infrastructure for multimodal AI

Explore product

Contents

From “Visual Prompts” to “Concept Prompts”

Architectural Overview: Detection, Tracking, and a New “Presence” Head

Interactivity and Refinement

A Scalable Human + AI Data Engine

Performance and Scaling

Limitations and Outlook

What to Expect When SAM 3 Launches

Share on socials

The Segment Anything Model 3 (SAM 3) paper, currently under review for ICLR 2026, signals another major leap in computer vision. Where SAM 1 and 2 introduced the idea of promptable segmentation through clicks, boxes, and masks, SAM 3 extends this to concepts, uniting image, video, and text in a single architecture.

Rather than simply segmenting “this object,” SAM 3 can now segment all instances of a concept (say, “yellow school buses” or “striped cats”) across an image or a full video sequence.

Below, we break down what this evolution means, what’s under the hood, and how it could reshape the next wave of multimodal visual models.

From “Visual Prompts” to “Concept Prompts”

Earlier SAM versions were grounded in geometry: a user click or bounding box pointed to a single instance. SAM 3 expands this through Promptable Concept Segmentation (PCS) - the ability to detect, segment, and track every instance of a user-specified concept across images or short videos.

These concept prompts can come from:

A noun phrase, such as “red apple” or “construction crane”
An image exemplar that visually defines what to look for
A combination of the two, allowing text and visual guidance.

Crucially, PCS is open-vocabulary. The model isn’t limited to a fixed taxonomy; any noun phrase that can be visually grounded becomes a valid query. This makes SAM 3 capable of “finding all X” across arbitrary data, a key capability for fields like robotics, scientific imaging, or auto-labeling in AI data pipelines.

Architectural Overview: Detection, Tracking, and a New “Presence” Head

The SAM 3 architecture (diagram on page 3 of the paper) is a hybrid between a DETR-style object detector and the SAM 2 transformer tracker. It’s built around a shared Perception Encoder (PE) - a vision-language backbone that jointly processes image, text, and exemplar prompts.

The system splits into two coordinated modules:

A DETR-based Detector:

Encodes the image and text/exemplar prompts.
Produces bounding boxes and masks for each detected instance.
Uses dual supervision for alignment between visual and linguistic features.
Employs a presence head - a learned “global token” that determines whether a concept is present at all before localizing it. This decouples recognition (what) from localization (where), improving accuracy on ambiguous or fine-grained categories.

A Memory-Based Tracker:

Extends the SAM 2 transformer for spatio-temporal segmentation.
Propagates “masklets” (object-specific spatial-temporal masks) across frames.
Uses temporal disambiguation to handle occlusions by matching detections and periodically re-prompting the tracker with high-confidence masks.

Together, these allow SAM 3 to both detect new instances and maintain consistent object IDs through time, effectively merging open-vocabulary detection, segmentation, and multi-object tracking into a unified pipeline.

Interactivity and Refinement

Like its predecessors, SAM 3 remains fully interactive. Users can:

Add positive or negative exemplars to refine detections (e.g., “these are apples, this isn’t”).
Adjust individual masks through clicks or boxes as in SAM 2.
Iterate across frames, with refinements propagating through the video automatically.

This enables collaborative human-in-the-loop annotation - ideal for labeling complex datasets or curating domain-specific corpora in fields such as medical imaging or autonomous driving.

A Scalable Human + AI Data Engine

Training SAM 3 required an enormous expansion of data diversity and quality. The authors built a new data engine that combines human experts with fine-tuned multimodal LLMs to annotate and verify segmentation data at scale.

Key components of this pipeline:

Media and Label Curation: Images and videos are mined from diverse domains using a curated ontology of 22 million concepts drawn from Wikidata.
AI Annotators: Models propose candidate noun phrases and segmentation masks.
AI Verifiers: LLM-based systems automatically validate whether the masks are correct and exhaustive, doubling throughput compared to human-only verification.
Human Correction: Human annotators focus on difficult cases flagged by AI verifiers, manually fixing or refining masks.

This semi-automated feedback loop produces:

SA-Co/HQ: 5.2 M images with 4 M unique noun phrases
SA-Co/SYN: 38 M synthetic phrases with 1.4 B masks
SA-Co/VIDEO: 52.5 K videos with 467 K masklets.

The resulting Segment Anything with Concepts (SA-Co) benchmark includes 214 K unique concepts - over 50× more than prior open-vocabulary segmentation datasets - and is open-sourced alongside the model.

Performance and Scaling

In experiments, SAM 3 sets a new state of the art in open-vocabulary segmentation and tracking:

Image segmentation: Zero-shot mask AP 47.0 on LVIS (vs 38.5 previous best).
PCS benchmark: 2× higher scores than OWLv2 and LLMDet baselines.
Video tracking: 80% of human pHOTA accuracy on SA-Co/VEval videos.
Counting tasks: Outperforms Gemini 2.5 Pro and Qwen 2 VL on CountBench.
Interactive refinement: Three exemplar clicks boost accuracy by +18.6 points (CG F1) over text-only prompts.

Ablations reveal several drivers of this performance:

The presence head contributes +5.7 CG F1 by decoupling recognition from localization.
Adding hard negative phrases (e.g., near-miss concepts) boosts precision.
Mixing synthetic and human-verified data yields the best scaling behaviour.
Using AI verifiers for pseudo-label cleanup adds +4–5 points on key metrics.

On a NVIDIA H200, SAM 3 can process an image with 100 objects in ~30 ms and achieve near real-time tracking for ≈ 5 concurrent objects.

Limitations and Outlook

Despite its scale and generality, SAM 3 isn’t without constraints:

It struggles on fine-grained or domain-specific categories (e.g., medical terms, aircraft models) without fine-tuning.
It’s limited to short noun phrases, not long referring expressions requiring reasoning.
Video inference scales linearly with object count, making dense tracking costly.
Switching between “concept-level” and “instance-level” edits still requires explicit mode changes.

The authors show, however, that combining SAM 3 with a multimodal LLM (“SAM 3 Agent”) extends its reasoning ability, handling queries like “people sitting down but not holding a gift box” by iteratively calling SAM 3 with refined concept prompts.

What to Expect When SAM 3 Launches

If open-sourced as promised, SAM 3 will likely mark a turning point in vision-language segmentation:

Unified Image-Video Segmentation:
One model handles both static and temporal domains through a shared backbone.
Expect significant gains in video annotation tools and robot perception stacks.
Concept-Level Search and Segmentation:
Instead of prompting a single object, users can ask for all instances of a category - crucial for data labeling and content moderation.
Interactive AI-Assisted Annotation:
The refinement loop of AI verifiers + human feedback could cut annotation time by half or more on complex datasets.
Bridge to LLM-Driven Vision Agents:
SAM 3 becomes a “vision tool” that language models can call to ground their reasoning in pixel-level perception.

For AI infrastructure companies like us at Encord, SAM 3 sets the stage for more intelligent, efficient, and scalable annotation workflows, and we're already planning for the next generation of data pipelines.

Its blend of multimodal reasoning, interactive refinement, and automation hints at a future where the boundary between human instruction and pixel-level understanding grows thinner with every generation.

Explore the platform