Contents
Why SAM 2 is so exciting
Promptable Visual Segmentation (PVS)
SAM 2 Model Architecture
SA-V Dataset Overview
Performance of SAM 2
Real-World Applications of SAM 2
SAM 2: Related Work
Limitations of SAM 2
Try SAM 2
Encord Blog
Segment Anything Model 2 (SAM 2) & SA-V Dataset from Meta AI
Meta AI has released Segment Anything Model 2 (SAM 2), a groundbreaking new foundation model designed for segmenting objects in both images and videos. Alongside SAM 2, Meta AI released new dataset SA-V Dataset, the large-scale dataset used to train SAM 2. To demonstrate these advancements, an interactive web demo has also been made available.
This release comes 14 months after the release of its predecessor, Segment Anything Model, which took the field of computer vision by storm. Learn how to finetune the first version of SAM.
Why SAM 2 is so exciting
SAM 2 represents a significant leap forward in the field of computer vision. SAM 2 brings state-of-the-art segmentation and tracking capabilities for both video and images into a unified model. This unification eliminates the need for combining SAM with other video object segmentation models, streamlining the process of video segmentation into a single, efficient tool. It maintains a simple design and fast inference speed, making it accessible and efficient for users.
The AI model can track objects consistently across video frames in real-time, which opens up numerous possibilities for applications in video editing and mixed reality experiences. This new model builds upon the success of the original Segment Anything Model, offering improved performance and efficiency.
SAM 2 can also be used to annotate visual data for training computer vision systems. It opens up creative ways to select and interact with objects in real-time or live videos.
Key Features of SAM 2
- Object Selection and Adjustment: SAM 2 extends the prompt-based object segmentation abilities of SAM to also work for object tracks across video frames.
- Robust Segmentation of Unfamiliar Videos: The model is capable of zero-shot generalization. This means it can segment objects, images, and videos from domains not seen during training, allowing for versatility in real-world use cases.
- Real-Time Interactivity: SAM 2 utilizes a streaming memory architecture that processes video frames one at a time, allowing for real-time, interactive applications.
SAM 2 Performance Enhancements
SAM 2 sets new standards for object segmentation of both videos and images. During testing, it demonstrated superior accuracy in video segmentation with three times fewer interactions compared to previous models and an 8x speedup for video annotations. For image segmentation, it is not only more accurate but also six times faster than its predecessor, SAM.
Meta AI emphasizes the broader implications of this release: "We believe that our data, model, and insights will serve as a significant milestone for video segmentation and related perception tasks."
Meta AI’s Segment Anything Model 2
Promptable Visual Segmentation (PVS)
Promptable Visual Segmentation (PVS) task extends the Segment Anything (SA) task from static images to videos, allowing interactive prompts such as clicks, bounding boxes, or masks on any frame to define or refine object segmentation. Upon receiving a prompt, the model immediately generates a segmentation mask for the specific frame and propagates this information to segment the object across the entire video, creating a spatio-temporal mask or masklet that includes segmentation for every frame. The model can also accommodate additional prompts for further refinement throughout the video.
PVS emphasizes real-time, interactive segmentation with a focus on clearly defined objects, excluding regions without visual boundaries. It combines aspects of static image and video segmentation tasks, aiming to provide an enhanced user experience with minimal interaction. This task improves on traditional methods by allowing more flexible and immediate adjustments to segmentation through various prompt types, enabling accurate and efficient tracking of objects throughout video content.
SAM 2 Model Architecture
SAM 2 significantly advances the promptable capabilities of SAM by extending its functionality to the video domain. This extension incorporates a sophisticated per-session memory module, enabling the model to maintain and track selected objects across all frames of a video, even when these objects temporarily disappear from view. SAM 2 also supports dynamic corrections of mask predictions based on supplementary prompts throughout the video sequence.
Meta AI’s Segment Anything Model 2
In contrast, SAM 1 lacks this memory mechanism and does not offer the same level of continuous tracking and dynamic correction for video sequences. SAM 1’s functionality is limited to static images without the advanced memory management capabilities introduced in SAM 2.
Here are they key architectural innovations introduced in SAM 2:
Frame Embeddings and Memory Conditioning
Unlike SAM, where frame embeddings are directly obtained from an image encoder, SAM 2 employs a more intricate approach. In SAM 2, the frame embeddings are conditioned on both past predictions and frames that have been prompted. This conditioning allows the model to leverage historical context and anticipate future object movements. Notably, prompted frames can even be from the future relative to the current frame, facilitating a more comprehensive understanding of object dynamics across the video.
Memory Encoder and Memory Bank
The memory encoder in SAM 2 creates and manages frame memories based on current predictions, storing them in a memory bank. This bank operates as a FIFO (First-In, First-Out) queue, retaining information about up to N recent frames and M prompted frames. To capture short-term object motion, temporal position information is embedded into the memories of the N recent frames. However, this temporal information is not included in the memories of prompted frames due to their sparser training signal and the difficulty in generalizing to new temporal ranges not encountered during training. This design allows the model to refer to recent context and adjust predictions based on new information while managing the challenge of varying temporal contexts.
Image Encoder
For efficient processing of videos, SAM 2 utilizes a streaming approach where the image encoder processes video frames sequentially. The encoder, based on a pre-trained masked autoencoder (MAE) Hiera, produces multi-scale feature embeddings for each frame. This hierarchical encoding facilitates the use of various scales during decoding, contributing to the model's ability to handle detailed segmentation tasks effectively.
Memory Attention Mechanism
SAM 2 incorporates a memory attention mechanism that conditions the current frame features on both past frame features and predictions, as well as any new prompts. This mechanism is implemented using a stack of L transformer blocks. The first block takes the image encoding of the current frame as input, performing self-attention and cross-attention with memories stored in the memory bank. This process helps integrate historical and contextual information into the current frame's feature representation.
Mask decoder architecture. SAM 2: Segment Anything in Images and Videos
Prompt Encoder and Mask Decoder
The prompt encoder in SAM 2 mirrors the design of SAM's prompt encoder, capable of handling clicks (positive or negative), bounding boxes, or masks to define the object of interest. Sparse prompts are represented through positional encodings combined with learned embeddings for each prompt type. The mask decoder, while largely following SAM’s design, is enhanced to include high-resolution features from the image encoder, ensuring precise segmentation details. The decoder also predicts multiple masks to address potential ambiguities in video frames, selecting the most accurate mask based on predicted IoU (Intersection over Union).
Occlusion Prediction
SAM 2 introduces an occlusion prediction head that assesses whether the object of interest is visible in the current frame. This addition is crucial for handling scenarios where objects may be occluded or absent, allowing the model to indicate the visibility status alongside the segmentation masks.
Training SAM 2
- Pre-training: SAM 2 is pre-trained on the SA-1B dataset using an MAE pre-trained Hiera image encoder. It filters out masks covering over 90% of the image and trains with 64 randomly sampled masks per image. ℓ1 loss and sigmoid activation are used for IoU predictions. The model also applies horizontal flip augmentation and resizes images to 1024×1024 pixels.
- Full Training: Post pre-training, SAM 2 is trained on a mix of SA-V + Internal, SA-1B, and open-source AI video datasets like DAVIS, MOSE, and YouTubeVOS. It alternates between video and image data to optimize training. The model processes sequences of 8 frames, with up to 2 frames receiving corrective clicks. Initial prompts are sampled from ground-truth masks, positive clicks, or bounding boxes.
- Losses and Optimization: The model uses focal and dice losses for mask prediction, MAE loss for IoU, and cross-entropy loss for object presence. Only the mask with the lowest segmentation loss is supervised in multi-mask cases. An additional occlusion prediction head handles frames without valid masks.
This training strategy ensures SAM 2 effectively segments and tracks objects across video frames with interactive refinements.
SA-V Dataset Overview
Example of SA-V dataset with masklets overlaid. SAM 2: Segment Anything in Images and Videos
SAM 2 was training on the SA-V dataset, which was open-sourced by Meta AI. SA-V includes approximately 600,000+ masklets collected from around 51,000 videos across 47 countries. Such numbers stand out as it’s >15x the number of videos and >50x the number of masks as compared to the previously largest open-source AI dataset (BURST and UVO-dense). This diverse dataset covers various real-world scenarios, providing comprehensive training data for robust model performance.
- Primary data type: Video, mask annotations
- Data function: Training, test
- Videos: 50.9K videos, with 54% indoor and 46% outdoor scenes. Average duration is 14 seconds, spanning diverse environments and capturing footage from 47 countries. Video resolutions range from 240p to 4K, with an average resolution of 1,401 × 1,037 pixels.
- Masklets: Total of 642.6K masklets, including 190.9K manually annotated and 451.7K automatic. This dataset offers approximately 53 times more masks than the largest existing VOS datasets. Automatic masklets are created using multi-tier grid prompting and post-processed for accuracy.
- Annotation Details: Videos have an average length of 13.8 seconds and cover 4.2 million frames in total. Masklets are enhanced by removing tiny disconnected components and filling small holes.
- Dataset Splits: Training, validation, and test sets are carefully curated to minimize overlap of similar objects. The validation set has 293 masklets across 155 videos, and the test set includes 278 masklets in 150 videos. Internal datasets add further training and testing data.
- License: CC BY 4.0 (Creative Commons Attribution 4.0 International).
- Access cost: Open
- Data Collection: Videos were collected via a contracted third-party company.
- Labeling Methods: Masks generated by Meta Segment Anything Model 2 (SAM 2) and human annotators.
- Label Types: Masks are provided in COCO run-length encoding (RLE) format for the training set and in PNG format for validation and test sets.
Performance of SAM 2
Video Tasks
- Interactive Segmentation: SAM 2 excels in zero-shot video segmentation, outperforming previous methods like SAM+XMem++ and SAM+Cutie, and requires about three times fewer interactions.
- Semi-Supervised VOS: Outperforms XMem++ and Cutie across 17 datasets, showing strong performance with click, box, or mask prompts on the first frame.
- Fairness: Shows minimal performance discrepancy across gender and age in the EgoExo4D dataset, with reduced variability in correctly segmented clips.
SAM 2: Segment Anything in Images and Videos
Image Tasks
- Segment Anything Task: Achieves higher accuracy (58.9 mIoU with 1 click) than SAM (58.1 mIoU) and is six times faster. Training on SA-1B and video data mix boosts accuracy to 61.4%. Also performs exceptionally on new video datasets.
SAM 2: Segment Anything in Images and Videos
State-of-the-Art Comparison
- VOS Benchmark: Outperforms existing methods in accuracy and speed, with significant gains from larger image encoders. Shows a notable gap in "segment anything in videos" compared to prior methods.
- Outperforms Existing Benchmarks: SAM 2 excels in video object segmentation benchmarks such as DAVIS, MOSE, LVOS, and YouTube-VOS compared to prior state-of-the-art models.
- Long-Term VOS: Demonstrates improvements on the LVOS benchmark.
- Speed: SAM 2 operates at around 44 frames per second, providing real-time inference.
- Annotation Efficiency: 8.4 times faster than manual per-frame annotation with SAM.
VOS comparison to previous work. SAM 2: Segment Anything in Images and Videos
Overall, SAM 2 offers superior accuracy, efficiency, and speed in both video and image segmentation tasks.
Real-World Applications of SAM 2
SAM 2 is poised to revolutionize fields such as augmented reality (AR), virtual reality (VR), robotics, autonomous vehicles, and video editing. These applications often require temporal localization beyond image-level segmentation, making SAM 2 an ideal solution.
The model can be integrated into larger systems to create novel experiences. Its video object segmentation outputs can be used as inputs for modern video generation models, enabling precise editing capabilities. SAM 2's extensibility allows for future enhancements with new types of input prompts, facilitating creative interactions in real-time or live video contexts.
SAM 2: Related Work
Segment Anything
Segment Anything (SAM) introduced a promptable image segmentation framework allowing for zero-shot segmentation based on flexible prompts like bounding boxes or points. SAM's capabilities were extended with HQ-SAM for higher-quality outputs and EfficientSAM, MobileSAM, and FastSAM for improved efficiency. SAM's applications have spanned diverse fields, including medical imaging, remote sensing, motion segmentation, and camouflaged object detection.
Interactive Video Object Segmentation (iVOS)
Interactive video segmentation has advanced with methods focusing on user-guided annotations like clicks or scribbles. Early methods used graph-based optimizations, while more recent approaches have employed modular designs for mask propagation. The DAVIS interactive benchmark inspired similar interactive evaluation settings. Approaches integrating SAM with video trackers face challenges such as tracker limitations and re-annotation requirements.
Semi-Supervised Video Object Segmentation (VOS)
Semi-supervised VOS typically involves tracking objects throughout a video using an initial mask provided for the first frame. Early neural network-based methods adapted models with online fine-tuning or offline training. Recent advancements have introduced RNNs and cross-attention mechanisms or employed vision transformers. Although semi-supervised VOS is a specific instance of SAM's Promptable Visual Segmentation, the task of annotating high-quality masks remains challenging.
Data Engine for Creating Video Segmentation Datasets
Early VOS datasets such as DAVIS and YouTube-VOS provided high-quality annotations but were limited in scale for training deep learning models. Recent datasets have addressed this by increasing the difficulty of the VOS task, focusing on aspects like occlusions, long videos, extreme transformations, and object diversity. The SA-V dataset, developed using a multi-phase data engine, significantly expands annotation coverage to include both whole objects and parts, surpassing the scale and diversity of existing datasets. Notably, the iterative process increased labeling performance by 8x with the help of SAM 2.
Data Engine Phases:
- Phase 1: SAM Per Frame - In this initial phase, SAM was used for per-frame annotation with high spatial precision, but the process was slow (37.8 seconds per frame). This phase collected 16K masklets.
- Phase 2: SAM + SAM 2 Mask - This phase introduced SAM 2 for temporal mask propagation, improving speed (7.4 seconds per frame) while maintaining high quality. It collected 63.5K masklets.
- Phase 3: SAM 2 - The final phase utilized SAM 2’s full capabilities, achieving the highest speed (4.5 seconds per frame) and quality. It collected 197.0K masklets.
Quality Verification and Auto Masklet Generation: A verification step was implemented to ensure high-quality annotations. Automatic masklets were also generated to enhance diversity and identify model limitations. Auto-generated masklets were reviewed and refined, contributing to the dataset’s extensive coverage.
Limitations of SAM 2
- Segmenting Across Shot Changes: SAM 2 may struggle to maintain object segmentation across changes in video shots.
- Crowded Scenes: The model can lose track of or confuse objects in densely populated scenes.
- Long Occlusions: Prolonged occlusions can lead to errors or loss of object tracking.
- Fine Details: Challenges arise with objects that have very thin or fine details, especially if they move quickly.
- Inter-Object Communication: SAM 2 processes multiple objects separately without shared object-level contextual information.
- Quality Verification: The current reliance on human annotators for verifying masklet quality and correcting errors could be improved with automation.
Try SAM 2
SAM 2 is available under an Apache 2.0 license, allowing developers and researchers to install it on a GPU machine. Meta has also released the SA-V dataset, a web based demo, and the research paper, facilitating further innovation and development in computer vision systems for the AI community. Access the github repo here.
Power your AI models with the right data
Automate your data curation, annotation and label validation workflows.
Get startedWritten by
Akruti Acharya
- SAM 2 is an advanced version of the Segment Anything Model, developed by Meta. It unifies image and video segmentation capabilities into a single model, allowing for real-time, promptable object segmentation in both static images and dynamic video content. Unlike the original SAM, SAM 2 can process video frames sequentially thanks to its innovative streaming memory design.
- SAM 2's architecture includes a streaming memory design, a transformer-based image encoder, a prompt encoder for various input types, and a mask decoder for accurate segmentation. It is optimized for real-time performance, enabling interactive prompting in web browsers in approximately 50 milliseconds.
- SAM 2 outperforms previous models in both accuracy and speed. It achieves better video segmentation accuracy with 3x fewer interactions and is 6x faster in image segmentation than the original SAM, making it more efficient and effective for real-time applications.
- SAM 2 has diverse applications, including video editing, autonomous vehicles, robotics, scientific research, environmental monitoring, augmented reality, fashion retail, and data annotation. Its ability to segment objects in real-time across both images and videos makes it valuable in various industries.
- SAM 2 faces challenges in tracking objects across drastic viewpoint changes, long occlusions, and crowded scenes. It struggles with segmenting thin or fast-moving objects and is less effective in low-contrast scenarios. Additionally, it processes multiple objects separately, which can impact efficiency in complex scenes.
- Yes, SAM 2 is available under an Apache 2.0 license, allowing developers and researchers to use and build upon it. Meta has also released the SA-V dataset, a web-based demo, and the research paper, facilitating further innovation and development in computer vision.
Explore our products