profile
Nikolaj Buhl
Published April 6, 2023Edited May 18, 2023 6 min read

Meta AI's New Breakthrough: Segment Anything Model (SAM) Explained

blog image

If you thought the AI space was already moving fast with ChatGPT, GPT4, and Stable Diffusion, then strap in and get ready for the next groundbreaking innovation in AI.

Meta’s FAIR lab has just released the Segment Anything Model (SAM), a state-of-the-art image segmentation model that aims to change the field of computer vision. 

SAM is based on foundation models that have had a significant impact on natural language processing (NLP). It focuses on promptable segmentation tasks, using prompt engineering to adapt to diverse downstream segmentation problems.

Why are we so excited about SAM?

 Having tested it out for a day now, we can see the following incredible advances:

  • SAM can segment objects by simply clicking or interactively selecting points to include or exclude from the object. You can also create segmentations by drawing bounding boxes or segmenting regions with a polygon tool and it will snap to the object.
  • When encountering uncertainty in identifying the object to be segmented, SAM is capable of producing multiple valid masks.
  • SAM has the ability to identify and generate masks for all objects present in an image automatically.
  • After precomputing the image embeddings, SAM can provide a segmentation mask for any prompt instantly, enabling real-time interaction with the model.

In this blog post, you will: 

  • Understand what SAM is and why it’s a game-changer.
  • Learn how it fares compared to previous models.
  • See what’s inside SAM: its network architecture, design, and implementation.
  • Learn potential uses of SAM for AI-assisted labeling.

Update: Do you want to eliminate manual segmentation? Learn how to use the Segment Anything Model (SAM) to reduce labeling costs with Encord! Read the product announcement, or go straight to a free trial! You can also check out our tutorial on how to fine-tune Segment Anything here.

A Brief History of Meta's AI & Computer Vision

As one of the leading companies in the field of artificial intelligence (AI), Meta has been pushing the boundaries of what's possible with machine learning models. From recently released open source models such as LLaMA to developing the most used Python library for ML and AI, PyTorch.

The following sections delve into advances in computer vision and the growth of foundation models.

Advances in Computer Vision

Computer vision has also experienced considerable advancements, with models like CLIP bridging the gap between text and image understanding.

These models use contrastive learning to map text and image data. This allows them to generalize to new visual concepts and data distributions through prompt engineering. 

FAIR’s Segment Anything Model (SAM) is the latest breakthrough in this field. Their goal was to create a foundation model for image segmentation that can adapt to various downstream tasks using prompt engineering.

Let’s shortly explore some of the key developments in computer vision that have contributed to the growth of AI systems like Meta's.

Convolutional Neural Networks (CNNs)

CNNs, first introduced by Yann LeCun (now VP & Chief AI scientist at Meta) in 1989, have emerged as the backbone of modern computer vision systems, enabling machines to automatically learn and recognize complex patterns in images. 

By employing convolutional layers, CNNs can capture local and global features in images, allowing them to effectively recognize objects, scenes, and actions. This has led to significant improvements in tasks such as image classification, object detection, and semantic segmentation.

Generative Adversarial Networks (GANs)

GANs are a type of deep learning model that Ian Goodfellow and his team came up with in 2014. They are made up of two neural networks, a generator and a discriminator, that compete with each other.

The generator aims to create realistic outputs, while the discriminator tries to distinguish between real and generated outputs. The competition between these networks has resulted in the creation of increasingly realistic synthetic images and has led to advances in tasks such as image synthesis, data augmentation, and style transfer.

Transfer learning and pre-trained models

Similar to NLP, computer vision has benefited from the development of pre-trained models that can be fine-tuned for specific tasks. Models such as ResNet, VGG, and EfficientNet have been trained on large-scale image datasets, allowing researchers to use these models as a starting point for their own projects.

The Growth of Foundation Models

Foundation models in natural language processing (NLP) have made significant strides in recent years, with models like Meta’s own LLaMa or OpenAI’s GPT-4 demonstrating remarkable capabilities in zero-shot and few-shot learning. 

These models are pre-trained on vast amounts of data and have the ability to generalize to new tasks and data distributions by using prompt engineering. Meta AI has been instrumental in advancing this field, fostering research and the development of large-scale NLP models that have a wide range of applications.

Here, we explore the factors contributing to the growth of foundation models.

Large-scale language models

The advent of large-scale language models like GPT-4 has been a driving force behind the development of foundation models in NLP. These models employ deep learning architectures with billions of parameters, allowing them to capture complex patterns and structures in the training data. 

Transfer learning

A key feature of foundation models in NLP is their capacity for transfer learning. Once trained on a large corpus of data, they can be fine-tuned on smaller, task-specific datasets to achieve state-of-the-art performance across a variety of tasks.

Zero-shot and few-shot learning

Foundation models have also shown promise in zero-shot and few-shot learning, where they can perform tasks without any fine-tuning or with minimal task-specific training data. This capability is largely attributed to the models' ability to understand and generate human-like responses based on the context provided by prompts.

Multi-modal learning

Another growing area of interest is multi-modal learning, where foundation models are trained to understand and generate content across different modalities, such as text and images. 

Models like CLIP and ALIGN show how NLP and computer vision could be used together to make multi-modal models that can translate actions from one domain to another.

Ethical Considerations and Safety

The growth of foundation models in NLP has also raised concerns about their ethical implications and safety. Researchers are actively exploring ways to mitigate potential biases, address content generation concerns, and develop safe and controllable AI systems. Proof of this was the recent call for a six-month halt on all development of cutting edge models.

Comparing Segment Anything Model to Previous Models

SAM is a big step forward for AI because it builds on the foundations that were set by earlier models. SAM can take input prompts from other systems, such as, in the future, taking a user's gaze from an AR/VR headset to select an object, using the output masks for video editing, abstracting 2D objects into 3D models, and even popular Google Photos tasks like creating collages.

It can handle tricky situations by generating multiple valid masks where the prompt is unclear. Take, for instance, a user’s prompt for finding Waldo:

Image displaying semantic segmentations produced by the Segment Anything Model (SAM)

Source

One of the reasons the results from SAM are groundbreaking is because of how good the segmentation masks are compared to other techniques like ViTDet. The illustration below shows a comparison of both techniques:

Image displaying examples of segmentation masks produced by humans, ViTDet, and the Segment Anything Model (SAM)

 Source

The research paper compares the results of both techniques in more detail.

Dive into SAM's Network Architecture and Design

SAM’s design hinges on three main components:

  1. The promptable segmentation task to enable zero-shot generalization.
  2. The model architecture.
  3. The dataset that powers the task and model.

Image displaying the foundation model architecture for the Segment Anything (SA) model

Source

Task

SAM was trained on millions of images and over a billion masks to return a valid segmentation mask for any prompt. The prompt, in this case, is the segmentation task and can be foreground/background points, a rough box or mask, clicks, text, or, in general, any information indicating what to segment in an image. The task is also used as the pre-training objective for the model.

Model

SAM’s architecture comprises three components that work together to return a valid segmentation mask:

  • An image encoder to generate one-time image embeddings.
  • A prompt encoder that embeds the prompts.
  • A lightweights mask decoder that combines the embeddings from the prompt and image encoders.

Image displaying the components of the Segment Anything (SA) model

Segment Anything Model (SAM) components. | Source

We will dig deeper into the architecture in the next section, but for now, let’s take a look at the dataset.

Data Engine and Dataset

A data engine is needed to power the tasks and improve the dataset and model. The data engine has three stages:

  • Assisted-manual, where SAM assists annotators in annotating masks, similar to a classic interactive segmentation setup.
  • Semi-automatic, where SAM can automatically generate masks for a subset of objects by prompting it with likely object locations, and annotators focus on annotating the remaining objects, helping increase mask diversity.
  • Fully automatic, where human annotators prompt SAM with a regular grid of foreground points, yielding on average 100 high-quality masks per image.

The data engine builds the large segment anything 1-billion mask dataset Meta AI released.

A Quick Guide to the Segment Anything Model’s Structure

Image displaying the architecture of the Segment Anything (SA) universal segmentation model

Source

Image encoder

At the highest level, an image encoder (a masked auto-encoder, MAE, pre-trained Vision Transformer, ViT) generates one-time image embeddings and can be applied prior to prompting the model.

Prompt encoder

The prompt encoder encodes background points, masks, bounding boxes, or texts into an embedding vector in real time. The research considers two sets of prompts: sparse (points, boxes, text) and dense (masks). 

Points and boxes are represented by positional encodings and added with learned embeddings for each prompt type. Free-form text prompts are represented with an off-the-shelf text encoder from CLIP. Dense prompts, like masks, are embedded with convolutions and summed element-wise with the image embedding.

Mask decoder

A lightweight mask decoder predicts the segmentation masks based on the embeddings from both the image and prompt encoders. It maps the image embedding, prompt embeddings, and an output token to a mask. All of the embeddings are updated by the decoder block, which uses prompt self-attention and cross-attention in two directions (from prompt to image embedding and back).

The masks are annotated and used to update the model weights. This layout enhances the dataset and allows the model to learn and improve over time, making it efficient and flexible.

Segment Anything 1-Billion Mask Dataset

The Segment Anything 1 Billion Mask (SA-1B) dataset is the largest labeled segmentation dataset to date. It is specifically designed for the development and evaluation of advanced segmentation models.

We think the dataset will be an important part of training and fine-tuning future general-purpose models. This would allow them to achieve remarkable performance across diverse segmentation tasks. For now, the dataset is only available under a permissive license for research.

The SA-1B dataset is unique due to its:

  • Diversity
  • Size
  • High-quality annotations

Diversity

The dataset is carefully curated to cover a wide range of domains, objects, and scenarios, ensuring that the model can generalize well to different tasks. It includes images from various sources, such as natural scenes, urban environments, medical imagery, satellite images, and more. 

This diversity helps the model learn to segment objects and scenes with varying complexity, scale, and context.

Image displaying the distribution of images and masks used to train the Segment Anything (SA) model

Source

Size

The SA-1B dataset, which contains over a billion high-quality annotated images, provides ample training data for the model. The sheer volume of data helps the model learn complex patterns and representations, enabling it to achieve state-of-the-art performance on different segmentation tasks.

Image displaying the relative size of SA-1B used to train the Segment Anything (SA) model versus previous datasets

Source

High-quality annotations

The dataset has been carefully annotated with high-quality masks, leading to more accurate and detailed segmentation results. In the Responsible AI (RAI) analysis of the SA-1B dataset, potential fairness concerns and biases in geographic and income distribution were investigated.

The research paper showed that SA-1B has a substantially higher percentage of images from Europe, Asia, and Oceania, as well as middle-income countries, compared to other open-source datasets. It's important to note that the SA-1B dataset features at least 28 million masks for all regions, including Africa. This is 10 times more than the total number of masks in any previous dataset.

Image displaying the distribution of the images used to train the Segment Anything (SA) model

Source

At Encord, we think the SA-1B dataset will enter the computer vision hall of fame (together with famous datasets such as COCO, ImageNet, and MNIST) as a resource for the development of future computer vision segmentation models.

Is Segment Anything Model Open Source?

The short answer is YES! The SA-1B Dataset has been released as open source for research purposes. In addition, Meta AI released the pre-trained models (~2.4 GB in size) and code under Apache 2.0 (a permissive license) following FAIR’s commitment to open research. It is freely accessible on GitHub. The training dataset is also available, alongside an interactive demo web UI.

All linked from the project webpage:

Image displaying a visual from the FAIR Segment Anything (SA) paper

Source

How to Fine-Tune Segment Anything Model (SAM)

Now that you know the datasets SAM was trained on, you can identify if a dataset that represents your tasks was covered in the SA-1B dataset. If it’s not, or is underrepresented, you can consider finetuning SAM’s weights on your dataset. Yes, that’s possible with the pre-trained models that are also open sourced.

As of the time of this writing, the Meta AI team has not added a way to fine-tune SAM specific applications, but Alex Bonnet, our Machine Learning Solutions Engineer, has prepared a step-by-step guide on how to do this.

The steps include:

  • Create a custom dataset by extracting the bounding box coordinates (prompts for the model), and extracting the ground truth segmentation masks.
  • Prepare for fine-tuning by converting the input images to PyTorch tensors, a format SAM's internal functions expect.
  • Run fine-tuning step by instantiating a training loop that focuses on using the lightweight mask decoder to iterate over the data items, generating masks, and comparing them to your ground truth masks. The mask decoder is easier, faster, and more memory efficient to fine tune.
  • Compare your tuned model to the original model to make sure there’s indeed a significant improvement.

To see the fine-tuning process in practice, check out our detailed blog post that also includes a Colab notebook as a walkthrough.

How to Use the Segment Anything Model for AI-Assisted Labeling

At Encord, we see the Segment Anything Model (SAM) as a game changer in AI-assisted labeling. It basically eliminates the need to go through the pain of segmenting images with polygon drawing tools and allows you to focus on the data tasks that are more important for your model. 

These other data tasks include mapping the relationships between different objects, giving them attributes that describe how they act, and evaluating the training data to make sure it is balanced, diverse, and free of bias.

Image displaying brain MRI with and without segmentation masks produced by Segment Anything (SA)

Enhancing Manual Labeling with AI

SAM can be used to create AI-assisted workflow enhancements and boost productivity for annotators. Here are just a few improvements we think SAM can contribute:

Image displaying valid masks generated by SAM from a single ambiguous point prompt

Source

  • Improved accuracy: Annotators can achieve more precise and accurate labels, reducing errors and improving the overall quality of the annotated data.
  • Faster annotation: No doubt that SAM will speed up the labeling process, enabling annotators to complete tasks more quickly and efficiently when combined with a suitable image annotation tool.
  • Consistency: Having all annotators use a version of SAM would ensure consistency across annotations, which is particularly important when multiple annotators are working on the same project.
  • Reduced workload: By automating the segmentation of complex and intricate structures, SAM significantly reduces the manual workload for annotators, allowing them to focus on more challenging and intricate tasks.
  • Continuous learning: As annotators refine and correct SAM's assisted labeling, we could implement it such that the model continually learns and improves, leading to better performance over time and further streamlining the annotation process.

So integrating SAM into the annotation workflow is a no-brainer from our side, and it would allow our current and future customers to accelerate the development of cutting-edge computer vision applications.

How can SAM Contributes to AI-Assisted Labeling

To give an example of how SAM can contribute to AI-assisted labeling, consider the medical image example from before. We uploaded the DICOM image to the demo web UI, and spent 10 seconds clicking the image to segment the different areas of interest.

Afterward, we did the same exercise with manual labeling using polygon annotations, which took 2.5 minutes. A 15x improvement in the labeling speed!

Image displaying manually produced vs. SAM-produced segmentation masks on a brain MRI scan

We’re excited to start building this capability into Encord’s platform. Do reach out if you want to hear more.

Real-World Use Cases and Applications

SAM can be used in almost every single segmentation task, from instance segmentation to panoptic segmentation. We’re excited about how quickly SAM can help you pre-label objects with almost pixel-perfect segmentation masks before your expert reviewer adds the ontology on top. 

From agriculture and retail to medical imagery and geospatial imagery, the AI-assisted labeling that SAM can achieve is endless. It will be hard to imagine a world where SAM is not a default feature in all major annotation tools. This is why we at Encord are very excited about this new technology.

Find other applications that could leverage SAM below. 

Image and video editors

SAM’s outstanding ability to provide accurate segmentation masks for even the most complex videos and images can provide image and video editing applications with automatic object segmentation skills. Whether the prompts (point coordinates, bounding boxes, etc.) are in the foreground or background, SAM uses positional encodings to indicate if the prompt is in the foreground or background.

Generating synthetic datasets for low-resource industries

One challenge that has plagued computer vision applications in industries like manufacturing is the lack of datasets. For example, industries building car parts and planning to detect defects in the parts along the production line cannot afford to gather large datasets for that use case.

You can use SAM to generate synthetic datasets for your applications. If you realize SAM does not work particularly well for your applications, an option is to fine-tune it on existing datasets.

Gaze-based Segmentation

AR applications can use SAM’s Zero-Shot Single Point Valid Mask Evaluation technique to segment objects through devices like AR glasses based on where subjects gaze. This can help AR technologies give users a more realistic sense of the world as they interact with those objects.

Where Does This Leave Us?

The Segment Anything Model (SAM) truly represents a groundbreaking development in the field of computer vision. By leveraging promptable segmentation tasks, SAM can adapt to a wide variety of downstream segmentation problems using prompt engineering. 

This innovative approach, combined with the largest labeled segmentation dataset to date (SA-1B), allows SAM to achieve state-of-the-art performance in various segmentation tasks.

With the potential to significantly enhance AI-assisted labeling and reduce manual labor in image segmentation tasks, SAM can pave the way in industries such as agriculture, retail, medical imagery, and geospatial imagery. 

At Encord, we recognize the immense potential of SAM, and we are soon bringing the model to the Encord Platform to support AI-assisted labeling, further streamlining the data annotation process for users.

As an open-source model, SAM will inspire further research and development in computer vision, encouraging the AI community to push the boundaries of what is possible in this rapidly evolving field. 

Ultimately, SAM marks a new chapter in the story of computer vision, demonstrating the power of foundation models in transforming the way we perceive and understand the world around us.

Make foundation models your own.

Frequently Asked Questions on Segment Anything Model (SAM)

How do I fine-tune SAM for my tasks?

We have provided a step-by-step walkthrough you can follow to fine-tune SAM for your tasks. Check out the tutorial in this blog post.

What datasets were used to train SAM?

The Segment Anything 1 Billion Mask (SA-1B) dataset has been touted as the “ImageNet of segmentation tasks.” The images vary across subject matter. Scenes, objects, and places frequently appear throughout the dataset. Masks range from large-scale objects such as buildings to fine-grained details like door handles.

See the data card and dataset viewer to learn more about the composition of the dataset.

Does SAM work well for all tasks?

Yes. You can automatically select individual items from images—it works very well on complex images. SAM is a foundation model that provides multi-task capabilities to any application you plug it into.

Does SAM work well for ambiguous images?

Yes, it does. Because of this, you might find duplicates in your mask sets when you run SAM over your dataset. This allows you to select the most appropriate masks for your task. In this case, you should add a post-processing step to segment the most suitable masks for your task.

SAM model output

Source

How long does it take SAM to generate segmentation masks?

SAM can generate a segment in as little as 50 milliseconds—practically real-time!

Do I need a GPU to run SAM?

Although it is possible to run SAM on your CPU, you can use a GPU to achieve significantly faster results from SAM.

cta banner

Get the latest machine learning news and insights