Akruti Acharya October 31, 2022

Image Segmentation for Computer Vision: Best Practice Guide

blog image

Image segmentation is a computer vision task that divides a digital image into several components.

Image segmentation has become an indispensable technique for teaching machines how to grasp the world around them in an era where cameras and other devices are increasingly being used to analyze and interpret the outside world.

In this guide, I will start with answering what image segmentation is, how it works and what are the different approaches for implementing it. We will then look at ways how it is useful in the annotation process, and its use cases in the real world. In the end we will discuss some tips to consider when building a trained model for image segmentation and resources you may need to further delve into the subject. 

This tutorial is for you if you are a data scientist or a machine learning engineer or your team is considering using image segmentation as part of an artificial intelligence computer vision project. 

What is Image Segmentation?

Image segmentation is the way of splitting digital data into subgroups known as image segments, which reduces the entire image’s complexity and allows for further processing or image analysis of each segment. Here, each pixel is labeled. All the pixels belonging to the same category have a common label assigned to them. The task of segmentation can further be done by two ways:

  • Similarity: As the name suggests, the segments are formed by detecting similarity between image pixels. It is often done by thresholding (see below for more on thresholding). Machine learning algorithms (such as clustering) are based on this type of approach for image segmentation.
  • Discontinuity: Here, the segments are formed based on the change of pixel intensity values within the image. This strategy is used by line, point, and edge detection techniques to obtain intermediate segmentation results that may be processed to obtain the final segmented image. 

Image Annotation and Segmentation Techniques

Now that we have a basic idea of what image segmentation is, let’s look at how it works.

Image segmentation can be viewed as a blackbox. It accepts an image as input and outputs a matrix, which is usually called a mask. The elements of this mask indicate which class or instance the pixel belongs to.

A variety of heuristics would come to your mind if you were thinking of top-level image features which might be useful for segmenting images. For example, color or contrast. A dark red house can be segmented from a bright blue sky by looking for pixel boundaries with high contrast values (discontinuity approach). 

Now, heuristics like these form the basis for traditional image segmentation algorithms based on image features. For example:


Thresholding is one of the simplest image segmentation methods. Here, the pixels are divided into classes based on their histogram intensity which is relative to a fixed value or threshold. This method is suitable for segmenting objects where the difference in pixel values between the two target classes is significant. In low noise images, the threshold value can be kept constant, but with images with noise, dynamic thresholding performs better. In the thresholding based segmentation, the greyscale image is divided into two segments based on their relationship to the threshold value, producing binary images. Algorithms like contour detection and identification work on these binarized images.


Fig 1: Left shows the original image and the right shows the thresholded image. Source: author


The edge based segmentation method, as the name suggests relies on the edges found in an image using various edge detection operators. These edges are the markers of discontinuity of gray levels, color, texture, etc. The edge detection is carried out by choosing a specific filter and convolving it on the image. These filters are designed to filter out the edges based on contrast, texture, color, and saturation variations. 

The output of the edge detector is an intermediate segmentation result as it is just chunks of small borders. This intermediate result should then be further processed by a region filling algorithm to create fully formed edges. Now, the image has been classified into those pixels which are edges and which are not. 


Fig 2: Edge segmented image of a turtle. Source: author


Clustering based segmentation is the most common traditional method of segmentation as it provides reasonably good segments and it is faster than other algorithms. One of the most dominant among clustering algorithms is KMeans clustering. In KMeans, it takes all the pixels into consideration and clusters them into “k” classes based on the distance from the mean of the clusters forming.


Fig 3: Showing the result of segmenting the image at k=2,4,10. Source

‍In order to understand more about the mathematical implementations of these traditional image segmentation techniques, take a look at this blog on Towards Data Science. Also, there are many more traditional algorithms available, but here I have talked about the ones which are more common in use. 

Though these techniques are simple, they are fast and memory efficient. But these techniques are more suitable for simpler segmentation tasks as well. They often require tuning to customize the algorithm as per the use case and also provide limited accuracy on complex scenes.

Neural networks also provide solutions for image segmentation by training neural networks to identify which features are important in an image, rather than relying on customized functions like in traditional algorithms. Neural nets that perform the task of segmentation typically use an encoder-decoder structure. The encoder extracts features of an image through narrower and deeper filters. If the encoder is pre-trained on a task like image or face recognition, it then uses that knowledge to extract features for the purpose of segmentation (transfer learning). The decoder then over a series of layers inflates the encoder’s output into a segmentation mask resembling the pixel resolution of the input image. 


Fig 4: basic architecture of the neural network model for image segmentation. Source

‍There are many deep learning models that are quite adept at performing the task of segmentation reliably. Let’s have a look at a few of them:


U-Net is a modified fully convolutional neural network. It was primarily proposed for medical purposes, i.e., to detect tumors in lungs and brain. It has the same encoder and decoder. The encoder is used to extract features using shortcut connection unlike in fully convolutional networks which extrat features by upsampling. The shortcut connection in the U-Net is designed to tackle the problem of information loss. In the U-Net architecture, the encoders and decoders are designed in such a manner such that the network captures finer information and retains more information by concatenating high level features with low-level ones. This allows the network to yield more accurate results.


Fig 5: U-Net Architecture. Source


SegNet is also a deep fully convolutional network which is designed especially for semantic pixel-wise segmentation. Like U-Net, SegNet’s architecture also consists of blocks of encoder and decoder. The SegNet differs from other neural networks in the way it uses its decoder for upsampling the features. The decoder network uses the pooling indices computed in the max-pooling layer which in turn makes the encoder perform non-linear upsampling. This eliminates the need for learning to upsample. SegNet is primarily designed for scene understanding applications.


Fig 6: SegNet Architecture. Source


DeepLab is primarily a convolutional neural network (CNN) architecture. Unlike the other two network, it uses features from every convolutional block and then concatenates them to their deconvolutional block. The neural network uses the features from the last convolutional block and upsamples it like the fully convolutional network (FCN). It uses atrous convolution or dilated convolution method for upsampling. The advantage of using atrous convolution is that the computation cost is reduced while capturing more information. 


Fig 7: Encoder-Decoder architecture of DeepLab v3. Source

Modes of Image Segmentation

Image segmentation modes are divided into three categories based on the amount and type of information that should be extracted from the image: Instance, semantic, and panoptic. Let’s look at these various modes of image segmentation methods.

Also in order to understand the three modes of image segmentation, it would be more convenient to know more about objects and background. 

Objects are the things in an image that can be counted by assigning IDs to each. The background defines the category of objects which cannot be counted like the sky, water bodies, etc.


Fig 8: Original Image on which the three modes of segmentation will be shown. Source: author

Instance Segmentation

Instance segmentation is a technique of detecting, segmenting and classifying each individual object in an image. Here, the pixels are categorized on the basis of the boundaries of objects. The algorithm has no idea of the class of the region but it separates overlapping objects.


Fig 9: Instance segmentation on Fig 8. The dog and the human have been separated from the background. Source: author

For example, as seen in the image above has been processed by an instance segmentation algorithm. It separates each person from the crowd as well as the background. It provides polygon bounding boxes of each instance as well as the object segmentation maps for each instance, thereby allowing you to get the number of instances in the image.

Semantic Segmentation

Semantic segmentation classifies each pixel into particular classes with no other information or context taken into consideration. The algorithm takes an image as input and generates a segmentation map where the pixel value (0,1,...255) of the image is transformed into class labels (0,1,...n).


Fig 10: Semantic segmentation on Fig 8. Source: author

For example, using the same image as above as an input for semantic segmentation algorithm. We can see that it classifies pixels into background and pedestrian. Here, unlike instance segmentation, there is no means to get any other information like count from the segmentation.

Panoptic Segmentation

Panoptic segmentation is a combination of semantic and instance segmentation. Here each instance of an object in the image is separated and the object's identity is predicted. This mode of image segmentation provides the maximum amount of high-quality granular information from machine learning algorithms.


Fig 11 : Panoptic Segmentation on Fig 8. Source: author

Importance of Image Segmentation For Image Annotation

The major advantage of image segmentation can be understood by comparing the annotation method of the three most popular computer vision tasks.

Image classification : In this task’s annotation, the object’s identity is the only information which is extracted for the machine learning model to use.

Image object detection : Here, during annotation, the object’s identity and its location (using bounding boxes) is extracted from the image for the machine learning model to use.

Image segmentation : Here, The image is classified pixel level. Hence, it extracts the most information and helps understand what’s in the image at the pixel level. 

To understand better, let's take one image and annotate it by five separate methods: classification, object detection, instance segmentation, semantic segmentation and panoptic segmentation, i.e., semantic segmentation with instances.


Fig 12: Image annotated five ways. Source: author

In the image above you can clearly see that annotations like classification and image detection are faster to label as you don’t have to classify each pixel. But if you wish to extract the most information, then annotations generated from image segmentation are more reliable. It focuses on all contents of an image and this provides flexibility in choosing machine learning models.

The annotation for image and video segmentation is made so much easier now with the help of annotation tools. These tools are equipped with features to reduce the time-consuming nature of creating consistent pixel-perfect image segmentation annotations. Let’s see some of the features which help in generating better annotations using image segmentation:

Automated Annotation for Image Segmentation

Automated annotation accelerates the process of image segmentation without diminishing the quality of annotations. The annotation tools like Encord, contain auto-segmentation features as well to help the annotators specifically for image segmentation tasks. The annotation tools also feature model-assisted labeling which is a machine learning model which outputs pre-labels. With these features, the annotators just need to review and edit the labeled training data rather than labeling the entire data. This allows the annotators to focus on labeling edge cases manually or focusing on areas where the model might not be performing well.

Ontology-Based Customization for Image Segmentation

The customizable ontology allows the annotators to configure datasets to match the data structure requirements. It provides the ability to classify instances that you have segmented into as many classes as you want. A customizable ontology is important to define your data’s data structure for generating accurate image segmentation labels.

Real-World Use-Cases of Image Segmentation for Computer Vision Projects

We have discussed a great deal about image segmentation up to this point. It is an essential component of computer vision. To perform segment-specific processing, machine learning models must split image and video data into segments. As a result, image segmentation is widely used in domains such as robotics, medical imaging, and other technologies that rely on intelligent image and video analysis.

Here are a few of the most popular real-world image segmentation implementations.


Image segmentation is important in building algorithms for robot vision and aiding machine perception and locomotion. It is used in building models to detect obstacles in the path of motion and enable the robot to interpret and change path effectively. Other than locomotion, segmentation also helps the robot to classify and understand each object in the environment. This enables the robot to interact with the real-world objects using only the sensors relying on vision as reference. This allows the machine to be more efficient. Other areas where image segmentation is used in robot vision are instance segmentation for grasping objects, autonomous navigation and simultaneous localisation and mapping (SLAM).

Medical Imaging and Diagnostics in Healthcare

Medical images like CT scans use image segmentation in the process of the early diagnostics pipeline. It is used in medical image processing to identify and separate areas of the image containing important pixels of organs, lesions, etc. This allows doctors to identify malignant features in images faster and more accurately. This makes the process of diagnostics more effective as it passes through two levels of review. A few areas of medical diagnostics where image segmentation is used are:

  • X-ray segmentation
  • CT scan organ segmentation
  • Dental instance segmentation
  • Digital pathology cell segmentation
  • Surgical video annotation

Self-Driving Cars

The autonomous vehicles rely on sensors capturing images and videos to visualize the environment like a driver. People who learn to drive learn to be attentive and notice and react to the environment quickly and in an effective way to ensure safety at all times. In order to build a machine learning model for self-driving cars, you would expect the same from it. The vehicle needs to see, interpret and respond to the scene in real time and with the highest accuracy possible. Using image segmentation to generate pixel-level maps of the world helps in building robust machine learning models which allow the vehicles to navigate efficiently and safely in the world.

If you want to visualize how image segmentation is helpful in scene understanding for self-driving cars, please watch this video.

Creativity Tools

As we saw above, image segmentation is crucial in powering transformational technologies such as automated medical imaging and diagnostics, autonomous cars, etc. Image segmentation can also be used to enhance customer-facing creativity tools such as image and video editors, content creation platforms, etc. These tools have a wide variety of use from a marketing point of view to a hobby.  

These tools want techniques to creatively augment visual content. They use image segmentation to teach the devices to understand and interpret images so that the tools can manipulate them better. Generating pixel-level maps and masks of objects of an image through image segmentation allows you to isolate regions of an image. This creates opportunities to implement targeted image effects on specific areas of a given image. Here are some of the ways segmentation helps in developing features for creativity tools:

  • Creating green screen effect to provide background in real time like in zoom video calls
  • Developing “try-on yourself” experience to allow the users to sample products virtually
  • Adding augmented reality objects to an image
  • Blurring the background to sharpen the object of interest’s focus
  • Artistic filters for content creation like in Instagram or Snapchat

Best Practice for Image Segmentation for Computer Vision

Now that we have discussed image segmentation and all its aspects. Here are some of the tips to keep in mind when you are working with image segmentation or thinking of ways to improve your segmentation network:

Be Aware Of Class Imbalance

Similar to working with any machine learning model, data is important. Image segmentation is a method of classification of pixels into groups and if there is any class imbalance within the dataset, it will not be an efficient solution. For example, in medical imaging, the pixels of malignant cells may be present rarely in the dataset. This will not help the neural network to learn the malignant cell pixels. Data augmentation where the image properties are preserved can increase the training data and improve the robustness of the model. 

Choosing The Right Architecture

The right architecture is again crucial. If training the neural networks from scratch with the right dataset is not producing the desirable results, changing the network architecture can be helpful. For example, exchanging the encoder with a pre-trained specialized image classification network and using end-to-end training can be useful. Finding the optimal cut-off layer if you are using transfer learning can also be useful.

Loss Functions

Loss functions are used to optimize the neural network. Choosing the right loss function for an accurate segmentation, especially for unbalanced data sets is essential. Using an inadequate loss function can result in high false positives or high false negative rates.

Post Processing

Post processing of the segments is essential mainly for the traditional segmentation methods. For example in edge based segmentation, the filter separates the pixels which have edges and forms an intermediate segmentation. Post processing this further by dilating and joining these pixels would give the edge segments present in the image. 


I hope the overview of image segmentation was helpful in understanding the basics and how it can be used in the real world. The best practices are some suggestions to look into if your image segmentation model is not performing as desired. In the final section, I have organized the resources to help you take the next step in learning all there is to know about image segmentation.

Further Reading on Image Segmentation

Useful Datasets for Image Segmentation

  • COCO - Dataset of common objects found in the environment
  • The Berkeley Segmentation dataset and benchmark - variety of images ranging from natural images to object-specific such as plants, people, food, etc.
  • Pascal VOC2010 - contains images from 20 object classes divided over person, animal, vehicle and indoor
  • ADE20K - contains images for scene parsing
  • MARIDA - contains satellite data to distinguish marine debris from other marine objects like ships
  • HuTu 80 - contains high resolution color microscopic images of human small intestine which contains cancer.