Back to Blogs

How to Detect Data Quality Issues in Torchvision Dataset using Encord Active

December 19, 2023
|
10 mins
blog image

If you have built computer vision applications for any time, you have likely had to grapple with the challenge of bad-quality images in your data. This issue may manifest in various forms, such as mislabeled images, inconsistent image resolutions, noise, or other types of distortion. Such flaws in the data quality can lead to models learning the incorrect features, resulting in incorrect or untrustworthy classifications and outputs. These issues can cause significant setbacks in areas where accuracy and reliability are paramount, which can cause initiatives to fail and waste project resources.

How do you enhance your models' effectiveness? The key is a systematic process of investigating, assessing, and improving the quality of your image data. This is where Encord Active steps in — offering a robust framework to pinpoint and label images that are causing issues while supplying tools to refine the overall quality of your dataset. This article will show you how to use Encord Active to explore images, identify and visualize potential issues, and take the next steps to rectify low-quality images. In particular, you will:

  • Use the popular Caltech101 dataset from the Torchvision Datasets library.
  • Delve into the steps of creating an Encord Active project.
  • Define and run quality metrics on the dataset.
  • Visualize the quality metrics.
  • Indicate strategies to fix the issues you identified.

Ready? Let’s delve right in! 🚀

Using Encord Active to Explore the Quality of Your Images

Encord Active enables you to find and fix label errors through data exploration, model-assisted quality metrics, and a one-click labeling integration. It takes a data-centric approach to improving model performance.

With Encord Active, you can:

  • Explore your data through interactive embeddings, precision/recall curves, and other advanced visualizations.
  • Slice your visual data across metric functions to identify data slices with low performance.
  • Surface and prioritize valuable data for labeling - crucial for training the model with high-quality data.
  • Flag poor-performing slices and send them for review.
  • Export your new data set and labels.

Check out the project on GitHub.

Demo: Explore the Quality of Caltech 101 Images to Train ML Models

In this article, you will use Encord Active to explore the quality of the “Caltech 101” images. You’ll access the dataset through the Torchvision Datasets library.  The dataset includes many object categories, such as animals, vehicles, household objects, etc. Examples of categories include "airplanes," "faces," "motorbikes," "helicopters," "chairs," "laptops," and others. Downstream, you can train a computer vision model for multi-class image classification or object recognition once you have good-quality data. The dataset consists of 101 object categories, with varying images per category. It is renowned for its diversity in object appearance, viewpoint, lighting, and scale, making it a challenging benchmark for object recognition algorithms. We hosted the code for this walkthrough on GitHub. Open the Notebook side-by-side with this blog post.

light-callout-cta

Interested in more computer vision, visual foundation models, active learning, and data quality notebooks? Check out the Encord Notebook repository.

Install Torchvision Datasets to Download and Generate the Dataset

To download and generate datasets for various computer vision tasks, Torchvision Datasets provides pre-configured datasets. It is a part of the popular PyTorch library, including sub-packages like Torchvision Models. Install the latest version of Torchvision from the documentation page. With a few lines of code, you can access and download datasets like CIFAR-10, ImageNet, MNIST, and many others, saving time and effort in data preparation. These well-structured datasets come with predefined classes and labels, making them easily adaptable for training and evaluating machine learning models.

Load the ‘Caltech101’ Dataset

Loading a dataset from Torchvision is straightforward: Install the library and download the dataset. Torchvision comes pre-installed with your Colab instance, but if you are running this walkthrough locally, you can install the library with PyPi. The version installed for this article is `0.16.0+cu118`.

To download ‘Caltech101’, run the following command:

from pathlib import Path
from torchvision import datasets
datasets.Caltech101(Path.cwd(), target_type="category", download=True)

`target_type="category"` specifies that each image is associated with a category label (e.g., "airplanes," "faces," etc.).

If you are running this walkthrough on Google Colab, run the following utility code to forcibly assign file descriptor numbers 1 (stdout) and 2 (stderr) to the “sys.stdout” and “sys.stderr” objects, respectively, to affect how the instance handles the output and errors within a Python script:

import sys
sys.stdout.fileno = lambda: 1
sys.stderr.fileno = lambda: 2

Create an Encord Active Project

You must specify the directory containing your datasets when using Encord Active for exploration. You will initialize a local project with the image files—there are different ways to import and work with projects in Encord. Encord Active provides functions and utilities to load all your images, compute embeddings, and, based on that, evaluate the embeddings using pre-defined metrics. The metrics will help you search and find images with errors or quality issues. Before initializing the Encord Active project, define a function, `collect_all_images`, that obtains a list of all the image files from the `huggingface_dataset_path` directory, takes a root folder path as input, and returns a list of `Path` objects representing image files within the root folder:

def collect_all_images(root_folder: Path) ->  list[Path]:
    image_extensions = {".jpg", ".jpeg", ".png", ".bmp"}
    image_paths = []

    for file_path in root_folder.glob("**/*"):
        if file_path.suffix.lower() in image_extensions:
            image_paths.append(file_path)

    return image_paths

Set up some configuration parameters, including specifying the `root_folder` where the image data is located (in this case, a directory named "./caltech101") and a `projects_dir` where the project-related data will be stored (in this case, a directory named "./ea/"):

root_folder = Path("./caltech101")
projects_dir = Path("./ea/")

light-callout-cta Remember to access and run the complete code in this cell.

 

Initialize Encord Active Project

Next, initialize a local project using Encord Active's `init_local_project` function. This function provides the same functionality as running the `init` command in the CLI. If you prefer using the CLI, please refer to the “Quick import data & labels” guide.

if not projects_dir.exists():
  projects_dir.mkdir()

image_files = collect_all_images(root_folder)

try:
    project_path: Path = init_local_project(
        files = image_files,
        target = projects_dir,
        project_name = "sample_ea_project",
        symlinks = False,
    )
except ProjectExistsError as e:
    project_path = Path("./ea/sample_ea_project")
    print(e)  # A project already exist with that name at the given path.

Compute Image Embeddings and Analyze Them With Metrics

When dealing with raw image data in computer vision, directly analyzing it can often be challenging due to its high dimensional nature. A typical approach involves generating embeddings for the images, effectively compressing their dimensions, and applying various metrics to these embeddings to derive valuable insights and assess the image quality. Generating these embeddings using pre-trained models, specifically convolutional neural networks (CNNs), is preferable. These models are adept at extracting vital features from images while simultaneously reducing the complexity of the data. Upon acquiring the embeddings, you should apply similarity analysis, clustering, and classification metrics to examine the dataset's characteristics thoroughly. Computing embeddings and applying these metrics can require considerable manual labor. This is where Encord Active comes into play!

Encord Active (open source) provides utility functions to run predefined subsets of metrics, or you can execute custom metrics. It computes the image embeddings and runs the metrics by the type of embeddings. Encord Active has three different types of embeddings:

  • Image embeddings - general for each image or frame in the dataset
  • Classification embeddings - associated with specific frame-level classifications
  • Object embeddings - associated with specific objects, like polygons or bounding boxes

Use the `run_metrics_by_embedding_type` function to execute quality metrics on the images, specifying the embedding type as `IMAGE`:

run_metrics_by_embedding_type(
    EmbeddingType.IMAGE,
    data_dir=project_path
)

Create a `Project` object using the `project_path` - you will use this for further interactions with the project:

ea_project = Project(project_path)

{light_callout_start}} Got any questions about Encord Active? Connect with us in the Encord Developer community; we are happy to chat or check out the FAQ page.

Exploring the Quality Of Images From the Torchvision Dataset Library

Now that you have set up your project, it’s time to explore the images! There are typically two ways you could visualize images with Encord Active (EA):

  • Through the web application (Encord Active UI),
  • Combining EA with visualization libraries to display those embeddings based on the metrics.

In this article, we will explore the images within the Caltech101 dataset through the web application UI.

📈 We explored Hugging Face image datasets by combining Encord Active with visualization libraries in the article Exploring the Quality of Hugging Face Image Datasets with Encord Active.

%cd ea
!encord-active start

Your browser should open a new window with Encord Active OS. It should launch the following web page with all your projects:

Encord Active OS - Project

⚠️ If the terminal seems stuck and nothing happens in your browser, try visiting http://localhost:8000.

Visualizing Aspect Ratio Score Distributions of the Caltech101 Images

Visualizing the aspect ratio score distributions in image datasets allows you to explore unusual patterns in aspect ratio distributions, which might indicate issues like incorrect data collection or labeling. You can also better understand data diversity to identify biases or gaps, particularly in aspect ratios. This understanding is vital for optimizing model performance, as different models may respond differently to various aspect ratios. Variations in aspect ratios can lead to inconsistencies that can degrade model performance if you don’t handle them properly.

Within the web application, click on “sample_ea_project” and navigate to “Summary” >> “Data” >> “Metric Distribution

Metric distribution - Aspect Ratio - Encord Active

Let’s take a closer look at the aspect ratio distribution:

Aspect Ratio Distribution - Encord Active

The distribution is not uniform and has a long tail, meaning there is a significant variation of the aspect ratio within the dataset. Knowing the distribution of aspect ratios can guide your decision on preprocessing images, for example, whether to crop, scale, or use letterboxing to handle different aspect ratios. The long tails to the right indicate that aspect ratios are much larger than the median, which could be considered outliers (unusually wide images). The tall bars at certain aspect ratios indicate that these ratios are very common in the dataset. There is a notably tall bar at an aspect ratio of around 1.22, meaning many images have this aspect ratio—many images in the dataset have widths approximately 22% larger than their heights.

Visualizing Image Resolution Distributions of the Caltech101 Images

Visualizing the image resolution score distributions provides insights into the variation and quality of image details, which are critical for accurate feature extraction for your CV tasks. It can help detect anomalies or outliers in the dataset, such as images with resolutions that deviate significantly from the majority, which may indicate data collection or labeling errors.

Under the “Data” tab, go to “2D Metrics view” and set the “X Metric” to “Width” and the “Y Metric” to “Height”:

Detects outliers in the dataset with image resolutions - Encord Active

Most of the data points are clustered at the lower left of the chart, which should tell you that a large number of images have relatively small dimensions in both width and height—typical for most open source datasets. The points are concentrated mostly below 900 pixels in width and 1,000 pixels in height. Meanwhile, outliers towards the right side of the chart show images that may need special consideration in processing or could indicate anomalies in the data pipeline.

Inspect the Problematic Images

What are the Severe and Moderate Outliers in the Image Set?

It may also be beneficial to understand the range and intensity of anomalies across different imaging characteristics. These characteristics encompass metrics like green channel intensity, blue channel intensity, area, image sharpness, uniqueness (singularity), and others. Severe outliers may indicate corrupted images, anomalies, or errors in data collection or processing, while moderate outliers might still be valid data points but with unusual characteristics. Understanding the extent and nature of these outliers can help you design robust models for such anomalies or decide whether to include or exclude them during training. Still under the “Data” tab, check the plot of image outliers based on all the metrics EA detects in your dataset. EA categorizes the outliers into two levels: "severe outliers" (represented in red “tomato”) and "moderate outliers" (represented in orange).

Image outlier detection plot - Encord Active

From the plot above, the width and the green color channel have the highest number of severe outliers, suggesting a significant deviation from the norm in these measures for the images in the Caltech101 dataset. For almost all metrics, the number of severe outliers is higher than that of moderate outliers, indicating extreme variations in the image properties. The distribution of outliers suggests that there may be a need to investigate the causes of these extreme values. This could lead to data cleaning and additional preprocessing, or you may have to consider these factors during model training to account for the atypical images.

Let’s take a look at some of the images with severe outliers in green, red, and blue channel intensities:

Go to “Explorer” and under “Data Metrics," select “Green Values” and sort by descending order:

Data Metrics - Outlier detection - Encord

The image with the highest green value intensity is not bad quality. Sift through the rest of the images in the “Explorer” to find the ones that are low quality, given your model training or data objectives. You can do the same for the green and blue channels. The Caltech101 dataset is relatively clean, so you will find that most images in the set may be high quality for CV tasks.

What are the Blurry Images in the Image Set?

Depending on the specific application, you may find that images with blur can adversely affect your model's accuracy. A model trained on high-resolution, sharp images may struggle to interpret and make correct predictions on images that are not as clear. Blurry images can result in misinterpretations and errors in the model's output, which could be critical. Examining such images within your dataset to determine whether to exclude or improve their quality is important.

Examining blurry images - Encord Active

You can view one of the blurry images in the Explorer to get more insights:

Explorer - View blurry images - Encord Active

You can also click on “SIMILAR” to visualize similar images - this could help you surface more images you may need to inspect.

What are the Brightest and Darkest Images in the Dataset?

Next, surface images with poor lighting or low visibility. Dark images can indicate issues with the quality. These could result from poor lighting during capture, incorrect exposure settings, or equipment malfunctions. Also, a model might struggle to recognize patterns in such images, which could reduce accuracy. Identify and correct these images to improve the overall training data quality. Change the “Data Metrics” dropdown selection to “Brightness” and sort in descending order:

Identify image brightness through explorer on Encord Active

The brightest images look reasonably good quality, but you can look through the explorer to spot images that may not meet the training or data requirements. You can also observe that the distribution of the images based on this metric is reasonably normal. To get the darkest images, sort the “Brightness” in ascending order:

Sorting the brightness in ascending order for darkest image - Encord Active

There does not appear to be anything wrong with the dark images. You might need to explore other images to determine which ones to sift through. Awesome! You have played with a few pre-defined quality metrics. See the complete code to run other data quality metrics on the images.

Next Steps: Fixing Data Quality Issues

Identifying problematic images is half the battle. Ideally, the next step would be for you to take action on those insights and fix the issues. Encord Active (EA) can help you tag problematic images, which may skew model performance downstream. Post-identification, you can use various strategies to rectify these issues. Below are some approaches to fix problematic image issues.

Tagging and Annotation

Once you identify the problematic images, you can tag them within EA with whatever action you deem appropriate. Would you want to remove them from your dataset, label them, or process them? Your model training and data objectives will help you understand an ideal next step. Here is how you can do it for the blurry images we saw earlier:

Tagging and Annotation in Encord Active

One of the most common workflows we see from our users at Encord is identifying image quality issues at scale with Encord Active, tagging problematic images, and sending them upstream for annotation with Annotate.

Re-Labeling

Incorrect labels can significantly hamper model performance. EA facilitates the re-labeling process by exporting the incorrectly labeled images to an annotation platform like Encord Annotate, where you can correct the labels.

light-callout-cta Check out our practical guide to active learning for computer vision to learn more about active learning, its tradeoffs, alternatives, and a comprehensive explanation of active learning pipelines.
 

Image Augmentation and Correction

Image augmentation techniques enhance the diversity and size of the dataset to improve model robustness. Consider augmenting the data using techniques like rotation, scaling, cropping, and flipping. Some images may require corrections like brightness adjustment, noise reduction, or other image processing techniques to meet the desired quality standards. Image quality is not a one-time task but a continuous process. Regularly monitoring and evaluating your image quality will help maintain a high-quality dataset pivotal for achieving superior model performance.

Torchvision: Key Takeaways

In this article, you defined the objective of training a classification model for your use case with over 101 classes of images. Technically, you “gathered” human labels since the open 'Caltech101' dataset is already loaded in Torchvision. Finally, using Encord Active, you computed image embeddings and ran metrics on the images and embeddings. Inspect the problematic images by exploring the datasets based on objective quality metrics. Identifying and fixing the errors in the dataset will set up your downstream model training and ML application for success.

cta banner

Build better ML models with Encord

Get started today
Written by
author-avatar-url

Stephen Oladele

View more posts
Frequently asked questions
  • Torchvision is a package from the PyTorch ecosystem that provides tools and utilities for computer vision. It is widely used for several purposes, like loading pre-built computer vision datasets, pre-trained models, common image transformations for data augmentation and preprocessing, exploration and debugging utilities, and custom components

  • Torchvision datasets are a set of ready-to-use dataset classes available in the Torchvision library that cover a wide range of computer vision tasks and benchmarks. These datasets are used for training and testing machine learning models in tasks such as image classification, object detection, segmentation, and more.

  • The MNIST dataset in Torchvision is a large database of handwritten digits that is widely used for training and testing in the fields of machine learning and computer vision. MNIST stands for Modified National Institute of Standards and Technology and is considered one of the classic datasets in the field.

  • To load the MNIST dataset from Torchvision, you need to follow these steps: - Install PyTorch and Torchvision: If you haven't already installed PyTorch and Torchvision, you can do so using pip or conda. You can find the instructions on the official PyTorch website. - Import the necessary packages: In your Python script or Jupyter notebook, import the required packages. - Load the dataset: Use the torchvision.datasets.MNIST class to download and load the MNIST dataset. Here's an example of how you might write the code to accomplish this: import torch from torchvision import datasets, transforms # Define a transform to normalize the data transform = transforms.Compose([transforms.ToTensor(),                                 transforms.Normalize((0.5,), (0.5,))]) # Download and load the training data trainset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=True, transform=transform) trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True) # Download and load the test data testset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=False, transform=transform) testloader=torch.utils.data.DataLoader(testset,batch_size=64,shuffle=True)

  • Here's a general approach to loading a custom dataset: - Define a Dataset Class: Inherit from torch.utils.data.Dataset and implement the __init__, __len__, and __getitem__ methods to customize how your data is loaded. - Create Dataset and DataLoader Objects: Once the Dataset class is defined, create an instance of this class and then wrap it with a DataLoader to handle batching and shuffling.