Software To Help You Turn Your Data Into AI
Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.
When you are working with datasets or developing a machine learning model, you often find yourself looking for or hypothesizing about subsets of data, labels, or model predictions with certain properties. Quality metrics form the foundation for finding such data and testing the hypotheses. The core concept is to use quality metrics to index, slice, and analyze the subject in question in a structured way to continuously perform informed actions when cranking the active learning cycle.
Concrete example: You hypothesize that object "redness" influences the mAP score of your object detection model. To test this hypothesis, you define a quality metric that captures the redness of each object in the dataset. From the quality metric, you slice the data to compare your model performance on red vs. not red objects.
We like to think of a quality metric as:
[A quality metric is] any function that assigns a value to individual data points, labels, or model predictions in a dataset.
By design, quality metrics is a very abstract class of functions because the accompanying methodologies are agnostic to the specific properties that the quality metrics express. No matter the specific quality metric, you can:
all of which are possible with Encord Active.
Tip: Try to read the remainder of this post with the idea of "indexing" your data, labels, and model prediction based on quality metrics in mind. The metrics mentioned below are just the tip of the iceberg in terms of what quality metrics can capture -- only imagination limits the space.
Data quality metrics are those metrics that require only information about the data itself. Within the computer vision domain, this means the raw images or video frames without any labels. This subset of quality metrics is typically used frequently at the beginning of a machine learning project where labels are scarce or perhaps not even existing.
Below are some examples of data quality metrics ranging from simple to more complex:
Image Brightness as a data quality metric on MS COCO validation dataset on Encord. Source: Author
Image Singularity as a data quality metric on MS COCO validation dataset on Encord. Source: Author
Label quality metrics apply to labels. Some metrics use image content while others apply only to the label information. Label quality metrics serve many purposes but some of the more frequent ones are surfacing label errors, model failure modes, and assessing annotator performance.
Here are some concrete examples of label quality metrics ranging from simple to more complex:
Object count as a label quality metric on MS COCO validation dataset on Encord. Source: Author
Annotation Duplicate as a label quality metric on MS COCO validation dataset on Encord. Source: Author
Model quality metrics also take into account the model predictions. The most obvious use-case for these metrics is acquisition functions; answering the question "What should I label next?". There are many intelligent ways to leverage model predictions to answer this question. Here is a list of some of the most common ones:
Using Model Confidence as model quality metric on MS COCO validation dataset on Encord. It shows the predictions where the confidence is between 50% to 80%. Source: Author
Using Polygon Shape Similarity as model quality metric on MS COCO validation dataset on Encord. It ranks objects by how similar they are to their instances in previous frames based on Hu moments. The more an object’s shape changes, the lower its score will be.
We have now gone over some examples of common quality metrics that already exist in Encord Active.
However, every machine learning project is different, and most likely, you have just the idea of what to compute in order to surface the data that you want to evaluate or analyze.
With Encord Active, you only need to define the per-data-point computation and the tool will take care of everything from executing the computation to visualizing your data based on your new metric.
Perhaps, you want to know when your skeleton predictions are occluded or in which frames of video-specific annotations are missing.
You could also get even smarter and compare your labels with results from foundational models like SAM.
These different use-cases are situations in which you would be building your own custom metrics.
You can find the documentation for writing custom metrics here or you can follow any of the links provided above to specific quality metrics and find their implementation on GitHub.
If you need assistance developing your custom metric, the [slack channel][ea-slack] is also always open.
Quality Metrics constitute the foundation of systematically exploring, evaluating and iterating on machine learning datasets and models.
We use them for slicing data, comparing data, tagging data, finding label errors, and much more. The true power of these metrics is that they can be arbitrarily specific to a problem at hand. With Encord Active, it is super easy to define, execute, and utilize quality metrics to get the most out of your data, your models, and your annotators.
Footnotes
Join the Encord Developers community to discuss the latest in computer vision, machine learning, and data-centric AI
Join the communityRelated Blogs
Computer vision engineers, data scientists, and machine learning engineers face a pervasive issue: the prevalence of low-quality images within datasets. You have likely encountered this problem through incorrect labels, varied image resolutions, noise, and other distortions. Poor data quality can lead to models learning incorrect features, misclassifications, and unreliable or incorrect outputs. In a domain where accuracy and reliability are paramount, this issue can significantly impede the progress and success of projects. This could result in wasted resources and extended project timelines. Take a look at the following image collage of Chihuahuas or muffins, for example: Chihuahua or muffin? My search for the best computer vision API How fast could you tell which images are Chihuahuas vs. muffins? Fast? Slow? Were you correct in 100% of the images? I passed the collage to GPT-4V because, why not? 😂 And as you can see, even the best-in-class foundation model misclassified some muffins as Chihuahuas! (I pointed out a few.) So, how do you make your models perform better? The sauce lies in the systematic approach of exploring, evaluating, and fixing the quality of images. Enter Encord Active! It provides a platform to identify, tag problematic images, and use features to improve the dataset's quality. This article will show you how to use Encord Active to explore images, visualize potential issues, and take next steps to rectify low-quality images. In particular, you will: Use a dog-food dataset from the Hugging Face Datasets library. Delve into the steps of creating an Encord Active project. Define and run quality metrics on the dataset. Visualize the quality metrics. Indicate strategies to fix the issues you identified. Ready? Let’s delve right in! 🚀 Using Encord Active to Explore the Quality of Your Images Encord Active toolkit helps you find and fix wrong labels through data exploration, model-assisted quality metrics, and one-click labeling integration. It takes a data-centric approach to improving model performance. With Encord Active, you can: Slice your visual data across metrics functions to identify data slices with low performance. Flag poor-performing slices and send them for review. Export your new data set and labels. Visually explore your data through interactive embeddings, precision/recall curves, and other advanced visualizations. Check out the project on GitHub, and hey, if you like it, leave a 🌟🫡. Demo: Explore the quality of 'dog' and 'food' images for ML models In this article, you will use Encord Active to explore the quality of the `sashs/dog-food` images. You’ll access the dataset through the Hugging Face Datasets library. You can use this dataset to build a binary classifier that categorizes images into the "dog" and "food" classes. The 'dog' class has images of canines that resemble fried chicken and some that resemble images of muffins, and the 'food' class has images of, you guessed it, fried chicken and muffins. The complete code is hosted on Colab. Open the Colab notebook side-to-side with this blog post. {{light_callout_start}} Interested in more computer vision, visual foundation models, active learning, and data quality notebooks? Check out the Encord Notebook repository {{light_callout_end}} Use Hugging Face Datasets to Download and Generate the Dataset Whatever machine learning, deep learning, or AI tasks you are working on, the Hugging Face Datasets library provides easy access to, sharing, and processing datasets, particularly those catering to audio, computer vision, and natural language processing (NLP) domains. The 🤗 datasets library enables an on-disk cache that is memory-mapped for quick lookups to back the datasets. Explore the Hugging Face Hub for the datasets directory You can browse and explore over 20,000 datasets housed in the library on the Hugging Face Hub. The Hub is a centralized platform for discovering and choosing datasets pertinent to your projects. In the search bar at the top, enter keywords related to the dataset you're interested in, e.g., "sentiment analysis," "image classification," etc. You should be able to: Filter datasets by domain, license, language, and so on. Find information such as the size, download number, and download link on the dataset card. Engage with the community by contributing to discussions, providing feedback, or suggesting improvements to the dataset. Load the ‘sashs/dog-food’ dataset Loading the `sashs/dog-food` dataset is pretty straightforward: Install the 🤗 Datasets library and download the dataset. To install Hugging Face Datasets, run the following command: pip install datasets Use the `load_dataset` function to load the 'sasha/dog-food' dataset from Hugging Face: dataset_dict=load_dataset('sasha/dog-food') `load_dataset` returns a dictionary object (`DatasetDict`). You can iterate through the train and test dataset split keys in the `DatasetDict` object. The keys map to a `Dataset` object containing the images for that particular split. You will explore the entire dataset rather than in separate splits. This should provide a comprehensive understanding of the data distribution, characteristics, and potential issues. To do that, merge the different splits into a single dataset using the `concatenate_datasets` function: dataset=concatenate_datasets([dfordindataset_dict.values()]) Perfect! Now, you have an entire dataset to explore with Encord Active in the subsequent sections. If you have not done that already, create a dataset directory to store the downloaded images. # Create a new directory "hugging_face_dataset" in the current working dir huggingface_dataset_path = Path.cwd() / "huggingface_dataset" # Delete dir if it already exists and recreate if huggingface_dataset_path.exists(): shutil.rmtree(huggingface_dataset_path) huggingface_dataset_path.mkdir() Use a loop to iterate through images from the ‘sashs/dog-food’ dataset and save them to the directory you created: for counter, item in tqdm(enumerate(dataset)): image = item['image'] image.save(f'./Hugging Face_dataset/{counter}.{image.format}') If your code throws errors, run the cell in the Colab notebook in the correct order. Super! You have prepared the groundwork for exploring your dataset with Encord Active. Create an Encord Active Project You must specify the directory containing your datasets when using Encord Active for exploration. You will initialize a local project with the image files—there are different ways to import and work with projects in Encord. Encord Active provides functions and utilities to load all your images, compute embeddings, and, based on that, evaluate the embeddings using pre-defined metrics. The metrics will help you search and find images with errors or quality issues. Before initializing the Encord Active project, define a function, `collect_all_images`, that obtains a list of all the image files from the `huggingface_dataset_path` directory, takes a root folder path as input, and returns a list of `Path` objects representing image files within the root folder: def collect_all_images(root_folder: Path) -> list[Path]: image_extensions = {".jpg", ".jpeg", ".png", ".bmp"} image_paths = [] for file_path in root_folder.glob("**/*"): if file_path.suffix.lower() in image_extensions: image_paths.append(file_path) return image_paths Remember to access and run the complete code in this cell. Initialize Encord Active project Next, initialize a local project using Encord Active's `init_local_project` function. This function provides the same functionality as running the `init` command in the CLI. If you prefer using the CLI, please refer to the “Quick import data & labels” guide. try: project_path: Path = init_local_project( files = image_files, target = projects_dir, project_name = "sample_ea_project", symlinks = False, ) except ProjectExistsError as e: project_path = Path("./sample_ea_project") print(e)# A project already exists with that name at the given path. Compute image embeddings and analyze them with metrics Analyzing raw image data directly in computer vision can often be impractical due to the high dimensionality of images. A common practice is to compute embeddings for the images to compress the dimensions, then run metrics on these embeddings to glean insights and evaluate the images. Ideally, you compute the embeddings using pre-trained (convolutional neural network) models. The pre-trained models capture the essential features of the images while reducing the data dimensionality. Once you obtain the embeddings, run similarity, clustering, and classification metrics to analyze different aspects of the dataset. Computing embeddings and running metrics on them can take quite a bit of manual effort. Enter Encord Active! Encord Active provides utility functions to run predefined subsets of metrics, or you can import your own sets of metrics. It computes the image embeddings and runs the metrics by the type of embeddings. Encord Active has three different types of embeddings: Image embeddings - general for each image or frame in the dataset Classification embeddings - associated with specific frame-level classifications Object embeddings - associated with specific objects, like polygons or bounding boxes Use the `run_metrics_by_embedding_type` function to execute quality metrics on the images, specifying the embedding type as `IMAGE`: run_metrics_by_embedding_type( EmbeddingType.IMAGE, data_dir=project_path, use_cache_only=True ) The `use_cache_only=True` parameter cached data only when executing the metrics rather than recomputing values or fetching fresh data. This can be a useful feature for saving computational resources and time, especially when working with large datasets or expensive computations. Create a `Project` object using the `project_path` - you will use this for further interactions with the project: ea_project=Project(project_path) Exploring the Quality Of Images From the Hugging Face Datasets Library Now that you have set up your project, it’s time to explore the images! There are typically two ways you could visualize images with Encord Active (EA): Through the web application (Encord Active UI) Combining EA with visualization libraries to display those embeddings based on the metrics We’ll use the latter in this article. You will import helper functions and modules from Encord Active with visualization libraries (`matplotlib` and `plotly`). This code cell contains a list of the modules and helper functions. Pre-defined subset of metrics in Encord Active Next, iterate through the data quality metrics in Encord Active to see the list of available metrics, access the name attribute of each metric object within that iterable, and construct a list of these names: [metric.nameformetricinavailable_metrics] You should get a similar output: There are several quality metrics to explore, so let’s define and use the helper functions to enable you to visualize the embeddings. Helper functions for displaying images and visualizing the metrics Define the `plot_top_k_images` function to plot the top k images for a metric: def plot_top_k_images(metric_name: str, metrics_data_summary: MetricsSeverity, project: Project, k: int, show_description: bool = False, ascending: bool = True): metric_df = metrics_data_summary.metrics[metric_name].df metric_df.sort_values(by='score', ascending=ascending, inplace=True) for _, row in metric_df.head(k).iterrows(): image = load_or_fill_image(row, project.file_structure) plt.imshow(image) plt.show() print(f"{metric_name} score: {row['score']}") if show_description: print(f"{row['description']}") The function sorts the DataFrame of metric scores, iterates through the top `k` images in your dataset, loads each image, and plots it using Matplotlib. It also prints the metric score and, optionally, the description of each image. You will use this function to plot all the images based on the metrics you define. Next, define a `plot_metric_distribution` function that creates a histogram of the specified metric scores using Plotly: def plot_metric_distribution(metric_name: str, metric_data_summary: MetricsSeverity): fig = px.histogram(metrics_data_summary.metrics[metric_name].df, x="score", nbins=50) fig.update_layout(title=f"{metric_name} score distribution", bargap=0.2) fig.show() Run the function to visualize the score distribution based on the “Aspect Ratio” metric: plot_metric_distribution(“AspectRatio”,metrics_data_summary) Most images in the dataset have aspect ratios close to 1.5, a normal distribution. The set has only a few extremely small or enormous image proportions. Use EA’s `create_image_size_distribution_chart` function to plot the size distribution of your images: image_sizes = get_all_image_sizes(ea_project.file_structure) fig = create_image_size_distribution_chart(image_sizes) fig.show() As you probably expected for an open-source dataset for computer vision applications, there is a dense cluster of points in the lower-left corner of the graph, indicating that many images have smaller resolutions, mostly below 2000 pixels in width and height. A few points are scattered further to the right, indicating images with a much larger width but not necessarily a proportional increase in height. This could represent panoramic images or images with unique aspect ratios. You’ll identify such images in subsequent sections. Inspect the Problematic Images What are the severe and moderate outliers in the image set? You might also need insights into the distribution and severity of outliers across various imaging attributes. The attributes include metrics such as green values, blue values, area, etc. Use the `create_outlier_distribution_chart` utility to plot image outliers based on all the available metrics in EA. The outliers are categorized into two levels: "severe outliers" (represented in red “tomato”) and "moderate outliers" (represented in orange): available_metrics = load_available_metrics(ea_project.file_structure.metrics) metrics_data_summary = get_metric_summary(available_metrics) all_metrics_outliers = get_all_metrics_outliers(metrics_data_summary) fig = create_outlier_distribution_chart(all_metrics_outliers, "tomato", 'orange') fig.show() Here’s the result: "Green Values," "Blue Values," and "Area" appear to be the most susceptible to outliers, while attributes like "Random Values on Images" have the least in the ‘sashs/dog-food’ dataset. This primarily means there are lots of images that have abnormally high values of green and blue tints. This could be due to the white balance settings in the camera for the images or low-quality sensors. If your model trains on this set, it’s likely that more balanced images may perturb the performance. What are the blurry images in the image set? Depending on your use case, you might discover that blurry images can sometimes deter your model. A model trained on clear images and then tested or used on blurry ones may not perform well. If the blur could lead to misinterpretations and errors, which can have significant consequences, you might want to explore the blurry images to remove or enhance them. plot_top_k_images('Blur',metrics_data_summary,ea_project,k=5,ascending=False) Based on a "Blur score" of -9.473 calculated by Encord Active, here is the output with one of the five blurriest images: What are the darkest images in the image set? Next, surface images with poor lighting or low visibility. Dark images can indicate issues with the quality. These could result from poor lighting during capture, incorrect exposure settings, or equipment malfunctions. Also, a model might struggle to recognize patterns in such images, which could reduce accuracy. Identify and correct these images to improve the overall training data quality. plot_top_k_images('Brightness', metrics_data_summary, ea_project, k=5, ascending=True) The resulting image reflects a low brightness score of 0.164: What are the duplicate or nearly similar images in the set? Image singularity in the context of image quality is when images have unique or atypical characteristics compared to most images in a dataset. Duplicate images can highlight potential issues in the data collection or processing pipeline. For instance, they could result from artifacts from a malfunctioning sensor or a flawed image processing step. In computer vision tasks, duplicate images can disproportionately influence the trained model, especially if the dataset is small. Identify and address these images to improve the robustness of your model. Use the “Image Singularity” metric to determine the score and the images that are near duplicates: plot_top_k_images('Image Singularity', metrics_data_summary, ea_project, k=15, show_description=True) Here, you can see two nearly identical images with similar “Image Singularity” scores: The tiny difference between the singularity scores of the two images—0.01299857 for the left and 0.012998693 for the right—shows how similar they are. Check out other similar or duplicate images by running this code cell. Awesome! You have played with a few pre-defined quality metrics. See the complete code to run other data quality metrics on the images. Next Steps: Fixing Data Quality Issues Identifying problematic images is half the battle. Ideally, the next step would be for you to take action on those insights and fix the issues. Encord Active (EA) can help you tag problematic images, which may skew model performance downstream. Post-identification, various strategies can be employed to rectify these issues. Below, I have listed some ways to fix problematic image issues. Tagging and annotation Once you identify the problematic images, you can tag them within EA. One of the most common workflows we see from our users at Encord is identifying image quality issues at scale with Encord Active, tagging problematic images, and sending them upstream for annotation with Annotate. Re-labeling Incorrect labels can significantly hamper model performance. EA facilitates the re-labeling process by exporting the incorrectly labeled images to an annotation platform like Encord Annotate, where you can correct the labels. Active learning Use active learning techniques to improve the quality of the dataset iteratively. You can establish a continuous improvement cycle by training the model on good-quality datasets and then evaluating the model on low-quality datasets to suggest datasets to improve. Active learning (encord.com) {{light_callout_start}} Check out our practical guide to active learning for computer vision to learn more about active learning, its tradeoffs, alternatives, and a comprehensive explanation of active learning pipelines. {{light_callout_end}} Image augmentation and correction Image augmentation techniques enhance the diversity and size of the dataset to improve model robustness. Consider augmenting the data using techniques like rotation, scaling, cropping, and flipping. Some images may require corrections like brightness adjustment, noise reduction, or other image processing techniques to meet the desired quality standards. Image quality is not a one-time task but a continuous process. Regularly monitoring and evaluating your image quality will help maintain a high-quality dataset pivotal for achieving superior model performance. Key Takeaways There’s an excellent article from Aliaksei Mikhailiuk that perfectly describes the task of image quality assessment in three stages: Define an objective Gather the human labels for your dataset Train objective quality metrics on the data In this article, you defined the objective of training a binary classification model for your use case. Technically, you “gathered” human labels since the open 'sashs/dog-food' dataset was already labeled on Hugging Face. Finally, using Encord Active, you computed image embeddings and ran metrics on the embeddings. Inspect the problematic images by exploring the datasets based on the objective quality metrics. Identifying and fixing the errors in the dataset will set up your downstream model training and ML application for success. {{Active_CTA}}
October 19
14 min
When you’re working with datasets or developing a machine learning model, you often find yourself looking for or hypothesizing about subsets of data, labels, or model predictions with certain properties. Quality metrics form the foundation for finding such data and testing the hypotheses. What is a Quality Metric? The core concept is to use quality metrics to index, slice, and analyze the subject in question in a structured way to perform informed actions when continuously cranking the active learning cycle. Concrete example: You hypothesize that object "redness" influences the mAP score of your object detection model. To test this hypothesis, you define a quality metric that captures the redness of each object in the dataset. From the quality metric, you slice the data to compare your model performance on red vs. not red objects. {{gray_callout_start}}💡 Tip: Find an example notebook for this use-case here. {{gray_callout_end}} Quality Metric Defined The best way to think of a quality metric in computer vision is: {{gray_callout_start}} Any function that assigns a value to individual data points, labels, or model predictions in a dataset. {{gray_callout_end}} By design, quality metrics are a very abstract class of functions because the accompanying methodologies are agnostic to the specific properties that the quality metrics express. No matter the specific quality metric, you can: sort your data according to the metric slice your data to inspect specific subsets find outliers compare training data to production data to detect data drifts evaluate your model performance as a function of the metric define model test-cases and much more All of these are possible with Encord Active. {{gray_callout_start}} Pro Tip: Try to read the remainder of this post with the idea of "indexing" your data, labels, and model prediction based on quality metrics in mind. The metrics mentioned below are just the tip of the iceberg in terms of what quality metrics can capture. Every project will have it's own specific metrics to consider and factor in. {{gray_callout_end}} Data Quality Metrics Data quality metrics are those metrics that require only information about the data itself. Within the computer vision domain, this means the raw images or video frames without any labels. This subset of quality metrics is typically used frequently at the beginning of a machine learning project where labels are scarce or perhaps not even existing. Below are some examples of data quality metrics ranging from simple to more complex: Image Brightness as a data quality metric on MS COCO validation dataset on Encord. Image Singularity as a data quality metric on MS COCO validation dataset on Encord. Label Quality Metrics Label quality metrics apply to labels. Some metrics use image content, while others apply only to the label information. Label quality metrics serve many purposes, but some more frequent ones are surfacing label errors, model failure modes, and assessing annotator performance. Here are some concrete examples of label quality metrics ranging from simple to more complex: Object count as a label quality metric on MS COCO validation dataset on Encord. Annotation Duplicate as a label quality metric on MS COCO validation dataset on Encord. Model Quality Metrics Model quality metrics also take into account the model predictions. The most obvious use-case for these metrics is acquisition functions, answering the question, "What should I label next?" There are many intelligent ways to leverage model predictions to answer this question. Here is a list of some of the most common ones: Using Model Confidence as a model quality metric on MS COCO validation dataset on Encord. It shows the predictions where the confidence is between 50% to 80%. Using Polygon Shape Similarity as a model quality metric on MS COCO validation dataset on Encord. It ranks objects by how similar they are to their instances in previous frames based on Hu moments. The more an object’s shape changes, the lower its score will be. {{gray_callout_start}}💡 Tip: To utilize acquisition functions with Encord Active, have a look here. {{gray_callout_end}} Custom Quality Metrics We have now reviewed some examples of common quality metrics already in Encord Active. However, every machine learning project is different, and most likely, you have just the idea of what to compute to surface the data that you want to evaluate or analyze. With Encord Active, you only need to define the per-data-point computation. The tool will handle everything from executing the computation to visualizing your data based on your new metric. You may want to know when your skeleton predictions are occluded or in which frames of video-specific annotations are missing. You could also get even smarter and compare your labels with results from foundational models like SAM. These different use cases are situations where you would build your custom metrics. You can find the documentation for writing custom metrics here or you can follow any of the links provided above to specific quality metrics and find their implementation on GitHub. {{debug_models_EA_CTA}} Conclusion Quality Metrics constitute the foundation of systematically exploring, evaluating, and iterating on machine learning datasets and models. With Encord Active, it’s easy to define, execute, and utilize quality metrics to get the most out of your data, models, and annotators. We use them for slicing data, comparing data, tagging data, finding label errors, and much more. The true power of these metrics is that they can be arbitrarily specific to a problem at hand. Ready to improve the performance and quality metrics of your CV models? Sign-up for an Encord Free Trial: The Active Learning Platform for Computer Vision, used by the world’s leading computer vision teams. AI-assisted labeling, model training & diagnostics, find & fix dataset errors and biases, all in one collaborative active learning platform, to get to production AI faster. Try Encord for Free Today. Want to stay updated? Follow us on Twitter and LinkedIn for more content on computer vision, training data, and active learning. Join the Slack community to chat and connect.
May 10
4 min
As machine learning models become increasingly complex and ubiquitous, it's crucial to have a practical and methodical approach to evaluating their performance. But what's the best way to evaluate your models? Traditionally, average accuracy scores like Mean Average Precision (mAP) have been used that are computed over the entire dataset. While these scores are useful during the proof-of-concept phase, they often fall short when models are deployed to production on real-world data. In those cases, you need to know how your models perform under specific scenarios, not just overall. At Encord, we approach model evaluation using with a data-centric approach using model test cases. Think of them as the "unit tests" of the machine learning world. By running your models through a set of predefined test cases before continuing model deployment or prior to deployment, you can identify any issues or weaknesses and improve your model's accuracy. Even after deployment, model test cases can be used to continuously monitor and optimize your model's performance, ensuring it meets your expectations. In this article, we will explore the importance of model test cases and how you can define them using quality metrics. We will use a practical example to put this framework into context. Imagine you’re building a model for a car parking management system that identifies car throughput, measures capacity at different times of the day, and analyzes the distribution of different car types. You've successfully trained a model that works well on Parking Lot A in Boston with the cameras you've set up to track the parking lot. Your proof of concept is complete, investors are happy, and they ask you to scale it out to different parking lots. Car parking photos are taken under various weather and daytime conditions. However, when you deploy the same model in a new parking house in Boston and in another town (e.g., Minnesota), you find that there are a lot of new scenarios you haven't accounted for: In the parking lot in Boston, the cameras have slightly blurrier images, different contrast levels, and the cars are closer to the cameras. In Minnesota, there is snow on the ground, different types of lines painted on the parking lot, and new types of cars that weren't in your training data. This is where a practical and methodical approach to testing these scenarios is important. Let's explore the concept of defining model test cases in detail through five steps: Identify Failure Mode Scenarios Define Model Test Cases Evaluate Granular Performance Mitigate Failure Modes Automate Model Test Cases Identify Failure Mode Scenarios Thoroughly testing a machine learning model requires considering potential failure modes, such as edge cases and outliers, that may impact its performance in real-world scenarios. Identifying these scenarios is a critical first step in the testing process of any model. Failure mode scenarios may include a wide range of factors that could impact the model's performance, such as changing lighting conditions, unique perspectives, or variations in the environment. Let's consider our car parking management system. In this case, some of the potential edge cases and outliers could include: Snow on the parking lot Different types of lines painted on the parking lot New types of cars that weren't in your training data Different lighting conditions at different times of day Different camera angles, perspectives, or distance to cars Different weather conditions, such as rain or fog By identifying scenarios where your model might fail, you can begin to develop model test cases that evaluate the model's ability to handle these scenarios effectively. It's important to note that identifying model failure modes is not a one-time process and should be revisited throughout the development and deployment of your model. As new scenarios arise, it may be necessary to add new test cases to ensure that your model continues to perform effectively in all possible scenarios. Furthermore, some scenarios might require specialized attention, such as the addition of new classes to the model's training data or the implementation of more sophisticated algorithms to handle complex scenarios. For example, in the case of adding new types of cars to the model's training data, it may be necessary to gather additional data to train the model effectively on these new classes. Define Model Test Cases Defining model test cases is an important step in the machine learning development process as it enables the evaluation of model performance and the identification of areas for improvement. As mentioned earlier, this involves specifying classes of new inputs beyond those in the original dataset for which the model is supposed to work well, and defining the expected model behavior on these new inputs. Defining test cases begins by building hypotheses based on the different scenarios the model is likely to encounter in the real world. This can involve considering different environmental conditions, lighting conditions, camera angles, or any other factors that could affect the model's performance. Hereafter you define the expected model behavior under the scenario. My model should achieve X in the scenario where Y It is crucial that the test case is quantifiable. That is, you need to be able to measure whether the test case passes or not. In the next section, we’ll get back to how to do this in practice. For the car parking management system, you could define your model test cases as follows: The model should achieve an mAP of 0.75 for car detection when cars are partially covered in or surrounded by snow. The model should have an accuracy of 98% on parking spaces when the parking lines are partially covered in snow. The model should achieve an mAP of 0.75 for car detection in parking houses under poor light conditions. Evaluate Granular Performance Once the model test cases have been defined, the performance can be evaluated using appropriate performance metrics for each model test case. This might involve measuring the model's mAP, precision, and recall of data slices related to specified test cases. To find the specific data slices relevant to your model test case we recommend using Quality metrics. Quality metrics are useful to evaluate your model's performance based on specific criteria, such as object size, blurry images, or time of day. In practice, they are additional parametrizations added on top of your data, labels, and model predictions and they allow you to index your data, labels, and model predictions in semantically relevant ways. Read more here. Quality metrics can then be used to identify data slices related to your model test cases. To evaluate a specific model text case, you identify a slice of data that has the properties that the test case defines and evaluate your model performance on that slice of data. {{check_out_on_github_visual}} Mitigate Failure Modes If your model test case fails and the model is not performing according to your expectations in the defined scenario, you need to take action to improve performance. This is where targeted data quality improvements come in. These improvements can take various shapes and forms, including: Data collection campaigns: Collect new data samples that cover identified scenarios. Remember to ensure data diversity by obtaining samples from different locations and parking lot types. Nonetheless, you should regularly update the dataset to account for new scenarios and maintain model performance. Relabeling Campaigns: If your failure modes are due to label errors in the existing dataset it would be beneficial to correct any inaccuracies or inconsistencies in labels before collecting new data. If your use case is complex, we recommend collaborating with domain experts to ensure high-quality annotations. Data augmentation: By applying methods such as rotation, color adjustment, and cropping, you can increase the diversity of your dataset. Additionally, you can utilize techniques to simulate various lighting conditions, camera angles, or environmental factors that the model might encounter in real-world scenarios. Implementing domain-specific augmentation techniques, such as adding snow or rain to images, can further enhance the model's ability to generalize to various situations. Synthetic data generation: Creating artificial data samples can help expand the dataset, but it is essential to ensure that the generated data closely resembles real-world scenarios to maintain model performance. Combining synthetic data with real data can increase the dataset size and diversity, potentially leading to more robust models. Automated Model Test Cases Once you've defined your model test cases, you need a way to select data slices and test them in practice. This is where quality metrics and Encord Active comes in. Encord Active is an open-source data-centric toolkit that allows you to investigate and analyze your data distribution and model performance against these quality metrics, in an easy and convenient way. The chart above is automatically generated by Encord Active using uploaded model predictions. The chart shows the dependency between model performance and each metric - how much is model performance affected by each metric. With quality metrics, you identify areas where the model is underperforming, even if it's still achieving high overall accuracy. Thus they are perfect for testing your model test cases in practice. For example, the quality metric that specifically measures the model's performance in low-light conditions (see “Brightness” among quality metrics in the figure above) will help you to understand if your car parking management system model will struggle to detect cars in low-light conditions. You could also use the “Object Area” quality metric to create a model test case that checks if your model has issues with different sizes of objects (different distance to cars results in different object areas). One of the benefits of Encord Active is that it is open-source and it enables you to write your own custom quality metrics to test your hypotheses around different scenarios. Tip: If you have any specific things you’d like to test please get in touch with us and we would gladly help you get started. {{check_out_on_github_visual}} This means that you can define quality metrics that are specific to your use case and evaluate your model's performance against them. For example, you might define a quality metric that measures the model's performance in heavy rain conditions (a combination of low Brightness and Blur). Finally, if you would like to visually inspect the slices that your model is struggling with you can visualize model predictions (both TP, FP, and FNs) if you. Tip: You can use Encord Annotate to directly correct labels if you spot any outright label errors. Back to the car parking management system example: Once you have defined your model test cases and evaluated your model's performance against using the quality metrics, you can find low-performing "slices" of data. If you've defined a model test case for the scenario where there is snow on the ground in Minnesota, you can: Compute the quality metric that measures its performance in snowy conditions. Investigate how much this metric affects the overall performance. Filter the slice of images where your model performance is low. Set in motion a data collection campaign for images in similar conditions. Set up an automated model test that always tests for performance on snowy images in your future models. Tip: If you already have a database of unlabeled data you can leverage similarity search to find images of interest for your data collection campaigns. Benefits of The Model Test Case Framework As machine learning models continue to evolve, evaluating them is becoming more important than ever. By using a model test case framework, you can gain a more comprehensive understanding of your model's performance and identify areas for improvement. This approach is far more effective and safe than relying solely on high-level accuracy metrics, which can be insufficient in evaluating your model performance in real-world scenarios. So to summarize, the benefits of using model test cases instead of only high level accuracy performance metrics are: Enhanced understanding of your model: You gain a thorough understanding of your model by evaluating it in detail (rather than depending on one overall metric). systematically analyzing its performance will improve your (and your team's) confidence in its effectiveness during deployment and augments the model's credibility. Allows you to concentrate on addressing model failure modes: Armed with an in-depth evaluation from Encord Active, efforts to improve a model can be directed toward its weak areas. Focusing on the weaker aspects of your model accelerates its development, optimizes engineering time, and minimizes data collection and labeling expenses. Fully customizable to your specific case: One of the benefits of using open-source tools like Encord Active is that it enables you to write your own custom quality metrics and set up automated triggers without having to rely on proprietary software. If you're interested in incorporating model test cases into your data annotation and model development workflow, don't hesitate to reach out. Conclusion In this article, we start off by understanding why defining model test cases and using quality metrics to evaluate model performance against them is essential. It is a practical and methodical approach for identifying data-centric failure modes in machine learning models. By defining model test cases, evaluating model performance against quality metrics, and setting up automated triggers to test them, you can identify areas where the model needs improvement, prioritize data labeling efforts accordingly, and improve the model's credibility with your team. Furthermore, it changes the development cycle from reactive to proactive, where you can find and fix potential issues before they occur, instead of deploying your model in a new scenario and finding out that you have poor performance and trying to fix it. Open-source tools like Encord Active enable users to write their own quality metrics and set up automated triggers without having to rely on proprietary software. This can lead to more collaboration and knowledge sharing across the machine-learning community, ultimately leading to more robust and effective machine-learning models in the long run.
March 22
5 min
Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.