Validating Model Performance Using Encord Active

Stephen Oladele
March 1, 2024
8 min read
blog image

Model validation is a key machine learning (ML) lifecycle stage, ensuring models generalize well to new, unseen data. This process is critical for evaluating a model's predictions independently from its training dataset, thus testing its ability to perform reliably in the real world. 

Model validation helps identify overfitting—where a model learns noise rather than the signal in its training data—and underfitting, where a model is too simplistic to capture complex data patterns. Both are detrimental to model performance.

Techniques like the holdout method, cross-validation, and bootstrapping are pivotal in validating model performance, offering insights into how models might perform on unseen data. These methods are integral to deploying AI and machine learning models that are both reliable and accurate.

This article delves into two parts:

  • Key model validation techniques, the advantages of a data-centric approach, and how to select the most appropriate validation method for your project.
  • How to validate a Mask R-CNN pre-trained model that segments instances in COVID-19 scans using Encord Active, a data-centric platform for evaluating and validating computer vision (CV) models.

Ready to dive deeper into model validation and discover how Encord Active can enhance your ML projects? Let’s dive in!

The Vital Role of a Data-Centric Approach in Model Validation

A data-centric approach to model validation places importance on the quality of data in training and deploying computer vision (CV) and artificial intelligence (AI) models. The approach recognizes that the foundation of any robust AI system lies not in the complexity of its algorithms but in the quality of the data it learns from.

High-quality, accurately labeled data (with ground truth) ensures that models can truly understand and interpret the nuances of the tasks they are designed to perform, from predictive analytics to real-time decision-making processes.

Why Data Quality is Paramount

The quality of training data is directly proportional to a model's ability to generalize from training to real-world applications. Poor data quality—including inaccuracies, biases, label errors, and incompleteness—leads to models that are unreliable, biased, or incapable of making accurate predictions.

A data-centric approach prioritizes meticulous data preparation, including thorough data annotation, cleaning, and validation. This ensures the data distribution truly reflects the real world it aims to model and reduces label errors. 

Improving Your Model’s Reliability Through Data Quality

The reliability of CV models—and even more recently, foundation models—in critical applications—such as healthcare imaging and autonomous driving—cannot be overstated. 

A data-centric approach mitigates the risks associated with model failure by ensuring the data has high fidelity. It involves rigorous validation checks and balances, using your expertise and automated data quality tools to continually improve your label quality and datasets.

light-callout-cta Adopt a data-centric approach to your AI project and unlock its potential by downloading our whitepaper.

Key Computer Vision Model Validation Techniques

A data-centric approach is needed to validate computer vision models after model training that looks at more than just performance and generalizability. They also need to consider the unique problems of visual data, like how image quality, lighting, and perspectives can vary.

Tailoring the common validation techniques specifically for computer vision is about robustly evaluating the model's ability to analyze visual information and embeddings across diverse scenarios:

  • Out-of-Sample Validation: Essential for verifying that a CV model can generalize from its training data to new, unseen images or video streams. This approach tests the model's ability to handle variations in image quality, lighting, and subject positioning that it hasn't encountered during training.
  • Cross-Validation and Stratified K-Fold: Particularly valuable in computer vision is ensuring that every aspect of the visual data is represented in both training and validation sets. Stratified K-Fold is beneficial when dealing with imbalanced datasets, common in computer vision tasks, to maintain an equal representation of classes across folds.
  • Leave-One-Out Cross-Validation (LOOCV): While computationally intensive, LOOCV can be particularly insightful for small image datasets where every data point's inclusion is crucial for assessing the model's performance on highly nuanced visual tasks.
  • Bootstrapping: Offers insights into the stability of model predictions across different visual contexts. This method helps understand how training data subset changes can affect the model's performance, which is particularly relevant for models expected to operate in highly variable visual environments.
  • Adversarial Testing: Tests the model's resilience against slight, often invisible, image changes. This technique is critical to ensuring models are not easily perturbed by minor alterations that would not affect human perception.
  • Domain-Specific Benchmarks: Participating in domain-specific challenges offered by ImageNet, COCO, or PASCAL VOC can be a reliable validation technique. These benchmarks provide standardized datasets and metrics, allowing for evaluating a model's performance against a wide range of visual tasks and conditions, ensuring it meets industry standards.
  • Human-in-the-Loop: Involving domain experts in the validation process is invaluable, especially for tasks requiring fine-grained visual distinctions (e.g., medical imaging or facial recognition). This approach helps ensure that the model's interpretations align with human expertise and can handle the subtleties of real-world visual data.

Ensuring a model can reliably interpret and analyze visual information across various conditions requires a careful balance between automated validation methods and human expertise. 

Choosing the right validation techniques for CV models involves considering the dataset's diversity, the computational resources available, and the application's specific requirements.

Luckily, there are model validation tools that can help you focus on validating the model. At the same time, they do the heavy lifting of providing the insights necessary to validate your CV model’s performance, including providing AI-assisted evaluation features. 

But before walking through Encord Active, let’s understand the factors you need to consider for choosing the right tool.

How to Choose the Right Computer Vision Model Validation Tool

When choosing the right model validation tool for computer vision projects, several key factors come into play, each addressing the unique challenges and requirements of working with image data. 

These considerations ensure that the selected tool accurately evaluates the model's performance and aligns with the project's specific demands. Here's a streamlined guide to making an informed choice:

  • Data Specificity and Complexity: Opt for tools that cater to the variability and complexity inherent in image data. This means capabilities for handling image-specific metrics such as Intersection over Union (IoU) for object detection and Mean Absolute Error (MAE) for tasks like classification and segmentation are crucial.
  • Robust Data Validation: The tool should adeptly manage image data peculiarities, including potential discrepancies between image annotations and the actual images. Look for features that support comprehensive data validation across various stages of the model development cycle, including pre-training checks and ongoing training validations.
  • Comprehensive Evaluation Metrics: Essential for thoroughly assessing a computer vision model's performance. The tool should offer a wide array of metrics, including precision-recall curves, ROC curves, and confusion matrices for classification, alongside task-specific metrics like IoU for object detection. It should also support quality metrics for a more holistic, real-world evaluation.
  • Versatile Performance Evaluation: It should support a broad spectrum of evaluation techniques for deep insights into accuracy, the balance between precision and recall, and the model’s ability to distinguish between different classes.
  • Dataset Management: The validation tool should help with efficient dataset handling for proper training-validation splits. For the sake of performance and scale, it should be able to manage large datasets.
  • Flexibility and Customization: The fast-paced nature of computer vision demands tools that allow for customization and flexibility. This includes introducing custom metrics, supporting various data types and model architectures, and adapting to specific preprocessing and integration needs.

Considering those factors, you can select a validation tool (open-source toolkits, platforms, etc.) that meets your project's requirements and contributes to developing reliable models.

Using Encord Active to Validate the Performance of Your Computer Vision Model

Encord Active (EA) is a data-centric model validation solution that enables you to curate valuable data that can truly validate your model’s real-world generalizability through quality metrics.

In this section, you will see how you can analyze the performance of a pre-trained Mask R-CNN object detection model with Encord Active on COVID-19 predictions. From the analysis results, you will be able to validate and, if necessary, debug your model's performance.

light-callout-cta This walkthrough uses  Encord Annotate to create a project and import the dataset. We use Encord Active Cloud to analyze the model’s failure modes. We recommend you sign up for an Encord account to follow this guide.

Import Predictions

Import your predictions onto the platform. Learn how to import Predictions in the documentation.

Import prediction set to Encord

Select the Prediction Set you just uploaded, and Encord Active will use quality data, label, and model quality metrics to evaluate the performance of your model:

Selecting the prediction set - Encord

Visualize Model Performance Summary on the Validation Set

Evaluate the model’s performance by inspecting the Model Summary dashboard to get an overview of your model’s performance on the validation set with details error categorization (true positive vs. false positive vs. false negative), the F1 score, and mean average precision/recall based on a confidence (IoU) threshold:

Model Summary Dashboard - Encord

Manually Inspect the Model Results

Beyond visualizing a summary of the model’s performance, using a tool that allows you to manually dig in and inspect how your model works on real-world samples is more than helpful. Encord Active provides an Explorer tab that enables you to filter models by metrics to observe the impact of metrics on real-world samples.

EA’s data-centric build also lets you see how your model correctly or incorrectly makes predictions (detects, classifies, or segments) on the training, validation, and production samples. 

Let’s see how you can achieve this: On the Model Summary dashboard, → Click True Positive Count metric to inspect the predictions your model got right:

Model Summary - Encord

Click on one of the images using the expansion icon to see how well the model detects the class, the confidence score with which it predicts the object, other scores on performance metrics, and metadata.

Still under the Explorer tab → Click on Overview (the tab on the right) → Click on False Positive Count to inspect instances that the model failed to detect correctly

False Positive Count - Encord

It seems most classes flagged as False Positives are due to poor object classification quality (the annotations are not 100% accurate). Let’s look closely at an instance:

Broken Object Track - Quality Metrics in Encord

In that instance, the model correctly predicts that the object is ‘Cardiomediastinum’. Still, the second overlapping annotation has a broken track for some reason, so Encord Active classifies its prediction as false positive using a combination of Broken Object Track and other relevant quality metrics.

Under FilterAdd filter, you will see parameters and attributes to filter your model’s performance. For example, if you added your validation set to Active through Annotate, you can validate your model’s performance on that set and, likewise, on the production set.

Adding Filters on Encord

Visualize the Impact of Metrics on Model Performance

Evaluate the model outcome count to understand the distribution of the correct and incorrect results for each class. Under the Model Evaluation tab → Click on Outcome to see the distribution chart:

Model Evaluation Encord

Now, you should see the count for the number of predictions the model gets wrong. Using this chart, you can get a high-level perspective on the issues with your model. In this case, the model fails to segment the ‘Airways’ object in the instances correctly. The Intersection-of-Union (IoU) Threshold is 0.5, the threshold for the model’s confidence in its predictions.

Use the IOU Threshold slider under the Overview tab to see the outcome count based on a higher or lower threshold. You can also select specific classes you want to inspect under the Classes option.

Dig Deeper into the Metrics

Once you understand the model outcome count, you can dig deeper into specific metrics like precision, recall, and F1 scores if they are relevant to your targets.

Model evaluation outcome - Encord

Notice the low precision, recall, and F1 scores per class! Also, group the scores by the model outcome count to understand how the model performs in each class.

You could also use the precision-recall curve to analyze and highlight the classes harder for the model to detect with high confidence.

Precision Recall Curve on Encord

Also break down the model’s precision and recall values for the predictions of each object over the relevant metrics you want to investigate. For example, if you want to see the precision and recall by the Object Classification Quality metric, under Metric Performance Select the Metric dropdown menu, and then the metric you want to investigate the model’s precision by:

Metric Performance by Encord Active

Validate the Model’s Performance on Business Criteria

Now it’s time to see the metrics impacting the model’s performance the most and determine, based on your information, if it’s good or bad (needs debugging) for business. 

For instance, if the Confidence scores are the least performing metrics, you might be worried that your vision model is naive in predictions given the previous consensus on the outcome count (false positives and negatives).

Here is the case for this model under the Metric Performance dashboard (remember, you can use the IoU Threshold slider to check the metric impact at different confidence intervals):

Performance metric for object classification in Encord

The Relevative Area (the object's size) significantly influences our model’s performance. Considering the business environment you want to deploy the model, would this be a good or bad event? This is up to you to decide based on your technical and business requirements. If the model does not work, you can run more experiments and train more models until you find the optimal one.

Awesome! You have seen how Encord Active plays a key role in providing features for validating your model’s performance with built-in metrics. In addition, it natively integrates with Encord Annotate, an annotation tool, to facilitate data quality improvement that can enhance the performance of your models. 

Evaluate your models and build active learning pipelines with Encord
medical banner


Selecting the right model validation tools ensures that models perform accurately and efficiently. It involves the assessment of a model's performance through quantitative metrics such as the IoU, mAP (mean Average Precision), and MaE, or qualitatively, by subject matter experts. 

The choice of evaluation metric should align with the business objectives the model aims to achieve. Furthermore, model selection hinges on comparing various models using these metrics within a carefully chosen evaluation schema, emphasizing the importance of a proper validation strategy to ensure robust model performance before deployment.​

light-callout-cta Validating model performance is particularly vital in sectors where such inaccuracies could compromise safety. Check out our customer stories to learn from large and small teams that have improved their data quality and model performance with the help of Encord.

Platforms like Encord, which specialize in improving data and model quality, are instrumental in this context. Encord Active, among others, provides features designed to refine data quality and bolster model accuracy, mitigating the risks associated with erroneous predictions or data analysis.

Written by Stephen Oladele
Stephen Oladele is a Developer Advocate and an MLOps Technical Content Creator at Encord. He has significant experience building and managing data communities, and you will find him learning and discussing machine learning topics across Discord, Slack and Twitter.
View more posts
cta banner

Build better ML models with Encord

Get started today
cta banner

Discuss this blog on Slack

Join the Encord Developers community to discuss the latest in computer vision, machine learning, and data-centric AI

Join the community

Software To Help You Turn Your Data Into AI

Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.