Software To Help You Turn Your Data Into AI
Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.
As machine learning models become increasingly complex and ubiquitous, it's crucial to have a practical and methodical approach to evaluating their performance. But what's the best way to evaluate your models?
Traditionally, average accuracy scores like Mean Average Precision (mAP) have been used that are computed over the entire dataset. While these scores are useful during the proof-of-concept phase, they often fall short when models are deployed to production on real-world data. In those cases, you need to know how your models perform under specific scenarios, not just overall.
At Encord, we approach model evaluation using with a data-centric approach using model test cases. Think of them as the "unit tests" of the machine learning world. By running your models through a set of predefined test cases before continuing model deployment or prior to deployment, you can identify any issues or weaknesses and improve your model's accuracy. Even after deployment, model test cases can be used to continuously monitor and optimize your model's performance, ensuring it meets your expectations.
In this article, we will explore the importance of model test cases and how you can define them using quality metrics.
We will use a practical example to put this framework into context. Imagine you’re building a model for a car parking management system that identifies car throughput, measures capacity at different times of the day, and analyzes the distribution of different car types.
You've successfully trained a model that works well on Parking Lot A in Boston with the cameras you've set up to track the parking lot. Your proof of concept is complete, investors are happy, and they ask you to scale it out to different parking lots.
Car parking photos are taken under various weather and daytime conditions.
However, when you deploy the same model in a new parking house in Boston and in another town (e.g., Minnesota), you find that there are a lot of new scenarios you haven't accounted for:
This is where a practical and methodical approach to testing these scenarios is important.
Let's explore the concept of defining model test cases in detail through five steps:
Thoroughly testing a machine learning model requires considering potential failure modes, such as edge cases and outliers, that may impact its performance in real-world scenarios. Identifying these scenarios is a critical first step in the testing process of any model.
Failure mode scenarios may include a wide range of factors that could impact the model's performance, such as changing lighting conditions, unique perspectives, or variations in the environment.
Let's consider our car parking management system. In this case, some of the potential edge cases and outliers could include:
By identifying scenarios where your model might fail, you can begin to develop model test cases that evaluate the model's ability to handle these scenarios effectively.
It's important to note that identifying model failure modes is not a one-time process and should be revisited throughout the development and deployment of your model. As new scenarios arise, it may be necessary to add new test cases to ensure that your model continues to perform effectively in all possible scenarios.
Furthermore, some scenarios might require specialized attention, such as the addition of new classes to the model's training data or the implementation of more sophisticated algorithms to handle complex scenarios.
For example, in the case of adding new types of cars to the model's training data, it may be necessary to gather additional data to train the model effectively on these new classes.
Defining model test cases is an important step in the machine learning development process as it enables the evaluation of model performance and the identification of areas for improvement. As mentioned earlier, this involves specifying classes of new inputs beyond those in the original dataset for which the model is supposed to work well, and defining the expected model behavior on these new inputs.
Defining test cases begins by building hypotheses based on the different scenarios the model is likely to encounter in the real world. This can involve considering different environmental conditions, lighting conditions, camera angles, or any other factors that could affect the model's performance. Hereafter you define the expected model behavior under the scenario.
My model should achieve X in the scenario where Y
It is crucial that the test case is quantifiable. That is, you need to be able to measure whether the test case passes or not. In the next section, we’ll get back to how to do this in practice.
For the car parking management system, you could define your model test cases as follows:
Once the model test cases have been defined, the performance can be evaluated using appropriate performance metrics for each model test case.
This might involve measuring the model's mAP, precision, and recall of data slices related to specified test cases.
To find the specific data slices relevant to your model test case we recommend using Quality metrics.
Quality metrics are useful to evaluate your model's performance based on specific criteria, such as object size, blurry images, or time of day. In practice, they are additional parametrizations added on top of your data, labels, and model predictions and they allow you to index your data, labels, and model predictions in semantically relevant ways. Read more here.
Quality metrics can then be used to identify data slices related to your model test cases. To evaluate a specific model text case, you identify a slice of data that has the properties that the test case defines and evaluate your model performance on that slice of data.
If your model test case fails and the model is not performing according to your expectations in the defined scenario, you need to take action to improve performance. This is where targeted data quality improvements come in. These improvements can take various shapes and forms, including:
Once you've defined your model test cases, you need a way to select data slices and test them in practice. This is where quality metrics and Encord Active comes in.
Encord Active is an open-source data-centric toolkit that allows you to investigate and analyze your data distribution and model performance against these quality metrics, in an easy and convenient way.
The chart above is automatically generated by Encord Active using uploaded model predictions. The chart shows the dependency between model performance and each metric - how much is model performance affected by each metric.
With quality metrics, you identify areas where the model is underperforming, even if it's still achieving high overall accuracy. Thus they are perfect for testing your model test cases in practice.
For example, the quality metric that specifically measures the model's performance in low-light conditions (see “Brightness” among quality metrics in the figure above) will help you to understand if your car parking management system model will struggle to detect cars in low-light conditions.
You could also use the “Object Area” quality metric to create a model test case that checks if your model has issues with different sizes of objects (different distance to cars results in different object areas).
One of the benefits of Encord Active is that it is open-source and it enables you to write your own custom quality metrics to test your hypotheses around different scenarios.
Tip: If you have any specific things you’d like to test please get in touch with us and we would gladly help you get started.
This means that you can define quality metrics that are specific to your use case and evaluate your model's performance against them. For example, you might define a quality metric that measures the model's performance in heavy rain conditions (a combination of low Brightness and Blur).
Finally, if you would like to visually inspect the slices that your model is struggling with you can visualize model predictions (both TP, FP, and FNs) if you.
Tip: You can use Encord Annotate to directly correct labels if you spot any outright label errors.
Back to the car parking management system example:
Once you have defined your model test cases and evaluated your model's performance against using the quality metrics, you can find low-performing "slices" of data.
If you've defined a model test case for the scenario where there is snow on the ground in Minnesota, you can:
Tip: If you already have a database of unlabeled data you can leverage similarity search to find images of interest for your data collection campaigns.
As machine learning models continue to evolve, evaluating them is becoming more important than ever. By using a model test case framework, you can gain a more comprehensive understanding of your model's performance and identify areas for improvement. This approach is far more effective and safe than relying solely on high-level accuracy metrics, which can be insufficient in evaluating your model performance in real-world scenarios.
So to summarize, the benefits of using model test cases instead of only high level accuracy performance metrics are:
Enhanced understanding of your model: You gain a thorough understanding of your model by evaluating it in detail (rather than depending on one overall metric). systematically analyzing its performance will improve your (and your team's) confidence in its effectiveness during deployment and augments the model's credibility.
Allows you to concentrate on addressing model failure modes: Armed with an in-depth evaluation from Encord Active, efforts to improve a model can be directed toward its weak areas. Focusing on the weaker aspects of your model accelerates its development, optimizes engineering time, and minimizes data collection and labeling expenses.
Fully customizable to your specific case: One of the benefits of using open-source tools like Encord Active is that it enables you to write your own custom quality metrics and set up automated triggers without having to rely on proprietary software.
If you're interested in incorporating model test cases into your data annotation and model development workflow, don't hesitate to reach out.
In this article, we start off by understanding why defining model test cases and using quality metrics to evaluate model performance against them is essential. It is a practical and methodical approach for identifying data-centric failure modes in machine learning models.
By defining model test cases, evaluating model performance against quality metrics, and setting up automated triggers to test them, you can identify areas where the model needs improvement, prioritize data labeling efforts accordingly, and improve the model's credibility with your team.
Furthermore, it changes the development cycle from reactive to proactive, where you can find and fix potential issues before they occur, instead of deploying your model in a new scenario and finding out that you have poor performance and trying to fix it.
Open-source tools like Encord Active enable users to write their own quality metrics and set up automated triggers without having to rely on proprietary software. This can lead to more collaboration and knowledge sharing across the machine-learning community, ultimately leading to more robust and effective machine-learning models in the long run.
Join the Encord Developers community to discuss the latest in computer vision, machine learning, and data-centric AI
Join the communityForget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.