3 Key Considerations for Regulatory Compliance in AI systems
There’s nothing worse than putting in the time, effort, and resources into building an artificial intelligence (AI), machine learning (ML), or a computer vision (CV) model only to find out you can’t use it. Failing regulatory compliance is one of those mission-critical factors, especially in sectors such as healthcare, that you can’t afford to overlook.
It’s even worse if what you’re missing is operationally crucial, such as ensuring the whole data management, labeling, annotation, and model training, and production process should have been geared to align with regulatory compliance practices.
When it comes to building artificial intelligence systems (AI), you’ve got to take data compliance considerations into account from day one; otherwise, your project will be finished before it even begins.
What is the importance of regulatory compliance?
Compliance regulations exist for good reason, especially when it comes to handling any kind of potentially sensitive data, including images and videos.
Data compliance regulations exist to ensure that companies, governments, and researchers handle data responsibly and ethically. However, developing machine learning models and emerging technologies that derive meaningful information from imagery is a challenging task. Compliance regulations can create additional headaches when designing these systems for AI application use cases, including computer vision models in healthcare and clinical operations.
Production models run in the real world on out-of-sample data. They evaluate never-before-seen data to make predictions and generate outcomes, and they can only make predictions based on the training a model receives, based on the datasets they were trained on. Even the smartest ML or CV models can’t reason and infer how a human can when encountering new data without a frame of reference.
To ensure the highest performance possible, algorithmic models must train on a vast amount and variety of data.
However, different legal frameworks govern data in different ways. When building and training a model, the data used must be compliant with the regulatory framework where the data originated, even if the model is being built or deployed elsewhere.
For example, some jurisdictions have stricter laws protecting citizens' identifiable information than others. Models trained on data collected in these jurisdictions might not be able to be shipped elsewhere. Similarly, healthcare AI systems trained on US data must often meet HIPAA regulations with unique criteria for patients’ medical data, creating constraints around where the model can be deployed.
Machine learning engineers must successfully navigate the inherent tension between acquiring as much data as possible and abiding by compliance regulations. With that in mind, here are three compliance considerations to take into account when building production AI technologies.
What are the three key considerations for regulatory compliance?
In this article, we cover the following top 3 considerations for regulatory compliance:
- Partitioning Training Data For Data Privacy
- Auditability for Data Annotations
- Data Compliance Throughout The Release Lifecycle: From Annotation to CV Model Deployment
Partitioning Training Data For Data Privacy
To follow best practices for data-centric AI, you should train a model on large volumes of diverse and high-quality labeled datasets. However, you can’t just mix and match data as needed to fill out your training dataset.
Data operations teams have got to be sure that the data you're using complies with the regulatory requirements of its country, state, or region of origin. Within each country, state, or region, different institutions and governing bodies will have different requirements for handling data, achieving regulatory compliance, and broader risk management.
For instance, let’s say you’re building a computer vision model for medical imaging. You’ve obtained a million images from various hospitals to train the model. However, one-third of the images originated in the US, so that data is subject to HIPAA regulations. In contrast, another third originated in Europe (specifically within the European Union), so it’s subject to GDPR. Meanwhile, the last third is open-source and, therefore, freely licensed.
Unfortunately, training one model on all these images would be difficult while ensuring the outputs remain compliant. For regulatory compliance reasons, it would be better to partition the data into separate buckets and build three distinct models so that each one is compliant with the appropriate regulatory framework as determined by the data’s origins.
Documenting and showing your workflows and processes will also be important to prove that you followed the respective compliance rules from the start. So, keep a clear record of the training data used for each computer vision model.
Traceability can create a significant challenge from an engineering perspective. It’s a cumbersome and difficult task but also a serious consideration when building production AI. If you spend resources building a model only to realize later that one piece of data in the training dataset wasn’t compliant, you’ll have to scrap that model. Thanks to the non-compliant data, you’d have to go through the entire building process again, retraining the model without it.
Unfortunately, this is similar to a judge throwing out an entire court case because a crucial piece of evidence was obtained illegally. It happens, and data scientists must meet exacting requirements, especially in sectors with strict compliance requirements.
Auditability for Data Annotations
When putting an AI model into production, you’ve got to consider the auditability of the data, not just the models.
Make sure there’s an exact audit trail of how each piece of training data and its label was generated because both the labels and data must comply with the process you’re trying to optimize.
For example, when it comes to developing medical AI, some regulatory bodies have implemented an approval process for algorithms, which requires independent expert reviews. These procedures are in place to ensure that the model learns to make predictions from training data that has either been labeled or reviewed by a certified professional.
As such, when medical companies build production AI, a designated number of medical specialists must review the labeled training data before the company can use it in downstream model-building applications. They must also keep a record of how each piece of data was labeled, who it was reviewed by, and how many times it was reviewed.
With Encord, you can do all of this, thanks to our regulatory-compliant and auditable dashboard, so you’ve got a record of the entire flow of data, from raw images or videos, through to a production-ready model.
Encord's DICOM labeling tool in action
Data Compliance Throughout The Release Lifecycle: From Annotation to CV Model Deployment
Before building the model, it’s wise to consider the localities that will be involved in each stage of the production cycle.
- Where is the model being trained?
- Is it being trained in the same jurisdiction as where the labels and training data were generated?
- Where is the model being deployed after training?
From a production and model deployment viewpoint, the answers to these questions are important for preventing issues down the road.
For instance, if your training data is in the US, but your model training infrastructure is established in the UK, you need to know if you’re allowed to process that data by sending it to the UK. Even if you have no intention of storing data in the UK, you still have to establish whether you’re allowed to process that data ⏤ e.g., train the model and perform various types of experiments over the model ⏤ in the UK. It gets even more complex if you’ve got an outsourced data annotation team elsewhere in the world, such as South East Asia.
Data operations leaders need to know they can store, send, and share datasets with outsourced collaboration teams without compromising the entire project's regulatory compliance.
The practical implication for the AI companies, and organizations using computer vision models, is that they either have to have model infrastructure deployed in different jurisdictions so that they can process data locally, or they have to ensure that they have data processing agreements in place with customers, which clearly state whether and where they intend to process the data.
Some jurisdictions have much more stringent rules around data processing and storage than others, and it’s important to know the regulations around data collection, usage, processing, and storage, for all the relevant jurisdictions.
Compliance regulations can create headaches for building production AI by adding operational overhead when making the model work in practice. However, it’s best to know the rules from the start and decrease the potential high-risk situation of having to abandon a model for falling afoul of AI regulations.
At Encord, we’ve worked with multiple customers from different jurisdictions and different data requirements. With our user-friendly, computer-vision-first platform and in-house expertise, we help companies develop their training data pipeline while removing their compliance headaches.
Encord is a comprehensive AI-assisted platform for collaboratively annotating data, orchestrating active learning pipelines, fixing dataset errors, and diagnosing model errors & biases.
Sign-up for an Encord Free Trial: The Active Learning Platform for Computer Vision, used by the world’s leading computer vision teams.
AI-assisted labeling, model training & diagnostics, find & fix dataset errors and biases, all in one collaborative active learning platform, to get to production AI faster. Try Encord for Free Today.
Want to stay updated?
Follow us on Twitter and LinkedIn for more content on computer vision, training data, and active learning.
Join our Discord channel to chat and connect.