An Introduction to Synthetic Training Data

November 11, 2022
4 mins
blog image

Despite the vast amounts of data civilization is generating (2.5 quintillion bytes of new data every day, according to recent studies), computer vision and machine learning data scientists still encounter many challenges when sourcing enough data to train and make a computer vision model production-ready. 

Algorithmically-generated models need to train on vast amounts of data, but sometimes that data isn’t readily available.

Machine learning engineers designing high-stake production models face difficulty when curating training data because most models must handle numerous edge cases when they go into production. 

An artificial intelligence model with only a few errors can still have disastrous results. Consider an autonomous vehicle company seeking to put its cars on the road. AI models running in those cars need the accurate, fast, and real-time predictive ability for every edge case, such as distinguishing between a pedestrian and a reflection of pedestrians so that the vehicle can either take evasive action or continue to drive as normal. 

Unfortunately, high-quality images of reflections of pedestrians aren’t as easy to come by as photos of pedestrians.

A large enough volume of training data is relatively hard to find in some fields where machine learning could have the most significant potential impact. 

Consider a medical AI company attempting to build a model that diagnoses a rare disease. The model might need to train on hundreds of thousands of images to perform accurately, but there might only be a few thousand images for this edge case. Other medical imaging data might be locked away in private patient records, which might not be accessible to data science teams building these models.

The images or video datasets you need might not be available even with numerous open-source datasets

What can you do in this scenerio? 

The answer is to generate synthetic data, images, videos, and synthetic datasets.

Example of open sourcesynthetic brain image.

Open source synthetic brain images

What is Synthetic Training Data?

Put simply, Synthetic data such as images and videos, that’s been artificially manufactured rather than having been captured from real-world events, such as MRI scans or satellite images.

Synthetic data can significantly increase the size of these difficult-to-find datasets. As a result, augmenting real-world datasets with synthetic data can mean the difference between a viable production-ready computer vision model and one that isn’t viable because it doesn’t have enough data to train on. 

Remember, any kind of data-centric approach is only as good as your ability to get the right data into your model. Here’s our take on choosing the best data for your computer vision model

And where finding data isn’t possible, creating and using synthetic datasets for machine learning models is the most effective approach. 

From scaling to enhancing your model development with data-driven insights
medical banner

Two Methods for Creating Synthetic Data

For years now, game engines such as Unity and Unreal have enabled game engineers to build virtual environments. These 3D physical models integrate well with writing code, so they’re useful when it comes to generating certain types of synthetic data.

Because humans now have a strong understanding of physics and interactions in the physical world, digital engineers can design these models to replicate how light interacts with different materials and surfaces. That means they can continue to change the 3D environment and generate more data that encompasses a variety of situations and edge cases. 

For instance, if a machine learning engineer is training an autonomous vehicle model, a data engineer could simulate different lighting scenarios to create reflections of pedestrians. Then, an ML engineer would have enough data to train a model to learn to distinguish between reflections of pedestrians and actual pedestrians. Likewise, the data engineer could also generate data that represents different weather situations–sunny, cloudy, hazy, snowy– so that the ML engineer can train the model to behave appropriately in a variety of weather conditions. 

Example image of car in the Unity game engine.

The Unity game engine in action

Unfortunately, game engines have certain limitations when generating synthetic data. Sometimes, there isn’t enough information or understanding of how things work to create a 3D version of the edge cases a data science team needs. For instance, when it comes to medical imaging, many factors ⏤ from camera models and software, image format files, gut health, patient diet, etc., ⏤ make simulating data challenging. 

In these scenarios, rather than build 3D representations, data engineers can use real-world data to generate more data using deep learning synthetically. 

Machine learning enables them to generate artificial data not from a set of parameters programmed by a scientist or game engineer but from a neural network trained on real-world datasets. 

Generative adversarial networks (GANs) are a relatively recent development that allows us to create synthetic data by setting two neural networks against each other. One of the models– a generative model– takes random inputs and generates data, and the other model– the discriminative model– is tasked with determining whether the data it is fed is a real-world example or an example made by the generator model.

As the GAN iterates, these two “opposing models” will train against and learn from each other. If the generator fails at its task of creating believable/realistic synthetic data, it adjusts its parameters while the discriminator remains as is. If the discriminator fails at its task of identifying synthetic data as “fake” data, it adjusts its parameters while the generator remains as is. 

GAN architecture

Over many iterations, this interplay will improve the discriminative model’s accuracy in distinguishing between real and synthetic data. Meanwhile, the generative model will incorporate feedback each time it fails to “fool” the discriminator, improving its effectiveness at creating accurate synthetic data over time. When this training has finished, the GAN will have created high-quality synthetic data that can supplement training datasets that would otherwise lack enough real-world data to train a model.

Of course, using synthetic data comes with pros and cons. In my next post, I’ll discuss some of the benefits of using GAN-generated synthetic data as well as some of the challenges that come with this approach.

Machine learning and data operations teams of all sizes use Encord’s collaborative applications, automation features, and APIs to build models & annotate, manage, and evaluate their datasets.

Sign-up for an Encord Free Trial: The Active Learning Platform for Computer Vision, used by the world’s leading computer vision teams. 

AI-assisted labeling, model training & diagnostics, find & fix dataset errors and biases, all in one collaborative active learning platform, to get to production AI faster. Try it for free today

Want to stay updated?

Follow us on Twitter and LinkedIn for more content on computer vision, training data, and active learning.

Join our Discord channel to chat and connect.

sideBlogCtaBannerMobileBGencord logo

Power your AI models with the right data

Automate your data curation, annotation and label validation workflows.

Try Encord for Free
Written by

Frederik Hvilshøj

View more posts