Frederik Hvilshøj August 9, 2022

An Introduction to Synthetic Training Data

blog image

Over the past five years, the world has witnessed a proliferation of the amount of data generated daily. Despite this growth, machine learning engineers still encounter many challenges when curating the data needed to train models to run “in the wild”. To make accurate predictions for a variety of scenarios, a model needs to train on a vast amount of data, but sometimes that data isn’t readily available.

Machine learning engineers designing high-stake production models face difficulty when curating training data because the models must be able to deal with edge cases when they go into production. A high-stakes model with only a small percentage of errors can still have disastrous results. Think about an autonomous vehicle company seeking to put its cars on the road. Those cars need to be able to predict edge cases, such as distinguishing between a pedestrian and a reflection of pedestrians, so that the vehicle can either take evasive action or continue to drive as normal. Unfortunately, high-quality images of reflections of pedestrians aren’t as easy to come by as photos of pedestrians.

In some of the fields where machine learning could have the greatest potential impact, training data is relatively hard to come by. Consider a medical AI company attempting to build a model that diagnoses a rare disease. The model might need to train on hundreds of thousands of images to perform accurately, but perhaps doctors only have data from a few thousand diagnosed patients. Other medical imaging data might be locked away in private patient records.

What can these companies do?

They can generate synthetic data.


Open source synthetic brain images

What is Synthetic Training Data?

Put simply, Synthetic data is information– aka data– that’s been artificially manufactured rather than having been captured via real-world events. Synthetic data has the potential to significantly increase the size of these difficult-to-find datasets. As a result, augmenting real-world datasets with synthetic data can mean the difference between viability and inviability for high-stakes production models.

Two Methods for Creating Synthetic Data

For years now, game engines such as Unity and Unreal have enabled game engineers to build virtual environments. These 3D physical models integrate well with writing code, so they’re useful when it comes to generating certain types of synthetic data. Because humans now have a strong understanding of physics and interactions in the physical world, digital engineers can design these models to replicate how light interacts with different materials and surfaces. That means they can continue to change the 3D environment and generate more data that encompasses a variety of situations and edge cases. 

For instance, if a machine learning engineer is training an autonomous vehicle model, a data engineer could simulate different lighting scenarios to create reflections of pedestrians. Then, an ML engineer would have enough data to train a model to learn to distinguish between reflections of pedestrians and actual pedestrians. Likewise, the data engineer could also generate data that represents different weather situations–sunny, cloudy, hazy, snowy– so that the ML engineer can train the model to behave appropriately in a variety of weather conditions. 


The Unity game engine in action

Unfortunately, game engines have limitations when it comes to generating synthetic data. In some instances, we don’t have enough information or understanding of how things work to create 3D data or alter it to obtain a variety of training data. For instance, when it comes to medical imaging, a lot of factors– from camera types to gut health to patient diet– make simulating data challenging. 

In such cases, rather than build 3D representations, data engineers can take real-world data and use it to synthetically generate more data via deep learning. Machine learning enables them to generate artificial data not from a set of parameters that have been programmed by a scientist or game engineer but from a neural network that has trained on real world datasets. 

Generative adversarial networks (GANs) are a relatively recent development that allows us to create synthetic data by setting two neural networks against each other. One of the models– a generative model– takes random inputs and generates data, and the other model– the discriminative model– is tasked with determining whether the data it is fed is a real-world example or an example made by the generator model.

As the GAN iterates, these two “opposing models” will train against and learn from each other. If the generator fails at its task of creating believable/realistic synthetic data, it adjusts its parameters while the discriminator remains as is. If the discriminator fails at its task of identifying synthetic data as “fake” data, it adjusts its parameters while the generator remains as is. 


Over many iterations, this interplay will improve the discriminative model’s accuracy in distinguishing between real and synthetic data. Meanwhile, the generative model will incorporate feedback each time it fails to “fool” the discriminator, improving its effectiveness at creating accurate synthetic data over time. When this training has finished, the GAN will have created high-quality synthetic data that can supplement training datasets that would otherwise lack enough real-world data to train a model.

Of course, using synthetic data comes with pros and cons. In my next post, I’ll discuss some of the benefits of using GAN-generated synthetic data as well as some of the challenges that come with this approach.

Machine learning and data operations teams of all sizes use Encord’s collaborative applications, automation features, and APIs to build models & annotate, manage, and evaluate their datasets. Check us out here.