Frederik Hvilshøj August 16, 2022

The Advantages and Disadvantages of Synthetic Training Data For Machine Learning

blog image

The most obvious advantage of using synthetic training data is that it can supplement datasets that would otherwise lack sufficient examples to train a model. As a general rule, more and higher-quality training data equals better performance, so synthetic data can play a hugely important role for machine learning engineers working in fields that suffer from a scarcity of data.

However, using synthetic data comes with pros and cons. Let’s look at some advantages and disadvantages of using synthetic training data to train machine learning algorithms.  


When high stakes models, such as those used to run autonomous vehicles/self-driving cars or diagnose patients, run in the real world, they need to be able to deal with edge cases. However, because edge-cases are outliers by their nature, finding enough real-world examples to create a real dataset and train a model can sometimes feel like searching for a needle in a haystack. With a synthetic dataset, the search is over.

For instance, think of machine learning engineers training a model to diagnose a rare genetic condition. Because the condition is rare, the real world sample of patients from which to collect data is likely too small to train a model effectively. Generating synthetic data can circumvent the challenge of creating and labeling a large dataset from a small sample size.

This same logic applies to augmenting datasets that suffer from systemic data bias. For instance, historically, much of medical research has been based on studies of white men, which has resulted in racial and gender bias in medicine and medical artificial intelligence. Additionally, an overreliance on major research institutions for datasets has perpetuated racial bias. Similarly, an overreliance on existing open source datasets as well as societal inequities has resulted in facial recognition software performing more accurately on white faces.

To avoid perpetuating these biases, machine learning engineers need to train models on data that contains underrepresented groups. Augmenting existing datasets with synthetic data that represents these groups can accelerate the correction of these biases.


For research areas where data is hard to come by, using synthetic data to augment training datasets can be an effective solution that saves time and money. For example, researchers and clinicians working with MRI face a myriad of challenges when collecting diverse datasets. For starters, MRI machines are extremely expensive, with a typical machine costing $1 million and cutting-edge models running up to $3 million. In addition to the hardware, these machines require sterile rooms that eliminate outside interference and offer protection to those outside the room. Because of these installation costs, a hospital usually spends between $3 to $5 million on a suite with one machine. As of 2016, only about 36,000 machines existed globally. Operating these machines safely and effectively also requires specific technical expertise.

Given these constraints, it’s no wonder that these machines are in demand, and hospitals have wait lists full of patients needing them for diagnosis. At the same time, however, researchers and machine learning engineers might need additional MRI data on rare diseases before they can train a model to accurately diagnose patients. Supplementing these datasets with synthetic data means that they don't have to book as much expensive MRI time or wait for the machines to begin available, ultimately lowering the cost and timeline for putting a model into production.

Synthetic data generation can also save costs and time in other ways. For rare or regionally specific diseases, synthetic data can eliminate the challenges associated with locating enough patients around the world and having to depend on their participation in the time-consuming process of getting an MRI. The ability to augment datasets can especially help level the playing field in accelerating the development  of medical AI for treating patients in developing nations where limitations around medical access and resources create additional challenges in collecting data. For context, in 2016, the entire West Africa region had only 84 MRI machines to serve a population of more than 350 million.



In an ideal world, researchers would collaborate to build larger, shared datasets, but for the fields in which data scarcity is most prevalent, privacy regulations and data protection laws (such as GDPR) often make collaboration difficult.  

Because synthetic data isn’t real, it doesn’t technically belong to anyone, so doesn’t contain sensitive data, which opens up possibility for the sharing of datasets among researchers to enhance and accelerate scientific discovery. Earlier this year, for example, King's College released 100,000 artificially generated brain images to researchers, to help accelerate research into dementia. Collaboration on this level in medical science is only possible through use of synthetic data.

Unfortunately, at the moment, using synthetic data comes with a tradeoff between achieving differential privacy– the standard for ensuring that individuals within a dataset cannot be identified from personally identifiable information– and the accuracy of the synthetic data generated. However, in the future, sharing artificial images publicly without breaching privacy may become increasingly possible.


While synthetic data enables researchers to avoid the expense of building datasets solely from real world data, generating synthetic data comes with its own costs. The compute time and financial cost of training a Generative Adversarial Network (GAN) - or any other deep learning based generative model - to generate realistic artificial data varies with the complexity of the data in question. Training a GAN to produce realistic medical imaging data could take weeks of training, even on expensive specialist hardware and under the supervision of high-quality engineers.

Even for those organisations that have access to the required hardware and know-how, synthetic data may not be the panacea for their dataset difficulties. GANs are a relatively recent development, so predicting whether a GAN will produce useful synthetic data is difficult to do except by using trial and error. To pursue a trial and error strategy, organisations need to have time and money to spare.  

As a result, generating synthetic data is somewhat limited to institutions and companies that have access to capital, large amounts of computing power, and highly skilled machine learning engineers. In the short term, synthetic data could inhibit the democratisation of AI by separating those who have the resources to generate artificial data from those who don’t.


Even when a GAN is working properly, ML engineers need to remain vigilant that it isn’t “over-trained.” As strange as it sounds, while using synthetic data can help researchers navigate privacy concerns, it can also place them at risk of committing unexpected privacy violations if a GAN overtrains.

If training goes on for too long, the generative model will eventually start reproducing data from its original data pool. Not only is this result counterproductive, but it also has the potential to undermine the privacy of people whose data was during its training by recreating their information in what is supposed to be an artificial dataset.

Is this for example a real image of Drake?


No, it’s not. It is a synthetic image from this paper where a deep learning model is used to increase the size of Drake’s nose. So does this image preserve Drake’s privacy or not? It’s technically not Drake, but it is still clearly Drake, right? 


Because artificial data is often used in areas where real-world data is scarce, there’s a chance that the data generated might not accurately reflect real world populations or scenarios. Researchers must create more data– because there wasn’t enough data to train a model to begin with– but then they must find a way to trust that the data they’re creating is reflective of what’s happening in the real world.

Data and ML engineers must perform an important layer of quality control after the data is produced to test whether the new artificial data accurately reflects the sample of real-world data that it aims to mimic.


While synthetic training brings its own difficulties and disadvantages, its development is promising, and it has the potential to revolutionise fields where scarcity of real world datasets have slowed the application of machine learning models. As with any new technology, generating synthetic data will have its growing pains, but its promise for accelerating technology and positively impacting the lives of many people far outweigh its current disadvantages.

Machine learning and data operations teams of all sizes use Encord’s collaborative applications, automation features, and APIs to build models & annotate, manage, and evaluate their datasets. Check us out here.