Ulrik Stig Hansen April 8, 2022

Training Datasets for Machine Learning: The Complete Guide

blog image

Simply put, training data is the initial training dataset used to teach a machine learning algorithm to process information.

Algorithms make models by running on data, and models continue to refine their performance –improving their decision-making and confidence – as they encounter new data and build upon what they learned from the previous data.

High-quality training data is thus the foundation of successful machine learning because the quality of the training data has a profound impact on the model’s subsequent development and its accuracy. Training data is as important to the success of the model as the algorithms themselves because training data directly influences the accuracy with which the model learns to identify the outcome it was designed to detect.

Training data guides the model: it’s the textbook from which the model gains its foundational knowledge. It shows the model patterns and tells it what to look for. After data scientists train the model, it should be able to identify patterns in never-before-seen datasets based on the patterns it learned from the training data.

Think of humans as the teachers of these machine students. Just like human students, machines perform better when they have well-curated and relevant examples to practise with and learn from. If trained on unreliable or irrelevant data, well-designed models can become functionally useless. As the old artificial intelligence adage goes: “garbage in, garbage out”.

How do we use a training dataset to teach machines to learn?

There are two common types of machine learning models: supervised and unsupervised learning. Unsupervised learning is when a human feeds data into a model without providing it specific instructions or feedback on its progress. The training data is raw, meaning humans haven’t annotated it with identifying labels, so the model trains without human guidance and discovers patterns on its own. Unsupervised models can cluster and identify patterns in data, but they can’t perform tasks with a desired outcome. For instance, a data scientist can’t feed an unsupervised model images of animals and expect the model to group them by species: the model might identify a different pattern and group them by colours instead.

When the desired outcomes are predetermined, such as identifying a tumour or changes in weather patterns, machine learning engineers build supervised learning models. In supervised learning, a human provides the model with labelled data and then supervises the machine learning process, providing feedback on the model’s performance.

Human-in-the-loop (HILT) is the process of humans continuing to work with the machine and help improve its performance. A human’s first step in the loop is to curate and label the training data.

Labelling data allows humans to structure the data in a way that makes it readable to the model. Within the training data, humans identify a target– the outcome that a machine learning model is designed to predict– and they annotate the target by giving it a label. By labelling data, humans can point out important features in the data and ensure that the model focuses on those features rather than drawing its own conclusions about the data. Applying well-chosen labels is critical for guiding the model’s learning. For instance, if humans want a computer vision model to learn to identify different types of birds, then every bird that appears in the image training data needs to be labelled appropriately with a descriptive label.

After data scientists begin training the model to predict the desired outcomes by feeding it the labelled data, the “humans in the loop” check its outputs to determine whether the model is working successfully and to a high degree of accuracy. Just as a teacher would help students prepare for an exam, the “humans in the loop” make corrections and feed the data back to the model so that it can learn from its mistakes. By constantly validating the model’s predictions, humans can ensure that its learning is moving in the correct direction. Through this continuous loop of feedback and practice, the model improves its performance.

Once the machine has been sufficiently trained, data scientists will test the model’s performance at returning real-world predictions by feeding it never-before-seen “test data”. Test data is unlabelled because data scientists don’t use it to tune the model: they use it to confirm that the model is working accurately. If the model fails to produce the right outputs from the test data, then data scientists know it needs more training before it can predict the desired outcome.

What makes a good machine learning training dataset?

Because machine learning is an interactive process, it’s vital that the training data is applicable and appropriately labelled for the task at hand.

The data curated must be relevant to the problem that the model is trying to solve. For instance, if a computer vision model is trying to learn to identify bicycles, then the data must contain images of bicycles and, ideally, a variety of different types of bicycles. The cleanliness of the data also impacts the performance of a model. If trained on corrupt or broken data or datasets with duplicate images, the model will make incorrect predictions. Lastly, as already discussed, the quality of the annotations has a tremendous effect on the quality of the training data.

Encord specialises in creating high-quality training data for downstream computer vision models. When companies train their models on high-quality data, they increase the performance of their models in solving real-world business problems. Our platform has flexible ontology and easy-to-use annotation tools, so computer vision companies can create high-quality training data customised for their models without having to spend the time and money building these tools in-house.

What’s the best way to get a dataset for machine learning?

Creating, evaluating, and managing training data depends on having the right tools. Encord’s computer vision-first toolkit lets customers label any computer vision modality all in one platform. We offer fast and intuitive collaboration tools to enrich your data so that you can build cutting-edge AI applications. Our platform automatically classifies objects, detects segments, and tracks objects in image and video.

Computer vision models must learn to distinguish between different aspects of pictures and videos, which requires them to process labelled data, and the types of annotations which they need to learn vary depending on the task they’re performing.

Let’s take a look at some common annotation tools for computer vision tasks.

Image Classification: For single-label classification, each image in a dataset has one label, and the model outputs a single prediction for each image it encounters. In multi-label classification, each image has multiple labels, and the labels are not mutually exclusive.

Bounding boxes: When performing object detection, computer vision models detect an object and its location, and the object’s shape doesn’t need to be detailed to achieve this outcome, which makes bounding boxes the ideal tool for this task. With a bounding box, the target object in the image is contained within a small rectangular box accompanied by a descriptive label.

Polygons/Segments: When performing image segmentation, computer vision models use algorithms to separate objects in the image from both their backgrounds and other objects. Mapping labels to pixel elements belonging to the same image helps the model break down the digital images into subgroups called segments. The shape of these segments matters, so annotators need a tool that doesn’t restrict them to rectangles. With polygons, an annotator can create tight-knit outlines around the target object by plotting points on the image vertices.

Encord’s platform provides annotation tools for a variety of computer vision tasks, and our tools are embedded in the platform, so users don’t have to jump through any hoops before accessing model-assisted labelling.

Because the platform supports a variety of data formats including images, videos, SAR, satellite, thermal imaging, and DICOM images (X-ray, CT, MRI, etc), it works for a wide range of computer vision applications.


Labelling training data for machine learning in Encord

How to create better training datasets for your machine learning models

While there’s no shortage of data in the world, most of it is unlabelled and thus can’t actually be used in supervised machine learning models. Computer vision models, such as those designed for medical imaging or self-driving cars, need to be incredibly confident in their predictions, so they need to train on vast amounts of data. Acquiring large quantities of labelled data remains a serious obstacle for the advancement of AI.

Because every incorrect label has a negative impact on a model’s performance, data annotators play a vital role in the process of creating high-quality training data. Unfortunately, manual data labelling is still largely the norm, making it a slow process prone to human error.

Ideally, data annotators would be subject-matter experts in the domain for which the model is answering questions. In this scenario, the data annotators–because of their domain expertise–understand the connection between the data and the problem the machine is trying to solve, so their labels are informative and accurate.

Unfortunately, we don’t live in an ideal world. Data labelling is a time-intensive and tedious process. For perspective, one hour of video data can take humans up to 800 hours to annotate. That creates a problem for industry experts who have other demands on their time. Should a doctor spend hundreds of hours labelling scans of tumours to teach a machine how to identify them? Or should a doctor prioritise doctor-human interaction and spend those hours providing care to the patients whose scans clearly showed malignancies?

Data labelling can be outsourced, but doing so means losing the input of subject-matter experts, which could result in low-quality training data if the labelling requires any industry-specific knowledge. Another issue with outsourcing is that data labelling jobs are often in developing economies, and that scenario isn’t viable for any domain in which data security and privacy are important. When outsourcing isn’t possible, teams often build internal tools and use their in-house workforces to manually label their data, which leads to cumbersome data infrastructure and annotation tools that are expensive to maintain and challenging to scale.

The current practice of manually labelling training data isn’t sufficient or sustainable. Using a unique technology called micro-models, Encord solves this problem and makes computer-vision practical by reducing the burden of manual annotation and label review. Our platform automates data labelling, increasing its efficiency without sacrificing quality.

Using micro-models to automate data labelling for machine learning

Encord uses a novel technology called micro-models to build its automation features. Micro-models allow for quick annotation in a “semi-supervised fashion”. In semi-supervised learning, data scientists feed machines a small amount of labelled data in combination with a large amount of unlabelled data during training.

The micro-model methodology comes from the idea that a model can produce strong results when trained on a small set of purposefully selected and well-labelled data. Micro-models don’t differ from traditional models in terms of their architecture or parameters, but they have different domains of applications and use cases.

A knee-jerk reaction from many data scientists might be that this goes against “good” data science because a micro-model is an overfit model. In an overfit model, the algorithm can’t separate the “signal” (the true underlying pattern data scientists wish to learn from the data) from the “noise” (irrelevant information or randomness in a dataset). An overfit model unintentionally memorises the noise instead of finding the signal, meaning that it usually makes poor predictions when it encounters unseen data.

Overfitting a production model is problematic because if a production model doesn't train on a lot of data that resembles real-world scenarios, then it won’t be able to generalise. For instance, if data scientists train a computer vision model on images of sedans alone, then the model might not be able to identify a truck as a vehicle.

However, Encord’s micro-models are purposefully overfitted. They are annotation specific models intentionally designed to look at one piece of data, identify one thing, and overtrain on that specific task. They wouldn’t perform well on general problems, but we didn’t design them to apply to real-world production use cases. We designed them only for the specific purpose of automating data annotation. Micro-models can solve many different problems, but those problems must relate to the training data layer of model development.


Comparing traditional and micro models for creating machine learning training data

Because micro-models don’t take much time to build, require huge datasets, or need weeks to train, the humans in the loop can start training the micro-models after annotating only a handful of examples. Micro-models then automate the annotation process. The model begins training itself on a small set of labels and removes the human from much of the validation process. The human reviews a few examples, providing light-touch supervision, but mostly the model validates itself each time it retrains, getting better and better results.

With automated data labelling, the number of labels that require human annotation decreases over time because the system gets more intelligent each time the model runs.

When automating a comprehensive annotation process, Encord strings together multiple micro-models. It breaks each specific labelling task into a separate micro-model and then combines these models together. For instance, to classify both aeroplanes and clouds in a dataset, a human would train one micro-model to identify planes, create and train another to identify clouds, and then chain them together to label both clouds and planes in the training data.

Production models need massive amounts of labelled data, the reliance on human annotation has limited their ability to go into production and “run in the wild.”

Micro-models can change that.

With micro-models, users can quickly create training data to feed into downstream computer vision models.

Encord recently worked with King’s College of London. The institution wanted to find a way to reduce the amount of time that highly skilled clinicians spent annotating videos of precancerous polyps for training data to develop AI-aided medical diagnostic tools. Using Encord’s micro-models, clinicians increased the annotation speed, completing the task 6.4 times faster than when manual labelling. In fact, only three percent of the datasets required manual labelling from clinicians. Encord’s technology not only saved the clinicians a lot of valuable time, but it also provided King’s College with access to training data much more quickly than had the institution relied on a manual annotation process. This increased efficiency allowed King’s College to move the AI into production faster, cutting model development time from one year to two months.

Interested in learning more? Schedule a demo to better understand how Encord can help your company unlock the power of AI.