Jonathon Byrd November 24, 2022

The Guide to Data Augmentation for Computer Vision Projects

blog image

Why should I do data augmentation?

Perhaps the most common modeling point-of-failure faced by machine learning practitioners is the problem of overfitting. This happens when our models memorize our model training examples, but fail to generalize predictions to unseen images. Overfitting is especially pertinent in computer vision where we deal with high-dimensional image inputs and large, over-parameterized deep networks. There are many modern modeling techniques to deal with this problem including dropout-based methods, label smoothing, or architectures that reduce the number of parameters needed while still maintaining the power to fit complex data. But one of the most effective places to combat overfitting is the data itself.

Deep learning models can be incredibly data-hungry, and one of the most effective ways to improve your model’s performance is to give it more data - the fuel of deep learning. This can be done in two ways:

  • Increasing the raw quantity of data. This combats overfitting by “filling out” the underlying distribution from which your images come from in the dataset, thus refining your model’s decision boundaries. The more examples you have - for example, from a particular class in a classification problem - the more accurately you can recover the support of that class.
  • Increasing diversity in the dataset. It’s worth mentioning that failure to generalize to new data can also be caused by dataset/distribution shift. Imagine classifying dog breeds using a training set of images of dogs in parks, but seeing dogs in other locations in production. Widening your training dataset to include these types of images will likely have a dramatic effect on your model’s ability to generalize. (But most of the time image augmentation will not be able to address this issue). 

However, data collection can often be very expensive and time-consuming. For example, in healthcare applications, collecting more data usually requires access to patients with specific conditions, considerable time and effort from skilled medical professionals to collect and annotate the data, and often the use of expensive imaging and diagnostic equipment. In many situations, the “just get more data” solution will be very impractical. Furthermore, public datasets aren’t usually usable for custom CV problems, aside from transfer learning. Wouldn’t it be great if there were some way to increase the size of our dataset without returning to the data collection phase? This is data augmentation.

Wait, what is data augmentation?

Data augmentation is generating new training examples from existing ones through various transformations. It is a very effective regularization tool and is used by experts in virtually all CV problems and models. Data augmentation can increase the size of just about any image training set by 10x, 100x or even infinitely, in a very easy and efficient way. Mathematically speaking:

More data = better model. Data augmentation = more data. Therefore, data augmentation = better machine learning models.

Data augmentation techniques


Common image transformations for data augmentation.

The list of methods demonstrated by the figure above is by no means exhaustive. There are countless other ways to manipulate images and create augmented data. You are only limited by your own creativity!

Don’t feel limited to only using each technique in isolation either. You can (and should) chain them together like so:


Multiple transformations


More examples of augmentation combinations from a single source image. (source)

Considerations and potential pitfalls of data augmentation

  • It should go without saying that any data augmentation should occur after splitting your dataset into training, validation, and testing subsets. Otherwise, you will be creating major data leak in your models, and your test results will be useless.
  • If you’re performing localization-based tasks like object detection or segmentation, your labels will change when applying geometric transformations like reflection (flipping), rotation, translation, and cropping. Hence, you will need to apply the same transformations to your label annotations as well.
  • When you crop images, you are changing the size and possibly shape of the inputs to your model. For convolutional neural network models, you’ll need all of your inputs, including your test set, to have the same dimensionality. A common way of handling this is applying crop transformations to the test and validation sets as well. Resizing the images is another option.
  • Some of these transformations like translation, rotation, or scaling can result in “blank spaces” where the transformed image doesn’t fully cover the gridspace that’s fed into your model. In these cases you can fill the extra pixels with constant black/white/gray pixels, random noise, or interpolations that extend the original image content.
  • Be careful not to crop or translate so much that you completely remove relevant objects from an image. This is easily detectable when you know bounding boxes in object detection tasks, but could be an issue for image classification if you completely crop out the object corresponding to the label.
  • Data augmentation is typically only performed on the training set. While it can also be used as a strategy to decrease variance in very small validation or even test sets, you should always be very careful about making any changes to test sets. You want test data to be an unbiased estimate of performance on unseen examples from the inference-time distribution, and an augmented dataset may not resemble that distribution.
  • Don’t chain too many augmentations together. You might think to throw everything in at once and the kitchen sink by combining all of the listed transforms, but this can quickly make the resulting image extremely unrealistic, unidentifiable to humans, and also cause the potential problem outlined in the point above. There’s nothing wrong with using all of these transformations, just don’t combine all of them at once.

That being said, transformed images don’t need to be perfect to be useful. Quantity of data will often beat quality of data. The more examples you have, the less detrimental effect one outlier/mistake image will have on your model, and the more diverse your dataset will be.

While it will almost always have a positive effect on your model’s performance, data augmentation isn’t a cure-all silver bullet for problems related to dataset size. You can’t expect to take a tiny dataset of 50 images, blow it up to 50,000 with the above techniques and get all the benefits of a dataset of size 50,000. Data augmentation can help make models more robust to things like rotations, translations, lighting, and camera artifacts, but not for other changes such as different backgrounds, perspectives, variations in appearance of objects, relative positioning in scenes, etc.

When should I do data augmentation?

You might be wondering “When should I use data augmentation? When is it beneficial?” The answer is: always! Data augmentation is usually going to help regularize and improve your model, and there are unlikely to be any downsides if you apply it in a reasonable way. The only instance where you might skip it is if your dataset is so incredibly large and diverse that augmentation does not add any meaningful diversity to it. But most of us will not have the luxury of working with such fairytale datasets 🙂.

Data Augmentation for Class Imbalance

Augmentation can also be used to deal with class imbalance problems. Instead of using sampling or weighting-based approaches, you can simply augment the smaller classes more to make all classes the same size. 

So Which Transformations Should I Choose?

There is no one exact answer, but you should start by thinking about your problem. Does the transformation only generate images that are completely outside the support that you’d ever expect in the real world? Even if an inverted image of a tree in a park isn’t something you’d see in real life, you might see a fallen tree in a similar orientation. However some transformations might need to be re-considered such as:

  • Vertically reflected (upside-down) stop signs at an intersection for object recognition in self-driving.
  • Upside-down body parts or blurred/colored images for radiology images where there will always be consistent orientation, lighting, and sharpness.
  • Grid distortion on satellite images of roads and neighborhoods. (though this might be one of the best places to apply rotations).
  • 180-degree rotation on digit classification (MNIST). This transformation will make your 6’s look like 9’s and vice versa while keeping the original label.

Your transformations don’t have to be exclusively realistic, but you should definitely be using any transformations that are likely to occur in practice.

In addition to knowledge of your task and domain, knowledge of your dataset is also important to consider. Better knowledge of the distribution of images in your dataset will allow you to better choose which augmentations will give you sensible results or possibly even which augmentations can help you fill in gaps in your dataset. A great tool to help you explore your dataset, visualize distributions of image attributes, and examine the quality of your image data is Encord Active.

However, we are engineers and data scientists. We don’t just make decisions based on conjectures, we try things out and run experiments. We have the tried-and-true technique of model validation and hyperparameter tuning. We can simply experiment with different techniques and choose the combination that maximizes performance on our validation set.

If you need a good starting point: horizontal reflection (horizontal flip), cropping, blur, noise, and an image erasing method (like cutout or random erasing) are a good base to begin with. Then you can experiment with combining them together and adding brightness and coloring changes.

Data Augmentation for Video

Augmentation techniques for video data are very similar to image data, with a few differences. Generally, the chosen transformation will be applied identically to each frame in the video (with the exception of noise). Trimming videos to create shorter segments is also a popular technique (temporal cropping). 

How to implement data augmentation

The exact specifics of your implementation will depend on your hardware, chosen deep learning library, chosen transformations etc. But there are generally two strategies to implement data augmentation: offline and online.

Offline augmentation: Performing data augmentation offline means you will compute a new dataset that includes all of your original and transformed images, and save it to disk. Then you’ll train your model as usual using the augmented dataset instead of the original one. This can drastically increase the disk storage required, so we don’t recommend it unless you have a specific reason to do so (such as verifying the quality of the augmented images or controlling for the exact images that are shown during training).

Online augmentation: This is the most common method of implementing data augmentation. In online augmentation, you will transform the images at each epoch or batch when loading them. In this scenario, the model sees a different transformation of the image at each epoch, and the transformations are never saved to disk. Typically, transformations are randomly applied to an image each epoch. For example, you will randomly decide whether or not to flip an image at each epoch, perform a random crop, sample a blur/sharpening amount, etc.


Online and offline data augmentation processes.

TensorFlow and PyTorch both contain a variety of modules and functions to help you with augmentation. For even more options, check out the imgaug Python library.

Which techniques do the pros use?

You may still be wondering, “How do people who train state-of-the-art models use image augmentation?” Let’s take a look:

PaperData Augmentation Techniques


Translate, Scale, Squeeze, Shear


Translate, Flip, Intensity Changing


Crop, Flip


Flip, Crop, Translate


Crop, Elastic distortion


Cutout, Crop, Flip


AutoAugment, Mixup, Crop


AutoAugment, RandAugment, Random erasing, Mixup, CutMix

Swin Transformer

RandAugment, Mixup, CutMix, Random erasing


Translate, Rotate, Gray value variation, Elastic deformation

Faster R-CNN



Scale, Translate, Color space


Crop, Resize, Flip, Color Space, Distortion


Mosaic, Distortion, Scale, Color space, Crop, Flip, Rotate, Random erase, Cutout, Hide and Seek, GridMask, Mixup, CutMix, StyleGAN

Erasing/Cutout: Wait, what is all this cut-mix-rand-aug stuff? Some of these like Cutout, Random Erasing, and GridMask are image erasing methods. When performing erasing, you can cut out a square, rectangles of different shapes, or even multiple separate cuts/masks within the image. There are also various ways to randomize this process. Erasing is a popular strategy, and for example, in the context of image classification, can force the model to learn to identify objects from each individual part rather than just the most distinct one by erasing the most distinct part (for example learning to recognize dogs by paws and tails, not just faces). Erasing can be thought of as a sort of “dropout in the input space”.

Mixing: Another popular technique in data augmentation is mixing. Mixing involves combining separate examples (usually of different classes) to create a new training image. Mixing is less intuitive than the other methods we have seen, because the resulting images do not look realistic. Let’s look at a couple popular techniques for doing this:

Mixup: Mixup combines two images by linear interpolation (weighted average) of the two images. The same interpolation is then applied to the class label.


An example of a mixup image. The corresponding image label in a binary image classification problem with labels (dog, cat) would then be (0.52, 0.48).

What? This looks like hazy nonsense! And what are those label values? Why does this work?

Essentially, the goal here is to encourage our model to learn smoother, linear transitions between different classes, rather than oscillate or behave erratically. This helps stabilize model behavior on unseen examples at inference time.

CutMix: CutMix is a combination of the Cutout and Mixup approaches. As mentioned before, Mixup images look very unnatural and can be confusing to the model when performing localization. Rather than interpolate between two images, CutMix simply takes a crop of one image and pastes it onto a second image. This also has the benefit over cutout, that the cut out region is not just thrown away and replaced with garbage, but instead with actual information. The label weighting is similar - for a classification problem the labels correspond to the percentage of pixels from the corresponding class image that are present in the augmented image. For localization, we keep the same bounding boxes or segmentation from the original images in their respective parts of the composite image.


An example of a CutMix image.

AugMix: Augmix is a little different from the above examples, but is also worth mentioning here. AugMix doesn’t mix different training images together - instead it mixes different transformations of the same image. This retains some of the benefits of mixing through exploring the input space between images, and reduces degradation effects from applying many transformations to the same image.  The mixes are computed as follows:

  1. Multiple (3 by default) augmented images are created. Each augmented image is created using 1-3 different transformations.
  2. The 3 augmented images are mixed by weighted averaging
  3. The resulting image is mixed with the original image by weighted averaging


AugMix augmentation process. The entire method involves other parts as well such as a specific loss function. (source)

Advanced techniques, i.e. “Can’t I just generate completely unique data using GANs”?

Image augmentation is still an active research area, and there are a few more advanced methods to be aware of. The following techniques are more complex (particularly the last two) and will not always be the most practical or efficient-to-implement strategies. We list these for the sake of completeness.

  • Feature Space Augmentation: Feature space augmentation consists of performing augmentation on the hidden layer representations during training instead of the original images. The idea is that you would be more likely to encounter random images traversing through the feature space than through the input space. This can be done by adding noise to the hidden representations, performing mixing on them, or other methods.
  • GANs: Another approach is to generate new synthetic images using a generative model (typically GAN) that learns the underlying data-generating distribution. This can be done unconditionally, or by starting from existing training examples, for example, by using style transfer. For video data, simulation techniques can be used to synthesize sequences of frames.
  • Automatic Augmentation: Finding the best combinations of augmentation techniques for a given problem can be time-consuming and require expertise in both the domain and specific dataset. Wouldn’t it be great if the computer could do all of that work for us? Automatic Augmentation algorithms search over the space of possible augmentations to find the best performing ones. Reinforcement learning methods like AutoAugment find high-performing augmentation policies. Adversarial learning-based methods that generate difficult-to-classify transformations also exist.


Now you know what data augmentation is and how it helps address overfitting by filling out your dataset. You know that you should be using data augmentation for all of your computer vision tasks. You have a good overview of the most essential data augmentation transformations and techniques, you know what to be mindful of, and you’re ready to add data augmentation to your own preprocessing and training pipelines. Good luck!