Cristian Duguet, Adrian Hutter, Nandika Kalra, Arshak Navruzyan
Overview
Data augmentation of images is a widely adopted technique to improve models generalization [3] by applying transformations (or sets of combined transformations) during training and optionally at test time. Often practitioners look for the best set of transformations manually, either by relying on domain expertise or making assumptions about generally useful transformations. This process of manual search for optimal augmentations can be timeconsuming and compute intensive.
The fast.ai library does a great job in providing smart defaults in other areas like setting hyperparameters, model architecture (e.g. custom head) and augmentations are no exception. After a lot of experimentation, the fast.ai team has found a standard set of augmentations which is applied indiscriminately to every dataset, and has proven to work effectively most of the time. However, there may be room for improvement if the appropriate augmentations can be found for a given dataset in a fast way.
Over the last months, our team in fellowship.ai has been working to create a method to automate the search for the best augmentation set, in a computationally efficient manner, and with as little domainspecific input as possible. The purpose of this research aligns with the goals of platform.ai, which seeks to offer deep learning to a wider nonengineer audience.
Current approaches
There are a few approaches to automatically finding an optimal set of augmentations, but the most common paradigm is to generate various augmented datasets and fully or partially train “child models” with each set to determine impact on performance. The central focus of these approaches is to make the search as efficient as possible through gaussian process, reinforcement learning and other search methods.
One obvious disadvantage of this formulation is that they are not computationally cheap. In order to achieve even a modest improvement, thousands or tens of thousands of child models have to be trained and evaluated.
A more novel approach is from Ratner [5] where they use a GAN to find the set of transformation parameters to create augmented data which lies within a defined distribution of interest, representative of the training set. By treating the augmentation search as a sequence modeling problem, this approach attempts to find not only the right augmentations but also their parameterization, which can be nondifferentiable and hence requires a reinforcement learning loop in addition to the generative and discriminative model training.
We are looking for a more computationally efficient way (at the sacrifice of any performance due to composability) for automatically finding a set of parameters for the image transform functions provided by the fast.ai library.
Methodology
The fastai library provides different types of image transformations which are used for data augmentation. These transformations can be grouped in affine, pixel, cropping, lighting and coordinate transformations. The default augmentation set, which is obtained by calling get_transforms() has the following parameters:
Table 1: List of transformations available in fast.ai. There are more transforms available in their documentation, but we focused on the ones listed
transform  type  parameters  prob 
crop_pad  TfmCrop  'row_pct': (0, 1), 'col_pct': (0, 1)  1.0 
flip_affine  TfmAffine   0.5 
symmetric_warp  TfmCoord  'magnitude': (0.2, 0.2)  0.75 
rotate  TfmAffine  'degrees': (10.0, 10.0)  0.75 
zoom  TfmAffine  'row_pct': (0, 1), 'col_pct': (0, 1), 'scale': (1.0, 1.1)  0.75 
brightness  TfmLighting  'change': (0.4, 0.6)  0.75 
contrast  TfmLighting  'scale': (0.8, 1.25)  0.75 
Method Validation
The performance of the training set augmented with the chosen transformations will be compared to the performance of the training set augmented with the default fast.ai augmentation set, by training both networks for a determined routine. This will be done for different datasets. The performance metric is the error rate on the validation set.
Datasets
A variety of datasets and domains were selected to assess the robustness of this method. A group of datasets, which could be representative of very different use cases was gathered and the following 6 were selected.
 OxfordIIIT Pet Dataset.
 Stanford Dogs Dataset.
 Planet: Understanding the Amazon from Space.
 CIFAR10.
 KuzushijiMNIST.
 Food101.
TTA Search
The Test Time Augmentation is a technique to leverage the use of augmentation for prediction instead of training. With TTA, the probability of prediction of an image would comprise 2 different predictions:
Where and are the predicted values for the original and the augmented image, respectively, and is a weight factor.
We have observed that the Test Time Augmentation can be a good indicator of when a certain augmentation is a good candidate for training. This means that if the average of all TTA predictions on a set using certain augmentation had higher accuracy than the average of normal predictions, then training the network with that augmentation will most likely improve the performance of the network.
Figure 1: Normalized error rate of a trained dataset with a certain augmentation vs normalized TTA for that augmentation as well. The normalization is the relative change to the baseline case. Each point is a different augmentation tried on the network, each color is an experiment on a different dataset.
We identified that very harmful transformations for training, which would be characterized by a high err/err_none and TTA_err/TTA_err_none
value. For the Pets and Dogs dataset, for example, it was the dihedral transform, while for Planet it was resize_crop
. Most transforms of our set were detrimental for CIFAR10.
Based on the observed behaviour, we search for a certain augmentation set using TTA, in a procedure which works as follows.
 Split the training set into two subsets of size 80% and 20%, respectively.
 Train the last layer group on 80% of the training set for
EPOCHS_HEAD
epochs, without any data augmentation.  Calculate the error rate
ERR_NONE
on the remaining 20% of the training set.  For each kind of transformation, for each possible magnitude, calculate the TTA error rate on the remaining 20% of the training set. For TTA, we base predictions on
WEIGHT_UNTRANSFORMED * LOGITS_UNTRANSFORMED + (1  WEIGHT_UNTRANSFORMED) * LOGITS_TRANSFORMED
. WhereWEIGHT_UNTRANSFORMED
describes the amount of influence the augmentation has on the prediction.  For each kind of transformation, choose the magnitude which leads to the lowest TTA error rate, if that error rate is lower than
THRESHOLD * ERR_NONE
; otherwise, don’t include that kind of transformation in the final set of augmentations.  With the chosen set of augmentations, train the head for
EPOCHS_HEAD
epochs and the full network forEPOCHS_FULL
.  As a baseline, train the network for the same number of epochs using the transforms provided by
get_transforms()
.
Out of the transformations available in the fast.ai library, we have tested our method with the following transforms/parameters list
Table 2: List of tested augmentations using our search method, including a list of the tested parameters.
Transform  Type  Tested parameters  Prob* 
flip  TfmAffine  Either flip_affine, or dihedral_affine  0.5 
symmetric_warp  TfmCoord  'magnitude': (m, m) for m in [0.1,0.2,0.3,0.4]  0.75 
rotate  TfmAffine  'degrees': (m, m) for m in [10., 20., 30., 40.]  0.75 
zoom  TfmAffine  'Scale': (1.,max_zoom) for max_zoom in [1.1, 1.2, 1.3, 1.4]  0.75 
brightness  TfmLighting  'change': (0.5*(1max_lighting), 0.5*(1+max_lighting)) for max_lighting in [0.1, 0.2, 0.3, 0.4]  0.75 
contrast  TfmLighting  'scale': (1max_lighting, 1/(1max_lighting)) for max_lighting in [0.1, 0.2, 0.3, 0.4]  0.75 
skew  TfmCoord  'direction': (0,7), 'magnitude': max_skew for max_skew in [0.2, 0.4, 0.6, 0.8]  0.75 
squish  TfmAffine  'scale': (1/max_scale, max_scale) for max_scale in [1.2, 1.8, 2.4, 3.]  0.75 
rand_pad+crop  TfPixel  'padding': p for p in [size/16, size/8, size/4]  0.75 
* that is the probability with which each transform was used for final training. For the TTA search method the probability was 1.0.
Findings
We have found that the Test Time Augmentation for transformation delivers information about the performance improvement with a particular augmentation, and helps rapidly deciding on image transformations which are constructive for better generalizing the network.
Table 3 shows the performance improvement by this method, in comparison to the baseline case. The list of augmentations picked out for each dataset are detailed in Table 4.
Table 3: Top1 error rates for the found augmentation sets.
Dataset  Baseline error rate  TTA Search  rel. improvement 
5.07%  4.74%  7.14%  
15.41%  15.17%  1.61%  
3.29%  3.30%  0.13%  
2.99%  3.11%  3.86%  
1.55%  1.48%  4.73%  
19.39%  18.65%  3.93% 
UPDATE: The platform.ai team has just reached SOTA performance in the dataset Food101 using our augmentation search technique.
Table 4 : List of selected augmentations and its parameters, for each dataset. The values between brackets represent uniform distribution of the RandTransform class.
transform  pets  dogs  planet  cifar  kmnist  food 
flip  flip_lr  flip_lr  dihedral  flip_affine 
 flip_affine 
symmetric_warp (magnitude)  [0.1, 0.1]  [0.1, 0.1]  [0.2, 0.2] 

 [0.2, 0.2] 
rotate (degrees)  [20, 20]  [20, 20]  [10, 10] 
 [10, 10]  [40, 40] 
zoom (scale)  [1.0, 1.3]  [1.0, 1.2]  [1.0, 1.2] 
 [1.0, 1.2]  [1.0, 1.4] 
brightness (change) 
 [0.45, 0.55] 
 [0.3, 0.7]  [0.45, 0.55]  [0.35, 0.65] 
contrast (scale) 
 [0.7, 1.43]  [0.9, 1.11]  [0.9, 1.11]  [0.8, 1.25]  [0.7, 1.43] 
skew (scale, magnitude)  [0, 7], 0.4
 [0,7], 0.2
 [0, 7], 0.4 
 [0, 7], 0.2  [0, 7], 0.2 
squish (scale) 
 [0.83, 1.2]  [0.83, 1.2] 
 [0.83, 1.2]  [0.42, 2.4] 
It is worth noting that the augmentations picked by our method seemed qualitatively reasonable. For example, for the Planet dataset it chooses dihedral flips (which might include upsidedown flips), while for KuzushijiMNIST it chooses neither leftright nor dihedral flips, since any of these flips would be damaging for the transformation.
With more time, it may be worth investigating how consistent the differences between the error rates are. The differences between the TTAbased selection of augmentations and get_transforms() might be smaller than the differences between different runs for the same set of augmentations.
Acknowledgements
We would like to thank Jeremy Howard for his insightful guidance and David Tedaldi for mentorship during this project.
References
 Cubuk, Ekin D., et al. AutoAugment: Learning Augmentation Policies from Data. arXiv preprint arXiv:1805.09501 (2018).
 Geng, Mingyang, et al. Learning data augmentation policies using augmented random search. arXiv preprint arXiv:1811.04768 (2018).
 Perez, Luis, et al. The Effectiveness of Data Augmentation in Image Classification using Deep Learning. arXiv preprint arXiv:1712.04621 (2017).
 Krizhevsky, Alex, et al. ImageNet Classification with Deep Convolutional Neural Networks. Neural Information Processing Systems. 25. 10.1145/3065386 (2012).
 Ratner, Alexander J., et al. Learning to compose domainspecific transformations for data augmentation. Advances in neural information processing systems. 2017.
Also published on Medium.
This is really useful, thanks.
Thanks, it is very informative