HOW MUCH DATA ARE AUGMENTATIONS WORTH? AN INVESTIGATION INTO SCALING LAWS, INVARIANCE, AND IMPLICIT REGULARIZATION

Abstract

Despite the clear performance benefits of data augmentations, little is known about why they are so effective. In this paper, we disentangle several key mechanisms through which data augmentations operate. Establishing an exchange rate between augmented and additional real data, we find that in out-of-distribution testing scenarios, augmentations which yield samples that are diverse, but inconsistent with the data distribution can be even more valuable than additional training data. Moreover, we find that data augmentations which encourage invariances can be more valuable than invariance alone, especially on small and medium sized training sets. Following this observation, we show that augmentations induce additional stochasticity during training, effectively flattening the loss landscape.

1. INTRODUCTION

Even with the proliferation of large-scale image datasets, deep neural networks for computer vision represent highly flexible model families and often contain orders of magnitude more parameters than the size of their training sets. As a result, large models trained on limited datasets still have the capacity for improvement. To make up for this data shortage, standard operating procedure involves diversifying training data by augmenting samples with randomly applied transformations that preserve semantic content. These augmented samples expand the volume of data available for training, resulting in downstream performance benefits that one might expect from a larger dataset. However, the now profound significance of data augmentation (DA) for boosting performance suggests that its benefits may be more nuanced than previously believed. In addition to adding extra samples, augmentation promotes invariance by encouraging models to make consistent predictions across augmented views of each sample. The need to incorporate invariances in neural networks has motivated the development of architectures that are explicitly constrained to be equivariant to transformations (Weiler & Cesa, 2019; Finzi et al., 2020) . If the downstream effects of data augmentations were attributable solely to invariance, then we could replace DA with explicit model constraints. However, if explicit constraints cannot replicate the benefits of augmentation, then augmentations may affect training dynamics beyond imposing constraints. Finally, augmentation may improve training by serving as an extra source of stochasticity. Under DA, randomization during training comes not only from randomly selecting samples from the dataset to form batches but also from sampling transformations with which to augment data (Fort et al., 2022) . Stochastic optimization is associated with benefits in non-convex problems wherein the optimizer can bias parameters towards flatter minima (Jastrzębski et al., 2018; Geiping et al., 2021; Liu et al., 2021a) . In this paper, we re-examine the role of data augmentation. In particular, we quantify the effects of data augmentation in expanding available training data, promoting invariance, and acting as a source of stochasticity during training. In summary: • We quantify the relationship between augmented views of training samples and extra data, evaluating the benefits of augmentations as the number of samples rises. We find that augmentations can confer comparable benefits to independently drawn samples on in-domain test sets and even stronger benefits on out-of-distribution testing. • We observe that models that learn invariances via data augmentation provide additional regularization compared to invariant architectures and we show that invariances that are uncharacteristic of the data distribution still benefit performance. • We then clarify the regularization benefits gained from augmentations through measurements of flatness and gradient noise showing how DA exhibits flatness-seeking behavior. , 1998) .

2. RELATED WORK

We restrict our study to augmentations which act on a single sample and do not modify labels. Namely, we study augmentations which can be written as (T (x), y), where (x, y) denotes an input-label pair, and T ∼ T is a random transformation sampled from a distribution of such transformations. König (2018) propose that data augmentations (DA) induce implicit regularization. Empirical evaluations describe useful augmentations as "label preserving", namely they do not significantly change the conditional probability over labels (Taylor & Nitschke, 2018) . Gontijo-Lopes et al. (2020b; a) investigate empirical notions of consistency and diversity, and variations in dataset scales (Steiner et al., 2022) . They measure consistency (referred to as affinity) as the performance of models trained without augmentation on augmented validation sets. They also measure diversity as the ratio of training loss of a model trained with and without augmentations and conclude that strong data augmentations should be both consistent and diverse, an effect also seen in Kim et al. (2021) . In contrast to Gontijo-Lopes et al. (2020b ), Marcu & Prugel-Bennett (2022) find that the value of data augmentations cannot be measured by how much they deform the data distribution. Other work proposes to learn invariances parameterized as augmentations from the data (Benton et al., 2020) , investigates the number of samples required to learn an invariance (Balestriero et al., 2022b) , uncovers the tendency of augmentations to sacrifice performance on some classes in exchange for gains on others (Balestriero et al., 2022a) , or argues that data augmentations cause models to misrepresent uncertainty (Kapoor et al., 2022) . Theoretical investigations in Chen et al. (2020a) formalize data augmentations as label-preserving group actions and discuss an inherent invariance-variance trade-off. Variance regularization also arises when modeling augmentations for kernel classifiers (Dao et al., 2019) . For a binary classifier with finite VC dimension, the bound on expected risk can be reduced through additional data generated via augmentations until inconsistency between augmented and real data distributions overwhelms would-be gains (He et al., 2019b (2021) , where the effect of DA in increasing the theoretical sample cover of the distribution is investigated, and augmentations can reduce the amount of data required, if they "cover" the real distribution. Stochastic Optimization and Neural Network Training. The implicit regularization of SGD is regarded as an essential component for neural network generalization (An, 1996; Neyshabur et al., 2017) . Stochastic training which randomizes gradients can drive parameters into flatter minima,



For a broad and thorough discussion on image augmentations, their categorization, and applications to computer vision, see Shorten & Khoshgoftaar (2019) and Xu et al. (2022). We consider basic geometric (random crops, flips, perspective) and photometric (jitter, blur, contrast) transformations, and common augmentation policies, such as AutoAug (Cubuk et al., 2019), RandAug (Cubuk et al., 2020), AugMix (Hendrycks et al., 2020) and TrivialAug (Müller & Hutter, 2021) which combine basic augmentations. Understanding the Role of Augmentation and Invariance. Works such as Hernández-García &

Data Augmentations in Computer Vision. Data augmentations have been a staple of deep learning, used to deform handwritten digits as early asYaeger et al. (1996)  andLeCun et al. (1998), or to improve oversampling on class-imbalanced datasets(Chawla et al., 2002). These early works hypothesize that data augmentations are necessary to prevent overfitting when training neural networks since they typically contain many more parameters than training data points (LeCun et al.

). The regularizing effect of data augmentations is investigated in LeJeune et al. (2019) who propose a model under which continuous augmentations increase the smoothness of neural network decision boundaries. Rajput et al. (2019) similarly find that linear classifiers trained with sufficient augmentations can approximate the maximum margin solution. Hanin & Sun (2021) relate data augmentations to stochastic optimization. A different angle towards understanding invariances through data augmentations is presented in Zhu et al.

