AUTOMATIC DATA AUGMENTATION VIA INVARIANCE-CONSTRAINED LEARNING

Abstract

Underlying data structures, such as symmetries or invariances to transformations, are often exploited to improve the solution of learning tasks. However, embedding these properties in models or learning algorithms can be challenging and computationally intensive. Data augmentation, on the other hand, induces these symmetries during training by applying multiple transformations to the input data. Despite its ubiquity, its effectiveness depends on the choices of which transformations to apply, when to do so, and how often. In fact, there is both empirical and theoretical evidence that the indiscriminate use of data augmentation can introduce biases that outweigh its benefits. This work tackles these issues by automatically adapting the data augmentation while solving the learning task. To do so, it formulates data augmentation as an invariance-constrained learning problem and leverages Monte Carlo Markov Chain (MCMC) sampling to solve it. The result is a practical algorithm that not only does away with a priori searches for augmentation distributions, but also dynamically controls if and when data augmentation is applied. Our experiments illustrate the performance of this method, which achieves stateof-the-art results in automatic data augmentation benchmarks for CIFAR datasets. Furthermore, this approach can be used to gather insights on the actual symmetries underlying a learning task.

1. INTRODUCTION

Exploiting the underlying structure of data has always been a key principle in data analysis. Its use has been fundamental to the success of machine learning solutions, from the translational equivariance of convolutional neural networks (Fukushima and Miyake, 1982) to the invariant attention mechanism in Alphafold (Jumper et al., 2021) . However, embedding invariances and symmetries in model architectures is hard in general and when possible, often incurs a high computational cost. This is the case, of rotation invariant neural network architectures that rely on group convolutions, which are feasible only for small, discrete transformation spaces or require coarse undersampling due to their high computational complexity (Cohen and Welling, 2016; Finzi et al., 2020) . A widely used alternative consists of modifying the data rather than the model. That is, to augment the dataset by applying transformations to samples in order to induce the desired symmetries or invariances during training. Data augmentation, as it is commonly known, is used to train virtually all state-of-the-art models in a variety of domains (Shorten and Khoshgoftaar, 2019) . This empirical success is supported by theoretical results showing that, when the underlying data distribution is invariant to the applied transformations, data augmentation provides a better estimation of the statistical risk (Chen et al., 2019; Sannai et al., 2019; Lyle et al., 2020; Shao et al., 2022) . On the other hand, applying the wrong transformations can introduce biases that may outweigh these benefits (Chen et al., 2019; Shao et al., 2022) . Choosing which transformations to apply, when to do so, and how often, is thus paramount to achieving good results. However, it requires knowledge about the underlying distribution of the data that is typically unavailable. Several approaches to learning an augmentation policy or distribution over a fixed set of transformations exist, such as reinforcement learning (Cubuk et al., 2018 ), genetic algorithms (Ho et al., 2019) , density matching (Lim et al., 2019; Cubuk et al., 2020; Hataya et al., 2020) , gradient matching (Zheng et al., 2022) , bi-level optimization (Li et al., 2020b; Liu et al., 2021) , jointly optimizing over transformations using regularised objectives (Benton et al., 2020) , variational bayesian inference (Chatzipantazis et al., 2021) , bayesian model selection (Immer et al., 2022) and alignment regularization (Wang et al., 2022) . Optimization based methods often require computing gradients with respect to transformations (Chatzipantazis et al., 2021; Li et al., 2020b) . Moreover, several methods resort to computationally intensive search phases, the optimization of auxiliary models, or additional data, while failing to outperform fixed user-defined augmentation distributions (Müller and Hutter, 2021) . In this work, we formulate data augmentation as an invariance-constrained learning problem. That is, we specify a set of transformations and a desired level of invariance, and recover an augmentation distribution that enables imposing this requirement on the learned model, without explicitly parametrising the distribution over transformations. In addition, the constrained learning formulation mitigates the potential biases introduced by data augmentation without doing away with its potential benefits. More specifically, we rely on an approximate notion of invariance that is weighted by the probability of each data point. Hence, we require the output of our model to be stable only on the support of the underlying data distribution, and more so on common samples. By imposing this requirement as a constraint on the learning task and leveraging recent duality results, the amount of data augmentation can be automatically adjusted during training. We propose an algorithm that combines stochastic primal-dual methods and MCMC sampling to do away with the need for transformations to be differentiable. Our experiments show that it leads to state-of-the-art results in automatic data augmentation benchmarks in CIFAR datasets.

2. DATA AUGMENTATION IN SUPERVISED LEARNING

As in the standard supervised learning setting, let x ∈ X ⊆ R d denote a feature vector and y ∈ Y ⊆ R its associated label or measurement. For classification tasks, we take Y ⊆ N. Let D denote a probability distribution over the data pairs (x, y) and ℓ : Y × Y → R + be a non-negative, convex loss function, e.g., the cross entropy loss. Our goal is to learn a predictor f θ : X → Y in some hypothesis class H θ = {f θ | θ ∈ Θ ⊆ R p } that minimizes the expected loss, namely minimize θ∈Θ R(f θ ) := E (x,y)∼D [ℓ(f θ (x), y)]. (SRM) We consider the distribution D to be unknown, except for the dataset {(x i , y i ), i = 1, . . . , n} of n i.i.d. samples from D. Therefore, we rely on the empirical approximation of the objective of (SRM), explicitly R(f θ ) := 1 N n i=1 ℓ(f θ (x i ), y i ). One of the aims of data augmentation is to improve the approximation R of the statistical risk R when dealing with a dataset that is not sufficiently representative of the data distribution. To do so, we consider transformations of the feature vector g : X → X , taken from the (possibly infinite) transformation set G.  ) := 1 N N i=1 E g∼G [ ℓ(f θ (gx i ), y i )] . Note that the empirical risk approximation R in (1) can be interpreted as an approximation of the data distribution D by a discrete distribution that places atoms on each data point. In that sense, Raug in (2) can be thought of as the Vicinal Risk Minimization (Chapelle et al., 2000) counterpart of (1), in which the atoms on x i are replaced by a local distribution over the transformed samples gx i , i.e., Raug (f θ ) = 1 N N i=1 ℓ (f θ (gx i ), y i ) dP (gx i ), where the distribution P over X is induced by the distribution G over G. As it can be seen from ( 3) if G is not chosen adequately, Raug can be a poor estimate of R, introducing biases that outweigh



Common examples include rotations and translations in images. Data augmentation leverages these transformations to generate new data pairs (gx, y) by sampling transformations according to a probability distribution G over G, leading to the learning problem minimize

