INSTANCE-SPECIFIC AUGMENTATION: CAPTURING LOCAL INVARIANCES

Abstract

We introduce InstaAug, a method for automatically learning input-specific augmentations from data. Previous data augmentation methods have generally assumed independence between the original input and the transformation applied to that input. This can be highly restrictive, as the invariances that the augmentations are based on are themselves often highly input dependent; e.g., we can change a leaf from green to yellow while maintaining its label, but not a lime. InstaAug instead allows for input dependency by introducing an invariance module that maps inputs to tailored transformation distributions. It can be simultaneously trained alongside the downstream model in a fully end-to-end manner, or separately learned for a pre-trained model. We empirically demonstrate that InstaAug learns meaningful input-dependent augmentations for a wide range of transformation classes, which in turn provides better performance on both supervised and self-supervised tasks.

1. INTRODUCTION

Data augmentation is an important tool in deep learning (Shorten & Khoshgoftaar, 2019) . It allows one to incorporate inductive biases and invariances into models (Chen et al., 2019; Lyle et al., 2020) , providing a highly effective regularization technique that aids generalization (Goodfellow et al., 2016) . It has proved particularly successful for computer vision tasks, forming an essential component of many modern supervised (Perez & Wang, 2017; Krizhevsky et al., 2012; Cubuk et al., 2020; Mikołajczyk & Grochowski, 2018) and self-supervised (Bachman et al., 2019; Chen et al., 2020; Tian et al., 2020; Foster et al., 2021) approaches. Algorithmically, data augmentations apply a random transformation τ : X → X , τ ∼ p(τ ), to each input data point x ∈ X , before feeding this augmented data into the downstream model. These transformations are resampled each time the data point is used (e.g. at each training epoch), effectively populating the training set with additional samples. Augmentation is also sometimes used at test time by ensembling predictions from multiple transformations of the input. A particular augmentation is defined by the choice of the transformation distribution p(τ ), whose construction thus forms the key design choice. Good transformation distributions induce substantial and wide-ranging changes to the input, while preserving the information relevant to the task at hand. Data augmentation necessarily relies on exploiting problem-specific expertise: though aspects of p(τ ) can be learned from data (Benton et al., 2020) , trying to learn p(τ ) from the set of all possible transformation distributions is not only unrealistic, but actively at odds with the core motivations of introducing inductive biases and capturing invariances. One, therefore, restricts τ to transformations that reflect how we desire our model to generalize, such as cropping and color jitter for image data. Current approaches (Cubuk et al., 2018; Lim et al., 2019; Benton et al., 2020) are generally limited to learning augmentations where the transformation is sampled independently from the input it is applied to, such that p(τ ) has no dependence on x. This means that they are only able to learn global invariances, severely limiting their flexibility and potential impact. For example, when using color jittering, changing the color of a leaf from yellow to green would likely preserve its label, but the same transformation would change a lemon to a lime (see Figure 1b ). This transformation cannot be usefully applied as a global augmentation, even though it is a useful invariance for the specific input instance of a leaf. Similar examples regularly occur for other transformations, as shown in Figure 1 . To address this shortfall, we introduce InstaAug, a new approach that allows one to learn instancespecific augmentations that encapsulate local invariances of the underlying data generating process, that is invariances specific to a particular region of the input space. InstaAug is based on using a transformation distribution of the form p(τ ; ϕ(x)), where ϕ is a deep neural network that maps inputs to transformation distribution parameters. We refer to ϕ as an invariance module. It can be trained simultaneously with the downstream model in a fully end-to-end manner, or using a fixed pre-trained model. Both cases only require access to training data and a single objective function that minimizes the training error while maintaining augmentation diversity. As such, InstaAug allows one to directly learn powerful and general augmentations, without requiring access to additional annotations. We evaluate InstaAug in both supervised and self-supervised settings, focusing on image classification and contrastive learning respectively. Our experimental results show that InstaAug is able to uncover meaningful invariances that are consistent with human cognition, and improve model performance for various tasks compared with global augmentations. While we primarily focus on the case where the invariance module is trained alongside the downstream model (to allow data augmentation during training), we find that InstaAug can also provide substantial performance gains when used as a mechanism for learning test-time augmentations for large pre-trained models.

2. BACKGROUND

Data augmentation methods operate as a wrapper algorithm around some downstream model, f , randomly transforming the inputs x ∈ X before they are passed to the model. The outputs of the augmented model are given by f (τ (x)), where τ : X → X represents the transformation, sampled from some transformation distribution p(τ ). The aim of this augmentation is to instil inductive biases into the learned model, leading to improved generalization by capturing invariances of the problem. It can be used both during training to provide additional synthetic training data, and/or at test-time, where ensembling the predictions from multiple transformations can provide a useful regularization that often improves performance (Shanmugam et al., 2021) . Some approaches look to learn aspects of the augmentation (Cubuk et al., 2018; 2020; Lim et al., 2019; Ho et al., 2019; Hataya et al., 2020; Li et al., 2020; Zheng et al., 2022) . These approaches can be viewed as learning parameters of p(τ ), helping to automate its construction and tuning. Of particular relevance, Augerino (Benton et al., 2020) provides a mechanism for learning augmentations using a simple end-to-end training scheme, where the parameters of the downstream model and transformation distribution are learned simultaneously using the (empirical) risk minimization min f,θ E x,y∼pdata E τ ∼p θ (τ ) [L(f (τ (x)), y)] + λR(θ), where L is a loss function and λR(θ) is a regularization term that encourages large transformations. All of these approaches can be thought of global augmentation schemes, in that transformations are sampled independently to the input. For an unrestricted, universal, class of transformations, this assumption can be justified through the noise outsourcing lemma (Kallenberg & Kallenberg, 1997): any conditional distribution Y |X = x can be expressed as a deterministic function g : X × R n → Y of the input and some independent noise ε ∼ N (0, I). Thus, using reparameterization, the dependency on x can, in principle, be entirely dealt with by the transformation itself. However, in practice, the transformation class must be restricted to provide the desired inductive biases, meaning this result no longer holds and so the independence assumption can cause severe restrictions. For example, sampling rotations independently to the input is equivalent to the unrealistic assumption that the labels of all images x are invariant to the same range of angles (cf. Figure 1a ).



Figure 1: Different inputs require different augmentations. In (a), the digit '0' is invariant to any rotation, but rotating the digit '6' by more 90 • makes it a '9'. In (b), a similar phenomenon is observed for color jittering applied to a leaf and a lemon/lime. The red dashed lines in (a) and (b) are boundaries between different classes. In (c), the same effect is shown for cropping. Solid rectangles represent the patches that preserve the labels of the original images ([left] grass, [right] cattle), while dashed rectangles represent patches with different labels to the original images.

