INSTANCE-SPECIFIC AUGMENTATION: CAPTURING LOCAL INVARIANCES

Abstract

We introduce InstaAug, a method for automatically learning input-specific augmentations from data. Previous data augmentation methods have generally assumed independence between the original input and the transformation applied to that input. This can be highly restrictive, as the invariances that the augmentations are based on are themselves often highly input dependent; e.g., we can change a leaf from green to yellow while maintaining its label, but not a lime. InstaAug instead allows for input dependency by introducing an invariance module that maps inputs to tailored transformation distributions. It can be simultaneously trained alongside the downstream model in a fully end-to-end manner, or separately learned for a pre-trained model. We empirically demonstrate that InstaAug learns meaningful input-dependent augmentations for a wide range of transformation classes, which in turn provides better performance on both supervised and self-supervised tasks.

1. INTRODUCTION

Data augmentation is an important tool in deep learning (Shorten & Khoshgoftaar, 2019) . It allows one to incorporate inductive biases and invariances into models (Chen et al., 2019; Lyle et al., 2020) , providing a highly effective regularization technique that aids generalization (Goodfellow et al., 2016) . It has proved particularly successful for computer vision tasks, forming an essential component of many modern supervised (Perez & Wang, 2017; Krizhevsky et al., 2012; Cubuk et al., 2020; Mikołajczyk & Grochowski, 2018) and self-supervised (Bachman et al., 2019; Chen et al., 2020; Tian et al., 2020; Foster et al., 2021) approaches. Algorithmically, data augmentations apply a random transformation τ : X → X , τ ∼ p(τ ), to each input data point x ∈ X , before feeding this augmented data into the downstream model. These transformations are resampled each time the data point is used (e.g. at each training epoch), effectively populating the training set with additional samples. Augmentation is also sometimes used at test time by ensembling predictions from multiple transformations of the input. A particular augmentation is defined by the choice of the transformation distribution p(τ ), whose construction thus forms the key design choice. Good transformation distributions induce substantial and wide-ranging changes to the input, while preserving the information relevant to the task at hand. Data augmentation necessarily relies on exploiting problem-specific expertise: though aspects of p(τ ) can be learned from data (Benton et al., 2020) , trying to learn p(τ ) from the set of all possible transformation distributions is not only unrealistic, but actively at odds with the core motivations of introducing inductive biases and capturing invariances. One, therefore, restricts τ to transformations that reflect how we desire our model to generalize, such as cropping and color jitter for image data. Current approaches (Cubuk et al., 2018; Lim et al., 2019; Benton et al., 2020) are generally limited to learning augmentations where the transformation is sampled independently from the input it is applied to, such that p(τ ) has no dependence on x. This means that they are only able to learn global invariances, severely limiting their flexibility and potential impact. For example, when using color jittering, changing the color of a leaf from yellow to green would likely preserve its label, but the same transformation would change a lemon to a lime (see Figure 1b ). This transformation cannot be usefully applied as a global augmentation, even though it is a useful invariance for the specific input instance of a leaf. Similar examples regularly occur for other transformations, as shown in Figure 1 . To address this shortfall, we introduce InstaAug, a new approach that allows one to learn instancespecific augmentations that encapsulate local invariances of the underlying data generating process, that is invariances specific to a particular region of the input space. InstaAug is based on using a transformation distribution of the form p(τ ; ϕ(x)), where ϕ is a deep neural network that maps inputs to transformation distribution parameters. We refer to ϕ as an invariance module. It can be trained simultaneously with the downstream model in a fully end-to-end manner, or using a fixed pre-trained model. Both cases only require access to training data and a single objective function that minimizes the training error while maintaining augmentation diversity. As such, InstaAug allows one to directly learn powerful and general augmentations, without requiring access to additional annotations. We evaluate InstaAug in both supervised and self-supervised settings, focusing on image classification and contrastive learning respectively. Our experimental results show that InstaAug is able to uncover meaningful invariances that are consistent with human cognition, and improve model performance for various tasks compared with global augmentations. While we primarily focus on the case where the invariance module is trained alongside the downstream model (to allow data augmentation during training), we find that InstaAug can also provide substantial performance gains when used as a mechanism for learning test-time augmentations for large pre-trained models.

2. BACKGROUND

Data augmentation methods operate as a wrapper algorithm around some downstream model, f , randomly transforming the inputs x ∈ X before they are passed to the model. The outputs of the augmented model are given by f (τ (x)), where τ : X → X represents the transformation, sampled from some transformation distribution p(τ ). The aim of this augmentation is to instil inductive biases into the learned model, leading to improved generalization by capturing invariances of the problem. It can be used both during training to provide additional synthetic training data, and/or at test-time, where ensembling the predictions from multiple transformations can provide a useful regularization that often improves performance (Shanmugam et al., 2021) . Some approaches look to learn aspects of the augmentation (Cubuk et al., 2018; 2020; Lim et al., 2019; Ho et al., 2019; Hataya et al., 2020; Li et al., 2020; Zheng et al., 2022) . These approaches can be viewed as learning parameters of p(τ ), helping to automate its construction and tuning. Of particular relevance, Augerino (Benton et al., 2020) provides a mechanism for learning augmentations using a simple end-to-end training scheme, where the parameters of the downstream model and transformation distribution are learned simultaneously using the (empirical) risk minimization min f,θ E x,y∼pdata E τ ∼p θ (τ ) [L(f (τ (x)), y)] + λR(θ), where L is a loss function and λR(θ) is a regularization term that encourages large transformations. All of these approaches can be thought of global augmentation schemes, in that transformations are sampled independently to the input. For an unrestricted, universal, class of transformations, this assumption can be justified through the noise outsourcing lemma (Kallenberg & Kallenberg, 1997) : any conditional distribution Y |X = x can be expressed as a deterministic function g : X × R n → Y of the input and some independent noise ε ∼ N (0, I). Thus, using reparameterization, the dependency on x can, in principle, be entirely dealt with by the transformation itself. However, in practice, the transformation class must be restricted to provide the desired inductive biases, meaning this result no longer holds and so the independence assumption can cause severe restrictions. For example, sampling rotations independently to the input is equivalent to the unrealistic assumption that the labels of all images x are invariant to the same range of angles (cf. Figure 1a ).

3. INSTANCE-SPECIFIC AUGMENTATION: CAPTURING LOCAL INVARIANCES

In order to remedy the problems of global augmentations, we propose InstaAug. InstaAug learns an input dependent distribution p(τ ; ϕ(x)) of information-preserving transformations that actively makes use of the input x via the invariance module ϕ, as opposed to learning a global transformation distribution p θ (τ ). This generalizes the hypothesis class of transformation distributions, and significantly increases the flexibility and expressivity of the augmentations we can learn, without undermining our ability to carefully control the inductive biases that are imparted. It can also informally be viewed as a mechanism for learning invariances which are local to the specific input. We argue that a good augmentation strategy needs to fulfill two properties. First, the transformations should preserve the information in x that is necessary for the task at hand. For example, in classification, transformations must preserve sufficient information to correctly classify τ (x). Second, the set of transformations needs to have sufficient 'diversity' to effectively augment the data; we quantify this as the entropy of the transformation distribution p(τ ; ϕ(x)). In addition to their intuitive nature, in Appendix A we provide theoretical analysis that shows these requirements naturally originate from a decomposition of the generalization error that results from using ϕ when training f . For simplicity, we describe InstaAug on the task of classification in the remainder of this section. InstaAug is based around using a simple plug-in invariance module, ϕ, between the input x and the classifier f , as shown in Figure 2 . We assume a parametric family of distributions p(τ ; •) over some transformation space, then use ϕ, which is a trainable neural network, to predicts its parameters for a given input. During training, we sample a transformation τ ∼ p(τ ; ϕ(x)), which is applied to x to generate an augmented sample τ (x), before feeding this into the classifier f . Good augmentations should induce substantial changes to the input x while providing all necessary information of the task at hand, thereby capturing the maximum possible invariance. Figure 3a illustrates the tension between these two objectives experienced by global augmentation schemes. Wider-ranging transformations are generally beneficial for generalization, but 'excessive' transformations will generate samples that will be incorrectly classified. In Figure 3a we see this in the red area, where the augmentations for a pair of data points have started to overlap, creating ambiguity and inevitably misclassifications. Using instance-specific augmentations (Figure 3b ) allows for a better trade-off of these needs. However, to achieve this we need our objective to encourage diversity in augmentations, not just low training error. It should also let the level of diversity vary between inputs, as some points will be able to support larger transformations than others.

3.1. MODEL STRUCTURE

Based on these needs, training is done by simultaneously minimizing a conventional expected loss with respect to both ϕ and f (or just ϕ if f is a fixed pre-trained classifier as per Section 5.3), while placing a hard constraint on the average entropy of the transformations, E x∼pdata [H[p(τ ; ϕ(x) )]]. The core motivation for this setup is that minimizing the expected loss will naturally encourage the information needed for prediction to be preserved, but the constraints on the entropy are needed to enforce diversity. Further motivation is provided by the theoretical analysis of Appendix A. By appropriately parameterizing p(τ ; ϕ(x)) (see Section 3.3), we can write down its entropy in closed form. We can then formulate the problem as the following constrained optimization problem: min f,ϕ E x,y∼pdata E τ ∼p(τ ;ϕ(x)) [L(f (τ (x)), y)] , s.t. E x,y∼pdata [H[p(τ ; ϕ(x))]] ∈ [H min , H max ], where L is the loss, for which we will generally use the cross-entropy. Here the lower bound p(τ ; ϕ(x)) enforces the desired diversity. We typically expect this constraint to be active at the true optimal solution, so H min can be thought of as a hyperparameter that controls the desired level of diversity. The upper bound prevents p(τ ; ϕ(x)) exploding at the start of training when the classifier is weak: without this we empirically find that the augmented samples from different classes tend to overlap in the initial phase of training, hindering the training of f . The Lagrangian function, E x,y∼pdata E τ ∼p(τ ;ϕ(x)) [L(f (τ (x)), y)] -λE x,y∼pdata [H[p(τ ; ϕ(x))]], can be used to solve this constrained optimization, where λ is the Lagrangian multiplier. In practice, we initialize λ with a small positive value, and increase (decrease) λ when the average entropy drops below H min (exceeds H max ). The invariance module and downstream model can thus be trained simultaneously using end-to-end gradient descent, utilizing the reparameterization trick to deal with the stochasticity of τ when possible (Kingma & Welling, 2014), and the REINFORCE estimator (Williams, 1992) otherwise. The approach can also be extended to regression or selfsupervised learning by substituting the loss function L (cf. Appendix C).

3.3. PARAMETERIZATION OF AUGMENTATIONS

We focus on parameterizing transformations that are frequently used in computer vision, though our framework can easily be extended to other domains. Due to the varied characteristics of different image transformations, we design two different parameterization methods for p(τ ; ϕ(x)). Uniform parameterization. For rotation and color jittering, we find that a uniform distribution is suitable for parameterizing p(τ ; ϕ(x)), such that ϕ(x) returns a pair (θ min , θ max ) representing extrema of the possible transformations. For example, for rotations these represent the maximum and minimum rotation angles, such that τ (x) = R(θ) • x, where θ ∼ U(θ min , θ max ). To compose multiple transformations (such as hue, saturation and brightness in color jittering), we simply sample them independently, such that p(τ 1 , . . . , τ K ; ϕ(x)) = K k=1 p(τ k ; ϕ k (x) ). This provides a similar parameterization to (Benton et al., 2020) , but where (θ min , θ max ) now critically varies with the input x and there is no symmetry assumption on the transformation ranges. Location-related parameterization. Using this uniform parameterization is unfortunately not appropriate for cropping. Firstly, the distribution on crop centers may be multi-modal, since important information may exist in different parts of an image. Secondly, the desired crop size and center are often highly correlated so cannot be sampled independently. Finally, we encountered significant practical training issues when using the uniform parameterization for cropping, with ϕ often becoming trapped in local optima with little transformation diversity. We therefore propose an alternative location-related parameterization (LRP) for cropping, which is based on defining a large fixed set of allowable crops then constructing ϕ to map from inputs to a vector of probabilities over this set. As shown in Figure 4 , this is achieved using a CNN where each hidden unit corresponds to a one possible crop. The units from all layers are utilized, with those of earlier layers representing smaller crops. This parametrization proved more effective than simply outputting the probabilities from a conventional network, due to the greater parameter sharing between related crops. We note that it can also be directly extended to other transformations, such as masking, local blurring, pixel-wise perturbation, and local color jittering.

3.4. TEST-TIME AUGMENTATION

Besides augmenting data during training, the learned invariance can also be applied to test-time augmentation. Given a test image x, we sample n different transformations τ i from p(τ ; ϕ(x)) and apply them to x to generate n different views τ i (x). After feeding these views to the classifier, f , we use the mean logit 1 n n i=1 f (τ i (x)) to predict x's label. When only learning invariance for test-time augmentation, InstaAug can be trained with a fixed pre-trained classifier at a lower computation cost.

4. RELATED WORK

Hard-coded invariance. Much recent work has been devoted to hard-coding global invariance in neural networks. For example, various models have been designed to be invariant to translation (Chaman & Dokmanic, 2021; Zhang, 2019) , rotation (Worrall et al., 2017; Zhou et al., 2017; Marcos et al., 2017) , scaling (Worrall & Welling, 2019; Sosnovik et al., 2019) or other group actions (Cohen & Welling, 2016; Xu et al., 2021) . Unfortunately, they require the set of invariant transformations to be closed under composition, leaving out many practical transformations that do not form a group. Learning augmentations. There have been numerous prior works that automatically learn global augmentations and invariance from data. As discussed in Section 2, Augerino (Benton et al., 2020) is perhaps the mostly closely linked such approach to InstaAug as it also relies on end-toend training (see Appendix B for further discussion on its similarities and differences to InstaAug). AutoAugment (Cubuk et al., 2018) instead uses reinforcement learning to find augmentation strategies that increase accuracy on a separate validation set. Various follow-up works have improved its efficiency and/or performance (Lim et al., 2019; Ho et al., 2019; Hataya et al., 2020; Li et al., 2020; Cubuk et al., 2020; Tang et al., 2019; Zheng et al., 2022) . A small number of works have further looked to learn augmentation policies that have some dependency on the input or just the class label (Zhou et al., 2021; Cheung & Yeung, 2022) . These approaches focus on choosing which type(s) of transformation to apply from a fixed list-e.g. choosing from crop, blur, or color jitter-which may include a small number of discrete options for transformation strength. By comparison, InstaAug keeps the type of transformation fixed and learns instance-specific parameter for the transform distribution, such as the positions and sizes of patches for cropping. In principle, it should be possible to combine these complementary approaches with InstaAug, though we note they require a separate validation dataset and cannot be used in unsupervised settings, unlike InstaAug. Other related work. The spatial transformer (Jaderberg et al., 2015) aims to learn instance-specific transformations, but only applies a single transformation to each input rather than a distribution of transformations, making it distinct from data augmentation. Luo et al. (2020) and Kim et al. (2020) both also learn instance-specific augmentations. However, the latter consider only test-time augmentation, while the former introduces an approach that is highly specialized to test recognition and cannot be applied in more general settings we consider. Tamkin et al. (2020) and (Chen et al., 2021) both utilize adversarial augmentations to increase robustness. Zhou et al. (2020) learn symmetries shared across several datasets through a meta-learning scheme.

5.1. ROTATED 2D IMAGES

We first consider a toy synthetic dataset proposed in Benton et al. (2020) . The dataset contains four categories, (1) upright Mario; (2) upside-down Mario; (3) upright Iggy; and (4) upside-down Iggy. Each of the four base images is randomly rotated in the interval of [-π/4, π/4] to form the training dataset. The task is to predict the correct character (Mario vs Iggy) and the orientation (up vs down). We assess whether InstaAug is able to learn the 'best' rotation range for each sample-i.e. the maximum range that avoids 'up' and 'down' classes from overlapping. Figure 5 shows that InstaAug effectively recovers the broadest range of rotations for each image while preserving labels, while Augerino only learns a subset of these ranges. This can be most easily seen by the fact that the transformation distributions (shown in green) always extend to very close to the true class boundary for InstaAug, but not for Augerino. These gains are because Augerino learns a single global augmentation distribution shared across all images (note the shared transformation distribution arcs), which are inevitably limited for any given input.

5.2. CROPPING

We now move to more realistic images and to the most common and effective form of image augmentation: cropping. We first evaluate the performance of jointly training InstaAug and the classifier on Tiny-Imagenet (TinyIN, 64 × 64), as it inherits the image complexity of ImageNet whilst being within our computational budget. TinyIN is a standard testbed for data augmentations that contains 100k images divided into 200 classes. Full experiment details are given in Appendix D.1. We benchmark InstaAug alongside several augmentation baselines, including Augerino, no augmentation, and random crops (random augmentation). The latter uniformly samples patch sizes and then randomly selects a patch inside the image. Since the effect of cropping crucially relies on scales of patches, we carefully tune this baseline by sweeping over all possible scale intervals between [0, 1] with a stride of 0.1. We further compare to other prior works that have obtained competitive results on TinyIN (Ramé et al., 2021; Yun et al., 2019; Zhang et al., 2018) . Table 1 : InstaAug improves generalization on Tiny-ImageNet by learning instance-specific cropping. 'Instance' and 'LRP' refer respectively to 'instance-specific' and 'location-related parameterization'. Statistics are computed over 10 runs, except for MixMo, CutMix and Mixup, whose results are from Ramé et al. (2021) . Other learnable augmentation methods are actually learning the size ranges of cropping. We leave their results out because we are already performing this learning in the random cropping results through our hyperparameter tuning. In order to ablate the effects of inputdependency and location-related parameterization on InstaAug, we additionally assess the performance of InstaAug (without LRP) which relies on the same uniform parameterization as Augerino rather than our location-related parametrization (LRP, described in Figure 4 ); InstaAug (without input) that uses the LRP and general InstaAug setup, but shares the transformation distribution across all inputs rather than learning an inputspecific augmentation; and InstaAug (class specific), which takes training labels instead of images as inputs. Test-time augmentation using 50 transformation samples is deployed for all variants of InstaAug, along with the Augerino and random augmentation baselines. For InstaAug (class specific), this test-time augmentation is based on random cropping, due to the lack of class information being available at test-time and this approach performing better than simply omitting test-time augmentation. Following prior works, we choose the PreActResNet-18 architecture (He et al., 2016b) with width = 1 as the classifier for all methods. Table 1 shows the top-1 accuracy for each method. In agreement with prior works, we find that random cropping increases top-1 accuracy by 9.4% over no augmentation, which is achieved where cropping scale = [0.1, 1]. InstaAug outperforms random cropping and its own global version without input by 1.5% and 2.8% respectively, highlighting the effect of learning instance-specific augmentation. Allowing only for class dependence actually produces even worse performance than just ignoring the input completely, presumably because of the inevitable resulting mismatch in the augmentations used in training and testing. Methods with mean-field uniform parameterization (including Augerino and InstaAug without LRP) performed extremely poorly, noticeably worse than just random cropping. This is because they were found to become easily stuck at local minima with low cropping diversity, leading to similar performance as no augmentation. Note that the potentially unexpectedly good performance of the random cropping baseline compared to the other global baselines stems from the careful hyperparameter sweep used to tune its crop size, which proved more effective than these more direct training mechanisms. See Appendix E.3 for more discussion. Figure 6 shows example crops and learned transformation distributions for InstaAug and a global augmentation scheme (InstaAug without input). We see that InstaAug is able to learn a cropping scheme that focuses on the key aspect of the input image, while the baselines cannot.

5.3. APPLYING INSTAAUG TO A FIXED CLASSIFIER

InstaAug can also be used to learn suitable augmentations for a fixed pre-trained classifier. This can most notably be useful as a means to learn test-time augmentations. As the invariance module is itself only a small network, it can be done relatively cheaply, even when the dataset and downstream model are very large. We exploit this on the larger Imagenet dataset (224 × 224) (Deng et al., 2009) , again focusing on cropping augmentations and utilizing the LRP parameterization from Section 3.3. Training the invariance module in this setting is done in exactly the same way as elsewhere, using the training procedure of Section 3.2 with the normal training data. The only thing that is changed is that f is now fixed to a pre-trained classifier-specifically, the ResNet-50 (He et al., 2016a) from Wightman (2019) (which did not use an invariance module during training)-rather than being simultaneously learned. We are thus simply learning invariances, without affecting the training of f . In Table 2 we show the effect of using the learned invariance module for testtime augmentation, finding that it is able to noticeably improve accuracy, unlike the baseline test-time augmentations of random cropping, AutoAugment (Cubuk et al., 2018) , and Fast AutoAugment (Lim et al., 2019) . In order to evaluate the generalization performance of our learned augmentation module, we further apply the augmentation trained on ResNet-50 to two different models with zero finetuning: ResNet-18 (He et al., 2016a) and XCiT (Ali et al., 2021) . We find that the learned augmentation transfers very effectively to these different models, which implies that the local invariances InstaAug learns reflect the natural invariances of the underlying classification problem, rather than being specific to the model that was used to train the augmentation module.

5.4. COLOR JITTERING ON TEXTURES

Color jittering is another important type of data augmentation, which can help models generalize to different lighting conditions. We benchmark on the texture classification dataset RawFooT (Bianco et al., 2017) We first train on a single lighting condition D45 (4500K, daylight) resembling natural light. Table 3 shows that InstaAug outperforms all baselines with and without test-time augmentation. In this task, we find that Augerino (with relaxed symmetry restrictions on learned intervals) underperforms random augmentation because its parameters ϕ are often stuck in a neighborhood around their initial values. We believe this is due to the conservative nature of using global augmentations (cf. Figure 3 ), where even a small change in the parameters may largely increase the training loss, which prohibits wide-ranging augmentations. We also compare in-distribution and out-ofdistribution generalization by splitting the 46 test sets into two groups, according to the similarity of their lighting conditions to D45-see Appendix D.2 for the details on the splitting method. In Figure 7 we can see that above a certain in-distribution performance, there exists a trade-off for random augmentation between in-distribution accuracy and out-of-distribution generalization, controlled through the hyperparameter settings. InstaAug, meanwhile, delivers higher out-of-distribution performance than any of the hyperparameter configurations, while also simultaneously giving better in-distribution accuracy to the vast majority of them as well. We can further vary the difficulty of the classification task by using different numbers of lighting conditions in the training data. In Table 4 , we randomly select a set number of lighting conditions to use as the training set for each baseline. As expected, the accuracy increases with the number of lighting conditions for all methods. However, the effect of random augmentation saturates: it performs similarly to no augmentation with 8 lighting conditions. By contrast, InstaAug always provides improvements. In Appendix D, we show that these gains come at very little computational overhead at both train and test time.

6. INSTAAUG FOR CONTRASTIVE LEARNING

Contrastive learning aims to learn features that are approximately invariant to certain augmentations. Typical contrastive learning methods, such as SimCLR (Chen et al., 2020; Ermolov et al., 2021) , first sample two independent transformations, τ 1 , τ 2 ∼ p(τ ), and apply them to an input image x, generating two views x 1 and x 2 . They then feed the transformed images to a neural encoder f , which is trained to maximize the similarity between f (x 1 ) and f (x 2 ), measured with a contrastive loss. As the choice of augmentations directly influences the learned invariance of the encoder, it is a crucial ingredient of contrastive learning (Bachman et al., 2019; Chen et al., 2020; Tian et al., 2020) . However, existing schemes use global augmentations which often introduce unrealistic assumptions. For example, if there are multiple entities in an image, such as grass and cattle in Figure 1c , random cropping will pull features for different entities closer to each other. Consequently, we propose InstaAug as a more flexible instance-specific augmentation method for contrastive learning. Applying InstaAug to contrastive learning is similar to the supervised case shown in Section 3. The main difference is, given an input x, we sample two τ independently from the input-specific distribution p(τ ; ϕ(x)), before they are applied to x. The training objective is correspondingly changed to minimizing the contrastive loss while keeping the diversity in a reasonable range. We again consider TinyIN and evaluate three methods: InstaAug, InstaAug (without input), and random crop. We exclude methods with uniform parameterization, because of their poor performance in the simpler supervised setting. All experiments are based on the SimCLR framework and use the PreActResNet-18 network as the encoder. We train each model with a batch size of 512 for 500 epochs. We then train a linear classifier to evaluate feature quality. We use test-time augmentation-with 10 sampled crops-as it has been shown to improve performance (Foster et al., 2021) . From Table 5 , we see that InstaAug outperforms the random and global augmentation schemes as well as Un-Mix (Shen et al., 2022) , which is a recent variant of MixUp methods on contrastive learning. We observe from the examples shown in Figure 8 that InstaAug focuses on the salient features containing important information. We also notice that the sizes of learned patches are correlated to the sizes of the main objects in images. Thus, InstaAug is able to learn sensible instance-specific augmentations in a fully unsupervised setting.

7. DISCUSSION

In this paper we introduced InstaAug, a method for learning instance-specific data augmentations that capture local invariances of the underlying data generating process. This is achieved by training an augmentation module that parametrizes an input-dependent distribution over transformations, whose samples are used to augment the training data on the fly and/or for test-time augmentation. The main benefits of InstaAug stem from its applicability to a wide range of settings, its ease of use, and crucially its capacity to learn meaningful augmentations that in turn improve performance. Empirically, we demonstrated these benefits for both classification and contrastive learning problems, considering several classes of transformations-rotation, color jittering, and cropping. where f (τ (x i )) is deterministic given τ and i, we have that Ỹ |i, τ, d = Ŷ |i, ∀i, τ is a sufficient (but not necessary) condition to ensure (A) = 0 for all f .foot_0 That is, it is zero for all f if the conditional distribution on the labels is the same for both the original and transformed inputs for all possible pairs (i, τ ), i.e. all possible original inputs and sampled transformations. One simple way to ensure this is to have τ always be equal to the identity mapping, so this term prefers limited transformations. By contrast, if the transformation destroys information about the label, Ŷ |i and Ỹ |i, τ will now differ, such that, in general, (A) ̸ = 0 and, moreover, it will vary with f . Here we typically expect that (A) ≥ 0,foot_1 as we are making predictions using the transformed inputs, so the expected loss under the true label distribution for the transformed inputs will tend to be less than that when labels are generated using the untransformed input. To keep the magnitude of (A) low, we need to ensure that transformations maintain the conditional label distribution as well as possible, i.e. that transformations preserve all input information that is salient for predicting labels. Conveniently, minimizing R(f, ϕ) with respect to ϕ, as done by the InstaAug training setup of Section 3.2, will naturally try to reduce (A). Given we expect the term to typically be positive, this provides an explanation for why InstaAug can be effective without any separate consideration in the objective for the need for transformations to maintain the class label distribution. (B) represents how well our transformation captures the true input distribution. Here we can utilize the fact that, by the definition of Ỹ , E L(f (τ (x i )), Ỹ ) τ (x i ) = x = E [L(f (X), Y )|X = x] =: r(x) (A.7) to write it as (B) = E[r(τ (x i ))] -E[r(X)], (A.8) where r : X → R + maps inputs to their true expected loss. We thus see that τ (x i ) d = X is a sufficient (but not necessary) condition to ensure that (B) = 0 for all f . That is (B) is always 0 if the process of choosing one of the training inputs at random followed by applying a sampled transformation to that input produces samples distributed exactly according to the true input distribution. Unlike for (A), there is no simple scenario in which we can ensure this is true, with the use of the identity transformation now likely to give significant discrepancies by failing to provide sufficient coverage of the input space: though the x i may originally have been sampled from p true (X), there is only a finite set of them, such that repeated sampling from this finite set represents a substantially different distribution to p true (X). In fact, (B) nicely encapsulates the desire to perform augmentation in the first place, by showing how it can be used to increase the coverage of the input space. How to best manage Term (B) will vary depending on the type of model used and the form of our transformations. In some situations, it may be that no matter how diverse our transformations are within the class of those allowable, τ (x i ) will still only cover a subset of the support of X. Here the most important factor for keeping (B) small will be to maximize the diversity of the transformations, e.g. by maximizing their entropy, to ensure the best possible coverage of the true input space. In other cases, it might also be possible to "over-diversify" the inputs, such that τ (x i ) can become more diffuse than X for some choices of ϕ, potentially causing training to lack focus on the particular test-time input distribution we care about. Here we may need to ensure that the entropy of the transformation does not become so large as to cause such over-diversification, creating a more complex trade-off with the need to ensure sufficient coverage. These two scenarios respectively motivate the lower and upper bounds on the transformation distribution entropy used when training the augmentation module. 3For augmentation of high-dimensional data, the former, coverage-limited, scenario is expected to be significantly more likely, as our original training data will generally provide quite poor coverage the classifier (see Figure 2 ) is replaced by a regressor and the loss function L in Equation ( 2a) is changed accordingly to absolute or square error. For self-supervised contrastive learning, we replace the classifier and cross-entropy loss with the feature extractor and contrastive loss (such as SimCLR loss (Chen et al., 2020) ), respectively. In addition, the sampler samples 2 rather than 1 transformations to generate multiple views for an input x.

C.2 IMPLEMENTATION OF LOCATION-RELATED PARAMETERIZATION

As an example, we show how to implement location-related parameterization with a basic CNN structure in the following algorithm,. darker images (row 1 and 3) and decrease the brightness of brighter images (row 4). Also InstaAug is more likely to change saturation compared with hue and brightness, which is consistent with the common belief that saturation contains less information than hue and brightness. InstaAug's behavior is quite different on different samples. It even decides not to augment the H and V channels of the image in the second row. In comparison, Augerino adds or multiplies noise to each channel with the same distribution across all samples, which is harmful in many cases. For example, the input image in the last row is already very bright. but Augerino allows further increasing its brightness. Then brightness values of many pixels will be capped at 1.0, which leads to loss of information.

E.2 HYPERPARAMETER ABLATION

The two hyperparameters of InstaAug are H min and H max , which reflect human preference on augmentation diversity. To investigate how H min and H max influence model performance and provide a guide on how to choose them, we perform an ablation study for the experiment of Section 5.2, wherein we sweep over possible intervals of length 0.5 and 1.0. From Table E .1, we find that the best accuracy is achieved when [H min , H max ] is set to [3, 3.5], while any sub-interval of [2, 4] produces significantly better results compared with random augmentation.

E.3 WHY IS THE RANDOM AUGMENTATION BASELINE SO STRONG?

It is perhaps initially surprising that the Random Augmentation baseline in 5.2 is so strong compared to the other global augmentation schemes. In short, this occurs because the extensive hyperparameter sweep used for it turns out to be a more effective tuning mechanism than directly training global parameters simultaneously to the model. To be more precise, for any global cropping scheme (which includes random crop, Augerino, and InstaAug without input), there is little to be gained from using a non-uniform distribution on the position of the crops. As such, the only thing that can be usefully learned is the distribution on the size of the crops themselves. For the random crop baseline, we do an exhaustive sweep to establish the best distribution on crop sizes, meaning that this baseline represents a near-optimal global cropping augmentation. By comparison, InstaAug (without input) must still learn the optimal cropping size distribution during training, and the results suggest that it does not always manage to do this perfectly, tending to prefer under-diverse transformations. This is perhaps not surprising, as it does not have access to a validation set, unlike the hyperparameter sweep implicitly being deployed for the random crop baseline. The problem is seen even more starkly for Augerino, where the lack of LRP causes training to become stuck in highly sub-optimal local optima that yield very little transformation diversity.



Note that Ŷ d= Ỹ alone is not generally sufficient, as matching in marginal distribution does not ensure that the joint distributions with i and τ also match, in turn yielding different expectations. Note, though, that this is not formally guaranteed, even for the cross entropy loss and an f that exactly captures the true distribution. This is because, while Gibbs' inequality ensures the optimal q given p for a cross-validation expected loss E p(Y ) [-log q(Y )] is q = p, in general, the optimal p given q is not p = q. Note here that the bounds in Equation (2b) are on are on the entropy on the parameters of τ , rather than τ (xi) itself. This is because it is difficult to directly control the latter during the training, with the former providing a more practical proxy that is expected to generally be representative. https://github.com/alexrame/mixmo-pytorch.git, under Apache License v2.0. https://github.com/htdt/self-supervised.git, under Apache License v2.0.



Figure 1: Different inputs require different augmentations. In (a), the digit '0' is invariant to any rotation, but rotating the digit '6' by more 90 • makes it a '9'. In (b), a similar phenomenon is observed for color jittering applied to a leaf and a lemon/lime. The red dashed lines in (a) and (b) are boundaries between different classes. In (c), the same effect is shown for cropping. Solid rectangles represent the patches that preserve the labels of the original images ([left] grass, [right] cattle), while dashed rectangles represent patches with different labels to the original images.

Figure 2: Summary of InstaAug.

Figure 3: InstaAug learns more diverse augmentations that also preserve labels compared to global augmentations. ⋆ and are samples from two different classes. Blue and green shades represent label-preserving augmentations for each class. In (a), the upper ⋆ should be further augmented, but some of the augmented samples for the lower ⋆ are already over-augmented and indistinguishable from another class (see the red intersection). InstaAug solves this problem by learning a different augmentation for each instance, as shown in (b).

Figure 4: Location-related parameterization of crops by a CNN. The shaded area (bottom right) shows a simplified 3-layer CNN, and squares represent units at different convolutional layers. Each units defines a patch in the input image (shown in the same color) through its receptive field. The value of the activation then gives the corresponding unnormalized log probability for that patch.

Figure 5: Learned invariances for the Mario and Iggy dataset. The blue arcs show the training data range, while the green arcs show the learned transformation distributions for some examples.

Figure 6: InstaAug (B) learns more sensible crops compared to random and learned global (A) augmentations. Columns (a, d) show examples of sampled crops, with red edges indicating higher probability. Columns (b, e) show density maps for the crop centres, with brighter color meaning higher probability. Columns (c, f) give the proportion of crops (red) above a particular size threshold, showing that InstaAug produces fewer large crops.

Figure 7: In-distribution and out-of-distribution test accuracy for models trained on RawFooT D45. The round dots are random augmentation with different hyperparameter settings. The colors of dots change from yellow to red as hue jittering increases; more saturated dots indicate higher saturation jittering; larger dots mean higher brightness jittering. Each thick line connects dots with the same hue and brightness jitter and thin lines link dots with the same hue and saturation jitter.

Figure 8: Some examples (bird houses) of learned cropping in contrastive learning.

Figure E.1: Examples of learned color jittering. (a) Original image; (b, f) Average hue (H) of original image (blue dot) and learned hue jittering (red arc) for InstaAug and Augerino; (c,g) learned saturation (S) and brightness value (V) of original image (blue dot) and learned hue jittering (red line segment) for InstaAug and Augerino; (d,e) examples of images transformed by InstaAug.

InstaAug boosts the test accuracy (%) with test-time augmentation on Imagenet. Invariance modules learned on ResNet-50 can also be directly applied to other models such as ResNet-18 and XCiT to improve generalization without fine-tuning. By contrast, we see that global augmentation schemes are actually detrimental to test-time augmentation.

. RawFooT includes 68 different samples of raw food and each sample has an image taken under each of 46 different lighting conditions (see Figure D.1 for some examples). We crop the original images to create the train set and test set. For each original image with a resolution of 800 × 800, we randomly sample 200 different 200 × 200 patches in the upper half as training images.The same procedure is taken on the lower half to produce test images, giving a train set and a test set for each different lighting condition. To evaluate the generalization ability of each method to a broader range of lighting conditions, we evenly mix test images from all lighting conditions to form a general test set, while controlling the lighting conditions present during training.

InstaAug achieves higher general accuracy than baseline methods when trained on D45 (Daylight, 4500K).

InstaAug significantly outperforms baseline methods in general test accuracy (%) on different difficulty levels. For each difficulty level, we randomly sample lighting conditions used for training and repeat each experiment 10 times. Test-time augmentation is included for random and InstaAug.

Representations learned by InstaAug perform better in the downstream linear classification task than baselines.

1: Model performance with different choice of H min and H max on supervised cropping. H min H max Accuracy (%)

APPENDIX A THEORETICAL ANALYSIS OF GENERALIZATION ERROR

We now provide a decomposition of the generalization error-i.e. the difference between the true risk and the training risk-when using ϕ during training of the downstream classifier f . Here we can view the objective of augmentation as adjusting the training objective to encourage the learned model to have a low true risk. As such, the generalization error provides a measure of the effectiveness of the augmentation for the training of f ; by analysing the behaviour of the generalization error as a function of the augmentation module, we can derive a characterization of the desirable properties of the latter.To start our analysis, we first define the true risk of the downstream model, f , aswhere (X, Y ) ∼ p true (X, Y ) are drawn from the true data generating distribution. In practice, one might also perform test-time augmentation, implying a different predictive function and thus different true risk, but for the purposes of our analysis, we will assume that this is not done, as this allows us to focus on the impact the invariance module has on f during training.One the other hand, the implied training risk (i.e. our objective for training f ) when using an invariance module is the augmented empirical riskwhere i ∼ Uniform{1, . . . , N } is a uniformly sampled index for a point in the original training dataset {x n , y n } N n=1 and τ |i ∼ p(τ ; ϕ(x i )) is the sampled transformation. Note that the expectation in Equation (A.2) is only over i and τ , with the datapoints themselves not considered random variables for our purposes, because we are only provided with a single fixed training dataset.The generalization error can now be defined as R(f, ϕ) -R(f ). At a high level, we are interested in finding a ϕ that ensures this has a low magnitude. More precisely, we want ϕ to ensure that the minimizer of the training risk, f * := arg min f R(f, ϕ), gives as low a true risk, R( f * ), as possible. Therefore, we want to keep the generalization error magnitude small across different f (relative to the corresponding variations in R(f, ϕ) itself), so that the optima of the training and true risks are as similar as possible. In other words, we want a ϕ that ensures R(f, ϕ) -R(f ) is small for all f , especially those close to f * . If we do hypothetically drive the generalization error to zero for all f , we will have a mechanism for directly training to the true risk using a finite original training dataset.To aid with decomposing the generalization error, it is convenient to further define the following random variables through their conditional distributions:(A.4)We can now write down our decomposition as follows:From this, we see that if the magnitude of (A), (B), and (C) are all small, then our generalization error magnitude will be small as well. Moreover, if we can construct a ϕ such that these terms are small for all f , then we can ensure effective generalization performance. We will now look at each term individually.(A) provides a precise characterisation of how well our transformation preserves the label distribution; it is the difference between the expected loss under the true label distribution of the untransformed inputs and the expected loss under the true label distribution of the transformed inputs, making predictions using the transformed inputs in both cases. In particular, by noting that we haveof the true input distribution, while our transformations will not generally be sufficiently powerful to produce unrepresentative inputs. Moreover, when working with large deep learning models, prediction in one region of the input space is rarely harmed by the addition of data in another input region. Thus, for the typical scenarios, we expect InstaAug to be deployed in, increasing the entropy of the transformations will directly relate to reducing the magnitude of (B). Note here that it will typically be the case that (B) < 0 provided that the transformations maintain the label distribution, as the accuracy of the downstream model will typically be higher for the transformations of the original training data that for the test data.Term (C) is the error from the fact that we only have one sample of the label for each original training input, rather than the full label distribution. As Ŷ ⊥ ⊥ τ , we have limited ability to reduce it through controlling ϕ; it essentially represents the irreducible noise in R(f, ϕ) from only having a finite number of true labels. Note that it is not related to the model's ability to generalize to unseen inputs, as it is based on variability in other possible labels we might have seen for our training inputs themselves; if Y |X is actually deterministic, it is exactly zero. As such, it is of limited interest for our analysis, while it will thankfully generally be much smaller than the other terms for practical problems unless we have both a very small dataset and a very noisy true label distribution.Putting everything together, we see that (A) and (B) respectively encapsulate the competing needs of the invariance module to maintain the conditional label distribution (i.e. preserve the label information) and maximize coverage of the input space. We have also seen that the former is typically naturally taken care of by minimizing R(f, ϕ) with respect to ϕ, motivating the objective used by InstaAug in Equation ( 2a), but the latter requires separate consideration, which we deal with through our constraints on the entropy in Equation (2b).

APPENDIX B DETAILS OF AUGERINO

As a method to learn invariance, Augerino (Benton et al., 2020) is quite different from the previous approaches, which usually require an extra validation set. The basic idea behind Augerino is to use a few parameters (θ) to control the transformation distribution on input images and learn these parameters with the training loss of the classifier. Specifically, it minimizes the losswhere L(x; y) is the cross-entropy loss and R(θ) is a regularization function on the volume of the support of the distribution weighted by the hyper-parameter λ.Comparison with InstaAug. InstaAug shares with Augerino the ideas of tuning augmentation parameters by the classifier loss and using test time augmentation to boost performance, but they are different in the following aspects. The most significant difference is that InstaAug is instance-specific, while Augerino learns global augmentations. Besides, Augerino uses a single scalar θ to parameterize a symmetric uniform distribution (U[-θ, θ]) over each type of transformations, which lacks the flexibility to model more complex augmentations, such as cropping.In addition, Augerino uses a fixed weight λ to balance the training loss and augmentation diversity. However, we find that, in more complicated settings, this is quite impractical. Specifically, we need different λ in different stages of training. If we use a large λ from the start of training, the diversity will quickly diverge to maximum, because the classifier is very weak and the loss is consequently dominated by the diversity term. This will block the training of the classifier because transformed samples from different classes are quite mixed with each other. Otherwise, if we choose a small λ, the diversity will converge to zero after a few epochs, yielding similar results as the vanilla model without augmentation. In neither of the case can we learn a useful augmentation. Consequently, InstaAug directly constrains the diversity to keep it stable during training.

APPENDIX C METHOD DETAILS C.1 REGRESSION AND SELF-SUPERVISED LEARNING

In Section 3, we use classification as an example to introduce InstaAug. However, InstaAug can be easily applied to other tasks including regression and self-supervised learning. For regression, Algorithm 1: Location related parameterization Input: Image x, channel numbers M i , and layer number n_layer Output: Probability of patches p F// Logit vector at each level logits = Concat([logit i ]) ; // Logit vector at all levels p = Normalize(Exp(logits)) ;// Probability after normalization

C.3 OTHER PARAMETRIZATION METHODS

Besides the uniform and location-related parameterization, we also tried VAE-like methods to parameterize augmentations, such as cropping. The main idea is to have a Gaussian latent variable and a neural decoder to map the latent Gaussian distributions to a continuous distribution on transformation parameters (in this case, the centers and sizes of crops). However, similar to the uniform parameterization, we find the VAE-like parameterization unstable and easily stuck at local minima.

APPENDIX D EXPERIMENTAL DETAILS D.1 CROPPING

Supervised training Based on the Mixmo codebase 4 (Ramé et al., 2021) , we use stochastic gradient descent (SGD) optimizer to train baselines and InstaAug. For the classifier, the initial learning rate is set to 0.2 (with momentum 0.9 and weight decay 1e -4). A scheduler is used to decrease the learning rate by a factor of 0.9 once validation accuracy doesn't increase for 10 epochs. The learning rate of the augmentation module ϕ is fixed at 1e -5. Batch size is set to 100 and we pre-train InstaAug for 10 epochs without augmentation. We train the model until convergence and the maximum epoch is set to 150.

Contrastive training

We directly apply InstaAug on the codebase 5 from Ermolov et al. (2021) .Because of the characteristics of contrastive learning, we set the batch size to 512. Same as the supervised case, we use SGD optimizer to train the augmentation module ϕ. Differently, we use Adam optimizer (Kingma & Ba, 2015) (with learning rate 1e -3 and weight decay 1e -6) to train the base model. We train each model for 500 epochs and decrease the learning rate by a factor of 0.8 at step 450 and 475.

D.2 COLOR JITTERING ON TEXTURES

Training. We use PreActResNet-18 (width = 1) on texture recognition task on RawFooT and train it with SGD optimizer. The learning rate is 0.02 (with momentum 0.9 and weight decay 1e -4) for the classifier and 1e -5 for the augmentation module ϕ. We train each model for 50 epochs and learning rate schedulers are not necessary in this task.Random augmentation baseline. We sweep over the variation range on each channel to find the best hyperparameters for the random augmentation baseline. For hue (h-jittering), we sweep between [0, 0.5] with stride 0.1, and for saturation (s-jittering) as well as brightness value (v-jittering), we sweep between [0, 1.0] with stride 0.2, which yields 216 different settings in total. The best accuracy shown in Table 3 is achieved where h,s,v= 0.0, 0.2, 0.8. 

Group Lighting id

Easy (1) 1-4,10,14-31 Hard (2) 5-9, 11-13, 32-46 In-distribution vs. out-of-distribution generalization. To further investigate the effect of each augmentation method, we additionally split the 46 test sets into two equally-sized groups. The first group contains lighting conditions similar to D45, such as daylight with different temperatures, for which the vanilla model without augmentation trained on D45 has high test accuracy. The second group contains lighting conditions that are dramatically different from D45, for example, pure red light, which are more difficult for the vanilla method. Then the average accuracy on the first group can be regarded as a measure of in-distribution generalization, while the accuracy on the second group reflects out-of-distribution generalization.

D.3 TIME COMPLEXITY

We notice that InstaAug on color jittering has a similar training speed (0.37s/iter) as random augmentation (0.40s/iter) on a single Nvidia 1080Ti GPU, though it takes more epochs (about 40) compared with random augmentation, which usually converges after 25 epochs. We also find the speed for evaluation is very fast even with test time augmentation (sample number =10), which is about 0.004s/sample. However, the training speed of InstaAug on cropping (0.25s/iter) is slower than random augmentation (0.15s/iter) due to optimization issues on the more complex parameterization method. Training InstaAug alone takes a similar amount of time for each epoch compared with joint training, but it requires fewer epochs (less than 30) to converge and we can cache the outputs of the classifier for faster training. The evaluation speed is 0.011s/sample when sample number is set to 50 for test-time augmentation.

APPENDIX E ADDITIONAL RESULTS AND DISCUSSION

E.1 RAWFOOT Figure E.1 shows some examples of learned color jittering. Though it's not easy to fully understand them, we can still find some patterns. For example, InstaAug tends to increase the brightness of

