FEATURE DROPOUT: REVISITING THE ROLE OF AUGMENTATIONS IN CONTRASTIVE LEARNING

Abstract

What role do augmentations play in contrastive learning? Recent work suggests that good augmentations are label-preserving with respect to a specific downstream task. We complicate this picture by showing that label-destroying augmentations can be useful in the foundation model setting, where the goal is to learn diverse, general-purpose representations for multiple downstream tasks. We perform contrastive learning experiments on a range of image and audio datasets with multiple downstream tasks (e.g. for digits superimposed on photographs, predicting the class of one vs. the other). We find that Viewmaker Networks, a recently proposed model for learning augmentations for contrastive learning, produce label-destroying augmentations that stochastically destroy features needed for different downstream tasks. These augmentations are interpretable (e.g. altering shapes, digits, or letters added to images) and surprisingly often result in better performance compared to expert-designed augmentations, despite not preserving label information. To support our empirical results, we theoretically analyze a simple contrastive learning setting with a linear model. In this setting, label-destroying augmentations are crucial for preventing one set of features from suppressing the learning of features useful for another downstream task. Our results highlight the need for analyzing the interaction between multiple downstream tasks when trying to explain the success of foundation models.

1. INTRODUCTION

In recent years, foundation models (Bommasani et al., 2021) have exhibited remarkable progress on a range of AI tasks (Devlin et al., 2019; Liu et al., 2019; Ramesh et al., 2021; Radford et al., 2021; Brown et al., 2020; Chowdhery et al., 2022; Hoffmann et al., 2022; Alayrac et al., 2022; Reed et al., 2022) . A crucial characteristic of foundation models is that they can be adapted for a range of downstream tasks. For example, a foundation model trained on ImageNet should ideally not only perform well at object classification, but should also have learned general features useful for localization, segmentation, and other visual tasks. Indeed, this is borne out by recent work showing the high accuracy of foundation models on a range of downstream tasks (Chen et al., 2020b) , as well as a range of analysis work showing models learn high-level semantic features including texture, color, pose, and style (Goh et al., 2021) . One popular strategy for training foundation models involves training models to match transformed versions (known as views or augmentations) of the same input. For example, image views might include common data augmentations such as cropping or color jitter (Chen et al., 2020b) , while views for speech might include pitch modulation or spectrogram masking (Kharitonov et al., 2021; Park et al., 2019) . This family of objectives includes contrastive approaches such as SimCLR and MoCo, as well as non-contrastive approaches such as BYOL and SwAV (Chen et al., 2020b; He et al., 2020; Grill et al., 2020; Caron et al., 2020) . Given the central importance of these views for defining the self-supervised task, much work has focused on the question of what views lead to high-quality representations. The prevailing consensus, exemplified by (Tian et al., 2020) , holds that views should be label-preserving with respect to a downstream task. In other words, because the contrastive loss will produce representations which are invariant to features that vary across views, any information we wish to preserve in the representations should not be altered by such views. As Tian et al. (2020) write: "A good set of views are those that share the minimal information necessary to perform well at the downstream task." Here, we question whether this assumption-in particular, with its focus on a single task-is enough to explain why contrastive foundation models succeed on a range of downstream tasks. In Section 2, we observe that the actual choice and application of views in practice does not align with this prevailing consensus. For example, complete invariance to several common data augmentations (e.g. shifts in brightness or cropping) is undesirable since augmentations of inputs from different classes can collide. Furthermore, in many cases there are explicit ways to specify invariances (e.g. converting images to grayscale) that researchers avoid in favor of specifying them indirectly via augmentations (e.g. hue shifts). These observations suggest that specifying invariances is not the sole role of these views. Instead, we suspect that augmentations serve as a form of feature dropout-preventing any one feature from becoming a shortcut feature and suppressing the learning of other features. We study this idea empirically in Viewmaker Networks, a recently proposed method that appears to learn to drop out different features in the input via adversarial training. We apply viewmaker and expert views to datasets with two associated downstream tasks, one involving classifying the main input (e.g., an image or audio recording) and one involving a simple overlaid element (e.g., a digit, shape, letter, or speech snippet). We observe that the viewmaker augmentations selectively obscure these overlaid features. Despite this, the viewmaker representations still learn both downstream tasks well, while expert views often struggle on one or the other. This further suggests that being label-preserving is not a necessary property of good views, as long as the label information is still sometimes accessible. Finally, we formalize the intuition that feature dropout can aid learning with a theoretical analysis of a simple linear contrastive setting. In this setting, we characterize how the noisiness of each feature directly determines how quickly features are learned, and uncover an interaction between features governing how fast they are learned. In particular, we show how learning one feature quickly can suppress the learning of other features, and show that adding noise to the "easiest" feature can increase the rate at which other features are learned. This further indicates that label-destroying augmentations may have a direct role in ensuring that contrastive models learn a broad range of features for downstream tasks. Overall, these findings suggest the need to revisit common assumptions about the role of augmentations for contrastive learning in the foundation model setting, and move towards a better understanding of how to train generalist models that learn diverse features from unlabeled data.

2. COMMON PRACTICES ARE AT ODDS WITH THE "INVARIANCE" EXPLANATION

We begin by briefly exploring several common augmentations used in contrastive learning for natural images, and explore how they come into conflict with the common assumption described above. First, we observe that many common augmentations can affect the label of the input, depending on the downstream task. For example, many downstream image recognition tasks require color information (e.g. identifying bird species) or brightness (e.g. scene or time-of-day classification), implying that invariance to these characteristics would be undesirable. Yet hue shifts, greyscaling, and brightness shifts are common augmentations used in contrastive learning Chen et al. (2020b) ; He et al. (2020) Second, repeated application of some augmentations causes challenges for all downstream tasks. For example, applying brightness shifts repeatedly results in any image turning completely black or completely white. Thus the class label cannot be truly invariant to this augmentation, since inputs from different classes can experience an "augmentation collision" at this black or white image (this is formalized in Appendix B).foot_0 This argument also applies to other augmentations, including shifts in contrastfoot_1 and random masking. Third, some augmentations are commonly used despite ways of explicitly encoding invariance to them. For example, two image augmentations are hue shifts and greyscaling. Invariance to both of these augmentations can be explicitly encoded by always converting an image to greyscale. Yet doing so is not common practice because color information is still desirable for many downstream tasks. The contradictions between the invariance rationale for augmentations in contrastive learning and these common practices suggest the need for additional explanations for the role of augmentations.

3. VIEWMAKER NETWORKS SUCCEED DESPITE DESTROYING LABEL INFORMATION

As another point of evidence that good views need not be label-preserving, we consider the behavior of viewmaker networks (Tamkin et al., 2021b) , a generative model which produces augmentations for contrastive learning. Intuitively, viewmakers learn a stochastic augmentation policy that makes the contrastive task as hard as possible for the encoder. The stochastic augmentations are parameterized as additive perturbations bounded by an L 1 norm, meaning the viewmaker can alter but not completely destroy the original image. Formally, given an input x ∈ N, a viewmaker network V ψ is trained jointly with an encoder E θ to optimize the minimax expression: max ψ min θ L E θ x + ϵ V ψ (x, δ 1 ) ||V ψ (x, δ 1 )|| 1 , E θ x + ϵ V ψ (x, δ 2 ) ||V ψ (x, δ 2 )|| 1 Here L is a multiview loss function (e.g. (Chen et al., 2020b; He et al., 2020) ), x is a minibatch of inputs, || • || 1 is the L 1 norm, ϵ is the distortion budget controlling the strength of the views, and δ 1 , δ 2 ∼ N (0, 1) are random inputs that enable the viewmaker to learn a stochastic augmentation policy. We clamp the output of the viewmaker for images to [0, 1] as in Tamkin et al. (2021b) . Viewmaker networks learn to stochastically alter different parts of the input, including task-relevant features, meaning that these augmentations are not label-preserving. Nevertheless, as we will see shortly, viewmaker networks enable strong performance on multiple downstream tasks, including often better performance than expert-designed augmentations. Moreover, this feature dropout capability of viewmaker networks may help them to learn many features well rather than focusing on the easiest ones.

3.1. DATASETS

We consider the behavior of viewmaker networks on four datasets, including three image and one audio dataset. Each dataset is constructed in such a way as to support two distinct downstream classification tasks, enabling us to examine how well each downstream task is learned. The presence of two downstream tasks enables us to analyze the foundation model setting where we wish to learn features relevant for multiple downstream tasks, as opposed to one set or the other.

Image datasets

The three image datasets are based on the canonical CIFAR-10 image-recognition dataset (Krizhevsky, 2009) (MIT-License). One task is always to predict the CIFAR-10 object label (e.g. airplane or bird). The other task is dependent on an additional feature overlaid on the image: C+Shapes: The CIFAR-10 image is overlaid with one of three randomly-colored shapes: a square, a triangle, or a circle. The second task is to predict what shape was overlaid (N=3 classes). C+Digits: The CIFAR-10 images are overlaid with four copies of a randomly-sampled digit from the MNIST dataset. The second task is to predict the digit class (N=10 classes). C+Letters: The CIFAR-10 images are overlaid with four copies of a randomly-colored English letter. The second task is to predict the class of the letter (N=26 classes).

Audio dataset

The audio dataset is created by overlaying the audio of a spoken digit (from the AudioMNIST dataset (Becker et al., 2018) , MIT License) with a random background sound (collected from one of three possible classes: cafe, machinery, and traffic) (Saki et al., 2016; Saki & Kehtarnavaz, 2016) . The tasks are to predict the digit class (N=10 classes) and to predict the sound class (N=3 classes). Inputs are presented to the network as log mel spectrograms.

3.2. EXPERIMENTS

Pretraining We pretrain with the SimCLR algorithm for 200 epochs with a batch size of 256 and a temperature of 0.1. We use a ResNet-18 model with standard modifications for smaller inputs (including a smaller stride and no initial maxpool) as used in Tamkin et al. (2021b) . For the expert augmentations, we use the standard SimCLR augmentations for the image datasets (Chen et al., 2020b) , and the SpecAug (Park et al., 2019) augmentations for the audio datasets, which randomly mask out different frequency and time bands, as well as the WaveAug (Kharitonov et al., 2021) augmentations, which alter various properties of the waveform such as the pitch and speed. For the viewmaker augmentations, we use a budget of ϵ = 0.05P for the image datasets, and ϵ = 0.125P for the audio datasets, where P is the number of pixels in the input.

Linear Evaluation

We evaluate the quality of the learned representations by training a linear softmax classifier on top of the prepool representations. We train for 100 epochs, using the same parameters as Viewmaker (Tamkin et al., 2021b) , training separate linear classifiers using the same pretrained network for each downstream task (Chen et al., 2020b) . Augmentations are applied during training but not evaluation.

3.3. RESULTS

Qualitative evidence of feature dropout Visually, the viewmaker augmentations seem to stochastically alter different aspects of the input, as shown in Figure 1 . In addition to modifying the background of each input, the viewmaker also selectively modifies the additional synthetic features added to each domain: C+Digits: The viewmaker augmentations selectively add pixels to the MNIST digits, making it difficult to distinguish which number is present. C+Shapes: The viewmaker augmentations sometimes draw squares around the shape in the center, making it difficult to determine the shape class. C+Letters: The viewmaker draws letter-like markings on top of the letters, obscuring the letter identity and color. Audio: The viewmaker identifies the narrow band corresponding to the speech and applies perturbations to it. As can be seen in Figure 1 , these label-destroying augmentations are quite common, occuring in a sizeable fraction of the sampled views. Quantitative evidence of feature dropout We also measure this selectivity of features quantitatively in Appendix C.2 and Figure 4 . We augment images 1,200 times and observe the impact on the predictive probability of the correct object class. Two clear modes appear for the viewmaker augmentations, but not expert augmentations. This corresponds to the fraction of time the viewmaker destroys the overlaid feature information (low P(correct object class)) and preserves it (high P(correct object class)). Viewmaker succeeds despite destroying label information As shown in Table 1 and Table 2 , viewmaker networks are able to achieve good accuracy on both tasks, while expert augmentations frequently achieve lower performance on one or both tasks. For example, on the image tasks, while expert views achieve slightly higher performance on the image only, they experience a large drop in accuracy when the synthetic feature is added. In two of these cases (Shape and Digit) the viewmaker models are able to achieve a higher accuracy on both the image and the synthetic feature, while on the third (Letters) viewmakers achieve slightly lower accuracy on the images but achieve half the error on the synthetic object. For the audio experiments the picture is similar-the viewmaker is able to avoid catastrophic drops in performance learning both features together, achieving the highest accuracy on both, while the expert views experience larger drops and worse overall performance. Note that the high performance of expert views for our control tasks (CIFAR-10/Speech/Sound Only) indicates that the viewmaker views are not merely better all-around views, but that they specifically help the model learn multiple features. These results provide additional evidence that label-preserving views are not necessary for learning good representations-and that the ability to perform feature dropout may improve learning of multiple tasks.

4. THEORETICAL ANALYSIS OF FEATURE INTERACTIONS IN A LINEAR CONTRASTIVE SETTING

In this section, we theoretically analyze a simple linear model that captures the essence of how label-destroying augmentations can improve downstream accuracy. We study a setting where the data contains many underlying features that are relevant to downstream classification tasks, and where these features are preserved to varying degrees across augmentations. We will show that a linear model trained with a contrastive objective learns these features, and that adding noise to one feature can speed the learning of other features during gradient descent. One difference between the linear setting we theoretically analyze and Section 3 is that in this section we add stochastic Gaussian noise to destroy features across augmentations, as opposed to the more bimodal feature dropout behavior seen in Figure 1 .  u (4) k v (4) k v (3) k u (3) k u (2) k v (2) k u (1) k v (1) k -µ k µ k w 11 w 12 w 13 w 21 w 21 w 23 w K1 w K2 w K3 . . . θ 1 θ 2 θ K θ T 1 w 1 θ T 2 w 2 θ T K w K . . . Representation f Θ (w) Input w (a) (b) Θ (w). As each feature µ k is learned (θ k → µ k ) the representations of the two views f Θ (u (i) ), f Θ (v (i) ) become more similar, decreasing the contrastive loss.

4.1. DATA MODEL AND SETTING

We study a model which consists of data with K distinct features, each corresponding to some ground truth unit-vector directions µ 1 , . . . , µ K ∈ R d . We sample each data point u ∈ R K×d and its augmentation (a.k.a. its positive pair or its view) v ∈ R K×d as follows. For k ∈ 1, . . . , K, the kth row of u, which we denote u k , is drawn from the Gaussian distribution N (0, I d ). The kth row of the augmentation, v k , is drawn from the same distribution, but is correlated with u k in the µ k -direction (and is otherwise independent in the other directions). The strength of the correlation is governed by parameter α k ∈ [0, 1] in the following sense: v T k µ k = α k u T k µ k + 1 -α 2 k ξ , where ξ ∼ N (0, 1). Thus the larger α k , the stronger the correlation in that feature across the two views. Figure 2 (a) visualizes the correlation of (u k , v k ) in an augmented pair. Formally, we can write that (u k , v k ) ∼ N 0, I d α k µ k µ T k α k µ k µ T k I d , for a vector α ∈ [0, 1] k . We will learn a model Θ ∈ R K×d , which represents a collection of K feature extractors, as pictured in Figure 2  (b). The model Θ, with rows {θ k } k∈[K] , maps a data point w ∈ R K×d to a representation f Θ (w) ∈ R K by computing a score w T k θ k for each element in the representation. That is, (f Θ (w)) k = w T k θ k . Our goal is that the model Θ will be useful for a downstream classification task which depends on the ground truth features. A good representation will capture ground truth features that are correlated across augmentations, such that θ k is aligned with µ k or -µ k . Training. We will study the the evolution of Θ as we optimize a standard constrictive learning objective using gradient descent (Dosovitskiy et al., 2014; Chen et al., 2020b) . At each round of gradient descent, we sample a fresh batch of m data points and their augmentations, (U, V ) := {(u (i) , v (i) } i∈[m] . For each i, j ∈ [m], we compute a similarity score z ij := ⟨f Θ (u (i) ), f Θ (v (j) )⟩ = k (θ T k u (i) k )(θ T k v (j) k ) using the dot product of their K-dimensional representations. We then compute the logits p ij := exp(zij ) j ′ exp(z ij ′ ) using the softmax function, and use the classwise cross entropy loss function L(Θ; U, V ) := -log(p ii ).

4.2. MAIN RESULT

We will study gradient descent (GD) on the cross entropy loss, and consider how adding noise to one feature affects the learning of the other features. As suggested earlier, we can measure how well we learn the kth feature by measuring the alignment of θ k with µ k or -µ k . A natural way to measure this alignment is the acute angle between ±µ k and θ k , given by arccos |µ T k θ k | ∥θ k ∥2 . Lemma E.1 in Appendix E proves that this quantity directly determines the test accuracy on a natural downstream linear classification task. Formally, we say we add noise to some feature k ′ of a data point v, if for some β ∈ [0, 1), we let ṽk ′ = βv k ′ + 1 -β 2 ξ, where ξ ∼ N (0, I d ), and ṽk = v k for k ̸ = k ′ . Thus if (u, v) were a pair generated with the correlation coefficients {α k } k∈[K] , then the distribution of (u, ṽ) comes from the modified correlation coefficients {α} k∈ [K] with the single modification αk ′ = βα k . We now present our main theorem: Theorem 4.1 (Noise improves feature learning). There exists a universal constant C, such that the following holds. Let Θ (t+1) = Θ (t) -η(∇L(U, V ; Θ) + λΘ (t) ), and Θ(t+1) = Θ (t) -η(∇L(U, Ṽ ; Θ) + λΘ (t) ), where Ṽ is V with any amount of added noise in the k ′ feature. This has the effect of changing α k ′ to αk ′ for any αk ′ < α k ′ . Then for any k ̸ = k ′ , if |θ T k µ k | ≤ 1-α 2 k ′ C ∥θ k ∥, ∥θ k ′ ∥ 3 ≤ |θ T k ′ µ k |, and ∥θ k ∥ 2 ≤ α k (1-α 2 k ′ ) C , then for a small enough step size η, E U,V arccos |µ T k θ (t+1) k | ∥θ (t+1) k ∥2 > E U, Ṽ arccos |µ T k θ(t+1) k | ∥ θ(t+1) k ∥2 . We briefly comment on the three assumptions on Θ in the theorem. The first assumption, |θ T k µ k | ≤ 1-α 2 k ′ C ∥θ k ∥ requires that θ k is not too aligned with µ k -that is, the result applies to all features k that aren't already learned too well. The second two assumptions are satisfied if the k ′ th feature has been learned to some extent, and the norm of θ k and θ k ′ are small, which can be enforced throughout training with ℓ 2 regularization. The theorem guarantees that at any point in training, if we add noise to the k ′ th feature, the next step of GD learns all other features better than if we didn't add noise. To validate the implication of this result for the complete trajectory of GD, we include simulations in Appendix D. Our experiments show that introducing noise part-way through training to dominant features can significantly speed the alignment of weak features, with only a small cost to the alignment of the dominant features. We prove our result in Appendix E, including intuition and a technical overview of the steps in Section E.3.

5. RELATED WORK

Understanding contrastive and multiview learning Many prior works have laid the foundations for current contrastive and multiview learning algorithms (Becker & Hinton, 1992; Hadsell et al., 2006; Dosovitskiy et al., 2014; Wu et al., 2018; Bachman et al., 2019; Misra & van der Maaten, 2020; He et al., 2020; Chen et al., 2020b) . Several works perform analysis studies of contrastive learning to identify important factors (Cole et al., 2021; Zhao et al., 2021) or how contrastive models differ from supervised learning (Yang et al., 2020; Ericsson et al., 2021a; Karthik et al., 2021) . HaoChen et al. ( 2021) study contrastive learning using the concept of an augmentation graph. This model assumes the fraction of non-label preserving augmentations is "extremely small;" interestingly, we show in practice it can quite large and still yield good performance. Wang et al. (2022) theoretically study contrastive learning under an assumption of label-preserving augmentations, though they show that such an assumption alone does not suffice to learn. Most relevant to our work, Tian et al. (2020) ; Ericsson et al. (2021b) study how the information shared between different views impacts learning of downstream tasks. We complicate this picture by analyzing the foundation model setting where a single model must learn features that are useful for multiple tasks that are not known in advance. In this setting, we find that label-destroying perturbations, thought to be harmful by Tian et al. (2020) , are useful for preventing one feature from suppressing others. Feature suppression Our work is closely connected to the notion of feature suppression (Hermann & Lampinen, 2020) , where the presence of one feature can crowd out or suppress the learning of other features. Several works have explored the relevance of this concept in contrastive learnings. For example, the original SimCLR paper (Chen et al., 2020b) noted that color jitter augmentation was necessary to prevent the network for using only the color profile of the input to solve the contrastive task. Followup work (Chen et al., 2021) explores this phenomenon in more detail, characterizing how different hyperparameters and dataset features affect feature suppression. Other works have attempted to address feature suppression in contrastive learning, either via auxiliary losses (Li et al., 2020) or by modifying representations in the latent space (Robinson et al., 2021) . Our work relates to these in two ways. First, we empirically and theoretically investigate feature suppression as an alternate rationale for the role of augmentations, as opposed to invariance. Second, we show that an existing method, viewmaker networks (Tamkin et al., 2021b) , can identify and potentially neutralize suppressing features in an interpretable way, resulting in better performance than expert augmentations. Spurious correlations and shortcut features Outside the framing of feature suppression, several other works explore how classifiers can learn or make use of unwanted features. Shortcut features (Geirhos et al., 2020) describe often-simple features (e.g. the average color of an input) which are learned by networks at the expense of more salient features (e.g. the object class). This notion is connected to spurious correlations (Simon, 1954) in deep learning which have been explored extensively (Sagawa et al., 2019; 2020; Srivastava et al., 2020; Tu et al., 2020; Xiao et al., 2021) , including in the context of self-supervised learning (Minderer et al., 2020) . Other works have also performed theoretical analysis of how related dynamics affect learning in the supervised setting (Li et al., 2019; Shah et al., 2020) . Our work suggests that viewmaker networks may be a useful tool as well here-both as an interpretability tool to visualize the different features a network relies on, and as a way to reduce reliance on particular features without completely destroying the information.

6. DISCUSSION AND CONCLUSION

We have presented several different arguments against the commonly-articulated belief that the main role of augmentations is to specify invariances for a contrastive learning model. First, common augmentations such as brightness shifts would result in useless representations if networks became truly invariant to them. Second, viewmaker networks succeed at contrastive learning despite learning label-destroying perturbations which drop out different features in the input. Finally, we present an analysis of a linear contrastive setting where we prove that label-destroying views actually have a positive effect on contrastive learning if the goal is to avoid learning one feature at the expense of others. Our work has limitations. For example, our empirical analysis is limited to four synthetic datasets spanning vision and audio, whereas self-supervised learning may be applied to naturalistic data spanning a much wider range of modalities (Tamkin et al., 2021a; 2022) . In addition, our theoretical analysis considers a linear contrastive setting, whereas current neural networks are highly nonlinear. Improving upon both of these fronts is an exciting area for future work. On the other hand, understanding augmentations as dropping out easy features suggests possible ways of improving the performance of self-supervised learning. For instance, viewmaker networks cap the extent to which views can differ from the underlying image. Our analysis here suggests the role of this cap indirectly sets the dropout rate of different features in the input; some way of directly encoding this objective may yield more flexible and performant viewmaker approaches. The challenge of learning a broad range of useful features lies at heart of self-supervised learning. We hope our work sheds light on this challenge in contrastive learning, especially as these objectives continue to develop and are applied more broadly and at larger scale. 

A CODE RELEASE

We are preparing our code and will release it on a public GitHub repo.

B FORMALIZATION OF OBSERVATION IN SECTION 2

Definition B.1 (Invariance). A function f : R m → R n is invariant to a set of transformations G if and only if f • g(x) = f (x) for all x ∈ R m and for all g ∈ G.

Definition B.2 (Augmentation collision

). An augmentation collision occurs if, for two inputs x a , x b and set of transformations G, there exist g (1) a , . . . , g (na) a , g b , . . . , g (n b ) b ∈ G for some n a , n b ∈ N such that g (1) a • . . . • g (na) a (x a ) = g (1) b • . . . • g (n b ) a (x b ). Observation B.3. If there exists an augmentation collision for inputs x a , x b and transformation set G, and f is invariant to G, then f (x a ) = f (x b ). Proof. By the definition of an augmentation collision, g (1) a • . . . • g (na) a (x a ) = g (1) b • . . . • g (n b ) a (x b ). By the definition of a function, we have f • g (1) a • . . . • g (na) a (x a ) = f • g (1) b • . . . • g (n b ) a (x b ). Applying invariance, we obtain f (x a ) = f (x b ). Applying this observation, we observe that if the downstream labeling function f is invariant to a class of augmentations, then there cannot be an augmentation collision for inputs with different labels. However, common augmentations such as brightness shifts can reduce any image to a black or white image, resulting in an augmentation collapse between any two inputs.

C ADDITIONAL FEATURE DROPOUT EXPERIMENTS C.1 QUANTIFYING THE IMPORTANCE OF FEATURE DROPOUT

To assess the importance of label-destroying augmentations to the success of the viewmaker, we experiment with a setup where the viewmaker cannot destroy the information in the object class. To do this, we compute a mask around the object and zero out any perturbation from the viewmaker within that mask. We then perform pretraining and transfer as usual. As we report in Table 3 , the accuracy of the CIFAR-10 class label drops precipitously, as expected. At the same time, the accuracy of two of the other objects remains mostly constant (shape and digits), while the accuracy for letters declines modestly (perhaps because the color of the letter is now able to suppress the learning of the letter class. Table 3 : Experiments with a masked viewmaker which is unable to destroy the object class. Transfer accuracy on CIFAR-10 (C-10) and the object task (Shape, Digit, or Letter). The Mask-Viewmaker has its perturbation masked such that it cannot destroy the label of the object. This results in the features in the object suppressing the CIFAR-10 accuracy, while leaving the object accuracy relatively unscathed. 

C.2 QUANTIFYING THE DEGREE OF FEATURE DROPOUT

We perform an exploratory analysis to testing how well different views drop out the features in an input. We augment a 1,200 examples (CIFAR-10 image plus an overlaid object) using a given augmentation policy (either the expert or viewmaker augmentations). We then encode the model with a classifier trained off of the other augmentation policy (i.e. expert for viewmaker augmentations or the reverse) in order to test how well the augmentations drop out the features. We use a different encoder to see the effects of the augmentations prior to the encoder having a chance to adapt to them. We observe a bimodal behavior for the viewmaker views, shown in Figure 4 , suggesting that the model is adapting to the semantics of the input and has learned to stochastically drop out the simple feature some fraction of the time. By contrast, the expert views display no such structure. Using the corresponding encoder and views leads to models performing uniformly well, as shown in Figure 5 .

D END-TO-END SIMULATIONS OF LINEAR SETTING

We empirically test the performance of the full trajectory of gradient descent when we add noise to the data. We study a setting with one weak feature with correlations coefficient α 1 ≤ 0.5, and 50 dominant features with α k = 1 for k = 2, • • • , 51. We compare two approaches run on the same data: in the first approach, we run 150 iterations of GD without adding noise. In the second, we run 50 iterations of GD without noise, and then add noise to the dominant features for the remaining 100 iterations. In Figure 6 (top), we compare the alignment of Feature 1 (the weak feature) and Feature 2 (one of the dominant features) to the ground truth in the two approaches. We observe that adding noise consistently accelerates the learning of the weak feature (blue), with little cost to the dominant features (red). The affect is consistent among many choices for α 1 , the correlation coefficient of the weak feature. We also plot in Figure 6 (bottom) the probability of predicting the correct class (pair) of the view under both approaches. We observe that this probability drops sharply when we add noise, which we believe is the mechanism for faster learning with noise. We remark that we chose to add noise to all the dominant features (instead of a single k ′ a in our theorem) to accentuate the effect of adding noise. We observed a similar effect, but smaller, when we added noise to fewer features, or when there were fewer than 50 dominant features.

E FULL PROOFS OF PROPOSITIONS AND THEOREMS

We begin by stating and proving Lemma E.1 on the downstream classification accuracy. Lemma E.1 (Downstream classification accuracy). Suppose we draw labeled data points (u, y) ∈ R K×d × {+1, 1}, where as before, u k ∼ N (0, I d ) for k ∈ [K], and the label is given by sign(u T k µ k ). Then the best linear classifier a ∈ R K on the representations f Θ (u) ∈ R K achieves an test error of 1 π arccos |µ T k θ k | ∥θ k ∥2 . That is min a∈R K Pr u [sign(a T f Θ (u)) ̸ = sign(µ T k u k )] = arccos |µ T k θ k | ∥θ k ∥2 π . Thus if θ k and µ k are orthogonal, then the test error is 50%. If the angle between θ k and the ±µ k is zero, then we achieve perfect classification accuracy. Proof. It is easy to see that the best linear classifier a will (up to scaling) be equal to the vector sign(µ T k θ k )e k . Such a classifier predicts the correct sign whenever sign(a T f Θ (u)) = sign(µ T k θ k ) sign(θ T k u k ) equals sign(µ T k u k ), which occurs exactly a 1 - arccos |µ T k θ k | ∥θ k ∥ 2 π fraction of the time. In the rest of this section, we prove our main theoretical result, Theorem 4.1, which shows that arccos |µ T k θ k | ∥θ k ∥2 decreases faster in expectation during gradient descent if we add noise to the k ′ feature.

E.1 NOTATION.

We let δ ij denote the δ-function which equals 1 if i = j and 0 otherwise. For a parameter Θ = {θ k } k∈[K] , we let θ ∥ k := µ k µ T k θ k be the projection of θ k in the µ k direction. We let θ ⊥ k = θ k -θ ∥ k be the projection of θ k orthogonal to the feature µ k . Throughout this section, we consider the ground truth directions to be fixed, and we fix some initial correlation vector α. We let P α denote the distribution from which the pair (u, v) is drawn from the Gaussian distribution described in Section 4 with correlation coefficients α. When unspecified, the variables U, V are drawn from the distribution P m α . Since we study what happens when we vary α k ′ , for x ∈ [0, 1], we use the shorthand P x to denote the distribution P m α(x) , where α(x) k ′ = x, and α(x) k = α k for all other k. We denote L i (Θ; U, V ) = CE({p ij } j∈[m] , e i ) = -log(p ii ), which we abbreviate by L i . When it is clear that we are considering L i for some fixed i, we omit the superscripts on the ith data point or its pair. That is, we denote u k := u (i) k and v k := v (i) k .

E.2 PRELIMINARIES

The following facts about of the derivative of the cross entropy loss are easy derived. Lemma E.2. ∂L i ∂Θ = j (p ij -δ ij ) ∂z ij ∂Θ = i j̸ =i p ij ∂z ij ∂Θ - ∂z ii ∂Θ , where ∂z ij ∂θ k = (u (i) k v (j) k T + v (j) k u (i) k T )θ k . We will also need the following facts on Gaussian random variables. The first, Stein's Lemma, is well known. Lemma E.3 (Stein's Lemma). E X∼N (0,σ 2 ) [Xf (X)] = σ 2 E X∼N (0,σ 2 ) [f ′ (X)]. The next two lemmas are proved in Section E.4. Lemma E.4. There exists some constant C such that following holds. If σ ≤ 1 C , and 0 ≤ t ≤ 1 σ , then for any c ∈ {0, 1, 2, 3}, and X ∼ N (0, σ 2 ) we have E X |X| c exp(t|X|) exp(tX 2 ) ≤ Cσ c . (5) If additionally d ∈ {0, 1, 2, 3}, ρ ≤ 1 C and Y ∼ N (0, ρ 2 ), then E X |X| c |Y | d exp(t|X|) exp(|XY |) ≤ Cσ c ρ d . ( ) Lemma E.5. For some universal constant C, for any σ ∈ [0, 1], t ≥ 0, c ∈ {0, 1, 2, 3, 4}, we have E X∼N (0,σ 2 ) [(exp(t|X|) -1) |X| c ] ≤ Ctσ c .

E.3 APPROACH AND LEMMAS

Intuition for proof of Theorem 4.1. Our proof involves comparing the gradient of the loss in the θ k direction, ∇ k := ∂ ∂θ k L in the setting with noise to the setting without noise. Loosely, our goal is to show that for any k, the projection of the this gradient onto the ground truth direction, µ T k ∇ k sign(µ T k θ k ), increases when when increase the noise. The main intuition comes from an expansion of this gradient in Lemma E.7, which shows that Eµ T k ∇ k sign(µ T k θ k ) approximately scales with i (1 -p ii ). Now observe that p ii , the probability of correctly matching the ith view to its pair, decreases when we add noise to feature k ′ . Thus adding noise will increase µ T k ∇ k sign(µ T k θ k ), thereby improving the alignment. In the remainder of this section, we outline our proof of Theorem 4.1 in this section. We prove all the lemmas below in Section E.4. To understand E U,V arccos |µ T k θ (t+1) k | ∥θ (t+1) k ∥2 for a small enough step size, we first claim that it suffices to understand the expected projection of the gradient with respect to θ k in the µ k direction and in the θ k direction. We use the notation ∇ k = ∂L(Θ;U,V ) ∂θ k . Lemma E.6. Let θ + k = θ k -η(∇ k + λθ k ). Then lim η→0 1 η E U,V arccos |µ T k θ + k | ∥θ + k ∥ 2 -arccos |µ T k θ k | ∥θ k ∥ 2 = N E U,V -(µ T k θ k )(µ T k ∇ k ) + θ T k ∇ k (µ T k θ k ) 2 ∥θ k ∥ 2 2 , ( ) where N is some negative value that depends only on θ k . Now, since we care about the quantity E U,V arccos |µ T k θ (t+1) k | ∥θ (t+1) k ∥2 -E U, Ṽ arccos |µ T k θ(t+1) k | ∥ θ(t+1) k ∥2 being positive, it suffices to show that derivative d dx E U,V ∼Px -(µ T k θ k )(µ T k ∇ k ) + θ T k ∇ k (µ T k θ k ) 2 ∥θ k ∥ 2 2 , is negative for all x ∈ [α k ′ , α k ′ ]. Indeed, from Lemma E.6, we have that lim η→0 1 η E U,V ∼Pα k ′ arccos |µ T k θ + k | ∥θ + k ∥ 2 -E U,V ∼P αk ′ arccos |µ T k θ k | ∥θ k ∥ 2 (8) = N α k ′ αk ′ d dx E U,V ∼Px -(µ T k θ k )(µ T k ∇ k ) + θ T k ∇ k (µ T k θ k ) 2 ∥θ k ∥ 2 2 dx, so if the derivative is negative for the full range, then the difference in arccosines is positive. In the following lemma we compute the derivative of E[∇ k ] with respect to x. Lemma E.7. d dx E U,V ∼Px [∇ k ] = m d dx E U,V ∼Px ∂L i ∂θ k = -m 1 -x 2 θ T k ′ µ k ′ j̸ =i E U,V ∼Px p ij p ii θ T k ′ u k ′ µ T k ′ u (i) k ′ -xµ T k ′ v (i) k ′ ∂(z ij -z ii ) ∂θ k . We will analyze this quantity by explicitly taking the expectation with respect to some set of random variables. Let S = {U k , V k , U k ′ , V k ′ } consist of the random variables u (i) k ′ , u (i) k , and v (i) k ′ , v k for all i ∈ [m]. Define q ij to be the logits when all variables in S are set to 0 (Thus explicitly, q ij = exp k̸ =k,k ′ θ T k u (i) k θ T k v (j) k j ′ exp k̸ =k,k ′ θ T k u (i) k θ T k v (j ′ ) k ). We will use the notation j ∼ q to denote the distribution on [m] with mass q ij on j. Let h(S) := θ T k ′ u k ′ µ T k ′ u (i) k ′ -xµ T k ′ v (i) k ′ ∂(z ij -z ii ) ∂θ k , and h 1 (S) = θ T k ′ u k ′ (1 -x 2 )µ T k ′ u (i) k ′ 2α k (µ T k u k )(θ ∥ k u k )µ T k , which are the terms that appear in the right hand side of Lemma E.7 after p ii p ij . Observe that E S [h(S) -h 1 (S)] = 0. The following four lemmas serve to bound d dx E S µ T k ∇ k and d dx E S θ T k ∇ k . We call the terms of the form Ep ii p ij (h(S) -h 1 (S)) "junk" terms, and our goal will be to show that these terms are small. We will control more closely the terms of the form Ep ii p ij (h 1 (S)). Lemma E.8 (Junk Terms for µ k term.). If ∥θ k ∥ ≤ 1 and ∥θ k ′ ∥ ≤ 1, then for some universal constant C E S p ii p ij µ T k (h(S) -h 1 (S)) ≤ Cq ii q ij ∥θ k ′ ∥ 3 ∥θ k ∥ 3 + ∥θ ∥ k ′ ∥∥θ k ∥ 3 + α k ∥θ k ′ ∥ 3 ∥θ ∥ k ∥ . Lemma E.9 (Good Term for µ k term.). If ∥θ k ∥ ≤ 1 and ∥θ k ′ ∥ ≤ 1, then for some universal constant C E S p ii p ij µ T k h 1 (S) ≥ 2α k (1 -x 2 )q ii q ij ∥θ ∥ k ′ ∥∥θ ∥ k ∥ 1 -C(∥θ k ′ ∥ 2 + ∥θ k ∥ 2 ) . Plugging these two lemmas into Lemma E.7 yields the following corollary. Corollary E.9.1 (Total µ k term.). If for a sufficiently large constant C, |θ T k µ k | ≤ 1-α 2 k ′ C ∥θ k ∥, ∥θ k ′ ∥ 3 ≤ |θ T k ′ µ k |, and ∥θ k ∥ 2 ≤ α k (1-α 2 k ′ ) C , then (µ T k θ k ) d dx E Px µ T k ∇ k ≥ m 2 E U,V \S   i,j q ii q ij 2α k ∥θ ∥ k ′ ∥ 2 ∥θ ∥ k ∥ 2   . Lemma E.10 (Junk Terms for θ k term.). If ∥θ k ∥ ≤ 1 and ∥θ k ′ ∥ ≤ 1, then for some universal constant C E S p ii p ij θ T k (h(S) -h 1 (S)) ≤ Cq ii q ij ∥θ k ′ ∥ 3 ∥θ k ∥ 4 + ∥θ ∥ k ′ ∥∥θ k ∥ 4 + α k ∥θ k ′ ∥ 3 ∥θ k ∥∥θ ∥ k ∥ + ∥θ ∥ k ′ ∥∥θ k ∥ 3 ∥θ ∥ k ∥ . Lemma E.11 (Good Term for θ k term.). If ∥θ k ∥ ≤ 1 and ∥θ k ′ ∥ ≤ 1, then for some universal constant C E S p ii p ij θ T k h 1 (S) ≤ (1 -x 2 )2α k q ii q ij ∥θ ∥ k ′ ∥∥θ ∥ k ∥ 2 1 + C(∥θ k ′ ∥ 2 + ∥θ k ∥ 2 ) . Plugging these two lemmas into Lemma E.7 yields the following corollary. Corollary E.11.1 (Total θ k term.). If for a sufficiently large constant C, ∥θ ∥ k ∥ ≤ 1-x 2 C ∥θ k ∥, ∥θ k ′ ∥ 3 ≤ ∥θ ∥ k ′ ∥, ∥θ k ∥ 2 ≤ α k (1-x 2 ) C , then (µ T k θ k ) 2 ∥θ k ∥ 2 d dx E Px θ T k ∇ k ≤ m 2 E U,V \S   i,j q ii q ij α k ∥θ ∥ k ′ ∥ 2 ∥θ ∥ k ∥ 2   . Combining Corollaries E.9.1 and E.11.1, we obtain the following lemma. Lemma E.12. If for a sufficiently large constant C, ∥θ ∥ k ∥ ≤ 1-x 2 C ∥θ k ∥, ∥θ k ′ ∥ 3 ≤ ∥θ ∥ k ′ ∥, ∥θ k ∥ 2 ≤ α k (1-x 2 ) C , then E U,V ∼Px -(µ T k θ k )(µ T k ∇ k ) + θ T k ∇ k (µ T k θ k ) 2 ∥θ k ∥ 2 2 < 0. Theorem 4.1 now follows.

E.4 PROOFS OF LEMMAS

To prove the Lemmas E.4 and E.5, we will use the following well-known formula for the moment generating function (MGF) of the half-normal distribution. Lemma E.13 (MGF of half-normal distribution). The MGF of the half-normal distribution is E X∼N (0,1)|X>0 [e t|X| ] = 2e t 2 /2 Φ(t) , where Φ(t) is the cumulative distribution of a normal random variable. Proof of Lemma E.4. E X |X| c exp(t|X|) exp(tX 2 ) = 1 σ √ 2π ∞ -∞ |x| c exp(t|x|) exp(tx 2 ) exp - x 2 2σ 2 dx = √ 1 -2σ 2 t σ √ 1-2σ 2 t √ 2π ∞ -∞ |x| c exp(t|x|) exp   - x 2 2 σ √ 1-2σ 2 t 2    dx = 1 -2σ 2 tE Z∼N (0,r)|Z≥0 [Z c exp(tZ)], where r = σ √ 1-2σ 2 t . To evaluate this, we use the MGF of the half-normal distribution in Lemma E.13. Thus for some constant C, for all c ∈ {1, 2, 3, 4}, E X∼N (0,1)|X>0 c!|X| c e t|X| ≤ E X∼N (0,1)|X>0 d c dt c e t|X| ≤ C (1 + t c ) e t 2 /2 . So for some constant C (whose value changes throughout this equation), so long as σ ≤ 1 C , 1 -2σ 2 tE Z∼N (0,r)|Z≥0 [Z c exp(tZ)] = 1 -2σ 2 tE X∼N (0,1)|Z≥0 [r c Z c exp(rtZ)] ≤ 1 -2σ 2 tCr c (1 + (tr) c ) e (tr) 2 /2 ≤ Cσ c . This proves the first statement in the lemma. To prove the second, we first take the expectation over X, and using the half-Gaussian MGF as before, we obtain E X E Y |X| c |Y | d exp(t|X|) exp(|XY |) ≤ CE Y |Y | d σ c (1 + (t + |Y |) c )e (t+|Y |) 2 /2 Now applying the first statement to take the expectation over Y , we obtain E Y |Y | d (1 + (t + |Y |) c )e (t+|Y |) 2 /2 ≤ Cσ c ρ d . Proof of Lemma E.5. We prove the lemma by induction on c. Suppose c = 0. Then by plugging in the MGF for the half-normal distribution from Lemma E.13, for some constant C, we have E X∼N (0,1)|X>0 [(e t|X| -1)] = 2e t 2 /2 Φ(t) -1 (13) ≤ 2e t 2 /2 1 + t 2 -1 (14) ≤ e t 2 /2 -1 + te t 2 /2 (15) ≤ Ct, thus E X∼N (0,σ 2 ) [(e t|X| -1)] = E X∼N (0,σ 2 )|X>0 [(e σt|X| -1)] ≤ Ctσ. Now for c ≥ 1, by Stein's Lemma, we have (for a new constant C), E X∼N (0,σ 2 ) [|X| c (e t|X| -1)] = E X∼N (0,σ 2 ) [X|X| c-1 sign(X)(e t|X| -1)] (17) = σ 2 E X∼N (0,σ 2 ) d dX |X| c-1 sign(X)(e t|X| -1) (18) = σ 2 E X∼N (0,σ 2 ) (c -2) |X| c-2 (e t|X| -1) + |X| c-1 (te t|X| ) ≤ Ctσ c+1 . ( ) where in the last step we used the inductive hypothesis and Lemma E.4. Proof of Lemma E.6. First observe that lim η→0 1 η E U,V arccos |µ T k θ + k | ∥θ + k ∥ 2 -arccos |µ T k θ k | ∥θ k ∥ 2 = lim η→0 1 η E U,V arccos |µ T k (θ k (1 -ηλ) -η∇ k )| ∥θ k (1 -ηλ) -η∇ k ∥ 2 -arccos |µ T k θ k | ∥θ k ∥ 2 = lim η→0 1 η E U,V arccos |µ T k (θ k -η 1-ηλ ∇ k )| ∥θ k -η 1-ηλ ∇ k ∥ 2 -arccos |µ T k θ k | ∥θ k ∥ 2 = E U,V d dη arccos |µ T k (θ k -η∇ k )| ∥θ k -η∇ k ∥ 2 (0) , since lim η→0 η 1-ηλ = 0. Now d dη arccos |µ T k (θ k -η∇ k )| ∥θ k -η∇ k ∥ 2 (0) = arccos ′ |µ T k θ k | ∥θ k ∥ 2 d dη |µ T k (θ k -η∇ k )| ∥θ k -η∇ k ∥ 2 (0) = arccos ′ |µ T k θ k | ∥θ k ∥ 2   -sign(µ T k θ k )µ T k ∇ k ∥θ k ∥ + |µ T k θ k | θ T k ∇ k ∥θ k ∥ ∥θ k ∥ 2 2   = N -µ T k θ k µ T k ∇ k + (µ T k θ k ) 2 θ T k ∇ k ∥θ k ∥ 2 , where N = arccos ′ |µ T k θ k | ∥θ k ∥2 1 ∥θ k ∥|µ T k θ k | . The lemma follows by taking the expectation over U, V , and observing derivative of arccos(x) is negative whenever x is positive. Proof of Lemma E.7. First observe that by symmetry, we have d dx E U,V ∼Px [∇ k ] = m d dx E U,V ∼Px ∂L i ∂θ k . To make this expectation easier to analyze, we express the random variable (U (x), V (x)) ∼ P x as an interpolation of Gaussians in the coordinate µ T k ′ v (i) k ′ . Let ξ ∼ N (0, 1), and define (U, V ) ∼ P 1 , such that µ T k ′ v (i) k ′ = µ T k ′ u (i) k ′ . For x ∈ [0, 1), define (U (x), V (x)) to have µ T k ′ v (i) k ′ (x) = xµ T k ′ u (i) k ′ + 1 -x 2 ξ, and otherwise be the same as (U, V ). It is easy to check that (U (x), V (x)) ∼ P x . Now d dx E U,V ∼Px ∂L i (Θ; U, V ) ∂θ k = E U,V ∼P1,ξ d dx ∂L i (Θ; U (x), V (x)) ∂θ k . Taking the derivative of the cross-entropy loss, we have d dx ∂L i (Θ; U (x), V (x)) ∂θ k = d dx   j̸ =i p ij ∂(z ij -z ii ) ∂θ k   = j̸ =i dp ij dµ T k ′ v (i) k ′ (x) dµ T k ′ v (i) k ′ (x) dx ∂(z ij -z ii ) ∂θ k = j̸ =i -p ij p ii dz ii dµ T k ′ v (i) k ′ (x) µ T k ′ u (i) k ′ - x √ 1 -x 2 ξ ∂(z ij -z ii ) ∂θ k where the variables z ij and p ij are the similarity scores and the softmaxes from the data (U (x), V (x)). Here the first line is by Lemma E.2, and the second line holds by chain rule since ∂zij ∂θ k -∂zii ∂θ k does not depend on v (i) k ′ . The third line uses the proof of Claim E.14 to take the derivative of p ij , and Equation 21 to take the derivative of µ T k ′ v (i) k ′ (x). Now we reparameterize µ T k ′ u (i) k ′ -x √ 1-x 2 ξ as follows: µ T k ′ u (i) k ′ - x √ 1 -x 2 ξ = 1 1 -x 2 µ T k ′ u (i) k ′ - x 1 -x 2 µ T k ′ v (i) k ′ (x). Plugging in this reparameterization and dzii dµ T k ′ v (i) k ′ (x) = θ T k ′ µ k ′ θ T k ′ u k ′ , we obtain d dx E U,V ∼Px ∂L i (Θ; U, V ) ∂θ k = -1 1 -x 2 j̸ =i E U,V ∼Px p ij p ii θ T k ′ µ k ′ θ T k ′ u k ′ µ T k ′ u (i) k ′ -xµ T k ′ v (i) k ′ ∂(z ij -z ii ) ∂θ k . We now prove Lemmas E.8, E.9, E.10, and E.11. Notation. Since i is fixed throughout, we drop the (i) superscripts and let u k = u (i) k and v k = v (i) k . We will introduce the following random variables, which are all independent, to simplify the exposition: • ξ j := θ T k v (j) k for j ̸ = i. Thus ξ j ∼ N (0, ∥θ k ∥ 2 ). • ξ ′ j := θ T k ′ v (j) k ′ for j ̸ = i. Thus ξ ′ j ∼ N (0, ∥θ k ′ ∥ 2 ). • ξ i := (θ ⊥ k ) T v k + (θ ∥ k ) T (v k -α k u k ). Thus ξ i ∼ N (0, ∥θ ⊥ k ∥ 2 + (1 -α 2 k )∥θ ∥ k ∥ 2 ). • ξ ′ i := (θ ⊥ k ′ ) T v k ′ . Thus ξ ′ i ∼ N (0, ∥θ ⊥ k ′ ∥ 2 ∥θ ∥ k ′ ∥ 2 ). • ζ ′ i := (θ ∥ k ′ ) T (v k ′ -α k ′ u k ′ ). Thus ζ ′ i ∼ N (0, (1 -α 2 k ′ )∥θ ∥ k ′ ∥ 2 ). • y = (θ ∥ k ) T u k . Thus y ∼ N (0, ∥θ ∥ k ∥ 2 ). • y ′ = (θ ∥ k ′ ) T u k ′ . Thus y ′ ∼ N (0, ∥θ ∥ k ′ ∥ 2 ). • η i := (θ ⊥ k ) T u k . Thus η i ∼ N (0, ∥θ ⊥ k ∥ 2 ). • η ′ i := (θ ⊥ k ′ ) T u k ′ . Thus η ′ i ∼ N (0, ∥θ ⊥ k ′ ∥ 2 ). For any such random variable X, we use σ 2 X to denote its variance. Observe that p ii p ij q ii q ij = exp θ T k u k θ T k v k exp θ T k ′ u k ′ θ T k ′ v k ′ E j ′ ∼q exp θ T k u k θ T k v (j ′ ) k exp θ T k ′ u k ′ θ T k ′ v (j ′ ) k ′ exp θ T k u k θ T k v (j) k exp θ T k ′ u k ′ θ T k ′ v (j) k ′ E j ′ ∼q exp θ T k u k θ T k v (j ′ ) k exp θ T k ′ u k ′ θ T k ′ v (j ′ ) k ′ . We will use the following two claims in the proofs of all four lemmas. Claim E.14. For β ∈ {ξ j , ξ ′ j , ξ i , ξ ′ i , ζ ′ i , η i , η ′ i , x, x ′ }, let βj ′ := ∂ ∂β θ T k u k θ T k v (j ′ ) k + θ T k ′ u k ′ θ T k ′ v (j ′ ) k ′ . Then ∂p ii p ij ∂β ≤ p ii p ij | βj | + | βi | + 2E j ′ ∼q | βj ′ | . If additionally γ ∈ {ξ j , ξ ′ j , ξ i , ξ ′ i , ζ ′ i , η i , η ′ i } and γ ⊥ { βj ′ } j ′ ∈[m] , then ∂ ∂γ ∂p ii p ij ∂β ≤ p ii p ij | βj | + | βi | + 2E j ′ ∼q | βj ′ | (|γ j | + |γ i | + 2E j ′ ∼q |γ j ′ |) + 2E j ′ ∼q | βj ′ γj ′ | + 2(E j ′ ∼q | βj ′ |)(E j ′ ∼q |γ j ′ |) . Proof. By a straightforward quotient-rule computation of the derivative of pij qij , recalling that q ij is independent of S, we obtain ∂p ij ∂β = p ij βj -E j ′ ∼q βj ′ p ij ′ . By applying product to the expression above, we obtain ∂p ii p ij ∂β = p ii p ij βj + βi -2E j ′ ∼q βj ′ p ij ′ . Taking absolute values and using the fact that p ij ′ ≤ 1, we obtain the first result. Next we take the derivative of p ij with respect to both β and γ. Using the expression above for ∂pij ∂β , we obtain ∂ ∂γ ∂p ij ∂β = p ij βj -E j ′ ∼q βj ′ p ij ′ (γ j -E j ′ ∼q γj ′ p ij ′ ) -E j ′ ∼q βj ′ γj ′ p ij ′ + (E j ′ ∼q βj ′ p ij ′ )(E j ′ ∼q γj ′ p ij ′ ) , and ∂ ∂γ ∂p ii p ij ∂β = p ii p ij βj + βi -2E j ′ ∼q βj ′ p ij ′ (γ j + γi -2E j ′ ∼q γj ′ p ij ′ ) -2E j ′ ∼q βj ′ γj ′ p ij ′ + 2(E j ′ ∼q βj ′ p ij ′ )(E j ′ ∼q γj ′ p ij ′ ) The second result follows by taking absolute values and the fact that p ij ′ ≤ 1. Claim E.15. p ij q ij ≤ exp |θ T k u k θ T k v (j) k | exp |θ T k ′ u k ′ θ T k ′ v (j) k ′ | E j ′ ∼q exp |θ T k u k θ T k v (j ′ ) k | exp |θ T k ′ u k ′ θ T k ′ v (j ′ ) k ′ | . Proof. This follows directly from using Jenson's inequality on the distribution j ′ ∼ q to show that 1 E j ′ ∼q exp θ T k u k θ T k v (j ′ ) k exp θ T k ′ u k ′ θ T k ′ v (j ′ ) k ′ ≤ E j ′ ∼q exp -θ T k u k θ T k v (j ′ ) k exp -θ T k ′ u k ′ θ T k ′ v (j ′ ) k ′ ≤ E j ′ ∼q exp |θ T k u k θ T k v (j ′ ) k | exp |θ T k ′ u k ′ θ T k ′ v (j ′ ) k ′ | . Claim E.16. 1 - p ij q ij ≤ Z j -1, where Z j := exp |θ T k u k θ T k v (j) k | exp |θ T k ′ u k ′ θ T k ′ v (j) k ′ | E j ′ ∼q exp |θ T k u k θ T k v (j ′ ) k | exp |θ T k ′ u k ′ θ T k ′ v (j ′ ) k ′ | . Proof. Note that for any x ≥ 0, we have |1 -x| ≤ max x -1, 1 x -1 . By Claim E.15, pij qij -1 is at most the desired value given in this claim. Now q ij p ij = E j ′ ∼q exp θ T k u k θ T k v (j ′ ) k exp θ T k ′ u k ′ θ T k ′ v (j ′ ) k ′ exp θ T k u k θ T k v (j) k exp θ T k ′ u k ′ θ T k ′ v (j) k ′ ≤ exp |θ T k u k θ T k v (j) k | exp |θ T k ′ u k ′ θ T k ′ v (j) k ′ | E j ′ ∼q exp |θ T k u k θ T k v (j ′ ) k | exp |θ T k ′ u k ′ θ T k ′ v (j ′ ) k ′ | . This yields the claim. Proof of Lemma E.8. Expanding h(S) -h 1 (S), we see that we need to control the following terms: 1 . (a) E S p ii p ij η ′ i µ T k ′ u k ′ -xµ T k ′ v k ′ µ T k u k ξ j , (b) E S p ii p ij y ′ µ T k ′ u k ′ -xµ T k ′ v k ′ µ T k u k ξ j 2. (a) E S p ii p ij η ′ i µ T k ′ u k ′ -xµ T k ′ v k ′ µ T k u k ξ i , (b) E S p ii p ij (y ′ µ T k ′ u k ′ -xµ T k ′ v k ′ µ T k u k ξ i 3. (a) α k E S p ii p ij η ′ i µ T k ′ u k ′ -xµ T k ′ v k ′ µ T k u k y , (b) α k E S p ii p ij (y ′ (-xξ ′ i )) µ T k u k y 4. (a) E S p ii p ij η ′ i µ T k ′ u k ′ -xµ T k ′ v k ′ ξ i (v k -v (j) k ) T µ k (b) E S p ii p ij y ′ µ T k ′ u k ′ -xµ T k ′ v k ′ ξ i (v k -v (j) k ) T µ k 5. (a) α k E S p ii p ij η ′ i µ T k ′ u k ′ -xµ T k ′ v k ′ y(v k -v (j) k ) T µ k (b) E S p ii p ij y ′ µ T k ′ u k ′ -xµ T k ′ v k ′ y(v k -α k u k -v (j) k ) T µ k We begin by bounding the terms where the expression after p ii p ij has two independent mean-0 terms, mainly (1a), (2a), (4a). The first step is to apply Stein's Lemma (Lemma E.3) twice to these two terms, which we will call β and γ. Let βγg(S \ {β, γ}) be the terms after p ii p ij . Then we have |E S [p ii p ij βγg(S \ {β, γ})]| ≤ σ 2 β σ 2 γ E S ∂ ∂γ ∂p ii p ij ∂β |g(S \ {β, γ})| . Next we apply the final result in Claim E.14 to bound the absolute value of ∂ ∂γ ∂piipij ∂β . Once we do this, we achieve |E S [p ii p ij βγg(S \ {β, γ})]| ≤ σ 2 β σ 2 γ q ii q ij E S   Z|g(S \ {β, γ})| j ′ ,ℓ∈[m] c j ′ ,ℓ | βj ′ || γℓ |   , where j ′ ,ℓ∈[m] c j ′ ,ℓ ≤ C for some constant C, and Z := piipij qiiqij . Finally, we use the bound on Z from Claim E.15, and then Lemma E.4 to take the expectation over S, iteratively applying Lemma E.4 to each variable in S. Thus we have, for some (different) constant C, 1. E S p ii p ij η ′ i µ T k ′ u k ′ -xµ T k ′ v k ′ µ T k u k ξ j ≤ Cq ii q ij σ 2 η ′ i σ 2 ξj ∥θ k ′ ∥∥θ k ∥ = Cq ii q ij ∥θ ⊥ k ′ ∥ 2 ∥θ k ′ ∥∥θ k ∥ 3 ≤ Cq ii q ij ∥θ k ′ ∥ 3 ∥θ k ∥ 3 . 2. E S p ii p ij η ′ i µ T k ′ u k ′ -xµ T k ′ v k ′ µ T k u k ξ i ≤ Cq ii q ij σ 2 η ′ i σ 2 ξi ∥θ k ′ ∥∥θ k ∥ ≤ Cq ii q ij ∥θ ⊥ k ′ ∥ 2 ∥θ k ′ ∥∥θ k ∥ 3 ≤ Cq ii q ij ∥θ k ′ ∥ 3 ∥θ k ∥ 3 . 3. E S p ii p ij η ′ i µ T k ′ u k ′ -xµ T k ′ v k ′ ξ i (v k -v (j) k ) T µ k ≤ Cq ii q ij σ 2 η ′ i σ 2 ξi ∥θ k ′ ∥∥θ k ∥ ≤ Cq ii q ij ∥θ ⊥ k ′ ∥ 2 ∥θ k ′ ∥∥θ k ∥ 3 ≤ Cq ii q ij ∥θ k ′ ∥ 3 ∥θ k ∥ 3 . Now we consider the remaining 7 terms. Here we decompose the expression inside the expectation as p ii p ij βg(S \ β), where β ∈ S. We proceed as before, but we only apply Stein's Lemma once, to β. Applying Steins, the expression for ∂piipij ∂β given in the first result of Claim E.14, we obtain |E S [p ii p ij βg(S \ β)]| ≤ σ 2 β E S ∂p ii p ij ∂β |g(S \ β)| ≤ σ 2 β q ii q ij E S   Z|g(S \ β)| j ′ ∈[m] c j ′ | βj ′ |   , where j ′ ∈[m] c j ′ ≤ C for some constant C, and Z := piipij qiiqij . Finally, we plug in a bound for Z in Claim E.15, an use Lemma E.4 to take the expectation over S, again iteratively over each variable. Thus we have, for some (different) constant C, 1. E S p ii p ij y ′ µ T k ′ u k ′ -xµ T k ′ v k ′ µ T k u k ξ j ≤ Cq ii q ij σ 2 ξj ∥θ k ∥∥θ ∥ k ′ ∥ = Cq ii q ij ∥θ k ∥ 3 ∥θ ∥ k ′ ∥. 2. E S p ii p ij y ′ µ T k ′ u k ′ -xµ T k ′ v k ′ µ T k u k ξ i ≤ Cq ii q ij σ 2 ξi ∥θ k ∥∥θ ∥ k ′ ∥ ≤ Cq ii q ij ∥θ k ∥ 3 ∥θ ∥ k ′ ∥. 3. α k E S p ii p ij η ′ i µ T k ′ u k ′ -xµ T k ′ v k ′ µ T k u k y ≤ Cα k q ii q ij σ 2 η ′ i ∥θ k ′ ∥∥θ ∥ k ∥ = Cα k q ii q ij ∥θ ⊥ k ′ ∥ 2 ∥θ k ′ ∥∥θ ∥ k ∥ ≤ Cα k q ii q ij ∥θ k ′ ∥ 3 ∥θ ∥ k ∥. 4. α k E S p ii p ij (y ′ (-xζ ′ i )) µ T k u k y ≤ Cα k q ii q ij σ 2 ζ ′ i ∥θ k ′ ∥∥θ ∥ k ∥ = Cα k q ii q ij ∥θ ∥ k ′ ∥ 2 ∥θ k ′ ∥∥θ ∥ k ∥. 5. E S p ii p ij y ′ µ T k ′ u k ′ -xµ T k ′ v k ′ ξ i (v k -v (j) k ) T µ k ≤ Cq ii q ij σ 2 ξi ∥θ k ∥∥θ ∥ k ′ ∥ ≤ Cq ii q ij ∥θ k ∥ 3 ∥θ ∥ k ′ ∥. 6. α k E S p ii p ij η ′ i µ T k ′ u k ′ -xµ T k ′ v k ′ x(v k -v (j) k ) T µ k ≤ Cα k q ii q ij σ 2 η ′ i ∥θ k ′ ∥∥θ ∥ k ∥ = Cα k q ii q ij ∥θ ⊥ k ′ ∥ 2 ∥θ k ′ ∥∥θ ∥ k ∥ ≤ Cα k q ii q ij ∥θ k ′ ∥ 3 ∥θ ∥ k ∥. 7. E S p ii p ij y ′ µ T k ′ u k ′ -xµ T k ′ v k ′ x(v k -α k u k -v (j) k ) T µ k ≤ Cq ii q ij σ 2 x ∥θ k ∥∥θ ∥ k ′ ∥ = Cq ii q ij ∥θ ∥ k ∥ 2 ∥θ k ∥∥θ ∥ k ′ ∥. Combining the bounds on these 10 terms proves the lemma: E S p ii p ij µ T k (h(S) -h 1 (S)) ≤ Cq ii q ij ∥θ k ′ ∥ 3 ∥θ k ∥ 3 + ∥θ ∥ k ′ ∥∥θ k ∥ 3 + α k ∥θ k ′ ∥ 3 ∥θ ∥ k ∥ . Proof of Lemma E.10. The proof of Lemma E.10 is nearly identical, besides some differences in the terms we need to bound. We list them below: 1. (a) E S p ii p ij η ′ i µ T k ′ u k ′ -xµ T k ′ v k ′ θ T k u k ξ j (b) E S p ii p ij y ′ µ T k ′ u k ′ -xµ T k ′ v k ′ θ T k u k ξ j 2. (a) E S p ii p ij η ′ i µ T k ′ u k ′ -xµ T k ′ v k ′ θ T k u k ξ i (b) E S p ii p ij (y ′ µ T k ′ u k ′ -xµ T k ′ v k ′ θ T k u k ξ i 3. (a) α k E S p ii p ij η ′ i µ T k ′ u k ′ -xµ T k ′ v k ′ θ T k u k y (b) α k E S p ii p ij y ′ µ T k ′ u k ′ -xµ T k ′ v k ′ (η i y) 4. α k E S p ii p ij (y ′ (-xζ ′ i )) θ T k u k y We use the same approach as before. For the terms (1a) and (2a) we apply Stein's Lemma to (η ′ i , ξ j ) and (η ′ i , ξ i ) respectively. For (1b), (2b), (3a) and (3b) and (4), we apply Stein's Lemma to ξ j , ξ i , η ′ i , η i , and ξ ′ i respectively. Using Claim E.15 and then Lemma E.4 as before, we obtain the following result: 1. E S p ii p ij η ′ i µ T k ′ u k ′ -xµ T k ′ v k ′ θ T k u k ξ j ≤ Cq ii q ij σ 2 η ′ i σ 2 ξj ∥θ k ′ ∥∥θ k ∥∥θ k ∥ = Cq ii q ij ∥θ ⊥ k ′ ∥ 2 ∥θ k ′ ∥∥θ k ∥ 4 ≤ Cq ii q ij ∥θ k ′ ∥ 3 ∥θ k ∥ 4 . 2. E S p ii p ij η ′ i µ T k ′ u k ′ -xµ T k ′ v k ′ θ T k u k ξ i ≤ Cq ii q ij σ 2 η ′ i σ 2 ξi ∥θ k ′ ∥∥θ k ∥∥θ k ∥ ≤ Cq ii q ij ∥θ ⊥ k ′ ∥ 2 ∥θ k ′ ∥∥θ k ∥ 4 ≤ Cq ii q ij ∥θ k ′ ∥ 3 ∥θ k ∥ 4 . 3. E S p ii p ij y ′ µ T k ′ u k ′ -xµ T k ′ v k ′ θ T k u k ξ j ≤ Cq ii q ij σ 2 ξj ∥θ k ∥∥θ k ∥∥θ ∥ k ′ ∥ = Cq ii q ij ∥θ k ∥ 4 ∥θ ∥ k ′ ∥ 4. E S p ii p ij y ′ µ T k ′ u k ′ -xµ T k ′ v k ′ θ T k u k ξ i ≤ Cq ii q ij σ 2 ξi ∥θ k ∥∥θ k ∥∥θ ∥ k ′ ∥ ≤ Cq ii q ij ∥θ k ∥ 4 ∥θ ∥ k ′ ∥ 5. α k E S p ii p ij η ′ i µ T k ′ u k ′ -xµ T k ′ v k ′ θ T k u k y ≤ Cα k q ii q ij σ 2 η ′ i ∥θ k ′ ∥∥θ k ∥∥θ ∥ k ∥ = Cα k q ii q ij ∥θ ⊥ k ′ ∥ 2 ∥θ k ′ ∥∥θ k ∥∥θ ∥ k ∥ 6. α k E S p ii p ij y ′ µ T k ′ u k ′ -xµ T k ′ v k ′ (η i y) ≤ Cα k q ii q ij σ 2 ηi ∥θ k ∥∥θ ∥ k ′ ∥∥θ ∥ k ∥ = Cα k q ii q ij ∥θ ⊥ k ∥ 2 ∥θ k ∥∥θ ∥ k ′ ∥∥θ ∥ k ∥. 7. α k E S p ii p ij (y ′ (-xζ ′ i )) θ T k u k y ≤ Cα k q ii q ij σ 2 ζ ′ i ∥θ k ′ ∥∥θ k ∥∥θ ∥ k ∥ ≤ Cα k q ii q ij ∥θ ∥ k ′ ∥ 2 ∥θ k ′ ∥∥θ k ∥∥θ ∥ k ∥. Combining the bounds on these 7 terms, proves the lemma: E S p ii p ij θ T k (h(S) -h 1 (S)) ≤ Cq ii q ij ∥θ k ′ ∥ 3 ∥θ k ∥ 4 + ∥θ ∥ k ′ ∥∥θ k ∥ 4 + α k ∥θ k ′ ∥ 3 ∥θ k ∥∥θ ∥ k ∥ + ∥θ ∥ k ′ ∥∥θ k ∥ 3 ∥θ ∥ k ∥ . We now prove the lemmas on the non-junk terms. Proof of Lemma E.9. E S p ii p ij (θ ∥ k ′ ) T u k ′ u T k ′ µ k ′ 2µ T k u k α k (θ ∥ k ) T u k = E S q ii q ij (θ ∥ k ′ ) T u k ′ u T k ′ µ k ′ 2µ T k u k α k (θ ∥ k ) T u k + E S (p ii p ij -q ii q ij ) (θ ∥ k ′ ) T u k ′ u T k ′ µ k ′ 2µ T k u k α k (θ ∥ k ) T u k = 2α k q ii q ij θ T k ′ µ k ′ θ T k µ k + 2α k q ii q ij E S p ii p ij q ii q ij -1 (θ ∥ k ′ ) T u k ′ u T k ′ µ k ′ µ T k u k (θ ∥ k ) T u k . Now by Claim E.16, we have piipij qiiqij -1 ≤ Z i Z j -1 (where the variable's Z i , Z j defined in the Claim E.16) so E S p ii p ij q ii q ij -1 (θ ∥ k ′ ) T u k ′ u T k ′ µ k ′ µ T k u k (θ ∥ k ) T u k ≤ E S (Z i Z j -1) (θ ∥ k ′ ) T u k ′ u T k ′ µ k ′ µ T k u k (θ ∥ k ) T u k ≤ C ∥θ k ∥ 2 + ∥θ k ′ ∥ 2 ∥θ ∥ k ′ ∥∥θ ∥ k ∥. Here the second inequality follows from applying Lemma E.5 first, and then Lemma E.4 repeatedly for the remainder of the variables in S. This proves the lemma. Note that we need to apply Lemma E.5 several times to a single variable X ∈ S. Indeed we can write (Z i Z j -1) (θ ∥ k ′ ) T u k ′ u T k ′ µ k ′ µ T k u k (θ ∥ k ) T u k = (E ℓ exp(|t ℓ X|)S ℓ -1) B|X| c = (E ℓ S ℓ (exp(|t ℓ X|) -1)) B|X| c + (E ℓ S ℓ -1)) B|X| c for some distribution on ℓ, and for some terms S ℓ , t ℓ , and B that are independent of X, and c ∈ {0, 1, 2}. Then to take the expectation of this term over X, we first apply Lemma E.5 to on X to the first term, and iteratively apply Lemma E.5 to the random variables appearing in the next terms. Proof of Lemma E.11. 1 1 -x 2 E S p ii p ij θ T k h 1 (S) = E S p ii p ij (θ ∥ k ′ ) T u k ′ u T k ′ µ k ′ 2(θ ∥ k ) T u k α k (θ ∥ k ) T u k = E S q ii q ij (θ ∥ k ′ ) T u k ′ u T k ′ µ k ′ 2α k ((θ ∥ k ) T u k ) 2 + E S (p ii p ij -q ii q ij ) (θ ∥ k ′ ) T u k ′ u T k ′ µ k ′ 2α k ((θ ∥ k ) T u k ) 2 = 2α k q ii q ij θ T k ′ µ k ′ ∥θ ∥ k ∥ 2 + 2α k q ii q ij E S p ii p ij q ii q ij -1 (θ ∥ k ′ ) T u k ′ u T k ′ µ k ′ (θ ∥ k ) T u k 2 . Now by Claim E.16, we have piipij qiiqij -1 ≤ Z i Z j -1, so E S p ii p ij q ii q ij -1 (θ ∥ k ′ ) T u k ′ u T k ′ µ k ′ (θ ∥ k ) T u k 2 ≤ E S (Z i Z j -1) (θ ∥ k ′ ) T u k ′ u T k ′ µ k ′ (θ ∥ k ) T u k 2 ≤ C ∥θ k ∥ 2 + ∥θ k ′ ∥ 2 ∥θ ∥ k ′ ∥θ ∥ k ∥ 2 , Again the second inequality follows from applying Lemma E.5 first (several times as described in the previous lemma), and then Lemma E.4 repeatedly for the remainder of the variables in S. Taking absolute values proves the lemma.



Note that invariance is not to be confused with the related but distinct property of equivariance, often discussed as a desirable property of network architectures (e.g. seeFukushima & Miyake (1982);Chen et al. (2020a)) Continuous reduction in contrast eventually produces single-color images, given finite precision images



Figure 1: Comparison of viewmaker and expert augmentations on datasets with multiple features. The viewmaker augmentations adapt to the particular semantics of the input data, and make targeted perturbations which remove the class-relevant information of the synthetic features (e.g. occluding the digit, shape, letter, or speech). Despite this, the encoder network is still able to learn strong representations. Rows (from top): Digits, Shapes, Letters, Audio. Columns (from left): Expert augmentations, viewmaker augmentations, difference between original and viewmaker augmentation, rescaled to [0,1]. Center image in each grid is the original. Audio Expert views shown are Spectral views.

Figure 2: We show how label-destroying augmentations can aid learning of other features in a linear contrastive setting: (a) The correlation of the kth feature of an augmentation pair, shown for d = 2. Each pair u (i) k

Figure 3: Non-label destroying Viewmaker perturbation examples.

Figure 4: Viewmaker augmentations stochastically drop out simple features added to the input. Probability of the correct answer for different augmentations (Viewmaker or Expert) and different examples from different datasets (Shapes, Letters, Digits). Each histogram shows a single example from each dataset randomly augmented 1200 times, and the corresponding probabilities of the correct answer. The viewmaker augmentations display a bimodal structure, indicating that the simple feature is selectively either destroyed or preserved. The expert augmentations by contrast lack such structure, reflecting their lack of adaptation to the structure of each input.

Figure 6: Alignment of features with verses without added noise. From left to right: α 1 = 0.125, 0.25, 0.375, 0.5. The top plots show the alignment of Features 1 (weak) and 2 (dominant) to the ground truth; the bottom plots shows the probability of predicting the correct augmentation pair from the batch. Standard deviation bars are shown for the mean alignment over 200 runs. We used dimension d = 5, and a batch size of m = 25.

Transfer accuracy on different features. Viewmaker networks are able to achieve good performance across multiple downstream tasks, while expert views sometimes falter. Networks are pretrained on the datasets on the left, and transfer accuracy is reported for the different conditions on the columns. Runs are averages of three seeds (with the exception of CIFAR-10 Only, which is taken from(Tamkin et al., 2021b)).

Audio transfer accuracies. Viewmaker networks achieve good performance across multiple tasks, while expert views sometimes suffer catastrophic drops as another feature is added. Networks are pretrained on the datasets on the left, and transfer accuracy is reported for the different conditions on the columns. Runs are averages of three seeds.

Under review as a conference paper at ICLR 2023 Kai Y. Xiao, Logan Engstrom, Andrew Ilyas, and Aleksander Madry. Noise or signal: The role of image backgrounds in object recognition. ArXiv, abs/2006.09994, 2021. Xingyi Yang, Xuehai He, Yuxiao Liang, Yue Yang, Shanghang Zhang, and Pengtao Xie. Transfer learning or self-supervised learning? a tale of two pretraining paradigms. ArXiv, abs/2007.04234, 2020. Nanxuan Zhao, Zhirong Wu, Rynson W. H. Lau, and Stephen Lin. What makes instance discrimination good for transfer learning? ArXiv, abs/2006.06606, 2021.

annex

Ethics Statement Our work is centered on conceptual understanding, making it challenging to confidently predict societal impacts. Better conceptual understanding of existing methods may help us understand the failure modes and successes of current models better, which may have positive impacts. However, if this understanding enables the development of more powerful methods, the work may indirectly accentuate whatever social impacts (positive or negative) those applications have.

Reproducibility Statement

We include hyperparameters and experimental settings for our experiments in Section 3, complete statements of our theoretical results in Appendix E, and will release our codebase to reproduce our results.

