ARCL: ENHANCING CONTRASTIVE LEARNING WITH AUGMENTATION-ROBUST REPRESENTATIONS

Abstract

Self-Supervised Learning (SSL) is a paradigm that leverages unlabeled data for model training. Empirical studies show that SSL can achieve promising performance in distribution shift scenarios, where the downstream and training distributions differ. However, the theoretical understanding of its transferability remains limited. In this paper, we develop a theoretical framework to analyze the transferability of self-supervised contrastive learning, by investigating the impact of data augmentation on it. Our results reveal that the downstream performance of contrastive learning depends largely on the choice of data augmentation. Moreover, we show that contrastive learning fails to learn domain-invariant features, which limits its transferability. Based on these theoretical insights, we propose a novel method called Augmentation-robust Contrastive Learning (ArCL), which guarantees to learn domain-invariant features and can be easily integrated with existing contrastive learning algorithms. We conduct experiments on several datasets and show that ArCL significantly improves the transferability of contrastive learning.

1. INTRODUCTION

A common assumption in designing machine learning algorithms is that training and test samples are drawn from the same distribution. However, this assumption may not hold in real-world applications, and algorithms may suffer from distribution shifts, where the training and test distributions differ. This issue has motivated a plethora of research in various settings, such as transfer learning, domain adaptation and domain generalization (Blanchard et al., 2011; Muandet et al., 2013; Wang et al., 2021a; Shen et al., 2021) . Different ways of characterizing the relationship between test and training distributions lead to different algorithms. Most literature studies this in the supervised learning scenario. It aims to find features that capture some invariance across different distributions, and assume that such invariance also applies to test distributions (Peters et al., 2016; Rojas-Carulla et al., 2018; Arjovsky et al., 2019; Mahajan et al., 2021; Jin et al., 2020; Ye et al., 2021) . Self-Supervised Learning (SSL) has attracted great attention in many fields (He et al., 2020; Chen et al., 2020; Grill et al., 2020; Chen & He, 2021; Zbontar et al., 2021) . It first learns a representation from a large amount of unlabeled training data, and then fine-tunes the learned encoder to obtain a final model on the downstream task. Due to its two-step nature, SSL is more likely to encounter the distribution shift issue. Exploring its transferability under distribution shifts has become an important topic. Some recent works study this issue empirically (Liu et al., 2021; Goyal et al., 2021; von Kügelgen et al., 2021; Wang et al., 2021b; Shi et al., 2022) . However, the theoretical understanding is still limited, which also hinders the development of algorithms. In this paper, we study the transferability of self-supervised contrastive learning in distribution shift scenarios from a theoretical perspective. In particular, we investigate which downstream distribu-tions will result in good performance for the representation obtained by contrastive learning. We study this problem by deriving a connection between the contrastive loss and the downstream risk. Our main finding is that data augmentation is essential: contrastive learning provably performs well on downstream tasks whose distributions are close to the augmented training distribution. Moreover, the idea behind contrastive learning is to find representations that are invariant under data augmentation. This is similar to the domain-invariance based supervised learning methods, since applying each kind of augmentation to the training data can be viewed as inducing a specific domain. Unfortunately, from this perspective, we discover that contrastive learning fails to produce a domain-invariant representation, limiting its transferability. To address this issue, we propose a new method called Augmentation-robust Contrastive Learning (ArCL), which can be integrated with various widely used contrastive learning algorithms, such as SimCLR (Chen et al., 2020) and MoCo (He et al., 2020) . In contrast to the standard contrastive learning, ArCL forces the representation to align the two farthest positive samples, and thus provably learns domain-invariant representations. We conducted experiments on various downstream tasks to test the transferability of representations learned by ArCL on CIFAR10 and ImageNet. Our experiments demonstrate that ArCL significantly improves the standard contrastive learning algorithms.

RELATED WORK

Distribution shift in supervised learning. Distribution shift problem has been studied in many literature (Blanchard et al., 2011; Muandet et al., 2013; Wang et al., 2021a; Shen et al., 2021) . Most works aim to learn a representation that performs well on different source domains simultaneously (Rojas-Carulla et al., 2018; Mahajan et al., 2021; Jin et al., 2020) , following the idea of causal invariance (Peters et al., 2016; Arjovsky et al., 2019) . Structural equation models are often assumed for theoretical analysis (von Kügelgen et al., 2021; Liu et al., 2020; Mahajan et al., 2021) . Distributionally robust optimization optimizes a model's worst-case performance over some uncertainty set directly (Krueger et al., 2021; Sagawa et al., 2019; Duchi & Namkoong, 2021; Duchi et al., 2021) . Stable learning (Shen et al., 2020; Kuang et al., 2020) learns a set of global sample weights that could remove the confounding bias for all the potential treatments from data distribution. Disentangled representation learning (Bengio et al., 2013; Träuble et al., 2021; Kim & Mnih, 2018) aims to learn representations where distinct and informative factors of variations in data are separated. Theoretical understanding of contrastive learning. A number of recent works also aim to theoretically explain the success of contrastive learning in IID settings. One way to explain it is through the mutual information between positive samples (Tian et al., 2020; Hjelm et al., 2018; Tschannen et al., 2019) . Arora et al. (2019) directly analyze the generalization of InfoNCE loss based on the assumption that positive samples are drawn from the same latent classes. In the same setting, Bao et al. (2022) establish equivalence between InfoNCE and supervised loss and give sharper upper and lower bounds. Huang et al. (2021) take data augmentation into account and provide generalization bounds based on the nearest neighbor classifier. Contrastive learning in distribution shift. Shen et al. (2022) and HaoChen et al. (2022) study contrastive learning in unsupervised domain adaptation, where unlabeled target data are obtained. Shi et al. (2022) show that SSL is the most robust on distribution shift datasets compared to autoencoders and supervised learning. Hu et al. (2022) improve the out-of-distribution performance of SSL from an SNE perspective. Other robust contrastive learning methods (Kim et al., 2020; Jiang et al., 2020) focus on adversarial robustness while this paper focuses on distributional robustness.

2. PROBLEM FORMULATION

Given a set of unlabeled data where each sample X is i.i.d. sampled from training data distribution D on X ⊆ R d , the goal of Self-Supervised Learning (SSL) is to learn an encoder f : X → R m for different downstream tasks. Contrastive learning is a popular approach of SSL, which augments each sample X twice to obtain a positive pair (X 1 , X 2 ), and then learns the encoder f by pulling them close and pushing random samples (also called negative samples) away in the embedding space. The data augmentation is done by applying a transformation A to the original data, where A is randomly selected from a transformation set A according to some distribution π. We use D A to denote the distribution of augmented data under a specific transformation A, and use D π to denote the distribution of augmented data, i.e., D π = A D A dπ(A). The loss of contrastive learning typically consists of two terms: L con (f ; D, π) := L align (f ; D, π) + λL reg (f ; D, π), where L align (f ; D, π) := E X∼D E (A1,A2)∼π 2 ∥f (A 1 (X)) -f (A 2 (X) )∥ 2 measures the alignment of positive samplesfoot_0 , and L reg (f ; D, π) is the regularization term to avoid feature collapse. For example, the regularization term in InfoNCE loss is L reg (f ; D, π) := E (X,X ′ )∼D 2 E (A1,A2,A ′ )∼π 3 log e f (A1(X)) ⊤ f (A2(X)) + e f (A1(X)) ⊤ f (A ′ (X ′ )) . In this paper, we consider multi-class classification problems X ⊆ R d → Y = {1, . . . , K} as downstream tasks, and focus on the covariate shift setting, i.e., target distribution D tar is different from D but P(Y |X) is fixed. We use the best linear classifier on top of f for downstream classification, and evaluate the performance of f on D tar by its risk (also used by Shi et al. (2022) ): R(f ; D tar ) := min h∈R K×m E X∼D tar ℓ(h • f (X), Y ), ( ) where ℓ is the loss function. We focus on the case where ℓ is the square loss. With a slight abuse of notation, for the classifier h • f : R m → R k , we also use R(h • f ; D tar ) to denote its risk. Our goal is to study the transferability of representation learned by contrastive learning on different downstream distributions. For simplicity, we assume that ∥f (x)∥ = 1 for every x ∈ X and f is L-Lipschitz continuous throughout this paper. At the end of this section, we remark the difference between transformations and augmentations. Transformations are deterministic mapping from X to X , and augmentations are random application of them. We give examples to demonstrate that many augmentation methods used in practice indeed satisfy this model. Example 2.1. The augmentations used in SimCLR (Chen et al., 2020) are compositions of random transformations, such as RandomCrop, HorizonalFlip and Color distortion. Each A denotes a specific composition here. Users will determine probabilities that each transformation will be applied. If used, the parameters of the random transformation such as crop range are also selected randomly. Until the parameters of each used transformations are set, the composited transformation is deterministic. Therefore, we can view these augmentation processes as randomly selecting some deterministic transformations, following the distribution π determined by users. Example 2.2. Some recent works (Jahanian et al., 2021) propose to augment data by transformations in the latent space. Suppose we have a pre-trained generative model with an encoder T : X → Z and a generator G : Z → X , where Z is a latent space. Since Z is usually simple, one can augment data in X by randomly shifting T (X). In this setting, the augmentation can be parameterized by A θ (X) = G(T (X) + θ) for θ ∈ Z. The distribution π of θ is usually chosen to be a normal distribution.

3. TRANSFERABILITY ANALYSIS: THE CRUCIAL ROLE OF DATA AUGMENTATION

In this section, we theoretically demonstrate that data augmentation plays a crucial role in the transferability of contrastive learning. The first step is to establish the connection between contrastive loss L con (f ; D, π) and downstream risk R(f ; D tar ). For this purpose, we adopt the (σ, δ)-augmentation notion proposed by Huang et al. (2021) , which was used there to study the KNN error of contrastive learning. Definition 3.1 ((σ, δ)-transformation). Let C k ⊆ X be the set of all points in class k. A data transformation set A is called a (σ, δ)-transformation on D for some σ ∈ (0, 1] and δ > 0, if for every k ∈ [K], there exists C 0 k ⊆ C k such that P X∼D (X ∈ C 0 k ) ≥ σ P X∼D (X ∈ C k ), and sup x1,x2∈C 0 k d A (x 1 , x 2 ) ≤ δ, where d A (x 1 , x 2 ) := inf A1,A2∈A d(A 1 (x 1 ), A 2 (x 2 )) for some distance d(•, •). This definition measures the concentration of augmented data. A transformation set with smaller δ and larger σ clusters original data more, i.e., samples from the same class are closer after augmentation. Therefore, one can expect the learned representation f to have better cluster performance, which was measured by KNN error in (Huang et al., 2021) . We remark that most of contrastive learning theories (Arora et al., 2019; Ash et al., 2021; Bao et al., 2022) assume that given class k, positive pairs are obtained by i.i.d. sampling from P X∼D ( • | X ∈ C k ). Compared with this data generation model, our (σ, δ) notion is more practical. By taking into account the linear classifier on top of f , we extend the KNN result in (Huang et al., 2021) to the downstream risk, and obtain the following lemma. Lemma 3.1. For any distribution D and encoder f , define µ k (f ; D) := E X∼D f (X) for k ∈ [K]. Suppose that A is a (σ, δ)-transformation. For any representation f : X → R d1 and linear layer h ∈ R K×d1 , we have R(h•f ; D π ) ≤ c∥h∥ √ Kσ(L align (f ; D, π)) 1 4 +∥h∥τ (σ, δ)+ K k=1 D π (C k )∥e k -h•µ k (f ; D π )∥, ( ) where c is an absolute constant and τ (σ, δ) is a constant which only depends on (σ, δ). The detail form of τ is shown in the appendix. The first term in equation 3 is the alignment of f , which is optimized during the contrastive learning pre-training on D. The second term is determined solely by the (σ, δ) quantity of the data augmentation. Larger σ and smaller δ induce smaller τ (σ, δ). These two terms together quantify the cluster level of the representation f learned by contrastive learning. The third term is related to the linear layer h on top of f , and will be minimized in downstream training, where each class center is converted to the corresponding ground-truth label. If the regularization term L reg is appropriately chosen, the class centers can be distinguished from each other, and the third term can be reduced to 0 by h. Taking the linear classifier into account, we obtain the following main result. Theorem 3.2. Suppose that A is a (σ, δ)-transformation on D, and π is the augmentation distribution on A. For an encoder f , define γ reg (f ; D, π) = 1 -c 1 (L align (f ; D, π)) 1/4 -τ -c 2 max k̸ =k ′ |µ k (f ; D π ) ⊤ µ k ′ (f ; D π )| for some positive constants c 1 and c 2 . Then for any f such that {µ k (f ; D π )} K k=1 are linearly independent and γ reg (f ; D, π) > 0, we have R(f ; D π ) ≤ c • γ -1 reg (f ; D, π) (L align (f ; D, π)) 1/4 + γ -1 reg (f ; D, π)τ (σ, δ) , where c is an absolute constant. Remark 1. (a) The linear independence of {µ k (f ; D π )} K k=1 condition is weak, since its complement event is a null set in Euclidean space under Lebesgue measure, and have 0 probability under any distribution that is absolutely continuous w.r.t. Lebesgue measure. (b) The quantity γ -1 reg is essentially the bound of L 2 norm of the linear classifier h which reduces the third term K k=1 D π (C k )∥e k -h • µ k (f ; D π )∥ in equation 3 to 0, i.e. , mapping the class centers to their corresponding ground-truth labels. The more orthogonal the class centers, the larger the γ reg and the better the bounds. The theory developed in (Huang et al., 2021) shows that popular algorithms such as SimCLR and Barlow Twins can indeed push away the class centers, and the constant γ reg can be ensured to have a constant gap away from 0. Therefore, for simplicity we directly impose this requirement on f . This theorem shows that contrastive learning on distribution D with augmentation π is essentially optimizing the supervised risk on the augmented distribution D π , instead of the original distribution D. Therefore, the augmentation π is crucial to the transferability of contrastive learning. If the downstream distribution D tar is close to D π , the encoder obtained by contrastive learning performs well on it. The more diverse π is, we can expect D π to approximate downstream distributions more accurately, and contrastive learning has better transferability. Besides, if we have some prior knowledge of downstream distribution, we can specify an appropriate augmentation such that D π and D tar are close in order to improve downstream performance. For example, consider the case where original data D are gray digits, and the downstream distribution D tar is obtained by dying them red. Then if we use random coloring as data augmentation and conduct contrastive learning on D, the learned feature will generalize to D tar well, even though D tar is far away from D.

4. TRANSFERABILITY LIMITATION: A DOMAIN-INVARIANCE VIEW

Our theoretical results in the last section suggests that data augmentation is crucial to the transferability of contrastive learning. However, there exists a fundamental limitation of contrastive learning: the learned representation is not domain-invariant, which is important for a model to generalize to more downstream datasets (Peters et al., 2016; Arjovsky et al., 2019) . Roughly speaking, domain-invariance means that the learned model can extract some intrinsic features, which are invariant across different domains. In supervised domain generalization, a common setting is to have a training domain set consisting of multiple source domains. In contrastive learning, correspondingly, we can also consider D A as a transformation-induced domain for each data transformation A, and consider {D A } A∈A as the training domain set. Since the goal of contrastive learning is also to align different D A , we naturally expect that the features it learn are domaininvariant. However, since the alignment loss L align in contrastive learning is obtained by averaging over different transformations (i.e., taking expectation), the learned features are not invariant even across {D A } A∈A . In other words, encoders learned through contrastive learning could behave extremely differently in different D A . The following toy model gives an illustrative example. Proposition 4.1. Consider a two-dimensional classification problem with data (X 1 , X 2 ) ∼ N (0, I 2 ). The label Y satisfies Y = 1(X 1 ≥ 0), and the data augmentation is to multiply X 2 by standard normal noise, i.e., A θ (X) = (X 1 , θ • X 2 ), θ ∼ N (0, 1). The corresponding transformation-induced domain set is P = {D c : D c = (X 1 , c • X 2 ) for c ∈ R}. We consider the 0-1 loss in equation 2. Then for every ε > 0, there exists representation f and two domains D c and D c ′ such that L align (f ; D, π) < ε, but |R(f ; D c ) -R(f ; D c ′ )| ≥ 1 4 This example shows that there exists a representation with arbitrary small contrastive loss, but with very different performance over different transformation-induced domains. The intuition behind this example is that in order to make L align small, it is sufficient to align different domains (induced by different transformations) in an average rather than uniform sense. Therefore, the representation may still suffer a large alignment loss on some rarely selected augmented domains.

5. ARCL: AUGMENTATION-ROBUST CONTRASTIVE LEARNING

To further improve the transferability performance of contrastive learning, we consider how to learn a representation that is domain-invariant across {D A } A∈A . First of all, we need a formal definition that mathematically characterize domain-invariance, so that we can design algorithms based on it. Here, we borrow the notion proposed by (Arjovsky et al., 2019) . Definition 5.1. We say that a representation f elicits an invariant predictor h 0 • f across a domain set P, if there is a classifier h 0 simultaneously optimal for all domains in P, that is, h 0 ∈ arg min h R(h • f ; D) for all D ∈ P. This definition is equivalent to learning features whose correlations with the target variable are stable, and has been shown to improve distribution shift transferability both practically and theoretically in supervised learning. Interestingly, by setting P to be {D A } A∈A , this definition is well suited to contrastive learning. Specifically, define the augmentation-robust loss as L AR (f ; D) := E X∈D sup A,A ′ ∈A ∥f (A(X)) -f (A ′ (X))∥ 2 , which is an uniform version of the original alignment loss L align . Then we have the following results. Theorem 5.1. For any two transformations A and A ′ , linear predictor h and representation f , we have sup A,A ′ ∈A |R(h • f ; D A ) -R(h • f ; D A ′ )| ≤ c • ∥h∥L AR (f, D). Moreover, fix f and let h A ∈ arg min h R(h • f, D A ). Then we have |R(h A • f ; D A ′ ) -R(h A ′ • f ; D A ′ )| ≤ 2c • (∥h A ∥ + ∥h A ′ ∥)L AR (f, D). Compared with L align , L AR replaces the expectation over A by the supremum operator, hence L align (f ; D, π) ≤ L AR (f ; D) for all f and π. If L AR (f ) is small, the above proposition shows that R(h • f ; D A ) varies slightly with A, so that the optimal h for D A is close to that for D A ′ . In other words, representation with smaller L AR tends to elicit the same linear optimal predictors across different domains, a property that does not hold for original alignment loss. Small L AR not only requires f to push the positive pairs closer, but also forces this alignment to be uniform across all transformations in A. Therefore, we propose to train the encoder using the augmentation-robust loss for better distribution shift transfer performance, i.e., replace L align by L AR in contrastive loss equation 1. In practical implementation, sup A1,A2∈A ∥f (A 1 (X)) -f (A 2 (X))∥ 2 2 can not be computed exactly, unless |A| is finite and small. Therefore, we use the following approach to approximate it. For each sample X, we first randomly select m augmentations to obtain a set of augmented samples denoted by A m (X). Then, rather than taking average as in standard contrastive learning, we consider use sup A1,A2∈ Am(X) ∥f (A 1 (X)) -f (A 2 (X))∥ 2 2 as the alignment term, which is an approximation of sup A1,A2∈A ∥f (A 1 (X)) -f (A 2 (X))∥ 2 2 . The integer m is called the number of views, which controls the approximation accuracy. Then, given n unlabeled samples (X 1 , . . . , X n ), the empirical augmentation-robust loss is L AR (f ) := 1 n n i=1 sup A1,A2∈ Am(Xi) ∥f (A 1 (X i )) -f (A 2 (X i ))∥ 2 2 . ( ) We remark that our proposed method can be applied to any contrastive learning methods whose objective can be formulated by equation 1, such as SimCLR, Moco, Barlow Twins (Zbontar et al., 2021) and so on. We give the detailed form of SimCLR + ArCL as an example in Algorithm 1, and provide MoCo + ArCL in the Appendix. The difference between Algorithm 1 and SimCLR is only the construction of positive pairs. In each epoch t, for every sample X, we first randomly select m transformations denoted by A(X) = {A 1 , . . . , A m }. Then based on current encoder f and projector g, we select two transformations that have worst alignment (minimal inner product) among all pairs in A(X), and use them to construct the finally used positive pairs of X. The construction of negative pairs and update rule are the same as SimCLR.

5.1. THEORETICAL JUSTIFICATION

We now give theoretical guarantees of our proposed approximation, i.e., bound the gap between L AR (f ) and L AR (f ). For each representation f ∈ F, where F is some hypothesis space, define z f (x) := sup A1,A2∈A ∥f (A 1 x) -f (A 2 x)∥ 2 and Z F := {z f : f ∈ F }. Then we use the Rademacher complexity of Z to define the alignment complexity of F as R n (F) := R n (Z F ) = E sup z∈Z 1 n n i=1 σ i z(X i ) . This quantity is a characteristic of the complexity of F in terms of its alignment. If most of the functions in F have good alignment under A, then R n (F) is small even though R n (F) could be large. Using this definition, we have the following result. Algorithm 1: SimCLR + ArCL input : Batch size N , temperature τ , augmentation π, number of views m, epoch T , encoder f , projector g. 1 for t = 1, . . . , T do 2 sample minibatch {X i } N i=1 ; 3 for i = 1, . . . , N do 4 draw m augmentations A = {A 1 , . . . , A m } ∼ π; 5 z i,j = g(f (A j X i )) for j ∈ [m]; 6 # select the worst positive samples; 7 s + i = min j,k∈[m] {z ⊤ i,j z i,k /(∥z i,j ∥∥z i,k ∥)}; 8 # select the negative samples; 9 for j = 1, . . . , N do 10 s - i,j = z ⊤ i,1 z j,1 /(∥z i,1 ∥∥z j,1 ∥); 11 s - i,j+N = z ⊤ i,1 z j,2 /(∥z i,1 ∥∥z j,2 ∥) ; 12 compute L = -1 N N i=1 log exp(s + i /τ ) 2N j=1,j̸ =i exp(s - i,j /τ ) ; 13 update f and g to minimize L; 14 return f Theorem 5.2. Suppose that Θ = B d A (1), i.e., the d A -dimensional unit ball. Suppose that all A in A are L A -Lipschitz continuous. Let π be a distribution on Θ with density p π such that inf θ∈Θ p π (θ) ≥ c π . Then with probability at least 1 -2ε, for every f ∈ F, L AR (f ) ≤ L AR (f ) + 2 R n (F) + log(1/ε) n 1 2 + LL A log(2n/ε) c π m 1 d A , where n is the training sample size and m is the number of views. The bound in Theorem 5.2 consists of two terms O(n -1/2 ) and O(m -1/d A ). The first one is due to finite sample issue, and the second is caused by the finite maximization approximation of the supremum. 1/c π acts like the "volume" of the transformation set and is of constant order. As m increases, the second term decreases and the bound becomes tighter. Note that we need to assume that the augmentation distribution π behaves uniformly, i.e., inf θ∈Θ p π (θ) ≥ c for some positive constant c, otherwise we can not ensure that the finite maximization is an acceptable approximation.

6. EXPERIMENTS

We conduct experiments on several distribution shift settings to show the performance of our proposed methods. To show the adaptability of our approach, we apply ArCL on SimCLR (Chen et al., 2020) and MoCo V2 (He et al., 2020) respectively. Following the setting of SimCLR and MoCo V2, we add a projection head after the backbone in the pre-training stage and remove it in the downstream task. For the linear evaluation, we follow the protocol that the encoder is fixed and a linear classifier on the top of the encoder is trained using test set. For the full fine-tuning, we also add a linear classifier on the top of the encoder but we do not fix the encoder.

6.1. EXPERIMENTS ON MODIFIED CIFAR10 AND CIFAR100

Setup. Our representations are trained on CIFAR-10 (Krizhevsky, 2009) using SimCLR modified with our proposed ArCL. Different batch sizes (256, 512) and number of views (m = 4, 6, 8) are considered. We use ResNet-18 as the encoder and a 2-layer MLP as the projection head. We train the representation with 500 epochs. The temperature is 0.5. Warm-up is not used. We conduct linear evaluation with 100 epochs on different downstream datasets to test the OOD performance of the learned representations.  ✓ - - - Aug 2 - ✓ - - Aug 3 - - ✓ - Aug 4 - - - ✓ Aug 5 ✓ - - ✓ Table 2 : Linear evaluation results (%) of pretrained CIFAR10 models on CIFAR10, CIFAR100 and their modified versions. Datasets. We use CIFAR10 to train the representation. The data augmentation used follows the setting in SimCLR, i.e., a composition of RandomCrop, RandomHorizontalFlip, ColorJitter and Random Grayscale. To evaluate the transferability of the learned representation, we use different downstream datasets to train the linear classifier and test its accuracy. Downstream datasets are generated by modifying the original CIFAR10 and CIFAR100 datasets, i.e., applying different data augmentations to them. We choose 5 different augmentations as shown in Table 1 . The results are shown in Table 2 . Results. In all settings, our proposed approach improves the transferability of SimCLR significantly. As the number of views m increases, the accuracy also increases, which matches our theoretical results. Besides, the performance improvement brought by increasing m tends to saturate, which suggest that m does not need to be very large.

6.2. EXPERIMENTS ON IMAGENET TRANSFERRING TO SMALL DATASETS

Setup. In this part, we will first train an encoder on ImageNet (Deng et al., 2009) using MoCo v2 modified with our proposed ArCL. We select the number of ArCL views from {2, 3, 4}. Notice that since there exists asymmetric network architecture in MoCo, there is difference between MoCo and MoCo + ArCL(views=2). We use ResNet50 as the backbone and follow all the other architecture settings of MoCo. The memory bank size K is set to 65536 and the batch size is set to 256. For the training epochs, we propose three schemes: 1) 200 epochs training from scratch, 2) 50 epochs training from the model which is pretrained by MoCo for 800 epochs, 3) 100 epochs training from the model which is pretrained by MoCo for 800 epochs. For each setting, we search the best start learning rate in {0.03, 0.0075, 0.0015} and the best temperature in {0.20, 0.15, 0.12, 0.10, 0.05}. We use the SGD optimizer and apply the cosine learning rate decay. After the model has been pre-trained by contrastive learning, we test its transferability with linear evaluation and full fine-tuning on various downstream small datasets, following the settings in (Ericsson et al., 2021) . For the target datasets, we adopt FGVC Aircraft (Maji et al., 2013) , Caltech-101 (Fei-Fei et al., 2004) , Stanford Cars (Krause et al., 2013) , CIFAR10, CIFAR100, DTD (Cimpoi et al., 2014) , Oxford 102 Flowers (Nilsback & Zisserman, 2008) , Food-101 (Bossard et al., 2014) and Oxford-IIIT Pets (Parkhi et al., 2012) . For linear evaluation, multinomial logistic regression is fit on the extracted features. For full fine-tuning, the full models are trained for 5,000 steps using SGD with Nesterov momentum. We calculate the average results of the nine downstream datasets. Results. We compare our methods with the MoCo baseline. There are about an average rise of 2% in the linear evaluation setting and about an average rise of 3% in the finetuning setting. In particular, we gain a huge improvement on CIFAR100 finetuning, where our proposed methods can outperform the baseline for up to 10%. It is also worthy to mention that our methods can reach the supervised results on nearly all the datasets. These facts can all indicate the effectiveness of our proposed methods. Besides, the results also show that, under the same epoch setting, the average performance grows as the number of the views increase, which fits our theoretical analysis. One can also notice that there is only a small improvement in the 800 + 100 epoch setting compared to the 800 + 50 epoch setting. This implies that our methods can also converge in a rather fast speed, which can also be effective within a few epochs of training. Compared to CL methods, ArCL gets to see more augmentations in total since for each original sample, we construct its m augmented versions.

6.3. COMPARISON WITH AVERAGE ALIGNMENT LOSS

Compared to CL methods, ArCL gets to see more augmentations in total since for each original sample, we construct its m augmented versions. To make the comparison between ArCL and CL fairer, for original CL, we also construct m views while use the average of the similarity between all positive pairs among the m views, namely average alignment loss (AAL), as the learning objective. Results for SimCLR on CIFAR10 and CIFAR100 and MoCo on ImageNet can be seen in Table 5 and 6 in the appendix. Detailed experimental settings are also deferred there. As the experiments show, contrastive learning with AAL has similar performance with vanilla methods, which is still much worse than ArCL on distribution shift downstream tasks. z 1 i,j = g(f q (A j X i )) for j ∈ [m]; 6 z 2 i,j = g(f k (A j X i )) for j ∈ [m]; 7 # select the worst positive samples;  8 s + i = min j,l∈[m],j̸ =l {z 1⊤ i,j z 2 i,l /(∥z 1 i,j ∥∥z 2 i,l ∥)}; 9 # select the negative samples; 10 for m ∈ M do 11 s - i,m = z ⊤ i,1 m/(∥z i,1 ∥∥m∥); 12 compute L = -1 N N i=1 log exp(s + i /τ ) m∈M exp(s - i,m /τ ) ;

B PROOFS IN SECTION 3

Proof of Lemma 3.1. For simplicity we omit the notations D and ϑ in this proof, and use x 1 to denote the positive sample, i.e. the augmented data, use p k to denote D ϑ (C k ) and use µ k to denote µ k (f ; D ϑ ). From Theorem 2 and Lemma B.1 in (Huang et al., 2021) , we have E x∈C k ∥f (x 1 ) -µ k ∥ ≤ c 1 p k L 1 4 pos (f ) + τ (σ, δ) for some constant c, where τ (σ, δ) := 4 1 -σ 1 - Lδ 4 . Note that τ is decreasing with σ and increasing with δ. Then we have c K k=1 √ p k L 1 4 pos (f ) + τ (σ, δ) ≥ K k=1 p k E x∈C k ∥f (x 1 ) -µ k ∥ ≥ K k=1 p k ∥h∥ E x∈C k E x1∈Ax ∥h • f (x 1 ) -h • µ k ∥ ≥ K k=1 p k ∥h∥ E x∈C k E x1∈Ax ∥h • f (x 1 ) -e k ∥ - 1 ∥h∥ K k=1 p k ∥e k -h • µ k ∥ = 1 ∥h∥ R(h • f ) - 1 ∥h∥ K k=1 p k ∥e k -h • µ k ∥ (9) for all linear layer h ∈ R K×d1 . Therefore, we obtain R(h • f ) ≤ c∥h∥L 1 4 pos K k=1 √ p k + ∥h∥τ (σ, δ) + K k=1 p k ∥e k -h • µ k ∥ ≤ c∥h∥ √ KL 1 4 pos + ∥h∥τ (σ, δ) + K k=1 p k ∥e k -h • µ k ∥. (10) Proof of Theorem 3.2. By the triangle inequality and equation 8, we have ∥µ k ∥ ≥ 1 -E x∈C k ∥f (x 1 ) -µ k ∥ ≥ 1 -C 1 p k L 1 4 pos (f ) -τ (σ, δ). Therefore, min k ∥µ k ∥ 2 ≥ 1 -C 1 p 0 L 1 4 pos (f ) -τ (σ, δ). Define U = (µ 1 , . . . , µ K ) ∈ R d1×K and let h 0 = U + be the Moore-Penrose inverse of U such that h 0 U = I K . Then we have ∥h 0 ∥ = sup x∈R d 1 ∥x∥ ∥U x∥ . For every x = (x 1 , . . . , x d1 ) ∈ R d1 with ∥x∥ 2 = 1, we have ∥U x∥ 2 2 = d1 i=1 x 2 i ∥µ i ∥ 2 + i̸ =j x i x j µ ⊤ i µ j ≥ min i ∥µ i ∥ 2 2 -max i,j |µ ⊤ i µ j | i̸ =j |x i x j | = min i ∥µ i ∥ 2 2 -max i,j |µ ⊤ i µ j |   i |x i | 2 -1   ≥ min i ∥µ i ∥ 2 2 -(K -1) max i,j |µ ⊤ i µ j | ≥ 1 -c 1 p 0 L 1 4 pos (f ) -τ (σ, δ) -K max i,j |µ ⊤ i µ j | =: 1 η reg (f ; D, ϑ) . Therefore, we have ∥h 0 ∥ ≤ η reg (f ; D, ϑ). Then, by Lemma 3.1, we complete the proof by R(f ) = min h R(h • f ) ≤ R(h 0 • f ) ≤ cγ reg (f ; D, ϑ)L 1 4 pos (f ; D, ϑ) + γ reg (f ; D, ϑ)τ (σ, δ).

C PROOF IN SECTION 4

Proof of Proposition 4.1. For any ε > 0, let t = √ ε/2 and f (x 1 , x 2 ) = x 1 + tx 2 . Then, the alignment loss of f satisfies L align (f ; D, π) = t 2 E X 2 2 E (θ1,θ2)∼N (0,1) 2 (θ 1 -θ 2 ) 2 = 2t 2 < ε. Let c = 0 and c ′ = 1/t. Then obviously R(f ; D c ) = 0, but R(f ; D c ′ ) = P (X 1 < 0, X 1 + X 2 ≥ 0) + P (X 1 ≥ 0, X 1 + X 2 ≤ 0) = 1 4 .

D PROOFS IN SECTION 5

Proof of Theorem 5.1. For any f and h with ∥h∥ ≤ c h , we have R(h • f ; D A ) -R(h • f ; D A ′ ) = E (X,Y )∼D |h • f (AX) -Y | 2 -|h • f (A ′ X) -Y | 2 = E (X,Y )∼D (h • f (Ax) -h • f (A ′ x))((h • f (Ax) + h • f (A ′ x)) + 2y) ≤ c E (X,Y )∼D ∥h • f (Ax) -h • f (A ′ x)∥ ≤ c∥h∥ E (X,Y )∼D ∥f (Ax) -f (A ′ x)∥ ≤ c∥h∥L AR (f ) for some constant c which depends on c h . proof of Theorem 5.2. We first fix the sample (x 1 , . . . , x n ), and consider the randomness of A m (x i ). Define A * 1 (x i ) and A * 2 (x i ) as A * 1 (x i ), A * 2 (x i ) = arg sup A1,A2∈A ∥f (A 1 x i ) -f (A 2 x i )∥ 2 2 . Note that their order doesn't matter. For every ε > 0, we have P 1 n n i=1 sup A1,A2∈A ∥f (A 1 x i ) -f (A 2 x i )∥ 2 2 ≤ 1 n n i=1 sup A1,A2∈Am(xi) ∥f (A 1 x i ) -f (A 2 x i )∥ 2 2 + L 2 L 2 A log(2n/ε) c ϑ m 2 d (i) ≥P max i∈[n] inf A∈Am(xi) ∥A -A * 1 (x i )∥ 2 , inf A∈Am(xi) ∥A -A * 2 (x i )∥ 2 ≤ log(2n/ε) c ϑ m 1 d (ii) ≥1 - n i=1 P inf A∈Am(xi) ∥A -A * 1 (x i )∥ 2 ≤ log(2n/ε) c ϑ m 1 d - n i=1 P inf A∈Am(xi) ∥A -A * 2 (x i )∥ 2 ≤ log(2n/ε) c ϑ m 1 d ≥1 - n i=1 P ∃A ∈ A m (x i ) : ∥A -A * 1 (x i )∥ 2 ≤ log(2n/ε) c ϑ m 1 d - n i=1 P ∃A ∈ A m (x i ) : ∥A -A * 2 (x i )∥ 2 ≤ log(2n/ε) c ϑ m 1 d (iii) ≥ 1 -2n 1 - c ϑ log(2n/ε) c ϑ m m (iv) ≥ 1 -ε. (i) comes from the Lipschitz continuity of f and A; (ii) is a simple application of set operations; (iii) is derived from the fact that the volume of a d-dimensional ball of radius r is proportional to r d ; (iii) comes from the inequality (1 -1/a) a ≤ e -1 for a > 1. Therefore, with probability at least 1 -ε, we have 1 n n i=1 sup A1,A2∈A ∥f (A 1 x i ) -f (A 2 x i )∥ 2 ≤ L AR (f ) + L 2 f L 2 A log(2n/ε) m 2 d . Besides, for every f we have L AR (f ) ≤ 1 n n i=1 sup A1,A2∈A ∥f (A 1 x i ) -f (A 2 x i )∥ 2 2 + 2R n (F) + log(1/ε) n 1 2 with probability at least 1 -ε. Then we finally obtain that with probability at least 1 -2ε, for every f ∈ F,  L AR (f ) ≤ L AR (f ) + 2 R n (F) + log(1/ε) n 1 2 + L 2 f L 2 A log(2n/ε) c ϑ m 2 d .

E ADDITIONAL EXPERIMENTS

In order to further verify the generality of our method, we conduct more experiments on different settings and compare our objective with different loss.

E.1 MOCO ON CIFAR10

Setup. In this part, we conduct experiments with MoCo v2 pretrained on CIFAR10. We follow the setup of MoCo v2. We use ResNet-18 as the encoder and train the representation with 400 epochs. The temperature is set to 0.2 and the initial learning rate is set to 0.15. The memory bank size is set to 4096. The SGD optimizer and a cosine schedule for learning rate are used. Warm-up is not used. We conduct linear evaluation with 100 epochs on modified CIFAR10 and CIFAR100 which are used in Section 6.1. Results. Results can be seen in Table 4 . We can see that the original results still maintain in the MoCo method. Our proposed approach improves the transferability of MoCo. As the number of views grows, the accuracy also increases.

E.2 COMPARISON WITH AVERAGE ALIGNMENT LOSS

To make the comparison between ArCL and CL fairer, we add new experiments for original CL. For each sample, we also construct m views and use the expectation of the similarity between positive pairs, namely average alignment loss(AAL), as the learning objective. The settings and results are shown in the following part. Loss Function. For image x i , the normalized features of its m views from the online branch are z i1 , z i2 , . . . , z im and the normalized features from the target branch are z ′ i1 , z ′ i2 , . . . , z ′ im . Our ArCL alignment loss is L align ArCL = - i min j̸ =k z ⊤ ij z ′ ik /τ. And the average alignment loss should be L align Average = - i avg j̸ =k z ⊤ ij z ′ ik /τ = - i ( j̸ =k z ⊤ ij z ′ ik /τ )/(m 2 -m). The uniformity loss keeps the same and the total loss is the sum of the alignment loss and the uniformity loss. Hyperparameters and Datasets. For SimCLR, we train the models on CIFAR10 and evaluate them on modified CIFAR10 and CIFAR100 just as our original paper does. We choose the augmentation views to be 4 and the batch size to be 512. For MoCo, We conduct experiments on ImageNet, with the 800-epochs-pretrained model as the initializing model. We train the model for 50 epochs and use the same setting as in Section 6.2. Results. Results for SimCLR experiments can be seen in Table 5 . We compare the linear evaluation results on modified CIFAR10 and CIFAR100. Results for MoCo experiments can be seen in 6. We compare both the linear evaluation and finetuning results on small datasets. As the experiments show, when contrastive learning methods with average alignment loss use the same amount of data as ArCL, although they can have higher accuracy compared to vanilla methods (those which only use two views) on ID set, they still have a similar performance with vanilla methods on OOD set, which is much worse than ArCL. This further verifies the superiority of our method.

E.3 EXPERIMENTS ON MNIST-CIFAR

MNIST-CIFAR is a synthetic dataset proposed by (Shah et al.) . It consists of 3 × 64 × 32 synthetic images, each of which is a vertical concatenation of a 3 × 32 × 32 MNIST image from class 0 or class 1, and a 3×32×32 CIFAR10 image from class 0 or class 1. We follow the distribution settings proposed by (Shi et al., 2022) : • ID train: 1.0 correlation between MNIST and CIFAR10 labels. Contains two classes: class 0 with MNIST "0" and CIFAR10 "automobile", and class 1 with MNIST "1" and CIFAR10 "plane". • OOD train: no correlation between MNIST and CIFAR10 labels, images from the two classes of MNIST and CIFAR10 are randomly paired. We choose the label of CIFAR10 to be the label of the concatenated image. • OOD test: generated similarly to the OOD train set using the test set of MNIST and CI-FAR10. In this experiment, we train the model on the ID train set, conduct the linear evaluation on the OOD train set, and calculate the accuracy on the OOD test set. We adopt SimCLR as the CL framework and use the 4-layer CNN as the backbone just as Shi et al. (2022) does. We fix the base feature size of CNN C = 32 and the latent dimension L = 128. Batch size is set to 128, the optimizer is set to SGD and the lr. scheduler is set to warmup cosine. The learning rate and weight decay are searched in the range given in Table 2 of Shi et al. (2022) . Since the training process is easy to converge, we set the training epoch and the linear evaluation epoch to 5. We also compare the average alignment loss (AAL for short) mentioned in the Appendix 6.3. Results. We can see that by using ArCL, SimCLR enjoys a rising performance. The accuracy also grows as the number of views grows, which fits our theory. We also notice that the accuracy of SimCLR with AAL does not vary too much as the number of views grows, which indicates that the key point is the usage of multi-view (adopting the minimum instead of the average).



In this paper, notation ∥ • ∥ stands for L -norm or Frobenius norm for vectors and matrices, respectively.



5 different augmentations.

Results comparison on linear evaluation (up) and finetuning (down) of pretrained ImageNet models on popular recognition datasets. Additionally, we provide a supervised baseline for comparison, a standard pretrained ResNet50 available from the PyTorch library. The supervised baseline results are from(Ericsson et al., 2021). Results style: best in the same epoch setting. Avg takes the average of the results on all the nine small datasets.

Linear evaluation results of pretrained CIFAR10 models using MoCo v2 on CIFAR10, CIFAR100 and their modified versions.

Linear evaluation results of pretrained CIFAR10 models using SimCLR with two different alignment losses on CIFAR10, CIFAR100 and their modified versions.

Results comparison on linear evaluation (up) and finetuning (down) of pretrained ImageNet models with ArCL alignment loss and average alignment loss on popular recognition datasets. Results style: best in the same augmentation view setting. Avg takes the average of the results on all the nine small datasets.

Linear evaluation results of pretrained models using SimCLR with two different alignment losses on MNIST-CIFAR dataset. The average results under three diffrent random seeds are given.

ACKNOWLEDGEMENT

Yisen Wang is partially supported by the National Key R&D Program of China (2022ZD0160304), the National Natural Science Foundation of China (62006153), Open Research Projects of Zhejiang Lab (No. 2022RC0AB05), and Huawei Technologies Inc.We would like to express our sincere gratitude to the reviewers of ICLR 2023 for their insightful and constructive feedback. Their valuable comments have greatly contributed to improving the quality of our work.

funding

work was partially done when Xuyang was visiting Qing Yuan Research Institute.

