THE TRADE-OFF BETWEEN UNIVERSALITY AND LA-BEL EFFICIENCY OF REPRESENTATIONS FROM CON-TRASTIVE LEARNING

Abstract

Pre-training representations (a.k.a. foundation models) has recently become a prevalent learning paradigm, where one first pre-trains a representation using large-scale unlabeled data, and then learns simple predictors on top of the representation using small labeled data from the downstream tasks. There are two key desiderata for the representation: label efficiency (the ability to learn an accurate classifier on top of the representation with a small amount of labeled data) and universality (usefulness across a wide range of downstream tasks). In this paper, we focus on one of the most popular instantiations of this paradigm: contrastive learning with linear probing, i.e., learning a linear predictor on the representation pre-trained by contrastive learning. We show that there exists a trade-off between the two desiderata so that one may not be able to achieve both simultaneously. Specifically, we provide analysis using a theoretical data model and show that, while more diverse pre-training data result in more diverse features for different tasks (improving universality), it puts less emphasis on task-specific features, giving rise to larger sample complexity for down-stream supervised tasks, and thus worse prediction performance. Guided by this analysis, we propose a contrastive regularization method to improve the trade-off. We validate our analysis and method empirically with systematic experiments using real-world datasets and foundation models.

1. INTRODUCTION

Representation pre-training is a recent successful approach that utilizes large-scale unlabeled data to address the challenges of scarcity of labeled data and distribution shift. Different from the traditional supervised learning approach using a large labeled dataset, representation learning first pre-trains a representation function using large-scale diverse unlabeled datasets by self-supervised learning (e.g., contrastive learning), and then learns predictors on the representation using small labeled datasets for downstream target tasks. The pre-trained model is commonly referred to as a foundation model (Bommasani et al., 2021) , and has achieved remarkable performance in many applications, e.g., BERT (Devlin et al., 2019) , GPT-3 (Brown et al., 2020) , CLIP (Radford et al., 2021) , and Flamingo (Alayrac et al., 2022) . To this end, we note that there are two properties that are key to their success: (1) label efficiency: with the pre-trained representation, only a small amount of labeled data is needed to learn accurate predictors for downstream target tasks; (2) universality: the pre-trained representation can be used across various downstream tasks. In this work, we focus on contrastive learning with linear probing that learns a linear predictor on the representation pre-trained by contrastive learning, which is an exemplary pre-training approach (e.g., (Arora et al., 2019; Chen et al., 2020) ). We highlight and study a fundamental trade-off between label efficiency and universality, though ideally, one would like to have these two key properties simultaneously. Since pre-training with large-scale diverse unlabeled data is widely used in practice, such a trade-off merits deeper investigation. Theoretically, we provide an analysis of the features learned by contrastive learning, and how the learned features determine the downstream prediction performance and lead to the trade-off. We propose a hidden representation data model, which first generates a hidden representation containing various features, and then uses it to generate the label and the input. We first show that contrastive learning is essentially generalized nonlinear PCA that can learn hidden features invariant to the transformations used to generate positive pairs. We also point out that additional assumptions on the data and representations are needed to obtain non-vacuous guarantees for prediction performance. We thus consider a setting where the data are generated by linear functions of the hidden representation, and formally prove that the difference in the learned features leads to the trade-off. In particular, pre-training on more diverse data learns more diverse features and is thus useful for prediction on more tasks. But it also down-weights task-specific features, implying larger sample complexity for predictors and thus worse prediction performance on a specific task. This analysis inspires us to propose a general method -contrastive regularization -that adds a contrastive loss to the training of predictors to improve the accuracy on downstream tasks. (The variance of the blue line is too small to be seen.) Please refer to Section 3.1 for details. Empirically, we first perform controlled experiments to reveal the trade-off. Specifically, we first pre-train on a specific dataset similar to that of the target task, and then incrementally add more datasets into pretraining. In the end, the pre-training data includes both datasets similar to the target task and those not so similar, which mimics the practical scenario that foundation models are pre-trained on diverse data to be widely applicable for various downstream tasks. Fig. 1 gives an example of this experiment: As we increase task diversity for contrastive learning, it increases the average accuracy on all tasks from 18.3% to 20.1%, while it harms the label efficiency of an individual task, on CIFAR-10 the accuracy drops from 88.5% to 76.4%. We also perform experiments on contrastive regularization, and demonstrate that it can consistently improve over the typical finetuning method across multiple datasets. In several cases, the improvement is significant: 1.3% test accuracy improvement for CLIP on ImageNet, 4.8% for MoCo v3 on GTSRB (see Table 1 and 2 for details) . With these results, we believe that it is of importance to bring the community's attention to this trade-off and the forward path of foundation models. Our main contributions are summarized as follows: • We propose a hidden representation data model and prove that contrastive learning is essentially generalized nonlinear PCA, and can encode hidden features invariant to the transformations used in positive pairs (Section 2.1). • We formally prove the trade-off in a simplified setting with linear data (Section 2.2). • We empirically demonstrate the trade-off across different methods and datasets for contrastive learning with linear probing (Section 3.1 and 3.2). • We propose a contrastive regularization method for training the predictor on a target task (Section 2.2), which achieves consistent improvement in our experiments (Section 3.3). Related Work on Representation Pre-training. This paradigm pre-trains a representation function on a large dataset and then uses it for prediction on various downstream tasks (Devlin et al., 2019; Kolesnikov et al., 2020; Brown et al., 2020; Newell & Deng, 2020) . The representations are also called foundation models (Bommasani et al., 2021) . There are mainly two kinds of approaches: (1) supervised approaches (e.g., (Kolesnikov et al., 2020) ) that pre-train on large labeled datasets; (2) self-supervised approaches (e.g., (Newell & Deng, 2020) ) that pre-train on large and diverse unlabeled datasets. Recent self-supervised pre-training can compete with or outperform supervised pre-training on the downstream prediction performance (Ericsson et al., 2021) . Practical examples like BERT (Devlin et al., 2019) , GPT-3 (Brown et al., 2020) , CLIP (Radford et al., 2021) ,

2. THEORETICAL ANALYSIS

Our experiments in Section 3.1 demonstrate a trade-off between the universality and label efficiency of contrastively pre-trained representations when used for prediction on a distribution different from the pre-training data distribution. See Fig. 1 for an example. Intuitively, from the unlabeled data, pre-training can learn semantic features useful for prediction on even different data distributions. To analyze this, we need to formalize the notion of useful semantic features. So we introduce a hidden representation data model where a hidden representation (i.e., a set of semantic features) is sampled and then used for generating the data. Similar models have been used in some studies (HaoChen et al., 2021; Zimmermann et al., 2021) , while we introduce the notion of spurious and invariant features and obtain a novel analysis for contrastive learning. Using this theoretical model of data, Section 2.1 investigates what features are learned by contrastive learning. We show that contrastive learning can be viewed as a generalization of Principal Components Analysis, and it encodes the invariant features not affected by the transformations but removes the others. We also show that further assumptions on the data and the representations are needed necessary for any non-vacuous bounds for downstream prediction. So Section 2.2 considers a simplified setting with linear data. We show that when pre-trained on diverse datasets (modeled as a mixture of unlabeled data from different tasks), it encodes all invariant features from the different tasks and thus is useful for all tasks. On the other hand, it essentially emphasizes those that are shared among the tasks, but down-weights those that are specific to a single task. Compared to pre-training only on unlabeled data from the target task, this then leads to a larger sample complexity and thus worse generalization for prediction on the target task. Therefore, we show that the trade-off between universality and label efficiency occurs due to the fact that when many useful features from diverse data are packed into the representation, those for a specific target task can be down-weighted and thus worsen the prediction performance on it. Based on this insight, we propose a contrastive regularization method for using representations in downstream prediction tasks, which achieves consistent improvement over the typical fine-tuning method in our experiments in Section 3.3. Contrastive Learning. Let X ⊆ R d denote the input space, Y the label space, and Z ⊆ R k the output vector space of the learned representation function. Let Φ denote the hypothesis class of representations ϕ : X → Z, and F ϕ the hypothesis class of predictors on ϕ. A task is simply a data distribution over X × Y. In pre-training, using transformations on unlabeled data from the tasks, we have some pre-train distribution D pre over positive pairs (x, x + ) and negative examples x -, where x, x + are obtained by applying random transformations on the same input (e.g., cropping or color jitter for images), and x -is an independent example. The contrastive loss is ℓ ϕ(x) ⊤ (ϕ(x + ) -ϕ(x -)) where ℓ(t) is a suitable loss function. Typically, the logistic loss ℓ(t) = log(1 + exp(-t)) is used, while our analysis also holds for other loss functions. A representation ϕ is learned by: min ϕ∈Φ E (x,x + ,x -)∼Dpre ℓ ϕ(x) ⊤ (ϕ(x + ) -ϕ(x -)) . (1) (We simply consider the population loss since pre-training data are large-scale.) Then a predictor f is learned on top of ϕ using m labeled points {(x i , y i )} m i=1 from a specific target task D: min f ∈F ϕ 1 m m i=1 ℓ c (f (ϕ(x i )), y i ) (2) where ℓ c is a prediction loss (e.g. cross-entropy). Usually, f is a linear classifier (Linear Probing) with a bounded norm: F ϕ = {f (z) = u ⊤ z : u ∈ R k , ∥u∥ ≤ B}, where ∥ • ∥ denotes the ℓ 2 norm. Hidden Representation Data Model. We now consider the pre-train distribution D pre over (x, x + , x -). To capture that pre-training can learn useful features, we assume a hidden representation for generating the data: first sample a hidden representation z ∈ Z from a distribution D z over some hidden representation space Z ⊆ R d , and then generate the input x and the label y from z. (The space Z models semantic features, and can be different from the learned representation space Z.) The dimensions of z are partitioned into two disjoint subsets of [d] := {1, • • • , d}: spurious features U that are affected by the transformations, and invariant features R that are not. Specifically, let D U , D R denote the distributions of z U and z R , respectively, and let x = g(z) denote the generative function for x. Then the positive pairs (x, x + ) are generated as follows: z = [z R ; z U ] ∼ D z , z + U ∼ D U , z + = [z R ; z + U ], x = g(z), x + = g(z + ). That is, x, x + are from the same z R but two random copies of z U that model the random transformations. Finally, x -is an i.i.d. sample from the same distribution as x: z -∼ D z , x -= g(z -).

2.1. WHAT FEATURES ARE LEARNED BY CONTRASTIVE LEARNING?

To analyze prediction performance, we first need to analyze what features are learned in pre-training. Contrastive Learning is Generalized Nonlinear PCA. Recall that given data x from a distribution D, Principal Components Analysis (PCA) (Pearson, 1901; Hotelling, 1933) aims to find a linear projection function ϕ on some subspace such that the variance of the projected data ϕ(x) is maximized, i.e., it is minimizing the following PCA objective: -E x∼D [∥ϕ(x) -E x ′ ∼D [ϕ(x ′ )]∥ 2 ] = -E x∼D [∥ϕ(x) -ϕ 0 ∥ 2 ] where ϕ 0 := E[ϕ(x ′ )] is the mean of the projected data. Nonlinear PCA replaces linear representation functions ϕ with nonlinear ones. We next show that contrastive learning is a generalization of nonlinear PCA on the smoothed representation after smoothing out the transformations. Theorem 2.1. If ℓ(t) = -t, then the contrastive loss is equivalent to the PCA objective on ϕ z R : E ℓ ϕ(x) ⊤ [ϕ(x + ) -ϕ(x -)] = -E ∥ϕ z R -ϕ 0 ∥ 2 (5) where ϕ z R := E[ϕ(x) | z R ] = E[ϕ(g(z)) | z R ]. If additionally ϕ(x) is linear in x, then it is equiva- lent to the linear PCA objective -E ∥ϕ(x) -ϕ 0 ∥ 2 on data x := E[x|z R ] = E[g(z)|z R ]. So contrastive learning is essentially nonlinear PCA when ℓ(t) = -t, and further specializes to linear PCA when the representation is linear. As PCA finds directions with large variances, the analogue is that contrastive learning encodes important invariant features but not spurious ones. Contrastive Learning Encodes Invariant Features and Removes Spurious Features. For a formal statement we need some weak assumptions on the data, the representations, and the loss: (A1) z R can be recovered from x, i.e., the inputs x = g(z) from different z R 's are disjoint. (A2) The representation functions are the regular functions with ∥ϕ(x)∥ = B r (∀x) for some B r > 0. Being regular means there are a finite L and a partition of Z into a finite number of subsets, such that in each subset all ϕ • g have Lipschitz constants bounded by L. (A3) The loss ℓ(t) is convex, decreasing, and lower-bounded. The first condition means the invariant features z R can be extracted from x (note that g need not be invertible). The regular condition on the representation is to exclude some pathological cases like the Dirichlet function; essentially reasonable functions relevant for practice satisfy this condition, e.g., when g is Lipschitz and ϕ are neural networks with the ReLU activation. Also, note that the logistic loss typically used in practice satisfies the last condition. We say a function f (z) is independent of a subset of input dimensions z S , if there exists a function f ′ such that f (z) = f ′ (z -S ) with probability 1, where z -S denotes the set of all z j with j ̸ ∈ S. We say the representation ϕ encodes a feature z i , if ϕ • g : Z → Z is not independent of z i as long as the generative function g(z) is not independent of z i . Theorem 2.2. Under Assumptions (A1)(A2)(A3), the optimal representation ϕ * satisfies: (1) ϕ * does not encode the spurious features z U : ϕ * • g(z) is independent of z U . (2) For any invariant feature i ∈ R, there exists B i > 0 such that as long as the repre- sentations' norm B r ≥ B i , then ϕ * encodes z i . Furthermore, if Z is finite, then B i is monotonically decreasing in Pr[z R\{i} = z - R\{i} , z i ̸ = z - i ], the probability that in z R and z - R , the i-th feature varies while the others remain the same. So contrastive learning aims to remove the spurious features and preserve the invariant features. Then the transformations should be chosen such that they will not affect the useful semantic features, but change those irrelevant to the label. Interestingly, the theorem further suggests that contrastive learning tends to favor the more "spread-out" invariant features z i , as measured by Pr[z R\{i} = z - R\{i} , z i ̸ = z - i ]. As we increase the representation capacity B r , B r passes the threshold B i for more features z i , so ϕ * first encodes the more spread-out invariant features and then the others. This further suggests the following intuition for the trade-off. When pre-trained on diverse data modeled as a mixture from multiple tasks with different invariant features, the representation encodes all the invariant features and thus is useful for prediction on all the tasks. When pre-trained on only a specific task, features specific to this task are favored over those that only show up in other tasks, which leads to smaller sample complexity for learning the predictor and thus better prediction. However, to formalize this, some inductive bias assumptions about the data and the representation are necessary to get any non-vacuous guarantee for the prediction (see discussion in Appendix A.1). Therefore, Section 2.2 introduces additional assumptions and formalizes the trade-off. To analyze the prediction performance, we first need to model the relation between the pre-training data and the target task. We model the diverse pre-training data as a mixture of data from T different tasks D t 's, while the target task is one of the tasks. All tasks share a public feature set S of size s, and each task D t additionally owns a private disjoint feature set P t of size r -s, i.e., P t ∩ S = ∅ and P t1 ∩ P t2 = ∅ for t 1 ̸ = t 2 (Fig. 2 ). The invariant features for D t are then R t = S ∪ P t . All invariant features are R = ∪ T t=1 R t , and spurious features are U = [d] \ R. In task D t , the (x, x + ) are generated as follows:

2.2. ANALYZING

z Rt ∼ N (0, I), z R\Rt = 0, z U ∼ N (0, I), z = [z R ; z U ], x = g(z), z + U ∼ N (0, I), z + = [z R ; z + U ], x + = g(z + ), and x -is simply an i.i.d. copy from the same distribution as x. In practice, multiple independent negative examples are used, and thus we consider the following contrastive loss min ϕ∈Φ E (x,x + ) ℓ ϕ(x) ⊤ (ϕ(x + ) -E x -ϕ(x -)) for a convex and decreasing ℓ(t) to pre-train a representation ϕ. Then, when using ϕ for prediction in the target task D t , the predictor class should contain a predictor matching the ground-truth label: F ϕ,t = {f (z) = u ⊤ z : u ∈ R k , ∥u∥ ≤ B ϕ,t } where B ϕ,t is the minimum value such that there exists u t ∈ F ϕ,t with y = u ⊤ t ϕ(x) on D t . Now, given the necessity of inductive biases for non-vacuous guarantees (see Appendix A.1), and inspired by classic dictionary learning and recent analysis on such data (e.g., Olshausen & Field (1997) ; Wen & Li (2021) ; Shi et al. (2022) ), we assume linear data and linear representations: • x is linear in z: x = g(z) = M z where M ∈ R d×d is an orthonormal dictionary. Since linear probing has strong performance on pre-trained representations, we thus assume that the label in each task t is linear in its invariant features y = (u * t ) ⊤ z Rt for some u * t ∈ R r . • The representations are linear functions with weights of bounded spectral/Frobenius norms: Φ = {ϕ(x) = W x : W ∈R k×d , ∥W ∥≤1, ∥W ∥ F ≤ √ r}. Here the norm bounds are chosen to be the minimum values to allow recovering the invariant features in the target task, i.e., there exists ϕ ∈ Φ such that ϕ(x) = [z Rt ; 0]. We compare two representations: a specific one pre-trained on unlabeled data from the target task D t , and a universal one pre-trained on an even mixture of data from T tasks. (Appendix B provides analysis for more general cases like uneven mixtures.) This captures the situation that the pretraining data contains some data similar to the target task and also other less similar data. Let v t,1 = j∈S (u * t ) 2 j and v t,2 = j∈Pt (u * t ) 2 j be the weights on the shared and task-specific invariant features, respectively. Also, assume the prediction loss ℓ c is L-Lipschitz. Proposition 2.3. The representation ϕ * obtained on an even mixture of data from all the tasks {D t : 1 ≤ t ≤ T } satisfies ϕ * •g(z) = Q j∈S √ αz j e j + j∈R\S √ βz j e j for some α ∈ [0, 1], β = min 1, r-αs T (r-s) , where e j 's are the basis vectors and Q is any orthonormal matrix. The Empirical Risk Minimizer û ∈ F ϕ * ,t on ϕ * using m labeled data points from D t has risk E (x,y)∼Dt [ℓ c (û ⊤ ϕ * (x), y)] ≤ 4L 1 m v t,1 α + v t,2 β sα + (r -s)β + O r sα + (r -s)β + 8 2 ln(4/δ) m . Proposition 2.4. encodes all invariant features but down-weights the task-specific features P t . The difference in the learned features then determines the prediction performance and results in a trade-off between universality and label efficiency: compared to ϕ * t , ϕ * is useful for more tasks but has worse performance on the specific task D t . For illustration, suppose r = 2s, and the shared and task-specific features are equally important for the labels on the target task: v t,1 = v t,2 = ∥u * t ∥ 2 /2. In Appendix B.3 we show that ϕ * has α = 1, β = 1 T and the error is O L T r m ∥u * t ∥ , while the error using ϕ * t is O L r m ∥u * t ∥ . Therefore, the error when using representations pre-trained on data from T tasks is O( √ T ) worse than that when just pre-training on data from the target task. On the other hand, the former can be used in all T tasks and the prediction error diminishes with the labeled data number m. While the latter only encodes R t and the only useful features on the other tasks are z S , then even with infinite labeled data the error can be large (≥ min u E[ℓ c (u ⊤ z S , y)], the approximation error using only the common features z S for prediction). Improving the Trade-off via Contrastive Regularization. The above analysis provides some guidance on improving the trade-off, in particular, improving the target prediction accuracy when given a pre-trained representation ϕ * . It suggests that when ϕ * is pre-trained on diverse data, one can update it by contrastive learning on some unlabeled data from the target task, which can get better features and better predictions. This is indeed the case for the illustrative example above. We can show that updating ϕ * by contrastive learning on D t can increase the weights β on the task-specific features z Pt , and thus improve the generalization error (formal analysis in Appendix B.4). In practice, typically one will learn the classifier and also fine-tune the representation with a labeled dataset {(x i , y i )} m i=1 from the target task. We thus propose contrastive regularization for fine-tuning: for each data point (x, y), generate contrastive pairs R = {(x, x+ , x-)} by applying transformations, and add the contrastive loss on these pairs as a regularization term to the classification loss: ℓ c (f (ϕ(x)), y) + λ |R| (x,x + ,x -)∈R ℓ ϕ(x) ⊤ (ϕ(x + ) -ϕ(x -)) . ( ) This method is simple and generally applicable to different models and algorithms. Similar ideas have been used in graph learning (Ma et al., 2021) , domain generalization (Kim et al., 2021) and semi-supervised learning (Lee et al., 2022) , while we use it in fine-tuning for learning predictors. Our experiments in Section 3.3 show that it can consistently improve the prediction performance compared to the typical fine-tuning approach.

3. EXPERIMENTS

We conduct experiments to answer the following questions. Fer2013. Note that training does not follow the online learning fashion, e.g., the model will pre-train from scratch (random initialization) on the CSG datasets, rather than using the model pre-trained on the CS datasets. Evaluation & Methods. We first pre-train a ResNet18 backbone (He et al., 2016) extractor) with the labeled data from the target task. We report the test accuracy on a specific target task and the average test accuracy on all pre-training datasets (i.e., using them as the downstream tasks). Appendix C.2 presents full details and additional results, while Fig. 3 shows the results for the method MoCo v2. The size and diversity of pre-training data are increased on the x-axis by incrementally adding unlabeled training data from: (a) CINIC-10, SVHN, GTSRB, ImageNet32 (using only a 500k subset); (b) EMNIST-Digits&Letters, Fashion-MNIST, GTSRB, ImageNet32; (c) FaceScrub, CIFAR-10, SVHN, ImageNet32. We further perform larger-scale experiments: (1) on ImageNet (see Fig. 4 ); (2) on ImageNet22k and GCC-15M (see Appendix C.2.1). Results. The results show that when the pre-training data becomes more diverse, the average test accuracy on all pre-training datasets increases (i.e., universality improves), while the test accuracy on the specific target task decreases (i.e., label efficiency drops). This shows a clear trade-off between universality and label efficiency. It supports our claim that diverse pre-training data allow learning diverse features for better universality, but can down-weight the features for a specific task resulting in worse prediction. Additional results in the appendix show similar trends (e.g., for methods NNCLR and SimSiam). This validates our theoretical analysis of the trade-off. Here we compute the similarity of the features learned from different pre-training datasets for a target task. For each pre-trained model, we extract a set of features for the target task Fer2013 using the pre-trained representation function. Then we compute the similarities between the extracted features based on different pre-training dataset pairs using linear Centered Kernel Alignment (CKA) (Kornblith et al., 2019) , a widely used tool for high-dimensional feature comparison.

3.2. INSPECTING THE TRADE-OFF: FEATURE SIMILARITY

Figure 5 reports the results (rows/columns are pretraining data; numbers/colors show the similarity). The left figure shows that the features from different pre-training datasets have low similarities. This is consistent with our setup in Section 2.2 that different tasks only share some features and each owns many private ones. The right figure shows a decreasing trend of similarity along each row. This indicates that when gradually adding more diverse pre-training data, the learned representation will encode more downstream-task-irrelevant features, and become less similar to that prior to adding more pre-training data. Additional results with similar observations, finer-grained investigation into the trade-off, and some ablation studies are provided in Appendix C.3. Evaluation & Methods. We pre-train ResNet18 by MoCo v2 as in Section 3.1 and report the test accuracy on CIFAR-10 when the predictor is learned by: Linear Probing (LP), Finetune (FT), and Finetune with Contrastive Regularization (Ours). LP follows the training protocol in Section 3.1. FT and Ours learn a linear predictor and update the representation, and use the same data augmentation for a fair comparison. FT follows MAE (He et al., 2022) , while Ours uses MoCo v2 contrastive loss and regularization coefficient λ = 0.1. More details and results are given in Appendix C.4.

Results

. Table 1 shows that our method can consistently outperform the other baselines. In particular, it outperforms the typical fine-tuning method by about 0.7% -1%, even when the latter also uses the same amount of data augmentation. This confirms the benefit of contrastive regularization. To further support our claim, Fig. 13 Larger Foundation Models. We further evaluate our method on several popular real-world large representation models (foundation models). On some of these models, the user may be able to finetune the representation when learning predictors. On very large foundation models, the user typically extracts feature embeddings of their data from the models and then trains a small predictor, called adapter (Hu et al., 2021; Sung et al., 2022) , on these embeddings. We evaluate CLIP (ViT-L (Dosovitskiy et al., 2020) as the representation backbone), MoCo v3 (ViT-B backbone), and SimCSE (Gao et al., 2021) (BERT backbone). They are trained on (image, text), (image, image), and (text, text) pairs, respectively, so cover a good spectrum of methods. For CLIP and MoCo v3, the backbone is fixed. LP uses a linear classifier, while FT and Ours insert a two-layer ReLU network as an adapter between the backbone and the linear classification layer. Ours uses the SimCLR contrastive loss on the output of the adapter. For SimCSE, all methods use linear classifiers. LP fixes the backbone, while FT and Ours train the classifier and fine-tune the backbone simultaneously. Ours uses the SimCSE contrastive loss on the backbone feature. We set the regularization coefficient λ = 1.0. Table 2 again shows that our method can consistently improve the downstream prediction performance for all three models by about 0.4% -4.8%, and quite significantly in some cases (e.g., 1.3% for CLIP on ImageNet, 4.8% for MoCo v3 on GTSRB). This shows that our method is also useful for large foundation models, even when the foundation models cannot be fine-tuned and only the extracted embeddings can be adapted. Full details and more results are provided in Appendix C.4.1.

4. CONCLUSION AND FUTURE WORK

In this work, we have shown and analyzed the trade-off between universality and label efficiency of representations in contrastive learning. There are many interesting open questions for future work. (1) What features does the model learn from specific pre-training and diverse pre-training datasets beyond linear data? (2) Do the other self-supervised learning methods have a similar trade-off? (3) Can we address the trade-off better to gain both properties at the same time?

Appendix

A PROOFS FOR SECTION 2.1 Theorem A.1 (Restatement of Theorem 2.1). If ℓ(t) = -t, then the contrastive loss is equivalent to the PCA objective on ϕ z R : E ℓ ϕ(x) ⊤ [ϕ(x + ) -ϕ(x -)] = -E ∥ϕ z R -ϕ 0 ∥ 2 . ( ) If additionally ϕ(x) is linear in x, then the contrastive loss is equivalent to the linear PCA objective on data from the distribution p x of x = E z U [x]: E ℓ ϕ(x) ⊤ [ϕ(x + ) -ϕ(x -)] = -E ∥ϕ(x) -ϕ 0 ∥ 2 . ( ) Proof. We first present some preliminaries for the proof. Recall that in our hidden representation data model x = g(z). The learned representation is ϕ(x) = ϕ(g(z)) = ϕ • g(z). For brevity, let us define ϕ(x) = ϕ • g(z) := h(z) . Also, the hidden representations corresponding to (x, x + , x -) are given by (z, z + , z -), where z = [z R ; z U ], z + = [z R ; z + U ], z -= [z - R ; z - U ], where z R and z - R are sampled independently from the distribution D R ; and z U , z + U , and z - U are sampled independently from the distribution D U . The expectation of an arbitrary function f (z, z + , z -) can be simplified as follows: E (z,z + ,z -) f (z, z + , z -) = E (z R ,z - R ,z U ,z + U ,z - U ) f (z, z + , z -) = E (z R ,z - R ) E (z U ,z + U ,z - U ) f (z, z + , z -) | z R , z - R . The second step follows the law of iterated expectations. The negative expected contrastive loss is -E (x,x + ,x -) ℓ ϕ(x) ⊤ [ϕ(x + ) -ϕ(x -)] (12) = -E (z,z + ,z -) ℓ ϕ(g(z)) ⊤ [ϕ(g(z + )) -ϕ(g(z -))] (13) = E (z,z + ,z -) h(z) ⊤ [h(z + ) -h(z -)] (14) = E (z R ,z - R ) E h(z) ⊤ [h(z + ) -h(z -)] | z R , z - R (15) = E (z R ,z - R ) E [h(z) | z R ] ⊤ E h(z + ) | z R -E h(z -) | z - R (16) = E (z R ,z - R ) E [ϕ(x) | z R ] ⊤ E ϕ(x + ) | z R -E ϕ(x -) | z - R (17) = E (z R ,z - R ) ϕ ⊤ z R ϕ z R -ϕ z - R . ( ) The second equality follows from the choice of loss ℓ(t) = -t, and the fourth equality follows from the fact that z U , z + U , and z - U are sampled independently from the distribution D U . Also, we have defined ϕ z R := E [ϕ(x) | z R ].

Denote the centered representation as

φz R = ϕ z R -ϕ 0 . Then we have -E (x,x + ,x -) ℓ ϕ(x) ⊤ [ϕ(x + ) -ϕ(x -)] (19) = E (z R ,z - R ) ϕ ⊤ z R ϕ z R -ϕ z - R (20) = E (z R ,z - R ) ( φz R + ϕ 0 ) ⊤ φz R + ϕ 0 -φz - R -ϕ 0 (21) = E (z R ,z - R ) ( φz R + ϕ 0 ) ⊤ φz R -φz - R (22) = E (z R ,z - R ) φ⊤ z R φz R -φ⊤ z R φz - R + E (z R ,z - R ) ϕ ⊤ 0 φz R -φz - R . ( ) Since φz R and φz - R are independent with mean 0, we have E (z R ,z - R ) [ φ⊤ z R φz - R ] = 0, E (z R ,z - R ) [ϕ ⊤ 0 φz R ] = 0, and E (z R ,z - R ) [ϕ ⊤ 0 φz - R ] = 0. Therefore, -E (x,x + ,x -) ℓ ϕ(x) ⊤ [ϕ(x + ) -ϕ(x -)] (24) = E z R φ⊤ z R φz R (25) = E z R ∥ φz R ∥ 2 (26) = E z R ∥ϕ z R -ϕ 0 ∥ 2 , ( ) which is the PCA objective on the mean representation ϕ z R . If additionally ϕ(x) is linear in x, then -E (x,x + ,x -) ℓ ϕ(x) ⊤ [ϕ(x + ) -ϕ(x -)] (28) = E z R ∥ϕ z R -ϕ 0 ∥ 2 (29) = E x ∥ϕ(x) -ϕ(x 0 )∥ 2 (30) which is the linear PCA objective on the data from the distribution of x = E[x|z R ]. Theorem A.2 (Restatement of Theorem 2.2). Under Assumptions (A1)(A2)(A3): (1) The optimal representation ϕ * does not encode z U : ϕ * • g(z) is independent of z U . (2) For any invariant feature i ∈ R, there exists B i > 0 such that as long as the representations' norm B r ≥ B i , the optimal representation encodes z i . Furthermore, if z R is discrete, then B i is monotonically decreasing in Pr[z R\{i} = z - R\{i} , z i ̸ = z - i ], the probability that in z R and z - R , the i-th feature varies while the others remain the same. Proof. (1) Recall that ϕ z R = E[ϕ • g(z) | z R ], ϕ 0 = E z [ϕ • g(z)] = E z R [ϕ z R ]. Then the contrastive loss at pre-training is: E (x,x + ,x -) ℓ ϕ(x) ⊤ [ϕ(x + ) -ϕ(x -)] (32) = E (z,z + ,z -) ℓ (ϕ • g(z)) ⊤ (ϕ • g(z + ) -ϕ • g(z -)) (33) = E (z R ,z - R ) E ℓ (ϕ • g(z)) ⊤ (ϕ • g(z + ) -ϕ • g(z -)) | z R , z - R (34) ≥ E (z R ,z - R ) ℓ E (ϕ • g(z)) ⊤ (ϕ • g(z + ) -ϕ • g(z -)) | z R , z - R (35) = E (z R ,z - R ) ℓ E[ϕ • g(z) | z R ] ⊤ E[ϕ • g(z + ) | z R ] -E[ϕ • g(z -) | z - R ] (36) = E (z R ,z - R ) ℓ ϕ ⊤ z R ϕ z R -ϕ ⊤ z R ϕ z - R , where the inequality comes from the convexity of ℓ(z) and Jensen's inequality applied to the inner expectation. The inequality becomes equality when the representation function ϕ is invariant to the spurious features z U , i.e., with probability 1 over the distribution, ϕ • g(z) = ϕ z R . Therefore, the spurious features z U are not encoded in the optimal representation, proving the first part. (2) First consider the case when z has discrete values from a finite set. When the generative function g(z) is not independent of z i , we assume for contradiction that the optimal representation ϕ is independent of z i . From (1), we know that it is independent of z U . So there exists an f such that ϕ • g(z) = f (z R\{i} ). Without loss of generality, suppose U = ∅, then ϕ • g(z) = f (z -i ). Since the generative function g(z) is not independent of z i , there exist z and z -, such that z -i = z - -i , z i ̸ = z - i , g(z) ̸ = g(z -), and z, z -have non-zero probabilities. So Pr[z -i = z - -i , z i ̸ = z - i ] > 0. Now construct a new representation function φ ∈ R k+n , n = |Z| such that φ • g(z) = h(z) as follows : h(z) = 1 -α 2 f (z -i ), α∥f (z -i )∥I z ( ) where I z is the one-hot encoding of the value z. Note that φ still satisfies that norm bound since ∥ φ(x)∥ = ∥h(z)∥ = ∥f (z -i )∥. We next show that the contrastive loss of φ can be smaller than that of ϕ, leading to a contradiction and finishing the proof. The contrastive loss of φ (using the fact that z + = z when U = ∅) is E (z,z -) ℓ h(z) ⊤ h(z) -h(z) ⊤ h(z -) (39) = E (z,z -) ℓ h(z) ⊤ h(z) -h(z) ⊤ h(z -) | z ̸ = z -Pr[z ̸ = z -] + E z,z -[ℓ(0)] Pr[z = z -]. We only need to consider the first term. E (z,z -) ℓ h(z) ⊤ h(z) -h(z) ⊤ h(z -) | z ̸ = z -Pr[z ̸ = z -] (41) = E (z,z -) ℓ ∥f (z -i )∥ 2 -(1 -α 2 )f (z -i ) ⊤ f (z - -i ) T1 | z -i ̸ = z - -i Pr[z -i ̸ = z - -i ] (42) + E (z,z -) ℓ α 2 ∥f (z -i )∥ 2 T2 | z -i = z - -i , z i ̸ = z - i Pr[z -i = z - -i , z i ̸ = z - i ]. When α = 0, the above reduces to the corresponding terms for ϕ, so we would like to show that there exists non-zero α that leads to smaller loss values. Recall that ℓ(•) is decreasing by property (A3). Let α = 1/2/B r , where B r = ∥f (z -i )∥. Then when switching from ϕ to φ, T 2 goes from ℓ(0) to ℓ(1/2), a constant reduction. For T 1 , if f (z -i ) ⊤ f (z - -i ) is positive, then T 1 decreases; if f (z -i ) ⊤ f (z - -i ) is negative, then T 1 in- creases from ℓ(B 2 r -f (z -i ) ⊤ f (z - -i )) to ℓ(B 2 r -f (z -i ) ⊤ f (z - -i ) + α 2 f (z -i ) ⊤ f (z - -i )). Note that |α 2 f (z -i ) ⊤ f (z - -i )| ≤ 1 (by the Cauchy-Schwarz inequality); so the increase in T 1 diminishes when B r grows, by the property (A3) of ℓ. Then when B r is large enough, the increase in T 1 is smaller than the decrease in T 2 . So from ϕ to φ, the contrastive loss decreases, contradicting that ϕ is optimal. Finally, since the reduction in ( 43) is smaller when Pr[z -i = z - -i , z i ̸ = z - i ] is smaller, then B i needs to be larger. So B i is monotonically decreasing in Pr[z -i = z - -i , z i ̸ = z - i ] . Now consider the general case when z may not be from a finite set. For any ϵ 0 > 0, there exists a ℓ 2 ball B of bounded radius such that the probability of z outside the ball is at most ϵ 0 . Since ϕ • g's are regular by assumption, there exists a partition Z ∩ B into finitely many subsets such that in each subset and for each ϕ • g, the function value varies by at most ϵ 0 . Construct a new distribution D ′ z for z: select a representative point in each subset, and put a probability mass to it equal to that of the original distribution D z in this subset, and normalize the probabilities over the subsets. The new distribution is over a finite set so the above argument holds. Furthermore, the difference in the T 1 term for D ′ z and D z can be made arbitrarily small by choosing sufficiently small ϵ 0 ; similarly for T 2 . Then the argument also holds for D z , which completes the proof for the general case.

A.1 INDUCTIVE BIASES ARE NEEDED FOR ANALYZING PREDICTION SUCCESS

We have analyzed what features are encoded in the representation. However, encoding the information does not equate to good prediction performance, in particular, with linear predictors. Recently, Saunshi et al. (2022b) demonstrated that existing analyses that ignore the inductive biases of the model and algorithm cannot adequately explain the prediction success, and provided examples where such analysis can lead to vacuous bounds. One may wonder if our hidden representation data model can provide inductive biases that avoid such vacuous bounds. Unfortunately, similar issues as in Saunshi et al. (2022b) remain. To illustrate that inductive biases are still needed in our data model, consider the following simple example. Suppose z R ∈ {-1, 1} 2 and can be recovered from x; the label y is simply the first coordinate in z R . Suppose the representation satisfies ϕ(x) ∈ R 2 , ∥ϕ(x)∥ = 1, and contrastive learning uses the logistic loss ℓ(z). Let ϕ(x) be such that ϕ • g(z) = h(z R ), and h((-1, -1)) = (-1, 0), h((-1, 1)) = (1, 0), h((1, -1)) = (0, -1), h((1, 1)) = (0, 1). It can be verified that this ϕ is optimal for the contrastive loss. However, on the representation ϕ, the classification is an XORproblem (Fig. 6 ), for which there is no non-trivial error bound for linear predictors. This contradicts the success of linear probing in practice. Furthermore, some restrictions on the data distributions are also needed. Suppose all optimal representations are linearly separable with certain inductive biases on the representation function class. Suppose the label y depends on z R . Without restrictions on the labeling function, one can consider a random y ∈ {-1, +1} over any z R . Then for any linear predictor on any optimal representation, in expectation the error is 1/2, so there is always a labeling function for which no non-trivial error can be achieved. Our analysis thus requires restrictions on the dependence of the label on z R (in particular, we will assume linear dependence).

B PROOFS AND MORE ANALYSIS FOR SECTION 2.2 B.1 LEMMAS FOR A MORE GENERAL SETTING

We will prove the results in a more general setting, where the mixture can be uneven and the variances of different types of features can be different. The results in Section 2.2 then follow from these lemmas. In the more general setting, the diverse pre-training data is a mixture of data from T different tasks D t 's, while the target task is one of the tasks. In the mixture, the task D t has weight w t > 0 and T t=1 w t = 1. All tasks share a public feature set S of size s, and each task D t additionally owns a private disjoint feature set P t of size r -s, i.e., P t ∩ S = ∅ for t ∈ [T ] and P t1 ∩ P t2 = ∅ for t 1 ̸ = t 2 . The invariant features for D t are then R t = S ∪ P t . All invariant features are ∪ T t=1 R t ⊆ R, k := |R|, and spurious features are U = [d] \ R. In task D t , the positive pairs (x, x + ) are generated as follows: z S ∼ N (0, σ 2 S,t I), z Pt ∼ N (0, σ 2 R,t I), z R\Rt = 0, z U ∼ N (0, σ 2 U,t I), z = [z R ; z U ], x = g(z), z + U ∼ N (0, σ 2 U,t I), z + = [z R ; z + U ], x + = g(z + ), and x -is simply an i.i.d. copy from the same distribution as x. In practice, multiple independent negative examples are used, and thus we consider the following contrastive loss min ϕ∈Φ E (x,x + ) ℓ ϕ(x) ⊤ (ϕ(x + ) -E x -ϕ(x -)) to pre-train a representation ϕ. Then, when using ϕ for prediction in the target task D t , the predictor class should contain a predictor matching the ground-truth label, so consider the class: F ϕ,t = {f (z) = u ⊤ t z : u t ∈ R k , ∥u t ∥ ≤ B ϕ,t } where B ϕ,t is the minimum value such that there exists u t ∈ F ϕ,t with y = u ⊤ t ϕ(x) on D t . Recall that we assume a linear data model and linear representation functions ϕ: • x is linear in z: x = g(z) = M z where M ∈ R d×d is an orthonormal dictionary. The label in task D t is linear in its invariant features y = (u * t ) ⊤ z Rt for some u * t ∈ R r . • The representations are linear functions with weights of bounded spectral/Frobenius norms: Φ = {ϕ(x) = W x : W ∈R k×d , ∥W ∥≤1, ∥W ∥ F ≤ √ r}. Here the norm bounds are chosen to be the minimum values to allow recovering the invariant features in the target task, i.e., there exists ϕ ∈ Φ such that ϕ(x) = [z Rt ; 0]. 

Lemma

w t E ℓ ασ 2 S,t Z + αt σ 2 R,t Z t , subject to αs + T t=1 αt (r -s) ≤ r, α, αt ∈ [0, 1], where Z ∼ χ 2 s and Z t ∼ χ 2 r-s . Then the optimal representation ϕ * (x) the loss (47) in contrastive learning satisfies ϕ * (x) = W * x with any W * of the form: W * = [QA * , 0]M -1 (52) where Q ∈ R k×k is any orthonormal matrices, A * is a k × k diagonal matrix with A * jj =    √ α if j ∈ S, √ α t if j ∈ P t , 0 otherwise, and the matrix of zeros has size k × (d -k). Proof. For each D t , E (x,x + ) ℓ ϕ(x) ⊤ [ϕ(x + ) -E x -ϕ(x -)] (54) = E (z,z + ) ℓ (W M z) ⊤ (W M z + -E z -[W M z -]) (55) = E (z,z + ) ℓ z ⊤ (M ⊤ W ⊤ W M )(z + -E z -[z -]) (56) ≥ E z R ℓ (E z U [z]) ⊤ M ⊤ W ⊤ W M (E z + U [z + ] -E z -[z -]) (57) = E z R ℓ [z R ; 0] ⊤ M ⊤ W ⊤ W M ([z R ; 0] -0) (58) = E z R ℓ ∥W M [z R ; 0]∥ 2 (59) where the inequality comes from the convexity of ℓ(t) and Jensen's inequality. Similar to Theorem 2.2, the equality holds if and only if W M z does not depend on z U and W M z + does not depend on z + U , so the optimal solution should satisfy this condition. d-k) . By rotational invariance of z S , and z Pt , without loss of generality, we can assume A R = QA where A is a diagonal matrix with diagonal entries a jj 's and Q is any orthonormal matrix. Furthermore, A U = 0 in the optimal solution since it does not affect the loss but only decreases the norm bound on A R . So on data from the task D t , Let W M = [A R , A U ] where A R ∈ R k×k , A U ∈ R k×( E Dt ℓ ∥W M [z R ; 0]∥ 2 = E z R t   ℓ   j∈Rt a 2 jj z 2 j     . ( ) Then on the mixture, E (x,x + ) ℓ ϕ(x) ⊤ [ϕ(x + ) -E x -ϕ(x -)] (61) ≥ T t=1 w t E {zj }   ℓ   j∈Rt a 2 jj z 2 j     (62) = T t=1 w t E {zj ∼N (0,1)}   ℓ   j∈S a 2 jj σ 2 S,t z2 j + j∈Pt a 2 jj σ 2 R,t z2 j     ( ) :=g({a jj }), where each zj is a random variable drawn from standard Gaussian. Now consider the minimum of the function g({a jj }) on the right hand side, under the constraints that |a jj | ≤ 1 and j a 2 jj ≤ r. Before finishing the proof of Lemma B.1, we have the following claim for this optimization. Claim 1. There exist α, α t satisfying 0 ≤ α, α t ≤ 1 and αs + T t=1 α t (r -s) = j a 2 jj ≤ r, such that the minimum of the above optimization ( 64) is achieved when a 2 jj = α for any j ∈ S, and a 2 jj = α t for any j ∈ P t and t ∈ [T ]. Proof. We need to prove that to achieve the minimum, (1) a 2 ℓℓ = a 2 ℓ ′ ℓ ′ for any ℓ ̸ = ℓ ′ ∈ S; (2) a 2 ℓℓ = a 2 ℓ ′ ℓ ′ for any ℓ ̸ = ℓ ′ ∈ P t and any t ∈ [T ]; For (1): By symmetry of z j 's and the convexity of ℓ(•), for any t ∈ [T ], E   ℓ   j∈Rt a 2 jj z 2 j     (65) = 1 2 E   ℓ   j∈S,j̸ =ℓ,j̸ =ℓ ′ a 2 jj z 2 j + a 2 ℓℓ z 2 ℓ + a 2 ℓ ′ ℓ ′ z 2 ℓ ′ + j∈Pt a 2 jj z 2 j     (66) + 1 2 E   ℓ   j∈S,j̸ =ℓ,j̸ =ℓ ′ a 2 jj z 2 j + a 2 ℓℓ z 2 ℓ ′ + a 2 ℓ ′ ℓ ′ z 2 ℓ + j∈Pt a 2 jj z 2 j     (67) ≥ E   ℓ   j∈S,j̸ =ℓ,j̸ =ℓ ′ a 2 jj z 2 j + a 2 ℓℓ + a 2 ℓ ′ ℓ ′ 2 z 2 ℓ ′ + a 2 ℓℓ + a 2 ℓ ′ ℓ ′ 2 z 2 ℓ + j∈Pt a 2 jj z 2 j     . ( ) Then g({a jj }) ≥ T t=1 w t E   ℓ   j∈S,j̸ =ℓ,j̸ =ℓ ′ a 2 jj z 2 j + a 2 ℓℓ + a 2 ℓ ′ ℓ ′ 2 z 2 ℓ ′ + a 2 ℓℓ + a 2 ℓ ′ ℓ ′ 2 z 2 ℓ + j∈Pt a 2 jj z 2 j     . Therefore, the minimum is achieved when a 2 ℓℓ = a 2 ℓ ′ ℓ ′ . A similar argument as above proves statement (2). These statements mean that, for any t ∈ [T ], the minimum is achieved when a 2 jj = α for j ∈ S, and a 2 jj = α t for j ∈ P t , for some values α, α t ≥ 0. Let Z = j∈S z2 j , Z t = j∈Pt z2 j . Then Z ∼ χ 2 s and Z t ∼ χ 2 r-s , and we have: g({a jj }) = T t=1 w t E   ℓ   j∈S ασ 2 S,t z2 j + j∈Pt α t σ 2 R,t z2 j     (70) = T t=1 w t E ℓ ασ 2 S,t Z + α t σ 2 R,t Z t . Given the constraint αs + T t=1 α t (r -s) = j a 2 jj ≤ r, 0 ≤ α, α t ≤ 1, we complete the proof of Lemma B.1. Given this result we can now analyze the generalization error when predicting on the target task D t . Lemma B.2. Consider any t ∈ [T ]. Let v t,1 = j∈S (u * t ) 2 j and v t,2 = j∈Pt (u * t ) 2 j . Suppose in ϕ * (calculated in Lemma B.1), α, α t > 0. Suppose the prediction loss ℓ c is L-Lipschitz. Then the Empirical Risk Minimizer ût ∈ F ϕ * ,t on ϕ * using m labeled data points from D t has risk E (x,y)∼Dt [ℓ c ( ût ⊤ ϕ * (x), y)] ≤ 8 2 ln(4/δ) m + 4L 1 m v t,1 α + v t,2 α t sασ 2 S,t + (r -s)α t σ 2 R,t + O max{ασ 2 S,t , α t σ 2 R,t } 2 r sασ 2 S,t + (r -s)α t σ 2 R,t Proof. For any t ∈ [T ], we only need to bound the Rademacher complexity R m (F ϕ * ,t ) of F ϕ * ,t ; the statement then follows from standard generalization bounds, E (x,y)∼Dt [ℓ c ( ût ⊤ ϕ * (x), y)] ≤ 4LR m (F ϕ * ,t ) + 8 2 ln(4/δ) m . Given the representation ϕ * in Lemma B.1, to ensure there exists a predictor in F ϕ * ,t matching the ground-truth label, f (ϕ * (x)) = u ⊤ t ϕ * (x) = y = (u * t ) ⊤ z Rt , predictor u t should satisfy E Dt [(ŷ -y) 2 ] = 0 ⇔∀z Rt , u t ⊤ [QA * , 0]M -1 M [z Rt ; 0; z U ] = u * t ⊤ z Rt (72) ⇔∀z Rt , u t ⊤ QA * [z Rt ; 0] = u * t ⊤ z Rt ( ) ( * ) ⇔A * 1:r,1:r (Q ⊤ ) 1:r,1:k u t = u * t (74) ⇔∀v ∈ R r , u t = Q 1:k,1:r (A * 1:r,1:r ) -1 u * t + Q 1:k,r+1:k v. The ( * ) is from non-zero variance for z Rt . u t = Q 1:k,1:r (A * 1:r,1:r ) -1 u * t is the least-norm optimal solution, so we have B ϕ * ,t = ∥Q 1:k,1:r (A * 1:r,1:r ) -1 u * t ∥ = vt,1 α + vt,2 αt . So the predictor class should be F ϕ * ,t = f (ϕ * ) = u ⊤ t ϕ * : u t ∈ R k , ∥u t ∥ ≤ B ϕ * ,t = v t,1 α + v t,2 α t . ( ) The empirical Rademacher complexity and Rademacher complexity of F ϕ * ,t with m samples are Rm (F ϕ * ,t ) = 1 m E σ sup f u,ϕ ∈F ϕ * ,t m i=1 σ i f u,ϕ (x (i) ) (77) = 1 m E σ sup ∥u∥≤B ϕ * ,t m i=1 σ i u t ⊤ QA * [z (i) Rt ; 0] (78) = 1 m E σ sup ∥u∥≤B ϕ * ,t u t ⊤ m i=1 σ i Q 1:k,1:r A * 1:r,1:r z (i) Rt (79) = B ϕ * ,t m E σ m i=1 σ i Q 1:k,1:r A * 1:r,1:r z (i) Rt , R m (F ϕ * ,t ) =E z R ,z U Rm (F ϕ * ,t ) (81) = B ϕ * ,t m E z (i) R t E σ m i=1 σ i Q 1:k,1:r A * 1:r,1:r z (i) Rt (82) = B ϕ * ,t m E z (i) R t A * 1:r,1:r m i=1 z (i) Rt . ( ) For any t ∈ [T ], define X t := A * 1:r,1:r m i=1 z (i) Rt . Note that for j ∈ S, X t,j = α m i=1 z (i) j is a Gaussian of mean zero and variance E[X 2 t,j ] = αE m i=1 z (i) j 2 = αE m i=1 z (i) j 2 = mασ 2 S,t . Similarly, for j ∈ P t , X t,j = α t m i=1 z (i) j is a Gaussian of mean zero and variance E[X 2 t,j ] = mα t σ 2 R,t . Since X t,j is sub-gaussian, X 2 t,j -mασ 2 S,t for j ∈ S and X 2 t,j -mα t σ 2 R,t for j ∈ P t are sub-exponential and more precisely ∥X 2 t,j -mασ 2 S,t ∥ ψ1 ≤ C 1 ∥X 2 t,j ∥ ψ1 = C 1 ∥X t,j ∥ 2 ψ2 ≤ C 2 mασ 2 S,t , j ∈ S, ∥X 2 t,j -mα t σ 2 R,t ∥ ψ1 ≤ C 1 ∥X 2 t,j ∥ ψ1 = C 1 ∥X t,j ∥ 2 ψ2 ≤ C 2 mα t σ 2 R,t , j ∈ P t , where C 1 , C 2 are absolute constants and C 2 > 1. Let K = max(C 2 mασ 2 S,t , C 2 mα t σ 2 R,t ) = C 2 m max{ασ 2 S,t , α t σ 2 R,t } and µ := m(sασ 2 S,t + (r -s)α t σ 2 R,t ) . By Bernstein's inequality, we have for every γ ≥ 0 that P 1 r (∥X t ∥ 2 -µ) ≥ γ ≤ 2 exp -c min γ 2 K 2 , γ K r ⇒P ∥X t ∥ 2 µ -1 ≥ rγ µ (87) ≤ 2 exp - c C 2 2 min γ 2 m 2 max{ασ 2 S,t , α t σ 2 R,t } 2 , γ m max{ασ 2 S,t , α t σ 2 R,t } r , ( ) where c is an absolute constant. For all numbers z ≥ 0, we have |z -1| ≥ δ ⇒ |z 2 -1| ≥ max(δ, δ 2 ). Thus, for any δ ≥ 0, we have P ∥X t ∥ √ µ -1 ≥ δ (89) ≤P ∥X t ∥ 2 2 µ -1 ≥ max(δ, δ 2 ) (90) ≤2 exp   - c C 2 2 min   µ max(δ, δ 2 ) m max{ασ 2 S,t , α t σ 2 R,t }r 2 , µ max(δ, δ 2 ) m max{ασ 2 S,t , α t σ 2 R,t }r   r   (91) ≤2 exp   - c C 2 2 µ m max{ασ 2 S,t , α t σ 2 R,t }r 2 min max(δ, δ 2 ) 2 , max(δ, δ 2 ) r   (92) =2 exp - c C 2 2 µ 2 m 2 max{ασ 2 S,t , α t σ 2 R,t } 2 r δ 2 , ( ) where the last inequality is from µ = m(sασ 2 S,t + (r -s)α t σ 2 R,t ) ≤ m max{ασ 2 S,t , α t σ 2 R,t }r. Changing variables to θ = δ √ µ, we obtain the desired sub-gaussian tail P {|∥X t ∥ - √ µ| ≥ θ} ≤2 exp - c C 2 2 µ m 2 max{ασ 2 S,t , α t σ 2 R,t } 2 r θ 2 . ( ) By generalization of integral identity, we have |E [∥X t ∥ - √ µ]| = ∞ 0 P{∥X t ∥ - √ µ > θ}dθ - 0 -∞ P{∥X t ∥ - √ µ < θ}dθ (95) ≤2 ∞ 0 P{|∥X t ∥ - √ µ| > θ}dθ (96) ≤4 ∞ 0 exp - c C 2 2 µ m 2 max{ασ 2 S,t , α t σ 2 R,t } 2 r θ 2 dθ (97) ≤C 3 m max{ασ 2 S,t , α t σ 2 R,t } √ r √ µ , ( ) where C 3 is an absolute constant. Thus, we have The representation ϕ * obtained on an even mixture of data from all the tasks {D t : R m (F ϕ * ,t ) - 1 m v t,1 α + v t,2 α t (sασ 2 S,t + (r -s)α t σ 2 R,t ) = B ϕ * ,t m |E [∥X t ∥ - √ µ]| (100) ≤ O 1 m v t,1 α + v t,2 α t max{ασ 2 S,t , α t σ 2 R,t } 2 r sασ 2 S,t + (r -s)α t σ 2 R,t . 1 ≤ t ≤ T } satisfies ϕ * • g(z) = Q j∈S √ αz j e j + j∈R\S √ βz j e j for some α ∈ [0, 1], β = min 1, r-αs T (r-s) , where e j 's are the basis vectors and Q is any orthonormal matrix. The Empirical Risk Minimizer û ∈ F ϕ * ,t on ϕ * using m labeled data points from D t has risk E (x,y)∼Dt [ℓ c (û ⊤ ϕ * (x), y)] ≤4L 1 m v t,1 α + v t,2 β sα + (r -s)β + O r sα + (r -s)β + 8 2 ln(4/δ) m . Proof. This follows from Lemma B.1, and considering the optimal α, α t for the following: g({α, α t }) = T t=1 w t E ℓ ασ 2 S,t Z + α t σ 2 R,t Z t (102) = 1 T T t=1 E [ℓ (αZ + α t Z 1 )] ≥ E ℓ αZ + Z 1 T t=1 1 T α t . ( ) The second equation is from that Z t 's follow the same distribution by the symmetry of zj 's. The inequality comes from the convexity of ℓ(t) and Jensen's inequality. So the minimum is achieved when α t := β for any t ∈ [T ], leading to g({α, α t }) = E [ℓ (αZ + βZ 1 )] subject to the constraints αs + T β(r -s) ≤ r, 0 ≤ α, β ≤ 1. Then we get ϕ * • g(z) = W * M z = Q j∈S √ αz j e j + j∈R\S √ βz j e j for some α ∈ [0, 1], β = min 1, r-αs T (r-s) , where e j 's are the basis vectors and Q is any orthonormal matrix. Finally, the generalization bound follows from Lemma B.2, and that O max{α, β} 2 r sα + (r -s)β = O r sα + (r -s)β . ( ) This completes the proof. Proposition B.4 (Restatement of Proposition 2.4). Suppose σ S,t = σ R,t = σ U,t = 1. The represen- tation ϕ * t obtained on data from D t satisfies ϕ * t • g(z) = Q j∈Rt z j e j where e j 's are the basis vectors and Q is any orthonormal matrix. The Empirical Risk Minimizer û ∈ F ϕ * t ,t on ϕ * t using m labeled data points from D t has risk E (x,y)∼Dt [ℓ c (û ⊤ ϕ * t (x), y)] ≤ 4L r m ∥u * t ∥ + 8 2 ln(4/δ) m . While on task D i (i ̸ = t), any linear predictor on ϕ * t has error at least min u E Di [ℓ c (u ⊤ z S , y)]. Proof. Following Lemma B.1 (with r = s), we get ϕ * t • g(z) = Q j∈Rt z j e j , where e j 's are the basis vectors and Q is any orthonormal matrix. Following the same argument as in the proof of Lemma B.2, we get R m (F ϕ * ,t ) = ∥u * t ∥ m E z (i) R t m i=1 z (i) Rt (107) ≤ r m ∥u * t ∥, where the last inequality comes from the property of chi-squared distribution expectation.

B.3 IMPLICATION FOR THE TRADE-OFF

The propositions then imply the trade-off between universality and label efficiency. Below we formalize the example discussed in Section 2.2. Proposition B.5 (A specific version of Proposition 2.3). Suppose σ S,t = σ R,t = σ U,t = 1 for any t ∈ [T ] and r = 2s. The representation ϕ * obtained on an even mixture of data from all the tasks {D t : 1 ≤ t ≤ T } satisfies ϕ * • g(z) = Q j∈S z j e j + j∈R\S 1 T z j e j , where e j 's are the basis vectors and Q is any orthonormal matrix. The Empirical Risk Minimizer û ∈ F ϕ * ,t on ϕ * using m labeled data points from D t has risk E (x,y)∼Dt [ℓ c (û ⊤ ϕ * (x), y)] ≤ 4L 1 m (v t,1 + T v t,2 ) r T + 1 2T + O (1) + 8 2 ln(4/δ) m . Proof. This follows from Proposition 2.3, and noting that when r = 2s, α = 1 and β = 1/T are the optimal for: g({α, β}) = E [ℓ (αZ + βZ 1 )] (109) = E ℓ αZ + r -αs T (r -s) Z 1 (110) = E ℓ αZ + 2 -α T Z 1 (111) subject to the constraints αs+T β(r -s) ≤ r, 0 ≤ α, β ≤ 1. To see this, note that Z ∼ χ 2 s and Z 1 ∼ χ 2 r-s = χ 2 s follow the same distribution, so αZ + 2-α T Z 1 for α = 1 will stochastically dominate its value for other α ∈ [0, 1). The optimal is then achieved when α = 1 and β = 2-α T = 1 T .

B.4 IMPROVING THE TRADE-OFF BY CONTRASTIVE REGULARIZATION

The above analysis shows that contrastive learning a representation on unlabeled data from the target task can help in prediction on this target task. This suggests that given a representation ϕ * pre-trained on diverse data, one can fine-tune it by contrastive learning on some unlabeled data from the target task to get a representation that can lead to better prediction on the target task. In the following, we will formally show that this is indeed the case for the illustrative example in Section 2.2. Recall that in this example, σ S,t = σ R,t = σ U,t = 1, r = 2s, and v t,1 = v t,2 . The representation ϕ * obtained on an even mixture of data from all the tasks {D t : 1 ≤ t ≤ T } satisfies ϕ * • g(z) = Q j∈S √ αz j e j + j∈R\S √ βz j e j , where e j 's are the basis vectors and Q is any orthonormal matrix, and α = 1, β = 1 T . Now, suppose we are given unlabeled data from D t , and we use them to fine-tune ϕ * (x) = W * x by contrastive learning on these unlabeled data. That is, we find W near W * to minimize the contrastive loss on the unlabeled data from D t : min ϕ(x)=W x E ℓ ϕ(x) ⊤ (ϕ(x + ) -E x -ϕ(x -)) (112) subject to ∥W -W * ∥ F ≤ γ, ∥W ∥ ≤ 1. ( ) for some small γ > 0. Proposition B.6. For (112), ϕ * CR,t satisfying the following on x from D t is an optimal representation: ϕ * CR,t • g(z) = Q   j∈S √ αz j e j + j∈Pt βz j e j   where √ α = 1, √ β = min 1, 1 T + γ √ s . Proof. Following the argument in Lemma B.1, we still have that ϕ * CR,t (x) = W x where W = Q 2 [A 2 ; 0]M -1 for any orthonormal matrix Q 2 and some diagonal matrix A 2 = diagonal(a jj ), with a jj = √ α for j ∈ S and a jj = √ β for j ∈ P t for some α, β ∈ [0, 1]. And the contrastive loss is: E ℓ ϕ(x) ⊤ (ϕ(x + ) -E x -ϕ(x -)) = E   ℓ   j∈Rt a 2 jj z 2 j     (114) = E   ℓ   α j∈S z 2 j + β j∈Pt z 2 j     . Recall that ϕ * (x) = W * x with W = Q[A; 0]M -1 for any orthonormal matrix Q and some diagonal matrix A, with A jj = 1 for j ∈ S and A jj = 1/T for j ∈ R i \ S for any i ∈ [T ]. Then ∥W -W * ∥ F = ∥Q 2 [A 2 ; 0]M -1 -Q[A; 0]M -1 ∥ F (116) = ∥Q 2 A 2 -QA∥ F (117) = ∥A 2 -Q -1 2 QA∥ F . ( ) Since Q -1 2 Q is a rotation and A, A 2 are diagonal, we can always set Q 2 = Q without increasing ∥W -W * ∥ F . Then ∥W -W * ∥ 2 F = ∥A 2 -A∥ 2 F (119) = s( √ α -1) 2 + s( β -1/T ) 2 + j∈Pi,i̸ =t ((A 2 ) jj -1/T ) 2 . ( ) To minimize the contrastive loss, we need α, β to be as large as possible, subject to ∥W -W * ∥ 2 F ≤ γ 2 , and α, β, (A 2 ) 2 jj ∈ [0, 1]. The optimal is then achieved when α = 1, √ β = min 1, 1 T + γ √ s , and (A 2 ) jj = 1/T for j ∈ P i , i ̸ = t. Now, recall that by Proposition 2.3, the ERM has risk: E (x,y)∼Dt [ℓ c (û ⊤ ϕ * (x), y)] ≤ 4L 1 m v t,1 α + v t,2 β sα + (r -s)β + O r sα + (r -s)β + 8 2 ln(4/δ) m = 4L 1 m ∥u 2 t ∥ 2 2α + ∥u 2 t ∥ 2 2β sα + sβ + O (1) + 8 2 ln(4/δ) m = O L r∥u * t ∥ 2 m 2 + α β + β α With the fine-tuning using contrastive learning, in the representation learned, α remains to be 1, while β increases from 1/T to ( 1/T + γ/ √ s) 2 . Then the error bound decreases. This shows that fine-tuning with contrastive learning on unlabeled data from the target task can emphasize the task-specific features z Pt , which then leads to better prediction performance.

C MORE EXPERIMENTAL DETAILS AND RESULTS

C.1 DATASETS CIFAR-10. CIFAR-10 ( Krizhevsky et al., 2009) dataset consists of 60,000 32 × 32 color images in 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. Each class has 6,000 images. There are 50,000 training images and 10,000 test images. CINIC-10. CINIC-10 (Darlow et al., 2018) consists of 32 × 32 color images from both CIFAR and ImageNet and has 90,000 training images with ten classes identical to CIFAR-10. ImageNet. ImageNet (Deng et al., 2009 ) is a huge visual dataset which is composed of 1,281,167 training data and 50,000 test data with 1,000 classes. We crop each color image to 224 × 224 as the standard setting. ImageNet32. ImageNet32 (Deng et al., 2009 ) is a huge dataset made up of small color images called the down-sampled version of ImageNet. ImageNet32 comprises 1,281,167 training data and 50,000 test data with 1,000 classes. Each color image is down-sampled to 32 × 32. ImageNet22k. ImageNet22k (Deng et al., 2009; Ridnik et al., 2021) GCC-15M. GCC-15M denotes the merged version of GCC-3M (Sharma et al., 2018) and GCC-12M (Changpinyo et al., 2021) . It is a diverse dataset of visual concepts with image-text pairs meant to be used for vision-and-language pre-training. GCC-15M contains 15M training data and more than 600k concepts. (Netzer et al., 2011) contains 10 digits color images of size 32 × 32 in the natural scene. It has 73,257 digits for training and 26,032 digits for testing.

SVHN. The Street View House Numbers

MNIST. The Modified National Institute of Standards and Technology (LeCun et al., 1998 ) is a database of handwritten gray-scale digits of size 28 × 28. It contains 60,000 training images and 10,000 testing images. EMNIST. Extended MNIST (Cohen et al., 2017) includes gray images of handwritten letters and digits. The images in EMNIST were converted into the same size 28 × 28 by the same process as MNIST. EMNIST-Letters has 145,600 lowercase characters with 26 balanced classes, and EMNIST-Digits has 280,000 characters with ten balanced classes. Fashion-MNIST. Fashion-MNIST (Xiao et al., 2017) is a dataset of 28 × 28 gray-scale images with ten classes: T-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag, ankle boot. The training set size is 60,000, and the test set size is 10,000. Fer2013. Fer2013 is a dataset (Goodfellow et al., 2013) of 48 × 48 gray-scale images with 7 face expression classes: angry, disgust, fear, happy, sad, surprise, neutral. The training set size is 28,709, and the test set size is 3,589. FaceScrub. FaceScrub (Ng & Winkler, 2014 ) is a dataset with 141,130 color face images of 695 public figures. GTSRB. The German Traffic Sign Recognition Benchmark (Stallkamp et al., 2012 ) is a dataset of color images depicting 43 different traffic signs. The images are not of fixed dimensions and have a rich background and varying light conditions as expected of photographed images of traffic signs. The original training set contains 34,799 images, and the original test set contains 12,630 images. We resize each image to 32×32. The dataset has a significant imbalance in the number of sample occurrences across classes. We use data augmentation techniques to enlarge the training data and balance the number of samples in each class. We construct a class-preserving data augmentation pipeline consisting of rotation, translation, and projection transforms and apply this pipeline to the training images until each class contains 2,500 examples. So we construct a new training set containing 107,500 images in total. We also construct a new test set by randomly selecting 10,000 images from the original test set for evaluation. IMDB. IMDB (Maas et al., 2011 ) is a large movie review text dataset. The dataset is for binary sentiment classification, positive review or negative review. The dataset contains 25,000 movie reviews for training and 25,000 for testing. AGNews. AGNews (Zhang et al., 2015) is a sub-dataset of AG's corpus of news articles for text topic classification. It covers the 4 largest classes: world, sports, business, sci/tech. The AG News contains 30,000 training and 1,900 test samples per class.

C.2 VERIFYING THE EXISTENCE OF THE TRADE-OFF

Model. We evaluate three popular contrastive learning frameworks, MoCo v2 (He et al., 2020) , NNCLR (Dwibedi et al., 2021) and SimSiam (Chen & He, 2021) . MoCo v2 can be regarded as SimCLR (Chen et al., 2020) equipped with a memory bank, while NNCLR uses nearest-neighbor as the positive pairs. SimSiam can be regarded as a modification from BYOL (Grill et al., 2020) similar to Barlow Twins (Zbontar et al., 2021) , which does not need negative pairs. We follow the same data augmentation methods as SimSiam (Chen & He, 2021 ) for all datasets. Datasets. We consider three sets of data. In the first set, our downstream task is CIFAR-10, and the pre-training datasets may include CINIC-10, SVHN, GTSRB, and ImageNet32. CINIC-10 has classes identical to CIFAR-10 and is the most target-relevant, while the others are less similar to CIFAR-10. This then provides more and more diverse pre-training data w.r.t. the target task. In the second set, our downstream task is MNIST, and the pre-training datasets may include EMNIST-Digits&Letters, Fashion-MNIST, GTSRB, and ImageNet32. Here, the handwritten dataset EMNIST-Digits&Letters is the most target-relevant. In the last set, our downstream task is Fer2013, a face expression classification dataset. The pre-training datasets may include FaceScrub, CIFAR-10, SVHN, and ImageNet32, where the face dataset Facescrub is the most target-relevant. Evaluation & Methods. We pre-train a ResNet18 network (He et al., 2016) as a feature extractor under different contrastive learning methods using SGD for 800 epochs with a cosine learning-rate scheduler, the base learning rate of 0.06, weight decay 5e-4, momentum 0.9 and batch size 512. Then we fix the pre-trained feature extractor and train a linear classifier (Linear Probing, LP) on 1%, 5%, 10%, 20%, 100% of the labeled data from the downstream task. For LP we use SGD for 200 epochs with a cosine learning-rate scheduler, the base learning rate of 5.0, no weight decay, momentum 0.9, and batch size 256. We finally report the test accuracy on a specific target task and the weighted average test accuracy on all pre-training datasets (i.e., using them as the downstream tasks). We use the class number of each pre-training dataset as the weight, which is consistent with random guessing as a baseline. We have three types of models pre-trained on three sets of datasets. Thus, we have nine tasks in total. For each task, we have two pre-trained models initialized by different random seeds. We evaluated each model three times and we report the average test accuracy with standard deviation based on multiple runs (six evaluations). In Figs. 7(a)(b)(c), we report results for MoCo v2, NNCLR, SimSiam (respectively) on CIFAR-10 as the downstream task. The size and diversity of unlabeled data for pre-training are increased on the x-axis by incrementally adding datasets in the following order: CINIC-10 (C), SVHN (S), GTSRB (G), and ImageNet32 (only use a 500k subset)(I). Then, we do LP on CIFAR-10 using different proportions of labeled samples. Similarly, in Figs. 7(d )(e)(f), we report results for three models on MNIST. We incrementally add pretraining datasets in the following order: EMNIST-Digits&Letters (E), Fashion-MNIST (F), GTSRB (G), ImageNet32 (I). In Figs. 7(g)(h)(i), the downstream task is Fer2013 and we incrementally add datasets in the following order: FaceScrub (I), CIFAR-10 (C), SVHN (S), ImageNet32 (I). Results. In Figs. 7, when the pre-training data becomes more diverse, the average test accuracy on all pre-training datasets increases (i.e., universality improves), while the test accuracy on the specific target task decreases (i.e., label efficiency drops). This shows a clear trade-off between universality and label efficiency. Moreover, with fewer labeled data (from the green line to the red line), the trade-off phenomenon will be more significant. It supports our claim that diverse pre-training data allow learning diverse features for better universality, but can down-weight the features for a specific task resulting in worse prediction. As more diverse unlabeled data are included, more labeled data from the target task is needed to achieve comparably-good prediction accuracy. This validates our analysis of the trade-off in Section 2.2. In Figs. 7(a)(b)(d)(e)(f)(h), the average test accuracy (x-axis) decreases in the end because when we add one pre-training dataset, it may hurt the test accuracy of all other pre-training datasets. In Figs. 7(g)(h), the downstream task test accuracy (y-axis) increases in the beginning because when the pre-training unlabeled data from relevant tasks is not sufficiently large, introducing other pretraining datasets will help the model to learn features relevant for the downstream task. However, the downstream task test accuracy will drop later as in other figures. Please refer to Appendix C.5 for more figures.

C.2.1 LARGER SCALE EXPERIMENTS

The datasets involved are ImageNet (1.2M data points, 1k classes), ImageNet22k (14M, 22k classes), and GCC-15M (15M). We compare two UniCL representations (Yang et al., 2022) : the more specific representation pre-trained on ImageNet, and the one pre-trained on the more diverse dataset ImageNet+GCC-15M. We compare their performance in two tasks: classification on Ima-geNet (using 2k labeled data) and classification on ImageNet22k (using 44k labeled data). The results are reported in Table 3 . From the specific representation to the diverse one, we observe that the test accuracy on ImageNet decreases (i.e., efficiency drops), while the test accuracy over ImageNet22k increases (i.e., universality improves). This again confirms the trade-off. For each set of pre-training data, we extract a set of features for the target task, CIFAR-10, MNIST, and Fer2013 respectively, using the pre-trained representation function. In Fig. 8 (rows/columns are pre-training data; numbers/colors show the similarity) first row, the features from different pretraining datasets have low similarities. This is consistent with our setup in Section 2.2 that different tasks only share some features and each owns many private ones. In the second row, we can see a decreasing trend of similarity in each row of each sub-figure. This indicates when gradually adding more diverse pre-training data, the obtained representation will encode more downstream task-irrelevant information and become less similar to that before adding more pre-training data. It will hurt the downstream task performance. This result is consistent with our Proposition 2.3 and 2.4. Finally, we would like to verify in Theorem 2.2 via CKA similarities. The theorem says that when we increase the norm bound, the representation can encode more and more features. To verify this, our the results for a similar setting with ImageNet-Vehicle. Table 6 shows the results of UniCL with Swin-T backbone using different pre-training datasets (ImageNet, and ImageNet+GCC-15M) and different labeled datasets (2k samples from ImageNet, 44k samples from ImageNet22k, the whole ImageNet, and the whole ImageNet22k). First, we find that the trade-off is hidden when a small amount of data from the specific task is used for pre-training. The results show that when the specific pre-training data is small, the representation learned is noisy and the downstream prediction performance is poor. This is not surprising: in the extreme case with only 1 pre-training image from the bird task or vehicle task, there is essentially no information for pre-training. In this case, the results are well inside the Pareto front of the trade-off curve and thus cannot demonstrate the trade-off. Second, we find that the trade-off is hidden when a large amount of labeled data are available for learning predictors on the representation in the specific task. If a large amount of labeled data is available for training the predictor, the prediction performance is similar when using the specific or universal representations. This is consistent with the insights from our analysis: when pre-training on diverse data, the features specific to the target task are down-weighted but can still be in the representation, then with a large amount of labeled data from the specific task, the sample complexity issue is not significant, and thus the trade-off is hidden. The above experimental studies show that the trade-off is revealed when we have large-scale pretraining data and a limited amount of labeled data from the target task, which is indeed the typical interesting case for using pre-trained representations (especially the large foundation models). The trade-off is significant in this case and thus crucial for further development of pre-training representations. Varying the Class Number of ImageNet32. To further support A1, we show that the trade-off between universality and label efficiency also exists under a fixed dataset size. In Fig. 10 , we pretrain MoCo v2 and SimSiam on CIFAR-10 + ImageNet32(200k) and keep the same setting as Fig. 7 except that we vary the class number of ImageNet32(200k). In previous experiments, we randomly pick 500,000 images from ImageNet32 without considering labels. Here, we fix the number of classes to 50, 100, 200, 500, 1000. Then we randomly sample 200,000 images from the subset of classes. The downstream task is CIFAR-10. In Fig. 10 , we observe that with a fixed pre-training 5 0 0 0 0 1 2 3 2 5 7  2 3 0 7 5 7  7 3 0 7 5 datasets size, e.g., 250,000, when the data is more diverse, the pre-training will learn more irrelevant features, and the performance will drop on the downstream task. This supports our analysis as well. Varying Target-Relevant pre-training Data Percentage. In Fig. 11 , we use (a)(d) 100% (b)(e) 50% (c)(f) 20% CINIC-10 to train MoCo v2 and SimSiam, and keep the same setting as Fig. 7 . For Fig. 11 (b) with 50% CINIC-10, test accuracy drops, e.g., the test accuracy of 1% CIFAR-10 in Fig. 11 (a) 80.63% vs. (b) 76.45%. We can also see the decreasing curve in Fig. 11 (b ). On the other hand, we also have test accuracy drops in Fig. 11 (c )(e)(f). However, we can see a U-curve rather than a strictly decreasing curve in Fig. 11 (c )(e)(f). ImageNet32 is more relevant with CIFAR-10 than SVHN and GTSRB, consistent with human intuition. When we have a small partition of CINIC-10 that does not cover all target-relevant features, the feature extractor can learn these missing features from ImageNet32. Although there are many irrelevant features in ImageNet32, the positive effect is larger than the negative effect, and so it plots a U-curve. It is consistent with our statement that we need a large and target-relevant pre-training dataset rather than a diverse irrelevant one. Replacing CINIC-10 With CIFAR-10. In Fig. 12 , we keep the same setting as Fig. 7 except we replace CINIC-10 with CIFAR-10. Note that our downstream task is still CIFAR-10. In Fig. 12 , we can see the same phenomena and similar performance as Fig. 7 . Thus, if we have a good choice of a task-relevant pre-training dataset, we can get a similar performance as pre-training on the downstream task domain directly. The pre-training unlabeled data from diverse tasks may have a positive effect when the pre-training unlabeled data from similar (relevant) tasks is not sufficiently large. Moreover, if we choose a good task-relevant pre-training dataset, we can directly get a similar performance as pre-training on the downstream task. However, when the task-relevant dataset is sufficient, the performance will drop if we introduce task-irrelevant data in the pre-training dataset (Fig. 11 Results. In Table 7 , the trade-off phenomenon also exists for FT evaluation, where the FT test accuracy drops when the pre-training dataset contains more diverse data points. Table 7 shows that Ours can achieve better performance than the other baselines. In particular, it outperforms the typical finetuning method consistently, even when the latter also uses the same amount of data augmentation. This confirms the benefit of contrastive regularization. Fig. 13 visualizes the features of different methods by t-SNE. It shows that contrastive regularization can down-weight the downstream taskirrelevant invariant feature, so it can improve the model generalization ability, which is consistent with the discussion in Section 2.2. et al., 2022) ). The pricing for GPT-3 to get feature embedding is $0.20 / 1k tokens. If users would like to use a foundation model on their downstream task, the most efficient way is to get feature embedding of their data from the API and train a small model, called adapter (Hu et al., 2021; Sung et al., 2022) , on these embedding rather than on raw input data. We evaluate CLIP (with ViT-L as the representation backbone), MoCo v3 (ViT-B backbone), and SimCSE (Gao et al., 2021) (BERT backbone). They are trained on (image, text), (image, image), and (text, text) contrastive pairs, respectively, so cover a good spectrum of methods. For CLIP and MoCo v3, we fix the backbone for all evaluation methods. For LP, we add a linear FC layer on top of the backbone. For FT and Ours, we insert a two-layer ReLU neural network (1024 dimensions for the hidden layer) as an adapter between the backbone and the linear classification layer. For Ours, we apply SimCLR contrastive loss on the adapted feature (output of adapter) and set λ = 1.0. We use the same training strategy as Section 3.1 for LP and as Table 7 for FT and Ours. In line with the actual situation, we control the number of augmentation number used in FT and Ours (more data augmentation leads to higher prices in practice). We conduct experiments on NLP tasks as well. SimCSE proposes a contrastive learning method for sentence embeddings, which uses the dropout feature from BERT as data augmentation. The max sequence embedding length is 512. For SimCSE, all three evaluation methods use a linear classifier. For all evaluation methods, we use AdamW (Loshchilov & Hutter, 2018) with weight decay = 0.01 and train 3 epochs with batch size 16. For LP, we fix the backbone and set the learning rate as 5e-3. For FT and Ours, we train the backbone and linear classification layer simultaneously using unique data augmentation in each training step and set the learning rate as 5e-5. For Ours, we apply SimCSE contrastive loss on the feature (output of the backbone) and set λ = 1.0. As in Appendix C.2, we report LP, FT, and Ours on 1%, 5%, 10%, 20%, 100% of the labeled data from the downstream task in Table 8 , Table 9 and Table 10 for CLIP, MoCo v3, and SimCSE respectively. For CLIP and MoCo v3 in Table 8 and Table 9 , we consider different data augmentation numbers, e.g. 10, 100. For SimCSE, we use the standard data augmentation, i.e. generating unique augmentation for each gradient step. These tables again show our method can consistently improve the downstream prediction performance in all three real-world scenarios, and quite significantly in some cases (e.g., MoCo v3 on GTSRB). This shows that the method is also useful for large foundation models, including the case when the foundation models cannot be fine-tuned and only the extracted embeddings can be adapted.

C.4.2 THE EFFECT OF CONTRASTIVE REGULARIZATION ON LINEAR PROBING

We show that contrastive regularization can also improve over linear probing. Note that the contrastive regularization loss term is only relevant to the backbone; see the definition in Equation ( 9). While linear probing fixes the backbone. Thus, we cannot do contrastive regularization and linear probing at the same time. To show its effect, we first apply contrastive regularization to update the backbone, and after that use linear probing. The results in Table 11 show that contrastive regularization leads to better prediction accuracy. Furthermore, the improvement is more significant on diverse pre-training data, consistent with our analysis. We show that contrastive regularization can reduce the target task performance gap between pretraining on the specific dataset (the same or similar as the target task) and that on diverse datasets. We pre-train with MoCo v3 (backbone ViT-S) on ImageNet-Bird or the whole ImageNet and then perform linear probing (LP) on the target task of ImageNet-Bird with 1k labeled samples. The results in Table 12 show that pre-training on the diverse data leads to worse performance. Then we pre-train on ImageNet followed by contrastive regularization on ImageNet-Bird and then perform linear probing (LP) on the target task. This leads to improved performance than without contrastive regularization, closing the gap between pre-training on diverse and specific data. Similar results are observed in Table 13 when ImageNet-Vehicles are used. This confirms the benefits of contrastive regularization for highlighting task-specific features. 



https://github.com/zhmeishi/trade-off_contrastive_learning



Figure 1: Illustration of the trade-off between universality and label efficiency. x-axis: from left to right, incrementally add CINIC-10 (C), SVHN (S), GTSRB (G), and ImageNet32 (I) for pretraining MoCo v2. For example, "CS" means CINIC-10+SVHN. The average test accuracy of prediction on all 4 datasets (red line) increases with more diverse pre-training data, while that on the target task CIFAR-10 (blue line) decreases.(The variance of the blue line is too small to be seen.) Please refer to Section 3.1 for details.

Figure 2: Illustration of the features in our data distributions.

Figure 3: Trade-off between universality and label efficiency for MoCo v2. Appendix C.5 shows similar results for more methods and datasets. x-axis: incrementally add datasets for pre-training MoCo v2. (a) Pretraining data: CINIC-10 (C), SVHN (S), GTSRB (G), and ImageNet32 (I). E.g., "CS" on the x-axis means CINIC-10+SVHN. Target task: CIFAR-10. Red line: average test accuracy of Linear Probing on all 4 datasets. Blue line: test accuracy on the target task. (b) EMNIST-Digits&Letters (E), Fashion-MNIST (F), GTSRB (G), ImageNet32 (I). Target: MNIST. (c) FaceScrub (F), CIFAR-10 (C), SVHN (S), ImageNet32 (I). Target:Fer2013. Note that training does not follow the online learning fashion, e.g., the model will pre-train from scratch (random initialization) on the CSG datasets, rather than using the model pre-trained on the CS datasets.

Figure 5: Linear CKA similarity among Fer2013 features from MoCo v2 pre-trained on different datasets. Left: each representation in the first four columns/rows is pre-trained on a single dataset. "Union" indicates the model pre-trained on the union of the four disjoint datasets. Right: from left column to right, from top row to bottom, we incrementally add datasets for pre-training.

Figure 6: A two-dim example of XOR structure in the space of ϕ.

is a superset of ImageNet which contains 14.2M training data and 522,500 test data with 21,841 classes. We crop each color image to 224 × 224 as the standard setting. ImageNet-Bird. The ImageNet-Bird is a subset of ImageNet and contains all bird-related images from ImageNet, which have 59 classes and 76k training images. ImageNet-Vehicle. The ImageNet-Vehicle is a subset of ImageNet and contains all vehicle-related images from ImageNet, which have 43 classes and 55k training images. ImageNet-Cat/Ball/Shop/Clothing/Fruit. The ImageNet-Cat/Ball/Shop/Clothing/Fruit is a subset of ImageNet and contains all cat/ball/shop/clothing/fruit-related images from ImageNet, which have 76 classes and 100k training images.

: Target Task vs.

Figure 7: Trade-off of universality and label efficiency for MoCo v2, NNCLR, SimSiam on downstream tasks CIFAR-10, MNIST, Fer2013. "1, 2, 3, 4" means incrementally adding datasets for pre-training. The x-axis is the average test accuracy of Linear Probing on all 4 datasets. The yaxis is test accuracy on the target task. Pre-training data: (a)(b)(c) CINIC-10, SVHN, GTSRB, and ImageNet32. Target task: CIFAR-10. (d)(e)(f) EMNIST-Digits&Letters, Fashon-MNIST, GT-SRB, ImageNet32. Target: MNIST. (g)(h)(i) FaceScrub, CIFAR-10, SVHN, ImageNet32. Target: Fer2013.

Figure 10: Pre-train MoCo v2 and SimSiam on CIFAR-10 + ImageNet32(200k) with varying number of classes of ImageNet32 from 50 to 1000 (x-axis) under fixed size of pre-training data. The y-axis is LP test accuracy on CIFAR-10.

Figure 12: Trade-off on CIFAR-10 LP test accuracy (y-axis) for MoCo v2 and SimSiam pre-trianed on datasets including CIFAR-10.

Figure 13: The t-SNE visualization (Van der Maaten & Hinton, 2008) for CIFAR-10 training data normalized features from different evaluation methods, where the model is pre-trained on (CSGI) defined in Fig. 3. FT and Ours are trained on the 20% CIFAR-10 training dataset. Different colors correspond to different classes.

Figure 14: Trade-off of universality and label efficiency for MoCo v2, NNCLR, SimSiam on downstream tasks CIFAR-10, MNIST, Fer2013. x-axis: incrementally add datasets for pre-training. Pretraining data: (a)(b)(c) CINIC-10 (C), SVHN (S), GTSRB (G), and ImageNet32 (I). For example, "CS" on the x-axis means CINIC-10+SVHN. Target task: CIFAR-10. Red line: average test accuracy of Linear Probing on all 4 datasets. Blue line: test accuracy on the target task. (d)(e)(f) EMNIST-Digits&Letters (E), Fashon-MNIST (F), GTSRB (G), ImageNet32 (I). Target: MNIST. (g)(h)(i) FaceScrub (F), CIFAR-10 (C), SVHN (S), ImageNet32 (I). Target: Fer2013. All evaluations are trained with 1% labeled data.

Figure 15: Trade-off of universality and label efficiency for MoCo v2, NNCLR, SimSiam on downstream tasks CIFAR-10, MNIST, Fer2013. x-axis: incrementally add datasets for pre-training. Pretraining data: (a)(b)(c) CINIC-10 (C), SVHN (S), GTSRB (G), and ImageNet32 (I). For example, "CS" on the x-axis means CINIC-10+SVHN. Target task: CIFAR-10. Red line: average test accuracy of Linear Probing on all 4 datasets. Blue line: test accuracy on the target task. (d)(e)(f) EMNIST-Digits&Letters (E), Fashon-MNIST (F), GTSRB (G), ImageNet32 (I). Target: MNIST. (g)(h)(i) FaceScrub (F), CIFAR-10 (C), SVHN (S), ImageNet32 (I). Target: Fer2013. All evaluations are trained with 5% labeled data.

Figure 16: Trade-off of universality and label efficiency for MoCo v2, NNCLR, SimSiam on downstream tasks CIFAR-10, MNIST, Fer2013. x-axis: incrementally add datasets for pre-training. Pretraining data: (a)(b)(c) CINIC-10 (C), SVHN (S), GTSRB (G), and ImageNet32 (I). For example, "CS" on the x-axis means CINIC-10+SVHN. Target task: CIFAR-10. Red line: average test accuracy of Linear Probing on all 4 datasets. Blue line: test accuracy on the target task. (d)(e)(f) EMNIST-Digits&Letters (E), Fashon-MNIST (F), GTSRB (G), ImageNet32 (I). Target: MNIST. (g)(h)(i) FaceScrub (F), CIFAR-10 (C), SVHN (S), ImageNet32 (I). Target: Fer2013. All evaluations are trained with 10% labeled data.

Figure 17: Trade-off of universality and label efficiency for MoCo v2, NNCLR, SimSiam on downstream tasks CIFAR-10, MNIST, Fer2013. x-axis: incrementally add datasets for pre-training. Pretraining data: (a)(b)(c) CINIC-10 (C), SVHN (S), GTSRB (G), and ImageNet32 (I). For example, "CS" on the x-axis means CINIC-10+SVHN. Target task: CIFAR-10. Red line: average test accuracy of Linear Probing on all 4 datasets. Blue line: test accuracy on the target task. (d)(e)(f) EMNIST-Digits&Letters (E), Fashon-MNIST (F), GTSRB (G), ImageNet32 (I). Target: MNIST. (g)(h)(i) FaceScrub (F), CIFAR-10 (C), SVHN (S), ImageNet32 (I). Target: Fer2013. All evaluations are trained with 20% labeled data.

Figure 18: Trade-off of universality and label efficiency for MoCo v2, NNCLR, SimSiam on downstream tasks CIFAR-10, MNIST, Fer2013. x-axis: incrementally add datasets for pre-training. Pretraining data: (a)(b)(c) CINIC-10 (C), SVHN (S), GTSRB (G), and ImageNet32 (I). For example, "CS" on the x-axis means CINIC-10+SVHN. Target task: CIFAR-10. Red line: average test accuracy of Linear Probing on all 4 datasets. Blue line: test accuracy on the target task. (d)(e)(f) EMNIST-Digits&Letters (E), Fashon-MNIST (F), GTSRB (G), ImageNet32 (I). Target: MNIST. (g)(h)(i) FaceScrub (F), CIFAR-10 (C), SVHN (S), ImageNet32 (I). Target: Fer2013. All evaluations are trained with 100% labeled data.

While on task D i (i ̸ = t), any linear predictor on ϕ * t has error at least min u E Di [ℓ c (u ⊤ z S , y)].

Test accuracy on CIFAR-10 with different evaluation methods on MoCo v2 by using all CIFAR-10 training data. From left to right: incrementally add datasets for pre-training.

Test

B.1. Consider the above setting. Let α, α t (t ∈ [T ]) be the optimizer for

LP test accuracy on ImageNet and ImageNet22k with UniCL (Swin-T) pre-trained 500 epochs on ImageNet and ImageNet+GCC-15M.

LP test accuracy on ImageNet-Bird and ImageNet with MoCo v3 (ViT-S) pre-trained on ImageNet-Bird and ImageNet.

LP test accuracy on ImageNet-Vehicle and ImageNet with MoCo v3 (ViT-S) pre-trained on ImageNet-Vehicle and ImageNet.



Test accuracy on CIFAR-10 with different evaluation methods on MoCo v2 under different percentages of labeled data. From top to bottom: incrementally add datasets for pre-training.

Test accuracy for different evaluation methods on different datasets using foundation model CLIP (backbone ViT-L). We do not use data augmentation for LP. We evaluate FT without data augmentation, with 10 augmentation and with 100 augmentation to each training images. For Ours, we use 10, 100 augmentation.

Test accuracy for different evaluation methods on different datasets using foundation model MoCo v3 (backbone ViT-B). We do not use data augmentation for LP. We evaluate FT without data augmentation, with 10 augmentations and with 100 augmentations to each training image. For Ours, we use 10, 100 augmentation.

Test accuracy for different evaluation methods on different datasets using foundation model SimCSE (backbone BERT). We do not use data augmentation for LP. We evaluate FT and Ours with the same data augmentation as SimCSE.

Test accuracy on CIFAR-10 with different evaluation methods on MoCo v2 with ResNet18 backbone. From left to right: incrementally add datasets for pre-training. C.4.3 THE EFFECT OF CONTRASTIVE REGULARIZATION ON CLOSING THE GAP

LP test accuracy on ImageNet-Bird and ImageNet with MoCo v3 (ViT-S) pre-trained on ImageNet-Bird and ImageNet.

LP test accuracy on ImageNet-Vehicle and ImageNet with MoCo v3 (ViT-S) pre-trained on ImageNet-Vehicle and ImageNet.

ACKNOWLEDGMENTS

The work is partially supported by Air Force Grant FA9550-18-1-0166, the National Science Foundation (NSF) Grants CCF-FMitF-1836978, IIS-2008559, SaTC-Frontiers-1804648, CCF-2046710 and CCF-1652140, and ARO grant number W911NF-17-1-0405. Jiefeng Chen and Somesh Jha are partially supported by the DARPA-GARD problem under agreement number 885000. Jayaram Raghuram was partially supported through the National Science Foundation's grants CNS-2112562 and CNS-2003129. 

annex

experiments will vary the weight decay regularization coefficient (larger weight decay corresponds to a smaller norm bound and fewer features learned in the representation). Fig. 9 shows that the linear CKA similarity decreases with the increase of the weight decay, then it provides some support for the theorem.

C.3.2 THE EFFECT OF PRE-TRAINING AND LABELED DATA SIZES

We have also conducted finer-grained investigations into the trade-off by varying the pre-training dataset size and the downstream labeled data on the specific task. Table 4 shows the results for different pre-training datasets (ImageNet-Bird, ImageNet, and 10% of ImageNet data) and different labeled datasets (500 to 8k samples from ImageNet-Bird, and the whole ImageNet). Table 5 shows 

