THE TRADE-OFF BETWEEN UNIVERSALITY AND LA-BEL EFFICIENCY OF REPRESENTATIONS FROM CON-TRASTIVE LEARNING

Abstract

Pre-training representations (a.k.a. foundation models) has recently become a prevalent learning paradigm, where one first pre-trains a representation using large-scale unlabeled data, and then learns simple predictors on top of the representation using small labeled data from the downstream tasks. There are two key desiderata for the representation: label efficiency (the ability to learn an accurate classifier on top of the representation with a small amount of labeled data) and universality (usefulness across a wide range of downstream tasks). In this paper, we focus on one of the most popular instantiations of this paradigm: contrastive learning with linear probing, i.e., learning a linear predictor on the representation pre-trained by contrastive learning. We show that there exists a trade-off between the two desiderata so that one may not be able to achieve both simultaneously. Specifically, we provide analysis using a theoretical data model and show that, while more diverse pre-training data result in more diverse features for different tasks (improving universality), it puts less emphasis on task-specific features, giving rise to larger sample complexity for down-stream supervised tasks, and thus worse prediction performance. Guided by this analysis, we propose a contrastive regularization method to improve the trade-off. We validate our analysis and method empirically with systematic experiments using real-world datasets and foundation models.

1. INTRODUCTION

Representation pre-training is a recent successful approach that utilizes large-scale unlabeled data to address the challenges of scarcity of labeled data and distribution shift. Different from the traditional supervised learning approach using a large labeled dataset, representation learning first pre-trains a representation function using large-scale diverse unlabeled datasets by self-supervised learning (e.g., contrastive learning), and then learns predictors on the representation using small labeled datasets for downstream target tasks. The pre-trained model is commonly referred to as a foundation model (Bommasani et al., 2021) , and has achieved remarkable performance in many applications, e.g., BERT (Devlin et al., 2019) , GPT-3 (Brown et al., 2020) , CLIP (Radford et al., 2021), and Flamingo (Alayrac et al., 2022) . To this end, we note that there are two properties that are key to their success: (1) label efficiency: with the pre-trained representation, only a small amount of labeled data is needed to learn accurate predictors for downstream target tasks; (2) universality: the pre-trained representation can be used across various downstream tasks. In this work, we focus on contrastive learning with linear probing that learns a linear predictor on the representation pre-trained by contrastive learning, which is an exemplary pre-training approach (e.g., (Arora et al., 2019; Chen et al., 2020) ). We highlight and study a fundamental trade-off between label efficiency and universality, though ideally, one would like to have these two key properties simultaneously. Since pre-training with large-scale diverse unlabeled data is widely used in practice, such a trade-off merits deeper investigation. Theoretically, we provide an analysis of the features learned by contrastive learning, and how the learned features determine the downstream prediction performance and lead to the trade-off. We propose a hidden representation data model, which first generates a hidden representation containing various features, and then uses it to generate the label and the input. We first show that contrastive learning is essentially generalized nonlinear PCA that can learn hidden features invariant to the transformations used to generate positive pairs. We also point out that additional assumptions on the data and representations are needed to obtain non-vacuous guarantees for prediction performance. We thus consider a setting where the data are generated by linear functions of the hidden representation, and formally prove that the difference in the learned features leads to the trade-off. In particular, pre-training on more diverse data learns more diverse features and is thus useful for prediction on more tasks. But it also down-weights task-specific features, implying larger sample complexity for predictors and thus worse prediction performance on a specific task. This analysis inspires us to propose a general method -contrastive regularization -that adds a contrastive loss to the training of predictors to improve the accuracy on downstream tasks. (The variance of the blue line is too small to be seen.) Please refer to Section 3.1 for details. Empirically, we first perform controlled experiments to reveal the trade-off. Specifically, we first pre-train on a specific dataset similar to that of the target task, and then incrementally add more datasets into pretraining. In the end, the pre-training data includes both datasets similar to the target task and those not so similar, which mimics the practical scenario that foundation models are pre-trained on diverse data to be widely applicable for various downstream tasks. Fig. 1 gives an example of this experiment: As we increase task diversity for contrastive learning, it increases the average accuracy on all tasks from 18.3% to 20.1%, while it harms the label efficiency of an individual task, on CIFAR-10 the accuracy drops from 88.5% to 76.4%. We also perform experiments on contrastive regularization, and demonstrate that it can consistently improve over the typical finetuning method across multiple datasets. In several cases, the improvement is significant: 1.3% test accuracy improvement for CLIP on ImageNet, 4.8% for MoCo v3 on GTSRB (see Table 1 and 2 for details). With these results, we believe that it is of importance to bring the community's attention to this trade-off and the forward path of foundation models. Our main contributions are summarized as follows: • We propose a hidden representation data model and prove that contrastive learning is essentially generalized nonlinear PCA, and can encode hidden features invariant to the transformations used in positive pairs (Section 2.1). • We formally prove the trade-off in a simplified setting with linear data (Section 2.2). • We empirically demonstrate the trade-off across different methods and datasets for contrastive learning with linear probing (Section 3.1 and 3.2). • We propose a contrastive regularization method for training the predictor on a target task (Section 2.2), which achieves consistent improvement in our experiments (Section 3.3). Related Work on Representation Pre-training. This paradigm pre-trains a representation function on a large dataset and then uses it for prediction on various downstream tasks (Devlin et al., 2019; Kolesnikov et al., 2020; Brown et al., 2020; Newell & Deng, 2020) . The representations are also called foundation models (Bommasani et al., 2021) . There are mainly two kinds of approaches: (1) supervised approaches (e.g., (Kolesnikov et al., 2020) ) that pre-train on large labeled datasets; (2) self-supervised approaches (e.g., (Newell & Deng, 2020) ) that pre-train on large and diverse unlabeled datasets. Recent self-supervised pre-training can compete with or outperform supervised pre-training on the downstream prediction performance (Ericsson et al., 2021) . Practical examples like BERT (Devlin et al., 2019) , GPT-3 (Brown et al., 2020) , CLIP (Radford et al., 2021) ,



Figure 1: Illustration of the trade-off between universality and label efficiency. x-axis: from left to right, incrementally add CINIC-10 (C), SVHN (S), GTSRB (G), and ImageNet32 (I) for pretraining MoCo v2. For example, "CS" means CINIC-10+SVHN. The average test accuracy of prediction on all 4 datasets (red line) increases with more diverse pre-training data, while that on the target task CIFAR-10 (blue line) decreases.(The variance of the blue line is too small to be seen.) Please refer to Section 3.1 for details.

