THE ROLE OF PRE-TRAINING DATA IN TRANSFER LEARNING

Abstract

The transfer learning paradigm of model pre-training and subsequent fine-tuning produces high accuracy models. However, a question remains: what data and method should be used for pre-training? We study the effect of the pre-training data distribution on transfer learning in the context of image classification, investigating to what extent different pre-training datasets differ in downstream task performance. Through controlled experiments, we find that the pre-training dataset is initially important for low-shot transfer. However, the differences between distributions are diminished as more data is made available for fine-tuning. We also looked into how much is labeling worth compared to noisier but larger pre-training data. Our results show that to match the performance on supervised pretraining on ImageNet we need 15x-2000x more pre-train data from LAION for different downstream tasks. We also investigate the dataset size and observe that larger pre-training datasets lead to better accuracy, however, the absolute accuracy difference is the largest in the few-shot regime. Beyond data, we study the effect of the pre-training method, language-image contrastive vs. image-image contrastive, finding that the latter usually leads to better transfer accuracy.

1. INTRODUCTION

The best-performing computer vision models are produced by the transfer learning paradigm. In this two-step procedure, a model is first pre-trained on a large heterogeneous dataset. Next, the model is fine-tuned on application specific data which adapts the model to a problem of interest. While transfer learning is not new, it has become increasingly important with drastic improvements in the quality of pre-trained models (e.g., CLIP Radford et al. (2021) , BASIC Pham et al. (2021), and Flamingo Alayrac et al. (2022) ). These improvements are driven by new datasets for pre-training as well as better pre-training algorithms. This naturally leads to a question: How does the dataset and algorithm used for pre-training affect downstream transfer performance? While related works try to find the relation between pre-training and transfer performance by exploring scaling laws (Kornblith et al., 2019; Abnar et al., 2021) or predicting transferability without actual finetuning (You et al., 2021; Nguyen et al., 2020; Deshpande et al., 2021; Bolya et al., 2021) , we highlight that to the best of our knowledge the role of pre-training data distribution has not been investigated so far. Therefore we define specific research questions detailed below and set up systematic experiments focusing on each question, while carefully ablating the other factors. To what extent do different pre-training datasets differ in downstream task performance? Do we expect different distributions to perform differently in the transfer setting and how does that compare to training from scratch? When controlling for size but changing the pre-train dataset, we observe noticeable differences in downstream transfer accuracy. These differences are larger in the few-shot setting when only a few examples per class are available for fine-tuning. When many images are available for fine-tuning, the difference in absolute accuracy when varying the pre-training dataset mainly evaporates. Across many downstream tasks, certain pre-training datasets (i.e, Shutterstock) consistently lead to better transfer accuracy than others (i.e., WiT). However, there are still ordering differences from one downstream task to another. Moreover, even the pre-training dataset which leads to the worst transfer accuracy still outperforms training from scratch (see Figure 1 ). How much is expensive labeling worth compared to noisier but larger pre-training data? We compare different pre-training strategies: supervised pre-training on small but labeled ImageNet and semi-supervised pre-training on image and language pairs on larger but noisier datasets. We find that pre-training on a well-curated dataset leads to better transfer accuracy than pre-training on a noisy dataset of similar size, and pre-training only on a 15x-2000x larger noisy dataset can close the gap (see Figure 2 ). How much does increasing pre-training dataset size contribute to the performance of transfer learning? When controlling for the pre-training dataset and instead changing the size, we observe that models pre-trained on more images usually have better transfer performance. However, similarly to the aforementioned results, the absolute difference in transfer accuracy between models pre-trained on small scale and medium scale datasets is diminished when fine-tuning on more data. We also observe that increasing the pre-training dataset size shows different saturation performances on target tasks, i.e. while even 100X more data does not help on transfer to some downstream tasks, including more data improve the downstream performance of others (see Figure 3 ). What is the role of pre-training method on transfer performance? We also examine the difference between supervised pre-training with the popular CLIP and SimCLR semi-supervised algorithms. Overall we find that the SimCLR pre-training leads to better transfer than CLIP pre-training in the low-shot regime, but that there are only small differences when many images are used for fine-tuning (see Figure 5 ). To answer these questions we conduct an extensive empirical investigation (over 4000 experiments) in the context of computer vision. Our study covers 7 pre-training datasets (YFCC (Thomee et al., 2016) , LAION (Schuhmann et al., 2021 ), Redcaps Desai et al. (2021 ), Conceptual captions-3m (Sharma et al., 2018 ) , Conceptual captions-12m (Changpinyo et al., 2021) , WiT (Srinivasan et al., 2021 ), Shutterstock, ImageNet (Deng et al., 2009) ), 9 fine-tuning datasets (CIFAR100 (Krizhevsky et al., 2009) , DTD (Cimpoi et al., 2014 ), Caltech-101 (Fei-Fei et al., 2004) , PETS (Parkhi et al., 2012), REAL and CLIPART from DomainNet (Peng et al., 2019) , EuroSAT (Helber et al., 2019), Cassava Leaf Disease Classification (Cas), and Caltech Camera Traps-20 (Beery et al., 2018) ), and two pre-training methods CLIP (Radford et al., 2021) and SimCLR (Chen et al., 2020) . To evaluate transfer performance, we examine both few-shot fine-tuning and full fine-tuning. The paper is structured as follows: we review related work and provide relevant background on transfer learning in Section 2, followed by our experimental setup in Section 3. Section 4 details our observations relating to our research questions by measuring the downstream transfer accuracy models pre-trained on various data sources, dataset sizes, and with different pre-training losses. We discuss our findings and conclude with future research directions in Section 5.

2. RELATED WORK

Transfer learning is widely used in deep learning research and practice and has become a cornerstone in both computer vision and natural language processing. Through the years, there have been many questions on why transfer helps and how to choose a good pre-trained model to transfer from. Neyshabur et al. ( 2020) separated the effect of feature reuse from that of learning low-level pre-training data statistics. Their study involved models with supervised pre-trained on ImageNet. Another important question is whether transfer learning is always helpful on any downstream dataset. Raghu et al. ( 2019) experimented with downstream medical datasets (with images coming from a distribution that is very different from that of ImageNet or other natural image datasets) and found that transfer learning from ImageNet pre-trained models shows little benefit in performance. This shows that the downstream dataset is an important factor to consider when evaluating the transfer performance of upstream models. To make it possible to more generally evaluate the visual representations of upstream models, Zhai et al. (2019) introduced the Visual Task Adaptation Benchmark (VTAB). VTAB aims to measure the adaptability of representations to diverse, unseen tasks, given only a few examples from the downstream dataset. Scaling up dataset and model size is a well-known trend for improving accuracy in both natural language processing (Kaplan et al., 2020) and computer vision (Kolesnikov et al., 2020) . For instance, Kolesnikov et al. (2020) uses a weakly labeled JFT-300M dataset for pre-training. An even large noisy dataset with 3.5B images from Instagram was used in (Mahajan et al., 2018) . To make use of large

