THE ROLE OF PRE-TRAINING DATA IN TRANSFER LEARNING

Abstract

The transfer learning paradigm of model pre-training and subsequent fine-tuning produces high accuracy models. However, a question remains: what data and method should be used for pre-training? We study the effect of the pre-training data distribution on transfer learning in the context of image classification, investigating to what extent different pre-training datasets differ in downstream task performance. Through controlled experiments, we find that the pre-training dataset is initially important for low-shot transfer. However, the differences between distributions are diminished as more data is made available for fine-tuning. We also looked into how much is labeling worth compared to noisier but larger pre-training data. Our results show that to match the performance on supervised pretraining on ImageNet we need 15x-2000x more pre-train data from LAION for different downstream tasks. We also investigate the dataset size and observe that larger pre-training datasets lead to better accuracy, however, the absolute accuracy difference is the largest in the few-shot regime. Beyond data, we study the effect of the pre-training method, language-image contrastive vs. image-image contrastive, finding that the latter usually leads to better transfer accuracy.

1. INTRODUCTION

The best-performing computer vision models are produced by the transfer learning paradigm. In this two-step procedure, a model is first pre-trained on a large heterogeneous dataset. Next, the model is fine-tuned on application specific data which adapts the model to a problem of interest. While transfer learning is not new, it has become increasingly important with drastic improvements in the quality of pre-trained models (e.g., CLIP Radford et al. (2021 ), BASIC Pham et al. (2021 ), and Flamingo Alayrac et al. (2022) ). These improvements are driven by new datasets for pre-training as well as better pre-training algorithms. This naturally leads to a question: How does the dataset and algorithm used for pre-training affect downstream transfer performance? While related works try to find the relation between pre-training and transfer performance by exploring scaling laws (Kornblith et al., 2019; Abnar et al., 2021) or predicting transferability without actual finetuning (You et al., 2021; Nguyen et al., 2020; Deshpande et al., 2021; Bolya et al., 2021) , we highlight that to the best of our knowledge the role of pre-training data distribution has not been investigated so far. Therefore we define specific research questions detailed below and set up systematic experiments focusing on each question, while carefully ablating the other factors. To what extent do different pre-training datasets differ in downstream task performance? Do we expect different distributions to perform differently in the transfer setting and how does that compare to training from scratch? When controlling for size but changing the pre-train dataset, we observe noticeable differences in downstream transfer accuracy. These differences are larger in the few-shot setting when only a few examples per class are available for fine-tuning. When many images are available for fine-tuning, the difference in absolute accuracy when varying the pre-training dataset mainly evaporates. Across many downstream tasks, certain pre-training datasets (i.e, Shutterstock) consistently lead to better transfer accuracy than others (i.e., WiT). However, there are still ordering differences from one downstream task to another. Moreover, even the pre-training dataset which leads to the worst transfer accuracy still outperforms training from scratch (see Figure 1 ).

