BI-TUNING OF PRE-TRAINED REPRESENTATIONS

Abstract

It is common within the deep learning community to first pre-train a deep neural network from a large-scale dataset and then fine-tune the pre-trained model to a specific downstream task. Recently, both supervised and unsupervised pre-training approaches to learning representations have achieved remarkable advances, which exploit the discriminative knowledge of labels and the intrinsic structure of data, respectively. It follows natural intuition that both discriminative knowledge and intrinsic structure of the downstream task can be useful for fine-tuning, however, existing fine-tuning methods mainly leverage the former and discard the latter. A question arises: How to fully explore the intrinsic structure of data for boosting fine-tuning? In this paper, we propose Bi-tuning, a general learning approach to finetuning both supervised and unsupervised pre-trained representations to downstream tasks. Bi-tuning generalizes the vanilla fine-tuning by integrating two heads upon the backbone of pre-trained representations: a classifier head with an improved contrastive cross-entropy loss to better leverage the label information in an instancecontrast way, and a projector head with a newly-designed categorical contrastive learning loss to fully exploit the intrinsic structure of data in a category-consistent way. Comprehensive experiments confirm that Bi-tuning achieves state-of-the-art results for fine-tuning tasks of both supervised and unsupervised pre-trained models by large margins (e.g. 10.7% absolute rise in accuracy on CUB in low-data regime).

1. INTRODUCTION

In the last decade, remarkable advances in deep learning have been witnessed in diverse applications across many fields, such as computer vision, robotic control, and natural language processing in the presence of large-scale labeled datasets. However, in many practical scenarios, we may have only access to a small labeled dataset, making it impossible to train deep neural networks from scratch. Therefore, it has become increasingly common within the deep learning community to first pre-train a deep neural network from a large-scale dataset and then fine-tune the pre-trained model to a specific downstream task. Fine-tuning requires fewer labeled data, enables faster training, and usually achieves better performance than training from scratch (He et al., 2019) . This two-stage style of pre-training and fine-tuning lays as the foundation of various transfer learning applications. In the pre-training stage, there are mainly two approaches to pre-train a deep model: supervised pre-training and unsupervised pre-training. Recent years have witnessed the success of numerous supervised pre-trained models, e.g. ResNet (He et al., 2016) , by exploiting the discriminative knowledge of labels on a large-scale dataset like ImageNet (Deng et al., 2009) . Meanwhile, unsupervised representation learning is recently changing the field of NLP by models pre-trained with a large-scale corpus, e.g. BERT (Devlin et al., 2018) and GPT Radford & Sutskever (2018) . In computer vision, remarkable advances in unsupervised representation learning (Wu et al., 2018; He et al., 2020; Chen et al., 2020) , which exploit the intrinsic structure of data by contrastive learning Hadsell et al. (2006) , are also to change the field dominated chronically by supervised pre-trained representations. In the fine-tuning stage, transferring a model from supervised pre-trained models has been empirically studied in Kornblith et al. (2019) . During the past years, several sophisticated fine-tuning methods were proposed, including L2-SP (Li et al., 2018) , DELTA (Li et al., 2019) and BSS (Chen et al., 2019) . These methods focus on leveraging the discriminative knowledge of labels by a cross-entropy loss and the implicit bias of pre-trained models by a regularization term. However, the intrinsic structure of data in downstream task is generally discarded during fine-tuning. Further, we empirically observed that unsupervised pre-trained representations focus more on the intrinsic structure, while supervised

