BI-TUNING OF PRE-TRAINED REPRESENTATIONS

Abstract

It is common within the deep learning community to first pre-train a deep neural network from a large-scale dataset and then fine-tune the pre-trained model to a specific downstream task. Recently, both supervised and unsupervised pre-training approaches to learning representations have achieved remarkable advances, which exploit the discriminative knowledge of labels and the intrinsic structure of data, respectively. It follows natural intuition that both discriminative knowledge and intrinsic structure of the downstream task can be useful for fine-tuning, however, existing fine-tuning methods mainly leverage the former and discard the latter. A question arises: How to fully explore the intrinsic structure of data for boosting fine-tuning? In this paper, we propose Bi-tuning, a general learning approach to finetuning both supervised and unsupervised pre-trained representations to downstream tasks. Bi-tuning generalizes the vanilla fine-tuning by integrating two heads upon the backbone of pre-trained representations: a classifier head with an improved contrastive cross-entropy loss to better leverage the label information in an instancecontrast way, and a projector head with a newly-designed categorical contrastive learning loss to fully exploit the intrinsic structure of data in a category-consistent way. Comprehensive experiments confirm that Bi-tuning achieves state-of-the-art results for fine-tuning tasks of both supervised and unsupervised pre-trained models by large margins (e.g. 10.7% absolute rise in accuracy on CUB in low-data regime).

1. INTRODUCTION

In the last decade, remarkable advances in deep learning have been witnessed in diverse applications across many fields, such as computer vision, robotic control, and natural language processing in the presence of large-scale labeled datasets. However, in many practical scenarios, we may have only access to a small labeled dataset, making it impossible to train deep neural networks from scratch. Therefore, it has become increasingly common within the deep learning community to first pre-train a deep neural network from a large-scale dataset and then fine-tune the pre-trained model to a specific downstream task. Fine-tuning requires fewer labeled data, enables faster training, and usually achieves better performance than training from scratch (He et al., 2019) . This two-stage style of pre-training and fine-tuning lays as the foundation of various transfer learning applications. In the pre-training stage, there are mainly two approaches to pre-train a deep model: supervised pre-training and unsupervised pre-training. Recent years have witnessed the success of numerous supervised pre-trained models, e.g. ResNet (He et al., 2016) , by exploiting the discriminative knowledge of labels on a large-scale dataset like ImageNet (Deng et al., 2009) . Meanwhile, unsupervised representation learning is recently changing the field of NLP by models pre-trained with a large-scale corpus, e.g. BERT (Devlin et al., 2018) and GPT Radford & Sutskever (2018) . In computer vision, remarkable advances in unsupervised representation learning (Wu et al., 2018; He et al., 2020; Chen et al., 2020) , which exploit the intrinsic structure of data by contrastive learning Hadsell et al. (2006) , are also to change the field dominated chronically by supervised pre-trained representations. In the fine-tuning stage, transferring a model from supervised pre-trained models has been empirically studied in Kornblith et al. (2019) . During the past years, several sophisticated fine-tuning methods were proposed, including L2-SP (Li et al., 2018) , DELTA (Li et al., 2019) and BSS (Chen et al., 2019) . These methods focus on leveraging the discriminative knowledge of labels by a cross-entropy loss and the implicit bias of pre-trained models by a regularization term. However, the intrinsic structure of data in downstream task is generally discarded during fine-tuning. Further, we empirically observed that unsupervised pre-trained representations focus more on the intrinsic structure, while supervised pre-trained representations explain better on the label information (Figure 3 ). This possibly implies that fine-tuning unsupervised pre-training representations is may be more difficult He et al. (2020) . Regarding to the success of supervised and unsupervised pre-training approaches, it follows a natural intuition that both discriminative knowledge and intrinsic structure of the downstream task can be useful for fine-tuning. A question arises: How to fully explore the intrinsic structure of data for boosting fine-tuning? To tackle this major challenge of deep learning, we propose Bi-tuning, a general learning approach to fine-tuning both supervised and unsupervised pre-trained representations to downstream tasks. Bi-tuning generalizes the vanilla fine-tuning by integrating two heads upon the backbone of pre-trained representations: • A classifier head with an improved contrastive cross-entropy loss to better leverage the label information in an instance-contrast way, which is the dual view of the vanilla cross-entropy loss and is expected to achieve a more compact intra-class structure. • A projector head with a newly-designed categorical contrastive learning loss to fully exploit the intrinsic structure of data in a category-consistent way, resulting in a more harmonious cooperation between the supervised and unsupervised fine-tuning mechanisms. As a general fine-tuning approach, Bi-tuning can be applied with a variety of backbones without any additional assumptions. Comprehensive experiments confirm that Bi-tuning achieves state-of-the-art results for fine-tuning tasks of both supervised and unsupervised pre-trained models by large margins (e.g. 10.7% absolute rise in accuracy on CUB in low-data regime). We justify through ablation studies the effectiveness of the proposed two-heads fine-tuning architecture with their novel loss functions. et al., 2019) . Even in computer vision, recent impressive advances in unsupervised representation learning (Wu et al., 2018; He et al., 2020; Chen et al., 2020) , which exploit the inductive bias of data structure, are shaking the long-term dominated status of representations learned in a supervised way. Further, a wide range of handcrafted pretext tasks have been proposed for unsupervised representation learning, such as relative patch prediction (Doersch et al., 2015) , solving jigsaw puzzles (Noroozi & Favaro, 2016 ), colorization (Zhang et al., 2016) , etc.

2.2. CONTRASTIVE LEARNING

Specifically, various unsupervised pretext tasks are based on some forms of contrastive learning, in which the instance discrimination approach (Wu et al., 2018; He et al., 2020; Chen et al., 2020) is one of the most general forms. It is noteworthy that the spirits of contrastive learning actually can date back very far (Becker & Hinton, 1992; Hadsell et al., 2006; Gutmann & Hyvärinen, 2010) . The key idea is to maximize the likelihood of the distribution p(x|D) contrasting to the artificial noise distribution p n (x), also known as noise-contrastive estimation (NCE 



PRE-TRAINING During the past years, supervised pre-trained models achieve impressive advances by exploiting the inductive bias of label information on a large-scale dataset like ImageNet (Deng et al., 2009), such as GoogleNet (Szegedy et al., 2015), ResNet (He et al., 2016), DenseNet (Huang et al., 2017), to name a few. Meanwhile, unsupervised representation learning is recently shining in the field of NLP by models pre-trained with a large-scale corpus, including GPT (Radford & Sutskever, 2018), BERT (Devlin et al., 2018) and XLNet (Yang

).Later, Goodfellow  et al. (2014)  pointed out the relations between generative adversarial networks and noise-contrastive estimation. Meanwhile,(van den Oord et al., 2018)  revealed that contrastive learning is related to mutual information between a query and the corresponding positive key, which is known as InfoNCE.Other variants of contrastive learning methods include contrastive predictive learning (CPC) (van denOord et al., 2018)  and colorization contrasting(Tian et al., 2019). Recent advances of deep contrastive learning benefit from contrasting positive keys against very large number of negative keys. Therefore, how to efficiently generate keys becomes a fundamental problem in contrastive learning. To achieve this goal, Doersch & Zisserman (2017) explored the effectiveness of in-batch samples,Wu et al.  (2018)  proposed to use a memory bank to store all representations of the dataset,He et al. (2020)

