BI-TUNING OF PRE-TRAINED REPRESENTATIONS

Abstract

It is common within the deep learning community to first pre-train a deep neural network from a large-scale dataset and then fine-tune the pre-trained model to a specific downstream task. Recently, both supervised and unsupervised pre-training approaches to learning representations have achieved remarkable advances, which exploit the discriminative knowledge of labels and the intrinsic structure of data, respectively. It follows natural intuition that both discriminative knowledge and intrinsic structure of the downstream task can be useful for fine-tuning, however, existing fine-tuning methods mainly leverage the former and discard the latter. A question arises: How to fully explore the intrinsic structure of data for boosting fine-tuning? In this paper, we propose Bi-tuning, a general learning approach to finetuning both supervised and unsupervised pre-trained representations to downstream tasks. Bi-tuning generalizes the vanilla fine-tuning by integrating two heads upon the backbone of pre-trained representations: a classifier head with an improved contrastive cross-entropy loss to better leverage the label information in an instancecontrast way, and a projector head with a newly-designed categorical contrastive learning loss to fully exploit the intrinsic structure of data in a category-consistent way. Comprehensive experiments confirm that Bi-tuning achieves state-of-the-art results for fine-tuning tasks of both supervised and unsupervised pre-trained models by large margins (e.g. 10.7% absolute rise in accuracy on CUB in low-data regime).

1. INTRODUCTION

In the last decade, remarkable advances in deep learning have been witnessed in diverse applications across many fields, such as computer vision, robotic control, and natural language processing in the presence of large-scale labeled datasets. However, in many practical scenarios, we may have only access to a small labeled dataset, making it impossible to train deep neural networks from scratch. Therefore, it has become increasingly common within the deep learning community to first pre-train a deep neural network from a large-scale dataset and then fine-tune the pre-trained model to a specific downstream task. Fine-tuning requires fewer labeled data, enables faster training, and usually achieves better performance than training from scratch (He et al., 2019) . This two-stage style of pre-training and fine-tuning lays as the foundation of various transfer learning applications. In the pre-training stage, there are mainly two approaches to pre-train a deep model: supervised pre-training and unsupervised pre-training. Recent years have witnessed the success of numerous supervised pre-trained models, e.g. ResNet (He et al., 2016) , by exploiting the discriminative knowledge of labels on a large-scale dataset like ImageNet (Deng et al., 2009) . Meanwhile, unsupervised representation learning is recently changing the field of NLP by models pre-trained with a large-scale corpus, e.g. BERT (Devlin et al., 2018) and GPT Radford & Sutskever (2018) . In computer vision, remarkable advances in unsupervised representation learning (Wu et al., 2018; He et al., 2020; Chen et al., 2020) , which exploit the intrinsic structure of data by contrastive learning Hadsell et al. (2006) , are also to change the field dominated chronically by supervised pre-trained representations. In the fine-tuning stage, transferring a model from supervised pre-trained models has been empirically studied in Kornblith et al. (2019) . During the past years, several sophisticated fine-tuning methods were proposed, including L2-SP (Li et al., 2018) , DELTA (Li et al., 2019) and BSS (Chen et al., 2019) . These methods focus on leveraging the discriminative knowledge of labels by a cross-entropy loss and the implicit bias of pre-trained models by a regularization term. However, the intrinsic structure of data in downstream task is generally discarded during fine-tuning. Further, we empirically observed that unsupervised pre-trained representations focus more on the intrinsic structure, while supervised pre-trained representations explain better on the label information (Figure 3 ). This possibly implies that fine-tuning unsupervised pre-training representations is may be more difficult He et al. (2020) . Regarding to the success of supervised and unsupervised pre-training approaches, it follows a natural intuition that both discriminative knowledge and intrinsic structure of the downstream task can be useful for fine-tuning. A question arises: How to fully explore the intrinsic structure of data for boosting fine-tuning? To tackle this major challenge of deep learning, we propose Bi-tuning, a general learning approach to fine-tuning both supervised and unsupervised pre-trained representations to downstream tasks. Bi-tuning generalizes the vanilla fine-tuning by integrating two heads upon the backbone of pre-trained representations: • A classifier head with an improved contrastive cross-entropy loss to better leverage the label information in an instance-contrast way, which is the dual view of the vanilla cross-entropy loss and is expected to achieve a more compact intra-class structure. • A projector head with a newly-designed categorical contrastive learning loss to fully exploit the intrinsic structure of data in a category-consistent way, resulting in a more harmonious cooperation between the supervised and unsupervised fine-tuning mechanisms. As a general fine-tuning approach, Bi-tuning can be applied with a variety of backbones without any additional assumptions. Comprehensive experiments confirm that Bi-tuning achieves state-of-the-art results for fine-tuning tasks of both supervised and unsupervised pre-trained models by large margins (e.g. 10.7% absolute rise in accuracy on CUB in low-data regime). We justify through ablation studies the effectiveness of the proposed two-heads fine-tuning architecture with their novel loss functions.

2.1. PRE-TRAINING

During the past years, supervised pre-trained models achieve impressive advances by exploiting the inductive bias of label information on a large-scale dataset like ImageNet (Deng et al., 2009) , such as GoogleNet (Szegedy et al., 2015) , ResNet (He et al., 2016) , DenseNet (Huang et al., 2017) , to name a few. Meanwhile, unsupervised representation learning is recently shining in the field of NLP by models pre-trained with a large-scale corpus, including GPT (Radford & Sutskever, 2018) , BERT (Devlin et al., 2018) and XLNet (Yang et al., 2019) . Even in computer vision, recent impressive advances in unsupervised representation learning (Wu et al., 2018; He et al., 2020; Chen et al., 2020) , which exploit the inductive bias of data structure, are shaking the long-term dominated status of representations learned in a supervised way. Further, a wide range of handcrafted pretext tasks have been proposed for unsupervised representation learning, such as relative patch prediction (Doersch et al., 2015) , solving jigsaw puzzles (Noroozi & Favaro, 2016) , colorization (Zhang et al., 2016) , etc.

2.2. CONTRASTIVE LEARNING

Specifically, various unsupervised pretext tasks are based on some forms of contrastive learning, in which the instance discrimination approach (Wu et al., 2018; He et al., 2020; Chen et al., 2020) is one of the most general forms. It is noteworthy that the spirits of contrastive learning actually can date back very far (Becker & Hinton, 1992; Hadsell et al., 2006; Gutmann & Hyvärinen, 2010) . The key idea is to maximize the likelihood of the distribution p(x|D) contrasting to the artificial noise distribution p n (x), also known as noise-contrastive estimation (NCE). Later, Goodfellow et al. (2014) pointed out the relations between generative adversarial networks and noise-contrastive estimation. Meanwhile, (van den Oord et al., 2018) revealed that contrastive learning is related to mutual information between a query and the corresponding positive key, which is known as InfoNCE. Other variants of contrastive learning methods include contrastive predictive learning (CPC) (van den Oord et al., 2018) and colorization contrasting (Tian et al., 2019) . Recent advances of deep contrastive learning benefit from contrasting positive keys against very large number of negative keys. Therefore, how to efficiently generate keys becomes a fundamental problem in contrastive learning. To achieve this goal, Doersch & Zisserman (2017) explored the effectiveness of in-batch samples, Wu et al. (2018) proposed to use a memory bank to store all representations of the dataset, He et al. (2020) further replaced a memory bank with the momentum contrast (MoCo) to be memory-efficient, and Chen et al. (2020) showed that a brute-force huge batch of keys works well.

2.3. FINE-TUNING

Fine-tuning a model from supervised pre-trained models has been empirically explored in Kornblith et al. (2019) by launching a systematic investigation with grid search of the hyper-parameters. During the past years, a few fine-tuning methods have been proposed to exploit the inductive bias of pretrained models: L2-SP (Li et al., 2018) drives the weight parameters of target task to the pre-trained values by imposing L2 constraint based on the inductive bias of parameter; DELTA (Li et al., 2019) computes channel-wise discriminative knowledge to reweight the feature map regularization with an attention mechanism based on the inductive bias of behavior; BSS (Chen et al., 2019) penalizes smaller singular values to suppress untransferable spectral components based on singular values. Other fine-tuning methods including learning with similarity preserving (Kang et al., 2019) or learning without forgetting (Li & Hoiem, 2017 ) also work well on some downstream classification tasks. However, the existing fine-tuning methods mainly focus on leveraging the knowledge of the target label with a cross-entropy loss. Intuitively, encouraging a model to capture the label information and intrinsic structure simultaneously may help the model transition between the upstream unsupervised models with the downstream classification tasks. In natural language processing, GPT (Radford & Sutskever, 2018; Radford et al., 2019) has employed a strategy that jointly optimizes unsupervised training criteria while fine-tuning with supervision. However, we empirically found that trivially following this kind of force-combination between supervised learning loss and unsupervised contrastive learning loss is beneficial but limited. The plausible reason is that these two losses will contradict with each other and result in an extremely different but not discriminative feature structure compared to that of the supervised cross-entropy loss (See Figure 3 ).

3. CONTRASTIVE LEARNING WITH MULTIPLE POSITIVE KEYS

Instance discrimination approach (van den Oord et al., 2018; Wu et al., 2018) , a.k.a. InfoNCE, is one of the most general forms of standard contrastive learning. Given a query q with a large key pool {k 0 , k 1 , k 2 , • • • , k K } where K is the number of keys, this non-parametric contrastive loss is L InfoNCE = -log exp(q • k + /τ ) K i=0 exp(q • k i /τ ) , ( ) where τ is the hyper-parameter for temperature scaling. Intuitively, contrastive learning can be defined as a query-key pair matching problem, in which a contrastive loss is a K-way cross-entropy loss to distinguish k + from a large key pool. From this perspective, a contrastive loss is to maximize the similarity between the query and the corresponding positive key k + . When fine-tuning the pre-trained model on a labeled downstream dataset, an intuitive way is to combine InfoNCE and cross-entropy as L = L InfoNCE + L CE . However, InfoNCE tends to generate an extremely different but not discriminative feature structure compared with that of the supervised cross-entropy loss, making the classifier struggle. To this end, we proposed a novel idea of contrastive learning with multiple positive keys to tailor contrast into cross-entropy loss in both the projector head and classifier head. Before describing the losses defined on these two heads in detail, let us make a brief introduction of contrastive learning with multiple positive keys. Definition 1 (Contrastive Learning with Multiple Positive Keys) In the context of tailoring contrastive into fine-tuning the pre-trained model on a labeled downstream task, it is the mechanism that expands the scope of positive keys to a set of instances instead of a single one. Formally, given a query q with a large key pool {k 0 , k 1 , k 2 , • • • , k K } where K is the number of keys and k 0 is the positive key k + . Suppose the intance number in each class is equal in the key pool, i.e., the equality K = k • C is held where C is the size of label space and k is the instance number in each class. The straightforward approach to expand the standard contrastive loss (L InfoNCE ) into a form of multi-positives (denoted by a positive key set K p ) is L InfoNCE = - 1 |K p | k+∈Kp log exp(q • k + /τ ) exp(q • k + /τ ) + k-∈Kn exp(q • k -/τ ) , where K n denotes the negative key set. We denote this loss function by L InfoNCE , which essentially performs multiple individual contrasts with different positive keys for each query q. However, this form of multi positive key contrastive can be seen as a simple generaliztion of the standard contrastive loss by repeating the positve keys several times, without fully exploiting the intrinsic structure of intra-class dataset. To this end, the losses (L CCE and L CCL detailed in Section 4.3 and Section 4.4 respectively) we proposed are based on the following formula: L proposed = - 1 |K p | k+∈Kp log exp(q • k + /τ ) k + ∈Kp exp(q • k + /τ ) + k-∈Kn exp(q • k -/τ ) Different from L InfoNCE , in the denominator of L proposed , both the positive keys in the same class with the query and the negative keys from other classes are presented. For each contrast with multiple positive keys, a query here needs to balance all positive keys simultaneously. In another view, L proposed can be regarded as performing cross-entropy on soft labels of uniform probability 1 |Kp| for each positive key. L proposed introduces a uninformative prior that the positive keys are uniformly distributed around the query. Table 6 compared the accuracies of Bi-tuning implemented with both forms of multiple-positives and revealed that the proposed form is the better choice.

4.1. PRE-TRAINED REPRESENTATIONS

Bi-tuning is a general learning approach to fine-tuning both supervised and unsupervised representations. Without any additional assumptions, the pre-trained feature encoder f (•) can be various network backbones according to the downstream tasks, including ResNet (He et al., 2016) and DenseNet (Huang et al., 2017) for supervised pre-trained models, and MoCo (He et al., 2020) and SimCLR (Chen et al., 2020) for unsupervised pre-trained models. Given a query sample x q i , we can first utilize a pre-trained feature encoder f (•) to extract its pre-trained representation as h q i = f (x q i ).

4.2. VANILLA FINE-TUNING

Given a pre-trained representation h q i , a fundamental step of vanilla fine-tuning is to feedforward the representation h q i into a C-way classifier g(•), in which C is the number of categories for the downstream classification task. Denote the parameters of the classifier g(•) as W = [w 1 , w 2 , • • • , w C ], where w j corresponds to the parameter for the j-th class. Denote the training dataset of the downstream task as {(x q i , y q i )} N i=1 . W can be updated by optimizing a standard cross-entropy (CE) loss as L CE = - N i=1 log exp(w yi • h q i ) C j=1 exp(w j • h q i ) . (4)

4.3. CONTRASTIVE CROSS-ENTROPY LOSS ON CLASSIFIER HEAD

From another perspective, the cross-entropy loss of vanilla fine-tuning on a given dataset with N instance-class pair (x q i , y q i ) can be regarded as a class-wise championship, i.e., the prediction that is the same as the ground-truth class of each instance is expected to win the game. To further exploit the label information of the downstream task, we propose a novel contrastive cross-entropy loss L CCE on the classifier head via the dual view of cross-entropy loss. Similarly, L CCE can be seen as an instance-wise championship, i.e., the prediction belongs to the nearest instance towards the prototype of each class is expected to win the game. Similar to CE loss, L CCE can be formulated as L CCE = - 1 |K p | h + ∈Kp log exp(w y • h + /τ ) h + ∈Kp exp(w y • h + /τ ) + h -∈Kn exp(w y • h -/τ ) , where K p is the positive key set (including the current query example h q and keys with the same label y q ), and K n is the negative set (including examples with other classes). Note that, h's are samples from the hidden key pool produced by the key generating mechanism (except h q ). Though Bi-tuning is general for it, we adopt the key generating approach in Momentum Contrast (MoCo) (He et al., 。 … . ! ! " " # $ Projector ! ⋅ ! % " " & ! ! " " # ! $ " " # ! % " " % & ! ! " " % & ! $ " " % & # ##$ ⋯ ! % " " # $ ! ' " " # $ ! ( " " # $ ! ! " " & ! ! " " % $ ⋯ ⋯ ⋯ # #$ $ % ⋯ & % & & ' ⋯ & ( ' ( % ( & ' ⋯ ( ( ' Classifier ) ⋅ Encoder *(⋅) # ##) -% -& ' -* ' ⋯ -( ' # # $ ∈ % ) # ' $ ∈ % ) # % $ ∈ % ) Representations Inputs 。。。 … … . 。 Figure 1: The architecture of the proposed Bi-tuning approach, which includes an encoder for pre-trained representations, a classifer head and a projector head. Bi-tuning enables a dual fine-tuning mechanism: a contrastive cross-entropy loss (CCE) on the classifier head to exploit label information and a categorical contrastive learning loss (CCL) on the projector head to model the intrinsic structure. 2020) as our default one due to its simplicity, high-efficacy, and memory-efficient implementation. As intuitively illustrated in Figure 1 , column-wised L CCE operates loss computation along the bank dimension whose size is K + 1 while row-wised L CE performs along the class dimension with a size of C. By encouraging instances in the training dataset to approach towards their corresponding class prototypes, L CCE tends to achieve a more compact intra-class structure than the vanilla fine-tuning.

4.4. CATEGORICAL CONTRASTIVE LEARNING LOSS ON PROJECTOR HEAD

Previously, we proposed an improved-version of vanilla fine-tuning on the classifier head to fully exploit label information. However, this kind of loss function design may still fall short in capturing the intrinsic structure. Inspired by the remarkable success of unsupervised pre-training which also aims at modeling the intrinsic structure in data, we first introduce a projector φ(•) which is usually off the shelf to embed a pre-trained representation h q i into a latent metric space as z q i . However, the standard contrastive learning loss (InfoNCE) defined in Eq. (1) assumes that there is a single key k + in the dictionary to match the given query q, which implicitly requires every instance to belong to an individual class. If we simply apply InfoNCE loss on the labeled downstream dataset, it will result in an extremely different but not discriminative feature structure compared with that of the supervised cross-entropy loss, making the classifier struggle. Obviously, this dilemma reveals that the naive combination of the supervised cross-entropy loss and the unsupervised contrastive loss is not an optimal solution for fine-tuning, which is also backed by our experiments in Table 3 . To capture the label information and intrinsic structure simultaneously, we propose a novel categorical contrastive loss L CCL on the projector head based on the following hypothesis: when we fine-tune a pre-trained model to a downstream task, it is reasonable to regard other keys in the same class as the positive keys that the query matches. In this way, L CCL expands the scope of positive keys to a set of instances instead of a single one, resulting in a more harmonious cooperation between the supervised and unsupervised learning mechanisms. Similar to the format of InfoNCE loss, L CCL is defined as L CCL = - 1 |K p | z + ∈Kp log exp(z q • z + /τ ) z + ∈Kp exp(z q • z + /τ ) + z -∈Kn exp(z q • z -/τ ) , where the notations are identical to Eq. ( 5), as well as the positive key set. Note that, the outer sum is over all positive keys, indicating that there may be more than one positive key for a single query, i.e., the inequality that |K p | ≥ 1 holds. 

4.5. BI-TUNING

Finally, we reach a novel approach to fine-tuning both supervised and unsupervised representations, i.e. the Bi-tuning, which jointly optimizes the standard cross-entropy loss, the contrastive crossentropy loss for classifier head and the categorical contrastive learning loss for projector head in an end-to-end deep architecture. Note that, Bi-tuning refers to the proposed two heads (a.k.a, bi-head) with two novel losses.The overall loss function of Bi-tuning can be formulated as follows: min Θ L CE + L CCE + L CCL , where Θ denotes the set of all parameters of the backbone, the classifier head and the projector head. Specifically, since the magnitude of the above loss terms is comparable, we empirically find that there is no need to introduce any extra hyper-parameters to trade-off them. This simplicity makes Bi-tuning easy to be applied to different datasets or tasks. The full portrait of Bi-tuning is shown in Figure 1 .

5. EXPERIMENTS

We follow the common fine-tuning principle described in Yosinski et al. (2014) , replacing the last task-specific layer in the classifier head with a randomly initialized fully connected layer whose learning rate is 10 times of that for pre-trained parameters. Meanwhile, the projector head is set to be another randomly initialized fully connected layer. For the key generating mechanisms, we follow the style in He et al. (2020) , employing a momentum contrast branch with a default momentum coefficient m = 0.999 and two cached queues both normalized by their L2-norm (Wu et al., 2018) with dimensions of 2048 and 128 respectively. For each task, the best learning rate is selected by cross-validation under a 100% sampling rate and applied to all four sampling rates. Queue size K is set as 8, 16, 24, 32 for each category according to the dataset scales respectively. Other hyperparameters in Bi-tuning are fixed for all experiments. The temperature τ in Eq. ( 5) and Eq. ( 6) is set as 0.07 (Wu et al., 2018) . The trade-off coefficients between these three losses are kept as 1 since the magnitude of the loss terms is comparable. All tasks are optimized using SGD with a momentum 0.9. All results in this sections are averaged over 5 trails and standard deviations are provided.

5.1. BI-TUNING SUPERVISED PRE-TRAINED REPRESENTATIONS

Standard benchmarks. We first verify our approach on three fine-grained classification bench- we create four configurations which randomly sample 25%, 50%, 75%, and 100% of training data for each class respectively, to reveal the detailed effect while fine-tuning to different data scales. We choose recent fine-tuning technologies: L2-SP (Li et al., 2018) , DELTA (Li et al., 2019) , and the state-of-the-art method BSS (Chen et al., 2019) , as competitors of Bi-tuning while regarding vanilla fine-tuning as a baseline. Note that vanilla fine-tuning is a strong baseline when sufficient data is provided. Results are averaged over 5 trials. As shown in Table 1 , Bi-tuning significantly outperforms all competitors across all three benchmarks by large margins (e.g. 10.7% absolute rise on CUB with a sampling rate of 25%). Note that even under 100% sampling rate, Bi-tuning still outperforms others. Large-scale benchmarks. Previous fine-tuning methods mainly focus on improving performance under low-data regime paradigms. We further extend Bi-tuning to large-scale paradigms. We use annotations of COCO dataset (Lin et al., 2014) to construct a large-scale classification dataset, cropping object with padding for each image and removing minimal items (with height and width less than 50 pixels), resulting a large-scale dataset containing 70 classes with more than 1000 images per category. The scale is comparable to ImageNet in terms of the number of samples per class. On this constructed large-scale dataset named COCO-70, Bi-tuning is also evaluated under four sampling rate configurations. Since even 25% sampling rates of COCO-70 is much larger than each benchmark in Section 5.1, previous fine-tuning competitors show micro contributions to these paradigms. Results in Table 2 reveal that Bi-tuning brings general gains for all tasks,beyond the low-data regime. We hypothesize that the intrinsic structure introduced by Bi-tuning contributes substantially. 

5.2. BI-TUNING UNSUPERVISED PRE-TRAINED REPRESENTATIONS

Bi-tuning representations of MoCo (He et al., 2020) . In this round, we use ResNet-50 pre-trained unsupervisedly via MoCo on ImageNet as the backbone. Since suffering from the large discrepancy between unsupervised pre-trained representations and downstream classification tasks as demonstrated (Wu et al., 2018) 86.59±0.22 89.54±0.25 CMC (Tian et al., 2019) 86.71±0.62 88.35±0.44 MoCov2 (He et al., 2020) 90.15±0.48 90.79±0.34 SimCLR(1×) (Chen et al., 2020) 89.30±0.18 90.84±0.22 SimCLR(2×) (Chen et al., 2020) 91.22±0.19 91.93±0.19 in Figure 3 , previous fine-tuning competitors usually perform very poorly. Hence we only compare Bi-tuning to the state-of-the-art method BSS (Chen et al., 2019) and vanilla fine-tuning as baselines. Besides, we add two intuitively related baselines: (1) GPT*, which follows a GPT (Radford & Sutskever, 2018; Radford et al., 2019) fine-tuning style but replaces its predictive loss with the contrastive loss; (2) Center loss, which introduces compactness of intra-class variations (Wen et al., 2016) that is effective in recognition tasks. As reported in Table 3 , trivially borrowing fine-tuning strategy in GPT (Radford & Sutskever, 2018) or center loss brings tiny benefits, and is even harmful on some datasets, e.g. CUB. Bi-tuning yields consistent gains on all fine-tuning tasks of unsupervised representations, indicating that Bi-tuning benefits substantially from exploring the intrinsic structure. Bi-tuning other unsupervised pre-trained representations. To justify Bi-tuning's general efficacy, we extend our method to unsupervised representations by other pre-training methods. Bi-tuning is applied to MoCo (version 2) (He et al., 2020) , SimCLR (Chen et al., 2020) , InsDisc (Wu et al., 2018) , Deep Cluster (Caron et al., 2018) , CMC (Tian et al., 2019) on Car dataset with 100% training data. Table 4 is a strong signal that Bi-tuning is not bound to specific pre-training pretext tasks. Analysis on components of contrastive learning. Recent advances in contrastive learning, i.e. momentum contrast (He et al., 2020) and memory bank (Wu et al., 2018) can be plugged into Bituning smoothly to achieve similar performance and the detailed discussions are deferred to Appendix. Previous works (He et al., 2020; Chen et al., 2020) reveal that a large amount of contrasts is crucial to contrastive learning. In Figure 2 (a), we report the sensitivity of the numbers of sampling keys in Bi-tuning (MoCo) under 25% and 100% sampling ratio configurations. Figure 2 (a) shows that though a larger key pool is beneficial, we cannot expand the key pool due to the limit of training data, which may lose sampling stochasticity during training. This result suggests that there is a trade-off between stochasticity and a large number of keys. Chen et al. (2020) pointed out that the dimension of the projector also has a big impact. The sensitivity of the dimension of projector head is also presented in Figure 2 (b). Note that the unsupervised pre-trained model (e.g., MoCo) may provide an off-the-shelf projector, fine-tuning or re-initializing it is almost the same (90.88 vs. 90.78 on Car when L is 128). Interpretable visualization of learned representations. As visualized by Fong & Vedaldi (2017) shown in Figure 3 . Note that 3(a) is the original image, Figure 3 ImageNet, and an unsupervised pre-trained model via MoCov1 (He et al., 2020) . We infer that supervised pre-training will obtain representations focusing on the discriminative part and ignoring the background part. In contrast, unsupervised pre-training pays uninformative attention to every location of an input. Bi-tuning in 3(e) captures both local details and global category-structures.

5.3. COLLABORATIVE EFFECT OF LOSS FUNCTIONS

Using either contrastive cross-entropy (CCE) or categorical contrastive (CCL) with vanilla crossentropy (CE) already achieves relatively good results, as shown in Table 5 . These experiments prove that there is collaborative effect between CCE and CCL loss empirically. It is worth mentioning that CCE and CCL can work independently of CE (see the fourth row in Table 5 ), while we optimize these three losses simultaneously to yield the best result. As discussed in prior sections, we hypothesize that Bi-tuning helps fine-tuning models characterize the intrinsic structure of training data when using CCE and CCL simultaneously. 

6. CONCLUSION

In this paper, we propose a Bi-tuning approach to fine-tuning both supervised and unsupervised representations. Bi-tuning generalizes the standard fine-tuning with an encoder for pre-trained representations, a classifier head and a projector head for exploring both the discriminative knowledge of labels and the intrinsic structure of data, which are trained end-to-end by two novel loss functions. Bi-tuning yields state-of-the-art results for fine-tuning tasks on both supervised and unsupervised pretrained models by large margins. Code will be released upon publication at http://github.com.

A VISUALIZATION BY T-SNE

We train the t-SNE (Maaten & Hinton, 2008) visualization model on the MoCo representations finetuned on Pets dataset Parkhi et al. (2012) . Visualization of the validation set is shown in Figure 4 . Note that representations in Figure 4 (a) do not present good classification structures. Figure 4 (c) suggests that forcefully combining the unsupervised contrastive learning loss as GPT (Radford et al., 2019) may cause conflict with CE and clutter the classification boundaries. Figure 4 (d) suggests Bi-tuning encourages the fine-tuning model to learn better intrinsic structure besides the label information. Therefore, Bi-tuning presents the best classification boundaries as well as intrinsic structures. Formally, denoting the momentum-updated encoder as f k with parameters θ k . Likewise, denoting the backbone encoder as f q with parameters θ q . θ k is updated by: θ k ← mθ k + (1 -m)θ q . Here we set the momentum coefficient m = 0.999. To fit the Bi-tuning approach, we reorganize the queues in MoCo for items in each category separately. Moreover, two contrastive mechanisms in Bi-tuning are performed on different the instance level and category level respectively, and we maintain two groups of queues correspondingly. 8), snapshots here are updated by: z k i ← mz k i + (1 -m)z q i , h k i ← mh k i + (1 -m)h q i . Notations follow Section 3. Here we set the momentum coefficient m = 0.5 (Wu et al., 2018) . Other hyper-parameters are the same as Section 5. We evaluate Bi-tuning with a memory bank on CUB (Welinder et al., 2010) with the same configurations in Section 4. The results in Table 7 show that the performance is close in both methods. Key generating mechanisms in Bi-tuning only have limited effects on the final performance in the supervised paradigm. These suggest that the key generating mechanism in Bi-tuning can be implemented by some variants with similar performance. MoCo is recommended regarding its scalability and simplicity.

C ADDITIONAL EXPERIMENTAL RESULTS

C.1 BI-TUNING SUPERVISED PRE-TRAINED REPRESENTATIONS ON MORE BENCHMARK In addition to the Table 1 in the main paper, we further conduct experiments on Stanford Dogs (Khosla et al., 2011) , Oxford-IIIT Pets (Parkhi et al., 2012) , Oxford 102 Flowers (Nilsback & Zisserman, 2008) , NABirds (Van Horn et al., 2015) . Results are shown in Table 8 . (Khosla et al., 2011) , Oxford-IIIT Pets (Parkhi et al., 2012) , Oxford 102 Flowers (Nilsback & Zisserman, 2008) , NABirds (Van Horn et al., 2015) . Results are shown in Table 9 as a suppmentary to Table 3 to report results on more datasets using ResNet-50 unsupervisedly pre-trained by MoCo. Further, to clarify that Bi-tuning is not overfitting to pre-training methods, Table 10 and Table 11 provide more promising results with different unsupervisedly pre-trained representations on CUB and Aircraft respectively, which reveal that Bi-Tuning consistly outperforms fine-tuning method. Deep Cluster (Caron et al., 2018) 74.63 78.24 InsDisc (Wu et al., 2018) 71.35 74.92 CMC (Tian et al., 2019) 68.71 77.21 MoCov2 (He et al., 2020) 75.75 77.24 SimCLR(1×) (Chen et al., 2020) 72.21 77.44 SimCLR(2×) (Chen et al., 2020) 74.35 78.43 Table 11 : Top-1 Accuracy on Aircraft dataset with different unsupervisedly pre-trained representations. Pre-training Method Fine-tuning (100% data) Bi-tuning (100% data) Deep Cluster (Caron et al., 2018) 77.68 81.70 InsDisc (Wu et al., 2018) 84.16 86.08 CMC (Tian et al., 2019) 83.98 85.84 MoCov2 (He et al., 2020) 88.04 89.85 SimCLR(1×) (Chen et al., 2020) 86.52 89.10 SimCLR(2×) (Chen et al., 2020) 88.39 89.98



Figure 2: Sensitivity analysis of hyper-parameters K and L for Bi-tuning.

(b), Figure 3(c) and Figure 3(d) are respectively obtained from a randomly initialized model, a supervised pre-trained model on

Figure 3: Interpretable visualization of learned representations via various training methods.

Figure4: T-SNE(Maaten & Hinton, 2008) visualization of baselines on Pets(Parkhi et al., 2012).

Top-1 accuracy on various datasets using ResNet-50 by supervised pre-training.

Top-1 accuracy on COCO-70 dataset using DenseNet-121 by supervised pre-training.

Top-1 accuracy on various datasets using ResNet-50 unsupervisedly pre-trained by MoCo.

Top-1 Accuracy on Car dataset with different unsupervisedly pre-trained representations.

Collaborative effect in Bi-tuning on CUB-200-2011 using ResNet-50 pre-trained by MoCo.

Comparison of different multi-positives contrastive losses on 100% data CUB.

Top-1 accuracy (%) of Bi-tuning on CUB with memory bank as key generating mechanism (Backbone: ResNet-50 pretrained via MoCo).

Top-1 accuracy on more benchmark using ResNet-50 by supervised pre-training. BI-TUNING UNSUPERVISED PRE-TRAINED REPRESENTATIONS ON MORE BENCHMARK Similarly, we also conduct experiment with ResNet-50 unsupervised pretrained by MoCo on Stanford Dogs

Top-1 accuracy on more benchmark using ResNet-50 unsupervised pre-trained by MoCo.

Top-1 Accuracy on CUB dataset with different unsupervisedly pre-trained representations.

