SEED: SELF-SUPERVISED DISTILLATION FOR VISUAL REPRESENTATION

Abstract

This paper is concerned with self-supervised learning for small models. The problem is motivated by our empirical studies that while the widely used contrastive self-supervised learning method has shown great progress on large model training, it does not work well for small models. To address this problem, we propose a new learning paradigm, named SElf-SupErvised Distillation (SEED), where we leverage a larger network (as Teacher) to transfer its representational knowledge into a smaller architecture (as Student) in a self-supervised fashion. Instead of directly learning from unlabeled data, we train a student encoder to mimic the similarity score distribution inferred by a teacher over a set of instances. We show that SEED dramatically boosts the performance of small networks on downstream tasks. Compared with self-supervised baselines, SEED improves the top-1 accuracy from 42.2% to 67.6% on EfficientNet-B0 and from 36.3% to 68.2% on MobileNet-V3-Large on the ImageNet-1k dataset.

1. INTRODUCTION

The vertical axis is the top-1 accuracy and the horizontal axis is the number of learnable parameters for different network architectures. Directly applying self-supervised contrastive learning (MoCo-V2) does not work well for smaller architectures, while our method (SEED) leads to dramatic performance boost. Details of the setting can be found in Section 4. The burgeoning studies and success on self-supervised learning (SSL) for visual representation are mainly marked by its extraordinary potency of learning from unlabeled data at scale. Accompanying with the SSL is its phenomenal benefit of obtaining task-agnostic representations while allowing the training to dispense with prohibitively expensive data labeling. Major ramifications of visual SSL include pretext tasks (Noroozi & Favaro, 2016; Zhang et al., 2016; Gidaris et al., 2018; Zhang et al., 2019; Feng et al., 2019) , contrastive representation learning (Wu et al., 2018; He et al., 2020; Chen et al., 2020a) , online/offline clustering (Yang et al., 2016; Caron et al., 2018; Li et al., 2020; Caron et al., 2020; Grill et al., 2020) , etc. Among them, several recent works (He et al., 2020; Chen et al., 2020a; Caron et al., 2020) have achieved comparable or even better accuracy than the supervised pre-training when transferring to downstream tasks, e.g. semi-supervised classification, object detection. The aforementioned top-performing SSL algorithms all involve large networks (e.g., ResNet-50 (He et al., 2016) or larger), with, however, little attention on small networks. Empirically, we find that existing techniques like contrastive learning do not work well on small networks. For instance, the linear probe top-1 accuracy on ImageNet using MoCo-V2 (Chen et al., 2020c) is only 36.3% with MobileNet-V3-Large (see Figure 1 ), which is much lower compared with its supervised training accuracy 75.2% (Howard et al., 2019) . For EfficientNet-B0, the accuracy is 42.2% compared with its supervised training accuracy 77.1% (Tan & Le, 2019). We conjecture that this is because smaller models with fewer parameters cannot effectively learn instance level discriminative representation with large amount of data. To address this challenge, we inject knowledge distillation (KD) (Buciluǎ et al., 2006; Hinton et al., 2015) into self-supervised learning and propose self-supervised distillation (dubbed as SEED) as a new learning paradigm. That is, train the larger, and distill to the smaller both in self-supervised manner. Instead of directly conducting self-supervised training on a smaller model, SEED first trains a large model (as the teacher) in a self-supervised way, and then distills the knowledge to the smaller model (as the student). Note that the conventional distillation is for supervised learning, while the distillation here is in the self-supervised setting without any labeled data. Supervised distillation can be formulated as training a student to mimic the probability mass function over classes predicted by a teacher model. In unsupervised knowledge distillation setting, however, the distribution over classes is not directly attainable. Therefore, we propose a simple yet effective self-supervised distillation method. Similar to (He et al., 2020; Wu et al., 2018) , we maintain a queue of data samples. Given an instance, we first use the teacher network to obtain its similarity scores with all the data samples in the queue as well as the instance itself. Then the student encoder is trained to mimic the similarity score distribution inferred by the teacher over these data samples. The simplicity and flexibility that SEED brings are self-evident. 1) It does not require any clustering/prototypical computing procedure to retrieve the pseudo-labels or latent classes. 2) The teacher model can be pre-trained with any advanced SSL approach, e.g., MoCo-V2 (Chen et al., 2020c), SimCLR (Chen et al., 2020a), SWAV (Caron et al., 2020) . 3) The knowledge can be distilled to any target small networks (either shallower, thinner, or totally different architectures). To demonstrate the effectiveness, we comprehensively evaluate the learned representations on series of downstream tasks, e.g., fully/semi-supervised classification, object detection, and also assess the transferability to other domains. For example, on ImageNet-1k dataset, SEED improves the linear probe accuracy of EfficientNet-B0 from 42.2% to 67.6% (a gain over 25%), and MobileNet-V3 from 36.3% to 68.2% (a gain over 31%) compared to MoCo-V2 baselines, as shown in Figure 1 and Section 4. Our contributions can be summarized as follows: • We are the first to address the problem of self-supervised visual representation learning for small models. • We propose a self-supervised distillation (SEED) technique to transfer knowledge from a large model to a small model without any labeled data. • With the proposed distillation technique (SEED), we significantly improve the state-of-theart SSL performance on small models. • We exhaustively compare a variety of distillation strategies to show the validity of SEED under multiple settings.

2. RELATED WORK

Among the recent literature in self-supervised learning, contrastive based approaches show prominent results on downstream tasks. Majority of the techniques along this direction are stemming from noise-contrastive estimation (Gutmann & Hyvärinen, 2010) where the latent distribution is estimated by contrasting with randomly or artificially generated noises. 



Figure 1: SEED vs. MoCo-V2 (Chen et al., 2020c)) on ImageNet-1K linear probe accuracy.The vertical axis is the top-1 accuracy and the horizontal axis is the number of learnable parameters for different network architectures. Directly applying self-supervised contrastive learning (MoCo-V2) does not work well for smaller architectures, while our method (SEED) leads to dramatic performance boost. Details of the setting can be found in Section 4.

Oord et al. (2018)  first proposed Info-NCE to learn image representations by predicting the future using an auto-regressive model for unsupervised learning. Follow-up works include improving the efficiency(Hénaff et al., 2019), and using multi-view as positive samples(Tian et al., 2019b). As these approaches can only have the access to limited negative instances, Wu et al. (2018) designed a memory-bank to store the previously seen random representations as negative samples, and treat each of them as independent categories (instance discrimination). However, this approach also comes with a deficiency that the previously stored vectors are inconsistent with the recently computed representations during the earlier stage of pre-training. Chen et al. (2020a) mitigate this issue by sampling negative samples from a large batch.Concurrently, He et al. (2020)  improve the memory-bank based method and propose to use

