CONTRASTIVE CONSISTENT REPRESENTATION DISTIL-LATION

Abstract

The combination of knowledge distillation with contrastive learning has great potential to distill structural knowledge. Most of the contrastive-learning-based distillation methods treat the entire training dataset as the memory bank and maintain two memory banks, one for the student and one for the teacher. Besides, the representations in the two memory banks are updated in a momentum manner, leading to representation inconsistency. In this work, we propose Contrastive Consistent Representation Distillation (CoCoRD) to provide consistent representations for efficient contrastive-learning-based distillation. Instead of momentum-updating the cached representations, CoCoRD updates the encoders in a momentum manner. Specifically, the teacher is equipped with a momentum-updated projection head to generate consistent representations. The teacher representations are cached in a fixed-size queue which serves as the only memory bank in CoCoRD and is significantly smaller than the entire training dataset. Additionally, a slow-moving student, implemented as a momentum-based moving average of the student, is built to facilitate contrastive learning. CoCoRD, which utilizes only one memory bank and much fewer negative keys, provides highly competitive results under typical teacher-student settings. On ImageNet, CoCoRD-distilled ResNet50 outperforms the teacher ResNet101 by 0.2% top-1 accuracy. Furthermore, in PASCAL VOC and COCO detection, the detectors whose backbones are initialized by CoCoRDdistilled models exhibit considerable performance improvements.

1. INTRODUCTION

The remarkable performance of convolutional neural networks (CNNs) in various computer vision tasks, such as image recognition (He et al., 2016; Huang et al., 2017) and object detection (Girshick, 2015; Ren et al., 2015; Redmon & Farhadi, 2017) , has triggered interest in employing these powerful models beyond benchmark datasets. However, the cutting-edge performance of CNNs is always accompanied by substantial computational costs and storage consumption. Early study has suggested that shallow feedforward networks can approximate arbitrary functions (Hornik et al., 1989) . Numerous endeavors have been made to reduce computational overheads and storage burdens. Among those endeavors, Knowledge Distillation, a widely discussed topic, presents a potential solution by training a compact student model with knowledge provided by a cumbersome but well-trained teacher model. The majority of distillation methods induce the student to imitate the teacher representations (Zagoruyko & Komodakis, 2017; Park et al., 2019; Tian et al., 2020; Hinton et al., 2015; Chen et al., 2021b; c; Yim et al., 2017; Tung & Mori, 2019; Ahn et al., 2019) . Although representations provide more learning information, the difficulty of defining appropriate metrics to align the student representations to the teacher ones challenges the distillation performance. Besides, failing to capture the dependencies between representation dimensions results in lame performance. To enhance performance, researchers attempt to distill structural knowledge by establishing connections between knowledge distillation and contrastive learning (Tian et al., 2020; Chen et al., 2021b) . To efficiently retrieve representations of negative samples for contrastive learning, memory banks cache representations which are updated in a momentum manner, as shown in Fig. 1 . However, the student is optimized sharply by the training optimizer. The student representations in the memory bank are inconsistent because the updated representations differ from those not updated in that iteration. Therefore, the student can easily contrast the positive and negative samples, keeping the student from learning good features. The storage size of memory bank is another factor of concern when applying contrastive-learning-based distillation methods. As in (Tian et al., 2020; Chen et al., Instead of momentum updating the representations, CoCoRD updates the encoder in a momentum manner. The teacher dictionary which contains representations from preceding batches is implemented as a queue. 2021b), there are two memory banks and each of them contains representations of all training images, leading to massive GPU memory usage on large-scale datasets. Motivated by the discussion above, we propose Contrastive Consistent Representation Distillation (CoCoRD) as a novel way of distilling consistent representations with one fixed-size memory bank. Specifically, CoCoRD is composed of four major components, as shown in Fig. 2: (1) a fixed-size queue which is referred to as the teacher dictionary, (2) a teacher, (3) a student, and (4) a slow-moving student. From a perspective of considering contrastive learning as a dictionary look-up task, the teacher dictionary is regarded as the memory bank, where all the representations serve as the negative keys. The encoded representations of the current batch from the teacher are enqueued. Once the queue is full of representations, the oldest ones are dequeued. By introducing a queue, the size of the memory bank is decoupled from dataset size and batch size, allowing it to be considerably smaller than dataset size and larger than the commonly-used batch size. The student is followed by a projection head, which maps the student features to a representation space. The teacher projection head is initialized the same as the student one and is a momentum moving average of the student projection head if the teacher and the student have the same feature dimension; otherwise, the teacher projection head is randomly initialized and not updated. Since the contrast through the teacher dictionary is to draw distinctions on an instance level, the cached teacher representations which share the same class label as the student ones are mistakenly treated as negative keys, resulting in noise in the dictionary. To alleviate the impact of the noise, a slow-moving student, implemented as a momentum moving average of the student, is proposed to pull together anchor representations and class-positive ones. As shown in Fig. 2 , with a momentum-updated projection head, the slow-moving student projects a data augmentation version of the anchor image to the representation space, which serves as the instance-negative but class-positive key. The main contributions are listed as follows: • We utilize only one lightweight memory bank (teacher dictionary), where all the representations are treated as negative keys. We experimentally demonstrate that a miniature teacher dictionary with much fewer negative keys can be sufficient for contrastive learning in knowledge distillation. • We equip the well-trained teacher with a momentum-updated projection head to provide consistent representations for the teacher dictionary. Besides, a slow-moving student provides class-positive representations to alleviate the impact of the potential noise in the teacher dictionary. • We verify the effectiveness of CoCoRD by achieving the state-of-the-art performance in 11 out of 13 student-teacher combinations in terms of model compression. On ImageNet, the CoCoRDdistilled ResNet50 can outperform the teacher ResNet101 by 0.2% top-1 accuracy. Moreover, we initialize the backbones in object detection with CoCoRD-distilled weights and observe considerable performance improvements over the counterparts that the vanilla students initialize.



Figure 1: The general pipelines of contrastive learning based knowledge distillation methods and CoCoRD.

