CONTRASTIVE CONSISTENT REPRESENTATION DISTIL-LATION

Abstract

The combination of knowledge distillation with contrastive learning has great potential to distill structural knowledge. Most of the contrastive-learning-based distillation methods treat the entire training dataset as the memory bank and maintain two memory banks, one for the student and one for the teacher. Besides, the representations in the two memory banks are updated in a momentum manner, leading to representation inconsistency. In this work, we propose Contrastive Consistent Representation Distillation (CoCoRD) to provide consistent representations for efficient contrastive-learning-based distillation. Instead of momentum-updating the cached representations, CoCoRD updates the encoders in a momentum manner. Specifically, the teacher is equipped with a momentum-updated projection head to generate consistent representations. The teacher representations are cached in a fixed-size queue which serves as the only memory bank in CoCoRD and is significantly smaller than the entire training dataset. Additionally, a slow-moving student, implemented as a momentum-based moving average of the student, is built to facilitate contrastive learning. CoCoRD, which utilizes only one memory bank and much fewer negative keys, provides highly competitive results under typical teacher-student settings. On ImageNet, CoCoRD-distilled ResNet50 outperforms the teacher ResNet101 by 0.2% top-1 accuracy. Furthermore, in PASCAL VOC and COCO detection, the detectors whose backbones are initialized by CoCoRDdistilled models exhibit considerable performance improvements.

1. INTRODUCTION

The remarkable performance of convolutional neural networks (CNNs) in various computer vision tasks, such as image recognition (He et al., 2016; Huang et al., 2017) and object detection (Girshick, 2015; Ren et al., 2015; Redmon & Farhadi, 2017) , has triggered interest in employing these powerful models beyond benchmark datasets. However, the cutting-edge performance of CNNs is always accompanied by substantial computational costs and storage consumption. Early study has suggested that shallow feedforward networks can approximate arbitrary functions (Hornik et al., 1989) . Numerous endeavors have been made to reduce computational overheads and storage burdens. Among those endeavors, Knowledge Distillation, a widely discussed topic, presents a potential solution by training a compact student model with knowledge provided by a cumbersome but well-trained teacher model. The majority of distillation methods induce the student to imitate the teacher representations (Zagoruyko & Komodakis, 2017; Park et al., 2019; Tian et al., 2020; Hinton et al., 2015; Chen et al., 2021b; c; Yim et al., 2017; Tung & Mori, 2019; Ahn et al., 2019) . Although representations provide more learning information, the difficulty of defining appropriate metrics to align the student representations to the teacher ones challenges the distillation performance. Besides, failing to capture the dependencies between representation dimensions results in lame performance. To enhance performance, researchers attempt to distill structural knowledge by establishing connections between knowledge distillation and contrastive learning (Tian et al., 2020; Chen et al., 2021b) . To efficiently retrieve representations of negative samples for contrastive learning, memory banks cache representations which are updated in a momentum manner, as shown in Fig. 1 . However, the student is optimized sharply by the training optimizer. The student representations in the memory bank are inconsistent because the updated representations differ from those not updated in that iteration. Therefore, the student can easily contrast the positive and negative samples, keeping the student from learning good features. The storage size of memory bank is another factor of concern when applying contrastive-learning-based distillation methods. As in (Tian et al., 2020; Chen et al., 1 

