BETTER TEACHER BETTER STUDENT: DYNAMIC PRIOR KNOWLEDGE FOR KNOWLEDGE DISTILLATION

Abstract

Knowledge distillation (KD) has shown very promising capabilities in transferring learning representations from large models (teachers) to small models (students). However, as the capacity gap between students and teachers becomes larger, existing KD methods fail to achieve better results. Our work shows that the 'prior knowledge' is vital to KD, especially when applying large teachers. Particularly, we propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation. This means that our method also takes the teacher's feature as 'input', not just 'target'. Besides, we dynamically adjust the ratio of the prior knowledge during the training phase according to the feature gap, thus guiding the student in an appropriate difficulty. To evaluate the proposed method, we conduct extensive experiments on two image classification benchmarks (i.e. CIFAR100 and ImageNet) and an object detection benchmark (i.e. MS COCO). The results demonstrate the superiority of our method in performance under varying settings. Besides, our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers. More importantly, DPK provides a fast solution in teacher model selection for any given model. Our code will be released at https://github.com/Cuibaby/DPK.

1. INTRODUCTION

Tremendous efforts have been made in crafting lightweight deep neural networks applicable to the real-world scenarios. Representative methods include network pruning (He et al., 2017 ), model quantization (Habi et al., 2020) , neural architecture search (NAS) (Wan et al., 2020) , and knowledge distillation (KD) (Bucilua et al., 2006; Hinton et al., 2015) , etc. Among them, KD has recently emerged as one of the most flourishing topics due to its effectiveness (Liu et al., 2021a; Zhao et al., 2022; Chen et al., 2021; Heo et al., 2019a) and wide applications (Yang et al., 2022a; Chong et al., 2022; Liu et al., 2019a; Yim et al., 2017b; Zhang & Ma, 2020) . Particularly, the core idea of KD is to transfer the distilled knowledge from a well-performed but cumbersome teacher to a compact and lightweight student. Based on this, numerous methods have been proposed and achieved great success. However, with the deepening of research, some related issues are also discussed. In particular, several works (Cho & Hariharan, 2019; Mirzadeh et al., 2020; Hinton et al., 2015; Liu et al., 2021a) report that with the increase of teacher model in performance, the accuracy of student gets saturated (which might be unsurprising). To make matters worse, when playing the role of a teacher, the large teacher models lead to significantly worse performance than the relatively smaller ones. For example, as shown in Fig. 1 , ICKD (Liu et al., 2021a) , a strong baseline model which also points out this issue, performing better under the guidance of small teacher models, whereas applying a large model as the teacher would considerably degrade the student performance. Same as (Cho & Hariharan, 2019; Mirzadeh et al., 2020) , we attribute the cause of this issue to the capacity gap between the teachers and the students. More specifically, the small student is * Equal contribution. † Corresponding author. 1

