BETTER TEACHER BETTER STUDENT: DYNAMIC PRIOR KNOWLEDGE FOR KNOWLEDGE DISTILLATION

Abstract

Knowledge distillation (KD) has shown very promising capabilities in transferring learning representations from large models (teachers) to small models (students). However, as the capacity gap between students and teachers becomes larger, existing KD methods fail to achieve better results. Our work shows that the 'prior knowledge' is vital to KD, especially when applying large teachers. Particularly, we propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation. This means that our method also takes the teacher's feature as 'input', not just 'target'. Besides, we dynamically adjust the ratio of the prior knowledge during the training phase according to the feature gap, thus guiding the student in an appropriate difficulty. To evaluate the proposed method, we conduct extensive experiments on two image classification benchmarks (i.e. CIFAR100 and ImageNet) and an object detection benchmark (i.e. MS COCO). The results demonstrate the superiority of our method in performance under varying settings. Besides, our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers. More importantly, DPK provides a fast solution in teacher model selection for any given model. Our code will be released at https://github.com/Cuibaby/DPK.

1. INTRODUCTION

Tremendous efforts have been made in crafting lightweight deep neural networks applicable to the real-world scenarios. Representative methods include network pruning (He et al., 2017) , model quantization (Habi et al., 2020) , neural architecture search (NAS) (Wan et al., 2020) , and knowledge distillation (KD) (Bucilua et al., 2006; Hinton et al., 2015) , etc. Among them, KD has recently emerged as one of the most flourishing topics due to its effectiveness (Liu et al., 2021a; Zhao et al., 2022; Chen et al., 2021; Heo et al., 2019a) and wide applications (Yang et al., 2022a; Chong et al., 2022; Liu et al., 2019a; Yim et al., 2017b; Zhang & Ma, 2020) . Particularly, the core idea of KD is to transfer the distilled knowledge from a well-performed but cumbersome teacher to a compact and lightweight student. Based on this, numerous methods have been proposed and achieved great success. However, with the deepening of research, some related issues are also discussed. In particular, several works (Cho & Hariharan, 2019; Mirzadeh et al., 2020; Hinton et al., 2015; Liu et al., 2021a) report that with the increase of teacher model in performance, the accuracy of student gets saturated (which might be unsurprising). To make matters worse, when playing the role of a teacher, the large teacher models lead to significantly worse performance than the relatively smaller ones. For example, as shown in Fig. 1 , ICKD (Liu et al., 2021a) , a strong baseline model which also points out this issue, performing better under the guidance of small teacher models, whereas applying a large model as the teacher would considerably degrade the student performance. Same as (Cho & Hariharan, 2019; Mirzadeh et al., 2020) , we attribute the cause of this issue to the capacity gap between the teachers and the students. More specifically, the small student is hard to 'understand' the high-order semantics extracted by the large model. This problem will be exacerbated when applying larger teachers, and it makes the student's accuracy inversely correlated with the capacity of the teacher modelfoot_0 . Note that this problem also exists for humans, and human teachers often tell students some prior knowledge to facilitates their learning in this case. Moreover, the experienced teachers can also adjust the amounts of provided prior knowledge accordingly for different students to fully develop their potentials. Inspired by the above observations from human teachers, we propose the dynamic prior knowledge (DPK) framework for feature distillation. Specifically, to provide the prior knowledge to the student, we replace student's features at some random spatial positions with corresponding teacher features at the same positions. Besides, we further design a ViT (Dosovitskiy et al., 2020)-style module to fully integrate this 'prior knowledge' with student's features. Furthermore, our method also dynamically adjusts the amounts of the prior knowledge, reflected in the proportion of teacher features in the hybrid feature maps. Particularly, DPK dynamically computes the differences of features between the student and the teacher in the training phase, and updates the ratio of feature mixtures accordingly. In this way, students always learn from the teacher at an appropriate difficulty, thus alleviating the performance degradation issue.

73

We evaluate DPK on two image classification benchmarks (i.e. CIFAR-100 (Krizhevsky et al., 2009) and ImageNet (Deng et al., 2009) ) and an object detection benchmark (i.e. MS COCO (Lin et al., 2014) ). Experimental results indicate that DPK outperforms other baseline models under several settings. More importantly, our method can be further improved by applying larger teachers (see Fig. 1 for an example). We argue that this characteristic of DPK not only further boosts student performance, but also provides a quick solution in model selection for finding the best teacher for a given student. In addition, we conduct extensive ablations to validate each design of DPK in detail. In summary, the main contributions of this work are: • We propose the prior knowledge mechanism for feature distillation, which can fully excavate the distillation potential of big models. To the best of our knowledge, our method is the first to take the features of teachers as 'input', not just 'target' in knowledge distillation. • Based on our first contribution, we further propose the dynamic prior knowledge (DPK). Our DPK provides a solution to the 'larger models are not always better teachers' issue. Besides, it also gives better (or comparable) results under several settings.

2. METHODOLOGY

In this section, we first provide the background of KD, and then introduce the framework and details of the proposed DPK.

2.1. PRELIMINARY

The existing KD methods can be grouped into two groups. In particular, the logits-based KD methods distill the dark knowledge from the teacher by aligning the soft targets between the student and



student's performance is positively correlated with the accuracy of the teacher if the capacity gap is fixed, see Appendix A.6 for more details.



Figure 1: Top-1 accuracy of ResNet-18 w.r.t. various teachers on ImageNet. Different from the baseline model (ICKD (Liu et al., 2021a)), our method shows better performance and makes the performance of student positively correlated with that of the teacher.

