CAN STUDENTS OUTPERFORM TEACHERS IN KNOWL-EDGE DISTILLATION BASED MODEL COMPRESSION?

Abstract

Knowledge distillation (KD) is an effective technique to compress a large model (teacher) to a compact one (student) by knowledge transfer. The ideal case is that the teacher is compressed to the small student without any performance dropping. However, even for the state-of-the-art (SOTA) distillation approaches, there is still an obvious performance gap between the student and the teacher. The existing literature usually attributes this to model capacity differences between them. However, model capacity differences are unavoidable in model compression. In this work, we systematically study this question. By designing exploratory experiments, we find that model capacity differences are not necessarily the root reason, and the distillation data matters when the student capacity is greater than a threshold. In light of this, we propose to go beyond in-distribution distillation and accordingly develop KD+. KD+ is superior to the original KD as it outperforms KD and the other SOTA approaches substantially and is more compatible with the existing approaches to further improve their performances significantly 1 .

1. INTRODUCTION

Deep neural networks (DNNs) have achieved remarkable performances in various domains, but they require large amounts of computation and memory. This seriously limits their deployment with limited resources or a strict latency requirement. One solution to this problem is knowledge distillation which transfers the knowledge from a large network (teacher) to a small one (student). Hinton et al. (2015) proposed the original knowledge distillationfoot_1 (KD) which uses softened logits of a teacher as supervision to train a student. To make the student better capture the knowledge from the teacher, the existing studies focus on aligning their representations by using different criteria. However, there is still a significant performance gap between the teacher and the student. Figuring out the reason for this gap is essential for further improving the student performance. Mirzadeh et al. (2020) argue that the model capacity difference causes the failure for transferring the knowledge from a large teacher to a small student, thus leading to a large performance gap. Similarly, Cho & Hariharan (2019) point out that as the teacher grows in capacity and accuracy, it is difficult for the student to emulate the teacher. In this paper, we systematically study why students underperform teachers and how students can match or outperform teachers. We find that in most experimental settings of the existing literature, the root reason for the performance gap is not necessarily the capacity differenc as the student is powerful enough to memorize the teacher's outputs. The reason lies in the distillation dataset on which the knowledge is transferred. As an old proverb says, indigo comes from blue, but it is bluer than blue. In reality, it is not rare for human students to do better than their teachers. These excellent human students not only well capture the knowledge from their teachers but also learn more related knowledge on their own. This gives an insight for students in KD to match or outperform their teachers. We find that currently the students in KD have not well captured the knowledge in their teachers as they only mimic the behavior of the teachers on sparse training data points. In light of this, we propose KD+ which goes beyond in-distribution distillation to substantially reduce the performance gap between students and teachers. Our main contributions are summarized as follows: • Different from the common belief that model capacity differences result in the performance gap between students and teachers, we find that capacity differences are not necessarily the root reason and instead the distillation data matters when students' capacities are greater than a threshold. To our best knowledge, this is the first work that systematically explores why small students underperform teachers and how students can outperform large teachers. • By designing exploratory experiments, we find the following: (1) only fitting teachers' outputs at sparse training data points cannot make students well capture the local, indistribution shapes of the teacher functions; (2) different from the case on standard supervised learning, out-of-distribution data (but not all) can be beneficial to knowledge distillation. • Different from the existing work focusing on using different criteria to align representations or logits between teachers and students, we address knowledge distillation from a novel (data) perspective by going beyond in-distribution distillation and accordingly develop KD+. • Extensive experiments demonstrate that KD+ largely reduces the performance gap between students and teachers, and even enables students to match or outperform their teachers. KD+ is superior to KD as it outperforms KD and more than 10 SOTA methods substantially and shows a better compatibility with the existing methods and superiority in few-shot scenario.

2. RELATED WORK

The objective function of knowledge distillation can be simply expressed as a combination of the regular cross-entropy objective and a distillation objective. According to the distillation objective, the existing literature can be divided into logit-based approaches (Hinton et al., 2015) and representationbased approaches (Romero et al., 2015) . Logit-based approaches construct the distillation objective based on output logits. Hinton et al. (2015) propose KD which penalizes the softened logit differences between a teacher and a student. Park et al. (2019) propose to transfer data sample relations from a teacher to a student by aligning their logit-based structures. On the other hand, representation-based approaches design the distillation objective based on feature maps. FitNet (Romero et al., 2015) aligns the features of a teacher and a student through regressions. AT (Zagoruyko & Komodakis, 2017) distills feature attention from a teacher into a student. CRD (Tian et al., 2020) maximizes the mutual information between student and teacher representations. Other representation-based methods (Yim et al., 2017; Huang & Wang, 2017; Kim et al., 2018; Liu et al., 2019; Srinivas & Fleuret, 2018; Wang et al., 2018; Heo et al., 2019a; Cho & Hariharan, 2019; Ahn et al., 2019; Koratana et al., 2019; Aguilar et al., 2019; Shen & Savvides, 2020) use different criteria to align feature representations. SSKD (Xu et al., 2020) introduces extra self-supervision tasks to assist KD. Online knowledge distillation (Zhang et al., 2018b; Chen et al., 2020; Anil et al., 2018; Chung et al., 2020; Zhu et al., 2018) trains multiple students simultaneously. Self-distillation (Furlanello et al., 2018; Yuan et al., 2020) approaches train a DNN by using itself as the teacher. It is observed that the existing studies focus on designing different criteria to align teacher-student representations or logits on in-distribution data. In this work, we address knowledge distillation from a data perspective by embedding out-of-distribution distillation into a regularizer. Mirzadeh et al. ( 2020) observe that the model capacity gap results in the failure for transferring knowledge from a large teacher to a small student, thus causing a performance gap. To reduce this gap, they propose a multi-step knowledge distillation framework by using several intermediate-size networks (teacher assistants). However, the students still underperform the teachers substantially. Cho & Hariharan (2019) argue that as the teacher grows in capacity and accuracy, it is difficult for the student to emulate the teacher. To reduce the influence of the large capacity gap, they regularize both the teacher and the knowledge distillation by early stopping. We find that capacity differences are not necessarily the root reason when student capacities are greater than a threshold. On the other hand, KD+ goes beyond in-distribution distillation by exploring the knowledge between two training samples. Similar techniques have been used in many applications with different goals and mechanisms. Mixup (Zhang et al., 2018a) enforces local linearity of a DNN by linearly interpolating a random pair of training samples and their one-hot labels simultaneously. However, simply interpolating two labels may not match the generated sample as pointed out in (Guo et al., 2019) . KD+ does not have the above issue as it teaches a student to mimic the local shape of a powerful teacher. MixMatch (Berthelot et al., 2019b) linearly interpolates labeled and unlabeled data to improve the semi-supervised learning performances. ReMixMatch (Berthelot et al., 2019a) improves MixMatch



The code will be released online. In this paper, we use KD to denote the original knowledge distillation algorithmHinton et al. (2015).

