CAN STUDENTS OUTPERFORM TEACHERS IN KNOWL-EDGE DISTILLATION BASED MODEL COMPRESSION?

Abstract

Knowledge distillation (KD) is an effective technique to compress a large model (teacher) to a compact one (student) by knowledge transfer. The ideal case is that the teacher is compressed to the small student without any performance dropping. However, even for the state-of-the-art (SOTA) distillation approaches, there is still an obvious performance gap between the student and the teacher. The existing literature usually attributes this to model capacity differences between them. However, model capacity differences are unavoidable in model compression. In this work, we systematically study this question. By designing exploratory experiments, we find that model capacity differences are not necessarily the root reason, and the distillation data matters when the student capacity is greater than a threshold. In light of this, we propose to go beyond in-distribution distillation and accordingly develop KD+. KD+ is superior to the original KD as it outperforms KD and the other SOTA approaches substantially and is more compatible with the existing approaches to further improve their performances significantly 1 .

1. INTRODUCTION

Deep neural networks (DNNs) have achieved remarkable performances in various domains, but they require large amounts of computation and memory. This seriously limits their deployment with limited resources or a strict latency requirement. One solution to this problem is knowledge distillation which transfers the knowledge from a large network (teacher) to a small one (student). Hinton et al. (2015) proposed the original knowledge distillation 2 (KD) which uses softened logits of a teacher as supervision to train a student. To make the student better capture the knowledge from the teacher, the existing studies focus on aligning their representations by using different criteria. However, there is still a significant performance gap between the teacher and the student. Figuring out the reason for this gap is essential for further improving the student performance. 2020) argue that the model capacity difference causes the failure for transferring the knowledge from a large teacher to a small student, thus leading to a large performance gap. Similarly, Cho & Hariharan (2019) point out that as the teacher grows in capacity and accuracy, it is difficult for the student to emulate the teacher. In this paper, we systematically study why students underperform teachers and how students can match or outperform teachers. We find that in most experimental settings of the existing literature, the root reason for the performance gap is not necessarily the capacity differenc as the student is powerful enough to memorize the teacher's outputs. The reason lies in the distillation dataset on which the knowledge is transferred.

Mirzadeh et al. (

As an old proverb says, indigo comes from blue, but it is bluer than blue. In reality, it is not rare for human students to do better than their teachers. These excellent human students not only well capture the knowledge from their teachers but also learn more related knowledge on their own. This gives an insight for students in KD to match or outperform their teachers. We find that currently the students in KD have not well captured the knowledge in their teachers as they only mimic the behavior of the teachers on sparse training data points. In light of this, we propose KD+ which goes beyond in-distribution distillation to substantially reduce the performance gap between students and teachers. Our main contributions are summarized as follows: 1 The code will be released online. 2 In this paper, we use KD to denote the original knowledge distillation algorithm Hinton et al. (2015). 1

