ON STUDENT-TEACHER DEVIATIONS IN DISTILLATION: DOES IT PAY TO DISOBEY?

Abstract

Knowledge distillation has been widely-used to improve the performance of a "student" network by hoping to mimic soft probabilities of a "teacher" network. Yet, for self-distillation to work, the student must deviate from the teacher in some manner (Stanton et al., 2021). What is the nature of these deviations, and how do they relate to the generalization gains of distillation? To investigate these questions, we first conduct a variety of experiments across image and language classification datasets. One of our key observations is that in a majority of our settings, the student underfits points that the teacher finds hard. We also find that studentteacher deviations during the initial phase of training are not crucial to see the benefits of distillation -simply switching to distillation in the middle of training can recover much of its gains. We then provide two parallel theoretical perspectives of these deviations, one casting distillation as a regularizer in eigenspace, and another as a denoiser of gradients. In both views, we argue how student-teacher deviations emerge, and how they relate to generalization in the context of our experiments. Our analysis also bridges fundamental gaps between existing theory and practice by focusing on gradient descent and avoiding label noise assumptions.

1. INTRODUCTION

Distillation (Bucilǎ et al., 2006; Hinton et al., 2015) has emerged as a highly effective model compression technique, wherein one trains a small "student" model to match the predicted soft label distribution of a large "teacher" model, rather than match one-hot labels. An actively developing literature has sought to explore applications of this technique to various settings (Radosavovic et al., 2018; Furlanello et al., 2018; Xie et al., 2019) , design more effective variants of the above recipe (Romero et al., 2015; Anil et al., 2018; Park et al., 2019; Beyer et al., 2022) , and better understand theoretically when and why distillation is effective (Lopez-Paz et al., 2016; Phuong & Lampert, 2019; Mobahi et al., 2020; Allen-Zhu & Li, 2020; Menon et al., 2021; Dao et al., 2021) . On paper, distillation intends to transfer the teacher's soft probabilities over to the student. However, Stanton et al. (2021) challenge this premise: they show there is often a mismatch of student and teacher probabilities, and in fact, that a greater mismatch is correlated with better student performance. Indeed, in the self-distillation setting (Furlanello et al., 2018; Zhang et al., 2019) -where the student and teacher architectures are identical -some form of deviation (in the representation, if not in the probabilities) is necessary for the student's generalization to supercede the teacher. In this work, we are interested in better characterizing these deviations in probabilities, and in understanding how they play a role in the student outperforming the teacher. In the first half of the paper, we conduct experiments characterizing what kind of deviations exist between the teacher and the student, and which deviations are relevant for better generalization. In the second half, we provide two complementary theoretical perspectives on how distillation can induce such deviations, and why that can subsequently aid generalization. More concretely, our key contributions are as follows: (i) What deviations exist? Across various architectures (ResNet56, ResNet20, MobileNet, and RoBERTa), and image/language classification data (CIFAR100, TinyImageNet, CIFAR10, GLUE). we empirically demonstrate ( §3.1) that the the student tends to underfit on "hard" points for the teacher (Fig 1a ) in terms of the final probabilities learned by both models. (ii) Which deviations matter? We find ( §3.2) that it is possible to switch from one-hot loss in the middle of training to distillation loss and (a) still recover a considerable fraction of distillation's ] and teacher predicted label y te . We consistently find that the distilled student predictions deviate from the X = Y line (dashed) with teacher's "hard" points (small X) being underfit by the student (Y ≤ X). This hints at distillation acting as a regularizer. Right: Effect of late loss-switching: In CIFAR-Resnet56 self-distillation, we switch the loss (gradually over the course of a few steps) late during training and find that switching to distillation (OneHot to KD line) recovers nearly all the gains of distillation (KD). This suggests that the initial phase of training is not critical for distillation to help. (iv) Gradient space view. As a complementary viewpoint, we formalize distillation as a gradient denoiser in the presence of class similarities (Theorem 4.2). We propose this view as a way to understand our empirical observations on loss-switching. Importantly, unlike prior work (Menon et al. ( 2021)), we show how denoising can occur even when the data is perfectly classifiable and has no inherent label noise. (v) A unified view. We informally unify these two views, thus painting a more coherent picture of two disjoint lines of existing theories (Mobahi et al. (2020 ) vs. Menon et al. (2021) ). Overall, we hope that our discussion helps bridge the gap between existing theoretical understanding and empirics in distillation by (a) making more practical assumptions than existing theories, and (b) making connections to various empirical observations. Our findings also suggest that not matching the teacher probabilities exactly can be a good thing, which future empirical work on distillation may want to be mindful of.



(a) Teacher-student logit plots for self-distillation.(b) Effect of loss-switching.

Figure 1: Left: Deviation in probabilities of (one-hot trained) teacher vs. (self-distilled) student: For each training sample (x, y), we plot φ(p te y te (x)) versus φ(p st y te (x)) for logit transformation φ(u) = log [u/(1 -u)] and teacher predicted label y te . We consistently find that the distilled student predictions deviate from the X = Y line (dashed) with teacher's "hard" points (small X) being underfit by the student (Y ≤ X). This hints at distillation acting as a regularizer. Right: Effect of late loss-switching: In CIFAR-Resnet56 self-distillation, we switch the loss (gradually over the course of a few steps) late during training and find that switching to distillation (OneHot to KD line) recovers nearly all the gains of distillation (KD). This suggests that the initial phase of training is not critical for distillation to help.

gains (Fig 1b), (b) and also recover the final-epoch underfitting behavior on TinyImageNet and CIFAR100. Thus, we conclude that any student-teacher deviations unique to the early phase of training -such as those proposed in Allen-Zhu & Li (2020); Jha et al. (2020) -are not by themselves adequate to explaining the success of distillation, but the underfitting may be. Next, we ask how deviations arise and why they help. (iii) Eigenspace view: We provide a counterpart of the seminal result of Mobahi et al. (2020)which demonstrates distillation as a regularizer in a non-gradient-descent setting -for the gradient descent setting for linear regression (Theorem 4.1). We propose this view as a way to understand the empirically observed underfitting in distillation. Besides providing a much simpler proof and a more practically relevant version of Mobahi et al. (2020), our result also formalizes existing empirical intuition about the importance of early-stopping in distillation (Dong et al., 2019; Cho & Hariharan, 2019; Ji & Zhu, 2020; Wang et al., 2022).

