ON STUDENT-TEACHER DEVIATIONS IN DISTILLATION: DOES IT PAY TO DISOBEY?

Abstract

Knowledge distillation has been widely-used to improve the performance of a "student" network by hoping to mimic soft probabilities of a "teacher" network. Yet, for self-distillation to work, the student must deviate from the teacher in some manner (Stanton et al., 2021). What is the nature of these deviations, and how do they relate to the generalization gains of distillation? To investigate these questions, we first conduct a variety of experiments across image and language classification datasets. One of our key observations is that in a majority of our settings, the student underfits points that the teacher finds hard. We also find that studentteacher deviations during the initial phase of training are not crucial to see the benefits of distillation -simply switching to distillation in the middle of training can recover much of its gains. We then provide two parallel theoretical perspectives of these deviations, one casting distillation as a regularizer in eigenspace, and another as a denoiser of gradients. In both views, we argue how student-teacher deviations emerge, and how they relate to generalization in the context of our experiments. Our analysis also bridges fundamental gaps between existing theory and practice by focusing on gradient descent and avoiding label noise assumptions.

1. INTRODUCTION

Distillation (Bucilǎ et al., 2006; Hinton et al., 2015) has emerged as a highly effective model compression technique, wherein one trains a small "student" model to match the predicted soft label distribution of a large "teacher" model, rather than match one-hot labels. An actively developing literature has sought to explore applications of this technique to various settings (Radosavovic et al., 2018; Furlanello et al., 2018; Xie et al., 2019) , design more effective variants of the above recipe (Romero et al., 2015; Anil et al., 2018; Park et al., 2019; Beyer et al., 2022) , and better understand theoretically when and why distillation is effective (Lopez-Paz et al., 2016; Phuong & Lampert, 2019; Mobahi et al., 2020; Allen-Zhu & Li, 2020; Menon et al., 2021; Dao et al., 2021) . On paper, distillation intends to transfer the teacher's soft probabilities over to the student. However, Stanton et al. ( 2021) challenge this premise: they show there is often a mismatch of student and teacher probabilities, and in fact, that a greater mismatch is correlated with better student performance. Indeed, in the self-distillation setting (Furlanello et al., 2018; Zhang et al., 2019) -where the student and teacher architectures are identical -some form of deviation (in the representation, if not in the probabilities) is necessary for the student's generalization to supercede the teacher. In this work, we are interested in better characterizing these deviations in probabilities, and in understanding how they play a role in the student outperforming the teacher. In the first half of the paper, we conduct experiments characterizing what kind of deviations exist between the teacher and the student, and which deviations are relevant for better generalization. In the second half, we provide two complementary theoretical perspectives on how distillation can induce such deviations, and why that can subsequently aid generalization. More concretely, our key contributions are as follows: (i) What deviations exist? Across various architectures (ResNet56, ResNet20, MobileNet, and RoBERTa), and image/language classification data (CIFAR100, TinyImageNet, CIFAR10, GLUE). we empirically demonstrate ( §3.1) that the the student tends to underfit on "hard" points for the teacher (Fig 1a ) in terms of the final probabilities learned by both models. (ii) Which deviations matter? We find ( §3.2) that it is possible to switch from one-hot loss in the middle of training to distillation loss and (a) still recover a considerable fraction of distillation's

