SUPERVISION COMPLEXITY AND ITS ROLE IN KNOWLEDGE DISTILLATION

Abstract

Despite the popularity and efficacy of knowledge distillation, there is limited understanding of why it helps. In order to study the generalization behavior of a distilled student, we propose a new theoretical framework that leverages supervision complexity: a measure of alignment between teacher-provided supervision and the student's neural tangent kernel. The framework highlights a delicate interplay among the teacher's accuracy, the student's margin with respect to the teacher predictions, and the complexity of the teacher predictions. Specifically, it provides a rigorous justification for the utility of various techniques that are prevalent in the context of distillation, such as early stopping and temperature scaling. Our analysis further suggests the use of online distillation, where a student receives increasingly more complex supervision from teachers in different stages of their training. We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures.

1. INTRODUCTION

Knowledge distillation (KD) (Buciluǎ et al., 2006; Hinton et al., 2015) is a popular method of compressing a large "teacher" model into a more compact "student" model. In its most basic form, this involves training the student to fit the teacher's predicted label distribution or soft labels for each sample. There is strong empirical evidence that distilled students usually perform better than students trained on raw dataset labels (Hinton et al., 2015; Furlanello et al., 2018; Stanton et al., 2021; Gou et al., 2021) . Multiple works have devised novel KD procedures that further improve the student model performance (see Gou et al. (2021) and references therein). Simultaneously, several works have aimed to rigorously formalize why KD can improve the student model performance. Some prominent observations from this line of work are that (self-)distillation induces certain favorable optimization biases in the training objective (Phuong & Lampert, 2019; Ji & Zhu, 2020) , lowers variance of the objective (Menon et al., 2021; Dao et al., 2021; Ren et al., 2022) , increases regularization towards learning "simpler" functions (Mobahi et al., 2020) , transfers information from different data views (Allen-Zhu & Li, 2020), and scales per-example gradients based on the teacher's confidence (Furlanello et al., 2018; Tang et al., 2020) . Despite this remarkable progress, there are still many open problems and unexplained phenomena around knowledge distillation; to name a few: -Why do soft labels (sometimes) help? It is agreed that teacher's soft predictions carry information about class similarities (Hinton et al., 2015; Furlanello et al., 2018) , and that this softness of predictions has a regularization effect similar to label smoothing (Yuan et al., 2020) . Nevertheless, KD also works in binary classification settings with limited class similarity information (Müller et al., 2020) . How exactly the softness of teacher predictions (controlled by a temperature parameter) affects the student learning remains far from well understood. -The role of capacity gap. There is evidence that when there is a significant capacity gap between the teacher and the student, the distilled model usually falls behind its teacher (Mirzadeh -What makes a good teacher? Sometimes less accurate models are better teachers (Cho & Hariharan, 2019; Mirzadeh et al., 2020) . Moreover, early stopped or exponentially averaged models are often better teachers (Ren et al., 2022) . A comprehensive explanation of this remains elusive. The aforementioned wide range of phenomena suggest that there is a complex interplay between teacher accuracy, softness of teacher-provided targets, and complexity of the distillation objective. This paper provides a new theoretically grounded perspective on KD through the lens of supervision complexity. In a nutshell, this quantifies why certain targets (e.g., temperature-scaled teacher probabilities) may be "easier" for a student model to learn compared to others (e.g., raw one-hot labels), owing to better alignment with the student's neural tangent kernel (NTK) (Jacot et al., 2018; Lee et al., 2019) . In particular, we provide a novel theoretical analysis ( §2, Thm. 3 and 4) of the role of supervision complexity on kernel classifier generalization, and use this to derive a new generalization bound for distillation (Prop. 5). The latter highlights how student generalization is controlled by a balance of the teacher generalization, the student's margin with respect to the teacher predictions, and the complexity of the teacher's predictions. Based on the preceding analysis, we establish the conceptual and practical efficacy of a simple online distillation approach ( §4), wherein the student is fit to progressively more complex targets, in the form of teacher predictions at various checkpoints during its training. This method can be seen as guiding the student in the function space (see Fig. 1 ), and leads to better generalization compared to offline distillation. We provide empirical results on a range of image classification benchmarks confirming the value of online distillation, particularly for students with weak inductive biases. Beyond practical benefits, the supervision complexity view yields new insights into distillation: -The role of temperature scaling and early-stopping. Temperature scaling and early-stopping of the teacher have proven effective for KD. We show that both of these techniques reduce the supervision complexity, at the expense of also lowering the classification margin. Online distillation manages to smoothly increase teacher complexity, without degrading the margin. -Teaching a weak student. We show that for students with weak inductive biases, and/or with much less capacity than the teacher, the final teacher predictions are often as complex as dataset labels, particularly during the early stages of training. In contrast, online distillation allows the supervision complexity to progressively increase, thus allowing even a weak student to learn. -NTK and relational transfer. We show that online distillation is highly effective at matching the teacher and student NTK matrices. This transfers relational knowledge in the form of examplepair similarity, as opposed to standard distillation which only transfers per-example knowledge. Problem setting. We focus on classification problems from input domain X to d classes. We are given a training set of n labeled examples {(x 1 , y 1 ), . . . , (x n , y n )}, with one-hot encoded labels y i ∈ {0, 1} d . Typically, a model f θ : X → R d is trained with the softmax cross-entropy loss: L ce (f θ ) = - 1 n n i=1 y ⊤ i log σ(f θ (x i )),



Figure 1: Online vs. online distillation. Figures (a) and (b) illustrate possible teacher and student function trajectories in offline and offline KD. The yellow dotted lines indicate KD.Figure (c) plots adjusted supervision complexity of various targets with respect to NTKs at different stages of training (see §4 for more details).

