SUPERVISION COMPLEXITY AND ITS ROLE IN KNOWLEDGE DISTILLATION

Abstract

Despite the popularity and efficacy of knowledge distillation, there is limited understanding of why it helps. In order to study the generalization behavior of a distilled student, we propose a new theoretical framework that leverages supervision complexity: a measure of alignment between teacher-provided supervision and the student's neural tangent kernel. The framework highlights a delicate interplay among the teacher's accuracy, the student's margin with respect to the teacher predictions, and the complexity of the teacher predictions. Specifically, it provides a rigorous justification for the utility of various techniques that are prevalent in the context of distillation, such as early stopping and temperature scaling. Our analysis further suggests the use of online distillation, where a student receives increasingly more complex supervision from teachers in different stages of their training. We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures.

1. INTRODUCTION

Knowledge distillation (KD) (Buciluǎ et al., 2006; Hinton et al., 2015) is a popular method of compressing a large "teacher" model into a more compact "student" model. In its most basic form, this involves training the student to fit the teacher's predicted label distribution or soft labels for each sample. There is strong empirical evidence that distilled students usually perform better than students trained on raw dataset labels (Hinton et al., 2015; Furlanello et al., 2018; Stanton et al., 2021; Gou et al., 2021) . Multiple works have devised novel KD procedures that further improve the student model performance (see Gou et al. (2021) and references therein). Simultaneously, several works have aimed to rigorously formalize why KD can improve the student model performance. Some prominent observations from this line of work are that (self-)distillation induces certain favorable optimization biases in the training objective (Phuong & Lampert, 2019; Ji & Zhu, 2020) , lowers variance of the objective (Menon et al., 2021; Dao et al., 2021; Ren et al., 2022) , increases regularization towards learning "simpler" functions (Mobahi et al., 2020) , transfers information from different data views (Allen-Zhu & Li, 2020), and scales per-example gradients based on the teacher's confidence (Furlanello et al., 2018; Tang et al., 2020) . Despite this remarkable progress, there are still many open problems and unexplained phenomena around knowledge distillation; to name a few: -Why do soft labels (sometimes) help? It is agreed that teacher's soft predictions carry information about class similarities (Hinton et al., 2015; Furlanello et al., 2018) , and that this softness of predictions has a regularization effect similar to label smoothing (Yuan et al., 2020) . Nevertheless, KD also works in binary classification settings with limited class similarity information (Müller et al., 2020) . How exactly the softness of teacher predictions (controlled by a temperature parameter) affects the student learning remains far from well understood. -The role of capacity gap. There is evidence that when there is a significant capacity gap between the teacher and the student, the distilled model usually falls behind its teacher (Mirzadeh

