SUPERVISION COMPLEXITY AND ITS ROLE IN KNOWLEDGE DISTILLATION

Abstract

Despite the popularity and efficacy of knowledge distillation, there is limited understanding of why it helps. In order to study the generalization behavior of a distilled student, we propose a new theoretical framework that leverages supervision complexity: a measure of alignment between teacher-provided supervision and the student's neural tangent kernel. The framework highlights a delicate interplay among the teacher's accuracy, the student's margin with respect to the teacher predictions, and the complexity of the teacher predictions. Specifically, it provides a rigorous justification for the utility of various techniques that are prevalent in the context of distillation, such as early stopping and temperature scaling. Our analysis further suggests the use of online distillation, where a student receives increasingly more complex supervision from teachers in different stages of their training. We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures.

1. INTRODUCTION

Knowledge distillation (KD) (Buciluǎ et al., 2006; Hinton et al., 2015) is a popular method of compressing a large "teacher" model into a more compact "student" model. In its most basic form, this involves training the student to fit the teacher's predicted label distribution or soft labels for each sample. There is strong empirical evidence that distilled students usually perform better than students trained on raw dataset labels (Hinton et al., 2015; Furlanello et al., 2018; Stanton et al., 2021; Gou et al., 2021) . Multiple works have devised novel KD procedures that further improve the student model performance (see Gou et al. (2021) and references therein). Simultaneously, several works have aimed to rigorously formalize why KD can improve the student model performance. Some prominent observations from this line of work are that (self-)distillation induces certain favorable optimization biases in the training objective (Phuong & Lampert, 2019; Ji & Zhu, 2020) , lowers variance of the objective (Menon et al., 2021; Dao et al., 2021; Ren et al., 2022) , increases regularization towards learning "simpler" functions (Mobahi et al., 2020) , transfers information from different data views (Allen-Zhu & Li, 2020) , and scales per-example gradients based on the teacher's confidence (Furlanello et al., 2018; Tang et al., 2020) . Despite this remarkable progress, there are still many open problems and unexplained phenomena around knowledge distillation; to name a few: -Why do soft labels (sometimes) help? It is agreed that teacher's soft predictions carry information about class similarities (Hinton et al., 2015; Furlanello et al., 2018) , and that this softness of predictions has a regularization effect similar to label smoothing (Yuan et al., 2020) . Nevertheless, KD also works in binary classification settings with limited class similarity information (Müller et al., 2020) . How exactly the softness of teacher predictions (controlled by a temperature parameter) affects the student learning remains far from well understood. -The role of capacity gap. There is evidence that when there is a significant capacity gap between the teacher and the student, the distilled model usually falls behind its teacher (Mirzadeh et al., 2020; Cho & Hariharan, 2019; Stanton et al., 2021) . It is unclear whether this is due to difficulties in optimization, or due to insufficient student capacity. -What makes a good teacher? Sometimes less accurate models are better teachers (Cho & Hariharan, 2019; Mirzadeh et al., 2020) . Moreover, early stopped or exponentially averaged models are often better teachers (Ren et al., 2022) . A comprehensive explanation of this remains elusive. The aforementioned wide range of phenomena suggest that there is a complex interplay between teacher accuracy, softness of teacher-provided targets, and complexity of the distillation objective. This paper provides a new theoretically grounded perspective on KD through the lens of supervision complexity. In a nutshell, this quantifies why certain targets (e.g., temperature-scaled teacher probabilities) may be "easier" for a student model to learn compared to others (e.g., raw one-hot labels), owing to better alignment with the student's neural tangent kernel (NTK) (Jacot et al., 2018; Lee et al., 2019) . In particular, we provide a novel theoretical analysis ( §2, Thm. 3 and 4) of the role of supervision complexity on kernel classifier generalization, and use this to derive a new generalization bound for distillation (Prop. 5). The latter highlights how student generalization is controlled by a balance of the teacher generalization, the student's margin with respect to the teacher predictions, and the complexity of the teacher's predictions. Based on the preceding analysis, we establish the conceptual and practical efficacy of a simple online distillation approach ( §4), wherein the student is fit to progressively more complex targets, in the form of teacher predictions at various checkpoints during its training. This method can be seen as guiding the student in the function space (see Fig. 1 ), and leads to better generalization compared to offline distillation. We provide empirical results on a range of image classification benchmarks confirming the value of online distillation, particularly for students with weak inductive biases. Beyond practical benefits, the supervision complexity view yields new insights into distillation: -The role of temperature scaling and early-stopping. Temperature scaling and early-stopping of the teacher have proven effective for KD. We show that both of these techniques reduce the supervision complexity, at the expense of also lowering the classification margin. Online distillation manages to smoothly increase teacher complexity, without degrading the margin. -Teaching a weak student. We show that for students with weak inductive biases, and/or with much less capacity than the teacher, the final teacher predictions are often as complex as dataset labels, particularly during the early stages of training. In contrast, online distillation allows the supervision complexity to progressively increase, thus allowing even a weak student to learn. -NTK and relational transfer. We show that online distillation is highly effective at matching the teacher and student NTK matrices. This transfers relational knowledge in the form of examplepair similarity, as opposed to standard distillation which only transfers per-example knowledge. Problem setting. We focus on classification problems from input domain X to d classes. We are given a training set of n labeled examples {(x 1 , y 1 ), . . . , (x n , y n )}, with one-hot encoded labels y i ∈ {0, 1} d . Typically, a model f θ : X → R d is trained with the softmax cross-entropy loss: L ce (f θ ) = - 1 n n i=1 y ⊤ i log σ(f θ (x i )), where σ(•) is the softmax function. In standard KD, given a trained teacher model g : X → R d that outputs logits, one trains a student model f θ : X → R d to fit the teacher predictions. Hinton et al. (2015) propose the following KD loss: L kd-ce (f θ ; g, τ ) = - τ 2 n n i=1 σ(g(x i )/τ ) ⊤ log σ(f θ (x i )/τ ), where temperature τ > 0 controls the softness of teacher predictions. To highlight the effect of KD and simplify exposition, we assume that the student is not trained with the dataset labels.

2. SUPERVISION COMPLEXITY AND GENERALIZATION

One apparent difference between standard training and KD (Eq. 1 and 2) is that the latter modifies the targets that the student attempts to fit. The targets used during distillation ensure a better generalization for the student; what is the reason for this? Towards answering this question, we present a new perspective on KD in terms of supervision complexity. To begin, we show how the generalization of a kernel-based classifier is controlled by a measure of alignment between the target labels and the kernel matrix. We first treat binary kernel-based classifiers (Thm. 3), and later extend our analysis to multiclass kernel-based classifiers (Thm. 4). Finally, by leveraging the neural tangent kernel machinery, we discuss the implications of our analysis for neural classifiers in §2.2.

2.1. SUPERVISION COMPLEXITY CONTROLS KERNEL MACHINE GENERALIZATION

The notion of supervision complexity is easiest to introduce and study for kernel-based classifiers. We briefly review some necessary background (Scholkopf & Smola, 2001) . Let k : X × X → R be a positive semidefinite kernel defined over an input space X . Any such kernel uniquely determines a reproducing kernel Hilbert space (RKHS) H of functions from X to R. This RKHS is the completion of the set of functions of form f (x) = m i=1 α i k(x i , x), with x i ∈ X , α i ∈ R. Any f (x) = m i=1 α i k(x i , x) ∈ H has (RKHS) norm ∥f ∥ 2 H = m i=1 m j=1 α i α j k(x i , x j ) = α ⊤ Kα, where α = (α 1 , . . . , α n ) ⊤ and K i,j = k(x i , x j ). Intuitively, ∥f ∥ 2 H measures the smoothness of f , e.g., for a Gaussian kernel it measures the Fourier spectrum decay of f (Scholkopf & Smola, 2001) . For simplicity, we start with the case of binary classification. Suppose {(X i , Y i )} i∈ [n] are n i.i.d. examples sampled from some probability distribution on X × Y, with Y ⊂ R, where positive and negative labels correspond to distinct classes. Let K i,j = k(X i , X j ) denote the kernel matrix, and Y = (Y 1 , . . . , Y n ) ⊤ be the concatenation of all training labels. Definition 1 (Supervision complexity). The supervision complexity of targets Y 1 , . . . , Y n with respect to a kernel k is defined to be Y ⊤ K -1 Y in cases when K is invertible, and +∞ otherwise. We now establish how supervision complexity controls the smoothness of the optimal kernel classifier. Consider a classifier obtained by solving a regularized kernel classification problem: f * ∈ argmin f ∈H 1 n n i=1 ℓ(f (X i ), Y i ) + λ 2 ∥f ∥ 2 H , where ℓ is a loss function and λ > 0. The following proposition shows whenever the supervision complexity is small, the RKHS norm of any optimal solution f * will also be small (see Appendix B for a proof). This is an important learning bias that shall help us explain certain aspects of KD. Proposition 2. Assume that K is full rank almost surely; ℓ(y, y ′ ) ≥ 0, ∀y, y ′ ∈ Y; and ℓ(y, y) = 0, ∀y ∈ Y. Then, with probability 1, for any solution f * of (3), we have ∥f * ∥ 2 H ≤ Y ⊤ K -1 Y . Equipped with the above result, we now show how supervision complexity controls generalization. In the following, let ϕ γ : R → [0, 1] be the margin loss (Mohri et al., 2018) with scale γ > 0: ϕ γ (α) = min{1, max{1 -α/γ, 0}}. (4) Theorem 3. Assume that κ = sup x∈X k(x, x) < ∞ and K is full rank almost surely. Further, assume that ℓ(y, y ′ ) ≥ 0, ∀y, y ′ ∈ Y and ℓ(y, y) = 0, ∀y ∈ Y. Let M 0 = ⌈γ √ n/(2 √ κ)⌉. Then, with probability at least 1 -δ, for any solution f * of problem in Eq. (3), we have P X,Y (Y f * (X) ≤ 0) ≤ 1 n n i=1 ϕ γ (sign (Y i ) f * (X i )) + 2 Y ⊤ K -1 Y + 2 γn Tr (K) + 3 ln (2M 0 /δ) 2n . ( ) The proof is available in Appendix B. One can compare Thm. 3 with the standard Rademacher bound for kernel classifiers (Bartlett & Mendelson, 2002) . The latter typically consider learning over functions with RKHS norm bounded by a constant M > 0. The corresponding complexity term then decays as O( M • Tr (K) /n), which is data-independent. Consequently, such a bound cannot adapt to the intrinsic "difficulty" of the targets Y . In contrast, Thm. 3 considers functions with RKHS norm bounded by the data-dependent supervision complexity term. This results in a more informative generalization bound, which captures the "difficulty" of the targets. Here, we note that Arora et al. (2019) characterized the generalization of an overparameterized two-layer neural network via a term closely related to the supervision complexity (see §5 for additional discussion). The supervision complexity Y ⊤ K -1 Y is small whenever Y is aligned with top eigenvectors of K and/or Y has small scale. Furthermore, one cannot make the bound close to zero by just reducing the scale of targets, as one would need a small γ to control the margin loss that would otherwise increase due to student predictions getting closer to zero (as the student aims to match Y i ). To better understand the role of supervision complexity, it is instructive to consider two special cases that lead to a poor generalization bound: (1) uninformative features, and (2) uninformative labels. Complexity under uninformative features. Suppose the kernel matrix K is diagonal, so that the kernel provides no information on example-pair similarity; i.e., the kernel is "uninformative". An application of Cauchy-Schwarz reveals the key expression in the second term in (5) satisfies: 1 n Y ⊤ K -1 Y Tr(K) = 1 n n i=1 Y 2 i • k(X i , X i ) -1 n i=1 k(X i , X i ) ≥ 1 n n i=1 |Y i |. Consequently, this term is least constant in order, and does not vanish as n → ∞. Complexity under uninformative labels. Suppose the labels Y i are purely random, and independent from inputs X i . Conditioned on {X i }, Y ⊤ K -1 Y concentrates around its mean by the Hanson-Wright inequality (Vershynin, 2018) . Hence, ∃ ϵ(K, δ, n) such that with probability ≥ 1-δ, Y ⊤ K -1 Y ≥ E {Yi} Y ⊤ K -1 Y -ϵ = E Y 2 1 Tr(K -1 ) -ϵ. Thus, with the same probability, 1 n Y ⊤ K -1 Y Tr(K) ≥ 1 n (E [Y 2 1 ] Tr(K -1 ) -ϵ) Tr(K) ≥ 1 n E [Y 2 1 ] n 2 -ϵ Tr(K), where the last inequality is by Cauchy-Schwarz. For sufficiently large n, the quantity E Y 2 1 n 2 dominates ϵ Tr (K), rendering the bound of Thm. 3 close to a constant.

2.2. EXTENSIONS: MULTICLASS CLASSIFICATION AND NEURAL NETWORKS

We now show that a result similar to Thm. 3 holds for multiclass classification as well. In addition, we also discuss how our results are instructive about the behavior of neural networks. Extension to multiclass classification. Let {(X i , Y i )} i∈[n] be drawn i.i.d. from a distribution over X × Y, where Y ⊂ R d . Let k : X × X → R d×d be a matrix-valued positive definite kernel and H be the corresponding vector-valued RKHS. As in the binary classification case, we consider a kernel problem in Eq. ( 3). Let Y ⊤ = (Y ⊤ 1 , . . . , Y ⊤ n ) and K be the kernel matrix of training examples: K = k(X 1 , X 1 ) • • • k(X 1 , X n ) • • • • • • • • • k(X n , X 1 ) • • • k(X n , X n ) ∈ R nd×nd . ( ) For f : X → R d and a labeled example (x, y), let ρ f (x, y) = f (x) y -max y ′ ̸ =y f (x) y ′ be the prediction margin. Then, the following analogue of Thm. 3 holds (see Appendix C for the proof). Theorem 4. Assume that κ = sup x∈X ,y∈[d] k(x, x) y,y < ∞, and K is full rank almost surely. Further, assume that ℓ(y, y ′ ) ≥ 0, ∀y, y ′ ∈ Y and ℓ(y, y) = 0, ∀y ∈ Y. Let M 0 = ⌈γ √ n/(4d √ κ)⌉. Then, with probability at least 1 -δ, for any solution f * of problem in Eq. (3), P X,Y (ρ f * (X, Y ) ≤ 0) ≤ 1 n n i=1 1{ρ f * (X i , Y i ) ≤ γ} + 4d(Y ⊤ K -1 Y + 1) γn Tr (K) + 3 log(2M 0 /δ) 2n . ( ) Implications for neural classifiers. Our analysis has so far focused on kernel-based classifiers. While neural networks are not exactly kernel methods, many aspects of their performance can be understood via a corresponding linearized neural network (see Ortiz-Jiménez et al. (2021) and references therein). We follow this approach, and given a neural network f θ with current weights θ 0 , we consider the corresponding linearized neural network comprising the linear terms of the Taylor expansion of f θ (x) around θ 0 (Jacot et al., 2018; Lee et al., 2019) : f lin θ (x) ≜ f θ0 (x) + ∇ θ f θ0 (x) ⊤ (θ - θ 0 ). Let ω ≜ θ -θ 0 . This network f lin ω (x) is a linear function with respect to the parameters ω, but is generally non-linear with respect to the input x. Note that ∇ θ f θ0 (x) acts as a feature representation, and induces the neural tangent kernel (NTK) k 0 (x, x ′ ) = ∇ θ f θ0 (x) ⊤ ∇ θ f θ0 (x ′ ) ∈ R d×d . Given a labeled dataset S = {(x i , y i )} i∈[n] and a loss function L(f ; S), the dynamics of gradient flow with learning rate η > 0 for f lin ω can be fully characterized in the function space, and depends only on the predictions at θ 0 and the NTK k 0 : ḟ lin t (x ′ ) = -η • K 0 (x ′ , x 1:n ) ∇ f L(f lin t (x 1:n ); S), where f (x 1:n ) ∈ R nd denotes the concatenation of predictions on training examples and Lee et al. (2019) show that as one increases the width of the network or when (θ -θ 0 ) does not change much during training, the dynamics of the linearized and original neural network become close. K 0 (x ′ , x 1:n ) = ∇ θ f θ0 (x ′ ) ⊤ ∇ θ f θ0 (x 1:n ). When f θ is sufficiently overparameterized and L is convex with respect to ω, then ω t converges to an interpolating solution. Furthermore, for the mean squared error objective, the solution has the minimum Euclidean norm (Gunasekar et al., 2017) . As the Euclidean norm of ω corresponds to norm of f lin ω (x) -f θ0 (x) in the vector-valued RKHS H corresponding to k 0 , training a linearized network to interpolation is equivalent to solving the following with a small λ > 0: h * = argmin h∈H 1 n n i=1 (f θ0 (x i ) + h(x i ) -y i ) 2 + λ 2 ∥h∥ 2 H . Therefore, the generalization bounds of Thm. 3 and 4 apply to h * with supervision complexity of residual targets y i -f θ0 (x i ). However, we are interested in the performance of f θ0 + h * . As the proofs of these results rely on bounding the Rademacher complexity of hypothesis sets of form {h ∈ H : ∥h∥ ≤ M }, and shifting a hypothesis set by a constant function does not change the Rademacher complexity (see Remark 7), these proofs can be easily modified to handle hypotheses shifted by the constant function f θ0 .

3. KNOWLEDGE DISTILLATION: A SUPERVISION COMPLEXITY LENS

We now turn to KD, and explore how supervision complexity affects student's generalization. We show that student's generalization depends on three terms: the teacher generalization, the student's margin with respect to the teacher predictions, and the complexity of the teacher's predictions.

3.1. TRADE-OFF BETWEEN TEACHER ACCURACY, MARGIN, AND COMPLEXITY

Consider the binary classification setting of §2, and a fixed teacher g : X → R that outputs a logit. Let {(X i , Y * i )} i∈[n] be n i.i.d. labeled examples, where Y * i ∈ {-1, 1} denotes the ground truth labels. For temperature τ > 0, let Y i ≜ 2 σ(g(X i )/τ ) -1 ∈ [-1, +1] denote the teacher's soft predictions, for sigmoid function σ : z → (1 + exp(-z)) -1 . Our key observation is: if the teacher predictions Y i are accurate enough and have significantly lower complexity compared to ground truth labels Y * i , then a student kernel method (cf. Eq. ( 3)) trained with Y i can generalize better than the one trained with Y * i . The following result quantifies the trade-off between teacher accuracy, student prediction margin, and teacher prediction complexity (see Appendix B.3 for a proof). Proposition 5. Assume that κ = sup x∈X k(x, x) < ∞ and K is full rank almost surely, ℓ(y, y ′ ) ≥ 0, ∀y, y ′ ∈ Y, and ℓ(y, y) = 0, ∀y ∈ Y. Let Y i and Y * i be defined as above. Let M 0 = ⌈γ √ n/(2 √ κ)⌉. Then, with probability at least 1 -δ, any solution f * of problem (3) satisfies P X,Y * (Y * f * (X) ≤ 0) student risk ≤ P X,Y * (Y * g(X) ≤ 0) teacher risk + 1 n n i=1 ϕ γ (sign (Y i ) f * (X i )) student's empirical margin loss w.r.t. teacher predictions + 2 Y ⊤ K -1 Y + 2 Tr (K)/(γn) complexity of teacher's predictions +3 ln (2M 0 /δ)/(2n). Note that a similar result is easy to establish for multiclass classification using Thm. 4. The first term in the above accounts for the misclassification rate of the teacher. While this term is not irreducible (it is possible for a student to perform better than its teacher), generally a student performs worse that its teacher, especially when there is a significant teacher-student capacity gap. The second term is student's empirical margin loss w.r.t. teacher predictions. This captures the price of making teacher predictions too soft. Intuitively, the softer (i.e., closer to zero) teacher predictions are, the harder it is for the student to learn the classification rule. The third term accounts for the supervision complexity and the margin parameter γ. Thus, one has to choose γ carefully to achieve a good balance between empirical margin loss and margin-normalized supervision complexity. The effect of temperature. For a fixed margin parameter γ > 0, increasing the temperature τ makes teacher's predictions Y i softer. On the one hand, the reduced scale decreases the supervision complexity Y ⊤ K -1 Y . Moreover, we shall see that in the case of neural networks the complexity decreases even further due to Y becoming more aligned with top eigenvectors of K. On the other hand, the scale of predictions of the (possibly interpolating) student f * will decrease too, increasing the empirical margin loss. This suggests that setting the value of τ is not trivial: the optimal value can be different based on the kernel k and teacher logits g(X i ).

3.2. FROM OFFLINE TO ONLINE KNOWLEDGE DISTILLATION

We identified that supervision complexity plays a key role in determining the efficacy of a distillation procedure. The supervision from a fully trained teacher model can prove to be very complex for a student model in an early stage of its training (Fig. 1c ). This raises the question: is there value in providing progressively difficult supervision to the student? In this section, we describe a simple online distillation method, where the the teacher is updated during the student training. Over the course of their training, neural models learn functions of increasing complexity (Kalimeris et al., 2019) . This provides a natural way to construct a set of teachers with varying prediction complexities. Similar to Jin et al. (2019) , for practical considerations of not training the teacher and the student simultaneously, we assume the availability of teacher checkpoints over the course of its training. Given m teacher checkpoints at times T = {t i } i∈[m] , during the t-th step of distillation, the student receives supervision from the teacher checkpoint at time min{t ′ ∈ T : t ′ > t}. Note that the student is trained for the same number of epochs in total as in offline distillation. Throughout the paper we use the term "online distillation" for this approach (cf. Algorithm 1 of Appendix A). Online distillation can be seen as guiding the student network to follow the teacher's trajectory in function space (see Fig. 1 ). Given that NTK can be interpreted as a principled notion of example similarity and controls which examples affect each other during training (Charpiat et al., 2019) , it is desirable for the student to have an NTK similar to that of its teacher at each time step. To test whether online distillation also transfers NTK, we propose to measure similarity between the final student and final teacher NTKs. For computational efficiency we work with NTK matrices corresponding to a batch of b examples (bd × bd matrices). Explicit computation of even batch NTK matrices can be costly, especially when the number of classes d is large. We propose to view student and teacher batch NTK matrices (denoted by K f and K g respectively) as operators and measure their similarity by comparing their behavior on random vectors: sim(K f , K g ) = E v∼N (0,I) [⟨K f v, K g v⟩/ (∥K f v∥ ∥K g v∥)] . Note that the cosine distance is used to account for scale differences of K g and K f . The kernel-vector products appearing in this similarity measure above can be computed efficiently without explicitly constructing the kernel matrices. For example, K f v = ∇ θ f θ (x 1:b ) ⊤ (∇ θ f θ (x 1:b )v ) can be computed with one vector-Jacobian product followed by a Jacobian-vector product. The former can be computed efficiently using backpropagation, while the latter can be computed efficiently using forward-mode differentiation. 

4. EXPERIMENTAL RESULTS

We now present experimental results to showcase the importance of supervision complexity in distillation, and to establish efficacy of online distillation.

4.1. EXPERIMENTAL SETUP

We consider standard image classification benchmarks: CIFAR-10, CIFAR-100, and Tiny-ImageNet. Additionally, we derive a binary classification task from CIFAR-100 by grouping the first and last 50 classes into two meta-classes. We consider teacher and student architectures that are ResNets (He et al., 2016) , VGGs (Simonyan & Zisserman, 2015) , and MobileNets (Howard et al., 2019) of various depths. As a student architecture with relatively weaker inductive biases, we also consider the LeNet-5 (LeCun et al., 1998) with 8 times wider hidden layers. We use standard hyperparameters (Appendix A) to train these models. We compare (1) regular one-hot training (without any distillation), (2) regular offline distillation using the temperature-scaled softmax crossentropy, and (3) online distillation using the same loss. For CIFAR-10 and binary CIFAR-100, we also consider training with mean-squared error (MSE) loss and its corresponding KD loss: L mse (f θ ) = 1 2n n i=1 ∥y i -f θ (x i )∥ 2 2 , L kd-mse (f θ ; g, τ ) = τ 2n n i=1 ∥σ(g(x i )/τ ) -f θ (x i )∥ 2 2 . ( ) The MSE loss allows for interpolation in case of one-hot labels y i , making it amenable to the analysis in §2 and §3. Moreover, Hui & Belkin (2021) show that under standard training, the CE and MSE losses perform similarly; as we shall see, the same is true for distillation as well. As mentioned in Sec. 1, in all KD experiments student networks receive supervision only through a knowledge distillation loss (i.e., dataset labels are not used). This choice help us decrease differences between the theory and experiments. Furthermore, in our preliminary experiments we observed that this choice does not result in student performance degradation (see Appendix D).

4.2. RESULTS AND DISCUSSION

Tables 1, 2, 3 and Table 5 (Appendix D) present the results (mean and standard deviation of test accuracy over 3 random trials). First, we see that online distillation with proper temperature scaling typically yields the most accurate student. The gains over regular distillation are particularly pronounced when there is a large teacher-student gap. For example, on CIFAR-100, ResNet to LeNet distillation with temperature scaling appears to hit a limit of ∼ 60% accuracy. Online distillation however manages to further increase accuracy by +6%, which is a ∼ 20% increase compared to standard training. Second, the similar results on binary CIFAR-100 shows that "dark knowledge" in the form of membership information in multiple classes is not necessary for distillation to succeed. The results also demonstrate that knowledge distillation with the MSE loss of (10) has a qualitatively similar behavior to KD with CE objective. We use these MSE models to highlight the role of , we compute the adjusted supervision complex- ity 1/n (Y -f (X 1:m )) ⊤ K -1 (Y -f (X 1:m )) • Tr (K) , where f denotes the current prediction function, and K is derived from the current NTK. Note that the subtraction of initial predictions is the appropriate way to measure complexity given the form of the optimization problem (9). As the training NTK matrix becomes aligned with dataset labels during training (see Baratin et al. (2021) and Fig. 7 of Appendix D), we pick {X i } i∈m to be a set of 2 12 test examples. Comparison of supervision complexities. We compare the adjusted supervision complexities of random labels, dataset labels, and predictions of an offline and online ResNet-56 teacher predictions with respect to various checkpoints of the LeNet-5x8 network. The results presented in Fig. 1c indicate that the dataset labels and offline teacher predictions are as complex as random labels in the beginning. After some initial decline, the complexity of these targets increases as the network starts to overfit. Given the lower bound on the supervision complexity of random labels (see §2), this increase means that the NTK spectrum becomes less uniform (see Fig. 6 of Appendix D). In contrast to these static targets, the complexity of the online teacher predictions smoothly increases, and is significantly smaller for most of the epochs. To account for softness differences of the various targets, we consider plotting the adjusted supervision complexity normalized by the target norm √ m ∥Y ∥ 2 . As shown in Fig. 2a , the normalized complexity of offline and online teacher predictions is smaller compared to the dataset labels, indicating a better alignment with top eigenvectors of the LeNet NTK. Importantly, we see that the predictions of an online teacher have significantly lower normalized complexity in the critical early stages of training. Similar observations hold when complexities are measured with respect to a ResNet-20 network (see Appendix D). Average teacher complexity. Ren et al. (2022) observed that teacher predictions fluctuate over time, and showed that using exponentially averaged teachers improves knowledge distillation. Fig. 2c demonstrates that the supervision complexity of an online teacher predictions is always slightly larger than that of the average of predictions of teachers of the last 10 preceding epochs. Effect of temperature scaling. As discussed earlier, higher temperature makes the teacher predictions softer, decreasing their norm. This has a large effect on supervision complexity (Fig. 2b ). Even when one controls for the norm of the predictions, the complexity still decreases (Fig. 2b ). NTK similarity. Remarkably, we observe that across all of our experiments, the final test accuracy of the student is strongly correlated with the similarity of final teacher and student NTKs (see Figures 3, 9 , and 10). This cannot be explained by better matching the teacher predictions. In fact, we see that the final fidelity (the rate of classification agreement of a teacher-student pair) measured on training set has no clear relationship with test accuracy. Furthermore, we see that online KD results in better NTK transfer without an explicit regularization loss enforcing such transfer.

5. RELATED WORK

Due to space constraints, we briefly discuss the existing works that are the most closely related to our exploration in this paper (see Appendix E for a more comprehensive account of related work.)  Y ⊤ (K ∞ ) -1 Y 1/2 / √ n , where K ∞ is the expected NTK matrix at a random initialization. Our bound of Thm. 3 can be seen as a generalization of this result for all kernel methods, including linearized neural networks of any depth and sufficient width, with the only difference of using the empirical NTK matrix. Belkin et al. (2018) warns that bounds based on RKHS complexity of the learned function can fail to explain the good generalization of kernel methods under label noise. Mobahi et al. (2020) prove that for kernel methods with RKHS norm regularization, self-distillation increases regularization strength. Phuong & Lampert (2019) , while studying self-distillation of deep linear networks, derive a bound of the transfer risk that depends on the distribution of the acute angle between teacher parameters and data points. This is in spirit related to supervision complexity as it measures an "alignment " between the distillation objective and data. Ji & Zhu (2020) extend this results to linearized neural networks, showing that ∆ ⊤ z K -1 ∆ z , where ∆ z is the logit change during training, plays a key role in estimating the bound. Their bound is qualitatively different than ours, and ∆ ⊤ z K -1 ∆ z becomes ill-defined for hard labels. Non-static teachers. Multiple works consider various approaches that train multiple students simultaneously to distill either from each other or from an ensemble (see, e.g., Zhang et al., 2018; Anil et al., 2018; Guo et al., 2020) . Zhou et al. (2018) and Shi et al. (2021) train teacher and student together while having a common architecture trunk and regularizing teacher predictions to close that of students, respectively. Jin et al. (2019) study route constrained optimization which is closest to the online distillation in §3.2. They employ a few teacher checkpoints to perform a multi-round KD. We complement this line of work by highlighting the role of supervision complexity and by demonstrating that online distillation can be very powerful for students with weak inductive biases.  t i > t} 4: Update student θ ← θ -η t • ∇ θ L kd-ce (f θ ; g (t * ) ,

B PROOFS

In this appendix we present the deferred proofs. We use the following definition of Rademacher complexity (Mohri et al., 2018) . Definition 6 (Rademacher complexity). Let G be a family of functions from Z to R, and Z 1 , . . . , Z n be n i.i.d. examples from a distribution P on Z. Then, the empirical Rademacher complexity of G with respect to (Z 1 , . . . , Z n ) is defined as R n (G) = E σ1,...,σn sup g∈G 1 m n i=1 σ i g(Z i ) , where σ i are independent Rademacher random variables (i.e., uniform random variables taking values in {-1, 1}). The Rademacher complexity of G is then defined as R n (G) = E Z1,...,Zn R n (G) . ( ) Remark 7. Shifting the hypothesis class G by a constant function does not change the empirical Rademacher complexity: R n ({f + g : g ∈ G}) = E σ1,...,σn sup g∈G 1 m n i=1 σ i (f (Z i ) + g(Z i )) (13) = E σ1,...,σn 1 m n i=1 σ i f (Z i ) + sup g∈G 1 m n i=1 σ i g(Z i ) (14) = E σ1,...,σn sup g∈G 1 m n i=1 σ i g(Z i ) = R n (G). ( ) Given the kernel classification setting described in Sec. 2.1, we first prove a slightly more general variant of a classical generalization gap bound in Bartlett & Mendelson (2002, Theorem 21) . Theorem 8. Assume sup x∈X k(x, x) < ∞. Fix any constant M > 0. Then with probability at least 1 -δ, every function f ∈ H with ∥f ∥ H ≤ M satisfies P X,Y (Y f (X) ≤ 0) ≤ 1 n n i=1 ϕ γ (sign (Y i ) f (X i + 2M γn Tr (K) + 3 ln (2/δ) 2n . ( ) Proof. Let F = {f ∈ H : ∥f ∥ ≤ M } and consider the following class of functions: G = {(x, y) → ϕ γ (sign(y)f (x)) : f ∈ F} . ( ) By the standard Rademacher complexity classification generalization bound (Mohri et al., 2018, Theorem 3.3) , for any δ > 0, with probability at least 1 -δ, the following holds for all f ∈ F: E X,Y [ϕ γ (sign(Y )f (X))] ≤ 1 n n i=1 ϕ γ (sign(Y i )f (X i )) + 2 R n (G) + 3 log(2/δ) 2n . ( ) Therefore, with probability at least 1 -δ, for all f ∈ F P X,Y (Y f (X) ≤ 0) ≤ 1 n n i=1 ϕ γ (sign(Y i )f (X i )) + 2 R n (G) + 3 log(2/δ) 2n . ( ) To finish the proof, we upper bound R n (G): R n (G) = E σ1,...,σn sup g∈G 1 n n i=1 σ i g(X i , Y i ) (20) = E σ1,...,σn sup f ∈F 1 n n i=1 σ i ϕ γ (sign(Y i )f (X i )) (21) ≤ 1 γ E σ1,...,σn sup f ∈F 1 n n i=1 σ i sign(Y i )f (X i ) (22) = 1 γ E σ1,...,σn sup f ∈F 1 n n i=1 σ i f (X i ) (23) = 1 γ R n (F), where the third line is due to Ledoux & Talagrand (1991) . By Lemma 22 of Bartlett & Mendelson (2002) , we thus conclude that R n (F) ≤ M n Tr (K).

B.1 PROOF OF PROPOSITION 2

Proof of Proposition 2. As K is a full rank matrix almost surely, then with probability 1 there exists a vector α ∈ R n , such that Kα = Y . Consider the function f (x) = n i=1 α i k(X i , x) ∈ H. Clearly, f (X i ) = Y i , ∀i ∈ [n]. Furthermore, ∥f ∥ 2 H = α ⊤ Kα = Y ⊤ K -1 Y . The existence of such f ∈ H with zero empirical loss and the assumptions on the loss function imply that any optimal solution of problem (3) has a norm at most Y ⊤ K -1 Y . 1 B.2 PROOF OF THM. 3 Proof. To get a generalization bound for f * it is tempting to use Thm. 8 with M = ∥f * ∥. However, ∥f * ∥ is a random variable depending on the training data and is an invalid choice for the constant M . This issue can be resolved by paying a small logarithmic penalty. For any M ≥ M 0 = γ √ n 2 √ κ the bound of Thm. 8 is vacuous. Let us consider the set of integers M = {1, 2, . . . , M 0 } and write Thm. 8 for each element of M with δ/M 0 failure probability. By union bound, we have that with probability at least 1 -δ, all instances of Thm. 8 with M chosen from M hold simultaneously. If Y ⊤ K -foot_0 Y ≥ M 0 , then the desired bound holds trivially, as the right-hand side becomes at least 1. Otherwise, we set M = Y ⊤ K -1 Y ∈ M and consider the corresponding part of the union bound. We thus have that with at least 1 -δ probability, every function f ∈ F with ∥f ∥ ≤ M satisfies P X,Y (Y f (X) ≤ 0) ≤ 1 n n i=1 ϕ γ (sign (Y i ) f (X i )) + 2M γn Tr (K) + 3 ln (2M 0 /δ) 2n . As by Prop. 2 any optimal solution f * has norm at most Y ⊤ K -1 Y and M ≤ Y ⊤ K -1 Y + 1, we have with probability at least 1 -δ, P X,Y (Y f * (X) ≤ 0) ≤ 1 n n i=1 ϕ γ (sign (Y i )f * (X i )) + 2 Y ⊤ K -1 Y + 2 γn Tr (K) + 3 ln (2M 0 /δ) 2n . B.3 PROOF OF PROP. 5 Prop. 5 is a simple corollary of Thm. 3. Proof. We have that P X,Y * (Y * f * (X) ≤ 0) = P X,Y * (Y * f * (X) ≤ 0 ∧ Y * g(X) ≤ 0) + P X,Y * (Y * f * (X) ≤ 0 ∧ Y * g(X) > 0) ≤ P X,Y * (Y * g(X) ≤ 0) + P X (g(X)f * (X) ≤ 0). The rest follows from bounding P X (g(X)f * (X) ≤ 0) using Thm. 3.

C EXTENSION TO MULTICLASS CLASSIFICATION

Let us now consider the case of multiclass classification with d classes. Let k : X × X → R d×d be a matrix-valued positive definite kernel. For every x ∈ X and a ∈ R d , let k x a = k(•, x)a be the function from X to R d defined the following way: k x a(x ′ ) = k(x ′ , x)a, for all x ′ ∈ X . With any such kernel k there is a unique vector-valued RKHS H of functions from X to R d . This RKHS is the completion of span k x a : x ∈ X , a ∈ R d , with the following inner product: n i=1 k xi a i , m j=1 k x ′ j a ′ j H = n i=1 m j=1 a ⊤ i k(x i , x ′ j )a ′ j . For any f ∈ H, the norm ∥f ∥ H is defined as ⟨f, f ⟩ H . Therefore, if f (x) = n i=1 k xi a i then ⟨f, f ⟩ 2 H = n i,j=1 a ⊤ i k(x i , x j )a j (28) = a ⊤ Ka, where a ⊤ = (a ⊤ 1 , . . . , a ⊤ n ) ∈ R nd and K = k(x 1 , x 1 ) • • • k(x 1 , x n ) • • • • • • • • • k(x n , x 1 ) • • • k(x n , x n ) ∈ R nd×nd . ( ) Suppose {(X i , Y i )} i∈[n] are n i.i.d. examples sampled from some probability distribution on X ×Y, where Y ⊂ R d . As in the binary classification case, we consider the regularized kernel problem (3). Let Y ⊤ = (Y ⊤ 1 , . . . , Y ⊤ n ) be the concatenation of targets. The following proposition is the analog of Prop. 2 in this vector-valued setting. Proposition 9. Assume K is full rank almost surely. Assume also ℓ(y, y ′ ) ≥ 0, ∀y, y ′ ∈ Y, and ℓ(y, y) = 0, ∀y ∈ Y. Then, with probability 1, for any solution f * of (3), we have that ∥f * ∥ 2 H ≤ Y ⊤ K -1 Y . Proof. With probability 1, the kernel matrix K is full rank. Therefore, there exists a vector a ⊤ = (a ⊤ 1 , . . . , a ⊤ n ) ∈ R nd , with a i ∈ R d , such that Ka = Y . Consider the function f (x) = n i=1 k Xi a i ∈ H. Clearly, f (X i ) = Y i , ∀i ∈ [n]. Furthermore, ∥f ∥ 2 H = a ⊤ Ka (32) = Y ⊤ K -1 Y . ( ) The existence of such f (x) ∈ H with zero empirical loss and assumptions on the loss function imply that any optimal solution of problem (3) has a norm at most Y ⊤ K -1 Y . As in the main text, for a hypothesis f : X → R d and a labeled example (x, y), let ρ f (x, y) = f (x) y -max y ′ ̸ =y f (x) y ′ be the prediction margin. We now restate Thm. 4 and present a proof. Theorem 10 (Thm. 4 restated). Assume that κ = sup x∈X ,y∈[d] k(x, x) y,y < ∞, and K is full rank almost surely. Further, assume that ℓ(y, y ′ ) ≥ 0, ∀y, y ′ ∈ Y and ℓ(y, y) = 0, ∀y ∈ Y. Let M 0 = ⌈γ √ n/(4d √ κ)⌉. Then, with probability at least 1 -δ, for any solution f * of problem in Eq. (3), P X,Y (ρ f * (X, Y ) ≤ 0) ≤ 1 n n i=1 1{ρ f * (X i , Y i ) ≤ γ} + 4d(Y ⊤ K -1 Y + 1) γn Tr (K) + 3 log(2M 0 /δ) 2n . ( ) Proof. Consider the class of functions F = {f ∈ H : ∥f ∥ ≤ M } for some M > 0. By Theorem 2 of Kuznetsov et al. (2015) , for any γ > 0 and δ > 0, with probability at least 1 -δ, the following bound holds for all f ∈ F: 2 P X,Y (ρ f (X, Y ) ≤ 0) ≤ 1 n n i=1 1{ρ f (X i , Y i ) ≤ γ} + 4d γ R n ( F) + 3 log(2/δ) 2n , where F = {(x, y) → f (x) y : f ∈ F, y ∈ [d]}. Next we upper bound the empirical Rademacher complexity of F: R n ( F) = E σ1,...,σn sup y∈[d],h∈H,∥h∥≤M 1 n n i=1 σ i h(X i ) y (36) = E σ1,...,σn sup y∈[d],h∈H,∥h∥≤M 1 n n i=1 σ i h(X i ) ⊤ y (y is the one-hot enc. of y) (37) = E σ1,...,σn sup y∈[d],h∈H,∥h∥≤M h, 1 n n i=1 σ i k Xi y H (reproducing property) (38) ≤ M n E σ1,...,σn sup y∈[d] n i=1 σ i k Xi y H (Cauchy-Schwarz) (39) = M n E σ1,...,σn   sup y∈[d] n i=1 σ i k Xi y 2 H   (Jensen's inequality) (40) ≤ M n E σ1,...,σn   d y=1 n i=1 σ i k Xi y 2 H   (41) = M n d y=1 E σ1,...,σn   n i=1 ∥σ i k Xi y∥ 2 H + i̸ =j σ i k Xi y, σ j k Xj y   (42) = M n d y=1 E σ1,...,σn n i=1 ∥σ i k Xi y∥ 2 H (independence of σ i ) (43) = M n d y=1 n i=1 y ⊤ k(X i , X i )y (44) = M n Tr (K). ( ) The proof is concluded with the same reasoning of the proof of Thm. 3.

D ADDITIONAL RESULTS AND DISCUSSION

In this appendix we present additional results and discussion to support the main findings of this work. On early stopped teachers. Cho & Hariharan (2019) observe that sometimes offline KD works better with early stopped teachers. Such teachers have worse accuracy and perhaps results in a smaller student margin, but they also have a significantly smaller supervision complexity (see Fig. 2a ), which provides a possible explanation for this phenomenon. Teaching students with weak inductive biases. As we saw earlier, a fully trained teacher can have predictions as complex as random labels for a weak student at initialization. This low alignment of student NTK and teacher predictions can result in memorization. In contrast, an early stopped teacher captures simple patterns and has a better alignment with the student NTK, allowing the student to learn these patterns in a generalizable fashion. This feature learning improves the student NTK and allows learning more complex patterns in future iterations. We hypothesize that this is the mechanism that allows online distillation to outperform offline distillation in some cases. CIFAR-10 results. Table 5 presents the comparison of standard training, offline distillation, and online distillation on CIFAR-10. We see that the results are qualitatively similar to CIFAR-100 results. Comparison of supervision complexities. In the main text, we have introduced the notion of adjusted supervision complexity, which for a given neural network f (x) with NTK k and a set of labeled examples {(X i , Y i )} i∈[m] is defined as: 1 n (Y -f (X 1:m )) ⊤ K -1 (Y -f (X 1:m )) • Tr (K). As discussed earlier, the subtraction of initial predictions is the appropriate way to measure complexity given the form of the optimization problem (9). Nevertheless, it is meaningful to consider the following quantity as well: 1 n Y ⊤ K -1 Y • Tr (K), in order to measure "alignment" of targets Y with the NTK k. We call this quantity adjusted supervision complexity*. We compare the adjusted supervision complexities* of random labels, dataset labels, and predictions of an offline and online ResNet-56 teacher predictions with respect to various checkpoints of the LeNet-5x8 network. The results presented in Fig. 4 are remarkably similar to the results with adjusted supervision complexity (Fig. 1c and Fig. 2a ). We therefore, focus only on adjusted supervision complexity of (46) when comparing various targets. The only other experiment where we compute adjusted supervision complexities* (i.e., without subtracting the current predictions from labels) is presented in Fig. 7 , where the goal is to demonstrate that training labels become aligned with the training NTK matrix over the course of training. Fig. 5 presents the comparison of adjusted supervision complexities, but with respect to a ResNet-20 network, instead of a LeNet-5x8 network. Again, we see that in the early epochs dataset labels and offline teacher predictions are almost as complex as random labels. Unlike the case of the LeNet-5x8 network, random labels, dataset labels, and offline teacher predictions do not exhibit a U-shaped behavior. As for the LeNet-5x8 network, the shape of these curves is in agreement with the behavior of the condition number of the NTK (Fig. 6 ). Importantly, we still observe that the complexity of the online teacher predictions is significantly smaller compared to the other targets, even when we account for the norm of predictions. The effect of frequency of teacher checkpoints. As mentioned earlier, throughout this paper we used one teacher checkpoint per epoch. While this served our goal of establishing efficacy of online distillation, this choice is prohibitive for large teacher networks. To understand the effect of the frequency of teacher checkpoints, we conduct an experiment on CIFAR-100 with ResNet-56 and LeNet-5x8 student with varying frequency of teacher checkpoints. In particular, we consider checkpointing the teacher once in every {1, 2, 4, 8, 16, 32, 64, 128} epochs. The results presented in Fig. 8 show that reducing the teacher checkpointing frequency to once in 16 epochs results in only a minor performance drop for online distillation with τ = 4. On label supervision in KD. So far in all distillation methods dataset labels were not used as an additional source of supervision for students. However, in practice it is common to train a student with a convex combination of knowledge distillation and standard losses: (1 -α)L ce + αL kd-ce . To verify that the choice of α = 1 does not produce unique conclusions regarding efficacy of online distillation, we do experiments on CIFAR-100 with varying values of α. The results presented in Table 6 confirm our main conclusions on online distillation. Furthermore, we observe that picking α = 1 does not result in significant degradation of student performance. The effect of τ . To confirm that our main conclusions regarding online distillation do not depend on the temperature value, we present additional experiments on CIFAR-100 with τ = 2 in Table 7 . NTK similarity. Figures 9 and 10 presents additional evidence that (a) training NTK similarity of the final student and the teacher is correlated with the final student test accuracy; and (b) that online distillation manages to transfer teacher NTK better.

E EXPANDED DISCUSSION ON RELATED WORK

The key contributions of this work are the demonstration of the role of supervision complexity in student generalization, and the establishment of online knowledge distillation as a theoretically grounded and effective method. Both supervision complexity and online distillation have a number of relevant precedents in the literature that are worth comment. Transferring knowledge beyond logits. In the seminal works of Buciluǎ et al. (2006) ; Hinton et al. (2015) transferred "knowledge" is in the form of output probabilities. Later works suggest other notions of "knowledge" and other ways of transferring knowledge (Gou et al., 2021) . These include activations of intermediate layers (Romero et al., 2015) , attention maps (Zagoruyko & Komodakis, 2017) , classifier head parameters (Chen et al., 2022) , and various notions of example similarity (Passalis & Tefas, 2018; Park et al., 2019; Tung & Mori, 2019; Tian et al., 2020; He & Ozay, 2021) . Transferring teacher NTK matrix belongs to this latter category of methods. Zhang et al. (2022) propose to transfer a low-rank approximation of a feature map corresponding the teacher NTK. Non-static teachers. Some works on KD consider non-static teachers. In order to bridge teacherstudent capacity gap, Mirzadeh et al. (2020) propose to perform a few rounds of distillation with teachers of increasing capacity. In deep mutual learning (Zhang et al., 2018; Chen et al., 2020) , codistillation (Anil et al., 2018) , and collaborative learning (Guo et al., 2020) , multiple students are trained simultaneously, distilling from each other or from an ensemble. In Zhou et al. (2018) and Shi et al. (2021) , the teacher and the student are trained together. In the former they have a common architecture trunk, while in the latter the teacher is penalized to keep its predictions close to the student's predictions. The closest method to the online distillation method of this work is route constrained optimization (Jin et al., 2019) , where a few teacher checkpoints are selected for a multi-round distillation. Rezagholizadeh et al. (2022) employ a similar procedure but with an annealed temperature that decreases linearly with training time, followed by a phase of training with dataset labels only. The idea of distilling from checkpoints also appears in Yang et al. (2019) , where a network is trained with a cosine learning rate schedule, simultaneously distilling from the checkpoint of the previous learning rate cycle. Fundamental understanding of distillation. The effects of temperature, teacher-student capacity gap, optimization time, data augmentations, and other training details is non-trivial (Cho & Hariharan, 2019; Beyer et al., 2022; Stanton et al., 2021) . It has been hypothesized and shown to some extent that teacher soft predictions capture class similarities, which is beneficial for the student (Hinton et al., 2015; Furlanello et al., 2018; Tang et al., 2020) . Yuan et al. (2020) demonstrate that this softness of teacher predictions also has a regularization effect, similar to label smoothing. Menon et al. (2021) argue that teacher predictions are sometimes closer to the Bayes classifier than the hard labels of the dataset, reducing the variance of the training objective. The vanilla knowledge distillation loss also introduces some optimization biases. Mobahi et al. (2020) prove that for kernel methods with RKHS norm regularization, self-distillation increases regularization strength, resulting in smaller norm RKHS norm solutions. Phuong & Lampert (2019) prove that in a self-distillation setting, deep linear networks trained with gradient flow converge to the projection of teacher parameters into the data span, effectively recovering teacher parameters when the number of training points is large than the number of parameters. They derive a bound of the transfer risk that depends on the distribution of the acute angle between teacher parameters and data points. This is in spirit related to supervision complexity as it measures an "alignment " between the distillation objective and data Ji & Zhu (2020) extend this results to linearized neural networks, showing that the quantity ∆ ⊤ z K -1 ∆ z , where ∆ z is the logit change during training, plays a key role in estimating the bound. The resulting bound is qualitatively different compared to ours, and the ∆ ⊤ z K -1 ∆ z becomes ill-defined for hard labels. Supervision complexity. The key quantity in our work is Y ⊤ K -1 Y . Cristianini et al. (2001) introduced a related quantity Y ⊤ KY called kernel-target alignment and derived a generalization bound with it for expected Parzen window classifiers. As an easy-to-compute proxy to supervision complexity, Deshpande et al. (2021) use kernel-target alignment for model selection in transfer learning. Ortiz-Jiménez et al. (2021) demonstrate that when NTK-target alignment is high, learning is faster and generalizes better. Arora et al. (2019) prove a generalization bound for overparameterized two-layer neural networks with NTK parameterization trained with gradient flow. Their bound is approximately Y ⊤ (K ∞ ) -1 Y / √ n, where K ∞ is the expected NTK matrix at a random initialization. Our bound of Thm. 3 can be seen as a generalization of this result for all kernel methods, including linearized neural networks of any depth and sufficient width, with the only difference of using the empirical NTK matrix. Belkin et al. (2018) warns that bounds based on RKHS complexity of the learned function can fail to explain the good generalization capabilities of kernel methods in presence of label noise. Future work. There are several potential directions for future work. Adaptive temperature scaling for online distillation, where the teacher predictions are smoothened so as to ensure low target complexity, is one such direction. Another avenue is to explore alternative ways to smoothen teacher prediction besides temperature scaling; e.g., can one perform sample-dependent scaling? There is large potential for improving online KD by making more informed choices for the frequency and positions of teacher checkpoints, and controlling how much the student is trained in between teacher updates. Finally, while we demonstrated that online distillation results in a better alignment with the teacher's NTK matrix, understanding why this happens is an open and interesting problem. 



For [0, 1]-bounded loss functions, it also holds that ∥f * ∥ H ≤ 2/λ. This is not of direct relevance to us, as we will be interested in cases with small λ > 0. Note that their result is in terms of Rademacher complexity rather than empirical Rademacher complexity. The variant we use can be proved with the same proof, with a single modification of bounding R(g) with empirical Rademacher complexity of G using Theorem 3.3 ofMohri et al. (2018).



Figure 1: Online vs. online distillation. Figures (a) and (b) illustrate possible teacher and student function trajectories in offline and offline KD. The yellow dotted lines indicate KD.Figure (c) plots adjusted supervision complexity of various targets with respect to NTKs at different stages of training (see §4 for more details).

Figure 2: Supervision complexity for various targets. On the left: Normalized adjusted supervision complexities of various targets with respect to a LeNet-5x8 network at different stages of its training.In the middle: The effect of temperature on the supervision complexity of an offline teacher for a LeNet-5x8 after training for 25 epochs. On the right: The effect of averaging teacher predictions.

Figure 3: Relationship between test accuracy, train NTK similarity, and train fidelity for CIFAR-100 students training with either ResNet-56 teacher (panels (a) and (c)) or ResNet-110 (panel (b)).

Supervision complexity. The key quantity in our work is supervision complexity Y ⊤ K -1 Y . Cristianini et al. (2001) introduced a related quantity Y ⊤ KY called kernel-target alignment (KTA) and derived a generalization bound with it for expected Parzen window classifiers. Deshpande et al. (2021) use KTA for model selection in transfer learning. Ortiz-Jiménez et al. (2021) demonstrate that when NTK-target alignment is high, learning is faster and generalizes better. Arora et al. (2019) prove a generalization bound for overparameterized two-layer neural networks with NTK parameterization, trained with gradient flow. Their bound is roughly

Online knowledge distillation. Require: Training sample S; teacher checkpoints {g (t1) , . . . , g (tm) }; temperature τ > 0; training steps T ; minibatch size b 1: for t = 1, . . . , T do 2: Draw random b-sized minibatch S ′ from S 3: Compute nearest teacher checkpoint t * = min{i ∈ [m] :

Figure 4: Adjusted supervision complexities* of various targets with respect to a LeNet-5x8 network at different stages of its training. The experimental setup of the left and right plots matches that of Fig. 1c and Fig. 2a respectively.

Figure 5: Adjusted supervision complexities of various targets with respect to a ResNet-20 network at different stages of its training. Besides the network choice, the experimental setup of the left and right plots matches that of Fig. 1c and Fig. 2a respectively. Note that the y-axes are in logarithmic scale.



Figure 9: Relationship between test accuracy, train NTK similarity, and train fidelity for various teacher, student, and dataset configurations.

Results on CIFAR-100.

Results on Tiny ImageNet.

Results on binary CIFAR-100. Every second line is an MSE student.

Initial learning rates for different dataset and model pairs.In all experiments we use stochastic gradient descent optimizer with 128 batch size and 0.9 Nesterov momentum. The starting learning rates are presented in Table4. All models for CIFAR datasets are trained for 256 epochs, with a learning schedule that divides the learning rate by 10 at epochs 96, 192, and 224. All models for Tiny ImageNet are trained for 200 epochs, with a learning rate schedule that divides the learning rate by 10 at epochs 75 and 135. The learning rate is warmed-up linearly to its initial value in the first 10 and 5 epochs for CIFAR and Tiny ImageNet models respectively. All VGG and ResNet models use 2e-4 weight decay, while MobileNet models use 1e-5 weight decay.

Results on CIFAR-10. Every second line is an MSE student.

