SCAFFOLDING A STUDENT TO INSTILL KNOWLEDGE

Abstract

We propose a novel knowledge distillation (KD) method to selectively instill teacher knowledge into a student model motivated by situations where the student's capacity is significantly smaller than that of the teachers. In vanilla KD, the teacher primarily sets a predictive target for the student to follow, and we posit that this target is overly optimistic due to the student's lack of capacity. We develop a novel scaffolding scheme where the teacher, in addition to setting a predictive target, also scaffolds the student's prediction by censoring hard-to-learn examples. The student model utilizes the same information as the teacher's soft-max predictions as inputs, and in this sense, our proposal can be viewed as a natural variant of vanilla KD. We show on synthetic examples that censoring hard-examples leads to smoothening the student's loss landscape so that the student encounters fewer local minima. As a result, it has good generalization properties. Against vanilla KD, we achieve improved performance and are comparable to more intrusive techniques that leverage feature matching on benchmark datasets.

1. INTRODUCTION

A fundamental problem in machine learning is to design efficient and compact models with near state-of-the-art (SOTA) performance. Knowledge Distillation (KD) (Zeng & Martinez, 2000; Bucila et al., 2006; Ba & Caruana, 2014; Hinton et al., 2015) is a widely used strategy for solving this problem wherein the knowledge from a large pre-trained teacher model with SOTA performance is distilled onto a small student network. Vanilla KD. Hinton et al. (2015) proposed the popular variant of KD by matching the student soft predictions, s(x) with that of the pre-trained teacher, t(x) on inputs x. Informally, during student training, an additional loss term, D KL (t(x), s(x)) is introduced that penalizes the difference between student and teacher predictive distributions using Kullback-Leibler (KL) divergence. This promotes inter-class knowledge learned by the teacher. We will henceforth refer to Vanilla KD as KD. Capacity mismatch between student and teacher. One of the primary issues in KD is that the loss function is somewhat blind to the student's capacity to interpolate. In particular, when the student's capacity is significantly lower than the teacher's, we expect the student to follow the teacher only on those inputs realizable by the student. We are led to the following question: What can the teacher provide by way of predictive hints for each input, so that the student can leverage this information to learn to its full capacity? Our Proposal: Scaffolding a Student to Distill Knowledge (DiSK). To address this question, we propose that the teacher, during training, not only set a predictive target, t(x), but also provide hints on hard to learn inputs. Specifically, the teacher utilizes its model to output a guide function, g(x), such that the student can selectively focus only on those examples that it can learn. • if g(x) ≈ 1, teacher discounts loss incurred by the student on the input x. • if g(x) ≈ 0, teacher signals the input x as learnable by student. With this in mind we modify the KL distance in the KD objective and consider, D KL (t(x), ϕ(s(x), g(x))), where ϕ(s, g), which will be defined later, is such that, ϕ(s, 0) = s if g offers no scaffolding. We must impose constraints on the guide function g to ensure that only hard-to-learn examples are scaffolded. In the absence of such constraints, the guide can declare all examples to be hard, and the student would no longer learn. We propose to do so by means of a budget constraint B(s, g) ≤ δ to ensure that the guide can only help on a small fraction of examples. While more details are described in Sec. 3, we note that, in summary, our proposed problem is to take the empirical linear combination of the aforementioned KL distance and a cross-entropy term as the objective, and minimize it under the empirical budget constraint. We emphasize that g(x) is used only during training . The inference logic for the student remains the same as there is no need for g(x) during inference. The guide function supported student training has three principal benefits. The benefits are explored in Sec. 2. • Censoring Mechanism. Our guide function censors examples that are hard to learn for the student. In particular, when there is a large capacity gap, it is obvious that the student cannot fully follow the teacher. For this reason, the teacher must not only set an expectation for the student to predict, but also selectively gather examples that the student has the ability to predict. • Smoothen the Loss landscape. We also notice in our synthetic experiments that whenever scaffolding is powerful, and can correct student's mistakes, the loss landscape undergoes a dramatic transformation. In particular, we notice fewer local minima in the loss viewed by the guided student. • Good Generalization. The solution to our constrained optimization problem in cases where the guide function is powerful can ensure good student generalization. Specifically, we can bound the statistical error in terms of student complexity and not suffer additional complexity due to the teacher. Contributions. We summarize our main results. • We develop a novel approach to KD that exploits teacher representations to adjust the predictive target of the student by scaffolding hard-to-learn points. This novel scaffolding principle has wider applicability across other KD variants, and is of independent interest. • We design a novel response-matching KD method (Gou et al., 2021) which is particularly relevant in the challenging regime of large student-teacher capacity mismatch. We propose an efficient constrained optimization approach that produces powerful training scaffolds to learn guide functions. • Using synthetic experiments, we explicitly illustrate the structural benefits of scaffolding. In particular, we show that under our approach, guides learn to censor difficult input points, thus smoothening the student's loss-landscape and often eliminating suboptimal local minima in it. • Through extensive empirical evaluation, we demonstrate that the proposed DiSK method; yields large and consistent accuracy gains over vanilla KD under large student-teacher capacity mismatch (upto 5% and 2% on CIFAR-100 and Tiny-Imagenet). produces student models that can get near-teacher accuracy with significantly smaller model complexity (e.g. 8× computation reduction with ∼ 2% accuracy loss on CIFAR-100). improves upon KD even under small student-teacher capacity mismatch, and is even competitive with modern feature matching approaches.

2. ILLUSTRATIVE EXAMPLES

We present two synthetic examples to illustrate the structural phenomena of the censoring mechanism and smoothening of student's loss landscape enabled by the scaffolding approach DiSK, which lead to globally optimal test errors. We defer exact specification of the algorithm to Sec.3. Example 1 (1D Intervals). Consider a toy dataset with one dimensional features x ∈ [0, 9] and binary class labels y ∈ {Red, Blue} as shown in Figure 1 . There are two Blue labelled clusters as in [2, 3] and in [5, 7] . The remaining points are labelled as Red. We sample 1000 i.i.d. data points as the training set and 100 data points as the test set with balanced data from both classes. We describe the details of the experiment setup such as models and learning procedure in Appx. A.1. Teacher T belongs to the 2-interval function class, and the capacity-constrained student S belongs to the 1-interval function class. Since teacher capacity is sufficient to separate the two classes without error, it learns the correct classifier (see Figure 1 ). In contrast, the best possible student hypothesis cannot correctly separate the two classes. Hence, the student will have to settle onto one of the many

