SCAFFOLDING A STUDENT TO INSTILL KNOWLEDGE

Abstract

We propose a novel knowledge distillation (KD) method to selectively instill teacher knowledge into a student model motivated by situations where the student's capacity is significantly smaller than that of the teachers. In vanilla KD, the teacher primarily sets a predictive target for the student to follow, and we posit that this target is overly optimistic due to the student's lack of capacity. We develop a novel scaffolding scheme where the teacher, in addition to setting a predictive target, also scaffolds the student's prediction by censoring hard-to-learn examples. The student model utilizes the same information as the teacher's soft-max predictions as inputs, and in this sense, our proposal can be viewed as a natural variant of vanilla KD. We show on synthetic examples that censoring hard-examples leads to smoothening the student's loss landscape so that the student encounters fewer local minima. As a result, it has good generalization properties. Against vanilla KD, we achieve improved performance and are comparable to more intrusive techniques that leverage feature matching on benchmark datasets.

1. INTRODUCTION

A fundamental problem in machine learning is to design efficient and compact models with near state-of-the-art (SOTA) performance. Knowledge Distillation (KD) (Zeng & Martinez, 2000; Bucila et al., 2006; Ba & Caruana, 2014; Hinton et al., 2015) is a widely used strategy for solving this problem wherein the knowledge from a large pre-trained teacher model with SOTA performance is distilled onto a small student network. Vanilla KD. Hinton et al. (2015) proposed the popular variant of KD by matching the student soft predictions, s(x) with that of the pre-trained teacher, t(x) on inputs x. Informally, during student training, an additional loss term, D KL (t(x), s(x)) is introduced that penalizes the difference between student and teacher predictive distributions using Kullback-Leibler (KL) divergence. This promotes inter-class knowledge learned by the teacher. We will henceforth refer to Vanilla KD as KD. Capacity mismatch between student and teacher. One of the primary issues in KD is that the loss function is somewhat blind to the student's capacity to interpolate. In particular, when the student's capacity is significantly lower than the teacher's, we expect the student to follow the teacher only on those inputs realizable by the student. We are led to the following question: What can the teacher provide by way of predictive hints for each input, so that the student can leverage this information to learn to its full capacity? Our Proposal: Scaffolding a Student to Distill Knowledge (DiSK). To address this question, we propose that the teacher, during training, not only set a predictive target, t(x), but also provide hints on hard to learn inputs. Specifically, the teacher utilizes its model to output a guide function, g(x), such that the student can selectively focus only on those examples that it can learn. • if g(x) ≈ 1, teacher discounts loss incurred by the student on the input x. • if g(x) ≈ 0, teacher signals the input x as learnable by student.

