DO NOT BLINDLY IMITATE THE TEACHER: LOSS PERTURBATION FOR KNOWLEDGE DISTILLATION

Abstract

Knowledge distillation (KD) is a popular model compression technique to transfer knowledge from large teacher models to a small student model. Typically, the student learns to imitate the teacher by minimizing the KL divergence of its output distribution with the teacher's output distribution. We argue that such a learning objective is sub-optimal because there exists a discrepancy between the teacher's output distribution and the ground truth label distribution, and forcing the student to blindly imitate the unreliable teacher output distribution leads to inferior performance. To this end, we propose a novel knowledge distillation objective PTLoss by first representing the vanilla KL-based distillation loss function via a Maclaurin series and then perturbing the leading-order terms in this series. This perturbed loss improves the student generalizability by effectively distilling knowledge from a shifted distribution closer to the ground truth data. We also propose a method to compute this shifted teacher distribution, named Proxy Teacher, which enables us to select the perturbation coefficients in PTLoss. We theoretically show the perturbed loss reduces the deviation from the true population risk compared to the vanilla KL-based distillation loss functions. Experiments on three tasks with teachers of different scales show that our method significantly outperforms vanilla distillation loss functions and other perturbation methods.

1. INTRODUCTION

Deep neural networks (DNNs) have achieved enormous success due to their massive sizes, expressive power, and the availability of large-scale data for model training. Accompanied by their success is the increasing need of deploying such large-scale DNNs on resource-limited devices. Knowledge distillation (KD) is a widely-used model compression technique, which distills knowledge from large teacher models into a much smaller student model to preserve the predictive power of teacher models (Buciluǎ et al., 2006; Hinton et al., 2015) . In the teacher-student learning paradigm of KD, the student model is encouraged to imitate the teacher models' outputs on a distillation dataset. The typical training objective in KD such as KL loss (Hinton et al., 2015; Menon et al., 2021; Stanton et al., 2021) encourages the student's outputs to be close to the teacher's outputs as much as possible. This implicitly assumes the teacher's outputs on the distillation data are perfect. However, the teacher's predictive distributions can be biased from the ground truth due to various factors, such as the inductive bias encoded in the teacher's model architecture, miscalibration in the training procedure (Menon et al., 2021) , or the bias in the source dataset used for learning the teacher model (Liu et al., 2021; Lukasik et al., 2021) . Enforcing the student to blindly imitate the teacher's outputs can make the student inherit such biases and produce suboptimal predictions. To overcome this challenge, one commonly used approach is to scale the teacher's logits via a temperature parameter (Hinton et al., 2015) . Menon et al. (2021) show that a proper temperature can improve the quality of the teacher model's predictive distribution, but the shifting space offered by the temperature scaling is limited, and the optimal temperature value relies on expensive grid search. Along a separate line, label smoothing (Szegedy et al., 2016) is proposed as a general technique to regularize the neural networks, and modulated loss functions (Lin et al., 2017; Leng et al., 2022) are designed to address several statistical issues (e.g.,overfitting issues and data imbalance) in model training. However, there lack works that explore tailoring such techniques for more robust knowledge distillation.

