DO NOT BLINDLY IMITATE THE TEACHER: LOSS PERTURBATION FOR KNOWLEDGE DISTILLATION

Abstract

Knowledge distillation (KD) is a popular model compression technique to transfer knowledge from large teacher models to a small student model. Typically, the student learns to imitate the teacher by minimizing the KL divergence of its output distribution with the teacher's output distribution. We argue that such a learning objective is sub-optimal because there exists a discrepancy between the teacher's output distribution and the ground truth label distribution, and forcing the student to blindly imitate the unreliable teacher output distribution leads to inferior performance. To this end, we propose a novel knowledge distillation objective PTLoss by first representing the vanilla KL-based distillation loss function via a Maclaurin series and then perturbing the leading-order terms in this series. This perturbed loss improves the student generalizability by effectively distilling knowledge from a shifted distribution closer to the ground truth data. We also propose a method to compute this shifted teacher distribution, named Proxy Teacher, which enables us to select the perturbation coefficients in PTLoss. We theoretically show the perturbed loss reduces the deviation from the true population risk compared to the vanilla KL-based distillation loss functions. Experiments on three tasks with teachers of different scales show that our method significantly outperforms vanilla distillation loss functions and other perturbation methods.

1. INTRODUCTION

Deep neural networks (DNNs) have achieved enormous success due to their massive sizes, expressive power, and the availability of large-scale data for model training. Accompanied by their success is the increasing need of deploying such large-scale DNNs on resource-limited devices. Knowledge distillation (KD) is a widely-used model compression technique, which distills knowledge from large teacher models into a much smaller student model to preserve the predictive power of teacher models (Buciluǎ et al., 2006; Hinton et al., 2015) . In the teacher-student learning paradigm of KD, the student model is encouraged to imitate the teacher models' outputs on a distillation dataset. The typical training objective in KD such as KL loss (Hinton et al., 2015; Menon et al., 2021; Stanton et al., 2021) encourages the student's outputs to be close to the teacher's outputs as much as possible. This implicitly assumes the teacher's outputs on the distillation data are perfect. However, the teacher's predictive distributions can be biased from the ground truth due to various factors, such as the inductive bias encoded in the teacher's model architecture, miscalibration in the training procedure (Menon et al., 2021) , or the bias in the source dataset used for learning the teacher model (Liu et al., 2021; Lukasik et al., 2021) . Enforcing the student to blindly imitate the teacher's outputs can make the student inherit such biases and produce suboptimal predictions. To overcome this challenge, one commonly used approach is to scale the teacher's logits via a temperature parameter (Hinton et al., 2015) . Menon et al. (2021) show that a proper temperature can improve the quality of the teacher model's predictive distribution, but the shifting space offered by the temperature scaling is limited, and the optimal temperature value relies on expensive grid search. Along a separate line, label smoothing (Szegedy et al., 2016) is proposed as a general technique to regularize the neural networks, and modulated loss functions (Lin et al., 2017; Leng et al., 2022) are designed to address several statistical issues (e.g.,overfitting issues and data imbalance) in model training. However, there lack works that explore tailoring such techniques for more robust knowledge distillation. We propose PTLoss for knowledge distillation, which revises the vanilla loss function in knowledge distillation to implicitly create a debiased teacher distribution closer to the ground truth. Instead of forcing an out-and-out imitation of the original teacher model, we relax the KL loss and add perturbations to the distillation objective. Specifically, we approximate the standard KL loss using the Maclaurin series, which allows us to construct a more flexible objective and to perturb the leadingorder terms. To determine the perturbation extent, we design a method to compute the equivalent distribution of the implicitly shifted teacher by perturbation, namely Proxy Teacher. With the computed proxy teacher distribution, we measure the empirical deviation between the perturbed teacher and the ground truth data. It leads to a systematic searching strategy for the perturbation coefficients, i.e., the near-optimal perturbation coefficients should minimize the deviation between distilled risk and population risk on the validation set. Theoretically, we justify the effectiveness of the PTLoss by proving that it can reduce the deviation from the distilled empirical risk compared to KL loss. We draw a connection between the PTLoss and other perturbation method (i.e., label smoothing (Szegedy et al., 2016) ). We illustrate that the PTLoss can debias the teacher to produce higher-fidelity outputs via a finer-grained perturbation, while subsuming existing perturbation techniques as special cases. Experiments on three datasets with different-sized teacher models demonstrate the empirical advantage of the PTLoss. Moreover, the Proxy Teacher method for perturbation coefficient search significantly outperforms the PTLoss with random searched coefficients, which shows the superiority of this systematic parameter search method. In summary, our key contributions are: • A perturbed loss function PTLoss, which formulates the vanilla knowledge distillation loss in the form of Maclaurin series and perturbs it to improve the fidelity of teacher models; • A Proxy Teacher method to solve the implicitly shifted teacher and to determine the perturbation coefficients in PTLoss; • Theoretical analysis proves that we can lower the distilled empirical risk bound with PTLoss and establishes the connection with other perturbation methods; • Comprehensive experiments on three public datasets with different-sized teacher models demonstrating the advantage of the PTLoss and the Proxy Teacher method.

2.1. KNOWLEDGE DISTILLATION

Knowledge distillation is first proposed in (Buciluǎ et al., 2006) to compress the large model ensembles to smaller, faster models without a significant performance drop. This technique is generalized by (Hinton et al., 2015) , where the temperature parameter is introduced to smooth the prediction and the student loss and distillation loss are integrated. With the prevalence of pre-trained language models (Devlin et al., 2018) , it becomes more urgently needed to distill such large models to deploy on edge devices with limited resources. For example, DistillBERT (Sanh et al., 2019) uses the teacher's soft prediction probability to train the student model; TinyBERT (Jiao et al., 2019) aligns the student's layer outputs (including attention outputs and hidden states) with the teacher's; MobileBERT (Sun et al., 2020 ) also adopts a layer-wise training objective and equips a bottleneck structure to distill from BERT-Large.

2.2. DISTILLATION THEORY

In parallel with the empirical success of the application of knowledge distillation, many works are devoted to answering its mechanism. Hinton et al. (2015) propose the teacher's soft labels can provide "dark knowledge" via weights on the wrong labels. 2021) study distillation from several different aspects, but there remains a gap between the theoretical analysis and the better distillation techniques.



Menon et al. (2021) present a statistical perspective on distillation, they observed that a good teacher model should be Bayesian to lower the variance of the student objective via the teacher's prediction distribution. Stanton et al. (2021) show the discrepancy between the teacher and the student regarding their output distribution and identify the optimization difficulty in knowledge distillation. Ji & Zhu (2020); Zhou et al. (2021); Hsu et al. (

