SoTeacher: TOWARD STUDENT-ORIENTED TEACHER NETWORK TRAINING FOR KNOWLEDGE DISTILLATION

Abstract

How to train an ideal teacher for knowledge distillation is still an open problem. It has been widely observed that a best-performing teacher does not necessarily yield the best-performing student, suggesting a fundamental discrepancy between the current practice in teacher training and the distillation objective. To fill this gap, we explore the feasibility of training a teacher that is oriented toward student performance with empirical risk minimization. Our analyses are inspired by the recent findings that the effectiveness of knowledge distillation hinges on the teacher's capability to approximate the true label distribution of training inputs. We theoretically established that (1) the empirical risk minimizer can provably approximate the true label distribution of training data if the loss function is a proper scoring rule and the hypothesis function is locally-Lipschitz continuous around training inputs; and (2) when data augmentation is employed for training, an additional constraint is required that the minimizer has to produce consistent predictions across augmented views of the same training input. In light of our theory, we propose a teacher training method SoTeacher which renovates the empirical risk minimization by incorporating Lipschitz regularization and consistency regularization. Experiments on two benchmark datasets confirm that SoTeacher can improve student performance significantly and consistently across various knowledge distillation algorithms and teacher-student pairs.

1. INTRODUCTION

Knowledge distillation aims to train a small yet effective student neural network following the guidance of a large teacher neural network (Hinton et al., 2015) . It dates back to the pioneering idea of model compression (Buciluǎ et al., 2006) and has a wide spectrum of real-world applications, such as recommender systems (Tang & Wang, 2018; Zhang et al., 2020) , question answering systems (Yang et al., 2020; Wang et al., 2020) and machine translation (Liu et al., 2020) . Despite the prosperous research interests in knowledge distillation, one of its crucial components, teacher training, is largely neglected. Existing training practice of teacher networks is often directly targeting at maximizing the performance of the teacher, which does not necessarily transfer to the performance of the student. Empirical evidence shows that a teacher trained toward convergence will yield an inferior student (Cho & Hariharan, 2019) and regularization methods benefitting the teacher may contradictorily degrade student performance (Müller et al., 2019) . As also shown in Figure 1 , the teacher trained towards convergence will consistently reduce the performance of the student after a certain point. This suggests a fundamental discrepancy between the common practice in neural network training and the learning objective of knowledge distillation. In this work, we explore both the theoretical feasibility and practical methodology of training the teacher toward student performance. Our analyses are built upon the recent understanding of knowledge distillation from a statistical perspective. In specific, Menon et al. (2021) show that the soft prediction provided by the teacher is essentially an approximation to the true label distribution, and true label distribution as supervision for the student improves the generalization bound compared to one-hot labels. Dao et al. (2021) show that the accuracy of the student is directly bounded by the distance between teacher's prediction and the true label distribution through the Rademacher analysis. Figure 1 : On CIFAR-100, we store a teacher checkpoint every 10 epochs and distill a student from it. We can observe that: (1) a standard teacher trained towards better teacher performance may consistently deteriorate student performance upon knowledge distillation; (2) a teacher trained using SoTeacher can achieve better student performance even with lower teacher performance. Based on the above understanding, a teacher benefitting the student should be able to learn the true label distribution of the distillation datafoot_0 . Since practically the distillation data is often reused from the teacher's training data, the teacher will have to learn the true label distribution of its own training data. This might appear to be infeasible using standard empirical risk minimization, as the teacher network often has enough capacity to fit all one-hot training labels, in which case, distilling from teacher predictions should not outperform direct training based on one-hot labels. Previous theoretical analyses tend to evade this dilemma by distilling from teacher predictions only on data that is not used in teacher training (Menon et al., 2021; Dao et al., 2021) . Instead, we directly prove the feasibility of training the teacher to learn the true label distribution of its training data. We show that the standard empirical risk minimizer can approach the true label distribution of training data as long as the loss function is a proper scoring rule and the hypothesis function is locally Lipschitz continuous around training samples. We further show that when data augmentation is employed for training, our argument still holds true under an additional constraint, i.e., predictions on the same training input under different augmentations have to be consistent. In light of our theory, we show that explicitly imposing the Lipschitz and consistency constraint in teacher training can facilitate the learning of the true label distribution and thus improve the student performance. We conduct extensive experiments on two benchmark datasets using various knowledge distillation algorithms and different teacher-student architecture pairs. The results confirm that our method can improve student performance consistently and significantly. To summarize, our main contributions can be listed as follows. • We show that it is theoretically feasible to train the teacher to learn the true label distribution of the distillation data even with data reuse, which explains why the current knowledge distillation practice works. • We show that explicitly imposing the Lipschitz and consistency regularization in teacher training can better learn the true label distribution and improve the effectiveness of knowledge distillation. We believe our work is among the first attempts to explore the theory and practice of training a student-oriented teacher in knowledge distillation. We hope our exploration can serve as a stepping stone to rethinking the teacher training and unleashing the full potential of knowledge distillation. 



For simplicity, we refer to the student's training data in knowledge distillation as the distillation data(Stanton et al., 2021)



Test accuracy and training loss of the teacher at different epochs in training. Student performance when distilled from different teacher checkpoints.

Standard practice of teacher network training. We study knowledge distillation in the context of multi-class classification. Specifically, we are given a set of training samples D = {(x i , y i )} i∈[N ] , where [N ] ≡ 1, 2, • • • , N . D is drawn from a probability distribution p X,Y that is defined jointly over input space X ⊂ R d and label space Y = [K]. In common practice, the teacher network

