CAN STUDENTS OUTPERFORM TEACHERS IN KNOWL-EDGE DISTILLATION BASED MODEL COMPRESSION?

Abstract

Knowledge distillation (KD) is an effective technique to compress a large model (teacher) to a compact one (student) by knowledge transfer. The ideal case is that the teacher is compressed to the small student without any performance dropping. However, even for the state-of-the-art (SOTA) distillation approaches, there is still an obvious performance gap between the student and the teacher. The existing literature usually attributes this to model capacity differences between them. However, model capacity differences are unavoidable in model compression. In this work, we systematically study this question. By designing exploratory experiments, we find that model capacity differences are not necessarily the root reason, and the distillation data matters when the student capacity is greater than a threshold. In light of this, we propose to go beyond in-distribution distillation and accordingly develop KD+. KD+ is superior to the original KD as it outperforms KD and the other SOTA approaches substantially and is more compatible with the existing approaches to further improve their performances significantly 1 .

1. INTRODUCTION

Deep neural networks (DNNs) have achieved remarkable performances in various domains, but they require large amounts of computation and memory. This seriously limits their deployment with limited resources or a strict latency requirement. One solution to this problem is knowledge distillation which transfers the knowledge from a large network (teacher) to a small one (student). Hinton et al. (2015) proposed the original knowledge distillationfoot_1 (KD) which uses softened logits of a teacher as supervision to train a student. To make the student better capture the knowledge from the teacher, the existing studies focus on aligning their representations by using different criteria. However, there is still a significant performance gap between the teacher and the student. Figuring out the reason for this gap is essential for further improving the student performance. Mirzadeh et al. (2020) argue that the model capacity difference causes the failure for transferring the knowledge from a large teacher to a small student, thus leading to a large performance gap. Similarly, Cho & Hariharan (2019) point out that as the teacher grows in capacity and accuracy, it is difficult for the student to emulate the teacher. In this paper, we systematically study why students underperform teachers and how students can match or outperform teachers. We find that in most experimental settings of the existing literature, the root reason for the performance gap is not necessarily the capacity differenc as the student is powerful enough to memorize the teacher's outputs. The reason lies in the distillation dataset on which the knowledge is transferred. As an old proverb says, indigo comes from blue, but it is bluer than blue. In reality, it is not rare for human students to do better than their teachers. These excellent human students not only well capture the knowledge from their teachers but also learn more related knowledge on their own. This gives an insight for students in KD to match or outperform their teachers. We find that currently the students in KD have not well captured the knowledge in their teachers as they only mimic the behavior of the teachers on sparse training data points. In light of this, we propose KD+ which goes beyond in-distribution distillation to substantially reduce the performance gap between students and teachers. Our main contributions are summarized as follows: • Different from the common belief that model capacity differences result in the performance gap between students and teachers, we find that capacity differences are not necessarily the root reason and instead the distillation data matters when students' capacities are greater than a threshold. To our best knowledge, this is the first work that systematically explores why small students underperform teachers and how students can outperform large teachers. • By designing exploratory experiments, we find the following: (1) only fitting teachers' outputs at sparse training data points cannot make students well capture the local, indistribution shapes of the teacher functions; (2) different from the case on standard supervised learning, out-of-distribution data (but not all) can be beneficial to knowledge distillation. • Different from the existing work focusing on using different criteria to align representations or logits between teachers and students, we address knowledge distillation from a novel (data) perspective by going beyond in-distribution distillation and accordingly develop KD+. • Extensive experiments demonstrate that KD+ largely reduces the performance gap between students and teachers, and even enables students to match or outperform their teachers. KD+ is superior to KD as it outperforms KD and more than 10 SOTA methods substantially and shows a better compatibility with the existing methods and superiority in few-shot scenario.

2. RELATED WORK

The objective function of knowledge distillation can be simply expressed as a combination of the regular cross-entropy objective and a distillation objective. According to the distillation objective, the existing literature can be divided into logit-based approaches (Hinton et al., 2015) and representationbased approaches (Romero et al., 2015) . Logit-based approaches construct the distillation objective based on output logits. Hinton et al. (2015) propose KD which penalizes the softened logit differences between a teacher and a student. Park et al. (2019) propose to transfer data sample relations from a teacher to a student by aligning their logit-based structures. On the other hand, representation-based approaches design the distillation objective based on feature maps. FitNet (Romero et al., 2015) aligns the features of a teacher and a student through regressions. AT (Zagoruyko & Komodakis, 2017) distills feature attention from a teacher into a student. CRD (Tian et al., 2020) maximizes the mutual information between student and teacher representations. Other representation-based methods (Yim et al., 2017; Huang & Wang, 2017; Kim et al., 2018; Liu et al., 2019; Srinivas & Fleuret, 2018; Wang et al., 2018; Heo et al., 2019a; Cho & Hariharan, 2019; Ahn et al., 2019; Koratana et al., 2019; Aguilar et al., 2019; Shen & Savvides, 2020) use different criteria to align feature representations. SSKD (Xu et al., 2020) introduces extra self-supervision tasks to assist KD. Online knowledge distillation (Zhang et al., 2018b; Chen et al., 2020; Anil et al., 2018; Chung et al., 2020; Zhu et al., 2018) trains multiple students simultaneously. Self-distillation (Furlanello et al., 2018; Yuan et al., 2020) approaches train a DNN by using itself as the teacher. It is observed that the existing studies focus on designing different criteria to align teacher-student representations or logits on in-distribution data. In this work, we address knowledge distillation from a data perspective by embedding out-of-distribution distillation into a regularizer. Mirzadeh et al. (2020) observe that the model capacity gap results in the failure for transferring knowledge from a large teacher to a small student, thus causing a performance gap. To reduce this gap, they propose a multi-step knowledge distillation framework by using several intermediate-size networks (teacher assistants). However, the students still underperform the teachers substantially. Cho & Hariharan (2019) argue that as the teacher grows in capacity and accuracy, it is difficult for the student to emulate the teacher. To reduce the influence of the large capacity gap, they regularize both the teacher and the knowledge distillation by early stopping. We find that capacity differences are not necessarily the root reason when student capacities are greater than a threshold. On the other hand, KD+ goes beyond in-distribution distillation by exploring the knowledge between two training samples. Similar techniques have been used in many applications with different goals and mechanisms. Mixup (Zhang et al., 2018a) enforces local linearity of a DNN by linearly interpolating a random pair of training samples and their one-hot labels simultaneously. However, simply interpolating two labels may not match the generated sample as pointed out in (Guo et al., 2019) . KD+ does not have the above issue as it teaches a student to mimic the local shape of a powerful teacher. MixMatch (Berthelot et al., 2019b) linearly interpolates labeled and unlabeled data to improve the semi-supervised learning performances. ReMixMatch (Berthelot et al., 2019a) improves MixMatch by introducing distribution alignment and augmentation anchoring. DivideMix (Li et al., 2020) aims to learn with noisy labels by modifying MixMatch with label co-refinement and label co-guessing on labeled and unlabeled samples, respectively. AugMix (Hendrycks et al., 2019) linearly interpolates original training samples and augmented training samples to improves the robustness and uncertainty estimates of DNNs. 3 REFORMULATING KD Hinton et al. (2015) propose KD which minimizes the softened logit differences between a student and a teacher over training data D t = (X t , Y t ) where X t and Y t are the training samples and the ground truth, respectively. The complete objective is: L KD = (xt,yt)∈(Xt,Yt) [αL CE (f S , x t , y t ) + βL KL (f S , f T , x t )] where α and β are balancing weights and L CE is the regular cross-entropy objective: L CE (f S , x t , y t ) = H(y t , σ(f S (x t ))) where H(.) is the cross-entropy and σ is softmax. L KL in (1) is the distillation objective: L KL (f S , f T , x t ) = τ 2 KL σ f T (x t ) τ , σ f S (x t ) τ ( ) where τ is a temperature to generate soft labels and KL represents KL-divergence. KD can be considered as using one function (f S ) to fit the outputs of another function (f T ). We notice that in (1), L CE requires both data samples X t and the corresponding ground truth Y t while L KL only needs data samples X t for distilling the teacher knowledge. In light of the difference, we consider KD from semi-supervised perspective and reformulate (1) in a more general form: L = (xt,yt)∈(Xt,Yt) αL CE (f S , x t , y t ) + x d ∈(X d ) βL KL (f S , f T , x d ) where we introduce a new concept: distillation dataset X d is a set of samples on which the knowledge is transferred from a teacher to a student. The first term in the right hand side of (4) is supervised while the second term is unsupervised. It is obvious that the widely used objective (1) is a special case of (4) when X d is set to X t .

4. WHY SMALL STUDENTS UNDERPERFORM LARGE TEACHERS?

In this part, we systematically analyze the reason for the performance gap between students and teachers in KD based model compression. We first introduce several definitions. Definition 4.1 Memorization Error (ME): For a given task with data distribution P (X, Y ), ME measures the degree of a student f S fitting the outputs of a teacher f T over the data distribution: E(f S , f T , P ) = E x∼P (X) M (f T (x) , f S (x)) where M denotes a distance metric such as KL-divergence or mean square error. When ME is 0, it means that the student can completely memorize the outputs of the teacher over the data distribution. In this paper, we take KL-divergence as M . Definition 4.2 Capable Students (CSTs) and Incapable Students (ISTs): network f S with parameters Θ S is a CST of teacher f T if there exists Θ S such that E(f S , f T , P )=0, otherwise, it is an IST. Obviously, a CST is able to fully fit the teacher outputs over data distribution P (X, Y ). In contrast, an IST does not have the capacity to fit the teacher. For ISTs, the common belief holds that the student-teacher capacity gap causes the performance gap. For example, we cannot expect a two-layer neural network with 1000 parameters to fit the outputs of ResNet-101 with 1.7M parameters on We empirically show that these models are CSTs on commonly used benchmark datasets. To check whether student f S is a CST of teacher f T on a task, we minimize ME to check whether E(f S , f T , P ) can achieve 0. However, in practice, it is impossible to calculate E(f S , f T , P ) as the data distribution P is typically unknown. Fortunately, we have the access to a set of training data (X t , Y t ). With the training data, we approximate ME E(f S , f T , P ) with the empirical error: E em (f S , f T , X t ) = 1 |X t | xt∈Xt M (f T (x t ) , f S (x t )) For comparison, we also evaluate two small neural networks which are expected to be ISTs, i.e., SN-2 and SN-3 with two and three layers, respectively. We report the ME in Table 1 foot_2 , where we adopt the students and the teachers that share the same architectures (e.g., WRN-40-2 and WRN-16-2) or use different architectures (e.g., . As expected, the widely used students achieve ME 0.0 on these benckmark datasets, i.e., CIFAR-10, CIFAR-100, and Tiny ImageNet while the small networks (i.e., SN2 and SN3) have large ME (e.g., 2.4 and 4.2), which demonstrates that the widely used students are CSTs. However, as observed in the existing literature, these CSTs underperform the teachers by a significant margin on the test data. This suggests that these students have well captured the knowledge on sparse training data points but have not well captured the local shapes of the teachers within the data distribution. Corollary 4.1 In KD, for CSTs, only fitting the outputs of teachers on sparse training data points cannot enable them to well capture the local, in-distribution shapes of the teachers, thus leading to a performance gap. For ISTs, capacity differences cause the performance gap. Proof: We empirically show this by comparing the student performances in the following two settings: (a) setting the distillation dataset to training data points; (b) setting the distillation dataset to real data distribution P (X). As P (X) is typically unknown in practice, we conduct a simulation experiment on CIFAR-100. We suppose that the union of the training dataset and the test dataset in CIFAR-100 can accurately represent the real data distribution for this task. Then we randomly draw data samples from the vicinity around the training data and the test data as the distillation dataset, i.e., X d in (4). Consequently, the distillation dataset can sufficiently represent the real data sample distribution. Note that in the experiments, we never spy the ground truth of the test samples, since the distillation dataset does not use ground truth as shown in (4). This means that the students are trained without any additional supervision compared with the teachers as training datset (X t , Y t ) in (4) does not change. As CSTs are able to fully memorize the outputs of the teachers, we expect them to achieve the same In contrast, we expect ISTs to achieve lower accuracies than those of the teachers. Table 2 shows the simulation results. As expected, all the CSTs outperform the teachers in the simulation experiments (i.e., Simulation KD). This is due to the following facts: first, by using the simulated distillation dataset, the distillation objective in (4) makes the CSTs fully capture the knowledge of the teachers within the data distribution; second, the cross-entropy objective in (4) enables the CSTs to learn their own knowledge. Consequently, CSTs contain both the teacher knowledge and the knowledge learned on their own, which results in better performances than those of the teachers. SN2 and SN3 still underperform the teachers in the simulation experiments due to their limited capacities. These results empirically prove Corollary 4.1. The simulation experiments also suggest a way for CSTs to outperform teachers. That is to sufficiently distill the knowledge in the teachers with a well representative distillation dataset. Unfortunately, it is impossible to have such a distillation datset as the real data sample distribution P (X) is typically unknown in practice. Motivated by this, we propose to go beyond in-distribution distillation.

5.1. DIFFERENCES BETWEEN SUPERVISED LEARNING AND KNOWLEDGE DISTILLATION

As analyzed above, the reason for the performance gap lies in distillation datasets. Distillation datasets are different from training datsets. As shown in (4), training datsets contain both samples and their ground truth, which are used for standard supervised learning. In contrast, distillation datsets only contain samples without ground truth, which are used for knowledge distillation. Knowledge distillation and standard supervised learning differ substantially. Standard supervised learning is to learn a function f (e.g., a DNN) for mapping x to y where (x, y) follows real data distribution P (X, Y ). The quality of f is constrained by the training data (X t , Y t ) that we have. In contrast, knowledge distillation is to learn a function (i.e., a student f S ) for mapping x to z where (x, z) follows a teacher-defined distribution Q(X, Z) and Z = σ( f T (X) τ ). Q(X, Z) is different from P (X, Y ) unless teacher f T is perfect. The advantage of knowledge distillation is that Q(X, Z ) is more tractable than P (X, Y ), since for given any sample x, f T can always give output f T (x), and (x, σ( f T (x) τ )) follows Q(X, Z). Even for out-of-distribution samples, f T can still output soft labels although these soft labels are semantically meaningless. However, the existing literature ignores this advantage as they only distill the knowledge on sparse training data points.

5.2. IMPROVING KD BY FURTHER EXPLORING TEACHER-DEFINED DISTRIBUTION

The simulation experiments demonstrate that only fitting sparse individual data points cannot necessarily enable students to well capture the local shapes of the teacher functions. As shown in Figure 1 where the yellow regions denote real data sample distribution P (X), even if student f S perfectly fits teacher f T at each training data point, i.e., x 1 , x 2 , and x 3 , their local shapes near these samples can still be highly different. To mitigate this issue, the typically used strategy is data augmentation Simard et al. (1998) formalized by the Vicinal Risk Minimization (VRM) Chapelle et al. (2001) principle. In VRM, human knowledge is necessary to define a vicinity or neighborhood around each training data point. Then, additional new data points can be drawn from the vicinity distribution of the training data. For example, in image classification, it is common to define the vicinity of an image as the set of its random crops after mildly padding and flipping. Nevertheless, data augmentation has its own limitation that a newly generated data point is very close to the original training data point, since they contain almost the identical objective only with different backgrounds caused by padding or cropping. Due to this limitation, as shown in Figure 2 , even if the student f S fits the teacher f T at all the training data points (i.e., x 1 , x 2 , and x 3 ) and the augmented data points (i.e., x aug 1 , x aug 2 , and x aug 3 ), their local shapes can still differ substantially. To address the above issue, we propose KD+ that regularizes KD by enforcing students to mimic the behavior of teachers on the region between two training (or augmented) samples. As shown in Figure 3 , KD+ first defines p -1 points (i.e., p 1 , p 2 , ..., p p-1 ) that evenly divide the region between two training samples (i.e., x 1 and x 2 ) into p pieces. We denote the set of p 1 , p 2 , ..., and p p-1 by P. P contains in-distribution points and out-of-distribution points. KD+ enforces the students to mimic the behavior of the teachers on P, which serves as a data-driven regularizer. KD+ goes beyond in-distribution distillation as it also uses out-of-distribution points in the regulaizer. As seen from Figure 3 , the regularizer can make the student better explore and capture the local shape of the teacher. Consequently, the complete objective of KD+ is written as: L KD+ = L KD + λ pi∈P L KL (f S , f T , p i ) (7) where λ is a balancing weight and simply setting λ to 1 works pretty well. KD+ is a very concise approach without requiring complex hyperparameter tuning and can sufficiently explore the knowledge in the teacher by using freely obtained in-distribution and out-of-distribution points as a regularizer.

6. EXPERIMENTS FOR EVALUATING KD+

In this section, We first conduct ablation study. Then we show that KD+ is superior to KD by (1) comparing KD+ with KD and other SOTA approaches, (2) showing that KD+ is more compatible with these approaches, (3) showing the superiority of KD+ under few-shot setting.

6.1. DATASETS, ARCHITECTURES, COMPETITORS, AND HYPER-PARAMETERS

Experiments are conducted on three benchmark datasets. CIFAR-100 (Krizhevsky & Hinton, 2009) has 100 classes with 50k training images and 10k test images. Tiny ImageNetfoot_3 has 200 classes with 100k training images and 10k test images. ImageNet (Deng et al., 2009) has 1000 classes with 1.28M training images and 50k validation images. We use the standard data augmentation strategy for each dataset. We adopt various modern architectures, i.e., ResNet (He et al., 2016) , WRN (Zagoruyko & Komodakis, 2016) , VGG (Simonyan & Zisserman, 2015) , MobileNet (Sandler et al., 2018) , and ShuffleNet (Ma et al., 2018) . We compare KD+ with KD and several SOTA methods, i.e., FitNet (Romero et al., 2015) , AT (Zagoruyko & Komodakis, 2017) , SP (Tung & Mori, 2019) , FT (Kim et al., 2018) , NST (Huang & Wang, 2017) , CC (Peng et al., 2019) , FSP (Yim et al., 2017) , PKT (Passalis & Tefas, 2018) , AB (Heo et al., 2019b) , VID (Ahn et al., 2019), RKD (Park et al., 2019) , CRD (Tian et al., 2020) , and SSKD (Xu et al., 2020) . For these SOTA approaches, we report the author-reported results, or use author-provided codes and the optimal hyper-parameters if they are publicly available. Otherwise, we use the implementation of Tian et al. (2020) . We follow KD and set α, β, λ and τ to 0.1, 0.9, 1, and 4, respectively, on all the datasets except on ImageNet where we follow the existing literature to set α = 1 and τ = 3. We have trained all the networks for 240, 100, and 120 epochs with SGD with momentum 0.9 on CIFAR-100, Tiny ImageNet, and ImageNet, respectively. We set the initial learning rate to 0.05 for for ResNet, WRN, and VGG, and 0.01 for MobileNet and ShuffleNet. On CIFAR, the learning rate is divided by 10 every 30 epochs after the first 150 epochs. On Tiny ImageNet, the learning rate is divided by 5 every 30 epochs. On ImageNet, the learning rate is initilized to 0.1 and is divided by 10 every 30 epochs. More implementation details are reported in Appendix A. We investigate how the performance varies with the values of p. We also check how the performance varies with the number of points used in the regularizer of KD+ as P contains much more samples than the training dataset. We use r to denote the ratio of the number of training samples to the number of samples used in the regularizer. Table 3 reports the results of KD+ with different p and r. The value of p determines what points are included in the regularzier of KD+. When p=2, it means that we only use the middle points between two training (or augmented) samples in the regularizer. These middle points have a high probability of being out-of-distribution as they do not belong to any predefined classes. As seen from Table 3 , by distilling on these middle points (i.e., p=2) as a regularizer, KD+ outperforms KD significantly (e.g., from 73.54 to 75.21 on WRN-40-1), which demonstrates that the out-of-distribution samples can be beneficial to knowledge distillation. Note that not all out-of-distribution samples are useful(e.g., randomly generated samples from normal distribution are harmful as shown in Appendix D). The reason for the usefulness of these middle points may be that these points are not far from the real data distribution as they share some statistics with the training data (e.g., the mean, the variance, and the relation among data dimensions). We also notice that the performance of KD+ is not sensitive to the values of r and p. The best performance is achieved when r=1:1 and p=3. Thus, we simply set r=1:1 and p=3 in the rest of the experiments.

6.3. COMPARISON WITH KD AND SOTA APPROACHES

Table 4 summarizes the comparison results on CIFAR-100. We have the following observations. First, there is an obvious performance gap between the students and the teachers for the existing approaches (e.g., KD, FitNet, and AT). Second, with a simple regularizer, KD+ substantially reduces the performance gap on all the six teacher-student pairs, and even matches or outperforms the teachers on four pairs (that are denoted by underline). On the other two teacher-student pairs, although the students still underperform the teachers, the performance gap is largely reduced by KD+. Note that there is no guarantee for KD+ to make students match or outperform teachers as the regularizer in KD+ cannot fully compensate for the unknown data sample distribution. Third, KD+ consistently outperforms KD and the other SOTA approaches by a large margin across different architectures, which demonstrates the superiority of KD+. Fourth, on the pair of WRN-40-2 and VGG-8, almost all the representation-based approaches (e.g., FitNet and AT) fail to transfer knowledge from the teacher to the student, even underperform the vanilla student. The reason is that WRN-40-2 and VGG-8 have extremely different architectures. Aligning their feature maps hurts the student performance. In contrast, KD+ shows its robustness and superiority in this case, and even enables student VGG-8 to match the performance of teacher WRN-40-2. Table 5 reports the comparison results on Tiny ImageNet. KD+ beats KD and the other approaches significantly, and even outperforms teacher VGG-13, which demonstrates the effectiveness of KD+. We further evaluate KD+ on large scale dataset ImageNet. Limited by computation resources, we only adopt one teacher-student pair on ImageNet. We follow CRD and use ResNet-34 and ResNet-18 as the teacher and the student, respectively. As shown in Table 6 , KD+ improves the accuracy over KD and the other approaches significantly, which demonstrates the applicability and usefulness of KD+ on large scale datasets. We also notice that there is still an obvious performance gap between the teacher and the student on ImageNet. The reason can be the model capacity difference as we find that ResNet-18 is an IST of ResNet-34 on the large and complex dataset ImageNet.

6.4. COMPATIBILITY WITH SOTA APPROACHES

The existing SOTA approaches can be combined with KD to obtain further performance gain. We show that these approaches combined with KD+ are able to obtain more performance gain. As shown in Table 7 , the existing approaches when combined with KD+ consistently achieve much better performances than when combined with KD in all the settings where the teachers and the students use similar or different architectures. This demonstrates that KD+ has a better compatibility than KD and the regularizer of going beyond in-distribution distillation also benefits the existing approaches. 

7. CONCLUSION

In this paper, we systematically study why students underperform teachers and how students can outperform teachers under KD based model compression. Through designing exploratory experiments, we find that model capacity differences are not necessarily the root reason and the distillation data matters when the student capacity is greater than a threshold. Inspired by this, we propose KD+ which goes beyond in-distribution distillation. Extensive experiments demonstrate that KD+ is superior to KD as it outperforms KD and the other SOTA approaches substantially, is more compatible with the existing approaches, and shows obvious superiority in few-shot scenario. When the state-of-the-art approaches are combined with KD, the objective is: L CompKD = L KD + cL distill For all the experiments, we report the last epoch test accuracy over 3 runs.

B TEACHER-STUDENT SHAPE DIFFERENCES

As stated in Corollary 4.1, only fitting the teacher outputs at sparse data points cannot enable students to well capture the local, in-distribution shapes of teachers. In this part, we show that the students trained with KD+ can better capture the local shapes of the teachers than those trained with KD. The 

C COMPARISON WITH THE REGULARIZER OF INJECTING NOISE TO INPUTS

KD+ goes beyond in-distribution distillation by using a data-driven regularizer. We compare the regularizer in KD+ with the regularizer of injecting small noise to inputs. Intuitively, distilling on noise-injected samples can also explore more knowledge in the teacher. We call this method NoiseKD. We compare KD+ with NoiseKD. We grid search the best hyperparameter for NoiseKD by using different levels of Gaussian noise, i.e., N (0, 0.1), N (0, 0.05), N (0, 0.01), and N (0, 0.005). Table ?? reports the comparison results. It is observed that when the noise in NoiseKD is large (e.g., N (0, 0.1) and N (0, 0.05)), NoiseKD even underperforms KD, which indicates that large noise is harmful for knowledge distillation. When noise is relatively small (e.g., N (0, 0.01)), NoiseKD slightly improves the performances over KD, which indicates that small noise is useful for knowledge distillation. We also see that KD+ consistently outperforms NoiseKD with different levels of noise as regularzers, which demonstrates the superiority of the proposed regularizer.

D NOT ALL OUT-OF-DISTRIBUTION SAMPLES ARE USEFUL

In KD+, when p = 2, the regularizer almost only uses out-of-distribution samples as the middle points of two samples do not belong to any predefined class. The experimental results in Table 3 have shown that by distillation on these out-of-distribution samples as a regularizer, the student performance is improved substantially. Here, we show that not all out-of-distribution samples are useful for knowledge distillation. We randomly draw image-size noise from a normal distribution. And then we distill on these randomly generated noisy samples as a regularizer for KD (we denote this method by NoiseRegKD). The results are reported in Table 10 . It is not surprising that the performances drop significantly, e.g., from 72.98 to 6.59 on VGG-8. This indicates that the out-of-distribution samples far from the real data distribution are harmful for knowledge distillation.

E LEARNING DATA DISTRIBUTION WITH GENERATIVE MODELS

As stated in Corollary 4.1, only fitting the teacher outputs at sparse data points cannot enable students to well capture the local, in-distribution shapes of teachers. One natural idea is to use generative adversarial networks (GAN) Goodfellow et al. (2014) ; Arjovsky et al. (2017) ; Liu et al. (2018) to learn the data distribution and then use the generator to generate fake data for knowledge distillation. However, there are two issues: first, training GAN is computationally expensive especially for large We conduct an exploratory experiment by using GAN to learn the data sample distribution on CIFAR-10 Krizhevsky & Hinton (2009) as GAN can easily converge on CIFAR-10. And then we distill on the generated fake data as a regularizer for KD. The Results are reported in Table 11 . It is obsrved that the GAN regularizer (i.e., KD-GAN) improves the performances over KD, but it underperforms KD+ substantially. This indicates that GAN can generate some useful fake samples for knowledge distillation, but the diversity and usefulness of these samples are highly constrained by the training data. As it is almost impossible to learn the real data sample distribution from sparse training data points, KD+ compensates this by going beyond in-distribution distillation and thus beats KD and the other approaches by a large margin.

F COMPATIBILITY WITH SOTA APPROACHES UNDER FEW-SHOT SCENARIO

In this part, we report the compatibility of KD+ with the existing approaches under few-shot scenario as this case can happen in reality where only a few samples are available due to the privacy or confidentiality issues. The comparison results are reported in Table 12 . It is observed that the existing approaches when combined with KD+ obtain much better performances than when combined with KD. Moreover, the overall accuracy improvement becomes larger when less training data samples are available. The reason is that when the training data become extremely sparse, Corollary 4.1 holds strongly that only fitting sparse data points cannot enable the students to well capture the local shapes of the teachers. KD+ substantially mitigates this issue by using a regularizer to go beyond the sparse in-distribution distillation.

G TRAINING TIME OF KD AND KD+

As KD+ explores more knowledge in the teacher by going beyond in-distribution distillation, it is more computationally expensive than KD. We report the training data on CIFAR-100 with GPU RTX 2080Ti. Both KD and KD+ are trained for 240 epochs. The training time is reported in Table 13 .



The code will be released online. In this paper, we use KD to denote the original knowledge distillation algorithmHinton et al. (2015). The ME values in Table1are accurate to 1 decimal place. https://tiny-imagenet.herokuapp.com



Figure 1: KD with training data Figure 2: KD with augmentationFigure3: KD+ accuracies as or higher accuracies than those of the teachers. In contrast, we expect ISTs to achieve lower accuracies than those of the teachers. Table2shows the simulation results. As expected, all the CSTs outperform the teachers in the simulation experiments (i.e., Simulation KD). This is due to the following facts: first, by using the simulated distillation dataset, the distillation objective in (4) makes the CSTs fully capture the knowledge of the teachers within the data distribution; second, the cross-entropy objective in (4) enables the CSTs to learn their own knowledge. Consequently, CSTs contain both the teacher knowledge and the knowledge learned on their own, which results in better performances than those of the teachers. SN2 and SN3 still underperform the teachers in the simulation experiments due to their limited capacities. These results empirically prove Corollary 4.1.

ME of different networks on CIFAR-10, CIFAR-100, and Tiny ImageNet

Simulation results on CIFAR-100 in terms of test accuracy (%)

Ablation study on CIFAR-100 in terms of test accuracy (%)

Test accuracy on CIFAR-100. Underline denotes that students match or outperform teachers.

Test accuracies (%) on Tiny ImageNet.

Compatibility performances on CIFAR-100

Test accuracies on CIFAR-100 under few-shot scenarioIn reality, it can happen that a powerful model is released, but only a few data samples are publicly accessible due to the privacy or confidentiality issues. We evaluate KD+ under few-shot scenario where knowledge is transferred from a powerful teacher to a student with limited data. Table8presents the comparison results. We observe that KD+ outperforms KD and the other approaches by a large margin in all the cases with 60%, 40%, 20%, and 10% training data available. The superiority of KD+ becomes more obvious under few-shot scenario, e.g., 9.05% accuracy improvement over KD on ResNet8×4 with 10% training data. The reason is that under few-shot scenario, the training data becomes extremely sparse. Corollary 4.1 holds strongly that only fitting sparse data points cannot enable the students to well capture the local shapes of the teachers. KD+ substantially mitigates this issue by using a regularizer to go beyond the sparse in-distribution distillation.

S-T DIFs (Shape differences) on CIFAR-100

S-T DIFs of KD+ are consistently smaller than those of KD, which demonstrates that the student shapes of KD+ are closer to the teacher shapes and indicates that the regularizer benefits the students in capturing the local shpaes of the teachers.

Training time of KD and KD+ just like we cannot obtain 100% test accuracy by training a deep neural network on the training data samples and their ground truth of CIFAR-100.

A MORE IMPLEMENTATION DETAILS

The code for this work will be released online. Besides the hyper-parameters reported in the paper, below we report more implementation details.We adopt the standard preprocessing and data augmentation strategies for each dataset. Each image is preprocessed by subtracting the mean of the whole training set and dividing it by the standard deviation. We use the standard data augmentation strategy, i.e., randomly flipping horizontally, padding 4 pixels for CIFAR (8 pixels for Tiny ImageNet), and then cropping to 32×32 for CIFAR (64×64 for Tiny ImageNet). On ImageNet, we use the widely used scale and aspect ratio augmentation strategy.On exploratory experiments, the architectures of SN2 and SN3 are Conv(128)-BN-AvgPooling(32)-FC and Conv(128)-BN-ReLU-Covn(256)-BN-ReLU-AvgPooling(16)-FC, respectively.Following KD, we set α, β, λ and τ to 0.1, 0.9, 1, and 4, respectively. On CIFAR-100, we have trained all the networks for 240 epochs with SGD with momentum 0.9 and batch size 64. On Tiny ImageNet, all the networks are trained with SGD with momentum 0.9 for 100 epochs with batch size 64. On ImageNet, we have the network for 120 epochs with SGD with momentum 0.9 and batch size 256.For the SOTA approaches, their objective is a combination of the regular cross-entropy loss and a distillation loss:where c is a weight for balancing the two terms. We report the author-reported results, or use author-provided codes and the optimal hyper-parameters from the original papers if they are publicly available. Otherwise, we use the implementation of Tian et al. (2020) . Specifically, the hyperparameters for each method are: For compatibility experiments, KD+ is combined with the existing SOTA approaches. The objective is written as:The values of c have been reported above. 

H COMPARISON RESULTS BETWEEN BASELINES AND BASELINES+

To further explore the performances of the proposed approaches on different distillation methods, we compare baselines with baselines+. Baselines+ are obtained by using the P points to assist the baselines (note that the KD+ objective is not included). The comparison results are reported in Table 14 . We observe that the beselines+ consistently outperform the baselines by a large margin (e.g., 3.13% accuracy improvement from FitNet to FitNet+), which demonstrates the generalization and effectiveness of the proposed strategy across different distillation approaches.

