RETHINKING SOFT LABELS FOR KNOWLEDGE DISTIL-LATION: A BIAS-VARIANCE TRADEOFF PERSPECTIVE

Abstract

Knowledge distillation is an effective approach to leverage a well-trained network or an ensemble of them, named as the teacher, to guide the training of a student network. The outputs from the teacher network are used as soft labels for supervising the training of a new network. Recent studies (Müller et al., 2019; Yuan et al., 2020) revealed an intriguing property of the soft labels that making labels soft serves as a good regularization to the student network. From the perspective of statistical learning, regularization aims to reduce the variance, however how bias and variance change is not clear for training with soft labels. In this paper, we investigate the bias-variance tradeoff brought by distillation with soft labels. Specifically, we observe that during training the bias-variance tradeoff varies sample-wisely. Further, under the same distillation temperature setting, we observe that the distillation performance is negatively associated with the number of some specific samples, which are named as regularization samples since these samples lead to bias increasing and variance decreasing. Nevertheless, we empirically find that completely filtering out regularization samples also deteriorates distillation performance. Our discoveries inspired us to propose the novel weighted soft labels to help the network adaptively handle the sample-wise biasvariance tradeoff. Experiments on standard evaluation benchmarks validate the effectiveness of our method.

1. INTRODUCTION

For deep neural networks (Goodfellow et al., 2016) , knowledge distillation (KD) (Ba & Caruana, 2014; Hinton et al., 2015) refers to the technique that uses well-trained networks to guide the training of another network. Typically, the well-trained network is named as the teacher network while the network to be trained is named as the student network. For distillation, the predictions from the teacher network are leveraged and referred to as the soft labels (Balan et al., 2015; Müller et al., 2019) . Soft labels generated by the teacher network have been proven effective in large-scale empirical studies (Liang et al., 2019; Tian et al., 2020; Zagoruyko & Komodakis, 2017; Romero et al., 2015) as well as recent theoretical studies (Phuong & Lampert, 2019) . However, the reason why soft labels are beneficial to the student network is still not well explained. Giving a clear theoretical explanation is challenging: The optimization details of a deep network with the common one-hot labels are still not well-studied (Nagarajan & Kolter, 2019) , not to mention training with the soft labels. Nevertheless, two recent studies (Müller et al., 2019; Yuan et al., 2020) shed light on the intuitions about how the soft labels work. Specifically, label smoothing, which is a special case of soft labels based training, is shown to regularize the activations of the penultimate layer to the network (Müller et al., 2019) . The regularization property of soft labels is further explored in (Yuan et al., 2020) . They hypothesize that in KD, one main reason why the soft labels work is the regularization introduced by soft labels. Based on the assumption, the authors design a teacher-free distillation method by turning the predictions of the student network into soft labels. Considering that soft labels are targets for distillation, the evidence of the regularization brought by soft labels drives us to rethink soft labels for KD: Soft labels are both supervisory signals and regularizers. Meanwhile, it is known that there is a tradeoff between fitting the data and imposing regularizations, i.e., the bias-variance dilemma (Kohavi & Wolpert, 1996; Bishop, 2006) , but it is unclear how bias and variance change for distillation with soft labels. Since the bias-variance tradeoff is an important issue in statistical learning, we investigate whether the bias-variance tradeoff exists for soft labels and how the tradeoff affects distillation performance. We first compare the bias and variance decomposition of direct training with that of distillation with soft labels, noticing that distillation results in a larger bias error and a smaller variance. Then, we rewrite distillation loss into the form of a regularization loss adding the direct training loss. Through inspecting the gradients of the two terms during training, we notice that for soft labels, the biasvariance tradeoff varies sample-wisely. Moreover, by looking into a conclusion from (Müller et al., 2019) , we observe that under the same temperature setting, the distillation performance is negatively associated with the number of some certain samples. These samples lead to bias increase and variance decrease and we name them as regularization samples. To investigate how regularization samples affect distillation, we first examine if we can design ad hoc filters for soft labels to avoid training with regularization samples. But completely filtering out regularization samples also deteriorates distillation performance, leading us to speculate that regularization samples are not well handled by standard KD. In the light of these findings, we propose weighted soft labels for distillation to handle the sample-wise bias-variance tradeoff, by adaptively assigning a lower weight to regularization samples and a larger weight to the others. To sum up, our contributions are: • For knowledge distillation, we analyze how the soft labels work from a perspective of biasvariance tradeoff. • We discover that the bias-variance tradeoff varies sample-wisely. Also, we discover that if we fix the distillation temperature, the number of regularization samples is negatively associated with the distillation performance. • We design straightforward schemes to alleviate negative impacts from regularization samples and then propose the novel weighted soft labels for distillation. Experiments on large scale datasets validate the effectiveness of the proposed weighted soft labels.

2. RELATED WORKS

Knowledge distillation. Hinton et al. (2015) proposed to distill outputs from large and cumbersome models into smaller and faster models, which is named as knowledge distillation. The outputs for large networks are averaged and formulated as soft labels. Also, other kinds of soft labels have been widely used for training deep neural networks (Szegedy et al., 2016; Pereyra et al., 2017) . Treating soft labels as regularizers were pointed out in (Hinton et al., 2015) since a lot of helpful information can be carried in soft labels. More recently, Müller et al. (2019) showed the adverse effect of label smoothing upon distillation. It is a thought-provoking discovery for the reason that both label smoothing and distillation are exploiting the regularization property behind soft labels. Yuan et al. (2020) further investigated the regularization property of soft labels and then proposed a teacher free distillation scheme. Distillation loss. One of our main contributions is that we improve the distillation loss. For adaptively adjusting the distillation loss, Tang et al. (2019) pays attention to hard-to-learn and hard-tomimic samples, and the latter is weighted based on the prediction gap between teacher and student. However, it does not consider that the teacher may give an incorrect guide to the student, under which the prediction gap is still large and such a method may lead to the performance being hurt. Saputra et al. ( 2019) transfers teacher's guidance only on the samples where the performance of the teacher surpasses the student, while Wen et al. (2019) deals with the incorrect guidance by probability shifting strategy. Our approach is different from the above methods, in terms of motivations as well as the proposed solutions. Bias-variance tradeoff. Bias-variance tradeoff is a well-studied topic in machine learning (Kohavi & Wolpert, 1996; Domingos, 2000; Valentini & Dietterich, 2004; Bishop, 2006) and for neural networks (Geman et al., 1992; Neal et al., 2018; Belkin et al., 2019; Yang et al., 2020) . We now present the bias-variance decomposition for L ce and L kd , based on the definition and notations from Heskes (1998) . First, we denote the train dataset as D and the output distribution on a sample x of the network trained without distillation as ŷce = f ce (x; D). For the network trained with distillation, the model also depends on the teacher network, so we define the output on x as ŷkd = f kd (x; D, T ), where T is the selected teacher network. Then, let the averaged output of ŷkd and ŷce be ȳkd and ȳce , that is, ȳce = 1 Z ce exp(E D [log ŷce ]), ȳkd = 1 Z kd exp(E D,T [log ŷkd ]), where Z ce , Z kd are two normalization constant. Then according to Heskes (1998) , we have the following decomposition for the expected error on the sample x and y = t(x) is the ground truth label: error ce = E x,D [-y log ŷce ] = E x,D -y log y + y log y ȳce + y log ȳce ŷce = E x [-y log y] + E x y log y ȳce + E D E x y log ȳce ŷce = E x [-y log y] + D KL (y, ȳce ) + E D [D KL (ȳ ce , ŷce )] = intrinsic noise + bias + variance, (2) where D KL is the Kullback-Leibler divergence. The derivation of the variance term is based on the facts that log ȳce E D [log ŷce] is a constant and E x [y] = E x [ȳ ce ] = 1. Detailed derivations can be found from Eq. (4) in Heskes (1998) . Next, we analyze the bias-variance decomposition of L kd . As mentioned above, when training with soft labels, extra randomness is introduced for the selection of a teacher network. In Fig. 1 , we illustrate the corresponding bias and variance for the selection process of a set of soft labels, which are generated by a teacher network. In this case, a high variance model indicates the model (grey point) is closer to the one-hot trained model (black point), while a low variance model indicates that the model is closer to other possible models trained with soft labels (red points). Although for KD there are more sources introducing randomness, the overall variance brought by L kd is not necessarily higher than L ce . In fact, existing empirical results strongly suggest that the overall variance is smaller with KD. For example, students trained with soft labels are better calibrated than one-hot baselines (Müller et al., 2019) and KD makes the predictions of students more consistent when facing adversarial noise (Papernot et al., 2016) . Here, we present these empirical evidence as an assumption: Assumption 1 The variance brought by KD is smaller than direct training, that is, E D,T [D KL (ȳ kd , ŷkd )] E D [D KL (ȳ ce , ŷce )]. Similar to Eq. ( 2), we write the decomposition for L kd as error kd = E x [-y log y] + D KL (y, ȳce ) + E x y log ȳce ȳkd + E D,T [D KL (ȳ kd , ŷkd )]. (3) An observation here is that ȳce converges to one-hot labels while ȳkd converges to soft labels, so ȳce is closer to the one-hot ground-truth distribution y than ȳkd , i.e., E x y log ȳce ȳkd 0. If we rewrite L kd as L kd = L kd -L ce + L ce , then L kd -L ce causes that the bias increases by E x y log ȳce ȳkd and the variance decreases by E D [D KL (ȳ ce , ŷce )] -E D,T [D KL (ȳ kd , ŷkd )]. From the above analysis, we separate L kd into two terms, and L kd -L ce leads to variance reduction, and L ce leads to bias reduction. In the following sections, we first analyze how L kd -L ce links to the bias-variance tradeoff during training. Then we analyze the changes in the relative importance between bias reduction and variance reduction during training with soft labels.

3.1. THE BIAS-VARIANCE TRADEOFF DURING TRAINING

It is known that bias reduction and variance reduction are often in conflict and we cannot minimize bias and variance together. However, if we consider the change of bias and variance during the training process, the importance of tuning the tradeoff also changes during training. Specifically, shortly after the training of the network starts, the bias error dominates the total error and the variance is less important. As training goes on, gradients of reducing the bias error (induced by L ce ) and reducing the variance (induced by L kd -L ce ) can be of the same scale for some samples, then we need to balance the tradeoff because reducing one term is likely to increase another one. Therefore for soft labels, we need to handle the bias-variance tradeoff in a sample-wise manner and take the training process into consideration. To study the bias-variance tradeoff during training, we consider the gradients of bias and variance reduction. Let z be the logits output of the student on input x and z i is i-th element of it, then we are interested in ∂(L kd -Lce) ∂zi . For simplifying analysis, we are concerned with the gradients on the ground-truth related logit, that is, the sample x is labeled as i-th class. Mathematically, for the gradients of variance reduction, we have ∂(L kd -L ce ) ∂z i = τ (ŷ s i,τ -ŷt i,τ ) -(ŷ s i,1 -y i ) = τ e zi/τ k e z k /τ -ŷt i,τ - e zi k e z k -y i , where ŷt i,τ denotes the i-th element of the teacher's prediction, i.e., ŷt τ . The term L kd -L ce is easy to understand when τ = 1 since the gradient now becomes y iŷt i,1 . Meanwhile, for the bias reduction, we have ∂Lce ∂zi = ŷs i,1y i , so ∂Lce ∂zi and ∂(L kd -Lce) ∂zi always have different signs, leading to a tradeoff. If ∂Lce ∂zi is much higher than ∂(L kd -Lce) ∂zi , the bias reduction dominates the overall optimization direction. Instead, if ∂(L kd -Lce) ∂zi becomes higher, the sample is used for variance reduction. Interestingly, we discover that under a fixed distillation temperature, the final performance is worse when more training samples are used for variance reduction, which will be introduced in the next section.

3.2. REGULARIZATION SAMPLES

Our analysis starts with a conclusion from Müller et al. (2019) : if a teacher network is trained with label smoothing, knowledge distillation into a student network is much less effective. Inspired by the phenomenon, we gather the impact of bias and variance during training with different distillation settings. Let a = ∂Lce ∂zi and b = ∂(L kd -Lce) ∂zi , then as introduced before, we use a and b to represent the impact of bias and variance, respectively. If we have |b| > |a| for a sample, we name the sample as a regularization sample since the variance dominates the optimization direction. From the collected data, we find that the number of regularization samples is closely related to distillation performance. In Tab. 1, we present the count of regularization samples for a student network trained by knowledge distillation. For distillation with a temperature higher than 1, which is the common setting, we observe that if the teacher network is trained with label smoothing, more samples will be involved in variance reduction. Also, distillation from a teacher trained with label smoothing performs worse, which is consistent with Müller et al. (2019) . Therefore, we conclude that for distillation with soft labels, the regularization samples during training affect the final distillation performance. Moreover, we plot the number of regularization samples with respect to different training epochs in Fig. 2 . As demonstrated in the plots, the number of such samples increases much faster when using the teacher trained with label smoothing for distillation. For regularization samples, the gap of their number between with and without label smoothing becomes larger for more training epochs. These observations verify our motivation that the bias-variance tradeoff varies sample-wisely and evolves during the training process. From the above results, we conclude that bias-variance tradeoff for soft labels varies sample-wisely, therefore the strategy for tuning the tradeoff should also be sample-wise. In the next section, we set up ad hoc filters for soft labels and further investigate how regularization samples affect distillation.

3.3. HOW REGULARIZATION SAMPLES AFFECT DISTILLATION

The results presented in the last section suggest that we should avoid training with regularization samples. Hence, we design two straightforward solutions and then find that totally filtering out regularization samples deteriorates the distillation performance. The first experiment we conduct is to manually resolve the conflicting gradient on the label related logit, as defined in section 3.2. Specifically, we apply a mask to the distillation loss L kd such that ∂L kd ∂zi = 0 where i is the label. Consequently, the loss for this sample now becomes L * kd = k =i ŷt k,τ log ŷs k,τ . The motivation behind the masked distillation loss is that we only transfer the knowledge of resemblances among the labels. Another experiment is to figure out what role in distillation those regularization samples will play. To investigate this, we carry out knowledge distillation on two subsets of samples: 1) L kd is not valid on regularization samples, and 2) L kd is valid only on regularization samples. The results of the two experiments are presented in Tab. 2. We can observe that all of the three approaches are not as good as the baseline knowledge distillation performance, but are better than the direct training baseline. First, since masking L kd loss on the label related logit results in worse performance compared to standard KD, we cannot resolve the tradeoff by applying a mask on the ground truth related logit. Then, from the second experiment, we can see that filtering out regularization samples deteriorates the distillation performance. Moreover, the result of the third experiment is higher than the direct training baseline, indicating that regularization samples are still valuable for distillation. The above results motivate us to think that regularization samples are not fully exploited by standard KD and we can tune the tradeoff to fulfill the potential of regularization samples.

4. WEIGHTED SOFT LABELS

From the last section, we realize that the bias-variance tradeoff varies sample-wisely during training and under fixed distillation settings, the number of regularization samples is negatively associated with the final distillation performance. Yet, discarding regularization samples deteriorates distillation performance and distilling knowledge from these samples is better than the direct training baseline. The above evidence inspires us to lower the weight of regularization samples. Recall that regularization samples are defined by the relative value of a and b, we propose to assign importance weight to a sample according to a and b. However, since L kd is computed with the hyperparameter temperature, a and b are correlated with the temperature and thus bring difficulty to tuning the hyperparameter. To make the weighting scheme independent of the temperature hyperparameter, we compare a and b with temperature τ = 1. Note that when τ = 1, a = ŷs i,1y i and b = y iŷt i,1 , so we compare ŷs i,1 and ŷt i,1 instead. Finally, in the light of previous works that assign sample-wise weights (Lin et al., 2017; Tang et al., 2019) , we propose weighted soft labels for  L wsl = 1 -exp - log ŷs i,1 log ŷt i,1 L kd = 1 -exp - L s ce L t ce L kd , ( ) where i is the ground truth class of the sample. The above equation means that a weighting factor is assigned to each sample's L kd according to the predictions of the teacher and the student. In this way, if compared to the teacher, a student network is relatively better trained on a sample, we have ŷs i,1 > ŷt i,1 , then a smaller weight is assigned to this sample. In Fig. 3 , the whole computational graph of knowledge distillation with the proposed weighted soft labels is demonstrated. Finally, we add L wsl and L ce together to supervise the network, i.e., L total = L ce + αL wsl , where α is a balancing hyperparameter.

5. EXPERIMENTS

To evaluate our weighted soft labels comprehensively, we first conduct experiments with various teacher-student pair settings on CIFAR-100 (Krizhevsky et al., 2009) . Next, we compare our method with current state-of-the-art distillation methods on ImageNet (Deng et al., 2009) . To validate the effectiveness of our method in terms of handling the bias-variance tradeoff, we conduct ablation experiments by applying weighted soft labels on different subsets.

5.1. DATASET AND HYPERPARAMETER SETTINGS

The datasets used in our experiments are CIFAR-100 (Krizhevsky et al., 2009) and ImageNet (Deng et al., 2009) For comparison, the following recent state-of-the-art methods are chosen: FitNet (Romero et al., 2015) , AT (Zagoruyko & Komodakis, 2017) , SP (Tung & Mori, 2019) , CC (Peng et al., 2019) , VID (Ahn et al., 2019) , RKD (Park et al., 2019) , PKT (Passalis & Tefas, 2018) , AB (Heo et al., 2019) , FT (Kim et al., 2018) , FSP (Yim et al., 2017) , NST (Huang & Wang, 2017) , Overhaul (Heo et al., 2019) and CRD (Tian et al., 2020) .

5.2. MODEL COMPRESSION

Results on CIFAR-100 In Tab. 3, we present the Top-1 classification accuracy of our method and comparison methods. The results of comparison methods are quoted from Tian et al. (2020) . Teacher-student pairs of the same and different architecture styles are considered. For pairs of same architecture style, we use wide residual networks (Zagoruyko & Komodakis, 2017) and residual networks (He et al., 2016) . For pairs of different architecture style, residual networks and ShuffleNet (Zhang et al., 2018) pairs are chosen for experiments. As shown in the table, for distillation with both same and different architecture style, our method reached new state-of-the-art results. Specifically, our method outperforms standard KD by a large margin, which verifies the effectiveness of our method. Results on ImageNet In Tab. 4, we compare our method with current SOTA methods on Ima-geNet. Note that for the ResNet34 → ResNet-18 distillation setting, the result of CRD is trained 10 more extra epochs while ours is the same as other methods. For ResNet-50 → MobileNet-v1 distillation setting, NST and FSP are not chosen for comparison as the two methods require too large GPU memories, so we include the accuracy of FT and AB reported in Heo et al. (2019) for comparison. Our results outperform all the existing methods, verifying the practical value of our method. Weighted soft labels on different subsets. Recall that we propose weighted soft labels for tuning samplewise bias-variance tradeoff, it is still unclear whether the improvements come from a well-handled samplewise bias-variance tradeoff. To investigate this issue, we compare the performance gain of weighted soft labels on different training subsets. Similar to the settings used in Tab. 2, we apply weighted soft labels on two different subsets: only the regularization samples and excluding regularization samples. In Tab. 5, we show the results on subsets of only regularization samples and excluding regularization samples. From the significant improvements, we can see that our method can not only improve performance on the RS subset, the improvements on excluding RS subset is also significant. We conclude that weighted soft labels can tune sample-wise bias-variance tradeoff globally and lead to an improved distillation performance. 

6. CONCLUSION

Recent studies (Müller et al., 2019; Yuan et al., 2020) point out that one important reason behind the effectiveness of distillation is the regularization effect brought by being soft. In this paper, we rethink the soft labels for distillation from a bias-variance tradeoff perspective. The tradeoff varies sample-wisely and we propose weighted soft labels to handle the tradeoff, of which the effectiveness is verified with experiments on standard evaluation benchmarks. ∂zi equals y iŷt i,1 . As y t i,τ is the output from the teacher network and computed by a linear mapping of the activations in the teacher's penultimate layer, the regularization indicates that the student should follow the learned the resemblances between classes (Hinton et al., 2015; Müller et al., 2019) . Still, two questions are unclear: 1) what the resemblances are and 2) whether the regularization still indicates resemblances if τ is set to 4, a widely adopted hyperparameter (Tian et al., 2020) . Towards answering the questions, we visualize the value of gradient vector ∂(L kd -Lce) ∂z concerning each class. Specifically, on ImageNet (Deng et al., 2009) training set and τ = 4, we calculate the average value of ∂(L kd -Lce) ∂z for each class. Let M be the matrix of values with the ij-th entry M ij means averaged ∂(L kd -Lce) ∂zi for class j. Since i M ij = 0, diagonal elements are ignored for visualization. The results are visualized in Fig. 4 . We find that plotting the common correlation with a large variance. By treating each entry M ij as a vertex and then constructing a mesh for the matrix, we apply subdivision (Loop, 1987) to the mesh for smoothing the extreme points and finally rendering the mesh by ray-tracing package PlotOptiX. We can observe the several facts from the figures: 1) Comparing the sub-figure (a) and (b), we can see that for distillation resemblances implied by regularizers are similar across different teacher-student pairs. 2) Comparing (ab) with (cd), we can see that the resemblances are consistent with the semantic similarity of image class names. In a word, for τ = 4, the variance reduction brought by soft labels still implies resemblances among labels, which are consistent with the semantic distance of class names. In the next section, we will analyze how bias-variance tradeoff changes when training with soft labels.

SAMPLES

To further investigate the phenomenon about regularization samples, we conduct experiments to show the intermediate states between excluding and only on regularization samples. Two settings are considered here: First, we gradually exclude regularization samples during training, from excluding all regularization samples to excluding 25% regularization samples; Second, we keep all regularization samples and then gradually add non-regularization samples. Since we judge a sample is regularization or not according to the training loss, we cannot pre-define a sample set such that a certain percentage samples are kept or dropped. Therefore, we propose to conduct these experiments by assigning a probability to whether backward the loss computed with regularization samples. For example, if during training, a sample is marked as regularization sample according to the value of a and b, we backward the loss of this sample by a probability p = 0.5. In this way, we can get the performance of excluding 75% regularization samples. In Tab. 7, we first present result with KD in (a) and then present result with weighted soft labels applied in (b). We can observe that weighted soft labels are indeed balancing the sample-wise, not on dataset scale, bias and variance. A.3 COMBINING WITH RKD (PARK ET AL., 2019). To investigate how the weighted soft labels can be applied to the variants of KD, we conduct an experiment of combining RKD (Park et al., 2019) with our weighted soft labels. Relational knowledge distillation measures the L2 distance of features between two samples or the angle formed by three samples as knowledge to transfer. In other words, the knowledge in RKD is measured by the relations between sample pairs. It is no longer sample-independent, which is different from the weighted soft labels applied to KD which can assign the weights sample-wisely. We currently take the averaged weighting factors of the involved sample pairs when calculating the distance/angle matrix. The results on CIFAR-100 are presented in Tab. 8 (averaged over 5 runs). As can be observed from the table, the weighted soft label applied to RKD still brings improvements, though not that big compared with WSL applied to KD. Also, we believe that it is an important future direction to explore the applications to more variants of KD. A.4 OTHER VARIANTS OF WEIGHTING. In the work, the weighting scheme is defined as 1exp - -1. In Tab. 9, we present the comparison between adopted weighting form and the Sigmoid baseline. We can see that as long as we can adaptively tune the sample-wise bias-variance tradeoff, the performance is better than KD, i.e., without weighted soft labels.Therefore, although the proposed weighting form is not mathematically optimal, the not-too-big or not-too-small weights for these regularization examples are not hard to tune. These results verify our main contribution that there is sample-wise biasvariance tradeoff and we need to assign weights to the regularization examples. A.5 ABLATION ON α. In Tab. 10 We first tune the value of α on CIFAR100, with four values {1, 2, 3, 4} tested. Then we test with three values in [2, 3] in (b). Finally, we tune α on ImageNet in (c). As a conclusion, the results are not very sensitive to α and the cost of searching α in our work is not expensive.

A.6 RESULTS ON MULTINLI

To further validate our method, we conduct experiments on an NLP dataset MultiNLI (Williams et al., 2018) . In this setting, the teacher is BERT-base-cased with 12 layers, 768 Hidden and 108M params. The student is T3 with 3 layers, 768 Hidden and 44M params. Besides, we follow the training setting in Sun et al. (2019) . In Tab. 11, we present the result comparisons of standard KD and our weighted soft labels.



Figure 2: The number of regularization samples with respect to training epochs. The distillation settings are the same as the settings in Tab. 1.

Figure 3: Computational graph of knowledge distillation with our proposed weighted soft labels.

. CIFAR-100 contains 50K training and 10K test images of size 32 × 32. ImageNet contains 1.2 million training and 50K validation images. Except the loss function, training settings like learning rate or training epochs are the same with Tian et al. (2020) for CIFAR-100 and Heo et al. (2019) for ImageNet. For distillation, we set the temperature τ = 4 for CIFAR and τ = 2 for ImageNet. For loss function, we set α = 2.25 for distillation on CIFAR and α = 2.5 for ImageNet via grid search. The teacher network is well-trained previously and fixed during training.

Figure 4: Visualization of the resemblances introduced by soft label regularizers: (a) VGG-19 (Teacher) → VGG-16 (Student), (b) ResNet-50 (Teacher) → ResNet-18 (Student). And semantic similarity between label names: (c) LCH similarity(Pedersen et al., 2004), (d) WUP similarity(Pedersen et al., 2004). Darker areas denote larger values.

in [0, 1], so that the weights of regularization samples are lower than those non-regularization samples. A straightforward baseline is that we can use the Sigmoid function to convert

. Existing methods are mainly concerned with the variance brought by the choice of network models. Our perspective is different from the previous methods since we focus on the behavior of samples during training. In our work, based on the results from Heskes (1998), we present the decomposition of distillation loss, which is defined by Kullback-Leibler divergence.

We count the number of regularization samples with different distillation settings on CIFAR-100. The teacher-student network pair is WRN-40-2 (Zagoruyko & Komodakis, 2017) and WRN-16-2. Results are averaged over 5 repeated runs. The temperature column means the temperature for distillation and the label smoothing column means whether the teacher network is trained with label smoothing trick.

Study of the impact on distillation for regularization samples. Loss function presented here is for the loss on a specific sample. Results are classification Top-1 accuracy. We follow the settings used in Tab. 1 and set τ = 4. Results are averaged over 5 runs.

Top-1 classification accuracy results on CIFAR-100. Comparison results are quoted fromTian et al. (2020). We report our results over 5 repeated runs.

Top-1 and Top-5 classification accuracy results on ImageNet validation set. All training hyperparameter like learning rate and training epochs are in accordance with(Heo et al., 2019).

Performance on different subsets with soft labels and our weighted soft labels. RS means regularization samples. Results are averaged over 5 runs.

Distillation using weighted soft labels and teacher trained with label smoothing (denoted as LS?). Results are averaged over 5 runs.Distillation with label smoothing trained teacher. Our exploration of bias-variance tradeoff starts with the conclusion made inMüller et al. (2019): a teacher network trained with the label smoothing trick is less effective for distillation. It is worthwhile to study whether the conclusion remains true for distillation with our weighted soft labels. As discussed before, we hold the opinion that too many regularization samples make the distillation less effective. Since our weighted soft label is proposed to mitigate the negative effects of the regularization samples, with the same settings from Tab. 1, we conduct comparison experiments in Tab. 6 to see if the negative effects still exist. It is evident that weighted soft labels significantly improve the distillation performance, especially for distillation from the teacher trained with label smoothing. Besides, using the teacher trained with label smoothing still performs worse than that without label smoothing, which again verifies the conclusion drawn byMüller et al. (2019).

Intermediate states between excluding and only on regularization samples

Combining weighted soft labels with RKD(Park et al., 2019).Distillation settings WRN-40-2 → WRN-16-2 WRN-40-2 → WRN-40-1 resnet56 → resnet20

ACKNOWLEDGEMENTS

This work is supported in part by a gift grant from Horizon Robotics and National Science Foundation Grant CNS-1951952. We thank Yichen Gong, Chuan Tian, Jiemin Fang and Yuzhu Sun for the discussion and assistance.

annex

Published as a conference paper at ICLR 2021 

