ADAPTIVE BLOCK-WISE LEARNING FOR KNOWLEDGE DISTILLATION

Abstract

Knowledge distillation allows the student network to improve its performance under the supervision of transferred knowledge. Existing knowledge distillation methods are implemented under the implicit hypothesis that knowledge from teacher and student contributes to each layer of the student network to the same extent. In this work, we argue that there should be different contributions of knowledge from the teacher and the student during training for each layer. Experimental results evidence this argument. To the end, we propose a novel Adaptive Block-wise Learning (ABL) for Knowledge Distillation to automatically balance teacher-guided knowledge between self-knowledge in each block. Specifically, to solve the problem that the error backpropagation algorithm cannot assign weights to each block of the student network independently, we leverage the local error signals to approximate the global error signals on student objectives. Moreover, we utilize a set of meta variables to control the contribution of the student knowledge and teacher knowledge to each block during the training process. Finally, the extensive experiments prove the effectiveness of our method. Meanwhile, ABL provides an insightful view that in the shallow blocks, the weight of teacher guidance is greater, while in the deep blocks, student knowledge has more influence.

1. INTRODUCTION

Knowledge distillation (KD) in deep learning imitates the pattern of human learning. Hinton et al. (2015) proposes the original concept of KD, which minimizes the KL divergence between the logits of teacher (soft label) and student. This allows KD to be expressed as a mode in which a complex pre-trained model is used as a teacher to guide a lightweight student model learning. Following such teacher-student framework, a series of KD methods are mainly developed in the following directions: what, where and how to distill. No matter what kind of directions, these existing KD methods are based on the same implicit hypothesis that in the training process of distillation, the contribution of student's and teacher's knowledge to each layer of the student network is fixed, whether it is the last layer or the first layer. This is because after the error backpropagation (BP) (Rumelhart et al., 1986) , the weight of the error signals on each layer is determined by the same hyper-parameters, which is shown in Figure 1(a) . Intuitively, this limits the flexibility of balancing the knowledge of teacher and student, which harms excavating the potentialities of the student model. Therefore, we argue that in the representation learning process of student network guided by teacher knowledge, different layers of the student network have different emphases on the knowledge learned through the one-hot labels and the knowledge distilled by the teacher. This means that some levels are more inclined to utilize student knowledge to learn, while others tend to leverage teacher knowledge. Furthermore, we also argue that the contribution of student and teacher knowledge to representation learning should be adaptive at each level. However, the existing KD methods obtain the global error signal from the last layer, which is hard to allocate the hierarchical weights. To explore the student network hierarchically in the training process, we modify the backward computation graphs and leverage the local error signals by the family of local objectives (Jaderberg et al., 2017; Nøkland & Eidnes, 2019; Belilovsky et al., 2020; Pyeon et al., 2021) to approximate the global error signals generated by the last layer. These local loss functions focus on the local error signals and decoupled learning. Here, by leveraging auxiliary networks, we adopt the above local strategies to acquire the approximation of the global error signal created by the student loss objective corresponding to the one-hot labels. This provides the possibility to independently assign different weights to teacher knowledge and student knowledge at different layers. After allowing the student error signals of each layer to obtain independently, the current issue is on exploring the balance between student knowledge and teacher knowledge at each layer. We model this issue as a bilevel optimization problem (Anandalingam & Friesz, 1992) by adding a set of meta variables on the error signals corresponding to two types of knowledge. These meta variables represent the influence of which target knowledge is preferred by the update of the corresponding layer. In addition, after the bilevel optimization based on the gradient descent solution, we obtain the optimal meta variables of the target network under the target KD method and utilize them for the final evaluation. To the end, we propose a novel paradigm dubbed Adaptive Block-wise Learning (ABL) for Knowledge Distillation, which allows conventional teacher-student architecture to explore the influence of knowledge from teacher and student in blocks. As shown in Figure 1 (b), the proposed method changes the error signals acquisition path of student objective function from global to local and adds a group of meta variables to measure the contribution of knowledge from students and teachers. Furthermore, we acquire the balance between knowledge from the student and the teacher on the validation set. Besides, we leverage the optimized meta variables to train the corresponding distillation method. Our main contributions are as follows: 1. We propose a novel paradigm named adaptive block-wise learning for knowledge distillation, which automatically balances the contribution of knowledge from the student and teacher for each block. 2. We discover that the deep and abstract representation inclines to learn from student knowledge, while the shallow and less abstract representation tends to be guided by teacher knowledge. We hope this discovery could provide another learning view for KD. 3. We conduct extensive experiments under eleven recent distillation benchmarks. Experimental results demonstrate the effectiveness of the proposed framework in improving the performance of existing distillation methods.

2. RELATED WORK

Knowledge distillation. Knowledge distillation usually transfers knowledge from large models to small models, which is completed under the teacher-student framework. The vanilla KD is firstly proposed by Hinton et al. (2015) , which allows the student model to mimic the final prediction of the teacher model. Zhao et al. (2022) decouple the knowledge (binary probabilities) by the target knowledge and non-target knowledge. In addition to the above two logits-based KD, there are also many methods based on features (Romero et al., 2015; Komodakis & Zagoruyko, 2017; Tung & Mori, 2019; Peng et al., 2019; Park et al., 2019; Ahn et al., 2019; Tian et al., 2020) , which are mainly aimed at transferring the knowledge of intermediate features. 

3. METHODS

We begin by describing the knowledge distillation from the perspective of gradient (Section 3.1). We then introduce our proposed adaptive block-wise learning for knowledge distillation (Section 3.2), where the global error signals are approximated with the local signals to realize a changeable weight allocation of the knowledge from teacher and student. Finally, we optimize simultaneously the global error and the local error signals by adopting a bilevel optimization (Section 3.3).

3.1. KNOWLEDGE DISTILLATION FROM THE PERSPECTIVE OF GRADIENT

Existing knowledge distillation methods can be divided into two parts: 1) One part is to learn by oneself through the student, 2) and the other is to learn by distilling the teacher's knowledge, which can be expressed as: L = αL S + βL KD , ) where L S is usually cross-entropy loss (De Boer et al., 2005) between the predicted probabilities from only the student model and the label, L KD can be the Kullback-Leibler divergence (Hershey & Olsen, 2007) in vanilla knowledge distillation, 2 -norm distance for the intermediate representations in Fitnets (Romero et al., 2015) , or any feature-based distillation loses. α and β balance the contribution of L S and L KD . With the above analysis, we rethink knowledge distillation from the perspective of gradient during the training of the student model. We now consider student networks consisting of L layers f (l) , l ∈ {1, 2, . . . , L}, each outputting h (l) through the parameters θ (l) . Thus, the updating of the parameters with the gradient-based update rule can be formulated as: θ (l) ← θ (l) -η ∂L ∂h (l) ∂h (l) ∂θ (l) , where η represents the learning rate and ∂L ∂h (l) is the backpropagated error gradients, which is usually represented by δ (l) . By Eq. (1), the training process of KD which reflects in the error gradients of each layer in the student network can be written as: δ (l) = αδ (l) S + βδ (l) KD , where δ (l) S and δ (l) KD denote the contributions of the student's knowledge and teacher's knowledge in the l-th layer of the student network, respectively. To facilitate our statements, we define δ (l) S and δ (l) KD as student error signals and teacher error signals respectively. Intuitively, no matter how α and β are set, the error gradients of each layer can be expressed by Eq. ( 3). This means that the contribution allocation of the two parts of each layer is fixed, which is shown in Figure 1(a) . Algorithm 1 Bilevel optimization for Adaptive Block-wise Learning for Knowledge Distillation Initialize meta variables γ as 0 Warm-start input network weights (θ, ϕ) with Eq. ( 7) and Eq. ( 8) 1: while not converged do 2: Calculate (θ * , ϕ * ) on training mini-batch with Eq. ( 7) and Eq. ( 8) 3: Updata γ on validation mini-batch with Eq. ( 9) 4: Updata (θ, ϕ) on training mini-batch with Eq. ( 7) and Eq. ( 8) 5: end while Obtain weight allocation variables γ of the global teacher error and the local student error

3.2. APPROXIMATING WITH LOCAL ERROR SIGNALS

Our goal is to realize the adaptive block-wise weight allocation of the knowledge from the student model and the teacher model in KD. However, it is impossible to assign weights in blocks by using the error backpropagation algorithm in conventional teacher-student frameworks, which is shown in Eq. ( 3). Thus, we make an approximate estimation of the global error signal of the student objective generated by backpropagation, leveraging the block-wise local error signal. Based on the transformation of this gradient flow, our main task is equivalent to the following two steps: (1) generating a local gradient flow that approximates the gradient flow created by the last layer, and (2) simultaneously optimizing the gradient error signals generated by the local gradient flow and by the backward propagation, achieving adaptive weight assignment for each block. These two steps will be explored in Section 3.2 and Section 3.3. For step (1), we imitate supervised greedy learning to assign the local error signals which are used to approximate the error obtained by backpropagation. Following previous works (Nøkland & Eidnes, 2019; Belilovsky et al., 2020; Pyeon et al., 2021) , we utilize the auxiliary networks and target vectors to create the student error (local error signals) independently of the blocks, which is shown in Figure 1 (b). We usually identify a single block as an input network consisting of convolutional layers, normalization and pooling layers. As for the input networks with residual connections, we regard each residual connection as a block. Each block corresponds to an auxiliary network to predict the target and compute the local objective. Meanwhile, this method is used for obtaining the local student error of each block. Note that, for the last block, the auxiliary network is the last fullyconnected (FC) layer of the student. Let X (l) and Y denote the output representation for block l and the labels, l ∈ {0, 1, . . . , L -1}. X (0) means the input data. ϕ (l) represents the parameters of the auxiliary networks corresponding to the l-th block. We define the local objective function as L(l) (θ (l) , ϕ (l) ; X (l) , Y ). To apply our paradigm to the training process of the existing KD framework as shown in Eq. (3), we maintain both sensitive hyper-parameters α and β. In addition, we propose to adopt a set of meta variables γ (l) ∈ R 2 in a two-dimensional continuous domain. It is used to modify the contribution of student error and teacher error to the total error of different network layers. To be specific, we utilize a softmax over all possible values, namely, γ (l) 1 , γ (l) 2 = softmax(γ (l) ), as the weight allocation variables. Finally, we can formulate the loss objective function for block l as follows: L (l) (θ, ϕ, γ; X (0) , Y, Y T ) = γ (l) 1 α L(l) (θ (l) , ϕ (l) ; X (l) , Y ) + γ (l) 2 βL (l) KD (θ (l) ; X (0) , Y T ), where Y T is the distilled signals of the teacher model. Finally, the error gradient of non-last block l can be re-defined as: δ(l) = γ (l) 1 α δ(l) S + γ (l) 2 βδ (l) KD , where the local student error l) and the teacher error δ l) . As for the last block, we use δ(L) = ∂L ∂h (L) . δ(l) S = ∂ L(l) ∂h (l) KD = ∂L (l) KD ∂h

3.3. OPTIMIZING META VARIABLES WITH BILEVEL OPTIMIZATION

As mentioned in Section 3.2, we need to optimize simultaneously both the global teacher error and the local student error to ensure a competitive performance of the student network. The objective is noted by Eq. ( 4) and Eq. ( 5), which is influenced not only by the network parameters (θ, ϕ) but also by the meta variables γ. This implies that this optimization problem can be solved as a bilevel optimization problem (Anandalingam & Friesz, 1992; Liu et al., 2019; Pyeon et al., 2021) with (θ, ϕ) as the inner-loop variables and γ as the outer-loop variables, which can be defined as: min γ L val (θ * , ϕ * , γ) s.t. (θ * , ϕ * ) = arg min θ,ϕ L train (θ, ϕ, γ), where L train and L val denote the loss function on training and validation set. By Eq. ( 6), we adopt the nested method which uses the training set to update the model parameters (θ, ϕ) with global error signals and local error signals of Eq. ( 5) and the validation set to optimize the meta parameters γ. Specifically, we fix γ and update (θ, ϕ) with a mini-batch training data in the inner loop. The error gradients updating can be written as: θ (l) (γ) ← θ (l) (γ) -η δ(l) (γ) ∂h (l) ∂θ (l) , ϕ (l) (γ) ← ϕ (l) (γ) -η ∇ ϕ Ltrain (θ (l) , ϕ (l) ), ( ) where Ltrain is the local objective of the training set and l ∈ {0, 1, . . . , L}. In the outer loop, we optimize the meta variables γ with a mini-batch of validation set based on (θ, ϕ) updated by Eq. ( 7) and Eq. ( 8) as follows: γ ← γ -λ ∇ γ L val ((θ, ϕ) -η∇ θ,ϕ L train (θ, ϕ, γ), γ), ( ) where λ is the learning rate of the meta variable γ. Then we apply the chain rule and the finite difference approximation (Liu et al., 2019) to the last item of Eq. ( 9): ∇ γ L val (•, γ) ≈ ∇ γ L val (θ * , ϕ * , γ) -η ∇ γ L train (θ + , ϕ + , γ) -∇ γ L train (θ -, ϕ -, γ) 2ξ , where L val is the validation loss, (θ * , ϕ * ) is the updated results of Eq. ( 7) and Eq. ( 8), (θ ± , ϕ ± ) = (θ, ϕ) ± ξ∇ (θ * ,ϕ * ) L val (θ * , ϕ * , γ), ξ is a scale 1 . Finally, we fix the reasonable γ to optimize the network parameters (θ, ϕ) on the training set and use the updated network parameters as the initial value of the next iteration. The bilevel optimization training process is shown in Algorithm 1.

Warm start

We adopt the warm start technique (Pyeon et al., 2021) on the network weights (θ, ϕ) with fixed meta variables γ (γ = 0), which makes bilevel optimization stably and avoids the bad results of γ caused by inappropriate initialization of (θ, ϕ). In our experiments, we pre-train the student with 40000 iterations to obtain a satisfactory initial value for the optimization of γ. Two-stage training Our proposed method adopts a two-stage training strategy, including meta variables optimization and final evaluation. According to the optimized meta variables, we evaluate the network weights (θ, ϕ) with the training loss to obtain the final comparable results, by following Eq. ( 7) and Eq. ( 8).

4. EXPERIMENTS

We first describe the main experimental settings (Section 4.1) which are necessary to understand our work. Then, we provide the results of baseline comparisons on several datasets (Section 4.2). Finally, we construct ablation experiments (Section 4.3) and some further discussions (Section 4.4) to explore the necessity and effectiveness of the components of our framework.

4.1. EXPERIMENTAL SETTINGS

Datasets. CIFAR-100 (Krizhevsky et al., 2009) 

Backbone and Auxiliary Networks

We adopt several backbone architectures as our main networks, including ResNet (He et al., 2016) , VGG (Simonyan & Zisserman, 2015) , Wide ResNet (Zagoruyko & Komodakis, 2016) , ShuffleNet (Zhang et al., 2018), and MobieNet (Howard et al., 2017) . Moreover, the selection of auxiliary networks should not only utilize lightweight auxiliary networks, but also ensure the parallel training of the auxiliary networks and main networks. Thus, we build an auxiliary block with a point-wise convolutional layer, a depth-wise convolutional layer, an average pooling layer, and a fully-connected layer. Baselines We compare the performance of the existing KD benchmarks themselves with those that use ABL. We divide these baselines into two categories: logits-based and feature-based. Logitsbased distillation methods include KD (Hinton et al., 2015) and DKD (Zhao et al., 2022) . The feature-based distillation methods are Fitnets (Romero et al., 2015) , AT (Komodakis & Zagoruyko, 2017) , SP (Tung & Mori, 2019) , CC (Peng et al., 2019) , RKD (Park et al., 2019) More details of these experimental settings are shown in Appendix A.1.

Results on CIFAR-100

We evaluate our framework on CIFAR-100. Table 1 and Table 2 show the performance of the standard KD baselines and them in our proposed framework. In Table 1 , we adopt various student-teacher combinations in same-style architectures. Table 2 shows the results on heterogeneous distillation of several student-teacher frameworks. For both types of distillation, we evaluate with five teacher-student combinations and eleven existing (including the state-of-the-art) knowledge distillation methods. After using our framework, despite no tuning for each task, the performance of most methods have improved by 0.5%-2%, which proves the effectiveness and wide applicability of our framework. In particular, AT, a feature-based method that plays a negative role in heterogeneous distillation, has a significant 8.1% improvement after our adaptive block-wise training. The extensive experimental results on both homogeneous and heterogeneous distillations demonstrate the effectiveness and portability of the proposed adaptive block-wise learning for KD.

Results on ImageNet

We also evaluate the performance of several methods in our proposed framework on the large-scale ImageNet dataset, which is shown in Table 3 and Table 4 . These experimental methods include KD and DKD based on logits and AT and ReviewKD based on features. As can be seen from both tables, the performance of the four methods we selected has steadily improved on the accuracy of top-1 and top-5 by an average of 0.3-0.5% after using our adaptive framework. These results on ImageNet verify the scalability of our proposed method. 

4.3. ABLATION STUDIES

In this section, we conduct several ablation experiments to verify the effectiveness of our proposed framework. Specially, we explore the framework deeply by answering the following questions. 1) Are the local error signals good approximations of the global error signals? If the local error signals are not representative of the global ones, then there will be a significant gap between the final results and the comparison results. To explore the effectiveness of local error signals from the local loss objective, we set the value of meta variables as zero. This means that the effect on any parameters in the input neural network is the same as the student error and the teacher error, which is similar to the conventional backpropagation KD framework. We build the experiments with several distillation methods under both same-style and different-style architecture setting and the results on CIFAR-100 are shown in Table 5 . From the results, for the KD, AT, RKD, ReviewKD method, the model updated with local error signals performs better, while in the DKD, CRD method, the model updated with global error signals performs better. Overall, the error between the two is less than 1%. Thus, most of the teacher-student architectures we compared can be updated with local error signals to achieve the performance of conventional training strategies. 2) Are the adaptive meta variables reasonable? To validate the effectiveness of meta variables used for balancing the student error and teacher error, we conduct three different strategies of ABL for KD via meta variables: Fixed-distillation, Random-distillation and No-distillation. Fixed-distillation is based on the assumption that the contribution of student's knowledge is the same as that of teacher's knowledge, and both the corresponding meta variables are 0.5. Random-distillation randomly decides the balance between knowledge from teacher and student. 6 , which demonstrates that the performance of our adaptive-distillation strategy has a consistent improvement above other comparable baselines. We also visualize the final evaluation process corresponding to the meta variables with ResNet20 as student and ResNet56 as teacher based on DKD in Figure 2 . We compare the training curves of three strategies including Standardtype, Fixed-type and Adaptive-type distillation. These three processes are similar in the initial training stage, while in the final stage, our proposed ABL has a relatively stable improvement. This result shows the effectiveness of the adaptive-distillation strategy with reasonable meta variables.

4.4. FURTHER INVESTIGATION ON ADAPTIVE BLOCK-WISE LEARNING

To further analyze ABL, we deeply explore the optimization process of meta variables and the results from the block-wise views. Optimization process of meta variables We draw the meta variables training process in Figure 3 (a) and 3(b). In these figures, with the increase in the number of network blocks, the network parameters prefer their own error signals (student error), while the influence of teacher error on total gradient error is reduced. In particular, in the shallow layers (the first three blocks), the parameter update of ResNet20 is more affected by teacher error signals. This means that from the perspective of gradient flow, the shallow representation of the student model is more likely to be guided by the distilled teacher knowledge, while the deeper and more abstract representation of the student model is more inclined to update from self-knowledge.

Block-wise results

The results in Figure 4 demonstrate shallow representation does not have the ability to classify with high accuracy and fit teacher's deep representation. In terms of the specific approach in our framework, DKD achieves better results than other methods in the accuracy of each block, which is also proved in Figure 4 (a) and 4(b). The difference of the correlation matrices between the logits of the student and teacher after DKD distillation is less than ReviewKD. In Figure 4 (c), the results show DKD with ABL achieves better performance on each block. Interestingly, although the test accuracy of Fitnets distillation method is worse than other results in overall results, it is in the lead in the first six blocks. The reason is that, in the experimental setting, the hint-layer in Fitnets method is set to the second residual block, i.e., in the sixth block we define, which results in that only the first six blocks have teacher knowledge guidance. These observations present a block-wise perspective, which provide a more detailed view of the distillation procedures. γ (1) 1 γ (2) 1 γ (3) 1 γ (4) 1 γ (5) 1 γ (6) 1 γ (7)

5. CONCLUSION

In this paper, we provide a novel viewpoint on knowledge distillation, which discusses the balance between the teacher's and student's knowledge at different levels. Existing distillation methods are built on an implicit hypothesis that teacher knowledge and student knowledge make the same contribution to the learning of shallow and deep representations. However, we consider that for shallow representations, the student network is easy to train under the guidance of teacher representation, while for deep representation, the student model is difficult to imitate directly from teacher's representation knowledge. Thus, we propose Adaptive Block-wise Learning for Knowledge Distillation, which leverages a set of meta variables to control the balance between the student's local error signals and the teacher's global error signals. The experimental results prove the effectiveness of the proposed method. More importantly, we confirm the hypothesis that the guidance of the teacher's knowledge to the student network does decrease with the increase of blocks. By following Zhao et al. (2022) , Chen et al. (2021) and Chen et al. (2021) , we train the student model with the following objective: L = αL CE + βL KD , For the setting of the hyper-parameters α, all methods are 1 except that KD is 0.1. The setting of distillation loss factor β is shown in Table 7 . Specially, for the logits-base distillation methods including KD and DKD, we set the temperature T = 4. All experiments are conducted with 2 NVIDIA TESLA V100S GPU cards. A.2 STANDARD DEVIATION FOR RESULTS ON CIFAR-100 We conduct our all experiments on CIFAR-100 over 5 trials. In Table 8 and Table 9 , we provide the standard deviation of 5 trials on 11 benchmarks. 

A.3 MORE ABLATION STUDIES

1) Is there a problem of gradient vanishing? Gradient vanishing usually exists in networks with too many layers. To verify whether such problem exists in the standard KD scheme, if there is a strategies: only using teacher knowledge as a standard KD with supervision (Stan.(tea.)), and the proposed ABL strategy under only teacher guidance (Adap.(tea.)). In the Table 10 , the standard KD scheme utilized teacher knowledge only performs better than that supervised by both teacher knowledge and the ground truth knowledge. This proves that standard KD framework limits the potential of teacher knowledge. However, the performance of ABL framework with only teacher knowledge is poorer than that under both teacher's and student's knowledge. This shows that the meta variables effectively help the model to learn under the guidance of both student knowledge and teacher knowledge. 3) How to design an appropriate auxiliary network? The only goal of the auxiliary network in ABL is to achieve the possibility of local error instead of global error to create a later inquiry into the meta weight allocation of student knowledge contribution and teacher knowledge contribution. Thus, the performance of an appropriate auxiliary network need to approach to standard KD scheme (with the only supervision of last layer). This means that the auxiliary network should not be too simple (shallow) to achieve the effect of using local error instead of global error, nor should it be so complex as to greatly increase the difficulty and duration of training. Inspired by the greedy learning (Jaderberg et al., 2017; Nøkland & Eidnes, 2019; Belilovsky et al., 2020; Pyeon et al., 2021) , we conduct several types of auxiliary network as follow. Aux1 To explore the two types of knowledge in different blocks, we plot the test accuracy of each block of different student networks under the same DKD distillation setting, which is shown in Figure 6 . In the Figure 6 , the results demonstrate that each block of the student networks trained by the proposed ABL scheme outperforms that under the fixed ABL setting. In the shallow blocks, the teachers knowledge contributes more than student knowledge, and the accuracy is higher than the fixed one. While in the deep blocks, the student knowledge contributes more, and the accuracy is higher than the fixed one. Thus, teacher knowledge is more suitable to guide shallow representation, while student knowledge is more appropriate to guide deep representation. 



The same setting withLiu et al. (2019), ξ = 0.01/||∇ (θ * ,ϕ * ) L val (θ * , ϕ * , γ)||2



Figure 1: Comparison between the Backpropagation process of KD and adaptive block-wise learning for KD, from the perspective of gradient. The distilled knowledge can be based on logits or based on features. (a) The contributions of the knowledge from student and teacher are fixed and equal for different blocks. (b) The gradient flows are different for different blocks, and can be adaptively modified by meta variables γ.

, VID(Ahn et al., 2019), PKT(Passalis & Tefas, 2018), CRD(Tian et al., 2020), ReviewKDChen et al. (2021).

Figure 2: Comparison of learning curves

Figure 3: All experiments are conducted on CIFAR-100 based on DKD with ResNet20 as student and ResNet56 as teacher. Figure (a) and (b) show the meta variables γ optimization process for student error and teacher error, respectively. The x-axis and y-axis represent the number of iterations and the softmax values (range from 0 to 1) of the meta variables, respectively. Note that the values of the meta variables are set to 0 at iteration 0, thus all the initial points in the figure are 0.5.

Figure 4: All the experiments are implemented under the setting of ResNet20 as student ResNet56 as teacher. (a) and (b) represent the correlations matrices between class logits of student and teacher in different blocks. For ResNet20, a residual block consists of three blocks that we define. (c) draws the test accuracy of nine blocks of ResNet20 based on ABL for six types of distillation methods.

Figure 6: All the experiments are implemented under the same DKD setting. Fixed represents the model trained with the fixed contribution of student's knowledge and teacher's knowledge. Adaptive means the model under ABL training.

In addition to the above consideration of what kind of knowledge to transfer, Song et al. (2022) considers where to transfer knowledge, Mirzadeh et al. (2020), Son et al. (2021) and Chen et al. (2021) explore how to transfer knowledge. However, none of these focuses on an important question, i.e. how to balance the transferred knowledge and the knowledge from the student model, which is more directly reflected in the effect of the target student model.

Test accuracy (%) of homogeneous distillation on the CIFAR-100. Stan., Adap., ∆ denote the standard KD method, the standard KD method within our adaptive block-wise framework and the performance improvement over the corresponding standard KD method, respectively.

Test accuracy (%) of heterogeneous distillation on the CIFAR-100. Stan., Adap., ∆ denote the standard KD method, the standard KD method within our adaptive block-wise framework and the performance improvement over the corresponding standard KD method, respectively.

Test accuracy (%) of KD on the ImageNet between the different-style architecture.

Test accuracy (%) of KD on the ImageNet between the same-style architecture.

Ablation studies on local error signals. Test accuracy (%) of knowledge distillation on the CIFAR-100. Global and Local denote student error from the global signals and the local signals.

Ablation studies on the block-wise distillation strategy. Test accuracy (%) of distillation methods on the CIFAR-100. Adaptive, Fixed, Random and No represent the adaptive-type, fixedtype, random-type and no-type distillation, respectively.

55±0.27 71.04±0.45 72.31±0.08 73.20±0.01 73.44±0.19 74.38±0.29 72.77±0.10 73.53±0.25 71.43±0.09 73.13±0.28 SP 69.67±0.20 71.06±0.42 72.69±0.41 73.07±0.05 72.94±0.23 74.20±0.22 72.43±0.27 73.27±0.01 72.68±0.19 73.52±0.45 CC 69.63±0.32 69.91±0.19 71.48±0.21 72.19±0.18 72.97±0.17 73.46±0.45 72.21±0.25 72.04±0.41 70.71±0.24 71.67±0.27 RKD 69.61±0.06 69.83±0.46 71.82±0.34 72.16±0.36 71.90±0.11 73.22±0.09 72.22±0.20 72.45±0.34 71.48±0.05 71.69±0.23 VID 70.38±0.14 70.75±0.23 72.61±0.17 72.85±0.43 73.09±0.21 73.59±0.28 73.30±0.13 72.96±0.45 71.23±0.23 71.74±0.05 PKT 70.34±0.04 70.96±0.21 72.61±0.17 73.02±0.19 73.64±0.18 74.64±0.37 73.45±0.19 73.40±0.30 72.88±0.09 73.06±0.07 Test accuracy (%) of homogeneous distillation on the CIFAR-100. Stan., Adap. denote the standard KD and the standard KD within our adaptive block-wise framework respectively.

Test accuracy (%) of heterogeneous distillation on the CIFAR-100.

AvgPool + FC layer, Aux2: AvgPool + 3 point-wise convolutional layers + AvgPool + 3-layer MLP, Aux3: a point-wise convolutional layer + a depth-wise convolutional layer + AvgPool + FC layer, Aux4: a point-wise convolutional layer + a depth-wise convolutional layer + a inverted residual block + AvgPool + FC layerThe Table11figures the results on testing accuracy (%) and the training time (GPU hours).

Ablation studies on the auxiliary networks. All experiments are conducted under the same KD distillation setting. A.4 TRAINING EFFICIENCY Due to the existence of auxiliary networks for computing local error and bi-level optimization for meta variables, ABL needs more training costs. The results of training efficiency are shown in Table12. The total training time of ABL is about 1.25-1.35 times of that of the standard KD, but the performance of ABL improves by 0.5%-2% over most distillation methods, which proves these costs are desirable.

Training costs(GPU hours) of different models evaluate under KD distillation methods on the CIFAR-100. The total training process contains a bi-level optimization and a final evaluation.

A APPENDIX

A.1 IMPLEMENTATION DETAILS We train our models on CIFAR-100 for 240 epochs with SGD that the weight decay is 5e-4 and the momentum is 0.9. Due to the batch size being 64, the initial learning rate is set to 0.01 for ShuffleNet and MobieNet and 0.05 for others, and the learning rate is divided by 10 at 150, 180 and 210 epochs. For ImageNet, we train the models for epochs 100. We set the batch size to 128, and initialize the learning rate to 0.1, which is divided by 10 at 30, 60, and 90 epochs.

Method

KD (Hinton et al., 2015) DKD (Zhao et al., 2022) Fitnets (Romero et al., 2015) AT (Komodakis & Zagoruyko, 2017) β 0.9 1.0 100 1000 Method SP (Tung & Mori, 2019) CC (Peng et al., 2019) RKD (Park et al., 2019 problem, whether proposed local objectives has been improved by solving the problem, we figure the training process of the standard KD framework and our proposed KD framework, as well as the mean value of the gradient of the first and last layers under the same setting, which is shown in Figure 5 . The results of Figure 5 (a) show that the training process of the standard KD framework and our proposed KD framework is very stable. The loss is always changing, which proves that neither of the two frameworks can be trained because gradient vanishing. In Figure 5 (b) and Figure 5 (c), there is no difference of magnitude between the mean gradient of the first layer and the last layer, and there is no stagnant training. This proves that there is no gradient vanishing in both our ABL scheme and the standard KD scheme. The more important thing is the similarity of the both schemes images verify that local error signals are good approximations of the global error signals.2) Are the adaptive meta variables effective? We construct more experiments to verify the effective of our proposed adaptive meta variables, which shown in Table 10 . We create two additional

