ADAPTIVE BLOCK-WISE LEARNING FOR KNOWLEDGE DISTILLATION

Abstract

Knowledge distillation allows the student network to improve its performance under the supervision of transferred knowledge. Existing knowledge distillation methods are implemented under the implicit hypothesis that knowledge from teacher and student contributes to each layer of the student network to the same extent. In this work, we argue that there should be different contributions of knowledge from the teacher and the student during training for each layer. Experimental results evidence this argument. To the end, we propose a novel Adaptive Block-wise Learning (ABL) for Knowledge Distillation to automatically balance teacher-guided knowledge between self-knowledge in each block. Specifically, to solve the problem that the error backpropagation algorithm cannot assign weights to each block of the student network independently, we leverage the local error signals to approximate the global error signals on student objectives. Moreover, we utilize a set of meta variables to control the contribution of the student knowledge and teacher knowledge to each block during the training process. Finally, the extensive experiments prove the effectiveness of our method. Meanwhile, ABL provides an insightful view that in the shallow blocks, the weight of teacher guidance is greater, while in the deep blocks, student knowledge has more influence.

1. INTRODUCTION

Knowledge distillation (KD) in deep learning imitates the pattern of human learning. Hinton et al. (2015) proposes the original concept of KD, which minimizes the KL divergence between the logits of teacher (soft label) and student. This allows KD to be expressed as a mode in which a complex pre-trained model is used as a teacher to guide a lightweight student model learning. Following such teacher-student framework, a series of KD methods are mainly developed in the following directions: what, where and how to distill. No matter what kind of directions, these existing KD methods are based on the same implicit hypothesis that in the training process of distillation, the contribution of student's and teacher's knowledge to each layer of the student network is fixed, whether it is the last layer or the first layer. This is because after the error backpropagation (BP) (Rumelhart et al., 1986) , the weight of the error signals on each layer is determined by the same hyper-parameters, which is shown in Figure 1 (a). Intuitively, this limits the flexibility of balancing the knowledge of teacher and student, which harms excavating the potentialities of the student model. Therefore, we argue that in the representation learning process of student network guided by teacher knowledge, different layers of the student network have different emphases on the knowledge learned through the one-hot labels and the knowledge distilled by the teacher. This means that some levels are more inclined to utilize student knowledge to learn, while others tend to leverage teacher knowledge. Furthermore, we also argue that the contribution of student and teacher knowledge to representation learning should be adaptive at each level. However, the existing KD methods obtain the global error signal from the last layer, which is hard to allocate the hierarchical weights. To explore the student network hierarchically in the training process, we modify the backward computation graphs and leverage the local error signals by the family of local objectives (Jaderberg et al., 2017; Nøkland & Eidnes, 2019; Belilovsky et al., 2020; Pyeon et al., 2021) objective corresponding to the one-hot labels. This provides the possibility to independently assign different weights to teacher knowledge and student knowledge at different layers. After allowing the student error signals of each layer to obtain independently, the current issue is on exploring the balance between student knowledge and teacher knowledge at each layer. We model this issue as a bilevel optimization problem (Anandalingam & Friesz, 1992) by adding a set of meta variables on the error signals corresponding to two types of knowledge. These meta variables represent the influence of which target knowledge is preferred by the update of the corresponding layer. In addition, after the bilevel optimization based on the gradient descent solution, we obtain the optimal meta variables of the target network under the target KD method and utilize them for the final evaluation. To the end, we propose a novel paradigm dubbed Adaptive Block-wise Learning (ABL) for Knowledge Distillation, which allows conventional teacher-student architecture to explore the influence of knowledge from teacher and student in blocks. As shown in Figure 1 (b), the proposed method changes the error signals acquisition path of student objective function from global to local and adds a group of meta variables to measure the contribution of knowledge from students and teachers. Furthermore, we acquire the balance between knowledge from the student and the teacher on the validation set. Besides, we leverage the optimized meta variables to train the corresponding distillation method. Our main contributions are as follows: 1. We propose a novel paradigm named adaptive block-wise learning for knowledge distillation, which automatically balances the contribution of knowledge from the student and teacher for each block. 2. We discover that the deep and abstract representation inclines to learn from student knowledge, while the shallow and less abstract representation tends to be guided by teacher knowledge. We hope this discovery could provide another learning view for KD. 3. We conduct extensive experiments under eleven recent distillation benchmarks. Experimental results demonstrate the effectiveness of the proposed framework in improving the performance of existing distillation methods.

2. RELATED WORK

Knowledge distillation. Knowledge distillation usually transfers knowledge from large models to small models, which is completed under the teacher-student framework. The vanilla KD is firstly proposed by Hinton et al. (2015) , which allows the student model to mimic the final prediction of the teacher model. Zhao et al. (2022) decouple the knowledge (binary probabilities) by the target knowledge and non-target knowledge. In addition to the above two logits-based KD, there are



Figure 1: Comparison between the Backpropagation process of KD and adaptive block-wise learning for KD, from the perspective of gradient. The distilled knowledge can be based on logits or based on features. (a) The contributions of the knowledge from student and teacher are fixed and equal for different blocks. (b) The gradient flows are different for different blocks, and can be adaptively modified by meta variables γ.

