AUTOSKDBERT: LEARN TO STOCHASTICALLY DIS-TILL BERT

Abstract

In this paper, we propose AutoSKDBERT, a new knowledge distillation paradigm for BERT compression, that stochastically samples a teacher from a predefined teacher team following a categorical distribution in each step, to transfer knowledge into student. AutoSKDBERT aims to discover the optimal categorical distribution which plays an important role to achieve high performance. The optimization procedure of AutoSKDBERT can be divided into two phases: 1) phase-1 optimization distinguishes effective teachers from ineffective teachers, and 2) phase-2 optimization further optimizes the sampling weights of the effective teachers to obtain satisfactory categorical distribution. Moreover, after phase-1 optimization completion, AutoSKDBERT adopts teacher selection strategy to discard the ineffective teachers whose sampling weights are assigned to the effective teachers. Particularly, to alleviate the gap between categorical distribution optimization and evaluation, we also propose a stochastic single-weight optimization strategy which only updates the weight of the sampled teacher in each step. Extensive experiments on GLUE benchmark show that the proposed AutoSKDBERT achieves state-of-the-art score compared to previous compression approaches on several downstream tasks, including pushing MRPC F1 and accuracy to 93.2 (0.6 point absolute improvement) and 90.7 (1.2 point absolute improvement), RTE accuracy to 76.9 (2.9 point absolute improvement).

1. INTRODUCTION

BERT (Devlin et al., 2019) has brought about a sea change in the field of Natural Language Processing (NLP). Following BERT, numerous subsequent works focus on various perspectives to further improve its performance, e.g., hyper-parameter (Liu et al., 2019b ), pre-training corpus (Liu et al., 2019b; Raffel et al., 2020) , learnable embedding paradigm (Raffel et al., 2020) , pre-training task (Clark et al., 2020) , architecture (Gao et al., 2022) and self-attention (Shi et al., 2021) , etc. However, there are massive redundancies in the above BERT-style models w.r.t. attention heads (Michel et al., 2019; Dong et al., 2021 ), weights (Gordon et al., 2020 ), and layers (Fan et al., 2020) . Consequently, many compact BERT-style language models are proposed via pruning (Fan et al., 2020; Guo et al., 2019 ), quantization (Shen et al., 2020 ), parameter sharing (Lan et al., 2020) and Knowledge Distillation (KD) (Iandola et al., 2020; Pan et al., 2021) . In this paper, we focus on the KD-based compression approaches. From the point of view of learning procedure, KD is used in the pre-training (Turc et al., 2019; Sanh et al., 2019; Sun et al., 2020; Jiao et al., 2020) and fine-tuning phases (Sun et al., 2019; Jiao et al., 2020; Wu et al., 2021) . On the other hand, from the point of view of distillation objective, KD is employed for the outputs of hidden layer (Sun et al., 2020 ), final layer (Wu et al., 2021 ), embedding (Sanh et al., 2019) and self-attention (Wang et al., 2020) . Wu et al. (2021) employ multiple teachers to achieve better performance than single-teacher KD based approaches on several downstream tasks of GLUE benchmark (Wang et al., 2019) . As shown in Table 1 , nevertheless, the ensemble of multiple teachers are not always more effective than the single teacher for student distillation. There are two possible reasons: 1) diversity losing (Tran et al., 2020) and 2) capacity gap (Mirzadeh et al., 2020) . On the one hand, the ensemble prediction of multi-teacher KD loses the diversity of each teacher. On the other hand, between the large-capacity teacher ensemble and small-capacity student, there is a capacity gap which can be prone to unsatisfactory distillation performance. To solve the above mentioned issues, we propose AutoSKDBERT which stochastically samples a teacher from a predefined teacher team following a categorical distribution in each step, to transfer knowledge into student. The task of AutoSKDBERT is learning the optimal categorical distribution to achieve high performance. 1) Given a teacher team which consists of multiple teachers with multi-level capacities, AutoSKDBERT optimizes an initialized categorical distribution to distinguish effective teachers from ineffective teachers in phase-1 optimization. 2) The sampling weights of the ineffective teachers are assigned to the effective teachers via teacher selection strategy after phase-1 optimization completion. 3) AutoSKDBERT further optimizes the weights of the effective teachers rather than the ineffective teachers' in phase-2 optimization. We implement extensive experiments on GLUE benchmark (Wang et al., 2019) to verify the effectiveness of the proposed AutoSKD-BERT. Moreover, to show the generalization capacity, we have also distilled deep convolutional neural network (e.g., ResNet (He et al., 2016) , Wide ResNet (Zagoruyko & Komodakis, 2016)) by AutoSKDBERT for image classification on CIFAR-100 (Krizhevsky et al., 2009) , as shown in Appendix B. Our contributions are summarized as followsfoot_0 : • We propose AutoSKDBERT which stochastically samples a teacher from the predefined teacher team following the categorical distribution in each step, to transfer knowledge into the student of BERT-style language model. • We propose a two-phase optimization framework with teacher selection strategy to select effective teachers and learn the optimal categorical distribution in a differentiable way. • We propose Stochastic Single-Weight Optimization (SSWO) strategy to alleviate the consistency gap between the categorical distribution optimization and evaluation for performance improvement.

2. THE PROPOSED AUTOSKDBERT

2.1 OVERVIEW In each step, AutoSKDBERT samples a teacher T from a teacher team which consists of n multilevel BERT-style teachers T 1:n , to transfer knowledge into student S. The objective function of AutoSKDBERT can be expressed as L(w) = x∈X L d (f T∈T1:n (x), f S (x; w)), where L d represents distilled loss function to compute the difference between the student S with learnable parameter w and the sampled teacher T, X denotes the training data, f T∈T1:n (•) and f S (•) denote the logits from T and S, respectively. In AutoSKDBERT, a categorical distribution Cat(θ) where θ = {θ 1:n } and n i=1 θ i = 1, is employed to sample the teacher from the teacher team. Particularly, the probability p(T i ) of T i being sampled is θ i . We observe that Cat(θ) plays an important role for obtaining high performance of AutoSKDBERT. As a result, the task of AutoSKDBERT then turns into learning the optimal categorical distribution Cat(θ * ), as illustrated in Figure 1 .



The code will be made publicly available upon publication of the paper.



Performances of knowledge distillation using single and multiple teachers for a 6-layer BERT-style language model on the development set of GLUE benchmark. In this experiment, we employ five teachers, i.e. T10 to T14 shown in Appendix C.1, for single-teacher distillation and multi-teacher distillation. We introduce the implementation details in Appendix H. The best teacher for student distillation on each downstream task as shown in Table15.‡ Pre-training with whole word masking.

