AUTOSKDBERT: LEARN TO STOCHASTICALLY DIS-TILL BERT

Abstract

In this paper, we propose AutoSKDBERT, a new knowledge distillation paradigm for BERT compression, that stochastically samples a teacher from a predefined teacher team following a categorical distribution in each step, to transfer knowledge into student. AutoSKDBERT aims to discover the optimal categorical distribution which plays an important role to achieve high performance. The optimization procedure of AutoSKDBERT can be divided into two phases: 1) phase-1 optimization distinguishes effective teachers from ineffective teachers, and 2) phase-2 optimization further optimizes the sampling weights of the effective teachers to obtain satisfactory categorical distribution. Moreover, after phase-1 optimization completion, AutoSKDBERT adopts teacher selection strategy to discard the ineffective teachers whose sampling weights are assigned to the effective teachers. Particularly, to alleviate the gap between categorical distribution optimization and evaluation, we also propose a stochastic single-weight optimization strategy which only updates the weight of the sampled teacher in each step. Extensive experiments on GLUE benchmark show that the proposed AutoSKDBERT achieves state-of-the-art score compared to previous compression approaches on several downstream tasks, including pushing MRPC F1 and accuracy to 93.2 (0.6 point absolute improvement) and 90.7 (1.2 point absolute improvement), RTE accuracy to 76.9 (2.9 point absolute improvement). Task MRPC RTE CoLA SST-2 QQP QNLI MNLI Metrics F1+acc 2 acc Mcc acc F1+acc 2 acc

1. INTRODUCTION

BERT (Devlin et al., 2019) has brought about a sea change in the field of Natural Language Processing (NLP). Following BERT, numerous subsequent works focus on various perspectives to further improve its performance, e.g., hyper-parameter (Liu et al., 2019b) , pre-training corpus (Liu et al., 2019b; Raffel et al., 2020) , learnable embedding paradigm (Raffel et al., 2020) , pre-training task (Clark et al., 2020) , architecture (Gao et al., 2022) and self-attention (Shi et al., 2021) , etc. However, there are massive redundancies in the above BERT-style models w.r.t. attention heads (Michel et al., 2019; Dong et al., 2021) , weights (Gordon et al., 2020) , and layers (Fan et al., 2020) . Consequently, many compact BERT-style language models are proposed via pruning (Fan et al., 2020; Guo et al., 2019) , quantization (Shen et al., 2020) , parameter sharing (Lan et al., 2020) and Knowledge Distillation (KD) (Iandola et al., 2020; Pan et al., 2021) . In this paper, we focus on the KD-based compression approaches. From the point of view of learning procedure, KD is used in the pre-training (Turc et al., 2019; Sanh et al., 2019; Sun et al., 2020; Jiao et al., 2020) and fine-tuning phases (Sun et al., 2019; Jiao et al., 2020; Wu et al., 2021) . On the other hand, from the point of view of distillation objective, KD is employed for the outputs of hidden layer (Sun et al., 2020) , final layer (Wu et al., 2021) , embedding (Sanh et al., 2019) and self-attention (Wang et al., 2020) . Wu et al. (2021) employ multiple teachers to achieve better performance than single-teacher KD based approaches on several downstream tasks of GLUE benchmark (Wang et al., 2019) . As shown in Table 1 , nevertheless, the ensemble of multiple teachers are not always more effective than the single teacher for student distillation. There are two possible reasons: 1) diversity losing (Tran et al., 2020) and 2) capacity gap (Mirzadeh et al., 2020) . On the one hand, the ensemble prediction of multi-teacher KD loses the diversity of each teacher. On the other hand, between the large-capacity teacher ensemble and small-capacity student, there is a capacity gap which can be prone to unsatisfactory distillation performance. Table 1 : Performances of knowledge distillation using single and multiple teachers for a 6-layer BERT-style language model on the development set of GLUE benchmark. In this experiment, we employ five teachers, i.e. T10 to T14 shown in Appendix C.1, for single-teacher distillation and multi-teacher distillation. We introduce the implementation details in Appendix H. • We propose AutoSKDBERT which stochastically samples a teacher from the predefined teacher team following the categorical distribution in each step, to transfer knowledge into the student of BERT-style language model. • We propose a two-phase optimization framework with teacher selection strategy to select effective teachers and learn the optimal categorical distribution in a differentiable way. • We propose Stochastic Single-Weight Optimization (SSWO) strategy to alleviate the consistency gap between the categorical distribution optimization and evaluation for performance improvement.

2. THE PROPOSED AUTOSKDBERT

2.1 OVERVIEW In each step, AutoSKDBERT samples a teacher T from a teacher team which consists of n multilevel BERT-style teachers T 1:n , to transfer knowledge into student S. The objective function of AutoSKDBERT can be expressed as L(w) = x∈X L d (f T∈T1:n (x), f S (x; w)), where L d represents distilled loss function to compute the difference between the student S with learnable parameter w and the sampled teacher T, X denotes the training data, f T∈T1:n (•) and f S (•) denote the logits from T and S, respectively. In AutoSKDBERT, a categorical distribution Cat(θ) where θ = {θ 1:n } and n i=1 θ i = 1, is employed to sample the teacher from the teacher team. Particularly, the probability p(T i ) of T i being sampled is θ i . We observe that Cat(θ) plays an important role for obtaining high performance of AutoSKDBERT. As a result, the task of AutoSKDBERT then turns into learning the optimal categorical distribution Cat(θ * ), as illustrated in Figure 1 .  (Sec. 2.4) (Sec. 2.3) (Sec. 2.4) (Sec. 2.4.2) Figure 1 : Two-phase optimization framework with teacher selection strategy for AutoSKDBERT. 1) For a predefined teacher team, AutoSKDBERT optimizes an initialized categorical distribution to distinguish effective teachers from ineffective teachers. 2) After phase-1 optimization completion, the sampling weights of the ineffective teachers are assigned to the effective teachers via teacher selection strategy. 3) AutoSKDBERT further optimizes the weights of the effective teachers rather than the ineffective teachers in phase-2 optimization. Best viewed in color.

2.2. PROBLEM FORMULATION

AutoSKDBERT has two groups of learnable parameter: 1) w of student and 2) θ of categorical distribution. We split original training data into training and validation subsets, and denote L train and L val as the losses on training and validation subsets, respectively. Both L train and L val are determined not only by Cat(θ), but also by w. Particularly, AutoSKDBERT aims to learn the best categorical distribution Cat(θ * ) that minimizes the validation loss L val (w * , Cat(θ)), where the weights w * associated with the categorical distribution Cat(θ) are obtained by argmin w L train (w, Cat(θ)). Consequently, AutoSKDBERT can be considered as a bilevel optimization problem (Colson et al., 2007) with upper-level variable Cat(θ) and lower-level variable w: min Cat(θ) L val (w * (Cat(θ)), Cat(θ)), s.t. w * (Cat(θ)) = argmin w L train (w, Cat(θ)). We optimize w of student (see Section 2.3) and θ of categorical distribution (see Section 2.4) in an alternate and iterative way, and show the optimization algorithm in Algorithm 1.

2.3. STUDENT DISTILLATION

For student distillation, Cat(θ) is frozen. Similar to Eq. 1, we utilize the following object function: L(w) = x∈X L d ( θf T∈T1:n (x), f S (x; w)), where θ indicates the probability of the teacher T being sampled from T 1:n according to Cat(θ). Algorithm 1: Two-phase Optimization for AutoSKDBERT Initialize categorical distribution Cat(θ (1) ) for phase-1 optimization, weights w of student, maximum step N , current step n = 0; while n < N 2 do Update Cat(θ (1) ) by descending Eq. 7 ; // phase-1 categorical distribution optimization Update w by descending Eq. 3 ; // phase-1 student distillation n = n + 1; end Select effective teachers to generate Cat(θ (2) ) by Eq. 8; // teacher selection while N 2 ≤ n < N do Update Cat(θ (2) ) by descending Eq. 7 ; // phase-2 categorical distribution optimization Update w by descending Eq. 3 ; // phase-2 student distillation n = n + 1; end

2.4. CATEGORICAL DISTRIBUTION OPTIMIZATION

For categorical distribution optimization, w is frozen. We propose a two-phase optimization framework with teacher selection strategy to learn appropriate categorical distribution: 1. Phase-1 Optimization distinguishes effective teachers from ineffective teachers in the teacher team according to Cat(θ); 2. Teacher Selection discards the ineffective teachers whose weights are assigned to the effective teachers; 3. Phase-2 Optimization further optimizes the weights of the effective teachers rather than the ineffective teachers; where a Stochastic Single-Weight Optimization (SSWO) strategy is proposed for categorical distribution optimization. Below, categorical distribution optimization and teacher selection strategy are introduced in detail.

2.4.1. CATEGORICAL DISTRIBUTION OPTIMIZATION VIA SSWO

To optimize Cat(θ) in a differentiable way, Continuous Relaxation (CR) (Liu et al., 2019a ) is a common technique to obtain mixture of logits w.r.t. teachers as f T1:n (x; Cat(θ)) = n i=1 θ i f Ti (x). Subsequently, Cat(θ) can be optimized by an approximation scheme: ∇ Cat(θ) L val (w * (Cat(θ)), Cat(θ)) ≈ ∇ Cat(θ) L val (w -α∇ w L train (w, Cat(θ)), Cat(θ)), where w and α indicate the current weights of the student and the learning rate of categorical distribution, respectively. In particular, we employ w with a single-step adapting (i.e., w -α∇ w L train (w, Cat(θ)) to appropriate w * (Cat(θ)) for avoiding the inner optimization in Eq. 2. This appropriation scheme has been widely used in meta-learning (Finn et al., 2017) and neural architecture search (Liu et al., 2019a) . However, in the case of CR, there is a consistency gap between the categorical distribution optimization and evaluation in terms of the teacher's logits. For categorical distribution optimization, f T1:n (x; Cat(θ)) is used to compute the difference between the student's logits as x∈X L d (f T1:n (x; Cat(θ)), f S (x; w)). For categorical distribution evaluation, however, only the logits of the sampled teacher f T∈T1:n (x) is used to obtain the difference to the student's logits as x∈X L d (f T∈T1:n (x), f S (x; w)). To alleviate the consistency gap, we propose SSWO whose objective function can be written as L(w; θ) = x∈X L d ( θf T∈T1:n (x), f S (x; w)), where θ plays also a role like label smoothing (Szegedy et al., 2016) which aims to reduce the confidence coefficient of the sampled teacher and avoid over fitting (Müller et al., 2019) of the categorical distribution. Moreover, the smaller the sampling weight, the more reduction the confidence coefficient of the sampled teacher. Subsequently, the sampled single-weight θ can be optimized by ∇ θ∼Cat(θ) L val (w * ( θ), θ) ≈ ∇ θ∼Cat(θ) L val (w -α∇ w L train (w, θ), θ). In practice, the proposed SSWO achieves better performance than CR, as shown in Section 4.2.

2.4.2. TEACHER SELECTION

After phase-1 optimization completion, m ineffective teachers are separated from the teacher team according to the current categorical distribution Cat(θ (1) ), where the smaller the weight, the more ineffective the teacher. For avoiding categorical distribution optimizing from scratch, we present teacher selection strategy which assigns the weights of m ineffective teachers to n -m effective teachers, to deliver the categorical distribution Cat(θ (2) ) for phase-2 optimization by Cat(θ (2) ) = Cat(θ (1) )mask(m smallest(Cat(θ (1) ), m)) max( Cat(θ (1) )mask(m smallest(Cat(θ (1) ), m)) p , ) , where p (1 in this paper) denotes the exponent value in the norm formulation, is a small value (1e-12 in this paper) to avoid division by zero, m smallest(Cat(θ (1) ), m) obtains m indexes of ineffective teachers according to Cat(θ (1) ), and mask(•) generates a mask where the values of m ineffective and n -m effective teachers are set to 0 and 1, respectively.

3.1. DATASETS AND SETTINGS

Datasets. We evaluate the proposed AutoSKDBERT on GLUE benchmark (Wang et al., 2019) , including MRPC (Dolan & Brockett, 2005) , RTE (Bentivogli et al., 2009) , CoLA (Warstadt et al., 2019) , SST-2 (Socher et al., 2013) , QQP (Chen et al., 2018) , QNLI (Rajpurkar et al., 2016) and MNLI (Williams et al., 2017) . Moreover, STS-B (Cer et al., 2017) is not selected. Settings. We employ the development set of GLUE benchmark dubbed as GLUE-dev, for categorical distribution evaluation of AutoSKDBERT. We employ a teacher team which consists of 14 BERT-style teachers, to distill a 6-layer BERT-style student dubbed AutoSKDBERT. The architecture information of the student and the teachers can be found in Appendix C.1. On the one hand, we employ weak T 01 to T 09 (refer to Table 12 ) to verify a guess that the diversities of those weak teachers contribute to improve the distillation performance or not. On the other hand, under a conclusion that the extreme strong teacher (i.e., T 13 and T 14 ) can not always contribute to improving the distillation performance (see Appendix B.4 and F in the revised manuscript), we employ strong T 13 and T 14 to verify the effectiveness of the proposed distillation paradigm for capacity gap alleviation. We give a general way to design the teacher team and determine the value of m in Appendix A.

3.2. TWO-PHASE OPTIMIZATION

We employ identical experimental settings for student distillation and categorical distribution optimization in both phase-1 and phase-2 optimization. The original training set is split fifty-fifty into two subsets, i.e., training subset for student distillation (see Section 2.3) and validation subset for categorical distribution optimization (see Section 2.4).

3.2.1. STUDENT DISTILLATION

We choose Adam with a weight decay of 1e-4 as the optimizer for student distillation. For various downstream tasks, we employ different batch size, learning rate and epoch number as shown in Table 2 . Other hyper-parameters can be found in Appendix C.2. For categorical distribution optimization, we employ other Adam with a weight decay of 1e-3 as the optimizer. There are two important hyper-parameters: 1) the number of the ineffective teacher and 2) learning rate for categorical distribution optimization. Similarly, for different downstream tasks, the above two parameters are various as shown in Table 9 . Other hyper-parameters are identical to student distillation. The impact of each hyper-parameter is discussed in Appendix E. AutoSKDBERT delivers 25 categorical distribution candidates in phase-2 optimization, and trains the student with 25 candidates from scratch to choose the optimal categorical distribution. In addition to epoch number, other hyper-parameters (e.g., batch size, learning rate, etc.) are identical to student distillation on various downstream tasks as shown in Table 2 . The epoch number is set to 15 on MRPC, RTE, CoLA tasks, and 5 on SST-2, QQP, QNLI and MNLI tasks. We show the categorical distributions learned by AutoSKDBERT on GLUE benchmark in Figure 2 . Each teacher model shows various importances on different downstream tasks. 1) The strongest teacher T 14 plays a dominant role on CoLA, SST-2 and QNLI tasks. 2) Low-capacity teachers, e.g., T 02 to T 06 , can also provide useful knowledge for student distillation on MRPC, CoLA and QNLI tasks.

3.3.2. LEARNED CATEGORICAL DISTRIBUTION

3) The capacity of the effective teacher is not always larger than the discard teachers on RTE, SST-2 and QQP tasks. Moreover, the search and evaluation costs with respect to each downstream task are shown in Appendix F.

3.4. RESULTS AND ANALYSIS ON GLUE BENCHMARK

Table 4 summarizes the performance of AutoSKDBERT and the comparative approaches on GLUEdev. The proposed AutoSKDBERT achieves state-of-the-art performance on four out of seven tasks. AutoSKDBERT contributes to achieving better performance on those tasks with small data size, e.g., MRPC and RTE. On MRPC, AutoSKDBERT achieves 93.2 F1 score and 90.7 accuracy score which are 0.6 and 1.2 point higher than previous state-of-the-art MoEBERT (Zuo et al., 2022) , respectively. On the other hand, compared to TinyBERT (Jiao et al., 2020) and MoEBERT (Zuo et al., 2022) on RTE task, AutoSKDBERT achieves 3.5 and 2.9 point absolute improvement, respectively. However, on CoLA, TinyBERT and MoEBERT achieve 2.2 and 3.6 point absolute improvement compared to AutoSKDBERT, respectively. On the one hand, TinyBERT employs data augmentation and transformer layer distillation to achieve high performance. On the other hand, MoEBERT employs 1) more complex student whose architecture is an ensemble of multiple experts, and 2) extra distillation procedure, i.e., transformer layer distillation, to achieve novel performance. The proposed approach is a general KD paradigm for BERT compression. Consequently, we implement also extensive experiments to verify the effectiveness for image classification on CIFAR-100 (see Appendix B) and the orthogonality with other approaches (see Appendix D).

4. ABLATION STUDIES

4.1 TWO-PHASE OPTIMIZATION: PHASE-1 VERSUS PHASE-2 In this section, AutoSKDBERT delivers also 25 categorical distribution candidates in phase-1 optimization. Subsequently, each categorical distribution candidate is trained from scratch using identical settings described in Section 3.3, and the best-performing one on each task is shown in Table 5 . Phase-2 optimization achieves better performance than phase-1 optimization, e.g., the absolute improvement is more than 1.3 on RTE, CoLA and MNLI, where those teachers weaker than the student are prone to providing useless knowledge even noise disturbance. However, low-capacity teachers contribute to improving the performance of AutoSKDBERT on MRPC, SST-2 and QNLI.

4.2. CATEGORICAL DISTRIBUTION UPDATE STRATEGY: CR-BASED VERSUS SSWO-BASED

In Section 2.4.1, we propose SSWO which stochastically samples a single-weight to optimize the categorical distribution, to alleviate the consistency gap between the categorical distribution optimization and evaluation of CR in terms of teachers' logits. For AutoSKDBERT with CR, the used hyper-parameters of categorical distribution optimization and evaluation are identical to AutoSKD-BERT with SSWO, as described in Section 3.2 and Section 3.3. The proposed SSWO achieves better performance than CR on all tasks as shown in Table 6 . Particularly, the absolute improvement is more than 1.6 point on RTE and CoLA tasks. Compared to Table 5 , AutoSKDBERT with CR achieves higher performance than phase-1 AutoSKDBERT on six out of seven tasks. Consequently, useless teachers lead to more performance degradation than the consistency gap issue. In addition to SSWO, heuristic optimization algorithms like evolutionary algorithm and reinforcement learning can also be used to determine the categorical distribution. In this paper, we choose the most efficient one, i.e., gradient-based SSWO.

4.3. CATEGORICAL DISTRIBUTION GENERATION: RANDOM VERSUS LEARNING

We compare two groups of implementation of AutoSKDBERT with various algorithms for categorical distribution generation, i.e., random and learning, on GLUE-dev. For random algorithm, 200 categorical distributions are randomly generated for all teacher candidates. For learning algorithm, we employ different learning rates of 3e-4 to 1e-3 with an interval of 1e-4 for categorical distribution optimization. Moreover, the ineffective teacher number is identical to Section 3.3.2 on various tasks. Subsequently, each implementation delivers 25 categorical distributions in phase-2 optimization. Consequently, 200 categorical distributions are obtained. The comparison between random and leaning algorithms is shown in Figure 3 . The categorical distribution generation algorithm aims to achieve more high-performance Au-toSKDBERTs. As shown in Figure 3 , the proposed learning algorithm contributes to obtaining better categorical distribution than those randomly generated on each downstream task. Particularly, the learning algorithm plays a dominant role on MRPC, CoLA, QQP, QNLI and MNLI tasks. The best accuracy scores of random algorithm are 89.29 on QQP and 83.37 on MNLI, respectively. For the proposed learning algorithm, the worst accuracy scores are 89.11 on QQP and 83.44 on MNLI which rank the top 10 and the best in random algorithm, respectively.

5.1. PRE-TRAINED LANGUAGE MODEL

Based on the transformer-style architecture (Vaswani et al., 2017) , BERT (Devlin et al., 2019) achieves state-of-the-art performance on different natural language understanding benchmarks, e.g., GLUE (Wang et al., 2019) , SQuAD (Rajpurkar et al., 2016; 2018) . Subsequently, a great number of variants of BERT are proposed, e.g., XLNet (Yang et al., 2019) , ELECTRA (Clark et al., 2020) with new pre-training objectives, RoBERTa (Liu et al., 2019b) , T5 (Raffel et al., 2020) with larger pre-training corpus, ConvBERT (Jiang et al., 2020) with various architectures and Synthesizer (Tay et al., 2020) with developed transformer-like block w.r.t. the dot-product self-attention mechanism. Besides, previous pre-trained language models often have several hundred million parameters (e.g. 335 million of BERT LARGE (Devlin et al., 2019) , even 175 billion of GPT-3 (Brown et al., 2020) ) which contribute to delivering amazing performance on downstream tasks while exponentially increasing the difficulty of deployment on resource-constrained device. ALBERT (Lan et al., 2020) adopts parameter sharing strategy to reduce the parameters, and achieves competitive performance.

5.2. KNOWLEDGE DISTILLATION FOR BERT-STYLE LANGUAGE MODEL COMPRESSION

In order to obtain device-friendly BERT-style language model, many KD-based compression approaches have been proposed. DistilBERT (Sanh et al., 2019) compresses a smaller, faster, cheaper and lighter 6-layer BERT-style language model via learning the soft target probabilities of the teacher in the pre-training stage. Sun et al. (2019) propose patient knowledge distillation which transfers knowledge from the last or every l layers, to compress BERT-style language model in the fine-tuning phase. In MobileBERT (Sun et al., 2020) , an inverted-bottleneck BERT-style language model is pretrained to transfer knowledge to task-agnostic MobileBERT in a layer-to-layer way. The student in MiniLM (Wang et al., 2020) imitates not only the attention distribution of the teacher, but also the deep self-attention knowledge which reflects the difference between values. In both the pre-training and the fine-tuning phases, TinyBERT (Jiao et al., 2020) learns various knowledge from hidden layer, final layer, embedding and self-attention to achieve high performance. Moreover, GloVe word embedding (Pennington et al., 2014) based data augmentation technique is employed to further improve the performance of TinyBERT. MT-BERT (Wu et al., 2021) employs multiple teachers to achieve better performance than single-teacher KD based approaches on several downstream tasks.

6. CONCLUSION

This work proposes AutoSKDBERT, which is a new paradigm of knowledge distillation for BERT model compression. A teacher is stochastically sampled from a predefined multi-level teacher team in each step to distill the student following a categorical distribution. We observe that the categorical distribution plays an important role for obtaining high-performance AutoSKDBERT. Consequently, we propose a two-phase optimization framework to learn the best categorical distribution via SSWO. The first phase distinguishes effective teachers from ineffective teachers. In the second phase, the effective teachers are further optimized. Moreover, before phase-2 optimization beginning, the ineffective teachers are discarded and their weights are assigned to the effective teachers via teacher selection strategy. Extensive experiments on GLUE benchmark show that the proposed AutoSKD-BERT achieves state-of-the-art performance compared to popular compression approaches on several downstream tasks.

A INEFFECTIVE TEACHER NUMBER DETERMINATION AND TEACHER TEAM DESIGN A.1 INEFFECTIVE TEACHER NUMBER DETERMINATION

A simple way to choose the ineffective teacher number m is elaborately designing the teacher team and setting m to the number of weak teachers whose capacities are weaker than student. Moreover, the student itself can be treated as the above weak teacher. A.2 TEACHER TEAM DESIGN First, we should determine the strongest teacher and student. Next, we select several teacher assistants whose capacities are stronger than student but weaker than the strongest teacher. Finally, we choose also several weak teachers whose capacities are weaker than student. Above all, the predefined teacher team consists of several weak teachers, several teacher assistants and the strongest teacher.

B AUTOSKD FOR IMAGE CLASSIFICATION

To verify the effectiveness of the proposed distillation paradigm on computer vision, we conduct three groups of experiment on CIFAR-100 (Krizhevsky et al., 2009) 

B.2 DETAILS OF TEACHER TEAM

For various student models, we select different teacher teams according to Appendix A.2 as shown in Table 7 . Table 7 : Details of teacher team for each student model.

Student Teacher Team

WRN-16-2 WRN-16-2, WRN-22-2, WRN-28-2, WRN-34-2, WRN-40-2 WRN-40-1 WRN-40-1, WRN-16-2, WRN-22-2, WRN-28-2, WRN-34-2, WRN-40-2 ResNet-8×4 ResNet-8×4, ResNet-14×4, ResNet-20×4, ResNet-26×4, ResNet-32×4 Moreover, the performance of each teacher model on CIFAR-100 is shown in Table 8 . (Hinton et al., 2015) 74.92 73.54 73.33 FitNet (Romero et al., 2015) 73.58 72.24 73.50 AT (Zagoruyko & Komodakis, 2017) 74.08 72.77 73.44 SP (Tung & Mori, 2019) 73.83 72.43 72.94 CC (Peng et al., 2019) 73.56 72.21 72.97 VID (Ahn et al., 2019) 74.11 73.30 73.09 RKD (Park et al., 2019) 73.35 72.22 71.90 PKT (Passalis & Tefas, 2018) 74.54 73.45 73.64 AB (Heo et al., 2019) 72.50 72.38 73.17 FT (Kim et al., 2018) 73.25 71.59 72.86 FSP (Yim et al., 2017) 72.91 -72.62 NST (Huang & Wang, 2017) 73.68 72.24 73.30 CRD (Tian et al., 2020) 75 For T13 and T14, {6e-6, 7e-6, 8e-6, 9e-6} on MRPC and RTE tasks, {2e-5, 3e-5, 4e-5, 5e-5} on other tasks. For student and other teachers, {2e-5, 3e-5, 4e-5, 5e-5} and {1e-5, 2e-5, 3e-5} on all tasks, respectively. Fine-tuning epochs 15 on MRPC, RTE and CoLA tasks, 5 on other tasks (Jiao et al., 2020) and MoEBERT (Zuo et al., 2022) can be replaced with the stochastic KD paradigm proposed in this paper. Moreover, each teacher in the teacher team should has same hidden size with the student when distilling the transformer layer. Consequently, we can not distill the student with the teacher team used in this paper. In this section, we implement a list of orthogonal experiments to examine the effectiveness of the combination of AutoSKDBERT and TinyBERT, and show the experimental result in Table 16 . Due to the difference of hidden size between the strongest teacher T 14 and the student, similar to TinyBERT, we employ also BERT BASE , i.e., T 12 , as the teacher for transformer layer distillation. TinyBERT employs random search to choose the best batch size and learning rate from {16, 32} and {1e-5, 2e-5, 3e-5}, respectively. Differently, AutoSKDBERT uses also the categorical distributions on vanilla datasets with the batch size and the learning rate shown in Table 2 for each downstream task. Moreover, the epoch number for each downstream task can be found in Table 13 . As shown in Table 16 , the combination of AutoSKDBERT and DA shows better performance compared to vanilla AutoSKDBERT on CoLA and QNLI tasks. Furthermore, the combination of Au-toSKDBERT, TD and DA achieves better performance compared to vanilla AutoSKDBERT on QNLI tasks. The main reason in our consideration is that the categorical distributions are learned on the vanilla dataset instead of the augmentation data. In the future, we will directly learn the categorical distribution on the augmentation data. However, the combination of AutoSKDBERT and TD is prone to obtaining worse performance on each downstream task. We consider that the main cause of the above phenomenon is the knowledge transfer gap between transformer layer distillation and prediction layer distillation, i.e., only using T 12 for transformer layer distillation, T 01 to T 14 for prediction layer distillation. In the future, we will select appropriate teacher team to distill the transformer layer of BERT.

E IMPACT OF HYPER-PARAMETERS FOR CATEGORICAL DISTRIBUTION OPTIMIZATION

As above mentioned, there are two important hyper-parameters, i.e., ineffective teacher number m and learning rate, for categorical distribution optimization. In this section, we discuss the impact of the above two hyper-parameters for the best performance of the learned categorical distributions. On the tasks of MRPC, RTE and CoLA, we implement AutoSKDBERT with m from 1 to 10 and learning rate from 3e-4 to 1e-3 with an interval of 1e-4, and show the results in Table 17 . In this section, we show the cost of AutoSKDBERT in terms of categorical distribution optimization and evaluation, and compare our approach to TinyBERT with respect to algorithm cost. Experimental results are shown in Table 18 where on five downstream tasks, the cost of AutoSKDBERT is 38.72 hours which is 8.4× less than TinyBERT. Moreover, we obtain the cost on NVIDIA A100 GPU with AMD EPYC 7642 48-Core Processor. For TinyBERT, the cost is obtained by 6 groups of experiment with various hyper-parameters (i.e., batch sizes of {16, 32} and learning rates of {1e-5, 2e-5, 3e-5}) on augmentation data. For AutoSKDBERT, the cost is obtained by 25 groups of experiment on vanilla data with different categorical distributions learned in the process of categorical distribution optimization. The distillation process of TinyBERT can be divided into two phases: 1) transformer layer distillation on augmentation data and 2) prediction layer distillation on augmentation data. The transformer layer distillation of TinyBERT is time-consuming, e.g., it spends about 62 hours on QNLI. Besides, the prediction layer distillation of TinyBERT is also time-consuming due to using large-scale augmentation data. Differently, AutoSKDBERT consists of categorical distribution optimization and evaluation (i.e., prediction layer distillation). On the one hand, categorical distribution optimization is efficient, e.g., 2.45 hours on the task of QNLI, due to the gradient-based SSWO. On the other hand, categorical distribution evaluation is also efficient even choosing the best categorical distribution from 25 candidates.

G TINYBERT WITH STRONGER TEACHER MODEL

AutoSKDBERT employs two stronger teacher models, i.e., T 13 and T 14 , compared to most of the comparative methods shown in Table 4 . To verify the impact of strong teacher on the distillation performance of other paradigms, we employ T 12 and T 14 as the teachers to distill TinyBERT on five downstream tasks for a fair comparison. Following TinyBERT (Jiao et al., 2020) , we implement the experiments with batch sizes of {16, 32} and learning rates of {1e-5, 2e-5, 3e-5}, and choose the best result to show in Table 19 . We can observe that the strong teacher T 14 contributes to only improving the performance on RTE. For the above phenomenon, the main reason is that a capacity gap (Mirzadeh et al., 2020 ) exists between T 14 and student which is prone to obtaining unsatisfactory performance. As a result, a conclusion can be drawn that the stronger teacher T 14 can not always contribute to improving the performance of other distillation paradigms.



The code will be made publicly available upon publication of the paper. https://huggingface.co/huawei-noah/TinyBERT General 6L 768D https://github.com/google-research/bert



Figure 2: Categorical distributions learned on GLUE benchmark.

Figure 3: Comparison of AutoSKDBERT with random and learning algorithm for categorical distribution generation on GLUE-dev. Two types of algorithm are evaluated by 200 categorical distributions. MRPC and QQP tasks are evaluated by the average of F1 score and accuracy score, CoLA task is evaluated by Matthews correlation coefficient, and other tasks are evaluated by accuracy score. Best viewed in color.

image classification dataset. Following CRD(Tian et al., 2020), we choose three student models: 1)WRN-16-2, 2) WRN-40-1 and 3) ResNet-8×4. WRN-d-w represents Wide ResNet with depth d and width factor w. ResNet-d×4 indicates a 4 times wider network (namely, with 64, 128, and 256 channels for each block) with depth d. Moreover, in CRD, the above student models are distilled by WRN-40-2, WRN-40-2 and ResNet-32×4, respectively.B.1 DATASETAs a popular dataset for image classification, CIFAR-100 consists of 60000 images (50000 for training and 10000 for test) with 32×32 pixels. Similar to the experiment for BERT compression, the original training set is split fifty-fifty into two subsets, i.e., training subset for student distillation and validation subset for categorical distribution optimization.

The hyper-parameters for student distillation.

The hyper-parameters for categorical distribution optimization.

Results of AutoSKDBERT and other popular approaches on GLUE-dev. All comparative approaches have identical architecture, i.e., 6-layer BERT-style language model with 66 million parameters. † and ‡ indicate that the results are cited fromXu et al. (2020) andZuo et al. (2022), respectively. * means that the comparison between TinyBERT6 and AutoSKDBERT may not be fair since the former employs GloVe word embedding(Pennington et al., 2014) based data augmentation and transformer-layer distillation and embedding-layer distillation. § indicates that the result is obtained by our settings with the distillation loss described inWu et al. (2021), and the experimental details can be found in Appendix H. Moreover, the stronger teacher can not always contribute to improving the distillation performance of other approaches due to the capacity gap(Mirzadeh et al., 2020) as shown in Appendix G. Besides, we show also the performances of multi-teacher AvgKD and TAKD with T01 to T14 in Appendix C.4.

The performance of AutoSKDBERT with the best categorical distribution learned in phase-1 and phase-2 optimization on GLUE-dev.

The performance of AutoSKDBERT with the best categorical distribution learned by CR and SSWO for categorical distribution optimization on GLUE-dev.

Performance of each teacher model on CIFAR-100.

Test accuracy (%) of the proposed AutoSKD and other popular distillation approaches on CIFAR-100. All experimental results are cited fromTian et al. (2020). Average of the last epoch over 5 runs.

The architecture of each student and teacher.

Hyper-parameters for fine-tuning of student and teacher team.

Distillation performance of student with various distillation paradigms on GLUE-dev.

Results of AutoSKDBERT with Data Augmentation (DA) transformer layer Distillation (TD) on GLUE-dev.

Results of AutoSKDBERT with various hyper-parameters for categorical distribution optimization.

The cost (hours) comparison of TinyBERT and AutoSKDBERT on five downstream tasks. These results about TinyBERT are obtained by following the experimental settings described inJiao et al. (2020) with the code publicly released by the authors at https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT. .24 4.35 31.71 187.62 1.75 1.25 5.75 6.50 18.25 Total Cost 8.46 6.04 17.07 44.61 249.56 2.01 1.42 6.81 7.78 20.70 †

Results of TinyBERT with the strongest teacher T14 on GLUE-dev. These results are obtained by TinyBERT with the fine-tuned teacher model of AutoSKDBERT using the code publicly released by the authors at https://github.com/huawei-noah/ Pretrained-Language-Model/tree/master/TinyBERT.

funding

T12 T11 T12 T14 ‡ Single-teacher KD 90.

B.3 CATEGORICAL DISTRIBUTION OPTIMIZATION

Similar to the experiment for BERT compression, there are two hyper-parameters, i.e., ineffective teacher number and learning rate for categorical distribution optimization. According to Appendix A.1, we fix ineffective teacher number to 1, and choose categorical distribution learning rate from 3e-4 to 1e-3 with an interval of 1e-4 for three groups of experiment. Different from BERT compression, we choose SGD as the optimizer with CosineAnnealing learning rate scheduler, initial learning rate of 0.05 and batch size of 64 for student model training. Moreover, the number of epochs is set to 50, and later 25 epochs deliver 25 categorical distribution candidates. 

C DETAILS OF STUDENT AND TEACHER TEAM FOR AUTOSKDBERT C.1 ARCHITECTURE INFORMATION

The architecture information of student and teachers is shown in Table 12 .

C.2 HYPER-PARAMETERS FOR FINE-TUNING AND DISTILLATION

We utilize the hyper-parameters shown in Table 13 for fine-tuning and distillation.

C.3 FINE-TUNING PERFORMANCE

On the one hand, we directly treat the pre-trained model of TinyBERT 6 2 as the student of AutoSKD-BERT. On the other hand, we choose 14 BERT-style language models with various capabilities as the candidates for teacher team. Moreover, each pre-trained teacher can be downloaded from official implementation of BERT 3 . Furthermore, the results of the student and the teacher on GLUE-dev are shown in Table 14 . 

C.4 PERFORMANCE OF STUDENT WITH VARIOUS DISTILLATION PARADIGMS

Table 15 summarizes the performance of student using different distillation paradigms with the teacher models described in Appendix C.1. Moreover, experimental settings can be found in Table 13. On the one hand, the student performance using single-teacher distillation with respect to each teacher model is given. On the other hand, two popular multi-teacher KD paradigms, i.e., AvgKD (Hinton et al., 2015) and TAKD (Mirzadeh et al., 2020) , are employed to distill the student with two groups of teacher team, i.e., T 01 to T 14 and T 10 to T 14 .According to Table 15 , we can draw several conclusions:1. For single-teacher KD paradigm, the strongest teacher may not be the best teacher for student distillation. Capacity gap (Mirzadeh et al., 2020) between the strong-capacity teacher and weak-capacity student plays an important role for this phenomenon.2. For multi-teacher AvgKD, increasing the number of teachers can not always contribute to improving the distillation performance. In AvgKD, the diversity losing issue leads to unsatisfactory performance due to using the ensemble of teacher outputs.3. For multi-teacher TAKD, weak-capacity teachers dramatically reduce the distillation performance of student. In TAKD, the weakest teacher assistant (e.g., T 01 for the teacher team T 01 -T 14 , T 10 for the teacher team T 10 -T 14 ) transfers mixture of knowledge which learned from previous stronger teacher assistants (e.g., T 02 to T 14 for the teacher team T 01 -T 14 , T 11 to T 14 for the teacher team T 10 -T 14 ) into the student. As a result, the performance of TAKD is very sensitive to the capacity of the weakest teacher assistant.In order to verify the effectiveness of weak-capacity teacher for performance improvement, we choose several weak-capacity BERT-style models as teachers, e.g., T 01 to T 09 . Besides, we choose also two strong-capacity teachers, i.e., T 13 and T 14 in Table 12 , to verify the effectiveness of the proposed distillation paradigm for capacity gap alleviation.

H MT-BERT FOR BERT COMPRESSION

For BERT-style language model compression, we verify the performance of MT-BERT (Wu et al., 2021) whose objection function can be expressed as:where, N indicates the number of teachers, CE(•, •) is the cross-entropy loss, T denotes the temperature, y represents the ground-truth label, y i and y s refer to the outputs of i-th teacher and the student, respectively.We employ T 10 to T 14 as the teacher team to distill the student via Eq. 9. Particularly, we only use the weighted multi-teacher distillation loss without the multi-teacher hidden loss and the taskspecific loss as in MT-BERT (Wu et al., 2021) .The hyper-parameters are given as follows:• Learning Rate: {1e-5, 2e-5, 3e-5} for all tasks.• Batch Size: {16, 32, 64}.• Epoch: 10 for MRPC, RTE and CoLA tasks, 3 for other tasks.Other settings follow AutoSKDBERT.I DETAILS OF GLUE BENCHMARK GLUE consists of 9 NLP tasks: Microsoft Research Paraphrase Corpus (MRPC) (Dolan & Brockett, 2005) , Recognizing Textual Entailment (RTE) (Bentivogli et al., 2009) , Corpus of Linguistic Acceptability (CoLA) (Warstadt et al., 2019) , Semantic Textual Similarity Benchmark (STS-B) (Cer et al., 2017) , Stanford Sentiment Treebank (SST-2) (Socher et al., 2013) , Quora Question Pairs (QQP) (Chen et al., 2018) , Question NLI (QNLI) (Rajpurkar et al., 2016) , Multi-Genre NLI (MNLI) (Williams et al., 2017) , and Winograd NLI (WNLI) (Levesque et al., 2012) .MRPC belongs to a sentence similarity task where system aims to identify the paraphrase/semantic equivalence relationship between two sentences.RTE belongs to a natural language inference task where system aims to recognize the entailment relationship of given two text fragments.CoLA belongs to a single-sentence task where system aims to predict the grammatical correctness of an English sentence.STS-B belongs to a sentence similarity task where system aims to evaluate the similarity of two pieces of texts by a score from 1 to 5.SST-2 belongs to a single-sentence task where system aims to predict the sentiment of movie reviews.QQP belongs to a sentence similarity task where system aims to identify the semantical equivalence of two questions from the website Quora.QNLI belongs to a natural language inference task where system aims to recognize that for a given pair <question, context>, the answer to the question whether contains in the context.MNLI belongs to a natural language inference task where system aims to predict the possible relationships (i.e., entailment, contradiction and neutral) of hypothesis w.r.t. premise for a given pair <premise, hypothesis>.WNLI belongs to a natural language inference task where system aims to determine the referent of a sentence's pronoun from a list of choices.

