AUTOSKDBERT: LEARN TO STOCHASTICALLY DIS-TILL BERT

Abstract

In this paper, we propose AutoSKDBERT, a new knowledge distillation paradigm for BERT compression, that stochastically samples a teacher from a predefined teacher team following a categorical distribution in each step, to transfer knowledge into student. AutoSKDBERT aims to discover the optimal categorical distribution which plays an important role to achieve high performance. The optimization procedure of AutoSKDBERT can be divided into two phases: 1) phase-1 optimization distinguishes effective teachers from ineffective teachers, and 2) phase-2 optimization further optimizes the sampling weights of the effective teachers to obtain satisfactory categorical distribution. Moreover, after phase-1 optimization completion, AutoSKDBERT adopts teacher selection strategy to discard the ineffective teachers whose sampling weights are assigned to the effective teachers. Particularly, to alleviate the gap between categorical distribution optimization and evaluation, we also propose a stochastic single-weight optimization strategy which only updates the weight of the sampled teacher in each step. Extensive experiments on GLUE benchmark show that the proposed AutoSKDBERT achieves state-of-the-art score compared to previous compression approaches on several downstream tasks, including pushing MRPC F1 and accuracy to 93.2 (0.6 point absolute improvement) and 90.7 (1.2 point absolute improvement), RTE accuracy to 76.9 (2.9 point absolute improvement).

1. INTRODUCTION

BERT (Devlin et al., 2019) has brought about a sea change in the field of Natural Language Processing (NLP). Following BERT, numerous subsequent works focus on various perspectives to further improve its performance, e.g., hyper-parameter (Liu et al., 2019b ), pre-training corpus (Liu et al., 2019b; Raffel et al., 2020) , learnable embedding paradigm (Raffel et al., 2020 ), pre-training task (Clark et al., 2020 ), architecture (Gao et al., 2022) and self-attention (Shi et al., 2021) , etc. However, there are massive redundancies in the above BERT-style models w.r.t. attention heads (Michel et al., 2019; Dong et al., 2021 ), weights (Gordon et al., 2020 ), and layers (Fan et al., 2020) . Consequently, many compact BERT-style language models are proposed via pruning (Fan et al., 2020; Guo et al., 2019) , quantization (Shen et al., 2020) , parameter sharing (Lan et al., 2020) and Knowledge Distillation (KD) (Iandola et al., 2020; Pan et al., 2021) . In this paper, we focus on the KD-based compression approaches. From the point of view of learning procedure, KD is used in the pre-training (Turc et al., 2019; Sanh et al., 2019; Sun et al., 2020; Jiao et al., 2020) and fine-tuning phases (Sun et al., 2019; Jiao et al., 2020; Wu et al., 2021) . On the other hand, from the point of view of distillation objective, KD is employed for the outputs of hidden layer (Sun et al., 2020 ), final layer (Wu et al., 2021 ), embedding (Sanh et al., 2019) and self-attention (Wang et al., 2020) . Wu et al. (2021) employ multiple teachers to achieve better performance than single-teacher KD based approaches on several downstream tasks of GLUE benchmark (Wang et al., 2019) . As shown in Table 1 , nevertheless, the ensemble of multiple teachers are not always more effective than the single teacher for student distillation. There are two possible reasons: 1) diversity losing (Tran et al., 2020) and 2) capacity gap (Mirzadeh et al., 2020) . On the one hand, the ensemble prediction of multi-teacher KD loses the diversity of each teacher. On the other hand, between the large-capacity teacher ensemble and small-capacity student, there is a capacity gap which can be prone to unsatisfactory distillation performance. 1

