HOMODISTIL: HOMOTOPIC TASK-AGNOSTIC DISTIL-LATION OF PRE-TRAINED TRANSFORMERS

Abstract

Knowledge distillation has been shown to be a powerful model compression approach to facilitate the deployment of pre-trained language models in practice. This paper focuses on task-agnostic distillation. It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints. Despite the practical benefits, task-agnostic distillation is challenging. Since the teacher model has a significantly larger capacity and stronger representation power than the student model, it is very difficult for the student to produce predictions that match the teacher's over a massive amount of open-domain training data. Such a large prediction discrepancy often diminishes the benefits of knowledge distillation. To address this challenge, we propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning. Specifically, we initialize the student model from the teacher model, and iteratively prune the student's neurons until the target width is reached. Such an approach maintains a small discrepancy between the teacher's and student's predictions throughout the distillation process, which ensures the effectiveness of knowledge transfer. Extensive experiments demonstrate that Ho-moDistil achieves significant improvements on existing baselines.

1. INTRODUCTION

Pre-trained language models have demonstrated powerful generalizability in various downstream applications (Wang et al., 2018; Rajpurkar et al., 2016a) . However, the number of parameters in such models has grown over hundreds of millions (Devlin et al., 2018; Raffel et al., 2019; Brown et al., 2020) . This poses a significant challenge to deploying such models in applications with latency and storage requirements. Knowledge distillation (Hinton et al., 2015) has been shown to be a powerful technique to compress a large model (i.e., teacher model) into a small one (i.e., student model) with acceptable performance degradation. It transfers knowledge from the teacher model to the student model through regularizing the consistency between their output predictions. In language models, many efforts have been devoted to task-specific knowledge distillation (Tang et al., 2019; Turc et al., 2019; Sun et al., 2019; Aguilar et al., 2020) . In this case, a large pre-trained model is first fine-tuned on a downstream task, and then serves as the teacher to distill a student during fine-tuning. However, task-specific distillation is computational costly because switching to a new task always requires the training of a task-specific teacher. Therefore, recent research has started to pay more attention to task-agnostic distillation (Sanh et al., 2019; Sun et al., 2020; Jiao et al., 2019; Wang et al., 2020b; Khanuja et al., 2021; Chen et al., 2021) , where a student is distilled from a teacher pre-trained on open-domain data and can be efficiently fine-tuned on various downstream tasks. Despite the practical benefits, task-agnostic distillation is challenging. The teacher model has a significantly larger capacity and a much stronger representation power than the student model. As a result, it is very difficult for the student model to produce predictions that match the teacher's Figure 1 : Left: In HomoDistil, the student is initialized from the teacher and is iteratively pruned through the distillation process. The widths of rectangles represent the widths of layers. The depth of color represents the sufficiency of training. Right: An illustrative comparison of the student's optimization trajectory in HomoDistil and standard distillation. We define the region where the prediction discrepancy is sufficiently small such that the distillation is effective as the Effective Distillation Region. In HomoDistil, as the student is initialized with the teacher and is able to maintain this small discrepancy, the trajectory consistently lies in the region. In standard distillation, as the student is initialized with a much smaller capacity than the teacher's, the distillation is ineffective at the early stage of training. over a massive amount of open-domain training data, especially when the student model is not wellinitialized. Such a large prediction discrepancy eventually diminishes the benefits of distillation (Jin et al., 2019; Cho & Hariharan, 2019; Mirzadeh et al., 2020; Guo et al., 2020; Li et al., 2021) . To reduce this discrepancy, recent research has proposed to better initialize the student model from a subset of the teacher's layers (Sanh et al., 2019; Jiao et al., 2019; Wang et al., 2020b) . However, selecting such a subset requires extensive tuning. To address this challenge, we propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning. As illustrated in Figure 1 , we initialize the student model from the teacher model. This ensures a small prediction discrepancy in the early stage of distillation. At each training iteration, we prune a set of least important neurons, which leads to the least increment in loss due to its removal, from the remaining neurons. This ensures the prediction discrepancy only increases by a small amount. Simultaneously, we distill the pruned student, such that the small discrepancy can be further reduced. We then repeat such a procedure in each iteration to maintain the small discrepancy through training, which encourages an effective knowledge transfer. We conduct extensive experiments to demonstrate the effectiveness of HomoDistil in task-agnostic distillation on BERT models. In particular, HomoBERT distilled from a BERT-base teacher (109M) achieves the state-of-the-art fine-tuning performance on the GLUE benchmark (Wang et al., 2018) and SQuAD v1.1/2.0 (Rajpurkar et al., 2016a; 2018) at multiple parameter scales (e.g., 65M and 10 ∼ 20M). Extensive analysis corroborates that HomoDistil maintains a small prediction discrepancy through training and produces a better-generalized student model.

2.1. TRANSFORMER-BASED LANGUAGE MODELS

Transformer architecture has been widely adopted to train large neural language models (Vaswani et al., 2017; Devlin et al., 2018; Radford et al., 2019; He et al., 2021) . It contains multiple identically constructed layers. Each layer has a multi-head self-attention mechanism and a two-layer feedforward neural network. We use f (•; θ) to denote a Transformer-based model f parameterized by θ, where f is a mapping from the input sample space X to the output prediction space. We define the loss function L(θ) = E x∼X [ℓ(f (x; θ))], where ℓ is the task loss.foot_0 

2.2. TRANSFORMER DISTILLATION

Knowledge Distillation trains a small model (i.e., student model) to match the output predictions of a large and well-trained model (i.e., teacher model) by penalizing their output discrepancy. Specif-ically, we denote the teacher model as f t (θ t ) and the student model as f s (θ s ), and consider the following optimization problem: min θs L(θ s ) + D KL (θ s , θ t ), where D KL (θ s , θ t ) is the KL-Divergence between the probability distributions over their output predictions, i.e., KL(f s (θ s )||f t (θ t )). Transformer Distillation. In large Transformer-based models, distilling knowledge from only the output predictions neglects the rich semantic and syntactic knowledge in the intermediate layers. To leverage such knowledge, researchers have further matched the hidden representations, attention scores and attention value relations at all layers of the teacher and the student (Romero et al., 2014; Sun et al., 2019; 2020; Jiao et al., 2019; Hou et al., 2020; Wang et al., 2020b; a) .

2.3. TRANSFORMER PRUNING

Pruning is a powerful compression approach which removes redundant parameters without significantly deteriorating the full model performance Importance Score. To identify the redundant parameters, researchers estimate the importance of each parameter based on some scoring metrics. A commonly used scoring metric is the sensitivity of parameters (Molchanov et al., 2017; 2019; Theis et al., 2018; Lee et al., 2019; Ding et al., 2019; Xiao et al., 2019) . It essentially approximates the change in the loss magnitude when this parameter is completely zeroed-out (LeCun et al., 1990; Mozer & Smolensky, 1989) . Specifically, we denote θ = [θ 0 , ..., θ J ] ∈ R J , where θ j ∈ R for j = 1, ..., J denotes each parameter. We further define θ j,-j = [0, ..., 0, θ j , 0, ..., 0] ∈ R J . Then we define the sensitivity score as S ∈ R J , where S j ∈ R computes the score of θ j as S j = |θ ⊤ j,-j ∇ θ L(θ)|. This definition is derived from the first-order Taylor expansion of L(•) with respect to θ j at θ. Specifically, S j approximates the absolute change of the loss given the removal of θ j : θ ⊤ j,-j ∇ θ L(θ) ≈ L(θ) -L(θ -θ j,-j ). The parameters with high sensitivity are of high importance and should be kept (Lubana & Dick, 2020) . Parameters with low sensitivity are considered redundant, and can be safely pruned with only marginal influence on the model loss. Other importance scoring metrics include the magnitude of parameters Han et al. (2015b) and the variants of sensitivity, e.g., movement score (Sanh et al., 2020) , sensitivity score with uncertainty (Zhang et al., 2022) , and second-order expansion of Eq 3 (LeCun et al., 1990) . Iterative Pruning gradually zeroes out the least important parameters throughout the training process. Specifically, given a gradient updated model θ (t) at the t-th training iteration, iterative pruning methods first compute the importance score S (t) following Eq 2, then compute a binary mask M (t) ∈ R J as M (t) j = 1 if S (t) j is in the top r (t) of S (t) , 0 otherwise. ∀j = 1, ..., J. where r (t) ∈ (0, 1) is the scheduled sparsity at the t-th iteration determined by a monotonically decreasing function of t. Then the model is pruned as M (t) ⊙ θ (t) , where ⊙ denotes the Hadamard product. Such a procedure is repeated through training. Structured Pruning. Pruning the model in the unit of a single parameter leads to a highly sparse subnetwork. However, the storage and computation of sparse matrices are not often optimized on commonly used computational hardware. Structured pruning resolves this issue by pruning the model in the unit of a structure, e.g., a neuron, an attention head, or a feed-forward layer (Wang et al., 2019; Michel et al., 2019; Liang et al., 2021; Hou et al., 2020; Lagunas et al., 2021) . To estimate the importance score of a structure, existing works compute the expected sensitivity with respect to the structure's output (Michel et al., 2019; Liang et al., 2021; Kim & Awadalla, 2020) . We introduce Homotopic Distillation, as illustrated in Figure 1 . Specifically, we initialize the student model from the teacher model. At each iteration, we prune the least important neurons from the student and distill the pruned student. We repeat such a procedure throughout the training process. Task-Agnostic Distillation. We consider the following losses to optimize the student model: 1) The knowledge distillation loss as defined in Eq 1. In task-agnostic distillation, L is the loss for continual pre-training of the student model on the open-domain data, e.g., the masked language modeling loss for BERT, L MLM . 2) The Transformer distillation losses. Specifically, we penalize the discrepancy between the teacher's and the student's hidden representations at both the intermediate and embedding layers, and the attention scores at the intermediate layers. We denote the hidden representations at the k-th intermediate layer of the teacher and the student as H k t ∈ R |x|×dt and H k s ∈ R |x|×ds , where d t and d s denote the hidden dimension and |x| denotes the sequence length. The distillation loss of the hidden representations at the intermediate layers is defined as: L hidn (θ s , θ t ) = K k=1 MSE(H k t , H k s W k hidn ). Here MSE(•, •) is the mean-squared error, and W k hidn ∈ R ds×dt is a randomly initialized and learnable linear projection that projects H k s into the same space as H k t . Similarly, the distillation loss of the hidden representations at the embedding layer is defined as L emb (θ s , θ t ) = MSE(E t , E s W emb ), where E t ∈ R |x|×dt and E s ∈ R |x|×ds are the hidden representations at the embedding layer and W emb ∈ R ds×dt is for dimension matching. Finally, the attention distillation loss is defined as L attn (θ s , θ t ) = K k=1 MSE(A k t , A k s ), where A k t ∈ R |x|×|x| and A k s ∈ R |x|×|x| are the attention score matrices averaged by the number of heads at the k-th layer. These transformer distillation losses aim to capture the rich semantic and syntactic knowledge from the teacher's layers and improve the generalization performance of the student. In summary, the student is optimized based on the weighted sum of all losses, i.e., L total = L MLM + α 1 D KL + α 2 L hidden + α 3 L emb + α 4 L attn , where α 1 , α 2 , α 3 , α 4 ≥ 0 are hyper-parameters. Iterative Neuron Pruning. We initialize the student model from a pre-trained teacher model as θ (0) s = θ t . At the t-th training iteration, we update the student model based on L total defined in Eq 5 using an SGD-type algorithm, e.g., θ (t) s ← θ (t-1) s -η∇ θ (t-1) s L total (θ (t-1) s , θ t ), where η is the step size. Then we compute the importance score for all parameters following Eq 2: S (t) j = |θ (t) s ⊤ j,-j ∇ θ (t) s L total (θ (t) s , θ t )| ∀j = 1, ..., J. For any weight matrix W (t) ∈ R d in s ×ds in the student model, we denote its corresponding importance score as S (t) W ∈ R d in s ×ds . We then define the importance score for individual columns as N (t) W ∈ R ds , where N (t) W i = ∥S (t) W [:,i] ∥ 1 ∀i = 1, ..., d s . (7) Notice that the score is computed based on L total , which consists of both the distillation and training losses. This is to ensure that we only prune the columns whose removal would lead to the least increment in both the prediction discrepancy and the training loss. We then compute the binary mask M (t) W ∈ R d in s ×ds associated with the weight matrix following Eq 4 as M (t) W [:,i] = 1 if N (t) W i is in the top r (t) of N (t) W , 0 otherwise, ∀i = 1, ..., d s , where r (t) is the scheduled sparsity determined by a commonly used cubically decreasing function (Zhu & Gupta, 2017; Sanh et al., 2020; Zafrir et al., 2021) : r (t) =      1 0 ≤ t < t i r f + (1 -r f ) 1 -t-ti t f -ti 3 t i ≤ t < t f r f t f ≤ t < T. Here r f is the final sparsity, T is the number of total training iterations and 0 ≤ t i < t f ≤ T are hyper-parameters. Such a schedule ensures that the sparsity is slowly increasing and the columns are gradually pruned. This prevents a sudden drop in the student's prediction performance, which effectively controls the expansion of prediction discrepancy. Finally, we prune the weight matrix as W (t) ⊙ M (t) W . We also prune the corresponding rows of the next weight matrix in the forward computation, including {W k hidn } K k=1 and W emb . The same pruning procedure is applied to all weight matrices in the model. The complete algorithm is shown in Alg. 1. Algorithm 1 HomoDistil: Homotopic Distillation 1: Input: θ t : the teacher model. T, t i , t f , r f , α 1 , α 2 , α 3 , α 4 : Hyper-parameters. 2: Output: θ (T ) s . 3: θ (0) s = θ t . 4: for t = 1, ..., T do 5: Compute loss L total following Eq 5. 6: θ (t) s ← θ (t-1) s -η∇ θ (t-1) s L total . 7: Compute importance score S (t) following Eq 6.

8:

for all W (t) ∈ θ  W (t) ← -W (t) ⊙ M (t) W . 12: end for 13: end for Why do we impose sparsity requirements on individual weight matrices? Traditional pruning imposes requirements on the global sparsity of the model instead of the local sparsity of the individual matrices. As a result, some matrices have much larger widths than the others. These wide matrices can be the memory bottlenecks for commonly used computational hardware. Furthermore, it requires re-configurations of the pre-defined model architectures in deep learning software packages to achieve the desired inference speedup. In contrast, controlling the local sparsity is more friendly to both hardware and software.

4. EXPERIMENTS

We evaluate HomoDistil on BERT-base (Devlin et al., 2018) on natural language understanding (NLU) and question answering tasks.

4.1. DATA

Continual Pre-training. We distill the student using the open-domain corpus for BERT pre-training (Devlin et al., 2018) , i.e., Wikipedia 2 , an English Wikipedia corpus containing 2500M words, and Toronto BookCorpus (Zhu et al., 2015) , containing 800M words. We clean the corpus by removing tables, lists and references following BERT. We then pre-process the cleaned corpus by concatenating all sentences in a paragraph and truncating the concatenated passage by length of 128 following TinyBERT (Jiao et al., 2019) . We tokenize the corpus with the vocabulary of BERT (30k). Fine-tuning. We fine-tune the student model on both NLU and question answering tasks. For NLU tasks, we adopt the commonly used General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018) , which contains nine tasks, e.g., textual entailment, semantic similarity, etc. For question answering tasks, we adopt the SQuAD v1.1 and v2.0 (Rajpurkar et al., 2016a; 2018) . Details about the datasets are deferred to Appendix A.1.

4.2. MODEL

We evaluate HomoDistil on pre-trained BERT-base (Devlin et al., 2018) , which contains 12 Transformer layers with hidden dimension 768. BERT-base is pre-trained with masked language modeling and next sentence prediction tasks on Wikipedia and Toronto BookCorpus (16GB). We use BERTbase as the teacher model and as the initialization of the student model. We produce multiple student models at several sparsity ratios. Table 1 lists the architectures of the teacher and the student models. 

4.3. BASELINES

We compare HomoDistil with the state-of-the-art task-agnostic distillation baselines.foot_2 These methods initialize the student directly as the target size and fix its size during distillation. For example, to obtain a shallow model, the student is often initialized from a subset of teacher's layers. DistilBERT (Sanh et al., 2019) considers the vanilla distillation by penalizing the final layer prediction discrepancy using Eq 1. TinyBERT-GD (General Distillation) (Jiao et al., 2019) extends DistilBERT by exploiting the knowledge in the intermediate Transformer layers using Eq 5. MiniLM (Wang et al., 2020b) penalizes the discrepancy between the queries-keys scaled dot product and values-values scaled dot product in the final layer self-attention module. MiniLMv2 (Wang et al., 2020a) extends MiniLM by encouraging the student to mimic the attention head relations of the teacher.

4.4. IMPLEMENTATIONS DETAILS

Continual Pre-training. For all experiments, we use a max sequence length of 128 and a batch size of 4k. We train the student model for T = 28k steps (3 epochs). We use Adam (Kingma & Ba, 2014) as the optimizer with β = (0.9, 0.999), ϵ = 1 × 10 -6 . We use a learning rate of 3 × 10 -foot_3 for HomoBERT-base and 6 × 10 -4 for HomoBERT-small/xsmall/tiny. We adopt a linear decay learning rate schedule with a warmup ratio of 0.1. For distillation, we share all weights of W hidn and {W k emb } K k=1 . We set α 1 , α 2 , α 3 , α 4 to be 1 for all experiments. For importance score computation, we select neurons based on the exponential moving average of the importance score for stability. For pruning schedule, we set the initial iteration t i as 0 and select the final iteration t f from {0.5, 0.7, 0.9} × T . Full implementation details are deferred to Appendix A.2. Fine-tuning. We drop the masked language modeling prediction head and W hidn and {W k emb } K k=1 from the continual pre-training stage, and randomly initialize a task-specific classification head for the student model. For NLU tasks, we select the training epochs from {3, 6}, batch size from {16, 32} and learning rate from {2, 3, 4, 5, 6, 7} × 10 -5 . For RTE, MRPC and STS-B, we initialize the student from a MNLI-fine-tuned student to further improve the performance for all baselines. For question answering tasks, we fine-tune the student for 2 epochs with a batch size of 12, and adopt a learning rate of 1 × 10 -4 . For all tasks, we use Adam as the optimizer with with β = (0.9, 0.999), ϵ = 1 × 10 -6 . Full implementation details are deferred to Appendix A.3.

4.5. MAIN RESULTS

Table 2 show the fine-tuning results of HomoDistil on the GLUE development set. We report the median over five random seeds for all experiments in this paper 4 . HomoBERT-base consistently outperforms existing state-of-the-art baselines over six out of eight tasks, and achieves significant gains on MNLI, SST-2 and CoLA. The margins of gains become much more prominent for students with 10 ∼ 20M parameters: HomoBERT-tiny (14.1M) significantly outperforms TinyBERT 4×312 (14.5M) by 3.3 points in terms of task-average score, and outperforms BERT-small, which is twice of the scale, by 1.0 point. Table 3 : The accuracy of fine-tuning distilled models on SQuAD v1.1/2.0 validation set. The results of TinyBERT 3 -GD and MiniLM 3 are reported from (Wang et al., 2020b) . The rest are fine-tuned from the officially released checkpoints (Devlin et al., 2018; Jiao et al., 2019) . 

5. ANALYSIS

We verify that HomoDistil maintains a small prediction discrepancy throughout the distillation process, leading to a better-generalized student model.

5.1. HOMODISTIL MAINTAINS A SMALL PREDICTION DISCREPANCY

Figure 2 shows the prediction discrepancy, D KL , under different schedules of sparsity throughout the distillation process. When the student is directly initialized with a single-shot pruned subnetwork at the target sparsity (i.e., t f = 0), the initial prediction discrepancy is large. In contrast, when the student is initialized with the full model and is iteratively pruned through longer iterations (i.e., t f = 0.5T, 0.7T and 0.9T ), the initial discrepancy is small. The discrepancy then gradually increases due to pruning, but the increment remains small due to distillation. Figure 3 shows the accuracy of task-specific fine-tuning of the student distilled with different schedules of sparsity. The student that is initialized with the full model and is pruned iteratively achieves a significantly better generalization performance on the downstream tasks than the one initialized to be the target-size subnetwork. 

5.2. DISTILLATION BENEFITS ITERATIVE PRUNING

Table 4 compares the student trained with and without distillation losses (i.e., L total defined in Eq 5 and L MLM only). The task-specific fine-tuning performance of the student trained with distillation losses consistently outperforms the one without distillation losses over multiple model scales. This suggests that teacher's knowledge is essential to recover the performance degradation due to pruning, and minimizing distillation loss is an important criteria to select important neurons.

5.3. IMPORTANCE METRIC MATTERS

Table 5 investigates the student performance under different importance metrics: 1) Magnitude Pruning (Han et al., 2015b) , where S j = |Θ j,-j |; 2) Movement Pruning (Sanh et al., 2020) , where S j = Θ ⊤ j,-j ∇ Θ L(Θ); 3) PLATON (Zhang et al., 2022) : S j = I j • U j , where I j is the sensitivity score as defined in Eq 2 and U j is the uncertainty estimation of I j . For all methods, we use the exponential moving average of score for stability. Using sensitivity and PLATON as the importance score significantly outperforms the baseline. In contrast, the weight magnitude, which may not correctly quantify the neuron's contribution to the loss in the large and complex models, achieves only comparable performance to the baseline. Movement pruning, which is mainly designed for task-specific fine-tuning, diverges. 

6. DISCUSSION

Combining pruning and distillation. While we are the first work to combine pruning with distillation in task-agnostic setting, there have been similar explorations in task-specific setting. One stream of explorations first prune the model to the target size and then distill the subnetwork (Hou et al., 2020; Lagunas et al., 2021) . In this case, pruning solely serves as an architecture selection strategy independent of distillation. Another stream simultaneously prunes and distills the model (Xu et al., 2021; Xia et al., 2022) , which is more comparable to ours. The main differences are that they do not initialize the student with the teacher and often prune at a large granularity, e.g., a Transformer layer. In task-agnostic setting, however, an undesirable initialization and a large granularity will induce a huge discrepancy, which is difficult to minimize on large amount of open-domain data. Furthermore, after each layer pruning, the remaining layers need to match a different set of teacher layers to ensure the learning of comprehensive knowledge. However, suddenly switching the layer to learn from can be difficult on large amount of open-domain data. How to prune the student's height in task-agnostic setting remains an interesting open problem. A comprehensive comparison of these methods is deferred to Appendix A.5. Resolving prediction discrepancy. Recent research has shown that distillation from a large teacher to a small student has only marginal benefits (Jin et al., 2019; Cho & Hariharan, 2019) , mainly due to the large prediction discrepancy (Guo et al., 2020) . Traditional solutions have resorted to introducing auxiliary teacher assistant models (Mirzadeh et al., 2020; Rezagholizadeh et al., 2021; Li et al., 2021) , but training and storing auxiliary models can be memory and computational costly. Table 9 : The evaluation performance of HomoDistil and the commonly used task-specific distillation baseline methods on the GLUE development set. The results of PKD (Sun et al., 2019) and ProKT (Shi et al., 2021) DynaBERT (Hou et al., 2020) Task-specific Fine-tuned weights Pruned, pre-trained weights CoFi (Xia et al., 2022) Task-specific Fine-tuned weights Pre-trained weights SparseBERT (Xu et al., 2021) Task-specific Fine-tuned weights Pre-trained weights HomoDistil Task-agnostic Pre-trained weights Pre-trained weights setting. This allows us to prune the word embeddings and produce a smaller model more suitable for edge devices (e.g., around 15 million parameters). Furthermore, a task-specific model needs to be specifically pruned for each individual task, while a task-agnostic model can be fine-tuned for any task with a low cost. HomoDistil initializes the student with the teacher. To maintain a small discrepancy in the early stage, HomoDistil initializes the student with the teacher. In contrast, DynaBERT initializes the student with a target-size subnetwork. SparseBERT and CoFi initialize the student with pre-trained weights while the teacher with fine-tuned weights. HomoDistil simultaneously prunes and distills and allows interactions between them. To maintain a small discrepancy throughout distillation, HomoDistil prunes based on the sensitivity to make the pruning operation "distillation-aware". Specifically, HomoDistil selects the columns and rows to prune based on their contributions to the distillation loss. In contrast, DynaBERT treats pruning and distillation as two independent operations by first pruning then distilling the subnetwork. Sparse-BERT prunes based on the weight magnitude without considering the influence on the distillation loss. HomoDistil prunes rows and columns. The granularity of rows and columns is sufficiently small to control the increment in discrepancy while maintaining the practical benefits of structured pruning. HomoDistil can control the layer width. HomoDistil enforces a local sparsity constraint for each matrix, producing a model with consistent width in each layer. In contrast, SparseBERT and CoFi have no control over the layer width, which might result in wide matrices as the memory bottlenecks. Table 12 shows the evaluation performance of HomoDistil, CoFi and SparseBERT on the GLUE benchmark (DynaBERT results are presented in Table 9 ). We can see that HomoDistil achieves a noticeable gain over CoFi and a comparable performance with SparseBERT with nearly half of their sizes. A.6 COMPUTATIONAL COSTS Table 13 compares the computational costs of HomoDistil and the baseline methods during inference. We profile the inference time and the number of FLOPs (embedding excluded) during the forward pass using the profiler package released by pytorchfoot_6 . We conduct the measurements on the GLUE development set with a batch size of 128 and a maximum sequence length of 128 on one Nvidia A100 GPU. We compute the averaged time and FLOPs over all batches. The speedup is computed with respect to BERT-base. For a fair comparison, we only compare with compact models. As can be observed, HomoDistil achieves some inference speedup and FLOPs reduction, but not as much as the other models under a similar parameter budget. This is because HomoDistil allocates a higher budget to the backbone parameters and a lower budget to the embedding parameters. However, we remark that HomoDistil achieves a better accuracy and enjoys the same (or more) storage benefits than the distilled (or structured pruned) models. All experimental results of HomoDistil presented in this paper are the median of five random seeds. Table 14 and Table 15 show the standard deviations of the experimental results on the GLUE benchmark (Table 2 ) and on the SQuAD v1.1/2.0 datasets (Table 3 ), respectively. 



For notational simplicity, we will omit x throughout the rest of the paper. https://dumps.wikimedia.org/enwiki/ We mainly compare with baselines that use BERT-base as the teacher model for a fair comparison. We also present a comprehensive comparison with task-specific distillation baselines in Appendix A.4. The standard deviations are reported in Appendix A.7. We propose a novel task-agnostic distillation approach equipped with iterative pruning -HomoDistil. We demonstrate that HomoDistil can maintain a small prediction discrepancy and can achieve promising benefits over existing task-agnostic distillation baselines. https://github.com/yinmingjun/TinyBERT/blob/master/pregenerate training data.py https://pytorch.org/tutorials/recipes/recipes/profiler recipe.html



Han et al. (2015b;a); Paganini & Forde (2020); Zhu & Gupta (2017); Renda et al. (2020); Zafrir et al. (2021); Liang et al. (2021).

Figure 2: The prediction discrepancy during the distillation of HomoBERT models under different schedules of sparsity.

Figure 3: The accuracy of fine-tuning HomoBERT-small distilled with different schedules of sparsity on the development set of GLUE benchmark.

Architectures of the teacher and the student models.

show the fine-tuning results of HomoDistil on SQuAD v1.1/v2.0. All HomoBERT students outperform the best baseline, MiniLM 3 (17.3M), by over 3 points of margin on SQuAD v2.0. Especially, HomoBERT-xsmall (15.6M) obtains 3.8 points of gain.

The accuracy of fine-tuning distilled BERT models on GLUE development set. The results of MiniLM 3/6 are reported from(Wang et al.

The accuracy of fine-tuning HomoBERT models trained with and without distillation losses ("L total " and "L MLM ") on the development set of GLUE benchmark.

The accuracy of fine-tuning HomoBERT-small pruned under different importance metrics on the development set of GLUE benchmark.

are from our implementation. The rest results are from the original papers. Comparison of HomoDistil and the existing "Pruning+Distillation" methods from the distillation perspective.

A comparison of HomoDistil and the existing "Pruning+Distillation" methods from the pruning perspective.

The performance comparison with the existing "Pruning+Distillation" methods. All results are reported by their papers.

The inference speedup and the number of FLOPs (embedding excluded) of HomoDistil and the baseline methods. The speedup is computed with respect to BERT-base.

The standard deviation of the experimental results on GLUE development set in Table2.

The standard deviation of the experimental results on SQuAD v1.1/2.0 in Table3.

Summary of the GLUE benchmark.

A APPENDIX

A.1 DATA Continual Pre-training. We use the same pre-training data as BERT: Wikipedia (English Wikipedia dump8 ; 12GB) and BookCorpus ( (Zhu et al., 2015) ) (6GB). We clean the corpus by removing tables, lists and references following BERT. We then pre-process the cleaned corpus by concatenating all sentences in a paragraph and truncating the concatenated passage by length of 128 following TinyBERT (Jiao et al., 2019) 5 . We tokenize the corpus with the vocabulary of BERT (30k).Fine-tuning. GLUE is a commonly used natural language understanding benchmark containing nine tasks. The benchmark includes question answering (Rajpurkar et al., 2016b) , linguistic acceptability (CoLA, Warstadt et al. 2019) , sentiment analysis (SST, Socher et al. 2013 ), text similarity (STS-B, Cer et al. 2017) , paraphrase detection (MRPC, Dolan & Brockett 2005) , and natural language inference (RTE & MNLI, Dagan et al. 2006; Bar-Haim et al. 2006; Giampiccolo et al. 2007; Bentivogli et al. 2009; Williams et al. 2018) tasks. Details of the GLUE benchmark, including tasks, statistics, and evaluation metrics, are summarized in Table 16 . SQuAD v1.1/v2.0 are the Stanford Question Answering Datasets (Rajpurkar et al., 2018; 2016a) , two popular machine reading comprehension benchmarks from approximately 500 Wikipedia articles with questions and answers obtained by crowdsourcing. The SQuAD v2.0 dataset includes unanswerable questions about the same paragraphs.

A.2 CONTINUAL PRE-TRAINING IMPLEMENTATIONS

Table 6 presents the hyper-parameter configurations for continual pre-training HomoBERT models on the open-domain data. We set the distillation temperature as 2. We empirically observe setting t f within the range of 0.5T ∼ 0.9T can achieve similarly good downstream performances. Furthermore, we observe that different weight modules may prefer different t f s: 1) it is better to finish the pruning of output projection matrices in the attention module, the feed-forward module and the embedding module early, because pruning them late will induce a large increment in distillation loss and the student performance is difficult to recover. 2) the student performance is less sensitive to the pruning of key and query projection matrices in the attention module and the input projection matrix in the feed-forward module, and can often easily recover. Based on this observation, we set t f = 0.5 for the output projection matrices in the attention module, the feed-forward module and the embedding module. For key and query projection matrices in the attention module and the input projection matrix in the feed-forward module, we set t f = 0.9. For other matrices, we set t f = 0.7. This configuration brings around a small and consistent gain of 0.05 ∼ 0.08 on GLUE. The continual pre-training experiment runs for around 13 hours on 8 Nvidia A100 GPUs. A.4 COMPARISON WITH TASK-SPECIFIC DISTILLATION METHODS Table 9 compares HomoDistil with commonly used task-specific distillation baseline methods: PKD (Sun et al., 2019) , BERT-of-Theseus (Xu et al., 2020) , MixKD (Liang et al., 2020) , DynaBERT (Hou et al., 2020) , ProKT (Shi et al., 2021) and MetaDistil (Zhou et al., 2022) . All baseline methods use a BERT-base fine-tuned on the target task as the teacher model, and a 6-layer pre-trained BERT-base as the initialization of the student model. The student model is then distilled with the target task data. As shown in the Table 9 , HomoDistil demonstrates a prominent margin over the commonly used task-specific methods.

A.5 A COMPARISON WITH "PRUNING+DISTILLATION" METHODS

We elaborate our discussion in Section 6 by comparing HomoDistil and the existing methods that combining pruning and distillation in more details. Table 10 and Table 11 present the detailed comparison among HomoDistil, DynaBERT (Hou et al., 2020) , SparseBERT (Xu et al., 2021) and CoFi (Xia et al., 2022) . We also list the major differences below:HomoDistil focuses on the task-agnostic setting. In the task-specific setting, pruning incurs an inevitable loss of task-relevant pre-training knowledge that may not be present in the fine-tuning data. Therefore, existing works leave the word embeddings untouched (e.g., take up around 20 million parameters in BERT-base). In contrast, this problem does not exist in the task-agnostic

