HOMODISTIL: HOMOTOPIC TASK-AGNOSTIC DISTIL-LATION OF PRE-TRAINED TRANSFORMERS

Abstract

Knowledge distillation has been shown to be a powerful model compression approach to facilitate the deployment of pre-trained language models in practice. This paper focuses on task-agnostic distillation. It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints. Despite the practical benefits, task-agnostic distillation is challenging. Since the teacher model has a significantly larger capacity and stronger representation power than the student model, it is very difficult for the student to produce predictions that match the teacher's over a massive amount of open-domain training data. Such a large prediction discrepancy often diminishes the benefits of knowledge distillation. To address this challenge, we propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning. Specifically, we initialize the student model from the teacher model, and iteratively prune the student's neurons until the target width is reached. Such an approach maintains a small discrepancy between the teacher's and student's predictions throughout the distillation process, which ensures the effectiveness of knowledge transfer. Extensive experiments demonstrate that Ho-moDistil achieves significant improvements on existing baselines.

1. INTRODUCTION

Pre-trained language models have demonstrated powerful generalizability in various downstream applications (Wang et al., 2018; Rajpurkar et al., 2016a) . However, the number of parameters in such models has grown over hundreds of millions (Devlin et al., 2018; Raffel et al., 2019; Brown et al., 2020) . This poses a significant challenge to deploying such models in applications with latency and storage requirements. Knowledge distillation (Hinton et al., 2015) has been shown to be a powerful technique to compress a large model (i.e., teacher model) into a small one (i.e., student model) with acceptable performance degradation. It transfers knowledge from the teacher model to the student model through regularizing the consistency between their output predictions. In language models, many efforts have been devoted to task-specific knowledge distillation (Tang et al., 2019; Turc et al., 2019; Sun et al., 2019; Aguilar et al., 2020) . In this case, a large pre-trained model is first fine-tuned on a downstream task, and then serves as the teacher to distill a student during fine-tuning. However, task-specific distillation is computational costly because switching to a new task always requires the training of a task-specific teacher. Therefore, recent research has started to pay more attention to task-agnostic distillation (Sanh et al., 2019; Sun et al., 2020; Jiao et al., 2019; Wang et al., 2020b; Khanuja et al., 2021; Chen et al., 2021) , where a student is distilled from a teacher pre-trained on open-domain data and can be efficiently fine-tuned on various downstream tasks. Despite the practical benefits, task-agnostic distillation is challenging. The teacher model has a significantly larger capacity and a much stronger representation power than the student model. As a result, it is very difficult for the student model to produce predictions that match the teacher's Figure 1 : Left: In HomoDistil, the student is initialized from the teacher and is iteratively pruned through the distillation process. The widths of rectangles represent the widths of layers. The depth of color represents the sufficiency of training. Right: An illustrative comparison of the student's optimization trajectory in HomoDistil and standard distillation. We define the region where the prediction discrepancy is sufficiently small such that the distillation is effective as the Effective Distillation Region. In HomoDistil, as the student is initialized with the teacher and is able to maintain this small discrepancy, the trajectory consistently lies in the region. In standard distillation, as the student is initialized with a much smaller capacity than the teacher's, the distillation is ineffective at the early stage of training. over a massive amount of open-domain training data, especially when the student model is not wellinitialized. Such a large prediction discrepancy eventually diminishes the benefits of distillation (Jin et al., 2019; Cho & Hariharan, 2019; Mirzadeh et al., 2020; Guo et al., 2020; Li et al., 2021) . To reduce this discrepancy, recent research has proposed to better initialize the student model from a subset of the teacher's layers (Sanh et al., 2019; Jiao et al., 2019; Wang et al., 2020b) . However, selecting such a subset requires extensive tuning. To address this challenge, we propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning. As illustrated in Figure 1 , we initialize the student model from the teacher model. This ensures a small prediction discrepancy in the early stage of distillation. At each training iteration, we prune a set of least important neurons, which leads to the least increment in loss due to its removal, from the remaining neurons. This ensures the prediction discrepancy only increases by a small amount. Simultaneously, we distill the pruned student, such that the small discrepancy can be further reduced. We then repeat such a procedure in each iteration to maintain the small discrepancy through training, which encourages an effective knowledge transfer. We conduct extensive experiments to demonstrate the effectiveness of HomoDistil in task-agnostic distillation on BERT models. In particular, HomoBERT distilled from a BERT-base teacher (109M) achieves the state-of-the-art fine-tuning performance on the GLUE benchmark (Wang et al., 2018) and SQuAD v1.1/2.0 (Rajpurkar et al., 2016a; 2018) at multiple parameter scales (e.g., 65M and 10 ∼ 20M). Extensive analysis corroborates that HomoDistil maintains a small prediction discrepancy through training and produces a better-generalized student model.

2. PRELIMINARY 2.1 TRANSFORMER-BASED LANGUAGE MODELS

Transformer architecture has been widely adopted to train large neural language models (Vaswani et al., 2017; Devlin et al., 2018; Radford et al., 2019; He et al., 2021) . It contains multiple identically constructed layers. Each layer has a multi-head self-attention mechanism and a two-layer feedforward neural network. We use f (•; θ) to denote a Transformer-based model f parameterized by θ, where f is a mapping from the input sample space X to the output prediction space. We define the loss function L(θ) = E x∼X [ℓ(f (x; θ))], where ℓ is the task loss.foot_0 

2.2. TRANSFORMER DISTILLATION

Knowledge Distillation trains a small model (i.e., student model) to match the output predictions of a large and well-trained model (i.e., teacher model) by penalizing their output discrepancy. Specif-



For notational simplicity, we will omit x throughout the rest of the paper.

