EARLYBERT: EFFICIENT BERT TRAINING VIA EARLY-BIRD LOTTERY TICKETS

Abstract

Deep, heavily overparameterized language models such as BERT, XLNet and T5 have achieved impressive success in many natural language processing (NLP) tasks. However, their high model complexity requires enormous computation resources and extremely long training time for both pre-training and fine-tuning. Many works have studied model compression on large NLP models, but only focusing on reducing inference time while still requiring expensive training process. Other works use extremely large batch sizes to shorten the pre-training time, at the expense of higher computational resource demands. In this paper, inspired by the Early-Bird Lottery Tickets recently studied for computer vision tasks, we propose EarlyBERT, a general computationally-efficient training algorithm applicable to both pre-training and fine-tuning of large-scale language models. By slimming the self-attention and fully-connected sub-layers inside a transformer, we are the first to identify structured winning tickets in the early stage of BERT training. We apply those tickets towards efficient BERT training, and conduct comprehensive pre-training and fine-tuning experiments on GLUE and SQuAD downstream tasks. Our results show that EarlyBERT achieves comparable performance to standard BERT, with 35∼45% less training time.

1. INTRODUCTION

Large-scale pre-trained language models (e.g., BERT (Devlin et al., 2018) , XLNet (Yang et al., 2019) , T5 (Raffel et al., 2019) ) have significantly advanced the state of the art in the NLP field. Despite impressive empirical success, their computational inefficiency has become an acute drawback in practice. As more and more transformer layers are stacked with larger self-attention blocks, model complexity increases rapidly. For example, compared to BERT-Large model with 340 million parameters, T5 has more than 10 billion to learn. Such high model complexity calls for expensive computational resources and extremely long training time. Model compression is one approach to alleviating this issue. Recently, many methods propose to encode large NLP models compactly (Sun et al., 2019; Sanh et al., 2019; Sun et al., 2020) . However, the focus is solely on reducing computational resources or inference time, leaving the process of searching for the right compact model ever more costly. Furthermore, almost all model compression methods start with a large pre-trained model, which in practice may not exist. Recent work (You et al., 2020b) proposes to use large training batches, which significantly shortens pre-training time of BERT-Large model but demands daunting computing resources (1,024 TPUv3 chips). In contrast, our quest is to find a general resource-efficient training algorithm for large NLP models, which can be applied to both pre-training and fine-tuning stages. Our goal is to trim down the training time, but also avoiding more costs of the total training resources (e.g., taking large-batch or distributed training). To meet this challenge demand, we draw inspirations from a recent work (You et al., 2020a) that explores the use of Lottery Ticket Hypothesis (LTH) for efficient training of computer vision models. LTH was first proposed in Frankle & Carbin (2019) as an exploration to understand the training process of deep networks. The original LTH substantiates a trainable sparse sub-network at initialization, but it cannot be directly utilized for efficient training, since the subnetwork itself has to be searched through a tedious iterative process. In addition, most LTH works discussed only unstructured sparsity. The study of You et al. (2020a) presents new discoveries that structured lottery tickets can emerge in early stage of training (i.e., Early-Bird Ticket), and there-fore a structurally sparse sub-network can be identified with much lower costs, leading to practical efficient training algorithms. Inspired by the success of LTH and Early-Bird Ticket, we propose EarlyBERT, a general efficient training algorithm based on structured Early-Bird Tickets. Due to the vast differences between the architectures and building blocks of computer vision models and BERT, directly extending the method of (You et al., 2020a) is not applicable to our work. By instead using network slimming (Liu et al., 2017) on the self-attention and fully-connected sub-layers inside a transformer, we are the first to introduce an effective approach that can identify structured winning tickets in the early stage of BERT training, that are successfully applied for efficient language modeling pre-training and finetuning. Extensive experiments on BERT demonstrate that EarlyBERT can save 35∼45% training time without sacrificing accuracy, when evaluated on GLUE and SQuAD benchmarks.

2. RELATED WORK

Efficient NLP Models It is well believed that BERT and other large NLP models are considerably overparameterized (McCarley, 2019; Sun et al., 2019) . This explains the emergence of many model compression works, which can be roughly categorized into quantization (Shen et al., 2020; Zafrir et al., 2019) , knowledge distillation (Sun et al., 2019; Jiao et al., 2019; Sanh et al., 2019; Sun et al., 2020) , dynamic routing (Fan et al., 2019; Xin et al., 2020), and pruning (Li et al., 2020; Wang et al., 2019; McCarley, 2019; Michel et al., 2019) . Almost all model compression methods focus on reducing inference time, while their common drawback is the reliance on fully-trained and heavilyengineered dense models, before proceeding to their compact, sparse versions -which essentially transplants the resource burden from the inference to the training stage Pruning is the mainstream approach for compressing BERT so far. McCarley (2019) proposed to greedily and iteratively prune away attention heads contributing less to the model. Wang et al. (2019) proposed to structurally prune BERT models using low-rank factorization and augmented Lagrangian 0 norm regularization. McCarley (2019) pruned less important self-attention heads and slices of MLP layers by applying 0 regularization to the coefficient corresponding to each head/MLP layer. Another line of works aim to reduce the training time of transformer-based models via large-batch training and GPU model parallelism (You et al., 2020b; Shoeybi et al., 2019) . Our work is orthogonal to those works, and can be readily combined for further efficiency boost.

Lottery Ticket Hypothesis in Computer Vision

Lottery Ticket Hypothesis (LTH) was firstly proposed in Frankle & Carbin (2019), which shed light on the existence of sparse sub-networks (i.e., winning tickets) at initialization with non-trivial sparsity ratio that can achieve almost the same performance (compared to the full model) when trained alone. The winning tickets are identified by pruning fully trained networks using the so-called Iterative Magnitude-based Pruning (IMP). However, IMP is expensive due to its iterative nature. Moreover, IMP leads to unstructured sparsity, which is known to be insufficient in Lottery Ticket Hypothesis in NLP All above works evaluate their methods on computer vision models. For NLP models, previous work has also found that matching subnetworks exist in training on Transformers and LSTMs (Yu et al., 2019; Renda et al., 2020 ). Evci et al. (2020) derived an algorithm for training sparse neural networks according to LTH and applied it to a character-level language modeling on WikiText-103. For BERT models, a latest work (Chen et al., 2020) found that the pre-trained BERT models contain sparse subnetworks, found by unstructured IMP at 40% to 90% sparsity, that are independently trainable and transferable to a range of downstream tasks with no performance degradation. Another concurrent work (Prasanna et al., 2020) aims to find structurally sparse lottery tickets for BERT, by pruning entire attention heads and MLP layers. Their experiments turn out that all subnetworks ("good" and "bad") have "comparable performance" when fined-tuned on downstream tasks, leading to their "all tickets are winning" conclusion.



reducing training cost or accelerating training speed practically. Those barriers prevent LTH from becoming immediately helpful towards efficient training. Morcos et al. (2019) studies the transferability of winning tickets between datasets and optimizers. Zhou et al. (2019) investigates different components in LTH and observes the existence of super-masks in winning tickets. Lately, You et al. (2020a) pioneers to identify Early-Bird Tickets, which emerge at the early stage of the training process, and contain structured sparsity when pruned with Network Slimming (Liu et al., 2017). Early-bird tickets mitigate the two limitations of IMP aforementioned, and renders it possible to training deep models efficiently, by drawing such tickets early in the training and then focusing on training this compact subnetwork only.

