TASK-AGNOSTIC AND ADAPTIVE-SIZE BERT COM-PRESSION

Abstract

While pre-trained language models such as BERT and RoBERTa have achieved impressive results on various natural language processing tasks, they have huge numbers of parameters and suffer from huge computational and memory costs, which make them difficult for real-world deployment. Hence, model compression should be performed in order to reduce the computation and memory cost of pre-trained models. In this work, we aim to compress BERT and address the following two challenging practical issues: (1) The compression algorithm should be able to output multiple compressed models with different sizes and latencies, so as to support devices with different kinds of memory and latency limitations; (2) the algorithm should be downstream task agnostic, so that the compressed models are generally applicable for different downstream tasks. We leverage techniques in neural architecture search (NAS) and propose NAS-BERT, an efficient method for BERT compression. NAS-BERT trains a big supernet on a carefully designed search space containing various architectures and outputs multiple compressed models with adaptive sizes and latency. Furthermore, the training of NAS-BERT is conducted on standard self-supervised pre-training tasks (e.g., masked language model) and does not depend on specific downstream tasks. Thus, the models it produces can be used across various downstream tasks. The technical challenge of NAS-BERT is that training a big supernet on the pre-training task is extremely costly. We employ several techniques including block-wise search, search space pruning, and performance approximation to improve search efficiency and accuracy. Extensive experiments on GLUE benchmark datasets demonstrate that NAS-BERT can find lightweight models with better accuracy than previous approaches, and can be directly applied to different downstream tasks with adaptive model sizes for different requirements of memory or latency.

1. INTRODUCTION

Pre-trained Transformer (Vaswani et al., 2017) -based language models like BERT (Devlin et al., 2019) , XLNet (Yang et al., 2019) and RoBERTa (Liu et al., 2019) have achieved impressive performance on a variety of downstream natural language processing tasks. These models are pre-trained on massive language corpus via self-supervised tasks to learn language representation and fine-tuned on specific downstream tasks. Despite their effectiveness, these models are quite expensive in terms of computation and memory cost, which makes them difficult for the deployment on different downstream tasks and various resource-restricted scenarios such as online servers, mobile phones, and embedded devices. Therefore, it is crucial to compress pre-trained models for practical deployment. Recently, a variety of compression techniques have been adopted to compress pre-trained models, such as pruning (McCarley, 2019; Gordon et al., 2020 ), weight factorization (Lan et al., 2019 ), quantization (Shen et al., 2020; Zafrir et al., 2019) , and knowledge distillation (Sun et al., 2019; Sanh et al., 2019; Chen et al., 2020; Jiao et al., 2019; Hou et al., 2020; Song et al., 2020) . Several existing works (Tsai et al., 2020; McCarley, 2019; Gordon et al., 2020; Sanh et al., 2019; Zafrir et al., 2019; Chen et al., 2020; Lan et al., 2019; Sun et al., 2019) compress a large pre-trained model into a small or fast model with fixed size on the pre-training or fine-tuning stage and have achieved good compression efficiency and accuracy. However, from the perspective of practical deployment, a fixed size model cannot be deployed in devices with different memory and latency constraints. For example, smaller models are preferred in embedded devices than in online servers, and faster inference speed is more critical in online services than in offline services. On the other hand, some previous methods (Chen et al., 2020; Hou et al., 2020) compress the models on the fine-tuning stage for each specific downstream task. This can achieve good accuracy due to the dedicated design in each task. However, compressing the model for each task can be laborious and a compressed model for one task may not generalize well on another downstream task. In this paper, we study the BERT compression in a different setting: the compressed models need to cover different sizes and latencies, in order to support devices with different kinds of memory and latency constraints; the compression is conducted on the pre-training stage so as to be downstream task agnostic. To this end, we propose NAS-BERT, which leverages neural architecture search (NAS) to automatically compress BERT models. We carefully design a search space that contains multi-head attention (Vaswani et al., 2017) , separable convolution (Kaiser et al., 2018) , feed-forward network and identity operations with different hidden sizes to find efficient models. In order to search models with adaptive sizes that satisfy diverse requirements of memory and latency constraints in different devices, we train a big supernet that contains all the candidate operations and architectures with weight sharing (Bender et al., 2018; Cai et al., 2018; 2019; Yu et al., 2020) . In order to reduce the laborious compressing on each downstream task, we directly train the big supernet and get the compressed model on the pre-training task to make it applicable across different downstream tasks. However, it is extremely expensive to directly perform architecture search in a big supernet on the heavy pre-training task. To improve the search efficiency and accuracy, we employ several techniques including block-wise search, search space pruning and performance approximation during the search process: (1) We adopt block-wise search (Li et al., 2020a) to divide the supernet into blocks so that the size of the search space can be reduced exponentially. To train each block, we leverage a pre-trained teacher model, divide the teacher model into blocks similarly, and use the input and output hidden states of the corresponding teacher block as paired data for training. (2) To further reduce the search cost of each block (even if block-wise search has greatly reduced the search space) due to the heavy burden of the pre-training task, we propose progressive shrinking to dynamically prune the search space according to the validation loss during training. To ensure that architectures with different sizes and latencies can be retained during the pruning process, we further divide all the architectures in each block into several bins according to their model sizes and perform progressive shrinking in each bin. (3) We obtain the compressed models under specific constraints of memory and latency by assembling the architectures in every block using performance approximation, which can reduce the cost in model selection. We evaluate the models compressed by NAS-BERT on the GLUE benchmark (Wang et al., 2018) . The results show that NAS-BERT can find lightweight models with various sizes from 5M to 60M with better accuracy than that achieved by previous work. Our contributions of NAS-BERT can be summarized as follows: • We carefully design a search space that contains various architectures and different sizes, and apply NAS on the pre-training task to search for efficient lightweight models, which is able to deliver adaptive model sizes given different requirements of memory or latency and apply for different downstream tasks. • We further apply block-wise search, progressive shrinking and performance approximation to reduce the huge search cost and improve the search accuracy. • Experiments on the GLUE benchmark datasets demonstrate the effectiveness of NAS-BERT for model compression.

2. RELATED WORK

BERT Model Compression Recently, compressing pre-trained language models has been studied extensively and several techniques have been proposed such as knowledge distillation, pruning, weight factorization, quantization and so on. Existing works (Tsai et al., 2020; Sanh et al., 2019; Sun et al., 2019; Song et al., 2020; Jiao et al., 2019; Lan et al., 2019; Zafrir et al., 2019; Shen et al., 2020; Wang et al., 2019b; Lan et al., 2019; Zafrir et al., 2019; Chen et al., 2020) aim to compress the pre-trained model into a fixed size of the model and have achieved a trade-off between the small parameter size (usually no more than 66M) and the good performance. However, these compressed models cannot be deployed in devices with different memory and latency constraints.

