LENGTH-ADAPTIVE TRANSFORMER: TRAIN ONCE WITH LENGTH DROP, USE ANYTIME WITH SEARCH

Abstract

Although transformers have achieved impressive accuracies in various tasks in natural language processing, they often come with a prohibitive computational cost, that prevents their use in scenarios with limited computational resources for inference. This need for computational efficiency in inference has been addressed by for instance PoWER-BERT (Goyal et al., 2020) which gradually decreases the length of a sequence as it is passed through layers. These approaches however often assume that the target computational complexity is known in advance at the time of training. This implies that a separate model must be trained for each inference scenario with its distinct computational budget. In this paper, we extend PoWER-BERT to address this issue of inefficiency and redundancy. The proposed extension enables us to train a large-scale transformer, called Length-Adaptive Transformer, once and uses it for various inference scenarios without re-training it. To do so, we train a transformer with LengthDrop, a structural variant of dropout, which stochastically determines the length of a sequence at each layer. We then use a multi-objective evolutionary search to find a length configuration that maximizes the accuracy and minimizes the computational complexity under any given computational budget. Additionally, we significantly extend the applicability of PoWER-BERT beyond sequence-level classification into token-level classification such as span-based question-answering, by introducing the idea of Drop-and-Restore. With Drop-and-Restore, word-vectors are dropped temporarily in intermediate layers and restored at the last layer if necessary. We empirically verify the utility of the proposed approach by demonstrating the superior accuracyefficiency trade-off under various setups, including SQuAD 1.1, MNLI-m, and SST-2. Upon publication, the code to reproduce our work will be open-sourced.

1. INTRODUCTION

Pretrained language models (Peters et al., 2018; Devlin et al., 2018; Radford et al., 2019; Yang et al., 2019) have achieved notable improvements in various natural language processing (NLP) tasks. Most of them rely on transformers (Vaswani et al., 2017) , and the number of model parameters ranges from hundreds of millions to billions (Shoeybi et al., 2019; Raffel et al., 2019; Kaplan et al., 2020; Brown et al., 2020) . Despite this high accuracy, excessive computational overhead during inference, both in terms of time and memory, has hindered its use in real applications. This level of excessive computation has further raised the concern over energy consumption as well (Schwartz et al., 2019; Strubell et al., 2019) . Recent studies have attempted at addressing these concerns regarding large-scale transformers' computational and energy efficiency (see §6 for a more extensive discussion.) Among these, we focus on PoWER-BERT (Goyal et al., 2020) which progressively reduces sequence length by eliminating word-vectors based on the attention values as passing layers. PoWER-BERT establishes the superiority of accuracy-time trade-off over earlier approaches (Sanh et al., 2019; Sun et al., 2019; Michel et al., 2019) . It however requires us to train a separate model for each efficiency constraint. In this paper, we thus develop a framework based on PoWER-BERT such that we can train a single model that can be adapted in the inference time to meet any given efficiency target. In order to train a transformer to cope with a diverse set of computational budgets in the inference time, we propose to train one while reducing the sequence length with a random proportion at each layer. We refer to this procedure as LengthDrop which was motivated by the nested dropout (Rippel et al., 2014) . We can extract sub-models of shared weights with any length configuration without requiring extra post-processing nor additional finetuning. Once a transformer is trained with the proposed LengthDrop, we search for the length configuration that maximizes the accuracy given a computational budget. Because this search is combinatorial and has multiple objectives (accuracy and efficiency), we use an evolutionary search algorithm, which further allows us to obtain a full Pareto frontier of accuracy-efficiency trade-off of each model. It is not trivial to find an optimal length configuration given the inference-time computational budget, although it is extremely important in order to deploy these large-scale transformers in practice. In this work, we propose to use evolutionary search to find a length configuration that maximizes the accuracy within a given computational budget. We can further compute the Pareto frontier of accuracy-efficiency trade-off to obtain a sequence of length configurations with varying efficiency profiles. PoWER-BERT, which forms the foundation of the proposed two-stage procedure, is only applicable to sequence-level classification, because by design it eliminates some of the word vectors at each layer. In other words, it cannot be used for token-level tasks such as span-based question answering (Rajpurkar et al., 2016) , because these tasks require hidden representations of the entire input sequence at the final layer. We thus propose to extend PoWER-BERT with a novel Drop-and-Restore process ( §3.3), which eliminates this inherent limitation. Word vectors are dropped and set aside, rather than eliminated, in intermediate layers to maintain the saving of computational cost, as was with the original PoWER-BERT. These set-aside vectors are then restored at the final hidden layer and provided as an input to a subsequent task-specific layer, which is unlike the original PoWER-BERT. The main contributions of this work are two-fold. First, we introduce LengthDrop, a structured variant of dropout for training a single Length-Adaptive Transformer model that allows us to automatically derive multiple sub-models with different length configurations in the inference time using evolutionary search, without requiring any re-training. Second, we design Drop-and-Restore process that makes PoWER-BERT applicable beyond classification, which enables PoWER-BERT to be applicable to a wider range of NLP tasks such as span-based question answering. We empirically verify Length-Adaptive Transformer works quite well using the variants of BERT on a diverse set of NLP tasks, including SQuAD 1.1 (Rajpurkar et al., 2016) and two sequence-level classification tasks in GLUE benchmark (Wang et al., 2018) . Our experiments reveal that the proposed approach grants us a fine-grained control of computational efficiency and a superior accuracy-efficiency trade-off in the inference time, compared to existing approaches.

2. BACKGROUND: TRANSFORMERS AND POWER-BERT

Before we describe our main approach, we review some of the building blocks in this section. In particular, we review transformers, which are a standard backbone used in natural language processing these days, and PoWER-BERT, which was recently proposed as an effective way to train a large-scale, but highly efficient transformer for sequence-level classification.

2.1. TRANSFORMERS AND BERT

A transformer is a particular neural network that has been designed to work with a variable-length sequence input and is implemented as a stack of self-attention and fully-connected layers (Vaswani et al., 2017) . Here, we give a brief overview of the transformer which is the basic building block of the proposed approach. Each token x t in a sequence of tokens x = (x 1 , . . . , x N ), representing input text, is first turned into a continuous vector h 0 t ∈ R H which is the sum of the token and position embedding vectors. This sequence is fed into the first transformer layer which returns another sequence of the same length h 1 ∈ R N ×H . We repeat this procedure L times, for a transformer with L layers, to obtain h L = (h L 1 , . . . , h L N ). We refer to each vector in the hidden sequence at each layer as a word vector to emphasize that there exists a correspondence between each such vector and one of the input words.

