DELIGHT: DEEP AND LIGHT-WEIGHT TRANSFORMER

Abstract

We introduce a deep and light-weight transformer, DeLighT, that delivers similar or better performance than standard transformer-based models with significantly fewer parameters. DeLighT more efficiently allocates parameters both (1) within each Transformer block using the DeLighT transformation, a deep and lightweight transformation and (2) across blocks using block-wise scaling, that allows for shallower and narrower DeLighT blocks near the input and wider and deeper DeLighT blocks near the output. Overall, DeLighT networks are 2.5 to 4 times deeper than standard transformer models and yet have fewer parameters and operations. Experiments on benchmark machine translation and language modeling tasks show that DeLighT matches or improves the performance of baseline Transformers with 2 to 3 times fewer parameters on average.

1. INTRODUCTION

Attention-based transformer networks (Vaswani et al., 2017) are widely used for sequence modeling tasks, including language modeling and machine translation. To improve performance, models are often scaled to be either wider, by increasing the dimension of hidden layers, or deeper, by stacking more transformer blocks. For example, T5 (Raffel et al., 2019) uses a dimension of 65K and GPT-3 (Brown et al., 2020) uses 96 transformer blocks. However, such scaling increases the number of network parameters significantly (e.g., T5 and GPT-3 have 11 billion and 175 billion parameters, respectively), and complicates learning, i.e., these models either require very large training corpora (Raffel et al., 2019; Devlin et al., 2019; Brown et al., 2020) or careful regularization (Hinton et al., 2012; Wan et al., 2013; Merity et al., 2018a) . In this paper, we introduce a new parameter-efficient attention-based architecture that can be easily scaled to be both wide and deep. Our Deep and Light-weight Transformer architecture, DeLighT, extends the transformer architecture of Vaswani et al. (2017) and delivers similar or better performance with significantly fewer parameters and operations. At the heart of DeLighT is the DeLighT transformation that uses the group linear transformations (GLTs) of Mehta et al. (2018) with an expand-reduce strategy for varying the width and depth of the DeLighT block efficiently. Since GLTs are local by nature, the DeLighT transformation uses feature shuffling, which is analogous to channel shuffling in convolutional networks (Zhang et al., 2018) , to share information between different groups. Such wide and deep representations facilitate replacing the multi-head attention and feed-forward layers in transformers with single headed attention and light-weight feed-forward layers, reducing total network parameters and operations. Importantly, unlike transformers, the DeLighT transformation decouples the depth and width from the input size, allowing us to allocate parameters more efficiently across blocks by using shallower and narrower DeLighT blocks near the input and deeper and wider DeLighT blocks near the output. We demonstrate that DeLighT models achieve similar or better performance than transformer models with significantly fewer parameters and operations, on two common sequence modeling tasks, (i) machine translation and (ii) language modeling. On the low resource WMT'16 En-Ro machine translation dataset, DeLighT attains transformer performance using 2.8× fewer parameters. On the high resource WMT'14 En-Fr dataset, DeLighT delivers better performance (+0.4 BLEU score) with 1.8× fewer parameters than baseline transformers. Similarly, on language modeling, DeLighT matches the performance of Transformer-XL (Dai et al., 2019) with 1.5× fewer parameters on the WikiText-103 dataset. Our source code is open-source and is available at: https://github.com/ sacmehta/delight 2 RELATED WORK Improving transformers: Several methods have been introduced to improve the transformer architecture. The first line of research addresses the challenge of computing self attention on long input sequences (Child et al., 2019; Kitaev et al., 2020; Beltagy et al., 2020) . These methods can be combined with our architecture. The second line of research focuses on explaining multi-head attention (Raganato and Tiedemann, 2018; Brunner et al., 2020) . They show that increasing the number of transformer heads can lead to redundant representations (Voita et al., 2019a; Michel et al., 2019) and using fixed attention heads with predefined patterns (Raganato et al., 2020) or synthetic attention matrices (Tay et al., 2020) improves performance. The third line of research focuses on improving transformers by learning better representations (Wu et al., 2019; 2020; So et al., 2019) . These works aim to improve the expressiveness of transformers using different transformations -for example, using convolutions (Wu et al., 2019; Gehring et al., 2017) , gated linear units (Dauphin et al., 2017) , or multi-branch feature extractors (So et al., 2019; Wu et al., 2020) . Our work falls into this category. Unlike previous works, we show that it is possible to efficiently allocate parameters both at the block-level using the DeLighT transformation and across blocks using block-wise scaling. Model scaling: Model scaling is a standard method to improve the performance of sequence models (Vaswani et al., 2017; Raffel et al., 2019; Lan et al., 2020; Devlin et al., 2019; Shoeybi et al., 2019; Tan and Le, 2019; Brown et al., 2020) . Model dimensions are increased in width-wise scaling (Vaswani et al., 2017; Devlin et al., 2019) while more blocks (e.g., Transformer blocks) are stacked in depth-wise scaling (Shoeybi et al., 2019; Brown et al., 2020; Wang et al., 2019) . In both cases (and their combination), parameters inside each block of the network are the same, which may lead to a sub-optimal solution. To further improve the performance of sequence models, this paper introduces block-wise scaling that allows for variably-sized blocks and efficient allocation of parameters in the network. Our results show that (1) shallower and narrower DeLighT blocks near the input and deeper and wider DeLighT blocks near the output deliver the best performance, and (2) models with block-wise scaling coupled with model scaling achieve better performance compared to model scaling alone. We note that convolutional neural networks (CNNs) also learn shallower and narrower representations near the input and deeper and wider representations near the output. Unlike CNNs (e.g., ResNet of He et al. 2016) that perform a fixed number of operations at each convolutional layer, the proposed block-wise scaling uses a variable number of operations in each layer and block. Improving sequence models: There is also significant recent work on other related methods for improving sequence models, including (1) improving accuracy using better token-level representations -for example, using BPE (Sennrich et al., 2016) , adaptive inputs (Baevski and Auli, 2019) and outputs (Grave et al., 2017a), and DeFINE (Mehta et al., 2020) , and (2) improving efficiency -for example, using compression (Chen et al., 2018; Sun et al., 2020 ), pruning (Han et al., 2016; Voita et al., 2019b), and distillation (Hinton et al., 2015; Sanh et al., 2019) . The closest to our work is the DeFINE transformation, which also learns representations using an expand-reduce strategy. The key difference between the DeFINE transformation (Figure 1c ) and the DeLighT transformation (Figure 1d ) is that the DeLighT transformation more efficiently allocates parameters within expansion and reduction layers. Unlike DeFINE, which uses fewer groups in group linear transformations to learn wider representations, DeLighT transformation uses more groups to learn wider representations with fewer parameters. The DeLighT transformation achieves comparable performance to the DeFINE transformation but with significantly fewer parameters.

3. DELIGHT: DEEP AND LIGHT-WEIGHT TRANSFORMER

A standard transformer block (Figure 1a ) comprises of multi-head attention that uses a query-keyvalue decomposition to model relationships between sequence tokens, and a feed forward network (FFN) to learn wider representations. Multi-head attention obtains query Q, key K, and value V by applying three projections to the input, each consisting of h linear layers (or heads) that map the d m -dimensional input into a d h -dimensional space, where d h = d m /h is the head dimension. The FFN consists of two linear layers, where the first expands the dimensions from d m to d f and the

