DELIGHT: DEEP AND LIGHT-WEIGHT TRANSFORMER

Abstract

We introduce a deep and light-weight transformer, DeLighT, that delivers similar or better performance than standard transformer-based models with significantly fewer parameters. DeLighT more efficiently allocates parameters both (1) within each Transformer block using the DeLighT transformation, a deep and lightweight transformation and (2) across blocks using block-wise scaling, that allows for shallower and narrower DeLighT blocks near the input and wider and deeper DeLighT blocks near the output. Overall, DeLighT networks are 2.5 to 4 times deeper than standard transformer models and yet have fewer parameters and operations. Experiments on benchmark machine translation and language modeling tasks show that DeLighT matches or improves the performance of baseline Transformers with 2 to 3 times fewer parameters on average.

1. INTRODUCTION

Attention-based transformer networks (Vaswani et al., 2017) are widely used for sequence modeling tasks, including language modeling and machine translation. To improve performance, models are often scaled to be either wider, by increasing the dimension of hidden layers, or deeper, by stacking more transformer blocks. For example, T5 (Raffel et al., 2019) uses a dimension of 65K and GPT-3 (Brown et al., 2020) uses 96 transformer blocks. However, such scaling increases the number of network parameters significantly (e.g., T5 and GPT-3 have 11 billion and 175 billion parameters, respectively), and complicates learning, i.e., these models either require very large training corpora (Raffel et al., 2019; Devlin et al., 2019; Brown et al., 2020) or careful regularization (Hinton et al., 2012; Wan et al., 2013; Merity et al., 2018a) . In this paper, we introduce a new parameter-efficient attention-based architecture that can be easily scaled to be both wide and deep. Our Deep and Light-weight Transformer architecture, DeLighT, extends the transformer architecture of Vaswani et al. (2017) and delivers similar or better performance with significantly fewer parameters and operations. At the heart of DeLighT is the DeLighT transformation that uses the group linear transformations (GLTs) of Mehta et al. (2018) with an expand-reduce strategy for varying the width and depth of the DeLighT block efficiently. Since GLTs are local by nature, the DeLighT transformation uses feature shuffling, which is analogous to channel shuffling in convolutional networks (Zhang et al., 2018) , to share information between different groups. Such wide and deep representations facilitate replacing the multi-head attention and feed-forward layers in transformers with single headed attention and light-weight feed-forward layers, reducing total network parameters and operations. Importantly, unlike transformers, the DeLighT transformation decouples the depth and width from the input size, allowing us to allocate parameters more efficiently across blocks by using shallower and narrower DeLighT blocks near the input and deeper and wider DeLighT blocks near the output. We demonstrate that DeLighT models achieve similar or better performance than transformer models with significantly fewer parameters and operations, on two common sequence modeling tasks, (i) machine translation and (ii) language modeling. On the low resource WMT'16 En-Ro machine translation dataset, DeLighT attains transformer performance using 2.8× fewer parameters. On the high resource WMT'14 En-Fr dataset, DeLighT delivers better performance (+0.4 BLEU score) with 1.8× fewer parameters than baseline transformers. Similarly, on language modeling, DeLighT matches the performance of Transformer-XL (Dai et al., 2019) with 1.5× fewer parameters on the WikiText-103 dataset. Our source code is open-source and is available at: https://github.com/ sacmehta/delight 2 RELATED WORK Improving transformers: Several methods have been introduced to improve the transformer architecture. The first line of research addresses the challenge of computing self attention on long input sequences (Child et al., 2019; Kitaev et al., 2020; Beltagy et al., 2020) . These methods can be combined with our architecture. The second line of research focuses on explaining multi-head attention (Raganato and Tiedemann, 2018; Brunner et al., 2020) . They show that increasing the number of transformer heads can lead to redundant representations (Voita et al., 2019a; Michel et al., 2019) and using fixed attention heads with predefined patterns (Raganato et al., 2020) or synthetic attention matrices (Tay et al., 2020) improves performance. The third line of research focuses on improving transformers by learning better representations (Wu et al., 2019; 2020; So et al., 2019) . These works aim to improve the expressiveness of transformers using different transformations -for example, using convolutions (Wu et al., 2019; Gehring et al., 2017) , gated linear units (Dauphin et al., 2017) , or multi-branch feature extractors (So et al., 2019; Wu et al., 2020) . Our work falls into this category. Unlike previous works, we show that it is possible to efficiently allocate parameters both at the block-level using the DeLighT transformation and across blocks using block-wise scaling. Model scaling: Model scaling is a standard method to improve the performance of sequence models (Vaswani et al., 2017; Raffel et al., 2019; Lan et al., 2020; Devlin et al., 2019; Shoeybi et al., 2019; Tan and Le, 2019; Brown et al., 2020) . Model dimensions are increased in width-wise scaling (Vaswani et al., 2017; Devlin et al., 2019) while more blocks (e.g., Transformer blocks) are stacked in depth-wise scaling (Shoeybi et al., 2019; Brown et al., 2020; Wang et al., 2019) . In both cases (and their combination), parameters inside each block of the network are the same, which may lead to a sub-optimal solution. To further improve the performance of sequence models, this paper introduces block-wise scaling that allows for variably-sized blocks and efficient allocation of parameters in the network. Our results show that (1) shallower and narrower DeLighT blocks near the input and deeper and wider DeLighT blocks near the output deliver the best performance, and (2) models with block-wise scaling coupled with model scaling achieve better performance compared to model scaling alone. We note that convolutional neural networks (CNNs) also learn shallower and narrower representations near the input and deeper and wider representations near the output. Unlike CNNs (e.g., ResNet of He et al. 2016 ) that perform a fixed number of operations at each convolutional layer, the proposed block-wise scaling uses a variable number of operations in each layer and block. Improving sequence models: There is also significant recent work on other related methods for improving sequence models, including (1) improving accuracy using better token-level representations -for example, using BPE (Sennrich et al., 2016) , adaptive inputs (Baevski and Auli, 2019) and outputs (Grave et al., 2017a) , and DeFINE (Mehta et al., 2020) , and (2) improving efficiency -for example, using compression (Chen et al., 2018; Sun et al., 2020) , pruning (Han et al., 2016; Voita et al., 2019b) , and distillation (Hinton et al., 2015; Sanh et al., 2019) . The closest to our work is the DeFINE transformation, which also learns representations using an expand-reduce strategy. The key difference between the DeFINE transformation (Figure 1c ) and the DeLighT transformation (Figure 1d ) is that the DeLighT transformation more efficiently allocates parameters within expansion and reduction layers. Unlike DeFINE, which uses fewer groups in group linear transformations to learn wider representations, DeLighT transformation uses more groups to learn wider representations with fewer parameters. The DeLighT transformation achieves comparable performance to the DeFINE transformation but with significantly fewer parameters.

3. DELIGHT: DEEP AND LIGHT-WEIGHT TRANSFORMER

A standard transformer block (Figure 1a ) comprises of multi-head attention that uses a query-keyvalue decomposition to model relationships between sequence tokens, and a feed forward network (FFN) to learn wider representations. Multi-head attention obtains query Q, key K, and value V by applying three projections to the input, each consisting of h linear layers (or heads) that map the d m -dimensional input into a d h -dimensional space, where d h = d m /h is the head dimension. The FFN consists of two linear layers, where the first expands the dimensions from d m to d f and the second reduces the dimensions from d f to d m . The depth of a transformer block is 4, consisting of (1) three parallel branches for queries, keys, and values, (2) a fusion layer that combines the output of multiple heads, and (3) two sequential linear layers in the FFN. In general, transformer-based networks sequentially stacks transformer blocks to increase network capacity and depth. This paper extends the transformer architecture and introduces a deep and light-weight transformer, DeLighT. Our model uses a deep and light-weight expand-reduce transformation, DeLighT transformation (Section 3.1), that enables learning wider representations efficiently. It also enables replacing multi-head attention and feed forward network (FFN) layers with single-head attention and a light-weight FFN (Section 3.2). DeLighT transformation decouples attention dimensions from the depth and width, allowing us to learn representations efficiently using block-wise scaling instead of uniform stacking of transformer blocks (Section 3.3).

3.1. DELIGHT TRANSFORMATION

DeLighT transformation maps a d m dimensional input vector into a high dimensional space (expansion) and then reduces it down to a d o dimensional output vector (reduction) using N layers of the group transformations of Mehta et al. (2018) , as shown in Figure 1d . During these expansion and reduction phases, DeLighT transformation uses group linear transformations (GLTs) because they learn local representations by deriving the output from a specific part of the input and are more efficient than linear transformations. To learn global representations, the DeLighT transformation shares information between different groups in the group linear transformation using feature shuffling, analogous to channel shuffling in convolutional networks (Zhang et al., 2018) . A standard approach to increase the expressivity and capacity of transformers is to increase the input dimensions, d m . However, increasing d m linearly also increases the number of operations in multihead attention (O(n 2 d m ), where n is the sequence length) in a standard transformer block (Figure 1a ). In contrast, to increase the expressivity and capacity of the DeLighT block, we increase the depth and width of its intermediate DeLighT transformations using expansion and reduction phases. This enables us to use smaller dimensions for computing attention, requiring fewer operations. (5) maximum groups g max in a GLT. In the expansion phase, the DeLighT transformation projects the d m -dimensional input to a high-dimensional space, d max = w m d m , linearly using N 2 layers. In the reduction phase, the DeLighT transformation projects the d max -dimensional vector to a d o -dimensional space using the remaining N -N 2 GLT layers. Mathematically, we define the output Y at each GLT layer l as: Y l = F X, W l , b l , g l , l = 1 F H X, Y l-1 , W l , b l , g l , Otherwise where W l = W l 1 , • • • , W l g l and b l = b l 1 , • • • , b l g l are the learnable weights and biases of group linear transformation F with g l groups at the l-th layer. Briefly, the F function takes the input X or H X, Y l-1 and splits into g l non-overlapping groups such that X = X 1 , • • • , X g l . The function F then linearly transforms each X i with weights W l i and bias b l i to produce output Y l i = X i W l i + b l i . The outputs of each group Y l i are then concatenated to produce the output Y l . The function H first shuffles the output of each group in Y l-1 and then combines it with the input X using the input mixer connection of Mehta et al. (2020) to avoid vanishing gradient problems. Figure 2 visualizes the expansion phase in the DeLighT transformation with group linear transformation, feature shuffling, and the input mixer connection. The number of groups at the l-th GLT in DeLighT transformation are computed as: g l = min(2 l-1 , g max ), 1 ≤ l ≤ N/2 g N -l , Otherwise In our experiments, we use g max = dm 32 so that each group has at least 32 input elements. We then model contextual relationships between these n tokens using scaled dot-product attention (Eq. 3). To enable the use of residual connections (He et al., 2016) , the d o -dimensional outputs of this attention operation are linearly projected into a d m -dimensional space.

3.2. DELIGHT BLOCK

Attention(K, Q, V) = softmax QK T √ d o V We hypothesize that the ability of DeLighT to learn wider representations allows us to replace multi-head attention with single-head attention. The computational costs for computing attention in Block depth: The DeLighT block stacks (1) a DeLighT transformation with N GLTs, (2) three parallel linear layers for key, query, and value, (3) a projection layer, and (4) two linear layers of a light-weight FFN. Thus, the depth of DeLighT block is N + 4. Compared to the standard transformer block (depth is 4), DeLighT block is deeper.

Block-wise Uniform

Input Output Input Output B blocks N 0 = N min N B-1 = N max (see Eq. 4) N B-1 =N N 0 =N (a)

3.3. BLOCK-WISE SCALING

Standard methods for improving the performance of sequence models include increasing the model dimensions (width scaling), stacking more blocks (depth scaling), or both. However, such scaling is not very effective on small datasets. For example, when a Transformer-Base (d m = 512) network is replaced with Transformer-Large (d m = 1024) on the WMT'16 En-Ro corpus, the number of parameters increases by approximately 4× while the performance does not change appreciably (BLEU: 34.28 vs. 34.35). We hypothesize that this happens because scaling model width and depth allocates parameters uniformly across blocks, which may lead to learning redundant parameters. To create deep and wide networks, we extend model scaling to the block level (see Figure 3 ). Scaling the DeLighT block: The DeLighT block learns deep and wide representations using the DeLighT transformation, whose depth and width are controlled by two configuration parameters: the number of GLT layers N and the width multiplier w m , respectively (Figure 3a ). These configuration parameters allow us to increase the number of learnable parameters inside the DeLighT block independently of the input d m and output d o dimensions. Such calibration is not possible with the standard transformer block because their expressiveness and capacity are a function of the input (input dimension = number of heads × head dimension). Here, we introduce block-wise scaling that creates a network with variably-sized DeLighT blocks, allocating shallower and narrower DeLighT blocks near the input and deeper and wider DeLighT blocks near the output. To do so, we introduce two network-wide configuration parameters: minimum N min and maximum N max number of GLTs in a DeLighT transformation. For the b-th DeLighT block, we compute the number of GLTs N b and the width multiplier w b m in a DeLighT transformation using linear scaling (Eq. 4). With this scaling, each DeLighT block has a different depth and width (Figure 3a ). N b = N min + (N max -N min ) b B -1 , w b m = w m + (N max -N min ) b N min (B -1) , 0 ≤ b ≤ B -1 (4) Here, B denotes the number of DeLighT blocks in the network. We add superscript b to number of GLT layers N and width multiplier w m to indicate that these parameters are for the b-th block. Network depth: The depth of transformer block is fixed, i.e., 4. Therefore, previous works (Raffel et al., 2019; Brown et al., 2020; Wang et al., 2019) have associated the depth of transformer-based networks with the number of transformer blocks. In DeLighT, we present a different perspective to learn deeper representations, wherein each block is variably-sized. To compute the network depth, we use the standard definition across different domains, including computer vision (e.g., ResNet of He et al. 2016 ) and theoretical machine learning (Telgarsky, 2016) . These works measures network depth as the number of sequential learnable layers (e.g., convolution, linear, or group linear). Similarly, the depth of DeLighT and transformer networks with B blocks is B-1 b=0 (N b + 4 ) and 4B, respectively.

4. EXPERIMENTAL RESULTS

We evaluate the performance of DeLighT on two standard sequence modeling tasks: (1) machine translation (Section 4.1) and (2) language modeling (Section 4.2).

4.1. MACHINE TRANSLATION

Datasets and evaluation: We benchmark DeLighT models on four datasets: (1) IWSLT'14 German-English (De-En), (2) WMT'16 English-Romanian (En-Ro), (3) WMT'14 English-German (WMT'14 En-De), and (4) WMT'14 English-French (WMT'14 En-Fr). For simplicity, we set B = N max . We use a learnable look-up table that maps every token in the vocabulary to a 128-dimensional vector. We implement our models using Fairseq (Ott et al., 2019) and use their provided scripts for data pre-processing, training, and evaluation. Training: For IWSLT'14 De-En models, we follow the setup of Wu et al. (2019) and train all our models for 50K iterations with a batch size of 4K tokens on a single NVIDIA GTX 1080 GPU. For WMT'16 En-Ro, we follow the training setup of Ghazvininejad et al. (2019) and train models for 100K iterations on 16 NVIDIA Tesla V100 GPUs with an effective batch size of 64K tokens. For WMT'14 En-De and WMT'14 En-Fr, we follow the training set-up of Wu et al. (2019) and train our models on 16 V100 GPUs for 30K and 50K iterations, respectively. We use Adam (Kingma and Ba, 2015) to minimize cross entropy loss with a label smoothing value of 0.1 during training. For a fair comparison, we trained baseline transformer models using the same training set-up. 

4.1.1. RESULTS

Comparison with baseline transformers: Table 1 compares the performance of DeLighT with the baseline transformers of Vaswani et al. (2017) on different corpora. DeLighT delivers better performance with fewer parameters than transformers, across different corpora. Specifically, on low-resource (WMT'16 En-Ro) and high resource (WMT'14 En-De & WMT'14 En-Fr) corpora, DeLighT delivers similar or better performance with 2.8× and 1.8× fewer parameters, respectively. When the number of parameters are increased, DeLighT outperforms transformers. For example, on WMT'14 En-Fr dataset, DeLighT is 3.7× deeper than transformers and improves its BLEU score by 1.3 points yet with 13 million fewer parameters and 3 billion fewer operations (see Table 2 ). Particularly interesting are the performance comparisons of DeLighT with the baseline transformers of Vaswani et al. (2017) and its neural search variant, i.e., Evolved Transformer of So et al. (2019) , at two different parametric settings on WMT'14 En-De corpora in Figure 4 . For small models (< 10 M parameters), DeLighT models delivers better performance and for attaining the same performance as these models, DeLighT models requires fewer parameters. Comparison with state-of-the-art methods: Most state-of-the-art methods have evaluated the performance on WMT'14 En-De while some have also evaluated on IWSLT'14 De-En. Table 3 compares the performance of DeLighT with state-of-the-art methods on these two corpora. DeLighT delivers similar or better performance than existing methods. It is important to note that existing methods have improved baseline transformers with different design choices -for example, the asymmetric encoder-decoder structure (Wang et al., 2019) and neural architecture search (So et al., 2019) . We believe that DeLighT, in the future, would also benefit from such design choices. Scaling up DeLighT models: Figure 5 shows the performance of DeLighT models improves with increase in network parameters; suggesting their ability to learn representations across different corpora, including low-resource. 

4.2. LANGUAGE MODELING

Datasets and evaluation: We evaluate on the WikiText-103 dataset (Merity et al., 2017) that has 103M/217K/245K tokens for training, validation, and testing. It has a word-level vocabulary of about 260K tokens. Following recent works (Baevski and Auli, 2019; Dai et al., 2019) , we report performance in terms of perplexity (lower is better) on the test set. Architecture: We use the transformer-based decoder architecture of Baevski and Auli (2019) with B DeLighT blocks. We use w m =2, N min =4, and N max =12. We scale d m using values {384, 512, 784, 1024} for increasing network parameters. For simplicity, we set B = N max . Following standard practice, we use adaptive input (Baevski and Auli, 2019) as a look-up table and adaptive output (Grave et al., 2017a) as the classification layer with one head (head dimension is 128) and two tails (tail dimensions are 64 and 32). We also share weights between the input and the output layers. Training: We follow the training setup of Baevski and Auli (2019) , except that we train our models on 8 NVIDIA Tesla V100 GPUs for 100K iterations with a context length of 512 and an effective batch size of 64K tokens. We use Adam during training and use a context length of 480 during test. Results: Table 4b compares the performance of DeLighT with previous methods on WikiText-103. Table 4a plots the variation of perplexity with number of parameters for DeLighT and Transformer-XL (Dai et al., 2019) -which outperforms other transformer-based implementations (e.g., Baevski and Auli 2019) . Both tables show that DeLighT delivers better performance than state-of-the-art methods (including Transformer-XL) and it does this using a smaller context length and significantly fewer parameters, suggesting that the DeLighT transformation helps learn strong contextual relationships.

5. ANALYSIS AND DISCUSSIONS ON COMPUTATIONAL EFFICIENCY

Training time and memory consumption: Table 5 compares the training time and memory consumption of DeLighT with baseline transformers. For an apples-to-apples comparisons, we implemented the Transformer unit without NVIDIA's dedicated CUDA kernel, and trained both transformer and DeLighT full-precision networks for 30K iterations on 16 NVIDIA V100 GPUs. The transformer and DeLighT models took about 37 and 23 hours for training and consumed about 12.5 GB and 14.5 GB of GPU memory, respectively (R1 vs. R2). When we enabled the dedicated CUDA kernel provided by APEX libraryfoot_2 for multi-head attention in Transformers, the training time of the transformer model reduced from 37 to 16 hours while we did not observe any significant change in memory consumption. Motivated by this observation, we implemented dedicated CUDA kernels for grouping and ungrouping functions in GLTs (see Appendix E). With these changes, training time and GPU memory consumption of DeLighT reduced by about 4 hours and 3 GB, respectively. We emphasize that grouping, linear transformation, feature shuffling, and ungrouping, can be implemented efficiently using a single CUDA kernel. In future, we expect a dedicated CUDA kernel for these operations would further reduce the memory consumption as well as training/inference time. Regularization: Table 6 shows that DeLighT delivers similar performance to baseline transformers, but with fewer parameters and less regularization. This suggests that learning representations with better transformation functions alleviates the need for dropout.

6. CONCLUSION

This paper introduces a deep and light-weight transformer architecture, DeLighT, that efficiently allocates parameters both within the DeLighT block and across DeLighT blocks. Compared to state-of-the-art transformer models, DeLighT models are (1) deep and light-weight and (2) deliver similar or better performance. In the future, we plan to apply DeLighT to other tasks, including language model pre-training, question answering, and language generation.

A DELI G HT ARCHITECTURES FOR LANGUAGE MODELING AND MACHINE TRANSLATION

DeLighT architectures for language modeling and machine translation are shown in Figure 6 . For language modeling, we follow the architecture in Baevski and Auli (2019) while for machine translation, we follow the architecture in Vaswani et al. (2017) . Language modeling: Figure 6a shows the architecture for language modeling. The architecture stacks B DeLighT blocks, the configuration of each block is determined using block-wise scaling. Each block has three sub-layers. The first layer is a DeLighT transformation that learns representations in high-dimensional space. The second layer is a single-head attention that encodes contextual relationships. The third layer is a position-wise light-weight feed-forward network. Similar to Vaswani et al. (2017) , we employ a residual connections (He et al., 2016) . Similar to previous works (Baevski and Auli, 2019; Dai et al., 2019) , we use tied adaptive input (Baevski and Auli, 2019) and adaptive softmax (Grave et al., 2017a) to map tokens to vectors and vectors to tokens, respectively. Machine translation: Figure 6b shows the architecture for machine translation. The encoder stacks B DeLighT blocks, the configuration of each block is determined using block-wise scaling. Similar to language modeling, each encoder block has three sub-layers. The first layer is a DeLighT transformation that learns representations in high-dimensional space. The second layer is a single-head attention that encodes contextual relationships. The third layer is a position-wise light-weight feed-forward network. Similar to Vaswani et al. (2017) , we employ a residual connections (He et al., 2016) . We use learnable look-up to vectors. Similar to the encoder, the decoder also stacks B blocks. Decoder blocks are identical to encoder blocks, except that they have an additional source-target single-head attention unit before the light-weight FFN. Keys and values in source-target single-head attention unit are projections over the encoder output. We use standard learnable look-up table to map tokens to vectors and linear classification layer to map vectors to tokens.

B GROUP LINEAR TRANSFORMATION WITH INPUT-MIXER CONNECTION

Group linear transformation (GLT) F splits a dm-dimensional input X into g non-overlapping groups such that X = Concat(X1, • • • , Xg), where Xi is a dm g -dimensional vector. Xi's are simultaneously transformed using g linear transforms Wi ∈ R Figure 7a shows an example of GLT in the expansion phase of DeLighT transformation. For illustrative purposes, we have used the same dimensions in this example. Recall that as we go deeper in the expansion phase, the number of groups increases. In this example, the first layer has one group, the second layer has two groups and the third layer has four groups. GLTs learns group-specific representations and are local. To allow GLT to learn global representations, we use feature shuffle. An example of GLT with feature shuffle is shown in Figure 7b . Furthermore, training deep neural networks by merely stacking linear or group linear (with or without feature shuffle) is challenging because of vanishing gradient problem. Residual connections introduced by He et al. (2016) mitigates this problem and helps train deep neural networks. However, such connections cannot be employed when input and output dimensions are not the same (e.g., during the expansion and reduction phases in DeLighT transformation). To stabilize the training and learn deeper representations, we use input-mixer connection of Mehta et al. (2020) . Figure 7c shows an example of GLT with feature shuffle and input mixer connection. 

C MULTIPLICATION-ADDITION OPERATIONS IN DELI G HT

The DeLighT block is built using linear transformations, GLTs, and scaled dot-product attention. Total number of multiplication-addition operations (MACs) in a network is an accumulation of these individual operations. Let n denotes the number of source tokens, m denotes the number of target tokens, dm denotes the input dimension, do denotes the output dimension, and g denotes the number of groups in GLT. The procedure for counting MACs for each of these operations is described below. Group linear transformation (GLT): GLT F has g learnable matrices Wi ∈ R dm g × do g . Therefore, GLT learns dmdo g parameters and performs dmdo g MACs to transform dm-dimensional input to do-dimensional output. Following a standard practice, e.g., ResNet of He et al. (2016) , we count addition and multiplication as one operation instead of two because these operations can be fused in recent hardwares. Importantly, when g = 1, the GLT is the same as linear transformation. Self-attention in DeLighT: The scaled dot-product self-attention in DeLighT is defined as: Attention(K, Q, V) = softmax QK T √ do V where Q ∈ R n×do , K ∈ R n×do , V ∈ R n×do denotes query, key, and value, respectively. The attention operation involves two dot-products. The first dot product between Q and K while the second dot product is between the output of first dot product and V. Both dot products require don 2 MACs. Therefore, total number of MACs in computing scaled dot-product self-attention are 2don 2 . In case of a source-target attention (as in machine translation), K's and V's are from the source (encoder) and Q's are incrementally decoded (one token at a time). Therefore, the number of MACs required to decode m target tokens given n source tokens are m k=1 2kndo . D ABLATIONS ON THE WIKITEXT-103 DATASET 1b ). Figure 8 , Figure 9 , and Figure 10 shows the impact of the DeLighT transformation, feature shuffling, and the light-weight FFN. Table 8 shows the effect of position of DeLighT transformation in the DeLighT block while Figure 12 shows the effect of scaling DeLighT networks. We choose the WikiText-103 dataset for ablations because it has very large vocabulary compared to other datasets (267K vs. 30-40K), allowing us to test the ability under large vocabulary sizes. The performance is reported in terms of perplexity (lower is better) on the validation set. In our ablation studies, we used the same settings for training as in Section 4.2 except that we train only for 50K iterations. DeLighT block: Overall, Table 7 shows that scaling depth and width using DeLighT transformation and block-wise scaling improves performance. We make following observations: a) Block-wise scaling (R4, R5) delivers better performance compared to uniform scaling (R1-R3). For instance, DeLighT with Nmin = 4 and Nmax = 8 (R4) is 1.25× shallower than DeLighT with Nmin = 8 and Nmax = 8 (R2), but delivers better performance with a similar number of parameters and operations. Scaling wm improves performance (R2 vs. R3), however, the improvement is significantly lower than for the model with block-wise scaling (R3 vs. R5). This suggests that non-uniform distribution of parameters across blocks allows the network to learn better representations. b) Different ratios between Nmax and Nmin yields different results. We observe significant performance improvements when the ratio is greater than or equal to two. For example, when we scale Nmax N min from 2 to 3 (R6 vs. R8), the perplexity improves by ∼5 points with only a moderate increase in network parameters. On the other hand, when the Nmax N min is close to 1 (R6 vs. R7), performance does not change appreciably. This is likely because the allocation of parameters across blocks is close to uniform (Eq. 4). This is consistent with our previous observation. c) Learning shallower and narrower representations near the input and deeper and wider representations near the output achieves better performance. For example, when we scaled Nmax from 8 to 12 for Nmin = 4 (R6, R8), DeLighT delivered better performance with a similar number of parameters compared to a model with Nmin = 6 (R7, R9). This is likely because the ratio of Nmax and Nmin is higher when Nmin = 4, which helps allocate parameters per block more effectively. d) Deeper and wider representations near the input and shallower and narrower representations near the output hurts performance (R13 vs. R16). e) Scaling width using wm and dm improves performance (R10-R15), however, their impact is different. For example, when we scale wm and dm by two, the rate of increase in number of parameters and operations is more rapid with dm compared to wm. DeLighT's ability to learn wider representations in different ways may be useful in selecting application specific models.

Impact of DeLighT transformation:

We replace DeLighT transformation in the DeLighT block (Figure 1b ) with (1) the DeFINE transformation and (2) a stack of linear layers. Figure 8 shows that DeLighT transformation delivers similar performance with significantly fewer parameters compared to the DeFINE unit and linear layers. In these experiments, the settings are the same as R13-R15 (Table 7 ), except, Nmax = 8, because models with a stack of linear layers learn too many parameters. Feature shuffling: Figure 9 shows that feature shuffling improves the performance of DeLighT by 1-2 perplexity points. Here, we use the same settings as in R13-R15 (Table 7 ). Light-weight FFN: Figure 10 shows the impact of varying the reduction factor r in the light-weight FFN. We use the same settings as in R13 (Table 7 ). We did not observe any significant drop in performance until r = 4. Beyond r = 4, we see a drop in performance (perplexity increases by ∼2 points). In such cases, the inner dimensions of the light-weight FFN are very small and hurt performance. Notably, the light-weight FFN with r = 2 2 delivered the same performance as r = 2 -2 , but with 1.28× fewer network parameters. At r = 2 -2 , the light-weight FFN is the same as the FFN in Vaswani et al. (2017) . This suggests that the ability of This work investigates relationships between Nmin, Nmax, wm, and dm, manually. We believe that a more principled approach, such as compound scaling of Tan and Le (2019) , that establishes relationships between these parameters would produce more efficient and accurate models.

E SOURCE CODE FOR GROUP LINEAR TRANSFORMATION

The source code for implementing group linear transformation (GLT) in PyTorch is shown in Listing 1. The source code for efficiently implementing the grouping function in GLT is shown in Listing 2. Since the ungrouping kernel is similar to grouping kernel, we have not shown it here. The reshape and transpose operations in naive PyTorch implementation for grouping and ungrouping are replaced with a dedicated CUDA kernels, resulting in reduced memory footprint and faster training.



Transformer-base uses dm=512 and d f =2048 while Transformer-large uses dm=1024 and d f =4096. We use training and validation data that is compatible with the Tensor2Tensor library(Vaswani et al., 2018) in order to have fair comparisons with recent works (e.g., Evolved Transformer). https://github.com/NVIDIA/apex



Figure 1: (a, b) Block-wise comparison between the standard transformer block of Vaswani et al. (2017) and the DeLighT block. In the DeLighT transformation, the number of operations in computing attention are reduced by half while the number of parameters (and operations) in the FFN are reduced by 16×. Transformations with learnable parameters ( Linear and DeLighT ) are shown in color. The shape of linear transformations indicate their operation (expansion, reduction, etc.).(c, d) compares the DeFINE transformation(Mehta et al., 2020) with the DeLighT transformation. Compared to the DeFINE transformation, the DeLighT transformation uses group linear transformations (GLTs) with more groups to learn wider representations with fewer parameters. Different colors are used to show groups in GLTs. For simplicity, feature shuffling is not shown in (d).

Figure2: Example illustrating the expansion phase in the DeLighT transformation that uses GLTs, feature shuffling, and an input mixer connection, to learn deeper and wider representations efficiently. For illustrative purposes, we have used the same input and output dimensions.

Figure 1b shows how we integrate DeLighT transformation into the transformer block to improve its efficiency. The d m -dimensional inputs are first fed to the DeLighT transformation to produce d o -dimensional outputs, where d o < d m . These d o -dimensional outputs are then fed into a single head attention, followed by a light-weight FFN to model their relationships. DeLighT layer and single head attention: Let us assume we have a sequence of n input tokens, each of dimensionality d m . These n, d m -dimensional inputs are first fed to the DeLighT transformation to produce n, d o -dimensional outputs, where d o < d m . These n, d o -dimensional outputs are then projected simultaneously using three linear layers to produce d o -dimensional queries Q, keys K, and values V.We then model contextual relationships between these n tokens using scaled dot-product attention (Eq. 3). To enable the use of residual connections(He et al., 2016), the d o -dimensional outputs of this attention operation are linearly projected into a d m -dimensional space.

Figure3: Block-wise scaling efficiently allocates parameters and operations across blocks, leading to shallower and narrower DeLighT blocks near the input and deeper and wider DeLighT blocks near the output. In (b), DeLighT networks with both uniform (N =Nmin=Nmax=8) and block-wise (Nmin=4, Nmax=8) scaling have about 16.7 M parameters and perform 3.5 B operations (computed for a sequence length of n = 30), however, the DeLighT network with block-wise scaling delivered 2 points better perplexity.

For the IWSLT'14 De-En dataset, we replicate the setup ofWu et al. (2019) andEdunov et al. (2018), which uses 160K/7K/7K sentence pairs for training, validation, and testing with a joint BPE vocabulary of about 10K tokens, respectively. For the WMT'14 English-German (En-De) dataset, we follow the setup ofVaswani et al. (2017). The dataset has 3.9M/39K/3K sentence pairs for training, validation, and testing respectively with a joint BPE vocabulary size of 44K. 2 For the WMT'14 English-French (En-Fr) dataset, we replicate the setup ofGehring et al. (2017), which uses 36M/27K/3K sentence pairs for training, validation, and testing respectively with a joint BPE vocabulary size of 44K. The performance is evaluated in terms of BLEU(Papineni et al., 2002) (higher is better) on the test set. We followWu et al. (2019) for beam search related hyper-parameters.Architecture: We follow the symmetric encoder-decoder architecture ofVaswani et al. (2017) with sinusoidal positional encodings. Both the encoder and the decoder have B DeLighT blocks. Decoder blocks are identical to the encoder blocks (Figure1b), except that they have an additional source-target single-head attention unit before the light-weight FFN. In the source-target single-head attention unit, keys and values are projections over the encoder output (full details in Appendix A). In our experiments, we use w m = 2, N min = 4, and N max = 8 for WMT'16 En-Ro, WMT'14 En-De, and WMT'14 En-Fr; resulting in 222 layer deep DeLighT networks. For IWSLT'14 De-En, we used w m = 1, N min = 3, and N max = 9 for IWSLT'14 De-En; resulting in 289 layer deep network.

Figure 4: Comparison of DeLighT with Transformers and Evolved Transformers at two different settings, on the WMT'14 En-De corpus: (1) the number of parameters is the same and (2) the performance is the same.

Figure 5: Scaling up DeLighT models. The performance of DeLighT improves with an increase in the number of network parameters, across different corpora, including low-resource (WMT'16 En-Ro).

Figure 6: Sequence modeling with DeLighT. Here, green color hexagon represents the DeLighT transformation.

g outputs Yi = XiWi. Yi's are then concatenated to produce the final do-dimensional output Y = Concat(Y1, • • • , Yg).

Figure 7: This figure visualizes different variants of group linear transformations that are used in the DeLighT transformation.

Figure 8: Impact of different transformations. DeLighT transformations are more parametric efficient than DeFINE and linear transformations. Lower perplexity value means better performance.

Comparison with baseline transformers on machine translation corpora. DeLighT models require significantly fewer parameters to achieve similar performance. Here, † and ‡ indicate the best reported transformer baselines fromWu et al. (2019) andGhazvininejad et al. (2019), respectively.

DeLighT

Comparison with state-of-the-art methods on machine translation corpora. DeLighT delivers similar or better performance than state-of-the-art models with fewer parameters. Here, † indicates that the network uses neural architecture search (NAS) and ‡ indicates that full network parameters are not reported.

Comparison with baseline transformers in terms of training speed and memory consumption. In R4, we implemented CUDA kernels for grouping and ungrouping functions only (see Appendix E). We expect DeLighT to be more efficient with a single and dedicated CUDA kernel for grouping, transformation, feature shuffling, and ungrouping. Memory consumption is measured on a single NVIDIA GP100 GPU (16 GB memory) with a maximum of 4096 tokens per batch and without any gradient accumulation.

DeLighT requires less regularization as compared to baseline transformers (Dataset: WMT'14 En-De).

table to map tokens



Ablations on different aspects of the DeLighT block, including uniform vs. block-wise scaling, depth scaling, and width scaling. Rows partially highlighted in color have the same configuration (repeated for illustrating results). Our experimental setup is similar to Section 4, except that we train our models for 50K iterations. Multiplication and addition operations (MACs) are computed for 20 time steps.

acknowledgement

Acknowledgements: This research was supported by ONR N00014-18-1-2826, DARPA N66001-19-2-403, NSF (IIS-1616112, IIS1252835), and an Allen Distinguished Investigator Award. Authors would also like to thank members of the UW-NLP and the H2Lab at The University of Washington for their valuable feedback and comments.

annex

LSTM (Grave et al., 2017b ) ---48.70 LSTM + Neural Cache (Grave et al., 2017b) ---40.80 QRNN (Merity et al., 2018b) --151 M 33.00Transformer-XL (Dai et al., DeLighT networks with block-wise scaling delivers better performance across different settings. Lower perplexity value means better performance.

Block-wise Uniform

DeLighT transformation to learn representations in high-dimensional spaces efficiently allows us to reduce the computational burden on the FFN.We also tested removing the light-weight FFN and while it reduced parameters by ∼0.5-1 M, performance dropped by about 2-3 perplexity points across different parametric settings.Uniform vs. block-wise scaling: Figure 11 compares the performance of DeLighT with uniform and blockwise scaling. For a given model dimension dm, DeLighT models with block-wise scaling delivers better performance.Position of DeLighT transformation: We studied three configurations for the DeLighT transformation on the WikiText-103 validation set (Table 8 ): (1) DeLighT transformation followed by single-headed attention and light-weight FFN, (2) single-headed attention followed by DeLighT transformation, and (3) single-headed attention followed by DeLighT transformation and light-weight FFN. For similar number of parameters, we found that (2) and (3) drops the performance of (1) significantly across different parametric settings. This suggests that deeper and wider representations helps learn better contextual representations; allowing us to replace multi-headed attention with single-headed attention.Scaling up DeLighT: Figure 12 shows the results of DeLighT models obtained after varying configuration parameters of DeLighT transformations (Nmin={4, 6}, Nmax={8, 12}, wm={2, 3, 4}, and dm={256, 384, 512}). We can see that scaling one configuration parameter (e.g., dm) while keeping other configuration parameters constant (e.g., Nmin, Nmax, and wm) consistently improves performance.Published as a conference paper at ICLR 2021 Listing 1: "Naive implementation of GLT in Pytorch" import torch def glt_function(x, n_groups, weights, bias=None): ''' :param x: Input tensor of size [B x N], where B is batch size and N is input dimension :param n_groups: number of groups in GLT :param weights: glt weights [g x N/g x M/g] :param bias: GLT bias (optional) of size [g x 1 x M/g] :return: output tensor of size [B x M] ''' bsz = x.size(0)x.transpose(0, 1) # transpose so that group is first ## TRANSFORMATION FUNCTION: Transforms from N/g-dimensional space to M/g-dimensional space ## # [g x B x N/g] x [g x N/g x M/g] --> [g x B x M/g] x = torch.bmm(x, weights) # multiply with Weights # add bias if bias is not None:x = torch.add(x, bias) 

