DELIGHT: DEEP AND LIGHT-WEIGHT TRANSFORMER

Abstract

We introduce a deep and light-weight transformer, DeLighT, that delivers similar or better performance than standard transformer-based models with significantly fewer parameters. DeLighT more efficiently allocates parameters both (1) within each Transformer block using the DeLighT transformation, a deep and lightweight transformation and (2) across blocks using block-wise scaling, that allows for shallower and narrower DeLighT blocks near the input and wider and deeper DeLighT blocks near the output. Overall, DeLighT networks are 2.5 to 4 times deeper than standard transformer models and yet have fewer parameters and operations. Experiments on benchmark machine translation and language modeling tasks show that DeLighT matches or improves the performance of baseline Transformers with 2 to 3 times fewer parameters on average.

1. INTRODUCTION

Attention-based transformer networks (Vaswani et al., 2017) are widely used for sequence modeling tasks, including language modeling and machine translation. To improve performance, models are often scaled to be either wider, by increasing the dimension of hidden layers, or deeper, by stacking more transformer blocks. For example, T5 (Raffel et al., 2019) uses a dimension of 65K and GPT-3 (Brown et al., 2020) uses 96 transformer blocks. However, such scaling increases the number of network parameters significantly (e.g., T5 and GPT-3 have 11 billion and 175 billion parameters, respectively), and complicates learning, i.e., these models either require very large training corpora (Raffel et al., 2019; Devlin et al., 2019; Brown et al., 2020) or careful regularization (Hinton et al., 2012; Wan et al., 2013; Merity et al., 2018a) . In this paper, we introduce a new parameter-efficient attention-based architecture that can be easily scaled to be both wide and deep. 2018) with an expand-reduce strategy for varying the width and depth of the DeLighT block efficiently. Since GLTs are local by nature, the DeLighT transformation uses feature shuffling, which is analogous to channel shuffling in convolutional networks (Zhang et al., 2018) , to share information between different groups. Such wide and deep representations facilitate replacing the multi-head attention and feed-forward layers in transformers with single headed attention and light-weight feed-forward layers, reducing total network parameters and operations. Importantly, unlike transformers, the DeLighT transformation decouples the depth and width from the input size, allowing us to allocate parameters more efficiently across blocks by using shallower and narrower DeLighT blocks near the input and deeper and wider DeLighT blocks near the output. We demonstrate that DeLighT models achieve similar or better performance than transformer models with significantly fewer parameters and operations, on two common sequence modeling tasks, (i) machine translation and (ii) language modeling. On the low resource WMT'16 En-Ro machine translation dataset, DeLighT attains transformer performance using 2.8× fewer parameters. On the high resource WMT'14 En-Fr dataset, DeLighT delivers better performance (+0.4 BLEU score) with 1.8× fewer parameters than baseline transformers. Similarly, on language modeling, DeLighT matches the performance of Transformer-XL (Dai et al., 2019) with 1.5× fewer parameters



Deep and Light-weight Transformer architecture, DeLighT, extends the transformer architecture of Vaswani et al. (2017) and delivers similar or better performance with significantly fewer parameters and operations. At the heart of DeLighT is the DeLighT transformation that uses the group linear transformations (GLTs) of Mehta et al. (

