WIDE ATTENTION IS THE WAY FORWARD FOR TRANSFORMERS?

Abstract

The Transformer is an extremely powerful and prominent deep learning architecture. In this work, we challenge the commonly held belief in deep learning that going deeper is better, and show an alternative approach that is building wider attention Transformers. We demonstrate that wide single layer Transformer models can typically equal or sometimes outperform deeper ones in a variety of Natural Language Processing (NLP) tasks when both are trained from scratch. The impact of changing the model aspect ratio on Transformers is studied systematically. This ratio balances the number of layers and the number of attention heads per layer, while keeping the total number of attention heads and all other hyperparameters constant. On average, across 4 NLP tasks and 10 attention types, single layer wide models perform 0.3% better than their deep counterparts. We show an in-depth evaluation and demonstrate how wide models require a far smaller memory footprint and can run faster on commodity hardware, in addition, these wider models are also more interpretable. For example, a single layer Transformer on the IMDb byte level text classification has 3.1× faster inference latency on a CPU than its equally accurate deeper counterpart, and is half the size. We therefore put forward wider and shallower models as a viable and desirable alternative for small models on NLP tasks, and as an important area of research for domains beyond this.

1. INTRODUCTION

Since Vaswani et al. (2017) , Transformer-based architectures have become widespread due to their advantages over previous architectures such as recurrent neural networks (RNNs) and sometimes even convolutional neural networks (CNNs). Many new X-formers have also been proposed that improve on the original Transformer by overcoming its limitation on sequence length by providing a more scalable attention mechanism (Choromanski et al., 2020; Wang et al., 2020b; Beltagy et al., 2020) . However, little research has been done on the relevance of the size of the attention computation in each layer, the number of attention layers, and how these parameters relate to the resulting Transformer's characteristics. The primary source of parameters in a Transformer network is the Feed-forward Network (FFN) in each encoder or decoder layer, and the linear layers which convert from the sequence feature dimension (often equal to the initial embedding dimension) to the attention feature dimension, and back again after attention is applied. Each attention head typically has an equal number of attention features. Consider an input sequence X ∈ R S×E , where S is the sequence length and E is the embedding dimension. Here, a multi-head attention with H heads is used and each head operates on the learned projection with a dimension of A. After the attention mechanism there is a FFN with a single hidden dimension of size M . These layers are then stacked L times as illustrated on the left of Figure 1 . Often in a typical Transformer E = AH. The total number of parameters in a Transformer encoder is given by: Encoder Parameters = L(3EAH + AHE + EM + M E) = 2LE(2AH + M ) In this paper, we investigate the effects of changing L and H while keeping their product, the total number of heads, the same. We start with typical values for L and H and then move down to a single layer. A diagram illustrating our design space and the differences between our widest and deepest models is given in Figure 1 . We refer to the ratio of layers to heads as the model aspect ratio. This naturally leads to an intriguing question: What is the best model aspect ratio for the growing number of X-former models? We consider impacts of model aspect ratios on accuracy, run-time performance, model size, and interpretability. Based on the question above, we investigate the influence of various model aspect ratios on 9 Xformer models, each with their own attention mechanism, in addition to the original Transformer. Prior work on Transformer architectures has mainly focused on designing more efficient attention styles (Wang et al., 2020b; Choromanski et al., 2020) or using Network Architecture Search (NAS) to discover an optimal combination of operators (So et al., 2019) . By changing the model aspect ratio, we consider a more coarse-grained design space. This design space is not commonly explored in the NAS algorithms for Transformers, and we evaluate some interesting model architectures such as a single layer model with many parallel heads. For each model aspect ratio we run our experiments with each X-former across a number of text classification tasks with various input sequence lengths ranging from 500 to 4000. We empirically observe that wider and shallower models can typically equal or sometimes beat the accuracy of deeper models. This observation challenges the common design paradigm of trying to build deeper Neural Networks. We show several other major advantages of a shallower and wider model. First, it is more latency friendly on commodity hardware. Second, wider models are smaller in terms of the number of parameters. And, third, outputs of wider models are more interpretable. To summarise, we make the following contributions in this paper: • We demonstrate that wider and shallower models can typically equal or sometimes beat the accuracy of deeper models when there is no pretraining of weights or embeddings. Across



Figure 1: A comparison of a deep Transformer based classifier (left, with L layers and H heads for each layer) vs an equivalent wide one (right, with a single layer and L × H heads). Layer norms and residual connections have been omitted from the diagram for clarity, for details on the full Transformer architecture see Vaswani et al. (2017).

