WIDE ATTENTION IS THE WAY FORWARD FOR TRANSFORMERS?

Abstract

The Transformer is an extremely powerful and prominent deep learning architecture. In this work, we challenge the commonly held belief in deep learning that going deeper is better, and show an alternative approach that is building wider attention Transformers. We demonstrate that wide single layer Transformer models can typically equal or sometimes outperform deeper ones in a variety of Natural Language Processing (NLP) tasks when both are trained from scratch. The impact of changing the model aspect ratio on Transformers is studied systematically. This ratio balances the number of layers and the number of attention heads per layer, while keeping the total number of attention heads and all other hyperparameters constant. On average, across 4 NLP tasks and 10 attention types, single layer wide models perform 0.3% better than their deep counterparts. We show an in-depth evaluation and demonstrate how wide models require a far smaller memory footprint and can run faster on commodity hardware, in addition, these wider models are also more interpretable. For example, a single layer Transformer on the IMDb byte level text classification has 3.1× faster inference latency on a CPU than its equally accurate deeper counterpart, and is half the size. We therefore put forward wider and shallower models as a viable and desirable alternative for small models on NLP tasks, and as an important area of research for domains beyond this.

1. INTRODUCTION

Since Vaswani et al. (2017) , Transformer-based architectures have become widespread due to their advantages over previous architectures such as recurrent neural networks (RNNs) and sometimes even convolutional neural networks (CNNs). Many new X-formers have also been proposed that improve on the original Transformer by overcoming its limitation on sequence length by providing a more scalable attention mechanism (Choromanski et al., 2020; Wang et al., 2020b; Beltagy et al., 2020) . However, little research has been done on the relevance of the size of the attention computation in each layer, the number of attention layers, and how these parameters relate to the resulting Transformer's characteristics. The primary source of parameters in a Transformer network is the Feed-forward Network (FFN) in each encoder or decoder layer, and the linear layers which convert from the sequence feature dimension (often equal to the initial embedding dimension) to the attention feature dimension, and back again after attention is applied. Each attention head typically has an equal number of attention features. Consider an input sequence X ∈ R S×E , where S is the sequence length and E is the embedding dimension. Here, a multi-head attention with H heads is used and each head operates on the learned projection with a dimension of A. After the attention mechanism there is a FFN with a single hidden dimension of size M . These layers are then stacked L times as illustrated on the left of Figure 1 . Often in a typical Transformer E = AH. The total number of parameters in a Transformer encoder is given by: Encoder Parameters = L(3EAH + AHE + EM + M E) = 2LE(2AH + M ) In this paper, we investigate the effects of changing L and H while keeping their product, the total number of heads, the same. We start with typical values for L and H and then move down to a single

