AN ATTENTION FREE TRANSFORMER

Abstract

We introduce Attention Free Transformer (AFT), an efficient variant of Transformers (Vaswani et al., 2017) that eliminates the need for dot product attention. AFT offers great simplicity and efficiency compared with standard Transformers, where the multi-head attention operation is replaced with the composition of element-wise multiplications/divisions and global/local pooling. During training time, AFT has linear time and space complexity w.r.t. both the sequence length and feature dimension; in the autoregressive decoding mode, AFT has constant memory and time complexity per step. We show that, surprisingly, we are able to train AFT effectively on challenging benchmarks, and also to match or surpass the standard Transformer counterparts and other efficient variants. In particular, AFT achieves the state-of-the-art result on CIFAR10 autoregressive modeling with much reduced complexity, and also outperforms several efficient Transformer variants on Enwik8.

1. INTRODUCTION

Attention mechanisms, represented by Transformers (Vaswani et al., 2017) , have driven the advancement of various machine learning problems, including language modeling (Devlin et al., 2018; Radford et al.) , image modeling (Chen et al.), and set modeling (Lee et al., 2019) . Different from other well known model architectures such as Convolutional Neural Nets (CNNs) or Recurrent Neural Nets (RNNs), Transformers enable direct interaction between every pair of elements within a sequence, which makes them especially powerful at capturing long term dependencies. However, Transformers require high computational costs. The root cause of this challenge is the need to perform attention operations that have quadratic time and space complexity w.r.t the context size. This makes it especially difficult for Transformers to scale to inputs with large context sizes. A number of recent works have been dedicated to addressing the scalability issue of Transformers (Child et al., 2019; Kitaev et al., 2020; Rae et al., 2020; Wang et al., 2020b; Katharopoulos et al., 2020; Tay et al., 2020a; Choromanski et al., 2020) . While the techniques adopted in the literature range from sparsity, locality sensitive hashing, low rank decomposition, kernel approximation and etc., most of them are trying to approximate the full attention operation. In this paper, we take a bolder step towards the same goal, by proposing a computational module that does not use or approximate the standard dot product attention. We hence name our model the attention free transformer (AFT). Similar to dot product attention, AFT is composed of the interaction of three quantities, namely the query, key and value. What's different, however, is that AFT operates solely based on element-wise operations. To be more concrete, they key and value are first multiplied element-wise, the result of which is then pooled over the context dimension (in the causal model, this corresponds to a cumulative sum). The query is then multiplied with the reduced key-value representation element-wise to produce the final output. See Figure 1a for an illustration. AFT maintains the full advantage of dot product attention, namely direct interaction between any two elements in a sequence (up to proper masking). However, the computational cost is drastically reduced to a O(T d) complexity for time and space, where T, d are the context length and feature dimension, respectively. In the autoregressive decoding mode, AFT also provides constant decoding time and space complexity per step, compared to O(T ) for standard transformers. To the best of our knowledge, AFT is the first model that achieves such efficiency in the context of Transformers. See Table 1 for the complexity analysis of AFT in comparison to other variants.

