AN ATTENTION FREE TRANSFORMER

Abstract

We introduce Attention Free Transformer (AFT), an efficient variant of Transformers (Vaswani et al., 2017) that eliminates the need for dot product attention. AFT offers great simplicity and efficiency compared with standard Transformers, where the multi-head attention operation is replaced with the composition of element-wise multiplications/divisions and global/local pooling. During training time, AFT has linear time and space complexity w.r.t. both the sequence length and feature dimension; in the autoregressive decoding mode, AFT has constant memory and time complexity per step. We show that, surprisingly, we are able to train AFT effectively on challenging benchmarks, and also to match or surpass the standard Transformer counterparts and other efficient variants. In particular, AFT achieves the state-of-the-art result on CIFAR10 autoregressive modeling with much reduced complexity, and also outperforms several efficient Transformer variants on Enwik8.

1. INTRODUCTION

Attention mechanisms, represented by Transformers (Vaswani et al., 2017) , have driven the advancement of various machine learning problems, including language modeling (Devlin et al., 2018; Radford et al.) , image modeling (Chen et al.) , and set modeling (Lee et al., 2019) . Different from other well known model architectures such as Convolutional Neural Nets (CNNs) or Recurrent Neural Nets (RNNs), Transformers enable direct interaction between every pair of elements within a sequence, which makes them especially powerful at capturing long term dependencies. However, Transformers require high computational costs. The root cause of this challenge is the need to perform attention operations that have quadratic time and space complexity w.r.t the context size. This makes it especially difficult for Transformers to scale to inputs with large context sizes. A number of recent works have been dedicated to addressing the scalability issue of Transformers (Child et al., 2019; Kitaev et al., 2020; Rae et al., 2020; Wang et al., 2020b; Katharopoulos et al., 2020; Tay et al., 2020a; Choromanski et al., 2020) . While the techniques adopted in the literature range from sparsity, locality sensitive hashing, low rank decomposition, kernel approximation and etc., most of them are trying to approximate the full attention operation. In this paper, we take a bolder step towards the same goal, by proposing a computational module that does not use or approximate the standard dot product attention. We hence name our model the attention free transformer (AFT). Similar to dot product attention, AFT is composed of the interaction of three quantities, namely the query, key and value. What's different, however, is that AFT operates solely based on element-wise operations. To be more concrete, they key and value are first multiplied element-wise, the result of which is then pooled over the context dimension (in the causal model, this corresponds to a cumulative sum). The query is then multiplied with the reduced key-value representation element-wise to produce the final output. See Figure 1a for an illustration. AFT maintains the full advantage of dot product attention, namely direct interaction between any two elements in a sequence (up to proper masking). However, the computational cost is drastically reduced to a O(T d) complexity for time and space, where T, d are the context length and feature dimension, respectively. In the autoregressive decoding mode, AFT also provides constant decoding time and space complexity per step, compared to O(T ) for standard transformers. To the best of our knowledge, AFT is the first model that achieves such efficiency in the context of Transformers. See Table 1 for the complexity analysis of AFT in comparison to other variants. (Kitaev et al., 2020) , Synthesizer (Tay et al., 2020a ), Linear Transformer (Katharopoulos et al., 2020) (only variants that support the causal mode are shown). Here T, d denote the sequence length and feature dimension, respectively.

Model

Time @ train Space @ train Time/step @ decode Space/step @ decode Full Attention O(T 2 d) O(T 2 + T d) O(T d) O(T d) Reformer O(T log T d) O(T log T + T d) O(log T + d) O(T d) Synthesizer O(T 2 d) O(T 2 + T d) O(T d) O(T d) Linear Transformer O(T d 2 ) O(T d + d 2 ) O(d 2 ) O(d 2 ) AFT (ours) O(Td) O(Td) O(d) O(d) We show that we can interpret AFT as an extreme case of multi head dot product attention (MHA). In particular, we show that by 1) setting the number of heads equal to the feature dimension in MHA and 2) using relu in place of sof tmax as the non-linearity, MHA can be decomposed into the summation of two AFT modules (see Equation 6). However, this relationship is not true in a general sense, i.e., by varying the non-linearity injected after the query and key in AFT, we can obtain models that do not have a MHA counterpart. This realization allows us to freely explore the design choices (e.g., nonlinearity) of AFT to achieve the best performance. This philosophy is in direct contrast with previous and concurrent "linearized attention" works (Katharopoulos et al., 2020; Choromanski et al., 2020) , which are constrained by the design space of MHA. We perform experiments with AFT on several benchmarks, including unconditional image modeling, image super-resolution, language modeling, machine translation and point cloud generation. We show that AFT works very well as an alternative to the standard Transformer, providing competitive results as well as excellent efficiency. To summarize, our contributions are as follows: • 

2. MULTI-HEAD ATTENTION

At the core of Transformers is the Multi-Head Attention (MHA) operation. Given three sequences, namely the query Q ∈ R T ×d , key K ∈ R T ×d and value V ∈ R T ×d , and the number of heads h, MHA performs a scaled dot product attention for each head i, defined as: f i (Q, K, V ) = σ( Q i (K i ) T √ d k )V i , s.t. Q i = QW Q i , K i = KW K i , V i = V W V i , where W Q i ∈ R d×d k , W K i ∈ R d×d k , W V i ∈ R d×dv are linear transformations for head i, and σ is the non-linearity by default set as the sof tmax r function (subscript r indicates softmax is applied to each row of a matrix). d k , d v are dimensions for key and value, respectively. MHA concatenates the output of h attention heads along the channel dimension, resulting in feature dimension hd v . Unless otherwise mentioned, we assume d k = d v and h = d d k . This means the query, key and value are the same dimension within each head, and output dimension matches that of the input.

3.1. ATTENTION FREE TRANSFORMER

We now define Attention free transformer (AFT), which provides an alternative to MHA. Given Q, K, V , AFT first linearly transforms them into Q = QW Q , K = KW K , V = V W V , then performs following operation: f (Q, K, V ) = σ q (Q ) T t=1 σ k (K ) V t , where is the element-wise product, with support for broadcasting when the operands' dimensions don't exactly matchfoot_0 ; σ q , σ k are nonlinearities applied to the query and key, respectively. Explained in words, the key and value are first combined with an element-wise multiplication, the result of which is then pooled over the context dimension, yielding a fixed length context vector ∈ R d . This context vector is then multiplied with each row of the query, which forms the final output of an AFT layer. One particularly useful variant of MHA is masked attention, oftentimes presented in the form of causal attention. Specifically, in auto-regressive models, queries are constrained to not be able to interact with keys and values beyond the curret position. In standard attention, this is usually implemented with an explicit binary masking matrix of shape T ×T , with non-causal entries masked as 0. We show that it is also straightforward to extend AFT to the causal mode while maintaining its efficiency. We denote an AFT layer's output as Y t = f (Q ≤t , K ≤t , V ≤t ), t = 1, ..., Tfoot_1 . We formulate the casual AFT as: Y t = σ q (Q t ) t t =1 σ k (K ≤t ) V ≤t t , t = 1, ..., T, where the subscript X t indexes the tth row of matrix X.

Discussions:

The design philosophy of AFT is to promote extreme efficiency, while keeping the benefit of standard Transformers. Concretely, AFT enables direct interaction of any two elements within the sequence, which is arguably the biggest advantage of Transformers over other types of models such as RNNs and ConvNets. However, AFT gets rid of the need of performing the costly spatial dot product attention, by computing a reduced value representation with the weights only depending on the keys. The resulting operation has an extremely efficiency of O(T d) w.r.t. both time and space, which is the first model that achieves linear complexity along both context and feature dimensions. Moreover, the causal mode of AFT has an additional advantage of a constant decoding cost per step, similar to (Katharopoulos et al., 2020) . To see this, from Equation 3, we have a simple recursion of Y t = σ q (Q t ) (σ k (K t ) V t + KV t-1 ) with KV t = t t =1 σ k (K t ) V t , assuming σ q , σ k are both element-wise functions. One thus only need to keep KV t in memory, and update it with constant cost per step. Selecting nonlinearies: σ q , σ k provide additional nonlinearity which helps to increase model's capacity. Empirically, we have found that one particularly strong setting is to let σ k = sof tmax which is normalized along the context dimension. This choice brings an interesting benefit especially in the causal mode, which we can explicitly write as: Y t = σ q (Q t ) g t (t) V t + t-1 t =1 g t (t ) V t , g t (t ) = exp(K t ) t t =1 exp(K t ) . ( ) Here g t (t) acts as a role similar to that of an input gate in an LSTM, and g t (t ) is operating like the forget gate, which depends on the input of time t, dynamically downweights the contribution of past time steps. When augmented with standard position embeddings as commonly used in Transformers, this allows the model to be able to learn the notion of recency at the same time of having access to the full context in the history. From this view, σ q can also be interpreted as the output gate, for which we found that both sigmoid and relu work well, with the former being slightly better. Also note that the same space and time complexity still holds for σ k = sof tmax, both in training and decoding. In our experiments, unless otherwise mentioned, we use the sigmoid + sof tmax setting for σ q and σ k by default. Relation to MHA: Although AFT performs fundamentally different operations than standard attention, we show that the two family of models overlap in the extreme case. To see this, we explore the limit of number of heads in MHA, which amounts to letting d k = 1 for each head. In this case, the dot product operation within each head reduces to a scalar product. Next, we set σ to be relu instead of sof tmax in Equation 1. In this case, we have: f i (Q, K, V ) = [Q i (K i ) T ] + V i = [Q i ] + [(K i ) T ] + + [-Q i ] + [-(K i ) T ] + V i = [Q i ] + [K i ] T + V i + [-Q i ] + [-K T i ] + V i , where [•] + denotes the relu operator, and Q i , K i , V i ∈ R T ×1 by definition. The concatenated output of the attention heads can then be concisely written as: f (Q, K, V ) = [Q ] + T t=1 [K ] + V t + [-Q ] + T t=1 [-K ] + V t , which consists of two terms, each of which is an AFT operation, with σ q = σ k = [•] + and σ q = σ k = [-•] + , respectively. However, note that this correspondence is not general, i.e., AFT does not need to approximate any MHA counterpart and can indeed have very different inductive biases than that of a standard Transformer. Relation to Linearized Attention: There are a few recent works proposing to linearize the dot product attention (Linear Attention) from the view of kernel approximation, first proposed in Katharopoulos et al. (2020) and also in concurrent work (Choromanski et al., 2020) . (Katharopoulos et al., 2020) proposes the linear attention operation in the form: Y t = φ(Q t ) T t =1 φ(K t ) T V t φ(Q) t t t -1 φ(K t ) T , ( ) where Q t , K t , V t are all row vectors of R d . Equation 7 is similar to AFT, in the sense that the key and value are first combined and reduced in both cases. However AFT differs in two aspects: 1) the time complexity of Linear Attention is O(T d 2 ), which is linear in the sequence length but has difficulty scaling to wide networks 2) Linear Attention is designed to approximate MHA, where the nonlinearity on query and key are shared. In AFT however, we show that it is beneficial to search for different nonlinearities for both the query and key.

3.2. LOCAL CAUSAL AFT

In autoregressive modeling, locality is a strong and effective inductive bias, as has been explored in Chen et al.; Child et al. (2019) . We similarly propose an augmented version of causal AFT, where we rewrite Equation 3 as Y t = σ q (Q t ) t t =1 w t,t σ k (K ≤t ) V ≤t t , t = 1, ..., T, where w t,t ∈ R is a locality masking scalar. We consider two strategies of constructing w, the first being the hard local mask where we have w t,t = 1 if t -t s and 0 otherwise, with s being the desired window size (for 2d inputs such as images, we can similarly construct 2d windows, see Appendix for details). The second one, which works better in practice, is to learn a position based local bias, while still assigning non-zero weights to out of window contexts. More concretely, we let w t,t = exp(I(t-t <s)u T t v t ) t t =1 exp(I(t-t <s)u T t v t ) , where I(•) is an indicator function, and u, v ∈ R T ×du are two sets of low dimensional learnable position embeddings, independently learned per layer. Note that in this case, we maintain dense connection between every t, t pair, but rather introduce learnable biases for more recent contexts. We typically set d u to be a small number (e.g., 64) which greatly reduces the amount of additional parameters compared to learning a full matrix. The learned version also adds very little overhead to both the time and space complexity, as w is sparse and static, whose memory cost is marginalized out across batches during training. We denote the two versions as AFT-local-hard and AFT-local-learned, respectively. ) ∑ [ σ ] = ⊙ ( + ⊙ = K′ ≤t V′ ≤t ⊙ + ∑ ∑ ⊙ + Q′ t exp exp K′ ≤t V′ ≤t K′ ≤t = = ] [ ⊙ ) ( ) ( pool pool = ) ( σ Q K T V (a) Causal attention free operation. ) ∑[ σ ] = ⊙ ( + + ⊙ ⊙ = Q′ t K′ ≤t V′ ≤t ⊙ + ∑ ∑ ⊙ + Q′ t exp exp K′ ≤t V′ ≤t K′ ≤t = = ] [ ⊙ ) ( ) ( pool pool = ) ( σ Q K T V (b) Local causal attention free operation. Figure 1 : AFT blocks require only element-wise and pooling operations. operation can enable larger context sizes and more efficient implementations. For a comprehensive, recent survey of efficient transformers, see (Tay et al., 2020c) . Approximating the dot product. (Wang et al., 2020a; Huang et al., 2019b; Zhu et al., 2019; Huang et al., 2019a; Ramachandran et al., 2019) . AFT also borrows the locality idea, but we put it as a bias rather than hard constraint (see AFT-local-learned) . Also AFT is a standalone module, where it works as a plug in replacement of MHA in autoregressive tasks.

Context compression.

Other approaches try to learn context patterns. Adaptive-Span Transformers (Sukhbaatar et al., 2019) learn a range for each attention head within which to attend. Routing transformers (Roy et al., 2020) use clustering to compute dot-product attention only over a subset of elements within the same cluster. The Linformer (Wang et al., 2020b) reduces the length of the context by compressing the keys and values with a linear layer. Compressive Transformers (Rae et al., 2020) compute and update reduced representations of the input that are far enough back in the input sequence, and attend to those compressed representations. AFT is largely complementary to these approaches, as our focus is to improve the complexity of any given sequence from the operation level. Eliminating dot product attention. Instead of limiting the number of comparisons, other methods change the operation used to compute attention. The Synthesizer (Tay et al., 2020a) uses attention weights predicted from inputs, rather than derived from dot-product interactions. The LightConv module introduced in (Wu et al., 2019) proposes to replace the dot product self-attention with dynamic lightweight depthwise convolution, where the weights are normalized across temporal dimension. The Sinkhorn Transformer (Tay et al., 2020b) uses a differentiable sorting operation to identify relevant comparisons that may not be local in the original sequence order. AFT can be viewed as a more drastic version in this direction, where the we adopt a single global "attention mask" (w) of all ones (vanilla AFT) or with a few learnable entries (AFT-local-learned). Gated RNNs. AFT is also related to the classic line of work on gated RNN variants, including LSTMs Hochreiter & Schmidhuber (1997) , GRUs (Chung et al., 2014) and QuasiRNNs (Bradbury et al., 2016) . AFT maintains the benefit of RNN models (linear complexity w.r.t. sequence length, constant decoding cost), but offers great parallelism and effectiveness, thanks to the use of a simple context reduction operation, which is amendable to a fully parallel implementation during training. We believe that AFT also offers new perspectives for rethinking the success and limitations of gated RNNs. Dynamic Convolution. AFT is also related to dynamic convolution (Wu et al., 2019) when applied to auto-regressive tasks, where the reduced key-value representation can be interpreted as a per sequence convolutional kernel. However, AFT operates in an extreme case where the dimension of the kernel is 1 along both the feature and spatial dimensions, again presenting superior efficiency. 

5. EXPERIMENTS

We conduct experiments on five tasks: unconditional image modeling (Sec. 5.2), language modeling (Sec. 5.3), machine translation (Sec. 5.4), image super resolution (Sec. B.2) and point cloud generation (Sec. B.3). We focus on the causal mode of AFT, while leaving systematic evaluation of the non-causal version as future work. Unless otherwise mentioned, all experiments are conducted on 8×V100 GPU machines.

5.1. EFFICIENCY

To support our analysis in Table 1 , we benchmarked an implementation of AFT on a single forward and backward pass on random data with a batch size of 4. We compared AFT with the self-attention from Transformers, a linear attention mechanism from (Katharopoulos et al., 2020) , and a Reformer (Kitaev et al., 2020) . For all of these, we used a base architecture with 12 layers, 8 heads (except for AFT, where there are no heads), feature dimension of either 256 or 1024, and context lengths up to 10,000. We used the code from the fast transformers library to perform the evaluationsfoot_3 . Results in terms of runtime and peak GPU usage are in Figure 2 . Where data-points do not exist in the figure, the model exhausted GPU memory. We see from this that compared to Transformers and Reformers, AFT and linear attention require far fewer computational resources as context increases. In addition, we see that AFT is not as sensitive to the feature dimension as the linear attention, which expands the design space for feasible models. In all settings, AFT performs best. Additionally, we examine decoding speed in the same benchmark. Figure 3 , we see that the total runtime of AFT increases linearly with the context length, and is faster than linear attention and Transformers with self-attention.

5.2. UNCONDITIONAL IMAGE MODELING

In our first set of experiments, we consider the problem of image modeling by minimizing the negative log likelihood (NLL). Similar to Parmar et al. (2018) , we represent an RGB image as a sequence of length H × W × 3, with H, W being the height and width, respectively. Each sub-pixel is represented as a 256-way discrete variable. We use CIFAR10 for image density modeling. Feasibility study and choice of nonlinearities. We first conduct experiments validating the legitimacy of AFT and its four nonlinearity variants. Our reference Transformer design largely follows that of Chen et al., where a transformer block consists of an attention layer (AFT layer in our cae) with residual connection and a 2 layer MLP with residual connections. Layer Normalization (LN) (Ba et al., 2016) is applied in a "pre-act" fashion. We adopt learned position embeddings, and use a set of shared token embeddings and prediction heads across RGB. Our base architecture consists of 24 Transformer blocks, each with d=256 dimensional features. The hidden layer of the MLP per block has 4 dimensionality of its input. We use Adam, and follow a standard warmup learning rate schedule as in Vaswani et al. (2017) . We use an initial learning rate of 3 × 10 -3 and a weight decay of 0.1 applied to all linear transformations weights, and a dropout of 0.1. We adopt simple data augmentation. During training, we first randomly flip each image horizontally, then add or subtract a value in the range [-10, 10 ] from all its subpixels, and clip resulting pixel values to [0, 255]. We use cross entropy loss, and a default batch size of 128 for 200 training epochs. We train three versions of AFT, namely AFT-relu2 (Equation 6), AFT-relu (σ q = σ k = relu), AFT-relu-softmax: (σ q = relu, σ k = sof tmax), AFT-sigmoid-softmax (σ q = sigmoid, σ k = sof tmax), all of which use full contexts. We show the training and test loss curves in Figure 4 . All versions of AFT are trainable with standard optimization techniques for Transformers. In particular, AFT-relu2 performs slightly worse than AFT-relu, and both are significantly worse than AFT-relu-softmax and AFT-sigmoid-softmax. AFT-sigmoid-softmax also outperforms the strong PixelCNN++ baseline (Salimans et al., 2017) . Based on this observation, we use AFT-sigmoidsoftmax as the default setting for all remaining experiments, unless otherwise mentioned. Comparing with the state of the art. CIFAR10 is a crowded benchmark for image autoregressive modeling, and we compare with a few competitive baselines, as shown in Table 2 . Note that CIFAR10 has an unrolled sequence length of 3072, which is already prohibiting to train a full Transformer with reasonable size. For example, for a standard 12 layer 512 dimension and 8 head configuration, the maximum batch size we can fit in our 8 V100 node is only 16, which makes it infeasible already. Our closest baseline Image Transformer (Parmar et al., 2018) , which restricts attention to local2d windows of size of 256. We test our AFT-local-learned and aft-local2d-hard variants, with the same window size, and the same architecture as well as a deeper but narrower one (24 layer and 256 dimensions), which are still fair comparisons. We also compare to Sparse Transformers (Child et al., 2019) , which restrains attention to sparse but global subset of context elements. From Table2, we see that all AFT variants outperform the Image Transformer baseline. Both AFT local versions are better than the full counterpart, with AFT-local-learned being significantly stronger than others. We also observe that the deeper but narrower architecture is more effective than the shallow but wide baseline. Our best model also achieves the state-of-the-art result on CIFAR10 in this setting, outperforming a much larger Sparse Transformer model. Efficiency wise, we benchmarked Image Transformer against AFT variants on a 8 V100 GPU nodefoot_4 . All our variants are faster than Image Transformer, while consuming only half of the memoryfoot_5 .

5.3. LANGUAGE MODELING

We apply AFT to character level language modeling on Enwik8 (Mahoney, 2011) , which is another popular benchmark for auto-regressive modeling. We follow the standard preprocessing procedures and training/validation/test splits as in (Dai et al., 2019) . Our base Transformer reference is a 12 layer 512 dimensional 8 head architecture with 2048 feed forward dimensions. For the first set of experiments, we use sequence length of 1024. Our training protocol as largely the same as before, except that we increase the weight decay to 0.5 and train for 100 epochs with batch size 128 in all experiments. We evaluate the AFT-local-learned variant with a window size of 32 and d u = 256. We also compare to several efficient Transformer baselines, namely Reformer (Kitaev et al., 2020) , Synthesizer (Tay et al., 2020a) and Linear Transformer (Katharopoulos et al., 2020) . From Table 3 , we see that with the base L = 12, d = 512 architecture, AFT achieves the lowest training bits per character (bpc), indicating its high capacity. Its test performance is slightly worse than that of the basic Transformer, but outperforms all other three variants. The deeper and narrower architecture of AFT strikes the best balance across parameter, speed, memory and performance. Its test bpc is only 0.02 away from the full Transformer's, while only consuming a third of the memory and provides a 44% speedup. In the end, we have also trained the same architecture with a sequence length of 2048, which results in an improved performance both on the training and test set. This suggests AFT's ability to effectively model long range dependencies. 

5.4. MACHINE TRANSLATION

As a machine translation benchmark, we show experiments with the WMT 2014 English to German translation task. The training set contains approximately 4.5 million sentence pairs. We compare against a Transformer architecture baseline using the OpenNMT implementation (Klein et al., 2017) . For translation, the standard architecture is an encoder-decoder structure, where the encoder uses non-causal attention to encode the input sentence. The decoder uses two different types of attention. The first, self attention, sequentially attends to the output translation as it is being generated token by token. The second attends to the translation and the context from the encoder. In our experiments, we replace the multi-headed decoder self-attention blocks. We compare perplexity (PPL), BLEU score, and efficiency between the Transformer base and AFT in Table 5 . In this task, we see that AFT performs on par with the Transformer. As expected for the small context size, typically around 50 tokens, AFT does not show dramatic improvements in speed or memory. 

6. CONCLUSIONS

We have introduced the Attention Free Transformer that replaces attention with an efficient, easyto-implement new operation. We have demonstrated strong results on challenging benchmarks, despite of the simplicity of our design. We believe that our model opens a new design space for Transformer-like models, and will see impact in various areas where Transformers are applied. 

B.2 IMAGE SUPER RESOLUTION

We also consider a super-resolution task based on pixel-wise image generation. Following (Dahl et al., 2017; Parmar et al., 2018) , we enlarge an 8 × 8 sized image to 32 × 32. We use CelebA dataset (Liu et al., 2015) as the benchmark. Our baseline model is the Image Transformer (Parmar et al., 2018) with its encoder and decoder connected through the attention mechanism. Both the 1D and 2D local Image Transformer models have L = 12 layers, d = 512 and Table 6 : Image super resolution results on CelebA. Our AFT models outperform the PixelRecursive baseline (Dahl et al., 2017) in bits/dim (the lower the better), and show clear advantages in parameter efficiency and memory saving over Image Transformers (Parmar et al., 2018) , with comparable or even better performance. 

B.3 POINT CLOUD GENERATION

In addition to images and text, we explore modeling point clouds randomly sampled from objects in the ShapeNetCore v2 dataset (Chang et al., 2015) . Each point cloud consists of 2048 points. Following Nash et al. (2020) , the points were sorted into a sequence in the order of z, y and x, then uniformly 8-bit quantized based on their positions. Our preliminary results of point cloud generation are shown in Table 7 , and examples of generated point clouds are shown Figure 8 . We see that our model is able to generate self consistent objects with fine details and great diversity. 



We adopt Numpy styled broadcasting convention: https://numpy.org/doc/stable/user/theory.broadcasting.html We assume here that Yt includes input information at the current position t, the version where the current position is excluded can be obtained by shifting the outputs to the right. RELATED WORKSince the Transformer was introduced, there have been numerous attempts to address the major source of inefficiency in the architecture, the quadratic cost of the attention operation. Improving this https://github.com/idiap/fast-transformers We use a batch size of 32 which is the largest batch size Image Transformer can fit Fair comparison against Sparse Transformer is infeasible, as it relies on a set of advanced implementation tricks such as mixed precision and gradient checkpointing, whereas AFT is implemented with standard Pytorch utilities ran in full precision.



We propose AFT, a new family of Transformer models that achieves O(T d) time and space complexity in training, as well as O(d) time and space complexity in autoregressive decoding. • We show strong performance of AFT as a drop in replacement of MHA on various benchmarks, including setting the state-of-the-art result on CIFAR10 in the standard setting and outperforming other efficient Transformer variants.

Figure 2: Comparisons of efficiency between models for a forward and backward pass with batchsize of 4 on a single GPU with 32 GB of RAM.

Figure 5: Image completion with test examples.

Figure 6: Upscaled images from baseline and our 2D local transformers on CelebA.

Figure7shows more samples from different models trained on CelebA face images.

Figure 7: Upscaled images from baseline 1D/2D local Image Transformers (Parmar et al., 2018) and our AFT-local2D model trained on CelebA.

Figure 8: Point clouds generated by AFT trained on airplane point clouds.

Complexity comparison with different Transformers: Reformer

Katharopoulos et al. (2020);Choromanski et al. (2020) both propose to approximate the exponential kernel with inner product of projections, which leads to a linearized attention operation of complexity O(T d 2 ). AFT is similar but offers greater efficiency of O(T d) due to the exclusive use of element-wise operations, as well as more design flexibility. Reformers(Kitaev et al., 2020) apply LSH as an approximation to the dot product, whereas AFT completely gets rid the need of dot product.Sparse, local attention. Sparse Transformers(Child et al., 2019) and Image Transformer(Parmar et al., 2018) proposes to use fixed sparse or local context patterns. Attention models in vision tasks (often combined with convolutions) use image structure to help handcraft relevant spatial patterns to attend

Proof of concept experiments for AFT-relu2, AFT-relu and AFT-softmax, tested on CI-FAR10. All three versions train well with standard optimization settings. AFT-relu2 and AFT-relu perform similarly, while AFT-relu-softmax and AFT-sigmoid-softmax are more stable and yields significantly better results.

NLL results on CIFAR10, evaluated by bits/dim, the lower the better. Speed and memory are measured during training time, with a batch size of 32 across 8 V100 GPUs. AFT achieve the state-of-the-art result in this setting, with much fewer parameters, better speed and significantly less memory.

Enwik8 results, measured in bits per character (bpc), the lower the better. Baselines compared are Reformer(Kitaev et al., 2020), Synthesizer(Tay et al., 2020a) (it's best performing dense version) and Linear Transformer(Katharopoulos et al., 2020). Speed and memory are measured during training time, with a batch size of 128 on a 8 V100 GPU node.On the local window size. In all our experiments, AFT-local-learned demonstrates superior performance compared to other variants. In order to validate its efficacy, we performed additional experiments wit the L = 24, d = 256 architecture, fixing everything but varying the local window size s. We show the results on 4, where we see that both the training and testing bpc forms a U-shape w.r.t. the window size, with 32 achieving the best performance.

Training and testing bpc w.r.t. the local window size for AFT-local-learned.

WMT 2014 English-to-German Translation.

Results on ShapeNetCore v2, evaluated by bits/dim, the lower the better.

A APPENDIX B LOCAL AFT

Algorithm 1: Pseudo code of an efficient, in-place causal AFT-softmax/AFT-softmax-local1d.Input: Query, key and value Q , K , V ∈ R T ×d ; optionally context size s ∈ {2 n , n ∈ N }.∀i // normalize according to sof tmaxc and multiply with queryAlgorithm 2: Pseudo code of an efficient, in-place causal AFT-softmax-local2d.Input: Query, key and value Here we show the visualizations of our best performing model trained on CIFAR10 (with test bits/dim 2.81). In Figure 5 , we sample 32 test images and mask out the bottom half for each of them. We then use the model to sample the remaining pixels, one at a time. We see the model provides consistent completions for most cases.

