RANDOM FEATURE ATTENTION

Abstract

Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their core is an attention function which models pairwise interactions between the inputs at every timestep. While attention is powerful, it does not scale efficiently to long sequences due to its quadratic time and space complexity in the sequence length. We propose RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, and explore its application in transformers. RFA can be used as a drop-in replacement for conventional softmax attention and offers a straightforward way of learning with recency bias through an optional gating mechanism. Experiments on language modeling and machine translation demonstrate that RFA achieves similar or better performance compared to strong transformer baselines. In the machine translation experiment, RFA decodes twice as fast as a vanilla transformer. Compared to existing efficient transformer variants, RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets. Our analysis shows that RFA's efficiency gains are especially notable on long sequences, suggesting that RFA will be particularly useful in tasks that require working with large inputs, fast decoding speed, or low memory footprints.

1. INTRODUCTION

Transformer architectures (Vaswani et al., 2017) have achieved tremendous success on a variety of sequence modeling tasks (Ott et al., 2018; Radford et al., 2018; Parmar et al., 2018; Devlin et al., 2019; Parisotto et al., 2020, inter alia) . Under the hood, the key component is attention (Bahdanau et al., 2015) , which models pairwise interactions of the inputs, regardless of their distances from each other. This comes with quadratic time and memory costs, making the transformers computationally expensive, especially for long sequences. A large body of research has been devoted to improving their time and memory efficiency (Tay et al., 2020c) . Although better asymptotic complexity and prominent gains for long sequences have been achieved (Lee et al., 2019; Child et al., 2019; Beltagy et al., 2020, inter alia) , in practice, many existing approaches are less well-suited for moderatelength ones: the additional computation steps required by some approaches can overshadow the time and memory they save (Kitaev et al., 2020; Wang et al., 2020; Roy et al., 2020, inter alia) . This work proposes random feature attention (RFA), an efficient attention variant that scales linearly in sequence length in terms of time and space, and achieves practical gains for both long and moderate length sequences. RFA builds on a kernel perspective of softmax (Rawat et al., 2019) . Using the well-established random feature maps (Rahimi & Recht, 2007; Avron et al., 2016; §2) , RFA approximates the dot-then-exponentiate function with a kernel trick (Hofmann et al., 2008) : exp(x • y) ≈ φ(x) • φ(y). Inspired by its connections to gated recurrent neural networks (Hochreiter & Schmidhuber, 1997; Cho et al., 2014) and fast weights (Schmidhuber, 1992), we further augment RFA with an optional gating mechanism, offering a straightforward way of learning with recency bias when locality is desired. RFA and its gated variant ( §3) can be used as a drop-in substitute for the canonical softmax attention, and increase the number of parameters by less than 0.1%. We explore its applications in transformers on language modeling, machine translation, and long text classification ( §4). Our experiments show that RFA achieves comparable performance to vanilla transformer baselines in all tasks, while outperforming a recent related approach (Katharopoulos et al., 2020) . The gating mechanism proves particularly useful in language modeling: the gated variant of RFA outperforms the transformer baseline on WikiText-103. RFA shines in decoding, even for shorter sequences. In our head-to-head comparison on machine translation benchmarks, RFA decodes around 2× faster than a transformer baseline, without accuracy loss. Comparisons to several recent efficient transformer variants on three long text classification datasets show that RFA is competitive in terms of both accuracy and efficiency. Our analysis ( §5) shows that more significant time and memory efficiency improvements can be achieved for longer sequences: 12× decoding speedup with less than 10% of the memory for 2,048-length outputs.

2. BACKGROUND 2.1 ATTENTION IN SEQUENCE MODELING

The attention mechanism (Bahdanau et al., 2015) has been widely used in many sequence modeling tasks. Its dot-product variant is the key building block for the state-of-the-art transformer architectures (Vaswani et al., 2017) . Let {q t } N t=1 denote a sequence of N query vectors, that attend to sequences of M key and value vectors. At each timestep, the attention linearly combines the values weighted by the outputs of a softmax: attn (q t , {k i }, {v i }) = i exp (q t • k i /τ ) j exp (q t • k j /τ ) v i . (1) τ is the temperature hyperparameter determining how "flat" the softmax is (Hinton et al., 2015) .foot_0  Calculating attention for a single query takes O(M ) time and space. For the full sequence of N queries the space amounts to O(M N ). When the computation cannot be parallelized across the queries, e.g., in autoregressive decoding, the time complexity is quadratic in the sequence length.

2.2. RANDOM FEATURE METHODS

The theoretical backbone of this work is the unbiased estimation of the Gaussian kernel by Rahimi & Recht (2007) . Based on Bochner's theorem (Bochner, 1955 ), Rahimi & Recht (2007) proposed random Fourier features to approximate a desired shift-invariant kernel. The method nonlinearly transforms a pair of vectors x and y using a random feature map φ; the inner product between φ(x) and φ(y) approximates the kernel evaluation on x and y. More precisely: Theorem 1 (Rahimi & Recht, 2007) . Let φ : R d → R 2D be a nonlinear transformation: φ (x) = 1/D sin (w 1 • x) , . . . , sin (w D • x) , cos (w 1 • x) , . . . , cos (w D • x) . When d-dimensional random vectors w i are independently sampled from N (0, σ 2 I d ), E wi [φ (x) • φ (y)] = exp -x -y 2 /2σ 2 . ( ) Variance of the estimation is inversely proportional to D (Appendix A.2; Yu et al., 2016) . Random feature methods proved successful in speeding up kernel methods (Oliva et al., 2015; Avron et al., 2017; Sun, 2019, inter alia) , and more recently are used to efficiently approximate softmax (Rawat et al., 2019) . In §3.1, we use it to derive an unbiased estimate to exp( • , • ) and then an efficient approximation to softmax attention.

3. MODEL

This section presents RFA ( §3.1) and its gated variant ( §3.2). In §3.3 we lay out several design choices and relate RFA to prior works. We close by practically analyzing RFA's complexity ( §3.4).



M = N in self-attention; they may differ, e.g., in the cross attention of a sequence-to-sequence model.

