CAB: COMPREHENSIVE ATTENTION BENCHMARKING ON LONG SEQUENCE MODELING

Abstract

Transformer has achieved remarkable success in language, image, and speech processing. Recently, various efficient attention architectures have been proposed to improve transformer's efficiency while largely preserving its efficacy, especially in modeling long sequences. A widely-used benchmark to test these efficient methods' capability on long-range modeling is Long Range Arena (LRA). However, LRA only focuses on the standard bidirectional (or noncausal) self attention, and completely ignores cross attentions and unidirectional (or causal) attentions, which are equally important to downstream applications. Although designing cross and causal variants of an attention method is straightforward for vanilla attention, it is often challenging for efficient attentions with subquadratic time and memory complexity. In this paper, we propose Comprehensive Attention Benchmark (CAB) under a fine-grained attention taxonomy with four distinguishable attention patterns, namely, noncausal self, causal self, noncausal cross, and causal cross attentions. CAB collects seven real-world tasks from different research areas to evaluate efficient attentions under the four attention patterns. Among these tasks, CAB validates efficient attentions in eight backbone networks to show their generalization across neural architectures. We conduct exhaustive experiments to benchmark the performances of nine widely-used efficient attention architectures designed with different philosophies on CAB. Extensive experimental results also shed light on the fundamental problems of efficient attentions, such as efficiency length against vanilla attention, performance consistency across attention patterns, the benefit of attention mechanisms, and interpolation/extrapolation on long-context language modeling.

1. INTRODUCTION

Transformer has achieved great breakthroughs in natural language processing (Devlin et al., 2019; Radford et al., 2019; Brown et al., 2020) , computer vision (Dosovitskiy et al., 2020; Liu et al., 2021b) , speech processing (Schneider et al., 2019; Ren et al., 2021) , and biology (Jumper et al., 2021; Brandes et al., 2022) . A major drawback of the transformer architecture is its quadratic complexity in both time and memory. The problem has been more evident with the ever-increasing need in applying transformers for longer sequence modeling in different domains. Recent research on efficient attention mechanisms seeks to respond to this problem by improving attention efficiency while preserving efficacy (Wang et al., 2020; Kitaev et al., 2019; Zaheer et al., 2020; Choromanski et al., 2020; Zheng et al., 2022) . The commonly adopted test bed for benchmarking efficient transformers in the context of processing long sequences, is the Long Range Arena (Tay et al., 2020a; LRA) , consisting of both synthetic probing tasks and real-world tasks. However, all of these tasks focus on the self attention setting, ignoring cross attention and causal attention, which are equally important and often more challenging. In other words, the transformer model is only used as a sequence encoder in LRA, while in real applications, cross attention is essential for conditionality modeling tasks such as sequence-to-sequence (Bahdanau et al., 2015) , data-to-text (Dušek et al., 2020) and knowledge-enhanced models (Liu et al., 2021a) , and causal attention is critical for causality modeling tasks such as text generation (Vaswani et al., 2017; Zhang et al., 2020) , language modeling (Radford et al., 2019; Brown et al., 2020) and speech synthesis (Li et al., 2019) . Another potential drawback of LRA as discovered by researchers recently, is that with proper tuning, the performance gap between different transformer variants in these tasks can be insignificant (Xiong et al., 2022; Ivgi et al., 2022; Shaham et al., 2022) , impairing its effectiveness as a standard benchmark. To address these problems, we propose a comprehensive attention benchmark (CAB) for long sequence modeling. Above all, we present a fine-grained attention taxonomy, considering attentive functionality on conditionality and causality modeling. We present four patterns of attentions, namely, noncausal self, causal self, noncausal cross, and causal cross, representing the distinguishable attentive functionality to sequence modeling ( §2). With that in mind, we then collect seven real-world tasks from diverse fields of computer vision, natural language processing, speech processing, and time series forecasting ( §3). Among these tasks, CAB includes rich backbone architectures to evaluate the attention mechanisms, testing their performances and generalization abilities. Given four attention patterns defined by the attention taxonomy, we advocate a pattern-wise comparison between attention mechanisms, evaluating their attentive functionality respectively. We conduct exhaustive experiments on CAB, assessing nine widely-used efficient attentions designed with different philosophies ( §4). The experimental results reveal several insights into designing efficient attention architectures. First, we show that existing efficient transformers claiming comparable or even superior performances to vanilla attention, often achieve less competitive results in the causal cross scenario, indicating efficient attentions' modeling capability cannot always generalize across different attention patterns. Second, by quantifying the efficiency of attentions in long sequence contexts using efficiency length (i.e., the minimum length for a sub-quadratic efficient model to surpass vanilla attention in efficiency), we disclose the underlying inefficiency problem of existing efficient attention methods in modeling relatively short sequences. Third, we investigate interpolation and extrapolation on a long-context language model, and find that it is promising for efficient attention, such as local attention and LongShort Transformer, to scale to long-context language modeling. We hope CAB and elaborated experimental efforts can shed light on the fundamental problems of efficient attentions, and inspire the design of advanced attention mechanisms. CAB and all the related codes will be released at https://github.com/Anonymous.

2. ATTENTION TAXONOMY

Attention methods are designed to capture token-to-token dependency within and across sequences. Let Y = {y 1 , y 2 , . . . , y n } and Y ∈ R n×d be a target sequence and its feature matrix, and X = {x 1 , x 2 , . . . , x m } and X ∈ R m×d be a source sequence and its feature matrix, where n and m are target and source sequence lengths respectively, and d denotes dimensionality. Note that X could be the same as Y in the case of self attention. The target and source feature matrices are transformed to Query-Key-Value feature matrices as (Vaswani et al., 2017) learns to integrate key-value feature matrices K ∈ R m×d , V ∈ R m×d into a query one Q = {q 1 , q 2 , . . . , q n } ∈ R n×d with token-to-token attention: Q = YW Q , K = XW K , V = XW V with learnt parametric matrices W Q , W K , W V . Vanilla transformer Attn(Q, K, V) = softmax QK ⊤ / √ d V (1) in which we omit the multihead notation without loss of generality. Vanilla attention in Eq. 1 advocates a token-to-token alignment matrix QK ⊤ , which results in quadratic complexity O(nm) of computational time and memory usage. It is observed that Attn(•) in Eq. 1 acts as an integration model, projecting query features Q by integrating key-value features without changing Q's shape. Beyond the vanilla token-to-token attention mechanism, we extend Eq. 1 to a more general form of attention family as: Attn(Q, K, V) = f (Q; K, V) : R n×d → R n×d where Attn i (Q, K, V) = f i (q i ; Q, K, V), i = 1 . . . n (2) where Q is treated as the main variable and K, V are conditional ones. Comparing with Eq. 1, Eq. 2 covers efficient attention methods without explicitly modeling the token-to-token alignment. In usage, attentions are modified to fit generative models with underlying structures, such as conditional models and causal models, to achieve specific attentive functionality. In this section, we put forward a fine-grained attention taxonomy, considering the conditionality and causality of attentions. We show their challenges and potential impact on real-world applications in modeling conditionality and causality. Under the taxonomy, we present four attention patterns with different attentive functionality as shown in Figure 1 . Self Attention & Cross Attention Self attention (Lin et al., 2017; Vaswani et al., 2017; Dosovitskiy et al., 2020) learns contextual features within a sequence whereas cross attention (Bahdanau et al., 2015) learns to integrate features across sequences. Both self attention and cross attention are highly effective in sequence learning, but designing cross attention is more challenging for efficient attentions without explicit token-to-token alignment. On the aspect of computation, Q ∈ R n×d and K, V ∈ R m×d are of different shapes and are impossible to perform elementwise operations along length dimension, such as cross-covariance used in XCiT (Ali et al., 2021) . Besides, different from self attention, cross attention lacks implicit alignment between Q and K, V, such as neighborhood information. Cross attention learns to integrate conditional features K, V into the main features Q, and has been successfully applied to real-world applications, such as machine translation (Bahdanau et al., 2015; Vaswani et al., 2017) , data-to-text generation (Dušek et al., 2020; Shen et al., 2020) and knowledge enhanced model (Jin et al., 2020; Liu et al., 2021a) . Noncausal Attention & Causal Attention While modeling a sequence, noncausal models have a holistic view of the whole sequence whereas causal ones only access to previous tokens up to now.foot_0 This restriction forbids efficient attention like Nyströmformer (Xiong et al., 2021) , which must be conditioned on the whole target sequence to achieve linearity, to be applied in causal models. Furthermore, when training a causal model, efficient attentions such as RFA (Peng et al., 2021) and ABC (Peng et al., 2022) have to provide different sets of features from Y for each query token. This mechanism in these attentions, therefore, leads to multiplied memory consumption compared to their noncausal counterparts. We provide a detailed analysis of the increasing computation costs in Appendix B. In real-world applications, causal attentions bring up strong generation capability and is widely adopted in text generation (Bahdanau et al., 2015; Vaswani et al., 2017; Zhang et al., 2020 ), text-to-speech synthesis (Li et al., 2019) and language modeling (Radford et al., 2019; Brown et al., 2020; Raffel et al., 2020; Chen et al., 2021a) . Attention Patterns As discussed above, conditionality and causality are two orthogonal aspects of modeling sequences. Thus we obtain four attention patterns via Cartesian product as {noncausal, causal} × {self, cross}: noncausal self (NS), causal self (CS), noncausal cross (NC), and causal cross (CC). Following the general form of attentions in Eq. 2, we formulate the four attention patterns as below: NS i (Q, K, V) ≜ f i (q i ; Q, K, V), s.t. Y = X (3) CS i (Q, K, V) ≜ f i (q i ; Q ≤i , K ≤i , V ≤i ), s.t. Y = X (4) NC i (Q, K, V) ≜ f i (q i ; Q, K, V) (5) CC i (Q, K, V) ≜ f i (q i ; Q ≤i , K, V) Figure 1 illustrates the computation diagrams of these four patterns. These attention patterns construct typical sequence models with attention mechanisms. For instance, Transformer (Vaswani et al., 2017) adopt NS attention in encoder, and CS/CC ones in decoder; recently prevailing Non-Autoregressive Transformer (Gu et al., 2018) without the others, leading to an insufficient evaluation of attentive functionality. In the subsequent sections, we will introduce a comprehensive attention benchmark CAB that evaluates attentions comprehensively on the four patterns.

3. COMPREHENSIVE ATTENTION BENCHMARK

We propose Comprehensive Attention Benchmark (CAB) that collects seven tasks covering four research fields, including computer vision, natural language processing, speech processing, and time series forecasting. We introduce these tasks with their datasets, data lengths, evaluation metrics, and backbone neural networks. Then we present a compositional index (CI) to combine individual task performance into a compositional one for direct performance comparison across tasks and models.

3.1. LONG SEQUENCE MODELING TASKS

We collect seven real-world long-sequence modeling tasks in CAB with data lengths from 300 to 16,000 and include eight widely-used models as backbones in CAB to assess typical efficient attention mechanisms. Table 1 summarizes the tasks' statistics, evaluation metrics, backbone neural networks and the required attention patterns ( §2) are also checked. Text-to-Speech Synthesis (TTS) This task requires models to convert input text sequences, descriptions, or narrations produced by a single speaker, to synthesized audio clips sharing the same timbre as the provided speaker. The CAB incorporates the LJSpeech dataset (Ito, 2017) whose audio clips are sampled with 22,050 Hz. Under this set of relatively high sample rates, the average sequence length of processed audio clips is 559. We use non-autoregressive FastSpeech 2 (Ren et al., 2021) and autoregressive Transformer-TTS (Li et al., 2019) as backbone networks. We adopt Mean Cepstral Distortion (MCD) and Mel Spectral Distortion (MSD) for objective evaluation.

Summarization (Sum)

This task is one that makes a comprehensive summary of multiple input documents without losing important information. We consider multi-document summarization task and use Multi-News datasets (Fabbri et al., 2019) for long sequence modeling. To make the task more challenging, we set the maximum source text and target summary lengths to be 4,096 and 400, respectively. Transformer (Vaswani et al., 2017) implemented on PARAGEN (Feng et al., 2022) serves the evaluating backbone and ROUGE (R-N) (Lin, 2004 ) is used to evaluate each attention. Long Sequence Time-series Forecasting (LSTF) Long sequence time-series forecasting is to predict long-term future behavior based on past states. This task evaluates models on three datasets, including Electricity Transformer Temperature (ETT), Electricity Consuming Load (ECL), and Weather (Zhou et al., 2021) . Following (Zhou et al., 2021) , we use Informer as backbone and conduct univariate and multivariate evaluations. We average their Mean Square Error (MSE) and Mean Absolute Error (MAE) to obtain the final scores. Point Cloud Completion (PCC) This task is to complete a frame of point cloud data when providing partial points. CAB adopts PCN dataset (Griffiths & Boehm, 2019) which inputs 2,048 3D point coordinates and requires models to output the complete point cloud data comprising 14,336 points. PoinTr (Yu et al., 2021 ) is adopted here serving as evaluating backbone. Chamfer Distance (CD) (Huang et al., 2020) and F-Score (Tatarchenko et al., 2019) are selected to evaluate each attention. For clearer comparison, CD metrics are enlarged by 1,000 times. Language Modeling (LM) The language modeling task requires the language model to predict the next text based on the previous information. In this task, we consider a long-context language modeling where the context length is prolonged to 8,192. We use PG-19 dataset (Rae et al., 2019) and conduct evaluations based on the backbone model GPT-2 (Radford et al., 2019) implemented on FAIRSEQ (Ott et al., 2019) . Perplexity (PPL) is used to serve as the evaluation metric. Masked Language Modeling (MLM) Different from the language modeling task, masked language modeling (Devlin et al., 2019) uses full context to predict masked tokens. Roberta-base (Liu et al., 2019) implemented on FAIRSEQ is adopted as the evaluation backbone. We use the same dataset and evaluation metric as LM task while the context length is set to 2,048.

Super-Resolution (SR)

In this task, we aims to convert low-resolution (16 × 16) face images into high-resolution (128 × 128) images. Following Saharia et al. (2022) , we train the backbone model SR3 on Flickr-Faces-HQ (FFHQ) dataset (Karras et al., 2019) and conduct evaluation on CelebA-HQ dataset (Karras et al., 2018) . Peak Signal-to-Noise Ratio (PSNR) and Structural SIMilarity (SSIM) (Wang et al., 2004) are selected to evaluate models from different aspects. To test attentions under different patterns, we assign each pattern with tasks as suggested in Table 1 . We exclude LSTF and PCC tasks from noncausal self attention test for the limited contribution of attentions on them (see §4.2).

3.2. COMPOSITIONAL INDEXES

We compute a compositional index (CI) from seven tasks for performance comparison. CI is a normalized score to balance the influence among evaluation metrics, and high CI represents excellence. It is computed as follows: a) we transform all evaluation metrics beforehand, so that a higher score indicates better performance; b) we then normalize each transformed metric with Z-score normalization; c) after normalization, the score of each evaluation metric is averaged within each task, and is further averaged across tasks. We include details of CI calculation in Appendix C.

4.1. MAIN RESULTS

We conduct exhaustive experiments on CAB by applying various efficient attentions to the backbone models and measure their performance under noncausal self, causal self, noncausal cross, and causal cross attention patterns. We extensively study the fundamental problems of efficient attentions, in terms of efficiency length against vanilla attention, performance consistency across attention patterns, the benefit of attention, and interpolation/extrapolation on long-context language modeling. The details of experimental settings for tasks are shown in Appendix D. 

Compared Efficient Attentions

Kernel LARA O(n) cosFormer * O(n) Performer * O(n) Low-rank factorization Nyströmformer O(n) Prototype ABC O(n) ProbSparse O(n log n) Hybrid LongShort O(n) State space S4D O(n) totype (ABCfoot_2 (Peng et al., 2022) , ProbSparse (Zhou et al., 2021) ), hybrid attention (LongShort Transformer (Zhu et al., 2021) ), and state space (S4D (Gu et al., 2022) ) and evaluate these attention architectures in terms of different attention patterns described in §2. We provide a brief overview of these attention architectures in Table 2 and leave a detailed description in Appendix F. The main results are shown in Table 3 . It is observed that efficient attention achieves comparable performance with vanilla attention under noncausal self attention pattern, and some of them even surpass vanilla attention. It suggests that the current efficient attention indeed improves the ability to learn contextual features in long sequences of various modalities. Although the performance of efficient attention is remarkable in the noncausal self pattern, it is less competitive in the other three patterns. As shown in Table 3d , causal cross attentions are extremely hard to design and are inferior to vanilla attention by a significant margin. Due to the challenge of causality modeling, it is notable that when compared with autoregressive models (i.e. Transformer and Transformer-TTS), non-autoregressive or parallel generation models (i.e. FastSpeech 2, SR3, and RoBERTa) are more likely to have their efficient variants without significant performance drops. On the dimension of efficient attention, we find that local attention is a strongly competitive efficient model on most real-world tasks, which agrees with the previous studies (Xiong et al., 2022) , although it is utterly defeated in LRA. It achieves the highest score of 0.978 under the noncausal self pattern and ranks second under the causal self pattern. In contrast, S4D, which achieves impressive results on LRA, appears less inspiring than expected on Sum and MLM tasks under the noncausal self attention pattern. Efficiency Analysis We conduct a simulation efficiency experiment under noncausal self attention pattern. The experiment is performed on a single A100 GPU, where attention mechanisms are fed a set of dummy sequences with lengths of {256, 512, 1024, 2048, 4096, 8192}. Table 2 shows the time and memory consumption against sequence length by each attentions. We can find that the advantage of efficient attention gradually expands as the sequence length increases. Moreover, kernel-based and low-rank attentions are more efficient on time and space. Further, we report efficiency length to measure the utility of efficient attentions. The efficiency length is defined as the intersection point of computational time and memory curves by efficient models and vanilla attention. The efficiency length represents the minimum length that a sub-quadratic efficient model surpasses vanilla attention in efficiency. Specifically, we perform regression on vanilla and efficient attentions by their theoretical complexity. Then we calculate the intersection of the regression line and the quadratic regression curve to get an efficiency length. Figure 2c shows that efficient attentions gain superiority over vanilla attention once the sequence contains thousands of tokens. For noncausal self attention, ABC and Performer have the shortest efficiency lengths on running time and memory usage, respectively. The details of efficiency lengths and experiment settings can be found in Appendix G. 

4.2. DISCUSSIONS

Performance Consistency Across Attention Patterns We investigate whether efficient attentions perform consistently across different attention patterns. To achieve this, we compute Pearson correlation (Benesty et al., 2009) over CI between each pair of attention patterns against the same arrangement of efficient attentionsfoot_3 for intra-benchmark comparison. Note that we also include CI computed from LRA for inter-benchmark comparison between CAB's four attention patterns and LRA. In Figure 3a attentions developed on LRA do contribute to a wide range of real-world applications. In contrast, cross attention patterns share non-positive correlations with self attention patterns, indicating their inconsistent performance on self and cross attention. 4 On the causality dimension, the performances of causal and noncausal models are highly related but not strictly consistent. We also include task-to-task correlation analysis in Appendix I. Benefit of Attention Self attention is designed to capture contextual features while cross attention integrates non-homologous information. But how much do these features benefit the model? We conduct an ablation study without the attention mechanism. In a preliminary experiment, we find that causal and cross patterns are indispensable to the backbone models as expected, while noncausal self attention only has a limited contribution. Therefore, we focus on discussing the attentive functionality of noncausal self attention. Figure 3b depicts the CI of backbone models with and without attention on tasks. Detailed CI can be found in Appendix H. Results show that attention mechanism improves performance on most tasks. Embarrassingly, after removing the attention mechanism, the PoinTr and Informer achieve improvement on PCC and LSTF tasks, respectively. A possible reason is that there are alternative contextualizing structures such as the convolutional neural network in the backbone models and the contribution of the attention mechanism is weakened. Besides, we also notice that the current efficient attention family has already surpassed vanilla attention in modeling noncausal self relationships on most tasks. Interpolation and Extrapolation on Long-Context Language Modeling Language models with efficient attentions are more likely to face unexpected long context that exceeds predefined lengths, resulting in an extrapolation problem. It is critical for efficient attentions to scale to longer sequences while preserving the performance on short ones. Therefore, we conduct an experiment on the language modeling task, where we train the model with the input of 8,192 tokens and calculate the perplexity of contexts with 256-16, 384 tokens during the test. Figure 3c shows the results on both of interpolation and extrapolation. On one hand, as the size of context grows to 8,192, language models achieve decreasing perplexity. It means that longer contexts indeed improve language modeling, and efficient attentions do well in interpolation as expected. On the other hand, for sequences with more than 8,192 tokens, the perplexities of all these efficient attention-equipped language models become higher. Notably, local attention and LongShort with local features have only a slight drop in performance in extrapolation, while ABC with only global features fails to extrapolate to longer sequences. Although S4D performs well as causal attention, it has a large perplexity with sequence input longer than 8,192, indicating that S4D has relatively poor extrapolation ability.

5. RELATED WORK

Long Sequence Modeling Most efficient attention architectures are designed for long sequence modeling. Previous works evaluate their proposed attention modules separately on Language Model (Choromanski et al., 2020; Qin et al., 2021; Peng et al., 2022) , Masked Language Model (Qin et al., 2021; Peng et al., 2022 ), Image Classification (Choromanski et al., 2020; Zheng et al., 2022) , and Machine Translation (Peng et al., 2022; 2021) . Without unified criteria for evaluating attention modules, researchers find it hard to distinguish between them and select an appropriate attention module for their research. Later, Tay et al. (2020a) proposed Long Range Arena (LRA), a benchmark that contains 5 different classification tasks across language and image domainsHowever, their benchmark only considers the noncausal self attention pattern and contains most synthetic tasks which are far from real-world applications. Besides, different attention modules perform nearly equally on LRA (Xiong et al., 2022) , which causes LRA a limited benchmark. Shaham et al. (2022) proposed SCROLLS for long language sequences but ignores long sequence modeling tasks in other fields.

Efficient Attention

The emergence and prevalence of Transformer models (Vaswani et al., 2017) as well as its strong capability for encoding and generation have imposed impressiveness on various applications. To improve the efficiency of attention in Transformer models, previous work explored efficient attention mechanisms, including sparse attention (Guo et al., 2019a; Ainslie et al., 2020; Beltagy et al., 2020) , low-rank attention (Guo et al., 2019b; Chen et al., 2020; Xiong et al., 2021) , kernel-based linearized attention (Choromanski et al., 2020; Katharopoulos et al., 2020) , prototype attention (Vyas et al., 2020; Zhou et al., 2021) , prior attention (Zhang et al., 2018; Ying et al., 2021; Tay et al., 2021) , and memory compress (Liu* et al., 2018; Lee et al., 2019; Wang et al., 2020) . For interested readers, we refer them to (Tay et al., 2020b) and (Lin et al., 2021) for a more detailed description of efficient attention.

6. CONCLUSION

In this paper, we propose to evaluate efficient attentions under a more fine-grained attention taxonomy with four distinguishable attention patterns, each of which reflects different attentive functionality. Under the taxonomy, we present Comprehensive Attention Benchmark (CAB), consisting of seven real-world tasks and eight backbone networks. We conduct exhaustive experiments to benchmark nine typical efficient attentions on CAB. The empirical results show that existing efficient attentions make significant progress in noncausal self attention, but are still struggling to catch up with vanilla attention in conditionality and causality modeling. The efficiency analysis shows that although existing efficient attentions enjoy lower computational complexity, most of them only start to achieve significant efficiency gain when applied to sequences with more than one thousand tokens, especially in the causal attention scenario. Moreover, we find that attention mechanisms, in the noncausal self case, are sometimes not helpful or even harmful to neural models that already contain feature mixing. Our study sheds light on long-context language modeling and shows that efficient attentions such as local attention and LongShort Transformer are promising for extrapolation problems in long-context language modeling. self attention pattern) tokens around a querying token to serve as K w and V w . Its formulation is as follows: Local(q i , K w , V w ) = softmax(q i K ⊤ w )V w ( ) where K w , V w ∈ R w×d denotes the neighborhood tokens within w size around the query token q i in K and V. The complexity is of linear complexity O(nwd) when w is predefined and independent of sequence length n. Due to the locality inductive bias, local attention is competent only on causal self and noncausal self attention patterns. Nyströmformer Nyströmformer (Xiong et al., 2021) adopts the Nyström method to approximate the softmax attention matrix. It selects r landmarks for Q and K with Segment-means (Shen et al., 2018) and produces Q and K. Using iterative Moore-Penrose pseudoinverse (Razavi et al., 2014) , Nyströformer decomposes the softmax attention matrix into three sub-softmax matrices as follows: Nyströmformer(Q, K, V) = softmax(Q K ⊤ )(softmax( Q K ⊤ )) + softmax( QK ⊤ )V where (softmax( Q K ⊤ )) + is the Moore-Penrose pseudoinverse of softmax( Q K ⊤ ). The complexity Moore-Penrose pseudoinverse is linear with respect to n when its iteration time is constant. Therefore, the overall computation cost is O(nrd) when r is predefined. Because it needs to select the same number of landmarks for Q and K and requires the full Q representation, it is only adaptable in the noncausal self attention pattern. Performer Performer (Choromanski et al., 2020) is another linear transformer using random feature map ϕ(•) : R d → R r + (for some r > 0) as a kernel function to reduce computational complexity. Formally, the random feature map is defined to approximate the exponential kernel as exp(q ⊤ i k j ) = E q [ϕ(q i ) ⊤ ϕ(k j )], where ϕ(x) = exp(W r x -1 2 ∥x∥ 2 1 r ) ∈ R r and all elements of W r are independently drawn from a standard Gaussian q(w) = N (w; 0, 1). Given the defined random feature map, the whole softmax attention for each query can then be approximated as m j=1 exp(q ⊤ i k j )v j m j ′ =1 exp(q ⊤ i k ′ j ) ≈ ϕ(q i ) ⊤ m j=1 ϕ(k j )v ⊤ j ϕ(q i ) ⊤ m j ′ =1 ϕ(k ′ j ) = Performer(q i , K, V) Thanks to the kernel approximation, Performer achieves O(nrd) complexity in both noncausal self and noncausal cross attention patterns; however, for causal self and causal cross attentions, Performer has to store the cumulative summation for both m j=1 ϕ(k j )v ⊤ j and m j=1 ϕ(k j ), which requires a dedicated CUDA operator to accelerate the computation. LARA LARA (Zheng et al., 2022) further extends the random feature approximation from estimating individual exponential kernels to directly estimating the whole softmax function. In particular, there exists a distribution p that instantiates the random feature (Zheng et al., 2022) such as:  m j=1 exp(q ⊤ i k j )v ⊤ j m j ′ =1 exp(q ⊤ i k ′ j ) = E p   m j=1 ϕ(q i ) ⊤ ϕ(k j )v ⊤ j m j ′ =1 ϕ(q i ) ⊤ ϕ(k ′ j )   . LARA (q n , K, V) = ϕ(q n ) ⊤ A n m j=1 ϕ(k j )v ⊤ j ϕ(q n ) ⊤ A n m j ′ =1 ϕ(k j ′ ) , where A n is a query-specific diagonal matrix that ensures the computation overall still yields a valid self-normalized importance sampling estimation; and the elements of sampled W r in ϕ(x) = exp(W r x-1 2 ∥x∥foot_7 1 r ) are allowed to be drawn from different distributions beyond standard Gaussian. cosFormer Following the same method defined in Eq. 13, cosFormer (Qin et al., 2021) adopts another kernel function ReLU(•) instead of random feature map. Due to the importance of the softmax operator that can ensure the non-negativeness of value weights and stabilize the training process, cosFormer designed a new cos-based reweighting mechanism similar to softmax. Denote q ′ i = ReLU(q i ) and k ′ j = ReLU(k j ), the re-weight formula is expressed as follows: s(q i , k j ) = q ⊤ i k j cos( π 2 × i -j L ) where L = max(n, m). Using this formula and Ptolemy's theorem, the final expression of cosFormer is as follows: cosFormer(q i , K, V) = m j=1 q cos i ((k cos j ) ⊤ v j ) + m j=1 q sin i ((k sin j ) ⊤ v j ) m j=1 (q cos i ) ⊤ k cos j + m j=1 (q sin i ) ⊤ k sin j ( ) where q cos i = q i cos( πi 2L ), q sin i = q i sin( πi 2L ), k cos j = k j cos( πj 2L ), k sin j = k j sin( πj ). Similar to Performer, it can efficiently fulfill noncausal self and noncausal cross attentions with linear complexity O(nd 2 ), but its causal attention relies on a specialized CUDA operator to achieve high efficiency. LongShort Transformer LongShort Transformer (Zhu et al., 2021) models the long-range and short-term dependency of a sequence with two different methods discussed above. For long-range modeling, it utilizes the low-rank property of the softmax function to convert key and value to a more compact representation using a learnable matrix W P ∈ R d×r : K = P ⊤ K ∈ R r×d and V = P ⊤ V ∈ R r×d where P = softmax(KW P ) and r is the compressed dimension. For shortterm modeling, it borrows local attention to output a token representation fused with neighborhood features via segment-wise local attention. Let l be the segment length and s i = ⌊i/l⌋ be the segment id for i-th token, segment-wise local attention produces K w = K (si-1)l+ l . Combined with these two methods, the output of LongShort Transformer is formalized as follows: LongShort(q i , K, V) = softmax(q i [K w ; K] ⊤ )[V w ; V] where [•; •] is a concatenation operator at the first dimension. The complexity O(nrd + nwd) is also linear w.r.t. the target length n. Due to its introduction of local attention, LongShort Transformer can only complete tasks under causal self and noncausal self attention patterns. ProbSparse Attention ProbSparse (Zhou et al., 2021) calculates attention scores for few (Q, K) dot-product pairs based on sparsity of self attention scores. Specifically, ProbSparse selects some "important" queries whose attention distributions on keys are non-uniform. Then we only need to calculate the attention scores between the "important" queries and all keys and output the weighted sum of V. For other trivial queries, ProbSparse takes the mean of V as outputs. To distinguish the "important" queries, ProbSparse first samples some keys K, and then measures the sparsity of query q i by M q i , K as follows: M q i , K = max j q i k⊤ j √ d - 1 L K L K j=1 q i k⊤ j √ d where L K denotes the number of keys. Finally, the Top-u queries with large M q i , K are the "important" queries. ABC In this paper, we use ABC to denote ABC MLP (Peng et al., 2022) . ABC MLP bounds the amount of memory slot for K and V to be r < m, independent of both n and m. ABC MLP uses a learnable matrix to fill each memory slot with reweighted key and value representations: K = softmax(W K ϕ K ⊤ )K, V = softmax(W V ϕ V ⊤ )V where W K ϕ , W V ϕ ∈ R r×d is a learnable matrix representing the bounded memory. Leveraging the bounded representation K and V, ABC MLP executes following softmax attention: ABC MLP (q i , K, V) = softmax(q i K ⊤ ) V ( ) ABC MLP is again of linear complexity O(nrd). S4D S4D (Gu et al., 2022) is different from the above attention-based methods, as it relies on State Space Matrix called "HiPPO matrix" (Gu et al., 2020) which enables the whole model to be viewed as

I INTER-TASK CORRELATION

Figure 4 shows the task-to-task correlation. Except for the causal cross attention pattern, tasks within the other three patterns scarcely correlate positively with each other. This suggests that there does hardly exist a single task that can comprehensively test the capability of some attention module under an attention pattern. 



When generating a sequence Y = {y1, y2, . . . , yn} autoregressively, neural models(Vaswani et al., 2017;Li et al., 2019; Brown et al., 2020) usually adopt a causal graph G(V, E) with vertices V = {yi ∈ Y } and edges E = {yi → yj|i ∈ [1, . . . , n], j ∈ [1, . . . , n], i ≤ j}. We implement and experiment ABC with its fast version as suggested byPeng et al. (2022). The order is[vanilla, local, ABC, LARA, cosFormer, ProbSparse, Performer, LongShort, Nyströmformer, S4D] The negative correlation here is biased due to limited exploration of causal and cross efficient attentions. https://github.com/OpenNLPLab/cosFormer/blob/main/cosformer.py#L106 https://github.com/google-research/google-research/blob/master/ performer/fast_attention/tensorflow/fast_attention.py#L190 :(si+1)l+ l 2 https://github.com/pkuzengqi/Skyformer



Figure 1: Computation diagrams for (a) noncausal self, (b) causal self, (c) noncausal cross, (d) causal cross attentions. Shaded blocks represent future tokens that are invisible to the current state. Blocks with red rims represent the current state token.

Figure 2: Empirical running time (a) and memory cost (b) with sequence length. Relative measurements to vanilla attention are reported. (c) efficiency length of attention architectures compared to vanilla attention. The efficient attention order on the y-axis is monotonically sorted by efficiency length on memeory usage.

Figure 4: Task-to-task correlation.

models adopt NS attention in the encoder, and NS/NC attention in the decoder. Previous efficient attention benchmark LRA only tests attentions with the NS pattern but Task statistics of datasets, data lengths, evaluation metrics, backbone neural networks and required attention patterns. For encoder-decoder models, both source and target lengths are listed.

Efficient attentions with supported attention patterns. The complexity is based on sequence length n, omitting independent factors for simplicity. ( * ) denotes the attentions that perform differently as the original paper claims. We clarify the reasons in Appendix A.

Experimental results on CAB with (a) noncausal self, (b) causal self, (c) noncausal cross, and (d) causal cross attention pattern. (-) denotes that the method fails on the task. ( †) denotes that the attention does not support fp16 training. The CI of each efficient attention is shown in Appendix E.

Compositional index.

The means and standard deviations for noncausal self attention pattern evaluation.

The means and standard deviations for causal self attention pattern evaluation.By viewing softmax attention as an expectation over a potentially complicated distribution p,Zheng et al. (2022) noted that Performer can be understood as performing self-normalized importance sampling to approximate this expectation, with standard Gaussian as the proposal distribution. LARA generalizes this view by adopting multiple adaptive proposal distributions to further improve the approximation fidelity. It is shown that LARA takes the following form:

The means and standard deviations for noncausal cross attention pattern evaluation.

The means and standard deviations for causal cross attention pattern evaluation.

A EFFICIENT ATTENTIONS WITH PERFORMANCE DIFFERENCES

cosFormer requires a cos-reweighting mechanism to behave like vanilla softmax attention. However, its reweighting mechanism needs the maximum length between the source and target sequences in Eq. 16. Under causal self pattern, this operation requires different L values for each token of Q and K, which is ignored in its released codebase repository 5 . Therefore, the released code cannot model causality.Performer uses random feature matrices to project Q and K. However, before transforming Q, K with the softmax kernel, its released codebase 6 uses the golden target sequence length to compute random feature maps under causal self pattern. However, this operation brings inconsistency between training and inference, because the golden target length is unavailable during inference. Thus, we cannot use the released code to perform causal self tasks.

B INCREASING COMPUTATION COSTS FOR CAUSAL EFFICIENT ATTENTION

In this section, we use an efficient attention example RFA (Peng et al., 2021) to illustrate enforcing causality might bring an increase in computation for efficient attention. In RFA, the noncausal version models the query q t as:where S t ∈ R 2D×d and z t ∈ R d and 2D is the dimensionality of kernel ϕ. In its causal version, the hidden state S t and z t are recurrently modified to maintain the causality:We need to store S t and z t explicitly during training, which contains the historical information specific for each query q t . As a result, these recurrent operations add additional O(ndD) time and memory consumption.

C COMPOSITIONAL INDEX

We define s k ij as the score of i-th attention model on k-th metric in the j-th task, where k = 1, . . . , K and j = 1, . . . , M . To obtain Compositional index (CI), we first normalize s k ij . For the metric whose value is higher indicating high performance,, where µ k j and σ k j denote the mean and standard deviation of the models' scores on this metric. Otherwise, we normalizefor the metric whose value is lower indicating high performance.Then we average the normalized score of each metric to obtain a task score

D HYPERPARAMETERS

Hyperparameters for all tasks are shown in Table 4 . We also report the hyperparameters of efficient attentions in Table 5 . 

E TASK SCORE FOR COMPARISON

We show the compositional index of each task in Table 6 . For a new efficient attention, it can use our provided means and variances to obtain the normalized score. The means and standard deviations of metrics are shown in the Table 7 , Table 8 , Table 9 , and Table 10 .

F DETAILED DESCRIPTION OF EFFICIENT ARCHITECTURES

We here only present the noncausal self forms of these efficient architectures. In the subsequent discussion, we denote n and m as the target and source sequence lengths, respectively. Also, we omit the factor 1/ √ d for simplicity.Local Attention Local attention (Luong et al., 2015) intuitively forces each query token to attend to its neighborhood tokens within a fixed window size w to reduce the computation costs. In the full attention map, the neighborhood tokens contribute a large proportion of the attention score for the querying token (Qin et al., 2021) . Therefore, local attention intuitively selects w (w/2 for causal a convolutional model. It does not require the introduction of Q, K, V as it takes the whole modeling process as a linear time-invariant (LTI) system u(t) → y(t) where u(t) is the input sequence. To parameterize the system, S4D designs a Convolution Kernel K(t) ∈ R n×d and introduces a inner state dimension r during parameterization of K(t):where A, B, C are all initialized with certain values and trained with the whole model. During implementation, S4D processes u and K in the frequency domain and transforms back the output to the time domain:Therefore, S4D has the complexity of O(nrd).

G EFFICIENCY LENGTH

We show efficiency lengths under noncausal self attention pattern in 

H BENEFIT OF ATTENTION

We report the CI of backbone models with and without attention in Table 12 . 

