CAB: COMPREHENSIVE ATTENTION BENCHMARKING ON LONG SEQUENCE MODELING

Abstract

Transformer has achieved remarkable success in language, image, and speech processing. Recently, various efficient attention architectures have been proposed to improve transformer's efficiency while largely preserving its efficacy, especially in modeling long sequences. A widely-used benchmark to test these efficient methods' capability on long-range modeling is Long Range Arena (LRA). However, LRA only focuses on the standard bidirectional (or noncausal) self attention, and completely ignores cross attentions and unidirectional (or causal) attentions, which are equally important to downstream applications. Although designing cross and causal variants of an attention method is straightforward for vanilla attention, it is often challenging for efficient attentions with subquadratic time and memory complexity. In this paper, we propose Comprehensive Attention Benchmark (CAB) under a fine-grained attention taxonomy with four distinguishable attention patterns, namely, noncausal self, causal self, noncausal cross, and causal cross attentions. CAB collects seven real-world tasks from different research areas to evaluate efficient attentions under the four attention patterns. Among these tasks, CAB validates efficient attentions in eight backbone networks to show their generalization across neural architectures. We conduct exhaustive experiments to benchmark the performances of nine widely-used efficient attention architectures designed with different philosophies on CAB. Extensive experimental results also shed light on the fundamental problems of efficient attentions, such as efficiency length against vanilla attention, performance consistency across attention patterns, the benefit of attention mechanisms, and interpolation/extrapolation on long-context language modeling.

1. INTRODUCTION

Transformer has achieved great breakthroughs in natural language processing (Devlin et al., 2019; Radford et al., 2019; Brown et al., 2020 ), computer vision (Dosovitskiy et al., 2020; Liu et al., 2021b) , speech processing (Schneider et al., 2019; Ren et al., 2021), and biology (Jumper et al., 2021; Brandes et al., 2022) . A major drawback of the transformer architecture is its quadratic complexity in both time and memory. The problem has been more evident with the ever-increasing need in applying transformers for longer sequence modeling in different domains. Recent research on efficient attention mechanisms seeks to respond to this problem by improving attention efficiency while preserving efficacy (Wang et al., 2020; Kitaev et al., 2019; Zaheer et al., 2020; Choromanski et al., 2020; Zheng et al., 2022) . The commonly adopted test bed for benchmarking efficient transformers in the context of processing long sequences, is the Long Range Arena (Tay et al., 2020a; LRA) , consisting of both synthetic probing tasks and real-world tasks. However, all of these tasks focus on the self attention setting, ignoring cross attention and causal attention, which are equally important and often more challenging. In other words, the transformer model is only used as a sequence encoder in LRA, while in real applications, cross attention is essential for conditionality modeling tasks such as sequence-to-sequence (Bahdanau et al., 2015 ), data-to-text (Dušek et al., 2020) and knowledge-enhanced models (Liu et al., 2021a) , and causal attention is critical for causality modeling tasks such as text generation (Vaswani et al., 2017; Zhang et al., 2020) , language modeling (Radford et al., 2019; Brown et al., 2020) and speech synthesis (Li et al., 2019) . Another potential drawback of LRA as discovered by researchers recently, is that with proper tuning, the performance gap between different transformer variants in these tasks can be insignificant (Xiong et al., 2022; Ivgi et al., 2022; Shaham et al., 2022) , impairing its effectiveness as a standard benchmark. To address these problems, we propose a comprehensive attention benchmark (CAB) for long sequence modeling. Above all, we present a fine-grained attention taxonomy, considering attentive functionality on conditionality and causality modeling. We present four patterns of attentions, namely, noncausal self, causal self, noncausal cross, and causal cross, representing the distinguishable attentive functionality to sequence modeling ( §2). With that in mind, we then collect seven real-world tasks from diverse fields of computer vision, natural language processing, speech processing, and time series forecasting ( §3). Among these tasks, CAB includes rich backbone architectures to evaluate the attention mechanisms, testing their performances and generalization abilities. Given four attention patterns defined by the attention taxonomy, we advocate a pattern-wise comparison between attention mechanisms, evaluating their attentive functionality respectively. We conduct exhaustive experiments on CAB, assessing nine widely-used efficient attentions designed with different philosophies ( §4). The experimental results reveal several insights into designing efficient attention architectures. First, we show that existing efficient transformers claiming comparable or even superior performances to vanilla attention, often achieve less competitive results in the causal cross scenario, indicating efficient attentions' modeling capability cannot always generalize across different attention patterns. Second, by quantifying the efficiency of attentions in long sequence contexts using efficiency length (i.e., the minimum length for a sub-quadratic efficient model to surpass vanilla attention in efficiency), we disclose the underlying inefficiency problem of existing efficient attention methods in modeling relatively short sequences. Third, we investigate interpolation and extrapolation on a long-context language model, and find that it is promising for efficient attention, such as local attention and LongShort Transformer, to scale to long-context language modeling. We hope CAB and elaborated experimental efforts can shed light on the fundamental problems of efficient attentions, and inspire the design of advanced attention mechanisms. CAB and all the related codes will be released at https://github.com/Anonymous.

2. ATTENTION TAXONOMY

Attention methods are designed to capture token-to-token dependency within and across sequences. Let Y = {y 1 , y 2 , . . . , y n } and Y ∈ R n×d be a target sequence and its feature matrix, and X = {x 1 , x 2 , . . . , x m } and X ∈ R m×d be a source sequence and its feature matrix, where n and m are target and source sequence lengths respectively, and d denotes dimensionality. Note that X could be the same as Y in the case of self attention. The target and source feature matrices are transformed to Query-Key-Value feature matrices as Vanilla transformer (Vaswani et al., 2017) learns to integrate key-value feature matrices K ∈ R m×d , V ∈ R m×d into a query one Q = {q 1 , q 2 , . . . , q n } ∈ R n×d with token-to-token attention: Q = YW Q , K = XW K , V = XW V with learnt parametric matrices W Q , W K , W V . Attn(Q, K, V) = softmax QK ⊤ / √ d V (1) in which we omit the multihead notation without loss of generality. Vanilla attention in Eq. 1 advocates a token-to-token alignment matrix QK ⊤ , which results in quadratic complexity O(nm) of computational time and memory usage. It is observed that Attn(•) in Eq. 1 acts as an integration model, projecting query features Q by integrating key-value features without changing Q's shape. Beyond the vanilla token-to-token attention mechanism, we extend Eq. 1 to a more general form of attention family as: Attn(Q, K, V) = f (Q; K, V) : R n×d → R n×d where Attn i (Q, K, V) = f i (q i ; Q, K, V), i = 1 . . . n where Q is treated as the main variable and K, V are conditional ones. Comparing with Eq. 1, Eq. 2 covers efficient attention methods without explicitly modeling the token-to-token alignment. In usage, attentions are modified to fit generative models with underlying structures, such as conditional models and causal models, to achieve specific attentive functionality. In this section, we put forward a fine-grained attention taxonomy, considering the conditionality and causality of attentions. We show their challenges and potential impact on real-world applications in modeling conditionality and causality. Under the taxonomy, we present four attention patterns with different attentive functionality as shown in Figure 1 .

