CAB: COMPREHENSIVE ATTENTION BENCHMARKING ON LONG SEQUENCE MODELING

Abstract

Transformer has achieved remarkable success in language, image, and speech processing. Recently, various efficient attention architectures have been proposed to improve transformer's efficiency while largely preserving its efficacy, especially in modeling long sequences. A widely-used benchmark to test these efficient methods' capability on long-range modeling is Long Range Arena (LRA). However, LRA only focuses on the standard bidirectional (or noncausal) self attention, and completely ignores cross attentions and unidirectional (or causal) attentions, which are equally important to downstream applications. Although designing cross and causal variants of an attention method is straightforward for vanilla attention, it is often challenging for efficient attentions with subquadratic time and memory complexity. In this paper, we propose Comprehensive Attention Benchmark (CAB) under a fine-grained attention taxonomy with four distinguishable attention patterns, namely, noncausal self, causal self, noncausal cross, and causal cross attentions. CAB collects seven real-world tasks from different research areas to evaluate efficient attentions under the four attention patterns. Among these tasks, CAB validates efficient attentions in eight backbone networks to show their generalization across neural architectures. We conduct exhaustive experiments to benchmark the performances of nine widely-used efficient attention architectures designed with different philosophies on CAB. Extensive experimental results also shed light on the fundamental problems of efficient attentions, such as efficiency length against vanilla attention, performance consistency across attention patterns, the benefit of attention mechanisms, and interpolation/extrapolation on long-context language modeling.

1. INTRODUCTION

Transformer has achieved great breakthroughs in natural language processing (Devlin et al., 2019; Radford et al., 2019; Brown et al., 2020 ), computer vision (Dosovitskiy et al., 2020; Liu et al., 2021b ), speech processing (Schneider et al., 2019; Ren et al., 2021), and biology (Jumper et al., 2021; Brandes et al., 2022) . A major drawback of the transformer architecture is its quadratic complexity in both time and memory. The problem has been more evident with the ever-increasing need in applying transformers for longer sequence modeling in different domains. Recent research on efficient attention mechanisms seeks to respond to this problem by improving attention efficiency while preserving efficacy (Wang et al., 2020; Kitaev et al., 2019; Zaheer et al., 2020; Choromanski et al., 2020; Zheng et al., 2022) . The commonly adopted test bed for benchmarking efficient transformers in the context of processing long sequences, is the Long Range Arena (Tay et al., 2020a; LRA), consisting of both synthetic probing tasks and real-world tasks. However, all of these tasks focus on the self attention setting, ignoring cross attention and causal attention, which are equally important and often more challenging. In other words, the transformer model is only used as a sequence encoder in LRA, while in real applications, cross attention is essential for conditionality modeling tasks such as sequence-to-sequence (Bahdanau et al., 2015 ), data-to-text (Dušek et al., 2020) and knowledge-enhanced models (Liu et al., 2021a) , and causal attention is critical for causality modeling tasks such as text generation (Vaswani et al., 2017; Zhang et al., 2020) , language modeling (Radford et al., 2019; Brown et al., 2020) and speech synthesis (Li et al., 2019) . Another potential drawback of LRA as discovered by researchers recently, is that with proper tuning, the performance gap between different transformer variants in these tasks can be insignificant (Xiong et al., 2022; Ivgi et al., 2022; Shaham et al., 2022) , impairing its effectiveness as a standard benchmark. 1

