DBA: EFFICIENT TRANSFORMER WITH DYNAMIC BI-LINEAR LOW-RANK ATTENTION

Abstract

Many studies have been conducted to improve the efficiency of Transformer from quadric to linear over long sequence conditions. Among them, the low-rank-based methods aim to learn the projection matrices to compress sequence length, thus achieving efficiency gain. However, the projection matrices are fixed once they have been learned, which compress sequence length with dedicated coefficients for tokens in the same position regardless of different sequences. Adopting such input-invariant low-rank projections ignores the fact that the most informative part of a sequence varies from sequence to sequence, thus failing to preserve the most useful information that lies in varied positions of different sequences. In addition, previous efficient Transformers only focus on the influence of sequence length while neglecting the effect of hidden state dimension to achieve further efficiency gain. To address the aforementioned problems, we present an efficient yet effective attention mechanism, namely Dynamic Bilinear Low-Rank Attention (DBA), which compresses sequence length by input-sensitive dynamic projection matrices and achieves linear time and space complexity by jointly optimizing sequence length and hidden state dimension while maintaining state-of-the-art performance. Specifically, we first theoretically demonstrate that the sequence length can be compressed losslessly from a novel perspective of information theory, with the compression matrices dynamically determined by the input sequence. Furthermore, we show that the hidden state dimension can be approximated by extending the Johnson-Lindenstrauss lemma and achieves high-order small amount error, optimizing the attention in bilinear form. In addition, theoretical analysis shows that DBA is proficient in capturing high-order relations in cross-attention problems. Experiments over tasks with diverse sequence length conditions show that DBA achieves state-of-the-art performance compared with various strong baselines while maintaining less memory consumption with higher speed, demonstrating the effectiveness and efficiency of DBA.

1. INTRODUCTION

The Transformer (Vaswani et al., 2017) has shown immense capabilities in a wide range of areas, including natural language processing (Dai et al., 2019 ), computer vision (Dosovitskiy et al., 2021; Liu et al., 2021) , time series analysis (Zerveas et al., 2021) , and multi-modal tasks (Qin et al., 2022a; Yu et al., 2019) . However, the Vanilla Transformer suffers quadratic time and memory complexity, raising concerns about its further application scenarios. Therefore, several efficient Transformers have been introduced (Tay et al., 2022) . Among them, kernel-based methods have drawn much attention due to their optimization-friendly characteristic, which improves the efficiency by using the approximation in the attention mechanism (Katharopoulos et al., 2020; Wang et al., 2020; Ma et al., 2021; Xiong et al., 2021; Qin et al., 2022b; Choromanski et al., 2021) . One popular kernelbased technique is low-rank approximation, which compresses sequence length dimension using the same coefficients for all sequences. For instance, Wang et al. (Wang et al., 2020) approximated the stochastic matrix in sequence length dimension by using sets of fixed coefficients learned in the training process to calculate the weighted sum of tokens in different positions. Xiong et al. (Xiong et al., 2021) adopted the Nyström method to approximate the attention mechanism to linear complexity, decreasing the sequence length with mean pooling. However, the flexibility of low-rank projection in previous methods is limited. The projection matrices are pre-determined or fixed after the training process, which compress different sequences by using the same coefficients for tokens in the same position. Such input-invariant low-rank compressions ignore the fact that the informative part of a sequence varies from sequence to sequence. Hence, the compression might fail to preserve the most informative parts lying in different positions and limit the performance over tasks where the most informative parts of inputs change significantly, such as image-related tasks. In addition, previous efficient Transformers only focused on optimizing the sequence length while ignoring the influence of hidden state dimension. The hidden state dimension also contributes to the computation cost and becomes more critical to efficiency when processing moderate or short sequences. Previous efficient Transformers that achieve significant memory compression and speed-up rate in long sequence conditions could end up with similar efficiency when processing moderate or short sequences compared with the Vanilla Transformer, as shown in Figure 1 . To address the aforementioned problems, we proposed an efficient yet effective attention mechanism, namely Dynamic Bilinear Low-Rank Attention (DBA), which compresses sequence length with input-sensitive dynamic projection matrices and achieves linear computation and memory efficiency with bilinear optimization from both sequence length and hidden state dimension. Specifically, we first theoretically show that sequence length can be compressed losslessly from a novel perspective of the information theory, where the projection matrices are dynamically determined by the input sequence to best preserve the most informative parts. Furthermore, we demonstrate that the hidden state dimension can be approximated by extending the Johnson-Lindenstrauss lemma (Arriaga & Vempala, 2006; Lindenstrauss & Johnson, 1984) with high-order small amount error. In addition, theoretical analysis shows that DBA is able to capture high-order relations in crossattention problems, which is crucial to the performance in multi-modality tasks. Extensive experiments over tasks with various sequence length conditions are conducted on three different datasets, including Long-Range Arena (LRA) (Tay et al., 2021b) as the long sequence benchmark, UEA multivariate time series classification archive (Bagnall et al., 2018) to evaluate the performance of various sequence lengths, VQA-v2 (Goyal et al., 2017) as the illustrations of DBA in capturing high-order relations. The DBA achieves state-of-the-art performance with impressing speed-up and memory compression rate compared with other competitors over various sequence length conditions, demonstrating the effectiveness and efficiency of DBA in a wide range of applications. Our main contributions can be summarized as follows:



Range Arena benchmark(Tay et al., 2021b) compared with Vanilla Transformer (Vaswani et al., 2017)  in different sequence length conditions (512 and 4k). DBA could achieve state-of-the-art performance with the highest speed and lowest memory consumption over various sequence length conditions.

