DBA: EFFICIENT TRANSFORMER WITH DYNAMIC BI-LINEAR LOW-RANK ATTENTION

Abstract

Many studies have been conducted to improve the efficiency of Transformer from quadric to linear over long sequence conditions. Among them, the low-rank-based methods aim to learn the projection matrices to compress sequence length, thus achieving efficiency gain. However, the projection matrices are fixed once they have been learned, which compress sequence length with dedicated coefficients for tokens in the same position regardless of different sequences. Adopting such input-invariant low-rank projections ignores the fact that the most informative part of a sequence varies from sequence to sequence, thus failing to preserve the most useful information that lies in varied positions of different sequences. In addition, previous efficient Transformers only focus on the influence of sequence length while neglecting the effect of hidden state dimension to achieve further efficiency gain. To address the aforementioned problems, we present an efficient yet effective attention mechanism, namely Dynamic Bilinear Low-Rank Attention (DBA), which compresses sequence length by input-sensitive dynamic projection matrices and achieves linear time and space complexity by jointly optimizing sequence length and hidden state dimension while maintaining state-of-the-art performance. Specifically, we first theoretically demonstrate that the sequence length can be compressed losslessly from a novel perspective of information theory, with the compression matrices dynamically determined by the input sequence. Furthermore, we show that the hidden state dimension can be approximated by extending the Johnson-Lindenstrauss lemma and achieves high-order small amount error, optimizing the attention in bilinear form. In addition, theoretical analysis shows that DBA is proficient in capturing high-order relations in cross-attention problems. Experiments over tasks with diverse sequence length conditions show that DBA achieves state-of-the-art performance compared with various strong baselines while maintaining less memory consumption with higher speed, demonstrating the effectiveness and efficiency of DBA.

1. INTRODUCTION

The Transformer (Vaswani et al., 2017) has shown immense capabilities in a wide range of areas, including natural language processing (Dai et al., 2019) , computer vision (Dosovitskiy et al., 2021; Liu et al., 2021) , time series analysis (Zerveas et al., 2021) , and multi-modal tasks (Qin et al., 2022a; Yu et al., 2019) . However, the Vanilla Transformer suffers quadratic time and memory complexity, raising concerns about its further application scenarios. Therefore, several efficient Transformers have been introduced (Tay et al., 2022) . Among them, kernel-based methods have drawn much attention due to their optimization-friendly characteristic, which improves the efficiency by using the approximation in the attention mechanism (Katharopoulos et al., 2020; Wang et al., 2020; Ma et al., 2021; Xiong et al., 2021; Qin et al., 2022b; Choromanski et al., 2021) . One popular kernelbased technique is low-rank approximation, which compresses sequence length dimension using the same coefficients for all sequences. For instance, Wang et al. (Wang et al., 2020) approximated the stochastic matrix in sequence length dimension by using sets of fixed coefficients learned in the training process to calculate the weighted sum of tokens in different positions. Xiong et al. (Xiong et al., 2021) adopted the Nyström method to approximate the attention mechanism to linear complexity, decreasing the sequence length with mean pooling. Figure 1 : Performance (y-axis, higher is better), speed (x-axis, higher is better), and memory footprint (circle sizes, smaller is better) of efficient Transformers on the Long-Range Arena benchmark (Tay et al., 2021b) compared with Vanilla Transformer (Vaswani et al., 2017) in different sequence length conditions (512 and 4k). DBA could achieve state-of-the-art performance with the highest speed and lowest memory consumption over various sequence length conditions. However, the flexibility of low-rank projection in previous methods is limited. The projection matrices are pre-determined or fixed after the training process, which compress different sequences by using the same coefficients for tokens in the same position. Such input-invariant low-rank compressions ignore the fact that the informative part of a sequence varies from sequence to sequence. Hence, the compression might fail to preserve the most informative parts lying in different positions and limit the performance over tasks where the most informative parts of inputs change significantly, such as image-related tasks. In addition, previous efficient Transformers only focused on optimizing the sequence length while ignoring the influence of hidden state dimension. The hidden state dimension also contributes to the computation cost and becomes more critical to efficiency when processing moderate or short sequences. Previous efficient Transformers that achieve significant memory compression and speed-up rate in long sequence conditions could end up with similar efficiency when processing moderate or short sequences compared with the Vanilla Transformer, as shown in Figure 1 . To address the aforementioned problems, we proposed an efficient yet effective attention mechanism, namely Dynamic Bilinear Low-Rank Attention (DBA), which compresses sequence length with input-sensitive dynamic projection matrices and achieves linear computation and memory efficiency with bilinear optimization from both sequence length and hidden state dimension. Specifically, we first theoretically show that sequence length can be compressed losslessly from a novel perspective of the information theory, where the projection matrices are dynamically determined by the input sequence to best preserve the most informative parts. Furthermore, we demonstrate that the hidden state dimension can be approximated by extending the Johnson-Lindenstrauss lemma (Arriaga & Vempala, 2006; Lindenstrauss & Johnson, 1984) with high-order small amount error. In addition, theoretical analysis shows that DBA is able to capture high-order relations in crossattention problems, which is crucial to the performance in multi-modality tasks. Extensive experiments over tasks with various sequence length conditions are conducted on three different datasets, including Long-Range Arena (LRA) (Tay et al., 2021b) as the long sequence benchmark, UEA multivariate time series classification archive (Bagnall et al., 2018) to evaluate the performance of various sequence lengths, VQA-v2 (Goyal et al., 2017) as the illustrations of DBA in capturing high-order relations. The DBA achieves state-of-the-art performance with impressing speed-up and memory compression rate compared with other competitors over various sequence length conditions, demonstrating the effectiveness and efficiency of DBA in a wide range of applications. Our main contributions can be summarized as follows: 1) We introduce an efficient and effective attention mechanism, namely Dynamic Bilinear Low-Rank Attention (DBA), which compresses sequence length by input-sensitive dynamic projection matrices. The DBA achieves efficiency gain over various sequence length conditions with linear time and space complexity by jointly optimizing sequence length and hidden state dimension. 2) Theoretical guarantee from information theory and matrix low-rank approximation demonstrates that DBA has a similar capability to the Vanilla Attention with low expected error. In addition, theoretical analysis shows that DBA is able to capture high-order inter-relations in cross-attention problems. 3) Extensive experiments on tasks with various sequence length conditions show that DBA could achieve state-of-the-art performance in a wide range of applications with impressing efficiency gain. In addition, DBA is superior in capturing high-order relations in the cross-attention task, which outperforms the Vanilla Transformer based MCAN (Yu et al., 2019) with only 12% of parameters in the attention layer.

2. BACKGROUND AND RELATED WORK

2.1 VANILLA TRANSFORMER The Vanilla Transformer (Vaswani et al., 2017) uses Vanilla Attention as its main algorithm, which is calculated via the softmax weighted sum of all values V concerning weights obtained by the multiplication of Q and K: P ϕ (K, Q) = softmax QK T √ d Attention (P ϕ (K, Q), V ) = P ϕ (K, Q)V Here we define P ϕ (K, Q) as attention map, and later, we will abbreviate P ϕ (K, Q) as P ϕ for simplicity. The Q ∈ R n×d , K ∈ R n×d , V ∈ R n×d , and P ϕ ∈ R n×n , where n is sequence length and d indicates hidden state dimension. Notice that the time and memory complexity for the Vanilla Attention is proportional to O n 2 d . For long sequence applications, the impact of n becomes dominant, and the influence of d becomes greater when facing moderate or short sequences.

2.2. EFFICIENT TRANSFORMERS

One kind of Efficient Transformers is by using sparsity, where each token could only access limited perspective fields with fixed or learned patterns, including the local attention (Parmar et al., 2018 ), Reformer (Kitaev et al., 2020) , Sinkhorn (Tay et al., 2020) , Routing Transformer (Roy et al., 2021) , ALiBi (Press et al., 2022) , Learning-to-Hash Attention (LHA) (Sun et al., 2022) , YOSO (Zeng et al., 2021) , ClusterFormer (Wang et al.) , Poolingformer (Zhang et al., 2021) , and Big Bird (Zaheer et al., 2020) . To make the attention have wider perspective fields, some works concentrate on the interactions between near field and far field, such as Focal Attention (Yang et al., 2021) , FMMformers (Nguyen et al., 2021) , Long-Short Transformer (Zhu et al., 2021) , and Crossformer (Wang et al., 2022) . Another popular approach is kernel-based method, which improves the efficiency of Transformer by rewriting the multiplication in equations 1 and 2, such as Linear Transformer (Parmar et al., 2018) , PoNet (Tan et al., 2022) , Random Feature Attention (Peng et al., 2021) , and LARA (Zheng et al., 2022) . In (Choromanski et al., 2021; Qin et al., 2022b; Chen et al., 2021b) , the authors optimize the softmax kernel with faster reweighting functions. Since the kernels are the approximation in attention matrices, they can also be optimized by low-rank methods, such as Linformer (Wang et al., 2020) , Luna (Ma et al., 2021) , and Nyströmformer (Xiong et al., 2021) . In (Zhuang et al., 2022; Luo et al., 2021; Kreuzer et al., 2021; Zhou et al., 2022) , the authors improve the efficiency of attention kernel by exploring the frequency domain. Some works also focus on the multi-head characteristic in attention mechanism and optimize via reducing the parallel computations, such as FLASH (Hua et al., 2022) and Transformer-MGK (Nguyen et al., 2022) . The proposed DBA is most similar to Linformer, which both approximate features in Vanilla Attention to the low-rank matrices and achieve linear complexity. The main differences are in four folds. First, the low-rank projection matrices in DBA are more flexible than in Linformer, which are dynamically determined by the input sequence rather than fixed after training to best preserve the most informative parts of a sequence. Secondly, DBA could process sequences in various lengths as the dimensions of sequence length compression matrices are also determined by the input. Thirdly, DBA could achieve state-of-the-art performance with high efficiency over various sequence length conditions due to jointly considering the sequence length and hidden state dimension. Furthermore, DBA is proficient in capturing high-order relations with multi-stage interactions, whereas Linformer could only perform one-stage interaction.

3. METHOD

Our goal is to design an efficient attention mechanism with linear complexity, where the analysis starts with the Vanilla Attention defined in equations 1 and 2. In Section 3.1, we will theoretically demonstrate that the input sequence length n can be compressed losslessly from the perspective of information theory, leading to linear complexity in both time and space. In Section 3.2, we will extend the Johnson-Lindenstrauss lemma to prove that the multiplication between Q and K can be reduced by low-rank approximation with high-order small amount error in the results, mitigating the impact of hidden state dimension d on efficiency. In Section 3.3, we will present the source of matrices newly introduced to DBA and show that the sequence length compression matrices are dynamically determined by the input features, leading to adaptive coefficients for tokens in the same position. In Section 3.4, we will show that DBA could capture high-order inter-relations in crossattention problems and perform multi-stage interactions within a single attention layer.

3.1. OPTIMIZE THE SEQUENCE LENGTH WITH INFORMATION THEORY

In this section, we will optimize the quadric complexity of sequence length to Transformer by analyzing the attention mechanism from the information theory perspective, leading to linear complexity in both time and space. Specifically, we will show that the attention matrix P ϕ ∈ R n×n could be replaced by a set of smaller matrices without information loss. Note that in the Vanilla Attention, P ϕ is deterministic for dedicated QK T . Hence, we could derive that the conditional entropy between QK T and P ϕ is 0. H(P ϕ | QK T ) = H(softmax QK T √ d | QK T ) = 0 Therefore, QK T contains all the information P ϕ has. Notice that QK T could be reconstructed losslessly with the based of QK T and the reconstruction coefficients. Hence, the conditional entropy between the bases of QK T with reconstruction coefficients to the P ϕ is 0. H(P ϕ | basis r (basis c (QK T )), W ′ r , W ′ c ) = 0 (4) where basis r and basis c calculate the basis of QK T in the row and column spaces, respectively. W ′ r and W ′ c are the reconstruction coefficients for row and column, which values and dimensions are determined by QK T . From the properties of matrix rank in multiplication, we could get the following inequality. Rank QK T ≤ max (Rank (Q) , Rank (K)) ≤ min (n, d) where Rank() calculates the rank of a matrix. Hence, the dimension of basis r (basis c (QK T )) are no larger than R min(n,d)×min (n,d) . Therefore, with the help of equation 4, a given P ϕ can be represented losslessly with a matrix P ϕ ′ ∈ R dp×dp (d p ≤ min (n, d)) and reconstruction coefficient W ′ r ∈ R n×dp and W ′ c ∈ R n×dp . In practice, we could form: P ϕ = W ′ r P ϕ ′ W ′ c T where P ϕ ′ , W ′ r , and W ′ c are determined by the input and learned through the training process. Here, we could project ( ) ) ( ✖ ✖ ✖ 𝑲 )!$ 𝑸 )!$ 𝑽 )!$ 𝑑 𝑑! 𝑾 & ' 𝑑! 𝑛 𝑑! 𝑑! 𝑑 "# 𝑑"# Softmax Multi-Head Attention Q ∈ R n×d , K ∈ R n×d to Q l ∈ R dp×d , K l ∈ R dp×d to generate the P ϕ ′ , with Q l with K l as the input for equation 1. Therefore, we could derive the following equation: P ϕ = W ′ r P ϕ ′ W ′ c T = W ′ r softmax Q l K T l √ d W ′ c T = W ′ r softmax (W r Q)(K T W c T ) √ d W ′ c T (7) where W r ∈ R dp×n , W c ∈ R dp×n , W ′ r ∈ R n×dp , and W ′ c ∈ R n×dp . By proposing P ϕ ′ as the new attention map instead of P ϕ , we avoid quadric complexity in time and space in attention map generation. However, notice that the reconstruction process using W ′ r and W ′ c T still brings high complexity. Here, we first merge the W ′ c and V first as V DBA , then the V DBA is multiplied by P ϕ ′ . The reconstruction process by W ′ r is set in the last. By optimizing the calculation order, the DBA achieves linear complexity. DBA (K, Q, V ) = W ′ r (P ϕ ′ (W ′ c T V )) = W ′ r P ϕ ′ V DBA where V DBA = W ′ c T V ∈ R dp×d .

3.2. OPTIMIZE HIDDEN STATE DIMENSION WITH MATRIX APPROXIMATION

In Section 3.1, we optimize sequence length by analyzing from the information theory perspective, leading to linear complexity. In this section, we will further increase the efficiency of DBA by mitigating the impact of hidden state dimension d on efficiency. Specifically, we extend the Johnson-Lindenstrauss lemma (Arriaga & Vempala, 2006; Lindenstrauss & Johnson, 1984) to show the multiplication between Q and K can be approximated with high-order small amount error. Based on the Johnson-Lindenstrauss lemma, we could derive that when d in ≥ 10log (d p ) / ϵ 2 -ϵ 3 , the following equation holds. Pr (W r Q) RR T K T W T c -(W r Q) K T W T c ≤ ϵ (W r Q) K T W T c > 1 -o (1) The proof details are in Appendix A.1. The equation 9 shows that the multiplication between Q and K could be replaced by alternatives with lower hidden state dimension (d vs. d in ) and achieves errors in high-order small quantities compared to full-rank multiplication. Therefore, we could further project Q l ∈ R dp×d , K l ∈ R dp×d mentioned in Section 3.1 to Q DBA ∈ R dp×din , K DBA ∈ R dp×din , and finally, the DBA could be written as follows: DBA (K, Q, V ) = W ′ r softmax ((W r Q)R)(R T (K T W c T )) √ d in W ′ c T V = W ′ r softmax Q DBA K DBA √ d in V DBA (10) The attention mechanism is now compressed with bilinear form in both sequence length and the hidden state dimension, increasing the efficiency for sequences with various lengths. The graphical comparison between Vanilla Attention and DBA is illustrated in Figure 2 .

3.3. THE SOURCE OF MATRICES

In this section, we will define the source of matrices that are newly introduced to DBA in selfattention situation, including the hidden state compression matrix R, the sequence compression matrices W r , W c , and the reconstruction matrices W ′ r , W ′ c . We will show that the weights in sequence compression matrices are determined by the input sequence, leading to dynamic coefficients for tokens in the same position between different sequences. The R compress hidden state with a fixed dimension. Therefore, it is set as a fully connected layer and learned through training. The W ′ r , W ′ c are set as the input sequences propagate thought fully connected layers to obtain the expected hidden state dimension. The W r and W c are generated by combining the input sequence and an extra input Z ∈ R dp×d in a shorter length. W r = φ ZQ T (11) W c = φ(ZK T ) (12) where φ is a normalization function to stabilize the training process. In practice, we set φ as softmax function. Therefore, the compression matrices W r and W c are dynamically determined by the input sequence, where every coefficient in W r and W c is the linear transformations of token features in the corresponding position. Each row in W r and W c is a set of compression coefficients for all tokens in the input sequence, and the results of each position in the final compressed sequences W r Q and K T W c T are the different weighted sum of tokens in the original sequence. The weights in the reconstruction matrices W ′ r , W ′ c are also determined by the input, where the rows also represent different sets of coefficients dynamically determined by the input sequence. Note that the dimensions for both W r , W c and W ′ r , W ′ c are dynamically determined by the input, making them able to process sequences in various lengths without fixed padding. In practice, we set Z as learnable parameters propagating through different attention layers.

3.4. CAPTURE HIGH-ORDER RELATIONS IN CROSS-ATTENTION

In this section, we will show that DBA is able to capture high-order relations with multi-stage interactions within an attention layer in the cross-attention situation. We will first introduce the cross-attention algorithm in Vanilla Transformer and then compare it with the proposed DBA. The cross-attention in Vanilla Transformer shares the same expression as self-attention in equations 1 and 2. The only difference is input. In cross-attention, one input X 1 from H 1 is processed as Q 1 , and the other input X 2 from H 2 is processed as K 2 , V 2 , where the subscript 1, 2 indicate the variables in different hierarchies, and H 1 , H 2 denote different hierarchies. By leveraging the Vanilla Attention algorithm, Q 1 is fused with K 2 , leading to one-stage interaction within an attention layer. Cross-Attention (K 2 , Q 1 , V 2 ) = softmax Q 1 K T 2 √ d V 2 The DBA takes different inputs compared with Vanilla Attention in cross-attention. Instead of taking the full-length features X 2 as K 2 , V 2 , DBA takes the compressed sequence W r2 X 2 as Z 1 , K 2 , and V 2 . Both models take X 1 as Q 1 . Firstly, we compress sequence length following Section 3.1. As W r2 X 2 formed K 2 , V 2 are already been compressed, we only need to compress sequence length in Q 1 using W r1 , which is obtained from Z 1 . W r1 = φ(Z 1 Q T 1 ) = φ(Linear(W r2 X 2 )Q T 1 ) ) where Linear() denotes fully connected layer. The advantages of using W r2 X 2 as Z 1 are in two folds. First, the features from two different hierarchies interact when generating W r1 . Second, it compresses the sequence length of Q 1 , where the compression coefficients are guided by features in H 2 . After sequence length compression, we optimize the impact of d on efficiency following Section 3.2, and finally, we could get (Q DBA ) 1 , (K DBA ) 2 , and (V DBA ) 2 to perform second interaction between two features as in equation 10. Therefore, DBA could capture inter-relations in the sequence compression matrices generation procedure and attention mechanism within a single attention layer, where the compressed feature in H 2 interacts with original and compressed features in H 1 , making DBA able to capture high-order relations and perform multi-stage interactions.

4. EXPERIMENTS

We evaluate the performance of DBA on three datasets, covering long and diverse sequence conditions with self-and cross-attention tasks, including Long-Range Arena (LRA) (Tay et al., 2021b) as the benchmark on long sequence, UEA multivariate time series classification archive (Bagnall et al., 2018) to evaluate performance on various sequence lengths, VQA-v2 (Goyal et al., 2017) to test the performance of cross-attention. The detailed descriptions of datasets are in Appendix A.2, and the experiment settings are listed in Appendix A.3. 

4.1. EFFICIENCY

The efficiency of DBA compared with Vanilla Transformer and other efficient Transformers are illustrated in Figure 1 and Table 1 . We report the speed and peak memory usage of different attentions in 256-4k sequence lengths. The DBA achieves state-of-the-art efficiency in terms of speed and peak memory usage, which is faster than the Vanilla Transformer and consumes fewer memories over various sequence conditions. In the long sequence conditions, DBA is 6.1 times faster than Vanilla Transformer and only uses 9% of memory in 4k sequence length. As for shorter sequence length, DBA could also achieve the highest efficiency among others, with 1.4 times faster and only uses 66% of memories compared to the Vanilla Attention in 512 sequence length. The DBA only falls behind the Synthesizer when facing 256-sequence length in terms of speed. However, DBA uses much less memory with much-suppressed efficiency on the long sequence condition. In addition, DBA achieves the best among others in terms of average performance on the LRA task, demonstrating the effectiveness and efficiency of DBA.

4.2. PERFORMANCE ON LONG SEQUENCE MODELING

We evaluate long sequence modeling performance of DBA and the previous methods on the LRA benchmark, as listed in Table 2 . The DBA achieves state-of-the-art performance in terms of average score. By closer observation of each individual task, DBA achieves the best results on three out of five individual tasks. Notably, DBA suppresses the Vanilla Transformer and previous low-rankbased methods in all five tasks and is exceptionally proficient in image-related tasks where the most informative parts change significantly for different inputs. Note that DBA has the fastest speed and lowest memory consumption in all tasks, demonstrating the effectiveness and efficiency of DBA.

4.3. PERFORMANCE ON TIME SERIES SIGNAL IN VARIOUS LENGTH

We use UEA multivariate time series classification archive to evaluate the performance of models in various sequence lengths. The results are illustrated in Table 3 . The DBA achieves the best performance compared with previous methods in all 10 tasks, with 2.3% improvement on the average accuracy compared with the Vanilla Transformer, highlighting the capability of processing sequences in various lengths.

4.4. PERFORMANCE ON CAPTURE CROSS-ATTENTION RELATIONS

We use the VQA-v2 dataset to evaluate the performance of DBA in cross-attention tasks. The results are shown in Table 4 . Compared with the previous methods, where the image and question interact once per layer, DBA could capture high-order relations between hierarchies and perform multi-stage interactions within an attention layer, making DBA achieve the best results on the VQA-v2 tasks in all four evaluation aspects with only 12% of parameters compared with Vanilla Transformer based (Yu et al., 2019) in attention layer, highlighting the effectiveness of DBA in capturing crossattention relations.

4.5. PERFORMANCE WITH STATE SPACE MODEL BACKBONE

As a different approach from Transformer, the state space model has achieved promising results in long sequence modeling. The state space model takes the similar input X ∈ R n×d as the Transformer when processing the sequence, with its speed and memory consumption much influenced by the sequence length n. The DBA can also be directly plug-and-play to the state space model by compressing the sequence length the state space models need to process from R n×d to R dp×d to improve the efficiency while maintaining the final performance. We use the S4 (Gu et al., 2022) as backbone. The results are illustrated in Table 5 . The S4 with DBA optimization could achieve 1.4x average speed boost and 0.8x average memory consumption with competitive performance compared to the baseline, highlighting the universality of DBA.

5. CONCLUSION

In this paper, we propose Dynamic Bilinear Low-Rank Attention (DBA), an efficient attention mechanism that compresses sequence length by input-sensitive dynamic projection matrices and achieves linear time and space complexity by jointly optimizing sequence length and hidden state dimension. Theoretical analysis from the information theory and matrix low-rank approximation perspectives shows that DBA could achieve a similar function to Vanilla Attention with high-order small amount error. In addition, DBA is capable of capturing high-order relations in cross-attention problems. Experiments show that DBA is able to achieve state-of-the-art performance with faster speed and lower memory consumption compared with previous models, highlighting its efficiency and effectiveness.

A APPENDIX A.1 PROOF IN OPTIMIZING HIDDEN STATE DIMENSION

Johnson-Lindenstrauss lemma. Let R ∈ R d×din , 1 ≤ d in ≤ d, with i.i.d. entries from N (0, 1/d). For any x, y ∈ R d , we have Pr xRR T y T -xy T ≤ ϵ xy T > 1 -2e -(ϵ 2 -ϵ 3 )din/4 Based on the Johnson-Lindenstrauss lemma, we could obtain: Pr QRR T k i T -Qk i T ≤ ϵ Qk i T ≥ 1 - q i ∈ Q Pr q i RR T k i T -q i k i T > ϵ q i k i T > 1 -2ne -(ϵ 2 -ϵ 3 )din/4 The K contain n rows. Here we could get: Pr QRR T K T -QK T ≤ ϵ QK T ≥ 1 - k i ∈ K Pr QRR T k i T -Qk i T > ϵ Qk i T >1 -2n 2 e -(ϵ 2 -ϵ 3 )din/4 Hence, Pr (W r Q) RR T K T W T c -(W r Q) K T W T c ≤ ϵ (W r Q) K T W T c > 1 -2d p 2 e -(ϵ 2 -ϵ 3 )din/4 Let d in ≥ 10log (d p ) / ϵ 2 -ϵ 3 , we could derive the equation 9, then theorem follows. A.2 DATASETS LRA (Tay et al., 2021b ) is a popular benchmark to test the efficiency of Transformers in long sequence conditions, containing a suite of tasks (ListOps (Nangia & Bowman, 2018) , byte-level text classification (Maas et al., 2011 ), document retrieval (Radev et al., 2013) , pixel-level image classification (Krizhevsky & Hinton, 2009) , and Pathfinder (Linsley et al., 2018) ) with sequence length ranging from 1k to 4k. UEA multivariate time series classification archive (Bagnall et al., 2018) is a collection of datasets to evaluate the time series classification algorithms, which contains a wide range of problems in various sequence length conditions. VQA-v2 (Goyal et al., 2017 ) is a popular benchmark for multi-modal models, containing 1.1 million human-labeled image-question pairs with around 13 million associated answers on 200k images from the Microsoft COCO dataset (Lin et al., 2014) , and it is split into the train, val, and test set. LRA (Tay et al., 2021b) Long Sequence Modeling 1k-4k UEA (Bagnall et al., 2018) Time Series 29-1751 VQA-v2 (Goyal et al., 2017) Visual Question Answering 5-625

A.3 EXPERIMENT SETTINGS

All the experiments are conducted using PyTorch (Paszke et al., 2019) and Numpy (Harris et al., 2020) with Nvidia GPU. 

A.4 VISUALIZATION OF DYNAMIC SEQUENCE LENGTH PROJECTION MATRICES

We visualized the dynamic sequence lengths compression matrices in DBA and compared them with the input invariant compression matrices in Linformer on the Selfregulationscp1 task (Birbaumer et al., 1999) , as shown in Figure 3 . The Selfregulationscp1 record the EEG data and is one of the UEA multivariate time series classification archives. Results show that the sequence length projection matrix is determined by the input, which highlights values in different positions for different inputs, while Linformer concentrates on the same position for different samples. In addition, the sequence compression matrices in DBA are "smoother" between adjacent positions and have a more noticeable trend than the compression matrices in Linformer. From the characteristics of Selfregulationscp1 task and how people diagnose such diseases (Birbaumer et al., 1999) , the concentrated position shall be different for different inputs and share more coherent trends between adjacent points like in DBA rather than oscillate. The DBA process the signals more human-like and could achieve higher performance, demonstrating the superiority of dynamic sequence length projection matrices. 11 and 12 . Results show that both sequence length and hidden state dimension compression contribute to the efficiency, where the sequence length compression contributes a higher speed-up ratio with the increase of sequence length (from 1.0x to 5.9x speed-up with 256-4k sequence length), and the compression in hidden state dimension contributes to dedicate speed-up rate (around 1.1x) for all inputs with different length. We also investigate the different settings of d p and d in to the efficiency, and DBA has faster speed with less memory consumption with the decrease of d p and d in . Table 12 illustrates d p and d in to the final performance. Results show that DBA could achieve a similar performance compared with the counterparts without sequence length or hidden state dimension compression on three out of five tasks on the LRA dataset, which is consistent with our theory that the optimization in DBA is either lossless or with high-order small amount error. The DBA achieves higher performance on the image and pathfinder tasks, which we believe the optimization in DBA contributes to the generation capability and ease of optimization. The DBA achieves higher validation accuracy under the same training loss and has faster coverage speed, as detailed shown in Figure 4 . Note that DBA is faster and uses less memory than the ablation counterparts without sequence length or hidden state dimension compression, demonstrating the efficiency and effectiveness of DBA. Different settings of d p and d in to the final performance are also illustrated in Table 12 . We set d in = 24 for the experiment of d p and d p = 16 for the experiments of d in . Increasing the d p from 8 to 128, DBA achieves higher performances, especially on the retrieval, image, and pathfinder tasks, while keeping similar performances on the other two tasks. However, the performance on the pathfinder task drops when increasing the d p to 256. Increasing the d in from 12 to 64, DBA also achieves higher performances on the same three tasks. For the balance between performance and efficiency, we set d p = 16 and d in = 24 for all tasks on the LRA dataset, as shown in Table 7 .



Figure 2: Illustration of the algorithm of Vanilla Attention versus DBA. Compared with Vanilla Transformer(Vaswani et al., 2017), the Q, K in DBA are compressed to low-rank alternatives with bilinear form in both sequence length and the hidden state dimension. The DBA has the same input and output as Vanilla Transformer by using reconstruction matrix W ′ r , which easily makes DBA plug and play with existing Transformers. The labels on the row and column of squares represent the dimension of features.

Figure 3: Visualization of sequence length compression matrices. The first column is the input sequence. The second column compares the dynamic sequence length projection matrix in DBA and the learned input invariant projection matrix in Linformer. The third column illustrates the compressed input by DBA and Linformer, respectively. Different rows represent different input samples.

Figure 4: Comparison of training loss and validation accuracy of DBA with ablation variants, including DBA without hidden state dimension compression and DBA without sequence length compression.

Speed and peak memory consumption of different models on byte-level text classification with various sequence lengths(256, 512, 1k, 2k, 3k, and 4k). The average performances on the LRA task are listed on the right. The best model is made bold.

Performance on the LRA benchmark. The DBA is trained with 5 random seeds, and the average scores with accuracy variances are reported. The best model is made bold.

Performance on the UEA multivariate time series classification archive. The best model is made bold.

Performance on the val split of VQA-v2 dataset. nPara denotes the number of parameters in attention layers. The best model is made bold.

Efficiency and performance of DBA with state space model on LRA dataset.

Summary of experiment benchmarks.

Experiment settings for the S4 with DBA optimization.

annex

For the experiment on the LRA dataset, DBA follows the configurations as (Ma et al., 2021) , where all models use the same data processing strategy and model architecture for fair comparisons.For the experiment on the UEA multivariate time series classification archive, we select 10 multivariate datasets similar to (Wu et al., 2022) and use the same configurations following (Zerveas et al., 2021) . As some of the tasks contain sequences of various lengths, we padded the batched input to the maximum length of the task during training process. During implementation, DBA takes the input without padding as DBA is able to process sequences in various lengths.For the experiments on the VQA-v2 dataset, we use the ALBERT (Lan et al., 2020) to extract question features, resulting R 768 embedding for every token in a sentence. We use the gird image features (Jiang et al., 2020) obtained from a ResNet-152 model (He et al., 2016) in the vision part. For the i th grid, it is represented as a feature as x i ∈ R 2048 , with maximum 608 grid features. After cross-attention interactions, both language and vision parts perform intra-modality fusion following (Yu et al., 2019) , and the final answer is predicted via addition.For the performance on the state space model, we use the S4 (Gu et al., 2022) as backbone. Our goal is to improve efficiency of the state space model while maintaining its performance. Note that DBA first compresses the input sequence from R n×d to R dp×d , then processes compressed feature and finally restores the sequence to its original dimension R n×d . Therefore, we could extract the compressed feature in DBA as the input of state space model to improve efficiency. We use one layer of DBA to compress sequence length of the input, and DBA is plugged after the first layer of the S4 model.The detailed settings with hyper-parameters are listed in Tables 7, 8 , 9, and 10. 

