DBA: EFFICIENT TRANSFORMER WITH DYNAMIC BI-LINEAR LOW-RANK ATTENTION

Abstract

Many studies have been conducted to improve the efficiency of Transformer from quadric to linear over long sequence conditions. Among them, the low-rank-based methods aim to learn the projection matrices to compress sequence length, thus achieving efficiency gain. However, the projection matrices are fixed once they have been learned, which compress sequence length with dedicated coefficients for tokens in the same position regardless of different sequences. Adopting such input-invariant low-rank projections ignores the fact that the most informative part of a sequence varies from sequence to sequence, thus failing to preserve the most useful information that lies in varied positions of different sequences. In addition, previous efficient Transformers only focus on the influence of sequence length while neglecting the effect of hidden state dimension to achieve further efficiency gain. To address the aforementioned problems, we present an efficient yet effective attention mechanism, namely Dynamic Bilinear Low-Rank Attention (DBA), which compresses sequence length by input-sensitive dynamic projection matrices and achieves linear time and space complexity by jointly optimizing sequence length and hidden state dimension while maintaining state-of-the-art performance. Specifically, we first theoretically demonstrate that the sequence length can be compressed losslessly from a novel perspective of information theory, with the compression matrices dynamically determined by the input sequence. Furthermore, we show that the hidden state dimension can be approximated by extending the Johnson-Lindenstrauss lemma and achieves high-order small amount error, optimizing the attention in bilinear form. In addition, theoretical analysis shows that DBA is proficient in capturing high-order relations in cross-attention problems. Experiments over tasks with diverse sequence length conditions show that DBA achieves state-of-the-art performance compared with various strong baselines while maintaining less memory consumption with higher speed, demonstrating the effectiveness and efficiency of DBA.

1. INTRODUCTION

The Transformer (Vaswani et al., 2017) has shown immense capabilities in a wide range of areas, including natural language processing (Dai et al., 2019 ), computer vision (Dosovitskiy et al., 2021; Liu et al., 2021) , time series analysis (Zerveas et al., 2021) , and multi-modal tasks (Qin et al., 2022a; Yu et al., 2019) . However, the Vanilla Transformer suffers quadratic time and memory complexity, raising concerns about its further application scenarios. Therefore, several efficient Transformers have been introduced (Tay et al., 2022). Among them, kernel-based methods have drawn much attention due to their optimization-friendly characteristic, which improves the efficiency by using the approximation in the attention mechanism (Katharopoulos et al., 2020; Wang et al., 2020; Ma et al., 2021; Xiong et al., 2021; Qin et al., 2022b; Choromanski et al., 2021) . One popular kernelbased technique is low-rank approximation, which compresses sequence length dimension using the same coefficients for all sequences. For instance, Wang et al. (Wang et al., 2020) approximated the stochastic matrix in sequence length dimension by using sets of fixed coefficients learned in the training process to calculate the weighted sum of tokens in different positions. Xiong et al. (Xiong et al., 2021) adopted the Nyström method to approximate the attention mechanism to linear complexity, decreasing the sequence length with mean pooling.

