MRSFORMER: TRANSFORMER WITH MULTIRESOLUTION-HEAD ATTENTION

Abstract

We propose the Transformer with Multiresolution-head Attention (MrsFormer), a class of efficient transformers inspired by the multiresolution approximation (MRA) for approximating a signal f using wavelet bases. MRA decomposes a signal into components that lie on orthogonal subspaces at different scales. Similarly, MrsFormer decomposes the attention heads in the multi-head attention into fine-scale and coarse-scale heads, modeling the attention patterns between tokens and between groups of tokens. Computing the attention heads in MrsFormer requires significantly less computation and memory footprint compared to the standard softmax transformer with multi-head attention. We analyze and validate the advantage of MrsFormer over the standard transformers on a wide range of applications including image and time series classification.

1. INTRODUCTION

The transformer architectures (Vaswani et al., 2017) is popularly used in natural language processing (Devlin et al., 2018; Al-Rfou et al., 2019; Dai et al., 2019; Child et al., 2019; Raffel et al., 2020; Baevski & Auli, 2019; Brown et al., 2020; Dehghani et al., 2018 ), computer vision (Dosovitskiy et al., 2021; Liu et al., 2021; Touvron et al., 2020; Ramesh et al., 2021; Radford et al., 2021; Arnab et al., 2021; Liu et al., 2022; Zhao et al., 2021; Guo et al., 2021; Chen et al., 2022) , speech processing (Gulati et al., 2020; Dong et al., 2018; Zhang et al., 2020; Wang et al., 2020b) , and other relevant applications (Rives et al., 2021; Jumper et al., 2021; Chen et al., 2021; Zhang et al., 2019; Wang & Sun, 2022) . Transformers achieve state-of-the-art performance in many of these practical tasks, and the results get better with larger model size and increasingly long sequences. For example, the text generating model in (Liu et al., 2018a) processes input sequences of up to 11,000 tokens of text. Applications involving other data modalities, such as music (Huang et al., 2018) and images (Parmar et al., 2018) , can require even longer sequences. Lying at the heart of transformers is the self-attention mechanism, an inductive bias that connects each token in the input through a relevance weighted basis of every other tokens to capture the contextual representation of the input sequence (Cho et al., 2014; Parikh et al., 2016; Lin et al., 2017; Bahdanau et al., 2014; Vaswani et al., 2017; Kim et al., 2017) . The capability of self-attention to attain diverse syntactic and semantic representations from long input sequences accounts for the success of transformers in practice (Tenney et al., 2019; Vig & Belinkov, 2019; Clark et al., 2019; Voita et al., 2019a; Hewitt & Liang, 2019) . The multi-head attention (MHA) extends the self-attention by concatenating multiple attention heads to compute the final output as explained in Section 2.1 below. In spite of the success of the MHA, it has been shown that attention heads in MHA are redundant and tend to learn similar attention patterns, thus limiting the representation capacity of the model (Michel et al., 2019; Voita et al., 2019b; Bhojanapalli et al., 2021) . Furthermore, additional heads increase the computational and memory costs, which becomes a bottleneck in scaling up transformers for very long sequences in large-scale practical tasks. These high computational and memory costs and head redundancy issues of the MHA motivates the need for a new efficient attention mechanism.

1.1. CONTRIBUTION

Levaraging the idea of the multiresolution approximation (MRA) (Mallat, 1999; 1989; Crowley, 1981) , we propose a class of efficient and flexible transformers, namely the Transformer with Multiresolutionhead Attention (MrsFormer). At the core of MrsFormer is to use the novel Multiresolution-head Attention (MrsHA) that computes the approximation of the outputs H h , h = 1, . . . , H, of attention heads in MHA at different scales for saving computation and reducing the memory cost of the model. The MRA has been widely used to efficiently approximate complicated signals like video and images in signal and image processing (Mallat, 1999; Taubman & Marcellin, 2002; Bhaskaran & Konstantinides, 1997) , as well as to approximate solutions of partial differential equations (Dahmen et al., 1997; Qian & Weiss, 1993) . While existing works have been proposed to approximate the attention matrices using the MRA (Zeng et al., 2022; Fan et al., 2021; Tao et al., 2020; Li et al., 2022) , our MrsHA is the first method that approximates the output of an attention head, resulting in a better approximation scheme compared to other works that try to approximate the attention matrices. Our contribution is three-fold: 1. We derive the approximation of an attention head at different scales via two steps: i) Directly approximating the output sequence H, and ii) approximating the value matrix V, i.e. the dictionary that contains bases of H. 2. We develop MrsHA, a novel MHA whose attention heads approximate the output sequences H h , h = 1, . . . , H, at different scales. We then propose MrsFormer, a new class of transformers that use MrsHA in their attention layers. 3. We empirically verify that the MrsFormer helps reduce the head redundancy and achieves better efficiency than the baseline softmax transformer while attaining comparable accuracy to the baseline. Organization: We structure this paper as follows: In Section 2, we derive the approximation for the output sequence H h , h = 1, . . . , H, at different scales and propose the MrsHA and MrsFormer. In Section 3 and 4, we empirically validate and analyze the advantages of the MrsFormer over the baseline softmax transformer. We discuss related work in Section 5. The paper ends up with concluding remarks. More experimental details are provided in the Appendix.

2.1. BACKGROUND: SELF-ATTENTION

The self-attention mechanism learns long-range dependencies via parallel processing of the input sequence. For a given input sequence X := [x 1 , • • • , x N ] ⊤ ∈ R N ×Dx of N feature vectors, the self-attention transforms X into the output sequence H := [h 1 , • • • , h N ] ⊤ ∈ R N ×Dv as follows H = softmax QK ⊤ √ D V := AV, where Q := [q 1 , • • • , q N ] ⊤ , K := [k 1 , • • • , k N ] ⊤ , and V := [v 1 , • • • , v N ] ⊤ are the projections of the input sequence X into three different subspaces spaned by W Q , W K ∈ R D×Dx , and W V ∈ R Dv×Dx , i.e. Q = XW ⊤ Q , K = XW ⊤ K , V = XW ⊤ V . Here, in the context of transformers, Q, K, and V are named the query, key, and value matrices, respectively. The softmax function is applied to row-wise. The matrix A = softmax QK ⊤ √ D ∈ R N ×N is the attention matrix, whose component a ij for i, j = 1, • • • , N are the attention scores. The structure of the attention matrix A after training from data determines the ability of the self-attention to capture contextual representation for each token. Eqn. ( 1) is also called the scaled dot-product or softmax attention. In our paper, we call a transformer that uses this attention the softmax transformer. Multi-head Attention (MHA). In MHA, multiple heads are concatenated to compute the final output. Let H be the number of heads and (2) W multi O = W (1) O , . . . , W The MHA enables transformers to capture more diverse attention patterns.

2.2. BACKGROUND: WAVELET TRANSFORM AND MULTIRESOLUTION APPROXIMATIONS

The wavelet transform uses time-frequency atoms with different time supports to analyze the structure of a signals. In particular, it decomposes signals over dilated and translated copies of a fixed function



R Dv×Dv . The multi-head attention is defined asMultiHead({H} H h=1 ) = Concat(H (1) , . . . , H(H)

