MRSFORMER: TRANSFORMER WITH MULTIRESOLUTION-HEAD ATTENTION

Abstract

We propose the Transformer with Multiresolution-head Attention (MrsFormer), a class of efficient transformers inspired by the multiresolution approximation (MRA) for approximating a signal f using wavelet bases. MRA decomposes a signal into components that lie on orthogonal subspaces at different scales. Similarly, MrsFormer decomposes the attention heads in the multi-head attention into fine-scale and coarse-scale heads, modeling the attention patterns between tokens and between groups of tokens. Computing the attention heads in MrsFormer requires significantly less computation and memory footprint compared to the standard softmax transformer with multi-head attention. We analyze and validate the advantage of MrsFormer over the standard transformers on a wide range of applications including image and time series classification.

1. INTRODUCTION

The transformer architectures (Vaswani et al., 2017) is popularly used in natural language processing (Devlin et al., 2018; Al-Rfou et al., 2019; Dai et al., 2019; Child et al., 2019; Raffel et al., 2020; Baevski & Auli, 2019; Brown et al., 2020; Dehghani et al., 2018) , computer vision (Dosovitskiy et al., 2021; Liu et al., 2021; Touvron et al., 2020; Ramesh et al., 2021; Radford et al., 2021; Arnab et al., 2021; Liu et al., 2022; Zhao et al., 2021; Guo et al., 2021; Chen et al., 2022) , speech processing (Gulati et al., 2020; Dong et al., 2018; Zhang et al., 2020; Wang et al., 2020b) , and other relevant applications (Rives et al., 2021; Jumper et al., 2021; Chen et al., 2021; Zhang et al., 2019; Wang & Sun, 2022) . Transformers achieve state-of-the-art performance in many of these practical tasks, and the results get better with larger model size and increasingly long sequences. For example, the text generating model in (Liu et al., 2018a) processes input sequences of up to 11,000 tokens of text. Applications involving other data modalities, such as music (Huang et al., 2018) and images (Parmar et al., 2018) , can require even longer sequences. Lying at the heart of transformers is the self-attention mechanism, an inductive bias that connects each token in the input through a relevance weighted basis of every other tokens to capture the contextual representation of the input sequence (Cho et al., 2014; Parikh et al., 2016; Lin et al., 2017; Bahdanau et al., 2014; Vaswani et al., 2017; Kim et al., 2017) . The capability of self-attention to attain diverse syntactic and semantic representations from long input sequences accounts for the success of transformers in practice (Tenney et al., 2019; Vig & Belinkov, 2019; Clark et al., 2019; Voita et al., 2019a; Hewitt & Liang, 2019) . The multi-head attention (MHA) extends the self-attention by concatenating multiple attention heads to compute the final output as explained in Section 2.1 below. In spite of the success of the MHA, it has been shown that attention heads in MHA are redundant and tend to learn similar attention patterns, thus limiting the representation capacity of the model (Michel et al., 2019; Voita et al., 2019b; Bhojanapalli et al., 2021) . Furthermore, additional heads increase the computational and memory costs, which becomes a bottleneck in scaling up transformers for very long sequences in large-scale practical tasks. These high computational and memory costs and head redundancy issues of the MHA motivates the need for a new efficient attention mechanism.

1.1. CONTRIBUTION

Levaraging the idea of the multiresolution approximation (MRA) (Mallat, 1999; 1989; Crowley, 1981) , we propose a class of efficient and flexible transformers, namely the Transformer with Multiresolutionhead Attention (MrsFormer). At the core of MrsFormer is to use the novel Multiresolution-head Attention (MrsHA) that computes the approximation of the outputs H h , h = 1, . . . , H, of attention heads in MHA at different scales for saving computation and reducing the memory cost of the model. The MRA has been widely used to efficiently approximate complicated signals like video and images in signal and image processing (Mallat, 1999; Taubman & Marcellin, 2002; Bhaskaran & Konstantinides, 1997) , as well as to approximate solutions of partial differential equations (Dahmen et al., 1997; Qian & Weiss, 1993) . While existing works have been proposed to approximate the attention matrices using the MRA (Zeng et al., 2022; Fan et al., 2021; Tao et al., 2020; Li et al., 2022) , our MrsHA is the first method that approximates the output of an attention head, resulting in a better approximation scheme compared to other works that try to approximate the attention matrices. Our contribution is three-fold: 1. We derive the approximation of an attention head at different scales via two steps: i) Directly approximating the output sequence H, and ii) approximating the value matrix V, i.e. the dictionary that contains bases of H. 2. We develop MrsHA, a novel MHA whose attention heads approximate the output sequences H h , h = 1, . . . , H, at different scales. We then propose MrsFormer, a new class of transformers that use MrsHA in their attention layers. 3. We empirically verify that the MrsFormer helps reduce the head redundancy and achieves better efficiency than the baseline softmax transformer while attaining comparable accuracy to the baseline. Organization: We structure this paper as follows: In Section 2, we derive the approximation for the output sequence H h , h = 1, . . . , H, at different scales and propose the MrsHA and MrsFormer. In Section 3 and 4, we empirically validate and analyze the advantages of the MrsFormer over the baseline softmax transformer. We discuss related work in Section 5. The paper ends up with concluding remarks. More experimental details are provided in the Appendix.

2.1. BACKGROUND: SELF-ATTENTION

The self-attention mechanism learns long-range dependencies via parallel processing of the input sequence. For a given input sequence X := [x 1 , • • • , x N ] ⊤ ∈ R N ×Dx of N feature vectors, the self-attention transforms X into the output sequence H := [h 1 , • • • , h N ] ⊤ ∈ R N ×Dv as follows H = softmax QK ⊤ √ D V := AV, where Q := [q 1 , • • • , q N ] ⊤ , K := [k 1 , • • • , k N ] ⊤ , and V := [v 1 , • • • , v N ] ⊤ are the projections of the input sequence X into three different subspaces spaned by W Q , W K ∈ R D×Dx , and W V ∈ R Dv×Dx , i.e. Q = XW ⊤ Q , K = XW ⊤ K , V = XW ⊤ V . Here, in the context of transformers, Q, K, and V are named the query, key, and value matrices, respectively. The softmax function is applied to row-wise. The matrix A = softmax QK ⊤ √ D ∈ R N ×N is the attention matrix, whose component a ij for i, j = 1, • • • , N are the attention scores. The structure of the attention matrix A after training from data determines the ability of the self-attention to capture contextual representation for each token. Eqn. ( 1) is also called the scaled dot-product or softmax attention. In our paper, we call a transformer that uses this attention the softmax transformer. Multi-head Attention (MHA). In MHA, multiple heads are concatenated to compute the final output. Let H be the number of heads and W multi O = W (1) O , . . . , W (H) O ∈ R Dv×HDv be the projection matrix for the output where W (1) O , . . . , W (H) O ∈ R Dv×Dv . The multi-head attention is defined as MultiHead({H} H h=1 ) = Concat(H (1) , . . . , H (H) )W multi⊤ O = H h=1 H (h) W h⊤ O = H h=1 A (h) V (h) W (h)⊤ O . ( ) The MHA enables transformers to capture more diverse attention patterns.

2.2. BACKGROUND: WAVELET TRANSFORM AND MULTIRESOLUTION APPROXIMATIONS

The wavelet transform uses time-frequency atoms with different time supports to analyze the structure of a signals. In particular, it decomposes signals over dilated and translated copies of a fixed function φ. A dictionary of time-frequency atoms is obtained by scaling φ by s and translating it by t: B = φ s t = 1 √ s φ x -t s t∈R,s∈R + . ( ) Here, s controls the dilation, i.e., the scale, and t controls the location, e.g., the time. Using this dictionary of time-frequency atoms, a signal f ∈ L 2 (R) can be expanded in the following form: f = +∞ 0 +∞ -∞ α s t φ s t (x) dtds. The wavelet transform then maps the signal f to the coefficient α s t as follows α s t = ⟨f, φ s t ⟩ = +∞ -∞ f (x)(φ * ) s t dx, where φ * is the complex conjugate of φ. The coefficient α s t captures the measurement of the signal f at scale s and location t (Mallat, 1999) .

2.3.1. FIRST LEVEL APPROXIMATION: APPROXIMATING THE OUTPUT SEQUENCE H AT DIFFERENT SCALES

Let B s = {φ s t ∈ R N } be a set of orthogonal expansion functions for possible translations at scale s where s = 1, 2, 4, . . . , N . For simplicity, we assume that the sequence length N = 2 k . The expansion functions φ s t are chosen to be the boxcar functions as follows  φ s t [i] = 1 if st -s < i ≤ st 0 otherwise (6) for s ∈ {1, H[:, d] ≈ H s [:, d] = φ s t ∈B s α s td φ s t , where the coefficient α s td is computed as follows α s td = 1 s ⟨φ s t , H[:, d]⟩. Plug Eqn. (1) and Eqn. (8) into Eqn. (7), we obtain H[:, d] ≈ H s [:, d] = φ s t ∈B s 1 s ⟨φ s t , H[:, d]⟩φ s t = N/s t=1 1 s st i=st-s+1 H[i, d]) φ s t = N/s t=1 1 s st i=st-s+1 A[i, :] V[:, d] φ s t (9) =↑ s,1 ((↓ s,1 A)V[:, d]). Here, we employ the notations for downsampling and upsampling from signal processing. In particular, ↓ s,ℓ denotes the average pooling by the factor s along the ℓ th dimension, and ↑ s,ℓ denotes the nearest-neighbor interpolation by the factor s along the ℓ th dimension. Applying Eqn. (10) for d = 1, . . . , D v , we achieve the approximation of H at scale s as follows: H ≈ H s =↑ s,1 ((↓ s,1 A)V). (11) An illustration of Eqn. 11 is given in Figure. 1 (Left) . Remark 1 (Approximating the columns of H independently) As pointed out in (Nguyen et al., 2022) , the features H [:, d] in the ouput sequence H, as well as the features V[:, d] in the value matrix V, d = 1, . . . , D v , in the softmax attention are independent due to the use of the unnormalized Gaussian kernels with the isotropic covariance. This finding justifies our approach of approximating the columns of H independently. Remark 2 (Group-to-token attention) The downsampling ↓ s,1 A of the matrix A in Eqn. ( 11) computes the attentions between groups of tokens and individual tokens in the sequence.

2.3.2. SECOND LEVEL APPROXIMATION: APPROXIMATING THE HEAD BASES V AT DIFFERENT SCALES

In Eqn. ( 11) that approximates the output sequence H at scale s by H s , we can further approximate the bases V, i.e., the value matrix, by its approximation at scale s ′ . Following the derivation in Section 2.3.1 above, we can derive the approximation V s ′ [:, d] for the d th columns of V as follows V[:, d] ≈ V s ′ [:, d] = N/s ′ t ′ =1   1 s ′ s ′ t ′ j=s ′ t ′ -s ′ +1 V[j, d])   φ s ′ t ′ . Plugging Eqn. ( 12) into Eqn. ( 9), we obtain the second level approximation of the head output H: H[:, d] ≈ H s,s ′ [:, d] = N/s t=1   1 s st i=st-s+1 A[i, :] N/s ′ t ′ =1   1 s ′ s ′ t ′ j=s ′ t ′ -s ′ +1 V[j, d])   φ s ′ t ′   φ s t = N/s t=1   N/s ′ t ′ =1 1 s ′ s st i=st-s+1 A[i, :]φ s ′ t ′   s ′ t ′ j=s ′ t ′ -s ′ +1 V[j, d])     φ s t = N/s t=1   N/s ′ t ′ =1   1 s ′ s st i=st-s+1 s ′ t ′ j=s ′ t ′ -s ′ +1 A[i, j]     s ′ t ′ j=s ′ t ′ -s ′ +1 V[j, d])     φ s t =↑ s,1 ((↓ s,1 ↓ s ′ ,2 A)(↓ s ′ ,1 V[:, d])). (13) Same as above, by applying Eqn. ( 13) for d = 1, . . . , D v , we achieve the full approximation of H at scale s of H and scale s ′ of V as follows: H ≈ H s,s ′ :=↑ s,1 ((↓ s,1 ↓ s ′ ,2 A)(↓ s ′ ,1 V)). ( ) An illustration of Eqn. 14 is given in Figure . 1 (Right). Given the approximation H s,s ′ of the attention matrix H, we have the following upper bound on the approximation error. Theorem 1 Assume that δ > 0 is chosen such that the attention matrix A satisfies the following inequalities |A i,j -A i±1,j | ≤ δ, |A i,j -A i,j±1 | ≤ δ for all 1 ≤ i, j ≤ N . Then, we obtain that ∥H -H s,s ′ ∥ F ≤ (s + s ′ -2)N δ √ ss ′ ∥V∥ 2 , where ∥.∥ F denotes the Frobenius norm and ∥.∥ 2 denotes the spectral norm of a matrix. Proof of Theorem 1 is in Appendix B. The result of Theorem 1 shows that the approximation matrix H s,s ′ approximates H exactly when s = s ′ = 1, which is true. In the coarsest scale when s = s ′ = N , the upper bound achieves the maximum value (N -1)δ∥V∥ 2 . Remark 3 (Group-to-group attention) The downsampling ↓ s,1 ↓ s ′ ,2 A of the matrix A in Eqn. ( 14) computes the attentions between groups of tokens and groups of tokens in the sequence. first. In order to avoid this redundant computation, we propose to compute the lower bound of this average pooling (due to the convexity of the exponential in the softmax function). In particular, we approximate the downsampling of A as follows: ↓ s,1 ↓ s ′ ,2 A ≈ softmax ↓ s,1 ↓ s ′ ,2 (QK ⊤ ) √ D = softmax (↓ s,1 Q)(↓ s ′ ,1 K) ⊤ √ D .

2.3.4. TRANSFORMER WITH MULTIRESOLUTION-HEAD ATTENTION: EACH HEAD APPROXIMATES THE ATTENTION AT A DIFFERENT SCALE

In this section, we formally define our Multiresolution-head Attention (MrsHA) and Transformer with Multiresolution-head Attention (MrsFormer). MrsHA combines Eqn. ( 14) and ( 15) to implement the approximation of the output sequences H (h) , h = 1, . . . , H, at different scales s and s ′ . Definition 1 (Multiresolution-head Attention) Let H be the number of heads and W multi O = W (1) O , . . . , W (H) O ∈ R Dv×HDv be the projection matrix for the head outputs where W (1) O , . . . , W (H) O ∈ R Dv×Dv . Given a set of scales {s (h) , s ′(h) } H h=1 for the output H (h) and the value matrix V (h) , h = 1, . . . , H, at each head, the MrsHA is an efficient attention mechanism that computes the approximation of H (h) at scale s (h) using an approximation of V (h) at scale s ′(h) by the following attention formula: MrsHA({H} H h=1 ) = H h=1 ↑ s (h) ,1 softmax (↓ s (h) ,1 Q)(↓ s ′(h) ,1 K) ⊤ √ D (↓ s ′(h) ,1 V (h) ) W (h)⊤ O . ( ) The MrsFormer is the class of transformers that use the MrsHA in their attention layers. Remark 4 (Downsampling Q, K, and V) Downsampling Q, K, and V can be efficiently implemented by downsampling the input sequence X before projecting it into the query matrix Q, the key matrix K, and the value matrix V via the linear transformations W Q , W K , and W V , respectively. Eqn. ( 16) of the MrsHA then becomes MrsHA({H} H h=1 ) = H h=1 ↑ s (h) ,1 softmax (↓ s (h) ,1 XW (h)⊤ Q )(↓ s ′(h) ,1 XW (h)⊤ K ) ⊤ √ D (↓ s ′(h) ,1 XW (h)⊤ V ) W (h)⊤ O . (17) An illustration of Eqn. 17 is given in Figure . 2. Remark 5 (Choosing s (h) and s ′(h) ) s (h) and s ′(h) are hyperparameters that can be tuned for each head. In our experiments, we use s h) , where k (h) is an integer. (h) = s ′(h) = 2 k ( Remark 6 (Choosing the expansion functions φ s t and 1-D convolution) In order to derive the MrsHA in Eqn. ( 16), we have chosen the expansion functions φ s t to be the boxcar functions. Other expansion functions, such as the wavelet bases or the triangular functions, can be used to derive different forms of the MrsHA. In a general case, the average pooling and the nearest-neighbor interpolation in Eqn. ( 16) and ( 17) can be replaced by the 1-D convolution operators with φ s t as the corresponding filters.

3. EXPERIMENTAL RESULTS

In this section, we empirically justify the advantages of our propsed MrsFormer model. We compare the performance of the MrsFormer with the baseline softmax transformer, the MRA-2 (Zeng et al., 2022) , and the MRA-2-s (which is the sparse version of the MRA-2) on various benchmarks. Unlike our method, the MRA-2 and MRA-2-s perform multiresolution analysis for each head by approximating the attention matrix by blocks of different scales, while the MrsHA in our MrsFormer computes the approximation of each head H h at a specific scale. The benchmarks studied in our experiments include 10 tasks from the UEA time series classification dataset (Bagnall et al., 2018) , 3 tasks from Long Range Arena (Tay et al., 2021b ) (LRA) benchmark, and ImageNet image classification task (Russakovsky et al., 2015) . In addition, we also study the performance of the MrsHA when being combined with other attention mechanism such as the linear attention (Katharopoulos et al., 2020) , the MRA-2 attention, and the MRA-2-s attention (Zeng et al., 2022) . We aim to show that: (i) the MrsFormer can achieve better or comparable accuracy over the baseline softmax, MRA-2, and MRA-2-s transformers; (ii) the MrsFormer saves significant amount of FLOPs and memory compared to the baseline softmax transformer, and this advantage grows with the sequence length; (iii) the MrsHA can be combined with other attentions to achieve similar or better performance with better efficiency; and (iv) the MrsFormer reduces redundancy between heads comparing to the softmax baseline. In our experiment, we keep the hyperparameters the same for all models for fair comparisons. All of our results are averaged over 5 runs with different seeds.

3.1. UEA TIME SERIES CLASSIFICATION

Models and baselines. We adapt code from (Wu et al., 2022; Zerveas et al., 2021) for our experiments. Following the same setting from these papers, we set the number of heads and layers to 8 and 2, respectively. For the MrsFormers, we use the same set of scales at each layer, which is given by s = [1, 1, 2, 2, 4, 4, 8, 8] . For MRA-2 and MRA-2-s models (Zeng et al., 2022) , each head is approximated by blocks of scales [1, 32] as suggested in their paper. The percentage of blocks with scale 1 in these MRA-2 models is set to 25% of the full attention matrix. Other hyperparameters have the same values as in (Wu et al., 2022) (for the PEMS-SF, SelfRegulationSCP2, and UWaveGestureLibrary tasks) and (Zerveas et al., 2021) (for other tasks). Results. We summarize the results in Table 1 . The MrsFormer achieves bettter test accuracy than the baseline softmax transformer for 5 out of 10 tasks while being much more efficient. Among these tasks, the MrsFormer outperforms the baseline by at least 1% accuracy. For the remaining tasks, besides Handwriting, our model maintains an accuracy gap less than 0.8% compared to the baseline. Our model gets the best accuracy for 4 out of the 10 tasks. In addition, it achieves second best accuracy for 4 out of the remaining tasks. The MrsFormer achieves the average accuracy across all tasks. Note that among 8 heads at each layer, our model computes 6 of them with the size of only 1 4 , 1 4 , 1 16 , 1 16 , 1 64 and 1 64 of the size of the corresponding heads in the baseline softmax transformer. Thus, the MrsFormer has a significant smaller FLOPS and memory usage compared to the baseline.

3.2. LONG RANGE ARENA

Models and baselines. We follow the same settings and adapt code for LRA task from (Zeng et al., 2022) , which uses transformer with 2 heads and 2 layers. We choose the same set of scales s = [1, 2] for all the layers in MsFormer. Results. Table 2 summarizes our results. Although being an approximation of the softmax attention, it is evidently from Table 2 that MrsFormer can consistently achieve better than or comparable accuracy as the baseline softmax attention on the LRA tasks. The MRA-2 and MRA-2-s models (Zeng et al., Table 1 : Accuracy (%) of the MrsFormer vs. the baseline softmax transformer on the UEA Time Series Classification task averaged over 5 seeds. The best model for each task is highlighted in bold, while the second best one is underlined. We also include the reported results for the softmax transformer from (Wu et al., 2022) and (Zerveas et al., 2021) 2022) are also included for comparison. Our MrsFormer's performance is comparable with these MRA baselines. Overall, the MrsFormer yields the best average accuracy across the LRA tasks.

Models and baselines:

In this section, we apply the MrsFormer to the Deit model (Touvron et al., 2020) with 4 heads. Since Deit uses special class token [CLS] for the classification, we do not downsample this token along with other tokens in the sequence. For our MrsFormers, we use the set of scales s = [1, 2, 2, 4] at each layer. We also study the MRA-2-s attention on this task. As reported in (Zeng et al., 2022) , the MRA-2-s is a better model than the MRA-2 on the ImageNet image classification task since its sparse attention structure is more effective for modeling images. Results: We present our results in Table 3 . The MrsFormer DeiT's top-1 accuracy is about 0.5% higher than MRA-2-s DeiT and is the closest model to the performance of the softmax DeiT baseline. The performance gap of less than 1% of MrsFormer DeiT is very promising for applying the MrsFormer-based model in large scale tasks to reduce the computational and memory cost while maintaining comparable performance with the baseline transformer.

4. EMPIRICAL ANALYSIS

In this section, we use the models trained on the LRA retrieval task for our analysis.

4.1. EFFICIENCY ANALYSIS

We study the efficiency of MrsFormer over the baseline softmax transformer. Figure 3 demonstrates the reduction ratio of train and test flops of the MrsFormer over the softmax transformer. Although in this experiment, we only approximate one head with scale s = 2 and preserve the other head the same as in the baseline, the FLOP saving ratio over softmax attention still ranges from 18% up to more than 36% and grows with sequence length in both the training and testing phases. Figure 4 presents the memory saving ratio of the MrsFormer over the softmax transformer. This figure shows a similar trend of more memory saving when the sequence length increases. Our model achieves up to 49% and 31% decrease in memory usage in the training and testing phases, respectively. This indicates that our model scales well with long sequences and takes significantly less resource than the baseline softmax attention in both training and testing.

4.2. MRSFORMER HELPS REDUCE HEAD REDUNDANCY

To show that the MrsFormer captures more diverse attention patterns, we compare the average L 2 distances between the heads of our trained MrsFormer model (on the retrieval task) and the softmax baseline. Table 4 reports the layer-average mean and standard deviation of distances between heads. Since the MrsFormer attains higher L 2 distances, it reduces the risk of learning redundant heads compared to the softmax baseline.

4.3. BEYOND THE SOFTMAX ATTENTION: COMBINING MRSHA WITH OTHER ATTENTIONS

The MrsHa is complementary to many other types of attentions. Therefore, a natural question arises is whether we can combine the MrsHa with other attentions besides the softmax attention? To answer this question, we combine the MrsHA with the MRA attention (Zeng et al., 2022) and the linear attention (Katharopoulos et al., 2020) and train these combined models for the LRA tasks (Tay et al., 2021a) as in Section 3.2. The results are presented in Table 5 . It is interesting to see from Table 5 that all combined models gain an improvement in average test accuracy over the original models despite being an approximation. This observation suggests that the MrsHa can be applied to other attention mechanisms besides softmax to reduce computation and memory while maintaining the accuracy of the original models.

5. RELATED WORK

Efficient Transformers. To reduce the quadratic computational cost and memory usage of transformers, many efficient transformer models have been developed (Roy et al., 2021) . Sparse transformers are a line of works in this branch, which explore and design the sparsity structure of attention matrix, resulting in more efficient models (Parmar et al., 2018; Liu et al., 2018b; Qiu et al., 2019; Child et al., 2019; Beltagy et al., 2020) . Another class of efficient transformers is patterns integration, combining different attention patterns to cover a diverse and wide range of dependencies (Child et al., 2019; Ho et al., 2019) . These patterns can be set as pre-specified or learnable during training, along with model parameters (Kitaev et al., 2020; Roy et al., 2021; Tay et al., 2020) . In another attempt, multiple tokens can be accessed simultaneously with a side memory module, saving the cost of computing and memory storage (Lee et al., 2019; Sukhbaatar et al., 2019; Asai & Choi, 2020; Beltagy et al., 2020) . In a different approach, observing that the attention matrices are low-rank, kernelization and low-rank approximation methods have been proposed to replace the softmax attention with more efficient attentions (Tsai et al., 2019; Wang et al., 2020a; Katharopoulos et al., 2020; Choromanski et al., 2021; Shen et al., 2021; Nguyen et al., 2021; Peng et al., 2021; Jaegle et al., 2021) . From a signal processing perspective, wavelet-based and multiscale methods has been used lately to learn a multiresolution approximation of self-attention (Zeng et al., 2022; Fan et al., 2021; Tao et al., 2020; Li et al., 2022) , which flexibly discover the coarse and fine attention patterns. Our approach decomposes the attention heads into coarse-and fine-scale heads, diversely modeling the dependencies between tokens and between group of tokens to reduce the computational and memory costs of the model in both training and testing. Redundancy in Transformers. Pre-trained transformers contain redundant neurons and heads which can be pruned away for downstream tasks (Dalvi et al., 2020; Michel et al., 2019; Durrani et al., 2020) . Studying the contextualized embeddings in these pre-trained networks shows the anisotropicity of the learned representation from these models under this redundancy (Mu & Viswanath, 2018; Ethayarajh, 2019) . Multiple approaches have been proposed to reduce this redundancy and improve the efficiency of transformers, such as the knowledge distillation and sparse approximation (Sanh et al., 2019; Sun et al., 2019; Voita et al., 2019b; Sajjad et al., 2020) . Our MrsHA/MrsFormer represent the attention heads at different scales and are complementary to these methods.

6. CONCLUDING REMARKS

In this paper, we propose the MrsFormer, a class of efficient transformers that calculates the approximation of the attention heads at different scales using the Multiresolution-head Attention (MrsHA). The MrsFormer achieves better computational and memory cost than the corresponding softmax transformers baseline. Furthermore, the MrsFormer helps reduce the redundancy between attention heads and can be easily combined with other attention mechanisms. In the MrsFormer, we use the boxcar function to form a set of orthogonal expansion functions. It is natural to further develop the MrsFormer using other basis functions including the popular wavelets. Furthermore, in our derivation of the MrsHA and MrsFormer in Section 2.3, we employ the observation from (Nguyen et al., 2022) that the features H[:, d] in the output sequence H are independent. We leave the extenson of the MrsHA and MrsFormer to capture dependent output features as future work.

Supplement to "MrsFormer: Transformer with Multiresolution-head Attention"

A ADDITIONAL DETAILS ON THE EXPERIMENTS A.1 UEA TIME SERIES CLASSIFICATION Datasets and metrics The benchmark (Bagnall et al., 2018) consists of 30 datasets. Following (Wu et al., 2022) , we choose 10 datasets, which vary in input sequence lengths, the number of classes, and dimensionality, to evaluate our models on temporal sequences.

Models and baselines

We adapt code from (Wu et al., 2022; Zerveas et al., 2021) for our experiments. Following the same setting from these papers, we set the number of heads and layers to 8 and 2, respectively. For the MrsFormers, we use the same set of scales at each layer, which is given by s , 2, 2, 4, 4, 8, 8] . For MRA-2 and MRA-2-s models (Zeng et al., 2022) , each head is approximated by blocks of scales [1, 32] as suggested in their paper. The percentage of blocks with scale 1 in these MRA-2 models is set to 25% of the full attention matrix. Other hyperparameters have the same values as in (Wu et al., 2022) (for the PEMS-SF, SelfRegulationSCP2, and UWaveGestureLibrary tasks) and (Zerveas et al., 2021) (for other tasks). Hyperparameters for these tasks are presented in Table 6 . = [1, 1

A.2 LONG RANGE ARENA BENCHMARK

Datasets and metrics We adopt the tasks: Listops (Nangia & Bowman, 2018) , byte-level IMDb reviews text classification (Maas et al., 2011) , and byte-level document retrieval (Radev et al., 2013) in the LRA benchmark for our experiments. They consist of long sequences of length 2K, 4K, and 4K, respectively. The evaluation protocol and metric are the same as in (Tay et al., 2021b) .

Models and baselines

We follow the same settings and adapt code for LRA task from (Zeng et al., 2022) , which uses transformer with 2 heads and 2 layers. We choose the same set of scales s = [1, 2] for all the layers in MsFormer. Hperparameters for these tasks are presented in Table 7 .

A.3 IMAGE CLASSIFICATION ON IMAGENET

Dataset and metric: We perform classification task on ILSVRC-2012 ImageNet dataset to validate the performance of our model on large dataset. This dataset has 1000 classes and about 1.28 million images.

Models and baselines

In this section, we apply the MrsFormer to the Deit model (Touvron et al., 2020) with 4 heads. Since Deit uses special class token [CLS] for the classification, we do not downsample this token along with other tokens in the sequence. For our MrsFormers, we use the set of scales s = [1, 2, 2, 4] at each layer. We also study the MRA-2-s attention on this task. As reported in (Zeng et al., 2022) , the MRA-2-s is a better model than the MRA-2 on the ImageNet image classification task since its sparse attention structure is more effective for modeling images.

B PROOF OF THEOREM 1

Recall from Eqn. ( 14) that H ≈ H s,s ′ =↑ s,1 ((↓ s,1 ↓ s ′ ,2 A)(↓ s ′ ,1 V)). Let T s be the down-sampling operator (matrix multiplication) on the first dimension of a matrix corresponding to the scale s. T s is the Kronecker product (or outer product) between an identity matrix I and the row vector 1 si - → 1 of size 1 × s, i.e. T s = I ⊗ 1 s - → 1 . Under this notation, the up-sampling operator is the transpose of T s . In addition, the down-sampling operator on the second dimension of a matrix is also T T s but with the right multiplication instead. Then, we can rewrite the approximation H s,s ′ as follows: H s,s ′ = T T s ((T s AT T s ′ )(T s ′ V)) = (T T s T s AT T s ′ T s ′ )V. From the above equation, we have From the inequality with the Frobenius norm, we have H -H s,s ′ = A -(T T s T s AT T s ′ T s ′ ) V. ∥H -H s,s ′ ∥ F ≤ ∥A -T T s T s AT T s ′ T s ′ ∥ F ∥V∥ 2 . Therefore, it suffices to approximate the upper bound ∥A - T T s T s AT T s ′ T s ′ ∥ F . Let A s,s ′ = T T s T s AT T s ′ T s ′ and obviously A s,s ′ contains blocks matrices of the same values. We can rewrite A and A s,s ′ as block matrices of size s × s ′ : A = [A m,n ] m,n and A s,s ′ = [A s,s ′ m,n ] m,n where m = 0, 1, ..., qlen/s, and n = 0, 1, ..., klen/s ′ . Note that all elements of A s,s ′ m,n have an identical value to the average of all elements of the sub-matrix A m,n . Now we can decompose the above quantity into a sum of Frobenius norms: ∥A -T T s T s AT T s ′ T s ′ ∥ 2 F = m,n ∥A m,n -A s,s ′ m,n ∥ 2 F . Recall that from the hypothesis, we have |A i,j -A i±1,j | ≤ δ, |A i,j -A i,j±1 | ≤ δ. Then, by applying Popoviciu's inequality, we have Var [X] ≤ (M -m) 2 4 , where m = inf X and M = sup X. Since matrix is finite, the infimum and the maximum become the maximum and minimum respectively. By Assumption 18, we can approximate the upper bound of M -m as follows: (M -m) 2 ≤ (s + s ′ -2) 2 δ 2 . Integrate the sum, we find that ∥A -A s,s ′ ∥ 2 F ≤ qlen s klen s ′ (s + s ′ -2) 2 δ 2 4 . When we plug in klen = qlen = N , we obtain a simpler version: ∥A -A s,s ′ ∥ F ≤ s + s ′ -2 √ ss ′ N δ 2 . As a consequence, we obtain the conclusion of the theorem.

C ADDITIONAL EXPERIMENTS C.1 COMBINING MRSHA WITH OTHER EFFICIENT ATTENTIONS

In this section, we combine the proposed MrsHA architecture with other efficient attention mechanisms to demonstrate MrsHA can be combined with other efficient transformer to reduce memory and computation requirements. We run our experiments on 5 efficient transformer including Linformer (Wang et al., 2020a) , Linear transformer (Katharopoulos et al., 2020) , FMM transformer (Nguyen et al., 2021 ), Performer (Choromanski et al., 2021) and Luna transformer (Ma et al., 2021) . All experiments settings in this section follows directly from subsections 3.1 and 3.2 unless stated otherwise.

C.1.1 UEA TIME SERIES CLASSIFICATION

Results in Table 8 presents the accuracy of the combined and original models on the UEA Time Series Classification task. All the efficient transformers in this experiment either maintain comparable performance or experience a boost in average accuracy when combined with MrsHA.

C.1.2 LONG RANGE ARENA

In Listops experiments, we increase the number of training step from 5000 to 15000 to ensure convergence for all models. Table 9 further consolidates the advantage of the proposed MrsHA architecture. In fact, all the combined models obtain better average accuracy than the original models in the LRA task. Performer, Linear, and FMM baselines trained for the LRA retrieval task in Figure 6 . We observe that in both train and test cases, the scatter-plots of our MrsHA-based models are above and on the left of the scatter-plots of the baselines, suggesting that our MrsHA-based models are more memory efficient while achieving comparable or better accuracies than the baseline models.



Figure 1: Illustration of Eqn. 11 (Left) and Eqn. 14 (Right).

Figure 2: Illustration of Eqn. 17.2.3.3 EFFICIENT DOWNSAMPLING OF THE ATTENTION MATRIX A As shown in Eqn. (1), A = softmax QK ⊤ √ D. Since the softmax function needs access to the full

Figure 3: Training (A) and inference (B) FLOP ratios between the MrsFormer and the baseline softmax transformer across different model dimensions D (dim) and sequence lengths N on the LRA retrieval task. The MrsFormer requires fewer FLOPs compared to the baseline, and this advantage grows with the sequence length for very long sequences. The efficient advantage of the MrsFormer holds for large-scale models with the large D.

Figure 4: Training (A) and inference (B) memory ratios between the MrsFormer and the baseline softmax transformer across different model dimensions D (dim) and sequence lengths N on the LRA retrieval task.

2, 4, . . . , N } and t ∈ {1, . . . , N/s}. At each scale s, we approximate the columns H[:, d], d = 1, . . . , D v , of the output sequence H as follows

(in parentheses). The MrsFormer attains the best average accuracy across all tasks while being much more efficient than the baseline softmax transformers. Accuracy (%) of the MrsFormer vs. the baseline softmax transformer averaged over 5 seeds. The best model for each task is highlighted in bold, while the second best one is underlined. The MrsFormer attains the best average accuracy across all tasks while being much more efficient than the baseline softmax transformers.

Accuracy (%) of the MrsFormer DeiT vs. the baseline softmax DeiT and the MRA-2-s DeiT on the

Layer

Accuracy (%) of the models that combined MrsHa with the MRA and linear attentions vs. the original MRA and linear transformers on the LRA tasks. The combined models are indicated by the prefix "Mrs", results are averaged over 5 seeds (In this experiment, we use the set of scales s = [1, 2]).

Hyperparameter configuration for UEA time series classification task.

Hyperparameter configuration for LRA task.

Accuracy (%) of the models that combined MrsHa with other efficient transformers versus the accuracy of the original efficient transformers on the UEA Time Series Classification task. The combined models are indicated by the prefix "Mrs", results are averaged over 5 seeds (In this experiment, we use the set of scales s =[1, 1, 2, 2,4,4, 8, 8]).

Accuracy (%) of models that combined MrsHa with other efficient transformers versus accuracy of the original efficient transformers (in the parentheses) in LRA task. The combined models are indicated by the prefix "Mrs", results are averaged over 5 seeds (In this experiment, we use the set of scales s = [1, 2]).

The resutls of the comparison between MrsFT-Transformer and FT-Transformer. The ↑ symbol denotes that the metric being reported is accuracy (the higher the better), the ↓ symbol denotes that the metric being reported is root mean square error (the lower the better).

C.2 TABULAR DATA

We include a diverse set of 11 tabular dataset for our benchmarking: California Housing (Kelley Pace & Barry, 1997) , Adult (Kohavi, 1996) , Helena (Guyon et al., 2019) , Jannis (Guyon et al., 2019) , Higgs (Baldi et al., 2014) , ALOI (Geusebroek et al., 2005) , Epsilon (EP, simulated physics experiments), Year (Bertin-Mahieux et al., 2011) , Covertype (Blackard & Dean, 1999) , Yahoo (Chapelle & Chang, 2011) , Microsoft (Qin & Liu, 2013) . We follow all the train settings and use the default set of hyperparameters used in paper (Gorishniy et al., 2021) for all models. For simplicity, we omit the ensemble step from paper (Gorishniy et al., 2021) . We report average accuracy over 5 random seed for both FT-Transformer (Gorishniy et al., 2021) and the combined model of MrsHA and FT-Transformer, which we denote MrsFT-Transformer.Table 10 evidently shows that our combined model obtained better results in 7 over 11 tasks, while other tasks maintain comparable performance. This result consolidates the benefit of combining MrsHA with other transformer models in a diverse set of tasks.

C.3 EFFICIENCY WHEN COMBINING MRSHA WITH OTHER EFFICIENT TRANSFORMER

For illustration, we present FLOP and memory reduction ratios of train and test phases of our MrsFMM transformer comparing to the original FMM transformer for LRA retrieval task in Figure 5 . Our model saves up to 35% of the original FLOP and has lower memory footprint, less than 65% and 85% of the original model for training and testing phases, respectively. 

