MRSFORMER: TRANSFORMER WITH MULTIRESOLUTION-HEAD ATTENTION

Abstract

We propose the Transformer with Multiresolution-head Attention (MrsFormer), a class of efficient transformers inspired by the multiresolution approximation (MRA) for approximating a signal f using wavelet bases. MRA decomposes a signal into components that lie on orthogonal subspaces at different scales. Similarly, MrsFormer decomposes the attention heads in the multi-head attention into fine-scale and coarse-scale heads, modeling the attention patterns between tokens and between groups of tokens. Computing the attention heads in MrsFormer requires significantly less computation and memory footprint compared to the standard softmax transformer with multi-head attention. We analyze and validate the advantage of MrsFormer over the standard transformers on a wide range of applications including image and time series classification.

1. INTRODUCTION

The transformer architectures (Vaswani et al., 2017) is popularly used in natural language processing (Devlin et al., 2018; Al-Rfou et al., 2019; Dai et al., 2019; Child et al., 2019; Raffel et al., 2020; Baevski & Auli, 2019; Brown et al., 2020; Dehghani et al., 2018 ), computer vision (Dosovitskiy et al., 2021; Liu et al., 2021; Touvron et al., 2020; Ramesh et al., 2021; Radford et al., 2021; Arnab et al., 2021; Liu et al., 2022; Zhao et al., 2021; Guo et al., 2021; Chen et al., 2022) , speech processing (Gulati et al., 2020; Dong et al., 2018; Zhang et al., 2020; Wang et al., 2020b) , and other relevant applications (Rives et al., 2021; Jumper et al., 2021; Chen et al., 2021; Zhang et al., 2019; Wang & Sun, 2022) . Transformers achieve state-of-the-art performance in many of these practical tasks, and the results get better with larger model size and increasingly long sequences. For example, the text generating model in (Liu et al., 2018a) processes input sequences of up to 11,000 tokens of text. Applications involving other data modalities, such as music (Huang et al., 2018) and images (Parmar et al., 2018) , can require even longer sequences. Lying at the heart of transformers is the self-attention mechanism, an inductive bias that connects each token in the input through a relevance weighted basis of every other tokens to capture the contextual representation of the input sequence (Cho et al., 2014; Parikh et al., 2016; Lin et al., 2017; Bahdanau et al., 2014; Vaswani et al., 2017; Kim et al., 2017) . The capability of self-attention to attain diverse syntactic and semantic representations from long input sequences accounts for the success of transformers in practice (Tenney et al., 2019; Vig & Belinkov, 2019; Clark et al., 2019; Voita et al., 2019a; Hewitt & Liang, 2019) . The multi-head attention (MHA) extends the self-attention by concatenating multiple attention heads to compute the final output as explained in Section 2.1 below. In spite of the success of the MHA, it has been shown that attention heads in MHA are redundant and tend to learn similar attention patterns, thus limiting the representation capacity of the model (Michel et al., 2019; Voita et al., 2019b; Bhojanapalli et al., 2021) . Furthermore, additional heads increase the computational and memory costs, which becomes a bottleneck in scaling up transformers for very long sequences in large-scale practical tasks. These high computational and memory costs and head redundancy issues of the MHA motivates the need for a new efficient attention mechanism.

1.1. CONTRIBUTION

Levaraging the idea of the multiresolution approximation (MRA) (Mallat, 1999; 1989; Crowley, 1981) , we propose a class of efficient and flexible transformers, namely the Transformer with Multiresolutionhead Attention (MrsFormer). At the core of MrsFormer is to use the novel Multiresolution-head Attention (MrsHA) that computes the approximation of the outputs H h , h = 1, . . . , H, of attention heads in MHA at different scales for saving computation and reducing the memory cost of the 1

