WAVEFORMER: LINEAR-TIME ATTENTION WITH FOR-WARD AND BACKWARD WAVELET TRANSFORM

Abstract

We propose Waveformer that learns attention mechanism in the wavelet coefficient space, requires only linear time complexity, and enjoys universal approximating power. Specifically, we first apply forward wavelet transform to project the input sequences to multi-resolution orthogonal wavelet bases, then conduct nonlinear transformations (in this case, a random feature kernel) in the wavelet coefficient space, and finally reconstruct the representation in input space via backward wavelet transform. We note that other non-linear transformations may be used, hence we name the learning paradigm Wavelet transformatIon for Sequence lEarning (WISE). We emphasize the importance of backward reconstruction in the WISE paradigm -without it, one would be mixing information from both the input space and coefficient space through skip-connections, which shall not be considered as mathematically sound. Compared with Fourier transform in recent works, wavelet transform is more efficient in time complexity and better captures local and positional information; we further support this through our ablation studies. Extensive experiments on seven long-range understanding datasets from the Long Range Arena benchmark and code understanding tasks demonstrate that (1) Waveformer achieves competitive and even better accuracy than a number of state-of-the-art Transformer variants and (2) WISE can boost accuracies of various attention approximation methods without increasing the time complexity. These together showcase the superiority of learning attention in a wavelet coefficient space over the input space.

1. INTRODUCTION

Transformer (Vaswani et al., 2017) has become one of the most influential models in natural language processing (Devlin et al., 2018; Brown et al., 2020 ), computer vision (Dosovitskiy et al., 2020 ), speech processing (Baevski et al., 2020 ), code understanding (Chen et al., 2021a) and many other applications. It is composed of the attention layer and the feed-forward layer with layer norms and skip-connections added in between. The original design of the attention layer scales quadratically to the sequence length, becoming a scalability bottleneck of Transformers as texts, images, speech, and codes can be of vast lengths. State-of-the-art attention approximation methods have enabled Transformers to scale sub-quadratic or even linearly to the input sequence length. Typical approaches to computing a cheaper pseudoattention include sparse attention patterns (Parmar et al., 2018; Wang et al., 2019; Beltagy et al., 2020; Zaheer et al., 2020) , low-rank approximation (Wang et al., 2020; Chen et al., 2021b) , and kernel approximation (Katharopoulos et al., 2020; Choromanski et al., 2020; Peng et al., 2020) , where most of these methods are linear in time complexity. For a comprehensive review, please refer to Section 4. Recent works on improving the effectiveness and efficiency of long-range capabilities of Transformers start to explore attention learning in a transformed space. For example, conducting lowcost token-mixing with forward Fourier transform leads to remarkable accuracy improvement with a quasi-linear time complexity (Lee-Thorp et al., 2021) . Token-mixing ideas (You et al., 2020; Lee-Thorp et al., 2021) are simple and effective, however, they lose Transformer's universal approximating power by replacing attention with hard averaging (Yun et al., 2019) . Moreover, without backward transform the model will mix information from both the input and transformed spaces,

