SKTFORMER: A SKELETON TRANSFORMER FOR LONG SEQUENCE DATA

Abstract

Transformers have become a preferred tool for modeling sequential data. Many studies of using Transformers for long sequence modeling focus on reducing computational complexity. They usually exploit the low-rank structure of data and approximate a long sequence by a sub-sequence. One challenge with such approaches is how to make an appropriate balance between information preservation and noise reduction: the longer the sub-sequence used to approximate the long sequence, the better the information is preserved but at a price of introducing more noise into the model and of course more computational costs. We propose skeleton transformer, SKTformer for short, an efficient transformer architecture that improves upon the previous attempts to negotiate this tradeoff. It introduces two mechanisms to effectively reduce the impact of noise while still keeping the computation linear to the sequence length: a smoothing block to mix information over long sequences and a matrix sketch method that simultaneously selects columns and rows from the input matrix. We verify the effectiveness of SKTformer both theoretically and empirically. Extensive studies over both Long Range Arena (LRA) datasets, and six time-series forecasting show that SKTformer significantly outperforms both vanilla Transformer and other state-of-the-art variants of Transformer.

1. INTRODUCTION

Transformer type models (Vaswani et al., 2017) have achieved many breakthroughs in various artificial intelligence areas, such as natural language processing (NLP) (Brown et al., 2020; Clark et al., 2020; Devlin et al., 2018; Liu et al., 2019) , computer vision (CV) (Dosovitskiy et al., 2020; Liu et al., 2021; Touvron et al., 2021; Yuan et al., 2021; Zhou et al., 2021b) , and time series forecasting (Xu et al., 2021; Zhou et al., 2022) . The self-attention scheme plays a key role in those transformer-based models, which efficiently capture long-term global and short-term local correlations when the length of the token sequence is relatively small. Due to the quadratic complexity of standard self-attention, many approaches have been developed to reduce the computational complexity of Transformer for long sequences (e.g., (Zhu et al., 2021) ). Most of them try to exploit the special patterns of attention matrix, such as low-rankness, locality, sparsity, or graph structures. One group of approaches is to build a linear approximation for the softmax operator (e.g., (Chen et al., 2021; Choromanski et al., 2020; Chowdhury et al., 2021; Qin et al., 2021) ). Despite the efficiency of the linear approximation, these approximation methods often perform worse than the original softmax based attention. More discussion of efficient transformer for long sequence can be found in the section of related work. In this work, we will focus on approaches that assume a low-rank structure of input matrix. They approximate the global information in a long sequence by a sub-sequence (i.e., short sequence) of landmarks, and only compute attention between queries and selected landmarks (e.g., (Ma et al., 2021; Nguyen et al., 2021; Zhu et al., 2021; Zhu & Soricut, 2021) ). Although those models enjoy linear computational cost and often better performance than vanilla Transformer, they face one major challenge, i.e., how to balance between information preserving and noise reduction. By choosing a larger number of landmarks, we are able to preserve more global information but at the price of introducing more noise into the sequential model and more computational cost. In this work, we propose an efficient Transformer architecture, termed Skeleton Transformer, or SKTformer for short, that introduces two mechanisms to explicitly address the balance. First, we introduce a smoothing block into the Transformer architecture. It effectively mixes global information over the long sequence by the Fourier analysis and local information over the sequence by a convolution kernel. Through the information mixture, we are able to reduce the noise for individual tokens over the sequence, and at the same time, improve their representativeness for the entire sequence. Second, we introduce a matrix sketch technique to approximate the input matrix by a smaller number of rows and columns. A standard self-attention can be seen as reweighing the columns of the value matrix. Important columns are assigned high attention weights and remain in the output matrix, while small attention weights eliminate insignificant columns. The selfattention mechanism is equivalent to column selection if we replace the softmax operator with the corresponding argmax operator. However, sampling only columns may not generate a good summary of the matrix, and could be subjected to noises to individual columns. We address this problem by exploiting CUR (Drineas et al., 2008) or Skeleton approximation technique (Chiu & Demanet, 2013) in the matrix approximation community. Theoretically, for a rank-r matrix X ∈ R n×d , we can take O(r log d) column samples and O(r log n) row samples to construct a so-called Skeleton approximation X ≈ CU R, where C and R are matrices consisting of the columns and rows of X, respectively, and U is the pseudo-inverse of their intersection. By combing these mechanism, we found, both theoretically and empirically, that SKTformer is able to preserve global information over long sequence and reduce the impact of noise simultaneously, thus leading to better performance than state-of-the-art variants of Transformer for long sequences, without having to sacrifice the linear complexity w.r.t. sequence length. In short, we summarize our main contributions as follows: 1. We propose a Skeleton Transformer (SKTformer), an efficient model that integrates a smoother, column attention and row attention components to unfold a randomized linear matrix sketch algorithm. 2. By randomly selecting a fixed number of rows and columns, the proposed model achieves near-linear computational complexity and memory cost. The effectiveness of this selection method is verified both theoretically and empirically. 3. We conduct extensive experiments over Long-term sequence, long-term time series forecasting and GLUE tasks. In particular, the Long Range Arena benchmark (Tay et al., 2021) , achieves an average accuracy of 64% and 66% with fixed parameters (suggested setting in Mathieu et al. (2014) ; Tay et al. (2021) ) and fine-tuned parameters respectively. It improves from 62% of the best transformer-type model. Moreover, it also has a comparable performance with the recent state-of-art long-term time series forecasting models for long-term time series forecasting and GLUE tasks Organization. We structure the rest of this paper as follows: In Section 2, we briefly review the relevant literature on efficient transformers and Skeleton approximations. Section 3 introduces the model structure and performs a theoretical analysis to justify the proposed model. We empirically verify the efficiency and accuracy of SKTformer in Section 4. we discuss limitations and future directions in Section 5. Technical proofs and experimental details are provided in the appendix.

2. RELATED WORK

This section provides an overview of the literature focusing on efficient Transformer models. The techniques include sparse or local attention, low-rankness, and kernel approximation. We refer the reader interested in their details to the survey (Tay et al., 2020c) . Sparse Attention. The general idea of these methods is restricting the query token to perform attention only within a specific small region, such as its local region or some global tokens. In this setting, the attention matrix becomes sparse compared to the original one. (Qiu et al., 2019) proposes BlockBert, which introduces sparse block structures into the attention matrix by multiplying a masking matrix. (Parmar et al., 2018) applies self-attention within blocks for the image generation task. (Liu et al., 2018) divides a sequence into blocks and uses a stride convolution to reduce the model complexity. However, these block-type Transformers ignore the connections among blocks. To address this issue, Transformer-XL (Dai et al., 2019) and Compressive Transformer (Rae et al., 2019) propose a recurrence mechanism to connect multiple blocks. Transformer-LS (Zhu et al., 2021) combines local attention with a dynamic projection to capture long-term dependence. (Tay et al., 2020b ) uses a meta-sorting network to permute over sequences and quasi-global attention with local windows to improve memory efficiency. Another approach in this category is based on stride attention. Longformer (Beltagy et al., 2020) uses dilated sliding windows to obtain a sparse attention matrix. Sparse Transformers (Child et al., 2019) consider approximating a dense attention matrix by several sparse factorization methods. In addition, some methods reduce the complexity by clustering tokens. For example, Reformer (Kitaev et al., 2020b) uses a hash similarity measure to cluster tokens, and Routing Transformer (Roy et al., 2021) uses k-means to cluster tokens. BigBird (Zaheer et al., 2020) proposes a generalized attention mechanism described by a directed graph to reduce attention complexity. (Lee-Thorp et al., 2021) considers using 2D Fourier Transformation to mix the token matrix directly. (Tan et al., 2021) uses max pooling scheme to reduce the computation costs. Low-rank and Kernel Methods. Inducing low-rankness into the attention matrix can quickly reduce the complexity and the kernel approximation is widely applied in efficient low-rank approximation. Linformer (Wang et al., 2020) and Luna (Ma et al., 2021) approximate softmax with linear functions, which yield a linear time and space complexity. (Choromanski et al., 2020) and (Peng et al., 2021) use random features tricks and reach promising numerical performance. (Winata et al., 2020) proposes Low-Rank Transformer based on matrix factorization. FMMformer (Nguyen et al., 2021) combines the fast multipole method with the kernel method. Synthesizer (Tay et al., 2020a ) uses a random low-rank matrix to replace the attention matrix. Nyströmformer (Xiong et al., 2021) adopts the Nyström method to approximate standard self-attention. Linear Transformer (Katharopoulos et al., 2020) expresses self-attention as a linear dot-product of kernel feature maps. (Zhu & Soricut, 2021) applies the Multigrid method to efficiently compute the attention matrix recursively. Cosformer (Qin et al., 2021 ) develops a cosine-based re-weighting mechanism to linearize the softmax function. (Chen et al., 2021) proposes the Scatterbrain, which unifies locality-sensitive hashing and the kernel method into attention for accurate and efficient approximation.

3. SKTFORMER

We start by going over the vanilla attention. For a sequence of length n, the vanilla self-attention in the transformer is dot-product type (Vaswani et al., 2017) . Following standard notation, the attention matrix A ∈ R n×n is defined as: A = softmax 1 √ d QK ⊤ , where Q ∈ R n×d denotes the queries while K ∈ R n×d denotes the keys, and d represents the hidden dimension. By multiplying the attention weights A with the values V ∈ R n×d , we can calculate the new values as V = AV . Intuitively, the attention is the weighted average over the old ones, where the weights are defined by the attention matrix A. In this paper, we consider generating Q, K and V via the linear projection of the input token matrix X: Q = XW Q , K = XW K , V = XW V , where X ∈ R n×d and W Q , W K , W V ∈ R d×d . The vanilla procedure has two drawbacks in concentrating the information from V . First, when computing the QK ⊤ part, full dense matrix multiplication is involved at a cost of O(n 2 ) vector multiplications. It can be prohibitive for long sequence problems. On the other hand, if we view the softmax operator as an approximation of the argmax counterpart, V becomes a row selection from V . This column-wise information concentration is ignored.

3.1. SKELETON ATTENTION

We propose a Skeleton self-attention structure motivated by the Skeleton approximation to address those issues. First, we modify the original self-attention to build the column self-attention as follows: V1 = softmax 1 √ d QK ⊤ P ⊤ 1 P 1 V , where P 1 ∈ R s1×n denotes the sampling matrix and s 1 is the number of columns sampled. Let i 1 < i 2 < ... < i s1 be the indices of the randomly sampled columns. Let P 1,ab denote the element located at the a-th column and b-th row and we have P 1,ab = 1 if i a = b and 0 otherwise. By these constructions, we can reduce the computational cost to O(ns 1 d 2 + ns 2 1 d). Similarly, we build the row sampling matrix P 2 ∈ R d×s2 indicating the locations of the s 2 sample rows. Compute the row self-attention as: V2 = V P 2 softmax 1 √ n P ⊤ 2 K ⊤ Q . Finally, we apply the layer-norm on V1 and V2 and then them together to generate the final output: V = layernorm 1 ( V1 ) + layernorm 2 ( V2 ). The usage of layer norm is to balance the output scales of column and row self-attentions. A similar trick has been used in (Zhu et al., 2021) , where the layer norm is applied to resolve scale mismatches between the different attention mechanisms. Before going to the detailed analysis, we first introduce incoherence parameter of a matrix, which is commonly used in many low-rank matrix applications. Definition 1 (µ-incoherence). Given a rank-r matrix X ∈ R n×d . Let X = W ΣV ⊤ be its compact singular value decomposition. X is µ-incoherent if there exists a constant µ such that max i ∥e ⊤ i W ∥ ≤ µr n and max i ∥e ⊤ i V ∥ ≤ µr d , where e i denotes the i-th canonical basis vector. The µ-incoherence describes the correlation between the column/row spaces and the canonical basis vectors. The larger µ value implies a higher overlapping, which leads to a better chance of successful reconstruction from sparse row/column samples. We next use the following proposition to characterize the efficiency of sampling in both columns and rows. Proposition 1. Let X ∈ R n×d be a rank-r matrix with µ-incoherence. Without loss of generality, we assume n ≥ d. Let E ∈ R n×d be a noise matrix. By uniformly sampling O(µr log n) columns and rows from the noisy X + E, Skeleton approximation can construct a matrix X such that, with probability at least 1 -O(n -2 ), ∥X -X∥ ≤ O ∥E∥ √ nd µr log n . Several works (e.g., (Chiu & Demanet, 2013; Drineas et al., 2008) ) have proposed explicit methods to construct X. Those methods require computing the pseudo-inverse, generally inefficient in deep learning settings. (Xiong et al., 2021) uses an approximation of the pseudo-inverse in the symmetric matrix setting. It is still an open question whether the approximated pseudo-inverse also works for the general matrix in deep learning settings. On the other hand, in the transformer model, a good matrix approximation is not our primary goal, and we thus pursue a different way that only maintains sufficient information to pass through the network via (2).

3.2. SMOOTHER COMPONENT

Based on the analysis of Skeleton approximation, the matrix incoherence parameter µ plays a crucial role in determining the number of rows and columns to sample. Decreasing in µ leads to a smaller sampling size. Furthermore, the µ-incoherence condition implies that the "energy" of the matrix is evenly distributed over its entries, i.e., the matrix is "smooth" (Candès & Recht, 2009) . In this subsection, we propose a novel smoother component to reduce the incoherence parameter without introducing excessive information loss. The incoherence parameter can be viewed as a measure of the smoothness of a matrix. A "smoother" matrix tends to have a smaller incoherence parameter. Intuitively, the adjacent columns or rows have similar values for a smooth matrix. Thus a few landmark columns or rows can represent the matrix with little error. On the other hand, if the matrix is harsh (e.g., containing spiky columns or rows), more landmarks are required. A common way to smooth a matrix is to convolute it with a smoothing kernel, such as a Gaussian kernel. However, directly using a fixed smoothing kernel can potentially remove too much details and harm the final performance. In the recent literature (e.g., Guo et al. 2022) , large convolution kernel-based attentions show a supreme performance in vision Transformers. In this paper, we propose to use a data-driven convolution layer along the sequence dimension with a kernel size equal to the sequence length. In this setting, the information of a given row could be decentralized among the rows. As the input token matrix is computed through a FeedForward layer, the information among different rows is already adaptive allocated. Hence, we do not perform the convolution along the hidden dimension. We use the Fast Fourier Transformation (FFT) to implement the convolution. Let L 0 ∈ R n×d be the convolution kernel matrix. Via the convolution theorem, the circular convolutions in the spatial domain are equivalent to pointwise products in the Fourier domain, and we then have: X smooth = X * L 0 = F -1 [F(X) • F(L 0 )] , where F, * , and • denote FFT operator, convolution operator, and point-wise product, respectively. Equation (4) requires 3d times faster Fourier operations which could be prohibited when facing large d. In order to save the computational cost, we use the learnable matrix L ∈ C n×d1 in the frequency domain instead and apply segment-average (averaging segments of hidden dimension) to X. To simplify the notation, we assume there are integers s and r with d = sr. Instead of using (4), we apply the following (5) to smooth the token matrix. X smooth = F -1 [F(XS) • L] , where S =      1 s 1 0 ... 0 0 1 s 1 ... 0 . . . . . . . . . . . . 0 0 • • • 1 s 1      ∈ R d×d (6) and 1 denotes the s × s matrix with all elements equal 1. As XS contains repeated rows, in (5), we can reduce the usage of faster Fourier operations to r + d times. In the following proposition, we show the smooth ability of the Fourier convolution.  |f (t) -f (t -1)| ≤ b max σ 1 2n log 2n δ + a max σ 1 2n 2 log 2 δ . ( ) The Proposition 2 can be used to describe the Fourier convolution layer's behavior in the early training stage. Via some standard initialization methods (e.g., Kaiming initialization or Xavier initialization), the variance of elements in learnable matrix L is O(n -1 ) and the scale of elements is O(n -1/foot_1 ). 2 To simplify our discussion, let us assume we use Kaiming normal initialization and L becomes a random complex Gaussian matrix with zero mean and variance n -1 σ 2 . Using the fact that the FFT of a Gaussian sequence remains Gaussian with 2n times larger variance, the n -1 σ 2 variance Gaussian sequence through inverse FFT (IFFT) would result in a Gaussian sequence with 1 2n 2 σ 2 variance. By Proposition 2, the maximum difference between adjacent elements after the convolution is scaled on b max σn -1/2 + a max σn -1 ≈ b max σn -1/2 when sequence length n is large enough. Thus as long as σ < O( √ n), the sequence is smoothed by the Fourier convolution. During the training process, the elements in learnable matrix L go away from the independent random variables and help generate a better representation of segment-averaged token matrix XS. We use the following Proposition 3 to describe the potential representation ability of the proposed Fourier convolution component. Proposition 3. Let X ∈ R n×d be a bounded matrix and S ∈ R d×d constructed by (6). There exist matrices G, L ∈ R n×d such that (XS) 1:t -X smooth t G 1:t ≤ O r 3/2 t log(n)d -1/2 , ( ) where (•) 1:t is the submatrix of the first t rows of a given matrix, X smooth t is the t-th row of X smooth = F -1 [F(XS) • L], and G satisfies G i,j = G i+s,j = .... = G i+r(s-1),j = g i (j). Here {g 1 (•), ..., g s (•)} is an orthogonal polynomial basis. The Proposition 3 states that if we properly train the matrix L, the information in XS up to row t can be compressed into t-th row of X smooth with a moderate tolerance. Therefore, when we sample in rows X smooth , they will contain more information than the same number of rows in the original XS. Similar results are also discussed in FNet (Lee-Thorp et al., 2021) and several RRN literature, such as (Gu et al., 2020) and (Voelker et al., 2019) . In (Gu et al., 2020) , several specific types of polynomials (e.g., Legendre or Chebyshev) are explored, and the corresponding matrix L is predefined instead of data-driven. Recently, (Gu et al., 2021b) propose a sophisticated method that can be used to compute X smooth . We leave it for future work.

3.2.2. CONVOLUTION STEM

The X smooth may encounter an over-smoothing situation that local details can be wiped out. We use a convolution stem (CNNs + BN + ReLU) to tackle this problem. We first concatenate X smooth with the original token matrix X into a n × 2d matrix and then apply a 1D convolution with kernel size 3 to transform it back to n × d dimensions. At last, the output is normalized with the Batchnorm layer and truncated by the ReLU activation function to stabilize the training procedure. (Wang et al., 2021) report the ReLU activation coupled with the normalization layer plays an important role in various vision transformers and analyzes this phenomenon theoretically.

4. EXPERIMENTS

In this section, we test our SKTformer on Long Range Arena (LRA) datasets (Tay et al., 2021) and six real-world time series benchmark dataset for long term forecasting. We also evaluate the transfer learning ability of SKTformer on GLUE tasks. We implement the SKTformer based on the official codes of (Zhu et al., 2021) and (Zhou et al., 2022) for LRA and time-series forecasting tasks respectively. The implementation detail (source code) for SKTformer is provided in Appendix A.

4.1. LONG-RANGE ARENA

The open-source Long-Range Arena (LRA) benchmark (Tay et al., 2021) is proposed as a standard way to test the capabilities of transformer variants architectures on long sequence tasks. 1 . In particular, SKTformer significantly outperforms the benchmarks on Image tasks by relatively large margins (12.6% and 20.6%, respectively), which support SKTformer's smoothness effect on the low-level features and will benefit the high-level image classification tasks. Moreover, we want to highlight the sampling efficiency of SKTformer. The sequence length of LRA tasks is over one thousand. The efficient Transformers in literature usually can not project the token matrix to a very small size while maintaining comparable numerical performance, by only sampling 8 rows and columns from the token matrix, SKTformer has already obtained 64.11% average score improving the previous best 62.03% score of Transformer-LS.

4.2. LONG-TERM FORECASTING TASKS FOR TIME SERIES

To further evaluate the proposed SKTformer, we also conduct extensive experiments on six popular real-world benchmark datasets for long-term time series forecasting, including traffic, energy, economics, weather, and disease as shown in table 2 To highlight the relevant comparison, we mainly include five state-of-the-art (SOTA) Transformerbased models, i.e., FEDformer (Zhou et al., 2022) , Autoformer (Wu et al., 2021) , Informer (Zhou et al., 2021a) , LogTrans (Li et al., 2019) , Reformer (Kitaev et al., 2020a) , and one recent state-space model with recursive memory S4 (Gu et al., 2021a) , for comparison. FEDformer is selected as the main baseline as it achieves SOTA results in most settings. More details about baseline models, datasets, and implementations are described in Appendix. Compared with SOTA work (FEDformer), our proposed SKTformer yields a comparable performance in those tasks, with 4/6 datasets having relative MSE reductions. It is worth noting that the improvement is even more significant on certain datasets, e.g., Exchange (> 30%). Although Exchange does not exhibit an apparent periodicity pattern, SKTformer still achieves superior performance. 

4.3. TRANSFER LEARNING IN GLUE TASKS

We evaluate the transfer learning ability of the proposed model in the pretraining-finetuning paradigm in NLP tasks. We pretrain vanilla BERT (Devlin et al., 2018) 

4.4. TRAINING SPEED AND PEAK MEMORY USAGE

We compared the training speed (in terms of steps per second) and peak memory usage with several baseline models. SKTformer achieves a 4x time speed advantage and 87% memory reduction compared to vanilla transformer models with 3k input setting and has a neck-to-neck performance compared to the most efficient baseline models as shown in Table 4 . 

4.5. ROBUSTNESS ANALYSIS

We conduct a noise-resistant experiment for SKTformer and Xformers as shown in Table 5 . We use the Image experiment setting in LRA datasets. During generating a sample sequence, we randomly add noise with uniform distribution U(-a, a) to each position in the sequence. We consider a ∈ [0, 2, 4, 8] and train every model with 5k steps and 5 replicates. Our model's performance remains robust with a high level of noise injection. This supports our theoretical robustness analysis and shows SKTformer indeed makes an appropriate tradeoff between information preservation and noise reduction.

4.6. ABLATION STUDY

This subsection provides an ablation test on four components: Fourier Convolution, Convolution Stem, Column Attention, and Row Attention. We use SKTformer with (r, s 1 , s 2 = 8) as the baseline, and the detailed settings are in Table 11 in Appendix F. In Table 6 we present the accuracy changes when removing each component. The performance-decreasing results in Table 6 indicate the four components used in SKTformer are necessary to reach promising results. The most significant component is Column Attention which leads 8.28 average accuracy difference. It reflects that a good summary of the whole sequence is important. Similar observations are also reported in Transformer-LS (Zhu et al., 2021) and XCiT (Ali et al., 2021) , where the spirit of attention over columns is used in the dynamic project and Cross-Covariance Attention, respectively. The second most effective part is Fourier Convolution. It reaches a 13.89% accuracy difference in the Retrieval task involving two 4k sequences. Fourier Convolution also works well on shorter sequence tasks (e.g., Image and Pathfinder) and brings a 6.12% accuracy difference.

5. CONCLUDING REMARKS

We propose SKTformer, a robust and efficient transformer architecture for modeling long sequences with a good balance between feature preserving and noise resistance. It aggregates a Fourier convolutional stem smoothing information among tokens and a Skeleton-decomposition-inspired efficient self-attention. In particular, our proposed Skeleton Attention directly samples the columns and rows of the token matrix. Such a design increases the model's robustness and gives us a positive near-linear complexity side effect. We conduct a thorough theoretical and experimental analysis of the proposed model and show its effectiveness. Lastly, extensive experiments show that the proposed model achieves the best performance on Long Range Arena compared to all transformer-based baselines and a state-of-art performance in long-term time series forecasting tasks. One limitation of the current SKTformer is that we need to use both FFT and IFFT in a sequential manner, which is potentially slower than the existing Fourier-based Transformers (e.g., (Lee-Thorp et al., 2021)) that only involve the FFT. As our primary goal using Fourier convolution is to smooth the token matrix and reduce the incoherent parameter, we can use the Random Fourier Transformation (Ailon & Chazelle, 2006) to modify SKTformer with only FFT. Another limitation is that the size of L matrix in the Fourier Convolution part is the same as the input sequence. On a longer sequence, L will contain more learnable parameters that make the model easier to overfit. We may introduce low-rankness or use a more sophisticated design, such as (Gu et al., 2021b) , to tackle this issue in the future. and f (t) -f (t -1) = t-1 i=1 (l i+1 -l i )x i :=(at) +l 1 x t . By Hoffelding inequality, term (a) satisfies the following inequality with ε > 0. P(|(a t )| ≥ ε) = P t-1 i=1 (l i+1 -l i )x i ≥ ε ≤ exp - 2ε 2 (t -1)b 2 max • 1 n 2 σ 2 Combine ( 10) with the union bound over t = 1, 2, ..., n and the following (10) holds with probability at least 1 -δ/2: max t |(a t )| ≤ b max σ 1 2n log 2n δ Similarly, with probability 1 -δ/2, we have max t |l 1 x t | ≤ a max σ 1 2n 2 log 2 δ . Therefore, via ( 11) and ( 12), with probability at least 1 -δ, we have max t |f (t) -f (t -1)| ≤ b max σ 1 2n log 2n δ + a max σ 1 2n 2 log 2 δ D PROOF OF PROPOSITION 3 The proof contains two parts. In the first part, we view the data sequence as a function of index t and construct the coefficients and orthogonal polynomials for function approximation. In the second part, we show such coefficients can be computed with Fourier convolution i.e. ( 5)). Function Approximation. We reformulate the matrix XS as follow: XS = [ x1 e x2 e • • • xr e] , where e ∈ R 1×s is the one vector and xi ∈ R n×1 is the average from (s(i -1) + 1)-th column to (si)-th column of X. Next, we focus on vector xj and view its t-th element as the output of a function h j (t) = xjt . Via analysis in (Gu et al., 2020, Appendixes C and D) , we can form an approximation on h j (t) as follow: h j [x≤t] (x) ≈ i=1 c j i (t)g i (x), where {g i } is a sequence of orthogonal polynomial and [c j 1 (t), c j 2 (t), ..., c j s (t)] := c j t ∈ R 1×s satisfy d dt c(t) j = 1 t c(t) j A 0 + 1 ts log n h(t)b 0 where A 0 ∈ R s×s and b 0 ∈ R 1×s are predefined matrix and vector respectively. Equation ( 15) is corresponding to the case with λ n = s log n in (Gu et al., 2020) . We then use Forward Euler approach to discretize it: ĉ(t) j = ĉ(t -1) j ( 1 t I + 1 t A 0 ) + 1 ts log n h(t)b 0 , . Via standard error analysis of Forward Euler approach, we have c(t + 1) j = c(t) j + 1 t c(t) j A 0 + 1 ts log n h(t)b 0 + d 2 dt 2 c(t) j | t=ξ = c(t) j + 1 t c(t) j A 0 + 1 ts log n h(t)b 0 + 1 ξs log n h(ξ) ′ b 0 = c(t) j + 1 t c(t) j A 0 + 1 ts log n h(t)b 0 + O 1 ts log n , where ξ ∈ [t, t + 1]. It implies that for t = 1, 2, ..., n, ∥ĉ(t) j -c(t) j ∥ ≤ O log t s log n . Combine ( 17) with the similar proof procedure in (Gu et al., 2020, Proposition 6) , if h j (x) is quadratic spline interpolation on { xjt }, we obtain ∥ xjt - s i=1 ĉi (t)g i (x)∥ ≤ O t log n/ √ s = O t log n r d . The desirable result in Proposition 3 is obtained by repeatedly using (18) with j = 1, 2, ..., r. Coefficients via Fourier Convolution. The remaining task is to show that {ĉ(t) j } can be generated via Fourier convolution. To simplify the notation, we denote A = 1 t I + 1 t A 0 and b = 1 t log n b 0 and (16) becomes ĉ(t) j = ĉ(t -1) j A + h(t)b. We then repeatedly use ( 19) from t = 1, 2, ... and one may verify ĉj t = t-1 i=1 bA t-i h(i) = t-1 i=1 bA t-i xji ⇒ C j = Āj * ( xj e), where C j =      ĉj 1 ĉj 2 . . . ĉj n      ∈ R n×s , and Āj =     b bA . . . bA n-1     ∈ R n×s . Next we repeatedly use (20) from j = 1, 2, .., r, and one has C 1 C 2 • • • C r :=X smooth = Ā1 Ā2 • • • Ār :=L0 * [ x1 e x2 e • • • xr e] =XS ⇒ X smooth = L 0 * XS ⇒ X smooth = F -1 (F(L 0 ) • F(XS)) ⇒ X smooth = F -1 (L • F(XS)) , where we use the fact that L is constructed in frequency domain in Fourier convolution in Eq. ( 5).

E MODEL PARAMETERS IMPACT

SKTformer introduces three extra hyperparameters, r, s 1 and s 2 . We test the influence when varying them and report results in Table 7 . We use SKTformer (r, s 1 , s 2 = 8) as the baseline model and other parameters are reported in Table 10 in Appendix F. Influence of r in Fourier Convolution. The r parameter is used to determine the number of segment-averages to compute in (5). The smaller r leads the matrix with more duplicate columns, and more details information is lost. On the other hand, according to Proposition 3, the larger r would potently decrease the memorization ability and yield a high approximation error. In Table 7a , the best performance is observed when r = 8 or r = 16. For the case with r = 1, the token matrix is smoothed to rank one matrix, and the average accuracy drops 3.55 from the best setting. When the r value goes larger than 16, the accuracy in all experiments slightly decreases. We believe it is due to the over-fitting since the smoothed token matrix contains more flexibility and more irrelevant information training dataset is learned. Influence of sample number s 1 in Row Attention. In Row Attention part, we randomly sample s 1 from key and value tokens. Table 7b reports that the optimal sampling amounts are different among tasks. In Pathfinder task, the optimal result is associated with s 1 = 256, while the best performance of other tasks the reached with s 1 = 32. Pathfinder task requires learning extreme long-range dependence (the connectivity between two circles far away from each other). The lack of enough tokens leads to inaccurate long-range dependence estimation and damages the final results. For the tasks like Image or Retrieval, the modest range dependence may already be enough to get promising performance, and we thus could use fewer token samples. Influence of sample number s 2 in Column Attention. In Column Attention, s 2 columns are selected. The experiment results are shown in Table 7c . When setting s 2 = 1, average performance decreases by 13.24%. Similar behavior is also observed in the first row of Table 7a with r = 1. The information loss due to lack of rankness limits the final performance. In an average sense, s 2 = 16 gives the best result, and further increasing in s 2 slightly harms the accuracy in all tasks except Pathfinder.

F EXPERIMENT CONFIGURATIONS

In this section, we report the configurations for the experiments in Sections 4.1, 4.2, and 4.3. 

H DATASET AND IMPLEMENTATION DETAILS

In this subsection, we summarize the details of the datasets used in this paper as follows: The LRA has several desirable advantages that made us focus on it as the evaluation benchmark: generality (only requires the encoder part); simplicity (data augmentation and pretraining are out of scope); challenging long inputs (difficulty enough and room to improve); diversity aspects (tasks covering math, language, image, and spatial modeling); and lightweight (run with low resource requirement). Time series datasets:1) ETT (Zhou et al., 2021a) dataset contains two sub-dataset: ETT1 and ETT2, collected from two separated counties. Each of them has two versions of sampling resolutions (15min & 1h). ETT dataset contains multiple time series of electrical loads and one time sequence of oil temperature. 2) Electricityfoot_2 dataset contains the electricity consumption for more than three hundred clients with each column corresponding to one client. 3) Exchange (Lai et al., 2018) dataset contains the current exchange of eight countries. 4) Trafficfoot_3 dataset contains the occupation rate of freeway systems in California, USA. 5) Weatherfoot_4 dataset contains 21 meteorological indicators for a range of one year in Germany. 6) Illnessfoot_5 dataset contains the influenza-like illness patients in the United States. Table 13 summarizes all the features for the six benchmark datasets. They are all split into the training set, validation set and test set by the ratio of 7:1:2 during modeling. GLUE datasets: The GLUE benchmark covers various natural language understanding tasks and is widely used in evaluating transfering ability. The tasks can be devided in to two types, single-sentence tasks (SST-2 and CoLA), and sentence-pair tasks (MNLI, QQP,QNLI,STS-B,MRPC,RTE). Following the same settings in (Devlin et al., 2018) , we exclude WNLI task.

I EXPERIMENTS ON THE SMOOTHNESS EFFECT OF FOURIER CONVOLUTION

In this section, we verify Fourier convolution component in the Smoother block can reduce the incoherence value in the early training stage. We use SKTformer with (r, s 1 , s 2 = 8) as the test model and test on an NLP dataset: Text, and a vision dataset: Pathfinder. We compute the µincoherence value 7 of the token matrix before and after the Fourier convolution (denoted as µ X and µ X smooth , respectively) for each samples in the validation dataset. Since we do not explicitly force the token matrix to be low-rank required by Definition 1, we report the incoherence value for different rankness settings (rank = 16 and rank = 32) approximately, and the mean and standard deviation of incoherence value can be found in Table 14 . The average value reduced 30% after the Fourier convolution in both datasets. Moreover, We observe that the standard deviation significantly decreases, which suggests the Fourier convolution may also potentially stabilize the training procedure. In this section, an illustration of the Smoother and Skeleton Attention part is shown in Figure 2 . We smooth the input token matrix to ensure the sampling in rows and columns containing more local and/or global information. Thus, sampling several rows and columns from the smoothed token matrix can be more effective than the samples from the original token matrix. 



In practice, we use the rFFT/irFFT, the fast (inverse) Fourier Transformation of real input instead of the general FFT/IFFT, and the size of the matrix L is reduced to L ∈ C ⌊n/2⌋+1)×d . Here we omit the dependence in d for brevity. https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams 20112014 http://pems.dot.ca.gov https://www.bgc-jena.mpg.de/wetter https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html Incoherence is defined by Definition 1 in Appendix B.



Figure 1: Illustration of the architecture of Vanilla Transformer versus SKTformer

Figure 2: Illustration on effect of the Smoother and Skeleton Attention on Token Matrix.

Proposition 2. Let {x 1 , ...., x n } be a sequence with max t |x t | ≤ a max and max t |x t -x t-1 | ≤ b max . Let {l 1 , ..., l n } be a sequence of i.i.d. 1 n 2 σ 2 -subgaussian variables. Let f (t) be the convolution of {x t } and {l t }, i.e., f (t) = t i=1 l t+1-i x i . With probability at least 1 -δ, we have:

Experimental results on Long-Range Arena benchmark. Best model is in boldface and second best is underlined. The standard deviation of the SKTformer are reported in parenthesis.



We report the best GLUE results for each model from multiple hyper-parameters configurations in Table3, and the detailed training configurations in Table15in Appendix K. Our SKTformer reaches 77.01 average scores (96.0% of the accuracy of vanilla BERT), which also outperform FNet by 4.6% and PoNet by 0.3% relatively. GLUE validation results. We report the mean of accuracy and F1 for QQP and MRPC, matthew correlations for CoLA, spearman correlations for STS-B, and accuracy for other tasks. For MNLI task, we consider the matched test set.

Benchmark results of all Xformer models with a consistent batch size of 32 across all models with various of input lengths

Average Accuracy on Image task (CIFAR-10 dataset) in Long Range Arena with noise injections. The relative performance changes are reported in parentheses.

Ablation experiments. The SKT (r, s 1 , s 2 = 8) is used as baseline. The differences by removing each component from the baseline model are reported.

Experimental results on varying r, s 1 and s 2 . Best result is in boldface and second best is underlined. And Ablation experiments for each components (a) Experimental results on varying r parameter in smoothing component.

Experiment Configuration of SKTformer (r, s 1 , s 2 = 8).

Experiment Configuration of SKTformer (best).

Experiment Configuration for Model Parameters Impact.

Experiment Configuration for Ablation. have already provided the average of 5 runs with different random seeds in Table1. Here we also provide the standard deviations for these experiments in Table12.

Accuracy on Long Range Arena (LRA) with standard errors shown in parenthesis. All results are averages of 5 runs with different random seeds. ListOps(2K length mathematical expression task which investigates the parsing ability); Text (up to 4K byte/character-level document classification task that tests capacity in character compositionality); Retrieval (byte/character-level document matching task, which exams the information compression ability with two 4K length sequence); Image (pixel-wise sequence image classification based on the CIFAR-10 dataset); Pathfinder (long-range spatial dependency identification task. The input images contain two small points/circles and dash-line paths. The model needs to identify whether two points/circles are connected);

The average incoherence parameters after 100 training steps with standard errors shown in the parenthesis.Datasetµ X (rank = 32) µ X smooth (rank = 32) µ X (rank = 16) µ X smooth (rank = 16)

The training configurations for Pretraining and GLUE tasks

availability

//anonymous.4open.

B PROOF OF PROPOSITION 1

A similar result, under a slightly different setting, can be found in (Cai et al., 2021) . For the completeness of the paper, we provide a proof here. We resolve the sampling strategy. We consider a clear rank-r matrix X ∈ R n×d , i.e., no additive noise and the rank is exact. Without loss of generality, we assume n ≥ d. Provided X is µ-incoherent, by (Chiu & Demanet, 2013 Thirdly, we resolve the error bound estimation. For the noisy matrix X + E, we directly apply (Hamm & Huang, 2021, Corollary 4.3) . Thus, we havewhere Ĉ and R are sampled from the noisy matrix, Û is the pseudo-inverse of their intersection, and l C (resp. l R ) is the number of columns (resp. rows) being sampled in Ĉ (resp. R).Note that this error bound assumes good column and row sampling, i.e., the clear submatrices corresponding to Ĉ and R can recover X exactly. Therefore, by combining the above two results, we show the claim in Proposition 1.

C PROOF OF PROPOSITION 2

As f (t) is the convolution function of {x t } and {l t }, from the definition of convolution for t = 1, 2, ..., n we have

