SKTFORMER: A SKELETON TRANSFORMER FOR LONG SEQUENCE DATA

Abstract

Transformers have become a preferred tool for modeling sequential data. Many studies of using Transformers for long sequence modeling focus on reducing computational complexity. They usually exploit the low-rank structure of data and approximate a long sequence by a sub-sequence. One challenge with such approaches is how to make an appropriate balance between information preservation and noise reduction: the longer the sub-sequence used to approximate the long sequence, the better the information is preserved but at a price of introducing more noise into the model and of course more computational costs. We propose skeleton transformer, SKTformer for short, an efficient transformer architecture that improves upon the previous attempts to negotiate this tradeoff. It introduces two mechanisms to effectively reduce the impact of noise while still keeping the computation linear to the sequence length: a smoothing block to mix information over long sequences and a matrix sketch method that simultaneously selects columns and rows from the input matrix. We verify the effectiveness of SKTformer both theoretically and empirically. Extensive studies over both Long Range Arena (LRA) datasets, and six time-series forecasting show that SKTformer significantly outperforms both vanilla Transformer and other state-of-the-art variants of Transformer.

1. INTRODUCTION

Transformer type models (Vaswani et al., 2017) have achieved many breakthroughs in various artificial intelligence areas, such as natural language processing (NLP) (Brown et al., 2020; Clark et al., 2020; Devlin et al., 2018; Liu et al., 2019) , computer vision (CV) (Dosovitskiy et al., 2020; Liu et al., 2021; Touvron et al., 2021; Yuan et al., 2021; Zhou et al., 2021b) , and time series forecasting (Xu et al., 2021; Zhou et al., 2022) . The self-attention scheme plays a key role in those transformer-based models, which efficiently capture long-term global and short-term local correlations when the length of the token sequence is relatively small. Due to the quadratic complexity of standard self-attention, many approaches have been developed to reduce the computational complexity of Transformer for long sequences (e.g., (Zhu et al., 2021) ). Most of them try to exploit the special patterns of attention matrix, such as low-rankness, locality, sparsity, or graph structures. One group of approaches is to build a linear approximation for the softmax operator (e.g., (Chen et al., 2021; Choromanski et al., 2020; Chowdhury et al., 2021; Qin et al., 2021) ). Despite the efficiency of the linear approximation, these approximation methods often perform worse than the original softmax based attention. More discussion of efficient transformer for long sequence can be found in the section of related work. In this work, we will focus on approaches that assume a low-rank structure of input matrix. They approximate the global information in a long sequence by a sub-sequence (i.e., short sequence) of landmarks, and only compute attention between queries and selected landmarks (e.g., (Ma et al., 2021; Nguyen et al., 2021; Zhu et al., 2021; Zhu & Soricut, 2021) ). Although those models enjoy linear computational cost and often better performance than vanilla Transformer, they face one major challenge, i.e., how to balance between information preserving and noise reduction. By choosing a larger number of landmarks, we are able to preserve more global information but at the price of introducing more noise into the sequential model and more computational cost. In this work, we propose an efficient Transformer architecture, termed Skeleton Transformer, or SKTformer for short, that introduces two mechanisms to explicitly address the balance. First, we introduce a smoothing block into the Transformer architecture. It effectively mixes global information over the long sequence by the Fourier analysis and local information over the sequence by a convolution kernel. Through the information mixture, we are able to reduce the noise for individual tokens over the sequence, and at the same time, improve their representativeness for the entire sequence. Second, we introduce a matrix sketch technique to approximate the input matrix by a smaller number of rows and columns. A standard self-attention can be seen as reweighing the columns of the value matrix. Important columns are assigned high attention weights and remain in the output matrix, while small attention weights eliminate insignificant columns. The selfattention mechanism is equivalent to column selection if we replace the softmax operator with the corresponding argmax operator. However, sampling only columns may not generate a good summary of the matrix, and could be subjected to noises to individual columns. We address this problem by exploiting CUR (Drineas et al., 2008) or Skeleton approximation technique (Chiu & Demanet, 2013) in the matrix approximation community. Theoretically, for a rank-r matrix X ∈ R n×d , we can take O(r log d) column samples and O(r log n) row samples to construct a so-called Skeleton approximation X ≈ CU R, where C and R are matrices consisting of the columns and rows of X, respectively, and U is the pseudo-inverse of their intersection. By combing these mechanism, we found, both theoretically and empirically, that SKTformer is able to preserve global information over long sequence and reduce the impact of noise simultaneously, thus leading to better performance than state-of-the-art variants of Transformer for long sequences, without having to sacrifice the linear complexity w.r.t. sequence length. In short, we summarize our main contributions as follows: 1. We propose a Skeleton Transformer (SKTformer), an efficient model that integrates a smoother, column attention and row attention components to unfold a randomized linear matrix sketch algorithm. 2. By randomly selecting a fixed number of rows and columns, the proposed model achieves near-linear computational complexity and memory cost. The effectiveness of this selection method is verified both theoretically and empirically. 3. We conduct extensive experiments over Long-term sequence, long-term time series forecasting and GLUE tasks. In particular, the Long Range Arena benchmark (Tay et al., 2021) , achieves an average accuracy of 64% and 66% with fixed parameters (suggested setting in Mathieu et al. ( 2014); Tay et al. ( 2021)) and fine-tuned parameters respectively. It improves from 62% of the best transformer-type model. Moreover, it also has a comparable performance with the recent state-of-art long-term time series forecasting models for long-term time series forecasting and GLUE tasks Organization. We structure the rest of this paper as follows: In Section 2, we briefly review the relevant literature on efficient transformers and Skeleton approximations. Section 3 introduces the model structure and performs a theoretical analysis to justify the proposed model. We empirically verify the efficiency and accuracy of SKTformer in Section 4. we discuss limitations and future directions in Section 5. Technical proofs and experimental details are provided in the appendix.

2. RELATED WORK

This section provides an overview of the literature focusing on efficient Transformer models. The techniques include sparse or local attention, low-rankness, and kernel approximation. We refer the reader interested in their details to the survey (Tay et al., 2020c) . Sparse Attention. The general idea of these methods is restricting the query token to perform attention only within a specific small region, such as its local region or some global tokens. In this setting, the attention matrix becomes sparse compared to the original one. 



(Qiu et al., 2019)   proposes BlockBert, which introduces sparse block structures into the attention matrix by multiplying a masking matrix.(Parmar et al., 2018)  applies self-attention within blocks for the image generation task.(Liu et al., 2018)  divides a sequence into blocks and uses a stride convolution to reduce the model complexity. However, these block-type Transformers ignore the connections among blocks. To address this issue, Transformer-XL(Dai et al., 2019) and Compressive Transformer (Rae et al.,  2019)  propose a recurrence mechanism to connect multiple blocks. Transformer-LS(Zhu et al., 2021)   combines local attention with a dynamic projection to capture long-term dependence.(Tay et al.,

availability

//anonymous.4open.science/r/SKTFormer-B33B/

