SKTFORMER: A SKELETON TRANSFORMER FOR LONG SEQUENCE DATA

Abstract

Transformers have become a preferred tool for modeling sequential data. Many studies of using Transformers for long sequence modeling focus on reducing computational complexity. They usually exploit the low-rank structure of data and approximate a long sequence by a sub-sequence. One challenge with such approaches is how to make an appropriate balance between information preservation and noise reduction: the longer the sub-sequence used to approximate the long sequence, the better the information is preserved but at a price of introducing more noise into the model and of course more computational costs. We propose skeleton transformer, SKTformer for short, an efficient transformer architecture that improves upon the previous attempts to negotiate this tradeoff. It introduces two mechanisms to effectively reduce the impact of noise while still keeping the computation linear to the sequence length: a smoothing block to mix information over long sequences and a matrix sketch method that simultaneously selects columns and rows from the input matrix. We verify the effectiveness of SKTformer both theoretically and empirically. Extensive studies over both Long Range Arena (LRA) datasets, and six time-series forecasting show that SKTformer significantly outperforms both vanilla Transformer and other state-of-the-art variants of Transformer.

1. INTRODUCTION

Transformer type models (Vaswani et al., 2017) have achieved many breakthroughs in various artificial intelligence areas, such as natural language processing (NLP) (Brown et al., 2020; Clark et al., 2020; Devlin et al., 2018; Liu et al., 2019) , computer vision (CV) (Dosovitskiy et al., 2020; Liu et al., 2021; Touvron et al., 2021; Yuan et al., 2021; Zhou et al., 2021b) , and time series forecasting (Xu et al., 2021; Zhou et al., 2022) . The self-attention scheme plays a key role in those transformer-based models, which efficiently capture long-term global and short-term local correlations when the length of the token sequence is relatively small. Due to the quadratic complexity of standard self-attention, many approaches have been developed to reduce the computational complexity of Transformer for long sequences (e.g., (Zhu et al., 2021) ). Most of them try to exploit the special patterns of attention matrix, such as low-rankness, locality, sparsity, or graph structures. One group of approaches is to build a linear approximation for the softmax operator (e.g., (Chen et al., 2021; Choromanski et al., 2020; Chowdhury et al., 2021; Qin et al., 2021) ). Despite the efficiency of the linear approximation, these approximation methods often perform worse than the original softmax based attention. More discussion of efficient transformer for long sequence can be found in the section of related work. In this work, we will focus on approaches that assume a low-rank structure of input matrix. They approximate the global information in a long sequence by a sub-sequence (i.e., short sequence) of landmarks, and only compute attention between queries and selected landmarks (e.g., (Ma et al., 2021; Nguyen et al., 2021; Zhu et al., 2021; Zhu & Soricut, 2021) ). Although those models enjoy linear computational cost and often better performance than vanilla Transformer, they face one major challenge, i.e., how to balance between information preserving and noise reduction. By choosing a larger number of landmarks, we are able to preserve more global information but at the price of introducing more noise into the sequential model and more computational cost. In this work, we propose an efficient Transformer architecture, termed Skeleton Transformer, or SKTformer for short, that introduces two mechanisms to explicitly address the balance. First,

availability

//anonymous.4open.science/r/SKTFormer-B33B/

