WHAT MAKES CONVOLUTIONAL MODELS GREAT ON LONG SEQUENCE MODELING?

Abstract

Convolutional models have been widely used in multiple domains. However, most existing models only use local convolution, making the model unable to handle long-range dependency efficiently. Attention overcomes this problem by aggregating global information based on the pair-wise attention score but also makes the computational complexity quadratic to the sequence length. Recently, Gu et al. (2021a) proposed a model called S4 inspired by the state space model. S4 can be efficiently implemented as a global convolutional model whose kernel size equals the input sequence length. With Fast Fourier Transform, S4 can model much longer sequences than Transformers and achieve significant gains over SoTA on several long-range tasks. Despite its empirical success, S4 is involved. It requires sophisticated parameterization and initialization schemes that combine the wisdom from several prior works. As a result, S4 is less intuitive and hard to use for researchers with limited prior knowledge. Here we aim to demystify S4 and extract basic principles that contribute to the success of S4 as a global convolutional model. We focus on the structure of the convolution kernel and identify two critical but intuitive principles enjoyed by S4 that are sufficient to make up an effective global convolutional model: 1) The parameterization of the convolutional kernel needs to be efficient in the sense that the number of parameters should scale sub-linearly with sequence length. 2) The kernel needs to satisfy a decaying structure that the weights for convolving with closer neighbors are larger than the more distant ones. Based on the two principles, we propose a simple yet effective convolutional model called Structured Global Convolution (SGConv). SGConv exhibits strong empirical performance over several tasks: 1) With faster speed, SGConv surpasses the previous SoTA on Long Range Arena and Speech Command datasets. 2) When plugging SGConv into standard language and vision models, it shows the potential to improve both efficiency and performance.

1. INTRODUCTION

Handling Long-Range Dependency (LRD) is a key challenge in long-sequence modeling tasks such as time-series forecasting, language modeling, and pixel-level image generation. Unfortunately, standard deep learning models fail to solve this problem for different reasons: Recurrent Neural Network (RNN) suffers from vanishing gradient, Transformer has complexity quadratic in the sequence length, and Convolutional Neural Network (CNN) usually only has a local receptive field in each layer. A recently proposed benchmark called Long-Range Arena (LRA) (Tay et al., 2020b) reveals that all existing models perform poorly in modeling LRD. Notably, on one spatial-level sequence modeling task called Pathfinder-X from LRA, all models fail except a new Structured State Space sequence model (S4) (Gu et al., 2021a) . The S4 model is inspired by the state space model widely used in control theory and can be computed efficiently with a special parameterization based on the Cauchy kernel. The exact implementation of the S4 model can be viewed as a (depthwise) global convolutional model with an involved computation global convolution kernel. Thanks to the global receptive field of the convolution kernel, S4 is able to handle tasks that require LRD, such as Pathfinder (Linsley et al., 2018; Tay et al., 2020b) , where classic local CNNs fail (Linsley et al., 2018; Kim et al., 

