WHAT MAKES CONVOLUTIONAL MODELS GREAT ON LONG SEQUENCE MODELING?

Abstract

Convolutional models have been widely used in multiple domains. However, most existing models only use local convolution, making the model unable to handle long-range dependency efficiently. Attention overcomes this problem by aggregating global information based on the pair-wise attention score but also makes the computational complexity quadratic to the sequence length. Recently, Gu et al. (2021a) proposed a model called S4 inspired by the state space model. S4 can be efficiently implemented as a global convolutional model whose kernel size equals the input sequence length. With Fast Fourier Transform, S4 can model much longer sequences than Transformers and achieve significant gains over SoTA on several long-range tasks. Despite its empirical success, S4 is involved. It requires sophisticated parameterization and initialization schemes that combine the wisdom from several prior works. As a result, S4 is less intuitive and hard to use for researchers with limited prior knowledge. Here we aim to demystify S4 and extract basic principles that contribute to the success of S4 as a global convolutional model. We focus on the structure of the convolution kernel and identify two critical but intuitive principles enjoyed by S4 that are sufficient to make up an effective global convolutional model: 1) The parameterization of the convolutional kernel needs to be efficient in the sense that the number of parameters should scale sub-linearly with sequence length. 2) The kernel needs to satisfy a decaying structure that the weights for convolving with closer neighbors are larger than the more distant ones. Based on the two principles, we propose a simple yet effective convolutional model called Structured Global Convolution (SGConv). SGConv exhibits strong empirical performance over several tasks: 1) With faster speed, SGConv surpasses the previous SoTA on Long Range Arena and Speech Command datasets. 2) When plugging SGConv into standard language and vision models, it shows the potential to improve both efficiency and performance.

1. INTRODUCTION

Handling Long-Range Dependency (LRD) is a key challenge in long-sequence modeling tasks such as time-series forecasting, language modeling, and pixel-level image generation. Unfortunately, standard deep learning models fail to solve this problem for different reasons: Recurrent Neural Network (RNN) suffers from vanishing gradient, Transformer has complexity quadratic in the sequence length, and Convolutional Neural Network (CNN) usually only has a local receptive field in each layer. A recently proposed benchmark called Long-Range Arena (LRA) (Tay et al., 2020b) reveals that all existing models perform poorly in modeling LRD. Notably, on one spatial-level sequence modeling task called Pathfinder-X from LRA, all models fail except a new Structured State Space sequence model (S4) (Gu et al., 2021a) . The S4 model is inspired by the state space model widely used in control theory and can be computed efficiently with a special parameterization based on the Cauchy kernel. The exact implementation of the S4 model can be viewed as a (depthwise) global convolutional model with an involved computation global convolution kernel. Thanks to the global receptive field of the convolution kernel, S4 is able to handle tasks that require LRD, such as Pathfinder (Linsley et al., 2018; Tay et al., 2020b) , where classic local CNNs fail (Linsley et al., 2018; Kim et al., 2019) . Also, the use of Fast Fourier Transform (FFT) and techniques from numerical linear algebra make the computational complexity of S4 tractable compared to the quadratic complexity of attention. Together, S4 shows the potential of global convolutional models to model LRD and advances the SoTA on LRA. Despite its accomplishments, the delicate design of S4 makes it unfriendly even to knowledgable researchers. In particular, the empirical success of S4 relies on 1) A Diagonal Plus Low Rank (DLPR) parameterization whose efficient implementation requires several numerical linear algebra tricks, 2) An initialization scheme based on the HiPPO matrix derived in prior work (Gu et al., 2020) . Therefore, aiming to reduce the complications of the model and highlight minimal principles, we raise the following questions: What contributes to the success of the S4 model? Can we establish a simpler model based on minimal principles to handle long-range dependency? To answer these questions, we focus on the design of the global convolution kernel. We extract two simple and intuitive principles that contribute to the success of the S4 kernel. The first principle is that the parameterization of the global convolution kernel should be efficient in terms of the sequence length: the number of parameters should scale slowly with the sequence length. For example, classic CNNs use a fixed kernel size. S4 also uses a fixed number of parameters to compute the convolution kernel while the number is greater than classic CNNs. Both models satisfy the first principle as the number of parameters does not scale with input length. The efficiency of parameterization is also necessary because the naive implementation of a global convolution kernel with the size of sentence length is intractable for inputs with thousands of tokens. Too many parameters will also cause overfitting, thus hurting the performance. The second principle is the decaying structure of the convolution kernel, meaning that the weights for convolving with closer neighbors are larger than the more distant ones. This structure appears ubiquitously in signal processing, with the well-known Gaussian filter as an example. The intuition is clear that closer neighbors provide a more helpful signal. S4 inherently enjoys this decaying property because of the exponential decay of the spectrum of matrix powers (See Figure 2 ), and we find this inductive bias improves the model performance (See Section 4.1.2). 1)). The convolution kernel is composed of multi-scale sub-kernels. Parameterization Efficiency. Every larger sub-kernel doubles the size of the previous sub-kernel while the same number of parameters are used for every scale, ensuring a logarithmic dependency of the number of parameters to the input length. Decaying. We use a weighted combination of sub-kernels where the weights are decaying, and smaller weights are assigned to larger scales. We show that these two principles are sufficient for designing a global convolutional model that captures LRD well. To verify this, we introduce a class of global convolution kernels with a simple multiscale structure, as shown in Figure 1 . Specifically, we compose the convolution kernel by a sequence of sub-kernels of increasing sizes, yet every sub-kernel is upsampled from the same number of parameters. This parameterization ensures that the number of parameters only scales logarithmically to the input length, which satisfies the first principle. In addition, we add a decaying weight to each scale during the combination step and fulfill the second principle. We named our methods as Structural Global Convolution kernels (SGConv). Empirically, SGConv improves S4 by more than 1% and achieves SoTA results on the LRA benchmark. On Speech Command datasets, SGConv achieves comparative results in the tenclass classification task and significantly better results in the 35-class classification task upon previous SoTA. We further show that SGConv is more efficient than S4 and can be used as a general purpose module in different domains. For example, a hybrid model of classic attention and SGConv shows promising performance



Figure 1: Illustration of the parameterization used in SGConv (Eq. (1)). The convolution kernel is composed of multi-scale sub-kernels. Parameterization Efficiency. Every larger sub-kernel doubles the size of the previous sub-kernel while the same number of parameters are used for every scale, ensuring a logarithmic dependency of the number of parameters to the input length. Decaying. We use a weighted combination of sub-kernels where the weights are decaying, and smaller weights are assigned to larger scales.

