TOEPLITZ NEURAL NETWORK FOR SEQUENCE MODELING 2

Abstract

Sequence modeling has important applications in natural language processing and computer vision. Recently, the transformer-based models have shown strong performance on various sequence modeling tasks, which rely on attention to capture pairwise token relations, and position embedding to inject positional information. While showing good performance, the transformer models are inefficient to scale to long input sequences, mainly due to the quadratic space-time complexity of attention. To overcome this inefficiency, we propose to model sequences with a relative position encoded Toeplitz matrix and use a Toeplitz matrix-vector production trick to reduce the space-time complexity of the sequence modeling to log linear. A lightweight sub-network called relative position encoder is proposed to generate relative position coefficients with a fixed budget of parameters, enabling the proposed Toeplitz neural network to deal with varying sequence lengths. In addition, despite being trained on 512-token sequences, our model can extrapolate input sequence length up to 14K tokens in inference with consistent performance. Extensive experiments on autoregressive and bidirectional language modeling, image modeling, and the challenging Long-Range Arena benchmark show that our method achieves better performance than its competitors in most downstream tasks while being significantly faster. The code is available at https://github.com/OpenNLPLab/Tnn.

1. INTRODUCTION

Sequence modeling is a fundamental problem in natural language processing, speech processing, and computer vision. Various sequence modeling methods have been proposed in the literature, including recurrent (Hochreiter & Schmidhuber, 1997 ), convolutional architectures (LeCun et al., 1989 ), and transformers (Vaswani et al., 2017) . These models utilize various properties of sequential data for their modeling. For example, recurrent models (Hochreiter & Schmidhuber, 1997) mimic the sequential property by sequentially processing the input while maintaining hidden states through steps. Convolutional models (LeCun et al., 1989) enforce the locality bias sequentially and only interact elements within local patches. Transformers use attention matrices to model pairwise relations regardless of the distance between them. Recently, Transformers (Vaswani et al., 2017; Dosovitskiy et al., 2021) show strong performance on a wide range of applications across domains and become arguably one of the most successful architectures for sequence modeling in general. There are two main components in transformers: the attention mechanism that learns pairwise correlations of tokens from data, and the position embedding to introduce positional inductive biases. The vanilla attention mechanism requires quadratic space-time complexity, which precludes Transformers from handling long sequences. Numerous attention variants have been proposed recently to reduce the complexity, including linear transformers (Katharopoulos et al., 2020), and Performer (Choromanski et al., 2021) . Although the types of attention vary, the position embedding remains in every method, which indicates the importance of position information in sequence modeling. This motivates us to ask the following question: since position information is important, can we design a model that relies entirely on the position information of its elements regardless of their content, thus alleviating the quadratic computation cost of the vanilla attention mechanism? In this paper, we give an affirmative answer to this question by introducing Toeplitz neural network, a new efficient architecture that solely exploits relative position relations for sequence modeling. In specific, instead of attention matrices, the Toeplitz neural network uses Toeplitz matrices to capture relations between each token pair. There are two motivations for selecting the Toeplitz matrix. One is that it compactly represents relative positional relations between tokens with much fewer parameters, i.e., 2n -1 parameters for an n × n Toeplitz matrix. The other is that the Toeplitz matrix-vector production can be efficiently processed in O(n log n) complexity, which is exactly what we used in our token mixing operation. In this way, we avoid computing content similarities between tokens and effectively reduce the quadratic computation complexity of transformers to log linear, rendering a more efficient sequence modeling architecture. We further propose relative position encoder, a lightweight module that generates relative position parameters to assemble the Toeplitz matrices, so that the number of the TNN's parameters will no longer depend on the sequence length. Moreover, it allows TNN to deal with varying sequence lengths without retraining. In addition, the input sequence length extrapolation becomes an important ability in sequence modeling as training on longer sequences can be prohibitively expensive (Press et al., 2022) . We propose an exponential decay bias that directly applies to the Toeplitz matrix. Our model achieves a consistent performance to a sequence length of 14K tokens in inference when training on sequences of 512 tokens. We also show analytically that the Toeplitz neural network represents a general form of sequence modeling methods, and derives transformers, CNNs, and the recently proposed State-space-based methods (Gu et al., 2022) as its special forms. We validate our model on a wide range of sequence modeling tasks and benchmarks. These include auto-regressive language modeling, text classification, image classification, and the Long-Range Arena benchmark. As illustrated in Fig. 1 , our model achieves state-of-the-art performance on most tasks at a favorable log linear space-time complexity. It also demonstrates superior extrapolation capabilities when training on shorter sequences and evaluating on longer ones off-the-shelf.

2. PRELIMINARY

In this section, we introduce concepts used throughout the paper, including positional embedding, token and channel mixing, and the Toeplitz matrix. Notations used can be found in Appendix A. Positional embedding is introduced in transformers (Vaswani et al., 2017) to inject positional inductive bias. It often uses fixed or learned parameters to encode position-specific information, thus making the model position-aware. There are mainly two types of positional embeddings: the absolute positional embedding (Vaswani et al., 2017) and the relative position embedding (Shaw et al., 2018) . In this work, we focus on the relative position embedding to emphasize pair-wise token



Figure 1: The left figure shows the training speed (x-axis), performances (y-axis), and GPU memory footprints (circle sizes) of the TNN and competing methods on Long-Range Arena benchmark. The TNN beats the competitors with a clear margin. The right figure plots the extrapolation results with different sequence lengths, where the x-axis denotes sequence lengths, and the y-axis denotes log PPL. It demonstrates that regardless of the sequence length, the PPL of the TNN remains constant.

