TOEPLITZ NEURAL NETWORK FOR SEQUENCE MODELING 2

Abstract

Sequence modeling has important applications in natural language processing and computer vision. Recently, the transformer-based models have shown strong performance on various sequence modeling tasks, which rely on attention to capture pairwise token relations, and position embedding to inject positional information. While showing good performance, the transformer models are inefficient to scale to long input sequences, mainly due to the quadratic space-time complexity of attention. To overcome this inefficiency, we propose to model sequences with a relative position encoded Toeplitz matrix and use a Toeplitz matrix-vector production trick to reduce the space-time complexity of the sequence modeling to log linear. A lightweight sub-network called relative position encoder is proposed to generate relative position coefficients with a fixed budget of parameters, enabling the proposed Toeplitz neural network to deal with varying sequence lengths. In addition, despite being trained on 512-token sequences, our model can extrapolate input sequence length up to 14K tokens in inference with consistent performance. Extensive experiments on autoregressive and bidirectional language modeling, image modeling, and the challenging Long-Range Arena benchmark show that our method achieves better performance than its competitors in most downstream tasks while being significantly faster. The code is available at https://github.com/OpenNLPLab/Tnn.

1. INTRODUCTION

Sequence modeling is a fundamental problem in natural language processing, speech processing, and computer vision. Various sequence modeling methods have been proposed in the literature, including recurrent (Hochreiter & Schmidhuber, 1997) , convolutional architectures (LeCun et al., 1989) , and transformers (Vaswani et al., 2017) . These models utilize various properties of sequential data for their modeling. For example, recurrent models (Hochreiter & Schmidhuber, 1997 ) mimic the sequential property by sequentially processing the input while maintaining hidden states through steps. Convolutional models (LeCun et al., 1989) enforce the locality bias sequentially and only interact elements within local patches. Transformers use attention matrices to model pairwise relations regardless of the distance between them. Recently, Transformers (Vaswani et al., 2017; Dosovitskiy et al., 2021) show strong performance on a wide range of applications across domains and become arguably one of the most successful architectures for sequence modeling in general. There are two main components in transformers: the attention mechanism that learns pairwise correlations of tokens from data, and the position embedding to introduce positional inductive biases. The vanilla attention mechanism requires quadratic space-time complexity, which precludes Transformers from handling long sequences. Numerous attention variants have been proposed recently to reduce the complexity, including linear transformers (Katharopoulos et al., 2020) , and Performer (Choromanski et al., 2021) . Although the types of attention vary, the position embedding remains in every method, which indicates the importance of position information in sequence modeling. This motivates us to ask the following question: since position information is important, can we design a model that relies entirely on the position information of its elements regardless of their content, thus alleviating the quadratic computation cost of the vanilla attention mechanism? In this paper, we give an affirmative answer to this question by introducing Toeplitz neural network, a new efficient architecture that solely exploits relative position relations for sequence modeling. In specific, instead of attention matrices, the Toeplitz neural network uses Toeplitz matrices to capture relations between each token pair. There are two motivations for selecting the Toeplitz matrix. One is that it compactly represents relative positional relations between tokens with much fewer parameters, i.e., 2n -1 parameters for an n × n Toeplitz matrix. The other is that the Toeplitz matrix-vector production can be efficiently processed in O(n log n) complexity, which is exactly what we used in our token mixing operation. In this way, we avoid computing content similarities between tokens and effectively reduce the quadratic computation complexity of transformers to log linear, rendering a more efficient sequence modeling architecture. We further propose relative position encoder, a lightweight module that generates relative position parameters to assemble the Toeplitz matrices, so that the number of the TNN's parameters will no longer depend on the sequence length. Moreover, it allows TNN to deal with varying sequence lengths without retraining. In addition, the input sequence length extrapolation becomes an important ability in sequence modeling as training on longer sequences can be prohibitively expensive (Press et al., 2022) . We propose an exponential decay bias that directly applies to the Toeplitz matrix. Our model achieves a consistent performance to a sequence length of 14K tokens in inference when training on sequences of 512 tokens. We also show analytically that the Toeplitz neural network represents a general form of sequence modeling methods, and derives transformers, CNNs, and the recently proposed State-space-based methods (Gu et al., 2022) as its special forms. We validate our model on a wide range of sequence modeling tasks and benchmarks. These include auto-regressive language modeling, text classification, image classification, and the Long-Range Arena benchmark. As illustrated in Fig. 1 , our model achieves state-of-the-art performance on most tasks at a favorable log linear space-time complexity. It also demonstrates superior extrapolation capabilities when training on shorter sequences and evaluating on longer ones off-the-shelf.

2. PRELIMINARY

In this section, we introduce concepts used throughout the paper, including positional embedding, token and channel mixing, and the Toeplitz matrix. Notations used can be found in Appendix A. relations. A typical relative positional embedding (Raffel et al., 2020) is formulated as: e ij = q ⊤ i k j / √ d + w i-j , where j, i are two positional indices, e ij denotes the attention score before softmax. The q i , k j represents the queries and keys in the attention. The w i-j is a positional coefficient. In this case, the relative position information is added to the attention as a bias. Token and channel mixing are used by (Yu et al., 2022) to refer to the two main procedures in sequence modeling. The token mixing refers to the process of mixing information between token pairs and the channel mixing for those between feature channels. In the Transformers, given the attention matrix A ∈ R n×n and token matrix X ∈ R n×d , the attention operation AX can be regarded as a token mixing process and the FFN module is used for channel mixing. Researchers often classify various sequence modeling techniques based on the token mixing techniques used. MLP-based methods (Liu et al., 2021; Tolstikhin et al., 2021) use matrix multiplication on the sequence dimension for token mixing. FFT-based methods (Lee-Thorp et al., 2022) utilize the FFT on the sequence dimension to mix token-wise information. The State-space-based methods (Gu et al., 2022) leverage the state equations and hidden states to model sequences, as well as perform interactions between tokens. Toeplitz matrix is a special form of a matrix that has constant values along each diagonal running from left to right, i.e., T ij = T i+1,j+1 = t i-j , T ∈ R n×n . There are two nice properties of a Toeplitz matrix: 1). For an n×n Toeplitz matrix, we can efficiently describe it with 2n -1 parameters. 2). The Toeplitz matrix-vector production is faster than standard matrix-vector production. In particular, we have: Theorem 2.1. For a Toeplitz matrix T ∈ R n×n and any vector x ∈ R n , the time complexity of Tx is O(n log n). We provide detailed proof in Appendix B. This property enables us to use the Toeplitz matrices to perform efficient token mixing.

3. TOEPLITZ NEURAL NETWORK

In this section, we provide a detailed design and analysis of our proposed Toeplitz Neural Network (TNN) by giving a glance at the overall structure of our model first and then describing each of its components. We also discuss the connection between the TNN and other sequence modeling methods at the end of this section.

3.1. THE OVERALL ARCHITECTURE

Our model consists of a stack of Gated Toeplitz Units (GTU) and GLU (Shazeer, 2020) . GTU is a modified GLU layer injected with the proposed Toeplitz Neural Operator (TNO), as illustrated in Fig. 2 . A TNO is used to perform token mixing with a Toeplitz matrix. To generate relative position coefficients for the Toeplitz matrix, we propose a Relative Position Encoder (RPE), a lightweight fully-connected sub-network to encode the relative position information. An exponential decay bias is also added to the Toeplitz matrix to enable extrapolation on longer inputs.

3.2. TOEPLITZ NEURAL OPERATOR

Here, we will show how to use a Toeplitz matrix to represent relative positional information. Let us consider i, j to be two positions in a 1D sequence, by using the relative position embedding in Eq. 1, we can define a Toeplitz matrix T ∈ R n×n , where T ij = t i-j . Specifically, given a sequence x of n tokens, x = [x 0 , x 1 , . . . , x n-1 ] ⊤ ∈ R n , we use a scalar t i-j to represent the relative position coefficients between x i and x j . Then a Toeplitz matrix T ∈ R n×n can be formed by gathering t i-j 2020) and. We propose the TNO to perform token mixing with only relative position information. We use a small fully-connected network named RPE to encode relative position information. for every token pair: T =      t 0 t -1 • • • t -n+1 t 1 t 0 . . . . . . t 0 t -1 t n-1 . . . t 1 t 0      ∈ R n×n . Let us define a token mixing operation as: y = Tx ∈ R n , ( ) where y is the token mixing result. For any d-dimensional sequences, the token mixing is performed on each dimension individually. As aforementioned in Theorem 2.1, the computation complexity of Eq. 4 is O(n log n). As we need to perform token mixing on d dimensions, our TNO has a computation complexity of O(nd log n). One following question is how to calculate the relative position coefficients in T. A naive solution is to make the coefficients learnable parameters, such that the model can directly learn them from training data. However, this solution has some drawbacks: 1). Parameter explosion. For a ddimensional sequence of n tokens, there are a total of (2n -1)d learnable parameters, which can be prohibitively large as n increases. It also shows an unsatisfactory performance in our ablation studies in Sec. 4.3. 2). Fixed input sequence length. Since the sequence length n is fixed in training, we are unable to adjust the sequence length during inference, i.e., it will cause a crucial performance drop when the sequence length changes. To address these drawbacks, we propose a relative position encoder to generate the relative position coefficients.

3.3. RELATIVE POSITION ENCODER

We illustrate the network structure of our RPE in Fig. 2 , which is a fully connected network with K layers. The input of the network is a 1-dimensional scalar, i.e., the value of -(n -1), . . . , (n -1), ∀n ∈ N + , and output a d dimension vector, which is used to assemble the Toeplitz matrix. In this case, the number of the TNN's parameters will no longer depend on the input sequence length and the TNN will have the flexibility to deal with various sequence lengths in the inference stage. Note that recent literature (Mildenhall et al., 2021) claims that projecting the scalar input to a higher dimensional space with high frequency functions, i.e., sin and cos functions, before passing a network can lead to better performance. However, in our ablations, we find that using the original integer achieves better performance. Exponential decay bias Previous models (Vaswani et al., 2017; Qin et al., 2022) often use a fixed sequence length in both training and inference. If we need to infer a longer sequence, the model needs to be retrained on the longer sequence length to maintain the performance, which can be prohibitively expensive in the application. ALiBi (Press et al., 2022) shows that by applying a simple penalty to the query-key attention scores, the Transformer can handle longer sequence length in inference without compromising the performance. The penalty is a linear bias that is proportional to the distance between tokens. Inspired by this technique, we propose an exponential decay bias that directly applies to the Toeplitz matrix to achieve the same goal. In specific, let us define a decay rate of λ ∈ [0, 1], and the new relative position coefficients ti-j in T can be expressed as: ti-j = λ |i-j| t i-j . ALiBi can be seen as a special case of our method. Given the equation of ALiBi: sij = q ⊤ i k j / √ d + m|i -j|, exp(s ij ) = exp(q ⊤ i k j / √ d) exp(m|i -j|), and s ij = q ⊤ i k j / √ d, λ ≜ exp(m), (7) we have: exp(s ij ) = exp(s ij )λ |i-j| . (8) It means the ALiBi applies an exponential decay on the softmax attention matrices whereas ours applies it on the Toeplitz matrices.

3.4. RELATION TO OTHER SEQUENCE MODELING MODELS

In this section, we will show the relationship between our model and other sequence modeling models such as the Transformers (Vaswani et al., 2017) , CNNs (LeCun et al., 1989) , and the State space (Gu et al., 2022) . We also compare the theoretical space-time complexity of our model with previous sequence modeling models in Table . 1. Transformers A Transformer with relative position embedding can be expressed as: O = Softmax(QK ⊤ / √ d + T)V. Comparing it with Eq. 4, the TNN can be regarded as an attention-free transformer, i.e., removing the Q, K, and the Softmax, while only keeping the relative position matrices T. CNNs A convolutional layer can be viewed as a Toeplitz matrix of a special structure. Considering a 1D convolution: y = h * x, y i = i j=0 h i-j x j , h ∈ R m , x ∈ R n , y ∈ R n+m-1 . Let's define a Toeplitz matrix T ∈ R (n+m-1)×(n+m-1) : T st = h t-s 0 ≤ t -s ≤ m -1, 0 ≤ t ≤ n -1 0 others, , z = x 0 m-1 ∈ R n+m-1 . Then: y = Tz ∈ R n+m-1 . (12) Therefore, a 1D CNN can be viewed as a special case of the TNN with a zero-padded input. For better illustration, we provide a matrix form of CNN operation in Appendix C.1.

State space

The equation of the State space can be expressed as: u i = Au i-1 + Bx i , y i = Cu i , A ∈ R h×h , B ∈ R h×1 , C ∈ R 1×h , i = 1, . . . , n where x i is the input, y i is the output,u i is the intermediate state. According to (Gu et al., 2022) , the output of the State space is: y i = i j=0 k i-j x j , k = CB, CAB, . . . , CA n-1 B ⊤ ∈ R n . ( ) Let's define the Toeplitz matrix T ∈ R n×n : T i-j = k i-j , i ≥ j 0, i < j . Then: y = Tx, x ∈ R n , y ∈ R n . (16) In this case, the State space can be regarded as a special form of TNN with the coefficients that are calculated by the State space. We also provide the matrix form in Appendix C.2 for better illustration. 

4. EXPERIMENT

We compare our method to four kinds of sequential modeling methods including attention-based methods, MLP-based methods, FFT-based methods, and State-space-based methods. In particular, we select the following methods: • Attention-based: Vanilla transformer (Vaswani et al., 2017) , Transformer-LS (Zhu et al., 2021) , FLASH, (Hua et al., 2022) , 1+elu (Katharopoulos et al., 2020) , Performer (Choromanski et al., 2020), cosFormer (Qin et al., 2022 ). • MLP-based: gMLP (Liu et al., 2021) , Synthesizer (Random), Synthesizer (Dense) (Tay et al., 2021 ). • FFT-based: FNet(Lee-Thorp et al., 2022), GFNet (Rao et al., 2021) , AFNO (Guibas et al., 2021) . • State-space-based: S4 (Gu et al., 2022) , DSS (Gupta et al., 2022) , GSS (Mehta et al., 2022) . We evaluate our methods on the WikiText-103 (Merity et al., 2017) for autoregressive language modeling and the input length extrapolation ability, and the GLUE benchmark (Wang et al., 2018) for bidirectional language modeling. We also validate the accuracy and efficiency of our methods in handling long-range dependencies on the Long-Range Arena benchmark (Tay et al., 2020) . To demonstrate the robustness of our model, we implement our model in DeiT (Touvron et al., 2021) structure and compare its performance with the vanilla DeiT (Touvron et al., 2021) on the ImageNet-1K (Deng et al., 2009) for image classification.

4.1. SETTING

We implement our models in Pytorch (Paszke et al., 2019) and train them on 8 V100 GPUs. We adopt the same training configuration for all competitors, including batch size, learning rate, training epochs/updates, etc. More detailed hyper-parameters are listed in Appendix D. For the autoregressive language modeling, all models are trained on the WikiText-103 dataset (Merity et al., 2017) for 50K steps with a learning rate of 0.005. We use perplexity (PPL) as the evaluation metric. For the bidirectional language modeling, we choose the Roberta (Liu et al., 2019) model as the base model structure for all methods. All models are pre-trained on the WikiText-103 (Merity et al., 2017) for 50K steps with lr=0.005 and fine-tuned on the GLUE dataset (Wang et al., 2018) . We use different learning rates among 1e-5, 3e-5, 6e-5, 1e-4 and choose the best result after fine-tuning for 3 epochs. For the Long-Range Arena benchmark, we adopt the same experimental configurations from the Skyformer Chen et al. (2021) . We ensure that performances and efficiencies of all methods are obtained with a similar parameter size and the same training hyperparameters. For the image classification on the ImageNet-1k dataset, we adopt the Deit (Touvron et al., 2021) network structure and replace the transformer layers with our model.

4.2. RESULTS

Autoregressive language modeling Autoregressive language modeling is a crucial task that requires the models to estimate causal probability distribution given the previously seen tokens. In Table 2 , we compare the proposed TNN with competing sequence modeling models. First, compared to existing Mlp-based methods, TNN shows better performances with a clear margin on both val set and test set. Transformer-based methods are currently dominant sequence modeling methods. As a strong baseline, Transformer adopts a standard self-attention module with quadratic complexity, TNN still outperforms it on both val and test sets. in addition, TNN achieves better results than most efficient transformers including FLASH, 1+elu, Performer, and cosFormer. Finally, compared with recent emerging State-space-based sequence modeling methods, TNN achieves superior performance to all competing methods. it proves the effectiveness of our method in causal models. Further, we also compared the extrapolation capabilities of each method. In Figure 1 , we show that our method outperforms all other methods and is comparable to ALiBi (Press et al., 2022) . Complete results can be found in Appendix 15. Bidirectional language modeling We benchmark bidirectional modeling methods on the GLUE datasets in Table . 3. TNN achieves competitive results across all tasks. Further, it is worth noting that TNN boosts the results of CoLA by a significant margin, showing the ability of reasoning logistic information from sequences. It demonstrates the effectiveness of TNN in bidirectional language modeling. Long-Range Arena benchmark As shown in Table 4 , we compare TNN with competing methods across five tasks of the LRA benchmark. The results before the Transformer-LS are taken from Skyformer (Chen et al., 2021) . As demonstrated, TNN achieves the best scores on three tasks and the second places on the left two tasks. In terms of overall results, TNN outperforms all other competing methods including S4 (Gu et al., 2022) foot_0  For speed comparison, we compare the training speed of the TNN with other methods in Table 5 . For a fair and comprehensive comparison, we follow exactly the same configurations of the Skyformer Chen et al. ( 2021) and report step per second under different sequence lengths. Timing is conducted on an Nvidia A6000 GPU with 48G GPU memory.

Image modeling

We report classification results on the ImageNet-1k dataset in Table 6 . As shown, under similar parameter sizes, TNN achieves better results than Deit-Tiny and comparable results with Deit-Small. It demonstrates the capability of our method in encoding visual signals. 

4.3. ABLATION STUDY

Network structure configuration We ablate different structure configurations on the autoregressive language modeling task in Table 7 . We consider three options of configuration: the GTU+GLU, GTU only, and attention+GLU. We empirically find that the GTU+GLU one achieves better performance than other options and choose it as our structure in TNN.

Input of relative position encoder

In Table 8 , we ablate different RPE inputs on language modeling. (-(n-1),...,(n-1)) denotes that we feed 2n -1 constants into the RPE. (-(n-1),...,(n-1))/n denotes normalized constants. The sin, cos denotes the absolute position embedding method used in (Vaswani et al., 2017) . We empirically find that using the original integers as the input for the RPE leads to better performance. Relative position encoder There are two ways to generate relative position coefficients for the Toeplitz matrix. One is to set these coefficients as learnable parameters and allow TNN to learn them from data. The other is to use our proposed RPE network to generate these coefficients. We compare these two strategies in Table 9 . The TNN with our RPE network achieves an improvement of 2.47 PPL in language modeling. Table 5 : Speed comparison on Long-Range Arena benchmark. We mark it with a dash if a method exhausts GPU memory. The higher the better for all metrics. The 1K,...,5K represent the input sequence length. Exponential decay rate We ablate different exponential decay rates in Table 10 on the language modeling. We train these model variants with a fixed sequence length of 512 and test them on a series of sequence lengths from 512 to 14336 and compute the average PPL. When there is no exponential decay, the model fails to extrapolate to a longer sequence length. We also test our model with a learnable decay rate, but it does not show better performance. We empirically select 0.99 as the exponential decay rate in our method.

5. CONCLUSION

In this paper, we propose Toeplitz neural network, a new efficient architecture that relies entirely on relative positional information for sequence modeling. The proposed model enjoys a favorable log linear space-time complexity. Thanks to the proposed relative position encoder and exponential decay techniques, Toeplitz neural network generalizes to long sequences with a fixed budget of parameters while obtaining consistently superior performance than competing methods across multiple challenging tasks, including language modeling, image modeling, and sequence modeling on long inputs, i.e., the Long-Range Arena benchmark. Toeplitz neural network is also a generic sequence modeling approach, which renders various popular architectures, such as Transformers, CNNs, and State-space-based methods, as its special forms, offering a unified view for sequence modeling. 

B PROOF OF THEOREM

In this section, we will prove Theorem 2.1. Before doing that, let's first introduce the circulant matrix and Toeplitz matrix: Definition B.1. A matrix C ∈ R n×n is a circulant matrix if and only if C ij = c (i-j+n) mod n , i.e., C =             c 0 c n-1 c n-2 • • • • • • c 1 c 1 c 0 c n-1 . . . . . . c 2 c 1 . . . . . . . . . . . . . . . . . . . . . . . . c n-1 c n-2 . . . . . . c 1 c 0 c n-1 c n-1 . . . . . . c 2 c 1 c 0             ∈ R n×n . ( ) Definition B.2. A matrix T ∈ R n×n is a Toeplitz matrix if and only if T ij = t i-j , i.e.,  T =             t 0 t -1 t -2 • • • • • • t -n+1             ∈ R n×n . Based on the definition, we can give a key lemma: Lemma B.3. A circulant matrix C ∈ R n×n is orthogonally equivalent to the diagonal matrix Λ, in particular, the orthogonal matrix F is a n × n DFT matrix: C = F ⊤ ΛF, Λ = diag{F[a 0 , a 1 , . . . , a n-1 ] ⊤ } ∈ R n×n , F st = exp 2πsti n , i 2 = -1. The proof can be found in (Gray et al., 2006) . Based on this, we can prove a key lemma: Lemma B.4. For a vector x ∈ R n and a circulant matrix C ∈ R n×n , matrix multiplication Cx can be done in O(n log n) time. Proof of Lemma B. Because F, F ⊤ is a DFT matrix, so Fx and F ⊤ x can be done O(n log n) time (Bracewell & Bracewell, 1986) . Since Λ is a diagonal matrix, so Λx can be done in O(n) time, note that its diagonal elements F[a 0 , a 1 , . . . , a n-1 ] ⊤ can also be computed in O(n log n) time complexity, therefore, Cx = F ⊤ ΛFx = F ⊤ (Λ(Fx)) , (20) can be done in O(n log n). Based on this, we can prove Theorem 2.1: Proof of Theorem 2.1. We first fill the Toeplitz matrix T ∈ R n× into a circulant matrix C ∈ R 2n×2n : c k =    t k , 0 ≤ k ≤ n -1 t 0 , k = n t k-2n , n + 1 ≤ k ≤ 2n -1 , i.e.,  C =                      t 0 t -                      ∈ R 2n×2n . Using the notation of block matrix, we can define: C = C 1 C 2 C 3 C 4 ∈ R 2n×2n , C s ∈ R n×n , s = 1, 2, 3, 4, C 1 = T. For the vector x ∈ R n , let's define: x 1 = x 0 n ∈ R 2n , so: Cx 1 = C 1 C 2 C 3 C 4 x 0 n = C 1 x C 3 x = Tx C 3 x ∈ R 2n , therefore: [I n 0 n×n ] Cx 1 = [I n 0 n×n ] Tx C 3 x = Tx. Note that: • Computing Cx 1 has a time complexity of O(2n log(2n)) = O(n log n). • [ I n 0 n×n ] Cx 1 is equivalent to selecting the first n rows of Cx 1 , the time complexity is O(n). So the total time complexity is O(n log n).

C MATRIX FORM OF SEQUENTIAL MODELS

In this section, we give the matrix form of some sequence models mentioned in section 3.4.

C.1 CNN

The matrix form of CNN mentioned in Eq. 10 is:       y 0 y 1 y 2 . . . y n+m-1       =                      h 0 0 . . . 0 0 h 1 h 0 . . . . . . . . . h 2 h 1 . . . 0 0 . . . h 2 . . . h 0 0 h m-2 . . . . . . h 1 h 0 h m-1 h m-2 . . . . . . h 1 0 h m-1 . . . h m-3 . . . 0 0 . . . h m-2 h m-3 . . . . . . . . . h m-1 h m-2 0 0 0 • • • h m-1                            x 0 x 1 x 2 . . . x n-1       ∈ R n+m-1 . C.2 STATE SPACE The Toeplitz matrix mentioned in Eq. 15 is:  T =             k 0 0 0 • • • • • • 0 k 1 k 0 0 . . . . . . k s-1 • • • • • • k 2 k 1 k 0             ∈ R n×n . D CONFIGURATIONS 

G VISUALIZATION

In this section, we visualize Tnn, in particular, we choose the Toeplitz matrix used in Roberta for visualization. 



We re-run the S4 experiments with the new configuration to match the number of parameters. For the sake of completeness, we also compare TNN with S4 in the original size of S4 using the suffix "-Large" in Table14, which validates our ability to encode long sequences.



Figure 1: The left figure shows the training speed (x-axis), performances (y-axis), and GPU memory footprints (circle sizes) of the TNN and competing methods on Long-Range Arena benchmark. The TNN beats the competitors with a clear margin. The right figure plots the extrapolation results with different sequence lengths, where the x-axis denotes sequence lengths, and the y-axis denotes log PPL. It demonstrates that regardless of the sequence length, the PPL of the TNN remains constant.

Figure2: Network structure overview of the proposed Toeplitz Neural Network. The proposed sequence modeling block is composed of a Gated Toeplitz Unit and a GLUShazeer (2020) and. We propose the TNO to perform token mixing with only relative position information. We use a small fully-connected network named RPE to encode relative position information.

Figure 3: Visualization of the Toeplitz matrix used by each layer in Roberta, each element of the matrix represents the interaction between tokens. The Toeplitz matrices show similar behaviors to conventional transformer attention matrices where the diagonal concentrates the most attention.(i) Layer 1.(ii) Layer 2.(iii) Layer 3.

Comparison of theoretical space-time complexity of several models. Parallel indicates whether parallel training is possible, n indicates the sequence length, and d indicates the feature dimension, e indicates the CNN kernel size. Here we only list about 1D CNN.

Performances comparison of autoregressive language modeling on the Wikitext-103 dataset. The best result is highlighted in bold and the second in underline. ↓ means lower is better. Attn stands for Attention, Ss stands for State space, Trans stands for Transformer, LS stands for Transformer-LS.

Performances comparison of bidirectional sequence modeling on the GLUE benchmark. MNLI is reported by the match/mismatch splits. MRPC is reported by F1 score. CoLA is reported by Matthews correlation coefficient. All the other tasks are measured by accuracy. The best result is highlighted in bold and the second in underline. The larger the better for all metrics. "-" means unconverted. Attn stands for Attention, Ss stands for State space, Trans stands for Transformer, LS stands for Transformer-LS.

Performances Comparison on the Long Range Arena benchmark. We use bold and underline to highlight the best and the second result of each task respectively. The proposed TNN achieves the best performances and outperforms all competing methods.

Performances comparison of image classification on the ImageNet-1k dataset.

Performances

Results comparison with different RPE inputs.

Performances comparison of TNN with and without RPE. RPE brings an improvement in language modeling.

Ablation of exponential decay rates in input length extrapolation. The model variants are trained on a fixed sequence length of 512 and tested on a series of sequence lengths ranging from 512 to 14336. We compute the average PPL for all sequence lengths.

Mathematical notations used in the paper.

Detailed training configurations used in our experiments. "Total batch size" means batch per gpu × update freq × num gpus. "ALM" stands for Autoregressive Language Model. "BLM" stands for Bidirectional Language Model. "IM" stands for Image Modeling.

Detailed model configurations used in our experiments.

Performances Comparison on the Long Range Arena benchmark. We use bold and underline to highlight the best and the second result of each task respectively. The proposed TNN achieves the best performances and outperforms all competing methods.

The extrapolation performance of each method. The best result is highlighted in bold and the second in underline. ↓ means lower is better.

