DATA CONTINUITY MATTERS: IMPROVING SEQUENCE MODELING WITH LIPSCHITZ REGULARIZER

Abstract

Sequence modeling is a core problem in machine learning, and various neural networks have been designed to process different types of sequence data. However, few attempts have been made to understand the inherent data property of sequence data, neglecting the critical factor that may significantly affect the performance of sequence modeling. In this paper, we theoretically and empirically analyze a generic property of sequence data, i.e., continuity, and connect this property with the performance of deep models. First, we empirically observe that different kinds of models for sequence modeling prefer data with different continuity. Then, we theoretically analyze the continuity preference of different models in both time and frequency domains. To further utilize continuity to improve sequence modeling, we propose a simple yet effective Lipschitz Regularizer, that can flexibly adjust data continuity according to model preferences, and bring very little extra computational cost. Extensive experiments on various tasks demonstrate that altering data continuity via Lipschitz Regularizer can largely improve the performance of many deep models for sequence modeling.

1. INTRODUCTION

Sequence modeling is a central problem in many machine learning tasks, ranging from natural language processing (Kenton & Toutanova, 2019) to time-series forecasting (Li et al., 2019) . Although simple deep models, like MLPs, can be used for this problem, various model architectures have been specially designed to process different types of real-world sequence data, achieving vastly superior performance to simple models. For instance, the vanilla Transformer shows great power in natural language processing (Wolf et al., 2020) , and its variant Informer (Zhou et al., 2021) is more efficient in time-series forecasting tasks. And a recent work Structured State Space sequence model (S4) (Gu et al., 2021) reaches SoTA in handling data with long-range dependencies. However, few attempts have been made to understand the inherent property of sequence data in various tasks, neglecting the critical factor which could largely influence the performance of different types of deep models. Such investigations can help us to understand the question that what kind of deep model is suitable for specific tasks, and is essential for improving deep sequence modeling. In this paper, we study a generic property of sequence data, i.e., continuity, and investigate how this property connects with the performance of different deep models. Naturally, all sequence data can be treated as discrete samples from an underlying continuous function with time as the hidden axis. Based on this view, we apply continuity to describe the smoothness of the underlying function, and further quantify it with Lipschitz continuity. Then, it can be noticed that different data types have different continuity. For instance, time-series or audio data are more continuous than language sequences, since they are sampled from physical continuous signals evolved through time. Furthermore, we empirically observe that different deep models prefer data with different continuity. We design a sequence-to-sequence task to show this phenomenon. Specifically, we generate Figure 1 : A Sequence-to-sequence task to show that different deep models prefer data with different continuity. The first row shows input and output sequences. We generate input sequences with different continuities (left column: high continuity; right column: low continuity) and learn a mapping function using different models (second row: S4; third row: Transformer). We can see that S4 prefers more continuous sequences, while Transformer prefers more discrete sequences. And adjusting continuity according to the preferences of models with Lipschitz Regularizer can largely improve their performances. More details of this experiment are in Appendix A. two kinds of input sequences with different continuity, and map them to output sequences using exponential moving average. Then, we use two different deep models to learn this mapping. Each model has an identical 1D convolution embedding layer, and a separate sequence processing module. One uses the S4 model (Gu et al., 2021) and the other uses vanilla Transformer (Vaswani et al., 2017) (with the same number of layers and hidden dimensions). The results of this experiment are shown in Figure 1 . It can be observed that the S4 model achieves significantly better performance with more continuous inputs, and the Transformer performs better with more discrete inputs. Note that, essentially, they are learning the same mapping only with different data continuity. This clearly shows different models prefer different data continuity. Inspired by the above observation, we hypothesize that it is possible to enhance model performance by changing the data continuity according to their preferences. To make the proposed method simple and applicable for different deep models, we derive a surrogate that can be directly optimized for the Lipschitz continuity, and use it as a regularizer in the loss function. We call the proposed surrogate Lipschitz Regularizer, which depicts the data continuity and can also be used to adjust it. Then, we investigate the data continuity property for different models and how to use Lipschitz Regularizer to change data continuity according to the model preference. We provide in-depth analyses in both time and frequency domains. On the one hand, Lipschitz continuity describes the continuity of sequences over time, which is a feature in the time domain. Here, we investigate two models. One is a continuous-time model S4, and the other model Informer is based on self-attention. As for S4, since the fitting error of the S4 model is bounded by the Lipschitz constant, S4 prefers smoother input sequences with smaller Lipschitz constant. Hence, we make the inputs of S4 layers more continuous by adding the Lipschitz Regularizer to the loss function. Experiment results on the Long Range Arena benchmark demonstrate that Lipschitz Regularizer can largely improve the performance of the S4 model, especially for tasks with discrete inputs. Conversely, Informer is built upon self-attention, which is designed to process some tokenized discrete data, e.g., text, so Informer prefers less continuous sequences. Therefore, we decrease the continuity of input sequences by subtracting the Lipschitz Regularizer from the loss function. Prediction performance and empirical analyses on many time-series tasks well prove the superiority of the Lipschitz Regularizer. Also, we observe the same results on the task mentioned above as shown in Figure 1 . On the other hand, for the frequency domain, we find that Lipschitz Regularizer represents the expectation of the frequency of data's underlying function. Here, we take the ReLU network as the studied case, and theoretically justify that Lipschitz Regularizer is related to the spectral biasa phenomenon that neural networks tend to prioritize learning the low-frequency modes. We then propose to use Lipschitz Regularizer by subtracting it from the loss function to mitigate the spectral bias. In this way, neural networks are forced to learn high-frequency parts, and convergence can be accelerated since information in different frequency bands can be learned simultaneously. In summary, Lipschitz Regularizer can be used to flexibly adjust data continuity for a wide range of deep models which have a preference for data continuity. It improves various models with very little extra computational cost, shedding a light on inherent data property analyses for sequence modeling.

2. RELATED WORK

Deep Neural Networks for Sequence Modeling Sequence modeling plays a critical role in many machine learning problems. Many general architectures, including MLPs (Rahaman et al., 2019) , RNNs (Mikolov et al., 2010) , and CNNs (Bai et al., 2018) , can all be used for sequence modeling. And recently, two types of models show great power in addressing challenges for sequence modeling, such as handling complex interactions and long-range dependencies. The first type is self-attention-based models. For instance, the vanilla Transformer (Vaswani et al., 2017) , Informer (Zhou et al., 2021), and Performer (Choromanski et al., 2020) all show great performance on natural language processing, time-series forecasting, and speech processing, respectively. Another type is continuous-time models, which are built upon the view that inputs are sampled from continuous functions. They include but not limited to Neural ODE (Chen et al., 2018) , Lipschitz RNN (Erichson et al., 2020) , State-space model (Gu et al., 2021) . In this paper, we do not aim at proposing novel models like previous works, but we focus on understanding intrinsic preference for input sequences. We show these two types of models both have a preference for the data continuity property, and we utilize it to promote their performance.

Lipschitz Continuity of Neural Networks

The Lipschitz continuity is a general property for any function, and is widely used for analyzing different kinds of neural networks, including MLPs (Zhang et al., 2021; Gouk et al., 2021) , CNNs (Zou et al., 2019) , self-attention-based networks (Dasoulas et al., 2021) , graph neural networks (Gama et al., 2020) and GANs (Gulrajani et al., 2017) . It becomes an essential property of neural networks, and can be used in various ways, such as improving adversarial robustness (Meunier et al., 2022) and proving generalization bounds (Sokolić et al., 2017) . In this paper, we focus on the Lipschitz continuity of the underlying function of sequence data, and use it as a data property, but not a property of models.

3. LIPSCHITZ CONTINUITY OF SEQUENCE DATA

In this section, we first show the measure for continuity of sequence data, and how it can be used as a regularizer. We give a definition for Lipschitz Regularizer here, and leave detailed analyzes and usages of it in specific models in the rest of the sections. Then, we provide views for Lipschitz Regularizer in both time and frequency domains. To define the measure for the continuity of sequence data, we view inputs as signals, and data points in the sequence are discrete samples of an underlying continuous function with certain time steps. Next, we calculate the Lipschitz constant of the underlying function, which is widely used as the measure for continuity. Specifically, suppose the sequence is x 0 , x 1 , . . . , x n , and the underlying function is defined as f (t 0 ) = x 0 , f (t 1 ) = x 1 , . . . , f (t n ) = x n , where t 0 , t 1 , . . . , t n are time steps. Then, if we let t 0 = 0, t 1 = 1, . . . , t n = n, the Lipschitz constant L f of function f is L f = max ti,tj ∈{0,1,...,n} |f (t i ) -f (t j )| |t i -t j | = max i,j∈{0,1,...,n} |x i -x j | |i -j| . By Mean Value Theorem, for i, j ∈ {0, 1, . . . , n} and j -i > 1, we could always find an interval [k, k + 1] of time step 1 such that i ≤ k ≤ j -1, |xi-xj | |i-j| ≤ |x k+1 -x k |. Therefore, we have L f = max i,j∈{0,1,...,n} |x i -x j | |i -j| = max k∈{0,1,...,n-1} |x k+1 -x k |. However, since we would like to adjust this continuity according to the preferences of different models, this measure should be easy to be optimized, but it is hard to pass gradients due to the max operator. To help with the optimization process, we design a surrogate by taking the average over all terms and changing the norm to ℓ 2 . Moreover, since we simply use this surrogate as a regularizer in the loss function to flexibly adjust the continuity for various models, we name this term as Lipschitz Regularizer, and its formal definition is given as follows. Definition 3.1. (Lipschitz Regularizer) Suppose the sequence is x 0 , x 1 , . . . , x n . We define the Lipschitz Regularizer in the following equation: L Lip = 1 n n-1 i=0 (x i+1 -x i ) 2 (3)

3.1. VIEW LIPSCHITZ REGULARIZER IN TIME AND FREQUENCY DOMAINS

We then provide two views for Lipschitz Regularizer. On the one hand, Lipschitz Regularizer is a feature for sequence data in the time domain, representing the continuity of sequences over time. Thus, it can be used to alter the continuity of input sequences to specific models. As shown in Figure 1 , different deep models have different preferences for data continuity. We can use the Lipschitz Regularizer to manually make the sequences more or less continuous, and therefore improve the performance of the model. An example of increasing the continuity to improve the performance of a continuous-time model is described in §4.1. A converse example of decreasing the continuity to improve the performance of an attention-based model is described in §4.2. On the other hand, from the frequency perspective, Lipschitz Regularizer directly relates to the frequency of the function, and can be used to alter modes with different frequencies. Specifically, n-1 i=0 (x i+1 -x i ) 2 ≈ R df (t) dt 2 dt = R (2πiξ) 2 f 2 (ξ)(-dξ) = 4π 2 CE p(ξ) [ξ 2 ] ( ) where ξ is the frequency of the Fourier transform of f . p(ξ) = f 2 (ξ)/C is the normalized squared Fourier transform of f , where f (ξ) := f (x)e -i2πξx dx. Details of the derivation are presented in Appendix G.1. Essentially, the Lipschitz Regularizer of sequence data represents the exception of the frequency of the data's underlying function. Besides, previous literature shows that neural networks tend to prioritize the learning of low-frequency parts of the target function (Rahaman et al., 2019) . We find that Lipschitz Regularizer can be utilized to mitigate this phenomenon by emphasizing high-frequency parts, which allows the network to fit all spectra simultaneously and results in a faster convergence rate. The details of this discussion are in §5.1.

4. TIME DOMAIN

In this section, we view Lipschitz Regularizer in the time domain, and show how it can be used to make the sequence more discrete or continuous over time, catering to the preference of different models. Generally, to change the continuity of the input sequence to different models with the Lipschitz Regularizer, we apply it on the output of the embedding layer before the sequence is sent to different models. We describe the details of two different models in the following sections.  ẋ = Ax + Bu y = Cx + Du, ( ) where u is the input function, x is the hidden state , y is the output. A, B, C, D are trainable matrices. The critical and essential design in the S4 layer is the transition matrix A, which is initialized with the HiPPO matrix. The HiPPO matrix makes the S4 layer optimally remember the history of the input's underlying function, so the S4 model can substantially outperform previous methods on long-range sequence modeling tasks. Particularly, the HiPPO matrix is designed to find the best polynomial approximation of the input's underlying function given a measure that defines the optimal history, and a memory budget which is the hidden dimension in the model. Each measure corresponds to an optimal HiPPO matrix. Theoretical Analyses To connect the continuity property with the S4 model, we provide the following intuition here while a formal proposition along with its proof in Appendix G.2. Generally, the error rate of HiPPO-LegS projection decreases when the sequence is more continuous/smooth (Gu et al., 2021) . Here, LegS denotes the scaled Legendre measure, which assigns uniform weights to all history. This is also true for S4 layers, since the HiPPO matrix is the most critical design in the S4 layer. However, in many tasks, such as natural language processing, the input sequence are not particularly smooth. This will deteriorate the performance of the S4 model. Lipschitz Regularizer can be used to solve the above problem, because it can adjust the continuity of sequences. Specifically, since we cannot directly manipulate the underlying function of the input sequence, we add a 1D convolutional layer that does not change the sequence length as an embedding layer before the S4 layer, and then apply Lipschitz Regularizer to the output of the embedding layer as follows: L(y, ŷ, l) = L S4 (y, ŷ) + λL Lip ( l), ( ) where y is the ground-truth, and ŷ is the output of the S4 model. l is the output of the embedding layer, and L S4 is the original loss of the S4 model. λ is a hyperparameter to control the magnitude of the Lipschitz Regularizer. By using Equation ( 6) as the loss function, the input of the S4 layers becomes more continuous, so the error of the HiPPO-LegS projection and S4 layer can be reduced, leading to better model performance. Experiments To demonstrate the effectiveness of the Lipschitz Regularizer, we use a modified version of the Long Range Arena (LRA) (Tay et al., 2020) benchmark with harder tasks. The descriptions of the original LRA are in Appendix . In addition to the original LRA, we create 3 harder tasks with more discrete sequences. Particularly, we notice that among these 6 tasks, 3 of them use pixels as inputs (i.e., Image and Pathfinder), which could be more continuous than texts in the other 3 tasks. So we design Image-c, Path-c, and PathX-c, in which the contrast of images is increased, and the increasing degree is randomly sampled from 50% to 100% for each image. We test three methods on the modified LRA. The first one is the original S4 model (denoted as S4). The second one is the S4 model with a 1D convolutional layer as the embedding layer, and Lipschitz Regularizer is applied to the outputs of the embedding layer (denoted as S4 + Emb + Lip). Furthermore, we also design the third model to ablate the effect of the extra embedding layer. Here, we use the S4 model with the same embedding layer as the previous method, and Lipschitz Regularizer is not applied (denoted as S4 + Emb). Hyperparameter λ is chosen from {1, 2, 3, 4, 5} when the model performs best on the validation set. Results are listed in 

4.2. TRANSFORMER-BASED MODELS

In this section, we show that Lipschitz Regularizer can improve the performance of Transformerbased models when inputs are continuous. In particular, we choose time-series forecasting tasks whose inputs are highly continuous, and we use three Transformer-based models, i.e. vanilla Transformer (Vaswani et al., 2017) , Informer (Zhou et al., 2021) and Autoformer (Wu et al., 2021) to evaluate the effectiveness of Lipschitz Regularizer. Although these models already have a good performance on time-series forecasting tasks, due to the preference of Transformer-based models for discrete sequences (shown in Figure 1 ) and highly continuous inputs, we can still apply Lipschitz Regularizer to further improve the model by decreasing the continuity of input sequences. Specifically, since all three models have an embedding layer, we directly apply Lipschitz Regularizer to the output of the embedding layer as follows: L(y, ŷ, l) = L Transformer (y, ŷ) -λL Lip ( l), ( ) where y is the ground-truth, and ŷ is the output of the respective model. l is the output of the embedding layer, and L Transformer is the original loss of the Transformer-based model. λ controls the magnitude of the Lipschitz Regularizer. Note that different from the usage in the S4 model, here we subtract Lipschitz Regularizer to make the input discrete, and cater to the model preference. We also show an ablation study for the hyperparameter λ in Figure 3 . We can observe that (1) MSE increases when λ < 0, while decreases when λ > 0. Since positive λ reduces the data continuity, we can conclude that Informer prefers discrete sequences, and Lipschitz Regularizer can reduce the continuity and cater to the preference; (2) MSEs do not have a large variance for different positive λ, indicating that the performance improvement is not sensitive to hyperparameter changes.

5. FREQUENCY DOMAIN

In this section, we study how continuity affects the performance of deep models from the frequency perspective. We take the ReLU network as the study case, and provide theoretical analyses and experiment results to show the effectiveness of applying the Lipschitz Regularizer on ReLU networks. Published as a conference paper at ICLR 2023

5.1. RELU NETWORK

A ReLU network g : R d → R with L hidden layers of width d 1 , . . . , d L is defined as: g(x) = T (L+1) • σ • T (L) • • • • • σ • T (1) (x), where T (k) : R d k-1 → R d k is an affine function (d 0 = d, d L+1 = 1 ) and σ is the ReLU function. Theoretical Analyses In the previous literature, Rahaman et al. (2019) showed that the lowfrequency part of the sequence data is learned faster by the ReLU network, and such phenomenon is called the "spectral bias". We claim that the Lipschitz Regularizer could help mitigate the spectral bias. Intuitively, when the Lipschitz constant of the ReLU network increases, we expect that the model can learn more information in high-frequency parts. We provide a formal proposition and its proof on this intuition in Appendix G.3. This inspires us to balance frequency modes via changing the Lipschitz continuity of functions. Besides, suppose we use a ReLU network to learn a sequenceto-sequence mapping, where values of data in the input sequence (length n) increase linearly in the interval (0, 1) with step size 1 n , and the output is generated by the mapping function h(t). Note that since values of data in the input sequence increase linearly, the Lipschitz constant of the ReLU network is the same as the output sequence. Therefore, we design the decayed Lipschitz Regularizer as follows: L(y, ŷ) = L MSE (y, ŷ) -λe -ϵt L Lip (ŷ), (9) where y is the ground-truth generated by h(t) and ŷ is the prediction. L MSE is the MSE Loss. λ and ϵ are hyperparameters that control the magnitude and decay rate of the Lipschitz Regularizer, respectively. We further explain the reasons why this regularizer can mitigate spectral bias in two aspects. First, by Equation ( 4), the added term could be seen as a direct penalty to the low-frequency part of the output sequence. Since the value of data in the input sequence increases linearly, this is equivalent to penalizing the low-frequency part of the ReLU network, and prioritizing the learning of the highfrequency part. In another perspective, Rahaman et al. ( 2019) claimed that the origin of the spectral bias is the gradually increasing parameter norm, and Lipschitz Regularizer can intentionally relieve it. Specifically, Fourier components of the ReLU network ĝθ (ξ) is bounded by O(L g ), and L g is bounded by the parameter norm, which can only increase by a small step during the optimization step. Hence, gradually increasing parameter norms can hinder the learning of high-frequency parts at the early optimization stage. Besides, due to the fact that Lipschitz Regularizer can intentionally change L g , subtracting Lipschitz Regularizer as Equation ( 9) can enlarge the parameter norm, making it possible for optimizing both high and low-frequency parts. This can be seen as a warm-up process for the network where the parameter norm increases at the beginning of optimization, and then the convergence can be significantly accelerated, since modes of all frequencies can be learned simultaneously after the warm-up. Experiments We choose a mapping task to evaluate the proposed Lipschitz Regularizer. Specifically, we try to learn the mapping function whose input is the sequence with linearly increasing values, and output is a highly periodic sequence. Given frequencies K = {k 1 , k 2 , . . . , k n }, amplitudes A = {a 1 , a 2 , . . . , a n }, and phases Φ = {ϕ 1 , ϕ 2 , . . . , ϕ n }, the mapping function is defined as h(x) = n i=1 A i sin(2πk i x + ϕ i ). In this experiment, we take n = 10 and frequencies K = {5, 10, . . . , 45, 50}, amplitudes A = {0.1, 0.2, . . . , 1}. The phases are uniformly sampled from 0 to 2π, i.e., ϕ i ∼ U (0, 2π). The input samples in the sequence are uniformly placed over (0, 1) with the number of samples N = 100, and the output is generated by h(x). As for the model, we use a 6-layer deep ReLU network with the hidden dimension set to 256 for all layers. To verify the effectiveness of the proposed Lipschitz Regularizer, we train two identical networks with the same training procedure. One is trained with the decayed Lipschitz Regularizer and the other without it. We set hyperparameter λ ∈ {1, 2, 3, 4, 5} and ϵ ∈ {0.00001, 0.0001, 0.001, 0.01, 0.1} when the model performs best on the validation set. We show the frequency and MSE of ReLU networks during the training process in Figure 4 . From Figure 4 (a) and (b), we notice that low-frequency parts are learned first in both networks, but with the decayed Lipschitz Regularizer, high frequencies can be learned significantly faster. Figure 4 (c) shows that Lipschitz Regularizer can accelerate convergence. We also show predictions of two models during the training process in Figure 5 , which gives a more intuitive result, indicating that high frequencies can be learned faster when we use decayed Lipschitz Regularizer to warmup optimization. All results demonstrate that Lipschitz Regularizer enables almost simultaneous learning for all frequencies, so spectral bias can be relieved in this way, and the convergence is accelerated.

6. SUMMARY

We investigate a generic property of sequence data, i.e., continuity, which is closely related to the performance of different models, and propose Lipschitz Regularizer to flexibly adjust the continuity for various models. We first empirically observe that the different deep models prefer different data continuity. Then, from both time and frequency domains, we provide in-depth theoretical and experimental studies for specific models. For the time domain, we show that the continuoustime model S4 prefers continuous sequences, while the Transformer-based model Informer prefers discrete inputs. We use Lipschitz Regularizer to adjust the data continuity for both of them and largely improve their performance by catering to their preference. For the frequency domain, we show that Lipschitz Regularizer can help mitigate the spectral bias, and accelerate convergence for ReLU networks. In general, Lipschitz Regularizer is available for any sequence modeling tasks and models which have a preference for data continuity, and can accordingly facilitate learning for various models with very little computational cost. 4 . We also randomly sample four dimensions and corresponding curves are shown in Figure 6 , 7, 8, 9. Our findings and conclusions in the univariate experiment also hold in the multivariate case. 

B.2 ANALYSIS FOR THE PATHFINDER TASK

We notice that the Lipschitz Regularizer causes deteriorated performance on the Pathfinder dataset, so here, we provide detailed analyzes to explore the reason. We show a case in the experiment of the S4 model on the Path dataset in Figure 10 . We can see that the performance drop is mainly caused by the embedding layer. As explained in § 4.1, since we cannot directly manipulate the underlying function of the input sequence, we add an extra embedding layer before the S4 layer. However, changes from figure a1 to a2 show that this embedding layer may overly and incorrectly blur or even erase some informative shapes in the original picture, causing some necessary information lost, and making the model confused. Although Lipschitz Regularizer can slightly relieve this issue, the necessary path information is not as obvious as it is in the original image. The performance of these 3 models (i.e., S4, S4 + Emb, S4 + Emb + Lip) in Table 1 also matches this finding. Moreover, Figure b1 , b2, and b3 show that when the contrast is increased, these shapes are not likely to be erased since their pixels all have high gray values. Hence, Lipschitz Regularizer can improve model performance on the Path-c task. In this experiment, we show that the Lipschitz Regularizer can improve the performance of Transformer-based models, including Transformer, Informer, and Autoformer. We use 3 realworld datasets: ETT (Electricity Transformer Temperature), ECL (Electricity Consuming Load), and Weather. ETT has 3 separate datasets, i.e., {ETTh 1 , ETTh 2 } for 1-hour-level with 2 separated countries, and ETTm 1 for 15-minute-level. And we use multiple prediction window sizes, including {24h, 48h, 168h, 336h, 720h, 960h} for ETTh, ECL, Weather, and {6h, 12h, 24h, 72h, 168h} for ETTm. (a1) (a2) (a3) (b1) (b2) (b3)

C.2 RESULTS OF MULTIVARIATE TIME-SERIES FORECASTING

Results of multivariate time-series forecasting of three Transformer-based models are presented in Table 5 . We can see that Lipschitz Regularizer can improve the model performance in most cases, showing that altering data continuity is also helpful in multivariate time-series forecasting tasks. Besides, we can also observe that the improvement by Lipschitz Regularizer is slightly less significant than which in univariate time-series forecasting. The reason may be that we apply the same regularizer to input sequences of all variates. However, these input sequences may have different continuities and need regularizers with different weights. Future work might be adding trainable weights to the regularizer of different input sequences. Figure 13 shows a negative case. In Figure 13 , the main problem is that the model trained with Lipschitz Regularizer captures a larger decrease than the original data when the time is in 100-200. 

D FINE-TUNE SWIN TRANSFORMER WITH LIPSCHITZ REGULARIZER

In this section, we investigate whether the proposed Lipschitz Regularizer can be used to improve large pre-trained models by fine-tuning its embedding layer on down-stream tasks. Here, we use a regular setting that the model is pre-trained on a large image dataset, and then fine-tuned it on the down-stream image classification task. Moreover, considering that Transformer-based models have shown great power in computer vision domains (Dosovitskiy et al., 2020; Liu et al., 2021) , and these models process images by splitting them into patches and feeding the model a sequence of patch embeddings, we choose a typical model, i.e., Swin Transformer (Liu et al., 2021) , for experiments in this section. Generally, because input tokens of the Transformer are local image patches, they tend to be continuous, which might not match the preference of the Transformer model. Therefore, we apply the Lipschitz Regularizer to make them more discrete. Specifically, we use the Lipschitz Regularizer for outputs of the embedding layer, and change the loss as follows: L(y, ŷ, l) = L Swin (y, ŷ) -λL Lip ( l), ( ) where y is the ground-truth, and ŷ is the output of the Swin Transformer. l is the output of the embedding layer, and L Swin is the original loss of the Swin Transformer. λ controls the magnitude of the Lipschitz Regularizer. We use the pre-trained Swin Transformer on ImageNetfoot_0 , and fine-tune it on the image classification task with the Beans dataset (Lab, 2020), containing bean leaf images of diseased and healthy leaves. We only fine-tune the embedding layer and the last linear layer, and freeze other parts of the model. We show the validation accuracy during training in Figure 18 and list the testing accuracy in Table 6 . By setting λ to zero, we obtain the result of the baseline. We can see that the model performance can be improved by the Lipschitz Regularizer, showing great potential for changing data continuity in the fine-tuning setting and vision tasks. Besides, results also show that setting λ to positive values (i.e., 5 and 1) benefits model performance, while setting it to negative values (i.e., -5 and -1) degenerates the performance. This verifies our intuition that the Transformer-based model prefers more discrete inputs. 

E APPLY LIPSCHITZ REGULARIZER TO THE SPEECH CLASSIFICATION TASK

In this section, we show the effectiveness of the Lipschitz Regularizer on the speech classification task by applying it to a Transformer-based model. As we discussed in § 1, the Transformer-based model prefers discrete inputs. However, voice signals are highly continuous since they are sampled from a continuous physical process with a high sample rate. This inspires us to use the Lipschitz Regularizer to make it more discrete and therefore more preferable for Transformer-based models. Specifically, following the settings in Gu et al. (2021) , we investigate the Performer model (Choromanski et al., 2020) on the Speech Commands (SC) dataset (Warden, 2018) . We test the Performer model on two versions of SC. One is MFCC, where the sequence is pre-processed into standard MFCC features (length 161). Another is Raw, which contains unprocessed signals (length 16000). The Lipschitz Regularizer is applied after the embedding layer, and changes the loss as follows: L(y, ŷ, l) = L Per (y, ŷ) -λL Lip ( l), where y is the ground-truth, and ŷ is the output of the Performer model. l is the output of the embedding layer, and L Per is the original loss of the Performer model. λ controls the magnitude of the Lipschitz Regularizer. 

F NEURAL ODE WITH LIPSCHITZ REGULARIZER

In this section, we apply the Lipschitz Regularizer to the Neural ODE model (Chen et al., 2018) to see the effect of the regularizer on the model. Similar to the state-space model, Neural ODE is also a continuous-time model, which treats the input as samples from a continuous function. We expect that the Neural ODE will perform better when the input is more continuous. We adopt the experiment of fitting time series using the latent ODE in its original paper (Chen et al., 2018) . Essentially, the neural network is a generative latent function time-series model, predicting the solution to an ODE, and the input data of this experiment is sampled from a randomly generated ODE with the same generation process as Chen et al. ( 2018). The network is a variational autoencoder, which consists of an RNN encoder and a Neural ODE decoder. To alter the continuity of the input to the Neural ODE, we directly apply Lipschitz Regularizer to the output of the RNN encoder as follows: L(y, ŷ, l) = L ODE (y, ŷ) -λL Lip ( l), where y is the ground-truth, and ŷ is the output of the model. l is the output of the RNN encoder, and L ODE is the original loss of the Neural ODE. λ controls the magnitude of the Lipschitz Regularizer. The MSE during training is shown in Figure 19 . We can observe that Neural ODE performs better when data is more continuous. Predictions of 9 independent runs are presented in Figure 20 . We can see that the model has better fitting results when we use Lipschitz Regularizer to make inputs more continuous. 

G MATHEMATICAL DERIVATIONS

G.1 DERIVATION OF EQUATION ( 4) n-1 i=0 (x i+1 -x i ) 2 = n-1 i=0 f (t i+1 ) -f (t i ) t i+1 -t i 2 ≈ R df (t) dt 2 dt = R (2πiξ) 2 f 2 (ξ)(-dξ) = 4π 2 R ξ 2 f 2 (ξ)dξ = 4π 2 C R ξ 2 f 2 (ξ) C dξ = 4π 2 CE p(ξ) [ξ 2 ] G.2 CONTINUITY AND THE S4 MODEL Proposition G.1. Suppose f 1 , f 2 : R + → R are two differentiable functions of input sequences, and their Lipschitz constant are L f1 and L f2 . The HiPPO matrix with scaled Legendre measure (LegS) is denoted as HiPPO-LegS. Let the error of the HiPPO-LegS projection of f 1 , f 2 at time t be δ 1 (t), δ 2 (t), respectively. Let δ1 (t) = tL f1 , δ2 (t) = tL f2 . For any time t, suppose L f1 ≤ L f2 , we have δ 1 (t) = O( δ1 (t)), δ 2 (t) = O( δ2 (t)), and δ1 (t) ≤ δ2 (t). Proof. By Gu et al. (2020, Proposition 6 ), the LegS measure, which uniformly weighs all history, has the following property. Suppose the HiPPO-LegS projection for the target function f (t) at time t is p (t) = proj t (f ), then the error δ f (t) = f ≤t -p (t) = O(tL f / √ N ), where L f is the Lipschitz constant of f (t), and the maximum polynomial degree is N -1. So, we have δ 1 (t) = O( δ1 (t)), δ 2 (t) = O( δ2 (t)), and δ1 (t) ≤ δ2 (t). Therefore, the error rate of HiPPO-LegS projection decreases with the Lipschitz constant, so with smaller Lipschitz constant, we expect smaller projection error.

G.3 CONTINUITY AND THE RELU NETWORK

Proposition G.2. Suppose there are two ReLU networks g θ1 , g θ2 with identical architecture, and the Lipschitz constant of them are L 1 , L 2 , respectively. Let h 1 (ξ) = L 1 /∥ξ∥ n+1 , h 2 (ξ) = L 2 /∥ξ∥ n+1 , where ξ is the frequency, and ĝθ (ξ) is the Fourier component of g θ . Suppose L 1 ≤ L 2 , we have ĝθ1 (ξ) = O(h 1 (ξ)), ĝθ2 (ξ) = O(h 2 (ξ)), and h 1 (ξ) ≤ h 2 (ξ). 



The pre-trained model is acquired at https://huggingface.co/microsoft/swin-base-patch4-window7-224



Figure 2: The Lipschitz constant of the output of the embedding layer during the training process of Informer + Lip. The experiment is the univariate ETTh 2 with the prediction window size of 24h.

Figure 4: Evolution of the frequency and MSE of ReLU networks during the training process. In (a) and (b), color indicates the normalized amplitude of the Fourier component at the corresponding frequency, i.e., | ĝθ (k i )|/A i . Lipschitz Regularizer enables faster learning of high frequencies and faster convergence.

Figure 6: Results of the S4 model with high continuity multivariate data for the experiment in the Introduction.

Figure 10: A case in the experiment of the S4 model on the Path dataset. In this task, the model needs to deduce whether two points in the image are connected by a dashed line. (a1) An image randomly sampled from the Path dataset. (b1) The image in (a1) with 100% contrast increased. (a2, b2) Average of the output vector of the embedding layer in a trained S4 + Emb model, with a1 and b1 as the input, respectively. (a3, b3) Average of the output vector of the embedding layer in a trained S4 + Emb + Lip model, with a1 and b1 as the input, respectively.

Figure 11: Univariate forecasting example of Transformer on the Weather dataset with the prediction window size set to 720. Left figure shows the result of the original Transformer (MSE: 0.00933, MAE: 0.07630). Right figure shows the result of the Transformer trained with Lipschitz Regularizer (λ = 1, MSE: 0.00272, MAE: 0.03823).

Figure 13: Univariate forecasting example of Autoformer on the ECL dataset with the prediction window size set to 720. Left figure shows the result of the original Autoformer (MSE: 0.80028, MAE: 0.66756). Right figure shows the result of the Autoformer trained with Lipschitz Regularizer (λ = 5, MSE: 0.91144, MAE: 0.69692).

Figure 17: Multivariate forecasting example of Autoformer on the ETTh1 dataset with the prediction window size set to 720. Left figure shows the result of the original Autoformer (MSE: 0.47690, MAE: 0.49172). Right figure shows the result of the Autoformer trained with Lipschitz Regularizer (λ = 1, MSE: 0.50977, MAE: 0.50879).

Figure 18: Validation accuracy of fine-tuning Swin Transformer in each epoch with different values of λ.

Figure 19: The MSE of the Neural ODE model with Lipschitz Regularizer during training.

Proof. By Rahaman et al.(2019, Theorem 1), for a ReLU network g θ with parameter θ, its Fourier component is,ĝθ (ξ) = d n=0 G n (θ, ξ) ∥ξ∥ n+1 (14)where the numeratorG n (θ, •) : R d → C is bounded by O(L g ). So, we have ĝθ1 (ξ) = O(h 1 (ξ)), ĝθ2 (ξ) = O(h 2 (ξ)), and h 1 (ξ) ≤ h 2 (ξ). Therefore, with smaller Lipschitz constant, we expect smaller ĝθ (ξ).

Accuracy of the S4 model and its variant with our proposed Lipschitz Regularizer (S4 + Emb + Lip) in LRA. S4 + Emb is set to ablate the effect of the extra embedding layer. The State Space Model is a classic model in control engineering. Gu et al. (2021) extended it to the deep sequence model, and proposed the S4 model. S4 is a continuous-time sequence model. It advances SoTA on long-range sequence modeling tasks by a large margin. An S4 layer can be denoted as follows:

Table1, and we have the following observations. (1) It is obvious that our method (i.e., S4 + Emb + Lip) significantly outperforms other methods in almost all tasks, especially in tasks with discrete inputs, such as Text and Retrieval. Improved performance in Image-c, Pathc, and PathX-c shows that Lipschitz Regularizer can mitigate the influence of increased contrasts. These results well demonstrate the effectiveness of the Lipschitz Regularizer, indicating that it can make input sequences of the S4 layer more continuous, and better cater to the preference of the S4 model. (2) Comparing the results of S4 on Image/Path(X) and Image-c/Path(X)-c, it can be observed that the performance of the S4 model degenerates with the increasing contrasts of images. The cause is the deceased continuity against the preference of the S4 model, verifying that the S4 model indeed prefers continuous inputs. (3) Only adding the extra embedding layer (S4 + Emb) makes the accuracy decrease in 4 out of 7 tasks, indicating that improvements come from the effect of the Lipschitz Regularizer, but not the extra layer. Besides, this extra embedding layer is also the main reason causing the performance drop in Path and PathX dataset. In Appendix B.2, the visualization for the output vector of the embedding layer shows that this embedding layer may overly and incorrectly blur or even erase some informative shapes in the original picture, causing some necessary information lost, and the model confused.

Experiments We use 5 datasets in this experiment and their descriptions are in Appendix C.1. Evaluation metrics are Mean Square Error (MSE) and Mean Absolute Error (MAE). Hyperparameter λ is chosen from {1, 2, 3, 4, 5, 6, 7, 8} when the model performs best on the validation set.Results of Transformer, Informer, Autoformer, and these models with Lipschitz Regularizer (denoted as Transformer + Lip, Informer + Lip and Autoformer + Lip) are shown in Table2, and results of multivariate experiments are in Appendix C.2. We can see that the models with Lipschitz Regularizer generally outperform the original models on most of the tasks. This indicates that Transformer-based models prefer discrete sequences and reducing input continuity with Lipschitz Regularizer can be helpful for them. We also note that the Lipschitz Regularizer is more effective on vanilla Transformer than the models with specialized designs for time series forecasting. This indicates the vanilla Transformer is more sensitive to data continuity, and special designs in Informer and Autoformer may mitigate it.

Univariate time-series forecasting results of 3 Transformer-based models and training them with Lipschitz Regularizer (indicated by + Lip). Note in this table, prediction window sizes are converted to lengths of sequences used in the model.

Results of the experiment in the Introduction.

Results of the experiment in the Introduction running with multivariate data.

Multivariate time-series forecasting results of 3 Transformer-based models and training them with Lipschitz Regularizer (indicated by + Lip). Note in this table, prediction window sizes are converted to lengths of sequences used in the model.

Results of fine-tuning Swin Transformer with Lipschitz Regularizer on an image classification task. Test accuracy with different λ is reported.

Results of the Performer model and Lipschitz Regularizer on the Speech Commands dataset. Test accuracy for MFCC and Raw speech data is reported.Results are shown in Table7. The column with λ = 0 represents the baseline. The performance of the Performer model is increased by the Lipschitz Regularizer, which further verifies our claim that Transformer-based models prefer discrete inputs.

ACKNOWLEDGMENTS

The authors would like to thank Yifei Shen and Yansen Wang for their helpful discussions and insights. The authors also want to thank our reviewers for providing all the valuable feedback and suggestions.

APPENDIX

We present a sequence-to-sequence task in the Introduction section, and show more details here. In this experiment, we generate two types of input sequences with different continuity (each has 1000 samples), and map them to output with the exponential moving average h(t):where x 1 , x 2 , . . . , x N is the input sequence, w is the window size (set to 50). We choose the exponential moving average because it is a sequence-to-sequence mapping that makes use of the contextual information. The Lipschitz constant of the input and output sequence is shown in Table 3 . Note that the high Lipschitz constant represents low continuity, while the low Lipschitz constant represents high continuity. Then, we train the S4 model and the Transformer model with generated input and output sequences. Each model has a 1D convention embedding layer with kernel size 5, stride 1, and padding 2. Both Transformer and S4 have 1 separated layer with the hidden dimension set to 16. We also apply Lipschitz Regularizer to the output of the embedding layer and train models again. MSE of these 4 models is shown in Table 3 . We could observe that S4 performs better with continuous inputs and the Transformer is better with discrete inputs. Also, Lipschitz Regularizer can improve the performance of S4 and Transformer by changing the data continuity into their prefers ones.

A.2 MULTIVARIATE

We repeat the above experiment with multivariate data. Specifically, we also generate high and low continuity input sequences with dimension 16 (each has 1000 samples). The input sequences are (a) Apply Lipschitz Regularizer with λ = 1. 

