DEEP LATENT STATE SPACE MODELS FOR TIME-SERIES GENERATION

Abstract

Methods based on ordinary differential equations (ODEs) are widely used to build generative models of time-series. In addition to high computational overhead due to explicitly computing hidden states recurrence, existing ODE-based models fall short in learning sequence data with sharp transitions -common in many realworld systems -due to numerical challenges during optimization. In this work, we propose LS4, a generative model for sequences with latent variables evolving according to a state space ODE to increase modeling capacity. Inspired by recent deep state space models (S4), we achieve speedups by leveraging a convolutional representation of LS4 which bypasses the explicit evaluation of hidden states. We show that LS4 significantly outperforms previous continuous-time generative models in terms of marginal distribution, classification, and prediction scores on real-world datasets in the Monash Forecasting Repository, and is capable of modeling highly stochastic data with sharp temporal transitions. LS4 sets state-of-the-art for continuous-time latent generative models, with significant improvement of mean squared error and tighter variational lower bounds on irregularly-sampled datasets, while also being ×100 faster than other baselines on long sequences.

1. INTRODUCTION

Time series are a ubiquitous data modality, and find extensive application in weather (Hersbach et al., 2020) engineering disciplines, biology (Peng et al., 1995) , and finance (Poli et al., 2019) . The main existing approaches for deep generative learning of temporal data can be broadly categorized into autoregressive (Oord et al., 2016) , latent autoencoder models (Chen et al., 2018; Yildiz et al., 2019; Rubanova et al., 2019) , normalizing flows (de Bézenac et al., 2020) , generative adversarial (Yoon et al., 2019; Yu et al., 2022; Brooks et al., 2022) , and diffusion (Rasul et al., 2021) . Among these, continuous-time methods (often based on underlying ODEs) are the preferred approach for irregularly-sampled sequences because they can make predictions at arbitrary time steps and can handle sequences of varying lengths. Unfortunately, existing ODE-based methods (Rubanova et al., 2019; Yildiz et al., 2019) often fall short in learning models for real-world data (e.g., with stiff dynamics) due to their limited expressivity and numerical instabilities during backward gradient computation (Hochreiter, 1998; Niesen & Hall, 2004; Zhuang et al., 2020) . A natural way to increase the flexibility of ODE-based models is to increase the dimensionality of their (deterministic) hidden states. However, that leads to quadratic scaling in the hidden dimensionality due to the need of explicitly computing hidden states by unrolling the underlying recurrence over time, thus preventing scaling to long sequences. An alternative approach to increasing modeling capacity is to incorporate stochastic latent variables into the model, a highly successful strategy in generative modeling (Kingma & Welling, 2013; Chung et al., 2015; Song et al., 2020; Ho et al., 2020) . However, this leads to computational costs, and existing models like latent neural ODE models (Rubanova et al., 2019) inject stochasticity only at the initial condition of the system. In contrast, we introduce LS4, a latent generative model where the sequence of latent variables is represented as the solution of linear state space equations (Chen, 1984) . Unrolling the recurrence equation shows an autoregressive dependence in the sequence of latent variables, the joint of which is highly expressive in representing time series distributions. The high dimensional structure of the latent space, being equivalent to that of the input sequence, allows LS4 to learn expressive latent representations and fit the distribution of sequences produced by a family of dynamical systems, a common setting resulting from non-stationarity. We further show how LS4 can learn the dynamics of stiff (Shampine & Thompson, 2007) dynamical systems where previous methods fail to do so. Inspired by recent works on deep state space models, or stacks of linear state spaces and non-linearities (Gu et al., 2020; 2021) , we leverage a convolutional kernel representation to solve the recurrence, bypassing any explicit computations of hidden states via the recurrence equation, which ensures log-linear scaling in both the hidden state space dimensionality as well as sequence length. We validate our method across a variety of time series datasets, benchmarking LS4 against an extensive set of baselines. We propose a set of 3 metrics to measure the quality of generated time series samples and show that LS4 performs significantly better than baselines on datasets with stiff transitions and obtains on average 30% lower MSE scores and ELBO. On sequences with ≈ 20K lengths, our model trains ×100 faster than the baseline methods.

2. RELATED WORK

Rapid progress on deep generative modeling of natural language and images has consolidated diffusion (Ho et al., 2020; Song et al., 2020; Song & Ermon, 2019; Sohl-Dickstein et al., 2015) and autoregressive techniques (Brown et al., 2020) as the state-of-the-art. Although various approaches have been proposed for generative modeling of time series and dynamical systems, consensus on the advantages and disadvantages of each method has yet to emerge. Deep generative modeling of sequences. All the major paradigms for deep generative modeling have seen application to time series and sequences. Most relevant to our work are latent continuoustime autoencoder models proposed by Chen et al. (2018) ; Yildiz et al. (2019) ; Rubanova et al. (2019) , where a neural differential equation encoder is used to parametrize as distribution of initial conditions for the decoder. Massaroli et al. (2021) proposes a variant that parallelizes computation in time by casting solving the ODE as a root finding problem. Beyond latent models, other continuoustime approaches are given in Kidger et al. (2020) , which develops a GAN formulation using SDEs. State space models. State space models. (SSMs) are at the foundation of dynamical system theory (Chen, 1984) and signal processing (Oppenheim, 1999) , and have also been adapted to deep generative modeling. Chung et al. (2015) ; Bayer & Osendorfer (2014) propose VAE variants of discrete-time RNNs, generalized later by (Franceschi et al., 2020) , among others. These models all unroll the recurrence and are thus challenging to scale to longer sequences. Our work is inspired by recent advances in deep architectures built as stacks of linear SSMs, notably S4 (Gu et al., 2021) . Similar to S4, our generative model leverages the convolutional representation of SSMs during training, thus bypassing the need to materialize the hidden state of each recurrence.

3. PRELIMINARIES

We briefly introduce relevant details of continuous-time SSMs and their different representations. Then we introduce preliminaries of generative models for sequences.

3.1. STATE SPACE MODELS (SSM)

A single-input single-output (SISO) linear state space model is defined by the following differential equation d dt h t = Ah t + Bx t y t = Ch t + Dx t (1) with scalar input x t ∈ R, state h t ∈ R N and scalar output y t ∈ R. The system is fully characterized by the matrices  A ∈ R N ×N , B ∈ R N ×1 , C ∈ R 1×N , D ∈ R 1×1 . Let x, y ∈ C([a, b], h t k+1 = Āh t k + Bx t k y t k = Ch t k + Dx t k where Ā = e A∆ , B = A -foot_0 (e A∆ -I)B with the assumption that signals are constant during the sampling interval. Among many approaches to efficiently computing e A∆ , Gu et al. (2021) use a bilinear transform to estimate e A∆ ≈ (I -1 2 A∆) -1 (I + 1 2 A∆). This recurrence equation can be used to iteratively solve for the next hidden state h t k+1 , allowing the states to be calculated like an RNN or a Neural ODE (Chen et al., 2018; Massaroli et al., 2020) . Convolutional representation. Recurrent representations of SSM are not practical in training because explicit calculation of hidden states for every time step requires O(N 2 L) in time and O(N L) in space for a sequence of length L 1 . This materialization of hidden states significantly restricts RNN-based methods in scaling to long sequences. To efficiently train an SSM, the recurrence equation can be fully unrolled, assuming zero initial hidden states, as h t0 = Bx t0 h t1 = Ā Bx t1 + Bx t0 h t2 = Ā2 Bx t2 + Ā Bx t1 + Bx t0 . . . y t0 = C Bx t0 y t1 = C Ā Bx t1 + C Bx t0 y t2 = C Ā2 Bx t2 + C Ā Bx t1 + C Bx t0 . . . and more generally as, y t k = C Āk Bx t k + C Āk-1 Bx k-1 + • • • + C Bx t0 For an input sequence x = (x t0 , x t1 , . . . , x t L ), one can observe that the output sequence y = (y t0 , y t1 , . . . , y t L ) can be computed using a convolution with a skip connection y = CK * x + Dx, where K = ( B, Ā B, . . . , ĀL-1 B, ĀL B) (3) This is the well-known connection between SSM and convolution (Oppenheim & Schafer, 1975; Chen, 1984; Chilkuri & Eliasmith, 2021; Romero et al., 2021; Gu et al., 2020; 2021; 2022) and it can be computed very efficiently with a Fast Fourier Transform (FFT), which scales better than explicit matrix multiplication at each step.

3.2. VARIATIONAL AUTOENCODER (VAE)

VAEs (Kingma & Welling, 2013; Burda et al., 2015) are a highly successful paradigm in learning latent representations of high dimensional data and is remarkably capable at modeling complex distributions. A VAE introduces a joint probability distribution between a latent variable z and an observed random variable x of the form p θ (x, z) = p θ (x | z)p(z) where θ represents learnable parameters. The prior p(z) over the latent is usually chosen as a standard Gaussian distribution, and the conditional distribution p θ (x | z) is defined through a flexible non-linear mapping (such as a neural network) taking z as input. Such highly flexible non-linear mappings often lead to an intractable posterior p θ (z | x). Therefore, an inference model with parameters ϕ parametrizing q ϕ (z | x) is introduced as an approximation which allows learning through a variational lower bound of the marginal likelihood: log p θ (x) ≥ -D KL (q ϕ (z | x)∥p(z)) + E q ϕ (z|x) [log p θ (x | z)] where D KL (•∥•) is the Kullback-Leibler divergence between two distributions. VAE for sequences. Sequence data can be modeled in many different ways since the latent space can be chosen to encode information at different levels of granularity, i.e. z can be a single variable encoding entire trajectories or a sequence of variables of the same length as the trajectories. We focus on the latter. Given observed sequence variables x ≤T up to time T discretized into sequence (x t0 , . . . , x t L-1 ) of length L where t L-1 = T , a sequence VAE model with parameters θ, λ, ϕ learns a generative and inference distribution p θ,λ (x ≤t L-1 , z ≤t L-1 ) = L-1 i=0 p θ (x ti | x <ti , z ≤ti )p λ (z t | z <ti ) q ϕ (z ≤t L-1 | x ≤t L-1 ) = L-1 i=0 q ϕ (z ti | x ≤tt i ) where z ≤t L-1 = (z t0 , . . . , z t L-1 ) is the corresponding latent variable sequence. The approximate posterior q ϕ is explicitly factorized as a product of marginals due to efficiency reasons we shall discuss in the next section. Given this form of factorization, the variational lowerbound has been considered for discrete sequence data (Chung et al., 2015) with the objective E q ϕ (z ≤t L-1 |x ≤t L-1 ) - L-1 i=0 D KL (q ϕ (z ti | x ≤ti )∥p λ (z i | z <ti )) + log p θ (x ti | x <ti z ≤ti ) (5)

4. METHOD

In this section, we introduce Latent S4 (LS4), a latent variable generative model parameterized using SSMs. We show how SSMs can parametrize the generative distribution p θ (x ≤T |z ≤T )p λ (z ≤T ), the prior distribution p λ (z ≤T ) and the inference distribution q ϕ (z ≤T | x ≤T ) effectively. For the purpose of exposition, we can assume z t , x t are scalars at any time step t. Their generalization to arbitrary dimensions is discussed in Section 4.4. We first define a structured state space model with two input streams and use this as a building block for our generative model. It is an SSM of the form d dt h t = Ah t + Bx t + Ez t y t = Ch t + Dx t + F z t (6) where x, y, z ∈ C([0, T ], R) are continuous real signals on time interval [0, T ]. We denote H(x, z, A, B, E, h 0 , t) = H β (x, z, h 0 , t) , where β denotes trainable parameters A, B, E, as the deterministic function mapping from signals x, z to h t , the state at time t, given initial state h 0 at time 0. The above SSM can be compactly represented by y t = CH β (x, z, h 0 , t) + Dx t + F z t (7) When the continuous-time input signals are discretized into discrete-time sequences (x t0 , . . . , x t L-1 ) and (z t0 , . . . , z t L ), the corresponding hidden state at time t k has a convolutional view (assuming D = F = 0 for simplicity) y t k = CK t k * x t k + C Kt k * z t k , where K t k = Āk B, Kt k = Āk Ē (8) which can be evaluated efficiently using FFT. Additionally, A is HiPPO-initialized (Gu et al., 2021) for all such SSM blocks.

4.1. LATENT SPACE AS STRUCTURED STATE SPACE

The goal of the prior model is to realize a sequence of random variables (z t0 , z t1 , . . . , z t L ), which the prior distribution p λ (z ≤t L ) models autoregressively. Suppose (z t0 , z t1 , . . . , z tn ) is a sequence of latent variables up to time t n , we define the prior distribution of z tn autoregressively as p λ (z tn | z <tn ) = N (µ z,n (z <tn , λ), σ 2 z,n (z <tn , λ)) where the mean µ z,n and standard deviation σ z,n are deterministic functions of previously generated z <tn parameterized by λ. To parameterize the above distribution, we first define an intermediate building block, a stack of which will produce the wanted distribution. LS4 prior block. The forward pass through our SSM is a two-step procedure: first, we consider the latent dynamics of z on [t 0 , t n-1 ] where we simply leverage Equation 6to define the hidden states to follow H β1 (0, z, 0, t). Second, on (t n-1 , t n ], since no additional z is available in this interval, we ignore additional input signals in the ODE and only include the last given latent, i.e. z tn-1 , as an auxiliary signal for the outputs, which can be compactly denoted, with a final GELU non-linearity, as y z,n = GELU(C yz H β1 (0, 0, H β2 (0, z [t0,tn-1] , h tn-1 , 0, t n-1 ) , t n ) + F yz z tn-1 ) We call this LS4 prior layer, which we use to build our LS4 prior block that is built upon a ResNet structure with a skip connection, denoted as LS4 prior (z [t0,tn-1] , ψ) = LayerNorm(G yz y z,n + b yz ) + z tn-1 where ψ denotes the union of parameters β i , C yz , F yz , G yz , b yz . We define the final parameters µ z,n and σ z,n for the conditional distribution in the autoregressive model as the result of a stack of LS4 prior blocks. During generation, as an initial condition, z t0 ∼ N (µ z,0 , σ 2 z,0 ) where µ z,0 , σ z,0 are learnable parameters, and subsequent latent variables are generated autoregressively. We specify our architecture in Appendix C and use λ to denote the union of all trainable parameters.

4.2. GENERATIVE MODEL

Given the latent variables, we now specify a decoder that represents the distribution p θ (x ≤t L |z ≤t L ). Suppose z ≤t L is a latent path generated via the latent state space model, the output path x ≤t L also follows the state space formulation. Assuming we have generated (x t0 , . . . , x tn-1 ) and (z t0 , . . . , z tn ), the conditional distribution of x tn is parametrized as p θ (x tn |x <tn , z ≤tn ) = N (µ x,n (x <tn , z ≤tn , θ), σ 2 x ) where σ x is a pre-defined observation standard deviation and µ x,n is a deterministic function of z ≤tn and x <xn . LS4 generative block. Different from the prior block, both observation and latent sequences are input into our model, and we define intermediate outputs g x,n and g z,n as h tn = H β3 (0, z tn-1 , H β4 (x [t0,tn-1] , z [t0,tn-1] , 0, t n-1 ), t n ) g x,n = GELU(C gx h tn + D gx x tn-1 + F gx z tn ) g z,n = GELU(C gz h tn + D gz x tn-1 + F gz z tn ) which are used to build a LS4 generative block defined as ĝx,n = LayerNorm(G gx g x,n + b gx ) + x tn-1 ĝz,n = LayerNorm(G gz g z,n + b gz ) + z tn LS4 gen (x [t0,tn-1] , z [t0,tn] , ψ) = (ĝ x,n , ĝz,n ) where ψ denotes all parameters inside the block. Note that the generative block gives two streams of outputs, which can be used as inputs for the next stack. We then define the final mean value µ x,n as the result of a stack of LS4 generative blocks. The initial condition for generation is given as x t0 ∼ N (µ x,0 (z 0 , θ), σ x ) where µ x,0 exactly follows our formulation while taking only z t0 as input. The subsequent x tn 's are generated autoregressively. We specify our architecture in Appendix C and use θ to denote the union of all trainable parameters.

4.3. INFERENCE MODEL

The latent variable model up to time t n has intractable posterior p θ (z ≤tn | x ≤tn ). Therefore, we approximate this distribution with q ϕ (z ≤tn | x ≤tn ) using variational inference. We parameterize the inference distribution at time t n to depend only on the observed path x ≤tn : q ϕ (z t | x ≤tn ) = N (μ z,tn (x ≤tn , ϕ), σ2 z,tn (x ≤tn , ϕ)) LS4 inference block. The inference block is defined as ŷz,n = GELU(C ŷz H β5 (x [t0,tn] , 0, 0, t n-1 ) + D ŷz x tn ) LS4 inf (x [t0,tn] , ψ) = LayerNorm(G ŷz ŷz,n + b ŷz ) + x tn (16) Notice that input x is fully present in [t 0 , t n ] unlike in the generative model. μz,t and σz,t are each a deep stack of our inference blocks similarly defined as before. Here we also justify our choice of factorizing posterior as a product of marginals as in Equation 5. By having each z t explicitly depending on x ≤tn only, we can leverage the fast convolution operation to obtain all z t in parallel, thus achieving fast inference time, unlike the autoregressive nature of the prior and generative model.

4.4. LS4: PROPERTIES AND PRACTICE

We highlight some properties of LS4. In particular, we compare in the following proposition the expressive power of our generative model against structured state space models. Proposition 4.1. (LS4 subsumes S4.) Given any autoregressive model r(x) with conditionals r(x n |x <n ) parameterized via deep S4 models, there exists a choice of θ, λ, ϕ such that p θ,λ (x) = r(x) and p θ,λ (z|x) = q ϕ (z|x), i.e. the variational lower bound (ELBO) is tight. A proof sketch is provided in Appendix B. This result shows that LS4 subsumes autoregressive generative models based on vanilla S4 (Gu et al., 2021) , given that the architecture between SSM layers is the same. Crucially, with the assumption that we are able to globally optimize the ELBO training objective, LS4 will fit the data at least as well as vanilla S4. Scaling to arbitrary feature dimensions. So far we have assumed the input and latent signals are real numbers. The approach can be scaled to arbitrary dimensions of inputs and latents by constructing LS4 layers for each dimension which are input into a mixing linear layer. We call such parallelized SSMs heads and provide a pseudo-code in Appendix C. Note that in practice, A is HiPPO initialized (Gu et al., 2020) and the materialized kernel includes C so that the convolution is computed directly in the projected space, bypassing materializing the high-dimensional hidden states.

5. EXPERIMENTS

In this section, we verify the modeling capability of LS4 empirically. There are three main questions we seek to answer: (1) How effective is LS4 in modeling stiff sequence data? (2) How expressive is LS4 in scaling to real time-series with a variety of temporal dynamics? (3) How efficient in training and inference is LS4 in terms of wall-clock time?

5.1. LEARNING TO GENERATE DATA FROM STIFF SYSTEMS

Modeling data generated by dynamics with widely separated time scales has been proven to be particularly challenging for vanilla ODE-based approaches which make use of standard explicit solvers for inference and gradient calculation. Kim et al. (2021) showed that as the learned dynamics stiffen up to track data paths, the ODE numerics start to catastrophically fail; the inference cost raises drastically and the gradient estimation process becomes ill-conditioned. These issues can be mitigated by employing implicit ODE solvers or ad-hoc rescalings of the learned vector field (see (Kim et al., 2021) for further details). In turn, state-space models do not suffer from stiffness of dynamics as the numerical methods are sidestepped in favor of an exact evaluation of the convolution operator. We hereby show that LS4 is able to model data generated by a prototype stiff system.

FLAME problem

We consider a simple model of flame growth (FLAME) (Wanner & Hairer, 1996) , which has been extensively studied as a representative of highly stiff systems: d dt x t = x 2 t -x p t where p ∈ {3, 4, . . . , 10}. For each p, 1000 trajectories are generated for t ∈ [0, 1000] with unit increment. Generation. In Figure 1a , we show the mean trajectories and the distribution at each time step and that our samples closely match the ground-truth data. The Latent ODE (Rubanova et al., 2019) instead fails to do so and produces non-stiff samples drastically different from the target. (a) Generation of the stiff system. (b) Marginal histograms at steps equally spaced between the 0.5% and 10% steps in log scale. Marginal Distribution. We plot the marginal distribution of the real data and the generated data from both our model and Latent ODE. Since the stiff transitions are mostly distributed before 10% of total steps, we visualize the marginal histograms at 4 time steps equally spaced between the 0.5% and 10% steps in log scale (see Figure 1b ). We observe that the empirical histogram matches the ground truth distribution significantly better than what is produced by Latent ODEs, as also qualitatively visible from the samples in (a).

5.2. GENERATION WITH REAL TIME-SERIES DATASETS

We investigate the generative capability of LS4 on real time-series data. We show that our model is capable of fitting a wide variety of time-series data with significantly different temporal dynamics. We leave implementation details to Appendix D.1. Data. We use Monash Time Series Repository (Godahewa et al., 2021) , a comprehensive benchmark containing 30 time-series datasets collected in the real world, and we choose FRED-MD, NN5 Daily, Temperature Rain, and Solar Weekly as our target datasets. Each dataset exhibits unique temporal dynamics, which makes generative learning a challenging task. A sample from each dataset can be visualized in Figure 2 . empirical probability density functions of two distributions -the lower the better (Ni et al., 2020) . Following Kidger et al. (2021) , we define Classification scores as using a sequence model to classify whether a sample is real or generated and use its cross-entropy loss as a proxy for generation quality -the higher the less distinguishable the samples, thus better the generation. Similarly, Prediction scores use a train-on-synthetic-test-on-real seq2seq model to predict k steps into the future -the lower the more predictable, thus better the generation. We use a 1-layer S4 (Gu et al., 2021) for both Classification and Prediction scores (see Appendix D.1 for more details). We will discuss the necessity of all 3 metrics in the following discussion section. We compare our model with several time-series generative models, namely the VAE-based methods such as RNN-VAE (Rubanova et al., 2019) , GP-VAE (Fortuin et al., 2020) , ODEfoot_1 VAE (Yildiz et al., 2019) , Latent ODE (Rubanova et al., 2019) , GAN-based methods such as TimeGAN (Yoon et al., 2019) and SDE GAN (Kidger et al., 2021) , and SaShiMi (Goel et al., 2022) . We show quantitative results in Table 1 . Our model significantly outperforms the baselines on all datasets. We note that baseline models have a hard time modeling NN5 Daily and Temperature Rain where the transition dynamics are stiff. For Temperature Rain where most data points lie around x-axis with sharp spikes throughout, latent ODE generates mostly closely to the x-axis, thus achieving lower marginal scores, but its generation is easily distinguishable from data, thus making it a less favorable generative model. We demonstrate that Marginal scores alone are an insufficient metric for generation quality. SaShiMi, an autoregressive model based on S4, does not perform as well on time series generation in the tasks considered. We further discuss the reason in Appendix D.1.

5.3. INTERPOLATION & EXTRAPOLATION

We also show that our model is expressive enough to fit to irregularly-sampled data and performs favorably in terms of interpolation and extrapolation. Interpolation refers to the task of predicting missing data given a subset of a sequence while extrapolation refers to the task that data is separated into 2 segments each with half the length of the full sequence, and one predicts the latter segment given the former. Data. Metrics. We use mean squared error (MSE) to evaluate both interpolation and extrapolation. We compare our model with RNN (Rumelhart et al., 1985) , RNN-VAE (Chung et al., 2014; Rubanova et al., 2019) , ODE-RNN (Rubanova et al., 2019) , GRU-D (Rubanova et al., 2019) , Latent ODE (Chen et al., 2018; Rubanova et al., 2019) , and CRU Schirmer et al. (2022) . Results are shown in Table 2 . We observe that our model outperforms previous continuous-time methods. Our model performs less well on extrapolation for Physionet compared to ODE-RNN and latent ODE. We postulate that this is due to the high variability granted by our latent space. Since new latent variables are generated as we extrapolate, our model generates paths that are more flexible (hence less predictable) than those of Latent ODE, which instead uses a single latent variable to encode an entire path. We additional observe that our model achieves ELBO of -669.0 and -250.2 on Physionet interpolation and extrapolation tasks respectively. These bounds are much tighter lower bounds than other VAEbased methods, i.e. RNN-VAE, which reports -412.8 and -220.2, and latent ODE, which reports -410.3 and -168.5. We leave additional ELBO comparisons in Appendix D.

5.4. RUNTIME

We additionally verify the computational efficiency of our model for both training and inference. We do so by training and inferring on synthetic data with controlled lengths specified below. Data. We create a set of synthetic datasets with lengths {80, 320, 1280, 5120, 20480} to investigate scaling of training/inference time with respect to sequence length. Training is done with 100 iterations through the synthetic data, and inference is performed given one batch of synthetic data (see Appendix D.3). Metrics. We use wall-clock runtime measured in milliseconds. Figure 3 shows model runtime in log scale. ODE 2 VAE fails to finish training on the last sequence length within a reasonable time frame, so we omit its plot of the last data point. Our model performs consistently and significantly lower than baselines, which are observed to scale linearly with input lengths, and is ×100 faster than baselines in both training and inference on 20480 length.

6. CONCLUSION

We introduce LS4, a powerful generative model with latent space evolution following a state space ODE. Our model is built with a deep stack of LS4 prior/generative/inference blocks, which are trained via standard sequence VAE objectives. We also show that under specific choices of model parameters, LS4 subsumes autoregressive S4 models. Experimentally, we demonstrate the modeling power of LS4 on datasets with a wide variety of temporal dynamics and show significant improvement in generation/interpolation/extrapolation quality. In addition, our model shows ×100 speed-up in training and inference time on long sequences. LS4 demonstrates improved expressivity and computational efficiency, and we believe that it has a further role to play in modeling general time-series sequences.



Further explanations in Appendix A.1 Numbers are taken from the original paper. We keep the significant digits unchanged



Proposition 4.2. (Efficiency.) For a SSM with H heads, an observation sequence of length L and hidden dimension N can be calculated in O(H(L+N ) log(L + N )) time and O(H(L+N )) space.We provide proof in Appendix B. Note that our model is much more efficient in both time and space than RNN/ODE-based methods (which requires O(N 2 L) in time and O(N L) in space as discussed in Section 3.1). To demonstrate the computation efficiency, we additionally provide below pseudocode for a single LS4 prior layer 10. The other blocks can be similarly constructed. def LS4_prior_layer(z, A, B, C, F, h_0): # z: (B, L, 1) K = C @ materialize_kernel(z, A, B, h_0) # O((L+N)log(L+N)) time CH = fft_conv(K, z) # O(LlogL) time and O(L) space y = gelu(CH + F * z) return y

Figure 2: Monash data. The selected datasets exhibit a variety of temporal dynamics ranging from relatively smooth to stiff transitions.

Figure 3: Runtime comparison. The y-axis shows run-time (ms) of each setting in log scale. Our runtime stays consistently lower across all sequence lengths investigated.

R) be absolutely continuous real signals on time interval[a, b]. Given an initial condition h 0 ∈ R N the SSM (1) realizes a mapping x → y.Discrete recurrent representation. In practice, continuous input signals are often sampled at time interval ∆ and the sampled sequence is represented by x = (x t0 , x t1 , . . . , x t L ) where t k+1 = t k +∆.

Generation results on FRED-MD, NN5 Daily, Temperature Rain, and Solar Weekly.

Following Rubanova et al. (2019);Schirmer et al. (2022), we use USHCN and Physionet as our datasets of choice. The United States Historical Climatology Network (USHCN)(Menne et al., 2015) is a climate dataset containing daily measurements form 1,218 weather stations across the US for precipitation, snowfall, snow depth, minimum and maximum temperature. Physionet(Silva et al., 2012) is a dataset containing health measurements of 41 signals from 8,000 ICU patients in their first 48 hours. We follow preprocessing steps ofSchirmer et al. (2022) for training and testing.

Interpolation and extrapolation MSE (×10 -3 ) scores. Lower scores are better.

