LATENT NEURAL ODES WITH SPARSE BAYESIAN MULTIPLE SHOOTING

Abstract

Training dynamic models, such as neural ODEs, on long trajectories is a hard problem that requires using various tricks, such as trajectory splitting, to make model training work in practice. These methods are often heuristics with poor theoretical justifications, and require iterative manual tuning. We propose a principled multiple shooting technique for neural ODEs that splits the trajectories into manageable short segments, which are optimised in parallel, while ensuring probabilistic control on continuity over consecutive segments. We derive variational inference for our shooting-based latent neural ODE models and propose amortized encodings of irregularly sampled trajectories with a transformer-based recognition network with temporal attention and relative positional encoding. We demonstrate efficient and stable training, and state-of-the-art performance on multiple largescale benchmark datasets.

1. INTRODUCTION

Dynamical systems, from biological cells to weather, evolve according to their underlying mechanisms, often described by differential equations. In data-driven system identification we aim to learn the rules governing a dynamical system by observing the system for a time interval [0, T ], and fitting a model of the underlying dynamics to the observations by gradient descent. Such optimisation suffers from the curse of length: complexity of the loss function grows with the length of the observed trajectory (Ribeiro et al., 2020) . For even moderate T the loss landscape can become highly complex and gradient descent fails to produce a good fit (Metz et al., 2021) . To alleviate this problem previous works resort to cumbersome heuristics, such as iterative training and trajectory splitting (Yildiz et al., 2019; Kochkov et al., 2021; HAN et al., 2022; Lienen & Günnemann, 2022) . The optimal control literature has a long history of multiple shooting methods, where the trajectory fitting is split into piecewise segments that are easy to optimise, with constraints to ensure continuity across the segments (van Domselaar & Hemker, 1975; Bock & Plitt, 1984; Baake et al., 1992) . Multiple-shooting based models have simpler loss landscapes, and are practical to fit by gradient descent (Voss et al., 2004; Heiden et al., 2022; Turan & Jäschke, 2022; Hegde et al., 2022) . Inspired by this line of work, we develop a shooting-based latent neural ODE model (Chen et al., 2018; Rubanova et al., 2019; Yildiz et al., 2019; Massaroli et al., 2020) . Our multiple shooting formulation generalizes standard approaches by sparsifying the shooting variables in a probabilistic setting to account for irregularly sampled time grids and redundant shooting variables. We furthermore introduce an attention-based (Vaswani et al., 2017) encoder architecture for latent neural ODEs that is compatible with our sparse shooting formulation and can handle noisy and partially observed high-dimensional data. Consequently, our model produces state-of-the-art results, naturally handles the problem with long observation intervals, and is stable and quick to train. Our contributions are: • We introduce a latent neural ODE model with quick and stable training on long trajectories. • We derive sparse Bayesian multiple shooting -a Bayesian version of multiple shooting with efficient utilization of shooting variables and a continuity-inducing prior. • We introduce a transformer-based encoder with novel time-aware attention and relative positional encodings, which efficiently handles data observed at arbitrary time points. Figure 2 : Method overview with two blocks (see Section 3.1). The encoder maps the input sequence y 1:5 observed at arbitrary time points t 1:5 to two distributions q ψ1 (s 1 ), q ψ2 (s 2 ) from which we sample shooting variables s 1 , s 2 . Then, s 1 , s 2 are used to compute two sub-trajectories that define the latent trajectory x 1:5 from which the decoder reconstructs the input sequence.

2. PROBLEM SETTING AND BACKGROUND

Data. We observe a dynamical system at arbitrary consecutive time points t 1:N = (t 1 , ..., t N ), which generates an observed trajectory y 1:N = (y 1 , . . . , y N ), where y i := y(t i ) ∈ R D . Our goal is to model the observations and forecast the future states. For brevity we present our methodology for a single trajectory, but extension to many trajectories is straightforward. Latent Neural ODE models. L-NODE models (Chen et al., 2018; Rubanova et al., 2019) relate the observations y 1:N to a latent trajectory x 1:N := (x 1 , ..., x N ), where x i := x(t i ) ∈ R d , and learn dynamics in the latent space. An L-NODE model is defined as: x i = ODEsolve(x 1 , t 1 , t i , f θdyn ), i = 2, ..., N, (1) y i |x i ∼ p(y i |g θdec (x i )), i = 1, ..., N. (2) Variable x 1 is the initial state at time t 1 . Dynamics function f θdyn is the time derivative of x(t), and ODEsolve(x 1 , t 1 , t i , f θdyn ) is defined as the solution of the following initial value problem at time t i : dx(t) dt = f θdyn (t, x(t)), x(t 1 ) = x 1 , t ∈ [t 1 , t i ]. ( ) Decoder g θdec maps the latent state x i to the parameters of p(y i |g θdec (x i )). Dynamics and decoder functions are neural networks with parameters θ dyn and θ dec . In typical applications, data is high-dimensional whereas the dynamics are modeled in a low-dimensional latent space, i.e., d ≪ D. L-NODE models are commonly trained by minimizing a loss function, e.g., evidence lower bound (ELBO), via gradient descent (Chen et al., 2018; Yildiz et al., 2019) . In gradient-based optimization complexity of the loss landscape plays a crucial role in the success of the optimization. However, it has been empirically shown that the loss landscape of L-NODE-like models (i.e., models that compute latent trajectory x 1:N from initial state x 1 ) is strongly affected by the length of the simulation interval et al., 2004; Metz et al., 2021; Heiden et al., 2022 ). Furthermore, Ribeiro et al. (2020) show that the loss complexity in terms of Lipschitz constant can grow exponentially with the length of [t 1 , t N ]. Figure 1 shows an example of this phenomenon (details in Appendix A). [t 1 , t N ] (Voss

3. METHODS

In Section 3.1, we present our latent neural ODE formulation that addresses the curse of length by sparse multiple shooting. In Section 3.2 we describe the generative model, inference, and forecasting procedures. In Section 3.3 we describe our time-aware, attention-based encoder architecture that complements our sparse multiple shooting framework.

3.1. LATENT NEURAL ODES WITH SPARSE MULTIPLE SHOOTING

Figure 3 : Top: Trajec- tory over [t 1 , t 4 ], x i is computed from x 1 . Bot- tom: [t 1 , t 4 ] is split into three sub-intervals, x i is computed from s i-1 . Multiple shooting. A simple and effective method for solving optimisation problems with long simulation intervals is to split these intervals into short, non-overlapping sub-intervals that are optimised in parallel. This is the main idea of a technique called multiple shooting (Hemker, 1974; Bock & Plitt, 1984) . To apply multiple shooting to an L-NODE model we introduce new parameters, called shooting variables, s 1:N -1 = (s 1 , . . . , s N -1 ) with s i ∈ R d that correspond to time points t 1:N -1 , and redefine the L-NODE model as x 1 = s 1 , (4) x i = ODEsolve(s i-1 , t i-1 , t i , f θdyn ), (5) y i |x i ∼ p y i |g θdec x i ) . ( ) The initial state x 1 is represented by the first shooting variable s 1 , and the latent state x i is computed from the previous shooting variable s i-1 . This gives short simulation intervals [t i-1 , t i ], which greatly reduces complexity of the loss landscape. Continuity of the entire piecewise trajectory is enforced via constraints on the distances between x i and s i (see Figure 3 ), which we discuss in Section 3. Sparse multiple shooting. Multiple shooting assigns a shooting variable to every time point (see Figure 3 ). For irregular or densely sampled time grids this approach might result in redundant shooting variables and excessively short and uninformative sub-intervals due to high concentration of time points in some regions of the time grid. We propose to fix these problems by sparsifying the shooting variables. Instead of assigning a shooting variable to every time point, we divide the time grid into B non-overlapping blocks and assign a single shooting variable to each block.  x i = ODEsolve(s b , t [b] , t i , f θ dyn ), i ∈ I b . As shown in Figure 4 , this approach reduces the number of shooting variables and grants finer control over the length of each sub-interval to ensure that it is both sufficiently long to contain enough dynamics information and sufficiently short to keep the loss landscape not too complex. As illustrated in Figure 4 , an ODE solution (Eq. 7) does not necessarily match the corresponding shooting variable. Standard multiple shooting formulations enforce continuity of the entire trajectory via a hard constraint or a penalty term (Voss et al., 2004; Jordana et al., 2021; Turan & Jäschke, 2022) . Instead, we propose to utilize Bayesian inference and naturally encode continuity as a prior which leads to sparse Bayesian multiple shooting which we discuss in the next section.

3.2. MODEL, INFERENCE, AND FORECASTING

Model. Our model is a latent neural ODE with sparse multiple shooting (Section 3.1). To infer the parameters s 1:B , θ dyn , and θ dyn we use Bayesian inference with the following prior: p(s 1:B , θ dyn , θ dec ) = p(s 1:B |θ dyn )p(θ dyn )p(θ dec ), (8) where p(θ dyn ), p(θ dec ) are Gaussians, and the continuity inducing prior p(s 1:B |θ dyn ) is defined as p(s 1:B |θ dyn ) = p(s 1 ) B b=2 p(s b |s b-1 , θ dyn ) = p(s 1 ) B b=2 N s b |ODEsolve(s b-1 , t [b-1] , t [b] , f θdyn ), σ 2 c I , where p(s 1 ) is a diagonal Gaussian, N is the Gaussian distribution, I ∈ R d×d is identity matrix, and parameter σ 2 c controls the strength of the prior. The continuity prior forces the shooting variable s b and the final state of the previous block b -1, which is obtained using the dynamics model, to be close (e.g., s 2 and x(t [2] ) = x 4 in Fig. 4 ), thus promoting continuity of the entire trajectory. With the priors above, we get the following generative model θ dyn , θ dec ∼ p(θ dyn )p(θ dec ), s 1:B |θ dyn ∼ p(s 1:B |θ dyn ), (10)  x 1 = s 1 , ( ) x i = ODEsolve(s b , t [b] , t i , f θdyn ), b ∈ {1, ..., B}, i ∈ I b , y i |x i ∼ p(y i |g θdec (x i )), i = 1, ..., N. Inference. We use variational inference (Blei et al., 2017) to approximate the true posterior p(θ dyn , θ dec , s 1:B |y 1:N ) by an approximate posterior q(θ dyn , θ dec , s 1:B ) = q(θ dyn )q(θ dec )q(s 1:B ) = q ψdyn (θ dyn )q ψdec (θ dec ) B b=1 q ψ b (s b ) with variational parameters ψ dyn , ψ dec , and ψ 1:B = (ψ 1 , . . . , ψ B ). Note that contrary to standard VAEs, which use point estimates of θ dyn and θ dec , we extent the variational inference to these parameters to adequately handle the uncertainty. To avoid direct optimization over the local variational parameters ψ 1:B , we use amortized variational inference (Kingma & Welling, 2013) and learn an encoder h θenc with parameters θ enc which maps observations y 1:N to ψ 1:B (see Section 3.3). We denote the amortized shooting distributions q ψ b (s b |y 1:N , θ enc ), where ψ b = h θenc (y 1:N ), simply as q(s b ) or q ψ b (s b ) for brevity. We assume q ψdyn , q ψdec , and q ψ b to be diagonal Gaussians. With a fully factorised q(s 1:B ) we can sample the shooting variables s 1:B independently which allows to compute the latent states x 1:N in parallel by simulating the dynamics only over short subintervals. If the posterior q(s 1:B ) followed the structure of the prior p(s 1:B |θ dyn ) we would not be able to utilize these benefits of multiple shooting since to sample s 1:B we would need to simulate the whole trajectory s 1:B starting at s 1 . In variational inference we minimize the Kullback-Leibler divergence between the variational approximation and the true posterior, KL q(θ dyn , θ dec , s 1:B )∥p(θ dyn , θ dec , s 1:B |y 1:N ) , (16) which is equivalent to maximizing the ELBO which for our model is defined as Appendix B contains detailed derivation of the ELBO, and fully specifies the model and the approximate posterior. While terms (iii), (v) and (vi) have a closed form, computation of terms (i), (ii) and (iv) involves approximations: Monte Carlo sampling for the expectations, and numerical integration for the solution of the initial value problems. Appendix C details the computation of ELBO. L = E q(θdec,s1) log p(y 1 |s 1 , θ dec ) (i) data likelihood + B b=1 i∈I b E q(θdyn,θdec,s b ) log p(y i |s b , θ dyn , θ dec ) (ii) data likelihood (17) -KL q(s 1 )∥p(s 1 ) (iii) initial state prior - B b=2 E q(θdyn,s b-1 ) KL q(s b )∥p(s b |s b-1 , θ dyn ) (iv) continuity prior (18) -KL q(θ dyn )∥p(θ dyn ) (v) dynamics prior -KL q(θ dec )∥p(θ dec ) (vi) decoder prior . ( ) Forecasting.  ) ≈ p(y * n |s * 1 , θ dyn , θ dec )q ψ * 1 (s * 1 )q ψdyn (θ dyn )q ψdec (θ dec )ds * 1 dθ dyn dθ dec , where ψ * 1 = h θenc (y * 1:m ). The expectation is estimated via Monte Carlo integration (Appendix C). Note that inferring s * m instead of s * 1 could lead to more accurate predictions, but in this work we use s * 1 to simplify implementation of the method.

3.3. ENCODER

We want to design an encoder capable of operating on irregular time grids, handling noisy and partially observed data, and parallelizing the computation of the local variational parameters ψ 1:B . Transformer (Vaswani et al., 2017) satisfies most of these requirements, but is not directly applicable to our setup. We design a transformer-based encoder with time-aware attention and continuous relative positional encodings. These modifications provide useful inductive biases and allow the encoder to effectively operate on input sequences with a temporal component. The encoder computes ψ 1:B with (see Figure 5 (a-b)): ψ 1:B = h θenc (y 1:N ) = h read (h agg (h comp (y 1:N ))), where 1. h comp : R D → R Dlow compresses observations y 1:N ∈ R D×N into a low-dimensional sequence a 1:N ∈ R Dlow×N , where D low ≪ D. 2. h agg : R Dlow×N → R Dlow×B aggregates information across a 1:N into b 1:B ∈ R Dlow×B , where b i is located at the temporal position of s i (Figure 5 (b)). 3. h read : R Dlow → R P reads the parameters ψ 1:B ∈ R P ×B from b 1:B . Transformations h comp and h read are any suitable differentiable functions. Transformation h agg is a transformer encoder (Vaswani et al., 2017) which is a sequence-to-sequence mapping represented by a stack of L layers (Figure 5 (a) ). Each layer ℓ ∈ {1, . . . , L} contains a component called attention sub-layer which maps an input sequence α (ℓ) 1:B ), where FF(•) is a feed-forward network with a residual connection. In the following, we drop the index ℓ for notational simplicity since each layer has the same structure. The attention sub-layer for the standard, scaled dot-product self-attention (assuming a single attention head) is defined using the dot-product (C DP ij ), softmax (C ij ) and weighted average (β i ) (Vaswani et al., 2017) : 1:N := (α (ℓ) 1 , . . . , α (ℓ) N ) ∈ R Dlow×N to an output sequence β (ℓ) 1:N := (β (ℓ) 1 , . . . , β (ℓ) N ) ∈ R Dlow×N , C DP ij = ⟨W Q α i , W K α j ⟩ √ D low , C ij = exp (C DP ij ) N k=1 exp (C DP ik ) , β i = N j=1 C ij (W V α j ), where W Q , W K , W V ∈ R Dlow×Dlow are learnable layer-specific parameter matrices, and C ∈ R N ×N is the attention matrix. This standard formulation of self-attention works poorly on irregularly sampled trajectories (see Section 4). Next, we discuss modifications that we introduce to make it applicable on irregularly sampled data. Temporal attention Dot product attention has no notion of time hence can attend to arbitrary elements of the input sequence. To make β i dependent mostly on those input elements that are close to t i we augment the dot-product attention with temporal attention C TA ij and redefine the attention matrix as C TA ij = ln (ϵ) |t j -t i | δ r p , C ij = exp (C DP ij + C TA ij ) N k=1 exp (C DP ik + C TA ik ) , ( ) where ϵ ∈ (0, 1], p ∈ N and δ r ∈ R >0 are constants. Since exp (C DP ij + C TA ij ) = exp (C DP ij ) exp (C TA ij ) , the main purpose of temporal attention is to reduce the amount of attention from β i to α j as the temporal distance |t i -t j | grows. Parameter δ r defines the distance beyond which exp(C DP ij ) is scaled by at least ϵ, while p defines the shape of the scaling curve. Figure 6 (a) demonstrates shapes of the scaling curves for various values of p. Relative positional encodings To make β i independent of its absolute temporal position t i we replace the standard global positional encodings with relative positional encodings which we define as P ij = w ⊙ hardtanh t j -t i δ r , and redefine β i = N j=1 C ij (W V α j + P ij ), where w ∈ R d is a vector of trainable parameters, ⊙ is point-wise multiplication, and δ r is the same as for temporal attention. This formulation is synergistic with temporal attention as it ensures that β i has useful positional information about α j only if |t i -t j | < δ r which further forces β i to depend on input elements close to t i (see Figure 6 (b)). In this work we share w across attention sub-layers. For further details about the encoder, see Appendix E. In Appendix F we investigate the effects of p and δ r . In Appendix J we compare our transformer-based aggregation function with ODE-RNN of Rubanova et al. (2019) . Note that our encoder can process input sequences of varying lengths. Also, as discussed in Section 3.2, at test time we set B = 1 so that the encoder outputs only the first parameter vector ψ 1 since we are only interested in the initial state s 1 from which we predict the test trajectory.

4. EXPERIMENTS

To demonstrate properties and capabilities of our method we use three datasets: PENDULUM, RMNIST, and BOUNCING BALLS, which consist of high-dimensional (D = 1024) observations of physical systems evolving over time (Figure 7 ) and are often used in literature on modeling of dynamical systems. We generate these datasets on regular and irregular time grids. Unless otherwise stated, we use the versions with irregular time grids. See Appendix D for more details. We train our model for 300000 iterations with Adam optimizer (Kingma & Ba, 2015) and learning rate exponentially decreasing from 3 • 10 -4 to 10 -5 . To simulate the dynamics we use an ODE solver from torchdiffeq package (Chen et al., 2018) (dopri5 with rtol = atol = 10 -5 ). We use second-order dynamics and set the latent space dimension d to 32. See Appendix E for detailed description of training/validation/testing setup and model architecture. Error bars are standard errors evaluated with five random seeds. Training is done on a single NVIDIA Tesla V100 GPU. 

4.1. REGULAR AND IRREGULAR TIME GRIDS

Here we compare performance of our model on regular and irregular time grids. As Figure 8 shows, for all datasets our model performs very similarly on both types of the time grids, demonstrating its strong and robust performance on irregularly sampled data. Next, to investigate how design choices in our encoder affect the results on irregular time grids, we do an ablation study where we remove temporal attention (TA) and relative positional encodings (RPE). Note that when we remove RPE we add standard sinusoidal-cosine positional encodings as in Vaswani et al. (2017) . The results are shown in Table 1 . We see that removing temporal attention, or RPE, or both tends to noticeably increase test errors, indicating the effectiveness of our modifications. 

4.2. BLOCK SIZE

Our model operates on sub-trajectories whose lengths are controlled by the block sizes, i.e., the number of observations in each block (Section 3.1). Here we set the size of all blocks to a given value and demonstrate how it affects the performance of our model. Figure 9 shows test errors and training times for various block sizes. We see that the optimal block size is much smaller than the length of the observed trajectory (51 in our case), and that in some cases the model benefits from increasing the block size, but only up to some point after which the performance starts to drop. We also see how the ability to parallelize computations across block improves training times. Our model divides training sequences into blocks and uses the continuity prior (Equation 9) to enforce continuity of the latent trajectories across the blocks. Here we investigate how the strength of the prior (in terms of σ c ) affects the model's performance. In Figure 10 we show results for different values of σ c . We see that stronger continuity prior tends to improve the results. For BOUNCING BALLS with σ c = 2 • 10 -5 the model failed to learn meaningful latent dynamics, perhaps due to excessively strong continuity prior. For new datasets the continuity prior as well as other hyperparameters can be set e.g. by cross-validation. In appendix I we also show how the value of σ c affects the gap between the blocks. We found that constraining variance of the approximate posteriors q ψi (s i ) to be at least τ 2 min > 0 (in each direction) might noticeably improve performance of our model. In Figure 11 we compare the results for τ min = 0 and τ min = 0.02. As can be seen, this simple constraint greatly improves the model's performance on more complex datasets. This constraint could be viewed as an instance of noise injection, a technique used to improve stability of model predictions (Laskey et al., 2017; Sanchez-Gonzalez et al., 2020; Pfaff et al., 2021) . Previous works inject noise into the input data, but we found that injecting noise directly in the latent space produces better results. Details are in Appendix E.4.3. As discussed previously, models that compute x 1:N directly from x 1 without multiple shooting (so called single shooting models) require various heuristics to train them in practice. Here we compare two commonly used heuristics with our multi-block model. First, we train our model with a single block (equivalent to single shooting) and use it as the baseline (SS). Then, we augment SS with the two heuristics and train it on short sub-trajectories (SS+sub) and on progressively increasing trajectory lengths (SS+progr). Finally, we train our sparse multiple shooting model (Ours) which is identical to SS, but has multiple blocks and continuity prior. See Appendix G for details. The results are in Figure 12 . The baseline single shooting model (SS) tends to fail during training, with only a few runs converging. Hence, SS produces poor predictions on average. Training a single shooting model on short sub-trajectories tends to make results even worse in our case. With relatively easy training, SS+sub produces unstable test predictions that quickly blow up. In our case SS+progr was the most effective heuristic, with stable training and reasonable test predictions (with a few getting a bit unstable towards the end). Compared to our model, none of the heuristics was able to match the performance of our sparse multiple shooting model.

4.6. COMPARISON TO OTHER MODELS

We compare our model to recent models from the literature: ODE2VAE (Yildiz et al., 2019) and NODEP (Norcliffe et al., 2021) . Both models learn continuous-time deterministic dynamics in the latent space and use an encoder to map observations to the latent initial state. For comparison we use datasets on regular time grids since ODE2VAE's encoder works only on regular time grids. All models are trained and tested on full trajectories and use the first 8 observations to infer the latent initial state. We use the default parameters and code provided in the ODE2VAE and NODEP papers. All models are trained for the same amount of time. See Appendix H for more details. Figure 13 shows the results. We see that NODEP produces reasonable predictions only for the PENDULUM dataset. ODE2VAE performs slightly better and manages to learn both PENDULUM and RMNIST data quite well, but fails on the most complex BOUNCING BALLS dataset (note that ODE2VAE uses the iterative training heuristic). Our model performs well on all three datasets. Also, see Appendix H.5 for a demonstration of the effect of the training trajectory length on NODEP and ODE2VAE.

5. RELATED WORK

The problem with training on long trajectories is not new and multiple shooting (MS) was proposed as a solution long time ago (van Domselaar & Hemker, 1975; Baake et al., 1992; Voss et al., 2004) . Recent works have tried to adapt MS to modern neural-network-based models and large data regimes. Jordana et al. (2021) and Beintema et al. (2021) directly apply MS in latent space in fully deterministic setting, but use discrete-time dynamics without amortization or with encoders applicable only to regular time grids, and also both use ad-hoc loss terms to enforce continuity (see Appendix H.6 for comparison against our method). Hegde et al. (2022) proposed a probabilistic formulation of MS for Gaussian process based dynamics, but do not use amortization and learn dynamics directly in the data space. While not directly related to this work, recently Massaroli et al. (2021) proposed to use MS to derive a parallel-in-time ODE solver with the focus on efficient parallelization of the forward pass, but they do not explicitly consider the long trajectory problem. Different forms of relative positional encodings (RPE) and distance-based attention were introduced in previous works, but usually for discrete and regular grids. Shaw et al. (2018) 2020), but they are based on global positional encodings and do not constrain the size and shape of the attention windows.

6. CONCLUSION

In this work we developed a method that merges classical multiple shooting with principled probabilistic modeling and efficient amortized variational inference thus making the classical technique efficiently applicable in the modern large-data and large-model regimes. Our method allows to learn large-scale continuous-time dynamical systems from long observations quickly and efficiently, and, due to its probabilistic formulation, enables principled handling of noisy and partially observed data.

REPRODUCIBILITY STATEMENT

Datasets and data generation processes are described in Appendix D. Model, hyperparameters, architectures, training, validation and testing procedures, and computation algorithms are detailed in Appendices B, C, E. Source code accompanying this work will be made publicly available after review.

A DEPENDENCE OF LOSS LANDSCAPE ON THE OBSERVATION INTERVAL

Here we demonstrate how complexity of the loss landscape grows with the length of the training trajectory. For simplicity, we train a neural ODE model which is similar to the L-NODE model in Equations 1-2, but with g θdec being the identity function. The dynamics function is represented by an MLP with two hidden layers of size 16 and hyperbolic tangent nonlinearities. The training data consists of a single 2-dimensional trajectory observed over time interval of [0, 20] seconds (see Figure 14 ). The trajectory is generated by solving the following ODE d 2 x(t) dt 2 = -9.81 sin (x(t)) with the initial position being 90 degrees (relative to the vertical) and the initial velocity being zero. The training data is generated by saving the solution of the ODE every 0.1 seconds. We train the model with MSE loss using Adam (Kingma & Ba, 2015) optimizer and dopri5 adaptive solver from the torchdiffeq package (Chen et al., 2018). We start training on the first 10 points of the trajectory and double that length every 3000 iterations (hence the spikes in the loss plot in Figure 15 ). At the end of each 3000 iterations cycle (right before doubling the training trajectory length) we plot the loss landscape around the parameter value to which the optimizer converged. Let θ be the point to which the optimizer converged during the given cycle. We denote the corresponding loss value by a marker in Figure 15 . Then, we plot the loss landscape around θ by evaluating the loss at parameter values cθ, where c ∈ [-4, 6]. For the given observation time interval, the trajectory of length 10 is easy to fit, hence is considered to be "short". 

B MODEL, APPROXIMATE POSTERIOR, AND ELBO

Here we provide details about our model, approximate posterior and derivation of the ELBO.

Joint distribution

The joint distribution is p(y 1:N , s 1:B , θ dyn , θ dec ) = p(y 1:N |s 1:B , θ dyn , θ dec )p(s 1:B |θ dyn )p(θ dyn )p(θ dec ) with  p(θ dyn ) = N (θ dyn |µ θdyn , σ 2 θdyn I), p(θ dec ) = N (θ dec |µ θdec , σ 2 θdec I), p(s 1:B |θ dyn ) = p(s 1 ) B b=2 p(s b |s b-1 , θ dyn ) (28) = N (s 1 |µ 0 , σ 2 0 I) B b=2 N (s b |ODEsolve(s b-1 , t [b-1] , t [b] , f θdyn ), σ 2 c I), = N (y 1 |g θdec (s 1 ), σ 2 Y I) B b=1 i∈I b N (y i |g θdec (ODEsolve(s b , t [b] , t i , f θdyn )), σ 2 Y I) (32) = N (y 1 |g θdec (x 1 ), σ 2 Y I) B b=1 i∈I b N (y i |g θdec (x i ), σ 2 Y I), ( ) where N is the Gaussian distribution, I ∈ R d×d is identity matrix, and σ 2 Y is the observation noise variance that is shared across data dimensions.

Approximate posterior

The family of approximate posteriors is defined as q(θ dyn , θ dec , s 1:B ) = q(θ dyn )q(θ dec ) B b=1 q(s b ) (34) = N (θ dyn |γ θdyn , diag(τ 2 θdyn ))N (θ dec |γ θdec , diag(τ 2 θdec )) B b=1 N (s b |γ b , diag(τ 2 b )), where diag(τ • ) is a matrix with vector τ • on the main diagonal. ELBO The ELBO can be written as L = q(θ dyn , θ dec , s 1:B ) ln p(y 1:N , s 1:B , θ dyn , θ dec ) q(θ dyn , θ dec , s 1:B ) dθ dyn dθ dec ds 1:B (36) = q(θ dyn , θ dec , s 1:B ) ln p(y 1:N |s 1:B , θ dyn , θ dec )p(s 1:B |θ dyn )p(θ dyn )p(θ dec ) q(s 1:B )q(θ dyn )q(θ dec ) dθ dyn dθ dec ds 1:B = q(θ dyn , θ dec , s 1:B ) ln p(y 1:N |s 1:B , θ dyn , θ dec )dθ dyn dθ dec ds 1:B -q(θ dyn , θ dec , s 1:B ) ln q(s 1:B ) p(s 1:B |θ dyn ) dθ dyn dθ dec ds 1:B (39) -q(θ dyn , θ dec , s 1:B ) ln q(θ dyn ) p(θ dyn ) dθ dyn dθ dec ds 1:B (40) -q(θ dec , θ dec , s 1:B ) ln q(θ dec ) p(θ dec ) dθ dyn dθ dec ds 1:B (41) = L 1 -L 2 -L 3 -L 4 . Let's look at each term L i separately. L 1 = q(θ dyn , θ dec , s 1:B ) ln p(y 1:N |s 1:B , θ dyn , θ dec )dθ dyn dθ dec ds 1:B (43) = q(θ dyn , θ dec , s 1:B ) ln p(y 1 |s 1 , θ dec ) B b=1 p({y i } i∈I b |s b , θ dyn , θ dec ) dθ dyn dθ dec ds 1:B (44) = q(θ dyn , θ dec , s 1:B ) ln p(y 1 |s 1 , θ dec )dθ dyn dθ dec ds 1:B + q(θ dyn , θ dec , s 1:B ) ln  = q(θ dyn , θ dec , s 1:B ) ln p(y 1 |s 1 , θ dec )dθ dyn dθ dec ds 1:B (47) + B b=1 q(θ dyn , θ dec , s 1:B ) ln p({y i } i∈I b |s b , θ dyn , θ dec )dθ dyn dθ dec ds 1:B (48) = q(θ dec , s 1 ) ln p(y 1 |s 1 , θ dec )dθ dec ds 1 (49) + B b=1 q(θ dyn , θ dec , s b ) ln p({y i } i∈I b |s b , θ dyn , θ dec )dθ dyn dθ dec ds b (50) = E q(θdec,s1) [ln p(y 1 |s 1 , θ dec )] + B b=1 E q(θdyn,θdec,s b ) [ln p({y i } i∈I b |s b , θ dyn , θ dec )] (51) = E q(θdec,s1) [ln p(y 1 |s 1 , θ dec )] + B b=1 i∈I b E q(θdyn,θdec,s b ) [ln p(y i |s b , θ dyn , θ dec )] L 2 = q(θ dyn , θ dec , s 1:B ) ln q(s 1:B ) p(s 1:B |θ dyn ) dθ dyn dθ dec ds 1:B (53) = q(θ dyn , θ dec , s 1:B ) ln q(s 1 ) p(s 1 ) B b=2 q(s b ) p(s b |s b-1 , θ dyn ) dθ dyn dθ dec ds 1:B (54) = q(θ dyn , θ dec , s 1:B ) ln q(s 1 ) p(s 1 ) dθ dyn dθ dec ds 1:B (55) + q(θ dyn , θ dec , s 1:B ) ln B b=2 q(s b ) p(s b |s b-1 , θ dyn ) dθ dyn dθ dec ds 1:B (56) = q(θ dyn , θ dec , s 1:B ) ln q(s 1 ) p(s 1 ) dθ dyn dθ dec ds 1:B (57) + B b=2 q(θ dyn , θ dec , s 1:B ) ln q(s b ) p(s b |s b-1 , θ dyn ) dθ dyn dθ dec ds 1:B (58) = q(s 1 ) ln q(s 1 ) p(s 1 ) ds 1 (59) + B b=2 q(θ dyn , s b-1 , s b ) ln q(s b ) p(s b |s b-1 , θ dyn ) dθ dyn ds b-1 ds b (60) = q(s 1 ) ln q(s 1 ) p(s 1 ) ds 1 (61) + B b=2 q(θ dyn , s b-1 ) q(s b ) ln q(s b ) p(s b |s b-1 , θ dyn ) ds b dθ dyn ds b-1 = q(s 1 ) ln q(s 1 ) p(s 1 ) ds 1 (63) + B b=2 q(θ dyn , s b-1 )KL (q(s b )∥p(s b |s b-1 , θ dyn )) dθ dyn ds b-1 (64) = KL (q(s 1 )∥p(s 1 )) + B b=2 E q(θdyn,s b-1 ) [KL (q(s b )∥p(s b |s b-1 , θ dyn ))] , ( ) where KL is Kullback-Leibler divergence. L 3 = KL(q(θ dyn )∥p(θ dyn )), L 4 = KL(q(θ dec )∥p(θ dec )). (66) Computing ELBO All expectations are approximated using Monte Carlo integration with one sample, that is E p(z) [f (z)] ≈ f (ζ), where ζ is sampled from p(z). ( ) The KL terms contain only Gaussian distributions, so can be computed in closed form.

D DATASETS

Figure 16 : Examples of trajectories from the PENDULUM dataset. Here we provide details about the datasets used in this work and about the data generation procedures. The datasets we selected are commonly used in literature concerned with modeling of temporal processes (Karl et al., 2017; Ha et al., 2019; Casale et al., 2018; Yildiz et al., 2019; Norcliffe et al., 2021; Sutskever et al., 2008; Lotter et al., 2015; Hsieh et al., 2018; Gan et al., 2015) . To the best of our knowledge, previous works consider these datasets only on regular time grids (i.e., the temporal distance between consecutive observations is constant). Since in this work we are mostly interested in processes observed at irregular time intervals, we generate these datasets on both regular and irregular time grids. The datasets and data generation scripts can be downloaded at https://github.com/yakovlev31/msvi.

D.1 PENDULUM

This dataset consist of images of a pendulum moving under the influence of gravity. Each trajectory is generated by sampling the initial angle x and angular velocity ẋ of the pendulum and simulating its dynamics over a period of time. The algorithm for simulating one trajectory is 3. Solve the ODE d 2 x(t) dt 2 = -9.81 sin (x(t)) with initial state x, ẋ at time points (t 1 , ..., t N ). 4. Create sequence of observations (y 1 , ..., y N ) with y i = observe(x(t i )), where x(t i ) is the solution of the ODE above at time point t i and observe(•) is a mapping from the pendulum angle to the corresponding observation. 1. Sample x ∼ Uniform[0, 2π] (in The training/validation/test sets contain 400/50/50 trajectories. Regular time grids are identical across all trajectories. Irregular time grids are unique for each trajectory. The only constraint we place on the time grids is that they contain N time points (for efficient implementation and meaningful comparison). We set t 1 = 0, t N = 3, and N = 51. Each observation y i is a 1024-dimensional vector (flat 32 × 32 image).

D.2 RMNIST

This dataset consist of images of rotating digits 3 sampled from the MNIST dataset. Each trajectory is generated by sampling a digit 3 from the MNIST dataset uniformly at random without replacement, then sampling the initial angle x and angular velocity ẋ and simulating the frictionless rotation of the digit. The algorithm for simulating one trajectory is 1. Sample a digit 3 from the MNIST dataset uniformly at random without replacement. 2. Sample x ∼ Uniform[0, 2π] (in rads) and ẋ ∼ Uniform[π, 2π] (in rads/second). 3. Generate time grid (t 1 , ..., t N ). Regular time grids are generated by placing the time points at equal distances along the time interval [t 1 , t N ] with the first time point placed at t 1 and the last time point placed at t N . Irregular time grids are generated by sampling N points from the time interval [t 1 , t N ] uniformly at random with the first time point placed at t 1 , the last time point placed at t N , and also ensuring that the minimum distance between time points is larger than t N -t1 4(N -1) (i.e., a quarter of the time step of a regular time grid). 4. Solve the ODE dx(t) dt = ẋ with initial state x at time points (t 1 , ..., t N ). 5. Create sequence of observations (y 1 , ..., y N ) with y i = observe(x(t i )), where x(t i ) is the solution of the ODE above at time point t i and observe(•) is a mapping from the digit angle to the corresponding observation. The training/validation/test sets contain 4000/500/500 trajectories. Regular time grids are identical across all trajectories. Irregular time grids are unique for each trajectory. The only constraint we place on the time grids is that they contain N time points (for efficient implementation and meaningful comparison). We set t 1 = 0, t N = 2, and N = 51. Each observation y i is a 1024-dimensional vector (flat 32 × 32 image).

E.1.4 TESTING

Predictions for the test trajectories are made as described in Section 3.2. Similarly to validation, we use all observations within the interval [t 1 , t 1 + δ test ] as initial observations from which we predict the latent initial state. We set δ test to 15% of the observation interval [t 1 , t N ].

E.2 PRIORS

As discussed in Appendix B, we use the following priors: p(θ dyn ) = N (θ dyn |µ θdyn , σ 2 θdyn I), p(θ dec ) = N (θ dec |µ θdec , σ 2 θdec I), p(s 1:B |θ dyn ) = N (s 1 |µ 0 , σ 2 0 I) B b=2 N (s b |ODEsolve(s b-1 , t [b-1] , t [b] , f θdyn ), σ 2 c I). We set µ θdyn = µ θdec = 0, σ θdyn = σ θdec = 1, µ 0 = 0, σ 0 = 1, and σ c = ξ √ d , where ξ denotes the required average distance between s i and x i , and d is the latent space dimension. In this work we use d = 32. The parameter ξ is dataset specific, for PENDULUM and RMNIST we set ξ = 10 -4 , for BOUNCING BALLS we set ξ = 10 -3 .

E.3 VARIATIONAL PARAMETERS

As discussed in Appendix B, we use the following family of approximate posteriors: q(θ dyn , θ dec , s 1:B ) = N (θ dyn |γ θdyn , diag(τ 2 θdyn ))N (θ dec |γ θdec , diag(τ 2 θdec )) B b=1 N (s b |γ b , diag(τ 2 b )) While γ b and τ b are provided by the encoder, other variational parameters are directly optimized. We initialize γ θdyn and γ θdec using default Xavier (Glorot & Bengio, 2010) initialization of the dynamics function and decoder (see PyTorch 1.12 (Paszke et al., 2019) documentation for details). We initialize τ θdyn and τ θdec as vectors with each element equal to 9 • 10 -4 . E.4 MODEL ARCHITECTURE E.4.1 DYNAMICS FUNCTION Many physical systems, including the ones we consider in this work, are naturally modeled using second order dynamics. We structure the latent space and dynamics function so that we include this useful inductive bias into our model. In particular, we follow Yildiz et al. ( 2019) and split the latent space into two parts representing "position" and "velocity". That is, we represent the latent state x(t) ∈ R d as a concatenation of two components: x(t) = x p (t) x v (t) , where x p (t) ∈ R d/2 is the position component and x v (t) ∈ R d/2 is the velocity component. Then, we represent the dynamics function f θdyn (t, x(t)) as f θdyn (t, x(t)) = x v (t) f v θdyn (t, x(t)) , where f v θdyn (t, x(t)) : R × R d → R d/2 is the dynamics function modeling the instantaneous rate of change of the velocity component. In all our experiments we remove the dependence of f v θdyn on time t and represent it as a multi-layer perceptron whose architecture depends on the dataset: • PENDULUM: input size d, output size d/2, two hidden layers with size 256 and ReLU nonlinearities. h p agg and h v agg are transformer encoders with our temporal dot product attention and relative positional encodings (Section 3.3). The number of layers (i.e., L in Figure 5 ) is 4 for h p agg and 8 for h v agg . We set D low = 128, ϵ = 10 -2 , p = ∞ (i.e., use masking), and finally we set δ r to 15% of the training time interval [t 1 , t N ]. For both aggregation functions we use only temporal attention at the first layer since we found that it slightly improves the performance. In Appendix F we investigate the effects that p and δ r have on the model's performance. h read is a mapping from b i to ψ i . Recall that we define b i as b i = b p i b v i , so h read is defined as h read (b i ) =    Linear(b p i ) exp Linear(b p i ) Linear(b v i ) exp (Linear(b v i ))    =    γ p i τ p i γ v i τ v i    = ψ i , where Linear() is a linear layer (different for each line). Constraining variance of the approximate posteriors As we showed is Section 4.4, forcing the variance of the approximate posteriors q ψi (s i ) to be at least τ 2 min > 0 in each direction might greatly improve the model's performance. In practice, we implement this constraint by simply adding τ min to τ p i . We do not add τ min to τ v i as we found that it tends to make long-term predictions less accurate. Structured Attention Dropout We found that dropping the attention between random elements of the input and output sequences improves performance of our model on regular time grids and for block sizes larger than one. In particular, at each attention layer we set an element of the unnormalized attention matrix C DP ij + C TA ij to -∞ with some probability (0.1 in this work). This ensures that the corresponding element of C ij is zero. This is similar to DropAttention of Zehui et al. (2019) , however in our case we do not drop arbitrary elements, but leave the diagonal of C ij and one of the first off diagonal elements unchanged. This is done to ensure that the output element i has access to at least the i'th element of the input sequence and to one of its immediate neighbors.

F PROPERTIES OF THE ENCODER

Our encoder has parameters p and δ r which control the shape and size, respectively, of the temporal attention windows (see Section 3.3). Here we investigate how these parameters affect our model's performance. At test time we assume to have access to observations within some initial time interval [t 1 , t 1 + t test ]. Figure 21 (left) shows that there seems to be no conclusive effect from the shape of the attention window. On the other hand, as Figure 21 (right) shows, parameter δ r seems to have noticeable effect on all three datasets. We see that the curves have the U-shape with the best performance being at δ r = δ test /2. We also see that too wide attention windows (i.e., large δ r ) tend to increase the error. 

G.2 HEURISTICS

Training on sub-trajectories. Here, instead of training on full trajectories, at each training iteration we randomly select a short sub-trajectory from each full trajectory and train on these subtrajectories. For PENDULUM/RMNIST/BOUNCING BALLS datasets we used sub-trajectories of length 2/2/6. These sub-trajectory lengths were selected such that they are identical to the subtrajectories used in the multiple shooting version of our model (Ours). Increasing training trajectory length. Here, instead of starting training on full trajectories, we start training on a small number of initial observations, and then gradually increase the training trajectory length. In particular, for PENDULUM and RMNIST datasets we start training on first 5 observations, and then double that length every 10k iterations until we reach the full length. For BOUNCING BALLS dataset we start training on first 2 observations, and then double that length every 10k iterations until we reach the full length.

H COMPARISON TO OTHER MODELS

Here we provide details about our model comparison setup in Section 4.6 and show predictions from different models.

H.1 NODEP

NODEP is similar to our model in the sense that it also uses the encode-simulate-decode approach, where it takes some number of initial observations, maps them to a latent initial state, simulates the deterministic latent dynamics, and then maps the latent trajectory to the observation space via a decoder. The encoder works by concatenating the initial observations and their temporal positions, mapping each pair to a representation space and averaging the individual representations to compute the aggregated representation from which the initial latent state is obtained. This encoder allows NODEP to operate on irregular time grids, but, due to its simplicity (it is roughly equivalent to a single attention layer), might be unable to accurately estimate the latent initial state. NODEP reported results on a variant of RMNSIT dataset, so we use their setup directly with our RMNIST and PENDULUM datasets. For our BOUNCING BALLS dataset we used 32 filters for the encoder and decoder (close to our model), and the same dynamics function as for our model. We train NODEP using random subsets of the first 8 observations to infer the latent initial state. We found this approach to generalize better than training strictly on the first 8 observations. For validation and testing we always use the first 8 observations.

H.2 ODE2VAE

ODE2VAE is similar to our model in the sense that it also uses the encode-simulate-decode approach, where it takes some number of initial observations, maps them to a latent initial state, simulates the deterministic second-order latent dynamics, and then maps the latent trajectory to the observation space via a decoder. The encoder computes the latent initial state by stacking the initial observations and passing them thought a CNN. This encoder is flexible, but restricted to regular time grids and a constant number of initial observations. ODE2VAE reported results on variants of RMNIST and Bouncing Balls datasets, so we use their setup directly with our RMNIST and BOUNCIGN BALLS datasets. For our PENDULUM we use ODE2VAE with the same setup as for RMNIST. We tried to increase the sizes of the ODE2VAE components, but it resulted in extremely long training times. For training, validation and testing we use the first 8 observations to infer the latent initial state. We use the official implementation from Jordana et al. (2021) . For PENDU-LUM/RMNIST/BOUNCIGN BALLS datasets we use the penalty constant of 1e3/1e3/1e4, learning rate of 1e-3/1e-3/3e-4, batch size of 16/16/64, number of training epochs of 600/600/3000. In all cases the number of shooting variables is set to 5. In all cases, architecture of the dynamics function and decoder is the same as for our model. The encoder of Jordana et al. (2021) first maps the images to low-dimensional vectors using a CNN (we 

I STRENGTH OF THE CONTINUITY PRIOR VS GAP BETWEEN BLOCKS

We investigate how the strength of the continuity prior (as measured by σ c ) affects the gap between consecutive blocks of the latent trajectory. We train our model with different values of σ c and compute the mean squared gap between the end of a current block and the beginning the next block (i.e., between the latent state x at a time t [b] and the shooting variable s [b] ). We report the results in Table 3 . We see that stronger continuity prior (i.e., smaller σ c ) tends to result in smaller gap between the blocks and, consequently, in better continuity of the whole trajectory. We also see that better continuity tends to result in smaller prediction errors. 

J USING ODE-RNN AS AGGREGATION FUNCTION

Here we test the effect of replacing our transformer-based aggregation function h agg by ODE-RNN (Rubanova et al., 2019) . For each dataset, we set ODE-RNN's hyperparameters such that the number of parameters is similar to that of our transformer-based h agg . We report the results in Table 4 . We see that on the PENDULUM dataset ODE-RNN works on par with our method, while on other datasets it has higher test error. The training time for ODE-RNN tends to be much larger than for our method highlighting the effectiveness of parallelization provided by the Transformer architecture.



Figure 1: Top: Train loss of L-NODE model using iterative training heuristic. We start training on a short trajectory (N = 10), and double its length every 3000 iterations. The training fails for the longest trajectory. Bottom: 1-D projection of the loss landscape around the parameters to which the optimizer converged for a given trajectory length. Complexity of the loss grows dramatically with N .

2. Multiple shooting leads to a new optimisation problem over θ dyn , θ dec , and s 1:N -1 .

Figure 4: An example of sparse multiple shooting with B = 2, I 1 = {2, 3, 4} and I 2 = {5, 6}.

For block b ∈ {1, ..., B}, we define an index set I b containing indices of consecutive time points associated with that block such that ∪ b I b = {2, . . . , N }. We do not include the first time point t 1 in any of the blocks. With every block b we associate observations {y i } i∈I b , time points {t i } i∈I b and a shooting variable s b placed at the first time point before the block. The temporal position of s b is denoted by t [b] . Latent states {x i } i∈I b are computed from s b as

Figure 5: (a) Encoder structure. (b) Encoder with two blocks (i.e., B = 2) operating on input sequence y 1:5 with shooting variables s 1 , s 2 located at t 1 , t 3 .

except for the last layer which maps α (L) 1:N to β (L) 1:B to match the number of shooting variables. For the first layer, α (1) 1:N = a 1:N , and for the last layer, b 1:B = FF(β (L)

Figure 6: (a) Temporal attention. (b) Relative position encoding.

Figure 7: Top row: PENDULUM dataset consisting of images of a pendulum moving under the influence of gravity. Middle row: RMNIST dataset consisting of images of rotating digits 3. Bottom row: BOUNCING BALLS dataset consisting of images of three balls bouncing in a box.

Figure 8: Test errors for our model on regular and irregular time grids.

Figure 9: Test errors and training times for different block sizes.

Figure 11: Errors for constrained and unconstrained approximate posteriors.

Figure 12: Errors for different heuristics.

Figure 13: Left: Test errors for different models and datasets. Right: For each dataset, we plot data and predictions for NODEP, ODE2VAE and our model (top to bottom). Each sub-plot shows data as the first row, and prediction as the second row. We show prediction with the median test error. See Appendix H.4 for more predictions.

and Raffel et al. (2020) use discrete learnable RPEs which they add to keys, values or attention scores. Both works use clipping, i.e., learn RPEs only for k closest points, which is some sense similar to using hardtanh function. Press et al. (2022) use discrete distance-based attention which decreases linearly with the distance. Zhao et al. (2021) use continuous learnable RPEs which are represented as an MLP which maps difference between spatial positions of two points to the corresponding RPEs which are then added to values and attention scores without clipping. Variants of attention-based models for irregular time series were introduced in Shukla & Marlin (2021) and Zhang et al. (

Figure 14: Pendulum data.

Figure 15: Top: Training loss of NODE model. We start with a short training trajectory (N = 10) and double its length at iterations denoted by the markers. Note that training fails for long enough trajectory. Bottom: One-dimensional projection of the loss landscape around the parameter values to which the optimizer converged for a given trajectory length. Note that complexity of the loss landscape grows with the trajectory length.

(y 1:N |s 1:B , θ dyn , θ dec ) = p(y 1 |s 1 , θ dec ) B b=1 p({y i } i∈I b |s b , θ dyn , θ dec ) (30) = p(y 1 |s 1 , θ dec ) B b=1 i∈I b p(y i |s b , θ dyn , θ dec ) (31)

i } i∈I b |s b , θ dyn , θ dec ) dθ dyn dθ dec ds 1:B

Figure 17: Examples of trajectories from the RMNIST dataset.

Figure 18: Examples of trajectories from the BOUNCING BALLS dataset.

Figure 19: Examples of regular and irregular time grids for PENDULUM dataset. At test time, observations before the red lines are used to compute the latent initial state.

Figure 21: Test errors for different values of p and δ r

OUR MODEL Our model followed the same setup as described in Appendix E.H.4 MORE PREDICTIONSIn the model comparison experiment (Section 4.6) we showed only the median test predictions. Here, we plot test predictions corresponding to different percentiles.Figures 22, 23, and 24  show predictions of NODEP, ODE2VAE, and our model.

Figure 22: Predictions on PENDULUM dataset. Shown are test predictions corresponding to different percentiles wrt test MSE. The first snapshot is at t 1 , the last one is at t 51 . The distance between snapshots is five time points. First row is ground truth, second row is the prediction.

Figure 23: Predictions on RMNIST dataset. Shown are test predictions corresponding to different percentiles wrt test MSE. The first snapshot is at t 1 , the last one is at t 51 . The distance between snapshots is five time points. First row is ground truth, second row is the prediction.

Figure 25: Left: Test errors for different models and datasets. Right: For each dataset, we plot ground truth and predictions for NODEP, ODE2VAE and our model (top to bottom). Each sub-plot shows the ground truth as the first row, and the prediction as the second row. We plot test prediction with the median test error (for each model and dataset we select the value of N which gives the best predictions).

Figure26: Predictions of NODEP and ODE2VAE on PENDULUM dataset when trained of subtrajectories of length N . Shown are test predictions with the median test error. The first snapshot is at t 1 , the last one is at t 51 . The distance between snapshots is five time points. First row is ground truth, second row is the prediction.

Figure 27: Predictions of NODEP and ODE2VAE on RMNIST dataset when trained of subtrajectories of length N . Shown are test predictions with the median test error. The first snapshot is at t 1 , the last one is at t 51 . The distance between snapshots is five time points. First row is ground truth, second row is the prediction.

Figure28: Predictions of NODEP and ODE2VAE on BOUNCING BALLS dataset when trained of sub-trajectories of length N . Shown are test predictions with the median test error. The first snapshot is at t 1 , the last one is at t 51 . The distance between snapshots is five time points. First row is ground truth, second row is the prediction.

) Since x 1:N are deterministic functions of s 1:B and θ dyn , we have the following joint distribution (see Appendix B for more details) p(y 1:N , s 1:B , θ dyn , θ dec ) = p(y 1:N |s 1:B , θ dyn , θ dec )p(s 1:B |θ dyn )p(θ dyn )p(θ dec ).

Test MSEs for different ablations.

Note that in this experiment we remove the iterative training heuristic from ODE2VAE to study the sub-trajectory length effects directly. All models are tested on full trajectories and use the first 8 observations to infer the latent initial state. Figure25shows results for different values of N . We see that our model outperforms NODEP and ODE2VAE in all cases. We also see that both NODEP and ODE2VAE perform poorly when trained on short sub-trajectories; in figures below we show that for N = 10 both models perform well on the first N time points, but fail to generalize far beyond the training time intervals, which is in contrast to our model which shows excellent generalization. Increasing the sub-trajectory length tends to provide some improvement, but only up to a certain point, where the training starts to fail; in figures below we show how NODEP and ODE2VAE fail for large N .Overall, we see that NODEP and ODE2VAE tend to perform well when trained and tested on short trajectories, but do not generalize beyond the training time interval very well. Simply training these models on longer sequences does not necessarily help as the optimization problem becomes harder and training might fail. Our model provides a principled solution to this dilemma by splitting long trajectories into short blocks and utilizing the continuity prior to enforce consistency of the solution across the blocks thus ensuring easy and fast training with stable predictions over long time intervals.H.6 COMPARISON AGAINST ANOTHER MULTIPLE-SHOOTING-BASED METHODWe compare the performance of our method againstJordana et al. (2021) which use a deterministic discrete-time latent dynamics model and apply multiple shooting directly in the latent space without amortization. After training the model, the optimized shooting variables are used to train a discretetime RNN-based recognition network to map observations to the corresponding shooting variables. The recognition network is then used at test time to map initial observations to the latent initial state.

Comparison results. used the same architecture as for our model), and then applies an LSTM (we used the latent state of dimension 1024) to map these vectors to shooting variables. Note that the encoder is trained after the model. The latent space dimension is the same as for our model. At test time we use the first 8 observations to infer the latent initial state.We applied the method ofJordana et al. (2021) on our datasets with regular and irregular time grids and report the results in Table2. We found thatJordana et al. (2021) performs quite similarly to our method on regularly sampled PENDULUM and RMNIST datasets, but fails to produce stable long-term predictions on the BOUNCIGN BALLS dataset. Also, due to being a discrete-time method,Jordana et al. (2021) fails on irregularly sampled versions of the datasets.

Dependence of test MSE and inter-block continuity on σ c .

ACKNOWLEDGMENTS

This work was supported by NVIDIA AI Technology Center Finland.

C COMPUTATION ALGORITHMS

C.1 ELBO To find the approximate posterior which minimizes the Kullback-Leibler divergence KL(q(θ dyn , θ dec , s 1:B )∥p(θ dyn , θ dec , s 1:B |y 1:N )), (68) we maximize the evidence lower bound (ELBO) which for our model is defined as L = E q(θdec,s1) log p(y 1 |s 1 , θ dec )E q(θdyn,θdec,s b ) log p(y i |s b , θ dyn , θ dec )(ii) data likelihood (69)-KL q(s 1 )∥p(s 1 )(iii) initial state priorE q(θdyn,s b-1 ) KL q(s b )∥p(s b |s b-1 , θ dyn )(iv) continuity prior (70)-KL q(θ dyn )∥p(θ dyn )(v) dynamics prior -KL q(θ dec )∥p(θ dec )The ELBO is computed using the following algorithm:1. Sample θ dyn , θ dec from q ψdyn (θ dyn ), q ψdec (θ dec ).2. Sample s 1:B from q ψ1 (s 1 ), ..., q ψ B (s B ) with ψ 1:B = h θenc (y 1:N ).

3.. Compute

x 1:N from s 1:B as in Equations 11-12.4. Compute ELBO L (KL terms are computed in closed form, for expectations we use Monte Carlo integration with one sample).Sampling is done using reparametrization to allow unbiased gradients w.r.t. the model parameters.We observed that under some hyper-parameter configurations the continuity-promoting term (iv) might cause the shooting variables to collapse to a single point hence preventing the learning of meaningful dynamics. Downscaling this term helps to avoid the collapse. However, in our experiments we did not use any scaling.

C.2 FORECASTING

Given initial observations y * 1:N1 at time points t * 1:N1 we predict the future observations y * N1+1:N2 at time points t * N1+1:N2 as the expected value of the (approximate) posterior predictive distributionwhere ψ * 1 = h θenc (y * 1:N1 ). The expected value is estimated via Monte Carlo integration, so the algorithm for predicting y * N1+1:N2 is 1. Sample θ dyn , θ dec from q ψdyn (θ dyn ), q ψdec (θ dec ).

2.. Sample s

.., N 2 }. 5. Repeat steps 1-4 n times and average the predicted trajectories y * N1+1:N2 (we use n = 10).

D.3 BOUNCING BALLS

This dataset consist of images of three balls bouncing in a frictionless box. Each trajectory is generated by sampling the initial positions and velocities of the three balls and simulating the frictionless collision dynamics. The algorithm for simulating one trajectory is 1. Sample initial positions of the three balls uniformly at random such that the balls do not overlap and do not extend outside the boundaries of the box. In this work we use d = 32.

E.4.2 DECODER

The decoder g θdec maps the latent state x i to parameters of p(y i |g θdec (x i )). As we discussed in Appendix B, we set p(y i |g θdec (x i )) = N (y i |g θdec (x i ), σ 2 Y I), so the decoder outputs the mean of a Gaussian distribution. We treat σ Y as a hyperparameter and set it to 10 -3 . In our experiments, trying to learn σ Y resulted in overfitting. Following Yildiz et al. (2019) , our encoder utilizes only the "position" part x p i of the latent state x i since this part is assumed to contain all the information required to reconstruct the observations (see Appendix E.4.1).We represent g θdec as the composition of a convolutional neural network (CNN) with a sigmoid function to keep the mean in the interval (0, 1). In particular, g θdec has the following architecture: linear layer, four transposed convolution layers (2x2 kernel, stride 2) with batch norm and ReLU nonlinearities, convolutional layer (5x5 kernel, padding 2), sigmoid function. The four transposed convolution layers have 8n, 4n, 2n and n channels, respectively. The convolution layer has n channels. For datasets PENDULUM, RMNIST, and BOUNCING BALLS we set n to 8, 16, and 32, respectively. Encoder maps observations y 1 , ..., y N to parameters ψ 1 , ..., ψ B of the approximate posterior (Equation 75).

E.4.3 ENCODER

In particular, it returns the means γ 1 , ..., γ B and standard deviations τ 1 , ..., τ B of the normal distributions N (s 1 |γ 1 , diag(τ 2 1 )), ..., N (s B |γ B , diag(τ 2 B )). Using second order dynamics naturally suggests splitting the parameters into two groups. The first group contains parameters for the "position" part of the latent space, while the second group contains parameters for the "velocity" part. So, we split the means and standard deviations into position and velocity parts aswhere the position and velocity parts occupy a half of the latent space each (have dimension d/2). Then, we simply make each ψ i contain the means and standard deviations as:In Section 3.3 we described the structure of our encoder. For the ease of exposition we omitted overly general descriptions and presented a simple to understand overall architecture (Figure 5 (a)). However, in practice we use a slightly more general setup which we show in Figure 20 . As can be seen, we simply use two aggregation function h p agg and h v agg to aggregate information for the position and velocity components separately. Then, we concatenate b p 1:B and b v 1:B to get b 1:B . Other components remain exactly the same as described in Section 3.3. Now, we describe the sub-components of the encoder: h comp is represented as a convolutional neural network (CNN). In particular, h comp has the following architecture: three convolution layers (5x5 kernel, stride 2, padding 2) with batch norm and ReLU nonlinearities, one convolution layer (2x2 kernel, stride 2) with batch norm and ReLU nonlinearities, linear layer. The four convolution layers have n, 2n, 4n and 8n channels, respectively. For datasets PENDULUM, RMNIST, and BOUNCING BALLS we set n to 8, 16, and 32, respectively. 

