LEARNING CONTINUOUS-TIME DYNAMICS BY STOCHASTIC DIFFERENTIAL NETWORKS

Abstract

Learning continuous-time stochastic dynamics is a fundamental and essential problem in modeling sporadic time series, whose observations are irregular and sparse in both time and dimension. For a given system whose latent states and observed data are multivariate, it is generally impossible to derive a precise continuous-time stochastic process to describe the system behaviors. To solve the above problem, we apply Variational Bayesian method and propose a flexible continuous-time stochastic recurrent neural network named Variational Stochastic Differential Networks (VSDN), which embeds the complicated dynamics of the sporadic time series by neural Stochastic Differential Equations (SDE). VSDNs capture the stochastic dependency among latent states and observations by deep neural networks. We also incorporate two differential Evidence Lower Bounds to efficiently train the models. Through comprehensive experiments, we show that VSDNs outperform state-of-the-art continuous-time deep learning models and achieve remarkable performance on prediction and interpolation tasks for sporadic time series.

1. INTRODUCTION AND RELATED WORKS

Many real-world systems experience complicated stochastic dynamics over a continuous time period. The challenges on modeling the stochastic dynamics mainly come from two sources. First, the underlying state transitions of many systems are often uncertain, as they are placed in unpredictable environment with their states continuously affected by unknown disturbances. Second, the monitoring data collected may be sparse and at irregular intervals as a result of the sampling strategy or data corruption. The sporadic data sequence loses a large amount of information and system behaviors hidden behind the intervals of the observed data. In order to accurately model and analyze dynamics of these systems, it is important to reliably and efficiently represent the continuous-time stochastic process based on the discrete-time observations. In some domains, the derivation of the continuous-time stochastic model relies heavily on human knowledge and many studies focus on its inference problem (Ryder et al., 2018; Särkkä et al., 2015) . But in more domains (e.g., video analysis (Vondrick et al., 2016) and human activity detection (Rubanova et al., 2019) ), it is difficult and sometimes intractable to derive an accurate model to capture the underlying temporal evolution from the collected sequence of data. Although some studies have been made on approximating the stochastic process from the data collected, the majority of these methods define the system dynamics with a linear model (Macke et al., 2011; Yu et al., 2009b; a) , which can not well represent high-dimensional data with nonlinear relationship. Recently, the Neural Ordinary Differential Equation (ODE) studies (Chen et al., 2018; Rubanova et al., 2019; Jia & Benson, 2019; De Brouwer et al., 2019; Yildiz et al., 2019; Kidger et al., 2020) introduce deep learning models to learn an ODE and apply it to approximate continuous-time dynamics. Nevertheless, these methods generally neglect the randomness of the latent state trajectories and posit simplified assumptions on the data distribution (e.g. Gaussian), which strongly limits their capability of modeling complicated continuous-time stochastic processes. Compared to ODE, Stochastic Differential Equation (SDE) (Jørgensen et al., 2020 ) is a more practical solution in modeling the continuous-time stochastic process. Recently there have been some studies on bridging the gap between deep neural networks and SDEs (Ha et al., 2018) . In (Hegde et al., 2019; Liu et al., 2020; Peluchetti & Favaro, 2020; Wang et al., 2019; Kong et al., 2020) , SDEs are introduced to define more robust and accurate deep learning architectures for supervised learning problems (e.g. classification and regression). These studies focus on the design of neural network architectures, and are orthogonal to our work on the modeling of sporadic time series. In (Tzen & Raginsky, 2019a; b) the authors studied the theoretical guarantees of the optimization and inference problems of Neural SDEs. In (Li et al., 2020) , a stochastic adjoint method is proposed to efficiently compute the gradients for neural SDEs. In this paper, we propose a new continuous-time stochastic recurrent network called Variational Stochastic Differential Network (VSDN) that incorporates SDEs into recurrent neural model to effectively model the continuous-time stochastic dynamics based only on sparse or irregular observations. Taking advantage of the capacity of deep neural networks, VSDN has higher flexibility and generalizability in modeling the nonlinear stochastic dependency from high-dimensional observations. Compared to Neural ODEs, VSDN incorporates the latent state trajectory to capture the underlying factors of the system dynamics. The trajectory helps to more flexibly model the data distribution and more accurately generate the output data than Neural ODEs. Parallel to the theoretical analysis (Tzen & Raginsky, 2019a; b) and gradient computations (Li et al., 2020) , our study focuses more on exploring the feasible variational loss and flexible recurrent architecture for the Neural SDEs to model the sporadic data. The contributions of this paper are three-fold: 1. We incorporate the continuous-time variants of VAE and IWAE losses into VSDN to train the continuous-time stochastic neural networks with latent state trajectories. 2. We propose the efficient and flexible network architecture of VSDN which can model the complicated stochastic process under high-dimensional sporadic data sequences. 3. We conduct comprehensive experiments to show that VSDN outperforms state-of-the-art deep learning methods on modeling the continuous-time dynamics and achieves remarkable performance in the prediction and interpolation of irregular or sporadic time series. The rest of this paper is organized as follows. In Section 2, we first present the continuous-time variants of VAE loss, and then derive a continuous-time IWAE loss to train continuous-time statespace models with deep neural networks. In Section 3, we propose the deep learning structures of VSDN. Comprehensive experiments are presented in section 4 and conclusion is given in section 5.

2. CONTINUOUS-TIME VARIATIONAL BAYES

In this section, we first introduce the basic notations and formulate our problem. We then define the continuous-time variants of the Variational Auto-Encoding (VAE) and Importance-Weighted Auto-Encoding (IWAE) lower bounds to enable the efficient training of our models. Due to the page limit, we present all deductions in Appendix A.

2.1. BASIC NOTATIONS AND PROBLEM FORMULATION

Throughout this paper, we define X t ∈ R d1 as the continuous-time latent state at time t and Y n ∈ R d2 as the n th discrete-time observed data at time t n . d 1 and d 2 are the dimensions of the latent state and observation, respectively. X <t is the continuous trajectory before time t and X ≤t is the path up to time t. Y n1:n2 is the sequence of data points and X tn 1 :tn 2 is the continuous-time state trajectory from t n1 to t n2 . Y t = {Y n |t n < t} is the historical observations before t and Y t = {Y n |t n ≥ t} is the current and future observations. For simplicity, we also assume that the initial value of the latent state is constant. The results in this paper can be easily extended to the situation that the initial states are also random variables. Given K data sequences {y (i) 1:ni }, i = 1, • • • , K , the target of our study is to learn an accurate continuous-time generative model G that maximizes the log-likelihood: G = arg max G 1 K K i=1 log P G (y (i) 1:ni ). (1) For Multivariate sequential data, there exists a complicated nonlinear relationship between the observed data and the unobservable latent state, which can be either the physical state of a dynamic system or the low-dimensional manifold of data. In our study, the latent state evolves in the continuous time domain and generates the observation through some transformation.

2.2. CONTINUOUS-TIME VARIATIONAL INFERENCE

In order to capture the underlying stochastic process from sporadic data, we design the generative model as a neural continuous-time state-space model, which consists of a latent Stochastic Differential Equation (SDE) and a conditional distribution of the observation. The latent SDE describes the stochastic process of the latent states and the conditional distribution depicts the probabilistic dependency of the current data with the latent states and historical observations: dX t =H G (X t , Y t ; t)dt + R G (Y t ; t)dW t , P G (Y n |Y 1:n-1 , X tn ) = Φ(Y n |f (Y 1:n-1 , X tn )), where  H G and logP G (y 1:n ) = log P G (X ≤tn ) n i=1 P G (y i |y 1:n-1 , X ti )dX ≤tn which does not have the closed-form solution in general. Therefore, G can not be directly trained by maximizing log-likelihood. To overcome this difficulty, an inference model Q is introduced to depict the stochastic dependency of the latent state on observed data. Similar to the generative model, Q consists of a posterior SDE: dX t =H Q (X t , Y t , Y t ; t)dt + R G (Y t ; t)dW t , where H Q is the posterior drift function. Different from H G , H Q also uses the future observation Y t as the input and therefore the inference model Q induces the posterior distribution P Q (X ≤tn |y 1:n ). Based on Auto-Encoding Variational Bayes (Kingma & Welling, 2014) , it is straightforward to introduce a continuous-time variant of the VAE lower bound of the log-likelihood: L V AE (y 1:n ) = -βKL(P Q ||P G ) + n i=1 E P Q (Xt i ) log P G (y i |y 1:n-1 , X ti ), KL(P Q ||P G ) = 1 2 tn 0 E P Q (Xt) (H Q -H G ) T [R G R T G ] -1 (H Q -H G ) dt. where P G (X ≤tn ) and P Q (X ≤tn ) are the probability density of the latent states induced by the prior SDE Eq. ( 2) and the posterior SDE Eq. ( 5). KL(•||•) denotes the KL divergence between two distributions and β is a hyper-parameter to weight the effect of the KL terms. In this paper, we fix β as 1.0 and L V AE is the original VAE objective (Kingma & Welling, 2014) . In β-VAE (Higgins et al., 2017; Burgess et al., 2018) , it is shown that a larger β can encourage the model to learn more efficient and disentangled representation from the data. Eq. ( 5) is restricted to having the same diffusion function as Eq. ( 2). A feasible L V AE can not be defined to train VSDN-SDE without this restriction, as the KL divergence of two SDEs with different diffusions will be infinite (Archambeau et al., 2008) . The VAE objective has been widely used for discrete-time stochastic recurrent modals, such as LFADS (Sussillo et al., 2016) , VRNN (Chung et al., 2015) and SRNN (Fraccaro et al., 2016) . The major difference between these models and our work is that we incorporate a continuous-time latent state into our model while the latent states of the discrete-time models evolve only at distinct and separate time slots. Continuous-Time Importance Weighted Variational Bayes: L V AE (y 1:n ) equals the exact loglikelihood when P Q (X ≤tn ) of the inference model is identical to the exact posterior distribution induced by the generative model. The errors of the inference model can result in the looseness of the VAE loss for the model training. Under the framework of Importance-Weighted Auto-Encoder (IWAE) (Burda et al., 2016; Cremer et al., 2017) , we can define a tighter evidence lower bound: L K IW AE (y 1:n ) =E x 1 ≤tn ,••• ,x K ≤tn ∼P Q (x ≤tn ) log 1 K K k=1 w k n i=1 P G (y i |y 1:n-1 , X ti ) , where the importance weights satisfy the following SDE: d log w k =d log P G (x k ≤tn ) P Q (x k ≤tn ) = - 1 2 (H Q -H G ) T [R G R T G ] -1 (H Q -H G )dt -(H Q -H G ) T [R G ] -1 dW t . Given the variational auto-encoding lower bound L V AE (•) and the importance weighted autoencoding lower bound L K IW AE (•) for the continuous-time generative model, the tightness of the lower bounds are given by the following inequality: log P G (y 1:n ) ≥ L K+1 IW AE (•) ≥ L K IW AE (•) ≥ L V AE (•), for any positive integer K. Consequently, L K IW AE (•) is infinite if the diffusions of Eq. ( 2) and Eq. ( 5) are different. In our implementation, we notice that the training of our models by L K IW AE is not stable, possibly due to the drawbacks of importance sampling and the Signal-To-Noise problem (Rainforth et al., 2018) . To alleviate the problem, we train our model by a convex combination of the VAE and IWAE losses: L K IW AE (y 1:n ) = (1 -α)L V AE (y 1:n ) + α L K IW AE (y 1:n ), α ∈ (0, 1). ( ) With the use of reparameterization (Kingma & Welling, 2014), both L V AE (y 1:n ) and L K IW AE (y 1:n ) are differentiable with respect to the parameters of the generative and inference models. Therefore, they can be applied to train continuous-time stochastic models with deep learning components.

3. VARIATIONAL STOCHASTIC DIFFERENTIAL NETWORKS

We propose a new continuous-time stochastic recurrent network called Variational Stochastic Differential Network (VSDN) (Figure 1 ). VSDN introduces the latent state to capture the underlying unobservable factors that generate the observed data, and incorporates efficient deep learning structures to compute the components in the generative model Eq. ( 2) -( 3) and inference model Eq. ( 5). Generative Model G: Inside the generative model, the latent SDE Eq. ( 2) depicts the dynamics of the latent state trajectory controlled by the historical observations Y t . Both the drift and diffusion functions have the dependency on Y t . Therefore, we first apply a forward ODE-RNN (Rubanova et al., 2019) to embed the information of historical data into the hidden feature -→ h t,pre . Two feedforward networks are defined to compute drift and diffusion respectively. The decoder network further computes the parameters of the conditional distribution in Eq. ( 3) by the concatenation of the latent state and forward feature H G = N drif t ([X t , - → h t,pre = ODE-RNN 1 (Y t ; t)]), R G = exp(N dif f ( - → h t,pre )), P G (Y t |Y t , X t ) = Φ(Y n |f = D([X t , - → h t,pre ])). ( ) Inference Model Q: We propose two types of inference models in VSDN: a filtering model, and a smoothing model. L V AE (y 1:n ) and L K IW AE (y 1:n ) equal the exact log-likelihood when P Q (X ≤tn ) is identical to the exact posterior distribution P G (X ≤tn |y 1:n ). The inference model must process the the whole data sequence to compute H Q at a time. According to d-separation (Bishop, 2006) , the latent state X t is dependent on both the historical data Y t and future observations Y t . Therefore, we first define Q as a smoothing model by introducing a backward ODE-RNN to embed the information of the future observations into a hidden feature ←h t . The drift function is computed as: H Q = N drif t ([X t , - → h t,pre + ← - h t ]), ← - h t = ODE-RNN 2 (Y t ; t)]). In real-world applications, it is sometimes possible to have close performance in inference without processing the future observations. Besides, the future measurements are intractable in online systems. Therefore, we also design a filtering inference model that infers the latent state from the historical and current data. The drift of the filtering model is given as: H Q = N drif t ([X t , - → h t,pre + - → h t,post ]) if there is y t at t H G otherwise where -→ h t,post is the post-observation updated feature of the forward ODE-RNN (Rubanova et al., 2019) . The filtering model does not have to include a backward RNN to process the future observations and thus its running speed is faster. The whole architectures of VSDN with filtering Q (VSDN-F) and smoothing Q (VSDN-S) are shown in Figure 1 (a) -(b ). The inference model and the generative model share the drift network. This strategy can force the ODE-RNNs to embed more information into the hidden features and reduce the model complexity. Applications: VSDN consists of a generative model and an inference model. The generative model is an online predictive model which can recurrently predict the future values of the sequence. The inference models can be applied to either filtering or smoothing problems of the latent states accordingly. Furthermore, the smoothing inference model infers the latent state trajectory from the whole sequence, which can be further used in Eq. ( 3) to synthesize missing data. Therefore, the smoothing inference model is capable of offline interpolation. The motivation of this paper is to design an efficient continuous-time stochastic recurrent model. Therefore, VSDNs only use the generative model to recurrently predict the future values in the experiments. Discussions: VSDN has higher flexibility and model capability than current continuous-time deep learning models in modeling the sporadic sequences. LatentODE (Chen et al., 2018) and ODE 2 VAE (Yildiz et al., 2019) encode the information of the time series into the initial values of the latent state trajectories and neglect the variance in the latent state transition. This strategy is impractical and inefficient in real-world applications, as it requires the initial latent states to disentangle the property of the long sequence. Furthermore, LatentODE, ODE 2 VAE are offline models, as the encoder used during training of these models can not be directly used for online prediction. In contrast, VSDN defines a latent SDE controlled by the historical observations and recurrently integrates the information of the sequence along the time axis. It is more efficient than the initial state embedding and is also applicable in online prediction. GRU-ODE (De Brouwer et al., 2019) , ODE-RNN (Rubanova et al., 2019) and NCDE (Kidger et al., 2020 ) also utilize recurrent scheme but does not explicitly model the stochasticity of the underlying latent state. Therefore, they are less capable than VSDN in modeling the complicated stochastic process of the irregular data.

4. EXPERIMENTS

In this section, we conduct comprehensive experiments to validate the performance of our models and demonstrate its advantages in real-world applications. We compare the performance of VSDN with state-of-the-art continuous-time recurrent neural networks (i.e. ODE-RNN (Rubanova et al., 2019) and GRU-ODE (De Brouwer et al., 2019) ), LatentODE (Chen et al., 2018) and LatentSDE (Li et al., 2020) .

4.1. HUMAN MOTION ACTIVITIES

We first evaluate the performance of different models on the prediction and interpolation problems for human motion capturing. For a given sequence of data points sampled at irregular time intervals, the prediction task is defined to estimate the next observed data in the time axis, and the interpolation task is defined to recover the missing parts of the whole data trajectory. In both prediction and interpolation tasks, only the generative models of VSDNs are evaluated. The experiments are conducted on the following datasets: • Human3.6M (Ionescu et al., 2014) : We apply the same data pre-processing as (Martinez et al., 2017) , after which the data frame at each time is a 51-dimensional vector. The long data sequences are further segmented by 248 frames. • CMU MoCap * : We follow the data pre-processing in (Liu et al., 2019) . In each data frame, human activity is represented as a 62-dimensional vector and each dimension of the frames is normalized by global mean and standard deviation. The long data sequences are further segmented by 300 frames. After data pre-processing, we randomly remove half of the frames in the data sequence as missing data. To quantify the model performance, we consider two evaluation metrics: one is the negative log-likelihood (NLL) per frame; the other is the frame-level mean square error (MSE) between the ground-true and estimated values. The model configurations are given in Appendix C. The model performance is shown in Tables 1 and 2 . VSDN incorporates SDE to model the stochastic dynamics, and also applies a recurrent structure to embed the information of the irregular time series into the whole latent state trajectory. With these advances, VSDN outperforms the baseline models in both the prediction and interpolation tasks. VSDN has much smaller negative log-likelihood, which indicates that it can better model the underlying stochastic process of the data. Furthermore, VSDN trained by IWAE losses has similar and sometimes better performance than those with VAE losses. As the latent state in the inference model has stochastic dependency on the future observations, VSDN-S using the smoothing model has slightly lower NLL and is a better choice than VSDN-F using filtering model. Visualization: We further compare different models qualitatively through the visualization of the interpolated human skeletons in Figure 2 . VSDN models are able to generate vivid skeletons that are closer to the ground-true ones. Instead, ODE-RNN and GRU-ODE can not interpolate the postures correctly (e.g the angles of arms in each frame are significantly different from the ground-true ones). We also observe that the motions generated by VSDNs are smooth and closer to the real data, while there are a large vibration in the movements generated by the baseline models. The videos of these human motions are provided in supplementary materials. • Double-OU † : The Double-OU dataset consists of data sequences synthesized by a 2dimensional Ornstein-Uhlenbeck process, which is a classic stochastic differential equations in finance and physics. • USHCN ‡ : The United State Historical Climatology Network (USHCN) dataset contains daily measurements of 5 climate variables from the meteorological stations in United States. In our experiment, we use the pre-processed subset of the data given in (De Brouwer et al., 2019) . Compared with the previous experiments, the data in Double-OU and USHCN are not only sampled at irregular times, but also have missing dimensions at each sampled frames. That is a data sequence is sparse in both time axis and frame dimension. We evaluate the model performance in predicting future values based on the sporadic observations. The results are shown in Table 3 . All VSDN models outperform the baseline ones. On the USHCN dataset, VSDN-S has better NLL than VSDN-F when using either VAE or IWAE losses in the training processes. VSDNs trained by IWAE loss also have smaller NLL than those trained by VAE loss. However, when running on the Double-OU dataset, the training with IWAE performs slightly worse than the training using the VAE loss. This is possibly caused by the randomness of the training process, as Double-OU process is a very simple stochastic differential equation and all VSDNs have the smallest errors in the prediction tasks. ing these losses. We visualize the L V AE and L K IW AE of VSDN trained for 40 epoches on the Human3.6M dataset in Figure 3 . As the VSDN-S contains both forward and backward ODE-RNNs, it is more difficult to train than VSDN-F. The looseness of L V AE further increases the training difficulty and results in a worse lower bound of VSDN-S (VAE). Therefore, VSDN-S (VAE) requires more epochs to converge during the training. For the other cases, we observe that the L K IW AE is tighter than L V AE in training when the number of trajectories is small. Therefore, L K IW AE has faster convergence in training our models. For large number of trajectories, L V AE has similar tightness as L K IW AE in training set.

5. CONCLUSIONS

In this paper, we propose a continuous-time stochastic recurrent neural network called VSDN to learn the continuous-time stochastic dynamics from irregular or even sporadic data sequence. We provide two variants, one is VSDN-F whose inference model is a filtering model, and the other is VSDN-S with smoothing inference model. The continuous-time variants of the VAE and IWAE losses are incorporated to efficiently train our model. We demonstrate the effectiveness of VSDN through evaluations studies on different datasets and tasks, and our results show that VSDN can achieve much better performance than state-of-the-art continuous-time deep learning models. In the future work, we will investigate along several potential directions: First, we will apply our models to higher dimensional and more complicated data, such as videos, which are more challenging to model yet, especially under the premise of increasing demand for producing videos in high resolution and frame-per-second (FPS); Second, as stochastic differential equations are the base of many significant control methodologies, we will try to further extend the capacity of our models such that they can be used in precise control scenarios.

A DEDUCTIONS OF CONTINUOUS-TIME EVIDENCE LOWER BOUND

A.1 PRELIMINARIES OF STOCHASTIC DIFFERENTIAL EQUATIONS During the model design and implementation, we will use the Euler-Maruyama method to discretize the stochastic differential equation. The details are given as follows. Lemma 1 (Discretization of SDE). For a SDE dX = H(X, t)dt + R(t)dW , we can discretize it as X k+1 = X k + H(X k , t k )∆t + R(t k ) √ ∆tε, where ε ∼ N (0, 1), t k = k∆t and ∆t is the sampling interval. Eq. ( 15) converges to the original SDE when ∆t → 0. Lemma 2. The state X k+1 in Eq. (15) follows the conditional Gaussian distribution P (X k+1 |X k ) = N (X k + H(X k )∆t, ∆tR(X k ) T R(X k )). The joint distribution of the sate sequence X 1:K of Eq. ( 15) is given by P (X 1:K |X 0 ) ∝ exp -0.5 K-1 k=0 (X k+1 -m k ) T Σ -1 k (X k+1 -m k ) , where m k = X k + H(X k )∆t and Σ k = ∆tR(X k ) T R(X k ). A.2 DEVIATION OF L V AE The proof is similar as the evidence lower bound in (Archambeau et al., 2008) . By applying Jensen's inequality, we can obtain that: logP G (y 1:n ) = log P G (X ≤tn ) n i=1 P G (y i |y 1:n-1 , X ti )dX ≤tn = log P Q (X ≤tn ) P G (X ≤tn ) n i=1 P G (y i |y 1:n-1 , X ti ) P Q (X ≤tn ) dX ≤tn ≥ P Q (X ≤tn ) log P G (X ≤tn ) n i=1 P G (y i |y 1:n-1 , X ti ) P Q (X ≤tn ) dX ≤tn = P Q (X ≤tn ) log P G (X ≤tn ) P Q (X ≤tn ) dX ti + P Q (X ≤tn ) log n i=1 P G (y i |y 1:n-1 , X ti )dX ≤tn = -KL P Q ||P G + n i=1 E P Q (Xt i ) log P G (y i |y 1:n-1 , X ti ). The next step is to derive the KL divergence term for the prior and inference SDEs. After discretization into K points via Lemma 1, the KL divergence of the two SDEs in VSDN-SDE will be: KL(P Q ||P G ) = P Q (X 1:K ) log P Q (X 1:K ) P G (X 1:K ) dX 1:K = K-1 k=0 P Q (X 1:K ) log P Q (X k+1 |X k ) P G (X k+1 |X k ) dX 1:K = K-1 k=0 P Q (X 1:K ) log P Q (X k+1 |X k ) P G (X k+1 |X k ) dX 1:K = K-1 k=0 P Q (X k+2:K |X k+1 )P Q (X k+1 |X k )P Q (X 1:k ) log P Q (X k+1 |X k ) P G (X k+1 |X k ) dX 1:K = K-1 k=0 P Q (X k+1 |X k )P Q (X k ) log P Q (X k+1 |X k ) P G (X k+1 |X k ) dX k dX k+1 = K-1 k=0 P Q (X k ) • KL P Q (X k+1 |X k )||P G (X k+1 |X k ) dX k = K-1 k=0 E X k ∼P Q (X k ) KL P Q (X k+1 |X k )||P G (X k+1 |X k ) , where P Q (X k ) is the marginal distribution of X k in the inference SDE. According to lemma 2 and the KL divergence of two Gaussian distribution, we further have KL P Q (X k+1 |X k )||P G (X k+1 |X k ) = 1 2 tr(Σ -1 k,G Σ k,Q ) + (m k,G -m k,Q ) T Σ -1 k,G (m k,G -m k,Q ) + log det Σ k,G det Σ k,Q -d = 1 2 tr (R G R T G ) -1 R Q R T Q + ∆t(H G -H Q ) T (R G R T G ) -1 (H G -H Q ) + log R G R T G R Q R T Q -d where d is the dimension of X k+1 . When we restrict R G = R Q , we have KL(P Q ||P G ) = 1 2 K-1 k=0 E X k ∼P Q (X k ) (H G -H Q ) T (R G R T G ) -1 (H G -H Q )∆t. When we set ∆t → 0, the discretized SDEs converge to the original SDEs and KL(P Q ||P G ) converges to: KL(P Q ||P G ) = 1 2 tn 0 E P Q (Xt) (H Q -H G ) T [R G R T G ] -1 (H Q -H G ) dt. The expectation operator is removed as H G , H Q and R G are independent with X t . If R G does not equal to R Q , we have KL(P Q ||P G ) = 1 2 lim ∆t→0 K-1 k=0 E X k ∼P Q (X k ) (H G -H Q ) T (R G R T G ) -1 (H G -H Q ) + const ∆t ∆t, = + ∞ A.3 DEVIATION OF L IW AE Given X k+1 = X k + H Q ∆t + R G √ ∆tε = m k,Q + R G √ ∆tε, we have: log w = log P G (x ≤tn ) P Q (x ≤tn ) = K-1 k=0 log P G (X k+1 |X k ) P Q (X k+1 |X k ) = 1 2 K-1 k=0 -(X k+1 -m k,G ) T Σ -1 k,G (X k+1 -m k,G ) + (X k+1 -m k,Q ) T Σ -1 k,G (X k+1 -m k,Q ) = 1 2 K-1 k=0 -(H Q -H G )∆t + R G √ ∆tε T [∆tR G R T G ] -1 (H Q -H G )∆t + R G √ ∆tε + R G √ ∆tε T [∆tR G R T G ] -1 R G √ ∆tε = 1 2 K-1 k=0 -(H Q -H G ) T [R G R T G ] -1 (H Q -H G )∆t -2(H Q -H G ) T [R G ] -1 √ ∆tε Let ∆t → 0, we have log w = 1 2 -(H Q -H G ) T [R G R T G ] -1 (H Q -H G )dt -(H Q -H G ) T [R G ] -1 dW t , which is equivalent to d log w = -(H Q -H G ) T [2R G R T G ] -1 (H Q -H G )dt -(H Q -H G ) T [R G ] -1 dW t . B ILLUSTRATION OF THE NOISE INJECTION OF R(X t ) In the section, we give an example to illustrate the noise injection problem when we include X t as the input for the diffusion function R(X t ) in a Neural SDE. For simplicity, we consider the scalar case (i.e. X t ∈ R).

B.1 CASE A: R IS INDEPENDENT OF X t

Consider the following neural SDE: dX t =H φ (X t ; t)dt + R θ (t)dW t , where H φ and R θ are neural networks. φ denotes the parameters of the drift network and θ denotes the parameters of the diffusion network. 𝑋 Δ𝑡 𝑋 2Δ𝑡 𝑋 3Δ𝑡 ℒ 𝐻 𝜙 , 𝑅 𝜃 𝐻 𝜙 , 𝑅 𝜃 Figure 4 : A example to show the noise injection problem of R(X t ). Now consider the following example (shown in Figure 4 ) that we have to compute the gradient of the loss at t = 3∆t with respect to the network parameters, where the neural SDE is discretized by Euler-Maruyama method: X ∆t =X 0 + H φ (X 0 ; 0)∆t + R θ (0) √ ∆tε 1 , X 2∆t =X ∆t + H φ (X ∆t ; ∆t)∆t + R θ (∆t) √ ∆tε 2 , X 3∆t =X 2∆t + H φ (X 2∆t ; 2∆t)∆t + R θ (2∆t) √ ∆tε 3 , where ε n ∼ N (0, 1). It is straight forward to prove the following lemma. For notation simplicity, we define H φ (n) = H φ (X (n-1)∆t ; (n -1)∆t) and R θ (n) = R θ ((n -1)∆t). Lemma 3. Eqs. ( 21) -( 23) follows the following relationship of the gradients: ∂X n∆t ∂X (n-1)∆t = 1 + ∆t ∂H φ (n) ∂X (n-1)∆t Therefore, the gradients of the parameters in the drift and diffusion functions can be given by: ∂L ∂φ = ∂L ∂X 3∆t ∂X 3∆t ∂φ + ∂L ∂X 3∆t ∂X 3∆t ∂X 2∆t ∂X 2∆t ∂φ + ∂L ∂X 3∆t ∂X 3∆t ∂X 2∆t ∂X 2∆t ∂X ∆t ∂X ∆t ∂φ , = ∂L ∂X 3∆t ∂H φ (3) ∂φ ∆t + ∂L ∂X 3∆t 1 + ∆t ∂H φ (3) ∂X 2∆t ∂H φ (2) ∂φ ∆t + ∂L ∂X 3∆t 1 + ∆t ∂H φ (3) ∂X 2∆t 1 + ∆t ∂H φ (2) ∂X ∆t ∂H φ (1) ∂φ ∆t. and ∂L ∂θ = ∂L ∂X 3∆t ∂X 3∆t ∂θ + ∂L ∂X 3∆t ∂X 3∆t ∂X 2∆t ∂X 2∆t ∂θ + ∂L ∂X 3∆t ∂X 3∆t ∂X 2∆t ∂X 2∆t ∂X ∆t ∂X ∆t ∂θ , = ∂L ∂X 3∆t ∂R θ (3) ∂θ √ ∆tε 3 + ∂L ∂X 3∆t 1 + ∆t ∂H φ (3) ∂X 2∆t ∂R θ (2) ∂θ √ ∆tε 2 + ∂L ∂X 3∆t 1 + ∆t ∂H φ (3) ∂X 2∆t 1 + ∆t ∂H φ (2) ∂X ∆t ∂R θ (1) ∂θ √ ∆tε 1 . According to Eq. ( 25) and Eq. ( 26), the gradient of φ of the drift network is deterministic except the weight of first hidden layer and the gradient of θ of the diffusion network is obstructed by Gaussian noise terms √ ∆tε 1 , √ ∆tε 2 and √ ∆tε 3 . B.2 CASE B: R USES X t AS INPUT Now we consider the case when the diffusion network R also use X t as input. Eq. ( 24) will change to the following equation: ∂X n∆t ∂X (n-1)∆t = 1 + ∆t ∂H φ (n) ∂X (n-1)∆t + √ ∆tε n ∂R θ (n) ∂X (n-1)∆t . Inserting Eq. ( 27) into Eq. ( 25) and Eq. ( 26), we have ∂L ∂φ = ∂L ∂X 3∆t ∂X 3∆t ∂φ + ∂L ∂X 3∆t ∂X 3∆t ∂X 2∆t ∂X 2∆t ∂φ + ∂L ∂X 3∆t ∂X 3∆t ∂X 2∆t ∂X 2∆t ∂X ∆t ∂X ∆t ∂φ , = ∂L ∂X 3∆t ∂H φ (3) ∂φ ∆t + ∂L ∂X 3∆t 1 + ∆t ∂H φ (3) ∂X 2∆t + √ ∆tε 3 ∂R θ (3) ∂X 2∆t ∂H φ (2) ∂φ ∆t + ∂L ∂X 3∆t 1 + ∆t ∂H φ (3) ∂X 2∆t + √ ∆tε 3 ∂R θ (3) ∂X 2∆t 1 + ∆t ∂H φ (2) ∂X ∆t + √ ∆tε 2 ∂R θ (2) ∂X ∆t ∂H φ (1) ∂φ ∆t. and ∂L ∂θ = ∂L ∂X 3∆t ∂X 3∆t ∂θ + ∂L ∂X 3∆t ∂X 3∆t ∂X 2∆t ∂X 2∆t ∂θ + ∂L ∂X 3∆t ∂X 3∆t ∂X 2∆t ∂X 2∆t ∂X ∆t ∂X ∆t ∂θ , = ∂L ∂X 3∆t ∂R θ (3) ∂θ √ ∆tε 3 + ∂L ∂X 3∆t 1 + ∆t ∂H φ (3) ∂X 2∆t + √ ∆tε 3 ∂R θ (3) ∂X 2∆t ∂R θ (2) ∂θ √ ∆tε 2 + ∂L ∂X 3∆t 1 + ∆t ∂H φ (3) ∂X 2∆t + √ ∆tε 3 ∂R θ (3) ∂X 2∆t 1 + ∆t ∂H φ (2) ∂X ∆t + √ ∆tε 2 ∂R θ (2) ∂X ∆t ∂R θ (1) ∂θ √ ∆tε 1 . According to Eq. ( 28), the gradient of φ is now also corrupted by noise terms (i.e. √ ∆tε 3 and ∆tε 2 ε 3 ), while in previous case it is deterministic. What's worse, more noise terms are added into the gradient of θ. When we train our models in long data sequence, these injected noise terms will cause a large variance of the parameters' gradients. Therefore, we can conclude that introducing X t into the diffusion function is not beneficial.

C MODEL CONFIGURATION C.1 HUMAN MOTION ACTIVITIES

For all the models, the feed-forward network contains one hidden layer with 256 Relu units. ∆t is set as 0.25. The dimension of hidden features of ODE-RNN and GRU-ODE is 512 and the dimension of latent states is 128. A single-layer feed-forward network with 128 Relu units is defined to compute the initial states of the latent state. For LatentSDE, the posterior initial state is computed by using the encoding feature of a backward ODE-RNN. The number of latent state trajectories generated to compute VAE and IWAE losses is 5. All models are trained by Adam optimizer with learning rate 0.0001 and weight-decay 0.0005. The batch size is 64. Early stopping with 10 epoch tolerance is applied.

C.2 TOY SIMULATION AND CLIMATE PREDICTION

For all the models, the feed-forward network contains one hidden layer with 25 Relu units. ∆t is set as 0.1 for USHCN and 0.01 for Double-OU. The dimension of hidden features of ODE-RNN and GRU-ODE is 15 and the dimension of latent states is 15 as well. A single-layer feed-forward network with 128 Relu units is defined to compute the initial states of the latent state. The number of latent state trajectories generated to compute VAE and IWAE losses is 5. All models are trained by Adam optimizer with learning rate 0.0001 and weight-decay 0.0001. The batch size is 500 for USHCN and 250 for Double-OU. Early stopping with 25 epoch tolerance is applied.

C.3 SPECTROGRAM MODELING

To further evaluate the performance of our model in high-dimensional data, we conduct a brief experiment on the spectrogram data extracted from the FMA dataset (Defferrard et al., 2017) , which is a large collection of music and songs. We transform the first 500 songs in FMA-small into spectrogram and then split the spectrogram into segments. Each segment has 100 frames and each frame is 1025 dimensional. Each dimension of the data frame corresponds to a specific frequency component of STFT. The definitions of prediction and interpolation tasks are same as those in Section 4.1. For all the models, the feed-forward network contains one hidden layer with 256 Relu units. The dimension of hidden features of ODE-RNN and GRU-ODE is 256 and the dimension of latent states is 64. A single-layer feed-forward network with 128 Relu units is defined to compute the initial states of the latent state. For LatentSDE, the posterior initial state is computed by using the encoding feature of a backward ODE-RNN. The number of latent state trajectories generated to compute VAE and IWAE losses is 5. The batch size is 32. The NLL and MSE (per dim) are shown in Table 4 . Our model has similar mean square errors with baselines but has much better NLL, which indicates that our model can better estimate the stochastic process of the data.



* http://mocap.cs.cmu.edu/ † https://github.com/edebrouwer/gru ode bayes ‡ https://cdiac.ess-dive.lbl.gov/epubs/ndp/ushcn/monthly doc.html



Figure 1: Model Architectures of (a) VSDN-F (filtering); (b) VSDN-S (smoothing).

Figure 2: Visualization for human skeleton interpolation of different models.

Figure 3: Training processes of our models with respect to the different number of sampled latent state trajectories. (UP: training set; Bottom: validation set)

R G are the drift and diffusion functions of the latent SDE. W t denotes the a Wiener process, which is also called standard Brownian motion. To integrate the information of the observed data, H G is the function of the current state X t and the historical observations Y t . However, R G only uses the historical data as input. It is not beneficial to include X t as the input of the diffusion function, as it will inject more noise into gradients of the network parameters. A detailed example and analysis of the noise injection problem is given in Appendix B. Φ(•) is a parametric family of distributions over the data and f (•) is the function to compute the parameters of Φ. With the advance of deep learning methods, we parameterize H G , R G and f (•) by deep neural networks.

Model performance on Human3.6M dataset

Model performance on MoCap dataset

Model Performance on Sporadic Time Series

Model performance on Spectrogram dataset

