NEURAL JUMP ORDINARY DIFFERENTIAL EQUATIONS: CONSISTENT CONTINUOUS-TIME PREDICTION AND FILTERING

Abstract

Combinations of neural ODEs with recurrent neural networks (RNN), like GRU-ODE-Bayes or ODE-RNN are well suited to model irregularly observed time series. While those models outperform existing discrete-time approaches, no theoretical guarantees for their predictive capabilities are available. Assuming that the irregularly-sampled time series data originates from a continuous stochastic process, the L 2 -optimal online prediction is the conditional expectation given the currently available information. We introduce the Neural Jump ODE (NJ-ODE) that provides a data-driven approach to learn, continuously in time, the conditional expectation of a stochastic process. Our approach models the conditional expectation between two observations with a neural ODE and jumps whenever a new observation is made. We define a novel training framework, which allows us to prove theoretical guarantees for the first time. In particular, we show that the output of our model converges to the L 2 -optimal prediction. This can be interpreted as solution to a special filtering problem. We provide experiments showing that the theoretical results also hold empirically. Moreover, we experimentally show that our model outperforms the baselines in more complex learning tasks and give comparisons on real-world datasets.

1. INTRODUCTION

Stochastic processes are widely used in many fields to model time series that exhibit a random behaviour. In this work, we focus on processes that can be expressed as solutions of stochastic differential equations (SDE) of the form dX t = µ(t, X t )dt + σ(t, X t )dW t , with certain assumptions on the drift µ and the diffusion σ. With respect to the L 2 -norm, the best prediction of a future value of the process is provided by the conditional expectation given the current value. If the drift and diffusion are known or a good estimation is available, the conditional expectation can be approximated by a Monte Carlo (MC) simulation. However, since µ and σ are usually unknown, this approach strongly depends on the assumptions made on their parametric form. A more flexible approach is given by neural SDEs, where the drift µ and diffusion σ are modelled by neural networks (Tzen & Raginsky, 2019; Li et al., 2020; Jia & Benson, 2019) . Nevertheless, modelling the diffusion can be avoided if one is only interested in forecasting the behaviour instead of sampling new paths. An alternative widely used approach is to use Recurrent Neural Networks (RNN), where a neural network dynamically updates a latent variable with the observations of a discrete input time-series. RNNs are successfully applied to tasks for which time-series are regularly sampled, as for example speech or text recognition. However, often observations are irregularly observed in time. The standard approach of dividing the time-line into equally-sized intervals and imputing or aggregating observations might lead to a significant loss of information (Rubanova et al., 2019) . Frameworks that overcome this issue are the GRU-ODE-Bayes (Brouwer et al., 2019) and the ODE-RNN (Rubanova et al., 2019) , which combine a RNN with a neural ODE (Chen et al., 2018) . In standard RNNs, the hidden state is updated at each observation and constant in between. Conversely, in the GRU-ODE-Bayes and ODE-RNN framework, a neural ODE is trained to model the continuous evolution of the hidden state of the RNN between two observations. While GRU-ODE-Bayes and ODE-RNN both provide convincing empirical results, they lack thorough theoretical guarantees. Contribution. In this paper, we introduce a mathematical framework to precisely describe the problem statement of online prediction and filtering of a stochastic process with temporal irregular observations. Based on this rigorous mathematical description, we introduce the Neural Jump ODE (NJ-ODE). The model architecture is very similar to the one of GRU-ODE-Bayes and ODE-RNN, however we introduce a novel training framework, which in contrast to them allows us to prove convergence guarantees for the first time. Moreover, we demonstrate empirically the capabilities of our model. Precise problem formulation. We emphasize that a precise definition of all ingredients is needed, to be able to show theoretical convergence guarantees, which is the main purpose of this work. Since the objects of interest are stochastic processes, we use tools from probability theory and stochastic calculus. To make the paper more readable and comprehensible also for readers without background in these fields, the precise formulations and demonstrations of all claims are given in the appendix, while the main part of the paper focuses on giving well understandable heuristics.

2. PROBLEM STATEMENT

The problem we consider in this work, is the online forecasting of temporal data. We assume that we make observations of a Markovian stochastic process described by the stochastic differential equation (SDE) dX t = µ(t, X t )dt + σ(t, X t )dW t , at irregularly-sampled time points. Between those observation times, we want to predict the stochastic process, based only on the observations that we made previously in time, excluding the possibility to interpolate observations. Due to the Markov property, only the last observation is needed for an optimal prediction. Hence, after each observation we extrapolate the current observation into the future until we make the next observation. The time at which the next observation will be made is random and assumed to be independent of the stochastic process itself. More precisely, we suppose to have a training set of N independent realisations of the R d Xdimensional stochastic process X defined in (1). Each realisation j is observed at n j random observation times t (j) 1 , . . . , t nj ∈ [0, T ] with values x (j) 1 , . . . , x (j) nj ∈ R d X . We assume that all coordinates of the vector x (j) i are observed. We are interested in forecasting how a new independent realization evolves in time, such that our predictions of X minimize the expected squared distance (L 2 -metric) to the true unknown path. The optimal prediction, i.e. the L 2 -minimizer, is the conditional expectation. Given that the value of the new realization at time t is x t , we are therefore interested in estimating the function f (x t , t, s) := E[X t+s |X t = x t ], s ≥ 0, which is the L 2 -optimal prediction until the next observation is made. To learn an approximation f of f we make use of the N realisations of the training set. After training, f is applied to the new realization. Hence, this can be interpreted as a special type of filtering problem. The following example illustrates the considered problem. Example. A complicated to measure vital parameter of patients in a hospital is measured multiple times during the first 48 hours of their stay. For each patient, this happens at different times depending on the resources, hence the observation dates are irregular and exhibit some randomness. Patient 1 has n 1 = 4 measurements at hours (t (1) 1 , t 2 , t (1) 3 , t (1) 4 ) = (1, 14, 27, 34) where the values (x (1) 1 , x (1) 2 , x

3. , x

(1) 4 ) = (0.74, 0.65, 0.78, 0.81) are measured. Patient 2 only has n 2 = 2 measurements at hours (t (2) 1 , t 2 ) = (3, 28) where the values (x (2) 1 , x 2 ) = (0.56, 0.63) are measured. Similarly, the j-th patient has n j measurements at times (t (j) 1 , . . . , t (j) nj ) and has the measured values (x (j) 1 , . . . , x (j) nj ). Based on this data, we want to forecast the vital parameter of new patients coming to the hospital. In particular, for a patient with measured values x 1 at time t 1 , we want to predict what his values will likely be at any time t 1 + s > t 1 . Importantly, we do not only focus on predicting the value at some t 2 > t 1 , but we want to know the entire evolution of the value.

3. BACKGROUND

Recurrent Neural Network. The input to a RNN is a discrete time series of observations {x 1 , • • • , x n }. At each observation time t i+1 , a neural network, the RNNCell, updates the latent variable h using the previous latent variable h i and the input x i+1 as h i+1 := RNNCell(h i , x i+1 ). Neural Ordinary Differential Equation. Neural ODEs (Chen et al., 2018) are a family of continuous-time models defining a latent variable h t := h(t) to be the solution to an ODE initial-value problem h t := h 0 + t t0 f (h s , s, θ)ds, t ≥ t 0 , where f (•, •, θ) = f θ is a neural network with weights θ. Therefore, the latent variables can be updated continuously by solving this ODE (3). We can emphasize the dependence of h t on a numerical ODE solver by rewriting (3) as h t := ODESolve(f θ , h 0 , (t 0 , t)) . (4) ODE-RNN. ODE-RNN (Rubanova et al., 2019) is a mixture of a RNN and a neural ODE. In contrast to a standard RNN, we are not only interested in an output at the observation times t i , but also in between those times. In particular, we want to have an output stream that is generated continuously in time. This is achieved by using a neural ODE to model the latent dynamics between two observation times, i.e. for t i-1 < t < t i the latent variable is defined as in ( 3) and ( 4), with h 0 and t 0 replaced by h i-1 and t i-1 . At the next observation time t i , the latent variable is updated by a RNN with the new observation x i . Fixing h 0 , the entire latent process can be computed by iteratively solving an ODE followed by applying a RNN. Rubanova et al. (2019) write this as h i := ODESolve(f θ , h i-1 , (t i-1 , t i )) h i := RNNCell(h i , x i ) . (5) GRU-ODE-Bayes. The model architecture describing the latent variable in GRU-ODE-Bayes (Brouwer et al., 2019) is defined as a special case of the ODE-RNN architecture. In particular, a gated recurrent unit (GRU) is used for the RNN-cell and a continuous version of the GRU for the neural ODE f θ . Therefore, we focus on explaining the difference between our model architecture and the ODE-RNN architecture, in the following section.

4. PROPOSED METHOD -NEURAL JUMP ODE

Markovian paths. Our assumptions on the stochastic process X imply that it is a Markov process. In particular, the optimal prediction of a future state of X only depends on the current state rather than on the full history. Hence, the previous values do not provide any additional useful information for the prediction. JumpNN instead of RNN. Using a neural ODE between two observation has the advantage that it allows to continuously model the hidden state between two observations. But since the underlying process is Markov, there is no need to use a RNN-cell to model the updates of the hidden state at each new observation. Instead, whenever a new observation is made, the new hidden state can solely be defined from this observation. Therefore, we replace the RNN-cell used in ODE-RNN by a standard neural network mapping the observation to the hidden state. We call this the jumpNN which can be interpreted as an encoder map. Compared to ODE-RNN, this architecture is easier to train. Last observation and time increment as additional inputs for the neural ODE. The neural network f θ used in the neural ODE takes two arguments as inputs, the hidden state h t and the current time t. However, our theoretical problem analysis suggests, that instead of t the last observation time t i-1 and the time increment t -t i-1 should be used. Additionally the last observation x i-1 should also be part of the input.

NJ-ODE.

Combining the ODE-RNN architecture (5) with the previous considerations, we introduce the modified architecture of Neural Jump ODE (NJ-ODE) h i := ODESolve(f θ , (h i-1 , x i-1 , t i-1 , t -t i-1 ), (t i-1 , t i )) h i := jumpNN(x i ) . An implementable version of this method is presented in the Algorithm 1. A neural ODE f θ transforms the hidden state between observations, and the hidden state jumps according to jumpNN when a new observation is available. The outputNN, a standard neural network, maps any hidden state h t to the output y t . To implement the continuous-in-time ODE evaluation, a discretization scheme is provided by the inner loop. In the training process, the weights of all three neural networks, jumpNN, the neural ODE f θ and outputNN are optimized. Algorithm 1 The NJ-ODE. A small step size ∆t is fixed and we denote t n+1 := T . Input: Data points with timestamps {(x i , t i )} i=0...n , for i = 0 to n do h ti = jumpNN(x i ) Update hidden state given next observation x i y ti = outputNN(h ti ) compute output s ← t i while s + ∆t ≤ t i+1 do h s+∆t = ODESolve(f θ , h s , x i , t i , s -t i , (s, s + ∆t)) get next hidden state y s+∆t = outputNN(h s+∆t ) compute output s ← s + ∆t end while end for Objective function. Our goal is to train the NJ-ODE model such that its output approximates the conditional expectation (2), which is the optimal prediction of the target process X with respect to the L 2 -norm. Therefore, we define a new objective function, with which we can prove convergence. Let y i-denote the output of the NJ-ODE at t i before the jump and y i the output at t i after the jump. Note that the outputs depend on parameters θ and the previously observed x i which are inputs to the model. Then the objective function is defined as ΦN (θ) := 1 N N j=1 paths 1 n j n j i=1 dates |x (j) i -y (j) i | jump part at observations + |y (j) i -y (j) i-| continuous part between two observations 2 . ( ) We give an intuitive explanation for this definition. The "jump part" of the loss function forces the jumpNN to produce good updates based on new observations, while the other part forces the jump size to be small in (the empirical) L 2 -norm. Since the conditional expectation minimizes the jump size with respect to the L 2 -norm, this forces the neural ODE f θ to continuously transform the hidden state such that the output approximates the conditional expectation. Moreover, both parts of the loss function force the outputNN to reasonably transform the hidden state h t to the output y t .

5. MAIN RESULT -THEORETICAL CONVERGENCE GUARANTEE

In the following we informally state our main result. To formally state and prove the theorem, precise definitions of all ingredients are needed. This analysis is provided in the appendix where the following results is stated in Theorem E.2 and Theorem E.13. Theorem 5.1 (informal). We assume that for each number of paths N and for every size of the neural networks M , their weights are chosen optimally, as to minimize ΦN (θ). Then, if N and M tend to infinity, the output of NJ-ODE converges in mean (L 1 -convergence) to the conditional exception of the stochastic process X given the current information. An intuitive explanation for this theorem was given with the definition of the objective function. In this result, the focus lies on the convergence analysis under the assumption that optimal weights are found. In the Appendix we discuss why this assumption is not restrictive.

6. EXPERIMENTS

For further details and results for all experiments see Appendix F.

6.1. TRAINING ON SYNTHETIC DATASETS

Evaluation metric. For synthetic datasets where an analytic formula for the conditional expectation exists, we can evaluate the distance of the model output to the target process (2). We use a sampling time grid with equidistant step size ∆ t := T K , K ∈ N, on [0, T ]. On this grid, we compare for path j at time t, the true conditional expectation x(j) t with the predicted conditional expectation (the model output) y (j) t . For N 2 test samples, the evaluation metric is defined as eval(x, y) := 1 N 2 N2 j=1 1 K + 1 K i=0 x(j) i∆t -y (j) i∆t 2 . ( ) Black-Scholes, Ornstein-Uhlenbeck and Heston. We test our algorithm on three scalar stochastic models, Black-Scholes, Ornstein-Uhlenbeck and Heston, with fixed parameters. For each model, we generate a dataset by sampling N = 20 000 paths on the time interval [0, 1] using the Euler-scheme with 100 time steps. Independently for each path, on average 10% of the grid points are randomly chosen as observation times. The NJ-ODE is trained on 80% of the data. On the remaining 20% the model is tested, by comparing the loss function ( 31) computed with the NJ-ODE to the loss function computed with the true conditional expectation (Figure 2 ). During training, the relative difference becomes very small, hence the true conditional expectation is nearly replicated. 

6.2. FURTHER SYNTHETIC DATASETS

Heston model without Feller condition. We also train our model on a Heston dataset for which the Feller condition is not satisfied. As explained by Andersen (2007) ; Jean-François et al. (2015) this is a more delicate situation. We see that our model is very robust, since even in this critical case it learns to replicate the true conditional expectation process. In the Heston model, the variance of the stochastic process X t is a stochastic process, v t . Here, we train our model to predict both processes at the same time. The training on this 2-dimensional dataset is successful as can be seen in Figure 3 . The minimal evaluation metric after 200 epochs is 0.0983. Dataset with changing regime. In this experiment we test how well our model can deal with stochastic processes, that undergo an (abrupt) change of regime at a certain point in time. Many real world time series might exhibit such a change of regime. Some examples are listed below. • Longitudinal patient health recordings might experience changes depending on seasonal or longerterm influences, as for example due to the seasonal flue or currently the Covid-19 pandemic. • In many regions climate data has strong seasonal dependencies, that can lead to relatively abrupt changes as for example when the weather changes from dry to rain season. In both plots, the upper sub-plot corresponds to the 1-dimensional path of X t and the lower sub-plot corresponds to the 1-dimensional path of v t . • A stock market that suddenly changes from a bullish to a bearish market, for example due to a macro-economic event. An example for this would be the start of the Covid-19 crisis in the first quarter of 2020. We test a change of regime by combining two synthetic datasets. On the first half of the time interval [0, 0.5] we use the Ornstein-Uhlenbeck and on the second half [0.5, 1] the Black-Scholes model. In Figure 4 we see that our model correctly learns the change of regime. The minimal evaluation metric after 200 epochs is 0.0463. Dataset with explicit time dependence. Many real world datasets, have an explicit time dependence, i.e. the drift and diffusion of (9) explicitly depend on t. Examples are all datasets that have a certain periodicity, as for example weather data (seasonal and daily periodicity), intraday periodicity of stock prices (Andersen & Bollerslev, 1997) or prices of certain seasonal goods. We incorporate an explicit time dependence into the Black-Scholes dataset, by replacing the drift constant µ with the time dependent constant α 2 (sin(βt) + 1), for α, β > 0. In Figure 5 we see that the model learns to adapt to the time-dependent coefficients. The minimal evaluation metric after 100 epochs is 0.0215 (β = 2π) and 0.02805 (β = 4π) respectively.

6.3. EMPIRICAL CONVERGENCE STUDY

We confirm the theoretical results of Theorem 5.1 by an empirical convergence study for growing numbers of training samples N 1 and network size M , where the performance is measured by the evaluation metric (8). For each combination of the number of training samples N 1 and the neural network size M , the NJ-ODE is trained 5 times on the Black-Scholes, Ornstein-Uhlenbeck and Heston datasets. The means of the evaluation metric over the 5 trials with their standard deviations are shown in Figure 6 for Black-Scholes. Already 200 training samples lead to a small evaluation metric if the network size is big enough, suggesting that the Monte Carlo approximation of the loss function is good, already with a few samples.

6.4. COMPARISON TO GRU-ODE-BAYES ON SYNTHETIC DATA

On the Black-Scholes, Ornstein-Uhlenbeck and Heston datasets, we compare our model to GRU-ODE-Bayes (Brouwer et al., 2019) , which is, to the best of our knowledge, the neural network based method with the most similar task to ours. Results of our comparison are shown in Table 6 .4. On the Black-Scholes and Ornstein-Uhlenbeck dataset, our model performs similarly to GRU-ODE-Bayes. However, on the more complicated Heston dataset, the training of GRU-ODE-Bayes is unstable and does not converge. On the other hand, our model converges during training. Although the value of the evaluation metric is much higher than for Black-Scholes and Ornstein-Uhlebeck, the resulting model output is still meaningful, which is not the case for GRU-ODE-Bayes. Hence we conclude, that GRU-ODE-Bayes cannot be applied reasonably for the Heston dataset, while our method works as the theoretical results suggest.

6.5. REAL WORLD DATASETS WITH INCOMPLETE OBSERVATIONS

Self-imputation for incomplete observations. Until now we assumed that at each observation time all coordinates of the stochastic process X are observed. However, in many real world applications, the observations are incomplete, i.e. only for some of the coordinates an observation is available. To deal with such incomplete observations, we propose the following self-imputation method. Whenever  := m t x i + (1 d X -m i ) y i-, where is the element-wise multiplication (Hadamar product) and 1 d X ∈ R d X is the one-vector. Instead of x i we use (x i , m i ) as an input for the jump part jumpNN. The intuition behind this definition is the following. In the one dimensional case, if we do not make an observation, but input y t-as if it was an observation, we expect that this does not change the output y s for s ≥ t. From this point of view, y t-does not provide any additional information for the model. Similarly, we expect that imputing y t-for unobserved coordinates does not provide any information about this coordinate to the model. However, since the model might learn to transfer the information about an observed coordinate to an unobserved one, we extend the input to also include the information which coordinates were observed. For the ODE part we use y i , the prediction after processing the input (x i , m i ), as input instead of x i . Here, the intuition is that if the model learns how to best use the incomplete observation, i.e. if the jump is good, then this is the best approximation for x i . Our objective function is adjusted by multiplying each term in the sum with the mask. Climate forecast. We compare our model to GRU-ODE-Bayes on the USHCN daily dataset (Menne et al., 2016) , using the same experimental setting as was used by Brouwer et al. (2019) . We train a small (S) and a large (L) version of NJ-ODE with different total number of parameters. The validation set was used for early stopping after the first 100 epochs, where we trained for a total of 200 epochs. We see in Table 6 .5 that our small version performs slightly worse than GRU-ODE-Bayes while our large version slightly outperforms it. Physionet. We compare our model to the latent ODE on their extrapolation task on the PhysioNet Challenge 2012 dataset (Goldberger et al., 2000) , using the same experimental setting as was used by Rubanova et al. (2019) . The mean and standard deviation over 5 runs starting at different random initializations is reported for our model. We see in Table 6 .5 that our model outperforms the latent ODE models although having only about a seventh of the trainable weights. (Bain & Crisan, 2008, Def. 3 .2) and therefore can provide conditional expectations. Comparably, we directly compute the conditional expectation given the last observation. Similar to our assumptions, in the neural filtering approach of GRU-ODE-Bayes (Brouwer et al., 2019) and (Ryder et al., 2018) observations are only available at irregular discrete time points. They approximate the conditional law of X given the last observation by a Gaussian distribution and learn its mean and variance parameters. In particular, the conditional expectation is then given by this mean parameter. In contrast, we do not make normality assumptions about the conditional distribution and we theoretically prove convergence to the true conditional expectation. Neural ODEs with jumps. Except for GRU-ODE-Bayes (Brouwer et al., 2019) and ODE-RNN (Rubanova et al., 2019) another work studying a neural ODE with jumps is Neural Jump SDE (NJSDE) (Jia & Benson, 2019) . Similar to the NJ-ODE framework (29), the latent process of NJSDE is described by a neural ODE with jumps at random times. This model is used to describe hybrid systems which evolve continuously in time but may also be interrupted by stochastic events. In contrast to that, we model the conditional expectation of a continuous stochastic process.

8. CONCLUSION

We presented the Neural Jump ODE, a data-driven framework for modelling the conditional expectation of a stochastic process given the previous observations. We introduced a rigorous mathematical description of our model and more generally for the class of neural ODE based models. Moreover, for the first time we provided theoretical guarantees for a model falling in this category. We evaluated our model empirically on six synthetic and two real world datasets. In comparison to the baselines GRU-ODE-Bayes and latent ODE, we achieved better results especially on complex datasets.

APPENDIX A SETUP

A.1 STOCHASTIC PROCESS X In this section we rigorously describe the process X and give the assumptions which are needed in order to derive the convergence results. Let d X , d W ∈ N and T > 0 be the fixed time horizon. Consider a filtered probability space (Ω, F, F := {F t } 0≤t≤T , P), on which an adapted d W -dimensional Brownian motion {W t } t∈[0,T ] is defined. We define the stochastic processfoot_0 X := (X t ) t∈[0,T ] as the solution of the stochastic differential equation (SDE) dX t = µ(t, X t )dt + σ(t, X t )dW t , for all 0 ≤ t ≤ T , where X 0 = x ∈ R d X is the starting point and the measurable functions µ : [0, T ] × R d X → R d X and σ : [0, T ] × R d X → R d X ×d W are the drift and the diffusion respectively. We impose the following assumptions: • X is continuous and square integrable, i.e. for P-a.e. ω ∈ Ω the map t → X t (ω) is continuous and E[X 2 t ] < ∞ for every t ∈ [0, T ]. • µ and σ are both globally Lipschitz continuous in their second component, i.e. for ϕ ∈ {µ, σ} there exists a constant M > 0 such that for all t ∈ [0, T ] |ϕ(t, x) -ϕ(t, y)| 2 ≤ M |x -y| 2 and |ϕ(t, x)| 2 ≤ (1 + |x| 2 ) M . In particular, their growth is at most linear in the second component. • µ is bounded and continuous in its first component (t) uniformly in its second component (x), i.e. for every t ∈ [0, T ] and ε > 0 there exists a δ > 0 such that for all s ∈ [0, T ] with |t -s| < δ and all x ∈ R d X we have |µ(t, x) -µ(s, x)| < . • σ is càdlàg (right-continuous with existing left-limit) in the first component and L 2 integrable with respect to W , σ ∈ L 2 (W ), i.e. E   d X i=1 d W j=1 T 0 sup x σ i,j (x, t) 2 d[W j , W j ] t   = T 0 |sup x σ(x, t)| 2 F dt < ∞, where |•| F denotes the Frobenius matrix norm. This is in particular implied if σ is bounded.

A.2 RANDOM OBSERVATION DATES

In this section we describe how we model the observation dates. We want to treat irregularly observed time-series, i.e. without a fixed number of observations and not necessarily equally distributed. Therefore, we suppose that we have a possibly random number of n observations and that those observations are made at random times t 1 , . . . , t n . For simplicity we make the assumption that the observation dates are independent of the stochastic process itself. In future work, this will be generalized to observation dates correlated with X, that could be modelled by a point process on the same probability space. In contrast, here we hard-code the independence assumption by considering a second probability space ( Ω, F, P), on which the random observation times of the stochastic process X are defined. More precisely, we assume that: • n : Ω → N ≥0 is a random variable with E P[n] < ∞, the random number of observations, • t i : Ω → [0, T ] for 0 ≤ i ≤ n are sortedfoot_1 random variables, the random observation times. We denote the joint pushforward measure of n and {t i } 0≤i≤n as Pt := (n, t 0 , . . . , t n )# P. The random variable n can but does not have to be unbounded. If it is bounded, we define K := max k ∈ N | P(n ≥ k) > 0 to be the maximal value of n, which otherwise is infinity. We use the notation B([0, T ]) for the Borel σ-algebra of the set [0, T ]. Then we define for each 1 ≤ k ≤ K λ k : B([0, T ]) → [0, 1], B → λ k (B) := P(n≥k,(t k -)∈B) P(n≥k) , which is a probability measure on the time interval as shown in the lemma below. Here, (t k -) means the left-point of t k , for example, if t 0 = 0, t 1 = 1 we have t 0 ∈ [0, 1), t 1 / ∈ [0, 1) but t 0 -/ ∈ [0, 1), t 1 -∈ [0, 1). Lemma A.1. For 1 ≤ k < K + 1 the map λ k defines a probability measure. Proof. First we see that λ k ([0, T ]) = 1 since t k maps to [0, T ]. Furthermore, for any disjoint sets B i ∈ B([0, T ]), i ∈ N, we have {n ≥ k, t k -∈ ∪i≥1 B i } = ∪i≥1 {n ≥ k, t k -∈ B i } and these sets are F-measurable, since they are defined through pre-images of random variables. Therefore, the additivity of P implies that λ k ( ∪i≥1 B i ) = i≥1 λ k (B i ). Moreover, we define τ as the time of the last observation before a certain time t, τ : [0, T ] × Ω → [0, T ], (t, ω) → τ (t, ω) := max{t i (ω)|0 ≤ i ≤ n(ω), t i (ω) ≤ t}. Example A.2. We give two examples how the random observation dates could be defined. • Let (N t ) t∈[0,T ] be a homogeneous Poisson point process with rate r > 0. Hence, N (0) = 0, t → N (t) is constant except for jumps of size 1 at discrete random times and for any fixed t ∈ [0, T ], N (t) is Poisson distributed, i.e. P[N (t) = k] = (rt) k k! e -rt for k ∈ N. Then the number of observations is defined as n := N (T ) and satisfies E P[n] = rT . The observation dates t 1 , . . . , t n are defined as the discontinuity times of N , i.e. the times where N increases by 1. In particular, for 0 ≤ k ≤ n, t k := inf{t|N (t) = k} ≤ T . Therefore, for 0 ≤ a ≤ b ≤ T we can rewrite λ k ((a, b)) = P(N (a)<k,N (b)≥k) P(N (T )≥k) . • We define n by one of the following options: as a constant in N >0 . as a Binomial random variable, n ∼ Binom(p, n max ), for p ∈ (0, 1), n m ax ∈ N >0 . as a Geometric random variable, n ∼ Geom(p), for p ∈ (0, 1). as a Poisson random variable, n ∼ Poi(r), for r > 0. and t 1 , . . . , t n are defined by choosing n uniform random variables on [0, T ] and sorting them.

A.3 INFORMATION σ-ALGEBRA

In this section, we define a mathematical tool, the σ-algebra, that is essential to the description of the conditional expectation. This object describes which information is available at a certain time t. In the following, we leave away ω ∈ Ω whenever the meaning is clear. We define the filtration of the currently available information A := (A t ) t∈[0,T ] by A t := σ(X ti |t i ≤ t), where t i are the observation times and σ(•) denotes the generated σ-algebra. By the definition of τ we have A t = A τ (t) for all t ∈ [0, T ]. (Ω × Ω, F ⊗ F, F ⊗ F, P × P) is the filtered product probability space which, intuitively speaking, combines the randomness of the stochastic process with the randomness of the observations. Here, F ⊗ F consists of the tensor-product σ-algebras (F ⊗ F) t := F t ⊗ F for t ∈ [0, T ]. As explained in the remark below, A t can be identified with a sub-σ-algebra of (F ⊗ F) t . Remark A.3. The σ-fields A t depend on ω ∈ Ω as well. If we look at the product probability space (Ω × Ω, F ⊗ F, F ⊗ F, P × P), where F ⊗ F consists of the tensor-product σ-algebras (F ⊗ F) t := F t ⊗ F for t ∈ [0, T ], then σ(X ti |t i ≤ t) := σ (ω, ω) ∈ Ω × Ω X ti(ω) (ω) ∈ A, n(ω) ≥ i, t i (ω) ≤ t A ∈ B(R d X ), i ∈ N is a well defined sub-σ-algebra of (F ⊗ F) t . Furthermore, we can recover the ω-wise defined version of A t by intersecting each set in it with Ω × {ω} and subsequently projecting the intersection on its first component. We use the notation Ãt := Ãt (ω) = σ(X ti(ω) |t i (ω) ≤ t) to distinguish this ω-wise definition from the definition as sub-σ-algebra of the product space given above. However, Lemma B.3 implies that for our considerations, both versions of this σ-algebra have the same effect, therefore we will simply write A t for both versions, by abuse of notation.

B OPTIMAL APPROXIMATION X OF THE STOCHASTIC PROCESS X

We are interested in the "best" approximation (or prediction) Xt of the process X that one can make at any time t ∈ [0, T ], given the currently available information A t . For us "best" refers to the L 2 (Ω × Ω, F ⊗ F, P × P)-minimizer, therefore, this approximation is given by the conditional expectation. Indeed, if we define ∆ := {(t, r) ∈ [0, T ] 2 |t + r ≤ T }, and the function μ : ∆ × R d X → R d X , ((t, r), ξ) → E [µ(t + r, X t+r )|X t = ξ] , this is proven in the following proposition. Proposition B.1. The optimal (i.e. L 2 -norm minimizing) A-adapted process in L 2 (Ω× Ω, F⊗ F, P× P) approximating (X t ) t∈[0,T ] is given by 3 X := ( Xt ) t∈[0,T ] with Xt := E P× P[X t |A t ]. Moreover, this process is unique up to (P × P)-null-sets. In addition we have (ω-wise for ω ∈ Ω) that Xt = X τ (t) + t τ (t) μ τ (t), s -τ (t), X τ (t) ds, implying that X is càdlàg. The first part of the result follows from the elementary Proposition below (which is proven for example in (Durrett, 2010, Thm. 5.1.8) for R-valued random variables and can easily be extended to R d -valued random variables when using the 2-norm). Proposition B.2. Given a probability space (Ω, F, P) and a sub-σ-algebra A ⊂ F, the orthogonal projection of X ∈ L 2 (Ω, F, P) on L 2 (Ω, A, P) is given by X := E[X|A]. In particular, for every Z ∈ L 2 (Ω, A, P) with P(Z = X) > 0 we have E[|X -Z| 2 2 ] = E[|X -X| 2 2 ] + E[|Z -X| 2 2 ] > E[|X -X| 2 2 ]. Proof Proposition B.1. In our case this means, that the optimal A-adapted process in L 2 (Ω × Ω, F ⊗ F, P × P) approximating X is given by ( Xt ) t∈[0,T ] with Xt := E P× P[X t |A t ]. Here, A t is meant as a sub-σ-algebra of F t × F. This process is unique up to (P × P)-null-sets. Moreover, the following lemma shows that it coincides ω-wise with E P [X t | Ãt ](ω), where Ãt = Ãt (ω) is defined in Remark A.3. Lemma B.3. For P-almost-every ω ∈ Ω we have E P× P[X t |A t ](ω) = E P [X t | Ãt ](ω) P-almost- surely. Proof. Otherwise we have by Fubini's theorem, Proposition B.2 and an argument similar to the one in Lemma E.4 E P× P X t -E P× P[X t |A t ] 2 2 = E P E P X t -E P× P[X t |A t ] 2 2 > E P E P X t -E P [X t | Ãt ] 2 2 , which is a contradiction to Proposition B.2. This proves the first part of Proposition B.1. The second part of this Proposition, i.e. ( 12), should be understood ω-wise, for ω ∈ Ω. This is justified by Lemma B.3 and derived below. In particular, for the remainder of this section, all statements are meant ω-wise. With the assumption that µ and σ are Lipschitz, (Protter, 2005 , Thm. 7, Chap. V) implies that a unique solution of (9) exists. Furthermore, this solution is a Markov process as soon as the starting point x is fixed (Protter, 2005, Thm. 32, Chap. V). Hence, one can define a transition function P (compare (Protter, 2005 , Chap. V.6)) such that for all s < t and φ bounded and measurable, P s,t (X s , φ) := E P [φ(X t )|σ(X s )] = E P [φ(X t )|F s ]. We have that X τ (s) is A τ (s) -measurable and therefore, since A s = A τ (s) ⊂ F τ (s) , P τ (s),t (X τ (s) , φ) = E P [φ(X t )|A s ]. By our additional assumption on σ it follows from (Protter, 2005, Lem. before Thm. 28, Chap. IV) that M t := t 0 σ(s, X s )dW s , 0 ≤ t ≤ T, is a square integrable martingale, since the Browian motion W is square integrable. In particular, for 0 ≤ s ≤ t ≤ T we have E P [ t s σ(r, X r )dW r |F s ] = E P [M t -M s |F s ] = 0. Moreover, the same is true when conditioning on A s 4 . Using the martingale property of M , we have (ω-wise) for every t ∈ [0, T ] Xt = E P [(X t -X τ (t) ) + X τ (t) |A τ (t) ] = X τ (t) + E P t τ (t) µ(r, X r )dr A τ (t) + E P t τ (t) σ(r, X r )dW r A τ (t) = X τ (t) + t τ (t) E P µ(r, X r )|A τ (t) dr, where we used Fubini's Theorem (for conditional expectations) in the last step. This is justified because E P [ T 0 |µ(r, X r )| 2 dr] < ∞ follows from µ being bounded. Let us define ∆ := {(t, r) ∈ [0, T ] 2 |t + r ≤ T } and the function μ : ∆ × R d X → R d X , ((t, r), ξ) → P t,t+r (X t , µ) Xt=ξ = E P [µ(t + r, X t+r )|X t = ξ] , then we can use (13) to rewrite (14) as  Xt = X τ (t) + t τ (t) μ τ (t), s -τ (t), X τ (t) ds. ( ∈ [0, T ], x ∈ R d X , we define ζ s,• (x) : [0, T ] × Ω → R d X , (t, ω) → ζ s,t (x)(ω) to be the solution of the SDE ζ s,t (ξ) = ξ + t s µ(r, ζ s,r )dr + t s σ(r, ζ s,r )dW r . This solution exists and is unique by (Protter, 2005 , Chap. V, Thm. 7), therefore we have that X t = ζ 0,t (x) P-almost surely. Furthermore, (Gubinelli, 2016, Thm. 4 ) implies that for s ≤ t we have X t = ζ s,t (ζ 0,s (x)). Hence, for t = s + r, we have the identity μ(s, r, ξ ) = E[µ(t, ζ s,t (ξ))]. Furthermore, by (Gubinelli, 2016, Thm. 8) we have for any ξ, ξ ∈ R d X and (s, r), (s , r ) ∈ ∆ with t := s + r, t := s + r ∈ [0, T ] that there exists some constant C such that E |ζ s,t (ξ) -ζ s ,t (ξ )| 2 2 ≤ C |ξ -ξ | 2 2 + (1 + |ξ| 2 + |ξ | 2 ) 2 (|t -t | + |s -s |) . Therefore, we have that |μ(s, r, ξ) -μ(s , r , ξ )| 2 = |E [µ(t, ζ s,t (ξ))] -E [µ(t , ζ s ,t (ξ ))]| 2 ≤ |E [µ(t, ζ s,t (ξ))] -E [µ(t , ζ s,t (ξ))]| 2 + |E [µ(t , ζ s,t (ξ))] -E [µ(t , ζ s ,t (ξ ))]| 2 ≤ E |µ(t, ζ s,t (ξ)) -µ(t , ζ s,t (ξ))| 2 2 1/2 + E |µ(t , ζ s,t (ξ)) -µ(t , ζ s ,t (ξ ))| 2 2 1/2 ≤ E |µ(t, ζ s,t (ξ)) -µ(t , ζ s,t (ξ))| 2 2 1/2 + M E |ζ s,t (ξ) -ζ s ,t (ξ )| 2 2 1/2 , ( ) where we used Jensen's inequality in the second last and (10) in the last step. Hence, for (s , r , ξ ) → (s, r, ξ) we have that the first term of (17) goes to zero due to continuity of µ in its first component uniformly in the second component. Moreover, the second term of (17) converges to zero by ( 16). Together, this proves continuity of μ.

C RECALL: RNN, NEURAL ODE AND ODE-RNN -EQUIVALENT WAYS OF WRITING

We describe our model as the solution of the SDE (29), which is a compact way of writing. In the following, we first shortly recall recurrent neural networks (RNN) and the neural ODE and then recall the ODE-RNN. Furthermore, we give a step-by-step explanation, how the way ODE-RNN was formalized, can equivalently be written in terms of an SDE similar to (29). Finally we give the alternative way of writing our model and a short comparison to ODE-RNN. Recurrent Neural Network. The input to a RNN is a discrete time series of observations {x t1 , • • • , x tn }. At each time t i+1 , the latent variable h is updated using the previous latent variable h ti and the input x ti+1 as h ti+1 := RNNCell(h ti , x ti+1 ), (18) where RNNCell is a neural network. Neural Ordinary Differential Equation. Neural ODEs (Chen et al., 2018) are a family of continuous-time models defining a latent variable h t := h(t) to be the solution to an ODE initial-value problem (IVP): h t := h t0 + t t0 f (h s , s, θ)ds, t ≥ t 0 , where f (•, •, θ) = f θ is a neural network with weights θ. Therefore, the latent variables can be updated continuously by solving this ODE (19). We can emphasize the dependence of h t on a numerical ODE solver by rewriting (19) as h t := ODESolve(f θ , h t0 , (t 0 , t)) . ODE-RNN. ODE-RNN (Rubanova et al., 2019) is a mixture of a RNN and a neural ODE. In contrast to a standard RNN, we are not only interested in an output at the observation times t i , but also in between those times. In particular, we want to have an output stream that is generated continuously in time. This is achieved by using a neural ODE to model the latent dynamics between two observation times, i.e. for t i < t < t i+1 the latent variable is defined as in ( 19) and ( 20), with h t0 and t 0 replaced by h ti and t i . At the next observation time t i+1 , the latent variable is then updated by a RNN. Rubanova et al. (2019) write this as h ti+1 := ODESolve(f θ , h ti , (t i , t i+1 )) h ti+1 := RNNCell(h ti+1 , x ti+1 ) . Therefore, fixing h t0 := h 0 , the entire latent process can be computed by iteratively solving an ODE followed by applying a RNN. GRU-ODE-Bayes. The model architecture describing the latent variable in GRU-ODE-Bayes (Brouwer et al., 2019) is defined as the ODE-RNN but with the special choice of a gated recurrent unit (GRU) for the RNN-cell and the neural network f θ also being derived from a GRU. ODE-RNN as càdlàg process. Thinking about the process h := (h t ) t≥t0 defined in (21) as a (stochastic) process in time, it is defined to evolve continuously for t i ≤ t < t i+1 according to the ODE dynamics f θ and jumps at time t i+1 according to the RNN cell. In particular, it is defined to be a càdlàgfoot_3 process, for which h ti+1-is the standard notation for the left limit, i.e. the last point before the jump at time t i+1 . According to this notation we have h ti+1-= h ti+1 , hence, we can rewrite ( 21) as h ti+1-:= ODESolve(f θ , h ti , (t i , t i+1 )) h ti+1 := RNNCell(h ti+1-, x ti+1 ) .

C.1 RESIDUAL ODE-RNN AS A SPECIAL CASE OF CONTROLLED ODE

Residual ODE-RNN. We replace the standard RNN cell by a residual RNN cell (rRNN), as it was described e.g. in Yue et al. (2018) . In particular, instead of applying the RNN cell such that h ti = RNNCell(h ti-, x i ) we use a residual RNN cell to have h ti = h ti-+ rRNNCell(h ti-, x i ). The residual RNN is as expressive as the standard RNN and was empirically shown to perform very similarly or even better than the standard framework (Yue et al., 2018) . This way, the residual RNN cell models exactly the jump of the latent variable (i.e. the differences) that occurs at the time t i+1 when taking into account the next observation x ti+1 . Therefore, we can rewrite the ODE-RNN ( 22) as h ti+1-:= ODESolve(f θ , h ti , (t i , t i+1 )) h ti+1 := h ti+1-+ rRNNCell(h ti+1-, x i ) . ( ) Controlled Ordinary Differential Equation. We briefly recall the definition of controlled ODEs as it was given in (Herrera et al., 2020, Section 4 .1) and used in (Cuchiero et al., 2019; Herrera et al., 2020) to describe neural networks. We fix , d ∈ N and define for 1 ≤ i ≤ d the vector fields V i : Θ × R ≥0 × R → R , (θ, t, x) → V θ i (t, x), which are càglàdfoot_4 in t and Lipschitz continuous in x. Furthermore, we define the scalar càdlàg control functions u i : R ≥0 → R, t → u i (t), which have finite variation and satisfy u i (0) = 0. Then we define the process Z := (Z t ) t≥0 as the solution of the controlled ODE dZ t = d i=1 V θ i t, Z t-du i (t), Z 0 = z, where z ∈ R is some starting point. ( 24) is written in Itô's differential notation for (stochastic) integrals. The solution of (24) exists and is unique under much more general assumptions than we made here (Protter, 2005 , Chap. V, Thm. 7). 2020), one can take the u i to be semimartingales instead of deterministic functions. In line with this, we define u 2 := u as the pure jump stochastic control process u : Ω × [0, T ] → R, (ω, t) → u t (ω) := n(ω) i=1 1 [ti(ω),∞) (t). ( ) We note that u is an adapted process starting at 0 with finite variation on the product probability space (Ω × Ω, F ⊗ F, F ⊗ F, P × P), since the total variation of u up to time T is n and E P× P[n] < ∞. The following result shows that the residual ODE-RNN can compactly be described as a controlled ODE. Proposition C.1. Using the vector fields and controls defined above, the latent variable process h = (h t ) t≥t0 of the residual ODE-RNN can equivalent be written as the solution of the controlled ODE dh t = f θ (h t-, t-)dt + rRNNCell(h t-, x t )du(t), h t0 = h 0 . ( ) Proof of Proposition C.1. By (Protter, 2005, Chap . II, Thm. 17), the stochastic integral is indistinguishable from the path-wise Lebesgue-Stieltjes integral if the integrator is of finite variation. Hence, we can assume that some ω ∈ Ω is fixed and that the following expressions are evaluated at this ω whenever applicable. First note, that u is constant except at the t i where it increases by 1 (cf. Figure 7 ). In particular, the Lebesgue-Stieltjes integral of some càdlàg function g with respect to u is a sum, i.e. t 0 g(s-)du s = ti≤t g(t i -)∆u ti , where ∆u ti = u ti -u ti-= 1. Therefore, integrating (26) from t 0 to t we get h t = h t0 + t t0 f θ (h s-, s-)ds + t t0 rRNNCell(h s-, x s )du(s) = h t0 + t t0 f θ (h s-, s-)ds + ti≤t rRNNCell(h ti-, x ti ). In particular, since the first integral is continuous in t, we have for every 1 ≤ k ≤ n h t k = h t0 + t k - t0 f θ (h s-, s-)ds + ti<t k rRNNCell(h ti-, x ti ) + rRNNCell(h t k -, x t k ) = h t k -+ rRNNCell(h t k -, x t k ), and for t k < t < t k+1 h t = h t0 + t k t0 f θ (h s-, s-)ds + ti≤t k rRNNCell(h ti-, x ti ) + t t k f θ (h s-, s-)ds = h t k + t t k f θ (h s-, s-)ds. (28) Together, ( 27) and (28) prove that ( 26) is equivalent to (23). We also emphasize that x t is used as input only for t = t i , 1 ≤ i ≤ n.

D NEURAL JUMP ODE

We propose a model framework whose architecture can be interpreted as a simplification of the ODE-RNN (Rubanova et al., 2019) and the GRU-ODE-Bayes (Brouwer et al., 2019) model architecture.

D.1 THE MODEL FRAMEWORK: NEURAL ODE BETWEEN JUMPS

We define X ⊂ R d X and H ⊂ R d H to be the observation and latent space for d X , d H ∈ N. Moreover, we define the following feed-forward neural networks with sigmoid activation functions: • f θ1 : R d H × R d X × [0, T ] × [0, T ] → R d H modelling the ODE dynamics, • ρ θ2 : R d X → R d H modelling the jumps when new observations are made, and • g θ3 : R d H → R d Y the readout map, mapping into the target space Y ⊂ R d Y for d Y ∈ N. The trainable parameters of the neural networks are θ := (θ 1 , θ 2 , θ 3 ) ∈ Θ := ∪ M ≥1 Θ M . Here, for every M ∈ N, Θ M is defined to be the set of all parameters such that f θ1 and ρ θ2 have hidden dimension M . We define the pure jump stochastic process (cf. Appendix C) u : Ω × [0, T ] → R, (ω, t) → u t (ω) := n(ω) i=1 1 [ti(ω),∞) (t). Then the Neural Jump ODE (NJ-ODE) is defined by the latent process H := (H t ) t∈[0,T ] and the output process Y := (Y t ) t∈[0,T ] defined as solutions of the SDE system (cf. Appendix C) H 0 = ρ θ2 (X 0 ) , dH t = f θ1 H t-, X τ (t) , τ (t), t -τ (t) dt + (ρ θ2 (X t ) -H t-) du t , Y t = g θ3 (H t ). (29) Note that, only the values of X at the times t i , 0 ≤ i ≤ n are used as inputs. We will use the notation H θ t (X) and Y θ t (X) to emphasize the dependence of the latent process H and the output process Y on the model parameters θ and the input X. Existence and uniqueness. We note that u is an adapted process starting at 0 with finite variation on the product probability space (Ω × Ω, F ⊗ F, F ⊗ F, P × P), since the total variation of u up to time T is n and E P× P[n] < ∞. Hence, since f θ1 and ρ θ2 are Lipschitz continuous (as composition of Lipschitz continuous functions) a unique càdlàg solution H θ exists, once an initial value is fixed (Protter, 2005 , Thm. 7, Chap. V). Moreover, the resulting process (Y θ t ) t∈[0,T ] is also càdlàg and A-adapted. Equivalent way of writing NJ-ODE. By applying the steps that were explained in Section C backwards to (29) (but without introducing the residual version of the RNN), it is easy to see that our model can equivalently be written (similar to ( 22)) as h ti+1-:= ODESolve(f θ1 , (h t , x ti , t i , t -t i ), (t i , t i+1 )) h ti+1 := ρ θ2 (x ti+1 ) . ( ) In particular, the main difference to ( 22) is that we use a modified input for f θ1 and that the neural network performing the jumps does not take h ti+1-as an input, i.e. it is not a RNN.

D.2 OBJECTIVE FUNCTION

We introduce a loss function, such that the output Y θ of the NJ-ODE can be trained to approximate X, i.e. to model the best online prediction of the stochastic process X. We define the theoretical loss function and its Monte Carlo and Ergodic approximation. Objective function. Let us define D to be the set of all R d X -valued A-adapted processes on the probability space (Ω × Ω, F ⊗ F, F ⊗ F, P × P) . Then we define our objective functions Ψ :D → R, Z → Ψ(Z) := E P× P 1 n n i=1 (|X ti -Z ti | 2 + |Z ti -Z ti-| 2 ) 2 , Φ :Θ → R, θ → Φ(θ) := Ψ(Y θ (X)), where Φ will be our (theoretical) loss function. Remark that from the definition of Y θ , it directly follows that it is an element of D, hence Φ is well-defined. Monte Carlo approximation of the objective function. Let us assume, that we observe N ∈ N independent realisations of the path X at times ( t(j) 1 , • • • , t(j) n j ), 1 ≤ j ≤ N , which are themselves independent realisations of the random vector (n, t 1 , • • • , t n ). In particular, let us assume that X (j) ∼ X and (n j , t (j) 1 , • • • , t (j) n j ) ∼ P t are i.i.d. random processes (respectively variables) for 1 ≤ j ≤ N and that our training data is one realisation of them. Then ΦN (θ) := 1 N N j=1 1 n j n j i=1 X (j) t (j) i -Y θ t (j) i (X (j) ) 2 + Y θ t (j) i (X (j) ) -Y θ t (j) i -(X (j) ) 2 2 , converges (P × P)-a.s. to Φ(θ) as N → ∞, by the law of large numbers (cf. Theorem E.13). Ergodic approximation of the objective function. If we only observe one realization of the path X at times ( t1 , • • • , tN ), we can still approximate the objective function by assuming that µ and σ are time-independent and that the stochastic process X is ergodic in the following sense. We fix n = 1 and assume that the time increments ∆ tj := tjtj-1 are i.i.d. realizations of the probability distribution λ 1 . Furthermore, we consider each observation X tj as one sample with initial condition X tj-1 for which Y θ,j is the realization of Y θ . Then we approximate the objective function by ΦN (θ) := 1 N N j=1 X tj -Y θ ∆ tj (X) 2 + Y θ ∆ tj (X) -Y θ ∆ tj -(X) 2 2 , ( ) which is assumed to converge by the ergodicity assumption for N → ∞ to E P× P X t1 -Y θ t1 2 + Y θ t1 -Y θ t1-2 2 . ( ) Instead of setting the random variable n = 1, one could similarly fix the time horizon T , and take for each sample all subsequent observations that lie in the time interval [t start , t start + T ], where t start is the date of the first observation of this sample. The next sample would then start with the first observation after t start + T .

E THEORETICAL CONVERGENCE RESULTS

Our main results show, that the output of NJ-ODE converges to the conditional expectation when the size of the neural networks and the number of samples go to infinity. For each network size (and number of samples) we assume to have the weights minimizing the (Monte Carlo approximation of the) loss function. In particular, we do not consider the problem of finding those optimal weights, and therefore also do not analyse backpropagation through our model. In a similar setting, backpropagation was studied by Jia & Benson (2019) . For completeness we recall the definition of L p -convergence. Definition E.1. Let 1 ≤ p < ∞. Let (Ω, F, P) be a probability space. Then a sequence of random variables (X n ) n∈N converges to a random variable X in L p (Ω, F, P) (or simply L p ), if E[|X n -X| p ] n→∞ ----→ 0. We use the notation X n L p --→ X. L 1 -convergence is also denoted convergence in mean. E.1 CONVERGENCE WITH RESPECT TO THEORETICAL OBJECTIVE FUNCTION Theorem E.2. Let θ min M ∈ Θ min M := argmin θ∈Θ M {Φ(θ)} for every M ∈ N. Then, for M → ∞, the value of the loss function Φ (32) converges to the minimal value of Ψ (31) which is uniquely achieved by X, i.e. Φ(θ min M ) M →∞ ----→ min Z∈D Ψ(Z) = Ψ( X). Furthermore, for every 1 ≤ k ≤ K we have that Y θ min M converges to X as random variable in L 1 (Ω × [0, T ], P × λ k ). In particular, the limit process Y := lim M →∞ Y θ min M equals X (P × λ k )- almost surely as a random variable on Ω × [0, T ]. The idea of the proof is to split up the target jump process into its continuous-in-time parts and into the jumps. Both parts are continuous functions of their inputs and can therefore be approximated by neural networks (Hornik et al., 1989) . More precisely, by Proposition B.1 and B.4 the continuous-intime part can be written as an integral over time of a function which is jointly continuous in its inputs. This jointly continuous function is approximated by a neural network, which is itself integrated over time -that is the neural ODE part of the NJ-ODE. The remainder of the proof shows L 1 -convergence of the output of NJ-ODE with optimal weights (with respect to the loss function) to the conditional expectation process. For completeness we restate the straight forward generalization of the universal approximation result (Hornik et al., 1989, Thm. 2.4 ) to multidimensional output. Theorem E.3 (Hornik) . Let r, d ∈ N and σ be a sigmoid function. Let N N σ r,d be the set of all 1-hidden layer neural networks mapping R r to R d . Then for every compact subset K ⊂ R r , every > 0 and every f ∈ C(R r , R d ) there exists a neural network g ∈ N N σ r,d such that sup x∈K |f (x) -g(x)| 2 < . The following Lemmas are used in the Proof of Theorem E.2. Lemma E.4. Let 1 ≤ k ≤ K and let Z ∈ D be a process such that (P × λ k )[ X = Z] > 0. Then there exists an ε > 0 such that B := {t ∈ [0, T ] | E P [|X t -Z t-| 2 2 ] ≥ ε + E P [|X t -Xt-| 2 2 ]} satisfies λ k ( B) > 0. Proof. First remark that since X is continuous, we have X t = X t-. Let us define C := {(ω, t) ∈ Ω × [0, T ] | Xt-(ω) = Z t-(ω)} and for each t ∈ [0, T ] let C t := {ω ∈ Ω | (ω, t) ∈ C}. Then we have for B := {t ∈ [0, T ] | P(C t ) > 0} that λ k (B) > 0, since otherwise by Fubini's theorem 0 < (P × λ k )[C] = [0,T ] P(C t )dλ k (t) = 0, which is a contradiction. Now Proposition B.2 yields that for each t ∈ B there exists some ε t > 0 such that E P [|X t--Z t-| 2 2 ] ≥ ε t + E P [|X t--Xt-| 2 2 ] . This implies the claim. Indeed, assume no such ε > 0 exists, then we have for each n ∈ N that λ k {t ∈ [0, T ] | E P [|X t--Z t-| 2 2 ] ≥ 1 n + E P [|X t--Xt-| 2 2 ]} = 0. Therefore, λ k {t ∈ [0, T ] | E P [|X t--Z t-| 2 2 ] > E P [|X t--Xt-| 2 2 ]} ≤ n∈N λ k {t ∈ [0, T ] | E P [|X t--Z t-| 2 2 ] ≥ 1 n + E P [|X t--Xt-| 2 2 ]} = 0, which is a contradiction to λ k (B) > 0. Lemma E.5. For any A-adapted process Z it holds that E P× P 1 n n i=1 |X ti -Z ti-| 2 2 = E P× P 1 n n i=1 X ti -Xti- 2 2 + E P× P 1 n n i=1 Xti--Z ti- 2 2 . Proof. First recall that by continuity X ti = X ti-. Then the statement is a consequence of Proposition B.2, Lemma B.3 and Fubini's theorem, which imply Proof. With the triangle inequality it follows that E[|Z -Z| p ] 1/p = 0, which implies the claim. E P× P 1 n n i=1 |X ti--Z ti-| 2 2 = E P 1 n n i=1 E P |X ti--Z ti-| 2 2 = E P 1 n n i=1 E P X ti--Xti- 2 2 + E P Xti--Z ti- 2 2 = E P× P 1 n n i=1 X ti--Xti- 2 2 + E P× P 1 n n i=1 Xti--Z ti- 2 2 . Lemma E.6. Let 1 ≤ p < ∞. Let (Z n ) n∈N Lemma E.7. The random variable S T := sup 0≤t≤T |X t | 1 is square integrable and bounded in probability, i.e. for any ε > 0 exist some K > 0 such that P[S T > K] ≤ ε. Proof. From the proof of Proposition B.1 we know that M t := t 0 σ(s, X s )dW s , 0 ≤ t ≤ T, is a square integrable martingale and that E P [sup 0≤t≤T |M t | j 1 ] ≤ c E P [|M T | j 1 ] < ∞, for j = 1 , 2 and some constant c > 0 by Doob's inequality. Moreover, µ is bounded, say by B, hence |X t | 1 = x + t 0 µ(r, X r )dr + t 0 σ(r, X r )dW r 1 ≤ |x| 1 + t 0 |µ(r, X r )| 1 dr + |M t | 1 ≤ |x| 1 + B T + |M t | 1 . Therefore, E P [S T ] ≤ |x| 1 + B T + c E P [|M T | 1 ] < ∞ and similar E P [S 2 T ] < ∞, which implies the claim. Indeed, if for a fixed ε > 0 no such K exists, then P[S T = ∞] ≥ ε, which is a contradiction to integrability. Proof of Theorem E.2. We start by showing that X ∈ D is the unique minimizer of Ψ up to (P × λ k )null-sets for any k ≤ K. First, we recall that for every t i we have Xti = X ti and by continuity of X that X ti = X ti-. Therefore, Ψ( X) = E P× P 1 n n i=1 X ti -Xti- 2 2 = E P 1 n n i=1 E P X ti -Xti- 2 2 = min Z∈D E P× P 1 n n i=1 |X ti -Z ti-| 2 2 ≤ min Z∈D E P× P 1 n n i=1 (|X ti -Z ti | 2 + |Z ti -Z ti-| 2 ) 2 = min Z∈D Ψ(Z), where we used Fubini's theorem for the second line, Proposition B.2 for the third line and the triangle inequality for the fourth line. Hence, X is a minimizer of Ψ. To see that it is unique (P × λ k )-a.s., let Z ∈ D be a process such that (P × λ k )[ X = Z] > 0. By Lemma E.4, this implies that there exists an ε > 0 such that B := {t ∈ [0, T ] | E P [|X t--Z t-| 2 2 ] ≥ + E P [|X t--Xt-| 2 2 ]} satisfies λ k (B) > 0. Now recall that by definition of λ k we have λ k (B) = P( ∪j≥k {n = j, t k -∈ B})/ P(n ≥ k) > 0. This implies that there exists j ∈ N ≥k such that P(n = j, t k -∈ B) > 0. Therefore, E P 1 n n i=1 1 {ti-∈B} ≥ E P 1 {n=j} 1 n n i=1 1 {ti-∈B} ≥ E P 1 {n=j} 1 j 1 {t k -∈B} = 1 j P(n = j, t k -∈ B) > 0. This inequality implies now that Z is not a minimizer of Ψ, because Ψ(Z) = E P× P 1 n n i=1 (|X ti -Z ti | 2 + |Z ti -Z ti-| 2 ) 2 ≥ E P 1 n n i=1 E P |X ti -Z ti-| 2 2 = E P 1 n n i=1 1 {ti-∈B} + 1 {ti-∈B C } E P |X ti -Z ti-| 2 2 ≥ E P 1 n n i=1 ε 1 {ti-∈B} + E P X ti -Xti- 2 2 = ε E P 1 n n i=1 1 {ti-∈B} + min Z∈D Ψ(Z) > min Z∈D Ψ(Z). Next we show that (29) can approximate X arbitrarily well. Since the dimension d H can be chosen freely, let us fix it to d H := d X . Furthermore, let us fix θ * 2 and θ * 3 such that ρ θ * 2 = g θ * 3 = id. From Theorem E.3 it follows that for any ε > 0 there exist M ∈ N and θ * 1 with (θ * 1 , θ * 2 , θ * 3 ) ∈ Θ M such that sup (u,v,t,r)∈[-M,M ] d H ×d X ×∆ f θ * 1 (u, v, t, r) -μ (t, r, v) 2 ≤ ε, where we used that μ is continuous by Proposition B.4. Since µ is bounded, also μ is bounded, say by 12), (36) and the previous bound yield for B -1 > 0. On [-M, M ] d X we , • • • , t n }, we have Y θ * M t -Xt 2 = H t-+ (ρ θ * M (X t ) -H t-) -X t 2 = 0, and if t not in {t 1 , • • • , t n }, then ( S T := sup 0≤t≤T |X t | 1 Y θ * M t -Xt 1 ≤ Y θ * M τ (t) -Xτ(t) 1 + t τ (t) f θ * 1 H s-, X τ (s) , τ (s), s -τ (s) -μ τ (s), s -τ (s), X τ (s) 1 ds ≤ ε T 1 {S T ≤M } + 2BT 1 {S T >M } ≤ ε T + 2BT 1 {S T >M } . Here we used that S T ≤ M implies that (36) can be used for all τ (t) ≤ s ≤ t. Moreover, by equivalence of the 1and 2-norm, there exists a constant c > 0 such that Y θ * M t -Xt 2 ≤ c ε T + 2cBT 1 {S T >M } . With Lemma E.7 we know that E P [1 {S T >M } ] = P[S T > M ] =: M M →∞ ----→ 0. Now we can show the convergence of Φ(θ min M ) using these two bounds and X ti = Xti . Indeed, min Z∈D Ψ(Z) ≤ Φ(θ min M ) ≤ Φ(θ * M ) = E P× P 1 n n i=1 X ti -Y θ * M ti 2 + Y θ * M ti -Y θ * M ti-2 2 ≤ E P× P 1 n n i=1 Xti -Y θ * M ti 2 + Y θ * M ti -Xti 2 + Xti -Xti- 2 + Xti--Y θ * M ti-2 2 ≤ E P× P 1 n n i=1 X ti -Xti- 2 + c(ε T + 2BT 1 {S T >M } ) 2 ≤ E P 1 n n i=1 E P X ti -Xti- 2 + c(ε T + 2BT 1 {S T >M } ) 2 ≤ E P   1 n n i=1 E P X ti -Xti- 2 2 1/2 + E P (c ε T + 2cBT 1 {S T >M } ) 2 1/2 2   , where we used the triangle-inequality for the L 2 -norm in the last step. We can bound E P (c ε T + 2cBT 1 {S T >M } ) 2 ≤ 3(c ε T ) 2 + 3(2cBT ) 2 M =: c 2 M , which is a constant converge to 0 as M → ∞. Using that for a ∈ R, we have a ≤ a 2 + 1, we get min Z∈D Ψ(Z) ≤ Φ(θ min M ) ≤ Φ(θ * M ) ≤ E P   1 n n i=1 E P X ti -Xti- 2 2 1/2 + c(M ) 2   ≤ E P 1 n n i=1 E P X ti -Xti- 2 2 + 2c M E P X ti -Xti- 2 2 1/2 + c 2 M ≤ (1 + 2c M )E P× P 1 n n i=1 X ti -Xti- 2 2 + 2c M + c 2 M ≤ (1 + 2c M ) min Z∈D Ψ(Z) + 2c M + c 2 M M →∞ ----→ min Z∈D Ψ(Z), where we used Ψ( X) = min Z∈D Ψ(Z) and that c M converges to 0. In the last step we show that the limits lim M →∞ Y θ min M and lim M →∞ Y θ * M exist as limits in the Banach space L := L 1 (Ω × [0, T ], F ⊗ B([0, T ]), P × λ k ), for every k ≤ K, and that they are both equal to X. Let us fix k ≤ K. First we note that for every B ∈ B([0, T ]) we have E λ k [1 B ] = λ k (B) = P(n≥k,t k -∈B) P(n≥k) = E P[1 {n≥k} 1 {t k -∈B }] P(n ≥ k) . Using "measure theoretic induction" (Durrett, 2010 , Case 1-4 of Proof of Thm. 1.6.9) this yields for c := (P(n ≥ k)) -1 and a B([ 0, T ])-measurable function Z : [0, T ] → R, t → Z t := Z(t) that E λ k [Z] = c E P[1 {n≥k} Z t k -]. Moreover, the triangle inequality and Lemma E.5 yield Φ(θ * M ) -Ψ( X) ≥ E P× P 1 n n i=1 X ti -Y θ * M ti- 2 2 -Ψ( X) = E P× P 1 n n i=1 Xti--Y θ * M ti- 2 2 . For any R d X -valued Z ∈ L the Hölder inequality, together with the fact that n ≥ 1, yields E P× P [|Z| 2 ] = E P× P √ n √ n |Z| 2 ≤ E P× P [n] 1/2 E P× P 1 n |Z| 2 2 1/2 . ( ) Together, this implies that lim M →∞ Y θ * M = X as a L-limit. Indeed, with c := E P× P [n] 1/2 < ∞ we have E P×λ k X -Y θ * M 2 = c E P× P 1 {n≥k} Xt k --Y θ * M t k -2 ≤ c c E P× P 1 {n≥k} 1 n Xt k --Y θ * M t k - 2 2 1/2 ≤ c c E P× P 1 {n≥k} 1 n n i=1 Xti--Y θ * M ti- 2 2 1/2 ≤ c c E P× P 1 n n i=1 Xti--Y θ * M ti- 2 2 1/2 ≤ c c Φ(θ * M ) -Ψ( X) 1/2 M →∞ ----→ 0, where we used first (37) and ( 39) followed by two simply upper bounds and (38) in the last step. The same argument can be applied to show that lim M →∞ Y θ min M = X as a L-limit. In particular, this proves that the limit Y := lim M →∞ Y θ min M exists as L-limit and by Lemma E.6 it equals X (P × λ k )-almost surely, for any k ≤ K. Remark E.8. This result can be extended to any other neural network architecture for which a universal approximation theorem equivalent to (Hornik et al., 1989, Thm. 2.4) exists. Moreover, the stochastic process X defined in (9) can be chosen more general, in particular, the diffusion part σdW can be replaced by any martingale, as long as the resulting process still is a Markov process and μ stays continuous. Remark E.9. If we used the modified loss function Ψ which is identical to Ψ except that we drop the factor 1 n , everything would work similarly and we could show L 2 -convergence instead of L 1convergence of Y θ min M to X. However, we remark that there might exists Z ∈ D, such that Ψ(Z) < ∞ while Ψ(Z) = ∞. In particular, if some moment of n does not exist, such a process can be constructed. Remark E.10. It follows directly from Theorem E.2 that lim M →∞ Y θ min M = X as random variables on Ω × [0, T ] except on sets which are null sets with respect to every product measure P × λ k for 1 ≤ k ≤ K. Remark E.11. The result of Theorem E.2 does not imply that X and lim M →∞ Y θ min M are modifications or indistinguishable. For example, if B ⊂ [0, T ] is a subset such that no left-point of the observation times (t k -) lies in B with probability greater 0, i.e. λ k (B) = 0 for 1 ≤ k ≤ K, then Theorem E.2 does not tell us how close (lim M →∞ Y θ min M ) t is to Xt for t ∈ B. In particular, it does not tell us whether they are equal P-almost surely. Furthermore, such a set B always exists, since there has to exists t ∈ [0, T ] such that B := {t} has measure 0 for all k. In the following corollary we show, that Theorem E.2 can be extended to show convergence to the conditional expectation of ϕ(X), for some function ϕ ∈ C 2,b (R d X , R), i.e. a function that is twice continuously differentiable with bounded derivatives. Corollary E.12. Let ϕ ∈ C 2,b (R d X , R), then the statement of Theorem E.2 holds equivalently, when replacing X in the loss functions Ψ and Φ by Γ = ϕ(X) and X by the conditional expectation Γ, where Γt := E P× P[ϕ(X t )|A t ]. Corollary E.12 combined with the monotone convergence theorem for conditional expectation theoretically enables us to make statements about the conditional law and conditional moments of X under some a priori integrability assumptions. Proof of Corollary E.12. We first remark that Proposition B.2 and Lemma E.4, E.5, B.3 hold similarly for the conditional expectation Γ. Hence, the same argument as in the proof of Theorem E.2 implies that Γ is the unique A-adapted minimizer of Ψ. For simplicity of the notation we assume that d X = d Y = 1, i.e. that the process X and the Brownian motion W are 1-dimensional. However, the following works as well in the general case, where the correlations of the Brownian motion components have to be taken into account. By Itô's Formula (Protter, 2005, Chap. II, Thm. 32)  , Γ = ϕ(X) is the solution of the SDE dϕ(X) t = ϕ (X t )dX t + 1 2 ϕ (X t )d[X, X] t = ϕ (X t )µ(t, X t )dt + ϕ (X t )σ(t, X t )dW t + 1 2 ϕ (X t )σ(t, X t ) 2 dt = α(t, X t )dt + β(t, X t )dW t , for α(t, X t ) := ϕ (X t )µ(t, X t ) + 1 2 ϕ (X t )σ(t, X t ) 2 and β(t, X t ) := ϕ (X t )σ(t, X t ). Defining α similar to μ as α : ∆ × R d X → R d X , (t, r, ξ) → P t,t+r (X t , α) Xt=ξ = E P [α(t + r, X t+r )|X t = ξ] , one can use the boundedness of of ϕ and ϕ to show that it is continuous and that Γt = E[ϕ(X) t |A t ] = Γ τ (t) + t τ (t) α τ (t), s -τ (t), X τ (t) ds. In particular, Proposition B.1 and B.4 hold equivalently for Γ. Similar to (36), the neural network parameters can be chosen such that sup (u,v,t,r)∈[-M,M ] d H ×d X ×∆ f θ * 1 (u, v, t, r) -α (t, r, v) 2 ≤ ε, which then implies the statement of the Corollary similar as in the proof of Theorem E.2.

E.2 CONVERGENCE OF THE MONTE CARLO APPROXIMATION

In the following, we assume the size of the neural network M is fixed and we study the convergence with respect to the number of samples N . Moreover, we show that both types of convergence can be combined. To do so, we define ΘM := {θ ∈ Θ M | |θ| 2 ≤ M }, which is a compact subspace of Θ M . It is straight forward to see, that Θ M in Theorem E. 2 can be replaced by ΘM . Indeed, if the needed neural network weights for an ε-approximation have too large norm, then one can increase M until it is sufficiently big. The following convergence analysis is based on (Lapeyre & Lelong, 2019, Chapter 4.3) . Theorem E.13. Let θ min M,N ∈ Θ min M,N := arg inf θ∈ ΘM { ΦN (θ)} for every M, N ∈ N. Then, for every M ∈ N, (P × P)-a.s. ΦN N →∞ ----→ Φ uniformly on ΘM . Moreover, for every M ∈ N, (P × P)-a.s. Φ(θ min M,N ) N →∞ ----→ Φ(θ min M ) and ΦN (θ min M,N ) N →∞ ----→ Φ(θ min M ) . In particular, one can define an increasing sequence (N M ) M ∈N in N such that for every 1 ≤ k ≤ K we have that Y θ min M,N M converges to X for M → ∞ as random variable in L 1 (Ω × [0, T ], P × λ k ). In particular, the limit process Y := lim M →∞ Y θ min M,N M equals X (P × λ k )-almost surely as a random variable on Ω × [0, T ]. The following Monte Carlo convergence analysis is based on (Lapeyre & Lelong, 2019, Section 4.3) . In comparison to them, we do not need the additional assumptions that were essential in (Lapeyre & Lelong, 2019, Section 4.3) , i.e. that all minimizing neural network weights generate the same neural network output. This assumption is not needed, because we do not aim to show that Y θ min M,N converges to Y θ min M . We define the separable Banach space S := {x = (x i ) ∈N ∈ 1 (R d X ) | x 1 < ∞} with the norm x 1 := i∈N |x i | 2 . E.2.1 CONVERGENCE OF OPTIMIZATION PROBLEMS Consider a sequence of real valued functions (f n ) n defined on a compact set K ⊂ R d . Define, v n = inf x∈K f n (x) and let x n be a sequence of minimizers f n (x n ) = inf x∈K f n (x). From (Rubinstein & Shapiro, 1993 , Theorem A1 and discussion thereafter) we have the following Lemma. Lemma E.14. Assume that the sequence (f n ) n converges uniformly on K to a continuous function f . Let v * = inf x∈K f (x) and S * = {x ∈ K : f (x) = v * }. Then v n → v * and d(x n , S * ) → 0 a.s. The following lemma is a consequence of (Ledoux & Talagrand, 1991, Corollary 7.10 ) and (Rubinstein & Shapiro, 1993, Lemma A1) . Lemma E.15. Let (ξ i ) i≥1 be a sequence of i.i.d random variables with values in S and h : R d ×S → R be a measurable function. Assume that a.s., the function θ ∈ R d → h(θ, ξ 1 ) is continuous and for all C > 0, E(sup |θ|2≤C |h(θ, ξ 1 )|) < +∞. Then, a.s. θ ∈ R d → 1 N N i=1 h(θ, ξ i ) converges locally uniformly to the continuous function θ ∈ R d → E(h(θ, ξ 1 )), lim N →∞ sup |θ|2≤C 1 N N i=1 h(θ, ξ i ) -E(h(θ, ξ 1 )) = 0 a.s.

E.2.2 STRONG LAW OF LARGE NUMBERS

Let us define F (x, y, z) := |x -y| 2 + |y -z| 2 and ξ j := (X j t j 1 , . . . , X j t j n j , 0, . . . ), where X j t j i are random variables describing the realizations of the training data, as defined in Section D.2. By this definition we have n j := n j (ξ j ) := max i∈N {ξ j,i = 0} P-almost-surely and we know that ξ j are i.i.d. random variables taking values in S. Furthermore, let us write Y θ t (ξ) to make the dependence of Y on the input and the weight θ explicit. Then we define h(θ, ξ j ) := 1 n j n j i=1 F X t j i , Y θ t j i (ξ j ), Y θ t j i -(ξ j ) 2 . Lemma E.16. The following properties are satisfied. (P 1 ) There exists κ > 0 such that for all S ∈ S and θ ∈ ΘM we have |Y θ t (S)| 2 ≤ κ (1+|X τ (t) | 2 ) for all t ∈ [0, T ]. (P 2 ) Almost-surely the random function θ ∈ ΘM → Y θ t is uniformly continuous for every t ∈ [0, T ]. (P 3 ) We have E P× P 1 n n i=1 |X ti | 2 2 < ∞ and E P× P 1 n n i=1 |X ti-| 2 2 < ∞. Proof. By definition of the neural networks with sigmoid activation functions (in particular having bounded outputs), all neural network outputs are bounded in terms of the norm of the network weights, which is assumed to be bounded, not depending on the norm of the input. Since after a jump at τ (t), Y has the value X τ (t) , we can find κ depending on T , such that the claimed bound is satisfied for all t, proving (P 1 ). Since the activation functions are continuous, also the neural networks are continuous with respect to their weights θ, which implies that also θ ∈ ΘM → Y θ t is continuous. Since ΘM is compact, this automatically yields uniform continuity and therefore finishes the proof of (P 2 ). (P 3 ) follows directly from the stronger result in Lemma E.7. Proof of Theorem E.13. We apply Lemma E.15 to the sequence of i.i.d random function h(θ, ξ j ). From (P 1 ) of Lemma E.16 we have that F (X t j i , Y t j i , Y t j i -) 2 = X t j i -Y θ t j i 2 + Y θ t j i -Y θ t j i -2 2 ≤ 4 |X t j i | 2 2 + |Y θ t j i | 2 2 + |Y θ t j i | 2 2 + |Y θ t j i -| 2 2 ≤ 4 3|X t j i | 2 2 + κ(1 + |X t j i-1 | 2 2 ) . Hence, we obtain that h(θ, ξ j ) = 1 n j n j i=1 F (X t j i , Y t j i , Y t j i -) 2 ≤ 12 + 4κ n j n j i=1 X t j i 2 2 + 4κ + |x| 2 2 , implying that E P× P sup θ∈ ΘM h(θ, ξ j ) ≤ (12 + 4κ) E P× P 1 n n i=1 |X ti | 2 2 + 4κ + |x| 2 2 < ∞, using (P 3 ) of Lemma E.16. By (P 2 ) of Lemma E.16, the function θ → h(θ) is continuous. Therefore, we can apply Lemma E.15, yielding that almost-surely for N → ∞ the function θ → 1 N N j=1 h(θ, ξ j ) = ΦN (θ) converges uniformly on ΘM to θ → E P× P[h(θ, ξ 1 )] = Φ(θ). We deduce from Lemma E.14 that d(θ min M,N , Θ min M ) → 0 a.s. when N → ∞. Then there exists a sequence ( θmin 41) we can apply dominated convergence which yields M,N ) N ∈N in Θ min M such that |θ min M,N -θmin M,N | 2 → 0 a.s. for N → ∞. The uniform continuity of the random functions θ → Y θ t on ΘM implies that |Y θ min M,N t -Y θmin M,N t | 2 → 0 a.s. when N → ∞ for all t ∈ [0, T ]. By continuity of F this yields |h(θ min M,N , ξ 1 ) -h( θmin M,N , ξ 1 )| 2 → 0 a.s. as N → ∞. With ( lim N →∞ E P× P |h(θ min M,N , ξ 1 ) -h( θmin M,N , ξ 1 )| 2 = 0. Since for every integrable random variable Z we have 0 ≤ |E[Z]| 2 ≤ E[|Z| 2 ] and since θmin M,N ∈ Θ min M we can deduce lim N →∞ Φ(θ min M,N ) = lim N →∞ E P× P h(θ min M,N , ξ 1 ) = lim N →∞ E P× P h( θmin M,N , ξ 1 ) = Φ(θ min M ). Now by triangle inequality, | ΦN (θ min M,N ) -Φ(θ min M )| ≤ | ΦN (θ min M,N ) -Φ(θ min M,N )| + |Φ(θ min M,N ) -Φ(θ min M )|. (42), ( 43) and ( 44) imply that both terms on the right hand side converge to 0 when N → ∞, which finishes the proof of the first part of the Theorem. We define N 0 := 0 and for every M ∈ N N M := min N ∈ N | N > N M -1 , |Φ(θ min M,N ) -Φ(θ min M )| ≤ 1 M + |Φ(θ min M ) -Ψ( X)| , which is possibly due to (44). Then Theorem E.2 implies that |Φ(θ min M,N M ) -Ψ( X)| ≤ 1 M + 2|Φ(θ min M ) -Ψ( X)| M →∞ ----→ 0. Therefore, we can apply the same arguments as in the proof of Theorem E.2 (starting from ( 38)) to show that E P×λ k X -Y θ min M,N M 2 ≤ c c Φ(θ min M,N M ) -Ψ( X) 1/2 M →∞ ----→ 0, for every 1 ≤ k ≤ K. Corollary E.17. In the setting of Theorem E.13, we also have that (P × P)-a.s. Φ(θ min M,N M ) M →∞ ----→ Ψ( X) and Φ ÑM (θ min M, ÑM ) M →∞ ----→ Ψ( X), where ( ÑM ) M ∈N is another increasing sequence in N. Proof. The first convergence result was already shown in the proof of Theorem E.13 and the second one can be shown similarly, when defining ÑM by Ñ0 := 0 and for every M ∈ N ÑM := min N ∈ N | N > ÑM-1 , | ΦN (θ min M,N ) -Φ(θ min M )| ≤ 1 M + |Φ(θ min M ) -Ψ( X)| , which is possibly due to (45).

E.3 DISCUSSION ABOUT OPTIMAL WEIGHTS

In Theorem E.2 and E.13, the focus lies on the convergence analysis under the assumption that optimal weights are found. Below we discuss, why this assumption is not restrictive in theory. Global versus local optima. The assumption that the optimal weights are found, is typical for a convergence analysis of a neural network based algorithm, since the objective function is highly complex and non-convex with respect to the weights. In particular, it is well known that the standard choice of (stochastic) gradient descent optimization methods do in general only find local and not global minima. Since the difference between any local minimum and the global minimum can not generally be bounded, it is unrealistic to hope for a theoretical proof of convergence with respect to such optimisation schemes. On the other hand, global optimization methods as for example simulated annealing provably convergence (in probability) to a global optimum (Locatelli, 2000; Lecchini-Visintini et al., 2008) . Hence, combining those with our result, convergence in probability of our model output to the conditional expectation can be established, without the assumption that the optimal weights are found. However, these global optimization schemes come at the cost of much slower training compared to (stochastic) gradient descent methods when applied in practice. Moreover, several works have focused on showing that most local optima of neural networks are nearly global, see for example (Feizi et al., 2017) and the related work therein. Hence, using (stochastic) gradient descent optimization methods likely yield nearly globally optimal weights much more efficiently. In our case, this is also supported by our empirical convergence studies in Section 6.3.

F EXPERIMENTAL DETAILS

All implementations were done using PyTorch. The code is available at https://github.com/ HerreraKrachTeichmann/NJODE.

F.1 IMPLEMENTATION DETAILS

Dataset. For each of the SDE models (Black-Scholes, Ornstein-Uhlenbeck, Heston) a dataset was generated by sampling N = 20 000 paths of the SDE using the Euler-scheme. We used an equidistant time grid of mesh 0.01 between time 0 and T = 1. Independently for each path, observation times were sampled from P t , by using each of the grid points with probability 0.1 as an observation time. In particular, n ∼ Bin(100, 0.1), t 0 = 0 and the observation times {t i } 1≤i≤n were chosen uniformly on the time grid. Hence, 10% of the grid points were used on average. This way, n and t i are defined as a discretized version of those given in Example A.2 where n is binomially distributed and t i chosen uniformly on [0, T ]. For each of these datasets, the samples were used in a 80%/20% split for training and testing. The SDEs of the dataset models and the chosen parameters are described below. • Black-Scholes: -SDE: dX t = µX t dt + σX t dW t , where W is a 1-dimensional Brownian motion -conditional expectation: E(X t+s |X t ) = X t e µs -used parameters: µ = 2, σ = 0.3, X 0 = 1 • Ornstein-Uhlenbeck: -SDE: dX t = -k(X t -m)dt + σdW t , where W is a 1-dimensional Brownian motion -conditional expectation: E(X t+s |X t ) = X t e -ks + m 1 -e -ks -used parameters: k = 2, m = 4, σ = 0.3, X 0 = 1 • Heston: -SDE: for W and Z 1-dimensional Brownian motions dX t = µX t dt + √ v t X t dW t dv t = -k(v t -m)dt + σ √ v t dZ t -conditional expectation 7 : E(X t+s |X t ) = X t e µs -used parameters: µ = 2, σ = 0.3, X 0 = 1 k = 2, m = 4, v 0 = 4, ρ = Corr(W, Z) = 0.5 Architecture. In our experiments we choose the dimension of the latent variable to be d H = 10. For f θ1 , gθ3 and ρθ2 we use 2-hidden-layer feed-forward neural networks, with 50 nodes in each hidden layer and tanh activation functions. Then the neural networks g θ3 and ρ θ2 are defined as residual versions of gθ3 and ρθ2 , by adding a residual shortcut between the input and the output of the neural networks. Dropout was applied after each non-linearity with a rate of 0.1. To scale the possibly unbounded inputs of the neural networks to a bounded hypercube, we applied tanh component-wise to x and h in every neural network. This was done, because neural networks sometimes become unstable when their inputs become large. To solve the ODE of the neural ODE part, the simple Euler-method was used. Training. The neural networks were trained using the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.001 and weight decay 0.0005 for 200 epochs using a batch size of 200. A random initialization was used and no hyper-parameter optimization was needed. Further training results. Further training results on test samples are shown in Figure 8 , 9, 10.

F.2 EXPERIMENTS ON OTHER DATASETS

We test our framework on additional synthetic datasets. 2 k m ≥ σ 2 is satisfied in the Heston model, it is known that the variance process v t is always strictly bigger than 0 and that the Euler-scheme works well to sample from it. However, if the Feller condition is not satisfied, the variance process can touch 0, where the process is reflected deterministically. Then the Euler-scheme can not longer be used to sample from the model. Moreover, this situation is generally considered as more delicate. Close to 0 the distribution of v t behaves differently than further away of 0 (Andersen, 2007, Section 3) . As explained by Andersen (2007) , there are differently well performing sampling schemes for a Heston model where the Feller condition is not satisfied. We use the simplest to implement, which is a slight extension of the Euler scheme, where values of v t below 0 are replaced by 0 (Andersen, 2007, Section 2.3). Although there is empirical evidence, that the resulting sampling distribution of v t close to 0 is not correctly replicating the true distribution (Jean-François et al., 2015, Figure 2 ,3), this method already produces sufficiently good sample paths. Dataset. Heston model, sampled as in Section F.1 with the extension described above. Used parameters: µ = 2, σ = 3, X 0 = 1 k = 2, m = 1, v 0 = 0.5, ρ = Corr(W, Z) = 0.5. Hence the Feller condition and also the weaker condition 4 k m ≥ σ 2 discussed in (Jean-François et al., 2015, Section 3.2), are both not satisfied. We produce two datasets, a 1-dimensional one similar to before, where only X is stored and a 2-dimensional one, where both X and v are stored, hence also v is a target for prediction. Note that v has the same conditional expectation as the Ornstein-Uhlenbeck SDE. In the 2-dimensional dataset, X and v are always observed at the same time. Architecture & Training. Same as in Section F.1, but with batch size 100. Results. The model learns to replicate the true conditional expectation process, which is analytic, hence not effected by the sampling scheme. In particular, we see that our model is very robust, since even in the delicate case where the Feller condition is not satisfied and a sampling scheme is used, that does not perfectly replicate the true distribution, our model still works well. Due to the very similar results, in Figure 11 we only show plots on test samples of the 2-dimensional dataset, where X and v are predicted. In both plots, the upper sub-plot corresponds to the 1-dimensional path of X t and the lower sub-plot corresponds to the 1-dimensional path of v t .

F.2.2 DATASET WITH CHANGING REGIME

Dataset. To be able to evaluate the performance of our model, we test a change of regime by combining two synthetic datasets. On the first half of the time interval [0, 0.5] we use the Ornstein-Uhlenbeck and on the second half [0.5, 1] the Black-Scholes model. The Black-Scholes process takes as starting point, the last point of the Ornstein-Uhlenbeck process. We use the same hyperparameters for the dataset generation as in Section F.1, except that we set the parameter m = 10 in the Ornstein-Uhlenbeck model to make the two parts act on similar scales. Results. In the plots on test samples of the dataset shown in Figure 12 we see that our model correctly learns the change of regime. 

F.2.3 DATASET WITH EXPLICIT TIME DEPENDENCE

Dataset. We use the Black-Scholes datasets of Section F.1, where we replace the constant µ by the time dependent constant α 2 (sin(βt) + 1), for α, β > 0. The conditional expectation changes accordingly. For the data generation we use the same hyper-parameters as in Section F.1, except that we use instead of µ the parameter α = 2 and β ∈ {2π, 4π}. Architecture & Training. Same as in Section F.1, but with batch size 100. Moreover, we used 400 neurons in each hidden layer, to account for the more complicated setting, where an explicit time dependence has to be learnt. Results. We show plots on test samples of the datasets in Figure 5 . We see that the model learns to adapt to the time-dependent coefficients.

F.3 DETAILS ON CONVERGENCE STUDY

Evaluation metric. For the sampling time grid with equidistant step size ∆ t := T ν , ν ∈ N, on [0, T ] and the true and predicted conditional expectation for path j ∈ N, Xj and Y j respectively, we define the evaluation metric as eval( X, Y ) := 1 N 2 N2 j=1 1 ν + 1 ν i=0 Xj i∆t -Y j i∆t 2 , ( ) where N 2 is the number of test samples, and j accordingly iterates over the paths in the test set. Increasingly big training sets. We use the following procedure to create increasingly big training sets, while keeping the exactly same test set for evaluation. Out of the initial 20 000 paths, we take N 2 := 4 000 paths, which are fixed as the testing set. Out of the remaining 16 000 paths, we randomly choose N 1 training paths for N 1 ∈ {200, 400, 800, 1 600, 3 200, 6 400, 12 800}. Increasing neural network sizes. The increasingly big neural networks are defined as follows. For all involved networks, we use the feed-forward 2-layer architecture with tanh activations (cf. Appedix F.1), where each hidden layer has the same size M for M ∈ {10, 20, 40, 80, 160, 320}. Results on Black-Scholes dataset. In accordance with the theoretical results in Theorem E.2 and E.13, we see that the evaluation metric decreases when N 1 and M increase (Figure 13 ). It is important to notice, that already a quite small number of samples can lead to a good approximation of the conditional expectation, if the network is big enough. In particular, the Monte Carlo approximation of the theoretical loss function is good already with a few samples. On the other hand, even with a large number of samples, the evaluation metric does not become so small, if the network size is not big enough. From a practitioners point of view, this is good news, since increasing the network size is often much easier than collecting more training data. Results on Ornstein-Uhlenbeck and Heston dataset. We get very similar results on the Ornstein Uhlenbeck (Figure 14 ) and Heston dataset (Figure 15 ). Similar as for Black Scholes, also for Ornstein-Uhlenbeck the training size N 1 is not as important as the network size M . For all network sizes, increasing the training size further than 1600 hardly changes the performance, while increasing M is crucial to get better performance. In contrast to this, for the Heston dataset we see that a large number of training samples is more important to get a smaller convergence metric. This reflects the fact, that the Heston dataset is more complex and more difficult to learn.

F.4 DETAILS ON COMPARISON TO GRU-ODE-BAYES

GRU-ODE-Bayes (Brouwer et al., 2019) . To the best of our knowledge, this is the neural network based method with the most similar task to ours. In particular, this continuous time model is trained to learn the unknown temporal parameters of a normal distribution, best describing the conditional distribution of X given the previous observations. This distribution is given by the Fokker-Planck equation. Brouwer et al. (2019) outlined, that their model can exactly represent the Fokker-Planck dynamics of the Ornstein-Uhlenbeck process, since the corresponding distribution is Gaussian. Implementation of GRU-ODE-Bayes. We use the code of the official implementation of (Brouwer et al., 2019) foot_7 and slightly adjust it for our purpose. In particular, we do not use incomplete observations, hence the input mask used for this task has always only 1-entries. Furthermore, we slightly changed the scheme how the time steps are taken, to be the same as in our implementation, so that comparisons can be made. Besides these minor changes, the original model is used and trained on all 3 datasets (Black-Scholes, Ornstein-Uhlenbeck and Heston). We tried out all combinations of the following parameters and always chose the best performing one for comparison to our model: impute ∈ {True, False}, logvar ∈ {True, False}, mixing ∈ {0.0001, 0.5} and hiddensize ∈ {50, 100}, whereby phidden and prephidden were chosen to be equal to hiddensize. The first parameter choices were the ones used in the official implementation. The model was always trained using the Euler-method for solving ODEs and with dropout = 0.1, to be comparable to our implementation. Furthermore, we always used the full GRU-ODE-Bayes implementation, since it should be the more powerful one. For the comparison, we only use the estimated mean of the normal distribution (the estimated variance is not used), which is precisely the estimated conditional expectation of the model. Implementation of NJ-ODE. Same as described in Section F.1. For each dataset only one model was trained, since we already saw the convergence properties before and wanted to have a qualitative comparison to GRU-ODE-Bayes. In particular, we did not try to optimize our hyper-parameters for best performance (e.g. by choosing different number of layers or neurons or different activation functions) and used about 10K trainable parameters compared to the best performing GRU-ODE-Bayes models of our study which used 112K trainable parameters. Datasets. Same as described in Section F.1. The train and test sets were fixed to be the same for all trained models. Training. All models were trained using the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.001 and weight decay 0.0005 for 100 epochs using a batch size of 20. A random initialization was used.

F.4.1 FURTHER RESULTS AND DISCUSSION OF THE COMPARISON

Evolution of losses and evaluation metric during training. In Figure 16 and 17 we see, that already after a few epochs the NJ-ODE model finds close to optimal weights for the given network size on the Black-Scholes and Orstein-Uhlenbeck dataset and oscillates around this optimum. The Heston dataset is considerably more difficult and the model slowly converges to close-to-optimal weights for the given network size. In comparison to this, it takes the GRU-ODE-Bayes model longer to converge to its close-to-optimal weights on the Black-Scholes and Orstein-Uhlenbeck dataset. Moreover, the model does not converge to close-to-optimal weights on the Heston dataset, but rather oscillates between bad weights. Due to some very large outliers, this is not directly visible in the plot, but can be deduced from Table 6 .4. Comparison of predicted paths. For each dataset we show 5 paths that were predicted with NJ-ODE and with GRU-ODE-Bayes, first at the optimal epoch, i.e. where the test loss was minimal during training (Figures 18, 19, 20) , and then at the last epoch (Figures 21, 22, 23) . For the sake of comparison, in each row the performance on the same test sample is shown. The results are very similar for NJ-ODE and GRU-ODE-Bayes on the Black-Scholes and Ornstein-Uhlenbeck dataset, with sometimes the one and sometimes the other having a slightly better prediction. On the Heston dataset, the results of NJ-ODE are good predictions, being correct whenever a new observation is made and not too far away even after longer periods without observations. On the other hand, GRU-ODE-Bayes is not even correct at times of new observations and learns an incorrect behaviour in between observations (e.g. making kinks where there should not be any). For the last epoch, we see that this malfunction is amplified.

F.4.2 FURTHER EXPERIMENTS

Since the results of GRU-ODE-Bayes were unexpectedly bad on the Heston dataset, we performed additional tests. In particular, we retrained the best combinations of GRU-ODE-Bayes with the larger batch size 100. For smaller versions of the network (M), i.e. with hiddensize = prephidden = phidden = 50 this stabilized the training, such that the models converged. Still, even at the best epoch, the models suffered from the same difficulties as explained in Section F.4.1. Increasing the hiddensize by factor 2 to 100 (L), made the training unstable again, and the models did not converge any more. This gives more empirical evidence, that GRU-ODE-Bayes can not be reliably trained on more complex datasets, where the target conditional distributions differs to much from a normal distribution. In contrast to this, retraining NJ-ODE with batch size 100, once for the smaller version described in Section F (S), once for a larger version with 100 instead of 50 neurons in all hidden layers (M) and once with 200 neurons (L), yielded the expected results of better performance with larger networks. In particular, there is no instability for larger networks. In Table F .4.2 we show results of our model and of the best GRU-ODE-model for the given size. In accordance with the self-imputation scheme, the loss function is adjusted to only use the nonimputed coordinates, i.e. if m is the random process in {0, 1} d X describing which coordinates are observed we have Ψ(Z) := E P× P 1 n n i=1 (|m ti (X ti -Z ti )| 2 + |m ti (Z ti -Z ti-)| 2 ) 2 .

F.5.2 CLIMATE FORECASTING DETAILS

Dataset. We use the publicly available United State Historical Climatology Network (USHCN) daily dataset (Menne et al., 2016) together with all pre-processing steps as they were provided by Brouwer et al. (2019) . In particular, there are 5 sporadically observed (i.e. incomplete observations) climate variables (daily temperatures, precipitation, and snow) measured at 1 114 stations scattered over the United States during an observation window of 4 years (between 1996 and 2000) where each station has an average of 346 observations over those 4 years. For 5 folds, the data is split into train (70%), validation (20%) and test (10%) sets. The task is to predict the next 3 measurements after the first 3 years of observation. The mean squared error between the prediction and the correct values is computed on the validation and test set. Baselines. We use the results reported in (Brouwer et al., 2019 , Table 1 ) as baselines for our comparison and perform the exact same 5-fold cross validation using the same folds with the same train, validation and test sets. For completeness, we give all results of the table together with our results in Table 6 .5. We only show the mean squared error (MSE), since our model does not provide the negative log-likelihood. Implementation of NJ-ODE. We once use the architecture described in Section F.1 (S) and once use the exact same architecture, but with hidden size d H = 50 and 400 instead of 50 nodes in each hidden layer (L). To deal with the incomplete observations, the self-imputation scheme described in Section 6.5 is used. In particular, the data is self-imputed and passed together with the observation mask as input to the network ρθ2 . Moreover, the network f θ1 uses the NJ-ODE output at the last observation time instead of the last observation as input. Training. Both versions of NJ-ODE were trained using the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.001 and weight decay 0.0005 for 100 epochs using a batch size of 100. A random initialization was used. The performance on the validation set was used for early stopping after the first 100 epochs, i.e. early stopping was possible at any epoch between {101, . . . , 200}. Results. The results of our model and all models reported in (Brouwer et al., 2019 , Table 1 ) are shown in Table 6 .5.

F.5.3 PHYSIONET PREDICTION DETAILS

Dataset. We use the publicly available PhysioNet Challenge 2012 dataset (Goldberger et al., 2000) together with all pre-processing steps as they were described and provided by Rubanova et al. (2019) . In particular, there are 41 features of 8000 patients that are observed irregularly over a 48 hour time period. The observations are put on a time grid with step size 0.016 hours leading to 3000 grid points. While in their paper Rubanova et al. (2019) say that they use 2880 grid points (i.e. minute wise) in their implementation they used 3000. Moreover, in contrast to what was written in the paper, the 4 constant features were not excluded in their implementation, hence we also keep them. Furthermore, we also rescale the time-grid to [0, 1] and normalize each feature as Rubanova et al. (2019) Baselines. We compare the performance of our model to latent ODE on the extrapolation task (as described above) on physionet. As baseline for our comparison we use the results reported in (Rubanova et al., 2019, Table 5) . For completeness, we show all extrapolation results of the table together with our results in Table 6 .5. We shortly outline the different approach of latent ODE compared to our model for the given extrapolation task. Latent ODE also splits the training samples similar to the test samples in half, using the first half as input and the second half as target. It is trained as an encoder-decoder, encoding the observations in the first half and reconstructing (decoding) the second half. This falls in the standard supervised learning framework. In particular, this approach cannot straight forward be extended for online forecasting. Moreover, this approach might learn certain path dependencies. On the other hand, our model is trained as always, online forecasting after each observation until the next observation is made. Instead of splitting the training samples we use the entire path as input for our unsupervised training framework. Our model is based on the assumption that paths are Markov, therefore it cannot learn path dependencies, i.e. dependencies on more than just the last observation. However, by training the model also on the second half of the training samples, it learns the underlying behaviour there, which should be helpful for the extrapolation task. Implementation of NJ-ODE. We use the architecture described in Section F.1, but with hidden size d H = 41. To deal with the incomplete observations, the self-imputation scheme described in Section 6.5 is used. In particular, the data is self-imputed and passed together with the observation mask as input to the network ρθ2 . Moreover, the network f θ1 uses the NJ-ODE output at the last observation time instead of the last observation as input. Training. The NJ-ODE was trained using the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.001 and weight decay 0.0005 for 175 epochs using a batch size of 50. 5 runs with random initialization were performed over which the mean and standard deviations were calculated. In particular, these runs always used the same training and test set, specified by the same random seed as in thee implementation of Rubanova et al. (2019) . Results. The results of our model and all models reported in (Rubanova et al., 2019, Table 5, extrapolation) are shown in Table 6 .5. We report the minimal MSE on the test set during the 175 epochs, since it was not differently specified in (Rubanova et al., 2019) . However, if the MSE of the epoch is used where the training loss is minimal, the results are nearly the same with 1.986 ± 0.058 (×10 -3 ). Moreover, we trained a larger model for 120 epochs, where 200 nodes were used instead of 50 (187 323 parameters in total), leading to slightly better results of 1.934 ± 0.007 (×10 -3 ) (at minimal MSE) and 1.982 ± 0.027 (×10 -3 ) (at minimal training loss). 



A stochastic process is a collection of random variables Xt : Ω → R d X , ω → Xt(ω) for 0 ≤ t ≤ T . For all ω ∈ Ω, 0 = t0 < t1(ω) < • • • < t n( ω) (ω) ≤ T . While we give a pointwise definition,(Cohen & Elliott, 2015, Theorem 7.6.5) allows to define X directly as the optional projection. By(Cohen & Elliott, 2015, Remark 7.2.2) this implies that the process X is progressively measurable, in particular, jointly measurable in t and ω× ω. However, as we show below, even from the pointwise definition, it follows that X is càdlàg, hence optional(Cohen & Elliott, 2015, Theorem 7.2.7). i.e. right-continuous with existing left limits, also denoted as RCLL i.e. left continuous with existing right limits see(Rujivan & Zhu, 2012, Equation 2.9) Architecture & Training. Same as in Section F.1, but with batch size 100. Moreover, we used 100 neurons in each hidden layer, to account for the more complicated setting, where also a time dependence has to be learnt. https://github.com/edebrouwer/gru_ode_bayes



Figure 1: Predicted and true conditional expectation on a test sample of the Heston dataset.

Figure3: Heston model without Feller condition. In both plots, the upper sub-plot corresponds to the 1-dimensional path of X t and the lower sub-plot corresponds to the 1-dimensional path of v t .

Figure 4: Our model evaluated o a stochastic dataset that follows an Ornstein-Uhlenbeck SDE on the time interval [0, 0.5] and an Black-Scholes model on the time interval (0.5, 1]. We see that our model correctly learns the change of regime.

Figure 5: NJ-ODE evaluated on the time-dependent Black-Scholes dataset with β = 2π (left) and β = 4π (right).

Figure 6: Black-Scholes dataset. Mean ± standard deviation (black bars) of the evaluation metric for varying training samples N 1 and network size M .

This proves the second part of Proposition B.1. Proposition B.4. The function μ is (jointly) continuous.4 To see this, we choose a localizing sequence (τn) n∈N such that M τn is bounded by n (works since M is continuous). Then the Markov property implies that E[M τn t -M τn s |As] = E[M τn t -M τn s |Fs] = 0. Since M τn t n→∞ ----→ Mt P-a.s. and since this sequence is dominated by the integrable random variable 1+sup r≤t |Mr| 2 1 (by Doob's inequality and square integrability of M ), dominated convergence implies that E[Mt -Ms|As] = 0. Proof of Proposition B.4. For any fixed s

Figure 7: Schematic representation of the stochastic control u t (25).

be a sequence of random variables, and Z and Z random variables defined on a common probability space such that Z n L p --→ Z and Z n L p --→ Z. Then Z = Z almost surely.

Figure 8: Black-Scholes

Figure11: Heston model without Feller condition. In both plots, the upper sub-plot corresponds to the 1-dimensional path of X t and the lower sub-plot corresponds to the 1-dimensional path of v t .

Figure 12: Our model evaluated on a stochastic dataset that follows an Ornstein-Uhlenbeck SDE on the time interval [0, 0.5] and an Black-Scholes model on the time interval (0.5, 1].

Figure 13: Black-Scholes dataset. Mean ± standard deviation (black bars) of the evaluation metric for varying N 1 and M .

Figure 14: Ornstein-Uhlenbeck dataset. Mean ± standard deviation (black bars) of the evaluation metric for varying N 1 and M .

Figure 15: Heston dataset. Mean ± standard deviation (black bars) of the evaluation metric for varying N 1 and M .

Figure 17: GRU-ODE-Bayes' best performing models on Black-Scholes, Ornstein-Uhlenbeck and Heston (from left to right). Blue (1st row): training loss, orange (2nd row): evaluation loss, green (3rd row): evaluation metric.

Figure 18: Comparison of predictions for Black-Scholes paths at best epoch.

Figure 19: Comparison of predictions for Ornstein-Uhlenbeck paths at best epoch.

Figure 20: Comparison of predictions for Heston paths at best epoch.

Figure 21: Comparison of predictions for Black-Scholes paths at last epoch.

Figure 22: Comparison of predictions for Ornstein-Uhlenbeck paths at last epoch.

Figure 23: Comparison of predictions for Heston paths at last epoch.

The minimal, last and average value of the evaluation metric throughout the 100 epochs of training are shown for GRU-ODE-Bayes and our method, together with the number of trainable parameters.

Mean and standard deviation of MSE on the test sets of USHCN. Result of baselines were reported byBrouwer et al. (2019). Where known, the number of trainable parameters is reported.

Mean and standard deviation of MSE on the test set of physionet. Result of baselines were reported byRubanova et al. (2019). Where known, the number of trainable parameters is reported. This observation process Y is usually described by the dynamics dY t = h(X t )dt + dB t , where h is a measurable function and B is a Brownian motion. Stochastic filtering then estimates the conditional law of X t given the noisy observations (Y s ) 0≤s≤t

approximate μ by the neural network f θ * 1 and outside we continuously extend f θ * 1 such that it is bounded by B. By abuse of notation we call this f θ *

The minimal, last and average value of the evaluation metric (smaller is better) on the Heston dataset throughout the 100 epochs of training, together with the number of trainable parameters.

did it for training the model. The dataset is split with the same fixed seed into 80% training and 20% test set. In particular, no cross validation but only multiple runs with new random initializations are performed to be exactly comparable to the results reported byRubanova et al.

ACKNOWLEDGEMENT

The authors thank Andrew Allan, Robert A. Crowell, Anastasis Kratsios and Pierre Ruyssen for helpful discussions, providing references and insights. Moreover, the authors thank the reviewers for their thoughtful feedback that contributed to significantly improve the paper. The authors gratefully acknowledge financial support coming from the Swiss National Science Foundation (SNF) under grant 179114.

