MODELING TEMPORAL DATA AS CONTINUOUS FUNCTIONS WITH PROCESS DIFFUSION Anonymous authors Paper under double-blind review

Abstract

Temporal data like time series are often observed at irregular intervals which is a challenging setting for existing machine learning methods. To tackle this problem, we view such data as samples from some underlying continuous function. We then define a diffusion-based generative model that adds noise from a predefined stochastic process while preserving the continuity of the resulting underlying function. A neural network is trained to reverse this process which allows us to sample new realizations from the learned distribution. We define suitable stochastic processes as noise sources and introduce novel denoising and score-matching models on processes. Further, we show how to apply this approach to the multivariate probabilistic forecasting and imputation tasks. Through our extensive experiments, we demonstrate that our method outperforms previous models on synthetic and real-world datasets.

1. INTRODUCTION

Time series data is collected from measurements of some real-world system that evolves via some complex unknown dynamics and the sampling rate is often arbitrary and non-constant. Thus, the assumption that time series follows some underlying continuous function is reasonable; consider, e.g., the temperature or load of a system over time. Although the values are observed as separate events, we know the temperature always exists and its evolution over time is smooth, not jittery. The continuity assumption remains when the intervals between the measurements vary. This kind of data can be found in many domains, from medical, industrial to financial applications. Different approaches to model irregular data have been proposed, including neural (ordinary or stochastic) differential equations (Chen et al., 2018; Li et al., 2020) , neural processes (Garnelo et al., 2018) , normalizing flows (Deng et al., 2020) etc. As it turns out, capturing the true generative process proves to be difficult, especially with the inherent stochasticity of the data. We propose an alternative method, a generative model for continuous data that is based on the diffusion framework (Ho et al., 2020) which simply adds noise to a data point until it contains no information about the original input. At the same time, the generative part of these models learns to reverse this process so that we can sample new realizations once training is completed. In this paper, we apply these ideas to the time series setting and address the unique challenges that arise. Contributions. Contrary to the previous works on diffusion, we model continuous functions, not vectors (Fig. 1 ). To do so, we first define a suitable noising process that will preserve continuity. Next, we derive the transition probabilities to perform the noising and specify the evidence bound on the likelihood as well as the new sampling procedure. Finally, we propose new models that take in the noisy input and produce the denoised output or, alternatively, the value of the score function. 

2. BACKGROUND

Given training data {x} where each x ∈ R d , the goal of generative modeling is to learn the probability density function p(x) and be able to generate new samples from this learned distribution. Diffusion models (Ho et al., 2020; Song et al., 2021) achieve both of these goals by learning to reverse some fixed process that adds noise to the data. In the following, we present a brief overview of the two ways to define diffusion; in Section 2.1 the noise is added across N increasing scales, which is then taken to the limit in Section 2.2 by defining the diffusion using a stochastic differential equation (SDE).

2.1. FIXED-STEP DIFFUSION

Sohl-Dickstein et al. (2015) ; Ho et al. (2020) propose the denoising diffusion probabilistic model (DDPM) which gradually adds fixed Gaussian noise to the observed data point x 0 via known scales β n to define a sequence of progressively noisier values x 1 , x 2 , . . . , x N . The final noisy output x N ∼ N (0, I) carries no information about the original data point and thus the sequence of positive noise (variance) scales β 1 , . . . , β N has to be increasing such that the first noisy output x 1 is close to the original data x 0 , and the final value x N is pure noise. The goal is then to learn to reverse this process. As diffusion forms a Markov chain, the transition between any two consecutive points is defined with a conditional probability q (x n |x n-1 ) = N ( √ 1 -β n x n-1 , β n I). Since the transition kernel is Gaussian, the value at any step n can be sampled directly from x 0 . Given α n = 1β n and ᾱn = n k=1 α k , we can write: q(x n |x 0 ) = N ( √ ᾱn x 0 , (1 -ᾱn )I). Further, the probability of any intermediate value x n-1 given its successor x n and initial x 0 is q(x n-1 |x n , x 0 ) = N ( μn , βn I), where: μn = √ ᾱn-1 β n 1 -ᾱn x 0 + √ α n (1 -ᾱn-1 ) 1 -ᾱn x n , βn = 1 -ᾱn-1 1 -ᾱn β n . The generative model learns the reverse process p(x n-1 |x n ). Sohl-Dickstein et al. (2015) set p(x n-1 |x n ) = N (µ θ (x n , n), β n I), and parameterized µ θ with a neural network. The training objective is to maximize the evidence lower bound log p(x 0 ) ≥ E q D KL (q(x N |x 0 )||p(x N )) + n>1 D KL (q(x n-1 |x n , x 0 )||p(x n-1 |x n )) -log p(x 0 |x 1 ) . In practice, however, the approach by Ho et al. ( 2020) is to reparameterize µ θ and predict the noise ϵ that was added to x 0 , using a neural network ϵ θ (x n , n), and minimize the simplified loss function: L(x 0 ) = E ϵ∼N (0,I),n∼U ({0,...,N }) ||ϵ θ ( √ ᾱn x 0 + √ 1 -ᾱn ϵ, n) -ϵ|| 2 2 . To generate new data, the first step is to sample a point from the final distribution x N ∼ N (0, I) and then iteratively denoise it using the above model ( x N → x N -1 → • • • → x 0 ) to get a sample from the data distribution. To summarize, the forward process adds the noise ϵ to the input x 0 , at different scales, to produce x n . The model learns to invert this, i.e., predicts the noise ϵ from x n .

2.2. SDE DIFFUSION

Instead of taking a finite number of diffusion steps as in Section 2.1, Song et al. (2021) introduce a continuous diffusion of vector valued data, x 0 → x s where s ∈ [0, S] is now a continuous variable. The forward process can be elegantly defined with an SDE: dx s = f (x s , s)ds + g(s)dW s , where W is a standard Wiener process. The variable s is the continuous analogue of the discrete steps implying that the input gets noisier during the SDE evolution. The final value x S ∼ p(x S ) will follow some predefined distribution, as in Section 2.1. For the forward SDE in Equation 6 there exist a corresponding reverse SDE (Anderson, 1982) : dx s = [f (x s , s) -g(s) 2 ∇ xs log p(x s )]ds + g(s)dW s , where ∇ xs log p(x s ) is the score function. Solving the above SDE from S to 0, given initial condition x S ∼ p(x S ) returns a sample from the data distribution. The generative model's goal is to learn the score function via a neural network ψ θ (x s , s), by minimizing the following loss: L(x 0 ) = E xs∼SDE(x0),s∼U (0,S) ||ψ θ (x s , s) -∇ xs log p(x s )|| 2 2 . Song et al. ( 2021) define the continuous equivalent to DDPM forward process as the following SDE: dx s = - 1 2 β(s)x s ds + β(s)dW s , where β(s) and S are chosen in such a way that ensures the final noise distribution is unit normal, x S ∼ N (0, I). Given this specific parameterization, one can easily derive the transition probability q(x s |x 0 ) and calculate the exact score in closed-form (see Section 3.3 and Appendix A.3).

3. DIFFUSION FOR TIME SERIES DATA

In contrast to the previous section which deals with data points that are represented by vectors, we are interested in generative modeling for time series data. We represent time series as a set of points X = {x(t 0 ), . . . , x(t M -1 )}, t ∈ [0, T ], observed across M timestamps. The observations can be equally spaced but this formulation encompasses irregularly-sampled data as well. We assume that each observed time series comes from its corresponding underlying continuous function x(•). Our approach can be viewed as modeling the distribution "p(x(•))" over functions instead of vectors, which amounts to learning the stochastic process. To preserve continuity, we cannot apply the ideas from Section 2 directly, unless we assume measurements are independent of each other. The issue of adding an independent noise in the diffusion arises because it produces discontinuous samples, which is at odds with our assumption. In the following, we address this and propose a solution.

3.1. STOCHASTIC PROCESSES AS NOISE SOURCES FOR DIFFUSION

Instead of defining the diffusion by adding some scaled noise vector ϵ ∼ N (0, I) to a data vector x, we add a noise function (stochastic process) ϵ(•) to the underlying data function x(•). The only restriction on ϵ(•) is that it has to be continuous so that the output remains continuous as well, which clearly rules out stochastic processes where time is indexed by a finite set e.g. ϵ(•) ∼ N (0, I). However, using a normal distribution proved to be very convenient in Section 2 as it allows for closed-form formulations of various terms, especially the loss. This is due to the fact that adding two Gaussian random variables leads to another Gaussian variable, which, as an aside, is a property shared by some other parametric distributions (Nachmani et al., 2021) . Therefore, our goal is to define ϵ(•) which will satisfy the continuity property while giving us tractable training and sampling. Note that t refers to the time of the observation and ϵ(t) is the noise at t, in contrast to the previous section where time-like variables n and s referred to the noise scale. We could consider obtaining the noise from a standard Wiener process ϵ(•) = W (•). Although we will have all the terms in closed-form, a clear disadvantage of this approach is that variance grows with time. Additionally, the distribution of W (0) is degenerate as we never add any noise. This can be solved in an ad hoc manner by shifting the whole time series similar to Deng et al. (2020) . Instead, in the following, we present two stationary stochastic processes that add the same amount of noise regardless of the time of the observation. Note that the noise is correlated in the time dimension, hence the use of the stochastic process. An additional nice property of these processes is that they reduce to the diffusion from Section 2 in the trivial case of time series with only one element. 1. Gaussian process prior. Given a set of M time points t, we propose sampling ϵ(t), t ∈ t from a Gaussian process N (0, Σ), where each element of the covariance matrix is specified with a kernel cov(t, u) = k(t, u). This produces smooth noise functions ϵ(•) that can be evaluated at any t. To define a stationary process, we have to use a stationary kernel; we will use a radial basis kernel k(t, u) = exp(-γ(tu) 2 ). Adjusting the parameter γ lets us vary the flatness of the noise curves. Given a set of time points t, we can easily sample from this process by first computing the covariance Σ = k(t, t) and then sample from the multivariate normal distribution N (0, Σ). Algorithm 1 Loss (DSPD-GP diffusion) 1: X0, t ∼ pdata(X, t) 2: Σ = k(t, t) 3: L = Cholesky(Σ) 4: ε ∼ N (0, I) 5: ϵ = Lε 6: n ∼ U ({1, . . . , N }) 7: Xn = √ αnX0 + √ 1 -αnϵ 8: L = ||ε -ϵ θ (Xn, t, n)|| 2 2 Algorithm 2 Sampling (DSPD-GP diffusion) 1: input: t = {t0, . . . , tM-1} 2: Σ = k(t, t); L = Cholesky(Σ) 3: XN ∼ N (0, Σ) 4: for n = N, . . . , 1 do 5: z ∼ N (0, Σ) 6: Xn-1 = 1 √ αn Xn -1-αn √ 1-ᾱn Lϵ θ (Xn, t, n) + βnz 7: end for 8: return X0 2. Ornstein-Uhlenbeck diffusion. The alternative noise distribution is a stationary OU process which is specified as a solution to the following SDE: dϵ t = -γϵ t dt + dW t , where W t is the standard Wiener process and we use the initial condition ϵ 0 ∼ N (0, I). We can obtain samples from OU process easily by sampling from a time-changed and scaled Wiener process: e -γt W e 2γt . The covariance can be calculated as cov(t, u) = exp(-γ|t -u|). The OU process is a special case of a Gaussian process with a Matérn kernel (ν = 0.5) (Rasmussen & Williams, 2005, p. 86). We discuss different sampling techniques and their trade-offs in Appendix A.4. In the end, both the GP and OU processes are defined with a multivariate normal distribution N (0, Σ), where Σ is calculated using the times of the observations. As opposed to the diffusions from Section 2, we use correlated noise in the forward process. Our approach allows us to produce continuous functions as samples and will prove to be a natural way to do forecasting and imputation.

3.2. DISCRETE STOCHASTIC PROCESS DIFFUSION (DSPD)

We apply the discrete diffusion framework to the time series setting. Reusing the notation from before, X 0 denotes the input data and X n = {x n (t 0 ), . . . , x n (t M -1 )} is the noisy output after n diffusion steps. In contrast to the classical DDPM (Section 2.1) where one adds independent Gaussian noise to data, we now add the noise from a stochastic process. In particular, given the times of the input observations, we can compute the covariance Σ and sample noise ϵ(•) from a GP or OU process as defined in Section 3.1. We can write the transition kernel and the posterior as: q(X n |X 0 ) = N ( √ ᾱn X 0 , (1 -ᾱn )Σ), (X n-1 |X n , X 0 ) = N ( μn , βn Σ), where the difference to Equations 1 and 2 is the inclusion of Σ. Full derivation is in Appendix A.1. The generative model is defined with the reverse process p(X n- 2020), but we swap the identity matrix for Σ. Another key difference is that the model now takes the full time series and the time points in order to output the prediction which has the same size as X n . The architecture, therefore, has to be a type of a time series encoder-decoder. 1 |X n ) = N (µ θ (X n , t, n), β n Σ), similar to Ho et al. ( Since all the probabilities are still normal, the terms in the ELBO (Equation 4) can be calculated in closed-form. Following Ho et al. (2020), we also adopt the reparameterization which predicts the noise ϵ(•) and use the simplified loss as in Equation 5: L(X 0 ) = E ϵ∼N (0,Σ),n∼U ({1,...,N }) ||ϵ θ ( √ ᾱn X 0 + √ 1 -ᾱn ϵ, t, n) -ϵ|| 2 2 . ( ) More details can be found in Appendix A.2. The sampling is similar to Ho et al. (2020) but the noise comes from a stochastic process instead of an independent normal distributions. In Algorithms 1 and 2 we show the training and the sampling procedure, respectively, when using the GP diffusion.

3.3. CONTINUOUS STOCHASTIC PROCESS DIFFUSION (CSPD)

Similarly to the previous section, we can extend the continuous diffusion framework to use the noise coming from a Gaussian or OU process. Now, the noise scales β(s) are continuous in the diffusion time s, see Section 2.2. Given a factorized covariance matrix Σ = LL T , we modify the variance preserving diffusion SDE (Song et al., 2021) : dX s = - 1 2 β(s)X s ds + β(s)LdW s , which gives us the following transition probability (see Appendix A.3 for details): q(X s |x 0 ) = N ( μ, Σ) = N X 0 e -1 2 s 0 β(s)ds , Σ 1 -e -s 0 β(s)ds . ( ) Since this probability is normal, the value of the score function can be computed in closed-form: ∇ Xs log q(X s |X 0 ) = -Σ-1 (X s -μ), which we can use to optimize the same objective as in Equation 8. Our neural network ϵ θ (X s , t, s) will take in the full time series, together with the observation times t and the diffusion time s, and predict the values of the score function. As it turns out, we can again use the reparameterization in which we predict the noise, whilst the score is only calculated when sampling new realizations.

3.4. IMPLEMENTATION DETAILS

In our work, we consider multivariate time series which means each observation at a certain time point is a d-dimensional vector. In the forward diffusion process we treat the data as d individual univariate time series and add the noise to them independently. This is in line with the previous works where, e.g., independent noise is added to the individual pixels in an image. The model takes in a complete noisy multivariate time series to learn the reverse process so it handles the correlations between the data dimensions and across the time dimension. We note that the best results are obtained if the model is reparameterized to always predict the independent Gaussian noise. In the discrete stochastic process diffusion, this means the noise is computed as ϵ = Lε, ε ∼ N (0, I), where L T L = Σ is the covariance matrix of the stochastic process. The model then learns to predict ε. Similarly, in the continuous diffusion, we represent the score as Lε/σ 2 , where σ 2 = 1exp(-s 0 β(s)ds) (Equation 16). In both cases, when we sample, the model will output the prediction of ε, which we transform to get the sample from the stochastic process or to obtain the score function value. An example of this can be found in Algorithm 2 for the discrete diffusion.

4. APPLICATIONS

To train a generative model, it must learn to reverse the forward diffusion process by predicting the noise that was added to the clean data. The input to the model is the time series (X 0 , t) along with the diffusion step n or diffusion time s, and the output is of the same size as X 0 . If additional inputs are available, we can also model the conditional distribution; for example, time series data often contains covariates for each time point of t. We can also condition the generation on the past observations which essentially defines a probabilistic forecaster or condition only on the observed values which defines a neural process or an imputation model.

4.1. FORECASTING MULTIVARIATE TIME SERIES

Forecasting is answering what is going to happen, given what we have seen, and as such is the most prominent tasks in time series analysis. Probabilistic forecasting adds the layer of (aleatoric) uncertainty on top of that and returns the confidence intervals which is often a requirement for deploying models in real world settings. The neural forecasters are usually encoder-decoders, where the history of observations (X H , t H ) is represented with a single vector z and the decoder outputs the distribution of the future values X F given z at time points t F . Previous works relied on specifying the parameters of the output distribution, e.g., via a diagonal covariance (Salinas et al., 2020) or some low-rank approximation (Salinas et In our case, the prediction X F will not be a single vector but a set {x(t M ), . . . , x(t M +K )} of size K. This type of data is naturally handled by process diffusion as defined in Section 3. The only thing left to do is to design a suitable denoising model ϵ θ . Previous observations are again represented with an RNN to obtain z and condition the reverse process on it. We propose an architecture similar to the TimeGrad model from Rasul et al. (2021a) but which outputs multiple values simultaneously. Figure 2 shows the architecture overview: we add t as an input and take the whole time series at once. Thus, we use 2D convolution where the extra dimension corresponds to the time dimension.

4.2. DIFFUSION PROCESS AS A NEURAL PROCESS

Neural processes (Garnelo et al., 2018) are a class of latent variable models that define a stochastic process with neural networks. Given a set of data points (a dataset), the model outputs the probability distribution over the functions that generated this dataset. That is, for different datasets, the model will define different stochastic processes. Due to this behavior, neural processes bear a resemblance to the Gaussian processes but can also be viewed as a meta learning model (Hospedales et al., 2021) . Let X A denote the observed data, in our case, a time series, and let X B be the unobserved data at the time points t B . Garnelo et al. ( 2018) construct the encoder-decoder model that uses the amortized variational inference for training (Kingma & Welling, 2014) . The encoder takes in a set of observed points (X A , t A ) and outputs the distribution over the latent variable q(z). It is crucial that the encoder is permutation invariant, i.e., the order in of the input points does not alter the result. This is easy to achieve using, e.g., deep sets (Zaheer et al., 2017). The decoder takes in the sampled latent vector z and the query time points t B and predicts the values of the unobserved points X B . Since our approach samples functions, we can condition the generation on an input dataset (X A , t A ) in order to create our version of a neural process, based purely on the diffusion framework. The encoder will be a deterministic neural network that outputs the latent vector z, contrary to (Garnelo et al., 2018) which outputs the distribution. Similar to Section 4.1, the diffusion is conditioned on z and we can output samples for any query t B . Therefore, we capture the distribution p(X B |X A ) directly. During training, we adopt the approach of feeding in the data such that we learn p(X A ∪ X B |X A ) which helps the model learn to output high certainty around t A , see Garnelo et al. (2018) . In the end, our model sees many observed-unobserved pairs coming from different true underlying processes. The model learns to represent the observed points X A such that the denoising process corresponds to the correct distribution, given X A . After training is completed, we take a time series X A and output the samples at any set of query time points t B . We can view such an approach as an interpolation or imputation model that fills-in the missing values across time. The main appeal is the ability to capture different stochastic processes within a single model.

4.3. PROBABILISTIC TIME SERIES IMPUTATION

Previous section considered interpolating points in time. Now, we look into filling-in the missing values across the observation dimensions, i.e., the imputation of the vectors. An element x of the time series X is assigned a mask m of the same dimension that indicates whether the i-th value x i has been observed (m i = 1) or is missing (m i = 0). Given observed X A and missing points X B , Tashiro et al. ( 2021) propose a model that learns a conditional distribution p(X B |X A ). The model is built upon a diffusion framework and the reverse process is conditioned on X A , similar to Section 4.2. We extend this by introducing noise from a stochastic process, as presented above. The learnable model remains the same but we introduce the correlated noise in the loss and sampling. We posit that continuous noise process as an inductive bias for the irregular time series is a more natural choice. Table 1 shows that using a stochastic process as the noise source outperforms independent noise. The ablation in Table 6 shows that using an independent denoising model, i.e., ϵ θ that processes time series inputs individually, performs worse than a model that processes the whole time series at once. Figure 3 demonstrates the quality of the samples. Finally, Table 2 ) averaged over five runs, but we note that the rank of the models' performance does not change when using other metrics as well. Table 3 shows that our method outperforms TimeGrad even though we predict over the complete forecast horizon at once, and Figure 4 demonstrates the prediction quality alongside the uncertainty estimate. Neural process. We construct a dataset where each time series X comes from a different stochastic process, by sampling from Gaussian processes with varying kernel parameters and time series lengths. This is a standard training setting in neural process literature (Garnelo et al., 2018) . In our denoising network, we modify the attention-like layer to make it stationary (see Appendix B.2) and train as described in Section 4.2. Due to the use of tanh activations in the final layers, combined with its stationary, our model extrapolates well, i.e., when tanh saturates the mean and variance do not vary far from observations. This is the same behaviour we see in the GP with an RBF kernel, for example. The quantile loss of the unobserved data under the true GP model is 0.845 while we achieve 0.737 which indicates we capture the true process, which can also be seen in Figure 5 . We remark that attentive neural process (Kim et al., 2019) does not produce the correct uncertainty. Imputation. We compare to the CSDI model ( 

7. DISCUSSION

In this paper, we introduced the stochastic process diffusion framework for time series modeling. We demonstrated that the improvements over the previous works come from (1) using the stochastic process as the noise source; and (2) using the model that takes in the whole time series at once, instead of modeling points independently. We also show how one can condition the generation to obtain a forecasting model as well as interpolation and imputation models. In our experiments we used a mixture of synthetic and real-world datasets which both have regular and irregular sampling. We outperform strong baseline models on all of the tasks which demonstrates practical utility of our method.

7.1. FUTURE WORK

We used bare-bones diffusion without extensive tuning to demonstrate the modeling potential and make a fair comparison to other methods. However, it should be straightforward to improve upon our models by implementing recent advances in diffusion models (e.g., Nichol & Dhariwal, 2021a). In case we have a large amount of points, we can consider replacing the current sampling strategies with more scalable variants, such as switching to a sparse GP (Quiñonero-Candela & Rasmussen, 2005) . Additionally, one can train the latent diffusion (Rombach et al., 2021) by first learning the time series encoder-decoder which might be helpful for high-dimensional data, such as those we encountered in the forecasting task. It would be interesting to explore different architecture choices, e.g., implement improvements in conditioning models via learned activations (Ramos et al., 2022) . Finally, we can also apply the presented methods to other areas outside time series, such as modeling point clouds or even images, as we have demonstrated that our method is competitive on regular grids.

ETHICS STATEMENT

We introduced a new method for time series generation. As such, it has many applications, such as probabilistic forecasting and imputation, both of which are of practical significance in the realworld settings. In particular, we would like to see successful applications in the healthcare domain. As with any generative model, one has to pay attention to the privacy and fairness when collecting data and building a model. We do not anticipate any negative outcomes applying our model on time series data. No personal information is contained in any of the datasets.

REPRODUCIBILITY STATEMENT

Throughout the paper we describe in detail how our novel stochastic process diffusion framework works, including the algorithms for training and generation. We also describe the datasets and models we used, both in the main text and the Appendix. All the datasets we used are publicly available and the code that reproduces the results is released to the reviewers.  dx = a(b -x)dt + σ √ xdW t , where we set a = 1, b = 1.2, σ = 0.2 and sample x 0 ∼ N (0, 1) but only take the positive values, otherwise the √ x term is undefined. We solve for t ∈ {1, . . . , 64}. 2. Lorenz is a chaotic system in three dimensions. It is governed by the following equations: ẋ = σ(y -x), ẏ = ρx -y -xz, ż = xy -βz, where ρ = 28, σ = 10, β = 2.667, and t is sampled 100 times, uniformly on [0, 2], and x, y, z ∼ N (0, 100I). 3. Ornstein-Uhlenbeck is defined as: dx = (µt -θx)dt + σdW t , with µ = 0.02, θ = 0.1 and σ = 0.4. We sample time the same way as for CIR. 4. Predator-prey is a 2D dynamical system defined with an ODE: ẋ = 2/3x -2/3xy, ẏ = xyy. 5. Sine dataset is generated as a mixture of 5 random sine waves a sin(bx + c), where a ∼ N (3, 1), b ∼ N (0, 0.25), and c ∼ N (0, 1). 6. Sink is again a dynamical system, governed by: dx dt = -4 10 -3 2 x, with x 0 ∼ N (0, I).

B.1.2 CTFP

We implement continuous-time flow process (Deng et al., 2020) which is a normalizing flow model for stochastic processes. That is, there is a predefined base distribution p(z) and a series of invertible transformations f such that we can generate samples x = f (z), and evaluate the density in closedform by computing z = f -1 (x) and using the change of variables formula. For more details on normalizing flows, see Kobyzev et al. (2020) . The novel idea in CTFP is to change the base density to a stochastic process, i.e., a Wiener process, to obtain the distribution over the functions, similar to our work. In our case, we do not use invertible functions but learn to inverse the noising process, and additionally, we add noise at multiple levels instead only in the beginning. In the experiments we define a CTFP model as a 12-layer real NVP architecture (Dinh et al., 2017 ) with 2 hidden layers in each layer's MLP.

B.1.3 LATENT ODE

Latent ODE is a variational autoencoder architecture, with an encoder that represents the complete time series as a single vector following q(z), and a decoder that produces the samples at observation times t i , z(t i ) = f (z), z ∼ q(z). The final step is projection to a data space q(t i ) → x(t i ). The key idea is to use the neural ordinary differential equation (Chen et al., 2018) to define the evolution of the latent variable z(•), thus, have a probabilistic model of the function. This is different to our approach as it models the function in a latent space, with a single source of randomness at the beginning of the time series. That is, the random value is sampled at t = 0 and the time series is determined from there onward, whereas our method samples random values on the whole interval [0, T ] and does so multiple times (for N diffusion steps) until we get the new realization. In the experiments we use a two layer neural network for the neural ODE, and a another two layer network for projection to the data space.

B.1.4 OUR MODELS

We use two models, one is a simple feedforward network, and the second is an RNN-based model. The model takes in the time series X, times of the observations t and the diffusion step n or diffusion time s. The output is the same size as X. The feedforward model embeds the time and the diffusion step with a positional encoding (Vaswani et al., 2017) and passes it together with X through the multilayer neural network. Here, there is no interaction between the points along the time dimension. The model, however, has the capacity to learn transformation based on time of observation. The second model is RNN based, that is, we pass the same concatenated input as before to a 2-layer bidirectional GRU (Chung et al., 2014) and use a single linear layer to project to the output dimension. Table 6 shows that it is important to have interactions in the time dimension, regardless of the noise source, because otherwise we only learn the marginal distribution and the quality of the samples suffers. 



Figure 1: (Left) We add noise from a stochastic process to the whole time series at once. The model ϵ θ learns to reverse this process. (Right) We can use this approach to, e.g., forecast with uncertainty.

Figure 3: Real data and samples from our model based on an Ornstein-Uhlenbeck process.

We test our model as defined in Section 4.1 and Figure 2 against TimeGrad (Rasul et al., 2021b) on three established real-world datasets: Electricity, Exchange and Solar (Lai et al., 2018). Due to the limitations of the CRPS-sum metric (Koochali et al., 2022), we report the NRMSE and the energy score (Gneiting & Raftery, 2007

Figure 4: Forecast and uncertainty intervals on Electricity.

al., 2019), relying on normalizing flows (de Bézenac et al., 2020; Rasul et al., 2021b), or Generative Adversarial Networks (GANs) (Koochali et al., 2021).Recently, Rasul et al. (2021a) introduced a diffusion based forecasting model to learn the conditional probability p(X F |X H ). In particular, let X H = {x(t 0 ), . . . , x(t M -1 )} be a history window of Overview of the forecasting model. The inputs are time series X n , diffusion step n, observation times t and the history vector z. The output is the predicted noise value ϵ n . size M sampled randomly from the full training data. They specify the distribution p(x(t M )|X H ) using a conditional DDPM model. The forward process adds independent Gaussian noise to x(t M ) the same way as in DDPM. However, the reverse denoising model is conditioned on the history X H which is represented with a fixed sized vector z. After training is completed, the predictions are made in the following way: (1) X H is encoded with an RNN to obtain z; (2) the initial noisy value is sampled x N (t M ) ∼ N (0, I); and (3) denoising is performed using the sampling algorithm from Ho et al. (2020) but conditioned on z to obtain x 0 (t M ). The final denoised value is the forecast and sampling multiple values allows computing empirical confidence intervals of interest.

Chen et al., 2018) allow capturing the irregularly sampled time series as they can naturally handle the continuous time. As such, this work is a building block which can also be used alongside our method to devise more powerful denoising networks. Rubanova et al. (2019) construct an encoder-decoder architecture based on neural ODEs which resembles the variational autoencoder(Kingma & Welling, 2014). The time series is, thus, modeled in a latent space by sampling a random vector which is propagated with an ODE. Neural stochastic differential equations(Li et al., 2020) extend this by adding noise in every solver step. This still amounts to generating the time series in a single pass, from t 0 to t M -1 , whereas we use the diffusion framework which refines the generated time series from pure noise X N to the final sample X 0 . Negative log-likelihood on synthetic data (lower is better) shows OU/GP consistently better than independent noise. Our) 0.5115±0.0282 0.5135±0.0288 0.5055±0.0458 0.5855±0.0219 0.5255±0.009 0.513±0.0103

Accuracy of the discriminator trained to distinguish real data and model samples.

compares our model with the baselines and demonstrates that our model produces samples that are indistinguishable to a powerful transformer-based (Vaswani et al., 2017) discriminator model. The same does not hold for the competing methods.

NRMSE (top)  and energy score on real-world forecasting data.

Tashiro et al., 2021) introduced in Section 4.3 on an imputation task. To this end, we use exactly the same training setup, including the random seeds and model architecture, but change the noise source to a Gaussian process. Following Tashiro et al. (2021), we use Physionet dataset(Silva et al., 2012) which is a collection of medical time series collected at hourly rate. It already contains missing values but for testing purposes we choose varying degrees of missingness and report the results on the test set. We update the loss and sampling accordingly, as in Section 3. Table4shows that we outperform the original CSDI model even though we only changed the noise, and the dataset we used has regular time sampling. Imputation RMSE on Physionet data with varying amounts of missingness.

Simo Särkkä and Arno Solin. Applied Stochastic Differential Equations. Institute of Mathematical Statistics Textbooks. Cambridge University Press, 2019. 15 Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Ermon. Csdi: Conditional score-based diffusion models for probabilistic time series imputation. In Advances in Neural Information Multivariate dimension, domain, frequency, total training time steps and prediction length properties of the training datasets used in the forecasting experiments.draw ε ∼ N (0, I) and then ϵ = Lε. Since our model performs best if it predicts ε, we opted for this particular sampling approach. If t is not changing, L can be computed once and the performance impact will be minimal. Also when sampling new realizations, L has to be computed only once, before the sampling loop (see Algorithm 2).The properties of the open datasets used in the forecasting experiment are detailed in Table5. Additionally, we generate 6 synthetic datasets, each with 10000 samples, that involve stochastic processes, dynamical and chaotic systems.

Accuracy of the discriminator trained on samples from a diffusion model. Values around 0.5 indicate the discriminative model cannot distinguish the model samples and real data. Values closer to 1 indicate the generative model is not capturing the data distribution.

A DERIVATIONS A.1 DISCRETE DIFFUSION POSTERIOR PROBABILITY

We extend Ho et al. (2020) by using full covariance Σ(t) to define the noise distribution across time t. If Σ = LL T and keeping the same definitions from Section 2.1 for β n , α n , and ᾱn , we can write:with ϵ ∈ N (0, I). This corresponds to the following transition distributions:We are interested in q(X n-1 |X n , X 0 ) ∝ q(X n |X n-1 )q(X n-1 |X 0 ). Since both distributions on the right-hand side are normal, the result will be normal as well. We can write the resulting distribution as N ( μ, Σ), where:We can now write:and from there:and using the fact that Σ is a symmetric matrix:Therefore, the only difference to the derivation in Ho et al. (2020) is the Σ(t) instead of the identity matrix I in the covariance.

A.2 DISCRETE DIFFUSION LOSS

We use the evidence lower bound from Equation 4. The distribution q(X n-1 |X n , X 0 ) is defined as N ( μ, C 1 Σ), where C 1 is some constant (Equations 21 and 22). Similar to Ho et al. (2020), we choose the parameterization for the reverse process p(X n-1 |X n ) = N (µ θ (X n , t, n), β n Σ), where:Then the KL-divergence is between two normal distributions so we can write the following, where C 2 is a term which does not depend on the parameters θ:If we follow the implementation from Section 3.4 we reparameterize the model to predict the noise from the unit normal distribution. Then, the matrix L, where Σ = L T L, can be factorized from the previous equation leaving us with the L T Σ -1 L term in the middle which evaluates to an identity. We found that simplifying the loss to computing the mean squared error between the true noise ϵ and the predicted noise ϵ θ , as in Ho et al. ( 2020), leads to faster evaluation and better results. Therefore, during training we use the loss as described in Equation 13.Note that in the above notation we have a set of observations X for times t that we feed into the model ϵ θ to predict a set of noise values ϵ(t), t ∈ t, whereas, previous works predicted the noise for each data point independently.

A.3 CONTINUOUS DIFFUSION TRANSITION PROBABILITY

Given an SDE in Equation 14 we want to compute the change in the variance Σs , where s denotes the diffusion time. The derivation is similar to that in Song et al. (2021) . We start with the Equation 5.51 from Särkkä & Solin (2019):where f is the drift, L is the SDE diffusion term and Q is the diffusion matrix. From here, the only difference to Song et al. ( 2021) is in the last term; they obtain β(s)I while we have a full covariance matrix from the stochastic process: β(s)Σ. Therefore, we only need to slightly modify the result:which will gives us the covariance of the transition probability as in Equation 15. The derivation for the mean is unchanged as our drift term is the same as in Song et al. (2021) .

A.4 SAMPLING FROM AN ORNSTEIN-UHLENBECK PROCESS

In the following, we discuss three different approaches to sampling noise ϵ(•) from an OU process defined by γ at time points t 0 , . . . , t M -1 .1. Modified Wiener. As we already mentioned in Section 3.1, we can use a time-changed and scaled Wiener process: e -γt W e 2γt . Sampling from a Wiener process is straightforward: given a set of time increments ∆t 0 , . . . , ∆t M -1 , we sample M points independently from N (0, ∆t i ) and cumulatively sum all the samples. The time changed process first needs to reparameterize the time values. The issue arises when applying the exponential for large t which leads to numerical instability. This can be mitigated by re-scaling t. 2. Discretized SDE. A numerically stable approach involves solving the OU SDE in fixed steps. The point at t = 0, ϵ(0) is sampled from unit Gaussian. After that, each point is obtained based on the previous, i.e., i-th point ϵ(t i ) is calculated as ϵ(t i ) = cϵ(t i-1 ) + √ 1c 2 z, where c = exp(-γ(t it i-1 )) and z ∼ N (0, 1). This is an iterative procedure but is quite fast and stable. 3. Multivariate normal. Finally, we can treat the process as a multivariate normal distribution with mean zero and covariance cov(t, u) = exp(-γ|t -u|). Given a set of time points t it is easy to obtain the covariance matrix Σ and its factorization L T L. 

B.2.2 MODEL

The denoising model takes in X A (observed points) as a conditioning variable and X B n (target points) as the noisy input. We first run a learnable RBF kernel k(t A , t B ) to obtain a similarity matrix K between the observed and unobserved time points. We project X A with a neural network by transforming each point independently to obtain Z, and then obtain the latent variable of the same time dimension size as X B by multiplying K and Z. We then use Z as a conditioning vector and add it to projected X B , transform with a multilayer network, and obtain the output.

B.2.3 ADDITIONAL RESULTS

We test the hypothesis that using a stochastic process with similar properties to the data will lead to better performance. The difference to the neural process setup in Section 6 is that we fix the synthetic GP to always have σ = 0.05. As can be seen from Figure 6 , the marginal distribution will be equal regardless of which process and which kernel parameter we use. On the other hand, when we look at path probability p(X), we notice better results when the noise process matches data properties (as was also shown in Table 1 and 6 ). That means, while our model can reverse the process well, the qualitative properties of the sampled curves will be different. In particular, the curves will be rougher with increasing γ in OU and smoother with increasing σ in GP.

B.3 CSDI IMPUTATION

The imputation experiment presented in Sections 4.3 and 6 uses the original CSDI model (Tashiro et al., 2021) and only changes the noise to include the stochastic process source. In this case, the time points at which we evaluate the stochastic process are regular which does not reflect the true nature of the Physionet dataset. Here, we change the setup such that the measurements keep the actual time that has passed instead of rounding to the nearest hour. This is still in favour of the original paper as it only takes one measurement per hour and discards other if they are present. The model from Tashiro et al. (2021) remains the same and we replace the independent normal noise with the GP noise with σ ∈ {0.005, 0.01, 0.02}.We run each experimental setup 10 times with different data maskings (see Tashiro et al. (2021) for more details) and report the results in Table 7 . We perform the Wilcoxon one-sided signedrank test (Conover, 1999) and reject the null hypothesis that the expected RMSE values are the same when p < 0.05. As we can see, higher values of σ produce better results which makes sense since σ = 0.005 is, informally, closer to independent Gaussian sampling than σ = 0.02, which has stronger temporal dependency between the samples. We suspect 10%-missing case does not produce significant results due to noise. Using higher σ does not further improve the results. Columns correspond to different values of the kernel parameter σ. First row shows samples from the GP prior. As we can see, the higher the value of σ the smoother the process is. This is also reflected in the samples from the model. (Bottom) Same but for Ornstein-Uhlenbeck process, however, increasing the kernel parameter γ now decreases the smoothness. All of the models perfectly capture the marginal distribution.

