MODELING TEMPORAL DATA AS CONTINUOUS FUNCTIONS WITH PROCESS DIFFUSION Anonymous authors Paper under double-blind review

Abstract

Temporal data like time series are often observed at irregular intervals which is a challenging setting for existing machine learning methods. To tackle this problem, we view such data as samples from some underlying continuous function. We then define a diffusion-based generative model that adds noise from a predefined stochastic process while preserving the continuity of the resulting underlying function. A neural network is trained to reverse this process which allows us to sample new realizations from the learned distribution. We define suitable stochastic processes as noise sources and introduce novel denoising and score-matching models on processes. Further, we show how to apply this approach to the multivariate probabilistic forecasting and imputation tasks. Through our extensive experiments, we demonstrate that our method outperforms previous models on synthetic and real-world datasets.

1. INTRODUCTION

Time series data is collected from measurements of some real-world system that evolves via some complex unknown dynamics and the sampling rate is often arbitrary and non-constant. Thus, the assumption that time series follows some underlying continuous function is reasonable; consider, e.g., the temperature or load of a system over time. Although the values are observed as separate events, we know the temperature always exists and its evolution over time is smooth, not jittery. The continuity assumption remains when the intervals between the measurements vary. This kind of data can be found in many domains, from medical, industrial to financial applications. Different approaches to model irregular data have been proposed, including neural (ordinary or stochastic) differential equations (Chen et al., 2018; Li et al., 2020 ), neural processes (Garnelo et al., 2018) , normalizing flows (Deng et al., 2020) etc. As it turns out, capturing the true generative process proves to be difficult, especially with the inherent stochasticity of the data. We propose an alternative method, a generative model for continuous data that is based on the diffusion framework (Ho et al., 2020) which simply adds noise to a data point until it contains no information about the original input. At the same time, the generative part of these models learns to reverse this process so that we can sample new realizations once training is completed. In this paper, we apply these ideas to the time series setting and address the unique challenges that arise. Contributions. Contrary to the previous works on diffusion, we model continuous functions, not vectors (Fig. 1 ). To do so, we first define a suitable noising process that will preserve continuity. Next, we derive the transition probabilities to perform the noising and specify the evidence bound on the likelihood as well as the new sampling procedure. Finally, we propose new models that take in the noisy input and produce the denoised output or, alternatively, the value of the score function. Under review as a conference paper at 2023

2. BACKGROUND

Given training data {x} where each x ∈ R d , the goal of generative modeling is to learn the probability density function p(x) and be able to generate new samples from this learned distribution. Diffusion models (Ho et al., 2020; Song et al., 2021) achieve both of these goals by learning to reverse some fixed process that adds noise to the data. In the following, we present a brief overview of the two ways to define diffusion; in Section 2.1 the noise is added across N increasing scales, which is then taken to the limit in Section 2.2 by defining the diffusion using a stochastic differential equation (SDE). 2020) propose the denoising diffusion probabilistic model (DDPM) which gradually adds fixed Gaussian noise to the observed data point x 0 via known scales β n to define a sequence of progressively noisier values x 1 , x 2 , . . . , x N . The final noisy output x N ∼ N (0, I) carries no information about the original data point and thus the sequence of positive noise (variance) scales β 1 , . . . , β N has to be increasing such that the first noisy output x 1 is close to the original data x 0 , and the final value x N is pure noise. The goal is then to learn to reverse this process. As diffusion forms a Markov chain, the transition between any two consecutive points is defined with a conditional probability q( x n |x n-1 ) = N ( √ 1 -β n x n-1 , β n I). Since the transition kernel is Gaussian, the value at any step n can be sampled directly from x 0 . Given α n = 1β n and ᾱn = n k=1 α k , we can write: q(x n |x 0 ) = N ( √ ᾱn x 0 , (1 -ᾱn )I). Further, the probability of any intermediate value x n-1 given its successor x n and initial x 0 is q(x n-1 |x n , x 0 ) = N ( μn , βn I), where: μn = √ ᾱn-1 β n 1 -ᾱn x 0 + √ α n (1 -ᾱn-1 ) 1 -ᾱn x n , βn = 1 -ᾱn-1 1 -ᾱn β n . The generative model learns the reverse process p(x n-1 |x n ). Sohl-Dickstein et al. (2015) set p(x n-1 |x n ) = N (µ θ (x n , n), β n I), and parameterized µ θ with a neural network. The training objective is to maximize the evidence lower bound log p(x 0 ) ≥ E q D KL (q(x N |x 0 )||p(x N )) + n>1 D KL (q(x n-1 |x n , x 0 )||p(x n-1 |x n )) -log p(x 0 |x 1 ) . (4) In practice, however, the approach by Ho et al. ( 2020) is to reparameterize µ θ and predict the noise ϵ that was added to x 0 , using a neural network ϵ θ (x n , n), and minimize the simplified loss function: L(x 0 ) = E ϵ∼N (0,I),n∼U ({0,...,N }) ||ϵ θ ( √ ᾱn x 0 + √ 1 -ᾱn ϵ, n) -ϵ|| 2 2 . To generate new data, the first step is to sample a point from the final distribution x N ∼ N (0, I) and then iteratively denoise it using the above model ( x N → x N -1 → • • • → x 0 ) to get a sample from the data distribution. To summarize, the forward process adds the noise ϵ to the input x 0 , at different scales, to produce x n . The model learns to invert this, i.e., predicts the noise ϵ from x n .

2.2. SDE DIFFUSION

Instead of taking a finite number of diffusion steps as in Section 2.1, Song et al. ( 2021) introduce a continuous diffusion of vector valued data, x 0 → x s where s ∈ [0, S] is now a continuous variable. The forward process can be elegantly defined with an SDE: dx s = f (x s , s)ds + g(s)dW s , where W is a standard Wiener process. The variable s is the continuous analogue of the discrete steps implying that the input gets noisier during the SDE evolution. The final value x S ∼ p(x S ) will follow some predefined distribution, as in Section 2.1. For the forward SDE in Equation 6 there exist a corresponding reverse SDE (Anderson, 1982) : dx s = [f (x s , s)g(s) 2 ∇ xs log p(x s )]ds + g(s)dW s ,



Figure 1: (Left) We add noise from a stochastic process to the whole time series at once. The model ϵ θ learns to reverse this process. (Right) We can use this approach to, e.g., forecast with uncertainty.

2.1 FIXED-STEP DIFFUSION Sohl-Dickstein et al. (2015); Ho et al. (

