VARIATIONAL INFERENCE FOR DIFFUSION MODU-LATED COX PROCESSES Anonymous

Abstract

This paper proposes a stochastic variational inference (SVI) method for computing an approximate posterior path measure of a Cox process. These processes are widely used in natural and physical sciences, engineering and operations research, and represent a non-trivial model of a wide array of phenomena. In our work, we model the stochastic intensity as the solution of a diffusion stochastic differential equation (SDE), and our objective is to infer the posterior, or smoothing, measure over the paths given Poisson process realizations. We first derive a system of stochastic partial differential equations (SPDE) for the pathwise smoothing posterior density function, a non-trivial result, since the standard solution of SPDEs typically involves an Itô stochastic integral, which is not defined pathwise. Next, we propose an SVI approach to approximating the solution of the system. We parametrize the class of approximate smoothing posteriors using a neural network, derive a lower bound on the evidence of the observed point process samplepath, and optimize the lower bound using stochastic gradient descent (SGD). We demonstrate the efficacy of our method on both synthetic and real-world problems, and demonstrate the advantage of the neural network solution over standard numerical solvers.

1. INTRODUCTION

Cox processes (Cox, 1955; Cox & Isham, 1980) , also known as doubly-stochastic Poisson processes, are a class of stochastic point processes wherein the point intensity is itself stochastic and, conditional on a realization of the intensity process, the number of points in any subset of space is Poisson distributed. These processes are widely used in the natural and physical sciences, engineering and operations research, and form useful models of a wide array of phenomena. We model the intensity by a diffusion process that is the solution of a stochastic differential equation (SDE). This is a standard assumption across a range of applications (Susemihl et al., 2011; Kutschireiter et al., 2020) . The measure induced by the solution of the SDE serves as a prior measure over sample paths, and our objective is to infer a posterior measure over the paths of the underlying intensity process, given realizations of the Poisson point process observations over a fixed time horizon. This type of inference problem has been studied in the Bayesian filtering literature (Schuppen, 1977; Bain & Crisan, 2008; Särkkä, 2013) , where it is of particular interest to infer the state of the intensity process at any past time given all count observations till the present time instant (the resulting posterior is called the smoothing posterior measure). In a seminal paper, Snyder (1972) derived a stochastic partial differential equation (SPDE) describing the dynamics of the corresponding posterior density for Cox processes. The solution of this smoothing SPDE requires the computation of an Itô stochastic integral with respect to the counting process. It has long been recognized (Clark, 1978; Davis, 1981; 1982) that for stochastic smoothing (and filtering) theory to be useful in practice, it should be possible to compute smoothing posteriors conditioned on a single observed sample path. However, Itô integrals are not defined pathwise and deriving a pathwise smoothing density is remarkably hard. 30 years after Synder's original work Elliott & Malcolm (2005) derived a pathwise smoothing SPDE in the form of a coupled system of forward and backward pathwise SPDEs. Nonetheless, solving the system of pathwise SPDEs, or sampling from the corresponding SDE, is still challenging and intractable in general. It is well known, for example, that numerical techniques for solving these SPDEs, such as the finite element method (FEM), suffers from the curse of dimensionality (Han et al., 2018) . Therefore, it is of considerable interest to find more efficient methods to solve the smoothing SPDE. We take a variational inference approach to computing an approximate smoothing posterior measure. Variational representations of Bayesian posteriors in stochastic filtering and smoothing theory have been developed in considerable generality; see (Mitter & Newton, 2003) for a rigorous treatment. There are a number of papers that consider the computation of an approximate posterior distribution over the paths of the underlying intensity process that is observed with additive Gaussian noise (Archambeau et al., 2007; 2008; Cseke et al., 2013; Susemihl et al., 2011; Sutter et al., 2016) . Susemihl et al. (2011) studied Bayesian filtering of Gaussian processes by deriving a differential equation characterizing the evolution of the mean-square error (MSE) in estimating the underlying Gaussian process. On the other hand, Sutter et al. (2016) compute a variational approximation to the smoothing posterior density when the underlying diffusion intensity is observed with additive Brownian noise. They choose their variational family to be a class of SDEs with an analytically computable marginal density. This setting is considerably different from our setting, where the observed process is a point process. Nonetheless, Sutter et al. (2016) provides methodological motivation for our current study. In the context of the computation of approximate smoothing/filtering posteriors for point process observations, Harel et al. (2015) developed an analytically tractable approximation to the filtering posterior distribution of a diffusion modulated marked point processes under specific modeling assumptions suited for a neural encoding/decoding problem. In general, however, analytical tractability cannot be assured without restrictive assumptions. We present a stochastic variational inference (SVI) (Hoffman et al., 2013) method for computing a variational approximation to the smoothing posterior density. Our approach fixes an approximating family of path measures to those induced by a class of parametrized SPDEs. In particular, we parametrize the drift function of the approximating SPDEs by a neural network with input and output variables matching the theoretical smoothing SPDE. Thereafter, using standard stochastic analysis tools we compute a tractable lower bound to the evidence of observing a sample path of count observations, the so-called evidence lower bound (ELBO). A sample average approximation (SAA) to the ELBO is further computed by simulating sample paths from the stochastic differential equation (SDE) corresponding to the approximating SPDE. Finally, by maximizing the ELBO, the neural network is trained using stochastic gradient descent (SGD) utilizing multiple batches of sample paths of count observations. Note that each sample path of the count observations entails the simulation of a separate SDE. We note that there are many problems in the natural and physical sciences, engineering and operations research where multiple paths of a point process (over a finite time horizon) may be obtained. For instance, we present an example in Section 5 modeling the demand for bikes rented during a 24 hour, one day time period in a bike-sharing platform, where the underlying driving intensity is subject to stochastic variations, and demand information is collected over multiple days. In contrast to the variational algorithm developed in Sutter et al. (2016) , where the variational lower bound must be re-optimized for new sample paths of the observation process, our variational method is more general and our approximation to the smoothing posterior can be used as a map for another (unobserved) sample path of count observations. Our computational approach can also be straightforwardly adapted to solve the problem of interest in Sutter et al. (2016) . In the subsequent sections, we describe our problem and method in detail and demonstrate the utility of our method with the help of numerical experiments. In particular, we show how the choice of approximating family enables us to use the trained neural network and in turn, the variational Bayesian smoothing posterior (VBSP), to compute smoothing SPDE in almost (3/4) th of the computational time required to compute the original smoothing SPDE using FEM. Moreover, we also efficiently generate Monte Carlo samples from the learned VBSP and use them for inference on the bike-sharing dataset, whereas FEM failed to compute either VBSP or the true smoothing density for the given time-space discretization.

2. PROBLEM DESCRIPTION

Let N t be a Cox process with unknown stochastic intensity {z t ∈ R + , t ∈ [0, T ]}. We use N t ,t to denote a sample path realization of N t restricted to the interval [t , t], and use N t to denote N t -N 0 ; recall that N 0 = 0 by definition. As noted before, a Cox process conditioned on the intensity is a Poisson process. Therefore, given a realized sample path {z t , t ∈ [0, T ]} of the intensity, and for any 0 ≤ t < t ≤ T , the marginal likelihood of observing N t -N t ∈ N counts in (t , t] is N t -N t ∼ L(N t -N t = N t -N t |{z s } t <s≤t ) := t t z s ds Nt-N t e -t t zsds (N t -N t )! , where L denotes the Poisson likelihood. Rather than directly modeling the intensity z, we will bring a little more flexibility to our setting, and assume that z t is a deterministic transformation of an another stochastic process x t through a known mapping h : R d → R + : is z t = h(x t ). Note that the non-negative range of h ensures that the Poisson intensity z t = h(x t ) is non-negative. Unless x t ∈ R + , the mapping h cannot be an identity function. We use the term intensity process to refer to either z t or x t . We model the intensity process {x t ∈ R d , ∀t ∈ [0, T ]} with the following SDE, dx t = b(x t )dt + σ(x t )dB t , ∀t ≤ T and x 0 = 0, where b : R d → R d is the drift function, σ(•) : R d → R d×d is the diffusion coefficient, and B t is the d-dimensional Brownian motion (or Wiener process). We assume that there exists a strong solution to the SDE above (Oksendal, 2013, Chapter 5) . Moreover, we assume that b(•), h(•), and σ(•) are fixed by the modeler apriori, and we are interested in inferring the unknown intensity process with their fixed definitions. Incorporating them will obscure our main contribution, and we leave it for future work. The model of the count observations above forms a diffusion modulated Cox process. Diffusion modulated Cox processes are widely used to model the arrival process in various service systems such as call centers, hospitals, airports etc. (Zhang et al., 2014; Wang et al., 2020) . Zhang & Kou (2010) use a Gaussian process modulated Cox-process to infer proteins' conformation, in particular, they model the arrival rates of the photons collected from a laser excited protein molecule as a Gaussian process. Schnoerr et al. (2016) model spatio-temporal stochastic systems from systems biology and epidemiology using Cox process where intensity is modelled with diffusions. As stated in the introduction, we seek to infer the smoothing posterior measure over the unknown intensity process {x t , t ∈ [0, T ]} using the count observations upto time T . Following terminology from the Bayesian filtering theory (Särkkä, 2013) , we use smoothing to refer to inferring the unobserved intensity process at any past time given the observations upto the current time. Mathematically, the smoothing posterior is defined using the conditional expectation of the form E[f (x t )|s(N u , u ∈ [0, T ])], where s(N u , u ∈ [0, T ]) is the smallest sigma algebra (or filtration) generated by the Cox process {N t } from time 0 to T . For brevity we write E[f (x t )|s(N u , u ∈ [0, T ])] as E[f (x t )|N 0,T ]. Interested readers may refer to Kutschireiter et al. (2020) for more details on non-linear filtering theory. We now provide a formal derivation of the smoothing posterior using Bayes' theorem (Bain & Crisan (2008) ; Elliott & Malcolm (2005) ). Observe the conditional expectation satisfies E[f (x t )|N 0,T ] = E † [Λ 0,T f (x t )|N 0,T ] E † [Λ 0,T |N 0,T ] for any measurable function f (•) and Λ s,t := L(Ns,t) L † (Ns,t) for any 0 ≤ s < t ≤ T , where L † is the unit intensity Poisson likelihood and E † [•] denotes the expectation with respect to L † . Note that L † does not depend on the stochastic intensity process x and forms a reference measure. The marginal smoothing posterior density is defined as p t (x|N 0,T ) := P(x t ∈ dx|N 0,T ), which can be formally obtained from equation 3 by setting f (x t ) = I {A} (x t ) for any A ∈ R d , where I {A} (y) is an indicator function that equals 1 when y ∈ A, otherwise 0. Now, define the unnormalized filtering density function qt (x) as the function satisfying P(x t ∈ dx|N 0,t ) = qt (x)dx R d qt (ξ)dξ , and also define vt (x ) := E † [Λ t,T |N 0,T ]. Then, it can be shown (Elliott & Malcolm (2005) ) that for any measurable function f , E[f (x t )|N 0,T ] = E † [Λ 0,T f (x t )|N 0,T ] E † [Λ 0,T |N 0,T ] = R d f (ξ)q t (ξ)v t (ξ)dξ R d qt (ξ)v t (ξ)dξ . (6) Next, recalling that h(•) is the mapping to ensure the intensity process is positive, define the function Ψ t for a given sample path of count observations (i.e., pathwise) as Ψ t := Ψ(h(x), t, N t ) = exp [(1 -h(x))t + N t log h(x)] , ∀x ∈ R d . Following Elliott & Malcolm (2005, Theorem 4 ) one may use Ψ t to derive a coupled system of pathwise SPDEs that characterize qt (x) and vt (x). In particular, they show that q t = Ψ -1 t qt is a solution to the following SPDE ∂ t q t (x) = Ψ -1 t L * [Ψ t q t (x)], ∀t ≤ T, q 0 (x) = δ x0 (x), where L * is the adjoint of L[F (x)] = 1 2 i,j a i,j (x)∂ xixj F (x) + i b i (x)∂ xi F (x) , which is the infinitesimal generator of the prior process for any twice-differentiable, continuous, and bounded function F : R d → R and a(x) = σ(x)σ(x) T , and δ x0 (x) is the Dirac delta distribution at x 0 . Moreover, they also show that v t (x) = Ψ t vt (x) satisfies the following backward parabolic equation ∂ t v t (x) = -Ψ t L[Ψ -1 t v t (x)], with terminal condition v T (x) = Ψ T (x). Now it follows from equation 6 that using the solution of these two SPDEs, the marginal smoothing posterior density for any t ∈ [0, T ] satisfies p t (x|N 0,T ) = q t (ξ)v t (ξ)dξ R d q t (ξ)v t (ξ)dξ . Using the SPDEs in equation 7, and 8, together with 9, it can be shown that the marginal smoothing posterior density p t (x|N 0,T ) satisfies its own SPDE: for any t ∈ [0, T ], ∂ t p t (x|N 0,T ) = - i ∂ xi (a(x)[∇ log(Ψ -1 t v t (x))]) i + b i (x) p t (x|N 0,T ) + 1 2 i,j ∂ xixj a i,j (x)p t (x|N 0,T ) and p t (x|N 0,T ) = δ x0 (x) with x 0 = 0. We present a detailed derivation in Appendix A.1. Corresponding to this SPDE, there exists a smoothing posterior SDE, defined as d xt = a( xt )[∇ log(Ψ -1 t v t ( xt ))] + b( xt ) dt + σ( xt )d Bt and x0 = 0, where { xt } is a modification of the process {x t } such that Bt is independent of the Cox process N t (and thus B t ). Observe that the entire sample path of the count observations N 0,T is summarized through the pathwise function Ψ t and v t together in the drift term of this SDE. Also note that the diffusion coefficient of the smoothing posterior SDE is precisely the same as that of the prior SDE. The computation of the drift term in the smoothing posterior SDE requires the solving equation 8 for v t (x) which, in turn, makes the posterior computation challenging and computationally intractable in general. Consequently, the computation of the marginal posterior density (and hence the path measure). Therefore, we propose a variational inference-based method to compute an approximation to the solution of the smoothing posterior SPDE, by computing an approximate solution to the smoothing posterior SDE in equation 11.

3. VARIATIONAL BAYES FOR APPROXIMATING THE SMOOTHING DENSITY

Observe that the posterior path measure is the minimizer of the following variational optimization problem (Mitter & Newton (2003, Proposition 2.1)), max Π(x 0,T )∈P(C) KL(Π Π 0 ) + dΠ(x 0,T ) log L(N 0,T |h(x 0,T )) , where P(C) is the space of all absolutely continuous measures with respect to Π 0 , the measure induced by a solution of the intensity SDE (equation 2) on the space C[0, T ] of continuous functions with support [0, T ], and KL denotes the Kullback-Leibler divergence between two absolutely continuous measures.Note that P(C) also contains Π 0 . Solving this optimization problem over all measures in P(C) is intractable. Therefore, we choose the subset of absolutely continuous measures Qb ⊂ P(C) induced by solutions of the following SDE: dx t = b(x t , N t , t)dt + σ(x t )dB t , for t ≤ T and x 0 = 0, where b(•, •, •) : R d × N × [0, T ] → R d is the drift function. We term this space of measures Qb as the variational family. The measures in Qb are absolutely continuous with respect to Π 0 as they have the same diffusion coefficient, therefore Qb ⊂ P(C). This choice of the variational family is not arbitrary, but rather motivated by the smoothing posterior SDE derived in equation 11, where the diffusion coefficient is σ(•) and the drift coefficient has an intractable form (that depends on the prior drift b(•) and diffusion coefficient σ(•), and N t through Ψ t and v t (•)). Notice that the choice of drift function spans the space of measures in the variational family Qb. Since, Qb ⊂ P(C), it follows from equation 12 that max Π(x 0,T )∈P(C) -KL(Π Π 0 ) + dΠ(x 0,T ) log L(N 0,T |h(x 0,T )) , ≥ max Q∈Qb -KL(Q Π 0 ) + dQ(x 0,T ) log L(N 0,T |h(x 0,T )) The right hand side above is known as the evidence lower bound (ELBO). The corresponding ELBO maximization problem to compute the optimal Q ∈ Qb (for a given sample path N 0,T ) is simply Q * (•|N 0,T ) = arg max Q∈Qb E Q {log L(N 0,T |x 0,T )} -KL(Q Π 0 ). Note that absolutely continuous measures on path space correspond to changes in the drift function, for a fixed diffusion coefficient (else, the measures are singular). As a consequence of Girsanov's theorem (Oksendal, 2013, Theorem 8.6 .8) (see Appendix A.2 for the proof) we have, KL(Q Π 0 ) = 1 2 E Q T 0 σ -1 (x t )(b(x t ) -b(x t , N t , t)) 2 dt , where recall b(•, •) is the drift of the prior SDE defined in equation 2 and b(•, •, •) is the drift of the variational SDE. Substituting this into equation 15 yields Q * (•|N 0,T ) = arg max Q∈Qb E Q log L(N 0,T |x 0,T ) - 1 2 T 0 σ -1 (x t )[b(x t , t) -b(x t , N t , t)] 2 dt . We denote Q * (•|N 0,T ) as the variational Bayesian smoothing posterior (VBSP) path measure . Next, we lay down the details of the SVI algorithm to solve the above optimization problem to compute the VBSP.

4. STOCHASTIC VARIATIONAL INFERENCE OF THE VBSP

It is evident from the ELBO in equation 17 and the definition of the variational family Qb that computing the VBSP measure entails the computation of the unknown drift function b(•, •, •) in equation 13. We further restrict the family of measures Qb by assuming the drift functions belong to a class of parametrized, smooth functions. A feasible way to model this class of drift functions is through a neural network. However, it is possible to use simpler approximation function classes as done in, for example, Sutter et al. (2016) , who fixed b to ensure that the marginal distributions of the variational smoothing SDE belong to a specific exponential family of distributions. We note that in choosing a parametrized class, we must still ensure that the resulting drift functions are Lipschitz continuous and satisfy sufficient regularity so that a solution to the SDE equation 13 exists. Furthermore, restricting the drift functions in this way entails a further restriction of the class of (approximating) path measures. An open question is here is how much of a loss this entails (in terms of the Kullback-Leibler divergence from the 'true' posterior path measure in this instance). To fix the idea, we assume that b(•, •, •) lies in a general class of functions parametrized by θ. Henceforth, we write the drift coefficient as b(•, •, •, θ) to make its dependence on θ explicit. We use stochastic gradient descent (SGD) to maximize the ELBO, which requires the computation of stochastic gradients of the ELBO with respect to θ. To compute the gradients, we generate sample paths of x θ , the solution of the variational SDE equation 13 for a given θ, using a first-order Euler-Maruyama integration of the SDE. We do this for convenience, though higher order approximations could be used. Specifically, we partition the time interval [0, T ] in K equal sub-intervals of length ∆ = T /K, {t 0 , t 1 , . . . t K }, where t 0 = 0 and t K = T and then generate the sequence of {x θ ti } K i=1 using the following recursive equation and initial condition x θ t0 = 0: x θ ti = x θ ti-1 + b(x θ ti-1 , N ti-1 , t i-1 , θ)∆ + ∆σ(x θ i-1 )Z i , ∀ i ∈ {1, 2, . . . K} where {Z i } K i=1 is the sequence of K independent and identically distributed (i.i.d.) d-dimensional standard Gaussian random vectors. We generate M independent sample paths of the discrete-time process in equation 18 denoted as {x θ,m ti }, for m ∈ {1, 2, . . . M }, to compute a sample average approximation (SAA) of the ELBO over the partition {t 0 , t 1 , . . . t K } as ELBO = 1 M M m=1 K-1 i=0 log L(Nt i ,t i+1 |h(x θ,m t i )∆) - 1 2 σ(x θ,m t i ) -1 [b(x θ,m t i , ti) -b(x θ,m t i , Nt i , ti, θ)] 2 ∆ = 1 M M m=1 K-1 i=0 Nt i ,t i+1 log(h(x θ,m t i )∆) -h(x θ,m t i )∆ - 1 2 σ(x θ,m t i ) -1 [b(x θ,m t i , ti) -b(x θ,m t i , Nt i , ti, θ)] 2 ∆ + C, where we used the definition of the Poisson likelihood L from equation 1 and C is a constant independent of θ. Now, to compute gradients of ELBO with respect to θ, observe that the gradient operator can be exchanged with the expectation, since the only source of randomness in each sample path of x θ,m t are K i.i.d. Gaussian random vectors {Z m i } K i=1 , which are independent of θ. Notice too that this is a pathwise analog of the reparametrization trick, and has been used recently to learn deep latent models (Tzen & Raginsky, 2019; Li et al., 2020) . Also, note that thus far the ELBO is defined for a single sample path of the count observations N 0,T . However, as noted before, in our method we will also take a sample average of ELBO over multiple batches of sample paths of count observations at each epoch of the training algorithm.

5. NUMERICAL EXPERIMENTS

We present three experiments demonstrating the efficacy and utility of our proposed SVI method. First, we consider a setting when the underlying stochastic intensity process is 1-dimensional. We compare the SVI approximation with the 'true' smoothing posterior density computed using the solution of the forward and backward SPDEs, defined in equation 7 and equation 8. We solve the SPDEs using a standard finite element method (FEM) solver (Skeel & Berzins, 1990 , Matlab solvers for 1-D PDEs). In a second experiment, we demonstrate the performance of our algorithm on a subset of a Bike-sharing dataset obtained from the UCI machine learning repository (Fanaee-T & Gama, 2013) . In this experiment we estimate a smoothing posterior density for the observed counts of the demand for bikes in a 24 hour period, assuming that the demand process is well-modeled by a Cox process. In our third and final experiment, we apply our method to compute an approximation to a 4dimensional smoothing posterior density. We note that despite being low dimensional, the standard FEM solver does not scale to this setting, while our method can be straightforwardly adapted.

5.1. VARIATIONAL APPROXIMATION OF UNIVARIATE SMOOTHING POSTERIOR DENSITY

As defined in Section 2, we are interested in learning the posterior measure over an unknown process {x t ∈ R} where the intensity process {z t } satisfies z t = h(x t ). We set b(x) = -x and σ(x) = 1 in the prior SDE as defined in equation 2. Furthermore, motivated from the mathematical structure of the true smoothing SDE equation 11, we fix our variational family Q to be the class of measures induced by solutions to the class of SDEs in equation 13 with drift and diffusion coefficient set as b(x, t, N t ) = - Ψ t Ψ t + V (x, t, N T -N t ; θ) -x, σ(x) = 1, where Ψ t is the derivative of Ψ t with respect to x. Here V (x, t, N T -N t ; θ) is modeled using a neural network with 2 hidden layers whose parameters we call θ (see Appendix A.3 for more details on the architecture).

5.1.1. SIMULATED DATASET

We generate sample paths of the count observation N 0,T from a non-homogeneous Poisson process, where the intensity process {z 0 t } is the solution of the following ordinary differential equation, dz 0 t dt = 20(2 -t) exp(-0.85(2 -t) 2 ), and z 0 0 = 0. We fix the map h(a) = 5 * exp(-.08 * (a -5) 2 ) in this experiment. To train the neural network V , we use 150 samples paths of the count observation between time interval [0, 2] and optimize ELBO in equation 19 using Adam (Kingma & Ba, 2014) . To demonstrate the efficacy of our approach, we first generate 20 test sample paths of count observations and compute the true smoothing posterior density (defined in equation 9) using the solution of the forward and backward SPDEs (defined in equation 7 and equation 8 respectively). We do this using the FEM method. Then for the same test observations we compute the VBSP density by FEM using the trained drift coefficient of VBSP SDE (see equation 10 and 11). Comparative results are presented in Figure 1 . We clearly see from the the top row plot that these are very similar to each other, with our variational approximation capturing the sharp rises in density with high fidelity. Note from the first two plots in the bottom row that as the ELBO decreases the gap between the the true and VB smoothing posterior reduces too. Moreover, the time required to compute the smoothing density using the forward and backward SPDEs on the test data is about 2.2 seconds which is approximately fifty percent higher than the trained VBSP density, which required 1.6 seconds (on an 3.1 GHz Intel i5 CPU). This is due to the fact that we are required to solve one SPDE in the latter case instead of two in the former case. Notice that the learnt drift of the smoothing SDE equation 13 is a map which can be used with any sample path of count observations to generate Monte Carlo samples from the an approximate smoothing posterior density. In contrast, it is challenging to sample from the true smoothing SDE as it involves computing the solution of the system of SPDEs in equation 7 and 8. Furthermore, this solution must be recomputed for each new sample path of count observations.

5.1.2. BIKE SHARING DATASET

In this experiment, we compute the VBSP density for the hourly counts of demand in a bike-sharing system. The experimental setting remains unaltered, albeit the diffusion coefficient is set to σ(x) = 1.1, to capture the increased variability in counts of bike demand than the variability in simulated counts in the previous experiment. Notice that fixing the map h is a modeling question, and we consider mappings of the form h(x) = a exp(-b(x -c) 2 ) parametrized by a, b, and c. We take a simple empirical Bayes heuristic to fix their values after observing the count observations, based on the fact that h(x)∆ is the mean of the Poisson counts in the interval ∆. We set a in such a way that h(x)∆ equals the maximum of the median observed count. For the Bike-sharing data we re-scaled the problem to interval [0, 2] and fixed ∆ = 0.083 and thus choose 90/0.083 1050 as a. The choice of b and c depends on the diffusion coefficient σ(x) and x 0 , as the SDE should be able to explore the relevant domain of h to appropriately model the actual count observations. Thus, after looking at the count data, we chose h(x) = 1050 * exp(-0.001 * (y -50) 2 ). The empirical results are summarized in Figure 2 . We note here that the FEM approach (our implementation) to compute the VBSP density was numerically unstable and failed; this may be attributed Note that in a smoothing problem, the prior intensity process (specifically the drift coefficient b(•), diffusion coefficient σ(•) and the map h(•)) are sourced from an expert, and the objective is to update the modeler's beliefs with count observations to compute the smoothing posterior density. In many settings, the functions b(•), σ(•) and h(•) are known only up to some unknown parameters. It is fairly straightforward to combine the parameters of h and b with the neural network parameters θ, and learn them all in a data-driven manner. We choose not to do this to keep the discussion simple. Learning the parameters of σ presents a slightly greater challenge since the path measures for different settings of σ are singular. This is a crucial difference from the finite dimensional setting, where the Lebesgue measure is a common reference. We leave solving this problem for future work.

5.2. VARIATIONAL APPROXIMATION OF MULTIVARIATE SMOOTHING POSTERIOR DENSITY

We demonstrate our method on a 4-dimensional smoothing problem. In this case, we fix the map h(a) = 25 * exp(-.08 * a -5 2 ), where • is the L-2 norm, and a ∈ R 4 . We also choose the prior density to be induced by an SDE defined in equation 2 with b(x) = -x and σ(x) = I, where x ∈ R d and I is a d × d identity matrix. Furthermore, we choose our variational family to be a family of SDE as defined in equation 13, with drift and diffusion coefficients, b(x, t, N t ) = -∇Ψt Ψt + V d (x, t, N T -N t ; θ) -x and σ(x) = I. To train the neural network V , we use 150 samples paths of the count observation between time interval [0, 2] and optimize the ELBO defined in equation 19 using Adam (Kingma & Ba, 2014) . We plot the empirical results in Figure 3 .  ∂ ∂t p S,t = - p S,t V t (x) L[V t (x)] + V t (x) L * [ p S,t V t (x) ] = - 1 2 i,j a i,j (x)p S,t V t (x) ∂ xixj V t (x) - i b i (x)p S,t V t (x) ∂ xi V t (x) + 1 2 i,j ∂ xixj (a i,j (x)p S,t ) - ∂ xi V t (x)∂ xj (a i,j (x)p S,t ) V t (x) + 2∂ xi V t (x) V 2 t (x) ∂ xj V t (x) (a i,j (x)p S,t ) - 1 V t (x) ∂ xi ∂ xj V t (x)a i,j (x)p S,t + i ∂ xi V t (x)b i (x)p S,t V t (x) - i [(∂ xi b i (x)p S,t + b i (x)∂ xi p S,t )] = 1 2 i,j ∂ xixj (a i,j (x)p S,t ) - i ∂ xi [b i (x)p S,t ] - 1 2 i,j ∂ xixj V t (x) V t (x) [a i,j (x)p S,t ] + ∂ xi V t (x)∂ xj (a i,j (x)p S,t ) V t (x) - 2∂ xi V t (x)∂ xj V t (x) V 2 t (x) [(a i,j (x)p S,t )] + 1 V t (x) ∂ xj V t (x)∂ xi (a i,j (x)p S,t ) + ∂ xixj V t (x)a i,j (x)p S,t = 1 2 i,j ∂ xixj (a i,j (x)p S,t ) - i ∂ xi [b i (x)p S,t ] - 1 2 i,j ∂ xi ∂ xj log V t (x)[a i,j (x)p S,t ] + ∂ xi V t (x)∂ xj (a i,j (x)p S,t ) V t (x) - ∂ xi V t (x)∂ xj V t (x) V 2 t (x) [(a i,j (x)p S,t )] + 1 V t (x) ∂ xixj V t (x)a i,j (x)p S,t Since, ∂ xixj V t (x) = ∂ xj xi V t (x) therefore ∂ ∂t p S,t = 1 2 i,j ∂ xixj (a i,j (x)p S,t ) - i ∂ xi [b i (x)p S,t ] - 1 2 i,j ∂ xi ∂ xj log V t (x)[a i,j (x)p S,t ] + ∂ xj [∂ xi log V t (x)[a i,j (x)p S,t ]] = 1 2 i,j ∂ xixj (a i,j (x)p S,t ) - i ∂ xi [b i (x)p S,t ] - i,j ∂ xi ∂ xj log V t (x)[a i,j (x)p S,t ] = 1 2 i,j ∂ xixj (a i,j (x)p S,t ) - i ∂ xi {[(a(x)[∇ log V t (x)]) i + b i (x)]p S,t } . A.2 KL-DIVERGENCE BETWEEN A MEMBER OF VARIATIONAL FAMILY AND PRIOR SDE We derive a pathwise expression for KL(Q Π 0 ) for a given count observation path N 0,T . Theorem A.1. Define u(x t , t, N t ; θ) := σ(x t ) -1 b(x t , t) -b(x t , t, N t ; θ) and suppose that u satisfies a strong Novikov's condition: E exp 1 2 T 0 u(x t , t, N t ; θ) 2 dt < +∞ ∀θ, φ. Then, KL(Q Π 0 ) = E Q 1 2 T 0 u(x t , t, N t ; θ) 2 ds . Proof. Given samples path of count observation N 0,T , using the definition of u and under Novikov's condition, using Girsanov's theorem (Oksendal, 2013, Theorem 8.6.8) 



Figure 1: Variational vs. True smoothing posterior density on the 1-D simulated dataset. The top row compares the variational approximation with the true smoothing density. The bottom-left plot shows the ELBO as a function of epochs of the training algorithm. We also compute the L 1 distance between the VB and true smoothing density over 20 sample paths of both training and test count observations as the training progresses and plot the 10 th , 50 th and 90 th quantile in the bottommiddle figure. The bottom-right plot depicts that 95% of the test count observations are within the 5 th and 95 th quantile of the simulated counts, where counts are simulated using the learned VBSP SDE for that test sample path of count observations.

Figure 2: VBSP smoothing posterior density for Bike Sharing Data. The left plot shows the ELBO as the training progresses. The second and third plots depict the empirical VBSP density computed on a test sample path of count observations, using 1000 simulated sample paths of the learnt variational smoothing SDE. In the right plot, we use the same test sample path of count observation, compute the VBSP using the trained drift and demonstrate that most count observations (of the test sample path) ( 70%) lie within the 97.5 th and 2.5 th quantile of the simulated counts.

Figure 3: Multivariate VB smoothing posterior density: The first two rows of Figure 3 show the marginals of the (empirical) VBSP density using the trained drift coefficient for a given test sample path of count observations. The bottom-left figure plots the ELBO value as a function of the number of training epochs. Then, for a test sample path of count observations, the right hand plot shows that more than 95% of these observations are within the 95% confidence interval of the simulated count observations generated from the trained 4-dimensional VBSP density computed for that test path.

, we have Brownian motion w.r.t. Q. Furthermore, we also havedx t = b(x t , t)dt + σ(x t )d Bt . , t, N t ; θ)dB s + 1 2 u(x t , t, N t ; θ) 2 ds . .3 NEURAL NETWORK ARCHITECTUREFor all the experiments, we use the neural network architecture depicted in Figure A.3, with ReLU activation functions between fully-connected hidden layers.Figure 4: Neural network architecture used in the numerical experiments.A.4 COMPARING COMPUTATIONAL TIME REQUIRED TO NUMERICALLY COMPUTE VBSP AND TRUE SMOOTHING DENSITY USING FEM .84 1.81 1.85 1.75 1.78 1.87 1.9 1.95 2.1 2.06 1.81 1.84 True SD 2.09 2.18 2.25 2.37 2.22 2.18 2.22 2.23 2.21 2.24 2.33 2.29 2.36 2.39 2.41 2.17 2.12 .81 1.85 1.89 1.87 1.84 1.85 1.86 1.92 1.91 1.92 1.88 1.84 1.91 1.91 1.82 True SD 2.14 2.14 2.15 2.18 2.25 2.17 2.14 2.12 2.2 2.15 2.23 2.17 2.17 2.28 2.25 2.17 Comparing computational time (in sec) required to compute 1-D VBSP and true smoothing density using FEM. The time reported here are median over 20 test sample paths of count observations at each epoch

A APPENDIX

A.1 DERIVATION OF SMOOTHING SPDE According to Theorem * [sic] in Elliott & Malcolm (2005) , K t := ( R d q t (ξ)v t (ξ)dξ) -1 is almost surely constant for t ≤ T . Using this result it follows thatwhere V t (x) = Ψ -1 t v t (x) and P S,t = p t (x|N 0,T ) are introduced for brevity. Now observe thatConsider the summand in the first term of equation 21 and observe thatNow the second term in equation 21 can be expressed asSubstituting the above expression in equation 21, we haveNow consider the first term in the RHS of equation 20

