COMPUTATIONAL DOOB'S h-TRANSFORMS FOR ON-LINE FILTERING OF DISCRETELY OBSERVED DIFFU-SIONS

Abstract

This paper is concerned with online filtering of discretely observed nonlinear diffusion processes. Our approach is based on the fully adapted auxiliary particle filter, which involves Doob's h-transforms that are typically intractable. We propose a computational framework to approximate these h-transforms by solving the underlying backward Kolmogorov equations using nonlinear Feynman-Kac formulas and neural networks. The methodology allows one to train a locally optimal particle filter prior to the data-assimilation procedure. Numerical experiments illustrate that the proposed approach can be orders of magnitude more efficient than state-of-the-art particle filters in the regime of highly informative observations, when the observations are extreme under the model, and if the state dimension is large.

1. INTRODUCTION

Diffusion processes are fundamental tools in applied mathematics, statistics, and machine learning. Because this flexible class of models is easily amenable to computations and simulations, diffusion processes are very common in biological sciences (e.g. population and multi-species models, stochastic delay population systems), neuroscience (e.g. models for synaptic input, stochastic Hodgkin-Huxley model, stochastic Fitzhugh-Nagumo model), and finance (e.g. modeling multi assets prices) (Allen, 2010; Shreve et al., 2004; Capasso & Capasso, 2021) . In these disciplines, tracking signal from partial or noisy observations is a very common task. However, working with diffusion processes can be challenging as their transition densities are only tractable in rare and simple situations such as (geometric) Brownian motions or Ornstein-Uhlenbeck (OU) processes. This difficulty has hindered the use of standard methodologies for inference and data-assimilation of models driven by diffusion processes and various approaches have been developed to circumvent or mitigate some of these issues, as discussed in Section 4. Consider a time-homogeneous multivariate diffusion process dX t = µ(X t ) dt + σ(X t ) dB t that is discretely observed at regular intervals. Noisy observations y k of the latent process X t k are collected at equispaced times t k ≡ k T for k ≥ 1. We consider the online filtering problem which consists in estimating the conditional laws π k (dx) = P(X t k ∈ dx|y 1 , . . . , y k ), i.e. the filtering distributions, as observations are collected. We focus on the use of Particle Filters (PFs) that approximate the filtering distributions with a system of weighted particles. Although many previous works have relied on the Bootstrap Particle Filter (BPF), which simulates particles from the diffusion process, it can perform poorly in challenging scenarios as it fails to take the incoming observation y k into account. The goal of this article is to show that the (locally) optimal approach given by the Fully Adapted Auxiliary Particle Filter (FA-APF) (Pitt & Shephard, 1999 ) can be implemented. This necessitates simulating a conditioned diffusion process, which can be formulated as a control problem involving an intractable Doob's h-transform (Rogers & Williams, 2000; Chung & Walsh, 2006) . We propose the Computational Doob's h-Transform (CDT) framework for efficiently approximating these quantities. The method relies on nonlinear Feynman-Kac formulas for solving backward Kolmogorov equations simultaneously for all possible observations. Importantly, this preprocessing step only needs to be performed once before starting the online filtering procedure. Numerical experiments illustrate that the proposed approach can be orders of magnitude more efficient than the BPF in the regime of highly informative observations, when the observations are extreme under the model, and if the state dimension is large. A PyTorch implementation to reproduce our numerical experiments is available at https://anonymous.4open.science/r/CompDoobTransform/. Notations. For two matrices A, B ∈ R d,d , their Frobenius inner product is defined as ⟨A, B⟩ F = d i,j=1 A i,j B i,j . The Euclidean inner product for u, v ∈ R d is denoted as ⟨u, v⟩ = d i=1 u i v i . For two (or more) functions F and G, we sometimes use the shortened notation [F G](x) to denote the product F (x)G(x).

2. BACKGROUND 2.1 FILTERING OF DISCRETELY OBSERVED DIFFUSIONS

Consider a homogeneous diffusion process {X t } t≥0 in X = R d with initial distribution ρ 0 (dx) and dynamics dX t = µ(X t ) dt + σ(X t ) dB t , described by the drift and volatility functions µ : R d → R d and σ : R d → R  (Y k ∈ A | X t k = x k ) = A g(x k , y) dy for some dominating measure dy on Y. For a test function φ : X → R, the generator of the diffusion process {X t } t≥0 is given by Lφ = ⟨µ, ∇φ⟩ + 1 2 ⟨σσ ⊤ , ∇ 2 φ⟩ F . This article is concerned with approximating the filtering distributions π k (dx) = P(X t k ∈ dx | y 1 , . . . , y k ). For notational convenience, we set π 0 (dx) ≡ ρ 0 (dx) since there is no observation collected at the initial time t = 0.

2.2. PARTICLE FILTERING

Particle Filters (PF), also known as Sequential Monte Carlo methods, are a set of Monte Carlo algorithms that can be used to solve filtering problems (see Chopin et al. (2020) for a recent textbook on the topic). PFs evolve a set of M ≥ 1 particles x 1:M t = (x 1 t , . . . , x M t ) ∈ X M forward in time using a combination of propagation and resampling operations. To initialize the PF, each initial particle x j 0 ∈ X for 1 ≤ j ≤ M is sampled independently from the distribution ρ 0 (dx) so that π 0 (dx) ≈ M -1 M j=1 δ(dx; x j 0 ). Approximations of the filtering distribution π k for k ≥ 1 are built recursively as follows. Given the Monte Carlo approximation of the filtering distribution at time t k , π k (dx) ≈ M -1 M j=1 δ(dx; x j t k ), the particles x 1:M t k are propagated independently forward in time by x j t k+1 ∼ q k+1 (d x | x j t k ), using a Markov kernel q k+1 (d x | x) specified by the user. The BPF corresponds to the Markov kernel q BPF k+1 (d x | x) = P(X t k+1 ∈ d x | X t k = x), while the FA-APF (Pitt & Shephard, 1999) corresponds to the (typically intractable) kernel q FA-APF k+1 (d x | x) = P(X t k+1 ∈ d x | X t k = x, Y k+1 = y k+1 ). Each particle x j t k+1 is associated with a normalized weight W j k+1 = W j k+1 / M i=1 W i k+1 , where the unnormalized weights W j k+1 (by time-homogeneity of (1)) are defined as W j k+1 = p T (d x j t k+1 | x j t k ) q k+1 (d x j t k+1 | x j t k ) g( x j t k+1 , y k+1 ). The BPF and FA-APF correspond respectively to having W j,BPF k+1 = g( x j t k+1 , y k+1 ) and W j,FA-APF k+1 = E[g(X t k+1 , y k+1 ) | X t k = x j t k ]. The weights are such that π k+1 (dx) ≈ M j=1 W j k+1 δ(dx; x j t k+1 ). The resampling step consists in defining a new set of particles x 1:M t k+1 with P(x j t k+1 = x i t k+1 ) = W i k+1 . This resampling scheme ensures that the equally weighted set of particles x 1:M t k+1 provides a Monte Carlo approximation of the filtering distribution at time t k+1 in the sense that π k+1 (dx) ≈ M -1 M j=1 δ(dx; x j t k+1 ). Note that the particles x 1:M t k+1 do not need to be resampled independently given the set of propagated particles x 1:M t k+1 . We refer the reader to Gerber et al. (2019) for a recent discussion of resampling schemes within PFs and to Del Moral (2004) for a book-length treatment of the convergence properties of this class of Monte Carlo methods. In most settings, the FA-APF (Pitt & Shephard, 1999 ) that minimizes a local variance criterion (Doucet et al., 2009) generates particles that are more consistent with informative data and weights that exhibit significantly less variability compared to the BPF. This gain in efficiency can be very substantial when the signal-to-noise ratio is high or when observations contain outliers under the model specification. Nevertheless, implementing FA-APF requires sampling from the transition probability q FA-APF k+1 (d x | x), which is typically not feasible in practice. We will show in the following that this can be achieved in our setting by simulating a conditioned diffusion.

2.3. CONDITIONED AND CONTROLLED DIFFUSIONS

As the diffusion process (1) is assumed to be time-homogeneous, it suffices to focus on the initial interval [0, T ] and study the dynamics of the diffusion X [0,T ] = {X t } t∈[0,T ] conditioned upon the first observation Y T = y. It is a standard result that the conditioned diffusion is described by diffusion process with the same volatility as the original diffusion but with a time-dependent drift function that takes the future observation Y T = y into account. Before deriving the exact form of the conditioned diffusion, the notion of controlled diffusion needs to be discussed. For an arbitrary control function c : X × Y × [0, T ] → R d and y ∈ Y, consider the controlled diffusion {X c,y t } t∈[0,T ] with generator L c,y,t φ(x) = Lφ(x) + ⟨[σc](x, y, t), ∇φ(x)⟩ and dynamics dX c,y t = µ(X c,y t ) dt + σ(X c,y t ) dB t (original dynamics) + [σ c](X c,y t , y, t) dt (control drift term) . If P [0,T ] and P c,y [0,T ] denote the probability measures on the space of continuous functions C([0, T ], R d ) generated by the original and controlled diffusions, Girsanov's theorem shows that dP [0,T ] dP c,y [0,T ] (X [0,T ] ) = exp - 1 2 T 0 ∥c(X t , y, t)∥ 2 dt - T 0 ⟨c(X t , y, t), dB t ⟩ . ( ) We now describe the optimal control function c ⋆ : X × Y × [0, T ] → R d that is such that, for any observation value y ∈ Y, the controlled diffusion X c⋆,y [0,T ] has the same dynamics as the original diffusion X [0,T ] conditioned upon the observation Y T = y. For this purpose, consider the function h(x, y, t) = E[g(X T , y) | X t = x] = X g(x T , y) p T -t (dx T | x) that gives the probability of observing Y T = y when the diffusion has state x ∈ X at time t ∈ [0, T ]. Recall that the likelihood function g : X ×Y → R + was defined in Section 2.1. Equation ( 6) implies that h : X × Y × [0, T ] → R + satisfies the backward Kolmogorov equation (Oksendal, 2013) , (∂ t + L)h = 0, with terminal condition h(x, y, T ) = g(x, y) for all (x, y) ∈ X × Y. As described in Appendix A.1, the theory of Doob's h-transformed shows that the optimal control is given by c ⋆ (x, y, t) = [σ ⊤ ∇ log h](x, y, t). We refer readers to Rogers & Williams (2000) for a formal treatment of Doob's h-transform.

3.1. NONLINEAR FEYNMAN-KAC FORMULA

Obtaining the control function c ⋆ (x, y, t) = [σ ⊤ ∇ log h](x, y, t) by solving the backward Kolmogorov equation in (7) for each observation y ∈ Y is computationally not feasible when filtering many observations. Furthermore, when the dimensionality of the state-space X becomes larger, standard numerical methods for solving Partial Differential Equations (PDEs) such as Finite Differences or the Finite Element Method become impractical. For these reasons, we propose instead to approximate the control function c ⋆ with neural networks, and employ methods based on automatic differentiation and the nonlinear Feynman-Kac approach to solve semilinear PDEs (Hartmann et al., 2017; 2019; Kebiri et al., 2017; E et al., 2017; Chan-Wai-Nam et al., 2019; Hutzenthaler & Kruse, 2020; Hutzenthaler et al., 2020; Beck et al., 2019; Han et al., 2018; Nüsken & Richter, 2021) . As the non-negative function h typically decays exponentially for large ∥x∥, it is computationally more stable to work on the logarithmic scale and approximate the value function v(x, y, t) = -log[h(x, y, t)]. Using the fact that h satisfies the PDE (7), the value function satisfies (∂ t + L)v = 1 2 ∥σ ⊤ ∇v∥ 2 , v(x, y, T ) = -log[g(x, y)] for all (x, y) ∈ X × Y. Let {X c,y t } t∈[0,T ] be a controlled diffusion defined in Equation ( 4) for a given control function c : X × Y × [0, T ] → R d and define the diffusion process {V t } t∈[0,T ] as V t = v(X c,y t , y, t). While any control function c(x, y, t) with mild growth and regularity assumptions can be considered within our framework, we will see that iterative schemes that choose it as a current approximation of c ⋆ (x, y, t) tend to perform better in practice. Since we have that ∂ t v + Lv + ⟨σc, ∇v⟩ = (1/2) ∥σ ⊤ ∇v∥ 2 + ⟨c, σ ⊤ ∇v⟩, Itô's Lemma shows that for any observation Y T = y and 0 ≤ s ≤ T , we have V T = V s + T s 1 2 ∥Z t ∥ 2 + ⟨c, Z t ⟩ dt + T s ⟨Z t , dB t ⟩ with Z t = [σ ⊤ ∇v](X c,y t , y, t) and V T = -log[g(X c,y T , y)]. For notational simplicity, we suppressed the dependence of (V t , Z t ) on the control c and observation y. In summary, the pair of processes (V t , Z t ) defined as V t = v(X c,y t , y, t) and Z t = [σ ⊤ ∇v](X c,y t , y, t) are such that the following equation holds, -log[g(X c,y T , y)] = V s + T s 1 2 ∥Z t ∥ 2 + ⟨c, Z t ⟩ dt + T s ⟨Z t , dB t ⟩. Crucially, under mild growth and regularity assumptions on the drift and volatility function µ : X → R d and σ : X → R d,d , the pair of processes (V t , Z t ) is the unique solution to Equation (10) (Pardoux & Peng, 1990; 1992; Pardoux & Tang, 1999; Yong & Zhou, 1999) . This result can be used as a building block for designing Monte Carlo approximations of the solution to semilinear and fully nonlinear PDEs (E et al., 2017; Han et al., 2018; Raissi, 2018; Beck et al., 2019; Huré et al., 2020; Pham et al., 2021) .

3.2. COMPUTATIONAL DOOB'S h-TRANSFORM

As before, consider a diffusion {X c,y t } t∈[0,T ] controlled by a function c : X × Y × [0, T ] → R d and driven by the standard Brownian motion {B t } t≥0 . Furthermore, for two functions N 0 : X × Y → R and N : X × Y × [0, T ] → R d , consider the diffusion process {V t } t∈[0,T ] defined as V s = V 0 + s 0 1 2 ∥Z t ∥ 2 + ⟨c(X c,y t , y, t), Z t ⟩ dt + s 0 ⟨Z t , dB t ⟩, where the initial condition V 0 and the process {Z t } t∈[0,T ] are defined as V 0 = N 0 (X c,y 0 , y) and Z t = N (X c,y t , y, t). Importantly, we remind the reader that the two diffusion processes X c,y t and V t are driven by the same Brownian motion B t . The uniqueness result mentioned at the end of Section 3.1 implies that, if for any choice of initial condition X c,y 0 ∈ X and terminal observation y ∈ Y the condition V T = -log[g(X c,y T , y)] is satisfied, then we have that for all (x, y, t) ∈ X × Y × [0, T ] N 0 (x, y) = -log h(x, y, 0) and N (x, y, t) = -[σ ⊤ ∇ log h](x, y, t). (13) In particular, the optimal control is given by c ⋆ (x, y, t) = -N (x, y, t). These remarks suggest parametrizing the functions N 0 (•, •) and N (•, •, •) by two neural networks with respective parameters θ 0 ∈ Θ 0 and θ ∈ Θ while minimizing the loss function L(θ 0 , θ; c) = E V T + log[g(X c,Y T , Y)] 2 . ( ) The above expectation is with respect to the Brownian motion {B t } t≥0 , the initial condition X c,Y 0 ∼ η X (dx) of the controlled diffusion, and the observation Y ∼ η Y (dy) at time T . In ( 14), we fix the dynamics of X c,y t and optimize over the dynamics of V t . The spread of the distributions η X and η Y should be large enough to cover typical states under the filtering distributions π k , k ≥ 1 and future observations to be filtered respectively. Specific choices will be detailed for each application in Section 5. For offline problems, one could learn in a data-driven manner by selecting η Y as the empirical distribution of actual observations. We stress that these choices only impact training of the neural networks, and will not affect the asymptotic guarantees of our filtering approximations. CDT algorithm. The following outlines our training procedure to learn neural networks N 0 and N that satisfy (13). To minimize the loss function ( 14), any stochastic gradient algorithm can be used with a user-specified mini-batch size of J ≥ 1. The following steps are iterated until convergence. 1. Choose a control c : X × Y × [0, T ] → R d , possibly based on the current neural network parameters (θ 0 , θ) ∈ Θ 0 × Θ. 2. Simulate independent Brownian paths B j [0,T ] , initial conditions X j 0 ∼ η X (dx), and obser- vations Y j ∼ η Y (dy) for 1 ≤ j ≤ J. 3. Generate the controlled trajectories: the j-th sample path X j [0,T ] is obtained by forward integration of the controlled dynamics in Equation ( 4) with initial condition X j 0 , control c(•, Y j , •), and the Brownian path B j [0,T ] . 4. Generate the value trajectories: the j-th sample path V j [0,T ] is obtained by forward integration of the dynamics in Equation ( 11)-( 12) with the Brownian path B j [0,T ] and the current neural network parameters (θ 0 , θ) ∈ Θ 0 × Θ. 5. Construct a Monte Carlo estimate of the loss function ( 14): L = J -1 J j=1 (V j T + log[g(X j T , Y j )]) 2 (15) 6. Use automatic differentiation to compute ∂ θ0 L and ∂ θ L and update the parameters (θ 0 , θ).

Importantly, if the control function c in

Step:1 does depend on the current parameters (θ 0 , θ), the gradient operations executed in Step:6 should not be propagated through the control function c. A standard stop-gradient operation available in most popular automatic differentiation frameworks can be used for this purpose. Time-discretization of diffusions. For clarity of exposition, we have described our algorithm in continuous-time. In practice, one would have to discretize these diffusion processes, which is entirely straightforward. Although any numerical integrator could potentially be considered, the experiments in Section 5 employed the standard Euler-Maruyama scheme (Kloeden & Platen, 1992) . Parametrizations of functions N 0 and N . In all numerical experiments presented in Section 5, the functions N 0 and N are parametrized with fully-connected neural networks with two hidden layers, number of neurons that grow linearly with dimension d, and the Leaky ReLU activation function except in the last layer. Future work could explore other neural network architectures for our setting. In situations that are close to a Gaussian setting (e.g. Ornstein-Uhlenbeck process observed with additive Gaussian noise) where the value function has the form v(x, y, t) = ⟨x, a(y, t)x⟩ + ⟨b(y, t), x⟩ + c(y, t), a more parsimonious parametrization could certainly be exploited. Furthermore, the function N (x, y, t) could be parametrized to automatically satisfy the terminal condition N (x, y, T ) = -[σ ⊤ ∇ log g](x, y). A possible approach consists in setting N (x, y, t) = (1 -t/T ) N (x, y, t) -(t/T )[σ ⊤ ∇ log g](x, y) for some neural network N : X × Y × [0, T ] → R d . These strategies have not be used in the experiments of Section 5. Choice of controlled dynamics. In challenging scenarios where observations are highly informative and/or extreme under the model, choosing a good control function to implement Step:1 of the proposed algorithm can be crucial. We focus on two possible implementations: • CDT static scheme: a simple (and naive) choice is not using any control, i.e. c(x, y, t) ≡ 0 ∈ R d for all (x, y, t) ∈ X × Y × [0, T ]. • CDT iterative scheme: use the current approximation of the optimal control c ⋆ described by the parameters (θ 0 , θ) ∈ Θ 0 × Θ. This corresponds to setting c(x, y, t) = -N (x, y, t). While using a static control approach can perform reasonably well in some situations, our results in Section 5 suggest that the iterative control procedure is a more reliable strategy. This is consistent with findings in the stochastic optimal control literature (Thijssen & Kappen, 2015; Pereira et al., 2019) . This choice of control function drives the forward process X c,y t to regions of the statespace where the likelihood function is large and helps mitigate convergence and stability issues. Furthermore, Section 5 reports that (at convergence), the solutions N 0 and N can be significantly different. The iterative control procedure leads to more accurate solutions and, ultimately, better performance when used for online filtering.

3.3. ONLINE FILTERING

Before performing online filtering, we first run the CDT algorithm described in Section 3.2 to construct an approximation of the optimal control c ⋆ (x, y, t) = [σ ⊤ ∇ log h](x, y, t). For concrete- ness, denote by c : X × Y × [0, T ] → R d the resulting approximate control, i.e. c(x, y, t) = -N (x, y, t) where N (•, •, •) is parametrized by the final parameter θ ∈ Θ. Similarly, denote by V 0 : X × Y → R the approximation of the initial value function v(x, y, 0) = -log h(x, y, 0), i.e. V 0 (x, y) = N 0 (x, y) where N 0 (•, •) is parametrized by the final parameter θ 0 ∈ Θ 0 . For implementing online filtering with M ≥ 1 particles, consider a current approximation π k (dx) = M -1 M j=1 δ(dx; x j t k ) of the filtering distribution at time t k ≥ 0. Given the future observation Y k+1 = y k+1 , the particles x 1:M t k are then propagated forward by exploiting the approximately optimal control (x, t) → c(x, y k+1 , t -t k ). In particular, x j t k+1 is obtained by setting x j t k+1 = X j t k+1 where { X j t } t∈[t k ,t k+1 ] follows the controlled diffusion d X j t = µ( X j t ) dt + σ( X j t ) dB j t (original dynamics) + [σ c]( X j t , y k+1 , t -t k ) dt (approximately optimal control) (16) initialized at X j t k = x j t k . Each propagated particle x j t k+1 is associated with a normalized weight W j k+1 = W j k+1 / M i=1 W i k+1 where W j k+1 = (dP [t k ,t k+1 ] /dP c,y k+1 [t k ,t k+1 ] )( X j [t k ,t k+1 ] ) × g( x j t k+1 , y k+1 ). We recall that the probability measures P [t k ,t k+1 ] and P c,y k+1 [t k ,t k+1 ] correspond to the original and controlled diffusions on the interval [t k , t k+1 ]. Girsanov's theorem, as described in Equation ( 5), implies that W j k+1 = exp - 1 2 t k+1 t k ∥Z j t ∥ 2 dt + t k+1 t k ⟨Z j t , dB j t ⟩ + log g(x j t k+1 , y k+1 ) where Z j t = -c( X j t , y k+1 , t -t k ). Similarly to Equation ( 11), consider the diffusion process {V j t } t∈[t k ,t k+1 ] defined by the dynamics dV j t = -1 2 ∥Z j t ∥ 2 dt + ⟨Z j t , dB j t ⟩ with initialization at V j t k = V 0 (x j t k , y k+1 ). Therefore the weight can be re-written as W j k+1 = exp V j t k+1 + log g(x j t k+1 , y k+1 ) ≈0 exp -V 0 (x j t k , y k+1 ) , and computed by numerically integrating the process {V j t } t∈[t k ,t k+1 ] . Given the definition of the loss function in (14), we can expect the term within the first exponential to be close to zero. In the ideal case where c(x, y, t) ≡ c ⋆ (x, y, t) and V 0 (x, y) ≡ -log h(x, y, 0), one recovers the exact AF-APF weights in (3). Once the unnormalized weights (17) are computed, the resampling steps are identical to those described in Section 2.2 for a standard PF. For practical implementations, all the processes involved in the proposed methodology can be straightforwardly time-discretized. To distinguish between CDT learning with static or iterative control, we shall refer to the resulting approximation of FA-APF as Static-APF and Iterative-APF respectively. We note that these APFs do not involve modified resampling probabilities as described e.g. in Chopin et al. (2020, p. 145) .

4. RELATED WORK

This section positions our work within the existing literature. MCMC methods: Several works have developed MCMC methods for smoothing and parameter estimation of SDEs; for example, Roberts & Stramer (2001) proposes to treat paths between observations as missing data. Our work concentrates on the online filtering problem: this cannot be tackled with MCMC methods. Exact Simulation: Several methods have been proposed to reduce or eliminate the bias due to discretization (Beskos et al., 2006a; b; Fearnhead et al., 2010; 2008) ; these methods typically rely on the Lamperti transform that is only rarely available in multivariate settings. Furthermore, when filtering diffusion with highly-informative observations, the discretization bias is often orders of magnitude smaller than other sources of errors. We also stress that our method is generic: it does not exploit any specific structure of the diffusion process being assimilated. Gaussian Assumptions: In the data-assimilation literature, methods based on variations of the Ensemble Kalman Filter (EnKF) Evensen (2003) have been successfully deployed in applied scenarios and very high-dimensional settings. These methods do rely on strong Gaussian assumptions and are inappropriate for highly nonlinear and non-Gaussian models. In contrast, our method is asymptotically exact in the limit when the number of particles M → ∞ (up to discretization error). Indeed, we do not expect our method to be competitive with this class of (approximate) methods in very high-dimensional settings that are common in numerical weather forecasting. These methods typically achieve lower variance by increasing the bias. Our method is designed to filter diffusion processes in low or moderate dimensional settings. It is likely that scaling our method to truly highdimensional settings with effective dimension D ≫ 10 2 would require introducing model-specific approximations (e.g. localization strategies). Steering particles towards observations: particle methods pioneered by Van Leeuwen ( 2010) are based on this natural principle in order to mitigate collapse of PFs in high-dimensional settings. These methods typically rely on some model structure (e.g. linear Gaussian observation process) and have a number of tuning parameters. They can be understood as parametrizing a linear control, which is only expected to work well for linear Gaussian dynamics, admittedly very important in applications such as geoscience. Implicit Particle Filter: the method of Chorin et al. (2010) attempts to transform standard i.i.d Gaussian samples into samples from the optimal proposal density. To implement this methodology requires a number of assumptions and requires solving a non-convex optimization step for each particle and each time step. This can quickly become computational burdensome. Guided Intermediate Resampling Filters (GIRF): the method of Del Moral & Murray (2015); Park & Ionides (2020) propagates particles at intermediate time intervals between observations with the original dynamics and triggers resampling steps based on a guiding functions that forecast the likelihood of future observations. The choice of guiding functions is crucial for good algorithmic performance. We note that GIRF is in fact intimately related to Doob's h-transform as the optimal choice of guiding functions is given by ( 6) (Park & Ionides, 2020) . However, even under this optimal choice, the resulting GIRF is still sub-optimal when compared to an APF that moves particles using the optimal control induced by Doob's h-transform, i.e. it is better to move particles well rather than rely on weighting and resampling. The latter behaviour is supported by our numerical experiments. Appendix A.5 details our GIRF implementation and the connection to Doob's h-transform.

5. EXPERIMENTS

We performed numerical experiments on three different models: an Ornstein-Uhlenbeck model, a (nonlinear) Logistic diffusion model and a (nonlinear) diffusion model describing cell differentiation. This section presents experiments on the Ornstein-Uhlenbeck model; the two other studies can be found in the Appendix A. CPU: it is negligible when compared to the cost of running filters with many particles and/or to assimilate large number of observations. The inter-observation time was T = 1 and we employed the Euler-Maruyama integrator with a stepsize of 0.02 for all examples. Our results are not sensitive to the choice of T and discretization stepsize as long as it is sufficiently small. We report the Effective Sample Size (ESS) averaged over observation times and independent repetitions, the evidence lower bound (ELBO) E[log p(y 1 , . . . , y K )], and the variance Var[log p(y 1 , . . . , y K )], where p(y 1 , . . . , y K ) denotes its unbiased estimator of the marginal likelihood of the time-discretized filter p(y 1 , . . . , y K ). When testing particle filters with varying number of observations K, we increased the number of particles M linearly with K to keep marginal likelihood estimators stable (Bérard et al., 2014) . For non-toy models, our GIRF implementation relies on a sub-optimal but practical choice of guiding functions that gradually introduce information from the future observation by annealing the observation density using a linear (Linear-GIRF) or quadratic schedule (Quadratic-GIRF).

5.1. ORNSTEIN-UHLENBECK MODEL

Consider a d-dimensional Ornstein-Uhlenbeck process given by (1) with µ(x) = -x, σ(x) = I d and the Gaussian observation model g(x, y) = N (y; x, σ 2 Y I d ). We chose η X = N (0 d , I d /2) as the stationary distribution and η Y = N (0 d , (1/2+σ 2 Y )I d ) as the implied distribution of the observation when training neural networks with the CDT iterative scheme. We took different values of σ Y ∈ {0.125, 0.25, 0.5, 1.0} to vary the informativeness of observations and d ∈ {1, 2, 4, 8, 16, 32} to illustrate the impact of dimension. Analytical tractability in this example (Appendix A.2) consider three idealized particle filters, namely an APF with exact networks (Exact-APF), FA-APF, and GIRF with optimal guiding functions (Appendix A.5). Comparing our proposed Iterative-APF to Exact-APF and FA-APF enables us to distinguish between neural network approximation errors and timediscretization errors. We note that all PFs except the FA-APF involve time-discretization. Columns 1 to 4 of Figure 1 summarize our numerical findings when filtering simulated observations from the model with varying σ Y and fixed d = 1. We see that the performance of BPF deteriorates as the observations become more informative, which is to be expected. Furthermore, when σ Y is small, the impact of our neural network approximation and time-discretization becomes more noticeable. For the values of σ Y and the number of observations K considered, Iterative-APF had substantial gains in efficiency over BPF and typically outperformed GIRF. From Column 5, we note that these gains over BPF become very large when we filter K = 100 observations simulated with observation standard deviations that are multiples of σ Y = 0.25 which was used to run the filters. In particular, while the ELBO of BPF diverges as we increase the degree of noise in the simulated observations, the ELBO of Iterative-APF and GIRF remain stable. Figure 2 shows the impact of increasing dimension d with fixed σ Y = 1.0 when filtering simulated observations from the model. Due to the curse of dimensionality (Snyder et al., 2008; 2015) , it is not surprising for the performance of all PFs to degrade with dimension. Although the error of our neural network approximation becomes more pronounced when d is large, the gain in efficiency of Iterative-APF relative to BPF is very significant in the higher dimensional regime, and particularly so when the number of observations K is also large. Iterative-APF also outperformed GIRF in most settings, with comparable performance when d is large.

6. DISCUSSION

This paper introduced the CDT algorithm, a Sequential Monte-Carlo method for online filtering of diffusion processes evolving in state-spaces of low to moderate dimensions. Contrarily to a number of existing methods, the CDT approach is general and does not exploit any particular structure of the diffusion process. Furthermore, numerical simulations suggests that the CDT algorithm is especially worthwhile when compared to competing approaches (e.g. BPF or GIRF) in higher dimensional settings or when the observations are highly informative. Ongoing work involves extending the CDT framework to parameter estimation and experimenting with alternative formulations and/or parameterizations to accelerate the training procedures.

A APPENDIX A.1 DOOB'S h-TRANSFORM

This section gives an heuristic derivation of the Equation (8) that describes the optimal control. To simplify notation, we shall denote the conditioned process X [0,T ] | (Y T = y) as X [0,T ] . Recall the function h(x, y, t) = E[g(X T , y) | X t = x] = X g(x T , y) p T -t (dx T | x) which gives the probability of observing Y T = y when the diffusion process has state x ∈ X at time t ∈ [0, T ]. The definition in (6) implies that the function h : X × Y × [0, T ] → R + satisfies the backward Kolmogorov equation (Oksendal, 2013) , (∂ t + L)h = 0, with terminal condition h(x, y, T ) = g(x, y) for all (x, y) ∈ X × Y. For φ : X → R and an infinitesimal increment δ > 0, we have E[φ( X t+δ )| X t = x] = E[φ(X t+δ ) g(X T , y) | X t = x] / E[g(X T , y)|X t = x] = E[φ(X t+δ ) h(X t+δ , y, t + δ) | X t = x] / h(x, y, t) = φ(x) + δ L[φ h] h (x, y, t) + O(δ 2 ). Furthermore, since the function h satisfies (7), some algebra shows that L[φ h]/h = Lφ + ⟨σσ ⊤ ∇ log h, ∇φ⟩. Taking δ → 0, this heuristic derivation shows that the generator of the conditioned diffusion equals Lφ+⟨σσ ⊤ ∇ log h, ∇φ⟩. Hence X [0,T ] satisfies the dynamics of a controlled diffusion (4) with control function c ⋆ (x, y, t) = [σ ⊤ ∇ log h](x, y, t) This proves Equation (8).

A.2 ANALYTICAL TRACTABILITY OF THE ORNSTEIN-UHLENBECK MODEL

The transition probability of the Ornstein-Uhlenbeck process considered in Section 5.1 is p t (dx | x) = N (x; µ X (x, t), σ 2 X (t)I d )dx for time t > 0, with mean µ X (x, t) = x exp(-t) and variance σ 2 X (t) = {1 -exp(-2t)}/2. From (6), we have h(x, y, t) = R d N (y; x T , σ 2 Y I d ) N (x T ; µ X (x, T -t), σ 2 X (T -t)I d )dx T = (2π) -d/2 σ -d X (T -t)σ -d Y σ d h (T -t) exp 1 2 σ 2 h (T -t) µ X (x, T -t) σ 2 X (T -t) + y σ 2 Y 2 × exp - ∥µ X (x, T -t)∥ 2 2σ 2 X (T -t) - ∥y∥ 2 2σ 2 Y where σ 2 h (t) = {σ -2 X (t) + σ -2 Y } -1 . Hence we can compute the value function v(x, y, t) = -log[h(x, y, t)]. Next, the optimal control function is c ⋆ (x, y, t) = [σ ⊤ ∇ log h](x, y, t) = σ 2 h (T -t) exp{-(T -t)} σ 2 X (T -t) µ X (x, T -t) σ 2 X (T -t) + y σ 2 Y - exp{-(T -t)} σ 2 X (T -t) µ X (x, T -t). The distribution of X T conditioned on X 0 = x 0 and Y T = y is N (µ h (x 0 , y, T ), σ 2 h (T )I d ) with µ h (x 0 , y, T ) = σ 2 h (T ) µ X (x 0 , T ) σ 2 X (T ) + y σ 2 Y .

A.3 LOGISTIC DIFFUSION MODEL

In this section we consider a logistic diffusion process (Dennis & Costantino, 1988; Knape & De Valpine, 2012) to model the dynamics of a population size {P t } t≥0 , defined by dP t = (θ 2 3 /2 + θ 1 -θ 2 P t )P t dt + θ 3 P t dB t . We apply the Lamperti transformation X t = log(P t )/θ 3 and work with the process {X t } t≥0 that satisfies (1) with µ(x) = θ 1 /θ 3 -(θ 2 /θ 3 ) exp(θ 3 x) and σ(x) = 1. Following (Knape & De Valpine, 2012) , we adopt a negative binomial observation model g(x, y) = N B(y; θ 4 , exp(θ 3 x)) for counts y ∈ N 0 with dispersion θ 4 > 0 and mean exp(θ 3 x). We set (θ 1 , θ 2 , θ 3 , θ 4 ) as the parameter estimates obtained in (Knape & De Valpine, 2012) . Noting that (22) admits a Gamma distribution with shape parameter 2(θ 2 3 /2 + θ 1 )/θ 2 3 -1 and rate parameter 2θ 2 /θ 2 3 as stationary distribution (Dennis & Costantino, 1988) , we select η X as the push-forward under the Lamperti transformation and η Y as the implied distribution of the observation when training neural networks under both static and iterative CDT schemes. To induce varying levels of informative observations, we considered θ 4 ∈ {1. 069, 4.303, 17.631, 78.161} . Figure 3 displays our filtering results for various number of simulated observations from the model (Columns 1 to 4) and for K = 100 observations that are simulated with observation standard deviations larger than θ 4 = 17.631 used to run the filters (Column 5). In the latter setup, we solved for different values of θ 4 in the negative binomial observation model to induce larger standard deviations. The behaviour of BPF and Iterative-APF is similar to the previous example as the observations become more informative with larger values of θ 4 . Iterative-APF outperformed all other algorithms over all combinations of θ 4 and K considered, and also when filtering observations that are increasingly extreme under the model. We note also that the APFs trained using the CDT static scheme can sometimes give unstable results, particularly in challenging scenarios.

A.4 CELL MODEL

This section examines a cell differentiation and development model from (Wang et al., 2011) . Cellular expression levels X t = (X t,1 , X t,2 ) of two genes are modelled by (1) with 23) describe self-activation, mutual inhibition and inactivation respectively, and the volatility captures intrinsic and external fluctuations. We initialize the diffusion process from the undifferentiated state of X 0 = (1, 1) and consider the Gaussian observation model g(x, y) = N (y; x, σ 2 Y I 2 ). To train neural networks under both static and iterative CDT schemes, we selected η X and η Y as the empirical distributions obtained by simulating states and observations from the model for 2000 time units. Figure 4 illustrates our numerical results for various number of observations K and σ Y ∈ {0.25, 0.5, 1.0, 2.0}. It shows that Iterative-APF offers significant gains over all other algorithms when filtering observations that are informative (see Columns 1 to 4) and highly extreme under the model specification of σ Y = 0.5 (see Column 5). In this example, Static-APF did not exhibit any unstable behaviour and its performance lies somewhere in between BPF and Iterative-APF. µ(x) = x 4 1 /(2 -4 + x 4 1 ) + 2 -4 /(2 -4 + x 4 2 ) -x 1 x 4 2 /(2 -4 + x 4 2 ) + 2 -4 /(2 -4 + x 4 1 ) -x 2 (23)

A.5 GUIDED INTERMEDIATE RESAMPLING FILTERS.

We first describe our implementation of GIRF for online filtering. For M ≥ 1 particles, let π k (dx) = M -1 M j=1 δ(dx; x j t k ) denote a current approximation of the filtering distribution at time t k ≥ 0. Given the future observation Y k+1 = y k+1 at time t k+1 , GIRF introduces a sequence of intermediate time steps t k = s 0 < s 1 < • • • < s P = t k+1 between the observation times, and a (c) Evolution of neural network -N (x, y, t) (black to copper) approximating the optimal control function c⋆(x, y, t) over first 500 optimization iterations for a typical (upper row) and an extreme (lower row) observation y. 



Figure 1: Results for Ornstein-Uhlenbeck model with d = 1 based on 100 independent repetitions of each PF. The ELBO gap in the second row is relative to FA-APF.

Figure 3: Results for logistic diffusion model based on 100 independent repetitions of each PF. The ELBO gap in the second row is relative to Iterative-APF.

Figure 4: Results for cell model based on 100 independent repetitions of each PF. The ELBO gap in the second row is relative to Iterative-APF.

Evolution of neural network N0(x, y) (black to copper) approximating the initial value function v(x, y, 0) (red) over first 500 optimization iterations for a typical (left) and an extreme (right) observation y. Evolution of neural network -N (x, y, t) (black to copper) approximating the optimal control function c⋆(x, y, t) (red) over first 500 optimization iterations for a typical (upper row) and an extreme (lower row) observation y.

Figure 5: Results for Ornstein-Uhlenbeck model with d = 1 and σ Y = 1.0 during initial training phase.

Figure 6: Results for Ornstein-Uhlenbeck model with d = 1 and σ Y = 1.0 after training.

Figure 7: Results for logistic diffusion model with θ 4 = 1.069 during initial training phase.

Figure 8: Results for logistic diffusion model with θ 4 = 1.069 after training.

Figure 9: Results for cell model with σ Y = 0.5 after training.

annex

sequence of guiding functions {G p } P p=0 satisfying G 0 (x s0 , y k+1 ) P p=1G p (x sp-1 , x sp , y k+1 ) = g(x t k+1 , y k+1 ).(24)For each intermediate step p ∈ {1, . . . , P }, the particles x 1:M sp are then propagated forward according to the original SDE (1), i.e. x j sp+1 ∼ p ∆sp+1 (d x | x j sp ) with stepsize ∆s p+1 = s p+1 -s p . In practice, this propagation step can be replaced by a numerical integrator. Each particle x j sp+1 is then associated with a normalized weight, where the unnormalized weight), p ∈ {1, . . . , P -1}, W j P = G P (x j s P -1 , x j s P , y k+1 )G 0 ( x j s P , y k+2 ), if t k+1 is not the final observation time, W j P = G P (x j s P -1 , x j s P , y k+1 ), if t k+1 is the final observation time. After the unnormalized weights are computed, the resampling operation is the same as a standard PF (see Section 2.2).From the above description, we see that the role of {G p } P p=0 is to guide particles to appropriate regions of the state-space using the weighting and resampling steps. The optimal choice of guiding functions (Park & Ionides, 2020) isfor p ∈ {1, . . . , P }, where h : 6) is given by Doob's h-transform.The condition ( 24) is satisfied as we have a telescoping product and h(x t k+1 , y k+1 , t k+1 ) = g(x t k+1 , y k+1 ). For the Ornstein-Uhlenbeck model of Section 5.1, we leveraged analytical tractability of (25) in our implementation of GIRF. When the optimal choice ( 25) is intractable, one sub-optimal but practice choice that gradually introduces information from the future observation by annealing the observation density isfor p ∈ {1, . . . , P }, where {λ p } P p=0 is a non-decreasing sequence with λ P = 1. This construction clearly satisfies the condition in (24). It is interesting to note that under the choice λ p = 0 for p ∈ {1, . . . , P -1}, GIRF recovers the BPF. In our numerical implementation, we considered both linear and quadratic annealing schedules {λ p } P p=0 which determine the rate at which information from the future observation is introduced.Lastly, we explain why GIRF with the optimal guiding functions ( 25) is still sub-optimal compared to an APF that move particles using the optimal control c ⋆ : X ×Y ×[0, T ] → R d induced by Doob's h-transform. We consider the law of {X sp } P p=1 conditioned on X s0 = x s0 and Y k+1 = y k+1 P p=1 p ∆sp (dx sp | x sp-1 )g(x s P , y k+1 ).Under the condition (24), we can write the law (26) asGIRF can be understood as a Sequential Monte Carlo (SMC) algorithm (Chopin et al., 2020) approximating the law ( 27) with Markov transitions {p ∆sp } P p=1 and potential functions {G p } P p=0 given by (25). We can rewrite (27) aswhere Markov transitions {p h ∆sp } P p=1 are defined asfor p ∈ {1, . . . , P }. By the Markov property, we have h(x sp-1 , y k+1 , s p-1 ) = X p ∆sp (dx sp | x sp-1 )h(x sp , y k+1 , s p ), hence ( 29) is a valid Markov transition kernel. Moreover, it follows from Dai Pra (1991, Theorem 2.1) that {p h ∆sp } P p=1 are the transition probabilities of the controlled diffusion process in (4) with optimal control c ⋆ (x, y, t) = [σ ⊤ ∇ log h](x, y, t). Hence an APF propagating particles according to this optimally controlled process can be seen as SMC algorithm approximating (28) with Markov transitions {p h ∆sp } P p=1 and a single potential function G 0 . By viewing GIRF and APF as specific instantaneous of SMC algorithms, it is clear that the former is sub-optimal compared to the latter. Intuitively, this means that better particle approximations can be obtained by moving particles well instead of relying on weighting and resampling.A.6 COMPUTATIONAL DOOB'S h-TRANSFORM ALGORITHM In this section, we provide figures to illustrate how our proposed CDT algorithm behaves. We report the training curves (i.e. loss v.s. iteration), as well as describe the evolution of the approximate control functions parametrized by the neural networks. In the analytically tractable Ornstein-Uhlenbeck case, comparison with the optimal control is possible. See Figures 5 and 6 for the Ornstein-Uhlenbeck model of Section 5.1, Figures 7 and 8 for the logistic diffusion model of Section A.3, and Figure 9 for the cell model of Section A.4.

