Φ-DVAE: LEARNING PHYSICALLY INTERPRETABLE REPRESENTATIONS WITH NONLINEAR FILTERING

Abstract

Incorporating unstructured data into physical models is a challenging problem that is emerging in data assimilation. Traditional approaches focus on well-defined observation operators whose functional forms are typically assumed to be known. This prevents these methods from achieving a consistent model-data synthesis in configurations where the mapping from data-space to model-space is unknown. To address these shortcomings, in this paper we develop a physics-informed dynamical variational autoencoder (Φ-DVAE) for embedding diverse data streams into timeevolving physical systems described by differential equations. Our approach combines a standard (possibly nonlinear) filter for the latent state-space model and a VAE, to embed the unstructured data stream into the latent dynamical system. A variational Bayesian framework is used for the joint estimation of the embedding, latent states, and unknown system parameters. To demonstrate the method, we look at three examples: video datasets generated by the advection and Korteweg-de Vries partial differential equations, and a velocity field generated by the Lorenz-63 system. Comparisons with relevant baselines show that the Φ-DVAE provides a data efficient dynamics encoding methodology that is competitive with standard approaches, with the added benefit of incorporating a physically interpretable latent space.

1. INTRODUCTION

Physical models -as represented by ordinary, stochastic, or partial differential equations -are ubiquitous throughout engineering and the physical sciences. These differential equations are the synthesis of scientific knowledge into mathematical form. However, as a description of reality they are imperfect (Judd & Smith, 2004) , leading to the well-known problem of model misspecification (Box, 1979) . At least since Kalman (1960) physical modellersls with observations (Anderson & Moore, 1979) . Such approaches are usually either solving the inverse problem of attempting to recover model parameters from data, and/or, the data assimilation (DA) problem of conducting state inference based on a time-evolving process. For the inverse problem, Bayesian methods are common (Tarantola, 2005; Stuart, 2010) . In this, prior belief in model parameters Λ is updated with data y to give a posterior distribution, p(Λ|y). This describes uncertainty with parameters given the data and modelling assumptions. DA can also proceed from a Bayesian viewpoint, where inference is cast as a nonlinear state-space model (SSM) (Law et al., 2015; Reich & Cotter, 2015) . The SSM is typically the combination of a time-discretised differential equation and an observation process: uncertainty enters the model through extrusive, additive errors. For a latent state variable u n representing some discretised system at time n, with observations y n , the object of interest is the filtering distribution p(u n |y 1:n ), where y 1:n := {y k } n k=1 . Additionally, the joint filtering and estimation problem, which estimates p(u n , Λ|y 1:n ) has received significant attention in the literature (see, e.g., Kantas et al. (2015) and references therein). This has been well studied in, e.g., electrical engineering (Storvik, 2002) , geophysics (Bocquet & Sakov, 2013) , neuroscience (Ditlevsen & Samson, 2014) , chemical engineering (Kravaris et al., 2013) , biochemistry (Dochain, 2003) , and hydrology (Moradkhani et al., 2005) , to name a few. Typically in data assimilation tasks, while parameters of an observation model may be unknown, the observation model itself is assumed known (Kantas et al., 2015) . This assumption breaks down in settings where data arrives in various modalities, such as videos, images, or audio, hindering the ability to perform inference. However, in such cases often the underlying variation in the data stream is due to a latent physical process, which is typically at least partially known. In this work, these data streams are video data and velocity fields. We develop a variational Bayes (VB) (Blei et al., 2017) methodology which jointly solves the inverse and filtering problems for the case in which the observation operator is unknown. We model this unknown mapping with a variational autoencoder (VAE) (Kingma & Welling, 2014) , which encodes the assumed timedependent observations y 1:N into pseudo-data x 1:N in a latent space. On this latent space, we stipulate that the pseudo-observations are taken from a known dynamical system, given by a stochastic ordinary differential equation (ODE) or partial differential equation (PDE) with possibly unknown coefficients. The differential equation is also assumed to have stochastic forcing, which accounts for possible model misspecification. The stipulated system gives a structured prior p(x 1:N |Λ), which acts as a physics-informed regulariser whilst also enabling inference over the unknown Λ. This prior is approximated using classical nonlinear filtering algorithms. Our framework is fully probabilistic: inference proceeds from a derived evidence lower bound (ELBO), enabling joint estimation of unknown network parameters and unknown dynamical coefficients via VB. To set the scene for this work, we now review the relevant literature.

2. RELATED WORK

As introduced above, VAEs (Kingma & Welling, 2014) are a popular high-dimensional encoder. A VAE defines a generative model that learns low-dimensional representations, x, of high-dimensional data, y, using VB. To perform efficient inference, a variational approximation q ϕ (x|y) is made to the intractable posterior p(x|y). Variational parameters ϕ are estimated via optimisation of the ELBO. This unsupervised learning approach infers latent representations of high-dimensional data. Recent works have extended the VAE to high-dimensional time-series data y 1:N , indexed by time n, with the aim of jointly learning latent representations x 1:N , and a dynamical system that evolves them. These dynamical variational autoencoder (DVAE) methods (Girin et al., 2021) enforce the dynamics with a structured prior p(x 1:N ) on the latent space. Various DVAE methods have been proposed. The Kalman variational autoencoder (KVAE) of Fraccaro et al. ( 2017) is a popular approach, which encodes y 1:N into latent variables x 1:N that are assumed to be observations of a linear Gaussian state-space model (LGSSM), driven by latent dynamic states u 1:N . Assumed linear dynamics are jointly learnt with the encoder and decoder, via Kalman filtering/smoothing. Another approach is the Gaussian process variational autoencoder (GPVAE) (Pearce, 2020; Jazbec et al., 2021; Fortuin et al., 2020) , which models x 1:N as a temporally correlated Gaussian process (GP). The Markovian variant of Zhu et al. (2022) allows for a similar Kalman procedure as in the KVAE, except, in this instance, the dynamics are known and are given by an stochastic differential equation (SDE) approximation to the GP (Hartikainen & Sarkka, 2010) . A related approach is provided for control applications in Watter et al. (2015) ; Hafner et al. (2019) , where locally linear embeddings are estimated. Yildiz et al. (2019) also propose the so-called ODE 2 VAE, which encodes the data to an initial condition which is integrated through time using a Bayesian neural ODE (Chen et al., 2018) . This trajectory, only, is used to generate the reconstructions via the decoder network. A related class of methods are deep SSMs (Bayer & Osendorfer, 2014; Krishnan et al., 2015; Karl et al., 2017) . These works assume that the parametric form of the SSM is unknown, and replace the transition and emission distributions with neural network (NN) models, which are trained based on an ELBO. They harness the representational power of deep NNs to directly model transitions between high-dimensional states. More emphasis is placed on generative modelling and prediction than representation learning, or system identification. We also note the related VAE works of Wu et al. (2021); Franceschi et al. (2020) ; Babaeizadeh et al. (2022) , which use VAE-type architectures for similar video prediction tasks. In Chung et al. (2015) the variational recurrent neural networks (VRNN) attempt to capture variation in highly structured time-series data, by pairing a recurrent NN for learning nonlinear state-transitions with a sequential latent random variable model. Methods to include physical information inside of autoencoders have been studied in the physics community. A popular approach uses SINDy (Brunton et al., 2016) for discovery of low-dimensional latent dynamical systems using autoencoders (Champion et al., 2019) . A predictive framework is given in Lopez & Atzberger (2021) , which aims to learn nonlinear dynamics by jointly optimizing an ELBO. Following our notation, this learns some function which maps u n → u n+k , for some k, via a VAE. Lusch et al. (2018) use a physics-informed autoencoder to linearise nonlinear dynamical systems via a Koopman approach; inference is regularised through incorporating the Koopman structure in the loss function. Otto & Rowley (2019) present a similar method, and an extension of these approaches to PDE systems is given in Gin et al. (2021) . Morton et al. (2018) use the linear regression methods of Takeishi et al. (2017) within a standard autoencoder to similarly compute the Koopman observables. Erichson et al. (2019) derive an autoencoder which incorporates a linear Markovian prediction operator, similar to a Koopman operator, which uses physics-informed regulariser to promote Lyapunov stability. Hernández et al. (2018) studies VAE methods to encode high-dimensional dynamical systems. Finally, we note the related work which studies the estimation of dynamical parameters within so-called "gray-box" systems, blending NN methods with known physical laws (Lu et al., 2020; Yin et al., 2021; Long et al., 2018; de Bézenac et al., 2019) . Our contribution. In this paper we propose a physics-informed dynamical variational autoencoder (Φ-DVAE): a DVAE approach which imposes the additional structure of known physics on the latent space. We assume that there is a low-dimensional dynamical system generating the high-dimensional observed time-series: a NN is used to learn the unknown embedding to this lower dimensional space. On the lower-dimensional space, the embedded data are pseudo-observations of a latent dynamical system, which is, in general, derived from a numerical discretisation of a nonlinear PDE. However, the methodology is suitably generic, allowing for ODE latent systems. Inference on this latent system is done with efficient nonlinear stochastic filtering methods, enabling the use of mature DA algorithms within our framework. Our approach follows a probabilistically coherent VB construction and allows for joint learning of both the embedding and unknown dynamical parameters. Instead of learnt, linear dynamics with the KVAE (Fraccaro et al., 2017) , Φ-DVAE assumes a possibly misspecified nonlinear differential equation is driving the variation in the latent space, with possibly unknown parameters. This is in contrast to incorporating generic physical structure in the latent space (such as Koopman structure (Lusch et al., 2018) , or Lyapunov stability (Erichson et al., 2019) ), or generic temporal structure (such as in the GPVAE (Pearce, 2020; Jazbec et al., 2021; Fortuin et al., 2020) ). Our latent dynamical systems give a known functional form of the latent transition density, instead of the learnt, NN-parameterised, transition and emission densities in deep SSMs (Bayer & Osendorfer, 2014; Krishnan et al., 2015; Karl et al., 2017) . Similarly, whilst we share commonality with a latent differential equation, the Φ-DVAE differs with the ODE 2 VAE (Yildiz et al., 2019) as we perform inference with this ODE/PDE, instead of learning it and leveraging it to deterministically evolve the latents. Our approach trades off the generality of latent system discovery against the ability to infer physical quantities of interest relating to a particular latent system. Φ-DVAE can infer physical parameters and states, solving the joint filtering and parameter estimation problem in scenarios where the observation model is unknown.

3. THE PROBABILISTIC MODEL

In this section we define our probabilistic model; our presentation roughly follows the structure of the generative model. We first give an overview of the dependencies between variables, as described by conditional probabilities. We then cover the latent differential equation model used to describe the underlying physics. Then, the pseudo-observation model is covered, followed by the decoder and the encoder. To be precise, we assume a general SSM: Λ ∼ p(Λ), u n |u n-1 , Λ ∼ p(u n |u n-1 , Λ), x n |u n ∼ p ν (x n |u n ), y n |x n ∼ p θ (y n |x n ), where Λ describes the parameters of the Markov process {u n } N n=0 evolving w.r.t. the dynamic model p(u n |u n-1 , Λ), ν describes the parameters of the likelihood denoted by p ν (x n |u n ), and θ describes NN parameters for the decoder p θ (y n |x n ). The conditional independence structure imposed by the model gives p(y 1:N , x 1:N , u 1:N , Λ) = p θ (y 1:N |x 1:N )p ν (x 1:N |u 1:N )p(u 1:N |Λ)p(Λ). (1) Intuitively, {y n } N n=1 is the sequence of high-dimensional video frames, {x n } N n=1 is its embedding (or the pseudo-data), and {u n } N n=0 is the latent physics process. For each n, we assume that y n ∈ Y (with dim(Y) = n y ), x n ∈ R nx , u n ∈ R nu , and Λ ∈ R n λ . In what follows, we describe the components of our probabilistic model in detail. On the left, the frames of a video can be seen which are denoted y 1:N . These are converted into physically interpretable low-dimensional encodings x 1:N using an encoder. This learning process is informed by the physics driven state-space model which treats x 1:N as pseudo-observations, which can be seen on the bottom right.

3.1. DYNAMIC MODEL

The first component of the generative model is the latent dynamical system p(u n |u n-1 , Λ). In general, we model this latent physics process {u n } N n=0 as a discretised stochastic PDE, however ODE latent physics is admissible within our framework (see Section 5.1). We discretise this process with the statistical finite element method (STATFEM) (Girolami et al., 2021; Duffin et al., 2021; 2022; Akyildiz et al., 2022) , forming the basis of the physics-informed prior on the latent space. Stochastic additive forcing inside the PDE represents additive model error, which results from possibly misspecified physics. Full details, including the ODE case, are given in Appendix C. In general, we assume that the model has possibly unknown coefficients Λ; on these we place the Bayesian prior Λ ∼ p(Λ) (Stuart, 2010) , describing our a priori knowledge on the model parameters before observing any data. We also assume that u 0 , the initial condition, is known up to measurement noise, with prior p(u 0 ) set accordingly. For pedagogical purposes, we derive the discrete-time dynamic model using the Korteweg-de Vries (KDV) equation as a running example, which is used in later sections as an example PDE: ∂ t u + αu∂ s u + β∂ 3 s u = ξ, ξ ∼ GP(0, δ(t -t ′ ) • k(s, s ′ )), where u := u(s, t) ∈ R, ξ := ξ(s, t), s ∈ [0, L], t ∈ [0, T ], and Λ = {α, β}. Informally ξ is a GP, with delta correlations in time, and spatial correlations given by the covariance kernel k(•, & Rasmussen, 2006) . This is an uncertain term in the PDE, representing possible model misspecification. Note that we assume all GP hyperparameters are known in this work. The KDV equation is used to model nonlinear internal waves (see, e.g., Drazin & Johnson, 1989) , and describes the balance between nonlinear advection and dispersion. Note that although the KDV equation defines a scalar field, u(s, t), the approach similarly holds for vector fields. •) : R×R → R (Williams Following STATFEM, the equations are first spatially discretised with the finite element method (FEM), then discretised in time. Thus, equation 2 is multiplied with a smooth test function v(s) ∈ V , where V is an appropriate function space, and integrated over the domain Ω to give the weak form (Brenner & Scott, 2008; Thomée, 2006 ) ⟨∂ t u, v⟩ + α⟨u∂ s u, v⟩ + β⟨∂ 3 s u, v⟩ = ⟨ ξ, v⟩, where ⟨•, •⟩ is the L 2 (Ω) inner product. The domain is discretised to give the mesh Ω h with vertices {s j } n h j=0 . In this case, we take the s j to be a uniformly spaced set of points, so that s j = jh (h gives the spacing between grid-points). On the mesh a set of polynomial basis functions {ϕ j (s)} nu i=1 is defined, such that approximation to the PDE. Letting u h (s, t) = nu i=1 u i (t)ϕ i (s), the weak form is now rewritten with these basis functions ⟨∂ t u h , ϕ j ⟩ + α⟨u∂ s u h , ϕ j ⟩ + β⟨∂ 3 s u h , ϕ j ⟩ = ⟨ ξ, ϕ j ⟩, j = 1, . . . , n u . This gives a finite-dimensional SDE over the FEM coefficients u(t) = (u 1 (t), . . . , u nu (t)) ⊤ : M du dt + βAu + αF(u) = ξ, ξ(t) ∼ N (0, δ(t -t ′ ) • G), where M ij = ⟨ϕ i , ϕ j ⟩, A ji = ⟨∂ 3 s ϕ i , ϕ j ⟩, F(u) j = ⟨u h ∂ s u h , ϕ j ⟩, and G ij = ⟨ϕ i , ⟨k(•, •), ϕ j ⟩⟩. Time discretisation eventually gives the transition density p(u n |u n-1 , Λ), for u n = u(n∆ t ), whose form is dependent on the discretisation used (see Appendix C for all details).

3.2. LIKELIHOOD

The second component of the generative model is the likelihood p ν (x n |u n ). This density acts as a data likelihood for pseudo-data {x n } N n=1 . This middle layer in the model is usually necessary, as high-dimensional observations {y n } N n=1 may only be generated by some observed dimensions of {u n } N n=0 . For example, perhaps it is known a priori that only a single component of a latent coupled differential equation generates the observations (see also Section 5.1). This explicit likelihood is introduced to separate the encoding process from the state-space inference; in practice we compute the pseudo-data via the encoding q ϕ (x 1:N |y 1:N ), then condition on it with standard nonlinear filtering algorithms (Fraccaro et al., 2017) . The latent states u n are mapped at discrete times to pseudo-observations via x n = Hu n + r n , with r n ∼ N (0, R). We parameterise this observation density as p ν (x n |u n ) where ν = {H, R}. Both the pseudo-observation operator H ∈ R nx×nu and the noise covariance R ∈ R nx×nx are assumed to be known in this work. An additional noise process is assumed, r n , to represent extraneous uncertainty associated with the pseudo-observations. Observations y n are related to pseudo-observations x n via the decoder, represented with the conditional density p θ (y n |x n ). The combination of the transition and observation densities provides the nonlinear Gaussian SSM: Transition: u n = M(u n-1 ) + e n-1 , e n-1 ∼ N (0, Q), Pseudo-observation: x n = Hu n + r n , r n ∼ N (0, R). Inference is performed with the extended Kalman filter (EXKF) (Jazwinski, 1970; Law et al., 2015) , which computes p(u n |x 1:n , Λ). Further details are contained in Section 4.

3.3. DECODER

The last component of our generative model is the decoder p θ (y n |x n ), which describes the unknown mapping between the pseudo-observations x n , and the observed data y n . The decoding of latents to data should model as closely as possible the true data generation process. Prior knowledge about this process can be used to select an appropriate p θ (y n |x n ). No temporal structure is assumed on θ, so the decoder is shared across all times p θ (y 1:N |x 1:N ) = N n=1 p θ (y n |x n ). For more details on specific architectures see Appendix A.

4. VARIATIONAL INFERENCE

In this section, we introduce the variational family and the ELBO. Denote by q(u 1:N , x 1:N , Λ|y 1:N ) the variational posterior, which, similar to Fraccaro et al. (2017) , factorises as q(u 1:N , x 1:N , Λ|y 1:N ) = q(u 1:N |x 1:N , Λ)q ϕ (x 1:N |y 1:N )q λ (Λ). Note here that we do not make a variational approximation q(u 1:N , x 1:N , Λ|y 1:N ); this is taken to be the exact posterior p(u 1:N |x 1:N , Λ). We derive the ELBO to be (see also Appendix B) log p(y 1:N ) ≥ log p(u 1:N , x 1:N , Λ, y 1:N ) q(u 1:N , x 1:N , Λ|y 1:N ) q(u 1:N , x 1:N , Λ|y 1:N )dx 1:N du 1:N dΛ = E q ϕ log p θ (y 1:N |x 1:N ) q ϕ (x 1:N |y 1:N ) + E q λ log p(x 1:N |Λ) + log p(Λ) q λ (Λ) . Typically this expectation is not analytically tractable and Monte Carlo is used to compute an approximation. - Nonlinear Filtering. In the ELBO of equation 3, estimating log p(x 1:N |Λ) requires marginalising over u 1:N , the latent physics process. We perform this via the EXKF (Jazwinski, 1970; Law et al., 2015) , which recursively computes a Gaussian approximation to the filtering posterior p(u n |x 1:n , Λ) ≈ N (m n , C n ). We note however that this can also be realised with other nonlinear filters, e.g., ensemble Kalman filters (Chen et al., 2022) or particle filters (Corenflos et al., 2021) . The factorisation of the pseudo-observation marginal likelihood, p(x 1:N |Λ) = p(x 1 |Λ) N n=2 p(x n |x 1:n-1 , Λ) , enables computation, as the filter can compute log p(x n |x 1:n-1 , Λ), at each prediction step, via p( x n |x 1:n-1 , Λ) = p(x n |u n , Λ)p(u n |x 1:n-1 , Λ) du n . Variational approximations. As with the decoder, encoder parameters ϕ are shared between variational distributions {q ϕ (x n |y n )} N n=1 to give an amortized approach (Kingma et al., 2019) . Unless otherwise specified, for each n the encoding has the form q ϕ (x n |y n ) = N (µ ϕ (y n ), σ ϕ (y n )). Functions µ ϕ (y n ) : R ny → R nx and σ ϕ (y n ) : R ny → R nx are NNs, with parameters ϕ to be learnt. Specific encoding architectures are given in Appendix A. As for q λ , we set it to a Gaussian with mean µ λ and variance diag(σ λ ).

5. EXPERIMENTS

We present three examples with different dynamical systems. To demonstrate the generality of the method, the first uses the stochastic Lorenz-63 system (Lorenz, 1963) , a highly nonlinear stochastic ODE. In this case, high-dimensional observations are of a velocity field being modulated by the chaotic system. For the final two examples, we use video data. We consider the advection and KDV PDEs, and, in these examples, we indirectly observe high-dimensional representations in the form of video data. This mimics the experimental setup of the various DVAE papers (e.g., Fraccaro et al., 2017; Pearce et al., 2018; Jazbec et al., 2021; Fortuin et al., 2020; Zhu et al., 2022; Girin et al., 2021) . For these two examples, video datasets are generated in a similar fashion. In both cases, simulated data emulate the scenario where a noisy video of an internal wave profile is captured. Internal waves arise as waves of depression or elevation flowing within a density-stratified fluid at regions of maximum density gradient (Gerkema & Zimmerman, 2008) . Our experiment setup thus emulates an idealised setup where a black-and-white, side-on, video of a laboratory experiment has been obtained, and is inspired by scenarios where the high-resolution use of classical measurement devices is not feasible, yet the use of commonplace video-capturing devices is (see, e.g., Horn et al., 2001; 2002 ). The advection equation example is motivated by an internal wave propagating undisturbed through some medium. For this linear case, comparisons with the KVAE reveal that after training, the Φ-DVAE outperforms both in terms of the estimated ELBO and in terms of the mean-squared-error (MSE). The KDV example is a more complex case, and extends into the nonlinear PDE setting, while also being a classical model for internal waves (Drazin & Johnson, 1989) . For this example, comparisons are made with VRNNs, the GPVAE, and the standard VAE. We demonstrate that the MSE of the Φ-DVAE is comparable or better than these approaches. Furthermore, for joint state and parameter inference, we verify the methodology and demonstrate contraction of the posterior about the truth.

5.1. LORENZ-63 EXAMPLE

In our first example, the latent dynamical model p(u n |u n-1 , Λ) is given by an Euler-Maruyama discretisation (Kloeden & Platen, 1992) of the stochastic Lorenz-63 system, where u(t du 1 = (-σu 1 + σu 2 )dt + dw 1 , du 2 = (-u 1 u 3 + ru 1 -u 2 )dt + dw 2 , (4) du 3 = (u 1 u 2 -bu 3 )dt + dw 3 , ) := [u 1 (t), u 2 (t), u 3 (t)] ⊤ , t ∈ [0, 6], u n = [u 1 (n∆ t ), u 2 (n∆ t ), u 3 (n∆ t )], Λ = {σ, r, b}, and w 1 , w 2 , and w 3 are independent Brownian motion processes (Øksendal, 2003; Särkkä & Solin, 2019) . For full details we refer to Appendices A and C. The Lorenz-63 system is classical system widely used to benchmark filtering and data assimilation methods (see, e.g., Akyildiz & Míguez, 2020). It was popularised in Lorenz (1963) through its characterisation of "deterministic nonperiodic flow", and is a common example of chaotic dynamics. We observe synthetic 2D velocity fields, y 1:N , of convective fluid flow, and we use our method to embed these synthetic data into the stochastic Lorenz-63 system. The Lorenz-63 system is related to the velocity fields through a truncated spectral expansion (see, e.g., Wouters, 2013) . In brief, it is assumed that the velocity fields have no vertical velocity, so the 3D velocity field is realised in 2D. The velocity field can be described by the stream function ψ := ψ(s 1 , s 2 , t), where s 1 and s 2 are the spatial coordinates of variation, respectively, and thus y(t) = (-∂ s2 ψ, 0, ∂ s1 ψ). A truncated spectral approximation and a transform to non-dimensional equations yields ψ(s 1 , s 2 , t) ∝ u 1 (t) sin(πs 1 /l) sin(πs 2 /d), where u 1 (t) is governed by the Lorenz-63 ODE (i.e., equation 4 with w i ≡ 0). To generate the synthetic data y 1:N , we generate a trajectory u true 1:N from the deterministic version of 4, at discrete timepoints n∆ t , and use the generated u true 1,n to compute the two-dimensional velocity field, y n , via ψ(s 1 , s 2 , t). This corresponds to having a middle layer x n = u 1,n + w n where w n ∼ N (0, R 2 ), with likelihood p(x n |u n ) = N (h ⊤ u n , R 2 ) where h = [1, 0, 0] ⊤ . The synthetic data of u true (t), x 1:N , and y 1:N are visualised in Figure 2 ; the full trajectory u true (t) is shown in 3D, and the velocity field y n is shown in 2D, for a single n. The decoding is assumed to be of the form p w (y n |x n ) = N (wx n , η 2 I), with unknown coefficients w ∈ R ny . The variational encoding q w (x n |y n ) is determined via a pseudo-inverse, as detailed in Appendix A, along with the relevant hyperparameters and numerical details. Both results show the posteriors contracting about the true value, with b demonstrating more rapid convergence, visually. For joint estimation (Figure 3a , bottom), similar behaviour is observed, with the parameter r not identified by the final epoch. We conjecture this is due to identifiability with other parameters when estimating jointly. Note, however, that the true values are all contained within the confidence bands of the final variational posteriors. We also investigate the posterior inference achieved by Φ-DVAE. We visualise the filtering posterior through time, conditioned on a sample from the trained encoder x (i) 1:N ∼ q ϕ (•|y 1:N ), with fixed, known Λ. Figure 3b (left) shows clear agreement between the filter mean and the latent states u true 1:N . Marginalising over the encoding (Figure 3b (centre)) targets the filtering posterior directly conditioned on observed data y 1:N , and demonstrates unbiased mean estimates of the true state. This is particularly clear for the first latent dimension, where the posterior conditioned on an individual sample x (i) 1:N often has poor posterior coverage of the true value (cf. Figure 3b (right)). As a second example, we consider the advection equation with periodic boundary conditions. In this example, we derive the transition density p(u n |u n-1 , Λ) from a STATFEM discretisation of a stochastic advection equation:

5.2. ADVECTION PDE

MSE: y1:N -ŷ1:N 2 2 y1:N 2 2 KVAE nx=2 KVAE nx=4 KVAE nx=16 KVAE nx=64 KVAE nx=128 KVAE nx=256 Φ-DVAE ∂ t u + c ∂ s u = ξ, ξ ∼ GP(0, δ(t -t ′ ) • k(s, s ′ )), where Video data y 1:N is generated from the deterministic version of equation 5 (i.e. equation 5 with ξ ≡ 0). A trajectory u true 1:N is simulated and the corresponding FEM solutions u true h (s, n∆ t ) are imposed onto a 2D grid. On the grid, pixels below u true h (s, n∆ t ) are lit-up in a binary fashion, with salt-and-peper noise (Gonzalez & Woods, 2007) . In this experiment we use fixed parameters, setting Λ ≡ c = 0.5. We set the decoder to p θ (y n |x n ) = Bern(µ θ (x n )) and the encoder as u := u(s, t), ξ := ξ(s, t), s ∈ [0, 1], t ∈ [0, q ϕ (x n |y n ) = N (µ ϕ (y n ), σ ϕ (y n )). As previous, see Appendix A for full details. Due to linearity of the underlying dynamical system, we compare the Φ-DVAE to the KVAE for a set of video data generated from the advection equation, for various dimensions of the KVAE latent space. Specifying a particular form of latent dynamics on the latent states increases the inductive bias imposed on the latent space, and should provide faster learning -and more likely representations -than with learnt dynamics. To explore this, we compare our method to KVAE for the linear advection example in terms of the ELBO and normalised MSE, over training epochs (all methods use Adam (Kingma & Ba, 2017)). These are plotted in Figure 4 . For the MSE, KVAE quickly learns to reconstruct the images, with Φ-DVAE taking longer to reconstruct with similar accuracy but eventually producing better reconstructions (final MSEs 0.0221 vs. 0.0533 for the Φ-DVAE, KVAE-64, respectively). The ELBO for Φ-DVAE is rapidly optimised in comparison to the KVAE models, and is greater by the end of training. The trained Φ-DVAE gives better evidence for the data (greater ELBO), whilst providing more accurate reconstructions (lower MSE). Φ -D V A E Φ -D V A E ( p re d ) V R N N V R N N ( p re d ) G P -V A E V A E 0.

5.3. KORTEWEG-DE VRIES PDE

Our final example uses the KDV equation as the underlying dynamical system. As previously, the latent transition density p(u n |u n-1 , Λ) defines the evolution of the FEM coefficients, as given by a STATFEM discretisation of a stochastic KDV equation: ∂ t u + αu∂ s u + β∂ 3 s u = ξ, ξ ∼ GP(0, δ(t -t ′ ) • k(s, s ′ )), where u := u(s, t), ξ := ξ(s, t), s ∈ [0, 2], t ∈ [0, 1], and u(s, t) = u(s + 2, t). Parameters are Λ = {α, β}. Data is simulated in the same fashion as in the advection equation: we simulate a trajectory u true 1:N using a FEM discretisation of the deterministic KDV equation and we impose FEM solutions u true h (s, n∆ t ) on a 2D grid. We light up pixels below the solution, and add salt-and-pepper noise. For the datagenerating process, we use the parameters Λ = {α = 1, β = 0.022 2 }, and the initial condition of u true h (s, 0) = cos(πs) as in the classical work of Zabusky & Kruskal (1965) . This regime is characterised by the steepening of the initial condition and the generation of solitons; nonlinear waves which have particle-like interactions (Drazin & Johnson, 1989) . As for the advection example, we set the decoder as p θ (y n |x n ) = Bern(µ θ (x n )) and the encoder as q ϕ (x n |y n ) = N (µ ϕ (y n ), σ ϕ (y n )); again, see Appendix A for details. We test the VRNN, GPVAE and the standard VAE approaches alongside Φ-DVAE with equal encoding dimension n x = 40. In Figure 5 we report the normalised MSE after a fixed number of epochs, using Adam (Kingma & Ba, 2017) with the same learning rate for each method. Φ-DVAE outperforms both GPVAE and the standard VAE in terms of median MSE. Note the variation, in MSE, of the standard VAE is also greater than the other models, suggesting that the dynamical structure provides more consistent learning. Both Φ-DVAE and VRNN perform well, producing visually similar reconstructions, with the VRNN having lower median MSE (0.0239 vs 0.0105, respectively). A predictive MSE is also computed by giving the model a single image frame to encode, sampling from the generative model forward in time, and comparing the decoded samples against the ground truth images. Here, Φ-DVAE outperforms VRNN with median predictive MSEs 0.0415 vs 0.0806 respectively. We report joint estimation results for the partially known KDV PDE, where we fix β = 0.022 2 , and estimate α. The prior over α, p(α) = N (1.5, 0.3 2 ), is semi-informative. Joint inference of α and latent states u n is shown in Figure 6 . The Gaussian variational posterior, q λ (α) = N (µ λ , σ 2 λ ) (initialised at the prior), contracts about the true value α = 1.0 (see Figure 6a ). The filtering posterior, p(u n |y 1:n ), is shown in Figure 6c . This is obtained via Monte Carlo approximation, marginalising over the variational posteriors q ϕ , q λ , to account for uncertainty in the encoding and parameter estimates. Including a structured prior on the latent space has forced the encoding to be representative of observations taken from the KDV system, clearly capturing the latent dynamics causing the variation in the image data. Figures 6b and 6d display the image data and reconstructions respectively, showing the ability of Φ-DVAE to both accurately reconstruct and de-noise the data.

6. CONCLUSION

In this paper we developed Φ-DVAE, a methodology that allows for the incorporation of unstructured data into physical models, in settings where the model-data mapping may be unknown. The proposed approach utilizes variational autoencoders and nonlinear filtering algorithms for PDEs, to learn physically interpretable latent spaces where analysis and prediction can be performed straightforwardly. Our framework connects traditional nonlinear filtering techniques and VAEs, opening up the possibility of further combinations of these methods. Future work will focus on more challenging PDE systems, as well as more complex, and higher-dimensional, observational data. • µ θ (•): MLP, two fully connected hidden layers with dimension 128 • Encoder: q ϕ (x n |y n ) = N (µ ϕ (y n ), σ ϕ (y n )) • µ ϕ (•), σ ϕ (•): MLP, two fully connected hidden layers with dimension 128 • Neural Network activations: LeakyReLU, negative slope = 0.01 • Latent initial condition: u(s, 0) = cos(πs). • Latent noise processes: ρ = 0.01, ℓ = 0.2, R = diag(0.05 2 ). • Latent discretisation: Petrov-Galerkin approach of Debussche & Printems (1999)  : C 0 ([0, 2]) polynomial trial functions, Crank-Nicolson time discretisation, dt = 0.01. • Parameter prior: p(α) = N (1.5, 0.3 2 ). • Optimiser: Adam, learning rate = 0.005. To generate the data we simulate the KDV equation with dt = 0.01, observing every timestep for ∆ t = 0.01, and we take N = 100 observations y n , with y n ∈ [0, 1] ny . Each y n is a flattened image of dimension n y = 64 × 28 = 1792, and we encode to pseudo-observations x n of dimension n x = 40. The latent state dimension is n u = 600. Predictive MSE is calculated by encoding y t , t = 0.1 to x t , t = 0.1, sampling x t from the generative model forward in time for 10 time-steps (up to t = 0.2), and computing the normalised MSE of the decoded samples ŷt compared to the ground truth. Linear decoding/encoding. If a linear data generation is assumed from x 1:N to y 1:N , then this structure can inform decoding. With a linear decoder of the form: p A (y|x) = N (Ax, η 2 I). In this case, we use the "inverted" linear decoder given by: q A (x|y) = N ((A ⊤ A) -1 A ⊤ y, η 2 (A ⊤ A) -1 ). By selecting the encoder appropriately, the space of parameterised variational distributions can be restricted to align with our beliefs about the data generation process.

B FULL VARIATIONAL FRAMEWORK

We derive the approximate ELBO for joint estimation of dynamic parameters Λ, and autoencoder parameters ϕ, θ. We start by writing the evidence p(y 1:N ) = p(u 1:N , x 1:N , Λ, y 1:N )dx 1:N du 1:N dΛ. We maximize log p(y 1:N ) as log p(y 1:N ) = log p(u 1:N , x 1:N , Λ, y 1:N )dx 1:N du 1:N dΛ = log p(u 1:N , x 1:N , Λ, y 1:N ) q(u 1:N , x 1:N , Λ|y 1:N ) q(u 1:N , x 1:N , Λ|y 1:N )dx 1:N du 1:N dΛ ≥ log p(u 1:N , x 1:N , Λ, y 1:N ) q(u 1:N , x 1:N , Λ|y 1:N ) q(u 1:N , x 1:N , Λ|y 1:N )dx 1:N du 1:N dΛ = ELBO, where the third line follows from the application of Jensen's inequality. Our generative model determines the factorisation of the joint distribution, given in equation 1: p(u 1:N , x 1:N , Λ, y 1:N ) = p(y 1:N |x 1:N , u 1:N , Λ)p(x 1:N |u 1:N , Λ)p(u 1:N |Λ)p(Λ) = p θ (y 1:N |x 1:N )p(x 1:N |u 1:N , Λ)p(u 1:N |Λ)p(Λ). Next, we plug this factorised distribution into the ELBO and obtain ELBO = log p θ (y 1:N |x 1:N )p(x 1:N |u 1:N , Λ)p(u 1:N |Λ)p(Λ) q(u 1:N , x 1:N , Λ|y 1:N ) × q(u 1:N , x 1:N , Λ|y 1:N ) dx 1:N du 1:N dΛ. where u := u(t) ∈ R nu , t ∈ [0, T ], f Λ : R nu × [0, T ] → R nu , L : [0, T ] → R nu×nu . The noise process W(t) is a standard vector Brownian motion. The diffusion term L(t) can be used to describe any a priori correlation in the error process dimensions. As stated in the main text, this error process is taken to represent possibly misspecified/unknown physics, which may have been omitted when specifying the model. Discretisation with an explicit Euler-Maruyama scheme (Kloeden & Platen, 1992) gives, u n = u n-1 + ∆ t f n-1 (u n-1 ; Λ) + L n-1 ∆W n-1 , ∆W n-1 ∼ N (0, ∆ t I), where u n := u(n∆ t ), f n (•; Λ) = f Λ (•, n∆ t ), and so on. This gives a transition density p(u n |u n-1 , Λ) = N (u n-1 + ∆ t f n-1 (u n-1 ; Λ), ∆ t L n-1 L ⊤ n-1 ) , defining a Markov model on the now discretised state vector u n . To align with the notation introduced in the main text, this gives: p(u n |u n-1 , Λ) = N (M(u n-1 ), Q), M(u n-1 ) := u n-1 + ∆ t f n-1 (u n-1 ; Λ), Q := ∆ t L n-1 L ⊤ n-1 . Due to the structure of the STATFEM discretisation, the fully-discretised underlying model is of the same mathematical form as this ODE case. The difference lies in the dynamics being defined from either a PDE or ODE system. In common cases, a lower dimensional state vector, u n , typically results for the ODE case in comparison to the PDE case. For the PDE case, entries of the state vector u n are coefficients of the finite element basis functions. For the PDE case, the derivation is similar, with an additional step pre-time-discretisation to spatially discretise the system. This yields a method-of-lines approach (Schiesser, 1991) . As in the main text, we consider a generic nonlinear PDE system of the form ∂ t u + L Λ u + F Λ (u) = f + ξ, ξ ∼ GP(0, δ(t -t ′ ) • k(s, s ′ )), where  u := u(s, t), ξ := ξ(s, t), f := f (s), s ∈ Ω ⊂ R d , k(s, s ′ ) = ρ 2 exp - ∥s -s ′ ∥ 2 2 2ℓ 2 . Hyperparameters {ρ, ℓ} are always assumed to be known, being set a priori. Further work investigating inference of these hyperparameters is of interest. As stated in the main text we discretise the linear time-evolving PDE following the STATFEM as in Duffin et al. (2021) , for which we refer to for the full details of this approach. In brief, we discretise spatially with finite elements (see, e.g., Brenner & Scott (2008) ; Thomée (2006) , for standard references) then temporally via finite differences. We first multiply equation 8 with a sufficiently smooth test function v ∈ V , where V is an appropriate function space (e.g. the H 1 0 (Ω) Sobolev space (Evans, 2010) ) and integrate over the domain Ω to give the weak form (Brenner & Scott, 2008) ⟨∂ t u, v⟩ + A Λ (u, v) + ⟨F Λ (u), v⟩ = ⟨f, v⟩ + ⟨ ξ, v⟩, ∀v ∈ V. Recall that A Λ (•, •) is the bilinear form generated from the linear operator L Λ , and ⟨f, g⟩ = Ω f (s)g(s) ds, the L 2 (Ω) inner product. Next we introduce a discrete approximation to the domain, Ω h ⊆ Ω, having vertices {s j } n h j=1 . This is parameterised by h which indicates the degree of mesh-refinement. We now introduce a finite-dimensional set of polynomial basis functions {ϕ j (s)} nu j=1 , such that ϕ i (s j ) = δ ij . In this work these are exclusively the C 0 (Ω) linear polynomial "hat" basis functions. This gives the finite-dimensional function space V h = span{ϕ j (s)} nu j=1 , which is the space we look for solutions For the ODE case, we have worked with observation operators that extract relevant entries from the state vector. The pseudo-observations are mapped to high-dimensional observed data through a possibly nonlinear observation model which has the probability density p θ (y n |x n ). Recall that in this, θ are neural network parameters. This defines the decoding component of our model (see Figure 1 ). Nonlinear Filtering for Latent State Estimation. To perform state inference given a set of pseudoobservations we use the EXKF. The EXKF constructs an approximate Gaussian posterior distribution via linearising about the nonlinear model M(•). The action of the nonlinear M(•) is approximated via tangent linear approximation. We will derive our filter in the general context of a nonlinear Gaussian SSM given by Transition: M(u n , u n-1 ) = e n-1 , e n ∼ N (0, Q), Observation: x n = Hu n + r n , r n ∼ N (0, R). This allows for the use of implicit time-integrators and subsumes the derivation for the explicit case. We assume that at the previous timestep an approximate Gaussian posterior has been obtained, p(u n-1 |x 1:n-1 , Λ) = N (m n-1 , C n-1 ). For each n the EXKF thus proceeds as: 1. Prediction step. Solve M( mn , m n-1 ) = 0 for mn . Calculate the tangent linear covariance update: Ĉn = J -1 n J n-1 C n-1 J ⊤ n-1 + Q J -⊤ n , where J n = ∂M/∂u n | mn,mn-1 and J n-1 = ∂M/∂u n-1 | mn,mn-1 . This gives p(u n |x 1:n-1 , Λ) = N ( mn , Ĉn ). 2. Update step. Compute the posterior mean m n and covariance C n : m n = mn + Ĉn H T (H Ĉn H T + R) -1 (y n -H mn ), C n = Ĉn -Ĉn H T (H Ĉn H T + R) -1 H Ĉn . This gives p(u n |x 1:n , Λ) = N (m n , C n ). The log-marginal likelihood can be calculated recursively, with each term of the log-likelihood computed after each prediction step: Note that although we focus on the EXKF other nonlinear filters could be used; two popular alternatives are the ensemble Kalman filter (Evensen, 2003) , or, the particle filter (Doucet et al., 2000) . For a linear dynamical model, such as the advection equation considered in Section 5.2, the EXKF reduces to the standard Kalman filter (Kalman, 1960) . As mentioned in the main text we can marginalise over the uncertainty in the encoder, via a Monte Carlo approximation: p(u n |y 1:n , Λ) ≈ p(u n |x 1:n , Λ)q ϕ (x 1:n |y 1:n ) dx 1:n (12) ≈ 1 M M i=1 p(u n |x (i) 1:n , Λ), x (i) 1:n ∼ q ϕ (•|y 1:n ). ( ) The intractable integral is approximated using samples from the encoder, which provides an approximate posterior in the form of a mixture of Gaussians distribution, where each p(u n |x (i) 1:n , Λ) = N (m (i) n , C (i) n ). A similar marginalisation procedure can proceed over the parameters p(u n |y 1:n ) ≈ p(u n |x 1:n , Λ)q λ (Λ)q ϕ (x 1:n |y 1:n ) dΛdx 1:n (14) ≈ 1 M x M Λ Mx i=1 M Λ j=1 p(u n |x (i) 1:n , Λ (j) ), x (i) 1:n ∼ q ϕ (•|y 1:n ), Λ (j) ∼ q λ (•). (15)



Figure1: An illustration of the Φ-DVAE model. On the left, the frames of a video can be seen which are denoted y 1:N . These are converted into physically interpretable low-dimensional encodings x 1:N using an encoder. This learning process is informed by the physics driven state-space model which treats x 1:N as pseudo-observations, which can be seen on the bottom right.

Figure 2: Lorenz-63: latent states u 1:N , pseudo-observations x 1:N , and velocity field y N .

Figure 3: Lorenz-63 system: (a) the results of parameter estimation for (top) σ estimation, (center) b estimation, and (bottom) joint estimation of Λ = {σ, r, b}. In (b) we show an example result of state estimation. Left column is the true Lorenz states vs. EXKF means, center and right columns show the distribution of the estimate of the final states.

Figure 3a displays parameter estimation results. For individual σ and b estimation (Figure 3a, top and center), we initialise the variational posterior randomly, and visualise each across training epochs.Both results show the posteriors contracting about the true value, with b demonstrating more rapid convergence, visually. For joint estimation (Figure3a, bottom), similar behaviour is observed, with the parameter r not identified by the final epoch. We conjecture this is due to identifiability with other parameters when estimating jointly. Note, however, that the true values are all contained within the confidence bands of the final variational posteriors.

Figure 4: Advection: ELBO (top) and reconstruction MSE (bottom), against ground truth, for (Φ-DVAE) against KVAE.

Figure 5: KDV: normalised reconstruction MSE after 200 epochs, for 30 independent simulations.

Figure 6: KDV joint inference results; frames shown for n = {10, 50, 75}.

and t ∈ [0, T ]. The operators L Λ and F Λ (•) are linear and nonlinear differential operators, respectively. The process ξ is the derivative of a function-valued Wiener process, whose increments are given by a Gaussian process with the covariance kernel k(•, •). In our examples, we use the squared-exponential covariance function(Williams & Rasmussen, 2006)

log p(x 1:N |Λ) = N n=2 log p(x n |x 1:n-1 , Λ), p(x n |x 1:n-1 , Λ) = N (H mn , H Ĉn H T + R).

40], and u(s, t) = u(s + 1, t). Recall that, as in Section 5.2, the FEM coefficients u n = (u 1 (n∆ t ), . . . , u nu (n∆ t )) are the latent variables. These are related to the discretised solution via u h (s, n∆ t ) =

REPRODUCIBILITY STATEMENT

We have included full numerical details to reproduce all results in our paper in Appendix A. The code we have developed for this paper will be made publicly available after the decision, enabling the generation of all datasets and figures.

A NUMERICAL DETAILS AND NETWORK ARCHITECTURES

Lorenz-63 experiment.• Time-series length: N = 150.• Input: n y = 200.• Pseudo-observations: n x = 1.• Latent n u = 3.• Decoder: p w (y n |x n ) = N (wx n , η 2 I), η = 0.005.• Encoder: q w (x n |y n ) = N ((w ⊤ w) -1 w ⊤ y n , η 2 (w ⊤ w) -1 )• Latent initial condition: u 0 = [-3.7277, -3.8239, 21.1507 ] ⊤ • Latent noise processes: L = diag(0.2 2 ), R = diag(0.4 2 ).• Latent discretisation: Euler-Maruyama, dt = 0.001.• Optimiser: Adam, learning rate = 10 -4 .To generate the data, we simulate the Lorenz SDE with dt = 0.001 and take pseudo-observations x n every 40 time-steps of the latent system, for a total of N = 150 with ∆ t = 0.04. Velocity measurements are taken in the s 1 and s 2 directions over a regular 10 × 10 grid on the domain s 1 , s 2 ∈ [-4, 4], via the streamfunction ψ(s 1 , s 2 , t). These measurements are flattened to the data vector y n ∈ R ny , n y = 200. Parameters for data generation are Λ = {σ = 10, r = 28, b = 8/3}.

Advection equation experiment.

• Time-series length: N = 200.• Input: n y = 784.• Pseudo-observations: n x = 64.• Latent n u = 64.• µ θ (•): MLP, two fully connected hidden layers with dimension 128• µ ϕ (•), σ ϕ (•): MLP, two fully connected hidden layers with dimension 128• Neural Network activations: LeakyReLU, negative slope = 0.01• Latent initial condition: u(s, 0) = exp(-(x -2.5) 2 /0.1).• Latent noise processes: ρ = 0.02, ℓ = 0.1, R = diag(0.1 2 ).• Latent discretisation: FEM, C 0 ([0, 1]) polynomial trial/test functions, Crank-Nicolson time discretisation, dt = 0.02.• Optimiser: Adam, learning rate = 0.001.To generate the data, we simulate the advection equation with dt = 0.02, observing every 10 timesteps for ∆ t = 0.2 and N = 200. Latent dimensions are n u = 64 and n x = 64, with each image 28 × 28 pixels. These images are then flattened to vectors y n ∈ [0, 1] ny , with n y = 784.KDV equation experiment.• Time-series length: N = 100.• Input: n y = 1792.• Pseudo-observations: n x = 40.• Latent n u = 600.• Decoder:Under review as a conference paper at ICLR 2023The family of distributions which we use to approximate the posterior is described below. We assume a factorisation based on the model into variational encoding q ϕ (•), a full latent state posterior q ν (•), and the variational approximation to the parameter posterior q λ (•):The second line demonstrates the amortized structure of the autoencoder, where the same encoding parameters are shared across datapoints. We can then substitute this expression into our ELBO and obtainAssuming the variational posterior is the exact filtering posterior, i.e., q ν (uleads to a simplification of ELBO in terms of the marginal likelihood p(x 1:N |Λ) of the state-space model. Substituting this expression leads toUsing a single MC sample from q ϕ (x 1:N |y 1:N ) to approximate the expectation, we can write:We can sample q λ (Λ) to approximate the expectation of log p(x 1:N |Λ). Note that this requires the reparameterisation trick that is used for also used when sampling x 1:N . This allows for backpropagation of errors through the sampling step. The KL-divergence can be calculated analytically for the case of Gaussian prior and posterior:and approximated via Monte-Carlo otherwise

C FURTHER DETAILS ON THE DYNAMIC MODEL

In this work we take the latent dynamical model to be a stochastic ODE or PDE. For an ODE this follows from a standard SDE (Särkkä & Solin, 2019; Øksendal, 2003) , given byin. Next, we rewrite u and v in terms of these basis functions: u h (s, t) = nu j=1 u j (t)ϕ j (s) and v h (s, t) = nu j=1 v j (t)ϕ j (s). As the weak form must hold for all v h ∈ V h , this is equivalent to holding for all ϕ j . Thus, the weak form can now be rewritten in terms of this set of basis functionsNote that, in general, u h and v h do not necessarily have to be defined on the same function space, but as we use the linear basis functions in this work we stick with this here.As stated in the main text, this is an SDE over the FEM coefficients u(t) = (u 1 (t), . . . , u nu (t)) ⊤ , given bywhereLetting G = LL ⊤ we can then write this in the familiar notation as abovefrom which an Euler-Maruyama time discretisation giveswhere ∆W n-1 ∼ N (0, ∆ t I), eventually defining a transition model of the formNote that also that the STATFEM methodology also allows for implicit discretisations which may be desirable for time-integrator stability. The transition equations for these approaches can be written out in closed form, yet although they give Markovian transition models, the resultant transition densities p(u n |u n-1 , Λ) are not necessarily Gaussian due to the nonlinear dynamics being applied to the current state u n . Letting e n-1 = L∆W n-1 ∼ N (0, ∆ t G), then the implicit Euler isand the Crank-Nicolson iswhere u n-1/2 = (u n + u n-1 ) /2. Furthermore, to compute the marginal measure p(u n |Λ) this also requires integrating over the previous solution u n-1 ; again due to nonlinear dynamics this will not necessarily be Gaussian.In each of these cases, therefore, the transition equation iswhere we take, for the implicit Euler,and, for the Crank-Nicolson,In practice due to conservative properties of the Crank-Nicolson discretisation we use this for all our models in this work.Discretised solutions u n are mapped at time n to "pseudo-observations" via the observation processx n = Hu n + r n , r n ∼ N (0, R).This observation process has the density p ν (x n |u n ) where ν = {H, R}. As stated in the main text, the pseudo-observation operator H and observational covariance R are assumed known in this work. We typically use a diagonal covariance, setting R = σ 2 I. In the PDE case, for a given STATFEM discretisation as above, these pseudo-observations are assumed to be taken on a user-specified grid, given by x obs ∈ R nx . The pseudo-observation operator thus acts as an interpolant, such thatHu n = u h (x 1 obs , n∆ t ), u h (x 2 obs , n∆ t ), . . . , u h (x nx obs , n∆ t ) ⊤ .

