SCORE MATCHING VIA DIFFERENTIABLE PHYSICS

Abstract

Diffusion models based on stochastic differential equations (SDEs) gradually perturb a data distribution p(x) over time by adding noise to it. A neural network is trained to approximate the score ∇ x log p t (x) at time t, which can be used to reverse the corruption process. In this paper, we focus on learning the score field that is associated with the time evolution according to a physics operator in the presence of natural non-deterministic physical processes like diffusion. A decisive difference to previous methods is that the SDE underlying our approach transforms the state of a physical system to another state at a later time. For that purpose, we replace the drift of the underlying SDE formulation with a differentiable simulator or a neural network approximation of the physics. At the core of our method, we optimize the so-called probability flow ODE to fit a training set of simulation trajectories inside an ODE solver and solve the reverse-time SDE for inference to sample plausible trajectories that evolve towards a given end state. We demonstrate the competitiveness of our approach for different challenging inverse problems.

1. INTRODUCTION

Many physical systems are time-reversible on a microscopic scale. For example, a continuous material can be represented by a collection of interacting particles (Gurtin, 1982; Blanc et al., 2002) based on which we can predict future states of the material. We can also compute earlier states, meaning we can evolve the simulation backwards in time (Martyna et al., 1996) . When taking a macroscopic perspective, we only know the average quantities within specific regions (Farlow, 1993) , which constitutes a loss of information. It is only then that time-reversibility is no longer possible, since many macroscopic and microscopic initial states exist that evolve to yield the same macroscopic state. In the following, we target inverse problems to reconstruct the distribution of initial macroscopic states for a given end state. This genuinely tough problem has applications in many areas of scientific machine learning (Zhou et al., 1996; Gómez-Bombarelli et al., 2018; Delaquis et al., 2018; Lim & Psaltis, 2022) , and existing methods lack tractable approaches to represent and sample the distribution of states. We address this issue by leveraging continuous approaches for diffusion models in the context of physical simulations. In particular, our work builds on the reversediffusion theorem (Anderson, 1982) . Given the functions f (•, t) : R d → R d , called drift, and g(•) : R → R, called diffusion, it can be shown that under mild conditions, for the forward stochastic differential equation (SDE) dx = f (x, t)dt + g(t)dw there is a corresponding reverse-time SDE dx = [f (x, t) -g(t) 2 ∇ x log p t (x)]dt + g(t)d w. In particular, this means that given a marginal distribution of states p 0 (x) at time t = 0 and p T (x) at t = T such that the forward SDE transforms p 0 (x) to p T (x), then the reverse-time SDE runs backward in time and transforms p T (x) into p 0 (x). The term ∇ x log p t (x) is called the score. This theorem is a central building block for SDE-based diffusion models and denoising score matching (Song et al., 2021c; Jolicoeur-Martineau et al., 2021) , which parameterize the drift and diffusion in such a way that the forward SDE corrupts the data and transforms it into random noise. By training a neural network to represent the score, the reverse-time SDE can be deployed as a generative model, which transforms samples from random noise p T (x) to the data distribution p 0 (x). In this paper, we show that a similar methodology can likewise be employed to model physical processes. We replace the drift f (x, t) by a physics model P(x) : R d → R d , which is implemented by a differentiable solver or a neural network that represent the dynamics of a physical system, thus deeply integrating physical knowledge into our method. The end state at t = T on which the forward SDE acts is not fully destroyed by the diffusion g(t), but instead, the noise acts as a perturbation of the system state over time. An overview of our method is shown in figure 1 . To the best of our knowledge, our work is the first to leverage the reverse-diffusion theorem as a method for solving inverse problems of physical systems. As such, our primary aim is to demonstrate how existing algorithms from this field can be used in the context of physics simulations. We showcase the efficacy of the score matching viewpoint on physics problems with a range of challenging inverse problems. Specifically, our contributions are: 1. We develop a framework in which we incorporate the reverse-diffusion theorem and score matching into a method for solving inverse problems that involve the time evolution of physical systems. We demonstrate its competitiveness against common baseline approaches using the heat equation as an example. 2. We highlight the effectiveness of our method with a more challenging inverse problem where we simulate a fluid-based transport process backwards in time in the presence of randomized obstacles. Here, we compare our method to different strategies for learned solvers. 3. Finally, we show that this approach can even be used when the underlying SDE is unknown. Our approach can be combined with operator learning methods and we demonstrate its effectiveness for learning the Navier-Stokes equation in the turbulent regime.

2. BACKGROUND AND RELATED WORK

Learned solvers: Numerical simulations benefit greatly from machine learning models (Tompson et al., 2017; Morton et al., 2018; Pfaff et al., 2020; Li et al., 2020) . By integrating a neural network inside differential equation solvers, it is possible learn to reduce numerical errors (Tompson et al., 2017; Kochkov et al., 2021; Brandstetter et al., 2022) or guide the simulation towards a desired target state (Holl et al., 2020b; Li et al., 2022) . As errors may accumulate quickly over time, trained networks benefit from gradients that are backpropagated over multiple time steps (Um et al., 2020) . Diffusion models: Diffusion models (Ho et al., 2020; Song et al., 2021c) have been considered for a wide range of applications. Most notably, diffusion models have been proposed for image (Dhariwal & Nichol, 2021) , video (Ho et al., 2022; Höppe et al., 2022; Yang et al., 2022) and audio synthesis (Chen et al., 2021) . Recently, Bansal et al. (2022) have proposed to train generalized diffusion models for arbitrary transformations and suggest that fully deterministic models without any noise are sufficient for generative behaviour. Specifically for uncertainty quantification, solving inverse problems and conditional sampling many methods have been proposed (Chung et al., 2021; 2022; Song et al., 2021b; Kawar et al., 2021; Ramzi et al., 2020) . However, most approaches either focus on the denoising objective that is common for tasks involving natural images, or the synthesis process of solutions does not take the underlying physics directly into account. Generative modeling via SDEs: Classical denoising score matching approaches based on Langevin dynamics (Vincent, 2011; Song & Ermon, 2019, SMLD) and based on discrete Markov chains, e.g. Denoising Diffusion Probabilistic Models (Sohl-Dickstein et al., 2015; Ho et al., 2020, DDPM) , can be unified in a time-continuous framework using SDEs (Song et al., 2021c) . Given a distribution of states p 0 (x) at time t = 0, an SDE transforms p 0 (x) to a tractable distribution p T (x) dx = f (x, t)dt + g(t)dw, with w the standard Brownian motion, a drift f (•, t) : R d → R d and diffusion g(•) : R → R, which for x 0 ∼ p 0 (x) yields a diffusion process (x t ) T t=0 . As outlined above, the reverse-time SDE of the reverse-diffusion theorem (Anderson, 1982) is given by et al., 2019) . The evolution of the marginal probability density p t (x) for the SDE in equation ( 1) is described by Kolmogorov's forward equation (Øksendal, 2003) . Maoutsa et al. (2020) and Song et al. (2021c) show that there exists an ODE with the same Kolmogorov forward equation. This ODE is called probability flow ODE and is given by dx = [f (x, t) -g(t) 2 ∇ x log p t (x)]dt + g(t)d w, dx = f (x, t) - 1 2 g(t) 2 ∇ x log p t (x) dt. The probability flow ODE equation (3) represents a CNF and, if f (x, t) is known, a network s θ parameterized by θ representing ∇ x log p t (x) can be trained via maximum likelihood using standard methods (Chen et al., 2018) . While the evolution of p t (x) is the same between the probability flow ODE from equation (3) and the reverse-time SDE from equation (2), there are caveats due to the approximation by s θ (x, t) (Song et al., 2021b; Lu et al., 2022) . Huang et al. (2021) show that minimizing the score-matching loss is equivalent to maximizing a lower bound of the likelihood obtained by sampling from the reverse-time SDE. A recent variant combines score matching with CNFs (Zhang & Chen, 2021) , and employs a joint training of the drift and score with an integration backwards in time.

3. METHOD

Modeling assumptions We consider a known physics model P(x) : R d → R d that is differentiable and approximates the time evolution, i.e. x tm+1 ≈ x tm + (t m+1 -t m ) • P(x tm ). One of our key modelling choices is to describe the time evolution of the physical system by a stochastic differential equation dx = P(x)dt + g(t)dw, with a diffusion process g(•) : R → R that perturbs the simulation states. We can simulate paths from this SDE using Euler-Maruyama steps, i.e. for a time discretization t 0 < t 1 < ... < t M and initial state x t0 , we obtain the iteration rule where z tm are i.i.d. with z tm ∼ N (0, I). The random noise g(t m )z tm that is added to the data at each time step can be regarded as either noise inherent to the physical problem, or as noise from a measurement process. As training data, we consider a set of N trajectories {(x ti,n ) i=0,...,M } n=0,...,N sampled with equation ( 5) and which describe the evolution of a physical system. x tm+1 ← x tm + (t m+1 -t m ) • P(x tm ) + t m+1 -t m • g(t m )z tm , Inverse Problem Given an end state x T we are interested in recovering a likely trajectory (x pred ti ) i=0,...,M that evolves towards x T . More formally, the set of trajectories {(x ti,n ) i=0,...,M } n=0,...,N implicitly defines marginal likelihoods p t (x) at every time step t which are linked through the SDE equation ( 4) of the physical system by the Kolmogorov forward equation. The solution trajectory may not be unique, so we want to sample from the full posterior instead of obtaining only a maximum likelihood solution, i.e. we want to sample from p 0 (x| x T ).

Method

In line with previous work in score-based generative modeling (Song & Ermon, 2019; Song et al., 2021c) , we approximate the score ∇ x log p t (x) of the marginal likelihoods by a neural network s θ (x, t). We optimize s θ (x, t) via maximum likelihood training of the probability flow ODE, as discussed in section 2. For this, we maximize a variational lower bound for the maximum likelihood objective, which we estimate by minimizing the following loss L (x ti ) M i=0 , θ = M i=1 ||x ti -x ODE ti || 2 2 (6) s.t. x ODE 0 = x T + 0 T P(x ODE t ) - 1 2 g 2 (t)s θ (x ODE t , t)dt, where we sample a SDE trajectory (x ti ) M i=0 from the training set. We give theoretical justification for this objective in appendix A. Intuitively, our method fits bijective and deterministic trajectories of the probability flow ODE to the non-determinisic SDE trajectories. In contrast to previous work, our method deeply integrates a prior about the physical system in the form of the simulation operator P(x) into the training process. In this context, the end state at t = T is not fully destroyed by the noise, but instead the noise acts as a perturbation of the system state over time. An overview of our method is shown in figure 2 . Given an end state x T , we can solve the probability flow ODE backwards in time using the trained score function s θ (x, t) to obtain a trajectory (x pred ti ) i=0,...,M . However, this will only give a single, deterministic solution and not allow for sampling from the posterior p(x |x T ). We simulate trajectories from the reverse-time SDE (see section 1) via dx = P(x) -g 2 (t)s θ (x, t) dt + g(t)dw. (8) The evolution of marginal probabilities p t (x) for this SDE is the same as for the probability flow ODE equation (7) (Song et al., 2021c) . Moreover, by the reverse-diffusion theorem (Anderson, 1982) , SDE equation ( 8) is the time-reverse of the physical system SDE from equation (4). Therefore, we can appxocimate sampling from p(x|x T ) by simulating trajectories from SDE equation (8) using any traditional SDE solver. In the following, we refer to the integration of the physics model P(x) into the score-based modelling approach as score matching via differentiable physics, or SMDP in short. We denote trajectories from the probability flow ODE by SMDP-ODE, and those obtained by simulating the reverse-time SDE by SMDP-SDE. Algorithm 1 SMDP-ODE, SMDP-SDE Require: xt M , {tm} M m=0 , {gt m } M m=0 1: for m ← M to 1 do 2: p ← P(xt m ) 3: s ← -g 2 tm s θ (xt m , tm)/2 4: if SMDP-ODE then 5: xt m-1 ← xt m -(tm -tm-1)•(p+s) 6: if SMDP-SDE then 7: xt m-1 ← xt m -(tm-tm-1)•(p+2s) 8: z ∼ N (0, I) 9: xt m-1 ← xt m-1 +gt m √ tm -tm-1z 10: return xt 0 Training and Inference Algorithm 1 gives an overview of SMDP inference for the ODE as well as the SDE variant when using the explicit Euler method as ODE solver. For simplicity, we also employ the explicit Euler method for training and backpropagate gradients through multiple solver steps when computing the ODE trajectory in equation ( 7) to obtain updates for θ. We also refer to this procedure as unrolling the dynamics. Our training setup is similar to Um et al. (2020) , which was originally developed for training correction functions in the context of controlling numerical errors for physical simulations. In particular, in our implementation, we consider a sliding window for unrolling the dynamics, which makes our training very flexible, i.e. we can consider single-step updates as well as unrolling the entire simulation. We give a more detailed descriptions about our training setup in appendix A. We consider an additional variant of SMDP, for which we apply a bidirectional training, i.e. instead of only training the probability flow ODE for the time-backward direction T → 0 we alternate with optimizing the time-forward direction 0 → M , i.e. equation ( 7) becomes x ODE T = x 0 + T 0 P(x ODE t ) - 1 2 g 2 (t)s θ (x ODE t , t)dt,

4. EXPERIMENTS

We conduct several experiments to demonstrate the advantages of SMDP compared to a number of baseline methods. The source code for all experiments will be made available upon acceptance. We first consider the 2D heat equation in section 4.1, where our objective is to find possible initial states at time t = 0 given a noisy end state at time t = T . In our second experiment in section 4.2, we transfer the established practices to a more challenging problem, where we are interested in reconstructing the trajectory of a buoyancy-driven flow within a closed simulation domain given an end state at time T . What makes this problem challenging is that for each simulation, we place different obstacles at different positions within the simulation domain. Then, in section 4.3, we consider the situation where the physics of the system is unknown. For this purpose, we consider training a network that approximates the solutions to the Navier-Stokes equations and a network that approximates the score field. We demonstrate that by doing so, we obtain an improved performance for inverse problems and the learned score can be used to refine predictions in post-processing. We provide additional results in the appendix. In appendix B we compare our method with implicit score matching (Hyvärinen, 2005 ) in a toy experiment, which demonstrates the importance of include physics dynamics in the training. We analyze how many steps are required when unrolling the dynamics for obtaining stable trajectories in appendix C. Finally, in appendix we give an additional evaluation of quality and diversity of the posterior distribution we obtain when sampling from the reverse-time SDE in the heat equation experiment. Figure 3 : Heat diffusion case. We simulate a Gaussian random field at t = 0 forwards in time using equation (5).Given s θ we can either solve the probability flow ODE or simulate trajectories of the reverse-time SDE to obtain solutions for the state at t = 0. Method MSE [10 The closer a method is to the ground truth, the better it produces structures of a similar scale.

4.1. HEAT EQUATION

We consider the heat equation ∂u ∂t = α∆u which plays a fundamental role in many physical systems. Here, we set the diffusivity constant to α = 1 and initial conditions at t = 0 are generated from Gaussian random fields with n = 4 at resolution 32 × 32. We simulate the heat diffusion using a spectral method until t = 0.2 with a fixed step size ∆t = 6.25 × 10 -3 using the iteration rule from equation ( 5) with g ≡ 0.1. Our training data set consists of 2.500 initial conditions with their corresponding trajectories sampled with varying step size ∆t and end states at t = 0.2. The test set is comprised of 500 initial conditions and corresponding end states generated directly without any noise. Training We consider a small ResNet-like architecture based on an encoder and decoder part (see appendix E) as representation for the score function s θ (x, t). The physics model P is implemented via differentiable programming in JAX (Schoenholz & Cubuk, 2020) . For better comparison with the baseline methods, these are trained with a Gaussian random noise of σ = 0.1 added to the inputs. This noise is applied to all network inputs during testing. Baseline methods As baseline methods, we consider the ResNet-like architecture from above, in addition to a Bayesian neural network (BNN) based on a U-Net architecture with spatial dropout (Mueller et al., 2022) , as well as a Fourier neural operator (FNO) network (Li et al., 2020) . For each of these three methods, we consider two variants: the first variant is trained with a supervised loss, i.e. the training data consists of pairs (x 0 , x T ) with initial state x 0 and end state x T . The supervised loss corresponds to the squared L2 distance between the network prediction x pred 0 and the ground truth, i.e. (x pred 0 -x 0 ) 2 . For the second variant, the reconstruction loss, we rely on the differentiable solver and only make use of the end state x T such that the loss becomes (P(x pred 0 ; T )-x T ) 2 , i.e we simulate the network output forward in time using P to obtain a state at t = T , which we compare to the desired end state x T . We denote the supervised variant by S and the physics-based one by P. Additionally, we consider an adopted generative model from Rissanen et al. (2022) , denoted by HeatGen. We train this network similarly to SMDP-ODE, but without the solver P, such that the network has to learn the score and the physics at the same time.

Reconstruction accuracy vs. fitting the data manifold

We give an evaluation of our method and the baselines by considering the reconstruction MSE on the test set: how well a predicted initial state x0 that is simulated forward in time yields states that correspond to the reference end state x T in terms of MSE. This metric has the disadvantage that it does not measure how well the prediction matches the training data manifold, i.e. for this case whether the prediction resembles the statistics of the Gaussian random field. For that reason, we additionally compare the power spectral density of the states as the spectral loss. The corresponding measurements are given in table 1, which show that our method SMDP-ODE performs best in terms of the reconstruction MSE. However, solutions obtained by SMDP-ODE are very smooth and do not contain the small-scale structures of the references, which is reflected in a high spectral error that is also visually prominent, as shown in figure 4 . SMDP-SDE on the other hand performs very well in terms of spectral error and yields visually convincing solutions with only a slight increase in the reconstruction loss. We note that there is a natural tradeoff between both metrics, and SMDP-ODE and SMDP-SDE perform best for both cases respectively while using the same set of weights. Ablation study We performed an ablation study to highlight several design choices of the proposed method. In particular we note that despite fundamental differences between SMDP-ODE and SMDP-SDE, as explained in section 3, the main difference at inference time is the constant factor for s θ and the noise term g(t)dw for SMDP-SDE, cf. equations ( 7) and (8). We investigated how the change in noise integration affects the performance of SMDP-ODE, and considered a variant SMDP-ODE+noise that includes the addition of the noise term during inference, but is otherwise identical to SMDP-ODE. As shown with crosses in figure 5 , this method has a slightly higher reconstruction loss but in contrast to SMDP-SDE does not improve upon the spectral error. This indicates that SMDP-SDE can recover small-scale distributions of the references, while the ODE variant by construction tends to favour smooth solutions when faced with uncertainties. We additionally investigated the effect of the proposed bidirectional training scheme. The error measurements of Figure 5 show that ODE variant is not affected, but SMDP-SDE significantly benefits. The resulting model robustly handles a wide range of different temporal discretizations for inference, which deviate from the training discretization, with high accuracy. In conclusion, our SMDP-SDE model with bidirectional training yields the best performance for a wide range of hyperparameter settings. In appendix F, we additionally evaluate the effects of a logarithmic time discretization and a physics-conditioned score field.

4.2. BUOYANCY-DRIVEN FLOW WITH OBSTACLES

Next, we test our methodology on a more challenging problem. For this purpose we consider simulations of buoyancy-driven flow within a fixed domain Ω ⊂ [0, 1] × [0, 1] and randomly placed obstacles. We make use of semi-Lagrangian advection for the velocity and MacCormack advection for the hot marker density. The temperature dynamics of the marker field are modeled with a Boussinesq approximation. Each simulation runs from time t = 0.0 to t = 0.65 with a step size of ∆t = 0.01. The inflow at (0.2, 0.5) is active until t = 0.2. Our objective is to employ SMDP-ODE and SMDP-SDE to obtain trajectories that reconstruct a plausible flow given an end state of the marker density and velocity fields at time t = 0.65. Our training data set consists of 250 simulations with corresponding trajectories. For the data generation, we make use of the differentiable phiflow solver (Holl et al., 2020a) . We place spheres and boxes with varying sizes at different positions within the simulation domain that do not overlap with the marker inflow. For each simulation, we place one to two objects of each category. The testing set comprises 5 simulations. In contrast to the previous task, we generate It becomes apparent that our ODE method clearly outperform the learned baseline. Interestingly, the SDE variant performs less well for this test case. This behavior can be explained by the highly nonlinear system dynamics, and the comparatively approximate reverse simulator which yields the substantial errors for the Physics only version in table 2. These errors causes the score network to inadvertently learn significant corrections of the physics operator, which deteriorates the quality of the score field. Nonetheless, as qualitatively shown in figure 6 , both variants are able to accurately recover the initial states despite the complex motion of the fluid around the obstacles.

Algorithmic variants

We evaluate several altered configurations of our SMDP method to further illustrate its behavior. We consider adding noise during rollouts, i.e. adding a noise term σ • z for z ∼ N (0, I) to the state x after applying the score and physics updates during training. Additionally, we experiment with a physics conditioned score, i.e. we extend the input dimension of the score function to accept concatenated inputs of the form s θ ([x, P(x)], t). We also evaluate the effects of the bidirectional training. Finally, we evaluate a version that decouples score and physics, i.e. we do not evaluate the score and physics update on the same input x, but instead we first apply the physics update P(x), and evaluate the score function afterwards. Overall, the error measurements in table 2 justify our choices for the baseline SMDP algorithm.

4.3. ISOTROPIC TURBULENCE

As third example, we consider a problem where the physics operator is unknown, i.e. we approximate both P and the score ∇ x log p t (x) by neural networks. We consider the problem of learning the time evolution of isotropic, forced turbulence as determined by the 2D Navier-Stokes equations with a viscosity of ν = 10 -5 (Li et al., 2020) . The training data set consists of vorticity fields from 1000 simulation trajectories from t = 0 until T = 10 with ∆t = 1 and a spatial resolution of 64 × 64. Our objective is to predict a trajectory x0:10 that reconstructs the true trajectory x 0:10 given an end state x 10 of the solution, whereas in the original paper, the objective was to learn an operator mapping predicting the vorticity at a later point in time. Evaluation We give an evaluation of the improvements of SMDP over the learned variants in table 3. Compared to the Learned physics variant, our methods improve the mean squared error (MSE) between the ground truth trajectories and the reconstructed trajectories slightly, while there is an substantial decrease in the spectral error. This can be seen qualitatively in figure 7 . In this scenario DiffFlow has severe difficulties to learn state updates and score field, resulting in large differences in terms of MSE. As before, the SMDP-SDE method performs best in terms of spectral error at the expense of a slightly increased MSE. Outlook: Refinement with Langevin dynamics Since the score ∇ x log p t (x) represents a data gradient, we can use gradient-based optimization methods to find local optima of the data distribution p t (x) that are close to x. Inspired by stochastic gradient Langevin dynamics (Welling & Teh, 2011) , we consider the iteration rule x i+1 t = x i t + ϵ • ∇ x log p t (x i t ) + √ 2ϵz t , for ϵ = 2 × 10 -5 , where z t ∼ N (0, I) (details in appendix H.1). Denoted by SMDP-SDE+LD in figure 7 , this method manages to extract even finer details from the reverse-time SDE solution. As such it provides an interesting starting point for a further refinement of the SMDP results.

5. CONCLUSION

We presented SMDP, a derivative of score matching in the context of physical simulations and differentiable physics. We demonstrated its competitiveness against different baseline methods and in challenging inverse physics problems. We demonstrated the versatility of SMDP with two variants: while the neural ODE variant focuses on high MSE accuracies, the neural SDE variant allows for sampling the posterior and yields an improved coverage of the target data manifold. Despite the promising initial results, our work gives rise to many interesting questions. In particular, the time discretizations is a crucial issue, as for training data generation and differentiable solvers, we would favor larger step size and less evaluations due to computational constraints, while for conventional diffusion models, a large number of smaller time steps often yields substantial improvements in terms of quality. Determining a good balance between accurate solutions with few time steps and diverse solutions with many time steps will remain an important direction for future research in this area.

APPENDIX A ADDITIONAL DETAILS OF TRAINING METHODOLOGY

Below we summarize the problem formulation from the main paper, and provide details about the training procedure together with further information how the solution of the probability flow ODE solution relates to solutions of the SDE.

Problem setting

We consider the time evolution of the physical system modeled by the stochastic differential equation dx = P(x)dt + g(t)dw, with a drift P : R d → R d and diffusion g : [0, T ] → R + , which transforms the marginal distribution p 0 of initial states at time 0 to the marginal distribution p T of end states at time T . Moreover, we assume that we have sampled N trajectories of length M from the above SDE with a fixed time discretization 0 ≤ t 0 < t 1 < ... < t M ≤ T for the interval [0, T ] and collected them in a training data set {(x ti,n ) i=0,...,M } n=0,...,N . We are interested in training a neural network s θ (x, t) parameterized by θ to approximate the score ∇ x log p t (x), i.e. minimize the score matching objective J SM (θ; λ(•)) := 1 2 T 0 E x∼pt λ(t) ||s θ (x, t) -∇ x log p t (x)|| 2 2 dt, where λ : [0, T ] → R + is a weighting function. In the case of densoing score matching, where the underlying SDE is dx = f (x, t)dt + g(t)dw for affine functions f (•) and g(•), the score can be learned by minimizing the denoising score matching objective using transition kernels, which enables an efficient training of diffusion models, see Song et al. (2021c) for a reference. In our case P is an arbitrary function describing the dynamics of the physical system, and hence we can not rely on an analytical expression for the transition kernel. Training via Continuous Normalizing Flows Score-based diffusion models can be transformed into continuous normalizing flows (Chen et al., 2018, CNFs) , which allows for a tractable computation of the likelihood. We can train the corresponding CNF given by dx = P(x) - 1 2 g 2 (t)s θ (x, t) dt using maximum likelihood training, i.e. maximizing E x0∼p0 [log p ODE 0 (x 0 )] (13) s.t. x T = x 0 + T 0 P(x t ) - 1 2 g 2 (t)s θ (x t , t)dt, where p ODE 0 is the distribution obtained by sampling x T ∼ p T and simulating x 0 using ODE equation ( 12). The log-likelihood can be computed using the instantaneous change of variables formula (Chen et al., 2018) and by using the fact that p T is approximately Gaussian. For denoising score matching, maximum likelihood training of the corresponding CNF is usually not done, because it requires running an ODE solver for every optimization step and the training with the denoising score matching objective is more efficient (Huang et al., 2021; Song et al., 2021a) . It was shown by Song et al. (2021a) that there is connection between the Kullback-Leibler divergence and the score matching objective. In particular KL(p 0 || p SDE θ ) ≤ J SM (θ; g(•) 2 ) + KL(p T || π) for a prior distribution π and p SDE θ defined by sampling x T ∼ π and solving the reverse-time SDE using the score approximation s θ (x, t). Chaining CNFs For SDE equation ( 10), the distribution p T describes the perturbed end states and an exact likelihood computation is no longer possible. Equivalent to maximizing the likelihood of the CNF, we can minimize the Kullback-Leibler divergence KL(p 0 || p ODE 0 ). Since in our problem setting, we do not require p 0 or p T to be a simple distribution for which we can evaluate the likelihood, we can think of the CNF from time 0 to time T as multiple smaller CNFs chained together, e.g. we consider a chain of CNFs for the time discretizations t 1 to t 0 , t 2 to t 1 and so on. Each of those CNFs transforms the marginal probabilities p ti to p ti+1 via the probability flow ODE (12) by minimizing KL(p ti || p ODE,ti+1 ti ) during training, where p ODE,ti+1 ti is defined by sampling x ti+1 ∼ p ti+1 and simulating the probability flow ODE from time t i+1 until t i . For example, we can sample x T ∼ p T and generate data from p 0 by recursively using the CNFs to generate x ti given x ti+1 until we reach x 0 . Since CNFs are bijective, this also works in the reverse direction, i.e. we can sample x 0 ∼ p 0 and simulate x 0 from p 0 with the same method. In the following, we will derive our method using the regular, increasing flow of time, i.e. 0 → T . The direction T → 0 follows analogously by replacing the SDE (10) with the corresponding reversetime SDE (Anderson, 1982) . Then, the new objective obtained by chaining the CNFs becomes the minimization of M -1 i=0 KL(p ti+1 || p ODE,ti ti+1 ), where p ODE,ti ti+1 is now defined for the time increasing direction, i.e. sampling x ti ∼ p ti and simulating x ti+1 via the probability flow ODE. If s θ (x, t) ≡ ∇ x log p t (x), then by the theory of probability flow ODEs (Song et al., 2021c) , the above objective will become 0, since the marginal probabilities p t of the SDE will coincide with the marginal likelihoods of the probability flow CNF at every time t. In this case, then we also obtain KL(p T || p ODE T ) = 0, i.e. when sampling x t0 ∼ p 0 and solving the probability flow ODE, we obtain exactly the distribution p T .

Additional assumptions

In the following, we make the same additionally assumptions as Song et al. (2021a) , Appendix A. Specifically, we require that (i) ∃C p > 0 ∀x ∈ R d t ∈ [0, T ] : ||∇ x log p t (x)|| 2 ≤ C p (1 + ||x|| 2 ) (ii) ∃C s > 0 ∀θ ∀x ∈ R d t ∈ [0, T ] : ||s θ (x, t)|| 2 ≤ C s (1 + ||x|| 2 ) (iii) ∃C P > 0 ∀x ∈ R d : ||P(x)|| 2 ≤ C P (1 + ||x|| 2 ) (iv) ∀t ∈ [0, T ] ∃k > 0 : p t (x) ∈ O(e -||x|| k 2 ) as ||x|| 2 → ∞ Minimizing the Kullback-Leibler divergence The i-th summand of equation ( 16) can be simplified to KL(p ti+1 || p ODE,ti ti+1 ) = E xt i+1 ∼pt i+1 log p ti+1 (x ti+1 ) p ODE,ti ti+1 (x ti+1 ) (17) = E xt i ∼pt i E xt i+1 ∼pt i+1 |xt i log p ti+1 (x ti+1 ) p ODE,ti ti+1 (x ti+1 ) (18) = -E xt i ∼pt i E xt i+1 ∼pt i+1 |xt i log p ODE,ti ti+1 (x ti+1 ) + C, ( ) where C is a constant independent of θ. Thus, we can maximize the expectation: E xt i ∼pt i E xt i+1 ∼pt i+1 |xt i log p ODE,ti ti+1 (x ti+1 ) Locality of CNFs and estimating p ODE,ti ti+1 (x ti+1 ) With the law of iterated expectations, we can write the probability p ODE,ti ti+1 (x ti+1 ) in equation ( 20) as p ODE,ti ti+1 (x ti+1 ) = E xt i ∼pt i p ODE,ti ti+1 (x ti+1 |x ti ) . ( ) We show in the following that we can approximate the expectation in equation ( 21) with only negligible error κ > 0 on a different distribution pti (x ti+1 ) that depends on x ti+1 and κ. Importantly, the support of pti (x ti+1 ) is a bounded set with diam(supp(p ti (x ti+1 ))) → 0 as t i+1 -t i → 0 for all x ti+1 ∈ R d . Intuitively, the score ∇ x log p t (x) is determined by the (local) probabilities p t in a small environment around x instead of the global properties of p t . This insight is an important part of our methodology and loss estimation. To prove this, given x ti+1 , we first define a set of points x for which the density p ODE,ti ti+1 (x ti+1 |x) is greater than a chosen κ > 0, i.e. we define S κ (x ti+1 ) = x ∈ R d | p ODE,ti ti+1 (x ti+1 |x) ≥ κ . Then, for this set, we consider the indicator function I κ xt i+1 (•), which is 1 for elements in S κ (x ti+1 ) and 0 otherwise. Then, we can rewrite equation ( 21) as E xt i ∼pt i p ODE,ti ti+1 (x ti+1 |x ti ) (23) = E xt i ∼pt i I κ xt i+1 (x ti )p ODE,ti ti+1 (x ti+1 |x ti ) + (1 -I κ xt i+1 (x ti ))p ODE,ti ti+1 (x ti+1 |x ti ) . ( ) By the definition of the set S κ (x ti+1 ), we can derive the following bounds E xt i ∼pt i (1 -I κ xt i+1 (x ti ))p ODE,ti ti+1 (x ti+1 |x ti ) ≤ κ (25) and E xt i ∼pt i I κ xt i+1 (x ti )p ODE,ti ti+1 (x ti+1 |x ti ) + (1 -I κ xt i+1 (x ti ))p ODE,ti ti+1 (x ti+1 |x ti ) (26) ≥ E xt i ∼pt i I κ xt i+1 (x ti )p ODE,ti ti+1 (x ti+1 |x ti ) . ( ) ≥ p ODE,ti ti+1 (x ti+1 ) -κ . For the next part, we need to make an approximation for p ODE,ti ti+1 (•|x ti ). Approximating p ODE,ti ti+1 (•|x ti ) as a Gaussian Since the CNF is bijective, choosing p ODE,ti ti+1 (•|x ti ) as a Dirac delta function with spike located at µ ODE (x ti ) is the correct choice. Here, µ ODE (x ti ) is defined as the solution of the probability flow ODE equation ( 12) for x ti integrated from time t i to t i+1 . However, we are assuming that we are limited by machine precision and inexact ODE solvers to compute µ ODE (x) anyway. Therefore, we make the assumption that given x ti , solving the probability flow ODE equation ( 12) until t i+1 will give a solution x ODE ti+1 that is approximately Gaussian with mean µ ODE (x ti ) = x ti +(t i+1 -t i )(P(x ti )-1 2 g 2 ti s θ (x ti , t i )) and standard deviation σ ODE = ϵ(t i+1 -t i ), for an arbitrary, but small ϵ > 0. This approximation makes use of the explicit Euler method and thus also relies on the time step t i+1 -t i being sufficiently small to ensure stability of the time integration. Given the above approximation with Euler steps, we can derive the following bound on the distance between x and µ ODE (x) using assumptions (i), (ii) and (iii) ||x ti -µ ODE (x ti )|| 2 = ||x ti -xti -(t i+1 -t i )(P(x ti ) + 1 2 g ti s θ (x ti , t))|| 2 (29) ≤ (t i+1 -t i )(||P(x ti )|| 2 + 1 2 g 2 ti ||s θ (x ti , t))|| 2 ) (30) ≤ (t i+1 -t i ) (1 + C P )||x ti || 2 + 1 2 g 2 ti (1 + C s )||x ti || 2 (31) ≤ (t i+1 -t i )C ODE ||x ti || 2 . ( ) Morover, the approximation using Gaussians gives us the following equivalence for the term p t ODE i+1 ,ti (x ti+1 |x ti ), which we have used to define the set S κ (x ti+1 ) in equation ( 22) p t ODE i+1 ,ti (x ti+1 |x ti ) ≥ κ (33) ⇐⇒ 1 (2π) d |Σ| exp (x ti+1 -µ ODE (x ti ))Σ -1 (x ti+1 -µ ODE (x ti )) T ≥ κ, where Σ = σ 2 ODE I. The above is then equivalent to ||x ti+1 -µ ODE (x ti )|| 2 2 ≤ -log κ (2π) d σ 2d ODE σ 2 ODE (35) Note that σ ODE depends on ϵ and (t i+1 -t i ). We can now finally define the distribution pti (x ti+1 ). For this, we define the set Sκ (x ti+1 ) := x∈S κ (xt i+1 ) x ∈ R d | ||x -x|| 2 ≤ (t i+1 -t i )C ODE ||x|| 2 , Now, by combining equation ( 35) and equation ( 32), we obtain diam( Sκ (x ti+1 )) → 0 as t i+1 -t i → 0. For the indicator function Ĩκ xt i+1 (•) on Sκ (x ti+1 )), we get the bound I κ xt i+1 (x) ≤ Ĩκ xt i+1 (x) ∀x ∈ R d (37) and therefore also E xt i ∼pt i (1 -Ĩκ xt i+1 (x ti ))p ODE,ti ti+1 (x ti+1 |x ti ) ≤ κ. ( ) Given Sκ (x ti+1 ), we define the distribution pti (x ti+1 ) based on p ti but with support restricted to Sκ (x ti+1 ). Then, we get the following equality E xt i ∼pt i Ĩκ xt i+1 (x ti )p ODE,ti ti+1 (x ti+1 |x ti ) = E xt i ∼ pt i (xt i+1 ) C xt+1 p ODE,ti ti+1 (x ti+1 |x ti ) , where C xt+1 is a normalizing constant depending on κ, t i+1 -t i and x t+1 . Maximizing the variational lower bound From Jensen's inequality, equation ( 39) and equation (38), we obtain a lower bound on equation ( 20) E xt i ∼pt i E xt i+1 ∼pt i+1 |xt i log p ODE,ti ti+1 (x ti+1 ) (40) ≥ E xt i ∼pt i E xt i+1 ∼pt i+1 |xt i log E xt i ∼ pt i (xt i+1 ) C xt+1 p ODE,ti ti+1 (x ti+1 |x ti ) (41) ≥ E xt i ∼pt i E xt i+1 ∼pt i+1 |xt i E xt i ∼ pt i (xt i+1 ) log C xt+1 p ODE,ti ti+1 (x ti+1 |x ti ) (42) = E xt i ∼pt i E xt i+1 ∼pt i+1 |xt i E xt i ∼ pt i (xt i+1 ) log p ODE,ti ti+1 (x ti+1 |x ti ) + log(C xt+1 ) , which is the same as maximizing E xt i ∼pt i E xt i+1 ∼pt i+1 |xt i E xt i ∼ pt i (xt i+1 ) log p ODE,ti ti+1 (x ti+1 |x ti ) (44) Thus, instead of the original objective equation ( 20), we instead maximize the lower bound from equation ( 44). Deriving the L2 loss Since p ODE,ti ti+1 (• |x ti ) is the density of a Gaussian, the objective equation ( 44) is equivalent to minimizing E xt i ∼pt i E xt i+1 ∼pt i+1 |xt i E xt i ∼ pt i (xt i+1 ) ||x ti+1 -µ ODE (x ti )|| 2 2 /σ 2 ODE (45) ∝ E xt i ∼pt i E xt i+1 ∼pt i+1 |xt i E xt i ∼ pt i (xt i+1 ) ||x ti+1 -µ ODE (x ti )|| 2 2 , ( ) where Estimating the loss The training data set {(x ti,n ) i=0,...,M } n=0,...,N is sampled directly from the SDE of the physical system equation ( 10), so the empirical distribution p emp ti induced by the training data set at time discretization t i is close to the marginal distribution p ti for 0 ≤ i ≤ M and sufficiently large N . We make use of this fact to approximate the sampling of x ti ∼ p ti and x ti+1 ∼ p ti+1 |x ti in equation ( 46), by drawing a data sample x ti at time t i and its next successor x ti+1 on the trajectory at time t i+1 from the training data set. We approximate sampling from the distribution pti (x ti+1 ) by reusing the sampled x ti . Intuitively, for the CNF from t i to t i+1 , our optimization fits a deterministic (bijective) process defined by the probability flow to the non-deterministic SDE trajectories, which start at the same point. µ ODE (x ti ) = xti + (t i+1 -t i ) P(x ti ) - 1 2 g 2 ti s θ (x ti , t i ) . Extension to multiple time steps We jointly train multiple time steps. First, we sample a trajectory (x ti , ..., x tj ) with 0 ≤ i < j ≤ M from the training data set. Then, the loss becomes j k=i+1 ||x t k -µ k ODE (x ti )|| 2 2 , where µ k ODE (x ti ) is the discretized trajectory from the probability flow ODE, i.e. µ i ODE (x ti ) = x ti (49) µ k ODE (x ti ) = µ k-1 ODE (x ti ) + (t k -t k-1 ) P(µ k-1 ODE (x ti )) - 1 2 g 2 t k s θ (µ k-1 ODE (x ti ), t k ) . For multiple steps, instead of reusing x ti as a sample from pti (x ti+1 ), we use the previous ODE solution µ k ODE (x ti ). An overview of the training for multiple steps is shown in Figure 8 . Bidirectional training Since CNFs are bidirectional and our loss formulation does not make any specific assumptions about p ti and p ti+1 , we can consider training the reverse time direction. In this case, the SDE describing the physical system becomes the reverse-time SDE. Analogous to the increasing time direction, we sample the trajectory (x t ′ i , ..., x t ′ j ), but the time direction is reversed, i.e. t ′ i > t ′ j and t 0 = t ′ M and t M = t ′ 0 . We train the combined objective by sampling a trajectory from the training data set and alternating between optimizing the time-forward direction equation (48), and the corresponding time-backwards direction loss. Note that training both the directions (i.e. p 0 to p T and vice versa) has been explored in other works as well, for example for training Schrödinger Bridges (Bortoli et al., 2021; Chen et al., 2022) , where the discrepancy between the final marginal distribution (i.e. either p SDE T or p SDE 0 ) for the time-forward and respectively time-backward (or reverse-time) SDE is minimized.

Rollout length

The training of SMDP requires unrolling algorithm 1 as shown in figure 8 . Additionally, we adopt a training method based on sliding windows. For a window size S, we consider the points (x t M , x t M -1 , ..., x t M -S ) from a training trajectory and unroll the SMDP algorithm 1 for S steps. We compute the loss equation ( 48) and backpropagate gradients through all steps to obtain updates for θ. Then, we move the window by 1, i.e. consider the points (x t M -1 , x t M -2 , ..., x t M -S-1 ) and compute the updates for θ. We repeat the above M times until we have covered the entire trajectory. If the training data trajectories are short, we use S = M , otherwise, we pick a lower window size S. If S = 1, then we do not require differentiability of the physics operator P, as SMDP reduces to predicting the next point on the trajectory given the previous point. Starting with an untrained score network, long rollouts may yield divergent trajectories because of the physics dynamics. Therefore losses may be very high and the training becomes unstable. We therefore typically start training with a short sliding window, e.g. 2. We train for a few epochs and then increase the sliding window size by a constant. We repeat this until we reach a sufficiently high rollout length, which yields stable trajectories for the entire simultion. We give details of this for each specific experiment either directly in the main text or the accompanying appendix. Comparison to DiffFlow Zhang & Chen (2021) train DiffFlow by considering forward and backward processes in the context of generative modelling. In their setting, the drift f (x, t) is also a learnable neural network and they consider p 0 to be the data distribution and p T to resemble a simple noise distribution. Specifically, they implement the forward and backward processes as x i+1 = x i + f i (x i )∆t i + g i δ F i ∆t i x i = x i+1 -[f i+1 (x i+1 ) -g 2 i+1 s i+1 (x i+1 )]∆t i + g i+1 δ B i+1 ∆t i for two samples δ F i , δ B i ∼ N (0, I) and time discretization {t i } N i=0 and ∆t i = (t i+1 -t i ). Zhang & Chen (2021) directly minimize the KL divergence between the trajectory distribution for the forward and backward process, i.e. they minimize KL(p F (τ )||p B (τ )) = E τ ∼p F [log p F (x 0 )] + E τ ∼p B [log p B (x N )] + N -1 i=1 E τ ∼p F log p F (x i |x i-1 ) p B (x i-1 |x i ) Using the forward and backward process discretizations as well as the fact that p B (x N ) is a Gaussian distribution, they are able to derive a loss based on the squared difference between the forward and backward process as well as an additional likelihood term for -log p B (x N ), see equation ( 15) in their paper. Our method on the other hand directly minimizes the KL divergence between marginal distribution p t and the ones produced by deterministic probability flow ODE p ODE t . Thus, our loss likewise minimizes the difference between a forward and backward trajectory. Moreover, in our case, p T is not constrained to a simple noise distribution. This affects the reverse-time SDE trajectories. Because of the stronger shearing, SMDP trajectories result in states of either 1 or -1, see figure 10d. On the other hand, for ISM in figure 10c , where the shearing is less pronounced, many trajectories end up in-between 1 and -1. These states are not valid samples from the posterior distribution of the solution from equation (55). Interestingly, the probability flow solutions for both SMDP and ISM have comparable quality, cf. figure 10e and figure 10f. In both cases, trajectories that start above 0 at t = 10 end close to 1 at t = 0, and vice versa. Overall, this case illustrates that the learned score of SMDP is more accurate than the one of ISM, because of its stronger shearing. This results in a better quality of the reverse-time SDE trajectories. SMDP performs better here, because it is directly trained with an integration of the physics dynamics, whereas ISM is purely data-driven and does not incorporate the physics model with its temporal evolutions at training time. 

C ABLATION FOR GRADIENT BACKPROPAGATION AND ROLLOUT LENGTH

Gradient backpropagation steps It is possible to unroll for S steps, but stop the gradient backpropagation after G < S steps. For example, when unrolling the algorithm for S = 32 steps, we can unroll the first G = 8 steps, save the intermediate results, then backpropagate the gradient to optimize θ. Then, we continue unrolling with the intermediate results as a new initialization, again stopping and updating θ after G steps. For this example, we can repeat this S/G = 4 times in total until we obtain combined rollout of S steps. An advantage of stopping the gradient after G steps is that it reduces memory requirements, as we do not need to store all intermediate results and therefore this allows for training with large window sizes. For all other experiments in the paper, we have set G = S always. Effect on reconstruction MSE We evaluate the effect of changing the window size S and the number of steps until we stop the gradient backpropagation using the setup of the heat equation experiment in section 4.1. In figure 11 , we show the results on the reconstruction MSE for SMDP-ODE and SMDP-SDE. We keep the time discretization fixed, i.e. a full simulation trajectory consists of 32 steps from t = 0 until t = 0.2, however we vary the window size S. In this case, we always backpropagate gradients through all unrolling steps. In the second case, where we change the number of gradient backpropagation steps, we keep the window size S fixed at 32, which is the entire simulation trajectory. For both SMDP-ODE and SMDP-SDE, our evaluation shows that both longer rollouts and more gradient backpropagation steps improve the reconstruction MSE. The improvements are more significant for SMDP-SDE.

D 1D HEAT EQUATION: EVALUATION OF POSTERIOR

To demonstrate the diversity and high quality of the reverse-time SDE trajectories we provide an additional evaluation of the posterior for a one-dimensional heat equation case. We compare the SDE solutions with samples obtained from filtering a large data set. To simplify the comparison, we consider 1D processes based on the 2D Gaussian random fields from section 4.1. Each process has a length of 100 and corresponds to a Gaussian random field of size 1 × 100. Analogously to section 4.1, we use the heat equation to simulate the states forward in time from t = 0.0 until t = 0.2. Figure 12 shows some examples of the 1D processes we consider here. Training of SMDP The network s θ (x, t) representing the score is trained as described for the heat equation experiment, section 4.1, using the same ResNet architecture with padding in y-direction removed. We consider a time discretization with ∆t = 0.01. We begin training with an initial rollout of 6 steps and increase the rollout length every 5 epochs by 2 until we reach 14 rollout steps. For every epoch, we train on 20 randomly generated 1D processes. We use the Adam optimizer with learning rate 10 -4 . In the following, we randomly generate a 1D process P . We describe how we generate samples for t = 0.0 conditioned on the simulation end state of P at t = 0.2.

Reverse-SDE posterior

We initialize the state based on P at t = 0.2. Then, we simulate the reverse-time SDE with the learned score s θ (x, t) via Euler-Maruyama steps. A visualization of 100 samples is shown in figure 14 . Empirical distribution We sample 10 6 processes and form pairs of initial state t = 0 and end states t = 0.2. We filter the 100 process end states that are closest to the end state of P in terms of the L2 distance. As solutions from the empirical sampling, we consider the 100 corresponding initial states as shown in figure 13 . This empirical distribution makes a qualitative comparison possible, as shown in figures 14 and 13. They indicate that the reverse-time SDE solutions are diverse while matching P very well. Simulating the obtained solutions forward in time gives end state that are in excellent agreement with P. 

E ARCHITECTURES

ResNet We employ a simple ResNet-like architecture, which is used in Section 4.1 for the score function s θ (x, t) and the convolutional neural network baseline (ResNet-S and ResNet-P) as well as in Section 4.3 again for the score s θ (x, t). Since in both experiments, there are periodic boundary conditions, we apply a periodic padding with length 16, i.e. if the underlying 2-dimensional data dimensions are N × N , the dimenions after the periodic padding are (N + 16) × (N + 16). We implement the periodic padding by first tiling the input 3 times in x-and y-direction and then cropping to the correct sizes. The time t is concatenated as an additional constant channel to the 2-dimensional input data when this architecture is used to represent the score s θ (x, t). The encoder-part of our network begins with a single 2D-convolution encoding layer with 32 filters, kernel size 4 and no activation function. This is followed by 4 consecutive residual blocks, each consisting of 2D-convolution, LeakyReLU, 2D-convolution and Leaky ReLU. All 2D convolutions have 32 filters with kernel size 4 and stride 1. The encoder part ends with a single 2D convolution with 1 filter, kernel size 1 and no activation. Then, in the decoder part, we begin with a transposed 2D convolution, 32 filters, kernel size 4. Afterwards, there are 4 consecutive residual blocks, analogous to the encoder residual blocks, but with the 2D convolution replaced by a transposed 2D convolution. Finally there is a final 2D convolution with 1 filter and kernel size 5. Parameters statistics of this model, as well as the others are given in table 4 . UNet We use the UNet architecture with spatial dropout as given in (Mueller et al., 2022) , Appendix A.1. The dropout rate is set to 0.25. We do not include batch normalization and apply the same periodic padding as done for our ResNet architecture. FNO For all experiments, we consider the FNO-2D architecture introduced in (Li et al., 2020) with k max,j = 12 Fourier modes per channel.

Dil-ResNet

The Dil-ResNet architecture is described in (Stachenfeld et al., 2021) 

F HEAT EQUATION F.1 ADDITIONAL TRAINING DETAILS

Spectral loss We consider a spectral error based on the two-dimensional power spectral density. For two 2d-images, we compute their 2d Fourier transform and compute the radially averaged power spectrum s 1 and s 2 . Then, we define the spectral error as the difference between the log of the spectral densities Baseline methods All other baseline methods are trained for 80 epochs using the Adam optimizer algorithm with an initial learning rate of 10 -4 which is decreased by a factor of 0.5 every 20 epochs. L(s 1 , s 2 ) = | log(s 1 ) -log(s 2 )|. ( For the training data, we consider solutions to the heat equation consisting of initial state x 0 and end state x T that are noise-free and add a Gaussian noise with standard deviation σ = 0.1 to the network input.

F.2 LOGARITHMIC TIME DISCRETIZATION

For this experiment, we consider a logarithmic time discretization during training and inference. This discretization is finer around t = 0, where most of the small-scale structures are smoothened out. We find that this method also works well, but the linear time discretization performs better for most cases, especially when ∆t is changed for inference, cf. figure 5 and figure 15. 

F.3 PHYSICS-CONDITIONED SCORE

We extend the definition of the score function, to also include information about the physics at time t, i.e. P(x t ). Thus, we replace s θ (x, t) in algorithm 1 by s θ ([x, P(x)], t), where we concatenate both inputs. An evaluation is shown in figure 16 . Although the stability of SMDP-ODE and SMDP-ODE + noise is greatly increased, we do not obtain results with a low spectral error for SMDP-SDE. Interestingly, the bidirectional training also seems harmful in this case. We train all networks with Adam and learning rate 1 × 10 -4 with batch size 16. We begin with unrolling N = 2 steps, which we increase every 30 epochs by 2 until we reach N = 20.

G.2 ADDITIONAL RESULTS

We give more detailed time evolutions of results for the buoyancy-driven flow case in figure 21 and figure 22 . These again highlight the difficulties of the physics simulator to recover the initial states by itself. The SMDP variants significantly improve upon this behavior. In figure 20 we also show an example of the posterior sampling for this case. It becomes apparent that the inferred small-scale structures of the different samples change. However, in contrast to cases like the heat diffusion example, the physics simulation in this scenario leaves only little room for substantial changes of the states. 

H ISOTROPIC TURBULENCE

For the learned physics network, we employ a FNO neural network with batch size 20. We train the FNO for 500 epochs using Adam optimizer with learning rate 10 -3 , which we decrease every 100 epochs by a factor of 0.5. We train SMDP-ODE with the ResNet architecture for 250 epochs with learning rate 10 -4 , decreased every 50 epochs by a factor of 0.5 and batch size 6.

H.1 REFINEMENT WITH LANGEVIN DYNAMICS

We do a fixed point iteration at a single point in time via: x i+1 t = x i t + ϵ • ∇ x log p t (x i t ) + √ 2ϵz t , for a number of steps T and ϵ = 2 × 10 -5 as a post-processing and refinement strategy, cf. figure 23 and figure 24. This is motivated by established methods in score-based generative modelling (Welling & Teh, 2011; Song & Ermon, 2019) . For a prior distribution π t , x 0 t ∼ π t and by iterating equation ( 60), the distribution of x T t equals p t for ϵ → 0 and T → ∞. There are some theoretical caveats, i.e. an additional Metropolis-Hastings update needs to be added in equation ( 60) and there are regularity conditions (Song & Ermon, 2019) .



Figure 1: Overview: we employ a physics simulator P to learn the score field ∇x log pt(x) with a neural network s θ in the presence of noise or uncertainties. The trained model allows for sampling the posterior of p0, i.e. different states that explain an observation pT , via probability flow or by solving the reverse-time SDE.

Figure 2: During the training phase, we optimize s θ (x, t) that approximates the score ∇x log pt(x) inside the probability flow ODE to fit the data trajectories. For the inference part, we simulate trajectories of the reverse-time SDE.

Figure 4: Spectral density on different scales, the red line indicating ground truth.The closer a method is to the ground truth, the better it produces structures of a similar scale.

Figure 6: Buoyancy flow case. Ground truth shows the marker density and velocity field in the x-direction at different points of the simulation trajectories. The simulation end state at t = 0.65 is the input to SMDP-ODE and SMDP-SDE.

Figure 5: Reconstruction MSE and spectral errors for the bidirectional (top) and regular variant (bottom). The x-axis shows the relative increase of the number of time steps during inference compared to training. Large errors are truncated at the top of each graph.

Figure 7: Turbulence case. Comparison of reconstructed trajectories at t = 9.

Figure 9: Trajectories from SDE equation (55) with λ2 = 0 (a) and λ2 = 0.03 (b).

ISM learned score. (b) SMDP learned score. (c) ISM reverse-time SDE trajectories. (d) SMDP reverse-time SDE trajectories. (e) ISM probability flow trajectories. (f) SMDP probability flow trajectories.

Figure 10: Comparison of Implicit Score Matching (ISM, left) and Score Matching via Differentiable Physics (SMDP, right). Colormap in (a) and (b) truncated to [-75, 75].

Figure 11: Heat equation reconstruction MSE for different sliding window sizes S (Rollout length) and number of gradient backpropagation steps for SMDP-ODE (a) and SMDP-SDE (b).

Figure 12: Examples of training data for 1D heat equation with initial states at t = 0.0 (a) and end states at t = 0.2 (b).

Figure13: Empirical distribution: we generate 10 6 Gaussian processes and simulate them forward in time from t = 0.0 until t = 0.2. We pick one specific process P and sort all other processes in ascending order based on the L2 distance to P at time t = 0.2. Then we pick the first 100 and visualize them at time t = 0.0 (a) and t = 0.2 (b). The top row in both plots shows the process P.

Figure14: Reverse-time SDE: We pick the process P from figure13. Then, we simulate 100 trajectories from the reverse-time SDE with learned score and P at t = 0.2 as initialization. We sort the states at t = 0.0 based on their distance to P and visualize them (a). We then simulate all states from (a) forward in time again until t = 0.2, see (b). The forward simulated trajectories almost exactly match P at t = 0.2.

Figure 15: Logarithmic time discretization. Reconstruction MSE and spectral errors for varying time steps of regular variant (a) and bidirectional variant (b). Large errors are truncated at the top of each graph.

Figure 16: Physics-conditioned score. Reconstruction MSE and spectral errors for varying time steps of regular variant (a) and bidirectional variant (b). Large errors are truncated at the top of each graph.

Figure 17: Higher training noise. Reconstruction MSE and spectral errors for varying time steps of regular variant (a) and bidirectional variant (b). Large errors are truncated at the top of each graph.

Figure 20: Comparison of SMDP-SDE predictions and ground truth for buoyancy-driven flow at t = 0.36.

Figure 23: Steps of Langevin dynamics for ϵ = 2 × 10 -5 . Ground truth Learned physics 100 steps 200 steps 300 steps

-5 ] ↓ Evaluation of reconstruction MSE and spectral error for SMDP and baselines. The column full posterior indicates whether models yield point estimates or allow to sample from the posterior.

the training data set without any noise, but add a Gaussian random noise with standard deviation σ = √ ∆t to each simulation state of the training trajectories. Evaluation of variants for the buoyancy obstacle case in terms of reconstruction MSE and LPIPS of the marker field.

Evaluation of the turbulence case.

Training overview for the trajectory (xt 0 , xt 1 , ..., xt M ). Gradients are backpropagated over multiple time steps via automatic differentiation. This requires that the physics operator P is differentiable. Incoming gradients at s θ (xt i , ti) are used to obtain gradients for θ, which are summed over all steps i. The network weights θ are then updated based on the optimizer, e.g. stochastic gradient descent or Adam.

, Appendix A. Since this architecture represents the score s θ in Section 4.2, we concatenate the constant time channel analogously to the ResNet architecture. Additionally, positional information is added to the network input by encoding the x-position and y-position inside the domain in two separate channels.

Summary of architectures.

SMDP-ODE For inference, we consider the linear time discretization t n = n∆t with ∆t = 0.2/32 and t 32 = 0.2. During training, we sample a random time discretization tn for each batch based on t n by tn ∼ U(t n -∆t/2, t n + ∆t/2) for n = 1, ..., 31 to not overfit on the step size. In the warmup phase of training, we unroll Algorithm 1 for N = 6, 8, ..., 32 steps, where we increase N every 2 epochs. We employ Adam to update the weights θ with learning rate 10 -4 . After the warmup is finished, we finetune the network weights for 80 epochs with an initial learning rate of 10 -4 which we reduce by a factor of 0.5 every 20 epochs.Diffusion modelThe diffusion model is inspired by(Rissanen et al., 2022). We make use of the same ResNet architecture as SMDP-ODE, however we keep the time discretization t n fixed. The diffusion model does not include the physics operator P in the rollout. Therefore, it has to learn the score and physics at the same time. Except for those changes, the training setup is identical to SMDP-ODE.

B ADDITIONAL EXPERIMENT: 1D PROCESS

We discuss an additional experiment, where we compare SMDP with Implicit Score Matching (ISM) (Hyvärinen, 2005) . For this task, we consider the SDE given by dx = -λ 1 • sign(x)x 2 dt + λ 2 dw.(55)The corresponding reverse-time SDE is given byThroughout this experiment, p 0 is a categorical distribution, where we draw either 1 or -1 with the same probability. In figure 9 , we show trajectories from this SDE simulated with the Euler-Maruyama method. Trajectories either start at 1 or -1 and approach 0 as t increases. Given the trajectory value at t = 10, it is no longer possible to infer the origin of the trajectory at t = 0.We employ a neural network s θ (x, t) parameterized by θ to approximate the score via ISM and compare it to SMDP. In both cases, the neural network is a simple multilayer perceptron with tanh activations and 5 hidden layers with 20 neurons for the first four hidden layers and 10 neurons for the last hidden layer.Our training data set consists of 250 simulated trajectories from t = 0 until t = 10 and ∆t = 0.02. Therefore each training trajectory has length M = 500.Implicit Score Matching Implicit Score Matching (Hyvärinen, 2005) is a score matching method that leverages the fact that for a random vector x ∈ R d with probability density function p, minimizing the score matching objectiveis equivalent to minimizing the following objectiveNote that for ISM, there is no explicit time dimension, so we are absorbing the time dimension into x, i.e. for the trajectory (x 1 , x 2 , ..., x M ) sampled at time (t 1 , t 2 , ..., t M ), we concatenate value and time with x (i) := (x i , t i ) ∈ R 2 . We collect x i for i = 1, ..., M from all trajectories in a new training data set. When training s θ (x), we therefore lose the information, to which trajectory a value-time pair originally belonged. We obtain the time-dependent score ∇ x1 log p x2 (x 1 ) from the first coordinate of the output of s θ (x), i.e. s θ (x) 1 .We compute the partial derivative ∂s θ (x) i /∂x i using reverse-mode automatic differentiation in JAX (jax.jacrev). We train s θ (x) with the Adam optimizer for 15.000 epochs with learning rate 10 -3 , which we decrease by a factor of 0.1 every 5.000 epochs.

SMDP Training

The training of SMDP follows the experiments in section 4 with slight problemspecific modifications. Since the length of the trajectories is very long (M = 500), we subsample the trajectories, and keep every 5th point in time. We train with a sliding window of size 4, which we increase every 500 epochs by 2 until we reach an window size of 40 with the Adam optimizer and learning rate 10 -5 . Then, we finetune the network and train on the complete trajectories for an additional 500 epochs and sliding window size 70 with learning rate 10 -6 .Comparison We show a direct comparison of the learned score, the reverse-time SDE trajectories and the probability flow trajectories between ISM and SMDP in figure 10 . The learned score of ISM and SMDP in figure 10a and figure 10b is similar until t = 2. Then, the score for both networks becomes positive for points marginally above 0, i.e. the score pushes points up, leading them to 1 at t = 0. Analogously, points marginally below 0 are pushed down. For SMDP, the absolute value of the score in this region is considerably higher than ISM, thus having a more significant shearing effect. 

