REINFORCEMENT LEARNING-BASED ESTIMATION FOR PARTIAL DIFFERENTIAL EQUATIONS

Abstract

In systems governed by nonlinear partial differential equations such as fluid flows, the design of state estimators such as Kalman filters relies on a reduced-order model (ROM) that projects the original high-dimensional dynamics onto a computationally tractable low-dimensional space. However, ROMs are prone to large errors, which negatively affects the performance of the estimator. Here, we introduce the reinforcement learning reduced-order estimator (RL-ROE), a ROMbased estimator in which the correction term that takes in the measurements is given by a nonlinear policy trained through reinforcement learning. The nonlinearity of the policy enables the RL-ROE to compensate efficiently for errors of the ROM, while still taking advantage of the imperfect knowledge of the dynamics. Using examples involving the Burgers and Navier-Stokes equations, we show that in the limit of very few sensors, the trained RL-ROE outperforms a Kalman filter designed using the same ROM. Moreover, it yields accurate high-dimensional state estimates for reference trajectories corresponding to various physical parameter values, without direct knowledge of the latter.

1. INTRODUCTION

Active control of turbulent flows has the potential to cut down emissions across a range of industries through drag reduction in aircrafts and ships or improved efficiency of heating and air-conditioning systems, among many other examples (Brunton & Noack, 2015) . But real-time feedback control requires inferring the state of the system from sparse measurements using an algorithm called a state estimator, which typically relies on a model for the underlying dynamics (Simon, 2006) . Among state estimators, the Kalman filter is by far the most well-known thanks to its optimality for linear systems, which has led to its widespread use in numerous applications (Kalman, 1960; Zarchan, 2005) . However, continuous systems such as fluid flows are governed by partial differential equations (PDEs) which, when discretized, yield high-dimensional and oftentimes nonlinear dynamical models with hundreds or thousands of state variables. These high-dimensional models are too expensive to integrate with common state estimation techniques, especially in the context of embedded systems. Thus, state estimators for control are instead designed based on a reduced-order model (ROM) of the system, in which the underlying dynamics are projected to a low-dimensional subspace that is computationally tractable (Barbagallo et al., 2009; Rowley & Dawson, 2017) . A big challenge is that ROMs provide a simplified and imperfect description of the dynamics, which negatively affects the performance of the state estimator. One potential solution is to improve the accuracy of the ROM through the inclusion of additional closure terms (Ahmed et al., 2021) . In this paper, we leave the ROM untouched and instead propose a new design paradigm for the estimator itself, which we call a reinforcement-learning reduced-order estimator (RL-ROE). The RL-ROE is constructed from the ROM in an analogous way to a Kalman filter, with the crucial difference that the linear filter gain function, which takes in the current measurement data, is replaced by a nonlinear policy trained through reinforcement learning (RL). The flexibility of the nonlinear policy, parameterized by a neural network, enables the RL-ROE to compensate for errors of the ROM while still taking advantage of the imperfect knowledge of the dynamics. Indeed, we show that in the limit of sparse measurements, the trained RL-ROE outperforms a Kalman filter designed using the same ROM and displays robust estimation performance across different dynamical regimes. To our knowledge, the RL-ROE is the first application of RL to state estimation of parametric PDEs.

2.1. PROBLEM FORMULATION

Consider the parametric discrete-time nonlinear system given by z k+1 = f (z k ; µ), y k = Cz k + n k , where z k ∈ R n and y k ∈ R p are respectively the state and measurement at time k, f : R n → R n is a time-invariant nonlinear map from current to next state, n k ∈ R p is observation noise (assumed zero unless stated otherwise), µ ∈ R is a physical parameter, and C ∈ R p×n is a linear map from state to measurement. In this study, we assume that the dynamics given in (1) are obtained from a highfidelity numerical discretization of a nonlinear partial differential equation (PDE), which typically requires a large number n of continuous state variables (on the order of at least a few hundreds). Nonetheless, our work is applicable to any high-dimensional nonlinear system of the form (1). We do not account for exogenous control inputs to the system, which is left for future work. Here, we will focus on the post-transient dynamics of (1); these are the observed dynamics once the transients associated with the initial condition have died down. In particular, we consider systems whose post-transient dynamics are described by an attractor that is either a steady state, a periodic limit cycle or a quasi-periodic limit cycle, which encompasses the behavior of a large class of physical systems. The nature of the attractor is independent of the initial condition but depends on the value of µ, which we will consider to be in a range [µ 1 , µ 2 ]. The purpose of the present work is to combine reduced-order modeling (ROM) and reinforcement learning (RL) to construct a state estimator that solves the following problem: given a sequence of measurements {y 1 , • • • , y k } from a post-transient reference trajectory of (1), estimate the highdimensional state z k at current time k without knowledge of µ itself. The ROM procedure, which follows standard practices, is described in Section 2.2. The integration of the ROM with RL to solve the estimation problem, which constitutes the main novelty of the paper, is described in Section 2.3.

2.2. REDUCED-ORDER MODEL

Since the high dimensionality of (1) renders online estimation impractical, it is customary to formulate a reduced-order model (ROM) of the dynamics (Rowley & Dawson, 2017) . First, one chooses a suitable linearly independent set of modes {u 1 , . . . , u r }, where u i ∈ R n , defining an r-dimensional subspace of R n in which most of the dynamics is assumed to take place. Stacking these modes as columns of a matrix U ∈ R n×r , one can then express z k U x k , where the reduced-order state x k ∈ R r represents the coordinates of z k in the subspace. Finally, one finds a ROM for the dynamics of x k , which is vastly cheaper to evolve than (1) when r n. There exist various ways to find an appropriate set of modes U and corresponding ROM for the dynamics of x k (Taira et al., 2017) . In this work, we employ the Dynamic Mode Decomposition (DMD), a purely data-driven algorithm that has found numerous applications in fields ranging from fluid dynamics to neuroscience (Schmid, 2010; Kutz et al., 2016) . Importantly, we seek a single ROM to describe dynamics corresponding to various parameter values µ ∈ [µ 1 , µ 2 ] since the state estimator that we will later construct based on this ROM does not have knowledge of µ. In order to apply DMD, we first construct a training dataset by solving (1) for values of µ belonging to a finite set S ⊂ [µ 1 , µ 2 ], resulting in a concatenated collection of snapshots Z train = {Z µ } µ∈S , where each Z µ = {z µ 0 , . . . , z µ K } is a post-transient trajectory of (1a) for a specific value µ ∈ S. The DMD then seeks a best-fit linear model of the dynamics in the form of a matrix A ∈ R n×n such that z µ k+1 Az µ k for all k and µ, and computes the modes U as the r leading principal component analysis (PCA) modes of Z train . The transformation z k U x k and the orthogonality of U then yield a linear discrete-time ROM of the form x k = A r x k-1 + w k-1 , (2a) y k = C r x k + v k , where A r = U T AU ∈ R r×r and C r = CU ∈ R p×r are the reduced-order state-transition and observation models, respectively. The (unknown) non-Gaussian process noise w k and observation noise v k account for the neglected PCA modes of Z train in U , as well as the error incurred by the linear approximation and effective averaging of the dynamics over a range of µ. Additional details regarding the calculation of A r and U are provided in Appendix A of the supplementary materials.

2.3. REINFORCEMENT LEARNING-BASED REDUCED-ORDER ESTIMATOR

Using the ROM (2) defined by A r , C r and U , we now want to solve the estimation problem defined in Section 2.1. To this effect, we design a reduced-order estimator (ROE) of the form xk = A r xk-1 + a k , (3a) a k ∼ π θ ( • |y k , xk-1 ), ( ) where xk is an estimate of the reduced-order state x k and a k ∈ R r is an action sampled from a nonlinear and stochastic policy π θ , which takes as input the current measurement y k from the reference trajectory of (1a) and the previous state estimate xk-1 . The subscript θ denotes the set of parameters that define the policy, whose goal is to use the sparse measurements y k to act on the dynamics of xk in (3a) so that the reconstructed high-dimensional state estimate ẑk = U xk converges towards the (hidden) reference state z k . Note that designing state estimators, also called state observers, by correcting the dynamics model with a measurement-dependent term is a standard approach in control theory (Korovin & Fomichev, 2009; Besanc ¸on, 2007) . A Kalman filter is a special case of such an estimator, for which the action in (3b) is given by a k = K k (y k -C r A r xk-1 ), ) with K k ∈ R r×p the optimal Kalman gain. Although the Kalman filter is the optimal linear filter for linear systems (Julier & Uhlmann, 2004; Simon, 2006) , its performance suffers in the presence of unmodeled dynamics and parameter uncertainty, both of which are present in our case. Thus, this motivates the adoption of the more general form (3b), which retains the dependence of a k on y k and xk-1 but is more flexible thanks to the nonlinearity of the policy π θ . We first train the policy π θ in an offline phase, using deep RL to solve the optimization problem θ * = arg min θ E K k=1 ||z k -U xk || 2 + λ||a k || 2 , ( ) where the expectation is taken over initial estimates x0 , initial true states z 0 , parameters µ, estimate trajectories { x1 , x2 , . . . } induced by π θ through (3), and true trajectories of states {z 1 , z 2 , . . . } and measurements {y 1 , y 2 , . . . } induced by (1). The first squared term in (5) penalizes the error between the high-dimensional estimate ẑk = U xk and the true z k . The second squared term favors smaller values for the action a k , which acts as a regularization mechanism. Unless indicated otherwise, we will consider λ = 0. By considering different values of µ during training, a strategy called domain randomization (Peng et al., 2018b) , we ensure robustness of the policy with respect to µ during online deployment of the estimator. Note that the stochasticity of π θ lets the RL algorithm explore different actions during the training process, but is turned off during online deployment. We call the estimator constructed and trained through this process an RL-trained ROE, or RL-ROE for short. Finally, an interpretation of the estimator dynamics (3) and the optimization problem (5) in the context of Bayesian inference is presented in Appendix B.

2.4. SUMMARY OF THE PROPOSED METHODOLOGY

In summary, the methodology we propose consists of the following three steps. The first two are carried out offline using a training dataset Z train = {Z µ } µ∈S containing high-dimensional state snapshots from multiple trajectories of (1) for various µ. The third takes place online using sole knowledge of measurements {y 1 , . . . , y k } from a trajectory of (1) corresponding to unknown µ not necessarily belonging to S. At each time step k, the agent selects an action a k ∈ A according to the policy π θ defined in (3b), which can be expressed as

1.. Construction of the ROM (offline

a k ∼ π θ ( • |o k ), where o k = (y k , xk-1 ) = (Cz k , xk-1 ) is a partial observation of the current state s k . The state s k+1 = ( xk , z k+1 , µ) at the next time step is then obtained from equations (1a) and (3a) as s k+1 = (A r xk-1 + a k , f (z k ; µ), µ), which defines the transition model s k+1 ∼ P(•|s k , a k ). Finally, the agent receives the reward r k = R(s k , a k , s k+1 ) = -||z k -U xk || 2 -λ||a k || 2 , ( ) which is minus the term to be minimized at each step in (5). Thanks to the incorporation of z k into s k , the reward function (8) has no explicit time dependence and the MDP is therefore stationary. The RL training process then finds the optimal policy parameters θ * = arg max θ E τ ∼π θ [R(τ )], where the expectation is over trajectories τ = (s 1 , a 1 , s 2 , a 2 , . . . ), and R(τ ) = K k=1 r k is the finite-horizon undiscounted return. Thus, the optimization problem (9) solved by RL is equivalent to that stated in (5). At the beginning of every episode, the environment is reset according to the distributions x0 ∼ p x0 (•), z 0 ∼ p z0 (•), µ ∼ p µ (•), ) from which the augmented state s 1 = ( x0 , z 1 , µ) = ( x0 , f (z 0 ; µ), µ) follows immediately; s 1 thus constitutes the start of the agent-environment interactions. The distribution p x0 (•) bestows robustness of the learned policy with respect to the initial state estimate x0 , while the distributions p z0 (•) and p µ (•) enable the same policy π θ to be trained on several reference trajectories of (1) corresponding to various µ ∈ [µ 1 , µ 2 ]. In practice, we reuse the same reference trajectories corresponding to µ ∈ S from the training dataset Z utilized to construct the DMD in Section 2.2, so that we do not have to keep solving (1a) during the training process. Thus, at the beginning of every episode, we draw a random µ from S and we initialize z 0 as the starting state z µ 0 of the corresponding trajectory Z µ = {z µ 0 , . . . , z µ K }. To learn θ * , we employ the Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017) , which belongs to the class of policy gradient methods (Sutton et al., 2000) . The parameterization of the policy π θ , implementation details and training hyperparameters are discussed in Appendix D. Remark. Since the policy ( 6) is conditioned on a partial observation o k of the state s k , the stationary MDP we have defined in this section is, in fact, a partially observable MDP (POMDP). In this case, it is known that the globally optimal policy depends on a summary of the history of past observations and actions, h k = {o 1 , a 1 , . . . , o k }, rather than just the current observation o k (Kaelbling et al., 1998) . However, policies formulated based on an incomplete summary of h k are common in practice and still achieve good results (Sutton & Barto, 2018) . We therefore pursue this approach in the present paper, and leave for future work testing the generalization of our policy input to a more complete summary of h k . We also note that policy gradient methods, which PPO belongs to, do not require the Markov property of the state (that is, conditional independence of future states on past states given the present state) and can therefore be readily applied to the POMDP setting. For our problem, this guarantees that the PPO algorithm will converge to a locally optimum policy.

4. RESULTS

We evaluate the state estimation performance of the RL-ROE for systems governed by the Burgers equation and Navier-Stokes equations. For each system, we first compute various solution trajectories corresponding to different physical parameter values, which we use to construct the ROM and train the RL-ROE following the procedure outlined in Section 2.4. The trained RL-ROE is finally deployed online and compared against a time-dependent Kalman filter constructed from the same ROM, which we refer to as KF-ROE. The KF-ROE is given by equations ( 3a) and ( 4), with the calculation of the time-varying Kalman gain detailed in Appendix C of the supplementary materials. Before proceeding to the results, we discuss our choice of baseline. The ensemble Kalman filter and 4D-Var are two estimation techniques for high-dimensional systems such as those governed by PDEs (Lorenc, 2003) . Although they are commonly employed for data assimilation in numerical weather prediction, they require large computational resources since they involve repeated solutions of the high-dimensional dynamics (1). Thus, they are not applicable in the context of embedded control systems, whose limited resources call for an inexpensive model such as the ROM (2). Since the ROM that we consider has linear dynamics, extensions of the Kalman filter for nonlinear dynamics such as the extended or unscented Kalman filters (Wan & Van Der Merwe, 2000; Julier & Uhlmann, 2004) are not relevant, and the vanilla Kalman filter remains the best choice of baseline.

4.1. BURGERS EQUATION

The forced Burgers equation is a prototypical nonlinear hyperbolic PDE that takes the form ∂u ∂t + u ∂u ∂x -ν ∂ 2 u ∂x 2 = f (x, t), where u(x, t) is the velocity at position x ∈ [0, L] and time t, f (x, t) is a distributed time-dependent forcing, and the scalar ν acts like a viscosity. Here, we choose a forcing of the form f (x, t) = 2 sin(ωt -kx) + 2 sin(3ωt -kx) + 2 sin(5ωt -kx), where k = 2π/L, and we let ν and ω be related through a scalar parameter µ ∈ [0, 1] as follows: ν = ν 1 + (ν 2 -ν 1 )µ, ω = ω 1 + (ω 2 -ω 1 )µ. Thus, µ can be regarded as a physical parameter that affects the dynamics of the forced Burgers equation through both ν and ω. We consider periodic boundary conditions and choose L = 1, ν 1 = 0.01, ν 2 = 0.1, ω 1 = 0.2π, ω 1 = 0.4π. We solve the forced Burgers equation using a spectral method with n = 256 Fourier modes and a fifth-order Runge-Kutta time integration scheme. We define the discrete-time state vector z k ∈ R n that contains the values of u at n equally-spaced collocation points and at discrete time steps t = k∆t, where ∆t = 0.05. To generate the training dataset Z train = {Z µ } µ∈S used for constructing the ROM and training the RL-ROE, we compute solutions of the Burgers equation corresponding to µ ∈ S = {0, 0.1, 0.2, . . . , 0.1}. For each µ, we discard the transient portion of the dynamics and save 201 snapshots Z µ = {z µ 0 , . . . , z µ 200 } in the post-transient regime. We retain r = 10 modes when constructing the ROM, corresponding to an-order-of-magnitude reduction in the dimensionality of the system. We train the RL-ROE using episodes of length K = 200 steps to make full use of the trajectories stored in Z train , and we end the training process when the return no longer increases The trained RL-ROE and the KF-ROE are now compared based on their ability to track reference trajectories corresponding to various µ using sparse measurements of u from a limited number p of equally-spaced sensors (Appendix E describes the corresponding C). The evaluation is carried out using 20 initial state estimates sampled from a Gaussian distribution with unit standard deviation. The error of the RL-ROE is very close to the lower bound, which is the error incurred by projecting the reference solution z k to the modes U , i.e. the lowest possible error achievable by an ROE based on U . Spatio-temporal contours of the reference solutions for the same 3 values of µ and the corresponding RL-ROE and KF-ROE highdimensional estimates are shown in Figure 2 . The RL-ROE vastly outperforms the KF-ROE, which demonstrates the superiority of a nonlinear correction to the estimator dynamics (3a). We emphasize that the Kalman filter was tuned to obtain the best possible results for the KF-ROE (see Appendix C for details on the tuning process). Figure 3 reports the time average of the normalized L 2 error as a function of µ for p = 4, with the values present in Z train indicated by large circles. The RL-ROE exhibits robust performance across the entire parameter range µ ∈ [0, 1], including when estimating previously unseen trajectories. Finally, Figure 4 displays the average over time and over µ of the normalized L 2 error for varying number p of sensors. Note that each value of p corresponds to a separately trained RL-ROE. As the number of sensors increases, the KF-ROE performs better and better until its accuracy overtakes that of the RL-ROE. We hypothesize that the accuracy of the RL-ROE is limited by the inability of the RL training process to find an optimal policy, due to both the non-convexity of the optimization landscape as well as shortcomings inherent to current deep RL algorithms. This being said, the strength of the nonlinear policy of the RL-ROE becomes very clear in the very sparse sensing regime; its performance remains remarkably robust as the number of sensors reduces to 2 or even 1. In Appendix F, spatio-temporal contours (similar as in Figure 2 ) of the reference solution and corresponding estimates for p = 2 and 12 illustrate that the slight advantage held by the KF-ROE for p = 12 is reversed into clear superiority of the RL-ROE for p = 4.

4.2. NAVIER-STOKES EQUATIONS

The Navier-Stokes equations are a set of nonlinear PDEs that describe the motion of fluids flows. For incompressible fluids, the Navier-Stokes equations take the form ∂u ∂t + (u • ∇)u = -∇p + 1 Re ∆u, ∇ • u = 0, (14b) where u(x, t) and p(x, t) are the velocity vector and pressure at position x and time t, and the scalar Re is the Reynolds number. We consider the classical problem of a flow past a cylinder in a 2D domain, which is well known to exhibit a Hopf bifurcation from a steady wake to periodic vortex shedding at a critical Reynolds number Re c ∼ 40 (Jackson, 1987) . For our study, we focus on the range Re ∈ [10, 110], which makes the estimation problem very challenging since this range includes the bifurcation and therefore comprises solution trajectories with very different dynamicssteady for Re < Re c , periodic limit cycle for Re > Re c . Furthermore, the shedding frequency and spacing between consecutive vortices in the limit cycle regime both vary with Re. We solve the Navier-Stokes equations with the open source finite volume code OpenFOAM using a mesh consisting of 18840 nodes and a second-order implicit scheme with time step 0.05. The discrete-time state vector z k ∈ R 37680 contains the two velocity components of u at discrete time steps t = k∆t, where we choose ∆t = 0.25. To generate the training dataset Z train = {Z Re } Re∈S for constructing the ROM and training the RL-ROE, we run simulations of the Navier-Stokes equations for Re ∈ S = {10, 20, 30, . . . , 110}. We discard the transient portion of the dynamics (for the cases Re > Re c ) and save 201 snapshots Z Re = {z Re 0 , . . . , z Re 200 } in the post-transient regime. We retain r = 20 modes when constructing the ROM, corresponding to a three-orders-of-magnitude reduction in the dimensionality of the system. These modes are shown in Appendix G. We train the RL-ROE using episodes of length K = 200 steps to make full use of the trajectories stored in Z train and end the training process when the return no longer increases on average. The RL hyperparameters and learning curve displaying the performance improvement of the RL-ROE during the training process are reported in Appendix D. The trained RL-ROE and the KF-ROE are now compared based on their ability to track reference trajectories corresponding to various Re using sparse measurements of u from a limited number p of sensors randomly distributed in the wake region behind the cylinder (see Appendix E for the construction of the corresponding C). The evaluation is carried out using 5 The velocity magnitude of the reference solutions for the same 3 values of Re and the corresponding RL-ROE and KF-ROE reconstructed high-dimensional estimates are shown at t = 50 in Figure 6 . Remarkably, the RL-ROE is manages to estimate very precisely the entire flow field across different dynamical regimes, with the steady wake at Re = 35 being reproduced equally well as the wake of vortices at Re = 65 and 105. The KF-ROE, on the other hand, struggles to estimate the flow fields for all 3 Reynolds numbers, and instead predicts a velocity field that is almost everywhere zero except in the wake. Again, the superiority of the RL-ROE is granted by the nonlinearity of its policy -in fact, note that bifurcations such as the one exhibited by this flow are inherently nonlinear phenomena. Appendix H shows corresponding results in the presence of non-zero observation noise. Figure 8 displays the average over time and over Re of the normalized L 2 error for varying number p of sensors. Although the KF-ROE eventually becomes very accurate in the presence of a large number p of sensors, the accuracy of the RL-ROE remains remarkably stable as p decreases, allowing it to vastly outperforms the KF-ROE as soon as p < 8. Once again, this showcases the benefits of using a nonlinear policy to correct the estimator dynamics (3a).

5. RELATED WORK

Previous studies have already proposed designing state estimators using policies trained through reinforcement learning. Morimoto & Doya (2007) introduced an estimator of the form xk = f ( xk-1 ) + L( xk-1 )(y k-1 -C xk-1 ), where f (•) is the state-transition model of the system, and the state-dependent filter gain matrix L( xk-1 ) is defined using Gaussian basis functions whose parameters are learned through a variant of vanilla policy gradient. Hu et al. (2020) proposed an estimator of the form xk = f ( xk-1 ) + L(x k -xk )(y k -Cf ( xk-1 )), where L(x k -xk ) is approximated by neural networks trained with a modified Soft-Actor Critic algorithm (Haarnoja et al., 2018) . Although they derived convergence properties for the estimate error, the dependence of the filter gain L(x k -xk ) on the reference state x k limits its practical application. Furthermore, a major difference between these past studies and our work is that they only consider low-dimensional systems with four state variables at most. Our RL-ROE, on the other hand, handles parametric PDEs described by tens of thousands of state variables as shown in the previous section. In another line of works, reinforcement learning has been applied to learn control policies for joint torques that enable simulated characters to imitate given reference motions consisting of a sequence of target poses; see e.g. Peng et al. (2018a) or Lee et al. (2019) . These are essentially trajectory tracking problems, as the policy learns to drive the character's pose towards that defined by the reference motion. Similar to us, these studies also propose to transform the problem into a stationary MDP by augmenting the state; but they do so by appending a scalar phase variable φ ∈ [0, 1] that represents the normalized time elapsed in the reference motion. The reward can then be formulated in terms of q(φ), where q(•) describes the reference motion, and no explicit time dependence appears. However, this approach would not work for our purpose since we do not wish to restrict the RL-ROE to estimating a single reference trajectory for each value of µ. Many dynamical systems indeed admit multiple post-transient trajectories for given parameter values (Cross & Hohenberg, 1993) . Augmenting the MDP's state with the entire snapshot from the reference trajectory instead of just time ensures that the policy π θ can learn any number of reference trajectories for each µ.

6. CONCLUSIONS

In this paper, we have introduced the reinforcement learning reduced-order estimator (RL-ROE), a new state estimation methodology for parametric PDEs. Our approach relies on the construction of a computationally inexpensive reduced-order model (ROM) to approximate the dynamics of the system. The novelty of our contribution lies in the design, based on this ROM, of a reducedorder estimator (ROE) in which the filter correction term is given by a nonlinear stochastic policy trained offline through reinforcement learning. We demonstrate using simulations of the Burgers and Navier-Stokes equations that in the limit of very few sensors, the trained RL-ROE vastly outperforms a Kalman filter designed using the same ROM, which is attributable to the nonlinearity of its policy (see Appendix I for a quantification of this nonlinearity). Finally, the RL-ROE also yields accurate high-dimensional state estimates for reference trajectories corresponding to various parameter values without direct knowledge of the latter.

A DYNAMIC MODE DECOMPOSITION

In this appendix, we describe the DMD algorithm (Schmid, 2010; Tu et al., 2014) , which is a popular data-driven method to extract spatial modes and low-dimensional dynamics from a dataset of high-dimensional snapshots. Here, we use the DMD to construct a ROM of the form (2) given an observation model C and a concatenated collection of snapshots Z train = {Z µ } µ∈S , where each Z µ = {z µ 0 , . . . , z µ m } contains snapshots from a trajectory of (1a) for a specific value µ. Fundamentally, the DMD seeks a best-fit linear model of the dynamics in the form of a matrix A ∈ R n×n such that z k+1 Az k . First, arrange the snapshots into two time-shifted matrices X = {z µ1 0 , . . . , z µ1 m-1 , . . . , z µq 0 , . . . , z µq m-1 }, Y = {z µ1 1 , . . . , z µ1 m , . . . , z µq 1 , . . . , z µq m }, where q denotes the number of elements in S. The best-fit linear model is then given by A = Y X † , where X † is the pseudoinverse of X. The ROM is then obtained by projecting the matrices A and C onto a basis U consisting of the r leading left singular vectors of X, which approximate the r leading PCA modes of Z. Using the truncated singular value decomposition X = U ΣV T (16) where U , V ∈ R n×r and Σ ∈ R r×r , the resulting reduced-order state-transition and observation models are given by A r = U T AU = U T Y V Σ -1 , (17a) C r = CU . (17b) Conveniently, the ROM matrices A r and C r can be calculated directly from the truncated SVD of X, which avoids forming the large n × n matrix A.

B BAYESIAN INTERPRETATION

Here, we evaluate the meaningfulness of the RL-ROE design by framing the estimator dynamics (3) and the optimization problem (5) in the context of Bayesian inference. Unless stated otherwise, we consider the estimation problem in terms of the reduced state x k , governed by the ROM (2).

B.1 BAYESIAN OPTIMAL FILTER

From a Bayesian perspective, the goal at each time k = 1, . . . , K is to calculate p(x k |y 1:k ), the posterior probability density function (pdf) measuring our belief in the true state x k given the measurement data y 1:k = {y 1 , . . . , y k }. It is assumed that we know the transition model p(x k |x k-1 ) describing the dynamics of the system, the observation model p(y k |x k ) relating state and measurement data, and the initial pdf p(x 0 ). Then, the posterior pdf p(x k |y 1:k ) can formally be obtained recursively by alternating between prediction and update steps (Särkkä, 2013) . Prediction step. Starting from the posterior pdf p(x k-1 |y 1:k-1 ) at k-1, one first uses the dynamics of the system to compute the prior pdf at k via the Chapman-Kolmogorov equation p(x k |y 1:k-1 ) = p(x k |x k-1 )p(x k-1 |y 1:k-1 )dx k-1 , where the Markovian property of the dynamics has been used. Update step. The prior is then updated using the new measurement y k via Bayes' rule, yielding the posterior pdf p(x k |y 1:k ) = p(y k |x k )p(x k |y 1:k-1 ) p(y k |y 1:k-1 ) , where the normalizing constant is p( y k |y 1:k-1 ) = p(y k |x k )p(x k |y 1:k-1 )dx k . These equations generally do not admit analytical solutions, except in the linear and Gaussian setting which results in the Kalman filter (Särkkä, 2013) . Particle filters provide an approximate solution in the general setting but they are computationally expensive, their required ensemble size scaling exponentially with the state dimension (Daum & Huang, 2003; Snyder et al., 2008) .

B.2 RL-ROE FROM A BAYESIAN PERSPECTIVE

To frame the RL-ROE in the context of Bayesian inference, we first need to relate the state estimate xk to the posterior p(x k |y 1:k ). Specifically, we define xk as the mean of the posterior, xk = E[x k |y 1:k ] = x k p(x k |y 1:k )dx k , which, assuming that p(x k |y 1:k ) is known, is commonly referred to as the minimum mean square error (MMSE) estimator (Kay, 1993) . Using this definition, we can now interpret the estimator dynamics (3) and the optimization problem (5) from the perspective of the Bayesian filter equations, with the transition model p(x k |x k-1 ) and the observation model p(y k |x k ) given by ( 2a) and (2b). Prediction step. Starting from the estimate xk-1 = E[x k-1 |y 1:k-1 ] at k -1, we take the expectation of ( 18) to obtain the 'prior' estimate at k, x- k = E[x k |y 1:k-1 ] = A r xk-1 , where we have made use of the linearity of the transition model (2a), and we have assumed that the process noise w k is zero-mean (which is empirically observed in our examples; see Appendix ). Update step. Formally, the estimate xk is given by (20) using the posterior pdf p(x k |y 1:k ) from the Bayes update (19). However, ( 19) cannot be solved explicitly since the RL-ROE does not evolve the full prior p(x k |y 1:k-1 ). Instead, we make use of the well-known fact that the MMSE estimator is equivalent to minimizing the mean square error, or variance, of the posterior estimate (see, e.g., chapters 10 and 11 in Kay (1993) or chapter 2 in Särkkä ( 2013)). This can be seen as follows: xk = E[x k |y 1:k ] = x k p(x k |y 1:k )dx k = arg min xk ||x k -xk || 2 p(x k |y 1:k )dx k = arg min xk E ||x k -xk || 2 |y 1:k . We relax the above minimization problem by restricting xk to a specific class of functions, as is done when deriving linear MMSE estimators. In the latter case, the optimal solution is xk = x- k + K k (y k -C x- k ), where the prior estimate xk is given by ( 21), and K k is obtained from the closed-form solution of ( 22) (Kay, 1993) . We generalize this approach by considering the nonlinear form xk = x- k + µ θ k (y k , x- k ), where µ θ k is a nonlinear function parameterized by θ k , whose role is to update the prior xk using y k in such a way to minimize the posterior variance. Then, the minimization problem (22) becomes θ * k = arg min θ k E ||x k -xk || 2 , ( ) where it is implicitly assumed that the expectation is conditioned on y 1:k .

RL-ROE.

To obtain the estimator dynamics (3) and the optimization problem (5), we further consider that the function µ θ k is stochastic and independent of time; it is therefore expressed by a stationary policy π θ parameterized by θ. Then, combining the posterior estimate (24) with the prior estimate (21) gives the recursion xk = A r xk-1 + a k , (26a) a k ∼ π θ ( • |y k , xk-1 ), which is the estimator dynamics (3). The parameters θ of the stationary policy are found by extending the minimization problem (25) to all time steps, yielding θ * = arg min θ E K k=1 ||x k -xk || 2 . ( ) In the RL setting, this optimization problem is solved in a data-driven manner by sampling system trajectories. Thus, the expectation is taken over initial estimates x0 , initial true states x 0 , parameters µ, estimate trajectories { x1 , x2 , . . . } induced by π θ through (3), and true trajectories of states {x 1 , x 2 , . . . } and measurements {y 1 , y 2 , . . . }. Finally, we express the posterior variance through the reconstructed high-dimensional estimate ẑk = U xk and the high-dimensional true state z k , and we add a regularization term that penalizes the magnitude of the action a k . This gives θ * = arg min θ E K k=1 ||z k -U xk || 2 + λ||a k || 2 , ( ) which is the optimization problem (5). We conclude this analysis with a couple remarks. First, it demonstrates that the cost being minimized in (28) derives from the variance minimization principle underlying the definition of the MMSE estimator. Second, an important difference between the RL-ROE and the Bayesian optimal filter is that the RL-ROE does not require knowledge of the distribution of the process noise w k and observation noise v k . Rather, it leverages knowledge of the true trajectories and corresponding measurements in the offline training phase to find the policy µ θ that yields the minimum mean square error. Third, the RL solves the optimization problem (28) in a batch approach during the offline training phase, sampling entire trajectories of the estimate between each update of the policy parameters. The trained RL-ROE is then applied online in a recursive manner.

C KALMAN FILTER

The time-dependent Kalman filter that we use as a benchmark in this paper, KF-ROE, is based on the same ROM (2) as the RL-ROE, with identical matrices A r , C r and U . Similarly to the RL-ROE, the reduced-order estimate xk is given by equation (3a), from which the high-dimensional estimate is reconstructed as ẑk = U xk . However, the KF-ROE differs from the RL-ROE in its definition of the action a k in (3a), which is instead given by the linear feedback term (4). The calculation of the optimal Kalman gain K k in (4) requires the following operations at each time step: P - k = A r P k-1 A T r + Q k , S k = C r P - k C T r + R k , K k = P - k C T r S -1 k , P k = (I -K k C r )P - k , where P - k and P k are respectively the a priori and a posteriori estimate covariance matrices, S k is the innovation covariance, and Q k and R k are respectively the covariance matrices of the process noise w k and observation noise v k in the ROM (2). Following a standard procedure, we tune these noise covariance matrices to yield the best possible results (Simon, 2006) . We assume that Q k = β Q I and R k = β R I and perform a line search to find the values of β Q and β R that yield the best performance. This resulted in β Q = 10 3 and β R = 1 for the Burgers example, and β Q = 10 9 and β R = 1 for the Navier-Stokes example. At time step k = 0, the a posteriori estimate covariance is initialized as P 0 = cov(U T z 0 -x0 ), which can be calculated from the distributions (10).

D RL ALGORITHM, HYPERPARAMETERS AND LEARNING CURVES

We employ the Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017) to find the optimal policy parameters θ * . PPO alternates between sampling data by computing a set of trajectories {τ 1 , τ 2 , τ 3 , . . . } using the most recent version of the policy, and updating the policy parameters θ in a way that increases the probability of actions that led to higher rewards during the sampling phase. The policy π θ encodes a diagonal Gaussian distribution described by a neural network that maps from observation to mean action, µ θ (o k ), together with a vector of standard deviations σ, so that θ = {θ , σ}. We utilize the Stable Baselines3 (SB3) implementation of PPO (Raffin et al., 2019) and define our MDP as a custom environment in OpenAI Gym (Brockman et al., 2016) . For both the Burgers and Navier-Stokes examples, the stochastic policy π θ is trained with PPO using the default hyperparameters from Stable Baselines3, except for the discount factor γ which we choose as 0.75. The mean output of the stochastic policy and the value function are approximated by two neural networks, each containing two hidden layers with 64 neurons and tanh activation functions. The input to the policy is normalized using a running average and standard deviation during the training process, which alternates between sampling data for 10 trajectories (of length 200 timesteps each) and updating the policy. Each policy update consists of multiple gradients steps through the most recent data using 10 epochs, a minibatch size of 64 and a learning rate of 0.0003. The policy is trained for a total of one to three million timesteps, corresponding to 5000 to 15000 trajectories, which depending on the dimensionality of the ROM takes between 15 min and an hour on a Core i7-12700K CPU. Figure 9 reports the learning curves corresponding to the unforced and forced cases. During training, the policy is tested (with stochasticity switched off) after each update using 20 separate test trajectories, and is saved if it outperforms the previous best policy. Finally, the RL-ROE is assigned the latest saved policy upon ending of the training process, and the stochasticity of the policy is switched off during subsequent evaluation of the RL-ROE.

E CONSTRUCTION OF THE OBSERVATION MATRIX

We describe how we construct the observation matrix C in the Burgers and Navier-Stokes examples, once the number, type and locations of the sensors have been chosen. In the Burgers example, the state vector z k ∈ R n contains the values of u at n collocation points, and the measurements y k ∈ R p consist of the values of u at p equally-spaced sensors. Let us introduce the indices {j 1 , . . . , j p } of the entries in z k corresponding to the measurements y k . Then, y k and z k can be related by y k = Cz k , where the matrix C ∈ R p×n contains ones at the entries indexed {(1, j 1 ), . . . , (p, j p )}, and zeros everywhere else. In the Navier-Stokes example, the state vector z k ∈ R 2n contains the horizontal and vertical components of velocity u at n collocation points, and the measurements y k ∈ R 2p consist of the components of u at p equally-spaced sensors. Let us introduce the indices {j 1 , . . . , j 2p } of the entries in z k corresponding to the measurements y k . Then, y k and z k can be related by y k = Cz k , where the matrix C ∈ R 2p×n contains ones at the entries indexed {(1, j 1 ), . . . , (2p, j 2p )}, and zeros everywhere else.

F ADDITIONAL RESULTS FOR THE BURGERS EQUATION

Figure 10 shows spatio-temporal contours of the reference solution and corresponding estimates, for p = 2 and 12. The RL-ROE vastly outperforms the KF-ROE for p = 2, while for p = 12 the KF-ROE is slightly more accurate.

G MODES FOR FLOW PAST A CYLINDER

The first 18 modes (i.e. columns of U ) of the 20-dimensional ROM for the flow past a cylinder are displayed in Figure 11 .

H EFFECT OF OBSERVATION NOISE FOR THE NAVIER-STOKES EQUATIONS

In this appendix, we evaluate the estimation accuracy of the RL-ROE in the presence of non-zero observation noise n k polluting the sensor measurements y k in (1). Specifically, we consider that n k is Gaussian white noise of standard deviation σ = 0.1. For the case of p = 3 sensors, Figure 12 shows the time series of the noise-free measurements contained in y k for various values of Re, together with their polluted counterpart (recall each sensor measures two components of velocity, as detailed in Appendix E). Using the noisy measurements, we then repeat the same experiments carried out in the main text; the results are shown in Figures 13 and 14 , which are the counterparts of Figures 5 and 6 . We observe excellent robustness of the RL-ROE in the presence of noise, with the estimate maintaining high accuracy.

I NONLINEARITY OF THE TRAINED POLICY

In this appendix, we evaluate the degree of nonlinearity of the RL-trained policies π θ obtained in our Burgers and Navier-Stokes examples. Since the stochasticity of π θ is switched off during online deployment, π θ can be expressed by its mean function µ θ (see Appendix D) so that the action in (3b) becomes a k = µ θ (y k , xk-1 ). (33) We therefore quantify the nonlinearity of the policy by evaluating the Jacobian of the function µ θ with respect to its two arguments y k and xk-1 . The Jacobian is a matrix whose components are the first-order derivatives ∂µ i /∂y j and ∂µ i /∂ xj , where (i, j) refers to the indices of the vectors entries in µ and y k or xk-1 , respectively. Instead of looking at individual components, we consider the Frobenius norm of the Jacobian, defined as (34) The Jacobian (and its norm) of a linear policy will be independent of the input values, while the Jacobian (and its norm) of a nonlinear policy will change with the input values. 



Figure 1: Burgers equation with p = 4 sensors. Normalized L 2 error of the RL-ROE and KF-ROE for the estimation of trajectories corresponding to values of µ not seen during training.

Figure 2: Burgers equation with p = 4 sensors. Reference trajectories for values of µ not seen during training and corresponding RL-ROE and KF-ROE estimates. The dashed lines on the reference trajectory plots indicate the sensor data seen by the RL-ROE and KF-ROE.

Figure 3: Burgers equation with p = 4 sensors. Time average of the normalized L 2 error versus µ. Values of µ present in Z train shown by large circles. Beginning with p = 4 sensors, Figure 1 reports the mean (lines) and standard deviation (shaded areas) of the normalized L 2 error for 3 values of µ not present in the training dataset Z train used to construct the ROM and train the RL-ROE. The normalized L 2 error is defined as || ẑkz k ||/||z k ||, where z k is the high-dimensional reference solution and ẑk = U xk is the high-dimensional reconstruction of the RL-ROE or KF-ROE estimate xk . The error of the RL-ROE is very close to the lower bound, which is the error incurred by projecting the reference solution z k to the modes U , i.e. the lowest possible error achievable by an ROE based on U . Spatio-temporal contours of the reference solutions for the same 3 values of µ and the corresponding RL-ROE and KF-ROE highdimensional estimates are shown in Figure 2. The RL-

Figure 4: Burgers equation. Average over time and over µ of the normalized L 2 error versus number p of sensors.

Figure 5: Navier-Stokes equations with p = 3 sensors. Normalized L 2 error of the RL-ROE and KF-ROE for the estimation of trajectories corresponding to values of Re not seen during training.

Figure 6: Navier-Stokes equations with p = 3 sensors. Velocity magnitude at t = 50 of the reference trajectories for values of Re not seen during training and corresponding RL-ROE and KF-ROE estimates. The black crosses in the contours of the reference solutions indicate the sensor locations.

Figure 8: Navier-Stokes equations. Average over time and over Re of the normalized L 2 error versus number p of sensors.

Figure 7 reports the time average of the normalized L 2 error as a function of µ for p = 3, with the values present in Z train indicated by large circles. The RL-ROE exhibits robust performance across the entire range of Reynolds numbers, including in the vicinity of the bifurcation at Re c ∼ 40 and for values of Re not seen during training.

Figure 9: Learning curves for the stochastic policy of the RL-ROE for the (a) Burgers and (b) Navier-Stokes cases. Each plot corresponds to a different number p of sensor measurements. The line and shaded area show the mean and standard deviation of the results over 10 runs, each one smoothed with a moving average of size 100 episodes.

Figure 15 show the distribution of the norm of the Jacobian of the trained policies obtained in the Burgers and Navier-Stokes examples for various values of p. The distributions are obtained by calculating the Jacobian along a solution trajectory of the RL-ROE corresponding to the results shown in Sections 4.1 and 4.2. The wide spread of the distributions demonstrates that the mean function µ θ trained by the RL process is highly nonlinear, as opposed to the linear correction term (4) in the KF-ROE.

Figure 10: Burgers equation with (a) p = 2 and (b) p = 12 sensors. Reference trajectories for values of µ not seen during training and corresponding RL-ROE and KF-ROE estimates. The dashed lines on the reference trajectory plots indicate the sensor data seen by the RL-ROE and KF-ROE.

Figure 14: Navier-Stokes equations with p = 3 sensors and observation noise of std σ = 0.1. Velocity magnitude at t = 50 of the reference trajectories for values of Re not seen during training and corresponding RL-ROE and KF-ROE estimates. The black crosses in the contours of the reference solutions indicate the sensor locations.

Figure 15: Nonlinearity of the trained policies obtained in the (a) Burgers and (b) Navier-Stokes examples for various p. Distribution of the Frobenius norm of the Jacobian of the trained mean policy µ θ , sampled along solution trajectories of the RL-ROE.

). A ROM of the form (2) is obtained by applying the DMD to the training dataset Z train . 2. Training of the RL-ROE (offline). An RL-ROE of the form (3) is designed based upon the ROM constructed in Step 1, and its policy π θ is trained using the reference trajectories contained in Z train .3. Deployment of the RL-ROE (online). Using measurements {y 1 , . . . , y k } from a reference trajectory of (1) corresponding to unknown µ, the trained RL-ROE returns an estimate ẑk = U xk of the (unobserved) high-dimensional state z k . , z k , µ) ∈ R r+n+1 denote an augmented state at time k, we can define an MDP consisting of the tuple (S, A, P, R), where S = R r+n+1 is the augmented state space, A ⊂ R r is the action space, P(•|s k , a k ) is a transition probability, and R(s k , a k , s k+1 ) is a reward function.

J DISTRIBUTION OF PROCESS NOISE

We evaluate empirically the distribution of the process noise w k in the ROM dynamics (2a). First, note that the vector w k can be evaluated as w k = A r x k-1 -x k , where x k-1 = U T z k-1 and x k = U T z k are the reduced-order projections of two consecutive states z k-1 and z k solving the highdimensional dynamics (1). For a given ROM, we can then sample the process noise by evaluating values of w k along trajectories of the high-dimensional dynamics (1). We show in Figure 16 the distributions of the components of w k for the Burgers and Navier-Stokes examples, using the ROM constructed in Sections 4.1 and 4.2. The sampling is done along trajectories of (1) corresponding to different parameter values (the parameter being µ for Burgers, Re for Navier-Stokes), and the corresponding distributions of process noise are shown separately for each parameter value. (Note that the same ROM is shared across all parameter values.) The distributions reveal that the process noise is non-Gaussian but approximately zero-mean. 

