LEARNING REDUCED FLUID DYNAMICS

Abstract

Predicting the state evolution of ultra high-dimensional, time-reversible fluid dynamic systems is a crucial but computationally expensive task. Model reduction has been proven an effective method to reduce computational costs by learning a low-dimensional state embedding. However, existing reduced models are irrespective of either the time reversible property or the nonlinear dynamics, leading to sub-optimal performance. We propose a model-based approach to identify locally optimal, model-reduced, time reversible, nonlinear fluid dynamic systems. Our main idea is to use stochastic Riemann optimization to obtain a high-quality reduced fluid model by minimizing the expected trajectory-wise model reduction error over a given distribution of initial conditions. To this end, our method formulates the reduced fluid dynamics as an invertible state transfer function parameterized by the reduced subspace. We further show that the reduced trajectories are differentiable with respect to the subspace bases over the entire Grassmannian manifold, under proper choices of timestep sizes and numerical integrators. Finally, we propose a loss function measuring the trajectory-wise discrepancy between the original and reduced models. By tensor precomputation, we show that gradient information of such loss function can be evaluated efficiently over a long trajectory without time-integrating the high-dimensional dynamic system. Through evaluations on a row of simulation benchmarks, we show that our method reduces the discrepancy by 50% -90% over conventional reduced models.

1. INTRODUCTION

High-dimensional Partial Differential Equations (PDE), especially fluid dynamic systems, find vast applications in the field of scientific computation Moin & Mahesh (1998) ; Alfonsi (2009) , PDEconstrained optimization Biegler et al. (2003) ; Herzog & Kunisch (2010) , design prototyping Baysal & Eleshaky (1992) ; Zang & Green (1999) , fluidic devices design Du et al. (2020) ; Li et al. (2022) , and digital entertainment Bridson & Batty (2010) ; Bridson (2015) , to name a few. A fundamental task of all these applications lies in the efficient prediction of numerical solutions over a long horizon. In design prototyping, for example, a designer needs to quickly preview the fluid flow surrounding an aerial vehicle in order to refine its form factor. In a game engine, a fluid simulator needs to achieve real-time performance to provide interactive special effects for players. Although abundant numerical tools Petrila & Trif (2004) ; Demkowicz et al. (1989) have been developed over the past decades with improved efficacy, their algorithmic complexity is still challenging the limits of current computational resources. As a parallel effort, the idealized, incompressible, inviscid Eulerian fluid should be time reversible and energy preserving Duponcheel et al. (2008) , and dedicated numerical schemes are proposed to faithfully preserve these properties in a discrete setting Rowley & Marsden (2002) ; Pavlov et al. (2011) . This implies that the initial condition of a trajectory can be recovered from any state thereafter and the discrete total energy is a constant throughout the predicted trajectory. Although idealized fluid models are not pursued in applications, their accurate prediction is an important criterion of reliable numerical schemes. Since their proposal Berkooz et al. (1993) ; Rowley (2005) , model reduction has been quickly established as one of the most effective approaches that can significantly reduce the PDE prediction cost. By restricting the state variables to low-dimensional linear and nonlinear sub-manifolds, the dimension of associated dynamic system can be reduced by orders of magnitude. Over the years, several data-driven and data-free approaches have been proposed to identify sub-manifolds that can capture the complex dynamic behaviors of fluids. The earliest data-driven method of Proper Orthogonal Decomposition (POD) Berkooz et al. (1993) finds the optimal linear subspace that best explains the variation of the state distribution. However, POD is flawed in that it ignores the temporal dependence of state variables. This problem is remedied by the Dynamic Model Decomposition (DMD) Schmid (2010) that finds the optimal linear subspace that best approximates the Koopman operator. However, these data-driven algorithms are irrespective of the nonlinearity in the underlying PDE. Comparatively, data-free methods, such as balanced POD Rowley (2005) , H 2 -optimization Gugercin et al. (2006) , and modal analysis Taira et al. (2017) , identify bases corresponding to the intrinsic property of PDE by analyzing the system transfer matrices in the frequency domain, and are thus independent of data. Unfortunately, these techniques are largely limited to linear systems and their extensions to nonlinear fluid dynamics, such as Yang et al. (2019) , are in their infancy. More generally, the construction of reduced fluid models has been formulated as machine learning problems for system identification. The vast majority of prior works generalize the non-intrusive approach and identify the state transfer function via supervised learning in an existing sub-manifold, where the transfer functions are parameterized via radial basis functions Zhang et al. (2016) , feedforward networks Hsieh & Tang (1998), recurrent networks Pearlmutter (1989); Wang et al. (2018) , etc. More recent approaches jointly learn the state transfer function and identify the sub-manifold via convolutional autoencoder Wu et al. (2021) ; Hasegawa et al. (2020) . Unfortunately, all these non-intrusive learning techniques cannot preserve the time reversible property of idealized fluid, potentially leading to large prediction error or requiring a large dataset. ... We propose a machine learning approach to identify locally optimal, time reversible, reduced-order fluid dynamic models. We first interpret the linear subspace of fluid velocities as a point on the Grassmannian manifold and study the dependence of reduced trajectories on the choice of subspace. Thanks to the time reversibility, we show that the map from the subspace bases to reduced trajectories is globally differentiable, which allows us to optimize the reduced model via gradient-based Riemannian optimization. We further propose a trajectory-wise discrepancy loss that penalizes the difference between the full-order and the reduced trajectories. To make the optimization tractable, we propose a tensor precomputation scheme to accelerate the back-propagation of gradient information. Figure 1 illustrates the high-level pipeline of our method that fine-tunes the reduced fluid model to minimize the expected trajectory-wise discrepancy loss over the distribution of initial conditions. In essence, our method extends prior optimal reduced bases construction algorithm Berkooz et al. (1993); Schmid (2010) to the nonlinear, idealized fluid dynamic model. As an intrusive approach, our method preserves the desirable property of time reversibility. When compared with POD-type reduced model baseline on a row of idealized fluid simulation benchmarks, our method lowers the discrepancy by 50% -90%. Ū I v+ (v 0 , Ū ) v+ (v 1 , Ū ) v+ (v T , Ū ) L dyn L dyn L dyn

2. RELATED WORK

We review related works on machine learning for solving ODE and PDE, reduced physics models beyond fluid dynamics, and finally learning under hard constraints. Learning for Solving ODE and PDE: To study the complex behavior of dynamic systems, various mathematical models have been proposed for idealized models of fluid, solid, elasto-magnetic fields, etc. However, there are oftentimes subtle discrepancies between these models and real-world observations that are hard to model, in which cases machine learning stands out as an effective approach for acquiring these behaviors from groundtruth data. Chen et al. (2018) propose to learn such dynamics as a general Ordinary Differential Equation (ODE) with the time derivative of state predicted via a neural network. Although this method is applicable to general dynamic systems, it does not reflect the spatial and temporal structures of certain systems, which limits its accuracy, data-efficacy, and scalability to high-dimensional systems such as fluids. Several follow-up works improve the network architecture to reflect additional structures. For example, the inter-dependency among spatial variables is oftentimes local and sparse, which could be modeled via neighborhood message passing Battaglia et al. (2016) ; Li et al. (2019) . Hamiltonian dynamics are time reversible and energy preserving, which is modeled by learning the Hamiltonian operator in "canonical" coordinates Greydanus et al. (2019) , generalized coordinates Cranmer et al. (2020) , or ambient space with additional constraints Finzi et al. (2020) . However, the above techniques are using Lagrangian coordinates, while fluid mechanics are oftentimes modeled via an Eulerian grid, see e.g. Takahashi et al. (2021) , which is a major point of difference from our method. Parallel efforts have been made to learn Eulerian fluid mechanics Um et al. (2020) ; Takahashi et al. (2021) ; Holl et al. (2020) ; Prantl et al. (2019) ; Kim et al. (2019) . Some of these works Um et al. (2020) ; Takahashi et al. (2021) ; Holl et al. (2020) learn to control fluids via differentiable simulators but the dynamic systems are not learned. Other works Prantl et al. (2019) ; Kim et al. (2019) learn to predict short future trajectories of free-surface flows. As the major difference from these techniques, our goal is to predict arbitrarily long trajectories by utilizing the time reversible structure of the dynamic system to guarantee stability. On the downside, however, our method cannot predict free-surface flows. Learning Reduced Physical Models: Model reduction is a special kind of dimension reduction technique dealing with time series datasets and we refer readers to Rowley & Dawson (2017) for a review of its application in fluid mechanics. Other than fluid, reduced models have been adopted in predicting the behaviors of solid Sampaio & Soize (2007) , electromagnetic fields Ralph-Uwe et al. ( 2008), quantum and molecular mechanics Mohan & Fredrickson (2020) , neuron propagations Amsallem & Nordstrom (2016), etc. A successful reduced model involves two steps: 1) embed the data into a proper subspace that well explains the data variations; 2) project the dynamic system into the subspace. Conventional techniques for model reduction are restricted to linear dynamic systems, for which optimal linear subspace can be identified via POD or DMD Berkooz et al. (1993) ; Rowley (2005) and the projected dynamic system can be precomputed via Galerkin projection. More general machine learning techniques have been proposed for an extension to nonlinear dynamics. For example, convolution autoencoder has been used to identify nonlinear subspaces Wu et al. (2021) ; Hasegawa et al. (2020) . The ROM-net Daniel et al. (2020) learns to select a suitable subspace from a dictionary. Li et al. (2017) proposes to represent the linear subspace bases as the output of a universal neural network. In order to efficiently project the nonlinear dynamic system into the subspace, the Discrete Empirical Interpolation Method (DEIM) Chaturantabut & Sorensen (2010) has been proposed to select a sparse set of interpolation points. The interpolation points are then contracted with the subspace bases in an intrusive manner. Non-intrusive approaches use universal neural networks to learn the entire nonlinear transfer function Wu et al. (2021) ; Hasegawa et al. (2020) ; Lee et al. (2021) or part of the nonlinear terms Maulik et al. (2019) . It has been noticed in Amsallem & Nordstrom (2016) ; Liu et al. (2015) that time reversibility and energy preservation features can be preserved by using an intrusive approach, which is a main reason behind our technical choice. Learning Under Hard Constraints: Our work deals with idealized fluid satisfying two hard constraints: 1) incompressibility and 2) time reversibility. Since prominent training algorithms Duchi et al. (2011); Kingma & Ba (2014) and neural network architectures are designed for unconstrained optimization, dealing with hard constraints has been a long-standing problem Márquez-Neila et al. (2017) . There are two general approaches to inform a learned model of hard constraints: softening and constraint layers. Softening transforms the hard constraint into soft losses and relies on unconstrained optimizations. Some hard constraints model invariant variables, in which case data augmentation could be used to enforce a neural network gives the same output over all invariant transforms of inputs. In the learning of physical models, softening has been adopted to enforce physical correctness Sirignano & Spiliopoulos (2018) ; Ober-Blöbaum & Offen (2022), fluid incompressible Ajuria Illarramendi et al. (2020) , and collision-free constraints Tan et al. (2022) , and data augmentation has been used to enforce invariance to rigid Morozov et al. (2021) and Galilean transformations Ling et al. (2016) . A common problem with all these approaches lies in the unpredictable constraint violation in regions of insufficient data coverage. To exactly impose hard constraints, a series of works Amos & Kolter (2017) ; Agrawal et al. (2019) propose to formulate the constrained optimization as a differentiable layer in the neural network architecture. In particular, the entire fluid simulator has been formulated as a differentiable layer Schenck & Fox (2018) ; Takahashi et al. (2021) for model-based control and system identification. The incompressible constraint has also been formulated as an elliptic PDE solver layer in Mohan et al. (2020) . Although these techniques can enforce hard constraints, the cost of forward-and back-propagations through these layers are prohibitive. Even worse, these layers must be evaluated during both training and test time. Our method uses the constraint layer approach to enforce fluid incompressibility and time reversibility, by incorporating the reduced model Liu et al. (2015) as our differentiable layer. However, we encode the constraint property into the reduced bases, which is fixed during test time, leading to the low computational cost of trajectory prediction.

3. TIME REVERSIBLE REDUCED FLUID MODEL

We briefly review the underlying geometric structure and associated computational model of idealized, incompressible, inviscid fluid Pavlov et al. (2011) . Given a simulation domain M, the fluid configuration can be described as a vector field v ∈ V(M) where v(x) for any x ∈ M represents the velocity of fluid at x. The governing equation for v is: v + ∇ × v × v + ∇λ = 0 s.t. ∇ ⋅ v = 0, ( ) where λ is the pressure field, which is also the Lagrangian multiplier for the divergence-free constraint ∇ ⋅ v = 0. The above system is closed with appropriate initial and boundary conditions. Pavlov et al. (2011) proposed time-reversible, energy preserving spatial and temporal discretization schemes for Equation 1. However, directly time integrating the discrete system requires solving large-scale nonlinear system equations. Reduced-order model Liu et al. (2015) scales down the cost by embedding v into a p-dimensional subspace with divergence-free, orthogonal basis U , giving v = U z where z is the coefficient vector. The reduced-order governing equation can be derived via Galerkin projection: ż + ∫ M U T ∇ × (U z) × (U z) = 0, where the second term is the reduced-order advector, which could be succinctly written as a contraction with a third-order tensor C kij : żk + ∑ i ∑ j C kij z i z j = 0 s.t. C kij ≜ ∫ M ⟨U k , ∇ × U i × U j ⟩ , where we use z k (resp. U k ) to denote the kth element (resp. column). For fast reduced trajectory prediction, the tensor C kij is precomputed, and a small p is used. An essential feature of C kij is antisymmetry: C kij = -C jik , which implies that the continuous-time, reduced system is also energy-preserving as: d dt ∥z∥ 2 =2 ∑ kij C kij z k z i z j = ∑ kij (C kij -C jik )z k z i z j = 0. Using a variational integrator, e.g. the trapezoidal rule, the energy will also be conserved in a time-discrete computational model. We use a superscript + to denote variables at the next time instance, the superscript d denotes the variable at the dth timestep, and δt denotes timestep size. The trapezoidal rule relates z + and z by: z + (z) ∶ z + -z δt + C(z + ) = 0 s.t. C(z + ) ≜ ∑ ij C ∶ij z + i + z i 2 z + j + z j 2 , from which z + can be solved via the Newton-Raphson method to satisfy ∥z + ∥ 2 = ∥z∥ 2 , i.e. energy conservation, as well as discrete-time reversibility. These remarked features, originally discovered in Pavlov et al. (2011); Liu et al. (2015) , achieve an ideal balance between computational efficacy and numerical stability. As pointed out by Pavlov et al. (2011) , although real-world flows are not ideally energy-preserved, simulating ideal flows is a crucial benchmark for evaluating the stability and fidelity of a simulator. More general non-reversible flows can be modeled by adding additional constitutive terms. As an example, we could add a viscous term µ∇(∇ ⋅ v) to model energy dissipation and this term can be projected to the reduced space via Galerkin projection. In Figure 2 , we plot the procedure of energy dissipation under different µ using both our learned reduced model and the groundtruth fullspace model. We formalize and prove these properties in Appendix A and Appendix B. In particular, Equation 4 defines a unique z + given z and a sufficiently small δt, so we define the function z + (z) by a slight abuse of notations. The accuracy of a reduced model relies on a proper choice of the basis vector U , which remains a difficult but underappreciated problem. 

4. REDUCED MODEL OPTIMIZATION

As illustrated in Figure 1 , we propose to identify reduce-order fluid models via gradient-based optimization of U to minimize the trajectory-wise discrepancy between a reduced-order model (Equation 3) and the full-order model (Equation 1). In this section, we first discretize the spatial computational domain (Section 4.1), we then propose our discrepancy loss function (Section 4.3), and finally discuss our optimization algorithm (Section 4.4).

4.1. SPATIAL DISCRETIZATION

We assume that M is discretized using a tetrahedron mesh or a rectangular grid via Discrete Exterior Calculus (DEC) as in Pavlov et al. (2011) . As a result, each vector field has a finite dimension n ≫ p. We use a bar to denote discrete variable so v ∈ R n . Ū belongs to the intersection of Stiefel manifold St(n, p) and the divergence-free basis subspace: D(n, p) = { Ū ∈ R n×p | ∇ ⋅ Ū = 0}, where ∇⋅ ∈ R (n-m )×n is the discrete divergence operator and m ≫ p is the dimension of divergence-free velocity subspace. The elements of Ū can also be identified with the elements of St(m, p). Indeed, we can find a set of unit, orthogonal bases D ∈ R n×m spanning the subspace of divergence-free velocity fields. For each Ū , we can identify some Ū m ∈ St(m, p) such that Ū = D Ū m . As illustrated in Figure 8 , a point on St(n, p) is the bases of a p-dimensional velocity field subspace, while a point on St(m, p) is the bases of a p-dimensional divergence-free velocity field subspace. Since we merely use Ū to project the velocity field into a subspace, we are only interested in the lower-dimensional Grassmannian Manifold (the manifold of velocity subspace irrespective of the particular bases), but we use Stiefel representation for better memory and computational efficacy. In other words, we treat Ū as our decision variable. We further write the tensor coefficient C kij as a function C( Ūk , Ūi , Ūj ), which is derived by discretizing the continuous definition of C kij in Equation 3 using DEC.

4.2. LIFTING TRANSFER FUNCTION FROM REDUCED-TO FULL-SPACE

In order to optimize the accuracy of reduced dynamic system, we first need to compare simulated trajectories generated by different bases Ū . However, the coordinate vector z of different Ū is incomparable, as they reside in different linear subspaces. To resolve this problem, we propose to lift z to v = Ū z in the ambient space R n , so that two vectors can be compared by the induced metric in the Euclidean space. Further, we can smoothly extend the reduced-order simulator function to the ambient space using the projection operator P = Ū Ū T and P⊥ = I -P : v+ (v, Ū ) ≜ Ū z + ( Ū T v) + P⊥ v. In other words, the velocity component orthogonal to the subspace is stationary, and the tangential velocity is governed by the reduced dynamic system. As detailed in Appendix C, the above extension can be written as a function defined on the Grassmannian manifold: v+ (v, P m ) ∶ R n × Gr(m, p) ↦ R n , where we denote P m = Ū m [ Ū m ] T . With the smooth extension, we can evaluate the derivatives of v+ with respect to v and the subspace. We can also compare two velocity fields generated by reduced-order simulators using different subspaces. Note the full-order dynamics (Equation 1) can be identified with U m = I m×m . The above lifting is not unique, and a useful alternative is to discard the orthogonal component, i.e. setting P⊥ v+ = 0, which is discussed in Appendix C.2. As our major contribution, we show in Appendix C that the above function v+ is a well-defined smooth function on Gr(m, p). We further show that for any differentiable loss function L(v + ), its derivatives with respect to the bases can be efficiently computed under a proper representation of Ū as a manifold point. Algorithm 1 Forward-Backward(v 0 , Ū ) 1: Precompute tensor C kij = C( Ūk , Ūi , Ūj ) 2: Precompute tensor C( D DT , Ūi , Ūj ) 3: for d = 0, ⋯, T -1 do ▷ forward propagation 4: vd+1 ← v+ (v d , Ū ) 5: G ← 0 ▷ backward propagation 6: Evaluate ∇L ← ∂γ T L dyn (v T ) ∂ vT 7: for d = T -1, ⋯, 1 do 8: G ← G + Equation 12 9: ∇L ← ∂ vd+1 ∂ vd T ∇L + ∂γ d L dyn (v d ) ∂ vd 10: Compute ∇ Ū L via Equation 11 ▷ divergence-free projection 11: Return ∇ Ū L Algorithm 2 RAMSGRAD(I, Ū ) Input: β 1 , β 2 , α, δt 1: m ← 0, τ ← 0, ν ← 0, ν ← 0 2: while Not converge do 3: Sample v 0 ∼ I ▷ we always use batch size equals to 1 4: g ←Forward-Backward(v 0 , Ū ) 5: m ← β 1 τ + (1 -β 1 )g 6: ν ← β 2 ν + (1 -β 2 )∥g∥ 2 7: ν = max(ν, ν) 8: Ū ← Retract( Ū , -αm/ √ ν) ▷ by QR factorization 9: τ ← P⊥ m ▷ approximate parallel transport 10: Return Ū

4.3. REDUCED DISCREPANCY LOSS

The differentiable structure of reduced fluid allows us to minimize the discrepancy between reducedand full-order model in an efficient model-based manner. Given two velocity fields v and v+ , a fullorder model should satisfy the governing equation of motion, which inspires the following discrepancy measure: L dyn (v + , v) ≜ ∥ D DT v+ - v δt + C( D DT , v+ + v 2 , v+ + v 2 )∥ 2 . ( ) This is similar to the physics correctness loss used in Sirignano & Spiliopoulos (2018) ; Ober-Blöbaum & Offen (2022) and we absorb the linear divergence-free constraint by using the projection operator D DT . Again, evaluating L involves a sparse linear solve for each of the T timesteps. But we can accelerate this computation thanks to the low-rank property of the velocity fields. Since, v and v+ both reside in low-rank spaces, we can write: C( D DT , v+ + v 2 , v+ + v 2 ) = ∑ ij C( D DT , Ūi , Ūj ) z+ i + zi 2 v+ j + vj 2 , and precompute the tensor C( D DT , Ūi , Ūj ) via p 2 sparse linear solves at the cost of O(n ω p 2 ). For a trajectory with T ≫ p 2 timesteps, this operator reduces the cost of evaluating L dyn from O(n ω T ) to O(n ω p 2 + T np 2 ).

4.4. STOCHASTIC RIEMANN OPTIMIZATION

Using a low-dimensional subspace, it is impossible to approximate all fluid simulation trajectories with sufficient accuracy. Instead, reduced models are designed to optimize a subset of trajectories with a given distribution I of initial conditions, i.e. v0 ∼ I and our goal is to solve the following problem via stochastic Riemann optimization: argmin Ū ∈D(n,p)∩St(n,p) E v0 ∼I [ T ∑ d=1 γ d L dyn (v d , vd-1 )] , ( ) where T is the horizon of trajectory and γ ∈ (0, 1] is a constant discount factor. Riemann optimization is a well-studied problem in both deterministic and stochastic settings and we use the RAMSGRAD algorithm proposed in Becigneul & Ganea (2019) . This algorithm requires both the retraction and parallel transport operators on St(n, p). We use QR-factorization for the retraction operator Bendokat et al. (2020) . Unfortunately, there is no efficient way to compute the parallel transport operator Edelman et al. (1998) , so we approximate the transport operator by projecting out the non-tangential component. This corresponds to using a single step of forward Euler integrator to solve the associated ODE of the transport operator. Again due to time reversibility, the objective function is globally differentiable with respect to Ū under compact I and sufficiently small δt. We outline our forward-backward gradient propagation in Algorithm 1 and adapted RAMSGRAD in Algorithm 2. These algorithms are well-defined due to the following lemma: Lemma 4.1. For any compact initial distribution I, there exists a sufficiently small δt, such that the objective function ∑ T d=0 γ d L dyn (v d ) is globally differentiable, i.e. for any z 0 ∈ I and Ū ∈ D(n, p) ∩ St(n, p). Proof. Since I is compact, v0 is uniformly upper bounded by some r and ∥z 0 ∥ = ∥ Ū T v0 ∥ ≤ r. By Corollary A.5, there exists a sufficiently small δt making any z d a differentiable, reversible function of z 0 . This also implies vd is a differentiable, reversible function of v0 under the definition of Equation 5, and our result follows. We implement our method using Pytorch with a fluid simulator implemented via native C++ with CPU parallelization, and perform all the computations on an AMD Threadripper 3970X CPU having 32 cores. We initialize our method using a conventional POD-type algorithm. Given I, we first sample a set of N trajectories using the full-order dynamics (Equation 1) and then perform a POD-type basis extraction. The number of extracted bases is determined by truncating the eigenvalues below ϵ of the largest eigenvalue. We always use a batch size of 1. The performance of our method is summarized in Table 1 . We consider two variants of our method: coupled case, where C kij is treated as a function C(U k , U i , U j ) as discussed in Section 4, and decoupled case, where C kij is treated as an antisymmetric independent decision variable. Our main experiments are performed with the coupled case. Experiments with the decoupled case and a summary of decision variables are included in Appendix D. The efficacy of trajectory prediction using a reduced-order model depends on p as illustrated in Figure 3 , so the runtime performance of both the POD baseline and our method are the same, while the cost of evaluating the full-order model is 252ms (26× slower than the reduced-order model with p = 49). Under review as a conference paper at ICLR 2023 Our first benchmark is Taylor vortices Pavlov et al. (2011) , where two vortices are separated by a distance slightly larger than the critical threshold. We use a velocity field discretized on a 64 × 64 rectangular grid with the periodic boundary condition, leading to n = 8192. This is a single trajectory (I is deterministic) and we set T = 500, δt = 0.01. We experiment with four parameters ϵ = 0.05, 0.01, 0.001, and 0.0001 and the number of bases is p = 8, 11, 16, and 25, correspondingly. With each Ū as the initial guess, we run our optimizer for 24 hours. In Figure 4bc , we plot the trajectorywise discrepancy loss against the number of bases p and the convergence history of our method.

5. EVALUATION

Compared with POD bases, our method reduces the discrepancy loss by 87.93%, 90.12%, 91.47%, and 90.16%, respectively. Snapshots are shown in Figure 4a , where our method predicts a velocity field closer to the full-order groundtruth. Our second benchmark involves having a smoke plume rise at a constant speed. We use a rectangular domain of [0, 1] 2 with all Dirichlet boundary conditions. The region of [0.25, 0.75]×[0.125, 0.375] is occupied by the smoke with a constant speed ( 0, 1 ), the remaining regions have zero velocity, and we use T = 1000. All other settings are the same as our first benchmark. The discrepancy loss and convergence history are plotted in Figure 13bc . We experiment with four parameters ϵ = 0.05, 0.01, 0.001, and 0.0001, the corresponding numbers of bases p are 6, 9, 15, and 26, respectively. Our method reduces the discrepancy loss by 88.82%, 80.94%, 87.60%, and 75.79%, respectively. We have also tested a variant of our method with an obstacle in the simulation domain, where our method reduces the discrepancy loss by 92.13%, 81.49%, 85.38%, and 81.70%, respectively. Snapshots of our second benchmark are shown in Figure 13a and Figure 14 of Appendix H. and difference noise levels ε = 0.01, 0.05, 0.25, 0.5. We first run the four training instances for 8000 iterations, which already brings the ultimate discrepancy loss down to similarly low levels. We then give the ε = 0.5 instance another 4500 iterations (purple after red curve) and it could outperform the ε = 0.25 instance. Finally, we tried using a fully noisy initialization of Ū and the result is much worse than other instances. In our first benchmark, Taylor vortices Pavlov et al. (2011) , we further analyze the sensitivity of our method with respect to the initial guess Ū . To this end, we first compute Ū via POD and then corrupt Ū using a random noise bases Ũ with each element sampled according to the truncated normal distribution with µ = 0, σ = 1 and truncated to range [-1, 1]. We then use the following initial guess: Retract( Ũ , D DT Ũ Σ), where Σ is a scaling diagonal matrix such that each column of Ũ Σ has l 2 -norm equals to some ε and ε controls the magnitude of random noise. Here multiplying by D DT ensures that our noise is divergence-free. In Figure 5 , we profile the convergence history with ε = 0.01, 0.05, 0.25, 0.5. Although the noise can drastically change the initial discrepancy loss, all four instances can reduce the loss to similar levels after sufficiently many iterations. Our analysis also implies that the POD baseline provides a good initial guess of Ū , because a fully noisy initialization of Ū can lead to a worse result. and final (c) trajectory-wise discrepancy with respect to θ1, θ2 for our forth benchmark. Our third benchmark involves a spherical smoke plume, with initial diameter 1/3 and speed 1.0 located in the center of a [0, 1] 2 domain, moving in varying directions. We assume the direction of motion is parameterized by the angle θ ∈ [0, 2π] sampled from the initial distribution I = U([0, 2π]). We use a velocity field discretized on a 64 × 64 rectangular grid with Dirichlet boundary condition (n = 8320). Our training dataset for the POD baseline contains N = 8 trajectories with evenly sampled θ = 0 ○ , 45 ○ , 90 ○ , ⋯. With T = 500, δt = 0.01, ϵ = 0.01, p = 36, we run our method for 12200 iterations, taking 72 hours to converge. We then test our method on another 24 evenly sampled θ = 7.5 ○ , 22.5 ○ , 30 ○ , ⋯, which are not covered by the training dataset (some snapshots can be found in Figure 15 of Appendix H). As plotted in Figure 7a , our method reduces the discrepancy by 54.65% on average. Our fourth benchmark extends the third one by involving two smoke plumes, located at ( 0.5, 0.25 ) and ( 0.5, 0.75 ). The directions of motion θ 1 , θ 2 ∈ [0, π] are sampled from the initial distribution I = U([0, π] 2 ) and we set ϵ = 0.01, p = 59. Our training dataset for the POD baseline contains N = 25 trajectories with 5 evenly sampled θ 1,2 = 0 ○ , 72 ○ , 144 ○ , 216 ○ , 288 ○ . Other parameters are the same as those of our third benchmark. We run our method for 18000 iterations, taking 72 hours to converge (some snapshots can be found in Figure 16 of Appendix H). Afterwards, we test our method on another 25 evenly sampled θ 1,2 = 36 ○ , 108 ○ , 180 ○ , 252 ○ , 324 ○ that are not covered by the training dataset. As plotted in Figure 7bc , our method reduces the discrepancy by 59.28% on average.

6. CONCLUSION

We propose a model-based approach to fine-tune reduced fluid dynamic systems. Our main idea is to rely on the differentiable structure between the state transfer function and the linear subspace bases to minimize the expected trajectory-wise discrepancy loss, over a distribution of initial conditions. By evaluating several simulation benchmarks, we show that our method outperforms the POD baseline. As our major limitation, our trajectory prediction has sequential dependence and cannot exploit GPU parallelization. Even with our tensor precomputation technique, the training still takes hours on a desktop machine, which is orders of magnitude slower than the simple POD or DMD method. In addition, our method uses a linear subspace with limited expressivity as compared with universal neural networks Wu et al. (2021) ; Hasegawa et al. (2020) ; Lee et al. (2021) used by non-intrusive model reduction techniques. We speculate that using neural networks to represent the reduced bases Ū is possible as done in Li et al. (2017) , although the orthogonal and divergence-free constraints will be more difficult to enforce. Enforcing these constraints exactly as in Mohan et al. (2020) would compromise the efficacy of reduced time integration.

A DISCRETE ENERGY PRESERVATION

We prove that energy preservation and time reversibility hold in a time-discrete setting. Lemma A.1. The tensor C kij is antisymmetric. Proof. This follows from the definition of C kij : C kij = ∫ M ⟨U k , ∇ × U i × U j ⟩ = ∫ M U T k (∇U i -∇U T i )U j = -∫ M U T j (∇U i -∇U T i )U k = -∫ M ⟨U j , ∇ × U i × U k ⟩ = -C jik , where we used elementary vector identity that (∇ × A) × B = B ⋅ (∇A -∇A T ). Using the antisymmetry of C kij , we can show that trapezoidal rule is indeed energy preserving. Lemma A.2. For any z + satisfying the trapezoidal rule, ∥z + ∥ = ∥z∥. Proof. Multiplying the lefthand side of Equation 4 by z + k + z k and summing over k, we have: ∥z + ∥ 2 -∥z∥ 2 δt + 2 ∑ kij [C kij z + k + z k 2 z + i + z i 2 z + j + z j 2 ] = ∥z + ∥ 2 -∥z∥ 2 δt + ∑ kij [(C kij + C jik ) z + k + z k 2 z + i + z i 2 z + j + z j 2 ] = ∥z + ∥ 2 -∥z∥ 2 δt = 0, from which our result follows. Next, we show that the trapezoidal integrator (Equation 4) must have a solution by a proper choice of sufficiently small δt. Lemma A.3. There exists a sufficiently small δt such that Equation 4 can be solved for z + via the following negative gradient flow: f (z + ) ≜ z + -z + δtC(z + ) ż+ ≜ -∇f (z + ) T f (z + )/2, with initial guess z + = z. Proof. We consider the Lyapunov candidate V (z + ) ≜ ∥f (z + )∥ 2 on the ball B r (z) = {z + |∥z + -z∥ ≤ r}. The negative gradient flow satisfies: V (z + ) = -∥∇f (z + ) T f (z + )∥ 2 = -∥(I + δt∇C(z + ) T )f (z + )∥ 2 = -V (z + ) -2δtf (z + ) T ∇C(z + ) T f (z + ) -δt 2 ∥∇C(z + ) T f (z + )∥ 2 ≤ -(1 -δt)V (z + ) + (δt -δt 2 )∥∇C(z + ) T f (z + )∥ 2 . Now since the eigenvalue of a Hermitian matrix is a Lipschitz function of matrix entries Golub & Van Loan (2013), we must have: ρ(∥z∥, r) ≤ ρ(∇C(z + )∇C(z + ) T ) ≤ ρ(∥z∥, r), for some ρ, ρ and any z + ∈ B r (z). Combining the above estimation, we have: V (z + ) ≤ -(1 -δt)V (z + ) + (δt -δt 2 )ρ(∥z∥, r)V (z + ). Obviously, with sufficiently small δt, we have V (z + ) ≤ -ϵV (z + ) for some ϵ ∈ (0, 1) and z + ∈ B r (z). Next, consider the boundary case z + ∈ ∂B r (z), where we have: V (z + ) -V (z) =r 2 + δtC(z + ) T (z + -z) + 2δt 2 [∥C(z + )∥ 2 -∥C(z)∥ 2 ] ≥(1 -δt)r 2 + (δt 2 -δt)∥C(z + )∥ 2 -δt 2 ∥C(z)∥ 2 , and we can choose sufficiently small δt such that V (z + ) > V (z) for all z + ∈ ∂B r (z). Our result follows from the exponential stability condition Murray et al. (2017) . In practice, however, continuous gradient flow cannot be realized, but a similar argument as Lemma A.3 can be used to show that the Newton-Raphson method is guaranteed to converge when minimizing V (z + ) under sufficiently small δt: Lemma A.4. There exists a sufficiently small δt such that Equation 4 can be solved for z + via the Newton-Raphson method: z (d) = z (d-1) -∇f (z (d-1) ) -1 f (z (d-1) ), with initial guess z (0) = z. Here we use superscript with bracket to denote iteration index. Proof. Consider the reduction of Lyapunov candidate V (z) after one iteration, we have: V (z (d) ) =∥f (z (d-1) -∇f (z (d-1) ) -1 f (z (d-1) ))∥ 2 = ∑ k ∥ δt 2 f (z (d-1) ) T H k (z (d-1) )f (z (d-1) )∥ 2 H k (z (d-1) ) ≜∇f (z (d-1) ) -T C k∶∶ + C T k∶∶ 2 ∇f (z (d-1) ) -1 . By a similar argument as in Lemma A.3, we can choose sufficiently small δt such that: ρ(H k (z (d-1) )) ≤ ρ(∥z∥, r) V (z (d) ) ≤ pδt 2 ρ(∥z∥, r) 2 4 ∥f (z (d-1) )∥ 4 , as long as z (d-1) ∈ B r (z). We can also choose sufficiently small δt such that: V (z (d) ) ≤ ϵV (z (d-1) ) ∀z (d-1) ∈ B r (z) ∧ ∥f (z (d-1) )∥ ≤ 1, for some ϵ ∈ (0, 1). Next, we consider the Hessian of V (z + ): ∇ 2 V (z + ) = [I -δt∇C(z + ) T ] [I -δt∇C(z + )] + δt∇ 2 C(z + )f (z + ) ≜ I + O(δt)R(z + ), where R(z + ) is a smooth, symmetric matrix function. We can further choose sufficiently small δt such that V (z + ) is 1/2-strongly convex and for any z + ∈ B 3r (z)/B r (z): V (z + ) -V (z) ≥ r 2 /2 + ∇V (z) T (z + -z) = r 2 /2 + δt 2 C(z) T ∇C(z)(z + -z). By the smallness of δt, we have: ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ V (z + ) > V (z) ∀z + ∈ B 3r (z)/B r (z) ∥∇f (z (d-1) ) -1 f (z (d-1) )∥ ≤ 2r ∀z (d-1) ∈ B r (z) ∧ V (z (d-1) ) ≤ min(1, r 2 /2) . ( ) Combining Equation 8and Equation 9, we have for small enough δt: { z (d) ∈ B r (z) V (z (d) ) ≤ ϵV (z (d-1) ) ∀z (d-1) ∈ B r (z) ∧ V (z (d-1) ) ≤ min(1, r 2 /2). Our result follows by choosing sufficiently small δt such that V (z (0) ) ≤ min(1, r 2 /2) and invoke the discrete exponential stability condition Aitken & Schwartz (1994) . Note the choice of δt is only dependent on ∥z∥ and r, which can be used to show that the timestep size can be fixed throughout the trajectory for time reversible fluid systems: Corollary A.5. Given an initial condition z 0 , an energy preserving discrete trajectory can be computed by repeatedly solving Equation 4 for z k using a fixed timestep size δt via the Newton-Raphson method. Proof. This result can be derived by induction on two facts: 1) ∥z k ∥ = ∥z k-1 ∥ by Lemma A.2; 2) To solve for z k , δt can be determined as a function δt(∥z k-1 ∥, r) by Lemma A.4.

B DISCRETE TIME REVERSIBILITY

The above result guarantees energy preservation throughout the trajectory. We now move on to show time reversibility in the discrete setting: Lemma B.1. There exists a sufficiently small δt, such that for any z ∈ B r (0), the negative gradient flow Equation 4 defines a invertible map from z to z + . Proof. Following the same argument as in Lemma A.4, we can choose sufficiently small δt such that V (z + ) is strongly convex when restricted to B 2r (0) and the map z + (z) = argmin z + V (z + ) is well-defined and differentiable Still (2018) . The derivative of function z + (z) can then be derived via the implicit function theorem as: ∇z + (z) = -[I + δt∇C(z + )] -1 [I -δt∇C(z + )] . By Lemma A.2, we know that z + ∈ B r (0) as well. Again by the lipschitz continuity of singular values, we can choose sufficiently small δt such that det(∇z + (z)) ≠ 0 throughout B r (0) and our result follows by the inverse function theorem. Lemma B.1 can also be extended to the entire trajectory via induction: Corollary B.2. Given an initial condition z 0 ∈ B r (0) for some r, an energy preserving discrete trajectory can be computed by repeated solving Equation 4 for z k using a fixed timestep size δt, such that the resulting map z k (z 0 ) is invertible. Proof. By induction on Lemma A.2 and Lemma B.1, we know that z k (z k-1 ) is invertible for any k > 0 and our result follows by composition of invertible functions. St(m, p) for the divergence-free velocity bases; Gr(n, p) for the velocity subspace; Gr(m, p) for the divergence-free velocity subspace. Our method maintains Ū ∈ St(n, p) and represents the gradient as some ∇ Ū L ∈ T Ū St(n, p), which is both memory efficient and computationally tractable.

C DERIVATIVE FORMULATION

In this section, we analyze the differentiability of our lifted transfer function Equation 5. To compute derivatives of the forward dynamic function with respect to the bases Ū , we need to utilize the implicit function theorem and special representation of the bases as a manifold point, which cannot be exploited by automatic differentiation. First, we show that the function is well-defined on the manifold Gr(m, p) via the following lemma: Lemma C.1. The lifted transfer function Equation 5 can be written as a function v+ (v, P m ). Proof. By the incompressibility of bases Ū = D Ū m , we have: P = D P m DT . Plugging this into Equation 5 and we have the follow rewrite: v+ (v, P m ) ∶ { P⊥ v+ = P⊥ v P v+ -v δt + C( P , P v+ +v 2 , P v+ +v 2 ) = 0 , from which our result follows. We can derive the original definition (Equation 5) by multiplying the second equation by Ū T from the left. Although the function is well-defined, the complexity of its derivative computation relies on an efficient representation of bases. A straightforward representation is to use matrix P m and consider the function v+ (v, P m ). However, this representation requires storing the large matrix P m which is computationally impractical. In this section, we exploit equivalent manifold representations to derive the computationally tractable formulas for the derivatives of arbitrary loss functions L ○ v+ . The relevant manifolds are illustrated in Figure 8 . We first derive the partial derivative ∂v + /∂v via the implicit function theorem: ∂v + ∂v = [ Ū [I + δt∇C(z + )] -1 [I -δt∇C(z + )] Ū T + P⊥ ] . The inverse of the system matrix above is well-defined when the timestep size δt is sufficiently small according to Appendix A. It can be verified that the above derivative is invariant to the orthogonal basis transform. Next, we derive the partial derivative with respect to P ∈ Gr(m, p). We denote Ū m ⊥ as the complement of Ū m and Q m = ( Ū m , Ū m ⊥ ) ∈ O(m ). An element of T P m Gr(m, p) can be identified with a matrix dB ∈ R (m-p)×p via: d P m = Q m ( dB T dB ) [Q m ] T . We can lift Gr(m, p) to St(m, p) via the map π St(m,p)↦Gr(m,p) ( Ū m ) = Ū m [ Ū m ] T . Under this map, an element d Ū m ∈ T Ū m St(m, p) horizontal of T P m Gr(m, p) must satisfy the condition d Ū m = Ū m ⊥ dB (we refer readers to Bendokat et al. (2020) for the derivation). Representing gradient as some dB is the most memory efficient method, since the dimension of Gr(m, p) equals that of dB. However, we have to multiply dB with Ū m ⊥ and then with D to recover divergence-free velocity bases, while computing either Ū m ⊥ or D is intractable. Instead, we choose to work with d Ū directly and rely on the following result that establishes a connection between dB and d Ū : In order to calculate the gradient on the manifold, we can smoothly extend the composite function to the entire R n×p , calculate the Euclidean-space gradient denoted by G ∈ R n×p , and then project the gradient onto the tangent space. Such projection is defined by Lemma C.2 as: ∇ Ū L = P⊥ D DT G, where multiplying by D DT ensures ∇ Ū L ∈ D(n, p) and multiplying by P⊥ ensures Ū T ∇ Ū L = 0. Note that, although computing the entire D is intractable, evaluating D DT G is tractable. Indeed, this involves projecting each column of G into the divergence-free vector subspace, which can be calculated by solving a discrete Poisson's equation Petrila & Trif (2004) via a sparse linear solve at a complexity of O(n ω ) Zhang (1998) , where ω ≥ 1 depends on the numerical linear system solver. Therefore, the entire projection has a cost of O(n ω p), as compared with the complexity of computing D being O(n ω m). We refer readers to Appendix C.1 for the derivation of Euclidean space gradient G. The computation of ∇ Ū L over a long trajectory with T ≫ p timesteps is rather efficient. Indeed, we can precompute and accumulate G for each timestep, and finally apply divergence-free projection operator to compute ∇ Ū L, the total cost of which is O(n ω p + T np + T p 3 ).

C.1 DERIVATIVE FORMULATION IN EUCLIDEAN SPACE

We derive the formula for G in the following lemma: Lemma C.3. If we introduce the third order tensor: Φ αβγ ≜ ∑ ij C(e β , Ūi , Ūj )δ αγ z + i + z i 2 z + j + z j 2 + ∑ j C( Ūα , e β , Ūj ) z + γ + z γ 2 z + j + z j 2 + ∑ i C( Ūα , Ūi , e β ) z + i + z i 2 z + γ + z γ 2 , and consider an arbitrary differentiable function L(v), then the Euclidean space gradient G of function L ○ v+ (v, Ū ) with respect to Ū is defined as: G =v∇L T Ū [I + δt∇C(z + )] -1 [I -δt∇C(z + )] -∇L T Ū [I + δt∇C(z + )] Φ+ ∇L [z + -z] T -v∇L T Ū . ( ) Proof. Assuming v is fixed, we first derive some useful fundamental results: dz = [d Ū m ] T DT v = dB T [ Ū m ⊥ ] T DT v d [ P v] =d [ D Ū m z] = Dd Ū m z + D Ū m dz = = D [ Ū m ⊥ dB [ Ū m ] T + Ū m dB T [ Ū m ⊥ ] T ] DT v = DQ m ( dB T dB ) [Q m ] T DT v = d P v = -d P⊥ v. Plugging Φ into the first-order expansion of Equation 4 and we have: [I + δt∇C(z + )]dz + + Φ ∶ d Ū = [I -δt∇C(z + )]dz = [I -δt∇C(z + )]d Ū T v, where ∶ denotes tensor contraction of the last two indices. The remaining derivation follows the chain rule: dv + = Ū dz + + d Ū z + + d P⊥ v = Ū [I + δt∇C(z + )] -1 [[I -δt∇C(z + )]d Ū T v -Φ ∶ d Ū ] + d Ū [z + -z] -Ū d Ū T v dL =∇L T dv + = tr(d Ū T G). By comparing the two sides of the last equation, our result follows.

C.2 ALTERNATIVE LIFTED FUNCTION

The above derivation is based on the definition of v+ (v, Ū ) in Equation 5, which assumes that the orthogonal component of v is kept across timesteps. An useful alternative is to assume that the orthogonal component is discarded, which is: v+ (v, Ū ) ≜ Ū z + ( Ū T v). ( ) By a similar argument, we can derive the following derivatives for Equation 13: ∂v + ∂v = Ū [I + δt∇C(z + )] -1 [I -δt∇C(z + )] Ū T G =v∇L T Ū [I + δt∇C(z + )] -1 [I -δt∇C(z + )] -∇L T Ū [I + δt∇C(z + )] -1 Φ + ∇L [z + ] T .

D DECOUPLED REDUCED-ORDER MODEL

We observe that energy preservation and time-reversibility discussed in Appendix A only requires the tensor C kij to be antisymmetric. In other words, the construction of the tensor C kij via Equation 3 is not necessary. We speculate that using a learned antisymmetric tensor C kij can expose a larger search space, leading to a better match with the full-order model. We denote such model as decoupled reduced-order model, where C kij are separate decision variables not constructed from Ū . The formula for ∂v + /∂v Equation 10 stays the same and the formula for G takes the following simpler form: G =v∇L T Ū [I + δt∇C(z + )] -1 [I -δt∇C(z + )] + ∇L [z + -z] T -v∇L T Ū . Finally, the derivative with respect to C ijk reads: ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ L ≜ [I + δt∇C(z + )] -T Ū T ∇L ∂L ∂C kij = 1 2 [L k z + i +zi 2 z + j +zj 2 -L j z + i +zi 2 z + k +z k 2 ] , where we have projected the derivative onto the antisymmetric subspace. On the downside, there is no universally valid δt to make our objective function globally differentiable for all Ū and C kij , because discrete time reversibility requires a sufficiently small δt that depends on C kij . Empirically, however, we have not observed any convergence issue. In Figure 9 , we compare the coupled and decoupled versions on the Taylor vortices and the smoke plume benchmark, their convergence histories are almost identical. Therefore, we recommend always using the coupled model due to its theoretical differentiability guarantee. In Table 2 , we summarize the number of decision variables in our various experiments. 

E COMPARISON WITH ALTERNATIVE LOSS

To highlight the effectiveness of our physics correctness loss, we conduct a comparison with two other loss functions: the L 1 and L 2 losses defined as: L 1 (v + , v) ≜ ∥v + -v∥ 1 L 2 (v + , v) ≜ ∥v + -v∥ 2 , where we denote v as the velocity generated by the groundtruth fullspace fluid simulator Pavlov et al. (2011) . We note that these loss functions are impractical for large-scale test cases because they require solving for the groundtruth data of a different initial condition during each iteration of training. Therefore, we choose to only evaluate them on our first three benchmarks in Table 1 , where there is only a single trajectory so v can be precomputed. For these benchmarks, we both train and evaluate them on the three losses L dyn , L 1 , L 2 , and summarize the results in Table 3 . We also plot the convergence history of the first benchmark (Taylor Vertices) in Figure 10 . Our plots show that, when the first benchmark is trained using L 1,2 , L 1,2 will both decrease by at most 64%, but our L dyn can increase drastically by at most 1083%. Instead, when trained using L dyn , L 1,2 will increase or decrease by at most 3.3% but our L dyn can decrease significantly by 76.8%. Considering these properties and the fact that L 1 , L 2 is impractical to compute by requiring the groundtruth data, we conclude that our L dyn is overall more practical in training reduced fluid systems. 

F COMPARISON WITH DMD

We have shown that our method works best with POD initialization. In this section, we conduct additional experiments with DMD. DMD extends POD by assuming that the data is generated from a linear dynamic system. DMD can be used both as an intrusive and non-intrusive method. In the intrusive mode, we use DMD to compute a bases Ū and compute C kij from Ū via Equation 3. In the non-intrusive mode, we simply use the DMD-assumed linear dynamic system as the surrogate. To evaluate the performance of DMD, we use two metrics. For the intrusive DMD, we use our physics correctness loss Equation 6. Unfortunately, our physics correctness loss is not suitable for evaluating non-intrusive methods that can be non-reversible. Indeed, it is always possible to let L dyn = 0 by setting v+ = v = 0. Therefore, we also measure the energy gain ∆e = (∥v T ∥ -∥v 0 ∥)/∥v 0 ∥ as an indication of dynamic system stability. We perform the experiments using the open source DMD library Demo et al. (2018) on our first three benchmarks. Their results are shown in Table 4 . The results show that the performance of intrusive DMD is worse than either POD or our method, in terms of the physics correctness loss. This is because the main assumption of DMD, i.e., the dynamic system being linear, is invalid for the bilinear dynamic system Equation 3. Instead, POD does not make any assumption on the time dependency between frames and serves as a better initialization for our method. On the other hand, the non-intrusive DMD leads to better performance in terms of L dyn but the dynamic system tends to be rather unstable due to a drastic energy gain of 1.9 × -73.3×. 

G COMPARISON WITH PINNS

We conduct comparisons with PINNs Raissi et al. (2019) . PINNs was originally designed for solving PDEs, while our divergence-free Navier-Stokes equation is an DAE. In order to extend PINNs to handle DAE, we learn a neural network DAE solution function, denoted as NN(x, y, t) = ( v x , v y , λ ) and represented as an MLP with 3 hidden layers each having H neurons and Tanh activation function, and minimize the following physics violation loss: ∥ v + ∇ × v × v + ∇λ∥ 2 + ∥∇ ⋅ v∥ 2 . We also enforce additional temporal and spatial boundary conditions as loss functions. All the loss functions have weights equal to 1. For fairness of comparison, we use the same training data for both our method and PINNs. Note that our method uses grid-based spatial discretization, so we use all the grid centers as spatial samples of training data and we sample the temporal domain at a regular interval of δt = 0.01, which equals to our timestep size. We aim to predict a trajectory of the same length as our method, i.e. T δt. We use Adam as our optimizer and we train both methods on CPU for 24 hours. Since PINNs can lead to non-divergent-free velocity fields, we measure the accuracy of both methods via three metrics: L dyn , ∆e, and average divergence error: ∥v -v * ∥ ∞ where v * is the closest divergence-free velocity field to v. The results are summarized in Table 5 . PINNs mostly perform worse than our method in terms of L dyn . In the Taylor Vortices benchmark using H = 128, the L dyn metric generated by PINNs is slightly better than our method. But this is again because L dyn is only designed for measuring time-reversible flows, which is not an effective metric for comparing reversible and non-reversible flows due to its trivial solutions. Such trivial solutions are indeed exhibited in PINNs, as illustrated in Figure 11 . After a very short period of time, the solution predicted by PINNs become significantly smeared out and meaningless. 

H ADDITIONAL RESULTS

We demonstrate additional experimental results. Some snapshots of our 4 benchmark scenarios are shown in Figure 12 , 13, 14, 15, and 16, respectively. 



Figure 1: Given a distribution of initial conditions I, we identify a reduced-order fluid model v+ (v, Ū ) by optimizing the bases Ū that minimize the expected trajectory-wise discrepancy loss L dyn . Our output model v+ (v, Ū ) can perform efficient and as-accurateas-possible fluid trajectory predictions.

Figure 2: We plot the energy dissipation cause by a viscous term under µ = 0, 1, 10, 100, simulated using our learned reduced model (a) and the groundtruth fullspace model (b).

Figure 3: The cost of evaluating z + (z) plotted against p.

Figure 4: (a) Velocity magnitude field snapshots of the Taylor vortices benchmark, generated by full-order model (top row), our method with ϵ = 0.0001 and p = 25 (middle row), and POD with ϵ = 0.0001 and p = 25 (bottom row). (b) Trajectory-wise discrepancy loss of POD and our method, under different p. (c) The convergence history of our method over 24 hours.

Figure 5: The convergence history of four instances of learning reduced Taylor vortices with ϵ = 0.05, p = 8,

Figure 6: The convergence history over 3000 iterations of four instances of learning reduced smoke plume rising trajectory. We use two sets of instances: ϵ = 0.05, p = 6 and ϵ = 0.01, p = 9. For each set, we compare one-step and full-unrolling mode of training.In the recent workBrandstetter et al. (2022), authors proposed two training modes for learning neural PDE solver, one-step training and full-unrolling. One-step training cuts off the gradient after a single timestep, while the full-unrolling mode considers the full gradient of Equation 7 over the entire trajectory. We compare the two modes in Figure6in terms of trajectory-wise discrepancy loss, using our second benchmark scenario, rising smoke plume. Both modes can reduce the loss after 3000 iterations, although there is some initial fluctuation in one-step training, while full-unrolling leads to significantly faster convergence. We use the full-unrolling mode for all other examples.

Figure 8: We illustrate the four manifolds: St(n, p) for the velocity bases;St(m, p) for the divergence-free velocity bases; Gr(n, p) for the velocity subspace; Gr(m, p) for the divergence-free velocity subspace. Our method maintains Ū ∈ St(n, p) and represents the gradient as some ∇ Ū L ∈ T Ū St(n, p), which is both memory efficient and computationally tractable.

Lemma C.2. For a divergence-free velocity bases Ū , a direction d Ū belongs to the tangent plane of D(n, p) ∩ S(n, p) at Ū if and only if d Ū ∈ D(n, p) and Ū T d Ū = 0. Proof. If d Ū belongs to the tangent plane, then it must satisfy d Ū = Dd Ū m for some dŪ m = Ū m ⊥ dB, so d Ū ∈ D(n, p). Further, Ū T d Ū = Ū T D Ū m ⊥ dB = [ Ū m ] T Ū m ⊥ dB = 0. Conversely, d Ū ∈ D(n, p) implies d Ū = Dd Ū m for some d Ū m . Further, Ū T d Ū = 0 implies [ Ū m ] T d Ū m = 0,which in turn implies d Ū m = Ū m ⊥ dB for some dB. Suppose we have a loss function L ○ v+ (v, P m ) with v as the constant, we can composite the loss function with the map π St(n,p)↦Gr(m,p) ( Ū ) = DT Ū Ū T D = P m . The domain of this composite function is the intersection of D(n, p) and St(n, p), which is an embedded sub-manifold of R n×p .

Figure 9: We compare the performance of coupled and decoupled versions on the Taylor vortices benchmark (a), with ϵ = 0.01 and p = 11, and the smoke plume benchmark (b), with ϵ = 0.01 and p = 9.

Figure 10: For our first benchmark (Taylor Vortices), we plot the convergence history when trained using Ldyn (a), L1 (b), and L2 (c). The scale of Ldyn is shown on the left and L1,2 is shown on the right of each plot.

We compare the POD baseline and our method with intrusive DMD (I-DMD) and non-intrusive DMD (NI-DMD) in terms of trajectory-wise physics correctness loss and energy gain.

Figure 11: We compare frames generated by groundtruth (a), PINNs(H = 128) (b) and our method (ϵ = 0.0001, p = 25) (c) on the Taylor Vortices benchmark. After very short time period, the results generated by PINNs become significantly smeared out and meaningless.

Figure 12: Velocity magnitude field snapshots of the Taylor vortices benchmark, generated by full-order model (a), our method with ϵ = 0.0001 and p = 25 (b), and POD with ϵ = 0.0001 and p = 25 (c).

Figure 13: (a) Velocity magnitude field snapshots of the smoke plume benchmark, generated by full-order model (top row), our method with ϵ = 0.0001 and p = 26 (middle row), and POD with ϵ = 0.0001 and p = 26 (bottom). (b) Trajectory-wise discrepancy loss of POD and our method, under different p. (c) The convergence history of our method over 24 hours.

Figure 14: Velocity magnitude field snapshots of the smoke plume benchmark with an spherical obstacle, generated by full-order model (a), our method with ϵ = 0.0001 and p = 26 (b), and POD with ϵ = 0.0001 and p = 26 (c).

Figure 15: Velocity magnitude field snapshots of the spherical plume benchmark, generated by full-order model (a), our method with ϵ = 0.01 and p = 36 (b), and POD with ϵ = 0.01 and p = 36 (c). The plume moves along θ = 7.5 ○ (arrow), which is not covered by our training dataset.

Loss-POD Loss-Ours p Loss-POD Loss-Ours p Loss-POD Loss-Ours p Loss-POD Loss-Ours Summary of benchmarks for comparing POD and our method under different ϵ and p.



We summarized the number of decision variables in each example. In the coupled case, our decision parameter is Ū having np variables. In the decoupled case, our decision variables are Ū , C kij having np + p 3 variables.

We evaluate our first three benchmarks when trained and evaluated using Ldyn, L1, L2.



PINNs(H = 64) PINNs(H = 128) Ours L dyn /∆e ∥v -v * ∥ ∞ L dyn /∆e ∥v -v * ∥ ∞ L dyn /∆e ∥v -v * ∥ ∞ TaylorVortices 17.46/0.32 0.000027 8.96/0.64 0.000018 We compare our method and PINNs in terms of Ldyn and ∥v -v * ∥∞.

