AUGMENTING PHYSICAL MODELS WITH DEEP NET-WORKS FOR COMPLEX DYNAMICS FORECASTING

Abstract

Forecasting complex dynamical phenomena in settings where only partial knowledge of their dynamics is available is a prevalent problem across various scientific fields. While purely data-driven approaches are arguably insufficient in this context, standard physical modeling based approaches tend to be over-simplistic, inducing non-negligible errors. In this work, we introduce the APHYNITY framework, a principled approach for augmenting incomplete physical dynamics described by differential equations with deep data-driven models. It consists in decomposing the dynamics into two components: a physical component accounting for the dynamics for which we have some prior knowledge, and a data-driven component accounting for errors of the physical model. The learning problem is carefully formulated such that the physical model explains as much of the data as possible, while the data-driven component only describes information that cannot be captured by the physical model, no more, no less. This not only provides the existence and uniqueness for this decomposition, but also ensures interpretability and benefits generalization. Experiments made on three important use cases, each representative of a different family of phenomena, i.e. reaction-diffusion equations, wave equations and the non-linear damped pendulum, show that APHYNITY can efficiently leverage approximate physical models to accurately forecast the evolution of the system and correctly identify relevant physical parameters.

1. INTRODUCTION

Modeling and forecasting complex dynamical systems is a major challenge in domains such as environment and climate (Rolnick et al., 2019) , health science (Choi et al., 2016) , and in many industrial applications (Toubeau et al., 2018) . Model Based (MB) approaches typically rely on partial or ordinary differential equations (PDE/ODE) and stem from a deep understanding of the underlying physical phenomena. Machine learning (ML) and deep learning methods are more prior agnostic yet have become state-of-the-art for several spatio-temporal prediction tasks (Shi et al., 2015; Wang et al., 2018; Oreshkin et al., 2020; Donà et al., 2020) , and connections have been drawn between deep architectures and numerical ODE solvers, e.g. neural ODEs (Chen et al., 2018; Ayed et al., 2019b) . However, modeling complex physical dynamics is still beyond the scope of pure ML methods, which often cannot properly extrapolate to new conditions as MB approaches do. Combining the MB and ML paradigms is an emerging trend to develop the interplay between the two paradigms. For example, Brunton et al. (2016) ; Long et al. (2018b) learn the explicit form of PDEs directly from data, Raissi et al. (2019) ; Sirignano & Spiliopoulos (2018) use NNs as implicit methods for solving PDEs, Seo et al. (2020) learn spatial differences with a graph network, Ummenhofer et al. (2020) velocity field of an advection-diffusion system, Greydanus et al. (2019) ; Chen et al. (2020) enforce conservation laws in the network architecture or in the loss function. The large majority of aforementioned MB/ML hybrid approaches assume that the physical model adequately describes the observed dynamics. This assumption is, however, commonly violated in practice. This may be due to various factors, e.g. idealized assumptions and difficulty to explain processes from first principles (Gentine et al., 2018) , computational constraints prescribing a fine grain modeling of the system (Ayed et al., 2019a) , unknown external factors, forces and sources which are present (Large & Yeager, 2004) . In this paper, we aim at leveraging prior dynamical ODE/PDE knowledge in situations where this physical model is incomplete, i.e. unable to represent the whole complexity of observed data. To handle this case, we introduce a principled learning framework to Augment incomplete PHYsical models for ideNtIfying and forecasTing complex dYnamics (APHYNITY). The rationale of APHYNITY, illustrated in Figure 1 on the pendulum problem, is to augment the physical model when-and only when-it falls short. Designing a general method for combining MB and ML approaches is still a widely open problem, and a clear problem formulation for the latter is lacking (Reichstein et al., 2019) . Our contributions towards these goals are the following: • We introduce a simple yet principled framework for combining both approaches. We decompose the data into a physical and a data-driven term such that the data-driven component only models information that cannot be captured by the physical model. We provide existence and uniqueness guarantees (Section 3.1) for the decomposition given mild conditions, and show that this formulation ensures interpretability and benefits generalization. • We propose a trajectory-based training formulation (Section 3.2) along with an adaptive optimization scheme (Section 3.3) enabling end-to-end learning for both physical and deep learning components. This allows APHYNITY to automatically adjust the complexity of the neural network to different approximation levels of the physical model, paving the way to flexible learned hybrid models. • We demonstrate the generality of the approach on three use cases (reaction-diffusion, wave equations and the pendulum) representative of different PDE families (parabolic, hyperbolic), having a wide spectrum of application domains, e.g. acoustics, electromagnetism, chemistry, biology, physics (Section 4). We show that APHYNITY is able to achieve performances close to complete physical models by augmenting incomplete ones, both in terms of forecasting accuracy and physical parameter identification. Moreover, APHYNITY can also be successfully extended to the partially observable setting (see discussion in Section 5).

2. RELATED WORK

Correction in data assimilation Prediction under approximate physical models has been tackled by traditional statistical calibration techniques, which often rely on Bayesian methods (Pernot & Cailliez, 2017) . Data assimilation techniques, e.g. the Kalman filter (Kalman, 1960; Becker et al., 2019) , 4D-var (Courtier et al., 1994) , prediction errors are modeled probabilistically and a correction using observed data is applied after each prediction step. Similar residual correction procedures are commonly used in robotics and optimal control (Chen, 2004; Li et al., 2014) . However, these sequential (two-stage) procedures prevent the cooperation between prediction and correction. Besides, in model-based reinforcement learning, model deficiencies are typically handled by considering only short-term rollouts (Janner et al., 2019) or by model predictive control (Nagabandi et al., 2018) . The originality of APHYNITY is to leverage model-based prior knowledge by augmenting it with neurally parametrized dynamics. It does so while ensuring optimal cooperation between the prior model and the augmentation. Augmented physical models Combining physical models with machine learning (gray-box or hybrid modeling) was first explored from the 1990's: Psichogios & Ungar (1992); Thompson & Kramer (1994) ; Rico-Martinez et al. (1994) use neural networks to predict the unknown parameters of physical models. The challenge of proper MB/ML cooperation was already raised as a limitation of gray-box approaches but not addressed. Moreover these methods were evaluated on specific applications with a residual targeted to the form of the equation. In the last few years, there has been a renewed interest in deep hybrid models bridging data assimilation techniques and machine learning to identify complex PDE parameters using cautiously constrained forward model (Long et al., 2018b; de Bézenac et al., 2018) , as discussed in introduction. Recently, some approaches have specifically targetted the MB/ML cooperation. HybridNet (Long et al., 2018a) and PhICNet (Saha et al., 2020) both use data-driven networks to learn additive perturbations or source terms to a given PDE. The former considers the favorable context where the perturbations can be accessed, and the latter the special case of additive noise on the input. Wang et al. (2019) ; Mehta et al. (2020) propose several empirical fusion strategies with deep neural networks but lack theoretical groundings. PhyDNet (Le Guen & Thome, 2020) tackles augmentation in partially-observed settings, but with specific recurrent architectures dedicated to video prediction. Crucially, all the aforementioned approaches do not address the issues of uniqueness of the decomposition or of proper cooperation for correct parameter identification. Besides, we found experimentally that this vanilla cooperation is inferior to the APHYNITY learning scheme in terms of forecasting and parameter identification performances (see experiments in Section 4.2).

3. THE APHYNITY MODEL

In the following, we study dynamics driven by an equation of the form: dX t dt = F (X t ) defined over a finite time interval [0, T ], where the state X is either vector-valued, i.e. we have X t ∈ R d for every t, (pendulum equations in Section 4), or X t is a d-dimensional vector field over a spatial domain Ω ⊂ R k , with k ∈ {2, 3}, i.e. X t (x) ∈ R d for every (t, x) ∈ [0, T ] × Ω (reaction-diffusion and wave equations in Section 4). We suppose that we have access to a set of observed trajectories D = {X • : [0, T ] → A | ∀t ∈ [0, T ], dXt /dt = F (X t )} , where A is the set of X values (either R d or vector field). In our case, the unknown F has A as domain and we only assume that F ∈ F, with (F, • ) a normed vector space.

3.1. DECOMPOSING DYNAMICS INTO PHYSICAL AND AUGMENTED TERMS

As introduced in Section 1, we consider the common situation where incomplete information is available on the dynamics, under the form of a family of ODEs or PDEs characterized by their temporal evolution F p ∈ F p ⊂ F. The APHYNITY framework leverages the knowledge of F p while mitigating the approximations induced by this simplified model through the combination of physical and data-driven components. F being a vector space, we can write: F = F p + F a where F p ∈ F p encodes the incomplete physical knowledge and F a ∈ F is the data-driven augmentation term complementing F p . The incomplete physical prior is supposed to belong to a known family, but the physical parameters (e.g. propagation speed for the wave equation) are unknown and need to be estimated from data. Both F p and F a parameters are estimated by fitting the trajectories from D. The decomposition F = F p + F a is in general not unique. For example, all the dynamics could be captured by the F a component. This decomposition is thus ill-defined, which hampers the interpretability and the extrapolation abilities of the model. In other words, one wants the estimated parameters of F p to be as close as possible to the true parameter values of the physical model and F a to play only a complementary role w.r.t F p , so as to model only the information that cannot be captured by the physical prior. For example, when F ∈ F p , the data can be fully described by the physical model, and in this case it is sensible to desire F a to be nullified; this is of central importance in a setting where one wishes to identify physical quantities, and for the model to generalize and extrapolate to new conditions. In a more general setting where the physical model is incomplete, the action of F a on the dynamics, as measured through its norm, should be as small as possible. This general idea is embedded in the following optimization problem: min Fp∈Fp,Fa∈F F a subject to ∀X ∈ D, ∀t, dX t dt = (F p + F a )(X t ) The originality of APHYNITY is to leverage model-based prior knowledge by augmenting it with neurally parametrized dynamics. It does so while ensuring optimal cooperation between the prior model and the augmentation. A first key question is whether the minimum in Eq. ( 2) is indeed well-defined, in other words whether there exists indeed a decomposition with a minimal norm F a . The answer actually depends on the geometry of F p , and is formulated in the following proposition proven in Appendix B: Proposition 1 (Existence of a minimizing pair). If F p is a proximinal setfoot_0 , there exists a decomposition minimizing Eq. ( 2). Proximinality is a mild condition which, as shown through the proof of the proposition, cannot be weakened. It is a property verified by any boundedly compact set. In particular, it is true for closed subsets of finite dimensional spaces. However, if only existence is guaranteed, while forecasts would be expected to be accurate, non-uniqueness of the decomposition would hamper the interpretability of F p and this would mean that the identified physical parameters are not uniquely determined. It is then natural to ask under which conditions solving problem Eq. ( 2) leads to a unique decomposition into a physical and a data-driven component. The following result provides guarantees on the existence and uniqueness of the decomposition under mild conditions. The proof is given in Appendix B: Proposition 2 (Uniqueness of the minimizing pair). If F p is a Chebyshev set 1 , Eq. (2) admits a unique minimizer. The F p in this minimizer pair is the metric projection of the unknown F onto F p . The Chebyshev assumption condition is strictly stronger than proximinality but is still quite mild and necessary. Indeed, in practice, many sets of interest are Chebyshev, including all closed convex spaces in strict normed spaces and, if F = L 2 , F p can be any closed convex set, including all finite dimensional subspaces. In particular, all examples considered in the experiments are Chebyshev sets. Propositions 1 and 2 provide, under mild conditions, the theoretical guarantees for the APHYNITY formulation to infer the correct MB/ML decomposition, thus enabling both recovering the proper physical parameters and accurate forecasting.

3.2. SOLVING APHYNITY WITH DEEP NEURAL NETWORKS

In the following, both terms of the decomposition are parametrized and are denoted as F θp p and F θa p . Solving APHYNITY then consists in estimating the parameters θ p and θ a . θ p are the physical parameters and are typically low-dimensional, e.g. 2 or 3 in our experiments for the considered physical models. For F a , we need sufficiently expressive models able to optimize over all F: we thus use deep neural networks, which have shown promising performances for the approximation of differential equations (Raissi et al., 2019; Ayed et al., 2019b) . When learning the parameters of F θp p and F θa a , we have access to a finite dataset of trajectories discretized with a given temporal resolution ∆t: 2) requires estimating the state derivative dXt /dt appearing in the constraint term. One solution is to approximate this derivative using e.g. finite differences as in (Brunton et al., 2016; Greydanus et al., 2019; Cranmer et al., 2020) . This numerical scheme requires high space and time resolutions in the observation space in order to get reliable gradient estimates. Furthermore it is often unstable, leading to explosive numerical errors as discussed in Appendix D. We propose instead to solve Eq. ( 2) using an integral trajectory-based approach: we compute X i k∆t,X0 from an initial state X (i) 0 using the current F θp p + F θa a dynamics, then enforce the constraint X i k∆t,X0 = X i k∆t . This leads to our final objective function on (θ p , θ a ): D train = {(X (i) k∆t ) 0≤k≤ T /∆t } 1≤i≤N . Solving Eq. ( min θp,θa F θa a subject to ∀i, ∀k, X (i) k∆t = X (i) k∆t (3) where X (i) k∆t is the approximate solution of the integral X (i) 0 +k∆t X (i) 0 (F θp p + F θa a )(X s ) dX s obtained by a differentiable ODE solver. In our setting, where we consider situations for which F θp p only partially describes the physical phenomenon, this coupled MB + ML formulation leads to different parameter estimates than using the MB formulation alone, as analyzed more thoroughly in Appendix C. Interestingly, our experiments show that using this formulation also leads to a better identification of the physical parameters θ p than when fitting the simplified physical model F θp p alone (Section 4). With only an incomplete knowledge on the physics, θ p estimator will be biased by the additional dynamics which needs to be fitted in the data. Appendix F also confirms that the integral formulation gives better forecasting results and a more stable behavior than supervising over finite difference approximations of the derivatives.

3.3. ADAPTIVELY CONSTRAINED OPTIMIZATION

The formulation in Eq. ( 3) involves constraints which are difficult to enforce exactly in practice. We considered a variant of the method of multipliers (Bertsekas, 1996) which uses a sequence of Lagrangian relaxations L λj (θ p , θ a ): L λj (θ p , θ a ) = F θa a + λ j • L traj (θ p , θ a ) where L traj (θ p , θ a ) = N i=1 T /∆t h=1 X (i) h∆t -X (i) h∆t . Algorithm 1: APHYNITY Initialization: λ 0 ≥ 0, τ 1 > 0, τ 2 > 0; for epoch = 1 : N epochs do for iter in 1 : N iter do for batch in 1 : B do θ j+1 = θ j - τ 1 ∇ [λ j L traj (θ j ) + F a ] λ j+1 = λ j + τ 2 L traj (θ j+1 ) This method needs an increasing sequence (λ j ) j such that the successive minima of L λj converge to a solution (at least a local one) of the constrained problem Eq. ( 3). We select (λ j ) j by using an iterative strategy: starting from a value λ 0 , we iterate, minimizing L λj by gradient descentfoot_1 , then update λ j with: λ j+1 = λ j + τ 2 L traj (θ j+1 ) , where τ 2 is a chosen hyper-parameter and θ = (θ p , θ a ). This procedure is summarized in Algorithm 1. This adaptive iterative procedure allows us to obtain stable and robust results, in a reproducible fashion, as shown in the experiments.

4. EXPERIMENTAL VALIDATION

We validate our approach on 3 classes of challenging physical dynamics: reaction-diffusion, wave propagation, and the damped pendulum, representative of various application domains such as chemistry, biology or ecology (for reaction-diffusion) and earth physic, acoustic, electromagnetism or even neuro-biology (for waves equations). The two first dynamics are described by PDEs and thus in practice should be learned from very high-dimensional vectors, discretized from the original compact domain. This makes the learning much more difficult than from the one-dimensional pendulum case. For each problem, we investigate the cooperation between physical models of increasing complexity encoding incomplete knowledge of the dynamics (denoted Incomplete physics in the following) and data-driven models. We show the relevance of APHYNITY (denoted APHYNITY models) both in terms of forecasting accuracy and physical parameter identification.

4.1. EXPERIMENTAL SETTING

We describe the three families of equations studied in the experiments. In all experiments, F = L 2 (A) where A is the set of all admissible states for each problem, and the L 2 norm is computed on D train by: F 2 ≈ i,k F (X (i) k∆t ) 2 . All considered sets of physical functionals F p are closed and convex in F and thus are Chebyshev. In order to enable the evaluation on both prediction and parameter identification, all our experiments are conducted on simulated datasets with known model parameters. Each dataset has been simulated using an appropriate high-precision integration scheme for the corresponding equation. All solver-based models take the first state X 0 as input and predict the remaining time-steps by integrating F through the same differentiable generic and common ODE solver (4th order Runge-Kutta) 3 . Implementation details and architectures are given in Appendix E.

Reaction-diffusion equations

We consider a 2D FitzHugh-Nagumo type model (Klaasen & Troy, 1984) . The system is driven by the PDE ∂u ∂t = a∆u + R u (u, v; k), ∂v ∂t = b∆v + R v (u, v) where a and b are respectively the diffusion coefficients of u and v, ∆ is the Laplace operator. The local reaction terms are R u (u, v; k) = u -u 3 -k -v, R v (u, v) = u -v. The state is X = (u, v) and is defined over a compact rectangular domain Ω with periodic boundary conditions. The considered physical models are: • Param PDE (a, b), with unknown (a, b) diffusion terms and without reaction terms: F p = {F a,b p : (u, v) → (a∆u, b∆v) | a ≥ a min > 0, b ≥ b min > 0}; • Param PDE (a, b, k), the full PDE with unknown parameters: F p = {F a,b,k p : (u, v) → (a∆u + R u (u, v; k), b∆v + R v (u, v) | a ≥ a min > 0, b ≥ b min > 0, k ≥ k min > 0}. Damped wave equations We investigate the damped-wave PDE: ∂ 2 w ∂t 2 -c 2 ∆w + k ∂w ∂t = 0 where k is the damping coefficient. The state is X = (w, ∂w ∂t ) and we consider a compact spatial domain Ω with Neumann homogeneous boundary conditions. Note that this damping differs from the pendulum, as its effect is global. Our physical models are: • Param PDE (c), without damping term: F p = {F c p : (u, v) → (v, c 2 ∆u) | c ∈ [ , +∞) with > 0}; • Param PDE (c, k): F p = {F c,k p : (u, v) → (v, c 2 ∆u -kv) | c, k ∈ [ , +∞) with > 0}. Damped pendulum The evolution follows the ODE d 2 θ /dt 2 + ω 2 0 sin θ + α dθ /dt = 0, where θ(t) is the angle, ω 0 the proper pulsation (T 0 the period) and α the damping coefficient. With state X = (θ, dθ /dt), the ODE is F ω0,α p : X → ( dθ /dt, -ω 2 0 sin θ -α dθ /dt). Our physical models are: • Hamiltonian (Greydanus et al., 2019) , a conservative approximation, with F p = {F H p : (u, v) → (∂ y H(u, v), -∂ x H(u, v)) | H ∈ H 1 (R 2 )}, H 1 (R 2 ) is the first order Sobolev space. • Param ODE (ω 0 ), the frictionless pendulum: F p = {F ω0,α=0 p | ω 0 ∈ [ , +∞) with > 0} • Param ODE (ω 0 , α), the full pendulum equation: F p = {F ω0,α p | ω 0 , α ∈ [ , +∞) with > 0}. Baselines As purely data-driven baselines, we use Neural ODE (Chen et al., 2018) for the three problems and PredRNN++ (Wang et al., 2018 , for reaction-diffusion only) which are competitive models for datasets generated by differential equations and for spatio-temporal data. As MB/ML methods, in the ablations studies (see Appendix F), we compare for all problems, to the vanilla MB/ML cooperation scheme found in (Wang et al., 2019; Mehta et al., 2020) . We also show results for True PDE/ODE, which corresponds to the equation for data simulation (which do not lead to zero error due to the difference between simulation and training integration schemes). For the pendulum, we compare to Hamiltonian neural networks (Greydanus et al., 2019; Toth et al., 2020) and to the the deep Galerkin method (DGM, Sirignano & Spiliopoulos, 2018) . See additional details in Appendix E. Table 1 : Forecasting and identification results on the (a) reaction-diffusion, (b) wave equation, and (c) damped pendulum datasets. We set for (a) a = 1 × 10 -foot_2 , b = 5 × 10 -3 , k = 5 × 10 -3 , for (b) c = 330, k = 50 and for (c) T 0 = 6, α = 0.2 as true parameters. log MSEs are computed respectively over 25, 25, and 

4.2. RESULTS

We analyze and discuss below the results obtained for the three kind of dynamics. We successively examine different evaluation or quality criteria. The conclusions are consistent for the three problems, which allows us to highlight clear trends for all of them.

Forecasting accuracy

The data-driven models do not perform well compared to True PDE/ODE (all values are test errors expressed as log MSE): -4.6 for PredRNN++ vs. -9.17 for reaction-diffusion, -2.51 vs. -5.24 for wave equation, and -2.84 vs. -8.44 for the pendulum in Table 1 . The Deep Galerkin method for the pendulum in complete physics DGM (ω 0 , α), being constrained by the equation, outperforms Neural ODE but is far inferior to APHYNITY models. In the incomplete physics case, DGM (ω 0 ) fails to compensate for the missing information. The incomplete physical models, Param PDE (a, b) for the reaction-diffusion, Param PDE (c) for the wave equation, and Param ODE (ω 0 ) and Hamiltonian models for the damped pendulum, have even poorer performances than purely data-driven ones, as can be expected since they ignore important dynamical components, e.g. friction in the pendulum case. Using APHYNITY with these imperfect physical models greatly improves forecasting accuracy in all cases, significantly outperforming purely data-driven models, and reaching results often close to the accuracy of the true ODE, when APHYNITY and the true ODE models are integrated with the same numerical scheme (which is different from the one used for data generation, hence the non-null errors even for the true equations), e.g. -5.92 vs. -5.24 for wave equation in Table 1 . This clearly highlights the capacity of our approach to augment incomplete physical models with a learned data-driven component.

Physical parameter estimation

Confirming the phenomenon mentioned in the introduction and detailed in Appendix C, incomplete physical models can lead to bad estimates for the relevant physical parameters: an error respectively up to 67.6% and 10.4% for parameters in the reaction-diffusion and wave equations, and an error of more than 13% for parameters for the pendulum in Table 1 . APHYNITY is able to significantly improve physical parameters identification: 2.3% error for the reaction-diffusion, 0.3% for the wave equation, and 4% for the pendulum. This validates the fact that augmenting a simple physical model to compensate its approximations is not only beneficial for prediction, but also helps to limit errors for parameter identification when dynamical models do not fit data well. This is crucial for interpretability and explainability of the estimates.

Ablation study

We conduct ablation studies to validate the importance of the APHYNITY augmentation compared to a naive strategy consisting in learning F = F p + F a without taking care on the quality of the decomposition, as done in (Wang et al., 2019; Mehta et al., 2020) . Results shown in Table 1 of Appendix F show a consistent gain of APHYNITY for the three use cases and for all physical models: for instance for Param ODE (a, b) in reaction-diffusion, both forecasting performances (log MSE =-5.10 vs. -4.56) and identification parameter (Error= 2.33% vs. 6.39%) improve. Other ablation results are provided in Appendix F showing the relevance of the the trajectory-based approach described in Section 3.2 (vs supervising over finite difference approximations of the derivative F ). Flexibility When applied to complete physical models, APHYNITY does not degrade accuracy, contrary to a vanilla cooperation scheme (see ablations in Appendix F). This is due to the least action principle of our approach: when the physical knowledge is sufficient for properly predicting the observed dynamics, the model learns to ignore the data-driven augmentation. This is shown by the norm of the trained neural net component F a , which is reported in Table 1 last column: as expected, F a 2 diminishes as the complexity of the corresponding physical model increases, and, relative to incomplete models, the norm becomes very small for complete physical models (for example in the pendulum experiments, we have F a = 8.5 for the APHYNITY model to be compared with 132 and 623 for the incomplete models). Thus, we see that the norm of F a is a good indication of how imperfect the physical models F p are. It highlights the flexibility of APHYNITY to successfully adapt to very different levels of prior knowledge. Note also that APHYNITY sometimes slightly improves over the true ODE, as it compensates the error introduced by different numerical integration methods for data simulation and training (see Appendix E). 2 for reaction-diffusion show that the incomplete diffusion parametric PDE in Figure 2 (a) is unable to properly match ground truth simulations: the behavior of the two components in Figure 2 (a) is reduced to simple independent diffusions due to the lack of interaction terms between u and v. By using APHYNITY in Figure 2 (b), the correlation between the two components appears together with the formation of Turing patterns, which is very similar to the ground truth. This confirms that F a can learn the reaction terms and improve prediction quality. In Figure 3 , we see for the wave equation that the data-driven Neural ODE model fails at approximating dw /dt as the forecast horizon increases: it misses crucial details for the second component dw /dt which makes the forecast diverge from the ground truth. APHYNITY incorporates a Laplacian term as well as the data-driven F a thus capturing the damping phenomenon and succeeding in maintaining physically sound results for long term forecasts, unlike Neural ODE.

Qualitative visualizations Results in Figure

Extension to non-stationary dynamics We provide additional results in Appendix G to tackle datasets where physical parameters of the equations vary in each sequence. To this end, we design an encoder able to perform parameter estimation for each sequence. Results show that APHYNITY accommodates well to this setting, with similar trends as those reported in this section.

Additional illustrations

We give further visual illustrations to demonstrate how the estimation of parameters in incomplete physical models is improved with APHYNITY. For the reaction-diffusion equation, we show that the incomplete parametric PDE underestimates both diffusion coefficients. The difference is visually recognizable between the poorly estimated diffusion (Figure 4 

5. CONCLUSION

In this work, we introduce the APHYNITY framework that can efficiently augment approximate physical models with deep data-driven networks, performing similarly to models for which the underlying dynamics are entirely known. We exhibit the superiority of APHYNITY over data-driven, incomplete physics, and state-of-the-art approaches combining ML and MB methods, both in terms of forecasting and parameter identification on three various classes of physical systems. Besides, APHYNITY is flexible enough to adapt to different approximation levels of prior physical knowledge. An appealing perspective is the applicability of APHYNITY on partially-observable settings, such as video prediction. Besides, we hope that the APHYNITY framework will open up the way to the design of a wide range of more flexible MB/ML models, e.g. in climate science, robotics or reinforcement learning. In particular, analyzing the theoretical decomposition properties in a partially-observed setting is an important direction for future work.

A REMINDER ON PROXIMINAL AND CHEBYSHEV SETS

We begin by giving a definition of proximinal and Chebyshev sets, taken from (Fletcher & Moors, 2014) : Definition 1. A proximinal set of a normed space (E, • ) is a subset C ⊂ E such that every x ∈ E admits at least a nearest point in C. Definition 2. A Chebyshev set of a normed space (E, • ) is a subset C ⊂ E such that every x ∈ admits a unique nearest point in C. Proximinality reduces to a compacity condition in finite dimensional spaces. In general, it is a weaker one: Boundedly compact sets verify this property for example. In Euclidean spaces, Chebyshev sets are simply the closed convex subsets. The question of knowing whether it is the case that all Chebyshev sets are closed convex sets in infinite dimensional Hilbert spaces is still an open question. In general, there exists examples of non-convex Chebyshev sets, a famous one being presented in (Johnson, 1987) for a non-complete inner-product space. Given the importance of this topic in approximation theory, finding necessary conditions for a set to be Chebyshev and studying the properties of those sets have been the subject of many efforts. Some of those properties are summarized below: • The metric projection on a boundedly compact Chebyshev set is continuous. • If the norm is strict, every closed convex space, in particular any finite dimensional subspace is Chebyshev. • In a Hilbert space, every closed convex set is Chebyshev.

B PROOF OF PROPOSITIONS 1 AND 2

We prove the following result which implies both propositions in the article: Proposition 3. The optimization problem: min Fp∈Fp,Fa∈F F a subject to ∀X ∈ D, ∀t, dX t dt = (F p + F a )(X t ) (5) is equivalent a metric projection onto F p . If F p is proximinal, Eq. ( 5) admits a minimizing pair. If F p is Chebyshev, Eq. ( 5) admits a unique minimizing pair which F p is the metric projection. Proof. The idea is to reconstruct the full functional from the trajectories of D. By definition, A is the set of points reached by trajectories in D so that: A = {x ∈ R d | ∃X • ∈ D, ∃t, X t = x} Then let us define a function F D in the following way: For a ∈ A, we can find X • ∈ D and t 0 such that X t0 = a. Differentiating X at t 0 , which is possible by definition of D, we take: F D (a) = dX t dt t=t0 For any (F p , F a ) satisfying the constraint in Eq. ( 5), we then have that (F p + F a )(a) = dXt /dt |t0 = F D (a) for all a ∈ A. Conversely, any pair such that (F p , F a ) ∈ F p × F and F p + F a = F D , verifies the constraint. Thus we have the equivalence between Eq. ( 5) and the metric projection formulated as: minimize F p ∈ F p F D -F p (6) If F p is proximinal, the projection problem admits a solution which we denote F p . Taking F a = F D -F p , we have that F p + F a = F D so that (F p , F a ) verifies the constraint of Eq. ( 2). Moreover, if there is (F p , F a ) satisfying the constraint of Eq. ( 2), we have that F p + F a = F D by what was shown above and F a = F D -F p ≥ F D -F p by definition of F p . This shows that (F p , F a ) is minimal. Moreover, if F p is a Chebyshev set, by uniqueness of the projection, if F p = F p then F a > F a . Thus the minimal pair is unique.

C PARAMETER ESTIMATION IN INCOMPLETE PHYSICAL MODELS

Classically, when a set F p ⊂ F summarising the most important properties of a system is available, this gives a simplified model of the true dynamics and the adopted problem is then to fit the trajectories using this model as well as possible, solving: minimize F p ∈ F p E X∼D L( X X0 , X) subject to ∀g ∈ I, X g 0 = g and ∀t, d X g t dt = F p ( X g t ) where L is a discrepancy measure between trajectories. Recall that X X0 is the result trajectory of an ODE solver taking X 0 as initial condition. In other words, we try to find a function F p which gives trajectories as close as possible to the ones from the dataset. While estimation of the function becomes easier, there is then a residual part which is left unexplained and this can be a non negligible issue in at least two ways: • When F ∈ F p , the loss is strictly positive at the minimum. This means that reducing the space of functions F p makes us lose in terms of accuracy.foot_3 • The obtained function F p might not even be the most meaningful function from F p as it would try to capture phenomena which are not explainable with functions in F p , thus giving the wrong bias to the calculated function. For example, if one is considering a dampened periodic trajectory where only the period can be learned in F p but not the dampening, the estimated period will account for the dampening and will thus be biased. This is confirmed in the paper in Section 4: the incomplete physical models augmented with APHYNITY get different and experimentally better physical identification results than the physical models alone. Let us compare our approach with this one on the linearized damped pendulum to show how estimates of physical parameters can differ. The equation is the following: d 2 θ dt 2 + ω 2 0 θ + α dθ dt = 0 We take the same notations as in the article and parametrize the simplified physical models as: F a p : X → ( dθ dt , -aθ) where a > 0 corresponds to ω 2 0 . The corresponding solution for an initial state X 0 , which we denote X a , can then written explicitly as: θ a t = θ 0 cos √ at Let us consider damped pendulum solutions X written as: θ t = θ 0 e -t cos t which corresponds to: F : X → ( dθ dt , -2(θ + dθ dt )) It is then easy to see that the estimate of a with the physical model alone can be obtained by minimizing: T 0 |e -t cos t -cos √ at| 2 This expression depends on T and thus, depending on the chosen time interval and the way the integral is discretized will almost always give biased estimates. In other words, the estimated value of a will not give us the desired solution t → cos t. On the other hand, for a given a, in the APHYNITY framework, the residual must be equal to: F a r : X → (0, (a -2)θ -2 dθ dt ) in order to satisfy the fitting constraint. Here a corresponds to 1 + ω 2 0 not to ω 2 0 as in the simplified case. Minimizing its norm, we obtain a = 2 which gives us the desired solution: θ t = θ 0 e -t cos t with the right period.

D DISCUSSION ON SUPERVISION OVER DERIVATIVES

In order to find the appropriate decomposition (F p , F a ), we use a trajectory-based error by solving: minimize F p ∈ F p , F a ∈ F F a subject to ∀g ∈ I, X g 0 = g and ∀t, d X g t dt = (F p + F a )( X g t ), ∀X ∈ D, L(X, X X0 ) = 0 In the continuous setting where the data is available at all times t, this problem is in fact equivalent to the following one: minimize F p ∈ F p E X∼D dX t dt -F p (X t ) where the supervision is done directly over derivatives, obtained through finite-difference schemes. This echoes the proof in Section B of the Appendix where F can be reconstructed from the continuous data. However, in practice, data is only available at discrete times with a certain time resolution. While Eq. ( 9) is indeed equivalent to Eq. ( 8) in the continuous setting, in the practical discrete one, the way error propagates is not anymore: For Eq. ( 8) it is controlled over integrated trajectories while for Eq. ( 9) the supervision is over the approximate derivatives of the trajectories from the dataset. We argue that the trajectory-based approach is more flexible and more robust for the following reasons: • In Eq. ( 8), if F a is appropriately parameterized, it is possible to perfectly fit the data trajectories at the sampled points. • The use of finite differences schemes to estimate F as is done in Eq. ( 9) necessarily induces a non-zero discretization error. • This discretization error is explosive in terms of divergence from the true trajectories. This last point is quite important, especially when time sampling is sparse (even though we do observe this adverse effect empirically in our experiments with relatively finely time-sampled trajectories). The following gives a heuristical reasoning as to why this is the case. Let F = F + be the function estimated from the sampled points with an error such that ∞ ≤ α. Denoting X the corresponding trajectory generated by F , we then have, for all X ∈ D: ∀t, d(X -X) t dt = F (X t ) -F ( X t ) -( X t ) Integrating over [0, T ] and using the triangular inequality as well as the mean value inequality, supposing that F has uniformly bounded spatial derivatives: ∀t ∈ [0, T ], (X -X) t ≤ ∇F ∞ t 0 X s -X s + αt which, using a variant of the Grönwall lemma, gives us the inequality: ∀t ∈ [0, T ], X t -X t ≤ α ∇F ∞ (exp( ∇F ∞ t) -1) When α tends to 0, we recover the true trajectories X. However, as α is bounded away from 0 by the available temporal resolution, this inequality gives a rough estimate of the way X diverges from them, and it can be an equality in many cases. This exponential behaviour explains our choice of a trajectory-based optimization.

E IMPLEMENTATION DETAILS

We describe here the three use cases studied in the paper for validating APHYNITY. All experiments are implemented with PyTorch (Paszke et al., 2019) and the differentiable ODE solvers with the adjoint method implemented in torchdiffeq.foot_4 

E.1 REACTION-DIFFUSION EQUATIONS

The system is driven by a FitzHugh-Nagumo type PDE (Klaasen & Troy, 1984 ) ∂u ∂t = a∆u + R u (u, v; k), ∂v ∂t = b∆v + R v (u, v) where a and b are respectively the diffusion coefficients of u and v, ∆ is the Laplace operator. The local reaction terms are R u (u, v; k) = u -u 3 -k -v, R v (u, v) = u -v. The state X = (u, v) is defined over a compact rectangular domain Ω = [-1, 1] 2 with periodic boundary conditions. Ω is spatially discretized with a 32 × 32 2D uniform square mesh grid. The periodic boundary condition is implemented with circular padding around the borders. ∆ is systematically estimated with a 3 × 3 discrete Laplace operator. Dataset Starting from a randomly sampled initial state X init ∈ [0, 1] 2×32×32 , we generate states by integrating the true PDE with fixed a, b, and k in a dataset (a = 1×10 -3 , b = 5×10 -3 , k = 5×10 -3 ). We firstly simulate high time-resolution (δt sim = 0.001) sequences with explicit finite difference method. We then extract states every δt data = 0.1 to construct our low time-resolution datasets. We set the time of random initial state to t = -0.5 and the time horizon to t = 2.5. 1920 sequences are generated, with 1600 for training/validation and 320 for test. We take the state at t = 0 as X 0 and predict the sequence until the horizon (equivalent to 25 time steps) in all reaction-diffusion experiments. Note that the sub-sequence with t < 0 are reserved for the extensive experiments in Appendix G.1. Neural network architectures Our F a here is a 3-layer convolution network (ConvNet). The two input channels are (u, v) and two output ones are ( ∂u ∂t , ∂v ∂t ). The purely data-driven Neural ODE uses such ConvNet as its F . The detailed architecture is provided in Table 2 . The estimated physical parameters θ p in F p are simply a trainable vector (a, b) ∈ R 2 + or (a, b, k) ∈ R 3 + . 3 × 3 kernel, 16 input channels, 2 output channels, 1 pixel zero padding Optimization hyperparameters We choose to apply the same hyperparameters for all the reactiondiffusion experiments: N iter = 1, λ 0 = 1, τ 1 = 1 × 10 -3 , τ 2 = 1 × 10 3 . E.2 WAVE EQUATIONS The damped wave equation is defined by ∂ 2 w ∂t 2 -c 2 ∆w + k ∂w ∂t = 0 where c is the wave speed and k is the damping coefficient. The state is X = (w, ∂w ∂t ). We consider a compact spatial domain Ω represented as a 64 × 64 grid and discretize the Laplacian operator similarly. ∆ is implemented using a 5 × 5 discrete Laplace operator in simulation whereas in the experiment is a 3 × 3 Laplace operator. Null Neumann boundary condition are imposed for generation. Dataset δt was set to 0.001 to respect Courant number and provide stable integration. The simulation was integrated using a 4th order finite difference Runge-Kutta scheme for 300 steps from an initial Gaussian state, i.e for all sequence at t = 0, we have: w(x, y, t = 0) = C × exp (x-x 0 ) 2 +(y-y 0 ) 2 σ 2 The amplitude C is fixed to 1, and (x 0 , y 0 ) = (32, 32) to make the Gaussian curve centered for all sequences. However, σ is different for each sequence and uniformly sampled in [10, 100] . The same δt was used for train and test. All initial conditions are Gaussian with varying amplitudes. 250 sequences are generated, 200 are used for training while 50 are reserved as a test set. In the main paper setting, c = 330 and k = 50. As with the reaction diffusion case, the algorithm takes as input a state X t0 = (w, dw dt )(t 0 ) and predicts all states from t 0 + δt up to t 0 + 25δt.

Neural network architectures

The neural network for F a is a 3-layer convolution neural network with the same architecture as in Table 2 . For F p , the parameter(s) to be estimated is either a scalar c ∈ R + or a vector (c, k) ∈ R 2 + . Similarly, Neural ODE networks are build as presented in Table 2 . Optimization hyperparameters We use the same hyperparameters for the experiments: N iter = 3, λ 0 = 1, τ 1 = 1 × 10 -4 , τ 2 = 1 × 10 2 .

E.3 DAMPED PENDULUM

We consider the non-linear damped pendulum problem, governed by the ODE d 2 θ dt 2 + ω 2 0 sin θ + α dθ dt = 0 where θ(t) is the angle, ω 0 = 2π T0 is the proper pulsation (T 0 being the period) and α is the damping coefficient. With the state X = (θ, dθ dt ), the ODE can be written as dXt dt = F (X t ) with F : X → ( dθ dt , -ω 2 0 sin θ -α dθ dt ). Dataset For each train / validation / test split, we simulate a dataset with 25 trajectories of 40 timesteps (time interval [0, 20], timestep δt = 0.5) with fixed ODE coefficients (T 0 = 12, α = 0.2) and varying initial conditions. The simulation integrator is Dormand-Prince Runge-Kutta method of order (4)5 (DOPRI5, Dormand & Prince, 1980) . We also add a small amount of white gaussian noise (σ = 0.01) to the state. Note that our pendulum dataset is much more challenging than the ideal frictionless pendulum considered in Greydanus et al. (2019) . Neural network architectures We detail in Table 3 the neural architectures used for the damped pendulum experiments. All data-driven augmentations for approximating the mapping X t → F (X t ) are implemented by multi-layer perceptrons (MLP) with 3 layers of 200 neurons and ReLU activation functions (except at the last layer: linear activation). The Hamiltonian (Greydanus et al., 2019; Toth et al., 2020) is implemented by a MLP that takes the state X t and outputs a scalar estimation of the Hamiltonian H of the system: the derivative is then computed by an in-graph gradient of H with respect to the input: Optimization hyperparameters The hyperparameters of the APHYNITY optimization algorithm (N iter, λ 0 , τ 1 , τ 2 ) were cross-validated on the validation set and are shown in Table 4 . All models were trained with a maximum number of 5000 steps with early stopping.  F (X t ) = ∂H ∂(dθ/ dt) , -∂H dθ .

F ABLATION STUDY

We conduct ablation studies to show the effectiveness of APHYNITY's adaptive optimization and trajectory-based learning scheme. F.1 ABLATION TO VANILLA MB/ML COOPERATION In Table 5 , we consider the ablation case with the vanilla augmentation scheme found in Le Guen & Thome (2020); Wang et al. (2019) ; Mehta et al. (2020) , which does not present any proper decomposition guarantee. We observe that the APHYNITY cooperation scheme outperforms this vanilla scheme in all case, both in terms of forecasting performances (e.g. log MSE= -0.35 vs. -3.97 for the Hamiltonian in the pendulum case) and parameter identification (e.g. Err Param=8.4% vs. 2.3 for Param PDE (a, b for reaction-diffusion). It confirms the crucial benefits of APHYNITY's principled decomposition scheme. We conduct an extensive evaluation on a setting with varying diffusion parameters for reactiondiffusion equations. The only varying parameters are diffusion coefficients, i.e. individual a and b for each sequence. We randomly sample a ∈ [1 × 10 -3 , 2 × 10 -3 ] and b ∈ [3 × 10 -3 , 7 × 10 -3 ]. k is still fixed to 5 × 10 -3 across the dataset. In order to estimate a and b for each sequence, we use here a ConvNet encoder E to estimate parameters from 5 reserved frames (t < 0). The architecture of the encoder E is similar to the one in Table 2 except that E takes 5 frames (10 channels) as input and E outputs a vector of estimated (ã, b) after applying a sigmoid activation scaled by 1 × 10 -2 (to avoid possible divergence). For the baseline Neural ODE, we concatenate a and b to each sequence as two channels. In Table 7 , we observe that combining data-driven and physical components outperforms the pure data-driven one. When applying APHYNITY to Param PDE (a, b), the prediction precision is significantly improved (log MSE: -1.32 vs. -4.32) with a and b respectively reduced from 55.6% and 54.1% to 11.8% and 18.7%. For complete physics cases, the parameter estimations are also improved for Param PDE (a, b, k) by reducing over 60% of the error of b (3.10 vs. 1.23) and 10% to 20% of the errors of a and k (resp. 1.55/0.59 vs. 1.29/0.39). The extensive results reflect the same conclusion as shown in the main article: APHYNITY improves the prediction precision and parameter estimation. The same decreasing tendency of F a is also confirmed.



A proximinal set is one from which every point of the space has at least one nearest point. A Chebyshev set is one from which every point of the space has a unique nearest point. More details in Appendix A. Convergence to a local minimum isn't necessary, a few steps are often sufficient for a successful optimization. This integration scheme is then different from the one used for data generation, the rationale for this choice being that when training a model one does not know how exactly the data has been generated. This is true in theory, although not necessarily in practice when F overfits a small dataset. https://github.com/rtqichen/torchdiffeq



Figure 1: Predicted dynamics for the damped pendulum vs. ground truth (GT) trajectories d 2 θ /dt 2 + ω 2 0 sin θ + α dθ /dt = 0. We show that in (a) the data-driven approach (Chen et al., 2018) fails to properly learn the dynamics due to the lack of training data, while in (b) an ideal pendulum cannot take friction into account. The proposed APHYNITY shown in (c) augments the over-simplified physical model in (b) with a data-driven component. APHYNITY improves both forecasting (MSE) and parameter identification (Error T 0 ) compared to (b).

Figure 2: Comparison of predictions of two components u (top) and v (bottom) of the reactiondiffusion system. Note that t = 4 is largely beyond the dataset horizon (t = 2.5).

Figure 3: Comparison between the prediction of APHYNITY when c is estimated and Neural ODE for the damped wave equation. Note that t + 32, last column for (a, b, c) is already beyond the training time horizon (t + 25), showing the consistency of APHYNITY method.

(a)) and the true one (Figure 4(c)) while APHYNITY gives a fairly good estimation of those diffusion parameters as shown in Figure 4(b).

Figure 4: Diffusion predictions using coefficient learned with (a) incomplete physical model Param PDE (a, b) and (b) APHYNITY-augmented Param PDE(a, b), compared with the (c) true diffusion

40 predicted time-steps. %Err param. averages the results when several physical parameters are present. For each level of incorporated physical knowledge, equivalent best results according to a Student t-test are shown in bold. n/a corresponds to non-applicable cases.

ConvNet architecture in reaction-diffusion and wave equation experiments, used as datadriven derivative operator in APHYNITY and Neural ODE(Chen et al., 2018).

Neural network architectures for the damped pendulum experiments. n/a corresponds to non-applicable cases.

Hyperparameters of the damped pendulum experiments.

Detailed ablation study on supervision and optimization for the reaction-diffusion equation, wave equation and damped pendulum.

ACKNOWLEDGEMENTS:

Funding (P. Gallinari), Chaires de recherche et d'enseignement en intelligence artificielle (Chaires IA), DL4Clim project.

annex

Published as a conference paper at ICLR 2021 Table 5 : Ablation study comparing APHYNITY to the vanilla augmentation scheme (Wang et al., 2019; Mehta et al., 2020) We conduct also two other ablations in Table 6 :• derivative supervision: in which F p + F a is trained with supervision over approximated derivatives on ground truth trajectory, as performed in Greydanus et al. (2019) ; Cranmer et al. (2020) . More precisely, APHYNITY's L traj is here replaced with L deriv = dXt dt -F (X t ) as in Eq. ( 9), where dXt dt is approximated by finite differences on X t . • non-adaptive optim.: in which we train APHYNITY by minimizing F a without the adaptive optimization of λ shown in Algorithm 1. This case is equivalent to λ = 1, τ 2 = 0.We highlight the importance to use a principled adaptive optimization algorithm (APHYNITY algorithm described in paper) compared to a non-adpative optimization: for example in the reactiondiffusion case, log MSE= -4.55 vs. -5.10 for Param PDE (a, b). Finally, when the supervision occurs on the derivative, both forecasting and parameter identification results are systematically lower than with APHYNITY's trajectory based approach: for example, log MSE=-1.16 vs. -4.64 for Param PDE (c) in the wave equation. It confirms the good properties of the APHYNITY training scheme. We conduct an experiment where each sequence is generated with a different wave celerity. This dataset is challenging because both c and the initial conditions vary across the sequences. For each simulated sequence, an initial condition is sampled as described previously, along with a wave celerity c also sampled uniformly in [300, 400] . Finally our initial state is integrated with the same Runge-Kutta scheme. 200 of such sequences are generated for training while 50 are kept for testing.For this experiment, we also use a ConvNet encoder to estimate the wave speed c from 5 consecutive reserved states (w, ∂w ∂t ). The architecture of the encoder E is the same as in Table 2 but with 10 input channels. Here also, k is fixed for all sequences and k = 50. The hyper-parameters used in these experiments are the same than described in the Section E.2.The results when multiple wave speeds c are in the dataset are consistent with the one present when only one is considered. Indeed, while prediction performances are slightly hindered, the parameter estimation remains consistent for both c and k. This extension provides elements attesting for the robustness and adaptability of our method to more complex settings. Finally the purely data-driven Neural-ODE fails to cope with the increasing difficulty. To extend the experiments conducted in the paper (section 4) with fixed parameters (T 0 = 6, α = 0.2) and varying initial conditions, we evaluate APHYNITY on a much more challenging dataset where we vary both the parameters (T 0 , α) and the initial conditions between trajectories.We simulate 500/50/50 trajectories for the train/valid/test sets integrated with DOPRI5. For each trajectory, the period T 0 (resp. the damping coefficient α) are sampled uniformly in the range [3, 10] (resp. [0, 0.5]).We train models that take the first 20 steps as input and predict the next 20 steps. To account for the varying ODE parameters between sequences, we use an encoder that estimates the parameters based on the first 20 timesteps. In practice, we use a recurrent encoder composed of 1 layer of 128 GRU units. The output of the encoder is fed as additional input to the data-driven augmentation models and to an MLP with final softplus activations to estimate the physical parameters when necessary (ω 0 ∈ R + for Param ODE (ω 0 ), (ω 0 , α) ∈ R 2 + for Param ODE (ω 0 , α)). In this varying ODE context, we also compare to the state-of-the-art univariate time series forecasting method N-Beats (Oreshkin et al., 2020) .Results shown in Table 9 are consistent with those presented in the paper. Pure data-driven models Neural ODE (Chen et al., 2018) and N-Beats (Oreshkin et al., 2020) fail to properly extrapolate the pendulum dynamics. Incomplete physical models (Hamiltonian and ParamODE (ω 0 )) are even worse since they do not account for friction. Augmenting them with APHYNITY significantly and consistently improves forecasting results and parameter identification. 

