NEURAL NETWORK DIFFERENTIAL EQUATION SOLVERS ALLOW UNSUPERVISED ERROR ANALYSIS AND CORRECTION

Abstract

Neural Network Differential Equation (NN DE) solvers have surged in popularity due to a combination of factors: computational advances making their optimization more tractable, their capacity to handle high dimensional problems, easy interpretability, etc. However, most NN DE solvers suffer from a fundamental limitation: their loss functions are not explicitly dependent on the errors associated with the solution estimates. As such, validation and error estimation usually requires knowledge of the true solution. Indeed, when the true solution is unknown, we are often reduced to simply hoping that a "low enough" loss implies "small enough" errors, since explicit relationships between the two are not available. In this work, we describe a general strategy for efficiently constructing error estimates and corrections for Neural Network Differential Equation solvers. Our methods do not require a priori knowledge of the true solutions and obtain explicit relationships between loss functions and the errors. In turn, these explicit relationships allow for the unsupervised estimation and correction of the model errors.

1. INTRODUCTION

Deep learning has heralded new methods for many scientific disciplines -the field of numerical methods for differential equations has been no exception (1; 2; 3; 4). Deep neural network based differential equations (NN DE) solvers have been proposed under a variety of different names (PINNs (2) , DGM (3), etc) -each catering to various classes of problems. However, they all share certain common features: the use of a known differential equation (DE) as the central component of an appropriate loss function, the use of existing knowledge (boundary conditions, experimental/synthetic data, etc) to constrain the search for solutions, randomized optimization methods that sample from the domain of interest at a requisite resolution, etc. We investigate another common facet of many NN DE solvers: the lack of unsupervised error quantification/correction methods to estimate model errors without prior knowledge of the solution. Most solvers use the equation based loss functions as a surrogate measure for the error. However, while such a measure is intuitively related to the error in the solution model, an explicit description of that connection is mandatory, if error quantification is to be done without knowing the solution. We achieve these goals by explicitly relating the loss terms and the model error. We showcase how these connections allow results on the model error that don't rely on prior knowledge of solution. We propose techniques by which these results can be used to build significantly more efficient NN DE solvers, with only marginal increases in computational complexity. We formalize our ideas into four theorems, two inequalities, and two algorithms. We validate our claims with a collection of numerical experiments on several non-trivial DEs (including nonlinear PDEs). For the sake of readability and simplicity, all proofs have been rigorously presented in the appendices, while the main text simply reports the results and discusses their significance. An associated codebase is also provided, with inbuilt options for the DEs already studied as part of this work. However, the codebase has been designed so that the users may easily add their own DEs of interest (the assumptions under which this work is valid should be general enough for a wide variety of DEs from many different scientific disciplines).

2. NEURAL NETWORK DIFFERENTIAL EQUATION SOLVERS

For ease of discussion, we let D ⊆ R d denote some closed, bounded, path connected domain of interest for some differential equation (DE) . Let ∂D ⊆ D be the portion of the domain over which some constraint conditions on the solution exist (usually obtained as boundary conditions, empirical data, etc). Assume that our chosen DE, when given unique constraints over ∂D, admits a unique solution Φ : D → R D . We wish to consider a (possibly non-linear) equation operator F : G → H, where G and H are some suitable spaces of functions over D (Φ ∈ G). We decompose F as F[•] = L[•] + N[•] + C where L represents the term(s) which depend linearly on Φ, N the represents the term(s) which depend non-linearly on Φ, and C are the terms independent of Φ. This additive decomposition into linear, nonlinear, and constant terms is always possible: for linear DEs, N ≡ 0 ≡ C. We have F[Φ] = L[Φ] + N[Φ] + C = 0 . Let us assume we wish to construct an NN based approximation N : D → R D . Let us also assume the NN uses analytic activation functions so that N ∈ C ∞ . Let W be its width (neurons per hidden layer) and D N be its depth (number of hidden layers). Finally, let w ≡ {b 1 1 , w These theoretical guarantees may be practically realized by leveraging the minimum regularity expected from Φ. Recall W k,∞ (D) is the Sobolev space of order k, on compact D ⊆ R d , w.r.t. the L ∞ norm. Let us assume that Φ ∈ W k,∞ (D). There exists some N with (W, D N ) that can ε - approximate Φ : D → R, if W = d + 1, D N = O(diam(D)/ω -1 f (ε)) d (7), where ω -1 f (ε) = sup{δ : ω f (δ) ≤ ε}, δ = |x 1 -x 2 |, x 1 , x 2 ∈ D. There also exist N that ε -approximate Φ : D → R, if M = O(ln( 1 ε )ε -d/k ), D N = O(ln( 1 ε )) (8).

2.2. OPTIMIZATION

The loss function L to train N usually takes the following form (2; 3; 4): L = E x1∈D ∥F[N (x 1 )]∥ p + E x2∈∂D ∥Φ(x 2 ) -N (x 2 )∥ p where ∥ • ∥ p is the usual p-norm on R D . Variants of Stochastic Gradient Descent (SGD) are almost always capable of eventually reaching adequately small loss values for large enough (W, D N ) under such loss (9) . Thus as L → 0 during optimization, the uniqueness of Φ implies N → Φ over D (formally shown in Theorem 1).

2.3. ALTERNATIVE PARAMETERIZATION OF THE CONSTRAINT CONDITIONS ON ∂D

Sometimes, the constraint conditions (such as initial or boundary conditions) may be enforced by using the following parametrization as the model N for Φ, at some arbitrary x 1 ∈ D: N (x 1 ) = Φ(x 2 ) + dist(x 1 , x 2 )N o where x 2 ∈ ∂D is some appropriate nearest constraint point to x 1 . Here, N o represents the NN and dist(x 1 , x 2 ) is some metric that enforces that the model N is always exact over ∂D, while allowing the NN flexibility to learn Φ elsewhere (ex: 1 -e -∥x1-x2∥ , as used in (10) ). This parametrization can eliminate the need for the E x2∈∂D term in Eq. 2. Our work is applicable in such scenarios too.

3. ERROR BOUNDS AND CONVERGENCE FOR NN DE SOLVERS

Define Φ ϵ1 := Φ -N as the error in our solution model. Three simple observations about N , Φ ϵ1 , the convergence of L to 0, and their inter-plays underpin the entirety of this work: 1. F[N ] is not explicitly dependent on Φ ϵ1 : there is no explicit relation between L and Φ ϵ1 over D -∂D, where Φ is unknown. We don't know if/how E ∥Φ ϵ1 ∥ → 0 as L → 0. 2. Thus, Φ ϵ1 associated with N is not estimable over D -∂D in standard NN DE solvers. 3. Optimization performance saturates when N settles around a local minima(s) of L. In its most general form, the first problem can be immensely intractable and is often handled separately for different kinds of DEs (3; 10; 11; 12; 13) . In this section, we present generic results (proofs in Appendix B) which strongly combat this problem for a large collection of DEs originating in scientific domains, including many nonlinear PDEs. Our results will be valid for any  F[N ], s.t. F[N ] = 0 =⇒ N = Φ, F[N ] -→ 0 =⇒ N -→ Φ. Note that the existence of the Frechet derivative of F (denoted DF) is necessary if gradient descent or related methods are to work. As such, only the assumption that DF -1 exists within some open neighbourhood of Φ is imposing additional structure. Theorem 1 is a powerful result backing the wisdom of using NN DE solvers. However, the same assumptions allow us to go much further: Theorem 2. Under the same assumptions as Theorem 1, we have ∥Φ ϵ1 ∥ = O (∥F[N ]∥) . A common observation made for NN DE solvers is that E ∥F[N ]∥ ∝ E ∥Φ ϵ1 ∥ . We codify that as a theorem over a relaxed class of assumptions on F as Theorem 2, validating the intuitive wisdom of using the L as a surrogate for error: especially when the associated constant of proportionality for the Lipschitz continuous F is of order unity.  N (0) ∈ B R (Φ) ⊂ U , the gradient descent equation Ṅ (t) = -∇L(N (t)) has a solution that satisfies ∥N (t) -Φ∥ ≤ e -(1-ϵ)σ min 2 t ∥N (0) -Φ∥ σ min := inf N ∈G\{0} ∥DF[Φ]N ∥ 2 ∥N ∥ 2 = inf Spec (DF[Φ]) † DF[Φ] > 0 . σ min > 0 implies exponential convergence at least at that rate is always possible. However, while we mitigate the concerns raised in the first two observations, Theorems 1 -3 still only weakly/globally describe Φ ϵ1 over D. We can obtain stronger estimates on ∥Φ ϵ1 ∥ by assuming structural information on the Φ dependent terms L, N and on the interactions they might have with each other. In particular, assume that DN is a positive (or negative) definite operator (such an assumption is often taken to avoid degenerate systems). Define N s = Φ -sΦ ϵ1 . Further, let: H min := inf s∈[0,1] inf |λ| λ ∈ Spec (DN[N s ]) Let F max be the maximum of ∥F[N ]∥ over D for some model N that has finished optimizing. We then have the following inequality on ∥Φ ϵ1 ∥ (see Appendix C for assumptions and proof): ∥Φ ϵ1 (x)∥ ≤ F max H min x ∈ D A variation of Inequality 4 in Appendix C showcases how NN DE solver design can encode other knowledge/guesses on the system's behavior. Those can be used to fine-tune the assumptions that go into obtaining the presented inequality and lead us to modifications needed on the bound. However, even with the stronger inequalities on ∥Φ ϵ1 ∥ (which might still give severe overestimates, if H min is not properly controlled for), we don't yet have information regarding pointwise behavior of Φ ϵ1 over D -∂D. There is a scarcity of work on algorithms that can explicitly estimate Φ ϵ1 over D -∂D, using only the information available to the model N . More often than not, precise error estimations rely on actual knowledge of Φ over entirety of D. In the next section, we shall provide an unsupervised algorithm to estimate Φ ϵ1 over D -∂D, without knowledge of Φ. This endeavor, combined with the third observation will directly lead to another useful application. For any nontrivial DE, it is improbable that N over D will be the exact solution once a stable minima(s) of optimization is reached (stability is meant here in the sense of a converged model that is unlikely to find a significantly better minima in a reasonable amount of iterations). Indeed, it is demonstrably impossible if the DE does not admit closed form solutions, since N is always a closed form expression. Any further training would not provide any meaningful gains in accuracy. However, in that regime, even a moderately good estimate N ϵ1 for Φ ϵ1 would be immediately useful: we could simply use N + N ϵ1 as a better model (albeit, at additional optimization/model memory costs). Finally, just like Theorems 1 -3 remove the ambiguity associated with standard NN DE solvers, we will find similar results reliably eliminate such ambiguities vis a vis the error model N ϵ1 .

4. ERROR ESTIMATION AND CORRECTION

We begin by noticing the following: F[N ] = L[N ] + N[N ] + C = -L[Φ ϵ1 ] -N ′ [N , Φ ϵ1 ] where N ′ [N , Φ ϵ1 ] := N[N + Φ ϵ1 ] -N[N ] . Thus, we have a new equation and its operator F 1 F 1 [Φ ϵ1 ] = F[N ] + L[Φ ϵ1 ] + N ′ [N , Φ ϵ1 ] = 0 with Φ ϵ1 as the only unknown quantity. The uniqueness of Φ implies that if N is fixed, then Φ ϵ1 is a unique solution to the DE given by Eq. 6, and we havefoot_0  F 1 [Φ ϵ1 ] = F[N + Φ ϵ1 ] = 0 Note that N ′ is obtained such that terms explicitly dependent on Φ don't appear in Eq. 5: that is what allows tight bounds on Φ ϵ1 in (10) , without depending upon explicit knowledge of Φ (Inequality 4 is a strong generalization of that result). As a motivating example, consider the cases when the nonlinearities in F are degree 2 polynomials: N[ ] = [ ] • [ ]. In that case, we have: F[N ] + L[Φ ϵ1 ] + N ′ [N , Φ ϵ1 ] = 0 : N ′ [N , Φ ϵ1 ] = (Φ ϵ1 • Φ ϵ1 + N • Φ ϵ1 + Φ ϵ1 • N ) The transformation technique itself (algebraic manipulations, Taylor expansions, etc) is not important: we only need it to be such that F[N ], Φ ϵ1 appear in relation to each other over all of D. Indeed, we are always able to directly leverage the unsimplified 2 for optimization purposes in the proposed Algorithm 1. N[N ] -N[N + Φ ϵ1 ] form of N ′ (or even the F[N + Φ ϵ1 ] = 0 equation)

Algorithm 1 Unsupervised Estimation and Correction of Φ ϵ1

1: procedure ERRORCORRECT 2: initialize NN DE solver N : D → R.

3:

while N not converged do ▷ Initial model N over D 4: Optimize N for a step using L (Eq. 2) 5: end while 6: Freeze the parameters of N ▷ Fix N to start error analysis 7: initialize Error estimation model N ϵ1 . 8: while N ϵ1 not converged do ▷ Error estimation over D 9: Optimize N ϵ1 for a step using L ϵ (Eq. 8) 10: end while 11: return N + N ϵ1 ▷ Error correction over D 12: end procedure Eq. 6 immediately hints at a method to estimate Φ ϵ1 : we use Eq. 6 with another network model, N ϵ1 , in the manner we used N with Eq. 2, to obtain an approximation for Φ ϵ1 . More precisely, once we have a converged model N that has saturated its capacity to model Φ, we use: F 1 [N ϵ1 ] = F[N ] + L[N ϵ1 ] + N ′ [N , N ϵ1 ] ) to define a new loss function for the error model N ϵ1 : L ϵ1 = E x1∈D ∥F 1 [N ϵ1 (x 1 )]∥ p + E x2∈∂D ∥Φ ϵ1 (x 2 ) -N ϵ1 (x 2 )∥ p Alternatively, since the parameters of N are kept unchanged, we can use the equivalent form: L ϵ1 = E x1∈D ∥F[N (x 1 ) + N ϵ1 (x 1 )]∥ p + E x2∈∂D ∥Φ(x 2 ) -[N (x 2 ) + N ϵ1 (x 2 )]∥ p In problems with homogeneous constraint conditions, Φ(∂D) = 0, we may replace the second term in Eq. 8 by E ∥N ϵ1 (x 2 ) + N (x 2 )∥ p , since Φ ϵ1 (∂D) = 0 -N (∂D) = -N (∂D). The same reasoning underlying L = 0 =⇒ N = Φ, allows that L ϵ1 = 0 =⇒ N ϵ1 = Φ ϵ1 . Indeed, analogs of Theorems 1 -3 and Inequality 4 are obtained directly, as stated in a generalized form in the next subsection. In particular, our knowledge of F[N ] allows us substantial control over how to estimate Φ ϵ1 , whereas the knowledge of F 1 [N ϵ1 ] (or alternatively F[N + N ϵ1 ]) allows us bounds on ∥Φ ϵ1 -N ϵ1 ∥. The ability to internally and reliably estimate Φ ϵ1 associated with N is not only useful for error analysis purposes, but also for refining N . Algorithm 1 describes the procedure. Note that the difficulty of estimating Φ and Φ ϵ1 should be roughly equivalent since Φ ϵ1 has at least the same regularity as Φ. As such, the optimization costs of N ϵ1 per iteration will not be substantially larger than those of N (see discussion/experiments in the next sections). Finally, note that Alg. 1 provides relative corrections to some model N . We do not claim that only error correction can provide improvements in performance to the extent shown in Fig. 2 . We only claim that error analysis and correction are useful and available tools for any model N that might have been obtained via some technique based on NN DE solvers. It should always be possible to error correct N using the strategy we are describing, as long as the associated assumptions are satisfied. N N ϵ1 N 1 N ϵ2 N 2 N ϵ3 N 3 N ϵ4 • • • L ϵ1 L ϵ2 L ϵ3 L ϵ4 Figure 1: Graphical Model of the i th order error correction scheme. Each vertical pair of blue and red nodes denotes a specific order in the estimation and correction process.

4.1. HIGHER ORDER ERROR ESTIMATIONS AND CORRECTIONS

Consider some sequence of networks N ϵ1 , N ϵ2 , ...., N ϵi-1 , each modeling the errors Φ ϵ1 , Φ ϵ2 , ...., Φ ϵi , associated with error corrected solution models N , N 1 , ..., N i-1 . Here, N j = N + N ϵ1 + ... + N ϵj is the j th order corrected model. Assume the i -1 order error correction model N ϵi-1 has saturated its capacity to model Φ ϵi-1 . We can obtain a higher order corrected model N i by setting up a new network N ϵi to estimate Φ ϵi . The new network N ϵi is optimized using: F i [N ϵi ] := F[N ϵi + N i-1 ] = F i-1 [N i-1 ] + L[N ϵi ] + N[N i-1 + N ϵi ] -N[N i-1 ] L ϵi = E x1∈D ∥F i [N ϵi (x 1 )]∥ p + E x2∈∂D ∥Φ ϵi (x 2 ) -N ϵi (x 2 )∥ p Figure 1 provides a visual representation of the recursive process involved in building i th order corrected models. Appendix A demonstrates how to arrive at these results. We have direct analogs of Theorems 1, 2, and 3 for the i th order errors and error models as: Theorem 4. Under the assumptions above on F, the analogous result holds for F i for all i ∈ N, i.e. we can choose a neighbourhood U i ⊂ G of 0 small enough such that if N ϵi ∈ U i then F i [N ϵi ] -→ 0 =⇒ N ϵi -→ Φ ϵi . Furthermore, ∥Φ ϵi+1 ∥ = O (∥F i [N ϵi ]∥) . Lastly, under gradient descent, initial condition N ϵi (0) has a solution that satisfies ∥N ϵi (t) -Φ ϵi ∥ ≤ e - (1-ϵ)σ i min 2 t ∥N ϵi (0) -Φ ϵi ∥ Inequality 4 becomes (under the same structural assumptions as Inequality 4): ∥Φ ϵi ∥ ≤ F i-1 max H min Algorithm 2 in the appendix provides the pseudo-code to setup higher order corrections.

5. WHY ERROR CORRECTION WORKS AND HOW WE CAN MAKE IT BETTER

NN DE solvers transform the problem of estimating Φ into one of finding an appropriate N in some suitably chosen functional space of NN models. The use of gradient descent type methods turns this into a stochastic dynamical process in that space. The probability that a real network N will end up exactly modeling Φ within a finite number of iterations is 0 (assuming F is even moderately non-trivial). Note this is true without even considering that N is constrained to be a closed form model, whereas most non-trivial DEs don't have such solutions. As such, a real network N is doomed to languish in some inadequate minima(s) of optimization, no matter how much we optimize it. Its capacity to approximate Φ is necessarily limited and once that capacity is saturated, further training of N is a waste of computational resources. Error correction expands that capacity for approximating Φ, by introducing new trainable parameters (as components of N ϵ1 ) which exist exclusively for modeling the error Φ ϵ1 = Φ -N . Since the parameters of the base network N are frozen, we have locked in the amount of performance already achieved on the base model. Further, since we assume error correction is being turned on when the base model's approximation capacity is effectively saturated (which is certain to happen), the gains we make with N ϵ1 are truly unachievable without it. We have put forth a claim that error correction is not significantly more complex or resource hungry than the standard NN DE solver method: the reason being that Φ and Φ-N are in the same functional space, have the same regularity, etc. Essentially, we meant that Φ ϵ1 is no more complicated than Φ. However, it bears noting that modeling Φ ϵ1 using N ϵ1 can be a very different kind of problem than modeling Φ using N . For example, Φ ϵ1 presumably lies in a different scale of magnitude than Φ and might present itself as a different kind of structure over D (for example, in one sense, it might be significantly more or less "oscillatory" than Φ). Its optimization might also require its own set of hyper-parameters, algorithm choices, activations, etc. This is where our control over the sampling of F (and F 1 ) comes into play. Theorems 1 -3 and Inequality 4 allow us to estimate both the scale and complexity of Φ ϵ1 , relative to N . The correlation expected between Φ ϵ1 and F[N ] means even crude analysis of F[N ] is enough to point us in the correct direction, vis a vis the scale and complexity we expect N ϵ1 to model. For example, we can bound the expected scale of Φ ϵ1 by simply analysing the max of ∥F[N ]∥ over D: the linear correlation result already gives us one way to weakly bound Φ ϵ1 . The strong inequality 4 may allow for more precise control, when its assumptions are satisfied. Similarly, over a domain D, we could perform a FFT analysis of F[N ] and gauge how oscillatory N ϵ1 should be relative to N (since we have control over both N and F[N ] and we expect Φ ϵ1 to be linearly correlated to F[N ]). Alternatively, we could also analyse the distribution of ∥F[N ]∥ over D to understand where the equation is being badly solved, and switch from uniform to targeted sampling over those subdomains. Other methods of estimating relative complexity of features should also be possible, since we have complete access to the behavior of F[N ] and F 1 [N ϵ1 ]. The code we provide allows the user to supply their own estimates for relative scale and complexity of N ϵ1 wrt N , while keeping our preferred defaults. We note again, that in no way is prior knowledge of Φ assumed: only the information already known before or computed using the NN DE solver is being used. Domain knowledge of an expert building an NN DE solver for a system of their interest is thus rewarded during error analysis and correction. Finally, note that error correction inherits one major issue prevalent within machine learning: the ambiguity in determining if a real model N has reached (or nearly reached) its capacity for effectively modelling Φ. The issue is caused because of lack of work on a priori or in situ convergence guarantees. It is extremely tough to gauge whether a loss trajectory has flattened due to saturation in modeling capacity or it is simply in a non-trivial region of dynamics from which it could escape at any time. If error correction is attempted while the base model itself is fully capable of providing significant increases in accuracy, the additional resources will simply be wasted (and may even prove to be counter-productive). Indeed, in our numerical experiments, we found that convergence of the base model N was a practical pre-requisite for significant gains via error correction. However, in our proposed approach, the issue is still somewhat mitigated. Error correction can not cost significantly more than the standard methods (it can at worst be 2X expensive). Since it can lead to very significant improvements in accuracy when used at the appropriate time, it can always be used as an unsupervised, post-processing/refinement routine without significant risks.

6. NUMERICAL EXPERIMENTS

Let us exemplify the above discussion with an example DE that contains terms with linear, nonlinear, and no dependence on the solution. In particular, we investigate a variant of the nonlinear Poisson-Boltzmann equation (nPBE), which serves as an important tool for the study of electrostatic interactions with widespread applications in biology and medicine (14) . We will choose L[ ] = -∆[ ], N[ ] ≡ sinh[ ], C ≡ g, where g is some independent function/term. Thus, we define: F[ ] ≡ -∆[ ] + sinh[ ] + g For some solution attempt N : R d → R and associated error Φ ϵ1 = Φ -N , we get F[N ] = ∆[Φ ϵ1 ] + sinh[N ] -sinh[N + Φ ϵ1 ] Thus, replacing Φ ϵ1 with an error model N ϵ1 , we get the residual equation for the new loss as: F 1 [N ϵ1 ] := F[N + N ϵ1 ] = F[N ] -∆[N ϵ1 ] + sinh[N ](cosh[N ϵ1 ] -1) + cosh[N ] sinh[N ϵ1 ] Now, to setup N ϵ1 as the error NN DE solver (after halting the optimization of N ), we can use Eq. 14 to obtain an approximation of Φ ϵ1 . We not only get an estimate for Φ ϵ1 , we also get to use N ϵ1 as a correction term, while retaining the advantages that make N a lucrative option.

6.1. ERROR CORRECTION RESULTS

We choose D = [-π, π] d ⊂ R d , and randomly sample points from this domain as training data. The exact form of our chosen DE, considered over D with solution Φ : D → R and homogeneous boundary conditions Φ(∂D) = 0, is: F[Φ] = -∆[Φ] + sinh[Φ] + g = 0 g(x) = -ω 2 d [sin(ωx 1 )... sin(ωx d )] -sinh(sin(ωx 1 )... sin(ωx d )) x ≡ {x 1 , ..., x d } ∈ D The solution to this DE, which showcases significant scale and feature variation over D (Fig. 3 ), is: Φ(x) = sin(ωx 1 )...sin(ωx d ) x ∈ D Note that the solvers never see this solution, and only have access to their own parameters and the operator F during optimization. We only use the exact solution to verify that our claims hold true. Given a network N to model the true solution Φ, we optimize N using the mean squared loss L: L(N ) = E x1∼D ∥ -∆(N (x 1 )) + sinh[N (x 1 )] + g(x 1 )∥ 2 + E x2∼∂D ∥N (x 2 )∥ 2 The error correction model N ϵ1 : D → R is effectively optimized using the following loss: L ϵ1 = E x1∼D ∥F 1 [N ϵ1 (x 1 )]∥ 2 + E x2∼∂D ∥N (x 2 ) + N ϵ1 (x 2 )∥ 2 (18) As discussed before, L → 0 =⇒ F[N (x)] → 0, over x ∈ D, which implies N → Φ. Similarly, L ϵ1 → 0 =⇒ F 1 [N ϵ1 (x)] → 0, which implies N ϵ1 → Φ ϵ1 . We use Eq. 16 to benchmark efficacy of our proposed algorithm for (ω, d) choices of (5, 2) and (1, 4). The goal of the experiments was to figure out what impact error correction would have, if used in lieu of the standard approach, while running for an equivalent number of training iterations. As such, choices on width, depth, architecture, etc were kept the same between N and N ϵ1 . The experiments also demonstrate the effects of invoking error correction at different stages of training (Fig. 2 ). To present and analyse our results, we define a computational cost metric, τ : (time per iteration), and a computational performance metric, Relative Error (RE): Relative Error := K k=1 [Φ(x k ) -N i (x k )] 2 K k=1 [Φ(x k )] 2 , where N i : D → R is the i th order corrected, NN DE based approximation for Φ over D. K represents the number of sampled points in D for the calculation. In Fig. 2 , we plot the relative error dynamics of the various choices made. Notice the very immediate gains made by error correction.  ROI(c) = τ Std τ (c) × RE(c) RE Std × T Std T (c) As Table 1 shows, error correction consistently provides the best ROI. The associated codebase with this work provides numerical experiments on some other nonlinear/chaotic DEs (it is designed to be compatible with any DE operator with permissible Frechet derivatives and allow any finite order of correction). The results on those systems are also tabulated in Table 1 (first order error correction was injected at 50% of the run-time, for all other examples). All results are in line with our expectations. The additional systems are described in appendix D.

7. CONCLUSIONS

The surging popularity of NN DE solvers presents exciting possibilities in many scientific fields: their capacity to sidestep the curse of dimensionality (15) , general advances in computing/GPU power, and their easy interpret-ability make for a powerful combo. As such, it is important that these solution models be capable of validation over domains where true solutions are not available. Our work proposes theorems/methods that fix this deficiency for many NN DE solvers. Summarily, NN model errors are unambiguously estimable, and profitably so, if the assumptions of Theorem 1 -3 hold. For systems where the assumptions of Theorems 1 -3 do not hold, existing work still leads us to expect that NN DE solvers (and thus, error correction) will work. However, for those systems, error correction converts the ambiguity associated with the models N into the ambiguity associated with the estimate N ϵ1 . These unreliable estimates of Φ ϵ1 can lead to more accurate models, but our work does not rigorously predict when or how this happens. However, since this ambiguity is present for all NN DE solvers prior to our work, unsupervised error analysis and correction can only be an improvement upon the existing situation (even when it is unreliable). Future work will focus on extending our results rigorously to more systems, so that even larger classes of DEs may come within the remit of reliable error estimation and correction. Lastly, note that the suggested ideas are not radically different than those that already exist for classical numerical methods: higher order corrections are about as old as the field itself. By pairing the many significant advantages of modern NNs with those old ideas, we simply hope to have presented a blueprint that will be useful for a wide class of scientific problems that are being tackled using NN DE solvers. Let us define the 1 st order corrected model as N 1 := N + N ϵ1 , and with a slight abuse of notation, F 1 [N 1 ] := F[N + N ϵ1 ] . Thus, we may rewrite Eq. 19 as: F 1 [N 1 ] + L[Φ ϵ2 ] + N[N 1 + Φ ϵ2 ] -N[N 1 ] = 0 Note the functional similarity to Eq. 6. Further, to setup a 2 nd order correction estimate N ϵ2 for Φ ϵ2 , we need to minimize over the following residual, similar to Eq. 7: F 2 [N ϵ2 ] = F 1 [N 1 ] + L[N ϵ2 ] + N[N 1 + N ϵ2 ] -N[N 1 ] In general, given an (i-1) th order corrected model N i-1 using N i-1 = N +N ϵ1 +N ϵ2 +.....+N ϵi-1 , we can define Φ ϵi := Φ -N i-1 and F i-1 [N i-1 ] as the corresponding residual for the preceding equations. Φ ϵi is the unique solution to the following equation F i-1 [N i-1 ] + L[Φ ϵi ] + N[N i-1 + Φ ϵi ] -N[N i-1 ] = 0 The error can be estimated using a differential equation solver (e.g. a NN) with residual: F i [N ϵi ] := F[N ϵi + N i-1 ] = F i-1 [N i-1 ] + L[N ϵi ] + N[N i-1 + N ϵi ] -N[N i-1 ] Figure 1 depicts the extension of the error estimation and correction framework as a graphical model. Notice that the initial solver estimator and the error estimators may be pre-fetched before carrying out the optimization steps. Since Φ ϵi = Φ -N i-1 is always known exactly over ∂D for all i, we can optimize the i th order error model N ϵi using L ϵi : L ϵi = E x1∼D ∥F i [N ϵi (x 1 )]∥ p + E x2∼∂D ∥Φ ϵi (x 2 ) -N ϵi (x 2 )∥ p Algorithm 2 demonstrates the procedure for higher order estimation and correction. The uniqueness of Φ as a solution to F[ ] also implies the uniqueness of Φ ϵi as a solution to F i [ ]. Unpacking the definition of F i and N i-1 we find that F i [Φ ϵi + N i-1 ] = F[Φ -N i-1 + N i-1 ] = F[Φ] = 0 and therefore the solution to this equation is still unique and convergence of L ϵi → 0 implies F i → 0, and thus also N ϵi → Φ ϵi . Theorem 3 follows naturally from Theorems 1 and 2. Inequality 11 follows naturally from Inequality 4. Algorithm 2 describes the proposed implementation. Train N , and define N := N ▷ N serves as a dummy variable 5: for i = 1, . . . , m do 6: Train N ϵi until loss converges 7: Save parameter states of N ϵi 8: N := N + N ϵi 9: Freeze parameters of N 10: end for 11: return N + m i=1 N ϵi ▷ Equivalently, return N 12: end procedure B THEOREMS ON THE RELATIONSHIPS BETWEEN Φ ϵ 1 , F[N ], AND L This work is centered around three theorems. We begin this section by summarizing and contextualizing those results. Theorem 1 verifies the intuitive expectation that F[N ] → 0 implies the fitness of N as an approximation of Φ. We do this by taking advantage of the uniqueness of Φ as a solution, alongside the fact that any gradient descent method can only be successful, iff the gradient itself exists in a well defined sense. Theorem 2 naturally follows as a consequence of Theorem 1. It adds a strong a priori expectation we should have from NN DE solvers that satisfy the assumptions of Theorem 1 and tells us that their errors should be quantifiable in some sense, if information about F[N ] is sampled (which is always possible up to the desired resolution). Additional structure leads us to stronger quantification capabilities (surmised in Inequalities 4 and 11 and proved in appendix C). Finally, assume a mapping F : G ⊃ U → H between two Hilbert spaces, e.g. representing a non-linear PDE mapping between function spaces. Further, assume it is twice differentiable and at the solution Φ, satisfying F(Φ) = 0, DF[Φ], is an invertible linear map. Then, the gradient descent procedure guarantees convergence at a rate ∥N (t) -Φ∥ ≤ e -(1-ϵ)σ min 2 t ∥N (0) -Φ∥ where ϵ > 0 can be chosen sufficiently small if ∥N (0) -Φ∥ ≤ R, where R depends on ϵ. σ min here is given by the minimum of the spectrum of (DF[Φ]) † DF[Φ] which is strictly positive as DF[Φ] is invertible. Theorem 3 guarantees that exponentially converging NN models for Φ exist somewhere in G, and that gradient descent will allow us that level of performance. We need only find some initial model from that region. Together, the three theorems give us an idealized framework for describing NN DE solvers under iterative optimization. Real NN DE solvers differ from this idealized framework in two ways: their optimization is a discrete process and the subspace within which real NN models lie (say G M , where M is number of parameters in our model), while being capable of coming arbitrarily close to Φ, seldom contains it. The real optimization is a discrete approximation of the ideal trajectory: further, the empirical trajectory is actually a projection of that discrete trajectory onto G M . Note that for finite dimensional approximation of the problem on a subspace G M ⊂ G of dimension M , the corresponding constant σ G M min is greater or equal to σ min . Furthermore, if G M ⊂ G m then σ G M min ≥ σ Gm min . Thus, one can expect the error correction procedure to allow for an exponential improvement at each step, iff σ min > 0, i.e. exactly in the cases when DF[Φ] is invertible. Finally, Theorem 4 follows trivially from the definitions of F i , N i-1 , Φ ϵi , and Theorems 1 -3. We now move to rigorously prove Theorems 1 -3 below.

B.1 INVERSE FUNCTION THEOREM AND THEOREMS 1 AND 2

We quickly summarize the Banach space version of the inverse function theorem that allows us to establish that F[N ] → 0 implies N → Φ (for further details see for example (16) ). We also establish another lemma we will use to estimate the convergence rate of NN DE solvers. Let G and H be two Banach spaces, and U ⊂ G an open subset. A continuous map between the two Banach spaces F : G ⊃ U → H is said to be Fréchet differentiable at a point x ∈ U iff there exists a linear bounded operator L x : G → H such that ∥F(y) -F(x) -L x (y -x)∥ H = o(∥y -x∥ G ). If the map x → DF[x] = L x is continuous, then one says that F ∈ C 1 (U ; H). Analogously C 2 (U ; H) denote the functions which are twice differentiable, and if H = R we drop H from the notation. Lemma 1. Let F : U → H be a C 1 -map. Suppose that there exists a point x 0 ∈ U such that DF(x 0 ) is an isomorphism (i.e. it has a continuous inverse). Then there exists a neighbourhood V ⊂ H of F(x 0 ) and a C 1 function F -1 : V → G that is a local inverse of F. Proof. See Theorem 5.2 of (16). Lemma 2. Assume L ∈ C 2 (U ). L is strongly convex at N ∈ U , if for all N ′ ∈ G D 2 L[N ](N ′ , N ′ ) ≥ µ∥N ′ ∥ 2 . Proof. This follows by a simple application of Taylor's theorem as in the finite dimensional case. For Taylor's theorem for functions on Banach spaces, cf. (16) . Theorem 1. Suppose that F ∈ C 1 (U ; H), that the derivative of F at Φ ∈ U is invertible, and F(Φ) = 0. There is a neighbourhood V ⊂ H of 0 small enough such that F(N ) -→ 0 =⇒ N -→ Φ . Proof. By Lemma 1, we can choose neighbourhoods Φ ∈ U ′ ⊂ U and 0 ∈ V ⊂ H such that F : U ′ → V is a diffeomorphism. Then, if N ∈ U ′ the continuity of F -1 implies F[N ] -→ 0 =⇒ N -→ Φ . which is the assertion. Theorem 2. Under the same assumptions as above we ∥N -Φ∥ = O (∥F[N ]∥) . Proof. Since by Lemma 1, F -1 is differentiable at 0, it follows that it is locally Lipschitz continuous around 0, implying that ∥N -Φ∥ = O (∥F[N ]∥) . The constant of proportionality is approximately given by λ -1 min where λ min := inf λ∈Spec(DF[Φ]) |λ| the eigenvalue with minimal absolute value in the spectrum of DF[Φ]. The nPBE example from section 6 fits nicely into this paradigm as the map F : W 2,∞ (R 3 ) -→ W 0,∞ (R 3 ) f -→ -∆f + sinh(f ) + g is continuous and continuously Fréchet differentiable with Fréchet derivative DF(f ) = -∆ + cosh(f ) which is everywhere continuously invertible as a linear map. The constant of proportionality is approximately 1. Theorems 1 and 2 guarantee that there exist adequate models for Φ, if the loss is going down sufficiently enough. However, we still have no clue about how fast or slow this convergence is going to be and how to convert these existence results into actual estimates on Φ ϵ1 (and higher order Φ ϵi ). We develop those results below, with minimal additional assumptions.

B.2 GRADIENT DESCENT IN HILBERT SPACES

We assume that G is a Hilbert space, i.e. in addition to being a Banach space, it is also equipped with an inner product ⟨•, •⟩ such that ⟨v, v⟩ = ∥v∥ 2 for all v ∈ G. Assume that we are given a loss function L : G ⊃ U → R that is an element of C 1 (U ). We denote by ∇L(N ) the unique element of G, s.t. for all Ψ ∈ G DL[N ](Ψ) = ⟨∇L(N ), Ψ⟩ . If ∇L is a locally Lipschitz continuous or equivalently if L is a locally L-smooth function, then the ODE Ṅ (t) = -∇L(N (t)) ) has a unique local solution, cf. (17) . By locally L-smooth, we mean that around a minimum Φ, there exists R > 0 such that for all N , N ′ ∈ B R (Φ) ∇L(N ) -∇L(N ′ ) ≤ L∥N -N ′ ∥ . Furthermore, for µ > 0, we call the function L locally µ-strongly convex around a minimum Φ, if there exists R > 0 such that for all N , N ′ ∈ B R (Φ) L(N ′ ) ≥ L(N ) + ⟨∇L(N ), N ′ -N ⟩ + µ 2 ∥N ′ -N ∥ 2 . We need one more lemma to have all the tools we will use to get Theorem 3.

C STRICT UPPER BOUNDS ON ∥Φ ϵ 1 ∥

We will begin our attempt to ascertain bounds on ∥Φ ϵ1 ∥ with classical Hamiltonian systems, to exemplify the intuition and the technique involved in the generic bound we wish to develop. Once that is achieved, we will show how a generalization follows in a straightforward manner.

C.1 HAMILTONIAN SYSTEMS

To start, let us remember Φ ϵ1 := N -Φ = (N 1 -ϕ 1 , . . . , N D -ϕ D ) for a D-dimensional dynamical system to be solved on a domain of interest [0, T ]. Note that D is necessarily even for a Hamiltonian dynamical system. Let us construct a worst case scenario for the norm of Φ ϵ1 . We write the equation in terms of the Hamiltonian formulation: dΦ dt = J∇H(Φ) ( ) where H is the appropriate Hamiltonian, and J is the symplectic matrix J = 0 -I D/2 I D/2 0 ( ) and I D/2 is the identity matrix. The NN DE solver is trained using F[N ] given by F[N ] = J∇H(N ) - dN dt The equation above represents D separate differential equations in a vector form. Since F[Φ](t) = 0 for all t we can write F[N ] as follows, suppressing the time dependence F[N ] = - 1 0 DF[N s ]Φ ϵ1 ds = Φϵ1 - 1 0 J D 2 H (N s )ds • Φ ϵ1 =: Φϵ1 -RΦ ϵ1 where N s := Φ -sΦ ϵ1 , and D 2 H(Φ) is the Hessian matrix of H with components ∂ i ∂ j H(N ). In order to be able to extract a meaningful error bound from Eq. equation 27 we need to make some structural assumption: We assume that at a time t Φϵ 1 ∈ [0, T ] at which ∥Φ ϵ1 ∥ obtains its maximum R(t Φϵ 1 )Φ ϵ1 (t Φϵ 1 ) • Φϵ1 (t Φϵ 1 ) = 0 Note that the assumption of focusing all error in only one component of Φ, as done in (10) , already implies the above, but not vice versa. As such, we are using a much weaker assumption than (10). 3Then, for time t Φϵ 1 ∈ [0, T ], we have ∥RΦ ϵ1 ∥ 2 2 ≤ ∥RΦ ϵ1 ∥ 2 2 + ∥ Φϵ1 ∥ 2 2 = ∥F[N ]∥ 2 2 We next need an estimate on the matrix norm of the inverse of R for every t ∈ [0, T ]. We can ignore J as it is an orthogonal matrix and only consider 1 0 D 2 H (N s )ds . We can do this if we know that D 2 H(x) is a strictly positive or strictly negative definite matrix for all x ∈ D ⊂ R D in our domain of interest, which is a standard assumption for the motion to be non-degenerate. W.l.o.g., assuming that D 2 H is positive definite, and setting where λ min is the smallest eigenvalue of the matrix D 2 H(x), we have for all t ∈ [0, T ] ∥R -1 ∥ ≤ H -1 min . Finally, letting F max := max  F[N ] = - 1 0 DF[N s ]Φ ϵ1 ds = 1 0 DN[N s ]ds • Φ ϵ1 -L[Φ ϵ1 ] =: -RΦ ϵ1 -L[Φ ϵ1 ] The generalized version of the RΦ ϵ1 • Φϵ = 0 assumption is that at the x Φϵ 1 ∈ D that ∥Φ ϵ1 ∥ takes its maxima in, we have: R(x Φϵ 1 )Φ ϵ1 (x Φϵ 1 ) • L[Φ ϵ1 ](x Φϵ 1 ) = 0 x Φϵ 1 ∈ D The next assumption we generalize is the one made to have non-degeneracy in the solutions we were trying to model. As such, we assume that DN is a positive definite (or negative definitive) operator. From then on, we define H min in a similar manner. More precisely, we say: Finally, let H min = max{H min1 , H min2 }. With that in place, we obtain the following inequalities for ∥Φ ϵ1 ∥: H min1 := inf ∥Φ ϵ1 ∥ ≤ F max H min The generalized version of the assumption on x Φϵ where | Φϵ | attains its maxima similarly gives: ∥ Φϵ ∥ ≤ F max The assumption that R(x Φϵ 1 )Φ ϵ1 (x Φϵ 1 ) • L[Φ ϵ1 ](x Φϵ 1 ) = 0 is critical in allowing us strong estimates on ∥Φ ϵ1 ∥. Superficially, it might seem like too strong of an assumption (and even somewhat of a non-sequitur). However, note that L[ ] in a DE is often the domain derivative term (we denote it as ∇ x in this discussion, where x ∈ D represents an arbitrary element in the domain. As such, ∇ x are the spatial derivatives for spatial DEs, time derivatives for ODEs, a spatio-temporal one for space-time DEs, etc). When ∥Φ ϵ1 ∥ achieves its maxima, Φ ϵ1 • ∇ x (Φ ϵ1 ) = 0. When L ≡ ∇ x , we indeed have N and N ϵ1 were fully connected Neural Networks, with 4 hidden layers, each with 50 sine activation functions. The base model was trained via ADAM for 400000 iterations, while error models were trained for a varying number of iterations, depending on when the error correction models were activated (see Fig. 2 ). We sampled 1024 points per iteration. The (ω, d) = (1, 4) model was trained for 50000 iterations, with 8192 points sampled per iteration. Both NN DE solvers are implemented using the methods prescribed in (2)

D.3 NONLINEAR QUARTIC OSCILLATOR

The nonlinear quartic oscillator is represented by the following ODE governing the dynamics of Φ ≡ {x, p x }: Φ = -J∇H, J = 0 1 -1 0 , H ≡ x 2 + p 2 x 2 + x 4 4 We sample Φ(0) such that {x(0), p x (0)} ∈ [-1, 1] × [-1, 1] each time. We implement the NN DE solver prescribed in (10) . The models had 2 hidden layers each, with 50 sine activation functions per layer. The base models were trained for 50000 iterations using ADAM, with an error correction made at 25000 iterations. We sampled 200 time points were iteration. The associated codebase allows the user to add other systems as per their choice.



Note that F1[Nϵ 1 ] = F[N + Nϵ 1 ]for any appropriate mapping Nϵ 1 , and not just Φϵ 1 However, it is sometimes profitable to obtain an explicit, separable expression for N ′ [N , Φϵ 1 ]. For example, whenever Taylor expansions can be used,N ′ [N , Φϵ 1 ] ≡ T1(N )Φϵ 1 + Φ † ϵ 1 T2(N )Φϵ 1 2! + ...where Ti(N ) are the i th order Taylor terms. These can be useful for analysis, since N , Φϵ 1 appear in multiplicatively separable terms in the new expressions ((10) used these forms to bound Φϵ 1 ). However, we obtained Inequality 4 in a way that renders such transformations superfluous: our bounds stand under weaker assumptions. Unfortunately, we need this assumption to separate the two terms appearing on the right hand side of Eq. equation 27.



dimensions, ω = 5 (b) 4 dimensions, ω = 1

Figure 2: Relative errors for a single order correction on nPBE. Legend labels indicate duration of training N , and duration of training N ϵ1 for correction respectively.

Internal Error Estimation and Correction to Order m 1: procedure ERRORCORRECT(m) 2: initialize NN DE solver N : D → R initialize Error estimators {N ϵi } m i=1 4:

min := min x∈D ∥D 2 H(x) -1 ∥ -1 = min x∈D λ min (D 2 H(x))

inf λ λ ∈ Spec (|DN[N s ]|) H min2 := inf s∈[0,1] inf λ λ ∈ Spec (|L[Φ ϵ1 ]|)(32)where|DN[N s ]| := (DN[N s ]) † DN[N s ]

Figure 3: Heatmaps of non-error corrected N & error corrected solver with N ϵ1 at 50% of the total training iterations (left & middle), and Φ (right) for the (ω, d) = (5, 2) setting.

thus allowing a larger class of loss function designs as well.

Performance comparison across different systems and optimization strategiesWe also investigate alternative scenarios, where the slightly higher consumption of resources is used in other ways, to quantify ablation possibilities. We conduct two sets of examples. For size ablation experiments, we allow N to be proportionally larger, while training it for the same number of training iterations (say T ). For time ablation experiments, we allow the standard algorithms a proportionally higher number of iterations. We define the Return on Investment or ROI of each choice c = (error correction, size, or time ablation) as:

∥F[N ](t)∥ 2 we reach the an estimate on ∥Φ ϵ1 ∥ as: Additionally, if we assume that at the time t Φϵ 1 at which | Φϵ1 | reaches its maxima, R(t Φϵ 1 )Φ ϵ1 (t Φϵ 1 ) • Φϵ1 (t Φϵ 1 ) = 0 we have the following bound on Φϵ1 as well ∥ Φϵ1 ∥ ≤ F max (30) C.2 BOUNDS ON ∥Φ ϵ1 ∥ IN MORE GENERAL SETTINGS Eq. 27 generalizes beyond Hamiltonian systems in a straightforward manner for F[ ] ≡ L[ ]+N[ ]+C as:

annex

Lemma 3. Let L ∈ C 1 (U ) be a locally L-smooth, locally µ-strongly convex function around a minimum Φ, s.t. L ≥ 0 and L(Φ) = 0 .Then for any initial condition N ∈ B R (Φ), a ball where L is both L-smooth and µ-strongly convex, the Gradient Descent equation 23 converges exponentially at rate with rate µ 2 towards Φ, i.e.∥N (t) -Φ∥ ≤ e -µ 2 t ∥N (0) -Φ∥ .Proof. From the strong µ-convexity it follows thatSolving this differential inequality, we findLemma 3 guarantees the existence of an exponentially convergent regime of optimization, as long as we assume that DF exists and L is µ-strongly convex. However, it is a non-constructive statement: we have no information on what behavior should be expected from µ. We can make it constructive with an additional assumption on the existence of D 2 F.

B.3 CONSTRUCTIVELY ESTIMATING THE RATE OF CONVERGENCE

Assume that F is in C 2 (U ; H). Furthermore, we require that H is also a Hilbert space. From the definition of Fréchet differentiability we can write F aswhere ∥R 1 (N )∥ = o(∥N -Φ∥ 2 ). Then our loss function L(Nwhere we used that L(Φ) = 0 and definedas DF[Φ] is invertible. Furthermore, our loss function can be written aswhere) and R 2 is also a C 2 -function, and its second derivative is o(∥N -Φ∥). Thus, for every ϵ > 0 there exists R > 0 such that for all N in the ball of radius R around ΦThen, we have that for all N ∈ B R (Φ)It follows that L is µ-strongly convex in the ball of radius R around Φ with constant µ = (1 -ϵ)σ min , and thus the gradient descent flow N (t) converges exponentially at rate (1-ϵ)σmin 2 .We have thus proven Theorem 3. Theorem 4 follows immediately from the results in appendix B.As such, what we are really assuming is that the R term effectively has Φ ϵ1 as its eigenvector on x Φϵ 1 (since that means RΦ ϵ1 • ∇ x (Φ ϵ1 ) is 0). For example, such is the case for Hamiltonian systems without mixed position-momentum terms. What we are really constraining, is the behavior exhibited by the operator N in a system of interest.If the problem of interest comes from a scientific domain with which the user has familiarity, they may supply a different sort of assumption based on those considerations. For example, we might be able to say:at the x Φϵ 1 where ∥Φ ϵ1 ∥ takes its maxima. The generalized bound simply changes to:

D ADDITIONAL NUMERICAL EXPERIMENTS

As part of validating our claims, we also performed numerical experiments on several scientific ODEs and PDEs. We chose our experiments such that the set of examples showcases non-trivial spatial and temporal phenomena amongst the combined examples (chaos, high dimensional domains, etc).Table 1 presents the median results from the collection of randomized experiments conducted for each system. All numerical experiments were done using a 2019 MacBook Pro with a 2.6 GHz 6-Core Intel Core i7 processor and 16 GBs of 2667 MHz DDR4 RAM.We describe the chosen systems below:

D.1 HENON HEILES

Henon Heiles is represented by the following ODE governing the dynamics of Φ ≡ {x, y, p x , p y } † :where I is the 2 × 2 identity matrix. We picked [0, 6π] as the time domain of interest. All Φ(0) ≡ {x(0), y(0), p x (0), p y (0)} were picked s.t.2 ], and y(0) ∈ [-0.5, 1 -√ 3|x|], which corresponds to the subset of the phase space that has bounded orbits.We implement the NN DE solver prescribed in (10) . The models each had 2 hidden layers, with 50 sine activation functions per layer. The base models were trained for 50000 iterations using ADAM, with an error correction made at 25000 iterations. We sampled 200 time points were iteration.

D.2 NONLINEAR POISSON BOLTZMANN EQUATION

We have already given details of the equation in the main paper, alongside the associated choices of dimensionality and frequency parameter. Below, we plot what the (ω, d) = (5, 2) system should look like when modeled without (a) and with (b) error correction, alongside the true solution (c). To quantify the visual, we also quote the mean average value of ∥N ∥, ∥N + N ϵ1 ∥, ∥Φ∥ respectively.

