NEURAL NETWORK DIFFERENTIAL EQUATION SOLVERS ALLOW UNSUPERVISED ERROR ANALYSIS AND CORRECTION

Abstract

Neural Network Differential Equation (NN DE) solvers have surged in popularity due to a combination of factors: computational advances making their optimization more tractable, their capacity to handle high dimensional problems, easy interpretability, etc. However, most NN DE solvers suffer from a fundamental limitation: their loss functions are not explicitly dependent on the errors associated with the solution estimates. As such, validation and error estimation usually requires knowledge of the true solution. Indeed, when the true solution is unknown, we are often reduced to simply hoping that a "low enough" loss implies "small enough" errors, since explicit relationships between the two are not available. In this work, we describe a general strategy for efficiently constructing error estimates and corrections for Neural Network Differential Equation solvers. Our methods do not require a priori knowledge of the true solutions and obtain explicit relationships between loss functions and the errors. In turn, these explicit relationships allow for the unsupervised estimation and correction of the model errors.

1. INTRODUCTION

Deep learning has heralded new methods for many scientific disciplines -the field of numerical methods for differential equations has been no exception (1; 2; 3; 4). Deep neural network based differential equations (NN DE) solvers have been proposed under a variety of different names (PINNs (2), DGM (3), etc) -each catering to various classes of problems. However, they all share certain common features: the use of a known differential equation (DE) as the central component of an appropriate loss function, the use of existing knowledge (boundary conditions, experimental/synthetic data, etc) to constrain the search for solutions, randomized optimization methods that sample from the domain of interest at a requisite resolution, etc. We investigate another common facet of many NN DE solvers: the lack of unsupervised error quantification/correction methods to estimate model errors without prior knowledge of the solution. Most solvers use the equation based loss functions as a surrogate measure for the error. However, while such a measure is intuitively related to the error in the solution model, an explicit description of that connection is mandatory, if error quantification is to be done without knowing the solution. We achieve these goals by explicitly relating the loss terms and the model error. We showcase how these connections allow results on the model error that don't rely on prior knowledge of solution. We propose techniques by which these results can be used to build significantly more efficient NN DE solvers, with only marginal increases in computational complexity. We formalize our ideas into four theorems, two inequalities, and two algorithms. We validate our claims with a collection of numerical experiments on several non-trivial DEs (including nonlinear PDEs). For the sake of readability and simplicity, all proofs have been rigorously presented in the appendices, while the main text simply reports the results and discusses their significance. An associated codebase is also provided, with inbuilt options for the DEs already studied as part of this work. However, the codebase has been designed so that the users may easily add their own DEs of interest (the assumptions under which this work is valid should be general enough for a wide variety of DEs from many different scientific disciplines).

2. NEURAL NETWORK DIFFERENTIAL EQUATION SOLVERS

For ease of discussion, we let D ⊆ R d denote some closed, bounded, path connected domain of interest for some differential equation (DE). Let ∂D ⊆ D be the portion of the domain over which some constraint conditions on the solution exist (usually obtained as boundary conditions, empirical data, etc). Assume that our chosen DE, when given unique constraints over ∂D, admits a unique solution Φ : D → R D . We wish to consider a (possibly non-linear) equation operator F : G → H, where G and H are some suitable spaces of functions over D (Φ ∈ G). We decompose F as F[•] = L[•] + N[•] + C (1) where L represents the term(s) which depend linearly on Φ, N the represents the term(s) which depend non-linearly on Φ, and C are the terms independent of Φ. This additive decomposition into linear, nonlinear, and constant terms is always possible: for linear DEs, N ≡ 0 ≡ C. We have F[Φ] = L[Φ] + N[Φ] + C = 0 . Let us assume we wish to construct an NN based approximation N : D → R D . Let us also assume the NN uses analytic activation functions so that N ∈ C ∞ . Let W be its width (neurons per hidden layer) and D N be its depth (number of hidden layers). Finally, let w ≡ {b 1 1 , w 1 11 , w 1 12 , . . . , b 2 1 , w 2 11 , w 2 12 , . . . } ∈ M ⊆ R M be the M weights and biases of the NN.

2.1. EXISTENCE AND COMPLEXITY

Feedforward NNs with ReLU activation and W ≥ d + 4 can arbitrarily well approximate any Lebesgue integrable function Φ : D → R D w.r.t. L 1 norm, provided that D is some compact subset of R d and enough depth D N is provided to the NN (5). The same holds for NNs with ReLU activation and D N ≥ log 2 (d + 1) provided that they are wide enough (6). Hence, there exist w such that N can arbitrarily approximate any Φ ∈ C k over D, given large enough (W, D N ). These theoretical guarantees may be practically realized by leveraging the minimum regularity expected from Φ. Recall W k,∞ (D) is the Sobolev space of order k, on compact D ⊆ R d , w.r.t. the L ∞ norm. Let us assume that Φ ∈ W k,∞ (D). There exists some N with (W, D N ) that can εapproximate Φ : D → R, if W = d + 1, D N = O(diam(D)/ω -1 f (ε)) d (7), where ω -1 f (ε) = sup{δ : ω f (δ) ≤ ε}, δ = |x 1 -x 2 |, x 1 , x 2 ∈ D. There also exist N that ε -approximate Φ : D → R, if M = O(ln( 1 ε )ε -d/k ), D N = O(ln( 1 ε )) (8).

2.2. OPTIMIZATION

The loss function L to train N usually takes the following form (2; 3; 4): L = E x1∈D ∥F[N (x 1 )]∥ p + E x2∈∂D ∥Φ(x 2 ) -N (x 2 )∥ p (2) where ∥ • ∥ p is the usual p-norm on R D . Variants of Stochastic Gradient Descent (SGD) are almost always capable of eventually reaching adequately small loss values for large enough (W, D N ) under such loss (9). Thus as L → 0 during optimization, the uniqueness of Φ implies N → Φ over D (formally shown in Theorem 1).

2.3. ALTERNATIVE PARAMETERIZATION OF THE CONSTRAINT CONDITIONS ON ∂D

Sometimes, the constraint conditions (such as initial or boundary conditions) may be enforced by using the following parametrization as the model N for Φ, at some arbitrary x 1 ∈ D: N (x 1 ) = Φ(x 2 ) + dist(x 1 , x 2 )N o (3) where x 2 ∈ ∂D is some appropriate nearest constraint point to x 1 . Here, N o represents the NN and dist(x 1 , x 2 ) is some metric that enforces that the model N is always exact over ∂D, while allowing the NN flexibility to learn Φ elsewhere (ex: 1 -e -∥x1-x2∥ , as used in ( 10)). This parametrization can eliminate the need for the E x2∈∂D term in Eq. 2. Our work is applicable in such scenarios too.

