NEURAL NETWORK APPROXIMATIONS OF PDES BE-YOND LINEARITY: REPRESENTATIONAL PERSPECTIVE

Abstract

A burgeoning line of research has developed deep neural networks capable of approximating the solutions to high dimensional PDEs, opening related lines of theoretical inquiry focused on explaining how it is that these models appear to evade the curse of dimensionality. However, most theoretical analyses thus far have been limited to simple linear PDEs. In this work, we take a step towards studying the representational power of neural networks for approximating solutions to nonlinear PDEs. We focus on a class of PDEs known as nonlinear variational elliptic PDEs, whose solutions minimize an Euler-Lagrange energy functional E(u) = Ω L(∇u)dx. We show that if composing a function with Barron norm b with L produces a function of Barron norm at most B L b p , the solution to the PDE can be ϵ-approximated in the L 2 sense by a function with Barron norm O (dB L ) . By a classical result due to Barron (1993) , this correspondingly bounds the size of a 2-layer neural network needed to approximate the solution. Treating p, ϵ, B L as constants, this quantity is polynomial in dimension, thus showing that neural networks can evade the curse of dimensionality. Our proof technique involves "neurally simulating" (preconditioned) gradient in an appropriate Hilbert space, which converges exponentially fast to the solution of the PDE, and such that we can bound the increase of the Barron norm at each iterate. Our results subsume and substantially generalize analogous prior results for linear elliptic PDEs.

1. INTRODUCTION

Scientific applications have become one of the new frontiers for the application of deep learning (Jumper et al., 2021; Tunyasuvunakool et al., 2021; Sønderby et al., 2020) . PDEs are one of the fundamental modeling techniques in scientific domains, and designing neural network-aided solvers, particularly in high-dimensions, is of widespread usage in many domains (Hsieh et al., 2019; Brandstetter et al., 2022) . One of the most common approaches for applying neural networks to solve PDEs is to parameterize the solution as a neural network and minimize a loss which characterizes the solution (Sirignano & Spiliopoulos, 2018; E & Yu, 2017) . The hope in doing so is to have a method which computationally avoids the "curse of dimensionality"-i.e., that scales less than exponentially with the ambient dimension. To date, neither theoretical analysis nor empirical applications have yielded a precise characterization of the range of PDEs for which neural network-aided methods outperform classical methods. Active research on the empirical side (Han et al., 2018; E et al., 2017; Li et al., 2020a; b) has explored several families of PDEs, e.g., Hamilton-Bellman-Jacobi and Black-Scholes, where neural networks have been demonstrated to outperform classical grid-based methods. On the theory side, a recent line of works (Marwah et al., 2021; Chen et al., 2021; 2022) 

has considered the following fundamental question:

For what families of PDEs can the solution be represented by a small neural network? The motivation for this question is computational: since the computational complexity of fitting a neural network (by minimizing some objective) will grow with its size. Specifically, these works focus on understanding when the approximating neural network can be sub-exponential in size, thus avoiding the curse of dimensionality. Unfortunately, the techniques introduced in this line of work have so far only been applicable to linear PDEs. In this paper, we take the first step beyond linear PDEs, with a particular focus on nonlinear variational elliptic PDEs. These equations have the form -div(∇L(∇u)) = 0 and are instances of nonlinear Euler-Lagrange equations. Equivalently, u is the minimizer of the energy functional E(u) = Ω L(∇u)dx. This paradigm is very generic: its origins are in Lagrangian formulations of classical mechanics, and for different L, a variety of variational problems can be modeled or learned (Schmidt & Lipson, 2009; Cranmer et al., 2020) . These PDEs have a variety of applications in scientific domains, e.g., (non-Newtonian) fluid dynamics (Koleva & Vulkov, 2018) , meteorology (Weller et al., 2016) , and nonlinear diffusion equations (Burgers, 2013) . Our main result is to show that when the function L has "low complexity", so does the solution. The notion of complexity we work with is the Barron norm of the function, similar to Chen et al. (2021) ; Lee et al. (2017) . This is a frequently used notion of complexity, as a function with small Barron norm can be represented by a small, two-layer neural network, due to a classical result (Barron, 1993) . Mathematically, our proof techniques are based on "neurally unfolding" an iterative preconditioned gradient descent in an appropriate function space: namely, we show that each of the iterates can be represented by a neural network with Barron norm not much worse than the Barron norm of the previous iterate-along with showing a bound on the number of required steps. Importantly, our results go beyond the typical non-parametric bounds on the size of an approximator network that can be easily shown by classical regularity results of the solution to the nonlinear variational PDEs (De Giorgi, 1957; Nash, 1957; 1958) along with universal approximation results (Yarotsky, 2017) .

2. OVERVIEW OF RESULTS

Let Ω ⊂ R d be a bounded open set with 0 ∈ Ω and ∂Ω denote the boundary of Ω. Furthermore, we assume that the domain Ω is such that the Poincare constant C p is greater than 1 (see Theorem 2 for the exact definition of the Poincare constant). We first define the energy functional whose minimizers are represented by a nonlinear variational elliptic PDE-i.e., the Euler-Lagrange equation of the energy functional. Definition 1 (Energy functional). For all u : Ω → R such that u| ∂Ω = 0, we consider an energy functional of the following form: E(u) = Ω L(∇u)dx, where L : R d → R is a smooth and uniformly convex function , i.e., there exists constant 0 < λ ≤ Λ such that for all x ∈ R we have λI d ≤ D 2 L(x) ≤ ΛI d . Further, without loss of generalityfoot_0 , we assume that λ ≤ 1/C p . Note that due to the convexity of the function L, the minimizer u ⋆ exists and is unique. The proof of existence and uniqueness is standard (e.g., Theorem 3.3 in Fernández-Real & Ros-Oton (2020) ). Writing down the condition for stationarity, we can derive a (nonlinear) elliptic PDE for the minimizer of the energy functional in Definition 1 . Lemma 1. Let u ⋆ : Ω → R be the unique minimizer for the energy functional in Definition 1. Then for all φ ∈ H 1 0 (Ω) the minimizer u ⋆ satisfies the following condition, DE[u](φ) = Ω ∇L(∇u)∇φdx = 0, where dE[u](φ) denotes the dirctional derivative of the energy functional calculated at u in the direction of φ. Thus, the minimizer u ⋆ of the energy functional satisfies the following PDE: DE(u) := -div(∇L(∇u)) = 0 ∀x ∈ Ω. ( ) and u(x) = 0, ∀x ∈ ∂Ω. Here div denote the divergence operator. The proof for the Lemma can be found in Appendix Section A.1. Here, -div(∇L(∇•) is a functional operator that acts on a function (in this case u).foot_1  Our goal is to determine if the solution to the PDE in Equation 3 can be expressed by a neural network with a small number of parameters. In order do so, we utilize the concept of a Barron norm, which measures the complexity of a function in terms of its Fourier representation. We show that if composing with the function L is such that it increases it has a bounded increase in the Barron norm of u, then the solution to the PDE in Equation 3 will have a bounded Barron norm. The motivation for using this norm is a seminal paper (Barron, 1993) , which established that any function with Barron norm C can be ϵ-approximated in the L 2 sense by a two-layer neural network with size O(C 2 /ϵ), thus evading the curse of dimensionality if C is substantially smaller than exponential in d. Informally, we will show the following result: Theorem 1 (Informal). Let L be convex and smooth, such that composing a function with Barron norm b with L produces a function of Barron norm at most B L b p . Then, for all sufficiently small ϵ > 0, the minimizer of the energy functional in Definition 1 can be ϵ-approximated in the L 2 sense by a function with Barron norm O (dB L ) p log(1/ϵ) . As a consequence, when ϵ, p are thought of as constants, we can represent the solution to the Euler-Lagrange PDE Equation 3 by a polynomially-sized network, as opposed to an exponentially sized network, which is what we would get by standard universal approximation results and using regularity results for the solutions of the PDE. We establish this by "neurally simulating" a preconditioned gradient descent (for a strongly-convex loss) in an appropriate Hilbert space, and show that the Barron norm of each iterate-which is a function-is finite, and at most polynomially bigger than the Barron norm of the previous iterate. We get the final bound by (i) bounding the growth of the Barron norm at every iteration; and (ii) bounding the number of iterations required to reach an ϵ-approximation to the solution. The result in formally stated in Section 5 , 2007) that proceed by discretizing the input space, hence limiting their use to problems on low-dimensional input spaces.

3. RELATED WORK

Several recent works look to theoretically analyze these neural network based approaches for solving PDEs. Mishra & Molinaro (2020) look at the generalization properties of physics informed neural networks. In Lu et al. (2021) show the generalization analysis for the Deep Ritz method for elliptic equations like the Poisson equation and Lu & Lu (2021) extends their analysis to the Schrödinger eigenvalue problem. In addition to analyzing the generalization capabilities of the neural networks, theoretical analysis into their representational capabilities has also gained a lot of attention. Khoo et al. (2021) show the existence of a network by discretizing the input space into a mesh and then using convolutional NNs, where the size of the layers is exponential in the input dimension. Sirignano & Spiliopoulos (2018) provide a universal approximation result, showing that for sufficiently regularized PDEs, there exists a multilayer network that approximates its solution. In Jentzen et al. (2018) ; Grohs & Herrmann (2020) ; Hutzenthaler et al. (2020) provided a better-than-exponential dependence on the input dimension for some special parabolic PDEs, based on a stochastic representation using the Feynman-Kac Lemma, thus limiting the applicability of their approach to PDEs that have such a probabilistic interpretation. Moreover, their results avoid the curse of dimensionality only over domains with unit volume. A recent line of work has focused on families of PDEs for which neural networks evade the curse of dimensionality-i.e., the solution can be approximated by a neural network with a subexponential size. Marwah et al. (2021) show that for elliptic PDE's whose coefficients are approximable by neural networks with at most N parameters, a neural network exists that ϵ-approximates the solution and has size O(d log(1/ϵ) N ). As mentioned, while most previous works show key regularity results for neural network approximations of solution to PDEs, most of their analysis is limited to simple linear PDEs. The focus of this paper is towards extending these results to a family of PDEs referred to as nonlinear variational PDEs. This particular family of PDEs consists of many famous PDEs, such as p-Laplacian (on a bounded domain), which is used to model phenomena like non-Newtonian fluid dynamics and nonlinear diffusion processes. The regularity results for these family of PDEs was posed as Hilbert's XIX th problem. We note that there are classical results like De Giorgi (1957) and Nash (1957; 1958) that provide regularity estimates on the solutions of a nonlinear variational elliptic PDE of the form in Equation 3. One can easily use these regularity estimates, along with standard universal approximation results (Yarotsky, 2017) to show that the solutions can be approximated arbitrarily well. However, the size of the resulting networks will be exponentially large (i.e. suffer from the curse of dimensionality)-so are of no use for our desired results.

4. NOTATION AND DEFINITION

In this section, we introduce some key concepts and notation that will be used throughout the paper. For a vector x ∈ R d , we use ∥x∥ 2 to denote its ℓ 2 norm. Further, C ∞ (Ω) denotes the set of functions f : Ω → R that are infinitely differentiable. We also define some important function spaces and associated key results below. Definition 2. For a vector valued function g : R → R d , we define the L p (Ω) norm for p ∈ [1, ∞) as ∥g∥ L p (Ω) = Ω d i |g i (x)| p dx 1/p , For p = ∞, we have ∥g∥ L ∞ (Ω) = max 1≤i≤d ∥g i ∥ L ∞ (Ω) , where ∥g i ∥ L ∞ (Ω) = inf{c ≥ 0 : |g(x)| ≤ c for almost all x ∈ Ω}. Definition 3. For a domain Ω, the space of functions H 1 0 (Ω) is defined as H 1 0 (Ω) := {g : Ω → R : g ∈ L 2 (Ω), ∇g ∈ L 2 (Ω), g| ∂Ω = 0}. The corresponding norm for H 1 0 (Ω) is defined as ∥g∥ H 1 0 (Ω) = ∥∇g∥ L 2 (Ω) . We will make use of the Poincaré inequality throughout several of our results. Theorem 2 (Poincaré inequality, Poincaré (1890) ). For Ω ⊂ R d which is open and bounded, there exists a constant C p > 0 such that for all u ∈ H 1 0 (Ω) ∥u∥ L 2 (Ω) ≤ C p ∥∇u∥ L 2 (Ω) . This constant can be very benignly behaved with dimension for many natural domains-even dimension independent. Examples include convex domains (Payne & Weinberger, 1960) .

4.1. BARRON NORMS

For a function f : R d → R, the Fourier transform and the inverse Fourier transform are defined as f (ω) = 1 (2π) d R d f (x)e -ix T ω dx, and f (x) = R d f (ω)e ix T ω dω. The Barron norm is an average of the norm of the frequency vector weighted by the Fourier magnitude | f (ω)|. A slight technical issue is that the the Fourier transform is defined only for f : R d → R. Since we are interested in defining the Barron norm of functions defined over a bounded domain, we allow for arbitrary extensions of a function outside of their domain. (This is the standard definition, e.g. in (Barron, 1993) .) Definition 4. We define F be the set of functions g ∈ L 1 (Ω) such that the Fourier inversion formula g = (2π) d f (x) holds over the domain Ω, i.e., F = g : R d → R, ∀x ∈ Ω, g(x) = g(0) + R d (e iω T x -1)ĝ(ω)dω . Definition 5 (Spectral Barron Norm, (Barron, 1993) ). Let Γ be a set of functions defined over Ω such that their extension over R d belong to F, that is, Γ = {f : Ω → R : ∃g, g| Ω = f, g ∈ F} Then we define the spectral Barron norm ∥ • ∥ B(Ω) as ∥f ∥ B(Ω) = inf g|Ω=f,g∈F R d (1 + ∥ω∥ 2 )|ĝ(ω)|dω. The Barron norm is an L 1 relaxation of requiring sparsity in the Fourier basis-which is intuitively why it confers representational benefits in terms of the size of a neural network required. We refer to Barron (1993) for a more exhaustive list of the Barron norms of some common function classes. The main theorem from Barron (1993) formalizes this intuition, by bounding the size of a 2-layer network approximating a function with small Barron norm: Theorem 3 (Theorem 1, Barron (1993)). Let f ∈ Γ such that ∥f ∥ B(Ω) ≤ C and µ be a probability measure defined over Ω. There exists a i ∈ R d , b i ∈ R and c i ∈ R such that k i=1 |c i | ≤ 2C, there exists a function f k (x) = k i=1 c i σ a T i x + b , such that we have, Ω (f (x) -f k (x)) 2 µ(dx) ≤ 4C 2 k . Here σ denotes a sigmoidal activation function, i.e., lim x→∞ σ(x) = 1 and lim x→-∞ σ(x) = 0. Note that while Theorem 3 is stated for sigmoidal activations like sigmoid and tanh (after appropriate rescaling), the results are also valid for ReLU activation functions, since ReLU(x) -ReLU(x -1) is in fact sigmoidal. We will also need to work with functions that do not have Fourier coefficients beyond some size (i.e. are band limited), hence we introduce the following definition: Definition 6. Let F W (Ω) be the set of functions whose Fourier coefficients vanish outside a bounded ball, i.e., F W = {g : R d → R : s.t. ∀w, ∥w∥ ∞ ≥ W, ĝ(w) = 0}. Similarly, we denote Γ W = {f : Ω → R : ∃g, g| Ω = f, g ∈ F W } . Since we will work with vector valued function, we will also define the Barron norm of a vectorvalued function as the maximum of the Barron norms of its coordinates: Definition 7. For a vector valued function g : Ω → R d , we define ∥g∥ B(Ω) = max i ∥g i ∥ B(Ω) .

5. MAIN RESULT

Before stating the main result we introduce the key assumption. Assumption 1. The function L in Definition 1 can be approximated by a function L : R d → R such that there exists a constant ϵ L ∈ [0, λ) with sup x∈R d ∥∇L(x) -∇ L(x)∥ 2 ≤ ϵ L ∥x∥ 2 . Furthermore, we assume that L is such that for any g ∈ H 1 0 (Ω), we have L • g ∈ H 1 0 (Ω), L • g ∈ F and ∥ L • g∥ B(Ω) ≤ B L∥g∥ p B(Ω) . for some constants B L ≥ 0, and p ≥ 0. Furthermore, if g ∈ Γ W then L • g ∈ Γ kW for a k > 0. This assumption is fairly natural: it states that the function L can be approximated (up to ϵ L , in the sense of the gradients of the functions) by a function L that has the property that when applied to a function g with small Barron norm, the new Barron norm is not much bigger. The constant p specifies the order of this growth. The functions for which our results are most interesting are when the dependence of B L on d is at most polynomial-so that the final size of the approximating network does not exhibit curse of dimensionality. For instance, we can take L to be a multivariate polynomial of degree up to P : we show in Lemma 7 the constant B L is O(d P ) (intuitively, this dependence comes from the total number of monomials of this degree), whereas p and k are both O(P ). With all the assumptions stated, we now state our main theorem: Theorem 4 (Main Result). Consider the nonlinear variational elliptic PDE in Equation 3 which satisfies Assumption 1 and let u ⋆ ∈ H 1 0 (Ω) denote the solution to the PDE. If u 0 ∈ H 1 0 (Ω) is a function such that u 0 ∈ Γ W0 , then for all sufficiently small ϵ > 0, and T := log 2 ϵ (E(u 0 ) -E(u ⋆ )) λ / log 1 1 - λ 5 (1+Cp)Λ 4 , there exists a function u T ∈ H 1 0 (Ω) such that u T ∈ Γ k T W0 and Barron norm bounded as ∥u T ∥ B(Ω) ≤ 1 + λ 3 (C p + 1)Λ 3 dk 2 W 2 0 B L p T +1 ∥u 0 ∥ p T B(Ω) . Furthermore, u T satisfies ∥u T -u ⋆ ∥ H 1 0 (Ω) ≤ ϵ + ε where ε ≤ λ 3 (C p + 1)Λ 3 ϵ L ∥u ⋆ ∥ H 1 0 (Ω) + 1 λ E(u 0 ) Λ + ϵ L 1 + λ 3 (C p + 1)Λ 3 (Λ + ϵ L ) T -1 . Remark 1: The function u 0 can be seen as an initial estimate of the solution, that can be refined to an estimate u T , which is progressively better at the expense of a larger Barron norm. A trivial choice could be u 0 = 0, which has Barron norm 1, and which by Lemma 2 would satisfy E(u 0 ) -E(u * ) ≤ Λ∥u * ∥ 2 H 1 0 (Ω) . Remark 2: The final approximation error has two terms, T goes to ∞ as ϵ tends 0 and is a consequence of the way u T is constructed-by simulating a functional (preconditioned) gradient descent which converges to the solution to the PDE. The error term ε stems from the approximation that we make between L and L, which grows as T increases-it is a consequence of the fact that the gradient descent updates with L and L progressively drift apart as T tends to ∞. Remark 3: As in the informal theorem, if we think of p, Λ, λ, C p , k, ∥u 0 ∥ B(Ω) as constants, the theorem implies u * can be ϵ-approximated in the L 2 sense by a function with Barron norm O (dB L ) p log(1/ϵ) . Therefore, combining results from Theorem 4 and Theorem 3 the total param- eters required to ϵ-approximate u * by a 2-layer neural network is O 1 ϵ 2 (dB L ) 2p log(1/ϵ) . Remark 4: We further note that this result recovers (and generalizes) prior results which bound the Barron norm of linear elliptic PDEs like Chen et al. (2021) . In these results, the elliptic PDE takes the form -div(A∇u) and A is assumed to have bounded Barron norm. Thus, ∥L • u∥ B(Ω) ≤ d 2 ∥A∥ B(Ω) ∥u∥ B(Ω) , hence satisfying Equation 5 in Assumption 1 with p = 1.

6. PROOF OF MAIN RESULT

The proof will proceed by "neurally unfolding" a preconditioned gradient descent on the objective E in the Hilbert space H 1 0 (Ω). This is inspired by previous works by Marwah et al. (2021) ; Chen et al. (2021) where the authors show that for a linear elliptic PDE, an objective which is quadratic can be designed. In our case, we show that E is "strongly convex" in some suitable sense-thus again, bounding the amount of steps needed. More precisely, the result will proceed in two parts: 1. First, we will show that the sequence of functions {u t } ∞ t=0 , where u t+1 ← u t -η(I -∆) -1 dE(u t ) can be interpreted as performing preconditioned gradient descent, with the (constant) preconditioner (I -∆) -1 . We show that in some appropriate sense (Lemma 2), E is strongly convex in H 1 0 (Ω)-thus the updates converge at a rate of O(log(1/ϵ)). 2. We then show that the Barron norm of each iterate u t+1 can be bounded in terms of the Barron norm of the prior iterate u t . We show this in Lemma 5, where we show that given Assumptions1, the ∥u t+1 ∥ B(Ω) is O(d∥u t ∥ p B(Ω) ) . By unrolling this recursion we show that the Barron norm of the ϵ-approximation of u ⋆ is of the order O(d p T ∥u 0 ∥ p B(Ω) ), where T are the total steps required for ϵ-approximation and ∥u 0 ∥ B(Ω) is the Barron norm of the first function in the iterative updates. We now proceed to delineate the main technical ingredients for both of these parts.

6.1. CONVERGENCE RATE OF SEQUENCE

The proof to show the convergence to the solution u ⋆ is based on adapting the standard proof (in finite dimension) for convergence of gradient descent when minimizing a strongly convex function f . Recall, the basic idea is to Taylor expand f (x + δ) ≈ f (x) + ∇f (x) T δ + O(∥δ∥ 2 ). Taking δ = η∇f (x), we lower bound the progress term η∥∇f (x)∥ 2 using the convexity of f , and upper bound the second-order term η 2 ∥∇f (x)∥ 2 using the smoothness of f . We follow analogous steps, and prove that we can lower bound the progress term by using some appropriate sense of convexity of E, and upper bound some appropriate sense of smoothness of E, when considered as a function over H 1 0 (Ω). Precisely, we show: Lemma 2 (Strong convexity of E in H 1 0 ). If E, L are as in Definition 1, we have 1. ∀u, v ∈ H 1 0 (Ω) : ⟨DE(u), v⟩ L 2 (Ω) = Ω -div(∇L(∇u))vdx = Ω ∇L(∇u) • ∇vdx. 2. ∀u, v ∈ H 1 0 (Ω) : λ∥u -v∥ H 1 0 (Ω) ≤ ⟨DE(u) -DE(v), u -v⟩ L 2 (Ω) ≤ Λ∥u -v∥ H 1 0 (Ω) . 3. ∀u, v ∈ H 1 0 (Ω) : λ 2 ∥∇v∥ 2 L 2 (Ω) +⟨DE(u), v⟩ L 2 (Ω) ≤ E(u+v)-E(u) ≤ ⟨DE(u), v⟩ L 2 (Ω) + Λ 2 ∥∇v∥ 2 L 2 (Ω) . 4. ∀u ∈ H 1 0 (Ω) : λ 2 ∥u -u ⋆ ∥ 2 H 1 0 (Ω) ≤ E(u) -E(u ⋆ ) ≤ Λ 2 ∥u -u ⋆ ∥ 2 H 1 0 (Ω) . Part 1 is a helpful way to rewrite an inner product of a "direction" v with DE(u)-it is essentially a consequence of integration by parts and the Dirichlet boundary condition. Part 2 and 3 are common proxies of convexity: they are ways of formalizing the notion that E is strongly convex, when viewed as a function over H 1 0 (Ω). Finally, part 4 is a consequence of strong convexity, capturing the fact that if the value of E(u) is suboptimal, u must be (quantitatively) far from u * . The proof of the Lemma can be found in the Appendix (Section B.1) When analyzing gradient descent (in finite dimensions) to minimize a loss function E, the standard condition for progress is that the inner product of the gradient with the direction towards the optimum is lower bounded as ⟨DE(u), u * -u⟩ L 2 (Ω) ≥ α∥u -u * ∥ 2 L 2 (Ω) . From Parts 2 and 3, one can readily see that the above condition is only satisfied "with the wrong norm": i.e. we only have ⟨DE(u), u * - u⟩ L 2 (Ω) ≥ α∥u -u * ∥ 2 H 1 0 (Ω) . We can fix this mismatch by instead doing preconditioned gradient, using the fixed preconditioner (I -∆) -1 . Towards that, the main lemma about the preconditioning we require is the following one: Lemma 3 (Norms with preconditioning). For all v ∈ H 1 0 (Ω), we have 1. ∥(I -∆) -1 ∇ • ∇u∥ L 2 (Ω) = ∥(I -∆) -1 ∆u∥ L 2 (Ω) ≤ ∥u∥ L 2 (Ω) . 2. ⟨(I -∆) -1 v, v⟩ L 2 (Ω) ≥ 1 1+Cp ⟨∆ -1 v, v⟩ L 2 (Ω) . The first part of the lemma is a relatively simple consequence of the fact that the operators ∆ and ∇ "commute", and therefore can be re-ordered. The latter lemma can be understood intuitively as (I -∆) -1 and ∆ -1 act as similar operators on eigenfunctions of ∆ with large eigenvalues (the extra I does not do much)-and are only different for eigenfunctions for small eigenvalues. However, since the smallest eigenvalues is lower bounded by 1/C p , their gap can be bounded. Next we utilize the results in Lemma 2 and Lemma 3 to show preconditioned gradient descent exponentially converges to the solution to the nonlinear variational elliptic PDE in 3. Lemma 4 (Preconditioned Gradient Descent Convergence). Let u ⋆ denote the unique solution to the PDE in Definition 3. For all t ∈ N, we define the sequence of functions u t+1 ← u t - λ 3 (1 + C p )Λ 3 (I -∆) -1 DE(u t ). ( ) If u 0 ∈ H 1 0 (Ω) after t iterations we have, E(u t+1 ) -E(u ⋆ ) ≤ 1 - λ 5 (1 + C p )Λ 4 t (E(u 0 ) -E(u ⋆ )) . The complete proof for convergence can be found in Section B.3 of the Appendix. Therefore, using the result from Lemma 4, i.e., ∥u t -u ⋆ ∥ 2 H 1 0 (Ω) ≤ 2 λ (E(u t ) -E(u ⋆ )), we have ∥u t -u ⋆ ∥ 2 H 1 0 (Ω) ≤ 2 λ 1 - λ 5 2(1 + C p )Λ 4 t (E(u 0 ) -E(u ⋆ )) . and ∥u T -u ⋆ ∥ 2 H 1 0 (Ω) ≤ ϵ after T steps, where, T ≥ log E(u 0 ) -E(u ⋆ ) λϵ/2 / log 1 1 - λ 5 (1+Cp)Λ 4 . (7)

6.2. BOUNDING THE BARRON NORM

Having obtained a sequence of functions that converge to the solution u ⋆ , we bound the Barron norms of the iterates. We draw inspiration from Marwah et al. (2021) ; Lu et al. (2021) and show that the Barron norm of each iterate in the sequence has a bounded increase on the Barron norm of the previous iterate. Note that in general, the Fourier spectrum of a composition of functions can not easily be expressed in terms of the Fourier spectrum of the functions being composed. However, from Assumption 1, we know that the function L can be approximated by L such that L • g has a bounded increase the Barron norm of g. Thus, instead of tracking the iterates in Equation 6, we track the Barron norm of the functions in the following sequence, ũt+1 = ũt -η (I -∆) -1 D Ẽ(ũ t ). We can derive the following result (the proof is deferred to Section D.1 of the Appendix): Lemma 5. Consider the updates in Equation 8, if ũt ∈ Γ Wt then for all η ∈ (0, λ 3 (Cp+1)Λ 3 ] we have ũt+1 ∈ Γ kWt and the Barron norm of ũt+1 can be bounded as ∥ũ t+1 ∥ B(Ω) ≤ 1 + ηd(kW t ) 2 B L ∥ũ t ∥ p B(Ω) . The proof consists of using the result in Equation 5 about the Barron norm of composition of a function with L, as well as counting the increase in the Barron norm of a function by any basic algebraic operation, as established in Lemma 6. Precisely we show: Lemma 6 (Barron norm algebra). If h, h 1 , h 2 ∈ Γ, then the following set of results hold, • Addition: ∥h 1 + h 2 ∥ B(Ω) ≤ ∥h 1 ∥ B(Ω) + ∥h 2 ∥ B(Ω) . • Multiplication: ∥h 1 • h 2 ∥ B(Ω) ≤ ∥h 1 ∥ B(Ω) ∥h 2 ∥ B(Ω) • Derivative: if h ∈ Γ W for i ∈ [d] we have ∥∂ i h∥ B(Ω) ∈ W ∥h∥ B(Ω) . • Preconditioning: if h ∈ Γ, then ∥(I -∆) -1 h∥ B(Ω) ≤ ∥h∥ B(Ω) . The proof for the above Lemma can be found in Appendix D.3. It bears similarity to an analogous result in Chen et al. (2021) , with the difference being that our bounds are defined in the spectral Barron space which is different from the definition of the Barron norm used in Chen et al. (2021) . Expanding on the recurrence in Lemma 6 we therefore have that after T we have, we have W T ≤ k T W 0 and hence u t+1 ∈ Γ k t W0 , and the Barron norm of u T can be bounded as follows, ∥u T ∥ B(Ω) ≤ 1 + ηdk 2 W 2 0 B L p T +1 ∥u 0 ∥ p T B(Ω) . Finally, we exhibit a natural class of functions that satisfy the main Barron growth property in Equations 5. Precisely, we show (multivariate) polynomials of bounded degree have an effective bound on p and B L : Lemma 7. Let f (x) = α,|α|≤P A α d i=1 x αi i where α is a multi-index and x ∈ R d . If g : Hence if L is a polynomial of degree P the constants in Assumption 1 will take the following values R d → R d is such that g ∈ Γ W , B L = d P/2 α,|α|≤P |A α | 2 1/2 , and r = P . Finally, since we are using an approximation of the function L we will incur an error at each step of the iteration. The following Lemma shows that the error between the iterates u t and the approximate iterates ũt increases with t. The error is calculated by recursively tracking the error between u t and ũt for each t in terms of the error at t -1. Note that this error can be controlled by using smaller values of η. then for all t ∈ N and R ≤ ∥u ⋆ ∥ H 1 0 (Ω) + 1 λ E(u 0 ) we have, ∥u t -ũt ∥ H 1 0 (Ω) ≤ ϵ L ηR Λ + ϵ L (1 + η(Λ + ϵ L )) t -1

7. CONCLUSION AND FUTURE WORK

In this work, we take a representational complexity perspective on neural networks, as they are used to approximate solutions of nonlinear variational elliptic PDEs of the form -div(∇L(∇u)) = 0. We prove that if L is such that composing L with function of bounded Barron norm increases the Barron norm in a bounded fashion, then we can bound the Barron norm of the solution u ⋆ to the PDE-potentially evading the curse of dimensionality depending on the rate of this increase. Our results subsume and vastly generalize prior work on the linear case (Marwah et al., 2021; Chen et al., 2021) . Our proof consists of neurally simulating preconditioned gradient descent on the energy function defining the PDE, which we prove is strongly convex in an appropriate sense. There are many potential avenues for future work. Our techniques (and prior techniques) strongly rely on the existence of a variational principle characterizing the solution of the PDE. In classical PDE literature, these classes of PDEs are also considered better behaved: e.g., proving regularity bounds is much easier for such PDEs (Fernández-Real & Ros-Oton, 2020) . There are many nonlinear PDEs that come without a variational formulation-e.g. the Monge-Ampere equation-for which regularity estimates are derived using non-constructive methods like comparison principles. It is a wide-open question to construct representational bounds for any interesting family of PDEs of this kind. It is also a very interesting question to explore other notions of complexity-e.g. number of parameters in a (potentially deep) network like in Marwah et al. (2021) , Rademacher complexity, among others.

A APPENDIX

A.1 PROOF FOR LEMMA 1 The proof follows form Fernández-Real & Ros-Oton (2020) Chapter 3. We are provided it here for completeness. Proof of Lemma 1. If the function u ⋆ minimizes the energy functional in Definition 1 then we have for all ϵ ∈ R E(u) ≤ E(u + ϵφ) where φ ∈ C ∞ c (Ω). That is, we have a minima at ϵ = 0 and taking a derivative w.r.t ϵ we get, dE[u](φ) = lim ϵ→0 E(u + ϵφ) -E(u) ϵ = 0 = lim ϵ→0 Ω L (∇u + ϵ∇φ) dx -Ω L(∇u)dx ϵ = 0 = lim ϵ→0 Ω ϵ∇L(∇u)∇φ + ϵ 2 2 r(x) ϵ = 0 where for all x ∈ Ω we have |r (x)| ≤ |∇φ| 2 sup p∈R d D 2 L(p). Since ϵ → 0 the final derivative is of the form, dE[u](φ) = Ω ∇L(∇u)∇φdx = 0 For functions r, s ∈ H 1 0 (Ω) note the following Green's formula, Using the identity in Equation 11in Equation 10 we get, dE[u](φ) = Ω -div (∇L(∇u)) φdx = 0 That is the minima for the energy functional is reached at a u which solves the following PDE, dE(u) = -div (∇L(∇u)) = 0. B PROOFS FROM SECTION 6.1 B.1 PROOF FOR LEMMA 2 Proof. In order to prove part 1, we will use the following integration by parts identity, for functions r : Ω → R such that and s : Ω → R, and r, s ∈ H 1 0 (Ω), Ω ∂r ∂x i sdx = - Ω r ∂s ∂x i dx + ∂Ω rsndΓ (12) where n i is a normal at the boundary and dΓ is an infinitesimal element of the boundary ∂Ω. Using the formula in Equation 12for functions u, v ∈ H 1 0 (Ω), we have ⟨DE(u), v⟩ L 2 (Ω) = ⟨-∇ • ∇L(∇u), v⟩ L 2 (Ω) = - Ω ∇ • ∇L(∇u)vdx = - Ω d i=1 ∂ (∇L(∇u)) i ∂x i vdx = Ω d i=1 (∇L(∇u)) i ∂v ∂x i dx + Ω d i=1 (∇L(∇u)) i vn i dx = ∇L(∇u) • ∇vdx where in the last equality we use the fact that the function v ∈ H 1 0 (Ω), thus v(x) = 0, ∀x ∈ ∂Ω. To show the second part, first note since L : R d → R is strongly convex and smooth, we have ∀x, y ∈ R d , λ∥x -y∥ 2 ≤ ∥∇L(x) -∇L(y)∥ 2 ≤ Λ∥x -y∥ 2 . ( ) This implies ∀x ∈ Ω, ∥∇L(∇u(x)) -∇L(∇v(x))∥ 2 ≤ Λ∥∇u(x) -∇v(x)∥ 2 Taking square on each side and itegrating over Ω we get Ω ∥∇L(∇u(x)) -∇L(∇v(x))∥ 2 2 dx ≤ Λ 2 Ω ∥∇u(x) -∇v(x)∥ 2 2 dx =⇒ ∥∇L(∇u) -∇L(∇v)∥ L 2 (Ω) ≤ Λ∥∇u -∇v∥ L 2 (Ω) On the other hand, from part 1, we have ⟨DE(u) -DE(v), u -v⟩ L 2 (Ω) = ⟨∇L(∇u) -∇L(∇v), ∇u -∇v⟩ L 2 (Ω) Hence, by Cauchy-Schwartz, we get ⟨DE(u) -DE(v), u -v⟩ L 2 (Ω) = ⟨∇L(∇u) -∇L(∇v), ∇u -∇v⟩ L 2 (Ω) ≤ ∥∇L(∇u) -∇L(∇v)∥ L 2 (Ω) ∥∇u -∇v∥ L 2 (Ω) ≤ Λ∥∇u -∇v∥ 2 L 2 (Ω) which proves the right hand side of the inequality in part 2. For the left size of the inequality, by convexity of L we have ∀x, y ∈ R d , (∇L(x) -∇L(y)) T (x - y) ≥ λ∥x -y∥ 2 2 is convex, (∇L (∇u(x)) -∇L (∇v(x))) T (∇u(x) -∇v(x)) ≥ λ∥∇u(x) -∇v(x)∥ 2 2 Integrating over Ω we get, Ω (∇L (∇u(x)) -∇L (∇v(x))) T (∇u(x) -∇v(x))dx ≥ λ Ω ∥∇u(x) -∇v(x)∥ 2 2 dx =⇒ ⟨∇L(∇u) -∇L(∇v), ∇u -∇v⟩ L 2 (Ω) ≥ λ∥∇u -∇v∥ 2 L 2 (Ω) Using part 1 again, this implies Equation 15 ⟨DE(u) -DE(v), u -v⟩ L 2 (Ω) = ⟨∇L(∇u) -∇L(∇v), ∇u -∇v⟩ L 2 (Ω) ≥ λ∥∇v -∇v∥ 2 L 2 (Ω) . as we wanted. Under review as a conference paper at ICLR 2023 To show part 3, we first Taylor expand L to rewrite the energy function as: E(u + v) = Ω L(∇u(x) + ∇v(x))dx = Ω L(∇u(x)) + ∇L(∇u(x))∇v(x) + 1 2 D 2 L(x)∥∇v(x)∥ 2 2 dx where x ∈ R d (and is potentially different for every x ∈ Ω). Since the function L is strongly convex we have λI d ≤ D 2 L(x) ≤ ΛI d . Plugging in these bounds in Equation 16, we have: E(u + v) ≤ Ω L(∇u(x)) + ∇L(∇u(x))∇v(x) + Λ 2 ∥∇v(x)∥ 2 dx =⇒ E(u + v) ≤ E(u) + ⟨DE(u), v⟩ L 2 (Ω) + Λ 2 ⟨∇v, ∇v⟩ L 2 (Ω) . as well as E(u + v) ≥ Ω L(∇u(x)) + ∇L(∇u(x))∇v(x) + Λ 2 ∥∇v(x)∥ 2 dx =⇒ E(u + v) ≥ E(u) + ⟨DE(u), v⟩ L 2 (Ω) + λ 2 ⟨∇v, ∇v⟩ L 2 (Ω) . Combining Equation 17and Equation 18 we get, λ 2 ∥∇v∥ 2 L 2 (Ω) + ⟨DE(u), v⟩ L 2 (Ω) ≤ E(u + v) -E(u) ≤ ⟨DE(u), v⟩ L 2 (Ω) + Λ 2 ∥∇v∥ 2 L 2 (Ω) Finally, part 4 follows by plugging in u = u ⋆ and v = u -u ⋆ in part 3 and using the fact that DE(u ⋆ ) = 0.

B.2 PROOF FOR LEMMA 3

Proof. Let {λ i , ϕ i } ∞ i=1 denote the (eigenvalue, eigenfunction) pairs of the operator -∆ where 0 < λ 1 ≤ λ 2 ≤ • • • , which are real and countable. ( Evans (2010), Theorem 1, Section 6.5) Using the definition of eigenvalues and eigenfunctions, we have λ 1 = inf v∈H 1 0 (Ω) ⟨-∆v, v⟩ L 2 (Ω) ∥v∥ 2 L 2 (Ω) = inf v∈H 1 0 (Ω) ⟨∇v, ∇v⟩ L 2 (Ω) ∥v∥ 2 L 2 (Ω) = 1 C p . where in the last equality we use Theorem 2. Let us write the functions v, w in the eigenbasis as v = i µ i ϕ i . Notice that an eigenfunction of -∆ is also an eigenfunction for (I -∆) -1 , with correspondinding eigenvalue 1 1+λi . Thus, to show part 1, we have, (I -∆) -1 ∇ • ∇v 2 L 2 (Ω) = (I -∆) -1 ∆v 2 L 2 (Ω) = ∞ i=1 λ i 1 + λ i µ i ϕ i 2 L 2 (Ω) ≤ ∞ i=1 µ i ϕ i 2 L 2 (Ω) = ∞ i=1 µ 2 i = ∥u∥ 2 L 2 (Ω) where in the last equality we use the fact that ϕ i are orthogonal. Now, note that (I -∆) -1 v = ∞ i=1 1 (1+λi) µ i ϕ i now, note that since λ 1 ≤ λ 2 ≤ • • • and x 1+x is monotonically increasing, we have for all i ∈ N 1 1 + λ i ≥ 1 (1 + C p )λ i and note that 1 λi are the eigenvalues for (-∆) -1 for all i ∈ N. Now, bounding ⟨(I -∆) -1 v, v⟩ L 2 (Ω) ⟨(I -∆) -1 v, v⟩ L 2 (Ω) = ∞ i=1 µ i 1 + λ i ϕ i , ∞ i=1 µ i ϕ i L 2 (Ω) = ∞ i=1 µ 2 i 1 + λ i ∥ϕ i ∥ 2 L 2 (Ω) where we the orthogonality of ϕ ′ i s to get Equation 20. Using the inequality in Equation 19we can further lower bound ⟨(I -∆) -1 v, v⟩ L 2 (Ω) as follows, ⟨(I -∆) -1 v, v⟩ L 2 (Ω) ≥ ∞ i=1 µ 2 i (1 + C p )λ i ∥ϕ i ∥ 2 L 2 (Ω) := 1 1 + C p ⟨(-∆) -1 v, v⟩ L 2 (Ω) , where we use the following set of equalities in the last step, ⟨(-∆) -1 v, v⟩ L 2 (Ω) = ∞ i=1 µ i λ i ϕ i , ∞ i=1 µ i ϕ i L 2 (Ω) = ∞ i=1 µ 2 i λ i ∥ϕ i ∥ 2 L 2 (Ω) .

B.3 PROOF FOR CONVERGENCE: PROOF FOR LEMMA 4

Proof. For the analysis we consider η = λ 3 (1+Cp)Λ 3 Taylor expanding as in Equation 17, we have E(u t+1 ) ≤ E(u t ) -η ∇L(∇u t ), ∇(I -∆) -1 DE(u t ) L 2 (Ω) Term 1 + η 2 Λ 2 2 ∇(I -∆) -1 DE(u t ) 2 L 2 (Ω) Term 2 . ( ) where we have in Equation 17plugged in u t+1 - u t = -η (I -∆) -1 DE(u t ). First we lower bound Term 1. Since u ⋆ is the solution to the PDE in Equation 3, we have DE(u ⋆ ) = 0. Therefore we have ∇L(∇u t ), ∇(I -∆) -1 DE(u t ) L 2 (Ω) = ∇L(∇u t ), ∇(I -∆) -1 (DE(u t ) -DE(u ⋆ )) L 2 (Ω) Similarly, since u ⋆ is the solution to the PDE in Equation 3 Equation 2we have for all φ ∈ H 1 0 (Ω) ⟨∇L(∇u ⋆ ), ∇φ⟩ L 2 (Ω) = 0. Using this Equation 22 we get, ∇L(∇u t ), ∇(I -∆) -1 DE(u t ) L 2 (Ω) = ∇L(∇u t ), ∇(I -∆) -1 (DE(u t ) -DE(u ⋆ )) L 2 (Ω) = ∇L(∇u t ), ∇(I -∆) -1 (DE(u t ) -DE(u ⋆ )) L 2 (Ω) + ∇L(∇u ⋆ ), ∇(I -∆) -1 (DE(u t ) -DE(u ⋆ )) L 2 (Ω) = ∇L(∇u t ) -∇L(∇u ⋆ ), ∇(I -∆) -1 (DE(u t ) -DE(u ⋆ )) L 2 (Ω) Using Equation 23, we can rewrite Term 1 as ∇L(∇u t ), ∇(I -∆) -1 DE(u t ) L 2 (Ω) = ∇L(∇u t ) -∇L(∇u ⋆ ), ∇(I -∆) -1 (DE(u t ) -DE(u ⋆ )) L 2 (Ω) = Ω (∇L(∇u t ) -∇L(∇u ⋆ )) • ∇(I -∆) -1 (-∇ • (∇L(∇u t )) -∇L(u ⋆ )) dx (i) = Ω (∇L(∇u t ) -∇L(∇u ⋆ )) • ∇(I -∆) -1 (-∆ (L(∇u t ) -L(u ⋆ ))) dx = Ω (∇L(∇u t ) -∇L(∇u ⋆ )) • (I -∆) -1 (-∆) (∇L(∇u t ) -∇L(u ⋆ )) dx where in step (i) we use -∇ • ∇v = -∆v and the fact that ∇ commutes with (I -∆) -1 (-∆). Plugging part 2 of Lemma 3 in Equation 24, we get ∇L(∇u t ), ∇(I -∆) -1 DE(u t ) L 2 (Ω) ≥ 1 1 + C p Ω (∇L(∇u t ) -∇L(∇u ⋆ )) • (-∆) -1 (-∆) (∇L(∇u t ) -∇L(∇u ⋆ )) dx ≥ 1 1 + C p ⟨∇L(∇u t ) -∇L(∇u ⋆ ), ∇L(∇u t ) -∇L(∇u ⋆ )⟩ L 2 (Ω) (i) ≥ λ 2 1 + C p ∥∇u t -∇u ⋆ ∥ 2 L 2 (Ω) ≥ 2λ 2 (1 + C p )Λ (E(u t ) -E(u ⋆ )) where (i) follows by part 2 of Lemma 2 and we use part 4 of Lemma 2 for the last inequality. We will proceed to upper bounding Term 2. Using part 1 of Lemma 3 we have ∇(I -∆) -1 DE(u t ) 2 L 2 (Ω) = ∇(I -∆) -1 (DE(u t ) -DE(u ⋆ )) 2 L 2 (Ω) = ∇(I -∆) -1 (-∇ • (∇L(∇u t ) -∇L(∇u ⋆ ))) 2 L 2 (Ω) (i) = ∇(I -∆) -1 (-∆) (L(∇u t ) -L(∇u ⋆ )) 2 L 2 (Ω) (ii) = (I -∆) -1 (-∆) (∇L(∇u t ) -∇L(∇u ⋆ )) 2 L 2 (Ω) (iii) ≤ ∥∇L(∇u t ) -∇L(∇u ⋆ )∥ 2 L 2 (Ω) ≤ Λ 2 ∥∇u t -∇u ⋆ ∥ 2 L 2 (Ω) ≤ 2Λ 2 λ (E(u t ) -E(u ⋆ )) . Here, we use the fact that -∇ • ∇ = -∆ in step (i) and the fact that ∇ commutes with (I -∆) -1 (-∆) in step (ii), and finally we use the result from part 2 of Lemma 3 to get the inequality in (iii). Combining Equation 25and Equation 26in Equation 21we get =⇒ E(u t+1 ) -E(u ⋆ ) ≤ E(u t ) -E(u ⋆ ) - 2λ 2 (1 + C p )Λ -η Λ 2 λ η (E(u t ) -E(u ⋆ )) Since η = λ 3 /((1 + C p )Λ 3 ) we have E(u t+1 ) -E(u ⋆ ) ≤ E(u t ) -E(u ⋆ ) - λ 5 (1 + C p )Λ 4 η (E(u t ) -E(u ⋆ )) =⇒ E(u t+1 ) -E(u ⋆ ) ≤ 1 - λ 5 (1 + C p )Λ 4 t (E(u 0 ) -E(u ⋆ )) .

C ERROR ANALYSIS

First, we will need the following simple technical lemma showing that the H 1 0 (Ω) norm is self-dual: Lemma 9. The dual norm of ∥ • ∥ H 1 0 (Ω) is ∥ • ∥ H 1 0 (Ω) . Proof. If ∥u∥ * denotes the dual norm of ∥u∥ H 1 0 (Ω) , by definition we have, ∥u∥ * = sup v∈H 1 0 (Ω) ∥v∥ H 1 0 (Ω) =1 ⟨u, v⟩ H 1 0 (Ω) = sup v∈H 1 0 (Ω) ∥v∥ H 1 0 (Ω) =1 ⟨∇u, ∇v⟩ L 2 (Ω) ≤ sup v∈H 1 0 (Ω) ∥v∥ H 1 0 (Ω) =1 ∥∇u∥ L 2 (Ω) ∥∇v∥ L 2 (Ω) = ∥∇u∥ L 2 (Ω) where the inequality follows by Cauchy-Schwarz. On the other hand, equality can be achieved by taking v = u ∥∇u∥2 . Thus, ∥u∥ * = ∥∇u∥ L 2 (Ω) = ∥u∥ H 1 0 (Ω) as we wanted. With this we can prooceed to the proof of Lemma 8. C.1 PROOF FOR LEMMA 8 Proof. We define for all t r t = ũt -u t , and will iteratively bound ∥r t ∥ L 2 (Ω) . Starting with u 0 = 0 and ũt = 0, we define the iterative sequences as, u 0 = u 0 u t+1 = u t -η(I -∆) -1 DE(u t ) ũt = u 0 ũt+1 = ũt -η(I -∆) -1 D Ẽ(ũ t ) where η ∈ (0, λ 3 (1+Cp)Λ 3 ]. Subtracting the two we get, ũt+1 -u t+1 = ũt -u t -η(I -∆) -1 D Ẽ(ũ t ) -DE(u t ) =⇒ r t+1 = r t -η(I -∆) -1 D Ẽ(u t + r t ) -DE(u t ) Taking H 1 0 (Ω) norm on both sides we get, ∥r t+1 ∥ H 1 0 (Ω) ≤ ∥r t ∥ H 1 0 (Ω) + η (I -∆) -1 D Ẽ(u t + r t ) -DE(u t ) H 1 0 (Ω) Towards bounding (I -∆) -1 D Ẽ(u t + r t ) -DE(u t ) H 1 0 (Ω) , from Lemma 9 we know that the dual norm of ∥w∥ H 1 0 (Ω) is ∥w∥ H 1 0 (Ω) , thus, (I -∆) -1 D Ẽ(u t + r t ) -DE(u t ) H 1 0 (Ω) = sup φ∈H 1 0 (Ω) ∥φ∥ H 1 0 (Ω) =1 ∇(I -∆) -1 D Ẽ(u t + r t ) -DE(u t ) , ∇φ L 2 (Ω) = sup φ∈H 1 0 (Ω) ∥φ∥ H 1 0 (Ω) =1 ∇(I -∆) -1 D Ẽ(u t + r t ) -DE(u t + r t ) , ∇φ L 2 (Ω) + sup φ∈H 1 0 (Ω) ∥φ∥ H 1 0 (Ω) =1 ∇(I -∆) -1 (DE(u t + r t ) -DE(u t )) , ∇φ L 2 (Ω) = sup φ∈H 1 0 (Ω) ∥φ∥ H 1 0 (Ω) =1 ∇(I -∆) -1 ∇ • ∇ L(∇u t + ∇r t ) -∇L(∇u t + ∇r t ) , ∇φ L 2 (Ω) + sup φ∈H 1 0 (Ω) ∥φ∥ H 1 0 (Ω) =1 ∇(I -∆) -1 ∇ • (∇L(∇u t + ∇r t )) -∇L(∇u t ), ∇φ L 2 (Ω) ≤ sup φ∈H 1 0 (Ω) ∥φ∥ H 1 0 (Ω) =1 ∇ L(∇u t + ∇r t ) -∇L(∇u t + ∇r t ), ∇φ L 2 (Ω) + sup φ∈H 1 0 (Ω) ∥φ∥ H 1 0 (Ω) =1 ⟨∇L(∇u t + ∇r t ) -∇L(∇u t ), ∇φ⟩ L 2 (Ω) ≤ ϵ L ∥∇u t + ∇r t ∥ L 2 (Ω) + ∥∇L(∇u t + ∇r t ) -∇L(∇u t )∥ L 2 (Ω) Using the Lipschitzness of ∇L, we have ∥∇L(∇u t (x) + ∇r t (x)) -∇L(∇u t )∥ 2 ≤ ∥∇L(∇u t (x)) -∇L(∇u t (x))∥ 2 + sup p∈R d D 2 F (p)∥∇r t (x)∥ 2 ≤ Λ∥∇r t (x)∥ 2 Squaring and integrating over Ω on both sides we get ∥∇L(∇u t (x) + ∇r t (x)) -∇L(∇u t )∥ L 2 (Ω) ≤ Λ∥∇r t ∥ L 2 (Ω) Pluggin in Equation 31 in Equation 29 we get, (I -∆) -1 D Ẽ(u t + r t ) -DE(u t ) H 1 0 (Ω) ≤ (Λ + ϵ L )∥∇r t ∥ L 2 (Ω) + ϵ L ∥∇u t ∥ L 2 (Ω) Furthermore, from Lemma 4 we have for all t ∈ N, E(u t ) -E(u ⋆ ) ≤ 1 - λ 5 C p Λ 4 t E(u 0 ) ≤ E(u 0 ) and ∥u t -u ⋆ ∥ H 1 0 (Ω) ≤ 2 λ (E(u t ) -E(u 0 )) ≤ 2 λ E(u 0 ) Hence we have that for all t ∈ N, ∥u t ∥ H 1 0 (Ω) ≤ ∥u ⋆ ∥ H 1 0 (Ω) + 2 λ E(u 0 ) =: R. Putting this all together, we have (I -∆) -1 D Ẽ(u t + r t ) -DE(u t ) H 1 0 (Ω) ≤ (Λ + ϵ L )∥∇r t ∥ L 2 (Ω) + ϵ L R Hence using the result from Equation 33 in Equation 28 to get, ∥r t+1 ∥ H 1 0 (Ω) ≤ (1 + η(Λ + ϵ L )) ∥r t ∥ H 1 0 (Ω) + ϵ L ηR =⇒ ∥r t+1 ∥ H 1 0 (Ω) ≤ ϵ L ηR Λ + ϵ L (1 + η(Λ + ϵ L )) t -1 where we use the fact that ∥r t ∥ H 1 0 (Ω) = 0. Notice that ϵ L << Λ. Further, we have η ∈ (0, λ 3 (Cp+1)Λ 3 ] that is η ≤ 1, it implies that η(Λ + ϵ L ) < 1. Hence we can further bound Equation 34 as follows, ∥r t+1 ∥ H 1 0 (Ω) ≤ ϵ L ηR Λ + ϵ L (1 + η(Λ + ϵ L )) t -1 D PROOFS FOR BARRON NORM APPROXIMATION: SECTION 6.2 D.1 PROOF FOR LEMMA 5: BARRON NORM RECURSION Proof. Note that the update equation looks like, u t+1 = u t -η(I -∆) -1 DE(u t ) = u t -η(I -∆) -1 (-∇ • ∇L(∇u t )) = u t -η(I -∆) -1 - d i=1 ∂ 2 i L(∇u t ) From Lemma 6 we have ∥∇u t ∥ B(Ω) = max i∈[d] ∥∂ i u t ∥ B(Ω) ≤ W t ∥u t ∥ B(Ω) Note that since u t ∈ Γ Wt we have ∇u t ∈ Γ Wt and L(∇u t ) ∈ Γ kWt (from Assumption 1). Therefore, we can bound the Barron norm as, (I -∆) -1 - d i=1 ∂ 2 i L(∇u t ) B(Ω) (i) ≤ - d i=1 ∂ 2 i L(∇u t ) B(Ω) (ii) ≤ d ∂ 2 i L(∇u t ) B(Ω) ≤ d(kW t ) 2 ∥L(∇u t )∥ B(Ω) ≤ d(kW t ) 2 B L∥u t ∥ r B(Ω) where we use the fact that for a function h, we have ∥(I -∆) -1 h∥ B(Ω) ≤ ∥h∥ B(Ω) from Lemma 6 in (i) and the fact that L(∇u t ) ∈ Γ kWt in (ii). Using the result of Addition from Lemma 6 we have ∥u t ∥ B(Ω) ≤ ∥u t ∥ B(Ω) + ηd(kW t ) 2 B L ∥u t ∥ r B(Ω) ≤ 1 + ηd(kW t ) 2 B L ∥u t ∥ r B(Ω) .

D.2 PROOF FOR BARRON NORM OF POLYNOMIAL

Lemma 10 ( Lemma 7 restated). Let f (x) = α,|α|≤P A α d i=1 x αi i where α is a multi-index and x ∈ R d and A α ∈ R is a scalar. If g : R d → R d is a function such that g ∈ Γ W , then we have f • g ∈ Γ P W and the Barron norm can be bounded as, ∥f • g∥ B(Ω) ≤ d P/2   α,|α|≤P |A α | 2   1/2 ∥g∥ P B(Ω) . Proof. Recall from Definition 7 we know that for a vector valued function g : R d → R d , we have ∥g∥ B(Ω) = max i∈[d] ∥g i ∥ B(Ω) . Then, using Lemma 6, we have Proof. We first show the result for Addition and bound ∥h 1 + h 2 ∥ B(Ω) , For Multiplication, first note that multiplication of functions is equal to convolution of the functions in the frequency domain, i.e., for functions g 1 : R d → d and g 2 : R d → d, we have, ∥f (g)∥ B(Ω) = P α,|α|=0 A α d i=1 g αi i B(Ω) ≤ P α,|α|=0 A α d i=1 g αi i B(Ω) ≤ P α,|α|=0 |A α | d i=1 g αi i B(Ω) ≤ P α,|α|=0 |A α | d i=1 g αi i B(Ω) ≤ P α,|α|=0 |A α | d i=1 ∥g αi i ∥ B(Ω) ≤ P α,|α|=0 |A α | d i=1 ∥g i ∥ αi B(Ω) ≤   P α,|α|=0 |A α | 2   1/2   P α,|α|=0 d i=1 ∥g i ∥ αi B(Ω) 2   1/2 ∥h 1 + h 2 ∥ B(Ω) = inf g1|Ω=h1, g 1 • g 2 = ĝ1 * ĝ2 Now, to bound the Barron norm for the multiplication of two functions, ((1 + ∥ω∥ 2 )ĝ 1 (ω)) * ((1 + ∥ω∥ 2 )ĝ 2 (ω)) ∥h 1 • h 2 ∥ B(Ω) = inf g1|Ω=h1, Hence using Young's convolution identity from Lemma 11 we have  ∥h 1 • h 2 ∥ B(Ω) ≤ inf To show that for any k the function g k ∈ Γ kW , we write g k in the Fourier basis. We have:



Since λ is a lower bound on the strong convexity constant. If we choose a weaker lower bound, we can always ensure λ ≤ 1/Cp. For a vector valued function F : R d → R d , we will denote the divergence operator either by divF or by ∇ • F , where divF = ∇ • F = d i=1 ∂ i F ∂x i .



Let L : R d → R be the function satisfying the properties in Assumption 1 such that sup x∈R d ∥∇L(x) -∇ L(x)∥ 2 ≤ ϵ L ∥x∥ 2 and we have E(u) = Ω L(∇u)dx and Ẽ(u) = Ω L(∇u)dx. For η ∈ (0, λ Cp+1)Λ 3 ] consider the sequences, u t+1 = u t -η(I -∆) -1 DE(u t ), and ũt+1 = ũt -η(I -∆) -1 D Ẽ(u t )

37)where we have repeatedly used Lemma 6 and Cauchy-Schwartz in the last line. Using the fact that for a multivariate function g :R d → R d we have for all i ∈ [d] ∥g∥ B(Ω) ≥ ∥g i ∥ B(Ω). Therefore, from Equation 37 we get,∥f (g)∥ B(Ω)Since the maximum power of the polynomial can take is P from Corollary 1 we will have f • g ∈ Γ P W .D.3 PROOF FOR BARRON NORM ALGEBRA:LEMMA 6The proof of Lemma 6 is fairly similar to the proof of Lemma 3.3 in Chen et al. (2021)-the change stemming from the difference of the Barron norm being considered

∥w∥ 2 )ĝ 1 (ω)dω ω∈R d (1 + ∥w∥ 2 )ĝ 2 (ω)dω =⇒ ∥h 1 • h 2 ∥ B(Ω) ≤ ∥h 1 ∥ B(Ω) ∥h 2 ∥ B(Ω) .In order to show the bound for Derivative, since h ∈ Γ W , there exists a function g : R d → R such that,g(x) = ∥ω∥∞≤W e iω T x ĝ(ω)dωNow taking derivative on both sides we get,∂ j g(x) = ∥ω∥∞≤W ie iω T x ω j ĝ(ω)(39)This implies that we can upper bound | ∂ i g(ω) as∂ j g(ω) = iω j ĝ(ω) =⇒ | ∂ j g(ω)| ≤ W |ĝ(ω)|(40)Hence we can bound the Barron norm of ∂ j h as follows:∥∂ j h∥ B(Ω) = inf g|Ω=h,g∈F W ∥ω∥∞≤W (1 + ∥ω∥ ∞ ) | ∂ j g(ω)|dω ≤ inf g|Ω=h,g∈F W ∥ω∥∞≤W (1 + ∥ω∥ ∞ )|W ĝ(ω)|dω ≤ W inf g|Ω=h,g∈F W ∥ω∥∞≤W (1 + ∥ω∥ ∞ )|ĝ(ω)|dω ≤ W ∥h∥ B(Ω)In order to show the preconditioning, note that for a function g : R d → R, if f = (I -∆) -1 g then we have then we have (I -∆)f = g. Using the result form Lemma 12 we have Bounding ∥(I -∆) -1 h∥ B(Ω) ,∥(I -∆) -1 h∥ B(Ω) ∥ω∥ 2 )ĝ(ω)dω =⇒ ∥(I -∆) -1 h∥ B(Ω) ≤ ∥h∥ B(Ω) . Corollary 1. Let g : R d → R then for any k ∈ N we have ∥g k ∥ B(Ω) ≤ ∥g∥ k B(Ω) . Furthermore, if the function g ∈ F W then the function g k ∈ Γ kW .Proof. The result from ∥g k ∥ B(Ω) follows from the multiplication result in Lemma 6 and we can show this by induction. For n = 2, we have from Lemma 6 we have, ∥g 2 ∥ B(Ω) ≤ ∥g∥ 2 have for all n till k -1 we have ∥g n ∥ B(Ω) ≤ ∥g∥ n B(Ω) (42) for n = k we get, ∥g k ∥ B(Ω) = ∥gg k-1 ∥ B(Ω) ≤ ∥g∥ B(Ω) ∥g k-1 ∥ B(Ω) ≤ ∥g∥ k B(Ω) .

then we have f • g ∈ Γ P W and the Barron norm can be bounded

(1 + ∥ω∥ 2 ) |ĝ 2 (ω)|dω =⇒ ∥h 1 + h 2 ∥ B(Ω) ≤ ∥h 1 ∥ B(Ω) + ∥h 2 ∥ B(Ω) .

(1 + ∥ω -z∥ 2 + ∥z∥ 2 + ∥z∥ 2 ∥ω -z∥ 2 ) |ĝ 1 (z)ĝ 2 (ω -z)| dωdzWhere we use ∥ω∥ 2 ≤ ∥ω -z∥ 2 + ∥z∥ 2 and the fact thatω z ∥z∥ 2 ∥ω -z∥ 2 |ĝ 1 (z)ĝ 2 (ω -z)|dωdz > 0. (1 + ∥ω -z∥ 2 ) • (1 + ∥z∥ 2 ) |ĝ 1 (z)| |ĝ 2 (ω -z)| dω

annex

In particular, the coefficients with ∥ω∥ ∞ > kW vanish, as we needed.Lemma 11 (Young's convolution identity). For functions g ∈ L p (R d ) and h ∈ L q (R d ) andwhere 1 ≤ p, q, r ≤ ∞ we have ∥f * g∥ r ≤ ∥g∥ p ∥h∥ q .Here * denotes the convolution operator.Lemma 12. For a differentiable function f : R d → R, such that f ∈ L 1 (R d ) we have ∇f (ω) = iω f (ω)

