LEARNING DEEP OPERATOR NETWORKS: THE BENEFITS OF OVER-PARAMETERIZATION

Abstract

Neural Operators that directly learn mappings between function spaces have received considerable recent attention. Deep Operator Networks (DeepONets) (Lu et al., 2021) , a popular recent class of operator networks have shown promising preliminary results in approximating solution operators of parametric partial differential equations. Despite the universal approximation guarantees (Lu et al., 2021; Chen & Chen, 1995) there is yet no optimization convergence guarantee for DeepONets based on gradient descent (GD). In this paper, we establish such guarantees and show that over-parameterization based on wide layers provably helps. In particular, we present two types of optimization convergence analysis: first, for smooth activations, we bound the spectral norm of the Hessian of DeepONets and use the bound to show geometric convergence of GD based on restricted strong convexity (RSC); and second, for ReLU activations, we show the neural tangent kernel (NTK) of DeepONets at initialization is positive definite, which can be used with the standard NTK analysis to imply geometric convergence. Further, we present empirical results on three canonical operator learning problems: Antiderivative, Diffusion-Reaction equation, and Burger's equation, and show that wider DeepONets lead to lower training loss on all the problems, thereby supporting the theoretical results.

1. INTRODUCTION

Replicating the success of Deep Learning in scientific computing such as developing Neural PDE solvers, constructing surrogate models and developing hybrid numerical solvers has recently captured interest of the broader scientific community. Neural Operators (Li et al., 2021a; b) and Deep Operator Networks (DeepONets) (Lu et al., 2021; Wang et al., 2021) encompass two recent approaches aimed at learning mappings between function spaces. Contrary to a classical supervised learning setup which aims at learning mappings between two finite-dimensional vector spaces, these neural operators/operator networks aim to learn mappings between infinite-dimensional function spaces. The key underlying idea in both the approaches is to parameterize the solution operator as a deep neural network and proceed with learning as in a standard supervised learning setup. Since a neural operator directly learns the mapping between the input and output function spaces, it is a natural choice for learning solution operators of parametric PDE's where the PDE solution needs to be inferred for multiple instances of these "input parameters" or in the case of inverse problems when the forward problem needs to be solved multiple times to optimize a given functional. While there exist results on the approximation properties and convergence of DeepONets; see, e.g., (Deng et al., 2021) for a convergence analysis of DeepONets -vis-a-vis their approximation guarantees-for the advection-diffusion equation, there do not exist any optimization results on when and why GD converges during the optimization of the DeepONet loss. In this work we put forth theoretical convergence guarantees for DeepONets centered around overparameterization and show that over-parameterization based on wider layers (for both branch and trunk net) provably helps in DeepONet convergence. This is reflected in Figure 1 which summarizes an empirical evaluation of over-parameterized DeepONets with ReLU and smooth activations on a prototypical operator learning problem. In order to complement our theoretical results, we present empirical evaluation of our guarantees on three template operator learning problems: (i) Antiderivative operator, (ii) Diffusion-Reaction PDE, and (iii) Burger's equation and demonstrate that wider DeepONets lead to overall lower training loss at the end of the training process. x 0 u(ξ) dξ. In both cases m denotes the width of the branch net and the trunk net. For both ReLU and smooth activations, increasing the width m leads to much lower losses. Note that the y-axis is in log-scale. The rest of the paper is organized as follows. In Section 2 we review the existing literature on neural operators, operator networks and over-parameterization based approaches for establishing convergence guarantees for deep networks. Next, we devote Section 3 to briefly outline the the DeepONet model, the learning problem and the corresponding architecture. Section 4 contains the first technical result of the paper. In Section 4 we establish convergence guarantees for DeepONets with smooth activations (for both branch and trunk net) based on the Restricted Strong Convexity (RSC) of the loss. Next, in Section 5, we present the second technical result of the paper where we establish optimization guarantees for DeepONets with ReLU activations by showing that the Neural Tangent Kernel (NTK) of the DeepONet is positive definite at initialization. In Section 6 we present simple empirical evaluations of the main results by carrying out a parametric study based on increasing the DeepONet width and noting its effect on the total loss during training. We finally conclude by summarizing the main contributions in Section 7.

2.1. LEARNING OPERATOR NETWORKS

Constructing operator networks for ordinary differential equations (ODE's) using learning-based approaches was first studied in (Chen & Chen, 1995) where the authors showed that a neural network with a single hidden layer can approximate a nonlinear continuous functional to arbitrary accuracy. This was, in essence, akin to the Universal Approximation Theorem for classical neural networks (see, e.g., (Cybenko, 1989; Hornik et al., 1989; Hornik, 1991; Lu et al., 2017) ). While the theorem only guaranteed the existence of a neural architecture, it was not practically realized until (Lu et al., 2021) which also provided an extension of the theorem to deep networks. Since then a number of works have pursued applications of DeepONets to different problems (see, e.g. (Goswami et al., 2022; Wang et al., 2021; Wang & Perdikaris, 2021) ). Recently (Kontolati et al., 2022) studied the influence of over-parameterization on neural surrogates based on DeepONets in context of dynamical systems. While their paper studies the effects of over-parameterization on the generalization properties of DeepONets, an optimization analysis of DeepONets is a largely open problem.

2.2. OPTIMIZATION: NTK, ETC.

Optimization of over-parameterized deep networks have been studied extensively (see, e.g., (Du et al., 2019; Arora et al., 2019b; a; Allen-Zhu et al., 2019; Liu et al., 2021a) ). In particular, (Jacot et al., 2018) showed that the neural tangent kernel (NTK) of a deep network converges to an explicit kernel in the limit of infinite network width and stays constant during training. (Liu et al., 2021a) showed that this constancy arises due to the scaling properties of the hessian of the predictor as a function of network width. (Du et al., 2019; Allen-Zhu et al., 2019) showed that gradient descent converges to zero training error in polynomial time for a deep over-parameterized model, with (Du et al., 2019) showing it for a deep model with residual connections (ResNet) while (Allen-Zhu et al., 2019) showed in context of feed-forward models, CNNs and ResNets. (Karimi et al., 2016) showed that the Polyak-Lojasiewicz (PL) condition, a much weaker condition than strong convexity can be used to explain the linear convergence of gradient-based methods.

3. LEARNING DEEP OPERATOR NETWORKS

Learning neural operators (Li et al., 2020; Lu et al., 2021) is a fundamentally challenging problem as it requires learning parametric maps between two infinite-dimensional function spaces which is in sharp contrast to classical deep learning which learns parametric maps between two finite-dimensional vector spaces. A succinct review of learning in infinite dimensions is provided in Section A.1 in the Appendix. Here we focus on the DeepONet model and outline its main features.

3.1. DEEPONET SETUP

In what follows, we use the notation f (θ f ; u) to denote a deep network f θ f : R m → R n where u denotes the input and θ f the learnable parameters. A DeepONet is an operator network that learns a parametric map G θ such that G θ (u) ≈ G † (u), where u denotes the input function, and G † denotes the "true" operator whose approximation we wish to learn. Following (Lu et al., 2021) a DeepONet predictor can defined as the inner product of two deep networks: f = {f k } K k=1 known as the branch net and g = {g k } K k=1 the trunk net, namely G θ (u)(y) := K k=1 f k (θ f ; u)g k (θ g ; y), where u ∈ R du is the input function and y ∈ dom(G θ (u)) ⊆ R dv the output locationsfoot_0 . The training data comprises of n input functions, that is {u (i) } n i=1 and q i output locations for each G(u (i) ), i.e. {{y (i) j } qi j=1 } n i=1 with y (i) j denoting the j-th output location for G θ (u (i) ). The branch net f has parameters θ f ∈ R p f and the trunk net g has parameters θ g ∈ R pg . The entire set of parameters for the DeepONet is given by θ = [θ ⊤ f θ ⊤ g ] ⊤ ∈ R p f +pg . Further, let {x r } R r=1 ∈ dom(u) ⊆ R d ∀r ∈ [R] and u (i) (x r ) ∈ R du . For scalar functions u (i) ∈ R the branch net takes input {u (i) (x r )} R r=1 , which implies f : R R → R K . Similarly, for scalar output locations y (i) j ∈ R we have g : R → R K . The DeepONet learning problem can then be cast as the minimization of the following empirical risk: θ † = arg min θ∈Θ L G θ (u), G † (u) = 1 n n i=1 1 q i qi j=1 ℓ G θ (u (i) )(y (i) j ) -G † (u (i) )(y (i) j ) , with G θ (u (i) )(y (i) j ) = K k=1 f k θ f ; {u (i) (x r )} R r=1 g k θ g ; y (i) j , and ℓ(z) := 1 /2(z) 2 denoting the mean-squared error (MSE) loss. Note that the "true" operator G † whose approximation is sought in (2) can either be explicit, e.g. integral of a function, or implicit, e.g. the solution to a nonlinear partial differential equation (PDE).

3.2. DEEPONET ARCHITECTURE

We now briefly outline the architecture used throughout the analysis in Sections 4 and 5 and in the numerical experiments in Section 6. We adopt fully connected feedforward neural networks (FNNs) for both the branch and trunk nets which is also the baseline DeepONet model in (Lu et al., 2021) . Figure 4 in the Appendix shows a schematic of the architecture and the notation used throughout this paper. For the architecture we adopt the unstacked configuration (see, Fig 1d in (Lu et al., 2021) ). Remark 3.2 (Training Dataset) . The entire training dataset D is then simply given by the collection of all tuples {D (i) } n i=1 . Remark 3.3 (Widths m f and m g ). We denote the width of the branch net by m f and the trunk net by m g and use m f = m g = m interchangeably through Sections 4-5. Similarly, for the experiments we use m f = m g unless otherwise stated. For the latter (i.e. when m f ̸ = m g ) the analysis in Section 4 still holds with m = min(m f , m g ) Remark 3.4 (Last layer of the branch and trunk net). Note that the last layer of the branch and trunk net is simply a linear layer and does not have any nonlinearity/activation. (i) := {u (i) (x r )} R r=1 , {y (i) j } qi j=1 , {G(u (i) )(y (i) j } qi j=1 .

4. OPTIMIZATION GUARANTEES FOR DEEPONETS: SMOOTH ACTIVATIONS

In this section, we focus on DeepONets based on smooth activation functions. To build up to the optimization analysis, we first establish a bound on the spectral norm of the DeepONet predictor, in particular showing that ∥∇ 2 G θ (u)(y)∥ 2 = O( 1 √ m ) where, again, m = m f = m g . The spectral norm bound is then used to establish a form of Restricted Strong Convexity (RSC) of the DeepONet loss (2), which in turn is used to establish geometric convergence of gradient descent (GD). For the analysis, analogous to (Liu et al., 2021b) , we consider a FNN for the branch net: β (0) f = u, β (l) f = ϕ l 1 √ m f W (l) f β (l-1) f , ∀l ∈ [L -1], β (L) f = W (L) f β (L-1) f f k = β (L) f,k , ∀k ∈ [K] , where m f = m and L denote the width and depth of the branch net respectively, [K] := {1, . . . , K}, ϕ l is the activation function at layer l, β f are the outputs at layer l and W (l) f ≡ w (l) fij denote the weight matrices at layer l. Similarly, we consider a fully connected feedforward network for the trunk net: β (0) g = y, β (l) g = ϕ l 1 √ m g W (l) g β (l-1) g , ∀l ∈ [L -1], β (L) g = W (L) g β (L-1) g g k = β (L) g,k , ∀k ∈ [K] , where, again, m g = m and L denote the width and depth of the trunk net respectively and W (l) g ≡ w (l) gij denote the weight matrices at layer l of the trunk net. In order to aid our analysis, we make the following assumptions on the activations, the loss and the weights: Assumption 1 (Activation functions). The activation functions ϕ l at each layer l are 1-Lipschitz and β ϕ -smooth (i.e. ϕ ′′ ≤ β ϕ ) for some β ϕ > 0. Assumption 2 (Loss function). We assume the loss ℓ i,j is (i) strongly convex, i.e., ℓ ′′ i,j ≥ a > 0, (ii) smooth, i.e., ℓ ′′ i,j ≤ b, and (iii) Lipschitz, i.e., ℓ ′ i,j ≤ λ., where we make use of the following notation ℓ i,j ≡ ℓ G θ (u (i) )(y (i) j ) -G † (u (i) )(y (i) j ) , ℓ ′ i,j ≡ ℓ ′ G θ (u (i) )(y (i) j ) -G † (u (i) )(y (i) j ) , ℓ ′′ i,j ≡ ℓ ′′ G θ (u (i) )(y (i) j ) -G † (u (i) )(y (i) j ) . (6) Assumption 3 (Initialization of Weights). All weights of the branch and trunk net are initialized as w (l) fij | t=0 = w (l) f0, ij ∼ N (0, σ 2 0 ) and w (l) gij | t=0 = w (l) g0, ij ∼ N (0, σ 2 0 ) for l ∈ [L-1] and some constant σ 0 > 0 respectively. Furthermore, in order for the model (1) to have a suitable scaling in our analysis, we initialize the weights in the last layer of the branch and trunk nets as w L f0, ij ∼ N (0, 1/(m K)) and w L g0, ij ∼ N (0, 1/(m K)) respectively. Remark 4.1. For Assumption 1, our analysis straightforwardly extends to general ς-Lipschitz activations, with constants depending ς. For Assumption 2, the Lipschitz loss assumption can be dropped by assuming that the true responses G † (u)(y) are bounded and showing that the prediction responses G θ (u)(y) are bounded with high probability. Definition 1 (Norm ball). Our analysis focuses on the standard Euclidean norm ball around the initialization θ 0 , i.e. B Euc ρ (θ 0 ), where B Euc ρ ( θ) := θ ∈ R p f +pg | θ -θ 2 ≤ ρ . 4.1 SPECTRAL NORM OF THE HESSIAN OF BRANCH AND TRUNK NETS The convergence analysis makes use of the gradients and hessians of the total loss (2) and the predictor (1) with respect to the parameters θ, namely, ∇ θ L(θ) = ∇ θ f L; ∇ θg L , and ∇ 2 θ L = H (θ) = H f f H f g H gf H gg , where ∇ θ f L(θ) ∈ R p f and ∇ θg L(θ) ∈ R pg . Note that we make use of the notation ∇ θ f (•) to denote the derivative wrt the parameters θ f and this is not a functional gradient. Similarly, the individual blocks in the 2 × 2 block hessian H(θ) are given by H f f = ∂ 2 L ∂θ f 2 , H f g = ∂ 2 L ∂θ f ∂θ g , H gf = H ⊤ f g = ∂ 2 L ∂θ g ∂θ f , and H gg = ∂ 2 L ∂θ g 2 , where H f f ∈ R p f ×p f , H gg ∈ R pg×pg , H f g ∈ R p f ×pg , H gf ∈ R pg×p f and the argument θ is ignored for clarity of exposition. Using (2) and rewriting the derivatives in ( 8) and ( 9), we get ∂L ∂θ f = 1 n n i=1 1 q i qi j=1 ℓ ′ i,j K k=1 g (i) k,j ∇ θ f f (i) k and ∂L ∂θ g = 1 n n i=1 1 q i qi j=1 ℓ ′ ij K k=1 f (i) k ∇ θg g (i) k,j , for the gradients, and ∂ 2 L ∂θ f 2 = 1 n n i=1 1 q i qi j=1 ℓ ′ i,j K k=1 g (i) k,j ∇ 2 θ f f (i) k + 1 n n i=1 1 q i qi j=1 ℓ ′′ i,j   K k, k=1 g (i) k,j g (i) k,j ∇ θ f f (i) k ∇ θ f f (i)⊤ k   , ∂ 2 L ∂θ g 2 = 1 n n i=1 1 q i qi j=1 ℓ ′ i,j k k=1 f (i) k ∇ 2 θg g (i) k,j + 1 n n i=1 1 q i qi j=1 ℓ ′′ i,j   K k, k=1 f (i) k f (i) k ∇ θg g (i) k,j ∇ θg g k,j (i)⊤   , ∂ 2 L ∂θ f ∂θ g = 1 n n i=1 1 q i qi j=1 ℓ ′ i,j K k=1 ∇ θ f f (i) k ∇ θg g (i)⊤ k,j =H (1) f g + 1 n n i=1 1 q i qi j=1 ℓ ′′ i,j   K k, k=1 g (i) k,j f (i) k ∇ θ f f (i) k ∇ θg g k,j (i)⊤   =H (2) f g , for the individual blocks of the hessian (8) where, we make use of the notation g (i) k,j = g k (θ g ; y (i) j ) and f (i) k = f k (θ f ; u (i) ). Lemma 4.1. Under Assumptions 1, 2 and 3, and for θ ∈ B Euc ρ (θ 0 ), with high-probability we have for all k ∈ [K] max i∈[n] ∇ 2 θ f f (i) k ≤ c (f ) √ m f , and max i∈[n] max j∈[qi] ∇ 2 θg g (i) k,j ≤ c (g) √ m g (12) ∇ θ f f k 2 ≤ ϱ (f ) , and ∇ θg g k ≤ ϱ (g) , where c (f ) , c (g) , ϱ (f ) , ϱ (g) are suitable constants, ∇ θ f (•) = ∂(•)/∂θ f , ∇ θg (•) = ∂(•)/∂θ g , ∇ 2 θ f (•) = ∂ 2 (•)/∂θ 2 f and ∇ 2 θg (•) = ∂ 2 (•)/∂θ 2 g . Proof. The proof follows directly from Theorem 3.2 in (Liu et al., 2021a) .

4.2. GEOMETRIC CONVERGENCE BASED ON RESTRICTED STRONG CONVEXITY

We now focus on establishing the convergence of gradient descent (GD) by using the Hessian spectral norm bound. The convergence analysis is based on a generalization of the notion of restricted strong convexity (RSC) of the loss (2) (see (Negahban et al., 2012; Negahban & Wainwright, 2012; Zhang & Cheng, 2015; Zhang & Yin, 2013; Lai & Yin, 2013) for a review of RSC and its applications in high-dimensional statistics for linear models). In order to further aid clarity of exposition, we state the main results along with their implications in this section and leave the details of the proofs to Section A.3 in the Appendix. Definition 2 (Restricted Strong Convexity (RSC)). A function L is said to satisfy α-restricted strong convexity (α-RSC) with respect to the tuple (Q, θ) if for any θ ′ ∈ Q ⊆ R p and some fixed θ ∈ R p , we have L (θ ′ ) ≥ L(θ) + ⟨θ ′ -θ, ∇ θ L(θ)⟩ + α 2 ∥θ ′ -θ∥ 2 2 , with α > 0. Note that L being α-RSC w.r.t. (S, θ) does not need L to be convex on R p . Further, let {θ t } t≥0 denote a sequence of iterates obtained from GD, i.e., θ t+1 = θ t -η t ∇ θ L (θ t ) . ( ) Since θ ⊤ t = [θ ⊤ f,t θ ⊤ g,t ] ⊤ , we will use dynamic restricted sets Q t κ ⊆ R p , where p = p f + p g , to show RSC of the DeepONet loss. Note that these sets are parameterized by a constant κ which measures the absolute cosine similarity between suitable vectors (see Section A.3 and specifically Proposition 1 in the Appendix for a detailed discussion on these sets). Further, our optimization for DeepONets is based on a second order Taylor expansion of the loss (2) w.r.t. θ t = [θ ⊤ f,t θ ⊤ g,t ] ⊤ : L(θ) = L(θ t ) + ⟨θ -θ t , ∇ θ L(θ t )⟩ + 1 2 (θ -θ t ) ⊤ H( θ)(θ -θ t ) , where, ∇ θ L(θ t ) and H( θ) are as in ( 8), ( 9), (10), and ( 11), and θ = τ θ + (1 -τ ) θ for some τ ∈ [0, 1]. Definition 3 (Q t κ sets.). For an iterate θ t = [θ ⊤ f,t θ ⊤ g,t ] ⊤ , consider the singular value decomposition 1 n n i=1 1 q i qi j=1 ℓ ′ i,j K k=1 ∇ θ f f (i) k ∇ θg g (i) ⊤ k,j = q h=1 σ h a h b ⊤ h , where q ≤ qk, and σ h > 0, a h ∈ R p f , b h ∈ R pg respectively denote the singular values, left singular vectors, and right singular vectors. Further, let Ḡθ = 1 n n i=1 1 q i qi j=1 G θ (u (i) )(y (i) j ) . Then, for some suitable κ ∈ (0, 1 2 ], we define the set: Q t κ := θ ′ = [θ ′ f ; θ ′ g ] : | cos(θ ′ -θ t , ∇ θ Ḡθt )| ≥ κ, q h=1 σ h ⟨θ ′ f -θ f,t , a h ⟩⟨θ ′ g -θ g,t , b h ⟩ ≥ 0 , ∀h ∈ [q] . ( ) Remark 4.2. Note that p f , p g are respectively the number of parameters in the branch and trunk nets and the models can be sufficiently over-parameterized such that they are larger than the number of training examples q = n i=1 q i . Remark 4.3 (Q t κ sets). The specifics of the restricted set Q t κ ∈ R p f +pg stem from technicalities in the analysis. For a detailed outline of these sets we refer the reader to Section A.3 and specifically Proposition 1 in the Appendix. Theorem 4.2 (RSC of the loss). Under the assumptions Assumptions 1, 2 and 3, ∀θ ′ ∈ B t κ := Q t κ ∩ B Euc ρ (θ 0 ) with high probability we have L (θ ′ ) ≥ L (θ t ) + ⟨θ ′ -θ t , ∇ θ L (θ t )⟩ + α t 2 ∥θ ′ -θ t ∥ 2 2 , where α t = c 1 ∥∇ θ Ḡt ∥ 2 2 - c 2 √ m , ( ) where Ḡt = 1 n n i=1 1 qi qi j=1 G θt (u (i) )(y (i) j ) and c 1 , c 2 > 0 are constants. Thus, the loss L(θ) satisfies RSC w.r.t (B t κ , θ t ) whenever α t > 0. Proof. A detailed proof is presented in Section A.3 in the Appendix. Theorem 4.3 (Smoothness of Loss). Under the assumptions Assumptions 1, 2 and 3, with high probability, for θ ∈ B Euc ρ (θ 0 ), L(θ) is β-smooth with β = bϱ 2 + c √ λ √ m with c = max(c (f ) , c (g) ), ϱ = max(ϱ (f ) , ϱ (g) ) with c (f ) , c (g) , ϱ (f ) , ϱ (g) as in Lemma 4.1. Proof. We again refer the reader to Section A.3 in the Appendix for a detailed proof. Lemma 4.4 (RSC =⇒ Restricted PL). The RSC and Smoothness of the Loss together imply a form of Polyak-Łojasiewicz (PL) condition w.r.t. the tuple (B t , θ t ), unlike standard PL which holds without restrictions (Karimi et al., 2016) . Proof. For a detailed outline of the proof, we refer the reader to Section A.3 in the Appendix. Theorem 4.5 (Global Loss Reduction). Consider the same conditions as in Theorem 4.3 and α t > 0 for t ∈ [T ] for a gradient descent update θ t+1 = θ tη t ∇ θ L(θ t ) with η t = ωt β for some ω t ∈ (0, 2), where β is defined as in Theorem 4.3. Then, with high probability, ∀ θ ∈ arginf θ∈B Euc ρ (θ0) L(θ) with 0 ≤ γ t := L( θt+1)-L( θ) L(θt)-L( θ) < 1 and θt+1 ∈ arginf θ∈Q t κ ∩B Euc ρ (θ0) L(θ), we have L(θ t+1 ) -L( θ) ≤ 1 - α t ω t (1 -γ t ) β (2 -ω t ) (L(θ t ) -L( θ)) . ( ) Proof. We refer the reader to Section A.3 and in particular Lemma 1 in the Appendix for the proof. Remark 4.4. A direct consequence of Theorem 4.5 is global reduction in the loss in R p and thus the convergence of gradient descent for the DeepONet optimization problem.

5. OPTIMIZATION GUARANTEES FOR DEEPONETS: RELU ACTIVATIONS

We now present an alternative optimization analysisfoot_1 , based on the Neural Tangent Kernel (NTK) (Jacot et al., 2018) , to establish guarantees for the convergence of gradient descent and its variants for DeepONets. Although the NTK convergence theory holds for both smooth and ReLU activations (see, e.g. (Allen-Zhu et al., 2019; Du et al., 2019 )), we present this in context of the latter. Further, in this work we only present a proof for the positive definiteness of the DeepONet NTK at initialization. This allows one to readily transcribe analogous existing arguments for deep networks, for the convergence of gradient descent, to the DeepONet model. (Allen-Zhu et al., 2019; Du et al., 2019; Nguyen et al., 2021) Now, recall the DeepONet predictor G θ u (i) (y (i) j ) := K k=1 f k θ f ; u (i) g k θ g ; y (i) j , Under review as a conference paper at ICLR 2023 and its corresponding gradient with respect to the parameters θ, ∇ θ G θ (u (i) )(y (i) j ) = K k=1   g k θ g ; y (i) j ∇ θ f f k θ f ; u (i) f k θ f ; u (i) ∇ θg g k θ g ; y (i) j   . This allows us to write the NTK K(θ) of the DeepONet, which is a q × q matrix as: K(θ) = ∇ θ G θ (u (i) )(y (i) j ) , ∇ θ G θ (u (i ′ ) )(y (i ′ ) j ′ ) q×q , where, again, q = n i=1 q i . Given the branch and trunk nets are initialized using standard initialization techniques, as typically done in practice for deep networks, (Arora et al., 2019b; Du et al., 2019) with the addition of Assumption 3, the resulting branch net NTK K f,k and trunk net NTK K g,k , namely K f,k = ∇ θ f,k f k θ f ; u (i) , ∇ θ f,k f k θ f ; u (i ′ ) n×n K g,k = ∇ θ g,k g k θ g ; y (i) j , ∇ θ g,k g k θ g ; y (i ′ ) j ′ q×q , can be shown to be positive definite with high probability for each k ∈ [K] (Nguyen, 2021; Liu et al., 2021a; Du et al., 2019) . In particular, with high probability, for k ∈ [K], we have λ min (K f,k ) ≥ λ 0,f , λ min (K g,k ) ≥ λ 0,g ∀ k ∈ [K] , where λ min (K f,k ) and λ min (K g,k ) are the minimum eigenvalues of the branch and trunk net NTKs respectively, and λ 0,f , λ 0,g > 0 are positive constants (see, e.g., Theorem 4.1 in (Nguyen, 2021) ). Theorem 5.1 (K(θ) is positive definite at initialization). Given standard initialization for the branch and trunk nets, and granted that the individual branch and trunk net NTKs are positive definite ( 24)-( 25), the NTK of DeepONet is positive definite at initialization, i.e. α ⊤ E[K(θ)]α ≥ c > 0, where α denotes an arbitrary block unit vector with n blocks, and α i,j corresponds to the j-th entry in the i-th block and c denotes a positive constant. Proof. We refer the reader to Section A.4 and specifically Propositions 2, 3 and 4 Remark 5.1. A direct consequence of Theorem 5.1 is that the DeepONet NTK is positive definite at initialization. It is then straightforward to invoke standard NTK analysis to show convergence of gradient descent during training, see, e.g. (Jacot et al., 2018; Du et al., 2019; Arora et al., 2019b; a; Allen-Zhu et al., 2019; Nguyen et al., 2021) .

6. EXPERIMENTS

We now turn to an empirical evaluation of the effect of over-parameterization on the training performance of DeepONets, as measured by the empirical risk over a mini-batch B of the training dataset (3.2), for three canonical operator-learning problems. The results for smooth activations empirically verify the analysis in Section 4 whereas the ones for ReLU activations verify the analysis in Section 5. In all the examples described below, we consider FNN architectures for both branch and trunk nets which are similar to the ones chosen in (Lu et al., 2021) . For definiteness, we set the width in each layer of the branch and trunk net to be the same (i.e. m f = m g = m) and then increase it uniformly from m = 10 to m = 500. We monitor the training process over 80, 000 training epochs and report the resulting average loss over each mini-batch with size n B , L D B := 1 n B n B i=1 1 q i qi j=1 ℓ G θ (u (i) )(y (i) j ) -G † (u (i) )(y (i) j ) , where n B denotes the number of input training functions (u i ) in the batch B. We refer the reader to Section A.5 in the Appendix for specific details of the setup. Figure 2 shows the training loss ( 27) as a function of the epochs for DeepONets with smooth activations and Figure 3 shows the same for ReLU activations. Remark 6.1. We store the mini-batch training loss at every 100-th training epoch and observe that the training loss measured over the mini-batch is lower for wider DeepONets. This observation is consistent for both smooth (selu) and non-smooth (relu) activations. Remark 6.2 (Antiderivative Operator). The Antiderivative operator is a linear operator and hence is learned very accurately especially for wider DeepONets (L D B ∼ 10 -12 at the end of training). Remark 6.3 (Diffusion Reaction). The Diffusion reaction equation also demonstrates lower loss with increasing width, albeit less markedly than the antiderivative operator. This can be attributed in part to the fact that the operator is inherently nonlinear. Remark 6.4 (Burger's equation). The operator corresponding to Burger's equation is more intricate with the added periodicity constraints on the solution (see Section A.5.3 in the appendix for details on the problem). We remark the distinction from the operator learning problem for Burger's equation studied in (Li et al., 2021a) where the operator only sought to learn the mapping from the input (initial condition t = 0) to the final output t = 1 and not the entire solution space (x, t) ∈ [0, 1] × [0, 1].

7. DISCUSSION AND CONCLUSION

We present two novel optimization analyses for DeepONets (Lu et al., 2021) based on overparameterization and establish convergence guarantees for the DeepONet models with smooth and ReLU activations. The analysis for smooth activations is built on top of the restricted strong convexity of the DeepONet loss whereas the one for ReLU activations is based on the positive definiteness of the NTK at initialization. To the best of our knowledge, this is the first work to mathematically and empirically show the benefits of over-parameterization on the the learning of DeepONets. Sifan Wang, Hanwen Wang, and 

A APPENDIX

A.1 LEARNING OPERATORS Here we briefly outline the notion of learning for neural operators (Li et al., 2021a; 2020; Lu et al., 2021) . The standard operator learning problem seeks to approximate a possibly nonlinear operator G † : U → V by a parametric operator G θ∈Θ : U → V that depends on the learnable parameters θ. The goal is to learn an optimal set of parameters θ † such that G θ † ≈ G † . Given observations {u (j) } n j=1 ∈ U and {G † (u (j) )} n j=1 ∈ V where u (j) ∼ µ is an i.i.d sequence from the probability measure µ supported on U and G(u (j) ) is possibly corrupted with noise, the objective is to find θ † as the solution of the minimization problem θ † = arg min θ∈Θ E u∼µ C G θ (u), G † (u) , where U and V are separable Banach spaces and C a suitable cost functional. This is analogous to the notion of learning in finite dimensions, which is precisely the setup classical deep learning used for. A.2 DEEPONET ARCHITECTURE (Lu et al., 2021) used in this study. Note that the input functions need not be sampled on a structured grid of points in general.

A.3 OPTIMIZATION GUARANTEES FOR DEEPONETS: SMOOTH ACTIVATIONS

Theorem 4.2 (RSC of the loss). Under the assumptions Assumptions 1, 2 and 3, ∀θ ′ ∈ B t κ := Q t κ ∩ B Euc ρ (θ 0 ) with high probability we have L (θ ′ ) ≥ L (θ t ) + ⟨θ ′ -θ t , ∇ θ L (θ t )⟩ + α t 2 ∥θ ′ -θ t ∥ 2 2 , where α t = c 1 ∥∇ θ Ḡt ∥ 2 2 - c 2 √ m , ( ) where Ḡt = 1 n n i=1 1 qi qi j=1 G θt (u (i) )(y (i) j ) and c 1 , c 2 > 0 are constants. Thus, the loss L(θ) satisfies RSC w.r.t (B t κ , θ t ) whenever α t > 0. Proof. From the Taylor expansion in (15), to establish (19) it suffices to focus on the second order term and for θ ′ ∈ B t show (θ ′ -θ t ) ⊤ H( θ)(θ -θ t ) ≥ α t ∥θ ′ -θ t ∥ 2 2 . ( ) Given the 2 × 2 block structure of the Hessian as in (8), denoting δθ := θ ′θ t for compactness, we note that δθ ⊤ H( θ)δθ = δθ ⊤ f H f f δθ f T1 + 2δθ ⊤ f H f g δθ g T2 + δθ ⊤ g H gg δθ g T3 . Focusing on T 1 and using the exact form of H f f as in (11), we have T 1 = 1 n n i=1 1 q i qi j=1 ℓ ′′ i,j δθ f , K k=1 g (i) k,j ∇ θ f f (i) k 2 + 1 n n i=1 1 q i qi j=1 ℓ ′ ij K k=1 g (i) k,j δθ ⊤ f ∇ 2 θ f f (i) k δθ f (a) ≥ 1 n n i=1 1 q i qi j=1 ℓ ′′ i,j δθ f , ∇ θ f G θ (u (i) )(y (i) j ) 2 - λc 0 c (f ) √ m f ∥δθ f ∥ 2 2 , where (a) follows from Assumption 2 and Lemma 4.1. The analysis for T 3 is similar, and we get T 3 ≥ 1 n n i=1 1 q i qi j=1 ℓ ′′ i,j δθ g , ∇ θg G θ (u (i) )(y (i) j ) 2 - λc 0 c (g) √ m g ∥δθ g ∥ 2 2 . ( ) Focusing on T 2 and using the exact forms in terms of H (1) f g and H (2) f g as in (11), we have 1 2 T 2 = δθ ⊤ f   1 n n i=1 1 q i qi j=1 ℓ ′ ij K k=1 ∇ θ f f (i) k ∇ θg g (i) ⊤ k,j   δθ g + δθ ⊤ f   1 n n i=1 1 q i qi j=1 ℓ ′′ i,j K k=1 g (i) k,j ∇ θ f f (i) k K k ′ =1 f (i) k ′ ∇ θg g (i) ⊤ k ′ ,j   δθ g (a) = δθ ⊤ f q h=1 σ h a h b ⊤ h δθ g + δθ ⊤ f   1 n n i=1 1 q i qi j=1 ℓ ′′ i,j ∇ θ f G θ (u (i) )(y (i) j )∇ θg G θ (u (i) )(y (i) j ) ⊤   δθ g = q h=1 σ h ⟨δθ f , a h ⟩⟨δθ g , b h ⟩ + 1 n n i=1 1 q i qi j=1 ℓ ′′ i,j δθ f , ∇ θ f G θ (u (i) )(y (i) j ) δθ g , ∇ θg G θ (u (i) )(y (i) j ) (b) ≥ 1 n n i=1 1 q i qi j=1 ℓ ′′ i,j δθ f , ∇ θ f G θ (u (i) )(y (i) j ) δθ g , ∇ θg G θ (u (i) )(y (i) j ) . where (a) follows from SVD as in ( 16), (b) follows since by Definition 3, ⟨δθ f , a h ⟩ ≥ 0, ⟨δθ g , b h ⟩ ≥ 0. Combining (31), ( 33), (32), using m = m g = m f and c 1 = max(c (f ) , c (g) ), we have δθ ⊤ H( θ)δθ ≥ 1 n n i=1 1 q i qi j=1 ℓ ′′ i,j δθ f , ∇ θ f G θ (u (i) )(y (i) j ) + δθ g , ∇ θg G θ (u (i) )(y (i) j ) 2 - λc 0 c 1 √ m ∥δθ∥ 2 2 (a) ≥ a 1 n n i=1 1 q i qi j=1 δθ f , ∇ θ f G θ (u (i) )(y (i) j ) + δθ g , ∇ θg G θ (u (i) )(y (i) j ) 2 - λc 0 c 1 √ m ∥δθ∥ 2 2 (b) ≥ a δθ f , ∇ θ f Ḡθ (u (i) )(y (i) j ) + δθ g , ∇ θg Ḡθ (u (i) )(y (i) j ) 2 - λc 0 c 1 √ m ∥δθ∥ 2 2 = a δθ, ∇ θ Ḡθ 2 - λc 0 c 1 √ m ∥δθ∥ 2 2 (c) ≥ aκ 2 ∥∇ θ Ḡθ ∥ 2 2 ∥δθ∥ 2 2 - λc 0 c 1 √ m ∥δθ∥ 2 2 = α t ∥δθ∥ 2 2 , where (a) follows from Assumption 2, (b) follows from Jensen's inequality and with Ḡθ = 1 n n i=1 1 qi qi j=1 G θ (u (i) )(y (i) j ) as in Definition 3, (c) follows from Definition 3, and α t = aκ 2 ∥∇ θ Ḡθ ∥ 2 2 -λc0c1 √ m . That completes the proof. Proposition 1 (Q t κ is non-empty). For over-parameterized branch and trunk nets with p f , p g > qk, the restricted set Q t κ is non-empty. Proof. We simply construct a θ ′ = [θ ′ f ⊤ θ ′ g ⊤ ] ⊤ ∈ Q t κ along with the value of κ. Without loss of generality, we make θ t the origin of the coordinate system and work with the unit vector ḡ = [ḡ ⊤ f ḡ⊤ g ] ⊤ = ∇ θ Ḡθ t ∥∇ θ Ḡθ t ∥2 . Further, we assume θ ′ also to be a unit vector. Then, our problem reduces to feasibility of the following system of two quadratic equations over θ ′ f ∈ R p f , θ ′ g ∈ R pg : θ ′⊤ f q h=1 σ h1 a h1 b ⊤ h1 θ ′ g ≥ 0 ⟨θ ′ f , ḡf ⟩ + ⟨θ ′ g , ḡg ⟩ 2 ≥ κ 2 , where σ h1 , σ h2 > 0, a h ∈ R p f are orthogonal unit vectors, b h ∈ R pg are orthogonal unit vectors, ḡ = [ḡ ⊤ f ḡ⊤ g ] ⊤ ∈ R p f +pg and θ ′ = [θ ′ f ; θ ′ g ] ∈ R p f +pg are unit vectors, and we can choose κ ∈ (0, 1]. Without loss of generality, assume ∥ḡ f ∥ 2 ≥ ∥ḡ g ∥ 2 . Then, set θ ′ g = 0 so that our feasibility condition reduces to ⟨θ ′ f , ḡf ⟩ 2 ≥ κ 2 for some suitably chosen κ ∈ (0, 1]. Finally, set θ ′ f = ḡf ∥ḡ f ∥2 so that ⟨θ ′ f , ḡf ⟩ 2 = ḡf ∥ḡ f ∥ 2 ḡf 2 = ∥ḡ f ∥ 2 2 := κ 2 , so that κ ∈ (0, 1] (in fact, κ ≥ 1/ √ 2) as desired. That completes the proof. Theorem 4.3 (Smoothness of Loss). Under the assumptions Assumptions 1, 2 and 3, with high probability, for θ ∈ B Euc ρ (θ 0 ), L(θ) is β-smooth with β = bϱ 2 + c √ λ √ m with c = max(c (f ) , c (g) ), ϱ = max(ϱ (f ) , ϱ (g) ) with c (f ) , c (g) , ϱ (f ) , ϱ (g) as in Lemma 4.1. Proof. By the second order Taylor expansion about θ, we have L(θ ′ ) = L( θ) + ⟨θ ′ -θ, ∇ θ L( θ)⟩ + 1 2 (θ ′ -θ) ⊤ ∂ 2 L( θ) ∂θ 2 (θ ′ -θ), where θ = ξθ ′ + (1ξ) θ for some ξ ∈ [0, 1]. Then, (θ ′ -θ) ⊤ ∂ 2 L( θ) ∂θ 2 (θ ′ -θ) = (θ ′ -θ) ⊤ 1 n n i=1 1 q i qi j=1 ℓ ′′ i,j ∇G θ (u (i) )(y (i) j )∇G θ (u (i) )(y (i) j ) ⊤ + ℓ ′ i,j ∇ 2 G θ (u (i) )(y (i) j ) (θ ′ -θ) = 1 n n i=1 1 q i qi j=1 ℓ ′′ i,j θ ′ -θ, ∇G θ (u (i) )(y (i) j ) 2 I1 + 1 n n i=1 1 q i qi j=1 ℓ ′ i,j (θ ′ -θ) ⊤ ∇ 2 G θ (u (i) )(y (i) j )(θ ′ -θ) I2 . Now, note that I 1 = 1 n n i=1 1 q i qi j=1 ℓ ′′ i,j θ ′ -θ, ∇G θ (u (i) )(y (i) j ) 2 (a) ≤ b n n i=1 1 q i qi j=1 ∇G θ (u (i) )(y (i) j ) 2 2 ∥θ ′ -θ∥ 2 2 (b) ≤ bϱ 2 ∥θ ′ -θ∥ 2 2 , where (a) follows by the Cauchy-Schwartz inequality and (b) from Lemma 4.1. For I 2 , with Q t,(i,j) = (θ ′ -θ) ⊤ ∇ 2 G θ (u (i) )(y (i) j )(θ ′ -θ), we have |Q t,(i,j) | ≤ ∥θ ′ -θ∥ 2 2 ∇ 2 G θ (u (i) )(y (i) j ) 2 ≤ c∥θ ′ -θ∥ 2 2 √ m . Then, we have I 2 = 1 n n i=1 1 q i qi j=1 ℓ ′ i,j (θ ′ -θ) ⊤ ∇ 2 G θ (u (i) )(y (i) j )(θ ′ -θ) (a) ≤ λ 1 n n i=1 1 q i Q 2 t,(i,j) 1/2 ≤ λ c∥θ ′ -θ∥ 2 2 √ m , where (a) follows by Cauchy-Schwartz. Putting the upper bounds on I 1 and I 2 back, we have (θ ′ -θ) ⊤ ∇ 2 G θ (u (i) )(y (i) j )(θ ′ -θ) ≤ bϱ 2 + c √ λ t √ m ∥θ ′ -θ∥ 2 2 . This completes the proof. Lemma 4.4 (RSC =⇒ Restricted PL). The RSC and Smoothness of the Loss together imply a form of Polyak-Łojasiewicz (PL) condition w.r.t. the tuple (B t , θ t ), unlike standard PL which holds without restrictions (Karimi et al., 2016) . Proof. Define Lθt (θ) := L(θ t ) + ⟨θ -θ t , ∇ θ L(θ t )⟩ + α t 2 ∥θ -θ t ∥ 2 2 . By Theorem A.3, ∀θ ′ ∈ B t , we have L(θ ′ ) ≥ Lθt (θ ′ ) . Further, note that Lθt (θ) is minimized at θt+1 := θ t -∇ θ L(θ t )/α t and the minimum value is: inf θ Lθt (θ) = Lθt ( θt+1 ) = L(θ t ) - 1 2α t ∥∇ θ L(θ t )∥ 2 2 . Then, we have inf θ∈Bt L(θ) (a) ≥ inf θ∈Bt Lθt (θ) ≥ inf θ Lθt (θ) = L(θ t ) - 1 2α t ∥∇ θ L(θ t )∥ 2 2 , where (a) follows from (34). Rearranging terms completes the proof. Lemma 1 (Local Loss Reduction in B t ). Let α t , β be as in Theorems A.3 and A.3 respectively, and B t := Q t κ ∩ B Euc ρ (θ 0 ) ∩ B Euc ρ2 (θ t ). Under assumptions Assumptions 1, 2 and 3, for gradient descent with step size η t = ωt β , ω t ∈ (0, 2), for any θ t+1 ∈ arginf θ∈Bt L(θ), we have with high probability L(θ t+1 ) -L(θ t+1 ) ≤ 1 - α t ω t β (2 -ω t ) (L(θ t ) -L(θ t+1 )) . Proof. Since L is β-smooth by Theorem A.3, we have L(θ t+1 ) ≤ L(θ t ) + ⟨θ t+1 -θ t , ∇ θ L(θ t )⟩ + β 2 ∥θ t+1 -θ t ∥ 2 2 = L(θ t ) -η t ∥∇ θ L(θ t )∥ 2 2 + βη 2 t 2 ∥∇ θ L(θ t )∥ 2 2 = L(θ t ) -η t 1 - βη t 2 ∥∇ θ L(θ t )∥ 2 2 (36) Since θt+1 ∈ arginf θ∈Bt L(θ) and α t > 0 by assumption, from Lemma A.3 we obtain -∥∇ θ L(θ t )∥ 2 2 ≤ -2α t (L(θ t ) -L( θt+1 )) . Hence L(θ t+1 ) -L( θt+1 ) ≤ L(θ t ) -L( θt+1 ) -η t 1 - βη t 2 ∥∇ θ L(θ t )∥ 2 2 (a) ≤ L(θ t ) -L( θt+1 ) -η t 1 - βη t 2 2α t (L(θ t ) -L( θt+1 )) = 1 -2α t η t 1 - βη t 2 (L(θ t ) -L( θt+1 )) where (a) follows for any η t ≤ 2 β because this implies 1 -βηt 2 ≥ 0. Choosing η t = ω T β , ω t ∈ (0, 2), L(θ t+1 ) -L( θt+1 ) ≤ 1 - α t ω t β (2 -ω t ) (L(θ t ) -L( θt+1 )) . This completes the proof. Theorem 4.5 (Global Loss Reduction). Consider the same conditions as in Theorem 4.3 and α t > 0 for t ∈ [T ] for a gradient descent update θ t+1 = θ tη t ∇ θ L(θ t ) with η t = ωt β for some ω t ∈ (0, 2), where β is defined as in Theorem 4.3. Then, with high probability, ∀ θ ∈ arginf θ∈B Euc ρ (θ0) L(θ) with 0 ≤ γ t := L( θt+1)-L( θ) L(θt)-L( θ) < 1 and θt+1 ∈ arginf θ∈Q t κ ∩B Euc ρ (θ0) L(θ), we have L(θ t+1 ) -L( θ) ≤ 1 - α t ω t (1 -γ t ) β (2 -ω t ) (L(θ t ) -L( θ)) . Proof. We start by showing γ t = L( θt+1)-L(θ * ) L(θt)-L(θ * ) satisfies 0 ≤ γ t < 1. Since θ * ∈ arginf θ∈B Euc ρ (θ0) L(θ), θt+1 ∈ arginf θ∈Bt L(θ), and θ t+1 ∈ Q t κ ∩ B Euc ρ (θ 0 ) by the definition of gradient descent, we have L(θ * ) ≤ L( θt+1 ) ≤ L(θ t+1 ) (a) ≤ L(θ t ) - 1 2β ∥∇ θ L(θ t )∥ 2 2 < L(θ t ) , where (a) follows from (36). Since L( θt+1 ) ≥ L(θ * ) and L(θ t ) > L(θ * ), we have γ t ≥ 0. Further, since L( θt+1 ) < L(θ t ), we have γ t < 1. Now, with ω t ∈ (0, 2), we have L(θ t+1 ) -L(θ * ) = L(θ t+1 ) -L( θt+1 ) + L( θt+1 ) -L(θ * ) ≤ 1 - α t ω t β (2 -ω t ) (L(θ t ) -L( θt+1 )) + 1 - α t ω t β (2 -ω t ) (L( θt+1 ) -L(θ * )) + L( θt+1 ) -1 - α t ω t β (2 -ω t ) L( θt+1 ) -L(θ * ) -1 - α t ω t β (2 -ω t ) L(θ * ) = 1 - α t ω t β (2 -ω t ) (L(θ t ) -L(θ * )) + α t ω t β (2 -ω t )(L( θt+1 ) -L(θ * )) = 1 - α t ω t β (2 -ω t ) (L(θ t ) -L(θ * )) + α t ω t β (2 -ω t )γ t (L(θ t ) -L(θ * )) = 1 - α t ω t β (1 -γ t )(2 -ω t ) (L(θ t ) -L(θ * )) . That completes the proof. Theorem 5.1 (K(θ) is positive definite at initialization). Given standard initialization for the branch and trunk nets, and granted that the individual branch and trunk net NTKs are positive definite ( 24)-( 25), the NTK of DeepONet is positive definite at initialization, i.e. α ⊤ E[K(θ)]α ≥ c > 0, where α denotes an arbitrary block unit vector with n blocks, and α i,j corresponds to the j-th entry in the i-th block and c denotes a positive constant. Proof. The proof follows as a direct consequence of Proposition 2 together with Propositions 3 and 4 Proposition 2. With the branch and trunk net weights initialized using standard initialization techniques and the last layer of branch layer initialized according to Assumption 3, we have E[T (3) k,k ′ |θ ḡ , θ ḡ ] = 0 , E[T (4) k,k ′ |θ ḡ , θ ḡ ] = 0 , k, k ′ ∈ [K] . ( ) where T (3) k,k ′ is defined in (39) Proof. As noted in Assumption 3, the last layer weights for the branch and trunk nets are initialized as zero mean Gaussians, i.e., w k,h ′ ∼ N (0, 1 mK ) for k ∈ [K], h ∈ [m f ], h ′ ∈ [m g ], similar to the other layers. Now, ( T (3) k,k ′ = g k θ g ; y (i) j g k ′ θ g ; y (i ′ ) j ′ = mg h ′ =1 ) where 1 q denotes the q-dimensional vector of all entries equal to one.

A.5.2 DIFFUSION-REACTION PDE

In this example we learn the operator mapping the input forcing function u(x) to the output v(x, t) for the nonlinear Diffusion-Reaction PDE given by ∂v ∂t = D ∂ 2 v ∂x 2 + kv 2 + u(x), (x, t) ∈ (0, 1] × (0, 1] s.t.    v(0, x) = 0 v(t, 0) = 0 v(t, 1) = 0 (47) where D = 0.01 and k = 0.01 are constants denoting the diffusivity and reaction rate respectively. Note that in this case we are learning the operator v(x, t) = G θ (u)(x, t). For each sampled input, the PDE is solved using a backward finite-difference solver on a grid (x, t) of size (150 × 120). For training, the number of input sensors is fixed at m = 120. The number of input samples (n) is chosen to be 5000 and n B = 10000.

A.5.3 BURGER'S EQUATION

Finally, we look at the Burger's equation benchmark similar to the one investigated in (Li et al., 2021a) with the distinction that we learn a mapping from the initial condition v(x, 0) = u(x) to the solution v(x, t) for (x, t) ∈ [0, 1] × (0, 1] ∂v ∂t + v ∂v ∂x ν ∂ 2 v ∂x 2 = 0, (x, t) ∈ (0, 1) × (0, 1] v(x, 0) = u(x), x ∈ (0, 1) v(1, t) = v(0, t) t ∈ (0, 1) (48) We generate the training data using a stiff PDE integrator chebfun (Driscoll et al., 2014) on a grid resolution of (501, 501) and p = 200 training points sampled randomly on the solution grid.



The original DeepONet paper puts forth the above model and another one with a bias term added to the inner product. For definiteness, we restrict our attention to the model without bias The hessian-based analysis presented in Section 4 constitutes a sufficient condition and is not directly applicable for the case of ReLU activations. In contrast, the NTK analysis presented here applies to both smooth and ReLU activations.



Figure 1: Benefits of over-parameterization on learning of DeepONets for the Antiderivative Operator: G θ (u)(x) =

Remark 3.1 (DeepONet Training Tuple). As mentioned in the prequel, each DeepONet training data comprises of the tuple D

Figure 2: Training progress of DeepONet with a smooth activation function selu (Klambauer et al., 2017) for (a) Antiderivative Operator, (b) Diffusion-Reaction Equation and (c) Burger's Equation.We plot the training loss (27) as a function of the epochs with the y-axis on a log-scale to clearly distinguish the effect of increasing width (m). Increasing the width m of the branch and trunk net leads to lower losses for all the problems.

Figure 4: A schematic of the unstacked DeepONet architecture(Lu et al., 2021) used in this study. Note that the input functions need not be sampled on a structured grid of points in general.

k ′ , h′ ḡh ′ θ ḡ ; y k ′ , h′ ]ḡ h ′ θ ḡ ; y k ′ , h′ ]ḡ h ′ θ ḡ ; y ′can be suitably bounded close to zero, we have for any arbitrary block unit vector αα T K(θ)α ≥ λ 0,f K k=1 α ⊙ g k (θ g ; y ⊙ 1 p ⊗ f k (θ f ; u (•) 2 2 .

Paris Perdikaris. Learning the solution operator of parametric partial differential equations with physics-informed DeepOnets. arXiv:2103.10974 [cs, math, stat], March 2021. URL http://arxiv.org/abs/2103.10974. arXiv: 2103.10974.

A.4 OPTIMIZATION GUARANTEES FOR DEEPONETS: RELU ACTIVATIONS

Recall the DeepONet predictor (1)(37)In the analysis, it is useful to distinguish between the parameters up to the pre-final layer and the final layer, i.e. dim(θ) = M f + M g + (m f + m g )K, where M f and M g denote the number of parameters in the branch and trunk nets till the pre-final layer respectively and m f • K and m g • K are the number of weights in the last layer of the branch and trunk nets respectively. In essence we have dim(θ f ) = M f + m f • K and dim(θ g ) = M g + m g • K. We note that it is sufficient to show positive definiteness of the above NTK at initialization. Once that has been established, standard approaches (Jacot et al., 2018; Du et al., 2019; Arora et al., 2019b; a; Allen-Zhu et al., 2019) allow one to show the geometric convergence of (S)GD. In the sequel it proves useful to rewrite the branch and trunk net outputs as:where w] are the weights of the linear last layer of the branch net and w] are the weights of the linear last layer of the trunk net. Similarly, θ f and θ ḡ are the parameters leading up to the pre-final layer in branch net with m f outputs [ fh , h ∈ [m f ]] and trunk net with m g outputs [ḡ h ′ , h ′ ∈ [m g ]] respectively. We will denote by θ f,k all the parameters corresponding to f k , i.e. θ f,k := {wSimilarly we denote by θ g,k , all the parameters corresponding to g k , i.e. θ g,k := {wWe can explicitly write the NTK for the DeepONet model, specifically entry corresponding to the inputs {u (i) , y (i) j } and {u (i ′ ) , yProof. Focusing on a quadratic form of K(θ), and ignoring the T (3) , Tk,k ′ terms, for any arbitrary block unit vector α, we havewhere I q denotes the q-dimensional identity matrix since α i,j g k θ g ; y (i) j varies with j whereas the kernel K f does not; 1 q is the q-dimensional all ones vector since for the α i,j f k (θ f ; u (i) ) terms, for a fixed i, α i,j differs with j but f k θ f ; u (i) stays the same; λ min (K f,k ≥ λ 0,f ; and λ min (K g,k ) ≥ λ 0,g . This completes the proof.Note that we require α is a block unit vector with α i,j denoting the j-th position in the i-th for convenience in dealing with the quadratic form above. Given that T (3) , Tk,k ′ terms are close to 0 in expectation, as shown in Proposition 2, we now need to show that the NTK K(θ) is positive definite.Proposition 4. Given that the termsk,k ′ can be suitably bounded close to zero by Proposition 2, we have, for any arbitrary block unit vector α, and some k ∈where 1 q denotes the q-dimensional vector of all entries equal to one.Proof. Now, for the first term, for some k ∈ [K], making use of (38), we haveSince the trunk net is a ReLU network, following the argument in Lemma 7.1 in (Allen-Zhu et al., 2019) , with probability at least 1 -O(qe -Ω(m/4L) ), for all (i, j) we have ḡ• θ ḡ ; yRecalling that wHence, we haveand, with a similar argumentAs a result, we haveThe high probability version of the result can be obtained by applying Hoeffding (for cross terms) and Bernstein (for square terms) bounds on w. That completes the analysis.

A.5 EXPERIMENTAL DETAILS

For the optimizer we choose Adam (Kingma & Ba, 2014) with an adaptive learning rate schedule initialized at a learning rate η 0 = 10 -3 . In order to generate training data for all three examples, we sample the input, denoted by u(x), from a zero mean Gaussian process (GP) on a grid {x l } m l=1 ∈ [0, 1] and generate outputs corresponding to each sampled function by solving the ODE/PDE (see (Wang et al., 2021; Lu et al., 2021) for a detailed discussion on data generation). For end-to-end training we use the deep learning framework JAX (Bradbury et al., 2018) and build our code on top of (Wang et al., 2021) for Diffusion-Reaction and Antiderivative operators and we develop our own for the Burger's equation. We now briefly outline the problems below along with the specifics of the training process for each of them.

A.5.1 ANTIDERIVATIVE OPERATOR

The antiderivative (or simply the integral) operator corresponds to a linear operator defined explicitly by a linear ODE (initial value problem) in the unknown function v(x) ∈ V, given the input u(x) ∈ U, and the constant v(0) for mathematical well-posedness, i.e. We learn the operator mapping u(x) to its corresponding integral v(x) = G θ (u)(x) for all x ∈ (0, 1].For generating the training data, we sample the input functions from a univariate Gaussian process as outlined above and the output points randomly on the interval [0, 1] and choose n B = 10000 for our empirical results

