LEARNING DEEP OPERATOR NETWORKS: THE BENEFITS OF OVER-PARAMETERIZATION

Abstract

Neural Operators that directly learn mappings between function spaces have received considerable recent attention. Deep Operator Networks (DeepONets) (Lu et al., 2021), a popular recent class of operator networks have shown promising preliminary results in approximating solution operators of parametric partial differential equations. Despite the universal approximation guarantees (Lu et al., 2021; Chen & Chen, 1995) there is yet no optimization convergence guarantee for DeepONets based on gradient descent (GD). In this paper, we establish such guarantees and show that over-parameterization based on wide layers provably helps. In particular, we present two types of optimization convergence analysis: first, for smooth activations, we bound the spectral norm of the Hessian of DeepONets and use the bound to show geometric convergence of GD based on restricted strong convexity (RSC); and second, for ReLU activations, we show the neural tangent kernel (NTK) of DeepONets at initialization is positive definite, which can be used with the standard NTK analysis to imply geometric convergence. Further, we present empirical results on three canonical operator learning problems: Antiderivative, Diffusion-Reaction equation, and Burger's equation, and show that wider DeepONets lead to lower training loss on all the problems, thereby supporting the theoretical results.

1. INTRODUCTION

Replicating the success of Deep Learning in scientific computing such as developing Neural PDE solvers, constructing surrogate models and developing hybrid numerical solvers has recently captured interest of the broader scientific community. Neural Operators (Li et al., 2021a; b) and Deep Operator Networks (DeepONets) (Lu et al., 2021; Wang et al., 2021) encompass two recent approaches aimed at learning mappings between function spaces. Contrary to a classical supervised learning setup which aims at learning mappings between two finite-dimensional vector spaces, these neural operators/operator networks aim to learn mappings between infinite-dimensional function spaces. The key underlying idea in both the approaches is to parameterize the solution operator as a deep neural network and proceed with learning as in a standard supervised learning setup. Since a neural operator directly learns the mapping between the input and output function spaces, it is a natural choice for learning solution operators of parametric PDE's where the PDE solution needs to be inferred for multiple instances of these "input parameters" or in the case of inverse problems when the forward problem needs to be solved multiple times to optimize a given functional. While there exist results on the approximation properties and convergence of DeepONets; see, e.g., (Deng et al., 2021) for a convergence analysis of DeepONets -vis-a-vis their approximation guarantees-for the advection-diffusion equation, there do not exist any optimization results on when and why GD converges during the optimization of the DeepONet loss. In this work we put forth theoretical convergence guarantees for DeepONets centered around overparameterization and show that over-parameterization based on wider layers (for both branch and trunk net) provably helps in DeepONet convergence. This is reflected in Figure 1 which summarizes an empirical evaluation of over-parameterized DeepONets with ReLU and smooth activations on a prototypical operator learning problem. In order to complement our theoretical results, we present empirical evaluation of our guarantees on three template operator learning problems: (i) Antiderivative operator, (ii) Diffusion-Reaction PDE, and (iii) Burger's equation and demonstrate that wider DeepONets lead to overall lower training loss at the end of the training process.  G θ (u)(x) = x 0 u(ξ) dξ. In both cases m denotes the width of the branch net and the trunk net. For both ReLU and smooth activations, increasing the width m leads to much lower losses. Note that the y-axis is in log-scale. The rest of the paper is organized as follows. In Section 2 we review the existing literature on neural operators, operator networks and over-parameterization based approaches for establishing convergence guarantees for deep networks. Next, we devote Section 3 to briefly outline the the DeepONet model, the learning problem and the corresponding architecture. Section 4 contains the first technical result of the paper. In Section 4 we establish convergence guarantees for DeepONets with smooth activations (for both branch and trunk net) based on the Restricted Strong Convexity (RSC) of the loss. Next, in Section 5, we present the second technical result of the paper where we establish optimization guarantees for DeepONets with ReLU activations by showing that the Neural Tangent Kernel (NTK) of the DeepONet is positive definite at initialization. In Section 6 we present simple empirical evaluations of the main results by carrying out a parametric study based on increasing the DeepONet width and noting its effect on the total loss during training. We finally conclude by summarizing the main contributions in Section 7.

2.1. LEARNING OPERATOR NETWORKS

Constructing operator networks for ordinary differential equations (ODE's) using learning-based approaches was first studied in (Chen & Chen, 1995) where the authors showed that a neural network with a single hidden layer can approximate a nonlinear continuous functional to arbitrary accuracy. This was, in essence, akin to the Universal Approximation Theorem for classical neural networks (see, e.g., (Cybenko, 1989; Hornik et al., 1989; Hornik, 1991; Lu et al., 2017) ). While the theorem only guaranteed the existence of a neural architecture, it was not practically realized until (Lu et al., 2021) which also provided an extension of the theorem to deep networks. Since then a number of works have pursued applications of DeepONets to different problems (see, e.g. (Goswami et al., 2022; Wang et al., 2021; Wang & Perdikaris, 2021) ). Recently (Kontolati et al., 2022) studied the influence of over-parameterization on neural surrogates based on DeepONets in context of dynamical systems. While their paper studies the effects of over-parameterization on the generalization properties of DeepONets, an optimization analysis of DeepONets is a largely open problem.

2.2. OPTIMIZATION: NTK, ETC.

Optimization of over-parameterized deep networks have been studied extensively (see, e.g., (Du et al., 2019; Arora et al., 2019b; a; Allen-Zhu et al., 2019; Liu et al., 2021a) ). In particular, (Jacot et al., 2018) showed that the neural tangent kernel (NTK) of a deep network converges to an explicit kernel in the limit of infinite network width and stays constant during training. (Liu et al., 2021a) showed that this constancy arises due to the scaling properties of the hessian of the predictor as a function of network width. (Du et al., 2019; Allen-Zhu et al., 2019) showed that gradient descent



Figure 1: Benefits of over-parameterization on learning of DeepONets for the Antiderivative Operator: G θ (u)(x) =

