LEARNING DEEP OPERATOR NETWORKS: THE BENEFITS OF OVER-PARAMETERIZATION

Abstract

Neural Operators that directly learn mappings between function spaces have received considerable recent attention. Deep Operator Networks (DeepONets) (Lu et al., 2021), a popular recent class of operator networks have shown promising preliminary results in approximating solution operators of parametric partial differential equations. Despite the universal approximation guarantees (Lu et al., 2021; Chen & Chen, 1995) there is yet no optimization convergence guarantee for DeepONets based on gradient descent (GD). In this paper, we establish such guarantees and show that over-parameterization based on wide layers provably helps. In particular, we present two types of optimization convergence analysis: first, for smooth activations, we bound the spectral norm of the Hessian of DeepONets and use the bound to show geometric convergence of GD based on restricted strong convexity (RSC); and second, for ReLU activations, we show the neural tangent kernel (NTK) of DeepONets at initialization is positive definite, which can be used with the standard NTK analysis to imply geometric convergence. Further, we present empirical results on three canonical operator learning problems: Antiderivative, Diffusion-Reaction equation, and Burger's equation, and show that wider DeepONets lead to lower training loss on all the problems, thereby supporting the theoretical results.

1. INTRODUCTION

Replicating the success of Deep Learning in scientific computing such as developing Neural PDE solvers, constructing surrogate models and developing hybrid numerical solvers has recently captured interest of the broader scientific community. Neural Operators (Li et al., 2021a; b) and Deep Operator Networks (DeepONets) (Lu et al., 2021; Wang et al., 2021) encompass two recent approaches aimed at learning mappings between function spaces. Contrary to a classical supervised learning setup which aims at learning mappings between two finite-dimensional vector spaces, these neural operators/operator networks aim to learn mappings between infinite-dimensional function spaces. The key underlying idea in both the approaches is to parameterize the solution operator as a deep neural network and proceed with learning as in a standard supervised learning setup. Since a neural operator directly learns the mapping between the input and output function spaces, it is a natural choice for learning solution operators of parametric PDE's where the PDE solution needs to be inferred for multiple instances of these "input parameters" or in the case of inverse problems when the forward problem needs to be solved multiple times to optimize a given functional. While there exist results on the approximation properties and convergence of DeepONets; see, e.g., (Deng et al., 2021) for a convergence analysis of DeepONets -vis-a-vis their approximation guarantees-for the advection-diffusion equation, there do not exist any optimization results on when and why GD converges during the optimization of the DeepONet loss. In this work we put forth theoretical convergence guarantees for DeepONets centered around overparameterization and show that over-parameterization based on wider layers (for both branch and trunk net) provably helps in DeepONet convergence. This is reflected in Figure 1 which summarizes an empirical evaluation of over-parameterized DeepONets with ReLU and smooth activations on a prototypical operator learning problem. In order to complement our theoretical results, we present empirical evaluation of our guarantees on three template operator learning problems: (i) Antiderivative operator, (ii) Diffusion-Reaction PDE, and (iii) Burger's equation and demonstrate that wider DeepONets lead to overall lower training loss at the end of the training process.

