MOMENTUM DIMINISHES THE EFFECT OF SPECTRAL BIAS IN PHYSICS-INFORMED NEURAL NETWORKS Anonymous authors Paper under double-blind review

Abstract

Physics-informed neural network (PINN) algorithms have shown promising results in solving a wide range of problems involving partial differential equations (PDEs). However, even the simplest PDEs, often fail to converge to desirable solutions when the target function contains high-frequency modes, due to a phenomenon known as spectral bias. In the present work, we exploit neural tangent kernels (NTKs) to investigate the training dynamics of PINNs evolving under stochastic gradient descent with momentum (GDM). This demonstrates GDM significantly reduces the effect of spectral bias. We have also examined why training a model via the Adam optimizer can accelerate the convergence while reducing the spectral bias. Moreover, our numerical experiments have confirmed that wideenough networks using GDM or Adam still converge to desirable solutions, even in the presence of high-frequency features.

1. INTRODUCTION

Physics-informed neural networks (PINNs) have been proposed as alternatives to traditional numerical partial differential equations (PDEs) solvers (Raissi et al., 2019; 2020; Sirignano & Spiliopoulos, 2018; Tripathy & Bilionis, 2018) . In PINNs, a PDE which describes the physical domain knowledge of a problem is added as a regularization term to an empirical loss function. Although PINNs has shown remarkable performance in solving a wide range of problems in science and engineering (Cai et al., 2022; Kharazmi et al., 2019; Sun et al., 2020; Kissas et al., 2020; Tartakovsky et al., 2018) , regardless of the simplicity of a PDE itself, they often fail to converge to accurate solutions when the target function contains high-frequency features (Krishnapriyan et al., 2021; Wang et al., 2021) . This phenomenon known as the spectral bias exists in even the simplest linear PDEs (Wang et al., 2021; Moseley et al., 2021; Krishnapriyan et al., 2021) . Spectral bias is not limited to PINNs. Rahaman et al. (2019) empirically showed that all fullyconnected feed-forward neural networks (NNs) are biased against learning complex components of target functions. Furthermore, Cao et al. (2019) theoretically proved that in training infinitely-wide networks with squared loss, the corresponding eigenvalues of the neural tangent kernel (NTK) (Jacot et al., 2018) indicate the exact convergence rate for different components of the target functions. Thus, spectral bias happens when the absolute values of some of the eigenvalues of the NTK are large while others are small. Recently, utilizing the NTK of infinitely-wide PINNs, Wang et al. (2022) examined the gradient flow of these networks during training. They proved that the training error decays based on e -κit , where κ i are the eigenvalues of the NTK. Thus, the components of the target function corresponding to the smaller eigenvalues have a slower rate of decay, which causes spectral bias. To tackle the issue of spectral bias, they proposed to assign a weight to each term of the loss function and dynamically update it. Although the results showed some improvements, as the frequency of the target function increased, their proposed PINN still failed to converge to solutions of PDEs. Moreover, as assigning weights can result in indefinite kernels, the training process could become extremely unstable. Of note, compared to the typical NNs, analyzing the effect of spectral bias for PINNs is more challenging as the loss function is regularized by means of adding the PDE equation. Thus, Wang et al. ( 2022)'s study was limited to training the model only based on GD. Some studies proposed an alternative approach in which instead of modifying the loss function terms, a high-frequency PDE is solved in a few successive steps. In these methods, it is assumed that the optimal solution of low-frequency PDEs is close to the optimal solution of high-frequency PDEs. Hence, instead of randomly initializing weights they are being initialized using the optimal solution of low-frequency PDEs. Moseley et al. ( 2021) implemented a finite element approach where PINNs were trained to learn basis functions over several small, overlapping subdomains. Similarly, Krishnapriyan et al. (2021) proposed a learning method based on learning the solution over small successive chunks of time. Moreover, they proposed another sequential learning scheme in which the model was gradually trained on target functions with lower frequencies, and, at each step, the optimized weights were used as the warm initialization for higher-frequency target functions. In a similar approach, Huang & Alkhalifah (2021) proposed to use the pre-trained models from lowfrequency functions and to increase the size of the network (neuron splitting) as the frequency of the target function is increased. Although these methods showed good performance on some PDEs, as the frequency terms became larger, the process became much slower as the required time steps would significantly grow. In this work, we study the spectral bias of PINNs from an optimization perspective. Existing studies only have focused on effect of the vanilla GD (Wang et al., 2022) or they are limited to some weak empirical evidence indicating that Adam might learn high feature faster (Taylor et al., 2022) . We prove that an infinitely-wide PINN under the vanilla GD optimization process will converge to the solution. However, for high-frequency modes, the learning rate needs to become very small, which makes the convergence extremely slow, and hence not possible in practice. Moreover, we prove that for infinitely-wide networks, using the GD with momentum (GDM) optimizer can reduce the effect of spectral bias in the networks, while significantly outperforming vanilla GD. We also investigate why the Adam optimizer can also accelerate the optimization process while decreases the effect of spectral bias in PINNs. To the best of our knowledge this is the first time that the gradient flow of the output of PINNs under the GDM, and Adam are being analyzed, and their relation to solving spectral bias is discussed. Finally, our extensive numerical experiments on sufficiently wide networks confirm our theoretical findings.

2.1. PINNS GENERAL FORM

The general form of a well-posed PDE on a bounded domain (Ω ⊂ R d ) is defined as: D[u](x) = f (x), x ∈ Ω u(x) = g(x), x ∈ ∂Ω (1) where D is a differential operator and u(x) is the solution (of the PDE), in which x = (x 1 , x 2 , . . . , x d ). Note that for time-dependent equations, t = (t 1 , t 2 , . . . , t d ) are viewed as additional coordinates within x. Hence, the initial condition is viewed as a special type of Dirichlet boundary condition and included in the second term. Using PINNs, the solution of Eq. 1 can be approximated as u(x, w) by minimizing the following loss function: L(w) := 1 N b N b i=1 (u(x i b , w) -g(x i b )) 2 L b (w) + 1 N r Nr i=1 (D[u](x i r , w) -f (x i r )) 2 Lr(w) where {x i b } N b i=1 and {x i r } Nr i=1 are boundary and collocation points respectively, and w describes the neural network parameters. L b (w) corresponds to the mean squared error of the boundary (and initial) condition data points, and L r (w) encapsulates the physics of the problem using the randomly selected collocation points. Similar to all other NNs, minimizing the loss function L(w) results in finding the optimal solutions w*.

2.2. INFINITELY-WIDE NEURAL NETWORKS

A fully-connected infinitely-wide NN with L hidden layers can be written as Jacot et al. ( 2018): u h (x) = 1 N Θ h • x h + b h x h+1 = σ(u h )

