MOMENTUM DIMINISHES THE EFFECT OF SPECTRAL BIAS IN PHYSICS-INFORMED NEURAL NETWORKS Anonymous authors Paper under double-blind review

Abstract

Physics-informed neural network (PINN) algorithms have shown promising results in solving a wide range of problems involving partial differential equations (PDEs). However, even the simplest PDEs, often fail to converge to desirable solutions when the target function contains high-frequency modes, due to a phenomenon known as spectral bias. In the present work, we exploit neural tangent kernels (NTKs) to investigate the training dynamics of PINNs evolving under stochastic gradient descent with momentum (GDM). This demonstrates GDM significantly reduces the effect of spectral bias. We have also examined why training a model via the Adam optimizer can accelerate the convergence while reducing the spectral bias. Moreover, our numerical experiments have confirmed that wideenough networks using GDM or Adam still converge to desirable solutions, even in the presence of high-frequency features.

1. INTRODUCTION

Physics-informed neural networks (PINNs) have been proposed as alternatives to traditional numerical partial differential equations (PDEs) solvers (Raissi et al., 2019; 2020; Sirignano & Spiliopoulos, 2018; Tripathy & Bilionis, 2018) . In PINNs, a PDE which describes the physical domain knowledge of a problem is added as a regularization term to an empirical loss function. Although PINNs has shown remarkable performance in solving a wide range of problems in science and engineering (Cai et al., 2022; Kharazmi et al., 2019; Sun et al., 2020; Kissas et al., 2020; Tartakovsky et al., 2018) , regardless of the simplicity of a PDE itself, they often fail to converge to accurate solutions when the target function contains high-frequency features (Krishnapriyan et al., 2021; Wang et al., 2021) . This phenomenon known as the spectral bias exists in even the simplest linear PDEs (Wang et al., 2021; Moseley et al., 2021; Krishnapriyan et al., 2021) . Spectral bias is not limited to PINNs. Rahaman et al. (2019) empirically showed that all fullyconnected feed-forward neural networks (NNs) are biased against learning complex components of target functions. Furthermore, Cao et al. (2019) theoretically proved that in training infinitely-wide networks with squared loss, the corresponding eigenvalues of the neural tangent kernel (NTK) (Jacot et al., 2018) indicate the exact convergence rate for different components of the target functions. Thus, spectral bias happens when the absolute values of some of the eigenvalues of the NTK are large while others are small. Recently, utilizing the NTK of infinitely-wide PINNs, Wang et al. (2022) examined the gradient flow of these networks during training. They proved that the training error decays based on e -κit , where κ i are the eigenvalues of the NTK. Thus, the components of the target function corresponding to the smaller eigenvalues have a slower rate of decay, which causes spectral bias. To tackle the issue of spectral bias, they proposed to assign a weight to each term of the loss function and dynamically update it. Although the results showed some improvements, as the frequency of the target function increased, their proposed PINN still failed to converge to solutions of PDEs. Moreover, as assigning weights can result in indefinite kernels, the training process could become extremely unstable. Of note, compared to the typical NNs, analyzing the effect of spectral bias for PINNs is more challenging as the loss function is regularized by means of adding the PDE equation. Thus, Wang et al. (2022) 's study was limited to training the model only based on GD. Some studies proposed an alternative approach in which instead of modifying the loss function terms, a high-frequency PDE is solved in a few successive steps. In these methods, it is assumed that the optimal solution of low-frequency PDEs is close to the optimal solution of high-frequency PDEs. Hence, instead of randomly initializing weights they are being initialized using the optimal solution of low-frequency PDEs. Moseley et al. (2021) implemented a finite element approach where PINNs were trained to learn basis functions over several small, overlapping subdomains. Similarly, Krishnapriyan et al. (2021) proposed a learning method based on learning the solution over small successive chunks of time. Moreover, they proposed another sequential learning scheme in which the model was gradually trained on target functions with lower frequencies, and, at each step, the optimized weights were used as the warm initialization for higher-frequency target functions. In a similar approach, Huang & Alkhalifah (2021) proposed to use the pre-trained models from lowfrequency functions and to increase the size of the network (neuron splitting) as the frequency of the target function is increased. Although these methods showed good performance on some PDEs, as the frequency terms became larger, the process became much slower as the required time steps would significantly grow. In this work, we study the spectral bias of PINNs from an optimization perspective. Existing studies only have focused on effect of the vanilla GD (Wang et al., 2022) or they are limited to some weak empirical evidence indicating that Adam might learn high feature faster (Taylor et al., 2022) . We prove that an infinitely-wide PINN under the vanilla GD optimization process will converge to the solution. However, for high-frequency modes, the learning rate needs to become very small, which makes the convergence extremely slow, and hence not possible in practice. Moreover, we prove that for infinitely-wide networks, using the GD with momentum (GDM) optimizer can reduce the effect of spectral bias in the networks, while significantly outperforming vanilla GD. We also investigate why the Adam optimizer can also accelerate the optimization process while decreases the effect of spectral bias in PINNs. To the best of our knowledge this is the first time that the gradient flow of the output of PINNs under the GDM, and Adam are being analyzed, and their relation to solving spectral bias is discussed. Finally, our extensive numerical experiments on sufficiently wide networks confirm our theoretical findings.

2.1. PINNS GENERAL FORM

The general form of a well-posed PDE on a bounded domain (Ω ⊂ R d ) is defined as: D[u](x) = f (x), x ∈ Ω u(x) = g(x), x ∈ ∂Ω (1) where D is a differential operator and u(x) is the solution (of the PDE), in which x = (x 1 , x 2 , . . . , x d ). Note that for time-dependent equations, t = (t 1 , t 2 , . . . , t d ) are viewed as additional coordinates within x. Hence, the initial condition is viewed as a special type of Dirichlet boundary condition and included in the second term. Using PINNs, the solution of Eq. 1 can be approximated as u(x, w) by minimizing the following loss function: L(w) := 1 N b N b i=1 (u(x i b , w) -g(x i b )) 2 L b (w) + 1 N r Nr i=1 (D[u](x i r , w) -f (x i r )) 2 Lr(w) where {x i b } N b i=1 and {x i r } Nr i=1 are boundary and collocation points respectively, and w describes the neural network parameters. L b (w) corresponds to the mean squared error of the boundary (and initial) condition data points, and L r (w) encapsulates the physics of the problem using the randomly selected collocation points. Similar to all other NNs, minimizing the loss function L(w) results in finding the optimal solutions w*.

2.2. INFINITELY-WIDE NEURAL NETWORKS

A fully-connected infinitely-wide NN with L hidden layers can be written as Jacot et al. (2018) : u h (x) = 1 N Θ h • x h + b h x h+1 = σ(u h ) where Θ h and b h are respectively the weight matrices and the bias vectors in the layers h = 1, . . . , L, N is the width of the layer, and σ(•) is a β-smooth activation function (e.g. tanh(•)). The final output of the NN is written as: u(x, w) = 1 N Θ L • x L + b L where w = (Θ 0 , b 0 , . . . , Θ L , b L ). At each time step t, we can determine the change of the output with respect to the input, which defines a NTK: K t (x,x') = ∇ w u(x, w(t)) ⊤ ∇ w u(x ′ , w(t)). It is worth noting that the NTK is associated with the model and is completely independent of the choice of optimization algorithms or the loss function. Liu et al. (2020) showed that for a fully-connected NN with a linear output layer, the spectral norm of its Hessian satisfies ∥H(w )∥ = O( 1 √ N ). Consequently, as the width become larger, the norm of the Hessian becomes smaller: lim N →0 ∥H(w)∥ = 0. Thus, one major consequence of dealing with infinitely-wide fully-connected NNs is that if the last layer of the network is linear, the NTK becomes static (not changing over iterations), and the output of the network can be linearized (Liu et al., 2020) : u lin t (w) ≈ u(w)| w0 + (w -w 0 )∇u(w)| w0 . 3 THEORETICAL RESULTS

3.1. CONVERGENCE OF GRADIENT DESCENT IN THE PRESENCE OF HIGH-FREQUENCY FEATURES

Generally, the optimization problems corresponding to over-parametrized systems, even on a local scale, are non-convex (Liu et al., 2022) . A loss function L(w) of a µ-uniformly conditioned NN satisfies the µ-PL * condition on a set S ⊂ R m if: ∥∇ w (L(w))∥ 2 ≥ µL(w) for all w ∈ S ( ) where µ is the lower bound of the tangent kernel K(w) of the NN. It has been shown that infinitelywide networks satisfy the µ-PL * condition, a variant of the Polyak-Lojasiewicz condition, and as a result the (stochastic) gradient descent (SGD/GD) optimization algorithms will converge to the optimal solution (Liu et al., 2022) . The following proposition makes use of the µ-PL * condition to provide a convergence analysis for an infinitely-wide PINN with a loss function as in Eq. 2 optimized with GD (Appendix A). Proposition 1. Let λ bmax and λ rmax , respectively, be the largest eigenvalues of the Hessians ∇ 2 L b (w t ) and ∇ 2 L r (w t ). Consider an infinitely-wide PINN optimized with the following update rule: w t+1 = w t -η∇L(w t ), where η is a constant learning rate. Then, provided η = O(1/(λ bmax +λ rmax )), the µ-PL * condition is satisfied and the PINN will converge. Although Proposition 1 provides a convergence guarantee for an infinitely wide-wide PINN, for PDEs exhibiting stiff dynamics (those with high frequency modes), the eigenvalues (of the Hessian) dictating convergence are often very large (Wang et al., 2021) . For example, take the onedimensional Poisson equation: ∂ 2 u ∂x 2 = f (x), x ∈ Ω (4) u(x) = g(x), x ∈ ∂Ω with u(x) = sin(Cx). The norm of the Hessian H(w t ) is of order O(C 4 ) (Appendix B). Thus, as C grows so does the bound on the eigenvalues of H(w t ), indicating that at least one of λ bmax or λ rmax will be large. Consequently, the learning rate must be prohibitively small to guarantee convergence. Therefore GD is unusable in practice. 

3.2. WIDE PINN OPTIMIZATION USING GD WITH MOMENTUM

In order to accelerate learning, we investigate the training dynamics of PINNs under GD with momentum (GDM) (Du, 2019) : w t+1 = w t + α(w t -w t-1 ) -η∇ w L(w), where α and η are fixed rates.Specifically, we analyze the gradient flow of the update rule in Eq. 5 by leveraging the notion of the NTK. The following theorem (proof in Appendix C) reveals that GDM has a different convergence behavior from GD: Theorem 1. For an infinitely-wide PINN, the gradient flow of GDM is: m ü(x b , w(t)) D[ü](x r , w(t)) = -µ u(x b , w(t)) D[ u](x r , w(t)) -K u(x b , w(t)) -g(x b ) D[u](x r , w(t)) -h(x r ) that is analogous to a point mass m undergoing a damped harmonic oscillation in a viscous medium with a friction coefficient of µ(α) that is function of α, Furthermore, K is defined as: K = K bb K rb K br K rr , where: K bb (x,x ′ ) = ∇ w u(w, x) ⊤ ∇ w u(w, x ′ ) K br (x,x ′ ) = ∇ w u(w, x) ⊤ ∇ w D[u](w, x ′ ) K rr (x,x ′ ) = ∇ w D[u](w, x ′ ) ⊤ ∇ w D[u](w, x ′ ) are three NTKs associated with the boundary and residual terms. Moreover, let γ = µ/2m, κ i be the i-th eigenvalue of K, and κ ′ i = κi m . Then, the solutions to the gradient flow are of the form: A 1 e λi 1 t + A 2 e λi 2 t λ i1,2 = -γ ± γ 2 -κ ′ i (6) where A 1 and A 2 are constants. By examining the analogous gradient flow for the vanilla GD, where the training error decays at the rate e -κit (Wang et al., 2021) , thus PINNs under vanilla GD suffer from spectral bias. Once we add momentum, the decay rate analysis becomes more involved as Eq. 6 yields three different cases of solutions. Each of the three cases is analogous to one of the solutions of a damped harmonic oscillator Qian (1999); Arya (Fig. 1 ): • Under-damped: Imaginary roots (γ 2 < κ ′ i ) • Critically-damped: Real and equal roots (γ 2 = κ ′ i ) • Over-damped: Real roots (γ 2 > κ ′ i ). Under-damped case As γ 2 -κ ′ i has imaginary roots, Eq. 6 can be rewritten as: Ae -γt cos (ω 1 t + ϕ) where ω 1 = κ ′ i -γ 2 , and A and ϕ are two constants corresponding to the amplitude and the phase of the damped oscillation. In physics, this solution corresponds to an oscillatory motion in which the amplitude is decaying exponentially. Critically-damped case The general solution of the critically-damped case can be written as: (B 1 + B 2 t)e -γt where B 1 and B 2 are constants. As the oscillation motion is not present, the decaying rate for this case is much faster (Fig. 1 ). Over-damped case Lastly, in the over-damped case the general solution is simplified as: e -γt C 1 e ω2 t + C 2 e -ω2 t where ω 2 = γ 2 -κ ′ i and C 1 and C 2 are constants. Similar to the critically-damped case, the above equation states a fast decay. Thus, depending on |κ i |, the absolute value of the eigenvalues of the kernel matrix K, the dynamics of the training error corresponding to different frequency components can differ. For larger eigenvalues, the training error corresponds to an under-damped solution, in which the amplitude of an oscillatory motion is decaying exponentially, whereas for smaller eigenvalues the correspondence is with an over-damped or critically-damped oscillation, with a much faster decay. Thus, when using GDM instead of vanilla GD, the training process for the components of the target function that correspond to the smaller eigenvalues will decay fast (undergoing the over-damped or critically-damped motions), while the components that correspond to the larger eigenvalues will decay at a slower rate. As a consequence, the effect of spectral bias will be less prominent (compared to vanilla GD). To provide visualizations, in Appendix F the decay dynamics of GDM and vanilla GD are plotted using a small and large eigenvalue.

4. NUMERICAL EXPERIMENTS

In the previous Section, we proved that for infinitely-wide networks GDM and Adam can diminish the effect of spectral bias in theory. Here, we show that in practice, for sufficiently wide networks, Adam and GDM can significantly diminish the effect of spectral bias. Using Poisson's equation, the transport function, and the reaction-diffusion problem we provide results from our numerical experiments. Experiments for the reaction-diffusion problem are presented in Appendix H.2.

4.1. POISSON'S EQUATION

Poisson's equation is a well-known elliptic PDE in physics. For example, in electromagnetism, the solution to Poisson's equation is the potential field of a given electric charge. In Eq. 4 the general form of the one-dimensional Poisson's equation was presented. Here, we write Poisson's equation for a specific source function and a specific boundary condition: f (x) = -C 2 sin(Cx), x ∈ [0, 1] g(x) = 0, x = 0, 1 The defined loss function as well as the analytical solution of Poisson's equation are shown in Appendix G. We trained a network of two hidden layers of width 500, N r = 100 and N b = 100. Of note, Wang et al. (2022) had shown that during the training process, the NTK of networks with a width of 500 practically stayed constant. Our numerical experiments confirmed that for a relatively small value of C = 5π (where the effect of the spectral bias is not significant), after 55000 epochs, models trained via all three algorithms could accurately estimate the solution (Fig. 2 ; top panel). The relative error ∥(u -û)/û∥ for vanilla GD was on the order of 10 -2 . The relative error using GDM and Adam were respectively on the order of 10 -3 and 10 -4 (Fig. 3a ). As the parameters of a PDE increase, its solution contains higher frequency modes, and as a result, convergence becomes more challenging. When C = 10π, after 55000 epochs, the solution obtained from training the network with vanilla GD was far from the exact solution (Fig. 2 ; bottom panel), and it exhibited a relative error larger than 10 -1 (Fig. 3b ). Meanwhile, the trained models with GDM and Adam had relative errors of 10 -2 and 10 -4 , respectively (Fig. 3b ). It is also of interest to investigate the behavior of the training loss function and its decay rate. For the low-frequency case (C = 5π), the training loss via Adam had a faster convergence, however, all three networks converged after about 30000 epochs (Fig. 4 , left panel). For the high-frequency case (C = 10π), the training loss via GD had a much slower decay rate, and after 35000 epochs it did not converge. On the other hand, the training loss for both GDM and Adam could converge, and the model trained via Adam converged after about 25000 epochs (Fig. 4 , right panel). Of note, using vanilla GD and training the model for 180000 epochs resulted in a small relative error on the order of 10 -2 (Fig. G.1). This confirmed our discussion presented in Section 3.1 that vanilla GD will converge under large parameters, though due to the presence of high-frequency features it is extremely slow.

4.2. EVOLUTION OF THE SOLUTION DURING LEARNING

Knowing the fundamental difference between GDM and GD, it is of interest to investigate how under these two algorithms solutions are evolved (in time). Thus, for the Poisson equation when C = 15π, we have plotted the solutions at epochs 2000, 10000, 20000, 30000, 40000, and 50000 (GD: Fig. 5 and GDM: Fig. 6 ). For GD: at epoch 10000, the solution has the correct sinusoidal shape, however it is vertically shifted and clearly cannot satisfy the boundary conditions. Plotting the eigenvalues of K bb (Fig. 

4.3. TRANSPORT EQUATION

The transport equation is a hyperbolic PDE that models the concentration of a substance flowing in a fluid. Here, we focus on a one-dimensional transport equation: ∂u ∂t + β ∂u ∂x = 0, x ∈ Ω, T ∈ [0, 1] g(x) = u(x, 0), x ∈ Ω where β is a control parameter (independent of x and t). To facilitate later comparisons with Krishnapriyan et al. (2021) , we used the same network architectures as their study: N b = 100, N r = 1500, and a 4-layer network. We also chose the boundary and initial conditions to be u(x, 0) = sin(x) and u(0, t) = u(2π, t). The defined loss function as well as the analytical solution of the transport function are shown in Appendix G. Similar to Poisson's equation, for small β values, the models trained via vanilla GD, GDM, and Adam could all easily converge to the solution, and had small relative errors. However, for β = 20, after 125000 epochs the model trained with vanilla GD still failed to find the solution, and the averaged relative error stood at a large value (on the order of 10 0 ). However, the model trained via GDM after 55000 epochs could converge to the solution and the estimated solution had the relative error on the order of 10 -2 . The estimated solution from training the model via Adam, after only 15000 epochs, had a small relative error also on the order of 10 -2 . The exact solution, the estimated solutions (based on the three opti5izers), and the absolute difference between the exact and estimated solutions are shown in Fig. 7 . 2021) and the adaptive weight approach (hereafter AW) presented in Wang et al. (2022) . Of note, C-learning used the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm that mimics the second-order optimization, as Krishnapriyan et al. (2021) reported that, for their approach, using other optimization algorithms under performed compared to L-BFGS. To reproduce the results we used the code respectively presented in Krishnapriyan et al. (2021) and Wang et al. (2022) . All networks were trained with the same architecture introduced in Section (4). We tested a 1D Poisson's equation with different values of C. The comparison between the relative errors of the estimated solutions is plotted in Fig. 8a . For C ≥ 8π, both AW and C-learning exhibited errors of the order of 10 -1 . The errors using Adam were at most 10 -3 . For the 1D transport function with initial condition sin 2 (x) and for different values of β ∈ [2, 20], the relative error of the estimated solutions using the mentioned methods are plotted in Fig. 8b . As the values of β become larger, the estimated solutions from C-learning and AW became less accurate. C-learning for β ≥ 15 and AW for β ≥ 10 had relative errors of the estimated solutions larger than 10 -1 . Furthermore, using the C-learning methodology, in order to solve the transport function for β ≥ 5, we had to train models for β = 9, 11, 13, 15, 17, 18, 19, 20 such that the optimized weights for the previous β 8d ). Although C-learning also showed a fast training loss convergence, the relative error of the estimate from C-learning was large (see Fig. 8b ). This is not surprising, as the loss landscape for PINNsof PDEs with high-frequency modes is compound Krishnapriyan et al. (2021) , and they contain many saddle points which attract second-order optimization algorithms Dauphin et al. (2014) . Our observation is consistent with (Markidis, 2021) , where they also observed this downside of L-BFGS and recommended using Adam for at least the first few epochs. It is also important to mention that, unlike the other methods, the weight initialization for C-learning was not random. Instead, it was based on the optimal solution achieved for lower frequency modes, which requires extra training.

5. CONCLUSION

In the present study, through the lens of NTKs, we examined the dynamics of training PINNs via GDM. We also showed that under GDM the convergence rate of low-frequency features becomes slower (analogous to under-damped oscillation) and many high-frequency features undergo much faster conversion dynamics (analogous to over-damped or critically-damped oscillations). Thus, the effect of spectral bias becomes less prominent. Moreover, we discussed how training a PINN via Adam can even further accelerate convergence, and showed it can be much faster than GDM. Although we analyzed the theoretical dynamics of convergence by assuming an infinitely-wide network, our experiments confirmed the estimated solutions obtained from the trained models via GDM and Adam had high accuracy, using finite and practical widths. where H(w t ) = ∇ 2 L b (w t ) + ∇ 2 L r (w t ) (Wang et al., 2021) . Following the approach of (Wang et al., 2021) , let Q be an orthogonal matrix diagonalizing H(w t ) and v = L(w t )/∥L(w t )∥. With y = Qv, we have the following: L(w t+1 ) ≈ L(w t ) -η∥∇L(w t )∥ 2 + η 2 2 ∥∇L(w t )∥ 2 N b i λ bi y i 2 + Nr i λ ri y i 2 ⪅ L(w t ) -η∥∇L(w t )∥ 2 + η 2 2 ∥∇L(w t )∥ 2 (λ bmax + λ rmax ) where the λ bi and λ ri are the respective eigenvalues of the Hessians of L b and L r ordered nondecreasingly, and the summation is taken over the components of y. From here, fixing any bound B such that λ bmax + λ rmax ≤ B, we obtain the following inequality: L(w t+1 ) ⪅ L(w t ) -η∥∇L(w t )∥ 2 1 - ηB Witness that if η = 1/B, we can further simplify this to: L(w t+1 ) ⪅ L(w t ) -η∥∇L(w t )∥ 2 Since our NN satisfies the µ-PL * condition from Eq. 3 due to its width, we therefore have: L(w t+1 ) ≤ (1 -ηµ)L(w t ), which concludes the proof.

B HESSIAN FOR POISSON'S EQUATION

Proposition B. Let H be the Hessian of the loss function of an infinitely-wide PINN for the onedimensional Poisson equation defined by Eq. 4 with u(x) = sin(Cx). Then, we have H = O(C 4 ). Proof. The Hessian of loss on the collocation points is calculated following the methods introduced in Wang et al. (2021) , where the gradient of the loss function is: ∂L r ∂w = ∂ 1 0 ( ∂ 2 uw ∂x 2 -∂ 2 u ∂x 2 ) 2 dx ∂w . Here, w ∈ w, u(x) is the target solution (admitting some parameter C), and u w (x) is the NN approximation of the output. Assuming that the approximation is a good representation of the actual solution, it can be written as u w (x) = u(x)ϵ w (x), where ϵ w (x) is a smooth function taking values in [0, 1], such that |ϵ w (x) -1| < δ for some δ > 0 and ∥ ∂ k ϵw(x) ∂x k ∥ ≤ δ, where we have the L ∞ norm. The Hessian of the loss function will therefore be: ∂ 2 L r ∂w 2 = ∂ 2 1 0 ( ∂ 2 uw ∂x 2 -∂ 2 u ∂x 2 ) 2 ∂w 2 = ∂I ∂w where I = ∂Lr ∂w , and can be calculated to be: I = 2 ∂ 2 u w ∂w∂x ∂ 2 u w ∂x 2 - ∂ 2 u ∂x 2 1 0 - ∂u w ∂w ∂ 3 u w ∂x 3 - ∂ 3 u ∂x 3 1 0 + 1 0 ∂u w ∂w ∂ 4 u w ∂x 4 - ∂ 4 u ∂x 4 dx. The above equation contains 3 terms, which we call I 1 , I 2 , and I 3 respectively. Note that ∂I ∂w = ∂I1 ∂w + ∂I2 ∂w + ∂I3 ∂w . We calculate these terms as follows: I 1 = 2 ∂ 2 u w ∂w∂x ∂ 2 u w ∂x 2 - ∂ 2 u ∂x 2 1 0 ∂I 1 ∂w = 2 ∂( ∂ 2 uw ∂w∂x ) ∂w ( ∂ 2 u w ∂x - ∂ 2 u ∂x 2 ) ∂ 2 u w ∂x 2 - ∂ 2 u ∂x 2 1 0 = 2 ∂ ∂(u ′ (x)ϵw(x)+u(x)ϵ ′ w (x)) ∂w ∂w ∂ 2 u w ∂x 2 - ∂ 2 u ∂x 2 1 0 = 2 ∂(u ′ (x) ∂ϵw(x) ∂w + u(x) ∂ϵ ′ w (x) ∂w ) ∂w ∂ 2 u w ∂x 2 - ∂ 2 u ∂x 2 1 0 = 2 u ′ (x) ∂ 2 ϵ w (x) ∂w 2 + u(x) ∂ 2 ϵ ′ w (x) ∂w 2 ∂ 2 u w ∂x 2 - ∂ 2 u ∂x 2 1 0 . Note that |u(x)| ≤ 1 and by the chain rule |u ′ (x)| ≤ C. An application of the triangle inequality then yields: u ′ (x) ∂ 2 ϵ w (x) ∂w 2 + u(x) ∂ 2 ϵ ′ w (x) ∂w 2 ≤ C ∂ 2 ϵ w (x) ∂w 2 + ∂ 2 ϵ ′ w (x) ∂w 2 . Using the assumptions on ϵ w (x) we get: ∂ 2 u w ∂x 2 - ∂ 2 u ∂x 2 ≤ (C 2 + 2)δ. Combing both of these, we see ∂I1 ∂w = O(C 3 ). Similarly, ∂I2 ∂w = O(C 3 ) and ∂I3 ∂w = O(C 4 ), which concludes the proof.

C GRADIENT FLOW FOR GD WITH MOMENTUM

Theorem C. For an infinitely-wide PINN, the gradient flow of GDM is: m ü(x b , w(t)) D[ü](x r , w(t)) = -µ u(x b , w(t)) D[ u](x r , w(t)) -K u(x b , w(t)) -g(x b ) D[u](x r , w(t)) -h(x r ) that is analogous to a point mass m undergoing a damped harmonic oscillation in a viscous medium with a friction coefficient of µ(α) that is function of α, Furthermore, K is defined as: K = K bb K rb K br K rr , where: K bb (x,x ′ ) = ∇ w u(w, x) ⊤ ∇ w u(w, x ′ ) K br (x,x ′ ) = ∇ w u(w, x) ⊤ ∇ w D[u](w, x ′ ) K rr (x,x ′ ) = ∇ w D[u](w, x ′ ) ⊤ ∇ w D[u](w, x ′ ) are three NTKs associated with the boundary and residual terms. Moreover, let γ = µ/2m, κ i be the i-th eigenvalue of K, and κ ′ i = κi m . Then, the solutions to the gradient flow are of the form: A 1 e λi 1 t + A 2 e λi 2 t λ i1,2 = -γ ± γ 2 -κ ′ i (7) where A 1 and A 2 are constants. Proof. Recall that as the NN becomes wider, the norm of the Hessian becomes smaller, such that in the limit as N → ∞ the norm of Hessian becomes 0. One immediate consequence of small Hessian for a NN is that its output can be estimated by a linear function Lee et al. (2019) . The output of a NN can thus be replaced by its first-order Taylor expansion: u lin t (w) ≈ u(w)| w0 + (w -w 0 )∇u(w)| w0 . The update rule for GDM can be written as Du (2019) : w t+1 = w t + α(w t -w t-1 ) -η∇ w L(w). The discrete updates to the output of NN become (see Appendix C.1): u lin t+1 = u lin t + α(u lin t -u lin t-1 ) -η∇ w L(w)∇ w u(w)| w0 In the rest of this section, for simplicity, we will drop the "lin" term. The dynamics of GDM are analogous to the equation of motion of a point mass m undergoing a damped harmonic oscillation (Appendix C.2): mü + µ u -∇ w L(w)f (u) = 0 where f (u) is a linear function of u, and µ is the friction coefficient that is related to the momentum term in GDM as such: α = m m+µ∆t Qian (1999). Thus, the gradient flow of u t and D[u](x, w(t)) can be written as: mü(x b , w(t)) = -µ u(x b , w(t)) -K bb (x,x ′ ) (w)(u(x r , w(t)) -g(x b )) -K rb (x,x ′ ) (w)(D[u](x r , w(t)) -h(x r )) mD[ü](x r , w(t)) = -µD[ u](x r , w(t)) -K br (x,x ′ ) (w)(u(x b , w(t)) -g(x b )) -K rr (x,x ′ ) (w)(D[u](x r , w(t)) -h(x r )). (C.1) As mentioned earlier, in wide NNs if the last layer of the network is linear then the tanget kernels are static. We write Eq. C.1 in matrix form as follows: m ü(x b , w(t)) D[ü](x r , w(t)) = -µ u(x b , w(t)) D[ u](x r , w(t)) -K u(x b , w(t)) -g(x b ) D[u](x r , w(t)) -h(x r ) (C.2) As K is a positive semi-definite matrix Wang et al. (2021) Eq. C.2 can be viewed as a set of independent differential equations, each one corresponding to an eigenvalue λ i of the kernel. These give rise to the individual general solutions of the form: A 1 e λi 1 t + A 2 e λi 2 t λ i1,2 = -γ ± γ 2 -κ ′ i where A 1 and A 2 are constants, γ = µ/2, and κ ′ i = κi m . Of note, µ and m are set by the user, which in turn define the value of momentum. The choice of values of m and µ will determine the rate of decay of the above equation.

C.1 LINEAR NN UPDATES

The Taylor expansions of the outputs at time steps t + 1 and t are: u lin t+1 (w) = u(w)| w0 + (w t+1 -w 0 )∇u(w)| w0 u lin t (w) = u(w)| w0 + (w t -w 0 )∇u(w)| w0 . Thus, the difference between the outputs in the interval is: u lin t+1 (w) -u lin t (w) = ∇u(w)| w0 (w t+1 -w t ) = ∇u(w)| w0 (α(w t -w t-1 ) -η∇L(w t )) where we used the update rule of GD with momentum within the equation. Similarly, we have: u lin t (w) -u lin t-1 (w) = ∇u(w)| w0 (w t -w t-1 ) and: u lin t+1 (w) -u lin t (w) = ∇u(w)| w0 α u lin t (w) -u lin t-1 (w) ∇u(w)| w0 -η∇L(w t ) . Thus: u lin t+1 (w) = u lin t (w) -η∇L(w t ) • ∇u(w)| w0 -α(u lin t (w) -u lin t-1 (w)).

C.2 RELATION BETWEEN GDM AND DAMPED OSCILLATION

The dynamics of a point mass m undergoing a damped harmonic oscillation with a friction coefficient of µ are given by: mẅ + µ ẇ = -∇ w L(w) (3) where ∇ w L(w) is the force field. Qian (1999) showed that Eq. 3 in its discrete format can be written as: m w t+∆t + w t-∆t -2w t ∆t 2 + µ w t+∆t -w t ∆t = -∇ w L(w). After some algebraic simplifications, the above equation becomes: w t+∆t -w t = m m + µ∆t (w t -w t-∆t ) - ∆t 2 m + µ∆t ∇ w L(w). Clearly, m m+µ∆t is equivalent to the momentum term in GDM, and ∆t 2 m+µ∆t can be treated as the learning term. The dynamics of the output of a wide network (Appendix C.1) under GDM are similar.

D ANALYSIS OF ADAM FOR BAND-LIMITED FUNCTIONS

Adam is a commonly-used optimizer which can be interpreted as GDM adapted for variance (Kingma & Ba, 2014; Balles & Hennig, 2018) . Thus, it has the same properties as GDM related to the diminishing of the spectral bias because of the momentum term (see Appendix E for more details). However, because of the adaptive learning rate, it can be expected to be even faster than GDM. Here, we briefly provide more insights into the general behavior of Adam. Optimizing with Adam, we define: g(w) = x∈X ∇ℓ(w; x)/M, where ℓ = L b,r , M is the batch size and X is the batch. Furthermore, the pointwise variance of g(w) is written as σ = var g(w). We will use subscripts to denote entries, so that σ i is i-th entry of σ. With ∇L being the true gradient, at each epoch, Adam updates w i with a magnitude inversely proportional to an estimator of σ 2 i /∇L 2 i (Balles & Hennig, 2018) . The goal of the training process is to find a weight w which minimizes the loss L across all collocation and boundary points from Eq. 2. The role that variance adaptation plays in this speedup is readily seen when the solution to a PDE is (or is well-approximated by) a band-limited function. A band-limited function on a finite set of frequencies k has a form of: u(x) = k∈K α k exp(2πikx). When using GD, the solutions eventually satisfy the following bound with high probability (Basri et al., 2019; Arora et al., 2019) : L ⪅ 2π k∈K α 2 k k 2 N . This indicates that a network optimized with GD will learn better approximations of u(x) if lower frequencies k are present (especially when compared to another solution with otherwise identical amplitudes α k ). Conversely, if u(x) is primarily high-frequency then L has very weak bounds. Since our NN is infinitely-wide it satisfies the µ-PL * condition ∥∇L∥ 2 ≥ µL of Eq. 3. This implies similarly weak bounds on ∥∇L∥ and thus ∇L i (since µ is constant). Furthermore, convergence requires ∇L i ≈ 0 for all i, so if any ∇L i ≫ 0 the network has yet to converge. However, this is precisely the situation in which Adam accelerates toward the solution fastest as its updates are largest when σ 2 i /∇L 2 i close to zero. That is, in the presence of high-frequency features resulting in poor bounds on the loss if optimizing with GD, Adam would instead benefit from accelerated convergence. Combined with the inclusion of momentum as discussed earlier, this may provide the framework for it outperforming even GDM. Our numerical experiments confirm that for sufficiently wide NNs both GDM and Adam can converge to desirable solutions, but that Adam is even faster than GDM (see Section 4). that correspond to the large eigenvalues will be learned much faster. These are the eigenvalues corresponding to lower frequencies in the target function, so clearly the network suffers from spectral bias under GD. In contrast, learning under GDM, the training error for λ 1 decays following under-damped oscillation (the red curve in As discussed in Section 3.2 under GD optimization, the total training error is evaluated based on e -κit , where κ i are the eigenvalues of K that encapsulated K bb (NTK for boundary and initial data points) and K rr (NTK for collocation points representing the PDE). Thus, the convergence rate of the training error is evaluated based on the eigenvalues of K bb , and K rr together. Consequently, there might be a discrepancy between the absolute value of eigenvalues of K bb , and K rr , meaning that eigenvalues of one the two matrices be much larger, and the convergence rate of the training error for them becomes much faster. Hence, the network becomes biased to learn the components corresponding to those eigenvalues first. In Fig. F .1, for Poisson's equation when C = 15π, the eigenvalues at initialization for K bb (left panel) and K rr (right panel) (in descending order) are plotted. Clearly, the eigenvalues of K rr that represents the PDE, are much larger than the eigenvalues of K bb that represents the initial and boundary data points (bcs). Hence, not surprisingly, in GD where the training errors decay based on e -κit the network learns the PDE general form first and becomes slow in learning the bcs. This bias in learning the PDE first can be minimized while we are implementing GDM. In fact, plotting the terms of the loss function (PDE and bcs terms) also confirms that under GD the bcs are learnt slowly (see Fig. F .2) and the rate of decay of loss for bcs (green curve) is much slower than the rate of decay of loss in GDM (blue curve). Thus, clearly, we can see that under GD optimization, the network is much slower in learning the bcs (compared to the PDE term). As explained earlier, this is not surprising as the eigenvalues of K bb are much smaller than the eigenvalues of K rr . These observations are very insightful as they reveal a significant difference between PINNs and regular neural networks. Perhaps to explain the difference, it is useful to investigate the evolution of the solution via GD for a completely data-driven case. We trained a fully-connected forward neural network with the same architecture as the above PINN. The training data-set contained 3000 synthetic data points generated based on the solution of Poisson's equation, and we used the mean squared error loss function. The solutions after 5000 and 50000 epochs are shown in Fig. F.3. As the fully-connected network has no physics-informed regularization term, it has difficulties estimating the correct solution. Despite this, it still exhibits spectral bias, as the estimated solutions at both 5000 and 50000 epochs represent low-frequency sinusoidal forms. However, unlike a PINN there is no vertical shift in the solutions. Clearly, the physics-informed regularization term helps significantly to resolve the classical spectral bias (dealing with high-order frequency target functions), learning the correct sinusoidal shape faster. In return, the solutions are worse at satisfying the boundary condition values. This is because the eigenvalues of the boundary kernel represent high-frequency modes.

G LOSS FUNCTION OF EQUATIONS

G.1 LOSS FUNCTION FOR POISSON'S EQUATION For Poisson's equation: f (x) = -C 2 sin(Cx), x ∈ [0, 1] g(x) = 0, x = 0, 1 u(x) = sin(Cx) is used as the exact solution. The corresponding loss function is written as: where û is the output of the network. L(w) := 1 N b N b i=1 û(x b i , w) -g(x b i ) 2 + 1 N r Nr i=1 ∂ 2 û ∂x 2 (x r i , w) -f (x r i )

G.2 LOSS FUNCTION FOR THE TRANSPORT FUNCTION

Using the methods of characteristics (Evans, 2010), the transport function has a well-defined analytical solution: u(x, t) = g(x -βt). This is used as the exact solution. The corresponding loss function is written as: L(w) := 1 N b N b i=1 û(x b i , t b i , w) -g(x b i ) 2 + 1 N r Nr i=1 ∂ û(x r i , t b i , w) ∂t + β ∂ û(x r i , t b i , w) ∂x 2 where û is the output of the network. H EXPERIMENT PLOTS 

H.2 REACTION-DIFFUSION EQUATION

A reaction-diffusion equation contains a reaction and a diffusion term. Its general form is written as: ∂u ∂t = ν∆u + f (u) (G.1) where u(x, t) is the solution describing the density/concentration of a substance, ∆ is the Laplace operator, ν is a diffusion coefficient, and f (u) is a smooth function describing processes that change the present state of u (for example, birth, death or a chemical reaction). Here, we assume a onedimensional equation, where f (u) = ρu(1 -u). For ρ independent of x and t, and with the defined f (u), Eq. G.1, can be solved analytically (Evans, 2010). To estimate the solution using PINNs, the loss function for the 1D reaction-diffusion PDE is written as: L(w) := 1 N b N b i=1 û(x i b , t i b , w) -g(x i b ) 2 + 1 N r Nr i=1 ∂ û ∂t (x i r , t i b , w) -ν ∂ 2 û ∂x 2 (x i r , t i b , w) -ρû(x i r , t i b )(1 -û(x i r , t i b )) 2 . For We further compared the solutions of the wide Adam, C-learning, and AW. Similar to the previous experiments, for higher-frequency modes Adam showed superior results. The relative errors of the estimated solutions for different values of ν and ρ were computed (Fig. 8 (left panel)). Furthermore, when ν = 9 and ρ = 7), we observed that the loss function under GDM and Adam had much faster decay than under AW. But, the loss under L-BFGS decayed as fast as under Adam. We reasoned as to why in Section 4. 



Figure 1: Damped harmonic oscillators show three different characteristics.

Figure 2: 1D Poisson equation when C = 5π (top panel), and 10π (bottom panel). (a) Vanilla GD after (b) GDM (c) Adam.

Figure 3: Estimated training error for 1D Poisson equation. a) C = 5π. b) C = 10π.

Figure 4: Left panel: training loss when C = 5π. Right panel: training loss when C = 10π.

F.1) confirms that they are much smaller than the eigenvalues of K rr indicating that under GD the boundary and initial conditions are learnt much slower. As the training continues in time, the solution becomes less vertically shifted and closer to the boundary condition values. However, the learning process is slow, and even at epoch 50000 the estimation is far from the particular solution. Meanwhile, GDM is much faster, and at epoch 10000 the solution already has the correct sinusoidal shape and satisfies one end of the initial conditions. By epoch 20000 the estimated solution has a low error on the order of 10 -2 .

Figure 5: The solution in different stages of the learning process via GD. a) 2000 epochs, b) 10000 epochs, c) 20000 epochs, d) 30000 epochs, e) 40000 epochs, and f) 50000 epochs.

Figure 6: The solution in different stages of the learning process via GDM. a) 2000 epochs, b) 10000 epochs, c) 20000 epochs, d) 30000 epochs, e) 40000 epochs, and f) 50000 epochs.

Figure 7: 1D transport equation when β = 20: Top panels: The exact (analytical) solution. Middle panels: The estimated solutions. Bottom panels: The absolute difference between the exact and estimated solutions. (a) Vanilla GD, (b) GDM, (c) Adam.

Figure 8: Top Panel: The relative errors of the estimated solutions of 1D transport and Poisson's equations for different values of β and C using the C-learning, AW, and Adam. Bottom Panel: the training loss of different methods.

Fig. E.1b), while the training error for λ 2 follows over-damped oscillation (the blue curve in Fig. E.1a).The training error of both eigenvalues, the large λ 1 and small λ 2 , decay at approximately the same time. The components of the target function that correspond to the larger eigenvalue are thus not learned any faster, so the effect of spectral bias is minimized.

Figure E.1: The comparison of the error decay for a small and a large eigenvalue, a) learning decay during GD optimization process, b) learning decay during GDD optimization process.

Figure F.1: Left panel: Eigenvalues for K bb . Right panel: Eigenvalues for K rr

Figure F.3: The completely data-driven solution of the Poisson equation based on GD optimization process. a) after 5000 epochs, b) after 50000 epochs.

Figure G.1: The estimation of the solution of the Poisson's equation for C = 15π, using the vanilla GD algorithm after 180000 epochs.

consistency with Krishnapriyan et al. (2021), we used N r = 1000, and N b = 100. We also chose the initial and boundary conditions u(x, 0) = exp -(x-π) 2 π 2 /2 and u(0, t) = u(2π, t) respectively. Similar to the previous experiments, for larger choices of ν and ρ the model trained via vanilla GD (after 85000 epochs) had difficulty converging to the solution (Fig. G.3, left panel). However, models trained via GDM and Adam (after 45000 epochs) could provide solutions with low error. The plots of estimated and exact solutions for the three algorithms when ν = 3 and ρ = 5 are shown in Fig. G.2.

Figure G.2: 1D reaction-diffusion equation for ν = 3 and ρ = 5. The solutions are obtained by training a 4-layer network with width=500 at each layer. Top panels: The exact (analytical) solution. Middle panels: The estimated solution. Bottom panels: The absolute difference between the exact and estimated solutions. (a) Vanilla GD, (b) GDM, (c) Adam.

Figure G.3: Left Panel: The training losses of the network, when ν = 9 and ρ = 7), via AW, Clearning, wide GDM, and wide Adam are plotted. Right Panel: The relative error of the estimated solutions for different values of ν and ρ are plotted.

4. H.3 TRANSPORT FUNCTION: FURTHER COMPARISON WITH DIFFERENT INITIAL CONDITIONS We implemented different initial conditions for the transport function and ran several experiments. For the initial condition of tanh(x), when β ≥ 6 both C-learning and AW estimated solutions of showed large relative errors on the order of 10 -1 . However, the estimated solution of a wide network trained via Adam had an acceptable relative error on the order of 10 -2 (Fig. G.4). The estimated solutions for all three optimizers when β = 6 are shown in Fig. G.5. All networks were trained using a 4-layer network, and AW and Adam both had a width of 500 neurons. The estimated solutions are based on training the model for 85000 epochs.

Figure G.4: The relative errors of the estimated solutions of 1D transport function, for different values of β using the curriculum learning method (red dashed line), and our approach (blue dashed line) are plotted.

Figure G.5: The estimated solutions of the transport function based on the initial condition of tanh(x) when β = 6: a) The exact solution b) Estimated solution of a wide network trained via Adam c) Estimated solution of C-learning d) Estimated solution of AW

Figure G.6: The relative errors of the estimated solutions of 1D transport function, based on the initial condition of sin(x) cos(x), for different values of β using the curriculum learning method (red dashed line), and our approach (blue dashed line) are plotted.

Figure G.7: The estimation of the solution of the transport function based on the initial condition of sin(x) cos(x) when β = 25: a) The exact solution b) Estimated solution of a wide network trained via Adam c) Estimated solution of C-learning d) Estimated solution of AW

A CONVERGENCE OF WIDE PINNS

Proposition A. Let λ bmax and λ rmax , respectively, be the largest eigenvalues of the Hessians ∇ 2 L b (w t ) and ∇ 2 L r (w t ). Consider an infinitely-wide PINN optimized with the following update rule:where η is a constant learning rate. Then, provided η = O(1/(λ bmax +λ rmax )), the µ-PL * condition is satisfied and the PINN will converge.Proof. The loss function at time t + 1 can be written as:E RELATION BETWEEN ADAM AND DAMPED OSCILLATION It has been observed that Adam yields similar convergence rates to GDM (Kingma & Ba, 2014) .Here, we demonstrate that Adam also has dynamics equivalent to a damped oscillator, similar to GDM (Appendix C.2). First, let us briefly recall the traditional update rule for Adam:where m t and ν t are defined as:Proposition E. The update rule for Adam can be written as:where:Proof. It is straightforward to show that the update rule can be written as:Inserting w t = w t-1 -η √ νt+ϵ mt into the above equation shows:By the definition of P (t) and Q(t) we can rewrite Eq. E.1 as:Clearly, Eq. E.2 has the same format of the GDM update rule. Thus, the weight updates for Adam follows the same dynamics as SGDM, that of oscillatory motion of a point mass under friction (from Appendix C.2).However, based on the updated value of the second momentum of the gradient of loss function, both the momentum and the learning rate can be updated by Adam at each iteration. Thus, the convergence to the solution by Adam is much faster compared to GDM.

F DECAY COMPARISON BETWEEN GDM AND GD

F.1 NUMERICAL EVALUATION OF DECAY RATE FOR GD AND GDMHere, the dynamics of the error decay of GDM and vanilla GD under a large eigenvalue (λ 1 = 10 3 ) and a relatively small eigenvalue (λ 2 = 10 -5 ) are plotted. For vanilla GD, the training error decays at a rate of e -λit . As shown in Fig. E .1a, for the λ 2 the training error decays slowly, however is very fast for λ 1 (almost immediately dropping to zero). Thus, components of the target function

