SOBOLEV TRAINING FOR THE NEURAL NETWORK SO-LUTIONS OF PDES

Abstract

Approximating the numerical solutions of partial differential equations (PDEs) using neural networks is a promising application of deep learning. The smooth architecture of a fully connected neural network is appropriate for finding the solutions of PDEs; the corresponding loss function can also be intuitively designed and guarantees the convergence for various kinds of PDEs. However, the rate of convergence has been considered as a weakness of this approach. This paper introduces a novel loss function for the training of neural networks to find the solutions of PDEs, making the training substantially efficient. Inspired by the recent studies that incorporate derivative information for the training of neural networks, we develop a loss function that guides a neural network to reduce the error in the corresponding Sobolev space. Surprisingly, a simple modification of the loss function can make the training process similar to Sobolev Training although solving PDEs with neural networks is not a fully supervised learning task. We provide several theoretical justifications for such an approach for the viscous Burgers equation and the kinetic Fokker-Planck equation. We also present several simulation results, which show that compared with the traditional L 2 loss function, the proposed loss function guides the neural network to a significantly faster convergence. Moreover, we provide the empirical evidence that shows that the proposed loss function, together with the iterative sampling techniques, performs better in solving high dimensional PDEs. We define the loss function that depends on the Sobolev norm W k,p as follows: Loss GE (u nn ; k, p, l, q) = P (u nn (t, •)) -f (t, •) q W l,q (Ω) p W k,p ([0,T ])

1. INTRODUCTION

Deep learning has achieved remarkable success in many scientific fields, including computer vision and natural language processing. In addition to engineering, deep learning has been successfully applied to the field of scientific computing. Particularly, the use of neural networks for the numerical integration of partial differential equations (PDEs) has emerged as a new important application of the deep learning. Being a universal approximator (Cybenko, 1989; Hornik et al., 1989; Li, 1996) , a neural network can approximate solutions of complex PDEs. To find the neural network solution of a PDE, a neural network is trained on a domain wherein the PDE is defined. Training a neural network comprises the following: feeding the input data through forward pass and minimizing a predefined loss function with respect to the network parameters through backward pass. In the traditional supervised learning setting, the loss function is designed to guide the neural network to generate the same output as the target data for the given input data. However, while solving PDEs using neural networks, the target values that correspond to the analytic solution are not available. One possible way to guide the neural network to produce the same output as the solution of the PDE is to penalize the neural network to satisfy the PDE itself (Sirignano & Spiliopoulos, 2018; Berg & Nyström, 2018; Raissi et al., 2019; Hwang et al., 2020) . Unlike the traditional mesh-based schemes including the finite difference method (FDM) and the finite element method (FEM), neural networks are inherently mesh-free function-approximators. Advantageously, as mesh-free function-approximators, neural networks can avoid the curse of dimensionality (Sirignano & Spiliopoulos, 2018) and approximate the solutions of PDEs on complex geometries (Berg & Nyström, 2018) . Recently, Hwang et al. (2020) showed that neural networks could approximate the solutions of kinetic Fokker-Planck equations under not only various kinds of kinetic boundary conditions but also several irregular initial conditions. Moreover, they showed that the neural networks automatically approximate the macroscopic physical quantities including the kinetic energy, the entropy, the free energy, and the asymptotic behavior of the solutions. Further issues including the inverse problem were investigated by Raissi et al. (2019) ; Jo et al. (2020) . Although the neural network approach can be used to solve several complex PDEs in various kinds of settings, it requires relatively high computational cost compared to the traditional mesh-based schemes in general. To resolve this issue, we propose a novel loss function using Sobolev norms in this paper. Inspired by a recent study that incorporated derivative information for the training of neural networks (Czarnecki et al., 2017) , we develop a loss function that efficiently guides neural networks to find the solutions of PDEs. We prove that the H 1 and H 2 norms of the approximation errors converge to zero as our loss functions tend to zero for the 1-D Heat equation, the 1-D viscous Burgers equation, and the 1-D kinetic Fokker-Planck equation. Moreover, we show via several simulation results that the number of epochs to achieve a certain accuracy is significantly reduced as the order of derivatives in the loss function gets higher, provided that the solution is smooth. This study might pave the way for overcoming the issue of high computational cost when solving PDEs using neural networks. The main contributions of this work are threefold: 1) We introduce novel loss functions that enable the Sobolev Training of neural networks for solving PDEs. 2) We prove that the proposed loss functions guarantee the convergence of neuarl networks in the corresponding Sobolev spaces although it is not a supervised learning task. 3) We empirically demonstrate the effect of Sobolev Training for several regression problems and the improved performances of our loss functions in solving several PDEs including the heat equation, Burgers' equation, the Fokker-Planck equation, and high-dimensional Poisson equation.

2. RELATED WORKS

Training neural networks to approximate the solutions of PDEs has been intensively studied over the past decades. For example, Lagaris et al. (1998; 2000) used neural networks to solve Ordinary Differential Equations (ODEs) and PDEs on a predefined set of grid points. Subsequently, Sirignano & Spiliopoulos (2018) proposed a method to solve high-dimensional PDEs by approximating the solution using a neural network. They focused on the fact that the traditional finite mesh-based scheme becomes computationally intractable when the dimension becomes high. However, because neural networks are mesh-free function-approximators, they can solve high-dimensional PDEs by incorporating mini-batch sampling. Furthermore, the authors showed the convergence of the neural network to the solution of quasilinear parabolic PDEs under certain conditions. Recently, Raissi et al. (2019) reported that one can use observed data to solve PDEs using physicsinformed neural networks (PINNs). Notably, PINNs can solve a supervised regression problem on observed data while satisfying any physical properties given by nonlinear PDEs. A significant advantage of PINNs is that the data-driven discovery of PDEs, also called the inverse problem, is possible with a small change in the code. The authors provided several numerical simulations for various types of nonlinear PDEs including the Navier-Stokes equation and Burgers' equation. The first theoretical justification for PINNs was provided by Shin et al. (2020) , who showed that a sequence of neural networks converges to the solutions of linear elliptic and parabolic PDEs in L 2 sense as the number of observed data increases. There also exists a study aiming to enhance the convergence of PINNs (van der Meer et al., 2020) . Additionally, several works related deep neural networks with PDEs but not by the direct approximation of the solutions of PDEs. For instance, Long et al. (2018) attempted to discover the hidden physics model from data by learning differential operators. A fast, iterative PDE-solver was proposed by learning to modify each iteration of the existing solver (Hsieh et al., 2019) . A deep backward stochastic differential equation (BSDE) solver was proposed and investigated in Weinan et al. (2017) ; Han et al. (2018) for solving high-dimensional parabolic PDEs by reformulating them using BSDE. The main strategy of the present study is to leverage derivative information while solving PDEs via neural networks. The authors of Czarnecki et al. (2017) first proposed Sobolev Training that uses derivative information of the target function when training a neural network by slightly modifying the loss function. They showed that Sobolev Training had lower sample complexity than regular training, and therefore it is highly efficient in many applicable fields, such as regression and policy distillation problems. We appropriate the concept of Sobolev Training to develop a loss function for the efficient training of a neural network for solving PDEs.

3. LOSS FUNCTION

We consider the following Cauchy problem of PDEs: P u = f, (t, x) ∈ [0, T ] × Ω, (3.1) Iu = g, (t, x) ∈ {0} × Ω, (3.2) Bu = h, (t, x) ∈ [0, T ] × ∂Ω, where P denotes a differential operator; I and B denote the initial and boundary operators, respectively; f , g, and h denote the inhomogeneous term, and initial and boundary data, respectively. In most studies that reported the neural network solutions of PDEs, a neural network was trained on uniformly sampled grid points {(t i , x j )} Nt,Nx i,j=1 ∈ [0, T ] × Ω, which were completely determined before training. One of the most intuitive ways to make the neural network satisfy PDEs (3.1)-(3.3) is to minimize the following loss functional: Loss(u nn ; p) = P u nn -f p L p ([0,T ]×Ω) + Iu nn -g p L p (Ω) + Bu nn -h p L p ([0,T ]×∂Ω) , where u nn denotes the neural network and p = 1 or 2, as they have been the most commonly used exponents in regression problems in previous studies. Evidently, an analytic solution u satisfies Loss(u) = 0, and thus one can conceptualize a neural network that makes Loss(u nn ) = 0 a possible solution of PDEs (3.1)-(3.3). This statement is in fact proved for second-order parabolic equations with the Dirichlet boundary condition in Jo et al. (2020) , and for the Fokker-Planck equation with inflow and specular reflective boundary conditions in Hwang et al. (2020) . Both the proofs are based on the following inequality: u -u nn L ∞ (0,T ;L 2 (Ω)) ≤ CLoss(u nn ; 2), for some constant C, which states that minimizing the loss functional implies minimizing the approximation error. The main concept behind Sobolev Training is to minimize the error between the output and the target function, and that between the derivatives of the output and those of the target function. However, unlike the traditional supervised regression problem, neither the target function nor its derivative is provided while solving PDEs via neural networks. Thus, a special treatment is required to apply Sobolev Training for solving PDEs using neural networks. In this and the following sections, we propose several loss functions and prove that they guarantee the convergence of the neural network to the solution of a given PDE in the corresponding Sobolev space. Therefore, the proposed loss functions play similar roles to those in Sobolev Training.

,

(3.4) Loss IC (u nn ; l, q) = Iu nn (t, x) -g(x) q W l,q (Ω) , (3.5) Loss BC (u nn ; k, p, l, q) = Bu nn (t, •) -h(t, •) q W l,q (∂Ω) p W k,p ([0,T ]) . (3.6) Remark 3.1. Here, Loss T OT AL (u nn ) = Loss GE (u nn ; 0, 2, 0, 2) + Loss IC (u nn ; 0, 2) + Loss BC (u nn ; 0, 2, 0, 2) coincides with the traditional L 2 loss function employed by Sirignano & Spiliopoulos (2018); Berg & Nyström (2018); Raissi et al. (2019); Hwang et al. (2020) . When we train a neural network, the loss functions (3.4)-(3.6) are computed by Monte-Carlo approximation. Because the grid points are uniformly sampled, the loss functions are approximated as follows: Loss GE (u nn ; k, p, l, q) ≈ T |Ω| N t N x |β|≤k Nt i=1 d β dt β |α|≤l Nx j=1 |D α P (u nn (t i , x j )) -D α f (t i , x j )| q p , Loss IC (u nn ; l, q) ≈ |Ω| N x |α|≤l Nx j=1 |D α u nn (0, x j ) -D α g(x j )| q , Loss BC (u nn ; k, p, l, q) ≈ T |∂Ω| N t N B |β|≤k Nt i=1 d β dt β |α|≤l xj ∈∂Ω |D α u nn (t i , x j ) -D α h(t i , x j )| q p , where α and β denote the conventional multi-indexes, and D denotes the spatial derivatives.

4. THEORETICAL RESULTS

In this section, we theoretically validate our claim that our loss functions guarantee the convergence of the neural network to the solution of a given PDE in the corresponding Sobolev spaces, and that they play a similar role to those in Sobolev Training while solving PDEs via neural networks. Throughout this section, we will denote the strong solution of each equation by u, neural network solution by u nn , and Sobolev spaces W 1,2 and W 2,2 by H 1 and H 2 , respectively. All the proofs are provided in the Appendix.

4.1. THE HEAT EQUATION AND BURGERS' EQUATION

We define the following three total loss functions for the heat equation and Burgers' equation: Loss (0) T OT AL (u nn ) = Loss GE (u nn ; 0, 2, 0, 2) + Loss IC (u nn ; 0, 2) + Loss BC (u nn ; 0, 2, 0, 2), (4.1)

Loss

(1) T OT AL (u nn ) = Loss GE (u nn ; 0, 2, 0, 2) + Loss IC (u nn ; 1, 2) + Loss BC (u nn ; 0, 2, 0, 2), (2) T OT AL (u nn ) = Loss GE (u nn ; 1, 2, 0, 2) + Loss IC (u nn ; 2, 2) + Loss BC (u nn ; 0, 2, 0, 2). (4.3) We then obtain the following convergence theorem: Theorem 4.1. (Proofs are provided in (A.5) for the heat equation, and (A.8) for Burgers' equation) For the following 1-D heat and Burgers' equations: The heat equation Burgers' equation u t -u xx = 0 in (0, T ] × Ω, u(0, x) = u 0 (x) on Ω, u(t, x) = 0 on [0, T ] × ∂Ω, u t + uu x -νu xx = 0 in (0, T ] × Ω, u(0, x) = u 0 (x) on Ω, u(t, x) = 0 on [0, T ] × ∂Ω, there hold, provided that u nn is smooth, max 0≤t≤T u(t) -u nn (t) L 2 (Ω) → 0 as Loss (0) T OT AL → 0, ess sup 0≤t≤T u(t) -u nn (t) H 1 0 (Ω) → 0 as Loss (1) T OT AL → 0, ess sup 0≤t≤T u(t) -u nn (t) H 2 (Ω) → 0 as Loss (2) T OT AL → 0.

4.2. THE FOKKER-PLANCK EQUATION

For the Fokker-Planck equation, we need additional parameters for a new input variable v. We define the following two total loss functions for the Fokker-Planck equation: Loss (0;FP) T OT AL (u nn ) = Loss GE (u nn ; 0, 2, 0, 2, 0, 2) + Loss IC (u nn ; 0, 2, 0, 2) + Loss BC (u nn ; 0, 2, 0, 2, 0, 2), (4.4)

Loss

(1;FP) T OT AL (u nn ) = Loss GE (u nn ; 0, 2, 1, 2, 1, 2) + Loss IC (u nn ; 1, 2, 1, 2) + Loss BC (u nn ; 0, 2, 0, 2, 0, 2). (4.5) We then have the following convergence theorem: Theorem 4.2. (Proofs are provided in (A.10) and (A.12)) For the 1-D Fokker-Planck equation with the periodic boundary condition: u t + vu x -β(vu) v -qu vv = 0, for (t, x, v) ∈ [0, T ] × [0, 1] × R, u(0, x, v) = u 0 (x, v), for (x, v) ∈ [0, 1] × R, ∂ α t,x,v u(t, 1, v) -∂ α t,x,v u(t, 0, v) = 0, for (t, v) ∈ [0, T ] × R, there hold, under assumptions (A.50) and (A.51), sup 0≤t≤T u(t) -u nn (t) L 2 (Ω×[-V,V ]) → 0 as Loss (0;FP) T OT AL → 0, sup 0≤t≤T u(t) -u nn (t) H 1 (Ω;L 2 ([-V,V ])) → 0 as Loss (1;FP) T OT AL → 0. Remark 4.3. The theorems in this section imply that the proposed loss functions guarantee the convergence of neural networks in the corresponding Sobolev spaces, thereby coinciding with the main idea of Sobolev Training. Remark 4.4. The theorems in this section cannot be directly generalized to the high-dimensional cases because even the 2-dimensional case starts involving the convexity of the boundary. Though it has also been shown that the Fokker-Planck operator has strong hypoellipticity and the solutions to the boundary problems are smooth even in the higher dimensional case, the proof requires long rigorous mathematical analysis. For more information, see Hwang et al. (2018; 2019) . Remark 4.5. Because we cannot access the label (which corresponds to the analytic solution) on the interior grid, solving PDEs using a neural network is not a fully supervised problem. Interestingly, by incorporating derivative information in the loss function, the proposed approach enables Sobolev Training even if neither the labels nor the derivatives of the target function are provided.

5. EXPERIMENTAL RESULTS

In this section, we provide experimental results for toy examples that comprise several regression problems and various kinds of differential equations, including the heat equation, Burgers' equation, the kinetic Fokker-Planck equation, and high-dimensional Poisson's equation. We employ a fully connected neural network, which is a natural choice for function approximation. We use the hyperbolic tangent function as a nonlinear activation function. Although ReLU (x) = max(0, x) is a frequent choice in modern machine learning, it is not appropriate for solving PDEs because the second derivatives of the neural network vanish. In appreciation of Automatic Differentiation, we can easily compute derivatives of any order of a neural network with respect to input data despite the compositional structure; see Baydin et al. ( 2017) and references therein. We implemented our neural network using PyTorch, a widely used deep learning library (Paszke et al., 2019) . For the numerical experiments, we used a neural network with three hidden layers each of which had d-256-256-256-1 neurons, where d denotes the input dimension. We used the ADAM optimizer (Kingma & Ba, 2014), a popular gradient-based optimizer. To see whether our loss functions performed more efficiently than the traditional L 2 loss function introduced in Remark 3.1, we made everything maintain the same except the loss function. We compared the loss functions on the basis of L 2 (Ω) test error for the toy examples, absolute relative test error for the high-dimensional Poisson equation, and L ∞ (0, T ; L 2 (Ω)) test error for the other PDEs. For each loss function, we recorded the number of epochs required to meet a certain error threshold and the test error. Considering the randomness due to network initialization, we repeated the training a hundred times. Conversely, we initialized a hundred different neural networks with uniform initialization and trained them in the same manner. To compute the test error, we used analytic solutions for the Heat equation, Burgers' equation, and the high-dimensional Poisson equation, and a numerical solution from Wollman & Ozizmir (2008) for the kinetic Fokker-Planck equation.

5.1. TOY EXAMPLES

First, we consider two simple regression problems with target functions sin(x) and ReLU (x), respectively. For these toy examples, we define the loss functions as follows: L2 loss = u nn (x) -y(x) 2 2 , H1 loss = u nn (x) -y(x) 2 2 + u nn (x) -y (x) 2 2 , H2 loss = u nn (x) -y(x) 2 2 + u nn (x) -y (x) 2 2 + u nn (x) -y (x) 2 2 , where y(x) denotes either sin(x), or ReLU (x). We uniformly sampled a hundred grid points from [0, 2π] for training sin(x). Similarly, we uniformly sampled a hundred grid points from [-1, 1] for training ReLU (x). We expected the training to become fast using higher order derivatives as many as possible when training sin(x) and ReLU (x). Figure 1 confirms our assumption to be true. Interestingly, although ReLU (x) is not twice weakly differentiable at only one point x = 0, the H2 loss does not facilitate the training. In order to explore the nature of Sobolev Training, we design more complicated toy examples. Consider the target functions sin(kx), for k = 1, 2, ..., 5, and ReLU (kx) = max(0, kx), for k = 1, 2, 3..., 10. As k increases, the target functions and their derivatives contain drastic changes in their values, so it is difficult to learn those functions. We hypothesize that in Sobolev Training, the training becomes faster since we give explicit label for the derivatives and it becomes easier to capture the drastic changes in the derivatives. This is empirically shown to be true in Figure 2 . We train neural networks to approximate sin(kx), and ReLU (kx) for different k and record the number of training epochs to achieve certain error threshold which can be regarded as a difficulty of the problem. As one can see in Figure 2 , the difficulty changes little to no when we train with H1 and H2 losses while the difficulty increases with k when L2 loss is used. This implies that the difficulty of training barely changes in Sobolev Training even the target function has stiff changes. The same observations are made when solving PDEs. The improvement of our loss functions compare to L2 loss function are more dramatic for Burgers' equation (which has stiff solution (Raissi et al., 2019) ) than for the heat equation, with the initial condition of f 2 (which has a higher frequency) than with the initial condition of f 1 initial condition in the Fokker-Planck equation, and as k increases for the high-dimensional Poisson equation 7.

5.2. THE HEAT EQUATION & BURGERS' EQUATION

We now demonstrate the results of the Sobolev Training of the neural networks for solving PDEs. We begin with the 1-D heat equation, and Burgers' equation, which is the simplest PDE that combines both the nonlinear propagation effect and diffusive effect. Burgers' equation often appears as a The Heat equation Burgers' equation u t -u xx = 0 in (0, 10] × [0, π], u(0, x) = sin(x) on [0, π], u(t, x) = 0 on [0, 10] × {0, π}. u t + uu x -0.2u xx = 0 in (0, 0.01] × [0, 1], u(0, x) = -sin(πx) on [0, 1], u(t, x) = 0 on [0, 0.01] × {0, 1}. The heat equation attains a unique analytic solution u(t, x) = sin(x) exp(-t); an analytic solution of Burgers' equation is provided in Basdevant et al. (1986) . Although Sirignano & Spiliopoulos (2018) indicated that iterative random sampling reduces the computational cost, we fixed the grid points before training because we aimed to compare the efficiency of our loss function with that of the traditional one. For the heat equation and the Burgers' equation, we uniformly sampled the grid points {t i , x j } Nt,Nx i,j=1 from (0, T ] × Ω, where N t and N x denote the number of samples for interior t and x, respectively. For the initial and boundary conditions, we sampled the grid points from {t = 0, x j } Nx j=1 ∈ {0} × Ω and {t i , x j } Nt,N B i,j=1 ∈ [0, T ] × ∂Ω, respectively, where N B denotes the number of grid points in ∂Ω. Here, we set N t , N x , N B = 31. The testing data were also uniformly sampled from the domain of the PDEs. The L2, H1, and H2 losses are the Monte-Carlo approximations of (4.1), (4.2), and (4.3), respectively, for the heat equation and Burgers' equation. Working on achieving a smooth solution, we observed that the H2 loss performed the best, followed by the H1 loss and then the L2 loss in both accuracy, and computation time. We show the corresponding results in Figure 3 .

5.3. THE FOKKER-PLANCK EQUATION

The kinetic Fokker-Planck equation describes the dynamics of a particle whose behavior is similar to that of the Brownian particle. The Fokker-Planck operator has a strong regularizing effect not just in the velocity variable but also in the temporal and the spatial variables by the hypoellipticity. The Fokker-Planck equation has been considered in numerous physical circumstances including the Brownian motion described by the Uhlenbeck-Ornstein processes. We provide two simulation results for different initial conditions for the 1-D Fokker-Planck equation with the periodic boundary condition. For the Fokker-Planck equation, we adopted the idea of sampling from Hwang et al. (2020) . Because it is practically difficult to consider the entire space for the v ∈ R variable, we truncated the space for v as [-5, 5] . We then uniformly sampled the grid points {t i , x j , v k } Nt,Nx,Nv i,j,k=1 from (0, T ] × Ω × [-5, 5], where N v denotes the number of samples for v. The grid points for the initial and periodic boundary conditions were accordingly sampled. The truncated equation reads as follows: where u t + vu x -β(vu) v -qu vv = 0, for (t, x, v) ∈ (0, 3] × [0, 1] × [-5, 5], u(0, x, v) = f (x, v), for (x, v) ∈ [0, 1] × [-5, 5], ∂ α t,x,v u(t, 1, v) -∂ α t,x,v u(t, 0, v) = 0, for (t, v) ∈ [0, 3] × [-5, 5], f (x, v) is either f 1 (x, v) = exp(-v 2 ) 5 -5 exp(-v 2 )dv , or f 2 (x, v) = (1+cos(2πx)) exp(-v 2 ) 1 0 5 -5 (1+cos(2πx)) exp(-v 2 )dvdx , and β = 0.1, q = 0.1. A numerical solution on the test data was computed by a method shown by Wollman & Ozizmir (2008) and used for computing the test error. L2 loss and H1 loss denote the Monte-Carlo approximations of (4.4) and (4.5), respectively. The values of N t , N x , and N v were set to be 31, and the grid points were uniformly sampled. Expectedly, a solution of the Fokker-Planck equation could be estimated substantially faster using our loss function in both cases. We have provided the detailed results in Figure 4 . 2020). In this section, we provide empirical results to demonstrate that the proposed loss functions perform satisfactorily when equipped with iterative sampling for solving high-dimensional PDEs; see Sirignano & Spiliopoulos (2018) for more information. Convergence result similar to those of in section 4 for the Poisson equation is given in section A.4. We consider the following high-dimensional Poisson equation with the Dirichlet boundary condition:  -u = π 2 4 d i=1 sin( π 2 x i ), for x ∈ Ω = (0, 1) d , u = d i=1 sin( π 2 x i ),

A PROOFS FOR THE THEOREMS IN SECTION 4

We begin with the basic definitions of Sobolev spaces. We excerpt the definitions from Evans (2010). Definition A.1. Suppose u, v ∈ L 1 loc (U ) and α is a multiindex. We say that v is the α th -weak derivative of u, written D α u = v, provided U uD α φdx = (-1) |α| U vφdx, for all test functions φ ∈ C ∞ c (U ). Now we define the Sobolev space. Fix 1 ≤ p ≤ ∞ and let k be a nonnegative integer. Definition A.2. The Sobolev space W k,p (U ) consists of all locally integrable functions u : U → R such that for each multiindex α with |α| ≤ k, D α exists in the weak sense and belongs to L p (U ). Note that if p = 2, we usually write H k (U ) = W k,2 (U ) (k = 0, 1, 2, ...). The Sobolev norms are defined as follow: Definition A.3. If u ∈ W k,p U ), we define its norm to be u W k,p (U ) = ( |α|≤k U |D α u| p dx) 1/p (1 ≤ p ≤ ∞), |α|≤k ess sup U |D α u| (p = ∞). Finally, we define a notion of convergence in Sobolev spaces. Definition A.4. Let {u m } ∞ m=1 , u ∈ W k,p (U ). We say u m converges to u in W k,p (U ) provided lim m→∞ u m -u W k,p (U ) = 0. We will show that the following theorem holds. Theorem A.8. Let u and u nn be strong solutions of (A.5)-(A.6) and (A.7)-(A.8) respectively, on the time interval [0, T ]. For w := u nn -u, following statements are valid. (1) There exists a continuous function F 0 = F 0 w(0, •) 2 2 , T 0 f 2 2 dt, T 0 ∂ x u 2 2 dt such that sup 0≤t≤T w 2 2 + T 0 ∂ x w 2 2 dt ≤ F 0 → 0, as w(0, •) 2 2 , T 0 f 2 2 dt → 0. (2) There exists a continuous function F 1 = F 1 w(0, •) 2 H 1 , T 0 f 2 2 dt, T 0 ∂ x u 2 2 dt such that sup 0≤t≤T w 2 H 1 + T 0 ∂ x w 2 H 1 + ∂ t w 2 2 dt ≤ F 1 → 0, (A.9) as w(0, •) 2 H 1 , T 0 f 2 2 dt → 0. (3) There exists a continuous function F 2 = F 2 w(0, •) 2 H 2 , T 0 f 2 2 + ∂ t f 2 2 dt, sup 0≤t≤T ∂ x u 2 2 + T 0 ∂ t u 2 2 dt such that sup 0≤t≤T w 2 H 2 + w t 2 2 + T 0 ∂ x w 2 H 2 dt ≤ F 2 → 0, (A.10) as w(0, •) 2 H 2 , T 0 f 2 2 + ∂ t f 2 2 dt → 0. Remark A.9. By the Morrey's embedding theorem and the Poincare's inequality, for f ∈ H 1 0 (Ω), we have the following inequality, f 2 ∞ f 2 2 + f x 2 2 f x 2 2 , (A.11) Throughout the proof, we widely use (A.11). Proof. Subtracting (A.5) from (A.7), we get equations of w as follows. w t -w xx + ww x + wu x + uw x = f in Ω, (A.12) w = 0 on ∂Ω, (A.13) w(0, •) = g in Ω. (A.14) By multiplying w to (A.12) and integrating by parts in Ω, we have 1 2 d dt w 2 2 + w x 2 2 (A.15) for any small > 0. Applying estimates (A.25)-(A.29) to (A.24), we have the following inequality d dt w x 2 2 + w xx 2 2 ( w x 2 2 + u x 2 2 ) w x 2 2 + f 2 2 . (A.30) It follows from (A.23), (A.30), and the Grönwall inequality that sup 0≤t≤T w x 2 2 e T 0 wx 2 2 + ux 2 2 dt g x 2 2 + T 0 f 2 2 dt (A.31) e β e F0 g x 2 2 + T 0 f 2 2 dt . In a similar way to (A.23), there exists a function F1 = F 1 ( f L 2 (0,T ;L 2 ) , g H 1 , β) such that T 0 w xx 2 2 dt F 1 . (A.32) (2) of Theorem A.8 follows from (A.31), (A.32), and the fact that w t = f + w xx -ww x -wu x -uw x . (A.33) Finally, we differentiate (A.12) with respect to t, then we obtain w tt -w xxt + w t w x + ww xt + w t u x + wu xt + u t w x + uw xt = f t in Ω, (A.34) w t = 0 on ∂Ω, (A.35) w t (0) = f + g xx -gg x -gu 0x -u 0 g x in Ω. (A.36) By multiplying w t to (A.34) and integrating by parts in Ω, we have 1 2 d dt w t 2 2 + w xt 2 2 (A.37) = Ω f t w t - Ω w 2 t w x - Ω ww t w xt - Ω w 2 t u x - Ω ww t u xt - Ω u t w t w x - Ω uw t w xt + ∂Ω w t w xt , = 8 k=1 I 2 k . Terms on the right hand side of (A.37) are estimated by .44) for any small > 0. Applying estimates (A.38)-(A.44) to (A.37), we have the following inequality I 2 1 ≤ f t 2 w t 2 f t 2 2 + w xt 2 2 , (A.38) I 2 2 ≤ w t ∞ w t 2 w x 2 w x 2 2 w t 2 2 + w xt 2 2 , (A.39) I 2 3 ≤ w ∞ w t 2 w xt 2 w x 2 2 w t 2 2 + w xt 2 2 , (A.40) I 2 4 ≤ w t ∞ w t 2 u x 2 u x 2 2 w t 2 2 + w xt 2 2 , (A.41) I 2 5 + I 2 6 = Ω ww xt u t ≤ w ∞ u t 2 w xt 2 u t 2 2 w t 2 2 + w xt 2 2 , (A.42) I 2 7 ≤ u 2 w t 2 w xt 2 u x 2 2 w t 2 2 + w xt 2 2 , (A.43) I 2 8 = 0. (A d dt w t 2 2 + w xt 2 2 w x 2 2 + u x 2 2 + u t 2 2 w t 2 2 + f t 2 2 . (A.45) It follows from (A.36), (A.45) and the Grönwall inequality that sup 0≤t≤T w t 2 2 e T 0 wx 2 2 + ux 2 2 + ut 2 2 dt w t (0) 2 2 + T 0 f t 2 2 (A.46) e γ e F0 f 0 2 2 + g xx 2 2 + g x 4 2 + u 0x 2 2 g x 2 2 + T 0 f t 2 2 dt . where γ = T 0 u x 2 2 + u t 2 2 dt. In a similar way to the proof of (2) of Theorem A.8, (3) of Theorem A.8 follows from (A.33) and (A.46) . This completes the proof of the Theorem.

A.3 THE FOKKER-PLANCK EQUATION

A.3.1 BOUNDARY LOSS DESIGN Define the loss function for the periodic boundary condition as Loss BC = |α|=1 T 0 dt 5 -5 dv ∂ α t,x,v f nn (t, 1, v; m, w, b) -∂ α t,x,v f nn (t, 0, v; m, w, b) 2 ≈ 1 N i,k |α|=1,i,k ∂ α t,x,v f nn (t i , 1, v k ; m, w, b) -∂ α t,x,v f nn (t i , 0, v k ; m, w, b) 2 . (A.47) A.3.2 THE FOKKER-PLANCK EQUATION IN A PERIODIC INTERVAL In this section, we introduce an L 2 energy method for the Fokker-Planck equation and introduce a regularity inequality for the solutions to the equation. Throughout the section, we will abuse the notation and use both notations ∂ z u and u z for the same derivative of u with respect to z. We consider the Fokker-Planck equation in a periodic interval [0, 1]: (A.48) for any 3-dimensional multi-index α such that |α| ≤ 1 and a given initial distribution u 0 = u 0 (x, v). Now we consider the Fokker-Planck equation that the corresponding neural network solution u nn would satisfy: u t + vu x -β(vu) v -qu vv = 0, for (t, x, v) ∈ [0, T ] × [0, 1] × R, u(0, x, v) = u 0 (x, v), for (x, v) ∈ [0, 1] × R, and ∂ α t,x,v u(t, 1, v) -∂ α t,x,v u(t, 0, v) = 0, for (t, v) ∈ [0, T ] × R, (u nn ) t + v(u nn ) x -β(vu nn ) v -q(u nn ) vv = f for (t, x, v) ∈ [0, T ] × [0, 1] × [-5, 5], u nn (0, x, v) = g, for (x, v) ∈ [0, 1] × [-5, 5], |α|=1 T 0 dt 5 -5 dv (∂ α t,x,v u nn )(t, 1, v) -(∂ α t,x,v u nn )(t, 0, v) 2 ≤ L, (A.49) for any 3-dimensional multi-index α such that |α| ≤ 1 and given f = f (t, x, v), g = g(x, v), and a constant L > 0. Suppose that f , g and h are C 1 functions. Also, we suppose that the a priori solutions u and u nn are sufficiently smooth; indeed, we require them to be in C 1,1,2 t,x,v . For the a priori solution u and u nn to equation A.48 and equation A.49, assume that if |v| is sufficiently large, then we have that for some sufficiently small > 0, (A.50) for |α| ≤ 1 and α = (0, 0, 2). Also, suppose that sup t∈[0,T ] ∂ α t,x,v u(t, •, ±5) -∂ α t,x,v u nn (t, •, ±5) L 2 x ([0,1]) ≤ , |∂ α t,x,v u(t, x, ±5)|, |∂ α t,x,v u nn (t, x, ±5)| ≤ C, (A.51) for some C < ∞ for |α| ≤ 1 and α = (0, 0, 2). Now we introduce the following theorem on the energy estimates: Theorem A.10. Let u and u nn be the classical solutions to equation A.48 and equation A.49, respectively. Then we have sup 0≤t≤T u nn (t) -u(t) 2 2 + 2(q -ε) T 0 ∂ v u nn (s) -∂ v u(s) 2 2 ds ≤ g -u 0 2 2 + L 2 exp 1 + 25β 2 2ε T + T 0 f (s) 2 2 ds + 2q CT, for any ε ∈ (0, q), where L, u 0 , f, g, β, q, m, , and C are given in equation A.48-equation A.51. Proof. Define w def = u nn -u. Then by equation A.48 and equation A.49, w satisfies w t + vw x -βvw v -qw vv = f for (t, x, v) ∈ [0, T ] × [0, 1] × [-5, 5], w(0, x, v) = w 0 , for (x, v) ∈ [0, 1] × [-5, 5], (A.52) where w 0 def = g -u 0 . By multiplying w to equation A.52 and integrating with respect to dxdv, we have 1 2 d dt [0,1]×[-5,5] |w| 2 dxdv + [0,1]×[-5,5] vw x wdxdv - [0,1]×[-5,5] qw vv wdxdv = [0,1]×[-5,5] f wdxdv + [0,1]×[-5,5] βvw v wdxdv. Then we take the integration by parts and obtain that -5,5] f wdxdv 1 2 d dt [0,1]×[-5,5] |w| 2 dxdv + 1 2 5 -5 dv v(w(t, 1, v) 2 -w(t, 0, v) 2 ) + q [0,1]×[-5,5] |w v | 2 dxdv = [0,1]×[ + [0,1]×[-5,5] βvw v wdxdv + q [0,1] w v (t, x, 5)w(t, x, 5)dx -q [0,1] w v (t, x, -5)w(t, x, -5)dx def = I 1 + I 2 + I 3 + I 4 . We first define A(t) def = 1 2 5 -5 dv v(w(t, 1, v) 2 -w(t, 0, v) 2 ) . We now estimate I 1 -I 4 on the right-hand side. By the Hölder inequality and Young's inequality, we have |I 1 | ≤ f 2 w 2 ≤ 1 2 f 2 2 + 1 2 w 2 2 , where we denote h 2 def = [0,1]×[-5,5] |h| 2 dxdv. Similarly, we observe that |I 2 | ≤ 5β w v 2 w 2 ≤ ε w v 2 2 + 25β 2 4ε w 2 2 , for a sufficiently small ε > 0 as |v| ≤ 5. By equation A.50, we have |I 3 + I 4 | ≤ q w v (t, •, 5) L 2 x w(t, •, 5) -w(t, •, -5) L 2 x + q w v (t, •, 5) -w v (t, •, -5) L 2 x w(t, •, -5) L 2 x ≤ 2q C. Altogether, we have d dt w 2 2 + 2(q -ε) w v 2 2 ≤ f 2 + 1 + 25β 2 2ε w 2 2 + A(t) + 2q C. We integrate with respect to the temporal variable on [0, t] and obtain w(t) 2 2 + 2(q -ε) t 0 w v (s) 2 2 ds ≤ w(0) 2 2 + t 0 f (s) 2 + 1 + 25β 2 2ε w(s) 2 2 + A(s) + 2q C ds. By equation A.49 3 , we have t 0 A(s)ds ≤ L 2 . Thus, by the Grönwall inequality, we have w(t) 2 2 + 2(q -ε) t 0 w v (s) 2 2 ds ≤ w 0 2 2 + L 2 exp 1 + 25β 2 2ε t + t 0 f (s) 2 2 ds + 2q Ct, where w 0 (x, v) = g(x, v) -u 0 (x, v). This completes the proof for the theorem. Regarding the derivatives ∂ t w and ∂ x w we can obtain the similar estimates as follows. Corollary A.11. Let u and u nn be the classical solutions to equation A.48 and equation A.49, respectively. Assume that equation A.51 holds. Then for z = t or x we have sup 0≤t≤T ∂ z u nn (t) -∂ z u(t) 2 2 + 2(q -ε) T 0 ∂ v ∂ z u nn (s) -∂ v ∂ z u(s) 2 2 ds ≤ ∂ z g -∂ z u 0 2 2 + L 2 exp 1 + 25β 2 2ε T + T 0 ∂ z f (s) 2 2 ds + 2q CT, for any ε ∈ (0, q), where L, u 0 , f, g, β, q, m, , and C are given in equation A.48-equation A.51. Proof. For both ∂ z = ∂ t and ∂ x , we take ∂ z onto equation A.52 and obtain (∂ z w) t + v(∂ z w) x -βv(∂ z w) v -q(∂ z w) vv = ∂ z f for (t, x, v) ∈ [0, T ] × [0, 1] × [-5, 5], (∂ z w)(0, x, v) = (∂ z w) 0 , for (x, v) ∈ [0, 1] × [-5, 5], (A.53) where (∂ z w) 0 def = ∂ z g -∂ z u 0 . Then the proof is the same as the one for Theorem A.10 for ∂ z w replacing the role of w. This completes the proof. Finally, we can also obtain the regularity estimates for the derivative ∂ v w as follows: Theorem A.12. Let u and u nn be the classical solutions to equation A.48 and equation A.49, respectively. Assume that equation A.51 holds. Then we have sup 0≤t≤T ∂ v u nn (t) -∂ v u(t) 2 2 + 2(q -ε) T 0 ∂ vv u nn (s) -∂ vv u(s) 2 2 ds ≤ (L + ∂ x g -∂ x u 0 2 2 + ∂ v g -∂ v u 0 2 2 ) exp 2 + 25β 2 2ε T + T 0 ( ∂ x f (s) 2 2 + ∂ v f (s) 2 2 )ds + 4q CT, for any ε ∈ (0, q), where L, u 0 , f, g, β, q, m, , and C are given in equation A.48-equation A.51. Proof. we take ∂ v onto equation A.52 and obtain A.54) where (∂ v w) t + w x + v(∂ v w) x -βv(∂ v w) v -q(∂ v w) vv = ∂ v f, for (t, x, v) ∈ [0, T ] × [0, 1] × [-5, 5], (∂ v w) 0 def = ∂ v g -∂ v u 0 . By multiplying ∂ v w to equation A.54 and integrating with respect to dxdv, we have 1 2 d dt [0,1]×[-5,5] |∂ v w| 2 dxdv + [0,1]×[-5,5] v(∂ v w) x (∂ v w)dxdv - [0,1]×[-5,5] q(∂ v w) vv ∂ v wdxdv = [0,1]×[-5,5] (-w x + ∂ v f )∂ v wdxdv + [0,1]×[-5,5] βv(∂ v w) v ∂ v wdxdv. Then we take the integration by parts and obtain that 1 2 d dt [0,1]×[-5,5] |w v | 2 dxdv + 1 2 [-5,5] v(∂ v w) 2 (t, 1, v) -v(∂ v w) 2 (t, 0, v) dv + q [0,1]×[-5,5] |w vv | 2 dxdv = [0,1]×[-5,5] ∂ v f w v dxdv + [0,1]×[-5,5] βvw vv w v dxdv + q [0,1] w vv (t, x, 5)w v (t, x, 5)dx -q [0,1] w vv (t, x, -5)w v (t, x, -5)dx - [0,1]×[-5,5] w x w v dxdv def = I 1 + I 2 + I 3 + I 4 + I 5 . We first define B(t) def = 1 2 5 -5 dv v(∂ v w(t, 1, v) 2 -∂ v w(t, 0, v) 2 ) . We now estimate I 1 -I 4 on the right-hand side. We now estimate I 1 -I 5 on the right-hand side. By the Hölder inequality and Young's inequality, we have |I 1 | ≤ ∂ v f 2 w v 2 ≤ 1 2 ∂ v f 2 2 + 1 2 w v 2 2 , where we denote h 2 def = [0,1]×[-5,5] |h| 2 dxdv. Similarly, we observe that |I 2 | ≤ 5β w vv 2 w v 2 ≤ ε w vv 2 2 + 25β 2 4ε w v 2 2 , for a sufficiently small ε > 0 as |v| ≤ 5. -u = f in Ω, u = g on ∂Ω. Suppose there exists g ∈ H 2 ( Ω) s.t. g| ∂Ω = g (A.56) Then, the equation can be written by: In this subsection, we show several experiments that shows the proposed loss functions generally performs better in different learning rates. We first show the results for Burgers' equation. In Figure 5 , we show the test errors versus training epochs plot for different learning rates. We used 10 -3 , 10 -4 , 10 -5 as learning rates and we observe that H2 loss performs best followed by H1 and L2 loss functions.  -v = f in Ω, v =



Figure 1: First row: results for sin(x), Second row: results for ReLU (x). First column: Histograms generated from the repeated training of neural networks for training sin(x), and ReLU (x). Second column: Test L 2 errors. Third column: Average training time for each loss function to achieve certain error threshold. Error bars are for standard deviations. The thresholds for the error are set to 10 -4 .

Figure2: Average number of epochs to make error less than 10 -3 increases in L2 loss as k increases. However, when we use H1, and H2 losses, required number of epochs increases much more slowly or stays the same as k increases.

Figure 3: First row: results for the heat equation. Second row: results for Burgers' equation. First column: Histograms for the heat and Burgers' equation generated from a hundred neural networks for each loss function. Second column: Test L ∞ (0, T ; L 2 (Ω)) errors. Third column: Average training time for each loss function to achieve certain error threshold. Error bars are for standard deviations. The thresholds for the error are set to 10 -5 .

Figure 4: First row: results for f 1 initial condition. Second row: results for f 2 initial condition. First column: Histograms generated from a hundred neural networks for each loss function. Second column: Test L ∞ (0, T ; L 2 (Ω)) errors. Third column: Average training time for each loss function to achieve certain error threshold. Error bars are for standard deviations. The thresholds for the errors for the initial conditions f 1 (x, v), and f 2 (x, v) are set to 10 -4 , and 10 -3 , respectively.

∂ x f (s) 2 2 + ∂ v f (s) 2 2 )ds + 4q CT,where∂ v w 0 (x, v) = ∂ v g(x, v) -∂ v u 0 (x, v). This completes the proof for the theorem.A.4 THE POISSON EQUATIONWe consider the Poisson equation equation with Dirichlet boundary condition:

0 on ∂Ω, where v = u-g, f = f -g. Therefore, we assume the homogeneous Dirichlet boundary condition provided (A.56). Now, let u be a strong solution of -u = f in Ω, u = 0 on ∂Ω, (A.57) B ADDITIONAL EXPERIMENTAL RESULTS B.1 DEPENDENCY ON LEARNING RATES

Figure 5: Test errors as training goes for different learning rates.

Figure 6: Test errors as training goes for different learning rates.

for x ∈ ∂Ω, where x = (x 1 , x 2 , ..., x d ) ∈ Ω. One can readily prove that u(x) = Notably, the aforementioned loss functions have the variable x only. Table1presents the relative errors on a predefined test set for d = 10, 50, and 100. Evidently, in all cases, the proposed loss functions outperform the traditional L 2 loss function. More detailed experimental results are given in section B. Average of the relative errors of a hundred neural networks for the high-dimensional Poisson's equations. We uniformly sampled 500 data points from Ω for each epoch and trained the neural networks in 10000 epochs at a learning rate 10 -4 . Sobolev Training by slightly modifying the loss function, although the process of estimating neural network solutions of PDEs is not fully supervised.In addition to the toy examples, which showed the exceptional speed of Sobolev Training, we provided empirical evidences demonstrate that our loss functions expedited the training more than the traditional L 2 loss function. We believe that this can solve the problem associated with the high costs involved in estimating the neural network solutions of PDEs. Moreover, our experiments on high-dimensional problems showed that the proposed loss function performed better when equipped with iterative grid sampling. The histograms in Figure1-4 indicate that our loss function provided more stable training in that it reduced the variance in the distribution of the number of epochs (e.g., for the Burgers' equation, L2 loss: 3651±812, H1 loss: 995±71, and H2 loss: 331±15). Thus, the training, when governed by our loss function, became robust to the random initialization of the weights.Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024-8035, 2019. Jerome Darbon, and George Em Karniadakis. On the convergence and generalization of physics informed neural networks. arXiv preprint arXiv:2004.01806, 2020. Justin Sirignano and Konstantinos Spiliopoulos. Dgm: A deep learning algorithm for solving partial differential equations. Journal of Computational Physics, 375:1339-1364, 2018. Remco van der Meer, Cornelis Oosterlee, and Anastasia Borovykh. Optimally weighted loss functions for solving pdes with neural networks. arXiv preprint arXiv:2002.06269, 2020.

By equation A.50 and equation A.51, we have|I 3 + I 4 | ≤ q w vv (t, •, 5) L 2 x w v (t, •, 5) -w v (t, •, -5) L 2 x + q w vv (t, •, 5) -w vv (t, •, -5) L 2 x w v (t, •, -5) L 2 x ≤ 2q C. Finally, we have |I 5 | ≤ w x 2 w v 2 ≤Then we take the integration with respect to the temporal variable on [0, t] and obtain that ∂ v f (s) 22 )ds + 2q Ct, (A.55)where∂ v w 0 (x, v) = g(x, v) -u 0 (x, v).Then we use Corollary A.11 for an upper-bound of ∂ x w(s) 2 2 and obtain that

A.1 THE HEAT EQUATION

We denote the strong solution of the heat equation u t -u xx = 0 in (0, T ] × Ω, u(0, x) = u 0 (x) on Ω, u(t, x) = 0 on [0, T ] × ∂Ω, by u and the neural network solution by u nn . Then, v = u -u nn satisfies:on Ω, (A.1) v(t, x) = 0 on [0, T ] × ∂Ω, for some f, and g. Here, we can set the boundary to be zero by multiplying B(x), where B(x) is a smooth function satisfying. Then the following holds:By applying above theorem to (A.1), we get the results of Theorem 4.1.Remark A.6. The left hand sides in (A.2) -(A.4) are the errors of neural networks in corresponding norms, and the right hand sides are the losses (4.1) -(4.3) for the heat equation, respectively. This implies that the proposed loss functions are the upper bounds of the errors in the Sobolev spaces, and by minimizing them, we can expect the effect of Sobolev Training when solving PDEs with neural networks.In the rest of this section, we will show the similar results for Burgers' equation and the Fokker-Planck equation.

A.2 BURGERS' EQUATION

We consider the strong solution u of the following Burgers equation in a bounded interval Ω = [a, b],and the corresponding neural network solution u nn satisfyingwith the inital data u(0, •) and u nn (0, •), respectively.The following proposition ensures the existence of a strong solution to the initial boundary value problem (A.5)-(A.6) (see Benia & Sadallah (2016) ). Here, we multiply B(x) to u nn (t, x) in order to meet the boundary condition. We use the notation where the relation A B stands for A ≤ CB, where C denotes a generic constant. Proposition A.7. (Theorem 1.2 in Benia & Sadallah (2016) ) Let u 0 ∈ H 1 0 . Then there exists a time T * = T * (u 0 ) > 0 such that the problem (A.5)-(A.6) with initial data u 0 has a unique solution of u satisfyingNow we estimate the terms on the right hand side of (A.15). Applying the Young's inequality, the Hölder's inequality, the Sobolev inequality, and thd Poincare inequality, we havew 3 (a) = 0, (A.17)for any small > 0. Applying estimates (A. 16)-(A.20) to (A.15), we have the following inequality .21 ) and the Grönwall inequality imply thatLet us denoteNow we integrate (A.21) between 0 and T and drop the term w(T ) 2 2 on the left-hand side to obtainThis completes the proof of (1) of Theorem A.8.Next, by multiplying -w xx to (A.12) and integrating by parts in Ω, we obtainSimilarly to (A.16)-(A.20), we estimate the terms on the right hand side of (A.24). . By subtracting (A.58) from (A.57), we get where the right hand side corresponds to Loss GE (u nn ; m, 2). 

