MITIGATING PROPAGATION FAILURES IN PINNS USING EVOLUTIONARY SAMPLING

Abstract

Despite the success of physics-informed neural networks (PINNs) in approximating partial differential equations (PDEs), it is known that PINNs can sometimes fail to converge to the correct solution in problems involving complicated PDEs. This is reflected in several recent studies on characterizing and mitigating the "failure modes" of PINNs. While most of these studies have focused on balancing loss functions or adaptively tuning PDE coefficients, what is missing is a thorough understanding of the connection between failure modes of PINNs and sampling strategies used for training PINNs. In this paper, we provide a novel perspective of failure modes of PINNs by hypothesizing that the training of PINNs rely on successful "propagation" of solution from initial and/or boundary condition points to interior points. We show that PINNs with poor sampling strategies can get stuck at trivial solutions if there are propagation failures. We additionally demonstrate that propagation failures are characterized by highly imbalanced PDE residual fields where very high residuals are observed over very narrow regions. To mitigate propagation failures, we propose a novel evolutionary sampling (Evo) method that can incrementally accumulate collocation points in regions of high PDE residuals with little to no computational overhead. We provide an extension of Evo to respect the principle of causality while solving time-dependent PDEs. We theoretically analyze the behavior of Evo and empirically demonstrate its efficacy and efficiency in comparison with baselines on a variety of PDE problems. Under review as a conference paper at ICLR 2023 numerical methods where the solution of the PDE at initial/boundary points are iteratively propagated to interior points using finite differencing schemes (LeVeque, 2007) . We show that propagation failures in PINNs are characterized by highly imbalanced PDE residual fields, where very high residuals are observed in narrow regions of the domain. Such high residual regions are not adequately sampled in the set of collocation points (which generally is kept fixed across all training iterations), making it difficult to overcome the propagation failure mode. This motivates us to develop sampling strategies that focus on selecting more collocation points from high residual regions. This is related to the idea of local-adaptive mesh refinement used in FEM (Zienkiewicz et al., 2005) to selectively refine the computational mesh in regions with higher errors. We propose a novel evolutionary sampling (Evo) strategy that can accumulate collocation points in high PDE residual regions, thereby dynamically emphasizing on these skewed regions as we progress in training iterations. We also provide a causal extension of our proposed Evo algorithm (Causal Evo) that can explicitly encode the strong inductive bias of causality in propagating the solution from initial points to interior points over training iterations, when solving time-dependent PDEs. We theoretically prove the adaptive quality of Evo to accumulate points from high residual regions that persist over iterations. We empirically demonstrate the efficacy and efficiency of our proposed sampling methods in a variety of benchmark PDE problems previously studied in the PINN literature. We show the Evo and Causal Evo are able to mitigate propagation failure modes and converge to the correct solution with significantly smaller sample sizes as compared to baseline methods, while incurring negligible computational overhead. We also demonstrate the ability of Evo to solve a particularly hard PDE problem-solving 2D Eikonal equations for complex arbitrary surface geometries. The novel contributions of our work are as follows: (1) We provide a novel perspective for characterizing failure modes in PINNs by postulating the "Propagation Hypothesis." (2) We propose a novel evolutionary algorithm Evo to adaptively sample collocation points in PINNs that shows superior prediction performance empirically with little to no computational overhead compared to existing methods for adaptive sampling. (3) We theoretically show that Evo can accumulate points from high residual regions if they persist over iterations and release points if they have been resolved by PINN training, while maintaining non-zero representation of points from other regions.

1. INTRODUCTION

Physics-informed neural networks (PINNs) (Raissi et al., 2019) represent a seminal line of work in deep learning for solving partial differential equations (PDEs), which appear naturally in a number of domains. The basic idea of PINNs for solving a PDE is to train a neural network to minimize errors w.r.t. the solution provided at initial/boundary points of a spatio-temporal domain, as well as the PDE residuals observed over a sample of interior points, referred to as collocation points. Despite the success of PINNs, it is known that PINNs can sometimes fail to converge to the correct solution in problems involving complicated PDEs, as reflected in several recent studies on characterizing the "failure modes" of PINNs (Wang et al., 2020; 2022c; Krishnapriyan et al., 2021) . Many of these failure modes are related to the susceptibility of PINNs in getting stuck at trivial solutions acting as poor local minima, due to the unique optimization challenges of PINNs. In particular, note that training PINNs is different from conventional deep learning problems as we only have access to the correct solution on the initial and/or boundary points, while for all interior points in the domain, we can only compute PDE residuals. Also note that minimizing PDE residuals does not guarantee convergence to a correct solution since there are many trivial solutions of commonly observed PDEs that show 0 residuals. While previous studies on understanding and preventing failure modes of PINNs have mainly focused on modifying network architectures or balancing loss functions during PINN training, the effect of sampling collocation points on avoiding failure modes of PINNs has been largely overlooked. Although some previous approaches have explored the effect of sampling strategies on PINN training (Wang et al., 2022a; Lu et al., 2021) , they either suffer from large computation costs or fail to converge to correct solutions, empirically demonstrated in our results. In this work, we present a novel perspective of failure modes of PINNs by postulating the propagation hypothesis: "in order for PINNs to avoid converging to trivial solutions at interior points, the correct solution must be propagated from the initial/boundary points to the interior points." When this propagation is hindered, PINNs can get stuck at trivial solutions that are difficult to escape, referred to as the propagation failure mode. This hypothesis is motivated from a similar behavior observed in 2 BACKGROUND AND RELATED WORK Physics-Informed Neural Networks (PINNs) . The basic formulation of PINN (Raissi et al., 2017) is to use a neural network f θ (x, t) to infer the forward solution u of a non-linear PDE: u t + N x [u] = 0, x ∈ X , t ∈ [0, T ]; u(x, 0) = h(x), x ∈ X ; u(x, t) = g(x, t), t ∈ [0, T ], x ∈ ∂X where x and t are the space and time coordinates, respectively, X is the spatial domain, ∂X is the boundary of spatial domain, and T is the time horizon. The PDE is enforced on the entire spatiotemporal domain (Ω = X × [0, T ]) on a set of collocation points {x r i = (x i r , t i r )} Nr i=1 by computing the PDE residual (R(x, t)) and the corresponding PDE Loss (L r ) as follows: R θ (x, t) = ∂ ∂t f θ (x, t) -N x [f θ (x, t)] (1) L r (θ) = E xr∼U (Ω) [R θ (x r ) 2 ] ≈ 1 N r Nr i=1 [R θ (x i r , t i r )] 2 (2) where L r is the expectation of the squared PDE Residuals over collocation points sampled from a uniform distribution U. PINNs approximate the solution of the PDE by optimizing the following overall loss function L = λ r L r (θ) + λ bc L bc (θ) + λ ic L ic (θ), where L ic and L bc are the mean squared loss on the initial and boundary data respectively, and λ r , λ ic , λ bc are hyperparameters that control the interplay between the different loss terms. Although PINNs can be applied to inverse problems, i.e., to estimate PDE parameters from observations, we only focus on forward problems in this paper. Prior Work on Characterizing Failure Modes of PINNs. Despite the popularity of PINNs in approximating PDEs, several works have emphasized the presence of failure modes while training PINNs. One early work (Wang et al., 2020) demonstrated that imbalance in the gradients of multiple loss terms could lead to poor convergence of PINNs, motivating the development of Adaptive PINNs. Another recent development (Wang et al., 2022c ) made use of the Neural Tangent Kernel (NTK) theory to indicate that the different convergence rates of the loss terms can lead to training instabilities. Large values of PDE coefficients have also been connected to possible failure modes in PINNs (Krishnapriyan et al., 2021) . In another line of work, the tendency of PINNs to get stuck at trivial solutions due to poor initializations has been demonstrated theoretically in Wong et al. (2022) and empirically in Rohrhofer et al. (2022) . In all these works, the effect of sampling collocation points on PINN failure modes has largely been overlooked. Although some recent works have explored strategies to grow the representation of collocation points with high residuals, either by modifying the sampling procedure Wu et al. (2022) ; Lu et al. (2021) ; Nabian et al. (2021) or choosing higherorder L p norms of PDE loss Wang et al. (2022a) , they either require a prohibitively dense set of collocation points and hence are computationally expensive, or suffer from poor convergence to the correct solution as empirically demonstrated in Section 5. In another recent line of work on Causal PINNs (Wang et al., 2022b) , it was shown that traditional approaches for training PINNs can violate the principle of causality for time-dependent PDEs. Hence, they proposed an explicit way of incorporating the causal structure in the training procedure. However, this solution can only be applied to time-dependent PDEs, and as we demonstrate empirically in Section 5, also requires large sample sizes.

3. PROPAGATION HYPOTHESIS: A NEW PERSPECTIVE OF FAILURE MODES IN PINNS

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 What is Unique About Training PINNs? Training PINNs presents fundamentally different optimization challenges than those encountered in conventional deep learning problems. In a conventional supervised learning problem, the correct solution for every training sample is known and the training samples are considered representative of the test samples such that the trained model can easily be extrapolated on closely situated test samples. However, in the context of PINNs, we only have access to the "correct" solution of the PDE on the initial and/or boundary points, while not having any labels for the interior points in the spatio-temporal domain Ω. Note that the interior points in Ω can be quite far away from the initial/boundary points, making extrapolation difficult. Further, training PINNs involves minimizing the PDE residuals over a set of collocation points sampled from Ω. However, minimizing PDE residuals alone is not sufficient to ensure convergence to the correct solution, since there may exist many trivial solutions of a PDE showing very small residuals. For example, u(x, t) = 0 is a trivial solution for any homogeneous PDE, which a neural network is likely to get stuck at in the absence of correct solution at initial/boundary points. Another unique aspect of training PINNs is that minimizing PDE residuals requires computing the gradients of the output w.r.t. (x, t) (e.g., u x and u t ). Hence, the solution at a collocation point is affected by the solutions at nearby points leading to local propagation of solutions. Propagation Hypothesis. In light of the unique properties of PINNs, we postulate that in order for PINNs to converge to the "correct" solution, the correct solution must propagate from the initial and/or boundary points to the interior points as we progress in training iterations. We draw inspiration for this hypothesis from a similar behavior observed in numerical methods for solving PDEs, where the solution of the PDE at initial/boundary points are iteratively propagated to interior points using finite differencing schemes (LeVeque, 2007) . Figure 1 demonstrates the propagation hypothesis of PINNs for a simple ordinary differential equation (ODE). Propagation Failure: Why It Happens and How to Diagnose? As a corollary of the propagation hypothesis, PINNs can fail to converge to the correct solution if the solution at initial/boundary points is unable to propagate to interior points during the training process. We call this phenomenon the "propagation failure" mode of PINNs. This is likely to happen if some collocation points start converging to trivial solutions before the correct solution from initial/boundary points is able to reach them. Such collocation points would also propagate their trivial solutions to nearby interior points, leading to a cascading effect in the learning of trivial solutions over large regions of Ω and further hindering the propagation of the correct solution from initial/boundary points. To diagnose propagation failures, note that the PDE residuals are expected to be low over both types of regions: regions that have converged to the correct solution and regions that have converged to trivial solutions. However, the boundaries of these two types of regions would show a sharp discontinuity in solutions, leading to very high PDE residuals in very narrow regions. A similar phenomenon is observed in numerical methods where sharp high-error regions disrupt the evolution of the PDE solution at surrounding regions, leading to cascading of errors. We use the imbalance of high PDE residual regions as a diagnosis tool for characterizing propagation failure modes in PINNs. To demonstrate propagation failure, let us consider an example PDE for the convection equation: ∂u ∂t + β ∂u ∂x = 0, u(x, 0) = h(x) , where β is the convection coefficient and h(x) is the initial condition (see Appendix G for details about this PDE). In a previous work (Krishnapriyan et al., 2021) , it has been shown that PINNs fail to converge for this PDE for β > 10. We experiment with two cases, β = 10 and β = 50, in Figure 2 . We can see that the PDE loss steadily decreases with training iterations for both these cases, but the relative error w.r.t. the ground-truth solution only decreases for β = 10, while for β = 50, it remains flat. This suggests that for β = 50, PINN is likely getting stuck at a trivial solution that shows low PDE residuals but high errors. To diagnose this failure mode, we plot two additional metrics in Figure 2 to measure the imbalance in high PDE residual regions: Fisher-Pearson's coefficient of Skewness (Kokoska & Zwillinger, 2000) and Fisher's Kurtosis (Kokoska & Zwillinger, 2000) (see Appendix H for computation details). High Skewness indicates lack of symmetry in the distribution of PDE residuals while high Kurtosis indicates the presence of a heavy-tail. For β = 10, we can see that both Skewness and Kurtosis are relatively small across all iterations, indicating absence of imbalance in the residual field. However, for β = 50, both these metrics shoot up significantly as the training progresses, which indicates the formation of very high residuals in very narrow regions-a characteristic feature of the propagation failure mode. Figure 3 confirms that this indeed is the case by visualizing the PINN solution and PDE residual maps. We see similar trends of propagation failure for other values of β > 10 (see Appendix J.1). (2021) , where a dense set of collocation points P dense is maintained to approximate the continuous residual field R(x), and points with high residuals are regularly added from P dense to the set of collocation points every K iterations according to a sampling function. A second line of work was recently proposed in Wang et al. (2022a) , where higher-order L p norms of the PDE loss (e.g., L ∞ ) were advocated to be used in the training of PINNs in contrast to the standard practice of using L 2 norms, to ensure stability of learned solutions in control problems involving high-dimensional PDEs. Note that by using higher-order L p norms, we are effectively increasing the importance of collocation points from high residual regions in the PDE loss, thereby having a similar effect as increasing their representation in the sample of collocation points (see Theorem A.1 for more details). Challenges with Potential Remedies. There are two main challenges faced by the potential remedies described above that limit their effectiveness in mitigating propagation failures in PINNs. (1) High computational complexity: Sampling methods such as RAR and its variants require using a dense set of collocation points P dense (typically with 100k ∼ 1M points spread uniformly across the entire domain) to locate high residual regions, such that points from high residual regions can added to the training set every K iterations. This increases the computational cost in two ways. First, computing the PDE residuals on the entire dense set is very expensive. Second, the size of the training set keeps growing every K iterations, further increasing the training costs at later iterations. See Appendix F for a detailed analysis of the computational complexity of RAR based methods (2) Poor Prediction Performance: As we empirically demonstrate in Section 5, both sampling-based methods such as RAR and its variants as well as L ∞ PDE loss-based methods suffer from poor performance in converging to the correct solution for complex PDE problems. This can be attributed to several reasons. First, while increasing the value of k in sampling-based methods and p in L p norm-based methods influences a greater skew in the sample set towards points with higher residuals, the optimal values of k or p are generally unknown for an arbitrary PDE. Second, choosing L ∞ loss (or equivalently, only sampling collocation points with highest residuals) may not ideal for the training dynamics of PINNs, as it can lead to oscillatory behavior between different peaks of the PDE residual landscape, while forgetting to retain the solution over other regions. Instead, we need a way to gradually accumulate points from high residual regions in the collocation set while having non-zero representation of points from other regions.

4. PROPOSED APPROACH: EVOLUTIONARY SAMPLING (EVO)

Before we present our proposed sampling approach, in the following we first describe the four motivating properties that we focus on in this work to mitigate propagation failures: (1) Accumulation Property: To break propagation barriers, the set of collocation points should evolve over every iteration such that they start from a uniform distribution and accumulate in regions with high residuals until the PINN training process eventually resolves them in later iterations. This is similar to starting with an L 2 loss and increasing the order of L p norm if high residual regions persist over iterations. (2) Sample Release Property: Upon sufficient minimization of a high residual region through PINN training, collocation points that were once accumulated from the region needs to be released, such that we can focus on minimizing other high residual regions in later iterations. (3) Uniform Background Property: At every iteration, the set of collocation points should contain non-zero support of points sampled from a uniform distribution over the entire domain Ω, such that the collocation points do not collapse solely to high residual regions. (4) Computational Efficiency: While satisfying the above properties, we should incur little to no computational overhead in sampling collocation points from high residual regions. Specifically, we should be able to add points from high residual regions without maintaining a dense set of collocation points, P dense , and by only observing the residuals over a small set of N r points at every iteration. To satisfy all of the above properties, we present a novel sampling strategy termed Evolutionary Sampling (Evo) that is inspired by algorithms used for modeling biological evolution (Eiben et al., 2003) . The key idea behind Evo is to gradually evolve the population of collocation points at every iteration by retaining points with high residuals from the current population and re-sampling new points from a uniform distribution. Algorithm 1 shows the pseudo-code of our proposed evolutionary sampling strategy. Analogous to evolutionary algorithms developed in optimization literature (Simon, 2013) , we introduce a notion of "fitness" F(x r ) for every collocation point x r such that points with higher fitness are allowed to survive in the next iteration. Specifically, we define F(x r ) as the absolute value of PDE residual of Algorithm 1 Proposed Evolutionary Sampling Algorithm For PINN 1: Sample the initial population P0 of Nr collocations point P0 ← {xr} Nr i=1 from a uniform distribution xr i ∼ U (Ω), where Ω is the input domain (Ω = [0, T ] × X ). 2: for i = 0 to max_iterations -1 do 3: Compute the fitness of collocation points xr ∈ Pi as F(xr) = |R(xr)|.

4:

Compute the threshold τi = 1 Nr Nr j=1 F(xr j ) 5: Select the retained population P r i such that P r i ← {xr j : F(xr j ) > τit} 6: Generate the re-sampled population P s i ← {xr j : xr j ∼ U (Ω)}, s.t. |P s i | + |P r i | = Nr 7: Merge the two populations Pi+1 ← P r i ∪ P s i 8: end for x r , i.e., F(x r ) = |R(x r )|. At iteration 0, we start with an initial population P 0 of N r points sampled from a uniform distribution. At iteration i, in order to evolve the population to the next iteration, we first construct the "retained population" P r i comprising of points from P i with fitness values greater than τ i , i.e., P r i ← {x j r : F(x j r ) > τ i }, where τ i is equal to the expectation of fitness values over all points in P i . The remainder of collocation points in P i are re-sampled from a uniform distribution, thus constructing the "re-sampled population" P s i ← {x r j : x r j ∼ U(Ω)}. The retained population and the re-sampled population are then merged to generate the population for the next iteration, P i+1 . Appendix Figure 8 schematically shows the accumulation of collocation points from high residual regions in Evo over training iterations. Analysis of Evo: Note that at every iteration of PINN training, Evo attempts to retain the set of collocation points in P r with the highest fitness (corresponding to high residual regions). At the same time, the PINN optimizer is attempting to minimize the residuals by updating θ and thus in turn affecting the fitness function. We first show that when F(x) is fixed (e.g., when θ is kept constant), Evo maximizes F(x) in the retained population and thus accumulates points from high residual regions. In particular, Theorem 4.1 shows that for a fixed F(x), the expectation of the retained population in Evo becomes maximum (equal to L ∞ ) when the number of iterations approaches ∞. Theorem 4.1 (Accumulation Dynamics Theorem). Let F θ (x) : R n → R + be a fixed real-valued k-Lipschitz continuous objective function optimized using the Evolutionary Sampling algorithm. Then, the expectation of the retained population E x∈P r [F(x)] ≥ max x F(x)kϵ as iteration i → ∞, for any arbitrarily small ϵ > 0. The proof of Theorem 4.1 can be found in Appendix C.1. This demonstrates the accumulation property of Evo as points from high residual regions would keep accumulating in the retained population and make its expectation maximal if the fitness function is kept fixed. However, since the PINN optimizer is also minimizing the residuals at every iteration, we would not expect the fitness function to be fixed unless a high residual region persists over a long number of iterations. In fact, points from a high residual region would keep on accumulating until they are resolved by the PINN optimizer and thus eventually released from P r . Also note that Evo always maintains some collocation points from a uniform distribution, i.e., the re-sampled population P s is always non-empty. Theoretical proofs of the sample release property and the uniform background property of Evo are provided in Appendix C.2 and Lemma C.1.1, respectively. We also provide details of the computational complexity of Evo in comparison with baseline methods in Appendix F, showing that Evo is computationally efficient. Table 2 in the Appendix summarizes our ability to satisfy the motivating properties of our work in comparison with baselines. Note that Evo shares a similar motivation as local-adaptive mesh refinement methods developed for Finite Element Methods (FEM) (Zienkiewicz et al., 2005) , where the goal is to preferentially refine the computational mesh used in numerical methods based on localization of the errors. It is also related to the idea of boosting in ensemble learning where training samples with larger errors are assigned higher weights of being picked in the next epoch, to increasingly focus on high error regions (Schapire, 2003) .

4.1. CAUSAL EXTENSION OF EVOLUTIONARY SAMPLING (CAUSAL EVO)

In problems with time-dependent PDEs, a strong prior dictating the propagation of solution is the principle of causality, where the solution of the PDE needs to be well-approximated at time t before moving to time t + ∆t. To incorporate this prior guidance, we present a Causal Extension of Evo (Causal Evo) that includes two modifications: (1) we develop a causal formulation of the PDE loss L r that pays attention to the temporal evolution of PDE solutions over iterations, and (2) we develop a causally biased sampling scheme that respects the causal structure while sampling collocation points. We describe both these modifications in the following.  (θ) = 1 Nr Nr i=1 [R(x i r , t i r )] 2 * g(t i r ). We initially start with a small value of the shift parameter (γ = -0.5), which essentially only reveals a very small portion of the time domain, and then gradually increase γ during training to reveal more portions of the time domain. For γ ≥ 1.5, the entire time domain is revealed. Causally Biased Sampling. We bias the sampling strategy in Evo such that it not only favors the selection of collocation points from high residual regions but also accounts for the causal gate values at every iteration. In particular, we modify the fitness function as F(x r ) = |R(x r )| * g(t r ). A schematic illustration of causally biased sampling is provided in Appendix D. How to Update γ? Ideally, at some iteration i, we would like increase γ i at the next iteration only if the PDE residuals at iteration i are low. Otherwise, γ i should remain in its place until the PDE residuals under the current gate are minimized. To achieve this behaviour, we propose the following update scheme for γ: γ i+1 = γ i + η g e -ϵL g r (θ) , where η g is the learning rate and ϵ denotes tolerance that controls how low the PDE loss needs to be before the gate shifts to the right. Since the update in γ is inversely proportional to the causally-weighted PDE loss L g r , the gate will shift slowly if the PDE residuals are large. Also note that increasing γ also increases the value of g(t) for all collocation points, thus increasing the causally-weighted PDE loss and slowing down gate movement. Upon convergence, γ attains a large value such that the entire time domain is revealed. 1.51 ± 0.26% 0.78 ± 0.18% 6.03 ± 6.99% 1.98 ± 0.72% 0.83 ± 0.15% Causal Evo. (ours) 2.12 ± 0.67% 0.75 ± 0.12% 5.99 ± 5.25% 2.28 ± 0.76% 0.71 ± 0.007%

5. RESULTS

Experiment Setup. We perform experiments over three benchmark PDEs that have been used in existing literature to study failure modes of PINNs. In particular, we consider two time-dependent PDEs: convection equation (with β = 30 and β = 50) and Allen Cahn equation, and one timeindependent PDE: the Eikonal equation for solving signed distance fields for varying input geometries. We consider the following baseline methods: (1) PINN-fixed (conventional PINN using a fixed set of uniformly sampled collocation points), (2) PINN-dynamic (a simple baseline where collocation points are dynamically sampled from a uniform distribution every iteration, see Appendix B for more details), (3) Curr. Reg. (curriculum regularization method proposed in (Krishnapriyan et al., 2021) ), (4) cPINN-fixed (causal PINNs proposed in (Wang et al., 2022b) with fixed sampling), (5) cPINN-dynamic (cPINN with dynamic sampling), (6) RAR-G (Residual-based Adaptive Refinement strategy proposed in (Lu et al., 2021) ), ( 7) RAD (Residual-based Adaptive Distribution originally proposed in (Nabian et al., 2021) and later generalized in Wu et al. (2022) ), (8) RAR-D (RAR with a sampling Distribution proposed in Wu et al. (2022) ), (9) L ∞ (Sampling top N r collocation points at every iteration from a dense set P dense to approximate L ∞ norm). For every benchmark PDE, we use the same neural network architecture and hyper-parameter settings across all baselines and our proposed methods, wherever possible. Details about the PDEs, experiment setups, and hyper-parameter settings are provided in Appendix I. All code and datasets used in this paper are available atfoot_0 Comparing Prediction Performance. Table 1 shows the relative L 2 errors (over 5 random seeds) of PDE solutions obtained by comparative methods w.r.t. ground-truth solutions for different time-dependent PDEs when N r is set to 1K. We particularly chose a small value of N r to study the effect of small sample size on PINN performance (note that the original formulations of baseline methods used very high N r ). We can see that while PINNfixed fails to converge for convection (β = 30) and Allen Cahn equations (admitting very high errors), PINN-dynamic shows significantly lower errors. However, for complex PDEs such as convection (β = 50), PINN-dynamic is still not able to converge to low errors. We also see that cPINN-fixed shows high errors across all PDEs when N r = 1000. This is likely because the small size of collocation samples are insufficient for cPINNs to converge to the correct solution. As we show later, cPINN indeed is able to converge to the correct solution when N r is large. Performing dynamic sampling with cPINN shows some reduction in errors, but it is still not sufficient for convection (β = 50) case. All other baseline methods including Curr Reg, RAR-based methods, and L ∞ fail to converge on most PDEs and show worse performance than even the simple baseline of PINN-dynamic. On the other hand, our proposed approaches (Evo and Causal Evo) consistently show the lowest errors across all PDEs. Figure 4 shows that Evo and PINN-dynamic are indeed able to mitigate propagation failures for convection (β = 50), by maintaining low values of Skewness, Kurtosis, and Max PDE residuals across all iterations, in contrast to PINN-fixed. Additional visualizations of the evolution of samples in Evo, Caual Evo, and RAR-based methods across iterations are provided in Appendix J. Sensitivity of RAR-based methods to hyper-parameters is provided in Appendix J.3. L 2 r (θ) Retained P r Resampled P s L 2 Norm L 4 Norm L 6 Norm L ∞ Norm

Adaptive Nature of Evo PDE loss:

To demonstrate the ability of Evo to accumulate high residual points in the retained population P r (or equivalently, focus on higher-order L p norms of PDE loss), we consider optimizing a fixed objective function: the Auckley function in Figure 5 (see Appendix K.1 for details of this function). We can see that at iteration 1, the expected loss over P r is equal to the L 2 norm of PDE loss over the entire domains. As training progresses, the expected loss over P r quickly reaches higher-order L p norms, and approaches L ∞ at very large iterations. This confirms the gradual accumulation property of Evo as theoretically stated in Theorem 4.1. Addition visualizations of the dynamics of Evo for a number of test optimization functions are provided in Appendix K. Sampling Efficiency: Figure 6 shows the scalability of Evo to smaller sample sizes of collocation points, N r . Though all the baselines demonstrate similar performances when N r is large (> 10K), only Evo and Causal Evo manage to maintain low errors even for very small values of N r = 100, showing two orders of magnitude improvement in sampling efficiency. Note that the sample size N r is directly related to the compute and memory requirements of training PINNs. We also show that Evo and Causal Evo show faster convergence speed than baseline methods for both convection and Allen Cahn equations (see Appendix J.2 for details). Additional results on three cases of Kuramoto-Shivashinsky (KS) Equations including chaotic behavior are provided in Appendix J.7. Solving Eikonal Equations. Given the equation of a surface geometry in a 2D-space, u(x s , y s ) = 0, the Eikonal equation is a time-independent PDE used to solve for the signed distance field (SDF), u(x, y), which has negative values inside the surface and positive values outside the surface. See Appendix G for details of the Eikonal equation. The primary difficulty in solving Eikonal equation comes from determining the sign of the field (interior or exterior) in regions with rich details. We compare the performance of different baseline methods with respect to the ground-truth (GT) solution obtained from numerical methods for three complex surface geometries in Figure 7 . We also plot the reconstructed geometry of the predicted solutions to demonstrate the real-world application of solving this PDE, e.g., in downstream real-time graphics rendering. The quality of reconstructed geometries are quantitatively evaluated using the mean Intersection-Over-Union (mIOU) metric. The results show that PINN-fixed shows poor performance across all three geometries, while PINN-dynamic is able to capture most of the outline of the solutions with a few details missing for difficult geometries like "sailboat" and "gear". On the other hand, Evo is able to capture even the fine details of the SDF for all three complex geometries and thus show better reconstruction quality. We can see that mIOU of Evo is significantly higher than baselines for "sailboat" and "gear". See Appendix Section J.8 for more discussion and visualizations.

6. CONCLUSIONS AND FUTURE WORK DIRECTIONS

We present a novel perspective for identifying failure modes in PINNs named "propagation failures." and develop a novel evolutionary sampling algorithm to mitigate propagation failures. From our experiments, we demonstrate better performance on a variety of benchmark PDEs. Future work can focus on theoretically understanding the interplay between minimizing PDE loss and sampling from high residual regions on PINN performance. Other directions of future work can include exploring more sophisticated evolutionary algorithms involving mutation and crossover techniques. A CONNECTIONS BETWEEN L p NORM AND SAMPLING In this section, we provide connections between adaptively sampling collocation points from a distribution q(x r ) ∝ |R θ (x r )| k Wu et al. ( 2022) and using L p norm of the PDE loss Wang et al. (2022a) . Theorem A.1. For p ≥ 2, let L p r (U) denote the expected L p PDE Loss computed on collocation points sampled from a uniform distribution, U(Ω). Similarly, for k ≥ 0, let Lfoot_1 r (Q k ) denote the expected L 2 PDE Loss computed on collocation points sampled from an alternate distribution Q k (Ω) : x r ∼ q(x r ), where q(x r ) ∝ |R θ (x r )| k . Then, L 2 r (Q k ) = 1 Z 1/2 V -1/2 L k+2 r (U) (k+2)/2 , where Z is a normalization constant, defined as Z = xr∈Ω |R θ (x r )| k dx r and V is the volume of the domain Ω. Proof. The expectation of L p PDE Loss for collocation points sampled from a uniform distribution U(Ω) : x r ∼ p(x r ) can be defined as follows 2 : L p r (U) = E xr∼U (Ω) |R θ (x r )| p 1/p = p(x r )|R θ (x r )| p dx r 1/p (3) Note that for a uniform distribution, p(x r ) = 1 V , where V is the volume of the domain Ω, i.e., V = n i=1 supp(x i )inf(x i ) with supp(.) and inf(.) being the supremum and infimum operators, and x i is the i-th dimension of x r (e.g., the space dimension x or the time dimension t). Now, let us consider the case where we are interested in sampling from an alternate distribution Q k (Ω) : x r ∼ q(x r ), where q(x r ) ∝ |R θ (x r )| k while using the L 2 PDE Loss (the most standard loss formulation used in PINNs). The sampling function of Q k (Ω) can be defined as follows: q(x r ) = |R θ (x r )| k Z ( ) where Z is the normalizing constant, i.e., Z = xr∈Ω |R θ (x r )| k dx r . Hence, the L 2 PDE Loss for collocation points sampled from Q k (Ω) can be defined as: L 2 r (Q k ) = E xr∼Q k (Ω) |R θ (x r )| 2 1/2 = q(x r )|R θ (x r )| 2 dx r 1/2 = |R θ (x r )| k Z |R θ (x r )| 2 dx r 1/2 (From Equation 4) = 1 Z 1/2 |R θ (x r )| k+2 dx r 1/2 = 1 Z 1/2 p(x r ) p(x r ) |R θ (x r )| k+2 dx r 1/2 = 1 Z 1/2 V -1/2 p(x r )|R θ (x r )| k+2 dx r 1/2 , ∵ x r ∼ U(Ω) =⇒ p(x r ) = 1 V = 1 Z 1/2 V -1/2 E xr∼U (Ω) |R θ (x r )| k+2 1/2 = 1 Z 1/2 V -1/2 L k+2 r (U) (k+2)/2 (From Equation 3, with p = k + 2) (5) Under review as a conference paper at ICLR 2023 Theorem A.1 suggests that sampling collocation points from a distribution q(x r ) ∝ |R θ (x r )| k and L p norm of the PDE loss are related to each other by a scaling term On the other hand, if we perform dynamic sampling, the probability of picking at least one point from Ω high across all iterations will be equal to 1 -(1 -A high /(A high + A low )) N , where N is the number of iterations. We can see that when N is large, this probability approaches 1, indicating that across all iterations of PINN training, we would have likely sampled a point from Ω high in at least one iteration, and used it to minimize its PDE residual. As we empirically demonstrate later in Section 5, dynamic sampling is indeed able to control the skewness of PDE residuals compared to fixed sampling, and thus act as a strong baseline for mitigating propagation failures. -2 -1 0 1 2 -2 -1 0 1 2 Iteration 1 -2 -1 0 1 2 -2 -1 0 1 2 Iteration 2 -2 -1 0 1 2 -2 -1 0 1 2 Iteration 3 1 st Gen 2 nd Gen 3 rd Gen 1 Z 1/2 V -1/2 . However, note that even if we use dynamic sampling, the contribution of points from Ω high in the overall PDE residual loss computed at any iteration is still low. In particular, since the probability of sampling points from Ω high at any iteration is equal to A high /(A high + A low ), the expected PDE residual loss computed over all collocation points will be equal to E Ω [L r (θ)] = E Ω high [L r (θ)] × A high A high + A low + E Ω low [L r (θ)] × A low A high + A low (6) Since A high ≪ A low , the gradient update of θ at every epoch will be dominated by the low PDE residuals observed over points from Ω low , leading to slow propagation of information from initial/boundary points to interior points.

C ANALYSIS OF EVOLUTIONARY SAMPLING

In this section, we analyze the dynamic behavior (or evolution) of the collocation points for our proposed Evolutionary Sampling (Evo) approach over the iterations. A schematic representation of Evo is also provided in Figure 8 . C.1 ACCUMULATION PROPERTY OF EVO. In this section, we provide the proof of Theorem 4.1 presented in the main paper. Definition 1 (Objective Function). Let F θ (x) : R n → R + be an arbitrary positive real-valued k-Lipschitz continuous function, where θ denotes the neural network parameters. When θ is fixed, the function F(x) does not vary with iterations, representing a fixed objective function. Let X * = {x * i : F(x * i ) = max x F(x) ∀ i ∈ [n] } be the set of points where the objective function F is maximal. Now, let us define an ϵ-neighborhood around each of point x * i ∈ x * as N ϵ (x i ) such that ||x i -x * i || ≤ ϵ for any arbitrarily small ϵ > 0 and for all x i ∈ N ϵ (x i ) with i ∈ [n]. Let us also assume that the objection function F is k-Lipschitz continuous. Then the following is true: |F(x * i ) -F(x i )| ≤ kϵ ∀ x i ∈ N ϵ (x i ) & i ∈ [n] (7) =⇒ F(x * i ) -F(x i ) ≤ kϵ ∵ F(x * i ) = max x F(x) Definition 2 (ϵ-maximal Neighborhood). Let F * = max x F(x) be the maximal value of the objective function F(x) (Definition 1). Then an ϵ-maximal Neighborhood N ∞ ϵ can be defined as: N ∞ ϵ = N ϵ (x 0 ) ∪ N ϵ (x 1 ) ∪ ... ∪ N ϵ (x n ) such that any point x sampled from N ∞ ϵ would have F * -F(x) ≤ kϵ ∀ x ∈ N ∞ ϵ and for any arbitrarily small ϵ > 0. Note that since the volume of N ∞ ϵ is greater than 0, the probability of sampling any x ∈ N ∞ ϵ from a uniform distribution U(x) is greater than 0. Lemma C.1 (Population Properties). For any population P generated at some iteration of Evo optimizing a given objective function F(x) (Definition 1), the following properties are always true: 1. The re-sampled population is always non-empty, i.e., |P s | > 0 2. The size of the retained population is always less than the total population size, i.e., |P r | < |P| 3. The size of the retained population is zero, i.e., |P r | = 0, if and only if F(x) = c, ∀x ∈ P. Proof. The threshold τ for the Evolutionary Sampling can be computed as τ = 1 |P| x∈P F(x). The retained population is defined as: P r ← {x : F(x) > τ ∀x ∈ P}, Similarly, the non-retained population can be defined as: P r ← {x : F(x) ≤ τ ∀x ∈ P}. Proof of Property 1: For any arbitrary set of real numbers, there always exists some element in the set that is less than or equal to the mean. Hence, the size of the non-retained population is always non-zero as there always exists some point x ∈ P such that F(x) ≤ τ . Thus, |P r | > 0. Proof of Property 3: Let us consider the case where F(x) = c, ∀x ∈ P (where c is some constant), i.e., the value of the function is constant at all of the points x ∈ P. In this case, the mean of the population P, which is equal to the threshold, τ will be equal to c. This condition would lead to the entire population to be re-sampled as all element x ∈ P would satisfy the condition to belong in the non-retained population. Note that the constant function F(x) = c is the only case where all of the elements are less than or equal to the mean. Otherwise, there would always be at least one element greater than the mean, resulting in a non-zero size of the retained population.

Now by definition

Lemma C.2 (Entry Condition). If a point x m is sampled from N ∞ ϵ at any arbitrary iteration m, then it will always enter the retained population P r m unless E x∈P r m [F(x)] > F * -kϵ. Proof. The condition for any arbitrary point x m to enter the retained population P r m at any arbitrary iteration m is given by the following: F(x m ) > τ m = E x∈Pm [F(x)]. (By definition of the threshold τ m ) (9) Now, if the point x m is sampled from N ∞ ϵ , then F(x m ) ≥ F *kϵ (from Definition 2). Hence, for x m to enter the retained population P r m , we need to ensure that F *kϵ > τ m . Let us consider the case where x m is not able to enter the retained population. In such a case, we will have the following inequality: F * -kϵ < τ m . ( ) It is also easy to show from the definition of retained population that the threshold τ is always less than the expectation of the retained population: τ m ≤ E x∈P r m [F(x)] From Equations 10 and 11, we get, F * -kϵ < τ m ≤ E x∈P r m [F(x)] =⇒ E x∈P r m [F(x)] > F * -kϵ We have thus proved that x m will not be able to enter the retained population P r m only if E x∈P r m [F(x) ] > F *kϵ, which suggests that the expectation of the retained population is already close to F * , for any arbitrarily small ϵ > 0. On the other hand, if E x∈P r m [F(x)] ≤ F * -kϵ, we would necessarily add x m Lemma C.3 (Exit Condition). A point x m sampled from N ∞ ϵ that entered the retained population at any arbitrary iteration m, can exit the retained population P r n at an arbitrary iteration n (such that n > m) only if E x∈P r n [F(x)] ≥ F * -kϵ. Proof. The generic condition for any arbitrary point x to exit the retained population P r n at iteration n is given by: F(x) ≤ τ n = E x∈Pn [F(x)] (By definition of the threshold τ ). Since a point x m that was originally sampled from N ∞ ϵ will have F(x m ) ≥ F * -kϵ (from Definition 2), we can use this inequality in the generic exit condition shown above to get, F * -kϵ ≤ τ n ≤ E x∈P r n [F(x)] Hence, the point x m can exit the retained population at iteration n only if E x∈P r n [F(x)] ≥ F * - kϵ. Theorem C.4 (Accumulation Dynamics Theorem). Let F θ (x) : R n → R + be a fixed real-valued k-Lipschitz continuous objective function optimized using the Evolutionary Sampling algorithm. Then, the expectation of the retained population E x∈P r [F(x)] ≥ max x F(x)kϵ as iteration i → ∞, for any arbitrarily small ϵ > 0. Proof. We prove this theorem by contradiction. For the sake of contradiction, let us assume that as iterations i → ∞, the expectation of the retained population E x∈P r [F(x)] < max x F(x)kϵ, for any arbitrarily small ϵ > 0. We can then make the following two remarks. Entry of collocation points: Note that the probability of sampling x from N ∞ ϵ is non-zero because the size of the re-sampled population is non-zero, i.e., |P s | > 0 (proved in Lemma C.1). Also, since we have assumed E x∈P r [F(x)] < F *kϵ, we can use the Entry condition proved in Lemma C.2 to arrive at the conclusion that a point from N ∞ ϵ will always be able to enter the retained population. Exit of collocation points: Similarly, a point x that belongs in the ϵ-maximal neighborhood and is part of the retained population P r will not be able to escape the retained population as we have asssumed E x∈P r [F(x)] < F *kϵ (using the Exit condition proved in Lemmas C.3). From the above two remarks, we can see that points would keep accumulating indefinitely in the retained population if our initial assumption (for the sake of contradiction) is true. However, since the total size of the population |P| is bounded, the size of the retained population |P r | cannot grow indefinitely. We have thus arrived at a contradiction suggesting our assumption is incorrect. Hence, as iterations i → ∞, the expectation of the retained population E x∈P r [F(x)] ≥ max x F(x)kϵ, for any arbitrarily small ϵ > 0. From Theorem 4.1, we can prove a continuous accumulation of collocation points from the ϵ-maximal neighborhood until the expectation of the retained population is close to the maximum point (i.e., reaches L ∞ ), thus exhibiting the Accumulation Property described in Section 4. Although this theorem assumes that the objective function F(x) (or in the context of PINNs, the absolute residual values, R θ (x r )) is constant, this theorem is still valid when R θ (x r ) is gradually changing with the highest error regions (defined using our ϵ-maximal neighborhood) persisting over iterations. Under such conditions, the theorem states that the retained population P r would always accumulate points from the ϵ-maximal neighborhood, thereby adaptively increasing their contribution to the overall PDE residual loss and eventually resulting in their minimization.

C.2 RELEASE OF COLLOCATION POINTS FROM HIGH PDE RESIDUAL REGIONS.

Our definition of the Sample Release Property states that the distribution of collocation points should revert back to its original form by releasing the accumulated points in the high PDE residual regions once they are "sufficiently minimized". Let us define that for an arbitrary collocation point x r i ∈ P, "sufficient minimization" of the PDE is achieved if R θ (x r i ) ≤ E xr∈P [R θ (x r )] = τ ( where τ is the threshold used by Evo). Then, by definition, such points will belong to the "non-retained population" and will be immediately replaced by the re-sampled population. Since we generate the re-sampled population P s from a uniform distribution, these "sufficiently minimized" collocation points are replaced with a uniform density. Thus, Evo satisfies the "Sample Release Property" of an "ideal" sampling algorithm.

D ADDITIONAL DETAILS FOR CAUSAL EVOLUTIONARY SAMPLING

Figure 9b represents a schematic describing the causally biased Evolutionary sampling described in Section 4.1 and the causal gate g that is updated every iteration. The shift parameter γ of the causal gate is updated every iteration using the following scheme: γ i+1 = γ i + η g e -ϵL g r (θ) , where η g is the learning rate that controls how fast the gate should propagate and ϵ denotes tolerance that controls how low the PDE loss needs to be before the gate shifts to the right, and i denotes the i th iteration. Typically, in our experiments we set the learning rate to 1e-3. Thus, for example, if the expectation of e -ϵL g r (θ) over 1000 iterations is 0.1, then γ would change by a value of 0.1 after 1000 iterations (since γ i+N ≈ γ i + η g * N * E[e -ϵL g r (θ) ]). Additionally, note that, for a typical "tanh" causal gate, the operating range of γ values vary from -0.5 to 1.5. However, if the loss is very small (L g r (θ) → 0), the magnitude of the update e -ϵL g r (θ) → 1, i.e., leads to an abrupt change in the causal gate. Thus, to prevent an abrupt gate movement due to large magnitude update, we employ a magnitude clipping scheme (similar to gradient clipping in conventional ML) as follows: γ i+1 = γ i + η g min(e -ϵL g r (θ) , ∆ max ), where ∆ max is the maximum allowed magnitude of update. Typically, for our experiments we keep ∆ max = 0.1. Note, that ∆ max needs to be carefully chosen depending on the gate learning rate η g .

D.2 CHOICE OF OTHER GATE FUNCTIONS.

The gate function g to enforce the principle of causality is not limited to the "tanh" gate presented in Section 4.1 of the main paper. Any arbitrary function can be used for a causal gate as long as it obeys the following criteria: 1. Continuous Time Property: The function g should be continuous in time, such that it can be evaluated at any arbitrary time t. 2. Monotonic Property: The value of gate g at time t + ∆t should be less than the value of the gate at time t, i.e., g(t + ∆t) ≤ g(t). In other words, g should be a monotonically decreasing function, 3. Shift Property: The gate function should be parameterized using a shift parameter γ, such that g γ (t) < g γ+δ (t), where δ > 0, i.e., by increasing the value of the shift parameter the gate value of any arbitrary time should increase. An alternate choice of a causal gate is using a composition of ReLU and tanh functions: g = ReLU (-tanh(α(tγ)) (as shown in Figure 10 . We can see that by using ReLU, this alternate gate function provides a stricter thresholding of gate values to 0 after a cutoff value of time. The effect of this strict thresholding on the incorporation of causality in training PINNs can be studied in future analyses. In our current analysis, we simply used the tanh gate function for all our experiments. In this Section, we compare the different baseline methods w.r.t. the motivating properties. See Table 2 . 

F COMPUTATIONAL COMPLEXITY ANALYSIS OF EVOLUTIONARY SAMPLING VS ITS BASELINES

In this section, we aim to provide a comprehensive comparison of the computational complexity of Evolutionary Sampling and its baselines. It is well-known that the cost of computing the PDE residuals using automatic-differentiation during training amounts is reasonably large, especially when the PDE contains higher order gradients that require repeated backward passes through the computational graphs used by standard deep learning packages like PyTorch/Tensorflow. Thus, in this section we would mainly focus on comparing the number of PDE residual computations that each algorithm makes during training. We can quantify the effect of this difference on computational costs as follows. Let us first define the Notations that we are going to use for the analysis: Notations for Computational Analysis Thus, the cost to compute the PDE residuals on this training set P is: C train RAR = KC|P| + KC(|P| + M ) + KC(|P| + 2M ) + ... + KC(|P| + M N K ) = KC|P| N K + 1 + M 2|P| N K N K + 1 There is also an additional cost of evaluating the PDE residuals on the dense set C dense RAR to select these M points from high PDE residual region. C dense RAR = C|P dense | N K Therefore, the overall cost for RAR-based methods is: C RAR = C dense RAR + C train RAR Comparing the cost between Evo and RAR: Assuming that the total number of epochs N is divisible by the resampling period K, which is true for most practical scenarios. C RAR = N C|P| + KC|P| + M N C 2 N K + 1 + C|P dense | N K C RAR = C Evo + KC|P| + M N C 2 N K + 1 + C|P dense | N K C RAR -C Evo = KC|P| + M N C 2 N K + 1 + C|P dense | N K Thus, we can see that the difference in the computational cost can quickly grow depending on the choice of RAR setting. Also note that, since |P dense | >> |P|, the additional cost of C|P dense | N K is significant, especially if we want to re-sample/re-evaluate the adaptive sampling frequently (i.e., for small values of K).

G DETAILS OF PARTIAL DIFFERENTIAL EQUATIONS USED IN THIS WORK G.1 CONVECTION EQUATION

We considered a 1D-convection equation that is commonly used to model transport phenomenon, described as follows: ∂u ∂t + β ∂u ∂x = 0, x ∈ [0, 2π], t ∈ [0, 1] (17) u(x, 0) = h(x) (18) u(0, t) = u(2π, t) ( ) where β is the convection coefficient and h(x) is the initial condition. For our case studies, we used a constant setting of h(x) = sin(x) with periodic boundary conditions in all our experiments, while varying the value of β in different case studies.

G.2 ALLEN-CAHN EQUATION

We considered a 1D -Allen Cahn equation that is used to describe the process of phase-separation in multi-component alloy systems as follows: ∂u ∂t -0.0001 ∂ 2 u ∂x 2 + 5u 3 -5u = 0, x ∈ [-1, 1], t ∈ [0, 1] (20) u(x, 0) = x 2 cos(πx) (21) u(t, -1) = u(t, 1) (22) ∂u ∂t x=-1 = ∂u ∂t x=1

G.3 EIKONAL EQUATION

We formulate the Eiknonal equation for signed distance function (SDF) calculation as: |∇u| = 1, x, t ∈ [-1, 1] (24) u(x s ) = 0, x s ∈ S (25) u(x, -1), u(x, 1), u(-1, y), u(1, y) > 0 ( ) where S is zero contour set of the SDF. In training PINN, we use the zero contour constraint as initial condition loss and positive boundary constraint as boundary loss (see Table 3 for details of loss balancing).

G.4 KURAMOTO-SIVASHINSKY EQUATION

We use 1-D Kuramoto-Sivashinsky equation from CausalPINN Wang et al. (2022b) : ∂u ∂t + αu ∂u ∂x + β ∂ 2 u ∂x 2 + γ ∂ 4 u ∂x 4 = 0, subject to periodic boundary conditions and an initial condition u(0, x) = u 0 (x) The parameter α, β, γ controls the dynamical behavior of the equation. We use the same configurations as the CausalPINN: α = 5, β = 0.5, γ = 0.005 for regular settings, and α = 100/16, β = 100/162, γ = 100/164 for chaotic behaviors.

H DETAILS ON SKEWNESS AND KURTOSIS METRICS

Skewness and kurtosis are two basic metrics used in statistics to characterize the properties of a distribution of values {Y i } N i=1 . A high value of Skewness indicates lack of symmetry in the distribution, i.e., the distribution of values to the left and to the right of the center point of the distribution are not identical. On the other hand, a high value of Kurtosis indicates the presence of a heavy-tail, i.e., there are more values farther away from the center of the distribution relative to a Normal distribution. In our implementation using scipy, we used the adjusted Fisher-Pearson coefficient of skewness and Fisher's definition of kurtosis, as defined below. Skewness: For univariate data Y 1 , Y 2 , ..., Y N , the formula of skewness is skewness = N (N -1) N -2 × N i=1 (Y i -Ȳ ) 3 /N s 3 , ( ) where Ȳ is the sample mean of the distribution and s is the standard deviation. For any symmetric distribution (e.g., Normal distribution), the skewness is equal to zero. A positive value of skewness means there are more points to the right of the center point of the distribution than there are to the left. Similarly, a negative value of skewness means there are more points to the left of the center point than there are to the right. In our use-case, a large positive value of skewness of the PDE residuals indicates that there are some asymmetrically high PDE residual values to the right.

Kurtosis:

Kurtosis is the fourth central moment divided by the square of the variance after subtracting 3, defined as follows:  kurtosis = N i=1 (Y i -Ȳ ) 4 /N s 4 -3

I HYPER-PARAMETER SETTINGS AND IMPLEMENTATION DETAILS

The hyper-parameter settings for the different baseline methods for every benchmark PDE are provided in Table 3 . Note that we used the same network architecture and other hyper-parameter settings across all baseline method implementations for the same PDE. In this table, the column on 'r/ic/bc' represents the setting of the λ r , λ ic , λ bc hyper-parameters that are used to weight the different loss terms in the overall learning objective of PINNs. Table 3 also lists the type of Optimizer, learning rate (lr), and learning rate scheduler (lr.scheduler) used across all baselines for every PDE. For the Eikonal equation, we used the same modified multi-layer perceptron (MLP) architecture as the one proposed in (Wang et al., 2020) . Additionally, for the Causal Evolutionary Sampling method, we used the following hyper-parameter settings across all PDEs: α = 5, learning rate of the gate η g = 1e -3, tolerance ϵ = 20, initial value of β = -0.5, and ∆ max = 0.1. The number of iterations (and the corresponding PDE coefficients for the Convection Equation) are provided in Section 5 of the main paper. Hardware Implementation Details: We trained each of our models on one Nvidia Titan RTX 24GB GPU. J ADDITIONAL DISCUSSION OF RESULTS

J.1 VISUALIZING PROPAGATION FAILURE FOR DIFFERENT SETTINGS OF β

In Figure 2 PDE residual fields for a large number of iterations (or epochs), and a simultaneous stagnation in the relative error values even though the mean PDE residual kept on decreasing. Here, in Figure 11 , we show that the same phenomenon can be observed for other large values of β > 10, namely, β = 30, 50, 70. We can see that the relative errors for all these three cases remains high even though the PDE residual loss keeps on decreasing with iterations. We can also see that the absolute values of skewness and kurtosis increase as we increase β, indicating higher risks of propagation failure. In fact, for β = 30, we can even see that the epoch that marks an abrupt increase in skewness and kurtosis (around 50K iterations) also shows a sudden increase in the relative error at the same epoch, highlighting the connection between imbalanced PDE residuals and the phenomenon of propagation failure. 

J.4 VISUALIZING THE EVOLUTION OF COLLOCATION POINTS IN EVO

Figure 14 shows the evolution of collocation points and PDE residual maps of Evo as we progress in training iterations for the convection equation with β = 50. We can see that the retained population of Evo at every iteration (shown in red) selectively focuses on high PDE residual regions, while the re-sampled population (shown in blue) are generated from a uniform distribution. By increasing the contribution of high residual regions in the computation of the PDE loss, we can see that Evo is able to reduce the PDE loss over iterations without admitting high imbalance, thus mitigating the propagation failure mode, in contrast to conventional PINNs.

J.5 VISUALIZING THE EVOLUTION OF COLLOCATION POINTS IN CAUSAL EVO

Figure 15 shows the evolution of collocation points and PDE residuals of Causal Evo, along with the dynamics of the Causal Gate function. We can see that the retained population at every iteration (shown in red) strictly adheres to the principle of causality such that the collocation points are sampled Figures 16, 17, annd 18 shows the evolution of collocation points and PDE residuals of RAR-G, RAD, and RAR-D, respectively. We can see that all three RAR-based methods are failing to converge to the correct solution even after 100K iterations, demonstrating their inabillity to mitigate propagation failures.

J.7 KURAMOTO-SIVASHINSKY (KS) EQUATIONS

We used three additional experiments on the Kuramoto-Sivashinsky (KS) Equations (one for a regular relatively simple case and the remaining two exhibiting chaotic behavior). Please note that these equations are particularly more complex, especially the chaotic cases where a small change in the state of the solution can result in very large errors downstream in time. Thus, for chaotic domains, the successful propagation of solution from the initial and boundary conditions is critical to guarantee convergence. We would also like to highlight that the computational cost of these experiments are significantly higher. We used the exact same hyper-parameter settings as those provided in CPINN except the sample size, which was varied from 128 to 2048 in the KS-regular case, and the number of training iterations, which was kept as 300k in our proposed approaches while CPINN was allowed to use about 1 M maximum number of iterations with early stopping. Our method on average takes 50-60% less time than CPINN because of the significantly smaller number of iterations. Figure 19 compares the performance of CPINN, Evo, and Causal Evo on the KS equation (regular case) as a function of the number of collocation points used in PINN training. We can see that both Evo and Causal Evo show improvements over CPINN when the number of collocation points is small (N r = 128). As the number of collocation points is increased, Causal Evo shows better performance than Evo, as it incorporates an additional prior of causality along with satisfying the four motivating properties of Evo. Overall, Causal Evo mostly performs better than both CPINN and Evo across different training set sizes. Note that these curves have been obtained using a single run of every method due to the computational cost of training for the KS equations, and having multiple runs for every method will help to quantify the variance in these results. Table 4 compares the performance of CPINN, Evo, and Causal Evo on the three KS-Equations cases as was used in the original CPINN paper. Please note that for these experiments, we used the exact same hyper-parameter settings as the original CPINN, thus a large number of collocation points were used (2048 for the regular case and 8192 for the two chaotic regimes for each time-window). We can observe that on the regular case, Evo performs similarly to CPINN, while Causal Evo is significantly better than both. However, in the first chaotic case, CPINN is slightly better than both Evo and Causal Evo. Finally, in the much more chaotic regime for the KS-Equation (extended case), we find that all of the methods struggle to obtain a high fidelity solution of the field. However, Evo and Causal Evo are somewhat better than CPINNs. Hence, we can comment that Evo and CausalEvo have comparable performance to CPINN on their benchmark settings. Additional visualizations of We chose to solve 2D Eikonal Equations for complex arbitrary surface geometries as they represent particularly hard PDE problems that are susceptible to PINN failure modes. In these problems, we are given the zero contours of the equation on the boundaries (representing the outline of the 2D object), which can take arbitrary shapes. The goal is to correctly propagate the boundary conditions to obtain the unique target solution where the interior is negative and the exterior is positive. Here, any small error in propagation from the boundaries can lead to cascading errors such that a large segment of the predicted field can have opposite signs compared to the ground-truth, even though their PDE residuals are close to 0. Since Evo is explicitly designed to break propagation barriers and thus enable easy transmission of the solution from the boundary to the interior/exterior points, we can see that it shows significantly better performance. On the other hand, PINN (fixed) and PINN (dynamic) struggle to converge to the correct solution especially for complex geometries (e.g., the 'gear') because of the inherent challenge in sampling an adequate number of points from arbitrary shaped object boundaries exhibiting highly imbalanced residuals. In Figure 23 , we show the evolution of the solutions of comparative methods for the 'gear' case over iterations. We can see that Evo is able to resolve the high residual regions better than the baselines, and thus encounter less incorrect "sign flips" compared to the ground-truth, even in the early iterations of PINN training.

K OPTIMIZATION CHARACTERISTICS OF EVOLUTIONARY SAMPLING ON TEST OPTIMIZATION FUNCTIONS

In this section, we demonstrate the ability of our proposed Evolutionary Algorithm to find global minimas on various test optimization functions. We will also provide other characterizations of our proposed Evolutionary Sampling algorithm.

K.1 AUCKLEY FUNCTION

The two-dimensional form of the Auckley function has multiple local maximas in the near-flat region of the function and one large peak at the center.  f (x) = aexp -b 1 d d i=1 x 2 i + exp 1 d d i=1 cos(cx i ) + a + exp(1)

K.2 BOHACHEVSKY FUNCTION

The two-dimensional form of the Bohachevsky function is a bowl shaped function having one global maxima. f (x, y) = -x 2 -2y 2 + 0.3 cos(3πx) + 0.4cos(4πy) -0.7

K.3 DROP-WAVE FUNCTION

The two-dimensional form of the Drop-Wave function which is multimodal and highly complex. f (x, y) = 1 + cos(12 x 2 + y 2 ) 0.5(x 2 + y 2 ) + 2 (33)

K.4 EGG-HOLDER FUNCTION

The two-dimensional form of the Egg-Holder function is highly complex function that is difficult to optimize because of the presence of multiple local maximas. 

K.5 HOLDER-TABLE FUNCTION

The two-dimensional form of the Holder- 



https://www.dropbox.com/sh/45gec7qvgiutz2x/AABz4aFJ1IfMIxY11OkLbJ9Oa?dl=0. We assume that the batch size/number of collocation points used to compute the PDE Loss tends to infinity, i.e., Nr → ∞. This allows us to analyze the behavior of the continuous PDE loss function Lr(θ) as Nr → ∞



Figure 1: PINN solutions for a simple ODE: u xx + k 2 u = 0 (k = 20) with the analytical solution, u = A sin(kx) + B cos(kx). We can see smooth propagation of the correct solution from the boundary point at x = 0 to interior points (x > 0) as we increase training iterations.

Figure 2: Demonstration of propagation failure while solving the convection equation with β = 50.

Figure 4: Comparison of Skewness, Kurtosis, Max PDE Residuals, and Mean PDE Residuals over training iterations for PINN-fixed, PINN-dynamic, and Evo for convection equation with β = 50.

Figure 5: PDE loss of P r going from L 2 to L ∞ with iterations for a sample objective function.

Figure 6: Sample Efficiency of Evo and Causal Evo at low N r .

Figure 7: Solving Eikonal equation for signed distance field (SDF). The color of the heatmap represents the values of the SDF. The gray region shows the negative values of SDF that represents the interior points in the reconstructed geometry from predicted SDF.

Figure 8: Schematic to describe our proposed evolutionary sampling algorithm, where collocation points are incrementally accumulated in regions with high PDE residuals (shown as contour lines).

, since the re-sampled population P s replaces the non-retained population at every iteration, |P s | = |P r |. Hence, |P s | > 0, i.e., the size of the resampled population is always non-zero. Proof of Property 2: By definition, |P r | + |P r | = |P| (where |P| is the total size of the population and is always constant). Since, |P r | > 0, we can say that |P r | < |P|, i.e., the size of the retained population can never be equal to the entire population size |P|.

Figure 9: Causal Evo uses a time-dependent causal gate for computing PDE loss and for sampling.

Figure 10: ReLU-tanh Causal Gate

Computational Cost to evaluate the PDE residual on a single collocation point. N Total Number of Training Iterations |P| Total Number of Collocation Points/Population Size (Also referred to as the initial set of collocation points for RAR based methods) |P dense | Auxiliary Set of dense points [For RAR-G,RAD, and RAR-D], where |P dense | >> |P| K Resampling Period [For RAR-G,RAD, and RAR-D] M Number of additional collocation points added to the initial set Next, we present the computational cost to run each method separately. PINN/Evo: The cost of computing the PDE residual at an arbitrary epoch i: C Evo (i) = C|P|. Thus, the overall computational cost for N iterations is C Evo = N C|P|. RAR-G/RAD/RAR-D: For RAR-based methods, the initial set of collocation points P keeps growing as M new test points are added every K iterations. Thus, at an iteration i, the size of the collocation point set |P i | = |P 0 | + m⌊i/K⌋ = |P| + M ⌊i/K⌋ (to simplify notations, let |P 0 | = |P|, i.e., all of the methods start with the same number of initial collocation points).

Normal distribution, Kurtosis is equal to 0. A positive value of Kurtosis indicates that there are more values in the tails of the distribution than what is expected from a Normal distribution. On the other hand, a negative value of Kurtosis indicates that there are lesser values in the tails of the distribution relative to a Normal distribution. In our use-case, a large positive value of Kurtosis of the PDE residuals indicates that there are some high PDE residual values occurring in very narrow regions of the spatio-temporal domain, that are being picked up as the heavy-tails of the distribution.

of the main paper, we demonstrated the phenomenon of propagation failure for convection equation with β = 50, which was characterized by large values of Skewness and Kurtosis in the

Figure 11: Demonstration of Propagation Failure for Different Settings of β (β= 10, 30, 50, 70)

Figure 14: Demonstrating the propagation of information from the initial/boundary points to the interior points for Evo.Sample on Convection Equation(β = 50)

Figure 15: Demonstrating the propagation of information from the initial/boundary points to the interior points for Causal Evo. on Convection Equation(β = 50)

Figure 16: Demonstrating the dynamic changes in the collocation points for RAR-G on Convection Equation(β = 50)

Figure 18: Demonstrating the dynamic changes in the collocation points for RAR-D on Convection Equation(β = 50)

Figure 20: Visualization of the Evo, Causal Evo and CausalPINN on KS-Equation (Regular Case).

Figure 21: Visualization of the Evo, Causal Evo and CausalPINN on KS-Equation (Chaotic Case).

Figure 22: Visualization of the Evo, Causal Evo and CausalPINN on KS-Equation (Extended Chaotic Case).

Figure 25: Demonstrating the evolution of the randomly initialized points while optimizing the Auckley function. The red triangles represent the re-sampled population at that epoch, and the blue dots represent the retained population at that epoch. The contour function of the objective function is shown in the background.

Figure 31: Demonstrating the dynamic evolution of the retained population size over epochs on the Bohachevsky Function.

Figure 34: Illustrating the dynamic behavior of the Evolutionary Sampling algorithm on Drop-Wave Function using the L 2 Physics-informed Loss computed on the retained and re-sampled populations. The horizontal lines represent the L p Physics-informed Loss on a dense set of uniformly sampled collocation points (where p = 2, 4, 6, ∞).

Figure 35: Demonstrating the dynamic evolution of the retained population size over epochs on the Drop-Wave Function.

Figure 41: Demonstrating the evolution of the randomly initialized points while optimizing the Holder-Table function. The red triangles represent the re-sampled population at that epoch, and the blue dots represent the retained population at that epoch. The contour function of the objective function is shown in the background.

Figure 42: Illustrating the dynamic behavior of the Evolutionary Sampling algorithm on Holder-Table Function using the L 2 Physics-informed Loss computed on the retained and re-sampled populations. The horizontal lines represent the L p Physics-informed Loss on a dense set of uniformly sampled collocation points (where p = 2, 4, 6, ∞).

Figure 43: Demonstrating the dynamic evolution of the retained population size over epochs on the Holder-Table Function.

Figure 51: Demonstrating the dynamic evolution of the retained population size over epochs on the Michalewicz Function.

Relative L 2 errors (in %) of comparative methods over benchmark PDEs with N r = 1000.

However, since this scaling term is variable in nature as Z depends on the neural network parameters θ, optimizing L p r (U) is not directly equivalent to optimizing L 2 r (Q k ) (or a power thereof).

Table comparing baseline methods in terms of their ability to comply with the motivating properties of this work.

Hyper-parameter settings for different baseline methods for every benchmark PDE

Relative L 2 errors (in %) of CausalPINN, Evo and CausalEvo over the three different KS-Equation BenchmarksWang et al. (2022b).

Table function has many local maximas, but has 4 global maximas at the four corners.

