D-CIPHER: DISCOVERY OF CLOSED-FORM PARTIAL DIFFERENTIAL EQUATIONS

Abstract

Closed-form differential equations, including partial differential equations and higher-order ordinary differential equations, are one of the most important tools used by scientists to model and better understand natural phenomena. Discovering these equations directly from data is challenging because it requires modeling relationships between various derivatives that are not observed in the data (equation-data mismatch) and it involves searching across a huge space of possible equations. Current approaches make strong assumptions about the form of the equation and thus fail to discover many well-known phenomena. Moreover, many of them resolve the equation-data mismatch by estimating the derivatives, which makes them inadequate for noisy and infrequent observations. To this end, we propose D-CIPHER, which is robust to measurement artifacts and can uncover a new and very general class of differential equations. We further design a novel optimization procedure, CoLLie, to help D-CIPHER search through this class efficiently. Finally, we demonstrate empirically that it can discover many well-known equations that are beyond the capabilities of current methods.

1. INTRODUCTION

Scientists have been using mathematical equations to describe the world for centuries. In particular, closed-form differential equations turned out to be one of the best tools to model physical phenomena. A differential equation describes a relationship between a quantity and its derivatives (rates of change); it is called closed-form if this relationship is described by a mathematical expression consisting of a finite number of variables, constants, arithmetic operations, and some well-known functions (e.g., exponent, logarithm, trigonometric functions) 1 . Closed-form differential equations provide a general description of reality in a concise representation that is amenable to closer inspection by scientists. This renders them transparent and interpretable to human experts. Discoveries of these equations required a thorough knowledge of the theory, strong mathematical skills, substantial creativity, and good intuition. The goal of this work is to discover closed-form differential equations directly from data thus accelerating the process of scientific discovery.

Challenges in discovering differential equations from data

• Partial and higher-order derivatives. Many algorithms (Brunton et al., 2016; Qian et al., 2022) can only identify Ordinary Differential Equations (ODEs) which evolve only with respect to one variable (usually time). In contrast, many natural phenomena are described by equations involving many variables (e.g., spatial coordinates) called Partial Differential Equations (PDEs). Many equations also involve higher-order derivatives. • Derivatives not observed. Discovering differential equations from data is challenging because the derivatives are usually not observed in the dataset (equation-data mismatch (Qian et al., 2022) ). This makes verifying a candidate equation a non-trivial task. Most of the methods proposed in the literature try to resolve this issue by estimating the derivatives (Brunton et al., 2016; Rudy et al., 2017) . However, estimating the derivative is difficult, especially when the data is sampled infrequently or with high noise (Qian et al., 2022; Messenger & Bortz, 2021a ). • Strong assumptions and constrained search space. The majority of algorithms for identifying differential equations make many assumptions about the form of the equation. In particular, they make the evolution assumption (defined and explained later) and assume that the equation can be represented as a linear combination of prespecified functions and differential operators (Brunton et al., 2016; Messenger & Bortz, 2021a) . However, many well-known equations, such as a forced harmonic oscillator or an inhomogeneous wave equation, cannot be represented in that way. Currently, a few algorithms tackle only some of these challenges. In particular, Weak SINDy (Messenger & Bortz, 2021a) is able to discover PDEs without estimating the derivative by utilizing a variational approach. However, the form of the equation is constrained to be in a form amenable for a sparse regression algorithm. D-CODE (Qian et al., 2022) , on the other hand, uses a variational approach in conjunction with a symbolic regression algorithm to discover closed-form ODEs. However, it cannot handle higher-order derivatives or multiple independent variables, so it cannot be used to discover closed-form PDEs. The algorithms that do not require the evolution assumption appeared in (Mangan et al., 2016) and (Kaheman et al., 2020) but they require derivative estimation and only consider equations represented as linear combinations of prespecified functions. Contributions. In this work, we develop the Discovery of Closed-form Partial and Higher-order Differential Equations in a Robust Way framework (D-CIPHER) that does not estimate the derivatives, requires fewer assumptions, has a broader search space than previous techniques, and works for both higher-order ODEs and PDEs. Our contributions are as follows: • We examine the landscape of different types of PDEs from the discovery perspective. In particular, we introduce new notions such as evolution form, evolution assumption, derivative-bound part, and derivative-free part. We use them to describe what kinds of PDEs can be discovered with current methods and to motivate our new class of differential equations. (Section 2) • We propose a new general class of PDEs (Variational-Ready PDEs) that admit the variational formulation (and thus allows to circumvent the derivative estimation). We also prove a theorem that motivates a novel objective function. (Section 4) • We use the novel objective function to develop D-CIPHER, a new algorithm that searches over the Variational-Ready PDEs. (Section 5) • We develop a new optimization procedure (CoLLie) to efficiently solve a constrained least-squares problem and thus help D-CIPHER search through this space efficiently. (Section 6)

2. PARTIAL DIFFERENTIAL EQUATIONS

In this section, we provide background information about Partial Differential Equations and introduce new notions necessary for the following discussion. Notation and definitions. We denote the set {1, 2, . . . , n} as [n] and the set of non-negative integers as N 0 . Throughout this paper we let M, N, K ∈ N be some natural numbers and let Ω ⊂ R M be an open set inside R M . A comprehensive table with all symbols used in this work can be found in Appendix A together with some definitions restated more formally. Going beyond ODEs. The simplest differential equations are ordinary differential equations that describe quantities that evolve with respect to only one independent variable, usually time. Most methods assume that the ODE is explicit and as such can be represented as a system of first-order ODEs: uj (t) = f j (t, u(t)) where uj represents the derivative of u j . Then the discovery problem is reduced to deciding the order of the derivative (usually first or second) and the discovery of f j . For PDEs, it is not enough to talk about the derivative, as we can take derivatives with respect to different variables. We denote the mixed derivative as ∂ α , where α ∈ N M 0 is called a multi-index, and define it as ∂ α = ∂ α1 1 ∂ α2 2 . . . ∂ α M M . Each ∂ αi i = ∂ αi /∂x αi i is a α th i -order partial derivative with respect to x i (the i th independent variable) 2 . We define the order of α as |α| = M i=1 α i . We call ∂ α non-trivial if |α| > 0. A PDE of order K is any equation of the form f (x, u(x), ∂ [K]  u(x)) = 0 ∀x ∈ Ω (2) where ∂ [K] u are all non-trivial mixed derivatives of all u j (j ∈ [N ]) up to the K th order. We call a PDE closed-form if f is closed-form. As PDEs might include many different combinations of derivatives, there is no generally accepted counterpart of explicit ODEs in the space of PDEs. Evolution Assumption. Although there is no generally accepted notion of an explicit PDE, we define an evolution form of a PDE to be an equation of the form ∂ α u j (x) = f (x, u(x), ∂ [K]/α u(x)) ∀x ∈ Ω where ∂ [K]/α is ∂ [K] with ∂ α omitted, α is a known multi-index and j ∈ [N ]. Note that if M = 1 and |α| = K then Equation 3 becomes exactly the definition of an explicit ODE. In fact, many algorithms for PDE discovery assume a particular evolution form (Messenger & Bortz, 2021a) . We call it an evolution assumption (EA). However, this assumption requires the knowledge of α and j which might not be trivial. Usually, ∂ α is assumed to be the first derivative with respect to time (∂ t ) (Rudy et al., 2017) but it is not the case for many well-known PDEs such as the wave equation or Gauss's law. D-CIPHER does not need the evolution assumption. Moreover, it can discover some PDEs that cannot be put into the evolution form. Linear combinations. Current PDE discovery algorithms (Rudy et al., 2017; Messenger & Bortz, 2021a; Chen et al., 2021) consider PDEs that can be represented as a linear combination of functions. That means the PDE has the form P p=1 θ p f p (x, u(x), ∂ [K] u(x)) = 0 ∀x ∈ Ω where θ p ∈ R for p ∈ [P ] are the only constants that are optimized. As there are lot of expressions that cannot be put in that form, these algorithms fail to discover more complex equations. In particular, for an unknown θ ∈ R functions such as sin(θx i ), e θxi or 1 xi+θ cannot be learned by these algorithms. Unlike previous methods, D-CIPHER is not limited to PDEs that can be represented as a linear combination of functions. To describe the family of PDEs that D-CIPHER can discover we need to introduce the notions of derivative-bound and derivative-free parts. Derivative-bound part and derivative-free part. Any PDE can be expressed in the form f (x, u(x), ∂ [K] u(x)) -g(x, u(x)) = 0 ∀x ∈ Ω (5) where we collect all the terms with the derivatives into f (x, u(x), ∂ [K] u(x)) and all terms without the derivatives into g(x, u(x)). We call f the derivative-bound part and g the derivative-free part. We also denote them as ∂-bound and ∂-free. The significance of the ∂-free part is that it can be evaluated directly given u, whereas the ∂-bound part requires access to the derivatives which are not observed. It is important to note that for first-order ODEs, f is trivial and is just equal to uj . The challenge of unobserved derivatives might put constraints on the ∂-bound part of the PDE and thus in practice, it might not be possible to search over all closed-form PDEs. However, we observe that no such constraints need to be put on the ∂-free part (as it does not include the derivatives). We take full advantage of this observation and search separately for the two parts. We search over all closed-form functions g and for each candidate, we try to find the best counterpart f among the allowed expressions. This is very different from the previous approaches, which either do not need to find f as they work only for first-order ODEs (Qian et al., 2022) or they constrain equally both the ∂-bound part and ∂-free part to be a linear combination of some pre-specified functions (Messenger & Bortz, 2021a) .

3. RELATED WORKS

Symbolic Regression. The goal of symbolic regression is to find a closed-form expression that best models the given dataset both in terms of accuracy and simplicity. In contrast with the conventional regression analysis which optimizes the parameters of a pre-specified model, symbolic regression 3)? Can it discover any closed-form ∂-free part (Equation 5)? References: [1] (Brunton et al., 2016) , [2] (Mangan et al., 2016) , [3] (Rudy et al., 2017) , [4] (Long et al., 2019) (Koza, 1992) has been widely used for that task (Schmidt & Lipson, 2009) . A different strategy has been employed in AI Feynman (Udrescu et al., 2020; Udrescu & Tegmark, 2020 ) that uses neural networks to reduce the search space by identifying simplifying properties like symmetry or separability. Optimization methods based on pre-trained neural networks (Biggio et al., 2021) , reinforcement learning (Petersen et al., 2021) , and Meijer-G functions (Alaa & van der Schaar, 2019) have also been proposed. Data-driven discovery of closed-form differential equations. Data-driven discovery of physical laws is an established area of machine learning (Bongard & Lipson, 2007; Schmidt & Lipson, 2009) . The pioneering work in that area was SINDy (Brunton et al., 2016) that constrained the space of equations to linear combinations of functions from a predefined library and used sparse regression to discover explicit ODEs. It was later extended to include implicit ODEs (Mangan et al., 2016; Kaheman et al., 2020) and PDEs (Rudy et al., 2017; Schaeffer, 2017) . Various other extensions were proposed by improving the derivative estimation and the training procedure (Rao et al., 2022; Xu, 2021) , adding additional selection criteria (Mangan et al., 2017) and learning the library using genetic programming Maslyaev et al. (2019) ; Chen et al. (2021) ; Xu et al. (2020) . A different approach is taken by (Long et al., 2019) (an extension of (Long et al., 2018) ) which uses convolutional and symbolic neural networks. It is important to note that all of these methods still assume the PDE to be a linear combination as discussed in Section 2 (Equation 4) which significantly limits their search space. Some other developments are based on Gaussian processes (Raissi et al., 2018; Raissi & Karniadakis, 2018) but they require the exact form of the PDE and only optimize the parameters. Variational approach. Recently, the variational approach has been used as a viable alternative to derivative estimation. However, they have only been used for differential equations in a linear combination form (Messenger & Bortz, 2021a; b; Reinbold et al., 2020) or closed-form first-order ODEs (Qian et al., 2022) . Extending the variational approach to closed-form PDEs is not trivial as PDEs are much more complex than ODEs and not all closed-form PDEs admit the variational formulation. In fact, the approaches that learn the library mentioned in the previous paragraph can produce exactly such terms which prohibits the use of variational formulation. To address these challenges we use the new notions defined in Section 2 to define a new and general class of PDEs in Section 4 that admit the variational formulation.

4. VARIATIONAL-READY PDES

In this section, we propose a new and very general class of PDEs, the Variational-Ready PDEs (VR-PDEs), which can be characterized without referring to the derivative. The VR-PDEs allow arbitrary ∂-free part but make some minor restrictions on the ∂-bound part. These restrictions allow one to use the variational formulation of PDEs to circumvent derivative estimation entirely. Despite the minor restriction, VR-PDEs contain many well-known PDEs, including all linear PDEs, Maxwell's equations, and Navier-Stokes equations (additional examples provided in Appendix B). To define the new class of PDEs, we need the following definition. Definition 1 (Extended derivative and differential operator). Let α ∈ N M 0 , |α| ≤ K, be a multiindex. Let h : R M +N → R and a : R M → R be smooth functions. An extended derivative E, denoted (α, a, h), maps a vector field u : R M → R N to a function E[u] : R M → R defined as: E[u](x) = a(x)∂ α [h(x, u(x))] E is called closed-form if a and h are closed-form. We call E non-degenerate if |α| > 0. Now, let (E p ) p∈[P ] be a finite sequence of non-degenerate extended derivatives. The extended differential operator, denoted as E [P ] is an operator defined as: E [P ] [u](x) = P p=1 E p [u](x) Remark. Any linear operator L = α∈A a α ∂ α acting on u j is an extended differential operator. Definition 2 (Variational-Ready PDE). Let E [P ] be an extended differential operator, and let g : R M +N → R be a continuous function. We denote a Variational-Ready PDE (VR-PDE) by a pair E [P ] , g and define it as: E [P ] [u](x) -g(x, u(x)) = 0 ∀x ∈ Ω (8) We extend the standard variational formulation of PDEs (Proposition 1 in Appendix B) from linear PDEs to all VR-PDEs. The following definition is useful in further discussion. Definition 3. Consider a field u : Ω → R N , and an extended derivative E = (α, a, h). Let φ : Ω → R be a testing function (C K functionfoot_1 with compact support). We define the functional F(E, u, φ) = Ω h(x, u(x))(-1) |α| ∂ α [a(x)φ(x)]dx We can now use this functional to formulate variational characterization of VR-PDEs. Theorem 1. u : Ω → R N , where u j ∈ C K , is a solution to a VR-PDE in Equation 8 if and only if P p=1 F(E p , u, φ) - Ω [g(x, u(x))φ(x)] dx = 0 (9) for all testing functions φ : Ω → R. Proof. Appendix B. This theorem motivates the variational loss function as we expect the left-hand side of Equation 9to be closer to 0 the closer the canditate PDE is to the true one. To calculate how well a set of vector fields D = {u (d) } D d=1 matches a VR-PDE E [P ] , g we propose the following loss function.

L E

[P ] , g = D d=1 S s=1 P p=1 F(E p , u (d) , φ s ) - Ω g(x, u (d) (x))φ s (x)dx 2 ( ) where {φ s } S s=1 is a set of predefined testing functions. This novel loss function makes it possible to evaluate to what extent any VR-PDE matches the observed data. This loss can be used as an optimization objective in any algorithm that searches over some subspace of closed-form VR-PDEs. We propose D-CIPHER in Section 5 as an example of such an algorithm.

5. D-CIPHER

In this section, we formulate the problem of PDE discovery and then we introduce a novel algorithm (D-CIPHER) to solve it.

Problem formulation

We are given a dataset of observed fields D = {v (d) } D d=1 with a finite sampling grid G ⊂ Ω. Each v (d) (x) is a noisy measurement, i.e., v (d)  : G → R N is defined as v (d) j (x) = u (d) j (x) + (d) j (x) ∀x ∈ G ∀j ∈ [N ] where (d) j (x) is a realization of a zero-mean random variable (noise), each u (d) j : Ω → R is a C K function, and every true field u (d) is governed by the same closed-form PDE f . The task is to infer the closed-form PDE, f , from the dataset D = {v (d) } D d=1 and the sampling grid G. We assume that f is inside the class of closed-form VR-PDEs (Section 4) and its ∂-bound part is inside a subspace of extended differential operators spanned by a user-specified dictionary (see Step 1 below). We propose an algorithm that consists of three steps. In the first step, we define the subspace of closed-form VR-PDEs we want to search over to reflect our knowledge of the problem. In the second step, we reconstruct the fields from noisy measurements. In the last step, we solve an optimization problem using a modified symbolic regression algorithm. For more details, check Appendix C. Step 1: Choose the form and incorporate prior knowledge. A human expert should encode their prior knowledge of the problem into a dictionary of non-degenerate extended derivatives Q = { Êp } p∈ [P ] . We use this dictionary to search over a finite-dimensional subspace of closed-form operators spanned by this set. In other words, we assume that the VR-PDE is of the form: P p=1 β p Êp [u](x) -g(x, u(x)) = 0 ∀x ∈ Ω (12) where β ∈ R P , g is any closed-form function of M + N variables, and Êp = (α p , a p , h p ). For instance, a dictionary might include only the partial derivatives up to a certain order. For a 1+1 second-order equation that means Q = {∂ t , ∂ x , ∂ tx , ∂ 2 t , ∂ 2 x }. That is already enough to discover heat and wave equations with any closed-form source. If, for instance, the user suspects the presence of the advection term uu x (as in the Burgers' equation), the term ∂ x (u 2 ) can be included in the library. It's important to note that we do not assume any particular form of g apart from being closed-form. Step 2: Estimate the fields. As the dataset D consists of noisy and infrequently sampled fields, we first need to estimate the "true" fields û(d) from v (d) . Any choice of reconstruction algorithm can be used and the user should choose it according to the problem setting and their domain knowledge. Step 3: Optimize. We minimize the loss function in Equation 10 for the estimated fields {û (d) } D d=1 among all PDEs of the form in Equation 12. We solve the following optimization problem: min g min ||β||1=1 D d=1 S s=1 P p=1 F(β p Êp , û(d) , φ s ) - Ω g(x, û(d) (x))φ s (x)dx 2 (13) As we want to discover both g and β we cannot use the standard penalties on β such as the λ||β|| 2 or λ||β|| 1 , as the loss would be minimized by g = 0 and β = 0. Therefore we put the constraint ||β|| 1 = 1. We choose the L1 norm to encourage sparsity in the coordinates of the vector β. The inner minimization in Equation 13 can be rewritten as a constrained least-squares problem. min ||β||1=1 (d,s)∈[D]×[S] β • z (d,s) -w (d,s) 2 where Êp = (α p , a p , h p ) and z (d,s) ∈ R P , w (d,s) ∈ R are defined as z (d,s) p = Ω h p (x, û(d) (x))(-1) |αp| ∂ αp (a p (x)φ s (x))dx w (d,s) = Ω g(x, û(d) (x))φ s (x)dx We show the full derivation in Appendix C. z (d,s) can be precomputed at the beginning of the algorithm without estimating the derivatives of the reconstructed fields. They can be easily calculated if the derivatives of the testing functions φ s and the derivatives of a p can be analytically computed. As the optimization problem in Equation 14has to be solved many times for different closed-form expressions g, it poses some unique challenges. As standard approaches are not sufficiently fast, we design a new heuristic algorithm to solve this problem. We describe it in the next section.

6. COLLIE

The problem in Equation 14 from the previous section can be formulated as follows. Given matrix A ∈ R m×n and vector b ∈ R m , find a vector z ∈ R n that minimizes ||Az-b|| 2 2 such that ||z|| 1 = 1. The task is challenging as the unit L1 sphere is not convex. A method that guarantees an optimal solution is based on an observation that the (n -1)-dimensional L1 sphere consists of 2 n (n -1)simplices (which are convex). Minimizing ||Az -b|| 2 2 on a simplex is a quadratic program (Boyd et al., 2004) with many available solvers (Andersen et al., 2013; Stellato et al., 2020; ApS, 2019) . However, that means that the computation time scales exponentially with the number of dimensions. This is prohibitively long for the inner optimization of our algorithm. Therefore we design a heuristic algorithm CoLLie (Constrained L1 Norm Least Squares) that finds an approximate solution but is significantly faster (Figure 1 ). We provide a detailed description of CoLLie in Appendix D. .

7. EXPERIMENTS

We perform a series of synthetic experiments to show how well D-CIPHER is able to discover some well-known differential equationsfoot_2 (Table 2 ). First, we demonstrate that D-CIPHER performs better than current methods when discovering PDEs in a linear combination form (Section 7.1). Then we demonstrate it can discover PDEs with a closed-form ∂-free part that cannot be expressed as a linear combination and thus are beyond the capabilities of current methods (Section 7.2). We contrast D-CIPHER with its ablated version where the derivatives are estimated and the standard MSE loss is used instead of the variational loss (details in Appendix E.1). For additional information about the experiments (e.g., implementation details, data generation, experimental settings) see Appendix E.  ∂ t u -θ 1 ∂ 2 x u = 0 Burger's equation ∂ t u + u∂ x u -θ 1 ∂ 2 x u = 0 Kuramoto-Sivashinsky equation ∂ t u + ∂ 2 x u + ∂ 4 x u + u∂ x u = 0 Forced and damped harmonic oscillator ∂ 2 t + 2θ 1 θ 2 ∂ t u + θ 2 2 u = θ 3 sin(θ 4 t) SLM model (Appendix F.1) ∂ t u + ∂ x u = -2e θ1x u Inhomogeneous heat equation ∂ t u -θ 1 ∂ 2 x u = θ 2 e θ3t Inhomogeneous wave equation ∂ 2 t u -θ 1 ∂ 2 x u = θ 2 e t sin(θ 3 t) Evaluation metrics. To establish how well a discovered PDE matches the ground truth, we evaluate its ∂-free and ∂-bound parts separately. For the ∂-free part, we assign a binary variable indicating whether the correct functional form of the equation was recovered (please check Appendix E.8 for details). For the ∂-bound part, we measure the RMSE between the found coefficients of β and the target ones. We report the averages and standard deviations for both parts. We call the averages respectively Success Probability and Average RMSE. Implementation. We use B-Splines (De Boor, 1978) as the testing functions and we estimate the fields in Step 2 of D-CIPHER with a Gaussian Process (Williams & Rasmussen, 2006) . The outer optimization in Step 3 is performed using a modified genetic programming algorithm (Koza, 1992) and the inner optimization by CoLLie (Section 6).

7.1. DISCOVERING LINEAR COMBINATIONS: COMPARISON WITH OTHER METHODS

We compare D-CIPHER against two variants of PDE-FIND Rudy et al. ( 2017) and WSINDy Reinbold et al. ( 2020) with optimization performed by Stepwise Sparse Regression (Boninsegna et al., 2018) or Forward Regression Orthogonal Least-Squares (Billings, 2013) . We note that D-CIPHER is specifically designed to discover PDEs that are beyond the capabilities of current methods, i.e., where the derivative-free part can be any closed-form expression. Current methods are usually tested on equations where the derivative-free part is trivial (identically equal to 0). Even though these algorithms are specialized to discover these simpler kinds of equations, D-CIPHER performs better than (or equally well as) PDE-FIND and WSINDy, regardless of the optimization algorithm, when tested on Burgers' equation the homogeneous heat equation, and Kuramoto-Sivashinsky equation (Figure 2 ). This demonstrates gain from both the variational loss and the new optimization routine. Note that some of the benchmarks overlap .

7.2. DISCOVERING EQUATIONS BEYOND CURRENT METHODS

Forced and damped harmonic oscillator. As the oscillator is described by a second-order ODE, it cannot be discovered by D-CODE (Qian et al., 2022) . D-CIPHER discovers the correct functional form of the ∂-free part and achieves a low RMSE for the coefficients of β in most of the experimental settings. The performance is higher than or comparable to the ablated version of D-CIPHER, thus demonstrating gain from using the variational approach. We present the results in Figure 3 . Inhomogeneous heat equation. D-CIPHER is able to discover the correct equation even in settings with very high noise. It performs better than the ablated version, thus showing the importance of the variational objective. The result are presented in Table 3 . Inhomogeneous wave equation. This equation does not have the standard evolution form, as it does not involve the ∂ t term. Thus, even without the source term, most of the current methods cannot be applied directly to discover this equation. In Figure 4 we show the absolute difference between the true field and the fields computed from the sources discovered by D-CIPHER and its ablated version across different measurement settings. D-CIPHER finds the correct functional form with coefficients not far from the ground truth. The ablated version fails to discover the correct functional form and the found ∂-free part does not reproduce the correct behavior of the equation. 

8. DISCUSSION

Applications. As D-CIPHER can potentially discover any closed-form ∂-free part, it is especially useful when this part of the PDE captures an essential component of the phenomenon. We demonstrate it by finding the heat and vibration sources as well as the driving force of an oscillator. Beyond the spatio-temporal physical equations, D-CIPHER might prove useful in discovering population models structured by age, size, and spatial position (Webb, 1985; 2008) , age-dependent epidemiological models Hoppensteadt (1974) , and predator-prey models with age-structure (Promrak et al., 2017) . All these equations are VR-PDEs where the ∂-free parts are crucial elements of the equations signifying the rates of mortality, infection, recovery, or growth. Limitations and open challenges. D-CIPHER may fail in some scenarios, either due to challenging experimental settings or a challenging underlying PDE. Challenging experimental settings might include unobserved variables, high measurement noise, infrequent sampling, and inadequate domain (e.g., small time horizon). Challenging PDE forms might include a PDE outside of the VR-PDE class or a ∂-free part with a complex expression that is difficult to find. We note that we address some of these challenges by utilizing a variational approach, defining VR-PDEs to be a very general class of equations, and designing CoLLie enabling a thorough search across closed-form expressions. In this section, we collect the definitions of some of the important terms used in the paper for easy reference. Definition 4 (Closed-form expressions and functions). A closed-form expression is a mathematical expression that consists of a finite number of variables, constants, arithmetic operations, and certain well-known functions (e.g., logarithm, trigonometric functions). A function f is called closed-form if it can be represented by a closed-form expression. E.g., f (x, y) = x 2 log(y) + sin(3z). Remark. In practice, we do not want to consider any finite expression. Any symbolic regression algorithm penalizes expressions that are too long putting a soft constraint on the number of elements used. That is why deep neural networks are not considered closed-form even if they satisfy the conditions in Definition 4. Definition 5 (Multi-index). An n-dimensional multi-index α is an n-tuple α = (α 1 , α 2 , . . . , α n ) where ∀i ∈ [n] α i ∈ N 0 . Thus α ∈ N n 0 . We define the order of α as |α| = n i=1 α i . Definition 6. For any n-dimensional multi-index α we define a mixed derivative ∂ α = ∂ α1 1 ∂ α2 2 . . . ∂ αn n where ∂ αi i = ∂ αi /∂x αi i is a α th i -order partial derivative with respect to x i (the i th independent variable). We call ∂ α non-trivial if |α| > 0. We denote the list of all non-trivial partial derivatives of u up to order K as ∂ [K] u. Definition 7 (Closed-form Partial Differential Equation). Let f be a closed-form real smooth function. We say that a vector field u : Ω → R N is governed by a K th -order closed-form PDE described by f if f (x, u(x), ∂ [K] u(x)) = 0 ∀x ∈ Ω ( ) where ∂ [K] u are all non-trivial mixed derivatives of all u j (j ∈ [N ]) up to the K th order. Definition 8 (Testing function). Support of a function φ : Ω → R is defined as supp φ = {x ∈ Ω : φ(x) = 0} where B is the topological closure of B in Ω. φ is called a testing function if it is a C K function with compact support. Definition 9 (True Field). We define a true field on Ω as a vector valued function u : Ω → R N where each u j : Ω → R is a C K function. Definition 10 (Observed field and sampling grid). We define a sampling grid G to be a finite subset of Ω. Let u : Ω → R N be a true field on Ω. An observed field sampled from u on a grid G is a function v : G → R N of the form v j (x) = u j (x) + j (x) ∀x ∈ G ∀j ∈ [N ] where j (x) corresponds to noise, a realisation of a zero-mean random variable. Definition 11 (L1 sphere). Let n ∈ N. We define n-dimensional L1 sphere to be a subset of R n+1 defined as: {x ∈ R n+1 | ||x|| 1 = 1} ⊂ R n+1 17) Definition 12 (Standard simplex). Let n ∈ N. We define standard n-simplex to be a subset of R n+1 defined as: {x ∈ R n+1 | n+1 i=1 x i = 1 ∧ x i ≥ 0 ∀i ∈ [n + 1]} ⊂ R n+1 B VARIATIONAL-READY PDES

B.1 VARIATIONAL FORMULATION OF PDES

In this section, we provide the standard variational formulation of PDEs for linear PDEs (Friedlander, 1982) . Definition 13 (Linear differential operator). Let A be a finite set of multi-indices. A linear differential operator L is defined as L = α∈A a α ∂ α where a α ∈ C K is a non-zero sufficiently smooth function of dependent variables. If max α∈A |α| = n then we call L an n th -order linear differential operator. If all a α are constants we say that L has constant coefficients. The adjoint of L, denoted L † , is a linear differential operator defined as L † u(x) = α∈A (-1) |α| ∂ α (a α (x)u(x)) Proposition 1 (Variational Formulation of PDEs for linear PDEs). Let K ∈ N. Consider a scalar field u : Ω → R, such that u ∈ C K , a K th -order linear differential operator L, and a continuous function g : Ω → R. Let φ : Ω → R be a testing function. Then u satisfies a linear PDE L[u(x)] -g(x) = 0 ∀x ∈ Ω (20) if and only if Ω u(x)L † φ(x) -g(x)φ(x) dx = 0 (21) for all testing functions φ : Ω → R. Note that the integrals are always well-defined as φ has a compact support.

B.2 THEOREM

Before we prove the Theorem 1, we need the following lemma, which is a particular formulation of the Fundamental lemma of calculus of variations (Elsgolc, 2012) . We also need a generalized version of the divergence theorem (Loomis, 1968) . Lemma (Fundamental lemma of calculus of variations). Let K ∈ N, Ω be an open set in R M , and u : Ω → R be a continuous function. Then u is equal to 0 on the whole Ω if and only if Ω u(x)φ(x)dx = 0 for all C K functions φ : Ω → R with compact support. Proof. If u is identically 0 on Ω then all the integrals are trivially equal to 0. We now prove the converse. Let as assume for contradiction that there exists a point x 0 ∈ Ω such that u(x 0 ) = 0. Without loss of generality we assume u(x 0 ) = > 0. As u is continuous there exists an open ball around x 0 of radius δ, denoted B δ x0 = {x ∈ Ω | ||x -x 0 || 2 < δ}, such that ∀x ∈ B δ x0 u(x) > /2 > 0. Now let φ be a C K function that is positive on B δ x0 and 0 elsewhere. Such a function can always be created by appropriately shifting and scaling φ (x) = e -1/(1-||x|| 2 2 ) • 1 {||x||2<1} . Its support is a closed ball Bδ x0 = {x ∈ Ω | ||x -x 0 || 2 ≤ δ} which is compact. Then Ω u(x)φ(x)dx = B δ x 0 u(x)φ(x)dx > 0 as both u(x) and φ(x) are positive on B δ x0 . Thus we found a continuous function φ with compact support such that: Ω u(x)φ(x)dx = 0 ( ) Therefore u is identically 0 on Ω. To make this section self-contained, we provide the statement of the generalized divergence theorem (Loomis, 1968) . Theorem 2 (Divergence theorem). Let Ω be an open set in R M and let f, g be continuous on Ω = Ω ∪ ∂Ω and continuously differentiable on Ω. Then Ω ∂ 1 i [f (x)]g(x)dx = - Ω f (x)∂ 1 i [g(x)]dx + ∂Ω ν i f (x)g(x)dx ( ) where ν is a normal unit vector to the boundary ∂Ω. In a 1-dimensional setting, the statement of the theorem reduces to the integration by parts. We can now prove Theorem 1. Proof. Let us denote E p = (α p , a p , h p ). Then the PDE in Equation 8 can be written as: P p=1 a p (x)∂ αp [h p (x, u(x))] -g(x, u(x)) = 0 ∀x ∈ Ω (25) The LHS is continuous as all a p and h p are smooth, g is continuous, u ∈ C K , and |α p | ≤ K ∀p ∈ [P ]. Thus we can use the fundamental lemma of calculus of variations to say that the Equation 25 is true if and only if Ω P p=1 a p (x)∂ αp [h p (x, u(x))] -g(x, u(x)) φ(x)dx = 0 for all testing functions φ. We transform the LHS of Equation 26 using linearity to: P p=1 Ω a p (x)∂ αp [h p (x, u(x))]φ(x) - Ω g(x, u(x))φ(x)dx Let us now focus on Ω a p (x)∂ αp [h p (x, u(x))]φ(x) and let us denote α p = (α p1 , . . . , α pM ). Then ∂ αp = ∂ αp1 1 . . . ∂ α pM M and the expression can be written as Ω ∂ αp1 1 . . . ∂ α pM M [h p (x, u(x))]a p (x)φ(x)dx Let us denote the support of φ as B. As φ is equal to zero outside of its support, we can write the expression as B ∂ αp1 1 . . . ∂ α pM M [h p (x, u(x))]a p (x)φ(x)dx Without loss of generality, let us assume that α p1 > 0. By the divergence theorem, this can be rewritten as - B ∂ αp1-1 1 . . . ∂ α pM M [h p (x, u(x))]∂ 1 1 [a p (x)φ(x)]dx because the integral over the boundary is equal to 0 ∂B ν 1 ∂ αp1-1 1 . . . ∂ α pM M [h p (x, u(x))]a p (x)φ(x)dx = 0 (32) as φ has a compact support (and thus vanishes on the boundary). We can perform this operation α p1 times to shift the whole derivative ∂ αp1 1 to the second part of the equation and obtain (-1) αp1 B ∂ αp2 2 . . . ∂ α pM M [h p (x, u(x))]∂ αp1 1 [a p (x)φ(x)]dx Then we repeat this for other derivatives and we end up with the following expression: (-1) αp1 • . . . • (-1) α pM B h p (x, u(x))∂ αp1 1 . . . ∂ α pM M [a p (x)φ(x)]dx As the integrand is zero outside of B, this can be rewritten as: (-1) |αp| Ω h p (x, u(x))∂ αp [a p (x)φ(x)]dx or more compactly, using the functional defined in Definition 3, as: F(E p , u, φ) ) Therefore Equation 27 can be written as: P p=1 F(E p , u, φ) - Ω [g(x, u(x))φ(x)] dx Thus, we proved that Equation 25 is true if and only if P p=1 F(E p , u, φ) - Ω [g(x, u(x))φ(x)] dx = 0 for all testing functions φ.

B.3 EXAMPLES

The examples of VR-PDEs can be found in Table 5 .  u = g(x) Gauss law ∇ • E = ρ / 0 Burger's equation u t + uu x -νu xx = 0 Navier-Stokes equations u t + (u • ∇)u -ν∇ 2 u = -1 /ρ∇p + g Korteweg-De Vries equation u t + u xxx -6uu x = 0 Kuramoto-Sivashinsky equation u t + u xx + u xxxx + uu x = 0 Fisher's equation u t -κu xx = ru(1 -u) Liouville's equation u xx + u yy = κe ρu Porous medium equation u t -∇ 2 (u κ ) = 0 Sine-Gordon equation u tt -u xx = -sin(u) C D-CIPHER C.1 REWRITE THE INNER OPTIMIZATION AS A CONSTRAINED LEAST SQUARES Let us rewrite the objective in Equation 13. D d=1 S s=1 P p=1 F(β p Êp , û(d) , φ s ) - Ω g(x, û(d) (x))φ s (x)dx 2 First, let us observe that F(β p Êp , û(d) , φ s ) = Ω h p (x, û(d) (x))(-1) |αp| ∂ αp [β p a p (x)φ s (x)]dx = β p z (d,s) p (40) if we let z (d,s) p ∈ R be defined as z (d,s) p = Ω h p (x, û(d) (x))(-1) |αp| ∂ αp (a p (x)φ s (x))dx Moreover, if we define w (d,s) ∈ R as w (d,s) = Ω g(x, û(d) (x))φ s (x)dx we can rewrite expression 39 as D d=1 S s=1 P p=1 β p z (d,s) p -w (d,s) Now, the sum over p can be written as a dot product between z (d,s) ∈ R P and β ∈ R P . We can also combine the sums over d and s. We obtain (d,s)∈[D]×[S] β • z (d,s) -w (d,s) 2 (44) which is exactly the same as the objective in Equation 14.

C.2 PSEUDOCODE

The pseudocode of D-CIPHER is presented in Algorithm 1.  Q = { Êp } P p=1 , Êp = (α p , a p , h p ) Step 1 Output: Target PDE û(d) = S(v (d) ) ∀d ∈ [D] Step 2 initialize matrix Z ∈ R D×S × R P Z (d,s) p ← Ω h p (x, û(d) (x))(-1) |αp| ∂ αp (a p (x)φ s (x))dx procedure LOSS(g) initialize vector w ∈ R D×S w (d,s) ← Ω g(x, û(d) (x))φ s (x)dx β ← COLLIE(Z, w) Section 6 L = ||Zβ -w|| 2 2 return L end procedure g = O(LOSS) Step 3 initialize vector w ∈ R D×S w (d,s) ← Ω g(x, û(d) (x))φ s (x)dx β ← COLLIE(Z, w) Section 6 return 

C.4 TESTING FUNCTIONS

Testing functions, by definition, need to be sufficiently smooth functions (in C K class) with compact support. Moreover, the result proved in (Qian et al., 2022) suggests that these functions should be a subset of a Hilbert basis in L2 space. In particular, that means they should be orthonormal. We use B-Splines (De Boor, 1978) as the testing functions in our experiments because we can control their smoothness and the derivatives are easy to compute. We scale and shift them appropriately so that they are orthonormal.

D COLLIE D.1 LAGRANGIAN

The problem that CoLLie is supposed to solve is a constrained least-squares optimization defined as: minimize ||Az -b|| 2 2 subject to ||z|| 1 -1 = 0 ( ) where A ∈ R m×n has a full column rank, b ∈ R m , and z ∈ R n for some m, n ∈ N. We consider the Lagrangian L : R n × R → R associated with this problem (Boyd et al., 2004) defined as L(z, λ) = ||Az -b|| 2 2 + λ(||z|| 1 -1) Now let us define ẑ : R → R n as ẑ(λ) = arg min z∈R n L(z, λ) = arg min z∈R n ||Az -b|| 2 2 + λ||z|| 1 (47) The goal of our algorithm is to find λ * ∈ R such that || ẑ(λ * )|| 1 = 1. Let us define a function q : R → R as q(λ) = || ẑ(λ)|| 1 (48) The goal can be phrased as finding λ * ∈ R such that q(λ * ) = 1. Let us note that ẑ( 0) is just a solution to the ordinary least squares (OLS) problem with no constraints and its norm is q(0).

D.2 EXTENDING LARS

Case 1. q(0) ≥ 1. If we assume that ẑ is continuous then q is also continuous. From the continuity and the fact that lim λ→+∞ q(λ) = 0 and q(0) ≥ 1 we infer that there exists a λ ≥ 0 such that q(λ) = 1. Moreover, for λ ≥ 0 the problem in Equation 47is the same as in LASSO (Tibshirani, 1996) . Therefore we just need to perform LASSO for different λ and choose the one that gives the solution with L1 norm equal to 1. To do it in practice we use Least Angle Regression (LARS) (Efron et al., 2004) , a popular algorithm used to minimize the LASSO objective. It generates complete solution paths, i.e., a function c : R + → R n defined as c(λ) = arg min z∈R n ||Az -b|| 2 2 + λ||z|| 1 (49) which is equivalent to ẑ for λ ≥ 0. An illustration of LARS solution paths can be seen in Figure 6 . Each line corresponds to a function c i which describes the coefficient for the i th covariate. The paths are defined from some λ 0 where all c i (λ 0 ) = 0 to λ = 0 where c(0) = ẑ(0). In other words, the solution paths cover the whole range of constraints from the strictest, effectively imposing the L1 norm of z to be 0, up to no constraints, solving the OLS problem. The solution paths from the LARS algorithm are piecewise linear and the outputs are the values of the coefficients for points (λ 0 > . . . > λ n = 0) where the slopes change. We calculate the norm at each of these points, ||c(λ i )|| 1 , and find j ∈ [n] such that ||c(λ j-1 )|| 1 < 1 ≤ ||c(λ j )|| 1 . As each c i is a linear function on [λ j , λ j-1 ] and we know both c(λ j-1 ) and c(λ j ), we can effectively search for λ ∈ [λ j , λ j-1 ] such that ||c(λ)|| 1 = 1. The search can be performed by any root-finding algorithm. We use Brent's method (Brent, 2013) . Case 2. 0 < q(0) < 1. This is much more difficult as it corresponds to solving the problem in Equation 47for λ < 0. The solutions given by the LARS algorithm are too small. In fact, the solution with the biggest norm is c(0) = ẑ(0), the OLS solution, with norm exactly q(0) < 1. To address this challenge, we propose the following heuristic. We extend the solution paths generated by LARS beyond λ = 0 for λ < 0. We assume that the paths will continue to be piecewise linear and that they will keep the slope they have in the last interval [λ n = 0, λ n-1 ]. Let us denote this slope as ∆c i = c i (0) -c i (λ n-1 ) 0 -λ n-1 This is graphically represented in Figure 6 . Formally, these extended paths, c : R → R are defined as: ci (λ) = c i (λ), λ ≥ 0 c i (0) + λ∆c i , λ < 0 (51) Now, we want to find λ < 0 such that ||c(λ)|| 1 = 1. To achieve this in practice, we first make the following observations. For any λ < 0 we say that ci (λ) is on the right side if ci (λ)∆c i ≤ 0 and we say that ci (λ) is on the wrong side if ci (λ)∆c i > 0. In other words, being on the wrong side just means that the path is yet to cross the x-axis if we keep decreasing λ. We can easily find λ such that for all λ < λ all ci (λ) are on the right side (none of the paths will ever cross the x-axis). λ = min 0 -c i (0) ∆c i | i ∈ [n] ∧ c i (0)∆c i > 0 (52) If ||c(λ )|| 1 ≥ 1 we just need to search the interval [λ , 0] for λ such that ||c(λ )|| 1 = 1. If ||c(λ )|| 1 < 1 then we need to search λ < λ . However, by definition, for all λ < λ , all c i (λ) are on the right side. That means ||c(λ)|| 1 as a function of λ is just a linear function on the interval (-∞, λ ). To see that, let us observe that ||c(λ)|| 1 = n i=1 |c i (λ)| = n i=1 sign(c i (λ))c i (λ) Additionally, for λ < λ all c i (λ) are on the right side, so we have sign(c i (λ)) = -sign(∆c i ). We can rewrite ||c(λ)|| 1 as: ||c(λ)|| 1 = n i=1 (-sign(∆c i )(c i (0) + λ∆c i )) = - n i=1 sign(∆c i )c i (0) - n i=1 sign(∆c i )∆c i λ = - n i=1 sign(∆c i )c i (0) - n i=1 |∆c i | λ (54) Therefore the solution can be found using the following equation We perform 1000 experiments for each n. As losses for optimal solutions can be on widely different scales, we report the relative error between the loss obtained by CoLLie and the loss obtained by the algorithm based on CVXOPT, which seems to always find the optimal solution. λ * = λ + 1 -||c(λ )|| 1 - n i=1 |∆c i | The time is measured on a single computer with an Intel Core i5-6500 CPU (4 cores) and 16GB of RAM.

E EXPERIMENTS E.1 ABLATED D-CIPHER

The ablated version uses the standard MSE loss with estimated derivatives and thus solves the following optimization problem: min g min ||β||1 D d=1 x∈G P p=1 β p Êp [v (d) ](x) -g(x, v (d) (x)) 2 (57) where v (d) is the observed field, G is the sampling grid, and Êp [v (d) ](x) requires derivative estimation. The ablated version uses the same symbolic regression algorithm to search over closed-form g and CoLLie for the inner optimization.

E.2 IMPLEMENTATION

Step 1. For the homogeneous heat equation and Burgers' equation we use dictionary Q = {u, ∂ t u, ∂ x u, ∂ x (u 2 ), ∂ 2 x u, ∂ 2 x (u 2 )}. For Kuramoto-Sivashinsky equation we use dictionary Q = {u, ∂ t u, ∂ x u, ∂ x (u 2 ), ∂ 2 x u, ∂ 2 x (u 2 ), ∂ 3 x u, ∂ 3 x (u 2 ), ∂ 4 x u, ∂ 4 x (u 2 )}. For the damped and forced harmonic oscillator we use the dictionary {∂ t , ∂ 2 t }, and for the wave and heat equations, we use {∂ t , ∂ x , ∂ 2 t , ∂ t ∂ x , ∂ 2 x }. Step 2. Field estimation is performed using the Gaussian Process Regression from the Python library scikit-learn (Pedregosa et al., 2011) . The kernel is chosen to be the RBF kernel (Williams & Rasmussen, 2006) with an added White kernel to account for noise. The observed field is initially standardized by subtracting the mean and dividing by the standard deviation. Then the Gaussian-ProcessRegressor is fitted to the data. The estimated fields are generated by predicting the values of a trained Gaussian Process on a full integration grid and then scaling back to their original range (by multiplying by the standard deviation and adding the mean). Step 3. The search over the closed-form expression is performed using the symbolic regression library gplearn (Stephens, 2022) . We use a custom fitness function that solves the inner optimization problem in Equation 13. This inner optimization is performed by CoLLie (Section 6). The integration is performed using Riemann sums. Ablated version of D-CIPHER. The derivative estimation is performed by first fitting a Gaussian process (in the same way as in Step 2) and then using the finite difference to estimate the derivative in one of the coordinates for all points in the sampling grid. To obtain higher-order derivatives, a Gaussian process is fitted again and the derivative is once again calculated using the finite difference (possibly in a different direction than the first time).

E.3 HYPERPARAMETERS

Gaussian process regression. The kernel parameters of the Gaussian Process are automatically adjusted during training. The default bounds of the length scale of the RBF kernel and the noise level of the White kernel are used, i.e., (1e -5, 1e5). GPlearn. We do not perform parameter tuning for the gplearn library and use the same parameters as in D-CODE (Qian et al., 2022) except for the parsimony coefficient and the number of generations. The number of generations is chosen to be 30 for the damped and forced harmonic oscillator and 20 for the inhomogeneous heat and wave equations. Please check (Stephens, 2022) for the detailed description of these parameters. We modify the implementation of the parsimony coefficient. The standard implementation adds to the loss the length of the equation multiplied by the parsimony coefficient. In our implementation, we increase the loss by the parsimony coefficient. This modification is performed because for different experiments we record the loss on widely different scales. To prevent tuning this parameter for every experimental setting we introduce a penalty that can work on different scales. The parsimony coefficient is chosen manually by performing experiments for a few values. The value used in the experiments is 0.05. The set of allowed mathematical operations is: {+, -, ×, ÷, sin, exp, log} We want to emphasize that we use the same configuration of gplearn in D-CIPHER and its ablated version. Integration and number of testing functions. For the damped and forced harmonic oscillator we use 10 testing functions and the integration step 0.01. For the inhomogeneous heat and wave equations we use 100 testing functions and integrate on a grid with steps δt = 0.01 and δx = 0.01. Derivative estimation in the ablated version of D-CIPHER. The Gaussian process is configured the same way as described above. The interval used in the finite difference method to estimate the derivative was chosen to be: 10 -3 .

E.4 CHOICE OF EQUATIONS

Equations used in Section 7.1 are canonical equations from physics that often appear in other works about PDE discovery (Rudy et al., 2017) . The homogeneous heat equation is a second-order PDE that models how heat diffuses through a region. It contains the dissipative term ∂ 2 x u. Burgers' equation is second-order PDE used, for instance, in fluid mechanics or nonlinear acoustics (Crighton, 1979) . It contains the advection term u∂ x u and the diffusion term that prevents shock formation. Kuramoto-Sivashinsky equation is a fourth-order PDE used in modelling reaction-diffusion systems (Kuramoto, 1980) and is known for its chaotic behavior (Hyman & Nicolaenko, 1986) . In Section 7.2, we chose equations of physical significance that have an interesting ∂-free part that is not a linear combination (as discussed in Section 2). That makes them impossible to discover by the current methods. A forced and damped harmonic oscillator is a second-order ODE. Although D-CODE (Qian et al., 2022) can discover any closed-form first-order ODE, it cannot be used to discover second-order ODEs. Thus there is currently no algorithm capable of discovering this equation. Inhomogeneous heat and wave equations are second-order PDEs where the ∂-free part is a source (of heat and wave respectively). Moreover, the wave equation does not have the standard evolution form, as it does not involve the ∂ t term. Thus even without the source term, most of the current methods cannot be applied directly to discover this equation.

E.5 DATA GENERATION

Homogeneous heat equation. The fields were generated by solving the equation ∂ t u -θ 1 ∂ 2 x u = 0 (θ 1 = 0.25) with Neumann boundary conditions ∂ x u(t, 0) = ∂ x u(t, X) = 0 and an initial condition u(0, x) = u 0 (x), where u 0 is randomly sampled from a Gaussian process. The equation is solved using the implicit BTCS scheme (Kereyu & Gofe, 2016) with steps δt = 0.001 and δx = 0.001. The observed field is generated by sampling (t, x) ∈ [0, T ] × [0, X], evaluating the true field u(t, x) and adding Gaussian noise. T = 2 and X = 2 are used in the experiments. Burger's equation. The fields are computed by solving ∂ t u + u∂ x u -θ 1 ∂ 2 x U = 0 (θ 1 = 0.2) with an initial condition u(0, x) = u 0 (x), where u 0 is randomly sampled from a Gaussian process, and with Dirichlet boundary conditions u(t, 0) = u 0 (0), u(t, X) = u 0 (X). The equation is solved the using Crank-Nicolson scheme (Wani & Thakar, 2013) with steps δt = 0.002 and δx = 0.002. The observed field is generated by sampling (t, x) ∈ [0, T ] × [0, X], evaluating the field u(t, x) and adding Gaussian noise. T = 2 and X = 2 are used in the experiments. Kuramoto-Sivashinsky equation. The solution is the same as the one used in (Rudy et al., 2017) . The observed field is generated by sampling (t, x) ∈ [0, T ] × [0, X], evaluating the field u(t, x) and adding Gaussian noise. T = 100 and X = 100 are used in the experiments. Damped and forced harmonic oscillator. The true fields are created by analytically solving the equation ∂ 2 t u(t) + 2θ 1 θ 2 ∂ t u(t) + θ 2 2 u(t) = θ 3 sin(θ 4 t), where θ 1 = 0.5, θ 2 = 4.0, θ 3 = 5.0, θ 4 = 3.0, with random initial conditions for u(0) and ∂ t u(0). The observed fields are then created by sampling t ∈ [0, T ], evaluating u(t), and adding Gaussian noise. T = 2 was used in the experiments. Inhomogeneous heat equation. The true fields are computed by solving ∂ t u(t, x) -θ 1 ∂ 2 x u(t, x) = θ 2 e θ3t , where θ 1 = 0.25, θ 2 = 1.25, θ 3 = 1.8, with Neumann boundary conditions ∂ x u(t, 0) = ∂ x u(t, X) = 0 and an initial condition u(0, x) = u 0 (x), where u 0 is randomly sampled from a Gaussian process. The equation is solved using the implicit BTCS scheme (Kereyu & Gofe, 2016) with steps δt = 0.001 and δx = 0.001. The observed field is generated by sampling (t, x) ∈ Grid (G): {0, 0.07, . . . , 2} × {0, 0.07, . . . , 2} E.7 BENCHMARKS An important hyperparameter in PDE-FIND and WSINDy is the library Θ used. We impose that Q and Θ ∪ {∂ t u} have the same number of elements. Moreover, solutions given by PDE-FIND and WSINDy are scaled to have the L1 norm equal to 1. Both of these measures are undertaken to ensure that RMSE error is comparable between the algorithms. For the homogeneous heat equation and Burgers' equation we use a library Θ = {u, ∂ x , ∂ 2 x u, u∂ x u, u∂ 2 x u}. For the Kuramoto-Sivashinsky equation, we use a library Θ = {u, ∂ x , ∂ 2 x u, ∂ 3 x u, ∂ 4 x u, u∂ x u, u∂ 2 x u, u∂ x , u∂ 4 x }. In the experiments, we have not optimized for the derivative-free part as it is identically equal to 0 in all equations.

E.8 CORRECT FUNCTIONAL FORM

To measure success probability we need to establish whether two closed-form functions match. The previous approach (Qian et al., 2022) considered their functional forms, i.e., expressions where all numeric constants are replaced by placeholders. By this measure, functions sin(3x) and sin(3.5x) match as they have the same functional form sin(Cx), where C is a placeholder. However, this definition is quite restrictive because functions sin(3x), sin(3x) + 0.001, 1.001 sin(3x), and sin(3x + 0.001) all have different functional forms. We consider it an open challenge to design a good metric that would meaningfully reflect whether the correct equation is discovered. We propose the following. For a target function f , we consider its augmented form f , defined as f (x) = C 1 f (C 3 x + C 4 ) + C 2 , where all C i are placeholders. Then all numeric constants are turned into placeholders as well. In the end, we combine the constants. For instance, C 1 + C 2 becomes just C 3 . As an example, let us consider a function f (x) = 1.3e 2x . The augmented functional form is created in the following way: 1. Augment: C 1 × 1.3e 2×(C3×x+C4) + C 2 2. Replace: C 1 × C 5 e C6×(C3×x+C4) + C 2 3. Combine: C 1 e C3x+C4 + C 2 We perform this procedure for the target function. We can now take the standard functional form of the candidate function and check whether it matches the augmented functional form of the target function, taking into account that some of the constants might not be present in the candidate expression. To aid in this procedure, we use a Python library for symbolic mathematics, SymPy (Meurer et al., 2017) .

E.9 COMPUTATION TIME

The average computation time for a single experiment with the damped and forced harmonic oscillator is 281 seconds with a standard error of 4.5 seconds. The average computation time for a single experiment with an inhomogeneous heat equation is 68 minutes with a standard error of 38 seconds. This time is measured on a single computer with an Intel Core i5-6500 CPU (4 cores) and 16GB of RAM. The experiments are run simultaneously on 5 computers like the one described above. The total time for all experiments (all seeds, all equations, all experimental settings, and both versions of D-CIPHER) is 65 hours.

E.10 LICENSES

The licenses of the software used in this work are presented in (1985) . The model is described by the following equation ∂ t u(t, a) + ∂ a u(t, a) + m(a)u(t, a) = 0 where m(a) is age-specific mortality rate. We choose m(a) = 2e θa . We set θ = 1.5 in the experiments. We note that the derivative-free part of the target PDE cannot be expressed as a linear combination of functions from a finite dictionary if the parameters are not known a priori. D-CIPHER is uniquely positioned among other discovery algorithms as the only technique that can recover any mortality rate that can be represented as a closed-form expression. We show a comparison between D-CIPHER and the Ablated D-CIPHER in  (x, u (d) (x), ∂ [K] u (d) (x)) = 0 ∀x ∈ Ω f 2 (x, u (d) (x), ∂ [K] u (d) (x)) = 0 ∀x ∈ Ω ( ) then it is also a solution to any linear combination of these equations, i.e., 58) with different dictionaries. We start with a small dictionary Q 1 = {∂ t u, ∂ a u}, and we create every new dictionary from the previous one by adding one more extended derivative. The final dictionary contains 10 elements, λ 1 × f 1 (x, u (d) (x), ∂ [K] u (d) (x)) + λ 2 × f 2 (x, u (d) (x), ∂ [K] u (d) (x)) = 0 ∀x ∈ Ω (60) Q 10 = {∂ t u, ∂ a u, ∂ 2 a u, ∂ 2 t u, ∂ t ∂ a u, ∂ a (u 2 ), ∂ 2 a (u 2 ), ∂ t (u 2 ), ∂ 2 t (u 2 ), ∂ t (u 3 ), ∂ a (u 3 )}. The Average RMSE of the ∂-bound part is shown in Figure 8 . We do not observe any increase in average error. Note that in these experiments we just focus on the ∂-bound part and do not optimize the ∂-free part. 

F.4 COMPUTATIONAL COMPLEXITY

We want to emphasize that PDE discovery is not a time-critical application (usually this process is performed manually by scientists) and we believe D-CIPHER's computation time is acceptable for such a task. In this section, we describe which parts of the algorithms are most computationally intensive. Computation in D-CIPHER is performed in Step 2 and Step 3. Step 2. Computational complexity of Step 2 depends on the choice of the smoothing algorithm. We want to emphasize that the user can use any smoothing algorithm based on their domain knowledge and experience, including spline regression, LOWESS, and Kalman filters. Gaussian Process has time complexity O(n 3 ) where n is the number of data points in a grid G. D-CIPHER is specifically designed to work for sparse and noisy data, so we have not encountered major computational issues while performing Gaussian process regression. We also note that significant progress has been made in adapting Gaussian processes for datasets with many data points Liu et al. (2020) . Step 3. D-CIPHER consists of two optimization loops. The outer optimization is performed by a symbolic regression algorithm (in our case genetic programming). The inner optimization is performed by CoLLie (Section 6). Searching through a space of closed-form expression requires testing many candidate equations. D-CIPHER is designed to work with many different algorithms for symbolic regression and it is advised to choose an algorithm that can search through this space most efficiently. CoLLie was specifically designed to solve the optimization problem as quickly as possible with a minor accuracy trade-off (see Figure 1 ). It is based on LARS which has time complexity O(mn 2 ) Efron et al. ( 2004) where m is the number of samples and n is the number of features. In our case, n = P is usually small as it corresponds to the size of the dictionary and m = SD. In our experiments, S was set up to 100 and D up to 10. Overall LARS is performed very quickly. The additional steps in CoLLie require only a few arithmetical operations (linear in n) and a possible root searching of a single variable function that is efficiently implemented using Brentq algorithm Brent (2013). Other important parts of the algorithm are the numerical integrations. One such integration is performed at the beginning of the algorithm to compute the matrix Z (see Algorithm 1). It does not contribute much to the computation time as it is performed only once. The other integration is performed for each candidate equation to compute vector w. Also, substantial time is spent on computing the values of the candidate function g used in the integration. Fortunately, both of these operations can be implemented as vectorized operations which are designed to run very efficiently on modern hardware. We also discourage overly long equations for g (check the discussion in Appendix E.3) to limit the number of operations performed.

F.5 CHALLENGES OF DERIVATIVE ESTIMATION

One of the advantages of D-CIPHER compared to other methods is the use of the variational formulation of PDEs that allows it to circumvent derivative estimation. This is important as derivative estimation is challenging, especially in noisy settings with infrequent sampling. The problem becomes more pronounced the higher the order of the derivative. To demonstrate these issues, we perform a series of synthetic experiments. Qualitative study. First we qualitatively show how challenging the task of derivative estimation is. We generate an observed trajectory for the damped and forced harmonic oscillator. Then we estimate this trajectory using both Guassian Process regression and Spline regression. As shown in Figure 9 (Panel A), the estimated trajectories are very close to the true trajectory. Then we estimate the first derivative (Panel B) and the second derivative (Panel C). We show the standard finite difference methods as well as derivative estimation techniques using Spline regression and Gaussian Process regression. In both cases we see that the estimated derivatives do not match the ground truth (calculated analytically) as closely as in Panel A. Moreover the mismatch for the second derivative seems to be bigger than for the first derivative. Quantitative study. We investigate this relation quantitatively for the damped and forced harmonic oscillator and the wave equation. For the oscillator we generate an observed trajectory and then we estimate its derivatives, up to the fourth order using both finite difference and Gaussian Processes. We then compare the derivatives with the analytically calculated ground truths and measure root mean squared error. The results are shown in Figure 10 (Panel A). We can see that the error increases the higher the order of the derivative. For the wave equation, we perform a similar experiment but this time we estimate different mixed derivatives. We consider any mixed derivative ∂ i t ∂ j x , where We observe that the higher the order the less accurate is the estimate. i, j ∈ {0, 1, 2}. We demonstrate the results in Figure 10 (Panel B). We observe that the error increases the higher the order of the derivative (i + j). Panel B shows that the same happens for the wave equation. The (i.j) entry of the heatmap should be interpreted as the RMSE between the estimated derivative ∂ i t ∂ j x and the ground truth. F.6 CHALLENGES OF PDE DISCOVERY AND HOW WE ADDRESS THEM PDE discovery is a very difficult task with many challenges. In Table 9 we summarize some of them and describe how our work addresses them. We do not estimate the initial conditions (Chen et al., 2018) or perform forward time stepping (Long et al., 2019) which are computationally unstable for chaotic systems and use variational formulation to circumvent derivative estimation. We also note how the new notions we introduce help us compare the two works. Algorithm presented in Qian et al. ( 2022) can discover any first-order explicit closed-form ODE, i.e., an equation of the form. ∂ t u j (t) = g(u(t), t) where g is a closed-form function. It uses the variational formulation of ODEs to circumvent derivative estimation. 2022) can be considered a special case of D-CIPHER. We can recover it from D-CIPHER by choosing the dictionary to contain only one element, i.e., Q = ∂ t u j . As the derivative-bound part is fixed, every ODE of that form admits the variational formulation. This is not true for PDEs as there are derivative-bound parts that might prohibit the variational formulation. Thus D-CIPHER required careful consideration of the appropriate class of equations to search over. As D-CIPHER needs to find both the ∂-free part (function g) and the ∂-bound part, the optimization problem is much more complicated. That is why we restrict the derivative-bound part of the PDE to be spanned by terms from the pre-specified dictionary and develop an efficient optimization algorithm, CoLLie. We emphasize that, as is the case for Qian et al. (2022), we do not put any constraints on the derivative-free part of the PDE, apart from it being closed-form. F.9 ERROR BOUNDS While we would like to have error bounds for the discovered PDE, we note that the problem we solve is significantly more difficult than the one considered in other works. The space of PDEs we consider is much more complex than the space of PDEs in a linear combination form. In other works (e.g., Rudy et al. (2017) , Messenger & Bortz (2021a) ), the PDE is basically a vector in R P and the discovery task is mostly reduced to finding a sparse enough vector that approximately solves a certain linear equation. Of course, there is a lot of literature that aids in establishing error bounds in such problem settings. However, D-CIPHER searches over a space R P × CF E(M + N ), where CF E(M +N ) is a space of closed-form expressions in M +N variables. This space is combinatorial in the functional form and continuous in real constants. This makes it very challenging to derive any error bounds. 



Throughout this work we assume that the functions we use are smooth enough for the equality of mixed partials(Spivak, 2018) to hold. In that case, any mixed derivative can be uniquely specified by a multi-index. We say u : R M → R is in C K if ∂ α u exists and is continuous for all |α| ≤ K. All experiment code will be published upon acceptance. and 6, as well as in Appendix C and D. The implementation details, data generation procedures, hyperparameters, and experimental settings are described in Appendix E for D-CIPHER and in Appendix D for CoLLie. All experiment code will be published upon acceptance.



Figure1: We compare CoLLie with an algorithm that usesCVXOPT (Andersen et al., 2013)  to solve each of the convex subproblems. We report the relative error between the loss obtained by CoLLie and the minimum loss achieved by CVXOPT. Panels B and C show the averages and the distributions of relative errors. The average relative error is below 0.005 and the bulk of the distribution is below 10 -7 . At the same time CoLLie is orders of magnitude faster (Panel A)..

Figure 2: Simulation results for the Burgers' equation, homogeneous heat equation, and Kuramoto-Sivashinsky equation. We report the Average RMSE of the ∂-bound part of the equation. Note that some of the benchmarks overlap .

Figure 3: Success probability of discovering the correct ∂-free part of the equation and the average RMSE between the recovered ∂-bound part and the target one across different experimental settings. We compare D-CIPHER against its ablated version (Abl. D-CIPHER).

Figure 4: We solve the inhomogeneous wave equation for the ∂-free parts found by the D-CIPHER and its ablated version (Abl. D-CIPHER). We show the absolute difference between the computed fields and the true field generated by ∂-free part 2 × e t sin(3t).

Appendix B variational formulation for linear PDEs and the proof of Theorem 1 3. Appendix C: details of the D-CIPHER framework, including pseudocode 4. Appendix D: details of the CoLLie algorithm 5. Appendix E: details of experiments and the implementation 6. Appendix F: additional experiments and discussion A NOTATION AND DEFINITIONS A.1 NOTATION

Observed fields D = {v (d) } D d=1 , grid G Input: Symbolic regression optimization algorithm O Input: Smoothing algorithm S Input: Testing functions {φ s } S s=1 Input: Dictionary

Figure 5: This diagram describes how the algorithm works. After the optimization procedure is finished, we get the best found closed-form function g and use CoLLie to find the best vector β. The found equation has the form P p=1 β p Êp [u](x) -g(x, u(x)) = 0

55)Case 3. q(0) = 0. In that case, we just return a precomputed solution to the problem minimize ||Az||subject to ||z|| 1 -1 = 0 (56)which we compute by subdividing the problem into 2 n quadratic programs and solving each of them separately using CVXOPT algorithm(Andersen et al., 2013) as described in Section 6.

Figure 6: Panel A shows and example of solution paths calculated by the LARS algorithm. Panel B shows their extended versions as defined in Case 2 in D.2. The x-axis is reversed, so λ decreases as it moves to the right.

Figure 7: Simulation results for both Cauchy-Riemann equations and two Laplace's equations. We report the Average RMSE of the ∂-bound part in different noise settings .

Figure 8: Simulation results for Sharpe-Lotka-McKendrick model. We report the Average RMSE of the ∂-bound part for different sizes of the dictionary Q .

Figure 9: Panel A shows that estimation of the true trajectory can be performed successfully by both Spline regression (Spline) and Gaussian Process regression (GP). Panel B shows the estimated first derivative and Panel C shows the estimated second derivative. We observe that the higher the order the less accurate is the estimate.

Figure 10: Panel A demonstrates the error between the estimated derivative and the ground truth increases with the order of the derivative (performed for the damped and forced harmonic oscillator).Panel B shows that the same happens for the wave equation. The (i.j) entry of the heatmap should be interpreted as the RMSE between the estimated derivative ∂ i t ∂ j x and the ground truth.

Figure 11: Comparison of different estimation algorithms that can be used in D-CIPHER. GP -Gaussian Process regression, Nearest -Nearest point interpolation, Linear -Linear interpolation, Cubic -Cubic interpolation.

Columns correspond to challenges outlined in Section 1 and answer the following questions: Can it discover PDEs? Does it avoid derivative estimation? Is the evolution assumption unnecessary (Equation

Equations used in the experiments. "LC" column specifies if the equation can be represented as a linear combination (Equation4). "VR" column specifies if the PDE is Variational-Ready

We report the success probability of discovering the ∂-free part and the Average RMSE of the ∂-bound part for the inhomogeneous heat equation. Standard deviations shown in brackets. CIPHER 0.46 (.07) 0.20 (.06) 0.04 (.03) 0.18 (.009) 0.24 (.008) 0.27 (.007)



Symbols used in this work

Examples of equations which are Variational-Ready

Hyperparameters used in gplearn

Software used and their licenses

Simulation results for the Sharpe-Lotka-McKendrick model. We report the success probability of discovering the ∂-free part and the Average RMSE of the ∂-bound part. Standard deviations are shown in brackets PDEs is a much harder problem than discovering a single PDE. One of the issues is the fact that we would call indeterminism. It follows from the following fact. If a vector field u is a solution to two differential equations f 1 and f 2 , i.e.,

Some challenges of PDE discovery and how we address them

Comparison between D-CIPHER and Qian et al. (2022)

annex

Ethics Statement. We want to emphasize that D-CIPHER was designed to facilitate the process of scientific discovery by extracting closed-form PDEs from data. It is not intended to or capable of replacing human experts in the modeling process. No human-derived data was used.Reproducibility Statement. The assumptions of Theorem 1 are discussed in Section 4 and the proof is presented in Appendix B.2. The details of D-CIPHER and CoLLie are discussed in Section for any λ 1 , λ 2 ∈ R. Moreover, equations can sometimes be differentiated to yield more equations.Let us take as an example the Cauchy-Riemann equations defined as:Let us assume that we have a true vector field (u 1 , u 2 ) that satisfies both equations. Then the following equations are also satisfiedWe can also differentiate the Cauchy-Riemann equations to arrive at the Laplacian equations for u 1 and u 2 .We could also combine the first-order equations with second-order equations or consider even higher-order derivatives. Although all these equations are compatible with our vector field (u 1 , u 2 ), not all of them are equally desirable to discover. That is why we believe that in any algorithm for discovering systems of differential equations substantial expert knowledge or inductive biases have to be encoded to guide the algorithm into the right equations.Current methods do not consider systems of equations or consider a system of equations of a very particular form. In the latter case, each equation models a derivative with respect to time of a different scalar field. The system is assumed to look like this:In addition, the LHS of these equations is often assumed to only contain spatial derivatives.D-CIPHER can be used to discover some systems of differential equations if enough prior knowledge is provided in the choice of the dictionary Q. Moreover, the discovered equations are not required to have a particular evolution form as is the case in current approaches.Firstly, we note that D-CODE Qian et al. ( 2022) has been shown to discover a system of equations that looks like Equations 64 when all equations are first-order ODEs. As D-CIPHER reduces to D-CODE when applied to first-order ODEs, it is also capable of discovering such a system.We demonstrate that D-CIPHER is able to discover both Cauchy-Riemann equations if we use two different dictionaries. Each of the dictionaries yields a different equation. Additionally, it is a well-known fact that if a vector field satisfies Cauchy-Riemann equations then the constituent scalar fields are harmonic, i.e., they satisfy Laplace's equation. Based on the same dataset we are also able to discover both Laplace's equations given another set of two different dictionaries. We note that Laplace's equation does not contain ∂ x term, so most of the current methods cannot be directly applied to discover this equation. The results are presented in Figure 7 . The dictionaries used to discover each of the equations are the following.ForIn the experiments, we have not optimized for the derivative-free part as it is identically equal to 0 in all equations.

F.7 SIGNIFICANCE OF THE NEW NOTIONS

In this section, we want to justify that the new notions we introduce (evolution assumption, linear combination form, derivative-bound part, derivative-fee part, Variational-Ready PDEs) are important theoretical contributions that help us understand the landscape of different PDEs from the machine learning perspective.The definitions we introduce let us characterize different classes of PDEs. These new notions complement the standard recognized PDE classes such as semi-linear, quasilinear, hyperbolic, etc. These standard classes were introduced predominantly to characterize the solving techniques or the properties of the solutions, whereas the notions we introduce relate to the difficulty of discovering such equations from data.Significance of linear combination form and the evolution assumption. In Table 10 we demonstrate how the presence of the two assumptions, the linear combination form (LC) and the evolution assumption (EA), influences the optimization problem. We see that with both assumptions, the problem is relatively straightforward and reduces to sparse linear regression. With only one of the assumptions present the problem becomes more difficult. With neither of these assumptions, the problem becomes very difficult and requires some other assumptions. D-CIPHER does not make either of these assumptions but it assumes the PDE to be of the form described by Equation 12.Significance of ∂-bound part, ∂-free part and Variational-Ready PDEs. The difficulty of derivative estimation has been of the main challenges of PDE discovery. The variational formulation allows to circumvent derivative estimation and thus is more robust to noisy data. Previously, the variational formulation has been applied only to a subset of equations in a linear combination form and with the evolution assumption. We observed that any restrictions that the variational formulation might put on the equation come from the terms containing the derivatives. Thus we define the derivative-bound part and the derivative-free part of the PDE due to their significance for the variational formulation. That allows us to define Variational-Ready PDEs as currently the broadest class of PDEs that admit the variational formulation. We believe it is an important contribution as methods requiring derivative estimation underperform in settings with high noise. This definition outlines the current limits of any method that circumvents derivative estimation in that way.Table 10 : This table demonstrates how the presence of the two assumptions, the linear combination form (LC) and the evolution assumption (EA), influences the optimization problem. References: Yes Yes ∂ t u j = P p=1 θ p f p (x, u(x), ∂ [K] u(x)) Relatively easy. Can be formulated as finding sparse solution to linear least squares (similar to ridge regression) [1, 2] Yes NoMedium difficulty. Can be formulated as P separate linear least squares problems as above [3] No Yes ∂ t u j = g(x, u(x)) Medium difficulty. Find g using symbolic regression [4] No No f (x, u(x), ∂ [K] Step 2 of D-CIPHER requires estimating the fields. We emphasize that any choice of reconstruction algorithm can be used, and it should be chosen based on the application and domain knowledge. In our experiments, D-CIPHER is implemented using Gaussian Process regression (Williams & Rasmussen, 2006) . In this section, we investigate other common interpolation algorithms. We implement D-CIPHER with different estimation algorithms. In particular, we compare Gaussian Process (GP) against: Nearest point interpolation (Nearest), Linear interpolation, and Cubic interpolation (Cubic). The implementation details of these three algorithms can be found in the scipy (Virtanen et al., 2020) documentation. The results are presented in Figure 11 . We see that the estimation algorithms that produce smoother functions (GP, Cubic) tend to give better results.

