PHYSICS INFORMED DEEP KERNEL LEARNING

Abstract

Deep kernel learning is a promising combination of deep neural networks and nonparametric function estimation. However, as a data driven approach, the performance of deep kernel learning can still be restricted by scarce or insufficient data, especially in extrapolation tasks. To address these limitations, we propose Physics Informed Deep Kernel Learning (PI-DKL) that exploits physics knowledge represented by differential equations with latent sources. Specifically, we use the posterior function sample of the Gaussian process as the surrogate for the solution of the differential equation, and construct a generative component to integrate the equation in a principled Bayesian hybrid framework. For efficient and effective inference, we marginalize out the latent variables in the joint probability and derive a simple model evidence lower bound (ELBO), based on which we develop a stochastic collapsed inference algorithm. Our ELBO can be viewed as a nice, interpretable posterior regularization objective. On synthetic datasets and real-world applications, we show the advantage of our approach in both prediction accuracy and uncertainty quantification.

1. Introduction

Deep kernel learning (Wilson et al., 2016a) uses deep neural networks to construct kernels for nonparametric function estimation (e.g., Gaussian processes (Williams and Rasmussen, 2006) ) and unifies both the expressive power of neural networks and self-adaptation of nonparametric function learning. Many applications have shown that deep kernel learning substantially outperforms the conventional shallow kernel learning (e.g., RBF). Compared to standard neural networks, deep kernel learning enjoys closed-form posterior distributions and hence is more convenient for uncertainty quantification and reasoning, which is important for decision making. Nonetheless, as a data driven approach, the performance of deep kernel learning can still be restricted by scarce data, especially when the training samples are insufficient to reflect the complexity of the system (that produced the data) or the test points are far away from the training set, i.e., extrapolation. On the other hand, physics knowledge, expressed as differential equations, are used to build physical models for various science and engineering applications (Lapidus and Pinder, 2011) . These models are meant to characterize the underlying mechanism (i.e., physical processes) that drives the system (e.g., how the heat diffuses across the spatial and temporal domains) and are much less restricted by data availability: they can make accurate predictions even without training data, e.g., the landing of Curiosity on Mars and flight of Voyager 1. Therefore, we consider integrating physics knowledge into deep kernel learning to further improve its performance in prediction and uncertainty quantification, especially for scarce data and extrapolation tasks. Our work is enlightened by the recent Physics Informed Neural Networks (PINNs) (Raissi et al., 2019) . However, there are two substantial differences. First, PINNs require the form of the differential equations to be fully specified. We allow the equations to include unknown latent sources (functions), which is of often the case in practice. Second, we integrate the differential equations with a principled Bayesian manner to pursue better calibrated posterior estimations. Specifically, we use the posterior sample of the Gaussian process (GP), which is a random function, as the surrogate of the solution of the differential equation. We then apply the differential operators in the equation to obtain the sample of the latent source (function), for which we assign another GP prior. To ensure the sampling procedure is valid, we use the symmetric property of the Gaussian distribution to sample a set of virtual observations {0}, which is computationally equivalent to placing the GP prior with zero mean function over the latent source. The sampling procedure constitutes a generative component and ties to the original deep kernel model in the Bayesian hybrid framework (Lasserre et al., 2006) . For efficient and high-quality inference, we marginalize out all the latent variables in the joint distribution to avoid approximating their complex posteriors. Then we use Jensen's inequality to derive a simple model evidence lower bound (ELBO), based on which we develop a stochastic collapsed inference algorithm. The ELBO can be further explained as a soft posterior regularization objective (Ganchev et al., 2010) , regularized by physics. For evaluation, we examined our physics informed deep kernel learning (PI-DKL) in both simulation and real-world applications. On synthetic datasets based on two commonly used differential equations, PI-DKL outperforms the standard deep kernel learning, shallow kernel learning, and latent force models (LFM) that combine the physics via kernel convolution, in both ground-truth function recovery and prediction uncertainty, especially in the case of extrapolation. We then examined PI-DKL in four real-world applications. PI-DKL consistently improves upon the competing approaches in prediction error and test log-likelihood. We applied PI-DKL for a nonlinear differential equation where LFM is infeasible. PI-DKL significantly outperforms standard deep/shallow kernel learning methods.

2. Background

Gaussian Process and Kernel Learning. The Gaussian process (GP) is the most commonly used nonparametric function prior for kernel learning. Suppose we aim to learn a function f : R d → R from a training set D = (X, y), where X = [x 1 , • • • , x N ] , y = [y 1 , • • • , y N ] , each x n is a d dimensional input vector and y n the observed output. To avoid under-fitting and over-fitting, we do not want to assume any parametric form of f . Instead, we want the complexity of f (•) to automatically adapt to the data. To this end , we introduce a kernel function k(•, •) that measures the similarity of the function values in terms of their inputs. The similarity only brings in a smoothness assumption about the target function. For example, the commonly used RBF kernel, k RBF (x i , x j ) = exp(- xi-xj 2 η ), implies the function is infinitely differentiable. We then use the kernel to construct a GP prior, f ∼ GP (m(•), k(•, •)) where m(•) is the mean function that is usually set to constant 0. According to GP definition, the finite projection of f (•) on the training inputs X, namely f = [f (x 1 ), • • • , f (x N )] , follow a multivariate Gaussian distribution, p(f |X) = N (f |0, K) where K is the kernel matrix on X and each [K] i,j = k(x i , x j ). Given the function values f , the observed outputs y are sampled from a noisy model. For example, when y are continuous, we can use the isotropic Gaussian noise model, p(y|f ) = N (y|f , τ -1 I) where τ is the inverse variance. We can then integrate out f to obtain the marginal likelihood, p(y|X) = N (y|0, K + τ -1 I). (1) To learn the model, we can maximize the likelihood to estimate the kernel parameters and the inverse variance τ . According to the GP prior, given a new input x * , the posterior (or predictive) distribution of the output f (x * ) is conditional Gaussian, p f (x * )|x * , X, y = N f (x * )|µ(x * ), v(x * ) , where µ( x * ) = k * (K + τ -1 I) -1 y, v(x * ) = k(x * , x * ) -k * (K + τ -1 I) -1 k * and k * = [k(x * , x 1 ), • • • , k(x * , x N )] . Deep Kernel Learning. While GP priors with shallow kernels (e.g., RBF and Matern) have achieved a great success in many applications, these shallow structures can limit the expressiveness of the kernels in estimating highly complicated functions, e.g., sharp discontinuities and high curvatures. To address this problem, Wilson et al. (2016a) proposed to construct deep kernels with neural networks. Specifically, they first choose a shallow kernel as the base kernel. Each input is first fed into a neural network (NN), and the NN outputs are then fed into the base kernel to compute the final kernel function value. Take RBF as an example of the base kernel, we can construct a deep kernel by k DEEP (x i , x j ) = k RBF (NN(x i ), NN(x j )) . Note that the NN weights now become the kernel parameters. We can then use the deep kernel to construct a GP prior for nonparametric function estimation. The likelihood and predictive distribution have the same forms as in (1) and (2).

3. Model

By using deep neural networks to construct highly expressive kernels, deep kernel learning greatly enhances the capability of estimating complicated functions, and meanwhile inherits the self-adaptation of the nonparametric function learning and convenient posterior inference. However, as a purely data-drive approach, deep kernel learning can still suffer from data scarcity, especially when the training examples are inadequate to reflect the complexity of the underlying mapping and when the test points are distant from all the training samples, i.e., extrapolation. To overcome this limitation, we propose PI-DKL, a physics informed deep kernel learning model that exploit physics prior knowledge to improve the function learning and uncertainty reasoning. Our model is presented as follows.

3.1. Physics Informed Deep Kernel Learning

We assume that in general, the physics is described by a differential equation of the following form, ψ[f (x)] = g(x) where ψ is a functional that combines a set of differential operators, f (x) is the target (or solution) function we want to estimate from the training dataset D = (X, y), and g(x) is a latent source whose form is unknown. Note that the functional ψ[•] may include unknown parameters as well. One example is ψ[f (x)] = df (x) dx + αf (x) -β, where the input x is a scalar, and α and β are unknown parameters. This functional represents a linear operator. Another commonly seen example is from the viscous version of Burger's equation (Olsen-Kettle, 2011)  , ψ[f (x)] = ∂f (x) ∂x1 + f (x) ∂f (x) ∂x2 -v ∂ 2 f (x) ∂x 2 2 , where x = [x 1 , x 2 ] , x 1 is the spatial variable, x 2 the time variable, and v the unknown viscosity parameter. This functional includes a nonlinear operator, f (x) ∂f (x) ∂x2 . To incorporate the physics knowledge in (4), we propose a hybrid of conditional and generative models based on the general framework proposed by Lasserre et al. (2006) . The conditional component is the standard deep-kernel GP that given the training inputs X, samples the (noisy) output observations y, and the probability p(y|X) is given in (1). The generative component fulfills another GP prior over the latent source g(•), but avoids the double prior problem to ensure a valid joint Bayesian model for posterior inference. Coupled with the differential operators, the generative component regularizes and guides the deep kernel learning of the f (•). The graphical illustration of PI-DKL is shown in Fig. 1 of the supplementary document. Specifically, to consider a GP prior over g(•), we first sample a finite set of input locations Z = [z 1 , . . . , z m ] (we will discuss the choice of p(Z) later). Then the projection of g(•) on Z follows a multivariate Gaussian distribution, p(g|Z) = N (g|0, Σ) where g = [g(z 1 ), . . . , g(z m )] , [Σ] ij = κ(z i , z j ) and κ(•, •) is another kernel. Next, we link the GP model of the target f (•) to the latent source g(•) via the differential equation (4). Our key idea is that from the GP posterior (2), we can construct a sample of the target functionfoot_0 , f (•) = µ(•) + v(•) , where ∼ N ( |0, 1), µ(•) and v(•) are the posterior mean and standard deviation functions. While this is a random function (due to ), it has a closed form and we can apply the functional ψ to obtain the sample of g(•), g(•) = h(•, ) = ψ[µ(•) + v(•)]. Therefore, to sample g -the values of g(•) on Z, we can first sample a standard Gaussian white noise , and then sample from p(g| , X, y) = m j=i δ ( g j -h(z j , )) , where g j = g(z j ), and δ(•) is the Dirac delta prior. Note that we can also directly view g as a transformation of the Gaussian noise and derive the marginal distribution p(g|X, y) (see the discussion in the supplementary material), which, however, is much more difficult to compute. Now, we want to tie the GP prior for g(•) in ( 5) to the samples g generated from the GP model of the target function f (•), i.e., through (7) . In this way, the learning of f (•) can be guided or regularized by the differential equation ( 4). However, directly multiplying ( 5) and ( 7) is problematic, because g will have double priors and the sampling procedure is invalid -if g is sampled from (5), it cannot be sampled again from (7), and vice versa. To ensure our model is a valid probabilistic model for posterior inference, we utilize the symmetric property of the Gaussian distribution, p(g|Z) = N (g|0, Σ) = N (0|g, Σ) = p(0|g, Z). We can see that placing a (finite) GP prior over g(•) is equivalent to sampling a set of virtual observations 0, due to the symmetry of the Gaussian distribution. Therefore, we can turn the GP prior of the latent source to a generative component that samples a set of virtual observations 0. From the computational perspective, they are totally equivalent. However, the sampling procedure now becomes valid -we first sample g from ( 7), and then sample 0 from (8). Note that the virtual observations 0 come from the zero-mean function of the GP prior of g(•). We can use different virtual observations by choosing a nonzero mean function. Finally, we combine the conditional model and the generative model (see ( 1), ( 7) and ( 8)) to obtain a joint probability distribution, p(y, 0, Z, , g|X) = p(y|X)p(Z)p( )p(g| , X, y)p(0|g, Z) = N (y|0, K + τ -1 I)p(Z)N ( |0, 1) m j=1 δ ( g j -h(z j , )) N (0|g, Σ). ( ) The choice of p(Z) is flexible. If we have no knowledge about the input distribution, we can use a uniform distribution for the bounded domain, and for unbounded domains we can use a wide Gaussian distribution with zero mean or uniform distribution on a region large enough to cover our interested predictions.

4.1. Stochastic Collapsed Inference

We now present the model inference algorithm. The exact posterior of the latent random variables Z, , and g in ( 9) are infeasible to calculate because they are coupled in kernels and differential operators. While we can use variational approximations, they will introduce extra variational parameters, complicate the optimization and affect the integration of the physics knowledge. Therefore, we marginalize out all the latent variables to conduct collapsed inference to avoid approximating their complex posteriors. Specifically, we derive that p(y, 0|X) = p(y|X)p(0|y, X), where p(0|y, X) = p(Z)p( )p(g| , X, y)p(0|g, Z)dZd dg = E p(Z) E p( ) [ δ(g -h)N (0|g, Σ)dg] = E p(Z) E N ( |0,1) [N (h(Z, )|0, Σ)] . where h(Z, ) = [h(z 1 , ), . . . , h(z m , )] . Note that h(•, •) is defined in (6). To allow us to adjust the importance of the generative component and so the influence of the physics during training, we weight the likelihood of the generative component by a free hyper-parameter γ ≥ 0. The weighted marginal likelihood (Warm, 1989; Hu and Zidek, 2002) is p γ (y, 0|X) = p(y|X)p(0|X, y) γ . ( ) Our inference is to maximize the log weighted marginal likelihood to optimize the kernel parameters in k DEEP (•, •) and κ(•, •), the inverse noise variance τ and unknown parameters in the differential equation, log p γ (y, 0|X) = log N (y|0, K + τ -1 I) + γ log E p(Z) E N ( |0,1) [N (h(Z, )|0, Σ)] . However, the log likelihood is infeasible to compute due to the intractable expectation inside the logarithm. To address this problem, we use Jensen's inequality on the log function to obtain a model evidence lower bound (ELBO), L ≤ log p γ (y, 0|X), where L = log N (y|0, K + τ -1 I) + γ • E p(Z) E N ( |0,1) log N h(Z, )|0, Σ . ( ) While L is still intractable, it is straightforward to maximize L with stochastic optimization. Each time, we generate a sample of the input locations from p(Z) and the noise from N ( |0, 1), denoted by Z and . We then obtain L = log N (y|0, K + τ -1 I) + γ log N h( Z, )|0, Σ , a unbiased stochastic estimation of L. We calculate ∇ L as an unbiased stochastic gradient of L, with which we can use any stochastic optimization to estimate the model parameters. While h(•, •) couples the deep kernels and complex operators in ψ, it is differentiable and we can use automatic differentiation libraries to calculate the stochastic gradient conveniently. The ELBO L in ( 12) is the GP log marginal likelihood plus an extra term, E p(Z) E N ( |0,1) log N h(Z, )|0, Σ . Each element of h is obtained by applying the functional ψ on the posterior sample of f (•) (see ( 6)). Jointly maximizing this term in L encourages that all the possible latent source values (at m locations) obtained from the GP posterior function f (•) (through the equation) should be considered as the samples of another GP. This can be viewed as a soft constraint over the posterior function of the GP. Therefore, our ELBO is also a posterior regularization objective (Ganchev et al., 2010) , and our inference estimates the standard deep-kernel GP model with a soft regularization on its posterior distribution.

4.2. Algorithm Complexity

The time complexity for the inference of our model is O(N 3 +m 3 ), because it involves the calculation for two GPs: one is the standard model, and the other is in the generative component. The time complexity for prediction is still O(N 3 ). The space complexity is O(N 2 + m 2 ), including the storage of the kernel matrices of the two GPs.

5. Related Work

An influential work, physics informed neural networks (PINNs) (Raissi et al., 2019) , were recently proposed to train neural networks that respect physical laws. The key idea is to use neural networks as a surrogate for the solution of the (partial) differential equation, and minimize the NN loos plus the residual error on a set of randomly collected collocation points in the input domain. Research along this line are quickly growing: (Mao et al., 2020; Jagtap et al., 2020; Zhang et al., 2020; Chen et al., 2020; Pang et al., 2019) , to name a few. While our work is enlightened by PINNs, there are several substantial differences. First, PINNs demand the form of the PDE is fully specified, i.e., ψ[f (x)] = 0, while we assume there can be some unknown source (function), ψ[f (x)] = g(x). Thus, our work is to exploit incomplete physics knowledge. Second, we use the posterior of the deep-kernel GP to construct a random surrogate for the PDE solution, and cast the integration of the physics into a principled Bayesian framework to enable posterior inference and uncertainty quantification, while PINNs only conduct point estimations. Our experiments show that the incomplete physics knowledge can also improve uncertainty quantification. Note that in the mean time, Zhang et al. (2019) combined polynomial chaos (Xiu and Karniadakis, 2002) and dropout (Gal and Ghahramani, 2016) to estimate the total uncertainty for PINNs with stochastic PDEs Many works have used GPs to model or learn physical systems (Graepel, 2003; Lawrence et al., 2007; Gao et al., 2008; Alvarez et al., 2009; 2013; Raissi et al., 2017) . For example, Graepel (2003) uses GPs to solve the linear equation given observed noisy sources. He first defines the kernel for the solution function with which to derive the kernel for the source function. The kernel parameters are then estimated from the noisy source data, given which the solution can be predicted. Raissi et al. (2017) assume both the noisy forces and solutions are observed, and they jointly model these examples in one single GP with a heterogeneous block covariance matrix. Latent force models (LFM) (Alvarez et al., 2009 ) make the same assumption about the differential equations as our work. LFMs convolve the Green's function of the equation with the kernel for the latent source to obtain the kernel for the target, and then learn the kernel parameters from data. While LFMs enable a hard encoding of physics, they rely on an analytical Green's function, which is not available for many equations. In addition, LFMs construct shallow kernels, which can be less expressive as deep kernels. Other works include (Calderhead et al., 2009; Barber and Wang, 2014; Macdonald et al., 2015; Heinonen et al., 2018; Wenk et al., 2020) etc. They mainly focus on estimating parameters/operators in ODEs without latent functions/sources.

6.1. Simulation

We first examined if PI-DKL can improve extrapolation with right physics knowledge. We generated two synthetic datasets. The first dataset, 1stODE, was simulated from a first-order Ordinary Differential Equation (ODE), ∂f (t) ∂t + B • f (t) -D = g(t) where B = D = 1, g(t) = sin(2πt) exp(-t) and the initial condition f (0) = 0.1. We set the time domain t ∈ [0, 1]. We ran the finite difference algorithm (Mitchell and Griffiths, 1980) to obtain the accurate solution. We chose 1, 001 equally spaced time points (t 0 = 0, t 1000 = 1) and their solution values as the dataset. The second dataset, 1dDiffusion, was simulated from a diffusion equation with one dimensional spatial domain, ∂f (x,t) ∂t -α ∂f 2 (x,t) ∂x 2 = g(x, t) where α = 10, g(x, t) = 0 and the initial condition f (x, 0) is a square wave. We set the domain (x, t) ∈ [0, 1] × [0, 1]. We ran a numerical solver to obtain the accurate solution. Then we discretized the entire spatial and time domain into a 48 × 101 grid with equal spacing in each dimension. We retrieved the grid points and their solution values as our dataset. Competing methods. We compared with (1) shallow kernel learning (SKL) with SE-ARD kenrel, (2) deep kernel learning (DKL), and (3) LFM, which uses SE-ARD for the latent source, and then convolves it with Green's function to obtain the kernel for the target function. To construct a deep kernel, we followed (Wilson et al., 2016a) to feed the input variables to a (deep) neural network (NN) and calculated the RBF kernel over the neural network outputs (see ( 3)). Across our experiments, we used a 5-layer NN, with 20 nodes in each hidden and output layer. We used tanh(•) as the activation function. For our method PI-DKL, we used the same deep kernel for the target function. As in LFM, we used SE-ARD kernel for the latent source. We set the number of virtual observations m = 10 for the generative component, and uniformly sampled the input locations from the entire domain (see ( 12)). We chose the weight of the generative component γ from {0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10}. For both LFM and PI-DKL, the parameters of differential equations are unknown. All the methods were implemented with TensorFlow (Abadi et al., 2016) . For our method, we used ADAM (Kingma and Ba, 2014) for stochastic inference. We ran 10K epochs to ensure convergence. For the other methods, we used L-BFGS for optimization and set the maximum number of iterations to 5K. For 1stODE , we used the first 101 samples (t i ∈ [0, 0.1]) for training, and the remaining 900 samples (t i ∈ (0.1, 1]) for test. We show the posterior distribution of the functions learned by all the methods and the ground-truth in Fig. 1 . We can see that the predictions of SKL and DKL are largely biased when the test points are far from the training region [0, 0.1]. On average, DKL obtains better accuracy than SKL. The root-mean-square errors (RMSEs) are {DKL:0.21, SKL:0.25}. As a comparison, the posterior means of LFM and PI-DKL are much closer to the ground-truth in the test region, and the RMSEs are {LFM: 0.09, PI-DKL: 0.04}, showing the benefit of the physics. However, LFM is quite unstable in extrapolation: the farther away the test area, the more fluctuating the prediction. By contrast, PI-DKL obtains much smoother curves that are even closer to the ground-truth, and smaller posterior variances in the test region. Hence, it shows that the LFM kernel obtained from shallow kernel convolution is less expressive/powerful than the regularized deep kernel in PI-DKL. Note that unlike SKL/DKL, both LFM and PI-DKL estimated nontrivial posterior variances (i.e., not extremely close to 0) in the training region, implying that the physics also helps prevent overfitting. Since for diffusion equations, LFM cannot derive the kernel for time variable t, for a fair comparison on 1dDiffusion, we fixed t = 0.5 and used the 48 spatial points as the training inputs. We then evaluated the posterior distribution of the function values at all the grid points (48 × 101) in the entire domain. We report the absolute difference between the posterior mean and ground-truth in Fig. 2a-d . We can see that the prediction errors of SKL/DKL are close to 0 (dark colors) in regions close to the training data (t = 0.5). However, when the test points are getting far away, say, close to the boundary (t = 0 or 1), the error grows significantly (see the bright colors). Overall, DKL still achieves smaller extrapolation error than SKL, implying an advantage of using more flexible deep kernels. From Fig. 2c , we can see that while LFM misses the time information, it still exhibits better extrapolation results, as compared with SKL/DKL, showing the benefit of the physics . PI-DKL achieves even smaller prediction error (i.e., darker) when t is away from the training time point and exhibits even best extrapolation performance. The RMSEs of all the methods are {SKL: 0.18, DKL: 0.11, LFM: 0.09, PI-DKL:0.07}. We also report the predictive standard deviation (PSD) of each method in Fig. 2e-f . We can see that the PSDs of SKL/DKL are both close to 0 in the training region, and quickly increase when the inputs move away (on average DKL shows smaller PSDs and smoother changes). By contrast, LFM and PI-DKL obtain PSDs quite uniformly across the entire domain and less than SKL/DKL. It means that the physics knowledge help inhibit overfitting and reduce the uncertainty in extrapolation. Compared with LFM, PI-DKL obtains even smaller PSDs (darker color) across the domain, showing even smaller uncertainty in extrapolation.

6.2. Real-World Applications

Metal Pollution in Swiss Jura. Next, we evaluated PI-DKL in real-world applications. We examined the predictive performance in terms of normalized RMSE (nRMSE) and test log-likelihood (LL). Due to the space limit, the test LL results are provided in the supplementary material. We first considered predicting the metal concentration in Swiss Jura. The data were collected from 300 locations in a 14.5 km 2 region (https://rdrr.io/cran/gstat/man/jura.html). The diffusion of the metal concentration is naturally modelled by a diffusion equation with the two-dimensional  ∂x 2 1 + ∂f (x1,x2,t) ∂x 2 2 ), where f (•, •, •) is the concentration of the metal at a particular location and time point. However, the dataset do not include the time t s when these concentrations were measured. LFM considers the initial condition f (x 1 , x 2 , 0) as the latent function and obtains a kernel of the locations where t s can be viewed as a kernel parameter learned from data. In our approach, we estimated the solution function at t s , h(x 1 , x 2 ) = f (x 1 , x 2 , t s ). Hence, the equation can be viewed as ∂h 2 (x1,x2) ∂x 2 1 + ∂h 2 (x1,x2) ∂x 2 2 = g(x 1 , x 2 ) , where the latent function g(x 1 , x 2 ) = 1 α ∂f (x1,x2,t) ∂t | t=ts . We were interested in predicting the concentration of cadium and copper. The input variables include the coordinates of the location (x 1 , x 2 ), the concentrations of {nickel, zinc} for cadmium, and {lead, nickel, zinc} for copper. For PI-DKL, we selected m from {10, 50, 100, 200, 500} for the generative component and γ from {0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10}. We normalized the training inputs and then sampled latent inputs Z from N (0, I) in model estimation. For LFM, we varied the number of latent forces from {1,3, 5}. We randomly selected 50 samples for training, and used the remaining 250 samples for test. We repeated the experiments for 5 times, and report the average nRMSE and its standard deviation of each method in Fig. 3a and b . PI-DKL outperforms all the competing approaches for both prediction tasks. PI-DKL always significantly improves upon SKL and DKL (p < 0.05). In addition, PI-DKL significantly outperforms LFM in predicting Cadium concentration (Fig. 3b ). Note that LFM does improve upon SKL in predicting Copper concentration (Fig. 3a ), but not as significant as PI-DKL. Motion Capture. We then looked into predicting trajectories of joints in the motion capture application. To this end, we used CMU motion capture database ( http://mocap.cs.cmu.edu/), from which we used the samples collected from subject 35 in the walk and jog motion lasting for 2,644 seconds. We trained all the models to predict the angles of Joint 60 along with time. We used the first order ODE in simulation to represent the physical model, based on which we ran LFM and PI-DKL. Note this physical system might be oversimplified (Alvarez et al., 2009) . For LFM, we varied the number of latent forces from {1,3, 5}. Again, we randomly selected 500 samples for training and 2, 000 samples for test. We repeated the experiments for 5 times and report the average nRMSE and its standard deviation in Fig. 3c . As we can see, PI-DKL improves upon all the competing methods by a large margin. Note that LFM is even far worse than SKL. This might because LFM over-exploits the over-simplified physics, which harms the prediction. By contrast, PI-DKL allows us to tune the number of virtual observations m and the likelihood weight (γ in ( 12)), and hence can consistently improve upon DKL. PM2.5 in Salt Lake City. Second, we considered predicting the Particulate Matter (PM2.5) levels across Salt Lake City. The dataset were collected from sensors' reads at different time and locations (https://aqandu. org/). We chose the time range from 07/04/2018 to 07/06/2018. Following (Wang et al., 2018) , we used the diffusion equation plus a source term (i.e., the latent function) to represent the physical model, ∂f (x1,x2,t) ∂t -α 2 j=1 ∂f 2 (x1,x2,t) ∂x 2 j = g(x 1 , x 2 , t), where f is the concentration level and g the source term. The input variables include both the location coordinates and detailed time points. Since LFM cannot construct a full kernel of the input variables from the physics, we did not test it to avoid unfair comparisons. We trained SKL and DKL with both the spatial and time inputs. We randomly selected 500 samples for training and 2, 000 samples for test. We repeated the experiments for 5 times and report the average nRMSE and its standard deviation in Fig. 4a . As we can see, with a more expressive kernel, DKL improves upon SKL significantly, and with the incorporation of the physics, PI-DKL in turn outperforms DKL significantly (p < 0.05). High-Way Traffic Flow Prediction. Finally, we applied PI-DKL to predict the traffic flow in the interstate highway 215 across Utah state. The Utah Department of Transportation (UDOT) has installed sensors every a few miles along the high way. Each sensor counts the number of vehicles passed every minute, and sends the data back to a central database. The real time data and road conditions are available at https://udot.iteris-pems.com/. We used the data collected by 20 sensors continuously installed in a segment of 30 miles, and the time was chosen from 08/05/2019 to 08/11/2019. The input variables include the location coordinates of each sensor and the time of each read. Following (Nagatani, 2000) , we used the Burger's equation plus a source term to describe the system, ∂f ∂t + f • 2 j=1 ∂f ∂xi -ν 2 j=1 ∂f 2 ∂x 2 j = g(x 1 , x 2 , t), where f is the traffic flow, ν the unknown viscous coefficient, and g the source term, i.e., the latent function. Note that the equation is nonlinear and we do not have an analytical form of Green's function. Hence we cannot use LFM to incorporate the physics to enhance GP training. Hence we compared with SKL and DKL only. We randomly selected 500 and 2, 000 samples for training and test, respectively, and repeated for 5 times. The average nRMSEs and the standard deviations are reported in Fig. 4b . As we can see, DKL significantly outperforms SKL, which demonstrates the advantage of the more expressive, deep kernel. More important, PRGP further improves upon DKL, showing that the physics incorporated by our approach indeed promotes the predictive performance.

7. Conclusion

We have presented PI-DKL, a physics informed deep kernel learning model that can flexibly incorporate physics knowledge from incomplete differential equations to improve function learning and uncertainty quantification. In the future work, we will extend our model with sparse approximations (Hensman et al., 2013; Wilson et al., 2016b) to exploit physics in large-scale applications.



In computational physics, this is viewed as a surrogate for the solution function of the differential equation.



Figure 1: The posterior distribution of the learned solution functions on 1stODE. The red lines in the middle are the posterior means and the red dashed lines on the boundary of the shaded region the posterior mean plus/minus one posterior standard deviation. The black line is the ground-truth solution. The training inputs stay in [0, 0.1] (left to the green line).

Figure 2: The absolute value of the difference between the posterior mean and the ground-truth (1st row) and posterior standard deviation (2nd row) on 1dDiffusion. The training examples stay on t = 0.5 (the green line).

Figure 4: PM2.5 and traffic flow prediction.

