PHYSICS INFORMED DEEP KERNEL LEARNING

Abstract

Deep kernel learning is a promising combination of deep neural networks and nonparametric function estimation. However, as a data driven approach, the performance of deep kernel learning can still be restricted by scarce or insufficient data, especially in extrapolation tasks. To address these limitations, we propose Physics Informed Deep Kernel Learning (PI-DKL) that exploits physics knowledge represented by differential equations with latent sources. Specifically, we use the posterior function sample of the Gaussian process as the surrogate for the solution of the differential equation, and construct a generative component to integrate the equation in a principled Bayesian hybrid framework. For efficient and effective inference, we marginalize out the latent variables in the joint probability and derive a simple model evidence lower bound (ELBO), based on which we develop a stochastic collapsed inference algorithm. Our ELBO can be viewed as a nice, interpretable posterior regularization objective. On synthetic datasets and real-world applications, we show the advantage of our approach in both prediction accuracy and uncertainty quantification.

1. Introduction

Deep kernel learning (Wilson et al., 2016a) uses deep neural networks to construct kernels for nonparametric function estimation (e.g., Gaussian processes (Williams and Rasmussen, 2006) ) and unifies both the expressive power of neural networks and self-adaptation of nonparametric function learning. Many applications have shown that deep kernel learning substantially outperforms the conventional shallow kernel learning (e.g., RBF). Compared to standard neural networks, deep kernel learning enjoys closed-form posterior distributions and hence is more convenient for uncertainty quantification and reasoning, which is important for decision making. Nonetheless, as a data driven approach, the performance of deep kernel learning can still be restricted by scarce data, especially when the training samples are insufficient to reflect the complexity of the system (that produced the data) or the test points are far away from the training set, i.e., extrapolation. On the other hand, physics knowledge, expressed as differential equations, are used to build physical models for various science and engineering applications (Lapidus and Pinder, 2011) . These models are meant to characterize the underlying mechanism (i.e., physical processes) that drives the system (e.g., how the heat diffuses across the spatial and temporal domains) and are much less restricted by data availability: they can make accurate predictions even without training data, e.g., the landing of Curiosity on Mars and flight of Voyager 1. Therefore, we consider integrating physics knowledge into deep kernel learning to further improve its performance in prediction and uncertainty quantification, especially for scarce data and extrapolation tasks. Our work is enlightened by the recent Physics Informed Neural Networks (PINNs) (Raissi et al., 2019) . However, there are two substantial differences. First, PINNs require the form of the differential equations to be fully specified. We allow the equations to include unknown latent sources (functions), which is of often the case in practice. Second, we integrate the differential equations with a principled Bayesian manner to pursue better calibrated posterior estimations. Specifically, we use the posterior sample of the Gaussian process (GP), which is a random function, as the surrogate of the solution of the differential equation. We then apply the differential operators in the equation to obtain the sample of the latent source (function), for which we assign another GP prior. To ensure the sampling procedure is valid, we use the symmetric property of the Gaussian distribution to sample a set of virtual observations {0}, which is computationally equivalent to placing the GP prior with zero mean function over the latent source. The sampling procedure constitutes a generative component and ties to the original deep kernel model in the Bayesian hybrid framework (Lasserre et al., 2006) . For efficient and high-quality inference, we marginalize out all the latent variables in the joint distribution to avoid approximating their complex posteriors. Then we use Jensen's inequality to derive a simple model evidence lower bound (ELBO), based on which we develop a stochastic collapsed inference algorithm. The ELBO can be further explained as a soft posterior regularization objective (Ganchev et al., 2010) , regularized by physics. For evaluation, we examined our physics informed deep kernel learning (PI-DKL) in both simulation and real-world applications. On synthetic datasets based on two commonly used differential equations, PI-DKL outperforms the standard deep kernel learning, shallow kernel learning, and latent force models (LFM) that combine the physics via kernel convolution, in both ground-truth function recovery and prediction uncertainty, especially in the case of extrapolation. We then examined PI-DKL in four real-world applications. PI-DKL consistently improves upon the competing approaches in prediction error and test log-likelihood. We applied PI-DKL for a nonlinear differential equation where LFM is infeasible. PI-DKL significantly outperforms standard deep/shallow kernel learning methods.

2. Background

Gaussian Process and Kernel Learning. The Gaussian process (GP) is the most commonly used nonparametric function prior for kernel learning. Suppose we aim to learn a function f : R d → R from a training set D = (X, y), where X = [x 1 , • • • , x N ] , y = [y 1 , • • • , y N ] , each x n is a d dimensional input vector and y n the observed output. To avoid under-fitting and over-fitting, we do not want to assume any parametric form of f . Instead, we want the complexity of f (•) to automatically adapt to the data. To this end , we introduce a kernel function k(•, •) that measures the similarity of the function values in terms of their inputs. The similarity only brings in a smoothness assumption about the target function. For example, the commonly used RBF kernel, k RBF (x i , x j ) = exp(- xi-xj 2 η ), implies the function is infinitely differentiable. We then use the kernel to construct a GP prior, f ∼ GP (m(•), k(•, •)) where m(•) is the mean function that is usually set to constant 0. According to GP definition, the finite projection of f (•) on the training inputs X, namely f = [f (x 1 ), • • • , f (x N )] , follow a multivariate Gaussian distribution, p(f |X) = N (f |0, K) where K is the kernel matrix on X and each [K] i,j = k(x i , x j ). Given the function values f , the observed outputs y are sampled from a noisy model. For example, when y are continuous, we can use the isotropic Gaussian noise model, p(y|f ) = N (y|f , τ -1 I) where τ is the inverse variance. We can then integrate out f to obtain the marginal likelihood, p(y|X) = N (y|0, K + τ -1 I). (1) To learn the model, we can maximize the likelihood to estimate the kernel parameters and the inverse variance τ . According to the GP prior, given a new input x * , the posterior (or predictive) distribution of the output f (x * ) is conditional Gaussian, p f (x * )|x * , X, y = N f (x * )|µ(x * ), v(x * ) , where µ( x * ) = k * (K + τ -1 I) -1 y, v(x * ) = k(x * , x * ) -k * (K + τ -1 I) -1 k * and k * = [k(x * , x 1 ), • • • , k(x * , x N )] . Deep Kernel Learning. While GP priors with shallow kernels (e.g., RBF and Matern) have achieved a great success in many applications, these shallow structures can limit the expressiveness of the kernels in estimating highly complicated functions, e.g., sharp discontinuities and high curvatures. To address this problem, Wilson et al. (2016a) proposed to construct deep kernels with neural networks. Specifically, they first choose a shallow kernel as the base kernel. Each input is first fed into a neural network (NN), and the NN outputs are then fed into the base kernel to compute the final kernel function value. Take RBF as an example of the base kernel, we can construct a deep kernel by k DEEP (x i , x j ) = k RBF (NN(x i ), NN(x j )) . (3) Note that the NN weights now become the kernel parameters. We can then use the deep kernel to construct a GP prior for nonparametric function estimation. The likelihood and predictive distribution have the same forms as in (1) and (2).

3. Model

By using deep neural networks to construct highly expressive kernels, deep kernel learning greatly enhances the capability of estimating complicated functions, and meanwhile inherits the self-adaptation of the nonparametric function learning and convenient posterior inference. However, as a purely

