DEEP ENSEMBLE KERNEL LEARNING

Abstract

Gaussian processes (GPs) are nonparametric Bayesian models that are both flexible and robust to overfitting. One of the main challenges of GP methods is selecting the kernel. In the deep kernel learning (DKL) paradigm, a deep neural network or "feature network" is used to map inputs into a latent feature space, where a GP with a "base kernel" acts; the resulting model is then trained in an end-to-end fashion. In this work, we introduce the "deep ensemble kernel learning" (DEKL) model, which is a special case of DKL. In DEKL, a linear base kernel is used, enabling exact optimization of the base kernel hyperparameters and a scalable inference method that does not require approximation by inducing points. We also represent the feature network as a concatenation of an ensemble of learner networks with a common architecture, allowing for easy model parallelism. We show that DEKL is able to approximate any kernel if the number of learners in the ensemble is arbitrarily large. Comparing the DEKL model to DKL and deep ensemble (DE) baselines on both synthetic and real-world regression tasks, we find that DEKL often outperforms both baselines in terms of predictive performance and that the DEKL learners tend to be more diverse (i.e., less correlated with one another) compared to the DE learners.

1. INTRODUCTION

In recent years, there has been a growing interest in Bayesian deep learning (DL) , where the point predictions of traditional deep neural network (DNN) models are replaced with full predictive distributions using Bayes' Rule (Neal, 2012; Wilson, 2020) . The advantages of Bayesian DL over traditional DL are numerous and include greater robustness to overfitting and better calibrated uncertainty quantification (Guo et al., 2017; Kendall & Gal, 2017) . Furthermore, the success of traditional DL already rests on a number of probabilistic elements such as stochastic gradient descent (SGD), dropout, and weight initialization-all of which have been given Bayesian interpretations (Smith & Le, 2018; Gal & Ghahramani, 2016; Kingma et al., 2015; Schoenholz et al., 2016; Jacot et al., 2018) , so that insights into Bayesian DL may help to advance DL as a whole. Gaussian processes (GPs) are nonparametric Bayesian models with appealing properties, as they admit exact inference for regression and allow for a natural functional perspective suitable for predictive modeling (Rasmussen & Williams, 2005) . While at first glance GPs appear unrelated to DL models, a number of interesting connections between GPs and DNNs exist in the literature, suggesting that GPs can constitute a valid approach to Bayesian DL (Neal, 1996; Lee et al., 2018; de Matthews et al., 2018; Jacot et al., 2018; Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017; Agrawal et al., 2020) . A GP prior is typically characterized by its covariance function or "kernel", which determines the class of functions that the GP can model, as well as its generalization properties outside training data. Kernel selection is the primary problem in GP modeling, and unfortunately traditional kernels such as the radial basis function (RBF) kernel are not sufficiently expressive for complex problems where more flexible models such as DNNs generally perform well. This is the key motivation for kernel learning, which refers to the selection of an optimal kernel out of a family of kernels in a data-driven way. A number of approaches to kernel learning exist in the literature, including some that parameterize kernels using DNNs (Zhou et al., 2019; Li et al., 2019; Bullins et al., 2018; Sinha & Duchi, 2016) . As these approaches involve learning feature representations, they are fundamentally different from random-feature methods for efficient kernel representation (Rahimi & Recht, 2007; 2008) . However, these approaches are not specific to GPs and do not take advantage of a robust Bayesian framework. In contrast, the deep kernel learning (DKL) paradigm does exactly this; In DKL, a DNN is used as a feature extractor that maps data inputs into a latent feature space, where GP inference with some "base kernel" is then performed (Wilson et al., 2016b; a; Jean et al., 2016; Al-Shedivat et al., 2017; Bradshaw et al., 2017; Izmailov et al., 2018; Xuan et al., 2018) . The resulting model is then trained end-to-end using standard gradient-based optimization, usually in a variational framework. We note that the DKL model is just a GP with a highly flexible kernel parameterized by a DNN. By optimizing all hyperparameters (including the DNN weights) with type II maximum likelihood estimation, the DKL model is able to learn an optimal kernel in a manner directly informed by the data, while also taking advantage of the robustness granted by the Bayesian framework. A special case of DKL that is worthy of note was considered in Dasgupta et al. (2018) , who use a linear base kernel and impose a soft orthogonality constraint to learn the eigenfunctions of a kernel. Although similar in spirit to the approach in this paper, their method does not make use of an efficient variational method, nor is distributed training made possible since all of the basis functions are derived from the same feature network. In this work, we introduce the "deep ensemble kernel learning" (DEKL) model-a simpler and more efficient special case of DKL with two specifications-the base kernel is linear, and the feature network is partitioned into an "ensemble" of "learners" with common network architecture. In contrast to nonlinear kernels, the linear kernel allows us to derive an efficient training and inference method for DEKL that circumvents the inducing points approximation commonly used in traditional DKL. The hyperparameters of the linear kernel can also be optimized in closed form, allowing us to simplify the loss function considerably. Convenience aside, we show that DEKL remains highly expressive, proving that it is universal in the sense that it can approximate any continuous kernel so long as its feature network is arbitrarily wide. In other words, we may keep the base kernel simple if we are willing to let the feature network be more complex. The second specification of DEKL lets us handle the complexity of the feature network; because the feature network is partitioned, it admits easy model parallelism, where the learners in the ensemble are distributed. Moreover, our universality result only requires the number of learners to be arbitrarily large; the learners themselves need not grow (meaning fixed-capacity learners are sufficient), avoiding additional model parallelism. From a different perspective, DEKL may be regarded as an extension of traditional ensembling methods for DNNs and in particular the deep ensemble (DE) model of Lakshminarayanan et al. (2017) , which is also highly parallelizable. In a DE, each DNN learner parameterizes a distribution over the variates (e.g., the mean and variance of a Gaussian in regression, or the logits of a softmax vector in classification). Each learner is trained independently with maximum likelihood estimation, and the final predictive distribution of the DE is then defined to be a uniform mixture of the individual learner predictive distributions. Although not Bayesian itself, the DE model boasts impressive predictive performance and was shown to outperform Bayesian methods such as probabilistic back propagation (Hernández-Lobato & Adams, 2015) and MC-dropout (Gal & Ghahramani, 2016) . In contrast, in DEKL, the learners are trained jointly via a shared linear GP layer. We surmise that this may help to promote diversity (i.e., low correlation) among the learners by facilitating coordination, which we verify experimentally. Unlike non-Bayesian joint ensemble training methods such as that of Webb et al. (2019) , we hypothesize that the DEKL learners might learn to diversify in order to better approximate the posterior covariance-an inherently Bayesian feature. We therefore expect DEKL to be more efficient than DKL and more robust than DE, by drawing on the strengths of both (see Fig. 1 for a comparison of model architectures).

2. DEEP ENSEMBLE KERNEL LEARNING

A DKL model is a GP whose kernel encapsulates a DNN for feature extraction (Wilson et al., 2016b; a) . A deep kernel is defined as K deep (x 1 , x 2 ; θ, γ) = K base (ϕ(x 1 ; θ), ϕ(x 2 ; θ); γ), where ϕ(•; θ) is a DNN with weight parameters θ and K base (•, •; γ) is any chosen kernel-called the "base kernel"-with hyperparameters γ. Note that the kernel hyperparameters of K deep include all hyperparameters γ of the base kernel K base as well as the DNN weight parameters θ; Given the expressive power of DNNs, the deep kernel is also highly expressive and may be viewed as a method to automatically select a GP model.

