DEEP KERNEL PROCESSES

Abstract

We define deep kernel processes in which positive definite Gram matrices are progressively transformed by nonlinear kernel functions and by sampling from (inverse) Wishart distributions. Remarkably, we find that deep Gaussian processes (DGPs), Bayesian neural networks (BNNs), infinite BNNs, and infinite BNNs with bottlenecks can all be written as deep kernel processes. For DGPs the equivalence arises because the Gram matrix formed by the inner product of features is Wishart distributed, and as we show, standard isotropic kernels can be written entirely in terms of this Gram matrix -we do not need knowledge of the underlying features. We define a tractable deep kernel process, the deep inverse Wishart process, and give a doubly-stochastic inducing-point variational inference scheme that operates on the Gram matrices, not on the features, as in DGPs. We show that the deep inverse Wishart process gives superior performance to DGPs and infinite BNNs on standard fully-connected baselines.

1. INTRODUCTION

The deep learning revolution has shown us that effective performance on difficult tasks such as image classification (Krizhevsky et al., 2012) requires deep models with flexible lower-layers that learn task-dependent representations. Here, we consider whether these insights from the neural network literature can be applied to purely kernel-based methods. (Note that we do not consider deep Gaussian processes or DGPs to be "fully kernel-based" as they use a feature-based representation in intermediate layers). Importantly, deep kernel methods (e.g. Cho & Saul, 2009) already exist. In these methods, which are closely related to infinite Bayesian neural networks (Lee et al., 2017; Matthews et al., 2018; Garriga-Alonso et al., 2018; Novak et al., 2018) , we take an initial kernel (usually the dot product of the input features) and perform a series of deterministic, parameter-free transformations to obtain an output kernel that we use in e.g. a support vector machine or Gaussian process. However, the deterministic, parameter-free nature of the transformation from input to output kernel means that they lack the capability to learn a top-layer representation, which is believed to be crucial for the effectiveness of deep methods (Aitchison, 2019) . To obtain the flexibility necessary to learn a task-dependent representation, we propose deep kernel processes (DKPs), which combine nonlinear transformations of the kernel, as in Cho & Saul (2009) with a flexible learned representation by exploiting a Wishart or inverse Wishart process (Dawid, 1981; Shah et al., 2014) . We find that models ranging from DGPs (Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017) to Bayesian neural networks (BNNs; Blundell et al., 2015, App. C.1) , infinite BNNs (App. C.2) and infinite BNNs with bottlenecks (App. C.3) can be written as DKPs (i.e. only with kernel/Gram matrices, without needing features or weights). Practically, we find that the deep inverse Wishart process (DIWP), admits convenient forms for variational approximate posteriors, and we give a novel scheme for doubly-stochastic variational inference (DSVI) with inducing points purely in the kernel domain (as opposed to Salimbeni & Deisenroth, 2017, who described DSVI for standard feature-based DGPs), and demonstrate improved performance with carefully matched models on fully-connected benchmark datasets.

2. BACKGROUND

We briefly revise Wishart and inverse Wishart distributions. The Wishart distribution is a generalization of the gamma distribution that is defined over positive semidefinite matrices. Suppose that we have a collection of P -dimensional random variables x i with i ∈ {1, . . . , N } such that X K 1 F 1 G 1 K 2 F 2 G 2 K 3 F 3 Y X G 1 G 2 F 3 Y Layer 1 Layer 2 Output Layer x i iid ∼ N (0, V) , then, N i=1 x i x T i = S ∼ W (V, N ) has Wishart distribution with scale matrix V and N degrees of freedom. When N > P -1, the density is, W (S; V, N ) = 1 2 N P |V|Γ P N 2 |S| (N -P -1)/2 exp -1 2 Tr V -1 S , where Γ P is the multivariate gamma function. Further, the inverse, S -foot_0 has inverse Wishart distribution, W -1 V -1 , N . The inverse Wishart is defined only for N > P -1 and also has closed-form density. Finally, we note that the Wishart distribution has mean N V while the inverse Wishart has mean V -1 /(N -P -1) (for N > P + 1).

3. DEEP KERNEL PROCESSES

We define a kernel process to be a set of distributions over positive definite matrices of different sizes, that are consistent under marginalisation (Dawid, 1981; Shah et al., 2014) . The two most common kernel processes are the Wishart process and inverse Wishart process, which we write in a slightly unusual form to ensure their expectation is K. We take G and G to be finite dimensional marginals of the underlying Wishart and inverse Wishart process, G ∼ W (K /N, N ) , G ∼ W -1 (δK , δ + (P + 1)) , G * ∼ W (K * /N, N ) , G * ∼ W -1 (δK * , δ + (P * + 1)) , and where we explicitly give the consistent marginal distributions over K * , G * and G * which are P * × P * principal submatrices of the P × P matrices K, G and G dropping the same rows and columns. In the inverse-Wishart distribution, δ is a positive parameter that can be understood as controlling the degree of variability, with larger values for δ implying smaller variability in G . We define a deep kernel process by analogy with a DGP, as a composition of kernel processes, and show in App. A that under sensible assumptions any such composition is itself a kernel process.  G = 1 N F F T . Here, F ∈ R P ×N are the N hidden features in layer ; λ indexes hidden features so f λ is a single column of F , representing the value of the λth feature for all training inputs. Note that K(•) is a



Note that we leave the question of the full Kolmogorov extension theorem(Kolmogorov, 1933) for matrices to future work: for our purposes, it is sufficient to work with very large but ultimately finite input spaces as in practice, the input vectors are represented by elements of the finite set of 32-bit or 64-bit floating-point numbers(Sterbenz, 1974).



Figure 1: Generative models for two layer (L = 2) deep GPs. (Top) Generative model for a deep GP, with a kernel that depends on the Gram matrix, and with Gaussian-distributed features. (Bottom) Integrating out the features, the Gram matrices become Wishart distributed.

WITH ISOTROPIC KERNELS ARE DEEP WISHART PROCESSES We consider deep GPs of the form (Fig. 1 top) with X ∈ R P ×N0

