DEEP KERNEL PROCESSES

Abstract

We define deep kernel processes in which positive definite Gram matrices are progressively transformed by nonlinear kernel functions and by sampling from (inverse) Wishart distributions. Remarkably, we find that deep Gaussian processes (DGPs), Bayesian neural networks (BNNs), infinite BNNs, and infinite BNNs with bottlenecks can all be written as deep kernel processes. For DGPs the equivalence arises because the Gram matrix formed by the inner product of features is Wishart distributed, and as we show, standard isotropic kernels can be written entirely in terms of this Gram matrix -we do not need knowledge of the underlying features. We define a tractable deep kernel process, the deep inverse Wishart process, and give a doubly-stochastic inducing-point variational inference scheme that operates on the Gram matrices, not on the features, as in DGPs. We show that the deep inverse Wishart process gives superior performance to DGPs and infinite BNNs on standard fully-connected baselines.

1. INTRODUCTION

The deep learning revolution has shown us that effective performance on difficult tasks such as image classification (Krizhevsky et al., 2012) requires deep models with flexible lower-layers that learn task-dependent representations. Here, we consider whether these insights from the neural network literature can be applied to purely kernel-based methods. (Note that we do not consider deep Gaussian processes or DGPs to be "fully kernel-based" as they use a feature-based representation in intermediate layers). Importantly, deep kernel methods (e.g. Cho & Saul, 2009) already exist. In these methods, which are closely related to infinite Bayesian neural networks (Lee et al., 2017; Matthews et al., 2018; Garriga-Alonso et al., 2018; Novak et al., 2018) , we take an initial kernel (usually the dot product of the input features) and perform a series of deterministic, parameter-free transformations to obtain an output kernel that we use in e.g. a support vector machine or Gaussian process. However, the deterministic, parameter-free nature of the transformation from input to output kernel means that they lack the capability to learn a top-layer representation, which is believed to be crucial for the effectiveness of deep methods (Aitchison, 2019) . To obtain the flexibility necessary to learn a task-dependent representation, we propose deep kernel processes (DKPs), which combine nonlinear transformations of the kernel, as in Cho & Saul (2009) with a flexible learned representation by exploiting a Wishart or inverse Wishart process (Dawid, 1981; Shah et al., 2014) . We find that models ranging from DGPs (Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017) to Bayesian neural networks (BNNs; Blundell et al., 2015, App. C.1) , infinite BNNs (App. C.2) and infinite BNNs with bottlenecks (App. C.3) can be written as DKPs (i.e. only with kernel/Gram matrices, without needing features or weights). Practically, we find that the deep inverse Wishart process (DIWP), admits convenient forms for variational approximate posteriors, and we give a novel scheme for doubly-stochastic variational inference (DSVI) with inducing points purely in the kernel domain (as opposed to Salimbeni & Deisenroth, 2017 , who described DSVI for standard feature-based DGPs), and demonstrate improved performance with carefully matched models on fully-connected benchmark datasets.

2. BACKGROUND

We briefly revise Wishart and inverse Wishart distributions. The Wishart distribution is a generalization of the gamma distribution that is defined over positive semidefinite matrices. Suppose that X K 1 F 1 G 1 K 2 F 2 G 2 K 3 F 3 Y X G 1 G 2 F 3 Y Layer 1 Layer 2 Output Layer Figure 1 : Generative models for two layer (L = 2) deep GPs. (Top) Generative model for a deep GP, with a kernel that depends on the Gram matrix, and with Gaussian-distributed features. (Bottom) Integrating out the features, the Gram matrices become Wishart distributed. we have a collection of P -dimensional random variables x i with i ∈ {1, . . . , N } such that x i iid ∼ N (0, V) , then, N i=1 x i x T i = S ∼ W (V, N ) has Wishart distribution with scale matrix V and N degrees of freedom. When N > P -1, the density is, W (S; V, N ) = 1 2 N P |V|Γ P N 2 |S| (N -P -1)/2 exp -1 2 Tr V -1 S , where Γ P is the multivariate gamma function. Further, the inverse, S -foot_0 has inverse Wishart distribution, W -1 V -1 , N . The inverse Wishart is defined only for N > P -1 and also has closed-form density. Finally, we note that the Wishart distribution has mean N V while the inverse Wishart has mean V -1 /(N -P -1) (for N > P + 1).

3. DEEP KERNEL PROCESSES

We define a kernel process to be a set of distributions over positive definite matrices of different sizes, that are consistent under marginalisation (Dawid, 1981; Shah et al., 2014) . The two most common kernel processes are the Wishart process and inverse Wishart process, which we write in a slightly unusual form to ensure their expectation is K. We take G and G to be finite dimensional marginals of the underlying Wishart and inverse Wishart process, G ∼ W (K /N, N ) , G ∼ W -1 (δK , δ + (P + 1)) , G * ∼ W (K * /N, N ) , G * ∼ W -1 (δK * , δ + (P * + 1)) , and where we explicitly give the consistent marginal distributions over K * , G * and G * which are P * × P * principal submatrices of the P × P matrices K, G and G dropping the same rows and columns. In the inverse-Wishart distribution, δ is a positive parameter that can be understood as controlling the degree of variability, with larger values for δ implying smaller variability in G . We define a deep kernel process by analogy with a DGP, as a composition of kernel processes, and show in App. A that under sensible assumptions any such composition is itself a kernel process. 1

3.1. DGPS WITH ISOTROPIC KERNELS ARE DEEP WISHART PROCESSES

We consider deep GPs of the form (Fig. 1  top) with X ∈ R P ×N0 K = 1 N0 XX T for = 1, K (G -1 ) otherwise, P (F |K ) = N λ=1 N f λ ; 0, K , G = 1 N F F T . Here, F ∈ R P ×N are the N hidden features in layer ; λ indexes hidden features so f λ is a single column of F , representing the value of the λth feature for all training inputs. Note that K(•) is a function that takes a Gram matrix and returns a kernel matrix; whereas K is a (possibly random) variable representing a kernel matrix. Note, we have restricted ourselves to kernels, that can be written as functions of the Gram matrix, G , and do not require the full set of activations, F . As we describe later, this is not too restrictive, as it includes amongst others all isotropic kernels (i.e. those that can be written as a function of the distance between points Williams & Rasmussen, 2006) . Note that we have a number of choices as to how to initialize the kernel in Eq. (4a). The current choice just uses a linear dot-product kernel, rather than immediately applying the kernel function K. This is both to ensure exact equivalence with infinite NNs with bottlenecks (App. C.3) and also to highlight an interesting interpretation of this layer as Bayesian inference over generalised lengthscale hyperparameters in the squared-exponential kernel (App. B e.g. Lalchand & Rasmussen, 2020) . For DGP regression, the outputs, Y, are most commonly given by a likelihood that can be written in terms of the output features, F L+1 . For instance, for regression, the distribution of the λth output feature column could be P (y λ |F L+1 ) = N y λ ; f L+1 λ , σ 2 I , but our methods can be used with many other forms for the likelihood, including e.g. classification. The generative process for the Gram matrices, G , consists of generating samples from a Gaussian distribution (Eq. 4b), and taking their product with themselves transposed (Eq. 4c). This exactly matches the generative process for a Wishart distribution (Eq. 1), so we can write the Gram matrices, G , directly in terms of the kernel, without needing to sample features (Fig. 1 bottom), P (G 1 |X) = W 1 N1 1 N0 XX T , N 1 , P (G |G -1 ) = W (K (G -1 ) /N , N ) , for ∈ {2, . . . L}, P (F L+1 |G L ) = N L+1 λ=1 N f L+1 λ ; 0, K (G L ) . Except at the output, the model is phrased entirely in terms of positive-definite kernels and Gram matrices, and is consistent under marginalisation (assuming a valid kernel) and is thus a DKP. At a high level, the model can be understood as alternatively sampling a Gram matrix (introducing flexibility in the representation), and nonlinearly transforming the Gram matrix using a kernel (Fig. 2 ). This highlights a particularly simple interpretation of the DKP as an autoregressive process. In a standard autoregressive process, we might propagate the current vector, x t , through a deterministic function, f (x t ), and add zero-mean Gaussian noise, ξ, x t+1 = f (x t ) + σ 2 ξ such that E [x t+1 |x t ] = f (x t ) . By analogy, the next Gram matrix has expectation centered on a deterministic transformation of the previous Gram matrix, E [G |G -1 ] = K (G -1 ) , so G can be written as this expectation plus a zero-mean random variable, Ξ , that can be interpreted as noise, G = K (G -1 ) + Ξ . ( ) Note that Ξ is not in general positive definite, and may not have an analytically tractable distribution. This noise decreases as N increases, V G ij = V Ξ ij = 1 N K 2 ij (G -1 ) + K 2 ii (G -1 )K 2 jj (G -1 ) . Notably, as N tends to infinity, the Wishart samples converge on their expectation, and the noise disappears, leaving us with a series of deterministic transformations of the Gram matrix. Therefore, we can understand a deep kernel process as alternatively adding "noise" to the kernel by sampling e.g. a Wishart or inverse Wishart distribution (G 2 and G 3 in Fig. 2 ) and computing a nonlinear transformation of the kernel (K(G 2 ) and K(G 3 ) in Fig. 2 ) Remember that we are restricted to kernels that can be written as a function of the Gram matrix, where K features (•) takes a matrix of features, F , and returns the kernel matrix, K , and k is the usual kernel, which takes two feature vectors (rows of F ) and returns an element of the kernel matrix. This does not include all possible kernels because it is not possible to recover the features from the Gram matrix. In particular, the Gram matrix is invariant to unitary transformations of the features: the Gram matrix is the same for F and F = UF where U is a unitary matrix, such that UU T = I, K = K (G ) = K features (F ) , K ij = k F i,: , F j,: . (11) 0 100 200 input index K(G 1 ) G 2 K(G 2 ) G 3 K(G G = 1 N F F T = 1 N F U U T F T = 1 N F F T . Superficially, this might seem very limiting -leaving us only with dot-product kernels (Williams & Rasmussen, 2006 ) such as, k(f , f ) = f • f + σ 2 . ( ) However, in reality, a far broader range of kernels fit within this class. Importantly, isotropic or radial basis function kernels including the squared exponential and Matern depend only on the squared distance between points, R, (Williams & Rasmussen, 2006 ) k(f , f ) = k (R) , R = |f -f | 2 . ( ) These kernels can be written as a function of G, because the matrix of squared distances, R, can be computed from G, R ij = 1 N N λ=1 F iλ -F jλ 2 = 1 N N λ=1 F iλ 2 -2F iλ F jλ + F jλ 2 = G ii -2G ij + G jj .

4. VARIATIONAL INFERENCE IN DEEP KERNEL PROCESSES

A key part of the motivation for developing deep kernel processes was that the posteriors over weights in a BNN or over features in a deep GP are extremely complex and multimodal, with a large number of symmetries that are not captured by standard approximate posteriors (MacKay, 1992; Moore, 2016; Pourzanjani et al., 2017) . For instance, in the Appendix we show that there are permutation symmetries in the prior and posteriors over weights in BNNs (App. D.1) and rotational symmetries in the prior and posterior over features in deep GPs with isotropic kernels (App. D.2). The inability to capture these symmetries in standard variational posteriors may introduce biases in the parameters inferred by variational inference, because the variational bound is not uniformly tight across the state-space (Turner & Sahani, 2011) . Intuitively, these symmetries arise in DGPs with isotropic kernels because the features at the next layer depend only on the kernel matrix at the previous layer, and this kernel is invariant to unitary transformations of the features (Eq. 12). As such, we can sidestep these complex posterior symmetries by working directly with the Gram matrices as the random variables for variational inference. We show that DGPs (Sec. 3.1) and infinite NNs with bottlenecks (App. C.3) are deep Wishart processes, so a natural approach would be to define an approximate posterior over the Gram matrices in the deep Wishart process. However, this turns out to be difficult, predominantly because the approximate posterior we would like to use, the non-central Wishart (App. E), has a probability density function that is prohibitively costly and complex to evaluate in the inner loop of a deep learning model (Koev & Edelman, 2006) . Instead, we consider an inverse Wishart process prior, for which the inverse Wishart itself makes a good choice of approximate posterior.

4.1. THE DEEP INVERSE WISHART PROCESSES

By analogy with Eq. ( 6), our deep inverse Wishart processes (DIWPs) are given by P (Ω) = W -1 (δ 1 I, δ 1 + N 0 + 1) , (with G 1 = 1 N0 XΩX T ), P (G |G -1 ) = W -1 (G ; δ K (G -1 ) , P + 1 + δ ) , for ∈ {2, . . . L}, P (F L+1 |G L ) = N L+1 λ=1 N f L+1 λ ; 0, K (G L ) , remember that X ∈ R P ×N0 , G ∈ R P ×P and F ∈ R P ×N L+1 . Note that at the input layer, K 0 = 1 N0 XX T may be singular if there are more datapoints than features. Instead of attempting to use a singular Wishart distributions over G 1 , which would be complex and difficult to work with (Bodnar & Okhrin, 2008; Bodnar et al., 2016) , we instead define an approximate posterior over the full-rank N 0 × N 0 matrix, Ω, and use G 1 = 1 N0 XΩX T ∈ R P ×P . Critically, the distributions in Eq. ( 16b) are consistent under marginalisation as long as δ is held constant (Dawid, 1981) , with P taken to be the number of input points, or equivalently the size of K -1 . Further, the deep inverse Wishart process retains the interpretation as a deterministic transformation of the kernel plus noise because the expectation is, E [G |G -1 ] = δ K (G -1 ) (P + 1 + δ ) -(P + 1) = K (G -1 ) . ( ) The resulting inverse Wishart process does not have a direct interpretation as e.g. a deep GP, but does have more appealing properties for variational inference, as it is always full-rank and allows independent control over the approximate posterior mean and variance. Finally, it is important to note that Wishart and inverse Wishart distributions do not differ as much as one might expect; the standard Wishart and standard inverse Wishart distributions have isotropic distributions over the eigenvectors so they only differ in terms of their distributions over eigenvalues, and these are often quite similar, especially if we consider a Wishart model with ResNet-like structure (App. H).

4.2. AN APPROXIMATE POSTERIOR FOR THE DEEP INVERSE WISHART PROCESS

Choosing an appropriate and effective form for variational approximate posteriors is usually a difficult research problem. Here, we take inspiration from Ober & Aitchison (2020) by exploiting the fact that the inverse-Wishart distribution is the conjugate prior for the covariance matrix of a multivariate Gaussian. In particular, if we consider an inverse-Wishart prior over Σ ∈ R P ×P with mean δΣ 0 , which forms the covariance of Gaussian-distributed matrix, V ∈ R P ×P , consisting of columns v λ , P (Σ) = W -1 (Σ; δΣ 0 , P + 1 + δ) , P (V|Σ) = N V λ=1 N (v λ ; 0, Σ) , P (Σ|V) = W -1 Σ; δΣ 0 + VV T , P + 1 + δ + N V . Inspired by this exact posterior that is available in simple models, we choose the approximate posterior in our model to be, Q (Ω) = W -1 Ω; δ 1 I + V 1 V T 1 , δ 1 + γ 1 + (N 0 + 1) , Q (G |G -1 ) = W -1 G ; δ K (G -1 ) + V V T , δ + γ + (P + 1) , Q (F L+1 |G L ) = N L+1 λ=1 N f L+1 λ ; Σ λ Λ λ v λ , Σ λ , where Σ λ = K -1 (G L ) + Λ λ -1 , and where V 1 is a learned N 0 × N 0 matrix, {V } L =2 are P × P learned matrices and {γ } L =1 are learned non-negative real numbers. For more details about the input layer, see App. F. At the output layer, we take inspiration from the global inducing approximate posterior for DGPs from Ober & Aitchison (2020), with learned parameters being vectors, v λ , and positive definite matrices, Λ λ (see App. G). In summary, the prior has parameters {δ } L =1 (which also appears in the approximate posterior), and the posterior has parameters {V } L =1 and {γ } L =1 for the inverse-Wishart hidden layers, and {v λ } N L+1 λ=1 and {Λ λ } N L+1 λ=1 at the output. In all our experiments, we optimize all five parameters ({δ , V , γ } L =1 ) and ({v λ , Λ λ } N L+1 λ=1 ), and in addition, for inducing-point methods, we also optimize a single set of "global" inducing inputs, X i ∈ R Pi×N0 , which are defined only at the input layer.

4.3. DOUBLY STOCHASTIC INDUCING-POINT VARIATIONAL INFERENCE IN DEEP INVERSE

WISHART PROCESSES For efficient inference in high-dimensional problems, we take inspiration from the DGP literature (Salimbeni & Deisenroth, 2017) by considering doubly-stochastic inducing-point deep inverse Wishart processes. We begin by decomposing all variables into inducing and training (or test) points X t ∈ R Pt×N0 , X = X i X t , F L+1 = F L+1 i F L+1 t , G = G ii G it G ti G tt , where e.g. G ii is P i × P i and G it is P i × P t where P i is the number of inducing points, and P t is the number of testing/training points. Note that Ω does not decompose as it is N 0 × N 0 . The full ELBO including latent variables for all the inducing and training points is, L = E log P (Y|F L+1 ) + log P Ω, {G } L =2 , F L+1 |X Q Ω, {G } L =2 , F L+1 |X , where the expectation is taken over Q Ω, {G } L =2 , F L+1 |X . The prior is given by combining all terms in Eq. ( 16) for both inducing and test/train inputs, P Ω, {G } L =2 , F L+1 |X = P (Ω) L =2 P (G |G -1 ) P (F L+1 |G L ) , where the X-dependence enters on the right because G 1 = 1 N0 XΩX T . Taking inspiration from Salimbeni & Deisenroth (2017) , the full approximate posterior is the product of an approximate posterior over inducing points and the conditional prior for train/test points, Q Ω, {G } L =2 , F L+1 |X = Q Ω, {G ii } L =2 , F L+1 i |X i P {G it } L =2 , {G tt } L =2 , F L+1 t |Ω, {G ii } L =2 , F L+1 i , X . ( ) And the prior can be written in the same form, P Ω, {G } L =2 , F L+1 |X = P Ω, {G ii } L =2 , F L+1 i |X i P {G it } L =2 , {G tt } L =2 , F L+1 t |Ω, {G ii } L =2 , F L+1 i , X . ( ) We discuss the second terms (the conditional prior) in Eq. ( 28). The first terms (the prior and approximate posteriors over inducing points), are given by combining terms in Eq. ( 16) and Eq. ( 19), P Ω, {G ii } L =2 , F L+1 i |X i = P (Ω) L =2 P G ii |G -1 ii P F L+1 i |G L ii , Q Ω, {G ii } L =2 , F L+1 i |X i = Q (Ω) L =2 Q G ii |G -1 ii Q F L+1 i |G L ii . Substituting Eqs. (23-26) into the ELBO (Eq. 21), the conditional prior cancels and we obtain, L = E   log P Y|F L+1 t + log Q (Ω) L =2 Q G ii |G -1 ii Q F L+1 i |G L ii P (Ω) L =2 P G ii |G -1 ii P F L+1 i |G L ii   . Importantly, the first term is a summation across test/train datapoints, and the second term depends only on the inducing points, so as in Salimbeni & Deisenroth (2017) we can compute unbiased estimates of the expectation by taking only a minibatch of datapoints, and we never need to compute the density of the conditional prior in Eq. ( 28), we only need to be able to sample it. Finally, to sample the test/training points, conditioned on the inducing points, we need to sample, P {G it } L =2 , {G tt } L =2 , F L+1 t |Ω, {G ii } L =2 , F L+1 i , X = P F L+1 t |F L+1 i , G L L =2 P G it , G tt |G ii , G -1 . ( ) The first distribution, P F L+1 t |F L+1 i , G L , is a multivariate Gaussian, and can be evaluated using methods from the GP literature (Williams & Rasmussen, 2006; Salimbeni & Deisenroth, 2017) . The difficulties arise for the inverse Wishart terms, P G it , G tt |G ii , G -1 . To sample this distribution, note that samples from the joint over inducing and train/test locations can be written, G ii G it G ti G tt ∼ W -1 Ψ ii Ψ it Ψ ti Ψ tt , δ + P i + P t + 1 , where Ψ ii Ψ it Ψ ti Ψ tt = δ K (G -1 ) , and where P i is the number of inducing inputs, and P t is the number of train/test inputs. Defining the Schur complements, G tt•i = G tt -G ti G ii -1 G it , Ψ tt•i = Ψ tt -Ψ ti Ψ -1 ii Ψ it . We know that G tt•i and G ii -1 G it have distribution, (Eaton, 1983 ) G tt•i G ii , G -1 ∼ W -1 (Ψ tt•i , δ + P i + P t + 1) , G ii -1 G it G tt•i , G ii , G -1 ∼ MN Ψ -1 ii Ψ it , Ψ -1 ii , G tt•i , ( ) where MN is the matrix normal. Now, G it and G tt , can be recovered by algebraic manipulation. Finally, because of the doubly stochastic form for the objective, we do not need to sample multiple of jointly consistent samples for test points; instead, (and as in DGPs Salimbeni & Deisenroth, 2017) we can independently sample each test point (App. I), which dramatically reduces computational complexity. We optimize using standard reparameterised variational inference (Kingma & Welling, 2013; Rezende et al., 2014) (Ober & Aitchison, 2020, for details on how to reparameterise samples from the Wishart, see).

5. COMPUTATIONAL COMPLEXITY

As in non-deep GPs, the complexity is O(P 3 ) for time and O(P 2 ) for space for standard DKPs (the O(P 3 ) time dependencies emerge e.g. because of inverses and determinants required for the inverse Wishart distributions). For DSVI, there is a P 3 i time and P 2 i space term for the inducing points, because the computations for inducing points are exactly the same as in the non-DSVI case. As we can treat each test/train point independently (App. I), the complexity for test/training points must scale linearly with P t , and this term has P 2 i time scaling, e.g. due to the matrix products in Eq. ( 30). Thus, the overall complexity for DSVI is O(P 3 i + P 2 i P t ) for time and O(P 2 i + P i P t ) for space which is exactly the same as non-deep inducing GPs. Thus, and exactly as in non-deep inducing-GPs, by using a small number of inducing points, we are able to convert a cubic dependence on the number of input points into a linear dependence, which gives considerably better scaling. Surprisingly, this is substantially better than standard DGPs. In standard DGPs, we allow the approximate posterior covariance for each feature to differ (Salimbeni & Deisenroth, 2017) , in which case, we are in essence doing standard inducing-GP inference over N hidden features, which gives complexity of O(N P 3 i + N P 2 i P t ) for time and O(N P 2 i + N P i P t ) for space (Salimbeni & Deisenroth, 2017) . It is possible to improve this complexity by restricting the approximate posterior to have the same covariance for each point (but this restriction can be expected to harms performance). Table 1 : Performance in terms of ELBO and predictive log-likelihood for a three-layer (two hidden layer) DGP, NNGP and DIWP on UCI benchmark tasks. Errors are quoted as two standard errors in the difference between that method and the best performing method, as in a paired t-test. This is to account for the shared variability that arises due to the use of different test/train splits in the data (20 splits for all but protein, where 5 splits are used Gal & Ghahramani, 2015) some splits are harder for all models, and some splits are easier. Because we consider these differences, errors for the best measure are implicitly included in errors for other measures, and we cannot provide a comparable error for the best method itself. 

6. RESULTS

We began by comparing the performance of our deep inverse Wishart process (DIWP) against infinite Bayesian neural networks (known as the neural network Gaussian process or NNGP) and DGPs. To ensure sensible comparisons against the NNGP, we used a ReLU kernel in all models (Cho & Saul, 2009) . For all models, we used three layers (two hidden layers and one output layer), with three applications of the kernel. In each case, we used a learned bias and scale for each input feature, and trained for 8000 gradient steps with the Adam optimizer with 100 inducing points, a learning rate of 10 -2 for the first 4000 steps and 10 -3 for the final 4000 steps. For evaluation, we used 100 samples from the final iteration of gradient descent, and for each training step we used 10 samples in the smaller datasets (boston, concrete, energy, wine, yacht), and 1 sample in the larger datasets. We found that DIWP usually gives better predictive performance and ELBOs. We expected DIWP to be better than (or the same as) the NNGP as the NNGP was a special case of our DIWP (sending δ → ∞ sends the variance of the inverse Wishart to zero, so the model becomes equivalent to the NNGP). We found that the DGP performs poorly in comparison to DIWP and NNGPs, and even to past baselines on all datasets except protein (which is by far the largest). This is because we use a ReLU, rather than a squared exponential kernel, as in (Salimbeni & Deisenroth, 2017) , and because we used a plain feedforward architecture for all models. In contrast, Salimbeni & Deisenroth (2017) found that good performance with DGPs on even UCI datasets required a complex architecture involving skip connections. Here, we used simple feedforward architectures, both to ensure a fair comparison to the other models, and to avoid the need for an architecture search. In addition, the inverse Wishart process is implicitly able to learn the network "width", δ , whereas in the DGPs, the width is fixed to be equal to the number of input features, following standard practice in the literature (e.g. Salimbeni & Deisenroth, 2017) . Next, we considered fully-connected networks for small image classification datasets (MNIST and CIFAR-10). We used the same models as in the previous section, with the omission of learned bias and scaling of the inputs. Note that we do not expect these methods to perform well relative to for CIFAR-10 (App. K). For MNIST, remember that the ELBO must be negative (because both the log-likelihood for classification and the KL-divergence term give negative contributions), so the improvement from -0.301 to -0.214 represents a dramatic change.

7. RELATED WORK

Our first contribution was the observation that DGPs with isotropic kernels can be written as deep Wishart processes as the kernel depends only on the Gram matrix. We then gave similar observations for neural networks (App. C.1), infinite neural networks (App. C.2) and infinite network with bottlenecks (App. C.3, also see Aitchison, 2019) . These observations motivated us to consider the deep inverse Wishart process prior, which is a novel combination of two pre-existing elements: nonlinear transformations of the kernel (e.g. Cho & Saul, 2009) and inverse Wishart priors over kernels (e.g. Shah et al., 2014) . Deep nonlinear transformations of the kernel have been used in the infinite neural network literature (Lee et al., 2017; Matthews et al., 2018) where they form deterministic, parameter-free kernels that do not have any flexibility in the lower-layers (Aitchison, 2019) . Likewise, inverse-Wishart distributions have been suggested as priors over covariance matrices (Shah et al., 2014) , but they considered a model without nonlinear transformations of the kernel. Surprisingly, without these nonlinear transformations, the inverse Wishart prior becomes equivalent to simply scaling the covariance with a scalar random variable (App. L; Shah et al., 2014) . Further linear (inverse) Wishart processes have been used in the financial domain to model how the volatility of asset prices changes over time (Philipov & Glickman, 2006b; a; Asai & McAleer, 2009; Gourieroux & Sufana, 2010; Wilson & Ghahramani, 2010; Heaukulani & van der Wilk, 2019) . Importantly, inference in these dynamical (inverse) Wishart processes is often performed by assuming fixed, integer degrees of freedom, and working with underlying Gaussian distributed features. This approach allows one to leverage standard GP techniques (e.g. Kandemir & Hamprecht, 2015; Heaukulani & van der Wilk, 2019 ), but it is not possible to optimize the degrees of freedom and the posterior over these features usually has rotational symmetries (App. D.2) that are not captured by standard variational posteriors. In contrast, we give a novel doubly-stochastic variational inducing point inference method that operates purely on Gram matrices and thus avoids needing to capture these symmetries.

8. CONCLUSIONS

We proposed deep kernel processes which combine nonlinear transformations of the Gram matrix with sampling from matrix-variate distributions such as the inverse Wishart. We showed that DGPs, BNNs (App. C.1), infinite BNNs (App. C.2) and infinite BNNs with bottlenecks (App. C.3) are all instances of DKPs. We defined a new family of deep inverse Wishart processes, and give a novel doubly-stochastic inducing point variational inference scheme that works purely in the space of Gram matrices. DIWP performed better than NNGPs and DGPs on UCI, MNIST and CIFAR-10 benchmarks.

A DKPS ARE KERNEL PROCESSES

We define a generic DKP to be K(K), for a random matrix K ∈ R P ×P . For instance, we could take, K(K) = W (K, N ) or K(K) = W -1 (K, δ + (P + 1)) , ( ) where N is a positive integer and δ is a positive real number. A deep kernel process, D, is the composition of two (or more) underlying kernel processes, K 1 and K 2 , G 1 ∼ K 1 (K), G 2 ∼ K 2 (G 1 ), (33a) G 2 ∼ D(K). We define K * , G * 1 and G * 2 as principle submatrices of K, G 1 and G 2 respectively, dropping the same rows and columns. To establish that D is consistent under marginalisation, we use the consistency under marginalisation of K 1 and K 2 G * 1 ∼ K 1 (K * ), G * 2 ∼ K 2 (G * 1 ), and the definition of the D as the composition of K 1 and K 2 (Eq. 33) G * 2 ∼ D(K * ). D is thus consistent under marginalisation, and hence is a kernel process. Further, note that we can consider K to be a deterministic distribution that gives mass to only a single G. In that case, K can be thought of as a deterministic function which must satisfy a corresponding consistency property, G = K(K), G * = K(K * ), and this is indeed satisfied by all deterministic transformations of kernels considered here. In practical terms, as long as G is always a valid kernel, it is sufficient for the elements of G i =j to depend only on K ij , K ii and K jj and for G ii to depend only on K jj , which is satisfied by e.g. the squared exponential kernel (Eq. 15) and by the ReLU kernel (Cho & Saul, 2009) .

B THE FIRST LAYER OF OUR DEEP GP AS BAYESIAN INFERENCE OVER A GENERALISED LENGTHSCALE

In our deep GP architecture, we first sample F 1 ∈ R P ×N1 from a Gaussian with covariance K 0 = 1 N0 XX T (Eq. 4a). This might seem odd, as the usual deep GP involves passing the input, X ∈ R P ×N0 , directly to the kernel function. However, in the standard deep GP framework, the kernel (e.g. a squared exponential kernel) has lengthscale hyperparameters which can be inferred using Bayesian inference. In particular, k param ( 1 √ N0 x i , 1 √ N0 x j ) = exp -1 2N0 (x i -x j ) Ω (x i -x j ) T . ( ) where k param is a new squared exponential kernel that explicitly includes hyperparmeters Ω ∈ R N0×N0 , and where x i is the ith row of X. Typically, in deep GPs, the parameter, Ω, is diagonal, and the diagonal elements correspond to the inverse square of the lengthscale, l i , (i.e. Ω ii = 1/l 2 i ). However, in many cases it may be useful to have a non-diagonal scaling. For instance, we could use, Ω ∼ W 1 N1 I, N 1 , which corresponds to, Ω = WW T , where W iλ ∼ N 0, 1 N1 , W ∈ R N0×N1 . ( ) Under our approach, we sample F = F 1 from Eq. (4b), so F can be written as, F = XW, f i = x i W, X W 1 F 1 W 2 F 2 W 3 F 3 W 4 F 4 Y X K 1 F 1 K 2 F 2 K 3 F 3 K 4 F 4 Y X K 1 K 2 K 3 K 4 F 4 Y Layer 1 Layer 2 Layer 3 Output Layer Figure 3 : A series of generative models for a standard, finite BNN. Top. The standard model, with features, F , and weights W (Eq. 41). Middle. Integrating out the weights, the distribution over features becomes Gaussian (Eq. 44), and we explicitly introduce the kernel, K , as a latent variable. Bottom. Integrating out the activations, F , gives a deep kernel process, albeit one where the distributions P (K |K -1 ) cannot be written down analytically, but where the expectation, E [K |K -1 ] is known (Eq. 45). where f i is the ith row of F. Putting this into a squared exponential kernel without a lengthscale parameter, k( 1 √ N0 f i , 1 √ N0 f j ) = exp -1 2N0 (f i -f j ) (f i -f j ) T , = exp -1 2N0 (x i W -x j W) (x i W -x j W) T , = exp -1 2N0 (x i -x j ) WW T (x i -x j ) T , = exp -1 2N0 (x i -x j ) Ω (x i -x j ) T , = k param ( 1 √ N0 x i , 1 √ N0 x j ). We find that a parameter-free squared exponential kernel applied to F is equivalent to a squaredexponential kernel with generalised lengthscale hyperparameters applied to the input.

C BNNS AS DEEP KERNEL PROCESSES

Here we show that standard, finite BNNs, infinite BNNs and infinite BNNs with bottlenecks can be understood as deep kernel processes.

C.1 STANARD FINITE BNNS (AND GENERAL DGPS)

Standard, finite BNNs are deep kernel processes, albeit ones which do not admit an analytic expression for the probability density. In particular, the prior for a standard Bayesian neural network (Fig. 3 top) is, P (W ) = N λ=1 N w λ ; 0, I/N -1 , W ∈ R N -1 ×N , F = XW 1 for = 1, φ (F -1 ) W otherwise, F ∈ R P ×N , ( ) where w λ is the λth column of W . In the neural-network case, φ is a pointwise nonlinearity such as a ReLU. Integrating out the weights, the features, F , become Gaussian distributed, as they depend linearly on the Gaussian distributed weights, W , P (F |F -1 ) = N λ=1 N f λ ; 0, K = P (F |K ) , where K = 1 N -1 φ(F -1 )φ T (F -1 ). ( ) Crucially, F depends on the previous layer activities, F -1 only through the kernel, K . As such, we could write a generative model as (Fig. 3 middle), K = 1 N0 XX T for = 1, 1 N -1 φ(F -1 )φ T (F -1 ) otherwise, P (F |K ) = N λ=1 N f λ ; 0, K , where we have explicitly included the kernel, K , as a latent variable. This form highlights that BNNs are deep GPs, in the sense that F λ are Gaussian, with a kernel that depends on the activations from the previous layer. Indeed note that any deep GP (i.e. including those with kernels that cannot be written as a function of the Gram matrix) as a kernel, K , is by definition a matrix that can be written as the outer product of a potentially infinite number of features, φ(F ) where we allow φ to be a much richer class of functions than the usual pointwise nonlinearities (Hofmann et al., 2008) . We might now try to follow the approach we took above for deep GPs, and consider a Wishartdistributed Gram matrix, G = 1 N F F T . However, for BNNs we encounter an issue: we are not able to compute the kernel, K just using the Gram matrix, G : we need the full set of features, F . Instead, we need an alternative approach to show that a neural network is a deep kernel process. In particular, after integrating out the weights, the resulting distribution is chain-structured (Fig. 3 middle), so in principle we can integrate out F to obtain a distribution over K conditioned on K -1 , giving the DKP model in Fig. 3 (bottom), P (K |K -1 ) = dF -1 δ D K -1 N φ(F -1 )φ T (F -1 ) P (F -1 |K -1 ) , where P (F -1 |K -1 ) is given by Eq. ( 44b) and δ D is the Dirac-delta function, representing the deterministic distribution, P (K |F -1 ) (Eq. 44a). Using this integral to write out the generative process only in terms of K gives the deep kernel process in Fig. 3 (bottom). While this distribution exists in principle, it cannot be evaluated analytically. But we can explicitly evaluate the expected value of K given K -1 using results from Cho & Saul (2009) . In particular, we take Eq. 44a, write out the matrix-multiplication explicitly as a series of vector outer products, and note that as f λ is IID across , the empirical average is equal to the expectation of a single term, which is computed by Cho & Saul (2009) , E [K +1 |K ] = 1 N N λ=1 E φ(f λ )φ T (f λ )|K = E φ(f λ )φ T (f λ )|K , = df λ N f λ ; 0, K φ(f λ )φ T (f λ ) ≡ K(K ). Finally, we define this expectation to be K(K ) in the case of NNs.

C.2 INFINITE NNS

We have found that for standard finite neural networks, we were not able to compute the distribution over K conditioned on K -1 (Eq. ( 45)). To resolve this issue, one approach is to consider the limit of an infinitely wide neural network. In this limit, the K becomes a deterministic function of K -1 , as K can be written as the average of N IID outer products, and as N grows to infinity, the law of large numbers tells us that the average becomes equal to its expectation, lim N →∞ K +1 = lim N →∞ 1 N N λ=1 φ(f λ )φ T (f λ ) = E φ(f λ )φ T (f λ )|K = K(K ).

C.3 INFINITE NNS WITH BOTTLENECKS

In infinite NNs, the kernel is deterministic, meaning that there is no flexibility/variability, and hence no capability for representation learning (Aitchison, 2019) . Here, we consider infinite networks with bottlenecks that combine the tractability of infinite networks with the flexibility of finite networks (Aitchison, 2019) . The trick is to separate flexible, finite linear "bottlenecks" from infinite-width nonlinearities. We keep the nonlinearity infinite in order to ensure that the output kernel is deterministic and can be computed using results from Cho & Saul (2009) . In particular, we use finite-width F ∈ R P ×N and infinite width F ∈ R P ×M , (we send M to infinity while leaving N finite), X W 1 F 1 M 1 F 1 W 2 F 2 M 2 F 2 W 3 F 3 Y X K 1 F 1 G 1 F 1 K 2 F 2 G 2 F 2 K 3 F 3 Y X K 1 G 1 K 2 G 2 K 3 F 3 Y X G 1 G 2 F 3 Y Layer 1 Layer 2 Output Layer P (W ) = N λ=1 N w λ ; 0, I/M -1 M 0 = N 0 , F = XW if = 1, φ(F -1 )W otherwise, P (M ) = M λ=1 N m λ ; 0, I/N , F = F M . This generative process is given graphically in Fig. 4 

(top).

Integrating over the expansion weights, M ∈ R N ×M , and the bottleneck weights, W ∈ R M -1 ×N , the generative model (Fig. 4 second row) can be rewritten, K = 1 N0 XX T for = 1, 1 M -1 φ F -1 φ T F -1 otherwise, P (F |K ) = N λ=1 N f λ ; 0, K , G = 1 N F F T , P (F |G ) = M λ=1 N f λ ; 0, G . (49d) Remembering that K +1 is the empirical mean of M IID terms, as M → ∞ it converges on its expectation lim M →∞ K +1 = lim M →∞ 1 M N λ=1 φ f λ φ T f λ = E φ(f λ )φ T (f λ )|G = K(G ). (50) and we define the limit to be K(G ). Note if we use standard (e.g. ReLU) nonlinearities, we can use results from Cho & Saul (2009) to compute K(G ). Thus, we get the following generative process, K = 1 N0 XX T for = 1, K(G -1 ) otherwise, P (G ) = W G ; 1 N K , N . Finally, eliminating the deterministic kernels, K , from the model, we obtain exactly the deep GP generative model in Eq. 6 (Fig. C .3 fourth row).

D STANDARD APPROXIMATE POSTERIORS OVER FEATURES AND WEIGHTS FAIL TO CAPTURE SYMMETRIES

We have shown that it is possible to represent DGPs and a variety of NNs as deep kernel processes. Here, we argue that standard deep GP approximate posteriors are seriously flawed, and that working with deep kernel processes may alleviate these flaws. In particular, we show that the true DGP posterior has rotational symmetries and that the true BNN posterior has permutation symmetries that are not captured by standard variational posteriors.

D.1 PERMUTATION SYMMETRIES IN DNNS POSTERIORS OVER WEIGHTS

Permutation symmetries in neural network posteriors were known in classical work on Bayesian neural networks (e.g. MacKay, 1992) . Here, we spell out the argument in full. Taking P to be a permutation matrix (i.e. a unitary matrix with PP T = I with one 1 in every row and column), we have, φ(F)P = φ(FP). i.e. permuting the input to a nonlinearity is equivalent to permuting its output. Expanding two steps of the recursion defined by Eq. ( 41b), F = φ(φ(F -2 )W -1 )W , multiplying by the identity, F = φ(φ(F -2 )W -1 )PP T W , where P ∈ R N -1 ×N -1 , applying Eq. ( 52) F = φ(φ(F -2 )W -1 P)P T W , defining permuted weights, W -1 = W -1 P, W = P T W , the output is the same under the original or permuted weights, F = φ(φ(F -2 )W -1 )W = φ(φ(F -2 )W -1 )W . Introducing a different perturbation between every pair of layers we get a more general symmetry, W 1 = W 1 P 1 , W = P T -1 W P for ∈ {2, . . . , L}, W L+1 = P L W L+1 , where P ∈ R N -1 ×N -1 . As the output of the neural network is the same under any of these permutations the likelihoods for original and permuted weights are equal, P (Y|X, W 1 , . . . , W L+1 ) = P Y|X, W 1 , . . . , W L+1 , and as the prior over elements within a weight matrix is IID Gaussian (Eq. 41a), the prior probability density is equal under original and permuted weights, P (W 1 , . . . , W L+1 ) = P W 1 , . . . , W L+1 . Thus, the joint probability is invariant to permutations, P (Y|X, W 1 , . . . , W L+1 ) P (W 1 , . . . , W L+1 ) = P Y|X, W 1 , . . . , W L+1 P W 1 , . . . , W L+1 , and applying Bayes theorem, the posterior is invariant to permutations, P (W 1 , . . . , W L+1 |Y, X) = P W 1 , . . . , W L+1 |Y, X . Due in part to these permutation symmetries, the posterior distribution over weights is extremely complex and multimodal. Importantly, it is not possible to capture these symmetries using standard variational posteriors over weights, such as factorised posteriors, but it is not necessary to capture these symmetries if we work with Gram matrices and kernels, which are invariant to permutations (and other unitary transformations; Eq. 12).

D.2 ROTATIONAL SYMMETRIES IN DEEP GP POSTERIORS

To show that deep GP posteriors are invariant to unitary transformations, U ∈ R N ×N , where U U T = I, we define transformed features, F , F = F U . To evaluate P F |F -1 , we begin by substituting for F -1 , P F |F -1 = N λ=1 N f λ ; 0, K 1 N -1 F -1 F T -1 , = N λ=1 N f λ ; 0, K 1 N -1 F -1 U -1 U T -1 F T -1 , = N λ=1 N f λ ; 0, K 1 N -1 F -1 F T -1 , ( ) = P (F |F -1 ) . To evaluate P (F |F -1 ), we substitute for F in the explicit form for the multivariate Gaussian probability density, P (F |F -1 ) = -1 2 Tr F T K -1 -1 F + const, = -1 2 Tr K -1 -1 F F T + const, = -1 2 Tr K -1 -1 F U U T F T + const, = -1 2 Tr K -1 -1 F F T + const, = P (F |F -1 ) . where K -1 = K 1 N -1 F -1 F T -1 , and the constant depends only on F -1 . Combining these derivations, each of these conditionals is invariant to rotations of F and F -1 , P F |F -1 = P (F |F -1 ) = P (F |F -1 ) . The same argument can straightforwardly be extended to the inputs, P (F 1 |X), P (F 1 |X) = P (F 1 |X) , and to the final probability density, for output activations, F L+1 which is not invariant to permutations, P (F L+1 |F L ) = P (F L+1 |F L ) , Therefore, we have, P (F 1 , . . . , F L , F L+1 , Y|X) = P (Y|F L+1 ) P (F L+1 |F L ) L =2 P F |F -1 P (F 1 |X) , ( ) = P (Y|F L+1 ) P (F L+1 |F L ) L =2 P (F |F -1 ) P (F 1 |X) , ( ) = P (F 1 , . . . , F L , F L+1 , Y|X) . Therefore, applying Bayes theorem the posterior is invariant to rotations, P (F 1 , . . . , F L , F L+1 |X, Y) = P (F 1 , . . . , F L , F L+1 |X, Y) . Importantly, these posterior symmetries are not captured by standard variational posteriors with non-zero means (e.g. Salimbeni & Deisenroth, 2017) .

D.3 THE TRUE POSTERIOR OVER FEATURES IN A DGP HAS ZERO MEAN

We can use symmetry to show that the posterior of F has zero mean. We begin by writing the expectation as an integral, E [F |F -1 , F +1 ] = dF F P (F =F|F -1 , F +1 ) . ( ) Changing variables in the integral to F = -F, and noting that the absolute value of the Jacobian is 1, we have = dF (-F ) P (F = (-F ) |F -1 , F +1 ) , using the symmetry of the posterior, = dF (-F ) P (F =F |F -1 , F +1 ) , = -E [F |F -1 , F +1 ] , the expectation is equal to minus itself, so it must be zero E [F |F -1 , F +1 ] = 0.

E DIFFICULTIES WITH VI IN DEEP WISHART PROCESSES

The deep Wishart generative process is well-defined as long as we admit nonsingular Wishart distributions (Uhlig, 1994; Srivastava et al., 2003) . The issue comes when we try to form a variational approximate posterior over low-rank positive definite matrices. This is typically the case because the number of datapoints, P is usually far larger than the number of features. In particular, the only convenient distribution over low-rank positive semidefinite matrices is the Wishart itself, Q (G ) = W G ; 1 N Ψ, N . However, a key feature of most variational approximate posteriors is the ability to increase and decrease the variance, independent of other properties such as the mean, and in our case the rank of the matrix. For a Wishart, the mean and variance are given by, E Q(G ) [G ] = Ψ, V Q(G ) G ij = 1 N Ψ 2 ij + Ψ ii Ψ jj . Initially, this may look fine: we can increase or decrease the variance by changing N . However, remember that N is the degrees of freedom, which controls the rank of the matrix, G . As such, N is fixed by the prior: the prior and approximate posterior must define distributions over matrices of the same rank. And once N is fixed, we no longer have independent control over the variance. To go about resolving this issue, we need to find a distribution over low-rank matrices with independent control of the mean and variance. The natural approach is to use a non-central Wishart, defined as the outer product of Gaussian-distributed vectors with non-zero means. While this distribution is easy to sample from and does give independent control over the rank, mean and variance, its probability density is prohibitively costly and complex to evaluate (Koev & Edelman, 2006) .

F SINGULAR (INVERSE) WISHART PROCESSES AT THE INPUT LAYER

In almost all cases of interest, our the kernel functions K(G) return full-rank matrices, so we can use standard (inverse) Wishart distributions, which assume that the input matrix is full-rank. However, this is not true at the input layer as K 0 = 1 N0 XX T will often be low-rank. This requires us to use singular (inverse) Wishart distributions which in general are difficult to work with (Uhlig, 1994; Srivastava et al., 2003; Bodnar & Okhrin, 2008; Bodnar et al., 2016) . As such, instead we exploit knowledge of the input features to work with a smaller, full-rank matrix, Ω ∈ R N0×N0 , where, remember, N 0 is the number of input features in X. For a deep Wishart process, 1 N0 XΩX T = G 1 ∼ W 1 N1 K 0 , N 1 , where Ω ∼ W 1 N1 I, N 1 , and for a deep inverse Wishart process, 1 N0 XΩX T = G 1 ∼ W -1 (δ 1 K 0 , δ 1 + P + 1) , where Ω ∼ W -1 (δ 1 I, δ 1 + N 0 + 1) . (89) Now, we are able to use the full-rank matrix, Ω rather than the low-rank matrix, G 1 as the random variable for variational inference. For the approximate posterior over Ω, in a deep inverse Wishart process, we use Q (Ω) = W -1 δ 1 I + V 1 V T 1 , δ 1 + γ 1 + (N 0 + 1) . (90) Note in the usual case where there are fewer inducing points than input features, then the matrix K 0 will be full-rank, and we can work with G 1 as the random variable as usual.

G APPROXIMATE POSTERIORS OVER OUTPUT FEATURES

To define approximate posteriors over inducing outputs, we are inspired by global inducing point methods (Ober & Aitchison, 2020) . In particular, we take the approximate posterior to be the prior, multiplied by a "pseudo-likelihood", Q (F L+1 |G L ) ∝ P (F L+1 |G L ) N L+1 λ=1 N v λ ; f L+1 λ , Λ -1 λ . ( ) This is valid both for global inducing inputs and (for small datasets) training inputs, and the key thing to remember is that in either case, for any given input (e.g. an MNIST handwritten 2), there is a desired output (e.g. the class-label "2"), and the top-layer global inducing outputs, v λ , express these desired outcomes. Substituting for the prior,  Q (F L+1 |G L ) ∝ N L+1 λ=1 N f L+1 λ ; 0, K(G L ) N v λ ; f L+1 λ , Λ -1 λ , G = LΩL T , G = LΩ L T , where K = LL T such that, Ω ∼ W 1 N I, N , Ω ∼ W -1 (N I, λN ) , G ∼ W 1 N K, N , G ∼ W -1 (N K, λN ) . Note that as the standard Wishart and inverse Wishart have uniform distributions over the eigenvectors (Shah et al., 2014) , they differ only in the distribution over eigenvalues of Ω and Ω . We plotted the eigenvalue histogram for samples from a Wishart distribution with N = P = 2000 (Fig. 5 top left). This corresponds to an IID Gaussian prior over weights, with 2000 features in the input and output layers. Notably, there are many very small eigenvalues, which are undesirable as they eliminate information present in the input. To eliminate these very small eigenvalues, a common approach is to use a ResNet-inspired architecture (which is done even in the deep GP literature, e.g. Salimbeni & Deisenroth, 2017) . To understand the eigenvalues in a residual layer, we define a ResW distribution by taking the outer product of a weight matrix with itself, WW T = Ω ∼ ResW (N, α) , where the weight matrix is IID Gaussian, plus the identity matrix, with the identity matrix weighted as α, W = 1 √ 1+α 2 1 N ξ + αI , ξ i,λ ∼ N (0, 1) . With α = 1, there are still many very small eigenvalues, but these disappear as α increases. We compared these distributions to inverse Wishart distributions (Fig. 5 bottom) with varying degrees of freedom. For all degrees of freedom, we found that inverse Wishart distributions do not produce very small eigenvalues, which would eliminate information. As such, these eigenvalue distributions resemble those for ResW with α larger than 1.

I DOUBLY STOCHASTIC VARIATIONAL INFERENCE IN DEEP INVERSE WISHART PROCESSES

Due to the doubly stochastic results in Sec. 4.3, we only need to compute the conditional distribution over a single test/train point (we do not need the joint distribution over a number of test points). As such, we can decompose G and Psi as, where G ii , Ψ ii ∈ R Pi×Pi , g it ∈ R Pitimes1 and ψ it ∈ R Pi×1 are column-vectors, and g tt and ψ tt are scalars. Taking the results in Eq. ( 31) to the univariate case, G = G ii g T it g it g tt , Ψ = Ψ ii ψ T it ψ it ψ tt , 0 1 2 3 eigenvalue W -1 (NI, 3N) 0 1 2 3 eigenvalue W -1 (NI, 5N) 0 1 2 3 eigenvalue W -1 (NI, 10N) g tt•i = g tt -g T it G ii -1 g it , ψ tt•i = ψ tt -ψ T it Ψ -1 ii ψ it . As g tt•i is univariate, its distribution becomes Inverse Gamma, g tt•i |G ii , G -1 ∼ InverseGamma α = 1 2 (δ + P t + P i + 1) , β = 1 2 ψ tt•i . As g it is a vector rather than a matrix, its distribution becomes Gaussian, G ii -1 g it |g tt•i , G ii , G -1 ∼ N Ψ -1 ii ψ it , g tt•i Ψ -1 ii . J SAMPLES FROM THE 1D PRIOR AND APPROXIMATE POSTERIOR First, we drew samples from a one-layer (top) and two-layer (bottom) deep inverse Wishart process, with a squared-exponential kernel (Fig. 6 ). We found considerable differences in the function family corresponding to different prior samples of the top-layer Gram matrix, G L (panels). While differences across function classes in a one-layer IW process can be understood as equivalent to doing inference over a prior on the lengthscale, this is not true of the two-layer process, and to emphasise this, the panels for two-layer samples all have the same first layer sample (equivalent to choosing a lengthscale), but different samples from the Gram matrix at the second layer. The two-layer deep IW process panels use the same, fixed input layer, so variability in the function class arises only from sampling G 2 . Next, we exploited kernel flexibilities in IW processes by training a one-layer deep IW model with a fixed kernel bandwidth on data generated from various bandwidths. The first row in Figure 7 shows posterior samples from one-layer deep IW processes trained on different datasets. For each panel, we first sampled five full G 1 matrices using Eq.( 31a) and (31b). Then for each G 1 , we use Gaussian conditioning to get a posterior distribution on testing locations and drew one sample from the posterior plotted as a single line. Remarkably, these posterior samples exhibited wiggling behaviours that were consistent with training data even outside the training range, which highlighted the additional kernel flexibility in IW processes. On the other hand, when model bandwidth was fixed, samples from vanilla GPs with fixed bandwidth in the second row displayed almost identical shapes outside the training range across different sets of training data. 

K WHY WE CARE ABOUT THE ELBO

While we have shown that DIWP offers some benefits in predictive performance, it gives much more dramatic improvements in the ELBO. While we might think that predictive performance is the only goal, there are two reasons to believe that the ELBO itself is also an important metric. First, the ELBO is very closely related to PAC-Bayesian generalisation bounds (e.g. Germain et al., 2016) . In particular, the bounds are generally written as the average training log-likelihood, plus the KL-divergence between the approximate posterior over parameters and the prior. This mirrors the standard form for the ELBO, L = E Q(z) [log P (x|z)] -D KL (Q (z) || P (z)) , where x is all the data (here, the inputs, X and outputs, Y), and z are all the latent variables. Remarkably, Germain et al. (e.g. 2016) present a bound on the test-log-likelihood that is exactly the ELBO per data point, up to additive constants. As such, in certain circumstances, optimizing the ELBO is equivalent to optimizing a PAC-Bayes bound on the test-log-likelihood. Similar results are available in Rivasplata et al. (2019) . Second, we can write down an alternative form for the ELBO as the model evidence, minus the KL-divergence between the approximate and true posterior, L = log P (x) -D KL (Q (z) || P (z|x)) ≤ log P (x) . As such, for a fixed generative model, and hence a fixed value of the model evidence, log P (x), the ELBO measures the closeness of the variational approximate posterior, Q (z) and the true posterior, P (z|x). As we are trying to perform Bayesian inference, our goal should be to make the approximate posterior as close as possible to the true posterior. If, for instance, we can set Q (z) to give better predictive performance, but be further from the true posterior, then that is fine in certain settings, but not when the goal is inference. Obviously, it is desirable for the true and approximate posterior to be as close as possible, which corresponds to larger values of L (indeed, when the approximate posterior equals the true posterior, the KL-divergence is zero, and L = log P (x) ). L DIFFERENCES WITH SHAH ET AL. (2014) For a one-layer deep inverse Wishart process, using our definition in Eq. ( 16) K 0 = 1 N0 XX T , P (G 1 |K 0 ) = W -1 (δ 1 K 0 , δ 1 + (P + 1)) , (104b) P (y λ |K 1 ) = N (y λ ; 0, K (G 1 )) . (104c) Importantly, we do the nonlinear kernel transformation after sampling the inverse Wishart, so the inverse-Wishart sample acts as a generalised lengthscale hyperparameter (App. B), and hence dramatically changes the function family. In contrast, for Shah et al. (2014) , the nonlinear kernel is computed before, the inverse Wishart is sampled, and the inverse Wishart sample is used directly as the covariance for the Gaussian, K 0 = K 1 N0 XX T , P (G 1 |K 0 ) = W -1 (δ 1 K 0 , δ 1 + (P + 1)) , (105b) P (y λ |K 1 ) = N (y λ ; 0, G 1 ) . (105c) This difference in ordering, and in particular, the lack of a nonlinear kernel transformation between the inverse-Wishart and the output is why Shah et al. (2014) were able to find trivial results in their model (that it is equivalent to multiplying the covariance by a random scale).



Note that we leave the question of the full Kolmogorov extension theorem(Kolmogorov, 1933) for matrices to future work: for our purposes, it is sufficient to work with very large but ultimately finite input spaces as in practice, the input vectors are represented by elements of the finite set of 32-bit or 64-bit floating-point numbers(Sterbenz, 1974).



Figure2: Visualisations of a single prior sample of the kernels and Gram matrices as they pass through the network. We use 1D, equally spaced inputs with a squared exponential kernel. As we transition K(G -1 ) → G , we add "noise" by sampling from a Wishart (top) or an inverse Wishart (bottom). As we transition from G to K(G ), we deterministically transform the Gram matrix using a squared-exponential kernel.

Figure 4: A series of generative models for an infinite network with bottlenecks. First row. The standard model. Second row. Integrating out the weights. Third row. Integrating out the features, the Gram matrices are Wishart-distributed, and the kernels are deterministic. Last row. Eliminating all deterministic random variables, we get a model equivalent to that for DGPs (Fig. 1 bottom).

Figure 5: Eigenvalue histograms for a single sample from the labelled distribution, with N = 2000.

IW process with fixed first layer

Figure6: Samples from a one-layer (top) and a two-layer (bottom) deep IW process prior (Eq. 16). On the far left, we have included a set of samples from a GP with the same kernel, for comparison. This GP is equivalent to sending δ 0 → ∞ in the one-layer deep IW process and additionally sending δ 1 → ∞ in the two-layer deep IW process. All of the deep IW process panels use the same squaredexponential kernel with bandwidth 1. and δ 0 = δ 1 = 0. For each panel, we draw a single sample of the top-layer Gram matrix, G L , then draw multiple GP-distributed functions, conditioned on that Gram matrix.

Performance in terms of ELBO test log-likelihood and test accuracy for fully-connected three-layer (two hidden layer) DGPs, NNGP and DIWP on MNIST and CIFAR-10. CNNs) for these datasets, as we are using fully-connected networks with only 100 inducing points (whereas e.g. work in the NNGP literature uses the full 60, 000 × 60, 000 covariance matrix). Nonetheless, as the architectures are carefully matched, it provides another opportunity to compare the performance of DIWPs, NNGPs and DGPs. Again, we found that DIWP usually gave statistically significant but perhaps underwhelming gains in predictive performance (except for CIFAR-10 test-log-likelihood, where DIWP lagged by only 0.01). Importantly, DIWP gives very large improvements in the ELBO, with gains of 0.09 against DGPs for MNIST and 0.08

and computing this value gives the approximate posterior in the main text (Eq. 19). H USING EIGENVALUES TO COMPARE DEEP WISHART, DEEP RESIDUAL WISHART AND INVERSE WISHART PRIORS One might be concerned that the deep inverse Wishart processes in which we can easily perform inference are different to the deep Wishart processes corresponding to BNNs (Sec. C.1) and infinite NNs with bottlenecks (App. C.3). To address these concerns, we begin by noting that the (inverse) Wishart priors can be written in terms of samples from the standard (inverse) Wishart

