A THEORY OF REPRESENTATION LEARNING IN NEURAL NETWORKS GIVES A DEEP GENERALISATION OF KER-NEL METHODS

Abstract

The successes of modern deep machine learning methods are founded on their ability to transform inputs across multiple layers to build good high-level representations. It is therefore critical to understand this process of representation learning. However, standard theoretical approaches (formally NNGPs) involving infinite width limits eliminate representation learning. We therefore develop a new infinite width limit, the Bayesian representation learning limit, that exhibits representation learning mirroring that in finite-width models, yet at the same time, retains some of the simplicity of standard infinite-width limits. In particular, we show that Deep Gaussian processes (DGPs) in the Bayesian representation learning limit have exactly multivariate Gaussian posteriors, and the posterior covariances can be obtained by optimizing an interpretable objective combining a loglikelihood to improve performance with a series of KL-divergences which keep the posteriors close to the prior. We confirm these results experimentally in wide but finite DGPs. Next, we introduce the possibility of using this limit and objective as a flexible, deep generalisation of kernel methods, that we call deep kernel machines (DKMs). Like most naive kernel methods, DKMs scale cubically in the number of datapoints. We therefore use methods from the Gaussian process inducing point literature to develop a sparse DKM that scales linearly in the number of datapoints. Finally, we extend these approaches to NNs (which have non-Gaussian posteriors) in the Appendices.

1. INTRODUCTION

The successes of modern machine learning methods from neural networks (NNs) to deep Gaussian processes (DGPs Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017) is based on their ability to use depth to transform the input into high-level representations that are good for solving difficult tasks (Bengio et al., 2013; LeCun et al., 2015) . However, theoretical approaches using infinite limits to understand deep models struggle to capture representation learning. In particular, there are two broad families of infinite limit, and while they both use kernel-matrix-like objects they are ultimately very different. The neural network Gaussian process (NNGP Neal, 1996; Lee et al., 2017; Matthews et al., 2018) applies to Bayesian models like Bayesian neural networks (BNNs) and DGPs and describes the representations at each layer (formally, the NNGP kernel is raw second moment of the activities). In contrast, the neural tangent kernel (NTK Jacot et al., 2018) is a very different quantity that involves gradients, and describes how predictions at all datapoints change if we do a gradient update on a single datapoint. As such, the NNGP and NTK are suited to asking very different theoretical questions. For instance, the NNGP is better suited to understanding the transformation of representations across layers, while the NTK is better suited to understanding how predictions change through NN training. While challenges surrounding representation learning have recently been addressed in the NTK setting Yang & Hu (2020) , we are the first to address this challenge in the NNGP setting. At the same time, kernel methods (Smola & Schölkopf, 1998; Shawe-Taylor & Cristianini, 2004; Hofmann et al., 2008) were a leading machine learning approach prior to the deep learning revolution Krizhevsky et al. (2012) . However, kernel methods were eclipsed by deep NNs because depth gives NNs the flexibility to learn a good top-layer representation (Aitchison, 2020) . In contrast, in a standard kernel method, the kernel (or equivalently the representation) is highly inflexible -there are usually a few tunable hyperparameters, but nothing that approaches the enormous flexibility of the top-layer representation in a deep model. There is therefore a need to develop flexible, deep generalisations of kernel method. Remarkably, our advances in understanding representation learning in DGPs give such a flexible, deep kernel method.

2. CONTRIBUTIONS

• We present a new infinite width limit, the Bayesian representation learning limit, that retains representation learning in deep Bayesian models including DGPs. The key insight is that as the width goes to infinity, the prior becomes stronger, and eventually overwhelms the likelihood. We can fix this by rescaling the likelihood to match the prior. This rescaling can be understood in a Bayesian context as copying the labels (Sec. 4.3). • We show that in the Bayesian representation learning limit, DGP posteriors are exactly zero-mean multivariate Gaussian, P f ℓ λ |X, y = N f ℓ λ ; 0, G ℓ where f ℓ λ , is the activation of the λth feature in layer ℓ for all inputs (Sec. 4.4 and Appendix D). • We show that the posterior covariances can be obtained by optimizing the "deep kernel machine objective", L(G 1 , . . . , G L ) = log P (Y|G L ) - L ℓ=1 ν ℓ D KL (N (0, G ℓ )∥N (0, K(G ℓ-1 ))) , where G ℓ are the posterior covariances, K(G ℓ-1 ) are the kernel matrices, and ν ℓ accounts for any differences in layer width (Sec. 4.3). • We give an interpretation of this objective, with log P (Y|G L ) encouraging improved performance, while the KL-divergence terms act as a regulariser, keeping posteriors, N (0, G ℓ ), close to the prior, N (0, K(G ℓ-1 )) (Sec. 4.5). • We introduce a sparse DKM, which takes inspiration GP inducing point literature to obtain a practical, scalable method that is linear in the number of datapoints. In contrast, naively computing/optimizing the DKM objective is cubic in the number of datapoints (as with most other naive kernel methods; Sec. 4.7). • We extend these results to BNNs (which have non-Gaussian posteriors) in Appendix A.

3. RELATED WORK

Our work is focused on DGPs and gives new results such as the extremely simple multivariate Gaussian form for DGP true posteriors. As such, our work is very different from previous work on NNs, where such results are not available. There are at least three families of such work. First, there is recent work on representation learning in the very different NTK setting (Jacot et al., 2018; Yang, 2019; Yang & Hu, 2020 ) (see Sec. 1). In contrast, here we focus on NNGPs (Neal, 1996; Williams, 1996; Lee et al., 2017; Matthews et al., 2018; Novak et al., 2018; Garriga-Alonso et al., 2018; Jacot et al., 2018) , where the challenge of representation learning has yet to be addressed. Second, there is a body of work using methods from physics to understand representation learning in neural networks (Antognini, 2019; Dyer & Gur-Ari, 2019; Hanin & Nica, 2019; Aitchison, 2020; Li & Sompolinsky, 2020; Yaida, 2020; Naveh et al., 2020; Zavatone-Veth et al., 2021; Zavatone-Veth & Pehlevan, 2021; Roberts et al., 2021; Naveh & Ringel, 2021; Halverson et al., 2021) . This work is focuses on perturbational, rather than variational methods. Third, there is a body of theoretical work including (Mei et al., 2018; Nguyen, 2019; Sirignano & Spiliopoulos, 2020a; b; Nguyen & Pham, 2020) which establishes properties such as convergence to the global optimum. This work is focused on two-layer (or one-hidden layer network) networks, and like the NTK, considers learning under SGD rather than Bayesian posteriors. Another related line of work uses kernels to give a closed-form expression for the weights of a neural network, based on a greedy, layerwise objective (Wu et al., 2022) . This work differs in that it uses the HSIC objective, and therefore does not have a link to DGPs or Bayesian neural networks, and in that it uses a greedy-layerwise objective, rather than end-to-end gradient descent. X F 1 F 2 F 3 Y X G 0 F 1 G 1 F 2 G 2 F 3 G 3 Y X G 0 G 1 G 2 G 3 Y Layer 1 Layer 2 Layer 3 Figure 1 : The graphical model structure for each of our generative models for L = 3. Top. The standard model (Eq. 1), written purely in terms of features, F ℓ . Middle. The standard model, including Gram matrices as random variables (Eq. 5) Bottom. Integrating out the activations, F ℓ ,

4. RESULTS

We start by defining a DGP, which contains Bayesian NN (BNNs) as a special case (Appendix A). This model maps from inputs, X ∈ R P ×ν0 , to outputs, Y ∈ R P ×ν L+1 , where P is the number of input points, ν 0 is the number of input features, and ν L+1 is the number of output features. The model has L intermediate layers, indexed ℓ ∈ {1, . . . , L}, and at each intermediate layer there are N ℓ features, F ℓ ∈ R P ×N ℓ . Both F ℓ and Y can be written as a stack of vectors, F ℓ = (f ℓ 1 f ℓ 2 • • • f ℓ N ℓ ) Y = (y 1 y 2 • • • y ν L+1 ), where f ℓ λ ∈ R P gives the value of one feature and y λ ∈ R P gives the value of one output for all P input points. The features, F 1 , . . . , F L , and (for regression) the outputs, Y, are sampled from a Gaussian process (GP) with a covariance which depends on the previous layer features (Fig. 1  top), P (F ℓ |F ℓ-1 ) = N ℓ λ=1 N f ℓ λ ; 0, K(G(F ℓ-1 )) P (Y|F L ) = ν L+1 λ=1 N y λ ; 0, K(G(F L )) + σ 2 I . (1b) Note we only use the regression likelihood to give a concrete example; we could equally use an alternative likelihood e.g. for classification (Appendix B). The distinction between DGPs and BNNs arises through the choice of K(•) and G(•). For BNNs, see Appendix A. For DGPs, G(•), which takes the features and computes the corresponding P × P Gram matrix, is G(F ℓ-1 ) = 1 N ℓ-1 N ℓ-1 λ=1 f ℓ-1 λ (f ℓ-1 λ ) T = 1 N ℓ-1 F ℓ-1 F T ℓ-1 . Now, we introduce random variables representing the Gram matrices, G ℓ-1 = G(F ℓ-1 ), where G ℓ-1 is a random variable representing the Gram matrix at layer ℓ -1, whereas G(•) is a deterministic function that takes features and computes the corresponding Gram matrix using Eq. (2). Finally, K(•), transforms the Gram matrices, G ℓ-1 to the final kernel. Many kernels of interest are isotropic, meaning they depend only on the normalized squared distance between datapoints, R ij , K isotropic;ij (G ℓ-1 ) = k isotropic (R ij (G ℓ-1 )) . (3) Importantly, we can compute this squared distance from G ℓ-1 , without needing F ℓ-1 , R ij (G) = 1 N N λ=1 F iλ -F jλ 2 = 1 N N λ=1 F iλ 2 -2F iλ F jλ + F jλ 2 = G ii -2G ij + G jj , ) where λ indexes features, i and j index datapoints and we have omitted the layer index for simplicity. Importantly, we are not restricted to isotropic kernels: other kernels that depend only on the Gram matrix, such as the arccos kernels from the infinite NN literature (Cho & Saul, 2009) can also be used (for further details, see Aitchison et al., 2020) .

4.1. BNN AND DGP PRIORS CAN BE WRITTEN PURELY IN TERMS OF GRAM MATRICES

Notice that F ℓ depends on F ℓ-1 only through G ℓ-1 = G(F ℓ-1 ), and Y depends on F L only through G L = G(F L ) (Eq. 1). We can therefore write the graphical model in terms of those Gram matrices (Fig. 1 middle) . P (F ℓ |G ℓ-1 ) = N ℓ λ=1 N f ℓ λ ; 0, K(G ℓ-1 ) (5a) P (G ℓ |F ℓ ) = δ (G ℓ -G(F ℓ )) (5b) P (Y|G L ) = ν L+1 λ=1 N y λ ; 0, K(G L ) + σ 2 I . ( ) where δ is the Dirac-delta, and G 0 depends on X (e.g. G 0 = 1 ν0 XX T ). Again, for concreteness we have used a regression likelihood, but other likelihoods could also be used. Now, we can integrate F ℓ out of the model, in which case, we get an equivalent generative model written solely in terms of Gram matrices (Fig. 1 bottom), with P (G ℓ |G ℓ-1 ) = dF ℓ P (G ℓ |F ℓ ) P (F ℓ |G ℓ-1 ) , and with the usual likelihood (e.g. Eq. 5c). This looks intractable (and indeed, in general it is intractable). However for DGPs, an analytic form is available. In particular, note the Gram matrix (Eq. 2) is the outer product of IID Gaussian distributed vectors (Eq. 1a). This matches the definition of the Wishart distribution (Gupta & Nagar, 2018), so we have, P (G ℓ |G ℓ-1 ) = Wishart G ℓ ; 1 N ℓ K(G ℓ-1 ), N ℓ (7) log P (G ℓ |G ℓ-1 ) = N ℓ -P -1 2 log |G ℓ |-N ℓ 2 log |K(G ℓ-1 )|-N ℓ 2 Tr K -1 (G ℓ-1 )G ℓ + const . This distribution over Gram matrices is valid for DGPs of any width (though we need to be careful in the low-rank setting where N ℓ < P ). We are going to leverage these Wishart distributions to understand the behaviour of the Gram matrices in the infinite width limit.

4.2. STANDARD INFINITE WIDTH LIMITS OF DGPS LACK REPRESENTATION LEARNING

We are now in a position to take a new viewpoint on the DGP analogue of standard NNGP results (Lee et al., 2017; Matthews et al., 2018; Hron et al., 2020; Pleiss & Cunningham, 2021) . We can then evaluate the log-posterior for a model written only in terms of Gram matrices, log P (G 1 , . . . , G L |X, Y) = log P (Y|G L ) + L ℓ=1 log P (G ℓ |G ℓ-1 ) + const . Then we take the limit of infinite width, N ℓ = N ν ℓ for ℓ ∈ {1, . . . , L} with N → ∞. This limit modifies log P (G ℓ |G ℓ-1 ) (Eq. 7), but does not modify G 1 , . . . , G L in Eq. ( 8) as we get to choose the values of G 1 , . . . , G L at which to evaluate the log-posterior. Specifically, the logprior, log P (G ℓ |G ℓ-1 ) (Eq. 7), scales with N ℓ and hence with N . To get a finite limit, we therefore need to divide by N , lim N →∞ 1 N log P (G ℓ |G ℓ-1 ) = ν ℓ 2 log K -1 (G ℓ-1 )G ℓ -Tr K -1 (G ℓ-1 )G ℓ + const = -ν ℓ D KL (N (0, G ℓ )∥N (0, K(G ℓ-1 ))) + const . (10) And remarkably this limit can be written as the KL-divergence between two multivariate Gaussians. In contrast, the log likelihood, log P (Y|G L ), is constant wrt N (Eq. 5c), so lim N →∞ 1 N log P (Y|G L ) = 0. The limiting log-posterior is thus, lim N →∞ 1 N log P (G 1 , . . . , G L |X, Y) = - L ℓ=1 ν ℓ D KL (N (0, G ℓ )∥N (0, K(G ℓ-1 ))) + const . ( ) This form highlights that the log-posterior scales with N , so in the limit as N → ∞, the posterior converges to a point distribution at the global maximum, denoted G * 1 , . . . , G * L , (see Appendix C for a formal discussion of weak convergence), lim N →∞ P (G 1 , . . . , G L |X, Y) = L ℓ=1 δ (G ℓ -G * ℓ ) . Moreover, it is evident from the KL-divergence form for the log-posterior (Eq. 11) that the unique global maximum can be computed recursively as G * ℓ = K(G * ℓ-1 ), with e.g. G * 0 = 1 ν0 XX T . Thus, the limiting posterior over Gram matrices does not depend on the training targets, so there is no possibility of representation learning (Aitchison, 2020) . This is deeply problematic as the successes of modern deep learning arise from flexibly learning good top-layer representations.

4.3. THE BAYESIAN REPRESENTATION LEARNING LIMIT

In the previous section, we saw that standard infinite width limits eliminate representation learning because as N → ∞ the log-prior terms, log P (G ℓ |G ℓ-1 ), in Eq. ( 8) dominated the log-likelihood, P (Y|G L ), and the likelihood is the only term that depends on the labels. We therefore introduce the "Bayesian representation learning limit" which retains representation learning. The Bayesian representation learning limit sends the number of output features, N L+1 , to infinity as the layerwidths go to infinity, N ℓ = N ν ℓ for ℓ ∈ {1, . . . , L + 1} with N → ∞. Importantly, the Bayesian representation learning limit gives a valid probabilistic model with a welldefined posterior, arising from the prior, (Eq. 6) and a likelihood which assumes each output channel is IID, P ( Ỹ|G L ) = N L+1 λ=1 N ỹλ ; 0, K(G L ) + σ 2 I . ( ) where Ỹ ∈ R P ×N L+1 is infinite width (Eq. 13) whereas the usual DGP data, Y ∈ R P ×ν L+1 , is finite width. Of course, infinite-width data is unusual if not unheard-of. In practice, real data, Y ∈ R P ×ν L+1 , almost always has a finite number of features, ν L+1 . How do we apply the DKM to such data? The answer is to define Ỹ as N copies of the underlying data, Y, i.e. Ỹ = (Y • • • Y). As each channel is assumed to be IID (Eq. 5c and 14) the likelihood is N times larger, log P ( Ỹ|G L ) = N log P (Y|G L ) , The log-posterior in the Bayesian representation learning limit is very similar to the log-posterior in the standard limit (Eq. 11). The only difference is that the likelihood, log P ( Ỹ|G L ) now scales with N , so it does not disappear as we take the limit, allowing us to retain representation learning, L(G 1 , . . . , G L ) = lim N →∞ 1 N log P (G 1 , . . . , G L |X, Ỹ) + const, = log P (Y|G L ) - L ℓ=1 ν ℓ D KL (N (0, G ℓ )∥N (0, K(G ℓ-1 )) ) . Here, we denote the limiting log-posterior as L(G 1 , . . . , G L ), and this forms the DKM objective. Again, as long as the global maximum of the DKM objective is unique, the posterior is again a point distribution around that maximum (Eq. 12). Of course, the inclusion of the likelihood term means that the global optimum G * 1 , . . . , G * L cannot be computed recursively, but instead we need to optimize, e.g. using gradient descent (see Sec. 4.7). Unlike in the standard limit (Eq. 11), it is no longer possible to guarantee uniqueness of the global maximum. We can nonetheless say that the posterior converges to a point distribution as long as the global maximum of L(G 1 , . . . , G L ) is unique, (i.e. we can have any number of local maxima, as long as they all lie below the unique global maximum). We do expect the global maximum to be unique in most practical settings: we know the maximum is unique when the prior dominates (Eq. 11), in Appendix J, we prove uniqueness for linear models, and in Appendix K, we give a number of experiments in nonlinear models in which optimizing from very different initializations found the same global maximum, indicating uniqueness in practical settings.

4.4. THE EXACT DGP POSTERIOR OVER FEATURES IS MULTIVARIATE GAUSSIAN

Above, we noted that the DGP posterior over Gram matrices in the Bayesian representation learning limit is a point distribution, as long as the DKM objective has a unique global maximum. Remarkably, in this setting, the corresponding posterior over features is multivariate Gaussian (see Appendix D for the full derivation), P f ℓ λ |X, y = N f ℓ λ ; 0, G * ℓ (17) While such a simple result might initially seem remarkable, it should not surprise us too much. In particular, the prior is Gaussian (Eq. 1). In addition, in Fig. 1 (middle), we saw that the next layer features depend on the current layer features only through the Gram matrices, which are just the raw second moment of the features, Eq. (2). Thus, in effect the likelihood only constrains the raw second moments of the features. Critically, that constraints on the raw second moment are tightly connected to Gaussian distributions: under the MaxEnt framework, a Gaussian distribution arises by maximizing the entropy under constraints on the raw second moment of the features (Jaynes, 2003) . Thus it is entirely plausible that a Gaussian prior combined with a likelihood that "constrains" the raw second moment would give rise to Gaussian posteriors (though of course this is not a proof; see Appendix D for the full derivation). Finally, note that we appear to use G ℓ or G * ℓ in two separate senses: as 1 N ℓ F ℓ F T ℓ in Eq. ( 2) and as the posterior covariance in the Bayesian representation learning limit (Eq. 17). In the infinite limit, these two uses are consistent. In particular, consider the value of G ℓ defined by Eq. (2) under the posterior, G ℓ = lim N →∞ 1 N ℓ N ℓ λ=1 f ℓ λ (f ℓ λ ) T = E P(f ℓ λ |X,y) f ℓ λ (f ℓ λ ) T = G * ℓ . The second equality arises by noticing that we are computing the average of infinitely many terms, f ℓ λ (f ℓ λ ) T , which are IID under the true posterior (Eq. 17), so we can apply the law of large numbers, and the final expectation arises by computing moments under Eq. ( 17).

4.5. THE DKM OBJECTIVE GIVES INTUITION FOR REPRESENTATION LEARNING

The form for the DKM objective in Eq. ( 16) gives a strong intuition for how representation learning occurs in deep networks. In particular, the likelihood, log P (Y|G L ), encourages the model to find a representation giving good performance on the training data. At the same time, the KL-divergence terms keep the posterior over features, N (0, G ℓ ), (Eq. 17) close to the prior N (0, K(G ℓ-1 )) (Eq. 1a). This encourages the optimized representations, G ℓ , to lie close to their value under the standard infinite-width limit, K(G ℓ-1 ). We could use any form for the likelihood including classification and regression, but to understand how the likelihood interacts with the other KL-divergence terms, it is easiest to consider regression (Eq. 5c), as this log-likelihood can also be written as a KL-divergence, log P (Y|G L ) = -ν L+1 D KL N (0, G L+1 ) N 0, K(G L ) + σ 2 I + const ) Thus, the likelihood encourages K(G L ) + σ 2 I to be close to the covariance of the data, G L+1 = 1 ν L+1 YY T , while the DGP prior terms encourage all G ℓ to lie close to K(G ℓ-1 ). In combination, we would expect the optimal Gram matrices to "interpolate" between the input kernel, G 0 = 1 ν0 XX T and the output kernel, G L+1 . To make the notion of interpolation explicit, we consider σ 2 = 0 with a linear kernel, K(G ℓ-1 ) = G ℓ-1 , so named because it corresponds to a linear neural network layer. With this kernel and with all ν ℓ = ν, there is an analytic solution for the (unique) optimum of the DKM objective (Appendix J.1), G * ℓ = G 0 G -1 0 G L+1 ℓ/(L+1) , which explicitly geometrically interpolates between G 0 and G L+1 . Of course, this discussion was primarily for DGPs, but the exact same intuitions hold for BNNs, in that maximizing the DKM objective finds a sequence of Gram matrices, G * 1 , . . . , G * L that interpolate between the input kernel, G 0 and the output kernel, G L+1 . The only difference is in details of P (G ℓ |G ℓ-1 ), and specifically as slight differences in the KL-divergence terms (see below).

4.6. THE DKM OBJECTIVE MIRRORS REPRESENTATION LEARNING IN FINITE NETWORKS

Here, we confirm that the optimizing DKM objective for an infinite network matches doing inference in wide but finite-width networks using Langevin sampling (see Appendix F for details). We began by looking at DGPs, and confirming that the posterior marginals are Gaussian (Eq. 17; Fig. 3ab ). Then, we confirmed that the representations match closely for infinite-width DKMs (Fig. 2 top and bottom rows) and finite-width DGPs (Fig. 2 middle two rows), both at initialization (Fig. 2 top two rows) and after training to convergence (Fig. 2 bottom two rows). Note that the first column, K 0 is a squared exponential kernel applied to the input data, and G 3 = yy T is the output Gram matrix (in this case, there is only one output feature). To confirm that the match improves as the DGP gets wider, we considered the RMSE between elements of the Gram matrices for networks of different widths (x-axis) for different UCI datasets (columns) and different numbers of layers (top row is one-layer, bottom row is two-layers; Fig. 3c ). In most cases, we found a good match as long as the width was at least 128, which is around the width of typical fully connected neural network, but is a little larger than typical DGP widths (e.g. Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017) . init DKM G 1 K(G 1 ) G 2 K(G 2 ) init DGP trained DGP 1 50 index trained DKM 1 50 index 1 50 index 1 50 index 1 50 index 1 50 index K(G 0 ) 1 50 index G 3 1 0 1 Figure 2: A two hidden layer DGP with 1024 units per hidden layer and DKM with squared exponential kernels match closely. The data was the first 50 datapoints of the yacht dataset. The first column, K 0 is a fixed squared exponential kernel applied to the inputs, and the last column, G 3 = yy T is the fixed output Gram matrix. The first row is the DKM initialization at the prior Gram matrices and kernels, and the second row is the DGP, which is initialized by sampling from the prior. As expected, the finite width DGP prior closely matches the infinite-width DKM initialization, which corresponds to the standard infinite width limit. The third row is the Gram matrices and kernels for the trained DGP, which has changed dramatically relative to its initialization (second row) in order to better fit the data. The fourth row is the Gram matrices and kernels for the optimized DKM, which closely matches those for the trained DGP.

METHODS

DGPs in the representation learning limit constitute a deep generalisation of kernel methods, with a very flexible learned kernel, which we call the deep kernel machine (DKM; which was introduced earlier just in the context of the objective). Here, we design a sparse DKM, inspired by sparse methods for DGPs (Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017) (Appendix L). The sparse DKM scales linearly in the number of datapoints, P , as opposed to cubic scaling of the plain DKM (similar to the cubic scaling in most naive kernel methods). We compared DKMs (Eq. 16) and MAP over features (Sec. E) for DGPs. In addition, we considered a baseline, which was a standard, shallow kernel method mirroring the structure of the deep kernel machine but where the only flexibility comes from the hyperparameters. Formally, this model can be obtained by setting, G ℓ = K(G ℓ-1 ) and is denoted "Kernel Hyper" in Table 1 . We applied these methods to UCI datasets (Gal & Ghahramani, 2016 ) using a two hidden layer architecture, with a kernel inspired by DGP skip-connections, K(G ℓ ) = w ℓ 1 G ℓ + w ℓ 2 K sqexp (G ℓ ). Here, w ℓ 1 , w ℓ 2 and σ are hyperparameters, and K sqexp (G ℓ ) is a squared-exponential kernel. We used 300 inducing points fixed to a random subset of the training data and not optimised during training. We used the Adam optimizer with a learning rate of 0.001, full-batch gradients and 5000 iterations for smaller datasets and 1000 iterations for larger datasets (kin8nm, naval and protein). We found that the deep kernel machine objective gave better performance than MAP, or the hyperparameter optimization baseline (Tab. 1). Note that these numbers are not directly comparable to those in the deep GP literature (Salimbeni & Deisenroth, 2017) , as deep GPs have a full posterior so offer excellent protection against overfitting, while DKMs give only a point estimate. 

G1 G2

2 1 2 3 2 5 2 7 2 9 2 11 width 2 1 2 3 2 5 2 7 2 9 2 11 width 2 1 2 3 2 5 2 7 2 9 2 11 width two-layer 

5. CONCLUSION

We introduced the Bayesian representation learning limit, a new infinite-width limit for BNNs and DGPs that retains representation learning. Representation learning in this limit is described by the intuitive DKM objective, which is composed of a log-likelihood describing performance on the task (e.g. classification or regression) and a sum of KL-divergences keeping representations at every layer close to those under the infinite-width prior. For DGPs, the exact posteriors are IID across features and are multivariate Gaussian, with covariances given by optimizing the DKM objective. Empirically, we found that the distribution over features and representations matched those in wide by finite DGPs. We argued that DGPs in the Bayesian representation learning limit form a new class of practical deep kernel method: DKMs. We introduce sparse DKMs, which scale linearly in the number of datapoints. Finally, we give the extension for BNNs where the exact posteriors are intractable so must be approximated.

A BAYESIAN NEURAL NETWORK EXTENSION

Consider a neural network of the form, F 1 = XW 0 (21a) F ℓ = ϕ(F ℓ-1 )W ℓ-1 for ℓ ∈ {2, . . . , L + 1} (21b) W ℓ λµ ∼ N 0, 1 N ℓ W 0 λµ ∼ N 0, 1 ν0 (21c) where W 0 ∈ R ν0×N1 , W ℓ ∈ R N ℓ ×N ℓ+1 and W L+1 ∈ R N L ×ν L+1 are weight matrices with independent Gaussian priors and ϕ is the usually pointwise nonlinearity. In principle, we could integrate out the distribution over W ℓ to find P (F ℓ |F ℓ-1 ) P (F ℓ |F ℓ-1 ) = dW ℓ P (W ℓ ) δ (F ℓ -ϕ(F ℓ-1 )W ℓ-1 ) , ( ) where δ is the Dirac delta. In practice, it is much easier to note that conditioned on F ℓ-1 , the random variables interest, F ℓ are a linear combination of Gaussian distributed random variables, W ℓ . Thus, F ℓ are themselves Gaussian, and this Gaussian is completely characterised by its mean and variance. We begin by writing the feature vectors, f ℓ λ in terms of weight vectors, w ℓ λ , f ℓ λ = ϕ(F ℓ-1 )w ℓ λ . ( ) As the prior over weight vectors is IID, the prior over features, conditioned on F ℓ-1 ), is also IID, P (W) = N ℓ λ=1 P w ℓ λ = N ℓ λ=1 N w ℓ λ ; 0, 1 N ℓ-1 I , P (F ℓ |F ℓ-1 ) = N ℓ λ=1 P f ℓ λ |F ℓ-1 . The mean of f ℓ λ conditioned on F ℓ-1 is 0, E f ℓ λ |F ℓ-1 = E ϕ(F ℓ-1 )w ℓ λ |F ℓ-1 = ϕ(F ℓ-1 ) E w ℓ λ |F ℓ-1 = ϕ(F ℓ-1 ) E w ℓ λ = 0. (26) The covariance of f ℓ λ conditioned on F ℓ-1 is, E f ℓ λ f ℓ λ T |F ℓ-1 = E ϕ(F ℓ-1 )w ℓ λ ϕ(F ℓ-1 )w ℓ λ T |F ℓ-1 = ϕ(F ℓ-1 ) E w ℓ λ (w ℓ λ ) T ϕ T (F ℓ-1 ) = 1 N ℓ-1 ϕ(F ℓ-1 )ϕ T (F ℓ-1 ) This mean and variance imply that Eq. ( 1) captures the BNN prior, as long as we choose K BNN (•) and G BNN (•) such that, K BNN (G BNN (F ℓ-1 )) = 1 N ℓ-1 N ℓ-1 λ=1 ϕ(f ℓ-1 λ )ϕ T (f ℓ-1 λ ), Specifically, we choose the kernel function, K BNN (•) to be the identity function, and G BNN (•) to be the same outer product as in the main text for DGPs (Eq. 2), except where we have applied the NN nonlinearity, K BNN (G ℓ-1 ) = G ℓ-1 , G BNN (F ℓ-1 ) = 1 N ℓ-1 N ℓ-1 λ=1 ϕ(f ℓ-1 λ )ϕ T (f ℓ-1 λ ). ( ) This form retains the average-outer-product form for G BNN (•), which is important for our derivations. Now, Eq. ( 16) only gave the DKM objective for DGPs. To get a more general form, we need to consider the implied posteriors over features. This posterior is IID over features (Appendix D.1), and for DGPs, it is multivariate Gaussian (Appendix D.2), 34) with 2 16 Monte-Carlo samples. The Gram matrices for the flow posterior (third row) closely match those from the BNN posterior (second row), while those for a multivariate Gaussian approximate posterior (fourth row) do not match. b Marginal distributions over features at each layer for one input datapoint estimated using kernel density estimation. The note that the BNN (blue line) marginals are non-Gaussian, but the variational DKM with a flow posterior (red line) is capable of capturing this non-Gaussianity. Now, we can see that Eq. ( 16) is a specific example of a general expression. In particular, note that the distribution on the left of the KL-divergence in Eq. ( 16) is the DGP posterior over features (Eq. 31). Thus, the DKM objective can alternatively be written, P (F ℓ |G ℓ-1 , G ℓ ) = N ℓ λ=1 P f ℓ λ |G ℓ-1 , G ℓ = for DGPs N ℓ λ=1 N f ℓ λ ; 0, G ℓ . ( ) init G 1 G 2 G 3 G 4 vDKM (flow) L(G 1 , . . . , G L ) = log P (Y|G L ) - L ℓ=1 ν ℓ D KL P f ℓ λ |G ℓ-1 , G ℓ N (0, K(G ℓ-1 )) , (32) and this form holds for both BNNs and DGPs (Appendix D.3). As in DGPs, the log-posterior is N times L(G 1 , . . . , G L ) (Eq. 16), so as N is taken to infinity, the posterior for all models becomes a point distribution (Eq. 12) if L(G 1 , . . . , G L ) has a unique global maximum. In practice, the true posteriors required to evaluate Eq. ( 32) are intractable for BNNs, raising the question of how to develop accurate approximations for BNNs. We develop a variational DKM (vDKM) by taking inspiration from variational inference (Jordan et al., 1999; Blei et al., 2017) (Appendix D.4). Of course, variational inference is usually impossible in infinite width models, because it is impossible to work with infinitely large latent variables. Our key insight is that as the true posterior factorises across features (Appendix D.1), we can work with the approximate posterior over only a single feature vector, Q θ ℓ f ℓ λ , where θ ℓ are the parameters and f ℓ λ ∈ R P is finite. This approach allows us to define a vDKM objective, which bounds the true DKM objective, L(G θ (θ 1 ), . . . , G θ (θ L )) ≥ L V (θ 1 , . . . , θ L ), L V (θ 1 , . . . , θ L ) = log P (Y|G θ (θ L )) - L ℓ=1 ν ℓ D KL Q θ ℓ f ℓ λ N (0, K(G θ (θ ℓ-1 ))) with equality when the approximate posteriors, Q θ ℓ f ℓ λ , equal the true posteriors, P f ℓ λ |G ℓ-1 , G ℓ . The only subtlety here is that it is practically difficult to design flexible approximate posteriors Q θ ℓ f ℓ λ where we explicitly specify and optimize the Gram matrices. Instead we optimize general approximate posterior parameters, θ, and compute the implied Gram matrices, G θ (θ ℓ ) = 1 N ℓ lim N ℓ →∞ N ℓ λ=1 ϕ(f ℓ λ )ϕ T (f ℓ λ ) = E Q θ ℓ (f ℓ λ ) ϕ(f ℓ λ )ϕ T (f ℓ λ ) . ( ) where f ℓ λ are sampled from Q θ ℓ f ℓ λ , and the second equality arises from the law of large numbers. We can compute the Gram matrix analytically in simple cases (such as a multivariate Gaussian), but in general we can always estimate the Gram matrix using a Monte-Carlo estimate of Eq. ( 34). Finally, we checked that the vDKM objective closely matched the posterior under neural networks. This is a bit more involved, as the marginal distributions over features are no longer Gaussian (Fig. 4b ). To capture these non-Gaussian marginals, we used a simple normalizing flow. In particular, we first sampled z ℓ λ ∼ N (µ ℓ , Σ ℓ ) from a multivariate Gaussian with a learned mean, µ ℓ , and covariance, Σ ℓ then we obtained features, f ℓ λ = f (z ℓ λ ), by passing z ℓ λ through f , a learned pointwise function parameterised as in a neural spline flow (Durkan et al., 2019) . The resulting distribution is a high-dimensional Gaussian copula (e.g. Cai & Zhang, 2018) . As shown in Fig. 4 , vDKM with multivariate Gaussian (MvG) approximate posterior cannot match the Gram matrices learned by BNN (Fig. 4a ), while vDKM with flow is able to capture the non-Gaussian marginals (Fig. 4b ) and thus match the learned Gram matrices with BNN.

B GENERAL LIKELIHOODS THAT DEPEND ONLY ON GRAM MATRICES

We consider likelihoods which depend only on the top-layer Gram matrix, G L , P (Y|G L ) = dF L+1 P (Y|F L+1 ) P (F L+1 |G L ) where, P (F L+1 |G L ) = N L+1 λ=1 N f L+1 λ ; 0, K(G L ) . This family of likelihoods captures regression, P y λ |f L+1 λ = N y L+1 λ ; f L+1 λ , σ 2 I , (which is equivalent to the model used in the main text Eq. 1b) and e.g. classification, P (y|F) = Categorical (y; softmax (F L+1 )) , amoung many others.

C WEAK CONVERGENCE

Here, we give a formal argument for weak convergence of the DGP posterior over Gram matrices to a point distribution in the limit as N → ∞, P N (G 1 , . . . , G L |X, Ỹ) d → L ℓ=1 δ(G ℓ -G * ℓ ) where we have included N in the subscript of the probability distribution as a reminder that this distribution depends on the width. By the Portmanteau theorem, weak convergence is established if all expectations of bounded continuous functions, f , converge lim N →∞ E P N (G1,...,G L |X, Ỹ) [f (G 1 , . . . , G L )] = f (G * 1 , . . . , G * L ). To show this in a reasonably general setting (which the DGP posterior is a special case of), we consider an unnormalized probability density of the form h(g)e N L(g) , and compute the moment as, E [f (g)] = G dg f (g)h(g)e N L(g) G dg h(g)e N L(g) where g = (G 1 , . . . , G L ) is all L positive semi-definite matrices, G ℓ . Thus, g ∈ G, where G is a convex set. We consider the superlevel set A(∆) = {g|L(g) ≥ L(g * ) -∆}, where g * is the unique global optimum. We select out a small region, A(∆), surrounding the global maximum, and compute the integral as, E [f (g)] = A(∆) dg f (g)h(g)e N L(g) + G\A(∆) dg f (g)h(g)e N L(g) A(∆) dg h(g)e N L(g) + G\A(∆) dg h(g)e N L(g) And divide the numerator and denominator by A(∆) dg h(g)e N L(g) , E [f (g)] = A(∆) dg f (g)h(g)e N L(g) A(∆) dg h(g)e N L(g) + G\A(∆) dg f (g)h(g)e N L(g) A(∆) dg h(g)e N L(g) 1 + G\A(∆) dg h(g)e N L(g) A(∆) dg h(g)e N L(g) Now, we deal with each term separately. The ratio in the denominator can be lower-bounded by zero, and upper bounded by considering a smaller superlevel set, A(∆/2), in the denominator, 0 ≤ G\A(∆) dg h(g)e N L(g) A(∆) dg h(g)e N L(g) ≤ G\A(∆) dg h(g)e N L(g) A(∆/2) dg h(g)e N L(g) ≤ e N (L(g * )-∆) G\A(∆) dg h(g) e N (L(g * )-∆/2) A(∆/2) dg h(g) = G\A(∆) dg h(g) A(∆/2) dg h(g) e -N ∆/2 The upper bound converges to zero (as h(g) is independent of N ), and therefore by the sandwich theorem the ratio of interest also tends to zero. The second ratio in the numerator can be rewritten as, G\A(∆) dg f (g)h(g)e N L(g) A(∆) dg h(g)e N L(g) = G\A(∆) dg f (g)h(g)e N L(g) G\A(∆) dg h(g)e N L(g) G\A(∆) dg h(g)e N L(g) A(∆) dg h(g)e N L(g) The first term here is an expectation of a bounded function, f (g), so is bounded, while second term converges to zero in the limit (by the previous result). Finally, we consider the first ratio in the numerator, A(∆) dg f (g)h(g)e N L(g) A(∆) dg h(g)e N L(g) which can be understood as an expectation over f (g) in the region A(∆). As f is continuous, for any ϵ > 0, we can find a δ > 0 such that for all g with |g * -g| < δ, we have f (g * ) -ϵ < f (g) < f (g * ) + ϵ. Further, because the continuous function, L(g), has a unique global optimum, g * , for every δ > 0 we are always able to find a ∆ > 0 such that all points g ∈ A(∆) are within δ of g * i.e. |g * -g| < δ. Thus combining the previous two facts, given an ϵ, we are always able to find a δ such that Eq. 47 holds for all g with |g * -g| < δ, and given a δ we are always able to find a ∆ such that all g ∈ A(∆) have |g * -g| < δ. Hence for every ϵ > 0 we can find a ∆ > 0 such that Eq. 47 holds for all g ∈ A(∆). Choosing the appropriate ϵ-dependent ∆ and substituting Eq. 47 into Eq. 46, ϵ also bounds the error in the expectation, f (g * ) -ϵ < A(∆) dg f (g)h(g)e N L(g) A(∆) dg h(g)e N L(g) < f (g * ) + ϵ. Now, we use the results in Eq. ( 44), Eq. ( 45) and Eq. ( 48) to take the limit of Eq. ( 43) (we can compose these limits by the algebraic limit theorem as all the individual limits exist and are finite), f (g * ) -ϵ < lim N →∞ E [f (g)] < f (g * ) + ϵ. ( ) And as this holds for any ϵ, we have, f (g * ) = lim N →∞ E [f (g)] . ( ) This result is applicable to the DGP posterior over Gram matrices, as that posterior can be written as, P N (G 1 , . . . , G L |X, Ỹ) ∝ h(g)e N L(g) , where L(g) is the usual DKM objective, L(g) = L(G 1 , . . . , G L ) and h(g) is the remaining terms in the log-posterior which do not depend on N , h(g) = exp -P +1 2 ℓ log |G ℓ | (this requires P ≤ N so that G ℓ is full-rank).

D GENERAL MODELS IN THE BAYESIAN REPRESENTATION LEARNING LIMIT

Overall, our goal is to compute the integral in Eq. ( 6) in the limit as N → ∞. While the integral is intractable for general models such as BNNs, we can use variational inference to reason about its properties. In particular, we can bound the integral using the ELBO, log P (G ℓ |G ℓ-1 ) ≥ ELBO ℓ = E Q(F ℓ ) [log P (G ℓ |F ℓ ) + log P (F ℓ |G ℓ-1 ) -log Q (F ℓ )] . Note that Q (F ℓ ) here is different from Q θ ℓ f ℓ λ in the main text, both because the approximate posterior here, Q (F ℓ ) is over all features jointly, F ℓ , whereas the approximate posterior in the main text is only over a single feature, f ℓ λ , and because in the main text, we chose a specific family of distribution with parameters θ ℓ , while here we leave the approximate posterior, Q (F ℓ ) completely unconstrained, so that it has the flexibility to capture the true posterior. Indeed, if the optimal approximate posterior is equal to the true posterior, Q * (F ℓ ) = P (F ℓ |G ℓ-1 , G ℓ ), then the bound is tight, so we get log P (G ℓ |G ℓ-1 ) = ELBO * ℓ . Our overall strategy is thus to use variational inference to characterise the optimal approximate which is equal to the true posterior Q * (F ℓ ) = P (F ℓ |G ℓ-1 , G ℓ ) and use the corresponding ELBO to obtain log P (G ℓ |G ℓ-1 ).

D.1 CHARACTERISING EXACT BNN POSTERIORS

Remember that if the approximate posterior family, Q (F ℓ ) is flexible enough to capture the true posterior P (F ℓ |G ℓ-1 , G ℓ ), then the Q * (F ℓ ) that optimizes the ELBO is indeed the true posterior, the bound is tight, so the ELBO is equal to log P (G ℓ |G ℓ-1 ) (Jordan et al., 1999; Blei et al., 2017) . Thus, we are careful to ensure that our approximate posterior family captures the true posterior, by ensuring that we only impose constraints on Q (F ℓ ) that must hold for the true posterior, P (F ℓ |G ℓ-1 , G ℓ ). In particular, note that P (G ℓ |F ℓ ) in Eq. (5b) constrains the true posterior to give non-zero mass only to F ℓ that satisfy G ℓ = 1 N ℓ ϕ(F ℓ )ϕ T (F ℓ ). However, this constraint is difficult to handle. We therefore consider an alternative, weaker constraint on expectations, which holds for the true posterior (the first equality below) because Eq. (5b) constrains G ℓ = 1 N ℓ ϕ(F ℓ )ϕ T (F ℓ ), and impose the same constraint on the approximate posterior, G ℓ = E P(F ℓ |G ℓ ,G ℓ-1 ) 1 N ℓ ϕ(F ℓ )ϕ T (F ℓ ) = E Q(F ℓ ) 1 N ℓ ϕ(F ℓ )ϕ T (F ℓ ) . Now, we can solve for the optimal Q (F ℓ ) with this constraint on the expectation. In particular, the Lagrangian is obtained by taking the ELBO (Eq. 54), dropping the log P (G ℓ |F ℓ ) term representing the equality constraint (that G ℓ = 1 N ℓ ϕ(F ℓ )ϕ T (F ℓ )) and including Lagrange multipliers for the expectation constraint, Λ, (Eq. 55) and the constraint that the distribution must normalize to 1, Λ, L = dF ℓ Q (F ℓ ) (log P (F ℓ |G ℓ-1 ) -log Q (F ℓ )) + 1 2 Tr Λ G ℓ -dF ℓ Q (F ℓ ) ϕ(F ℓ )ϕ T (F ℓ ) + Λ 1 -dF ℓ Q (F ℓ ) Differentiating wrt Q (F ℓ ), and solving for the optimal approximate posterior, Q * (F ℓ ), 0 = ∂L ∂ Q (F ℓ ) Q * (F ℓ ) (57) 0 = (log P (F ℓ |G ℓ-1 ) -log Q * (F ℓ )) -1 -1 2 Tr Λϕ(F ℓ )ϕ T (F ℓ ) -Λ (58) Solving for log Q * (F ℓ ), log Q * (F ℓ ) = log P (F ℓ |G ℓ-1 ) -1 2 Tr Λϕ(F ℓ )ϕ T (F ℓ ) + const . Using the cyclic property of the trace, log Q * (F ℓ ) = log P (F ℓ |G ℓ-1 ) -1 2 Tr ϕ T (F ℓ )Λϕ(F ℓ ) + const . Thus, log Q (F ℓ ) can be written as a sum over features, log Q * (F ℓ ) = N ℓ λ=1 log P f ℓ λ |G ℓ-1 -1 2 ϕ T (f ℓ λ )Λϕ(f ℓ λ ) + const = N L λ=1 log Q f ℓ λ (61) so, the optimal approximate posterior is IID over features, Q * (F ℓ ) = N ℓ λ=1 Q * f ℓ λ . ( ) Remember that this approximate posterior was only constrained in expectation, and that this constraint held for the true posterior (Eq. 55). Thus, we might think that this optimal approximate posterior would be equal to the true posterior. However, remember that the true posterior had a tighter equality constraint, that G ℓ = 1 N ℓ ϕ(F ℓ )ϕ T (F ℓ ), while so far we have only imposed a weaker constraint in expectation (Eq. 55). We thus need to check that our optimal approximate posterior does indeed satisfy the equality constraint in the limit as N ℓ → ∞. This be shown using the law of large numbers, as f ℓ λ are IID under the optimal approximate posterior, and by using Eq. ( 55) for the final equality, lim N ℓ →∞ 1 N ℓ ϕ(F ℓ )ϕ T (F ℓ ) = lim N ℓ →∞ 1 N ℓ N ℓ λ=1 ϕ(f ℓ λ )ϕ T (f ℓ λ ) = E Q(f ℓ λ ) ϕ(f ℓ λ )ϕ T (f ℓ λ ) = G ℓ . Thus, the optimal approximate posterior does meet the constraint in the limit as N ℓ → ∞, so in that limit, the true posterior, like the optimal approximate posterior is IID across features, P (F ℓ |G ℓ-1 , G ℓ ) = Q * (F ℓ ) = N ℓ λ=1 Q * f ℓ λ = N ℓ ℓ=1 P f ℓ λ |G ℓ-1 , G ℓ .

D.2 EXACTLY MULTIVARIATE GAUSSIAN DGP POSTERIORS

For DGPs, we have ϕ(f ℓ λ ) = f ℓ λ , so the optimal approximate posterior is Gaussian, log Q * DGP f ℓ λ = log P DGP f ℓ λ |G ℓ-1 -1 2 (f ℓ λ ) T Λf ℓ λ + const (65) = -1 2 (f ℓ λ ) T Λ + K -1 (G ℓ-1 ) f ℓ λ + const (66) = log N f ℓ λ ; 0, Λ + K -1 (G ℓ-1 ) -1 . ( ) As the approximate posterior and true posterior are IID, the constraint in Eq. ( 55) becomes, G ℓ = E PDGP(f ℓ λ |G ℓ ,G ℓ-1 ) f ℓ λ (f ℓ λ ) T = E Q * DGP (f ℓ λ ) f ℓ λ (f ℓ λ ) T = Λ + K -1 (G ℓ-1 ) -1 . ( ) As the Lagrange multipliers are unconstrained, we can always set them such that this constraint holds. In that case both the optimal approximate posterior and the true posterior become, P DGP f ℓ λ |G ℓ-1 G ℓ = Q * DGP f ℓ λ = N f ℓ λ ; 0, G ℓ , as required.

D.3 GENERAL FORM FOR THE CONDITIONAL DISTRIBUTION OVER GRAM MATRICES

Now that we have shown that the true posterior, P (F ℓ |G ℓ-1 , G ℓ ) factorises, we can obtain a simple form for log P (G ℓ |G ℓ-1 ). In particular, log P (G ℓ |G ℓ-1 ) is equal to the ELBO if we use the true posterior in place of the approximate posterior, lim N ℓ →∞ 1 N log P (G ℓ |G ℓ-1 ) = lim N ℓ →∞ 1 N E P(F ℓ |G ℓ-1 ,G ℓ ) log P (G ℓ |F ℓ ) + log P (F ℓ |G ℓ-1 ) P (F ℓ |G ℓ-1 , G ℓ ) . Under the posterior, the constraint represented by log P (G ℓ |F ℓ ) is satisfied, so in the limit we can include that term in a constant, lim N ℓ →∞ 1 N log P (G ℓ |G ℓ-1 ) = lim N ℓ →∞ 1 N E P(F ℓ |G ℓ-1 ,G ℓ ) log P (F ℓ |G ℓ-1 ) P (F ℓ |G ℓ-1 , G ℓ ) + const . Now, we use the fact that the prior, P (F ℓ |G ℓ-1 ) and posterior, P (F ℓ |G ℓ-1 , G ℓ ), are IID across features, lim N ℓ →∞ 1 N log P (G ℓ |G ℓ-1 ) = ν ℓ E P(f ℓ λ |G ℓ-1 ,G ℓ ) log P f ℓ λ |G ℓ-1 P f ℓ λ |G ℓ-1 , G ℓ + const (72) and this expectation is a KL-divergence, lim N ℓ →∞ 1 N log P (G ℓ |G ℓ-1 ) = -ν ℓ D KL P f ℓ λ |G ℓ-1 , G ℓ P f ℓ λ |G ℓ-1 + const, which gives Eq. ( 32) when we combine with Eq. ( 8).

D.4 PARAMETRIC APPROXIMATE POSTERIORS

Eq. ( 64) represents a considerable simplification, as we now need to consider only a single feature, f ℓ λ , rather than the joint distribution over all features, F ℓ . However, in the general case, it is still not possible to compute Eq. ( 64) because the true posterior over a single feature is still not tractable. Following the true posteriors derived in the previous section, we could chose a parametric approximate posterior that factorises across features, Q θ (F 1 , . . . , F L ) = L ℓ=1 N ℓ λ=1 Q θ ℓ f ℓ λ . ( ) Remember that we optimize the approximate posterior parameters, θ, directly, and set the Gram matrices as a function of θ (Eq. 34). As before, we can bound, log P (G ℓ =G θ (θ ℓ )|G ℓ-1 ) using the ELBO, and the bound is tight when the approximate posterior equals the true posterior, log P (G ℓ = G θ (θ ℓ )|G ℓ-1 ) (75) = E P(F ℓ |G ℓ-1 ,G ℓ =G θ (θ θ )) log P (G ℓ =G θ (θ ℓ )|F ℓ ) + log P F ℓ λ |G ℓ-1 P (F ℓ |G ℓ-1 , G ℓ =G θ (θ ℓ )) (76) ≥ E Q θ (F ℓ ) log P (G ℓ =G θ (θ ℓ )|F ℓ ) + log P F ℓ λ |G ℓ-1 Q θ ℓ (F ℓ ) . ( ) Now, we can cancel the log P (G ℓ = G θ (θ ℓ )|F ℓ ) terms, as they represent a constraint that holds both under the true posterior, and under the approximate posterior, E P(F ℓ |G ℓ-1 ,G ℓ =G θ (θ ℓ ))) log P (F ℓ |G ℓ-1 ) P (F ℓ |G ℓ-1 , G ℓ =G θ (θ ℓ )) ≥ E Q θ ℓ (F ℓ ) log P (F ℓ |G ℓ-1 ) Q θ ℓ (F ℓ ) . ( ) Using the fact that the prior, posterior and approximate posterior are all IID over features, we can write this inequality in terms of distributions over a single feature, f ℓ λ and divide by N ℓ , E P(f ℓ λ |G ℓ-1 ,G ℓ =G θ (θ ℓ )) log P f ℓ λ |G ℓ-1 P f ℓ λ |G ℓ-1 , G ℓ =G θ (θ ℓ ) ≥ E Q θ ℓ (f ℓ λ ) log P f ℓ λ |G ℓ-1 (θ) Q θ ℓ f ℓ λ . ( ) Noting that both sides of this inequality are negative KL-divergences, we obtain, -D KL P f ℓ λ |G ℓ-1 , G ℓ =G θ (θ ℓ ) P f ℓ λ |G ℓ-1 ≥ -D KL Q θ ℓ f ℓ λ P f ℓ λ |G ℓ-1 , ) which gives Eq. ( 33) in the main text.

E THEORETICAL SIMILARITIES IN REPRESENTATION LEARNING IN FINITE AND INFINITE NETWORKS

In the main text, we considered probability densities of the Gram matrices, G 1 , . . . , G L . However, we can also consider probability densities of the features, F 1 , . . . , F L , for a DGP, log P (F ℓ |F ℓ-1 ) = -N ℓ 2 log |K (G DGP (F ℓ-1 ))| -1 2 tr F T ℓ K -1 (G DGP (F ℓ-1 )) F ℓ + const . ( ) We can rewrite the density such that it is still the density of features, F ℓ , but it is expressed in terms of the DGP Gram matrix, log P ( F ℓ |F ℓ-1 ) = -N ℓ 2 log |K(G ℓ-1 )| -N ℓ 2 tr K -1 (G ℓ-1 )G ℓ + const . ( ) Here, we have used the cyclic property of the trace to combine the F ℓ and F T ℓ to form G ℓ , and we have used the fact that our kernels can be written as a function of the Gram matrix. Overall, we can therefore write the posterior over features, P (F 1 , . . . , F L |X, Ỹ), in terms of only Gram matrices, J (G 1 , . . . , G L ) = 1 N log P (F 1 , . . . , F L |X, Ỹ) = log P (Y|G L ) + 1 N L ℓ=1 log P (F ℓ |F ℓ-1 ) , substituting Eq. ( 82), J (G 1 , . . . , G L ) = log P (Y|G L ) -1 2 L ℓ=1 ν ℓ log |K(G ℓ-1 )| + tr K -1 (G ℓ-1 )G ℓ + const . (84) Thus, J (G 1 , . . . , G L ) does not depend on N , and thus the Gram matrices that maximize J (G 1 , . . . , G L ) are the same for any choice of N . The only restriction is that we need N ℓ ≥ P , to ensure that the Gram matrices are full-rank. To confirm these results, we used Adam with a learning rate of 10 -3 to optimize full-rank Gram matrices with Eq. ( 84) and to directly do MAP inference over features using Eq. ( 81). As expected, as the number of features increased, the Gram matrix from MAP inference over features converged rapidly to that expected using Eq. ( 84) (Fig. 5 ). 

G1 G2

2 1 2 3 2 5 2 7 2 9 2 11 width 2 1 2 3 2 5 2 7 2 9 2 11 width 2 1 2 3 2 5 2 7 2 9 2 11 width two-layer 

F ADDITIONAL EXPERIMENTAL DETAILS

To optimize the analytic DKM objective for DGPs and the variational DKM objective for DGPs , we parameterised the Gram matrices (or covariances for the variational approximate posterior) as the product of a square matrix, R ℓ ∈ R P ×P , with itself transposed, G ℓ = 1 P R ℓ R T ℓ , and we used Adam with a learning rate of 10 -3 to learn R ℓ . To do Bayesian inference in finite BNNs and DGPs, we used Langevin sampling with 10 parallel chains, and a step size of 10 -3 . Note that in certain senarios, Langevin sampling can be very slow, as the features have a Gaussian prior with covariance K(G ℓ-1 ) which has some very small and some larger eigenvalues, which makes sampling difficult. Instead, we reparameterised the model in terms of the standard Gaussian random variables, V ℓ ∈ R P ×N ℓ . We then wrote F ℓ in terms of V ℓ , F ℓ = L ℓ-1 V ℓ . Here, L ℓ-1 is the Cholesky of K(G ℓ-1 ), so K(G ℓ-1 ) = L ℓ-1 L T ℓ-1 . This gives an equivalent distribution P (F ℓ |F ℓ-1 ). Importantly, as the prior on V ℓ is IID standard Gaussian, sampling V ℓ is much faster. To ensure that the computational cost of these expensive simulations remained reasonable, we used a subset of 50 datapoints from each dataset. For the DKM objective for BNNs, we used Monte-Carlo to approximate the Gram matrices, G θ (θ ℓ ) ≈ K k=1 ϕ(f ℓ k )ϕ T (f ℓ k ). ( ) with f ℓ k drawn from the appropriate approximate posterior, and K = 2 16 . We can use the reparameterisation trick (Kingma & Welling, 2013; Rezende et al., 2014) to differentiate through these Monte-Carlo estimates.

G ADDITIONAL COMPARISONS WITH FINITE-WIDTH DGPS

Here, we give additional results supporting those in Sec. 4.6, . In particular, we give the DGP representations learned by two-layer networks on all UCI datasets (boston, concrete, energy, yacht), except those already given in the main text Fig. 6 7 8 . 

NETWORKS

There is a body of theoretical work (e.g. (Seroussi & Ringel, 2021) ), on BNNs that approximates BNN posteriors over features as Gaussian. While we have shown that this is a bad idea in general (Fig. 4 and 9 ), we can nonetheless ask whether there are circumstances where the idea might work well. In particular, we hypothesised that depth is an important factor. In particular, in shallow networks, in order to get G L close to the required representation, we may need the posterior over F ℓ to be quite different from the prior. In contrast, in deeper networks, we might expect the posterior over F ℓ to be closer to its (Gaussian) prior, and therefore we might Gaussian approximate posteriors to work better. However, we cannot just make the network deeper, because as we do so, we apply the nonlinearity more times and dramatically alter the network's inductive biases. To resolve this issue, we derive a leaky relu nonlinearity that allows (approximately) independent control over the inductive biases (or effective depth) and the actual depth (Appendix I.1). Using these nonlinearities, we indeed find that very deep networks are reasonably well approximated by multivariate Gaussian approximate posteriors (Appendix I.2).

I.1 LEAKY RELU NONLINEARITIES

Our goal is to find a pointwise nonlinearity, ϕ, such that (under the prior), E PDGP(f ℓ λ |G ℓ-1 ) ϕ(f ℓ λ )ϕ T (f ℓ λ ) = p E P(f ℓ λ |G ℓ-1) relu(f ℓ λ )relu T (f ℓ λ ) + (1 -p)G ℓ-1 . We will set p = α/L, where α is the "effective" depth of the network and L is the real depth. These networks are designed such that their inductive biases in the infinite width limit are similar to a standard relu network with depth α. Indeed, we would take this approach if we wanted a well-defined infinite-depth DKM limit. Without loss of generality, we consider a 2D case, where x and y are zero-mean bivariate Gaussian, π(x, y) = N x y ; 0, Σ xx Σ xy Σ xy Σ yy ( ) where π(x, y) is the probability density for the joint distribution. Note that we use a scaled relu, relu(x) = √ 2 x for 0 < x 0 otherwise such that E relu 2 (x) = Σ xx . Mirroring Eq. 87, we want the nonlinearity, ϕ, to satisfy, E ϕ(x 2 ) = p E relu 2 (x) + (1 -p)Σ xx = Σ xx (90a) E ϕ(y 2 ) = p E relu 2 (y) + (1 -p)Σ yy = Σ yy (90b) E [ϕ(x)ϕ(y)] = p E [relu(x)relu(y)] + (1 -p)Σ xy We hypothesise that this nonlinearity has the form, ϕ(x) = a relu(x) + bx. We will write the relu as a sum of x and |x|, relu(x) = 1 √ 2 (x + |x|), because E [f (x, y)] = 0 for f (x, y) = x|y| or f (x, y) = |x|y. It turns out that we get zero expectation for all functions where f (-x, -y) = -f (x, y), which holds for the two choices above. To show such functions have a zero expectation, we write out the integral explicitly, E [f (x, y)] = ∞ -∞ dx ∞ -∞ dy π(x, y)f (x, y). We split the domain of integration for y at zero, E [f (x, y)] = ∞ -∞ dx 0 -∞ dy π(x, y)f (x, y) + ∞ -∞ dx ∞ 0 dy π(x, y)f (x, y). We substitute y ′ = -y and x ′ = -x in the first integral, E [f (x, y)] = ∞ -∞ dx ′ ∞ 0 dy ′ π(-x ′ , -y ′ )f (-x ′ , -y ′ ) + ∞ -∞ dx ∞ 0 dy π(x, y)f (x, y). As the variables we integrate over are arbitrary we can relabel y ′ as y and x ′ as x, and we can then merge the integrals as their limits are the same, E [f (x, y)] = ∞ -∞ dx ∞ 0 dy [π(-x, -y)f (-x, -y) + π(x, y)f (x, y)] . Under a zero-mean Gaussian, π(-x, -y) = π(x, y), E [f (x, y)] = ∞ -∞ dx ∞ 0 dy π(x, y) (f (-x, -y) + f (x, y)) . Thus, if f (-x, -y) = -f (x, y), then the expectation of that function under a bivariate zero-mean Gaussian distribution is zero. Remember that our overall goal was to design a nonlinearity, ϕ, (Eq. 91) which satisfied Eq. ( 90). We therefore compute the expectation, E [ϕ(x)ϕ(y)] = E [(a relu(x) + bx) (a relu(y) + by)] = E a √ 2 (x + |x|) + bx a √ 2 (y + |y|) + by Using the fact that E [ x|y| ] = E [ |x|y ] = 0 under a multivariate Gaussian, = E a 2 1 √ 2 (x + |x|) 1 √ 2 (y + |y|) + √ 2ab + b 2 xy (100) = a 2 E [relu(x)relu(y)] + √ 2ab + b 2 E [xy] . Thus, we can find the value of a by comparing with Eq. (90c), p = a 2 a = √ p. For b, things are a bit more involved, 1 -p = √ 2ab + b 2 = 2p b + b 2 where we substitute for the value of a. This can be rearranged to form a quadratic equation in b, 0 = b 2 + 2p b + (p -1), which can be solved, b = 1 2 -2p ± 2p -4(p -1) (105) b = 1 2 -2p ± 4 -2p (106) b = -p 2 ± 1 -p 2 (107) Only the positive root is of interest, b = 1 -p 2 -p 2 (108) Thus, the nonlinearity is, where we set p = α/L, and remember we used the scaled relu in Eq. ( 89). Finally, we established these choices by considering only the cross term, E [ϕ(x)ϕ(y)]. We also need to check that the E ϕ 2 (x) and E ϕ 2 (y) terms are as required (Eq. 90a and Eq. 90b). In particular, ϕ(x) = √ p relu(x) + 1 -p 2 -p 2 x E ϕ 2 (x) = E (a relu(x) + bx) 2 = E a √ 2 (x + |x|) + bx 2 (110) using E [x|x|] = 0 as x|x| is an odd function of x, and the zero-mean Gaussian is an even distribution, E ϕ 2 (x) = a 2 E relu 2 (x) + √ 2ab + b 2 Σ xx ( ) using Eq. ( 102) to identify a 2 and Eq. ( 103) to identify √ 2ab + b 2 , E ϕ 2 (x) = p E relu 2 (x) + (1 -p)Σ xx , as required.

I.2 MULTIVARIATE GAUSSIAN IN DEEPER NETWORKS

In the main text, we show that a more complex approximate posterior can match the distributions in these networks. Here, we consider an alternative approach. In particular, we hypothesise that these distributions are strongly non-Gaussian because the networks are shallow, meaning that the posterior needs to be far from the prior in order to get a top-layer kernel close to G L+1 . We could therefore make the posteriors closer to Gaussian by using leaky-relu nonlinearities (Appendix I.1) with fixed effective depth (α = 2), but increasing real depth, L. In particular, we use multivariate Gaussian approximate posteriors with learned means, Q θ ℓ f ℓ λ = N f ℓ λ ; µ ℓ , Σ ℓ so θ ℓ = (µ ℓ , Σ ℓ ). As expected, for a depth 32 network, we have much more similar marginals (Fig. 10 To find the mode, again we set the gradient wrt G ℓ to zero, 0 = ∂L ∂G ℓ = -ν ℓ+1 -ν ℓ 2 G -1 ℓ -ν ℓ 2 G -1 ℓ-1 + ν ℓ+1 2 G -1 ℓ G ℓ+1 G -1 ℓ , for ℓ = 1, ..., L. Right multiplying by 2G ℓ and rearranging, ν ℓ+1 G -1 ℓ G ℓ+1 = ν ℓ G -1 ℓ-1 G ℓ + (ν ℓ+1 -ν ℓ ) I, for ℓ = 1, ..., L. Evaluating this expression for ℓ = 1 and ℓ = 2 gives, ν 2 G -1 1 G 2 = ν 1 G -1 0 G 1 + (ν 2 -ν 1 ) I, ν 3 G -1 2 G 3 = ν 2 G -1 1 G 2 + (ν 3 -ν 2 ) I = ν 1 G -1 0 G 1 + (ν 3 -ν 1 ) I. Recursing, we get, ν ℓ G -1 ℓ-1 G ℓ = ν 1 G -1 0 G 1 + (ν ℓ -ν 1 ) I. Critically, this form highlights constraints on G 1 . In particular, the right hand side, G -1 ℓ-1 G ℓ , is the product of two positive definite matrices, so has positive eigenvalues (but may be non-symmetric (Horn & Johnson, 2012) ). Thus, all eigenvalues of ν 1 G -1 0 G 1 must be larger than ν 1 -ν ℓ , and this holds true at all layers. This will become important later, as it rules out inadmissible solutions. Given G 0 and G 1 , we can compute any G ℓ using, G -1 0 G ℓ = ℓ ℓ ′ =1 G -1 ℓ ′ -1 G ℓ ′ = 1 ℓ ℓ ′ =1 ν ℓ ′ ℓ ℓ ′ =1 ν ℓ ′ G -1 ℓ ′ -1 G ℓ ′ (129) ℓ ℓ ′ =1 ν ℓ ′ G -1 0 G ℓ = ℓ ℓ ′ =1 ν 1 G -1 0 G 1 + (ν ℓ ′ -ν 1 ) I where the matrix products are ordered as L ℓ=1 A ℓ = A 1 • • • A L . Now, we seek to solve for G 1 using our knowledge of G L+1 . Computing G -1 0 G L+1 , L+1 ℓ=1 ν ℓ G -1 0 G L+1 = L+1 ℓ=1 ν 1 G -1 0 G 1 + (ν ℓ -ν 1 ) I . ( ) We write the eigendecomposition of ν 1 G -1 0 G 1 as, ν 1 G -1 0 G 1 = VDV -1 . Thus, L+1 ℓ=1 ν ℓ G -1 0 G L+1 = L+1 ℓ=1 VDV -1 + (ν ℓ -ν 1 ) I = VΛV -1 (133) where Λ is a diagonal matrix, Λ = L+1 ℓ=1 (D + (ν ℓ -ν 1 ) I) . Thus, we can identify V and Λ by performing an eigendecomposition of the known matrix, L+1 ℓ=1 ν ℓ G -1 0 G L+1 . Then, we can solve for D (and hence G 1 ) in terms of Λ and V. The diagonal elements of D satisfy, 0 = -Λ ii + L+1 k=1 (D ii + (ν ℓ -ν 1 )) . ( ) This is a polynomial, and remembering the constraints from Eq. ( 128), we are interested in solutions which satisfy, ν 1 -ν min ≤ D ii . where, ν min = min (ν 1 , . . . , ν L+1 ) . To reason about the number of such solutions, we use Descartes' rule of signs, which states that the number of positive real roots is equal to or a multiple of two less than the number of sign changes in the coefficients of the polynomial. Thus, if there is one sign change, there must be one positive real root. For instance, in the following polynomial, 0 = x 3 + x 2 -1 the signs go as (+), (+), (-), so there is only one sign change, and there is one real root. To use Descartes' rule of signs, we work in terms of D ′ ii , which is constrained to be positive, 0 ≤ D ′ ii = D ii -(ν 1 -ν min ) D ii = D ′ ii + (ν 1 -ν min ) . Thus, the polynomial of interest (Eq. 135) becomes, 0 = -Λ ii + L+1 ℓ=1 (D ′ ii + (ν 1 -ν min ) -(ν 1 -ν ℓ )) = -Λ ii + L+1 ℓ=1 (D ′ ii + (ν ℓ -ν min )) where 0 < ν ℓ -ν min as ν min is defined to be the smallest ν ℓ (Eq. 137). Thus, the constant term, -Λ ii is negative, while all other terms, D ′ ii , . . . , (D ′ ii ) L+1 in the polynomial have positive coefficients. Thus, there is only one sign change, which proves the existence of only one valid real root, as required.

K UNIMODALITY EXPERIMENTS WITH NONLINEAR KERNELS

For the posterior over Gram matrices to converge to a point distribution, we need the DKM objective L(G 1 , . . . , G L ) to have one unique global optimum. As noted above, this is guaranteed when the prior dominates (Eq. 11), and for linear models (Appendix J). While we believe that it might be possible to construct counter examples, in practice we expect a single global optimum in most practical settings. To confirm this expectation, we did a number of experiments, starting with many different random initializations of a deep kernel machine and optimizing using gradient descent (Appendix K). In all cases tested, the optimizers converged to the same maximum. We parameterise Gram matrices G ℓ = 1 P V ℓ V T ℓ with V ℓ ∈ R P ×P being trainable parameters. To make initializations with different seeds sufficiently separated while ensuring stability we initialize G ℓ from a broad distribution that depends on K(G ℓ-1 ). Specifically, we first take the Cholesky decomposition K (G ℓ-1 ) = L ℓ-1 L T ℓ-1 , then set V ℓ = L ℓ-1 Ξ ℓ D 1/2 ℓ where each entry of Ξ ℓ ∈ R P ×P is independently sampled from a standard Gaussian, and D ℓ is a diagonal scaling matrix with each entry sampled i.i.d. from an inverse-Gamma distribution. The variance of the inverse-Gamma distribution is fixed to 100, and the mean is drawn from a uniform distribution U [0.5, 3] for each seed. Since for any random variable x ∼ Inv-Gamma(α, β), E(x) = β α-1 and V(x) = β (α-1)(α-2) , once we fix the mean and variance we can compute α and β as α = E(x) 2 V(x) + 2, β = E(x)(α -1). We set ν ℓ = 5, and use the Adam optimizer (Kingma & Ba, 2014) with learning rate 0.001 to optimize parameters V ℓ described above. We fixed all model hyperparameters to ensure that any multimodality could emerge only from the underlying deep kernel machine. As we did not use inducing points, we were forced to consider only the smaller UCI datasets (yacht, boston, energy and concrete). For the deep kernel machine objective, all Gram matrices converge rapidly to the same solution, as measured by RMSE (Fig. 12 ). Critically, we did find multiple modes for the MAP objective (Fig 13) , indicating that experiments are indeed powerful enough to find multiple modes (though of course they cannot be guaranteed to find them). Finally, note that the Gram matrices took a surprisingly long time to converge: this was largely due to the high degree of diversity in the initializations; convergence was much faster if we initialised deterministically from the prior. This might contradict our usual intuitions about huge multimodality in the weights/features of BNNs and DGPs. This can be reconciled by noting that each mode, written in terms of Gram matrices, corresponds to (perhaps infinitely) many modal features. In particular, in Sec. E, we show that the log-probability for features, P (F ℓ |F ℓ-1 ) (Eq. 82) depends only on the Gram matrices, and note that there are many settings of features which give the same Gram matrix. In particular, the Gram matrix is the same for any unitary transformation of the features, F ′ ℓ = F ℓ U, satisfying UU T = I, as 1 N ℓ F ′ ℓ F ′T ℓ = 1 N ℓ F ℓ U ℓ U T ℓ F T ℓ = 1 N ℓ F ℓ F T ℓ = G ℓ . For DGPs we can use any unitary matrix, so there are infinitely many sets of features consistent with a particular Gram matrix, while for BNNs we can only use permutation matrices, which are a subset of unitary matrices. Thus, the objective landscape must be far more complex in the feature domain than with Gram matrices, as a single optimal Gram matrix corresponds to a large family of optimal features. 84). Rows and columns are the same as in Figure 12 . Using the same randomly scaled initializations described above, we are able to find multiple modes in energy and concrete showing our initializations are diverse enough, albeit there is still only a single global optimum.

L INDUCING POINT DKMS

To do large-scale experiments on UCI datasets, we introduce inducing point DKMs by extending Gaussian process inducing point methods (Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017) to the DKM setting. This approach uses the variational interpretation of the deep kernel machine objective described in Appendix D. To do inducing-point variational inference, we need to explicitly introduce top-layer features mirroring F L+1 ∈ R P ×ν L+1 in Appendix B, but replicated N times, FL+1 ∈ R P ×N L+1 . Formally, each feature, f L+1 1 , . . . , f L+1 N L+1 is IID, conditioned on F L , P ( FL+1 |F L ) = N ℓ λ=1 N f L+1 λ ; 0, K(G(F L )) , P ( Ỹ| FL+1 ) = N ℓ λ=1 N ỹλ ; f L+1 λ , σ 2 I , where we give the likelihood for regression, but other likelihoods (e.g. for classification) are possible (Appendix B). Further, we take the total number of points, P , to be made up of P i inducing points and P t test/train points, so that P = P i + P t . Thus, we can separate all features, F ℓ ∈ R P ×N ℓ , into the inducing features, F ℓ i ∈ R Pi×N ℓ , and the test/train features, F ℓ t ∈ R Pt×N ℓ . Likewise, we separate the inputs, X, and outputs, Y, into (potentially trained) inducing inputs, X i , and trained inducing outputs, Y i , and the real test/training inputs, X t , and outputs, Y t , F ℓ = F ℓ i F ℓ t FL+1 = FL+1 i FL+1 t X = X i X t Y = Y i Y t Ỹ = Ỹi Ỹt We follow the usual doubly stochastic inducing point approach for DGPs. In particular, we treat all the features at intermediate layers, F 1 , . . . , F L , and the top-layer train/test features, F L+1 t as latent variables. However, we deviate from the usual setup in treating the top-layer inducing outputs, F L+1 i , as learned parameters and maximize over them to ensure that the ultimate method does not require sampling, and at same time allows minibatched training. The prior and approximate posterior over F 1 , . . . , F L are given by,  Q (F 1 , . . . F L |X) = L ℓ=1 Q (F ℓ |F ℓ-1 ) , and remember F 0 = X, so G 0 = 1 N0 XX T . The prior and approximate posterior at each layer factorises into a distribution over the inducing points and a distribution over the test/train points, Q (F ℓ |F ℓ-1 ) = P F ℓ t |F ℓ i , F ℓ-1 Q F ℓ i , P (F ℓ |F ℓ-1 ) = P F ℓ t |F ℓ i , F ℓ-1 P F ℓ i |F ℓ-1 i . Critically, the approximate posterior samples for the test/train points is the conditional prior P F ℓ t |F ℓ i , F ℓ-1 , which is going to lead to cancellation when we compute the ELBO. Likewise, the approximate posterior over FL+1 t is the conditional prior, Q FL+1 t |F L+1 i , F L = P FL+1 t |F L+1 i , F L . ( ) Concretely, the prior approximate posterior over inducing points are given by, Q F ℓ i = N ℓ λ=1 N f ℓ i;λ ; 0, G ℓ ii , P F ℓ i |F ℓ-1 i = N ℓ λ=1 N f ℓ i;λ ; 0, K(G(F ℓ-1 i )) The approximate posterior is directly analogous to Eq. ( 69) and the prior is directly analogous to Eq. ( 1a), but where we have specified that this is only over inducing points.  Q FL+1 t |F L+1 i , F L Q (F 1 , . . . F L |X) Note that the P F ℓ t |F ℓ i , F ℓ-1 terms are going to cancel in the ELBO, we consider them below when we come to describing sampling), substituting Eq. (145-147) and cancelling P F ℓ t |F ℓ i , F ℓ-1 and P FL+1 t |F L+1 i , F L , ELBO(F L+1 i , G 1 ii , . . . , G L ii ) = E Q log P Ỹt | FL+1 t + L ℓ=1 log P F ℓ i |F ℓ-1 i Q F ℓ i . ( ) So far, we have treated the Gram matrices, G ℓ ii as parameters of the approximate posterior. However, in the infinite limit N → ∞, these are consistent with the features generated by the approximate posterior. In particular the matrix product 1 N ℓ F ℓ i F ℓ i T can be written as an average over infinitely many IID vectors, f ℓ i;λ (first equality), and by the law of large numbers, this is equal to the expectation of one term (second equality), which is G ℓ ii (by the approximate posterior Eq. ( 148a)), lim N →∞ 1 N ℓ F ℓ i F ℓ i T = lim N →∞ 1 N ℓ N ℓ λ=1 f ℓ i;λ f ℓ i;λ T = E Q(f ℓ i;λ ) f ℓ i;λ f ℓ i;λ T = G ℓ ii . By this argument, the Gram matrix from the previous layer, G ℓ-1 ii is deterministic. Further, in a DGP, F ℓ i only depends on F ℓ-1 i through G ℓ-1 i (Eq. 5), and the prior and approximate posterior factorise. Thus, in the infinite limit, individual terms in the ELBO can be written, lim N →∞ 1 N E Q log P F ℓ i |F ℓ-1 i Q F ℓ i = ν ℓ E Q   log P f ℓ i;λ |G ℓ-1 i Q f ℓ i;λ   (152) = -ν ℓ D KL N 0, K(G ℓ-1 i ) N 0, G ℓ i , where the final equality arises when we notice that the expectation can be written as a KLdivergence. The inducing DKM objective, L ind , is the ELBO, divided by N to ensure that it remains finite in the infinite limit, L ind (F L+1 i , G 1 ii , . . . , G L ii )= lim N →∞ 1 N ELBO(F L+1 i , G 1 ii , . . . , G L ii ) (154) = E Q log P Y t |F L+1 t - L ℓ=1 ν ℓ D KL N 0, K(G ℓ-1 ii ) N 0, G ℓ ii . Note that this has almost exactly the same form as the standard DKM objective for DGPs in the main text (Eq. 16). In particular, the second term is a chain of KL-divergences, with the only difference that these KL-divergences apply only to the inducing points. The first term is a "performance" term that here depends on the quality of the predictions given the inducing points. As the copies are IID, we have, E Q log P Ỹt | FL+1 t = N E Q log P Y t |F L+1 t . ( ) Now that we have a simple form for the ELBO, we need to compute the expected likelihood, E Q log P Y t |F L+1 t . This requires us to compute the full Gram matrices, including test/train points, conditioned on the optimized inducing Gram matrices. We start by defining the full Gram matrix, G ℓ = G ℓ ii G ℓ it G ℓ ti G ℓ tt ( ) for both inducing points (labelled "i") and test/training points (labelled "t") from just G ℓ ii . For clarity, we have G ℓ ∈ R P ×P , G ℓ ii ∈ R Pi×Pi , G ℓ ti ∈ R Pt×Pi , G ℓ tt ∈ R Pt×Pt , where P i is the number of inducing points, P t is the number of train/test points and P = P i + P t is the total number of inducing and train/test points. The conditional distribution over F ℓ t given F ℓ i is, Inducing and train/test inputs: X i , X t Inducing outputs: P F ℓ t F ℓ i , G ℓ-1 = N ℓ λ=1 N f ℓ t;λ ; K ti K -1 ii f ℓ i;λ , K tt•i ( F L+1 i Initialize full Gram matrix G 0 ii G 0;T ti G 0 ti G 0 tt = 1 ν0 X i X T i X i X T t X t X T i X t X T t Propagate full Gram matrix for ℓ in (1, . . . , L) do K ii K T ti K ti K tt = K G ℓ-1 ii (G ℓ-1 ti ) T G ℓ-1 ti G ℓ-1 tt K tt•i = K tt -K ti K -1 ii K T ti . G ℓ ti = K ti K -1 ii G ℓ ii G ℓ tt = K ti K -1 ii G ℓ ii K -1 ii K T ti + K tt•i end for Final prediction using standard Gaussian process expressions K ii K T ti K ti K tt = K G L ii (G L ti ) T G L ti G L tt Y t ∼ N K ti K -1 ii F L+1 i , K tt -K ti K -1 ii K T ti + σ 2 I where f ℓ t;λ is the activation of the λth feature for all train/test inputs, f ℓ i;λ is the activation of the λth feature for all train/test inputs, and f ℓ i;λ , and K ii K T ti K ti K tt = K 1 N ℓ-1 F ℓ-1 F T ℓ-1 = K (G ℓ-1 ) K tt•i = K tt -K ti K -1 ii K T ti . In the infinite limit, the Gram matrix becomes deterministic via the law of large numbers (as in Eq. 151), and as such G it and G tt become deterministic and equal to their expected values. Using Eq. ( 157), we can write, F ℓ t = K ti K -1 ii F ℓ i + K 1/2 tt•i Ξ. ( ) where Ξ is a matrix with IID standard Gaussian elements. Thus, G ℓ ti = 1 ν E F ℓ t (F ℓ i ) T (161) = 1 ν K ti K -1 ii E F ℓ i (F ℓ i ) T (162) = K ti K -1 ii G ℓ ii (163) and, G ℓ tt = 1 ν E F ℓ t (F ℓ t ) T (164) = 1 ν K ti K -1 ii E F ℓ i (F ℓ i ) T K -1 ii K T ti + 1 ν K 1/2 tt•i E ΞΞ T K 1/2 tt•i (165) = K ti K -1 ii G ii K -1 ii K T ti + K tt•i (166) For the full prediction algorithm, see Alg. 1.



Figure3: Wide DGP posteriors converge to the DKM. Here, we trained DGPs with Langevin sampling (see Appendix F), and compared to a trained DKM. a Marginal distribution over features for one input datapoint for a two-layer DGP trained on a subset of yacht. We used a width of N 1...L = 1024 and ν 1...L = 5 in all plots to ensure that the data had a strong effect on the learned representations. The marginals (blue histogram) are very close to Gaussian (the red line shows the closest fitted Gaussian). Remember that the true posterior over features is IID (Eq. 31), so each column aggregates the distribution over features (and over 10 parallel chains with 100 samples from each chain) for a single input datapoint. b The 2D marginal distributions for the same DGP for two input points (horizontal and vertical axes). c Element-wise RMSE (normalized Frobenius distance) between Gram matrices from a trained DKM compared to trained DGPs of increasing width. The DGP Gram matrices converge to the DKM solution as width becomes larger.

Figure 4: The variational DKM closely matches the BNN true posterior obtained with Langevin sampling. a Comparison of Gram matrices. The first two rows show Gram matrices for BNN, with the first row being a random initialization, and the second row being the posterior. The last two rows show the Gram matrices from variational DKMs with a flow approximate posterior (third row) and a multivariate Gaussian approximate posterior (fourth row). In optimizing the variational DKM, we used Eq. (34) with 2 16 Monte-Carlo samples. The Gram matrices for the flow posterior (third row) closely match those from the BNN posterior (second row), while those for a multivariate Gaussian approximate posterior (fourth row) do not match. b Marginal distributions over features at each layer for one input datapoint estimated using kernel density estimation. The note that the BNN (blue line) marginals are non-Gaussian, but the variational DKM with a flow posterior (red line) is capable of capturing this non-Gaussianity.

Figure5: RMSE of trained Gram matrices between one-hidden-layer (first row) and two-hiddenlayer (second row) DGPs of various width trained by gradient descent and the corresponding MAP limit. Columns correspond to different datasets (trained on a subset of 50 datapoints).

Figure 6: One hidden layer DGP and DKM with squared exponential kernel trained on a subset of energy. First and second row: initializations of DGP and DKM. Third and fourth row: trained DGP (by Langevin sampling) and DKM Gram matrices and kernels.

Figure 8: One hidden layer DGP and DKM with squared exponential kernel trained on a subset of concrete. First and second row: initializations of DGP and DKM. Third and fourth row: trained DGP (by Langevin sampling) and DKM Gram matrices and kernels.

Figure10: Comparison of posterior feature marginal distributions between a BNN of width 1024 (trained by Langevin sampling over features) and a variational DKM with 2 16 Monte-Carlo samples, in a 4-layer (row 1) and a 32-layer (row 2) network. We give the BNN posterior features from Langevin sampling (blue histogarm) and the best fitting Gaussian (blue line), and compare against the variational DKM approximate posterior Gaussian distribution (red line).

top) and learned representations (Fig. 11 top).

Figure 11: Comparison of Gram matrices between BNN of width 1024 (trained by Langevin sampling over features) and variational DKM, in 4-layer (row 1-3) and 32-layer networks (row 4-6). Initializations are shown in row 1 and 4, trained BNN Gram matrices are shown in row 2 and 5, and trained variational DKM Gram matrices are shown in row 3 and 6. As in Figure 10, the variational DKM is a poor match to Langevin sampling in a BNN for a 4-layer network, but is very similar in a 32 layer network.

Figure12: One-layer DKMs with squared exponential kernel trained on full UCI datasets (through columns) converges to the same solution, despite very different initializations by applying stochastic diagonal scalings described in Appendix F to the standard initialization with different seeds. Standard initialization is shown in dashed line, while scaled initializations are the color lines each denoting a different seed. The first row shows the objective during training for all seeds that all converge to the same value. The second row shows the element-wise RMSE between the Gram matrix of each seed and the optimized Gram matrix obtained from the standard initialization. RMSE converges to 0 as all initializations converge on the same maximum. The last row plots RMSE versus objective value, again showing a single optimal objective value where all Gram matrices are the same.

Figure13: One-layer DGP with MAP inference over features as described in Appendix E Eq. (84). Rows and columns are the same as in Figure12. Using the same randomly scaled initializations described above, we are able to find multiple modes in energy and concrete showing our initializations are diverse enough, albeit there is still only a single global optimum.

(F 1 , . . . , F L |X) = L ℓ=1 P (F ℓ |F ℓ-1 ) ,

Algorithm 1 DKM predictionParameters:{ν ℓ } L ℓ=1 Optimized Gram matrices {G ℓ ii } L ℓ=1



J UNIMODALITY IN LINEAR DEEP KERNEL MACHINES J.1 THEORY: UNIMODALITY WITH A LINEAR KERNEL AND SAME WIDTHS

Here, we show that the deep kernel machine objective is unimodal for a linear kernel. A linear kernel simply returns the input Gram matrix,It is called a linear kernel, because it arises in the neural network setting (Eq. 21) by choosing the nonlinearity, ϕ to be the identity, in which case, F ℓ = F ℓ-1 W ℓ-1 . For a linear kernel the objective becomes,where we have assumed there is no output noise, σ 2 = 0. Taking all ν ℓ to be equal, ν = ν ℓ (see Appendix J.2 for the general case),Note that G 0 and G L+1 are fixed by the inputs and outputs respectively. Thus, to find the mode, we set the gradient wrt G 1 , . . . , G L to zero,Thus, at the mode, the recursive relationship must hold,Thus, optimal Gram matrices are given by,and we can solve for T by noting,Importantly, T is the product of two positive definite matrices, T = G -1 ℓ-1 G ℓ , so T must have positive, real eigenvalues (but T does not have to be symmetric (Horn & Johnson, 2012)). There is only one solution to Eq. ( 120) with positive real eigenvalues (Horn et al., 1994) . Intuitively, this can be seen using the eigendecomposition, G -1 0 G L+1 = V -1 DV, where D is diagonal,Thus, finding T reduces to finding the (L + 1)th root of each positive real number on the diagonal of D. While there are (L + 1) complex roots, there is only one positive real root, and so T and hence G 1 , . . . , G L are uniquely specified. This contrasts with a deep linear neural network, which has infinitely many optimal settings for the weights.Note that for the objective to be well-defined, we need K(G) to be full-rank. With standard kernels (such as the squared exponential) this is always the case, even if the input Gram matrix is singular. However, a linear kernel will have a singular output if given a singular input, and with enough data points, G 0 = 1 ν0 XX T is always singular. To fix this, we could e.g. define G 0 = K( 1 ν0 XX T ) to be given by applying a positive definite kernel (such as a squared exponential) to 1 ν0 XX T . This results in positive definite G 0 , as long as the input points are distinct.

J.2 THEORY: UNIMODALITY WITH A LINEAR KERNEL AND ARBITRARY WIDTHS

In the main text we showed that the deep kernel machine is unimodal when all ν ℓ are equal. Here, we show that unimodality in linear DKMs also holds for all choices of ν ℓ . Recall the linear DKM objective in Eq. ( 115),

