A THEORY OF REPRESENTATION LEARNING IN NEURAL NETWORKS GIVES A DEEP GENERALISATION OF KER-NEL METHODS

Abstract

The successes of modern deep machine learning methods are founded on their ability to transform inputs across multiple layers to build good high-level representations. It is therefore critical to understand this process of representation learning. However, standard theoretical approaches (formally NNGPs) involving infinite width limits eliminate representation learning. We therefore develop a new infinite width limit, the Bayesian representation learning limit, that exhibits representation learning mirroring that in finite-width models, yet at the same time, retains some of the simplicity of standard infinite-width limits. In particular, we show that Deep Gaussian processes (DGPs) in the Bayesian representation learning limit have exactly multivariate Gaussian posteriors, and the posterior covariances can be obtained by optimizing an interpretable objective combining a loglikelihood to improve performance with a series of KL-divergences which keep the posteriors close to the prior. We confirm these results experimentally in wide but finite DGPs. Next, we introduce the possibility of using this limit and objective as a flexible, deep generalisation of kernel methods, that we call deep kernel machines (DKMs). Like most naive kernel methods, DKMs scale cubically in the number of datapoints. We therefore use methods from the Gaussian process inducing point literature to develop a sparse DKM that scales linearly in the number of datapoints. Finally, we extend these approaches to NNs (which have non-Gaussian posteriors) in the Appendices.

1. INTRODUCTION

The successes of modern machine learning methods from neural networks (NNs) to deep Gaussian processes (DGPs Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017) is based on their ability to use depth to transform the input into high-level representations that are good for solving difficult tasks (Bengio et al., 2013; LeCun et al., 2015) . However, theoretical approaches using infinite limits to understand deep models struggle to capture representation learning. In particular, there are two broad families of infinite limit, and while they both use kernel-matrix-like objects they are ultimately very different. The neural network Gaussian process (NNGP Neal, 1996; Lee et al., 2017; Matthews et al., 2018) applies to Bayesian models like Bayesian neural networks (BNNs) and DGPs and describes the representations at each layer (formally, the NNGP kernel is raw second moment of the activities). In contrast, the neural tangent kernel (NTK Jacot et al., 2018) is a very different quantity that involves gradients, and describes how predictions at all datapoints change if we do a gradient update on a single datapoint. As such, the NNGP and NTK are suited to asking very different theoretical questions. For instance, the NNGP is better suited to understanding the transformation of representations across layers, while the NTK is better suited to understanding how predictions change through NN training. While challenges surrounding representation learning have recently been addressed in the NTK setting Yang & Hu (2020), we are the first to address this challenge in the NNGP setting. At the same time, kernel methods (Smola & Schölkopf, 1998; Shawe-Taylor & Cristianini, 2004; Hofmann et al., 2008) were a leading machine learning approach prior to the deep learning revolution Krizhevsky et al. (2012) . However, kernel methods were eclipsed by deep NNs because depth gives NNs the flexibility to learn a good top-layer representation (Aitchison, 2020). In contrast, in a standard kernel method, the kernel (or equivalently the representation) is highly inflexible -there are usually a few tunable hyperparameters, but nothing that approaches the enormous flexibility of the top-layer representation in a deep model. There is therefore a need to develop flexible, deep generalisations of kernel method. Remarkably, our advances in understanding representation learning in DGPs give such a flexible, deep kernel method.

2. CONTRIBUTIONS

• We present a new infinite width limit, the Bayesian representation learning limit, that retains representation learning in deep Bayesian models including DGPs. The key insight is that as the width goes to infinity, the prior becomes stronger, and eventually overwhelms the likelihood. We can fix this by rescaling the likelihood to match the prior. This rescaling can be understood in a Bayesian context as copying the labels (Sec. 4.3). • We show that in the Bayesian representation learning limit, DGP posteriors are exactly zero-mean multivariate Gaussian, P f ℓ λ |X, y = N f ℓ λ ; 0, G ℓ where f ℓ λ , is the activation of the λth feature in layer ℓ for all inputs (Sec. 4.4 and Appendix D). • We show that the posterior covariances can be obtained by optimizing the "deep kernel machine objective", L(G 1 , . . . , G L ) = log P (Y|G L ) - L ℓ=1 ν ℓ D KL (N (0, G ℓ )∥N (0, K(G ℓ-1 ))) , where G ℓ are the posterior covariances, K(G ℓ-1 ) are the kernel matrices, and ν ℓ accounts for any differences in layer width (Sec. 4.3). • We give an interpretation of this objective, with log P (Y|G L ) encouraging improved performance, while the KL-divergence terms act as a regulariser, keeping posteriors, N (0, G ℓ ), close to the prior, N (0, K(G ℓ-1 )) (Sec. 4.5). • We introduce a sparse DKM, which takes inspiration GP inducing point literature to obtain a practical, scalable method that is linear in the number of datapoints. In contrast, naively computing/optimizing the DKM objective is cubic in the number of datapoints (as with most other naive kernel methods; Sec. 4.7). • We extend these results to BNNs (which have non-Gaussian posteriors) in Appendix A.

3. RELATED WORK

Our work is focused on DGPs and gives new results such as the extremely simple multivariate Gaussian form for DGP true posteriors. As such, our work is very different from previous work on NNs, where such results are not available. There are at least three families of such work. First, there is recent work on representation learning in the very different NTK setting (Jacot et al., 2018; Yang, 2019; Yang & Hu, 2020 ) (see Sec. 1). In contrast, here we focus on NNGPs (Neal, 1996; Williams, 1996; Lee et al., 2017; Matthews et al., 2018; Novak et al., 2018; Garriga-Alonso et al., 2018; Jacot et al., 2018) , where the challenge of representation learning has yet to be addressed. Second, there is a body of work using methods from physics to understand representation learning in neural networks (Antognini, 2019; Dyer & Gur-Ari, 2019; Hanin & Nica, 2019; Aitchison, 2020; Li & Sompolinsky, 2020; Yaida, 2020; Naveh et al., 2020; Zavatone-Veth et al., 2021; Zavatone-Veth & Pehlevan, 2021; Roberts et al., 2021; Naveh & Ringel, 2021; Halverson et al., 2021) . This work is focuses on perturbational, rather than variational methods. Third, there is a body of theoretical work including (Mei et al., 2018; Nguyen, 2019; Sirignano & Spiliopoulos, 2020a; b; Nguyen & Pham, 2020) which establishes properties such as convergence to the global optimum. This work is focused on two-layer (or one-hidden layer network) networks, and like the NTK, considers learning under SGD rather than Bayesian posteriors. Another related line of work uses kernels to give a closed-form expression for the weights of a neural network, based on a greedy, layerwise objective (Wu et al., 2022) . This work differs in that it uses the HSIC objective, and therefore does not have a link to DGPs or Bayesian neural networks, and in that it uses a greedy-layerwise objective, rather than end-to-end gradient descent.

