A THEORY OF REPRESENTATION LEARNING IN NEURAL NETWORKS GIVES A DEEP GENERALISATION OF KER-NEL METHODS

Abstract

The successes of modern deep machine learning methods are founded on their ability to transform inputs across multiple layers to build good high-level representations. It is therefore critical to understand this process of representation learning. However, standard theoretical approaches (formally NNGPs) involving infinite width limits eliminate representation learning. We therefore develop a new infinite width limit, the Bayesian representation learning limit, that exhibits representation learning mirroring that in finite-width models, yet at the same time, retains some of the simplicity of standard infinite-width limits. In particular, we show that Deep Gaussian processes (DGPs) in the Bayesian representation learning limit have exactly multivariate Gaussian posteriors, and the posterior covariances can be obtained by optimizing an interpretable objective combining a loglikelihood to improve performance with a series of KL-divergences which keep the posteriors close to the prior. We confirm these results experimentally in wide but finite DGPs. Next, we introduce the possibility of using this limit and objective as a flexible, deep generalisation of kernel methods, that we call deep kernel machines (DKMs). Like most naive kernel methods, DKMs scale cubically in the number of datapoints. We therefore use methods from the Gaussian process inducing point literature to develop a sparse DKM that scales linearly in the number of datapoints. Finally, we extend these approaches to NNs (which have non-Gaussian posteriors) in the Appendices.

1. INTRODUCTION

The successes of modern machine learning methods from neural networks (NNs) to deep Gaussian processes (DGPs Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017) is based on their ability to use depth to transform the input into high-level representations that are good for solving difficult tasks (Bengio et al., 2013; LeCun et al., 2015) . However, theoretical approaches using infinite limits to understand deep models struggle to capture representation learning. In particular, there are two broad families of infinite limit, and while they both use kernel-matrix-like objects they are ultimately very different. The neural network Gaussian process (NNGP Neal, 1996; Lee et al., 2017; Matthews et al., 2018) applies to Bayesian models like Bayesian neural networks (BNNs) and DGPs and describes the representations at each layer (formally, the NNGP kernel is raw second moment of the activities). In contrast, the neural tangent kernel (NTK Jacot et al., 2018 ) is a very different quantity that involves gradients, and describes how predictions at all datapoints change if we do a gradient update on a single datapoint. As such, the NNGP and NTK are suited to asking very different theoretical questions. For instance, the NNGP is better suited to understanding the transformation of representations across layers, while the NTK is better suited to understanding how predictions change through NN training. While challenges surrounding representation learning have recently been addressed in the NTK setting Yang & Hu (2020) , we are the first to address this challenge in the NNGP setting. At the same time, kernel methods (Smola & Schölkopf, 1998; Shawe-Taylor & Cristianini, 2004; Hofmann et al., 2008) were a leading machine learning approach prior to the deep learning revolution Krizhevsky et al. (2012) . However, kernel methods were eclipsed by deep NNs because depth

