DEEP LEARNING IS COMPOSITE KERNEL LEARNING

Abstract

Recent works have connected deep learning and kernel methods. In this paper, we show that architectural choices such as convolutional layers with pooling, skip connections, make deep learning a composite kernel learning method, where the kernel is a (architecture dependent) composition of base kernels: even before training, standard deep networks have in-built structural properties that ensure their success. In particular, we build on the recently developed 'neural path' framework 1 that characterises the role of gates/masks in fully connected deep networks with ReLU activations.

1. INTRODUCTION

The success of deep learning is attributed to feature learning. The conventional view is that feature learning happens in the hidden layers of a deep network: in the initial layers simple low level features are learnt, and sophisticated high level features are learnt as one proceeds in depth. In this viewpoint, the penultimate layer output is the final hidden feature and the final layer learns a linear model with these hidden features. While this interpretation of feature learning is intuitive, beyond the first couple of layers it is hard make any meaningful interpretation of what happens in the intermediate layers. Recent works Jacot et al. (2018); Arora et al. (2019) ; Cao and Gu (2019) have provided a kernel learning interpretation for deep learning by showing that in the limit of infinite width deep learning becomes kernel learning. These works are based on neural tangents, wherein, the gradient of the network output with respect to the network parameters known as the neural tangent features (NTFs) are considered as the features. Arora et al. (2019) show that at randomised initialisation of weights, the kernel matrix associated with the NTFs, known as the neural tangent kernel (NTK) converges to a deterministic matrix and that optimisation and generalisation of infinite width deep neural networks is characterised by this deterministic kernel matrix. Cao and Gu (2019) provided generalisation bounds in terms of the NTK matrix. Arora et al. ( 2019) also proposed a pure-kernel method based on CNTK (NTK of convolutional neural networks, i.e., CNNs) which significantly outperformed the previous state-of-the-art kernel methods. The NTK either as an interpretation or as a method in itself has been very successful. Nevertheless it has some open issues namely i) non-interpretable: the kernel is the inner product of gradients and has no physical interpretation, ii) no feature learning: the NTFs are random and fixed during training and iii) performance gap: finite width CNN outperforms the infinite width CNTK, i.e., NTK does not fully explain the success of deep learning. Recently, Lakshminarayanan and Singh (2020) developed a neural path (NP) framework to provide a kernel interpretation for deep learning that addresses the open issues in the current NTK framework. Here, DNNs with ReLU activations are considered, and the gates (on/off state of ReLU) are encoded in the so called neural path feature (NPF) and the weights in the network in the so called neural path value (NPV). The key findings can be broken into the following steps. Step 1: The NPFs and NPV are decoupled. Gates are treated as masks, which are held in a separate feature network and applied to the main network called the value network. This enables one to study the various kinds of gates (i.e., NPFs), such as random gates (of a randomly initialised network), semi-learnt gates (sampled at an intermediate epoch during training), and learnt gates (sampled from a fully trained network). This addresses the feature learning issue. Step 2: When the gates/masks are decoupled and applied externally it follows that NTK = const ⇥ NPK, at random initialisation of weights. For a pair of input examples, NPK is a similarity measure 1 Introduced for the first time in the work of Lakshminarayanan and Singh (2020) . 1

