DEEP LEARNING IS COMPOSITE KERNEL LEARNING

Abstract

Recent works have connected deep learning and kernel methods. In this paper, we show that architectural choices such as convolutional layers with pooling, skip connections, make deep learning a composite kernel learning method, where the kernel is a (architecture dependent) composition of base kernels: even before training, standard deep networks have in-built structural properties that ensure their success. In particular, we build on the recently developed 'neural path' framework 1 that characterises the role of gates/masks in fully connected deep networks with ReLU activations.

1. INTRODUCTION

The success of deep learning is attributed to feature learning. The conventional view is that feature learning happens in the hidden layers of a deep network: in the initial layers simple low level features are learnt, and sophisticated high level features are learnt as one proceeds in depth. In this viewpoint, the penultimate layer output is the final hidden feature and the final layer learns a linear model with these hidden features. While this interpretation of feature learning is intuitive, beyond the first couple of layers it is hard make any meaningful interpretation of what happens in the intermediate layers. Jacot et al. (2018); Arora et al. (2019) ; Cao and Gu (2019) have provided a kernel learning interpretation for deep learning by showing that in the limit of infinite width deep learning becomes kernel learning. These works are based on neural tangents, wherein, the gradient of the network output with respect to the network parameters known as the neural tangent features (NTFs) are considered as the features. Arora et al. (2019) show that at randomised initialisation of weights, the kernel matrix associated with the NTFs, known as the neural tangent kernel (NTK) converges to a deterministic matrix and that optimisation and generalisation of infinite width deep neural networks is characterised by this deterministic kernel matrix. Cao and Gu (2019) provided generalisation bounds in terms of the NTK matrix. Arora et al. ( 2019) also proposed a pure-kernel method based on CNTK (NTK of convolutional neural networks, i.e., CNNs) which significantly outperformed the previous state-of-the-art kernel methods. The NTK either as an interpretation or as a method in itself has been very successful. Nevertheless it has some open issues namely i) non-interpretable: the kernel is the inner product of gradients and has no physical interpretation, ii) no feature learning: the NTFs are random and fixed during training and iii) performance gap: finite width CNN outperforms the infinite width CNTK, i.e., NTK does not fully explain the success of deep learning.

Recent works

Recently, Lakshminarayanan and Singh (2020) developed a neural path (NP) framework to provide a kernel interpretation for deep learning that addresses the open issues in the current NTK framework. Here, DNNs with ReLU activations are considered, and the gates (on/off state of ReLU) are encoded in the so called neural path feature (NPF) and the weights in the network in the so called neural path value (NPV). The key findings can be broken into the following steps. Step 1: The NPFs and NPV are decoupled. Gates are treated as masks, which are held in a separate feature network and applied to the main network called the value network. This enables one to study the various kinds of gates (i.e., NPFs), such as random gates (of a randomly initialised network), semi-learnt gates (sampled at an intermediate epoch during training), and learnt gates (sampled from a fully trained network). This addresses the feature learning issue. Step 2: When the gates/masks are decoupled and applied externally it follows that NTK = const ⇥ NPK, at random initialisation of weights. For a pair of input examples, NPK is a similarity measure that depends on the size of the sub-network formed by the gates that are active simultaneously for examples. This addresses the interpretability issue. Step 3: CNTK performs better than random gates/masks and gates/masks from fully trained networks perform better than CNTK. This explains the performance gap between CNN and CNTK. It was also observed (on standard datasets) that when learnt gates/masks are used, the weights of the value network can be reset and re-trained from scratch without significant loss of performance.

1.1. CONTRIBUTIONS IN THIS WORK

We attribute the success of deep learning to the following two key ingredients: (i) a composite kernel with gates as fundamental building blocks and (ii) allowing the gates to learn/adapt during training. Formally, we extend the NP framework of Lakshminarayanan and Singh (2020) as explained below. • Composite Kernel: The NPK matrix has a composite structure (architecture dependent). • Gate Learning: We show that learnt gates perform better than random gates. Starting with the setup of Lakshminarayanan and Singh (2020), we build combinatorially many models by, 1. permuting the order of the layers when we apply them as external masks, 2. having two types of modes based on input provided to the value network namely i) 'standard': input is the actual image and ii) 'all-ones': input is a tensor with all entries as '1'. We observe in our experiments that the performance is robust to such combinatorial variations. Message: This work along with that of Lakshminarayanan and Singh (2020) provides a paradigm shift in understanding deep learning. Here, gates play a central role. Each gate is related to a hyperplane, and gates together form layer level binary features whose kernels are the base kernels. Laying out these binary features depth-wise gives rise to a product of the base kernels. The skip connections gives a 'sum of product' structure, and convolution with pooling gives rotation invariance. Organisation: Section 2 describes the network architectures namely fully-connected, convolutional and residual, which we take up for theoretical analysis. Section 3 extends the neural path framework to CNN and ResNet. Section 4 explains the composite kernel. Section 5 connects the NTK and NPK for CNN and ResNet. Section 6 consists of numerical experiments.

2. ARCHITECTURES: FULLY CONNECTED, CONVOLUTIONAL AND RESIDUAL

In this section, we present the three architectures that we take up for theoretical analysis. These are i) fully connected (FC or FC-DNN), ii) convolutional (CNN) and iii) residual (ResNets) . In what follows, [n] is the set {1, . . . , n}, and the dataset is given by (x s , y s ) n s=1 2 R din ⇥ R. Fully Connected: We consider fully connected networks with width 'w' and depth 'd'. CNN: We consider a 1-dimensional convolutional neural network with circular convolutions (see Table 2 ), with d cv convolutional layers (l = 1, . . . , d cv ), followed by a global-average/max-pooling layer (l = d cv + 1) and d fc (l = d cv + 2, . . . , d cv + d fc + 1) FC layers. The convolutional window size is w cv < d in , the number of filters per convolutional layer is w, and the width of the FC is also w. (ii) rot(x, r)(i) = x(i r), i 2 [d in ].



Introduced for the first time in the work of Lakshminarayanan and Singh (2020).



Definition 2.1 (Circular Convolution). For x 2 R din , i 2 [d in ] and r 2 {0, . . . , d in 1}, define :(i) i r = i + r, for i + r  d in and i r = i + r d in , for i + r > d in .

1. Fully-Connected networks: H fc is the Hadamard product of the input data Gram matrix, and the kernel matrices corresponding to the binary gating features of the individual layers. 2. Residual networks (ResNets) with skip connections: H res assumes a sum of products form. In particular, consider a ResNet with (b + 2) blocks and b skip connections. Within this ResNet there are i = 1, . . . , 2 b possible dense networks, and then H res = P 2 b i=1 C i H fc i , where C i > 0 are positive constants based on normalisation layers. 3. Convolutional neural networks (CNN) with pooling: H cnn is rotation invariant.

