DEEP LEARNING IS COMPOSITE KERNEL LEARNING

Abstract

Recent works have connected deep learning and kernel methods. In this paper, we show that architectural choices such as convolutional layers with pooling, skip connections, make deep learning a composite kernel learning method, where the kernel is a (architecture dependent) composition of base kernels: even before training, standard deep networks have in-built structural properties that ensure their success. In particular, we build on the recently developed 'neural path' framework 1 that characterises the role of gates/masks in fully connected deep networks with ReLU activations.

1. INTRODUCTION

The success of deep learning is attributed to feature learning. The conventional view is that feature learning happens in the hidden layers of a deep network: in the initial layers simple low level features are learnt, and sophisticated high level features are learnt as one proceeds in depth. In this viewpoint, the penultimate layer output is the final hidden feature and the final layer learns a linear model with these hidden features. While this interpretation of feature learning is intuitive, beyond the first couple of layers it is hard make any meaningful interpretation of what happens in the intermediate layers. Recent works Jacot et al. (2018) ; Arora et al. (2019) ; Cao and Gu (2019) have provided a kernel learning interpretation for deep learning by showing that in the limit of infinite width deep learning becomes kernel learning. These works are based on neural tangents, wherein, the gradient of the network output with respect to the network parameters known as the neural tangent features (NTFs) are considered as the features. Arora et al. (2019) show that at randomised initialisation of weights, the kernel matrix associated with the NTFs, known as the neural tangent kernel (NTK) converges to a deterministic matrix and that optimisation and generalisation of infinite width deep neural networks is characterised by this deterministic kernel matrix. Cao and Gu (2019) provided generalisation bounds in terms of the NTK matrix. Arora et al. (2019) also proposed a pure-kernel method based on CNTK (NTK of convolutional neural networks, i.e., CNNs) which significantly outperformed the previous state-of-the-art kernel methods. The NTK either as an interpretation or as a method in itself has been very successful. Nevertheless it has some open issues namely i) non-interpretable: the kernel is the inner product of gradients and has no physical interpretation, ii) no feature learning: the NTFs are random and fixed during training and iii) performance gap: finite width CNN outperforms the infinite width CNTK, i.e., NTK does not fully explain the success of deep learning. Recently, Lakshminarayanan and Singh (2020) developed a neural path (NP) framework to provide a kernel interpretation for deep learning that addresses the open issues in the current NTK framework. Here, DNNs with ReLU activations are considered, and the gates (on/off state of ReLU) are encoded in the so called neural path feature (NPF) and the weights in the network in the so called neural path value (NPV). The key findings can be broken into the following steps. Step 1: The NPFs and NPV are decoupled. Gates are treated as masks, which are held in a separate feature network and applied to the main network called the value network. This enables one to study the various kinds of gates (i.e., NPFs), such as random gates (of a randomly initialised network), semi-learnt gates (sampled at an intermediate epoch during training), and learnt gates (sampled from a fully trained network). This addresses the feature learning issue. Step 2: When the gates/masks are decoupled and applied externally it follows that NTK = const ⇥ NPK, at random initialisation of weights. For a pair of input examples, NPK is a similarity measure that depends on the size of the sub-network formed by the gates that are active simultaneously for examples. This addresses the interpretability issue. Step 3: CNTK performs better than random gates/masks and gates/masks from fully trained networks perform better than CNTK. This explains the performance gap between CNN and CNTK. It was also observed (on standard datasets) that when learnt gates/masks are used, the weights of the value network can be reset and re-trained from scratch without significant loss of performance.

1.1. CONTRIBUTIONS IN THIS WORK

We attribute the success of deep learning to the following two key ingredients: (i) a composite kernel with gates as fundamental building blocks and (ii) allowing the gates to learn/adapt during training. Formally, we extend the NP framework of Lakshminarayanan and Singh (2020) as explained below. • Composite Kernel: The NPK matrix has a composite structure (architecture dependent). • Gate Learning: We show that learnt gates perform better than random gates. Starting with the setup of Lakshminarayanan and Singh (2020), we build combinatorially many models by, 1. permuting the order of the layers when we apply them as external masks, 2. having two types of modes based on input provided to the value network namely i) 'standard': input is the actual image and ii) 'all-ones': input is a tensor with all entries as '1'. We observe in our experiments that the performance is robust to such combinatorial variations. Message: This work along with that of Lakshminarayanan and Singh (2020) provides a paradigm shift in understanding deep learning. Here, gates play a central role. Each gate is related to a hyperplane, and gates together form layer level binary features whose kernels are the base kernels. Laying out these binary features depth-wise gives rise to a product of the base kernels. The skip connections gives a 'sum of product' structure, and convolution with pooling gives rotation invariance. Organisation: Section 2 describes the network architectures namely fully-connected, convolutional and residual, which we take up for theoretical analysis. Section 3 extends the neural path framework to CNN and ResNet. Section 4 explains the composite kernel. Section 5 connects the NTK and NPK for CNN and ResNet. Section 6 consists of numerical experiments.

2. ARCHITECTURES: FULLY CONNECTED, CONVOLUTIONAL AND RESIDUAL

In this section, we present the three architectures that we take up for theoretical analysis. These are i) fully connected (FC or FC-DNN), ii) convolutional (CNN) and iii) residual (ResNets). In what follows, [n] is the set {1, . . . , n}, and the dataset is given by (x s , y s ) n s=1 2 R din ⇥ R. Fully Connected: We consider fully connected networks with width 'w' and depth 'd'. CNN: We consider a 1-dimensional convolutional neural network with circular convolutions (see Table 2 ), with d cv convolutional layers (l = 1, . . . , d cv ), followed by a global-average/max-pooling layer (l = d cv + 1) and d fc (l = d cv + 2, . . . , d cv + d fc + 1) FC layers. The convolutional window size is w cv < d in , the number of filters per convolutional layer is w, and the width of the FC is also w. (i) i r = i + r, for i + r  d in and i r = i + r d in , for i + r > d in . (ii) rot(x, r )(i) = x(i r), i 2 [d in ]. Input Layer : z x,⇥ (•, 0) = x Pre-activation : q x,⇥ (i out , l) (iii) q x,⇥ (i fout , i out , l) =  = P iin ⇥(i in , i out , l) • z x,⇥ (i in , l 1) Gating Value : G x,⇥ (i out , l) = 1 {qx,⇥(iout,l)>0} Hidden Unit Output : z x,⇥ (i out , l) = q x,⇥ (i out , l) • G x,⇥ (i out , l) Final Output : ŷ⇥ (x) = P iin ⇥(i in , i out , d) • z x,⇥ (i in , d 1) Table 1: P icv,iin ⇥(i cv , i in , i out , l) • z x,⇥ (i fout (i cv 1), i in , l

Definition 2.2 (Pooling). Let G pool

x,⇥ (i fout , i out , d cv + 1) denote the pooling mask, then we have z x,⇥ (i out , d cv + 1) = X ifout z x,⇥ (i fout , i out , d cv ) • G pool x,⇥ (i fout , i out , d cv + 1), where in the case of global-average-pooling G pool x,⇥ (i fout , i out , d cv + 1) = 1 din , 8i out 2 [w], i fout 2 [d in ], and in the case of max-pooling, for a given i out 2 [w], G pool x,⇥ (i max , i out , d cv + 1) = 1 where i max = arg max ifout z x,⇥ (i fout , i out , d cv ), and G pool x,⇥ (i fout , i out , d cv + 1) = 0, 8i fout 6 = i max . ResNet: We consider ResNets with '(b + 2)' blocks and 'b' skip connections between the blocks (Figure 1 ). Each block is a FC-DNN of depth 'd blk ' and width 'w'. Here, pre i , post i , i 2 [b] are normalisation variables. Definition 2.3 (Sub FC-DNNs). Let 2 [b] denote the power set of [b] and let J 2 2 [b] denote any subset of [b] . Define the'J th ' sub-FC-DNN of the ResNet to be the fully connected network obtained by ignoring/removing (see Figure 1 ) the skip connections skip j , 8j 2 J (see Figure 1 ).

3. NEURAL PATH FRAMEWORK

In this section, we extend the neural path framework developed by LS2020, to CNN and ResNet architectures described in the previous section. The neural path framework exploits the gating property of ReLU activation, which can be thought of as gate/mask that blocks/allows its pre-activation input depending on its 0/1 state ( 0 if pre-activation is negative and 1 if pre-activation is positive). The key idea here is to break a DNN (with ReLU) into paths, and express its output as a summation of the contribution of the paths. The contribution of a path is the product of the signal in its input node, the weights in the path and the gates in the path. For a DNN with P paths, for an input x 2 R din , the gating information is encoded in a novel neural path feature (NPF), x,⇥ 2 R P and a novel neural path value (NPV), v ⇥ 2 R P encodes the weights. The output of the DNN is then the inner product of the NPFs and NPVs, i.e., ŷ⇥ (x s ) = h xs,⇥ , v ⇥ i (Proposition 3.4). Definition 3.1. A path starts from an input node, passes through weights, hidden nodes, and normalisation constants and ends at the output node. Proposition 3.1. The total number of paths in FC-DNN, CNN and ResNet are respectively given by P fc = d in w (d 1) , P cnn = d in (w cv w) dcv w (dfc 1) and P res = d in • P b i=0 b i w (i+2)dblk 1 . Notation 3.1 (Index Maps). The ranges of index maps I f l , I cv l , I l are [d in ], [w cv ] and [w] respectively. The index maps are used to identify the nodes through which a path p passes. Further, let I J (p) : [P res ] ! 2 [b] specify the indices of the skip connections ignored in path p. Also, we follow the convention that weights and gating values of layers corresponding to blocks skipped are 1. Definition 3.2 (Path Activity). The product of the gating values in a path p is its 'activity' denoted by A ⇥ (x, p). We define: (a) A ⇥ (x, p) = ⇧ d 1 l=1 G x,⇥ (I l (p), l), for FC-DNN and ResNet. (b) A ⇥ (x, p) = ⇧ dcv+1 l=1 G x,⇥ (I f l (p), I l (p), l) • ⇧ dcv+dfc+1 l=dcv+2 G x,⇥ (I l (p), l), for CNN. In CNN, the pooling layer is accounted by letting G = G pool for l = d cv + 1.  (c) v ⇥ (p) = ⇧ d l=1 ⇥(I l 1 (p), I l (p), l) • (I J (p)). The neural path value is defined as v ⇥ = (v ⇥ (p), p 2 [P fc ]) 2 R P fc , v ⇥ = (v ⇥ (B p), p 2 [ P cnn ]) 2 R P cnn , and v ⇥ = (v ⇥ (p), p 2 [P res ]) 2 R P res for FC-DNN, CNN and ResNet respectively. Proposition 3.3 (Rotational Invariance). Internal variables in the convolutional layers are circularly symmetric, i.e., for r 2 {0, . . . , d in 1} it follows that (i)  z rot(x,r),⇥ (i fout , •, •) = z x,⇥ (i fout r, •, •), (ii) q rot(x,r),⇥ (i fout , •, •) = q x,⇥ (i fout r, •, •) and (iii) G rot(x,r),⇥ (i fout , •, •) = G x,⇥ (i fout r, •, •).

4. NEURAL PATH KERNEL: COMPOSITE KERNEL BASED ON SUB-NETWORKS

In this section, we will discuss the properties of neural path kernel (NPK) associated with the NPFs defined in Section 3. Recall that a co-ordinate of NPF can be non-zero only if the corresponding path is active. Consequently, the NPK for a pair of input examples is a similarity measure that depends on the number of paths that are active for both examples. Such common active paths are captured in a quantity denoted by ⇤ (Definition 4.2). The number of active paths are in turn dependent on the number of active gates in each layer, a fact that endows the NPK with a hierarchical/composite structure. Gates are the basic building blocks, and the gates in a layer for a w-dimensional binary feature whose kernels are the base kernels. When the layers are laid out depth-wise, we obtain a product of the base kernels. When skip connections are added, we obtain a sum of products of base kernels. And presence of convolution with pooling provides rotational invariance. is the NPF matrix. Definition 4.2. Define ⇤ ⇥ (i, x, x s 0 ) = |{p 2 [P ] : I 0 (p) = i, A ⇥ (x s , p) = A ⇥ (x s 0 , p) = 1}| to be total number of 'active' paths for both x s and x s 0 that pass through input node i. Definition 4.3 (Layer-wise Kernel). Let G x,⇥ (•, l) 2 R w be w-dimensional feature of the gating values in layer l for input x 2 R din . Define layer-wise kernels: H lyr l,⇥ (s, s 0 ) = hG xs,⇥ (•, l)G x s 0 ,⇥ (•, l)i Lemma 4.1 (Product Kernel) . Let H fc denote the NPK of a FC-DNN, and for D 2 R din⇥din be a diagonal matrix with strictly positive entries, and u, u 0 2 R din let hu, u 0 i D = P din i=1 D(i)u(i)u 0 (i). H fc ⇥ (s, s 0 ) = hx s , x s 0 i ⇤(•,xs,x s 0 ) = hx s , x 0 s i⇧ d 1 l=1 H lyr l,⇥ (s, s 0 ) Lemma 4.2 (Sum of Product Kernel). Let H res ⇥ be the NPK of the ResNet, and H J ⇥ be the NPK of the sub-FC-DNN within the ResNet obtained by ignoring those skip connections in the set J . Then, H res ⇥ = X J 22 [b] H J ⇥ Lemma 4.3 (Rotational Invariant Kernel). Let H cnv ⇥ denote the NPK of a CNN, then H cnv ⇥ (s, s 0 ) = din 1 X r=0 hx s , rot(x s 0 , r)i ⇤(•,xs,rot(x s 0 ,r)) = din 1 X r=0 hrot(x s , r), x s 0 i ⇤(•,rot(xs,r),x s 0 )

5. MAIN THEORETICAL RESULT

In this section, we proceed with the final step in extending the neural path theory to CNN and ResNet. As with LS2020, we first describe the deep gated network (DGN) setup that decouples the NPFs and NPV, and follow it up with the main result that connects the NPK and the NTK in the DGN setting. DGN set up was introduced by LS2020 to analytically characterise the role played by the gates in a 'standalone' manner. The DGN has two networks namely the feature network parameterised by ⇥ f 2 R d f net which holds the NPFs (i.e., the gating information) and a value network parameterised by ⇥ v 2 R d v net which holds the NPV. The combined parameterisation is denoted by Feature Network Value Network z f x,⇥ f (•, 0) = x f z v x,⇥ DGN (•, 0) = x v q f x,⇥ f (i out , l) = h⇥ f (•, i out , l), z f x,⇥ f (•, l 1)i q v x,⇥ DGN (i out , l) = h⇥ v (•, i out , l), z x,⇥ v (•, l 1)i z f x,⇥ f (i out , l) = q f x,⇥ f (i out , l) • G f x,⇥ f (i out , l) z v x,⇥ DGN (i out , l) = q v x,⇥ DGN (i out , l) • G f x,⇥ f (i out , l) ŷf ⇥ f (x) = h⇥ f (•, i out , d), z f x,⇥ f (•, d 1)i ŷ⇥ DGN (x) = h⇥ v (•, i out , d), z v x,⇥ DGN (•, d 1)i Hard ReLU: G f x,⇥ f (i out , l) = 1 {q f x,⇥ f (iout,l)>0} or Soft-ReLU: G f x,⇥ f (i out , l) = 1 1+exp( .q f x,⇥ f (iout,l)) ⇥ DGN = (⇥ f , ⇥ v ) 2 R d f net +d v net . Thus the learning problem in the DGN is ŷ⇥ DGN (x) = h x,⇥ f , v ⇥ v i. Definition 5.1. The DGN has 4 regimes namely decoupled learning (DL), fixed learnt (FL), fixed random-dependent initialisation (FR-DI) and fixed random-independent initialisation (FR-II). In all the regimes ŷ⇥ DGN is the output, and ⇥ v 0 is always initialised at random and is trainable. However, the regimes differ based on i) trainability of ⇥ f , ii) initialisation ⇥ f 0 as described below. DL : ⇥ f is trainable, and ⇥ f 0 and ⇥ v 0 are random and statistically independent, > 0. FL : ⇥ f is non-trainable, and ⇥ f 0 is pre-trained; ⇥ v 0 is statistically independent of ⇥ f 0 . FR-II : ⇥ f is non-trainable, and ⇥ f 0 and ⇥ v 0 are random and statistically independent. FR-DI : ⇥ f is non-trainable, and ⇥ f 0 = ⇥ v 0 . DGN Regimes: The flexibility in a DGN is that a) ⇥ f can be trainable/non-trainable and b) ⇥ f 0 can be random or pre-trained using ŷ⇥ f as the output (Definition 5.1). By using the DGN setup we can study the role of gates by comparing (a) learnable (DL) vs fixed gates (FL, FR-DI, FR-II), (b) random (FR-DI, FR-II) vs learnt gates (FL) and (c) dependent (FR-DI) vs independent initialisations (FR-II). In the DL regime 'soft-ReLU' is chosen to enable gradient flow through the feature network. Proposition 5.1. Let K ⇥ DGN be the NTK matrix of the DGN, then K ⇥ DGN = K v ⇥ DGN + K f ⇥ DGN , with Overall NTK K ⇥ DGN (s, s 0 ) = h xs,⇥ , x s 0 ,⇥ DGN i, where x,⇥ DGN = r ⇥ DGN ŷ⇥ DGN (x) 2 R dnet Feature NTK K v ⇥ DGN (s, s 0 ) = h v xs,⇥ DGN , v x s 0 ,⇥ DGN i, where v x,⇥ DGN = r ⇥ v ŷ⇥ DGN (x) 2 R d v net Value NTK K f ⇥ DGN (s, s 0 ) = h f xs,⇥ DGN , f x s 0 ,⇥ DGN i, where f x,⇥ DGN = r ⇥ f ŷ⇥ DGN (x) 2 R d f net Remark: There are two separate NTKs, each one corresponding to feature and value networks respectively. In the case of fixed regimes, K f = 0. Theorem 5.1. (i) ⇥ v 0 is statistically independent of ⇥ f 0 (ii) ⇥ v 0 are i.i.d symmetric Bernoulli over { , + }. Let fc = cscale p w and cv = cscale p wwcv for FC and convolutional layers. As w ! 1, we have: (ii) K v ⇥ DGN 0 ! fc H ⇥ f 0 , fc = d 2(d 1) fc for FC-DNN, (ii) K v ⇥ DGN 0 ! cv H ⇥ f 0 , cv = 1 din 2 ⇣ d cv 2(dcv 1) cv 2dfc fc + d fc 2dcv cv 2(dfc 1) fc ⌘ for CNN with GAP, (iii) K v ⇥ DGN 0 ! P J 22 [b] J rs H J ⇥ f 0 , J rs = (|J | + 2)d blk 2 (|J |+2)dblk 1 fc (J ) 2 for ResNet. • fc , cv , rs : The simplest of all is fc = d

2(d 1) fc

, where d is due the fact that there are d weights in a path and in the exponent of fc , factor (d 1) arises because the gradient of a particular weight is product of all the weights in the path excluding the said weight itself, and the factor of 2 is due to the fact that NTK is an inner product of two gradients. cv is similar to fc with separate bookkeeping for the convolutional and FC layers, and 1 w , which goes to 0 as w ! 1. Hence, though assumption in Theorem 5.1 may not hold exactly, it is not a strong assumption to fix the NPFs for the purpose of analysis. Once the NPFs are fixed, it only natural to statistically decouple the NPV from fixed NPFs (Theorem 5.1 hold in FR-II, FL and DL regimes). • Gates are Key: In simple terms, Theorem 5.1 says that if the gates/masks are known, then the weights are expendable, a fact which we also verify in our extensive experiments.

6. NUMERICAL EXPERIMENTS

We now show via experiments that gates indeed play a central role in deep learning. For this we use the DGN setup (Figure 4 ) to create models in the 4 regimes namely DL, FL, FR-II and FR-DI. In each of the 4 regimes, we create combinatorially many models via a) permutation of the layers when the copied from the feature to the value network, and b) setting the input to the value network to 1 (in training and testing), i.e., a tensor with all its entries to be 1. We observe that in all the 4 regimes, the models are robust to the combinatorial variations. Setup: Datasets are MNIST and CIFAR-10. For CIFAR-10, we use Figure 4 with 3 ⇥ 3 windows and 128 filters in each layer. For MNIST, we use FC instead of the convolutional layers. All the FC-DNNs and CNNs are trained with 'Adam' [10] (step-size = 3 • 10 4 , batch size = 32). A ResNet called DavidNet [12] was trained with SGD ( step-size = 0.5, batch size = 256). We use = 10.

Reporting of Statistics:

The results are summarised in Figure 4 . For FC-DNN and CNN, in each of the 4 regimes, we train 48 = 2(x v = x/x v = 1) ⇥ 24(layer permutations) models. Each of these models are trained to almost 100% accuracy and the test performance is taken to be the best obtained in a given run. Each of the 48 models is run only once. For the ResNet, we train only two model for each of the 4 regimes ( without permuting the layers, but with image as well as 'all-ones' input variation) and here each mode is run 5 times. • Result Discussion: Recall that in regimes FR-II and FR-DI the gates are fixed and random, and only ⇥ v are trained. In DL regime, both ⇥ f and ⇥ v are trained, and FL regime ⇥ f is pre-trained and fixed, and only ⇥ v is trained. In the following discussion, we compare the performance of the models in various regimes, along with the performance of CNTK of Arora et al. (2019) (77.43% in CIFAR-10) and the performance of standard DNN with ReLU. The main observations are listed below (those by Lakshminarayanan and Singh (2020) are also revisited for the sake of completeness). 1. Decoupling: There is no performance difference between FR-II and FR-DI.Further, decoupled learning of gates (DL) performs significantly better than fixed random gates (FR), and the gap between standard DNN with ReLU and DL is less than 3%. This marginal performance loss seems to be worthy trade off for fundamental insights of Theorem 5.1 under the decoupling assumption.

2.. Recovery:

The fixed learnt regime (FL) shows that using the gates of a pre-trained ReLU network, performance can be recovered by training the NPV. Also, by interpreting the input dependent component of a model to be the features and the input independent component to be the weights, it makes sense to look at the gates/NPFs as the hidden features and NPV as the weights. 3. Random Gates: FR-II does perform well in all the experiments (note that for a 10-class problem, a random classifier would achieve only 10% test accuracy). Given the observation that the gates are the true features, and the fact that is no learning in the gates in the fixed regime, and the performance of fixed random gates can be purely attributed to the in-built structure.

4.. Gate Learning:

We group the models into three sets where S 1 = { ReLU, FL , DL}, S 2 = { FR} and S 3 = { CNTK }, and explain the difference in performance due to gate learning. S 2 and S 3 have no gate learning. However, S 3 due to its infinite width has better averaging resulting in a well formed kernel and hence performs better than S 2 which is a finite width. Thus, the difference between S 2 and S 3 can be attributed to finite versus infinite width. Both S 1 and S 2 are finite width, and hence, conventional feature learning happens in both S 1 and S 2 , but, S 1 with gate learning is better (77.5% or above in CIFAR-10) than S 2 (67% in CIFAR-10) with no gate learning. Thus neither finite width, nor the conventional feature learning explain the difference between S 1 and S 2 . Thus, 'gate learning' discriminates the regimes S 1 , S 2 and S 3 better than the conventional feature learning view. 

5.. Permutation and Input Invariance:

The performance (in all the 4 regimes) is robust to 'all-ones' inputs. Note that in the 'all-ones' case, the input information affects the models only via the gates. Here, all the entries of the input Gram matrix are identical, and the NPK depends only on ⇤, which is the measure of sub-network active simultaneously for the various input pairs. The performance (in all the 4 regimes) is also robust to permutation of the layers. This can be attributed to the product ⇧ (d 1) l=1 H lyr l,⇥ of the layer level base kernels being order invariant. 6. Visualisation: Figure 5 compares the hidden layer outputs of a standard DNN with ReLU with 4 layers, and that of a DGN which copies the gates from the standard DNN, but, reverses the gating masks when applying to the value network. Also, the value network of the DGN was provided with a fixed random input (as shown in Figure 5 ). Both the models achieved about 80% test accuracy, an otherwise surprising outcome, yet, as per the theory developed in this paper, a random input to the value network should not have much effect on performance, and this experiment confirms the same.

7. RELATED AND FUTURE WORK

Our paper extended the work of Lakshminarayanan and Singh (2020) to CNN and ResNet. Further, we pointed out the composite nature of the underlying kernel. Experiments with permuted masks and constant inputs are also significant and novel evidences, which to our knowledge are first of their kind in literature. Gated linearity was studied recently by We reserve a formal statement on the behaviour of H lyr l,⇥0 for the future. 2. Multiple Kernel Learning (Gönen and Alpaydın, 2011; Bach et al., 2004; Sonnenburg et al., 2006; Cortes et al., 2009) is the name given to a class of methods that learn a linear or non-linear combination of one or many base kernels. For instance, Cortes et al. (2009) consider polynomial combinations of base kernels, which also has a 'sum of products' form. Our experiments do indicate that the learning in the gates (and hence the underlying base kernels) has a significant impact. Understanding K f (Proposition 5.1) might be a way to establish the extent and nature of kernel learning in deep learning. It is also interesting to check if in ResNet the kernels of its sub-FC-DNNs are combined optimally.

8. CONCLUSION

We attributed the success of deep learning to the following two key ingredients: (i) a composite kernel with gates as fundamental building blocks and (ii) allowing the gates to learn/adapt during training. We justified our claims theoretically as well as experimentally. This work along with that of Lakshminarayanan and Singh (2020) provides a paradigm shift in understanding deep learning. Here, gates play a central role. Each gate is related to a hyper-plane, and gates together form layer level binary features whose kernels are the base kernels. Laying out these binary features depth-wise gives rise to a product of the base kernels. The skip connections gives a 'sum of product' structure, and convolution with pooling gives rotation invariance. The learning in the gates further enhances the generalisation capabilities of the models.



Introduced for the first time in the work ofLakshminarayanan and Singh (2020).



Circular Convolution). For x 2 R din , i 2 [d in ] and r 2 {0, . . . , d in 1}, define :

Information flow in a FC-DNN with ReLU. Here, 'q's are pre-activation inputs, 'z's are output of the hidden layers, 'G's are the gating values. l 2 [d 1] is the index of the layer, iout and iin are indices of nodes in the current and previous layer respectively.

Figure 1: ResNet Architecture is shown in the top. Process of obtaining a sub-FC-DNN by ignoring skip (retaining block) or retaining skip (ignoring block) is shown in the bottom.

), where i in /i out are the indices (taking values in [w]) of the input/output filters. i cv denotes the indices of the convolutional window (taking values in [w cv ]) between input and output filters i in and i out . i fout denotes the indices (taking values in [d in ], the dimension of input features) of individual nodes in a given output filter.

Bundle Paths of Sharing Weights). Let P cnn = P cnn din , and {B 1 , . . . , B P cnn } be a collection of sets such that 8i, j 2 [ P cnn ], i 6 = j we have B i \ B j = ; and [ P cnn i=1 B i = [P cnn ]. Further, if paths p, p 0 2 B i , then I cv l (p) = I cv l (p 0 ), 8l = 1, . . . , d cv and I l (p) = I l (p 0 ), 8l = 0, . . . , d cv . Proposition 3.2. There are exactly d in paths in a bundle. Definition 3.4 (Normalisation Factor). Define (J ) = ⇧ j2J pre j • ⇧ j 0 2[b] post j 0Weight sharing is shown in the the cartoon in Figure2, which shows a CNN with d in = 3, w = 1, w cv = 2, d cv = 3, d fc = 0. Here, the red coloured paths all share the same weights ⇥(1, 1, 1, l), l = 1, 2, 3 and the blue coloured paths all share the same weights given by ⇥(2, 1, 1, l), l = 1, 2, 3.

Figure 2: Shows weight sharing and rotational symmetry of internal variables and the output after pooling in a CNN. Left most cartoon uses a GAP layer, and the other two cartoons use max-pooling. Circles are nodes and the 1/0 in the nodes indicate the gating. Pre-activations/node output are shown in brown/purple. Definition 3.5 (Neural Path Value). The product of the weights and normalisation factors in a path p is its 'value'. The value of a path bundle is the value of any path in that bundle. The path/bundle values are denoted by v ⇥ (p)/v ⇥ (B p) and are defined as follows: (a) v ⇥ (p) = ⇧ d l=1 ⇥(I l 1 (p), I l (p), l). (b) v ⇥ (B p) = ⇧ dcv l=1 ⇥(I cv l (p), I l 1 (p), I l (p), l) • ⇧ dcv+dfc+1 l=dcv+2 ⇥(I l 1 (p), I l (p), l), for any p 2 B p.

The neural path feature (NPF) corresponding to a path p is given by (a) x,⇥ (p) = x(I f 0 (p))A ⇥ (x s , p) for FC-DNN and ResNet.(b) x,⇥ (p) = P p2B p x(I f 0 (p))A ⇥ (x, p) for CNN.The NPF is defined asx,⇥ = ( x,⇥ (p), p 2 [P fc ]) 2 R P fc , x,⇥ = ( x,⇥ (B p), p 2 [ P cnn ]) 2 R P cnn, and x,⇥ = ( x,⇥ (p), p 2 [P res ]) 2 R P res for FC-DNN, CNN and ResNet respectively. Proposition 3.4 (Output=hNPF,NPVi). The output of the network can be written as an inner product of the NPF and NPV, i.e., ŷ⇥ (x) = h x,⇥ , v ⇥ i.

Define the NPK matrix to be H ⇥ = > ⇥ ⇥ , where ⇥ = ( x1,⇥ , . . . , xn,⇥ ) 2 R P ⇥n

Figure 3: Shows a deep gated network (DGN). The soft-ReLU enables gradient flow into the feature network.

is due to the GAP layer. In rs , the fc for all the sub-FC-DNNs within the ResNet are scaled by the corresponding normalisation factors and summed.• Decoupling In a DNN with ReLU (and FR-DI regime of DGN), NPV and NPF are not statistically independent at initialisation, i.e., Theorem 5.1 does not hold. However, the current state-of-the-art analysis Jacot et al. (2018); Arora et al. (2019); Cao and Gu (2019) is in the infinite width (w ! 1) regime, wherein, the change in activations during training is only of the order q 1

Figure 4: C f i , C v i , i 2 [4] are the convolutional layers, which are followed by global-average-pooling (GAP) layer then by a dense layer (D f /D v ), and a softmax layer to produce the final logits.

Left) Standard ReLU DNN and (Right) DGN with Fixed Learnt Gates from model on the left with j 1 , j 2 , j 3 , j 4 = 4, 3, 2, 1 and x v = rand tensor For each model, input is shown first and then starting from the first layer, the first 2 filters of each of the 4 layers are shown.

Figure 5: Hidden layer outputs for a fixed random input to the value network of DGN with permuted gating.

Fiat et al. (2019), however, they considered only single layered gated networks.Jacot et al. (2018); Arora et al. (2019); Cao and Gu (2019); Jacot et al. (2019); Du et al. (2018) have all used the NTK framework to understand questions related to optimisation and/or generalisation in DNNs. We now discuss the future work below. Base Kernel: At randomised initialisation, for each l, H lyr l,⇥ 0 (s,s 0 ) w is the fraction of gates that are simultaneously active for input examples s, s 0 , which in the limit of infinite width is equal to 1 2 angle(z xs (•, l), z x s 0 (•, l)) (Xie et al., 2017). Further, due to the property of ReLU to pass only positive components, we conjecture that the pairwise angle between input examples measured at the hidden layer outputs is a decreasing function of depth and as l ! 1, 8s, s 0 2 [n].

Fully-Connected networks: H fc is the Hadamard product of the input data Gram matrix, and the kernel matrices corresponding to the binary gating features of the individual layers. 2. Residual networks (ResNets) with skip connections: H res assumes a sum of products form.

