A KERNEL PERSPECTIVE OF SKIP CONNECTIONS IN CONVOLUTIONAL NETWORKS

Abstract

Over-parameterized residual networks are amongst the most successful convolutional neural architectures for image processing. Here we study their properties through their Gaussian Process and Neural Tangent kernels. We derive explicit formulas for these kernels, analyze their spectra and provide bounds on their implied condition numbers. Our results indicate that (1) with ReLU activation, the eigenvalues of these residual kernels decay polynomially at a similar rate as the same kernels when skip connections are not used, thus maintaining a similar frequency bias; (2) however, residual kernels are more locally biased. Our analysis further shows that the matrices obtained by these residual kernels yield favorable condition numbers at finite depths than those obtained without the skip connections, enabling therefore faster convergence of training with gradient descent.

1. INTRODUCTION

In the past decade, deep convolutional neural network (CNN) architectures with hundreds and even thousands of layers have been utilized for various image processing tasks. Theoretical work has indicated that shallow networks may need exponentially more nodes than deep networks to achieve the same expressive power (Telgarsky, 2016; Poggio et al., 2017) . A critical contribution to the utilization of deeper networks has been the introduction of Residual Networks (He et al., 2016) . To gain an understanding of these networks, we turn to a recent line of work that has made precise the connection between neural networks and kernel ridge regression (KRR) when the width of a network (the number of channels in a CNN) tends to infinity. In particular, for such a network f (x; θ), KRR with respect to the corresponding Gaussian Process Kernel (GPK) K(x, z) = E θ [f (x; θ) • f (z; θ)] (also called Conjugate Kernel or NNGP Kernel) is equivalent to training the final layer while keeping the weights of the other layers at their initial values (Lee et al., 2017) . Furthermore, KRR with respect to the neural tangent kernel Θ (x, z) = E θ ∂f (x;θ) ∂θ , ∂f (z;θ) ∂θ is equivalent to training the entire network (Jacot et al., 2018) . Here x and z represent input data items, θ are the network parameters, and expectation is computed with respect to the distribution of the initialization of the network parameters. We distinguish between four different models; Convolutional Gaussian Process Kernel (CGPK), Convolutional Neural Tangent Kernel (CNTK), and ResCGPK, ResCNTK for the same kernels with additional skip connections. Yang (2020); Yang & Littwin (2021) showed that for any architecture made up of convolutions, skip-connections, and ReLUs, in the infinite width limit the network converges almost surely to its NTK. This guarantees that sufficiently over-parameterized ResNets converge to their ResCNTK. Lee et al. (2019; 2020) showed that these kernels are highly predictive of finite width networks as well. Therefore, by analyzing the spectrum and behavior of these kernels at various depths, we can better understand the role of skip connections. Thus the question of what we can learn about skip connections through the use of these kernels begs to be asked. In this work, we aim to do precisely that. By analyzing the relevant kernels, we expect to gain information that is applicable to finite width networks. Our contributions include:

