A KERNEL PERSPECTIVE OF SKIP CONNECTIONS IN CONVOLUTIONAL NETWORKS

Abstract

Over-parameterized residual networks are amongst the most successful convolutional neural architectures for image processing. Here we study their properties through their Gaussian Process and Neural Tangent kernels. We derive explicit formulas for these kernels, analyze their spectra and provide bounds on their implied condition numbers. Our results indicate that (1) with ReLU activation, the eigenvalues of these residual kernels decay polynomially at a similar rate as the same kernels when skip connections are not used, thus maintaining a similar frequency bias; (2) however, residual kernels are more locally biased. Our analysis further shows that the matrices obtained by these residual kernels yield favorable condition numbers at finite depths than those obtained without the skip connections, enabling therefore faster convergence of training with gradient descent.

1. INTRODUCTION

In the past decade, deep convolutional neural network (CNN) architectures with hundreds and even thousands of layers have been utilized for various image processing tasks. Theoretical work has indicated that shallow networks may need exponentially more nodes than deep networks to achieve the same expressive power (Telgarsky, 2016; Poggio et al., 2017) . A critical contribution to the utilization of deeper networks has been the introduction of Residual Networks (He et al., 2016) . To gain an understanding of these networks, we turn to a recent line of work that has made precise the connection between neural networks and kernel ridge regression (KRR) when the width of a network (the number of channels in a CNN) tends to infinity. In particular, for such a network f (x; θ), KRR with respect to the corresponding Gaussian Process Kernel (GPK) K(x, z) = E θ [f (x; θ) • f (z; θ)] (also called Conjugate Kernel or NNGP Kernel) is equivalent to training the final layer while keeping the weights of the other layers at their initial values (Lee et al., 2017) . Furthermore, KRR with respect to the neural tangent kernel Θ (x, z) = E θ ∂f (x;θ) ∂θ , ∂f (z;θ) ∂θ is equivalent to training the entire network (Jacot et al., 2018) . Here x and z represent input data items, θ are the network parameters, and expectation is computed with respect to the distribution of the initialization of the network parameters. We distinguish between four different models; Convolutional Gaussian Process Kernel (CGPK), Convolutional Neural Tangent Kernel (CNTK), and ResCGPK, ResCNTK for the same kernels with additional skip connections. Yang (2020); Yang & Littwin (2021) showed that for any architecture made up of convolutions, skip-connections, and ReLUs, in the infinite width limit the network converges almost surely to its NTK. This guarantees that sufficiently over-parameterized ResNets converge to their ResCNTK. Lee et al. (2019; 2020) showed that these kernels are highly predictive of finite width networks as well. Therefore, by analyzing the spectrum and behavior of these kernels at various depths, we can better understand the role of skip connections. Thus the question of what we can learn about skip connections through the use of these kernels begs to be asked. In this work, we aim to do precisely that. By analyzing the relevant kernels, we expect to gain information that is applicable to finite width networks. Our contributions include: 1. A precise closed form recursive formula for the Gaussian Process and Neural Tangent Kernels of both equivariant and invariant convolutional ResNet architectures. 2. A spectral decomposition of these kernels with normalized input and ReLU activation, showing that the eigenvalues decay polynomially with the frequency of the eigenfunctions. 3. A comparison of eigenvalues with non-residual CNNs, showing that ResNets resemble a weighted ensemble of CNNs of different depths, and thus place a larger emphasis on nearby pixels than CNNs. 4. An analysis of the condition number associated with the kernels by relating them to the so called double-constant kernels. We use these tools to show that skip connections speed up the training of the GPK. Derivations and proofs are given in the Appendix.

2. RELATED WORK

The equivalence between over-parameterized neural networks and positive definite kernels was made precise in (Lee et al., 2017; Jacot et al., 2018; Allen-Zhu et al., 2019; Lee et al., 2019; Chizat et al., 2019; Yang, 2020) (2022) . They further showed that, in contrast with FC-NTK and with a particular choice of balancing parameter relating the skip and the residual connections, ResNTK does not become degenerate as the depth tends to infinity. As we mention later in this manuscript, this result critically depends on the assumption that the last layer is not trained. Belfer et al. (2021) showed that the eigenvalues of ResNTK for fully connected architectures decay polynomially at the same rate as NTK for networks without skip connections, indicating that residual and conventional FC architectures are subject to the same frequency bias. In related works, (Du et al., 2019) proved that training over-parametrized convolutional ResNets converges to a global minimum. (Balduzzi et al., 2017; Philipp et al., 2018; Orhan & Pitkow, 2017) showed that deep residual networks better address the problems of vanishing and exploding gradients compared to standard networks, as well as singularities that are present in these models. Veit et al. (2016) made the empirical observation that ResNets behave like an ensemble of networks. This result is echoed in our proofs, which indicate that the eigenvalues of ResCNTK are made of weighted sums of eigenvalues of CNTK for an ensemble of networks of different depths. Below we derive explicit formulas and analyze kernels corresponding to residual, convolutional network architectures. We provide lower and upper bounds on the eigenvalues of ResCNTK and ResCGPK. Our results indicate that these residual kernels are subject to the same frequency bias as their standard convolutional counterparts. However, they further indicate that residual kernels are significantly more locally biased than non-residual kernels. Indeed, locality has recently been attributed as a main reason for the success of convolutional networks (Shalev-Shwartz et al., 2020; Favero et al., 2021) . Moreover, we show that with the standard choice of constant balancing parameter used in practical residual networks, ResCGPK attains a better condition number than the standard CGPK, allowing it to train significantly more efficiently. This result is motivated by the work of Lee



amongst others.Arora et al. (2019a)  derived NTK and GPK formulas for convolutional architectures and trained these kernels on CIFAR-10.Arora et al. (2019b)  showed subsequently that CNTKs can outperform standard CNNs on small data tasks.A number of studies analyzed NTK for fully connected (FC) architectures and their associated Reproducing Kernel Hilbert Spaces (RKHS). These works showed for training data drawn from a uniform distribution over the hypersphere that the eigenvalues of NTK and GPK are the spherical harmonics and with ReLU activation the eigenvalues decay polynomially with frequency (Bietti & Bach, 2020). Bietti & Mairal (2019) further derived explicit feature maps for these kernels. Geifman et al. (2020) and Chen & Xu (2020) showed that these kernels share the same functions in their RKHS with the Laplace Kernel, restricted to the hypersphere.Recent works applied spectral analysis to kernels associated with standard convolutional architectures that include no skip connections.(Geifman et al., 2022)  characterized the eigenfunctions and eigenvalues of CGPK and CNTK. Xiao (2022); Cagnetta et al. (2022) studied CNTK with nonoverlapped filters, while Xiao (2022) focused on high dimensional inputs. Formulas for NTK for residual, fully connected networks were derived and analyzed in Huang et al. (2020); Tirer et al.

