A KERNEL PERSPECTIVE OF SKIP CONNECTIONS IN CONVOLUTIONAL NETWORKS

Abstract

Over-parameterized residual networks are amongst the most successful convolutional neural architectures for image processing. Here we study their properties through their Gaussian Process and Neural Tangent kernels. We derive explicit formulas for these kernels, analyze their spectra and provide bounds on their implied condition numbers. Our results indicate that (1) with ReLU activation, the eigenvalues of these residual kernels decay polynomially at a similar rate as the same kernels when skip connections are not used, thus maintaining a similar frequency bias; (2) however, residual kernels are more locally biased. Our analysis further shows that the matrices obtained by these residual kernels yield favorable condition numbers at finite depths than those obtained without the skip connections, enabling therefore faster convergence of training with gradient descent.

1. INTRODUCTION

In the past decade, deep convolutional neural network (CNN) architectures with hundreds and even thousands of layers have been utilized for various image processing tasks. Theoretical work has indicated that shallow networks may need exponentially more nodes than deep networks to achieve the same expressive power (Telgarsky, 2016; Poggio et al., 2017) . A critical contribution to the utilization of deeper networks has been the introduction of Residual Networks (He et al., 2016) . To gain an understanding of these networks, we turn to a recent line of work that has made precise the connection between neural networks and kernel ridge regression (KRR) when the width of a network (the number of channels in a CNN) tends to infinity. In particular, for such a network f (x; θ), KRR with respect to the corresponding Gaussian Process Kernel (GPK) K(x, z) = E θ [f (x; θ) • f (z; θ)] (also called Conjugate Kernel or NNGP Kernel) is equivalent to training the final layer while keeping the weights of the other layers at their initial values (Lee et al., 2017) . Furthermore, KRR with respect to the neural tangent kernel Θ (x, z) = E θ ∂f (x;θ) ∂θ , ∂f (z;θ) ∂θ is equivalent to training the entire network (Jacot et al., 2018) . Here x and z represent input data items, θ are the network parameters, and expectation is computed with respect to the distribution of the initialization of the network parameters. We distinguish between four different models; Convolutional Gaussian Process Kernel (CGPK), Convolutional Neural Tangent Kernel (CNTK), and ResCGPK, ResCNTK for the same kernels with additional skip connections. Yang (2020) ; Yang & Littwin (2021) showed that for any architecture made up of convolutions, skip-connections, and ReLUs, in the infinite width limit the network converges almost surely to its NTK. This guarantees that sufficiently over-parameterized ResNets converge to their ResCNTK. Lee et al. (2019; 2020) showed that these kernels are highly predictive of finite width networks as well. Therefore, by analyzing the spectrum and behavior of these kernels at various depths, we can better understand the role of skip connections. Thus the question of what we can learn about skip connections through the use of these kernels begs to be asked. In this work, we aim to do precisely that. By analyzing the relevant kernels, we expect to gain information that is applicable to finite width networks. Our contributions include: 1. A precise closed form recursive formula for the Gaussian Process and Neural Tangent Kernels of both equivariant and invariant convolutional ResNet architectures. 2. A spectral decomposition of these kernels with normalized input and ReLU activation, showing that the eigenvalues decay polynomially with the frequency of the eigenfunctions. 3. A comparison of eigenvalues with non-residual CNNs, showing that ResNets resemble a weighted ensemble of CNNs of different depths, and thus place a larger emphasis on nearby pixels than CNNs. 4. An analysis of the condition number associated with the kernels by relating them to the so called double-constant kernels. We use these tools to show that skip connections speed up the training of the GPK. Derivations and proofs are given in the Appendix.

2. RELATED WORK

The equivalence between over-parameterized neural networks and positive definite kernels was made precise in (Lee et al., 2017; Jacot et al., 2018; Allen-Zhu et al., 2019; Lee et al., 2019; Chizat et al., 2019; Yang, 2020) amongst others. Arora et al. (2019a) derived NTK and GPK formulas for convolutional architectures and trained these kernels on CIFAR-10. Arora et al. (2019b) showed subsequently that CNTKs can outperform standard CNNs on small data tasks. A number of studies analyzed NTK for fully connected (FC) architectures and their associated Reproducing Kernel Hilbert Spaces (RKHS). These works showed for training data drawn from a uniform distribution over the hypersphere that the eigenvalues of NTK and GPK are the spherical harmonics and with ReLU activation the eigenvalues decay polynomially with frequency (Bietti & Bach, 2020) . Bietti & Mairal (2019) further derived explicit feature maps for these kernels. Geifman et al. ( 2020) and Chen & Xu (2020) showed that these kernels share the same functions in their RKHS with the Laplace Kernel, restricted to the hypersphere. Recent works applied spectral analysis to kernels associated with standard convolutional architectures that include no skip connections. (Geifman et al., 2022) characterized the eigenfunctions and eigenvalues of CGPK and CNTK. Xiao (2022) ; Cagnetta et al. (2022) studied CNTK with nonoverlapped filters, while Xiao (2022) focused on high dimensional inputs. Formulas for NTK for residual, fully connected networks were derived and analyzed in Huang et al. (2020) ; Tirer et al. (2022) . They further showed that, in contrast with FC-NTK and with a particular choice of balancing parameter relating the skip and the residual connections, ResNTK does not become degenerate as the depth tends to infinity. As we mention later in this manuscript, this result critically depends on the assumption that the last layer is not trained. Belfer et al. (2021) showed that the eigenvalues of ResNTK for fully connected architectures decay polynomially at the same rate as NTK for networks without skip connections, indicating that residual and conventional FC architectures are subject to the same frequency bias. In related works, (Du et al., 2019) proved that training over-parametrized convolutional ResNets converges to a global minimum. (Balduzzi et al., 2017; Philipp et al., 2018; Orhan & Pitkow, 2017) showed that deep residual networks better address the problems of vanishing and exploding gradients compared to standard networks, as well as singularities that are present in these models. Veit et al. (2016) made the empirical observation that ResNets behave like an ensemble of networks. This result is echoed in our proofs, which indicate that the eigenvalues of ResCNTK are made of weighted sums of eigenvalues of CNTK for an ensemble of networks of different depths. Below we derive explicit formulas and analyze kernels corresponding to residual, convolutional network architectures. We provide lower and upper bounds on the eigenvalues of ResCNTK and ResCGPK. Our results indicate that these residual kernels are subject to the same frequency bias as their standard convolutional counterparts. However, they further indicate that residual kernels are significantly more locally biased than non-residual kernels. Indeed, locality has recently been attributed as a main reason for the success of convolutional networks (Shalev-Shwartz et al., 2020; Favero et al., 2021) . Moreover, we show that with the standard choice of constant balancing parameter used in practical residual networks, ResCGPK attains a better condition number than the standard CGPK, allowing it to train significantly more efficiently. This result is motivated by the work of Lee . Note that K (x, x) = 1, ∀x ∈ R d , and K ∈ [-1, 1].

3.1. CONVOLUTIONAL RESNET

We consider a residual, convolutional neural network with L hidden layer (often just called ResNet). Let x ∈ R C0×d and q be the filter size. We define the hidden layers of the Network as: f (0) i (x) = 1 √ C 0 C0 j=1 V (0) 1,j,i * x j , g i (x) = 1 √ C 0 C0 j=1 W (1) 1,j,i * x j (1) f (l) i (x) = f (l-1) i (x) + α c v qC l C l j=1 V (l) :,j,i * σ g (l) j (x) , l = 1, ..., L, i = 1, . . . , C l (2) g (l) i (x) = c w qC l-1 C l-1 j=1 W (l) :,j,i * f (l-1) j (x) , l = 2, ..., L, i = 1, . . . , C l , where C l ∈ N is the number of channels in the l'th layer; σ is a nonlinear activation function, which in our analysis below is the ReLU function; W (l) ∈ R q×C l-1 ×C l , V (l) ∈ R q×C l ×C l , W (1) , V (0) ∈ R 1×C0×C1 are the network parameters, where W (1) , V (0) are convolution filters of size 1, and V (0) is fixed throughout training; c v , c w ∈ R are normalizing factors set commonly as c v = 1/E u∼N (0,1) [σ(u)] (for ReLU c v = 2 ) and c w = 1; α is a balancing factor typically set in applications to α = 1, however previous analyses of non-covolutional kernels also considered α = L -γ , with 0 ≤ γ ≤ 1. We will occasionally omit explicit reference to c v and c w and assume in such cases that c v = 2 and c w = 1. As in Geifman et al. (2022) , we consider three options for the final layer of the network: f Eq (x; θ) := 1 √ C L W Eq f (L) :,1 (x) f Tr (x; θ) := 1 √ d √ C L W Tr , f (L) (x) f GAP (x; θ) := 1 d √ C L W GAP f (L) (x) ⃗ 1, where W Eq , W GAP ∈ R 1×C L , W Tr ∈ R C L ×d and ⃗ 1 = (1, . . . , 1) T ∈ R d . f Eq is fully convolutional. Therefore, applying it to all shifted versions of the input results in a network that is shift-equivariant. f Tr implements a linear layer in the last layer and f GAP implements a global average pooling (GAP) layer, resulting in a shift invariant network. The three heads allow us to analyze kernels corresponding to (1) shift equivariant networks (e.g., image segmentation networks), (2) a convolutional network followed by a fully connected head, akin to AlexNet (Krizhevsky et al., 2017) (but with additional skip connections), and (3) a convnet followed by global average pooling, akin to ResNet (He et al., 2016) . Note that f (l) (x) ∈ R C l ×d and g (l) (x) ∈ R C l ×d . θ denote all the network parameters, which we initialize from a standard Gaussian distribution as in (Jacot et al., 2018) .

3.2. MULTI-DOT PRODUCT KERNELS

Following (Geifman et al., 2022) , we call a kernel K : MS (C 0 , d)×MS (C 0 , d) → R multi-dot prod- uct if K (x, z) = K (t) where t = x T z 1,1 , x T z 2,2 , . . . , x T z d,d ∈ [-1 , 1] d (note the overload of notation which should be clear by context.) Under our uniform distribution assumption on the multi-sphere, multi-dot product kernels can be decomposed as K (x, z) = k,j λ k Y k,j (x) Y k,j (z), where k, j ∈ N d . Y k,j (x) (the eigenfunctions of K) are products of spherical harmonics in S C0-1 , Y k,j (x) = d i=1 Y ki,ji (x i ) with k i ≥ 0, j i ∈ [N (C 0 , k i )], where N (C 0 , k i ) denotes the number of harmonics of frequency k i in S C0-1 . For C 0 = 2 these are products of Fourier series in a d-dimensional torus. Note that the eigenvalues λ k are non-negative and do not depend on j. Using Mercer's Representation of RKHSs (Kanagawa et al., 2018) , we have that the RKHS H K of K is H K :=    f = k,j α k,j Y k,j | ∥f ∥ 2 H K = k,j α 2 k,j λ k < ∞    . For multi-dot product kernels the normalized kernel simplifies to K (t) = K(t) K( ⃗ 1) , where ⃗ 1 = (1, ..., 1) T ∈ R d . K and K thus differ by a constant, and so they share the same eigenfunctions and their eigenvalues differ by a multiplicative constant.

4. KERNEL DERIVATIONS

We next provide explicit formulas for ResCGPK and ResCNTK.

4.1. RESCGPK

Given a network f (x; θ), the corresponding Gaussian process kernel is defined as E θ [f (x; θ)f (z; θ)]. Below we consider the network in Sec. 3.1, which can have either one of three heads, the equivariant head, trace or GAP. We denote the corresponding ResCGPK by K (L) Eq , K (L) Tr and K (L) GAP , where L denotes the number of layers. We proceed with the following definition. Definition 4.1. Let x, z ∈ R C0×d and f be a residual network with L layers. For every 1 ≤ l ≤ L (and i is an arbitrary choice of channel) denote by Σ (l) j,j ′ (x, z) := E θ g (l) ij (x) g (l) ij ′ (z) , Λ (l) j,j ′ (x, z) = Σ (l) j,j (x, x) Σ (l) j,j ′ (x, z) Σ (l) j ′ ,j (z, x) Σ (l) j ′ ,j ′ (z, z) (5) K (l) j,j ′ (x, z) = c v c w E (u,v)∼N 0,Λ (l) j,j ′ (x,z) [σ (u) σ (v)] (6) K(l) j,j ′ (x, z) = c v c w E (u,v)∼N 0,Λ (l) j,j ′ (x,z) [ σ (u) σ (v)] , where σ is the derivative of the ReLU function expressed by the indicator 1 x≥0 . Our first contribution is to give an exact formula for the ResCGPK. We refer the reader to the appendix for the precise derivation and note here some of the key ideas. We give precise formulas for Σ, K and K and prove that K (L) Eq (x, z) = Σ (1) 1,1 (x, z) + α 2 qc w L l=1 tr K (l) D1,1 (x, z) . This gives us the equivariant kernel, and by showing that K (L) Tr (x, z) = 1 d d j=1 K (L) Eq (s j x, s j z) and K (L) GAP (x, z) = 1 d 2 d j,j ′ =1 K (L) Eq s j x, s j ′ z we obtain precise formulas for the trace and GAP kernels. For clarity, we give here the case of the normalized ResCGPK with multi-sphere inputs, which we prove to simplify significantly. The full derivation for arbitrary inputs is given in Appendix A.2. Theorem 4.1. [Multi-Sphere Case] For any x, z ∈ MS (C 0 , d) let t = x T z 1,1 , . . . , x T z d,d ∈ [-1, 1] d . Fixing c v = 2, c w = 1 and let κ 1 (u) = √ 1-ρ 2 +(π-cos -1 (u))u π . Then, K Eq (t) = 1 1 + α 2   t 1 + α 2 q q-1 2 k=-q-1 2 κ 1 (t 1+k )   K (L) Eq (t) = 1 1 + α 2   K (L-1) Eq (t) + α 2 q 2 q-1 2 k=-q-1 2 q-1 2 k ′ =-q-1 2 κ 1 K (L-1) Eq s k+k ′ t   .

4.2. RESCNTK

For x, z ∈ R C0×d and f (x; θ) an L layer ResNet, ResCNTK is defined as E ∂f (x;θ) ∂θ , ∂f (z;θ) ∂θ . Considering the three heads in Sec. 3.1, we denote the corresponding kernels by Θ (L) Eq , Θ Tr and Θ (L) GAP , depending on the choice of last layer. Our second contribution is providing a formula for the ResCNTK for arbitrary inputs. Theorem 4.2. Let x, z ∈ R C0×d and f be a residual network with L layers. Then, the ResCNTK for f has the form Θ (L) Eq (x, z) = K (L) Eq (x, z) + α 2 q L l=1 d p=1 P (l) p (x, z) tr K(l) Dp,p (x, z) ⊙ Σ (l) Dp,p (x, z) + K (l) Dp,p (x, z) , where ∀1 ≤ j ≤ d, P (L) j (x, z) = 1j=1 cw and for 1 ≤ l ≤ L -1, P (l) j (x, z) = P (l+1) j (x, z) + α 2 q 2 q-1 2 k=-q-1 2 K(l+1) j+k,j+k (x, z) q-1 2 k ′ = q-1 2 P (l+1) j+k+k ′ (x, z) . The Tr and GAP kernels are given by Θ (L) Tr (x, z) = 1 d d j=1 Θ (L) Eq (s j x, s j z) and Θ (L) GAP (x, z) = 1 d 2 d 2 j,j ′ =1 Θ (L) Eq s j x, s j ′ z . 5 SPECTRAL DECOMPOSITION K (L) Eq and Θ (L) Eq are multi-dot product kernels, and therefore their eigenfunctions consist of spherical harmonic products. Next, we derive bounds on their eigenvalues. We subsequently use a result due to (Geifman et al., 2022) to extend these to their trace and GAP versions.

5.1. ASYMPTOTIC BOUNDS

The next theorem provides upper and lower bounds on the eigenvalues of K k -(C0+2νa-3) i ≤ λ k ≤ c 2 i∈R,ki>0 k -(C0+2ν b -3) i where ν a = 2.5 and ν b is 1 + 3 2d for ResCGPK and 1 + 1 2d for ResCNTK. c 1 , c 2 > 0 are constants that depend on L. The set R denotes the receptive field, defined as the set of indices of input pixels that affect the kernel output. We note that these bounds are identical, up to constants, to those obtained with convolutional networks that do not include skip connections (Geifman et al., 2022) , although the proof for the case of ResNet is more complex. Overall, the theorem shows that over-parameterized ResNets are biased toward low-frequency functions. In particular, with input distributed uniformly on the multi-sphere, the time required to train such a network to fit an eigenfunction with gradient descent (GD) is inversely proportional to the respective eigenvalue (Basri et al., 2020) . Consequently, training a network to fit a high frequency function is polynomially slower than training the network to fit a low frequency function. Note, however, that the rate of decay of the eigenvalues depends on the number of pixels over which the target function has high frequencies. Training a target function whose high frequencies are concentrated in a few pixels is much faster than if the same frequencies are spread over many pixels. This can be seen in Figure 1 , which shows for a target function of frequency k in m pixels, that the exponent (depicted by the slope of the lines) grows with m. The same behaviour is seen when the skip connections are removed. This is different from fully connected architectures, in which the decay rate of the eigenvalues depends on the dimension of the input and is invariant to the pixel spread of frequencies, see (Geifman et al., 2022) for a comparison of the eigenvalue decay for standard CNNs and FC architectures.

5.2. LOCALITY BIAS AND ENSEMBLE BEHAVIOR

To better understand the difference between ResNets and vanilla CNNs we next turn to a finegrained analysis of the decay. Consider an l-layer CNN that is identical to our ResNet but with the skip connections removed. Let p (l) i be the number of paths from input pixel i to the output in the corresponding CGPK, or equivalently, the number of paths in the same CNN but in which there is only one channel in each node.. Theorem 5.2. For both Θ (L) Eq or K (L) Eq there exist scalars A > 1 and c l > 0 s.t. letting c k,l = c l d i=1 A min(p (l) i ,ki) for every 1 ≤ l ≤ L and c k = L l=1 c k,l , it holds that λ k ≥ c k d i=1 ki>0 k -C0-2 i . The constant c k differs significantly from that of CNTK (without skip connections) which takes the form c k,L (Geifman et al., 2022) . In particular, notice that the constants in the ResCNTK are (up to scale factors) the sum of the constants of the CNTK at depths 1, . . . , L. Thus, a major contribution of the paper is providing theoretical justification for the following result, observed empirically in (Veit et al., 2016) : over-parameterized ResNets act like a weighted ensemble of CNNs of various depths. In particular, information from smaller receptive fields is propagated through the skip connections, resulting in larger eigenvalues for frequencies that correspond to smaller receptive fields. , no skip connections). We followed (Luo et al., 2016) in computing the ERF, where the networks are first trained on CIFAR-10. All values are re-scaled to the [0,1] interval. We used L = 8 in all cases. Figure 1 shows the eigenvalues computed numerically for various frequencies, for both the CGPK and ResCGPK. Consistent with our results, eigenfunctions with high frequencies concentrated in a few pixels, e.g., k = (k, 0, 0, 0) have larger eigenvalues than those with frequencies spread over more pixels, e.g., k = (k, k, k, k). See appendix G for implementation details. Figure 2 shows the effective receptive field (ERF) induced by ResCNTK compared to that of a network and to the kernel and network with the skip connections removed. The ERF is defined to be ∂f Eq (x; θ)/∂x for ResNet (Luo et al., 2016) and Θ Eq (x, x)/∂x for ResCNTK. A similar calculation is applied to CNN and CNTK. We see that residual networks and their kernels give rise to an increased locality bias (more weight at the center of the receptive field (for the equivariant architecture) or to nearby pixels (at the trace and GAP architectures). 5.3 EXTENSION TO f TR AND f GAP Using (Geifman et al., 2022)[Thm. 3.7] , we can extend our analysis of equivariant kernels to trace and GAP kernels. In particular, for ResCNTK, the eigenfunctions of the trace kernel are a product of spherical harmonics. In addition, let λ k denote the eigenvalues of Θ (L) Eq , then the eigenvalues of Θ (L) Tr are λ Tr k = 1 d d-1 i=0 λ sik , i.e. , average over all shifts of the frequency vector k. This implies that for the trace kernel, the eigenvalues (but not the eigenfunctions) are invariant to shift. For the GAP kernel, the eigenfunctions are 1 √ d d-1 i=0 Y sik,sij , i.e. , scaled shifted sums of spherical harmonic products. These eigenfunctions are shift invariant and generally span all shift invariant functions. The eigenvalues of the GAP kernel are identical to those of the Trace kernel. The eigenfunctions and eigenvalues of the trace and GAP ResCGPK are determined in a similar way. Finally, we note that the eigenvalues for the trace and GAP kernels decay at the same rate as their equivariant counterparts, and likewise they are biased in frequency and in locality. Moreover, while the equivariant kernel is biased to prefer functions that depend on the center of the receptive field (position biased), the trace and GAP kernels are biased to prefer functions that depend on nearby pixels.

6.1. DECAYING BALANCING PARAMETER α

We next investigate the effects of skip connections in very deep networks. Here the setting of balancing parameter α between the skip and residual connection (2) plays a critical role. Previous work on residual, non-convolutional kernels Huang et al. (2020) ; Belfer et al. (2021) proposed to use a balancing parameter of the form α = L -γ for γ ∈ (0.5, 1], arguing that a decaying α contributes to the stability of the kernel for very deep architectures. However, below we prove that in this setting as the depth L tends to infinity, ResCNTK converges to a simple dot-product, k(x, z) = x T z, corresponding to a 1-layer, linear neural network, which may be considered degenerate. We subsequently further elaborate on the connection between this result and previous work and provide a more comprehensive discussion in Appendix F. Theorem 6.1. Suppose α = L -γ with γ ∈ (0.5, 1]. Then, for any t ∈ [-1, 1] d it holds that Θ (L) Eq (t) -→ L→∞ Σ (1) 1,1 (t) = t 1 and likewise K (L) Eq (t) -→ L→∞ Σ (1) 1,1 (t) = t 1 . Clearly, this limit kernel, which corresponds to a linear network with no hierarchical features if undesired. A comparison to the previous work of Huang et al. (2020) ; Belfer et al. (2021) , which addressed residual kernels for fully connected architectures, is due here. This previous work proved that FC-ResNTK converges when L tends to infinity to a two-layer FC-NTK. They however made the additional assumption that the top-most layer is not trained. This assumption turned out to be critical to their result -training the last layer yields a result analogous to ours, namely, that as L tends to infinity FC-ResNTK converges to a simple dot product. Similarly, if we consider ResCNTK in which we do not train the last layer we will get that the limit kernel is the CNTK corresponding to a two-layer convolutional neural network. However, while a two-layer FC-NTK is universal, the set of functions produced by a two-layer CNTK is very limited; therefore, this limit kernel is also not desired. We conclude that the standard setting of α = 1 is preferable for convolutional architectures.

6.2. THE CONDITION NUMBER OF THE RESCGPK MATRIX WITH α = 1

Next we investigate the properties of ResCGPK when the balancing factor is set to α = 1. For ResCGPK and CGPK and any training distribution, we use double-constant matrices (O'Neill, 2021) to bound the condition numbers of their kernel matrices. We further show that with any depth, the lower bound for ResCGPK matrices is lower than that of CGPK matrices (and show empirically that these bounds are close to the actual condition numbers). Lee et al. (2019) ; Xiao et al. (2020) ; Chen et al. (2021) argued that a smaller condition number of the NTK matrix implies that training the corresponding neural network with GD convergences faster. Our analysis therefore indicates that GD with ResCGPK should generally be faster than GD with CGPK. This phenomenon may partly explain the advantage of residual networks over standard convolutional architectures.

Recall that the condition number of a matrix

A ⪰ 0 is defined as ρ(A) := λ max /λ min . Consider an n × n double-constant matrix B b,b that includes b in the diagonal entries and b in each off- diagonal entry. The eigenvalues of B b,b are λ 1 = b -b + nb and λ 2 = ... = λ n = b -b. Suppose b = 1, 0 < b ≤ 1, then B 1,b is positive semi-definite and its condition number is ρ(B 1,b ) = 1+ nb 1-b . This condition number diverges when either b = 1 or n tends to infinity. The following lemma relates the condition numbers of kernel matrices with that of double-constant matrices. Lemma 6.1. where λ max and λ min denote the maximal and minimal eigenvalues of B(A). Let A ∈ R n×n (n ≥ 2) be a normalized kernel matrix with i̸ =j A ij ≥ 0. Let B(A) = B 1,b with b = 1 n(n-1) i̸ =j A ij and ϵ = sup i j̸ =i |A ij -B(A) ij |. Then, 1. ρ (B(A)) ≤ ρ(A). 2. If ϵ < λ min (B(A)) then ρ(A) ≤ λmax(B(A))+ϵ λmin(B(A))-ϵ , The following theorem uses double-constant matrices to compare kernel matrices produced by ResCGPK and those produced by CGPK with no skip connections. Theorem 6.2. Let K(L) ResCGPK and K(L) CGPK respectively denote kernel matrices for the normalized trace kernels ResCGPK and CGPK of depth L. Let B (K) be a double-constant matrix defined for a matrix K as in Lemma 6.1. Then, 1. K(L) ResCGPK -B K(L) ResCGPK 1 -→ L→∞ 0 and K(L) CGPK -B K(L) CGPK 1 -→ L→∞ 0. 2. ρ B K(L) ResCGPK -→ L→∞ ∞ and ρ B K(L) CGPK -→ L→∞ ∞. 3. ∃L 0 ∈ N s.t. ∀L ≥ L 0 , ρ B K(L) ResCGPK < ρ B K(L) CGPK . The theorem establishes that, while the condition numbers of both B K(L) ResCGP K and B K(L) CGPK diverge as L → ∞, the condition number of B K(L) ResCGP K is smaller than that of B K(L) CGPK for all L > L 0 . (L 0 is the minimal L s.t. the entries of the double constant matrices are non-negative. We notice in practice that L 0 ≈ 2.) We can therefore use Lemma 6.1 to derive approximate bounds for the condition numbers obtained with ResCGPK and CGPK. Figure 3 indeed shows that the condition number of the CGPK matrix diverges faster than that of ResCGPK and is significantly larger at any finite depth L. The approximate bounds, particularly the lower bounds, closely match the actual condition numbers produced by the kernels. (We note that with training sampled from a uniform distribution on the multi-sphere, the upper bound can be somewhat improved. In this case, the constant vector is the eigenvector of maximal eigenvalue for both A and B(A), and thus the rows of A sum to the same value, yielding ρ(A) ≤ λmax(B(A)) λmin(B(A))-ϵ with ϵ = 1 n ∥A -B(A)∥ 1 . We used this upper bound in our plot in Figure 3 .) To the best of our knowledge, this is the first paper that establishes a relationship between skip connections and the condition number of the kernel matrix.

7. CONCLUSION

We derived formulas for the Gaussian process and neural tangent kernels associated with convolutional residual networks, analyzed their spectra, and provided bounds on their implied condition numbers. Our results indicate that over-parameterized residual networks are subject to both frequency and locality bias, and that they can be trained faster than standard convolutional networks. In future work, we hope to gain further insight by tightening our bounds. We further intend to apply our analysis of the condition number of kernel matrices to characterize the speed of training in various other architectures.

APPENDIX

Below we provide derivations and proofs for our paper.

A DERIVATION OF RESCGPK

In this section, we derive explicit formulas for ResCGPK. We begin with a few preliminaries. As in (Jacot et al., 2018) , we assume the network parameters θ are initialized with a standard Gaussian distribution, θ ∼ N (0, I). Therefore, at initialization, for every pair of parameters, θ i , θ j , E [θ i • θ j ] = δ ij . ( ) We note that Lee et al. (2019) proved the convergence of a network with this initialization to its NTK. For a vector v, we use the notation v * to denote an entry of v with arbitrary index. A.1 A CLOSED FORMULA FOR K AND K For u ∈ [-1, 1], let κ 0 (u) = π-cos -1 (u) π and κ 1 (u) = √ 1-u 2 +(π-cos -1 (u))u π be the arc-cosine kernels defined in Cho & Saul (2009) . Daniely et al. (2016) showed that K (l) (x, z) = c v c w 2 Σ (l) (x, x) Σ (l) (z, z)κ 1 Σ (l) (x, z) and K(l) (x, z) = c v c w 2 κ 0 Σ (l) (x, z) , where K (l) and K(l) are defined in ( 6) and ( 7) and c v , c w are defined in Sec. 3.1.

A.2 RESCGPK DERIVATION

Theorem A.1. For an L-layer neural network f and x, z ∈ R C0×d , Σ (1) j,j ′ (x, z) = 1 C 0 x T z j,j ′ Σ (2) j,j ′ (x, z) = c w q tr Σ (1) D j,j ′ (x, z) + α 2 q 2 q-1 2 k=-q-1 2 tr K (1) D j+k,j ′ +k (x, z) . For 3 ≤ l ≤ L, Σ (l) j,j ′ (x, z) = Σ (l-1) j,j ′ (x, z) + α 2 q 2 q-1 2 k=-q-1 2 tr K (l-1) D j+k,j ′ +k (x, z) . Finally, for the output layer K (L) Eq (x, z) = Σ (1) 1,1 (x, z) + α 2 qc w L l=1 tr K (l) D1,1 (x, z) K (L) Tr (x, z) = 1 d d j=1 K (L) Eq (s j x, s j z) K (L) GAP (x, z) = 1 d 2 d j,j ′ =1 K (L) Eq s j x, s j ′ z . Proof. We begin by deriving a formula for Σ (l)(x,z) . The case of l = 1 is shown in Lemma (A.1). For 2 < l ≤ L, the strategy is to express E g (l) ij (x) g (l) ij ′ (z) using E f (l-1) c,l (x) f (l-1) c,l ′ (z) and vice versa (which we can do using Lemma A.2). This way we derive an expression for E f (l-1) c,l (x) f (l-1) c,l ′ (z) in Lemma (A. 3) and subsequently get: E g (l) ij (x) g (l) ij ′ (z) = Lemma A.2 c w qC l-1 q-1 2 k=-q-1 2 C l-1 c=1 E f (l-1) c,j+k (x) f (l-1) c,j ′ +k (z) = Lemma A.3 c w qC l-1 C l-1 c=1   q-1 2 k=-q-1 2 E f (l-2) c,j+k (x) f (l-2) c,j ′ +k (z)   Denote by A + + α 2 c v c w q 2 C l-1 C l-1 c=1   q-1 2 k=-q-1 2 q-1 2 k ′ =-q-1 2 E σ g (l-1) (x) c,j+k+k ′ σ g (l-1) (z) c,j ′ +k+k ′   = A + α 2 c v c w q 2 q-1 2 k,k ′ =-q-1 2 E σ g (l-1) (x) * ,j+k+k ′ σ g (l-1) (z) * ,j ′ +k+k ′ = A + α 2 q 2 q-1 2 k=-q-1 2 tr K (l-1) D j+k,j ′ +k (x, z) . If l > 2 then using Lemma (A.2) we obtain A = E g (l-1) ij (x) g (l-1) ij ′ (z) = Σ (l-1) j,j ′ (x, z). Otherwise if l = 2 then using Lemma (A.1) we obtain A = cw q tr Σ (1) D j,j ′ (x, z) . We leave the three output layers to lemma A.4 Lemma A.1. E f (0) ij (x) f (0) ij ′ (z) = E g (1) ij (x) g (1) ij ′ (z) = 1 C 0 x T z j,j ′ . Proof. For g (1) we have: Σ (1) j,j ′ (x, z) = E g (1) ij (x) g (1) ij ′ (z) = 1 C 0 C0 l,l ′ =1 E W (1) 1,l,i x l,j W (1) 1,l ′ ,i z l ′ ,j ′ = 1 C 0 C0 l,l ′ =1 E W (1) 1,l,i W (1) 1,l ′ ,i E x l,j z l ′ j ′ = (8) 1 C 0 C0 l=1 x l,j z l,j ′ = 1 C 0 x T z j,j ′ . For f (1) the proof is analogous, by simply replacing W (1) with V (0) . Lemma A.2. ∀2 ≤ l ≤ L, 1 ≤ i, i ′ ≤ C l , 1 ≤ j, j ′ ≤ d, we have: E g (l) ij (x) g (l) i ′ j ′ (z) = δ i,i ′ c w qC l-1 q-1 2 k=-q-1 2 C l-1 c=1 E f (l-1) c,j+k (x) f (l-1) c,j ′ +k (z) . Proof. E g (l) ij (x) g (l) i ′ j ′ (z) = c w qC l-1 E     C l-1 c=1 W (l) :,c,i * f (l-1) c (x)   j   C l-1 c ′ =1 W (l) :,c ′ ,i ′ * f (l-1) c ′ (z)   j ′   = c w qC l-1 C l c,c ′ =1 q-1 2 k,k ′ =-q-1 2 E W (l) k+ q+1 2 ,c,i W (l) k ′ + q+1 2 ,c ′ ,i ′ E f (l-1) c,j+k (x) f (l-1) c ′ ,j ′ +k ′ (z) = Equation8 δ i,i ′ c w qC l-1 q-1 2 k=-q-1 2 C l-1 c=1 E f (l-1) c,j+k (x) f (l-1) c,j ′ +k (z) (11) Lemma A.3. ∀1 ≤ l ≤ L, 1 ≤ i, i ′ ≤ C l , 1 ≤ j, j ′ ≤ d, we have: E f (l) ij (x) f (l) ij ′ (z) = E f (l-1) ij (x) f (l-1) ij ′ (z) +α 2 c v q q-1 2 k=-q-1 2 E σ g (l) (x) * ,j+k σ g (l) (z) * ,j ′ +k . Proof. Using the definition for f (l) ij (Def. 2), we get the expression: E f (l) ij (x) f (l) ij ′ (z) = E f (l-1) ij (x) f (l-1) ij ′ (z) + E   f (l-1) ij (x) C l c=1 V (l) :,c,i * σ g (l) c (z) j ′   Denote by B1 + E   f (l-1) ij ′ (z) C l c=1 V (l) :,c,i * σ g (l) c (x) j   Denote by B2 + +α 2 c v q 1 C l E   C l c=1 V (l) :,c,i * σ g (l) c (x) j   C l c ′ =1 V (l) :,c ′ ,i * σ g (l) c ′ (z)   j ′   Denote by A . We will deal with this expression in parts. First, for B 1 we get: B 1 = q-1 2 k=-q-1 2 C l c=1 E f (l-1) ij (x) V (l) k+ q+1 2 ,c,i σ g (l) c (z) c,j ′ +k = 0, where the rightmost equality follows from ( 8) and the fact that E [V] = 0 mean in expectation in every index. Analogously, we also get B 2 = 0. Opening A and using Equation (8) we get: A = 1 C l q-1 2 k ′ =-q-1 2 q-1 2 k=-q-1 2 C l c=1 C l c ′ =1 E V k+ q+1 2 ,c,i σ g (l) (x) c,j+k V k ′ + q+1 2 ,c ′ ,i σ g (l) (z) c ′ ,j ′ +k ′ = (8) q-1 2 k=-q-1 2 1 C l C l c=1 E σ g (l) (x) c,j+k σ g (l) (z) c,j ′ +k = q-1 2 k=-q-1 2 E σ g (l) (x) * ,j+k σ g (l) (z) * ,j ′ +k . Overall, we obtain E f (l) ij (x) f (l) ij ′ (z) = E f (l-1) ij (x) f (l-1) ij ′ (z) +α 2 c v q q-1 2 k=-q-1 2 E σ g (l) (x) * ,j+k σ g (l) (z) * ,j ′ +k . Lemma A.4. K (L) Eq (x, z) = Σ (0) 1,1 (x, z) + α 2 qc w L l=1 tr K (l) D1,1 (x, z) . Proof. We start by proving the case for K (L) Eq . Observe that: E θ f Eq (x; θ) f Eq (z; θ) = 1 C L C L i,i ′ =1 E W Eq i f (L) i,1 (x) W Eq i ′ f (L) i ′ ,1 (z) = 1 C L C L i,i ′ =1 δ ii ′ E f (L) i,1 (x) f (L) i ′ ,1 (z) = 1 C L C L i=1 E f (L) i,1 (x) f (L) i,1 (z) . Applying Lemma A.3 recursively we obtain E θ f Eq (x; θ) f Eq (z; θ) = 1 C L C L i=1   E f (L-1) i,1 (x) f (L-1) i,1 (z) + α 2 c v q q-1 2 k=-q-1 2 E σ g (L) (x) * ,1+k σ g (L) (z) * ,1+k   = 1 C L C L i=1 E f (L-1) i,1 (x) f (L-1) i,1 (z) + α 2 qc w tr K (L) D1,1 (x, z) = Lemma A.3 1 C L C L i=1 E f (L-2) i,1 (x) f (L-2) i,1 (z) + α 2 qc w L l=L-1 tr K (l) D1,1 (x, z) . Applying Lemma A.3 recursively for all layers we obtain E θ f Eq (x; θ) f Eq (z; θ) = 1 C L C L i=1 E f (0) i,1 (x) f (0) i,1 (z) + α 2 qc w L l=1 tr K (l) D 1,1 ′ (x, z) = Σ (1) 1,1 (x, z) + α 2 qc w L l=1 tr K (l) D1,1 (x, z) . Lemma A.5. K (L) Tr (x, z) = 1 d d j=1 K (L) Eq (s j x, s j z) K (L) GAP (x, z) = 1 d 2 d j,j ′ =1 K (L) Eq s j x, s j ′ z . Proof. For f Tr we have: K (L) Tr (x, z) = E θ f Tr (x; θ) f Tr (z; θ) = 1 C L d d j,j ′ =1 C L i,i ′ =1 E W Tr i,j f (L) i,j (x) W Tr i ′ ,j ′ f (L) i ′ ,j ′ (z) = 1 d d j=1 1 C L C L i=1 E f (L) i,j (x) f (L) i,j (z) = 1 d d j=1 1 C L C L i=1 E f (L) i,1 (s j x) f (L) i,1 (s j z) , where the part inside the parentheses was shown in the previous Lemma A.4 to equal K (L) Eq (s j x, s j z). For f GAP we analogously obtain K (L) GAP (x, z) = E θ f GAP (x; θ) f GAP (z; θ) = 1 C L d 2 s 2 d j,j ′ =1 C L i,i ′ =1 E W GAP i f (L) i,j (x) W GAP i ′ f (L) i ′ ,j ′ (z) = 1 d 2 d j,j ′ =1 1 C L C L i=1 E f (L) i,j (x) f (L) i,j ′ (z) = 1 d 2 d j,j ′ =1 1 C L C L i=1 E f (L) i,1 (s j x) f (L) i,1 s j ′ z , from which the claim follows. A.3 FORMULAS FOR MULTISPHERE INPUT: PROOF OF THEOREM 4.1 Lemma A.6. For an L-layer ResNet f and x ∈ MS (C 0 , d), and for every 1 ≤ l ≤ L, 1 ≤ j, j ′ ≤ d Σ (l) j,j ′ (x, x) = 1 C0 l = 1 (1+α 2 cv cw 2 ) l-2 (2cw+α 2 cvcw) 2C0 l ≥ 2. Proof. We prove this by induction using the formula in Theorem (A.1). For l = 1, since by assumption ∥x i ∥ = 1, for every i we get that x T x is the d × d matrix with 1 in every entry. Therefore, Σ j,j ′ (x, x) = 1 C 0 x T x j,j ′ = 1 C 0 . Similarly, for l = 2: Σ (2) j,j ′ (x, x) = c w q tr Σ (1) D j,j ′ (x, x) + α 2 q 2 q-1 2 k=-q-1 2 tr K (1) D j+k,j ′ +k (x, x) . We can plug in the induction hypothesis, and express K as in (9), obtaining Σ (2) j,j ′ (x, x) = c w C 0 + α 2 q 2 q-1 2 k,k ′ =-q-1 2 c v c w 2 κ 1 (1) Σ (1) j+k+k ′ ,j ′ +k+k ′ (x, x) Σ (1) j+k+k ′ ,j ′ +k+k ′ (x, x) = c w C 0 + α 2 c v c w 2 N 1 = 2c w + α 2 c v c w 2C 0 , where we used the fact that κ 1 (1) = 1. The proof for l ≥ 3 is analogous: Σ (l) j,j (x, x) = Σ (l-1) j,j (x, x) + α 2 q 2 q-1 2 k=-q-1 2 tr K (l-1) D j+k,j+k (x, x) = N L-1 + α 2 q 2 q-1 2 k,k ′ =-q-1 2 c v c w 2 N L-1 κ 1 (1) = 1 + α 2 c v c w 2 N L-1 = 1 + α 2 cvcw 2 l-2 c w + α 2 c v c w 2C 0 . Lemma A.7. For any L ∈ N let N L be the value of Σ (l) j,j (x, x) from Lemma A.6. Let x, z ∈ MS (C 0 , d), then K (L) Eq (x, z) = Σ (1) 1,1 (x, z) + α 2 c v 2q L l=1 N l tr κ 1 Σ (L) D1,1 (x, z) . Proof. Let L ∈ N. We know from Theorem A.1 that K (L) Eq (x, z) = Σ (1) 1,1 (x, z) + α 2 qc w L l=1 tr K (l) D1,1 (x, z) . By expressing K as in (9) and using Lemma A.6 we get K (L) Eq (x, z) = Σ (1) 1,1 (x, z) + α 2 c v 2q L l=1 N l tr κ 1 Σ (L) D1,1 (x, z) . Corollary A.1. Fix c v = 2, c w = 1 then K (L) Eq (x, z) = C 0 (1 + α 2 ) L K (L) Eq (x, z) . Proof. Using the previous lemma and the fact that κ(1) = 1, K (L) Eq (x, x) = Σ (1) 1,1 (x, x) + α 2 L l=1 N l = 1 C 0 1 + α 2 1 + (1 + α 2 ) L-2 l=0 (1 + α 2 ) l = 1 C 0 1 + α 2 1 + (1 + α 2 ) 1 -(1 + α 2 ) L-1 1 -(1 + α 2 ) = 1 C 0 1 + α 2 1 + (1 + α 2 ) L -(1 + α 2 ) α 2 = (1 + α 2 ) L C 0 . Proposition A.1. For any L ∈ N, let x, z ∈ MS (C 0 , d), and denote t = x T z 1,1 , x T z 2,2 , . . . , x T z d,d ∈ [-1, 1] d . Suppose that α is fixed for all networks of different depths, then: K (1) Eq (t) = 1 C 0 t 1 + α 2 qc w C 0 q-1 2 k=-q-1 2 κ 1 (t 1+k ) and K (L) Eq (t) = K (L-1) Eq (t) + ÑL q-1 2 k,k ′ =-q-1 2 κ 1 K (L-1) Eq s k+k ′ t N L where N L be the value of Σ (l) j,j (x, x) from Lemma A.6 Proof. If α is fixed, then for all 1 ≤ l ≤ L the definition of K (l) (x, z) does not depend on L (just that l < L). Therefore, using Lemma A.7 we obtain K (L) Eq (x, z) = K (L-1) Eq (x, z) + α 2 c v 2q N L tr κ 1 Σ (L) D1,1 (x, z) . To simplify this further, observe first that a direct consequence of Theorem A.1 is that ∀L ≥ 2 Σ (L) 1,1 (x, z) = c w q q-1 2 k=-q-1 2 K (L-1) Eq (s k x, s k z) . We therefore get K (L) Eq (x, z) = K (L-1) Eq (x, z) + α 2 c v c w 2q 2 N L q-1 2 k,k ′ =-q-1 2 κ 1 K (L-1) Eq s k+k ′ x, s k+k ′ z . Corollary A.2. For any x, z ∈ MS (C 0 , d), let t = x T z 1,1 , x T z 2,2 , . . . , x T z d,d ∈ [-1, 1] d . Fix c v = 2, c w = 1 and some α for all neural networks. Then,

K

(1) Eq (t) = 1 1 + α 2   t 1 + α 2 q q-1 2 k=-q-1 2 κ 1 (t 1+k )   K (L) Eq (t) = 1 1 + α 2   K (L-1) Eq (t) + α 2 q 2 q-1 2 k,k ′ =-q-1 2 κ 1 K (L-1) Eq s k+k ′ t   .

B DERIVATION OF RESCNTK B.1 REWRITING THE NEURAL NETWORK

The convolution of w ∈ R q with a vector v ∈ R d can be rewritten as: [w * v] i = q j=1 [w] j [v] i+j-q+1 2 . Therefore, let φ (v) ∈ R q×d be [φ (v)] ij := [v] i+j-q+1 . Then we can rewrite the above as: w * v = w T φ (v) T = φ (v) T w. Using this definition, if we instead have w ∈ R q×c then: A ij := [w :,i * v] j = φ (v) T w :,i j = φ (v) T w ji =⇒ A = w T φ (v) . Lastly, if we instead have w ∈ R q×c×c ′ and v ∈ R c ′ ×d then: A ij := c ′ k=1 [w :,k,i * v k ] j = c ′ k=1 w T :,k,: φ (v k ) . We can now rewrite the network architecture as: f (0) (x) = 1 √ C 0 V (0) 1 T x (12) g (1) (x) = 1 √ C 0 W (1) 1 T x (13) f (l) (x) = f (l-1) (x) + α c v qC l C l j=1 V (l) :,j,: T φ σ g (l) j (x) l = 1, . . . , L g (l) (x) = c w qC l-1 C l-1 j=1 W (l) :,j,: T φ f (l-1) j (x) l = 2, . . . , L, and as before we have an output layer that corresponds to one of: f Eq , f Tr , or f GAP .

B.2 NOTATIONS

We use a numerator layout notation, i.e., for y ∈ R, A ∈ R m×n we denote:  R n×m ∋ ∂y ∂A =       ∂y ∂A11 ∂y ∂A21 • • ∂y ∂Amn       . Also, let J ij mn be the m × n matrix with 1 in coordinate (i, j) and 0 elsewhere. we write J ij when m, n are clear by context. Also, let J Di = m δ m∈Di J m,m-i+ q+1 2 d,q . B.3 CHAIN RULE REMINDER Recall that by the chain rule we know that we can decompose the Jacobian of a composition of functions h • ψ (v) as: As such, the following definitions will come in handy: J h•ψ (v) = J h (ψ (v)) J ψ (v) . When h, Definition B.1. ∀1 ≤ l ≤ L, 1 ≤ j, j ′ ≤ d, let b (l) (x) := ∂f Eq (x; θ) ∂f (l) (x) T , Π (l) j,j ′ (x, z) := 1 c w E b (l) T (x) b (l) (z) j,j ′ . Notice that f (l) (x) ∈ R C l ×d =⇒ b (l) (x) ∈ R C l ×d . Remark. There is a slight abuse of notation in the definition of b (l) . By Lemma B.4 b (L+1) only depends on the weights of the last layer. Therefore, by the recursive formula for b (l) in Lemma B.3 and plugging in Lemma B.5 we get that b (l) can be written using only W (l) , . . . , W (L) , W Eq , V (l) , . . . , V (L) and σ g (l+1) (x) , . . . , σ g (L) (x) , where the latter are indicator functions that are always multiplied by some of W (l) , . . . , W (L) , W Eq , V (l) , . . . , V (L) . It is easy to see now that for any l ′ ≤ l, E b (l) ij (x)f l ′ i ′ j ′ (x) = 0 = E b (l) ij (x) E f l ′ i ′ j ′ (x) and as a result, b (l) is uncorrelated with f (0) , . . . , f (l) , g (1) , . . . , g (l) , x and z

B.4 PROOF OF THEOREM 4.2 IN THE MAIN TEXT

We start with a lemma that relates the trace and GAP ResCNTK to the equivariant kernel. Lemma B.1. For an L layer ResNet, Θ (L) Tr (x, z) = 1 d d j=1 Θ (L) Eq (s j x, s j z) and Θ (L) GAP (x, z) = 1 d 2 d 2 j,j ′ =1 Θ (L) Eq s j x, s j ′ z . Proof. E ∂f Tr (x; θ) ∂θ , ∂f Tr (z; θ) ∂θ = 1 C L d C L i,i ′ =1 d j,j ′ =1 E   ∂W Tr ij f (L) ij (x) ∂θ , ∂W Tr i ′ j ′ f (L) i ′ j ′ (z) ∂θ   = (8) 1 d d j=1 E ∂f Eq (s j-1 x; θ) ∂θ , ∂f Eq (s j-1 z; θ) ∂θ . Similarly for GAP: E ∂f GAP (x; θ) ∂θ , ∂f GAP (z; θ) ∂θ = 1 C L d 2 C L i,i ′ =1 d j,j ′ =1 E   ∂W GAP i f (L) ij (x) ∂θ , ∂W GAP i ′ f (L) i ′ j ′ (z) ∂θ   = (8) 1 d 2 d j,j ′ =1 E ∂f Eq (s j-1 x; θ) ∂θ , ∂f Eq (s j-1 z; θ) ∂θ . We now return to the main proof. Using the lemma above, it remains to prove the theorem for f Eq Proposition B.1. Theorem (4.2) holds for the case of f = f Eq . Proof. By linearity of the derivative operation and expectation we can rewrite: Θ (L) Eq (x, z) = E ∂f (x; θ) ∂θ , ∂f (z; θ) ∂θ = E ∂f (x; θ) ∂W (Eq) , ∂f (z; θ) ∂W (Eq) + L l=1 E ∂f (x; θ) ∂W (l) , ∂f (z; θ) ∂W (l) + E ∂f (x; θ) ∂V (l) , ∂f (z; θ) ∂V (l) . We deal with each term separately, starting with the first term. E ∂f (x; θ) ∂W Eq , ∂f (z; θ) ∂W Eq = 1 C L E f (L) :,1 (x) , f (L) :,1 (z) = K (L) Eq (x, z) . Next, to handle E ∂f (x;θ) ∂V (l) , ∂f (z;θ) ∂V (l) , observe that ∀1 ≤ l ≤ L, 1 ≤ i ≤ C l we can express ∂f (x;θ) ∂V (l) :,i,: as follows: ∂f (x; θ) ∂V (l) :,i,: = LemmaB.2 α c v qC l ∂f (x; θ) ∂f (l) (x) T φ T σ g (l) i (x) = α c v qC l b (l) (x) φ T σ g (l) i (x) . Notice that Lemma B.7 implies that the conditions of Lemma B.6 are satisfied. Therefore, E ∂f (x; θ) ∂V (l) :,i,: , ∂f (z; θ) ∂V (l) :,i:, = α 2 c v qC l d p=1 Π (l) p,p (x, z) E φ T σ g (l) i (x) φ σ g (l) i (z) pp = α 2 c v c w qC l d p=1 Π (l) p,p (x, z) q-1 2 k=-q-1 2 E σ g (l) i,p+k (x) σ g (l) i,p+k (z) = α 2 qC l d p=1 Π (l) p,p (x, z) tr K (l) Dp,p . Note that as this does not depend on i. We therefore obtain E ∂f (x; θ) ∂V (l) , ∂f (z; θ) ∂V (l) = C l i=1 E ∂f (x; θ) ∂V (l) :,i,: , ∂f (z; θ) ∂V (l) :,i:, = α 2 q d p=1 Π (l) p,p (x, z) tr K (l) Dp,p . The next term we deal with is E ∂f (x;θ) ∂W (l) , ∂f (z;θ) ∂W (l) . Using Lemma B.9 we have E ∂f (x; θ) ∂W (l) , ∂f (z; θ) ∂W (l) = α 2 q d p=1 Π (l) p,p (x, z) tr K(l) Dp,p (x, z) ⊙ Σ (l) Dp,p (x, z) . Putting these together we obtain Θ (L) Eq (x, z) = E ∂f (x; θ) ∂θ , ∂f (z; θ) ∂θ = E ∂f (x; θ) ∂W (Eq) , ∂f (z; θ) ∂W (Eq) + L l=1 E ∂f (x; θ) ∂W (l) , ∂f (z; θ) ∂W (l) + E ∂f (x; θ) ∂V (l) , ∂f (z; θ) ∂V (l) = K (L) Eq (x, z) + α 2 q L l=1 d p=1 Π (l) p,p (x, z) tr K(l) Dp,p (x, z) ⊙ Σ (l) Dp,p (x, z) + K (l) Dp,p (x, z) . Finally, we provide a formula for of Π to Lemma B.8, and denoting P j = Π j,j completes the proof. Lemma B.2. Let ψ a real valued function and X, A matrices. If ∂ψ h X T A is well defined then ∂ψ(h(X T A)) ∂X = ∂ψ ∂(X T A) T A T = s,t ∂ψ ∂h(X T A) s,t ∂h(X T A) s,t (X T A) A T . Proof. First, by the linearity of derivatives we get that ∂ X T A ij ∂X nm = k ∂X ki A kj ∂X nm = δ im A nj . Using the chain rule we get ∂ψ ∂X mn = i,j ∂ψ ∂ (X T A) ij ∂ X T A ij ∂X nm = j ∂ψ ∂ (X T A) mj A nj = j A nj ∂ψ ∂ (X T A) jm = A ∂ψ ∂ (X T A) T nm =⇒ ∂ψ ∂X = ∂ψ ∂ (X T A) T A T = s,t ∂ψ ∂h (X T A) s,t ∂h X T A s,t (X T A) A T . Lemma B.3. ∀2 ≤ l ≤ L, b (l-1) (x) = C l-1 m=1 d n=1 b (l) mn (x) ∂f (l) mn (x) ∂f (l-1) (x) T . Proof. By the definition of b (l-1) (x) we have b (l-1) (x) = ∂f (x; θ) ∂f (l-1) (x) T =   C l-1 m=1 d n=1 ∂f (x; θ) ∂f (l) mn (x) ∂f (l) mn (x) ∂f (l-1) (x)   T = C l-1 m=1 d n=1 b (l) mn (x) ∂f (l) mn (x) ∂f (l-1) (x) T . Lemma B.4. E b (L) T (x) b (L) (z) j,j ′ = 1 j=j ′ =1 . Proof. b (L) :,j (x) = 1 √ C L ∂ ∂f (L) :,j (x) W Eq f (L) :,1 (x) T = δ j,1 1 √ C L W Eq . Therefore, E b (L) T (x) b (L) (z) j,j ′ = C L k=1 E b (L) kj (x) b (L) kj ′ (z) = δ j,1 δ j ′ ,1 C L d C L k=1 E W Eq k W Eq k ′ = δ j,1 δ j ′ ,1 . Lemma B.5. ∀2 ≤ l ≤ L, ∂f (l) mn (x) ∂f (l-1) (x) = J nm + α q c w c v C l C l-1 C l j=1 q k=1 V (l) k,j,m σ g (l) j,n+k-q+1 2 (x) J D n+k- q+1 2 W (l) :,:,j . Proof. by the definition of f (l) we have f (l) (x) = f (l-1) (x) + α q c w c v C l C l-1 C l j=1 V (l) T :,j,: φ σ g (l) j (x) . Taking a derivative of f (l) mn w.r.t. f (l-1) we obtain ∂f (l) mn (x) ∂f (l-1) (x) = J nm + α q c w c v C l C l-1 C l j=1 ∂ ∂f (l-1) (x) V (l) :,j,: T φ σ g (l) j (x) mn denote by yj . To simplify this, notice first that the derivative can be expressed as ∂ ∂f (l-1) (x) y j = q k=1 V (l) k,j,m ∂ ∂f (l-1) (x) φ kn σ g (l) j (x) = q k=1 V (l) k,j,m ∂ ∂f (l-1) (x) σ g (l) j,n+k-q+1 2 (x) . Using the chain rule we can express the derivative of σ g (l) j,n+k-q+1 2 (x) as follows:    σ g (l) j,n+k-q+1 2 (x) ∂f (l-1) (x)    m ′ n ′ = σ g (l) j,n+k-q+1 2 (x) • ∂ C l-1 j ′ =1 W (l) :,j ′ ,j , f (l-1) j ′ ,D n+k- q+1 2 (x) ∂f (l-1) n ′ m ′ (x) = σ g (l) j,n+k-q+1 2 (x) C l-1 j ′ =1 δ n ′ j ′ 1 m ′ ∈D n+k- q+1 2 W (l) m ′ -n-k+q+1,j ′ ,j = σ g (l) j,n+k-q+1 2 (x) 1 m ′ ∈D n+k- q+1 2 W (l) m ′ -n-k+q+1,j ′ ,j . = ∂g (l) j,n+k-q+1 2 (x) ∂f (l-1) (x) = σ g (l) j,n+k-q+1 2 (x) J D n+k- q+1 2 W (l) :,:,j In summary we obtain ∂f (l) mn (x) ∂f (l-1) (x) = J nm + α q c w c v C l C l-1 C l j=1 q k=1 V (l) k,j,m σ g (l) j,n+k-q+1 2 (x) J D n+k- q+1 2 W (l) :,:,j . Lemma B.6. For any two matrices M , M ′ ∈ R m×s and N , N ′ ∈ R s×n , if M T M ′ is uncorre- lated with N N ′ T , and for every k ̸ = k ′ either E M T M ′ kk ′ = 0 or E N N ′T k ′ k = 0, then E M N , M ′ N ′ = s p=1 E M T M ′ p,p E N T N ′ p,p = E tr M T M ′ ⊙ N T N ′ . Proof. Following the definition of an inner product we have E M N , M ′ N ′ = E   m i=1 n j=1 (M N ) ij M ′ N ′ ij   = E   m i=1 n j=1 s p=1 M i,p N p,j   s p ′ =1 M ′ i,p ′ N ′ p ′ j     = E   m i=1 n j=1 s p=1 s p ′ =1 M i,p M ′ ip ′ N p,j N ′ p ′ j   = uncorrelated s p=1 s p ′ =1 E M T M ′ p,p ′ E N N ′T p ′ p = 0 when p̸ =p ′ s p=1 E M T M ′ p,p E N N ′ T p,p = E tr M T M ′ ⊙ N T N ′ . Lemma B.7. ∀1 ≤ l ≤ L -1, 1 ≤ k, k ′ ≤ C l , 1 ≤ j, j ′ ≤ d, it holds that E b (l) kj (x) b (l) k ′ j ′ (z) = δ kk ′ δ jj ′   E b (l+1) kj (x) b (l+1) kj (z) + α 2 c w q 2 C l tr     q-1 2 p=-q-1 2 Π (l+1) Dj+p,j+p (x, z)   ⊙ K(l+1) Dj,j (x, z)     . Proof. By Lemma B.3 we have E b (l) kj (x) b (l) k ′ j ′ (z) = E   C l m=1 d n=1 b (l+1) mn (x) ∂f (l+1) mn (x) ∂f (l) kj , C l m ′ =1 d n ′ =1 b (l+1) m ′ n ′ (z) ∂f (l+1) m ′ n ′ (z) ∂f (l) k ′ j ′   = C l m=1 d n=1 C l m ′ =1 d n ′ =1 E   b (l+1) mn (x) b (l+1) m ′ n ′ (z) ∂f (l+1) mn (x) ∂f (l) kj ∂f (l+1) m ′ n ′ (z) ∂f (l) k ′ j ′   . Now as b (l+1) (x) is uncorrelated with ∂f (l+1) mn (x) ∂f (l) kj we get = C l m=1 d n=1 C l m ′ =1 d n ′ =1 E b (l+1) mn (x) b (l+1) m ′ n ′ (z) E   ∂f (l+1) mn (x) ∂f (l) kj ∂f (l+1) m ′ n ′ (z) ∂f (l) k ′ j ′   . By induction (where the base case is Lemma B.4), we can assume that if m ̸ = m ′ or n ̸ = n ′ then E b (l+1) mn (x) b (l+1) m ′ n ′ (x) = 0. We therefore obtain E b (l) kj (x) b (l) k ′ j ′ (x) = C l m=1 d n=1 E b (l+1) mn (x) b (l+1) mn (z) E   ∂f (l+1) mn (x) ∂f (l) kj ∂f (l+1) mn (z) ∂f (l) k ′ j ′   . It remains to calculate E ∂f (l+1) mn (x) ∂f (l) kj ∂f (l+1) mn (z) ∂f (l) k ′ j ′ . Lemma B.5 states that ∂f (l+1) mn (x) ∂f (l) kj (x) = ∂f (l+1) mn (x) ∂f (l) (x) jk = δ km δ jn Denote by A mn kj + α q c w c v C l C l+1 Denote by C C l+1 s=1 q p=1 V (l+1) p,s,m σ g (l+1) s,p+n-q+1 2 (x) 1 j∈D n+p- q-1 2 W (l+1) j-n-p+q+1,k,s B mn kj (x) Published as a conference paper at ICLR 2023 Using this notation we have: E   ∂f (l+1) mn (x) ∂f (l) kj ∂f (l+1) mn (z) ∂f (l) k ′ j ′   = A mn kj A mn k ′ j ′ +C A mn kj E B mn k ′ j ′ + A mn k ′ j ′ E B mn kj +C 2 E B mn kj (x) B mn k ′ j ′ (z) . We will deal with each term separately. First, we consider the first term: A mn kj A mn k ′ j ′ = δ km δ jn δ k ′ m δ j ′ n = δ kk ′ m δ jj ′ n , where we use the notation δ kk ′ m = 1 k = k ′ = m 0 otherwise and likewise for δ jj ′ n . For the second term, since V, W are 0 meaned i.i.d Gaussians (Equation ( 8)), and σ is the indicator function we have E B mn kj (x) = E B mn k ′ j ′ (z) = 0, Therefore, C A mn kj E B mn k ′ j ′ (z) + A mn k ′ j ′ E B mn kj (x) = 0. For the last term, by definition B mn kj (x) = C l+1 s=1 q p=1 V (l+1) p,s,m σ g (l+1) s,n+p-q+1 2 (x) 1 j∈D n+p- q+1 2 W (l+1) j-n-p+q+1,k,s B mn k ′ j ′ (z) = C l+1 s ′ =1 q p ′ =1 V (l+1) p ′ ,s ′ ,m σ g (l+1) s ′ ,n+p ′ -q+1 2 (z) 1 j ′ ∈D n+p ′ - q+1 2 W (l+1) j ′ -n-p ′ +q+1,k ′ ,s ′ . From ( 8), E V p ′ ,s ′ ,m = δ p,p ′ δ s,s ′ , and since they are uncorrelated with W (l+1) and g (l+1) then E B mn kj (x) B mn k ′ j ′ (z) = = C 2 C l+1 s=1 q p=1 E σ g (l+1) s,n+p-q+1 2 (x) W (l+1) j-n-p+q+1,k,s σ g (l+1) s,p+n-q+1 2 (z) W (l+1) j ′ -n-p+q+1,k ′ ,s 1 j,j ′ ∈D n+p- q+1 2 = δ kk ′ δ jj ′ C 2 C l+1 s=1 q p=1 E σ g (l+1) s,n+p-q+1 2 (x) σ g (l+1) s,n+p-q+1 2 (z) 1 j,j ′ ∈D n+p- q+1 2 = δ kk ′ δ jj ′ α 2 q 2 C l q p=1 K(l+1) n+p-q+1 2 ,n+p-q+1 2 (x, z) 1 j,j ′ ∈D n+p- q+1 Overall, E b (l) kj (x) b (l) k ′ j ′ (z) = = C l m=1 d n=1 E b (l+1) mn (x) b (l+1) mn (z) δ kk ′ m δ jj ′ n + δ kk ′ δ jj ′ α 2 q 2 C l q p=1 K(l+1) n+p-q+1 2 ,n+p-q+1 2 (x, z) 1 j,j ′ ∈D n+p- q+1 2 = δ kk ′ δ jj ′ E b (l+1) kj (x) b (l+1) kj (z) + α 2 q 2 C l d n=1 q p=1 Π (l+1) n,n (x, z) K(l+1) n+p-q+1 2 ,n+p-q+1 2 (x, z) 1 j,j ′ ∈D n+p- q+1 2 . Now observe that j ∈ D n+p-q+1 2 ⇐⇒ n + p -q ≤ j ≤ n + p -1 ⇐⇒ j -p + 1 ≤ n ≤ j -p + q. So we can rewrite the above as E b (l) kj (x) b (l) k ′ j ′ (z) = = δ kk ′ δ jj ′   E b (l+1) kj (x) b (l+1) kj (z) + α 2 c w q 2 C l q p=1 j-p+q n=j-p+1 Π (l+1) n,n (x, z) K(l+1) n+p-q+1 2 ,n+p-q+1 2 (x, z)   = δ kk ′ δ jj ′ E b (l+1) kj (x) b (l+1) kj (z) + α 2 c w q 2 C l q p=1 tr Π (l+1) D j-p+ q+1 2 ,j-p+ q+1 2 (x, z) ⊙ K(l+1) Dj,j (x, z) = δ kk ′ δ jj ′   E b (l+1) kj (x) b (l+1) kj (z) + α 2 c w q 2 C l tr     q-1 2 p=-q-1 2 Π (l+1) Dj+p,j+p (x, z)   ⊙ K(l+1) Dj,j (x, z)     . Lemma B.8. ∀1 ≤ j, j ′ ≤ d, Π (L) j,j ′ (x, z) = 1 dc w 1 j=j ′ =1 , ∀1 ≤ l ≤ L -1, 1 ≤ j, j ′ ≤ d, it holds that: Π (l) j,j ′ (x, z) = δ jj ′   Π (l+1) j,j (x, z) + α 2 q 2 tr     q-1 2 p= q-1 2 Π (l+1) Dj+p,j+p (x, z)   ⊙ K(l+1) Dj,j (x, z)     . Proof. First, from Lemma B.4 we know that: 1 c w E b (L) T (x) b (L) (z) j,j ′ = 1 dc w δ j,j ′ . Now let 1 ≤ l ≤ L -1. Using lemma B.7 we get that 1 c w E b (l) T (x) b (l) (z) j,j ′ = 1 c w C l k=1 E b (l) kj (x) b (l) kj ′ (z) = C l k=1 δ jj ′ 1 c w   E b (l+1) kj (x) b (l+1) kj (z) + δ kk ′ δ jj ′ α 2 c w q 2 C l tr     q-1 2 p=-q-1 2 Π (l+1) Dj+p,j+p (x, z)   ⊙ K(l+1) Dj,j (x, z)     = δ jj ′   Π (l+1) j,j (x, z) + α 2 q 2 tr     q-1 2 p= q-1 2 Π (l+1) Dj+p,j+p (x, z)   ⊙ K(l+1) Dj,j (x, z)     . Lemma B.9. ∀1 ≤ l ≤ L, E ∂f (x; θ) ∂W (l) , ∂f (z; θ) ∂W (l) = α 2 q d n=1 Π (l) nn (x, z) tr K(l) Dn,n (x, z) ⊙ Σ (l) Dn,n (x, z) . Proof. We first show the case for 2 ≤ l ≤ L. ∀2 ≤ l ≤ L, 1 ≤ i ≤ C l-1 , 1 ≤ c ≤ C l , we can express ∂f (x;θ) ∂W (l) :,i,c as ∂f (l) mn (x) ∂W (l) :,i,c = ∂ ∂W (l) :,i,c   f (l-1) (x) + α c v qC l C l j=1 V (l) :,j,: T φ σ g (l) j (x)   mn . Since g (l) j (x) depends on W (l) :,i,c iff c = j, we get ∂f (l) mn (x) ∂W (l) :,i,c = α c v qC l q k=1 V (l) k,c,m ∂ ∂W (l) :,i,c σ g (l) c,n+k-q+1 2 (x) = α q c w c v C l C l-1 q k=1 V (l) k,c,m σ g (l) c,n+k-q+1 2 (x) ∂ ∂W (l) :,i,c W (l) :,i,: T φ f (l-1) i (x) c,n+k-q+1 2 = α q c w c v C l C l-1 q k=1 V (l) k,c,m σ g (l) c,n+k-q+1 2 (x) φ T f (l-1) i (x) n+k-q+1 2 ,: . Now we have ∂f (x; θ) ∂W (l) :,i,c = C l m=1 d n=1 ∂f (x; θ) ∂f (l) mn (x) ∂f (l) mn (x) ∂W (l) :,i,c = α q c w c v C l C l-1 C l m=1 d n=1 q k=1 b (l) mn (x) V (l) k,c,m σ g (l) c,n+k-q+1 2 (x) φ T f (l-1) i (x) n+k-q+1 2 ,: . Taking the inner product we get the following expression: E ∂f (x; θ) ∂W (l) :,i,c , ∂f (z; θ) ∂W (l) :,i,c = α 2 c w c v q 2 C l C l-1 C l m=1 C l m ′ =1 d n=1 d n ′ =1 q k=1 q k ′ =1 E b (l) mn (x) b (l) mn ′ (z) •E V (l) k,c,m V (l) k ′ ,c,m ′ • •E σ g (l) c,n+k-q+1 2 (x) φ T f (l-1) i (x) n+k-q+1 2 ,: , σ g (l) c,n ′ +k ′ -q+1 2 (z) φ T f (l-1) i (z) n ′ +k ′ -q+1 2 ,: . Note however that (8) implies that V (l) k,c,m V (l) k ′ ,c,m ′ = δ kk ′ δ mm ′ , and Lemma B.7 implies that E b (l) mn (x) b (l) mn ′ (z) = 0 when n ̸ = n ′ . Therefore, E ∂f (x; θ) ∂W (l) :,i,c , ∂f = α 2 c w c v q 2 C l C l-1 d n=1 q k=1 C l m=1 E b (l) mn (x) b (l) mn (z) =cwΠ (l) nn (x,z) • • E σ g (l) c,n+k-q+1 2 (x) φ T f (l-1) i (x) n+k-q+1 2 ,: , σ g (l) c,n+k-q+1 2 (z) φ T f (l-1) i (z) n+k-q+1 2 ,: = α 2 c w q 2 C l C l-1 d n=1 Π (l) nn (x, z) q k=1 K(l) Dn,n (x, z) kk q-1 2 k ′ =-q-1 2 E f (l-1) i,n+k-q+1 2 +k ′ (x) f (l-1) i,n+k-q+1 2 +k ′ (z) . As a result, by linearity and the fact that this result does not depend on c we obtain E ∂f (x; θ) ∂W (l) , ∂f (z; θ) ∂W (l) = = α 2 c w q 2 C l-1 d n=1 Π (l) nn (x, z) q k=1 K(l) Dn,n (x, z) kk q-1 2 k ′ =-q-1 2 C l-1 i=1 E f (l-1) i,n+k-q+1 2 +k ′ (x) f (l-1) i,n+k-q+1 2 +k ′ (z) . Using Lemma A.2 the last term simplifies as follows: q-1 2 k ′ =-q-1 2 C l-1 i=1 E f (l-1) i,n+k-q+1 2 +k ′ (x) f (l-1) i,n+k-q+1 2 +k ′ (z) = qC l-1 c w E g (l) * ,n+k-q+1 2 (x) g (l) * ,n+k-q+1 2 (z) = qC l-1 c w Σ (l) n+k-q+1 2 ,n+k-q+1 2 (x, z) . Finally we obtain E ∂f (x; θ) ∂W (l) , ∂f (z; θ) ∂W (l) = α 2 q d n=1 Π (l) nn (x, z) q k=1 K(l) Dn,n (x, z) kk Σ (l) Dn,n (x, z) kk = α 2 q d n=1 Π (l) nn (x, z) tr K(l) Dn,n (x, z) ⊙ Σ (l) Dn,n (x, z) . The case of l = 1 is analogous, except that we replace φ T f (l-1) i (x) with x T i and similarly for z, (making minor adjustments accordingly). We therefore obtain E ∂f (x; θ) ∂W (1) , ∂f (z; θ) ∂W (1) = = α 2 qC 0 d n=1 Π (l) nn (x, z) q k=1 K(l) Dn,n (x, z) kk C0 i=1 E x i,n+k-q+1 2 z i,n+k-q+1 2 = α 2 q d n=1 Π (1) nn (x, z) tr K(1) Dn,n (x, z) ⊙ Σ (1) Dn,n (x, z) .

C SPECTRAL DECOMPOSITION

C.1 PROOF OF THEOREM 5.1 IN THE MAIN TEXT Proof. The strategy is to bound the Taylor expansion of the kernels. We use qualities of the Ordinary Bell Polynomials for the lower bound, and use previous work on singularity analysis (Flajolet & Sedgewick, 2009; Chen & Xu, 2020; Bietti & Bach, 2020; Belfer et al., 2021) for the upper bound. The details can be found in the lemmas that follow, resulting in that K (L) Eq can be written as n≥0 b n t n with c 1 n -2.5 ≤ b n ≤ c 2 n -3 2d -1 and Θ

(L)

Eq can similarly be written as n≥0 b n t n with c 1 n -2.5 ≤ b n ≤ c 2 n -1 2d -1 Thus, applying Geifman et al. (2022)[Theorems 3.3,3.4 ] completes the proof. Lemma C.1. K (L) Eq can be written as K (L) Eq (t) = n≥0 b n t n with c 1 n -ν ≤ b n , where c 1 is a constant if the receptive field of K (L) Eq includes n and 0 otherwise that depends on L. Proof. Let κ 1 (t) = ∞ n=0 a n t n be the power series expansion of κ 1 (t), where a n ∼ n -2.5 (Chen & Xu, 2020) . We prove this by induction on L, starting with L = 1: K (1) Eq (t) = 1 C 0 t 1 + α 2 qc w C 0 q-1 2 k=-q-1 2 κ 1 (t 1+k ) = 1 C 0 t 1 + α 2 qc w C 0 q-1 2 k=-q-1 2 ∞ n=0 a n t n 1+k . By letting e i be the multi-index with 1 in the i index and 0 elsewhere, it is clear that we can write the above as n≥0 b n t n where b n =      α 2 cwC0 a 0 n = 0 α 2 qcwC0 a n + 1 C0 δ j,1 n = ne j for -q-1 2 ≤ j ≤ q-1 2 and n ∈ N 0 otherwise. For L ≥ 2, by the induction hypothesis K (L-1) Eq (t) = n≥0 bn t n s.t. c 1 n -ν ≤ bn . Let N L be the value of Σ (l) j,j (x, x) from Lemma A.6 and ÑL = α 2 cvcw 2q 2 N L . κ 1 K (L) Eq (s j t) N L = ∞ n=0 a n N n L   n≥0 bn (s j t) n   n = ∞ n=0 a n N n L   n≥0 bs-jn (t) s-j n   n . Using the derivations for the multivariate ordinary Bell Polynomials in Withers & Nadarajah (2010); Schumann (2019) we can rewrite the above as: κ 1 K (L) Eq (s j t) N L = n≥0 ∞ n=0 a n N n L Bn,n s -j b t n , where b = bn n≥0 and Bn,n denotes the ordinary Bell Polynomials. Plugging this in to our formula for the ResCGPK from Proposition A.1 we get K (L) Eq (t) = K (L-1) Eq (t) + ÑL q-1 2 k,k ′ =-q-1 2 κ 1 K (L-1) Eq s k+k ′ t N L = n≥0  b n + ÑL q-1 j=-q+1 |q -j| ∞ n=0 a n N n L Bn,n s -j b   t n . Let this be n≥0 b n t n . All the terms in b n are positive so for a lower bound, it suffices to sum up only specific n's and j's. We choose n = 1, j = 0, and since Withers & Nadarajah (2010) showed that Bn,1 b = bn we get that b n ≥ bn + q a 1 ÑL N L bn = 1 + a 1 α 2 c v c w 2q c 1 n -ν , where the last equality is by the induction hypothesis. Lemma C.2. The bound in Lemma (C.1) holds for Θ (L) Eq . Proof. Denote by b n (K) the Taylor coefficients for some kernel K. Theorem 4.2 implies that Θ (L) Eq (x, z) = K (L) Eq (x, z) + α 2 q L l=1 d p=1 Π (l) tt (x, z) tr K(l) Dp,p (x, z) ⊙ Σ (l) Dp,p (x, z) + K (l) Dp,p (x, z) Denote by K ′ . Since for any positive definite kernel the Taylor coefficients are non negative, we get that: b n Θ (L) Eq = b n K Eq + b n K ′ ≥ b n K Eq . Definition C.1. Let K (L) FC be the GPK and Θ (L) FC the NTK of the bias free fully connected ResNet defined in Huang et al. (2020) . For x, z ∈ S C0-1 , u = x T z, following the derivation in Huang et al. (2020) and Belfer et al. (2021) (in particular, Appendix B.1 of the latter), the normalized version of these kernels will be: K (0) FC (u) = u K (L) FC (u) = 1 1 + α 2 K (L-1) FC (u) + α 2 κ 1 K (L-1) FC (u) Θ (L) FC = 1 2Lv L-1 L l=1 v l-1 P (l) (u) κ 1 K (l-1) FC (u) + K (l-1) FC (u) κ 0 K (l-1) FC (u) , where v l = 1 + α 2 l , P L = 1 and P l (u ) = P l+1 (u) 1 + α 2 κ 0 K (l+1) FC (u) . Lemma C.3. For u ∈ R d , letting t 1 = t 2 = . . . = u we obtain K (L) FC (u) = K (L) Eq (t) and letting k (L) = Θ (L) Eq -K (L) Eq we obtain Θ (L) FC (u) = k (L) (t). Proof. For the GPK, this is immediate from Corollary A.2. For the ResCNTK, first recall that Θ (L) Eq (x, z) = K (L) Eq (x, z)+ α 2 q L l=1 d p=1 P (l) p (x, z) tr K(l) Dp,p (x, z) ⊙ Σ (l) Dp,p (x, z) + K (l) Dp,p (x, z) . Observe first that a direct consequence of Theorem A.1 is that ∀L ≥ 2 Σ (L) 1,1 (x, z) = c w q q-1 2 k=-q-1 2 K (L-1) Eq (s k x, s k z) . and Σ (1) 1,1 (x, z) = 1 C0 u. Therefore, by choice of t, Σ (l) , K (l) and K(l) have constant diagonals, and so the terms in the trace do not depend on the index p, and we get k (L) (t) = α 2 L l=1 N l+1 κ 0 K (l-1) FC (u) K (l-1) FC (u) + κ 1 K (l-1) FC (u) d p=1 P (l) p (t) . Note that for each 1 ≤ l ≤ L, d p=1 P (l) p (t) = P l (u). For l = L they are both equal to 1 by definition. Now by induction we assume for l + 1 and now show for l. d p=1 P (l) p (t) = d p=1   P (l+1) p (t) + α 2 q 2 tr     q-1 2 k= q-1 2 P (l+1) D p+k (t)   ⊙ K(l+1) Dp,p (t)     = d p=1   P (l+1) p (t) + α 2 q 2 κ 0 K (l+1) F C tr   q-1 2 k= q-1 2 P (l+1) D p+k (t)     = d p=1 P (l+1) p (t) + α 2 κ 0 K (l+1) F C   1 q 2 q-1 2 k= q-1 2 d p=1 P (l+1) p (t)   . Plugging this in the induction hypothesis we prove the induction. So overall: k (L) (t) = α 2 L l=1 N l+1 P l (u) κ 0 K (l-1) FC (u) K (l-1) FC (u) + κ 1 K (l-1) FC (u) . Since N l+1 = N 2 v l-1 , normalizing k (L) completes the proof. Lemma C.4 ((Bietti & Bach, 2020) Section 3.2). For a small t > 0, κ 1 (1 -t) = 1 -t - 2 √ 2 3π t 3 2 + O t 5 2 . Lemma C.5. For all L ≥ 0, and for a small t > 0, K (L) FC (1 -t) = 1 -t + Θ t 3 2 . Remark. This lemma and its proof are a slightly modified version of Lemma B.4 from Belfer et al. (2021) where we tighten the bound.

D POSITIONAL BIAS OF EIGENVALUES D.1 PROOF OF THEOREM 5.2 IN THE MAIN TEXT

We define a "stride-q" version of the ResCGPK. Let Q = {-q-1 2 . . . , q-1 2 }, R 0 = 0 2L-1 (i.e a set that contains only the tuple 0 = (0, . . . , 0)), and for l ≥ 1 let R l := Q 2l-1 × {0} 2(L-l) (i.e tuples where the first 2l -1 elements are in Q and the last 2(L -l) elements are 0). We let [-1, 1] R l be elements of the form ta which are [-1, 1] valued parameters indexed by tuples R l . Also, for every k, k ′ ∈ Q and 1 ≤ l ≤ L define ι l k,k ′ : [-1, 1] R L → [-1, 1] R L by ι l k,k ′ ( t) a = t(a1,...,a 2l-3 ,k,k ′ ,a 2l ...) . We now define the kernel k (1) : [-1, 1] R1 → [-1, 1] to be k (1) ( t) = 1 1 + α 2          t0 :=k (1) 1 ( t) + α 2 1 q q-1 2 k=-q-1 2 κ 1 t(k,0,...,0) :=k (1) 2 ( t)          . Also, for all 2 ≤ l ≤ L we let k (l) : [-1, 1] R l → [-1, 1] be k (l) ( t) = 1 1 + α 2          k (l-1) (ι l 0,0 ( t)) :=k (l) 1 ( t) + α 2 1 q 2 q-1 2 k,k ′ =-q-1 2 κ 1 k (l-1) ι l k,k ′ ( t) :=k (l) 2 ( t)          . We We are now ready to define a correspondence between the stride-q ResCGPK and the standard one. Namely, for every t ∈ [-1, 1] d we let ϕ(t) ∈ [-1, 1] R L be ϕ(t) a := t S L (a) , I claim that k (L) (ϕ(t)) = K (L) Eq (t) First observe that for a ∈ R l it holds that ι l k,k ′ (ϕ(t)) a = t |a|+1+k+k ′ = ϕ(s k+k ′ t) a (because a ∈ R l so a 2l-2 , a 2l-1 = 0). Also notice that ι l k,k ′ (ϕ(t)) ∈ R l+1 . Thus, we recursively get that for a ∈ R 1 , ι 1 k1,k ′ 1 • . . . • ι L k L ,k ′ L (ϕ(t)) a = ϕ(s k1+k ′ 1 • . . . • s k L +k ′ L t) a . Now, for L = 1 we trivially have that k (1) (ϕ(t)) = K (1) Eq (t). So for a ∈ R 1 , k (1) ι 1 k1,k ′ 1 • . . . • ι L k L ,k ′ L (ϕ(t)) a = k (1) ϕ(s k1+k ′ 1 • . . . • s k L +k ′ L t) a = K (1) Eq (s k1+k ′ 1 •. . .•s k L +k ′ L t) As such, we get that k (2) ι 2 k1,k ′ 1 • . . . • ι L k L ,k ′ L (ϕ(t)) a = K (2) Eq (s k2+k ′ 2 • . . . • s k L +k ′ L t) And continuing by induction we eventually get k (L-1) ι L k,k ′ (ϕ(t)) = K (L-1) Eq (s k+k ′ t) which implies that k (L) (ϕ(t)) = K (L) Eq (t). We now move towards better understanding their Taylor expansions. For a function f (t) that can be approximated by a Taylor series ∞ n=0 a n t n let [t n ] f denote the coefficient of t n in its Taylor series (meaning a n ). Let M l (n) = ñ ∈ Z R L ≥0 | suppñ ⊆ R l and ∀i ∈ [d], j∈S -1 L (i) ñj = n i be the set of multi-indices which are indexed by tuples in R L with support in R l that correspond to n. So if ñ ∈ M l (n) we get that for every t ∈ [-1, 1] d , ϕ(t) ñ = t n . We get a correspondence between the Taylor expansions of the kernel as follows: n≥0 b n t n = K (L) Eq (t) = k (l) ( t) = ñ≥0 bñ tñ = n≥0   ñ∈M L (n) bñ   t n . By the uniqueness of the power series, b n = ñ∈M L (n) bñ = 1 1 + α 2   ñ∈M L-1 (n) [ tñ ]k (L) 1 + α 2 ñ∈M L (n) [ tñ ]k (L) 2   . Now k (L) 1 ( t) = 1 1+α 2 k (L-1) 1 ( t) + α 2 k (L-1) ( t) so we can continue to apply this recursively and eventually get: = 1 (1 + α 2 ) L ñ∈M0(n) [ tñ ]k (1) 1 + L l=1 α 2 1 + α 2 L-l+1 ñ∈M l (n) [ tñ ]k (l) 2 = 1 (1 + α 2 ) L 1 suppn⊆R0 + L l=1 α 2 1 + α 2 L-l+1 ñ∈M l (n) [ tñ ]k (l) 2 . Now let ν = 2.5, via the proof in Lemma C.1 we know that for every 1 ≤ l ≤ L there is some cl > 0 s.t. α 2 1 + α 2 L-l+1 ñ∈M l (n) [ tñ ]k (l) 2 ( t) ≥ cl ñ∈M l (n) ñ-ν . Let p i (l) = S -1 l (i) , the number of paths from an input pixel i to the output of an l layer CGPK. Using Geifman et al. (2022)[Lemma C.4, C.5] we get that for A > 1, some c l constants, and c n,l = c l d i=1 A min(p (l) i ,ni) it holds that c n,l n -ν ≤ α 2 1+α 2 L-l+1 ñ∈M l (n) [ tñ ]k (l) ( t). Overall, we obtain that b n ≥ L l=0 c n,l n -ν . Now consider the kernel k(l) (t) = n≥0 c n,l n -ν then by Geifman et al. (2022)[Lemma C.7] , the eigenvalues of this kernel satisfy ki) for some constant cl . As for every l the eigenvectors that correspond to λ k ( k(l) ) are the same (given by the spherical harmonics), we get by linearity that λ k ( k(l) ) ≥ c k,l d i=1 ni>0 k -C0-2 i , where c k,l = cl d i=1 A min(p (L) i , λ k K (L) Eq ≥ L l=1 c k,l d i=1 ni>0 k -C0-2 i . As in Lemma C.2, this also gives a bound on the eigenvalues of Θ (L) Eq .

E INFINITE DEPTH LIMIT

E.1 PROOF OF THEOREM 6.1 IN THE MAIN TEXT Lemma E.1. Suppose that α = L -γ for γ ∈ (0.5, 1] then for any t ∈ [-1, 1] d , Θ (L) Eq (t) -Σ (1) 1,1 (t) ≤ O L 1-2γ . Proof. First recall that Θ (L) Eq (t) = K (L) Eq (t) + α 2 q L l=1 d p=1 P (l) p (t) tr K(l) Dp,p (t) ⊙ Σ (l) Dp,p (t) + K (l) Dp,p (t) . Let k (L) (t) = 1 α 2 Θ (L) Eq (t) -K (L) Eq (t) so that Θ (L) Eq (t) = K (L) Eq (t) + α 2 k (L) (t). Using our calculations in (16) we know that k( ⃗ 1) = 2 C0 L(1 + α 2 ) L-1 . Therefore, α 2 k( ⃗ 1) = 2 C 0 (α 2 L)(1 + α 2 ) L-1 ≤ 2 C 0 (α 2 L)(1 + α 2 ) L = 2 C 0 L 1-2γ 1 + 1 L 2γ L ≤ 2 C 0 L 1-2γ 1 + 1 L 2γ L 2γ L -2γ+1 = 2 C 0 L 1-2γ e L 1-2γ = O L 1-2γ . Consequently, Θ (L) Eq (t) -K (L) Eq (t) ≤ α 2 k( ⃗ 1) ≤ O L 1-2γ . It therefore suffices to prove that K (L) Eq (t) -Σ (1) 1,1 (t) ≤ O L 1-2γ . To avoid confusion, we denote by K (L,α) Eq (t) the ResCGPK with a specific α (that may not necessarily be L -γ ). For all 2 ≤ l ≤ L as a result of Corollary A.2 we get that K (l,L -γ ) Eq (t) -K (l-1,L -γ ) Eq (t) ≤ 1 1 + α 2   K (l-1,L -γ ) Eq (t) + α 2 q 2 q-1 2 k,k ′ =-q-1 2 κ 1 K (l-1,L -γ ) Eq s k+k ′ t   -K (l-1,L -γ ) Eq (t) = α 2 1 + α 2 1 q 2 q-1 2 k,k ′ =-q-1 2 κ 1 K (l-1,L -γ ) Eq s k+k ′ t -K (l-1,L -γ ) Eq (t) ≤ α 2 1 + α 2 , Where the last inequality follows because K (l-1,L -γ ) Eq ∈ [-1, 1] and so is κ 1 , This implies that Lemma E.2. Suppose that α is a constant that does not depend on L, then for any t ∈ [-1, 1] d , K K (L,L -γ ) Eq (t) -Σ (1) 1,1 (t) = K (L,L -γ ) Eq (t) -K (1,L -γ ) Eq (t) ≤ L • α 2 1 + α 2 ≤ O L 1-2γ , (L) Eq (t) → 1 as L → ∞. Proof. Denote by K (L) (t) the vector that is K (L) Eq (s j-1 t) in the j ′ th coordinate. Using Corollary A.2, let E K (L) (t) denote the mean of the vector K (L) (t), then by linearity we get: E K (L) (t) = 1 1 + α 2   E K (L-1) (t) + α 2 q 2 q-1 2 k,k ′ =-q-1 2 E κ 1 K (L-1) s k+k ′ t   , where we let κ 1 act point-wise on vectors. Since we can permute t without chaning the mean (i.e., for any j, E K (L) (t) = E K (L) (s j t) ) we get: E K (L) (t) = 1 1 + α 2 E K (L-1) (t) + α 2 E κ 1 K (L-1) (t) (17) = E K (L-1) (t) + α 2 1 + α 2 E κ 1 K (L-1) (t) -E K (L-1) (t) ≥ E K (L-1) (t) + α 2 1 + α 2 κ 1 E K (L-1) (t) -E K (L-1) (t) , where the last inequality is Jensen's inequality (since κ 1 is convex (Daniely et al., 2016) ). THerefore, let a L = E K (L) (t) , we can rewrite (18) as: a L ≥ a L-1 + α 2 1 + α 2 (κ 1 (a L-1 ) -a L-1 ) . We therefore need to show that a L → 1 as L → ∞. Since a L is monotonically increasing (κ 1 (u) > u for all u ∈ [-1, 1] ( Daniely et al., 2016) ) and bounded in [-1, 1] it suffices to show that for all ϵ > 0 there exists some L ∈ N s.t. a L ≥ 1 -ϵ. Suppose not, then let ϵ > 0 s.t. for all L, a L < 1 -ϵ. As d du κ 1 (u) = κ 0 (u) and κ 0 (u) ∈ [0, 1] ( Daniely et al., 2016) then h(u) := κ 1 (u) -u satisfies d du h(u) = κ 0 (u)-1 ≤ 0 with equality iff u = 1. Therefore, for any u ∈ [-1, 1-ϵ], h(u) ≥ h(1-ϵ) > 0 (The > 0 is because κ 1 (u) > u for u ∈ [-1, 1-ϵ]). Since we assumed by contradiction that for every L ∈ N, a L < 1 -ϵ, we get that h(a L ) ≥ h(1 -ϵ) and thus a L ≥ a L-1 + α 2 1 + α 2 h(a L-1 ) ≥ a L-1 + α 2 1 + α 2 h(1 -ϵ) ≥ a 0 + L α 2 1 + α 2 h(1 -ϵ) → L→∞ ∞. However, since a L ∈ [-1, 1] this leads to a contradiction. Eq (t)? This is in fact a consequence of Huang et al. (2020) not training the last layer (denoted by v in their paper.) If they were to train the parameters v, the term E ∂f (x;θ) ∂v ∂f (z;θ) ∂v would be added to their ResNTK expression. But this term is exactly equal to the ResGPK. Therefore, training the last layer adds the ResCGPK to the ResNTK expression. This is indeed confirmed in (Tirer et al., 2022) , who derived ResNTK when the last layer is trained. So if the last layer is trained, we would have Θ Instead, by eliminating the ResGPK from the ResNTK expression, all the terms are multiplies by α. So after normalizing, the two layer ResNTK becomes equivalent to the two layer FC-NTK (Belfer et al., 2021) . If we were to not train the last layer, we would have a similar result, where the resulting kernel would correspond to a 2 layer CNTK. We give here a sketch proof (the details are analogous to Belfer et al. (2021) ). Theorem A.1 states that Σ (2) j,j ′ (t) = c w q tr Σ (1) D j,j ′ (t) + α 2 q 2 q-1 2 k=-q-1 2 tr K (1) D j+k,j ′ +k (t) . For 3 ≤ l ≤ L, Σ (l) j,j ′ (t) = Σ (l-1) j,j ′ (t) + α 2 q 2 q-1 2 k=-q-1 2 tr K (l-1) D j+k,j ′ +k (t) . So for γ = L -γ we have that for all l ≥ 3, Σ (l) j,j ′ (t) -Σ (l-1) j,j ′ (t) ≤ α 2 q-1 2 k=-q-1 2 tr K (l-1) D j+k,j ′ +k (t) ≤ α 2 . So Σ (l) j,j ′ (t) -Σ (2) j,j ′ (t) ≤ L • α 2 = L 1-2γ . In turn we also have K (l) j,j ′ (t) -K (2) j,j ′ (t) = L 1-2γ and K(l) j,j ′ (t) -K(2) j,j ′ (t) = L 1-2γ . Furthermore, by Theorem 4.2 we have:



This research was partially supported by the Israeli Council for Higher Education (CHE) via the Weizmann Data Science Research Center and by research grants from the Estate of Tully and Michele Plesser and the Anita James Rosen Foundation.



Let x, z ∈ MS (C 0 , d), where d denotes the number of pixels and C 0 denotes the number of input channels for each pixel. The eigenvalues λ k of either K

Figure1: Left: The eigenvalues of ResCGPK (filled circles and solid lines) computed numerically for various eigenfunctions, compared to those of CGPK (empty circles and dashed lines). Here L = 3, q = 2, d = 4 and the output head is the equivariant one. The slopes (respectively, -5.25, -6.8, -8.5, -9.8 for CGPK and -5.3, -6.84, -8.77, -9.85 for ResCGPK) approximate the exponent for each pattern. Notice that the slope increases for eigenfunctions involving more pixels. Right: Additional eigenvalues of ResCGPK. Notice that the eigenvalues for eigenfunction involving nearby pixels (in red) are larger compared to one involving farther pixels (in blue).

Figure2: The effective receptive field of ResCNTK (left) compared to that of actual ResNet and to CNTK and CNN (i.e., no skip connections). We followed(Luo et al., 2016) in computing the ERF, where the networks are first trained on CIFAR-10. All values are re-scaled to the [0,1] interval. We used L = 8 in all cases.

Figure 3: The condition number of ResCGPK-Tr (solid blue) as a function of depth compared to that of CGPK-Tr (solid red) and the corresponding lower and upper bounds (dashed lines) computed with n = 100.

also define the change of variables S l : R l → [d] by S l (a) = |a| + 1 (reminder: | • | on a multi-index means sum of all entries).

which completes the proof.E.2 PROOF OF THEOREM 6.2 IN THE MAIN TEXTOur goal of this subsections is to prove the following:Theorem E.1. Let K(L) ResCGPK and K(L)CGPK respectively denote kernel matrices for the normalized trace kernels ResCGPK and CGPK of depth L. Let B (K) be a double-constant matrix defined for a matrix K as in Lemma 6.1∃L 0 ∈ N s.t. ∀L ≥ L 0 , ρ B KWe give here the main ideas and leave the dirty work to the lemmas. By Lemmas E.3 and E.2 both K(L) ResCGPK and K(L) CGPK tend towards B 1,1 as L → ∞. As such, B K(L) ResCGPK -→ L→∞ B 1,1 and so we get (1).

let L 0 ∈ N be the minimal such that the entries of B Kand let L ≥ L 0 . By lemma E.3 we get that

Let µ (i) (t) = 1 d d j=1 κ 1 • . . . • κ 1L times (t j ), (an average of the entries of t after i compositions of κ 1 ), where κ1 (t) = 1 π √ 1 -t 2 + (π -arccos(t)) t . Let K (L)CGPK-Tr (t) be the corresponding CGPK-Tr without skip connections. ThenK (L) CGPK-Tr (t) -K (L) Tr (t) ≥ L l=1 µ (l) (t) -µ (l-1) (t) (1 + α 2 ) L-l+1 ,where if t ̸ = ⃗ 1 this quantity is strictly positive.By the assumptions that the diagonal entries of A are 1 and that i̸ =j A ij ≥ 0, we get thatγ 1 (A) = 1-1 n(n-1) i̸ =j A ij = λ min (B(A)) and γ 2 (Aby the Gershgorin circle theorem, since A -B(A) is a matrix with diagonal zero, every eigenvalue λ of A-B(A) must satisfy |λ| ≤ ϵ. Since A and A-B(A) are symmetric, it holds that λ max (A) ≤ λ max (B(A)) + λ max (A -B(A)) and λ min (A) ≥ λ min (B(A)) + λ min (A -B(A)) from which the lemma follows.F INFINITE DEPTH DISCUSSIONLemma C.3 states that for u ∈ R d , letting t 1 = t 2 = . . . = u, we obtain K (t), where the fully connected ResNTK and ResGPK are defined inHuang et al. (2020).One may ask why does Θ (L) FC (u) = k (L) (t) and not Θ

Intuitively, this happens because the term u exists in the ResGPK, and is the only term that is not multiplies by α. So if α decays quickly enough, u becomes the dominant term.

annex

Proof. We prove this by induction. For l = 0, K (0) FC (1 -t) = 1 -t, trivially satisfying the lemma. Suppose the lemma holds for l -1, then using Lemma C.4,Proof. Using Lemma C.5 we know that for t ↗ 1, it holds that K (L). Using (Flajolet & Sedgewick, 2009) [page 392 Thm. VI.1] we get that f admits a Taylor expansion ∞ n=0 a n t n around 0 with a n ∼ n -2.5 . Therefore, K (L) FC (u) has a Taylor expansion with coefficients that exhibit the same decay.For Θ (L) FC (u), using (Belfer et al., 2021)[Lemma 4.5] we know that for t ↗ 1, it holds that Θ (L)for some constant c 1 < 0. Similarly to the previous case, using (Flajolet & Sedgewick, 2009) [Thm. VI.1, page 392] we get the desired bound.Lemma C.7. Both K (L) Eq (t) and Θ (L) Eq (t) can be written as n≥0 b n t n with b n ≤ c 2 n -ν ,Eq and νEq and c 2 depends on L.Proof. By Lemma C.6, K (L) 2.5 . Moreover, we have thatThe uniqueness of the power series impliesPlugging in Lemma D.8 From (Geifman et al., 2022) we get that b n ≤ c 2 n -3 2d -1 for some constant c 2 > 0.For the bound for ResCNTK, since by Lemma C.6 Θ (L)(The difference from the different bound comes from the referenced lemma in (Geifman et al., 2022) .)SinceEq , by combining the two results we get that Θ (L) Eq (t) can be written asEq (t) be the normalized CGPK-EqNet and k (L) (t) be the matrix that is kEq (s 1+j t) in the j index. Similarly define the matrix KTr (t). By equation 17 we have that:and similarly it can be readily verified that(Note that the CGPK is naturally normalized so we can omit the bar.) We prove this by induction.For L = 1 we have:Since κ 1 is increasing, using the induction hypothesis we know that -α, and thereforeApplying the induction hypothesis recursively provides the desired result.Lemma E.4 (Lemma 6.1 in the main text). Let A ∈ R n×n (n ≥ 2) be a normalized kernel matrix withwhere λ max and λ min denote the maximal and minimal eigenvalues of B(A).Proof. For (1), using Marsli (2015) [Theorem 4.4] we havewhereNow because it holds that:Dp,p (t)Denote by k (L,l) (t)where by ( 16) we know that kwhere C is some normalizing constant (Note that after normalizing, k (L,1) (t) becomes negligible.)For such α, Σ(2) j,j ′ (t) -→ L→∞ 1 q tr Σ D1,1 (t) . As such, after normalizing, in the infinite depth limit, the expression trD1,1 (t) becomes the two layer CGPK (aka one hidden layer, denoted by L = 1) with inputs t where ti = 1 q q-1 2 k=-q-1 2 t i+k .

G EIGENVALUE DECAY EXPERIMENT

We use (Geifman et al., 2022)[Lemma A.6 ] to numerically compute the eigenvalues. Namely, for each frequency in Figure 1 we compute the Gegenbaur polynomials and the kernel, and numerically integrate. Note that as this integration requires evaluating the kernel many times, we are limited to d = 4 and L = 3. To prevent the receptive field from being much larger than d, and in order to better match the CGPK expression from (Geifman et al., 2022) , we slightly modify the ResCGPK to include one convolution in every layer instead of two, where the layer ends after the ReLU. Thus the kernel computed is with d = 4, q = 2, α = 1 is:where β = 0 is the CGPK from (Geifman et al., 2022) and β = 1 is the modified ResCGPK.

