Deep Equals Shallow for ReLU Networks in Kernel Regimes

Abstract

Deep networks are often considered to be more expressive than shallow ones in terms of approximation. Indeed, certain functions can be approximated by deep networks provably more efficiently than by shallow ones, however, no tractable algorithms are known for learning such deep models. Separately, a recent line of work has shown that deep networks trained with gradient descent may behave like (tractable) kernel methods in a certain over-parameterized regime, where the kernel is determined by the architecture and initialization, and this paper focuses on approximation for such kernels. We show that for ReLU activations, the kernels derived from deep fully-connected networks have essentially the same approximation properties as their "shallow" two-layer counterpart, namely the same eigenvalue decay for the corresponding integral operator. This highlights the limitations of the kernel framework for understanding the benefits of such deep architectures. Our main theoretical result relies on characterizing such eigenvalue decays through differentiability properties of the kernel function, which also easily applies to the study of other kernels defined on the sphere.

1. Introduction

The question of which functions can be well approximated by neural networks is crucial for understanding when these models are successful, and has always been at the heart of the theoretical study of neural networks (e.g., Hornik et al., 1989; Pinkus, 1999) . While early works have mostly focused on shallow networks with only two layers, more recent works have shown benefits of deep networks for approximating certain classes of functions (Eldan & Shamir, 2016; Mhaskar & Poggio, 2016; Telgarsky, 2016; Daniely, 2017; Yarotsky, 2017; Schmidt-Hieber et al., 2020) . Unfortunately, many of these approaches rely on constructions that are not currently known to be learnable using efficient algorithms. A separate line of work has considered over-parameterized networks with random neurons (Neal, 1996) , which also display universal approximation properties while additionally providing efficient algorithms based on kernel methods or their approximations such as random features (Rahimi & Recht, 2007; Bach, 2017b) . Many recent results on gradient-based optimization of certain over-parameterized networks have been shown to be equivalent to kernel methods with an architecture-specific kernel called the neural tangent kernel (NTK) and thus also fall in this category (e.g., Jacot et al., 2018; Li & Liang, 2018; Allen-Zhu et al., 2019b; Du et al., 2019a; b; Zou et al., 2019) . This regime has been coined lazy (Chizat et al., 2019) , as it does not capture the common phenomenon where weights move significantly away from random initialization and thus may not provide a satisfying model for learning adaptive representations, in contrast to other settings such as the mean field or active regime, which captures complex training dynamics where weights may move in a non-trivial manner and adapt to the data (e.g., Chizat & Bach, 2018; Mei et al., 2018) . Nevertheless, one benefit compared to the mean field regime is that the kernel approach easily extends to deep architectures, leading to compositional kernels similar to the ones of Cho & Saul (2009) ; Daniely et al. (2016) . Our goal in this paper is to study the role of depth in determining approximation properties for such kernels, with a focus on fully-connected deep ReLU networks. Our approximation results rely on the study of eigenvalue decays of integral operators associated to the obtained dot-product kernels on the sphere, which are diagonalized in the basis of spherical harmonics. This provides a characterization of the functions in the corresponding reproducing kernel Hilbert space (RKHS) in terms of their smoothness, and leads to convergence rates for non-parametric regression when the data are uniformly distributed on the sphere. We show that for ReLU networks, the eigenvalue decays for the corresponding deep kernels remain the same regardless of the depth of the network. Our key result is that the decay for a certain class of kernels is characterized by a property related to differentiability of the kernel function around the point where the two inputs are aligned. In particular, the property is preserved when adding layers with ReLU activations, showing that depth plays essentially no role for such networks in kernel regimes. This highlights the limitations of the kernel regime for understanding the power of depth in fully-connected networks, and calls for new models of deep networks beyond kernels (see, e.g., Allen-Zhu & Li, 2020; Chen et al., 2020, for recent works in this direction). We also provide applications of our result to other kernels and architectures, and illustrate our results with numerical experiments on synthetic and real datasets. Related work. Kernels for deep learning were originally derived by Neal (1996) for shallow networks, and later for deep networks (Cho & Saul, 2009; Daniely et al., 2016; Lee et al., 2018; Matthews et al., 2018) . Smola et al. (2001) ; Minh et al. (2006) study regularization properties of dot-product kernels on the sphere using spherical harmonics, and Bach (2017a) derives eigenvalue decays for such dot-product kernels arising from shallow networks with positively homogeneous activations including the ReLU. Extensions to shallow NTK or Laplace kernels are studied by Basri et al. (2019) ; Bietti & Mairal (2019b) ; Geifman et al. (2020) . The observation that depth does not change the decay of the NTK was previously made by Basri et al. (2020) empirically, and Geifman et al. (2020) provide a lower bound on the eigenvalues for deep networks; our work makes this observation rigorous by providing tight asymptotic decays. Spectral properties of wide neural networks were also considered in (Cao et al., 2019; Fan & Wang, 2020; Ghorbani et al., 2019; Xie et al., 2017; Yang & Salman, 2019) . Azevedo & Menegatto (2014) ; Scetbon & Harchaoui (2020) also study eigenvalue decays for dot-product kernels but focus on kernels with geometric decays, while our main focus is on polynomial decays. Additional works on over-parameterized or infinite-width networks in lazy regimes include (Allen-Zhu et al., 2019a; b; Arora et al., 2019a; b; Brand et al., 2020; Lee et al., 2020; Song & Yang, 2019) . Concurrently to our work, Chen & Xu (2021) also studied the RKHS of the NTK for deep ReLU networks, showing that it is the same as for the Laplace kernel on the sphere. They achieve this by studying asymptotic decays of Taylor coefficients of the kernel function at zero using complex-analytic extensions of the kernel functions, and leveraging this to obtain both inclusions between the two RKHSs. In contrast, we obtain precise descriptions of the RKHS and regularization properties in the basis of spherical harmonics for various dot-product kernels through spectral decompositions of integral operators, using (real) asymptotic expansions of the kernel function around endpoints. The equality between the RKHS of the deep NTK and Laplace kernel then easily follows from our results by the fact that the two kernels have the same spectral decay.

2.1. Kernels for wide neural networks

Wide neural networks with random weights or weights close to random initialization naturally lead to certain dot-product kernels that depend on the architecture and activation function, which we now present, with a focus on fully-connected architectures. Random feature kernels. We first consider a two-layer (shallow) network of the form f (x) =foot_1 √ m m j=1 v j σ(w j x), for some activation function σ. When w j ∼ N (0, I) ∈ R d are fixed and only v j ∈ R are trained with 2 regularization, this corresponds to using a random feature approximation Rahimi & Recht (2007) of the kernel k(x, x ) = E w∼N (0,I) [σ(w x)σ(w x )]. (1) If x, x are on the sphere, then by spherical symmetry of the Gaussian distribution, one may show that k is invariant to unitary transformations and takes the form k(x, x ) = κ(x x ) for a certain function κ. More precisely, if σ(u) = i≥0 a i h i (u) is the decomposition of σ in the basis of Hermite polynomials h i , which are orthogonal w.r.t. the Gaussian measure, then we have (Daniely et al., 2016) : κ(u) = i≥0 a 2 i u i . (2) Conversely, given a kernel function of the form above with κ(u) = i≥0 b i u i with b i ≥ 0, one may construct corresponding activations using Hermite polynomials by taking σ(u) = i a i h i (u), a i ∈ {± b i }. In the case where σ is s-positively homogeneous, such as the ReLU σ(u) = max(u, 0) (with s = 1), or more generally σ s (u) = max(u, 0) s , then the kernel (1) takes the form k(x, x ) = x s x s κ( x x x x ) for any x, x . This leads to RKHS functions of the form f (x) = x s g( x x ), with g in the RKHS of the kernel restricted to the sphere (Bietti & Mairal, 2019b, Prop. 8) . In particular, for the step and ReLU activations σ 0 and σ 1 , the functions κ are given by the following arc-cosine kernels (Cho & Saul, 2009 ): 1 κ 0 (u) = 1 π (π -arccos(u)) , κ 1 (u) = 1 π u • (π -arccos(u)) + 1 -u 2 . ( ) Note that given a kernel function κ, the corresponding activations (3) will generally not be homogeneous, thus the inputs to a random network with such activations need to lie on the sphere (or be appropriately normalized) in order to yield the kernel κ. Extension to deep networks. When considering a deep network with more than two layers and fixed random weights before the last layer, the connection to random features is less direct since the features are correlated through intermediate layers. Nevertheless, when the hidden layers are wide enough, one still approaches a kernel obtained by letting the widths go to infinity (see, e.g., Daniely et al., 2016; Lee et al., 2018; Matthews et al., 2018) , which takes a similar form to the multi-layer kernels of Cho & Saul (2009) : k L (x, x ) = κ L (x x ) := κ • • • • • κ L-1 times (x x ), for x, x on the sphere, where κ is obtained as described above for a given activation σ, and L is the number of layers. We still refer to this kernel as the random features (RF) kernel in this paper, noting that it is sometimes known as the "conjugate kernel" or NNGP kernel (for neural network Gaussian process). It is usually good to normalize κ such that κ(1) = 1, so that we also have κ L (1) = 1, avoiding exploding or vanishing behavior for deep networks. In practice, this corresponds to using an activation-dependent scaling in the random weight initialization, which is commonly used by practitioners (He et al., 2015) . Neural tangent kernels. When intermediate layers are trained along with the last layer using gradient methods, the resulting problem is non-convex and the statistical properties of such approaches are not well understood in general, particularly for deep networks. However, in a specific over-parameterized regime, it may be shown that gradient descent can reach a global minimum while keeping weights very close to random initialization. More precisely, for a network f (x; θ) parameterized by θ with large width m, the model remains close to its linearization around random initialization θ 0 throughout training, that is, f (x; θ) ≈ f (x; θ 0 ) + θ -θ 0 , ∇ θ f (x; θ 0 ) . This is also known as the lazy training regime (Chizat et al., 2019) . Learning is then equivalent to a kernel method with another architecture-specific kernel known as the neural tangent kernel (NTK, Jacot et al., 2018) , given by k NTK (x, x ) = lim m→∞ ∇f (x; θ 0 ), ∇f (x ; θ 0 ) . ( ) For a simple two-layer network with activation σ, it is then given by k NTK (x, x ) = (x x ) E w [σ (w x)σ (w x )] + E w [σ(w x)σ(w x )]. ( ) For a ReLU network with L layers with inputs on the sphere, taking appropriate limits on the widths, one can show (Jacot et al., 2018) : k NTK (x, x ) = κ L NTK (x x ), with κ 1 NTK (u) = κ 1 (u) = u and for = 2, . . . , L, κ (u) = κ 1 (κ -1 (u)) κ NTK (u) = κ -1 NTK (u)κ 0 (κ -1 (u)) + κ (u), where κ 0 and κ 1 are given in (4).

2.2. Approximation and harmonic analysis with dot-product kernels

In this section, we recall approximation properties of dot-product kernels on the sphere, through spectral decompositions of integral operators in the basis of spherical harmonics. Further background is provided in Appendix A.

Spherical harmonics and description of the RKHS.

A standard approach to study the RKHS of a kernel is through the spectral decomposition of an integral operator T given by T f (x) = k(x, y)f (y)dτ (y) for some measure τ , leading to Mercer's theorem (e.g., Cucker & Smale, 2002) . When inputs lie on the sphere S d-1 in d dimensions, dot-product kernels of the form k(x, x ) = κ(x x ) are rotationally-invariant, depending only on the angle between x and x . Similarly to how translation-invariant kernels are diagonalized in the Fourier basis, rotation-invariant kernels are diagonalized in the basis of spherical harmonics (Smola et al., 2001; Bach, 2017a) , which lead to connections between eigenvalue decays and regularity as in the Fourier setting. In particular, if τ denotes the uniform measure on S d-1 , then T Y k,j = µ k Y k,j , where Y k,j is the j-th spherical harmonic polynomial of degree k, where k plays the role of a frequency as in the Fourier case, and the number of such orthogonal polynomials of degree k is given by N (d, k) = 2k+d-2 k k+d-3 d-2 , which grows as k d-2 for large k. The eigenvalues µ k only depend on the frequency k and are given by µ k = ω d-2 ω d-1 1 -1 κ(t)P k (t)(1 -t 2 ) (d-3)/2 dt, ( ) where P k is the Legendre polynomial of degree k in d dimensions (also known as Gegenbauer polynomial when using a different scaling), and ω d-1 denotes the surface of the sphere S d-1 . Mercer's theorem then states that the RKHS H associated to the kernel is given by H =    f = k≥0,µ k =0 N (d,k) j=1 a k,j Y k,j (•) s.t. f 2 H := k≥0,µ k =0 N (d,k) j=1 a 2 k,j µ k < ∞    . ( ) In particular, if µ k has a fast decay, then the coefficients a k,j of f must also decay quickly with k in order for f to be in H, which means f must have a certain level of regularity. Similarly to the Fourier case, an exponential decay of µ k implies that the functions in H are infinitely differentiable, while for polynomial decay H contains all functions whose derivatives only up to a certain order are bounded, as in Sobolev spaces. If two kernels lead to the same asymptotic decay of µ k up to a constant, then by ( 9) their RKHS norms are equivalent up to a constant, and thus they have the same RKHS. For the specific case of random feature kernels arising from s-positively homogeneous activations, Bach (2017a) shows that µ k decays as k -d-2s for k of the opposite parity of s, and is zero for large enough k of opposite parity, which results in a RKHS that contains even or odd functions (depending on the parity of s) defined on the sphere with bounded derivatives up to order β := d/2 + s (note that β must be greater than (d -1)/2 in order for the eigenvalues of T to be summable and thus lead to a well-defined RKHS). Bietti & Mairal (2019b) show that the same decay holds for the NTK of two-layer ReLU networks, with s = 0 and a change of parity. Basri et al. (2019) show that the parity constraints may be removed by adding a zero-initialized additive bias term when deriving the NTK. We note that one can also obtain rates of approximation for Lipschitz functions from such decay estimates (Bach, 2017a) . Our goal in this paper is to extend this to more general dot-product kernels such as those arising from multi-layer networks, by providing a more general approach for obtaining decay estimates from differentiability properties of the function κ. Non-parametric regression. When the data are uniformly distributed on the sphere, we may also obtain convergence rates for non-parametric regression, which typically depend on the eigenvalue decay of the integral operator associated to the marginal distribution on inputs and on the decomposition of the regression function f * (x) = E[y|x] on the same basis (e.g., Caponnetto & De Vito, 2007 ).foot_3 Then one may achieve optimal rates that depend mainly on the regularity of f * when using various algorithms with tuned hyperparameters, but the choice of kernel and its decay may have an impact on the rates in some regimes, as well as on the difficulty of the optimization problem (see, e.g., Bach, 2013, Section 4.3) .

3. Main Result and Applications to Deep Networks

In this section, we present our main results concerning approximation properties of dotproduct kernels on the sphere, and applications to the kernels arising from wide random neural networks. We begin by stating our main theorem, which provides eigenvalue decays for dot-product kernels from differentiability properties of the kernel function κ at the endpoints ±1. We then present applications of this result to various kernels, including those coming from deep networks, showing in particular that the RKHSs associated to deep and shallow ReLU networks are the same (up to parity constraints).

3.1. Statement of our main theorem

We now state our main result regarding the asymptotic eigenvalue decay of dot-product kernels. Recall that we consider a kernel of the form k(x, y) = κ(x y) for x, y ∈ S d-1 , and seek to obtain decay estimates on the eigenvalues µ k defined in (8). We now state our main theorem, which derives the asymptotic decay of µ k with k in terms of differentiability properties of κ around {±1}, assuming that κ is infinitely differentiable on (-1, 1). This latter condition is always verified when κ takes the form of a power series (2) with κ(1) = 1, since the radius of convergence is at least 1. We also require a technical condition, namely the ability to "differentiate asymptotic expansions" of κ at ±1, which holds for the kernels considered in this work. Theorem 1 (Decay from regularity of κ at endpoints, simplified). Let κ : [-1, 1] → R be a function that is C ∞ on (-1, 1 ) and has the following asymptotic expansions around ±1: κ(1 -t) = p 1 (t) + c 1 t ν + o(t ν ) (10) κ(-1 + t) = p -1 (t) + c -1 t ν + o(t ν ), (11) for t ≥ 0, where p 1 , p -1 are polynomials and ν > 0 is not an integer. Also, assume that the derivatives of κ admit similar expansions obtained by differentiating the above ones. Then, there is an absolute constant C(d, ν) depending on d and ν such that: • For k even, if c 1 = -c -1 : µ k ∼ (c 1 + c -1 )C(d, ν)k -d-2ν+1 ; • For k odd, if c 1 = c -1 : µ k ∼ (c 1 -c -1 )C(d, ν)k -d-2ν+1 . In the case |c 1 | = |c -1 |, then we have µ k = o(k -d-2ν+1 ) for one of the two parities (or both if c 1 = c -1 = 0). If κ is infinitely differentiable on [-1, 1] so that no such ν exists, then µ k decays faster than any polynomial. The full theorem is given in Appendix B along with its proof, and requires an additional mild technical condition on the expansion which is verified for all kernels considered in this paper, namely, a finite number of terms in the expansions with exponents between ν and ν + 1. The proof relies on integration by parts using properties of Legendre polynomials, in a way reminiscent of fast decays of Fourier series for differentiable functions, and on precise computations of the decay for simple functions of the form t → (1 -t 2 ) ν . This allows us to obtain the asymptotic decay for general kernel functions κ as long as the behavior around the endpoints is known, in contrast to previous approaches which rely on the precise form of κ, or of the corresponding activation in the case of arc-cosine kernels (Bach, 2017a; Basri et al., 2019; Bietti & Mairal, 2019b; Geifman et al., 2020) . This enables the study of more general and complex kernels, such as those arising from deep networks, as discussed below. When κ is of the form κ(t) = k b k t k , the exponent ν in Theorem 1 is also related to the decay of coefficients b k . Such coefficients provide a dimension-free description of the kernel which may be useful for instance in the study of kernel methods in certain high-dimensional regimes (see, e.g., El Karoui, 2010; Ghorbani et al., 2019; Liang et al., 2020) . We show in Appendix B.1 that the b k may be recovered from the µ k by taking high-dimensional limits d → ∞, and that they decay as k -ν-1 .

3.2. Consequences for ReLU networks

When considering neural networks with ReLU activations, the corresponding random features and neural tangent kernels depend on the arc-cosine functions κ 1 and κ 0 defined in (4). These have the following expansions (with generalized exponents) near +1: κ 0 (1 -t) = 1 - √ 2 π t 1/2 + O(t 3/2 ) (12) κ 1 (1 -t) = 1 -t + 2 √ 2 3π t 3/2 + O(t 5/2 ). ( ) Indeed, the first follows from integrating the expansion of the derivative using the relation d dt arccos(1 -t) = 1 √ 2t √ 1-t/2 and the second follows from the first using the expression of κ 1 in (4). Near -1, we have by symmetry κ 0 (- 1 + t) = 1 -κ 0 (1 -t) = √ 2 π t 1/2 + O(t 3/2 ), and we have κ 1 (-1 + t) = 2 √ 2 3π t 3/2 + O(t 5/3 ) by using κ 1 = κ 0 and κ 1 (-1) = 0. The ability to differentiate the expansions follows from (Flajolet & Sedgewick, 2009, Theorem VI.8, p.419) , together with a complex-analytic property known as ∆-analyticity, which was shown to hold for RF and NTK kernels by Chen & Xu (2021) . By Theorem 1, we immediately obtain a decay of k -d-2 for even coefficients for κ 1 , and k -d for odd coefficients for κ 0 , recovering results of Bach (2017a) . For the two-layer ReLU NTK, we have κ 2 NTK (u) = uκ 0 (u) + κ 1 (u), leading to a similar expansion to κ 0 and thus decay, up to a change of parity due to the factor u which changes signs in the expansion around -1; this recovers Bietti & Mairal (2019b) . We note that for these specific kernels, Bach (2017a); Bietti & Mairal (2019b) show in addition that coefficients of the opposite parity are exactly zero for large enough k, which imposes parity constraints on functions in the RKHS, although such a constraint may be removed in the NTK case by adding a zero-initialized bias term (Basri et al., 2019) , leading to a kernel κ NTK,b (u) = (u + 1)κ 0 (u) + κ 1 (u). Deep networks. Recall from Section 2.1 that the RF and NTK kernels for deep ReLU networks may be obtained through compositions and products using the functions κ 1 and κ 0 . Since asymptotic expansions can be composed and multiplied, we can then obtain expansions for the deep RF and NTK kernels. The following results show that such kernels have the same eigenvalue decay as the ones for the corresponding shallow (two-layer) networks. NTK /L, which satisfies κ L NTK (1)/L = 1). The proofs, given in Appendix C, use the fact that κ 1 • κ 1 and κ 1 have the same non-integer exponent factors in their expansions, and similarly for κ 0 • κ 1 and κ 0 . One benefit compared to the shallow case is that the odd and even coefficients are both non-zero with the same decay, which removes the parity constraints, but as mentioned before, simple modifications of the shallow kernels can yield the same effect. The finite neuron case. For two-layer networks with a finite number of neurons, the obtained models correspond to random feature approximations of the limiting kernels (Rahimi & Recht, 2007) . Then, one may approximate RKHS functions and achieve optimal rates in non-parametric regression as long as the number of random features exceeds a certain degrees-of-freedom quantity (Bach, 2017b; Rudi & Rosasco, 2017) , which is similar to standard such quantities in the analysis of ridge regression (Caponnetto & De Vito, 2007) , at least when the data are uniformly distributed on the sphere (otherwise the quantity involved may be larger unless features are sampled non-uniformly). Such a number of random features is optimal for a given eigenvalue decay of the integral operator (Bach, 2017b) , which implies that the shallow random feature architectures provides optimal approximation for the multi-layer ReLU kernels as well, since the shallow and deep kernels have the same decay, up to the parity constraint. In order to overcome this constraint for shallow kernels while preserving decay, one may consider vector-valued random features of the form (σ(w x), x 1 σ(w x), . . . , x d σ(w x)) with w ∼ N (0, I), leading to a kernel κ σ,b (u) = (1 + u)κ σ (u), where κ σ is the random feature kernel corresponding to σ. With σ(u) = max(0, u), κ σ,b has the same decay as κ L RF , and when σ(u) = 1{u ≥ 0} it has the same decay as κ L NTK .

3.3. Extensions to other kernels

We now provide other examples of kernels for which Theorem 1 provides approximation properties thanks to its generality. Laplace kernel and generalizations. The Laplace kernel k c (x, y) = e -c x-y has been found to provide similar empirical behavior to neural networks when fitting randomly labeled data with gradient descent (Belkin et al., 2018) . Recently, Geifman et al. (2020) have shown that when inputs are on the sphere, the Laplace kernel has the same decay as the NTK, which may suggest a similar conditioning of the optimization problem as for fully-connected networks, as discussed in Section 2.2. Denoting κ c (u) = e -c √ 1-u so that k c (x, y) = κ c √ 2 ( x y), we may easily recover this result using Theorem 1 by noticing that κ c is infinitely differentiable around -1 and satisfies κ c (1 -t) = e -c √ t = 1 -c √ t + O(t), which yields the same decay k -d as the NTK. Geifman et al. (2020) also consider a heuristic generalization of the Laplace kernel with different exponents, κ c,γ (u) = e -c(1-u) γ . Theorem 1 allows us to obtain a precise decay for this kernel as well using κ c,γ (1 -t) = 1 -ct γ + O(t 2γ ), which is of the form k -d-2γ+1 for non-integer γ > 0, and in particular approaches the limiting order of smoothness (d -1)/2 when γ → 0.foot_4  Deep kernels with step activations. We saw in Section 3.2 that for ReLU activations, depth does not change the decay of the corresponding kernels. In contrast, when considering step activations σ(u) = 1{u ≥ 0}, we show in Appendix C.3 that approximation properties of the corresponding random neuron kernels (of the form κ 0 • • • • • κ 0 ) improve with depth, leading to a decay k -d-2ν+1 with ν = 1/2 L-1 for L layers. This also leads to an RKHS which becomes as large as allowed (order of smoothness close to (d -1)/2) when L → ∞. While this may suggest a benefit of depth, note that step activations make optimization hard for anything beyond a linear regime with random weights, since the gradients with respect to inner neurons vanish. Theorem 1 may also be applied to deep kernels with other positively homogeneous activations σ s (u) = max(0, u) s with s ≥ 2, for which endpoint expansions easily follow from those of κ 0 or κ 1 through integration. Infinitely differentiable kernels. Finally, we note that Theorem 1 shows that kernels associated to infinitely differentiable activations (which are themselves infinitely differentiable, see Daniely et al. (2016) foot_5 ), as well as Gaussian kernels on the sphere of the form e -c(1-x y) , have faster decays than any polynomial. This results in a "small" RKHS that only contains smooth functions. See Azevedo & Menegatto (2014) ; Minh et al. (2006) for a more precise study of the decay for Gaussian kernels on the sphere.

4. Numerical experiments

We now present numerical experiments on synthetic and real data to illustrate our theory. Our code is available at https://github.com/albietz/deep_shallow_kernel.

Synthetic experiments.

We consider randomly sampled inputs on the sphere S 3 in 4 dimensions, and outputs generated according to the following target models, for an arbitrary w ∈ S 3 : f * 1 (x) = 1{w x ≥ 0.7} and f * 2 (x) = e -(1-w x) 3/2 + e -(1+w x) 3/2 . Note that f * 1 is discontinuous and thus not in the RKHS in general, while f * 2 is in the RKHS of κ 1 (since it is even and has the same decay as κ 1 as discussed in Section 3.3). In Figure 1 we compare the quality of approximation for different kernels by examining generalization performance of ridge regression with exact kernels or random features. The regularization parameter λ is optimized on 10 000 test datapoints on a logarithmic grid. In order to illustrate the difficulty of optimization due to a small optimal λ, which would also indicate slower convergence with gradient methods, we consider grids with λ ≥ λ min , for two different choices of λ min . We see that all kernels provide a similar rate of approximation for a large enough grid, but when fixing a smaller optimization budget by taking a larger λ min , the NTK and Laplace kernels can achieve better performance for large sample size n, thanks to a slower eigenvalue decay of the covariance operator. Figure 1 (right) shows that when using m = √ n random features (which can achieve optimal rates in some settings, see Rudi & Rosasco, 2017), the "shallow" ReLU network performs better than a three-layer version, despite having fewer weights. This suggests that in addition to providing no improvements to approximation in the infinite-width case, the kernel regimes for deep ReLU networks may even be worse than their two-layer counterparts in the finite-width setting.

MNIST and Fashion-MNIST.

In Table 1 , we consider the image classification datasets MNIST and Fashion-MNIST, which both consist of 60k training and 10k test images of size 28x28 with 10 output classes. We evaluate one-versus-all classifiers obtained by using kernel ridge regression by setting y = 0.9 for the correct label and y = -0.1 otherwise. We train on random subsets of 50k examples and use the remaining 10k examples for validation. We find that test accuracy is comparable for different numbers of layers in RF or NTK kernels, with a slightly poorer performance for the two-layer case likely due to parity constraints, in agreement with our theoretical result that the decay is the same for different L. There is a small decrease in accuracy for growing L, which may reflect changes in the decay constants or numerical errors when composing kernels. The slightly better performance of RF compared to NTK may suggest that these problems are relatively easy (e.g., the regression function is smooth), so that a faster decay is preferable due to better adaptivity to smoothness. 

5. Discussion

In this paper, we have analyzed the approximation properties of deep networks in kernel regimes, by studying eigenvalue decays of integral operators through differentiability properties of the kernel function. In particular, the decay is governed by the form of the function's (generalized) power series expansion around ±1, which remains the same for kernels arising from fully-connected ReLU networks of varying depths. This result suggests that the kernel approach is unsatisfactory for understanding the power of depth in fully-connected networks. In particular, it highlights the need to incorporate other regimes in the study of deep networks, such as the mean field regime (Chizat & Bach, 2018; Mei et al., 2018) , and other settings with hierarchical structure (see, e.g., Allen-Zhu & Li, 2020; Chen et al., 2020) . We note that our results do not rule out benefits of depth for other network architectures in kernel regimes; for instance, depth may improve stability properties of convolutional kernels (Bietti & Mairal, 2019a; b) , and a precise study of approximation for such kernels and its dependence on depth would also be of interest. 

A Background on Spherical Harmonics

In this section, we provide some background on spherical harmonics needed for our study of approximation. See (Efthimiou & Frye, 2014; Atkinson & Han, 2012; Ismail, 2005) for references, as well as (Bach, 2017a, Appendix D) . We consider inputs on the d -1 sphere S d-1 = {x ∈ R d , x = 1}. We recall some properties of the spherical harmonics Y k,j introduced in Section 2.2. For j = 1, . . . , N (d, k) , where N (d, k) = 2k+d-2 k k+d-3 d-2 , the spherical harmonics Y k,j are homogeneous harmonic polynomials of degree k that are orthonormal with respect to the uniform distribution τ on the d-1 sphere. The degree k plays the role of an integer frequency, as in Fourier series, and the collection {Y k,j , k ≥ 0, j = 1, . . . , N (d, k)} forms an orthonormal basis of L 2 (S d-1 , dτ ). As with Fourier series, there are tight connections between decay of coefficients in this basis w.r.t. k, and regularity/differentiability of functions, in this case differentiability on the sphere. This follows from the fact that spherical harmonics are eigenfunctions of the Laplace-Beltrami operator on the sphere ∆ S d-1 (see Efthimiou & Frye, 2014, Proposition 4.5) : ∆ S d-1 Y k,j = -k(k + d -2)Y k,j . ( ) For a given frequency k, we have the following addition formula: N (d,k) j=1 Y k,j (x)Y k,j (y) = N (d, k)P k (x y), ( ) where P k is the k-th Legendre polynomial in dimension d (also known as Gegenbauer polynomial when using a different scaling), given by the Rodrigues formula: P k (t) = (-1/2) k Γ( d-1 2 ) Γ(k + d-1 2 ) (1 -t 2 ) (3-d)/2 d dt k (1 -t 2 ) k+(d-3)/2 . ( ) Note that these may also be expressed using the hypergeometric function 2 F 1 (see, e.g., Ismail, 2005, Section 4.5) , an expression we will use in proof of Theorem 1 (see the proof of Lemma 6). The polynomials P k are orthogonal in L 2 ([-1, 1], dν) where the measure dν is given by the weight function dν(t) = (1 -t 2 ) (d-3)/2 dt, and we have 1 -1 P 2 k (t)(1 -t 2 ) (d-3)/2 dt = ω d-1 ω d-2 1 N (d, k) , ( ) where ω p-1 = 2π p/2 Γ(p/2) denotes the surface of the sphere S p-1 in p dimensions. Using the addition formula (15) and orthogonality of spherical harmonics, we can show P j (w x)P k (w y)dτ (w) = δ jk N (d, k) P k (x y) (18) We will use two other properties of Legendre polynomials, namely the following recurrence relation (Efthimiou & Frye, 2014, Eq. 4.36 ) tP k (t) = k 2k + d -2 P k-1 (t) + k + d -2 2k + d -2 P k+1 (t), ( ) for k ≥ 1, and for k = 0 we simply have tP 0 (t) = P 1 (t), as well as the differential equation (see, e.g., Efthimiou & Frye, 2014, Proposition 4.20) : (1 -t 2 )P k (t) + (1 -d)tP k (t) + k(k + d -2)P k (t) = 0. ( ) The Funk-Hecke formula is helpful for computing Fourier coefficients in the basis of spherical harmonics in terms of Legendre polynomials: for any j = 1, . . . , N (d, k) , we have f (x y)Y k,j (y)dτ (y) = ω d-2 ω d-1 Y k,j (x) 1 -1 f (t)P k (t)(1 -t 2 ) (d-3)/2 dt. ( ) For example, we may use this to obtain decompositions of dot-product kernels by computing Fourier coefficients of functions κ( x, • ). Indeed, denoting µ k = ω d-2 ω d-1 1 -1 κ(t)P k (t)(1 -t 2 ) (d-3)/2 dt, writing the decomposition of κ( x, • ) using ( 21) leads to the following Mercer decomposition of the kernel: κ(x y) = ∞ k=0 µ k N (d,k) j=1 Y k,j (x)Y k,j (y) = ∞ k=0 µ k N (d, k)P k (x y). ( ) B Proof of Theorem 1 The proof of Theorem 1, stated below in full as Theorem 7, proceeds as follows. We first derive an upper bound on the decay of κ of the form k -d-2ν+3 (Lemma 5), which is weaker than the desired k -d-2ν+1 , by exploiting regularity properties of κ through integration by parts. The goal is then to apply this result on a function κ = κ-ψ, where ψ is a function that allows us to "cancel" the leading terms in the expansions of κ, while being simple enough that it allows a precise estimate of its decay. In the proof of Theorem 7, we follow this strategy by considering ψ as a sum of functions of the form t → (1 -t 2 ) ν and t → t(1 -t 2 ) ν , for which we provide a precise computation of the decay in Lemma 6. Decay upper bound through regularity. We begin by establishing a weak upper bound on the decay of κ (Lemma 5) by leveraging its regularity up to the terms of order (1 -t 2 ) ν . This is achieved by iteratively applying the following integration by parts lemma, which is conceptually similar to integrating by parts on the sphere by leveraging the spherical Laplacian relation ( 14) in Appendix A, but directly uses properties of κ and of Legendre polynomials instead (namely, the differential equation ( 20)). We note that the final statement in Theorem 1 on infinitely differentiable κ directly follows from Lemma 5. Lemma 4 (Integration by parts lemma). Let κ : [-1, 1] → R be a function that is C ∞ on (-1, 1) and such that κ (t)(1 -t 2 ) 1+ d-3 2 = O(1). We have 1 -1 κ(t)P k (t)(1 -t 2 ) d-3 2 dt = 1 k(k + d -2) -κ(t)(1 -t 2 ) 1+ d-3 2 P k (t) 1 -1 (23) + κ (t)(1 -t 2 ) 1+ d-3 2 P k (t) 1 -1 + 1 -1 κ(t)P k (t)(1 -t 2 ) (d-3)/2 dt , ( ) with κ(t) = -κ (t)(1 -t 2 ) + (d -1)tκ (t). Proof. In order to perform integration by parts, we use the following differential equation satisfied by Legendre polynomials (see, e.g., Efthimiou & Frye, 2014, Proposition 4.20) : (1 -t 2 )P k (t) + (1 -d)tP k (t) + k(k + d -2)P k (t) = 0. ( ) Using this equation, we may write for k ≥ 1, 1 -1 κ(t)P k (t)(1 -t 2 ) (d-3)/2 dt = 1 k(k + d -2) (d -1) tκ(t)P k (t)(1 -t 2 ) d-3 2 dt (26) -κ(t)P k (t)(1 -t 2 ) 1+ d-3 2 dt . ( ) We may integrate the second term by parts using d dt κ(t)(1 -t 2 ) 1+ d-3 2 = κ (t)(1 -t 2 ) 1+ d-3 2 -2t(1 + (d -3)/2)κ(t)(1 -t 2 ) d-3 2 = κ (t)(1 -t 2 ) 1+ d-3 2 -(d -1)tκ(t)(1 -t 2 ) d-3 2 . ( ) Noting that the first term in (26) cancels out with the integral resulting from the second term in (28), we then obtain 1 -1 κ(t)P k (t)(1 -t 2 ) (d-3)/2 dt = 1 k(k + d -2) -κ(t)(1 -t 2 ) 1+ d-3 2 P k (t) 1 -1 + 1 -1 κ (t)(1 -t 2 ) 1+ d-3 2 P k (t)dt . Integrating by parts once more, the second term becomes 1 -1 κ (t)(1 -t 2 ) 1+ d-3 2 P k (t)dt = κ (t)(1 -t 2 ) 1+ d-3 2 P k (t) 1 -1 - 1 -1 (κ (t)(1 -t 2 ) -(d -1)tκ (t))P k (t)(1 -t 2 ) (d-3)/2 dt. The desired result follows. Lemma 5 (Weak upper bound on the decay). Let κ : [-1, 1] → R be a function that is C ∞ on (-1, 1) and has the following expansions around ±1 on its derivatives: κ (j) (t) = p j,1 (1 -t) + O((1 -t) ν-j ) (30) κ (j) (t) = p j,-1 (1 + t) + O((1 + t) ν-j ), ( ) for t ∈ [-1, 1] and j ≥ 0, where p j,1 , p j,-1 are polynomials and ν may be non-integer. Then the Legendre coefficients µ k (κ) of κ given in (8) satisfy µ k (κ) = O(k -d-2ν+3 ). ( ) Proof. Let f 0 := κ and for j ≥ 1 f j (t) := -f j-1 (t)(1 -t 2 ) + (d -1)f j-1 (t). ( ) Then f j is C ∞ on (-1, 1) and has similar expansions to κ of the form f j (t) = q j,1 (1 -t) + O((1 -t) ν-j ) (34) f j (t) = q j,-1 (1 + t) + O((1 + t) ν-j ), for some polynomials q j,±1 . We may apply Lemma 4 repeatedly as long as the terms in brackets vanish, until we obtain, for j = ν + d-3 2 -1, 1 -1 κ(t)P k (t)(1 -t 2 ) (d-3)/2 dt = 1 (k(k + d -2)) j+1 f j (t)(1 -t 2 ) 1+ d-3 2 P k (t) 1 -1 + 1 -1 f j+1 (t)P k (t)(1 -t 2 ) (d-3)/2 dt . Given our choice for j, we have f j (t)(1 -t 2 ) 1+ d-3 2 = O(1), and f j+1 (t)(1 -t 2 ) (d-3)/2 = O((1 -t 2 ) -1+ ) for some > 0. Since P k (t) ∈ [-1, 1] for any t ∈ [-1, 1], we obtain µ k (κ) = O(k -2(j+1) ) = O(k -d-2ν+3 ). Precise decay for simple function. We now provide precise decay estimates for functions of the form t → (1 -t 2 ) ν and t → t(1 -t 2 ) ν , which will lead to the dominant terms in the decomposition of κ in the main theorem. Lemma 6 (Decay for simple functions φ ν and φν ). Let φ ν (t) = (1 -t 2 ) ν , with ν > 0 noninteger, and let µ k (φ ν ) denote its Legendre coefficients in d dimensions given by ω d-2 ω d-1 1 -1 (1- t 2 ) ν+(d-3)/2 P k (t)dt. We have • µ k (φ ν ) = 0 if k is odd • µ k (φ ν ) ∼ C(d, ν)k -d-2ν-1 for k even, k → ∞, with C(d, ν) a constant. Analogously, let φν (t) := t(1 -t 2 ) ν . We have • µ k ( φν ) = 0 if k is even • µ k ( φν ) ∼ C(d, ν)k -d-2ν-1 for k odd, k → ∞, with C(d, ν) a constant. Proof. We recall the following representation of Legendre polynomials based on the hypergeometric function (e.g., Ismail, 2005 , Section 4.5):foot_6  P k (t) = 2 F 1 (-k, k + d -2; (d -1)/2; (1 -t)/2), (36) where the hypergeometric function is given in its generalized form by p F q (a 1 , . . . , a p ; b 1 , . . . , b q ; x) = ∞ s=0 (a 1 ) s • • • (a p ) s (b 1 ) s • • • (b q ) s x s s! , where (a) s = Γ(a + s)/Γ(a) is the rising factorial or Pochhammer symbol. Using the above definitions and the integral representation of Beta functions, we then have 1 -1 (1 -t 2 ) ν+ d-3 2 P k (t)dt = 2 2ν+d-3 1 -1 1 -t 2 ν+ d-3 2 1 + t 2 ν+ d-3 2 P k (t)dt = 2 2ν+d-3 k s=0 (-k) s (d -2 + k) s d-1 2 s s! 1 -1 1 -t 2 ν+ d-3 2 +s 1 + t 2 ν+ d-3 2 dt = 2 2ν+d-2 k s=0 (-k) s (d -2 + k) s d-1 2 s s! 1 0 (1 -x) ν+ d-3 2 +s x ν+ d-3 2 dx = 2 2ν+d-2 k s=0 (-k) s (d -2 + k) s d-1 2 s s! Γ(ν + s + d-1 2 )Γ(ν + d-1 2 ) Γ(2ν + s + d -1) = 2 2ν+d-2 Γ(ν + d-1 2 ) 2 Γ(2ν + d -1) k s=0 (-k) s (d -2 + k) s (ν + d-1 2 ) s d-1 2 s (2ν + d -1) s s! = 2 2ν+d-2 Γ(ν + d-1 2 ) 2 Γ(2ν + d -1) 3 F 2 -k, k + d -2, ν + (d -1)/2 (d -1)/2, 2ν + d -1 1 . Now, we use Watson's theorem (e.g., Ismail, 2005, Eq. (1.4.12)), which states that 3 F 2 a, b, c (a + b + 1)/2, 2c 1 = Γ( 1 2 )Γ(c + 1 2 )Γ( a+b+1 2 )Γ(c + 1-a-b 2 ) Γ( a+1 2 )Γ( b+1 2 )Γ(c + 1-a 2 ) . ( ) We remark that with a = -k, b = k + d -2, c = ν + (d -1)/2, our expression above is of the form of Watson's theorem, and we may thus evaluate µ k in closed form. Indeed, we have 3 F 2 -k, k + d -2, ν + (d -1)/2 (d -1)/2, 2ν + d -1 1 = Γ( 1 2 )Γ(ν + d 2 )Γ( d-1 2 )Γ(ν + 1) Γ( 1-k 2 )Γ( d+k-1 2 )Γ(ν + k 2 + d 2 )Γ(ν + 1 -k 2 ) . ( ) When k is odd, then (1 -k)/2 is a non-positive integer so that the denominator is infinite and thus µ k vanishes. We assume from now on that k is even, making the denominator is finite. Using the following relation, for / ∈ Z and an integer n: Γ(1 + ) Γ( -n) = ( -1) • • • ( -n) = (-1) n-1 Γ(n + 1 -) Γ(-) , ( ) we may then rewrite 3 F 2 -k, k + d -2, ν + (d -1)/2 (d -1)/2, 2ν + d -1 1 = Γ(ν + d 2 )Γ( d-1 2 )Γ(ν + 1) Γ(-1 2 )Γ(ν + 2)Γ(-ν -1) Γ( k+1 2 )Γ( k 2 -ν) Γ( d+k-1 2 )Γ(ν + k 2 + d 2 ) . (41) When k → ∞, Stirling's formula Γ(x) ∼ x x-1 2 e -x √ 2π yields the equivalent Γ( k+1 2 )Γ( k 2 -ν) Γ( d+k-1 2 )Γ(ν + k 2 + d 2 ) ∼ k 2 -d-2ν+1 . ( ) This yields µ k ∼ C(d, ν)k -d-2ν+1 , (43) with C(d, ν) = 2 2ν+d-2 ω d-2 ω d-1 Γ(ν + d-1 2 ) 2 Γ(2ν + d -1) Γ(ν + d 2 )Γ( d-1 2 )Γ(ν + 1) Γ(-1 2 )Γ(ν + 2)Γ(-ν -1) (1/2) -d-2ν+1 . ( ) Decay for φν . The decay for φν follows from the decay of φ ν and the recurrence relation (Efthimiou & Frye, 2014, Eq. (4.36) ) tP k (t) = k 2k + d -2 P k-1 (t) + k + d -2 2k + d -2 P k+1 (t), which ensures the same decay with a change parity. Final theorem. We are now ready to prove our main theorem, which differs from the simplified statement of Theorem 1 by the technical assumption that only a finite number r of terms of order between ν and ν + 1 are present in the series expansions around ±1. Theorem 7 (Main theorem, full version). Let κ : [-1, 1] → R be a function that is C ∞ on (-1, 1) and has the following expansions around ±1: κ(t) = p 1 (1 -t) + r j=1 c j,1 (1 -t) νj + O((1 -t) ν1+1+ ) (46) κ(t) = p -1 (1 + t) + r j=1 c j,-1 (1 + t) νj + O((1 + t) ν1+1+ ), ( ) for t ∈ [-1, 1], where p 1 , p -1 are polynomials and 0 < ν 1 < . . . < ν r are not integers and 0 < < ν 2 -ν 1 . We also assume that the derivatives κ (s) of κ have the following expansions: κ (s) (t) = p s,1 (1 -t) + (-1) s r j=1 c j,1 Γ(ν j + 1) Γ(ν j + 1 -s) (1 -t) νj -s + O((1 -t) ν1+1+ -s ) (48) κ (s) (t) = p s,-1 (1 + t) + r j=1 c j,-1 Γ(ν j + 1) Γ(ν j + 1 -s) (1 + t) νj -s + O((1 + t) ν1+1+ -s ), ( ) for some polynomials p s,±1 . Then we have, for an absolute constant C(d, ν 1 ) depending only on d and ν 1 , • For k even, if c ν1,1 = -c 1,-1 : µ k ∼ (c 1,1 + c 1,-1 )C(d, ν 1 )k -d-2ν1+1 ; • For k even, if c 1,1 = -c 1,-1 : µ k = o(k -d-2ν1+1 ); their decay may be recovered from the Legendre coefficients in d dimensions, by taking highdimensional limits d → ∞. We illustrate this on the functions φ ν (t) = (1 -t 2 ) ν , for which Lemma 6 provides precise estimates of the Legendre coefficients µ k,d (φ ν ) in d dimensions (this only serves as an instructive illustration, since in this case Taylor coefficients may be computed directly through a power series expansion of φ ν using the Binomial formula). Lemma 8 (Recovering Taylor coefficients of φ ν through high-dimensional limits). Let b k (φ ν ) := φ (k) ν k! for some non-integer ν > 0. For k even, we have b k (φ ν ) = C ν 2 k Γ( k+1 2 )Γ( k 2 -ν) Γ(k + 1) , for a constant C ν depending only on ν. This leads to an equivalent b k ∼ C ν k -ν-1 for k → ∞ with k even. Proof. Assume throughout that k is even. the expression of the Legendre coefficients µ k,d (φ ν ) of φ ν in d dimensions (we include d as a subscript for more clarity here) from the proof of Lemma 6: µ k,d (φ ν ) = ω d-2 ω d-1 1 -1 κ(t)P k,d (t)(1 -t 2 ) d-3 2 dt (58) = 2 2ν+d-2 ω d-2 ω d-1 Γ(ν + d-1 2 ) 2 Γ(2ν + d -1) Γ(ν + d 2 )Γ( d-1 2 )Γ(ν + 1) Γ(-1 2 )Γ(ν + 2)Γ(-ν -1) Γ( k+1 2 )Γ( k 2 -ν) Γ( d+k-1 2 )Γ(ν + k 2 + d 2 ) . ( ) Now, note that when d is large enough compared to k, we may use the Rodrigues formula ( 16) and integration by parts to obtain the following alternative expression: µ k,d (φ ν ) = 2 -k ω d-2 ω d-1 Γ( d-1 2 ) Γ(k + d-1 2 ) 1 -1 φ (k) ν (t)(1 -t 2 ) k+ d-3 2 dt Following similar arguments to Ghorbani et al. (2019) , we may then use dominated convergence to show: Γ( d 2 ) √ πΓ( d-1 2 ) 1 -1 φ (k) ν (t)(1 -t 2 ) k+ d-3 2 dt → φ (k) ν (0) as d → ∞. Indeed, Γ( d 2 ) √ πΓ( d-1 2 ) (1 -t 2 ) (d-3 )/2 is a probability density that approaches a Dirac mass at 0 when d → ∞. This yields b k (φ ν ) = φ (k) ν k! = lim d→∞ 2 k ω d-1 ω d-2 Γ( d 2 )Γ(k + d-1 2 ) √ πΓ( d-1 2 )Γ( d-1 2 )Γ(k + 1) µ k,d (φ ν ). Plugging (59) and using Stirling's formula to take limits d → ∞ yields b k (φ ν ) = C ν 2 k Γ( k+1 2 )Γ( k 2 -ν) Γ(k + 1) , where C ν only depends on ν. Using Stirling's formula once again yields the desired equiva- lent b k (φ ν ) ∼ C ν k -ν-1 for k → ∞, k even, with a different constant C ν . We note that a similar asymptotic equivalent holds for b k ( φν ) for k odd. The next result leverages this to derive asymptotic decays of b k (κ) for any κ of the form κ(u) = k≥0 b k (κ)u k satisfying similar conditions as in Theorem 7. Corollary 9 (Taylor coefficients of κ). Let κ : [-1, 1] → R be a function admitting a power series expansion κ(u) = k≥0 b k u k , with the following expansions around ±1: κ(t) = p 1 (1 -t) + r j=1 c j,1 (1 -t) νj + O((1 -t) ν1 +1 ) (60) κ(t) = p -1 (1 + t) + r j=1 c j,-1 (1 + t) νj + O((1 + t) ν1 +1 ), ( ) for t ∈ [-1, 1], where p 1 , p -1 are polynomials and 0 < ν 1 < . . . < ν r are not integers and 0 < < ν 2 -ν 1 . Then we have, for an absolute constant C(ν 1 ) depending only on ν 1 , • For k even, if c ν1,1 = -c ν1,-1 : b k ∼ (c ν1,1 + c ν1,-1 )C(ν 1 )k -ν1-1 ; • For k even, if c ν1,1 = -c ν1,-1 : b k = o(k -ν1-1 ); • For k odd, if c ν1,1 = c ν1,-1 : b k ∼ (c ν1,1 -c ν1,-1 )C(ν 1 )k -ν1-1 . • For k odd, if c ν1,1 = c ν1,-1 : b k = o(k -ν1-1 ). Proof. As in the proof of Theorem 7, we may construct a function ψ = j α j φ νj + ᾱj φνj , with α 1 = c1,1+c1,-1 2 ν 1 +1 , ᾱ1 = c1,1-c1,-1 2 ν 1 +1 for j = 1, the other terms being of higher orders ν j > ν 1 , such that κ := κ -ψ (which is also a power series with convergence radius ≥ 1) satisfies κ(t) = p 1 (1 -t) + O((1 -t) ν1 +1 ) (62) κ(t) = p -1 (1 + t) + O((1 + t) ν1 +1 ), It follows that κ( ν1 +1) (1) is bounded, so that the Taylor coefficients of κ, denoted b k (κ), satisfy b k (κ) = o(k -ν1 -1 ) = o(k -ν1-1 ). The result then follows from Lemma 8 by using the decays of b k (φ ν ) and b k ( φν ).

C Other Proofs

In this section, we provide the proofs for results in Section 3.3 related to obtaining power series expansions (with generalized exponents) of kernels arising from deep networks, which leads to the corresponding decays by Theorem 1. We note that for the kernels considered in this section, we can differentiate the expansions since the kernel function is ∆-analytic (see Chen & Xu, 2021, Theorem 7) , so that the technical assumption in Theorem 1 is verified.

C.1 Proof of Corollary 2

Proof. Let κ := κ 1 • • • • • κ 1 -1 times = κ RF . We have κ 1 (1 -t) = 1 -t + ct 3/2 + o(t 3/2 ), c := 2 √ 2 3π . We now show by induction that κ (1 -t) = 1 -t + a t 3/2 + o(t 3/2 ), with a = ( -1)c. This is obviously true for = 2 since κ = κ 1 , and for ≥ 3 we have κ (1 -t) = κ 1 (κ -1 (1 -t)) = κ 1 (1 -t + a -1 t 3/2 + o(t 3/2 )) = 1 -(t -a -1 t 3/2 + o(t 3/2 )) + c(t + O(t 3/2 )) 3/2 + o(O(t) 3/2 ) = 1 -t + a -1 t 3/2 + ct 3/2 (1 + O(t 1/2 )) 3/2 + o(t 3/2 ) = 1 -t + a -1 t 3/2 + ct 3/2 (1 + O(t 1/2 )) + o(t 3/2 ) = 1 -t + a t 3/2 + o(t 3/2 ), which proves the result. Around -1, we know that κ 1 (-1 + t) = ct 3/2 + o(t 3/2 ). We then have κ (-1 + t) = b + c t 3/2 + o(t 3/2 ), with 0 ≤ b < 1 and 0 < c ≤ c (and the upper bound is strict for ≥ 3). Indeed, this is true for = 2, and for ≥ 3 we have, for t > 0, κ (-1 + t) = κ 1 (κ -1 (-1 + t)) = κ 1 (b -1 + c -1 t 3/2 + o(t 3/2 )) = κ 1 (b -1 ) + κ 1 (b -1 )c -1 t 3/2 + o(t 3/2 ). Now, note that κ 1 and κ 1 are both positive and strictly increasing on [0, 1], with κ 1 (1) = κ 1 (1) = 1. Thus, we have b = κ 1 (b -1 ) ∈ (0, 1), and c = κ 1 (a -1 )c -1 < c -1 , thus completing the proof. Since c is bounded while a grows linearly with , the constants in front of the asymptotic decay k -d-2 grow linearly with .

C.2 Proof of Corollary 3

Proof. We show by induction that κ NTK as defined in (7) satisfies κ NTK (1 -t) = - -1 s=1 s ct 1/2 + o(t 1/2 ), c := √ 2 π . For = 2 we have κ NTK (u) = uκ 0 (u) + κ 1 (u), so that κ 2 NTK (1 -t) = (1 -t)(1 -ct 1/2 + o(t 1/2 )) + 1 + O(t) = 2 -ct 1/2 + o(t 1/2 ). By induction, for ≥ 3, we have κ NTK (u) = κ -1 NTK (u)κ 0 (κ -1 (u)) + κ (u), with κ as in the proof of Corollary 2, which hence satisfies κ (1 -t) = 1 -t + o(t) for all ≥ 2. We then have κ 0 (κ -1 (1 -t)) = κ 0 (1 -t + o(t)) = 1 -c(t + o(t)) 1/2 + o(t 1/2 ) = 1 -ct 1/2 (1 + o(t 1/2 )) + o(t 1/2 ) = 1 -ct 1/2 + o(t 1/2 ). This yields κ NTK (1 -t) = ( -1 -( -2 s=1 s)ct 1/2 + o(t 1/2 ))(1 -ct 1/2 + o(t 1/2 )) + 1 + O(t) = -( -1 s=1 s)ct 1/2 + o(t 1/2 ) = - ( -1) 2 ct 1/2 + o(t 1/2 ), which proves the claim for the expansion around +1. Around -1, recall the expansion from the proof of Corollary 2, κ (-1 + t) = b + O(t 3/2 ), with 0 ≤ b < 1. For = 2, we have κ 2 NTK (-1 + t) = (-1 + t)(ct 1/2 + o(t 1/2 )) + b 2 + o(t 1/2 ) = b 2 -ct 1/2 + o(t 1/2 ). Note also that for ≥ 2, κ 0 (κ (-1 + t)) = κ 0 (b + O(t 3/2 )) = κ 0 (b ) + O(t 3/2 ), since κ 0 (b ) is finite for b < 1. We also have κ 0 (b ) ∈ (0, 1) since κ 0 is positive and strictly increasing on [0, 1] with κ 0 (1) = 1. Then, by an easy induction, we obtain κ NTK (-1 + t) = a -c t 1/2 + o(t 1/2 ), with a ≤ and 0 < c < c. Similar to the case of the RF kernel, the constant in front of t 1/2 grows with 2 for the expansion around +1 but is bounded for the expansion around -1, so that the final constants in front of the asymptotic decay k -d grow quadratically with . However, they grow linearly with when considering the NTK normalized by , κ = κ NTK / , which then satisfies κ (1) = 1.

C.3 Deep networks with step activations

In this section, we study the decay of the random weight kernel arising from deep networks with step activations, as presented in Section 3.3. For an L-layer network, this kernel is of the form κ L s := κ 0 • • • • • κ 0 L-1 times . Corollary 10. κ L s has a decay k -d-2ν L +1 with ν L = 1/2 L-1 for L layers. Proof. We show by induction that we have, for ≥ 2, κ s (1 -t) = 1 -c -1 j=0 2 -j t 1/2 -1 + o(t 1/2 -1 ), with c := √ 2 π . This is true for = 2 due to the expansion for κ 0 . Now assume it holds for ≥ 2. We have κ +1 s (1 -t) = κ 0 (κ s (1 -t)) = κ 0 1 -c -1 j=0 2 -j t 1/2 -1 + o(t 1/2 -1 ) = 1 -c c -1 j=0 2 -j t 1/2 -1 + o(t 1/2 -1 ) 1/2 + o(o(t 1/2 -1 ) 1/2 ) = 1 -c j=0 2 -j t 1/2 (1 + o(1)) + o(t 1/2 ) = 1 -c j=0 2 -j t 1/2 + o(t 1/2 ), proving the desired claim. Around -1, we have κ 0 (-1 + t) = ct 1/2 + o(t 1/2 ), and for ≥ 3, κ s (-1 + t) = a + O(t 1/2 ), by an easy induction using the fact that κ 0 ([0, 1)) ⊂ (0, 1) and κ 0 is smooth on [0, 1). Thus the behavior around -1 does not affect the decay of κ s for ≥ 3, and Theorem 1 leads to the desired decay, with a constant that only depends on through c -1 j=0 2 -j , which lies in the interval [c 2 , c] for any .



Review of Approximation with Dot-Product KernelsIn this section, we provide a brief review of the kernels that arise from neural networks and their approximation properties. Here we assume a scaling 2/m instead of 1/m in the definition of f , which yields κ(1) = 1, a useful normalization for deep networks, as explained below. The rates easily extend to distributions with a density w.r.t. the uniform distribution on the sphere, although the eigenbasis on which regularity is measured is then different. For κc and κc,γ, the ability to differentiate expansions is straightforward since we have the exact expansion κc,γ(u) = k c k (1 -u) γk /k!, which may be differentiated term-by-term. This requires the mild additional condition that each derivative of the activation is in L 2 w.r.t. the Gaussian measure. Here we normalize such that P k (1) = 1 as is standard for Legendre polynomials, in contrast to(Ismail, 2005) where the standard Jacobi/Gegenbauer normalization is used. These are obtained by writing ψj(t) = (a + bt)(1 + t) ν (1 -t) ν and computing, e.g., the first two terms in the analytic expansion of t → (a + bt)(1 + t) ν around 1.



Deep RF decay.). For the random neuron kernel κ L RF of an L-layer ReLU network with L ≥ 3, we have µ k ∼ C(d, L)k -d-2 , where C(d, L) is different depending on the parity of k and grows linearly with L. Corollary 3 (Deep NTK decay.). For the neural tangent kernel κ L NTK of an L-layer ReLU network with L ≥ 3, we have µ k ∼ C(d, L)k -d , where C(d, L) is different depending on the parity of k and grows quadratically with L (it grows linearly with L when considering the normalized NTK κ L

Figure 1: (left, middle) expected squared error vs sample size n for kernel ridge regression estimators with different kernels on f *1 and with two different budgets on optimization difficulty λ min (the minimum regularization parameter allowed). (right) ridge regression with one or two layers of random ReLU features on f * 2 , with different scalings of the number of "neurons" at each layer in terms of n. Test accuracies on MNIST (left) and Fashion-MNIST (right) for RF and NTK kernels with varying numbers of layers L. We use kernel ridge regression on 50k samples, with λ optimized on a validation set of size 10k, and report mean and standard errors across 5 such random splits of the 60k training samples. For comparison, the Laplace kernel with c = 1 yields accuracies 98.39 ± 0.02 on MNIST and 90.38 ± 0.06 on F-MNIST.

Salman. A fine-grained spectral perspective on neural networks. arXiv preprint arXiv:1907.10599, 2019. Dmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks, 94:103-114, 2017. Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. Machine Learning, 2019.

Acknowledgments

The authors would like to thank David Holzmüller for finding an error in an earlier version of the paper, which led us to include the new assumption on differentiation of asymptotic expansions in Theorem 1. This work was funded in part by the French government under management of Agence Nationale de la Recherche as part of the "Investissements d'avenir" program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute). We also acknowledge support of the European Research Council (grant SEQUOIA 724063).

annex

Proof. Define the functionsfor j = 1, . . . , r, where φ ν , φν are defined in Lemma 6. We have the asymptotic expansions:and for j ≥ 2,Define additionally ψ r+1 the same way as the other ψ j , with)/2, which satisfies a similar asymptotic expansion as the above ones for j ≥ 2. One can check that the derivatives of the ψ j can be expanded with the derivatives of the expansions above. Then, defining κ = κ -r+1 j=1 ψ j , we have for s ≥ 0,The functions ψ j satisfyBy Lemma 5, we haveThe result then follows from Lemma 6, with a constant C(d, ν 1 )/2 ν1+1 , where C(d, ν 1 ) is given by the proof of Lemma 6.

B.1 Dimension-free description

While our above description of the RKHS depends on the dimension d, in some cases a dimension-free description given by Taylor coefficients of the kernel κ at 0 may be useful, for instance for the study of kernel methods in certain high-dimensional regimes (e.g., El Karoui, 2010; Ghorbani et al., 2019; Liang et al., 2020) . Here we remark that such coefficients and

