DEEP NEURAL TANGENT KERNEL AND LAPLACE KERNEL HAVE THE SAME RKHS

Abstract

We prove that the reproducing kernel Hilbert spaces (RKHS) of a deep neural tangent kernel and the Laplace kernel include the same set of functions, when both kernels are restricted to the sphere S d-1 . Additionally, we prove that the exponential power kernel with a smaller power (making the kernel less smooth) leads to a larger RKHS, when it is restricted to the sphere S d-1 and when it is defined on the entire R d .

1. INTRODUCTION

In the past few years, one of the most seminal discoveries in the theory of neural networks is the neural tangent kernel (NTK) (Jacot et al., 2018) . The gradient flow on a normally initialized, fully connected neural network with a linear output layer in the infinite-width limit turns out to be equivalent to kernel regression with respect to the NTK (This statement does not necessarily hold for a non-linear output layer, because the NTK is non-constant (Liu et al., 2020) ). Through the NTK, theoretical tools from kernel methods were introduced to the study of deep overparametrized neural networks. Theoretical results were thereby established regarding the convergence (Allen-Zhu et al., 2019; Du et al., 2019b; a; Zou et al., 2020) , generalization (Cao & Gu, 2019; Arora et al., 2019b) , and loss landscape (Kuditipudi et al., 2019) of overparametrized neural networks in the NTK regime. While NTK has proved to be a powerful theoretical tool, a recent work (Geifman et al., 2020) posed an important question whether the NTK is significantly different from our repertoire of standard kernels. Prior work provided empirical evidence that supports a negative answer. For example, Belkin et al. (2018) showed experimentally that the Laplace kernel and neural networks had similar performance in fitting random labels. In the task of speech enhancement, exponential power kernels K γ,σ exp (x, y) = e -x-y γ /σ , which include the Laplace kernel as a special case, outperform deep neural networks with even shorter training time (Hui et al., 2019) . The experiments in (Geifman et al., 2020) also exhibited similar performance of the Laplace kernel and the NTK. The expressive power of a positive definite kernel can be characterized by its associated reproducing kernel Hilbert space (RKHS) (Saitoh & Sawano, 2016) . The work (Geifman et al., 2020) considered the RKHS of the kernels restricted to the sphere S d-1 {x ∈ R d | x 2 = 1} and presented a partial answer to the question by showing the following subset inclusion relation H Gauss (S d-1 ) H Lap (S d-1 ) = H N1 (S d-1 ) ⊆ H N k (S d-1 ) , where the four spaces denote the RKHS associated with the Gaussian kernel, Laplace kernel, the NTK of two-layer and (k + 1)-layer (k ≥ 1) fully connected neural networks, respectively. All four kernels are restricted to S d-1 . However, the relation between H Lap (S d-1 ) and H N k (S d-1 ) remains open in (Geifman et al., 2020) . We make a final conclusion on this problem and show that the RKHS of the Laplace kernel and the NTK with any number of layers have the same set of functions, when they are both restricted to S d-1 . In other words, we prove the following theorem. Theorem 1. Let H Lap (S d-1 ) and H N k (S d-1 ) be the RKHS associated with the Laplace kernel K Lap (x, y) = e -c x-y (c > 0) and the neural tangent kernel of a (k + 1)-layer fully connected ReLU network. Both kernels are restricted to the sphere S d-1 . Then the two spaces include the same set of functions: H Lap (S d-1 ) = H N k (S d-1 ), ∀k ≥ 1 . Our second result is that the exponential power kernel with a smaller power (making the kernel less smooth) leads to a larger RKHS, both when it is restricted to the sphere S d-1 and when it is defined on the entire R d . Theorem 2. Let H K γ,σ exp (S d-1 ) and H K γ,σ exp (R d ) be the RKHS associated with the exponential power kernel K γ,σ exp (x, y) = exp -x-y γ σ (γ, σ > 0) when it is restricted to the unit sphere S d-1 and defined on the entire R d , respectively. Then we have the following RKHS inclusions: (1 ) If 0 < γ 1 < γ 2 < 2, H K γ 2 ,σ 2 exp (S d-1 ) ⊆ H K γ 1 ,σ 1 exp (S d-1 ) . (2) If 0 < γ 1 < γ 2 < 2 are rational, H K γ 2 ,σ 2 exp (R d ) ⊆ H K γ 1 ,σ 1 exp (R d ) . If it is restricted to the unit sphere, the RKHS of the exponential power kernel with γ < 1 is even larger than that of NTK. This result partially explains the observation in (Hui et al., 2019) that the best performance is attained by a highly non-smooth exponential power kernel with γ < 1. Geifman et al. (2020) applied the exponential power kernel and the NTK to classification and regression tasks on the UCI dataset and other large scale datasets. Their experiment results also showed that the exponential power kernel slightly outperforms the NTK. 1.1 FURTHER RELATED WORK Minh et al. (2006) showed the complete spectrum of the polynomial and Gaussian kernels on S d-1 . They also gave a recursive relation for the eigenvalues of the polynomial kernel on the hypercube {-1, 1} d . Prior to the NTK (Jacot et al., 2018) , Cho & Saul (2009) presented a pioneering study on kernel methods for neural networks. Bach (2017) studied the eigenvalues of positively homogeneous activation functions of the form σ α (u) = max{u, 0} α (e.g., the ReLU activation when α = 1) in their Mercer decomposition with Gegenbauer polynomials. Using the results in (Bach, 2017) , Bietti & Mairal (2019) analyzed the two-layer NTK and its RKHS in order to investigate the inductive bias in the NTK regime. They studied the Mercer decomposition of two-layer NTK with ReLU activation on S d-1 and characterized the corresponding RKHS by showing the asymptotic decay rate of the eigenvalues in the Mercer decomposition with Gegenbauer polynomials. In their derivation of a more concise expression of the ReLU NTK, they used the calculation of (Cho & Saul, 2009) on arc-cosine kernels of degree 0 and 1. Cao et al. (2019) improved the eigenvalue bound for the k-th eigenvalue derived in (Bietti & Mairal, 2019 ) when d k. Geifman et al. (2020) used the results in (Bietti & Mairal, 2019) and considered the two-layer ReLU NTK with bias β initialized with zero, rather than initialized with a normal distribution (Jacot et al., 2018) . However, neither (Bietti & Mairal, 2019) nor (Geifman et al., 2020) went beyond two layers when they tried to characterize the RKHS of the ReLU NTK. This line of work (Bach, 2017; Bietti & Mairal, 2019; Geifman et al., 2020) is closely related to the Mercer decomposition with spherical harmonics. Interested readers are referred to (Atkinson & Han, 2012) for spherical harmonics on the unit sphere. The concurrent work (Bietti & Bach, 2021) analyzed the eigenvalues of the ReLU NTK. Arora et al. (2019a) presented a dynamic programming algorithm that computes convolutional NTK with ReLU activation. Yang & Salman (2019) analyzed the spectra of the conjugate kernel (CK) and NTK on the boolean cube. Fan & Wang (2020) studied the spectrum of the gram matrix of training samples under the CK and NTK and showed that their eigenvalue distributions converge to a deterministic limit. The limit depends on the eigenvalue distribution of the training samples.

2. PRELIMINARIES

Let C denote the set of all complex numbers and write i √ -1. For z ∈ C, write z, z, arg z ∈ (-π, π] for its real part, imaginary part, and argument, respectively. Let Suppose that f (z) has a power series representation f (z) = n≥0 a n z n around 0. Denote [z n ]f (z) a n to be the coefficient of the n-th order term. H + {z ∈ C | z > 0} For two sequences {a n } and {b n }, write a n ∼ b n if lim n→∞ an bn = 1. Similarly, for two functions f (z) and g(z), write f (z) ∼ g(z) as z → z 0 if lim z→z0 f (z) g(z) = 1. We also use big-O and little-o notation to characterize asymptotics. Write L {f (t)}(s) ∞ 0 f (t)e -st dt for the Laplace transform of a function f (t). The inverse Laplace transform of F (s) is denoted by L -1 {F (s)}(t).

2.1. POSITIVE DEFINITE KERNELS

For any positive definite kernel function K(x, y) defined for x, y ∈ E, denote H K (E) its associated reproducing kernel Hilbert space (RKHS). For any two positive definite kernel functions K 1 and K 2 , we write K 1 K 2 if K 2 -K 1 is a positive definite kernel. For a complete review of results on kernels and RKHS, please see (Saitoh & Sawano, 2016) . We will study positive definite zonal kernels on the sphere S d-1 = {x ∈ R d | x = 1}. For a zonal kernel K(x, y), there exists a real function K : [-1, 1] → R such that K(x, y) = K(u), where u = x y. We abuse the notation and use K(u) to denote K(u), i.e., K(u) here is real function on [-1, 1]. In the sequel, we introduce two instances of the positive definite kernel that this paper will investigate. Laplace Kernel The Laplace kernel K Lap (x, y) = e -c x-y with c > 0 restricted to the sphere S d-1 is given by K Lap (x, y) = e -c √ 2(1-x y) = e -c √ 1-u K Lap (u) , where by our convention u = x y and c √ 2c > 0 . We denote its associated RKHS by H Lap . Exponential Power Kernel The exponential power kernel (Hui et al., 2019) with γ > 0 and σ > 0 is given by K γ,σ exp (x, y) = exp -x-y γ σ . If x and y are restricted to the sphere S d-1 , we have K γ,σ exp (x, y) = exp -(2(1-x y)) γ/2 σ . Neural Tangent Kernel Given the input x ∈ R d (we define d 0 d) and parameter θ, this paper considers the following network model with (k + 1) layers f θ (x) = w 2 d k σ W k 2 d k-1 σ . . . 2 d 2 σ W 2 2 d 1 σ (W 1 x + βb 1 ) + βb 2 . . . + βb k + βb k+1 , where the parameter θ encodes  W l ∈ R d l ×d l-1 , b l ∈ R d l (l = 1, . . . , k), w ∈ R d k , Σ k (x, y) = Σ k-1 (x, x)Σ k-1 (y, y)κ 1 Σ k-1 (x, y) Σ k-1 (x, x)Σ k-1 (y, y) N k (x, y) = Σ k (x, y) + N k-1 (x, y)κ 0 Σ k-1 (x, y) Σ k-1 (x, x)Σ k-1 (y, y) + β 2 , where κ 0 and κ 1 are the arc-cosine kernels of degree 0 and 1 (Cho & Saul, 2009) given by κ 0 (u) = 1 π (π -arccos(u)), κ 1 (u) = 1 π u • (π -arccos(u)) + 1 -u 2 . The initial conditions are N 0 (x, y) = u + β 2 , Σ 0 (x, y) = u , where u = x y by our convention. The NTKs defined in (Bietti & Mairal, 2019) and (Geifman et al., 2020 ) are slightly different. There is no bias term β 2 in (Bietti & Mairal, 2019) , while the bias term appears in (Geifman et al., 2020) . We adopt the more general setup with the bias term. Lemma 3 (Proof in Appendix A.1). Σ k (x, x) = 1 for any x ∈ S d-foot_0 and k ≥ 0. Lemma 3 simplifies (2) and gives Σ k (u) = κ (k) 1 (u) , N k (u) = κ (k) 1 (u) + N k-1 (u)κ 0 (κ (k-1) 1 (u)) + β 2 , where κ (k) 1 (u) κ 1 (κ 1 (• • • κ 1 (κ 1 k (u)) • • • )) is the k-th iterate of κ 1 (u). For example, κ 1 (u) = u, κ 1 (u) = κ 1 (u) and κ (2) 1 (u) = κ 1 (κ 1 (u)). We present a detailed derivation of (4) in Appendix A.2.

3. RESULTS ON NEURAL TANGENT KERNEL

In this section, we present an overview of our proof for Theorem 1. Since (Geifman et al., 2020) showed H Lap (S d-1 ) ⊆ H N k (S d-1 ), it suffices to prove the reverse inclusion H N k (S d-1 ) ⊆ H Lap (S d-1 ). We then relate positive definite kernels with their RKHS according to the following lemma. Lemma 4 ( (Aronszajn, 1950, p. 354) and (Saitoh & Sawano, 2016 , Theorem 2.17)). Let K 1 , K 2 : Ω × Ω → C be two positive definite kernels. Then the Hilbert space H K1 is a subset of H K2 if and only if there exists some constant γ > 0 such that K 1 γ 2 K 2 . Lemma 4 implies that in order to show H N k (S d-1 ) ⊆ H Lap (S d-1 ), it suffices to show γ 2 K Lap -N k is a positive definite kernel for some γ > 0. Note that both K Lap and N k are positive definite kernels on the unit sphere. Then the Maclaurin series of K Lap (u) and N k (u) have all non-negative coefficients by the classical approximation theory; see (Schoenberg, 1942 , Theorem 2), Bingham (1973) , and (Cheney & Light, 2009, Chapter 17) . Conversely, if the Maclaurin series of K(u) have all non-negative coefficients, K(x, y) = K(x y) is a positive definite kernel on the unit sphere. To be precise, we have the following lemma. Lemma 5 (Schoenberg (1942); Bingham (1973) ). Suppose that K(x, y) = f (x y) where x, y ∈ S d-1 and f is continuous on [-1, 1]. 1 Then K is a positive definite kernel on S d-1 for every d if and only if f (u) = ∞ k=0 a k u k , in which a k ≥ 0 and ∞ k=0 a k < ∞. Thus, we turn to show that there exists γ > 0 such that γ 2 [z n ]K Lap (z) ≥ [z n ]N k (z) holds for every n ≥ 0. Exact calculation of the asymptotic rate of the Maclaurin coefficients is intractable for N k due to its recursive definition. Instead, we apply singularity analysis tools in analytic combinatorics. We refer the readers to (Flajolet & Sedgewick, 2009) for a systematic introduction. We treat all (zonal) kernels, K Lap (u), N k (u), κ 0 (u), and κ 1 (u), as complex functions of variable u ∈ C. To emphasize, we use z ∈ C instead of u to denote the variable. The theory of analytic combinatorics states that the asymptotic of the coefficients of the Maclaurin series is determined by the local nature of the complex function at its dominant singularities (i.e., the singularities closest to z = 0). To apply the methodology from (Flajolet & Sedgewick, 2009) , we introduce some additional definitions. For R > 1 and φ ∈ (0, π/2), the ∆-domain ∆(φ, R) is defined by ∆(φ, R) {z ∈ C | |z| < R, z = 1, | arg(z -1)| > φ} . For a complex number ζ = 0, a ∆-domain at ζ is the image by the mapping z → ζz of ∆(φ, R) for some R > 1 and φ ∈ (0, π/2). A function is ∆-analytic at ζ if it is analytic on a ∆-domain at ζ. Suppose the function f (z) has only one dominant singularity and without loss of generality assume that it lies at z = 1. We then have the following lemma. Lemma 6 ( (Flajolet & Sedgewick, 2009, Corollary VI.1) ). If f is ∆-analytic at its dominant singularity 1 and f (z) ∼ (1 -z) -α , as z → 1, z ∈ ∆ with α / ∈ {0, -1, -2, . . . }, we have [z n ]f (z) ∼ n α-1 Γ(α) . If the function has multiple dominant singularities, the influence of each singularity is added up (See (Flajolet & Sedgewick, 2009, Theorem VI.5 ) for more details). Careful singularity analysis then gives [z n ]K Lap (z) ∼ C 1 n -3/2 , [z n ]N k (z) ≤ C 2 n -3/2 , for some positive constants C 1 , C 2 > 0. We refer to Section 3.2 and Appendix A.4 for more detailed steps. They are indeed of the same order of decay rate n -3/2 , which implies that such γ exists. This shows H N k (S d-1 ) ⊆ H Lap (S d-1 ).

3.1. ∆-ANALYTICITY OF NEURAL TANGENT KERNELS

We present the ∆-analyticity of the NTKs here. In light of (4), the NTKs N k are compositions of arc-cosine kernels κ 0 and κ 1 . We analytically extend κ 0 and κ 1 to a complex function of a complex variable z ∈ C. Both complex functions arccos(z) and √ 1 -z 2 have branch points at z = ±1. Therefore, the branch cut of κ 0 (z) and κ 1 (z) is [1, ∞) ∪ (-∞, -1]. They have a single-valued analytic branch on D = C \ [1, ∞) \ (-∞, -1] . On this branch, we have κ 0 (z) = π + i log(z + i √ 1 -z 2 ) π , κ 1 (z) = 1 π z • π + i log(z + i 1 -z 2 + 1 -z 2 , where we use the principal value of the logarithm and square root. We then show the dominant singularities of κ (k) 1 (z) are ±1 and that κ (k) 1 (z) is ∆-analytic at ±1 for any k ≥ 1. We further have the following theorem on the ∆-singularity for N k . Theorem 7 (Proof in Appendix A.3). For each k ≥ 1, the dominant singularities of N k are ±1. There exists R k > 1 such that N k is analytic on {z ∈ C | |z| ≤ R k } ∩ D, where D = C \ [1, ∞) \ (-∞, -1].

3.2. ASYMPTOTIC RATES OF MACLAURIN COEFFICIENTS FOR N k

The following theorem demonstrates the asymptotic rates of Maclaurin coefficients for N k . Theorem 8 (Proof in Appendix A.4). The n-th order coefficient of the Maclaurin series of the (k + 1)-layer NTK in (2) satisfies [z n ]N k (z) = O(n -3/2 ). In the proof of Theorem 8, we show the following asymptotics N k (z) = (k + 1)(z + β 2 ) - √ 2(1 + β 2 ) k(k + 1) 2π + o(1) √ 1 -z as z → 1 , N k (z) = N k (-1) +   √ 2(β 2 -1) π k-1 j=1 κ 0 (κ j 1 (-1)) + o(1)   √ 1 + z as z → -1 . When β = 1, the singularity at z = -1 will not provide a √ 1 + z term. The dominating term in ( 7) is a higher power of √ 1 + z. As a result, the contribution of the singularity at -1 to the Maclaurin coefficients is o(n -3/2 ) and dominated by the contribution of the singularity at 1. The singularity at z = 1 provides a √ 1 -z term and thus contributes to O(n -3/2 ) decay rate of [z n ]N k (z). In addition, from (6), we deduce [z n ]N k (z) n -3/2 ∼ - 2 √ 2k(k + 1) (2π)Γ -1 2 = k(k + 1) √ 2π 3/2 . ( ) When β = 1, both singularities ±1 contribute Θ(n -3/2 ) to the Maclaurin cofficients. The contribu- tion of z = 1 is - √ 2(1 + β 2 )k(k + 1) 2πΓ -1 2 n -3/2 = β 2 + 1 k(k + 1) 2 √ 2π 3/2 n -3/2 . The contribution of z = -1 is   √ 2(β 2 -1) πΓ(-1/2) k-1 j=1 κ 0 (κ j 1 (-1))   n -3/2 =   1 -β 2 √ 2π 3/2 k-1 j=1 κ 0 (κ j 1 (-1))   n -3/2 . Combining them gives [z n ]N k (z) n -3/2 ∼ (β 2 + 1)k(k + 1) 2 √ 2π 3/2 + (-1) n 1 -β 2 √ 2π 3/2 k-1 j=1 κ 0 (κ j 1 (-1)) . Based on Theorem 8, we are ready to prove Theorem 1. Proof. Let K Lap (z) = e -c √ 1-z , where c > 0 is an arbitrary constant. We have H KLap = H Lap . The complex function K Lap is analytic on C \ [1, ∞). As z → 1, we have K Lap (z) -1 -c = √ 1 -z + o( √ 1 -z) ∼ √ 1 -z . By Lemma 6, we obtain [z n ]K Lap (z) ∼ c 2 √ π n -3/2 . ( ) Note that [z n ]N k (z) = O(n -3/2 ) from Theorem 8. Therefore, there exists γ > 0 such that γ 2 • [z n ]K Lap (z) -[z n ]N k (z) > 0 for all n ≥ 0. This further implies γ 2 K Lap (x y) -N k (x y) is a positive definite kernel. According to Lemma 4, we have H N k (S d-1 ) ⊆ H Lap (S d-1 ). Note that, due to (Geifman et al., 2020, Theorem 3) , we also have H Lap (S d-1 ) ⊆ H N k (S d-1 ). Therefore, for any k ≥ 1, H Lap (S d-1 ) = H N k (S d-1 ).

4. RESULTS ON EXPONENTIAL POWER KERNEL

This section presents the proof of Theorem 2. We first show part (1) below by singularity analysis. Proof of part (1) of Theorem 2. Recall that the exponential power kernel restricted to the unit sphere with γ > 0 and σ > 0 is given by K γ,σ exp (x, y) = exp -x-y γ σ = exp -(2(1-x y)) γ/2 σ . Let us study the decay rate of the Maclaurin coefficients of K γ,σ exp (z) e -c(1-z) γ/2 , where c = 2 γ/2 /σ. The dominant singularity lies at z = 1. As z → 1, we get K γ,σ exp (z) = 1 -(c + o(1))(1 -z) γ/2 . Applying Lemma 6 gives [z n ]K γ,σ exp (z) ∼ cn -γ/2-1 -Γ(-γ/2) . Therefore, a smaller γ results in a larger RKHS. Part (2) of Theorem 2 requires more technical preparation. Recall that L and L -1 denote the Laplace transform and inverse Laplace transform, respectively. We explicitly calculate the inverse Laplace transform L -1 {exp(-s a )}(t) using Bromwich contour integral and get the following lemma. Lemma 9 (Proof in Appendix B.1). For a ∈ (0, 1), f (t) L -1 {exp(-s a )}(t) exists. Moreover, f (t) is continuous in -∞ < t < ∞ and satisfies f (0) = 0. If t > 0, we have f (t) = 1 π ∞ k=0 (-1) k+1 Γ(ak + 1) sin(πak) k!t ak+1 . Based on the series representation (11), we then analyze the asymptotic rate for f (t) when a is rational. Note that if a ∈ (0, 1), we have -1 Γ(-a) > 0. Lemma 10 (Proof in Appendix B.2). Let f (t) be as defined in Lemma 9. For a = p q ∈ (0, 1) (p and q are co-prime), we have f (t) ∼ -1 t a+1 Γ(-a) as t → +∞. Thus, We have the following corollary for general exponential power kernel. Corollary 11. For a = p q ∈ (0, 1) (p and q are co-prime) and σ > 0, L -1 {exp(-s a /σ)}(t) is continuous in t ∈ R and satisfies L -1 {exp(-s a /σ)}(0) = 0. Moreover, L -1 {exp(-s a /σ)}(t) ∼ Ct -a-1 as t → +∞, for some constant C > 0. Proof. Use the property L -1 {F (cs)}(t) = 1 c f t c , where c > 0 and F (s) = L {f (t)}(s). Before completing the proof for part (2), we need two additional lemmas from the classical approximation theory. Recall that a function f (t) is completely monotone if it is continuous on [0, ∞), infinitely differentiable on (0, ∞) and satisfies (-1) n d n f (t) dt ≥ 0 for every n = 0, 1, 2, . . . and t > 0 (Cheney & Light, 2009, Chapter 14) . Lemma 12 (Schoenberg interpolation theorem (Cheney & Light, 2009 , Theorem 1 of Chapter 15)). If f is completely monotone but not constant on [0, ∞), then for any n distinct points x 1 , x 2 , . . . , x n in any inner-product space, the matrix A ij = f ( x i -x j 2 ) is positive definite. Lemma 13 (Bernstein-Widder (Cheney & Light, 2009 , Theorem 1 of Chapter 14)). A function f : [0, ∞) → [0, ∞) is completely monotone if and only if there is a nondecreasing bounded function g such that f (t) = ∞ 0 e -st dg(s).

Now we are ready to prove part (2).

Proof of part (2) of Theorem 2. By Lemma 12 and Lemma 4, we need to show that c 2 exp(-x γ1/2 /σ 1 ) -exp(-x γ2/2 /σ 2 ) ( ) is completely monotone but not constant on [0, ∞) for some c > 0. By Lemma 13, it suffices to check that ( 12) is the Laplace transform of a non-negative function on [0, ∞). By Corollary 11, for rational γ 1 , γ 2 ∈ (0, 1], there exists c > 0 such that c 2 L -1 {exp(-x γ1/2 /σ 1 )} -L -1 {exp(-x γ2/2 /σ 2 )} is continuous and positive on [0, ∞), which completes the proof. -u) and NTKs N 1 , . . . , N 4 with β = 0, 1. 

5. NUMERICAL RESULTS

Kernel [z 100 ]K(z) 100 -3/2 Theory [z 100 ]K(z) 100 -3/2 Theory K (β = 1) (β = 1) (β = 0) (β = 0) K Lap 0.28244 1 2 √ π ≈ 0.282095 N 1 0.261069 √ 2 π 3/2 ≈ (u) = e - √ 2(1-u) and NTKs N 1 , . . . , N 4 with β = 0, 1. These numerical values are the final values of the curves in Fig. 1 . We present the theoretical prediction by the asymptotic of [z n ]K(z)/n -3/2 alongside each numerical value. The choice of β does not apply to the Laplace kernel. Therefore, we only show the results of the Laplace kernel in the columns for β = 1 and leave blank the columns for β = 0. We verify the asymptotics of the Maclaurin coefficients of the Laplace kernel and NTKs through numerical results. Fig. 1 plots [z n ]K(z) n -3/2 versus n for different kernels, including the Laplace kernel K Lap (u) = e - √ 2(1-u) and NTKs N 1 , . . . , N 4 with β = 0, 1. All curves converge to a constant as n → ∞, which indicates that for every kernel K(z) considered here, we have [z n ]K(z) = Θ(n -3/2 ). The numerical results agree with our theory in the proofs of Theorem 8 and Theorem 1. Now we investigate the value of [z n ]K(z)/n -3/2 . Table 1 reports [z 100 ]K(z)/100 -3/2 for the Laplace kernel and NTKs with β = 0, 1. These numerical values are the final values of the curves in Fig. 1 . The theoretical predictions are obtained through the asymptotic of [z n ]K(z)/n -3/2 , which we shall explain below. The theoretical prediction of [z 100 ]N 4 (z)/100 -3/2 with β = 0 is presented below due to the space limit in the table 20 + π -2 π -arccos π -1 π -arccos √ π 2 -1+π-arccos(π -1 ) π 2 2 √ 2π 3/2 ≈ 1.29531 . ( ) We observe that the theoretical prediction by the asymptotic is close to the corresponding numerical value. There are two possible reasons that account for the minor discrepancy between them. First, the theoretical prediction reflects the situation for an infinitely large n (so that the lower order terms become negligible), while n = 100 is clearly finite. Second, the numerical results for the Maclaurin series are obtained by numerical Taylor expansion and therefore numerical errors could be present. In what follows, we explain how to obtain the theoretical predictions. First, (10) gives [z n ]K Lap (z)/n -3/2 ∼ 1 2 √ π . As a result, the theoretical prediction for [z 100 ]K Lap (z)/100 -3/2 is 1 2 √ π . Now we explain the thereotical predictions for NTKs. When β = 1, the theoretical prediction is given by ( 8). We present it in the third column of Table 1 for N 1 , . . . , N 4 . When β = 0, we plug β = 0 into (9) and obtain [z  n ]N k (z) n -3/2 ∼ k(k+1) 2 √ 2π 3/2 + (-1) n √ 2π 3/2 k-1 j=1 κ 0 (κ j 1 (-1)). The above expression (when n = 100 on the right-hand side) is the theoretical value presented in the fifth column of Table 1 for NTKs.

6. DISCUSSION

Our result provides further evidence that the NTK is similar to the existing Laplace kernel. However, the following mysteries remain open. First, if we still restrict them to the unit sphere, do they have a similar learning dynamic when we perform kernelized gradient descent? Second, what is the behavior of the NTK and the Laplace kernel outside of S d-1 and in the entire space R d ? Do they still share similarities in terms of the associated RKHS? If not, how far do they deviate from each other and is the difference significant? Third, this work along with (Bietti & Mairal, 2019; Geifman et al., 2020) focuses on the NTK with ReLU activation. It would be interesting to explore the influence of different activations upon the RKHS and other kernel-related quantities. We would like to remark that the ReLU NTK has a clean expression partly because the expectation over the Gaussian process in the general NTK can be computed exactly if the activation function is ReLU (which may not be true for other non-linearities, for example, it may require more work for sigmoid). Fourth, we showed that highly non-smooth exponential power kernels have an even larger RKHS than the NTK. It would be worthwhile comparing the performance of these non-smooth kernels and deep neural networks through more extensive experiments in a variety of machine learning tasks. Moreover, we show that a less smooth exponential power kernel leads to a larger RKHS and therefore greater expressive power. Its generalization capability is a related but different topic. Analyzing the generalization error requires more efforts in general. Researchers often use the RKHS norm to provide an upper bound for it. We will study its generalization in future work. Proof. We prove by induction on k. We first prove the statement for k = 1. Let z = 1 -re iθ . Taylor's theorem around 1 with integral form of remainder gives κ 1 (z) = z + γ z -w π √ 1 -w 2 dw , where γ : [0, 1] → C is the simple straight line connecting 1 and z taking the form γ(t) = 1 -tre iθ . It follows κ 1 (z) =z + γ z -w π √ 1 -w • 1 √ 1 + w dw =z + γ z -w π √ 2 √ 1 -w dw + γ z -w π √ 2 √ 1 -w • ( √ 2 √ 1 + w -1)dw . Since γ z -w √ 1 -w dw = 2 3 √ 1 -w(w -3z + 2) w=z w=1 = 4 3 (1 -z) 3/2 , we have κ 1 (z) = z + 2 √ 2 3π (1 -z) 3/2 + γ z -w π √ 2 √ 1 -w • ( √ 2 √ 1 + w -1)dw . We then turn to show lim z→1 (1 -z) -3/2 • γ z -w π √ 2 √ 1 -w • ( √ 2 √ 1 + w -1)dw = 0 . Direct calculation gives lim z→1 (1 -z) -3/2 • γ z -w √ 1 -w • ( √ 2 √ 1 + w -1)dw = lim r→0 (re iθ ) -3/2 • 1 0 (1 -t)r 2 e 2iθ √ tre iθ ( √ 2 √ 2 -tre iθ -1)dt = lim r→0 1 0 1 -t √ t ( 1 1 -tre iθ /2 -1)dt = 0 . Therefore, there exists c 1 (z) such that lim z→1 c 1 (z) = 2 √ 2 3π = 0 and κ 1 (z) = z + c 1 (z)(1 -z) 3/2 . Next, assume that the desired equation holds for some k ≥ 1. We then have κ (k+1) 1 (z) = κ 1 (κ (k) 1 (z)) = κ 1 (z + c k (z)(1 -z) 3/2 ) = z + c k (z)(1 -z) 3/2 + c 1 κ (k) 1 (z) • 1 -z -c k (z)(1 -z) 3/2 3/2 = z + c k+1 (z)(1 -z) 3/2 , where c k+1 (z) ∼ c k (z) + c 1 (k (k) 1 (z)). Recall that when z → 1, we have κ (k) 1 (z) → 1 as well. Therefore we deduce lim z→1 c k+1 (z) = lim z→1 c k (z) + lim z→1 c 1 (k k 1 (z)) = 2 √ 2k 3π = 0 . Lemma 15. For every k ≥ 1, there exist a k ∈ R and a complex function b k (z) such that κ (k) 1 (z) = a k + b k (z)(z + 1) 3/2 , where a k = κ (k) 1 (-1) and lim z→-1 b k (z) = 2 √ 2 3π k-1 j=1 κ 1 (κ (j) 1 (-1)) > 0 . Proof. We prove by induction on k. We first prove the statement for k = 1. Let z = -1 + re iθ . Taylor's theorem around -1 with integral form of remainder gives κ 1 (z) = γ z -w π √ 1 -w 2 dw . where γ : [0, 1] → C is the simple straight line connecting -1 and z taking the form γ(t) = -1 + tre iθ . Similar arguments as in the proof of Lemma 14 give κ 1 (z) = b 1 (z)(z + 1) 3/2 , where lim z→-1 b 1 (z) = 2 √ 2 3π . Next, assume that the desired equation holds for some k ≥ 1. Define h k κ (k) 1 (-1). Since κ 1 is strictly increasing on [-1, 1], κ 1 (-1) = 0 and κ 1 (1) = 1, we have h 1 = 0 and h k ∈ (0, 1) for all k > 1. Expanding κ 1 around h k yields κ 1 (z) = κ 1 (h k ) + p(z)(z -h k ) = h k+1 + p(z)(z -h k ) , where lim z→h k p(z) = κ 1 (h k ). It follows that κ k+1 1 (z) = κ 1 (a k + b k (z)(z + 1) 3/2 ) = h k+1 + p(κ (k) 1 (z))(a k + b k (z)(z + 1) 3/2 -h k ) = a k+1 + b k+1 (z)(z + 1) 3/2 , where a k+1 = h k+1 + κ 1 (h k )(a k -h k ) and lim z→-1 b k+1 (z) = κ 1 (h k ) lim z→-1 b k (z) . By induction, we can show that a k = h k for all k ≥ 1. Since κ 1 is strictly increasing on [-1, 1], κ 1 (-1) = 0, and κ 1 (1) = 1, we have κ 1 (h k ) ≥ κ 1 (0) > 0. As a result, lim z→-1 b k+1 (z) = 2 √ 2 3π k j=1 κ 1 (κ (j) 1 (-1)) > 0 . In the sequel, we show that ±1 are the only dominant singularities of κ (k) 1 and κ (k) 1 is ∆-analytic at ±1 (Lemma 19). Lemma 16. For any z ∈ C with arg z ∈ (0, π/4), κ 1 (z) ∈ H + . For any z ∈ C with arg z ∈ (-π/4, 0), κ 1 (z) ∈ H -. Proof. The second part of the statement follows from the first according to the reflection principle. We only prove the first part here. Let z = re iθ with θ ∈ (0, π/4). Taylor's theorem with integral form of the remainder and direct calculation give κ 1 (z) = κ 1 (0) + κ 1 (0)z + γ (z -w)κ 1 (w)dw = 1 π + 1 2 z + γ z -w π √ 1 -w 2 dw , where γ : [0, 1] → C is the simple straight line connecting 0 and z taking the form γ(t) = tre iθ . Then we have γ z -w π √ 1 -w 2 dw = r 2 e 2iθ 1 0 1 -t π √ 1 -r 2 t 2 e 2iθ dt = e 2iθ r 0 r -t π √ 1 -t 2 e 2iθ dt . Since θ ∈ (0, π/4), we have arg(1 -t 2 e 2iθ ) ∈ (-π, 0). Further arg 1 √ 1 -t 2 e 2iθ ∈ (0, π/2) and arg r 0 r -t π √ 1 -t 2 e 2iθ dt ∈ (0, π/2) . Noting arg(e 2iθ ) ∈ (0, π/2), we get arg γ z -w π √ 1 -w 2 dw ∈ (0, π) , which gives a positive imaginary part. Combining with (1/π + z/2) > 0 yields the desired statement. Lemma 17. For every k ≥ 1 and ε > 0, there exists δ > 0 such that κ (k) 1 is analytic on B 1 (δ) ∩ H + and B 1 (δ) ∩ H -with κ (k) 1 (B 1 (δ) ∩ H + ) ⊆ B 1 (ε) ∩ H + , κ (k) 1 (B 1 (δ) ∩ H -) ⊆ B 1 (ε) ∩ H -. Proof. We present the proof for H + here and that for H -can be shown similarly. We adopt an induction argument on k. For k = 1, κ 1 is analytic on H + . Since κ 1 is continuous at z = 1, for any ε > 0, there exists 0 < δ < 1/2 such that κ 1 (B 1 (δ) ∩ H + ) ⊆ B 1 (ε) . Lemma 16 implies κ 1 (B 1 (δ) ∩ H + ) ⊆ H + . Combining them yields κ 1 (B 1 (δ) ∩ H + ) ⊆ B 1 (ε) ∩ H + . ( ) Now assume that the statement holds true for some k ≥ 1. Note that for any ε > 0, there exists 0 < δ < 1/2 such that (14) holds. Then by induction hypothesis, for this chosen δ, there exists δ 1 > 0 such that κ (k) 1 is analytic on B 1 (δ 1 ) ∩ H + and κ (k) 1 (B 1 (δ 1 ) ∩ H + ) ⊆ B 1 (δ) ∩ H + . It follows κ (k+1) 1 (B 1 (δ 1 ) ∩ H + ) ⊆ κ 1 (B 1 (δ) ∩ H + ) ⊆ B 1 (ε) ∩ H + . This completes the proof. Lemma 18. |κ 1 (z)| ≤ 1 for any |z| ≤ 1, where the equality holds if and only if z = 1. Proof. The Taylor series of κ 1 around z = 0 is κ 1 (z) = 1 π + z 2 + ∞ n=1 (2n -3)!! (2n -1)n!2 n π z 2n . Therefore, for |z| ≤ 1, we have |κ 1 (z)| ≤ 1 π + |z| 2 + ∞ n=1 (2n -3)!! (2n -1)n!2 n π |z| 2n ≤ κ 1 (1) = 1 . The equality holds if and only if z = 1. Lemma 19. For each k ≥ 1, there exists R > 1 such that κ (k) 1 is analytic on {z ∈ C | |z| ≤ R}∩D, where D = C \ [1, ∞) \ (-∞, -1]. Proof. For any 0 < θ < π/2, there exists δ θ > 0 such that for all |z| ≤ 1 with | arg z| ≥ θ, we have |κ 1 (z)| ≤ 1 -δ θ . To see this, we use an argument similar to (Pinelis, 2020) . If we define φ arg z, we have 1 π + z 2 = |z| 2 4 + |z| cos φ π + 1 π 2 ≤ 1 4 + cos θ π + 1 π 2 = 1 2 + 1 π 2 - 1 -cos θ π = 1 2 + 1 π -δ θ , for some δ θ > 0. Consider the Taylor series of κ 1 around z = 0 κ 1 (z) = 1 π + z 2 + ∞ n=1 (2n -3)!! (2n -1)n!2 n π z 2n . We obtain |κ 1 (z)| ≤ 1 π + z 2 + ∞ n=1 (2n -3)!! (2n -1)n!2 n π |z| 2n ≤ 1 2 + 1 π -δ θ + ∞ n=1 (2n -3)!! (2n -1)n!2 n π = 1 -δ θ . Lemma 17 shows that there exists 0 < δ < 1 such that κ (k) 1 is analytic on B 1 (δ ) ∩ D. From the argument above, we know that κ 1 maps A {z ∈ C | |z| = 1, | arg z| ≥ θ} to inside of the open unit ball B 0 (1). Since A is compact and Lemma 18 implies that g maps B 0 (1) to B 0 (1), there exists 1 < R θ < 1 + δ such that κ 1 maps A θ ({z ∈ C | |z| ≤ R θ , | arg z| ≥ θ} ∩ D) ∪ B 0 (1) to B 0 (1). It follows that κ (k) 1 is analytic on A θ . Let us pick θ ∈ (0, π/2) such that e iθ ∈ B 1 (δ ). Then we conclude that κ (k) 1 is analytic on {z ∈ C | |z| ≤ R θ } ∩ D. Now we are ready to prove Theorem 7. Proof. Since κ 0 and κ 1 are both analytic on D = C\[1, ∞)\(-∞, -1], similar arguments as in the proof of Lemma 19 shows that κ 0 (κ (k) 1 (z)) is analytic on {z ∈ C | |z| ≤ R} ∩ D for all k ≥ 1 and some R > 1. We then show, for any k ≥ 1, there exists some R k > 1 such that N k (z) is analytic on {z ∈ C | |z| ≤ R k } ∩ D by induction. The function N 0 (z) = z + β 2 is analytic on D. Assume N k-1 (z) is analytic on {z ∈ C | |z| ≤ R k-1 } ∩ D for some R k-1 > 1. Recall that N k (z) = κ (k) 1 (z) + N k-1 (z)κ 0 (κ (k-1) 1 (z)) + β 2 . Then we can find some R k > 1 such that N k (z) is analytic on {z ∈ C | |z| ≤ R k } ∩ D.

A.4 PROOF OF THEOREM 8

Proof. We first analyze the behavior of N k (z) as z → 1 for any k ≥ 1. We aim to show, for any k ≥ 1, there exists a sequence of complex functions p k (z) with lim z→1 p k (z) = - √ 2(1+β 2 )k(k + 1)/2π such that N k (z) = (k + 1)(z + β 2 ) + p k (z) √ 1 -z . ( ) We prove by induction on k. Recall κ 0 (z) = π + i log(z + i √ 1 -z 2 ) π . The fundamental theorem of calculus then gives for any z ∈ D κ 0 (z) = 1 + γ 1 π √ 1 -w 2 dw , where γ : [0, 1] → C is the simple straight line connecting 1 and z. As z → 1, we have 1 √ 1-z 2 ∼ 1 √ 2 √ 1-z . Therefore, similar arguments as in the proof of Lemma 14 give κ 0 (z) = 1 + h(z) √ 1 -z , where lim z→1 h(z) = -√ 2 π . Combining with Lemma 14 further gives, for any k ≥ 1 κ 0 (κ (k) 1 (z)) = 1 + h(κ (k) 1 (z)) 1 -z -c k (z)(1 -z) 3/2 = 1 + h k (z) √ 1 -z , where lim z→1 h k (z) = -√ 2 π . For k = 1, we then have N 1 (z) = κ 1 (z) + (z + β 2 )κ 0 (z) + β 2 = z + d 1 (z)(1 -z) 3/2 + (z + β 2 )(1 + h(z) √ 1 -z) + β 2 = 2(z + β 2 ) + p 1 (z) √ 1 -z , where lim z→1 d 1 (z) = 2 √ 2 3π and lim z→1 p 1 (z) = - √ 2(1 + β 2 )/π. Assume N k-1 (z) = k(z + β 2 ) + p k-1 (z) √ 1 -z with lim z→1 p k-1 (z) = - √ 2(1 + β 2 )k(k -1)/(2π). We further have N k (z) = κ (k) 1 (z) + N k-1 (z)κ 0 (κ (k-1) 1 (z)) + β 2 = z + d k (z)(1 -z) 3/2 + k(z + β 2 ) + p k-1 (z) √ 1 -z (1 + h k-1 (z) √ 1 -z) + β 2 = (k + 1)(z + β 2 ) + (p k-1 (z) + k • h k-1 (z)(z + β 2 )) √ 1 -z = (k + 1)(z + β 2 ) + p k (z) √ 1 -z . where we set p k (z) = p k-1 (z) + k • h k-1 (z)(z + β 2 ) and d k (z) → 2 √ 2k 3π , h k-1 (z) → - √ 2 π as z → 1. Moreover, we have lim z→1 p k (z) = lim z→1 p k-1 (z) + k • h k-1 (z)(z + β 2 ) = - √ 2(1 + β 2 )k(k -1) 2π -k • √ 2 π (1 + β 2 ) = - √ 2(1 + β 2 )k(k + 1) 2π , which is desired. This proves (15). Next we study the behavior of N k (z) as z → -1 for any k ≥ 1. We aim to show, for any k ≥ 1, there exists a sequence of complex functions q k (z) with lim z→-1 q k (z) = √ 2(β 2 -1) k-1 j=1 κ 0 (a j )/π and a k κ (k) 1 (-1) as defined in Lemma 15 such that N k (z) = N k (-1) + q k (z) √ 1 + z . ( ) We again adopt induction on k. Taylor's theorem gives κ 0 (z) = κ 0 (a k ) + r k (z)(z -a k ) , where lim z→a k r k (z) = κ 0 (a k ) > 0. Combining with Lemma 15 further gives, for any k ≥ 1 κ 0 (κ (k) 1 (z)) = κ 0 (a k ) + r k (κ (k) 1 (z))b k (z)(z + 1) 3/2 = κ 0 (a k ) + rk (z)(z + 1) 3/2 , where b k (z) → 2 √ 2 3π k-1 j=1 κ 1 (a k ) and rk (z) → 2 √ 2 3π κ 0 (a k ) k-1 j=1 κ 1 (a k ) > 0 as z → -1 by Lemma 15. For k = 1, the fundamental theorem of calculus gives for any z ∈ D κ 0 (z) = γ 1 π √ 1 -w 2 dw , where γ : [0, 1] → C is the simple straight line connecting -1 and z. As z → -1, we have 1 √ 1-z 2 ∼ 1 √ 2 √ 1+z . Therefore, similar arguments as in the proof of Lemma 14 give κ 0 (z) = g(z) √ 1 + z , where g(z) → √ 2 π as z → -1. We then have N 1 (z) = κ 1 (z) + (z + β 2 )κ 0 (z) + β 2 = a 1 + b 1 (z)(z + 1) 3/2 + (z + β 2 )g(z) √ 1 + z + β 2 = (a 1 + β 2 ) + q 1 (z) √ 1 + z = N 1 (-1) + q 1 (z) √ 1 + z , where N 1 (-1) = a 1 + β 2 lim z→-1 q 1 (z) = √ 2 π (β 2 -1). Assume N k-1 (z) = N k-1 (-1) + q k-1 (z) √ 1 + z with lim z→-1 q k-1 (z) = √ 2(β 2 -1) k-2 j=1 κ 0 (a j )/π. We further have N k (z) = κ (k) 1 (z) + N k-1 (z)κ 0 (κ (k-1) 1 (z)) + β 2 = a k + b k (z)(z + 1) 3/2 + N k-1 (z) κ 0 (a k-1 ) + rk-1 (z)(z + 1) 3/2 + β 2 = a k + β 2 + N k-1 (z)κ 0 (a k-1 ) + (b k (z) + N k-1 (z)r k-1 (z)) (z + 1) 3/2 = a k + β 2 + N k-1 (-1)κ 0 (a k-1 ) + q k-1 (z)κ 0 (a k-1 ) √ z + 1 + (b k (z) + N k-1 (z)r k-1 (z)) (z + 1) 3/2 = N k (-1) + q k-1 (z)κ 0 (a k-1 ) √ z + 1 + (b k (z) + N k-1 (z)r k-1 (z)) (z + 1) 3/2 = N k (-1) + q k (z) √ 1 + z , where we use the induction assumption in the fourth equation, use the fact N k (-1) = a k + β 2 + N k-1 (-1)κ 0 (a k-1 ) in the fifth equation and define q k (z) = q k-1 (z)κ 0 (a k-1 ) + (b k (z) + N k-1 (z)r k-1 (z)) (z + 1) in the last equation. We also have lim z→-1 q k (z) = lim z→-1 q k-1 (z)κ 0 (a k-1 ) + (b k (z) + N k-1 (z)r k-1 (z)) (z + 1) = lim z→-1 q k-1 (z)κ 0 (a k-1 ) = √ 2(β 2 -1) π k-1 j=1 κ 0 (a j ) , which is desired. This proves (16). Finally, according to Theorem 7, combining (15) and ( 16), applying (Flajolet & Sedgewick, 2009, Theorem VI.5)  with ρ = 1, r = 2, τ (z) = (1 -z) 1/2 , ζ 1 = 1, ζ 2 = -1, σ 1 (z) = (k + 1)(z + β 2 ), σ 2 (z) = N k (-1), D = {z ∈ C | |z| ≤ R k } ∩ D, we conclude [z n ]N k (z) = O(n -3/2 ).

B PROOFS FOR EXPONENTIAL POWER KERNEL

B.1 PROOF OF LEMMA 9 Proof. According to (Doetsch, 1974, Theorem 28. 2), we have, for 0 < a < 1, f (t) = 1 2πi lim T →+∞ x0+iT x0-iT exp(ts -s a )ds (x 0 ≥ 0) . Also (Doetsch, 1974, Theorem 28. 2) implies that f (t) is continuous in -∞ < t < +∞ and f (0) = 0. Next we explicitly calculate f (t) using Bromwich contour integral. We denote each part of the Bromwich contour by Γ 0 , . . . , Γ 5 as depicted in Fig. 2 . Denote the radius of the outer and inner arc by R and r. When T → ∞, we have R = T 2 + x 2 0 → ∞. Also we let r → 0 and Γ 2 , Γ 4 tend to (-∞, 0] from above and below respectively in the limit. By the residue theorem, we have where the last two limits are taken as R → ∞, r → 0, and Γ 2 , Γ 4 tend to (-∞, 0]. We then calculate each part separately. Part I: We calculate the parts for Γ 1 and Γ 5 . We follow the similar idea as in the proof of (Spiegel, 1965, Theorem 7.1) . Along Γ 1 , since s = Re iθ with θ 0 ≤ θ ≤ π, θ 0 = arccos(x 0 /R), I 1 = π/2 θ0 e Re iθ t e -R a e iaθ iRe iθ dθ + π π/2 e Re iθ t e -R a e iaθ iRe iθ dθ I 11 + I 12 . For I 11 , |I 11 | ≤ π/2 θ0 |e Rt cos θ | • |e -R a cos(aθ) |Rdθ ≤ π/2 θ0 e Rt cos θ • e -R a cos(aπ/2) Rdθ ≤ R R a cos(aπ/2) π/2 θ0 e Rt cos θ dθ = R R a cos(aπ/2) φ0 0 e Rt sin φ dφ , where φ 0 = π/2 -θ 0 = arcsin(x 0 /R). Since sin φ ≤ sin φ 0 ≤ x 0 /R, we have |I 11 | ≤ R R a cos(aπ/2) φ 0 e x0t = R R a cos(aπ/2) e x0t arcsin(x 0 /R) . As R → ∞, we have lim R→∞ I 11 = 0. For I 12 , |I 12 | ≤ π π/2 e Rt cos θ • e -R a cos(aθ) Rdθ . First, we consider the case 0 < a < 1/2. We have aθ ≤ aπ < π/2 and cos(aθ) ≥ cos(aπ) > 0. It follows π π/2 e Rt cos θ • e -R a cos(aθ) Rdθ ≤ Re -R a cos(aπ) π π/2 e Rt cos θ dθ = Re -R a cos(aπ) π/2 0 e -Rt sin φ dφ ≤ Re -R a cos(aπ) π/2 0 e -2Rtφ/π dφ = e -R a cos(aπ) π(1 -e -Rt ) 2t , where in the last inequality we use the fact sin φ ≥ 2φ/π for φ ∈ [0, π/2]. Thus, lim R→∞ I 12 = 0. Next, we consider 1/2 ≤ a < 1. Define p(θ) Rt cos θ -R a cos(aθ) . We then have its second derivative as follows p (θ) = a 2 R a cos(aθ) -Rt cos(θ) . Choose δ to be a fixed constant in (0, π 2 ( 1 a -1)). Since a ≥ 1/2, then δ < π/2. If π/2 + δ ≤ θ ≤ π, p (θ) ≥ -a 2 R a -Rt cos(π/2 + δ) = -a 2 R a + Rt sin(δ) . Since a < 1, there exists some large R 1 > 0 such that p (θ) ≥ -a 2 R a + Rt sin(δ) > 0 holds for all R > R 1 . If π/2 ≤ θ < π/2 + δ, p (θ) ≥ a 2 R a cos(a(π/2 + δ)) . Since a(π/2 + δ) < π/2 by the choice of δ, we get cos(a(π/2 + δ)) > 0. Then we also have p Hence, we establish lim R→∞ I 12 = 0 for all a ∈ (0, 1). Combining these above, we conclude lim R→∞ I 1 = 0. Similarly, lim R→∞ I 5 = 0. exp(ts)s ak ds . We then calculate the limit of the summand. Part III: We get the limit for Γ 3 is 0 as r → 0. Combining the three parts above, we conclude f (t) = 1 2πi ∞ k=0 (-1) k+1 k! 2iΓ(ak + 1) sin(πak) t ak+1 = 1 π ∞ k=0 (-1) k+1 Γ(ak + 1) sin(πak) k!t ak+1 .

B.2 PROOF OF LEMMA 10

Proof. Euler's reflection formula gives Γ(1 + ka)Γ(-ka) = -π sin(πka) , ka / ∈ Z . According to Lemma 9, we have f (t) = 1 π ∞ k=0 (-1) k+1 Γ(ak + 1) sin(πak) k!t ak+1 = ∞ k=0 (-1) k k!t ak+1 Γ(-ak) = q-1 j=1 ∞ n=0 (-1) nq+j (nq + j)!t a(nq+j)+1 Γ(-a(nq + j)) . First, we show that the series in (17) converges absolutely:  The inner summation in ( 18) is a power series in |t| -p . We would like to show that its radius of convergence is ∞. Define (n+1)q i=nq+p+1 (j + i) ≤ 1 (j + nq + p + 1) q-p → 0 . As a result, the radius of convergence is ∞. Then we have f (t) = q-1 j=1 1 t aj+1 Γ(-aj) ∞ n=0 (-1) n(p+q)+j t -pn np i=1 (aj + i) (nq + j)! = q-1 j=1 1 t aj+1 Γ(-aj)      (-1) j j! + ∞ n=1 (-1) n(p+q)+j t -pn np i=1 (aj + i) (nq + j)! A      Notice that the quantity A goes to 0 as t → +∞. Therefore we deduce f (t) ∼ q-1 j=1 (-1) j t aj+1 j!Γ(-aj) ∼ -1 t a+1 Γ(-a) , as t → +∞.



When x and y live on the unit sphere (i.e., x x = y y = 1), their inner product x y can be any real number in [-1, 1].



denote the upper half-plane and H -{z ∈ C | z < 0} denote the lower half-plane. Write B z (r) for the open ball {w ∈ C | |z -w| < r} and Bz (r) for the closed ball {w ∈ C | |z -w| ≤ r}.

and b k+1 ∈ R. The weight matrices W 1 , . . . , W k , w are initialized with N (0, I) and the biases b 1 , . . . , b k+1 are initialized with zero, where N (0, I) is the multivariate standard normal distribution. The activation function is chosen to be the ReLU function σ(x) max{x, 0}. Geifman et al. (2020) and Bietti & Mairal (2019) presented the following recursive relations of the NTK N k (x, y) of the above ReLU network (1):

Figure 1: We plot [z n ]K(z)/n -3/2 versus n for the Laplace kernel K Lap (u) = e -√2(1-u) and NTKs N 1 , . . . , N 4 with β = 0, 1.

Figure 2: Bromwich contour that circumvents the branch cut (-∞, 0]

θ) > 0. Therefore, if R > R 1 , p(θ) is convex in θ ∈ [π/2, π]. As a result, we get max θ∈[π/2,π] p(θ) ≤ max{p(π/2), p(π)} . Write h(R, θ) Re Rt cos θ • e -R a cos(aθ) = Re p(θ) . θ) ≤ max{h(R, π/2), h(R, π)} = R max{e -R a cos( πa 2 ) , e -R a cos(πa)-Rt } ≤ R max{e -R a cos( πa 2 ) , e R a -Rt } ,which goes to 0 as R → ∞. Therefore, h(R, θ) converges to 0 uniformly (as a function of θ ∈ [π/2, π] with index R), which implies lim R→∞ π π/2 h(R, θ)dθ = 0 .

We calculate the parts for Γ 2 and Γ 4 . By the dominated convergence theorem, we have, for y

tx • [(-x)e iπ ] ak dx = ∞ 0 e -tx x ak e iπak dx = 1 t ak+1 Γ(ak + 1)e iπak .Similarly, we obtain the corresponding part inΓ 4 : tx • [(-x)e -iπ ] ak dx = -1 t ak+1 Γ(ak + 1)e -iπak .Combining the parts of Γ 2 and Γ 4 together, we get lim(I 2 + I 4 ) =

(nq + j)!|Γ(-a(nq + j))| = q-1 j=1 1 |t| aj+1 |Γ(-aj)| ∞ n=0 |t| -np np i=1 (aj + i) (nq + j)! .

We report the numerical values of [z 100 ]K(z) 100 -3/2 for the Laplace kernel K Lap

ACKNOWLEDGEMENTS

We gratefully acknowledge the support of the Simons Institute for the Theory of Computing. We thank Peter Bartlett, Mikhail Belkin, Jason D. Lee, and Iosif Pinelis for helpful discussions and thank Mikhail Belkin and Alexandre Eremenko for introducing to us the works (Hui et al., 2019; Liu et al., 2020) and (Flajolet & Sedgewick, 2009) , respectively.

Appendices Table of Contents

Proof. We show it by induction. It holds when k = 0 by the initial condition (3). Assume that it holds for some k ≥ 0, i.e., Σ k (x, x) = 1. Consider k + 1. We have

A.2 PROOF OF EQUATION (4)

Proof. We plug Σ k (x, x) = 1 into (2) and obtainRecall Σ 0 (x, y) = u. By induction, we getA.3 PROOF OF THEOREM 7 Lemma 14 and Lemma 15 demonstrate that ±1 are indeed singularities and analyze the asymptotics for κ (k) 1 as z tends to ±1, respectively. Our calculation is inspired by Pinelis (2020) , which only considers k = 2. Lemma 14. For every k ≥ 1, there exists c k (z) such that .

