DEEP NEURAL TANGENT KERNEL AND LAPLACE KERNEL HAVE THE SAME RKHS

Abstract

We prove that the reproducing kernel Hilbert spaces (RKHS) of a deep neural tangent kernel and the Laplace kernel include the same set of functions, when both kernels are restricted to the sphere S d-1 . Additionally, we prove that the exponential power kernel with a smaller power (making the kernel less smooth) leads to a larger RKHS, when it is restricted to the sphere S d-1 and when it is defined on the entire R d .

1. INTRODUCTION

In the past few years, one of the most seminal discoveries in the theory of neural networks is the neural tangent kernel (NTK) (Jacot et al., 2018) . The gradient flow on a normally initialized, fully connected neural network with a linear output layer in the infinite-width limit turns out to be equivalent to kernel regression with respect to the NTK (This statement does not necessarily hold for a non-linear output layer, because the NTK is non-constant (Liu et al., 2020) ). Through the NTK, theoretical tools from kernel methods were introduced to the study of deep overparametrized neural networks. Theoretical results were thereby established regarding the convergence (Allen-Zhu et al., 2019; Du et al., 2019b; a; Zou et al., 2020 ), generalization (Cao & Gu, 2019; Arora et al., 2019b) , and loss landscape (Kuditipudi et al., 2019) of overparametrized neural networks in the NTK regime. While NTK has proved to be a powerful theoretical tool, a recent work (Geifman et al., 2020) posed an important question whether the NTK is significantly different from our repertoire of standard kernels. Prior work provided empirical evidence that supports a negative answer. For example, Belkin et al. (2018) showed experimentally that the Laplace kernel and neural networks had similar performance in fitting random labels. In the task of speech enhancement, exponential power kernels K γ,σ exp (x, y) = e -x-y γ /σ , which include the Laplace kernel as a special case, outperform deep neural networks with even shorter training time (Hui et al., 2019) . The experiments in (Geifman et al., 2020) also exhibited similar performance of the Laplace kernel and the NTK. The expressive power of a positive definite kernel can be characterized by its associated reproducing kernel Hilbert space (RKHS) (Saitoh & Sawano, 2016) . The work (Geifman et al., 2020) considered the RKHS of the kernels restricted to the sphere S d-1 {x ∈ R d | x 2 = 1} and presented a partial answer to the question by showing the following subset inclusion relation H Gauss (S d-1 ) H Lap (S d-1 ) = H N1 (S d-1 ) ⊆ H N k (S d-1 ) , where the four spaces denote the RKHS associated with the Gaussian kernel, Laplace kernel, the NTK of two-layer and (k + 1)-layer (k ≥ 1) fully connected neural networks, respectively. All four kernels are restricted to S d-1 . However, the relation between H Lap (S d-1 ) and H N k (S d-1 ) remains open in (Geifman et al., 2020) . We make a final conclusion on this problem and show that the RKHS of the Laplace kernel and the NTK with any number of layers have the same set of functions, when they are both restricted to S d-1 . In other words, we prove the following theorem. Theorem 1. Let H Lap (S d-1 ) and H N k (S d-1 ) be the RKHS associated with the Laplace kernel K Lap (x, y) = e -c x-y (c > 0) and the neural tangent kernel of a (k + 1)-layer fully connected ReLU network. Both kernels are restricted to the sphere S d-1 . Then the two spaces include the same set of functions: H Lap (S d-1 ) = H N k (S d-1 ), ∀k ≥ 1 . Our second result is that the exponential power kernel with a smaller power (making the kernel less smooth) leads to a larger RKHS, both when it is restricted to the sphere S d-1 and when it is defined on the entire R d . Theorem 2. Let H K γ,σ exp (S d-1 ) and H K γ,σ exp (R d ) be the RKHS associated with the exponential power kernel K γ,σ exp (x, y) = exp -x-y γ σ (γ, σ > 0) when it is restricted to the unit sphere S d-1 and defined on the entire R d , respectively. Then we have the following RKHS inclusions: (1 ) If 0 < γ 1 < γ 2 < 2, H K γ 2 ,σ 2 exp (S d-1 ) ⊆ H K γ 1 ,σ 1 exp (S d-1 ) . ( ) If 0 < γ 1 < γ 2 < 2 are rational, H K γ 2 ,σ 2 exp (R d ) ⊆ H K γ 1 ,σ 1 exp (R d ) . If it is restricted to the unit sphere, the RKHS of the exponential power kernel with γ < 1 is even larger than that of NTK. This result partially explains the observation in (Hui et al., 2019) that the best performance is attained by a highly non-smooth exponential power kernel with γ < 1. Geifman et al. ( 2020) applied the exponential power kernel and the NTK to classification and regression tasks on the UCI dataset and other large scale datasets. Their experiment results also showed that the exponential power kernel slightly outperforms the NTK. 

2. PRELIMINARIES

Let C denote the set of all complex numbers and write i √ -1. For z ∈ C, write z, z, arg z ∈ (-π, π] for its real part, imaginary part, and argument, respectively. Let H + {z ∈ C | z > 0}



FURTHER RELATED WORK Minh et al. (2006) showed the complete spectrum of the polynomial and Gaussian kernels on S d-1 . They also gave a recursive relation for the eigenvalues of the polynomial kernel on the hypercube {-1, 1} d . Prior to the NTK (Jacot et al., 2018), Cho & Saul (2009) presented a pioneering study on kernel methods for neural networks. Bach (2017) studied the eigenvalues of positively homogeneous activation functions of the form σ α (u) = max{u, 0} α (e.g., the ReLU activation when α = 1) in their Mercer decomposition with Gegenbauer polynomials. Using the results in (Bach, 2017), Bietti & Mairal (2019) analyzed the two-layer NTK and its RKHS in order to investigate the inductive bias in the NTK regime. They studied the Mercer decomposition of two-layer NTK with ReLU activation on S d-1 and characterized the corresponding RKHS by showing the asymptotic decay rate of the eigenvalues in the Mercer decomposition with Gegenbauer polynomials. In their derivation of a more concise expression of the ReLU NTK, they used the calculation of (Cho & Saul, 2009) on arc-cosine kernels of degree 0 and 1. Cao et al. (2019) improved the eigenvalue bound for the k-th eigenvalue derived in (Bietti & Mairal, 2019) when d k. Geifman et al. (2020) used the results in (Bietti & Mairal, 2019) and considered the two-layer ReLU NTK with bias β initialized with zero, rather than initialized with a normal distribution (Jacot et al., 2018). However, neither (Bietti & Mairal, 2019) nor (Geifman et al., 2020) went beyond two layers when they tried to characterize the RKHS of the ReLU NTK. This line of work (Bach, 2017; Bietti & Mairal, 2019; Geifman et al., 2020) is closely related to the Mercer decomposition with spherical harmonics. Interested readers are referred to (Atkinson & Han, 2012) for spherical harmonics on the unit sphere. The concurrent work (Bietti & Bach, 2021) analyzed the eigenvalues of the ReLU NTK. Arora et al. (2019a) presented a dynamic programming algorithm that computes convolutional NTK with ReLU activation. Yang & Salman (2019) analyzed the spectra of the conjugate kernel (CK) and NTK on the boolean cube. Fan & Wang (2020) studied the spectrum of the gram matrix of training samples under the CK and NTK and showed that their eigenvalue distributions converge to a deterministic limit. The limit depends on the eigenvalue distribution of the training samples.

