CHARACTERIZING THE SPECTRUM OF THE NTK VIA A POWER SERIES EXPANSION

Abstract

Under mild conditions on the network initialization we derive a power series expansion for the Neural Tangent Kernel (NTK) of arbitrarily deep feedforward networks in the infinite width limit. We provide expressions for the coefficients of this power series which depend on both the Hermite coefficients of the activation function as well as the depth of the network. We observe faster decay of the Hermite coefficients leads to faster decay in the NTK coefficients and explore the role of depth. Using this series, first we relate the effective rank of the NTK to the effective rank of the input-data Gram. Second, for data drawn uniformly on the sphere we study the eigenvalues of the NTK, analyzing the impact of the choice of activation function. Finally, for generic data and activation functions with sufficiently fast Hermite coefficient decay, we derive an asymptotic upper bound on the spectrum of the NTK.

1. INTRODUCTION

Neural networks currently dominate modern artificial intelligence, however, despite their empirical success establishing a principled theoretical foundation for them remains an active challenge. The key difficulties are that neural networks induce nonconvex optimization objectives (Sontag & Sussmann, 1989) and typically operate in an overparameterized regime which precludes classical statistical learning theory (Anthony & Bartlett, 2002) . The persistent success of overparameterized models tuned via non-convex optimization suggests that the relationship between the parameterization, optimization, and generalization is more sophisticated than that which can be addressed using classical theory. A recent breakthrough on understanding the success of overparameterized networks was established through the Neural Tangent Kernel (NTK) (Jacot et al., 2018 ). In the infinite width limit the optimization dynamics are described entirely by the NTK and the parameterization behaves like a linear model (Lee et al., 2019) . In this regime explicit guarantees for the optimization and generalization can be obtained (Du et al., 2019a; b; Arora et al., 2019a; Allen-Zhu et al., 2019; Zou et al., 2020) . While one must be judicious when extrapolating insights from the NTK to finite width networks (Lee et al., 2020) , the NTK remains one of the most promising avenues for understanding deep learning on a principled basis. The spectrum of the NTK is fundamental to both the optimization and generalization of wide networks. In particular, bounding the smallest eigenvalue of the NTK Gram matrix is a staple technique for establishing convergence guarantees for the optimization (Du et al., 2019a; b; Oymak & Soltanolkotabi, 2020) . Furthermore, the full spectrum of the NTK Gram matrix governs the dynamics of the empirical risk (Arora et al., 2019b) , and the eigenvalues of the associated integral operator characterize the dynamics of the generalization error outside the training set (Bowman & Montufar, 2022; Bowman & Montúfar, 2022) . Moreover, the decay rate of the generalization error for Gaussian process regression using the NTK can be characterized by the decay rate of the spectrum (Caponnetto & De Vito, 2007; Cui et al., 2021; Jin et al., 2022) . The importance of the spectrum of the NTK has led to a variety of efforts to characterize its structure via random matrix theory and other tools (Yang & Salman, 2019; Fan & Wang, 2020) . There is a broader body of work studying the closely related Conjugate Kernel, Fisher Information Matrix, and Hessian (Poole et al., 2016; Pennington & Worah, 2017; 2018; Louart et al., 2018; Karakida et al., 2020) . These results often require complex random matrix theory or operate in a regime where the input dimension is sent to infinity. By contrast, using a just a power series expansion we are able to characterize a variety of attributes of the spectrum for fixed input dimension and recover key results from prior work.

1.1. CONTRIBUTIONS

In Theorem 3.1 we derive coefficients for the power series expansion of the NTK under unit variance initialization, see Assumption 2. Consequently we are able to derive insights into the NTK spectrum, notably concerning the outlier eigenvalues as well as the asymptotic decay. • In Theorem 4.1 and Observation 4.2 we demonstrate that the largest eigenvalue λ 1 (K) of the NTK takes up an Ω(1) proportion of the trace and that there are O(1) outlier eigenvalues of the same order as λ 1 (K). • In Theorem 4.3 and Theorem 4.5 we show that the effective rank T r(K)/λ 1 (K) of the NTK is upper bounded by a constant multiple of the effective rank T r(XX T )/λ 1 (XX T ) of the input data Gram matrix for both infinite and finite width networks. • In Corollary 4.7 and Theorem 4.8 we characterize the asymptotic behavior of the NTK spectrum for both uniform and nonuniform data distributions on the sphere.

1.2. RELATED WORK

Neural Tangent Kernel (NTK): the NTK was introduced by Jacot et al. ( 2018), who demonstrated that in the infinite width limit neural network optimization is described via a kernel gradient descent. As a consequence, when the network is polynomially wide in the number of samples, global convergence guarantees for gradient descent can be obtained (Du et al., 2019a; b; Allen-Zhu et al., 2019; Zou & Gu, 2019; Lee et al., 2019; Zou et al., 2020; Oymak & Soltanolkotabi, 2020; Nguyen & Mondelli, 2020; Nguyen, 2021) . Furthermore, the connection between infinite width networks and Gaussian processes, which traces back to Neal (1996) , has been reinvigorated in light of the NTK. 



Lee et al. (2018); de G.Matthews et al. (2018);Novak et al. (2019).Analysis of NTK Spectrum: theoretical analysis of the NTK spectrum via random matrix theory was investigated byYang & Salman (2019); Fan & Wang (2020) in the high dimensional limit.Velikanov & Yarotsky (2021)  demonstrated that for ReLU networks the spectrum of the NTK integral operator asymptotically follows a power law, which is consistent with our results for the uniform data distribution.Basri et al. (2019)  calculated the NTK spectrum for shallow ReLU networks under the uniform distribution, which was then expanded to the nonuniform case by Basri et al. (2020). Geifman et al. (2022) analyzed the spectrum of the conjugate kernel and NTK for convolutional networks with ReLU activations whose pixels are uniformly distributed on the sphere. Geifman et al. (2020); Bietti & Bach (2021); Chen & Xu (2021) analyzed the reproducing kernel Hilbert spaces of the NTK for ReLU networks and the Laplace kernel via the decay rate of the spectrum of the kernel. In contrast to previous works, we are able to address the spectrum in the finite dimensional setting and characterize the impact of different activation functions on it. Hermite Expansion: Daniely et al. (2016) used Hermite expansion to the study the expressivity of the Conjugate Kernel. Simon et al. (2022) used this technique to demonstrate that any dot product kernel can be realized by the NTK or Conjugate Kernel of a shallow, zero bias network. Oymak & Soltanolkotabi (2020) use Hermite expansion to study the NTK and establish a quantitative bound on the smallest eigenvalue for shallow networks. This approach was incorporated by Nguyen & Mondelli (2020) to handle convergence for deep networks, with sharp bounds on the smallest NTK eigenvalue for deep ReLU networks provided by Nguyen et al. (2021). The Hermite approach was utilized by Panigrahi et al. (2020) to analyze the smallest NTK eigenvalue of shallow networks under various activations. Finally, in a concurrent work Han et al. (2022) use Hermite expansions to develop a principled and efficient polynomial based approximation algorithm for the NTK and CNTK. In contrast to the aforementioned works, here we employ the Hermite expansion to charac-

