THE EIGENLEARNING FRAMEWORK: A CONSERVATION LAW PERSPECTIVE ON KERNEL REGRESSION AND WIDE NEURAL NETWORKS Anonymous

Abstract

We derive a simple unified framework giving closed-form estimates for the test risk and other generalization metrics of kernel ridge regression (KRR). Relative to prior work, our derivations are greatly simplified and our final expressions are more readily interpreted. In particular, we show that KRR can be interpreted as an explicit competition among kernel eigenmodes for a fixed supply of a quantity we term "learnability." These improvements are enabled by a sharp conservation law which limits the ability of KRR to learn any orthonormal basis of functions. Test risk and other objects of interest are expressed transparently in terms of our conserved quantity evaluated in the kernel eigenbasis. We use our improved framework to: i) provide a theoretical explanation for the "deep bootstrap" of Nakkiran et al. ( 2020), ii) generalize a previous result regarding the hardness of the classic parity problem, iii) fashion a theoretical tool for the study of adversarial robustness, and iv) draw a tight analogy between KRR and a well-studied system in statistical physics.

1. INTRODUCTION

Kernel ridge regression (KRR) is a popular, tractable learning algorithm that has seen a surge of attention due to equivalences to infinite-width neural networks (NNs) (Lee et al., 2018; Jacot et al., 2018) . In this paper, we derive a simple theory of the generalization of KRR that yields estimators for many quantities of interest, including test risk and the covariance of the predicted function. Our framework is consistent with other recent works, such as those of Canatar et al. (2021) and Jacot et al. (2020) , but is simpler and easier to derive. Our framework paints a new picture of KRR as an explicit competition between eigenmodes for a fixed budget of a quantity we term "learnability," and downstream generalization metrics can be expressed entirely in terms of the learnability received by each mode . This picture stems from a conservation law latent in KRR which limits any kernel's ability to learn any complete basis of target functions. The conserved quantity, learnability, is the inner product of the target and predicted functions and, as we show, can be interpreted as a measure of how well the target function can be learned by a particular kernel given n training examples. We prove that the total learnability, summed over a complete basis of target functions (such as the kernel eigenbasis), is no greater than the number of training samples, with equality at zero ridge parameter. The conservation of this quantity suggests that it will prove useful for understanding the generalization of KRR. This intuition is borne out by our subsequent analysis: we derive a set of simple, closed-form estimates for test risk and other objects of interest and find that all of them can be transparently expressed in terms of eigenmode learnabilities. Our expressions are more compact and readily interpretable than those of prior work and constitute a major simplification. Our derivation of these estimators is significantly simpler and more accessible than those of prior work, which relied on the heavy mathematical machinery of replica calculations and random matrix theory to obtain comparable results. By contrast, our approach requires only basic linear algebra, leveraging our conservation law at a critical juncture to bypass the need for advanced techniques. We use our improved framework to shed light on several topics of interest: i) We provide a compelling theoretical explanation for the "deep bootstrap" phenomenon of Nakkiran et al. ( 2020) and identify two regimes of NN fitting occurring at early and late training times. ii) We generalize a previous result regarding the hardness of the parity problem for rotationinvariant kernels. Our technique is simple and illustrates the power of our framework. iii) We craft an estimator for predicted function smoothness, a new tool for the theoretical study of adversarial robustness. iv) We draw a tight analogy between our framework and the free Fermi gas, a well-studied statistical physics system, and thereby transfer insights into the free Fermi gas over to KRR. We structure these applications as a series of vignettes. The paper is organized as follows. We give preliminaries in Section 2. We define our conserved quantity and state its basic properties in Section 3. We characterize the generalization of KRR in terms of this quantity in Section 4. We check these results experimentally in Section 5. Section 6 consists of a series of short vignettes discussing topics (i)-(iv). We conclude in Section 7.

1.1. RELATED WORK

The present line of work has its origins with early studies of the generalization of Gaussian process regression (Opper, 1997; Sollich, 1999) , with Sollich ( 2001) deriving an estimator giving the expected test risk of KRR in terms of the eigenvalues of the kernel operator and the eigendecomposition of the target function. We refer to this result as the "omniscient risk estimatorfoot_0 ," as it assumes full knowledge of the data distribution and target function. 2021) developed equivalent results in the context of linear regression using tools from random matrix theory. In the present paper, we provide a new interpretation for this rich body of work in terms of explicit competition between eigenmodes, provide simplified derivations of the main results of this line of work, and break new ground with applications to new problems of interest. We compare selected works with ours and provide a dictionary between respective notations in Appendix A. All prior works in this line -and indeed most works in machine learning theory more broadly -rely on approximations, asymptotics, or bounds to make any claims about generalization. The conservation law is unique in that it gives a sharp equality even at finite dataset size. This makes it a particularly robust starting point for the development of our framework (which does later make approximations). In addition to those listed above, many works have investigated the spectral bias of neural networks in terms of both stopping time (Rahaman et al., 2019; Xu et al., 2019b; a; Xu, 2018; Cao et al., 2019; Su & Yang, 2019) and the number of samples (Valle-Perez et al., 2018; Yang & Salman, 2019; Arora et al., 2019) . Our investigation into the deep bootstrap ties together these threads of work: we find that the interplay of these two sources of spectral bias is responsible for the deep bootstrap phenomenology.

2. PRELIMINARIES AND NOTATION

We study a standard supervised learning setting in which n training samples D ≡ {x i } n i=1 are drawn i.i.d. from a distribution p over R d . We wish to learn a (scalar) target function f given noisy evaluations y ≡ (y i ) n i=1 with y i = f (x i ) + η i , with η i ∼ N (0, ϵ 2 ). As it simplifies later



We borrow this terminology fromWei et al. (2022).



Bordelon et al. (2020)  andCanatar et al.  (2021)  brought these ideas into a modern context, deriving the omniscient risk estimator with a replica calculation and connecting it to the "neural tangent kernel" (NTK) theory of wide neural networks(Jacot et al., 2018), with (Loureiro et al., 2021)  extending the result to arbitrary convex losses. Sollich & Halees (2002); Caponnetto & De Vito (2007); Spigler et al. (2020); Cui et al. (2021); Mallinar et al. (2022) study the asymptotic consistency and convergence rates of KRR in a similar vein. Jacot et al. (2020); Wei et al. (2022) used random matrix theory to derive a risk estimator requiring only training data. In parallel with work on KRR, Dobriban & Wager (2018); Wu & Xu (2020); Richards et al. (2021); Hastie et al. (2022) and Bartlett et al. (

