THE EIGENLEARNING FRAMEWORK: A CONSERVATION LAW PERSPECTIVE ON KERNEL REGRESSION AND WIDE NEURAL NETWORKS Anonymous

Abstract

We derive a simple unified framework giving closed-form estimates for the test risk and other generalization metrics of kernel ridge regression (KRR). Relative to prior work, our derivations are greatly simplified and our final expressions are more readily interpreted. In particular, we show that KRR can be interpreted as an explicit competition among kernel eigenmodes for a fixed supply of a quantity we term "learnability." These improvements are enabled by a sharp conservation law which limits the ability of KRR to learn any orthonormal basis of functions. Test risk and other objects of interest are expressed transparently in terms of our conserved quantity evaluated in the kernel eigenbasis. We use our improved framework to: i) provide a theoretical explanation for the "deep bootstrap" of Nakkiran et al. ( 2020), ii) generalize a previous result regarding the hardness of the classic parity problem, iii) fashion a theoretical tool for the study of adversarial robustness, and iv) draw a tight analogy between KRR and a well-studied system in statistical physics.

1. INTRODUCTION

Kernel ridge regression (KRR) is a popular, tractable learning algorithm that has seen a surge of attention due to equivalences to infinite-width neural networks (NNs) (Lee et al., 2018; Jacot et al., 2018) . In this paper, we derive a simple theory of the generalization of KRR that yields estimators for many quantities of interest, including test risk and the covariance of the predicted function. Our framework is consistent with other recent works, such as those of Canatar et al. (2021) and Jacot et al. (2020) , but is simpler and easier to derive. Our framework paints a new picture of KRR as an explicit competition between eigenmodes for a fixed budget of a quantity we term "learnability," and downstream generalization metrics can be expressed entirely in terms of the learnability received by each mode . This picture stems from a conservation law latent in KRR which limits any kernel's ability to learn any complete basis of target functions. The conserved quantity, learnability, is the inner product of the target and predicted functions and, as we show, can be interpreted as a measure of how well the target function can be learned by a particular kernel given n training examples. We prove that the total learnability, summed over a complete basis of target functions (such as the kernel eigenbasis), is no greater than the number of training samples, with equality at zero ridge parameter. The conservation of this quantity suggests that it will prove useful for understanding the generalization of KRR. This intuition is borne out by our subsequent analysis: we derive a set of simple, closed-form estimates for test risk and other objects of interest and find that all of them can be transparently expressed in terms of eigenmode learnabilities. Our expressions are more compact and readily interpretable than those of prior work and constitute a major simplification. Our derivation of these estimators is significantly simpler and more accessible than those of prior work, which relied on the heavy mathematical machinery of replica calculations and random matrix theory to obtain comparable results. By contrast, our approach requires only basic linear algebra, leveraging our conservation law at a critical juncture to bypass the need for advanced techniques. We use our improved framework to shed light on several topics of interest: i) We provide a compelling theoretical explanation for the "deep bootstrap" phenomenon of Nakkiran et al. (2020) and identify two regimes of NN fitting occurring at early and late training times. ii) We generalize a previous result regarding the hardness of the parity problem for rotationinvariant kernels. Our technique is simple and illustrates the power of our framework. iii) We craft an estimator for predicted function smoothness, a new tool for the theoretical study of adversarial robustness. iv) We draw a tight analogy between our framework and the free Fermi gas, a well-studied statistical physics system, and thereby transfer insights into the free Fermi gas over to KRR. We structure these applications as a series of vignettes. The paper is organized as follows. We give preliminaries in Section 2. We define our conserved quantity and state its basic properties in Section 3. We characterize the generalization of KRR in terms of this quantity in Section 4. We check these results experimentally in Section 5. Section 6 consists of a series of short vignettes discussing topics (i)-(iv). We conclude in Section 7.

1.1. RELATED WORK

The present line of work has its origins with early studies of the generalization of Gaussian process regression (Opper, 1997; Sollich, 1999) , with Sollich (2001) deriving an estimator giving the expected test risk of KRR in terms of the eigenvalues of the kernel operator and the eigendecomposition of the target function. We refer to this result as the "omniscient risk estimatorfoot_0 ," as it assumes full knowledge of the data distribution and target function. Bordelon et al. (2020) and Canatar et al. (2021) brought these ideas into a modern context, deriving the omniscient risk estimator with a replica calculation and connecting it to the "neural tangent kernel" (NTK) theory of wide neural networks (Jacot et al., 2018) , with (Loureiro et al., 2021) extending the result to arbitrary convex losses. Sollich & Halees (2002) ; Caponnetto & De Vito (2007) ; Spigler et al. (2020) ; Cui et al. (2021) ; Mallinar et al. (2022) study the asymptotic consistency and convergence rates of KRR in a similar vein. Jacot et al. (2020) ; Wei et al. (2022) used random matrix theory to derive a risk estimator requiring only training data. In parallel with work on KRR, Dobriban & Wager (2018) ; Wu & Xu (2020) ; Richards et al. (2021) ; Hastie et al. (2022) and Bartlett et al. (2021) developed equivalent results in the context of linear regression using tools from random matrix theory. In the present paper, we provide a new interpretation for this rich body of work in terms of explicit competition between eigenmodes, provide simplified derivations of the main results of this line of work, and break new ground with applications to new problems of interest. We compare selected works with ours and provide a dictionary between respective notations in Appendix A. All prior works in this line -and indeed most works in machine learning theory more broadly -rely on approximations, asymptotics, or bounds to make any claims about generalization. The conservation law is unique in that it gives a sharp equality even at finite dataset size. This makes it a particularly robust starting point for the development of our framework (which does later make approximations). In addition to those listed above, many works have investigated the spectral bias of neural networks in terms of both stopping time (Rahaman et al., 2019; Xu et al., 2019b; a; Xu, 2018; Cao et al., 2019; Su & Yang, 2019) and the number of samples (Valle-Perez et al., 2018; Yang & Salman, 2019; Arora et al., 2019) . Our investigation into the deep bootstrap ties together these threads of work: we find that the interplay of these two sources of spectral bias is responsible for the deep bootstrap phenomenology.

2. PRELIMINARIES AND NOTATION

We study a standard supervised learning setting in which n training samples D ≡ {x i } n i=1 are drawn i.i.d. from a distribution p over R d . We wish to learn a (scalar) target function f given noisy evaluations y ≡ (y i ) n i=1 with y i = f (x i ) + η i , with η i ∼ N (0, ϵ 2 ). As it simplifies later analysis, we assume this N (0, ϵfoot_1 ) label noise is also applied to test targets 2 . Our results are easily generalized to vector-valued functions as in Canatar et al. (2021) . For scalar functions g, h, we define ⟨g, h⟩ ≡ E x∼p [g(x)h(x)] and ||g|| 2 ≡ ⟨g, g⟩. We shall study the KRR predicted function f given by f (x) = k xD (K DD + δI n ) -1 y, where, for a positive-semidefinite kernel K, we have constructed the row vector [k xD ] i = K(x, x i ) and the empirical kernel matrix [K DD ] ij = K(x i , x j ) (which we trust to be nonsingular), δ is a ridge parameter, and I n is the identity matrix. We wish to minimize test mean squared error (MSE) E (D) (f ) = ||f -f || 2 + ϵ 2 and its expectation over training sets E(f ) = E D E (D) (f ) ( where the expectation over D also averages over noise values). We emphasize that, here and in our discussion of learnability, f is understood to be the KRR predictor given by Equation 1 from training targets generated with target function f . In the classical bias-variance decomposition of test risk, we have bias B(f ) = ||f -E D [ f ]|| 2 + ϵ 2 and variance V(f ) = E(f ) -B(f ). We also define train MSE as E tr (f ) = 1 n n i=1 (f (x i ) -f (x i )) 2 . 2.1 THE KERNEL EIGENSYSTEM By Mercer's theorem Mercer (1909) , the kernel admits the decomposition K(x, x ′ ) = i λ i ϕ i (x)ϕ i (x ′ ), with eigenvalues λ i ≥ 0 and a basis of eigenfunctions ϕ i satisfying ⟨ϕ i , ϕ j ⟩ = δ ij . We assume eigenvalues are indexed in descending order. As the eigenfunctions form a complete basis, we are free to decompose f and f as f (x) = i v i ϕ i (x) and f (x) = i vi ϕ i (x), where v ≡ (v i ) i and v ≡ (v i ) i are vectors of eigencoefficients.

3. LEARNABILITY AND ITS CONSERVATION LAW

Here we define learnability, our conserved quantity. Learnability is a measure of f defined similarly to test MSE, but it is linear instead of quadratic. For any function f such that ||f || = 1, let L (D) (f ) ≡ ⟨f, f ⟩ and L(f ) ≡ E D L (D) (f ) , where, as with MSE, f is given by Equation 1. We refer to L (D) (f ) the D-learnability of function f with respect to the kernel and n and refer to L(f ) as the learnability. Up to normalization, this quantity is akin to the cosine similarity between f and f . We shall show that, for KRR, learnability gives a useful indication of how well a function (particularly a kernel eigenfunction) is learned. Results in this section are rigorous and exact; see Appendix G for proofs. We begin by stating several basic properties of learnability to build intuition for the quantity. Proposition 3.1. The following properties of L (D) , L, {ϕ i }, and any f such that ||f || = 1 hold: (a) L(ϕ i ), L (D) (ϕ i ) ∈ [0, 1]. (b) When n = 0, L (D) (f ) = L(f ) = 0. (c) Let D + be D ∪ x, where x ∈ X, x / ∈ D is a new data point. Then L (D+) (ϕ i ) ≥ L (D) (ϕ i ). (d) ∂ ∂λi L (D) (ϕ i ) ≥ 0, ∂ ∂λi L (D) (ϕ j ) ≤ 0, and ∂ ∂δ L (D) (ϕ i ) ≤ 0. (e) E(f ) ≥ B(f ) ≥ (1 -L(f )) 2 . Properties (a-c) together give an intuitive picture of the learning process: the learnability of each eigenfunction monotonically increases from zero as the training set grows, attaining its maximum of one in the ridgeless, maximal-data limit. Properties (d) shows that the kernel eigenmodes are in competition -increasing one eigenvalue while fixing all others can only improve the learnability of the corresponding eigenfunction, but can only harm the learnabilities of all others -and that regularization only harms eigenfunction learnability. Property (e) gives a lower bound on MSE in terms of learnability and will be useful when we discuss the parity problem. We now state the conservation law obeyed by learnability. This rule follows from the view of KRR as a projection of f onto the n-dimensional subspace of the RKHS defined by the n samples and is closely related to the "dimension bound" for linear learning rules given by Hsu (2021). Theorem 3.2 (Conservation of learnability). For any complete basis of orthogonal functions F, when ridge parameter δ = 0, f ∈F L (D) (f ) = f ∈F L(f ) = n, and when δ > 0, f ∈F L (D) (f ) < n and f ∈F L(f ) < n. This result states that, summed over any complete basis of target functions, total learnability is at most the number of training examples, with equality at zero ridgefoot_2 . This theorem is a stronger version of the classic "no-free-lunch" theorem for learning algorithms, which states that, averaged over all target functions, all models perform at chance level (Wolpert, 1996) . While deep and compelling, this classic result is rarely informative in practice because the set of all possible target functions is prohibitively large to make a nonvacuous statement. By contrast, Theorem 3.2 requires only an average over a basis of target functions and, as we shall see, it is directly informative in understanding the generalization of KRR.

4. THEORY

Eigenmode learnabilities are interpretable quantities, obeying a conservation law and several intuitive properties. These features suggest they may prove a wise choice of variables in a theory of KRR generalization. Here we show that this is indeed the case: we derive a suite of estimators for various metrics of KRR generalization, all of which can be expressed entirely in terms of modewise learnabilities and thereby inheriting their interpretability. Because they characterize KRR learning via eigenmode learnabilities, we call these equations the eigenlearning equations. We sketch our method here and relegate the derivation to Appendix H. Our derivation leverages our conservation law to avoid the need for either a replica calculation or random matrix theoretic tools and is consequently more accessible and extensible than derivations of prior works. Results in this section are nonrigorous. Like comparable works, our derivations use an experimentally-validated "universality" assumption that the kernel features may be replaced by independent Gaussian features with the same statistics without changing downstream generalization metrics. We begin by observing that v depends linearly on v, and we can thus construct a "learning transfer matrix" T (D) such that v = T (D) v. T (D) is equivalent to the "KRR reconstruction operator" of Jacot et al. (2020) and, viewing KRR as linear regression in eigenfeature space, is essentially the "hat" matrix of linear regression. We then study E T (D) , leveraging our universality assumption to show that it is diagonal with diagonal elements of the form λ i /(λ i + κ), where κ is a modeindependent constant. We show the mode-independence of κ with an eigenmode-removal argument reminiscent of the cavity method of statistical physics (Del Ferraro et al., 2014) . We use Theorem 3.2 to determine κ. Differentiating this result with respect to kernel eigenvalues, we obtain the covariance of T (D) and thus of v, which permits evaluation of various test metricsfoot_3 .

4.1. THE EIGENLEARNING EQUATIONS

Let the effective regularization κ be the unique positive solution to n = i λ i λ i + κ + δ κ . We calculate various test and train metrics to be (eigenmode learnability) L(ϕ i ) = L i ≡ λ i λ i + κ , (overfitting coefficient) E 0 = n ∂κ ∂δ = n n -i L 2 i , (test MSE) E(f ) = E 0 i (1 -L i ) 2 v 2 i + ϵ 2 , ( ) (bias of test MSE) B(f ) = i (1 -L i ) 2 v 2 i + ϵ 2 = E(f ) E 0 , ( ) (variance of test MSE) V(f ) = E(f ) -B(f ) = E 0 -1 E 0 E(f ), (train MSE) E tr (f ) = δ 2 n 2 κ 2 E(f ), (mean predictor) E[v i ] = L i v i , (covariance of predictor) Cov[v i , vj ] = E(f )L 2 i n δ ij . ( ) The learnability of an arbitrary normalized function can be computed as L(f ) = i L i v i . The mean of the target function evaluated at input x can be obtained as E[ f (x)] = i E[v i ] ϕ i (x) , with the covariance obtained similarly.

4.2. INTERPRETATION OF THE EIGENLEARNING EQUATIONS

Remarkably, all test metrics, and indeed all second-order statistics of f , can be expressed solely in terms of modewise learnabilities, with no additional reference to eigenvalues required. This strongly suggests that we have identified the "correct" choice of variables for the problem. We now interpret these equations through the lens of learnability. Equation 6 gives a constant κ which decreases monotonically as n increases. Equation 7states that the learnability of eigenmode i is 0 when κ ≫ λ i and approaches 1 when κ ≪ λ i , in line with the observation of Jacot et al. (2020) that a mode is well-learned when κ ≪ λ i . Inserting Equation 7 into 6 and setting δ = 0, we recover our conservation law. Equation 9 is the omniscient risk estimator for test MSE. Modes with learnability equal to one are fully learned and do not contribute to the risk. Noise acts the same as target weight placed in modes with learnability zero. Equation 8 defines E 0 , the MSE when trained on pure-noise targets (v i = 0 and ϵ 2 = 1). It is strictly greater than one and can be interpreted as the factor by which pure noise is overfitfoot_4 . The denominator explodes when the n units of learnability are fully allocated to the first n modes, not distributed among a greater number (L i≤n = 1, L i>n = 0). In this sense, overfitting of noise is "overconfidence" on the part of the kernel that the target function lies in the top-n subspace, and "hedging" via a wider distribution of learnability (or sacrificing a portion of the learnability budget to the ridge parameter) lowers E 0 and fixes this problem. This overconfidence occurs when the kernel eigenvalues drop sharply around index n and is the cause of double-descent peaks (Belkin et al., 2019) , which other works have also found can be avoided with an appropriate ridge parameter Canatar et al. (2021) ; Nakkiran et al. (2021) . Equations 10 and 11 show that, remarkably, the bias and variance can be expressed solely in terms of E(f ) and E 0 . Since learnabilities strictly increase as n grows, the bias strictly decreases while the variance can be nonmonotone, as also noted by Canatar et al. (2021) . Equation 12states that train error is related to test error by the target-independent proportionality constant δ 2 /κ 2 . Finally, Equations 13 and 14 give the mean and covariance of the predicted eigencoefficients. Different eigencoefficients are uncorrelated, and the variance of vi is proportional to L 2 i but surprisingly independent of v i .

5. EXPERIMENTS

Here we describe experiments confirming our main results. The targets are eigenfunctions on three synthetic domains -the unit circle discretized into M points, the (Boolean) hypercube {±1} d , and the d-sphere -as well as two-class subsets of MNIST and CIFAR-10. Unless otherwise stated, experiments use a fully-connected four-hidden-layer ReLU architecture, and finite networks have width 500. The kernel eigenvalues on each synthetic domain group into degenerate sets which we index by k ∈ Z + . Eigenvalues and eigencoefficients for image datasets are approximated numerically from a large sample of training data. Full experimental details can be found in Appendix B. Figure 1 illustrates Theorem 3.2 in a toy setting. Modewise D-learnabilities indeed sum to n. Figure 2 compares theoretical predictions for learnability and MSE with experiments on real and synthetic data, finding good agreement in all cases. Appendix C repeats one of these experiments with network widths varying from ∞ to 20, finding good agreement with theory even at narrow width.

6.1. EXPLAINING THE DEEP BOOTSTRAP

The deep bootstrap (DB) is a phenomenon observed by Nakkiran et al. (2020) in which the performance of a neural network stopped after a given number of training steps is relatively insensitive to the size of the training set, unless the training set is so small that it has been interpolated. The DB has been studied theoretically on kernel gradient flow (KGF), which describes the training of wide neural networks, by Ghosh et al. (2022) in the toy case in which the data lies on a high-dimensional 2 0 2 2 2 4 2 6 2 8 2 9. 0.0 0.5 1.0 L(f ) A Unit Circle k = 0 k = 2 k = 5 k = 10 B 8d Hypercube k = 0 k = 1 k = 3 k = 8 C 7-sphere k = 0 k = 1 k = 2 k = 3 k = 4 k = 5 D Image Datasets CIFAR10 deer/horse CIFAR10 bird/truck MNIST 3/8 MNIST 0/1 2 0 2 2 2 4 2 6 2 8 n 0 1 2 E(f ) E k = 0 k = 2 k = 5 k = 10 2 0 2 2 2 4 2 6 2 8 n F k = 0 k = 1 k = 3 k = 8 2 0 2 2 2 4 2 6 2 8 n G k = 0 k = 1 k = 2 k = 3 k = 4 k = 5 sphere. Here we give a convincing general explanation of this phenomenon using our framework and identify two regimes of NN fitting in the process. Ali et al. (2019) proved that KRR with finite ridge generalizes remarkably similarly to KGF with finite stopping time (a theoretical result confirmed empirically by Lee et al. (2020) for NTKs on image datasets). When considering a standard supervised learning setting, the effective training time corresponding to a ridge δ is τ eff ≡ δ -1 n (see Appendix D for a discussion of this scaling). As a proxy for KGF, we shall study KRR as τ eff increases from 0 to ∞. In Equation 9, the ridge parameter affects MSE solely through the value of κ. Define κ 0 ≡ κ| δ=0 , the minimum effective regularization at a given dataset size. In Appendix D, we show that, for powerlaw eigenspectra (as are commonly found in practice), there are two regimes of fitting: 1. Regularization-limited regime: τ eff ≪ κ -1 0 and κ ≈ τ -1 eff . The generalization gap is small: E(f )/E tr (f ) ≈ 1. Regularization dominates generalization, and adding training samples does not affect generalizationfoot_5 . 2. Data-limited regime: τ eff ≫ κ -1 0 and κ ≈ κ 0 . The generalization gap is large: E(f )/E tr (f ) ≫ 1. Data is interpolated, and decreasing regularization (i.e. increasing training time) does not affect generalization. We suggest that Nakkiran et al. (2020) observe overlapping error curves for different n at early times because, at these times, the model is in the regularization-limited regime. We now present an experiment confirming our interpretation. In Figure 3 , we reproduce an experiment of Nakkiran et al. (2020) illustrating the DB using ResNets trained on CIFAR-10, juxtaposing it with our proposed model for this phenomenon -KRR with varying τ eff -trained on binarized MNIST. The match is excellent: in particular, both plots share the DB phenomenon that error curves for all n overlap at early times, with test and train error peeling off the master curve at roughly the same time. We find that τ eff ≈ κ -1 0 is indeed when the transition between regimes occurs for KRR, matching our theoretical prediction. This experiment strongly suggests that both neural network and KRR fitting can be thought of as a transition between regularization-limited and data-limited regimes. See Appendix D for experimental details.

6.2. THE HARDNESS OF THE PARITY PROBLEM FOR ROTATION-INVARIANT KERNELS

The parity problem stands as a classic example of a function which is easy to write down but hard for common algorithms to learn. The parity problem was shown to be exponentially hard for Gaus- sian kernel methods by Bengio et al. (2006) . Here we generalize this result to KRR with arbitrary rotation-invariant kernels. Our analysis is made trivial by the use of our framework and is a good illustration of the power of working in terms of learnabilities. The problem domain is the hypercube X = {-1, +1} d , over which we define the subset-parity functions ϕ S (x) = (-1) i∈S 1[xi=1] , where S ⊆ {1, ..., d} ≡ [d]. The objective is to learn ϕ [d] . For any rotation-invariant kernel (such as the NTK of a fully-connected neural network), {ϕ S } S are the eigenfunctions over this domain, with degenerate eigenvalues {λ k } d k=0 depending only on k = |S|. Yang & Salman (2019) proved that, for any fully-connected kernel, the even and odd eigenvalues each obey a particular ordering in k. Letting d be odd for simplicity, this result and Equation 7 imply that L 1 ≥ L 3 ≥ ... ≥ L d . Counting level degeneracies, this is a hierarchy of 2 d-1 learnabilities of which L d is the smallest. The conservation law of 3.2 then implies that L d ≤ n 2 d-1 , which, using Proposition 3.1(e), implies that E(ϕ [d] ) ≥ 1 - n 2 d-1 2 . ( ) Obtaining an MSE below a desired threshold ϵ thus requires at least n min = 2 d-1 (1 -ϵ 1/2 ) samples, a sample complexity exponential in d. The parity problem is thus hard for all rotation-invariant kernelsfoot_6 .

6.3. MEAN-SQUARED GRADIENT AND ADVERSARIAL ROBUSTNESS

While the omniscient risk estimate is the most important of the eigenlearning equations, we have also obtained estimators for arbitrary covariances of f . Here we point out a first use for these covariances: studying the smoothness of f and thus its adversarial robustness. We fashion an estimator for function smoothness, confirm its accuracy with KRR experiments, and identify a discrepancy between intuition and experiment ripe for further exploration. Consider the mean squared gradient (MSG) of f defined by G( f ) ≡ E x |∇ x f (x)| 2 = ||∇ f || 2 2 . This quantity is a measure of function smoothness. Eigendecomposition yields that E x ∇ x f (x) 2 = ij E[v i vj ] g ij with g ij ≡ E x [∇ x ϕ i (x) • ∇ x ϕ j (x)] . The expectation vj ] is given by the eigenlearning equations, and the structure constants g ij , which encode information about the domain, can be computed analytically for simple domains. On the d-sphere, for which the ϕ i = ϕ kℓ are spherical harmonics, these are g (kℓ),(k E[v i vj ] = E[v i ] E[v j ] + Cov[v i , ′ ℓ ′ ) = k(k + d -2)δ kk ′ δ ℓℓ ′ . 2 0 2 2 2 4 2 6 2 8 2 10 2 12 n 0.00 0.25 0.50 0.75 1.00 The study of adversarial robustness currently suffers from a lack of theoretically tractable toy models, and we suggest our expression for MSG can help fill this gap. To illustrate this, we describe an insight that can be drawn from Figure 4 . Vulnerability to gradient-based adversarial attacks can be viewed essentially as a phenomenon of surprisingly large gradients with respect to the input. If such vulnerability is an inevitable consequence of high dimension, a common heuristic belief (e.g. Gilmer et al. (2018) ), one might expect that G( f ) is generally much larger than G(f ) at high dimension. Surprisingly, we see no such effect, a discrepancy which can be investigated further using our framework. G( f )/G(f ) d = 4 d = 16 d = 64 d = 256 d = 1024

6.4. A QUANTUM MECHANICAL ANALOGY AND UNIVERSAL LEARNABILITY CURVES

Here we describe a remarkably tight analogy between our picture of KRR generalization and the statistics of the free Fermi gas, a canonical model in statistical physics. This allows certain insights into the free Fermi gas to be ported over to KRR. This correspondence is also of fundamental interest: KRR and the free Fermi gas are both paradigmatic systems in their respective fields, and it is remarkable that their statistics are in fact the same. We defer the details of this correspondence to Appendix F and focus here on the takeaways. The free Fermi gas is defined by a scalar µ and a set of states with energies {ε i } i , each of which may be occupied or not. We find that -ln κ is analogous to µ, the state energies are analogous to kernel eigenvalues, and the states' occupation probabilies are precisely analogous to the eigenmodes' learnabilities. In all prior work and in our work thus far, the constant κ has thus far been defined only as the solution to an implicit equation. Having an explicit equation would be advantageous. Leveraging methods for the study of the free Fermi gas, we find the following explicit formula for κ: κ = m n (λ 1 , λ 2 , ...) m n-1 (λ 1 , λ 2 , ...) , where m n (x 1 , x 2 , ...) ≡ 1≤j1<...<jn x j1 ...x jn . The second takeaway is the identification of a universal behavior in the learning dynamics of KRR. In systems obeying Fermi-Dirac statistics, a plot of p i vs. ε i takes a characteristic sigmoidal shape. As shown in Figure 10 , we robustly see this sigmoidal shape in plots of L i vs. ln λ i . As the number of samples and even the task are varied, this sigmoidal shape remains universal, merely translating horizontally (in particular, moving left as samples are added). This sigmoidal shape is thus a signature of KRR and could in principle be used to, for example, determine if an unknown kernel method resembles KRR.

7. CONCLUSIONS

We have developed an interpretable unified framework for understanding the generalization of KRR centered on a previously-unexploited conserved quantity. We then used this improved framework to break new theoretical ground in a variety of subjects including the parity problem, the deep bootstrap, and adversarial robustness and developed a tight analogy between KRR and canonical physical system allowing the transfer of insights. We have covered much territory, and each of these subjects is ripe for further exploration in future work.

A NOTATION DICTIONARY AND COMPARISON WITH RELATED WORKS

Tables 1, 2 and 3 provide a dictionary between the notations of our paper and the nearest related works. Rows should be read in comparison with the top row and interpreted as "when the present paper writes X, the other paper would write Y for the same quantity." Comparing expressions for test MSE in Table 2 and predicted function covariance in Table 3 makes it very clear that our learnability framework permits simpler and much more interpretable expression of these results than previously given. Table 1 : Notation dictionary between the present paper and related works (part 1). Paper # samples ridge noise eigenvalues eigenfns eigencoeffs (ours) n δ ϵ 2 λ i ϕ i v i Bordelon et al. (2020) p λ NA λ ρ λ -1/2 ρ ψ ρ λ 1/2 ρ wρ Canatar et al. (2021) P λ σ 2 η ρ η -1/2 ρ ψ ρ η 1/2 ρ wρ Jacot et al. (2020) N N λ ϵ 2 d k f (k) ⟨f (k) , f * ⟩ Cui et al. (2021) 8 n nλ σ 2 η k ϕ k η 1/2 k ϑ * k Table 2 : Notation dictionary between the present paper and related works (part 2). Paper eff. reg. overfitting coeff. test MSE (ours) κ E 0 E(f ) = E 0 i (1 -L i ) 2 v 2 i + ϵ 2 Bordelon et al. (2020) t+λ n 1 -pγ (t+λ) 2 -1 E g = ρ w2 ρ λρ 1 λρ + p λ+t -2 1 -pγ (λ+t) 2 -1 Canatar et al. (2021) κ n 1 1-γ E g = 1 1-γ ρ ηρ (κ+P ηρ) 2 κ 2 w2 ρ + σ 2 P η ρ Jacot et al. (2020) ϑ ∂ λ ϑ Rϵ = ∂ λ ϑ (I c -Ãϑ )f * 2 + ϵ 2 Cui et al. (2021) 9 z n 1 1-1 n p k=1 η 2 k (z/n+η k ) 2 ϵ g = z 2 n 2 ∞ k=1 ϑ * k 2 η k (z/n+η k ) 2 +σ 2 1-1 n ∞ k=1 η 2 k (z/n+η k ) 2 Table 3 : Notation dictionary between the present paper and related works (part 3).

Paper

predicted fn covariance The KRR works all rely on approximations, particularly the spectral "universality" approximation that the eigenfunctions are random and structureless and the design matrix can thus be replaced by a random Gaussian matrix. The works of Bordelon et al. (2020) ; Canatar et al. (2021) , as well as ours, make additional approximations valid for reasonable eigenspectra and large n, while Jacot et al. (2020) does not make approximations and provides bounds instead (though these bounds diverge at zero ridge, highlighting the difficulty of studying interpolating methods). On the other side, the LRR works do not make a unversality approximation, but instead assume in their setting that the feature vectors are high-dimensional and have (sub)Gaussian moments (or some similar condition), which ultimately amounts to the same condition, now enforced in the setting instead of assumed as an approximation. These works are typically mathematically rigorous. Relative to the LRR works, one contribution of the KRR works is the empirical observation that this universality approximation is generally accurate in practicefoot_9 , even for NTKs at initialization (for NTKs after training, the story appears mixed; see Wei et al. (2022) for evidence for universality and Ba et al. (2022) for evidence against it). (ours) Cov[v i , vj ] = L 2 i E(f ) n δ ij Bordelon et al. (2020) NA Canatar et al. (2021) √ η α η β Cov D w * α , w * β = 1 1-γ σ 2 + κ 2 ρ ηρ w2 ρ (P ηρ+κ) 2 P η 2 α (P ηα+κ) 2 δ αβ Jacot et al. (2020) 10 V k (f * ,λ,N,ϵ)= ∂ λ ϑ(λ) N ∥(IC-Ãϑ) f * ∥ 2 S +ϵ 2 +⟨f (k) ,f * ⟩ 2 S ϑ 2 (λ) (ϑ(λ)+d k ) 2 d 2 k (ϑ(λ)+d k ) 2 While our work is generally consistent with other works in the KRR set, we note one discrepancy regarding the covariance of the predicted function. Our expression, Equation 14, agrees with that of Canatar et al. (2021) foot_10 . However, the expression of Jacot et al. (2020) , given in Theorem 2, contains an extra O(n -1 ) term (blue in Table 3 ) dependent upon the target eigencoefficient of the mode in question. This term is small enough that it can be removed without affecting the bounds the authors prove. It contributes negligibly when summing over all eigenmodes to compute e.g. test error or mean squared gradient, but contributes to leading order when interrogating the variance of a particular mode. A sanity-check calculation that the sum of variance over all modes ought to equal V(f ) provides evidence that our equations, which lack this term, are correct. Lastly, we note that Cohen et al. (2021) study Gaussian process regression in the spirit of Sollich (1999) using field theoretic tools. Though they do not explicitly discuss the omniscient risk estimate, Canatar et al. (2021) note that their "equivalence kernel + corrections" learning curve can be viewed as a perturbative expansion of the omniscient risk estimate.

B.1 SYNTHETIC DOMAINS

Our experiments use both real image datasets and synthetic target functions on the following three domains: 1. Discretized Unit Circle. We discretize the unit circle into M points, X = {(cos(2πj/M ), sin(2πj/M )} Eigenvalues and multiplicities for four two-class image datasets are shown in Figure 6 . Spectra for CIFAR10 tasks roughly follow power laws with exponent -1, while spectra for MNIST tasks follow power laws with slightly steeper descent. (B) Eigencoefficients as computed from 10 4 training points. Tasks with higher observed learnability (Figure 2 ) place more weight in higher (i.e., lower-index) eigenmodes and less in lower ones.

B.3 RUNTIMES

For the dataset sizes we consider in this paper, exact NTK regression is typically quite fast, running in seconds, while the training time of finite networks varies from seconds to minutes and depends on width, depth, training set size, and eigenmode. In particular, as described by Rahaman et al. ( 2019), lower eigenmodes take longer to train (especially when aiming for near-zero training MSE as we do here).

B.4 HYPERPARAMETERS

We conduct all our experiments using JAX (Bradbury et al., 2018) , performing exact NTK regression with the neural tangents library (Novak et al., 2019) built atop it. Unless otherwise stated, all experiments used four-hidden-layer ReLU networks initialized with NTK parameterization (Sohl-Dickstein et al., 2020) with σ w = 1.4, σ b = .1. The tanh networks used in generating Figure 1 instead used σ w = 1.5. Experiments on the unit circle always used a learning rate of .5, while experiments on the hypercube, hypersphere, and image datasets used a learning rate of .5 or .1 depending on the experiment. While higher learning rates led to faster convergence, they tended to match theory more poorly, in line with the large learning rate regimes described by Lewkowycz et al. (2020) . Means and one-standard-deviation error bars always the statistics from several random dataset draws and initializations (for finite nets): • The toy experiment of Figure 1 used only a single trial per eigenmode by design. • Other experiments on synthetic domains used 30 trials. • The experiments on image datasets in Figure 2 used 15 trials.

B.5 INITIALIZING THE NETWORK FUNCTION AT ZERO

Naively, when training an infinitely-wide network, the NTK only describes the mean learned function, and the true learned function will include an NNGP-kernel-dependent fluctuation term reflecting the random initialization (Lee et al., 2019) . However, by storing a copy of the parameters at t = 0 and redefining ft (x) := ft (x) -f0 (x) throughout optimization and at test time, this term becomes zero. We use this trick in our experiments with finite networks.

C VARYING WIDTH EXPERIMENT

While kernel regression describes only infinitely-wide neural networks, it is natural to wonder whether our equations are nonetheless informative outside this regime. Figure 7 compares predicted learnability and MSE with experiment for hypercube eigenmodes with networks with varying widths. Our predictions remain informative down to width 50, and lower-frequency modes are better learned at all widths, suggesting that kernel eigenanalysis has a role to play even in the study of feature learning.  L (D) (φ k ) A width ∞ A width ∞ A width ∞ A width ∞ k = 0 k = 1 k = 3 k = 8

D EXPLAINING THE DEEP BOOTSTRAP IN KRR

In this appendix, we use the eigenlearning equations to arrive at an explanation of the deep bootstrap phenomenon of Nakkiran et al. (2020) . As explained in the main text, we study KRR as a proxy for KGF. We begin by finding the relationship between the ridge parameter and the effective training time (we assume for simplicity that the learning rate of KGF is one). Ali et al. (2019) obtained their result regarding the similarity of KRR and KGFfoot_11 under the identifcation [time] = [ridge] -1 . They assume a different ridge scaling than ours, and, in our notation, we identify τ eff = δ -1 n. Their KGF scaling is correct for modeling a standard supervised learning setup, as can be verified by taking the KGF limit of ordinary minibatch SGD. To study the effect of this "early on generalization, we must find how finite τ eff affects the constant κ. Recalling Equation 6 defining κ, n = i λ i λ i + κ + δ κ , we can easily see that κ ≤ (n -1 i λ i + τ -1 eff ) and κ ≥ τ -1 eff ). Squeezing κ with these bounds, we find that, when n ≫ τ -1 eff i λ i , then κ ≈ τ -1 eff . For any τ eff , there thus exists some n above which additional data does not further lower κ and permit the learning of additional eigenmodes. Let us call this the regularization-limited regime, and its counterpart (in which τ eff is sufficiently large and regularization is effectively zero) the data-limited regime. We can find the crossover regularization provided knowledge of the eigenspectrum. Let us assume that λ i ∼ i -α for some α > 1. Mallinar et al. (2022) show that κ 0 ≡ κ| δ=0 → α -1 π csc(α -1 π) α n -α as n → ∞. Extending their analysis to finite ridge, we find that, at large n, n = n(κ/κ 0 ) -1/α + (δκ) -1 , or equivalently 1 = κ 0 κ 1 α + 1 τ eff κ . Inspection of Equation 18 reveals that when τ eff ≪ κ -1 0 , then κ ≈ τ -1 eff , and when τ eff ≫ κ -1 0 , then κ ≈ κ 0 . We thus expect a crossover from the regularization-limited to the data-limited regime when τ eff ≈ κ 0 . In Figure 8 , we plot the theoretical omniscient risk estimate using powerlaw eigenvalues (λ i ∼ i -α ) and powerlaw eigencoefficients v 2 i ∼ i -α ) with α = 2 for various n. The close resemblance of these curves to the experimental KRR curves of Figure 3 confirm that our powerlaw toy model is good. If desired, Figures 3(A, B ) and Figure 8 can be viewed as the distillation of the DB phenomenon in three stages of increasing artificiality. It is worth noting that increasing n serves to both decrease κ and decrease E 0 . For powerlaw spectra, as n increases with fixed τ eff , E 0 tends to its lower bound of 1. It approaches 1 when the regularization-limited regime begins. Similar ideas regarding the scaling of KRR are discussed by Cui et al. (2021) . The notion of regimes limited by either dataset size or training time is explored using the language of scaling laws by Kaplan et al. (2020) and Bahri et al. (2021) . We expect that this analysis could be extended to proper KGF using the analysis of Bordelon & Pehlevan (2021) .

D.1 DEEP BOOTSTRAP EXPERIMENTS

In replicating the deep bootstrap phenomenon for neural networks, we use the same experimental setup as Nakkiran et al. (2020) , a ResNet-18 architecture trained on CIFAR-10 with data augmentation (random horizontal flips and random crops). We optimize cross-entropy loss using SGD, with batchsize 128, momentum 0.9, and initial learning rate 0.1 with cosine decay. The test/train curves are the mean of 4 training runs; the curves have been gaussian-smoothed to remove high-frequency noise artifacts. We perform kernel ridge regression (KRR) using the NTK of a fully-connected network with four hidden layers of width 500. We binarize MNIST into two classes: {0, 1, 2, 3, 4} and {5, 6, 7, 8, 9}. Although the test curves shift slightly depending on the binarization scheme, the theory curves all fall within one standard deviation of the empirical curves for all binarizations we tried. The KRR test/train curves are the mean of 10 training runs. Here we illustrate the deep bootstrap phenomenon using a hypothetical kernel giving powerlaw eigenvalues and eigencoefficients with exponent α = 2. In this setting, we note that finite-data test/train curves simultaneously split from the n → ∞ "online learning curve" (black dot-dashed line). We see that τ eff = κ -1 0 (vertical dashed lines) predicts the transition from regularization-limited to data-limited fitting for each n. (We choose α = 2 for a clean illustration of the phenomenon; empirically, CIFAR-10 and MNIST give spectra with exponents closer to 1.)

E ADDITIONAL MEAN-SQUARED GRADIENT EXPERIMENTS

In the MSG experiment of 4, we run KRR with a polynomial kernel on data sampled from hyperspheres of varying dimension. The polynomial kernel is K (x, x ′ ) = 1 + x ⊤ x ′ + (x ⊤ x ′ ) 2 + (x ⊤ x ′ ) 3 . We perform this experiment using PyTorch (Paszke et al., 2019) and compute ∇ f numerically. In the MSG experiment of 9, we train finite-width FCNs on eigenmodes on the (discretized) unit circle.  G( f )/G(f ) k = 1 k = 2 k = 4 k = 8 (f ) = E |f ′ (x)| 2 = k 2 . Because this is a discrete domain, smoothness is computed using a discretization as G( f ) = E j | f (x j ) -f (x j+1 )| 2 , where x j and x j+1 are neighboring points on the unit circle. F QUANTUM MECHANICAL ANALOGY

F.1 EXPLICIT EQUATION FOR κ

Here we develop a tight analogy between our picture of KRR and the free Fermi gas. The free Fermi gas consists of a collection of single-particle orbitals with energies {ε i } i connected to a bath of particles at chemical potential µ. In our analogy, we will identify orbital i with kernel eigenmode i. Each orbital will contain zero or one fermions (n i ∈ {0, 1}), with the occupation probability of orbital i given by the Fermi-Dirac distribution to be ⟨n i ⟩ = (1 + e εi-µ ) -1 . If we identify ε i = -ln λ i and µ = -ln κ, we find that ⟨n i ⟩ = L i : the orbital occupation probability is precisely the eigenmode learnability. We can always choose κ so that eigenmode learnabilities sum to n. This is equivalent to the statement that we can always choose the chemical potential µ to ensure the system contains n fermions on average. An eigenmode with eigenvalue λ i ≥ κ receives at least half a unit of learnability, and an orbital with energy ε i ≤ µ is occupied with probability at least one half. Here we elaborate further the physical analogy laid out in Section 6.4. In our system of noninteracting fermions, we have thus far identified -ln λ i ⇔ ε i (energy of orbital i) (19) -ln κ ⇔ µ (chemical potential) (20) L i ⇔ ⟨n i ⟩ (expected occupancy of orbital i). ( ) What is the value of κ? Observe that ⟨n i ⟩ = 1 1 + e εi-µ gives the expected occupation number in the grand canonical ensemble where the total number of fermions is allowed to fluctuate. By the equivalence of thermodynamic ensembles, we expect to get the same answer for ⟨n i ⟩ in the canonical ensemble where the total number of fermions does not fluctuate and is exactly n. This suggests that we should attempt to compute ⟨n i ⟩ in the canonical ensemble. By comparing the answer to the grand canonical expression, we can solve for µ and hence for κ. We assume a total number of orbitals M ≫ n (which may be infinite). Note that the equivalence of ensembles holds only in the thermodynamic limit n ≫ 1, so we must take this limit at the end. In the canonical ensemble, each microstate is labeled by of a list of indices 1 ≤ j 1 < ... < j n ≤ M corresponding to the occupied orbitals. Direct computation of the canonical partition function shows that Z C = m n (e -ε1 , ...., e -ε D ), (23) where m n is the so-called "elementary symmetric polynomial of order n": m n (x 1 , ..., x M ) = 1≤j1<...<jn≤M x j1 ...x jn . ( ) We also define the same with index i disallowed: m (i) n (x 1 , ..., x M ) = 1≤j1<...<jn≤M j k ̸ =i ∀ k x j1 ...x jn . ( ) We find that ⟨n i ⟩ = - ∂ ln Z C ∂ε i = e -εi m (i) n-1 (e -ε1 , ...., e -ε M ) m n (e -ε1 , ...., e -ε M ) . ( ) Comparing Equations 22 and 26, solving for µ, and using Equation 20, we find that κ = m n (λ) m (i) n-1 (λ) -λ i ( ) for any i, where λ = (λ i ) M i=1 . We are free to choose i ≫ n so that λ i is negligible, yielding κ = m n (λ) m n-1 (λ) . ( ) We have arrived at an explicit expression for κ. This expression "solves" Equation 6, the implicit equation defining κ, in the thermodynamic limit. Numerics easily confirm a close match with the implicit solution. While many works have defined similar constants implicitly, this is the first place to our knowledge that the explicit solution appears. The meaning of κ is elucidated by analogy to µ in the canonical ensemble. In the canonical ensemble, all orbitals are "fighting" for the supply of n particles, and µ represents their "average pull" on each unit. Similarly, in KRR, we can view eigenmodes as competing for the kernel's supply of n units of learnability, with κ encoding their average pull on this supply. This analogy deepens our picture of KRR as the competition among eigenmodes for a fixed supply of learnability.

F.2 ADDITIONAL ANALOGOUS QUANTITIES

In the grand canonical ensemble, the total occupation N = i n i concentrates about its mean n and has fluctuations given by Var[N ] = i ⟨n i ⟩(1 -⟨n i ⟩) (since each site constitutes a Bernoulli variable). Comparing with equation 8, we find that n/E 0 ⇔ Var[N ] . (29) Smaller particle number fluctuations in the analogous physical system thus correspond to larger E 0 and greater overfitting of noise. This provides a physical interpretation of the "overfitting as overconfidence" notion of Section 4.2. The grand canonical partition function of our free Fermi gas is Z GC = i (1 + e εi+µ ). The analogous quantity in KRR is Z KRR = i (1 + λ i κ ). It holds that ∂ ln Z GC ∂µ = ⟨n⟩, and indeed we find that Figure 10 shows that curves of learnability vs. eigenvalue collapse onto a single sigmoidal curve upon rescaling. We note that similar collapse of data from different experiments upon rescaling occurs in many important statistical physics systems including superconductors (Kardar, 2007) and turbulent flows (Goldenfeld, 2006) . - ∂ ln Z KRR ∂ ln κ = n. It is broadly worth noting that informative constants determined by self-consistency conditions are a hallmark of statistical mechanics. It is thus sensible that such a constant would be emerge from the replica calculation of Canatar et al. (2021) .

G PROOFS: LEARNABILITY AND ITS CONSERVATION LAW

In this appendix, we provide proofs of the formal claims of Section 3. We make use of our learning transfer matrix formalism (discussed in Appendix H.2) for these proofs as it improves clarity and compactness, but all our proofs can also be straightforwardly carried through without it. The learning transfer matrix is defined as T (D) ≡ ΛΦ Φ ⊤ ΛΦ + δI n -1 Φ ⊤ , with Φ ij = ϕ i (x j ) and T ≡ E D T (D) . It holds that v = T (D) v. It is easy to see (by making v a one-hot vector) that L (D) (ϕ i ) = T (D) ii and L(ϕ i ) = T ii . Property (a): L(ϕ i ), L (D) (ϕ i ) ∈ [0, 1]. Letting e i be the one-hot unit vector with a one at index i, we observe that L (D) (ϕ i ) = e ⊤ i T (D) e i (34) = e ⊤ i ΛΦ Φ ⊤ ΛΦ + δI n -1 Φ ⊤ e i (35) = λ 1/2 i e ⊤ i Φ Φ ⊤ ΛΦ + δI n -1 Φ ⊤ e i λ 1/2 i (36) = z ⊤ (zz ⊤ + M) -1 z ∈ [0, 1]. where in the third line we have used the fact that e i is a unit vector and Λ is diagonal, in the fourth line we have defined z = λ 1/2 i Φ ⊤ e i and M = Φ ⊤ ΛΦ + δI n -zz ⊤ , and in the fifth line we have used the fact that M is positive semidefinite. Given this, L(ϕ i ) ∈ [0, 1] by averaging. Property (b): L (D) (f ) = L(f ) = 0. Proof. When n = 0, the predicted function f is uniformly zero, and so we have L (D) (f ) = L(f ) = 0. Remark. As a corollary, it is easy to show that, when M is finite, n = M , and δ = 0, then L (D) (f ) = L(f ) = 1. Property (c): Let D + be D ∪ x, where x ∈ X, x / ∈ D is a new data point. Then L (D+) (ϕ i ) ≥ L (D) (ϕ i ). We first use the Moore-Penrose pseudoinverse, which we denote by (•) + , to cast T (D) into the dual form T (D) ≡ ΛΦ Φ ⊤ ΛΦ -1 Φ ⊤ = Λ 1/2 Λ 1/2 ΦΦ ⊤ Λ 1/2 Λ 1/2 ΦΦ ⊤ Λ 1/2 + δI M + Λ -1/2 , This follows from the property of pseudoinverses that A(A ⊤ A+δI) + A ⊤ = (AA ⊤ )(AA ⊤ +δI) + for any matrix A. We now augment our system with one extra data point, getting T (D+) = Λ 1/2 Λ 1/2 (ΦΦ ⊤ + ξξ ⊤ )Λ 1/2 Λ 1/2 (ΦΦ ⊤ + ξξ ⊤ )Λ 1/2 + δI M + Λ -1/2 , ( ) where ξ is an M -element column vector. Equations 38 and 39 yield that L (D) (ϕ i ) = e ⊤ i T (D) e i = e ⊤ i Λ 1/2 ΦΦ ⊤ Λ 1/2 Λ 1/2 ΦΦ ⊤ Λ 1/2 + δI M + e i , L (D+) (ϕ i ) = e ⊤ i T (D+) e i = e ⊤ i Λ 1/2 (ΦΦ ⊤ + ξξ ⊤ )Λ 1/2 Λ 1/2 (ΦΦ ⊤ + ξξ ⊤ ) + δI M Λ 1/2 + e i . The rightmost expressions of Equations 40 and 41 both contain a factor of the form B(B + δI) + , where A is a symmetric positive semidefinite matrix. An operator of this form is a projector onto the row space of B when δ = 0 and a variant of this projector with "shrinkage" when δ > 0. Comparing these equations, we find that the projectors are the same except that, in Equation 41, there is one additional dimension in the row-space and thus one new basis vector in the projector (provided ξ is orthonormal to the other columns of Φ; otherwise there are zero additional dimensions). In the case δ = 0, this new basis vector cannot decrease e ⊤ i T (D+) e i , and thus L (D+) (ϕ i ) ≥ L (D) (ϕ i ) in the ridgeless case. In the case δ > 0, a singular value decomposition of the projector confirms that the addition still cannot decrease e ⊤ i T (D+) e i . This shows the desired property. It follows as a corollary that increasing n → n + 1 cannot decrease L(f ). Property (d): ∂ ∂λi L (D) (ϕ i ) ≥ 0, ∂ ∂λi L (D) (ϕ j ) ≤ 0, and ∂ ∂δ L (D) (ϕ i ) ≤ 0. Proof. Differentiating T (D) jj with respect to a particular λ i , we find that ∂ ∂λ i L (D) (ϕ i ) = ∂ ∂λ i T (D) jj = (δ ij -λ j ϕ ⊤ j K -1 ϕ i )ϕ ⊤ i K -1 ϕ j , where ϕ ⊤ i is the ith row of Φ and K = Φ ⊤ ΛΦ + δI n . Specializing to the case i = j, we note that ϕ ⊤ i K -1 ϕ i ≥ 0 because K is positive definite, and λ i ϕ i K -1 ϕ ⊤ i ≤ 1 because λ i ϕ i ϕ ⊤ i is one of the positive semidefinite summands in K = k λ k ϕ k ϕ ⊤ k + δI n . The first clause of the property follows. To prove the second clause, we instead specialize to the case i ̸ = j, which yields that ∂ ∂λ i L (D) (ϕ j ) = ∂ ∂λ i T (D) jj = -λ j ϕ ⊤ j K -1 ϕ i 2 , which is manifestly nonpositive because λ j > 0. The second clause follows. Differentiating Equation 51 w.r.t. δ yields that ∂ ∂δ T (D) = -ΛΦK -2 Φ ⊤ . We then observe that ∂ ∂δ L (D) (ϕ i ) = e ⊤ i ∂ ∂δ T (D) e i = -λ i e ⊤ i ΦK -2 Φ ⊤ e i , which must be nonpositive because λ i > 0 and ΦK -2 Φ ⊤ is manifestly positive definite. The desired property follows. Property (e): E(f ) ≥ B(f ) ≥ (1 -L(f )) 2 . Noting that ||v|| = ||f || = 1, expected MSE is given by E(f ) = E (v -v) 2 = ||v|| 2 -2v ⊤ E[v] + E v2 = 1 -2v ⊤ E[v] + ||E[v]|| 2 bias B(f ) + Var[||v||] variance V(f ) . It is apparent that E(f ) ≥ B(f ). Projecting any vector onto an arbitrary unit vector can only decrease its magnitude, and so B(f ) ≥ 1 -2v ⊤ E[v] + E v⊤ vv ⊤ E[v] = 1 -v ⊤ E[v] 2 = (1 -L(f )) 2 . G.1 PROOF OF THEOREM 3.2 (CONSERVATION OF LEARNABILITY) First, we note that, for any orthogonal basis F on X , f ∈F L (D) (f ) = v∈V v ⊤ T (D) v v ⊤ v , ( ) where V is an orthogonal set of vectors spanning R M . This is equivalent to Tr T (D) . This trace is given by Tr T (D) = Tr Φ ⊤ ΛΦ(Φ ⊤ ΛΦ + δI n ) -1 = Tr K(K + δI n ) -1 . When δ = 0, this trace simplifies to Tr[I n ] = n. When δ > 0, it is strictly less than n. This proves the theorem. Remark. Theorem 3.2 is a consequence of the fact that T (D) is simply a projector onto an ndimensional space spanned by the embeddings of the n samples.

H DERIVATIONS: THE EIGENLEARNING EQUATIONS

In this appendix, we derive the eigenlearning equations for the test risk and covariance of the predicted function. Throughout this derivation, we shall prioritize clarity and interpretability over mathematical rigor, giving a derivation which can be understood without advanced mathematical tools and thereby filling a gap in the literature. For a formal derivation using random matrix theory, see Jacot et al. (2020) , and for a derivation using a replica calculation of a similar level of rigor as we use here, see Canatar et al. (2021) . In the interest of clarity, we begin with a brief summary of the appendix. We begin by discussing the problem setting (Section H.1). We then define the learning transfer matrix formalism (H.2) and use it to set the ridge parameter and noise to zero (H.3). After stating our main approximation (H.4), we find the expectation of the learning transfer matrix (H.5, H.6), using our conservation law to fix κ (H.7, H.8). Taking well-placed derivatives, we bootstrap the expectation of the learning transfer matrix to its second order statistics (H.9) and add back the ridge and noise (H.10). We conclude with various useful bounds on κ (H.11).

H.1 THE DATA DISTRIBUTION

We shall, in this derivation, consider a slightly more general setting than discussed in the main text. In the main text, we supposed the data were drawn i.i.d. from a continuous distribution p over R d . Here, in addition to this case, we shall also consider a setting in which the data are sampled without replacement from a discrete set X ⊂ R d with |X | = M . We will refer to this as the "discrete setting" and the setting with continuous distribution as the "continuous setting." As discussed in Appendix B, the discrete setting matches several of our experiments (namely those on the discretized unit circle and on the vertices of the hypercube). This discrete setting clearly converges to the continuous setting as M → ∞: when M ≫ n 2 , the probability of sampling the same point twice is negligible (and so we can drop the "without replacement" and sample with replacement as in the continuous setting) and, by distributing X throughout R d with point density proportional to p, we approach sampling from a continuous distribution. Alternatively, we could imagine X consists of M i.i.d. samples from p, so that, when we later sample n points from X , they are themselves i.i.d. samples from p as M → ∞. It is worth noting that the data distribution in e.g. a computer vision task is in fact discrete because pixel values are discretized, and thus it is reasonable to work with a discrete measure. Kernel eigenmodes in the discrete setting are defined as M -1 x ′ ∈X K(x, x ′ )ϕ i (x ′ ) = λ i ϕ i (x). Note that, in this case, the number of eigenmodes is M , the same number as the cardinality of X . In the continuous setting, the number of eigenmodes is infinite (though there may be only finitely many with nonzero eigenvalues), but, like Bordelon et al. (2020) , we will find it useful to assume that we need only consider a finite (but very large) number M . This serves merely permit us to work with finite matrices (to which the standard tools of linear algebra apply) and to thereby save us the trouble of dealing with infinite and semi-infinite matrices, which require greater care. So long as M is sufficiently large, we do not lose anything in discarding the exceedingly small eigenvalue tail λ i>M , as is typical in the study of kernel methods and as our final results will confirm. Our subsequent derivation will generally apply to both the discrete and continuous settings.

H.2 THE LEARNING TRANSFER MATRIX

We begin by translating the KRR predictor into the kernel eigenbasis. Because K(x, x ′ ) = M i=1 λ i ϕ i (x)ϕ i (x ′ ), we can decompose the empirical kernel matrix as K DD = Φ ⊤ ΛΦ, where Λ ≡ diag(λ 1 , ..., λ M ) and Φ is the M × n "design matrix" given by Φ ij ≡ ϕ i (x j ). The predicted function coefficients v are given by vi = ⟨ϕ i , f ⟩ = λ i ϕ i (K DD + δI n ) -1 Φ ⊤ v, where we have used the orthonormality of the eigenfunctions and defined [ϕ i ] j = ϕ i (x j ) to be the i-th row of Φ. Stacking these coefficients into a matrix equation, we find v = ΛΦ Φ ⊤ ΛΦ + δI n -1 Φ ⊤ v = T (D) v, where the learning transfer matrix T (D) ≡ ΛΦ Φ ⊤ ΛΦ + δI n -1 Φ ⊤ , ( ) is an M × M matrix, independent of f , that fully describes the model's learning behavior on a training set Dfoot_12 . The learning transfer matrix is the same as the "reconstruction operator" of Jacot et al. (2020) viewed as a finite matrix instead of a linear operator. Full understanding of the statistics of T (D) will give us the statistics of v and thus of f , and so our main objective will be to find the mean and the covariance of T (D) .

H.3 SETTING THE RIDGE AND NOISE TO ZERO

Our setting includes both a nonzero ridge parameter and nonzero noise. However, it is by this point modern folklore that many small eigenvalues together act as an effective ridge parameter equal to their sum and that power in zero-eigenvalue modes is effectively noise (see e.g. Canatar et al. (2021) for a discussion of these). Inverting these equivalences, we should expect to be able to convert δ into a small increase to many eigenvalues and convert ϵ 2 to power in a zero-eigenvalue mode, and thereby permit ourselves to consider neither ridge nor noise in our derivation and add them back at the end. Here we explain our method for doing so. The ridge parameter can be viewed as a uniform increase in all eigenvalues. We first observe that M -1 Φ ⊤ Φ = I n in the discrete case. Letting T (D) (Λ; δ) denote the learning transfer matrix with eigenvalue matrix Λ and ridge parameter δ, it follows from Equation 51 and this fact that T (D) (Λ; δ) = Λ Λ + δ M I M -1 T (D) Λ + δ M I M ; 0 . In the continuous case, M -1 Φ ⊤ Φ → I n as M → ∞ (since the columns of Φ are uncorrelated), and since M is very large in the continuous setting, we are again free to use Equation 52. As for the noise, we simply set ϵ 2 = 0 for now and, once we have our final equations, add power ϵ 2 to a hypothetical zero-eigenvalue mode. H.4 ASSUMPTION: THE UNIVERSALITY OF Φ We wish to take averages over Φ in finding the statistics of T (D) . The distribution of Φ is in fact highly structured (reflecting the eigenstructure of the kernel), and we know only that E[Φ ij Φ ij ′ ] = δ jj ′ . We neglect this structure, making the "universality" assumption that we may take Φ to be sampled from a simple Gaussian measure without substantially changing the statistics of T (D) . We henceforth assume Φ ij iid ∼ N (0, 1). This universality assumption is also made in comparable works (implicitly by We next observe that E Φ ΛΦ Φ ⊤ U ⊤ ΛUΦ -1 Φ ⊤ = E Φ ΛU ⊤ Φ Φ ⊤ ΛΦ -1 Φ ⊤ U , where U is any orthogonal M × M matrix. Defining U (m) as the matrix such that U (m) ab ≡ δ ab (1 -2δ am ), noting that U (m) ΛU (m) = Λ, and plugging U (m) in as U in Equation 53, we find that E T (D) ab = U (m) ⊤ E T (D) U (m) ab = (-1) δam+δ bm E T (D) ab . ( ) By choosing m = a, we conclude that E T (D) ab = 0 if a ̸ = b. H.6 FIXING THE FORM OF E T (D) ii H.6.1 ISOLATING THE DESIRED ELEMENT We now isolate a particular diagonal element of the mean learning transfer matrix. To do so, we write E T (D) ii in terms of λ i (the ith eigenvalue), Λ (i) (Λ with its ith row and column removed), ϕ ⊤ i (the ith row of Φ), and Φ (i) (Φ with its ith row removed). Using the Sherman-Morrison matrix inversion formula, we find that Φ ⊤ ΛΦ -1 = Φ ⊤ (i) Λ (i) Φ (i) + λ i ϕ i ϕ ⊤ i -1 = Φ ⊤ (i) Λ (i) Φ (i) -1 - λ i Φ ⊤ (i) Λ (i) Φ (i) -1 ϕ i ϕ ⊤ i Φ ⊤ (i) Λ (i) Φ (i) -1 1 + λ i ϕ ⊤ i Φ ⊤ (i) Λ (i) Φ (i) -1 ϕ i . Under review as a conference paper at ICLR 2023 Inserting this into the expectation of T (D) , we find that E T (D) ii = E Φ (i) ,ϕi λ i ϕ ⊤ i Φ ⊤ ΛΦ -1 ϕ i = E Φ (i) ,ϕi      λ i ϕ ⊤ i Φ ⊤ (i) Λ (i) Φ (i) -1 ϕ i - λ 2 i ϕ ⊤ i Φ ⊤ (i) Λ (i) Φ (i) -1 ϕ i 2 1 + λ i ϕ ⊤ i Φ ⊤ (i) Λ (i) Φ (i) -1 ϕ i      = E Φ (i) ,ϕi      λ i λ i + ϕ ⊤ i Φ ⊤ (i) Λ (i) Φ (i) -1 ϕ i -1      = E Φ (i) ,ϕi λ i λ i + κ (Φ (i) ,ϕi) , (56) where κ (Φ (i) ,ϕi) ≡ κ (Φ) i ≡ ϕ ⊤ i Φ ⊤ (i) Λ (i) Φ (i) -1 ϕ i -1 is a nonnegative scalar.

H.7 CONCENTRATION AND MODE-INDEPENDENCE OF κ (Φ) i

Important quantities in statistical mechanics systems are typically self-averaging (i.e. concentrating about their expectation) in the thermodynamic limit. Self-averaging quantities are the focus of random matrix theory, and generalization metrics in machine learning also tend to be self-averaging under most circumstances (e.g. resampling the data and rerunning a training procedure will typically yield a similar generalization error). Here we argue that κ (Φ) i is self-averaging in the thermodynamic limit and can be replaced by its expectation. This could be shown rigorously by means of random matrix theory. Here we opt to simply observe that, if κ (Φ) i were not self-averaging, then for modes i such that λ i ∼ κ (Φ) i , T (D) ii and thus L (D) (ϕ i ) would not be self-averaging either. However, because L (D) (ϕ i ) is a generalization metric like MSE, we should in general expect that it is self-averaging at large n. Our experimental results (Figure 2 ) confirm that fluctuations in L (D) (ϕ i ) are indeed small in practice, especially at large n. We thus replace κ (Φ) i with its expectation κ i ≡ E Φ κ (Φ) i . We next argue that κ i is approximately independent of i, so we can replace it with a modeindependent constant κ. This, too, could be argued rigorously by means of random matrix theory. We opt instead for an eigenmode-removal argument inspired by the cavity method of statistical physics (Del Ferraro et al., 2014) . Observe that, in the thermodynamic limit, the addition or removal of a single eigenmode should have a negligible effect on any observable and thus on κ i . We shall here show that, by inserting one eigenmode and removing another, we can transform κ i into κ j for any i and j, implying that κ i ≈ κ j . Assume that the addition or removal of a single eigenmode negligibly affects κ i . Concretely, assume that κ i ≈ κ + i , where κ + i is κ i computed with the addition of one extra eigenmode of arbitrary eigenvalue. We choose the additional eigenmode to have eigenvalue λ i , and we insert it at index i, effectively reinserting the missing mode i into Φ (i) and Λ (i) . To clarify the random variables in play, we shall adopt a more explicit notation, writing out Φ (i) in terms of its row vectors as Φ (i) = [ϕ 1 , ..., ϕ i-1 , ϕ i+1 , ..., ϕ M ] ⊤ . Using this notation, we find upon adding the new eigenmode that κ i ≡ E Φ (i) ,ϕi ϕ ⊤ i Φ ⊤ (i) Λ (i) Φ (i) -1 ϕ i -1 (57) ≡ E {ϕ k } M k=1 ϕ ⊤ i [ϕ 1 , ..., ϕ i-1 , ϕ i+1 , ..., ϕ M ]Λ (i) [ϕ 1 , ..., ϕ i-1 , ϕ i+1 , ..., ϕ M ] ⊤ -1 ϕ i -1 (58) ≈ κ + i ≡ E {ϕ k } M k=1 , φi ϕ ⊤ i [ϕ 1 , ..., ϕ i-1 , φi , ϕ i+1 , ..., ϕ M ]Λ[ϕ 1 , ..., ϕ i-1 , φi , ϕ i+1 , ..., ϕ M ] ⊤ -1 ϕ i -1 , ( ) where Λ is the original eigenvalue matrix and φ⊤ i is the design matrix row corresponding to the new mode. We can also perform the same manipulation with κ j (j ̸ = i), this time adding an additional eigenvalue λ j at index j, yielding that κ j ≡ E Φ (j) ,ϕj ϕ ⊤ j Φ ⊤ (j) Λ (j) Φ (j) -1 ϕ j -1 (60) ≡ E {ϕ k } M k=1 ϕ ⊤ j [ϕ 1 , ..., ϕ j-1 , ϕ j+1 , ..., ϕ M ]Λ (j) [ϕ 1 , ..., ϕ j-1 , ϕ j+1 , ..., ϕ M ] ⊤ -1 ϕ j -1 ≈ κ + j ≡ E {ϕ k } M k=1 , φj ϕ ⊤ j [ϕ 1 , ..., ϕ j-1 , φj , ϕ j+1 , ..., ϕ M ]Λ[ϕ 1 , ..., ϕ j-1 , φj , ϕ j+1 , ..., ϕ M ] ⊤ -1 ϕ j -1 . We now compare Equations 59 and 62. Each is an expectation over M + 1 vectors from the isotropic measure . T he statistics of these M + 1 vectors are symmetric under exchange, so we are free to relabel them. Equation 59 is identical to Equation 62 upon relabeling ϕ i → ϕ j , φi → ϕ i , and ϕ j → φj , so they are equivalent, and κ + i = κ + j . This in turn implies that κ i ≈ κ j . In light of this, we now replace all κ i with a mode-independent (but as-yet-unknown) constant κ and conclude that E T (D) ij = δ ij λ i λ i + κ . In summary, we have argued that κ i is not significantly changed by the addition or removal of single eigenmodes, and two such changes permit us to transform κ i into κ j , so they are therefore approximately equal. Our argument here is similar to the cavity method of statistical physics (Del Ferraro et al., 2014) , which essentially compares the behavior of a weakly-interacting system with and without a single element. The cavity method is often used as a simpler and more intuitive alternative to the replica method, a role it reprises here (contrast our approach with the replica approach of Canatar et al. (2021) ).

H.8 DETERMINING κ

We can determine the value of κ by observing that, using the ridgeless case of Theorem 3.2, i E T (D) ii = i λ i λ i + κ = n. ( ) This is a much more straightforward method of fixing this constant than used in comparable works. The ability to use the ridgeless version of Theorem 3.2 is the main motivation for setting δ = 0 at the start of the derivation. H.9 DIFFERENTIATING W.R.T. Λ TO OBTAIN THE COVARIANCE OF T (D) Here we obtain expressions for the covariance of T (D) and thereby the covariance of v. Remarkably, we shall need no further approximations beyond those already made in approximating E T (D) , which lends credence to our thesis that understanding modewise learnabilities is sufficient for understanding more interesting statistics of f . We begin with a calculation that will later be of use: differentiating both sides of the constraint on κ with respect to a particular eigenvalue, we find that ∂ ∂λ i M j=1 λ j λ j + κ = M j=1 -λ j (λ j + κ) 2 ∂κ ∂λ i + κ (λ i + κ) 2 = 0, yielding that ∂κ ∂λ i = κ 2 q(λ i + κ) 2 where q ≡ M j=1 κλ j (λ j + κ) 2 . (66) We now factor T (D) into two matrices as T (D) = ΛZ, where Z ≡ Φ Φ ⊤ ΛΦ -1 Φ ⊤ . Unlike T (D) , the matrix Z has the advantage of being symmetric and containing only one factor of Λ. We will find the second-order statistics of Z, which will trivially give these statistics for T (D) . From Equation 63, we find that the expectation of Z is E[Z] = (Λ + κI M ) -1 . ( ) We also define a modified Z-matrix Z (U) ≡ Φ Φ ⊤ U ⊤ ΛUΦ -1 Φ ⊤ , where U is an orthogonal M × M matrix. Because the measure over which Φ is averaged is rotation-invariant, we can equivalently average over Φ ≡ UΦ with the same measure, giving E Φ Z (U) = E Φ U ⊤ Φ Φ⊤ Λ Φ -1 Φ⊤ U = E Φ U ⊤ ZU = U ⊤ (Λ + κI M ) -1 U. ( ) It is similarly the case that E Φ (Z (U) ) ij (Z (U) ) kℓ = E Φ U ⊤ ZU ij U ⊤ ZU kℓ . ( ) Our aim will be to calculate expectations of the form E Φ [Z ij Z kℓ ]. With a clever choice of U, a symmetry argument quickly shows that most choices of the four indices make this expression zero. We define U (m) ab ≡ δ ab (1 -2δ am ) and observe that, because Λ is diagonal, (U (m) ) ⊤ ΛU (m) = Λ and thus Z (U (m) ) = Z. Equation 70 then yields that E Φ [Z ij Z kℓ ] = (-1) δim+δjm+δ km +δ ℓm E Φ [Z ij Z kℓ ] , from which it follows that E Φ [Z ij Z kℓ ] = 0 if any index is repeated an odd number of times. In light of the fact that Z ij = Z ji , there are only three distinct nontrivial cases to consider: 1. E Φ [Z ii Z ii ], 2. E Φ [Z ij Z ij ] with i ̸ = j, and 3. E Φ [Z ii Z jj ] with i ̸ = j. We note that we are not using the Einstein convention of summation over repeated indices. Cases 1 and 2. We now consider differentiating Z with respect to a particular element of the matrix Λ. This yields ∂Z iℓ ∂Λ jk = -ϕ ⊤ i Φ ⊤ ΛΦ -1 ϕ j ϕ ⊤ k Φ ⊤ ΛΦ -1 ϕ ℓ = -Z ij Z kℓ , where as before ϕ i is the i-th row of Φ. This gives us the useful expression that E[Z ij Z kℓ ] = - ∂ ∂Λ jk E[Z iℓ ] . We now set ℓ = i, k = j and evaluate this expression using Equation 68, concluding that E[Z ij Z ij ] = E[Z ij Z ji ] = - ∂ ∂λ j 1 λ i + κ = 1 (λ i + κ) 2 δ ij + ∂κ ∂λ j ( ) and thus Cov[Z ij , Z ij ] = Cov[Z ij , Z ji ] = 1 (λ i + κ) 2 ∂κ ∂λ j = κ 2 q(λ i + κ) 2 (λ j + κ) 2 . ( ) We did not require that i ̸ = j, and so Equation 74 holds for Case 1 as well as Case 2. Case 3. We now aim to calculate E[Z ii Z jj ] for i ̸ = j. We might hope to use Equation 73in calculating E[Z ii Z jj ], but this approach is stymied by the fact that we would need to take a derivative with respect to Λ ij , but we only have an approximation for Z for diagonal Λ. We can circumvent this by means of Z (U) . From the definition of Z (U) , we find that ∂ ∂U ij - ∂ ∂U ji Z (U) U=I M = -ϕ ⊤ i Φ ⊤ ΛΦ -1 ϕ j λ i ϕ ⊤ i -ϕ i λ j ϕ ⊤ j + ϕ i λ i ϕ ⊤ j -ϕ j λ j ϕ ⊤ i Φ ⊤ ΛΦ -1 ϕ j = (λ j -λ i ) Z 2 ij + Z ii Z jj . ( ) Differentiating with respect to both U ij and U ji with opposite signs ensures that the derivative is taken within the manifold of orthogonal matrices. Now, using Equation 69, we find that ∂ ∂U ij - ∂ ∂U ji E Z (U) U=I M = ∂ ∂U ij - ∂ ∂U ji U ⊤ (Λ + κI M ) -1 U U=I M = 1 λ i + κ - 1 λ j + κ . ( ) Taking the expectation of Equation 76, plugging in Equation 74 for E Z 2 ij , comparing to 77, and performing some algebra, we conclude that E[Z ii Z jj ] = 1 (λ i + κ)(λ j + κ) - κ 2 q(λ i + κ) 2 (λ j + κ) 2 and thus that Z ii , Z jj are anticorrelated with covariance Cov[Z ii , jj ] = -κ 2 q(λ i + κ) 2 (λ j + κ) 2 . (79) Cases 1-3 can be summarized as Cov[Z ij , Z kℓ ] = κ 2 (δ ik δ jℓ + δ iℓ δ jk -δ ij δ kℓ ) q(λ i + κ)(λ j + κ)(λ k + κ)(λ ℓ + κ) . ( ) Using the fact that T (D) ij = λ i Z ij , defining L i ≡ λ i (λ i + κ) -1 , and noting that q = ( i L i (1 -L i )) -1 , we find that Cov T (D) ij , T (D) kℓ = L i (1 -L j )L k (1 -L ℓ ) n - M m=1 L 2 m (δ ik δ jℓ + δ iℓ δ jk -δ ij δ kℓ ). Noting that E(f ) = E |v -v| 2 , recalling that v = T (D) v, and using Equation 81 to evaluate a sum over eigenmodes, we find that expected MSE is given by E(f ) = n n -m L 2 m i (1 -L i ) 2 v 2 i . ( ) Taking a sum over indices of v, we find that the covariance of the predicted function can be written simply in terms of MSE as Cov[v i , vj ] = L 2 i E(f ) n δ ij . H.10 ADDING BACK THE RIDGE AND NOISE We have thus far assumed δ = 0. We can now add the ridge parameter back using Equation 52. To add a ridge parameter δ, we need merely replace λ i → λ i + δ M and then change T (D) ij → λ i (λ i + δ M ) -1 T (D) ij . This yields that E T (D) ii = δ ij λ i λ i + δ M + κ , ( ) Cov T (D) ij , T (D) kℓ = κ (δ ik δ jℓ + δ iℓ δ jk -δ ij δ kℓ ) q(λ i + δ M + κ)(λ j + δ M + κ)(λ k + δ M + κ)(λ ℓ + δ M + κ) , where κ ≥ 0 satisfies i λ i + δ M λ i + δ M + κ = n and q ≡ M j=1 λ j + δ M (λ j + δ M + κ) 2 . ( ) Taking either δ = 0 or M → ∞, we find that n = i λ i λ i + κ + δ κ and the mean and covariance of T (D) are again given by Equations 63 and 81. To summarize this simplification: in the continuous setting (M → ∞), we recover the results of prior work. In the discrete setting with zero ridge, we find that these expressions apply unmodified. In the discrete setting with positive ridge, we find that these expressions contain corrections with perturbative parameter δ M . In the main text, we report the expressions with δ M = 0, and our experiments obey this condition.



We borrow this terminology fromWei et al. (2022). To study noiseless targets instead, simply subtract ϵ 2 from expressions for test MSE. We call Theorem 3.2 a conservation law because at zero ridge, as either the data distribution or the kernel changes, total learnability remains constant. This is analogous to, for example, physical conservation of charge. For train risk, we simply quote the result ofCanatar et al. (2021). See Mallinar et al. (2022) for a discussion of this interpretation. We mean here that adding training samples while holding τ eff constant does not affect generalization. We note a number of recent approaches(Daniely & Malach, 2020;Kamath et al., 2020;Hsu, 2021) which give complexity lower bounds for learning parities using a simple degeneracy argument. However, these approaches do not leverage the spectral bias of the kernel and become vacuous when k = d. Cui et al. (2021) use simplifications of the expressions ofLoureiro et al. (2021), whose notation does not map cleanly onto our tables. The overfitting coefficient and test MSE estimator ofCui et al. (2021) appear more complex than the alternatives in part because other works define intermediate quantities like E0, γ, etc., which include sums over eigenmodes. This is a nontrivial observation, as is not hard to construct contrived cases in which this approximation does not hold, as discussed bySollich & Halees (2002). The conditions under which this universality assumption holds are an active area of research. A variety of rigorous works have proven this spectral universality assumption in specific high-dimensional settings(El Karoui, 2010;Cheng & Singer, 2013;Fan & Montanari, 2019;Liu et al., 2021; Lu &Yau, 2022), andTomasini et al. (2022) has shown it does not quite hold in certain noisy low-dimensional settings. Equation 71 in the supplement of their published paper, giving the covariance of the predicted function, contains an spurious extra term, but this has been fixed in the arXiv version. More precisely, they study linear ridge regression and linear gradient flow, but their result also applies to the kernelized versions of these algorithms. We take this terminology from control theory in which, for a system under study, a "transfer function" maps inputs to outputs, or driving to response. Jacot et al. (2020) scale the ridge parameter proportional to n in their definition of kernel ridge regression; our reproduction of their result accounts for this and applies to our convention.



Figure 1: Toy problem illustrating our conservation law. (A) The task domain: the unit circle discretized into M = 10 points, n of which comprise the dataset D (filled circles). (B) The 10 eigenfunctions of a rotation-invariant kernel on this domain, grouped into degenerate pairs and shifted vertically for clarity. (C) We use each eigenfunction ϕ k in turn as the target function. For each ϕ k , we compute training targets ϕ k (D), obtain a predicted function fk in a standard supervised learning setup, and subsequently compute D-learnability. This comprises 10 orthogonal learning problems. (D,E) Stacked bar charts with 10 components showing D-learnability for each eigenfunction. The left bar in each pair contains results from NTK regression, while the right bar contains results from wide neural networks. Models vary in activation function and number of hidden layers (HL). Dashed lines indicate n. Learnabilities always sum to n, exactly for kernel regression and approximately for wide networks.

Figure 3: We reproduce and explain the deep bootstrap phenomenon in KRR. (A) An experiment illustrating the deep bootstrap effect using a ResNet-18 on CIFAR-10. (B) An analogous experiment using KRR on binarized MNIST. Eigenlearning predictions closely match experimental curves, and τ eff = κ -1 0 (vertical dashed lines) faithfully predicts the transition from regularizationlimited to data-limited fitting for each n.

Figure 4: Predicted function smoothness matches experiment. Predicted MSG of f (curves) and empirical MSG for kernel regression (triangles) for k = 1 modes on hyperspheres with varying dimension.

Figure4shows the MSG of the function learned by KRR with a polynomial kernel trained on k = 1 modes on spheres of increasing dimension d, normalized by the MSG of the groundtruth target function. See Appendix E for experimental details and additional experiments in this vein. True MSG matches predicted MSG well in all settings, particularly at large n and d.

Figure 5: 4HL ReLU NTK eigenvalues and multiplicities on three synthetic domains. (A) Eigenvalues for k for the discretized unit circle (M = 256). Eigenvalues decrease as k increases except for a few near exceptions at high k. (B) Eigenvalues for the 8d hypercube. Eigenvalues decrease monotonically with k. (C) Eigenvalues for the 7-sphere up to k = 70. Eigenvalues decrease monotonically with k. (D) Eigenvalue multiplicity for the discretized unit circle. All eigenvalues are doubly degenerate (due to cos and sin modes) except for k = 0 and k = 128. (E) Eigenvalue multiplicity for the 8d hypercube. (F) Eigenvalue multiplicity for the 7-sphere.

Figure 6: Eigenvalues and eigencoefficients for four binary image classification tasks. (A) Kernel eigenvalues as computed from 10 4 training points as described in Appendix B.Spectra for CIFAR10 tasks roughly follow power laws with exponent -1, while spectra for MNIST tasks follow power laws with slightly steeper descent. (B) Eigencoefficients as computed from 10 4 training points. Tasks with higher observed learnability (Figure2) place more weight in higher (i.e., lower-index) eigenmodes and less in lower ones.

Figure 7: Comparison between predicted learnability and MSE for networks of various widths. (A-F) Predicted (curves) and true (triangles and circles) learnability for four eigenmodes on the 8d hypercube. Dataset size n varies within each subplot, and the width of the 4HL ReLU network varies between subplots. (G-L) Same as (A-F) but with MSE instead of learnability.

Figure 8: Deep bootstrap theoretical test/train curves for a synthetic kernel spectrum.Here we illustrate the deep bootstrap phenomenon using a hypothetical kernel giving powerlaw eigenvalues and eigencoefficients with exponent α = 2. In this setting, we note that finite-data test/train curves simultaneously split from the n → ∞ "online learning curve" (black dot-dashed line). We see that τ eff = κ -1

Figure 9: Predicted function smoothness matches experiment on the unit circle. Theoretical MSG predictions (curves) and empirical values for finite networks (circles) and kernel regression (triangles) for various eigenmodes on the discretized unit circle with M = 256, normalized by the ground-truth mean squared gradient of G(f ) = E |f ′ (x)| 2 = k 2 .Because this is a discrete domain, smoothness is computed using a discretization as G( f ) =

Figure 10: Modewise learnabilities fall on universal sigmoidal curves. (A-F) Predicted learnability curve (sigmoidal curves) and empirical learnabilities for trained networks (circles) and NTK regression (triangles) for eigenmodes k ∈ {0, ..., 7} on three domains for n = 8, 64. Vertical dashed lines indicate κ. (G) All data from (A-F) with eigenvalues rescaled by κ.

Cui et al. (2021) NASollich (2001) derived the omniscient risk estimate in the context of GP regression and was, to our knowledge, the first to obtain the result. As its notation does not map cleanly onto the variables we use, we do not include this work in the above tables. Other analogous works include Wu &

Bordelon et al. (2020);Canatar et al. (2021) and explicitly byJacot et al. (2020)). See Appendix A for further references to relevant literature. The validity of this approximation is ultimately justified by the close match of our theory with experiment.

annex

Modifying v so as to place power ϵ 2 into an eigenmode with zero eigenvalue, Equation 82 for MSE becomesH.11 PROPERTIES OF κIn experimental settings, κ is in general easy to find numerically, but for theoretical study, we anticipate it being useful to have some analytical bounds on κ in order to, for example, prove that certain eigenmodes are or are not asymptotically learned for particular spectra. To that end, the following lemma gives some properties of κ.Lemma H.1. For κ ≥ 0 solving M i=1 λi λi+κ + δ κ = n, with positive eigenvalues {λ i } M i=1 ordered from greatest to least, the following properties hold:(a) κ = ∞ when n = 0, and κ = 0 when n → M and δ = 0.(b) κ is strictly decreasing with n.Proof of property (a): Because M i=1 λi λi+κ + δ κ is strictly decreasing with κ for κ ≥ 0, there can only be one solution for a given n. The first statement follows by inspection, and the second follows by inspection and our assumption that all eigenvalues are strictly positive.

Proof of property (b):

Differentiating the constraint on κ with respect to n yields 

Proof of property (d):

We set δ = 0 and consider replacing λ i with λ ℓ if i ≤ ℓ and 0 if i > ℓ. Noting that this does not increase any term in the sum, we find that n = M i=1 λi λi+κ ≥ ℓ i=1 λ ℓ λ ℓ +κ = ℓλ ℓ λ ℓ +κ . The desired property in the ridgeless case follows. A positive ridge parameter only increases κ, so the property holds in general. We note that a positive ridge parameter can be incorporated into the bound, givingWe also note that, as observed by Jacot et al. (2020) and Spigler et al. (2020) , the asymptotic scaling of κ can be fixed if the kernel eigenvalues follow a power law spectrum. Specifically, if λ i ∼ i -α for some α > 1, then Jacot et al. (2020) 15 show that κ = Θ(δ n -1 + n -α ).(91)

