THE ONSET OF VARIANCE-LIMITED BEHAVIOR FOR NETWORKS IN THE LAZY AND RICH REGIMES

Abstract

For small training set sizes P , the generalization error of wide neural networks is well-approximated by the error of an infinite width neural network (NN), either in the kernel or mean-field/feature-learning regime. However, after a critical sample size P * , we empirically find the finite-width network generalization becomes worse than that of the infinite width network. In this work, we empirically study the transition from infinite-width behavior to this variance-limited regime as a function of sample size P and network width N . We find that finite-size effects can become relevant for very small dataset sizes on the order of P * ∼ √ N for polynomial regression with ReLU networks. We discuss the source of these effects using an argument based on the variance of the NN's final neural tangent kernel (NTK). This transition can be pushed to larger P by enhancing feature learning or by ensemble averaging the networks. We find that the learning curve for regression with the final NTK is an accurate approximation of the NN learning curve. Using this, we provide a toy model which also exhibits P * ∼ √ N scaling and has P -dependent benefits from feature learning.

1. INTRODUCTION

Deep learning systems are achieving state of the art performance on a variety of tasks (Tan & Le, 2019; Hoffmann et al., 2022) . Exactly how their generalization is controlled by network architecture, training procedure, and task structure is still not fully understood. One promising direction for deep learning theory in recent years is the infinite-width limit. Under a certain parameterization, infinitewidth networks yield a kernel method known as the neural tangent kernel (NTK) (Jacot et al., 2018; Lee et al., 2019) . Kernel methods are easier to analyze, allowing for accurate prediction of the generalization performance of wide networks in this regime (Bordelon et al., 2020; Canatar et al., 2021; Bahri et al., 2021; Simon et al., 2021) . Infinite-width networks can also operate in the meanfield regime if network outputs are rescaled by a small parameter α that enhances feature learning (Mei et al., 2018; Chizat et al., 2019; Geiger et al., 2020b; Yang & Hu, 2020; Bordelon & Pehlevan, 2022) . While infinite-width networks provide useful limiting cases for deep learning theory, real networks have finite width. Analysis at finite width is more difficult, since predictions are dependent on the initialization of parameters. While several works have attempted to analyze feature evolution and kernel statistics at large but finite width (Dyer & Gur-Ari, 2020; Roberts et al., 2021) , the implications of finite width on generalization are not entirely clear. Specifically, it is unknown at what value of the training set size P the effects of finite width become relevant, what impact this critical P has on the learning curve, and how it is affected by feature learning. To identify the effects of finite width and feature learning on the deviation from infinite width learning curves, we empirically study neural networks trained across a wide range of output scales α, widths N , and training set sizes P on the simple task of polynomial regression with a ReLU neural network. Concretely, our experiments show the following: • Learning curves for polynomial regression transition exhibit significant finite-width effects very early, around P ∼ √ N . Finite-width NNs at large α are always outperformed by their infinite-width counterparts. We show this gap is driven primarily by variance of the predictor over initializations (Geiger et al., 2020a) . Following prior work (Bahri et al., 2021) , we refer to this as the variance-limited regime. We compare three distinct ensembling methods to reduce error in this regime. • Feature-learning NNs show improved generalization both before and after the transition to the variance limited regime. Feature learning can be enhanced through re-scaling the output of the network by a small scalar α or by training on a more complex task (a higher-degree polynomial). We show that alignment between the final NTK and the target function on test data improves with feature learning and sample size. • We demonstrate that the learning curve for the NN is well-captured by the learning curve for kernel regression with the final empirical NTK, eNTK f , as has been observed in other works (Vyas et al., 2022; Geiger et al., 2020b; Atanasov et al., 2021; Wei et al., 2022) . • Using this correspondence between the NN and the final NTK, we provide a cursory account of how fluctuations in the final NTK over random initializations are suppressed at large width N and large feature learning strength. In a toy model, we reproduce several scaling phenomena, including the P ∼ √ N transition and the improvements due to feature learning through an alignment effect. We validate that these effects qualitatively persist in the realistic setting of wide ResNets Zagoruyko & Komodakis (2017) trained on CIFAR in appendix E. Overall, our results indicate that the onset of finite-width corrections to generalization in neural networks become relevant when the scale of the variance of kernel fluctuations becomes comparable to the bias component of the generalization error in the bias-variance decomposition. The variance contribution to generalization error can be reduced both through ensemble averaging and through feature learning, which we show promotes higher alignment between the final kernel and the task. We construct a model of noisy random features which reproduces the essential aspects of our observations. 1.1 RELATED WORKS Geiger et al. (2020a) analyzed the scaling of network generalization with the number of model parameters. Since the NTK fluctuates with variance O(N -1 ) for a width N network (Dyer & Gur-Ari, 2020; Roberts et al., 2021) , they find that finite width networks in the lazy regime generically perform worse than their infinite width counterparts. The scaling laws of networks over varying N and P were also studied, both empirically and theoretically by Bahri et al. (2021) . They consider two types of learning curve scalings. First, they describe resolution-limited scaling, where either training set size or width are effectively infinite and the scaling behavior of generalization error with the other quantity is studied. There, the scaling laws can been obtained by the theory in Bordelon et al. (2020) . Second, they analyze variance-limited scaling where width or training set size are fixed to a finite value and the other parameter is taken to infinity. While that work showed for any fixed P that the learning curve converges to the infinite width curve as O(N -1 ), these asymptotics do not predict, for fixed N , at which value of P the NN learning curve begins to deviate from the infinite width theory. This is the focus of our work. The contrast between rich and lazy networks has been empirically studied in several prior works. Depending on the structure of the task, the lazy regime can have either worse (Fort et al., 2020) or better (Ortiz-Jiménez et al., 2021; Geiger et al., 2020b) performance than the feature learning regime. For our setting, where the signal depends on only a small number of relevant input directions, we expect representation learning to be useful, as discussed in (Ghorbani et al., 2020; Paccolat et al., 2021b) . Consequently, we posit and verify that the rich network will outperform the lazy one. Our toy model is inspired by the literature on random feature models. Analysis of generalization for two layer networks at initialization in the limit of high dimensional data have been carried out using techniques from random matrix theory (Mei & Montanari, 2022; Hu & Lu, 2020; Adlam & Pennington, 2020a; Dhifallah & Lu, 2020; Adlam & Pennington, 2020b ) and statistical mechanics (Gerace et al., 2020; d'Ascoli et al., 2020; d'Ascoli et al., 2020) . Several of these works have identified that when N is comparable to P , the network generalization error has a contribution from variance over initial parameters. Further, they provide a theoretical explanation of the benefit of ensembling predictions of many networks trained with different initial parameters. Recently, Ba et al. (2022) studied regression with the hidden features of a two layer network after taking one step of gradient descent, finding significant improvements to the learning curve due to feature learning. Zavatone-Veth et al. (2022) analyzed linear regression for Bayesian deep linear networks with width N comparable to sample size P and demonstrated the advantage of training multiple layers compared to only training the only last layer, finding that feature learning advantage has leading correction of scale (P/N ) 2 at small P/N .

2. PROBLEM SETUP AND NOTATION

We consider a supervised task with a dataset D = {x µ , y µ } P µ=1 of size P . The pairs of data points are drawn from a population distribution p(x, y). Our experiments will focus on training networks to interpolate degree k polynomials on the sphere (full details in Appendix A). For this task, the infinite width network learning curves can be found analytically. In particular at large P the generalization error scales as 1/P 2 (Bordelon et al., 2020) . We take a single output feed-forward NN fθ : R D → R with hidden width N for each layer. We let θ denote all trainable parameters of the network. Using NTK parameterization (Jacot et al., 2018) , the activations for an input x are given by h (ℓ) i = σ √ N N j=1 W (ℓ) ij φ(h (ℓ-1) j ), ℓ = 2, . . . L, h = σ √ D D j=1 W (1) ij x j . Here, the output of the network is fθ = h (L) 1 . We take φ to be a positively homogenous function, in our case a ReLU nonlinearity, but this is not strictly necessary (Appendix C.2). At initialization we have W ij ∼ N (0, 1). Consequently, the scale of the output at initialization is O(σ L ). As a consequence of the positive homogeneity of the network, the scale of the output is given by α = σ L . α controls the feature learning strength of a given NN. Large α corresponds to a lazy network while small α yields a rich network with feature movement. More details on how α controls feature learning are given in Appendix C.1 and C.2. In what follows, we will denote the infinite width NTK limit of this network by NTK ∞ . We will denote its finite width linearization by eNTK 0 (x, x ′ ) := θ ∂ θ f (x)∂ θ f (x ′ )| θ=θ0 , and we will denote its linearization around its final parameters θ f by eNTK f (x, x ′ ) := θ ∂ θ f (x)∂ θ f (x ′ )| θ=θ f . Following other authors (Chizat et al., 2019; Adlam & Pennington, 2020a) , we will take the output to be f θ (x) := fθ (x) -fθ0 (x). Thus, at initialization the function output is 0. We explain this choice further in Appendix A. The parameters are then trained with full-batch gradient descent on a mean squared error loss. We denote the final network function starting from initialization θ 0 on a dataset D by f * θ0,D (x) or f * for short. The generalization error is calculated using a held-out test set and approximates the population risk E g (f ) := (f (x) -y) 2 x,y∼p(x,y) .

3. EMPIRICAL RESULTS

In this section, we will study learning curves for ReLU NNs trained on polynomial regression tasks of varying degrees. We take our task to be learning y = Q k (β •x) where β is random vector of norm 1/D and Q k is the kth gegenbauer polynomial. We will establish the following key observations, which we will set out to theoretically explain in Section 4. 1. Both eNTK 0 and sufficiently lazy networks perform strictly worse than NTK ∞ , but the ensembled predictors approach the NTK ∞ test error. 2. NNs in the feature learning regime of small α can outperform NTK ∞ for an intermediate range of P . Over this range, the effect of ensembling is less notable. 3. Even richly trained finite width NNs eventually perform worse than NTK ∞ at sufficiently large P . However, these small α feature-learning networks become variance-limited at larger P than lazy networks. Once in the variance-limited regime, all networks benefit from ensembling over initializations. 4. For all networks, the transition to the variance-limited regime begins at a P * that scales sub-linearly with N . For polynomial regression, we find P * ∼ √ N . These findings support our hypothesis that finite width introduces variance in eNTK 0 over initializations, which ultimately leads to variance in the learned predictor and higher generalization error. Although we primarily focus on polynomial interpolation tasks in this paper, in Appendix F we provide results for wide ResNets trained on CIFAR and observe that rich networks also outperform lazy ones, and that lazy ones benefit more significantly from ensembling. The regression for NTK ∞ was calculated using the Neural Tangents package (Novak et al., 2020) . The exact scaling of NTK ∞ is known to go asymptotically as P -2 for this task. a) Lazy networks perform strictly worse than NTK ∞ while rich networks can outperform it for an intermediate range of P before their performance is also limited. b) Ensembling 20 networks substantially improves lazy network and eNTK 0 generalization, as well as asymptotic rich network generalization. This indicates that at sufficiently large P , these neural networks become limited by variance due to initialization. The error bars in a) and c) denote the variance due to both both training set and initialization. The error bars in b), d) denote the variance due to the train set.

3.1. FINITE WIDTH EFFECTS CAUSE THE

In this section, we first investigate how finite width NN learning curves differ from infinite width NTK regression. In Figure 1 we show the generalization error E g (f * θ0,D ) for a depth 3 network with width N = 1000 trained on a quadratic k = 2 and quartic k = 4 polynomial regression task. Additional plots for other degree polynomials are provided in Appendix F. We sweep over P to show the effect of more data on generalization, which is the main relationship we are interested in studying. For each training set size we sweep over a grid of 20 random draws of the train set and 20 random network initializations. This for 400 trained networks in total at each choice of P, k, N, α. We see that a discrepancy arises at large enough P where the neural networks begin to perform worse than NTK ∞ . We probe the source of the discrepancy between finite width NNs and NTK ∞ by ensemble averaging network predictions fD (x) := ⟨f * θ0,D (x)⟩ θ0 over E = 20 initializations θ 0 . In Figures 1b and 1d , we calculate the error of fD (x), each trained on the same dataset. We then plot E g ( fD ). This ensembled error approximates the bias in a bias-variance decomposition (Appendix B). Thus, any gap between 1 (a) and 1 (b) is driven by variance of f θ,D over θ. We sharpen these observations with phase plots of NN generalization, variance and kernel alignment over P, α, as shown in Figure 2 . In Figure 2a , generalization for NNs in the rich regime (small α) have lower final E g than lazy networks. As the dataset grows, the fraction of E g due to initialization variance (that is, the fraction removed by ensembling) strictly increases (2 (b)). We will show why this effect occurs in section 3.2. Figure 2b shows that, at any fixed P , the variance is lower for small α. To measure the impact of feature learning on the eNTK f , we plot its alignment with the target function, measured as y ⊤ Ky |y| 2 TrK for a test set of targets [y] µ and kernel [K] µν = eNTK f (x µ , x ν ). Alignment of the kernel with the target function is known to be related to good generalization (Canatar et al., 2021) . In Section 4, we revisit these effects in a simple model which relates kernel alignment and variance reduction. In addition to initialization variance, variance over dataset D contributes to the total generalization error. Following (Adlam & Pennington, 2020b) , we discuss a symmetric decomposition of the variance in Appendix B, showing the contribution from dataset variance and the effects of bagging. We find that most of the variance in our experiments is due to initialization. We show several other plots of the results of these studies in the appendix. We show the effect of bagging (Figure 7 ), phase plots of different degree target functions (Figures 10, 9 ), phase plots over N, α (Figure 11 ) and a comparison of network predictions against the initial and final kernel regressors (Figures 18, 19) .

3.2. FINAL NTK VARIANCE LEADS TO GENERALIZATION PLATEAU

In this section, we show how the variance over initialization can be interpreted as kernel variance in both the rich and lazy regimes. We also show how this implies a plateau for the generalization error. To begin, we demonstrate empirically that all networks have the same generalization error as kernel regression solutions with their final eNTKs. At large α, the initial and the final kernel are already close, so this follows from earlier results of Chizat et al. (2019) . In the rich regime, the properties of the eNTK f have been studied in several prior works. Several have empirically demonstrated that the eNTK f is a good match to the final network predictor for a trained network (Long, 2021; Vyas et al., 2022; Wei et al., 2022) while others have given conditions under which such an effect would hold true (Atanasov et al., 2021; Bordelon & Pehlevan, 2022) . We comment on this in appendix C.4. We show in Figure 3 how the final network generalization error matches the generalization error of eNTK f . As a consequence, we can use eNTK f to study the observed generalization behavior. Next, we relate the variance of the final predictor f * θ0,D to the corresponding infinite width network f ∞ D . The finite size fluctuations of the kernel at initialization have been studied in (Dyer & Gur-Ari, 2020; Hanin & Nica, 2019; Roberts et al., 2021) . The variance of the kernel elements has been shown to scale as 1/N . We perform the following bias-variance decomposition: Take f θ0,D to be the eNTK 0 predictor, or a sufficiently lazy network trained to interpolation on a dataset D. Then, Eg eNTKf k=2, L=3 N=177 N=421 N=1000 =0.1 =0.5 =1.0 =10.0 =20 E NN g = E eNTKf g (b) E N N g = E N T K f g across N, α ⟨(f * θ0,D (x) -y) 2 ⟩ θ0,D,x,y = ⟨(f ∞ D (x) -y) 2 ⟩ D,x,y + O(1/N ). We demonstrate this equality using a relationship between the infinite-width network and an infinite ensemble of finite-width networks derived in Appendix B. There we also show that the O(1/N ) term is strictly positive for sufficiently large N . Thus, for lazy networks of sufficiently large N , finite width effects lead to strictly worse generalization error. The decomposition in Equation 2 continues to hold for rich networks at small α if f ∞ is interpreted as the infinite-width mean field limit. In this case one can show that ensembles of rich networks are approximating an infinite width limit in the mean-field regime. See Appendix B for details.

3.3. FEATURE LEARNING DELAYS VARIANCE LIMITED TRANSITION

We now consider how feature learning alters the onset of the variance limited regime, and how this onset scales with α, N . We define the onset of the variance limited regime to take place at the value P * = P 1/2 where over half of the generalization error is due to variance over initializations. Equivalently we have E g ( f * )/E g (f * ) = 1/2. By using an interpolation method together with bisection, we solve for P 1/2 and plot it in Figure 4 . We analyze these ensembling methods for a k = 1 task with a width N = 100 ReLU network. (a) While all ensembling methods improve generalization, averaging either the kernel ⟨K⟩ or features ⟨ψ⟩ gives a better improvement to generalization than averaging the output function ⟨f ⟩. Computing final kernels for many richly trained networks and performing regression with this averaged kernel gives the best performance. (b) We plot the relative error of each ensembling method against the single init neural network. The gap between ensembling and the single init NN becomes evident for sufficiently large P ∼ P 1/2 . For small α, all ensembling methods perform comparably, while for large α ensembling the kernel or features gives much lower E g than averaging the predictors. Figure 4b shows that P 1/2 scales as √ N for this task. In the next section, we shall show that this scaling is governed by the fact that P 1/2 is close to the value where the infinite width network generalization curve E ∞ g is equal to the variance of eNTK f . In this case the quantities to compare are E ∞ g ≈ P -2 and Var eNTK f ≈ N -1 . We can understand the delay of the variance limited transition, as well as the lower value of the final plateau using a mechanistic picture similar to the effect observed in Atanasov et al. (2021) . In that setting, under small initialization, the kernel follows a deterministic trajectory, picking up a low rank component in the direction of the train set targets yy ⊤ , and then changing only in scale as the network weights grow to interpolate the dataset. In their case, for initial output scale σ L , eNTK f is deterministic up to a variance of O(σ). In our case, the kernel variance at initialization scales as σ 2L /N . As σ → 0 the kernel's trajectory becomes deterministic up to a variance term scaling with σ as O(σ), which implies that the final predictor also has a variance scaling as O(σ).

4. SIGNAL PLUS NOISE CORRELATED FEATURE MODEL

In Section 3.2 we have shown that in both the rich and lazy regimes, the generalization error of the NN is well approximated by the generalization of a kernel regression solution with eNTK f . This finding motivates an analysis of the generalization of kernel machines which depend on network initialization θ 0 . Unlike many analyses of random feature models which specialize to two layer networks and focus on high dimensional Gaussian random data (Mei & Montanari, 2022; Adlam & Pennington, 2020a; Gerace et al., 2020; Ba et al., 2022) , we propose to analyze regression with the eNTK f for more general feature structures. This work builds on the kernel generalization theory for kernels developed with statistical mechanics (Bordelon et al., 2020; Canatar et al., 2021; Simon et al., 2021; Loureiro et al., 2021) . We will attempt to derive approximate learning curves in terms of the eNTK f 's signal and noise components, which provide some phenomenological explanations of the onset of the variance limited regime and the benefits of feature learning. Starting with the final NTK K θ0 (x, x ′ ) which depends on the random initial parameters θ 0 , we project its square root K 1/2 θ0 (x, x ′ ) (as defined in equation 32) on a fixed basis {b k (x)} ∞ k=1 orthonormal with respect to p(x). This defines a feature map ψ k (x, θ 0 ) = dx ′ p(x ′ )K 1/2 θ0 (x, x ′ )b k (x ′ ) , k ∈ {1, ..., ∞}. (3) The kernel can be reconstructed from these features K θ0 (x, x ′ ) = k ψ k (x, θ 0 )ψ k (x ′ , θ 0 ). The kernel interpolation problem can be solved by performing linear regression with features ψ(x, θ 0 ). Here, w(θ 0 ) = lim λ→0 argmin w P µ=1 [w • ψ(x µ , θ 0 ) -y µ ] 2 + λ|w| 2 . The learned function f (x, θ 0 ) = w(θ 0 ) • ψ(x, θ 0 ) is the minimum norm interpolator for the kernel K(x, x ′ ; θ 0 ) and matches the neural network learning curve as seen in Section 3.2. In general, since the rank of K is finite for a finite size network, the ψ k (x, θ 0 ) have correlation matrix of finite rank N H . Since the target function y does not depend on the initialization θ 0 , we decompose it in terms of a fixed set of features ψ M (x) ∈ R M (for example, the first M basis functions {b k } M k=1 ). In this random feature model, one can interpret the initialization-dependent fluctuations in K(x, x ′ ; θ 0 ) as generating fluctuations in the features ψ(x, θ 0 ) which induce fluctuations in the learned network predictor f (x, θ 0 ). To illustrate the relative improvements to generalization from denoising these three different objects, in Figure 5 , we compare averaging the final kernel K, averaging the induced features ψ, and averaging network predictions f directly. For all α, all ensembling methods provide improvements over training a single NN. However, we find that averaging the kernel directly and performing regression with this kernel exhibits the largest reduction in generalization error. Averaging features performs comparably. However, ensemble averaging network predictors does not perform as well as either of these other two methods. The gap between ensembling methods is more significant in the lazy regime (large α) and is negligible in the rich regime (small α).

4.1. TOY MODELS AND APPROXIMATE LEARNING CURVES

To gain insight into the role of feature noise, we characterize the test error associated with a Gaussian covariate model in a high dimensional limit P, M, N H → ∞ with α = P/M, η = N H /M . y = 1 √ M ψ M • w * , f = 1 √ M ψ • w, ψ = A(θ 0 )ψ M + ϵ, ψ M ϵ ∼ N 0, Σ M 0 0 Σ ϵ (4) This model was also studied by Loureiro et al. (2021) and subsumes the classic two layer random feature models of prior works (Hu & Lu, 2020; Adlam & Pennington, 2020a; Mei & Montanari, 2022) . The expected generalization error for any distribution of A(θ 0 ) has the form E θ0 E g (θ 0 ) = E A 1 1 -γ 1 M w * Σ 1/2 M I -qΣ 1/2 s A ⊤ GAΣ 1/2 s -qΣ 1/2 s A ⊤ G 2 AΣ 1/2 s Σ 1/2 M w * G = I + qAΣ M A ⊤ + qΣ ϵ -1 , q = α λ + q , q = TrG[AΣ M A ⊤ + Σ ϵ ], where α = P/M and γ = α (λ+q) 2 TrG 2 [AΣ M A ⊤ +Σ ϵ ] 2 . Details of the calculation can be found in Appendix D. We also provide experiments showing the predictive accuracy of the theory in Figure 6 . In general, we do not know the induced distribution of A(θ 0 ) over disorder θ 0 . In Appendix D.5, we compute explicit learning curves for a simple toy model where A(θ 0 ) ′ s entries as i.i.d. Gaussian over the random initialization θ 0 . A similar random feature model was recently analyzed with diagrammatic techniques by Maloney et al. (2022) . In the high dimensional limit M, P, N H → ∞ with P/M = α, N H /M = η, our replica calculation demonstrates that test error is self-averaging (the same for every random instance of A) which we describe in Appendix D.5 and Figure 16 .

4.2. EXPLAINING FEATURE LEARNING BENEFITS AND ERROR PLATEAUS

Using this theory, we can attempt to explain some of the observed phenomena associated with the onset of the variance limited regime. First, we note that the kernels exhibit fluctuations over initialization with variance O(1/N ), either in the lazy or rich regime. In Figure 6 (a), we show learning curves for networks of different widths in the lazy regime. Small width networks enter the variance limited regime earlier and have higher error. Similarly, if we alter the scale of the noise Σ ϵ = σ 2 ϵ AΣ M A ⊤ in our toy model, the corresponding transition time P 1/2 is smaller and the asymptotic error is higher. In Figure 6 (c), we show that our theory also predicts the onset of the variance limited regime at P 1/2 ∼ √ N if σ 2 ϵ ∼ N -1 . We stress that this scaling is a consequence of the structure of the task. Since the target function is an eigenfunction of the kernel, the infinite width error goes as 1/P 2 (Bordelon et al., 2020) . Since variance scales as 1/N , bias and variance become comparable at P ∼ √ N . Often, realistic tasks exhibit power law decays where E N =∞ g = P -β with β < 2 (Spigler et al., 2020; Bahri et al., 2021) , where we'd expect a transition around P 1/2 ∼ N 1/β . Using our model, we can also approximate the role of feature learning as enhancement in the signal correlation along task-relevant eigenfunctions. In Figure 6 (d) we plot the learning curves for networks trained with different levels of feature learning, controlled by α. We see that feature learning leads to improvements in the learning curve both before and after onset of variance limits. In Figure Eg 6 (e)-(f), we plot the theoretical generalization for kernels with enhanced signal eigenvalue for the task eigenfunction y(x) = ϕ k (x). This enhancement, based on the intuition of kernel alignment, leads to lower bias and lower asymptotic variance. However, this model does not capture the fact that feature learning advantages are small at small P and that the slopes of the learning curves are different at different α. Following the observation of Paccolat et al. (2021a) that kernel alignment can occur with scale √ P , we plot the learning curves for signal enhancements that scale as √ P . Though this toy model reproduces the onset of the variance limited regime P 1/2 and the reduction in variance due to feature learning, our current result is not the complete story. A more refined future theory could use the structure of neural architecture to constrain the structure of the A distribution. 2 = 0 k/ k = 5.00 k/ k = 2.50 k/ k = 1.00 k/ k = 0.50 k/ k = 0.00

5. CONCLUSION

We performed an extensive empirical study for deep ReLU NNs learning a fairly simple polynomial regression problems. For sufficiently large dataset size P , all neural networks under-perform the infinite width limit, and we demonstrated that this worse performance is driven by initialization variance. We show that the onset of the variance limited regime can occur early in the learning curve with P 1/2 ∼ √ N , but this can be delayed by enhancing feature learning. Finally, we studied a simple random-feature model to attempt to explain these effects and qualitatively reproduce the observed behavior, as well as quantitatively reproducing the relevant scaling relationship for P 1/2 . This work takes a step towards understanding scaling laws in regimes where finite-size networks undergo feature learning. This has implications for how the choice of initialization scale, neural architecture, and number networks in an ensemble can be tuned to achieve optimal performance under a fixed compute and data budget.

A DETAILS ON EXPERIMENTS

We generated the dataset D = {x µ , y µ } P µ=1 by sampling x µ uniformly on S D-1 , the unit sphere in R D . ỹ was then generated as a Gegenbauer polynomial of degree k of a 1D projection of x, ỹ = Q k (β • x). Because the scale of the output of the neural network relative to the target is a central quantity in this work, it is especially important to make sure the target is appropriately scaled to unit norm. We did this by defining the target to be y = ỹ/ ⟨Q k (β • x) 2 ⟩ x∼S D-1 . The denominator can be easily and accurately approximated by Monte Carlo sampling. We used JAX (Bradbury et al., 2018) for all neural network training. We built multi-layer perceptrons (MLPs) of depth 2 and 3. Most of the results are reported for depth 3 perceptrons, where there is a separation between the width of the network N and the number of parameters N 2 . Sweeping over more depths and architectures is possible, but because of the extensive dimensionality of the hyperparameter search space, we have not yet experimented with deeper networks. We considered MLPs with no bias terms. Since the Gegenbauer polynomials are mean zero, we do not need biases to fit the training set and generalize well. We have also verified that adding trainable biases does not change the final results in any substantial way. As mentioned in the main text, we consider the final output function to be the initial network output minus the output at initialization: f θ (x) = fθ (x) -fθ0 (x). Here, only θ is differentiated through, while θ 0 is held fixed. The rationale for this choice is that without this subtraction, in the lazy limit the trained neural network output can be written as f * θ (x) = fθ0 (x) + µν k µ (x)[K -1 ] µν (y ν -fθ0 (x)). This is the same as doing eNTK 0 regression on the shifted targets y µ -fθ0 (x). At large initialization the shift fθ0 (x) amounts to adding random, initialization-dependent noise to the targets. By instead performing the subtraction, the lazy limit can be interpreted as a kernel regression on the targets themselves, which is preferable. We trained this network with full batch gradient descent with a learning rate η so that ∆θ = -η∇ θ L(D, θ), L(D, θ) := 1 P P µ=1 |f θ (x µ ) -y µ | 2 . ( ) Each network was trained to an interpolation threshold of 10 -6 . If a network could not reach this threshold in under 30k steps, we checked if the training error was less than 10 times the generalization error. If this was not satisfied, then that run of the network was discarded. For each fixed P, k, we generated 20 independent datasets. For each fixed N, α we generated 20 independent neural network initializations. This 20 × 20 table yields a total of 400 neural networks trained on every combination of initialization and dataset choice. The infinite width network predictions were calculated using the Neural Tangents package (Novak et al., 2020) . The finite width eNTK 0 s were also calculated using the empirical methods in Neural Tangents. They were trained to interpolation using the gradient descent mse method. This is substantially faster than training the linearized model using standard full-batch gradient descent, which we have found to take a very long time for most networks. We use the same strategy for the eNTK f s. For the experiments in the main text, we have taken the input dimension to be D = 10 and sweep over k = 1, 2, 3, 4. We swept over 15 values P in logspace from size 30 to size 10k, and over 6 values of N in logspace from size 30 to size 2150. We then swept over alpha values 0.1, 0.5, 1.0, 10.0, 20.0. Depending on α, N , we tuned the learning rate η of the network small enough to stay close to the gradient flow limit, but allow for the interpolation threshold to be feasibly reached. For each of the 1800 settings of P, N, α, k and each of the 400 networks, 400 eNTK 0 s, 400 eNTK f s, and 20 NTK ∞ s, the generalization error was saved, as well as a vector of ŷ predictions on a test set of 2000 points. In addition, for the neural networks we saved both initial and final parameters. All are saved as lists of numpy arrays in a directory of about 1TB. We plan to make the results of our experiments publicly accessible, alongside the code to generate them.

A.1 CIFAR EXPERIMENTS

We apply the same methodology of centering the network and allowing α to control the degree of laziness by redefining f θ (x) = α( fθ (x) -fθ0 (x)). We consider the task of binary classification for CIFAR-10. In order to allow P to become large we divide the data into two classes: animate and inanimate objects. We choose to subsample eight classes and superclass them into two: (cat, deer, dog, horse) vs (airplane, automobile, ship, truck 2014) with initial learning rate η 0 = 10 -3 . Every network is trained for 24,000 steps, such that under nearly all settings of α and dataset size the network has attained infinitesimal train loss. We sweep α from 10 -3 to 10 0 and P from 2 9 to 2 15 . For each value of P , we randomly sample five training datasets of size P and compute ensembles of size 20. For each network in an ensemble the initialization and the order of the training data is randomly chosen independently of those for the other networks.

B FINE-GRAINED BIAS-VARIANCE DECOMPOSITION B.1 FINE GRAINED DECOMPOSITION OF GENERALIZATION ERROR

Let D be a dataset of (x µ , y µ ) P µ=1 ∼ p(x, y) viewed as a random variable. Let θ 0 represent the initial parameters of a neural network, viewed as a random variable. In the case of no label noise, as in section 2.2.1 of Adlam & Pennington (2020b) , we derive the symmetric decomposition of the generalization error in terms of the variance due to initialization and the variance due to the dataset. We have E g (f * θ0,D ) = ⟨(f * θ0,D (x) -y) 2 ⟩ x,y = ⟨(⟨f * θ0,D (x)⟩ θ0,D -y) 2 ⟩ x,y + E x Var θ0,D f * θ0,D (y) = Bias 2 + V D + V θ0 + V D,θ0 . Here we have defined Bias 2 = ⟨(⟨f * θ0,D (x)⟩ θ0,D -y) 2 ⟩ x,y , V D = E x Var D E θ0 [f * θ0,D (x)|D] = E x Var D f * D (x), V θ0 = E x Var θ0 E D [f * θ0,D (x)|θ 0 ], V D,θ0 = E x Var θ0,D f * θ0,D (y) -V θ0 -V D . V D and V θ0 give the components of the variance explained by variance in D, θ 0 respectively. V D,θ0 is the remaining part of the variance not explained by either of these two sources. As in the main text, f * D (x) is the ensemble average of the trained predictors over initializations. E D [f * θ0,D (x)|θ 0 ] is commonly referred to as the bagged predictor. In the next subsection we study these terms empirically.

B.2 EMPIRICAL STUDY OF DATASET VARIANCE

Using the network simulations, one can show that the bagged predictor does not have substantially lower generalization error in the regimes that we are interested in. This implies that most of the variance driving higher generalization error is due to variance over initializations. In figure 7 , we make phase plots of the fraction of E g that arises from variance due to initialization, variance over datasets, and total variance for width 1000. This can be obtained by computing the ensembled predictor, the bagged predictor, and the ensembled-bagged predictor respectively. Making use of the fact that at leading order, the eNTK f (either in the rich or lazy regime) of a trained network has θ 0 -dependent fluctuations with variance 1/N , one can write the kernel Gram matrices as [K θ0 ] µν = [K ∞ ] µν + 1 √ N [δK θ0 ] µν + O(1/N ) [k θ0 (x)] µ = [k ∞ (x)] µ + 1 √ N [δk θ0 (x)] µ + O(1/N ). Here, δK θ0 , δk θ0 are the leading order fluctuations around the infinite width network. Because of how we have written them, their variance is O(1) with respect to N . Using perturbation theory (Dyer & Gur-Ari, 2020) , one can demonstrate that these leading order terms have mean zero around their infinite-width limit. The predictor for the eNTK 0 (or for a sufficiently large α neural network) for a training set with target labels y is given by: f * (x) θ0 = k θ0 (x) ⊤ µ K -1 θ0 • y = f ∞ (x) + 1 √ N δk θ0 (x) ⊤ K -1 ∞ • y - 1 √ N k ∞ (x) ⊤ K -1 ∞ δK θ0 K -1 ∞ • y + O N -1 . (16) This implies (Geiger et al., 2020a) : ⟨(f * θ0 (x) -f ∞ (x)) 2 ⟩ x = O N -1 . ( ) Upon taking the ensemble, because of the mean zero property of the deviations, we get that ⟨f * (x) θ0 ⟩ θ0 = f ∞ (x) + O N -1 ⇒ ⟨(⟨f * (x) θ0 ⟩ θ0 -f ∞ (x)) 2 ⟩ = O N -2 . ( ) We can now bound the generalization error of the ensemble of networks in terms of the infinite-width generalization: ⟨(⟨f * θ0 (x)⟩ θ0 -y) 2 ⟩ x,y = ⟨(f ∞ (x) -y) 2 ⟩ x,y + ⟨(⟨f * θ0 (x)⟩ θ0 -f ∞ (x)) 2 ⟩ x -2⟨(f ∞ (x) -y)(f ∞ (x) -⟨f * θ0 (x)⟩ θ0 )⟩ x,y . By equation 18, the second term yields a positive contribution going as O(N -2 ). The last term can be bounded by Cauchy-Schwarz: |⟨(f ∞ (x) -y)(f ∞ (x) -⟨f * θ0 (x)⟩ θ0 )⟩ x,y | ≤ ⟨(f ∞ (x) -y) 2 ⟩⟨(f ∞ (x) -⟨f * θ0 (x)⟩ θ0 ) 2 ⟩ x,y = E ∞ g (P )c 1 /N. (20) After we enter the variance limited regime by taking P > P 1/2 we get E ∞ g ≤ O(1/N ) so this last term is bounded by N -3/2 . Consequently, the difference in generalization error between the infinite width NTK and an ensemble of lazy network or eNTK 0 predictors is subleading in 1/N compared to the generalization gap, which goes as N -1 . The same argument can be extended to any predictor that differs from some infinite width limit. In particular Bordelon & Pehlevan (2022) show that the fluctuations of the eNTK f in any mean field network are asymptotically mean zero with variance N -1 . The above argument then applies to the predictor obtained by ensembling networks that have learned features. This implies that in the variance limited regime, ensemble averages of feature learning networks have the same generalization as the infinite-width mean field solutions up to a term that decays faster than N -3/2 .

C FEATURE LEARNING C.1 CONTROLLING FEATURE LEARNING THROUGH INITIALIZATION SCALE

Given the feed-forward network defined in equation 1, one can see that the components of the activations satisfy h -1) and consequently that the output h (ℓ) i = O(σh i ) (ℓ (L) 1 = O(σ L ). Because of the way the network is parameterized, the changes in the output ∂f ∂θ also scale as O(σ L ). This implies that the eNTK at any given time scales as K θ (x, x ′ ) = θ ∂f (x) ∂θ ∂f (x ′ ) ∂θ = O(σ 2L ). After appropriately rescaling learning rate to η = σ -2L we get df (x) dt = -η µ K θ (x, x µ )(f (x µ ) -y µ ). Under the assumption that σ L ≪ 1 and y µ = O(1) so that the error term is O(1) we get that the output changes in time as df /dt = O(1). On the other hand, using the chain rule one can show that the features change as a product of the gradient update and the features in the prior layer, yielding the scaling dh (ℓ) dt = η σ L √ N = 1 σ L √ N = (α √ N ) -1 . ( ) This gives us that the change in the features scales as (α √ N ) -1 while the change in the output scales as O(1). Thus, for α √ N sufficiently small, the features can move dramatically.

C.2 OUTPUT RESCALING WITHOUT RESCALING WEIGHTS

In the main text, we use the scale σ at every layer to change the scale of the output function. This relies on the homogeneity of the activation function so that W ℓ → σW ℓ for all ℓ leads to a rescaling f → f σ L . This would not work for nonhomogenous activations like ϕ(h) = tanh(h). However, following Chizat et al. (2019) ; Geiger et al. (2020a) , we note that we can set all weights to be O α (1) and introduce the α only in the definition of the neural network function f = α √ N N i=1 w L+1 i φ(h L i ) , h ℓ i = 1 √ N N j=1 W ℓ ij φ(h ℓ-1 j ) , h 1 i = 1 √ D W 1 ij x j . ( ) We note that all preactivations h ℓ have scale O α (1) for any choice of nonlinearity, but that f = Θ α (α). Several works have established that the α ∼ 1 √ N allows feature learning even as the network approaches infinite width Mei et al. (2018) ; Yang & Hu (2020) ; Bordelon & Pehlevan (2022) . This is known as the mean field or µ-limit.

C.3 KERNEL ALIGNMENT

In this section we comment on our choice of kernel alignment metric A(K) := y ⊤ Ky Tr K|y| 2 . ( ) For kernels that are diagonally dominant, such as those encountered in the experiments, this metric is related to another alignment metric A F (K) := y ⊤ Ky |K| F |y| 2 . ( ) Here |K| F is the Frobenius norm of the Gram matrix of the kernel. This metric was extensively used in Baratin et al. (2021) . The advantage of the first metric over the second is that one can quickly estimate the denominator of A(K) via Monte Carlo estimation of ⟨u ⊤ Ku⟩ u∼N (0,1) . We use A(K f ) as a measure of feature learning, as we have found that this more finely captures elements of feature learning than other related metrics. We list several metrics we tried that did not work. One option for a representation-learning metric involves measuring the magnitude of the change between the initial and final kernels, K i , K f : ∆K := |K f -K i | F . ( ) However, this is more sensitive to the raw parameter change than any task-relevant data. If one instead were to normalize the kernels to be unit norm at the beginning and the end, the modified metric ∆K := K f |K f | F - K i |K i | F F . ( ) This metric however remains remarkably flat over the whole range of α, P , as does the centered kernel alignment (CKA) of Cortes et al. ( 2012) CKA(K i , K f ) = Tr[K c i K c f ] |K c i | F |K c f | F , K c = CKC, C = 1 - 1 P ⃗ 1 ⃗ 1 T . ( ) Here C is the centering matrix that subtracts off the mean components of the kernel for a P × P kernel. This alignment metric has been shown to be useful in comparing neural representations (Kornblith et al., 2019) . For our task, however, because the signal is low-dimensional, only a small set of eigenspaces of the kernel align to this task. As a result, the CKA, which counts all eigenspaces equally, appears to be too coarse to capture the low-dimensional feature learning that is happening. On the other hand, we find that A(K f ) (with K f given by the eNTK f evaluated on a test set) can very finely detect alignment along the task relevant directions. This produces a clear signal of feature learning at small α and large P as shown in Figure 2c . A(K f ) can be related to the centered kernel alignment between the eNTK f and the (mean zero) task kernel yy ⊤ , where y is a vector of draws from the population distribution p(x, y).

C.4 RELATIONSHIP BETWEEN TRAINED NETWORK AND FINAL KERNEL

In general, the learned function contains contributions from the instantaneous NTKs at every point in the training. Concretely, following Atanasov et al. (2021) we have the following formula for the final network predictor f (x) f (x) = ∞ 0 dt k(x, t) • exp - t 0 dsK(s) y, where [k(x, t)] µ = K(x, x µ , t) and [K(s)] µν = K(x µ , x ν , s) and [y] µ = y µ . In general there are contributions from earlier kernels k(x, t) for t < ∞ and so the function f cannot always be written as a linear combination of the final NTK K f on training data: f = µ α µ K f (x, x µ ). However, as Vyas et al. (2022) ; Atanasov et al. (2021) have shown, the final predictions of the network are often well modeled by regression with the final NTK. We verify this for our task in section 3.2.

D GENERIC RANDOM FEATURE MODEL D.1 SETTING UP THE PROBLEM: FEATURE DEFINITIONS

For a random kernel, K(x, x ′ ; θ), we first compute its Mercer decomposition dx p(x)K(x, x ′ ; θ)ϕ k (x) = λ k ϕ k (x ′ ). ( ) From the eigenvalues λ k and eigenfunctions ϕ k , we can construct the square root K 1/2 (x, x ′ ; θ) = k λ k ϕ k (x)ϕ k (x ′ ). Lastly, using K 1/2 , we can get a feature map by projecting against a static basis {b k } giving ψ k (x) = dx ′ p(x ′ )K 1/2 (x, x ′ ; θ)b k (x ′ ). ( ) These features reproduce the kernel so that K(x, x ′ ; θ) = k ψ k (x)ψ k (x ′ ). This can be observed from the following observation ψ k (x) = ℓ λ ℓ ϕ ℓ (x)U ℓk , U ℓk = ⟨ϕ ℓ (x)b k (x)⟩ (34) ⇒ k ψ k (x)ψ k (x ′ ) = ℓ,m λ ℓ λ m ϕ ℓ (x)ϕ m (x ′ ) k U ℓk U mk = ℓ λ ℓ ϕ ℓ (x), ϕ ℓ (x ′ ) (35) where the last line follows from the orthogonality of U km and recovers K(x, x ′ ; θ).

D.2 DECOMPOSITION OF FINITE WIDTH FEATURES

We now attempt to characterize the variance in the features over the sample distribution. We will first consider the case of a fixed realization of θ 0 before providing a typical case analysis over random θ 0 . For a fixed initialization θ 0 we define the following covariance matrices Σ M = ψ M (x)ψ M (x) ⊤ ∈ R M ×M . ( ) where ψ M are the truncated (but deterministic) features induced by the deterministic infinite width kernel. We will mainly be interested in the case where M → ∞ and where the target function can be expressed as the linear combination y(x) = w * • ψ M (x) of these features. For example, in the case of our experiments on the sphere, ψ M could be the spherical harmonic functions. Further, in the M → ∞ limit, we will be able to express the target features ψ as linear combinations of the features ψ M ψ(x, θ 0 ) = A(θ 0 )ψ M (x) , A(θ) ∈ R N H ×M . ( ) The matrix A(θ 0 ) are the coefficients of the decomposition which can vary over initializations. Crucially A(θ 0 ) projects to the subspace of dimension N H where the finite width features have variance over x. The population risk for this θ 0 has an irreducible component E g (θ 0 ) = (w * • ψ M -w • ψ) 2 ≥ w * ⊤ Σ M -Σ M A(θ) ⊤ A(θ 0 )Σ M A(θ 0 ) ⊤ -1 A(θ 0 )Σ M w * . ( ) where the bound is tight for the optimal weights w = A(θ 0 )Σ M A(θ 0 ) ⊤ -1 A(θ 0 )Σ M w * . The irreducible error is determined by a projection matrix which preserves the subspace where the features ψ(x, θ 0 ) have variance: I -A(θ) ⊤ A(θ 0 )Σ M A(θ 0 ) ⊤ -1 A(θ 0 )Σ M . In general, this will preserve some fraction of the variance in the target function, but some variance in the target function will not be expressible by linear combinations of the features ψ(x, θ). We expect that random finite width N neural networks will have unexplained variance in the target function on the order ∼ 1/N .

D.3 GAUSSIAN COVARIATE MODEL

Following prior works on learning curves for kernel regression (Bordelon et al., 2020; Canatar et al., 2021; Loureiro et al., 2021) , we will approximate the learning problem with a Gaussian covariates model with matching second moments. The features ψ M (x) will be treated as Gaussian over random draws of datapoints. We will assume centered features. We decompose the features in the orthonormal basis b(x), which we approximate as a Gaussian vector b ∼ N (0, I). f = ψ(θ 0 ) • w , y = ψM • w * ψ M = Σ 1/2 s b , ψ(θ 0 ) = A(θ 0 ) ⊤ ψ M + Σ 1/2 ϵ ϵ b ∼ N (0, I) , ϵ ∼ N (0, I) This is a special case of the Gaussian covariate model introduced by Loureiro et al. (2021) and subsumes the popular two-layer random feature models (Mei & Montanari, 2022; Adlam & Pennington, 2020b) as a special case. In a subsequent section, we go beyond Loureiro et al. (2021) by computing typical case learning curves over Gaussian A(θ 0 ) matrices. In particular, we have for the two layer random feature model in the proportional asymptotic limit P, N, D → ∞ with P/D = O(1) and P/N = O(1) with ψ(x) = ϕ(F x µ ) for fixed feature matrix F ∈ R N ×D nonlinearity ϕ and x = b ∼ N (0, D -1 I) Σ M = I , Σ ϵ = c 2 * I , A ⊤ = c 1 F c 1 = ⟨zϕ(z)⟩ z∼N (0,1) , c 2 * = ϕ(z) 2 z∼N (0,1) -c 2 1 . We refer readers to Hu & Lu (2020) for a discussion of this equivalence between random feature regression and this Gaussian covariate model.

D.4 REPLICA CALCULATION OF THE LEARNING CURVE

To analyze the typical case performance of kernel regression, we define the following partition function: Z[D, θ 0 ] = dw exp - β 2λ P µ=1 [w • ψ µ -w * • ψ M,µ ] 2 - β 2 |w| 2 - JβM 2 E g (w) E g (w) = 1 M |Σ 1/2 M w * -Σ 1/2 M A(θ 0 )w| 2 + 1 M w ⊤ Σ ϵ w. For proper normalization, we assume that ψ M ψ ⊤ M = 1 M Σ M and ϵϵ ⊤ = 1 M Σ ϵ . We note that in the β → ∞ limit, the partition function is dominated by the unique minimizer of the regularized least squares objective (Canatar et al., 2021; Loureiro et al., 2021) . Further, for a fixed realization of θ 0 the average generalization error over datasets D can be computed by differentiation of the source term J 2 βM ∂ ∂J | J=0 ⟨ln Z[D, θ 0 ]⟩ D = 1 Z dw exp - β 2λ P µ=1 [w • ψ µ -w * • ψ M,µ ] 2 - β 2 |w| 2 E g (w) D . Thus the β → ∞ limit of the above quantity will give the expected generalization error of the risk minimizer. We see the need to average the quantity ln Z over realizations of datasets D. For this, we resort to the replica trick ⟨ln Z⟩ = lim n→0 1 n ln ⟨Z n ⟩. We will compute the integer moments ⟨Z n ⟩ for integer n and then analytically continue the resulting expressions to n → 0 under a symmetry ansatz. The replicated partition function thus has the form ⟨Z n ⟩ = n a=1 dw a E {bµ,ϵµ} exp - β 2λ P µ=1 n a=1 [w a • ψ µ -w * • ψ M,µ ] 2 - β 2 n a=1 |w a | 2 × exp - JβM 2 n a=1 E g (w a ) . We now need to perform the necessary average over the random realizations of data points D = {b µ , ϵ µ }. We note that the scalar quantities h a µ = w a • ψ µ -w * • ψ M,µ are Gaussian with mean zero and covariance h a µ h b ν = δ µν Q ab Q ab = 1 M (A(θ 0 )w a -w * ) Σ M (A(θ 0 )w a -w * ) + 1 M w a Σ ϵ w b . We further see that the generalization error in replica a is E g (w a ) = Q aa . Performing the Gaussian integral over {h a µ } gives ⟨Z n ⟩ ∝ a dw a ab dQ ab d Qab exp - P 2 ln det (λI + βQ) - JβM 2 TrQ - β 2 a |w a | 2 exp 1 2 ab Qab M Q ab -[A(θ 0 )w a -w * ] ⊤ Σ M [A(θ 0 )w a -w * ] + w a Σ ϵ w b . We introduced the Lagrange multipliers Q which enforce the definition of order parameters Q. We now integrate over W = Vec{w a } n a=1 . We let Σs = A ⊤ Σ M A dW exp - 1 2 W βI + Q ⊗ [ Σs + Σ ϵ ] W exp W ⊤ Q ⊗ I 1 ⊗ A ⊤ Σ s w * = exp 1 2 1 ⊗ A ⊤ Σ s w * ⊤ Q ⊗ I βI + Q ⊗ [ Σs + Σ ϵ ] -1 Q ⊗ I 1 ⊗ A ⊤ Σ s w * exp - 1 2 ln det βI + Q ⊗ [ Σs + Σ ϵ ] . To take the n → 0 limit, we make the replica symmetry ansatz βQ = qI + q 0 11 ⊤ , β -1 Q = qI + q0 11 ⊤ , which is well motivated since this is a convex optimization problem. Letting α = P/N , we find that under the RS ansatz the replicated partition function has the form ⟨Z n ⟩ = dqdq 0 dqdq 0 exp nM 2 S[q, q 0 , q, q0 ] S = q q + q 0 q + q q0 -α ln(λ + q) + q 0 λ + q - β M w * [qΣ M ]w * + β M w * [qΣ s ]A ⊤ GA[qΣ s ]w * + 1 M ln det G - 1 M q0 TrG[ Σs + Σ ϵ ] -J(q + q 0 ) G = I + q[ Σs + Σ ϵ ] -1 . In a limit where α = P/M is O(1), then this S is intensive S = O M (1). We can thus appeal to saddle point integration (method of steepest descent) to compute the set of order parameters which have dominant contribution to the free energy. ⟨Z n ⟩ = dqdq 0 dqdq 0 exp nM 2 S[q, q 0 , q, q0 ] ∼ exp nM 2 S[q * , q * 0 , q * , q * 0 ] =⇒ ⟨ln Z⟩ = M 2 S[q * , q * 0 , q * , q * 0 ]. The order parameters q * , q * 0 , q * , q * 0 are defined via the saddle point equations ∂S ∂q = ∂S ∂q0 = ∂S ∂ q = ∂S ∂ q0 = 0. For our purposes, it suffices to analyze two of these equations ∂S ∂q 0 = q - α λ + q -J = 0, ∂S ∂ q0 = q - 1 M TrG Σs + Σ ϵ = 0. We can now take the zero temperature (β → ∞) limit to solve for the generalization error E g = - ∂ ∂J | J=0 lim β→∞ 1 β F = 1 M ∂ J w * qΣ M -q2 Σ M A ⊤ GAΣ M w * . We see that we need to compute the J derivatives on q. We let κ = λ + q and note ∂ J q = -ακ -2 ∂ J κ + 1 ∂ J κ = -∂ J q 1 M TrG 2 Σs + Σ ϵ 2 = --ακ -2 ∂ J κ + 1 1 M TrG 2 Σs + Σ ϵ 2 . ( ) We solve the equation for ∂ J κ which gives ∂ J κ = -κ 2 α γ 1-γ where γ = α κ 2 1 M TrG 2 Σs + Σ ϵ 2 . With this definition we have ∂ J q = 1 + γ 1-γ = 1 1-γ . E g = 1 1 -γ 1 M w * Σ 1/2 M I -2qΣ 1/2 s A ⊤ GAΣ 1/2 s + q2 Σ 1/2 s A ⊤ G Σs + Σ ϵ GAΣ 1/2 s Σ 1/2 M w * = 1 1 -γ 1 M w * Σ 1/2 M I -qΣ 1/2 s A ⊤ GAΣ 1/2 s -qΣ 1/2 s A ⊤ G 2 AΣ 1/2 s Σ 1/2 M w * . This reproduces the derived expression from Loureiro et al. (2021) . The matching covariance Σs = Σ M and zero feature-noise limit Σ ϵ = 0 recovers the prior results of Bordelon et al. (2020) ; Canatar et al. (2021) ; Simon et al. (2021) . In general, this error will asymptote to the the irreducible error lim P →∞ E g = 1 M w * Σ M -Σ s A ⊤ Σs + Σ ϵ -1 AΣ s w * . We see that this recovers the minimal possible error in the P → ∞ limit. The derived learning curves depend on the instance of random initial condition θ 0 . To get the average case performance, we take an additional average of this expression over θ 0 E θ0 E g (θ 0 ) = E θ0 1 1 -γ 1 M w * Σ 1/2 M I -qΣ 1/2 s A ⊤ GAΣ 1/2 s -qΣ 1/2 s A ⊤ G 2 AΣ 1/2 s Σ 1/2 M w * . ( ) This average is complicated since γ, q, G all depend on θ 0 . In the next section we go beyond this analysis to try average case analysis for random Gaussian A.

D.5 QUENCHED AVERAGE OVER GAUSSIAN A

In this section we will define a distribution of features which allows an exact asymptotic prediction over random realizations of disorder θ 0 and datasets D. This is a nontrivial extension of the result of Loureiro et al. (2021) since the number of necessary saddle point equations to be solved doubles from two to four. However, this more complicated theory allows us to exactly compute the expectation in equation 55 under an ansatz for the random matrix A. We construct our features with ψ|A = 1 √ N A ⊤ ψ M + Σ 1/2 ϵ ϵ , A ij ∼ N (0, σ 2 ). We will now perform an approximate average over both datasets D and realizations of A ⟨Z n ⟩ = n a=1 dw a E {bµ,ϵµ,A} exp - β 2λ P µ=1 n a=1 [w a • ψ µ -w * • ψ M,µ ] 2 - β 2 n a=1 |w a | 2 × exp - JM β 2 n a=1 E g (w a ) . (56) As before, we first average over b µ , ϵ µ |A and define order parameters Q ab as before. ⟨Z n ⟩ = a dw a ab dQ ab d Qab exp - P 2 ln det (λI + βQ) - JβM 2 TrQ - β 2 a |w a | 2 E {g a } exp 1 2 ab Qab M Q ab -[g a -w * ] ⊤ Σ M [g b -w * ] + w a Σ ϵ w b . where we defined the fields g a = 1 √ N Aw a which are mean zero Gaussian with covariance g a g b⊤ = V ab I where V ab = σ 2 N w a • w b . Performing the Gaussian integral over G = Vec{g a }, we find a dg a exp - 1 2 G I ⊗ V -1 + Σ M ⊗ Q G + (Σ M w * ⊗ Q1)G - 1 2 ln det (I ⊗ V ) = exp 1 2 (Σ M w * ⊗ Q1) I ⊗ V -1 + Σ M ⊗ Q -1 (Σ M w * ⊗ Q1) - 1 2 ln det I + Σ M ⊗ QV . Next, we need to integrate over W = Vec{w a } which gives dW exp - 1 2 W βI + σ 2 I ⊗ V + Σ ϵ ⊗ Q W = exp - 1 2 ln det βI + σ 2 I ⊗ V + Σ ϵ ⊗ Q . ( ) Now the replicated partition function has the form ⟨Z n ⟩ = dQd QdV d V exp M 2 Tr[Q Q + ηV V ] - JβM 2 TrQ - P 2 ln det [λI + βQ] × exp - 1 2 (w * ⊗ 1) ⊤ [Σ M ⊗ Q](w * ⊗ 1) × exp 1 2 (Σ M w * ⊗ Q1) ⊤ I ⊗ V -1 + Σ M ⊗ Q -1 (Σ M w * ⊗ Q1) × exp - 1 2 ln det I + Σ M ⊗ QV - 1 2 ln det βI + σ 2 I ⊗ V + Σ ϵ ⊗ Q . ( ) Now we make a replica symmetry ansatz on the order parameters Q, Q, V , V βQ = qI + q 0 11 ⊤ , βV = vI + v 0 11 ⊤ β -1 Q = qI + q0 11 ⊤ , β -1 V = vI + v0 11 ⊤ . ( ) We introduce the shorthand for normalized trace of a matrix G as tr G = 1 M TrG. Under the replica symmetry ansatz, we find the following free energy 2 M ⟨ln Z⟩ = q q + q 0 q + q q0 + η(vv + v 0 v + vv 0 ) -J(q + q 0 ) -α ln(λ + q) + q 0 λ + q - β M w * [qΣ M ]w * + β M w * [qΣ M ][v -1 I + qΣ M ] -1 [qΣ M ] -tr log [I + qvΣ M ] -(q 0 v + qv 0 ) tr[I + qvΣ M ] -1 Σ M -tr log I + σ 2 vI + Σ ϵ q -tr I + σ 2 vI + Σ ϵ q -1 v0 σ 2 I + q0 Σ ϵ . Letting F = 2M -1 ⟨ln Z⟩, the saddle point equations read ∂F ∂q 0 = q - α λ + q -J = 0, ∂F ∂ q0 = q -v tr[I + qvΣ M ] -1 Σ M -tr[I + σ 2 vI + Σ ϵ q] -1 Σ ϵ = 0, ∂F ∂v 0 = ηv -q tr[I + qvΣ M ] -1 Σ M = 0, ∂F ∂v 0 = ηv -σ 2 tr[I + σ 2 vI + Σ ϵ q] -1 = 0. Now the generalization error can be determined from E g = - ∂ ∂J lim β→∞ 2 βM ⟨ln Z⟩ = ∂ J 1 M w * qΣ M -(qΣ M )[v -1 I + qΣ M ] -1 (qΣ M ) w * . ( ) We see that it is necessary to compute ∂ J q and ∂ J v in order to obtain the final result. For simplicity, we set σ 2 = 1. The equations for the source derivatives are ∂ J q = - α (λ + q) 2 ∂ J q + 1, ∂ J q = -tr[I + v qΣ M ] -2 [-∂ J vI + v 2 ∂ J qΣ M ]Σ M -tr[I + vI + qΣ ϵ ] -2 Σ ϵ [∂ J vI + ∂ J qΣ ϵ ], η∂ J v = -tr[I + v qΣ M ] -2 Σ M [-∂ J qI + q2 ∂ J vΣ M ], η∂ J v = -tr[I + vI + Σ ϵ q] -2 [∂ J vI + ∂ J qΣ ϵ ]. Once the value of the order parameters (q, q, v, v) have been determined, these source derivatives can be obtained by solving a 4 × 4 linear system. Examples of these solutions are provided in Figure 16 .

D.5.1 ASYMPTOTICS IN UNDERPARAMETERIZED REGIME

We can compute the asymptotic (α → ∞) generalization error due to the random projection A in the limit of Σ ϵ = 0. First, note that if v → O α (1), then the asymptotic error would be zero. Therefore, we will assume that v ∼ aα c for some a, c > 0. The saddle point equations give the following asymptotic conditions q ∼ α λ , η ∼ q tr[vI + qΣ M ] -1 Σ M =⇒ η = tr[λaα c-1 I + Σ M ] -1 Σ M . ( ) For 0 < η < 1, this equation can only be satisfied as α → ∞ if c = 1 so that v has the same scaling with α as q. If c < 1 then we could get the equation η = 1. If c > 1, then the equation would give η = 0. The constant a solves the equation η = tr[λaI + Σ M ] -1 Σ M . ( ) Using this fact, our order parameters satisfy the following large α scalings q ∼ α λ , q ∼ 0 , v ∼ aα , v ∼ 0. ( ) The source derivative equations simplify to ∂ J q ∼ 1 , ∂ J q ∼ 0 and η∂ J v ∼ (v∂ J q + q∂ J v) tr[vI + qΣ M ] -1 Σ M -qvtr[vI + qΣ] -2 [∂ J vΣ M + ∂ J qΣ 2 M ] ∼ q2 tr[vI + qΣ M ] -2 Σ 2 M ∂ J v + v2 tr[vI + qΣ M ] -2 Σ M ⇒ ∂ J v ∼ tr[I + a -1 λ -1 Σ M ] -2 Σ M η -tr[aλI + Σ M ] -2 Σ 2 M . We note that ∂ J v only depends on the product aλ which is an implicit function of η and Σ M . The generalization error is E g = 1 M ∂ J q(1 + v)w * [(1 + v)I + qΣ] -1 Σ M w * E g ∼ 1 M w * I + a -1 λ -1 Σ M -2 Σ M w * + 1 M w * [λaI + Σ M ] -2 Σ 2 M w * × tr[I + a -1 λ -1 Σ M ] -2 Σ M η -tr[aλI + Σ M ] -2 Σ 2 M . We see that in the generic case, the asymptotic error has a nontrivial dependence on the task w * and the correlation structure Σ M . To gain more intuition, we will now consider the special case of isotropic features Σ M = I. In this case, we have η = 1 1+λa so that λa = 1-η η . This results in the following generalization error E g ∼ 1 M |w * | 2 (1 -η) 2 + η 2 (1 -η) 2 η -η 2 ∼ 1 M |w * | 2 (1 -η). We see that as η = N H M → 1, the asymptotic error converges to zero since all information in the original features is preserved.

D.5.2 SIMPLIFIED ISOTROPIC FEATURE NOISE

We can simplify the above expressions somewhat in the case where σ 2 = 1 and Σ ϵ = σ 2 ϵ I. In this case, the order parameters become ηv = η(1 + v + σ 2 ϵ q) -1 =⇒ v = 1 1 + v + σ 2 ϵ q =⇒ ηv = q tr I + q 1 + v + σ 2 ϵ q Σ M -1 Σ M = q(1 + v + σ 2 ϵ q) tr[(1 + v + σ 2 ϵ q)I + qΣ M ] -1 Σ M q = tr (1 + v + σ 2 ϵ q)I + qΣ M -1 Σ M + ησ 2 ϵ 1 + v + σ 2 ϵ q . ( ) Letting G = [(1 + v + σ 2 ϵ q)I + qΣ M ] -1 , the source derivatives have the form ∂ J q = 1 - α (λ + q) 2 ∂ J q (72) = 1 + α (λ + q) 2 trG 2 Σ[∂ J vI + ∂ J qΣ M ] + ησ 2 ϵ (1 + v + σ 2 ϵ q) 2 (∂ J v + σ 2 ϵ ∂ J q) , η∂ J v = ((1 + v + 2σ 2 ϵ q)∂ J q + q∂ J v)trGΣ M (73) -q(1 + v + σ 2 ϵ q)trG 2 Σ M [(∂ J v + σ 2 ϵ ∂ J q)I + ∂ J qΣ M ] = (∂ J q)(1 + v + σ 2 ϵ q) 2 trG 2 Σ + (∂ J v + σ 2 ϵ ∂ J q)q 2 trG 2 Σ 2 . ( ) This is a 2 × 2 linear system 1 -α (λ+q) 2 [trG 2 Σ 2 + ησ 4 ϵ (1+v+σ 2 ϵ q) 2 ] -α (λ+q) 2 [trG 2 Σ + ησ 2 ϵ (1+v+σ 2 ϵ q) 2 ] -(1 + v + σ 2 ϵ q) 2 trG 2 Σ M -σ 2 ϵ q2 trG 2 Σ 2 M η -q2 trG 2 Σ 2 s ∂ J q ∂ J v = 1 0 . For each α, we can solve for ∂ J q and ∂ J v to get the final generalization error with the formula E g = ∂ J 1 M w * (1 + v + σ 2 ϵ q)qΣG w * = 1 M w * [∂ J (q + qv + σ 2 ϵ q2 )ΣG -(1 + v + σ 2 ϵ q)qΣG 2 (∂vI + σ 2 ϵ ∂ qI + ∂ qΣ M )]w * . ( ) An example of these solutions can be found in Figure 16 , where we show good agreement between theory and experiment. E RESNET ON CIFAR EXPERIMENTS Deeper networks in the rich regime can more easily outperform the infinite width network for a larger range of P . Also, for larger L it is easier to deviate from the lazy regime at a given α. By contrast, on this task the shallower NTK ∞ outperforms deeper NTK ∞ s. As before, ensembled lazy networks approach NTK ∞ and the variance rises with P . Figure 15 : a) E g for the centered predictor fθ (x) -fθ0 (x) (solid) compared to the generalization of the uncentered predictor fθ (x) (dashed). At small α, the difference is negligible, while at large α the uncentered predictor does worse and does not approach eNTK 0 . The worse generalization can be understood as fθ0 (x) effectively adding an initialization-dependent noise to the target y. b) The effect of ensembling becomes less beneficial for uncentered lazy networks. c) Color plot of E g . The lazy regime is different from the eNTK 0 generalization (c.f. Figure 9 ). For the NTK ∞ , the generalization curves sum to give the mixed mode curve, as observed in Bordelon et al. (2020) . We see that this also holds for the eNTK 0 for sufficiently lazy networks, as predicted by the simple renadom feature model considred in section 4 of this paper. b) The variance curves for the same task. Again, for sufficiently lazy networks the variance is a sum of the variances of the individual pure mode tasks, as predicted by our random feature model.



Figure1: Generalization errors of depth L = 3 neural networks across a range of α values compared to NTK ∞ . The regression for NTK ∞ was calculated using the Neural Tangents package(Novak et al., 2020). The exact scaling of NTK ∞ is known to go asymptotically as P -2 for this task. a) Lazy networks perform strictly worse than NTK ∞ while rich networks can outperform it for an intermediate range of P before their performance is also limited. b) Ensembling 20 networks substantially improves lazy network and eNTK 0 generalization, as well as asymptotic rich network generalization. This indicates that at sufficiently large P , these neural networks become limited by variance due to initialization. The error bars in a) and c) denote the variance due to both both training set and initialization. The error bars in b), d) denote the variance due to the train set.

Figure 2: Phase plots in the P, α plane of a) The log generalization error log 10 E g (f ⋆ ), b) The fraction of generalization error removed by ensembling 1 -E g ( f ⋆ )/E g (f ⋆ ), c) Kernel-task alignment measured by y T K f y |y| 2 TrK f where y and K f are evaluated on test data. We have plotted 'x' markers in a) to show the points where the NNs were trained.

Figure 3: Kernel regression with eNTK f reproduces the learning curves of the NN with high fidelity. (a) Learning curves across different laziness settings α in a width 1000 network. The solid black curve is the infinite width network. Colored curves are the NN generalizations. Stars represent the eNTK f s, and lie on top of the corresponding NN learning curves. (b) The agreement of generalizations between NNs and eNTK f s across different N and α. Here the colors denote different α values while the dot, triangle and star markers denote networks of N = {177, 421, 1000} respectively.

Figure 4: Critical sample size P 1/2 measures the onset of the variance limited regime as a function of α at fixed N . (a) More feature learning (small α) delays the transition to the variance limited regime. (b) P 1/2 as a function of N for fixed α has roughly P 1/2 ∼ √ N scaling.

Figure5: The random feature model suggests three possible types of ensembling: averaging the output function f (x, θ), averaging eNTK f K(x, x ′ ; θ), and averaging the induced features ψ(x, θ). We analyze these ensembling methods for a k = 1 task with a width N = 100 ReLU network. (a) While all ensembling methods improve generalization, averaging either the kernel ⟨K⟩ or features ⟨ψ⟩ gives a better improvement to generalization than averaging the output function ⟨f ⟩. Computing final kernels for many richly trained networks and performing regression with this averaged kernel gives the best performance. (b) We plot the relative error of each ensembling method against the single init neural network. The gap between ensembling and the single init NN becomes evident for sufficiently large P ∼ P 1/2 . For small α, all ensembling methods perform comparably, while for large α ensembling the kernel or features gives much lower E g than averaging the predictors.

Figure 6: A toy model of noisy features reproduces qualitative dependence of learning curves on kernel fluctuations and feature learning. (a) The empirical learning curves for networks of varying width N at large α. (b) Noisy kernel regression learning curve with noise Σ ϵ = σ 2 ϵ Σ M and A is a projection matrix preserving 20-k top eigenmodes of Σ M , which was computed from the NTK ∞ for a depth 3 ReLU network. (c) This toy model reproduces the approximate scaling of the transition sample size P 1/2 ∼ N 1/2 if σ 2 ϵ ∼ N -1 . (d) NNs trained with varying richness α. Small α improves the early learning curve and asymptotic behavior. (e) Theory curves for a kernel with amplified eigenvalue λ k → λ k + ∆λ k for the target eigenfunction. This amplification mimics the effect of enhanced kernel alignment in the low α regime. Large amplification improves generalization performance. (f) P -dependent alignment where ∆λ k ∼ √ P gives a better qualitative match to (d).

Figure 7: Phase plots of the fraction of the generalization error due to the initialization variance, the dataset variance, and their combined contribution. The columns correspond to the tasks of polynomial regression for degree 2, 3 and 4 polynomials. Neural network has width 1000 and depth 3. Notice that the initialization variance dominates in the large P large α regime

Figure 8: A Wide ResNet Zagoruyko & Komodakis (2017) trained on a superclassed CIFAR task comparing animate vs inanimate objects. Each learning curve is averaged over 5 different samples of the train set, yielding the means and error bars shown in the figures. a) Generalization error E g . The dashed lines are the error of a 20-fold ensemble over different values of α. Across all P , lazy networks attain worse generalization error. As with the MLP task, the best performing networks are ensembles of rich networks. b)The accuracy also has the same trend: richer networks perform better and ensembling lazy networks helps them more. c) Once P is large enough, lazier networks tend to benefit more from ensembling. d) Very lazy networks transition to variance limited behavior earlier. For ResNets on this task, we see that rich, feature learning networks eventually begin reducing their variance on this task. Further details of the experiment are given in section A.1.

Figure 10: A fine-grained view of the generalization error across different datasets and ensembles. Solid curves are depth 3 neural networks, dashed curves are the infinite width NTK (which only has variance over datasets). Each color is a set of networks trained on the same dataset but different initializations. Different colors correspond to different datasets indexed by d ∈ {0, . . . , 9}.

Figure12: Empirical plot of of the scaling of the variance of the eNTK 0 with N with variance taken over 10 initializations and averaged 10 different datasets.

Figure13: Sweep over depth L = {2, 3, 4}. Deeper networks in the rich regime can more easily outperform the infinite width network for a larger range of P . Also, for larger L it is easier to deviate from the lazy regime at a given α. By contrast, on this task the shallower NTK ∞ outperforms deeper NTK ∞ s. As before, ensembled lazy networks approach NTK ∞ and the variance rises with P .

Figure 14: Sweep over input dimension D = {5, 25, 50}. At larger input dimensions rich networks can more easily outperform NTK ∞ . This is a consequence of the task depending on the low-dimensional projection β • x.

Eg uncentered color plot

Figure 16: Verification of Gaussian A model. Solid lines are theory and dots are experiments. (a) The effect of changing the student's RKHS dimension N H . Double descent overfitting peaks occur at P = N H (b) The effect of additive noise in the student features Σ ϵ = σ 2 ϵ Σ M . (c) Learning curves for fitting the k-th eigenfunction. All mode errors exhibit a double descent peak at P = N H regardless of the task. (d) Regularization can prevent the overfitting peak.

Var ŷ for mixed mode task

Figure 17: Width 500 depth 3 MLP learning a D = 25 mixture of a linear and cubic polynomials. a)Generalization error of NTK ∞ (solid black) and MLP (solid colored lines) on mixed mode task. The dashed lines are convex combinations of the generalization curves for the pure mode k = 1, k = 3 tasks. For the NTK ∞ , the generalization curves sum to give the mixed mode curve, as observed inBordelon et al. (2020). We see that this also holds for the eNTK 0 for sufficiently lazy networks, as predicted by the simple renadom feature model considred in section 4 of this paper. b) The variance curves for the same task. Again, for sufficiently lazy networks the variance is a sum of the variances of the individual pure mode tasks, as predicted by our random feature model.

). Each superclass consists of 20,000 training examples and 4,000 test examples retrieved from the CIFAR-10 dataset. On subsets of this dataset, we train wide residual networks (ResNets) Zagoruyko & Komodakis (2017) of width 64 and block size 1 with the NTK parameterization Jacot et al. (2018) on this task using mini-batch gradient descent with batch size of 256 and MSE loss. Step sizes are governed by the Adam optimizer Kingma & Ba (

