WHEN DOES PRECONDITIONING HELP OR HURT GEN-ERALIZATION?

Abstract

While second order optimizers such as natural gradient descent (NGD) often speed up optimization, their effect on generalization has been called into question. This work presents a more nuanced view on how the implicit bias of optimizers affects the comparison of generalization properties. We provide an exact asymptotic biasvariance decomposition of the generalization error of preconditioned ridgeless regression in the overparameterized regime, and consider the inverse population Fisher information matrix (used in NGD) as a particular example. We determine the optimal preconditioner P for both the bias and variance, and find that the relative generalization performance of different optimizers depends on label noise and "shape" of the signal (true parameters): when the labels are noisy, the model is misspecified, or the signal is misaligned with the features, NGD can achieve lower risk; conversely, GD generalizes better under clean labels, a well-specified model, or aligned signal. Based on this analysis, we discuss several approaches to manage the bias-variance tradeoff, and the potential benefit of interpolating between first-and second-order updates. We then extend our analysis to regression in the reproducing kernel Hilbert space and demonstrate that preconditioning can lead to more efficient decrease in the population risk. Lastly, we empirically compare the generalization error of first-and second-order optimizers in neural network experiments, and observe robust trends matching our theoretical analysis.

1. INTRODUCTION

We study the generalization property of an estimator θ obtained by minimizing the empirical risk (or the training error) L(f θ ) via a preconditioned gradient update with preconditioner P : θ t+1 = θ t -ηP (t)∇ θt L(f θt ), t = 0, 1, . . . (1.1) Setting P = I recovers gradient descent (GD). Choices of P which exploit second-order information include the inverse Fisher information matrix, which gives the natural gradient descent (NGD) (Amari, 1998) ; the inverse Hessian, which leads to Newton's method (LeCun et al., 2012) ; and diagonal matrices estimated from past gradients, which include various adaptive gradient methods (Duchi et al., 2011; Kingma & Ba, 2014) . These preconditioners often alleviate the effect of pathological curvature and speed up optimization, but their impact on generalization has been under debate: Wilson et al. (2017) reported that in neural network optimization, adaptive or secondorder methods generalize worse compared to gradient descent (GD), whereas other empirical studies showed that second-order methods achieve comparable, if not better generalization (Xu et al., 2020) . The generalization property of optimizers relates to the discussion of implicit bias (Gunasekar et al., 2018a) , i.e. preconditioning may lead to a different converged solution (with potentially the same training loss), as illustrated in Figure 1 . While many explanations have been proposed, our starting point is the well-known observation that GD often implicitly regularizes the parameter 2 norm. For instance in overparameterized least squares regression, GD and many first-order methods find the minimum 2 norm solution from zero initialization (without explicit regularization), but preconditioned updates may not. This being said, while the minimum 2 norm solution can generalize well in the overparameterized regime (Bartlett et al., 2019) , it is unclear whether preconditioning leads to inferior solutions -even in the simple setting of overparameterized linear regression, quantitative understanding of how preconditioning affects generalization is largely lacking. Motivated by the observations above, in Section 3 we start with overparameterized least squares regression (unregularized) and analyze the stationary solution (t → ∞) of update (1.1) under time-invariant preconditioner. Extending previous analysis in the proportional limit (Hastie et al., 2019) , we consider a more general random design setting and derive the exact population risk in its bias-variance decomposition. We characterize the optimal P within a general class of preconditioners for both the bias and variance, and focus on the comparison between GD, for which P is identity, and NGD, for which P is the inverse population Fisher information matrixfoot_0 . We find that the comparison of generalization is affected by the following factors: 1. Label Noise: Additive noise in the labels leads to the variance term in the risk. We prove that NGD achieves the optimal variance among a general class of preconditioned updates. 2. Model Misspecification: Under misspecification, there does not exist a perfect f θ that recovers the true function (target). We argue that this factor is similar to additional label noise, and thus NGD may also be beneficial when the model is misspecified. 3. Data-Signal-Alignment: Alignment describes how the target signal distributes among the input features. We show that GD achieves lower bias when signal is isotropic, whereas NGD is preferred under "misalignment" -when the target function focuses on small feature directions. Beyond the decomposition of stationary risk, our findings in Section 4 and 5 are summarized as: • In Section 4.1 and 4.2 we discuss how the bias-variance tradeoff can be realized by different choices of preconditioner P (e.g. interpolating between GD and NGD) or early stopping. • In Section 4.3 we extend our analysis to regression in the RKHS and show that under early stopping, a preconditioned update interpolating between GD and NGD achieves minimax optimal convergence rate in much fewer steps, and thus reduces the population risk faster than GD. • In Section 5 we empirically test how well our findings in linear model carry over to neural networks: under a student-teacher setup, we compare the generalization of GD with preconditioned updates and illustrate the influence of all aforementioned factors. The performance of neural networks under a variety of manipulations results in trends that align with our theoretical analysis.

2. BACKGROUND AND RELATED WORKS

Natural Gradient Descent. NGD is a second-order method originally proposed in Amari (1997) . Consider a data distribution p(x) on the space X , a function f θ : X → Z parameterized by θ, and a loss function L(X, f θ ) = 1 n n i=1 l(y i , f θ (x i )), where l : Y × Z → R. Also suppose a probability distribution p(y|z) = p(y|f θ (x)) is defined on the space of labels. Then, the natural gradient is defined as: ∇θ L(X, f θ ) = F -1 ∇ θ L(X, f θ ), where F = E[∇ θ log p(x, y|θ)∇ θ log p(x, y|θ) ] is the Fisher information matrix, or simply the (population) Fisher. Note that expectations in the Fisher are under the joint distribution of the model p(x, y|θ) = p(x)p(y|f θ (x)). In the literature, the Fisher is sometimes defined under the empirical data distribution {x i } n i=1 (Amari et al., 2000) . We instead refer to this quantity as the sample Fisher, the properties of which influence optimization and have been studied in Karakida et al. (2018) ; Kunstner et al. (2019) ; Thomas et al. (2020) . We remark that in linear and kernel regression under squared loss, sample Fisher-based updates give the same stationary solution as GD (see Section 3), whereas population Fisher-based update may not. While the population Fisher is typically difficult to obtain, extra unlabeled data can be used in its estimation, which empirically improves generalization (Pascanu & Bengio, 2013) . Moreover, under structural assumptions, parametric approaches to estimate F can be more sample-efficient (Martens & Grosse, 2015; Ollivier, 2015) , and thus closing the gap between sample and population Fisher. When the per-instance loss is the negative log-probability of an exponential family, the sample Fisher coincides with the generalized Gauss-Newton matrix (Martens, 2014) . In least squares regression, which is the focus of this work, the quantity also coincides with the Hessian. We thus take NGD as a representative example of preconditioned update, and we expect our findings to also translate to other second-order methods (not including adaptive gradient methods) in regression problems. Analysis of Preconditioned Gradient Descent. While Wilson et al. (2017) outlined one example under fixed training data where GD generalizes better than adaptive methods, in the online learning setting, for which optimization speed relates to generalization, several works have shown the advantage of preconditioning (Levy & Duchi, 2019; Zhang et al., 2019a) . In addition, Zhang et al. (2019b) ; Cai et al. (2019) established convergence and generalization guarantees of sample Fisher-based updates for neural networks in the kernel regime. Lastly, the generalization of different optimizers relates to the notion of "sharpness" (Keskar et al., 2016; Dinh et al., 2017) , and it has been argued that second-order updates tend to find sharper minima (Wu et al., 2018) . We note that two concurrent works also discussed the generalization performance of preconditioned updates. Wadia et al. (2020) connected second-order methods with data whitening in linear models, and qualitatively showed that whitening (thus second-order update) harms generalization in certain cases. Vaswani et al. (2020) analyzed the complexity of the maximum P -margin solution in linear classification problems. We emphasize that instead of upper bounding the risk (e.g. Rademacher complexity), which may not decide the optimal P for generalization error, we compute the exact risk for least squares regression, which allows us to precisely compare different preconditioners.

3. ASYMPTOTIC RISK OF RIDGELESS INTERPOLANTS

In this section we consider the following setup: given n training samples {x i } n i=1 labeled by a teacher model (target function) f * : R d → R with additive noise: y i = f * (x i ) + ε i , we learn a linear student model f θ by minimizing the squared loss: L(X, f θ ) = n i=1 y i -x i θ 2 . We assume a random design: x i = Σ 1/2 X z i , where z i ∈ R d is an i.i.d. vector with zero-mean, unit-variance, and finite 12th moment, and ε is i.i.d. noise independent to z with mean 0 and variance σ 2 . Our goal is to compute the population risk R(f ) = E x [(f * (x) -f (x) ) 2 ] in the proportional asymptotic limit: • (A1) Overparameterized Proportional Limit: n, d → ∞, d/n → γ ∈ (1, ∞). (A1) entails that the number of features (or parameters) is larger than the number of samples, and there exist multiple empirical risk minimizers with potentially different generalization properties. Denote X = [x 1 , ..., x n ] ∈ R n×d the data matrix and y ∈ R n the corresponding label vector. We optimize the parameters θ via a preconditioned gradient flow with preconditioner In this linear setup, many common choices of preconditioner do not change through time: under Gaussian likelihood, the sample Fisher (and also Hessian) corresponds to the sample covariance X X/n up to variance scaling, whereas the population Fisher corresponds to the population covariance F = Σ X . We thus limit our analysis to fixed preconditioner of the form P (t) =: P . Write parameters at time t under update (3.1) with fixed P as θ P (t). For positive definite P , the stationary solution is given as: θP := lim t→∞ θ P (t) = P X (XP X ) -1 y. One may check that discrete time gradient descent update (with appropriate step size) and other variants that do not alter the span of gradient (e.g. stochastic gradient or momentum) converge to the same solution as well. P (t) ∈ R d×d , ∂θ(t) ∂t = -P (t) ∂L(θ(t)) ∂θ(t) = 1 n P (t)X (y -Xθ(t)), θ(0) = 0. (3.1) GD Precond. GD [xx ] P 1 y = X initialization Intuitively speaking, if the data distribution (blue contour in Figure 2 ) is not isotropic, then certain directions will be more "important" than others. In this case uniform 2 shrinkage (which GD implicitly provides) may not be most desirable, and certain P that takes data geometry into account may lead to better generalization instead. The above intuition will be made rigorous in this section. Remark. θP is the minimum θ P -1 norm interpolant: θP = arg min θ θ P -1 , s.t. Xθ = y for positive definite P . For GD this translates to the parameter 2 norm, whereas for NGD (P = F -1 ), the implicit bias is the θ Σ X norm. Since E[f (x) 2 ] = θ 2 Σ X , NGD finds an interpolating function with smallest norm under the data distribution. We empirically observe this divide between small parameter norm and function norm in neural networks as well (see Figure 1 and Appendix A.1). We highlight the following choices of P and the corresponding stationary solution θP as t → ∞. • Identity: P = I d recovers GD that converges to the min 2 norm interpolant (also true for momentum GD and SGD), which we write as θI := X (XX ) -1 y and refer to as the GD solution. • Population Fisher: P = F -1 = Σ -1 X leads to the estimator θF -1 , which we refer to as the NGD solution. • Sample Fisher: since the sample Fisher is rank-deficient, we may add a damping term P = (X X + λI d ) -1 or take the pseudo-inverse P = (X X) † . In both cases, the gradient is still spanned by X, and thus the update finds the same min 2norm solution θI (also true for full-matrix Adagrad (Agarwal et al., 2018) ), although the trajectory differs (see Figure 3 ).

Remark.

The above choices reveal a gap between sample-and population-based P : while the sample Fisher accelerates optimization (Zhang et al., 2019b) , the following sections demonstrate generalization properties only possessed by the population Fisher.  red), Σ -1 X (blue) and (X X) † (cyan). Time is rescaled differently for each curve (convergence speed is not comparable). Observe that GD and sample NGD give the same stationary risk. We compare the population risk of the GD solution θI and NGD solution θF -1 in its bias-variance decomposition w.r.t. label noise (Hastie et al., 2019) and discuss the two components separately, R(θ) = E x [(f * (x) -x, E ε [θ] ) 2 ] B(θ), bias + tr(Cov(θ)Σ X ) V (θ), variance . Note that the bias does not depend on label noise ε, and the variance does not depend on the teacher model f * . Additionally, given that f * can be independently decomposed into a linear component on features x and a residual: f * (x) = x, θ * + f * c (x), we can separate the bias term into a wellspecified component θ * -Eθ 2 Σ X , which captures the difficulty in learning θ * , and a misspecified component, which corresponds to the error due to fitting f * c (beyond what the student can represent).

3.1. THE VARIANCE TERM: NGD IS OPTIMAL

We first characterize the stationary variance which is independent to the teacher model f * . We restrict ourselves to preconditioners satisfying the following assumption on the spectral distribution: • (A2) Converging Eigenvalues: P is positive definite and as n, d → ∞, the spectral distribution of Σ XP := P 1/2 Σ X P 1/2 converges weakly to H XP supported on [c, C] for c, C > 0. The following theorem characterizes the asymptotic variance and the corresponding optimal P . Theorem 1. Given (A1-2), the asymptotic variance is given as V ( θP ) p → σ 2 lim λ→0+ m (-λ)m -2 (-λ) -1 , (3.2) where m(z) > 0 is the Stieltjes transform of the limiting distribution of eigenvalues of 1 n XP X satisfying the self-consistent equation m -1 (z) = -z + γ τ (1 + τ m(z)) -1 dH XP (τ ). Furthermore, under (A1-2), V ( θP ) ≥ σ 2 (γ -1) -1 , and the equality is achieved by P = F -1 . Formula (3.2) is a direct extension of (Hastie et al., 2019, Thorem 4) , which can be obtained from the general results of Ledoit & Péché (2011) ; Dobriban et al. (2018) . Theorem 1 implies that preconditioning with the inverse population Fisher F results in the optimal stationary variance, which is supported by Figure 5 (a). In other words, when the labels are noisy so that the risk is dominated by the variance, we expect NGD to generalize better upon convergence. We emphasize that this advantage is only present when the population Fisher is used, but not its sample-based counterpart (which gives θI ). In Appendix A.3 we discuss the sample complexity of estimating F from unlabeled data. Misspecification ≈ Label Noise. Under model misspecification, there does not exist a linear student that perfectly recovers the teacher f * , which we may decompose as: f * (x) = x θ * + f * c (x). In the simple case where f * c is an independent linear function on unobserved features (Hastie et al., 2019, Section 5 ): y i = x i θ * +x c,i θ c +ε i , where x c,i ∈ R dc is independent to x i , we can show that the additional error in the bias term due to misspecification is analogous to the variance term above: Corollary 2. Under (A1)(A2), for the above unobserved features model with E[x c x c ] = Σ c X and E[θ c θ c ] = d -1 c Σ c θ , the additional error due to misspecification can be written as B c ( θP ) = d -1 c tr(Σ c X Σ c θ )(V ( θP ) + 1) , where V ( θP ) is the variance in (3.2). In this case, misspecification can be interpreted as additional label noise, for which NGD is optimal by Theorem 1. While Corollary 2 describes one specific example of misspecification, we may expect similar outcome under broader settings. In particular, (Mei & Montanari, 2019, Remark 5) indicates that for many nonlinear f * c , the misspecified bias is same as variance due to label noise. We empirically observe similar findings under general covariance in Figure 4 , in which f * c is a quadratic function. Observe that NGD leads to lower bias compared to GD as we further misspecify the teacher model. 3.2 THE BIAS TERM: ALIGNMENT AND "DIFFICULTY" OF LEARNING We now analyze the bias term when the teacher model is linear on the input features (hence wellspecified): f * (x) = x θ * . Extending the random effects hypothesis in Dobriban et al. (2018) , we consider a more general prior on θ * : E[θ * θ * ] = d -1 Σ θ , and assume the following joint relationsfoot_2 : • (A3) Joint Convergence: Σ X and P share the same eigenvector matrix U . The empirical distributions of elements of (e x , e θ , e xp ) jointly converge to random variables (υ x , υ θ , υ xp ) supported on [c , C ] for c ,C > 0, where e x , e xp are eigenvalues of Σ X , Σ XP , and e θ = diag (U Σ θ U ). We remark that when P = I d , previous works (Hastie et al., 2019; Xu & Hsu, 2019) considered the special case of isotropic prior Σ θ = I d . Our assumption thus allows for analysis of the bias term under much more general Σ θfoot_3 , which gives rise to interesting phenomena that are not captured by simplified settings, such as non-monotonic bias and variance for γ > 1 (see Figure 15 ), and the epoch-wise double descent phenomenon (see Appendix A.5). Under this general setup, we have the following characterization of the asymptotic bias and the corresponding optimal preconditioner: Theorem 3. Under (A1)(A3), the expected bias B( θP ) := E θ * [B( θP )] is given as B( θP ) p → lim λ→0+ m (-λ)m -2 (-λ)E υ x υ θ (1 + υ xp m(-λ)) -2 , (3.3) where expectation is taken over υ and m(z) is the Stieltjes transform defined in Theorem 1. Furthermore, among all P satisfying (A3), the optimal bias is achieved by P = U diag (e θ )U . [xx T ] isotropic * misaligned * Note that the optimal P depends on the "orientation" of the teacher model Σ θ , which is usually not known in practice. This result can thus be interpreted as a no-free-lunch characterization in choosing an optimal preconditioner for the bias term a priori. As a consequence of the theorem, when the true parameters θ * have roughly equal magnitude (isotropic), GD achieves lower bias (see Figure 5 (b) where Σ θ = I d ). On the other hand, NGD leads to lower bias when Σ X is "misaligned" with Σ θ , i.e. when θ * focus on the least varying directions of input features (see Figure 5 (c) where Σ θ = Σ -1 X ), in which case learning is intuitively difficult since the features are not useful. Connection to Source Condition. The "difficulty" of learning above relates to the source condition in RKHS literature (Cucker & Smale, 2002)  (i.e., E[Σ r/2 X θ * ] < ∞, see (A4) in Section 4. 3), in which the coefficient r can be interpreted as a measure of "misalignment". To elaborate this connection, we consider the setting of Σ θ = Σ r X : note that as r decreases, the teacher θ * focuses more on input features with small magnitude, thus the learning problem becomes harder, and vice versa. In this case we can show a clear transition in r for the comparison between GD and NGD. Proposition 4 (Informal). When Σ θ = Σ r X , there exists a transition point r * ∈ (-1, 0) such that GD achieves lower (higher) stationary bias than NGD when r > (<) r * . The above proposition confirms that for the stationary bias (well-specified), NGD outperforms GD in the misaligned setting (i.e., small r), whereas GD has an advantage when the signal is aligned (large r). For formal statement and more discussion on the transition point r * see Appendix A.2.

4. BIAS-VARIANCE TRADEOFF

Our characterization of stationary risk suggests that preconditioners that achieve the optimal bias and variance are generally different. This section discusses how bias-variance tradeoff can be realized by interpolating between preconditioners or by early stopping. Additionally, we analyze the nonparametric least squares setting and show that by balancing the bias and variance, a preconditioned update that interpolates between GD and NGD decreases the population risk faster than GD. Depending on the orientation of the teacher model, we may expect a bias-variance tradeoff in choosing P . Intuitively, given P 1 that minimizes the bias and P 2 that minimizes the variance, it is possible that a preconditioner interpolating between P 1 and P 2 can balance the bias and variance and thus generalize better. The following proposition confirms this intuition in a setup of general Σ X and isotropic Σ θfoot_4 , for which GD (P = I d ) achieves optimal stationary bias and NGD (P = F -1 ) achieves optimal variance. Proposition 5 (Informal). Let Σ X = I d and Σ θ = I d . Consider the following choices of interpolation scheme: (i) P α = αΣ -1 X + (1-α)I d , (ii) P α = (αΣ X +(1-α)I d ) -1 , (iii) P α = Σ -α X . The stationary variance monotonically decreases with α ∈ [0, 1] for all three choices. For (i), the stationary bias monotonically increases with α ∈ [0, 1], whereas for (ii) and (iii), the bias monotonically increases with α in certain range that depends on Σ X . In other words, as the signal-to-noise ratio (SNR) decreases, one can increase α, which makes the update closer to NGD, to improve generalization, and vice versa. Indeed as shown in Figure 7 and 16(c), at a certain SNR, interpolating between Σ -1 X and Σ θ can improve the stationary risk. Remark. Two of the aforementioned interpolation schemes reflect common practical choices: additive interpolation (ii) corresponds to the damping term to stably invert the Fisher, whereas geometric interpolation (iii) resembles the "conservative" square-root scaling in adaptive gradient methods.

4.2. THE ROLE OF EARLY STOPPING

Thus far we considered the stationary solution of the unregularized objective. It is known that the bias-variance tradeoff can also be controlled by either explicit or algorithmic regularization. We briefly comment on the effect of early stopping, starting from the monotonicity of the variance term. Proposition 6. For all P satisfying (A2), the variance V (θ P (t)) monotonically increases with time. The proposition confirms the intuition that early stopping reduces overfitting. Variance reduction can benefit GD in its comparison to NGD, which achieves lowest stationary variance: indeed, Figure 3 and 19 show that under early stopping, GD may be favored even if NGD has lower stationary risk. On the other hand, early stopping may not improve the bias term. While a complete analysis is difficult partially due to the potential non-monotonicity of the bias term (see Appendix A.5), we speculate that previous findings for the stationary bias also translate to early stopping. As a concrete example, we consider well-specified settings in which either GD or NGD achieves the optimal stationary bias, and demonstrate that such optimality is also preserved under early stopping: Proposition 7. Given (A1) and denote the optimal early stopping bias as B opt (θ) = inf t≥0 B(θ(t)). When Σ θ = Σ -1 X , we have B opt (θ P ) ≥ B opt (θ F -1 ) for all P satisfying (A3). Whereas when Σ θ = I d , we have B opt (θ F -1 ) ≥ B opt (θ I ). Figure 19 illustrates that the observed trend in the stationary bias (well-specified) is indeed preserved under optimal early stopping: GD or NGD achieves lower early stopping bias under isotropic or misaligned teacher model, respectively. We leave a more precise characterization as future work.

4.3. FAST DECAY OF POPULATION RISK

Our previous analysis suggests that certain preconditioners can achieve lower population risk, but does not address which method decreases the risk more efficiently. Knowing that preconditioned updates may accelerate optimization, one natural question to ask is, is this speedup also present for generalization under fixed dataset? We answer this question in the affirmative in a slightly different model: we study least squares regression in the RKHS, and show that a preconditioned update that interpolates between GD and NGD achieves the minimax optimal rate in much fewer steps than GD. We provide a brief outline and defer the details to Appendix D.9.1. Let H be an RKHS included in L 2 (P X ) equipped with a bounded kernel function k, and K x ∈ H be the Riesz representation of the kernel function. Define S as the canonical operator from H to L 2 (P X ), and write Σ = S * S and L = SS * . We aim to learn the teacher model f * , under the following standard regularity conditions: • (A4) Source Condition: ∃r ∈ (0, ∞), M > 0 s.t. f * = L r h * for h * ∈ L 2 (P X ) and f * ∞ ≤ M . • (A5) Capacity Condition: ∃s > 1 s.t. tr Σ 1/s < ∞ and 2r + s -1 > 1. • (A6) Regularity of RKHS: ∃µ ∈ [s -1 , 1], C µ > 0 s.t. sup x∈supp(P X ) Σ 1/2-1/µ K x H ≤ C µ . Note that in the source condition (A4), the coefficient r controls the complexity of the teacher model and relates to the notions of model misalignment in Section 3.2: large r indicates a smoother teacher model which is "easier" to learn, and vice versafoot_5 (Steinwart et al., 2009) . Given training points {(x i , y i )} n i=1 , we consider the following preconditioned update on the student model f t ∈ H: f t = f t-1 -η(Σ + αI) -1 ( Σf t-1 -Ŝ * Y ), f 0 = 0, (4.1) where Σ = 1 n n i=1 K xi ⊗K xi , Ŝ * Y = 1 n n i=1 y i K xi . In this setup, the population Fisher corresponds to the covariance operator Σ, and thus (4.1) can be interpreted as additive interpolation between GD and NGD: update with large α behaves like GD, and small α like NGD. Related to our update is the FALKON algorithm (Rudi et al., 2017 ) -a preconditioned gradient method for kernel ridge regression. The key difference is that we consider optimizing the original objective (instead of Published as a conference paper at ICLR 2021 a regularized version as in FALKON) under early stopping. Importantly, since we aim to understand how preconditioning affects generalization, explicit regularization should not be taken into account. The following theorem shows that with appropriately chosen α, the preconditioned update (4.1) leads to more efficient decrease in the population risk, due to faster decay of the bias term. Theorem 8 (Informal). Under (A4-6), the population risk of f t can be written as R(f t ) = Sf t -f * 2 L2(P X ) ≤ B(t)+V (t) defined in Appendix D.9. Given r ≥ 1/2 or µ ≤ 2r, preconditioned update (4.1) with α = n -2s 2rs+1 achieves minimax optimal convergence rate R(f t ) = Õ n -2rs 2rs+1 in t = Θ(log n) steps, whereas ordinary gradient descent requires t = Θ n 2rs 2rs+1 steps. We comment that the optimal interpolation coefficient α and stopping time t are chosen to balance the bias B(t) and variance V (t). Note that α depends on the teacher model in the following way: for n > 1, α decreases as r becomes smaller, which corresponds to non-smooth and "difficult" f * , and vice versa. This agrees with our previous observation that NGD is advantageous when the teacher model is difficult to learn. We defer empirical verification of this result to Appendix C.

5. NEURAL NETWORK EXPERIMENTS

Protocol. We compare the generalization performance of GD and NGD in neural network settings and illustrate the influence of the following three factors: (i) label noise; (ii) misspecification; (iii) signal misalignment. We also verify the potential advantage of interpolating between GD and NGD. We consider the MNIST and CIFAR-10 datasets. To create a student-teacher setup, we split the training set into two halves, one of which (pretrain split) along with the original labels is used to pretrain the teacher, and the other (distill split) along with the teacher's labels is used to distill (Hinton et al., 2015) the student. We take the teacher to be either a two-layer fully-connected ReLU network or a ResNet (He et al., 2016) , and the student is a two-layer ReLU network. We normalize the teacher's labels logits following Ba & Caruana (2014) before potentially adding label noise, and fit the student by minimizing the L2 loss. Student models are trained on a subset of the distill split with full-batch updates. We implement NGD using Hessian-free optimization (Martens, 2010) . We use 100k unlabeled data (possibly applying data augmentation) to estimate the population Fisher. We report the test error when the training error is below 0.2% of its initial value as a proxy for the stationary risk. We defer detailed setup to Appendix E and additional results to Appendix C. Label Noise. We pretrain the teacher with the full pretrain split and use 1024 examples from the distill split to fit the student. For both the student and teacher, we use a two-layer ReLU net with 80 hidden units. We corrupt the labels with isotropic Gaussian noise of varying magnitude. Figure 8(a) shows that as the noise level increases (the variance term begins to dominate), the stationary risk of both NGD and GD worsen, with GD worsening faster, which aligns with our observation in Figure 5 .

5.1. EMPIRICAL FINDINGS

Misspecification. We use a ResNet-20 teacher and the same two-layer ReLU student from the label noise experiment. We control the misspecification level by varying amount of pretraining of the teacher. Intuitively, large teacher models that are trained longer should be more complex and thus likely to be outside of functions that a small two-layer student can represent (hence the problem becomes misspecified). Indeed, Figure 8 (b) shows that NGD eventually achieves better generalization as the number of training steps for the teacher increases. In Appendix A.4 we report a heuristic measure of model misspecification that relates to the NTK (Jacot et al., 2018) , and confirm that the quantity increases as more label noise is added or as the teacher model is trained longer. Misalignment. We set the student and teacher to be the same two-layer ReLU network. We construct the teacher model by perturbing the student's initialization, the direction of which is given by F r , where F is the Fisher of the student model and r ∈ [-1, 0]. Intuitively, as r decreases, the important parameters of the teacher (i.e. larger update directions) becomes misaligned with the student's gradient, and thus learning becomes more "difficult". While this analogy is rather superficial due to the non-convexity of neural network optimization, Figure 8 (c) shows that as r becomes smaller (setup is more misaligned), NGD begins to generalize better than GD (in terms of stationary risk). Interpolating between Preconditioners. We validate our observations in Section 3 and 4 on the difference between the sample Fisher and population Fisher, and the potential benefit of interpolating between GD and NGD, in neural network experiments. Figure 9 (a) shows that as we decrease the number of unlabeled data in estimating the Fisher, which renders the preconditioner closer to the sample Fisher, the stationary risk becomes more akin to that of GD, especially in the large noise setting. This agrees with our remark on sample vs. population Fisher in Section 3 and Appendix A.1. In particular, we interpret the left end of the figure to correspond to the bias-dominant regime (due to the same architecture for the student and teacher), and the right end to be the variance-dominant regime (due to large label noise). Observe that at certain SNR, a preconditioner that interpolates (additively or geometrically) between GD and NGD can achieve lower stationary risk. 

6. DISCUSSION AND CONCLUSION

We analyzed the generalization properties of a general class of preconditioned gradient descent in overparameterized least squares regression, with particular emphasis on natural gradient descent. We identified three factors that affect the comparison of generalization performance of different optimizers, the influence of which we also empirically observed in neural networkfoot_6 . We then determined the optimal preconditioner for each factor. While the optimal P is usually not known in practice, we provided justification for common algorithmic choices by discussing the bias-variance tradeoff. Note that our current theoretical setup is limited to fixed preconditioners or those that do not alter the span of gradient, and thus does not cover many adaptive gradient methods; understanding these optimizers in similar setting would be an interesting future direction. Another important problem is to further characterize the interplay between preconditioning and explicit (e.g. weight decayfoot_7 ) or algorithmic regularization (e.g. large step size and gradient noise). 

A DISCUSSION ON ADDITIONAL RESULTS

A.1 IMPLICIT BIAS OF GD VS. NGD It is known that gradient descent is the steepest descent with respect to the 2 norm, i.e. the update direction is constructed to decrease the loss under small changes in the parameters measured by the 2 norm (Gunasekar et al., 2018a) . Following this analogy, NGD is the steepest descent with respect to the KL divergence on the predictive distributions (Martens, 2014) ; this can be interpreted as a proximal update which penalizes how much the predictions change on the data distribution. Intuitively, the above discussion suggests GD tend to find solution that is close to the initialization in the Euclidean distance between parameters, whereas NGD prefers solution close to the initialization in terms of function outputs on the data distribution. This observation turns out to be exact in the case of ridgeless interpolant under the squared loss, as remarked in Section 3. Moreover, Figure 1 and 10 confirms the same trend in the optimization of overparameterized neural network. In particular, • GD results in small changes in parameters, whereas NGD results in small changes in the function. • Preconditioning with the pseudo-inverse of the sample Fisher, i.e., P = (J J ) † , leads to implicit bias similar to that of GD (also noted in (Zhang et al., 2019b) ), but different than NGD with the population Fisher. • "Interpolating" between GD and NGD (green) results in properties in between GD and NGD. Remark. Qualitatively speaking, the small change in the function output is the essential reason that NGD performs well under noisy labels in the interpolation setting: NGD seeks to interpolate the training data by changing the function only "locally", so that memorizing the noisy labels has small impact on the "global" shape of the learned function (see Figure 1 ). ReLU network with 50 hidden units towards a teacher model of the same architecture on Gaussian input. The x-axis is rescaled for each optimizer such that the final training error is below 10 -3 . GD finds solution with small changes in the parameters, whereas NGD finds solution with small changes in the function. Note that the sample Fisher (cyan) has implicit bias similar to GD and does not resemble NGD (population Fisher). We note that the above observation also implies that wide neural networks trained with NGD (population Fisher) is less likely to stay in the kernel regime: the distance traveled from initialization can be large (see Figure 10(a) ) and thus the Taylor expansion around the initialization is no longer accurate. In other words, the analogy between wide neural net and its linearized kernel model (which we partially employed in Section 5) may not be valid in models trained by NGDfoot_8 . Implicit Bias of Interpolating Preconditioners. We also expect that as we interpolate from GD to NGD, the distance traveled in the parameter space would gradually increase, and distance traveled in the function space would decrease. Figure 11 demonstrate that this is indeed the case for neural networks: we use the same two-layer MLP setup on MNIST as in Section 5. Observe that updates closer to GD result in smaller change in the parameters, and ones close to NGD lead to smaller change in the function outputs. Figure 11 : Illustration of the implicit bias of preconditioned gradient descent that interpolates between GD and NGD on MNIST. Note that as the update becomes more similar to NGD (smaller damping or larger α), the distance traveled in the parameter space increases, where as the distance traveled on the output space decreases. Two Kinds of Second-order Optimizer. Note that our optimal preconditioner derived in Section 3 requires knowledge of population second-order statistics, which we empirically approximate using extra unlabeled data. Consequently, our results suggest that different "types" of second-order information (sample vs. population) may affect generalization differently. Broadly speaking, there are two types of practical approximate second-order optimizers for neural networks. Some algorithms, such as Hessian-free optimization (Martens, 2010; Martens & Sutskever, 2012; Desjardins et al., 2013) , approximate second-order matrices (typically the Hessian or Fisher) using the exact matrix on finite training examples. In high-dimensional problems, this sample-based approximation can be very different from the population quantity (e.g. it is degenerate in the overparameterized regime). Other algorithms fit a parametric approximation to the Fisher, such as diagonal (Duchi et al., 2011; Kingma & Ba, 2014) , quasi-diagonal (Ollivier, 2015) , or Kronecker-factored (Martens & Grosse, 2015) . If the parametric assumption is accurate, these approximations are more statistically efficient and thus may lead to better approximation to the population Fisher. Our analysis reveals a difference (in generalization properties) between sample-and population-based preconditioned updates, which may also suggest a separation between the two kinds of approximate second-order optimizers. As future work, we intend to investigate this discrepancy in real-world problems.

A.2 BIAS TERM UNDER SPECIFIC SOURCE CONDITION

Motivated by the connection between the notion of "alignment" and the source condition in Section 3.2, we consider a specific case of θ * : Σ θ = Σ r X , where r controls the extent of misalignment, and Theorem 3 implies that the optimal preconditioner for the bias term (well-specified) is P = Σ r X . Note that smaller r corresponds to more misaligned and thus "difficult" problem, and vice versa. In this setup we have the following comparison between GD and NGD. Proposition (Formal Statement of Proposition 4). Consider the setting of Theorem 3 and Σ θ = Σ r X , then for all r ≤ -1 we have B( θF -1 ) ≤ B( θI ), whereas for all r ≥ 0, we have B( θF -1 ) ≥ B( θI ); the equality is achieved when features are isotropic (i.e., Σ X = cI d ). The proposition confirms the intuition that when parameters of the teacher model θ * are more "aligned" with features x than the isotropic setting (r ≥ 0), then GD achieves lower bias than NGD; on the other hand, when Σ θ is more "misaligned" than the Σ -1 X case (r ≤ -1), then NGD is guaranteed to be advantageous for the bias term. We therefore expect a transition from the NGD-dominated to GD-dominated regime for some r ∈ (-1, 0). The exact value of r for such transition depends on the spectral distribution of Σ X and varies caseby-case (one would need to specifically evaluate the equality (D.9)). To give a concrete example, when Σ X has a simple block structure, we can explicitly determine the the transition point r * ∈ (-1, 0), as shown by the following corollary. Corollary 9. Assume Σ θ = Σ r X , and eigenvalues of Σ X come from two equally-weighted point masses with κ λ max (Σ X )/λ min (Σ X ). WLOG we take tr(Σ X )/d = 1. Then given r * = -ln c κ,γ / ln κ (see Appendix D.10 for definition), we have B( θI ) B( θF -1 ) if and only if r r * . Remark. When γ = 2, the transition happens at r * = -1/2 which is independent of the condition number κ, as indicated by the dashed line in Figure 12 . However for other γ > 1, r * generally relates to both γ and κ. Our characterization above is supported by Figure 12 , in which we plot the bias term under varying extent of misalignment (controlled by r) in the setting of Corollary 9. Observe that as we construct the teacher model to be more "misaligned" (and thus difficult to learn) by decreasing r, NGD (blue) achieves lower bias compared to GD (red), and vice versa.

A.3 ESTIMATING THE POPULATION FISHER

Our analysis on linear model considers the idealized setup with access to the exact population Fisher, which can be estimated using additional unlabeled data. In this section we discuss how our result in Section 3 and Section 4 are affected when the population covariance is approximated from N i.i.d. (unlabeled) samples X u ∈ R N ×d . For the ridgeless interpolant we have the following result on the substitution error in replacing the true population covariance with the sample estimate. Proposition 10. Given (A1)(A3) and N/d → ψ > 1 as d → ∞, let ΣX = X u X u /N , we have (a) Σ X -ΣX 2 = O(ψ -1/2 ) almost surely. (b) Denote the stationary bias and variance of NGD (with the exact population Fisher) as B * and V * , respectively, and the bias and variance of the preconditioned update using the approximate Fisher F = ΣX as B and V , respectively. Let 0 < < 1 be the desired accuracy. Then ψ = Θ( -2 ) suffices to achieve |B * -B| < and |V * -V | < . Proposition 10 entails that when the preconditioner is a sample estimate of the Fisher F (based on unlabeled data), we can approach the stationary bias and variance of the population Fisher at a rate of ψ -1/2 as we increase the number of unlabeled data N linearly in the dimensionality d. In other words, any non-vanishing accuracy can be achieved with finite ψ (to push → 0, additional logarithmic factor is required, e.g. N = O(d logd), which is beyond the proportional limit). On the other hand, for our result in Section 4.3, (Murata & Suzuki, 2020, Lemma A.5) directly implies that setting N = Θ(n s log n) is sufficient to approximate the population covariance operator (i.e., so that Σ 1/2 Σ -1/2 N,λ = O(1)). Finally, we remark that our analysis above does not impose any structural assumptions on the estimated matrix. When the Fisher exhibits certain structures (e.g. Kronecker factorization (Martens & Grosse, 2015) ), then estimation can be more sample-efficient. For analysis on such structured approximations of the Fisher see Karakida & Osawa (2020) . As a heuristic measure of model misspecification, in Figure 13 we report y K -1 y/n studied in Arora et al. (2019b) , where y is the label vector and K is the NTK matrix (Jacot et al., 2018) of the student model. This quantity relates to kernelbased alignment measures (Cristianini et al., 2001) , and in the context of neural network optimization, it can be interpreted as a proxy for measuring how much signal and noise are distributed along the eigendirections of the NTK (e.g., see Li et al. (2019) ; Dong et al. (2019) ; Su & Yang (2019) ). Roughly speaking, large y K -1 y/n implies that the problem is "difficult" to learn by the student model via GD, and vice versa. Here we give a heuristic argument on how this quantity relates to label noise and misspecification. For the ridgeless regression model considered in Section 3, write y i = f * (x i ) + f c (x i ) + ε i , where f * (x) = x θ * , f c is the misspecified component, and ε i is the label noise, then we have E y K -1 y = E (XX ) -1/2 (f * (X) + f c (X) + ε) 2 2 (i) ≈tr θ * θ * X (XX ) -1 X + (σ 2 + σ 2 c )tr (XX ) -1 , (A.1) where we heuristically replaced the misspecified component with i.i.d. noise of the same variance σ 2 c . The first term of (A.1) resembles an RKHS norm of the target θ * , whereas the second term is small when the feature matrix is well-conditioned or when the level of label noise σ and misspecification σ 2 c is small (note that these are conditions under which GD achieves good generalization). We may expect similar behavior for neural networks close to the kernel regime. This provides a non-rigorous explanation of the trend observed in Figure 13 : as we increase the level of label noise or model misspecification (by pretraining the teacher for more steps), the quantity of interest becomes larger. Note that non-monotonicity of the bias w.r.t. time is present in GD but not NGD. Many previous works on the high-dimensional characterization of linear regression assumed a random effects model with an isotropic prior on the true parameters (Dobriban et al., 2018; Hastie et al., 2019) , which may not hold in practice. As an example of the limitation of this assumption, note that when Σ θ = I d , it can be shown that the expected bias B( θ(t)) monotonically decreases through time (see proof of Proposition 7). In contrast, when the target parameters do not follow an isotropic prior, the bias of GD can exhibit non-monotonicity, which gives rise to the "epochwise double descent" phenomenon also observed in deep learning (Nakkiran et al., 2019; Ishida et al., 2020) . We empirically demonstrate this non-monotonicity when the model is close to the interpolation threshold in Figure 14 . We set eigenvalues of Σ X to be two point masses with κ X = 32, Σ θ = Σ -1 X and γ = 16/15. Note that the GD trajectory (red) exhibits non-monotonicity in the bias term, whereas for NGD the bias is monotonically decreasing through time (which we confirm in the proof of Proposition 7). We remark that this mechanism of epoch-wise double descent may not relate the empirical findings in deep neural networks (the robustness of which is also largely unknown), in which it is typically speculated that the variance term exhibits non-monotonicity.

B ADDITIONAL RELATED WORKS

Implicit Regularization in Optimization. In overparameterized linear models, GD finds the minimum 2 norm solution under many loss functions. For the more general mirror descent, the implicit bias is determined by the Bregman divergence of the update (Gunasekar et al., 2018b; Suggala et al., 2018; Azizan et al., 2019) . Under the exponential or logistic loss, recent works showed that GD finds the max-margin direction in various models (Ji & Telgarsky, 2018; 2019; Soudry et al., 2018; Lyu & Li, 2019; Chizat & Bach, 2020) . The implicit bias of Adagrad has been analyzed under similar setting (Qian & Qian, 2019) . Implicit regularization also relates to the model architecture; examples include matrix factorization (Gunasekar et al., 2017; Saxe et al., 2013; Gidel et al., 2019; Arora et al., 2019a) and certain stylized neural networks (Li et al., 2017; Gunasekar et al., 2018b; Williams et al., 2019; Woodworth et al., 2020) . For wide networks in the kernel regime (Jacot et al., 2018) , the implicit bias of GD relates to properties of the neural tangent kernel (NTK) (Xie et al., 2016; Arora et al., 2019b; Bietti & Mairal, 2019) . Finally, we note that the implicit bias of GD is not always explained by the minimum norm property (Razin & Cohen, 2020; Dauber et al., 2020) . Asymptotics of Interpolating Estimators. In Section 3 we analyzed overparameterized estimators that interpolate the training data. Recent works have shown that interpolation may not lead to overfitting (Liang & Rakhlin, 2018; Belkin et al., 2018c; b; Bartlett et al., 2019) , and the optimal risk may be achieved under no regularization and extreme overparameterization (Belkin et al., 2018a; Xu & Hsu, 2019) . The asymptotic risk of overparameterized models has been characterized in various settings, such as linear regression (Karoui, 2013; Dobriban et al., 2018; Hastie et al., 2019) , random features regression (Mei & Montanari, 2019; Gerace et al., 2020; Dhifallah & Lu, 2020; Adlam & Pennington, 2020) , max-margin classification (Montanari et al., 2019; Deng et al., 2019) , and certain neural networks (Louart et al., 2018; Ba et al., 2020) . Our analysis is based on random matrix theory results developed in Rubio & Mestre (2011) ; Ledoit & Péché (2011) . Similar tools can also be used to study the gradient descent dynamics of linear regression (Liao & Couillet, 2018; Ali et al., 2019) . . The x-axis has been rescaled for each curve and thus convergence speed is not directly comparable. Note that (a) large λ (i.e., GD-like update) is beneficial when r is large, and (b) small λ (i.e., NGD-like update) is beneficial when r is small.

C.3 NEURAL NETWORK

Label Noise. In Figure 21 , (a) we observe the same phenomenon on CIFAR-10 that NGD generalizes better as more label noise is added to the labels, and vice versa. Figure 21 (b) shows that in all cases with varying amounts of label noise, the early stopping risk is however worse than that of GD. This agrees with the observation in Section 4 and Figure 19 (a) that early stopping can potentially favor GD due to the reduced variance. Misalignment. In Figure 21 (c)(d) we confirm the finding in Proposition 7 and Figure 18 (b) in neural networks under synthetic data: we consider 50-dimensional Gaussian input, and both the teacher and the student model are two-layer ReLU networks with 50 hidden units. We construct the teacher by perturbing the initialization of the student as described in Section 5. As r approaches -1 (problem more "misaligned"), NGD achieves lower early stopping risk (Figure 21(d) ), whereas GD dominates the early stopping risk in less misaligned setting ( 21(c)). We note that this phenomenon is difficult to observe in practical neural network training on real-world data, which may be partially due to the fragility of the analogy between neural nets and linear models, especially under NGD (discussed in Appendix A.1). , and both the student and the teacher are two-layer ReLU networks with 50 hidden units. The x-axis and the learning rate have been rescaled for each curve (i.e., optimization speed not comparable). When r is sufficiently small, NGD achieves lower early stopping risk, whereas GD is beneficial for larger r.

D PROOFS AND DERIVATIONS D.1 MISSING DERIVATIONS IN SECTION 3

Gradient Flow of Preconditioned Updates. Given positive definite P and γ > 1, one may check that the gradient flow solution at time t can be written as θ P (t) = P X I n -exp - t n XP X (XP X ) -1 y. Taking t → ∞ yields the stationary solution θP = P X (XP X ) -1 y. We remark that the damped inverse of the sample Fisher P = (XX + λI d ) -1 leads to the same minimum-norm solution as GD θI = X (XX ) -1 y since P X and X share the same eigenvectors. On the other hand, when P is the pseudo-inverse of the sample Fisher (XX ) † which is not full-rank, the trajectory can be obtained via the variation of constants formula: θ(t) = t n ∞ k=0 1 (k + 1)! - t n X (XX ) -1 X k X (XX ) -1 y, for which taking the large t limit also yields the minimum-norm solution X (XX ) -1 y. Minimum θ P -1 Norm Interpolant. For positive definite P and the corresponding stationary solution θP = P X (XP X ) -1 y, note that given any other interpolant θ , we have ( θPθ )P -1 θP = 0 because both θP and θ achieves zero empirical risk. Hence θ 2 P -1 -θP 2 P -1 = θ -θP 2 P -1 ≥ 0. This confirms that θP is the unique minimum θ P -1 norm solution.

D.2 PROOF OF THEOREM 1

Proof. By the definition of the variance term and the stationary θ, V ( θ) = tr Cov( θ)Σ X = σ 2 tr P X (XP X ) -2 XP Σ X . Write X = XP 1/2 . Similarly, we define Σ XP = P 1/2 Σ X P 1/2 . It is clear that the equation above simplifies to V ( θP ) = σ 2 tr X ( X X ) -2 XΣ XP . The analytic expression of the variance term follows from a direct application of (Hastie et al., 2019, Thorem 4) , in which conditions on the population covariance are satisfied by (A2). Taking the derivative of m(-λ) yields m (-λ) = 1 m 2 (-λ) -γ τ 2 (1 + τ m(-λ)) 2 dH XP (τ ) -1 . Plugging the quantity into the expression of the variance (omitting the scaling σ 2 and constant shift), m (-λ) m 2 (-λ) = 1 -γm 2 (-λ) τ 2 (1 + τ m(-λ)) 2 dH XP (τ ) -1 . From the monotonicity of x 1+x on x > 0 or the Jensen's inequality we know that 1 -γ τ m(-λ) 1 + τ m(-λ) 2 dH XP (τ ) ≤ 1 -γ τ m(-λ) 1 + τ m(-λ) dH XP (τ ) 2 . Taking λ → 0 and omitting the scalar σ 2 , the RHS evaluates to 1 -1/γ. We thus arrive at the lower bound V ≥ (γ -1) -1 . Note that the equality is only achieved when H XP is a point mass, i.e. P = Σ -1 X . In other words, the minimum variance is achieved by NGD. As a verification, the variance of the NGD solution θF -1 agrees with the calculation for the case were the features are isotropic (Hastie et al., 2019, A.3 ).

D.3 PROOF OF COROLLARY 2

Proof. Via calculation similar to (Hastie et al., 2019, Section 5) , the bias can be decomposed as E B( θP ) =E x,x,θ * ,θ c x P X XP X -1 (Xθ * + X c θ c ) -(x θ * + x θ c ) 2 (i) =E x,θ * x P X XP X -1 Xθ * -x θ * 2 + E x c ,θ x (x θ c ) 2 + E x,θ c x P X XP X -1 X c θ c θ c X c XP X -1 XP x 2 (ii) →B θ ( θP ) + 1 d c tr(Σ c X Σ c θ )(1 + V ( θP )), where we used the independence of x, x and θ * , θ c in (i), and (A1-3) as well as the definition of the well-specified bias B θ ( θP ) and variance V ( θP ) in (ii).

D.4 PROOF OF THEOREM 3

Proof. By the definition of the bias term (note that Σ X , Σ θ , P are all positive semi-definite), B( θP ) = E θ * P X (XP X ) -1 Xθ * -θ * 2 Σ X = 1 d tr Σ θ I d -P X (XP X ) -1 X Σ X I d -P X (XP X ) -1 X (i) = 1 d tr Σ θ/P I d -X ( X X ) -1 X Σ XP I d -X ( X X ) -1 X (ii) = lim λ→0+ λ 2 d tr Σ θ/P 1 n X X + λI d -1 Σ XP 1 n X X + λI d -1 (iii) = lim λ→0+ λ 2 d tr 1 n X X + λΣ -1 θ/P -2 Σ -1/2 θ/P Σ XP Σ -1/2 θ/P , where we utilized (A3) and defined X = XP 1/2 , Σ XP = P 1/2 Σ X P 1/2 , Σ θ/P = P -1/2 Σ θ P -1/2 in (i), applied the equality (AA ) † A = lim λ→0 (A A + λI) -1 A in (ii), and defined X = XP 1/2 Σ -1/2 θ in (iii). To proceed, we first assume that Σ θ/P is invertible (i.e. λ min (Σ θ/P ) is bounded away from 0) and observe the following relation via a leave-one-out argument similar to that in Xu & Hsu (2019) , 1 d tr 1 n X X 1 n X X + λΣ -1 θ/P -2 (D.1) (i) = 1 d n i=1 1 n x i 1 n X X + λΣ -1 θ/P -2 ¬i xi 1 + 1 n x i 1 n X X + λΣ -1 θ/P -1 ¬i xi 2 (ii) → p 1 d tr 1 n X X + λΣ -1 θ/P -2 Σ -1/2 θ/P Σ XP Σ -1/2 θ/P 1 + 1 n tr 1 n X X + λI d -1 Σ XP 2 , (D.2) where (i) is due to the Woodbury identity and we defined 1 n X X + λΣ -1 θ/P ¬i = 1 n X X - 1 n xi x i + λΣ -1 θ/P which is independent to xi (see (Xu & Hsu, 2019, Eq. 58) for details), and in (ii) we used (A3), the convergence to trace (Ledoit & Péché, 2011, Lemma 2.1) and its stability under low-rank perturbation (e.g., see (Ledoit & Péché, 2011, Eq. 18 )) which we elaborate below. In particular, denote Σ = 1 n X X + λΣ -1 θ/P , for the denominator we have sup i λ n tr Σ-1 Σ -1/2 θ/P Σ XP Σ -1/2 θ/P - λ n tr Σ-1 ¬i Σ -1/2 θ/P Σ XP Σ -1/2 θ/P ≤ λ n Σ -1/2 θ/P Σ XP Σ -1/2 θ/P 2 sup i tr Σ-1 Σ -Σ¬i Σ-1 ¬i ≤ λ n Σ -1/2 θ/P Σ XP Σ -1/2 θ/P 2 Σ-1 2 sup i Σ-1 ¬i 2 tr Σ -Σ¬i (i) → O p 1 n , where (i) is due to the definition of Σ¬i and (A1)(A3). The result on the numerator can be obtained via a similar calculation, the details of which we omit. We first note that the denominator can be evaluated by previous results (e.g. (Dobriban et al., 2018, Theorem 2 .1)) as follows, 1 n tr 1 n X X + λI d -1 Σ XP a.s. → 1 λm(-λ) -1. (D.3) On the other hand, following the same derivation as Dobriban et al. (2018) ; Hastie et al. ( 2019), (D.1) can be decomposed as 1 d tr 1 n X X 1 n X X + λΣ -1 θ/P -2 = 1 d tr 1 n X X + λI d -1 Σ θ/P - λ d tr 1 n X X + λI d -2 Σ θ/P = 1 d tr 1 n X X + λI d -1 Σ θ/P + λ d d dλ tr 1 n X X + λI d -1 Σ θ/P . (D.4) We employ (Rubio & Mestre, 2011 , Theorem 1) to characterize (D.4). In particular, For any deterministic sequence of matrices Θ n ∈ R d×d with finite trace norm, as n, d → ∞ we have tr Θ n 1 n X X -zI d -1 -Θ n (c n (z)Σ XP -zI d ) -1 a.s. → 0, in which c n (z) → -zm(z) for z ∈ C\R + and m(z) is defined in Theorem 1 due to the dominated convergence theorem. By (A3) we are allowed to take Θ n = 1 d Σ θ/P . Thus we have λ d tr Σ θ/P 1 n X X + λI d -1 → λ d tr Σ θ/P (λm(-λ)Σ XP + λI d ) -1 (i) =E υ x υ θ υ -1 xp 1 + m(-λ)υ xp , ∀λ > -c l , (D.5) in which (i) is due to (A3), the fact that the LHS is almost surely bounded for λ > -c l , where c l is the lowest non-zero eigenvalue of 1 n X X, and the application of the dominated convergence theorem. Differentiating (D.5) (note that the derivative is also bounded on λ > -c l ) yields λ d d dλ tr 1 n X X + λI d -1 Σ θ/P → E υ x υ θ υ -1 xp λ(1 + m(-λ)υ xp ) - m (-λ)υ x υ θ (1 + m(-λ)υ xp ) 2 . (D.6) Note that the numerator of (D.2) is the quantity of interest. Combining (D.1) (D.2) (D.3) (D.4) (D.5) (D.6) and taking λ → 0 yields the formula of the bias term. Finally, when Σ θ/P is not invertible, observe that if we increment all eigenvalues by some small > 0 to ensure invertibility Σ θ/P = Σ θ/P + I d , (3.3) is bounded and also decreasing w.r.t. . Thus by the dominated convergence theorem we take → 0 and obtain the desired result. We remark that similar (but less general) characterization can also be derived based on (Ledoit & Péché, 2011 , Theorem 1.2) when the eigenvalues of Σ XP and Σ θ/P exhibit certain relations. To show that P = U diag (e θ )U achieves the lowest bias, first note that under the definition of random variables in (A3), our claimed optimal preconditioner is equivalent to υ xp a.s. = υ x υ θ . We therefore define an interpolation υ α = αυ x υ θ + (1 -α)ῡ for some ῡ and write the corresponding Stieltjes transform as m α (-λ) and the bias term as B α . We aim to show that argmin α∈[0,1] B α = 1. For notational convenience define g α m α (0)υ x υ θ and h α m α (0)υ α . One can check that B α = E υ x υ θ (1 + h α ) 2 E h α (1 + h α ) 2 -1 ; dm α (-λ) dα λ→0 = m α (0)E hα-gα (1+hα) 2 (1 -α)E hα (1+hα) 2 . We now verify that the derivative of B α w.r.t. α is non-positive for α ∈ [0, 1]. A standard simplification of the derivative yields dB α dα ∝ -2E (g α -h α ) 2 (1 + h α ) 3 E h α (1 + h α ) 2 2 -2 E g α -h α (1 + h α ) 2 2 E h 2 α (1 + h α ) 3 + 4E h α (g α -h α ) (1 + h α ) 3 E g α -h α (1 + h α ) 2 E h α (1 + h α ) 2 (i) ≤ -4 E (g α -h α ) 2 (1 + h α ) 3 E h 2 α (1 + h α ) 3 E g α -h α (1 + h α ) 2 2 E h α (1 + h α ) 2 2 + 4E h α (g α -h α ) (1 + h α ) 3 E g α -h α (1 + h α ) 2 E h α (1 + h α ) 2 (ii) ≤ 0, where (i) is due to AM-GM and (ii) due to Cauchy-Schwarz on the first term. Note that the two equalities hold when g α = h α , from which one can easily deduce that the optimum is achieved when υ xp a.s. = υ x υ θ , and thus we know that the proposed P is the optimal preconditioner that is codiagonazable with Σ X .

D.5 PROOF OF PROPOSITION 4

Proof. Since Σ θ = Σ r X , we can simplify the expressions by defining υ x h and thus υ θ = h r . From Theorem 3 we have the following derivation of the GD bias under (A1)(A3), B( θI ) → m 1 m 2 1 E h • h r (1 + h • m 1 ) 2 = E h 1+r (1+h•m1) 2 1 -γE (h•m1) 2 (1+h•m1) 2 , (D.7) where m 1 = lim λ→0+ m(-λ), and m satisfies 1 m(-λ) = λ + γE h 1 + h • m(-λ) . Similarly, for NGD (P = Σ -1 X ) we have B( θF -1 ) → m 2 m 2 2 E h • h r (1 + m 2 ) 2 = E h 1+r (1+m2) 2 1 -γE m 2 2 (1+m2) 2 = Eh 1+r (1 + m 2 ) 2 -γm 2 2 , (D.8) where standard calculation yields m 2 = (γ -1) -1 , and thus B( θF -1 ) → (1 -γ -1 )Eh 1+r . To compare the magnitude of (D.7) and (D.8), observe the following equivalence. B( θI ) ≶ B( θF -1 ) ⇔ E h 1+r (1 + h • m 1 ) 2 • γ γ -1 ≶ 1 -γE (h • m 1 ) 2 (1 + h • m 1 ) 2 Eh 1+r . (i) ⇔ E ζ 1+r (1 + ζ) 2 E ζ 1 + ζ ≶ E ζ (1 + ζ) 2 Eζ 1+r E 1 1 + ζ . (D.9) where (i) follows from the definition of m 1 and we defined ζ h • m 1 . Note that when r ≤ -1 and h is not a point mass, we have E ζ 1+r (1 + ζ) 2 E ζ 1 + ζ > E ζ (1 + ζ) 2 E ζ 1+r 1 + ζ > E ζ (1 + ζ) 2 Eζ 1+r E 1 1 + ζ . On the other hand, when r ≥ 0, following the exact same procedure we get E ζ 1+r (1 + ζ) 2 E ζ 1 + ζ < E ζ (1 + ζ) 2 Eζ 1+r E 1 1 + ζ . Combining the two cases completes the proof.

D.6 PROOF OF PROPOSITION 5

Proof. We first outline a more general setup where P α = f (Σ X ; α) for continuous and differentiable function of α and f applied to eigenvalues of Σ x . For any interval I ⊆ [0, 1], we claim (a) Suppose all four functions 1 xf (x;α) , f (x; α), ∂f (x;α) ∂α /f (x; α) and x ∂f (x;α) ∂α are decreasing functions of x on the support of v x for all α ∈ I. In addition, ∂f (x;α) ∂α ≥ 0 on the support of v x for all α ∈ I. Then the stationary bias is an increasing function of α on I. (b) For all α ∈ I, suppose xf (x; α) is a monotonic function of x on the support of v x and ∂f (x;α) ∂α /f (x; α) is a decreasing function of x on the support of v x . Then the stationary variance is a decreasing function of α on I. Let us verify the three choices of P α in Proposition 5 one by one. • When P α = (1 -α)I d + α(Σ X ) -1 , the corresponding f (x; α) is (1 -α) + αx. This satisfies all conditions in (a) and (b) for all α ∈ [0, 1]. Hence, the stationary variance is a decreasing function and the stationary bias is an increasing function of α ∈ [0, 1]. • When P α = (Σ X ) -α , the corresponding f (x; α) is x -α . It is clear that it satisfies all conditions in (a) and (b) for all α ∈ [0, 1] except for the condition that x ∂f (x;α) ∂α = -x 1-α ln x is a decreasing function of x. Note that x ∂f (x;α) ∂α = -x 1-α ln x is a decreasing function of x on the support of v x only for α ≥ ln(κ)-1 ln(κ) where κ = sup v x / inf v x . Hence, the stationary variance is a decreasing function of α ∈ [0, 1] and the stationary bias is an increasing function of α ∈ [max(0, ln(κ)-1 ln(κ) ), 1]. • When P α = (αΣ X +(1-α)I d ) -1 , the corresponding f (x; α) is 1/(αx + (1 -α)) , which satisfies all conditions in (a) and (b) for all α ∈ [0, 1] except for the condition that x ∂f (x;α) ∂α = x(1-x) (αx+(1-α)) 2 is a decreasing function of x. Note that x ∂f (x;α) ∂α = x(1-x) (αx+(1-α)) 2 is a decreasing function of x on the support of v x only for α ≥ κ-2 κ-1 . Hence, the stationary variance is a decreasing function of α ∈ [0, 1] and the stationary bias is an increasing function of α ∈ [max(0, κ-2 κ-1 ), 1]. To show (a) and (b), note that under the conditions on Σ x and Σ θ assumed in Proposition 5, the stationary bias B( θP α ) and the stationary variance V ( θP α ) can be simplified to B( θP α ) = m α (0) m 2 α (0) E v x (1 + v x f (v x ; α)m α (0)) 2 and V ( θP α ) = σ 2 • m α (0) m 2 α (0) -1 , where m α (z) and m α (z) satisfy 1 = -zm α (z) + γE v x f (v x ; α)m α (z) 1 + v x f (v x ; α)m α (z) (D.10) m α (z) m 2 α (z) = 1 1 -γE f (vx;α)mα(z) 1+f (vx;α)mα(z) 2 . (D.11) For notation convenience, let f α := v x f (v x ; α). From (D.11), we have the following equivalences. B( θP α ) = E vx (1+fαmα(0)) 2 1 -γE fαmα(0) 1+fαmα(0) 2 , (D.12) V ( θP α ) = σ 2    1 1 -γE fαmα(0) 1+fαmα(0) 2 -1   . (D.13) We first show that (b) holds. Note that from (D.13), we have ∂V ( θP α ) ∂α = γσ 2    1 1-γE fαmα(0) 1+fαmα(0) 2    2 E 2f α m α (0) (1+f α m α (0)) 3 f α ∂m α (z) ∂α z=0 + ∂f α ∂α m α (0) . (D.14) To calculate ∂mα(z) ∂α z=0 , we take derivatives with respect to α on both sides of (D.10), 0 = γE 1 (1 + f α m α (0)) 2 • f α ∂m α (z) ∂α z=0 + ∂f α ∂α m α (0) . (D.15) Therefore, plugging (D.15) into (D.14) yields ∂V ( θP α ) ∂α =2γσ 2    m α (0) 1 -γE fαmα(0) 1+fαmα(0) 2    2 E f α (1 + f α m α (0)) 2 -1 × E f α ∂fα ∂α (1 + f α m α (0)) 3 E f α (1 + f α m α (0)) 2 -E f 2 α (1 + f α m α (0)) 3 E ∂fα ∂α (1 + f α m α (0)) 2 Thus showing V ( θP α ) is a decreasing function of α is equivalent to showing that E f 2 α (1 + f α m α (0)) 3 E ∂fα ∂α (1 + f α m α (0)) 2 ≥ E f α ∂fα ∂α (1 + f α m α (0)) 3 E f α (1 + f α m α (0)) 2 . (D.16) Let µ x be the probability measure of v x . We define a new measure μx = fαµx (1+fαmα(0)) 2 , and let ṽx follow the new measure. Since ∂f (x;α) ∂α /f (x; α) is a decreasing function of x and xf (x; α) is a monotonic function of x, E ṽx f (ṽ x ; α) 1 + ṽx f (ṽ x ; α)m α (0) E ∂ ṽxf (ṽx;α) ∂α ṽx f (ṽ x ; α) ≥ E ∂ ṽxf (ṽx;α) ∂α 1 + ṽx f (ṽ x ; α)m α (0) . Changing ṽx back to v x , we arrive at (D.16) and thus (b). For the bias term B( θP α ), note that from (D.10) and (D.12), we have ∂B( θP α ) ∂α = 1 γ 1 γ -E f α m α (0) 1 + f α m α (0) 2 -2 × -E 2 v x (1 + f α m α (0)) 3 • f α ∂m α (z) ∂α z=0 + ∂f α ∂α m α (0) E f α m α (0) (1 + f α m α (0)) 2 + E v x (1 + f α m α (0)) 2 E 2 f α m α (0) (1 + f α m α (0)) 3 • f α ∂m α (z) ∂α z=0 + ∂f α ∂α m α (0) . (D.17) Similarly, we combine (D.15) and (D.17) and simplify the expression. To verify B( θP α ) is an increasing function of α, we need to show that 0 ≤ E v x f α m α (0) (1 + f α m α (0)) 3 E ∂fα ∂α (1 + f α m α (0)) 2 -E v x ∂fα ∂α (1 + f α m α (0)) 3 E f α m α (0) (1 + f α m α (0)) 2 E f α m α (0) (1 + f α m α (0)) 2 -E v x (1 + f α m α (0)) 2 E (f α m α (0)) 2 (1 + f α m α (0)) 3 E ∂fα ∂α (1 + f α m α (0)) 2 -E f α m α (0) ∂fα ∂α (1 + f α m α (0)) 3 E f α m α (0) (1 + f α m α (0)) 2 , (D.18) Let h α f α m α (0) = v x f (v x ; α)m α (0) and g α ∂fα ∂α = v x ∂f (vx;α) ∂α . Then (D.18) can be further simplified to the following equation 0 ≤ E v x h α (1 + h α ) 3 E g α (1 + h α ) 3 E h α (1 + h α ) 3 -E v x (1 + h α ) 3 E g α (1 + h α ) 3 E h 2 α (1 + h α ) 3 part 1 + E v x (1 + h α ) 3 E g α h α (1 + h α ) 3 E h α (1 + h α ) 3 -E v x g α (1 + h α ) 3 E h α (1 + h α ) 3 E h α (1 + h α ) 3 part 2 + 2E v x h α (1 + h α ) 3 E g α h α (1 + h α ) 3 E h α (1 + h α ) 3 -2E v x g α (1 + h α ) 3 E h 2 α (1 + h α ) 3 E h α (1 + h α ) 3 part 3 + E v x h α (1 + h α ) 3 E g α h α (1 + h α ) 3 E h 2 α (1 + h α ) 3 -E v x g α (1 + h α ) 3 E h 2 α (1 + h α ) 3 E h 2 α (1 + h α ) 3 part 4 . (D.19) Note that under condition of (a), we know that both h α and v x /h α are increasing functions of v x ; and both g α /h α and g α are decreasing functions of v x . Hence, with calculation similar to (D.16), we know part 1,2,3,4 in (D.19 ) are all non-negative, and therefore (D.19) holds.

Remark.

The above characterization provides sufficient but not necessary conditions for the monotonicity of the bias term. In general, the expression of the bias is rather opaque, and determining the sign of its derivative can be tedious, except for certain special cases (e.g., γ = 2 and the eigenvalues of Σ X are two equally weighted point masses). We conjecture that the bias is monotone for α ∈ [0, 1] for a much wider class of Σ X , as shown in Figure 22 . 

D.7 PROOF OF PROPOSITION 6

Proof. Taking the derivative of V (θ P (t)) w.r.t. time yields (omitting the scalar σ 2 ), dV (θ P (t)) dt = d dt Σ 1/2 X P X I n -exp - t n XP X XP X -1 2 F (i) = 1 n tr      Σ XP X S P exp - t n S P S -2 P I n -exp - t n S P X p.s.d.      (ii) > 0, where we defined X = XP 1/2 and S P = XP X in (i), and (ii) is due to (A2-3) the inequality tr(AB) ≥ λ min (A)tr(B) for positive semi-definite A and B.

D.8 PROOF OF PROPOSITION 7

Proof. Recall the definition of the bias (well-specified) of θP (t), B(θ P (t)) (i) = 1 d tr Σ θ I d -P X W P (t)S -1 P X Σ X I d -P X W P (t)S -1 P X (ii) = 1 d tr Σ θ/P I d -X W P (t)S -1 P X Σ XP I d -X W P (t)S -1 P X (iii) ≥ 1 d tr Σ 1/2 XP I d -X W P (t)S -1 P X Σ 1/2 θ/P 2 , (D.20) where we defined S P = XP X , W P (t) = I n -exp -t n S P in (i), X = XP 1/2 in (ii), and (iii) is due to the inequality tr A A ≥ tr A 2 . When Σ X = Σ -1 θ , i.e. NGD achieves lowest stationary bias, (D.20) simplifies to B(θ P (t)) ≥ 1 d tr I d -X W P (t)S -1 P X 2 = 1 - 1 γ + 1 d n i=1 exp - t n λi 2 , (D.21) where λ is the eigenvalue of S P . On the other hand, since F = Σ X , for the NGD iterate θF -1 (t), B(θ F -1 (t)) = 1 d tr I d -X W F -1 (t)S -1 F -1 X 2 = 1 - 1 γ + 1 d n i=1 exp - t n λi 2 , (D.22) where X = XΣ -1/2 X and λ is the eigenvalue of S F -1 = X X . Comparing (D.21)(D.22), we see that given θP (t) at a fixed t, if we run NGD for time T > λmax λmin t (note that T /t = O(1) by (A2-3)), then we have B(θ P (t)) ≥ B(θ F -1 (T )) for any P satisfying (A3). This thus implies that B opt (θ P ) ≥ B opt (θ F -1 ). On the other hand, when Σ θ = I d , we can show that the bias term of GD is monotonically decreasing through time by taking its derivative, d dt B(θ I (t)) = 1 d d dt tr I d -X W I (t)S -1 I X Σ X I d -X W I (t)S -1 I X = - 1 nd tr      Σ X X S exp - t n S S -1 X I d -X W I (t)S -1 I X p.s.d.      < 0. (D.23) Similarly, one can verify that the expected bias of NGD is monotonically decreasing for all choices of Σ X and Σ θ satisfying (A2-4), d dt tr Σ θ I d -F -1 X W F -1 (t)S -1 F -1 X Σ X I d -F -1 X W F -1 (t)S -1 F -1 X = d dt tr Σ Xθ I d -X W F -1 (t)S -1 F -1 X I d -X W F -1 (t)S -1 F -1 X (i) < 0, where (i) follows from calculation similar to (D.23). Since the expected bias is decreasing through time for both GD and NGD when Σ θ = I d , and from Theorem 3 we know that B( θI ) ≤ B( θF -1 ), we conclude that B opt (θ I ) ≤ B opt (θ F -1 ). D.9 PROOF OF THEOREM 8 D.9.1 SETUP AND MAIN RESULT We restate the setting and assumptions. H is an RKHS included in L 2 (P X ) equipped with a bounded kernel function k satisfying sup supp(P X ) k(x, x) ≤ 1. K x ∈ H is the Riesz representation of the kernel function k(x, •), that is, k(x, y) = K x , K y H . S is the canonical embedding operator from H to L 2 (P X ). We write Σ = S * S : H → H and L = SS * . Note that the boundedness of the kernel gives Sf L2(P X ) ≤ sup x |f (x)| = sup x | K x , f | ≤ K x H f H ≤ f H . Hence we know Σ ≤ 1 and L ≤ 1. Our analysis will be made under the following regularity assumptions. • (A4). There exist r ∈ (0, ∞) and M > 0 s.t. f * = L r h * for some h * ∈ L 2 (P X ) and f * ∞ ≤ M . • (A5). There exists s > 1 s.t. tr Σ 1/s < ∞ and 2r + s -1 > 1.

• (A6). There exist

µ ∈ [s -1 , 1] and C µ > 0 s.t. sup x∈supp(P X ) Σ 1/2-1/µ K x H ≤ C µ . (A5)(A6) are standard regularity assumptions in the literature that provide capacity control of the RKHS (e.g., see Caponnetto & De Vito (2007); Pillaud-Vivien et al. (2018) ). For (A4), it is worth noting that previous works mostly consider r ≥ 1/2 which implies that f * ∈ H. The training data is generated as y i = f * (x i ) + ε i , where ε i is an i.i.d. noise satisfying |ε i | ≤ σ almost surely. Let y ∈ R n be the label vector. We identify R n with L 2 (P n ) and define Σ = 1 n n i=1 K xi ⊗ K xi : H → H, Ŝ * Y = 1 n n i=1 Y i K xi , (Y ∈ L 2 (P n )). We consider the following preconditioned update on f t ∈ H: f t = f t-1 -η(Σ + λI) -1 ( Σf t-1 -Ŝ * Y ), f 0 = 0. We briefly comment on how our analysis differs from Rudi et al. (2017) , which showed that a preconditioned update (the FALKON algorithm) for kernel ridge regression can also achieve accelerated convergence in the population risk. We emphasize the following differences. • The two algorithms optimize different objectives, as highlighted by the different role of the "ridge" coefficient λ. In FALKON, λ turns the objective into kernel ridge regression; whereas in our (4.1), λ controls the interpolation between GD and NGD. As we aim to study how the preconditioner affects generalization, it is important that we look at the objective in its original (instead of regularized) form. • To elaborate on the first point, since FALKON minimizes a regularized objective, it would not overfit even after large number of gradient steps, but it is unclear how preconditioning impacts generalization (i.e., any preconditioner may generalize well with proper regularization). In contrast, we consider the unregularized objective, and thus early stopping plays a crucial role in avoiding overfitting -this differs from most standard analysis of GD. • Algorithm-wise, the two updates employ different preconditioners. FALKON involves inverting the kernel matrix K defined on the training points, whereas we consider the population covariance operator Σ, which is consistent with our earlier discussion on the population Fisher in Section 3. • In terms of the theoretical setup, our analysis allows for r < 1/2, whereas Rudi et al. (2017) and many other previous works assumed r ∈ [1/2, 1], as commented in Section 4.3. We aim to show the following theorem: Theorem (Formal Statement of Theorem 8). Given (A4-6), if the sample size n is sufficiently large so that 1/(nλ) 1, then for η < Σ with ηt ≥ 1 and 0 < δ < 1 and 0 < λ < 1, it holds that Sf t -f * 2 L2(P X ) ≤ C(B(t) + V (t)), with probability 1 -3δ, where C is a constant and B(t) := exp(-ηt) ∨ λ ηt 2r , V (t) := V 1 (t) + (1 + ηt)   λ -1 B(t) + σ 2 tr Σ 1 s λ -1 s n + λ -1 (σ + M + (1 + tη)λ -( 1 2 -r)+ ) 2 n 2   log(1/δ) 2 , in which V 1 (t) :=   exp(-ηt) ∨ λ ηt 2r + (tη) 2   β (1 ∨ λ 2r-µ )tr Σ 1 s λ -1 s n + β 2 (1 + λ -µ (1 ∨ λ 2r-µ ) n 2     (1 + tη) 2 , for β = log 28C 2 µ (2 2r-µ ∨λ -µ+2r )tr(Σ 1/s )λ -1/s δ . When r ≥ 1/2, if we set λ = n -s 2rs+1 =: λ * and t = Θ(log(n)), then the overall convergence rate becomes Sg t -f * 2 L2(P X ) = O p n -2rs 2rs+1 , which is the minimax optimal rate ( Õp (•) hides a poly-log(n) factor). On the other hand, when , and therefore the upper bound translates to r < 1/2, the bound is also O p n -2rs 2rs+1 except the term V 1 (t S ft -f * 2 L2(P X ) ≤ C (ηt) -2r + 1 n tr Σ 1/s (ηt) 1/s + ηt n σ 2 + 1 ηt 2r + M 2 + (ηt) -(2r-1) n , with high probability. In other words, by the condition η = O(1), we need t = Θ(n 2rs 2rs+1 ) steps to sufficiently diminish the bias term. In contrast, the preconditioned update that interpolates between GD and NGD (4.1) only require t = O(log(n)) steps to make the bias term negligible. This is because the NGD amplifies the high frequency component and rapidly captures the detailed "shape" of the target function f * . D.9.2 PROOF OF MAIN RESULT Proof. We follow the proof strategy of Lin & Rosasco (2017) . First we define a reference optimization problem with iterates ft that directly minimize the population risk: ft = ft-1 -η(Σ + λI) -1 (Σ ft-1 -S * f * ), f0 = 0. Note that E[f t ] = ft . In addition, we define the degrees of freedom and its related quantity as N ∞ (λ) := E x [ K x , Σ -1 λ K x H ] = tr ΣΣ -1 λ , F ∞ (λ) := sup x∈supp(P X ) Σ -1/2 λ K x 2 H . We can see that the risk admits the following bias-variance decomposition Sf t -f * 2 L2(P X ) ≤ 2( Sf t -S ft 2 L2(P X ) V (t), variance + ft -f * 2 L2(P X ) B(t), bias ). We upper bound the bias and variance separately. Bounding the bias term B(t): Note that by the update rule (4.1), it holds that S ft -f * = S ft-1 -f * -ηS(Σ + λI) -1 (Σ ft-1 -S * f * ) ⇔ S ft -f * = (I -ηS(Σ + λI) -1 S * )(S ft-1 -f * ). Hence, unrolling the recursion gives S ft -f * = (I -ηS(Σ+λI) -1 S * ) t (S f0 -f * ) = (I -ηS(Σ+ λI) -1 S * ) t (-f * ) = -(I -ηS(Σ + λI) -1 S * ) t L r h * . Write the spectral decomposition of L as L = ∞ j=1 σ j φ j φ * j for φ j ∈ L 2 (P X ) for σ j ≥ 0. We have (I -ηS(Σ+λI) -1 S * ) t L r h * L2(P X ) = ∞ j=1 (1 -η σj σj +λ ) 2t σ 2r j h 2 j , where h = ∞ j=1 h j φ j . We then apply Lemma 11 to obtain B(t) ≤ exp(-ηt) j:σj ≥λ h 2 j + 2r e λ ηt 2r j:σj <λ h 2 j ≤ C exp(-ηt) ∨ λ ηt 2r h * 2 L2(P X ) , where C is a constant depending only on r. Bounding the variance term V (t): We now handle the variance term V (t). For notational convenience, we write A λ := A + λI for a linear operator A from a Hilbert space H to H. By the definition of f t , we know f t = (I -η(Σ + λI) -1 Σ)f t-1 + η(Σ + λI) -1 Ŝ * Y = t-1 j=0 (I -η(Σ + λI) -1 Σ) j η(Σ + λI) -1 Ŝ * Y = Σ -1/2 λ η   t-1 j=0 (I -ηΣ -1/2 λ ΣΣ -1/2 λ ) j   Σ -1/2 λ Ŝ * Y =: Σ -1/2 λ G t Σ -1/2 λ Ŝ * Y, where we defined G t := η t-1 j=0 (I -ηΣ -1/2 λ ΣΣ -1/2 λ ) j . Accordingly, we decompose V (t) as Sf t -S ft 2 L2(P X ) ≤2( S(f t -Σ -1/2 λ G t Σ -1/2 λ Σ ft ) 2 L2(P X ) (a) + S(Σ -1/2 λ G t Σ -1/2 λ Σ ft -ft ) 2 L2(P X ) ). We bound (a) and (b) separately. Step 1. Bounding (a). Decompose (a) as S(f t -Σ -1/2 λ G t Σ -1/2 λ Σ ft ) 2 L2(P X ) = SΣ -1/2 λ G t Σ -1/2 λ ( Ŝ * Y -Σ ft ) 2 L2(P X ) ≤ SΣ -1/2 λ 2 G t Σ -1/2 λ Σλ Σ -1/2 λ 2 Σ 1/2 λ Σ-1 λ Σ 1/2 λ 2 Σ -1/2 λ ( Ŝ * Y -Σ ft ) 2 H . We bound the terms in the RHS individually. (i) SΣ -1/2 λ 2 = Σ -1/2 λ ΣΣ -1/2 λ ≤ 1. (ii) Note that Σ -1/2 λ Σλ Σ -1/2 λ = I -Σ -1/2 λ (Σ -Σ)Σ -1/2 λ (1 -Σ -1/2 λ (Σ -Σ)Σ -1/2 λ )I. Proposition 6 of Rudi & Rosasco (2017) and its proof implies that for λ ≤ Σ and 0 < δ < 1, Σ -1/2 λ (Σ -Σ)Σ -1/2 λ ≤ 2βF ∞ (λ) n + 2β(1 + F ∞ (λ)) 3n =: Ξ n , (D.24) with probability 1 -δ, where β = log 4tr(ΣΣ -1 λ ) δ = log 4N∞(λ) δ . By Lemma 14, β ≤ log 4tr(Σ 1/s )λ -1/s δ and F ∞ (λ) ≤ λ -1 . Therefore, if λ = o(n -1 log(n)) and λ = Ω(n -1/s ), the RHS can be smaller than 1/2 for sufficiently large n, i.e. Ξ n = O( log(n)/(nλ)) ≤ 1/2; thus, Σ -1/2 λ Σλ Σ -1/2 λ 1 2 I. We denote this event as E 1 . (iii) Note that G t Σ -1/2 λ Σλ Σ -1/2 λ = η   t-1 j=0 (I -ηΣ -1/2 λ ΣΣ -1/2 λ ) j   Σ -1/2 λ Σλ Σ -1/2 λ = η   t-1 j=0 (I -ηΣ -1/2 λ ΣΣ -1/2 λ ) j   (Σ -1/2 λ ΣΣ -1/2 λ + λΣ -1 λ ). Thus, by Lemma 12 we have G t Σ -1/2 λ Σλ Σ -1/2 λ ≤ η   t-1 j=0 (I -ηΣ -1/2 λ ΣΣ -1/2 λ ) j   Σ -1/2 λ ΣΣ -1/2 λ ≤1 (due to Lemma 12) + η   t-1 j=0 (I -ηΣ -1/2 λ ΣΣ -1/2 λ ) j   λΣ -1 λ ≤1 + η t-1 j=0 (I -ηΣ -1/2 λ ΣΣ -1/2 λ ) j λΣ -1 λ ≤ 1 + ηt. (iv) Note that Σ -1/2 λ ( Ŝ * Y -Σ ft ) 2 H ≤ 2( Σ -1/2 λ [( Ŝ * Y -Σ ft ) -(S * f * -Σ ft )] 2 H + Σ -1/2 λ (S * f * -Σ ft ) 2 H ). First we bound the first term of the RHS. Let ξ i = Σ -1/2 λ [K xi y i -K xi ft (x i ) -(S * f * -Σ ft )]. Then, {ξ i } n i=1 is an i.i.d. sequence of zero-centered random variables taking value in H and thus Σ -1/2 λ [( Ŝ * Y -Σ ft ) -(S * f * -Σ ft )] 2 H = 1 n n i=1 ξ i 2 H . The RHS can be bounded by using Bernstein's inequality in Hilbert space Caponnetto & De Vito (2007) . To apply the inequality, we need to bound the variance and sup-norm of the random variable. The variance can be bounded as E[ ξ i 2 H ] ≤ E (x,y) Σ -1/2 λ (K x (f * (x) -ft (x)) + K ξ ) 2 H ≤ 2 E (x,y) Σ -1/2 λ (K x (f * (x) -ft (x)) 2 H + Σ -1/2 λ (K x ) 2 H ≤ 2 sup x∈supp(P X ) Σ -1/2 λ K x 2 f * -S ft 2 L2(P X ) + σ 2 tr Σ -1 λ Σ ≤ 2 F ∞ (λ)B(t) + σ 2 tr Σ -1 λ Σ ≤ 2 λ -1 B(t) + σ 2 tr Σ -1 λ Σ , The sup-norm can be bounded as follows. Observe that ft ∞ ≤ ft H , and thus by Lemma 13, ξ i H ≤ 2 sup x∈supp(P X ) Σ -1/2 λ K x H (σ + f * ∞ + ft ∞ ) F 1/2 ∞ (λ)(σ + M + (1 + tη)λ -(1/2-r)+ ) λ -1/2 (σ + M + (1 + tη)λ -(1/2-r)+ ). Hence, for 0 < δ < 1, Bernstein's inequality (Proposition 2 of Caponnetto & De Vito (2007) ) gives 1 n n i=1 ξ i 2 H ≤ C   λ -1 B(t) + σ 2 tr Σ -1 λ Σ n + λ -1/2 (σ + M + (1 + tη)λ -(1/2-r)+ ) n   2 log(1/δ) 2 with probability 1 -δ where C is a universal constant. We define this event as E 2 . For the second term Σ -1/2 λ (S * f * -Σ ft ) 2 H we have Σ -1/2 λ (S * f * -Σ ft ) 2 H ≤ Σ 1/2 λ (f * -Sf t ) 2 H = f * -S ft 2 L2(P X ) ≤ B(t) . Combining these evaluations, on the event E 2 where P (E 2 ) ≥ 1 -δ for 0 < δ < 1 we have Σ -1/2 λ ( Ŝ * Y -Σ ft ) 2 H (i) ≤C   λ -1 B(t) + σ 2 tr Σ -1 λ Σ n + λ -1/2 (σ + M + (1 + tη)λ -(1/2-r)+ ) n   2 log(1/δ) 2 + B(t). where we used Lemma 14 in (i). Step 2. Bounding (b). On the event E 1 , the term (b) can be evaluated as S(Σ -1/2 λ G t Σ -1/2 λ Σ ft -ft ) 2 L2(P X ) ≤ Σ 1/2 (Σ -1/2 λ G t Σ -1/2 λ Σ ft -ft ) 2 H ≤ Σ 1/2 Σ -1/2 λ (G t Σ -1/2 λ ΣΣ -1/2 λ -I)Σ 1/2 λ ft 2 H ≤ Σ 1/2 Σ -1/2 λ (G t Σ -1/2 λ ΣΣ -1/2 λ -I)Σ 1/2 λ ft 2 H ≤ (G t Σ -1/2 λ ΣΣ -1/2 λ -I)Σ 1/2 λ ft 2 H . (D.25) where we used Lemma 13 in the last inequality. The term (G t Σ -1/2 λ ΣΣ -1/2 λ -I)Σ 1/2 λ f t H can be bounded as follows. First, note that (G t Σ -1/2 λ ΣΣ -1/2 λ -I)Σ 1/2 λ =    η   t-1 j=0 (I -ηΣ -1/2 λ ΣΣ -1/2 λ ) j   Σ -1/2 λ ΣΣ -1/2 λ -I    Σ 1/2 λ = (I -ηΣ -1/2 λ ΣΣ -1/2 λ ) t Σ 1/2 λ . Therefore, the RHS of (D.25) can be further bounded by Next, as in the (D.24), by applying the Bernstein inequality for asymmetric operators (Corollary 3.1 of Minsker (2017) with the argument in its Section 3.2), it holds that (I -ηΣ -1/2 λ ΣΣ -1/2 λ ) t Σ 1/2 λ ft H = (I -ηΣ -1/2 λ ΣΣ -1/2 λ + ηΣ -1/2 λ (Σ -Σ)Σ -1/2 λ ) t Σ 1/2 λ ft H = t-1 k=0 (I -ηΣ -1/2 λ ΣΣ -1/2 λ ) k (ηΣ -1/2 λ (Σ -Σ)Σ -1/2 λ )(I -ηΣ -1 λ Σ) t-k-1 Σ 1/2 λ ft -(I -ηΣ -1 λ Σ) t Σ 1/2 λ ft H (i) ≤ (I -ηΣ -1 λ Σ) t Σ 1/2 λ ft H + η t-1 k=0 (I -ηΣ -1/2 λ ΣΣ -1/2 λ ) k Σ -1/2 λ (Σ -Σ)Σ -1/2+r λ (I -ηΣ -1 λ Σ) t-k-1 Σ 1/2-r λ ft H ≤ (I -ηΣ -1 λ Σ) t Σ 1/2 λ ft H + tη Σ -1/2 λ (Σ -Σ)Σ -1/2+r λ Σ 1/2-r λ ft H = (I -ηΣ -1 λ Σ) t Σ r λ Σ 1/2-r λ ft H + tη Σ -1/2 λ (Σ -Σ)Σ -1/2+r λ Σ 1/2-r λ ft H (I -ηΣ -1 λ Σ) t Σ r λ + tη Σ -1/2 λ (Σ -Σ)Σ Σ -1/2 λ (Σ -Σ)Σ -1/2+r λ ≤C β C 2 µ (2 2r-µ ∨ λ 2r-µ )N ∞ (λ) n + β ((1 + λ) r + C 2 µ λ -µ/2 (2 2r-µ ∨ λ r-µ/2 ) n =: Ξ n , with probability 1 -δ, where C is a universal constant and β ≤ log 28C 2 µ (2 2r-µ ∨λ -µ+2r )tr(Σ 1/s )λ -1/s δ . We also used the following bounds on the sup-norm and the second order moments: (sup-norm) Σ -1/2 λ (K x K * x -Σ)Σ -1/2+r λ ≤ Σ -1/2 λ K x K * x Σ -1/2+r λ + Σ r λ ≤ Σ -µ/2 λ Σ µ/2-1/2 λ K x K * x Σ -1/2+µ/2 λ Σ r-µ/2 λ + Σ r λ ≤ C 2 µ λ -µ/2 (2 r-µ/2 ∨ λ r-µ/2 ) + (1 + λ) r (a.s.), (2nd order moment 1) E x [Σ -1/2 λ (K x K * x -Σ)Σ -1+2r λ (K x K * x -Σ)Σ -1/2 λ ] ≤ Σ -1/2 λ ΣΣ -1/2 λ sup x∈supp(P X ) [K * x Σ -1/2+µ/2 λ Σ -µ+2r λ Σ -1/2+µ/2 λ K x ] ≤ C 2 µ (2 2r-µ ∨ λ 2r-µ ), (2nd order moment 2) E x [Σ -1/2+r λ (K x K * x -Σ)Σ -1/2 λ Σ -1/2 λ (K x K * x -Σ)Σ -1/2+r λ ] ≤ E x [Σ -1/2+r λ K x K * x Σ -1 λ K x K * x Σ -1/2+r λ ] ≤ C 2 µ (2 2r-µ ∨ λ 2r-µ )E x [K * x Σ -1 λ K x ] = C 2 µ (2 2r-µ ∨ λ 2r-µ )tr ΣΣ -1 λ = C 2 µ (2 2r-µ ∨ λ 2r-µ )N ∞ (λ). We define this event as E 3 . Therefore, the RHS of (D.26) can be further bounded by Finally, note that when λ = λ * and 2r ≥ µ, Ξ 2 n = Õ λ * 2r-µ-1/s n + λ * 2(r-µ) n 2 ≤ Õ(n -s(4r-µ) 2rs+1 + n -s(4r-2µ)+2

2rs+1

) ≤ Õ(n -2rs 2rs+1 ). Step 3. Combining the calculations in Step 1 and 2 leads to the desired result. Lemma 12. For t = N, 0 < η and 0 ≤ σ such that ησ < 1, it holds that η t-1 j=0 (1 -ησ) j σ ≤ 1. Proof. If σ = 0, then the statement is obvious. Assume that σ > 0, then t-1 j=0 (1 -ησ) j σ = 1 -(1 -ησ) t 1 -(1 -ησ) σ = 1 η [1 -(1 -ησ) t ] ≤ η -1 . This yields the desired claim. Lemma 13. Under (A4-6), for any 0 < λ < 1 and q ≤ r, it holds that Σ -s λ ft H (1 + λ -(1/2+(q-r))+ + λtηλ -(3/2+(q-r))+ ) h * L2(P X ) . Proof. Recall that ft = (I -η(Σ + λI) -1 Σ) ft-1 + η(Σ + λI) -1 S * f * = t-1 j=0 (I -η(Σ + λI) -1 Σ) j η(Σ + λI) -1 S * f * . Therefore, we obtain the following Σ -q λ ft H = η t-1 j=0 (I -ηΣ -1 λ Σ) j Σ -1-q λ S * L r h * H =η t-1 j=0 (I -ηΣ -1 λ Σ) j Σ -1 λ (Σ + λI)Σ -q-1 λ S * L r h * H ≤η t-1 j=0 (I -ηΣ -1 λ Σ) j Σ -1 λ ΣΣ -q-1 λ S * L r h * H + λη t-1 j=0 (I -ηΣ -1 λ Σ) j Σ -1 λ Σ -q-1 λ S * L r h * H ≤η t-1 j=0 (I -ηΣ -1 λ Σ) j Σ -1 λ Σ Σ -q-1 λ S * L r h * H + λη t-1 j=0 (I -ηΣ -1 λ Σ) j Σ -1 λ Σ -q-1 λ S * L r h * H ≤ Σ -q-1 λ S * L r h * H + λtη Σ -1 λ Σ -q-1 λ S * L r h * H ≤ S * L -q-1+r λ h * H + λtη S * L -q-2+r λ h * H ≤ h * , L -q-1+r λ SS * L -q-1+r λ h * L2(P X ) + λtη h * , L -q-2+r λ SS * L -q-2+r λ h * L2(P X ) = h * , L -q-1+r λ LL -q-1+r λ h * L2(P X ) + λtη h * , L -q-2+r λ LL -q-2+r λ h * L2(P X ) ≤(λ -1/2-(q-r) + λtηλ -3/2-(q-r) ) h * L2(P X ) ≤ (1 + tη)λ -1/2-(q-r) h * L2(P X ) . Lemma 14. Given (A4-6) and λ ∈ (0, 1), we have N ∞ (λ) ≤ tr Σ 1/s λ -1/s , and F ∞ (λ) ≤ 1/λ. Proof. For the first inequality, we have N ∞ (λ) =tr ΣΣ -1 λ = tr Σ 1/s Σ 1-1/s Σ -(1-1/s) λ Σ -1/s λ ≤tr Σ 1/s Σ 1-1/s Σ -(1-1/s) λ λ -1/s ≤ tr Σ 1/s λ -1/s . As for the second inequality, note that F ∞ (λ) = sup x K x , Σ -1 λ K x H ≤ sup x λ -1 K x , K x H ≤ λ -1 sup x k(x, x) ≤ λ -1 .

D.10 PROOF OF COROLLARY 9

Proof. Note that in this setting υ x takes value of 2 1+κ and 2κ 1+κ with probability 1/2 each. From (D.8) in the proof of Proposition 4 one can easily verify that for NGD, B( θF -1 ) → 2 r (1 + κ 1+r ) (1 + κ) 1+r 1 - 1 γ . For GD, the bias formula (D.7) can be simplified as B( θI ) → 1 γ •   2 1+κ r (1 + κ + 2m 1 ) 2 + κ 2κ 1+κ r (1 + κ + 2κm 1 ) 2   • m 1 (1 + κ + 2m 1 ) 2 + κm 1 (1 + κ + 2κm 1 ) 2 -1 , where m 1 is the Stieltjes transform defined after Equation (D.7). From standard numerical calculation one can show that when γ > 1, κ ≥ 1, m 1 = (κ + 1) γ 2 (κ + 1) 2 + 4(1 -γ)(κ -1) 2 + (2 -γ)(κ + 1) 2 8(γ -1)κ . Setting B( θI ) = B( θF -1 ) and solve for r, we have where we defined c 1 = 1 - 1 γ 1 κ + 1 , c 2 = 1 - 1 γ κ κ + 1 , c 3 = 1 γ • (1 + κ + 2κm 1 ) 2 m 1 (1 + κ + 2κm 1 ) 2 + κm 1 (1 + κ + 2m 1 ) 2 , c 4 = 1 γ • κ(1 + κ + 2m 1 ) 2 m 1 (1 + κ + 2κm 1 ) 2 + κm 1 (1 + κ + 2m 1 ) 2 . Hence, Proposition 4 (from which we know r * ∈ (-1, 0)) and the uniqueness of (D.27) implies that when r ≥ r * , B( θI ) ≤ B( θF -1 ), and vice versa. Finally, observe that in the special case of γ = 2, m 1 = κ+1 2 √ κ . Therefore, one can check that constants in (D.27) simplify to c 1 -c 3 = 1 - √ κ 2(κ + 1) , c 2 -c 4 = √ κ( √ κ -1) 2(κ + 1) , which implies that r * = -1/2.

D.11 PROOF OF PROPOSITION 10

Proof. Part (a) is a simple combination of (Bai & Yin, 2008, Theorem 2) and assumption (A3), which implies Σ X 2 and Σ -1 X 2 are both finite. For part (b), the substitution error for the variance term (ignoring the scalar σ 2 ) can be bounded as |V * -V | = tr F -1 X (XF -1 X ) -2 XF -1 Σ X -tr F -1 X (X F -1 X ) -2 X F -1 Σ X (i) ≤O(1) F -1 - F -1 2 tr X S -2 XF Σ X + √ d X S -2 X 2 Σ X F -1 F + tr X F -1 Σ X F -1 X S -2 - Ŝ-2 2 (ii) = O( ).



From now on we use NGD to denote the population Fisher-based update, and we write "sample NGD" when P is the inverse or pseudo-inverse of the sample Fisher; see Section for discussion. Note that (A2)(A3) covers many common choices of preconditioner, such as the population Fisher and variants of the sample Fisher (which is degenerate but leads to the same minimum 2 norm solution as GD). Two concurrent works(Wu & Xu, 2020;Richards et al., 2020) also considered similar relaxation of Σ θ in the context of ridge regression. Note that this reduces to the random effects model studied inDobriban et al. (2018). We remark that most previous works considered the case where r ≥ 1/2 which implies f * ∈ H. We however note that observations in linear model may not translate to neural network: many works have illustrated such a gap (e.g.,Ghorbani et al. (2019);Allen-Zhu & Li (2019); Suzuki (2020);Yang & Hu (2020)). In a companion work(Wu & Xu, 2020) we characterized the impact of 2 regularization in similar settings. Note that this gap is only present when the population Fisher is used; previous works have shown NTK-type global convergence for sample Fisher-related update(Zhang et al., 2019b;Cai et al., 2019).



Figure 1: 1D illustration of different implicit biases: two-layer sigmoid network trained with preconditioned GD.

Figure 2: Geometric illustration (2D) of how the interpolating θ P depends on the preconditioner.

Figure 3: Population risk of preconditioned linear regression vs. time with the following P : I (red), Σ -1X (blue) and (X X) † (cyan). Time is rescaled differently for each curve (convergence speed is not comparable). Observe that GD and sample NGD give the same stationary risk.

well-specified bias (misaligned).

Figure 5: We set eigenvalues of Σ X as two point masses with κX = 20 and Σ X 2 F = d; empirical values (dots) are computed with n = 300. (a) NGD (blue) achieves minimum variance. (b) GD (red) achieves lower bias under isotropic signal: Σ θ = I d . (c) NGD achieves lower bias under "misalignment": Σ X = Σ -1 θ .

Figure 4: Misspecified bias with Σ θ = I d (favors GD) and f * c (x) = α(x x -tr(Σ X )), where α controls the extent of nonlinearity. Predictions are generated by matching σ 2 with second moment of f * c .

Figure 6: Illustration of isotropic and misaligned θ * .

Figure 7: Bias-variance tradeoff with κX = 25, Σ θ = I d and SNR=32/5. As we additively (ii) or geometrically (iii) interpolate from GD to NGD (left to right), the stationary bias (blue) increases and the stationary variance (orange) decreases.

Figure 8: Comparison between NGD and GD. Error bar is one standard deviation away from mean over five independent runs. Numbers in parentheses denote amount of unlabeled examples for estimating the Fisher.

Figure 9(b)(c) supports the bias-variance tradeoff discussed in Section 4.1 in neural network settings.In particular, we interpret the left end of the figure to correspond to the bias-dominant regime (due to the same architecture for the student and teacher), and the right end to be the variance-dominant regime (due to large label noise). Observe that at certain SNR, a preconditioner that interpolates (additively or geometrically) between GD and NGD can achieve lower stationary risk.

additive interpolation (ii) between GD and NGD (MNIST).

geometric interpolation (iii) between GD and NGD (MNIST).

Figure 9: (a) numbers in parentheses indicate the amount of unlabeled data used in estimating the Fisher F ;we expect the estimated Fisher to be closer to the sample Fisher when the number of unlabeled data is small. (a) additive interpolation P = (F + αI d ) -1 ; larger damping parameter yields update closer to GD. (b) geometric interpolation P = F -α ; larger α parameter yields update closer to that of NGD (blue).

Bias of GD vs. NGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Bias Term Under Specific Source Condition . . . . . . . . . . . . . . . . . . . . . A.3 Estimating the Population Fisher . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4 Interpretation of y K -1 y/n . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5 Non-monotonicity of Bias Term w.r.t. Time . . . . . . . . . . . . . . . . . . . . . B Additional Related Works C Additional Figures C.1 Overparameterized Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . C.2 RKHS Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.3 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D Proofs and Derivations D.1 Missing Derivations in Section 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . D.2 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.3 Proof of Corollary 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.4 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.5 Proof of Proposition 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.6 Proof of Proposition 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.7 Proof of Proposition 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.8 Proof of Proposition 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.9 Proof of Theorem 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.10 Proof of Corollary 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.11 Proof of Proposition 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E Experiment Setup E.1 Processing the Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.2 Setup and Implementation for Optimizers . . . . . . . . . . . . . . . . . . . . . . E.3 Other Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Difference in Function Values.

Figure 10: Illustration of implicit bias of GD and NGD. We set n = 100, d = 50, and regress a two-layer

Figure 12: We set Σ θ = Σ r X , γ = 2, κ = 20, and plot the stationary bias (well-specified) under varying r.

Figure 13: y K -1 y/n on twolayer neural network (CIFAR-10).

Figure 14: Epoch-wise double descent.

Figure20: Population risk of the preconditioned update in RKHS that interpolates between GD and NGD. We use the IMQ kernel and set n = 1000, d = 5, N = 2500, σ 2 = 5 × 10 -4 . The x-axis has been rescaled for each curve and thus convergence speed is not directly comparable. Note that (a) large λ (i.e., GD-like update) is beneficial when r is large, and (b) small λ (i.e., NGD-like update) is beneficial when r is small.

Figure 21: (a)(b) Additional label noise experiment on CIFAR-10. (c)(d) Population risk of two-layer neural networks in the misalignment setup with synthetic Gaussian data. We set n = 200, d = 50, the damping coefficient λ = 10 -6, and both the student and the teacher are two-layer ReLU networks with 50 hidden units. The x-axis and the learning rate have been rescaled for each curve (i.e., optimization speed not comparable). When r is sufficiently small, NGD achieves lower early stopping risk, whereas GD is beneficial for larger r.

Figure 22: Illustration of the monotonicity of the bias term under Σ θ = I d . We consider two distributions of eigenvalues for Σ X : two equally weighted point masses (circle) and a uniform distribution (star), and vary the condition number κX and overparameterization level γ. In all cases the bias in monotone in α ∈ [0, 1].

is due to exchangeability of Σ λ and Σ. By Lemma 11, for the RHS we have(I -ηΣ -1 λ Σ) t Σ r λ ≤ exp(-ηt/2) ∨

tηΞ n (1 + tη) h * L2(P X ) .

For t ∈ N, 0 < η < 1, 0 < σ ≤ 1 and 0 ≤ λ, it holds that When σ ≥ λ, we have1 -η σ σ + λ t σ r ≤ 1 -η σ 2σ t σ r = (1 -η/2) t σ r ≤ exp(-tη/2)σ r ≤ exp(-tη/2)due to σ ≤ 1. On the other hand, note that1 -η σ σ + λ t σ r ≤ exp -ηt σ σ sup x>0 exp(-x)x r = (r/e) r .

r * = -ln c κ,γ ln κ, c κ,γ = c 4 -c 2 c 1 -c 3 , (D.27)

ACKNOWLEDGEMENT

The authors would like to thank Murat A. Erdogdu, Fartash Faghri, Ryo Karakida, Yiping Lu, Jiaxin Shi, Shengyang Sun, Yusuke Tsuzuku, Guodong Zhang, Michael Zhang, Tianzong Zhang, and anonymous ICLR reviewers for helpful feedback. The authors are also grateful to Tomoya Murata for his contribution to preliminary studies on the nonparametric least squares setting.JB and RG were supported by the CIFAR AI Chairs program. JB and DW were partially supported by LG Electronics and NSERC. AN was partially supported by JSPS Kakenhi (19K20337) and JST-PRESTO. TS was partially supported by JSPS Kakenhi (26280009, 15H05707 and 18H03201), Japan Digital Design and JST-CREST. JX was supported by a Cheung-Kong Graduate School of Business Fellowship. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute.

C ADDITIONAL FIGURES C.1 OVERPARAMETERIZED LINEAR REGRESSION

Non-monotonicity of the Risk. Under our generalized (anisotropic) assumption on the covariance of the features and the target, both the bias and the variance term can exhibit non-monotonicity w.r.t. the overparameterization level γ > 1: in Figure 15 we observe two peaks in the bias term and three peaks in the variance term. In contrast, it is known that when Σ X = I d , both the bias and variance are monotone in the overparameterized regime (e.g., Hastie et al. (2019) ). Figure 15 : Illustration of the "multiple-descent" curve of the risk for γ > 1. We take n = 300, eigenvalues of Σ X as three equally-spaced point masses with κX = 5000 and Σ X 2 F = d, and Σ θ = Σ -1 X (misaligned). Note that for GD, both the bias and the variance are highly non-monotonic for γ > 1.Additional Figures for Section 3 and 4. We include additional figures on (a) well-specified bias when Σ θ = I d (GD is optimal); (b) misspecified bias under unobserved features (predicted by Corollary 2); (c) bias-variance tradeoff by interpolating between preconditioners (SNR=5). As shown in Figure 16 and 17, in all cases the experimental values match the theoretical predictions. Figure 17 : We construct eigenvalues of Σ X with a polynomial decay: λi(Σ X ) = i -1 and then rescale the eigenvalues such that κX = 500 and Σ X 2Early Stopping Risk. Figure 18 compares the stationary risk with the optimal early stopping risk under varying misalignment level. We set Σ θ = Σ r X and vary r from 0 to -1. As discussed in Section 3.2 smaller α entails more "misaligned" teacher, and vice versa. Note that as the problem becomes more misaligned, NGD achieves lower stationary and early stopping risk.Figure 19 reports the optimal early stopping risk under misspecification (same trend can be obtained when the x-axis is label noise). In contrast to the stationary risk (Figure 4 ), GD can be advantageous under early stopping even with large extent of misspecification (for isotropic teacher). This aligns with our finding in Section 4.2 that early stopping reduces the variance and the misspecified bias. Figure 18 : Well-specified bias against different extent of "alignment". We set n = 300, eigenvalues of Σ X as two point masses with κX = 20, and take Σ θ = Σ r X and vary r from -1 to 0. (a) GD achieves lower bias when Σ θ is isotropic, NGD dominates when(interpolates between GD and NGD) is advantageous in between. (b) optimal early stopping bias follows similar trend as stationary bias. 

C.2 RKHS SETTING

We simulate the optimization in the coordinates of RKHS via a finite-dimensional approximation (using extra unlabeled data). In particular, we consider the teacher model in the form of f * (x) = N i=1 h i µ r i φ i (x) for square summable {h i } N i=1 , in which r controls the "difficulty" of the learning problem. We find {µ i } N i=1 and {φ i } N i=1 by solving the eigenfunction problem for some kernel k. The student model takes the form of f (x) = N i=1 ai √ µi φ i (x) and we optimize the coefficients {a i } N i=1 via the preconditioned update (4.1). We set n = 1000, d = 5, N = 2500 and consider the inverse multiquadratic (IMQ) kernel:Recall that Theorem 8 suggests that for small r, i.e. "difficult" problem, the damping coefficient λ would need to be small (which makes the update NGD-like), and vice versa. This result is (qualitatively) supported by Figure 20 , from which we can see that small λ is beneficial when r is small, and vice versa. We remark that this observed trend is rather fragile and sensitive to various hyperparameters, and we leave a comprehensive characterization of this observation as future work.where we defined S = XF -1 X and Ŝ = X F -1 X in (i) and applied tr(AB) ≤ λ max (A + A )tr(B) for positive semi-definite B, as well as tr(AB) ≤ √ d A 2 B F , and (ii) is due to (A3), ψ > 1, (Wedin, 1973, Theorem 4 .1) and the following estimate,where (i) again follows from (Wedin, 1973, Theorem 4.1) , and (ii) is due to (A1)(A3) and ψ > 1 (which implies n u S -1 2 and n u Ŝ-1 2 are bounded a.s.). Finally, from part (a) we know that ψ = Θ( -2 ) suffices to achieve -accurate approximation of F in spectral norm. The substitution error for the bias term can be derived from similar calculation, the details of which we omit.

E EXPERIMENT SETUP E.1 PROCESSING THE DATASETS

To obtain extra unlabeled data to estimate the Fisher, we zero pad pixels on the boarders of each image before randomly cropping; a random horizontal flip is also applied for CIFAR10 images (Krizhevsky et al., 2009) . We preprocess all images by dividing pixel values by 255 before centering them to be located within [-0.5, 0.5] with the subtraction by 1/2. For CIFAR10, we downsample the original images using a max pooling layer with kernel size 2 and stride 2.

E.2 SETUP AND IMPLEMENTATION FOR OPTIMIZERS

In all settings, GD uses a learning rate of 0.01 that is exponentially decayed every 1k updates with the parameter value 0.999. For NGD, we use a fixed learning rate of 0.03. Since inverting a parameter-by-parameter-sized Fisher estimate per iteration would be costly, we adopt the Hessian free approach (Martens, 2010) which computes approximate matrix-inverse-vector products using the conjugate gradient (CG) method (Hestenes et al., 1952) . For each approximate inversion, we run CG for 200 iterations starting from the solution returned by the previous CG run. The precise number of CG iterations and the initialization heuristic roughly follow Martens & Sutskever (2012) . For the first run of CG, we initialize the vector from a standard Gaussian, and run CG for 5k iterations. To ensure invertibility, we apply a very small amount of damping (0.00001) in most scenarios. For geometric interpolation experiments between GD and NGD, we use the singular value decomposition to compute the minus α power of the Fisher, as CG is not applicable in this scenario.

E.3 OTHER DETAILS

For experiments in the label noise and misspecification sections, we pretrain the teacher using the Adam optimizer (Kingma & Ba, 2014) with its default hyperparameters and a learning rate of 0.001.For experiments in the misalignment section, we downsample all images twice using max pooling with kernel size 2 and stride 2. Moreover, only for experiments in this section, we implement natural gradient descent by exactly computing the Fisher on a large batch of unlabeled data and inverting the matrix by calling PyTorch's torch.inverse before right multiplying the gradient.

