WHEN DOES PRECONDITIONING HELP OR HURT GEN-ERALIZATION?

Abstract

While second order optimizers such as natural gradient descent (NGD) often speed up optimization, their effect on generalization has been called into question. This work presents a more nuanced view on how the implicit bias of optimizers affects the comparison of generalization properties. We provide an exact asymptotic biasvariance decomposition of the generalization error of preconditioned ridgeless regression in the overparameterized regime, and consider the inverse population Fisher information matrix (used in NGD) as a particular example. We determine the optimal preconditioner P for both the bias and variance, and find that the relative generalization performance of different optimizers depends on label noise and "shape" of the signal (true parameters): when the labels are noisy, the model is misspecified, or the signal is misaligned with the features, NGD can achieve lower risk; conversely, GD generalizes better under clean labels, a well-specified model, or aligned signal. Based on this analysis, we discuss several approaches to manage the bias-variance tradeoff, and the potential benefit of interpolating between first-and second-order updates. We then extend our analysis to regression in the reproducing kernel Hilbert space and demonstrate that preconditioning can lead to more efficient decrease in the population risk. Lastly, we empirically compare the generalization error of first-and second-order optimizers in neural network experiments, and observe robust trends matching our theoretical analysis.

1. INTRODUCTION

We study the generalization property of an estimator θ obtained by minimizing the empirical risk (or the training error) L(f θ ) via a preconditioned gradient update with preconditioner P : θ t+1 = θ t -ηP (t)∇ θt L(f θt ), t = 0, 1, . . . (1.1) Setting P = I recovers gradient descent (GD). Choices of P which exploit second-order information include the inverse Fisher information matrix, which gives the natural gradient descent (NGD) (Amari, 1998) ; the inverse Hessian, which leads to Newton's method (LeCun et al., 2012) ; and diagonal matrices estimated from past gradients, which include various adaptive gradient methods (Duchi et al., 2011; Kingma & Ba, 2014) . These preconditioners often alleviate the effect of pathological curvature and speed up optimization, but their impact on generalization has been under debate: Wilson et al. (2017) reported that in neural network optimization, adaptive or secondorder methods generalize worse compared to gradient descent (GD), whereas other empirical studies showed that second-order methods achieve comparable, if not better generalization (Xu et al., 2020) . The generalization property of optimizers relates to the discussion of implicit bias (Gunasekar et al., 2018a) , i.e. preconditioning may lead to a different converged solution (with potentially the same training loss), as illustrated in Figure 1 . While many explanations have been proposed, our starting point is the well-known observation that GD often implicitly regularizes the parameter 2 norm. For instance in overparameterized least squares regression, GD and many first-order methods find the minimum 2 norm solution from zero initialization (without explicit regularization), but preconditioned updates may not. This being said, while the minimum 2 norm solution can generalize well in the overparameterized regime (Bartlett et al., 2019) , it is unclear whether preconditioning leads to inferior solutions -even in the simple setting of overparameterized linear regression, quantitative understanding of how preconditioning affects generalization is largely lacking. Motivated by the observations above, in Section 3 we start with overparameterized least squares regression (unregularized) and analyze the stationary solution (t → ∞) of update (1.1) under time-invariant preconditioner. Extending previous analysis in the proportional limit (Hastie et al., 2019) , we consider a more general random design setting and derive the exact population risk in its bias-variance decomposition. We characterize the optimal P within a general class of preconditioners for both the bias and variance, and focus on the comparison between GD, for which P is identity, and NGD, for which P is the inverse population Fisher information matrixfoot_0 . We find that the comparison of generalization is affected by the following factors: 1. Label Noise: Additive noise in the labels leads to the variance term in the risk. We prove that NGD achieves the optimal variance among a general class of preconditioned updates. 2. Model Misspecification: Under misspecification, there does not exist a perfect f θ that recovers the true function (target). We argue that this factor is similar to additional label noise, and thus NGD may also be beneficial when the model is misspecified. 3. Data-Signal-Alignment: Alignment describes how the target signal distributes among the input features. We show that GD achieves lower bias when signal is isotropic, whereas NGD is preferred under "misalignment" -when the target function focuses on small feature directions. Beyond the decomposition of stationary risk, our findings in Section 4 and 5 are summarized as: • In Section 4.1 and 4.2 we discuss how the bias-variance tradeoff can be realized by different choices of preconditioner P (e.g. interpolating between GD and NGD) or early stopping. • In Section 4.3 we extend our analysis to regression in the RKHS and show that under early stopping, a preconditioned update interpolating between GD and NGD achieves minimax optimal convergence rate in much fewer steps, and thus reduces the population risk faster than GD. • In Section 5 we empirically test how well our findings in linear model carry over to neural networks: under a student-teacher setup, we compare the generalization of GD with preconditioned updates and illustrate the influence of all aforementioned factors. The performance of neural networks under a variety of manipulations results in trends that align with our theoretical analysis.

2. BACKGROUND AND RELATED WORKS

Natural Gradient Descent. NGD is a second-order method originally proposed in Amari (1997) . Consider a data distribution p(x) on the space X , a function f θ : X → Z parameterized by θ, and a loss function L(X, f θ ) = 1 n n i=1 l(y i , f θ (x i )) , where l : Y × Z → R. Also suppose a probability distribution p(y|z) = p(y|f θ (x)) is defined on the space of labels. Then, the natural gradient is defined as: ∇θ L(X, f θ ) = F -1 ∇ θ L(X, f θ ), where F = E[∇ θ log p(x, y|θ)∇ θ log p(x, y|θ) ] is the Fisher information matrix, or simply the (population) Fisher. Note that expectations in the Fisher are under the joint distribution of the model p(x, y|θ) = p(x)p(y|f θ (x)). In the literature, the Fisher is sometimes defined under the empirical data distribution {x i } n i=1 (Amari et al., 2000) . We instead refer to this quantity as the sample Fisher, the properties of which influence optimization and have been studied in Karakida et al. (2018); Kunstner et al. (2019); Thomas et al. (2020) . We remark that in linear and kernel regression under squared loss, sample Fisher-based updates give the same stationary solution as GD (see Section 3), whereas population Fisher-based update may not. While the population Fisher is typically difficult to obtain, extra unlabeled data can be used in its estimation, which empirically improves generalization (Pascanu & Bengio, 2013) . Moreover, under structural assumptions, parametric approaches to estimate F can be more sample-efficient (Martens & Grosse, 2015; Ollivier, 2015) , and thus closing the gap between sample and population Fisher.



From now on we use NGD to denote the population Fisher-based update, and we write "sample NGD" when P is the inverse or pseudo-inverse of the sample Fisher; see Section for discussion.



Figure 1: 1D illustration of different implicit biases: two-layer sigmoid network trained with preconditioned GD.

