PARAMETER AVERAGING FOR SGD STABILIZES THE IMPLICIT BIAS TOWARDS FLAT REGIONS Anonymous authors Paper under double-blind review

Abstract

Stochastic gradient descent is a workhorse for training deep neural networks due to its excellent generalization performance. Several studies demonstrated this success is attributed to the implicit bias of the method that prefers a flat minimum and developed new methods based on this perspective. Recently, Izmailov et al. (2018) empirically observed that an averaged stochastic gradient descent with a large step size can bring out the implicit bias more effectively and can converge more stably to a flat minimum than the vanilla stochastic gradient descent. In our work, we theoretically justify this observation by showing that the averaging scheme improves the bias-optimization tradeoff coming from the stochastic gradient noise: a large step size amplifies the bias but makes convergence unstable, and vice versa. Specifically, we show that the averaged stochastic gradient descent can get closer to a solution of a penalized objective on the sharpness than the vanilla stochastic gradient descent using the same step size under certain conditions. In experiments, we verify our theory and show this learning scheme significantly improves performance.

1. INTRODUCTION

Stochastic gradient descent (SGD) (Robbins & Monro, 1951 ) is a powerful learning method for training modern deep neural networks. In order to further improve the performance, a great deal of SGD variants such as adaptive gradient methods has been developed. However, SGD is still the workhorse because SGD often generalizes better than these variants even when they achieve much faster convergence regarding the training loss (Keskar & Socher, 2017; Wilson et al., 2017; Luo et al., 2019) . Therefore, the study of the implicit bias of SGD, explaining why it works so better, is nowadays an active research subject. Among such studies, flat minima (Hochreiter & Schmidhuber, 1997) has been recognized as an important notion relevant to the generalization performance of deep neural networks, and SGD has been considered to have a bias towards a flat minimum. Hochreiter & Schmidhuber (1997); Keskar et al. (2017) suggested the correlation between flatness (sharpness) and generalization, that is, flat minima generalizes well compared to sharp minima, and Neyshabur et al. (2017) rigorously supported this correlation under ℓ 2 -regularization by using the PAC-Bayesian framework (McAllester, 1998; 1999) . Furthermore, by the large scale experiments, Jiang et al. (2020) verified that the flatness measures reliably capture the generalization performance and are the most relevant among 40 complexity measures. In parallel, Keskar et al. (2017) empirically demonstrated that SGD prefers a flat minimum due to its own stochastic gradient noise and subsequent studies (Kleinberg et al., 2018; Zhou et al., 2020) proved this implicit bias based on the smoothing effect due to the noise and stochastic differential equation, respectively. Along this line of research, there are endeavors to enhance the bias aiming to improve performance. Especially, stochastic weight averaging (SWA) (Izmailov et al., 2018) and sharpness aware minimization (SAM) (Foret et al., 2020) achieved significant improvement in generalization performance over SGD. SWA is a cyclic averaging scheme for SGD, which includes the averaged SGD (Ruppert, 1988; Polyak & Juditsky, 1992) as a special case. Averaged SGD with an appropriately small step size or diminishing step size to zero is well known to be an efficient method that achieves statistically optimal convergence rates for the convex optimization problems (Bach & Moulines, 2011; Lacoste-Julien et al., 2012; Rakhlin et al., 2012) . However, such a small step size strategy does not seem useful for The success of using a large step size can be attributed to the strong implicit bias as discussed in Izmailov et al. ( 2018). SGD with a large step size cannot stay in sharp regions because of the amplified stochastic gradient noise, and thus it moves to another region. After a long run, SGD will finally oscillate according to an invariant distribution covering a flat region. Then, by taking the average, we can get the mean of this distribution, which is located inside a flat region. Although this provides a good insight into how the averaged SGD with a large step size behaves, the theoretical understanding remains elusive. Hence, the research problem we aim to address is Why does the averaged SGD with a large step size converge to a flat region more stably than SGD? In our work, we address this question via the convergence analysis of both SGD and averaged SGD.

1.1. CONTRIBUTIONS

We first explain the idea behind our study. Our analysis builds upon the alternative view of SGD (Kleinberg et al., 2018) which suggested that SGD implicitly optimizes the smoothed objective function obtained by the convolution with the stochastic gradient noise (see the left of Figure 1 ). Since as pointed out later the smoothed objective is essentially a penalized objective on the sharpness whose strength depends on the step size, the more precise optimization of the smoothed objective with a large step size implies the convergence to a flatter region. At the same time, the step size is known to control the optimization accuracy of SGD, that is, we need to take a small step size at the final phase of training to converge. These observations indicate the bias-optimization tradeoff coming from the stochastic gradient noise and controlled by the step size: A large step size amplifies the bias towards a flat region but makes the optimization for the smoothed objective inaccurate, whereas a small step size weakens the bias but makes the optimization accurate. In our work, we prove that the averaged SGD can improve the above tradeoff, that is, it can optimize the smoothed objective more precisely than SGD under the same step size. Specifically, we prove as long as the smoothed objective satisfies one-point strong convexity at the solution and some regularity conditions, SGD using the step size η converges to a distance O( √ η) from the solution (Theorem 1), whereas the averaged SGD using the same step size converges to a distance O(η) (Theorem 2). We remark that large step size in our study means the step size with which SGD oscillates and poorly performs but the averaged SGD works. Clearly, a larger step size regardless of the condition of



Figure 1: We run SGD and averaged SGD 500 times with the uniform stochastic gradient noise for two objective functions (top and bottom).Figure (a) depicts the objective function f (green, η = 0) and smoothed objectives F (red and blue, η > 0). Figures (b) and (c) plot convergent points by SGD and averaged SGD with histograms, respectively.

