PARAMETER AVERAGING FOR SGD STABILIZES THE IMPLICIT BIAS TOWARDS FLAT REGIONS Anonymous authors Paper under double-blind review

Abstract

Stochastic gradient descent is a workhorse for training deep neural networks due to its excellent generalization performance. Several studies demonstrated this success is attributed to the implicit bias of the method that prefers a flat minimum and developed new methods based on this perspective. Recently, Izmailov et al. (2018) empirically observed that an averaged stochastic gradient descent with a large step size can bring out the implicit bias more effectively and can converge more stably to a flat minimum than the vanilla stochastic gradient descent. In our work, we theoretically justify this observation by showing that the averaging scheme improves the bias-optimization tradeoff coming from the stochastic gradient noise: a large step size amplifies the bias but makes convergence unstable, and vice versa. Specifically, we show that the averaged stochastic gradient descent can get closer to a solution of a penalized objective on the sharpness than the vanilla stochastic gradient descent using the same step size under certain conditions. In experiments, we verify our theory and show this learning scheme significantly improves performance.

1. INTRODUCTION

Stochastic gradient descent (SGD) (Robbins & Monro, 1951 ) is a powerful learning method for training modern deep neural networks. In order to further improve the performance, a great deal of SGD variants such as adaptive gradient methods has been developed. However, SGD is still the workhorse because SGD often generalizes better than these variants even when they achieve much faster convergence regarding the training loss (Keskar & Socher, 2017; Wilson et al., 2017; Luo et al., 2019) . Therefore, the study of the implicit bias of SGD, explaining why it works so better, is nowadays an active research subject. Among such studies, flat minima (Hochreiter & Schmidhuber, 1997) has been recognized as an important notion relevant to the generalization performance of deep neural networks, and SGD has been considered to have a bias towards a flat minimum. Hochreiter & Schmidhuber (1997) ; Keskar et al. (2017) suggested the correlation between flatness (sharpness) and generalization, that is, flat minima generalizes well compared to sharp minima, and Neyshabur et al. (2017) rigorously supported this correlation under ℓ 2 -regularization by using the PAC-Bayesian framework (McAllester, 1998; 1999) . Furthermore, by the large scale experiments, Jiang et al. (2020) verified that the flatness measures reliably capture the generalization performance and are the most relevant among 40 complexity measures. In parallel, Keskar et al. (2017) empirically demonstrated that SGD prefers a flat minimum due to its own stochastic gradient noise and subsequent studies (Kleinberg et al., 2018; Zhou et al., 2020) proved this implicit bias based on the smoothing effect due to the noise and stochastic differential equation, respectively. Along this line of research, there are endeavors to enhance the bias aiming to improve performance. Especially, stochastic weight averaging (SWA) (Izmailov et al., 2018) and sharpness aware minimization (SAM) (Foret et al., 2020) achieved significant improvement in generalization performance over SGD. SWA is a cyclic averaging scheme for SGD, which includes the averaged SGD (Ruppert, 1988; Polyak & Juditsky, 1992) as a special case. Averaged SGD with an appropriately small step size or diminishing step size to zero is well known to be an efficient method that achieves statistically optimal convergence rates for the convex optimization problems (Bach & Moulines, 2011; Lacoste-Julien et al., 2012; Rakhlin et al., 2012) . However, such a small step size strategy does not seem useful for training deep neural networks, and Izmailov et al. (2018) found the averaged SGD with not small but large step size works quite well. The success of using a large step size can be attributed to the strong implicit bias as discussed in Izmailov et al. (2018) . SGD with a large step size cannot stay in sharp regions because of the amplified stochastic gradient noise, and thus it moves to another region. After a long run, SGD will finally oscillate according to an invariant distribution covering a flat region. Then, by taking the average, we can get the mean of this distribution, which is located inside a flat region. Although this provides a good insight into how the averaged SGD with a large step size behaves, the theoretical understanding remains elusive. Hence, the research problem we aim to address is Why does the averaged SGD with a large step size converge to a flat region more stably than SGD? In our work, we address this question via the convergence analysis of both SGD and averaged SGD.

1.1. CONTRIBUTIONS

We first explain the idea behind our study. Our analysis builds upon the alternative view of SGD (Kleinberg et al., 2018) which suggested that SGD implicitly optimizes the smoothed objective function obtained by the convolution with the stochastic gradient noise (see the left of Figure 1 ). Since as pointed out later the smoothed objective is essentially a penalized objective on the sharpness whose strength depends on the step size, the more precise optimization of the smoothed objective with a large step size implies the convergence to a flatter region. At the same time, the step size is known to control the optimization accuracy of SGD, that is, we need to take a small step size at the final phase of training to converge. These observations indicate the bias-optimization tradeoff coming from the stochastic gradient noise and controlled by the step size: A large step size amplifies the bias towards a flat region but makes the optimization for the smoothed objective inaccurate, whereas a small step size weakens the bias but makes the optimization accurate. In our work, we prove that the averaged SGD can improve the above tradeoff, that is, it can optimize the smoothed objective more precisely than SGD under the same step size. Specifically, we prove as long as the smoothed objective satisfies one-point strong convexity at the solution and some regularity conditions, SGD using the step size η converges to a distance O( √ η) from the solution (Theorem 1), whereas the averaged SGD using the same step size converges to a distance O(η) (Theorem 2). We remark that large step size in our study means the step size with which SGD oscillates and poorly performs but the averaged SGD works. Clearly, a larger step size regardless of the condition of the objective will diverge, thus it should be appropriately small to achieve sufficient optimization. The better dependence of O(η) than O( √ η) means that the averaged SGD can work well with a wider range of step sizes than SGD. Although, a too small step-size does not always bias the solution because the deviation of the solution is O(η 2 ), the above dfference of the order can make the separation between SGD and averaged SGD with an appropriately chosen step-size depending on the problem. As a result, we can expect the improvement by the averaged SGD for datasets such that the stronger implicit bias with the appropriately larger step size is useful. The separation between SGD and averaged SGD regarding the bias can occur even in the simple setup as seen in Figure 1 which depicts obtained parameters by running SGD and averaged SGD 500 times in two cases. We observe (a) both methods with the small step size can get stuck at sharp valleys or an edge of a flat region because of weak bias and accurate optimization, (b) SGD with the large step size amplifies the bias and reaches a flat region but is unstable, and (c) averaged SGD with large step size can converge stably to a near biased solution which minimizes the smoothed objective. The behavior of the averaged SGD in an asymmetric valley (the top of Figure 1 ), that the parameter is biased toward a flat side from an edge of the region, is also known to be preferable property in generalization as well as flat minima (see Izmailov et al. (2018) ; He et al. (2019) ). We note that this phenomenon is certainly captured by our theory. Indeed, Figure 2 shows the convergent point of the averaged SGD is almost the minimizer of smoothed objective for each step size. Our findings are summarized below: • SGD and averaged SGD implicitly optimize the smoothed objective, whose strength depends on the step size, up to O( √ η) and O(η) errors in Euclidean distance from the solution. This explains why these methods reach a flat region with an approprie step-size, since smoothing eliminates sharp minima. • This means that averaged SGD can optimize the smoothed objective more precisely than SGD under the same step size as long as required conditions uniformly hold regarding the step size, resulting in a stronger bias towards a flat region. In other words, averaged SGD better controls the bias-optimization tradeoff than SGD. • Hence, the parameter averaging yields an improvement for difficult datasets such that the stronger implicit bias with the larger step size is useful. This suggests the use of larger step size for such datasets so that averaged SGD stably converges but SGD itself is unstable to effectively bring out the implicit bias. Technical difference from Kleinberg et al. (2018) . The proof idea of Proposition 1 relies on the alternative view of SGD (Kleinberg et al., 2018) which shows the existence of an associated SGD for the smoothed objective. However, since its stochastic gradient is a biased estimator, they showed the convergence not to the solution but to a point at which a sort of one-point strong convexity holds, and avoid the treatment of a biased estimator. Hence, the optimization of the smoothed objective is not guaranteed in their theory. On the other hand, optimization accuracy is the key in our theory, thus we need nontrivial refinement of the proof under a normal one-point strong convexity at the solution.

2. PRELIMINARY -STOCHASTIC GRADIENT DESCENT

In this section, we introduce the problem setup and stochastic gradient descent (SGD) in the general form including the standard SGD for the risk minimization problems appearing in machine learning. Let f : R d → R be a smooth nonconvex objective function to be minimized. For simplicity, we assume f is nonnegative. A stochastic gradient descent, randomly initialized at w 0 , for optimizing f is described as follows: for t = 0, 1, 2, . . . w t+1 = w t -η (∇f (w t ) + ϵ t+1 (w t )) , where η > 0 is the step-size and ϵ t+1 : R d → R d is a random field corresponding to the stochastic gradient noise i.e., for any w ∈ R d , {ϵ t+1 (w)} ∞ t=0 is a sequence of zero-mean random variables taking values in R d . A typical setup of the above is an empirical/expected risk minimization in machine learning. Example 1 (Risk Minimization). Let ℓ(w, z) be a loss function consisting of the hypothesis function parameterized by w ∈ R d and the data z ∈ R p . Let µ be an empirical/true data distribution over the data space and Z be a random variable following µ. Then, the objective function is defined by f (w) = E Z∼µ [ℓ(w, Z)]. Given i.i.d. random variables {Z t+1 } ∞ t=0 with the same distribution as Z, the standard stochastic gradient at t-th iterate w t is defined as ∇ w ℓ(w t , Z t+1 ). In this setting, the stochastic noise ϵ t+1 can be ϵ t+1 (w) = ∇ w ℓ(w, Z t+1 ) -∇f (w). Note that we can further include the ℓ 2 -regularization in the objective f and the perturbation by the data augmentation in the distribution µ. As this example satisfies, we suppose {ϵ t+1 } ∞ t=0 are independent copies each other. That is, there is a measurable map from a probability space: Ω ∋ z → ϵ(w, z) ∈ R d , and then ϵ t+1 can be written as a measurable map from a product probability space: Ω Z ≥0 ∋ {z s+1 } ∞ s=0 → ϵ(w, z t+1 ) ∈ R d when explicitly representing them as measurable maps. Moreover, we make the following assumptions on the objective function and stochastic gradient noise. Assumption 1. (A1) f : R d → R is nonnegative, twice continuously differentiable, and its Hessian is bounded, i.e., there is a constant L > 0 such that for any w ∈ R d , -LI ⪯ ∇ 2 f (w) ⪯ LI. (A2) Random fields {ϵ t+1 } ∞ t=0 are independent copies each other and each ϵ t+1 (w) is differentiable in w. Moreover, for any w ∈ R d E[ϵ t+1 (w)] = 0 and there are σ 1 , σ 2 > 0 such that for any w ∈ R d , E[∥ϵ t+1 (w)∥ 2 ] ≤ σ 2 1 and E[∥J ⊤ ϵt+1 (w)∥ 2 ] ≤ σ 2 . , where J ϵt+1 is Jacobian of ϵ t+1 . Remark. The smoothness and boundedness conditions (A1) on the objective function and the zero-mean and the bounded variance conditions (A2) on stochastic gradient noise are commonly assumed in the convergence analysis for the stochastic optimization methods. Moreover, if Hessian matrix satisfies -LI ⪯ ∇ 2 w ℓ(w, z) ⪯ LI in Example 1, then the last condition on J ϵt+1 also holds with at least σ 2 = 2L because J ϵt+1 (w) = ∇ 2 w ℓ(w, Z t+1 ) -∇ 2 f (w).

3. ALTERNATIVE VIEW OF STOCHASTIC GRADIENT DESCENT

An alternative view (Kleinberg et al., 2018) of SGD is the key in our analysis relating to an implicit bias towards a flat minimum. We introduce this view with a refined convergence analysis and see the bias-optimization tradeoff caused by the stochastic gradient noise with a step size. An alternative view of SGD considers an associated iterations {v t } ∞ t=0 with {w t } ∞ t=0 , which approximately minimizes a smoothed objective function obtained by the stochastic gradient noise. We here define v t as a parameter obtained by the exact gradient descent from w t , that is, v t = w t -η∇f (w t ) and we analyze the update of v t instead of w t . Since w t+1 = v t -ηϵ t+1 (w t ), we get v t+1 = v t -ηϵ t+1 (w t ) -η∇f (v t -ηϵ t+1 (w t )). As shown in Appendix A. 1, under a specific setting given later, w → v = w -η∇f (w) becomes a smooth invertible injection and its inverse is differentiable, thus, we identify ϵ ′ t+1 (v) with ϵ t+1 (w) through the map w → v. Then, we get an update rule of v t : v t+1 = v t -ηϵ ′ t+1 (v t ) -η∇f (v t -ηϵ ′ t+1 (v t )). (2) For convenience, we refer to the rule (2) as an implicit stochastic gradient descent in this paper. Since, the conditional expectation of ϵ ′ t+1 (v t ) at v t is zero, we expect that the implicit SGD (2) minimizes the following smoothed objective function: F (v) = E[f (v -ηϵ ′ (v))], (3) where ϵ ′ is an independent copy of ϵ ′ 1 , ϵ ′ 2 , . . .. However, we note that this implicit SGD is not a standard SGD because ∇f (v t -ηϵ ′ t+1 (v t )) is a biased estimate of ∇F (v) (i.e., ∇F (v) ̸ = E[∇f (v - ηϵ ′ (v))] ) in generalfoot_0 , and thus we need a detailed convergence analysis. The function ( 3) is actually a smoothed function of f by the convolution using the stochastic gradient noise ηϵ ′ and the level of smoothness is controlled by the step-size η as seen in the left of Figure 1 which depicts the original objective f corresponding to η = 0 and smoothed objectives F . In this figure, we can observe how a nonconvex function is smoothened and its sharp local minima are eliminated by an appropriately large step size (the bottom-left figure) and how the solution is biased toward the flat side in an asymmetric valley (the top-left figure). Hence, we expect that stochastic gradient descent can avoid sharp minima and converges to a flat region. Indeed, by taking Taylor expansion of f , we see that F (v) is an approximation of the function f (v) plus the penalization on the high (positive) curvature of f along the noise direction in expectation: F (v) = f (v) + η 2 2 Tr ∇ 2 f (v)E[ϵ ′ (v)ϵ ′ (v) ⊤ ] + O(η 3 ). The above observation indicates the reasonability of imposing some sort of convexity conditions at the solution of the smoothed objective F (v) rather than the original objective f (w). In this paper, we make the following one-point strong convexity at the solution v * to show the convergence of F (v t ). Let v * = arg min v∈R d F (v). We note that F and v * depend on the value of η, but we do not explicitly denote this dependency for simplicity. Assumption 2. (A3) There is c > 0 such that for any v ∈ R d , ∇F (v) ⊤ (v -v * ) ≥ c∥v -v * ∥ 2 . For instance, this assumption holds for the function in the bottom-left in Figure 1 with sufficiently large η and for the function in the top-left figure with any η in a certain interval (0, η 0 ].

Assumption (A3

) is a normal one-point strong convexity, whereas Kleinberg et al. (2018) assumed a different condition: E[∇f (v -ηϵ ′ (v))] ⊤ (v -v • ) ≥ c∥v -v • ∥ 2 at some parameter v • and showed the convergence to v • . If ∇F (v) = E[∇f (v -ηϵ ′ (v))], then v • should be v * and both assumptions coincide. However, as noted above ∇F (v) ̸ = E[∇f (v -ηϵ ′ (v)) ] in general, and hence v • is not necessarily v * . Our aim is to clarify how precisely SGD and averaged SGD can minimize F (v). That is why we make the normal one-point strong convexity at v * and need a much more detailed analysis. Moreover, our proof allows for a larger step size than that in Kleinberg et al. (2018) because of the different proof techniques. Theorem 1. Under Assumption (A1), (A2), and (A3), run SGD for T -iterations with the step size η ≤ 1 2L , then a sequence {v t } ∞ t=0 of the implicit SGD satisfies the following inequality: 1 T + 1 T t=0 E[∥v t -v * ∥ 2 ] ≤ O T -1 + 2ησ 2 1 c + 8η 2 σ 2 1 L 3c 1 + 2ησ 2 2 c . Remark. If σ 1 = 0, then SGD is nothing but deterministic gradient descent and f = F because of the absence of stochastic gradient noise. Hence, SGD converges to a minimizer of f according to the classical optimization theory, which is recovered by Theorem 1 with σ 1 = 0. This theorem shows the convergence of SGD to the minimum of the smoothed objective F up to distance O( √ η) from v * as long as F satisfies required assumptions even if the original objective f has local minima. This is also true for w t since ∥w t -v t ∥ = O(η). Thus, convergence to a flatter region is expected through an explicit expression as a regularized objective (4). Moreover, we can see from the theorem the optimization accuracy becomes more accurate by using a smaller step size for the problem where the required conditions uniformly hold regarding η. On the other hand, a small step size clearly weakens the bias. Thus, the step size η controls the bias-optimization tradeoff coming from the stochastic gradient noise. 4 AVERAGED SGD WITH LARGE STEP-SIZE Izmailov et al. (2018) empirically demonstrated that averaged SGD converges to a flat region and achieves better generalization even when SGD oscillates with a relatively large step size. We theoretically attribute this phenomenon to that the averaged SGD can get closer to v * than SGD using the same step size under certain settings. In other words, parameter averaging can improve the bias-optimization tradeoff and bring out the implicit bias more effectively. In the averaged SGD, we run normal SGD (1) and take the average as follows: w T +1 = 1 T + 1 T +1 t=1 w t . Our aim is to show lim T →∞ w T can be closer to v * than {w t } ∞ t=0 and {v t } ∞ t=0 by clarifying the dependency of this limit on the step size η. Preferably, the implicit SGD (2) is more useful in analyzing the averaged SGD because the average v T = 1 T T t=0 v t is consistent with w T as confirmed below. By the definition, we see w T +1 = v T + 1 T + 1 T t=0 ϵ t+1 (w t ), where the noise term T t=0 ϵt+1(wt) /(T +1) is zero in expectation and its variance is upper bounded by σ 2 1/(T +1) under Assumption (A2). Hence, w T +1 -v T converges to zero in probability by Chebyshev's inequality; for any r > 0, P[∥w T +1 -v T ∥ > r] ≤ σ 2 1/(T +1)r 2 → 0 as T → ∞ , and the analysis of lim T →∞ w T reduces to that of lim T →∞ v T . We further make the additional assumptions on the smoothed objective F : R d → R and give the theorem that shows the convergence of the averaged SGD. Assumption 3. (A4) There is M > 0 such that for any v ∈ R d , ∥∇F (v) -∇ 2 F (v * )(v -v * )∥ ≤ M ∥v -v * ∥ 2 . (A5) ∇ 2 F (v * ) is positive, i.e., there is µ > 0 such that ∇ 2 F (v * ) ⪰ µI. Remark. (A4) is used to show the superiority of the averaging scheme. This condition can be derived by the boundedness of the third-order derivative assumed in Dieuleveut et al. (2020) . The positivity of Hessian (A5) is only required at v * , which is consistent with nonconvexity. For instance, examples in Figure 1 satisfy (A5). Theorem 2. Under Assumption (A1)-(A5), run the averaged SGD for T -iterations with the step size η ≤ 1 2L , then the average v T satisfies the following inequality: ∥E[v T ] -v * ∥ ≤ O T -1 2 + 4σ 1 σ 2 η 3 2 L 1 2 √ 3µ + 2ησ 2 1 M cµ + 8η 2 σ 2 1 LM 3cµ 1 + 2ησ 2 2 c . The variance of the averaged parameter v T is typically small, hence we evaluate the distance of E[v T ] to v * . Indeed, this is reasonable because the central limit theorem holds for averaged SGD under the mild conditoin even for nonconvex problems (Yu et al., 2020) . Theorem 2 says that the averaged SGD can optimize the smoothed objective F with better accuracy of O(η) than O( √ η) achieved by SGD using the same step size as long as the required conditions (one-point strong convexity at minimizer and regularity for smoothed objectives) are satisfied uniformly for η in a certain interval (0, η 0 ] (∃η 0 < 1). These uniform requirements hold for valleys like the top-left case of Figure 1 and likely holds in the final phase of training deep neural network because of the observation that the parameter eventually falls in a better-shaped valley (see Figure 4 ). Therefore, we expect the averaged SGD to outperform the normal SGD in such cases, and we recommend the use of the tail-averaging scheme for deep learning as adopted in SWA (Izmailov et al., 2018) , whose benefit is well known even in the convex optimization (Rakhlin et al., 2012; Mücke et al., 2019) .

5. EXPERIMENTS

We evaluate the empirical performance of SGD and averaged SGD on image classification tasks using CIFAR10 and CIFAR100 datasets. To evaluate the usefulness of the parameter averaging for the other methods, we also compare SAM (Foret et al., 2020) with its averaging variant. We employ the tail-averaging scheme where the average is taken over the last phase of training. ), and we employ the weight decay with the coefficient 0.05. Moreover, we use the multi-step strategy for the step size, which decays the step size by a factor once the number of epochs reaches one of the given milestones. To see the dependence on the step size, we use two decay schedules for the parameter averaging. Table 1 summarizes milestones labeled by the symbols: 's', 'm', and 'l'. The initial step size and a decay factor of the step size are set to 0.1 and 0.2 in all cases. The averages are taken from 300 epochs for the schedules 's' and 'l', and from 160 epochs for the schedule 'm'. These hyperparameters were tuned based on the validation sets. For a fair comparison, we run (averaged) SGD with 400 epochs and (averaged) SAM with 200 epochs because SAM requires two gradients per iteration, and thus the milestones and starting epoch of taking averages are also halved for (averaged) SAM. We evaluate each method 5 times for ResNet-50 and WRN-28-10, and 3 times for Pyramid network. The averages of classification accuracies are listed in Table 2 with the standard deviations in brackets. We observe from the table that the parameter averaging for SGD improves the classification accuracies in all cases, especially on CIFAR100 dataset. Eventually, the averaged SGD achieves comparable or better performance than SAM. Moreover, we also observe improvement by parameter averaging for SAM in most cases, which is consistent with the observations in Kaddour et al. (2022) . Comparing results on CIFAR100 and CIFAR10, the large step size is better, and the small step size is relatively poor on CIFAR100 dataset, whereas the small step size generally works on CIFAR10 dataset. If we use the step-size strategy 'l' for CIFAR10, then the improvement becomes small (see Appendix B for this result). We hypothesize that this is because the strong bias with a large step size would be useful for difficult datasets, whereas the weak bias with a small step size would be sufficient for simple datasets such that the normal SGD already achieves high accuracies. Moreover, we note that the averaged SGD on CIFAR100 quite works well with the large step-size schedule 'l', but SGD itself does not converge and poorly performs under this schedule as seen in Figure 3 . The accuracy of SGD temporarily increases at the 300 epochs because of the decay of the step size, it decreases thereafter. However, the average of such parameters achieves significantly high accuracy as expected by our theory. Finally, we observe in Figure 4 the loss landscape around the convergent point is in better shape and forms an asymmetric valley. Therefore, we expect that the loss function around the solution uniformly satisfies the required conditions in our theory. Specifically, Figure 4 depicts the section of train and test loss functions across parameters obtained by the averaged SGD and SGD. The middle figure is the close-up view at the edge and plots each parameter. The right figure depicts the smoothed objectives with Gaussian noises in addition to train and test losses in log-scale. We observe in Figure 4 the phenomenon that SGD converges to an edge and averaged SGD converges to a flat side. This phenomenon can be explained by our theory because the minimizer of the smoothed asymptotic valley is shifted to a flat side as confirmed in a synthetic setting (Figure 2 ) and deep learning setting (the right of Figure 4 ). Moreover, the right figure indicates the possibility that the smoothed objective with appropriate stochastic gradient noise well approximates test loss, although we employ artificial noise (Gaussian) to depict graphs for simplicity. Finally, we observe that averaged SGD achieves a lower test loss which makes about 2% improvement in the classification error on CIFAR100 dataset. These observations are also consistent with the experiments conducted in He et al. (2019) .

6. RELATED LITERATURE AND DISCUSSION

Flat Minimum. Keskar et al. ( 2017) and Hochreiter & Schmidhuber (1997) showed a flat minimum generalizes well and a sharp minimum generalizes poorly. However, the flatness solely cannot explain generalization because it can be easily manipulated (Dinh et al., 2017) 2019) argued the averaged SGD tends to converge to an asymmetric valley. Several works (Kleinberg et al., 2018; Zhou et al., 2020) studied the stochastic gradient noise to theoretically prove the existence of an implicit bias towards a flat region or asymmetric valley. Moreover, many works (Izmailov et al., 2018; Foret et al., 2020; Damian et al., 2021; Orvieto et al., 2022) studied the techniques to further bring out the bias of SGD. In particular, SAM and SWA achieved a significant improvement in the generalization performance. In our paper, we show that parameter averaging stabilizes the convergence to a flat region or asymmetric valley, and suggest the usefulness of the combination with the large step size for the difficult dataset which needs a stronger regularization. Markov Chain Interpretation of SGD. Dieuleveut et al. (2020) ; Yu et al. (2020) provided the Markov chain interpretation of SGD. They showed the marginal distribution of the parameter of SGD converges to an invariant distribution for convex and nonconvex optimization problems, respectively. Moreover, Dieuleveut et al. (2020) showed the mean of the invariant distribution, attained by the averaged SGD, is at distance O(η) from the minimizer of the objective function, whereas SGD itself oscillates at distance O( √ η) in the convex optimization settings. Izmailov et al. (2018) also attributed the success of SWA to such a phenomenon. That is, Izmailov et al. (2018) explained that SGD travels on the hypersphere because of the convergence to Gaussian distribution and the concentration on the sphere under a simplified setting, and thus averaging scheme allows us to go inside of the sphere which may be flat. We can say our contribution is to theoretically justify this intuition by extending the result obtained by Dieuleveut et al. (2020) to a nonconvex optimization setting. In the proof, we utilize the alternative view of SGD (Kleinberg et al., 2018) in a non-asymptotic way under some conditions not on the original objective but on the smoothed objective function. Combination with the Markov chain view for nonconvex objective (Yu et al., 2020) may be helpful in more detailed analyses. Step size and Minibatch. SGD with a large step size often suffers from stochastic gradient noise and becomes unstable. This is the reason why we should take a smaller step size so that SGD converges. In this sense, the minibatching of stochastic gradients clearly plays the same role as the step size and sometimes brings additional gains. For instance, Smith et al. ( 2017) empirically demonstrated that the number of parameter updates can be reduced, maintaining the learning curves on both training and test datasets by increasing minibatch size instead of decreasing step size. We remark that our analysis can incorporate the minibatch by dividing σ 2 1 and σ 2 2 in Theorem 1 and 2 by the minibatch size, and we can see certain improvements of optimization accuracy as well. Then, both SGD and averaged SGD share the same dependency on the minibatch size and thus controlling step size seems more beneficial for parameter averaging. Edge of Stability. Recently, Cohen et al. (2021) showed the deterministic gradient descent for deep neural networks enters Edge of Stability phase. In the traditional optimization theory, the step size is set to be smaller than 1/L to ensure stable convergence and we also make such a restriction. On the other hand, the Edge of Stability phase appears when using a higher step size than 2/L. In this phase, the training loss behaves non-monotonically and the sharpness finally stabilizes around 2/η. This can be explained as follows (Lewkowycz et al., 2020) ; if the sharpness around the current parameter is large compared to the step size, then gradient descent cannot stay in such a region and goes to a flatter region that can accommodate the large step size. There are works (Arora et al., 2022; Ahn et al., 2022) which attempted to rigorously justify Edge of Stability phase. Interestingly, their analyses are based on a similar intuition to ours, but we consider a different regime of step sizes and a different factor (stochastic noise or larger step size than 2/L) brings the implicit bias towards flat regions. We believe establishing a unified theory is interesting future research. Averaged SGD. The averaged SGD (Ruppert, 1988; Polyak & Juditsky, 1992 ) is a popular variant of SGD, which returns the average of parameters obtained by SGD aiming at stabilizing the convergence. Because of the better generalization performance, many works conducted convergence rate analysis in the expected risk minimization setting and derived the asymptotically optimal rates O(1/ √ T ) and O(1/T ) for non-strongly convex and strongly convex problems (Nemirovski et al., 2009; Bach & Moulines, 2011; Rakhlin et al., 2012; Lacoste-Julien et al., 2012) . However, the schedule of step size is basically designed to optimize the original objective function, and hence the implicit bias coming from the large step size will eventually disappear. When applying a non-diminishing step size schedule, the non-zero optimization error basically remains. What we do in this paper is to characterize it as the implicit bias toward a flat region.

CONCLUSION

In this paper, we showed that parameter averaging improves the bias-optimization tradeoff caused by the stochastic gradient noise. Specifically, we proved that averaged SGD optimizes the smoothed objective functions up to O(η)-error, whereas SGD itself optimizes it up to O( √ η)-error in terms of Euclidean distance from the solution, where η is the step size. Therefore, parameter averaging significantly stabilizes the implicit bias toward a flat region, and we can expect improved performance for difficult datasets such that the stronger bias induced by a larger step size is helpful. Finally, we observed the consistency of our theory with the experiments on image classification tasks. In addition to the above discussion, another interesting research direction is to investigate what type of noise is strongly related to generalization performance. Pan Zhou, Jiashi Feng, Chao Ma, Caiming Xiong, Steven Chu Hong Hoi, et al. Towards theoretically understanding why sgd generalizes better than adam in deep learning. Advances in Neural Information Processing Systems, 33:21285-21296, 2020. Finally, we get J ⊤ ϵ ′ (•,z) (v)∇f (v -ηϵ ′ (v, z))dP (z) ≤ ∥J ⊤ ϵ ′ (•,z) (v)∥ 2 2 ∥∇f (v -ηϵ ′ (v, z))∥ 2 dP (z) ≤ 2σ 2 ∥∇f (v -ηϵ ′ (v, z))∥ 2 dP (z) ≤ 2σ 2 E [∥∇f (v -ηϵ ′ (v))∥ 2 ]. This finishes the proof.

A. 2 PROOF OF THEOREM 1

The following proposition is the restatement of the well-known convergence result to a stationary point using the coordinate v. Proposition A. Under Assumption (A1), (A2), and η ≤ 1 2L , we get T t=0 E ∥∇f (v t -ηϵ ′ t+1 (v t ))∥ 2 ≤ 4 3η E[f (w 0 )] + 2 3 ησ 2 1 L(T + 2). Proof. It is known that (A1) derives the following (Nesterov, 2004) : for any w, w ′ ∈ R d , f (w ′ ) ≤ f (w) + ∇f (w) ⊤ (w ′ -w) + L 2 ∥w ′ -w∥ 2 . ( ) Substituting the update Eq. ( 1) into this inequality with w ′ = w t+1 and w = w t , and taking the conditional expectation E[•|F t ], we get E[f (w t+1 )|F t ] ≤ f (w t ) -η∥∇f (w t )∥ 2 + η 2 L 2 E ∥∇f (w t ) + ϵ t+1 (w t )∥ 2 |F t = f (w t ) -η 1 - ηL 2 ∥∇f (w t )∥ 2 + η 2 L 2 E ∥ϵ t+1 (w t )∥ 2 |F t ≤ f (w t ) - 3η 4 ∥∇f (w t )∥ 2 + η 2 σ 2 1 L 2 . Thus, we have E[f (w t+1 )] ≤ E[f (w t )] -3η 4 E[∥∇f (w t )∥ 2 ] + η 2 σ 2 1 L 2 . By summing up this inequality, we get T +1 t=0 E[∥∇f (w t )∥ 2 ] ≤ 4 3η E[f (w 0 )] + 2 3 ησ 2 1 L(T + 2), where we used the nonnegativity of f . By dropping the term with t = 0 of the sum in the left hand side and using w t+1 = v t -ηϵ ′ t+1 (v t ), we finally get T t=0 E[∥∇f (v t -ηϵ ′ t+1 (v t ))∥ 2 ] ≤ 4 3η E[f (w 0 )] + 2 3 ησ 2 1 L(T + 2). Using the above results, we prove Theorem 1, which is restated below. Theorem A. Under Assumption (A1), (A2), and (A3), run the stochastic gradient descent with T -iterations with the step size η ≤ 1 2L , then the implicit SGD satisfies the following inequality: 1 T + 1 T t=0 E[∥v t -v * ∥ 2 ] ≤ 1 cη(T + 1) E[∥v 0 -v * ∥ 2 ] + 8 3c(T + 1) 1 + 2ησ 2 2 c E[f (w 0 )] + 2ησ 2 1 c + 8η 2 σ 2 1 L 3c 1 + 2ησ 2 2 c = O T -1 + 2ησ 2 1 c + 8η 2 σ 2 1 L 3c 1 + 2ησ 2 2 c . Proof of Theorem A. To evaluate ∥v t+1 -v * ∥ 2 for the implicit SGD (2), we first give several bounds as follows. By Assumption (A3), Young's inequality, and Lemma B, we get -2(v t -v * ) ⊤ E[∇f (v t -ηϵ ′ t+1 (v t ))|F t ] = -2(v t -v * ) ⊤ ∇F (v t ) + 2(v t -v * ) ⊤ (∇F (v t ) -E[∇f (v t -ηϵ ′ t+1 (v t ))|F t ]) ≤ -2c∥v t -v * ∥ 2 + c∥v t -v * ∥ 2 + 1 c ∥∇F (v t ) -E[∇f (v t -ηϵ ′ t+1 (v t ))|F t ]∥ 2 ≤ -c∥v t -v * ∥ 2 + 4η 2 σ 2 2 c E ∥∇f (v t -ηϵ ′ t+1 (v t ))∥ 2 |F t . By Assumption (A2) and Young's inequality again, we get E[∥ϵ ′ t+1 (v t ) + ∇f (v t -ηϵ ′ t+1 (v t ))∥ 2 |F t ] ≤ 2E[∥ϵ ′ t+1 (v t )∥ 2 |F t ] + 2E[∥∇f (v t -ηϵ ′ t+1 (v t ))∥ 2 |F t ] ≤ 2σ 2 1 + 2E[∥∇f (v t -ηϵ ′ t+1 (v t ))∥ 2 |F t ]. Combining the above two inequalities, we get E[∥v t+1 -v * ∥ 2 |F t ] = E[∥v t -ηϵ ′ t+1 (v t ) -η∇f (v t -ηϵ ′ t+1 (v t )) -v * ∥ 2 |F t ] = ∥v t -v * ∥ 2 -2η(v t -v * ) ⊤ E[∇f (v t -ηϵ ′ t+1 (v t ))|F t ] + η 2 E[∥ϵ ′ t+1 (v t ) + ∇f (v t -ηϵ ′ t+1 (v t ))∥ 2 |F t ] ≤ (1 -cη)∥v t -v * ∥ 2 + 2η 2 σ 2 1 + 2η 2 1 + 2ησ 2 2 c E ∥∇f (v t -ηϵ ′ t+1 (v t ))∥ 2 |F t . Taking the expectation regarding all histories and summing up over t = 0, 1, . . . , T , we get cη T t=0 E[∥v t -v * ∥ 2 ] ≤ E[∥v 0 -v * ∥ 2 ] -E[∥v T +1 -v * ∥ 2 ] + 2η 2 σ 2 1 (T + 1) + 2η 2 1 + 2ησ 2 2 c T t=0 E ∥∇f (v t -ηϵ ′ t+1 (v t ))∥ 2 ≤ E[∥v 0 -v * ∥ 2 ] -E[∥v T +1 -v * ∥ 2 ] + 2η 2 σ 2 1 (T + 1) + 8 3 η 1 + 2ησ 2 2 c E[f (w 0 )] + 4 3 η 3 1 + 2ησ 2 2 c σ 2 1 L(T + 2), where we used Proposition A. Therefore, we conclude 1 T + 1 T t=0 E[∥v t -v * ∥ 2 ] ≤ 1 cη(T + 1) E[∥v 0 -v * ∥ 2 ] + 8 3c(T + 1) 1 + 2ησ 2 2 c E[f (w 0 )] + 2ησ 2 1 c + 8η 2 3c 1 + 2ησ 2 2 c σ 2 1 L = O T -1 + 2ησ 2 1 c + 8η 2 3c 1 + 2ησ 2 2 c σ 2 1 L.

A. 3 PROOF OF THEOREM 2

We give several statements used to prove Theorem 2. Lemma C. Under the same assumptions as in Theorem A, run the stochastic gradient descent with T -iterations with the step size η ≤ 1 2L , then the implicit SGD satisfies the following inequality: E T t=0 ∇f (v t -ηϵ ′ t+1 (v t )) ≤ 1 η O(1) + 1 η (T + 1) 4ησ 2 1 c + 16η 2 σ 2 1 L 3c 1 + 2ησ 2 2 c , E T t=0 ∇F (v t ) -E ∇f (v t -ηϵ ′ t+1 (v t ))|F t ≤ 2σ 2 η 1 2 O(T 1 2 ) + 2σ 1 σ 2 η 3 2 2 3 L(T + 1)(T + 2). Proof of Lemma C. By the simple calculation, we get E T t=0 ∇f (v t -ηϵ ′ t+1 (v t )) = E T t=0 ∇f (v t -ηϵ ′ t+1 (v t )) + ϵ ′ t+1 (v t ) = 1 η ∥E [v 0 -v T +1 ]∥ = 1 η E [∥v 0 -v T +1 ∥] = 1 η E ∥v 0 -v T +1 ∥ 2 ≤ 1 η 2E ∥v 0 -v * ∥ 2 + ∥v T +1 -v * ∥ 2 ≤ 1 η 2 T +1 t=0 E ∥v t -v * ∥ 2 ≤ 1 η O(1) + (T + 1) 4ησ 2 1 c + 16η 2 σ 2 1 L 3c 1 + 2ησ 2 2 c ≤ 1 η O(1) + 1 η (T + 1) 4ησ 2 1 c + 16η 2 σ 2 1 L 3c 1 + 2ησ 2 2 c , where we used Theorem A. Next, we show the second inequality by using Lemma B as follows: E T t=0 ∇F (v t ) -E ∇f (v t -ηϵ ′ t+1 (v t ))|F t ≤ E T t=0 ∇F (v t ) -E ∇f (v t -ηϵ ′ t+1 (v t ))|F t ≤ E T t=0 2ησ 2 E ∥∇f (v t -ηϵ ′ t+1 (v t ))∥ 2 |F t ≤ 2ησ 2 T t=0 E ∥∇f (v t -ηϵ ′ t+1 (v t ))∥ 2 ≤ 2ησ 2 (T + 1) T t=0 E ∥∇f (v t -ηϵ ′ t+1 (v t ))∥ 2 ≤ 2ησ 2 1 η O(T ) + 2 3 ησ 2 1 L(T + 1)(T + 2) ≤ 2σ 2 η 1 2 O(T 1 2 ) + 2σ 1 σ 2 η 3 2 2 3 L(T + 1)(T + 2). Proposition B. Under the same assumptions as in Theorem A, run the stochastic gradient descent with T -iterations with the step size η ≤ 1 2L , then the implicit SGD satisfies the following inequality: 1 T + 1 E T t=0 ∇F (v t ) ≤ O(T -1 2 ) + 4 √ 3 σ 1 σ 2 η 3 2 L 1 2 . Proof of Proposition B. Using Lemma C, we get 1 T + 1 E T t=0 ∇F (v t ) ≤ 1 T + 1 E T t=0 ∇F (v t ) -E ∇f (v t -ηϵ ′ t+1 (v t ))|F t + 1 T + 1 E T t=0 E ∇f (v t -ηϵ ′ t+1 (v t ))|F t ≤ 1 T + 1 E T t=0 ∇F (v t ) -E ∇f (v t -ηϵ ′ t+1 (v t ))|F t + 1 T + 1 E T t=0 ∇f (v t -ηϵ ′ t+1 (v t )) ≤ 2σ 2 η 1 2 O(T -1 2 ) + 2σ 1 σ 2 η 3 2 2 3 L T + 2 T + 1 + 1 η O(T -1 ) + 1 η 1 T + 1 4ησ 2 1 c + 16η 2 σ 2 1 L 3c 1 + 2ησ 2 2 c ≤ O(T -1 2 ) + 4 √ 3 σ 1 σ 2 η 3 2 L 1 2 . We here prove Theorem 2 which is restated below. Theorem B. Under Assumption (A1)-(A5), run the averaged SGD for T -iterations with the step size η ≤ 1 2L , then the average v T satisfies the following inequality: ∥E[v T ] -v * ∥ ≤ O T -1 2 + 4σ 1 σ 2 η 3 2 L 1 2 √ 3µ + 2ησ 2 1 M cµ + 8η 2 σ 2 1 LM 3cµ 1 + 2ησ 2 2 c . Proof. We define R(v) = ∇F (v)-∇ 2 F (v * )(v-v * ). Then, by (A4), we see ∥R(v)∥ ≤ M ∥v-v * ∥ 2 . By taking average of R(v t ) over t ∈ {0, 1, . . . , T } and rearranging terms, we get ∇ 2 F (v * )(v T -v * ) = 1 T + 1 T t=0 ∇F (v t ) - 1 T + 1 T t=0 R(v t ). Therefore, we get µ∥E[v T ] -v * ∥ ≤ ∥∇ 2 F (v * )(E[v T ] -v * )∥ ≤ 1 T + 1 E T t=0 ∇F (v t ) + 1 T + 1 E T t=0 R(v t ) ≤ 1 T + 1 E T t=0 ∇F (v t ) + M T + 1 E T t=0 ∥v t -v * ∥ 2 . The latter and former terms can be bounded by Theorem A and Proposition B. Thus, we finally get We run SGD and averaged SGD on CIFAR10 dataset with the step size strategy 'l' under the same settings as in Section 5. Table 3 lists the results including this case. We observe that the large step size 'l' does not work so well on CIFAR10 dataset compared to other schedules. We hypothesize this is because CIFAR10 is not so difficult dataset and does not require stronger bias induced by a larger step size. µ∥E[v T ] -v * ∥ ≤ O T -1 2 + 4σ 1 σ 2 η 3 2 L 1 2 √ 3 + 2ησ 2 1 M c + 8η 2 σ 2 1 LM 3c 1 + 2ησ 2 2 c .

B ADDITIONAL EXPERIMENTS

We also validate the cosine annealing strategy for the step size, which is frequently used due to its excellent performance. We used the symbols 's', 'm', and 'l' for the cosine annealing depending on the last step sizes which are set to 0, 0.004, and 0.02, respectively. The parameter averaging for averaged SGD is taken over the last quarter of the training. From the table, we observe the usefulness of parameter averaging for cosine annealing schedule as well. Finally, we run SGD, SGD with a large step size, and averaged SGD to train the standard convolutional neural network on Fashion MNIST dataset to confirm how efficiently sharpness and classification accuracy can be optimized by each method. We note the large step size used for SGD is the same as that for averaged SGD. We plot the trace of Hessian ∇ 2 f (w) and test loss functions in Figure 5 . From this figure, we observe that the averaged SGD converges to a flatter region and achieves the highest classification accuracy on the test dataset as expected in our theory. In this section, we present a motivating example that verifies the convergence to a flat minimum and a certain separation between SGD and averaged SGD. We consider a one-dimensional objective function f : R → R defined below: for p, δ > 0, f (w) = 1 2 (w -p) 2 + g δ (w), where g δ : R → R is a scaled mollifier: This bound means averaged SGD will get closer to v * = p as long as SGD approaches a neighborhood of v * According to the above results, both SGD and averaged SGD converge to a flat region when δ is small, and averaged SGD converges even when δ is relatively large. We empirically observed this phenomenon in Figure 7 in which we run SGD and averaged SGD for problems with small δ = 0.1 and relatively large δ = 0.5.

C. 3 ESTIMATION OF CONSTANTS

We verify the estimations of constants in (11). L, σ 2 1 , and σ 2 are already obtained, thus, mu, c, and M remain. Minimum and estimation of µ. We first see that under our problem setting, the local minimum around the origin is eliminated and p is the optimal solution of F , i.e., v * = p. The smoothed function G δ (v) def = E[g δ (v -ηϵ ′ )] and its derivative G ′ δ (v) are calculated as follows: Hence, v * = p is the unique local minimum (i.e., optimal solution) of F and we can conclude µ = 1. G δ (v) = r -r g δ (v -ηt) 1 2r dt, G ′ δ (v) = Estimation of c. From the above argument, we get F ′ (v)(v -p) = (v -p) 2 + G ′ δ (v)(v -p) ≥    (v -p) 2 (v ∈ [-ηr -δ, -ηr + δ]), (v -p) 2 + pC1δ ηr (v -p) ≥ (v -p) 2 + p 2 (v -p) (v ∈ [ηr -δ, ηr + δ]), (v -p) 2 (else). Clearly, p/2 ≤ 2(p -v)/3 for v ≤ ηr + δ ≤ p/4. Thus, F ′ (v)(v -p) ≥ (v -p) 2 /3 on v ∈ [ηr -δ, ηr + δ] and we conclude c = 1/3. Estimation of M . Noting v * = p and F ′′ (p) = 1, we have This concludes M = 8 9p . |F ′ (v) -F ′′ (v * )(v -v * )| = |(v -p) + G ′ δ (v) -(v -p)| = |G ′ δ (v)|.



By the construction, there is a probability space (Ω, F, P ) such that ϵ ′ (v) can be represented as a measurable mapΩ ∋ z → ϵ ′ (v, z). Then, F (v) = E[f (v-ηϵ ′ (v))] = f (v-ηϵ ′ (v, z))dP (z) and ∇F (v) = E[∇(f (v- ηϵ ′ (v)))] = ∇(f (v -ηϵ ′ (v, z)))dP (z), whereas ∇f (v -ηϵ ′ (v)) in Eq. (2) means ∇f (w)| w=v-ηϵ ′ (v) which does not involve the derivative of ϵ ′ (v). Therefore, ∇F (v) ̸ = E[∇f (v -ηϵ ′ (v))] in general.



Figure 1: We run SGD and averaged SGD 500 times with the uniform stochastic gradient noise for two objective functions (top and bottom).Figure (a) depicts the objective function f (green, η = 0) and smoothed objectives F (red and blue, η > 0). Figures (b) and (c) plot convergent points by SGD and averaged SGD with histograms, respectively.

Figure 1: We run SGD and averaged SGD 500 times with the uniform stochastic gradient noise for two objective functions (top and bottom).Figure (a) depicts the objective function f (green, η = 0) and smoothed objectives F (red and blue, η > 0). Figures (b) and (c) plot convergent points by SGD and averaged SGD with histograms, respectively.

Figure 2: The figure plots the original objective (green), smoothed objectives (blue, darker is smoother), and convergent points obtained by the averaged SGD which is run for the asymmetric valley objective 500 times for each step size η ∈ {0, 1, 0.3, 0.5, 0.7, 0.9}.

Figure 3: Test accuracies achieved by SGD and averaged SGD on CIFAR100 dataset with ResNet-50 and WRN-28-10.

Figure 4: Sections of the train (red) and test (blue) loss landscapes across the parameters obtained by averaged SGD (distance=0) and SGD (distance=1) for ResNet-50 with CIFAR100 dataset. SGD is run with a small step size after running averaged SGD with a large step size. The middle figure is the close-up view at the edge. The triangle and circle markers represent convergent parameters by SGD and averaged SGD, respectively. The right figure plots smoothed train loss functions (green, darker is smoother) with Gaussian noises in addition to train and test losses. The blank circles are the minimizers of smoothed objectives.

Figure 5: The figure depicts the curve of the trace of Hessian ∇ 2 f (w) and test loss functions achieved by SGD, SGD with large step size, and averaged SGD. Each algorithm is run to train the standard convolutional neural network on Fashion MNIST dataset.

Figure 6: The left figure plots the mollifier g δ (blue) and smoothed mollifier G δ (orange), and the right figure plots the objective f (blue) and smoothed objective F (orange). The constants δ = 0.1, r = 2.0, and p = 1.0

w) = δg 1 (w/δ) is a scaling of the well-known mollifier of g 1 which is an infinitely differentiable function with a compact support. That is, g δ is a smooth function whose support is[-δ, δ]. Because of the coefficient p of g δ , the function f (w) has a local minimum in [-δ, δ], which can be the global minimum. See Figure6(right).The maximum values of the first and second derivatives of g1 p are bounded. Thus, we define constantsC 1 , C 2 by C 1 = max 1, 1 p max w |g ′ 1 (w)| , C 2 = 1 p max w |g ′′ 1 (w)|.

Figure 7: The figures plot the convergent points of SGD and averaged SGD for problems with δ = 0.1 and δ = 0.5.

By taking into account supp(g δ ) = [-δ, δ], the smoothed objective G δ (v) is constant on {|v| ≤ ηr -δ}∪{|v| ≥ ηr+δ}, and thus,G ′ δ is non-zero only on supp(G ′ δ ) = [-ηr-δ, -ηr+δ]∪[ηr-δ, ηr+δ]. See Figure 6 (left). Since ηr + δ < p/4 < p under (10), v = p is still a local minimum of F . We evaluate the bound on G ′ δ on supp(G ′ δ ) below. for v ∈ [ηr -δ, ηr + δ] the support of g ′ δ (v -ηt) in t ∈ R is [(v -δ)/η, (v + δ)/η], |g ′ δ (v)| = |g ′ 1 (v/δ)| ≤ pC 1 .A bound on [-ηr -δ, -ηr + δ] is also obtained in the same way. Thus, we see   -pC1δ ηr ≤ G ′ δ (v) ≤ 0 (v ∈ [-ηr -δ, -ηr + δ]), 0 ≤ G ′ δ (v) ≤ pC1δ ηr (v ∈ [ηr -δ, ηr + δ]), G ′ δ (v) = 0 (else).If there are additional stationary points of F , they should exist in [ηr-δ, ηr+δ] = supp(G ′ δ )\[-ηrδ, -ηr + δ] because of the sign of G ′ δ and supp(G ′ δ ) ⊂ (-∞, p/4). However, since ηr + δ ≤ p/4 and pC1δ ηr ≤ p/2 under (10), we see max v∈[ηr-δ,ηr+δ]F ′ (v) ≤ (ηr + δ) -p + pC 1

Because of the problem setup, it is enough to verifyM = 8 9p satisfies |G ′ δ (v)| ≤ M |v -p| 2 on v ∈ [ηr -δ, ηr + δ]. Since |G ′ δ (v)| ≤ p/2 and v ≤ p/4 for v in this interval, we have

Comparison of test classification accuracies on CIFAR100 and CIFAR10 datasets. WRN-28-10), and Pyramid Network(Han et al., 2017) with 272 layers and widening factor 200. In all settings, we use the standard data augmentations: horizontal flip, normalization, padding by four pixels, random crop, and cutout(DeVries & Taylor, 2017

Comparison of test classification accuracies on CIFAR10 dataset. All methods adopt the multi-step strategy for the step size schedule.

Comparison of test classification accuracies on CIFAR100 and CIFAR10 datasets. All methods adopt cosine annealing for the step-size schedule.

annex

Lemma A. Under Assumption (A1) and η ≤ 1 2L , the function φ is injective and invertible, and its inverse φ -1 defined on on Imφ is differentiable.Proof. For w, w ′ ∈ R d , we suppose φ(w) = φ(w ′ ). Then, it holds thatwhere we used L-Lipschitz continuity of ∇f due to (A1). Therefore, we see w = w ′ and φ is an injection. Moreover, since J φ (w) = I -η∇ 2 f (w) ⪰ (1 -ηL)I ⪰ 1 2 I. Thus, φ is invertible and φ -1 , which is defined on Imφ, is differentiable because of the injectivity and the inverse map theorem.Using φ, we see ϵ ′ (v) = ϵ(φ -1 (v)) for v ∈ Imφ. Let (Ω, F, P ) be a probability space such that ϵ ′ (v) can be represented as a measurable map z ∈ Ω → ϵ ′ (v, z). Note that we use ϵ ′ (v) and ϵ ′ (v, z) depending on the situation. For a function g : R d → R d , we denote by J g (w) Jacobian of g, i.e., J g (w) = (∂g i (w)/∂w j ) d i,j=1 . Lemma B. Under Assumption (A1) and (A2), we get for anyProof. The first equality of the statement can be confirmed by the direct calculation as follows:Next, we evaluate the last term below. By the chain rule and inverse map theory,Note that from assumption for anySince g ′′ δ (w) = 1 δ g ′′ 1 (w/δ), we see the second derivative of g δ is bounded by C 2 pδ -1 . Hence, Lipschitz smoothness (boundedness of Hessian)Next, we consider the uniform noise on the interval [-r, r] for r > 0, i.e., ϵ ∼ U [-r, r] and suppose ϵ(w, z) = ϵ(z)(= ϵ ′ (v, z)) where Ω ∋ z → ϵ(w, z) is an explicit representation of the random noise. In other words, noise distribution does not change in w. In this case, we see σ 2 1 = E[ϵ 2 ] ≤ r 2 and σ 2 = 0. The smoothed objective F with the noise ϵ ′ and step-size η isWe consider the following problem setup:(9) Note that we can choose arbitrarily small δ > 0 and large r which satisfy the above inequalities.For appropriate smoothing, we choose the step size η so thatA step-size η that satisfies the condition (10) exists and it also satisfies η ≤ 1/L = δ/(δ + C 2 p) required in the theory.

C. 2 CONVERGENCE OF SGD AND AVERAGED SGD

Under the above setup ( 8)-( 10), we can estimate constants appearing in the convergence results of SGD and averaged SGD as follows (for the detail see the next subsection):Moreover, the minimum of the smoothed objective is v * = p, a sharp minimum (∼ 0) can be eliminated by smoothing.Therefore, for SGD we obtain by Theorem 1,We see from this inequality, η * = 2C1δ r is the best choice of the step-size, resulting inwhere we apply Jensen's inequality to derive the bound on L 1 -norm. This result means SGD avoids a sharp minimum (i.e., v ∼ 0 under small δ > 0) and converges to a flat minimum v * = p, and a too large noise will affect the convergence to v * based on our step-size policy.Moreover, for averaged SGD we obtain by Theorem 2,

