SALR: SHARPNESS-AWARE LEARNING RATES FOR IMPROVED GENERALIZATION

Abstract

In an effort to improve generalization in deep learning, we propose SALR: a sharpness-aware learning rate update technique designed to recover flat minimizers. Our method dynamically updates the learning rate of gradient-based optimizers based on the local sharpness of the loss function. This allows optimizers to automatically increase learning rates at sharp valleys to increase the chance of escaping them. We demonstrate the effectiveness of SALR when adopted by various algorithms over a broad range of networks. Our experiments indicate that SALR improves generalization, converges faster, and drives solutions to significantly flatter regions.

1. INTRODUCTION

Figure 1 : A conceptual sketch of flat and sharp minima (Keskar et al., 2017) . Generalization in deep learning has recently been an active area of research. The efforts to improve generalization over the past two decades have brought upon many cornerstone advances and techniques; be it dropout (Gal & Ghahramani, 2016) , batch-normalization (Ioffe & Szegedy, 2015) , data-augmentation (Shorten & Khoshgoftaar, 2019) , weight decay (Loshchilov & Hutter, 2019) , adaptive gradient-based optimization (Kingma & Ba, 2015) , architecture design and search (Radosavovic et al., 2020) , ensembles and their Bayesian counterparts (Garipov et al., 2018; Izmailov et al., 2018) , amongst many others. Yet, recently, researchers have discovered that the concept of sharpness/flatness plays a fundamental role in generalization. Though sharpness was first discussed in the context of neural networks in the early work of Hochreiter & Schmidhuber (1997) , it was only brought to the forefront of deep learning research after the seminal paper by Keskar et al. (2017) . While trying to investigate decreased generalization performance when large batch sizes are used (LeCun et al., 2012) in stochastic gradient descent (SGD), Keskar et al. (2017) notice that this phenomena can be justified by the ability of smaller batches to reach flat minimizers. Such flat minimizers in turn, generalize well as they are robust to low precision arithmetic or noise in the parameter space (Dinh et al., 2017; Kleinberg et al., 2018) , as shown in Figure 1 . Since then, the generalization ability of flat minimizers has been repeatedly observed in many recent works (Neyshabur et al., 2017a; Goyal et al., 2017; Li et al., 2018; Izmailov et al., 2018) . Indeed, flat minimizers can potentially tie together many of the aforementioned approaches aimed at generalization. For instance, (1) higher gradient variance, when batches are small, increases the probability to avoid sharp regions (same can be said for SGD compared to GD) (Kleinberg et al., 2018) (2) averaging over multiple hypotheses leads to wider optima in ensembles and Bayesian deep learning (Izmailov et al., 2018) (3) regularization techniques such as dropout or over-parameterization can adjust the loss landscape into one that allows first order methods to favor wide valleys (Chaudhari et al., 2019; Allen-Zhu et al., 2019) . In this paper we study the direct problem of developing an algorithm that can converge to flat minimizers.Specifically, we introduce SALR: a sharpness aware learning rate designed to explore the loss-surface of an objective function and avoid undesired sharp local minima. SALR dynamically updates the learning rate based on the sharpness of the neighborhood of the current solution. The idea is simple: automatically increase the learning rates at relatively sharp valleys in an effort to escape them. One of the key features of SALR is that it can be fitted into any gradient based method such as Adagrad (Duchi et al., 2011 ), ADAM (Kingma & Ba, 2015) and also into recent approaches towards escaping sharp valleys such as Entropy-SGD (Chaudhari et al., 2019) .

1.1. RELATED WORK

From a theoretical perspective, generalization of deep learning solutions has been explained through multiple lenses. One of which is uniform stability (Bottou & Le Cun, 2005; Bottou & Bousquet, 2008; Hardt et al., 2016; Gonen & Shalev-Shwartz, 2017; Bottou et al., 2018 ). An algorithm is uniformly stable if for all data sets differing in only one element, nearly the same outputs will be produced (Bousquet & Elisseeff, 2002) . Hardt et al. (2016) show that SGD satisfies this property and derive a generalization bound for models learned with SGD. From a different viewpoint, Choromanska et al. (2015) ; Kawaguchi (2016) ; Poggio et al. (2017) ; Mohri et al. (2018) attribute generalization to the complexity of the hypothesis-space. Using measures like Rademacher complexity (Mohri & Rostamizadeh, 2009) and the Vapnik-Chervonenkis (VC) dimension (Sontag, 1998) , the former works show that deep hypothesis spaces are typically more advantageous in representing complex functions. Besides that, the importance of flatness on generalization has been theoretically highlighted through PAC-Bayes bounds (Dziugaite & Roy, 2017; Neyshabur et al., 2017b; Wang et al., 2018) . These papers highlight the ability to derive non-vacuous generalization bounds based on the sharpness of a model class while arguing that relatively flat solutions yield tight bounds. From an algorithmic perspective, approaches to recover flat minima are still limited. Most notably, Chaudhari et al. (2019) developed the Entropy-SGD algorithm. Entropy-SGD defines a localentropy-based objective which smoothens the energy landscape based on its local geometry. This in turn allows SGD to attain flatter solutions. Indeed, this approach was motivated by earlier work in statistical physics (Baldassi et al., 2015; 2016) which proves the existence of non-isolated solutions that generalize well in networks with discrete weights. Such non-isolated solutions correspond to flat minima in continuous settings. The authors then propose a set of approaches based on ensembles and replicas of the loss to favor wide solutions. Not too far, recent methods in Bayesian deep learning (BDL) have also shown potential to recover flat minima. BDL basically averages over multiple hypotheses weighted by their posterior probabilities (ensembles being a special case of BDL (Izmailov et al., 2018) ). One example, is the stochastic weighted averaging (SWA) algorithm proposed by Izmailov et al. (2018) . SWA simply averages over multiple points along the trajectory of SGD to potentially find flatter solutions compared to SGD. Another example is the SWA-Gaussian (SWAG). SWAG defines a Gaussian posterior approximation over neural network weights. Afterwards, samples are taken from the approximated distribution to perform Bayesian model averaging (Maddox et al., 2019) . Here we also note the recent work by Patel (2017) which partially motivates our method. Upon the aformentioned observations in Keskar et al. (2017) , Patel (2017) shows that the learning rate lowerbound threshold for the divergence of batch SGD, run on quadratic optimization problems, increases for larger batch-sizes. In general non-convex settings, given a problem with N local minimizers, one can compute N lower bound thresholds for local divergence of batch SGD. The number of minimizers for which batch SGD can converge is non-decreasing in the batch size. This is used to explain the tendency of low-batch SGD to converge to flatter minimizers compared to large-batch SGD. The former result links the choice of batch size and its effect on generalization to the choice of the learning rate. With the latter being a tunable parameter, to our knowledge, developing a dynamic choice of the learning rate that targets convergence to flat minimizers has not been studied before.

2. GENERAL FRAMEWORK

In this paper, we propose a framework that dynamically chooses a Sharpness-Aware Learning Rate to promote convergence to flat minimizers. More specifically, our proposed method locally approximates sharpness at the current iterate and dynamically adjusts the learning rate accordingly. In sharp regions, relatively large learning rates are attained to increase the chance of escaping that region. In contrast, when the current iterate belongs to a flat region, our method returns a relatively small learning rate to guarantee convergence. Our framework can be adopted by any local search descent method and is detailed in Algorithm 1. Algorithm 1: Sharpness-Aware Learning Rate (SALR) Framework Input: Starting point θ 0 , initial learning rate η 0 , Number of iterations K. for k = 0, 1, . . . , K do Estimate S k , the local sharpness around the current iterate θ k ; Set η k = η 0 S k Median S i k i=1 ; Compute θ k+1 using some local search descent method (Gradient Descent, Stochastic Gradient Descent, ADAM, . . .); end Return θ K . As detailed in Algorithm 1, at every iterate k, we compute the learning rate as a function of the local sharpness parameters S k k i=1 . The main intuition is to have the current learning rate to be an increasing function of the current estimated sharpness. Since the scale of the sharpness at different points can vary when using different networks or datasets (Dinh et al., 2017) , we normalize our estimated sharpness by dividing by the median of the sharpness of previous iterates. For instance in Figure 2 , despite having a similar sharpness measure, we consider the minimizer around θ = 1 to be sharp relative to the blue plot and flat relative to red plot. Normalization resolves this issue by helping our sharpness measure attain scale invariant properties. One can think of the median of previous sharpness values as a global sharpness parameter the algorithm is trying to learn. When k is sufficiently large, the variation in the global sharpness parameter among different iterates will be minimal. From an algorithmic perspective, SALR exploits a neighborhood around the current iterate to dynamically compute a desired learning rate while simultaneously exploring the sharpness of the landscape to refine this global sharpness parameter. 

3. SHARPNESS MEASURE

Several sharpness/flatness measures have been defined in recent literature (Rangamani et al., 2019; Keskar et al., 2017; Hochreiter & Schmidhuber, 1997) . For instance, Hochreiter & Schmidhuber (1997) computes flatness by measuring the size of the connected region in the parameter space where the objective remains approximately constant. In a more recent paper, Rangamani et al. (2019) proposed a scale invariant flatness measure based on the quotient manifold. Computing such notions for complex non-convex landscapes can be intractable in practice. In addition to the cited results, Keskar et al. (2017) quantifies flatness by finding the difference between the maximum value of the loss function within a small neighborhood around a given point and the current value. More specifically, they define sharpness as follows: φ(ε, θ) S(ε, θ) 1 + f (θ) and S(ε, θ) = max θ ∈IBε(θ) f (θ ) -f (θ), where IB ε (θ) is a euclidean ball with radius ε centered at θ and 1+f (θ) is a normalizing coefficient . One drawback of equation 1 is that the sharpness value around a maximizer is nearly zero. To resolve this issue, one can simply modify the sharpness measure in equation 1 as follows: S(ε, θ) max θ ∈IBε(θ) f (θ ) -min θ ∈IBε(θ) f (θ ). It can be easily shown that if θ is a local minimizer, equation 2 is equivalent to equation 1. Both measures defined in equation 1 and equation 2 require solving a possibly non-convex function which is in general NP-Hard. For computational feasibility, we provide a sharpness approximation by running n 1 gradient ascent and n 2 gradient descent steps. The resulting solutions are used to approximate the maximization and minimization optimization problems. Here we note that our definition for sharpness does not include a normalizing coefficient, as median S i k i=1 in Algorithm 1 plays this role. The details of the approximation are shown in the Definition 1. Definition 1. Given θ ∈ R n , iteration numbers n 1 and n 2 , and step-size γ, we define the sharpness measure S(θ) f (θ (n2) k,+ ) -f (θ k ) + f (θ k ) -f (θ (n1) k,-)+ = f (θ (n2) k,+ ) -f (θ (n1) k,-, where θ (0) k,+ = θ (0) k,-= θ k , θ (n1) k,-= θ k - n1-1 i=0 γ ∇f (θ (i) k,-) ∇f (θ (i) k,-) , and θ (n2) k,+ = θ k + n2-1 i=0 γ ∇f (θ (i) k,+ ) ∇f (θ (i) k,+ ) . Remark 2. In contrast to the measures defined in equation 2 and equation 1, Definition 1 does not require a ball radius ε. However, our definition requires specifying the step-size γ and the number of ascent and descent iterations. Remark 3. Running gradient descent/ascent with fixed step-size near a minimizer can return a very small sharpness value even if the minimizer is sharp. This is due to the small gradient norm around a minimizer. To resolve this issue, we normalize the gradient at every descent/ascent step. Moreover, normalizing by the norm of the gradient helps in understanding the radius of the ball containing the iterates {θ (j) k,-} n1 j=1 and {θ (j) k,+ } n2 j=1 . Figure 3 shows the plots of the three different sharpness measures defined in this section when computed for a function f (θ) = 0.5θ sin(3θ) + 1. Notice that the blue plot corresponding to the sharpness measure φ(•) attains a zero value at local maximizers compared to a positive value for the other two sharpness plots. Moreover, notice that the sharpness value in these three plots attains a small value near the local minimizer. This can be explained by our choice of radius ε = 0.1 which limits the neighborhood being exploited. Increasing the radius for φ(ε, •) and S(ε, •) (increasing n 1 and n 2 for S) will provide higher values around the minimizer. We next show that using our sharpness measure in Definition 1, gradient descent with SALR framework in Algorithm 1, denoted as GD-SALR, escapes sharp local minima. 

4. THEORETICAL RESULTS

In this section, we focus on analyzing the convergence of vanilla GD when adopting our sharpnessaware learning rate framework in Algorithm 1. We show that GD-SALR escapes any given neighborhood of a sharp local minimum by choosing a sufficiently large step-size. Throughout this section, we make the following assumptions that are standard for the convergence theory of gradient descent methods. Assumption 1. The objective function f is twice continuously differentiable and L 0 -Lipschitz continuous. The gradient function ∇f (•) is L-Lipschitz continuous with Lipschitz constant L. Furthermore, the gradient norm is bounded, i.e. there exists a scalar constant g max > 0 such that ∇f (θ) ≤ g max for all θ. The next theorem shows that GD with a sufficiently large step-size escapes a given strongly convex neighborhood around a local minimum θ * . Theorem 4. Suppose that f is µ-strongly convex function in a neighborhood IB δ (θ * ) around a local minimum θ * , i.e. λ min ∇ 2 f (θ) ≥ µ for all θ ∈ IB δ (θ * ) {θ | θ -θ * 2 ≤ δ}. Running vanilla GD with θ 0 ∈ IB δ (θ * ) and learning rate η k ≥ 2 + ε µ for some fixed ε > 0, there exists k with θ k / ∈ IB δ (θ * ). Proof. The proof of lemma is relegated to the Appendix B. Our next result shows that GD-SALR escapes sharp local minima by dynamically choosing a sufficiently large step-size in a local strongly convex region. Theorem 5. Suppose that f is a µ-strongly convex function in a neighborhood IB δ (θ * ) around a local minimum θ * , i.e. λ min ∇ 2 f (θ) ≥ µ for all θ ∈ IB δ (θ * ) {θ | θ -θ * 2 ≤ δ}. Under Assumption 1, run GD-SALR (Gradient descent with step size choice according to Algorithm 1) and Definition 1) with n 1 ≥ a 1 log 1 + µ g min L g max -µ g min 2 - 1 a 1 , n 2 a 2 log 1 + µ g min L g max 2 - 1 a 2 , and γ = g min L , where a 1 = (2 + )L 0 η 0 (g max L -g min µ) + g min µ 2(g max L -g min µ) , and a 2 = (2 + )L 0 η 0 g max L + µ 2 g min 2L 2 g max , , η 0 > 0, and g min > 0 is a lower bound that satisfies max ∇f θ (n1-1) k,- , ∇f (θ k ) ≥ g min . If δ > max{n 1 , n 2 }γ, then there exists k with θ k / ∈ IB δ (θ * ). Proof. The proof of the theorem is relegated to the Appendix C. Remark 6. In the context of machine learning, our theorem shows that our algorithm can potentially escape sharp regions even when all the data are used (full-batch). In their work, Patel (2017) show that when using large batch sizes we require a higher learning rate to escape sharp minima. This provides an insight on the favorable empirical results presented in Section 6 when running SGD-SALR. Moreover, our dynamic choice of high learning rate in sharp regions can potentially allow running SGD with larger batch sizes while still escaping sharp minimizers. This in turn provides an avenue for improved parallelism (Dean et al., 2012; Das et al., 2016) . In the next section, we generalize our proposed framework to the stochastic setting.

5. STOCHASTIC APPROXIMATION OF SHARPNESS

The concept of generalization is more relevant when solving problems arising in machine learning settings. Under the empirical risk minimization framework, the problem of training machine learning models can be mathematically formulated as the following optimization problem min θ∈R n f (θ) 1 m m i=1 f i (θ), where f i is a loss function parameterized with parameter θ corresponding to data point i ∈ {1, 2, . . . , m}. The most popular algorithm used to solve such optimization problems is the stochastic gradient descent which iteratively updates the parameters using the following update rule: θ k+1 = θ k -η k 1 |B k | i∈B k ∇f i (θ k ) , where B k is the batch sampled at iteration k and η k is the learning rate. To apply our framework in stochastic settings, we provide a stochastic procedure for computing the sharpness measure at a given iterate. Details are provided in Algorithm 2. By adopting Algorithm 2, our framework can be applied to numerous popular algorithms like SGD, ADAM and Entropy-SGD. The detailed implementation of SGD-SALR and ADAM-SALR can be found in Algorithms 3 and 4 in Appendix A. Algorithm 2: Calculating stochastic sharpness at iteration k Data: batch size B k , base learning rate γ, current iterate θ k , iteration number n 1 , n 2 Set θ (0) k,+ = θ (0) k,-= θ k ; for i = 0 : n 1 -1 do θ (i+1) k,-= θ (i) k,--γ   1 |B k | j∈B k ∇f j θ (i) k,-) ∇f j θ (i) k,-   ; end for i = 0 : n 2 -1 do θ (i+1) k,+ = θ (i) k,+ + γ   1 |B k | j∈B k ∇f j θ (i) k,+ ∇f j θ (i) k,+   ; end Return Ŝk = 1 |B k | j∈B k f j θ (n2) k,+ - 1 |B k | j∈B k f j θ (n1) k,-;

6. EMPIRICAL RESULTS

In this section, we present experimental results on image classification and text prediction datasets. We show that our framework SALR can be adopted by many optimization methods and achieve notable improvements over a broad range of networks. We compare SALR with Entropy-SGD (Chaudhari et al., 2019) and SWA (Izmailov et al., 2018) . Besides those benchmarks, we also use the conventional SGD and ADAM (Kingma & Ba, 2015) as baseline references. All aforementioned methods are trained with batch normalization (Ioffe & Szegedy, 2015) and dropout of probability 0.5 after each layer (Gal & Ghahramani, 2016) . We replicate each experiment 30 times to obtain the mean and standard deviation of testing errors. In our experiments, we do not tune any hyperparameters. We consider some typical networks such as mnistfc (Ioffe & Szegedy, 2015) , ResNet (He et al., 2016) , DenseNet (Iandola et al., 2014) , MobileNetV2 (Sandler et al., 2018) and RegNetX (Radosavovic et al., 2020) .

6.1. MNIST/CIFAR-10

We run Algorithm 3 SGD-SALR for 20 epochs. We collect the sharpness measure every c = 2 iterations and set n 1 = n 2 = 5 as detailed in Algorithm 2. The experimental settings for other benchmark models are as follows: (1) SGD: we run SGD for 100 epochs using decay learning rates. (2) SWA: the setting is the same as SGD. In the SWA stage, we switch to a cyclic learning rate schedule as suggested in Izmailov et al. (2018) . ( 3) Entropy-SGD: following the setting in Chaudhari et al. (2019) , we train Entropy-SGD for 20 epochs and set Langevin iterations L a = 5. (4) Entropy-SGD-SALR: the setting is same as Entropy-SGD, however, we update the learning rate of Entropy-SGD using Algorithm 2. Further details on each benchmark setting can be found in Appendix D. The results are reported in Table 1 . Table 3 : Classification accuracy on CIFAR10 using ADAM

6.2. TEXT PREDICTION

We train an LSTM network on the Penn Tree Bank (PTB) dataset for word-level text prediction. This dataset contains about one million words. Following the guideline in [1] and [2], we train PTB-LSTM with 66 million weights. SGD and SWA are trained with 55 epochs. Entropy-SGD (L = 5) and SALR (c = 2) are trained with 11 epochs. Overall, all methods have the same number of gradient calls (i.e., wall-clock times). We then train an LSTM to perform characterlevel text-prediction using War and Peace (WP). We follow the procedures in [2] and [3]. We train Adam/SWA and Entropy-SGD/SALR with 50 and 10 epochs, respectively. We report the perplexity on the test set in Table 5 .

6.3. ANALYSIS

Based on Tables 1 2 3 4 5 , we can obtain some important insights. First, methods adopting SALR show superior performance over their benchmarks. This increase in classification accuracy (or decrease in perplexity) is consistent across both datasets and a range of network structures. We also observed that SGD-SALR tends to outperform other SALR based methods in most settings while achieving comparable results in others. Second, and more interestingly, this superior performance is achieved with 5 times less epochs compared to SGD, ADAM and SWA. The caveat however is that SALR and Entropy respectively require (n 1 + n 2 )/c = 5 and L a = 5 more gradient calls at each iteration, hence making the total computational needs the same as ADAM, SGD and SWA. Third and as shown in Table 4 , it is clear that SALR drives solutions to significantly flatter regions. This highlights the effectiveness of dynamically adjusting learning rates based on the relative sharpness of the current 4 (Left). Interestingly, the overall trend when adding SALR to SGD is drastically changed with the trend smoothly and more consistently increasing its performance. This can be potentially explained through the capability of SALR to quickly escape sharp regions relative to SGD and hence attain larger and more consistent rates of improvement across Epochs. We also plot the change of sharpness/learning rates over some SALR iterations in the last Epoch in Figure 4 (Right). This figure highlights the dynamics of both learning rates and sharpness, highlighting that the median sharpness tends to stabilize and hence leading to a proportional relationship between learning rates and sharpness. 

7. DISCUSSION & OPEN PROBLEMS

In this paper we introduce SALR: an optimization tool that aims to recover flat minima through dynamically updating the learning rate based on the current solution's relative sharpness. SALR can be readily plugged into any gradient based method. Experiments show that SALR can deliver promising improvements over a range of optimization methods and network structures. In light of this work, we hope researchers further investigate landscape dependant learning rates as they can offer a potential alternative/unifying framework for many aforementioned attempts to achieve improved generalization. Also, quasi-Newton approximations of S k instead of first order methods and generalization bounds for SALR remain open problems worth investigating.

B PROOF OF THEOREM 4

Proof. According to the update rule of vanilla gradient descent, it follows that by mean-value theorem there exists θ ∈ [θ k , θ * ] such that θ k+1 -θ * = θ k -θ * -η k ∇f (θ k ) = θ k -θ * -η k ∇f (θ * ) + ∇ 2 f ( θ)(θ k -θ * ) = I -η k ∇ 2 f ( θ) (θ k -θ * ), where the last equality holds since θ * is a local minimum. By taking the norm, we get θ k+1 -θ * = I -η k ∇ 2 f ( θ) (θ k -θ * ) ≥ |1 -η k µ| θ k -θ * , where the last inequality holds by our local strong convexity assumption, the fact that θ ∈ IB δ {θ * } and our choice of η k . The former choice also imply that θ k+1 -θ * ≥ (η k µ -1) θ k -θ * ≥ (1 + ) θ k -θ * , which yields θ k -θ * ≥ (1 + ε) k θ 0 -θ * . Let θ 0 -θ * = D and k = 1 log(1 + ε) log δ D , then θ k -θ * ≥ δ, which completes our proof.

C PROOF OF THEOREM 5

Proof. According to Lemma 4, running vanilla gradient descent with η k ≥ 2 + ε µ for some fixed ε > 0 escapes the neighborhood IB δ (θ * ). Hence, to complete our proof, it suffices to show that GD-SALR will dynamically choose a sufficiently large step size. We first start by computing a lower bound for our local sharpness approximation in local strongly convex regions. By definition, S k = f θ (n2) k,+ -f θ (n1) k,- = f θ (n2) k,+ -f (θ k ) + f (θ k ) -f θ (n1) k,- We start by computing a lower bound for f (θ k ) -f θ k,-. By descent lemma (Bertsekas, 1997) , f θ (i+1) k,- ≤ f θ (i) k,-+ ∇f θ (i) k,-, θ (i+1) k,--θ (i) k,-+ L 2 θ (i+1) k,--θ (i) k,- 2 = f θ (i) k,--γ ∇f θ (i) k,- + Lγ 2 2 . By summing over the n 1 iterations, we get f (θ k ) -f θ (n1) k,- ≥ γ n1-1 i=0 ∇f θ (i) k,- - n 1 Lγ 2 2 . ( ) By mean value theorem, there exists z (i) k,-∈ θ (i) k,-, θ (i+1) k,- with ∇f θ (i+1) k,- = ∇f θ (i) k,-+ ∇ 2 f z (i) k,- θ (i+1) k,--θ (i) k,- = ∇f θ (i) k,--γ∇ 2 f z (i) k,- ∇f θ (i) k,- ∇f (θ (i) k,- , which yields ∇f θ (i+1) k,- ≤ (1 -γµ/g max ) ∇f θ (i) k,- = L g max -µ g min L g max ∇f θ (i) k,- Here the inequality holds by local strong convexity and the fact that z i k,-∈ IB δ (θ * ) due to our choice of δ, the upper bound on the norm of the gradient, and our choice of γ. Substituting back into equation 4, we get f (θ k ) -f θ (n1) k,- ≥ γ n1-1 i=0 L g max -µ g min L g max -i ∇f θ (n1-1) k,- - n 1 Lγ 2 2 = g min L L g max µg min -1 L g max L g max -µ g min n1 -1 ∇f θ (n1-1) k,- - n 1 Lγ 2 2 = g max µ - g min L L g max L g max -µ g min n1 -1 ∇f θ (n1-1) k,- - n 1 Lγ 2 2 (5) By our choice of n 1 , we have 1 + µg min L g max -µ g min n1 -1 ≥ n 1 a 1 . The inequality holds since log 1 + µ g min L g max -µ g min n1 = n 1 log 1 + µ g min L g max -µ g min ≥ n 1 a 1 √ n 1 a 1 + 1 ≥ log(1+n 1 a 1 ), where the first inequality holds by our choice of n 1 and the second inequality is an upper bound of log(1 + x). By substituting equation 6 in equation 5 and using our assumption that min ∇f θ (n1-1) k,- , ∇f θ 0 k,+ ≥ g min , we get f (θ k ) -f θ (n1) k,- ≥ g max µ - g min L n 1 a 1 g min - n 1 g 2 min 2L ≥ n 1 2 + η 0 µ L 0 g min L . (7) We now compute a lower bound for f θ (n2) k,+ -f (θ k ). By local strong convexity of f , we have f θ (i+1) k,+ ≥ f θ (i) k,+ + ∇f θ (i) k,+ , θ k,+ -θ (i) k,+ + µ 2 θ (i+1) k,+ -θ (i) k,+ 2 = f θ (i) k,+ + γ ∇f θ (i) k,+ + µγ 2 2 . By summing over the n 2 iterations, we get f θ (n2) k,+ -f (θ k ) ≥ γ n2-1 i=0 ∇f θ i k,+ + n 2 µγ 2 2 . ( ) By mean value theorem, there exists z i k,+ ∈ θ i k,+ , θ . Here the inequality holds by local strong convexity and the fact that z i k,+ ∈ IB δ (θ * ), the upper bound on the norm of the gradient, and our choice of γ. Substituting back into equation 8, we get f θ (n2) k,+ -f (θ k ) ≥ γ n2-1 i=0 (1 + γµ/g max ) i ∇f θ 0 k,+ + n 2 µγ 2 2 = g max µ 1 + µg min L g max n2 -1 ∇f θ 0 k,+ + n 2 µγ 2 2 (9) By our choice of n 2 , we have 1 + µg min L g max n2 -1 ≥ n 2 a 2 . ( ) The inequality holds since log 1 + µg min L g max n2 = n 2 log 1 + µg min L g max ≥ n 2 a 2 √ n 2 a 2 + 1 ≥ log(1 + n 2 a 2 ), where the first inequality holds by our choice of n 2 and the second inequality is an upper bound of log(1 + x). By substituting equation 10 in equation 9 and using our assumption that min ∇f θ (11) By adding equation 7 and equation 11, we obtain S k ≥ (n 1 + n 2 ) 2 + µ η 0 L 0 g min L . We now provide an upper bound for Median S k . Using the Lipschitz property of function, we have f (θ k ) -f θ (n1) k,- = n1-1 i=0 f θ (i) k,--f θ (i+1) k,- ≤ L 0 n1-1 i=0 θ (i) k,--θ (i+1) k,- = n 1 L 0 γ. (13) f θ (n2) k,+ -f (θ k ) = n2-1 i=0 f θ (i+1) k,+ -f θ (i) k,+ ≤ L 0 n2-1 i=0 γ θ (i+1) k,+ -θ (i) k,+ = n 2 L 0 γ. ( ) Combining equation 13 and equation 14, we get S = Median S k ≤ (n 1 + n 2 )L 0 γ = (n 1 + n 2 ) g min L 0 L . ( ) According to the definition of our learning rate, combining equation 12 and equation 15 results in the following inequality η k = η 0 S k S ≥ 2 + µ . ( ) The proof is concluded using Theorem 4.



Figure 2: Sharp/Flat minimizers relative to the landscape.

Figure 3: Sharpness measure plots for φ, S, and S on function f .

Figure 4: (Left) All-CNN-BN: Change of Testing Errors over Epochs. (Right) Change of sharpness and learning rate over iterations.

Classification accuracy on MNISTWe increase the number of training epochs for Entropy-SGD/Entropy-SGD-SALR/SGD-SALR to 40 and the training epochs of SGD/SWA to 200. Experimental results are reported in Table2.

Classification accuracy on CIFAR10To illustrate the flexibility of our framework, we change the base optimizer SGD to ADAM and re-run all the experiments under a similar setting as that of Table2. Results are reported in Table3. Finally, in Table4, we report the sharpness measure of the final solution obtained by each optimization approach. We also conduct some experiments on CIFAR-100. Results are deferred to Appendix.

Sharpness of final solutions (CIFAR-10, SGD)

Perplexity on PTB/WP iterate. To further demonstrate the advantage of SALR framework, we plot the testing error curves for SGD, Entropy-SGD, Entropy-SALR and SALR in Figure

A EXAMPLES OF SALR-BASED ALGORITHMS

The SALR framework can be fitted into many optimization algorithms. In this section we provide some examples. In Algorithm 3 and 4, we present the SGD-SALR algorithm and the ADAM-SALR algorithm, respectively.Algorithm 3: The SGD-SALR Algorithm Data: base learning rate η 0 , number of iterations K, frequency c, initial weight θ 0 Result: weight vector θ.Algorithm 4: The ADAM-SALR Algorithm Data: base learning rate η 0 , exponential decay rates for the moment estimates β 1 , β 2 ∈ [0, 1), number of iterations K, frequency c, initial weight θ 0 , perturbation Result: weight vector θ.

D EXPERIMENTAL SETTING

In this section, we provide the detailed experiment settings. All methods are trained with batch normalization and dropout with probability 0.5 after each layer. The batch size is 128. The base learning rate is set to 0.01. Both batch size and learning rate can be adjusted but their ratio should remain constant as suggested in Smith & Le (2017); Smith et al. (2018) .

D.1 MNIST

1. SGD: we train SGD for 100 epochs with a learning rate 0.01 that drops by a factor of 10 after every 30 epochs. 2. SWA: in the first 75 epochs, we run the regular SGD. We then switch to a cyclic learning rate schedule with α 1 = 5 × 10 -3 and α 2 = 1 × 10 -4 , where α 1 is the initial learning rate within a cycle and α 2 is the ending learning rate within a cycle. 3. Entropy-SGD: we train Entropy-SGD for 20 epochs with L = 5. The learning rate for the stochastic gradient Langevin dynamics (SGLD) is set to 0.1 with thermal noise 10 -4 . The initial value of the scope is set to 0.03 which increases by a factor of 1.001 after each parameter update. 4. Entropy-SGD-SALR: the learning rate for Entropy-SGD is updated based on Algorithm 2 (c = 2, γ = 0.002). 5. SGD-SALR: we use base learning rate 0.01 and set c = 2, γ = 0.002.

D.2 CIFAR-10

1. SGD: we train SGD for 200 epochs with a learning rate 0.01 that drops by a factor of 10 after every 30 epochs. 2. SWA: in the first 150 epochs, we run the regular SGD. We then switch to a cyclic learning rate schedule with α 1 = 5 × 10 -3 and α 2 = 1 × 10 -4 . 3. Entropy-SGD: we train Entropy-SGD for 40 epochs with L = 5. The learning rate for the SGLD is set to 0.1 with thermal noise 10 -4 . The initial value of the scope is set to 0.03 which increases by a factor of 1.001 after each parameter update. 4. Entropy-SGD-SALR: the learning rate for Entropy-SGD is updated based on Algorithm 2 (c = 2, γ = 0.002). 5. SGD-SALR: we use base learning rate 0.01 and set c = 2, γ = 0.002.

E MORE EXPERIMENTS

In this section, we have added more experimental results to illustrate the performance of SALR.1. We train an LSTM network on the Penn Tree Bank (PTB) dataset for word-level text prediction. Following the guideline in Zaremba et al. (2014) 

