LARGE LEARNING RATE MATTERS FOR NON-CONVEX OPTIMIZATION

Abstract

When training neural networks, it has been widely observed that a large step size is essential in stochastic gradient descent (SGD) for obtaining superior models. However, the effect of large step sizes on the success of SGD is not well understood theoretically. Several previous works have attributed this success to the stochastic noise present in SGD. However, we show through a novel set of experiments that the stochastic noise is not sufficient to explain good non-convex training, and that instead the effect of a large learning rate itself is essential for obtaining best performance. We demonstrate the same effects also in the noise-less case, i.e. for full-batch GD. We formally prove that GD with large step size-on certain non-convex function classes-follows a different trajectory than GD with a small step size, which can lead to convergence to a global minimum instead of a local one. Finally, we also demonstrate the difference in trajectories for small and large learning rates for real neural networks, again observing that large learning rates allow escaping from a local minimum, confirming this behavior is indeed relevant in practice.

1. INTRODUCTION

While using variants of gradient descent (GD), namely stochastic gradient descent (SGD), has become standard for optimizing neural networks, the reason behind their success and the effect of various hyperparameters is not yet fully understood. One example is the practical observation that using a large learning rate in the initial phase of training is necessary for obtaining well performing models (Li et al., 2019) . Though this behavior has been widely observed in practice, it is not fully captured by existing theoretical frameworks. Recent investigations of SGD's success (Kleinberg et al., 2018; Pesme et al., 2021) have focused on understanding the implicit bias induced by the stochasticity. Note that the effective variance of the trajectory due to the stochasticity of the gradient is moderated by the learning rate (see Appendix F for more intuition). Therefore, using a larger learning rate amplifies the stochasticity and the implicit bias induced by it which can provide a possible explanation for the need for larger learning rates. We show that this explanation is incomplete by demonstrating cases where using stochasticity with arbitrary magnitude but with a small learning rate, can not guarantee convergence to global minimum whereas using a large learning rate can. Furthermore, we provide a practical method to increase stochasticity without changing the learning rate when training neural networks and observe that increased stochasticity can not replace the effects of large learning rates. Therefore, it is important to study how a larger learning rate affects the trajectory beyond increasing the stochasticity. To that end, in this work we show that randomly initialized gradient descent with a high learning rate provably escapes local minima and converges to the global minimum over of a class of non-convex functions. In contrast, when using a small learning rate, GD over these functions can converge to a local minimum instead. We note that for brevity, we focus our results on the full-batch GD. We further show the positive effect of using a high learning rate to increase the chance of completely avoiding undesirable regions of the landscape such as a local minimum. Note that this behavior does not happen when using the continuous version of GD, i.e. gradient flow which corresponds to using infinitesimal step sizes. The difference remains even after adding the implicit regularization term identified in (Smith et al., 2021) in order to bring trajectories of gradient flow and gradient descent closer. We would like to note that throughout the paper, we sometimes misuse the terms "global" and "local" minimum to refer to desirable and undesirable minima respectively. For example when discussing generalization, a desirable minimum might not have the lowest objective value but enjoy properties such as flatness. Finally, to show the relevance of our theoretical results in practice, we demonstrate similar effects can happen in neural network training by showing evidence of an escape from local minimum when applying GD with a high learning rate on a commonly used neural network architecture. Our observations signify the importance of considering the effects of high learning rates for understanding the success of GD. Overall, our contributions can be summarized as follows: • Demonstrating the exclusive effects of large learning rates even in the stochastic setting both in theory and in practice, showing that they can not be reproduced by increasing stochasticity and establishing the importance of analyzing them. • Capturing the distinct trajectories of large learning rate GD and small learning rate GD in theory on a class of functions, demonstrating the empowering effect of large learning rate to escape from local minima. • Providing experimental evidence showing that gradient descent escapes from local minima in neural network training when using a large learning rate, establishing the relevance of our theoretical results in practice.

2. RELATED WORK

Extensive literature exists on studying the effect of stochastic noise on the convergence of GD. Several works have focused on the smoothing effect of injected noise (Chaudhari et al., 2017; Kleinberg et al., 2018; Orvieto et al., 2022; Wang et al., 2021a) . In (Vardhan & Stich, 2022 ) it has been shown that by perturbing the parameters at every step (called perturbed GD) it is possible to converge to the minimum of a function f while receiving gradients of f + g, assuming certain bounds on g. Other works use different models for the stochastic noise in SGD and use it to obtain convergence bounds or to show SGD prefers certain type (usually flat) of minima (Wu et al., 2018; Xie et al., 2021) . In order to better understand the effect of various hyperparameters on convergence, Jastrzebski et al. ( 2019); Jastrzbski et al. (2018) show the learning rate (and its ratio to batch size) plays an important role in determining the minima found by SGD. In (Pesme et al., 2021) it was shown that SGD has an implicit bias in comparison with gradient flow and its magnitude depends on the learning rate. While this shows one benefit of using large learning rates, in this work, we provide evidence that the effect of learning rate on optimization goes beyond controlling the amount of induced stochastic noise. Prior work also experimentally establish existence of different phases during training of a neural network. Cohen et al. (2021) show that initially Hessian eigenvalues tend to grow until reaching the convergence threshold for the used learning rate, a state they call "Edge of Stability". This growth is also reported in (Lewkowycz et al., 2020) for the maximum eigenvalue of the Neural Tangent Kernel (Jacot et al., 2018) where it has also been observed that this value decreases later in training, leading to convergence. Recent works have also investigated GD's behavior at the edge of stability for some settings (Arora et al., 2022) obtaining insights such as its effect on balancing norms of the layers of a two layer ReLU network (Chen & Bruna, 2022). In our results, GD is above the conventional stability threshold while it is escaping from a local minimum but returns to stability once the escape is finished. In (Elkabetz & Cohen, 2021) it is conjectured that gradient descent and gradient flow have close trajectories for neural networks. However, the aforementioned observations suggest that gradient descent with a large learning rate visits a different set of points in the landscape than GD with a small learning rate. Therefore, this conjecture might not hold for general networks. The difference in trajectory is also supported by the practical observation that a large learning rate leads to a better model (Li et al., 2019) . To bridge this gap and by comparing gradient flow and gradient descent trajectories, Barrett & Dherin (2021) identify an implicit regularization term on gradient norm induced by using discrete steps. Still, this term is not enough to remove a local minimum from the landscape. Other implicit regularization terms specific to various problems have also been proposed in the literature (Ma et al., 2020; Razin & Cohen, 2020; Wang et al., 2021b) . In this paper, we provide experimental evidence

