THE LARGE LEARNING RATE PHASE OF DEEP LEARNING

Abstract

The choice of initial learning rate can have a profound effect on the performance of deep networks. We present empirical evidence that networks exhibit sharply distinct behaviors at small and large learning rates. In the small learning rate phase, training can be understood using the existing theory of infinitely wide neural networks. At large learning rates, we find that networks exhibit qualitatively distinct phenomena that cannot be explained by existing theory: The loss grows during the early part of training, and optimization eventually converges to a flatter minimum. Furthermore, we find that the optimal performance is often found in the large learning rate phase. To better understand this behavior we analyze the dynamics of a two-layer linear network and prove that it exhibits these different phases. We find good agreement between our analysis and the training dynamics observed in realistic deep learning settings.

1. INTRODUCTION

Deep learning has shown remarkable success across a variety of tasks. At the same time, our theoretical understanding of deep learning methods remains limited. In particular, the interplay between training dynamics, properties of the learned network, and generalization remains a largely open problem. In tackling this problem, much progress has been made by studying deep neural networks whose hidden layers are wide. In the limit of infinite width, connections between stochastic gradient descent (SGD) dynamics of neural networks, compositional kernels, and linear models have been made. These connections hold when the learning rate is sufficiently small. However, a theory of the dynamics of deep networks that operate outside this regime remains largely open. In this work, we present evidence that SGD dynamics change significantly when the learning rate is above a critical value, η crit , determined by the local curvature of the loss landscape at initialization. These dynamics are stable above the critical learning rate, up to a maximum learning rate η max . Training at these large learning rates results in different signatures than observed for learning rates η < η crit : the loss initially increases and peaks before decreasing again, and the local curvature drops significantly early in training. We typically find that the best performance is obtained when training above the critical learning rate. Empirically, we find these two learning rate regimes are robust, holding across a variety of architectural and data settings. Figure 1 highlights our key findings. We now describe the main contributions of this work.

1.1. TRAINING WITH A LARGE LEARNING RATE LEADS TO A CATAPULT EFFECT

Consider a deep network defined by the network function f (θ, x), where θ are the model parameters and x the input. We define the curvature λ t at training step t to be the max eigenvalue of the Fisher Information Matrix, the early part of training. We notice the following effects, which occur when the learning rate is above the critical value η crit = 2/λ 0 , where λ 0 is the curvature at initialization.foot_0  F t := E x ∇ θ f (θ t , x)∇ θ f (θ t , x) T Amari 1. During the first few steps of training, the loss grows significantly compared to its initial value before it begins decreasing. We call this the catapult effect. 2. Over the same time frame, the curvature decreases until it is below 2/η. We can build intuition for these effects using loss landscape considerations. Consider a linear model where the curvature of the loss landscape is given by λ 0 . Here, curvature means the largest eigenvalue of the linear model kernel. The model can be trained using gradient descent as long as the learning rate η obeys η < 2/λ 0 . When η > 2/λ 0 , the loss diverges and optimization fails. Next, consider a deep network. If we train the model with learning rate η > η crit , we may again expect the loss to grow initially, assuming the curvature is approximately constant in the neighborhood of the initial point in parameter space. This is the effect observed in Figure 2 . However, unlike the linear case, optimization may still succeed if gradient descent is able to navigate to an area of the landscape that has lower curvature λ, such that η < 2/λ. This is indeed what we observe in practice. In Figure 1 we show that optimal performance typically occurs when a network is trained in the large learning rate regime. As discussed further in Section 2, this is true even when the compute budget for smaller learning rates is increased to account for the smaller step size. This is consistent with previous observations in the literature, which showed a correlation between performance and the flatness of the minimum (Keskar et al., 2016) .

1.2. AT LARGE WIDTH, A SHARP DISTINCTION BETWEEN LEARNING RATES REGIMES

The large width limit of deep networks has been shown to lead to simplified training dynamics that are amenable to theoretical study, as in the case of the Neural Tangent Kernel (Jacot et al., 2018) . In this work we show that the distinction between small and large learning rates becomes sharply defined at large width. This can be seen in Figures 2c, 2f , which show the curvature of sufficiently wide networks after the initial part of training, as a function of learning rate. When η < η crit the curvature is approximately independent of the learning rate, while for η > η crit the curvature is lower than 2/η.



The critical learning rate depends on the scale of initialization through λ0.



et al. (2000);Karakida et al. (2018). Equivalently, λ t is the max eigenvalue of the Neural Tangent Kernel Jacot et al.(2018).

Figure2shows the results of training several deep networks with mean squared error (MSE) loss using SGD with a range of learning rates. The loss and curvature are measured at every step during

