ROBUST LEARNING RATE SELECTION FOR STOCHAS-TIC OPTIMIZATION VIA SPLITTING DIAGNOSTIC

Abstract

This paper proposes SplitSGD, a new dynamic learning rate schedule for stochastic optimization. This method decreases the learning rate for better adaptation to the local geometry of the objective function whenever a stationary phase is detected, that is, the iterates are likely to bounce at around a vicinity of a local minimum. The detection is performed by splitting the single thread into two and using the inner product of the gradients from the two threads as a measure of stationarity. Owing to this simple yet provably valid stationarity detection, SplitSGD is easyto-implement and essentially does not incur additional computational cost than standard SGD. Through a series of extensive experiments, we show that this method is appropriate for both convex problems and training (non-convex) neural networks, with performance compared favorably to other stochastic optimization methods. Importantly, this method is observed to be very robust with a set of default parameters for a wide range of problems and, moreover, yields better generalization performance than other adaptive gradient methods such as Adam.

1. INTRODUCTION

Many machine learning problems boil down to finding a minimizer ✓ ⇤ 2 R d of a risk function taking the form F (✓) = E [f (✓, Z)] , where f denotes a loss function, ✓ is the model parameter, and the random data point Z = (X, y) contains a feature vector X and its label y. In the case of a finite population, for example, this problem is reduced to the empirical minimization problem. The touchstone method for minimizing (1) is stochastic gradient descent (SGD). Starting from an initial point ✓ 0 , SGD updates the iterates according to ✓ t+1 = ✓ t ⌘ t • g(✓ t , Z t+1 ) for t 0, where ⌘ t is the learning rate, {Z t } 1 t=1 are i.i.d. copies of Z and g(✓, Z) is the (sub-) gradient of f (✓, Z) with respect to ✓. The noisy gradient g(✓, Z) is an unbiased estimate for the true gradient rF (✓) in the sense that E [g(✓, Z)] = rF (✓) for any ✓. The convergence rate of SGD crucially depends on the learning rate-often recognized as "the single most important hyper-parameter" in training deep neural networks (Bengio, 2012)-and, accordingly, there is a vast literature on how to decrease this fundamental tuning parameter for improved convergence performance. In the pioneering work of Robbins and Monro (1951) , the learning rate ⌘ t is set to O(1/t) for convex objectives. Later, it was recognized that a slowly decreasing learning rate in conjunction with iterate averaging leads to a faster rate of convergence for strongly convex and smooth objectives (Ruppert, 1988; Polyak and Juditsky, 1992) . More recently, extensive effort has been devoted to incorporating preconditioning/Hessians into learning rate selection rules (Duchi et al., 2011; Dauphin et al., 2015; Tan et al., 2016) . Among numerous proposals, a simple yet widely employed approach is to repeatedly halve the learning rate after performing a pre-determined number of iterations (see, for example, Bottou et al., 2018) . In this paper, we introduce a new variant of SGD that we term SplitSGD with a novel learning rate selection rule. At a high level, our new method is motivated by the following fact: an optimal learning rate should be adaptive to the informativeness of the noisy gradient g(✓ t , Z t+1 ). Roughly speaking, the informativeness is higher if the true gradient rF (✓ t ) is relatively large compared with the noise rF (✓ t ) g(✓ t , Z t+1 ) and vice versa. On the one hand, if the learning rate is too small with respect to the informativeness of the noisy gradient, SGD makes rather slow progress. On the other hand, the iterates would bounce around a region of an optimum of the objective if the learning rate is too large with respect to the informativeness. The latter case corresponds to a stationary phase in stochastic optimization (Murata, 1998; Chee and Toulis, 2018) , which necessitates the reduction of the learning rate for better convergence. Specifically, let ⇡ ⌘ be the stationary distribution for ✓ when the learning rate is constant and set to ⌘. From (2) one has that E ✓⇠⇡⌘ [g(✓, Z)] = 0, and consequently that E[hg(✓ (1) , Z (1) ), g(✓ (2) , Z (2) )i] = 0 for ✓ (1) , ✓ (2) i.i.d. ⇠ ⇡ ⌘ , Z (1) , Z (2) i.i.d. ⇠ Z (3) Figure 1 : Normalized dot product of averaged noisy gradients over 100 iterations. Stationarity depends on the learning rate: ⌘ = 1 corresponds to stationarity (purple), while ⌘ = 0.1 corresponds to non stationarity (orange). Details in Section 2. SplitSGD differs from other stochastic optimization procedures in its robust stationarity phase detection, which we refer to as the Splitting Diagnostic. In short, this diagnostic runs two SGD threads initialized at the same iterate using independent data points (refers to Z t+1 in (2)), and then performs hypothesis testing to determine whether the learning rate leads to a stationary phase or not. The effectiveness of the Splitting Diagnostic is illustrated in Figure 1 , which reveals different patterns of dependence between the two SGD threads with difference learning rates. Loosely speaking, in the stationary phase (in purple), the two SGD threads behave as if they are independent due to a large learning rate, and SplitSGD subsequently decreases the learning rate by some factor. In contrast, strong positive dependence is exhibited in the non stationary phase (in orange) and, thus, the learning rate remains the same after the diagnostic. In essence, the robustness of the Splitting Diagnostic is attributed to its adaptivity to the local geometry of the objective, thereby making SplitSGD a tuning-insensitive method for stochastic optimization. Its strength is confirmed by our experimental results in both convex and non-convex settings. In the latter, SplitSGD showed robustness with respect to the choice of the initial learning rate, and remarkable success in improving the test accuracy and avoiding overfitting compared to classic optimization procedures.

1.1. RELATED WORK

There is a long history of detecting stationarity or non-stationarity in stochastic optimization to improve convergence rates (Yin, 1989; Pflug, 1990; Delyon and Juditsky, 1993; Murata, 1998; Pesme et al., 2020) . Perhaps the most relevant work in this vein to the present paper is Chee and Toulis (2018), which builds on top of Pflug (1990) for general convex functions. Specifically, this work uses the running sum of the inner products of successive stochastic gradients for stationarity detection. However, this approach does not take into account the strong correlation between consecutive gradients and, moreover, is not sensitive to the local curvature of the current iterates due to unwanted influence from prior gradients. In contrast, the splitting strategy, which is akin to HiGrad (Su and Zhu, 2018), allows our SplitSGD to concentrate on the current gradients and leverage the regained independence of gradients to test stationarity. Lately, Yaida (2019) and Lang et al. ( 2019) derive a stationarity detection rule that is based on gradients of a mini-batch to tune the learning rate in SGD with momentum. From a different angle, another related line of work is concerned with the relationship between the informativeness of gradients and the mini-batch size (Keskar et al., 2016; Yin et al., 2017; Li et al., 2017; Smith et al., 2017) . Among others, it has been recognized that the optimal mini-batch size should be adaptive to the local geometry of the objective function and the noise level of the gradients, delivering a growing line of work that leverage the mini-batch gradient variance for learning rate selection (Byrd et al., 2012; Balles et al., 2016; Balles and Hennig, 2017; De et al., 2017; Zhang and Mitliagkas, 2017; McCandlish et al., 2018) .

