ROBUST LEARNING RATE SELECTION FOR STOCHAS-TIC OPTIMIZATION VIA SPLITTING DIAGNOSTIC

Abstract

This paper proposes SplitSGD, a new dynamic learning rate schedule for stochastic optimization. This method decreases the learning rate for better adaptation to the local geometry of the objective function whenever a stationary phase is detected, that is, the iterates are likely to bounce at around a vicinity of a local minimum. The detection is performed by splitting the single thread into two and using the inner product of the gradients from the two threads as a measure of stationarity. Owing to this simple yet provably valid stationarity detection, SplitSGD is easyto-implement and essentially does not incur additional computational cost than standard SGD. Through a series of extensive experiments, we show that this method is appropriate for both convex problems and training (non-convex) neural networks, with performance compared favorably to other stochastic optimization methods. Importantly, this method is observed to be very robust with a set of default parameters for a wide range of problems and, moreover, yields better generalization performance than other adaptive gradient methods such as Adam.

1. INTRODUCTION

Many machine learning problems boil down to finding a minimizer ✓ ⇤ 2 R d of a risk function taking the form F (✓) = E [f (✓, Z)] , where f denotes a loss function, ✓ is the model parameter, and the random data point Z = (X, y) contains a feature vector X and its label y. In the case of a finite population, for example, this problem is reduced to the empirical minimization problem. The touchstone method for minimizing (1) is stochastic gradient descent (SGD). Starting from an initial point ✓ 0 , SGD updates the iterates according to ✓ t+1 = ✓ t ⌘ t • g(✓ t , Z t+1 ) for t 0, where ⌘ t is the learning rate, {Z t } 1 t=1 are i.i.d. copies of Z and g(✓, Z) is the (sub-) gradient of f (✓, Z) with respect to ✓. The noisy gradient g(✓, Z) is an unbiased estimate for the true gradient rF (✓) in the sense that E [g(✓, Z)] = rF (✓) for any ✓. The convergence rate of SGD crucially depends on the learning rate-often recognized as "the single most important hyper-parameter" in training deep neural networks (Bengio, 2012)-and, accordingly, there is a vast literature on how to decrease this fundamental tuning parameter for improved convergence performance. In the pioneering work of Robbins and Monro (1951) , the learning rate ⌘ t is set to O(1/t) for convex objectives. Later, it was recognized that a slowly decreasing learning rate in conjunction with iterate averaging leads to a faster rate of convergence for strongly convex and smooth objectives (Ruppert, 1988; Polyak and Juditsky, 1992) . More recently, extensive effort has been devoted to incorporating preconditioning/Hessians into learning rate selection rules (Duchi et al., 2011; Dauphin et al., 2015; Tan et al., 2016) . Among numerous proposals, a simple yet widely employed approach is to repeatedly halve the learning rate after performing a pre-determined number of iterations (see, for example, Bottou et al., 2018) . In this paper, we introduce a new variant of SGD that we term SplitSGD with a novel learning rate selection rule. At a high level, our new method is motivated by the following fact: an optimal learning rate should be adaptive to the informativeness of the noisy gradient g(✓ t , Z t+1 ). Roughly speaking, the informativeness is higher if the true gradient rF (✓ t ) is relatively large compared with the noise 1

