EXPECTIGRAD: FAST STOCHASTIC OPTIMIZATION WITH ROBUST CONVERGENCE PROPERTIES

Abstract

Many popular adaptive gradient methods such as Adam and RMSProp rely on an exponential moving average (EMA) to normalize their stepsizes. While the EMA makes these methods highly responsive to new gradient information, recent research has shown that it also causes divergence on at least one convex optimization problem. We propose a novel method called Expectigrad, which adjusts stepsizes according to a per-component unweighted mean of all historical gradients and computes a bias-corrected momentum term jointly between the numerator and denominator. We prove that Expectigrad cannot diverge on every instance of the optimization problem known to cause Adam to diverge. We also establish a regret bound in the general stochastic nonconvex setting that suggests Expectigrad is less susceptible to gradient variance than existing methods are. Testing Expectigrad on several high-dimensional machine learning tasks, we find it often performs favorably to state-of-the-art methods with little hyperparameter tuning.

1. INTRODUCTION

Efficiently training deep neural networks has proven crucial for achieving state-of-the-art results in machine learning (e.g. Krizhevsky et al., 2012; Graves et al., 2013; Karpathy et al., 2014; Mnih et al., 2015; Silver et al., 2016; Vaswani et al., 2017; Radford et al., 2019; Schrittwieser et al., 2019; Vinyals et al., 2019) . At the core of these successes lies the backpropagation algorithm (Rumelhart et al., 1986) , which provides a general procedure for computing the gradient of a loss measure with respect to the parameters of an arbitrary network. Because exact gradient computation over an entire dataset is expensive, training is often conducted using randomly sampled minibatches of data instead.foot_0 Consequently, training can be modeled as a stochastic optimization problem where the loss is minimized in expectation. A natural algorithmic choice for this type of optimization problem is Stochastic Gradient Descent (SGD) (Robbins & Monro, 1951) due to its relatively cheap computational cost and its reliable convergence when the learning rate is appropriately annealed. A major drawback of SGD is that its convergence rate is highly dependent on the condition number of the loss function (Boyd & Vandenberghe, 2004) . Ill-conditioned loss functions are nearly inevitable in deep learning due to the high-dimensional nature of the models; pathological features such as plateaus, sharp nonlinearities, and saddle points become increasingly probable as the number of model parameters grows-all of which can interfere with learning (Pascanu et al., 2013; Dauphin et al., 2014; Goodfellow et al., 2016; Goh, 2017) . Enhancements to SGD such as momentum (Polyak, 1964) and Nesterov momentum (Nesterov, 1983; Sutskever et al., 2013) can help, but they still largely suffer from the same major shortcoming: namely, any particular choice of hyperparameters typically does not generalize well to a variety of different network topologies, and therefore costly hyperparameter searches must be conducted. This problem has motivated significant research into adaptive methods for deep learning, which dynamically modify learning rates on a per-component basis with the goal of accelerating learning without tuning hyperparameters. AdaGrad (Duchi et al., 2011) was an early success in this area that (in its simplest form) divides each step by a running sum of gradient magnitudes, but this can cause its empirical performance to degrade noticeably in the presence of dense gradients. Later methods such as ADADELTA (Zeiler, 2012 ), RMSProp (Tieleman & Hinton, 2012 ), and Adam (Kingma & Ba, 2014) remedied this by instead normalizing stepsizes by an exponential moving average (EMA).



Training on small minibatches can also improve generalization(Wilson & Martinez, 2003).1

