EXPECTIGRAD: FAST STOCHASTIC OPTIMIZATION WITH ROBUST CONVERGENCE PROPERTIES

Abstract

Many popular adaptive gradient methods such as Adam and RMSProp rely on an exponential moving average (EMA) to normalize their stepsizes. While the EMA makes these methods highly responsive to new gradient information, recent research has shown that it also causes divergence on at least one convex optimization problem. We propose a novel method called Expectigrad, which adjusts stepsizes according to a per-component unweighted mean of all historical gradients and computes a bias-corrected momentum term jointly between the numerator and denominator. We prove that Expectigrad cannot diverge on every instance of the optimization problem known to cause Adam to diverge. We also establish a regret bound in the general stochastic nonconvex setting that suggests Expectigrad is less susceptible to gradient variance than existing methods are. Testing Expectigrad on several high-dimensional machine learning tasks, we find it often performs favorably to state-of-the-art methods with little hyperparameter tuning.

1. INTRODUCTION

Efficiently training deep neural networks has proven crucial for achieving state-of-the-art results in machine learning (e.g. Krizhevsky et al., 2012; Graves et al., 2013; Karpathy et al., 2014; Mnih et al., 2015; Silver et al., 2016; Vaswani et al., 2017; Radford et al., 2019; Schrittwieser et al., 2019; Vinyals et al., 2019) . At the core of these successes lies the backpropagation algorithm (Rumelhart et al., 1986) , which provides a general procedure for computing the gradient of a loss measure with respect to the parameters of an arbitrary network. Because exact gradient computation over an entire dataset is expensive, training is often conducted using randomly sampled minibatches of data instead.foot_0 Consequently, training can be modeled as a stochastic optimization problem where the loss is minimized in expectation. A natural algorithmic choice for this type of optimization problem is Stochastic Gradient Descent (SGD) (Robbins & Monro, 1951) due to its relatively cheap computational cost and its reliable convergence when the learning rate is appropriately annealed. A major drawback of SGD is that its convergence rate is highly dependent on the condition number of the loss function (Boyd & Vandenberghe, 2004) . Ill-conditioned loss functions are nearly inevitable in deep learning due to the high-dimensional nature of the models; pathological features such as plateaus, sharp nonlinearities, and saddle points become increasingly probable as the number of model parameters grows-all of which can interfere with learning (Pascanu et al., 2013; Dauphin et al., 2014; Goodfellow et al., 2016; Goh, 2017) . Enhancements to SGD such as momentum (Polyak, 1964) and Nesterov momentum (Nesterov, 1983; Sutskever et al., 2013) can help, but they still largely suffer from the same major shortcoming: namely, any particular choice of hyperparameters typically does not generalize well to a variety of different network topologies, and therefore costly hyperparameter searches must be conducted. This problem has motivated significant research into adaptive methods for deep learning, which dynamically modify learning rates on a per-component basis with the goal of accelerating learning without tuning hyperparameters. AdaGrad (Duchi et al., 2011) was an early success in this area that (in its simplest form) divides each step by a running sum of gradient magnitudes, but this can cause its empirical performance to degrade noticeably in the presence of dense gradients. Later methods such as ADADELTA (Zeiler, 2012), RMSProp (Tieleman & Hinton, 2012), and Adam (Kingma & Ba, 2014) remedied this by instead normalizing stepsizes by an exponential moving average (EMA). Such methods are able to increase their learning rates after encountering regions of small gradients and have enjoyed widespread adoption due to their consistent empirical performance. Unfortunately, the EMA has recently been shown to cause divergence on a certain convex optimization problem (Reddi et al., 2019) that we refer to as the Reddi Problem. This finding has severe implications because it points to an underlying flaw shared by the most widely used adaptive methods. Recent attempts to resolve this EMA-divergence issue have been unsatisfying. Proposed methods invariably begin with Adam, and then apply a minor adjustment aimed at suppressing divergence. Specifically, they either suddenly or gradually transition from Adam to SGD during training (Keskar & Socher, 2017; Luo et al., 2019) , or clip certain quantities in the Adam update rule to limit the EMA's sensitivity (Zaheer et al., 2018; Reddi et al., 2019) . While these methods technically do prevent divergence on the Reddi Problem, they fail to address the fundamental issue that the EMA can be unreliable, and they do not advance our theoretical understanding of alternatives that are inherently robust to divergence. Furthermore, by intentionally reducing the responsiveness of Adam, these modifications do not always translate into better empirical performance for problems of practical interest-yet they come at the expense of increased complexity. The principal objective of this work is to therefore develop a novel adaptive method that provides stronger convergence guarantees than EMA-based methods while retaining fast learning speed. Towards this, we propose the Expectigrad algorithm, which introduces two major innovations: (1) normalization by the arithmetic mean instead of the EMA, and (2) "outer momentum" in which bias-corrected momentum is applied jointly to the numerator and denominator. Expectigrad provably converges on all instances of the Reddi Problem that causes Adam to diverge, and minimizes the function significantly faster than related methods using the same hyperparameters. We also derive a regret bound for Expectigrad that establishes a convergence rate comparable to the best known rate for Adam. Our bounds also indicate that Expectigrad is less susceptible to noisy gradients. Finally, we test Expectigrad by training various neural network architectures with millions of learnable parameters; we show that it consistently outperforms Adam and is competitive with other state-of-the-art methods.

2. PRELIMINARIES

We begin with some notational remarks. We always typeset vectors x, y in boldface to avoid confusion with scalars x, y. We represent the j-th component of vector x i as x i,j . The Euclidean norm is denoted by x and the inner product is denoted by x, y . All arithmetic operations can be assumed to be element-wise: e.g. x ± y for addition and subtraction, xy for multiplication, x /y for division, x a for exponentiation by a scalar, √ x for square root, and so on. We now consider the optimization setting that is the focus of this paper. Let l : R d × R m → R be a (nonconvex) function that we seek to (locally) minimize in expectation over some distribution P on Ξ ⊂ R m . Precisely, we must locate a point x * ∈ R d with the property ∇f (x * ) = 0, where f (x) := E[l(x, ξ)]. All expectations in our work are implicitly taken over ξ ∼ P(Ξ). Direct computation of ∇f (x) is assumed to be infeasible, but repeated calculation of ∇l(x, ξ) is permitted. We also assume that l has the following properties: Assumption 1. l is bounded below. Assumption 2. l is Lipschitz continuous: |l(x, ξ) -l(y, ξ)| ≤ L x -y , ∀x, y ∈ R d , ∀ξ ∈ Ξ. Assumption 3. l is Lipschitz smooth: ∇l(x, ξ) -∇l(y, ξ) ≤ L x -y , ∀x, y ∈ R d , ∀ξ ∈ Ξ. Assumption 1 guarantees the existence of a well-defined set of minima (Nocedal & Wright, 2006) . While Assumptions 2 and 3 need not share the same Lipschitz constant L > 0 in general, we assume that L is sufficiently large to satisfy both criteria simultaneously. These conditions are often met in practice, making our assumptions amenable to the deep learning setting that is the focus of this paper. 2 In Appendix B, we prove three lemmas from Assumptions 2 and 3 that are necessary for our later theorems. With these results, we are ready to present Expectigrad in the following section.

3. ALGORITHM

Our motivation for Expectigrad comes from the observation that successful adaptive methods are normalized by an EMA of past gradient magnitudes. This normalization process serves two important



Training on small minibatches can also improve generalization(Wilson & Martinez, 2003). We note that the commonly used ReLU activation is unfortunately not Lipschitz smooth, but smooth approximations such as the softplus function(Dugas et al., 2001) can be substituted. This limitation is not specific to our work but affects convergence results for all first-order methods, including SGD.

