APOLLO: AN ADAPTIVE PARAMETER-WISE DIAG-ONAL QUASI-NEWTON METHOD FOR NONCONVEX STOCHASTIC OPTIMIZATION

Abstract

In this paper, we introduce APOLLO, a quasi-Newton method for nonconvex stochastic optimization, which dynamically incorporates the curvature of the loss function by approximating the Hessian via a diagonal matrix. Importantly, the update and storage of the diagonal approximation of Hessian is as efficient as adaptive first-order optimization methods with linear complexity for both time and memory. To handle nonconvexity, we replace the Hessian with its rectified absolute value, which is guaranteed to be positive-definite. Experiments on three tasks of vision and language show that APOLLO achieves significant improvements over other stochastic optimization methods, including SGD and variants of Adam, in terms of both convergence speed and generalization performance. The implementation of the algorithm is available at anonymous link.

1. INTRODUCTION

Nonconvex stochastic optimization is of core practical importance in many fields of machine learning, in particular for training deep neural networks (DNNs). First-order gradient-based optimization algorithms, conceptually attractive due to their linear efficiency on both the time and memory complexity, have led to tremendous progress and impressive successes. A number of advanced first-order algorithms have emerged over the years to pursue fast and stable convergence, among which stochastic gradient descent (SGD) (Robbins & Monro, 1951; LeCun et al., 1998) , equipped with momentum (Rumelhart et al., 1985; Qian, 1999; Bottou & Bousquet, 2008) , has stood out for its simplicity and effectiveness across a wide range of applications (Hinton & Salakhutdinov, 2006; Hinton et al., 2012; Graves, 2013) . However, one disadvantage of SGD is that the gradients in different directions are scaled uniformly, resulting in limited convergence speed and sensitive choice of the learning rate, and thus has spawned a lot of recent interest in accelerating SGD from the algorithmic and practical perspectives. Recently, many adaptive first-order optimization methods have been proposed to achieve rapid training progress with element-wise scaled learning rates, and we can only mention a few here due to space limits. In their pioneering work, Duchi et al. (2011) proposed AdaGrad, which scales the gradient by the square root of the accumulative square gradients from the first iteration. While AdaGrad works well for sparse settings, its performance significantly degrades for dense settings, primarily due to the monotonic increase of the accumulation. Subsequently, several methods have been proposed with the intuition to limit the accumulation to a small window of past iterations, and in particular exponentially reduce the weight of earlier iterations. Notable works incorporating this method are RMSProp (Tieleman & Hinton, 2012 ), AdaDelta (Zeiler, 2012 ), and Adam (Kingma & Ba, 2015) , among which Adam has become the default optimization algorithm across many deep learning applications because of its fast convergence speed and relatively consistent selections of hyper-parameters (Ruder, 2016; Zhang et al., 2020) . However, it has been observed that these adaptive optimization methods may converge to bad/suspicious local optima, resulting in worse generalization ability than their non-adaptive counterparts (Wilson et al., 2017) , or fail to converge due to unstable and extreme learning rates (Luo et al., 2019) . Quasi-Newton methods have been widely used in solving convex optimization problems, due to their efficient computation and fast convergence rate (Broyden, 1967; Dennis & Moré, 1977) . However, the stochastic, high-dimensional and nonconvex nature of many machine learning tasks, such as training deep neural networks, has rendered many classical quasi-Newton methods ineffective and/or inefficient (Keskar & Berahas, 2016; Wang et al., 2017; Yao et al., 2020) . Indeed, in many natural language processing (NLP) and computer vision (CV) tasks (He et al., 2016; Ma & Hovy, 2016; Luo et al., 2019) , SGD (with momentum) is chosen as the optimizer, benefiting from its stable and efficient training and outstanding generalization. In this work, we develop APOLLO, a quasi-Newton method for nonconvex stochastic optimization to simultaneously tackle the aforementioned challenges of stochastic variance, nonconvexity and inefficiency. Algorithmically, APOLLO dynamically incorporates the curvature of the objective function with diagonally approximated Hessian. It only requires first-order gradients and updates the approximation of the Hessian diagonally so that it satisfies a parameter-wise version of the weak secant condition (Wolfe, 1959) . To handle nonconvexity, we replace the Hessian with its rectified absolute value, the computation of which is also efficient under our diagonal approximation, yielding an efficient optimization algorithm with linear complexity for both time and memory ( §3). Experimentally, through three tasks on CV and NLP with popular deep neural networks, including ResNets (He et al., 2016 ), LSTMs (Hochreiter & Schmidhuber, 1997) and Transformers (Vaswani et al., 2017) , we demonstrate that APOLLO significantly outperforms SGD and variants of Adam, in terms of both convergence speed and generalization performance ( §4).

2. BACKGROUNDS

In this section, we set up the notations on nonconvex stochastic optimization, briefly review the (quasi-) Newton methods, and discuss the problems of applying quasi-Newton methods to nonconvex stochastic optimization that we attempt to study in the rest of the paper.

2.1. NONCONVEX STOCHASTIC OPTIMIZATION

In this paper, we consider the following stochastic optimization problem: min θ∈R d f (θ) = E[l(θ; Γ)] where l : R d × R n → R is a continuously differentiable (and possible nonconvex) function, θ ∈ R d denotes the parameter to be optimized, Γ ∈ R n denotes a random variable with distribution function P, and E[•] denotes the expectation w.r.t Γ. Intuitively, Γ incorporates noises in f , leading to a stochastic objective function. A special case of (1) that arises frequently in machine learning is the empirical risk minimization problem: min θ∈R d f (θ) = 1 N N i=1 l i (θ) where l i : R d → R is the loss function corresponds to the i-th data, and N is the number of data samples that is assumed to be extremely large. Objective functions may also have other sources of noise than data subsampling, such as dropout (Srivastava et al., 2014) in deep neural networks. Decoupled Parameters. In this work, we consider a setting of decoupled parameters: θ = {θ (l) , l = 1, . . . , L}. Intuitively, under this setting the parameter θ is decoupled into a sequence of parameters serving different functionalities. For example, in neural network training the parameters of a neural network can be naturally decoupled into the parameters of different layers or modules.

2.2. NEWTON AND QUASI-NEWTON METHODS

Newton's method usually employs the following updates to solve (1): θ t+1 = θ t -H -1 t g t where g t = ∇f (θ t ) is the gradient at θ t and H t = ∇ 2 f (θ t ) is the Hessian matrix. The convergence rate of Newton's method is quadratic under standard assumptions (Nocedal & Wright, 2006) . However, major challenges with this method are i) the expensive computation of the inverse Hessian at every iteration and the corresponding quadratic memory complexity; and ii) the limitation to convex functions (nonconvexity results in negative curvature of H t and misleads the update directions).

