APOLLO: AN ADAPTIVE PARAMETER-WISE DIAG-ONAL QUASI-NEWTON METHOD FOR NONCONVEX STOCHASTIC OPTIMIZATION

Abstract

In this paper, we introduce APOLLO, a quasi-Newton method for nonconvex stochastic optimization, which dynamically incorporates the curvature of the loss function by approximating the Hessian via a diagonal matrix. Importantly, the update and storage of the diagonal approximation of Hessian is as efficient as adaptive first-order optimization methods with linear complexity for both time and memory. To handle nonconvexity, we replace the Hessian with its rectified absolute value, which is guaranteed to be positive-definite. Experiments on three tasks of vision and language show that APOLLO achieves significant improvements over other stochastic optimization methods, including SGD and variants of Adam, in terms of both convergence speed and generalization performance. The implementation of the algorithm is available at anonymous link.

1. INTRODUCTION

Nonconvex stochastic optimization is of core practical importance in many fields of machine learning, in particular for training deep neural networks (DNNs). First-order gradient-based optimization algorithms, conceptually attractive due to their linear efficiency on both the time and memory complexity, have led to tremendous progress and impressive successes. A number of advanced first-order algorithms have emerged over the years to pursue fast and stable convergence, among which stochastic gradient descent (SGD) (Robbins & Monro, 1951; LeCun et al., 1998) , equipped with momentum (Rumelhart et al., 1985; Qian, 1999; Bottou & Bousquet, 2008) , has stood out for its simplicity and effectiveness across a wide range of applications (Hinton & Salakhutdinov, 2006; Hinton et al., 2012; Graves, 2013) . However, one disadvantage of SGD is that the gradients in different directions are scaled uniformly, resulting in limited convergence speed and sensitive choice of the learning rate, and thus has spawned a lot of recent interest in accelerating SGD from the algorithmic and practical perspectives. Recently, many adaptive first-order optimization methods have been proposed to achieve rapid training progress with element-wise scaled learning rates, and we can only mention a few here due to space limits. In their pioneering work, Duchi et al. (2011) proposed AdaGrad, which scales the gradient by the square root of the accumulative square gradients from the first iteration. While AdaGrad works well for sparse settings, its performance significantly degrades for dense settings, primarily due to the monotonic increase of the accumulation. Subsequently, several methods have been proposed with the intuition to limit the accumulation to a small window of past iterations, and in particular exponentially reduce the weight of earlier iterations. Notable works incorporating this method are RMSProp (Tieleman & Hinton, 2012), AdaDelta (Zeiler, 2012), and Adam (Kingma & Ba, 2015) , among which Adam has become the default optimization algorithm across many deep learning applications because of its fast convergence speed and relatively consistent selections of hyper-parameters (Ruder, 2016; Zhang et al., 2020) . However, it has been observed that these adaptive optimization methods may converge to bad/suspicious local optima, resulting in worse generalization ability than their non-adaptive counterparts (Wilson et al., 2017) , or fail to converge due to unstable and extreme learning rates (Luo et al., 2019) . Quasi-Newton methods have been widely used in solving convex optimization problems, due to their efficient computation and fast convergence rate (Broyden, 1967; Dennis & Moré, 1977) . However, the stochastic, high-dimensional and nonconvex nature of many machine learning tasks, such as

