STOCHASTIC GRADIENT METHODS WITH PRECONDI-TIONED UPDATES

Abstract

This work considers non-convex finite sum minimization. There are a number of algorithms for such problems, but existing methods often work poorly when the problem is badly scaled and/or ill-conditioned, and a primary goal of this work is to introduce methods that alleviate this issue. Thus, here we include a preconditioner that is based upon Hutchinson's approach to approximating the diagonal of the Hessian, and couple it with several gradient based methods to give new 'scaled' algorithms: Scaled SARAH and Scaled L-SVRG. Theoretical complexity guarantees under smoothness assumptions are presented, and we prove linear convergence when both smoothness and the PL-condition is assumed. Because our adaptively scaled methods use approximate partial second order curvature information, they are better able to mitigate the impact of badly scaled problems, and this improved practical performance is demonstrated in the numerical experiments that are also presented in this work.

1. INTRODUCTION

This work considers the following, possibly nonconvex, finite-sum optimization problem: min w∈R d P (w) = 1 n n i=1 f i (w) , where w ∈ R d is the model/weight parameter and the loss functions f i : R d → R ∀i ∈ [n] := {1 . . . n} are smooth and twice differentiable. Throughout this work it is assumed that (1) has an optimal solution, with a corresponding optimal value, denoted by w * , and P * = P (w * ), respectively. Problems of the form (1) cover a plethora of applications, including empirical risk minimization, deep learning, and supervised learning tasks such as regularized least squares or logistic regression (Shalev-Shwartz & Ben-David, 2014) . (Nguyen et al., 2017) , to name just a few. In general, these methods are simple, have low per iteration computational costs, and are often able to find an ε-optimal solution to (1) quickly, when ε > 0 is not too small. However, they often have several hyper-parameters that can be difficult to tune, they can struggle when applied to ill-conditioned problems, and many iterations may be required to find a high accuracy solution. Non-convex instances of the optimization problem (1) (for example, arising from deep neural networks (DNNS)) have been diverting the attention of researchers of late, and new algorithms are being developed to fill this gap (Ghadimi & Lan, 2013; Ghadimi et al., 2016; Lei et al., 2017; Li et al., 2021b) . Of particular relevance to this work is the PAGE algorithm presented in Li et al. (2021a) . The algorithm is conceptually simple, involving only one loop, and a small number of parameters, and can be applied to non-convex problems (1). The main update involves either a minibatch SGD direction, or the previous gradient with a small adjustment (similar to that in SARAH (Nguyen et al., 2017) ). The Loopless SVRG (L-SVRG) method (Hofmann et al., 2015; Qian et al., 2021) , is also of particular interest here. It is a simpler 'loopless' variant of SVRG, which, unlike for PAGE, involves an unbiased estimator of the gradient, and it can be applied to non-convex instances of problem (1). For problems that are poorly scaled and/or ill-conditioned, second order methods that incorporate curvature information, such as Newton or quasi-Newton methods (Dennis Jr. & Moré, 1977; Fletcher, 1987; Nocedal & Wright, 2006) , can often outperform first order methods. Unfortunately, they can also be prohibitively expensive, in terms of both computational and storage costs. There are several works that have tried to reduce the potentially high cost of second order methods by using only approximate, or partial curvature information. Some of these stochastic second order, and quasi-Newton (Jahani et al., 2021a; 2020) methods have shown good practical performance for some machine learning problems, although, possibly due to the noise in the Hessian approximation, sometimes they perform similarly to first order variants. An alternative approach to enhancing search directions is to use a preconditioner. In particular, at each iteration the updated diagonal preconditioner is taken to be a convex combination (using a momentum parameter β 2 ) of the (square) of the previous preconditioner and the Hadamard product of the current gradient with itself. So, gradient information from all the previous iterates is included in the preconditioner, but there is a preference for more recent information. Adam (Kingma & Ba, 2015) combines the positive features of Adagrad and RMSProp, but it also uses a first moment estimate of the gradient, providing a kind of additional momentum. Adam preforms well in practice, and is among the most popular algorithms for DNN. Recently, second order preconditioners that use approximate and/or partial curvature information have been developed and studied. The approach in AdaHessian (Yao et al., 2020) was to use a diagonal preconditioner that was motivated by Hutchinson's approximation to the diagonal of the Hessian (Bekas et al., 2007) , but that also stayed close to some of the approximations used in existing methods such as Adam (Kingma & Ba, 2015) and Adagrad (Duchi et al., 2011) . Because of this, the approximation often differed markedly from the true diagonal of the Hessian, and therefore it did not always capture good enough curvature information to be helpful. The work OASIS (Jahani et al., 2021b) , proposed a preconditioner that was closely based upon Hutchinson's approach, and provided a more accurate estimation of the diagonal of the Hessian, and correspondingly led to improved numerical behaviour in practice. The preconditioner presented in Jahani et al. ( 2021b) is adopted here.

1.1. NOTATION AND ASSUMPTIONS

Given a Positive Definite (PD) matrix D ∈ R d×d , the weighted Euclidean norm is defined to be x 2 D = x T Dx, where x ∈ R d . The symbol denotes the Hadamard product, and diag(x) denotes the d × d diagonal matrix whose diagonal entries are the components of the vector x ∈ R d . Recall that problem (1) is assumed to have an optimal (probably not a unique) solution w * , with corresponding optimal value P * = P (w * ). As is standard for stochastic algorithms, the convergence guarantees presented in this work will develop a bound on the number of iterations T , required to push the expected squared norm of the gradient below some error tolerance ε > 0, i.e., to find a ŵT satisfying E[ ∇P ( ŵT ) 2 2 ] ≤ ε 2 . (2) A point ŵT satisfying (2) is referred to as an ε-optimal solution. Importantly, ŵT is some iterate generated in the first T iterations of each algorithm, but it is not necessarily the T th iterate. Throughout this work we assume that each f i : R d → R and P : R d → R are twice differentiable and also L-smooth. This is formalized in the following assumption. Assumption 1.1 (L-smoothness). For all i ∈ [n] f i and P are assumed to be twice differentiable and L-smooth, i.e., ∀i ∈ [n], ∀w, w ∈ dom(f i ) we have ∇f i (w) -∇f i (w ) ≤ L w -w , and ∀w, w ∈ dom(P ) we have ∇P (w) -∇P (w ) ≤ L w -w .



There are several methods for problems of the form (1), which use what we call a 'first order preconditioner' -a preconditioner built using gradient information -including Adagrad(Duchi et al., 2011), RMSProp  (Tieleman et al., 2012), and Adam (Kingma & Ba, 2015). Adagrad(Duchi et al., 2011)  incorporates a diagonal preconditioner that is built using accumulated gradient information from the previous iterates. The preconditioner allows every component of the current gradient to be scaled adaptively, but it has the disadvantage that the elements of the preconditioner tend to grow rapidly as iterations progress, leading to a quickly decaying learning rate. A method that maintains the ability to adaptively scale elements of the gradient, but overcomes the drawback of a rapidly decreasing learning rate, is RMSProp. It does this by including a momentum parameter, β 2 in the update for the diagonal preconditioner.

