STOCHASTIC GRADIENT METHODS WITH PRECONDI-TIONED UPDATES

Abstract

This work considers non-convex finite sum minimization. There are a number of algorithms for such problems, but existing methods often work poorly when the problem is badly scaled and/or ill-conditioned, and a primary goal of this work is to introduce methods that alleviate this issue. Thus, here we include a preconditioner that is based upon Hutchinson's approach to approximating the diagonal of the Hessian, and couple it with several gradient based methods to give new 'scaled' algorithms: Scaled SARAH and Scaled L-SVRG. Theoretical complexity guarantees under smoothness assumptions are presented, and we prove linear convergence when both smoothness and the PL-condition is assumed. Because our adaptively scaled methods use approximate partial second order curvature information, they are better able to mitigate the impact of badly scaled problems, and this improved practical performance is demonstrated in the numerical experiments that are also presented in this work.

1. INTRODUCTION

This work considers the following, possibly nonconvex, finite-sum optimization problem: min w∈R d P (w) = 1 n n i=1 f i (w) , where w ∈ R d is the model/weight parameter and the loss functions f i : R d → R ∀i ∈ [n] := {1 . . . n} are smooth and twice differentiable. Throughout this work it is assumed that (1) has an optimal solution, with a corresponding optimal value, denoted by w * , and P * = P (w et al., 2017) , to name just a few. In general, these methods are simple, have low per iteration computational costs, and are often able to find an ε-optimal solution to (1) quickly, when ε > 0 is not too small. However, they often have several hyper-parameters that can be difficult to tune, they can struggle when applied to ill-conditioned problems, and many iterations may be required to find a high accuracy solution. Non-convex instances of the optimization problem (1) (for example, arising from deep neural networks (DNNS)) have been diverting the attention of researchers of late, and new algorithms are being developed to fill this gap (Ghadimi & Lan, 2013; Ghadimi et al., 2016; Lei et al., 2017; Li et al., 2021b) . Of particular relevance to this work is the PAGE algorithm presented in Li et al. (2021a) . The algorithm is conceptually simple, involving only one loop, and a small number of parameters, and can be applied to non-convex problems (1). The main update involves either a minibatch SGD direction, or the previous gradient with a small adjustment (similar to that in SARAH (Nguyen et al., 2017) ). The Loopless SVRG (L-SVRG) method (Hofmann et al., 2015; Qian et al., 2021) , is also of particular interest here. It is a simpler 'loopless' variant of SVRG, which, unlike for PAGE, involves

