STOCHASTIC GRADIENT METHODS WITH PRECONDI-TIONED UPDATES

Abstract

This work considers non-convex finite sum minimization. There are a number of algorithms for such problems, but existing methods often work poorly when the problem is badly scaled and/or ill-conditioned, and a primary goal of this work is to introduce methods that alleviate this issue. Thus, here we include a preconditioner that is based upon Hutchinson's approach to approximating the diagonal of the Hessian, and couple it with several gradient based methods to give new 'scaled' algorithms: Scaled SARAH and Scaled L-SVRG. Theoretical complexity guarantees under smoothness assumptions are presented, and we prove linear convergence when both smoothness and the PL-condition is assumed. Because our adaptively scaled methods use approximate partial second order curvature information, they are better able to mitigate the impact of badly scaled problems, and this improved practical performance is demonstrated in the numerical experiments that are also presented in this work.

1. INTRODUCTION

This work considers the following, possibly nonconvex, finite-sum optimization problem: min w∈R d P (w) = 1 n n i=1 f i (w) , where w ∈ R d is the model/weight parameter and the loss functions f i : R d → R ∀i ∈ [n] := {1 . . . n} are smooth and twice differentiable. Throughout this work it is assumed that (1) has an optimal solution, with a corresponding optimal value, denoted by w * , and P * = P (w * ), respectively. Problems of the form (1) cover a plethora of applications, including empirical risk minimization, deep learning, and supervised learning tasks such as regularized least squares or logistic regression (Shalev-Shwartz & Ben-David, 2014) . This minimization problem can be difficult to solve, particularly when the number of training samples n, or problem dimension d, is large, or if the problem is nonconvex. Stochastic Gradient Descent (SGD) is one of the most widely known methods for problem (1), and its origins date back to the 1950s with the work of Robbins & Monro (1951) . The explosion of interest in machine learning has led to an immediate need for reliable and efficient algorithms for solving (1). Motivated by, and aiming to improve upon, vanilla SGD, many novel methods have already been developed for convex and/or strongly convex instances of (1), including SAG/SAGA (Le Roux et al., 2012; Defazio et al., 2014) , SDCA (Shalev-Shwartz & Zhang, 2013) , SVRG (Johnson & Zhang, 2013; Xiao & Zhang, 2014) , S2GD (Konečný & Richtárik, 2017) and SARAH (Nguyen et al., 2017) , to name just a few. In general, these methods are simple, have low per iteration computational costs, and are often able to find an ε-optimal solution to (1) quickly, when ε > 0 is not too small. However, they often have several hyper-parameters that can be difficult to tune, they can struggle when applied to ill-conditioned problems, and many iterations may be required to find a high accuracy solution. Non-convex instances of the optimization problem (1) (for example, arising from deep neural networks (DNNS)) have been diverting the attention of researchers of late, and new algorithms are being developed to fill this gap (Ghadimi & Lan, 2013; Ghadimi et al., 2016; Lei et al., 2017; Li et al., 2021b) . Of particular relevance to this work is the PAGE algorithm presented in Li et al. (2021a) . The algorithm is conceptually simple, involving only one loop, and a small number of parameters, and can be applied to non-convex problems (1). The main update involves either a minibatch SGD direction, or the previous gradient with a small adjustment (similar to that in SARAH (Nguyen et al., 2017) ). The Loopless SVRG (L-SVRG) method (Hofmann et al., 2015; Qian et al., 2021) , is also of particular interest here. It is a simpler 'loopless' variant of SVRG, which, unlike for PAGE, involves an unbiased estimator of the gradient, and it can be applied to non-convex instances of problem (1). For problems that are poorly scaled and/or ill-conditioned, second order methods that incorporate curvature information, such as Newton or quasi-Newton methods (Dennis Jr. & Moré, 1977; Fletcher, 1987; Nocedal & Wright, 2006) , can often outperform first order methods. Unfortunately, they can also be prohibitively expensive, in terms of both computational and storage costs. There are several works that have tried to reduce the potentially high cost of second order methods by using only approximate, or partial curvature information. Some of these stochastic second order, and quasi-Newton (Jahani et al., 2021a; 2020) methods have shown good practical performance for some machine learning problems, although, possibly due to the noise in the Hessian approximation, sometimes they perform similarly to first order variants. An alternative approach to enhancing search directions is to use a preconditioner. There are several methods for problems of the form (1), which use what we call a 'first order preconditioner' -a preconditioner built using gradient information -including Adagrad (Duchi et al., 2011) , RMSProp (Tieleman et al., 2012) , and Adam (Kingma & Ba, 2015) . Adagrad (Duchi et al., 2011) incorporates a diagonal preconditioner that is built using accumulated gradient information from the previous iterates. The preconditioner allows every component of the current gradient to be scaled adaptively, but it has the disadvantage that the elements of the preconditioner tend to grow rapidly as iterations progress, leading to a quickly decaying learning rate. A method that maintains the ability to adaptively scale elements of the gradient, but overcomes the drawback of a rapidly decreasing learning rate, is RMSProp. It does this by including a momentum parameter, β 2 in the update for the diagonal preconditioner. In particular, at each iteration the updated diagonal preconditioner is taken to be a convex combination (using a momentum parameter β 2 ) of the (square) of the previous preconditioner and the Hadamard product of the current gradient with itself. So, gradient information from all the previous iterates is included in the preconditioner, but there is a preference for more recent information. Adam (Kingma & Ba, 2015) combines the positive features of Adagrad and RMSProp, but it also uses a first moment estimate of the gradient, providing a kind of additional momentum. Adam preforms well in practice, and is among the most popular algorithms for DNN. Recently, second order preconditioners that use approximate and/or partial curvature information have been developed and studied. The approach in AdaHessian (Yao et al., 2020) was to use a diagonal preconditioner that was motivated by Hutchinson's approximation to the diagonal of the Hessian (Bekas et al., 2007) , but that also stayed close to some of the approximations used in existing methods such as Adam (Kingma & Ba, 2015) and Adagrad (Duchi et al., 2011) . Because of this, the approximation often differed markedly from the true diagonal of the Hessian, and therefore it did not always capture good enough curvature information to be helpful. The work OASIS (Jahani et al., 2021b) , proposed a preconditioner that was closely based upon Hutchinson's approach, and provided a more accurate estimation of the diagonal of the Hessian, and correspondingly led to improved numerical behaviour in practice. The preconditioner presented in Jahani et al. (2021b) is adopted here.

1.1. NOTATION AND ASSUMPTIONS

Given a Positive Definite (PD) matrix D ∈ R d×d , the weighted Euclidean norm is defined to be x 2 D = x T Dx, where x ∈ R d . The symbol denotes the Hadamard product, and diag(x) denotes the d × d diagonal matrix whose diagonal entries are the components of the vector x ∈ R d . Recall that problem (1) is assumed to have an optimal (probably not a unique) solution w * , with corresponding optimal value P * = P (w * ). As is standard for stochastic algorithms, the convergence guarantees presented in this work will develop a bound on the number of iterations T , required to push the expected squared norm of the gradient below some error tolerance ε > 0, i.e., to find a ŵT satisfying E[ ∇P ( ŵT ) 2 2 ] ≤ ε 2 . (2) A point ŵT satisfying (2) is referred to as an ε-optimal solution. Importantly, ŵT is some iterate generated in the first T iterations of each algorithm, but it is not necessarily the T th iterate. Throughout this work we assume that each f i : R d → R and P : R d → R are twice differentiable and also L-smooth. This is formalized in the following assumption. Assumption 1.1 (L-smoothness). For all i ∈ [n] f i and P are assumed to be twice differentiable and L-smooth, i.e., ∀i ∈ [n], ∀w, w ∈ dom(f i ) we have ∇f i (w) -∇f i (w ) ≤ L w -w , and ∀w, w ∈ dom(P ) we have ∇P (w) -∇P (w ) ≤ L w -w . For some of the results in this work, it will also be assumed that the function P satisfies the PLcondition. Note that the PL-condition does not imply convexity (see Footnote 1 in Li et al. (2021a) ). Assumption 1.2 (Polyak-Łojasiewicz-condition). A function P : R d → R satisfies the PL-condition if there exists µ > 0, such that ∇P (w) 2 ≥ 2µ(P (w) -P * ), ∀w ∈ R d .

1.2. CONTRIBUTIONS

Table 1 : Comparison of scaled methods for non-convex problems. Notation: ε denotes solution accuracy (2). The 'Tuning of β 2 ' column shows whether it is easy ('+'), or difficult ('-') to tune β 2 . Our preconditioner (5) works with any β 2 . Adam only supports certain large β ≈ 1 (Défossez et al., 2020; Reddi et al., 2019) .

Method Reference Convergence Tuning of β2

Adagrad (Duchi et al., 2011 ) (Zou et al., 2018 ) (Défossez et al., 2020) ε -4 + RMSProp (Tieleman et al., 2012) no theory Adam (Kingma & Ba, 2015 ) (Défossez et al., 2020) ε -4 -AdaHessian (Yao et al., 2020) no theory OASIS (Jahani et al., 2021b ) ε -4 + Scaled SARAH This work ε -2 + Scaled L-SVRG This work ε -2 + Table 2 : A summary of the main results of this work. Complexities for Scaled SARAH and Scaled L-SVRG are given for nonconvex problems (first row), and under the PL assumption (second row). Notation: L = smoothness constant, µ = PL constant, ε = solution accuracy (2), ∆ 0 = P (w 0 ) -P * , n = data size, and Γ, α are upper and lower bounds of the Hessian approximation. Scaled SARAH Scaled L-SVRG NC O n + Γ α √ nL∆ 0 ε 2 O n + Γ α n 2/3 L∆ 0 ε 2 PL O max n, Γ α √ n L µ log ∆ 0 ε O max n, Γ α n 2/3 L µ log ∆ 0 ε The main contributions of this work are stated below, and are summarized in Tables 1 and 2 . Jahani et al. (2021b) . The inclusion of the preconditioner results in adaptive scaling of every element of the search direction (negative gradient), which leads to improved practical performance, particularly on ill-conditioned and poorly scaled problems. The algorithm is simple (a single loop) and is easy to tune (Section 3). • Scaled L-SVRG. The Scaled L-SVRG algorithm is also presented, which is similar to L-SVRG, but with the addition of the diagonal preconditioner (Jahani et al., 2021b) . Again, the preconditioner allows all elements of the gradient to be scaled adaptively, the algorithm uses a single loop structure, and for this algorithm an unbiased estimate of the gradient is used. The inclusion of adaptive local curvature information via the preconditioner leads to improvments in practical performance (Appendix A). • Convergence guarantees. Theoretical guarantees show that both Scaled SARAH and Scaled L-SVRG converge and we present an explicit bound for the number of iterations required by each algorithm to obtain an iterate that is -optimal. Convergence is guaranteed for both Scaled SARAH and Scaled L-SVRG under a smoothness assumption on the functions f i . If both smoothness and the PL-condition hold, then improved iteration complexity results for Scaled SARAH and Scaled L-SVRG are obtained, which show that expected function value gap converges to zero at a linear rate (see Theorems 3.2 and A.2.) Our scaled methods achieve the best known rates of all methods with preconditioning for non-convex deterministic and stochastic problems, and Scaled SARAH and Scaled L-SVRG are the first preconditioned methods that achieve a linear rate of convergence under the PL assumption. See a detailed comparison in Section 4. • Numerical experiments. Extensive numerical experiments were performed (Sections 5 and B) under various parameter settings to investigate the practical behaviour of our new scaled algorithms. The inclusion of preconditioning in Scaled SARAH and Scaled L-SVRG led to improvements in performance compared with no preconditioning in several of the experiments, and Scaled SARAH and Scaled L-SVRG were competitive with, and often outperformed, Adam. Paper outline. This paper is organised as follows. In Section 2 we describe the diagonal preconditioner that will be used in this work. In Section 3, we describe a new Scaled SARAH algorithm and present theoretical convergence guarantees. In Section 4, we give discussions of our results for the Scaled SARAH method, and also compare it with other methods. In Appendix A, we introduce the Scaled L-SVRG algorithm, which adapts the L-SVRG algorithm to include a preconditioner. We present numerical experiments demonstrating the practical performance of our proposed methods in Section 5. All proofs (Sections E and F), additional numerical experiments (Section B), and further details and discussion can be found in the appendix.

2. DIAGONAL PRECONDITIONER

In this section we describe the diagonal preconditioner that is used in this work. The paper of Bekas et al. (2007) described Hutchinson's approximation to the diagonal of the Hessian, and this provided motivation for the diagonal preconditioner proposed in Jahani et al. (2021b) , which is adopted here. In particular, given an initial approximation D 0 , (to be described soon), and Hessian approximation momentum parameter β ∈ (0, 1) (equivalent to the second moment hyperparameter, β 2 in Adam (Kingma & Ba, 2015) ), for all t ≥ 1, D t = βD t-1 + (1 -β)diag z t ∇ 2 P Jt (w t )z t , where z t is a random vector with Rademacher distributionfoot_0 , J t is an index set randomly sampled from [n], and ∇ 2 P Jt (w t ) = 1 |J t | j∈Jt ∇ 2 f j (w t ). Finally, for α > 0 (where the parameter α > 0 is equivalent to the parameter in Adam (Kingma & Ba, 2015) and AdaHessian (Yao et al., 2020) ), the diagonal preconditioner is: Dt i,i = max{α, |D t | i,i }. The expression (5) ensures that the preconditioner Dt is always PD, so it is well-defined and results in a descent direction. The absolute values are necessary because the objective function is potentially nonconvex, so the batch Hessian approximation could be indefinite. In fact, even if the Hessian is PD, D t in (3) may still contain negative elements due to the sampling strategy used. The preconditioner (5) is a good estimate of the diagonal of the (batch) Hessian because Hutchinson's updating formula (3) is used (see also Figures 12, 13 ). Hence, it captures accurate curvature information, which is helpful for poorly scaled and ill-conditioned problems. Because the preconditioner is diagonal it is easy and inexpensive to apply it's inverse, and the associated storage costs are low. The preconditioner (3)+(5) depends on the parameter β: if β = 1 then the preconditioner is fixed for all iterations, whereas if β = 0 then the preconditioner is simply a kind of sketched batch Hessian. Taking 0 < β < 1 gives a convex combination of the previous approximation and the current approximation, thereby ensuring that the entire history is included in the preconditioner, but is damped by β, and the most recent information is also present. See Section G for a detailed discussion of the choice of β. The main computational cost of the approximation in (3) is the (batch) Hessian-vector product ∇ 2 P Jt (w t )z t . Fortunately, this can be efficiently calculated using two rounds of back propagation. Moreover, the preconditioner is matrix-free, simply needing an oracle to return the Hessian vector product, but it does not need explicit access to the batch Hessian itself; see Appendix B in Jahani et al. (2021b) . Therefore, the costs (both computational and storage) for this preconditioner are not burdensome. As previously mentioned, the approximation (3) requires an initial estimate D 0 of the diagonal of the Hessian, and this is critical to the success of the preconditioner. In particular, one must take D 0 = 1 m m j=1 diag z j ∇ 2 P Jj (w 0 )z j , where J j denotes sampled batches and the vectors z j are generated from a Radermacher distribution. This ensures that Dt does indeed approximate the diagonal of the Hessian; see Section 3.3 in Jahani et al. (2021b) . The following remark confirms that the diagonal preconditioner is both PD and bounded. Lemma 2.1 (See Remark 4.10 in Jahani et al. (2021b) ). For any t ≥ 1, we have αI Dt ΓI, where 0 < α ≤ Γ = √ dL. Note that Remark 4.10 is proved incorrectly in Jahani et al. (2021b) . 3 Scaled SARAH Here we propose a new algorithm, Scaled SARAH, for finite sum optimization (1). Our algorithm is similar to the SARAH algorithm (Nguyen et al., 2017) and the PAGE algorithm (Li et al., 2021a) , but a key difference is that Scaled SARAH includes the option of a preconditioner, Dt for all t ≥ 0, with a preconditioned approximate gradient step. Scaled SARAH is presented now as Algorithm 1. Algorithm 1 Scaled SARAH 1: Input: initial point w 0 , learning rate η, preconditioner D0 , probability p 2: v 0 = ∇P (w 0 ) 3: for t = 0, 1, 2, . . . do 4: w t+1 = w t -η D-1 t v t 5: Generate independently batches i t+1 for v t+1 and J t for Dt+1 6: v t+1 = ∇P (w t+1 ), with probability p v t + ∇f it+1 (w t+1 ) -∇f it+1 (w t ), with probability 1 -p 7: Update the preconditioner Dt+1 8: end for 9: Output: ŵT chosen uniformly from {w t } T t=0 In each iteration of Algorithm 1 an update is computed in Step 4. The point w t is adjusted by taking a step in the direction D-1 t v t , of fixed step size η. The vector v t approximates the gradient, and the preconditioner scales that direction. A key difference between Scaled SARAH and PAGE/SARAH is the inclusion of the preconditioner D-1 t in this step. Step 6 defines the next gradient estimator v t+1 , for which there are two options. With probability p the full gradient is used. Alternately, with probability 1 -p, the new gradient estimate is the previous gradient approximation v t , with an adjustment term that involves the difference between the gradient of f i evaluated at w t+1 and at w t . The search direction computed in Scaled SARAH contains gradient information, while the preconditioner described in Section 2 contains approximate second order information. When this preconditioner is applied to the gradient estimate, each dimension is scaled adaptively depending on the corresponding curvature. Intuitively, this amplifies dimensions with low curvature (shallow loss surfaces), while damping directions with high curvature (sharper loss surfaces). The aim is for D-1 t v t to point in a better, adjusted direction, compared with v t . Scaled SARAH is a single loop algorithm so it is conceptually simple. If p = 1 then the algorithm always picks the first option in Step 6, so that Scaled SARAH reduces to a preconditioned GD method. On the other hand, if p = 0, then only the second option in Step 6 is used. Notice that Scaled SARAH is a combination of both the PAGE and SARAH algorithms, coupled with a preconditioner. SARAH (Nguyen et al., 2017) is a double loop algorithm, where the inner loop is defined in the same way as update in Step 6. PAGE (Li et al., 2021a ) is based upon SARAH, but PAGE uses a single loop structure, and allows for minibatches to be used in the gradient approximation v t+1 (rather than the single component as in Step 6). 2 Scaled SARAH shares the same single loop structure as PAGE, but also shares the same single component update for the gradient estimator as SARAH (no minibatches). However, different from both PAGE and SARAH, Scaled SARAH uses a preconditioner in Step 4. In the remainder of this work we focus on a particular instance of Scaled SARAH, which uses a fixed probability p t = p, and uses the diagonal preconditioner presented in Section 2. These choices have been made because a central goal of this work is to understand the impact that a well chosen preconditioner has on poorly scaled problems. Convergence guarantees and the results of numerical experiments, will be presented using this set up. Theoretical results for Scaled SARAH are presented now. In particular, we present complexity bounds on the number of iterations required by Scaled SARAH to obtain an ε-optimal solution for the non-convex problem (1) (recall Section 1.1 and (2)). The first result holds under Assumption 1.1, while the second theorem holds under both smoothness and PL assumptions. First, we define the following step-size bound: η = α L 1 + 1-p p . Theorem 3.1. Suppose that Assumption 1.1 holds, let ε > 0, let p denote the probability, and let the step-size satisfy η ≤ η (7). Then, the number of iterations performed by Scaled SARAH, starting from an initial point w 0 ∈ R d with ∆ 0 = P (w 0 ) -P * , required to obtain an ε-approximate solution of the non-convex finite-sum problem (1) can be bounded by T = O Γ α ∆ 0 L ε 2 1 + 1 -p p . Theorem 3.2. Suppose that Assumptions 1.1 and 1.2 hold, let ε > 0, and let the step-size satisfy η ≤ η (7). Then the number of iterations performed by Scaled SARAH sufficient for finding an ε-approximate solution of non-convex finite-sum problem (1) can be bounded by T = O max 1 p , L µ Γ α 1 + 1 -p p log ∆ 0 ε . Note that this last theorem shows that Scaled SARAH exhibits a linear rate of convergence under both the smoothness assumption and the PL-condition. We know that Algorithm 1 calls the full gradient at the beginning (Step 2) and then uses pn + (1 -p) stochastic gradients for each iteration on expectation (Step 6). Thus, the number of stochastic gradient computations (i.e., gradient complexity) is n + T [pn + (1 -p)] and the following corollaries of Theorems 3.1 and 3.2 are valid. Corollary 3.3. Suppose that Assumption 1.1 holds, let ε > 0, let p = 1 n+1 , and let the step-size satisfy η ≤ η (7). Then, the stochastic gradient complexity performed by Scaled SARAH, starting from an initial point w 0 ∈ R d with ∆ 0 = P (w 0 ) -P * , required to obtain an ε-approximate solution of the non-convex finite-sum problem (1) can be bounded by O n + Γ α ∆0L ε 2 √ n . Corollary 3.4. Suppose that Assumptions 1.1 and 1.2 hold, let ε > 0, and let the step-size satisfy η ≤ η (7). Then the stochastic gradient complexity performed by Scaled SARAH sufficient for finding an ε-approximate solution of non-convex finite-sum problem (1) can be bounded by O n + L µ Γ α √ n log ∆0 ε . In Appendix A we give our second method Scaled L-SVRG, it has worse convergence results than Scaled SARAH, but is be investigated in experiments, but is not be mentioned in the next section with discussions and comparison with competitors results.

4. DISCUSSION

In this section, we discuss the obtained result for Scaled SARAH, including how it relates to the same results for other scaled methods as well as for methods without preconditioning. For convenience, we give Table 3 summarising all the results. • Despite in Section 2 we consider scaling based on the Hutchinson's approximation, but with our analysis one can obtain similar estimates for Scaled SARAH with Adam preconditioning. In particular, we can prove an analogue of Lemma 2.1 (see Appendix D) by additionally assuming boundedness of the stochastic gradient for all w: ∇f i (w) ≤ M (a similar assumption is made in Défossez et al. (2020) ). We present these results in Table 3 for comparison with the current best results for Adam. • In the deterministic case our results are significantly superior to those from Défossez et al. (2020) , in particular, in terms of the accuracy of the solution our estimates give O(ε -2 ) dependence, at the same time the guarantees from Défossez et al. (2020) are O(ε -4 ). Compared to OASIS in the deterministic case we have the same results in terms of ε, but our bounds are much better in terms of d, L, α. It is also an interesting detail that our estimates for Scaled SARAH with Adam preconditioner are independent of d and with Hutchinson's preconditioner depend on √ d, which is important for high-dimensional problems. • In the stochastic case, our convergence guarantees are also the best among other scaled methods primarily in terms of ε. This is mainly due to the fact that we use the stochastic finite sum setting typical for machine learning. • Unfortunately, our estimates are inferior to bounds of the unscaled methods: SARAH (the base method for our method) and SGD (the best known method for minimization problems). As one can see in Table 3 , all results for methods with preconditioning have the same problem. This is the level of theory development in this field at the moment. It seems that our results are able in some sense to reduce this gap between scaled and unscaled methods by decreasing the additional multiplier. Table 3 : Comparison of deterministic and stochastic methods for non-convex problems in the general case and under Polyak-Łojasiewicz condition. In the stochastic case, the table is divided into two parts: the bounded and finite-sum setups. Notation: σ 2 = variance of stochastic gradients, M =uniform bound of (stochastic) gradients, the rest of the notation is the same as the one introduced earlier in the paper.

Method and reference

Non-convex Polyak-Łojasiewicz Deterministic SGD (Secs. B.2 and C.2 from Li & Richtárik (2020)) O L∆ 0 ε 2 O L µ log 1 ε SARAH (Sec. 3.2 from Pham et al. (2020), Sec. 5 from Li et al. (2021a)) O L∆ 0 ε 2 O L µ log 1 ε Adagrad (Th. 1 from Défossez et al. ( )) Õ dM 2 ε 2 • L∆ 0 ε 2 -- Adam (Sec. 4.3 from Défossez et al. ( )) Õ dM 2 ε 2 • L∆ 0 ε 2 -- OASIS (Ths. 4.17 and 4.18 from Jahani et al. ( 2021b)) O dL 2 α 2 • L∆ 0 ε 2 O dL 3 α 2 µ • L µ log 1 ε (1) Scaled SARAH with Hutchinson's preconditioner (ours) O √ dL α • L∆ 0 ε 2 O √ dL α • L µ log 1 ε Scaled SARAH with Adam preconditioner (ours) O M α • L∆ 0 ε 2 O M α • L µ log 1 ε Method and reference Non-convex Polyak-Łojasiewicz Stochastic Bounded variance SGD (Secs. B.2 and C.2 from Li & Richtárik (2020)) O L∆ 0 ε 2 + σ 2 ε 2 • L∆ 0 ε 2 O L µ log 1 ε + Lσ 2 µ 2 ε Adagrad (Th. 1 from Défossez et al. ( )) Õ dM 2 ε 2 • L∆ 0 ε 2 + d • σ 2 ε 2 • L∆ 0 ε 2 -- Adam (Sec. 4.3 from Défossez et al. ( )) Õ dM 2 ε 2 • L∆ 0 ε 2 + d • σ 2 ε 2 • L∆ 0 ε 2 -- OASIS (Ths. 4.17 and 4.18 from Jahani et al. 1) for strongly convex problems. (2021b)) O dL 2 α 2 • L∆ 0 ε 2 + dL 2 α 2 • σ 2 ε 2 • L∆ 0 ε 2 O dL 3 α 2 µ • L µ log 1 ε + dL 2 α 2 • Lσ 2 µ 2 ε (1) )) O n + √ n • L∆ 0 ε 2 O n + √ n • L µ log 1 ε Scaled SARAH with Hutchinson's preconditioner (ours) O √ dL α • √ n • L∆ 0 ε 2 O n + √ dL α • √ n • L µ log 1 ε Scaled SARAH with Adam preconditioner (ours) O M α • √ n • L∆ 0 ε 2 O n + M α • √ n • L µ log 1 ε To sum up, our results exceed the estimates already given in the literature for scaled methods. If we take into account that algorithms with preconditioning are strongly attractive from the point of view of real-world learning problems, it turns out that we prove the best results for the practical class of methods at present. Meanwhile, our estimates are still worse than those for unscaled methods, in Appendix G.1 we give some reasoning why these estimates are unimproved. Then we present Section 5 with experiments in which it becomes clear that real problems are not necessarily "the worst", on the contrary, it is on practical problems that our method from Sections 2 and 3 shows itself most strongly.

5. NUMERICAL EXPERIMENTS

The purpose of these numerical experiments is to study the practical performance of our new Scaled SARAH and Scaled L-SVRG algorithms, and hence, to understand the advantages of using the proposed diagonal preconditioner on SARAH and L-SVRG. These results will also be compared with SGD, both with and without the preconditioner described in Section 2, as well as the state-of-the-art (first order) preconditioned optimizer Adam. We test these algorithms on problem (1) with two loss functions: (1) logistic regression loss function, which is convex, and (2) non-linear least squares loss function, which is nonconvex. The loss functions are described in details below. For further details and experimental results that support the findings of this section, please see Appendix B. Note that all the experiments were initialized at the point w 0 = 0, and each experiment was run for 10 different random seeds. 

5.1. LOSS FUNCTIONS

Let P (w) be the empirical risk on a dataset {(x i , y i )} n i=1 where x i ∈ R d and y i ∈ {-1, +1}. Then, the logistic regression loss is P logistic (w) = 1 n n i=1 log(1 + e -yix T i w ) ) whereas for y i ∈ {0, 1} the non-linear least squares loss (NLLSQ) is P nllsq (w) = 1 n n i=1 (y i -1/(1 + e -x T i w )) 2 (9) We consider two different loss functions to test our algorithms on both convex and nonconvex settings.

5.2. BINARY CLASSIFICATION ON LIBSVM DATASETS

We train the optimizers on three binary classification LibSVM datasetsfoot_2 , namely w8a, rcv1, and real-sim. We also consider feature-scaled versions of these datasets, where the scaling is done as follows: we choose a minimum exponent k min and a maximum exponent k max , and scale the features by values ranging from 10 kmin to 10 kmax in equal steps in the exponent according to the number of features and in random order. The setting (k min , k max ) = (0, 0) corresponds to the original, unscaled version of the datasets. We consider combinations of k min = 0, -3 and k max = 0, 3. This scaling is done to check the robustness and overall effectiveness of the diagonal preconditioner in comparison with Adam. Figure 1 shows the results of the first experiment, and presents three types of line plots for each of the datasets of interest where the loss function is the logistic regression loss (8). Figure 2 shows the same for the NLLSQ loss (9). The first row corresponds to the loss, the second is the squared norm of the gradient, and the third is the error. Tuning was performed in order to select the best hyperparameters (that minimize either the loss, gradient norm squared, or the error). The hyperparameter search grid is reported in Appendix, Table 4 , and a thorough discussion is below. We fixed the batch size to be 128, in order to narrow the fine-tuned variables down to η, β, and α. Figure 1 shows the performances when minimizing the error on the unscaled datasets, (k min , k max ) = (0, 0). Experiments on scaled datasets can be found in Appendix B.2. Consider the first column in Figure 1 , which uses the w8a dataset. While preconditioned SGD performs better than SGD, for SARAH and L-SVRG this is not the case. Notice that Adam performs well initially, but then further effective data passes lead to small improvement (for all three metrics). While SARAH and L-SVRG perform the best on this dataset, after approximately 35 passes for Scaled L-SVRG and 60 passes for Scaled SARAH, performance is better than Adam. For the remaining three datasets, preconditioning helps in all cases, with Scaled L-SVRG and Scaled SARAH, performing the best. In order to understand the main factors that affect the preconditioner, we ran comparative studies, including studying the parameters β, and α, as well as studying the initialization of preconditioner D 0 , including a warm-up period, and studying how the number of samples, z, impacts performance. First, recalling (6), there did not appear to be any significant improvement from averaging across more samples of z per minibatch, neither in the initialization nor in the update step. We also observed that initializing D 0 with a batch size of 100 was sufficiently good for non-sparse problems, and consistently resulted in a relative error of within 0.1 from the true diagonal. However, increasing the number of warm-up samples, proportionally to the number of features, led to observable improvements in convergence for sparse datasets. We also investigated the role that β (3) played in algorithm performance. We found that larger values lead to slightly slower but more stable convergence. The best β highly depends on the dataset, but the value 0.999 appeared to be a good starting point in general. To ensure a fair comparison, we also optimized Adam's momentum parameter β 2 over the same range. Aside from the batch size and learning rate, we found that, for ill-conditioned problems, the choice of α (recall (5) and Remark 2.1) played an important role in determining the quality of the solution, convergence speed, and stability (which is not obvious from Figure 1 ). For example, if the features were scaled with k min = -3 and k max = 0, the best α is often around 10 -7 (very small), whereas if we scaled with k min = 0 and k max = 3, the best α becomes 10 -1 (relatively large). Therefore, finding the best α might require some additional fine-tuning, depending on the choice of η and β. However, we noticed that once we had tuned the learning rate for one scaled version of the dataset, the same learning rate transferred well to all the other scaled versions. In general, the optimal learning rate in our Scaled algorithms is very robust to feature scaling, given that α is chosen well, whereas Adam's learning rate depends more heavily upon how ill-conditioned the problem is, so it requires fine-tuning across a potentially much wider range. In our case, tuning α and β is straightforward, so we obtained state-of-the-art performance with minimal parameter tuning. The choice β = 1 -1/(t + 1) was investigated in Appendices B.5 and G.3. Preliminary results suggest that this choice virtually removes the dependence on α, and is competitive with a fine-tuned β across a large number of values. This makes fine-tuning much easier, even for strong feature scaling. A Scaled L-SVRG SVRG (Johnson & Zhang, 2013; Xiao & Zhang, 2014 ) is a variance reduced stochastic gradient method that is very popular for finite sum optimization problems. However, the algorithm has a double loop structure, and careful tuning of hyper-parameters is required for good practical performance. Recently, Hofmann et al. (2015) proposed a Loopless SVRG (L-SVRG) variant, that has a simpler, single loop structure, which can be applied to problem (1) in the convex and smooth case. This was extended in Qian et al. (2021) to cover the composite case with an arbitrary sampling scheme. With its single loop structure, and consequently fewer hyperparameters to tune, coupled with the fact that, unlike for PAGE (recall Section 1), L-SVRG uses an unbiased estimate of the gradient, L-SVRG is a versatile and competitive algorithm for problems of the form (1). However, as for the other previously mentioned gradient based methods, L-SVRG can perform poorly when the problem is badly scaled and/or ill-conditioned. This provides the motivation for the Scaled L-SVRG method that we propose in this work. Our Scaled L-SVRG algorithm combines the positive features of L-SVRG, with a preconditioner, to give a method that is loopless, has few hyperparameters to tune, uses an unbiased estimate of the gradient, and adaptively scales the search direction depending upon the local curvature. The Scaled L-SVRG method is presented now as Algorithm 2. Algorithm 2 Scaled L-SVRG 1: Input: initial point w 0 , learning rate η, preconditioner D0 , probability p 2: z 0 = w 0 , v 0 = ∇P (w 0 ) 3: for t = 0, 1, 2, . . . do 4: w t+1 = w t -η D-1 t v t 5: z t+1 = z t , with probability p w t , with probability 1 -p 6: Generate independently batches i t+1 for v t+1 and J t for Dt+1 7: v t+1 = ∇f it+1 (w t+1 ) -∇f it+1 (z t+1 ) + ∇P (z t+1 ) 8: Update the preconditioner Dt+1 9: end for 10: Output: ŵT chosen uniformly from {w t } T t=0 Scaled L-SVRG can be described, in words, as follows. The algorithm is initialized with an initial point w 0 , a learning rate η, an initial preconditioner D0 , and probability p. At each iteration t ≥ 0 of Scaled L-SVRG (Algorithm 2) a search direction v t is generated. This is made up of the full gradient plus a small adjustment. Next, the new point w t+1 is taken to be a step from w t in the scaled direction D-1 t v t , of size η. The new point z t+1 is either the (unchanged) previous point z t with probability p, or the scaled approximate gradient step w t+1 with probability 1 -p. Finally, the preconditioner is updated and the next iterate begins. The output, denoted by ŵT is chosen uniformly from the points w t , for t = 0, . . . , T , generated by Scaled L-SVRG (Algorithm 2). Note that a key difference between L-SVRG (Hofmann et al., 2015; Qian et al., 2021) and our new Scaled L-SVRG is the inclusion of the preconditioner in Step 4; recall that a competitive preconditioner is described in Section 2. The following theorem presents a complexity bound on the number of iterations required by Scaled L-SVRG to obtain an ε-optimal solution for the non-convex problem (1). Theorem A.1. Suppose that Assumption 1.1 holds, let ε > 0, let p denote the probability and let the step-size satisfy η ≤ min α 4L , √ pα √ 24L , p 2/3 144 2/3 α L . Given an initial point w 0 ∈ R d , let ∆ 0 = P (w 0 ) -P * . Then the number of iterations performed by Scaled L-SVRG, starting from w 0 , required to obtain an ε-approximate solution of non-convex finite-sum problem (1) can be bounded by T = O Γ α L∆ 0 p 2 /3 ε 2 . While the previous theorem held under a smoothness assumption, here we prove a complexity result for Scaled SARAH under both smoothness and PL assumptions. Theorem A.2. Suppose that Assumptions 1.1 and 1.2 hold, let ε > 0, let p denote the probability and let the step-size satisfy η ≤ min pΓ 6µ , 1 4 α L , p 6 1/2 α L , p 6 2/3 α L . Then the number of iterations performed by Scaled L-SVRG sufficient for finding an ε-approximate solution of non-convex finite-sum problem (1) can be bounded by T = O max 1 p , Γ α L p 2 /3 µ log ∆ 0 ε . We give corollaries on the stochastic gradient of complexities. 

B ADDITIONAL NUMERICAL EXPERIMENTS

Here we provide additional details related to our experimental set-up, as well as presenting the results of additional numerical experiments. True, False (except for Adam) Dataset a9a, w8a, rcv1, real-sim (k min , k max ) (0, 0), (0, 3), (-3, 0), (-3, 3)

B.1 BEST PERFORMANCE GIVEN A FIXED PARAMETER

Loss function logistic, NLLSQ Random Seed 0, • • • , 9 Parameters Batch Size 128, 512 η 2 -20 , 2 -18 , • • • , 2 4 α 10 -1 , 10 -3 , 10 -7 β 0.95, 0.99, 0.995, 0.999, avg We ran extensive experiments in order to find the best performing set of parameters for each optimizer on each dataset. For our parameter search, we fixed one parameter (e.g., α = 10 -1 ), and then found the best choice for the remaining parameters, given that fixed value. This allowed us to understand how robust the algorithm's optimal performance is, with respect to each parameter. In other words, 'how does changing one parameter degrade the quality of the solution or the overall performance of the algorithm?'. We also report the best overall performances. Here, we consider fixing one of three parameters: η, β, and α, and then plot the trajectory of the gradient norm squared ∇P (w t ) 2 for the setting that minimizes the error with respect to the other parameters. The values 'α = none' and 'β = none' indicate non-preconditioned trajectories, and the value β = avg indicate the choice β t = 1 -1/(t + 1) (or more precisely, as described in Appendix B.5). See Figures 3, 4 , 5, and 6.

B.2 CORRUPTING SCALE OF FEATURES

We consider the settings where the features of the data are corrupted with a given logarithmic scale. We show the best overall performances on 4 scaled datasets with (k min , k max ) ∈ {(-3, 0), (0, 3), (-3, 3)}. This allowed us to understand the impact of no preconditioning versus preconditioning on poorly scaled problems. The results are shown in Figures 7, 8 , and 9.

B.3 CONVEX VS. NON-CONVEX LOSS

We test our algorithms on the non-linear least squares loss, which is a non-convex loss function. We show the best performance on the unscaled datasets, as well as datasets scaled with (k min , k max ) = (-3, 3). See Figures 10, and 11.

B.4 RELATIVE ERROR OF D 0

For the preconditioner described in Section 2, the initial preconditioner D 0 must be chosen appropriately. Note that, if diag(H 0 ) is the true Hessian diagonal for some initial point w 0 , then the relative error of the approximation D 0 to diag(H 0 ) is D 0 -diag(H 0 ) diag(H 0 ) . For our numerical experiments we used this relative error to determine whether the initial preconditioner D 0 was sufficiently accurate. In particular, we noticed that, if a minibatch of size 100 was used, then the resulting D 0 almost always reached a relative error of below 0.1, for dense datasets. For sparse datasets, more warm-up samples can improve convergence and quality of the solution, but in either case, a minibatch of size 100 was sufficient to establish convergence. Thus, we initialized D 0 with a size 100 minibatch in most of our experiments. See Figures 12, 13, and 14.

B.5 CHOOSING β t

We show plots where we finetune α and β on a wider range of parameters. Similarly to Section B.1, we optimize the other parameters by fixing the parameter of interest at the chosen value, and report the optimal performance metric. We only consider Scaled SARAH in this experiment. We show plots where we minimize either the error or the gradient norm under the logistic loss or NLLSQ loss. We also show the optimal performance when choosing β t = 1 -1 t+t0 , where t 0 is the number of warm-up samples. We run experiments for 10 random seeds and show the confidence intervals for each hyperparameter setting. We see that the suggested choice for β t is consistently competitive with the best β and is robust to feature scaling as well. See Figures 15, 16 , and 17.

B.6 DEEP NEURAL NETS ON MNIST

In order to demonstrate that our algorithms are indeed practical and competitive with the state-ofthe-art, we test them on deep neural nets. The setting of our experiment is widely accessible and well-known; train the LeNet-5 model on MNIST dataset with the cross entropy loss Lecun et al. (1998) . For this experiment, we believe that it suffices to test Scaled L-SVRG vs. Adam. For L-SVRG, we use p = 0.999. The programming framework we run this experiment on is PyTorch Paszke et al. (2019) . The hyperparameter search grid is slightly reduced for this experiment. To be specific, learning rates bigger than 2 -6 and lower than 2 -14 are omitted, and we only consider β in {avg, 0.99, 0.999}. The result is shown in Figure 18 and 19. We observe that the performance of Scaled L-SVRG is indeed competitive with Adam in this simple deep learning problem. We omit trajectories that diverge in this plot, which is why the value α = 10 -7 is not seen in Figure 19 . In the future, we plan on running more sophisticated deep learning experiments, and explore ways to adapt our algorithms to these settings in particular. C LEMMA 2.1 In this section we prove Lemma 2.1. Proof: To begin, note that by constructing matrices Dt according to (3) -( 6), Dt is a diagonal matrix with positive elements on the diagonal, where every elements is at least α. Hence, αI Dt . To prove that Dt ΓI, let us show that all elements of Dt do not exceed Γ. Based on (3), it suffices to show that diag z t ∇ 2 P Jt (w t )z t ∞ ≤ Γ for all t. diag z t ∇ 2 P Jt (w t )z t 2 ∞ ≤ z t ∇ 2 P Jt (w t )z t 2 ∞ ≤   max i   d j=1 |(∇ 2 P Jt (w t )) ij |     2 ≤ max i   d j=1 |(∇ 2 P Jt (w t )) ij |   2 ≤ max i   d d j=1 (∇ 2 P Jt (w t )) 2 ij   ≤ d d i=1 d j=1 (∇ 2 P Jt (w t )) 2 ij ≤ d ∇ 2 P Jt (w t ) 2 2 ≤ dL 2 . In the last we use L-smoothness of f i (Assumption 1.1). Finally, we have diag z t ∇ 2 P Jt (w t )z t ∞ ≤ √ dL = Γ.

D LEMMA 2.1 FOR AD A M

In this section we prove Lemma 2.1, but for Adam preconditioning. The rules for calculating the D-matrix for Adam method are as follows: D 2 t = β t D 2 t-1 + (1 -β t )H 2 Jt , D 2 0 = 1 |J 0 | j∈J0 diag(∇f j (w 0 ) ∇f j (w 0 )) where Proof: To begin, note that by constructing matrices Dt according to these rules, Dt is a diagonal matrix with positive elements on the diagonal, where every elements is at least α. Hence, αI Dt . β t = β -β t+1 1 -β t+1 with β ∈ (0; 1), H 2 Jt = 1 |J t | j∈Jt diag(∇f j (w t ) ∇f j (w t )), Moreover, using the diagonal structure of H 2 Jt , one can note that H 2 Jt ∞ = 1 |J t | j∈Jt diag(∇f j (w t ) ∇f j (w t )) ∞ ≤ 1 |J t | j∈Jt diag(∇f j (w t ) ∇f j (w t )) ∞ = 1 |J t | j∈Jt ∇f j (w t ) ∇f j (w t ) ∞ = 1 |J t | j∈Jt ∇f j (w t ) 2 ∞ ≤ 1 |J t | j∈Jt ∇f j (w t ) 2 2 ≤ M 2 . Finally, we have H 2 Jt ∞ ≤ M 2 . And then, αI Dt M I = ΓI.

E Scaled SARAH

In this section we present the proofs for the main theoretical complexity results for the Scaled SARAH from Section 3. Lemma E.1 (Descent Lemma). Suppose that function P satisfies Assumption 1.1 and Algorithm 1 generate a sequence {w t } t≥0 . Then we have for any t ≥ 0 and η P (w t+1 ) ≤ P (w t ) + η 2α ∇P (w t ) -v t 2 - 1 2η - L 2α w t+1 -w t 2 Dt - η 2 ∇P (w t ) 2 D-1 t . Proof: Using L-smoothness of the function P and I 1 α Dt P (w t+1 ) ≤P (w t ) + ∇P (w t ), w t+1 -w t + L 2 w t+1 -w t 2 ≤P (w t ) + ∇P (w t ), w t+1 -w t + L 2α w t+1 -w t 2 Dt . With an update of Scaled SARAH: w t+1 = w t -η D-1 t v t , we get P (w t+1 ) ≤P (w t ) + ∇P (w t ) -v t , -η D-1 t v t + 1 η Dt (w t -w t+1 ), w t+1 -w t + L 2α w t+1 -w t 2 Dt =P (w t ) + η ∇P (w t ) -v t , ∇P (w t ) -v t D-1 t -η ∇P (w t ) -v t , ∇P (w t ) D-1 t - 1 η - L 2α w t+1 -w t 2 Dt =P (w t ) + η ∇P (w t ) -v t 2 D-1 t -η ∇P (w t ) -v t , ∇P (w t ) D-1 t - 1 η - L 2α w t+1 -w t 2 Dt . Let us define wt+1 = w t -η D-1 t ∇P (w t ). Using this new notation and an update w t+1 = w tη D-1 t v t , then we have P (w t+1 ) ≤P (w t ) + η ∇P (w t ) -v t 2 D-1 t - 1 η w t+1 -wt+1 , w t -wt+1 Dt - 1 η - L 2α w t+1 -w t 2 Dt =P (w t ) + η ∇P (w t ) -v t 2 D-1 t - 1 η - L 2α w t+1 -w t 2 Dt - 1 2η w t+1 -wt+1 2 Dt + w t -wt+1 2 Dt -w t+1 -w t 2 Dt =P (w t ) + η ∇P (w t ) -v t 2 D-1 t - 1 η - L 2α w t+1 -w t 2 Dt - 1 2η η 2 ∇P (w t ) -v t 2 D-1 t + η 2 ∇P (w t ) 2 D-1 t -w t+1 -w t 2 Dt . Lemma E.2. Suppose that the function P satisfies satisfies Assumptions 1.1, 1.2 and Algorithm 1 generate a sequence {w t } t≥0 . Then we have for any t ≥ 0 and η P (w t+1 ) -P * ≤ 1 - ηµ Γ (P (w t ) -P * ) + η 2α ∇P (w t ) -v t 2 - 1 2η - L 2α w t+1 -w t 2 Dt . Proof: According Lemma E.1, we have P (w t+1 ) ≤ P (w t ) + η 2α ∇P (w t ) -v t 2 - 1 2η - L 2α w t+1 -w t 2 Dt - η 2 ∇P (w t ) 2 D-1 t . Then with D-1 t 1 Γ I and PL-condition, we have P (w t+1 ) -P * ≤P (w t ) -P * + η 2α ∇P (w t ) -v t 2 - 1 2η - L 2α w t+1 -w t 2 Dt - η 2Γ ∇P (w t ) 2 ≤P (w t ) -P * + η 2α ∇P (w t ) -v t 2 - 1 2η - L 2α w t+1 -w t 2 Dt - ηµ Γ (P (w t ) -P * ) . Lemma E.3. Suppose that Assumption 1.1 holds. Then we have E v t+1 -∇P (w t+1 ) 2 ≤ (1 -p)E v t -∇P (w t ) 2 + (1 -p)L 2 α E w t+1 -w t 2 Dt . Proof: Lemma 3 from (Li et al., 2021a) gives E v t+1 -∇P (w t+1 ) 2 ≤ (1 -p)E v t -∇P (w t ) 2 + (1 -p)L 2 E w t+1 -w t 2 . Using I 1 α Dt , we get E v t+1 -∇P (w t+1 ) 2 ≤ (1 -p)E v t -∇P (w t ) 2 + (1 -p)L 2 α E w t+1 -w t 2 Dt . Theorem E.4 (Theorem 3.1). Suppose that Assumption 1.1 holds, let > 0, let p denote the probability, and let the step-size satisfy η ≤ α L 1 + 1-p p . Then, the number of iterations performed by Scaled SARAH, starting from an initial point w 0 ∈ R d with ∆ 0 = P (w 0 ) -P * , required to obtain an ε-approximate solution of the non-convex finite-sum problem (1) can be bounded by T = O Γ α ∆ 0 L ε 2 1 + 1 -p p . Proof: Using Lemmas E.1, E.3, we have E P (w t+1 ) -P * + η 2αp ∇P (w t+1 ) -v t+1 2 ≤E P (w t ) -P * + η 2α ∇P (w t ) -v t 2 - 1 2η - L 2α E w t+1 -w t 2 Dt - η 2 E ∇P (w t ) 2 D-1 t + η 2αp (1 -p)E v t -∇P (w t ) 2 + (1 -p)L 2 α E w t+1 -w t 2 =E P (w t ) -P * + η 2αp ∇P (w t ) -v t 2 - η 2 E ∇P (w t ) 2 D-1 t - 1 2η - L 2α - (1 -p)L 2 2α 2 p η E w t+1 -w t 2 . Choosing η ≤ α L 1+ 1-p p and defining Ψ t+1 = P (w t+1 ) -P * + η 2αp ∇P (w t+1 ) -v t+1 2 , we have E [Ψ t+1 ] ≤ E [Ψ t ] - η 2 E ∇P (w t ) 2 D-1 t . Summing up, we obtain T -1 t=0 η 2 E ∇P (w t ) 2 D-1 t ≤ E [Ψ 0 ] -E [Ψ T ] . Using that ŵT is chosen uniformly from all w t from 0 to T -1, we have η 2 E ∇P ( ŵT ) 2 D-1 t ≤ E [Ψ 0 ] -E [Ψ T ] T . With ∆ 0 = Ψ 0 = P (w 0 ) -P * , we get E ∇P ( ŵT ) 2 ≤ ΓE ∇P ( ŵT ) 2 D-1 t ≤ 2∆ 0 Γ ηT . Then T = O ∆ 0 Γ ηε 2 = O L∆ 0 ε 2 Γ α 1 + 1 -p p . Theorem E.5 (Theorem 3.2). Suppose that Assumptions 1.1 and 1.2 hold, let > 0, and let the step-size satisfy η ≤ α L 1 + 1-p p . Then the number of iterations performed by Scaled SARAH sufficient for finding an ε-approximate solution of non-convex finite-sum problem (1) can be bounded by T = O max 1 p , L µ Γ α 1 + 1 -p p log ∆ 0 ε . Proof: Using Lemmas E.2, E.3, we have for some B > 0 E P (w t+1 ) -P * + B ∇P (w t+1 ) -v t+1 2 ≤ 1 - ηµ Γ E [P (w t ) -P * ] + η 2α E ∇P (w t ) -v t 2 - 1 2η - L 2α E w t+1 -w t 2 Dt + B (1 -p)E v t -∇P (w t ) 2 + (1 -p)L 2 α E w t+1 -w t 2 Dt ≤ 1 - ηµ Γ E [P (w t ) -P * ] + η 2α + B(1 -p) E ∇P (w t ) -v t 2 - 1 2η - L 2α - (1 -p)L 2 α B E w t+1 -w t 2 Dt . We need to choose η, B such that η 2α + B(1 -p) ≤ 1 - ηµ Γ B; 1 2η - L 2α - (1 -p)L 2 α B ≥ 0. If we take B = η pα and η ≤ min{ pΓ 2µ , α 2L 1+ 1-p p }, we get E P (w t+1 ) -P * + B ∇P (w t+1 ) -v t+1 2 ≤ 1 - ηµ Γ E P (w t ) -P * + B ∇P (w t ) -v t 2 . and then obtain E P (w T ) -P * + B ∇P (w T ) -v T 2 ≤ 1 - ηµ Γ T ∆ 0 . Finally, T = O Γ ηµ log ∆ 0 ε = O max 1 p , L µ Γ α 1 + 1 -p p log ∆ 0 ε .

F Scaled L-SVRG

Here we provide proofs for our theoretical results for Scaled L-SVRG. Lemma F.1. Suppose that Algorithm 2 generate sequences {w t } t≥0 and {z t } t≥0 . Then we have for any t ≥ 0, any η and B > 0 E t w t+1 -z t+1 2 ≤ η 2 α E t v t 2 D-1 t +(1-p)(1+ηB) w t -z t 2 +(1-p) η αB ∇P (w t ) 2 D-1 t . Proof: Using definition z t+1 and an update of Scaled L-SVRG: w t+1 = w t -η D-1 t v t , we have E E zt+1 w t+1 -z t+1 2 =pE w t+1 -w t 2 + (1 -p)E w t+1 -z t 2 = pη 2 α E v t 2 D-1 t + (1 -p)E w t+1 -z t 2 . Here we additionally use D-1 t 1 α I. Next, we estimate E w t+1 -z t 2 : E w t+1 -z t 2 =E w t -η D-1 t v t -z t 2 =E w t -z t 2 + η 2 α E v t 2 D-1 t -2ηE D-1 t v t , w t -z t . Using unbiasedness of v t , we get for some B > 0 E w t+1 -z t 2 =E w t -z t 2 + η 2 α E v t 2 D-1 t -2ηE D-1 t ∇P (w t ), w t -z t ≤E w t -z t 2 + η 2 α E v t 2 D-1 t + η 1 αB E ∇P (w t ) 2 D-1 t + BE w t -z t 2 =(1 + ηB)E w t -z t 2 + η αB E ∇P (w t ) 2 D-1 t + η 2 α E v t 2 D-1 t . Combining inequality and equation, we finish proof. Lemma F.2. Suppose that function P satisfies Assumption 1.1 and Algorithm 2 generate a sequence {w t } t≥0 . Then we have for any t ≥ 0 and η E v t 2 D-1 t ≤ 3E ∇P (w t ) 2 D-1 t + 6L 2 α E w t -z t 2 . ( ) Proof: Using definition of v t , we have E v t 2 D-1 t =E ∇f it (w t ) -∇f it (z t ) + ∇P (z t ) 2 D-1 t ≤3E ∇P (w t ) 2 D-1 t + 3E ∇P (w t ) -∇P (z t ) 2 D-1 t + 3E ∇f it (w t ) -∇f it (z t ) 2 D-1 t ≤3E ∇P (w t ) 2 Dt + 6L 2 α E w t -z t 2 . Here we use Assumption 1.1 and D-1 t 1 α I. Theorem F.3 (Theorem 3.1). Suppose that Assumption 1.1 holds, let > 0, let p denote the probability and let the step-size satisfy η ≤ min α 4L , √ pα √ 24L , p 2/3 144 2/3 α L . Given an initial point w 0 ∈ R d , let ∆ 0 = P (w 0 ) -P * . Then the number of iterations performed by Scaled L-SVRG, starting from w 0 , required to obtain an ε-approximate solution of non-convex finite-sum problem (1) can be bounded by T = O Γ α L∆ 0 p 2 /3 ε 2 . Proof: Using L-smoothness of the function P and I 1 α Dt , we have E [P (w t+1 )] ≤E [P (w t )] + E [ ∇P (w t ), w t+1 -w t ] + L 2 E w t+1 -w t 2 ≤E [P (w t )] + E [ ∇P (w t ), w t+1 -w t ] + L 2α E w t+1 -w t 2 Dt . Taking into account an update of Scaled L-SVRG, we obtain E [P (w t+1 ) -P * ] ≤E [P (w t ) -P * ] + E ∇P (w t ), -η D-1 t v t + Lη 2 2α E v t 2 D-1 t =E [P (w t ) -P * ] -ηE ∇P (w t ), D-1 t ∇P (w t ) + Lη 2 2α E v t 2 D-1 t . Let us define Φ t+1 = P (w t+1 ) -P * + A w t+1 -z t+1 2 for some A > 0. Using Lemmas F.1, F.2, we have E [Φ t+1 ] ≤E [P (w t ) -P * ] -ηE ∇P (w t ) 2 D-1 t + Lη 2 2α E v t 2 D-1 t + AE w t+1 -z t+1 2 ≤E [P (w t ) -P * ] -ηE ∇P (w t ) 2 D-1 t + η 2 L 2α + A α E v t 2 D-1 t + A(1 -p)(1 + ηB)E w t -z t 2 + A(1 -p) η αB E ∇P (w t ) 2 D-1 t ≤E [P (w t ) -P * ] -η 1 - A(1 -p) αB E ∇P (w t ) 2 D-1 t + A(1 -p)(1 + ηB)E w t -z t 2 + η 2 L 2α + A α 2E ∇P (w t ) 2 D-1 t + 2L 2 α E w t -z t 2 ≤E [P (w t ) -P * ] -η 1 - A(1 -p) αB - 2A α η - L α η E ∇P (w t ) 2 D-1 t + A (1 -p)(1 + ηB) + η 2 L A + 2 L 2 α 2 E w t -z t 2 . ( ) We need to choose A, η, B in such way: We need to choose A, η, B in such way: 1 - A(1 -p) αB - 2A α η - L α η ≥ 1 4 ; (1 -p)(1 + ηB) + η 2 L A + 2 L 2 α 2 ≤ 1. 1 - A(1 -p) αβ - 2A α η - L α η ≥ 1 4 ; (1 -p)(1 + ηβ) + η 2 L A + 2 L 2 α 2 ≤ 1 - µη 4Γ . Thus, taking  A = 3η 2 L

G THE ROLE OF β IN THE PRECONDITIONER

The purpose of this section is to better understand the role that β plays with regards to the convergence theory and practical performance of our algorithms. Recall that β is the momentum parameter for the preconditioner (3), and it controls the weighting/trade-off between the past curvature history and the current minibatch Hessian. Note that the analysis presented in Appendices E and F does not impose any additional assumptions on β, which demonstrates a kind of universality of our method and convergence theory. Note that, Adam (Défossez et al., 2020) and OASIS do not have such universality. In particular, for Adam, the parameter β 2 (corresponding to β in this work) is critical: for Adam to converge at the rate 1/ √ T in the non-convex setting, one must choose β 2 = 1 -1/ √ T , while for small β 2 the method diverges (Reddi et al., 2019) . The dependence of OASIS on β is only presented in the adaptive case, but this dependence can negatively impact, or destroy, convergence. This section aims to better understand the choice β. Consider the following quadratic function: f (x, y) = L 2 x + y 2 2 + µ 2 x -y 2 2 . The corresponding Hessian for this function is ∇ 2 f (x, y) =    L+µ 2 L-µ 2 L-µ 2 L+µ 2    , and diag z t ∇ 2 f z t =    L+µ 2 + L-µ 2 z 1 t z 2 t 0 0 L+µ 2 + L-µ 2 z 1 t z 2 t    , where z t is from a Radermacher distribution. One can note that with probability 1 2 our approximation diag z t ∇ 2 f z t is    L 0 0 L    or    µ 0 0 µ    . G.1 THE WORST CASE Intuitively, with a good understanding of β, it should be possible to get better practical performance of the method. For example, our convergence analysis depends upon the factor Γ/α (recall Remark 2.1), and a good choice of β can shrink this factor. However, here we present the worst case, and show that for the function described above, unfortunately, improvement is not possible. Choose α > µ, so that for Dt we have two possible matrices with probability 1 2 :    L 0 0 L    or    α 0 0 α    . The matrices above are scaled identity matrices, so that applying the preconditioner is equivalent to simply dividing the gradient by a constant. This scaling can dramatically change size of gradient especially in the early iterations, or when β = 0. In particular, when we work with the second matrix we change gradient by 1/α. Then it is really worth taking a step η ∼ α L : α -impact of scaling, Loriginal stepsize for GD type methods. Additionally, if we randomly work with the first matrix with L, this additional L goes to the convergence of the method as Γ. G.2 HOW TO CHOOSE THE CONSTANT β D t is a linear combination of identically distributed independent matrices (13). In such a situation, it is natural to consider reducing the variance of D t . Note that D t = β t diag z 0 ∇ 2 f z 0 + t τ =1 β t-τ (1 -β)diag z τ ∇ 2 f z τ . and notice that, by (13), every diag z τ ∇ 2 f z τ has the same distribution. Then the variance of D t can be computed as Var[D t ] = β 2t + t τ =1 β 2t-2τ (1 -β) 2 C, where C is some constant (that does not depend on β). Now, minimizing this expression w.r.t.  β 2t + t τ =1 β 2t-2τ (1 -β) 2 . Using the optimality conditions one has: β 2t-1 (2t -1 + 2tβ + 2) = 1. If we have a limit of iterations T and if we want to minimize the variance of the final preconditioner, we can take β = 1 2T -1 √ 2T -1 + 2T β + 2 ∼ 1 2T √ 2T ----→ T →∞ 1.

G.3 IT IS BETTER TO CONSIDER VARYING β t

As in the previous section, here we also want to reduce the variance of D t . But now, let us consider the case when β t is allowed to vary: D t = β t D t-1 + (1 -β t )diag z t ∇ 2 f z t . To understand how to choose β t , note that at each iteration, because diag z t ∇ 2 f z t we get one of the two matrices given in (13). To reduce the variance of the matrix D t one can choose D t as D t = 1 t + 1 t τ =0 diag z τ ∇ 2 f z τ = t t + 1 D t-1 + 1 t + 1 diag z t ∇ 2 f z t . Thus, β t = 1 - 1 t + 1 . Using the identity and independence of the diag z τ ∇ 2 f z τ distributions, one can note that such a β t gives better variance then the constant β from the previous subsection. 



i.e., the components of the zt are ±1 with equal probability. Note that, while PAGE allows minibatches for either option in the update Step 6, most of the theoretical results presented inLi et al. (2021a) require the full gradient to be computed as the first option in Step 6. https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/



Finite-sum SARAH (Sec. 3.2 from Pham et al. (2020), Sec. 5 from Li et al. (

Figure 1: Best performances of the optimizers, including Adam, on the (unscaled) LibSVM datasets using the logistic loss. The Scaled variants are shown as dashed lines sharing the same color.

Figure 2: Best performances on the unscaled LibSVM datasets using the NLLSQ loss.

Suppose that Assumption 1.1 holds, let ε > 0, let p = 1 n+1 and let the stepsize satisfy η ≤ min α 4L , Given an initial point w 0 ∈ R d , let ∆ 0 = P (w 0 ) -P * . Then the stochastic gradient complexity performed by Scaled L-SVRG, starting from w 0 , required to obtain an ε-approximate solution of non-convex finite-sum problem (1) can be bounded by O n + Γ α L∆0 ε 2 n 2 /3 . Corollary A.4. Suppose that Assumptions 1.1 and 1.2 hold, let ε > 0, let p = 1 n+1 and let the step-size satisfy η ≤ min pΓ 6µ , Then the stochastic gradient complexity performed by Scaled L-SVRG sufficient for finding an ε-approximate solution of non-convex finite-sum problem (1) can be bounded by O n + Γ α L µ n 2 /3 log ∆0 ε .

2 10 , = 1e 07, = avg) SARAH( = 2 10 , = 1e 07, = 0.995) L-SVRG( = 2 8 , = 1e 07, = 0.999) Adam( = 2 4 , 1 = 0.9, 2 = 0.95) 2 2 , = 0.001, = 0.99) SARAH( = 2 12 , = 1e 07, = 0.99) L-SVRG( = 2 12 , = 1e 07, = 0.95) Adam( = 2 6 , 1 = 0.9, 2 = 0.995) 2 12 , = 1e 07, = 0.999) SARAH( = 2 12 , = 1e 07, = 0.995) L-SVRG( = 2 10 , = 1e 07, = 0.99) Adam( = 2 6 , 1 = 0.9, 2 = 0.999) , = 1e 07, = 0.995) L-SVRG( = 2 10 , = 1e 07, = 0.99) Adam( = 2 6 , 1 = 0.9, 2 = 0.999)Best performances on logistic loss

Figure 3: Performance of the best parameters minimizing the error using the logistic loss.

Figure 6: Trajectory of ∇P (w t ) 2 of the best parameters minimizing the error given α and logistic loss.

Figure12: The diagonal values of the true hessian H 0 vs. the estimate D 0 . The tree α levels in our hyperparameter search grid are shown in a dashed pink horizonatal lines. Increasing the number of warmup samples is beneficial, with slightly diminishing benefits as the batch size decreases. This can be seen from the larger number of light pink points around the diagonal for batch size 1024 on real-sim.

Figure 13: The diagonal values of the true hessian H 0 vs. the estimate D 0 given (k min , k max ) = (-3, 3).

Figure14: The relative error of the diagonal with respect to the diagonal of the true Hessian plotted against the number of warmup samples. The value corrupt indicates the feature scaling. For sparse datasets, it is more difficult to decrease the relative error below 0.1.

Figure 18: Best performance of Scaled L-SVRG and Adam on MNIST.

Figure 19: Best error given η on MNIST.

|D t | i,i }.Lemma D.1. For any t ≥ 1, we have αI Dt ΓI, where 0 < α ≤ Γ = M .

states the hyperparameters that were used in our numerical experiments.

Hyperparameter search grid.

Trajectory of ∇P (w t ) 2 of the best parameters minimizing the errors given β and logistic loss.

Theorem F.4 (Theorem A.2). Suppose that Assumptions 1.1 and 1.2 hold, let > 0, let p denote the probability and let the step-size satisfy Then the number of iterations performed by Scaled L-SVRG sufficient for finding an εapproximate solution of non-convex finite-sum problem (1) can be bounded by I and Assumption 1.2 and then get E [Φ t+1 ] ≤E [P (w t ) -P * ] -

