SP2: A SECOND ORDER STOCHASTIC POLYAK METHOD

Abstract

Recently the SP (Stochastic Polyak step size) method has emerged as a competitive adaptive method for setting the step sizes of SGD. SP can be interpreted as a method specialized to interpolated models, since it solves the interpolation equations. SP solves these equation by using local linearizations of the model. We take a step further and develop a method for solving the interpolation equations that uses the local second-order approximation of the model. Our resulting method SP2 uses Hessian-vector products to speed-up the convergence of SP. Furthermore, and rather uniquely among second-order methods, the design of SP2 in no way relies on positive definite Hessian matrices or convexity of the objective function. We show SP2 is very competitive on matrix completion, non-convex test problems and logistic regression. We also provide a convergence theory on sums-of-quadratics.

1. INTRODUCTION

Consider the problem w * ∈ arg min w∈R d f (w) := 1 n n i=1 f i (w) , where f is twice continuously differentiable, and the set of minimizers is nonempty. Let the optimal value of equation 1 be f * ∈ R, and w 0 be a given initial point. Here each f i (w) is the loss of a model parametrized in w ∈ R d over an i-th data point. Our discussion, and forth coming results, also hold for a loss given as an expectation f (w) = E ξ∼D [f ξ (w)], where ξ ∼ D is the data generating process and f ξ (w) the loss over this sampled data point. But for simplicity we use the f i (w) notation. Contrary to classic statistical modeling, there is now a growing trend of using overparametrized models that are able to interpolate the data Ma et al. (2018) ; that is, models for which the loss is minimized over every data point as described in the following assumption. Assumption 1. We say that the interpolation condition holds when the loss is nonnegative, f i (w) ≥ 0, and there exists w * ∈ R d such that f (w * ) = 0. (2) Consequently, f i (w * ) = 0 for i = 1, . . . , n. Overparameterized deep neural networks are the most notorious example of models that satisfy Assumption 1. Indeed, with sufficiently more parameters than data points, we are able to simultaneously minimize the loss over all data points. If we admit that our model can interpolate the data, then we have that our optimization problem equation 1 is equivalent to solving the system of nonlinear equations f i (w) = 0, for i = 1, . . . , n. (3) Since we assume f i (w) ≥ 0 any solution to the above is a solution to our original problem. Recently, it was shown in Berrada et al. ( 2020  w t+1 = w t -fi(w t ) ∇fi(w t ) 2 ∇f i (w t ) directly solves the interpolation equations. Indeed, at each iteration SP samples a single i-th equation from equation 3, then projects the current iterate w t onto the linearization of this constraint, that is w t+1 = arg min w∈R d w -w t 2 s.t. f i (w t ) + ∇f i (w t ), w -w t = 0. (5) Here we take one step further, and instead of projecting onto the linearization of f i (w) we use the local quadratic expansion. That is, as a proxy of setting f i (w) = 0 we set the quadratic expansion of f i (w) around w t to zero f i (w t ) + ∇f i (w t ), w -w t + 1 2 ∇ 2 f i (w t )(w -w t ), w -w t = 0. The above quadratic constraint could have infinite solutions, a unique solution or no solution at allfoot_0 . Indeed, for example if ∇foot_1 f i (w t ) is positive definite, there may exist no solution, which occurs when f i is convex, and is the most studied setting for second order methods. But if the loss is positive f i and the Hessian has at least one negative eigenvalue, then equation 6 always has a solution. If equation 6 has solutions, then analogously to the SP method, we can choose one using a projection step 2 w t+1 ∈ arg min w∈R d 1 2 w -w t 2 s.t. f i (w t ) + ∇f i (w t ), w -w t + 1 2 ∇ 2 f i (w t )(w -w t ), w -w t = 0. We refer to equation 7 as the SP2 method. Using a quadratic expansion has several advantages. First, quadratic expansions are more accurate than linearizations, which will allow us to take larger steps. Furthermore, using the quadratic expansion will lead to convergence rates which are independent on how well conditioned the Hessian matrices are, as we show later in Proposition 1. Our SP2 method occupies a unique position in the literature of stochastic second order method since it is incremental and in no way relies on convexity or positive semi-definite Hessian matrices. Indeed, as we will show in our non-convex experiments in 6.1 and matrix completition B, the SP2 excels at minimizing non-convex problems that satisfy interpolation. In contrast, Newton based methods often converge to stationary points other than the global minima. We also relax the interpolation assumption, and develop analogous quadratic methods for finding w and the smallest possible s ∈ R such that f i (w) ≤ s, for i = 1, . . . , n. We refer to this as the slack interpolation equations, which were introduced in Crammer et al. ( 2006) for linear models. If the interpolation assumption holds then s = 0 and the above is equivalent to solving equation 3. When interpolation does not hold, then equation 8 is still a upper approximation of equation 1, as detailed in Gower et al. (2022) . The rest of this paper is organized as follows. We introduce some related work in Section 2. We present the proposed SP2 methods in Section 3 and corresponding convergence analysis in Section 4. In Section 5, we relax the interpolation condition and develop a variety of quadratic methods to solve the slack version of this problem. We test the proposed methods with a series of experiments in Section 6. Finally, we conclude our work and discuss future directions in Section 7.

2. RELATED WORK

Since it became clear that Stochastic Gradient Descent (SGD), with appropriate step size tuning, was an efficient method for solving the training problem equation 1, there has been a search for an efficient second order counter part. The hope being, and our objective here, is to find a second order stochastic method that is incremental; that is, it can work with mini-batches, requires little to no tuning since it would depend less on how well scaled or conditioned the data is, and finally, would also apply to non-convex problems. To date there is a vast literature on stochastic second order methods, yet none that achieve all of the above. The subsampled Newton methods such as (Roosta-Khorasani & Mahoney, 2019; Bollapragada et al., 2018; Liu & Roosta, 2021; Erdogdu & Montanari, 2015; Kohler & Lucchi, 2017; Jahani et al., 2017) 



Or even two solutions in the 1d case. Note that there could be more than one solution to this projection and we can choose one either with least norm or arbitrarily.



);Gower et al. (2021b)  that the Stochastic Polyak step size (SP) methodLoizou et al. (2020); Polyak (1987)

