SP2: A SECOND ORDER STOCHASTIC POLYAK METHOD

Abstract

Recently the SP (Stochastic Polyak step size) method has emerged as a competitive adaptive method for setting the step sizes of SGD. SP can be interpreted as a method specialized to interpolated models, since it solves the interpolation equations. SP solves these equation by using local linearizations of the model. We take a step further and develop a method for solving the interpolation equations that uses the local second-order approximation of the model. Our resulting method SP2 uses Hessian-vector products to speed-up the convergence of SP. Furthermore, and rather uniquely among second-order methods, the design of SP2 in no way relies on positive definite Hessian matrices or convexity of the objective function. We show SP2 is very competitive on matrix completion, non-convex test problems and logistic regression. We also provide a convergence theory on sums-of-quadratics.

1. INTRODUCTION

Consider the problem w * ∈ arg min w∈R d f (w) := 1 n n i=1 f i (w) , where f is twice continuously differentiable, and the set of minimizers is nonempty. Let the optimal value of equation 1 be f * ∈ R, and w 0 be a given initial point. Here each f i (w) is the loss of a model parametrized in w ∈ R d over an i-th data point. Our discussion, and forth coming results, also hold for a loss given as an expectation f (w) = E ξ∼D [f ξ (w)], where ξ ∼ D is the data generating process and f ξ (w) the loss over this sampled data point. But for simplicity we use the f i (w) notation. Contrary to classic statistical modeling, there is now a growing trend of using overparametrized models that are able to interpolate the data Ma et al. ( 2018); that is, models for which the loss is minimized over every data point as described in the following assumption. Assumption 1. We say that the interpolation condition holds when the loss is nonnegative, f i (w) ≥ 0, and there exists w * ∈ R d such that f (w * ) = 0. (2) Consequently, f i (w * ) = 0 for i = 1, . . . , n. Overparameterized deep neural networks are the most notorious example of models that satisfy Assumption 1. Indeed, with sufficiently more parameters than data points, we are able to simultaneously minimize the loss over all data points. If we admit that our model can interpolate the data, then we have that our optimization problem equation 1 is equivalent to solving the system of nonlinear equations f i (w) = 0, for i = 1, . . . , n. (3) Since we assume f i (w) ≥ 0 any solution to the above is a solution to our original problem.  w t+1 = w t -fi(w t ) ∇fi(w t ) 2 ∇f i (w t ) (4)



Recently, it was shown in Berrada et al. (2020); Gower et al. (2021b) that the Stochastic Polyak step size (SP) method Loizou et al. (2020); Polyak(1987)

