SP2: A SECOND ORDER STOCHASTIC POLYAK METHOD

Abstract

Recently the SP (Stochastic Polyak step size) method has emerged as a competitive adaptive method for setting the step sizes of SGD. SP can be interpreted as a method specialized to interpolated models, since it solves the interpolation equations. SP solves these equation by using local linearizations of the model. We take a step further and develop a method for solving the interpolation equations that uses the local second-order approximation of the model. Our resulting method SP2 uses Hessian-vector products to speed-up the convergence of SP. Furthermore, and rather uniquely among second-order methods, the design of SP2 in no way relies on positive definite Hessian matrices or convexity of the objective function. We show SP2 is very competitive on matrix completion, non-convex test problems and logistic regression. We also provide a convergence theory on sums-of-quadratics.

1. INTRODUCTION

Consider the problem w * ∈ arg min w∈R d f (w) := 1 n n i=1 f i (w) , where f is twice continuously differentiable, and the set of minimizers is nonempty. Let the optimal value of equation 1 be f * ∈ R, and w 0 be a given initial point. Here each f i (w) is the loss of a model parametrized in w ∈ R d over an i-th data point. Our discussion, and forth coming results, also hold for a loss given as an expectation f (w) = E ξ∼D [f ξ (w)], where ξ ∼ D is the data generating process and f ξ (w) the loss over this sampled data point. But for simplicity we use the f i (w) notation. Contrary to classic statistical modeling, there is now a growing trend of using overparametrized models that are able to interpolate the data Ma et al. (2018) ; that is, models for which the loss is minimized over every data point as described in the following assumption. Assumption 1. We say that the interpolation condition holds when the loss is nonnegative, f i (w) ≥ 0, and there exists w * ∈ R d such that f (w * ) = 0. (2) Consequently, f i (w * ) = 0 for i = 1, . . . , n. Overparameterized deep neural networks are the most notorious example of models that satisfy Assumption 1. Indeed, with sufficiently more parameters than data points, we are able to simultaneously minimize the loss over all data points. If we admit that our model can interpolate the data, then we have that our optimization problem equation 1 is equivalent to solving the system of nonlinear equations f i (w) = 0, for i = 1, . . . , n. (3) Since we assume f i (w) ≥ 0 any solution to the above is a solution to our original problem. Recently, it was shown in Berrada et al. (2020) ; Gower et al. (2021b) that the Stochastic Polyak step size (SP) method Loizou et al. (2020) ; Polyak (1987) w t+1 = w tfi(w t ) ∇fi(w t )foot_1 ∇f i (w t ) (4) directly solves the interpolation equations. Indeed, at each iteration SP samples a single i-th equation from equation 3, then projects the current iterate w t onto the linearization of this constraint, that is w t+1 = arg min w∈R d ww t 2 s.t. f i (w t ) + ∇f i (w t ), ww t = 0. (5) Here we take one step further, and instead of projecting onto the linearization of f i (w) we use the local quadratic expansion. That is, as a proxy of setting f i (w) = 0 we set the quadratic expansion of f i (w) around w t to zero f i (w t ) + ∇f i (w t ), ww t +foot_0 2 ∇ 2 f i (w t )(ww t ), ww t = 0. The above quadratic constraint could have infinite solutions, a unique solution or no solution at all 1 . Indeed, for example if ∇ 2 f i (w t ) is positive definite, there may exist no solution, which occurs when f i is convex, and is the most studied setting for second order methods. But if the loss is positive f i and the Hessian has at least one negative eigenvalue, then equation 6 always has a solution. If equation 6 has solutions, then analogously to the SP method, we can choose one using a projection step 2 w t+1 ∈ arg min w∈R d 1 2 w -w t 2 s.t. f i (w t ) + ∇f i (w t ), w -w t + 1 2 ∇ 2 f i (w t )(w -w t ), w -w t = 0. We refer to equation 7 as the SP2 method. Using a quadratic expansion has several advantages. First, quadratic expansions are more accurate than linearizations, which will allow us to take larger steps. Furthermore, using the quadratic expansion will lead to convergence rates which are independent on how well conditioned the Hessian matrices are, as we show later in Proposition 1. Our SP2 method occupies a unique position in the literature of stochastic second order method since it is incremental and in no way relies on convexity or positive semi-definite Hessian matrices. Indeed, as we will show in our non-convex experiments in 6.1 and matrix completition B, the SP2 excels at minimizing non-convex problems that satisfy interpolation. In contrast, Newton based methods often converge to stationary points other than the global minima. We also relax the interpolation assumption, and develop analogous quadratic methods for finding w and the smallest possible s ∈ R such that f i (w) ≤ s, for i = 1, . . . , n. We refer to this as the slack interpolation equations, which were introduced in Crammer et al. (2006) for linear models. If the interpolation assumption holds then s = 0 and the above is equivalent to solving equation 3. When interpolation does not hold, then equation 8 is still a upper approximation of equation 1, as detailed in Gower et al. (2022) . The rest of this paper is organized as follows. We introduce some related work in Section 2. We present the proposed SP2 methods in Section 3 and corresponding convergence analysis in Section 4. In Section 5, we relax the interpolation condition and develop a variety of quadratic methods to solve the slack version of this problem. We test the proposed methods with a series of experiments in Section 6. Finally, we conclude our work and discuss future directions in Section 7.

2. RELATED WORK

Since it became clear that Stochastic Gradient Descent (SGD), with appropriate step size tuning, was an efficient method for solving the training problem equation 1, there has been a search for an efficient second order counter part. The hope being, and our objective here, is to find a second order stochastic method that is incremental; that is, it can work with mini-batches, requires little to no tuning since it would depend less on how well scaled or conditioned the data is, and finally, would also apply to non-convex problems. To date there is a vast literature on stochastic second order methods, yet none that achieve all of the above. The subsampled Newton methods such as (Roosta-Khorasani & Mahoney, 2019; Bollapragada et al., 2018; Liu & Roosta, 2021; Erdogdu & Montanari, 2015; Kohler & Lucchi, 2017; Jahani et al., 2017) require large batch sizes in order to guarantee that the subsampled Newton direction is close to the full Newton direction in high probability. As such are not incremental. Other examples of large sampled based methods include the Stochastic quasi-Newton methods (Byrd et al., 2011; Mokhtari & Ribeiro, 2015; Moritz et al., 2016; Gower et al., 2016; Wang et al., 2017; Berahas et al., 2016 ), stochastic cubic Newton Tripuraneni et al. (2018) , SDNA (Qu et al., 2016) , Newton sketch (Pilanci & Wainwright, 2017) and Lissa (Agarwal et al., 2017) , since these require a large mini-batch or full gradient evaluations. The only incremental second order methods we aware of are IQN (Incremental Quasi-Newton) (Mokhtari et al., 2018) , SNM (Stochastic Newton Method) (Kovalev et al., 2019; Rodomanov & Kropotov, 2016) and very recently SAN (Stochastic Average Newton) Chen et al. (2021) . IQN and SNM enjoy a fast local convergence, but their computational and memory costs per iteration, is of O(d 2 ) making them prohibitive in large dimensions. Handling non-convexity in second order methods is particularly challenging because most second order methods rely on convexity in their design. For instance, the classic Newton iteration is the minima of the local quadratic approximation if this approximation is convex. If it is not convex, the Newton step can be meaningless, or worse, a step uphill. Quasi-Newton methods maintain positive definite approximation of the Hessian matrix, and thus are also problematic when applied to non-convex problems Wang et al. (2017) for which the Hessian is typically indefinite. Furthermore the incremental Newton methods IQN, SNM and SAN methods rely on the convexity of f i in their design. Indeed, without convexity, the iterates of IQN, SNM and SAN are not well defined. In contrast, our approach of finding roots of the local quadratic approximation equation 7 in no way relies on convexity, and relies solely on the fact that the local quadratic approximation around w t is good if we are not far from w t . But our approach does introduce a new problem: the need to solve a system of quadratic equations. We propose a series of methods to solve this in Sections 3 and 5. Solving quadratic equations has been heavily studied. There are even dedicated methods for solving w = argmin w∈R d 1 2 w -w 2 s.t. Q(w) = 0, where Q(w) = 1 2 w Hw + b w + c for a given w, where H is a nonzero symmetric (not necessarily PSD) matrix, and the level set {w : Q(w) = 0} is nonempty. Note that since Q(w) is a quadratic function, the problem equation 9 is nonconvex. Yet despite this non-convexity, so long as there exists a feasible point, the projection equation 9 can be solved in polynomial time by re-writing the projection as a semi-definite program, or by using the S-procedure, which involves computing the eigenvalue decomposition of H and using a line search as proposed in Park & Boyd (2017) , and detailed here in Section A. But this approach is too costly when the dimension d is large. An alternative iterative method is proposed in Sosa & MP Raupp (2020) , but only asymptotic convergence is guaranteed. In Dai (2006) , the authors consider a similar problem by projecting a point onto a general ellipsoid, which is again a problem of solving quadratic equations. However, they require the matrix H to be a positive definite matrix. The problem equation 7 and equation 9 are also an instance of a quadratic constrained quadratic program (QCQP). Although the QCQP in equation 7 has no closed form solution in general, we show in the next section that there is a closed form solution for Generalized linear models (GLMs), that holds for convex and non-convex GLMs alike. For general non-linear models we propose in Section 3.2 an approximate solution to equation 7 by iteratively linearizing the quadratic constraint and projecting onto the linearization.

3. THE SP2 METHOD

Next we give a closed form solution to equation 7 for GLMs. We then provide an approximate solution to equation 7 for more general models.

3.1. SP2 -GENERALIZED LINEAR MODELS

Consider when f i is the loss over a linear model with f i (w) = φ i (x i w -y i ), where φ i : R → R is a loss function, and (x i , y i ) ∈ R d+1 is an input-output pair. Consequently ∇f i (w) = φ i (x i w -y i )x i := a i x i , ∇ 2 f i (w) = φ i (x i w -y i )x i x i := h i x i x i . The quadratic constraint problem equation 7 can be solved exactly for GLMs equation 10 as we show next. Lemma 1. (SP2) Assume f i (w) is the loss of a generalized linear model equation 10 and is nonnegative. Let f i = f i (w t ) for short. Let a i := φ i (x i w -y i ) and h i := φ i (x i w -y i ). If a 2 i -2h i f i ≥ 0 (12) then the optimal solution of equation 7 is as follows w t+1 = w t -ai hi 1 - √ a 2 i -2hifi |ai| xi xi 2 . ( ) Alternatively if equation 12 does not hold, since f i ≥ 0 we have necessarily that h i > 0, and consequently a Newton step will give the minima of the local quadratic, that is w t+1 = w t -ai hi xi xi 2 . ( ) The proof to the above lemma, and all subsequent missing proofs can be found in the appendix. Lemma 1 establishes a condition equation 12 under which we should not take a full Newton step. Interestingly, this condition equation 12 holds when the square root of the loss function has negative curvature, as we show in the next lemma. Lemma 2. Let φ be a non-negative function which is twice differentiable at all t with φ(t) = 0. The condition equation 12, in other words φ (t) 2 ≥ 2φ(t)φ (t) holds when φ(t) is concave away from its roots, i.e. when d 2 dt 2 φ(t) ≤ 0 for all t with φ(t) = 0. Examples of loss functions include φ(t) = tanh 2 (t) and φ(t) = t p with 0 ≤ p ≤ 2 (see Figure (a)). In conclusion to this section, SP2 has a closed form solution for GLMs, and this closed form solution includes many non-convex loss functions.

3.2. SP2 + -LINEARIZING AND PROJECTING

In general, there is no closed form solution to equation 7. Indeed, there may not even exist a solution. Inspired by the fact that computing a Hessian-vector product can be done with a single backpropagationfoot_2 at the same cost as computing a gradient Christianson (1992) , we will make use of the cheap Hessian-vector product to derive an approximate solution to equation 7. Instead of solving equation 7 exactly, here we propose to take two steps towards solving equation 7 by projecting onto the linearized constraints. To describe this method let q(w) := f i (w t ) + ∇f i (w t ), w -w t + 1 2 ∇ 2 f i (w t )(w -w t ), w -w t . ( ) In the first step we linearize the quadratic constraint equation 15 around w t and project onto this linearization: w t+1/2 = arg min w∈R d 1 2 w -w t 2 s.t. f i (w t ) + ∇f i (w t ), w -w t = 0. ( ) The closed form update of this first step is given by w t+1/2 = w t -fi(w t ) ∇fi(w t ) 2 ∇f i (w t ), which is a Stochastic Polyak step equation 4. For the second step, we once again linearize the quadratic constraint equation 15, but this time around the point w t+1/2 and set this linearization to zero, that is w t+1 = arg min w∈R d 1 2 w -w t+1/2 2 s.t. q(w t+1/2 ) + ∇q(w t+1/2 ), w -w t+1/2 = 0. ( ) The closed form update of this second step is given by w t+1 = w t+1/2 - q(w t+1/2 ) ∇q(w t+1/2 )) 2 ∇q(w t+1/2 ). ( ) We refer to the resulting proposed method as the SP2 + method, summarized in the following. Lemma 3. (SP2 + ) Let g t ≡ ∇f i (w t ) and H t ≡ ∇ 2 f i (w t ). Update equation 17 and equation 19 is given by w t+1 = w t -fi(w t ) gt 2 g t - 1 2 fi(w t ) 2 gt 4 Htgt,gt v t+1 2 v t+1 , where v t+1 = I -H t fi(w t ) gt 2 g t , In equation 20 we can see that SP2 + applies a second order correction to the SP step. SP2 + is equivalent to two steps of a Newton Raphson method applied to finding a root of q(w). If we apply multiple steps of the Newton Raphson method, as opposed to two, the resulting method converges to the root of q, see Theorem 2 in the appendix. Theorem 2 shows that this multi-step version of SP2 + converges when q belongs to a large class of non-convex functions known as the star-convex functions. Star-convexity, which is a generalization of convexity, includes several non-convex loss functions Hinder et al. (2019) .

4. CONVERGENCE THEORY

Here we provide a convergence theory for SP2 and SP2 + for when f (w) is an average of quadratic functions. Fix w * ∈ R d and let the loss over the i-th data point be given by f i (w) = H i (w -w * ), w -w * , where H i ∈ R d×d is a symmetric positive semi-definite matrix for i = 1, . . . , n. Note that we assume that our algorithms have access to the second order terms H i , but not to the minimizer w * . Consequently  f i (w * ) = 0 = f (w * ) = min w∈R d f (w), H i H + i ) < 1. ( ) The rate of convergence of SP2 in equation 22 can be orders of magnitude better than SGD. Indeed, since equation 21 is convex, smooth and interpolation holds, we have from Needell et al. (2016) that SGD converges at a rate of ρ SGD = 1 -1 2n λmin( n i=1 Hi) maxi=1,...,n λmax(Hi) . To compare equation 23 to ρ rate in Proposition 1, consider the case where all H i are invertible. In this case H i H + i = I and thus ρ = 0 and SP2 converges in one step. Indeed, even if a single H i is invertible, after sampling i the SP2 method will reach w * , solving the system exactly. Of course the method is more interesting when the H i 's are low rank . In contrast, the SGD method is still at the mercy of the spectra of the H i matrices and depend on how well conditioned these matrices are. Even in the extreme case where all H i are well conditioned, for example H i = i × I, the rate of convergence of SGD can be very slow, for instance in this case we have ρ SGD = 1 -1 2n 2 . Proposition 2. Consider the loss functions in equation 21. The SP2 + method equation 20 converges linearly E w t+1 -w * 2 ≤ ρ 2 SP 2 + E w t -w * 2 , where ρ SP 2+ = 1 -1 2n n i=1 λmin(Hi) λmax(Hi) . ( ) The proof of Proposition 2 follows from Corollary 5.7 in Gower et al. (2021a) . The rate of convergence of SP2 + now depends on the average condition number of the H i matrix. The rate in equation 24 can be greater or smaller than the rate of SGD in equation 23. For instance, if one H j has a condition number that is much greater, then equation 24 will be smaller than equation 23. On the other hand, if all the H j 's are such that max i=1,...,n λ max (H i ) = λ max (H j ), for j = 1, . . . , n, then equation 23 is a smaller rate as compared to equation 24. Note also that the rate of SP2 + appears squared in equation 24 and the rate ρ SGD of SGD is not squared. But this difference accounts for the fact that each step of SP2 + is at least twice the cost of SGD, since each step of SP2 + is comprised of two gradient steps, see equation 17 and equation 19. Thus we can neglect the apparent advantage of the rate ρ SP 2+ being squared.

5. QUADRATIC WITH SLACK

Here we depart from the interpolation Assumption 1 and design a variant of SP2 + that can be applied to models that are close to interpolation. Instead of trying to set all the losses to zero, we now will try to find the smallest slack variable s > 0 for which f i (w) ≤ s, for i = 1, . . . , n. If interpolation holds, then s = 0 is a solution. Outside of interpolation, s may be non-zero. There are two natural ways of translating the slack constraint above into a penalty term. Namely one can consider either the associated 2 or 1 loss. We explore both of these in turn.

5.1. L2 SLACK FORMULATION

To make s as small as possible, we consider the following problem min s∈R,w∈R s 1 2 s 2 subject to f i (w) ≤ s, for i = 1, . . . , n, which is called the L2 slack formulation. This type of slack problem was introduced in Crammer et al. (2006) to derive variants of the passive-aggressive method that apply to linear models on non-separable data; in other words, when the models cannot interpolate the data. To solve equation 25 we will again project onto a local quadratic approximation of the constraint. Let q i,t (w) := f i (w t ) + ∇f i (w t ), w -w t + 1 2 ∇ 2 f i (w t )(w -w t ), w -w t . ( ) and let ∆ t = ww t 2 + (ss t ) 2 . Consider the iterative method given by w t+1 , s t+1 = arg min s≥0, w∈R d 1 -λ 2 ∆ t + λ 2 s 2 s.t. q i,t (w) ≤ s, where λ ∈ [0, 1] is a regularization parameter that trades off between having a small s, and using the previous iterates as a regularizer. The resulting projection problem in equation 27 has a quadratic inequality, and thus in most cases has no closed form solution, despite always being feasiblefoot_4 . So instead of solving equation 27 exactly, we propose an approximate solution by iteratively linearizing and projecting onto the constraints. Our approximate solution has two steps, the first step being w t+1/2 , s t+1/2 = arg min s≥0, w∈R d 1 -λ 2 ∆ t + λ 2 s 2 s.t. q i,t (w t ) + ∇q i,t (w t ), w -w t ≤ s. ( ) The second step is given by projecting w t+1/2 onto the linearization around w t+1/2 as follows w t+1 , s t+1 =arg min s≥0, w∈R d 1-λ 2 ∆ t+ 1 2 + λ 2 s 2 s.t. q i,t (w t+1/2 )+ ∇q i,t (w t+1/2 ), w-w t+1/2 ≤ s. ( ) The closed form solution to our two step method is given in Lemma 7 of Appendix C.7. We refer to this method as SP2L2 + .

5.2. L1 SLACK FORMULATION

To make s as small as possible, we can also solve the following L1 slack formulation min s≥0,w∈R d s s.t. f i (w) ≤ s, for i = 1, . . . , n. We can again project onto a local quadratic approximation of the constraint. That is, let λ ∈ [0, 1] be a regularization parameter that trades off between having a small s, and using the previous iterates as a regularizer, and consider the iterative method given by w t+1 , s t+1 = arg min s≥0, w∈R d 1 -λ 2 ∆ t + λ 2 s s.t. q i,t (w) ≤ s, To approximately solve equation 30, we again propose an approximate two step method similar to equation 28 and equation 29. The closed form solution to the two step method is given in Lemma 9 of Appendix C.8. We refer to this method as SP2L1 + .

5.3. DROPPING THE SLACK REGULARIZATION

Note that the objective function in equation 30 contains a regularization term (ss t ) 2 , which forces s to be close to s t . If we allow s to be far from s t , we can instead solve the following unregularized problem w t+1 , s t+1 = arg min s≥0, w∈R d 1-λ 2 w -w t 2 + λ 2 s s.t. q i,t (w) ≤ s, where λ ∈ [0, 1] is again a regularization parameter that trades off between having a small s, and using the previous iterates as a regularizer. We call the resulting method in equation 31 the SP2max method since it is a second order variant of the SPmax method Loizou et al. ( 2020  f i = f i (w t ) is non-negative, then the iterates of equation 31 have a closed form solution given by w t+1 = w t + c x i , s t+1 = max { s, 0} , c =        0, if f i = 0 -λai 1+ λhi , if f i > 0 and s ≥ 0, -ai+ √ a 2 i -2hifi hi , otherwise. and where s = f i - λa 2 i 1+ λhi + hi λ 2 a 2 i 2 2(1+ λhi ) 2 , and = x i 2 , λ = λ 2(1-λ) . To approximately solve equation 31 in general, we again propose an approximate two step method. The closed form solution to the two step method is given in Lemma 10 of Appendix C.9. We refer to this method as SP2max + . 6 and 7 in the appendix. For a baseline we compared against SGD (yellow ), Adam (pink) and Newton (red).

6. EXPERIMENTS

The two experiments with the function Levy N. 13 and Rosenbrock are detailed in the appendix in Section D.1.2. All of these functions are 2D sums-of-terms of the format equation 1 and satisfy the interpolation Assumption 1. To compute the SP2 update we used ten steps of Newton's Raphson method as detailed in Section C.4. We consistently find across these non-convex problems that SP2 and SP2 + are very competitive, with SP2 converging in under 20 epochs. SP2 also converges faster in terms of time taken (see middle of Figures 1 and 2 ). Here we can clearly see that SP2 converges to a high precision solution (like most second order methods), and different than other second order methods is not attracted to local maxima or saddle points. In contrast, Newtons method converges to a local maxima on all problems excluding the Rosenbrock function in Figure 7 in the appendix. For instance on the right of Figure 2 we can see the red dot of Newton stuck on a local maxima. When increasing the dimension to 100, these non-convex problems become very challenging, see Figure 3 foot_6 , and often only SP2, SP2 + and SP are able to find the global minima. Note that we no longer compare to Newton's method because it was exceeding our maximum allocated time.

6.2. MATRIX COMPLETION

Assume a set of known values {a i,j } (i,j)∈Ω where Ω is a set of known elements of the matrix, and we want to determine the missing elements. One approach is solving the matrix completion problem min U,V (i,j)∈Ω 1 2 (u T i v j -a i,j ) 2 , ( ) where A = [a i,j ] ∈ R m×n , U = [u i ] i=1,...,m ∈ R r×m and V = [v j ] j=1,...,n ∈ R r×n . After solving equation 32, we then use the matrix U V as an approximation to the complete matrix A. Despite equation 32 being a non-convex problem, if there exists an interpolating solution to equation 32, one where u T i v j = a i,j , for (i, j) ∈ Ω, then the SP2 method can solve equation 32. Indeed, SP2 can be applied to equation 32 by sampling a single pair (i, j) ∈ Ω uniformly, then projecting onto the quadratic (the solution to which is detailed in Theorem 1 in Appendix B): u k+1 i , v k+1 j = arg min u,v 1 2 u -u k i 2 + 1 2 v -v k j 2 subject to u v = a i,j . We compared our method 33 to a specialized variant of SGD for online matrix completion described in Jin et al. (2016) , see Figure 4 . To compare the two methods we generated a rank r = 2 matrix A ∈ R 100×50 . We selected a subset entries with probability p = 0.1, 0.2 or 0.3 to form our set Ω init that was used to obtain an initial estimate U 0 , V 0 using rank-k SVD method as described in Jin et al. (2016) . We extensively tuned the step size of SGD using a grid search, and the method labelled Non-convex SGD is the resulting run of SGD with the best step size. We also show how sensitive SGD is to this step size, by including the run of SGD with step sizes that were only a factor of 2 to 4 away from the optimal, which greatly degrades the performance of SGD. In contrast, SP2 worked with no tuning, and matches the performance of SGD with the optimal step size in the p = 0.1 experiment, and outperforms SGD when more measurements are available in the p = 0.2 and p = 0.3 figures.  10 0 ∇f 2 colon-cancer-0.3-grad-iter-reg-0.00e+00 SGDM0.3 SPM0.3 SP2 + M0.3 SP2L2 + M0.3 SP2L1 + M0.3 SP2max + M0.3 ADAM 0 100 200 Effective Passes 10 -2 10 -1 10 0 ∇f 2 colon-cancer-0.3-grad-iter-reg-1.00e-03 SGDM0.3 SPM0.3 SP2 + M0.3 SP2L2 + M0.3 SP2L1 + M0.3 SP2max + M0.3 ADAM 0 100 200 Effective Passes 10 -2 10 -1 10 0 ∇f 2 colon-cancer-0.3-grad-iter-reg-8.00e-03 SGDM0.3 SPM0.3 SP2 + M0.3 SP2L2 + M0.3 SP2L1 + M0.3 SP2max + M0.3 ADAM

6.3. CONVEX CLASSIFICATION

Here we compare our proposed methods on a logistic regression problem to SGD, SP, and ADAM. In particular, we consider the problem of minimizing the following loss function f (w) = 1 n n i=1 f i (w) + σ 2 w 2 2 , where f i (w) = φ i (x i w) with φ i (t) = ln(1 + e -yit ). Here, {(x i , y i ) ∈ R d+1 } n i=1 stands for the feature-label pairs and σ > 0 is the regularization parameter. We control how far each problem is from interpolation by increasing σ. When σ > 0 the problem cannot interpolate, and thus we expect to see a benefit of the slack methods in Section 5 over SP2 + . We used two data sets: colon-cancer ((n, d) = (62, 2000)) Alon et al. (1999) and mushrooms ((n, d) = (8124, 112)) West et al. (2001) , both of which interpolate when σ = 0. We compare the proposed methods SP2 + equation 20, SP2L2 + (Lemma 7), SP2L1 + (Lemma 9), and SP2max + (Lemma 10) with SGD, SP equation 4, and ADAM on both data sets with three regularizations σ ∈ {0, 0.001, 0.008} and with momentum set to 0.3. For the SGD method, we use a learning rate L max / √ t in the t-th iteration, where L max = 1 4 max i=1,...,n x i 2 denotes the smoothness constant of the loss function. We chose λ for SP2L2 + , SP2L1 + , and SP2max + using a grid search of λ ∈ {0.1, 0.2, . . . , 0.9}, the details are in Section D. The gradient norm evaluated at each epoch is presented in Figures 5 and 16 (see Appendix D). We see that SP2 methods converge much faster than classical methods (e.g., SGD, SP, ADAM) and need fewer epochs to achieve the tolerance when σ is small (left and middle plots). However, they can all fail when the problem is far from interpolation, e.g., when σ = 8 × 10 -3 . The running time used for each algorithm to achieve the tolerance for both data sets is presented in Figure 17 (see Appendix D).

7. CONCLUSION

We have proposed new second order methods aimed at overparameterized models that can interpolate (or nearly interpolate) the data. In contrast to previous incremental second order methods, ours do not rely on convexity. Quite the opposite, the SP2 method can benefit from the Hessian having at least one negative eigenvalue. Consequently the SP2 method excels at optimizing non-convex models that interpolate, as we demonstrated in Sections 6.1 and B. We also provided a convergence result in Theorem 2 showing that SP2 can converge significantly faster than SGD for sums-of-quadratics. In Section 5, we then developed second order methods that solve a relaxed version of the interpolation equations by allowing some slack. We showed that these methods still perform well on problems that are close to interpolation in Section 6.3. In future work, it would be interesting to develop specialized variants of SP2 for optimizing Deep Neural Networks (DNNs). DNNs are particularly well suited since modern deep models often interpolate, are non-convex, and enjoy fast gradient and Hessian-vector computations via back-propagation.  w∈R d 1 2 w -z 2 s.t. r + q, w -z + 1 2 P(w -z), w -z = 0. Let λ 1 ≤ λ 2 ≤ • • • ≤ λ d be the eigenvalues of P and let QΛQ = P be the eigenvalue decomposition of P, where Λ = diag(λ i ) and QQ = I. Let q = Q q. If the quadratic constraint in equation 34 is feasible, then there exists a solution to equation 34. Now we give the three candidate solutions. 1. If r = 0, then the solution is given by w = z.

2.. Now assuming

r = 0. Let ν = max i : λi =0 - 1 λ 1 , - 1 λ d (35) i * ∈ arg max i : λi =0 - 1 λ 1 , - 1 λ d (36) N = {i : λ i = λ i * }. ( ) Let x * = -(I + νΛ) † ν q. ( ) If 2νr + ν q, x * -x * 2 + ν 2 4 i∈N q2 i ≥ 0 (39) then the solution is given by w = z + Q(x * + n), where n ∈ R d and n i = ν 2 qi + 1 |N | 2νr + ν q, x * -x * 2 + ν 2 4 i∈N q2 i , for i ∈ N n i = 0, for i ∈ N. 3. Alternatively if equation 39 does not hold, then the solution is given by w = z -(I + νΛ) † νq (41) where ν is the solution to the nonlinear equation ν 2 i q2 i (2 + νλ i ) (1 + νλ i ) 2 = r. ( ) Proof. First note that there exists a solution to equation 34 since the constraint is a closed feasible set. Let QΛQ = P be the SVD of P, where QQ = I. By changing variables x = Q (wz) we have that equation 34 is equivalent to arg min x∈R d 1 2 x 2 s.t. r + q, x + 1 2 Λx, x = 0, where x = Q (wz) and q = Q q. The Lagrangian of equation 43 is given by L(x, ν) = 1 2 x 2 + ν r + q, x + 1 2 Λx, x = 1 2 x (I + νΛ)x + ν r + q, x . Thus the KKT conditions are given by ∇ x L(x, ν) = (I + νΛ)x + ν q = 0 (45) ∇ ν L(x, ν) = r + q, x + 1 2 Λx, x = 0. Since we are guaranteed that the projection has a solution, we have that as a necessary condition that the solution satisfies ∇ 2 x L(x, ν) = (I + νΛ) 0, see Theorem 12.5 in Wright & Nocedal (1999) . Consequently either (I + νΛ) 0 or (I + νΛ) has a zero eigenvalue. Consider the case where (I + νΛ) 0. From equation 45 we have that x = -ν(I + νΛ) -1 q. (47) Now note that if ν = 0 then x = 0 and by the constraint we must have r = 0. Otherwise, if r = 0, then ν = 0. Assume now ν = 0 and substituting the above into equation 46 and letting Λ = diag(λ i ) gives ∇ ν L(x, ν) = r + q, x + 1 2ν νΛx, x = r + q, x + 1 2ν (I + νΛ)x, x - 1 2ν x 2 = r + 1 2 q, x - 1 2ν x 2 Using equation 45 = r - ν 2 q, (I + νΛ) -1 q - ν 2 (I + νΛ) -1 q 2 Using equation 47 = r - ν 2 i q2 i 1 + νλ i + q2 i (1 + νλ i ) 2 . Thus ν 2 i q2 i (2 + νλ i ) (1 + νλ i ) 2 = r. ( ) Upon finding the solution ν to the above, we have that our final solution is given by w = z + Qx, that is w = z -Q(I + νΛ) † ν q = z -(I + νΛ) † νQq = z -ν(I + νΛ) † q (49) Alternatively, suppose that (I + νΛ) 0 is non-singular. The positive definiteness implies that ν ≥ - 1 λ i , for i = 1, . . . , d. For (I + νΛ) to be non-singular, at least one of the above inequalities will hold to equality. To ease notation, let us arrange the eigenvalues in increasing order so that λ 1 ≤ λ 2 ≤ • • • ≤ λ d . For one of the equation 50 inequalities to hold to equality we need that ν = max i : λi =0 - 1 λ i = max i : λi =0 - 1 λ 1 , - 1 λ d . Since (I + νΛ) is now singular with this ν, we have that the solution to equation 45 is given by x = -(I + νΛ) † ν q + n := x * + n, where x * , n = 0, where † denotes the pseudo-inverse and where n is in the kernel of (I + νΛ), in other words (I + νΛ)n = 0. It remains to determine n, which we can do with equation 46. Indeed, substituting equation 51 into equation 46 gives ∇ ν L(x, ν) = r + 1 2 q, x - 1 2ν x 2 Using equation 45 = r + 1 2 q, x * + n - 1 2ν n 2 - 1 2ν x * 2 . Using equation 51. Setting to zero and completing the squares in n we have that 1 2ν n - ν 2 q 2 = r + 1 2 q, x * - 1 2ν x * 2 + ν 8 q 2 . ( ) To characterize the solutions in n of the above, first note that n will only have a few non-zero elements. To see this, let i * ∈ arg max i : λi =0 -1 λ1 , -1 λ d , and note that (I + νΛ) has as many zeros on the diagonal as the multiplicity of the eigenvalue λ i * . That is, it has zeros elements on the indices in I = {i : λ i = λ i * }. Thus the non-zero elements of n are in the set N = {i : λ i = λ i * }. Because of this observation we further re-write equation 52 as i∈N n i - ν 2 qi 2 = 2νr + ν q, x * -x * 2 + ν 2 4 q 2 - i∈I ν 2 4 q2 i = 2νr + ν q, x * -x * 2 + ν 2 4 i∈N q2 i . Consequently, if the above is positive, then there exists solutions to the above of which n i = ν 2 qi + 1 |N | 2νr + ν q, x * -x * 2 + ν 2 4 i∈N q2 i , for i ∈ N. ( ) is one. Consequently, the final solution is given by w = z + Q(x * + n) where x * is given in by equation 51. Corollary 1. If r > 0 and P has at least one negative eigenvalue, there always exists a solution to the projection equation 34. Proof. We only need to prove that there exists a solution to the quadratic equation in equation 34, after which Lemma 5 guarantees the existance of a solution. Alternatively, if γ = 1 then by isolating u and v in equation 63 and equation 64, respectively, gives u = u k i -γv k j 1 -γ 2 (68) v = v k j -γu k i 1 -γ 2 To figure out γ, we use the third constraint in equation 62 and the above two equations, which gives u v = (u k i -γv k j ) 1 -γ 2 v k j -γu k i 1 -γ 2 = (1 + γ 2 ) u k i , v k j -γ( u k i 2 + v k j 2 ) (1 -γ 2 ) 2 = a i,j . Let φ(γ) = (1 + γ 2 ) u k i , v k j -γ( u k i 2 + v k j 2 ) -(1 -γ 2 ) 2 a i,j . Can we now find an interval which will contain the solution in γ? Note that φ(-1) = 2 u k i , v k j + u k i 2 + v k j 2 = u k i + v k j 2 ≥ 0 φ(1) = 2 u k i , v k j -u k i 2 -v k j 2 = -u k i -v k j 2 ≤ 0. Thus it suffices to search for γ ∈ (-1, 1), which can be done efficiently with bisection.

C PROOFS OF IMPORTANT LEMMAS C.1 PROOF OF LEMMA 1

Let us first describe the set of solutions for given constraint. We need to have f i + a i x T i ∆ + 1 2 h i ∆ T x i x T i ∆ = 0, where ∆ = ww t is unknown. If we denote by τ i = x T i ∆ then equation 70 will reduce to f i + a i τ i + 1 2 h i τ 2 i = 0. This quadratic equation equation 71 has solution if a 2 i -2h i f i ≥ 0. If the condition above holds then we have that the solution for τ is in this set T * := -a i + a 2 i -2h i f i h i , -a i -a 2 i -2h i f i h i . ( ) Recall that the problem equation 7 now reduces into min ∆ ∆ 2 , such that x T i ∆ ∈ T * . ( ) Note that because we want to minimize ∆ 2 , we want to choose the constraint with smallest possible absolute value, hence the problem equation 74 is equivalent to min ∆ ∆ 2 , such that x T i ∆ = τ * i , where τ * i =    -ai+ √ a 2 i -2hifi hi , if a i > 0 -ai- √ a 2 i -2hifi hi , otherwise. In other words, τ * i = - a i h i + sign(a i ) a 2 i -2h i f i h i = a i h i a 2 i -2h i f i |a i | -1 The final solution is hence ∆ * = τ * i x i 2 x i and therefore w * = w t + τ * i x i 2 x i In case when equation 72 is not satisfied, and because we assumed that the loss function is nonnegative, we necessary have h i > 0. Then natural choice of τ i is the one that would minimize the f i + a i τ i + 1 2 h i τ 2 i . From first order optimality conditions we obtain that τ * i = -a i h i which leads to equation 14.

C.2 PROOF OF LEMMA 2

Proof. If φ(t) = 0 then the condition holds trivially. For t such that φ(t) = 0, φ(t) is differentiable, and we have d 2 dt 2 φ(t) = -1 4 φ(t) -3/2 φ (t) 2 + 1 2 φ(t) -1/2 φ (t) = 1 4 φ(t) -3/2 (-φ (t) 2 + 2φ(t)φ (t)), which is negative precisely when φ (t) 2 ≥ 2φ(t)φ (t).

C.3 PROOF OF LEMMA 3

Note that q(w t+1/2 ) equation 17+equation 15 = f i (w t ) -∇f i (w t ), f i (w t ) ∇f i (w t ) 2 ∇f i (w t ) + 1 2 ∇ 2 f i (w t ) f i (w t ) ∇f i (w t ) 2 ∇f i (w t ), f i (w t ) ∇f i (w t ) 2 ∇f i (w t ) = 1 2 f i (w t ) 2 ∇f i (w t ) 4 ∇ 2 f i (w t )∇f i (w t ), ∇f i (w t ) . Furthermore ∇q(w t+1/2 ) equation 15 = ∇f i (w t ) + ∇ 2 f i (w t )(w t+1/2 -w t ) equation 17 = I -∇ 2 f i (w t ) f i (w t ) ∇f i (w t ) 2 ∇f i (w t ). Thus the second step equation 19 is given by w t+1 = w t+1/2 - 1 2 f i (w t ) 2 ∇f i (w t ) 4 ∇ 2 f i (w t )∇f i (w t ), ∇f i (w t ) I -∇ 2 f i (w t ) fi(w t ) ∇fi(w t ) 2 ∇f i (w t ) 2 • I -∇ 2 f i (w t ) f i (w t ) ∇f i (w t ) 2 ∇f i (w t ). Putting the first equation 17 and second equation 76 updates together gives equation 20. This gives a second order correction of the Polyak step that only requires computing a single Hessianvector product that can be done efficiently using an additional backwards pass of the function. We call this method SP2.

C.4 CONVERGENCE OF MULTI-STEP SP2 +

If we apply multiple steps of the SP2 + , as opposed to two steps, the method converges to the solution of equation 15. This follows because each step of SP2 + is a step of NR Newton Raphson's method applied to solving the nonlinear equation q(w) := f i (w t ) + ∇f i (w t ), w -w t + 1 2 ∇ 2 f i (w t )(w -w t ), w -w t . Indeed, starting from w 0 = w t , the iterates of the NR (Newton Raphson) method are given by w i+1 = w i -∇q(w i ) † q(w i ) = w i - q(w i ) ∇q(w i ) 2 ∇q(w i ), where M † denotes the pseudo-inverse of the matrix M. The NR iterates in equation 77 can also be written in a variational form given by w i+1 = arg min w∈R d w -w i 2 s.t. q(w i ) + ∇q(w i )(w -w i ) = Comparing the above to the first equation 16 and second step equation 18 are indeed two steps of the NR method. Further, we can see that equation 78 is indeed the multi-step version of SP2 + . This method equation 77 is also known as gradient descent with a Polyak Step step size, or SP for short. It is this connection we will use to prove the convergence of equation 77 to a root of q(w). We assume that q(w) has at least one root. Let w * q ∈ R d be a least norm root of q(w), that is w * q = arg min w 2 subject to q(w) = 0. It follows from Theorem 3.2 of Sosa & MP Raupp (2020) that the above optimization equation 79 has solution if and only if the following matrix B = (∇f i (w t ) -∇ 2 f i (w t )w t )(∇f i (w t ) -∇ 2 f i (w t )w t ) + 2 -f i (w t ) + ∇f i (w t ) w t - 1 2 w t ∇ 2 f i (w t )w t ∇ 2 f i (w t ) has at least a non-negative eigenvalue. Theorem 2. Assume that the matrix B defined in equation 80 has at least a non-negative eigenvalue. If q(w) is star-convex with respect to w * q , that is if (w i -w * q ) ∇ 2 f i (w t )(w i -w * q ) ≥ 0, for all i, then it follows that min i=0,...,T -1 q(x i ) ≤ σ max (∇ 2 f i (w t )) 2T w 0 -w * q 2 . ( ) Proof To apply this Corollary D.3 in Gower et al. (2021b) , we need to verify that q is an L-smooth function and star-convex. To verify if it is smooth, we need to find L > 0 such that q(w) ≤ q(y) + ∇q(y), wy + L 2 w -y 2 , which holds with L = σ max (∇ 2 q(y)) = σ max (∇ 2 f i (w t ) since q is a quadratic function. Furthermore, for q to be star-convex along the iterates w i , we need to verify if q(w * q ) ≥ q(w i ) + ∇q(w i ), w * -w i . ( ) Since q is a quadratic, we have that q(w * q ) = q(w i ) + ∇q(w i ), w *w i + ∇ 2 q(w i )(w *w i ), w *w i . Using this in equation 85 gives that 0 ≥ ∇ 2 q(w i )(w * -w i ), w * -w i = ∇ 2 f i (w t )(w * -w i ), w * -w i , which is equivalent to our assumption equation 81. We can now apply the result in Corollary D.3 in Gower et al. (2021b) which states that min i=0,...,T -1 (q(x i ) -q(w * q )) ≤ L 2T w 0 -w * q . Finally using q(w * q ) = 0 and that L = σ max (∇ 2 f i (w t )) gives the result. To simplify notation, we will omit the dependency on w t and denote c = f i (w t ), g = ∇f i (w t ) and H = ∇ 2 f i (w t ), thus q(w) = c + g, w -w t + 1 2 H(w -w t ), w -w t ∇q(w) = g + H(w -w t ) ∇ 2 q(w) = H (86) Lemma 6. If g ∈ Range(H) and w 0 ∈ Range(H) then w i , ∇q(w i ) ∈ Range(H) for all i and w * q ∈ Range(H). Proof. First, note that since g ∈ Range(H) and since ∇q(w) = q + H(ww t ) (see equation 86) we have that ∇q(w) ∈ Range(H) for all w. Consequently by induction if w i ∈ Range(H) then by equation 77 we have that w i+1 ∈ Range(H) since it is a combination of ∇q(w i ) and w i . Finally, let w * q = w t + w H + w ⊥ H where w H ∈ Range(H) and w ⊥ H ∈ Range(H) ⊥ . It follows that q(w * q ) = q(w t + w H ). Furthermore, by orthogonality and Pythagoras' Theorem w * q = w t + w H + w ⊥ H Consequently, since w * q is the least norm solution, we must have that w ⊥ H = 0 and thus w * q ∈ Range(H).

C.5 PROOF OF PROPOSITION 1

First we repeat the proposition for ease of reference. Proposition 3. Consider the loss functions given in equation 21. The SP2 method equation 7 converges according to E w t+1 -w * 2 ≤ ρ E w t -w * 2 , ( ) where ρ = λ max I - 1 n n i=1 H i H + i < 1. ( ) Proof. First consider the first iterate of SP2 which applied to equation 21 are given by w t+1 = arg min w∈R d w -w t 2 s.t. ww * 2 Hi = 0. Thus every solution to the constraint set must satisfy w ∈ w * + N i α, ( ) where N i ∈ R d×d is a basis for the null space of H i . where α ∈ R d . Substituting into the objective we have the resulting linear least squares problem given by min α∈R d w * + N i α -w t 2 The minimal norm solution in α is thus α = N + i (w t -w * ) which when substituted into equation 89 gives w t+1 = w * + N i N + i (w t -w * ). ( ) Note that P i := N i N + i is the orthogonal projector onto Null(H i ). Subtracting w * from both sides of equation 90 and applying the squared norm we have that w t+1 -w * 2 = P i (w t -w * ) 2 = P i (w t -w * ), (w t -w * ) (91) where we used that P i P i = P i because it is a projection matrix. Now taking expectation conditioned on w t we have E w t+1 -w * 2 | w t = EP i (w t -w * ), (w t -w * ) ≤ λ max (EP i ) w t -w * 2 . Since the null space is orthogonal to the range of adjoint, we have that P i = I -H i H + i . Thus taking expectation again gives the result equation 87. Finally, the rate of convergence ρ in equation 88 is always smaller than one because, due Jensen's inequality and that λ max is convex over positive definite matrices we have that 0 < λ max (EH i H * i ) ≤ Eλ max (H i H * i ) = 1, ( ) where the greater than zero follows since there must exist H i = 0, otherwise the result still holds and the method converges in one step (with ρ = 0). Now multiplying equation 92 by -1 then adding 1 gives 1 > λ max (I -EH i H * i ) ≥ 0. ( ) Remark 1 (Connection to Block Kaczmarz proof). The proof of Proposition 3 follows the same proof technique of the Block Kaczmarz method Needell & Tropp (2014) foot_7 . To see this, first note that since interpolation holds, to find the minima we can solve the equations Find w ∈ R d : ∇f i (w) = 0, for i = 1, . . . , n. ( ) Since each f i (w) function is a quadratic, this is equivalent to Find w ∈ R d : H i (w -w * ) = 0, for i = 1, . . . , n. ( ) We can now interpret the above as one big linear system that we need to solve. If we sample one of the block of rows, say H i (ww * ) = 0 for one i, and project w k onto this smaller system, the resulting method is the block Kaczmarz method equation 95. This also coincides with the SP2 method. Furthermore, our convergence analysis is now equivalent to applying the convergence analysis of block Kaczmarz to solving equation 95.

C.6 PROOF OF PROPOSITION 2

For convenience we repeat the statement of the proposition here. Proposition 4. Consider the loss functions given in equation 21. The SP2 + method equation 20 converges according to E w t+1 -w * 2 ≤ ρ 2 SP 2 + E w t -w * 2 , ( ) where ρ SP 2+ = 1 - 1 2n n i=1 λ min (H i ) λ max (H i ) (97) Proof. The proof follows simply by observing that for quadratic function the SP2 + is equivalent to applying two steps of the SP method equation 5. Indeed in Section 3.2 the SP2 + applies two steps of the SP method to the local quadratic approximation of the function we wish to minimize. But in this case, since our function is quadratic, it is itself equal to it's local quadratic. Consequently we can apply the convergence theory of SP for smooth, strongly convex functions that satisfy the interpolation condition, such as Corollary 5.7.I in Gower et al. (2021b) , which states that SP converges at a rate of equation 97 C.7 PROOF OF LEMMA 7 The following lemma gives the two step update for SP2L2 + . Lemma 7. (SP2L2 + ) The w t+1 and s t+1 update of equation 28-equation 29 is given by w t+1 = w t -(Γ 1 + Γ 2 )∇f i (w t ) + Γ 2 Γ 1 ∇ 2 f i (w t )∇f i (w t ), s t+1 = (1 -λ) (1 -λ)(s t + Γ 1 ) + Γ 2 , where Γ 1 := (f i (w t ) -(1 -λ)s t ) + 1 -λ + ∇f i (w t ) 2 , Γ 2 := fi(w t )-Γ1 ∇fi(w t ) 2 -(1-λ) 2 (s t +Γ1) 1-λ+ ∇fi(w t )-Γ1∇ 2 fi(w t )∇fi(w t ) 2 + 1 2 Γ 2 1 ∇ 2 fi(w t )∇fi(w t ),∇fi(w t ) 1-λ+ ∇fi(w t )-Γ1∇ 2 fi(w t )∇fi(w t ) 2 + , where we denote (x) + = x if x ≥ 0 0 otherwise . We will use the following lemma to prove Lemma 7, which has been proven in Lemma C.2 of Gower et al. (2022) . Lemma 8 (L2 Unidimensional Inequality Constraint). Let δ > 0, c ∈ R and w, w 0 , a ∈ R d . The closed form solution to w , s = arg min w∈R d ,s∈R b w -w 0 2 + δ s -s 0 2 s.t. a (w -w 0 ) + c ≤ s , ( ) is given by w = w 0 -δ (c -s 0 ) + 1 + δ a 2 a, s = s 0 + (c -s 0 ) + 1 + δ a 2 , ( ) where we denote (x) + = x if x ≥ 0 0 otherwise . We are now in the position to prove Lemma 7. Note that 1 -λ 2 (s -s t ) 2 + λ 2 s 2 = 1 -λ 2 s 2 -(1 -λ)ss t + λ 2 s 2 + 1 -λ 2 (s t ) 2 = 1 2 s 2 -(1 -λ)ss t + 1 -λ 2 (s t ) 2 = 1 2 (s -(1 -λ)s t ) 2 + λ -λ 2 2 (s t ) 2 . ( ) Consequently equation 28 is equivalent to w t+1/2 , s t+1/2 = arg min s≥0, w∈R d w -w t 2 + 1 1 -λ (s -(1 -λ)s t ) 2 s.t. q i,t (w t ) + ∇q i,t (w t ), w -w t ≤ s. ( ) It follows from Lemma 8 that the closed form solution is w t+1/2 = w t - 1 1 -λ (q i,t (w t ) -(1 -λ)s t ) + 1 + 1 1-λ ∇q i,t (w t ) 2 ∇q i,t (w t ), ( ) s t+1/2 = (1 -λ)s t + (q i,t (w t ) -(1 -λ)s t ) + 1 + 1 1-λ ∇q i,t (w t ) 2 , ( ) where we denote (x) + = x if x ≥ 0 0 otherwise . Note that q i,t (w t ) = f i (w t ) and ∇q i,t (w t ) = ∇f i (w t ). To simplify the notation, we also denote Γ 1 = 1 1 -λ (f i (w t ) -(1 -λ)s t ) + 1 + 1 1-λ ∇f i (w t ) 2 . With this notation we have that w t+1/2 = w t - 1 1 -λ (f i (w t ) -(1 -λ)s t ) + 1 + 1 1-λ ∇f i (w t ) 2 ∇f i (w t ) (105) = w t -Γ 1 ∇f i (w t ), s t+1/2 = (1 -λ)s t + (f i (w t ) -(1 -λ)s t ) + 1 + 1 1-λ ∇f i (w t ) 2 (107) = (1 -λ)(s t + Γ 1 ). In a completely analogous way, the closed form solution to equation 29 is w t+1 = w t+1/2 - 1 1 -λ (q i,t (w t+1/2 ) -(1 -λ)s t+1/2 ) + 1 + 1 1-λ ∇q i,t (w t+1/2 ) 2 • ∇q i,t (w t+1/2 ), s t+1 = (1 -λ)s t+1/2 + (q i,t (w t+1/2 ) -(1 -λ)s t+1/2 ) + 1 + 1 1-λ ∇q i,t (w t+1/2 ) 2 . ( ) Note that q i,t (w t+1/2 ) = f i (w t ) -∇f i (w t ), Γ 1 ∇f i (w t ) + 1 2 ∇ 2 f i (w t )Γ 1 ∇f i (w t ), Γ 1 ∇f i (w t ) = f i (w t ) -Γ 1 ∇f i (w t ) 2 + 1 2 Γ 2 1 ∇ 2 f i (w t )∇f i (w t ), ∇f i (w t ) and ∇q i,t (w t+1/2 ) = ∇f i (w t ) + ∇ 2 f i (w t )(w t+1/2 -w t ) = ∇f i (w t ) -Γ 1 ∇ 2 f i (w t )∇f i (w t ). Denoting Γ 2 as in the statement of the lemma we conclude that w t+1 = w t+1/2 - 1 1 -λ (q i,t (w t+1/2 ) -(1 -λ)s t+1/2 ) + 1 + 1 1-λ ∇q i,t (w t+1/2 ) 2 • ∇q i,t (w t+1/2 ) = w t -Γ 1 ∇f i (w t ) -Γ 2 ∇f i (w t ) -Γ 1 ∇ 2 f i (w t )∇f i (w t ) , s t+1 = (1 -λ)s t+1/2 + (q i,t (w t+1/2 ) -(1 -λ)s t+1/2 ) + 1 + 1 1-λ ∇q i,t (w t+1/2 ) 2 = (1 -λ) (1 -λ)(s t + Γ 1 ) + Γ 2 C.8 PROOF OF LEMMA 9 The following Lemma gives a closed form for the two-step update for SP2L1 + . Lemma 9. (SP2L1 + ) The w t+1 and s t+1 update is given by w t+1 = w t -(Γ 4 + Γ 6 )∇f i (w t ) + Γ 6 Γ 4 ∇ 2 f i (w t )∇f i (w t ), s t+1 = s t - λ 2(1 -λ) + Γ 3 + - λ 2(1 -λ) + Γ 5 + , where Γ 3 = f i (w t ) -s t -λ 2(1-λ) + 1 + ∇f i (w t ) 2 , Γ 4 = min Γ 3 , f i (w t ) ∇f i (w t ) 2 , Γ 5 = Λ 1 -s t -λ 2(1-λ) + 1 + ∇f i (w t ) -Γ 4 ∇ 2 f i (w t )∇f i (w t ) 2 , Γ 6 = min Γ 5 , Λ 1 ∇f i (w t ) -Γ 4 ∇ 2 f i (w t )∇f i (w t ) 2 , Λ 1 = f i (w t ) -Γ 4 ∇f i (w t ) 2 + 1 2 Γ 2 4 ∇ 2 f i (w t )∇f i (w t ), ∇f i (w t ) . To solve equation 30, we consider the following two-step method similar to equation 28 and equation 29: w t+1/2 , s t+1/2 = arg min s≥0, w∈R d 1 -λ 2 ∆ t + λ 2 s (112) s.t. q i,t (w t ) + ∇q i,t (w t ), w -w t ≤ s. w t+1 , s t+1 = arg min s≥0, w∈R d 1 -λ 2 ∆ t+ 1 2 + λ 2 s (113) s.t. q i,t (w t+1/2 ) + ∇q i,t (w t+1/2 ), w -w t+1/2 ≤ s. Note that 1 -λ 2 (s -s t ) 2 + λ 2 s = 1 -λ 2 s -s t - λ 2(1 -λ) 2 + constants w.r.t. w and s. Then, equation 112 is equivalent to solving w t+1/2 , s t+1/2 = arg min s≥0, w∈R d w -w t 2 + s -s t - λ 2(1 -λ) 2 (114) s.t. q i,t (w t ) + ∇q i,t (w t ), w -w t ≤ s. It follows from Lemma C.4 in Gower et al. (2022) that the closed form solution to equation 112 is w t+1/2 = w t -min      q i,t (w t ) -s t -λ 2(1-λ) + 1 + ∇q i,t (w t ) 2 , q i,t (w t ) ∇q i,t (w t ) 2      ∇q i,t (w t ), s t+1/2 =    s t - λ 2(1 -λ) + q i,t (w t ) -s t -λ 2(1-λ) + 1 + ∇q i,t (w t ) 2    + , where we denote (x) + = x if x ≥ 0 0 otherwise . Note that q i,t (w t ) = f i (w t ) and ∇q i,t (w t ) = ∇f i (w t ). To simplify the notation, denote Γ 3 = f i (w t ) -s t -λ 2(1-λ) + 1 + ∇f i (w t ) 2 , and Γ 4 = min Γ 3 , f i (w t ) ∇f i (w t ) 2 . Then, we have w t+1/2 = w t -Γ 4 ∇f i (w t ), s t+1/2 = s t - λ 2(1 -λ) + Γ 3 + . In a similar way, we can get the closed form solution to equation 113, which is given as w t+1 = w t+1/2 -min      q i,t (w t+1/2 ) -s t+1/2 -λ 2(1-λ) + 1 + ∇q i,t (w t+1/2 ) 2 , q i,t (w t+1/2 ) ∇q i,t (w t+1/2 ) 2      ∇q i,t (w t+1/2 ), s t+1 =    s t+1/2 - λ 2(1 -λ) + q i,t (w t+1/2 ) -s t -λ 2(1-λ) + 1 + ∇q i,t (w t+1/2 ) 2    + . Note that q i,t (w t+1/2 ) = f i (w t ) -∇f i (w t ), Γ 4 ∇f i (w t ) + 1 2 ∇ 2 f i (w t )Γ 4 ∇f i (w t ), Γ 4 ∇f i (w t ) = f i (w t ) -Γ 4 ∇f i (w t ) 2 + 1 2 Γ 2 4 ∇ 2 f i (w t )∇f i (w t ), ∇f i (w t ) Λ 1 and ∇q i,t (w t+1/2 ) = ∇f i (w t ) + ∇ 2 f i (w t )(w t+1/2 -w t ) = ∇f i (w t ) -Γ 4 ∇ 2 f i (w t )∇f i (w t ). Again, to simplify the notation, we denote Γ 5 = q i,t (w t+1/2 ) -s t -λ 2(1-λ) + 1 + ∇q i,t (w t+1/2 ) 2 = Λ 1 -s t -λ and Γ 6 = min Γ 5 , q i,t (w t+1/2 ) ∇q i,t (w t+1/2 ) 2 = min Γ 5 , Λ 1 ∇f i (w t ) -Γ 4 ∇ 2 f i (w t )∇f i (w t ) 2 . Then, we have w t+1 = w t+1/2 -Γ 6 ∇q i,t (w t+1/2 ) = w t -Γ 4 ∇f i (w t ) -Γ 6 ∇f i (w t ) -Γ 4 ∇ 2 f i (w t )∇f i (w t ) = w t -(Γ 4 + Γ 6 )∇f i (w t ) + Γ 6 Γ 4 ∇ 2 f i (w t )∇f i (w t ), s t+1 = s t+1/2 - λ 2(1 -λ) + Γ 5 + = s t - λ 2(1 -λ) + Γ 3 + - λ 2(1 -λ) + Γ 5 + . C.9 PROOF OF LEMMA 10 The following lemma gives a closed form for two step method SP2max + . Lemma 10. (SP2max + ) The w t+1 and s t+1 update is given by w t+1 = w t -(Γ 1 + Γ 3 ) ∇f i (w t ) + Γ 3 Γ 1 ∇ 2 f i (w t )∇f i (w t ), s t+1 = max Γ 2 - λ 2(1 -λ) ∇f i (w t ) -Γ 1 ∇ 2 f i (w t )∇f i (w t ) 2 , 0 , where Γ 1 = min f i (w t ) ∇f i (w t ) 2 , λ 2(1 -λ) , Γ 2 = f i (w t ) -Γ 1 ∇f i (w t ) 2 + 1 2 Γ 2 1 ∇ 2 f i (w t )∇f i (w t ), ∇f i (w t ) , Γ 3 = min Γ 2 ∇f i (w t ) -Γ 1 ∇ 2 f i (w t )∇f i (w t ) 2 , λ 2(1 -λ) . To solve equation 31, we again consider a two step method similar to equation 112 and equation 113: w t+1/2 , s t+1/2 = arg min s≥0, w∈R d 1 -λ 2 w -w t 2 + λ 2 s (115) s.t. q i,t (w t ) + ∇q i,t (w t ), w -w t ≤ s. w t+1 , s t+1 = arg min s≥0, w∈R d 1 -λ 2 w -w t+1/2 2 + λ 2 s (116) s.t. q i,t (w t+1/2 ) + ∇q i,t (w t+1/2 ), w -w t+1/2 ≤ s. Note that equation 115 is equivalent to solving w t+1/2 , s t+1/2 = arg min s≥0, w∈R d 1 2 w -w t 2 + λ 2(1 -λ) s s.t. q i,t (w t ) + ∇q i,t (w t ), ww t ≤ s. It follows from Lemma D.2 in Gower et al. (2022) that the closed form solution to equation 117 is w t+1/2 = w t -min q i,t (w t ) ∇q i,t (w t ) 2 , λ 2(1 -λ) ∇q i,t (w t ) = w t -min f i (w t ) ∇f i (w t ) 2 , λ 2(1 -λ) ∇f i (w t ) = w t -Σ 1 ∇f i (w t ), s t+1/2 = max q i,t (w t ) - λ 2(1 -λ) ∇q i,t (w t ) 2 , 0 = max f i (w t ) - λ 2(1 -λ) ∇f i (w t ) 2 , 0 , where we denote Γ 1 = min f i (w t ) ∇f i (w t ) 2 , λ 2(1 -λ) . Note that q i,t (w t+1/2 ) = f i (w t ) -Γ 1 ∇f i (w t ) 2 + 1 2 Γ 2 1 ∇ 2 f i (w t )∇f i (w t ), ∇f i (w t ) := Γ 2 , and q i,t (w t+1/2 ) = ∇f i (w t ) + ∇ 2 f i (w t )(w t+1/2 -w t ) = ∇f i (w t ) -Γ 1 ∇ 2 f i (w t )∇f i (w t ). Similarly, we have the closed form solution to equation 116 given as w t+1 = w t+1/2 -min q i,t (w t+1/2 ) ∇q i,t (w t+1/2 ) 2 , λ 2(1 -λ) ∇q i,t (w t+1/2 ) = w t+1/2 -min Γ 2 ∇f i (w t ) -Γ 1 ∇ 2 f i (w t )∇f i (w t ) 2 , λ 2(1 -λ) ∇f i (w t ) -Γ 1 ∇ 2 f i (w t )∇f i (w t ) = w t+1/2 -Γ 3 ∇f i (w t ) -Γ 1 ∇ 2 f i (w t )∇f i (w t ) = w t -(Γ 1 + Γ 3 ) ∇f i (w t ) + Γ 3 Γ 1 ∇ 2 f i (w t )∇f i (w t ) s t+1 = max q i,t (w t+1/2 ) - λ 2(1 -λ) ∇q i,t (w t+1/2 ) 2 , 0 = max Γ 2 - λ 2(1 -λ) ∇f i (w t ) -Γ 1 ∇ 2 f i (w t )∇f i (w t ) 2 , 0 . C.10 PROOF OF LEMMA 4 In GLMs, the unregularized problem equation 31 becomes w t+1 , s t+1 = arg min s≥0, w∈R d 1 2 w -w t 2 + λs s.t. f i + a i x i , w -w t + 1 2 h i x i x i (w -w t ), w -w t ≤ s, where λ := λ 2(1-λ) , and we denote f i = f i (w t ), a i := φ i (x i wy i ), h i := φ i (x i wy i ) for short. Denote := ww t . Then, problem equation 118 reduces to min s≥0, ∈R d 1 2 2 + λs s.t. f i + a i x i + 1 2 h i x i x i ≤ s. Note that we want to minimize 2 . Together with the above constraint, we can conclude that must be a multiple of x i since any other component will not help satisfy the constraint but increase 2 . Let = cx i and = x i 2 , then problem equation 119 becomes min s≥0, c∈R 1 2 c 2 + λs s.t. f i + a i c + 1 2 h i 2 c 2 ≤ s. The corresponding Lagrangian function is then given as L(s, c, ν 1 , ν 2 ) = 1 2 c 2 + λs + ν 1 (f i + a i c + 1 2 h i 2 c 2 -s) -ν 2 s, where ν 1 , ν 2 ≥ 0 are the Lagrangian multipliers. The KKT conditions are thus f i + a i c + 1 2 h i 2 c 2 -s ≤ 0, s ≥ 0, ν 1 ≥ 0, ν 2 ≥ 0, ν 1 (f i + a i c + 1 2 h i 2 c 2 -s) = 0, ν 2 s = 0, λ -ν 1 -ν 2 = 0, c + ν 1 a i + ν 1 h i 2 c = 0. By checking the complementary conditions, the solution to the above KKT equations has three cases, which are summarized below. Case I: The Lagragian multiplier ν 2 = 0. In which case ν 1 = λ, ν 2 = 0, c = -λai 1+ λhi , and s = f i + a i c + 1 2 h i 2 c 2 = f i - λa 2 i 1 + λh i + h i λ 2 a 2 i 2 2(1 + λh i ) 2 , which is feasible if s ≥ 0. The resulting objective function is 1 2 c 2 + λs , which is ≥ 0. Case II: The Lagragian multiplier ν 1 = 0. In which case ν 1 = 0, ν 2 = λ, c = 0, s = 0, which is feasible if f i = 0. The objective function is 0 in this case and the variable w is unchanged since ww t = c x i = 0. Case III: Neither Lagragian multiplier is zero. In which case there are two possible solutions for c given by c = -ai± √ a 2 i -2hifi hi , ν 1 = -c ai+hi c , ν 2 = λ + c ai+hi c , s = 0. Note that a i + h i c = ± a 2 i -2h i f i . Consequently to guarantee that the Lagrangian multipliers ν 1 and ν 2 are non-negative, we must have c = -ai+ √ a 2 i -2hifi hi and in this case the objective function equals 1 2 c 2 ≥ 0. As a summary, if f i = 0, Case II is the optimal solution. Alternatively if f i > 0 and if s = f i - λa 2 i 1 + λh i + h i λ 2 a 2 i 2 2(1 + λh i ) 2 is non-negative then Case I is the optimal solution. Otherwise, Case III with c = -ai+ √ a 2 i -2hifi hi is the optimal solution. Therefore, the optimal solution to equation 118 is then w t+1 = w t + c x i , s t+1 = max { s, 0} , where c =        0, if f i = 0 -λai 1+ λhi , if f i > 0 and s ≥ 0, -ai+ √ a 2 i -2hifi hi , otherwise.

D ADDITIONAL NUMERICAL EXPERIMENTS D.1 NON-CONVEX PROBLEMS

For the non-convex experiements, we used the Python Package pybenchfunction available on github Python_Benchmark_Test_Optimization_Function_Single_Objective. We compared SP2 equation 7 (blue curve) , SP2 + equation 20 (green curve) on the non-convex problems PermDβ + , Rastrigin and Levy N. 13, Rosenbrock Jamil & Yang (2013) 9 . For a baseline we compared against SGD (yellow ), Adam (pink) and Newton (red). For the implementation of Newton's method, we found that the regularized Newton methods was best suited for these non-convex experiments where w t+1 = w t -γ(I + ∇ 2 f (w t )) -1 ∇f (w t ), where is the regularization parameter which we set to = 10 -8 , and γ is a learning rate which we tuned. Specifically, for all the experiments in this section we set a learning for all methods to 0.25 except for SGD and ADAM, both which had to be tuned by doing a grid search on the first 100 epochs over the grid [0.0001, 0.0005, 0.001, 0.005, 0.01, , 0.05, 0.25, 1] proving much harder to tune. We note here that the PermDβ + implemented is given by PermDβ + (x) := d i=1 d j=1 (j i + β) x j j i -1 2 . which is different than the standard PermDβ function which is given by PermDβ(x) := d i=1   d j=1 (j i + β) x j j i -1   2 . We believe this is a small mistake, which is why we have introduced the plus in PermDβ + to distinguish this function from the standard PermDβ function. Yet, the PermDβ + is still an interesting non-convex problem, and thus we have used it in our experiments despite this small alteration.

D.1.2 THE LEVY N. 13 AND ROSENBROCK PROBLEMS

Here we provide two additional experiments on the non-convex function Levy N. 13 and Rosenbrock that complement the findings in 6.1. For the Levy N. 13 function in Figure 6 we have that again SP2 converges in 10 epochs to the global minima. In contrast Newton's method converges immediately to a local maxima, that can be easily seen on the surface plot of the right of Figure 6 . The one problem where SP2 was not the fastest was on the Rosenbrock function, see Figure 7 . Here Newton was equally fast. But note, this problem was designed to emphasize the advantages of Newton over gradient descent. For completeness, we also give the function value versus time plot in Figure 8 . Here we increase the dimension of the problems to appraise the effect dimensions has on the convergence of the methods, see Figure 9 for a comparison in terms of epochs, and Figure 3 in the main text for a comparison in terms of time taken. We no longer compare to Newton's method because it was exceeding our maximum allocated time. We find that when increasing the dimension, these problems become extremely challenging. To select the learning rate of each method, we used a grid search over the first 100 epochs. For the SP2, SP2 + and SP methods we used the grid [0.005, 0.01, 0.05, 0.1]. For Adam and SGD we used the larger grid [0.00001, 0.00005, 0.0001, 0.005, 0.01, 0.05, 0.1]. Thus we gave a competitive advantage to SGD and Adam.

D.2 ADDITIONAL CONVEX EXPERIMENTS

We set the desired tolerance for each algorithm to 0.01, and set the maximum number of epochs for each algorithm to 200 in colon-cancer and 30 in mushrooms. To choose an optimal slack parameter λ for SP2L2 + , SP2L1 + , and SP2max + , we test these three methods on a uniform grid λ ∈ {0.1, 0.2, . . . , 0.9} with σ = 0.001. The gradient norm and loss evaluated at each epoch are presented in Figures 10-13 (see Appendix D). It can be seen that SP2L2 + performs best when λ = 0.9 in colon-cancer and λ = 0.1 in mushrooms, SP2L1 + and SP2max + perform best when λ = 0.1 in both data sets. Therefore, we set λ = 0.9 for SP2L2 + in colon-cancer and fix λ = 0.1 in other cases. Under the same setting as in Section 6.  f (x k )/f (x 0 ) colon-cancer-0.0-loss-iter-reg-1.00e-03 SP2L2 + SP2L1 + SP2max + (a) λ = 0.1 0 50 100 Effective Passes 10 -1 10 0 f (x k )/f (x 0 ) colon-cancer-0.0-loss-iter-reg-1.00e-03 SP2L2 + SP2L1 + SP2max + (b) λ = 0.2 0 20 40 60 Effective Passes 10 -1 10 0 f (x k )/f (x 0 ) colon-cancer-0.0-loss-iter-reg-1.00e-03 SP2L2 + SP2L1 + SP2max + (c) λ = 0.3 0 20 40 Effective Passes 10 -2 10 -1 10 0 f (x k )/f (x 0 ) colon-cancer-0.0-loss-iter-reg-1.00e-03 SP2L2 + SP2L1 + SP2max + (d) λ = 0.4 0 10 20 30 Effective Passes 10 -2 10 -1 10 0 f (x k )/f (x 0 ) colon-cancer-0.0-loss-iter-reg-1.00e-03 SP2L2 + SP2L1 + SP2max +



Or even two solutions in the 1d case. Note that there could be more than one solution to this projection and we can choose one either with least norm or arbitrarily. We use torch.autograd.functional.hvp in our code to compute this Hessian-vector product. Thanks to an anonymous reviewer, the proof technique we use for Proposition 1 is the same proof technique used for the Block Kaczmarz methodNeedell & Tropp (2014), see Remark 1. For instance w = w t and s = fi(w t ) is feasible We used the Python Package pybenchfunction available on github Python_Benchmark_Test_ Optimization_Function_Single_Objective. We also note that the PermDβ + implemented in this package is a modified version of the PermDβ function, as we detail in Section D.1.1. See Figure9for a comparison with respect to epochs. Thanks to an anonymous reviewer at ICLR for pointing out this connection. 2(1-λ) + 1 + ∇f i (w t ) -Γ 4 ∇ 2 f i (w t )∇f i (w t ) 2 , We used the Python Package pybenchfunction available on github Python_Benchmark_Test_ Optimization_Function_Single_Objective. We also note that the PermDβ + implemented in this package is a modified version of the PermDβ function, as we detail in Section D.1.1.



Figure (a) Non-convex loss func. for which condition equation 12 holds.

);Gower et al. (2022). The advantage of SP2max is that it has a closed form solution for GLMs in the form of equation 10. We give the closed form in the following lemma which is proved in Appendix C.10. Lemma 4. (SP2max) Consider the GLM model given in equation 10 and equation 11. If the loss

Figure 1: The 2D PermDβ + function with β = 0.5 we plot Left: f (x) vs epochs, Middle: f (x) vs time, Right: level set plot, Right: Surface plot.

Figure4: Recovery error for matrix completion. Left: Using 10%, Middle: using 20% and Right: using 30% of the entries of A to form Ω, respectively. Shaded region corresponds to 5 repeated runs.

Figure 5: Colon-cancer: gradient norm at each epoch with momentum set at 0.3. Left: σ = 0, Middle: σ = 0.001 and Right σ = 0.008.

The proof follows by applying the convergence Theorem 4.4 inGower et al. (2020) or equivalently Corollary D.3 inGower et al. (2021b). This result first appeared in Theorem 4.4 inGower et al. (2020), but we apply Corollary D.3 inGower et al. (2021b)  since it is a bit simpler.

.1 PERMDβ + IS AN INCORRECT IMPLEMENTATION OF PERMDβ.

Figure6: The Levy N. 13 function where Left: we plot f (x) across epochs Middle: level set plot, Right: Surface plot. SP2 is in blue , SP2 + is in green , SGD is in yellow and Newton is in red .

Figure 9: Function value vs epochs for the PermDβ + (d = 10) function with β = 0.5 (Left), the Rastrigin function (d = 100) (Middle) and the Rosenbrock function (d = 100).

Figure 11: Colon-cancer: loss at each epoch with different λ.

Figure 12: Mushrooms: gradient norm at each epoch with different λ.

Figure 13: Mushrooms: loss at each epoch with different λ.

Figure 14: Colon-cancer: gradient norm at each epoch with different λ.

λ = 0.05

Figure 15: Colon-cancer: loss at each epoch with different λ.

Figure 16: Mushrooms: gradient norm and loss at each epoch with momentum being 0.3.

thus the interpolation condition holds.

Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I Jordan. Stochastic cubic regularization for fast nonconvex optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31, 2018.

The time plots for the Rosenbrock function (Left) and Levy N. 13 function (Right). SP2 is in blue , SP2 + is in green , SGD is in yellow , Adam and is in pink.

B MATRIX COMPLETION

The projection equation 33 can be solved as we shown in the following theorem. Theorem 1. The solution to equation 33 is given by one of the following cases.3. Finally, if none of the above holds thenwhere γ ∈ (-1, 1) and is the solution to the depressed quartic equationProof. The Lagrangian of equation 33 is given bywhere γ ∈ R is the unknown Lagrangian Multiplier. Thus the KKT equations areSubtracting γ times the the second equation from the first equation, and analogously, subtracting the first equation from the second givesIf γ = 1, then necessarily u k i = v k j and furthermore from the first equation in equation 62 we have that(65) Substituting u out in the original projection problem equation 33 we have thatConsequently, for every v that satisfies the constraint we have that the objective value is invariant and equal to -a i,j . Consequently there are infinite solutions. To find one such solution, we complete the squares of the constraint and findThe above only has solutions if 1 4 v k j 2a i,j ≥ 0. One solution to the above is given by equation 55.

