A NOVEL FAST EXACT SUBPROBLEM SOLVER FOR STOCHASTIC QUASI-NEWTON CUBIC REGULARIZED OPTIMIZATION

Abstract

In this work we describe an Adaptive Regularization using Cubics (ARC) method for large-scale nonconvex unconstrained optimization using Limited memory Quasi-Newton (LQN) matrices. ARC methods are a relatively new family of second-order optimization strategies that utilize a cubic-regularization (CR) term in place of trustregions or line-searches. Solving the CR subproblem exactly requires Newton's method, yet using properties of the internal structure of LQN matrices, we are able to find exact solutions to the CR subproblem in a matrix-free manner, providing very large speedups. Additionally, we expand upon previous ARC work and explicitly incorporate first-order updates into our algorithm. We provide empirical results for different LQN matrices and find our proposed method compares to or exceeds all tested optimizers with minimal tuning.

1. INTRODUCTION

Scalable second-order methods for training deep learning problems have shown great potential, yet ones that build on Hessian-vector products may be prohibitively expensive to use. In this paper, we focus on algorithms that require information similar to Stochastic Gradient Descent (SGD) Ruder (2016) , namely, stochastic gradients calculated on mini-batches of data. Quasi-Newton (QN) methods are a natural higher-level alternative to first-order methods, in that they seek to model curvature information dynamically from past steps based on available gradient information. Thus, they can work out of the box in the same settings as SGD with little model-specific coding required. However, this comes with possible instability of the step size. Controlling the step size can be done using line-searches along a given direction s or using trust-regions to find the best s for a given step size. A relatively recent alternative to the mentioned approaches is known as cubic regularization Nesterov and Polyak (2006) ; Cartis et al. (2011) ; Tripuraneni et al. (2018) , which shows very promising results. In detail, we study the minimization problem of minimize s∈R n m k (s) def = f (x k ) + s T g k + 1 2 s T B k s + 1 3 σ k ||s|| 3 , for a given x k , where g k def = ∇f (x k ), B k is a Hessian approximation, σ k an iteratively chosen adaptive regularization parameter, and f (x k ) is the objective function to minimize evaluated at x k . Equation 1 is also known as the CR subproblem. Cubic regularization shows promise because it can be shown that if ∇ 2 f is Lipschitz continuous with constant L, then f (x k + s) ≤ m k (s) whenever σ k ≥ L and B k s = ∇ 2 f (x)s Nesterov and Polyak (2006) . Thus if the Hessian approximation B k behaves like ∇ 2 f (x) along the search direction s, the model function m k (s) becomes an upper bound on the objective f (x + s). In such cases a line-search would not be needed as reduction in m k (s) translates directly into reduction in f (x + s), removing the risk that the computational work performed minimizing m k (s) is wasted. We propose an efficient exact solver to Equation 1 using Newton's method which is tractable in large-scale optimization problems under near-identical conditions to those in which SGD itself is commonly applied. As Newton's method corresponds to much of the computation overhead when solving Equation 1, a dense approach such as that described in Cartis et al. (2011) would be prohibitive. However, by exploiting properties of LQN methods described in Erway and Marcia (2015) and Burdakov et al. (2017) (and further applied in other papers Chen et al. (2014) ; pei Lee et al.; Lee et al. (2022) ), we can instead perform Newton's method in a reduced subspace such that the cost per Newton iteration is reduced from O(mn) to O(m), where n is the problem dimension and m is the history size, commonly chosen to be an integer between 5 and 20. The full-space solution to Equation 1 can then be recovered for a cost identical to that of classic LQN methods. To the best of our knowledge, all previous attempts to use LQN methods in the context of the ARC framework have necessarily had to change the definition of m k (s) in order to find an approximate solution Andrei (2021); Liu et al. (2021) ; Ranganath et al. (2022) . Remarkably, we present a mechanism for minimizing m k (s) using similar computational efforts to a single matrix inversion of a shifted LQN matrix, which itself is a lower bound of the complexity of traditional LQN approaches. Further, we show that by applying Newton's method in the reduced subspace, we can achieve speed improvements of more than 100x over a naive (LQN inversion-based) implementation. In the numerical results section we further show that this modification permits the application of LQN matrices with exact cubic regularization as a practical optimizer for large DNNs.

2. RELATED WORK

Second-order methods in machine learning are steadily growing more common Berahas et al. (2021) ; Brust et al. (2017) ; Chen et al. (2020) ; Goldfarb et al. (2020) ; Ma (2020) ; Ramamurthy and Duffy (2016) ; Yao et al. (2021) . Limited-memory SR1 (LSR1) updates are studied in the context of ARC methodology using a "memory-less" variant in Andrei (2021). As we will describe in Section 3, many QN methods iteratively update B k matrices with pairs (s k , y k ) such that B k s k = y k , where y k denotes the difference of the corresponding gradients of the objective functions. In Andrei (2021), y k is a difference of the gradients of m k (s). Liu et al. (2021) approximately minimize m k (s) by solving shifted systems of form (B k + σ∥s k-1 ∥I)s k = -g, where the norm of the previous step is used as an estimate for the norm of ∥s k ∥ to define the optimal shift. As described in Theorem 1 in Section 3, the optimal solution necessarily satisfies a condition of the form (B k + σ∥s k ∥I)s k = -g. Since the norm of ∥s k ∥ may vary greatly between iterations, this solution is a noisy approximation. They further simplify the sub-problem using only the diagonals of B k + σ∥s k-1 ∥I when generating s k . Ranganath et al. (2022) solve a modified version of the problem using a shape-changing norm as the cubic overestimation that provides an analytical solution to Equation 1. They transform m k (s) using similar strategies to those advocated in this paper. However, this norm definition is dependent on the matrix B k and thus makes the definition of the target Lipschitz constant, L, dependent as well. A nontrivial distinction in our approaches is that theirs requires a QR factorization of matrices of size n × m. This may be prohibitive for deep learning problems, which may have billions of parameters. Bergou et al. (2017) explores a similar idea of making the norm dependant on the QN matrix. In Park et al. (2020) , the ARC framework with stochastic gradients is used with a Hessian-based approach first advocated by Martens et al. (2010) . In this case, ∇ 2 f (x) is approximated within a Krylov-based subspace using Hessian-vector products with batched estimates of ∇ 2 f (x). They then minimize m k (s) with this small-dimensional subspace. An alternative to ARC methods is the use of trust-regions or line-searches. Though fundamentally different approaches, we can often borrow technology from the trust-region subproblem solver space to adapt to the ARC context. For example, Brust et al. (2017) outlines mechanisms for efficiently computing (B k + λI) -1 g and implicit eigendecomposition of B k + λI when solving the trust-region subproblem of minimizing q k (s) Burdakov et al. (2017) significantly reduces the complexity and memory cost of such algebraic operations while solving the same problem. We thus adopt select operations developed therein when applicable, to adapt the method of Cartis et al. (2011) to the LQN context. Unlike the approach advocated in Burdakov et al. (2017) , we avoid inversions of potentially ill-conditioned systems to improve the stability of the approach while simultaneously reducing computation overhead costs. def = f (x k ) + s T g k + 1 2 s T B k s while subject to ∥s∥ ≤ δ. Note that we further extend Cartis et al. (2011) to the stochastic optimization setting. Thus we also share relation to past stochastic QN approaches. Erway et al. (2020) use the tools described in Brust et al. (2017) to create a stochastic trust-region solver using LSR1 updates. Schraudolph et al. (2007) generalizes BFGS and LBFGS to the online convex optimization setting. Mokhtari and Ribeiro (2014) studies BFGS applied to the stochastic convex case and develops a regularization scheme to prevent the BFGS matrix from becoming singular. Sohl-Dickstein et al. (2014) explores domain-specific modifications to SGD and BFGS for sum-of-functions minimization, where the objective function is composed of the sum of multiple differential subfunctions. Byrd et al. (2016) considers not using simple gradient differencing for the BFGS update, but instead more carefully building (s k , y k ) pairs using Hessian-vector products; Berahas et al. (2021) also explores a similar idea of carefully choosing s k and y k . Wang et al. (2016) tries to prevent ill-conditioning of B k for BFGS updates, similar to Mokhtari and Ribeiro (2014) , but explicitly for the nonconvex case. Keskar and Berahas (2016) present an optimizer designed specifically for RNNs that builds on Byrd et al. (2016) . Our Contributions. 1. A fast O(mn) approach for exactly solving the cubic regularization problem for any limited memory quasi-Newton approximation that lends itself to an efficient eigendecomposition such as LBFGS and LSR1, 2. A hybrid first and second-order stochastic Quasi-Newton ARC framework that is competitive with current SOTA optimizers, 3. Convergence theory that proves convergence in the nonconvex case, and 4. Strong empirical results of this optimizer applied to real-life nonconvex problems.

3. ALGORITHM

In this section we describe the proposed algorithm. We will first provide a brief introduction to LQN matrices, then describe how to exactly and efficiently solve Equation 1 when B k is defined by an LQN matrix. We will demonstrate that the computational complexity of ARCLQN (detailed in Algorithm 3) is similar to that of classical LQN solvers. Later in this section we describe how to solve the nonlinear optimization problem (Algorithm 2) using this subproblem solver. Until Section 3.2, for simplicity, we will motivate the problem by largely considering full-batch gradient descent. However, the techniques being developed in this paper will largely be applied in the stochastic setting. Popular Quasi-Newton updates such as BFGS, DFP, and SR1 are based on iteratively updating an initial matrix B 0 = γI with rank one or two corrections with pairs (s k , y k ) such that the property B k s k = y k is maintained each update (Nocedal and Wright, 2006) . For example, the popular SR1 update formula is given by the recursive relation: B k+1 ← B k + (y k -B k s k )(y k -B k s k ) T s T k (y k -B k s k ) , where y k def = g k -g k-1 and s k def = x k -x k-1 . To verify that the update is well-defined, ∥s T k (y k -B k s k )∥ > ϵ∥s k ∥∥y k -B k s k ∥ (3) is checked with a small number ϵ. If condition 3 is not satisfied, B k+1 ← B k . This helps ensure that B k remains bounded. While for much of this paper we will focus on the SR1 update, we stress that the exact subproblem solver proposed in this section will hold for all QN variants described in Erway and Marcia (2015) . We discuss many of such variants later in Section 3.2. Note that if B k is explicitly formed, the computational and memory costs are at least O(n 2 ); as such, for large-scale problems, limited-memory variants are popular. For such cases, only the most recent m ≪ n pairs of (s, y) are stored in n × m matrices S k and Y k , where S k def = (s k-m+1 , . . . , s k ) and Y k def = (y k-m+1 , . . . , y k ). In the limited memory case, B k is never explicitly formed, and operations using B k are performed using only γ, S, and Y using O(mn) operations. How this is done specifically for the cubic-regularized case will become clearer later in this section. Before proceeding, we will next briefly describe the approach used by Cartis et al. (2011) for the case where B k is dense. Later we describe how to adapt their dense approach to the limited-memory case.

3.1. SOLVING THE CUBIC REGULARIZED SUB-PROBLEM

In this section we focus on efficiently finding a global solution to the cubic regularized subproblem given in Equation 1, restated here for convenience: minimize s∈R n m k (s) def = f (x k ) + s T g k + 1 2 s T B k s + 1 3 σ k ∥s∥ 3 . (1) We start by describing a Newton-based approach proven to be convergent in Cartis et al. (2011) . Though their approach targets dense matrices B k where Cholesky factorizations are viable, we subsequently show in this section how to efficiently extend this approach to large-scale limited memory QN matrices. The Newton-based solver for Equation 1 is based on the following theorem: Theorem 1 ( (Cartis et al., 2011) ). Let B k (λ) def = B k + λI, λ 1 denote the smallest eigenvalue of B k , and u 1 its corresponding eigenvector. A step s * k is a global minimizer of m k (s) if there exists a λ * ≥ max(0, -λ 1 ) such that: B k (λ * )s * k = -g k , ( ) ∥s * k ∥ = λ * σ k , ( ) implying B k (λ * ) is positive semidefinite. Further, only if B k is indefinite, u T 1 g k = 0, and ∥(B k + λ 1 I) † g k ∥ ≤ -λ 1 /σ k , then λ * = -λ 1 . For simplicity we define s(λ) def = -(B + λI) -1 g, where the pseudo-inverse is used for the case where λ = -λ 1 . We can then see that for the case where λ * ≥ -λ 1 , s * k is given by the solution to the following equation: ϕ 1 (λ) def = 1 ||s(λ)|| - σ λ = 0. Note the authors of Cartis et al. (2011) show that when B k indefinite and u T 1 g = 0, the solution s * k is given by s * k = s(-λ 1 ) + αu 1 where α is a solution to the equation -λ 1 = σ||s(-λ 1 ) + αu 1 ||. That is, whenever Equation 6 fails to have a solution (the "hard-case"), s * k is obtained by adding a multiple of the direction of greatest negative curvature to the min-two norm solution to Equation 4 so that Equation 5 is satisfied. The authors of (Cartis et al., 2011) thus apply Newton's method to ϕ 1 (λ) resulting in Algorithm 1. This corresponds to Algorithm (6.1) of Cartis et al. (2011) . Algorithm 1 Newton's method to find s * and solve ϕ 1 (λ) = 0 if B indefinite, u T 1 g = 0 then if ∥s(-λ 1 )∥ < λ σ then Solve -λ 1 = σ||s(-λ 1 ) + αu 1 || for α s * ← s(-λ 1 ) + αu 1 else s * ← s(-λ 1 ) end if else Let λ > max(0, -λ 1 ). while ϕ 1 (λ) ̸ = 0 do Solve for s: (B + λI)s = -g. (7) Let B + λI = LL T . Lw = s. Compute the Newton correction ∆λ N def = λ ||s|| -λ σ ||s|| + λ σ λ||w|| 2 ||s|| 2 (9) Let λ ← λ + ∆λ N . end while s * ← s(λ) end if At first glance, Algorithm 1 may not look feasible, as Equation 8 requires the Cholesky matrix L, which is only cheaply obtained for small dense systems. Looking closer, we note that to execute Algorithm 1 one need not form s and w, as only their corresponding norms are needed to compute ∆λ N . Relevantly, it has been demonstrated that matrices in the Quasi-Newton family have compact matrix representations of the form B = γI + ΨM -1 Ψ T , further detailed in Byrd et al. (2016) . For example, for LSR1 Ψ = Y -γS and M = (E -γS T S) where E is a symmetric approximation of the matrix S T Y, whose lower triangular elements equal those of S T Y (Erway and Marcia, 2015). They further show that for matrices of this class, an O(mn) calculation may be used to implicitly form the spectral decomposition B = U ΛU T , where U is never formed but stored implicitly and Λ satisfies Λ = γI 0 0 Λ (11) where Λ ∈ R k×k is the diagonal matrix defined in Erway and Marcia (2015) to be diag(λ 1 , ..., λ m ). Thus we know that B will have a cluster of eigenvalues equal to γ of size n -k. We can exploit this property to further reduce the computational complexity of Algorithm 1. Note again that ∆λ N in Equation 9 can be computed as long as ∥s∥ and ∥w∥ are known. Using the eigendecomposition of B k , we get ∥s∥ 2 = g T U (Λ + λI) -2 U T g ∥w∥ 2 = s T L -T L -1 s = s T (B + λI) -1 s = g T U (Λ + λI) -3 U T g. Note here that λ denotes the parameter optimized in Algorithm 1 and not the diagonal values of Λ. If we then define U block-wise we can define the components ĝ1 and ĝ2 as follows: U = (U 1 U 2 ) ⇒ U T g = U T 1 g U T 2 g = ĝ1 ĝ2 . Thus we can compute ∥s∥ and ∥w∥ in O(m) operations assuming ĝ is stored, giving ∥s∥ 2 = ∥ĝ 1 ∥ 2 (λ + γ) 2 + m i=1 ĝ2 (i) 2 ( λi + λ) 2 (13) ∥w∥ 2 = ∥ĝ 1 ∥ 2 (λ + γ) 3 + m i=1 ĝ2 (i) 2 ( λi + λ) 3 (14) Using this, the computation cost of Newton's method is reduced to an arguably inconsequential amount, assuming that ∥ĝ 1 ∥ and ĝ2 required by Equations 13 and 14 can be efficiently computed. We dub this optimization the "norm-trick". We additionally note that Y and S both change by only one column each iteration of Algorithm 3 (defined later). Thus with negligible overhead we can cheaply update the matrix Ψ T Ψ ∈ R m×m each iteration by retaining previously computed values that are not dependent on the new (s, y) pair. Making the following two assumptions, we can then show that Algorithm 1 can be solved with negligible overhead compared to classical LQN approaches. Assumption 1. The matrix T = Ψ T Ψ is stored and updated incrementally. That is, if Ψ has one column replaced, then only one row and column of T is updated. Assumption 2. The vector ū = Ψ T g is computed once each iteration of Algorithm 3 and stored. We note that classic LQN methods at each iteration must update B k and then solve a system of the form s = -(B k + λI) -1 g for some λ ≥ 0. This creates an O(mn) computational lower bound that we aim to likewise achieve when generating an optimal step for Equation 1. In contrast to the approach described in Ranganath et al. (2022) that uses a QR factorization of Ψ, we use an analogous approach to that described in Burdakov et al. (2017) for trust-region methods to perform the majority of the required calculations on matrices of size m × m in place of n × m. This saves significantly on both storage and computational overhead. Note that unlike Burdakov et al. (2017) we do not explicitly compute M -1 as we have found this matrix can periodically become ill-conditioned. Using the techniques detailed here (but proven in Section D of the appendix), we can form a very efficient solver for Equation 1 (using a modified version of Algorithm 1). Using the norm-trick, we can avoid explicitly forming s and w, reducing complexity of a Newton iterate from O(mn) to O(m). Using Equations 13 and 14 and Assumptions 1-2, we can thus solve λ * = σ∥s * ∥ from Algorithm 1 in O(m 3 ) additional operations once T and ū are formed. Finally, once λ * is recovered, we can form s * in O(mn), a single inversion of a shifted system, the same complexity of classical LQN approaches. Full derivation and proof are available in Section D of the appendix.

3.2. SOLVING THE NONLINEAR OPTIMIZATION PROBLEM

In this section we focus on solving the problem min x∈R n f (x) = N i=1 f i (x), where f i (x) is defined as the loss for the i-th datapoint, using the subproblem solver defined in Section 3.1. We follow the ARC framework as described in Cartis et al. (2011) , stated here as Algorithm 2. A benefit of the algorithm defined in Algorithm 2 is that first-order convergence is proven if B k remains bounded and f (x) ∈ C 1 (R n ). Thus the condition that B k = ∇ 2 f (x) is greatly relaxed from its predecessors such as Nesterov and Polyak (2006) . Algorithm 2 Adaptive Regularization using Cubics (ARC). Blue text in Equation 18indicates our modification to default to an SGD-like step on failure. Given x 0 , σ 0 > 0, γ 2 , γ 1 , η 2 > η 1 > 0, α > 0, for k = 0, 1, . . . , until convergence, 1. Compute update s * k such that: m k (s * k ) ≤ m k (s c k ) where the Cauchy point s c k = -υ c k g k and υ c k = arg min υ∈R+ m k (-υg k ). 2. Compute ratio between the estimated reduction and actual reduction ρ k ← f (x k ) -f (x k + s * k ) f (x k ) -m k (s * k ) 3. Update x k+1 ← x k + s * k if ρ k ≥ η 1 x k -αg k otherwise (18) 4. Set σ k+1 in    (0, σ k ] if ρ k > η 2 [σ k , γ 1 σ k ] if η 2 ≥ ρ k ≥ η 1 [γ 1 σ k , γ 2 σ k ] otherwise (19) In Algorithm 2, we first solve the CR subproblem (Equations 1,16; Algorithm 1) to find our step, s * k . We then determine if the step is accepted by examining if the ratio between the decrease in the objective (f (x k ) -f (x k + s * k ) ) and the predicted decrease in objective (f (x k ) -m k (s * k ) ) is large enough (Equations 17-18). Then, depending on ρ k , η 1 and η 2 , we adjust our regularization parameter σ k : the 'better' the step is, the more we decrease σ k+1 , and the worse it is, the more we increase it (Equation 19). The amount of increase and decrease is governed by two hyperparameters, γ 1 and γ 2 . We note one important modification to the ARC framework: if we find that ρ < η 1 , we take an SGD step instead of just setting x k ← x k-1 (Equation 18). While, empirically, rejected steps are not common, we find that reverting to SGD in case of failure can save time in cases where B k is ill-conditioned. One may note that we have no guarantees that f (x k ) -f (x k -αg k ) > 0, which may seem to contradict the ARC pattern detailed in Cartis et al. (2011) which only accepts steps which improve loss. However, Chen et al. (2018) proves that in a trust-region framework, if you accept all steps, ρ k need only be positive half of the time for almost sure convergence (Paquette and Scheinberg (2020) proves a similar result for first-order methods). It has also been shown that noisy SGD steps improve performance of final solution quality (Zhang et al., 2017; Zou et al., 2021) . Implementation details regarding Algorithm 2 can be found in Section B of the appendix and Algorithm 3. Joining the optimizations presented in Section 3.1 with the modifications in Section 3.2, we can form the full ARCLQN algorithm, explicitly described in Algorithm 3. It is worth noting that while much of the above discussion assumes our Hessian approximation B k is an LQN matrix with a compact representation, this is not required. Indeed, any Hessian approximation which lends itself to a fast eigendecomposition and inversion may be applied to this modified ARC framework, with the caveat that Algorithm 1 may be slower if the norm-trick cannot be used. We explore this potential extension more in Section 4.3, where we use the positive-definite Hessian approximation proposed in Ma (2020) . We also provide theoretical analysis of the proposed framework in Section E of the appendix, where we prove that under moderate assumptions ARCLQN converges in the nonconvex case. Algorithm 3 ARCLQN, our proposed algorithm for solving Algorithm 2 under memory constraints. Require: Given x 0 : initial parameter vector Require: 0 < η 1 < η 2 : hyperparameters to measure the level of success of a step Require: D, q : dataset and minibatch-size, respectively. Require: σ 0 : starting regularization parameter Require: ϵ, δ : tolerance parameters Require: f (x, b) : objective function with inputs parameters x and minibatch b Require: α 1 , α 2 : learning rates 1: Initialize B 0 = I. 2: for k = 1, 2, . . . do

3:

Let b k be a minibatch sampled randomly from D of size q 4: g k ← ∇ x f (x k-1 , b k ) 5: Calculate λ 1 of B k-1 6: Let λ ← max(-λ 1 , 0) + ϵ 7: Compute s * k (using Algorithm 1) 8: Calculate ρ (as in Equation 17) 9: if ρ ≥ η 1 then 10: x k ← x k-1 + α 1 s * k 11: y ← ∇ x f (x k , b k ) -g k 12: Update B k using B k-1 , α 1 s * k , y if update and resulting B k are well-defined 13: if ρ ≥ η 2 then 14: σ k ← max( σ k-1 2 , δ) 15: end if 16: else 17: σ k ← 2 • σ k-1 18: x k ← x k-1 -α 2 g k 19: y ← ∇ x f (x k , b k ) -g k 20: Update B k using B k-1 , -α 2 g k , y if update and resulting B k are well-defined 21: end if 22: end for

4.1. COMPARISON TO SR1

We start by benchmarking the optimized CR subproblem solver alone, without integration into the larger ARCLQN optimizer.foot_0 These results are summarized in Table 1 . All timing information is reported as the average across 10 runs. We see that the dense SR1 solver fails to scale to more than 10,000 variables. We also see that the traditional LSR1 solver becomes computationally prohibitive for higher dimensions. For example, when n = 10 8 , the positive-definite test case takes 274 seconds to converge for the inversion-based solver, whereas following steps outlined in Section 3.1, it is reduced to 2.33 seconds, a speedup of over 100x. Considering that the CR subproblem represents the bulk of the computation of any given optimization step, this performance improvement greatly increases the scalability of the algorithm. In the next section, we use the enhancements highlighted here to provide preliminary results using Algorithm 3. Positive Definite 7.85e-3 1.64e-2 9.09e-2 1.48e-1 2.12e0 3.06e1 2.74e2 ARCLQN Positive Definite 3.64e-3 6.03e-3 6.45e-3 7.90e-3 2.68e-2 2.70e-1 2.33e0

4.2. AUTOENCODING

We also experiment with using ARCLQN as an optimizer for an autoencoder, as detailed in Goodfellow et al. ( 2016). Hyperparameters are detailed in Section C of the appendix. For this experiment, as our Hessian approximation, we use an LSR1 matrix Ramamurthy and Duffy (2016) . Results are summarized in Figure 1 and Table 2 . All numbers reported are averaged across 10 runs. It is worth noting that while LBFGS converges more rapidly (by number of steps), it is over twice as slow as our approach (by wall clock time) and suffers from numerical stability issues: of the 10 runs performed, 2 failed due to NaN loss. 2017) and a lack of computational resources, we resize ImageNet to 32x32. Following Ma (2020) , we use a modified version of ResNet-18 (dubbed ResNet-110) adapted for smaller image sizes. Additionally, we use the best hyperparameters from Ma (2020), namely, the learning rate, epsilon, and momentum for all optimizers. Here we use the positive-definite diagonal Hessian approximation presented by Ma (2020) as our B k . Unlike the other optimizers which received extensive hyperparameter searches, ARCLQN achieves strong results using the same hyperparameters as in the CIFAR-10 experiments. This is of significance: our proposed method outperforms or compares to all optimizers considered without expensive hyperparameter tuning or hacking. It is worth noting that Apollo, the optimizer proposed by Ma (2020) , requires a long warmup period for good performance. Our approach has no such requirement. We theorize this result is due to a combination of defaulting to SGD on failure and the ARC framework preventing any steps that would otherwise degrade performance. Finally, in Table 3 , we can see that ARCLQN is associated with both the highest Top-1 accuracy and the lowest computational cost. 

5. CONCLUSION AND DISCUSSION

Conclusion. We have introduced a new family of optimizers, referred to as ARCLQN, which utilize a novel fast large-scale solver for the CR subproblem. We demonstrate very large speedups over a baseline implementation, and we find that ARCLQN is competitive with modern first-order and second-order optimizers on real-world nonconvex problems with minimal tuning. To the best of our knowledge, ARCLQN is the first extension of the ARC framework to the limited memory case without major modification of the core framework. We additionally expand upon ARC, explicitly incorporating first-order updates into our methodology. Finally, we provide convergence analysis of the modified framework which proves convergence even for the nonconvex case. Limitations. While we have introduced an optimization framework that is applicable to any Hessian approximation that has a fast eigendecomposition, we do not consider Hessian approximations for which this information is not readily available. Additionally, if our Hessian approximation's eigenvalues differ greatly between steps, this can lead to oscillation in the calculated ∥s k ∥. Future work may also include a wider variety of evaluated Hessian approximations, as there are many which were not tested here.

Reproducibility Statement

In this paper we provide all details to recreate our implementation, including algorithms matching how the optimizer is written in code. These are located in Section 3. We also provide all implementation details and hyperparameters in Section B and Section C respectively.

A ETHICAL CONSIDERATIONS

With the ever-increasing utilization and adoption of more powerful models, it is more and more important for authors to consider the ethical aspects of their work. Our work presented here is very general in nature, as it is possible to use ARQLQN as an optimizer for any function where gradient information is easily available. A major positive impact from this paper and subsequent research may be substantial reductions in power consumption, as is seen preliminarily in the significantly reduced runtime in Table 3 . Additionally, as a general-purpose optimizer, this work may help progress societally beneficial research (such as in medicine). However, it also holds the ability to be misused (e.g., being used to train unethical models that discriminate based of protected personal attributes). This potential for misuse is inherent to all general-purpose optimization research. Avoidance of this may be impractical, and is beyond the scope of this work. In our numerical results section (Section 4), we use two image based datasets: CIFAR-10 and ImageNet. Improper or careless use of datasets that contain sensitive information should be avoided where possible. We believe both CIFAR-10 and the version of ImageNet we used pose very little risk, as both of the datasets are at 32x32 resolution, hiding most sensitive information. The research done in this paper abides by the licenses provided by the authors of the datasets.

B IMPLEMENTATION DETAILS

Second-order methods can at times be unstable. To achieve good performance and stable training, it is important to use heuristics to prevent or alleviate this instability. For the sake of complete transparency, we share all used heuristics and modifications not explicitly detailed elsewhere. It is worth explicitly noting that none of the parameters below have been tuned for performance, and instead have been chosen either arbitrarily, or via test-runs on toy problems. There may be significant room for improvement with tuning of these parameters, and we leave that to future work.

B.1 GENERAL DETAILS

An important detail is that we do not check if ϕ 1 (λ) = 0 when using Newton's method in Algorithm 1. Instead, we repeat the while loop until ∥s∥ -λ σ < ν. For CIFAR-10 and ImageNet experiments, we set ν = 1e-5. For comparison to SR1, we set ν = 1e-7. We empirically find that for larger scale problems, ν can be set higher, as ϕ 1 (λ) does not change much at the final iterations of Algorithm 1. For CIFAR-10 experiments, we do not take an actual SGD step, but instead use an Adam step Kingma and Ba (2015) . Finally, we also bound σ above, as rarely ρ < η 1 can occur multiple times in a row, which can lead to many very small steps being taken with little effect on performance.

B.2 LSR1 SPECIFIC

If we repeatedly take very similar or very small steps, we can run into issues with S k being singular or B k being ill conditioned. We use two heuristics to detect and fix this. First, before updating B k on lines 12 and 20 of Algorithm 3, we set y ← y max(∥s∥,κ) and s ← s max (∥s∥,κ) . This prevents B k from becoming ill-conditioned if ∥s∥ is very small. Additionally, we also reset B k if the minimum eigenvalue of S T k S k is less than κ. When we 'reset' B k , we drop the first and last column of S k and Y k instead of setting B k ← I; this helps us prevent resetting from destroying too much curvature information. We set κ = 1e-7.

C HYPERPARAMETERS

This section will detail optimizer settings not otherwise explicitly mentioned in the paper. For all experiments, default dataset splits are used. C.1 CIFAR-10. The hyperparameters used to generate Figure 1 can be found in Table 4 . For CIFAR-10 experiments, α 2 = 0.005 for ARCLQN was set arbitrarily. For ARCLQN, we use η 1 = 0.05, η 2 = 0.6; these hyperparameters were not tuned, but instead set to be similar to Cartis et al. (2011) . For all optimizers with momentum, β values were left at their defaults and not tuned. Since the Apollo paper emphasizes that warmup is extremely important for their method, we use linear warmup over 500 steps Ma (2020) . We performed a hyperparameter search over learning rates in α ∈ {.001, .01, .1, .75, 1} for all optimizers. For applicable optimizers, we also searched for the optimal ϵ ∈ {1e-4, 1e-8, 1e-16}. For each optimizer, we chose the hyperparameter settings that lead to the lowest test loss after 10 epochs. For limited memory methods, we fixed the history size at m = 5. We use a convolutional neural network Goodfellow et al. (2016) with 3 convolutional layers, then 3 transposed convolutional layers. For all layers, padding and stride are set to 1 and 2 respectively. Between all layers (except the middle one), SeLU Klambauer et al. (2017) activations are used; the final layer uses a sigmoid activation. Layers have 3, 12, 24, 48, 24, and 12 input channels respectively; minibatch size is fixed as 128. We use binary cross-entropy as our loss function. The hyperparameters used to generate Figure 2 -3 and Table 3 can be found in Table 5 . For AR-CLQN, we use the same hyperparameters as in Table 4 . We use the cosine learning rate annealing scheduler Loshchilov and Hutter (2017) , in line with Ma (2020) , from which we take many of our hyperparameters. 

D PROOF OF SUBPROBLEM SOLVER COMPLEXITY

This section contains the full proof of many of the claims located in Section D. We will restate the assumptions made there, then progress with the proofs. Assumption 1. The matrix T = Ψ T Ψ is stored and updated incrementally. That is, if Ψ has one column replaced, then only one row and column of T is updated. Assumption 2. The vector ū = Ψ T g is computed once each iteration of Algorithm 3 and stored. Unlike trust-region methods, the cubic-regularization search steps can sometimes quickly grow when negative curvature is found. To stabilize the approach we assume a safeguard is used (as in line-search methods Zhou et al. (2017) ) that uniformly bounds the second-order correction matrix from singularity. Assumption 3. The final search direction has form s = -U T Λ-1 U g.  (x k , b) is bounded for all x k . That is, ∥H(x k , b)∥ ≤ L H . From the results below, Algorithm 3 will iteratively reduce f (x) with probability one when α k goes to 0. Thus, it is expected that Assumptions 8 and 9 can be satisfied. In Algorithm 3, α 1 and α 2 are used as the learning rates. To simplify the notation, α k is used in this section as either of them for a given iteration k. The following assumption is then given. Assumption 10. The sequence of learning rates α k in Algorithm 3 is chosen such that: 1. +∞ i=1 α i = +∞ 2. +∞ i=1 α 2 i < +∞ The first theorem below ensures Algorithm 1 will solve problem 1, and it can be found in Cartis et al. (2011) . Theorem 5. (Cartis et al. (2011) ) Algorithm 1 converges to the global solution of problem 1 whenever the initial λ satisfies max(0, -λ 1 ) < λ < σ∥s∥ where λ 1 denotes the smallest eigenvalue of B k . Note that an initial λ for the preceding theorem is easily found by choosing λ suitably close to its lower bound. The following lemma can be found in Berahas et al. (2021) : Lemma 6. (Berahas et al. (2021) ) Suppose that x k is generated by Algorithm 3 and assumption 5 holds, and also B k is the Hessian approximations updated by Equation 2 when the new curvature pair satisfies Equation 3. Then there exists a constant c 1 > 0 such that ∥B k ∥ ≤ c 1 . Using Lemma 6, the following important theorem can be derived. Theorem 7. Suppose that x k is generated by Algorithm 3 and assumption 5 holds, and also B k is the Hessian approximations generated by Equation 2 when the new curvature pair satisfies Equation 3, then there exists a constant c 2 > 0 such that ∥B k + σ k ∥s k ∥I∥ ≤ c 2 Proof. We will first show that σ k ∥s k ∥ is always bounded for all k. Let B k def = U T ΛU , and where Λ is a diagonal matrix and U T U = I, Then, we have: B k + σ k ∥x k ∥I = U T (Λ + σ k ∥s k ∥I).U Therefore, (B k + σ k ∥s k ∥I)(B k + σ k ∥s k ∥I) = U T (Λ + σ k ∥s k ∥I)(Λ + σ k ∥s k ∥I)U Because of Lemma 6, there exists λ k min and λ k max such that (σ k ∥s k ∥ + λ k min ) 2 ≤ x T (B k + σ k ∥s k ∥I)(B k + σ k ∥s k ∥I)x x T x ≤ (σ k ∥s k ∥ + λ k max ) 2 , ( ) for all nonzero x. Note that λ k min and λ k max are bounded by c 1 . Set x = s k+1 in Equation 20, and because of s k+1 is solution of Problem 1, we have: (σ k ∥s k ∥ + λ k min ) 2 ≤ g T k g k s T k+1 s k+1 ≤ (σ k ∥s k ∥ + λ k max ) 2 There are two scenarios now. 1. g T k g k s T k+1 s k+1 is bounded. Because of Equation 21, we can easily conclude that σ k ∥s k ∥ is bounded. Thus, the lemma follows.

2.

g T k g k s T k+1 s k+1 is unbounded. That is, there exists M k → ∞ such that g T k g k s T k+1 s k+1 ≥ M k . Because of Assumption 8, g T k g k is bounded. We have that s T k s k → 0 as g T k g k s T k+1 s k+1 is unbounded. Using Taylor expansion, we note that from Equation 17, ρ k = f (x k ) -f (x k + s k ) f (x k ) -m k (s k ) = f (x k ) -m k (s k ) -O(∥s∥ 3 ) f (x k ) -m k (s k ) But as ∥g k ∥ ≥ M k ∥s k ∥, we have: f (x k ) -m k (s k ) = -g T k s k -s T k B k s k ≥ -M k ∥s k ∥ 2 -c 1 ∥s k ∥ 2 , ( ) when ∥s∥ is small. Therefore, when s k is very small, combining Equation 22and 23, we have: |ρ k -1| = O(∥s∥ 3 ) f (x k ) -m k (s k ) ≤ O(∥s∥ 3 ) M k ∥s k ∥ 2 + c 1 ∥s k ∥ 2 Because M k → ∞, we can now conclude that ρ k → 1 as ∥s k ∥ → 0. This means that σ k is bounded. Therefore when g T k g k s T k+1 s k+1 is unbounded, we have that ∥s k ∥ → 0 and σ k is bounded. Therefore σ k ∥s k ∥ is bounded. We can then conclude that ∥B k + σ k ∥s k ∥I∥ is bounded. We now focus on the proof of the convergence of Algorithm 3. Lemma 8. Suppose that x k is generated by Algorithm 3 and assumptions 3, 5, 6, 7, 8, and 9 hold. Then, there exists c 4 > 0 such that: E[f (x k+1 )] ≤ E[f (x k )] - α k c 3 E[∥∇f (x k )∥ 2 ] + α 2 k 2 c 4 (24) Proof. Because of Assumption 3, we have that there exists a c 3 > 0 such that s T (B k + σ k ∥s k ∥I)s ≥ c 3 ∥s∥ 2 (25) holds for all the s and iterations k. Now suppose x k+1 = x k + α k s k . From the Taylor theorem, there exist θ k such that: f (x k+1 ) = f (x k ) + α k s T k ∇f (x k ) + α 2 k 2 s T k H(θ k )s k . ( ) Now there are two scenarios. 1. ρ ≥ η 1 , So s k is the solution of the cubic regularized subproblem. Thus, Equation 26becomes f (x k+1 ) = f (x k ) -α k g T k (B k + λ k ∥s k ∥I) -1 ∇f (x k ) + α 2 k 2 s T k H(θ k )s k . (27) Let ∇f (x k ) = g k + ξ k . By Assumption 7, we have E(ξ k |x k ) = 0. So from Equation 27, now we have: f (x k+1 ) = f (x k ) -α k ∇f (x k ) T (B k + λ k ∥s k ∥I) -1 ∇f (x k )+ α k ξ T k (B k + λ k ∥s k ∥I) -1 ∇f (x k ) + α 2 k 2 s T k H(θ k )s k . ( ) Note that because of Equation 25and Assumption 8, we have: c 3 ∥s k ∥ 2 ≤ s T k (B k + λ k I)s k ≤ ∥s k ∥∥g k ∥ ≤ L g ∥s k ∥. Thus, s k is bounded. Because of Equation 28, we further have: f (x k+1 ) ≤ f (x k ) - α k c 3 ∥∇f (x k )∥ 2 + +α k ∇f (x k ) T (B k + λ k ∥s k ∥I) -1 ξ k + α 2 k 2 s T k H(θ k )s k . Because of Assumptions 6 and 9, the above equation becomes: f (x k+1 ) ≤ f (x k ) - α k c 3 ∥∇f (x k )∥ 2 + α k ∇f (x k ) T (B k + λ k ∥s k ∥I) -1 ξ k + α 2 k 2 (L 2 ∥s k ∥ + L H )∥s k ∥ 2 . ( ) We now take the expected value of both sides of the above inequality, Because of E(ξ k |x k ) = 0, we have: E[f (x k+1 )|x k ] ≤ f (x k ) - α k c 3 ∥∇f (x k )∥ 2 + α 2 k 2 (L 2 L g c 3 + L H )( L g c 3 ) 2 . ( ) 2. ρ < η 1 , that is, the SGD direction is used as s k . Similarly, we have: E[f (x k+1 )|x k ] ≤ f (x k ) -α k ∥∇f (x k )∥ 2 + α 2 k 2 L 2 g (32) Thus, combining with Equations 31 and 32, we have that there exists c 4 > 0 such that: E[f (x k+1 )|x k ] ≤ f (x k ) - α k c 3 ∥∇f (x k )∥ 2 + α 2 k 2 c 4 . ( ) Thus Lemma 8 holds. Theorem 9. Suppose that x k is generated by Algorithm 3 and assumptions 3, 4, 5, 6, 7, 8, and 9 hold. Then, we have: lim k→∞ E[∥∇f (x k )∥] = 0. ( ) Proof. Because of Lemma 8, we have: N k=1 E[f (x k+1 )] ≤ N k=1 E[f (x k )] - N k=1 α k c 3 E[∥∇f (x k )∥ 2 ] + N k=1 α 2 k 2 c 4 .



All experiments in this section were run on two Intel Xeon Gold 6150 processors. All experiments in this section were run on a single NVIDIA V100 GPU over 2 days. All experiments in this section were run on a single NVIDIA A100 GPU over 3 days.



We use CIFAR-10 Krizhevsky (2009) as our dataset and compare against a number of recent optimizers Kingma and Ba (2015); Ruder (2016); Ma (2020); Yao et al. (2021).

Figure 1: Test set loss of the trained CIFAR-10 autoencoder, evaluated at the end of each epoch.

Figure 2: Average training loss for each epoch.Figure 3: Test set accuracy for each epoch.

Figure 2: Average training loss for each epoch.Figure 3: Test set accuracy for each epoch.

Timing information for solving the CR subproblem, Equation1. A hyphen indicates that the test did not terminate within 300 seconds. SR1 corresponds a dense SR1 implementation. LSR1 corresponds to ARCLQN without the norm trick. For limited memory experiments, m = 3 was used. Cases are detailed in section 3. All other columns correspond to the problem dimension, and entries correspond to time (in seconds) required to find the global minimizer s * using CPU.

The

Table

Hyperparameter settings for the optimizers used in the CIFAR-10 autoencoding experiments. A dash indicates that the optimizer does not have a given hyperparameter.

Hyperparameter settings for the optimizers used in the ImageNet classification experiments. A dash indicates that the optimizer does not have a given hyperparameter.

E CONVERGENCE ANALYSISSeveral useful assumptions are given to establish the global convergence of Algorithm 3. Let b below denote a minibatch. Assumption 4. The function f (x) is bounded below by a scalar L f . Assumption 5. ∇f (x) is Lipschitz continuous for all x. That is, ∥∇f (x) -∇f (y)∥ ≤ L 1 ∥x -y∥. Assumption 6. H(x) is Lipschitz continuous for all x. That is, ∥H(x) -H(y)∥ ≤ L 2 ∥x -y∥. Assumption 7. For any iteration k, we have that E[g(x k , b)] = ∇f (x k ). Assumption 8. For any iteration k, we have that the gradient for that minibatch g(x k , b) is bounded for all x k . That is, ∥g(x k , b)∥ ≤ L g . Assumption 9. For any iteration k, we have that the hessian for that minibatch H

annex

where Λii = max(τ, Λ ii + λ * ) for small positive constant τ where Λ is given by Equation 11 and λ * denotes the optimal shift found by Algorithm 1.The following theorem shows that given T and ū defined in Assumptions 1 and 2, that the optimal λ * can be cheaply obtained by Algorithm 1 using O(m 3 ) operations.Theorem 2. Suppose that B def = γI + ΨM -1 Ψ T as defined in Equation 10and that (V, Λ) solves the generalized eigenvalue problem M v = λT v. Then U 2 as defined in 12 is given by U 2 = ΨV. Further the corresponding eigenvalues Λ from Equation 11 are given by Λ = (γI + Λ -1 ). We can then recover ĝ2 = V T ū, and ∥ĝ 1 ∥ = g T g -ĝT 2 ĝ2 .Proof. Rather than inverting M we can simply solve the generalized eigenvalue problem [V, Λ] = eig(M, Ψ T Ψ), wherewhere Λ is the diagonal matrix of generalized eigenvalues for the system M v = λΨ T Ψv. Then we have U 2 = ΨV implyingThus we can can set Λ from equation 11 as Λ = (γI + Λ -1 ). Further we can recoverUsing the previous theorems and Equations 13 and 14 we can thus obtain λ * = σ∥s * ∥ from Algorithm 1 in O(m 3 ) additional operations once T and ū are formed. We now show how to efficiently recover the optimal s * from Equation 1using O(mn) operations. Theorem 3. Using the same assumptions and definitions as in Theorem 2, given any λ > max(0, -λ 1 ), the solution s = -(B + λI) -1 g is given by -1 λ + γ g -ΨV r, where r can be formed with O(m 2 ) computations.Proof. Note we can further save on computation by storing Ψ T g for later calculations when we recover the final search direction. Note that the very end we must form the search direction by solving the system (B + λI)s = -g with the optimal value of λ. This impliesTheorem 4. Let (λ 1 , u 1 ) denote the eigenpair corresponding to the most negative eigenvalue of the matrix B. Then, if γ < min(diag( Λ)), u 1 can be formed as u 1 = r/∥r∥ where r = (I -U 2 U T 2 )r for any vector r in R n such that ∥r∥ > 0. Otherwise u 1 = Ψv k where v k is a column of V that corresponds to the smallest eigenvalue of Λ.Proof. Note that (I -U 2 U T 2 ) is the projection matrix onto the subspace defined by the span(U 1 ) implying U 2 r = 0, then Br = γU 1 U T 1 r = γ(I -U 2 U T 2 )r = γ r, since r has already been projected. Thus r is an eigenvector of γ. If γ is not the smallest eigenvalue of B, then by design u 1 can be obtained from U 2 e 1 assuming the eigenvalues of Λ are sorted smallest to largest.That is,Note:So, from Equation 35and 36, we have:That is,Because of Assumption 4 and 10, we have:Thus, the theorem follows.

