A NOVEL FAST EXACT SUBPROBLEM SOLVER FOR STOCHASTIC QUASI-NEWTON CUBIC REGULARIZED OPTIMIZATION

Abstract

In this work we describe an Adaptive Regularization using Cubics (ARC) method for large-scale nonconvex unconstrained optimization using Limited memory Quasi-Newton (LQN) matrices. ARC methods are a relatively new family of second-order optimization strategies that utilize a cubic-regularization (CR) term in place of trustregions or line-searches. Solving the CR subproblem exactly requires Newton's method, yet using properties of the internal structure of LQN matrices, we are able to find exact solutions to the CR subproblem in a matrix-free manner, providing very large speedups. Additionally, we expand upon previous ARC work and explicitly incorporate first-order updates into our algorithm. We provide empirical results for different LQN matrices and find our proposed method compares to or exceeds all tested optimizers with minimal tuning.

1. INTRODUCTION

Scalable second-order methods for training deep learning problems have shown great potential, yet ones that build on Hessian-vector products may be prohibitively expensive to use. In this paper, we focus on algorithms that require information similar to Stochastic Gradient Descent (SGD) Ruder (2016), namely, stochastic gradients calculated on mini-batches of data. Quasi-Newton (QN) methods are a natural higher-level alternative to first-order methods, in that they seek to model curvature information dynamically from past steps based on available gradient information. Thus, they can work out of the box in the same settings as SGD with little model-specific coding required. However, this comes with possible instability of the step size. Controlling the step size can be done using line-searches along a given direction s or using trust-regions to find the best s for a given step size. A relatively recent alternative to the mentioned approaches is known as cubic regularization Nesterov and Polyak (2006); Cartis et al. (2011); Tripuraneni et al. (2018) , which shows very promising results. In detail, we study the minimization problem of minimize s∈R n m k (s) def = f (x k ) + s T g k + 1 2 s T B k s + 1 3 σ k ||s|| 3 , for a given x k , where g k def = ∇f (x k ), B k is a Hessian approximation, σ k an iteratively chosen adaptive regularization parameter, and f (x k ) is the objective function to minimize evaluated at x k . Equation 1 is also known as the CR subproblem. Cubic regularization shows promise because it can be shown that if Nesterov and Polyak (2006) . Thus if the Hessian approximation B k behaves like ∇ 2 f (x) along the search direction s, the model function m k (s) becomes an upper bound on the objective f (x + s). In such cases a line-search would not be needed as reduction in m k (s) translates directly into reduction in f (x + s), removing the risk that the computational work performed minimizing m k (s) is wasted. ∇ 2 f is Lipschitz continuous with constant L, then f (x k + s) ≤ m k (s) whenever σ k ≥ L and B k s = ∇ 2 f (x)s We propose an efficient exact solver to Equation 1 using Newton's method which is tractable in large-scale optimization problems under near-identical conditions to those in which SGD itself is commonly applied. As Newton's method corresponds to much of the computation overhead when solving Equation 1, a dense approach such as that described in Cartis et al. (2011) 2022)), we can instead perform Newton's method in a reduced subspace such that the cost per Newton iteration is reduced from O(mn) to O(m), where n is the problem dimension and m is the history size, commonly chosen to be an integer between 5 and 20. The full-space solution to Equation 1 can then be recovered for a cost identical to that of classic LQN methods. To the best of our knowledge, all previous attempts to use LQN methods in the context of the ARC framework have necessarily had to change the definition of m k (s) in order to find an approximate solution Andrei (2021); Liu et al. (2021); Ranganath et al. (2022) . Remarkably, we present a mechanism for minimizing m k (s) using similar computational efforts to a single matrix inversion of a shifted LQN matrix, which itself is a lower bound of the complexity of traditional LQN approaches. Further, we show that by applying Newton's method in the reduced subspace, we can achieve speed improvements of more than 100x over a naive (LQN inversion-based) implementation. In the numerical results section we further show that this modification permits the application of LQN matrices with exact cubic regularization as a practical optimizer for large DNNs.

2. RELATED WORK

Second-order methods in machine learning are steadily growing more common Berahas et al. ( 2021 2021). Limited-memory SR1 (LSR1) updates are studied in the context of ARC methodology using a "memory-less" variant in Andrei ( 2021). As we will describe in Section 3, many QN methods iteratively update B k matrices with pairs (s k , y k ) such that B k s k = y k , where y k denotes the difference of the corresponding gradients of the objective functions. In Andrei ( 2021), y k is a difference of the gradients of m k (s). Liu et al. ( 2021) approximately minimize m k (s) by solving shifted systems of form (B k + σ∥s k-1 ∥I)s k = -g, where the norm of the previous step is used as an estimate for the norm of ∥s k ∥ to define the optimal shift. As described in Theorem 1 in Section 3, the optimal solution necessarily satisfies a condition of the form (B k + σ∥s k ∥I)s k = -g. Since the norm of ∥s k ∥ may vary greatly between iterations, this solution is a noisy approximation. They further simplify the sub-problem using only the diagonals of B k + σ∥s k-1 ∥I when generating s k . 2022) solve a modified version of the problem using a shape-changing norm as the cubic overestimation that provides an analytical solution to Equation 1. They transform m k (s) using similar strategies to those advocated in this paper. However, this norm definition is dependent on the matrix B k and thus makes the definition of the target Lipschitz constant, L, dependent as well. A nontrivial distinction in our approaches is that theirs requires a QR factorization of matrices of size n × m. This may be prohibitive for deep learning problems, which may have billions of parameters. Bergou et al. (2017) explores a similar idea of making the norm dependant on the QN matrix. In Park et al. (2020) , the ARC framework with stochastic gradients is used with a Hessian-based approach first advocated by Martens et al. (2010) . In this case, ∇ 2 f (x) is approximated within a Krylov-based subspace using Hessian-vector products with batched estimates of ∇ 2 f (x). They then minimize m k (s) with this small-dimensional subspace.

Ranganath et al. (

An alternative to ARC methods is the use of trust-regions or line-searches. Though fundamentally different approaches, we can often borrow technology from the trust-region subproblem solver space to adapt to the ARC context. For example, Brust et al. (2017) outlines mechanisms for efficiently computing (B k + λI) -1 g and implicit eigendecomposition of B k + λI when solving the trust-region subproblem of minimizing q k (s) et al. (2017) significantly reduces the complexity and memory cost of such algebraic operations while solving the same problem. We thus adopt select operations developed therein when applicable, to adapt the method of Cartis et al. (2011) to the LQN context. Unlike the approach advocated in Burdakov et al. (2017) , we avoid inversions of potentially ill-conditioned systems to improve the stability of the approach while simultaneously reducing computation overhead costs. def = f (x k ) + s T g k + 1 2 s T B k s while subject to ∥s∥ ≤ δ. Burdakov Note that we further extend Cartis et al. (2011) to the stochastic optimization setting. Thus we also share relation to past stochastic QN approaches. Erway et al. (2020) use the tools described in Brust et al. (2017) to create a stochastic trust-region solver using LSR1 updates. Schraudolph et al. (2007) generalizes BFGS and LBFGS to the online convex optimization setting. Mokhtari and Ribeiro (2014) studies BFGS applied to the stochastic convex case and develops a regularization scheme to prevent



); Brust et al. (2017); Chen et al. (2020); Goldfarb et al. (2020); Ma (2020); Ramamurthy and Duffy (2016); Yao et al. (

would be prohibitive. However, by exploiting properties of LQN methods described in Erway and Marcia (2015) and Burdakov et al. (2017) (and further applied in other papers Chen et al. (2014); pei Lee et al.; Lee et al. (

