A NOVEL FAST EXACT SUBPROBLEM SOLVER FOR STOCHASTIC QUASI-NEWTON CUBIC REGULARIZED OPTIMIZATION

Abstract

In this work we describe an Adaptive Regularization using Cubics (ARC) method for large-scale nonconvex unconstrained optimization using Limited memory Quasi-Newton (LQN) matrices. ARC methods are a relatively new family of second-order optimization strategies that utilize a cubic-regularization (CR) term in place of trustregions or line-searches. Solving the CR subproblem exactly requires Newton's method, yet using properties of the internal structure of LQN matrices, we are able to find exact solutions to the CR subproblem in a matrix-free manner, providing very large speedups. Additionally, we expand upon previous ARC work and explicitly incorporate first-order updates into our algorithm. We provide empirical results for different LQN matrices and find our proposed method compares to or exceeds all tested optimizers with minimal tuning.

1. INTRODUCTION

Scalable second-order methods for training deep learning problems have shown great potential, yet ones that build on Hessian-vector products may be prohibitively expensive to use. In this paper, we focus on algorithms that require information similar to Stochastic Gradient Descent (SGD) Ruder (2016), namely, stochastic gradients calculated on mini-batches of data. Quasi-Newton (QN) methods are a natural higher-level alternative to first-order methods, in that they seek to model curvature information dynamically from past steps based on available gradient information. Thus, they can work out of the box in the same settings as SGD with little model-specific coding required. However, this comes with possible instability of the step size. Controlling the step size can be done using line-searches along a given direction s or using trust-regions to find the best s for a given step size. A relatively recent alternative to the mentioned approaches is known as cubic regularization Nesterov and Polyak (2006); Cartis et al. (2011); Tripuraneni et al. (2018) , which shows very promising results. In detail, we study the minimization problem of minimize s∈R n m k (s) def = f (x k ) + s T g k + 1 2 s T B k s + 1 3 σ k ||s|| 3 , for a given x k , where g k def = ∇f (x k ), B k is a Hessian approximation, σ k an iteratively chosen adaptive regularization parameter, and f (x k ) is the objective function to minimize evaluated at x k . Equation 1 is also known as the CR subproblem. Cubic regularization shows promise because it can be shown that if Nesterov and Polyak (2006) . Thus if the Hessian approximation B k behaves like ∇ 2 f (x) along the search direction s, the model function m k (s) becomes an upper bound on the objective f (x + s). In such cases a line-search would not be needed as reduction in m k (s) translates directly into reduction in f (x + s), removing the risk that the computational work performed minimizing m k (s) is wasted. ∇ 2 f is Lipschitz continuous with constant L, then f (x k + s) ≤ m k (s) whenever σ k ≥ L and B k s = ∇ 2 f (x)s We propose an efficient exact solver to Equation 1 using Newton's method which is tractable in large-scale optimization problems under near-identical conditions to those in which SGD itself is commonly applied. As Newton's method corresponds to much of the computation overhead when solving Equation 1, a dense approach such as that described in Cartis et al. (2011) would be 1

