ISAAC NEWTON: INPUT-BASED APPROXIMATE CURVATURE FOR NEWTON'S METHOD

Abstract

We present ISAAC (Input-baSed ApproximAte Curvature), a novel method that conditions the gradient using selected second-order information and has an asymptotically vanishing computational overhead, assuming a batch size smaller than the number of neurons. We show that it is possible to compute a good conditioner based on only the input to a respective layer without a substantial computational overhead. The proposed method allows effective training even in small-batch stochastic regimes, which makes it competitive to first-order as well as second-order methods.

1. INTRODUCTION

While second-order optimization methods are traditionally much less explored than first-order methods in large-scale machine learning (ML) applications due to their memory requirements and prohibitive computational cost per iteration, they have recently become more popular in ML mainly due to their fast convergence properties when compared to first-order methods [1] . The expensive computation of an inverse Hessian (also known as pre-conditioning matrix) in the Newton step has also been tackled via estimating the curvature from the change in gradients. Loosely speaking, these algorithms are known as quasi-Newton methods; for a comprehensive treatment, see Nocedal & Wright [2] . Various approximations to the pre-conditioning matrix have been proposed in recent literature [3]- [6] . From a theoretical perspective, second-order optimization methods are not nearly as well understood as first-order methods. It is an active research direction to fill this gap [7] , [8] . Motivated by the task of training neural networks, and the observation that invoking local curvature information associated with neural network objective functions can achieve much faster progress per iteration than standard first-order methods [9]-[11], several methods have been proposed. One of these methods, that received significant attention, is known as Kronecker-factored Approximate Curvature (K-FAC) [12] , whose main ingredient is a sophisticated approximation to the generalized Gauss-Newton matrix and the Fisher information matrix quantifying the curvature of the underlying neural network objective function, which then can be inverted efficiently. Inspired by the K-FAC approximation and the Tikhonov regularization of the Newton method, we introduce a novel two parameter regularized Kronecker-factorized Newton update step. The proposed scheme disentangles the classical Tikhonov regularization and in a specific limit allows us to condition the gradient using selected second-order information and has an asymptotically vanishing computational overhead. While this case makes the presented method highly attractive from the computational complexity perspective, we demonstrate that its empirical performance on high-dimensional machine learning problems remains comparable to existing SOTA methods. The contributions of this paper can be summarized as follows: (i) we propose a novel two parameter regularized K-FAC approximated Gauss-Newton update step; (ii) we prove that for an arbitrary pair of regularization parameters, the proposed update direction is always a direction of decreasing loss; (iii) in the limit, as one regularization parameter grows, we obtain an efficient and effective conditioning of the gradient with an asymptotically vanishing overhead; (iv) we empirically analyze the method and find that our efficient conditioning method maintains the performance of its more expensive counterpart; (v) we demonstrate the effectiveness of the method in small-batch stochastic regimes and observe performance competitive to first-order as well as quasi-Newton methods.

2. PRELIMINARIES

In this section, we review aspects of second-order optimization, with a focus on generalized Gauss-Newton methods. In combination with Kronecker factorization, this leads us to a new regularized update scheme. We consider the training of an L-layer neural network f (x; θ) defined recursively as z i ← a i-1 W (i) (pre-activations), a i ← ϕ(z i ) (activations), where a 0 = x is the vector of inputs and a L = f (x; θ) is the vector of outputs. Unless noted otherwise, we assume these vectors to be row vectors (i.e., in R 1×n ) as this allows for a direct extension to the (batch) vectorized case (i.e., in R b×n ) introduced later. For any layer i, let W (i) ∈ R di-1×di be a weight matrix and let ϕ be an element-wise nonlinear function. We consider a convex loss function L(y, y ′ ) that measures the discrepancy between y and y ′ . The training optimization problem is then arg min θ E x,y [L(f (x; θ), y)] , where θ = θ (1) , . . . , θ (L) with θ (i) = vec(W (i) ). The classical Newton method for solving (2) is expressed as the update rule θ ′ = θ -η H -1 θ ∇ θ L(f (x; θ), y) , where η > 0 denotes the learning rate and H θ is the Hessian corresponding to the objective function in (2). The stability and efficiency of an estimation problem solved via the Newton method can be improved by adding a Tikhonov regularization term [13] leading to a regularized Newton method θ ′ = θ -η (H θ + λI) -1 ∇ θ L(f (x; θ), y) , where λ > 0 is the so-called Tikhonov regularization parameter. It is well-known [14], [15] , that under the assumption of approximating the model f with its first-order Taylor expansion, the Hessian corresponds with the so-called generalized Gauss-Newton (GGN) matrix G θ , and hence (4) can be expressed as θ ′ = θ -η (G θ + λI) -1 ∇ θ L(f (x; θ), y) . A major practical limitation of ( 5) is the computation of the inverse term. A method that alleviates this difficulty is known as Kronecker-Factored Approximate Curvature (K-FAC) [12] which approximates the block-diagonal (i.e., layer-wise) empirical Hessian or GGN matrix. Inspired by K-FAC, there have been other works discussing approximations of G θ and its inverse [15] . In the following, we discuss a popular approach that allows for (moderately) efficient computation. The generalized Gauss-Newton matrix G θ is defined as G θ = E (J θ f (x; θ)) ⊤ ∇ 2 f L(f (x; θ), y) J θ f (x; θ) , where J and ∇ 2 denote the Jacobian and Hessian matrices, respectively. Correspondingly, the diagonal block of G θ corresponding to the weights of the ith layer W (i) is G W (i) =E (J W (i) f (x; θ)) ⊤ ∇ 2 f L(f (x; θ), y) J W (i) f (x; θ) . According to the backpropagation rule J W (i) f (x; θ) = J zi f (x; θ) a i-1 , a ⊤ b = a ⊗ b, and the mixed-product property, we can rewrite G W (i) as G W (i) =E (J zi f (x; θ) a i-1 ) ⊤ (∇ 2 f L(f (x; θ), y)) 1/2 (∇ 2 f L(f (x; θ), y)) 1/2 J zi f (x; θ) a i-1 (7) =E (ḡ ⊤ a i-1 ) ⊤ (ḡ ⊤ a i-1 ) = E (ḡ ⊗ a i-1 ) ⊤ (ḡ ⊗ a i-1 ) = E (ḡ ⊤ ḡ) ⊗ (a ⊤ i-1 a i-1 ) , where ḡ = (J zi f (x; θ)) ⊤ (∇ 2 f L(f (x; θ), y)) 1/2 . (9) Remark 1 (Monte-Carlo Low-Rank Approximation for ḡ⊤ ḡ). As ḡ is a matrix of shape m × d i where m is the dimension of the output of f , ḡ is generally expensive to compute. Therefore, [12] use a low-rank Monte-Carlo approximation to estimate ∇ 2 f L(f (x; θ), y) and thereby ḡ⊤ ḡ. For this, we need to use the distribution underlying the probabilistic model of our loss L (e.g., Gaussian for MSE loss, or a categorical distribution for cross entropy). Specifically, by sampling from this distribution

