ISAAC NEWTON: INPUT-BASED APPROXIMATE CURVATURE FOR NEWTON'S METHOD

Abstract

We present ISAAC (Input-baSed ApproximAte Curvature), a novel method that conditions the gradient using selected second-order information and has an asymptotically vanishing computational overhead, assuming a batch size smaller than the number of neurons. We show that it is possible to compute a good conditioner based on only the input to a respective layer without a substantial computational overhead. The proposed method allows effective training even in small-batch stochastic regimes, which makes it competitive to first-order as well as second-order methods.

1. INTRODUCTION

While second-order optimization methods are traditionally much less explored than first-order methods in large-scale machine learning (ML) applications due to their memory requirements and prohibitive computational cost per iteration, they have recently become more popular in ML mainly due to their fast convergence properties when compared to first-order methods [1] . The expensive computation of an inverse Hessian (also known as pre-conditioning matrix) in the Newton step has also been tackled via estimating the curvature from the change in gradients. Loosely speaking, these algorithms are known as quasi-Newton methods; for a comprehensive treatment, see Nocedal & Wright [2] . Various approximations to the pre-conditioning matrix have been proposed in recent literature [3]- [6] . From a theoretical perspective, second-order optimization methods are not nearly as well understood as first-order methods. It is an active research direction to fill this gap [7], [8] . Motivated by the task of training neural networks, and the observation that invoking local curvature information associated with neural network objective functions can achieve much faster progress per iteration than standard first-order methods [9]-[11], several methods have been proposed. One of these methods, that received significant attention, is known as Kronecker-factored Approximate Curvature (K-FAC) [12], whose main ingredient is a sophisticated approximation to the generalized Gauss-Newton matrix and the Fisher information matrix quantifying the curvature of the underlying neural network objective function, which then can be inverted efficiently. Inspired by the K-FAC approximation and the Tikhonov regularization of the Newton method, we introduce a novel two parameter regularized Kronecker-factorized Newton update step. The proposed scheme disentangles the classical Tikhonov regularization and in a specific limit allows us to condition the gradient using selected second-order information and has an asymptotically vanishing computational overhead. While this case makes the presented method highly attractive from the computational complexity perspective, we demonstrate that its empirical performance on high-dimensional machine learning problems remains comparable to existing SOTA methods. The contributions of this paper can be summarized as follows: (i) we propose a novel two parameter regularized K-FAC approximated Gauss-Newton update step; (ii) we prove that for an arbitrary pair of regularization parameters, the proposed update direction is always a direction of decreasing loss; (iii) in the limit, as one regularization parameter grows, we obtain an efficient and effective conditioning of the gradient with an asymptotically vanishing overhead; (iv) we empirically analyze the method and find that our efficient conditioning method maintains the performance of its more expensive counterpart; (v) we demonstrate the effectiveness of the method in small-batch stochastic regimes and observe performance competitive to first-order as well as quasi-Newton methods.

