NEWTON LOSSES: EFFICIENTLY INCLUDING SECOND-ORDER INFORMATION INTO GRADIENT DESCENT

Abstract

We present Newton losses, a method for incorporating second-order information of losses by approximating them with quadratic functions. The presented method is applied only to the loss function and allows training the neural network with gradient descent. As loss functions are usually substantially cheaper to compute than the neural network, Newton losses can be used at a relatively small additional cost. We find that they yield superior performance, especially when applied to non-convex and hard-to-optimize loss functions such as algorithmic losses, which have been popularized in recent research.

1. INTRODUCTION

Neural network training has gained a tremendous amount of attention in machine learning in recent years. This is primarily due to the success of backpropagation and stochastic gradient descent for firstorder optimization. However, there has also been a strong line of work on second-order optimization for neural network training; see [1] and references therein. While these second-order optimization methods (such as Newton's method and natural gradient descent) exhibit improved convergence rates and therefore require fewer training steps, they have two major limitations [2] , namely (i) computing the curvature (or its inverse) for a large and deep neural network is computationally substantially more expensive than simply computing the gradient with backpropagation, which makes second-order methods practically inapplicable in most cases; (ii) networks trained with second-order information exhibit reduced generalization capabilities [3] . In this work, we propose a novel method for incorporating second-order information of the loss function into training, while training the actual neural network with gradient descent. As loss functions are usually substantially cheaper to evaluate than a neural network, the idea is to apply second-order optimization to the loss function while training the actual neural network with first-order optimization. For this, we decompose the original iterative optimization problem into a two-stage iterative optimization problem, which leads to Newton losses. This is especially interesting for intrinsically hard-to-optimize loss functions, i.e., where second-order optimization of the inputs to the loss is superior to first-order optimization. Such loss functions have recently become increasingly popular, as they allow for solving more specialized tasks such as inverse rendering We evaluate the proposed Newton losses for various algorithmic losses on two popular benchmarks: the four-digit MNIST sorting benchmark and the Warcraft shortest-path benchmark. We found that Newton losses improve the performance in case of hard-to-optimize losses and maintain the original performance in the case of easy-to-optimize losses. Contributions. The contributions of this work are (i) introducing a mathematical framework for splitting iterative optimization methods into two-stage schemes, which we show to be equal to the original optimization methods; (ii) introducing Newton losses as combinations of first-order and second-order optimization methods; and (iii) an extensive empirical evaluation on two algorithmic supervision benchmarks using an array of algorithmic losses.

1.1. RELATED WORK

The related work comprises algorithmic supervision losses and second-order optimization methods. To the best of our knowledge, this is the first work combining second-order optimization of loss functions with first-order optimization of neural networks, especially for algorithmic losses. Algorithmic Losses. Algorithmic losses, i.e., losses which contain some kind of algorithmic component, have become quite popular in recent machine learning research. In the domain of recommender systems, early learning-to-rank works already appeared in the 2000s [7] [27] , and learning protein structure with a differentiable simulator [28] , among many others. The methods used to make algorithms differentiable can be broadly categorized into those, which estimate gradients via sampling (e.g., [15] ) and those, which have analytical closed-form gradient estimates (e.g., [18] ). In this work, we specifically focus on the tasks of ranking supervision and shortest-path supervision, and discuss the methods that we consider in detail in Section 4. Second-Order Optimization. Second-order methods have recently gained popularity in machine learning due to their fast convergence properties when compared to first-order methods [1] . One alternative to vanilla Newton are quasi-Newton methods, which, instead of computing an inverse Hessian in the Newton step (which is expensive), approximate this curvature from the change in gradients [2], [29], [30] . In addition, a number of new approximations to the pre-conditioning matrix have been proposed in the recent literature, i.a., [31]-[33] . While the vanilla Newton method relies on the Hessian, there are variants which use the empirical Fisher information matrix, which can coincide in specific cases with the Hessian, but generally exhibits somewhat different behavior. For an overview and discussion of Fisher-based methods (sometimes referred to as natural gradient descent), see [34] , [35] . Decomposition Methods. When working with complex and non-standard loss functions, the optimization problem for neural network training is often challenging. In these cases, a natural approach is to look for a decomposition, i.e., to break up the optimization problem into two (or more) subproblems, which are then solved sequentially. The idea of decomposing an optimization problem is not new, see [36] for an overview of decomposition methods. A decomposition method, which has become particularly popular in machine learning, is the alternating direction method of multipliers [37] . There are other decomposition methods known as operator splitting methods [38] , which include the methods of multipliers with Gauss-Seidel passes [37], coordinate descent-type methods with linear coupling constraints [2], or consensus based optimization schemes [39], [40] . These methods typically derive optimization problems via Lagrangian duality that are then solved sequentially and to the respective (global) optimality which requires loss functions that can be optimized at relatively low cost. Moreover, convergence guarantees can be obtained if the underlying optimization problem is convex and if the coupling constraints are linear. In this work, we consider problems where neither of these properties hold.

2. A TWO-STAGE OPTIMIZATION METHOD

We consider the training of a neural network f (x; θ), where x ∈ R n is the vector of inputs, θ ∈ R d is the vector of weights and y = f (x; θ) ∈ R m is the vector of outputs. We assume that we have access to a data set of N samples drawn from the input distribution, which describes the empirical input x = [x 1 , . . . , x N ] ⊤ ∈ R N ×n . As per vectorization, we denote y = f (x; θ) ∈ R N ×m as the matrix describing the outputs of the neural network corresponding to the empirical inputs. Further, let ℓ : R N ×m → R denote the loss function, and let the ground truth output be implicitly encoded in ℓ (because for many algorithmic losses, it is not simply a label, but could be ordinal information).



[4]-[6], learning-to-rank [7]-[13], self-supervised learning [14], differentiation of optimizers [15], [16], and top-k supervision [9], [11], [17]. In this paper, we summarize these loss functions exceeding typical classification and regression under the umbrella of algorithmic losses [18] as they introduce algorithmic knowledge into the training objective.

, [8], [19], but more recently [20] propose differentiable ranking metrics, and [12] propose PiRank, which relies on differentiable sorting. For differentiable sorting an array of methods has been proposed in recent years, which includes NeuralSort [21], SoftSort [22], Optimal Transport Sort [9], differentiable sorting networks (DSN) [11], and the relaxed Bubble Sort algorithm [18]. Other works explore differentiable sorting-based top-k for applications such as differentiable image patch selection [23], differentiable k-nearest-neighbor [17], [21], top-k attention for machine translation [17], and differentiable beam search methods [17], [24]. But algorithmic losses are not limited to sorting: other works have considered learning shortest-paths [15], [16], [18], learning 3D shapes from images and silhouettes [4]-[6], [18], [25], [26], learning with combinatorial solvers for NP-hard problems [16], learning to classify handwritten characters based on editing distances between strings [18], learning with differentiable physics simulations

