NEWTON LOSSES: EFFICIENTLY INCLUDING SECOND-ORDER INFORMATION INTO GRADIENT DESCENT

Abstract

We present Newton losses, a method for incorporating second-order information of losses by approximating them with quadratic functions. The presented method is applied only to the loss function and allows training the neural network with gradient descent. As loss functions are usually substantially cheaper to compute than the neural network, Newton losses can be used at a relatively small additional cost. We find that they yield superior performance, especially when applied to non-convex and hard-to-optimize loss functions such as algorithmic losses, which have been popularized in recent research.

1. INTRODUCTION

Neural network training has gained a tremendous amount of attention in machine learning in recent years. This is primarily due to the success of backpropagation and stochastic gradient descent for firstorder optimization. However, there has also been a strong line of work on second-order optimization for neural network training; see [1] and references therein. While these second-order optimization methods (such as Newton's method and natural gradient descent) exhibit improved convergence rates and therefore require fewer training steps, they have two major limitations [2], namely (i) computing the curvature (or its inverse) for a large and deep neural network is computationally substantially more expensive than simply computing the gradient with backpropagation, which makes second-order methods practically inapplicable in most cases; (ii) networks trained with second-order information exhibit reduced generalization capabilities [3] . In this work, we propose a novel method for incorporating second-order information of the loss function into training, while training the actual neural network with gradient descent. As loss functions are usually substantially cheaper to evaluate than a neural network, the idea is to apply second-order optimization to the loss function while training the actual neural network with first-order optimization. For this, we decompose the original iterative optimization problem into a two-stage iterative optimization problem, which leads to Newton losses. This is especially interesting for intrinsically hard-to-optimize loss functions, i.e., where second-order optimization of the inputs to the loss is superior to first-order optimization. Such loss functions have recently become increasingly popular, as they allow for solving more specialized tasks such as inverse rendering We evaluate the proposed Newton losses for various algorithmic losses on two popular benchmarks: the four-digit MNIST sorting benchmark and the Warcraft shortest-path benchmark. We found that Newton losses improve the performance in case of hard-to-optimize losses and maintain the original performance in the case of easy-to-optimize losses. Contributions. The contributions of this work are (i) introducing a mathematical framework for splitting iterative optimization methods into two-stage schemes, which we show to be equal to the original optimization methods; (ii) introducing Newton losses as combinations of first-order and second-order optimization methods; and (iii) an extensive empirical evaluation on two algorithmic supervision benchmarks using an array of algorithmic losses.



[4]-[6], learning-to-rank [7]-[13], self-supervised learning [14], differentiation of optimizers [15], [16], and top-k supervision [9], [11], [17]. In this paper, we summarize these loss functions exceeding typical classification and regression under the umbrella of algorithmic losses [18] as they introduce algorithmic knowledge into the training objective.

