NEWTON LOSSES: EFFICIENTLY INCLUDING SECOND-ORDER INFORMATION INTO GRADIENT DESCENT

Abstract

We present Newton losses, a method for incorporating second-order information of losses by approximating them with quadratic functions. The presented method is applied only to the loss function and allows training the neural network with gradient descent. As loss functions are usually substantially cheaper to compute than the neural network, Newton losses can be used at a relatively small additional cost. We find that they yield superior performance, especially when applied to non-convex and hard-to-optimize loss functions such as algorithmic losses, which have been popularized in recent research.

1. INTRODUCTION

Neural network training has gained a tremendous amount of attention in machine learning in recent years. This is primarily due to the success of backpropagation and stochastic gradient descent for firstorder optimization. However, there has also been a strong line of work on second-order optimization for neural network training; see [1] and references therein. While these second-order optimization methods (such as Newton's method and natural gradient descent) exhibit improved convergence rates and therefore require fewer training steps, they have two major limitations [2] , namely (i) computing the curvature (or its inverse) for a large and deep neural network is computationally substantially more expensive than simply computing the gradient with backpropagation, which makes second-order methods practically inapplicable in most cases; (ii) networks trained with second-order information exhibit reduced generalization capabilities [3] . In this work, we propose a novel method for incorporating second-order information of the loss function into training, while training the actual neural network with gradient descent. As loss functions are usually substantially cheaper to evaluate than a neural network, the idea is to apply second-order optimization to the loss function while training the actual neural network with first-order optimization. For this, we decompose the original iterative optimization problem into a two-stage iterative optimization problem, which leads to Newton losses. This is especially interesting for intrinsically hard-to-optimize loss functions, i.e., where second-order optimization of the inputs to the loss is superior to first-order optimization. Such loss functions have recently become increasingly popular, as they allow for solving more specialized tasks such as inverse rendering [4] - [6] , learning-to-rank [7] - [13] , self-supervised learning [14] , differentiation of optimizers [15] , [16] , and top-k supervision [9] , [11] , [17] . In this paper, we summarize these loss functions exceeding typical classification and regression under the umbrella of algorithmic losses [18] as they introduce algorithmic knowledge into the training objective. We evaluate the proposed Newton losses for various algorithmic losses on two popular benchmarks: the four-digit MNIST sorting benchmark and the Warcraft shortest-path benchmark. We found that Newton losses improve the performance in case of hard-to-optimize losses and maintain the original performance in the case of easy-to-optimize losses. Contributions. The contributions of this work are (i) introducing a mathematical framework for splitting iterative optimization methods into two-stage schemes, which we show to be equal to the original optimization methods; (ii) introducing Newton losses as combinations of first-order and second-order optimization methods; and (iii) an extensive empirical evaluation on two algorithmic supervision benchmarks using an array of algorithmic losses.

1.1. RELATED WORK

The related work comprises algorithmic supervision losses and second-order optimization methods. To the best of our knowledge, this is the first work combining second-order optimization of loss functions with first-order optimization of neural networks, especially for algorithmic losses. Algorithmic Losses. Algorithmic losses, i.e., losses which contain some kind of algorithmic component, have become quite popular in recent machine learning research. In the domain of recommender systems, early learning-to-rank works already appeared in the 2000s [7] , [8] , [19] , but more recently [20] propose differentiable ranking metrics, and [12] propose PiRank, which relies on differentiable sorting. For differentiable sorting an array of methods has been proposed in recent years, which includes NeuralSort [21] , SoftSort [22] , Optimal Transport Sort [9] , differentiable sorting networks (DSN) [11] , and the relaxed Bubble Sort algorithm [18] . Other works explore differentiable sorting-based top-k for applications such as differentiable image patch selection [23] , differentiable k-nearest-neighbor [17] , [21] , top-k attention for machine translation [17] , and differentiable beam search methods [17] , [24] . But algorithmic losses are not limited to sorting: other works have considered learning shortest-paths [15] , [16] , [18] , learning 3D shapes from images and silhouettes [4] - [6] , [18] , [25] , [26] , learning with combinatorial solvers for NP-hard problems [16] , learning to classify handwritten characters based on editing distances between strings [18] , learning with differentiable physics simulations [27] , and learning protein structure with a differentiable simulator [28] , among many others. The methods used to make algorithms differentiable can be broadly categorized into those, which estimate gradients via sampling (e.g., [15] ) and those, which have analytical closed-form gradient estimates (e.g., [18] ). In this work, we specifically focus on the tasks of ranking supervision and shortest-path supervision, and discuss the methods that we consider in detail in Section 4. Second-Order Optimization. Second-order methods have recently gained popularity in machine learning due to their fast convergence properties when compared to first-order methods [1] . One alternative to vanilla Newton are quasi-Newton methods, which, instead of computing an inverse Hessian in the Newton step (which is expensive), approximate this curvature from the change in gradients [2] , [29] , [30] . In addition, a number of new approximations to the pre-conditioning matrix have been proposed in the recent literature, i.a., [31] - [33] . While the vanilla Newton method relies on the Hessian, there are variants which use the empirical Fisher information matrix, which can coincide in specific cases with the Hessian, but generally exhibits somewhat different behavior. For an overview and discussion of Fisher-based methods (sometimes referred to as natural gradient descent), see [34] , [35] . Decomposition Methods. When working with complex and non-standard loss functions, the optimization problem for neural network training is often challenging. In these cases, a natural approach is to look for a decomposition, i.e., to break up the optimization problem into two (or more) subproblems, which are then solved sequentially. The idea of decomposing an optimization problem is not new, see [36] for an overview of decomposition methods. A decomposition method, which has become particularly popular in machine learning, is the alternating direction method of multipliers [37] . There are other decomposition methods known as operator splitting methods [38] , which include the methods of multipliers with Gauss-Seidel passes [37] , coordinate descent-type methods with linear coupling constraints [2] , or consensus based optimization schemes [39] , [40] . These methods typically derive optimization problems via Lagrangian duality that are then solved sequentially and to the respective (global) optimality which requires loss functions that can be optimized at relatively low cost. Moreover, convergence guarantees can be obtained if the underlying optimization problem is convex and if the coupling constraints are linear. In this work, we consider problems where neither of these properties hold.

2. A TWO-STAGE OPTIMIZATION METHOD

We consider the training of a neural network f (x; θ), where x ∈ R n is the vector of inputs, θ ∈ R d is the vector of weights and y = f (x; θ) ∈ R m is the vector of outputs. We assume that we have access to a data set of N samples drawn from the input distribution, which describes the empirical input x = [x 1 , . . . , x N ] ⊤ ∈ R N ×n . As per vectorization, we denote y = f (x; θ) ∈ R N ×m as the matrix describing the outputs of the neural network corresponding to the empirical inputs. Further, let ℓ : R N ×m → R denote the loss function, and let the ground truth output be implicitly encoded in ℓ (because for many algorithmic losses, it is not simply a label, but could be ordinal information). In a general setting, the training of a neural network can be expressed as the optimization problem arg min θ∈Θ ℓ(f (x; θ)) , where Θ ⊆ R d is the domain of the parameters θ, and f and ℓ are such that the minimum in (1) exists. Note that the formulation (1) is extremely general and includes, e.g., optimization of nondecomposable loss functions (i.e., not composed of individual losses per training sample), which is relevant for some algorithmic losses like ranking losses. Typically, the optimization problem (1) is solved by using some iterative algorithm like gradient descent (or Newton's method) updating the weights θ by repeatedly applying the following step: θ ← One optimization step of ℓ(f (x; θ)) wrt. θ . However, in this work, we consider decomposing the optimization problem (1) into two problems, which may be solved by applying the following two updates in an alternating fashion: z ⋆ ← One optimization step of ℓ(z) wrt. z = f (x; θ) , θ ← One optimization step of 1 2 ∥z ⋆ -f (x; θ)∥ 2 2 wrt. θ . This split allows us later to use two different iterative optimization algorithms for (3a) and (3b), respectively. This is especially interesting for optimization problems where the loss function ℓ is non-convex and its minimization is a difficult optimization problem itself, and as such those problems, where a stronger optimization method exhibits a superior rate of convergence. We can express individual optimization steps (corresponding to (2) and (3a)) via θ ← arg min θ ′ ∈Θ ℓ(f (x; θ ′ )) + Ω(θ ′ , θ) and z ⋆ ← arg min z∈Y ℓ(z) + Ω(z, f (x; θ)) (4) where Ω is a regularizer such that one step of a respective optimization method corresponds to the global optimum of the regularized optimization problems in (4). The regularizer Ω has the standard property that Ω(a, b) = 0 for any a = b. Note that the explicit form of the regularizer Ω does not need to be known. Nevertheless, in Supplementary Material C, we discuss explicit choices of Ω. This allows us to express the set of points of convergence for the iterative optimization methods. Recall that an iterative optimization method has converged if it has arrived at a fixed point, i.e, the parameters do not change when applying an update. The set of points of convergence for (2) is A = θ | θ ∈ arg min θ ′ ℓ(f (x; θ ′ )) + Ω(θ ′ , θ) , i.e., those points at which the update does not change θ. For the two-stage optimization method (3), the set of points of convergence is B = θ | f (x; θ) = z ⋆ ∈ arg min z ℓ(z) + Ω(z, f (x; θ) as the method has converged if the update (3a) yields z ⋆ = z because the subsequent update (3b) will not change θ as z ⋆ = z = f (x; θ) already holds, and thus 1 2 ∥z ⋆ -f (x; θ)∥ 2 2 = 0. Now, we show that the iterative method (2) and the alternating method (3) lead to the same sets of convergence points. Lemma 1 (Equality of the Sets of Convergence Points). The set A of points of convergence obtained by the iterative optimization method (2) is equal to the set B of points of convergence obtained by the two-step iterative optimization method (3). Proof. (A ⊂ B) First, we show that any point in A also lies in B. By definition, for each point in A, the optimization step (2) does not change θ, i.e., θ ′ = θ. Thus, f (x; θ) = f (x; θ ′ ) ∈ arg min z ℓ(z) + Ω(z, f (x; θ)), and therefore θ ∈ B. (B ⊂ A) Second, we show that any point in B also lies in A. For each θ ∈ B, we know that, by definition, f (x; θ) = z ⋆ ∈ arg min z ℓ(z) + Ω(z, f (x; θ)), therefore θ ∈ arg min θ ′ ℓ(f (x; θ ′ )) + Ω(f (x; θ ′ ), f (x; θ)) where Ω(f (x; θ), f (x; θ)) = 0 = Ω(θ, θ), and, therefore θ ∈ A. While Lemma 1 states the equivalence of the original training (2) and its counterpart (3) wrt. their possible points of convergence (i.e., solutions) for an arbitrary choice of the iterative method, the two approaches are also equal when applying standard first-order or second-order optimization schemes. In other words, running a gradient descent step according to (2) coincides with two gradient steps of the alternating scheme (3a) and (3b), namely one step for (3a) and one step for (3b).

Theorem 2 (Gradient Descent

Step Equality between ( 2) and (3a)+(3b)). A gradient descent step according to (2) with arbitrary step size η coincides with two gradient descent steps, one according to (3a) and one according to (3b), where the optimization over θ has a step size of η and the optimization over z has a unit step size. Proof deferred to SM A. Theorem 3 (Newton Step Equality between ( 2) and (3a)+(3b) for m = 1). In the case of m = 1, a Newton step according to (2) with arbitrary step size η coincides with two Newton steps, one according to (3a) and one according to (3b), where the optimization over θ has a step size of η and the optimization over z has a unit step size. Proof deferred to SM A.

3. NEWTON LOSSES

In this section, we focus on the two-stage optimization method (3), and propose optimizing (3a) with Newton's method, while optimizing (3b) with stochastic gradient descent. Let us begin by considering the quadratic approximation of the loss function at the location y = f (x; θ), i.e., ly (z) = ℓ(y) + (z -y) ⊤ ∇ y ℓ(y) + 1 2 (z -y) ⊤ ∇ 2 y ℓ(y)(z -y) . To find the location z ⋆ of the minimum of ly (z), we set its derivative to 0: ∇ z ly (z ⋆ ) = 0 ⇔ ∇ y ℓ(y) + ∇ 2 y ℓ(y)(z ⋆ -y) = 0 (8) ⇔ ∇ y ℓ(y) = -∇ 2 y ℓ(y)(z ⋆ -y) ⇔ -∇ 2 y ℓ(y) -1 ∇ y ℓ(y) = z ⋆ -y Thus, the minimum of ly (z ) is z ⋆ = arg min z ly (z) = y -(∇ 2 y ℓ(y)) -1 ∇ y ℓ(y) . When ℓ is quadratic, it can be readily seen that z ⋆ is independent of the choice of y. For nonquadratic functions, we heuristically assume independence as z ⋆ is the projected optimum / goal of the function ℓ. In implementations, this independence can be achieved using .detach() or .stop_gradient(). Using z ⋆ , we can derive the Newton loss ℓ * as ℓ * z ⋆ (y) = 1 2 (z ⋆ -y) ⊤ (z ⋆ -y) = 1 2 ∥z ⋆ -y∥ 2 2 where z ⋆ = arg min z ly (z) (10) and its derivative as ∇ y ℓ * z ⋆ (y) = yz ⋆ . With this construction, we obtain the Newton loss ℓ * z ⋆ , a new convex loss, which has a gradient that corresponds to the Newton step of the original loss. Note that ( 10) is an instance of (3) for a regularization term describing the quadratic approximation error, i.e., Ω(z, f (x; θ)) = lf(x;θ) (z) -ℓ(z). As Ω is already implicitly part of the Newton step, we do not need to evaluate it. In general, ℓ * z ⋆ exhibits more desirable behavior than ℓ, as a single gradient descent step can solve any quadratic problem, and it possesses the same convergence properties as the Newton method when optimizing y. In the case of non-convex ℓ, as it is common for many algorithmic losses, the incorporation of second-order information also substantially improves empirical performance. Note that, in the case of non-convex or ill-conditioned settings, using Tikhonov regularization [41] stabilizes ℓ * z ⋆ . In the following, we define Newton losses and use x and z ⋆ to denote samples / rows of x and z ⋆ . Definition 1 (Element-wise Hessian-based Newton losses). For a loss function ℓ, and a given current parameter vector θ, we define the element-wise Hessian-based Newton loss as ℓ * z ⋆ (y) = 1 2 ∥z ⋆ E -y∥ 2 2 , where z ⋆ E = ȳ -(∇ 2 ȳ ℓ(ȳ)) -1 ∇ ȳ ℓ(ȳ) and ȳ = f (x; θ) . However, instead of using the element-wise Hessian-based Newton loss, it is typically more stable to use the empirical Hessian-based Newton loss. Definition 2 (Empirical Hessian-based Newton losses). For a loss function ℓ and a given current parameter vector θ, we define the empirical Hessian-based Newton loss as ℓ * z ⋆ (y) = 1 2 ∥z ⋆ H -y∥ 2 2 , where z ⋆ H = ȳ -E ȳ ∇ 2 ȳ ℓ(ȳ) -1 ∇ ȳ ℓ(ȳ) and ȳ = f (x; θ) . Instead of using the Hessian, it is also possible to use the Fisher information matrix for second-order information. While this coincides with the Hessian in certain cases, in most cases it yields different results. The Fisher-based Newton loss can be seen as using natural gradient descent for optimizing the loss, while using regular gradient descent for optimizing the neural network. Definition 3 (Fisher-based Newton losses). For a loss function ℓ, and a given current parameter vector θ, we define the Fisher-based Newton loss as ℓ * z ⋆ (y) = 1 2 ∥z ⋆ F -y∥ 2 2 , where z ⋆ F = ȳ -E ȳ ∇ ȳ ℓ(ȳ)∇ ȳ ℓ(ȳ) ⊤ -1 ∇ ȳ ℓ(ȳ) and ȳ = f (x; θ) . Remark 1 (Computational Considerations). The Hessian of the loss function ∇ 2 y ℓ(y) may be approximated using the empirical Fisher matrix F = E y ∇ y ℓ(y)∇ y ℓ(y) ⊤ . However, as only the Hessian of the loss function (and not the Hessian of the neural network) needs to be computed, computing the exact Hessian ∇ 2 y ℓ(y) is usually also fast. Remark 2 (Derivative of the Newton Loss). The derivative of the Newton loss is ∂ ∂ y 1 2 ∥z ⋆ -y∥ 2 2 = y -z ⋆ . ( ) Note that the derivative of z ⋆ wrt. y is zero because it is the projected optimum of the original loss.

3.1. EXAMPLES

We have seen in ( 10) how a given loss function ℓ induces a corresponding Newton loss ℓ * . For specific loss functions, the Newton loss can be explicitly computed. We begin with the trivial example of the MSE loss. For notational simplicity we often drop the subscript z ⋆ in the definition of the Newton loss (10) . Example 1 (MSE loss). Consider the classical MSE loss, i.e., ℓ(y) = 1 2 ∥y -y ⋆ ∥ 2 2 , where y ⋆ denotes the ground truth. Then, z ⋆ = y ⋆ and accordingly the Newton loss is given as ℓ * z ⋆ (y) = 1 2 ∥z ⋆ -y∥ 2 2 = 1 2 ∥y ⋆ -y∥ 2 2 = ℓ(y) . Therefore, the MSE loss ℓ and its induced Newton loss ℓ * are equivalent. A popular loss function for classification is the softmax cross entropy (SMCE) loss, defined as ℓ SMCE (y) = k i=1 -p i log q i , where q i = exp(y i ) k j=1 exp(y j ) . ( ) Example 2 (Softmax cross-entropy loss). For the SMCE loss, the induced Newton loss is given as ℓ * SMCE (y) = 1 2 ∥z ⋆ -y∥ 2 2 where the element-wise Hessian variant is z ⋆ E = -diag(q) -qq ⊤ -1 (q -p) + y , the empirical Hessian variant is z ⋆ H = -E q diag(q) -qq ⊤ -1 (q -p) + y , and the empirical Fisher variant is z ⋆ F = -E q (q -p)(q -p) ⊤ -1 (q -p) + y . In the experiments, we include a classification experiment with the SMCE loss. Additional examples of Newton losses can be found in the supplementary material.

4. ALGORITHMIC SUPERVISION LOSSES

In this section, we discuss how to derive the Newton losses for various types of algorithmic losses. Specifically, we consider SoftSort, DiffSort, AlgoVision, one-step Blackbox Differentiation, and stochastic smoothing. While all of these algorithmic losses can be used directly with Fisher-based Newton losses, the Hessian-based Newton losses require an estimation of the Hessian. We consider the task of algorithmic supervision, i.e., problems where an algorithm is applied to the predictions of a model and only the outputs of the algorithm are supervised. Specifically, we focus on the tasks of ranking supervision and shortest-path supervision. As this requires backpropagating through conventionally non-differentiable algorithms, the respective approaches make the ranking or shortest-path algorithms differentiable such that they can be used as part of the loss function.

4.1. SOFTSORT AND NEURALSORT

SoftSort [22] and NeuralSort [21] are prominent yet simple examples of a differentiable algorithm. In the case of ranking supervision, they obtain an array or vector of scalars and return a row-stochastic matrix called the differentiable permutation matrix P , which is a relaxation of the argsort operator. Note that, in this case, a set of k inputs yields a scalar for each image and thereby y ∈ R k . As a ground truth label, a ground truth permutation matrix Q is given and the loss between P and Q is the binary cross entropy loss ℓ SS (y) = BCE (P (y), Q) . Minimizing the loss enforces the order of predictions y to correspond to the true order, which is the training objective. SoftSort is defined as P (y) = softmax -y ⊤ ⊖ sort(y) /τ = softmax -y ⊤ ⊖ Sy /τ ( ) where τ is a temperature parameter, "sort" sorts the entries of a vector in non-ascending order, ⊖ is the element-wise broadcasting subtraction, | • | is the element-wise absolute value, and "softmax" is the row-wise softmax operator, as also used in (12) (right). NeuralSort is defined similarly and omitted for the sake of brevity. In the limit of τ → 0, SoftSort and NeuralSort converge to the exact ranking permutation matrix [21] , [22] . A respective Newton loss can be implemented using automatic differentiation according to Definition 2 or via the Fisher information matrix using Definition 3.

4.2. DIFFSORT

Differentiable sorting networks (DSN) [11] , [13] offer a strong alternative to SoftSort and NeuralSort. They are based on sorting networks, a classic family of sorting algorithms that operate by conditionally swapping elements [42] . As the locations of the conditional swaps are pre-defined, they are suitable for hardware implementations, which also makes them especially suited for continuous relaxation. By perturbing a conditional swap with a distribution and solving for the expectation under this perturbation in closed-form, we can differentially sort a set of values and obtain a differentiable doubly-stochastic permutation matrix P , which can be used via the BCE loss as in Section 4.1. We can obtain the respective Newton loss either via the Hessian computed via automatic differentiation or via the Fisher information matrix.

4.3. ALGOVISION

AlgoVision [18] is a framework for continuously relaxing arbitrary simple algorithms by perturbing all accessed variables with logistic distributions. The method approximates the expectation value of the output of the algorithm in closed-form and does not require sampling. For shortest-path supervision, we use a relaxation of the Bellman-Ford algorithm [43] , [44] and compare the predicted shortest path with the ground truth shortest path via an MSE loss. The input to the shortest path algorithm is a cost embedding matrix predicted by a neural network.

4.4. STOCHASTIC SMOOTHING

Another differentiation method is stochastic smoothing [45] . This method regularizes a nondifferentiable and discontinuous loss function ℓ(y) by randomly perturbing its input with random noise ϵ (i.e., ℓ(y + ϵ)). The loss function is then approximated as ℓ(y) ≈ ℓ ϵ (y) = E ϵ [ℓ(y + ϵ)]. While ℓ is not differentiable, its smoothed stochastic counterpart ℓ ϵ is differentiable and the corresponding gradient and Hessian can be estimated via the following result. Lemma 4 (Exponential Family Smoothing, adapted from Lemma 1.5 in Abernethy et al. [45] ). Given a distribution over R m with a probability density function µ of the form µ(ϵ) = exp(-ν(ϵ)) for any twice-differentiable ν, then ∇ y l ϵ (y) = ∇ y E ϵ [ℓ(y + ϵ)] = E ϵ ℓ(y + ϵ) ∇ ϵ ν(ϵ) , ∇ 2 y l ϵ (y) = ∇ 2 y E ϵ [ℓ(y + ϵ)] = E ϵ ℓ(y + ϵ) ∇ ϵ ν(ϵ)∇ ϵ ν(ϵ) ⊤ -∇ 2 ϵ ν(ϵ) . A variance-reduced form of ( 15) and ( 16) is ∇ y E ϵ [ℓ(y + ϵ)] = E ϵ (ℓ(y + ϵ) -ℓ(y)) ∇ ϵ ν(ϵ) , ∇ 2 y E ϵ [ℓ(y + ϵ)] = E ϵ (ℓ(y + ϵ) -ℓ(y)) ∇ ϵ ν(ϵ)∇ ϵ ν(ϵ) ⊤ -∇ 2 ϵ ν(ϵ) . In this work, we use this to estimate the gradient of the shortest path algorithm. By including the second derivative, we extend the perturbed optimizer losses to Newton losses. This also lends itself to full second-order optimization.

4.5. PERTURBED OPTIMIZERS WITH FENCHEL-YOUNG LOSSES

Blondel et al. [46] build on stochastic smoothing and Fenchel-Young losses [47] to propose perturbed optimizers with Fenchel-Young losses. For this, they use algorithms, like Dijkstra, to solve optimization problems of the type max w∈C ⟨y, w⟩, where C denotes the feasible set, e.g., the set of valid paths. Blondel et al. [46] identify the argmax to be the differential of max, which allows a simplification of stochastic smoothing. By identifying similarities to Fenchel-Young losses, they find that the gradient of their loss is ∇ y ℓ(y) = E ϵ arg max w∈C ⟨y + ϵ, w⟩ -w ⋆ where w ⋆ is the ground truth solution of the optimization problem (e.g., shortest path). This formulation allows optimizing the model without the need for computing the actual value of the loss function. Blondel et al. [46] find that the number of samples-surprisingly-only has a small impact on performance, such that 3 samples were sufficient in many experiments, and in some cases even a single sample was sufficient. In this work, we confirm this behavior and also compare it to plain stochastic smoothing. We find that for perturbed optimizers, the number of samples barely impacts performance, while for stochastic smoothing more samples always improve performance. If only few samples can be afforded (like 10 or less), perturbed optimizers are better as they are more sample efficient; however, when more samples are available, stochastic smoothing is superior as it can utilize more samples better.

5. EXPERIMENTS

For the experiments, we evaluate Newton losses on two applications of algorithmic supervision, i.e., problems where an algorithm is applied to the predictions of a model and the outputs of the algorithm are supervised. The first algorithmic supervision task is ranking supervision, where only the relative order of a set of samples is known, while their absolute values remain unsupervised. The second algorithmic supervision task is the shortest path supervision, where only the shortest path is supervised, while the underlying cost matrix remains unsupervised. Finally, as an ablation study, we also apply Newton losses to the trivial case of classification, where we do not expect a performance improvement, as the loss is not hard-to-optimize, but rather validate that the method still works.

5.1. RANKING SUPERVISION

In this section, we explore the ranking supervision setting [13] , [21] with an array of differentiable sorting-based losses. For this, we choose the recently popularized four-digit MNIST sorting benchmark [9] , [11] , [13] , [18] , [21] , [22] . In this setting, sets of n four-digit MNIST images are given, and the supervision is the relative order of these images corresponding to the displayed value, while the absolute values remain unsupervised. The goal is to learn a CNN that maps each image to a scalar value in an order preserving fashion. Since these losses are harder to optimize in this case, we can achieve substantial improvements over the baselines. We train the CNN using the Adam optimizer [48] at a learning rate of 10 -3 for 100 000 steps and with a batch size of 100. We explore NeuralSort, SoftSort, and differentiable sorting networks (DSNs) with logistic and Cauchy distributions in Table 1 . For NeuralSort and SoftSort, we find that using the Newton losses, based either on the Hessian or on the Fisher matrix, improve performance substantially. In this case, using the Hessian performs better than using the Fisher matrix. For DSNs, we find that for logistic DSNs, the improvements are substantial. Monotonic differentiable sorting networks, i.e., the Cauchy DSNs, provide an improved variant of differentiable sorting networks, which also have the property of quasi-convexity, and have been shown to exhibit much better training behavior. Thus, in this case, for n = 5 and Cauchy DSNs, there is no improvement over the default loss. However, for the somewhat harder setting of n = 10, we can observe an improvement of more than 1% using the Hessian-based Newton loss, even in the Cauchy DSN case. In summary, we obtain strong improvements on losses that are difficult to optimize, while on well-behaving losses only small or no improvements can be achieved. This aligns with our goal of improving performance on losses that are hard-to-optimize.

5.2. SHORTEST-PATH SUPERVISION

In this section, we apply Newton losses to the shortest-path supervision task of the 12 × 12 Warcraft shortest-path problem [15] , [16] , [18] . Here, 12 × 12 Warcraft terrain maps are given as 96 × 96 RGB images, and the supervision is the shortest path from the top left to the bottom right according to a hidden cost embedding. The goal is to predict 12 × 12 cost embeddings of the terrain maps such that the shortest path according to the predicted embedding corresponds to the ground truth shortest path. For this task, we explore three approaches: the relaxed AlgoVision Bellman-Ford algorithm, stochastic smoothing, and perturbed optimizers. For the relaxed AlgoVision Bellman-Ford algorithm, we explore two variants of the algorithm (an outer For loop and an outer While loop) and two losses (L 1 and L 2 2 ), i.e., a total of four settings. As computing the Hessian of the AlgoVision Bellman-Ford algorithm is too expensive with the PyTorch implementation, we restrict this case to the Fisher-based Newton loss. The variant with a For loop is easier to optimize than the While loop variant, as there is one condition less in the algorithm. This is beneficial in case of regular training; however, the For loop variant yields lower quality shortest paths due to an artifact arising from many unnecessary additional loop traversals when backtracking was actually already finished. This is avoided in the While loop variant, which terminates the loop. As displayed in Table 2 , the Newton loss improves performance in three out of four settings and the overall best performance is also provided by the Newton loss. Table 3 : Shortest-path benchmark results for the stochastic smoothing of the loss (including the algorithm), stochastic smoothing of the algorithm (excluding the loss), and perturbed optimizers with the Fenchel-Young loss. The metric is the percentage of perfect matches averaged over 10 seeds. After discussing analytical relaxations, we continue with stochastic methods, the results of which are displayed in Table 3. For stochastic smoothing of the loss function, (i.e., stochastic smoothing applied to the algorithm and loss as one unit), we find that Newton losses improve the performance for 10 and 30 samples, while the regular training performs best if only 3 samples can be drawn. This makes sense as the estimation of Hessian or Fisher with stochastic smoothing is not good enough with too few samples, but as soon as we have at least 10 samples, it is good enough to improve performance. For stochastic smoothing of the algorithm, (i.e., stochastic smoothing applied only to the algorithm, and the gradient of the loss afterward computed using backpropagation), we can observe a very similar behavior. While estimating the Hessian is intractable in this case, we can see improvements using the Fisher for ≥ 10 samples. For perturbed optimizers with a Fenchel-Young loss [46] , we can confirm that the number Table 4 : MNIST classification learning results. The models are a 5-layer fully connected ReLU networks with 100 (M1), 400 (M2) and 1 600 (M3) neurons per layer, as well as the convolutional LeNet-5 with sigmoid activations (M4) and LeNet-5 with ReLU activations (M5). The results are averaged over 20 seeds, and significance tests between regular training and the Newton methods are conducted. A better mean is indicated by a gray bold-face number, and a significantly better result is indicated by a black bold-face number. of samples drawn barely affects performance. By extending the formulation to also computing the Hessian of the Fenchel-Young loss, we can compute the Newton loss, and find that we achieve improvements of more than 2% in this case. Interestingly, we find that perturbed optimizers are more sample efficient but do not improve with more samples. Also, we compare stochastic smoothing of the loss (which computes a gradient) and stochastic smoothing of the algorithm (which computes a Jacobian matrix). Smoothing of the loss is more sample efficient. On the other hand, smoothing of the algorithm performs better for ≥ 10 samples.

5.3. ABLATION STUDY: CLASSIFICATION

Finally, as an ablation study, we explore the utility of Newton losses for the simple case of MNIST classification with a softmax cross-entropy loss. Note that, as softmax is already a well-behaved objective, we cannot expect improvements, and the experiment is rather to demonstrate that no loss in performance is induced through the Newton losses. To facilitate a fair and extensive comparison, we benchmark training on 5 models and with 2 optimizers: We use 5-layer fully connected ReLU networks with 100 (M1), 400 (M2) and 1 600 (M3) neurons per layer, as well as the convolutional LeNet-5 with sigmoid activations (M4) and LeNet-5 with ReLU activations (M5). Further, we use SGD and Adam as optimizers. To evaluate both early performance and full training performance, we test after 1 and 200 epochs. As computing the Hessian inverse is trivial for these settings, we also include the element-wise Hessian (e.w. H); however, as it performs (expectedly) poorly and as it would also be too expensive in the previous experiments, we did not include it for the algorithmic losses experiments in the previous sections. We run each experiment with 20 seeds, which allows us to perform significance tests (significance level 0.05). As displayed in Table 4 , we find that the element-wise Hessian (e.w. H) performs similar to regular training. Using the empirical Hessian (H) is indistinguishable from regular training, specifically in 12 out of 20 cases it is better and significantly better in one 1 out of 20 cases, which is to be expected from equal methods (on average 1/20 tests will be significant at a significance level of 0.05). Finally, we find that the Fisher-based Newton losses perform better than regular training. Specifically, with the SGD optimizer, in 9 out of 10 settings, it is significantly better and on the remaining setting, it has a higher mean. Using Adam [48] , both methods perform similarly.

A PROOFS

Theorem 2 (Gradient Descent Step Equality between ( 2) and (3a)+(3b)). A gradient descent step according to (2) with arbitrary step size η coincides with two gradient descent steps according to (3a) and (3b), where the optimization over θ has a step size of η and the optimization over z has a unit step size. Proof. Let θ ∈ Θ be the current parameter vector and let z = f (x; θ). Then the gradient descent steps according to (3a) and (3b) with step sizes 1 and η > 0 are expressed as z ← z -∇ z ℓ(z) = f (x; θ) -∇ f ℓ(f (x; θ)) θ ← θ -η ∇ θ 1 2 ∥z -f (x; θ)∥ 2 2 = θ -η ∂ f (x; θ) ∂ θ • (f (x; θ) -z) . Combining ( 20) and ( 21) eventually leads to θ ← θ -η ∂ f (x; θ) ∂ θ • (f (x; θ) -f (x; θ) + ∇ f ℓ(f (x; θ))) = θ -η ∇ θ ℓ(f (x; θ)), which is exactly a gradient descent step of problem (1) starting at θ ∈ Θ with step size η. Theorem 3 (Newton Step Equality between ( 2) and (3a)+(3b) for m = 1). In the case of m = 1, a Newton step according to (2) with arbitrary step size η coincides with two Newton steps according to (3a) and (3b), where the optimization over θ has a step size of η and the optimization over z has a unit step size. Proof. Let θ ∈ Θ be the current parameter vector and let z = f (x; θ). Then applying Newton steps according to (3a) and (3b) leads to z ← z -(∇ 2 z ℓ(z)) -1 ∇ z ℓ(z) = f (x; θ) -(∇ 2 f ℓ(f (x; θ))) -1 ∇ f ℓ(f (x; θ)) θ ← θ -η ∇ 2 θ 1 2 ∥z -f (x; θ)∥ 2 2 -1 ∇ θ 1 2 ∥z -f (x; θ)∥ 2 2 (24) = θ -η ∂ ∂θ ∂ f (x; θ) ∂ θ • (f (x; θ) -z) -1 ∂ f (x; θ) ∂ θ • (f (x; θ) -z) (25) = θ -η ∂ ∂θ ∂ f (x; θ) ∂ θ (f (x; θ) -z) + ∂ f (x; θ) ∂ θ 2 -1 ∂ f (x; θ) ∂ θ • (f (x; θ) -z) Inserting ( 23), we can rephrase the update above as θ ← θ -η ∂ ∂θ ∂ f (x; θ) ∂ θ (∇ 2 f ℓ(f (x; θ))) -1 ∇ f ℓ(f (x; θ)) + ∂ f (x; θ) ∂ θ 2 -1 • ∂ f (x; θ) ∂ θ • (∇ 2 f ℓ(f (x; θ))) -1 ∇ f ℓ(f (x; θ)) By applying the chain rule twice, we further obtain ∇ 2 θ ℓ(f (x; θ)) = ∂ ∂θ ∂ f (x; θ) ∂ θ ∇ f ℓ(f (x; θ)) = ∂ ∂θ ∂ f (x; θ) ∂ θ ∇ f ℓ(f (x; θ)) + ∂ f (x; θ) ∂ θ ∂ ∂θ ∇ f ℓ(f (x; θ)) = ∂ ∂θ ∂ f (x; θ) ∂ θ ∇ f ℓ(f (x; θ)) + ∂ f (x; θ) ∂ θ ∇ f ∂ ∂θ ℓ(f (x; θ)) = ∂ ∂θ ∂ f (x; θ) ∂ θ ∇ f ℓ(f (x; θ)) + ∂ f (x; θ) ∂ θ 2 ∇ 2 f ℓ(f (x; θ)), which allows us to rewrite (26) as θ ′ = θ -(∇ 2 f ℓ(f (x; θ))) -1 ∇ 2 θ ℓ(f (x; θ)) -1 (∇ 2 f ℓ(f (x; θ))) -1 ∇ θ ℓ(f (x; θ)) = θ -(∇ 2 θ ℓ(f (x; θ))) -1 ∇ θ ℓ(f (x; θ)) , which is exactly a single Newton step of problem (1) starting at θ ∈ Θ.

B FURTHER EXAMPLES FOR NEWTON LOSSES

A less trivial example is the binary cross-entropy (BCE) loss with ℓ BCE (y) = BCE(y, p) = - m i=1 p i log y i + (1 -p i ) log(1 -y i ) , where p ∈ ∆ m is a probability vector encoding of the ground truth. Example 3 (Binary cross-entropy loss). For the BCE loss, the induced Newton loss is given as ℓ * BCE (y) = 1 2 ∥z ⋆ -y∥ 2 2 , where the element-wise Hessian variant is z ⋆ E = -diag -p ⊘ y 2 + (1 -p) ⊘ (1 -y) 2 -1 (p ⊘ y -(1 -p) ⊘ (1 -y)) + y, the empirical Hessian variant is z ⋆ H = -diag E y -p ⊘ y 2 + (1 -p) ⊘ (1 -y) 2 -1 (p ⊘ y -(1 -p) ⊘ (1 -y)) + y, ) and the empirical Fisher variant is z ⋆ F = -E y (p ⊘ y -(1 -p) ⊘ (1 -y)) (p ⊘ y -(1 -p) ⊘ (1 -y)) ⊤ -1 (p ⊘ y -(1 -p) ⊘ (1 -y)) + y, and where ⊙ and ⊘ are element-wise operations. The BCE loss is often extended using the logistic sigmoid function to what is called the sigmoid binary cross-entropy loss (SBCE), defined as ℓ SBCE (y) = BCE(σ(y), p) where σ(x) = 1 1 + exp(-x) . ( ) Example 4 (Sigmoid Binary cross-entropy loss). For the SBCE loss, the induced Newton loss is given as ℓ * SBCE (y) = 1 2 ∥z ⋆ -y∥ 2 2 where the element-wise Hessian variant is z ⋆ E = -diag σ(y) -σ(y) 2 -1 (σ(y) -p) + y , the empirical Hessian variant is z ⋆ H = -diag E y σ(y) -σ(y) 2 -1 (σ(y) -p) + y , and the empirical Fisher variant is z ⋆ F = -E y (σ(y) -p)(σ(y) -p) ⊤ -1 (σ(y) -p) + y , and where ⊙ and ⊘ are element-wise operations.

D RUNTIMES

In this supplementary material, we provide and discuss runtimes for the experiments. In the differentiable sorting and ranking experiment, as shown in Table 5 , we observe that the runtime from regular training compared to the Newton Loss with the Fisher is only marginally increased. This is because computing the Fisher and inverting it is very inexpensive. We observe that the Newton loss with the Hessian, however, is more expensive: due to the implementation of the differentiable sorting and ranking operators, we compute the Hessian by differentiating each element of the gradient, which makes this process fairly expensive. An improved implementation could make this process much faster. Nevertheless, there is always some overhead to computing the Hessian compared to the Fisher. In Table 6 , we show the runtimes for the shortest-path experiment with AlgoVision. Here, we observe that the runtime overhead is very small. In Table 7 , we show the runtimes for the shortest-path experiment with stochastic methods. Here, we observe that the runtime overhead is also very small. Here, the Hessian is also cheap to compute as it is not computed with automatic differentiation. 

E CONVERGENCE PLOTS FOR THE CLASSIFICATION EXPERIMENT

In this Figure 1 , we provide convergence plots for the classification experiment. Here, we consider model M3, and the MNIST as well as the CIFAR-10 data sets. We show the convergence over the first 50 epochs. and CIFAR-10 (bottom) for model M3. On the left, we use the SGD optimizer and on the right, we use the Adam optimizer. Results are averaged over 20 seeds and the standard deviations are marked in shaded color. We note that for CIFAR-10 / SGD (bottom left) one of the seeds for "Newton L. (Hessian)" crashed numerically, which increased the standard deviation at this point.



CONCLUSIONIn this work, we proposed Newton losses, a method for combining second-order optimization of the loss function and first-order optimization of the model. We extensively benchmarked Newton losses on multiple tasks with an array of algorithmic losses. We found that Newton losses improve performance when training with non-trivial loss functions like algorithmic losses.



Figure 1: Convergence plots of the test accuracy for the classification experiment on MNIST (top)and CIFAR-10 (bottom) for model M3. On the left, we use the SGD optimizer and on the right, we use the Adam optimizer. Results are averaged over 20 seeds and the standard deviations are marked in shaded color. We note that for CIFAR-10 / SGD (bottom left) one of the seeds for "Newton L. (Hessian)" crashed numerically, which increased the standard deviation at this point.

Differentiable sorting results. The metric is the percentage of rankings correctly identified (and individual element ranks correctly identified) averaged over 10 seeds.

Shortest-path benchmark results for different variants of the AlgoVision-relaxed Bellman-Ford algorithm. The displayed metric is the percentage of perfect matches averaged over 10 seeds.

Runtimes for the differentiable sorting results corresponding to Table1. Times of full training in seconds.

Runtimes for the shortest-path results corresponding to Table2. Times of full training in seconds.

Runtimes for the shortest-path results corresponding to Table3. Times of full training in seconds.

C REGULARIZERS Ω

In this section, we provide a more detailed discussion, how the regularization term Ω in (4) induces different iterative optimization methods.The simple setting, where (3a) represents a basic gradient descent, i.e.,can be obtained by choosing a regularization term asThen, the first-order optimality conditions for min z ℓ(z) + Ω(z, f (x; θ)) arewhich leads to the gradient step z = f (x; θ) -η∇ f ℓ(f (x; θ)).Alternatively, we can ask for the choice of the regularization term Ω corresponding to a Newton step in (3a), i.e.,In this case, by choosingthe first-order optimality conditions for min z ℓ(z) + Ω(z, f (x; θ)) arewhich is equivalent to a Newton step z = f (x; θ) -η(∇ 2 ℓ(f (x; θ))) -1 ∇ℓ(f (x; θ)).

