DIFFERENTIABLE COMBINATORIAL LOSSES THROUGH GENERALIZED GRADIENTS OF LINEAR PROGRAMS Anonymous

Abstract

Combinatorial problems with linear objective function play a central role in many computer science applications, and efficient algorithms for solving them are well known. However, the solutions to these problems are not differentiable with respect to the parameters specifying the problem instance -for example, shortest distance between two nodes in a graph is not a differentiable function of graph edge weights. Recently, attempts to integrate combinatorial and, more broadly, convex optimization solvers into gradient-trained models resulted in several approaches for differentiating over the solution vector to the optimization problem. However, in many cases, the interest is in differentiating over only the objective value, not the solution vector, and using existing approaches introduces unnecessary overhead. Here, we show how to perform gradient descent directly over the objective value of the solution to combinatorial problems. We demonstrate advantage of the approach in examples involving sequence-to-sequence modeling using differentiable encoder-decoder architecture with softmax or Gumbel-softmax, and in weakly supervised learning involving a convolutional, residual feed-forward network for image classification.

1. INTRODUCTION

Combinatorial optimization problems, such as shortest path in a weighted directed graph, minimum spanning tree in a weighted undirected graph, or optimal assignment of tasks to workers, play a central role in many computer science applications. We have highly refined, efficient algorithms for solving these fundamental problems (Cormen et al., 2009; Schrijver, 2003) . However, while we can easily find, for example, the minimal spanning tree in a graph, the total weight of the tree as function of graph edge weights is not differentiable. This problem hinders using solutions to combinatorial problems as criteria in training models that rely on differentiability of the objective function with respect to the model parameters. Losses that are defined by objective value of some feasible solution to a combinatorial problem, not the optimal one, have been recently proposed for image segmentation using deep models (Zheng et al., 2015; Lin et al., 2016) . These focus on a problem where some pixels in the image have segmentation labels, and the goal is to train a convolutional network that predicts segmentation labels for all pixels. For pixels with labels, a classification loss can be used. For the remaining pixels, a criterion based on a combinatorial problem -for example the maximum flow / minimal cut problem in a regular, lattice graph connecting all pixels (Boykov et al., 2001) or derived, higher-level super-pixels (Lin et al., 2016) -is often used as a loss, in an iterative process of improving discrete segmentation labels (Zheng et al., 2015; Marin et al., 2019) . In this approach, the instance of the combinatorial problem is either fixed, or depends only on the input to the network; for example, similarity of neighboring pixel colors defines edge weights. The output of the neural network gives rise to a feasible, but rarely optimal, solution to that fixed instance a combinatorial problem, and its quality is used as a loss. For example, pixel labeling proposed by the network is interpreted as a cut in a pre-defined graph connecting then pixels. Training the network should result in improved cuts, but no attempt to use a solver to find an optimal cut is made. Here, we are considering a different setup, in which each new output of the neural network gives rise to a new instance of a combinatorial problem. A combinatorial algorithm is then used to find the optimal solution to the problem defined by the output, and the value of the objective function of the optimal solution is used as a loss. After each gradient update, the network will produce a new combinatorial problem instance, even for the same input sample. Iteratively, the network is expected to learn to produce combinatorial problem instances that have low optimal objective function value. For example, in sequence-to-sequence modeling, the network will output a new sentence that is supposed to closely match the desired sentence, leading to a new optimal sequence alignment problem to be solved. Initially, the optimal alignment will be poor, but as the network improves and the quality of the output sentences get higher, the optimal alignment scores will be lower. Recently, progress in integrating combinatorial problems into differentiable models have been made by modifying combinatorial algorithms to use only differentiable elements (Tschiatschek et al., 2018; Mensch & Blondel, 2018; Chang et al., 2019) , for example smoothed max instead of max in dynamic programming. Another approach involves executing two runs of a non-differentiable, black-box combinatorial algorithm and uses the two solutions to define a differentiable interpolation (Vlastelica Pogančić et al., 2020; Rolínek et al., 2020) . Finally, differentiable linear programming and quadratic programming layers, which can be used to model many combinatorial problems, have been proposed recently (Amos & Kolter, 2017; Agrawal et al., 2019; Wilder et al., 2019; Ferber et al., 2019) . The approaches above allow for differentiating through optimal solution vectors. In many cases, we are interested only in the optimal objective value, not the solution vector, and the approaches above introduce unnecessary overhead. We propose an approach for gradient-descent based training of a network f (x; β) for supervised learning problems involving samples (x, y) with the objective criterion involving a loss term of the form L(β) = h(OptSolutionObjectiveValue(Π(F (x; β), y)), where h : R → R is some differentiable function, and Π is a combinatorial solver for a problem instance defined by the output of the β-parameterized network F for feature vector x and by the true label y. We show that a broad class of combinatorial problems can be integrated into models trained using variants of gradient descent. Specifically, we show that for an efficiently solvable combinatorial problem that can be efficiently expressed as an integer linear program, generalized gradients of the problem's objective value with respect to real-valued parameters defining the problem exist and can be efficiently computed from a single run of a black-box combinatorial algorithm. Using the above result, we show how generalized gradients of combinatorial problems can provide sentence-level loss for text summarization using differentiable encoder-decoder models that involve softmax or Gumbel softmax (Jang et al., 2016) , and a multi-element loss for training classification models when only weakly supervised, bagged training data is available.

2. DIFFERENTIABLE COMBINATORIAL LOSSES

2.1 BACKGROUND ON GENERALIZED GRADIENTS A function f : X → R defined over a convex, bounded open set X ∈ R p is Lipschitz continuous on an open set B ∈ X if there is a finite K ∈ R such that ∀x, y ∈ B |f (x) -f (y)| ≤ K||x -y||. A function is locally Lipschitz-continuous if for every point x 0 in its domain, there is a neighborhood B 0 , an open ball centered at x 0 , on which the function is Lipschitz-continuous. For such functions, a generalized gradient can be defined. Definition 1. (Clarke, 1975) Let f : X → R be Lipschitz-continuous in the neighborhood of x ∈ X . Then, the Clarke subdifferential ∂f (x) of f at x is defined as ∂f (x) = conv lim x k →x ∇f (x k ) , where the limit is over all convergent sequences involving those x k for which gradient exists, and conv denotes convex hull, that is, the smallest polyhedron that contains all vectors from a given set. Each element of the set ∂f (x) is called a generalized gradient of f at x. The Rademacher theorem (see e.g. (Evans, 1992) ) states that for any locally Lipschitz-continuous function the gradient exists almost everywhere; convergent sequences can be found. In optimization algorithms, generalized gradients can be used in the same way as subgradients (Redding & Downs, 1992) , that is, nondifferentiability may affect convergence in certain cases.

