BACKPROPAGATION THROUGH COMBINATORIAL ALGORITHMS: IDENTITY WITH PROJECTION WORKS

Abstract

Embedding discrete solvers as differentiable layers has given modern deep learning architectures combinatorial expressivity and discrete reasoning capabilities. The derivative of these solvers is zero or undefined, therefore a meaningful replacement is crucial for effective gradient-based learning. Prior works rely on smoothing the solver with input perturbations, relaxing the solver to continuous problems, or interpolating the loss landscape with techniques that typically require additional solver calls, introduce extra hyper-parameters, or compromise performance. We propose a principled approach to exploit the geometry of the discrete solution space to treat the solver as a negative identity on the backward pass and further provide a theoretical justification. Our experiments demonstrate that such a straightforward hyperparameter-free approach is able to compete with previous more complex methods on numerous experiments such as backpropagation through discrete samplers, deep graph matching, and image retrieval. Furthermore, we substitute the previously proposed problem-specific and label-dependent margin with a generic regularization procedure that prevents cost collapse and increases robustness. Code is available at github.com/martius-lab/solver-differentiation-identity.

1. INTRODUCTION

Deep neural networks have achieved astonishing results in solving problems on raw inputs. However, in key domains such as planning or reasoning, deep networks need to make discrete decisions, which can be naturally formulated via constrained combinatorial optimization problems. In many settingsincluding shortest path finding (Vlastelica et al., 2020; Berthet et al., 2020) , optimizing rank-based objective functions (Rolínek et al., 2020a ), keypoint matching (Rolínek et al., 2020b; Paulus et al., 2021) , Sudoku solving (Amos and Kolter, 2017; Wang et al., 2019) , solving the knapsack problem from sentence descriptions (Paulus et al., 2021)-neural models that embed optimization modules as part of their layers achieve improved performance, data-efficiency, and generalization (Vlastelica et al., 2020; Amos and Kolter, 2017; Ferber et al., 2020; P. et al., 2021) . This paper explores the end-to-end training of deep neural network models with embedded discrete combinatorial algorithms (solvers, for short) and derives simple and efficient gradient estimators for these architectures. Deriving an informative gradient through the solver constitutes the main challenge, since the true gradient is, due to the discreteness, zero almost everywhere. Most notably, Blackbox Backpropagation (BB) by Vlastelica et al. (2020) introduces a simple method that yields an informative gradient by applying an informed perturbation to the solver input and calling the solver one additional time. This results in a gradient of an implicit piecewise-linear loss interpolation, whose locality is controlled by a hyperparameter. We propose a fundamentally different strategy by dropping the constraints on the solver solutions and simply propagating the incoming gradient through the solver, effectively treating the discrete block as a negative identity on the backward pass. While our gradient replacement is simple and cheap to compute, it comes with important considerations, as its naïve application can result in unstable learning behavior, as described in the following. Our considerations are focused on invariances of typical combinatorial problems under specific transformations of the cost vector. These transformations usually manifest as projections or normalizations, e.g. as an immediate consequence of the linearity of the objective, the combinatorial solver is agnostic to normalization of the cost vector. Such invariances, if unattended, can hinder fast convergence due to the noise of spurious irrelevant updates, or can result in divergence and cost collapse (Rolínek et al., 2020a) . We propose to exploit the knowledge of such invariances by including the respective transformations in the computation graph. On the forward pass this leaves the solution unchanged, but on the backward pass removes the malicious part of the update. We also provide an intuitive view on this as differentiating through a relaxation of the solver. In our experiments, we show that this technique is crucial to the success of our proposed method. In addition, we improve the robustness of our method by adding noise to the cost vector, which induces a margin on the learned solutions and thereby subsumes previously proposed ground-truthinformed margins (Rolínek et al., 2020a) . With these considerations taken into account, our simple method achieves strong empirical performance. Moreover, it avoids a costly call to the solver on the backward pass and does not introduce additional hyperparameters in contrast to previous methods. Our contributions can be summarized as follows: (i) A hyperparameter-free method for linear-cost solver differentiation that does not require any additional calls to the solver on the backward pass. (ii) Exploiting invariances via cost projections tailored to the combinatorial problem. (iii) Increasing robustness and preventing cost collapse by replacing the previously proposed informed margin with a noise perturbation. (iv) Analysis of the robustness of differentiation methods to perturbations during training.

2. RELATED WORK

Optimizers as Model Building Blocks. It has been shown in various application domains that optimization on prediction is beneficial for model performance and generalization. One such area is meta-learning, where methods backpropagate through multiple steps of gradient descent for few-shot adaptation in a multi-task setting (Finn et al., 2017; Raghu et al., 2020) . Along these lines, algorithms that effectively embed more general optimizers into differentiable architectures have been proposed such as convex optimization (Agrawal et al., 2019a; Lee et al., 2019) , quadratic programs (Amos and Kolter, 2017), conic optimization layers (Agrawal et al., 2019b) , and more. Combinatorial Solver Differentiation. Many important problems require discrete decisions and hence, using combinatorial solvers as layers have sparked research interest (Domke, 2012; Elmachtoub and Grigas, 2022) . Methods, such as SPO (Elmachtoub and Grigas, 2022) and MIPaaL Ferber et al. (2020) , assume access to true target costs, a scenario we are not considering. Berthet et al. (2020) differentiate through discrete solvers by sample-based smoothing. Blackbox Backpropagation



Figure 1: Hybrid architecture with blackbox combinatorial solver and Identity module (green dotted line) with the projection of a cost ω and negative identity on the backward pass.

