DIFFERENTIABLE OPTIMIZATION OF GENERALIZED NONDECOMPOSABLE FUNCTIONS USING LINEAR PRO-GRAMS

Abstract

We propose a framework which makes it feasible to directly train deep neural networks with respect to popular families of task-specific non-decomposable performance measures such as AUC, multi-class AUC, F -measure and others. A common feature of the optimization model that emerges from these tasks is that it involves solving a Linear Programs (LP) during training where representations learned by upstream layers influence the constraints. The constraint matrix is not only large but the constraints are also modified at each iteration. We show how adopting a set of influential ideas proposed by Mangasarian for 1-norm SVMswhich advocates for solving LPs with a generalized Newton method -provides a simple and effective solution. In particular, this strategy needs little unrolling, which makes it more efficient during backward pass. While a number of specialized algorithms have been proposed for the models that we describe here, our module turns out to be applicable without any specific adjustments or relaxations. We describe each use case, study its properties and demonstrate the efficacy of the approach over alternatives which use surrogate lower bounds and often, specialized optimization schemes. Frequently, we achieve superior computational behavior and performance improvements on common datasets used in the literature.

1. INTRODUCTION

Commonly used losses such as cross-entropy used in deep neural network (DNN) models can be expressed as a sum over the per-sample losses incurred by the current estimate of the model. This allows the direct use of mature optimization routines, and is sufficient for a majority of use cases. But in various applications ranging from ranking/retrieval systems to class imbalanced learning, the most suitable losses for the task do not admit a "decompose over samples" form. Examples include Area under the ROC curve (AUC), multi-class variants of AUC, F -score, Precision at a fixed recall (P@R) and others. Optimizing such measures in a scalable manner can pose challenges even in the shallow setting. Since the compromise involved in falling back on a decomposable loss when a non-decomposable objective may be more appropriate for the task at hand can range from negligible to concerning depending on the application, the last few years have seen a number of interesting approaches proposed to efficiently deal with such structured and non-decomposable losses. To this end, recent algorithms for AUC maximization have been developed based on convex surrogate losses Liu et al. ( 2018 Recently, the AP-Perf method Fathony & Kolter (2020) showed how custom non-decomposable performance measures can be conveniently incorporated into differentiable pipelines. It is known that a number of these non-decomposable objectives can be expressed in the form of an integer program that can be relaxed to a linear program (LP). Earlier approaches adapted methods for structured SVMs Joachims et al. (2009) or cutting plane techniques Yue et al. (2007) and were interesting but had difficulty scaling to larger datasets. More recently, strategies have instead focused on the stochastic setting where we operate on a mini-batch of samples. A common strategy, which is generally efficient, is to study the combinatorial form of one or more losses of interest and derive surrogate lower bounds. These are then tackled via specialized optimization routines. The arguably more direct alternative of optimizating as a module embedded in a DNN architecture remained difficult until recently -but OptNet Amos & Kolter (2017) and CVXPY now offer support for solving certain objectives as differentiable layers within a network. Further, methods based on implicit differentiation have been applied to various problems, showing impressive performance. Our approach is based on the premise that tackling the LP form of the non-decomposable objective as a module within the DNN, one which permits forward and reverse mode differentiation and can utilize in-built support for specialized GPU hardware in modern libraries such as PyTorch, is desirable. First, as long as a suitable LP formulation for an objective is available, the module may be directly used. Second, based on which scheme is used to solve the LP, one may be able to provide guarantees for the non-decomposable objective based on simple calculations (e.g., number of constraints, primal-dual gap). The current tools, however, do not entirely address all these requirements, as we briefly describe next. A specific characteristic of the LPs that arise from the losses mentioned above is that the constraints are modified at each iteration -as a function of the updates to the representations of the data in the upstream layers. Further, the mini-batch of samples changes at each iteration. Solvers within CVXPY, are effective but due to their general-purpose nature, rely on interior point methods. OptNet is quite efficient but designed for quadratic programs (QP): the theoretical results and its efficiency depends on factorizing a matrix in the quadratic term in the objective (which is zero/noninvertible for LPs). The primal-dual properties and implicit differentiation for QPs do not easily translate to LPs, and efficient ways of dealing with constraints that are iteratively updated are not widely available at this time. In principle, of course, backpropagating through a convex optimization model (and in particular, LPs) is not an unsolved problem. For LPs, we can take derivatives of the optimal value (or the optimal solution) of the model with respect to the LP parameters, and this can be accomplished by calling a powerful external solver. Often, this would involve running the solver on the CPU, which introduces overhead. . The ideas in Meng et al. ( 2020) are relevant in that the optimization steps for the LP are unrolled and only involve simple linear algebra operations but the formulation is applicable when the number of constraints are about the same as the number of variables -an assumption that does not hold for the models we will study. In §3, we show that the modified Newton's algorithm in Mangasarian (2004) can be used for deep neural network (DNN) training in an end-to-end manner without requiring an external solvers where support for GPUs remains limited. Specifically, by exploiting self-concordance of the objective, we show that the algorithm can converge globally without line search strategies. On the practical side, we analyze the gradient properties, and some modifications to improve stability during backpropagation. We show that this scheme based on Mangasarian's parametric exterior penalty formulation of the primal LP can be a computationally effective and scalable strategy to solve LPs with a large number of constraints.

2. NONDECOMPOSABLE FUNCTIONS AND CORRESPONDING LP MODELS

We first present a standard LP form and then reparameterize several generalized nondecomposable objectives in this way, summarized in Table 5 in the appendix. We start with the binary AUC, extend it to multi-class AUC, and then later, show a ratio objective, F -score. Some other objectives are described in the appendix.

2.1. NOTATIONS AND GENERALIZED LP FORMULATION

Notations. We use the following notations: 



); Natole et al. (2018) in a linear model or in conjuction with a deep neural network Liu et al. (2019) as well as stochastic and online variations (Ataman et al. (2006); Cortes & Mohri (2004); Gao et al. (2013); Liu et al. (2018; 2019)) are available. Methods for measures other than the AUC have also been studied -exact algorithms for optimizing F -score Nan et al. (2012); Dembczynski et al. (2011), optimizing average precision through direct optimization Song et al. (2016), scalable methods for non-decomposable objectives Eban et al. (2017); Venkatesh et al. (2019) and using a structured hinge-loss upper bound to optimize average precision and NDCG Mohapatra et al. (2018).

(i) n: number of samples used in training. (ii) X ∈ R n×d : the explanatory features fed to a classifier (e.g., parameterized by w); (iii) f (x i ) (or f (i)): a score function for the classifier where x i ∈ X such that f (X) = wX; (iv) Y ∈ {0, 1}: target label and Ŷ ∈ {0, 1}: predicted label for binary classification, both in R n ; (v) φ(•): non linear function applied on f (X); (vi) A ⊗ B: Kronecker product of matrices A and B. (vii) I r : Identity matrix of size r and 1 is the indicator function. (viii) B k, (and B |,k ) gives the k-th row (and k -th) column of B.

