DIFFERENTIABLE OPTIMIZATION OF GENERALIZED NONDECOMPOSABLE FUNCTIONS USING LINEAR PRO-GRAMS

Abstract

We propose a framework which makes it feasible to directly train deep neural networks with respect to popular families of task-specific non-decomposable performance measures such as AUC, multi-class AUC, F -measure and others. A common feature of the optimization model that emerges from these tasks is that it involves solving a Linear Programs (LP) during training where representations learned by upstream layers influence the constraints. The constraint matrix is not only large but the constraints are also modified at each iteration. We show how adopting a set of influential ideas proposed by Mangasarian for 1-norm SVMswhich advocates for solving LPs with a generalized Newton method -provides a simple and effective solution. In particular, this strategy needs little unrolling, which makes it more efficient during backward pass. While a number of specialized algorithms have been proposed for the models that we describe here, our module turns out to be applicable without any specific adjustments or relaxations. We describe each use case, study its properties and demonstrate the efficacy of the approach over alternatives which use surrogate lower bounds and often, specialized optimization schemes. Frequently, we achieve superior computational behavior and performance improvements on common datasets used in the literature.

1. INTRODUCTION

Commonly used losses such as cross-entropy used in deep neural network (DNN) models can be expressed as a sum over the per-sample losses incurred by the current estimate of the model. This allows the direct use of mature optimization routines, and is sufficient for a majority of use cases. But in various applications ranging from ranking/retrieval systems to class imbalanced learning, the most suitable losses for the task do not admit a "decompose over samples" form. Examples include Area under the ROC curve (AUC), multi-class variants of AUC, F -score, Precision at a fixed recall (P@R) and others. Optimizing such measures in a scalable manner can pose challenges even in the shallow setting. Since the compromise involved in falling back on a decomposable loss when a non-decomposable objective may be more appropriate for the task at hand can range from negligible to concerning depending on the application, the last few years have seen a number of interesting approaches proposed to efficiently deal with such structured and non-decomposable losses. To this end, recent algorithms for AUC maximization have been developed based on convex surrogate losses Liu et al. ( 2018 Recently, the AP-Perf method Fathony & Kolter (2020) showed how custom non-decomposable performance measures can be conveniently incorporated into differentiable pipelines. It is known that a number of these non-decomposable objectives can be expressed in the form of an integer program that can be relaxed to a linear program (LP). Earlier approaches adapted methods for structured SVMs Joachims et al. (2009) or cutting plane techniques Yue et al. (2007) and were interesting but had difficulty scaling to larger datasets. More recently, strategies have instead focused 1



); Natole et al. (2018) in a linear model or in conjuction with a deep neural network Liu et al. (2019) as well as stochastic and online variations (Ataman et al. (2006); Cortes & Mohri (2004); Gao et al. (2013); Liu et al. (2018; 2019)) are available. Methods for measures other than the AUC have also been studied -exact algorithms for optimizing F -score Nan et al. (2012); Dembczynski et al. (2011), optimizing average precision through direct optimization Song et al. (2016), scalable methods for non-decomposable objectives Eban et al. (2017); Venkatesh et al. (2019) and using a structured hinge-loss upper bound to optimize average precision and NDCG Mohapatra et al. (2018).

