DIFFERENTIABLE OPTIMIZATION OF GENERALIZED NONDECOMPOSABLE FUNCTIONS USING LINEAR PRO-GRAMS

Abstract

We propose a framework which makes it feasible to directly train deep neural networks with respect to popular families of task-specific non-decomposable performance measures such as AUC, multi-class AUC, F -measure and others. A common feature of the optimization model that emerges from these tasks is that it involves solving a Linear Programs (LP) during training where representations learned by upstream layers influence the constraints. The constraint matrix is not only large but the constraints are also modified at each iteration. We show how adopting a set of influential ideas proposed by Mangasarian for 1-norm SVMswhich advocates for solving LPs with a generalized Newton method -provides a simple and effective solution. In particular, this strategy needs little unrolling, which makes it more efficient during backward pass. While a number of specialized algorithms have been proposed for the models that we describe here, our module turns out to be applicable without any specific adjustments or relaxations. We describe each use case, study its properties and demonstrate the efficacy of the approach over alternatives which use surrogate lower bounds and often, specialized optimization schemes. Frequently, we achieve superior computational behavior and performance improvements on common datasets used in the literature.

1. INTRODUCTION

Commonly used losses such as cross-entropy used in deep neural network (DNN) models can be expressed as a sum over the per-sample losses incurred by the current estimate of the model. This allows the direct use of mature optimization routines, and is sufficient for a majority of use cases. But in various applications ranging from ranking/retrieval systems to class imbalanced learning, the most suitable losses for the task do not admit a "decompose over samples" form. Examples include Area under the ROC curve (AUC), multi-class variants of AUC, F -score, Precision at a fixed recall (P@R) and others. Optimizing such measures in a scalable manner can pose challenges even in the shallow setting. Since the compromise involved in falling back on a decomposable loss when a non-decomposable objective may be more appropriate for the task at hand can range from negligible to concerning depending on the application, the last few years have seen a number of interesting approaches proposed to efficiently deal with such structured and non-decomposable losses. To this end, recent algorithms for AUC maximization have been developed based on convex surrogate losses Liu et Recently, the AP-Perf method Fathony & Kolter (2020) showed how custom non-decomposable performance measures can be conveniently incorporated into differentiable pipelines. It is known that a number of these non-decomposable objectives can be expressed in the form of an integer program that can be relaxed to a linear program (LP). Earlier approaches adapted methods for structured SVMs Joachims et al. (2009) or cutting plane techniques Yue et al. (2007) and were interesting but had difficulty scaling to larger datasets. More recently, strategies have instead focused on the stochastic setting where we operate on a mini-batch of samples. A common strategy, which is generally efficient, is to study the combinatorial form of one or more losses of interest and derive surrogate lower bounds. These are then tackled via specialized optimization routines. The arguably more direct alternative of optimizating as a module embedded in a DNN architecture remained difficult until recently -but OptNet Amos & Kolter (2017) and CVXPY now offer support for solving certain objectives as differentiable layers within a network. Further, methods based on implicit differentiation have been applied to various problems, showing impressive performance. Our approach is based on the premise that tackling the LP form of the non-decomposable objective as a module within the DNN, one which permits forward and reverse mode differentiation and can utilize in-built support for specialized GPU hardware in modern libraries such as PyTorch, is desirable. First, as long as a suitable LP formulation for an objective is available, the module may be directly used. Second, based on which scheme is used to solve the LP, one may be able to provide guarantees for the non-decomposable objective based on simple calculations (e.g., number of constraints, primal-dual gap). The current tools, however, do not entirely address all these requirements, as we briefly describe next. A specific characteristic of the LPs that arise from the losses mentioned above is that the constraints are modified at each iteration -as a function of the updates to the representations of the data in the upstream layers. Further, the mini-batch of samples changes at each iteration. Solvers within CVXPY, are effective but due to their general-purpose nature, rely on interior point methods. OptNet is quite efficient but designed for quadratic programs (QP): the theoretical results and its efficiency depends on factorizing a matrix in the quadratic term in the objective (which is zero/noninvertible for LPs). The primal-dual properties and implicit differentiation for QPs do not easily translate to LPs, and efficient ways of dealing with constraints that are iteratively updated are not widely available at this time. In principle, of course, backpropagating through a convex optimization model (and in particular, LPs) is not an unsolved problem. For LPs, we can take derivatives of the optimal value (or the optimal solution) of the model with respect to the LP parameters, and this can be accomplished by calling a powerful external solver. Often, this would involve running the solver on the CPU, which introduces overhead. . The ideas in Meng et al. (2020) are relevant in that the optimization steps for the LP are unrolled and only involve simple linear algebra operations but the formulation is applicable when the number of constraints are about the same as the number of variables -an assumption that does not hold for the models we will study. In §3, we show that the modified Newton's algorithm in Mangasarian (2004) can be used for deep neural network (DNN) training in an end-to-end manner without requiring an external solvers where support for GPUs remains limited. Specifically, by exploiting self-concordance of the objective, we show that the algorithm can converge globally without line search strategies. On the practical side, we analyze the gradient properties, and some modifications to improve stability during backpropagation. We show that this scheme based on Mangasarian's parametric exterior penalty formulation of the primal LP can be a computationally effective and scalable strategy to solve LPs with a large number of constraints.

2. NONDECOMPOSABLE FUNCTIONS AND CORRESPONDING LP MODELS

We first present a standard LP form and then reparameterize several generalized nondecomposable objectives in this way, summarized in Table 5 in the appendix. We start with the binary AUC, extend it to multi-class AUC, and then later, show a ratio objective, F -score. Some other objectives are described in the appendix.

2.1. NOTATIONS AND GENERALIZED LP FORMULATION

Notations. We use the following notations: LP formulation. We consider a general linear program (LP) that contains nonnegative variables as well as inequality and equality constraints. The form of the LP is given as max u,v g T u + h T v s.t Eu + F v ≤ p, Bu + Gv = q u, v ≥ 0 (1) We can write it more succinctly as variable x = [u v] ; coefficient c = [-g -h] ; constraints A = E B -B F G -G T ; constants b = [p q -q] T The corresponding primal LP becomes min x c T x s.t Ax ≤ b, x ≥ 0.

2.2. MAXIMIZING AUC

The Area under the ROC Curve (AUC) calculates the probability that a classifier f (•) will rank a randomly chosen positive sample higher than a randomly chosen negative sample. Since AUC varies between 0 and 1, where 1 represents all positives being ranked above the negatives. AUC may be estimated using the  |X+| i=1 |X-| i=1 1 f (xi)>f (xj ) |X + | |X -| for x i : i ∈ {1, • • • , n}. Here, we follow Ataman et al. (2006) to calculate the AUC based on the WMW statistic as follows. min zij |X+| i=1 |X-| j=1 z ij s.t. f (x i ) -f (x j ) ≥ -z ij where x i ∈ X + , x j ∈ X -; z ij ≥ 0, where is a given constant. The model in (2), maximizes AUC indirectly, by minimizing the number of pairs (one each from the positive and negative classes) where the positive sample is not ranked higher than the negative sample: so, the number of zero entries in z equals the number of pairs where this condition is not true. To compute the AUC, we do: AUC = (n -z 0 ) (|X + | |X -|) = n - i j -1 relu(0, -z ij + ) (|X + | |X -|) . If z ij is 0, then -foot_0 relu(0, -z ij + ) equals 1. Otherwise z ij > 0, it follows from the first constraint in (2), that z ij ≥ , so -1 relu(0, -z ij + ) equals 0. Observe that in (2), the number of constraints is |X||X -|, which is quadratic in n.

2.3. MAXIMIZING MULTI-CLASS AUC

An extension of AUC to the multi-class setting, AUC µ , is defined in Kleiman & Page (2019) . The AUC µ objective optimizes an indicator matrix computed on the orientation function, O i,j defined as, Definition 2.2 (Orientation Function; Kleiman & Page (2019)). Assume we have K classes {y 1 , • • • , y K }. Let f (x i , ) ∈ R K indicate the model's prediction on x i for each of the K classes (class-specific probability) 1 . Let x * i provide the index of x i 's true class label. Let P ∈ R k×k be a partition matrix where P k,k is the cost of classifying a sample as class k when it should be k . Define v kk = P k, -P k , and v = v x * i x * j ∈ R K . Then, O i,j = ( v x * i -v x * j )( v, f (x i , ) -v, f (x j , ) ). Here, v kk ranks the instances by their cost difference between assignments to class k and k . For two classes, say indexed by 1 and 2, v 1,2 • f = 0 is the decision boundary between the classes. The ranking is correct for x i and x j (with correct labels 1 and 2 resp.) if the orientation of the points w.r.t. the decision boundary is the same as those of the class labels (converted to one-hot vectors). Then AUC µ = k<k g k,k i∈D k ,j∈D k 1 Oi,j ≥0 , where D k is the index set of samples having same class label k and g is a constant derived from class sizes. In formulating the linear program, we observe that the model is dictated on how P is defined. One way is to set P(k, k) = 0 ∀k and 1 for all other entries. We can also define a P with arbitary entries or formulate AUC in a one-vs-all setting. Here, we show the model in the first case (the other two cases are discussed in Appendix A.1). Similar to our previous formulation, the goal is to minimize the sum of negative values of the orientation function, rather than maximize the sum of positive values. Let f (i, j) = f (x i , x * i ) -f (x j , x * i ) + f (x j , x * j ) -f (x i , x * i ). Then the LP formulation is.

AUC bin

µ : min zij n i=1 n j=1:x * i <x * j z ij s.t. d ij f (i, j) ≥ -z ij , ∀i, j : x * i < x * j z ij ≥ 0 (4) Here d ij = v x * i -v x * j . This can be seen as an extension of our binary AUC model, where the z ij is the ranking between a pair of points defined for multiple classes.

2.4. MAXIMIZING F -SCORE

The F -score (or F -measure) is a representative of objectives expressed as ratios of some combination of True positives (TP), False positives (FP), True negatives (TN) and False negatives (FN). The general form of the ratio functions and formulations for other objectives is in the Appendix A.2. Specifically, F -score is defined as follows: Definition 2.3 (F -score). F -score = 2(Precision×Recall) Precision+Recall = 2TP 2TP+FP+FN = 2(Y T × Ŷ ) 1 T Y +1 T Ŷ The second part of the equality comes from simplying the precision ( T P T P +F P ) and recall ( T P T P +F N ) based formula Dembczynski et al. (2011) . The last part is obtained by replacing TP with (Y T × Ŷ ), FP with (1 -Y ) T × Ŷ and FN with (Y ) T × (1 -Ŷ ) as functions of Y and Ŷ . This leads to the following integer fractional optimization model, F -score = max Ŷ c T Ŷ 1 T Ŷ + b s.t. Ŷi ∈ [0, 1], i = 1, . . . , n where c = 2Y and b = n i=1 Y i . To solve this, we first relax the constraint on Ŷ and reformulate the model as the following LP, by introducing two variables z ∈ R n and t ∈ R 1 where z = b Ŷ 1 T Ŷ +b , t = b 1 T Ŷ +b and i ∈ {1, • • • , n}: max z,t c T z b s.t 1 T z + bt = b ; z i ≤ t (b) ; φ(f (x i ))t ≤ z i ≤ (1 + φ(f (x i )))t (c) ; 1 ≥ z i , t ≥ 0, Remark 1. (a) ensures the appropriate relation between z, t and Ŷ and is essentially a reformulaton of the ratio objective as a linear function with a fixed denominator. Ŷ is recovered from the solution to the linear program by computing Ŷi = zi t when t > 0 and Ŷi = z i otherwise. (b) sets an upper bound for Ŷi ≤ 1. (c) ties the output of the previous layer φ(f (x i )) (a classifier score for Ŷi , see the definition in Section 2.1) as a input in this layer. Assume φ(.) ∈ {-1, 1} (ensured if φ is sigmoid or tanh) is the indicator of the class label (based on sign). We want Ŷi ≥ φ(f (x i )) and Ŷi ≤ 1 + φ(f (x i )). If Ŷi is {0, 1}, these two constraints ensures that Ŷi = 0 when φ(f (x i )) ≤ 0 and Ŷi = 1 when φ(f (x i )) > 0. When Ŷi is relaxed and replaced with z and t, we get the equivalent form (c). This model imposes 4n constraints for n samples. Since this is a maximization, a solution to the LP, O Ŷ , is an upper bound on the integer objective opt * , and serves as the loss.

3. BACKPROPAGATION VIA FAST EXTERIOR PENALTY OPTIMIZATION

Unlike traditional feedforward networks, where the output of each layer is a relatively simple (though non-linear) function of the previous layer, a LP layer must solve a constrained optimization problem, therefore implementing scalable and efficient backpropagation schemes that minimizes computational overhead requires more care and as reviewed in §1, is an active topic of research. This problem is, of course, not unique to LPs and manifests in differentiable sorting Mena et 2020) shows that we can differentiate through LP formulations of sorting/ranking exactly by using smooth approximations of projection steps. Berthet et al. (2020) describes an interesting approach where one computes approximate gradients through ranking/shortest path problems by stochastic perturbation techniques. Remark 2. Some previous works Zeng et al. (2019) have considered LPs where the constraints are deterministic (for a fixed input dimension), i.e., do not depend on the data X, which is different from the LPs in §2.2-2.4. Note that such perturbation techniques in Berthet et al. (2020) are applicable to our LPs as well. For example, the Fenchel Young losses as defined in Berthet et al. ( 2020) is attractive because there is no need to compute the Jacobian. From the implementation standpoint, one could simply think of the backward pass as a function given the input and output of the forward pass. But the gradient expressions of the losses is an expectation and hence may require multiple calls to a LP solver in order to approximate the expectation. although parallelization and warm starts were shown to alleviate this dependency by sampling in parallel. Rationale. Consider a LP with a large m number of constraints in fixed dimensions n (n m). This assumption holds in all formulations in §2. This is because we assume that the architecture is fixed whereas minibatch size depends on the complexity of the task (stable gradient or when noise in gradient is high). Hence, solving such LPs using off-the-shelf solvers as in Berthet 2019) does offer benefits over Amos & Kolter (2017) for sparse QPs. Our strategy is to run Mangasarian's Newton's method on an exterior penalty function of the LP. There are two advantages: (i) during forward pass, quadratic local convergence of Newton's method indicates that unrolling the method may be a reasonable choice; and further (ii) based on the relationship between dual and primal variables, and the exactness of the exterior penalty we can show that backward pass is independent of m. We will discuss both these results and some modifications to deal with discontinuous Hessian (and its inverse) that is required for Newton's method. A similar approach is proposed in Amos & Kolter (2017) in which Primal-Dual Interior Point methods with implicit differentiation is used for differentiation purposes. But the exterior penalty in (6) satisfies an interesting property: primal and dual solutions are related by a closed form expression which can be exploited for efficient backpropagation.

3.1. FORWARD PASS USING NEWTON'S ALGORITHM

A key requirement for fast automatic (forward or reverse mode) differentiation is that we can perform the forward pass efficiently. In our setting, we seek to solve and backpropagate through an LP. We will focus on reverse mode differentiation since it is the most suitable for DNN training. Given a primal LP, for a fixed accuracy > 0, Mangasarian (2004) solves an unconstrained problem, min y g(y) := 1 2 σ(Ay -b) 2 + c T y, where σ(•) = max(•, 0) represents the elementwise relu function. A modified Newton's method can be use to solve equation 6 that performs the following iterations: y = y + λd where d = H(y) -1 ∇g(y) := (∇ 2 g(y) + ρI) -1 ∇g(y). In large scale settings of A, b, such Newton methods are known to perform empirically better than gradient descent Mangasarian (2006); Keerthi et al. ( 2007). We will discuss if this holds for our purposes shortly. Why is Newton's method applicable for minibatches? In general, the convergence of Newton's method depends strongly on initialization, i.e., we can only provide local convergence results. However, this is not the case for our problems since in our examples, either the level sets are bounded from below; or the feasible set is compact, as noted in Mangasarian (2004). There are two reasons why the result, by itself, is insufficient for our purposes: (i) it assumes that we can perform line search to satisfy Armijo condition; (ii) even with line search, the result does not provide a rate of convergence. In DNN training, such line search based convergence results can be prohibitively expensive. The main difficulty is handling the discontinuity in the Hessian. As a remedy, we use self concordance of (6) to guarantee global convergence of (7) iterations for the exterior penalty formulation in (6). To do so, we first show a result (proof in Appendix) that characterizes the discrepancy between the actual Hessian of (6) and the modified one in (7) when A is randomly distributed. Lemma 1. Assume that A is a random matrix, and fix some y ∈ R n . Then with probability one, g in equation 6 is constant (given by H, ∇g(y)) over a sufficiently small neighborhood of y. Intuitively, Lemma 1 states that with probability one, each y has a neighborhood in which the Hessian is constant. In addition, the modified Hessian is nonsingular at all points (in particular the optimal y * ), and so we can then show the following global convergence result. Theorem 2. Newton's method converges globally at a linear rate with local quadratic convergence. Proof. (Sketch) First, we use the fact that the objective in equation 6 is piecewise quadratic, and hence self concordant. Second, observe that the possible choice of Hessian are finite, and so we can choose ρ > 0 so that there is a descent direction without line search, that is, there exist a step size λ > 0 such that λ∇g(x) T d < 0. Finally, we use Theorem 4.1.12 in Nesterov (2013) to claim the result. Please see Appendix for the proof. The loss function is almost surely quadratic in the neighborhood of any y, thus intuitively suggests local quadratic convergence independent of the starting point. Remark 3. Convergence in Thm. 2 is guaranteed under standard constraint qualification assumptions. Linear Independent Constraint Qualification (LICQ) is satisfied for AUC, and Multi-class AUC formulations in §2. But the F -score formulation does not satisfy LICQ, hence we need safeguarding principles in the initial iterations (until iterates get close to the optimal solution).

3.2. BACKWARD PASS USING OPTIMAL DUAL VARIABLES AIDED BY UNROLLING

The advantage of optimizing the exterior penalty in ( 6) is that given an iterate y t , accuracy , we can extract the optimal dual solution v t by simple thresholding, that is, v t = 1/ (A T σ(Ay -b)). By complementarity slackness, the nonzero coordinates of v t specify the set of active constraints in Ax ≤ b. So, given an approximate solution y t such that ∇g(y t ) H(y t ) -1 ∇g(y t ) ≤ , to compute the primal solution x * , we solve the "active" linear system given by Ãb , where Ã denotes the active rows of A and the corresponding subvector b. Hence, backpropagation through the layer reduces to computing derivatives of Ã-1b which is simple via automatic differentiation. How to choose ? If we can successfully retrieve the active constraints at the optimal solution, we do not need to store the intermediate iterates y t at all during the forward pass (memory efficient). However, setting correctly can be tricky for arbitrary polyhedra since it depends on the geometric properties such as facets and vertices that may be difficult to enumerate. One possible way to get around this is to use a "burn-in" period in which we increase slightly in each iteration (of deep network training) and backpropagate through the unrolled Newton's iterations during this period. Once we see that the convergence profile has stabilized, we can fix at that value and start using the complementarity conditions and derive the active linear system Ã-1b as discussed above. How to backpropagate through unrolled iterations? We assume that the chain rule is applicable up to this LP layer and is ∂L ∂x ∂x ∂A (for one of the parameters A), and note that it is possible to find ∂L ∂x (either directly or using a chain rule). Therefore we focus our attention on ∂x ∂A , which involves the LP layer. Indeed, unrolling each iteration in (7) is equivalent to a "sublayer". So in order to backpropagate we have to show the partial derivatives of each operation or step wrt to the LP parameters A, b, and c.

Our goal is to ∂d

∂A where d = Q -1 q, Q = H and u = ∇g(y). We can use the product rule to arrive at: ∂d = -u T Q -1 ⊗ Q -1 ∂Q + Q -1 ∂u. To see this, note that we have used the chain rule to differentiate through the inverse in the first term. The second term is easy to compute similar to the computation of Hessian. For each of these terms we eventually have to compute ∂Q ∂z or ∂u ∂z where z ∈ {c, A, b} which can also be done by another application of chain rule. Please see Appendix A.4 for empirical verification of unrolled gradient and the one provided by Ã-1b . Before proceeding, we should note an issue that comes up when differentiating each step of the unrolled algorithm due to the fact that the Hessian is piecewise linear (constant) as a function of the input to that particular layer. Here, some possible numerical approximations are needed, as we describe below.

Remark 4. Note that the diagonal matrix term in ∂Q

∂A is nondifferentiable due to the presence of the step function. However, the step function is a piecewise constant function, and hence has zero derivative almost surely, that is, in any bounded set S, x ∈ S, if a ball (of radius r > 0,) B r (x) ⊆ S, then the Lebesgue measure of the set of nondifferentiable points on S is zero. Please see Appendix B.3 for a formal justification where we show this by approximating the step function using a sequence of logistic functions with increasing slope parameter at the origin. Therefore, in this setting, Remark 4 provides a way to compute an approximate sub-gradient when using Newton's method based LP layers. The function is a piece-wise quadratic function and differentiable everywhere, and the inverse of the Hessian acts as a preconditioner. Summary. Our forward pass involved three steps: 1. finite steps of Newton's method using which we 2. computed the dual variable by a thresholding operation, and 3. finally, to get the primal solution, these dual variables are first used to identify the active constraints followed by solving a linear system. In order to backpropagate through these three steps, we must differentiate through each layer of our procedure including Ã-1b , independent of whether we use unrolling or Danskin's theorem. Using Danskin's theorem in this setting would involve differentiating through the fixed point of the Newton's iterations similar to (regularized) gradient descent iterations considered in the iMAML work Rajeswaran et al. (2019) .

4. EXPERIMENTS

In this section, we conduct experiments on commonly used benchmarks to show that our framework can be used to optimize multiple different objectives within deep neural networks and lead to performance gain. We start with binary AUC optimization, and then extend to multi class AUC optimization and F -score optimization. We also show that nonnegative matrix factorization can be optimized in linear programming form in our framework.

Optimizing Binary AUC

We follow the current state-of-the-art work on AUC optimization Liu et Construction of imbalanced datasets: We construct imbalanced binary classification task by using half classes as positive class and another half as negative class, and dropping samples from negative class by a certain ratio, which is reflected by the positive ratio (the ratio of the majority class to the minority class) in Table 1 . Experimental setting. We use a Resnet-18 He et al. (2016) as the deep neural network for all algorithms. During optimization, the batch size is set to 64. The initial learning rate is tuned in {0.1, 0.01, 0.001} and decays 2/3 at 2k, 10k, 25k-th iteration. We train 40k iterations in total. The in Newton's method is 0.001. We use the same random seed, learning rate and total number of iterations in all of our experiments including multi class AUC and F -score experiments. 2019) and outperforms cross-entropy loss by a large margin, especially on imbalanced datasets, where AUC objective shows superiority over cross-entropy loss. Influence of . We report the influence of on our algorithm in Table 2 , where we choose Cat&Dog as an example to test different . We can see that = 0.1 gets slightly worse performance than = 0.01, while = 0.001 performs much worse. To choose , we follow the approach proposed by Mangasarian ( 2004). If for two successive values of 1 > 2 , the value of the perturbed quadratic function is the same, then it is the least 2-norm solution of the dual. Therefore, we simply choose an that satisfies this property, which is chosen to be 0.01 in our experiments. 

Optimizing Multiclass AUC

We further demonstrate our method for optimizing multiclass AUC. Similar to previous section on binary AUC, we construct imbalanced multiclass datasets by dividing datasets into 3 classes and drop samples from 2 of them and report the one-versus-all AUC (denoted as AUC ova ) and AUC bin µ score . For STL10, we group class 0 -2, 3 -5, 6 -9 into the three classes, and drop samples from the first two classes. For CIFR100, we group class 0 -32, 33 -65, 66 -99 into three classes, and also drop samples from the first two classes. Results. Results are in Table 3 . In addition to one-versus-all AUC metric, we also report the performance in terms of AUC µ Kleiman & Page (2019) which is specifically designed for measuring multiclass AUC and keeps nice properties of binary AUC such as being insensitive to class skew. We can see that our method outperforms cross-entropy loss on all four datasets and under all different skewed ratios. Specifically, the performance gain tends to be larger when the dataset becomes more imbalanced. Optimizing F -score We show that by directly optimizing F -score, we can achieve a better performance on this than when using cross entropy loss. In addition to cross entropy loss, we perform evalua-tions with two other methods that can also directly optimize the F 1-score. First, we replace our solver with CVXPY-SCS Agrawal et al. ( 2019), which is a differentiable general purpose linear programming solver; second, we perform comparisons with AP-Perf Fathony & Kolter (2020) which offers differentiable optimization of F -score using an adversarial prediction framework. The datasets and our setup to group them into two classes remain the same as in binary AUC section. The results in Table 4 shows that our method generally yields improvement over cross-entropy loss in terms of F -score. Note that different from optimizing cross entropy loss, when we optimize F -score directly, there exists a local optimal point where assigning all examples to the positive class leads to F -score of 66.7%. We see this behavior on CIFAR10, CIFAR100, and STL10.

5. CONCLUSIONS

We demonstrated that various non-decomposable objectives can be optimized within deep neural networks in a differentiable way under the same general framework of LPs using a modified Newton's algorithm proposed by Mangasarian. A number of recent papers have studied the general problem of backpropagating through convex optimization modules, and this literature provides several effective approaches although scalability remains a topic of active research. Our work complements these results and shows that the operations needed can be implemented to utilize the capabilities of modern deep learning libraries. While our experimental results suggest that promising results on binary AUC, multi-class AUC and F -score optimization within DNNs is achievable, we believe that the module may have other applications where the number of constraints are large and data-dependent. and R stands for the measure of recall, given as R(Y, Ŷ ) = T P T P + F N Plugging this in Eq(1), and replacing the formulations for TP, FP and FN(from Eq A.2) we get F -score(Y, Ŷ ) = 2T P 2T P + F P + F N = 2(Y T × Ŷ ) n i=1 y i + n i=1 ŷi = 2(Y T × Ŷ ) 1 T Y + 1 T Ŷ where y i refers to the ith element of Y (same for ŷi ). 1 represents an all one vector in R n . Note that in training, since Y is generally provided, we can assume 1 T Y = β which is constant (the number of examples in the positive class in the ground truth). We can also represent the the values of 2Y as a coefficient matrix c, then the optimization problem for finding F -score can be written as maximize Ŷ c T Ŷ 1 T Ŷ + b (12) subject to Ŷi ∈ [0, 1], i = 1, . . . , n.

A.2.2 MAXIMIZING JACCARD COEFFICIENT AND F β

The Jaccard Coefficient and Dice Index lead to similar formulation as F -score. The Jaccard coefficient can be expressed as:  Jacc(Y, Ŷ ) = T P T P + F P + F N = (Y T × Ŷ ) n i=1 y i + n i=1 ŷi - n i=1 y i × ŷi = (Y T × Ŷ ) 1 T Y + (1 -Y ) T Ŷ F β (Y, Ŷ ) = (1 + β 2 ) P (Y, Ŷ ) × R(Y, Ŷ ) β 2 P (Y, Ŷ ) + R(Y, Ŷ ) where β is a user specified parameter (balancing the importance of precision and recall) also permits a similar formulation. Here we simply set c = (1 + β 2 )Y , d = 1 and b = β 2 1 T Y . A.2.3 MAXIMIZING P @R We begin by defining the maximum precision at fixed minimum recall problem as P @Rα = maximize P s.t.R ≥ α = maximize (Y T Ŷ ) 1 T Ŷ s.t. Y T Ŷ ≥ α1 T Y This is again a linear fractional objective with a linear constraint. So we can write it as an equivalent Linear program using the same transformation where c = Y , d = 1 and b = 0. R@P on the other hand directly leads to a linear program.

A.3 OPTIMIZING NON-NEGATIVE MATRIX FACTORIZATION (NMF)

Nonnegative matrix factorization is different from the other objectives we presented in that it is primarily used in unsupervised learning and does not satify the criteria for a metric loss function. It can still be formulated in a generalized non-decomposable form because (i) cannot be written as a sum over individual samples, (ii) leads to a model where the constraints depend on learned features. Note that purpose of discussing NMF in this context, is not to provide a general purpose solver for the problem, and instead to assess whether NMF layers can serve as a regularizer or a clustering module, e.g., learning more interpretable attributes, co-segmentation and a substitute for clustering, see T . The top r rows of F contains so-called anchor words. Recht et al. (2012) proposed a data-driven model where the most salient features in the data are used to express the remaining features, given as V ∼ CV , where C is of size s × s. Assuming V admits a rank-r separable factorization, then V = R T I r 0 M 0 RV = CV . To show that thsis factorization is possible, we need to first make F square, which is why it is zero-padded to make it a size s × s matrix. Let p be any vector which is used as the coefficient in the objective in the We demonstrate applicability of our strategy to nonnegative matrix factorization (NMF) by performing a rank k factorization on Convolutional Neural Network (CNN) activations as an example, following Collins et al. (2018) . Recall that the activation tensor of an image at some layer in CNN has the shape V ∈ R c×h×w where h, w are the spatial sizes and c is the number of channels. We can reshape it into V ∈ R c×(h•w) and calculate a rank k NMF for V : V = F W . Each row W j of the resultant W ∈ R k×(h•w) can be reshaped into a heat map of dimension h × w which highlights regions in the image that correspond to the factor W j . We show an example for k = 1, 2 in Fig. 2 . We can see that heatmap consistently captures a meaningful part/concept in the examples. Currently, our memory consumption increases quickly with c here since the constraint matrix in our LP formulation is of size O(c 2 ) × O(c 2 ). This makes our method only work for small c on a GPU with 11GB memory (here, we use c = 20). This scaling issue can be possibly solved by utilizing sparsity in the constraint matrix, but the sparse matrix operations are currently not well supported on mainstream deep learning platforms like PyTorch and Tensorflow. Since our method provides backward gradients for the NMF operation, the heatmap generated here can, in fact, be used to construct a loss function during training in order to learn a interpretable models. Objective  g h E F p B G q AUC 1 ∈ Z |T |×|N | --1 ∈ Z |T |×|N | - p ij = (f (x i ) - f (x j ) -) - - - AUC bin µ 1 † , --1 † - p ij = (f ‡ ij -) - - - F -score c 0 1 -1 1 § -1 φ(f (x i )) -(1 + φ(f (x i ))) § 0 d b b : f ‡ ij = d ij (f (x i , y C(xi) ) -f (x j , y C(xi) ) + f (x j , y C(xj ) ) -f (x i , y C(xj ) )); § : one block for each i ∈ [1, ..n]. We do not include NMF in this table, as its formulation as a general LP is more verbose including vectorization of matrices and kronecker product calculations.

A.4 VERIFICATION OF UNROLLING GRADIENT AND THE ONE PROVIDED BY Ã-1b

We use Fscore formulation as an example. For input sample x, the neural network predicts a score f (x), and then the scores of a batch of samples will be used in solving the linear programming form of Fscore and be used to construct the loss function. We compute the gradient from the final loss function back to the predicted scores from the neural network and compare two approaches: one is that we use z = Ã-1b as the solution (the one we used in our experiment) where we can compute gradient by only one step, another one is that we directly use y t resulting from the Newton iterations as the solution and compute gradients by unrolling those iterations. We then compute the cosine value between these two gradient vectors. By experiments on 100 randomly sampled batches, the average cosine value is 0.9991, which means the two gradients are highly consistent. SECTION 3 In this section, we will provide the missing proofs and additional calculations in Section 3.

B.1 PROOF OF LEMMA 1.

Lemma 1 is restated here for convenience. Lemma 3. Assume that A ∈ R m×n is a random matrix, and fix some y ∈ R n . Then with probability one, g in equation 6 is quadratic (given by H, ∇g(y)) over a sufficiently small neighborhood of y. Proof. Using the integral form of second order Taylor's expansion of σ 2 (y) = (max(0, y)) 2 , we can show that, (20) Since y is fixed, assuming that the entries of A are chosen from a continuous distribution such that e T j A is uniformly distributed over the sphere, then (e T j Ah) 2 follows a Beta 1 2 , n-1 2 when h is drawn uniformly at random from the unit sphere, independent of A. This means that no matter what y is, there exists sufficiently small h such that the left hand side of equation 20 is false with probability one, and in that neighborhood diag(d) = H. g(y + h) -g(y) -h T ∇g(y) = 1 2 h T A T diag (d) Ah

B.2 PROOF OF THEOREM 2

Theorem 4. Assume that the primal LP has a unique optimal solution, and that the level set {x : Ax ≤ b, c T x ≤ α} is bounded for all α (for dual feasibility). Then short step (no line search) Newton's method converges globally at a linear rate with local quadratic convergence. Proof. First, since the objective function is piecewise quadratic since it is a sum of piecewise quadratic functions. In particular, it is self concordant since its third derivative zero everywhere. Now setting ρ < , we see that an approximate solution of the problem with the modified Hessian is also an approximate solution to equation 6. Moreover, since the possible values of H is finite, the local norm (also known as Newton's decrement) ∇g(y) T H(y) -1 ∇g(y) is finite. Hence, we can choose ρ > 0 so that there is a descent direction d, that is, there exist a step size λ > 0 such that λ∇g(x) T d < 0. Finally, we use Theorem 4.1.12 in Nesterov (2013) to claim the desired result. The assumptions in Theorem 4 are standard: 1. uniqueness can easily be satisfied by randomly perturbing the cost vector; 2. in most of our formulations, we explicitly have bound constraints on the decision variables, hence level sets are bounded.

B.3 DIFFERENTIATING THE STEP FUNCTION IN REMARK 4

We will use a slightly modified "suffix" notation as in Brookes (2005) in our calculations. That is, for a matrix A, A is the same as vec(A), vectorization of A obtained by concatenating all the columns. The following three properties relating the Kronecker product, •, and differentials will be used often: 1. Fact 1: For two vectors a, b, a ⊗ b = --→ ba T . 2. Fact 2: If A is p × q matrix, and B is a m × n matrix, then -→ ∂B = (∂B/∂A) -→ ∂A where ∂B/∂A is the (mn) × (pq) Jacobian matrix of B with respect to A. If A or B is a column vector or scalar, then • has no effect.

3.. Fact 3:

------→ ∂(AXB) = B T ⊗ A -→ ∂X. Using the above two facts, we can compute all the gradients needed to backpropagate through the unrolled iterations. We will show the computation for the gradient of Q -1 u with respect to A ∈ R m×n for a fixed u ∈ R n . We can apply chain rule to the following composition: A A T HA + ρI -1 u A T HA + ρI f1•f2 f1 f2 to get, J f2•f1 = J f2 • J f1 . Now using Fact 2 on ∂ X -1 = -X -1 (∂X)X -1 and some algebraic manipulation, we obtain, ---→ J f2•f1 = -u T A T HA + ρI -1 ⊗ A T HA + ρI -→ J f1 . We will now compute -→ J f1 . Note that H is also a function of A, so using product rule, we can write -→ J f1 as a sum of three derivatives -with respect to each of A, A T , H . The derivatives with respect to A and A T are fairly straightforward to compute, so will focus on computing the derivative with respect to H. To that end, we will use Fact 3, and show to compute the derivative of the step function by approximating it using the logistic function.  Note that these derivatives are used in computing derivatives of upstream network, so using distributional derivatives, and another application of chain rule to the left hand side of equation 22 results in the dirac delta function which is atomic, that is, has all its mass in a measure zero set. Hence this calculation provides an mathematical justification that the set of nondifferentiable points has measure zero for our training purposes. It is easy to formally verify this argument using differentiable tent functions as approximations to the heaviside step function.



extension of f (.) to multi-class setting



al. (2018); Natole et al. (2018) in a linear model or in conjuction with a deep neural network Liu et al. (2019) as well as stochastic and online variations (Ataman et al. (2006); Cortes & Mohri (2004); Gao et al. (2013); Liu et al. (2018; 2019)) are available. Methods for measures other than the AUC have also been studied -exact algorithms for optimizing F -score Nan et al. (2012); Dembczynski et al. (2011), optimizing average precision through direct optimization Song et al. (2016), scalable methods for non-decomposable objectives Eban et al. (2017); Venkatesh et al. (2019) and using a structured hinge-loss upper bound to optimize average precision and NDCG Mohapatra et al. (2018).

(i) n: number of samples used in training. (ii) X ∈ R n×d : the explanatory features fed to a classifier (e.g., parameterized by w); (iii) f (x i ) (or f (i)): a score function for the classifier where x i ∈ X such that f (X) = wX; (iv) Y ∈ {0, 1}: target label and Ŷ ∈ {0, 1}: predicted label for binary classification, both in R n ; (v) φ(•): non linear function applied on f (X); (vi) A ⊗ B: Kronecker product of matrices A and B. (vii) I r : Identity matrix of size r and 1 is the indicator function. (viii) B k, (and B |,k ) gives the k-th row (and k -th) column of B.

al. (2018) and formulating quadratic or cone programs Amos et al. (2017). One may unroll gradient descent steps Amos et al. (2017); Goodfellow et al. (2013); Metz et al. (2016) or use projections Zeng et al. (2019). Recently Agrawal et al. (2019) introduced a package for differentiable constrained convex programming, which includes LPs as a special case. For LPs, Meng et al. (2020) presents an unrolled scheme and Blondel et al. (

et al. (2020) may slow down the training process. The strategy in Agrawal et al. (

Figure 1: ROC curve of multiclass AUC optimization on STL10 with 90% drop rate. We divide STL10 into 3 classes and use one as positive class and other two as negative class to plot the ROC.

13) This can be equivalently written as a linear factional program as shown in Model(10) where c = Y , d = (1 -Y ) and b = 1 T Y . The rest of the construction is similar to F -score. Note that F β which is defined as

Figure 2: NMF example. Three rows correspond to original images, k = 1 and k = 2 respectively.

Ay -b)) * ds 2dt. (19) See Remark 1 in Golikov & Kaporin (2019) for details. Without loss of generality, we can assume b = 0 by simply translating the origin. Following the same remark, the diagonal matrix coincides with the step function based diagonal in H under the following condition on h: e T j Ah • e T j (Ay) < 0 =⇒ |e T j Ah| ≤ |e T j (Ay)|.

∂ ∂A diag ((Ay -b) * ) ≈ ∂ ∂A diag (1 (1 + exp (κ (-Ay + b)))) , κ > 0.

(AUC). Let n be the number of samples. Let X + (and X -resp.) be the set of positive (and negative resp.) samples such that |X + | + |X -| = n where | • | is the cardinality of the set. Then, AUC is given as

al. (2019) to conduct experiments on optimizing AUC score directly with deep neural networks. The baseline algorithms we compare with for binary AUC are cross-entropy loss and two algorithm (PPD-SG and PPD-AdaGrad) from Liu et al. (2019). Cat&Dog, CIFAR10, CIFAR100, and STL10. Cat&Dog is a dataset from Kaggle which contains 25000 images of cats and dogs. 80% of the dataset is used as training set and the rest 20% as test set. STL10 is inspired by the CIFAR-10 dataset but with some modifications. Each class in STL10 has fewer labeled training examples than in CIFAR-10. We follow Liu et al. Binary AUC optimization results on four benchmark datasets.

Ablation study of on Cat&Dog dataset. Results. The results are shown in Table 1. We can see that our method slightly outperforms Liu et al. (

Multiclass AUC optimization results on STL10 and CIFAR100. Drop rate is the proportion used when dropping samples from two of three classes.

F -score on four datasets.

Trigeorgis et al. (2014) andCollins et al. (2018). The following description in this section and the experimental validation in the following section is a proof of principle instantiation of this idea. We briefly review from Arora et al. (2012) and Recht et al. (2012), how NMF is written as a LP.We know from Arora et al. (2012) that a NMF decomposition V = F W where F is s × s and W is s × w and V and W have a row sum of 1, is 'separable' if the rows of W are simplicial and there is a permutation matrix R ∈ R s×s such that RF = [ I r M ]

Table showing the general LP coefficients for each model.

A APPENDIX A.1 LP FORMULATION FOR MULTI-CLASS AUC

One way to extend binary AUC to multi-class is by considering multiple one-versus-all pairs. This leads us to the following formulation:In our multi-class AUC experiment, we use this one-versus-all AUC as training loss and report performance in both one-versus-all AUC and AUC bin µ . In addition, we can also consider the setting of AUC µ where P is set arbitarily. In this case, the exact terms in orientation function O proposed by Kleiman & Page (2019) can be written as follows:Note that AUC arbit µ has the same number of constraints and variables as AUC bin µ . Once the LPs are solved, the loss function is calculated the same way as binary AUC.

A.2 FORMULATING RATIO OBJECTIVES

In this section, we study a subset of non-decomposable metrics, which are typically expressed as ratios of some combination of True Positive(TP), False Positives(FP), True Negatives(TN) and False Negatives(FN). These can be expressed in a general form as a11T P +a12 a21T P +a22F P +a23F N +a24 , where a pq are constants/cofficients which if set to 0, means the term is absent and not equal to zero in other cases. This formulation can used to define Fscore, F β , Jaccard, IOU and Precision at fixed recall. In the following section, we describe the formulation of Fscore as a representative of this approach, other metrics can be formulated similarly.Given Y the groud truth, our goal is to compute Ŷ both of length n, which aligns with Y based on the specific metric. We first show how to write TP, FP, TN and FN wrt to these vectors.The F -score or F -measure is routinely used as a performance metric for different types of prediction problems, including binary classification and, multi-label classification. Compared to measures like error rate in binary classification and Hamming loss, it enforces a better balance between performance on the minority and the majority classes, and, therefore, it is more suitable in the case of imbalanced data Dembczynski et al. (2011) . F -score is defined as follows:F -score(Y, Ŷ ) = 2P (Y, Ŷ ) × R(Y, Ŷ )where P is the measure of precision defined as P (Y, Ŷ ) = T P T P + F P

