ALTERNATING DIFFERENTIATION FOR OPTIMIZATION LAYERS

Abstract

The idea of embedding optimization problems into deep neural networks as optimization layers to encode constraints and inductive priors has taken hold in recent years. Most existing methods focus on implicitly differentiating Karush-Kuhn-Tucker (KKT) conditions in a way that requires expensive computations on the Jacobian matrix, which can be slow and memory-intensive. In this paper, we developed a new framework, named Alternating Differentiation (Alt-Diff), that differentiates optimization problems (here, specifically in the form of convex optimization problems with polyhedral constraints) in a fast and recursive way. Alt-Diff decouples the differentiation procedure into a primal update and a dual update in an alternating way. Accordingly, Alt-Diff substantially decreases the dimensions of the Jacobian matrix especially for optimization with large-scale constraints and thus increases the computational speed of implicit differentiation. We show that the gradients obtained by Alt-Diff are consistent with those obtained by differentiating KKT conditions. In addition, we propose to truncate Alt-Diff to further accelerate the computational speed. Under some standard assumptions, we show that the truncation error of gradients is upper bounded by the same order of variables' estimation error. Therefore, Alt-Diff can be truncated to further increase computational speed without sacrificing much accuracy. A series of comprehensive experiments validate the superiority of Alt-Diff.

1. INTRODUCTION

Recent years have seen a variety of applications in machine learning that consider optimization as a tool for inference learning Belanger & McCallum (2016) ; Belanger et al. (2017) ; Diamond et al. (2017) ; Amos et al. (2017) ; Amos & Kolter (2017) ; Agrawal et al. (2019a) . Embedding optimization problems as optimization layers in deep neural networks can capture useful inductive bias, such as domain-specific knowledge and priors. Unlike conventional neural networks, which are defined by an explicit formulation in each layer, optimization layers are defined implicitly by solving optimization problems. They can be treated as implicit functions where inputs are mapped to optimal solutions. However, training optimization layers together with explicit layers is not easy since explicit closed-form solutions typically do not exist for the optimization layers. Generally, computing the gradients of the optimization layers can be classified into two main categories: differentiating the optimality conditions implicitly and applying unrolling methods. The ideas of differentiating optimality conditions have been used in bilevel optimization Kunisch & Pock (2013) ; Gould et al. (2016) and sensitivity analysis Bonnans & Shapiro (2013) . Recently, Opt-Net Amos & Kolter (2017) and CvxpyLayer Agrawal et al. (2019a) have extended this method to optimization layers so as to enable end-to-end learning within the deep learning structure. However, these methods inevitably require expensive computation on the Jacobian matrix. Thus they are prone to instability and are often intractable, especially for large-scale optimization layers. Another direction to obtain the gradients of optimization layers is based on the unrolling methods Diamond et al. (2017) ; Zhang et al. (2023) , where an iterative first-order gradient method is applied. However, they are memory-intensive since all the intermediate results have to be recorded. Besides, unrolling methods are not quite suitable to constrained optimization problems as expensive projection operators are needed. In this paper, we aim to develop a new method that significantly increases the computational speed of the differentiation procedure for convex optimization problems with polyhedral constraints. Motivated by the scalability of the operator splitting method for optimization problem Glowinski & Le Tallec (1989) ; Stellato et al. (2020) , we developed a new framework, namely Alternating Differentiation (Alt-Diff), that differentiates optimization layers in a fast and recursive way. Alt-Diff first decouples the constrained optimization problem into multiple subproblems based on the wellknown alternating direction method of multipliers (ADMM) Boyd et al. (2011) . Then, the differentiation operators for obtaining the derivatives of the primal and dual variables w.r.t the parameters are implemented on these subproblems in an alternating manner. Accordingly, Alt-Diff substantially decreases the dimensions of the Jacobian matrix, significantly increasing the computational speed of implicit differentiation, especially for large-scale problems. Unlike most existing methods that directly differentiate KKT conditions after obtaining the optimal solution for an optimization problem, Alt-Diff performs the forward and backward procedures simultaneously. Both the forward and backward procedures can be truncated without sacrificing much accuracy in terms of the gradients of optimization layers. Overall, our contributions are three-fold: • We develop a new differentiation framework Alt-Diff that decouples the optimization layers in an alternating way. Alt-Diff significantly reduces the dimension of the KKT matrix, especially for optimization with large-scale constraints and thus increases the computational speed of implicit differentiation. • We show that: 1) the gradient obtained by Alt-Diff are consistent with those obtained by differentiating the KKT conditions; 2) the truncation error of gradients is upper bounded by the same order of variables' estimation error given some standard assumptions. Therefore, Alt-Diff can be truncated to accelerate the computational speed without scarifying much accuracy. • We conduct a series of experiments and show that Alt-Diff can achieve results comparable to state-of-the-art methods in much less time, especially for large-scale optimization problems. The fast performance of Alt-Diff comes from the dimension reduction of the KKT matrix and the truncated capability of Alt-Diff. However, these methods are only restricted to handling the differentiation of quadratic problems.

2. RELATED WORK

For more general convex optimization, Agrawal et al. (2019b) proposed CvxpyLayer that differentiates through disciplined convex programs and some sparse operations and LSQR Paige & Saunders (1982) to speed up the differentiation procedure. Although these have been worthwhile attempts to accelerate the differentiation procedure, they still largely focus on implicitly differentiating the KKT conditions in a way that requires expensive operations on a large-scale Jacobian matrix. Recently, a differentiation solver named JaxOpt Blondel et al. (2021) has also been put forward that is based on an implicit automatic differentiation mechanism under the Jax framework. Unrolling methods Another direction for differentiating optimization problems is the unrolling methods Domke (2012); Monga et al. (2021) . These approximate the objective function with a first-order gradient method and then incorporate the gradient of the inner loop optimization into the training procedures Belanger & McCallum (2016) ; Belanger et al. (2017) ; Amos et al. (2017); Metz et al. (2019) . This is an iterative procedure, which is usually truncated to a fixed number of iterations. Although unrolling methods are easy to implement, most of their applications are limited to unconstrained problems. If constraints are added, the unrolling solutions have to be projected into the feasible region, significantly increasing the computational burdens. By contrast, Alt-Diff only requires a very simple operation that projects the slack variable to the nonnegative orthant. This substantially improves the efficiency of subsequent updates and reduces the method's computational complexity. Implicit models There has been growing interest in implicit models in recent years J. Z. Kolter & Johnson (2020); Zhang et al. (2020) ; Bolte et al. (2021) . Implicit layers replace the traditional feedforward layers of neural networks with fixed point iterations to compute inferences. 

3. PRELIMINARY: DIFFERENTIABLE OPTIMIZATION LAYERS

We consider a parameterized convex optimization problems with polyhedral constraints: min x f (x; θ) s.t. x ∈ C(θ) where x ∈ R n is the decision variable, the objective function f : R n → R is convex and the constraint C(θ) := {x|Ax = b, Gx ≤ h} is a polyhedron. For simplification, we use θ to collect the parameters in the objective function and constraints. For any given θ, a solution of optimization problem (1) is x ⋆ ∈ R n that minimizes f (x; θ) while satisfying the constraints, i.e. x ⋆ ∈ C(θ). Then the optimization problem (1) can be viewed as a mapping that maps the parameters θ to the solution x ⋆ . Here, we focus on convex optimization with affine constraints due to its wide applications in control systems Guo & Wang (2010) , signal processing Mattingley & Boyd (2010 ), communication networks Luong et al. (2017) ; Bui et al. (2017) , etc. Since optimization problems can capture well-defined specific domain knowledge in a model-driven way, embedding such optimization problems into neural networks within an end-to-end learning framework can simultaneously leverage the strengths of both the model-driven and the data-driven methods. Unlike conventional neural networks, which are defined through explicit expressions of each layer, an optimization layer embedded in optimization problem (1) is defined as follows: Definition 3.1. (Optimization Layer) A layer in a neural network is defined as an optimization layer if its input is the optimization parameters θ ∈ R m and its output x ⋆ ∈ R n is the solution of the optimization problem (1). The optimization layer can be treated as an implicit function F : R n ×R m → R n with F(x ⋆ , θ) = 0. In the deep learning architecture, optimization layers are implemented together with other explicit layers in an end-to-end framework. During the training procedure, the chain rule is used to back propagate the gradient through the optimization layer. Given a loss function R, the derivative of the loss w.r.t the parameter θ of the optimization layer is ∂R ∂θ = ∂R ∂x ⋆ ∂x ⋆ ∂θ . Obviously, the derivative ∂R ∂x ⋆ can be easily obtained by automatic differentiation techniques on explicit layers, such as fully connected layers and convolutional layers. However, since explicit closed-form solutions typically do not exist for optimization layers, this brings additional computational difficulties during implementation. Recent work, such as OptNet Amos & Kolter (2017) and CvxpyLayer Agrawal et al. (2019a) have shown that the Jacobian ∂x ⋆ ∂θ can be derived by differentiating the KKT conditions of the optimization problem based on the Implicit Function Theorem Krantz & Parks (2002) . This is briefly recalled in Lemma 3.2. Lemma 3.2. (Implicit Function Theorem) Let F(x, θ) denote a continuously differentiable function with F(x ⋆ , θ) = 0. Suppose the Jacobian of F(x, θ) is invertible at (x ⋆ ; θ), then the derivative of the solution x ⋆ with respect to θ is ∂x ⋆ ∂θ = -[J F ;x ] -1 J F ;θ , where J F ;x := ∇ x F(x ⋆ , θ) and J F ;θ := ∇ θ F(x ⋆ , θ) are respectively the Jacobian of F(x, θ) w.r.t. x and θ. It is worth noting that the differentiation procedure needs to calculate the optimal value x ⋆ in the forward pass and involves solving a linear system with Jacobian matrix J F ;x in the backward pass. Both the forward and backward pass are generally very computationally expensive, especially for large-scale problems. This means that differentiating KKT conditions directly is not scalable to large optimization problems. Existing solvers have made some attempts to alleviate this issue. Specifically, CvxpyLayer adopted an LSQR technique to accelerate implicit differentiation for sparse optimization problems. However, the method is not necessarily efficient for more general cases, which may not be sparse. Although OptNet uses a primal-dual interior point method in the forward pass and its backward pass can be very easy but OptNet is suitable only for quadratic optimization problems. In this paper, our main target is to develop a new method that can increase the computational speed of the differentiation procedure especially for large-scale constrained optimization problems.

4. ALTERNATING DIFFERENTIATION FOR OPTIMIZATION LAYERS

This section provides the details of the Alt-Diff algorithm for optimization layers. Alt-Diff decomposes the differentiation of a large-scale KKT system into a smaller problems that are solved in a primal-dual alternating way (see Section 4.1). Alt-Diff reduces the computational complexity of the model and significantly improves the computational efficiency of the optimization layers. In section 4.2, we theoretically analyze the truncated capability of Alt-Diff. Notably, Alt-Diff can be truncated for inexact solutions to further increase computational speeds without sacrificing much accuracy. 1) with a quadratic penalty term is: max λ,ν min x,s≥0 L(x, s, λ, ν; θ) =f (x; θ) + ⟨λ, Ax -b⟩ + ⟨ν, Gx + s -h⟩ + ρ 2 (∥Ax -b∥ 2 + ∥Gx + s -h∥ 2 ), where s ≥ 0 is a non-negative slack variable, and ρ > 0 is a hyperparameter associated with the penalty term. Accordingly, the following ADMM procedures are used to alternatively update the primary, slack and dual variables:              x k+1 = arg min x L(x, s k , λ k , ν k ; θ), s k+1 = arg min s≥0 L(x k+1 , s, λ k , ν k ; θ), λ k+1 = λ k + ρ(Ax k+1 -b), ν k+1 = ν k + ρ(Gx k+1 + s k+1 -h). ( ) (5b) (5c) (5d) Note that the primal variable x k+1 is updated by solving an unconstrained optimization problem. The update of the slack variable s k+1 is easily obtained by a closed-form solution via a Rectified Linear Unit (ReLU) as s k+1 = ReLU - 1 ρ ν k -(Gx k+1 -h) . The differentiations for the slack and dual variables are trivial and can be done using an automatic differentiation technique. Therefore, the computational difficulty is now concentrated around differentiating the primal variables, which can be done using the Implicit Function Theorem. Applying a differentiation technique to procedure (5) leads to:                          ∂x k+1 ∂θ = -∇ 2 x L(x k+1 ) -1 ∇ x,θ L(x k+1 ), ∂s k+1 ∂θ = - 1 ρ sgn(s k+1 ) • 1 T ⊙ ∂ν k ∂θ + ρ ∂(Gx k+1 -h) ∂θ , ∂λ k+1 ∂θ = ∂λ k ∂θ + ρ ∂(Ax k+1 -b) ∂θ , ∂ν k+1 ∂θ = ∂ν k ∂θ + ρ ∂(Gx k+1 + s k+1 -h) ∂θ , where ∂θ by (7c) and (7d) k := k + 1 end while return x ⋆ and its gradient ∂x ⋆ ∂θ We summarize the procedure of Alt-Diff in Algorithm 1 and provide a framework of Alt-Diff in Appendix A. The convergence behavior of Alt-Diff can be inspected by checking ∥x k+1 -x k ∥/∥x k ∥ < ϵ, where ϵ is a predefined threshold. ∇ 2 x L(x k+1 ) = ∇ 2 x f (x k+1 ) T + ρA T A + ρG T G, ⊙ represents Hadamard production, sgn(s k+1 ) denotes a function such that sgn(s i k+1 ) = 1 if s i k+1 ≥ 0 and sgn(s i k+1 ) = 0 vice versa. Notably, Alt-Diff is somehow similar to unrolling methods as in Foo et al. (2007) ; Domke (2012) . However, these unrolling methods were designed for unconstrained optimization. If constraints are added, the unrolling solutions have to be projected into a feasible region. Generally this projector operators are very computationally expensive. By contrast, Alt-Diff can decouple constraints from the optimization and only involves a very simple operation that projects the slack variable s to the nonnegative orthant s ≥ 0. This significantly improves the efficiency of subsequent updates and reduces the overall computational complexity of Alt-Diff. Besides, conventional unrolling methods usually need more memory consumption as all the intermediate computational results have to be recorded. However, when updating (7) continuously, Alt-Diff does not need to save the results of the previous round. Instead, it only saves the results of the last round, that is, the previous ∂x k ∂θ is replaced by the ∂x k+1 ∂θ iteratively. Besides, we show that the gradient obtained by Alt-Diff is consistent with the one obtained by the gradient of the optimality conditions implicitly. A detailed proof as well as the procedure for differentiating the KKT conditions is presented in the Appendix E. For more clarity, we take several optimization layers for examples to show the implementation of Alt-Diff, including Quadratic Layer Amos & Kolter (2017) , constrained Sparsemax Layer Malaviya et al. (2018) and constrained Softmax Layer Martins & Astudillo (2016) . These optimization layers were proposed to integrate specific domain knowledge in the learning procedure. Detailed derivations for the primal differentiation procedure (7a) are provided in the Appendix B.2. In the forward pass, the unconstrained optimization problem (5a) can be easily solved by Newton's methods where the inverse of Hessian, i.e. (∇ 2 x L(x k+1 )) -1 is needed. It can be directly used in the backward pass (7a) to accelerate the computation after the forward pass. The computation is even more efficient in the case of quadratic programmings, where (∇ 2 x L(x k+1 )) -1 is constant and only needs to be computed once. More details are referred to Appendix B.1. As Alt-Diff decouples the objective function and constraints, it is obviously that the dimension of Hessian matrix ∇ 2 x L(x k+1 ) is reduced to the number of variables n. Therefore, the complexity for Alt-Diff becomes O n 3 , which differs from directly differentiating the KKT conditions with complexity of O (n + n c ) 3 . This also validates the superiority of Alt-Diff for optimization problem with many constraints. To further accelerate Alt-Diff, we propose to truncate its iterations and analyze the effects of truncated Alt-Diff. 2021), exact gradients are not necessary during the learning process of neural networks. Therefore, we can truncate the iterative procedure of Alt-Diff given some threshold values ϵ. The truncation in existing methods (OptNet and CvxpyLayers) that are based on interior point methods often does not offer significant improvements. This is because the interior point methods often converge fast whereas the computational bottleneck comes from the high dimension of the KKT matrix instead of the number of iterations of the interior points methods. However, the iteration truncation can benefit Alt-Diff more as the iteration of Alt-Diff is more related to the tolerance due to the sublinear convergence of ADMM. Our simulation results in Section 5.1 and Appendix F.1 also verify this phenomenon. Next, we give theoretical analysis to show the influence of truncated Alt-Diff. Assumption A (L-smooth) The first order derivatives of loss function R and the second derivatives of the augmented Lagrange function L are L-Lipschitz. For ∀x 1 , x 2 ∈ R n , ∥∇ x R(θ; x 1 ) -∇ x R(θ; x 2 )∥ ≤ L 1 ∥x 1 -x 2 ∥, ∥∇ 2 x f (x 1 ) -∇ 2 x f (x 2 )∥ ≤ L 2 ∥x 1 -x 2 ∥, ∥∇L x,θ (x 1 ) -∇L x,θ (x 2 )∥ ≤ L 3 ∥x 1 -x 2 ∥. (8a) (8b) (8c) Assumption B (Bound of gradients) The first and second order derivative of function L and R are bounded as follows: ∇ x R(θ; x) ⪯ µ 1 I, ∇ 2 x L(x) ⪰ µ 2 I, ∇ x,θ L(x) ⪯ µ 3 I, where µ 1 , µ 2 and µ 3 are positive constants. The above inequalities hold for ∀x ∈ R n . Assumption C (Nonsingular Hessian) Function f (x) is twice differentiable. The Hessian matrix of augmented Lagrange function L is invertiable at any point x ∈ R n . Theorem 4.1. (Error of gradient obtained by truncated Alt-Diff) Suppose x k is the truncated solution at the k-th iteration, the error between the gradient ∂x k ∂θ obtained by the truncated Alt-Diff and the real gradient ∂x ⋆ ∂θ is bounded as follows, ∂x k ∂θ - ∂x ⋆ ∂θ ≤ C 1 ∥x k -x ⋆ ∥, where C 1 = L 3 µ 2 + µ 3 L 2 µ 2 2 is a constant. Please refer to Appendix C for our proof. The theorem has shown that the primal differentiation (7a) obtained by Alt-Diff has the same order of error brought by the truncated iterations of (5a). Besides, this theorem also illustrates the convergence of Alt-Diff. By the convergence of ADMM itself, x k will converge to x ⋆ . The convergence of the Jacobian is established accordingly and the results from the computational perspective are shown in Appendix E.2. Moreover, we can derive the following corollary with respect to the general loss function R accordingly. Corollary 4.2. (Error of the inexact gradient in terms of loss function) Followed by Theorem 4.1, the error of the gradient w.r.t. θ in loss function R is bounded as follows: ∥∇R(θ; x k ) -∇R(θ; x ⋆ )∥ ≤ C 2 ∥x k -x ⋆ ∥, where C 2 = L 1 + µ 3 L 1 + µ 1 L 3 µ 2 + µ 1 µ 3 L 2 µ 2 2 is a constant. Please refer to Appendix D for our proof. Truncating the iterative process in advance will result in fewer steps and higher computational efficiency without sacrificing much accuracy. In the next section, we will show the truncated capability of Alt-Diff by several numerical experiments.

5. EXPERIMENTAL RESULTS

In this section, we evaluate Alt-Diff over a series of experiments to demonstrate its performance in terms of computational speed as well as accuracy. First, we implemented Alt-Diff over several optimization layers, including the sparse and dense constrained quadratic layers and constrained softmax layers. In addition, we applied the Alt-Diff to the real-world task of energy generation scheduling under a predict-then-optimize framework. Moreover, we also tested Alt-Diff with a quadratic layer in an image classification task on the MNIST dataset. To verify the truncated capability of Alt-Diff, we also implemented Alt-Diff with different values of tolerance. All the experiments were implemented on a Core Intel(R) i7-10700 CPU @ 2.90GHz with 16 GB of memory. Our source code for these experiments is available at https://github.com/HxSun08/Alt-Diff.

5.1. NUMERICAL EXPERIMENTS ON SEVERAL OPTIMIZATION LAYERS

In this section, we compare Alt-Diff with OptNet and CvxpyLayer on several optimization layers, including the constrained Sparsemax layer, the dense Quadratic layer and the constrained Softmax layer which are referred to typical cases of sparse, dense quadratic problems and problems with general convex objective function, respectively. In the dense quadratic layer, with the objective function as f (x) = 1 2 x T P x + q T x, the parameters P, q, A, b, G, h were randomly generated from the same random seed with P ⪰ 0. The tolerance ϵ is set as 10 -3 for OptNet, CvxpyLayer and Alt-Diff. As the parameters in optimization problems are generated randomly, we compared Alt-Diff with the "dense" mode in CvxpyLayer. All the numerical experiments were executed 5 times with the average times reported as the results in Table 1 . It can be seen that OptNet runs much faster than CvxpyLayer for the dense quadratic layer, while our Alt-Diff outperforms both the two counterparts. Additionally, the superiority of Alt-Diff becomes more evident with the increase of problem size. We also plot the trend of Jacobian ∂x k ∂b with iterations in primal update (7a) in Figure 1 . As shown in Theorem 4.1, the gradient obtained by Alt-Diff can gradually converge to the results obtained by KKT derivative as the number of iterations increases. The results in Figure 1 also validate this theorem. Further, we conduct experiments to compare the truncated performance among Alt-Diff, OptNet and CvxpyLayer under different tolerance levels ϵ. As shown in Table 2 , we can find that the truncated operation will not significantly improve the computational speed for existing methods but does have significant improvements for Alt-Diff. The detailed results of other optimization layers are provided in Appendix F.1. Mandi & Guns (2020) . Predict-then-optimize is an end-to-end learning model, in which some unknown variables are first predicted by some machine learning methods and are then successively optimized. The main idea of predict-then-optimize is to use the optimization loss (12) to guide the prediction, rather than use the prediction loss as in the normal learning style. L( θ, θ) = 1 2 m i=1 (x ⋆ i ( θ) -x ⋆ i (θ)) 2 We consider the energy generation scheduling task based on the actual power demand for electricity in a certain region web. In this setting, a power system operator must decide the electricity generation to schedule for the next 24 hours based on some historical electricity demand information. In this paper, we used the hourly electricity demand data over the past 72 hours to predict the real power demand in the next 24 hours. The predicted electricity demand was then input into the following optimization problem to schedule the power generation: min x k 24 k=1 ∥x k -P d k ∥ 2 s.t. |x k+1 -x k | ≤ r, k = 1, 2, • • • , 23 where P d k and x k denote the power demand and power generation at time slot k respectively. Due to physical limitations, the variation of power generation during a single time slot is not allowed to exceed a certain threshold value r. During training procedures, we use a neural network with two hidden layers to predict the electricity demand of the next 24 hours based on the data of the previous 72 hours. All the P d k have been normalized into the [0, 100] interval. The optimization problem in this task can be treated like the optimization layer previously considered. As shown in Table 4 , since the constraints are sparse, OptNet runs much slower than the "lsqr" mode in CvxpyLayer. Therefore, we respectively implemented Alt-Diff and CvxpyLayer to obtain the gradient of power generation with respect to the power demand. During the training procedure, we used the Adam optimizer Kingma & Ba (2014) to update the parameter. Once complete, we compared the results obtained from CvxpyLayer (with tolerance 10 -3 ) and those from Alt-Diff under different levels of truncated thresholds, varying from 10 -1 , 10 -2 to 10 -3 . The results are shown in Figure 2 . We can see that the losses for CvxpyLayer and Alt-Diff with different truncated threshold values are almost the same, but the running time of Alt-Diff is much smaller than CvxpyLayer especially in the truncated cases. 

6. CONCLUSION

In this paper, we have proposed Alt-Diff for computationally efficient differentiation on optimization layers associated with convex objective functions and polyhedral constraints. Alt-Diff differentiates the optimization layers in a fast and recursive way. Unlike differentiation on the KKT conditions of optimization problem, Alt-Diff decouples the differentiation procedure into a primal update and a dual update by an alternating way. Accordingly, Alt-Diff substantially decreases the dimensions of the Jacobian matrix especially for large-scale constrained optimization problems and thus increases the computational speed. We have also showed the convergence of Alt-Diff and the truncated error of Alt-Diff given some general assumptions. Notably, we have also shown that the iteration truncation can accelerate Alt-Diff more than existing methods. Comprehensive experiments have demonstrated the efficiency of Alt-Diff compared to the state-of-the-art. Apart from the superiority of Alt-Diff, we also want to highlight its potential drawbacks. One issue is that the forward pass of Alt-Diff is based on ADMM, which does not necessarily always outperform the interior point methods. In addition, Alt-Diff is currently designed only for specific optimization layers in the form of convex objective functions with polyhedral constraints. Extending Alt-Diff to more general convex optimization layers and nonconvex optimization layers is under consideration in our future work.

Appendix

We present some details for the proposed Alt-Diff algorithm, including the framework of Alt-Diff, the derivation of some specific layers, the proof of theorems, detailed experimental results and implementation notes. A FRAMEWORK OF ALT-DIFF The Alt-Diff framework for optimization layers within an end-to-end learning architecture is illustrated in Figure 3 . According to the first-order optimality condition in forward pass (5a), we have ∇L(x k+1 ) = ∇f (x k+1 ) + A T λ k + G T ν k + ρA T (Ax k+1 -b) + ρG T (Gx k+1 + s k -h) = 0. (14) Generally, the equation ( 14) can be efficiently solved by iterative methods such as Newton's methods as follows: x i k+1 = x i-1 k+1 -α ∇ 2 x L x i-1 k+1 -1 ∇L(x i-1 k+1 ) = x i-1 k+1 -α ∇ 2 x f x i-1 k+1 + ρA T A + ρG T G -1 ∇L(x i-1 k+1 ), where x i k+1 denotes the value of x k+1 at the i-th Newton's iteration and α is the step size. Therefore, when x i k+1 converges at its optimal value x k+1 , i.e., after the forward pass, the inverse of Hessian ∇ 2 x L(x k+1 ) -1 can be directly used to accelerate the computation of the backward pass (7a). Especially, when f (x) is quadratic, i.e. f (x) = 1 2 x T P x + q T x, and P is a symmetric matrix, the inverse of Hessian becomes constant as: Published as a conference paper at ICLR 2023 ∇ 2 x L(x k+1 ) -1 = P + ρA T A + ρG T G -1 . (16) B.2 SPECIAL LAYERS We take Quadratic Layer Amos & Kolter (2017) , constrained Sparsemax Layer Malaviya et al. (2018) and constrained Softmax Layer Martins & Astudillo (2016) as special cases in order to show the computation of backward pass (7). The details of these layers are provided in the first two columns of Table 3 . The differentiation procedure (7b) -( 7d) for these layers are exactly the same. The only difference relies on the primal differentiation procedure (7a), which is listed in the last column of Table 3 . Table 3 : The primal differentiation procedure of Alt-Diff for several optimization layers.

Layers Optimization problems

Primal differentiation (7a) Constrained Sparsemax Layer min x ∥x -y∥ 2 2 s.t. 1 T x = 1, 0 ≤ x ≤ u ∂x k+1 ∂θ = -(2 + 2ρ)I + ρ1 • 1 T -1 L xθ (x k+1 ) Quadratic Layer min x 1 2 x T P x + q T x s.t. Ax = b, Gx ≤ h ∂x k+1 ∂θ = -P + ρA T A + ρG T G -1 L xθ (x k+1 ) Constrained Softmax Layer min x -y T x + H(x) s.t. 1 T x = 1, 0 ≤ x ≤ u ∂x k+1 ∂θ = -diag -1 (x) + 2ρI + ρ1 • 1 T -1 L xθ (x k+1 ) In Table 3 , L xθ (x k+1 ) := ∂ ∂θ ∇ x L(x k+1 , s k , λ k , ν k ; θ), P ⪰ 0 in the Quadratic layer and H(x) = n i=1 x i log(x i ) is the negative entropy in the constrained Softmax layer. We next give the detailed derivation for the last column of Table 3 . Proof. For the Quadratic layer and the contrained Sparsemax layer, we substitute ∇f (x k+1 ) = P x k+1 + q into (14) and calculate the derivative with respect to θ, i.e. P ∂x k+1 ∂θ + ∂q ∂θ + ∂A T λ k ∂θ + ∂G T ν k ∂θ + ρ ∂A T (Ax k+1 -b) ∂θ + ρ ∂G T (Gx k+1 + s k -h) ∂θ = 0. Therefore we can derive the formulation of primal gradient update: ∂x k+1 ∂θ = -P + ρA T A + ρG T G -1 ∂q ∂θ + ∂A T λ k ∂θ + ∂G T ν k ∂θ -ρ ∂(A T b -G T (s k -h)) ∂θ = -P + ρA T A + ρG T G -1 L xθ (x k+1 ). Compared with ( 16), we can find that the forward pass and backward pass in quadratic layers shared the same Hessian matrix, which significantly reduces the computational complexity. The constrained Sparsemax Layer is a special quadratic layer that with sparse constraints, where P = 2I, A = 1 T and G = [-I, I] T . Therefore the primal differentiation (7a) becomes ∂x k+1 ∂θ = -(2 + 2ρ)I + ρ1 • 1 T -1 L xθ (x k+1 ). For constrained Softmax layer, the augmented Lagrange function is: L = -y T x + n i=1 x i log(x i ) + λ T (1 T x -1) + ν(Gx + s -h) + ρ 2 ∥1 T x -1∥ 2 + ∥Gx + s -h∥ 2 , where G = [-I, I] T and h = [0, u] T . Taking the gradient of the first-order optimality condition ( 14), ( 17) can be replaced by diag -1 (x) ∂x k+1 ∂θ + ∂1 T λ k ∂θ + ∂G T ν k ∂θ + ρ ∂1(1 T x k+1 -1) ∂θ + ρ ∂G T (Gx k+1 + s k -h) ∂θ = 0. (21) Similarly, the formulation of primal gradient update of the constrained Softmax layer is: ∂x k+1 ∂θ = -diag -1 (x) + ρG T G + ρ1 • 1 T -1 ∂1 T λ k ∂θ + ∂G T ν k ∂θ + ρ ∂G T (s k -h) ∂θ = -diag -1 (x) + 2ρI + ρ1 • 1 T -1 L xθ (x k+1 ). From Table 3 , we can see the implementation of primal differentiation for the constrained Sparsemax layers Malaviya et al. (2018) and the quadratic optimization layers Amos & Kolter (2017) are quite simple. The Jacobian matrix involved in the primal update keeps the same during the iteration.

C ERROR OF THE TRUNCATED ALT-DIFF

Theorem 4.1. (Error of gradient obtained by truncated Alt-Diff) Suppose x k is the truncated solution at the k-th iteration, the error between the gradient ∂x k ∂θ obtained by the truncated Alt-Diff and the real gradient ∂x ⋆ ∂θ is bounded as follows, ∂x k ∂θ - ∂x ⋆ ∂θ ≤ C 1 ∥x k -x ⋆ ∥, where C 1 = L 3 µ 2 + µ 3 L 2 µ 2 2 is a constant. Proof. In the differentiation procedure of Alt-Diff, we have our primal update procedure (7a), so that ∂x k ∂θ - ∂x ⋆ ∂θ = -∇ 2 x L -1 (x k )∇ x,θ L(x k ) + ∇ 2 x L -1 (x ⋆ )∇ x,θ L(x ⋆ ) = -∇ 2 x L -1 (x k )∇ x,θ L(x k ) + ∇ 2 x L -1 (x k )∇ x,θ L(x ⋆ ) -∇ 2 x L -1 (x k )∇ x,θ L(x ⋆ ) + ∇ 2 x L -1 (x ⋆ )∇ x,θ L(x ⋆ ) = ∇ 2 x L -1 (x k ) ∇ x,θ L(x ⋆ ) -∇ x,θ L(x k ) ∆1 + ∇ 2 x L -1 (x ⋆ ) -∇ 2 x L -1 (x k ) ∆2 ∇ x,θ L(x ⋆ ). By Assumption (8c), we can obtain that ∥∆ 1 ∥ = ∥∇L x,θ (x ⋆ ) -∇L x,θ (x k )∥ ≤ L 3 ∥x ⋆ -x k ∥. (25) Since we have computed the closed form of the Hessian of augmented Lagrange function ∇ 2 x L(x k ) = ∇ 2 x f (x k ) T + ρA T A + ρG T G, by assumption (9) and Cauchy-Schwartz inequalities, we have ∥∆ 2 ∥ = ∥∇ 2 x L -1 (x ⋆ ) -∇ 2 x L -1 (x k )∥ = ∥∇ 2 x L -1 (x k ) ∇ 2 x L(x k ) -∇ 2 x L(x ⋆ ) ∇ 2 x L -1 (x ⋆ )∥ ≤ ∥∇ 2 x L -1 (x k )∥ • ∥∇ 2 x L(x k ) -∇ 2 x L(x ⋆ )∥ • ∥∇ 2 x L -1 (x ⋆ )∥ ≤ 1 µ 2 2 ∥∇ 2 x f (x ⋆ ) -∇ 2 x f (x k )∥. Therefore the difference of Jacobians is ∂x k ∂θ - ∂x ⋆ ∂θ = ∥∇ 2 x L -1 (x k ) • ∆ 1 + ∆ 2 • ∇ x,θ L(x ⋆ )∥ ≤ ∥∇ 2 x L -1 (x k )∥ • ∥∆ 1 ∥ + 1 µ 2 2 ∥∇ x,θ L(x)∥ • ∥∇ 2 x f (x ⋆ ) -∇ 2 x f (x k )∥ ≤ L 3 µ 2 + µ 3 L 2 µ 2 2 ∥x ⋆ -x k ∥.

D ERROR OF THE GRADIENT OF LOSS FUNCTION

Corollary 4.2. (Error of the inexact gradient in terms of loss function) Followed by Theorem 4.1, the error of the gradient w.r.t. θ in loss function R is bounded as follows: ∥∇R(θ; x k ) -∇R(θ; x ⋆ )∥ ≤ C 2 ∥x k -x ⋆ ∥, where C 2 = L 1 + µ 3 L 1 + µ 1 L 3 µ 2 + µ 1 µ 3 L 2 µ 2 2 is a constant. Proof. Firstly, optimization layer can be reformulated as the following bilevel optimization problems Ghadimi & Wang (2018) : min θ R(θ; x ⋆ (θ)) s.t. x ⋆ (θ) = arg min x∈C(θ) f (x; θ), where R denotes the loss function, therefore we can derive the following conclusions by Implicit Function Theorem. In outer optimization problems, we have ∇R(θ; x) = ∇ θ R(θ; x) + ∇ x R(θ; x) ∂ x ∂θ . Therefore, ∆ = ∇R(θ; x k ) -∇R(θ; x ⋆ ) = ∇ θ R(θ; x k ) -∇ θ R(θ; x ⋆ ) ∆3 + ∇ x R(θ; x k ) ∂x k ∂θ -∇ x R(θ; x ⋆ ) ∂x ⋆ ∂θ ∆4 . ( ) The second term is: ∆ 4 = ∇ x R(θ; x k ) ∂x k ∂θ -∇ x R(θ; x ⋆ ) ∂x ⋆ ∂θ = ∇ x R(θ; x k ) ∂x k ∂θ -∇ x R(θ; x k ) ∂x ⋆ ∂θ + ∇ x R(θ; x k ) ∂x ⋆ ∂θ -∇ x R(θ; x ⋆ ) ∂x ⋆ ∂θ = ∇ x R(θ; x k ) -∇ x R(θ; x ⋆ ) ∆5 ∂x ⋆ ∂θ + ∇ x R(θ; x k ) ∂x k ∂θ - ∂x ⋆ ∂θ . Combining the results in (32), assumption (8a) and Theorem 4.1, we can obtain that the differences of truncated gradient with optimal gradients is ∥∆∥ = ∥∇R(θ; x k ) -∇R(θ; x ⋆ )∥ ≤ ∥∆ 3 ∥ + ∥∆ 5 ∥ • ∂x ⋆ ∂θ + ∥∇ x R(θ; x k )∥ • ∂x k ∂θ - ∂x ⋆ ∂θ ≤ L 1 1 + µ 3 µ 2 ∥x k -x ⋆ ∥ + µ 1 L 3 µ 2 + µ 3 L 2 µ 2 2 ∥x k -x ⋆ ∥ = L 1 + µ 3 L 1 + µ 1 L 3 µ 2 + µ 1 µ 3 L 2 µ 2 2 ∥x k -x ⋆ ∥. Therefore, the gradient of loss function R(θ; x k ) shares the same order of error as the truncated solution x k .

E ALT-DIFF AND DIFFERENTIATING KKT

We first consider the gradient obtained by differentiating KKT conditions implicitly, and then show that it is the same as the result obtained by Alt-Diff.

E.1 DIFFERENTIATION OF KKT CONDITIONS

In convex optimization problem (1), the optimal solution is implicitly defined by the following KKT conditions:    ∇f (x) + A T (θ)λ + G T (θ)ν = 0 A(θ)x -b(θ) = 0 diag(ν)(G(θ)x -h(θ)) = 0 where λ and ν are the dual variables, diag(•) creates a diagonal matrix from a vector. Suppose x collects the optimal solutions [x ⋆ , λ ⋆ , ν ⋆ ], in order to obtain the derivative of the solution x w.r.t. the parameters A(θ), b(θ), G(θ), h(θ), we need to calculate J F ;x and J F ;θ by differentiating KKT conditions (34) as follows Amos & Kolter (2017) ; Agrawal et al. (2019a) ; Zhang et al. (2020) : [ J F ;x ⋆ J F ;λ J F ;ν ] =   f T xx (x ⋆ ) A T (θ) B T (θ) A(θ) 0 0 diag(ν)G(θ) diag(G(θ)x -h(θ)) 0   [ J F ;A J F ;b J F ;G J F ;h ] =   ν T ⊗ I 0 λ T ⊗ I 0 I ⊗ (x ⋆ ) T -I 0 0 0 0 diag(λ)I ⊗ (x ⋆ ) T -I   (35a) where f xx (x) := ∇ 2 x f (x), ⊗ denotes the Kronecker product.

E.2 ILLUSTRATION OF EQUIVALENCE

In this section, we show the equivalence of the Jacobian obtained by differentiating KKT conditions and by Alt-Diff from the computational perspective. As we have shown in Theorem 4.1, this section is a further complement of the equivalence. Here we show that the results obtained by Alt-Diff is equal to differentiating KKT conditions. Note the alternating procedure of ( 5) is based on ADMM, which is convergent for convex optimization problems. Assuming the optimal solutions by the alternating procedure of ( 5) are [x ⋆ Alt , s ⋆ Alt , λ ⋆ Alt , ν ⋆ Alt ], we can derive the following results after the forward ADMM pass (5) converges:                x ⋆ Alt = arg min x L(x, s ⋆ Alt , λ ⋆ Alt , ν ⋆ Alt ; θ), s ⋆ Alt = ReLU - 1 ρ ν ⋆ Alt -(Gx ⋆ Alt -h) , λ ⋆ Alt = λ ⋆ Alt + ρ(Ax ⋆ Alt -b), ν ⋆ Alt = ν ⋆ Alt + ρ(Gx ⋆ Alt + s ⋆ Alt -h). Taking the derivative with respect to the parameters θ in (36), the differentiation procedure (7) becomes                          ∂x ⋆ Alt ∂θ = -f T xx (x ⋆ Alt ) + ρA T A + ρG T G -1 L xθ (x ⋆ Alt ), ∂s ⋆ Alt ∂θ = - 1 ρ sgn(s ⋆ Alt ) • 1 T ⊙ ∂ν ⋆ Alt ∂θ + ρ ∂(Gx ⋆ Alt -h) ∂θ , ∂λ ⋆ Alt ∂θ = ∂λ ⋆ Alt ∂θ + ρ ∂(Ax ⋆ Alt -b) ∂θ , ∂ν ⋆ Alt ∂θ = ∂ν ⋆ Alt ∂θ + ρ ∂(Gx ⋆ Alt + s ⋆ Alt -h) ∂θ . (37a) (37b) (37c) Recall that the differentiation through KKT conditions on the optimal solution [x ⋆ , λ ⋆ , ν ⋆ ] are as follows,                f T xx (x ⋆ ) ∂x ⋆ ∂θ + ∂A T λ ⋆ ∂θ + ∂G T ν ⋆ ∂θ = 0, ∂(Ax ⋆ -b) ∂θ = 0, ∂ν ⋆ ∂θ (Gx ⋆ -h) + ν ⋆ ∂(Gx ⋆ -h) ∂θ = 0. (38a) (38b) (38c) By the convergence of ADMM, it is known that [x ⋆ Alt , λ ⋆ Alt , ν ⋆ Alt ] is equal to [x ⋆ , λ ⋆ , ν ⋆ ]. Since in convex optimization problems, the KKT condition (34) and its derivative (38) have unique solutions, therefore we only need to show that (38) can be directly derived by (37). (37a)=⇒(38a): f T xx (x ⋆ Alt ) + ρA T A + ρG T G ∂x ⋆ Alt ∂θ = - ∂A T λ ⋆ Alt ∂θ - ∂G T ν ⋆ Alt ∂θ + ρ ∂(A T b -G T (s ⋆ Alt -h)) ∂θ . Therefore, f T xx (x ⋆ Alt ) ∂x ⋆ Alt ∂θ + ∂A T λ ⋆ Alt ∂θ + ∂G T ν ⋆ Alt ∂θ = -(ρA T A + ρG T G) ∂x ⋆ Alt ∂θ + ρ ∂(A T b -G T (s ⋆ Alt -h)) ∂θ (40) = -ρ ∂(A T (Ax ⋆ Alt -b)) ∂θ -ρ ∂(G T (Gx ⋆ Alt + s ⋆ Alt -h)) ∂θ = 0. Clearly, (40) is exactly the same as (38a). (37c)=⇒(38b): By (37c), we can obtain ∂(Ax ⋆ Alt -b) ∂θ = 0, which is exactly the same as (38b). (37b) and (37d)=⇒(38c): By (37b) and (37d), we can obtain: ρ ∂(Gx ⋆ Alt -h) ∂θ = -ρ ∂s ⋆ Alt ∂θ = sgn(s ⋆ Alt ) • 1 T ⊙ ∂ν ⋆ Alt ∂θ + ρ ∂(Gx ⋆ Alt -h) ∂θ We discuss this issue in two situations. Since the equality Gx ⋆ Alt + s ⋆ Alt -h = 0 holds by optimality, we have ⟨ν ⋆ Alt , s ⋆ Alt ⟩ = 0 using the complementary slackness condition. • If s ⋆ Alt,i > 0, then ν ⋆ Alt,i = 0, we obtain ρ ∂(G i x ⋆ Alt -h i ) ∂θ = sgn(s ⋆ Alt,i ) • 1 T ⊙ ∂ν ⋆ Alt,i ∂θ + ρ ∂(G i x ⋆ Alt -h i ) ∂θ = 1 • 1 T ⊙ ∂ν ⋆ Alt,i ∂θ + ρ ∂(G i x ⋆ Alt -h i ) ∂θ . (43) Therefore, ∂ν ⋆ Alt,i ∂θ = 0 and ∂ν ⋆ Alt,i ∂θ (G i x ⋆ Alt -h i ) + ν ⋆ Alt,i ∂(G i x ⋆ Alt -h i ) ∂θ = 0. ( ) • If s ⋆ Alt,i = 0, then G i x ⋆ Alt -h i = 0 and ρ ∂(G i x ⋆ Alt -h i ) ∂θ = sgn(s ⋆ Alt,i ) • 1 T ⊙ ∂ν ⋆ Alt,i ∂θ + ρ ∂(G i x ⋆ Alt -h i ) ∂θ = 0 • 1 T ⊙ ∂ν ⋆ Alt,i ∂θ + ρ ∂(G i x ⋆ Alt -h i ) ∂θ = 0. (45) Therefore, ∂ν ⋆ Alt,i ∂θ (G i x ⋆ Alt -h i ) + ν ⋆ Alt,i ∂(G i x ⋆ Alt -h i ) ∂θ = 0. Therefore, ( 44) and ( 46) show (38c) holds.

F EXPERIMENTAL DETAILS

In this section, we give more simulation results as well as some implementation details.

F.1 ADDITIONAL NUMERICAL EXPERIMENTS

In constrained Sparsemax layer, the "lsqr" mode of CvxpyLayer is applied for fair comparison. The total running time of CvxpyLayer includes the canonicalization, forward and backward pass, retrieval and initialization. The comparison results are provided in Table 4 . Alt-Diff can obtain the competitive results as OptNet and CvxpyLayer with lower executing time. The tolerances of the gradient are all set as 10 -3 for OptNet, CvxpyLayer and Alt-Diff. Besides, we conduct experiments to compare the running time among Alt-Diff, OptNet and CvxpyLayer using different tolerances from 10 -1 to 10 -5 in OptNet, CvxpyLayer and Alt-Diff. The results in Table 5 shows that the truncated operation will not significantly improve the computational speed for existing methods but does offer significant improvements for Alt-Diff. All the numerical experiments were executed 5 times with the average times reported as the results. 0.999 0.998 0.997 0.998 "-" represents that the solver cannot generate the gradients. From Table 4 , we can find that OptNet is much slower than CvxpyLayer in solving sparse Quadratic programmings. CvxpyLayer requires a lot of time in the initialization procedure, but once the initialization procedure completed, calling CvxpyLayer is very fast. However, Alt-Diff only needs to compute the inverse matrix once, significantly reducing the computational speed. As a special case for general convex objective functions, we adopted f (x) = -y T x + n i=1 x i log(x i ), which is not a quadratic programming problem. Thus we can only compare Alt-Diff with CvxpyLayer. The constraints Ax = b and Gx ≤ h are randomly generated with dense coefficients. In Alt-Diff, each primal update x k+1 in (5a) is carried out by Newton's methods with the tolerance set as 10 -4 . The comparison of cosine distance and running time is in Table 6 . We can find that Alt-Diff significantly outperforms in all these cases especially for optimization problems with large size. 

F.2 IMAGE CLASSIFICATION

All neural networks are implemented using PyTorch with Adam optimizer Kingma & Ba (2014) . We replaced a layer as optimization layer in the neural networks, and compared the running time and test accuracy in MNIST dataset using OptNet and Alt-Diff. The batch size is set as 64 and the learning rate is set as 10 -3 . We ran 30 epoches and provided the running time and test accuracy in Tabel 7. As we have shown that OptNet runs much faster than CvxpyLayer in dense quadratic layers, therefore we only compare Alt-Diff with OptNet. We used two convolutional layers followed by ReLU activation functions and max pooling for feature extraction. After that, we added two fully connected layers with 200 and 10 neurons, both using the ReLU activation function. Among them, an optimization layer with an input of 200 dimensions and an output of 200 dimensions is added. We set the dimension of the inequality and equality constraints both as 50. Following similar settings in OptNet, we used the quadratic objective function f (x) = 1 2 x T P x + q T x here, while taking q as the input of the optimization layer and the optimal x ⋆ as the output. Finally, the 10 neurons obtained by the fully connected layer were input into Softmax layer, and the nonnegative log likelihood was used as the loss function here. The experiment results are as follows, we can see Alt-Diff runs much faster and yields similar accuracy compared to OptNet. 



ALTERNATING DIFFERENTIATION Motivated by the scalability of the operator splitting method Glowinski & Le Tallec (1989); Stellato et al. (2020), Alt-Diff first decouples a constrained optimization problem into multiple subproblems based on ADMM. Each splitted operator is then differentiated to establish the derivatives of the primal and dual variables w.r.t the parameters in an alternating fashion. The augmented Lagrange function of problem (

4.2 TRUNCATED CAPABILITY OF ALT-DIFFAs shown in recent workFung et al. (2021);Geng et al. (

(a) The norm variation trends. (Blue dotted line is the gradient ∂x ⋆ ∂b obtained by CvxpyLayer). (b) The cosine distance of the obtained gradient between Alt-Diff and CvxpyLayer.

Figure 1: The variation of primal variable gradient of dense quadratic layers. (The threshold of iteration is set as 10 -3 )

(a) The convergence behaviors. (b) The average running time.

Figure 2: The experimental results of Alt-diff and CvxpyLayer for energy generation scheduling.

Figure 3: The model architecture of Alt-Diff for optimization layers.

Figure 4: The training and testing performance of Alt-Diff and OptNet on MNIST.

They have been responsible for substantial advances in many applications, including Neural ODE Chen et al. (2018); Dupont et al. (2019), Deep Equilibrium Models Bai et al. (2019; 2020); Gurumurthy et al. (2021), nonconvex optimization problems Wang et al. (2019) and implicit 3D surface layers Michalkiewicz et al. (2019); Park et al. (2019), etc. Implicit layers have a similar computational complexity to optimization layers as they also requires solving a costly Jacobian-based equation based on the Implicit Function Theorem. Recently, Fung et al. (2021) proposed a Jacobian-free backpropagation method to accelerate the speed of training implicit layers. However, this method is not suitable for the optimization layers with complex constraints. In terms of training implicit models, Geng et al. (2021) also proposed a novel gradient estimate called phantom gradient which relies on fixed-point unrolling and a Neumann series to provide a new update direction; computation of precise gradient is forgone. Implicit models have also been extended to more complex learning frameworks, such as attention mechanismsGeng et al. (2020) and Graph Neural NetworksGu et al. (2020).

Comparison of running time (s) and cosine distances of gradients in dense quadratic layers with tolerance ϵ = 10 -3 .

Comparison of running time (s) under different tolerances for a dense quadratic layer with n = 3000, m = 1000 and p = 500.

Comparison of running time (s) and cosine distances of gradients in constrained Sparsemax layers with tolerance ϵ = 10 -3 .

Comparison of running time (s) for constrained Sparsemax layers with n = 10000 and n c = 20000 under different tolerances.

Comparison of running time (s) and cosine distances of gradients in constrained Softmax layers. Since OptNet only works for quadratic optimization problems, thus we cannot compare Alt-Diff with it for the constrained Softmax layers. × 10 4 1.76 × 10 5 5.18 × 10 5 2.89 × 10 6

Comparison of OptNet and Alt-Diff on MNIST with tolerance ϵ = 10 -3 .

ACKNOWLEDGMENTS

This work was supported by the Shanghai Sailing Program (22YF1428800, 21YF1429400), Shanghai Local college capacity building program (23010503100), and Shanghai Frontiers Science Center of Humancentered Artificial Intelligence (ShangHAI).

