TOWARDS ONE-SHOT NEURAL COMBINATORIAL SOLVERS: THEORETICAL AND EMPIRICAL NOTES ON THE CARDINALITY-CONSTRAINED CASE

Abstract

One-shot non-autoregressive neural networks, different from RL-based ones, have been actively adopted for solving combinatorial optimization (CO) problems, which can be trained by the objective score in a self-supervised manner. Such methods have shown their superiority in efficiency (e.g. by parallelization) and potential for tackling predictive CO problems for decision-making under uncertainty. While the discrete constraints often become a bottleneck for gradient-based neural solvers, as currently handled in three typical ways: 1) adding a soft penalty in the objective, where a bounded violation of the constraints cannot be guaranteed, being critical to many constraint-sensitive scenarios; 2) perturbing the input to generate an approximate gradient in a black-box manner, though the constraints are exactly obeyed while the approximate gradients can hurt the performance on the objective score; 3) a compromise by developing soft algorithms whereby the output of neural networks obeys a relaxed constraint, and there can still occur an arbitrary degree of constraint-violation. Towards the ultimate goal of establishing a general framework for neural CO solver with the ability to control an arbitrarysmall degree of constraint violation, in this paper, we focus on a more achievable and common setting: the cardinality constraints, which in fact can be readily encoded by a differentiable optimal transport (OT) layer. Based on this observation, we propose OT-based cardinality constraint encoding for end-to-end CO problem learning with two variants: Sinkhorn and Gumbel-Sinkhorn, whereby their violation of the constraints can be exactly characterized and bounded by our theoretical results. On synthetic and real-world CO problem instances, our methods surpass the state-of-the-art CO network and are comparable to (if not superior to) the commercial solver Gurobi. In particular, we further showcase a case study of applying our approach to the predictive portfolio optimization task on real-world asset price data, improving the Sharpe ratio from 1.1 to 2.0 of a strong LSTM+Gurobi baseline under the classic predict-then-optimize paradigm.

1. INTRODUCTION

Developing neural networks that can handle combinatorial optimization (CO) problems is a trending research topic (Vinyals et al., 2015; Dai et al., 2016; Yu et al., 2020) . A family of recent CO networks (Wang et al., 2019b; Li et al., 2019; Karalias & Loukas, 2020; Bai et al., 2019) improves the existing reinforcement learning-based auto-regressive CO networks (Dai et al., 2016; Lu et al., 2019) by solving the problem in one shot and relaxing the non-differentiable constraints, resulting in an end-to-end learning pipeline. The superiorities of one-shot CO networks are recognized in three aspects: 1) the higher efficiency by exploiting the GPU-friendly one-shot feed-forward network, compared to CPU-based traditional solvers (Gamrath et al., 2020) and the tedious auto-regressive CO networks; 2) the natural label-free, self-supervised learning paradigm by directly optimizing over the objective score, which is more practical than supervised learning (Vinyals et al., 2015) and empirically more efficient than reinforcement learning (Schulman et al., 2017) ; 3) the end-to-end architecture enabling tackling the important predictive CO problems, i.e. decision-making under uncertainty (Wilder et al., 2019; Elmachtoub & Grigas, 2022) . In this paper, we follow the general paradigm of learning to solve CO in one-shot presented in the seminal work (Karalias & Loukas, 2020) . A neural network CO solver is built upon a problem encoder network, which firstly accepts raw problem data and predicts the decision variables for the problem. The decision variables are then passed to a differentiable formula to estimate the objective score, and finally, the objective score is treated as the self-supervised loss. All modules must be differentiable for end-to-end learning. As a CO solver, the output of the network should obey the constraint of the CO problem, while still preserving the gradient. Since the input-output mappings of CO are piece-wise constant, where the real gradient is zero almost everywhere or infinite when the output changes, it is notoriously hard to encode CO constraints in neural networks. There are three typical workarounds available: 1) In Karalias & Loukas (2020) , the constraints are softly enforced by a penalty term, and the degree of constraint-violation can be hardly theoretically characterized nor controlled, which limits their applicability in many constraint-critical scenarios. Meanwhile, in the obligatory discretization step, adding penalty terms means that the algorithm must search a much larger space than if it was confined to feasible configurations, making the search less efficient and less generalizable (see Table 1 ). 2) The perturbation-based black-box differentiation methods (Pogančić et al., 2019; Paulus et al., 2021; Berthet et al., 2020) resorts to adding perturbation to the input-output mapping of discrete functions to estimate the approximate gradient as such the strict constraints are enforced in brute force, yet their approximate gradients may hurt the learning process. 3) The soft algorithms (Zanfir & Sminchisescu, 2018; Wang et al., 2019a; Sakaue, 2021) encode constraints to neural networks by developing approximate and differentiable algorithms for certain CO problems (graph matching, SAT, submodular), which is followed in this paper for their efficiency, yet there still remains the possibility of facing an arbitrary degree of constraint-violation. Towards the ultimate goal of devising a general CO network solver addressing all the above issues, in this paper, we focus on developing a more practical paradigm for solving the cardinality-constrained CO problems (Buchbinder et al., 2014) . The cardinality constraints ∥x∥ 0 ≤ k are commonly found in a wide range of applications such as planning facility locations in business operation (Liu, 2009) , discovering the most influential seed users in social networks (Chen et al., 2021) , and predicting portfolios with controllable operational costs (Chang et al., 2000) . Under the cardinality constraint, we aim to find the optimal subset with size k. Likewise other discrete CO constraints, the cardinality constraint is non-trivial to differentiate through. In this paper, we propose to encode cardinality constraints to CO networks by a topk selection over a probability distribution (which is the output of an encoder network). An intuitive approach is to sort all probabilities and select the k-largest ones, however, such a process does not offer informative gradients. Inspired by Cuturi et al. (2019) ; Xie et al. (2020) , we develop a soft algorithm by reformulating the topk selection as an optimal transport problem (Villani, 2009) and efficiently tackle it by the differentiable Sinkhorn algorithm (Sinkhorn, 1964) . With a follow-up differentiable computation of the self-supervised loss, we present a CO network whose output is softly cardinality-constrained and capable of end-to-end learning. However, our theoretical characterization of the Sinkhorn-based soft algorithm shows its violation of the cardinality constraint may significantly grow if the values of the k-th and (k + 1)-th probabilities are too close. Being aware of the perturbation-based differentiable methods (Pogančić et al., 2019; Paulus et al., 2021; Berthet et al., 2020) and the Gumbel trick (Jang et al., 2017; Mena et al., 2018; Grover et al., 2019) that can build near-discrete neural networks, in this paper, we further incorporate the Gumbel trick which is crucial for strictly bounding the constraint-violation term to an arbitrary small number. Our network takes both advantages of the high efficiency in soft algorithms (Zanfir & Sminchisescu, 2018) and the low constraint-violation in perturbation-based methods (Pogančić et al., 2019; Jang et al., 2017) . A homotopy extension (Xu et al., 2016) is further developed where the constraint-violation term is gradually tightened. Following the self-supervised learning pipeline in Karalias & Loukas (2020) , our cardinality-constrained CO networks are validated on two representative deterministic CO tasks: facility location and max covering problems. An important application of predictive CO is also addressed, where the problem parameters are unknown at the decision-making time. We present a "predict-and-optimize" network that jointly learns a predictor and a neural network CO solver end-to-end over the final objective score, instead of the two-stage "predict-then-optimize" which learns a predictor first and then optimizes separately, at the risk of optimizing performance being hurt by prediction error. Specifically, towards a practical and widely concerned task: portfolio optimization under uncertainty, we build an end-to-end predictive portfolio optimization model. Experimental results on real-world data show that it outperforms the classic "predict-then-optimize" paradigm. The contributions include: • New End-to-end One-shot Neural Architecture for CO Problems. We propose the first (to our best knowledge) end-to-end cardinality-constrained neural network for efficient CO problemsolving in one-shot, in the sense that the constraints are incorporated in the network architecture instead of directly putting them in the learning objective as penalty terms. • Theoretical and Empirical Advantages of the CO Architecture. The cardinality constraint is encoded in the differentiable optimal transport layer based on the topk selection technique (Xie et al., 2020) . While we further introduce the idea of perturbation as used in blackbox differentiable CO (Pogančić et al., 2019; Paulus et al., 2021) , by incorporating the Gumbel trick to reduce the constraint-violation, and the violation bound is strictly guaranteed by our theoretical results. Empirical results on two CO tasks: facility location and max covering also verify its competitiveness. • Enabling "predict-and-optimize" Paradigm. We show that our new network further enables an emerging end-to-end "predict-and-optimize" paradigm in contrast to the traditional "predict-thenoptimize" pipeline. Its potential is demonstrated by a study on predictive portfolio optimization on real-world asset price data, with an improvement of Sharpe ratio from 1.1 to 2.0, compared with a baseline: LSTM+Gurobi.

2. CARDINLIATY-CONSTRAINED COMBINATORIAL NETWORKS

An overview of our CardNN pipeline is shown in Fig. 1 . Following the general paradigm (Karalias & Loukas, 2020) to tackle CO in one-shot, we introduce an optimal transport (OT) cardinality layer in the neural network CO solver to enforce the constraints upon the output of the problem encoder network, whereby the superiorities could be addressed both empirically and theoretically. Recall that under cardinality constraint, the solution must have no more than k non-zero elements: min x J(x) s.t. ∥x∥ 0 ≤ k. In this paper, enforcing the cardinality constraint in networks is formulated as solving OT with differentiable layers (Cuturi, 2013) . Denote s = [s 1 , • • • , s m ] as the probability vector predicted by the problem encoder network, our OT layer selects k largest items from s by moving k items to one destination (selected), and the other (m -k) elements to the other destination (not selected). In the following, we present two embodiments of OT layers and their theoretical characteristics.

2.1. CARDNN-S: SINKHORN LAYER FOR CARDINALITY CONSTRAINT

We follow the popular method Sinkhorn (1964) and define the OT problem as follows. The sources are m candidates in s and the destinations are the min/max values of s. OT moves the topk items to s max , and the others to s min . The marginal distributions (c, r) and distance matrix (D) are defined as: c = [1 1 ... 1] m items , r = m -k k , D = s 1 -s min s 2 -s min ... s m -s min s max -s 1 s max -s 2 ... s max -s m . Then OT can be formulated as integer linear programming: min T tr(T ⊤ D) s.t. T ∈ {0, 1} 2×m , T1 = r, T ⊤ 1 = c, ( ) where T is the transportation matrix which is also a feasible decision variable for the cardinality constraint, and 1 is a column vector whose all elements are 1s. The optimal solution T * to Eq. ( 3) should be equivalent to the solution by firstly sorting all items and then selecting the topk items. To make the process differentiable by soft algorithms, the binary constraint on T is relaxed to continuous values [0, 1], and Eq. ( 3) is modified with an entropic regularization: min T τ tr(T τ ⊤ D) + τ h(T τ ) s.t. T τ ∈ [0, 1] 2×m , T τ 1 = r, T τ ⊤ 1 = c, where h(T τ ) = i,j T τ ij log T τ ij is the entropic regularizer (Cuturi, 2013) . Given any real-valued matrix D, Eq. ( 4) is solved by firstly enforcing the regularization factor τ : T τ = exp(-D/τ ). Then T τ is row-and column-wise normalized alternatively: D r = diag(T τ 1 ⊘ r), T τ = D -1 r T τ ; D c = diag(T τ ⊤ 1 ⊘ c), T τ = T τ D -1 c , where ⊘ is element-wise division. We denote T τ * as the converged solution, which is the optimal solution to Eq. ( 4). The second row of T τ * is regarded as the relaxed decision variable for the cardinality constraint: T τ * [2, i] is regarded as the probability that x i should be non-zero. T τ * is further fed into the objective estimator. Note that T τ * is usually infeasible in the original problem, and we define the following constraint violation to measure the quality of T  CV CardNN-S = ∥T * -T τ * ∥ F ≤ 2mτ log 2 |ϕ k -ϕ k+1 | . ( ) Without loss of generality, ϕ is denoted as the descending sequence of s, i.e. ϕ k , ϕ k+1 are the kth, (k + 1)-th largest elements of s, respectively. Proposition 2.3 is a straightforward derivation based on Theorem 2 of Xie et al. (2020) , and is better than Karalias & Loukas (2020) whose CV is non-controlled. However, as we learn from Eq. ( 6), the CV of CardNN-S gradually grows if |ϕ k -ϕ k+1 | becomes smaller, and turns diverged under the extreme case that ϕ k = ϕ k+1 , meaning that its CV cannot be tighten by adjusting the hyperparameter τ . Such a divergence is not surprising Algorithm 1: CardNN-GS: Gumbel-Sinkhorn Layer for Cardinality Constraint Input: List s with m items; cardinality k; Sinkhorn factor τ ; noise factor σ; sample size #G. for i ∈ {1, 2, ..., #G} do for all s j , s j = s j -σ log(-log(u j )), where u j is from (0, 1) uniform distribution; D = s 1 -s min ... s m -s min s max -s 1 ... s max -s m ; construct c, r following Eq. (2); T i = exp(-D/τ ); while not converged do D r = diag( T i 1 ⊘ r); T i = D -1 r T i ; D c = diag( T ⊤ i 1 ⊘ c); T i = T i D -1 c ; Output: A list of transport matrices [ T 1 , T 2 , ..., T #G ]. because one cannot decide whether to select ϕ k or ϕ k+1 if they are equal, which is fine if any direct supervision on T τ * is available. However, as discussed in Remark 2.2, the importance of CV is non-negligible in self-supervised CO networks. Since working with solely the Sinkhorn algorithm reaches its theoretical bottleneck, in the following, we present our improved version by introducing random perturbations (Pogančić et al., 2019; Jang et al., 2017) to further tighten the CV.

2.2. CARDNN-GS: GUMBEL-SINKHORN LAYER FOR CARDINALITY CONSTRAINT

In this section, we present our Gumbel-Sinkhorn Layer for Cardinality Constraint as summarized in Alg. 1 and we will theoretically characterize its CV. Following the reparameterization trick (Jang et al., 2017) , instead of sampling from a distribution that is non-differentiable, we add random variables to probabilities predicted by neural networks. The Gumbel distribution is: g σ (u) = -σ log(-log(u)), where σ controls the variance and u is from (0, 1) uniform distribution. We can update s and D as: s j = s j + g σ (u j ), D = s 1 -s min s 2 -s min ... s m -s min s max -s 1 s max -s 2 ... s max -s m . ( ) Again we formulate the integer linear programming version of the OT with Gumbel noise: min T σ tr(T σ⊤ D) s.t. T σ ∈ {0, 1} 2×m , T σ 1 = r, T σ⊤ 1 = c, where the optimal solution to Eq. ( 9) is denoted as T σ * . To make the integer linear programming problem feasible for gradient-based deep learning methods, we also relax the integer constraint and add the entropic regularization term: min T tr( T ⊤ D) + h( T) s.t. T ∈ [0, 1] 2×m , T1 = r, T ⊤ 1 = c, which is tackled by the Sinkhorn algorithm following Eq. ( 5). Here we denote the optimal solution to Eq. ( 10) as T * . Since T σ * is the nearest feasible solution to T * , we characterize the constraintviolation as the expectation of ∥T σ * -T * ∥ F , and multiple Ts are generated in parallel in practice to overcome the randomness (note that ϕ is the descending ordered version of s): Proposition 2.4. With probability at least (1 -ϵ), the constraint-violation of the CardNN-GS is CV CardNN-GS = E u ∥T σ * -T * ∥ F ≤ (log 2)mτ i̸ =j Ω(ϕ i , ϕ j , σ, ϵ), where Ω(ϕ i , ϕ j , σ, ϵ) = 2σ log σ - |ϕi-ϕj |+2σ log(1-ϵ) + |ϕ i -ϕ j | π 2 + arctan ϕi-ϕj 2σ (1 -ϵ)((ϕ i -ϕ j ) 2 + 4σ 2 )(1 + exp ϕi-ϕ k σ )(1 + exp ϕ k+1 -ϕj σ ) . Proof sketch: This proposition is proven by generalizing Proposition 2.3. We denote ϕ π k , ϕ π k+1 as the k-th and (k + 1)-th largest items after perturbed by the Gumbel noise, and our aim becomes to prove the upper bound of E u 1/(|ϕ π k + g σ (u π k ) -ϕ π k+1 -g σ (u π k+1 )|) , where the probability density function of g σ (u π k ) -g σ (u π k+1 ) can be bounded by f (y) = 1/(y 2 + 4). Thus we can compute the bound by integration. See Appendix C.1 for details. We compare CV of CardNN-S and CardNN-GS by the toy example in Fig. 2 : finding the top3 of [1. 0, 0.8, 0.601, 0.6, 0.4, 0.2] . We plot CV w.r.t. different τ, σ values. CV is tightened by larger σ and smaller τ for CardNN-GS, compared to CardNN-S whose violation is larger and can only be controlled by τ . These empirical results are in line with Proposition 2.3 and Proposition 2.4. Homotopy Gumbel-Sinkhorn. The Corollary 2.5 suggests that CV can be tightened by adjusting τ and σ, motivating us to develop a homotopy (Xiao & Zhang, 2013; Xu et al., 2016) Gumbel-Sinkhorn method where the constraints are gradually tighten (i.e. annealing τ and σ values). In practice, σ is not considered because a larger σ means increased variance which calls for more Gumbel samples. We name the homotopy version as CardNN-HGS. We also notice that our CardNN-S (Sec. 2.1) and CardNN-GS (Sec. 2.2) can be unified theoretically: Corollary 2.6. CardNN-S is a special case of CardNN-GS when σ → 0 + (proof in Appendix C.3).

3. ONE-SHOT SOLVING THE DETERMINISTIC CO TASKS

In this section, we present the implementation details and experiment results for learning to solve two deterministic CO problems in one-shot: facility location problem (FLP) and max covering problem (MCP). Deterministic CO means all problem parameters are known at the decision-making time. Readers are referred to Appendix D for the algorithm details. The Facility Location Problem. Given m locations and we want to extend k facilities such that the goods can be stored at the nearest facility and delivered more efficiently (Liu, 2009) . The objective is to minimize the sum of the distances between each location and its nearest facility. Problem Formulation: Denote ∆ ∈ R m×m ≥0 as the distance matrix for locations, the FLP is min x m j=1 min({∆ i,j | ∀x i = 1}) s.t. x ∈ {0, 1} m , ∥x∥ 0 ≤ k. ( ) Problem Encoder: For locations with 2-D coordinates, an edge is defined if two locations are closer than a threshold, e.g. 0.02. We exploit a 3-layer SplineCNN (Fey et al., 2018) to extract features. Objective Estimator: We notice that the min operator in Eq. ( 12) will lead to sparse gradients. Denote • as element-wise product of a matrix and a tiled vector, we replace min by Softmax with minus temperature -β: J i = sum(softmax(-β∆ • T i [2, :] ⊤ ) • ∆), J = mean([ J 1 , J 2 , ..., J #G ]). The Max Covering Problem. Given m sets and n objects where each set may cover any number of objects, and each object is associated with a value, MCP (Khuller et al., 1999) aims to find k sets (k ≪ m) such that the covered objects have the maximum sum of values. This problem reflects realworld scenarios such as discovering influential seed users in social networks (Chen et al., 2021) . Problem Formulation: We build a bipartite graph for the sets and objects, whereby coverings are encoded as edges. Denote v ∈ R n as the values, A ∈ {0, 1} m×n as the adjacency of bipartite graph, I(x) as an indicator I(x ) i = 1 if x i ≥ 1 else I(x) i = 0. We formulate the MCP as Figure 3 : Plot of objective score, gap w.r.t. inference time on synthetic CO problems. Each scatter dot denotes a problem instance, and the average performance is marked by "×". In terms of both efficiency and efficacy, our CardNN-S outperforms the EGN CO network whose constraint-violation is non-controlled. The efficacy is further improved by CardNN-GS and CardNN-HGS, even surpassing the state-of-the-art commercial solver Gurobi (better results with less inference time). The Gurobi solver fails to return the optimal solution within 24 hours for MCP, thus not reported here. max x n j=1 I m i=1 x i A ij • v j s.t. x ∈ {0, 1} m , ∥x∥ 0 ≤ k, Problem Encoder: To encode the bipartite graph, we exploit three layers of GraphSage (Hamilton et al., 2017) followed by a fully-connected layer with sigmoid to predict the probability of selecting each set. Objective Estimator: Based on Eq. ( 13), the objective value is estimated as: J i = min( T i [2, :]A, 1) ⊤ • v, J = mean([ J 1 , J 2 , ..., J #G ]). ( ) Learning and Optimization. Based on whether it is a minimization or a maximization problem, J or -J is treated as the self-supervised loss, respectively. The Adam optimizer (Kingma & Ba, 2014) is applied for training. In inference, the neural network prediction is regarded as initialization, and we also optimize the probabilities w.r.t. the objective score by gradients. Experiment Setup. We follow the self-supervised learning pipeline proposed by the state-of-theart CO network (Karalias & Loukas, 2020) , whereby both synthetic data and real-world data are considered. For synthetic data, we build separate training/testing datasets with 100 samples. We generate uniform random locations on a unit square for FLP, and we follow the distribution in OR-LIB (Beasley, 1990) Baselines. 1) Greedy algorithms are considered because they are easy to implement but very strong and effective. They have the worst-case approximation ratio of (1 -1/e) due to the submodular property (Fujishige, 1991) for both FLP and MCP. 2) Integer programming solvers including the state-of-the-art commercial solver Gurobi 9.0 (Gurobi Optimization, LLC, 2021) and the state-ofthe-art open-source solver SCIP 7.0 (Gamrath et al., 2020) . The time budgets of solvers are set to be higher than our networks. For 3) CO neural networks, we compare with the state-of-the-art Erdos Goes Neural (EGN) (Karalias & Loukas, 2020) which is adapted from their official implementation: https://github.com/Stalence/erdos_neu. The major difference between EGN and ours is that EGN does not enforce CO constraints by its architecture. Besides, we empirically find out that all self-supervised learning methods converge within tens of minutes. Since the RL methods e.g. Khalil et al. (2017) ; Wang et al. (2021a) need much more training time, they are not compared. Metrics and Results. Fig. 3 and 4 report results on synthetic and real-world dataset, respectively. The "gap" metric is computed as gap = |J -J * |/ max(J, J * ), where J is the predicted objective Figure 4 : Plot of optimal gap w.r.t. inference time on real-world CO problems. Our CardNN models are consistently superior than EGN, and are comparative to state-of-the-art SCIP/Gurobi solvers and sometimes can even surpass. On the FLP-Starbucks problems, our CardNN-GS/HGS achieve a lower optimal gap with comparable time cost w.r.t. SCIP/Gurobi. On the MCP-Twitch problems, our CardNN-HGS is slower than SCIP/Gurobi, but it finds all optimal solutions. and J * is the incumbent best objective value (among all methods). If one of the integer programming solvers proves an optimal solution, we name it as "optimal gap". Considering both efficiency and efficacy, the performance ranking of CO networks is CardNN-HGS > CardNN-GS > CardNN-S > EGN. This is in line with our theoretical result in Sec. 2: a lower constraint violation will lead to better performance in CO. To justify our selection of Xie et al. (2020) as the base differentiable method, we also implement other perturbation-based differentiable methods and report the MCP results in Table 2 . See Appendix E for more details about our deterministic CO experiment.

4. ONE-SHOT SOLVING THE PREDICTIVE CO TASKS

In this section, we study the interesting and important topic of predictive CO problems where the problem parameters are unknown at the decision-making time. We consider the challenging problem of predicting the portfolio with the best trade-off in risks and returns in the future, under the practical cardinality constraint to control the operational costs. Traditionally, such a problem involves two separate steps: 1) predict the asset prices in the future, probably by some deep learning models; 2) find the best portfolio by solving an optimization problem based on the prediction. However, the optimization process may be misled due to unavoidable errors in the prediction model. To resolve this issue, Solin et al. (2019) proposes to differentiate through unconstrained portfolio optimization via Amos & Kolter (2017) , but the more practical cardinality constrained problem is less studied. Problem Formulation. Cardinality constrained portfolio optimization considers a practical scenario where a portfolio must contain no more than k assets (Chang et al., 2000) . A good portfolio aims to have a high return (measured by mean vector µ ∈ R m ) and low risk (covariance matrix Σ ∈ R m×m ). Here we refer to maximizing the Sharpe ratio (Sharpe, 1998) . The problem is formulated as max x (µ -r f ) ⊤ x √ x ⊤ Σx , s.t. m i=1 x i = 1, x ≥ 0, ∥x∥ 0 ≤ k, where x denotes the weight of each asset, r f means risk-free return, e.g. U.S. treasury bonds. Note that µ, Σ are unknown at the time of decision-making, and they are predicted by a neural network. Network Architecture. An encoder-decoder architecture of Long-Short Term Memory (LSTM) modules is adopted as the problem encoder (i.e. price predictor). The sequence of historical daily prices is fed into the encoder module, and the decoder module outputs the predicted prices for the future. We append a fully-connected layer after the hidden states to learn the probabilities for cardinality constraints, followed by our CardNN-GS layers. Objective Estimator. Based on the network outputs µ, Σ, T, we estimate the value of x by leveraging a closed-form solution of unconstrained Eq. ( 15): x = Σ -1 (µ -r f ), and then enforcing the constraints: x = relu(x ⊙ T i [2, :]), x = x/sum(x) (⊙ means element-wise product). After obtaining x, we compute the Sharpe ratio based on x and µ gt , Σ gt computed from the ground truth prices, and use this Sharpe ratio as supervision: CardNN-GS: average risk=18.8% pred-then-opt: average risk=19.5% history-opt: average risk=16.2% J i = (µ gt -r f ) ⊤ x / √ x ⊤ Σ gt x . -0.4 -0.2 Figure 5 : Return (left) and risk (right) for portfolios by the classic "predict-then-optimize" pipeline (by LSTM for prediction and Gurobi for optimization) and our CardNN-GS for end-to-end "predictand-optimize". The portfolios proposed by our CardNN-GS has higher returns and lower risks. Since the batch of S&P500 assets violates the cardinality constraint, it is unfair to compare the risk. Setup and Baselines. The price predictor is supervised with price labels but the optimizer is selfsupervised (no optimal solution labels). We consider portfolio prediction with the best Sharpe ratio in the next 120 trading days (∼24 weeks) and test with the real data in the year 2021. The training set is built based on the prices of 494 assets from the S&P 500 index from 2018-01-01 to 2020-12-30. We set the annual risk-free return as 3% and the cardinality constraint k = 20. The classic "predictthen-optimize" baseline learns the same LSTM model as ours to minimize the prediction square error of the asset prices and optimizes the portfolio by Gurobi based on the price predictions. We also consider a "history-opt" baseline, whereby the optimal portfolio in historical data is followed. Results. The portfolios are tested on the real data from 01-01-2021 to 12-30-2021, and the results are listed in Fig. 5 and Table 3 . On average, we improve the annual return of the portfolio from 24.1% to 40%. The MSE in Table 3 denotes the mean square error of price predictions, and note that more accurate price predictions do not lead to better portfolios. We visualize the predicted portfolios in Fig. 6 and compare it to the efficient frontier (portfolios with optimal risk-return trade-off). Being closer to the frontier means a better portfolio. Also, note that reaching the efficient frontier is nearly impossible as the prediction always contains errors.

5. CONCLUSIONS

Towards the ultimate goal of developing general paradigms to encode CO constraints into neural networks with controlled constraint-violation bounds, in this paper, we have presented a differentiable neural network for cardinality-constrained combinatorial optimization. We theoretically characterize the constraint-violation of the Sinkhorn network (Sec. 2.1), and we introduce the Gumbel trick to mitigate the constraint-violation issue (Sec. 2.2). Our method is validated in learning to solve deterministic CO problems (on both synthetic and real-world problems) and end-to-end learning of predictive CO problems under the important predict-and-optimize paradigm.

A RELATED WORK

CO Networks with Constraints Handling for Deterministic CO. Multi-step methods encode constraints by manually programmed action spaces, and the networks can be learned by supervised labels (Vinyals et al., 2015) or by reinforcement learning (Khalil et al., 2017; Zhang et al., 2020; Chen & Tian, 2019) . Controlling constraint-violation is less an issue for supervised or reinforcement learning because the supervision signals are directly passed to the output of neural networks. One-shot CO networks construct the solution by a single forward pass thus being more efficient. The seminal work (Karalias & Loukas, 2020) aims to develop a general pipeline for one-shot CO networks, by softly absorbing the violation of constraint as part of its final loss. However, our analysis shows that such a non-controlled constraint-violation potentially harms problem-solving. There also exist embodiments of constrained CO networks for tailored problems, e.g. the constraint can be encoded as doubly-stochastic matrices in assignment problems (Nowak et al., 2018; Wang et al., 2021b; 2020) . These methods can be viewed as special cases of our paradigm (yet the powerful perturbation method is not fully exploited). , 2020) with certain restrictions such as the formula must be linear. The other paradigm is incorporating neural-network solvers which are naturally differentiable. For example, for graph matching on images (Fey et al., 2020; Sarlin et al., 2020) , deep feature predictors and neural matching solvers are learned end-to-end under supervised learning, and their neural solvers leverage the Sinkhorn algorithm (Cuturi, 2013) as a neural network layer. In this paper, our neural network solver is incorporated for the new predictive portfolio optimization task, and our predictand-optimize pipeline does not require the ground truth labels for the optimization problem.

B LIMITATIONS

We are also aware of the following limitations: 1) Our theoretical analysis mainly focuses on characterizing the constraint-violation. There is an unexplored theoretical aspect about the approximation ratios of our CO networks w.r.t. the optimal solution and the optimal objective score, and we plan to study it in future work. 2) The original EGN pipeline is relatively general for all constraints, and we restrict the scope of this paper within cardinality constraints. We are aware of a potential direction to extend our paper: the cardinality constraints are handled by our method (encoded in the network's output), and the other constraints are handled in a way similar to EGN (encoded as Lagrange multipliers or penalty terms). In such a sense, the cardinality constraints are handled efficiently while still preserving the generality of EGN. 3) In the predictive CO tasks, the predictor may be, in some degree, coupled with the follow-up neural network solver. In our predictive portfolio optimization experiment, our price predictor cannot generalize soundly for the Gurobi solver, and the Sharpe ratio degenerates to 1.002 if our price predictions are passed to Gurobi.

C PROOF OF THEOREMS

Before starting the detailed proof of the propositions and corollaries, firstly we recall the notations used in this paper: • T * = TopK(D) is the optimal solution of the integer linear programming form of the OT problem Eq. ( 3), which is equivalent to the solution by firstly sorting all items and then selecting the topk items. If the k-th and (k + 1)-th largest items are equal, the algorithm randomly selects one to strictly satisfy the cardinality constraint; • T τ * = Sinkhorn(D) is the optimal solution of the entropic regularized form of the OT problem Eq. ( 4) solved by Sinkhorn algorithm. It is also the output by CardNN-S; • T σ * = TopK( D) is the optimal solution to the integer linear programming form of the OT problem after being disturbed by the Gumbel noise Eq. ( 9), which is equivalent to the solution by firstly adding the Gumbel noise, then sorting all items and finally select the topk items. If the perturbed k-th and (k + 1)-th largest items are equal, the algorithm randomly selects one to strictly satisfy the cardinality constraint; • T * = Sinkhorn( D) is the optimal solution of the entropic regularized form of the OT problem after disturbed by the Gumbel noise Eq. ( 10) solved by Sinkhorn algorithm. It is also the output of our proposed CardNN-GS. C.1 PROOF OF PROPOSITION 2.4 We firstly introduce a Lemma which will be referenced in the proof of Proposition 2.4: Lemma C.1. Given real numbers ϕ i , ϕ j , and u i , u j are from i.i.d. (0, 1) uniform distribution. After Gumbel perturbation, the probability that ϕ i + g σ (u i ) > ϕ j + g σ (u j ) is: P (ϕ i + g σ (u i ) > ϕ j + g σ (u j )) = 1 1 + exp - ϕi-ϕj σ . ( ) Proof. Since g σ (u i ) = -σ log(-log(u i )), P (ϕ i + g σ (u i ) > ϕ j + g σ (u j ) ) is equivalent to the probability that the following inequality holds: ϕ i -σ log(-log(u i )) > ϕ j -σ log(-log(u j )) And we have ϕ i -ϕ j > σ log(-log(u i )) -σ log(-log(u j )) (18) ϕ i -ϕ j σ > log log(u i ) log(u j ) (19) e ϕ i -ϕ j σ > log(u i ) log(u j ) Since u j ∈ (0, 1), log(u j ) < 0. Then we have log(u j ) < log(u i )e -ϕ i -ϕ j σ (21) log (u j ) < log u exp - ϕ i -ϕ j σ i (22) u j < u exp - ϕ i -ϕ j σ i ( ) Since u i , u j are i.i.d. uniform distributions, the probability when the above formula holds is 1 0 u exp - ϕ i -ϕ j σ i 0 du j du i = 1 0 u exp - ϕ i -ϕ j σ i du i = 1 1 + exp - ϕi-ϕj σ (24) Thus the probability that ϕ i + g σ (u i ) > ϕ j + g σ (u j ) after Gumbel perturbation is: P (ϕ i + g σ (u i ) > ϕ j + g σ (u j )) = 1 1 + exp - ϕi-ϕj σ (25) In the following we present the proof of Proposition 2.4: Proof of Proposition 2.4. Recall that we denote Φ = [ϕ 1 , ϕ 2 , ϕ 3 , ..., ϕ m ] as the descending-ordered version of s. By perturbing it with i.i.d. Gumbel noise, we have Φ = [ϕ 1 + g σ (u 1 ), ϕ 2 + g σ (u 2 ), ϕ 3 + g σ (u 3 ), ..., ϕ m + g σ (u m )] where g σ (u) = -σ log(-log(u)) is the Gumbel noise modulated by noise factor σ, and u 1 , u 2 , u 3 , ..., u m are i.i.d. uniform distribution. We define π as the permutation of sorting Φ in descending order, i.e. ϕ π1 + g σ (u π1 ), ϕ π2 + g σ (u π2 ), ϕ π3 + g σ (u π3 ), ..., ϕ πm + g σ (u πm ) are in descending order. Recall Proposition 2.3, for ϕ 1 , ϕ 2 , ϕ 3 , ..., ϕ m we have ∥T * -T τ * ∥ F ≤ 2mτ log 2 |ϕ k -ϕ k+1 | (27) By substituting Φ with Φ and taking the expectation over u, we have E u ∥T σ * -T * ∥ F ≤ E u 2mτ log 2 |ϕ π k + g σ (u π k ) -ϕ π k+1 -g σ (u π k+1 )| (28) Based on Lemma C.1, the probability that π k = i, π k+1 = j is P (π k = i, π k+1 = j) = 1 1 + exp - ϕi-ϕj σ ∀π k-1 a=1 1 1 + exp - ϕπ a -ϕi σ m b=k+2 1 1 + exp - ϕj -ϕπ b σ (29) where the first term denotes ϕ i + g σ (u i ) > ϕ j + g σ (u j ), the second term denotes all conditions that there are (k -1) items larger than ϕ i + g σ (u i ) and the rest items are smaller than ϕ j + g σ (u j ). In the following we derive the upper bound of E u 1 |ϕπ k +gσ(uπ k )-ϕπ k+1 -gσ(uπ k+1 )| . We denote A i,j as u i , u j ∈ A i,j , s.t. ϕ i + g σ (u i ) -ϕ j -g σ (u j ) > ϵ ( ) where ϵ is a sufficiently small number. Then we have E u 1 ϕ π k + g σ (u π k ) -ϕ π k+1 -g σ (u π k+1 ) = i̸ =j P (π k = i, π k+1 = j) E ui,uj ∈Ai,j 1 |ϕ i + g σ (u i ) -ϕ j -g σ (u j )| (31) = i̸ =j 1 1 + exp - ϕi-ϕj σ ∀π k-1 a=1 1 1 + exp - ϕπ a -ϕi σ m b=k+2 1 1 + exp - ϕj -ϕπ b σ E ui,uj ∈Ai,j 1 |ϕ i + g σ (u i ) -ϕ j -g σ (u j )| (32) = i̸ =j 1 1 + exp - ϕi-ϕj σ ∀π k-1 a=1 1 1 + exp - ϕπ a -ϕi σ m b=k+2 1 1 + exp - ϕj -ϕπ b σ E ui,uj ∈Ai,j 1 |ϕ i -σ log(-log(u i )) -ϕ j + σ log(-log(u j ))| (33) = i̸ =j f (ϕ i -ϕ j , σ, z) ∀π k-1 a=1 1 1 + exp - ϕπ a -ϕi σ m b=k+2 1 1 + exp - ϕj -ϕπ b σ (34) We denote f (δ, σ, z) as: f (δ, σ, z) = 1 1 + exp -δ σ E ui,uj 1 |δ -σ log(-log(u i )) + σ log(-log(u j ))| s.t. δ -σ log(-log(u i )) + σ log(-log(u j )) > z > 0 (35) For the probability terms in Eq. ( 34), for all permutations π, there must exist π a , π b , such that 1 1 + exp - ϕπ a -ϕi σ ≤ 1 1 + exp -ϕ k -ϕi σ (36) 1 1 + exp - ϕj -ϕπ b σ ≤ 1 1 + exp - ϕj -ϕ k+1 σ (37) Thus we have Eq. ( 34) ≤ i̸ =j f (ϕ i -ϕ j , σ, z) 1 1 + exp -ϕ k -ϕi σ 1 1 + exp - ϕj -ϕ k+1 σ (38) ≤ i̸ =j f (ϕ i -ϕ j , σ, z) (1 + exp ϕi-ϕ k σ )(1 + exp ϕ k+1 -ϕj σ ) By Eq. ( 16) in Lemma C.1 and substituting ϕ j -ϕ i by y, we have Eq. ( 16) ⇒P (g σ (u i ) -g σ (u j ) > ϕ j -ϕ i ) = 1 1 + exp - ϕi-ϕj σ (40) ⇒P (g σ (u i ) -g σ (u j ) > y) = 1 1 + exp y σ (41) ⇒P (g σ (u i ) -g σ (u j ) < y) = 1 - 1 1 + exp y σ = 1 1 + exp -y σ (42) where the right hand side is exactly the cumulative distribution function (CDF) of standard Logistic distribution by setting σ = 1: CDF(y) = 1 1 + exp (-y) Thus -log(-log(u i )) + log(-log(u j )) is equivalent to the Logistic distribution whose probability density function (PDF) is PDF(y) = dCDF(y) dy = 1 exp (-y) + exp y + 2 (44) and in this proof we exploit an upper bound of PDF(y): PDF(y) = 1 exp (-y) + exp y + 2 ≤ 1 y 2 + 4 (45) Based on the Logistic distribution, we can replace -σ log(-log(u i )) + σ log(-log(u j )) by σy where y is from the Logistic distribution. Thus we can derive the upper bound of f (δ, σ, z) as follows f (δ, σ, z) = 1 + exp -δ σ • ∞ -δ/σ+z 1 δ+σy 1 exp (-y)+exp y+2 dy ∞ -δ/σ+z 1 exp (-y)+exp y+2 dy (46) = 1 + exp -δ σ • ∞ -δ/σ+z 1 δ+σy 1 exp (-y)+exp y+2 dy 1 - 1 1+exp (δ/σ-z) (47) = 1 + exp -δ σ • ∞ -δ/σ+z 1 δ+σy 1 exp (-y)+exp y+2 dy exp (δ/σ-z) 1+exp (δ/σ-z) (48) = 1 + exp -δ σ • ∞ -δ/σ+z 1 δ+σy 1 exp (-y)+exp y+2 dy 1 1+exp (-δ/σ+z) (49) = + exp (-δ σ + z) 1 + exp -δ σ ∞ -δ/σ+z 1 δ + σy 1 exp (-y) + exp y + 2 dy (50) ≤ + exp (-δ σ + z) 1 + exp -δ σ ∞ -δ/σ+z 1 δ + σy 1 y 2 + 4 dy (51) = + exp (-δ σ + z) 1 + exp -δ σ • 2σ log (zσ -δ) 2 + 4σ 2 -2δ arctan z-δ/σ 2 -4σ log z + πδ 4δ 2 + 16σ 2 (52) ≤ + exp (-δ σ + z) 1 + exp -δ σ • 2σ log (zσ + |δ|) 2 + 4σ 2 -2δ arctan z-δ/σ 2 -4σ log z + πδ 4δ 2 + 16σ 2 (53) = + exp (-δ σ + z) 1 + exp -δ σ • 2σ log (zσ + |δ|) 2 + 4σ 2 -2δ arctan z-δ/σ 2 -2σ log z 2 + πδ 4δ 2 + 16σ 2 (54) = + exp (-δ σ + z) 1 + exp -δ σ • 2σ log (zσ+|δ|) 2 +4σ 2 z 2 -2δ arctan z-δ/σ 2 + πδ 4δ 2 + 16σ 2 (55) ≤ + exp (-δ σ + z) 1 + exp -δ σ • 2σ log (zσ+|δ|+2σ) 2 z 2 -2δ arctan z-δ/σ 2 + πδ 4δ 2 + 16σ 2 (56) = + exp (-δ σ + z) 1 + exp -δ σ • 4σ log zσ+|δ|+2σ z -2δ arctan z-δ/σ 2 + πδ 4δ 2 + 16σ 2 (57) = + exp (-δ σ + z) 1 + exp -δ σ • 4σ log zσ+|δ|+2σ z + δ π -2 arctan z-δ/σ 2 4δ 2 + 16σ 2 (58) ≤ + exp (-δ σ + z) 1 + exp -δ σ • 4σ log zσ+|δ|+2σ z + |δ| π -2 arctan z-δ/σ 2 4δ 2 + 16σ 2 (59) ≤ + exp (-δ σ + z) 1 + exp -δ σ • 4σ log zσ+|δ|+2σ z + |δ| π -2 arctan -δ 2σ 4δ 2 + 16σ 2 (60) = + exp (-δ σ + z) 1 + exp -δ σ • 4σ log zσ+|δ|+2σ z + |δ| π + 2 arctan δ 2σ 4δ 2 + 16σ 2 where Eq. ( 51) is because 1 exp (-y)+exp y+2 ≤ 1 y 2 +4 , and Eq. ( 59) is because π -2 arctan( z-δ/σ 2 ) ≥ 0. With probability at least (1 -ϵ), we have z = log 1 + ϵ exp δ σ 1 -ϵ ≥ -log (1 -ϵ) (62) 1 + exp (-δ σ + z) 1 + exp -δ σ = 1 1 -ϵ Thus f (δ, σ, z) ≤ Eq. (61) = 1 1 -ϵ 4σ log zσ+|δ|+2σ z + |δ| π + 2 arctan δ 2σ 4δ 2 + 16σ 2 (64) ≤ 1 1 -ϵ 4σ log σ -|δ|+2σ log(1-ϵ) + |δ| π + 2 arctan δ 2σ 4δ 2 + 16σ 2 Thus we have Eq. ( 39) ≤ i̸ =j   4σ log σ - |ϕi-ϕj |+2σ log(1-ϵ) + |ϕ i -ϕ j | π + 2 arctan ϕi-ϕj 2σ (1 -ϵ)(4(ϕ i -ϕ j ) 2 + 16σ 2 )(1 + exp ϕi-ϕ k σ )(1 + exp ϕ k+1 -ϕj σ )   In conclusion, with probability at least (1 -ϵ), we have E u ∥T σ * -T * ∥ F ≤ i̸ =j (2 log 2)mτ 4σ log σ - |ϕi-ϕj |+2σ log(1-ϵ) + |ϕ i -ϕ j | π + 2 arctan ϕi-ϕj 2σ (1 -ϵ)(4(ϕ i -ϕ j ) 2 + 16σ 2 )(1 + exp ϕi-ϕ k σ )(1 + exp ϕ k+1 -ϕj σ ) (67) = i̸ =j (log 2)mτ 2σ log σ - |ϕi-ϕj |+2σ log(1-ϵ) + |ϕ i -ϕ j | π 2 + arctan ϕi-ϕj 2σ (1 -ϵ)((ϕ i -ϕ j ) 2 + 4σ 2 )(1 + exp ϕi-ϕ k σ )(1 + exp ϕ k+1 -ϕj σ ) (68) =(log 2)mτ i̸ =j Ω(ϕ i , ϕ j , σ, ϵ) And we denote Ω(ϕ i , ϕ j , σ, ϵ) as Ω(ϕ i , ϕ j , σ, ϵ) = 2σ log σ - |ϕi-ϕj |+2σ log(1-ϵ) + |ϕ i -ϕ j | π 2 + arctan ϕi-ϕj 2σ (1 -ϵ)((ϕ i -ϕ j ) 2 + 4σ 2 )(1 + exp ϕi-ϕ k σ )(1 + exp ϕ k+1 -ϕj σ ) C.2 PROOF OF COROLLARY 2.5 Corollary 2.5 is the simplified version of Proposition 2.4 by studying the dominant components. Proof. For Ω(ϕ i , ϕ j , σ, ϵ) in Proposition 2.4, we have Ω(ϕ i , ϕ j , σ, ϵ) ≤ 2σ log σ - |ϕi-ϕj |+2σ log(1-ϵ) + |ϕ i -ϕ j | π 2 + arctan ϕi-ϕj 2σ (1 -ϵ)((ϕ i -ϕ j ) 2 + 4σ 2 ) (71) ≤ 2σ log σ - |ϕi-ϕj |+2σ log(1-ϵ) + |ϕ i -ϕ j |π (1 -ϵ)((ϕ i -ϕ j ) 2 + 4σ 2 ) (72) = O σ log (σ + |ϕ i -ϕ j |) + |ϕ i -ϕ j | (ϕ i -ϕ j ) 2 + σ 2 (73) = O σ + |ϕ i -ϕ j | (ϕ i -ϕ j ) 2 + σ 2 (74) φ k φ k+1 φ i φ j φ k φ k+1 φ j φ i Condition 1 Condition 2 φ k φ k+1 φ i φ j φ k φ k+1 φ j φ i

Condition 3 Condition 4

Figure 7 : Four conditions are considered in our proof. It is worth noting that ϕ i , ϕ j must not lie between ϕ k , ϕ k+1 , because we define ϕ k , ϕ k+1 as two adjacent items in the original sorted list. where we regard (1 -ϵ) as a constant (i.e. assuming high probability), and O(•) means ignoring the logarithm terms. Then we have E u ∥T σ * -T * ∥ F ≤ (log 2)mτ i̸ =j O σ + |ϕ i -ϕ j | (ϕ i -ϕ j ) 2 + σ 2 (75) = (log 2)mτ O σ + |ϕ i -ϕ j | (ϕ i -ϕ j ) 2 + σ 2 ∀i̸ =j (76) = O mτ (σ + |ϕ i -ϕ j |) (ϕ i -ϕ j ) 2 + σ 2 ∀i̸ =j C.3 PROOF AND REMARKS ON COROLLARY 2.6 In the following, we prove Corollary 2.6 and add some remarks about the relationship between the Sinkhorn and the Gumbel-Sinkhorn methods: the Sinkhorn method (CardNN-S) is a special case of the Gumbel-Sinkhorn method (CardNN-GS) when we set σ → 0 + . To more formally address Corollary 2.6, we have the following proposition: Proposition C.2. Assume the values of ϕ k , ϕ k+1 are uniquefoot_0 , under probability at least (1 -ϵ), we have lim σ→0 + E u ∥T σ * -T * ∥ F ≤ (π log 2)mτ (1 -ϵ)|ϕ k -ϕ k+1 | which differs from the conclusion of Proposition 2.3 by only a constant factor. Proof. Since σ → 0 + , the first term in Ω(ϕ i , ϕ j , σ, ϵ)'s numerator becomes 0. For the second term, we discuss four conditions as shown in Fig. 7 , except for the following condition: ϕ i = ϕ k , ϕ j = ϕ k+1 . Condition 1. If ϕ i ≥ ϕ k , ϕ j ≤ ϕ k+1 (equalities do not hold at the same time), we have at least ϕ i -ϕ k > 0 or ϕ k+1 -ϕ j > 0. Then we have lim σ→0 + 1 (1 + exp ϕi-ϕ k σ )(1 + exp ϕ k+1 -ϕj σ ) = 0 (79) ⇒ lim σ→0 + Ω(ϕ i , ϕ j , σ, ϵ) = 0 Condition 2. For any case that ϕ i < ϕ j , we have ϕ i -ϕ j < 0, thus lim σ→0 + arctan ϕ i -ϕ j σ = - π 2 ⇒ lim σ→0 + π 2 + arctan ϕ i -ϕ j σ = 0 ⇒ lim σ→0 + Ω(ϕ i , ϕ j , σ, ϵ) = 0 Condition 3. If ϕ i ≥ ϕ j ≥ ϕ k (equalities do not hold at the same time), we have ϕ i -ϕ k > 0. Then we have lim σ→0 + 1 1 + exp ϕi-ϕ k σ = 0 ⇒ lim σ→0 + Ω(ϕ i , ϕ j , σ, ϵ) = 0 Condition 4. If ϕ k+1 ≥ ϕ i ≥ ϕ j (equalities do not hold at the same time), we have ϕ k+1 -ϕ j > 0. Then we have lim σ→0 + 1 1 + exp ϕ k+1 -ϕj σ = 0 ⇒ lim σ→0 + Ω(ϕ i , ϕ j , σ, ϵ) = 0 Therefore, if ϕ i ̸ = ϕ k and ϕ j ̸ = ϕ k+1 , the second term Ω(ϕ i , ϕ j , σ, ϵ) degenerates to 0 when σ → 0 + . Thus we have the following conclusion by only considering ϕ i = ϕ k , ϕ j = ϕ k+1 : lim σ→0 + E u ∥T σ * -T * ∥ F ≤ (log 2)mτ |ϕ k -ϕ k+1 | π 2 + arctan ϕ k -ϕ k+1 2σ (1 -ϵ)(ϕ k -ϕ k+1 ) 2 ≤ (π log 2)mτ (1 -ϵ)|ϕ k -ϕ k+1 | Remarks. Based on the above conclusion, if |ϕ k -ϕ k+1 | > 0, with σ → 0 + , Eq. ( 11) degenerates to the bound in Eq. ( 6) and only differs by a constant factor: lim σ→0 + E u ∥T σ * -T * ∥ F ≤ (π log 2)mτ (1 -ϵ)|ϕ k -ϕ k+1 | ( ) where a strong assumption that |ϕ k -ϕ k+1 | > 0 is made, and the bound diverges if ϕ k = ϕ k+1 . Since ϕ k , ϕ k+1 are predictions by a neural network, such an assumption may not be satisfied. In comparison, given σ > 0, the conclusion with Gumbel noise in Eq. ( 11) is bounded for any ϕ k , ϕ k+1 . The strength of the theoretical results is also validated in experiment (see Tables 5 and 4 ), including the homotopy version CardNN-HGS.

D ALGORITHM DETAILS FOR SOLVING DETERMINISTIC CO PROBLEMS

Due to limited pages, we do not include detailed algorithm blocks on how to solve deterministic CO problems in the main paper. Here we elaborate on our implementation for solving facility location problem (FLP) in Alg. 2, and max covering problem (MCP) in Alg. 3. Fey & Lenssen (2019) . In our paper, we optimize the hyperparameters by greedy search on a small subset of problem instances (∼5) and set the best configuration of hyperparameters for CardNN-S/GS/HGS. The hyperparameters of EGN (Karalias & Loukas, 2020) are tuned in the same way. Here are the hyperparameters used to reproduce our experiment results: • • For the Predictive Portfolio Optimization Problem, we set the learning rate α = 10 -3 . For our CardNN-GS module, we set τ = 0.05, σ = 0.1, and set the Gumbel samples as #G = 1000. During inference, among all 1000 portfolio predictions, we return the best portfolio found based on the predicted prices, and we empirically find such a strategy beneficial for finding better portfolios on the real test set. All experiments are done on a workstation with i7-9700K@3.60GHz CPU, 16GB memory, and RTX2080Ti GPU.

E.3 DETAILED EXPERIMENT RESULTS

In the main paper, we only plot the experiment results on both synthetic datasets and real-world datasets due to limited pages. In Table 4 and 5, we report the digits from the synthetic experiments, which are in line with Fig. 3 . Some remarks about EGN on real-world dataset. Since the sizes of our real-world problems are relatively small, we mainly adopt a transfer learning setting: the CO networks are firstly trained on the synthetic data, and then tested on the corresponding real-world datasets. All our CardNN models follow this setting. However, the transfer learning ability of EGN seems less satisfying, and we empirically find the performance of EGN degenerates significantly when transferred to a different dataset. In Fig. 4 , we exploit the advantage of self-supervised learning for EGN: we allow EGN to be trained in a self-supervised manner on the real-world dataset. To avoid the scatter plots looking too sparse, we ignore the training time cost when plotting Fig. 4 since it does not affect our main conclusion (performance rank: CardNN-HGS > CardNN-GS > CardNN-S > EGN). We list the detailed experiment results on real-world problems in Tables 6 7 8 9 10 11 12 13 14 15 16 17 18 19 . Firstly, we want to add some remarks about the selection of hyperparameters: • #G (number of Gumbel samples): #G affects how many samples are taken during training and inference for CardNN-GS. A larger #G (i.e. more samples) will be more appealing, because CardNN-GS will have a lower variance when estimating the objective score, and it will have a higher probability of discovering better solutions. However, #G cannot be arbitrarily large because the GPU has limited memory, also it is harmful to the efficiency if #G is too large. In experiments, we set an adequate #G (e.g. #G = 1000) and ensure that it can fit into the GPU memory of our workstation (2080Ti, 11G). • τ (entropic regularization factor of Sinkhorn): Theoretically, τ controls the gap of the continuous Sinkhorn solution to the discrete solution, and a smaller τ will lead to a tightened gap. This property is validated by our theoretical findings in Proposition 2.4. Unfortunately, τ cannot be arbitrarily small, because a smaller τ requires more Sinkhorn iterations to converge. Besides, a smaller τ means the algorithm being closer to the discrete version, and the gradient will be more likely to explode. Therefore, given a fixed number of Sinkhorn iterations (100) to ensure the efficiency of our algorithm, we need trial-and-error to discover the suitable τ for both CardNN-S and CardNN-GS. The grid search results below show that our selection of τ fairly balances the performances of both CardNN-S and CardNN-GS. • σ (Gumbel noise factor): As derived in Proposition 2.4, a larger σ is beneficial for a tightened constraint-violation term. However, it is also worth noting that σ cannot be arbitrarily large because our theoretical derivation only considers the expectation but not the variance. A larger σ means a larger variance, demanding a larger number of samples and bringing computational and memory burdens. In the experiments, we first determine a τ , and then find a suitable σ by greedy search on a small subset (∼5) of problem instances. 5. Enforce the cardinality constraint by Gumbel-Sinkhorn layer introduced in Sec 3.2, whereby there are #G Gumbel samples: { T i |i = 1, 2, ..., #G} = Gumbel-Sinkhorn(s) 6. Compute the weights of each asset based on the second row of T i (r f is risk-free return, set as 3%): x i = Σ -1 (µ -r f ), x i = relu(x i ⊙ T i [2, :]), x i = x i /sum(x) 7. Based on the ground-truth prices in the future {p gt t |t ≥ 0}, compute the ground truth risk and return: µ gt = mean({p gt t |t ≥ 0}), Σ gt = cov({p gt t |t ≥ 0}). 8. Estimate the ground-truth Sharpe ratio in the future, if we invest based on x i : J i = (µ gt -r f ) ⊤ x i x ⊤ i Σ gt x i . 9. The self-supervised loss is the average over all Gumbel samples: Loss = -mean( J 1 , J 2 , J 3 , ..., J #G ) Testing steps: Follow training steps 1-6 to predict µ, Σ, {x i |i = 1, 2, ..., #G}. 7. Estimate the predicted Sharpe ratio in the future, if we invest based on x i : J i = (µ -r f ) ⊤ x i x ⊤ i Σx i . 8. Return x best = x i with the highest J i and enforce hard cardinality constraint on x best by hard topk. 9. Evaluate based on the ground-truth Sharpe ratio: J = (µ gt -r f ) ⊤ x best x ⊤ best Σ gt x best .

G VISUALIZATION OF MORE PORTFOLIOS

In Fig. 8 , we provide more visualizations of the portfolios predicted by our "predict-and-optimize" CardNN pipeline (blue), the traditional "predict-then-optimize" pipeline based on LSTM and Gurobi (orange), and the historical-data based "history-opt" (purple). In general, portfolio optimization means a trade-off between risks and returns, and we can draw an efficient frontier where the portfolios on this frontier are the Pareto optimal for risks and returns, i.e. for a portfolio on the efficient frontier, one cannot achieve higher returns unless s/he could accept higher risks. Being closer to the efficient frontier means a portfolio is better. Besides, it is also worth noting that reaching the efficient frontier is nearly infeasible in predictive portfolio optimization because our predictions of future asset prices are always with errors.

H DETAILS ON USING EXISTING ASSETS

The following open-source resources are used in this paper and we sincerely thank the authors and contributors for their great work. • Implementation of Erdos Goes Neural. Paper: Karalias & Loukas (2020) . URL: https://github.com/Stalence/erdos_neu. No open-source license is found on the GitHub webpage.



For a compact proof, we make this assumption that the values of ϕ k , ϕ k+1 are unique. If there are duplicate values of ϕ k , ϕ k+1 , the bound only differs by a constant multiplier, therefore, does not affect our conclusion: Sinkhorn method (CardNN-S) is a special case of the Gumbel-Sinkhorn method (CardNN-GS) when σ → 0 + . https://www.kaggle.com/datasets/kukuroo3/starbucks-locations-worldwide-2021-version https://www.starbucks.com/store-locator



Figure1: Our CardNN pipeline. The problem encoder and our proposed optimal transport (OT) cardinality layer compose our CO solver network, which has the superiority of guaranteeing a theoretically bounded constraint-violation. The decision variables from the CO network are then utilized to estimate the objective score, i.e. the self-supervised loss. The implementation of the OT cardinality layer and its theoretical characteristics will be discussed in Sec. 2. The other components are problem-dependent and will be discussed in Sec. 3 and Sec. 4 under the context of each problem.

Figure 2: Toy example.

is better) (d)MCP-Synthetic (k=100, m=1000, n=2000)

Figure 6: Visualization on 2021-03-25 data of individual assets. Larger dots mean higher weights. See more visualizations in Appendix G.

For the Max Covering Problem (MCP), we empirically set the learning rate α = 0.1. For the hyperparameters of CardNN, we have τ = 0.05, σ = 0.15 for CardNN-GS, τ = 0.05 for CardNN-S, and τ = (0.05, 0.04, 0.03), σ = 0.15 for the Homotopy version CardNN-HGS. We set #G = 1000 samples for CardNN-GS and CardNN-HGS. • For the Facility Location Problem (FLP), we set the learning rate α = 0.1. For the hyperparameters of CardNN, we have τ = 0.05, σ = 0.25 for CardNN-GS, τ = 0.05 for CardNN-S, and we set τ = (0.05, 0.04, 0.03), σ = 0.25 for the Homotopy version CardNN-HGS. We set #G = 500 samples for CardNN-GS and CardNN-HGS. The softmax temperature for facility location is empirically set as twice of the cardinality constraint: T = 100 if k = 50, T = 60 if k = 30.

Comparison among CO networks. Both theoretically and empirically, smaller constraintviolation (CV) leads to better optimization results. Logarithm terms in CV bounds are ignored.

Objective score↑ among perturb-based methods(Pogančić et al., 2019;Berthet et al., 2020;Amos et al., 2019) on MCP (k=50,m=500,n=1000). Baseline isXie et al. (2020)  used in CardNN-S.Corollary 2.5. Ignoring logarithm terms for simplicity, CV CardNN-GS ≤ O

for MCP. Due to the lack of large-scale datasets, real-world datasets are only considered for testing (training on synthetic data, testing on real-world data). We test the FLP based on Starbucks locations in 4 cities worldwide with 166-569 stores, and we test MCP based on 6 social networks with 1912-9498 nodes collected from Twitch by Rozemberczki et al. (2021).

Our "predict-and-optimize" achieves better risk-return trade-off (Sharpe ratio) though the price prediction is less accurate (by mean square error) than "predict-then-optimize" on test set.

Liu et al. (2019) can be viewed as a multi-step optimization variant of ours, yet learning is not considered and the constraint-violation issue is not theoretically characterized.

Objective score ↓, optimal gap ↓ and inference time (in seconds) ↓ comparison of the facility location problem, including mean and standard deviation computed from all test instances. The problem is to select k facilities from m locations.The Twitch dataset for MCP. This social network dataset is collected by Rozemberczki et al. (2021) and the edges represent the mutual friendships between streamers. The streamers are categorized by their streaming language, resulting in 6 social networks for 6 languages. The social networks are DE (9498 nodes), ENGB (7126 nodes), ES (4648 nodes), FR (6549 nodes), PTBR (1912 nodes), and RU (4385 nodes). The objective is to cover more viewers, measured by the sum of the logarithmic number of viewers. We took the logarithm to enforce diversity because those top streamers usually have the dominant number of viewers. We set k=50.E.2 IMPLEMENTATION DETAILSOur algorithms are implemented by PyTorch and the graph neural network modules are based on

Objective score ↑, gap ↓, and inference time (in seconds) ↓ of max covering. Under cardinality constraint k, the problem is to select from m sets to cover a fraction of n objects. For the gray entry, the Gurobi solver fails to return the optimal solution within 24 hours, thus reported as out-of-time.



MCP-Twitch PTBR dataset m=n=1912, k=50 objective↑ time (sec)↓

funding

* Junchi Yan is the correspondence author. The work was in part supported by National Key Research and Development Program of China (2020AAA0107600), NSFC (U19B2035, 62222607, 61972250), STCSM (22511105100), Shanghai Committee of Science and Technology (21DZ1100100).

annex

Algorithm 2: CardNN-GS/HGS for Solving the Facility Location Problem Input: the distance matrix ∆; learning rate α; softmax temperature β; CardNN-GS parameters k, τ, σ, #G. if Training then Randomly initialize neural network weights θ;if Inference then Load pretrained neural network weights θ; J best = +∞;while not converged do s = SplineCNN θ (∆); [ T 1 , T 2 , ..., T #G ] = CardNN-GS(s, k, τ, σ, #G); for all i, J i = sum(softmax(-β∆ while not converged doif Training then update θ with respect to the gradient ∂J ∂θ and learning rate α by gradient ascent;if Inference then update s with respect to the gradient ∂J ∂s and learning rate α by gradient ascent;if Homotopy Inference then Shrink the value of τ and jump to line 5; Output: Learned network weights θ (if training)/The best objective J best (if inference).

E MORE DETAILS ABOUT DETERMINISTIC CO EXPERIMENT E.1 DATASET DETAILS

The Starbucks location dataset for FLP. This dataset is built based on the project named Starbucks Location Worldwide 2021 version 2 , which is scraped from the open-accessible Starbucks store locator webpage 3 . We analyze and select 4 cities with more than 100 Starbucks stores, which are London (166 stores), New York City (260 stores), Shanghai (510 stores), and Seoul (569 stores). The locations considered are the real locations represented as latitude and longitude. For simplic-Published as a conference paper at ICLR 2023We conduct an ablation study about the sensitivity of hyperparameters by performing an extensive grid search near the configuration used in our max covering experiments (τ = 0.05, σ = 0.15, #G = 1000). We choose the k=50, m=500, n=1000 max covering problem, and we have the following results for CardNN-GS and CardNN-S (higher is better): Under the configuration used in our paper, both CardNN-S and CardNN-GS have relatively good results. Our grid search result shows that our CardNN-GS is not very sensitive to σ if we have τ = 0.05 or 0.1, and the result of τ = 0.01 is inferior because the Sinkhorn algorithm may not converge. The results of #G = 1000 are all better than #G = 800, suggesting that a larger #G is appealing if we have enough GPU memory. It is also discovered that CardNN-S seems to be able to accept a smaller value of τ compared to CardNN-GS, possibly because adding the Gumbel noise will increase the divergence of elements thus performs in a sense similar to decreasing τ when considering the convergence of Sinkhorn.

F DETAILS OF PREDICTIVE PORTFOLIO OPTIMIZATION

Some details of the portfolio optimization model is omitted due to limited pages. Here we elaborate on the entire process of doing portfolio optimization under the "pred-and-opt" paradigm, with LSTM and our CardNN-GS.Training steps:1. Denote the index of "now" as t = 0. {p t |t < 0} means the percentage change of prices of each day in history, {p t |t ≥ 0} means the percentage change of prices of each day in future.2. An encoder-decoder LSTM module predicts the prices in the future:where h denotes the hidden state of LSTM.3. Compute risk and return for the future:4. In the CardNN-GS module, predict s (the probability of selected each asset) from h: s = fully-connected(h). • SCIP solver. Paper: Gamrath et al. (2020) . URL: https://scip.zib.de/. ZIB Academic License.• ORLIB. Paper: Beasley (1990) 

