LEARNING CUT SELECTION FOR MIXED-INTEGER LINEAR PROGRAMMING VIA HIERARCHICAL SEQUENCE MODEL

Abstract

Cutting planes (cuts) are important for solving mixed-integer linear programs (MILPs), which formulate a wide range of important real-world applications. Cut selection-which aims to select a proper subset of the candidate cuts to improve the efficiency of solving MILPs-heavily depends on (P1) which cuts should be preferred, and (P2) how many cuts should be selected. Although many modern MILP solvers tackle (P1)-(P2) by manually designed heuristics, machine learning offers a promising approach to learn more effective heuristics from MILPs collected from specific applications. However, many existing learning-based methods focus on learning which cuts should be preferred, neglecting the importance of learning the number of cuts that should be selected. Moreover, we observe from extensive empirical results that (P3) what order of selected cuts should be preferred has a significant impact on the efficiency of solving MILPs as well. To address this challenge, we propose a novel hierarchical sequence model (HEM) to learn cut selection policies via reinforcement learning. Specifically, HEM consists of a two-level model: (1) a higher-level model to learn the number of cuts that should be selected, (2) and a lower-level model-that formulates the cut selection task as a sequence to sequence learning problem-to learn policies selecting an ordered subset with the size determined by the higher-level model. To the best of our knowledge, HEM is the first method that can tackle (P1)-(P3) in cut selection simultaneously from a data-driven perspective. Experiments show that HEM significantly improves the efficiency of solving MILPs compared to human-designed and learning-based baselines on both synthetic and large-scale real-world MILPs, including MI-PLIB 2017. Moreover, experiments demonstrate that HEM well generalizes to MILPs that are significantly larger than those seen during training.

1. INTRODUCTION

Mixed-integer linear programming (MILP) is a general optimization formulation for a wide range of important real-world applications, such as supply chain management (Paschos, 2014), production planning (Jünger et al., 2009) , scheduling (Chen, 2010) , facility location (Farahani & Hekmatfar, 2009) , bin packing (Nair et al., 2020) , etc. A standard MILP takes the form of z * ≜ min x {c ⊤ x|Ax ≤ b, x ∈ R n , x j ∈ Z for all j ∈ I}, where c ∈ R n , A ∈ R m×n , b ∈ R m , x j denotes the j-th entry of vector x, I ⊆ {1, . . . , n} denotes the set of indices of integer variables, and z * denotes the optimal objective value of the problem in (1). However, MILPs can be extremely hard to solve as they are N P-hard problems (Bixby et al., 2004) . To solve MILPs, many modern MILP solvers (Gurobi, 2021; Bestuzheva et al., 2021; FICO Xpress, 2020) employ a branch-and-bound tree search algorithm (Land & Doig, 2010) , in which a linear programming (LP) relaxation of a MILP (the problem in (1) or its subproblems) is solved at each node. To further enhance the performance of the tree search algorithm, cutting planes (cuts) (Gomory, 1960) are introduced to tighten the LP relaxations (Achterberg, 2007; Bengio et al., 2021) . Existing work on cuts falls into two categories: cut generation and cut selection (Turner et al., 2022) . Cut generation aims to generate cuts, i.e., valid linear inequalities that tighten the LP relaxations (Achterberg, 2007) . However, adding all the generated cuts to the LP relaxations can pose a computational problem (Wesselmann & Stuhl, 2012) . To further improve the efficiency of solving MILPs, cut selection is proposed to select a proper subset of the generated cuts (Wesselmann & Stuhl, 2012) . In this paper, we focus on the cut selection problem, which has a significant impact on the overall solver performance (Achterberg, 2007; Tang et al., 2020; Paulus et al., 2022) . Cut selection heavily depends on (P1) which cuts should be preferred, and (P2) how many cuts should be selected (Achterberg, 2007; Dey & Molinaro, 2018b) . Many modern MILP solvers (Gurobi, 2021; Bestuzheva et al., 2021; FICO Xpress, 2020) tackle (P1)-(P2) by hard-coded heuristics designed by experts. However, hard-coded heuristics do not take into account underlying patterns among MILPs collected from certain types of real-world applications, e.g., day-to-day production planning, bin packing, and vehicle routing problems (Pochet & Wolsey, 2006; Laporte, 2009; Nair et al., 2020) . To further improve the efficiency of MILP solvers, recent methods (Tang et al., 2020; Paulus et al., 2022; Huang et al., 2022) propose to learn cut selection policies via machine learning, especially reinforcement learning. They offer promising approaches to learn more effective heuristics by capturing underlying patterns among MILPs from specific applications (Bengio et al., 2021) . However, many existing learning-based methods (Tang et al., 2020; Paulus et al., 2022; Huang et al., 2022) -which learn a scoring function to measure cut quality and select a fixed ratio/number of cuts with high scores-suffer from two limitations. First, they learn which cuts should be preferred by learning a scoring function, neglecting the importance of learning the number of cuts that should be selected (Dey & Molinaro, 2018b) . Moreover, we observe from extensive empirical results that (P3) what order of selected cuts should be preferred significantly impacts the efficiency of solving MILPs as well (see Section 3). Second, they do not take into account the interaction among cuts when learning which cuts should be preferred, as they score each cut independently. As a result, they struggle to select cuts that complement each other nicely, which could severely hinder the efficiency of solving MILPs (Dey & Molinaro, 2018b) . Indeed, we empirically show that they tend to select many similar cuts with high scores (see Experiment 4 in Section 5). To address the aforementioned challenges, we propose a novel hierarchical sequence model (HEM) to learn cut selection policies via reinforcement learning. To the best of our knowledge, HEM is the first learning-based method that can tackle (P1)-(P3) simultaneously by proposing a two-level model. Specifically, HEM is comprised of (1) a higher-level model to learn the number of cuts that should be selected, (2) and a lower-level model to learn policies selecting an ordered subset with the size determined by the higher-level model. The lower-level model formulates the cut selection task as a sequence to sequence learning problem, leading to two major advantages. First, the sequence model is popular in capturing the underlying order information (Vinyals et al., 2016) , which is critical for tackling (P3). Second, the sequence model can well capture the interaction among cuts, as it models the joint conditional probability of the selected cuts given an input sequence of the candidate cuts. As a result, experiments show that HEM significantly outperforms human-designed and learning-based baselines in terms of solving efficiency on three synthetic MILP problems and seven challenging MILP problems. The challenging MILP problems include some benchmarks from MIPLIB 2017 (Gleixner et al., 2021) and large-scale real-world production planning problems. Our results demonstrate the strong ability to enhance modern MILP solvers with our proposed HEM in real-world applications. Moreover, experiments demonstrate that HEM can well generalize to MILPs that are significantly larger than those seen during training.

2. BACKGROUND

Cutting planes. Given the MILP problem in (1), we drop all its integer constraints to obtain its linear programming (LP) relaxation, which takes the form of z * LP ≜ min x {c ⊤ x|Ax ≤ b, x ∈ R n }. Since the problem in (2) expands the feasible set of the problem in (1), we have z * LP ≤ z * . We denote any lower bound found via an LP relaxation by a dual bound. Given the LP relaxation in (2), cutting planes (cuts) are linear inequalities that are added to the LP relaxation in the attempt to tighten it without removing any integer feasible solutions of the problem in (1). Cuts generated by MILP solvers are added in successive rounds. Specifically, each round k involves (i) solving the current LP relaxation, (ii) generating a pool of candidate cuts C k , (iii) selecting a subset S k ⊆ C k , (iv) adding S k to the current LP relaxation to obtain the next LP relaxation, (v) and proceeding to the next round. Adding all the generated cuts to the LP relaxation would maximally strengthen the LP relaxation and improve the lower bound at each round. However, adding too many cuts could lead to large models, which can increase the computational burden and present numerical instabilities (Wesselmann & Stuhl, 2012) . Therefore, cut selection is proposed to select a proper subset of the candidate cuts, which is significant for improving the efficiency of solving MILPs (Tang et al., 2020) . Branch-and-cut. In modern MILP solvers, cutting planes are often combined with the branch-andbound algorithm (Land & Doig, 2010) , which is known as the branch-and-cut algorithm (Mitchell, 2002) . Branch-and-bound techniques perform implicit enumeration by building a search tree, in which every node represents a subproblem of the original problem in (1). The solving process begins by selecting a leaf node of the tree and solving its LP relaxation. Let x * be the optimal solution of the LP relaxation. If x * violates the original integrality constraints, two subproblems (child nodes) of the leaf node are created by branching. Specifically, the leaf node is added with constraints x i ≤ ⌊x * i ⌋ and x i ≥ ⌈x * i ⌉, respectively, where x i denotes the i-th variable, x * i denotes the i-th entry of vector x * , and ⌊⌋ and ⌈⌉ denote the floor and ceil functions. In contrast, if x * is a (mixed-)integer solution of (1), then we obtain an upper bound on the optimal objective value of (1), which we denote by primal bound. In modern MILP solvers, the addition of cutting planes is alternated with the branching phase. That is, cuts are added at search tree nodes before branching to tighten their LP relaxations. Since strengthening the relaxation before starting to branch is decisive to ensure an efficient tree search (Wesselmann & Stuhl, 2012; Bengio et al., 2021) , we focus on only adding cuts at the root node, which follows Gasse et al. (2019) ; Paulus et al. (2022) . Primal-dual gap integral. We keep track of two important bounds when running branch-and-cut, i.e., the global primal and dual bounds, which are the best upper and lower bounds on the optimal objective value of (1), respectively. We define the primal-dual gap integral (PD integral) by the area between the curve of the solver's global primal bound and the curve of the solver's global dual bound. We provide more details in Appendix C.1.

3. MOTIVATING RESULTS

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 primal-dual gap integral 1e4 show that adding the same selected cuts in different order leads to variable overall solver performance. We empirically show that the order of selected cuts, i.e., the selected cuts are added to the LP relaxations in this order, significantly impacts the efficiency of solving MILPs. Moreover, we empirically show that the ratio of selected cuts matters significantly when solving MILPs (see Appendix G.1). Please see Appendix D.2 for details of the datasets used in this section. Order matters. Previous work (Bixby, 1992; Maros, 2002; Li et al., 2022) has shown that the order of constraints for a given linear program (LP) significantly impacts its constructed initial basis, which is important for solving the LP. As a cut is a linear constraint, adding cuts to the LP relaxations is equivalent to adding constraints to the LP relaxations. Therefore, the order of added cuts could have a significant impact on solving the LP relaxations as well, thus being important for solving MILPs. Indeed, our empirical results show that this is the case. (1) We design a RandomAll cut selection rule, which randomly permutes all the candidate cuts, and adds all the cuts to the LP relaxations in the random order. We evaluate RandomAll on five challenging datasets, namely D1, D2, D3, D4, and D5. We use the SCIP 8.0.0 (Bestuzheva et al., 2021) as the backend solver, and evaluate the solver performance by the average PD integral within a time limit. We evaluate RandomAll on each dataset over ten random seeds, and each bar in Figure 1a shows the mean and standard deviation (stdev) of its performance on each dataset. As shown in Figure 1a , the performance of RandomAll on each dataset varies widely with the order of selected cuts. (2) We further design a RandomNV cut selection rule. RandomNV is different from Rando-mAll in that it selects a given ratio of the candidate cuts rather than all the cuts. RandomNV first scores each cut using the Normalized Violation (Huang et al., 2022) and selects a given ratio of cuts with high scores. It then randomly permutes the selected cuts. Each bar in Figure 1b shows the mean and stdev of the performance of RandomNV with a given ratio on the same dataset. Figures 1a and 1b show that adding the same selected cuts in different order leads to variable solver performance, which demonstrates that the order of selected cuts is important for solving MILPs.

4. LEARNING CUT SELECTION VIA HIERARCHICAL SEQUENCE MODEL

In the cut selection task, the optimal subsets that should be selected are inaccessible, but one can assess the quality of selected subsets using a solver and provide the feedbacks to learning algorithms. Therefore, we leverage reinforcement learning (RL) to learn cut selection policies. In this section, we provide a detailed description of our proposed RL framework for learning cut selection. First, we present our formulation of the cut selection as a Markov decision process (MDP) (Sutton & Barto, 2018) . Then, we present a detailed description of our proposed hierarchical sequence model (HEM). Finally, we derive a hierarchical policy gradient for training HEM efficiently. Figure 2 : Illustration of our proposed RL framework for learning cut selection policies. We formulate a MILP solver as the environment and the HEM as the agent. Moreover, we train HEM via a hierarchical policy gradient algorithm.

Reinforcement Learning Formulation

As shown in Figure 2 , we formulate a MILP solver as the environment and our proposed HEM as the agent. We consider an MDP defined by the tuple (S, A, r, f ). Specifically, we specify the state space S, the action space A, the reward function r : S × A → R, the transition function f , and the terminal state in the following. (1) The state space S. Since the current LP relaxation and the generated cuts contain the core information for cut selection, we define a state s by (M LP , C, x * LP ). Here M LP denotes the mathematical model of the current LP relaxation, C denotes the set of the candidate cuts, and x * LP denotes the optimal solution of the LP relaxation. To encode the state information, we follow Achterberg (2007) ; Huang et al. (2022) to design thirteen features for each candidate cut based on the information of (M LP , C, x * LP ). That is, we actually represent a state s by a sequence of thirteen-dimensional feature vectors. We present details of the designed features in Appendix F.1. (2) The action space A. To take into account the ratio and order of selected cuts, we define the action space by all the ordered subsets of the candidate cuts C. It can be challenging to explore the action space efficiently, as the cardinality of the action space can be extremely large due to its combinatorial structure. (3) The reward function r. To evaluate the impact of the added cuts on solving MILPs, we design the reward function by (i) measures collected at the end of solving LP relaxations such as the dual bound improvement, (ii) or end-of-run statistics, such as the solving time and the primal-dual gap integral. For the first, the reward r(s, a) can be defined as the negative dual bound improvement at each step. For the second, the reward r(s, a) can be defined as zero except for the last step (s T , a T ) in a trajectory, i.e., r(s T , a T ) is defined by the negative solving time or the negative primal-dual gap integral. (4) The transition function f . The transition function maps the current state s and the action a to the next state s ′ , where s ′ represents the next LP relaxation generated by adding the selected cuts at the current LP relaxation. (5) The terminal state. There is no standard and unified criterion to determine when to terminate the cut separation procedure (Paulus et al., 2022) . Suppose we set the cut separation rounds as T , then the solver environment terminates the cut separation after T rounds. Under the multiple rounds setting (i.e., T > 1), we formulate the cut selection as a Markov decision process. Under the one round setting (i.e., T = 1), the formulation can be simplified as a contextual bandit.

Hierarchical Sequence Model

Motivation. Let π denote the cut selection policy π : S → P(A), where P(A) denotes the probability distribution over the action space, and π(•|s) denotes the probability distribution over the action space given the state s. We emphasize that learning such policies can tackle (P1)-(P3) in cut selection simultaneously. However, directly learning such policies is challenging for the following reasons. First, it is challenging to explore the action space efficiently, as the cardinality of the action space can be extremely large due to its combinatorial structure. Second, the length and max length of actions (i.e., ordered subsets) are variable across different MILPs. However, traditional RL usually deals with problems whose actions have a fixed length. Instead of directly learning the aforementioned policy, many existing learning-based methods (Tang et al., 2020; Huang et al., 2022; Paulus et al., 2022) learn a scoring function that outputs a score given a cut, and select a fixed ratio/number of cuts with high scores. However, they suffer from two limitations as mentioned in Section 1. Policy network architecture. To tackle the aforementioned problems, we propose a novel hierarchical sequence model (HEM) to learn cut selection policies. To promote efficient exploration, HEM leverages the hierarchical structure of the cut selection task to decompose the policy into two sub-policies, i.e., a higher-level policy π h and a lower-level policy π l . The policy network architecture of HEM is also illustrated in Figure 2 . First, the higher-level policy learns the number of cuts that should be selected by predicting a proper ratio. Suppose the length of the state is N and the predicted ratio is k, then the predicted number of cuts that should be selected is ⌊N * k⌋, where ⌊•⌋ denotes the floor function. We define the higher-level policy by π h : S → P([0, 1]), where π h (•|s) denotes the probability distribution over [0, 1] given the state s. Second, the lower-level policy learns to select an ordered subset with the size determined by the higher-level policy. We define the lower-level policy by π l : S × [0, 1] → P(A), where π l (•|s, k) denotes the probability distribution over the action space given the state s and the ratio k. Specifically, we formulate the lower-level policy as a sequence model, which can capture the interaction among cuts. Finally, we derive the cut selection policy via the law of total probability, i.e., π(a k |s) = E k∼π h (•|s) [π l (a k |s, k)], where k denotes the given ratio and a k denotes the action. The policy is computed by an expectation, as a k cannot determine the ratio k. For example, suppose that N = 100 and the length of a k is 10, then the ratio k can be any number in the interval [0.1, 0.11). Actually, we sample an action from the policy π by first sampling a ratio k from π h and then sampling an action from π l given the ratio. For the higher-level policy, we first model the higher-level policy as a tanh-Gaussian, i.e., a Gaussian distribution with an invertible squashing function (tanh), which is commonly used in deep reinforcement learning (Schulman et al., 2017; Haarnoja et al., 2018) . The mean and variance of the Gaussian are given by neural networks. The support of the tanh-Gaussian is [-1, 1], but a ratio of selected cuts should belong to [0, 1]. Thus, we further perform a linear transformation on the tanh-Gaussian. Specifically, we define the parameterized higher-level policy by π h θ1 (•|s) = 0.5 * tanh (K) + 0.5, where K ∼ N (µ θ1 (s), σ θ1 (s)). Since the sequence lengths of states are variable across different instances (MILPs), we use a long-short term memory (LSTM) (Hochreiter & Schmidhuber, 1997) network to embed the sequence of candidate cuts. We then use a multi-layer perceptron (MLP) (Goodfellow et al., 2016) to predict the mean and variance from the last hidden state of the LSTM. For the lower-level policy, we formulate it as a sequence model. That is, its input is a sequence of candidate cuts, and its output is the probability distribution over ordered subsets of candidate cuts with the size determined by the higher-level policy. Specifically, given a state action pair (s, k, a k ), the sequence model computes the conditional probability π l θ2 (a k |s, k) using a parametric model to estimate the terms of the probability chain rule, i.e., π l θ2 (a k |s, k) = m i=1 p θ2 (a i k |a 1 k , . . . , a i-1 k , s, k). Here s = {s 1 , . . . , s N } is the input sequence, m = ⌊N * k⌋ is the length of the output sequence, and a k = {a 1 k , . . . , a m k } is a sequence of m indices, each corresponding a position in the input sequence s. Such policy can be parametrized by the vanilla sequence model commonly used in machine translation (Sutskever et al., 2014; Vaswani et al., 2017) . However, the vanilla sequence model can only be applied to learning on a single instance, as the number of candidate cuts varies on different instances. To generalize across different instances, we use a pointer network (Vinyals et al., 2015; Bello* et al., 2017) -which uses attention as a pointer to select a member of the input sequence as the output at each decoder step-to parametrize π l θ2 (see Appendix F.4.1 for details). To the best of our knowledge, we are the first to formulate the cut selection task as a sequence to sequence learning problem and apply the pointer network to cut selection. This leads to two major advantages: (1) capturing the underlying order information, (2) and the interaction among cuts. This is also illustrated through an example in Appendix E.

Training: hierarchical policy gradient

For the cut selection task, we aim to find θ that maximizes the expected reward over all trajectories J(θ) = E s∼µ,a k ∼π θ (•|s) [r(s, a k )], (3) where θ = [θ 1 , θ 2 ] with [θ 1 , θ 2 ] denoting the concatenation of the two vectors, π θ (a k |s) = E k∼π h θ 1 (•|s) [π l θ2 (a k |s, k)], and µ denotes the initial state distribution. To train the policy with a hierarchical structure, we derive a hierarchical policy gradient following the well-known policy gradient theorem (Sutton et al., 1999a; Sutton & Barto, 2018) . Proposition 1. Given the cut selection policy π θ (a k |s) = E k∼π h θ 1 (•|s) [π l θ2 (a k |s, k) ] and the training objective (3), the hierarchical policy gradient takes the form of ∇ θ1 J([θ 1 , θ 2 ]) = E s∼µ,k∼π h θ 1 (•|s) [∇ θ1 log(π h θ1 (k|s))E a k ∼π l θ2 (•|s,k) [r(s, a k )]], ∇ θ2 J([θ 1 , θ 2 ]) = E s∼µ,k∼π h θ 1 (•|s),a k ∼π l θ 2 (•|s,k) [∇ θ2 log π l θ2 (a k |s, k)r(s, a k )]. We provide detailed proof in Appendix A. We use the derived hierarchical policy gradient to update the parameters of the higher-level and lower-level policies. We implement the training algorithm in a parallel manner that is closely related to the asynchronous advantage actor-critic (A3C) (Mnih et al., 2016) . Due to limited space, we summarize the procedure of the training algorithm in Appendix F.3.6. Moreover, we discuss some more advantages of HEM (see Appendix F.4.3 for details). ( 1) HEM leverages the hierarchical structure of the cut selection task, which is important for efficient exploration in complex decision-making tasks (Sutton et al., 1999b) . ( 2) We train HEM via gradientbased algorithms, which is sample efficient (Sutton & Barto, 2018).

5. EXPERIMENTS

Our experiments have five main parts: Experiment 1. Evaluate our approach on three classical MILP problems and six challenging MILP problem benchmarks from diverse application areas. Experiment 2. Perform carefully designed ablation studies to provide further insight into HEM. Experiment 3. Test whether HEM can generalize to instances significantly larger than those seen during training. Experiment 4. Visualize the cuts selected by our method compared to the baselines. Experiment 5. Deploy our approach to real-world production planning problems. Benchmarks. We evaluate our approach on nine N P-hard MILP problem benchmarks, which consist of three classical synthetic MILP problems and six challenging MILP problems from diverse application areas. We divide the nine problem benchmarks into three categories according to the difficulty of solving them using the SCIP 8.0.0 solver (Bestuzheva et al., 2021) . We call the three categories easy, medium, and hard datasets, respectively. (1) Easy datasets comprise three widely used synthetic MILP problem benchmarks: Set Covering (Balas & Ho, 1980) , Maximum Independent Set (Bergman et al., 2016) , and Multiple Knapsack (Scavuzzo et al., 2022) . We artificially generate instances following Gasse et al. (2019) ; Sun et al. (2020) . ( 2) Medium datasets comprise MIK (Atamtürk, 2003) and CORLAT (Gomes et al., 2008) , which are widely used benchmarks for evaluating MILP solvers (He et al., 2014; Nair et al., 2020) . ( 3) Hard datasets include the Load Balancing problem, inspired by real-life applications of large-scale systems, and the Anonymous problem, inspired by a large-scale industrial application (Bowly et al., 2021) . Moreover, hard datasets contain benchmarks from MIPLIB 2017 (MIPLIB) (Gleixner et al., 2021) . Although Turner et al. (2022) has shown that directly learning over the full MIPLIB can be extremely challenging, we propose to learn over subsets of MIPLIB. We construct two subsets, called MIPLIB mixed neos and MIPLIB mixed supportcase. Due to limited space, please see Appendix D.1 for details of these datasets. Experimental setup. Throughout all experiments, we use SCIP 8.0.0 (Bestuzheva et al., 2021) as the backend solver, which is the state-of-the-art open source solver, and is widely used in research of machine learning for combinatorial optimization (Gasse et al., 2019; Huang et al., 2022; Turner et al., 2022; Nair et al., 2020) . Following Gasse et al. (2019) ; Huang et al. (2022) ; Paulus et al. (2022) , we only allow cutting plane generation and selection at the root node, and set the cut separation rounds as one. We keep all the other SCIP parameters to default so as to make comparisons as fair and reproducible as possible. We emphasize that all of the SCIP solver's advanced features, such as presolve and heuristics, are open, which ensures that our setup is consistent with the practice setting. Throughout all experiments, we set the solving time limit as 300 seconds. For completeness, we also evaluate HEM with a much longer time limit of three hours. The results are given in Appendix G.6. We train HEM with ADAM (Kingma & Ba, 2014) using the PyTorch (Paszke et al., 2019) . Additionally, we also provide another implementation using the MindSpore (Chen, 2021) . For simplicity, we split each dataset into the train and test sets with 80% and 20% instances. To further improve HEM, one can construct a valid set for hyperparameters tuning. We train our model on the train set, and select the best model on the train set to evaluate on the test set. Please refer to Appendix F.3 for implementation details, hyperparameters, and hardware specification. Baselines. Our baselines include five widely used human-designed cut selection rules and a state-ofthe-art (SOTA) learning-based method. Cut selection rules include NoCuts, Random, Normalized Violation (NV), Efficacy (Eff), and Default. NoCuts does not add any cuts. Default denotes the default cut selection rule used in SCIP 8.0.0. For learning-based methods, we implement a slight variant of the SOTA learning-based methods (Tang et al., 2020; Huang et al., 2022) , namely scorebased policy (SBP). Please see Appendix F.2 for implementation details of these baselines. Evaluation metrics. We use two widely used evaluation metrics, i.e., the average solving time (Time, lower is better), and the average primal-dual gap integral (PD integral, lower is better). Additionally, we provide more results in terms of another two metrics, i.e., the average number of nodes and the average primal-dual gap, in Appendix G.2. Furthermore, to evaluate different cut selection methods compared to pure branch-and-bound without cutting plane separation, we propose an Improvement metric. Specifically, we define the metric by Im M (•) = M (NoCuts)-M (•) M (NoCuts) , where M (NoCuts) represents the performance of NoCuts, and M (•) represents a mapping from a method to its performance. The improvement metric represents the improvement of a given method compared to NoCuts. We mainly focus on the Time metric on the easy datasets, as the solver can solve all instances to optimality within the given time limit. However, HEM and the baselines cannot solve all instances to optimality within the time limit on the medium and hard datasets. As a result, the average solving time of those unsolved instances is the same, which makes it difficult to distinguish the performance of different cut selection methods using the Time metric. Therefore, we mainly focus on the PD integral metric on the medium and hard datasets. The PD integral is also a well-recognized metric for evaluating the solver performance (Bowly et al., 2021; Cao et al., 2022) . Experiment 1. Comparative evaluation The results in Table 1 suggest the following. (1) Easy datasets. HEM significantly outperforms all the baselines on the easy datasets, especially on Maximum Independent Set and Multiple Knapsack. SBP achieves much better performance than all the rule-based baselines, demonstrating that our implemented SBP is a strong baseline. Compared to SBP, HEM improves the Time by up to 16.4% on the three datasets, demonstrating the superiority of our method over the SOTA learning-based method. (2) Medium datasets. On MIK and CORLAT, 2 show that HEM significantly outperforms HEM w/o H and the baselines on the three datasets. The results demonstrate that the higher-level model is important for efficient exploration in complex tasks, thus significantly improving the solving efficiency. The importance of tackling (P1)-(P3). We perform ablation studies to understand the importance of tackling (P1)-(P3) in cut selection. (1) HEM. HEM tackles (P1)-(P3) in cut selection simultaneously. (2) HEM-ratio. In order not to learn how many cuts should be selected, we remove the higher-level model of HEM and force the lower-level model to select a fixed ratio of cuts. We denote it by HEM-ratio. Note that HEM-ratio is different from HEM w/o H (see Appendix F.4.2). HEMratio tackles (P1) and (P3) in cut selection. (3) HEM-ratio-order. To further mute the effect of the order of selected cuts, we reorder the selected cuts given by HEM-ratio with the original index of the generated cuts, which we denote by HEM-ratio-order. HEM-ratio-order mainly tackles (P1) in cut selection. The results in Table 3 suggest the following. HEM-ratio-order significantly outperforms Default and NoCuts, demonstrating that tackling (P1) by data-driven methods is crucial. HEM significantly outperforms HEM-ratio in terms of the PD integral, demonstrating the significance of tackling (P2). HEM-ratio outperforms HEM-ratio-order in terms of the Time and the PD integral, which demonstrates the importance of tackling (P3). Moreover, HEM-ratio and HEM-ratio-order perform better than SBP on MIS and CORLAT, demonstrating the advantages of using the sequence model to learn cut selection over SBP. HEM-ratio and HEM-ratio-order perform on par with SBP on MIPLIB mixed neos. We provide possible reasons in Appendix G. 2018b) . We visualize the cuts selected by HEM-ratio and SBP on a randomly sampled instance from Maximum Independent Set and CORLAT, respectively. We evaluate HEM-ratio rather than HEM, as HEM-ratio selects the same number of cuts as SBP. Furthermore, we perform principal component analysis on the selected cuts to reduce the cut features to two-dimensional space. Colored points illustrate the reduced cut features. To visualize the diversity of selected cuts, we use dashed lines to connect the points with the smallest and largest x,y coordinates. That is, the area covered by the dashed lines represents the diversity. Figure 3 shows that SBP tends to select many similar cuts that are possibly redundant, especially on Maximum Independent Set. In contrast, HEM-ratio selects much more diverse cuts that can well complement each other. Please refer to Appendix G.5 for results on more datasets. Experiment 5. Deployment in real-world production planning problems To further evaluate the effectiveness of our proposed HEM, we deploy HEM to large-scale real-world production planning problems at an anonymous enterprise, which is one of the largest global commercial technology enterprises. Please refer to Appendix D.3 for more details of the problems. The results in Table 4 (Right) show that HEM significantly outperforms all the baselines in terms of the Time and PD integral. The results demonstrate the strong ability to enhance modern MILP solvers with our proposed HEM in real-world applications. Interestingly, Default performs poorer than NoCuts, which implies that an improper cut selection policy could significantly degrade the performance of MILP solvers. In addition, we will integrate our proposed HEM into OptVersefoot_0 , i.e., the commercial solver developed by Huawei.

6. CONCLUSION

In this paper, we observe from extensive empirical results that the order of selected cuts has a significant impact on the efficiency of solving MILPs. We propose a novel hierarchical sequence model (HEM) to learn cut selection policies via reinforcement learning. Specifically, HEM consists of a two-level model: (1) a higher-level model to learn the number of cuts that should be selected, (2) and a lower-level model-that formulates the cut selection task as a sequence to sequence learning problem-to learn policies selecting an ordered subset with the size determined by the higher-level model. Experiments show that HEM significantly improves the efficiency of solving MILPs compared to human-designed and learning-based baselines on both synthetic and large-scale real-world MILPs. We believe that our proposed approach brings new insights into learning cut selection. 

A PROOF

A.1 PROOF OF PROPOSITION 1 Proof. The optimization objective takes the form of J(θ) = E s∼µ,a k ∼π θ (•|s) [r(s, a k )] = E s∼µ [ a k 1 k=0 π h θ1 (k|s)π l θ2 (a k |s, k)r(s, a k )dk] = E s∼µ [ 1 k=0 a k π h θ1 (k|s)π l θ2 (a k |s, k)r(s, a k )dk], where θ = [θ 1 , θ 2 ] with [θ 1 , θ 2 ] denoting the concatenation of the two vectors, π θ (a k |s) = E k∼π h θ 1 (•|s) [π h θ2 (a k |s, k)], and µ denotes the initial state distribution. We first compute the policy gradient for θ 1 : ∇ θ1 J([θ 1 , θ 2 ]) =∇ θ1 (E s∼µ [ 1 k=0 a k π h θ1 (k|s)π l θ2 (a k |s, k)r(s, a k )dk]) =E s∼µ [∇ θ1 [ 1 k=0 π h θ1 (k|s) a k π l θ2 (a k |s, k)r(s, a k )dk]]. Let r h (s, k, θ 2 ) = a k π l θ2 (a k |s, k)r(s, a k ) = E a k ∼π l θ 2 (•|s,k) [r(s, a k )], then we have that ∇ θ1 J([θ 1 , θ 2 ]) = E s∼µ [∇ θ1 [ 1 k=0 π h θ1 (k|s)r(s, k, θ 2 )dk]] = E s∼µ,k∼π h θ 1 (•|s) [∇ θ1 log π h θ1 (k|s)r(s, k, θ 2 )]. Therefore, we have that ∇ θ1 J([θ 1 , θ 2 ]) = E s∼µ,k∼π h θ 1 (•|s) [∇ θ1 log(π h θ1 (k|s))E a k ∼π l θ 2 (•|s,k) [r(s, a k )]]. We then compute the policy gradient for θ 2 : ∇ θ2 J([θ 1 , θ 2 ]) =∇ θ2 (E s∼µ [ 1 k=0 a k π h θ1 (k|s)π l θ2 (a k |s, k)r(s, a k )dk]) =E s∼µ,k∼π h θ 1 (•|s) [∇ θ2 [ a k π l θ2 (a k |s, k)r(s, a k )]] =E s∼µ,k∼π h θ 1 (•|s),a k ∼π l θ 2 (•|s,k) [∇ θ2 log π l θ2 (a k |s, k)r(s, a k )], which completes the proof.

B RELATED WORK

Machine learning for MILP. The use of machine learning methods to help improve the MILP solver performance has been an active topic of significant interest in recent years (Bengio et al., 2021; Lodi & Zarpellon, 2017; Bowly et al., 2021; Gasse et al., 2019; Qu et al., 2022b; Li et al., 2023) . During the solving process of the solvers, many crucial decisions that significantly impact the solver performance are based on heuristics (Achterberg, 2007) . Recent methods propose to replace these hand-crafted heuristics with machine learning models (Bengio et al., 2021) . This line of research has shown significant improvement on the solver performance, including cut selection (Tang et al., 2020; Paulus et al., 2022; Turner et al., 2022; Baltean-Lugojan et al., 2019) , variable selection (Khalil et al., 2016; Gasse et al., 2019; Balcan et al., 2018; Zarpellon et al., 2021; Qu et al., 2022a) , node selection (He et al., 2014; Sabharwal et al., 2012) , column generation (Morabit et al., 2021) , and primal heuristics selection (Khalil et al., 2017; Hendel et al., 2019) . In this paper, we focus on cut selection, which plays a significant role in modern MILP solvers (Dey & Molinaro, 2018a; Tang et al., 2020) . For cut selection, many existing learning-based methods (Tang et al., 2020; Paulus et al., 2022; Huang et al., 2022) focus on learning which cuts should be preferred by learning a scoring function to measure cut quality. Specifically, (Tang et al., 2020) proposes a reinforcement learning approach to learn to score Gomory cuts (Gomory, 1960) and select a Gomory cut with the best scores. Furthermore, (Paulus et al., 2022 ) designs a lookahead selection rule which selects a cut that yields the best dual bound improvement, and proposes to learn the expert rule via imitation learning. Instead of selecting the best cut, (Huang et al., 2022) frames cut selection as multiple instance learning to learn a scoring function and selects a fixed ratio of cuts with high scores. However, they neglect the importance of learning how many cuts should be selected. Moreover, we empirically show that the order of selected cuts has a large impact on the efficiency of solving MILPs (see Section 3). Moreover, (Turner et al., 2022) proposes to learn the weightings of four existing scoring rules designed by experts. For the theoretical analysis, (Balcan et al., 2021) provides some provable guarantees for learning cut selection policies. Sequence model. Sequence model such as long-short term memory and Transformer has achieved outstanding performance in language tasks such as machine translation (Hochreiter & Schmidhuber, 1997; Sutskever et al., 2014; Vaswani et al., 2017) . For combinatorial optimization, recent works (Vinyals et al., 2015; Bello* et al., 2017) propose a variant of the traditional sequence model, namely pointer network, which is applied to directly finding solutions for specific combinatorial optimization problems, such as the Travelling Salesman Problem (Lenstra & Shmoys, 2009) . Instead of finding solutions directly, we propose to use the pointer network for cut selection in modern MILP solvers. To the best of our knowledge, we are the first to apply the pointer network to cut selection, which not only captures the order of selected cuts, but also can well capture the interaction among cuts to select cuts that complement each other nicely. Reinforcement learning. Reinforcement learning (RL) has achieved great success in decisionmaking tasks, ranging from playing video games (Mnih et al., 2015; Fan, 2021; Fan & Xiao, 2022) to controlling robots in simulators (Haarnoja et al., 2018; Yang et al., 2022) . Roughly speaking, RL approaches fall into two categories: (1) model-based RL methods (Janner et al., 2019; Wang et al., 2022b; Zhou et al., 2020) , and (2) model-free methods (Haarnoja et al., 2018; Wang et al., 2022a; Kuang et al., 2022; Fan et al., 2021) . In this paper, we propose a novel RL framework for learning cut selection policies.

C MORE DETAILS OF BACKGROUND C.1 MORE DETAILS OF THE PRIMAL-DUAL GAP INTEGRAL

We keep track of two important bounds when running branch-and-cut, including the global primal and dual bound. The global primal bound corresponds to the value of the best feasible solution found so far, which is the best upper bound of the problem in (1). The global dual bound corresponds to the minimum dual bound across all leaves of the search tree, which is the best lower bound of the problem in (1). We define the primal-dual gap integral by the area between the curve of the solver's global primal bound and the curve of the solver's global dual bound. With a time limit T , we define the primal-dual gap integral by T t=0 (c T x * t -z * t )dt, Algorithm 1 Pseudo code for constructing MIPLIB datasets 1: Input the initial instance I 0 , the set of full MIPLIB M, an empty set M ′ , an empty queue Q. 2: Initialize Q with the instance I 0 , I 0 → Q 3: while Q is not empty do 4: n=Q.size() 5: for i = 1, . . . , n do 6: Pull an element from Q, namely I ′ 7: Compute the similarity scores between each instance in M except I ′ and I ′ 8: Select five instances with the best similarity scores M i 9: for I in M i do 10: if I not in M ′ then 11: Push I to M ′ ; Push I to Q 12: end if 13: end for 14: end for 15: end while 16: Return M ′ where c is the objective coefficient vector as in (1), x * t is the best feasible solution found at time t, z * t is the best dual bound at time t. We define the primal-dual gap by the difference between the global primal bound and the global dual bound. In SCIP 8.0.0 (Bestuzheva et al., 2021) , the initial value of the primal-dual gap is set to a constant 100. The primal-dual gap integral is a well-recognized metric for evaluating solver performance. For example, the primal-dual gal integral is a primary evaluation metric in the NeurIPS 2021 ML4CO competition (Bowly et al., 2021) .

D DETAILS OF THE DATASETS USED IN THIS PAPER D.1 THE DATASETS USED IN THE MAIN EVALUATION

Easy datasets. The SCIP 8.0.0 solver needs one minute to solve the MILP instances in the easy datasets to optimality. Easy datasets are comprised of three synthetic MILP problems: Set Covering (Balas & Ho, 1980) , Maximum Independent Set (Bergman et al., 2016) , and Multiple Knapsack (Scavuzzo et al., 2022) . We choose these three classes of problems for the following reasons. First, they are widely used benchmarks for evaluating MILP solvers (Gasse et al., 2019; Huang et al., 2022; Sun et al., 2020; Gupta et al., 2022) . Second, they represent a wide collection of MILP problems encountered in practice. Third, for each class of these problems, the average number of generated cuts is at least twenty, which ensures that proper cut selection strategies are significant for improving the solver performance. Similarly to (Gasse et al., 2019; Scavuzzo et al., 2022; Sun et al., 2020; Gupta et al., 2022) , we generate set covering instances with 500 rows and 1000 columns, Maximum Independent Set instances with graphs of 500 nodes and affinity set to 4, multiple knapsack instances with 60 items and 12 knapsacks. For each benchmark, we generate a training set of 10,000 instances, and a test set of 100 instances that are never seen during training. Specifically, readers can refer to https://github.com/ds4dm/learn2branch or https://github.com/lascavana/rl2branch for code to generate the easy datasets. We will also release our code once the paper is accepted to be published. Medium datasets. The SCIP 8.0.0 solver needs at least five minutes to solve the instances in the medium datasets to optimality. Following (1) Benchmarks from MIPLIB 2017. Note that MIPLIB 2017 (MIPLIB) (Gleixner et al., 2021) contains instances of MILPs across many different application areas and has been used as a long-standing standard benchmark for MILP solvers (Nair et al., 2020; Turner et al., 2022; Gleixner et al., 2021) . Previous work (Turner et al., 2022) has shown that directly learning over the full MIPLIB can be extremely challenging, as these instances are heterogeneous but machine learning has difficulty in learning from heterogeneous datasets. Despite this challenge, we take the first step towards learning over subsets of MIPLIB. Specifically, we construct two subsets by selecting similar instances from MIPLIB. We measure the similarity between instances by 100 human-designed instance features (Gleixner et al., 2021) . Following Turner et al. (2022) , we first discard instances from MILLIB that satisfy any of the criteria in Table 5 . This ensures that a good cut selection policy can significantly improve the dual bound on the remaining instances. Note that we only use three of seven criteria that are used in (Turner et al., 2022) to preserve as many instances as possible. To select similar instances from MIPLIB 2017, we first choose a representative instance with knapsack constraints (neos-1456979), and a representative instance with set covering constraints (sup-portcase40). Then we construct the dataset MIPLIB mixed neos following the procedure in Algorithm 1 with the initial instance neos-1456979. We construct the dataset MIPLIB mixed supportcase following the procedure in Algorithm 1 with the initial instance supportcase40. Note that We measure the similarity between instances by 100 human-designed instance features (Gleixner et al., 2021) . Each dataset is split into training and test sets with 80% and 20% of the instances. Specifically, MIPLIB mixed neos contains 20 instances: neos-1456979, ic97 tension, icir97 tension, l2p12, lectsched-4-obj, lectsched-5-obj, loopha13, neos-686190, neos-2294525-abba, neos-3009394-lami, neos-3046601-motu, neos-3046615-murg, neos-3610173-itata, neos-4338804snowy, neos-5221106-oparau, neos-5260764-orauea, neos-5261882-treska, neos-5266653-tugela, neos16, and timtab1CUTS. Moreover, MIPLIB mixed supportcase contains 40 instances: supportcase40, 30 70 45 05 100, 30 70 45 095 100, acc-tight2, acc-tight4, acc-tight5, comp07-2idx, comp08-2idx, comp12-2idx, comp21-2idx, decomp1, decomp2, gus-sch, istanbul-no-cutoff, mkc, mkc1, neos-555343, neos-555424, neos-738098, neos-872648, neos-933562, neos-933638, neos-933966, neos-935234, neos-935769, neos-983171, neos-1330346, neos-1337307, neos-1396125, neos-3209462rhin, neos-3755335-nizao, neos-3759587-noosa, neos-4300652-rahue, neos18, physiciansched6-1, physiciansched6-2, piperout-d27, qiu, reblock354, and supportcase37 . (2) Benchmarks used in NeurIPS 2021 ML4CO competition The Load Balancing and Anonymous problems used in the main text are from the NeurIPS 2021 ML4CO competition (Bowly et al., 2021) . Readers can refer to https://www.ecole.ai/2021/ml4co-competition/ for details of the competition. The competition releases three challenging datasets, but we only use two of the three datasets. The major reason is that the average number of the candidate cuts on the instances from the third dataset (Item Placement) is less than five, which makes cut selection has little impact on the overall solver performance.

D.1.1 DETAILED DESCRIPTION OF THE AFOREMENTIONED DATASETS

In this part, we provide detailed description of the aforementioned datasets. Note that all datasets we use except MIPLIB 2017 are application-specific, i.e., they contain instances from only a single application. We summarize the statistical description of the used datasets in this paper in Table 6 . Let n, m denote the average number of variables and constraints in the MILPs. Let m × n denote the size of the MILPs. We emphasize that the largest size of our used datasets is up to two orders of magnitude larger than that used in previous learning-based cut selection methods (Tang et al., 2020; Paulus et al., 2022) , which demonstrates the superiority of our proposed HEM. Moreover, we test First, it captures the interaction among cuts by selecting cuts one by one. Consequently, it selects cut3 and cut1 that complement each other nicely, leading to more tightened LP relaxation. Second, it naturally captures the order of selected cuts. Better order of selected cuts may lead to a better initial basis, thus solving the LP relaxation faster (Li et al., 2022) (see Section 3). the inference time of our proposed HEM given the average number of candidate cuts. The results in Table 6 show that the computational overhead of the HEM is very low.

D.2 DATASETS USED IN SECTION 3 IN THE MAIN TEXT

In Figure 1a in the main text, we use five challenging datasets, namely D1, D2, D3, D4, and D5, respectively. Specifically, D1 represents MIPLIB mixed supportcase, D2 represents the single instance neos-1456979 from MIPLIB 2017, D3 represents MIPLIB mixed neos, D4 represents Anonymous, and D5 represents the single instance lectsched-5-obj from MIPLIB 2017. In Figure 1b in the main text, we use the dataset MIPLIB mixed neos.

D.3 LARGE-SCALE REAL-WORLD PRODUCTION PLANNING PROBLEMS

The production planning problem aims to find the optimal production planning for thousands of factories according to the daily order demand. The constraints include the production capacity for each production line in each factory, transportation limit, the order rate, etc. The optimization objective is to minimize the production cost and lead time simultaneously. We split the dataset into training and test sets with 80% and 20% of the instances. The average size of the production planning problems is approximately equal to 3500 × 5000 = 1.75 × 10 8 , which are large-scale real-world problems. To promote the machine learning community for MILP, we will release the dataset once the paper is accepted to be published.

E ILLUSTRATION OF ADVANTAGES OF USING A SEQUENCE MODEL

Figure 4 illustrate two major advantages of using the sequence model to learn cut selection. First, the sequence model takes into account the order of selected cuts by modeling the selected cuts as an output sequence. As shown in Figure 4 , the order of cuts determined by the sequence model is better than the score-based method, thus leading to a better initial basis for solving the LP relaxation faster. Second, the sequence model captures the interaction among cuts, as it models the joint conditional probability of the selected cuts given an input sequence of the candidate cuts. As shown in Figure 4 , the sequence model selects cuts that complement each other nicely, thus leading to a more tightened LP relaxation and speeding up solving the MILP.

F ALGORITHM IMPLEMENTATION AND EXPERIMENTAL SETTINGS

F.1 DESIGNED CUT FEATURES we design thirteen cut features for the cut selection task, such as the extent to which a cut is violated by the current LP solution and the proportion of non-zero coefficients of a cut. We present a detailed description of the designed cut features in Table 7 . We emphasize that we do not tune the cut features. Therefore, it is promising to further improve our method by designing better cut features or using graph neural networks to learn better features in future work.

F.2 IMPLEMENTATION DETAILS OF THE BASELINES

In this part, we present a detailed description of all the baselines used in this paper. We denote a cut by α T x ≤ β and the optimal solution of the current LP relaxation by x * . Throughout all experiments, we set the ratio of selected cuts as 0.2 for all score-based rules and learning baselines. Random. Random selects a fixed ratio of the candidate cuts stochastically. The ratio is set as 0.2 in this paper.

Normalized Violation (NV).

NV is a score-based rule. It scores each cut based on the normalized violation of the cut to the current LP solution, and selects a fixed ratio of cuts with high scores. The normalized violation is defined by max{0, α T x * LP -β |β| }. The ratio is set as 0.2 in this paper.

Efficacy (Eff).

Eff is a score-based rule. It scores each cut based on the Euclidean distance of the cut hyperplane to the current LP solution, and selects a fixed ratio of cuts with high scores. The ratio is set as 0.2 in this paper. Default. Default is the default cut selection rule used in SCIP 8.0.0 (Bestuzheva et al., 2021) . Please refer to (Achterberg, 2007) for a detailed description of the SCIP's default cut selection rule. Note that Default tackles the two problems: (1) which cuts should be preferred, and (2) how many cuts should be selected, in cut selection by human-designed heuristics. That is, Default selects variable ratios of cuts rather than a fixed ratio.

Score-based policy (SBP).

Since the state-of-the-art (SOTA) reinforcement learning based method for cut selection (Tang et al., 2020) is designed for the setting that selects the best cut in each round, we implement a slight variant of the SOTA to adapt to our setting that selects a subset of cuts in each round, namely SBP. Specifically, the core idea of SBP is learning a scoring function to measure cut quality as Tang et al. (2020) ; Huang et al. (2022) ; Paulus et al. (2022) do. For a fair comparison, SBP uses the same cut features as HEM and we train SBP via reinforcement learning as well. Our implemented SBP is also a slight variant of the method proposed in Huang et al. (2022) . We emphasize that experiments in the main text show that our implemented SBP is a strong baseline. Specifically, we implement the scoring function with a multi-layer perceptron that predicts the score of a given cut. That is, the scoring function predicts a cut's score based on the features of the cut. The MLP network contains two hidden layers with 128 units. Moreover, we train the scoring function via evolutionary strategies as (Tang et al., 2020) does. We will also release the code for implementing SBP once the paper is accepted to be published. Compute hierarchical policy gradient using D train as in proposition 1 10: Update the parameters, θ 1 = θ 1 + α∇ θ1 J([θ 1 , θ 2 ]), θ 2 = θ 2 + α∇ θ2 J([θ 1 , θ 2 ]) 11: end for F.3 IMPLEMENTATION DETAILS AND HYPERPARAMETERS F.3.1 HARDWARE SPECIFICATION Throughout all experiments, we use a single machine that contains eight GPU devices (NVidia GeForce GTX 3090 Ti) and two Intel Gold 6246R CPUs.

F.3.2 SOLVER SETUP

For reproducibility, we emphasize that all results in the main text are obtained by averaging results over the SCIP random seeds {1, 2, 3}. On the easy datasets, we set the reward as the negative solving time. On the medium and hard datasets, we set the reward as the negative primaldual gap integral within a time limit of 300 seconds.

F.3.3 REWARD FUNCTION

For the real-world production planning problems, we set the reward as the negative primal-dual gap integral within a time limit of 600 seconds or the negative dual bound improvement. The results reported in the main text are achieved by HEM with the negative dual bound improvement reward. We provide the performance of HEM that uses the negative primal-dual gap integral in Table 8 . The results still show that HEM significantly outperforms all the baselines in terms of the Time and PD integral. We emphasize that we can set the reward according to our objective in real-world problems. For example, suppose we aim to minimize the primal-dual gap within a time limit, then we can set the reward as the primal-dual gap within the time limit.

F.3.4 POLICY NETWORK ARCHITECTURE

The higher-level model contains an LSTM encoder and an MLP. The LSTM network encodes variable-sized inputs into hidden vectors with dimension 128. The MLP network contains two hidden layers with 128 units. The lower-level model is essentially a pointer network. We keep the hyperparameters of the pointer network as that used in (Bello* et al., 2017) .  𝑥𝑥 1 𝑥𝑥 2 𝑥𝑥 3 𝑥𝑥 4 𝑥𝑥 5 (g) 𝑥𝑥 4 𝑥𝑥 5 𝑥𝑥 1 𝑥𝑥 2

F.3.5 OPTIMIZATION

Throughout all experiments, we apply Adam optimizer with learning rate α 1 = 1×10 -4 to optimize the lower-level model, and learning rate α 2 = 5×10 -4 to optimize the higher-level model. For each epoch, we collect 32 samples for training, and we set the total epochs as 100. It is surprising that learning a good cut selection policy does not need too much data as shown in Tang et al. (2020) . For training stability, we delay the higher-level policy update. This creates a two-timescale algorithm, as often required for convergence in the linear setting (Fujimoto et al., 2018; Konda & Tsitsiklis, 2003) . We set the delay update freq as two. That is, we first train the lower-level policy twice, then train the higher-level policy once. Additionally, the results in Appendix G.3.4 show that the convergence performance of HEM is insensitive to the hyperparameter delay update freq.

F.3.6 THE TRAINING ALGORITHM

We provide the procedure of the training algorithm of HEM in Algorithm 2.

F.4 MORE DETAILS OF HEM F.4.1 DETAILS OF THE POINTER NETWORK

The pointer network is first introduced by (Vinyals et al., 2015) for directly finding solutions of specific combinatorial optimization problems, such as the Travelling Salesman Problems. The pointer network architecture is illustrated in Figure 5 . The pointer network consists of a Long Short-Term Memory encoder, a Long Short-Term Memory decoder, and an attention that is used as a pointer to select a member of the input sequence as the output (Vinyals et al., 2015) . Specifically, we implement the pointer network following Bello* et al. (2017) . Please refer to (Bello* et al., 2017) for implementation details of the pointer network. The major difference between our used pointer network and the pointer network used in (Bello* et al., 2017 ) is that we use the pointer network to select ordered subsets of input sequences, but (Bello* et al., 2017) use the pointer network to output permutations of input sequences. The policy network of HEM-ratio and HEM w/o H are both essentially a pointer network (Vinyals et al., 2015) , a variant of the sequence model. We present the major difference between HEM-ratio and HEM w/o H in the following. HEM w/o H predicts an end token as used in language tasks (Sutskever et al., 2014; Vaswani et al., 2017) to determine the number of cuts that should be selected implicitly. In contrast, HEM-ratio always selects a fixed ratio of cuts, i.e, it always ends decoding In this part, we provide details of some more advantages of HEM. (1) Inspired by hierarchical reinforcement learning (Sutton et al., 1999b; Nachum et al., 2018) , HEM leverages the hierarchical structure of the cut selection task, which is important for efficient exploration in complex decisionmaking tasks. (2) Previous methods (Tang et al., 2020; Huang et al., 2022) usually train cut selection policies via black-box optimization methods such as evolution strategies (Salimans et al., 2017) . In contrast, HEM is differentiable and we train the HEM via gradient-based algorithms, which is more sample efficient than black-box optimization methods (Sutton & Barto, 2018; Schulman et al., 2015) . Although we can offline generate training samples as much as possible using a MILP solver, high sample efficiency is significant as generating samples can be extremely time-consuming in practice.

G.1 MORE MOTIVATING RESULTS

Ratio matters. To evaluate the effect of the ratio of selected cuts on solving MILPs, we focus on the Normalized Violation (NV) cut selection method with different ratios of selected cuts. (1) We first evaluate the NV methods that select 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, and 80% of candidate cuts, respectively, on four datasets. The results in Figure 6a show that the NV achieves better solver performance with larger ratios on CORLAT and MIPLIB mixed neos. The results demonstrate that the ratio that leads to better solver performance is variable over different datasets, which implies that learning dataset-dependent ratios is important. (2) We then evaluate the NV methods that select 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, and 80% of candidate cuts, respectively, on four instances from the Anonymous dataset. The results in Figure 6b show that NV achieves better solver performance with larger ratios on Anonymous 121 and Anonymous 131. The results demonstrate that the ratio that leads to better solver performance is variable over different instances from the same dataset, which implies that learning instance-dependent ratios is important as well. (Huang et al., 2022) . The results in (c) show that NV with different given ratios achieve variable normalized PD integral on four datasets. The results in (d) show that NV with different ratios achieve variable normalized PD integral on different instances from the same dataset.

G.2 MORE RESULTS OF MAIN EVALUATION

In this section, we provide more results of the main evaluation. The results in Table 9 show the performance of HEM and the baselines in terms of the total number of nodes (Nodes) and the primal-dual gap (PD gap). (1) Easy datasets. On the easy datasets, HEM and most baselines find the optimal solution within the time limit, as the PD gap converges to zero. Additionally, HEM significantly outperforms all the baselines in terms of the Nodes on the easy datasets. (2) Medium and hard datasets. In terms of the PD gap, HEM outperforms all the baselines on medium and hard datasets, especially on CORLAT. However, the Nodes metric cannot well distinguish the performance of different cut selection methods on the medium and hard datasets for the following two reasons. First, the solving time of the LP relaxation on each node is different, and thus the Nodes cannot directly determine the solving time (Huang et al., 2022) . Second, on those unsolved instances within the time limit, the Nodes metric is not a proper metric, as the Nodes cannot evaluate the quality of the solving process.

G.3 MORE RESULTS OF ABLATION STUDY

In this section, we provide more results of ablation studies in the main text.

G.3.1 IN-DEPTH ANALYSIS OF HEM-RATIO AND SBP

We provide possible reasons for HEM-ratio performing poorer than SBP on several challenging MILP problem benchmarks. Fundamentally, HEM-ratio formulates the cut selection task as a sequence modeling problem, which has two main advantages over SBP. That is, the sequence model can not only capture the underlying order information, but also capture the interaction among cuts. However, training a sequence model is more difficult than training a scoring function, as the sequence model aims to learn a much more complex task. Specifically, the scoring function aims to learn to score each cut, while the sequence model aims to model the joint probability of the selected cuts. The latter is a more challenging learning task. Moreover, we follow the reinforcement learning paradigm instead of supervised learning to train the model, making the training process more unstable. Therefore, the sequence model may suffer from inefficient exploration and be trapped to a local optimum. As a result, HEM-ratio can perform poorer than SBP, especially on challenging MILP problem benchmarks.

G.3.2 CONTRIBUTION OF EACH COMPONENT

To understand the contribution of each component of HEM, we provide more results of HEM and HEM without the higher-level model on Set Covering, Multiple Knapsack, MIK, Load Balancing, Anonymous, and MIPLIB mixed supportcase. The results in Table 10 show that HEM outperforms HEM w/o H in terms of the solving time, the primal-dual gap, and the primal-dual gap integral on several challenging datasets, demonstrating the importance of our proposed higher-level model. Moreover, HEM w/o H significantly outperforms SBP in terms of the solving time, the primal-dual (2) HEM-ratio. In order not to learn how many cuts should be selected, we remove the higher-level model of HEM and force the lower-level model to select a fixed ratio of cuts. We denote it by HEM-ratio. Note that HEM-ratio is different from HEM w/o H (see Appendix F). HEM-ratio tackles P1 and P3 in cut selection. (3) HEM-ratio-order. To further mute the effect of the order of selected cuts, we reorder the selected cuts given by HEM-ratio with the original index of the generated cuts, which we denote by HEM-ratio-order. HEM-ratio-order mainly tackles P1. The results in Table 11 suggest the following. HEM-ratio-order outperforms Default and NoCuts on several datasets, demonstrating that tackling P1 by data-driven methods is crucial. HEM significantly outperforms HEM-ratio in terms of the primal-dual gap integral, demonstrating the significance of tackling P2. HEM-ratio outperforms HEM-ratio-order on several datasets, which demonstrates the importance of tackling P3. Moreover, HEM-ratio performs better than SBP in terms of the solving time and the primal-dual gap integral on all six datasets except Set Covering and MI-PLIB mixed supportcase, which shows the superiority of formulating the cut selection as a sequence to sequence learning problem over formulating it as a scoring task. However, HEM-ratio and HEMratio-order perform a little worse than SBP on Set Covering and MIPLIB mixed supportcase. A possible reason is that it is more difficult to train a sequence model than to train a multi-layer perceptron and thus the sequence model may suffer from inefficient exploration and be trapped to the local optimum. Please refer to Appendix G.3.1 for a detailed analysis.

G.3.4 SENSITIVITY ANALYSIS

Additionally, we perform ablation studies to test the sensitivity of HEM to the hyperparameter delay update freq d. The results in Table 12 show that there is a wide range of d for HEM to achieve comparable performance on Maximum Independent Set, CORLAT, and MIPLIB mixed neos. Moreover, we emphasize that we do not tune the hyperparameter d. As the results shown in Table 12 , d = 3 and d = 4 performs the best on CORLAT and MIPLIB mixed neos, respectively. However, we simply set d = 2 throughout all experiments in the main text.

G.4 MORE RESULTS OF GENERALIZATION

Here we provide more results of the generalization experiments on Set Covering. On Set Covering, we test HEM on two times and four times larger instances than those seen during training. The results in Table 13 show that HEM generalizes well to instances that are significantly larger than seen during training. In particular, HEM achieves at least 70% improvement in terms of the Time compared to all the rule-based baselines. Moreover, SBP also generalizes well to large instances, demonstrating that SBP is a strong baseline.

G.5 MORE VISUALIZATION RESULTS

In this part, we provide more visualization results on Set Covering and MIK. On MIK, we visualize the cuts selected by HEM-ratio and SBP on a randomly sampled instance. We perform principal component analysis (Mohri et al., 2018) on selected cuts to reduce the cut features to two-dimensional space. Colored points illustrate reduced cut features. To visualize the diversity of selected cuts, we use dashed lines to connect the points with the smallest and largest x,y coordinates. The results in Figure 7 still show that HEM-ratio selects much more diverse cuts than SBP on MIK. However, HEM-ratio performs poorer than SBP on Set Covering (see Appendix G.3.1 for a detailed analysis). Therefore, we visualize the cuts selected by HEM and SBP on a randomly sampled instance from Set Covering. Although HEM learns the number of cuts that should be selected, we find that HEM selects much fewer cuts than SBP. Specifically, HEM selects 25 cuts, while SBP selects 158 cuts. Interestingly, the results in Figure 7 show that SBP selects 158 similar cuts with high scores, while HEM selects much more diverse cuts than SBP. The results show that HEM tends to select cuts that complement each other nicely. Each colored point illustrates a reduced cut feature. To visualize the diversity of selected cuts, we use dashed lines to connect the points with the smallest and largest x,y coordinates. 

G.6 EVALUATION WITH A TIME LIMIT OF THREE HOURS

In this section, we aim to evaluate whether HEM can generalize well to solving problems within a much longer time limit. Specifically, we evaluate HEM on two extremely challenging MIPLIB datasets within a time limit of three hours. Note that we still train HEM with a time limit of 300 seconds, while we test HEM with a time limit of three hours. The results in Table 14 show that HEM still significantly outperforms all the baselines, especially in terms of the primal-dual gap integral on MIPLIB mixed neos and MIPLIB mixed supportcase. In terms of the primal-dual gap, HEM also outperforms the baselines. Moreover, HEM performs better than baselines in terms of the solving time on MIPLIB mixed neos, but HEM performs poorly in terms of the solving time on MIPLIB mixed supportcase. Interestingly, the primal-dual gap integral is not always consistent with the solving time. We emphasize that we train with the negative primal-dual gap integral reward. To further improve the performance of HEM in terms of the solving time, we can set the reward as the negative solving time instead of the negative primal-dual gap integral.

G.7 TRAINING CURVES

In this section, we provide the training curves of HEM on all nine datasets. The results in Figure 8 show that the performance of our learned policies in terms of the solving time or the primal-dual gap integral drops with the training epochs, demonstrating the effectiveness of our learning process.

G.8.1 GENERALIZE TO NON-ROOT NODES

Our learned models outperform the baselines for all nodes (both root and non-root nodes) under the one round setting, as shown in Table 17 . Specifically, under the one round setting with nonroot cuts, our model improves the Time and Primal-dual gap integral by up to 91.29% and 29.61%, respectively.

G.9 COMPARISON WITH MORE LEARNING-BASED METHODS

We compare HEM with AdaptiveCutsel Turner et al. (2022) and Lookahead Paulus et al. (2022) in Table 15 . The experiments demonstrate that HEM significantly outperforms the two learning-based methods by a large margin in terms of the Time (up to 11.21% improvement) and Primal-dual gap integral (up to 24.36% improvement).



Please refer to https://www.huaweicloud.com/product/modelarts/optverse.html for details of OptVerse.



Figure 1: We design two cut selection heuristics, namely RandomAll and RandomNV (see Section 3 for details), which both add the same subset of cuts in random order for a given MILP. The results in (a) and (b)show that adding the same selected cuts in different order leads to variable overall solver performance.

Figure4: Illustration of selecting cuts using a sequence to sequence model compared to using a scoring function. The sequence model has two main advantages. First, it captures the interaction among cuts by selecting cuts one by one. Consequently, it selects cut3 and cut1 that complement each other nicely, leading to more tightened LP relaxation. Second, it naturally captures the order of selected cuts. Better order of selected cuts may lead to a better initial basis, thus solving the LP relaxation faster(Li et al., 2022) (see Section 3).

Pseudo code for training the HEM . 1: Initialize Hierarchical sequence model π [θ1,θ2] , MILP instances D, training dataset D train , batch size N b , training epochs N e , policy learning rate α 2: for N e epochs do action k and a k at state s 0 with the policy π 7:Receive reward r and add (s 0 , k, a k , r) to D

Figure 5: Illustration of the pointer network architecture introduced by Vinyals et al. (2015).

.2 DIFFERENCE BETWEEN HEM-RATIO AND HEM W/O H Details of HEM w/o H To implement HEM w/o H, we augment each input sequence with an end token, i.e., a thirteen-dimensional tensor with values all being one. The end token is at the end position of the input sequence. Once the decoder of HEM w/o H outputs the end token, then the decoding ends. That is, HEM w/o H can implicitly predict the number of cuts that should be selected by predicting whether to decode the end token at the current decoding step.

NV with different ratios on Anonymous.

Figure6: We use the Normalized Violation (NV) rule(Huang et al., 2022). The results in (c) show that NV with different given ratios achieve variable normalized PD integral on four datasets. The results in (d) show that NV with different ratios achieve variable normalized PD integral on different instances from the same dataset.

Figure7: We perform principal component analysis on cuts selected by HEM-ratio/HEM and SBP. Each colored point illustrates a reduced cut feature. To visualize the diversity of selected cuts, we use dashed lines to connect the points with the smallest and largest x,y coordinates. Table14: Policy evaluation on MIPLIB mixed neos and MIPLIB mixed supportcase with a time limit of three hours.

Policy evaluation on the easy, medium, and hard datasets. The best performance is marked in bold. Let m denote the average number of constraints and n denote the average number of variables. We report the arithmetic mean (standard deviation) of the Time and PD integral.

Comparison between HEM and HEM without the higher-level model.

Comparison between HEM, HEM-ratio, and HEM-ratio-order.HEM still outperforms all the baselines. Especially on CORLAT, HEM achieves at least 33.48% improvement in terms of the PD integral compared to the baselines. (3) Hard datasets. HEM significantly outperforms the baselines in terms of the PD integral on several problems in the hard datasets. HEM achieves outstanding performance on two challenging datasets from MIPLIB 2017 and real-world problems (Load Balancing and Anonymous), demonstrating the powerful ability to enhance MILP solvers with HEM in large-scale real-world applications. Moreover, SBP performs extremely poorly on several medium and hard datasets, which implies that it can be difficult to learn good cut selection policies on challenging MILP problems.

Left: The generalization ability of HEM. Right: Test on Production Planning problems.On MIS, we test HEM on four times and nine times larger instances than those seen during training. The results in Table4(Left) show that HEM significantly outperforms the baselines in terms of the Time and the PD integral on 4× and 9× MIS, demonstrating the superiority of HEM in terms of the generalization ability. Interestingly, SBP also generalizes well to large instances, demonstrating that SBP is a strong baseline. We provide more results on Set Covering in Appendix G.4. We perform principal component analysis on the cuts selected by HEM-ratio and SBP. Colored points illustrate the reduced cut features. The area covered by the dashed lines represents the diversity of selected cuts. The results show that HEM-ratio selects much more diverse cuts than SBP.

Zhihai Wang, Taoxing Pan, Qi Zhou, and Jie Wang. Efficient exploration in resource-restricted reinforcement learning. arXiv preprint arXiv:2212.06988, 2022a. Zhihai Wang, Jie Wang, Qi Zhou, Bin Li, and Houqiang Li. Sample-efficient reinforcement learning via conservative model-based actor-critic. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 8612-8620, 2022b. Franz Wesselmann and U Stuhl. Implementing cutting plane management and selection techniques. In Technical Report. University of Paderborn, 2012. Rui Yang, Jie Wang, Zijie Geng, Mingxuan Ye, Shuiwang Ji, Bin Li, and Feng Wu. Learning task-relevant representations for generalization via characteristic functions of reward sequence distributions. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 2242-2252, 2022. Giulia Zarpellon, Jason Jo, Andrea Lodi, and Yoshua Bengio. Parameterizing branch-and-bound search trees to learn branching policies. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 3931-3939, 2021. Qi Zhou, HouQiang Li, and Jie Wang. Deep model-based reinforcement learning via estimated uncertainty and conservative policy optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 6941-6948, 2020.

The statistical description of used datasets. In all datasets, m denotes the average number of constraints and n denotes the average number of variables. Inference Time denotes the inference time of our proposed HEM given the average number of candidate cuts.

The designed cut features of a generated cut α T x ≤ β. (Suppose c denotes the objective coefficient.)

Evaluation on real-world production planning problems with rewards being the negative PD integral. The results show that HEM still significantly outperforms all the baselines.

Policy evaluation on easy, medium, and hard datasets in terms of the total number of nodes and the primal-dual gap. The best performance are marked in bold.

Comparsion between HEM, HEM without the higher-level model on more datasets.

Comparsion between HEM, HEM-ratio, and HEM-ratio-order on more datasets. THE IMPORTANCE OF TACKLING P1-P3 IN CUT SELECTION To understand the importance of tackling P1-P3 in cut selection, we provide more results of HEM, HEM-ratio, and HEM-ratio-order on Set Covering, Multiple Knapsack, MIK, Load Balancing, Anonymous, and MIPLIB mixed supportcase. Here we refresh what HEM, HEM-ratio, HEM-ratioorder mean. (1) HEM. HEM tackles P1-P3 in cut selection simultaneously.

Sensitivity analysis of HEM to the hyperparameter dealy update freq d.

Evaluate the generalization ability of HEM on Set Covering.

Policy evaluation on MIPLIB mixed neos and MIPLIB mixed supportcase with a time limit of three hours.

ACKNOWLEDGEMENTS

The authors would like to thank all the anonymous reviewers for their insightful comments. This work was supported in part by National Nature Science Foundations of China grants U19B2026, U19B2044, 61836011, 62021001, and 61836006, and the Fundamental Research Funds for the Central Universities grant WK3490000004. We gratefully acknowledge the support of MindSpore used for this research. In addition, we would like to gratefully thank all the developers of OptVerse, and Huawei Cloud Solver Lab for their support of this research.

funding

/mindspore/models/tree/ master/research/l2o/hem-learning-to-cut (MindSpore version). * Equal contribution. This work was done when Zhihai Wang was an intern at Huawei Noah's Ark Lab.

annex

We have conducted the ablation study to show that the performance improvement achieved by HEM is from our novel problem formulation rather than using more powerful models. The results in Table 16 HEM still outperforms the score-based policy (SBP) with more powerful LSTM models, in terms of the Time (up to 80%-67.31%=12.69% improvement) and Primal-dual gap integral (up to 30.75%-22.17%=8.58% improvement).Setups. We implement another baseline, namely SBP with LSTM (SBP+LSTM), which parametrizes the scoring function via an LSTM encoder and a multi-layer perceptron. In terms of model parameters, the model used by SBP with LSTM ( 172289) is comparable to that of HEM (212749).Published as a conference paper at ICLR 2023 Results. HEM significantly outperforms SBP with LSTM as shown in Table 16 . The results demonstrate that HEM significantly outperforms SBP with more powerful models, suggesting that the better performance of HEM is from our novel problem formulation.

G.11 EXPERIMENTS WITH ADVANCED MODELS

We conduct the following experiments to demonstrate that our method is applicable to advanced models. By replacing the pointer network with the Advanced Model in Kool et al. (2018) (HEM+AM), experiments show that HEM+AM outperforms the baselines (up to 79.61% improvement) as shown in Table 18 .G.12 GENERALIZE TO OTHER SOLVERS Our proposed methodology can well generalize to other solvers as shown in Table 19 . The results demonstrate that HEM significantly outperforms the default cut selection method in the CBC solver (Saltzman, 2002) in terms of the primal-dual gap (up to 18.67% improvement).We do not use commercial solvers, such as Gurobi (Bixby, 2007) and (Bliek1ú et al., 2014) , as the backend solver, since they do not provide interfaces for users to customize cut selection methods.As the CBC cannot generate any cut on the dataset Maximum Independent Set, we conduct the experiments on the dataset Load balancing.We use the primal-dual gap metric rather than the primal-dual gap integral due to the reasons as follows.(1) The primal-dual gap is a well-recognized metric for evaluating the solvers as well.(2) Unlike the SCIP, the CBC does not provide interfaces for users to acquire the primal-dual gap integral. Due to limited time, we do not implement the interface.

MODEL

We provide a detailed computational analysis of our proposed model and the baselines' model in Table 20 . We summarize the conclusions in the following. ( 1 8 ).

G.14 EXPERIMENTS WITH SOME SPECIFIC STRUCTURED MODELS

We have analyzed the selected cuts on Multiple Knapsack (specific structured problems) to show that our learned policies can capture the underlying structure of the specific structured problems. The results in Figure 9 show that our model mainly selects three kinds of cover inequalities, i.e., lifted knapsack cover inequalities (47%) (Gu et al., 1998) , lifted minimal cover inequalities (43%) (Gu et al., 1998) , and flow cover inequalities (2%) (Gu et al., 1999) , for solving Multiple Knapsack problems. Specifically, we analyze the type of cuts selected by our proposed HEM on Multiple Knapsack, a class of problems with specific structures. It is known that a prominent class of cut for knapsack problems is cover inequalities (Gu et al., 1998; 1999) . The results demonstrate that our learned policies can select cover inequalities for solving the knapsack problems, suggesting that our model can well capture the underlying structure of specific problems.

G.15 MEASURING THE PRIMAL AND DUAL INTEGRALS

We have conducted experiments to measure the Primal Integral (PI) and Dual Integral (DI) as shown in Table 21 . The results show that the performance improvement of HEM is from both the primal and dual sides.Specifically, we use the optimal objective values as the reference values to measure the PI/DI. However, it is time-consuming to obtain optimal solutions for all instances. We conduct the experiments on three easy datasets due to limited time. Interestingly, the results demonstrate that proper cut selection policies can improve both the PI and DI. Moreover, the results show that HEM achieves more improvement from the primal side than the dual side on Set Cover and Maximum Independent Set, while HEM achieves more improvement from the dual side on Multiple Knapsack.Published as a conference paper at ICLR 2023 

