IMPROVING LEARNING TO BRANCH VIA REINFORCEMENT LEARNING

Abstract

Branch-and-Bound (B&B) is a general and widely used algorithm paradigm for solving Mixed Integer Programming (MIP). Recently there is a surge of interest in designing learning-based branching policies as a fast approximation of strong branching, a humandesigned heuristic. In this work, we argue that strong branching is not a good expert to imitate for its poor decision quality when turning off its side effects in solving branch linear programming. To obtain more effective and non-myopic policies than a local heuristic, we formulate the branching process in MIP as reinforcement learning (RL) and design a novel set representation and distance function for the B&B process associated with a policy. Based on such representation, we develop a novelty search evolutionary strategy for optimizing the policy. Across a range of NP-hard problems, our trained RL agent significantly outperforms expert-designed branching rules and the state-of-the-art learning-based branching methods in terms of both speed and effectiveness. Our results suggest that with carefully designed policy networks and learning algorithms, reinforcement learning has the potential to advance algorithms for solving MIPs.

1. INTRODUCTION

Mixed Integer Programming (MIP) has been applied widely in many real-world problems, such as scheduling (Barnhart et al., 2003) and transportation (Melo & Wolsey, 2012) . Branch and Bound (B&B) is a general and widely used paradigm for solving MIP problems (Wolsey & Nemhauser, 1999) . B&B recursively partitions the solution space into a search tree and compute relaxation bounds along the way to prune subtrees that provably can not contain an optimal solution. This iterative process requires sequential decision makings: node selection: selecting the next solution space to evaluate, variable selection: selecting the variable by which to partition the solution space (Achterberg & Berthold, 2009) . In this work, we focus on learning a variable selection strategy, which is the core of the B&B algorithm (Achterberg & Wunderling, 2013) . Very often, instances from the same MIP problem family are solved repeatedly in industry, which gives rise to the opportunity for learning to improve the variable selection policy (Bengio et al., 2020) . Based on the human-designed heuristics, Di Liberto et al. ( 2016) learn a classifier that dynamically selects an existing rule to perform variable selection; Balcan et al. (2018) consider a weighted score of multiple heuristics and analyse the sample complexity of finding such a good weight. The first step towards learning a variable selection policy was taken by Khalil et al. (2016) , who learn an instance customized policy in an online fashion, as well as Alvarez et al. (2017) and Hansknecht et al. (2018) who learn a branching rule offline on a collection of similar instances. Those methods need extensively feature engineering and require strong domain knowledge in MIP. To avoid that, Gasse et al. (2019) propose a graph convolutional neural network approach to obtain competitive performance, only requiring raw features provided by the solver. In each case, the branching policy is learned by imitating the decision of strong branching as it consistently leads to the smallest B&B trees empirically (Achterberg et al., 2005) . In this work, we argue that strong branching is not a good expert to imitate. The excellent performance (the smallest B&B tree) of strong branching relies mostly on the information obtained in solving branch linear programming (LP) rather than the decision it makes. This factor prevents learning a good policy by imitating only the decision made by strong branching. To obtain more effective and non-myopic policies,i.e. minimizing the total solving nodes rather than maximizing the immediate duality gap gap, we use reinforcement learning (RL) and model the variable selection process as a Markov Decision Process (MDP). Though the MDP formulation for MIP has been mentioned in the previous works (Gasse et al., 2019; Etheve et al., 2020) , the advantage of RL has not been demonstrated clearly in literature. The challenges of using RL are multi-fold. First, the state space is a complex search tree, which can involve hundreds or thousands of nodes (with a linear program on each node) and evolve over time. In the meanwhile, the objective of MIP is to solve problems faster. Hence a trade-off between decision quality and computation time is required when representing the state and designing a policy based on this state representation. Second, learning a branching policy by RL requires rolling out on a distribution of instances. Moreover, for each instance, the solving trajectory could contain thousands of steps and actions can have long-lasting effects. These result in a large variance in gradient estimation. Third, each step of variable selection can have hundreds of candidates. The large action set makes the exploration in MIP very hard. In this work, we address these challenges by designing a policy network inspired by primal-dual iteration and employing a novelty search evolutionary strategy (NS-ES) to improve the policy. For efficiency-effectiveness trade-off, the primal-dual policy ignores the redundant information and makes high-quality decisions on the fly. For reducing variance, the ES algorithm is an attractive choice as its gradient estimation is independent of the trajectory length (Salimans et al., 2017) . For exploration, we introduce a new representation of the B&B solving process employed by novelty search (Conti et al., 2018) to encourage visiting new states. We evaluate our RL trained agent over a range of problems (namely, set covering, maximum independent set, capacitated facility location). The experiments show that our approach significantly outperforms stateof-the-art human-designed heuristics (Achterberg & Berthold, 2009) as well as imitation based learning methods (Khalil et al., 2016; Gasse et al., 2019) . In the ablation study, we compare our primal-dual policy net with GCN (Gasse et al., 2019) , our novelty based ES with vanilla ES (Salimans et al., 2017) . The results confirm that both our policy network and the novelty search evolutionary strategy are indispensable for the success of the RL agent. In summary, our main contributions are the followings: • We point out the overestimation of the decision quality of strong branching and suggest that methods other than imitating strong branching are needed to find better variable selection policy. • We model the variable selection process as MDP and design a novel policy net based on primal-dual iteration over reduced LP relaxation. • We introduce a novel set representation and optimal transport distance for the branching process associated with a policy, based on which we train our RL agent using novelty search evolution strategy and obtain substantial improvements in empirical evaluation.

2. BACKGROUND

Mixed Integer Programming. MIP is an optimization problem, which is typically formulated as min x∈R n {c T x : Ax ≤ b, ≤ x ≤ u, x j ∈ Z, ∀j ∈ J} (1) where c ∈ R n is the objective vector, A ∈ R m×n is the constraint coefficient matrix, b ∈ R m is the constraint vector, , u ∈ R n are the variable bounds. The set J ⊆ {1, • • • , n} is an index set for integer variables. We denote the feasible region of x as X . Linear Programming Relaxation. LP relaxation is an important building block for solving MIP problems, where the integer constraints are removed: min x∈R n {c T x : Ax ≤ b, ≤ x ≤ u}. (2) Algorithm 1: Branch and Bound Input: A MIP P in form Equation 1Output: An optimal solution set x * and optimal value c * 1 Initialize the problem set S := {P LP }. where P LP is in form Equation 2. Set x * = φ, c * = ∞ ; 2 If S = φ, exit by returning x * and c * ; 3 Select and pop a LP relaxation Q ∈ S ; 4 Solve Q with optimal solution x and optimal value ĉ ; 5 If ĉ ≥ c * , go to 2 ; 6 If x ∈ X , set x * = x, c * = ĉ, go to 2 ; 7 Select variable j, split Q into two subproblems Q + j and Q - j , add them to S and go to 3 ; Branch and Bound. LP based B&B is the most successful method in solving MIP. A typical LP based B&B algorithm for solving MIP looks as Algorithm 1 (Achterberg et al., 2005) . It consists of two major decisions: node selection, in line 3, and variable selection, in line 7. In this paper, we will focus on the variable selection. Given a LP relaxation and its optimal solution x, the variable selection means selecting an index j. Then, branching splits the current problem into two subproblems, each representing the original LP relaxation with a new constraint x j ≤ xj for Q - j and x j ≥ xj for Q + j respectively. This procedure can be visualized by a binary tree, which is commonly called search tree. We give a simple visualization in Section A.1. Evolution Strategy. Evolution Strategies (ES) is a class of black box optimization algorithm (Rechenberg, 1978) . In this work, we refer to the definition in Natural Evolution Strategies (NES) (Wierstra et al., 2008) . NES represents the population as a distribution of parameter vectors θ characterized by parameters φ : p φ (θ). NES optimizes φ to maximize the expectation of a fitness f (θ) over the population E θ∼p φ [f (θ)]. In recent work, Salimans et al. (2017) outlines a version of NES applied to standard RL benchmark problems, where θ parameterizes the policy π θ , φ t = (θ t , σ) parameterizes a Gaussian distribution p φ (θ) = N (θ t , σ 2 I) and f (θ) is the cumulative reward R(θ) over a full agent interaction. At every iteration, Salimans et al. (2017) apply n additive Gaussian noises to the current parameter and update the population as θ t+1 = θ t + α 1 nσ n i=1 f (θ t + σ i ) i To encourage exploration, Conti et al. (2018) propose Novelty Search Evolution Strategy (NS-ES). In NS-ES, the fitness function f (θ) = λN (θ)+(1-λ)R(θ) is selected as a combination of domain specific novelty score N and cumulative reward R, where λ is the balancing weight.

3. WHY IMITATING STRONG BRANCHING IS NOT GOOD

Strong branching is a human-designed heuristic, which solves all possible branch LPs Q + j , Q - j ahead of branching. As strong branching usually produces the smallest B&B search trees (Achterberg, 2009) , many learning-based variable selection policy are trained by mimicking strong branching (Gasse et al., 2019; Khalil et al., 2016; Alvarez et al., 2017; Hansknecht et al., 2018) . However, we claim that strong branching is not a good expert: the reason strong branching can produce a small search tree is the reduction obtained in solving branch LP, rather than its decision quality. Specifically, (i) Strong branching can check lines 5, 6 in Algorithm 1 before branching. If the pruning condition is satisfied, strong branching does not need to add the subproblem into the problem set S. (ii) Strong branching can strengthen other LP relaxations in the problem set S via domain propagation (Rodosek et al., 1999) and conflict analysis (Achterberg, 2007) . For example, if strong branching finds x 1 ≥ 1 and x 2 ≥ 1 can be pruned during solving branch LP, then any other LP relaxations containing x 1 ≥ 1 can be strengthened by adding x 2 ≤ 0. These two reductions are the direct consequence of solving branch LP, and they can not be learned by a variable selection policy. (iii) Strong branching activates primal heuristics (Berthold, 2006) after solving LPs. To examine the decision quality of strong branching, we employ vanilla full strong branching (Gamrath et al., 2020) , which takes the same decision as full strong branching, while the side-effect of solving branch LP is switched off. Experiments in Section 5.2 show that vanilla full strong branching has poor decision quality. Hence, imitating strong branching is not a wise choice for learning variable selection policy.

4. METHOD

Due to line 5 in Algorithm 1, a good variable selection policy can significantly improve solving efficiency. To illustrate how to improve variable selection policy, we organize this section in three parts. First, we present our formulation of the variable selection process as a RL problem. Next, we introduce the LP relaxation based state representation and the primal-dual based policy network. Then, we introduce our branching process representation and the corresponding NS-ES training algorithm.

4.1. RL FORMULATION

Let the B&B algorithm and problem distribution D be the environment. The sequential decision making of variable selection can be formulated as a Markov decision process. We specify state space S, action space A, transition P and reward r as follows • State Space. At iteration t, node selection policy will pop out a LP relaxation P LP from the problem set S. We set the representation of the state to s t = {P LP , J, S}, where J is the index set of integer variables. • Action Space. At iteration t, the action space is the index set of non-fixed integer variables determined by the relaxation: A(s t ) = {j ∈ J : j < u j }. • Transition. Given state s t and action a t , the new state is determined by the node selection policy. • Reward. As our target is solving the problem faster, we set the reward r t = -1 with discount γ = 1. Maximizing the cumulative reward encourages the agent solving problems with less steps. In commercial solver, the solving process is much more complicated than the B&B stated in Algorithm 1. For example, between line 3 and line 4, primal heuristics could be used to detect feasible solutions, and cutting planes could be applied to strengthen the LP relaxation. These components in solver introduce more randomness in transition, but our formulation is still valid.

4.2. PRIMAL DUAL POLICY NET

Reduced LP. In the solving process, the variable bounds keep changing due to branching. Thus, we obtain our reduced LP relaxation by the following two steps: 1) remove fixed variables x j , where j = u j , and plug their values into the constraints; 2) remove trivial constraints, where max ≤x≤u j A ij x j ≤ b i . In the view of primal-dual iteration, the LP relaxation has Lagrangian form: min x max λ c T x + λ T (Ax -b), s.t. ≤ x ≤ u, 0 ≤ λ (4) where variables and constraints naturally form a bipartite graph. In the primal-dual iteration over Equation 4, fixed variables and trivial constraints always pass zero and have no interaction with other variables. PD policy net. We parameterize our policy network π θ (a t |s t ) as a primal-dual iteration over the reduced LP relaxation by message passing [ Rephrase this paragraph. Make it easier to understand.] Y i ← f C Y i , j A ij m C (X j ) , X j ← f V X j , i A ij m V (Y i ) (5) A 2 B 2 C 2 A 1 B 1 A 3 B 3 A 2 A 1 B 1 p (b1) = 16 , 16 = 4 , 4 p (b2) = ✓ 8 16 , 4 16 , 4 16 ◆ = ✓ 1 2 , 1 4 , 1 4 ◆ W (b1, b2) = " W (b 1 , b 2 ) =  3 2 3 2 5 2 7 2 5 2 3 2 W (b 1 , b 3 ) =  1 2 3 2 5 2 1 2 (b 1 , b 2 ) =  1/2 1/4 0 0 0 1/4 (b 1 , b 3 ) =  1/2 1/4 0 1/4 We have introduced the characterization and its metric for general B&B algorithm. Focusing on MIP, a subproblem R i can be represented by a polytope, the feasible region of the LP relaxation. We define the weight function w as counting the number of feasible integer points in the polytope and the distance function d as the 1 norm distance between the gravity center of two polytopes. For computational efficiency, we ignore the constraints and only consider variable bounds such that every polytope is a box. A simple illustration of characterizations is given in Figure 2 . We can compute the distance D(b 1 , b 2 ) = 3 2 . Another observation is the polyhedron may degenerate to lower dimension after branching, for example, the B 1 in b 1 . Hence, we choose the counting weight function w in our work [ Not so sure what this sentence means.]. This finishes the definition of the novelty score. To put everything together, we summarize the training in Algorithm 2.

5. EXPERIMENTS

We now present a comparative experiments against two competing machine learning approach and three SCIP's branching rules to assess the value of our RL agent, as well as an ablation study to validate the our choice of state representation and training algorithm.

5.1. SETUP

Benchmarks: We consider three classes of instances, Set Covering, Maximum Independent Set and Capacitated facility location, that are not only challenging for state-of-the-art solvers, but also representative for problems encountered in practice. For each class, we set up a backbone based on which we randomly generate the dataset.[ Need to justify why backbone idea is good: in many real word problems, different problem instance may share a backbone. For instance .... . Do you have more information about the backbone, and details about the generative process in appendix?] [Sun: I will add the generating process in Appendix] We generate set covering instances using 1000 columns. We train on instances with 500 rows and we evaluate on instances with 500 rows (test), 1000 rows (medium transfer), 1500 rows (hard transfer). We generate maximum independent set on graphs with 400 nodes and we evaluate on graphs with 400 nodes (test), 1000 nodes (medium transfer) and 1500 nodes (hard transfer). We generate capacitated facility location with 100 facilities. We train on instances with 40 customers (test) and we evaluate on instances with 200 customers (medium transfer) and 400 customers (hard transfer). More details are provided in the appendix. a policy memory M and an instance Q sampled from the problem distribution as: N (✓, Q, M) = 1 k X ⇡j 2kN N (M,✓) D(b(⇡ ✓ , Q), b(⇡ j , where kN N (M, ✓) is the k nearest neighbor of ⇡ ✓ in M . In score encourages the policies with representation far from the repres  (b 1 , b 2 ) =  1/2 1/4 0 0 0 1/4 (b 1 , b 3 ) =  1/2 1/4 0 1/4 W (b 1 , b 2 ) =  3/2 3/2 5/2 5/2 7/2 3/2 W (b 1 , b 3 ) =  1/2 3/2 5/ Sample valid instances Q 1 , • • • , Q N ⇠ D 2 Set F best = 1 N P N j=1 f (✓ 0 , Q j ), ✓ best = ✓ 0 and push ✓ 0 into M K 3 for t=0 to T do 4 Sample instances P 1 , • • • , P M ⇠ D 5 for i=1 to N do 6 Sample ✏ 1 , • • • , ✏ n ⇠ N (0, I) 7 Compute F i = 1 m P M m=1 f (✓ t + ✏ i , P m ) 8 Compute N i = 1 m P M m=1 N (✓ t + ✏ i , P m , M) 9 Send N i and F i from each worker to coordinator where f C , f V are two-hidden-layers neural networks, m C , m V are one hidden layer neural networks, A ij is the entry in the reduced constraint matrix A and X, Y are the embedding for variables and constraints initialized by P LP and J. As mentioned above, the original primal-dual iterations only occurs on the reduced LP hence, our message passing in Equation 5 is defined only on the reduced graph. For efficiency, we do not include problem set S, which makes it a partial observable MDP (Astrom, 1965) . After two iterations of Equation 5, the variable embedding X is passed to a two-hidden-layer neural network score function f S and the output is the final score for each variable. Since the state reduction and message passing are both inspired by primal-dual iteration, we call it PD policy. A more detailed discussion and comparison with GCN (Gasse et al., 2019) can be found at section A.2.2. 10 end 11 Set ✓ t+1 = ✓ t + ↵ 1 n P n i=1 • N i ✏ i + (1 ) • F i ✏ i 12 Compute F (t+1) = 1 N P N j=1 f (✓ t+1 , Q j ) 13 if F (t+1) > F best then 14 Set F best = F (t+1) , ✓ best = ✓ t+1 , = ⇤

4.3. SET REPRESENTATION FOR POLICY AND OPTIMAL TRANSPORT DISTANCE

We train the RL agent using evolution strategy similar to NSR-ES (Conti et al., 2018) and we need to define the novelty score for B&B process. In the general B&B algorithm, the solving process can be represented by a search tree, where each leaf is a solved subproblem. Given a branch policy π and an instance Q, we define our representation b(π, Q) = {R 1 , • • • , R H } as the collection of leaf subproblems on the complete search tree . Focusing on MIP, a subproblem R i is a LP relaxation which can be represented by its feasible region, a polytope. For example, in Figure 1 For each polytope R i (leaf subproblem), we define the weight function w(•) and distance function d(•, •) between two polytopes R i and R j as • w(R i ) := #{x ∈ R i : x is a feasible solution for Q}. • d(R i , R j ) := g i -g j 1 , where g i and g j are the center of mass for R i and R j respectively. For example, in Figure 1 , we have w(A 1 ) = 12, d(A 1 , A 2 ) = 3 2 . Then we can map the representation b = {R 1 , • • • , R H } to a simplex p(b) ∈ ∆ H-1 by normalizing the weights p(R j ) = w(R j )/ H i=1 w(R i ), and compute a cost matrix W ij = d(R i , R j ) (See Figure 1 for examples). Then, we can define the metric D between two representations as the Wasserstein distance (or optimal transport distance) (Villani, 2008; Peyré et al., 2019) : D(b 1 , b 2 ) = min Γ i,j Γ ij W ij (b 1 , b 2 ), s.t. Γ1 = p(b 1 ), Γ T 1 = p(b 2 ) (6) For example, in Figure 1 , the distance D(b 1 , b 2 ) = 3 2 , D(b 1 , b 3 ) = 3 4 meaning b 3 is closer to b 1 than b 2 . Hence the corresponding policy π 3 is closer to π 1 than π 2 . Here, we provide a concrete method to measure the distance between two solving processes. It is also provides a framework for general B&B algorithm. We can choose weight function w and distance function d depending on the property of the solution space and compute the distance between two B&B solving processes.

4.4. NOVELTY SEARCH EVOLUTIONARY STRATEGY

Equipped with metric D between representations, we can define the novelty score following Conti et al. (2018) . Given a policy memory M (a collection of older policies) and an instance Q sampled from the problem distribution D, novelty score is computed as: N (θ, Q, M ) = 1 k πj ∈kNN(M,θ) D(b(π θ , Q), b(π j , Q)) where kNN(M, θ) is the k nearest neighbor of π θ in M . Back to Algorithm 1, B&B algorithm recursively splits the feasible region and obtains a set of polytopes when finishing solving an instance. Notice that a polytope in the set representation is invariant with the generating order, i.e. branching x 1 then x 2 will give the same polytope with branching x 2 then x 1 . As a result, our metric D and novelty score N is mostly determined by the pruning behavior during the solving process. Put everything together, we summarize the training algorithm in section A.3.

5. EXPERIMENTS

We now present comparative experiments against two competing machine learning approaches and three SCIP's branching rules to assess the value of our RL agent, as well as an ablation study to validate our choice of policy representation and training algorithm.

5.1. SETUP

Benchmarks: We consider three classes of instances, Set Covering (Balas & Ho, 1980) , Maximum Independent Set (Albert & Barabási, 2002) and Capacitated facility location (Cornuéjols et al., 1991) , those are not only challenging for state-of-the-art solvers, but also representative for problems encountered in practice. For each class, we set up a backbone based on which we randomly generate the dataset as many real-world problems also share the same backbone. For example, a logistics company frequently solves instances on very similar transportation networks with different customer demands. We generate set covering instances using 1000 columns. We train on instances with 500 rows and evaluate on instances with 500 rows (test), 1000 rows (medium transfer), 1500 rows (hard transfer). We train maximum independent set on graphs with 400 nodes and evaluate on graphs with 400 nodes (test), 1000 nodes (medium transfer), and 1500 nodes (hard transfer). We generate capacitated facility location with 100 facilities. We train on instances with 40 customers (test) and evaluate on instances with 40 customers (test), 200 customers (medium transfer), and 400 customers (hard transfer). More details are provided in the section A.4 Settings: Throughout all experiments, we use SCIP 7.0.1 as the backend solver, with a time limit of 1 hour. For SCIP parameters, we have two settings: clean and default. The clean setting switches off other SCIP components, such as estimate node selection, cutting plane and primal heuristics. This way, the evaluation eliminates the interference from other components of the solver to variable selection policy. Under the clean setting, the solving nodes reflect the decision quality of variable selection policies only. So, we compare the decision quality of different methods under the clean setting. The default setting of SCIP will turn on all components inside SCIP, which is tuned for solving real problems. So, We compare the ability to solve challenging problems of different methods under the default setting. Baselines: We compare against: Reliability Pseudocost Branch (RPB) (Achterberg & Berthold, 2009) , the human-designed state-of-the-art branching rule, which computes strong branching in the beginning and (Gamrath et al., 2020) ; and two recent machine learning policies support vector machine (SVM) rank approach (Khalil et al., 2016) and GCN approach (Gasse et al., 2019) foot_0 . We denote our method as RL, which is the primal-dual net trained by NS-ES. Metrics. To minimize the expected solving cost, metrics are selected as the average solving times (T avg ) for all instances and average solving nodes (N avg ) for instances solved by all methods. Since MIP instances could vary a lot in difficulty, we count the number of times each method leads the performance over the number of times each method solves the instance within timelimit (Wins) as a third robust metric. Implementation. The detail of implementation is provided in section A.2

5.2. DECISION QUALITY

We evaluate the variable selection quality by solving 100 test instances under clean setting. Since we are comparing the decision quality, we say a method wins in this experiment if it results in the least number of solving nodes. As FSB and RPB benefit a lot from branching LP information (section 3), we do not include them when counting Wins. Table . 1 shows our RL agent leads the win times on all datasets and the average solving nodes on set covering, and independent set are significantly better than other methods.

5.3. GENERALIZATION TO LARGER INSTANCES

It is very important for RL agents to transfer to larger unseen instances as training on large instances is very expensive in the real world. We investigate the generalization ability of our RL agent by solving 100 transfer instances under default setting. To meet the needs in practice, we say a method wins in this experiment if it results in the fastest solving time. As VFS is not able to solve any transfer instance in time limit, we do not list its results in Table . 4. We can see, except for RPB and SVM having comparable performance on hard set covering and hard facility location, respectively, the RL agent leads the performance. In set covering (hard) and maximum independent set (hard), we do not compute the average number of nodes for full strong branching as it solves too limited instances.

5.4. IMPROVEMENT ANALYSIS

Having seen the improvements brought by RL, we would like to ask what kind of decisions our agent learns. We answer this question in two aspects: finding lower primal bound c * and obtaining higher dual value ĉ that We first examine primal bound c * . Figure 2 plots the feasible solutions found during the solving process. A point (n, y) means we find a feasible solution c * = y in a subproblem containing n branch constraints. Figure 2 shows that our RL agent is able to detect small c * at the early stage. Hence, it can prune more subproblems and solve the MIP faster. On the contrary, VFS fails to detect feasible solutions efficiently. One reason is, traditionally, strong branching or other human-designed heuristics are mainly on the purpose of obtaining higher ĉ. Our result suggests a new possibility for researchers to find variable selection method good at detecting feasible solutions. Then, we check local dual value ĉ. To eliminate the influence in primal bound c * changing, we initialize c * = c opt with the optimal value like Khalil et al. (2016) . We plot the curve of average width versus the depth in Figure 3 . The area under the curve equals the average number of solving nodes, and we report it in the legend. Also, as c * is fixed, the width versus depth plot characterizes how many branches are needed to increase the local dual value ĉ to c * so as to close a subproblem. A smaller width indicates the variable selection policy closes the gap faster. VFS performs better under this setting than in Figure 2 while it is still beat by learning based methods. Figure 3 shows that although our RL agent has the worst width in the beginning, it has the lowest peak and leads the overall performance. This means our RL agent successfully employs a non-myopic policy to maximize ĉ in the long term. 1) and ( 2) having larger rewards than (3) and ( 4) shows that PD policy can obtain more improvement than GCN. Also, ( 2) and ( 4) having larger rewards than (1) and (3) shows that novelty search helps to find better policies. The results suggest that RL improves learning to branch and both PD policy, NS-ES are indispensable in the success of RL agent.

6. DISCUSSION

In this work, we point out the overestimation of the decision quality of strong branching. The evidence in Table 1 shows VFS performs poor on synthetic dataset under clean setting. An interesting phenomenon is that GCN can easily beat VFS after imitation learning (or our PD policy can obtain similar result). One possible explanation is that the primal-dual message passing structure naturally learns the good decisions and ignores the noise brought by strong branching. Another possible reason is the biased sampling. To keep the diversity of the samples, Gasse et al. (2019) employs a mixed policy of RPB and VFS to sample the training data. VFS probably performs good on most of the states while has poor decision quality when trapped in some regions. As a result, VFS has poor overall performance. Fortunately, using the mixing policy as the behavior policy helps to escape from these regions hence, the collected data have good decision quality. More studies are needed before we can give a confident answer for this question.

7. CONCLUSION

We present an NS-ES framework to automatically learn the variable selection policy for MIP. Central to our approach is the primal-dual policy network and the set representation of the B&B process. We demonstrate our RL agent makes high-quality variable selection across different problems types and sizes. Our results suggest that with carefully designed policy networks and learning algorithms, reinforcement learning has the potential to advance algorithms for solving MIP.

A APPENDIX

A.1 BRANCH AND BOUND Here we gives a simple illustration of B&B algorithm in Figure 5 . Given the LP relaxation, the polytope represents the feasible region of the LP relaxation and the red arrow represents the objective vector. We first solve the LP relaxation and obtain the solution x as the red point. Noticing it is not feasible for MIP, we branch the LP relaxation into two subproblems. In (a) we select to split variable x 1 and in (b) we select to split variable x 2 . The subproblems obtained after branching are displayed by the shaded purple regions. After finishing solve these two MIPs, we obtain the search trees t 1 and t 2 . We can see that a wise selection of variable x 2 can solve the problem faster. Branching Rules Task • divide into (disjoint) subproblems

Techniques

• branching on variables • most infeasible • least infeasible • random branching • strong branching • pseudocost • reliability • VSIDS • hybrid reliability/inference • branching on constraints  • SOS1 • SOS2 Gregor Hendel -SCIP Introduction 47/71 (x 1 , x 2 ) = (1,2) (x 1 , x 2 ) = (3,2) Infeasible x 2 ≥ 3 x 2 ≤ 2 x 1 ≥ 2 x 1 ≤ 1 x 1 x 2 1 0 2 3 1 2 3 (a) • branching on variables • most infeasible • least infeasible • random branching • strong branching • pseudocost • reliability • VSIDS • hybrid reliability/inference • branching on constraints • SOS1 • SOS2 Gregor Hendel -SCIP Introduction x 2 ≤ 2 x 2 ≥ 3 (x 1 , x 2 ) = (3,2) Infeasible x 1 x 2 1 0 

A.2.2 PD POLICY

Comparison. PD policy is similar to the GCN in Gasse et al. (2019) but has two major differences. First, we use a dynamic reduced graph where fixed variables and trivial constraints are removed due to the variable bounds changing during the solving process while Gasse et al. (2019) do not consider it. The reduced graph can not only save computation, but also give a more accurate description of the solving state by ignoring the redundant information. The ablation in Section 5.5 shows it is indispensable in the success of RL. Second, we use a simple matrix multiplication in our PD policy while Gasse et al. (2019) use a complicated edge embedding in GCN. In some sense, GCN can be seen as an overparameterized version of our method. And our success reveals that message passing on the LP relaxation is the true helpful structure. detail. We implement our primal dual policy net using dgl (Wang et al., 2019) , with hidden dimension h = 64 and ReLU activation. The feature X for variable is a 17 dimension vector and feature Y for constraint is a 5 dimension vector. We list the detail of feature in VFS. We use the implementation in SCIP Gamrath et al. (2020) RPB. We use the implementation in SCIP Gamrath et al. (2020) GCN. We tried to implement GCN in dgl (Wang et al., 2019) , however, it is significantly slower than the original implementation in Gasse et al. (2019) . Hence, we still use the implementation in Gasse et al. (2019) . SVM. We use the implementation in Gasse et al. (2019) .

A.3 TRAINING

We have two settings clean, default. In experiments, we always train and test under the same setting. Imitation Learning. We initialize our PD policy using imitation learning similar to Gasse et al. (2019) . The difference is we only use 10000 training samples, 2000 validation samples and 10 training epochs as a warm start. In our setting, a policy from scratch can hardly solve an instance in a reasonable time, hence a warm start is necessary. Novelty Search Evolution Strategy. We improve our RL agent using Algorithm 2. The parameters are set as α = 1e -4, σ = 1e -2, n = 40, V = 200, w = 0.25, β = 0.99, T = 1000, k = 10. Maximum Independent Set. We generate maximum independent set problem using Barabasi-Albert (Albert & Barabási, 2002) graphs. The problem is formulated as the following ILP. max v∈V x v subject to x u + x v ≤ 1, ∀e uv ∈ E x v ∈ {0, 1}, ∀v ∈ V where V is the set of vertices and E is the set of edges. We generate the BA graph using a preferential attachment with affinity coefficient 4. We first generate a BA graph G 0 with 350 nodes. Then, every time we want to generate a new problem with n variables, we expand G 0 using preferential attachment. Capacitated Facility Location. We generate the capacitated facility location problem following Cornuéjols et al. (1991) . The problem with m customers and n facilities is formulated as the following MIP. where x i = 1 indicates facility i is open, and x i = 0 otherwise; f i is the fixed cost if facility i is open; d j is the demand for customer j; c ij is the transportation cost between facility j and customer i; y ij is the fraction of the demand of customer j filled by facility i. Following Cornuéjols et al. (1991) , where we first sample the location of facility and customers on a 2 dimension map. Then c i j is determined by the Euclidean distance between facility i and customer j and other parameters are sampled from the distribtuion given in Cornuéjols et al. (1991) . We first generate the location of 100 facilities and 40 customers as our backbone. Then, every time we want to generate a new problem with m customers, we generate new m -40 locations for customers and follow the pipeline mentioned above.

A.5 EXPERIMENTS ON BENCHMARK FROM GASSE ET AL

We are mostly interested in improving the variable selection policy on similar problems hence, we generate our benchmark based on a backbone. The backbone allows the instances share some common structures such that there exists a good policy for the given distribution of problems. Our experiments show that NS-ES is able to learn a good policies on this purpose. However, it is also interesting to check the performance of our method on a more random distribution. Here, we conduct experiments on benchmark from Gasse et al. (2019) . We employ the same instance generator and SCIP setting as Gasse et al. (2019) . For each category, we evaluate the policy on 20 instances with 5 random seeds. We report the average solving time T avg and shifted geometric mean solving time T geo on all instances, average solving nodes N avg and shifted geometric mean solving nodes N geo on instances solved by all instances, and Wins for the times one methods leads



The source code has been released inGasse et al. (2019).



Figure 2: (l) characterization b 1 , (m) characterization b 2 , (r) distribution and the cost matrix where kN N (M, ✓) is the k nearest neighbor of ⇡ ✓ in M . In this definition, the novelty score encourages the policies with characterization far from the characterizations in policy memory.

Figure 1: (left) three policies π 1 , π 2 and π 3 produce three sets of polytopes b 1 , b 2 and b 3 respectively for the same problem Q, (right) example cost matrix W and transportation matrix Γ.

, b 1 , b 2 and b 3 are the set of polytopes produced by three different policies π 1 , π 2 and π 3 respectively. And b1 = {A 1 , B 1 } is a set of two polytopes (leaf subproblems), b 2 = {A 2 , B 2 , C 2 }is a set of three polytopes, and b 3 = {A 3 , B 3 } is a set of two polytopes. For computational efficiency, we ignore the constraints and only consider variable bounds such that every polytope is a box.

Figure 2: Primal bounds versus the depth in search tree (number of branch constraints) they are found

Figure 3: Average width versus depth

split on x1 and search tree t1 Branching Rules Task • divide into (disjoint) subproblems • improve local bounds

split on x2 and search tree t2

Figure 5: Illustration of splitting in B&B and the corresponding search tree

ij = 1, ∀j = 1, • • • , m m j=1 d j y ij ≤ u i x i , ∀i = 1 • • • , n y ij ≥ 0, ∀i = 1, • • • , n and j = 1, • • • , m x i ∈ {0, 1}, ∀i = 1, • • • , n

Learning rate ↵, Noise std , number of workers n, Validation size N weight , Weight decay rate , Iterations T, Parameter ✓ 0 , Policy mem Instance distribution D Output: Best parameter ✓ best 1

Policy evaluation on test instances. Wins are counted by the number of times a method results in least number of solving nodes. The time T avg is reported in seconds.

Policy evaluation on transfer instances. Wins are counted by the number of times a method results in fastest solving time. The time T avg is reported in seconds.



Feature X for variable and feature Y for constraint

annex

A.4 DATA SET Set Covering. We generate a weighted set covering problem following Balas & Ho (1980) . The problem is formulated as the following ILP.where U is the universe of elements, S is the universe of the sets, w is a weight vector. For any e ∈ U and S ∈ S, e ∈ S with probability 0.05. And we guarantee that for any e, it is contained by at least two sets in S. Each w S is uniformly sampled from integer from 1 to 100. over the number of instance solving to optimal. As VFS is too slow to solve challenge instances, we only report its performance on easy instances.We can see that, in Table 4 , the improvement from RL method is less than that in the main text. Intuitively, the randomly generated instances have less shared structure and leave less room for RL to improve the policy. How can we improve branch policies for randomly generated problems is still a question needs more explorations in the future.

