LEARNING TO GENERATE COLUMNS WITH APPLICATION TO VERTEX COLORING

Abstract

We present a new column generation approach based on Machine Learning (ML) for solving combinatorial optimization problems. The aim of our method is to generate high-quality columns that belong to an optimal integer solution, in contrast to the traditional approach that aims at solving linear programming relaxations. To achieve this aim, we design novel features to characterize a column, and develop an effective ML model to predict whether a column belongs to an optimal integer solution. We then use the ML model as a filter to select high-quality columns generated from a sampling method and use the selected columns to construct an integer solution. Our method is computationally fast compared to the traditional methods that generate columns by repeatedly solving a pricing problem. We demonstrate the efficacy of our method on the vertex coloring problem, by empirically showing that the columns selected by our ML model are significantly better, in terms of the integer solution that can be constructed from them, than those selected randomly or based only on their reduced cost. Further, we show that the columns generated by our method can be used as a warm start to boost the performance of a column generation-based heuristic.

1. INTRODUCTION

Machine Learning (ML) has been increasingly used to tackle combinatorial optimization problems (Bengio et al., 2020) , such as learning to branch in Mixed Integer Program (MIP) solvers (Khalil et al., 2016; Balcan et al., 2018; Gupta et al., 2020) , learning heuristic search algorithms (Dai et al., 2017; Li et al., 2018) , and learning to prune the search space of optimization problems (Lauri & Dutta, 2019; Sun et al., 2021b; Hao et al., 2020) . Given a large amount of historical data to learn from, ML techniques can often outperform random approaches and hand-crafted methods typically used in the existing exact and heuristic algorithms. Predicting optimal solutions for combinatorial optimization problems via ML has attracted much attention recently. A series of studies (Li et al., 2018; Lauri & Dutta, 2019; Sun et al., 2021b; Grassia et al., 2019; Fischetti & Fraccaro, 2019; Lauri et al., 2020; Sun et al., 2021a; 2022; Ding et al., 2020; Abbasi et al., 2020; Zhang et al., 2020) have demonstrated that predicting optimal solution values for individual variables can achieve a reasonable accuracy. The predicted solution can be used in various ways (e.g., to prune the search space of a problem (Lauri & Dutta, 2019; Sun et al., 2021b; Hao et al., 2020) or warm-start a search method (Li et al., 2018; Zhang et al., 2020; Sun et al., 2022) ) to facilitate the solving of combinatorial optimization problems. However, for a symmetric optimization problem, predicting optimal values for individual decision variables does not provide much benefit for solving the problem. For example, in the vertex coloring problem (VCP) (See Section 2 for a formal problem definition), a random permutation of the colors in an optimal solution results in an alternative optimal solution, and thus predicting the optimal colors for individual vertices is not very useful. On the other hand, predicting a complete optimal solution for a problem directly is too difficult. This is partially due to the NP-hardness of a problem and the difficulty in designing a generic representation for solutions of different sizes. In this paper, we take an intermediate step by developing an effective ML model to predict columns (or fragments (Alyasiry et al., 2019) ) that belong to an optimal solution of a combinatorial optimization problem. To illustrate our method, we use the VCP as an example, in which a column is a Maximal Independent Set (MIS) (Tarjan & Trojanowski, 1977) , whose vertices can share the same color in a feasible solution. The aim of our ML model is to predict which MISs belong to an optimal solution for a given problem instance. To train our ML model, we construct a training set using solved problem instances with known optimal solutions, where each training instance corresponds to a column in a training graph. Three categories of features are designed to represent a column, including 1) problem-specific features computed from the graph data, 2) statistical measures computed from sample solutions, and 3) linear program (LP) features (e.g., reduced cost) computed from the LP relaxation of the MIP model. A training instance is labeled as positive if the corresponding column belongs to an optimal solution; otherwise it is labeled as negative. This is then a standard binary classification problem and any existing classification algorithm can be used for this task. We use the trained ML model to evaluate the quality of columns, and combine it with a sampling method to generate high-quality columns for unseen problem instances. Specifically, our method starts by randomly generating a subset of columns. It then computes the features for each column in the current subset and uses the trained ML model to evaluate the quality of the columns. The lowquality columns predicted by ML are replaced by the new ones generated by a sampling method. This process is repeated for multiple iterations with low-quality columns filtered out and high-quality columns remaining. A subproblem formed by the selected columns is then solved to generate an integer solution. We call this method Machine Learning-based Column Generation (MLCG). Our MLCG method is a significant departure from the traditional CG methods (Lübbecke & Desrosiers, 2005; Mehrotra & Trick, 1996) . Firstly, the traditional methods typically generate columns to solve the LP relaxation of the MIP, while our MLCG method aims at generating columns that are included in a high-quality integer solution. Secondly, the traditional methods select columns only based on their reduced cost, while our method learns a more robust criterion via ML based on a set of features for selecting columns. Thirdly, the traditional methods typically generate columns by repeatedly solving a pricing problem, while our method samples columns on the fly and uses the trained ML model to filter out low-quality columns. To demonstrate the effectiveness of our MLCG method, we evaluate it on the VCP, though the same idea is generally applicable to other combinatorial optimization problems. We empirically show that our ML model can achieve a high accuracy in predicting which columns are part of an optimal solution on the problem instances considered. The columns selected by our ML model are significantly better, in terms of the integer solution that can be constructed from them, than those selected randomly or based purely on their reduced cost. Furthermore, we use the subset of columns generated by our method to warm-start a CG-based heuristic, the Restricted Master Heuristic (RMH) (Taillard, 1999; Bianchessi et al., 2014) , and the results show that our method combined with RMH significantly outperforms RMH alone in terms of both solution quality and run time.

2. BACKGROUND AND RELATED WORK

Vertex Coloring Problem Formulation. Given an undirected graph G(V, E), where V is the set of vertices and E is the set of edges, the objective of VCP is to assign a color to each vertex, such that the adjacent vertices have different colors and the total number of colors used is minimized. Since adjacent vertices cannot share the same color by the problem definition, the vertices that are of the same color in any feasible solution must form an Independent Set. Therefore, the VCP is equivalent to a set partitioning problem which aims to select the minimum number of Independent Sets from a graph such that each vertex is covered exactly once. This is also equivalent to a set covering problem, the objective of which is to minimize the number of Maximal Independent Sets (MISs) selected such that each vertex in the graph is covered at least once (Mehrotra & Trick, 1996) . Let S denote a MIS, S denote the set of all MISs of a graph G, and S v denote the set of MISs that contain vertex v ∈ V . We use a binary variable x S to denote whether a MIS S is selected. The set covering formulation of VCP is defined in (1)-(3) (Mehrotra & Trick, 1996) . A variable x S corresponds to a MIS in the graph and also a column of the constraint matrix of the MIP. As the number of MISs in a graph is potentially exponential in |V |, the min x S∈S x S , s.t. S∈Sv x S ≥ 1, v ∈ V ; (2) x S ∈ {0, 1}, S ∈ S. (3) number of columns of the MIP can be very large. It can be shown that the LP relaxation of (1)-( 3) provides a bound at least as good as that of the compact MIP formulation (Mehrotra & Trick, 1996) . Although the set covering formulation has a tight LP relaxation, solving the large LP relaxation is difficult. The traditional CG and branch-and-price (B&P) algorithms (Mehrotra & Trick, 1996; Barnhart et al., 1998; Gualandi & Malucelli, 2012) resolve this by generating a subset of columns on the fly, instead of enumerating all the columns upfront, which will be described in the following. Column Generation. Let S ⊂ S denote an arbitrary subset of MISs based on which at least one feasible LP solution can be generated, and Sv denote the set of MISs in S that contain vertex v ∈ V . To solve the LP relaxation of ( 1)-( 3), CG starts by solving a restricted master problem (4)-( 6) with S. The dual solution for constraint (5) can be obtained after solving the restricted min x S∈ S x S , s.t. S∈ Sv x S ≥ 1, v ∈ V ; (5) 0 ≤ x S ≤ 1, S ∈ S. (6) master problem: π = {π 1 , π 2 , . . . , π |V | } (π ≥ 0) , which can be used to compute the reduced cost of a variable x S for any S ∈ S. Let u = {u 1 , u 2 , . . . , u |V | } be the binary string representation of a MIS S, where u i = 1 if the vertex v i ∈ S, otherwise u i = 0. The reduced cost of x S (for MIS S) is computed as 1 - |V | i=1 π i • u i . The reduced cost computes the amount by which the objective function coefficient of x S would have to be reduced before S would be cost-effective to use (i.e., x S would take a non-zero value in the optimal LP solution). In other words, if the reduced cost of a variable x S is negative, adding S into the subset S and solving the restricted master problem again would lead to a better LP solution assuming nondegeneracy. If the reduced cost of x S is non-negative for every S ∈ S, adding any other MIS into S would not improve the LP solution. Hence, the LP has been solved to optimality with the current subset S. However, as the number of MISs in a graph may be exponentially large, explicitly computing the reduced cost for every decision variable x S is often impractical. CG resolves this issue by optimizing a pricing problem (7)-( 9) to compute the minimum reduced cost among all variables. The constraints (8) and ( 9) ensure that the solution u is a valid MIS in the graph G. The obje- min u 1 - |V | i=1 π i • u i , s.t. u i + u j ≤ 1, (v i , v j ) ∈ E; (8) u i ∈ {0, 1}, v i ∈ V. ctive (7) minimizes the reduced cost among all possible MISs in the graph. This is equivalent to solving a maximum weighted independent set problem with the dual value π i being the weight of the vertex v i (for i = 1, • • • , |V |) (Sakai et al., 2003) . If the minimum objective value of the pricing problem is negative, the optimal solution generated u * (which is a MIS) will be added into S, and the above process is repeated. If the minimum objective value of the pricing problem is zero, the LP relaxation of (1)-(3) has been solved to optimality with the current S, and the optimal objective value of the restricted master problem provides a valid lower bound for (1)-(3). To generate an integer solution to the MIP (1)-( 3), a sub-problem with the subset of MISs S can be solved by a MIP solver. This approach is known as RMH (Taillard, 1999; Bianchessi et al., 2014) . Alternatively, CG can be applied at each node of the branch-and-bound tree to compute an LP bound, resulting in a B&P algorithm (Mehrotra & Trick, 1996; Gualandi & Malucelli, 2012) . A potential issue of the CG approach is that it may require to solve many instances of the pricing problem, which may be NP-hard itself. Further, the columns are generated only based on the reduced cost. Although it guarantees the optimality of the LP relaxation, the integer solution contained in the subset may not be of high quality. This motivates us to use ML to learn a better rule from a set of features (including reduced cost), to predict which columns belong to an optimal integer solution. There have been only a few studies that use ML to improve CG and B&P algorithms. Václavík et al. (2018) developed a regression model to predict an upper bound on the optimal objective value of the pricing problem, which was then used to prune the search space of the pricing problem. Morabit et al. (2020) developed a supervised learning model to select the minimum number of columns that could lead to the best LP solution at each iteration of CG. Shen et al. (2022) developed an ML-based pricing heuristic to boost the solving of pricing problems. These methods are still within the scope of the traditional CG framework, which is significantly different from our method, as explained earlier. The ML model designed in (Furian et al., 2021) was used to configure the variable selection and node selection in a B&P algorithm, but not about to improve CG.

3. PREDICTING OPTIMAL COLUMNS

In this section, we develop an effective ML model to predict which columns (i.e., MISs) are part of an optimal solution for the VCP.

3.1. FEATURE EXTRACTION

Let S denote a subset of MISs generated from a graph G(V, E), which contains at least one feasible solution. We extract three categories of features to characterize each MIS in S. Note that if the graph is small, we can easily enumerate all the MISs in it, and S in this case is the full set of MISs. Problem-specific features. Let S be a MIS in S, and |S| be the size of S. The first problem-specific feature designed to characterize S is f 1 (S) = |S|/ max S ′ ∈ S |S ′ |. Generally, the larger a MIS is, the more likely it belongs to an optimal solution. This is simply because a larger MIS contains more vertices, and therefore it potentially requires fewer MISs to cover all vertices in a graph. Let Sv ⊂ S denote the set of MISs that contain vertex v ∈ V , and α v denote the maximum size of MISs that contain vertex v: α v = max S∈ Sv |S|. The ratio |S|/α v computes the relative 'payoff' of using the set S to cover vertex v, compared to using the largest MIS in Sv that can cover vertex v. The next four problem-specific features compute the maximum (f 2 ), minimum (f 3 ), average (f 4 ) and standard deviation (f 5 ) of |S|/α v across the vertices in S. Let deg(v) be the degree of vertex v ∈ V , i.e., the number of neighbours of v, and ∆ be the maximum degree of vertices in V : ∆ = max v∈V deg(v). The next four features designed to characterize S are the maximum (f 6 ), minimum (f 7 ), average (f 8 ) and standard deviation (f 9 ) of the normalized degree (deg(v)/∆) across the vertices in S. Statistical features. Let x ∈ {0, 1} | S| be the binary string representation of a sample solution to the VCP, where a binary variable x S = 1 if and only if the corresponding MIS S is in the solution, for each S ∈ S. We first use the method presented in Appendix A.1 to efficiently generate n sample solutions {x 1 , x 2 , • • • , x n }, where x i S = 1 if and only if S is in the i th sample solution. Let y = {y 1 , y 2 , • • • , y n } be the objective values of the n solutions. The first statistical feature is an objective-based measure, which accumulates the 'payoff' of using S to construct solutions in terms of the objective values f obm (S) = n i=1 x i S /y i . Because the vertex coloring is a minimization problem, a MIS that frequently appears in high-quality sample solutions is expected to have a larger accumulated score. Let r = {r 1 , r 2 , • • • , r n } denote the ranking of the n sample solutions in terms of their objective values. The next statistical feature used is the ranking-based measure originally proposed in (Sun et al., 2021b) : f rbm (S) = n i=1 x i S /r i . If S frequently appears in high-quality sample solutions (with a smaller rank), it is more likely to have a larger ranking-based score. The objective-based score and ranking-based score are normalized by their maximum values across S. In addition, we compute the Pearson correlation coefficient and Spearman's rank correlation coefficient between the values of the binary decision variable x S and the objective values y over the sample solutions. If x S is highly negatively correlated with y, it means the sample solutions containing S generally have a smaller objective value than those not having S. We present an efficient method to compute these statsitical features in Appendix A.2. Linear programming features. The LP relaxation of a MIP is typically much more efficient than the MIP itself to solve. Solving the LP relaxation can provide very useful information about the importance of a decision variable. Specifically, we solve the restricted master problem (4)-( 6) with the subset of MISs S using the Gurobi solver. The first LP feature extracted for a MIS S ∈ S is its value in the optimal LP solution. The optimal LP solution value of x S is a good indication of which binary value x S takes in the optimal integer solution, as shown in Appendix B.2 that the mutual information between the LP and integer solutions is noticeable. In general, if x S has a fractional value closer to 1 in the optimal LP solution, it is more likely to be 1 in the optimal integer solution to the sub-MIP. The next LP feature extracted for S ∈ S is the reduced cost (See Section 2) of the corresponding variable x S . When the restricted master problem is solved to optimality for S, the reduced cost of the decision variable x S is non-negative for every S ∈ S. Furthermore, if the value of x S in the optimal LP solution is greater than 0, the reduced cost of x S must be 0. In general, the larger the reduced cost of x S , the less cost-effective S is to construct solutions. The three categories of features are designed carefully, each capturing different characteristics of a MIS. The problem specific features focus on the local characteristics of a MIS such as the number of vertices that a MIS can cover. Our statistical features are motivated by the observation that many optimization problems have a "backbone" structure (Wu & Hao, 2015) . In other words, high-quality solutions potentially share some components with the optimal solution. The goal of our statistical features is to extract the shared components from high-quality solutions. The LP features are based on the LP theory, which is widely used by state-of-the-art MIP solvers. The relevance of each category of features is investigated in Appendix B.2 due to the page limit.

3.2. CLASS LABELING

To construct a training set, we need to compute the optimal solutions for a set of problem instances for class labeling. We use small graphs in which all the MISs can be enumerated upfront as our training problem instances. For each training graph G(V, E), we first use the method proposed by (Tsukiyama et al., 1977) to list all MISs (S) in the graph. This method is based on vertex sequencing and backtracking, and has a time complexity of O(|V | |E| |S|). An exact solver Gurobi is then used to compute the optimal solutions for the problem instance by solving the MIP formulation (1)-( 3). The existing optimal solution prediction approaches in literature typically compute only one optimal solution of an optimization problem to supervise the training of an ML model (Li et al., 2018; Sun et al., 2021b; Ding et al., 2020; Sun et al., 2021a) . However, this is insufficient in our case, because there often exist multiple optimal solutions in the VCP. For example, in some of the training graphs used in our experiments, more than 90% of MISs are part of optimal solutions, indicating the existence of multiple optimal solutions. To tackle this, we present in Appendix A.3 a brute-force approach to compute for each MIS in a graph whether it belongs to any optimal solution. A MIS is assigned with a class label of 1 if it belongs to any optimal solution; otherwise 0.

3.3. TRAINING AND TESTING

We use multiple graphs to construct a training set. The training graphs are small, so that all the MISs in the graphs can be enumerated easily (Tsukiyama et al., 1977; Csardi & Nepusz, 2006) . Each MIS in a training graph is used as a training instance. The features and class label of a MIS are computed as above. After the training set is constructed, an off-the-shelf classification algorithm can be trained to classify optimal and non-optimal MISs (i.e., classifying whether a MIS belongs to an optimal solution or not). We will test multiple classification algorithms including K-Nearest Neighbor (KNN) (Cover & Hart, 1967) , Decision Tree (DT) (Breiman et al.; Loh, 2011) , and Support Vector Machine (SVM) (Boser et al., 1992; Cortes & Vapnik, 1995) in our experiments. Given an unseen test problem instance, if all the MISs in the corresponding graph can be enumerated upfront, our ML model can be used as a problem reduction technique to prune the search space of the problem (Lauri & Dutta, 2019; Sun et al., 2021b) . However, the number of MISs in a test graph may be exponentially large, and hence it is often impossible to enumerate all MISs upfront, especially for large graphs. To tackle this, we develop a search method in the next section to generate a subset of high-quality MISs (without the need of listing all MISs), guided by our ML model.

4. GENERATING COLUMNS

This section describes our MLCG method. Our MLCG method starts with a randomly generated subset of columns (i.e., MISs) S, which contains at least one feasible solution. It then goes through n it number of iterations, and at each iteration it computes the features (designed in Section 3.1) for each MIS S ∈ S, and uses the trained ML model to evaluate the quality of each MIS. The lowquality MISs in S predicted by ML are replaced by the newly generated ones. Finally, the MISs remain in the subset S are returned. In this sense, our MLCG method is significantly different from the traditional CG methods that typically generate columns by repeatedly solving a pricing problem. Initializing a subset of MISs. The initial subset of MISs S is generated randomly. To increase the diversity of S, we generate an equal number of MISs starting from each vertex v ∈ V . This also ensures that S contains at least one feasible solution, since all vertices in V can be covered by S. To generate a MIS starting from a vertex v, we initialize the set S as v and the candidate vertex set C as the set of vertices excluding v and v's neighbors. The set S is then randomly expanded to a MIS. In each step of expansion, we randomly select a vertex v s from C and add v s into S. The candidate vertex set C is then updated by removing v s and v s 's neighbors (due to the definition of independent set). This process is repeated until the candidate vertex set C is empty. Evaluating the quality of MISs. We use the ML model developed in Section 3 to evaluate the quality of the MISs in the subset S. To do this, we first extract the features designed in Section 3.1 for each MIS in S. Note that we do not need to enumerate all the MISs in a graph, because the features can be computed only based on the subset S. The ML model should be computationally efficient, as the time used in prediction should be counted as part of the total run time in solving the problem. Therefore, we only consider linear SVM and DT here, since KNN and non-linear SVM are slow in our case where the number of training instances is large. Let w be the vector of optimal weights learned by linear SVM. The quality score predicted by linear SVM for a given MIS with a feature vector f can be computed by p SVM (f ) = |f | i=1 w i f i . Generally, a MIS with a larger quality score is of better quality. For a trained DT, we use the proportion of positive training instances in a leaf node as an indication of the quality of MISs belonging to that leaf node. The quality score of a MIS with a feature vector f can be computed as p DT (f ) = n 1 (f )/(n 0 (f ) + n 1 (f )), where n 1 (f ) and n 0 (f ) are the number of positive and negative training instances in the leaf node that f belongs to. The score p DT is in the range of 0 and 1, with 1 indicating the best quality predicted by DT. Updating the MIS subset. We rank the MISs in S based on their quality scores predicted by the ML model. The top κ percentage of MISs are kept and the remaining low-quality MISs are replaced by newly generated MISs. We investigate the following two different methods to generate new MISs. The first method is a random approach. To replace a low-quality MIS S * whose first vertex is v * , the random approach starts from the same vertex v * and randomly expands it to a MIS. The random approach has the merit of maintaining the diversity of the MIS subset S, but on the other hand it may require many iterations to generate high-quality MISs. The second method is a crossover approach inspired by the Genetic Algorithm (Golberg, 1989), which randomly selects two high-quality MISs to create a new one. More specifically, it initializes the set S as the common vertices of the two high-quality MISs selected, and then randomly expands the set S to a MIS. Generating an integer solution. With the subset of MISs obtained, we can use an existing optimization algorithm such as the Gurobi solver to construct a complete integer solution. To be specific, we can form a subproblem by using the subset of MISs generated by our MLCG method as inputs to the MIP formulation (1)-(3), and solve the subproblem using Gurobi. The size of the subproblem is significantly smaller than the original problem, and thus solving the subproblem is a lot easier. In addition, the integer solution generated from the subproblem is likely to be of high-quality, which will be shown in our experiments. Directly solving the subproblem does not provide any optimality guarantee. Alternatively, we can use the subset of MISs generated by our MLCG method to warm-start RMH. Specifically, we can use the MISs generated by our method as the initial subset of columns for the restricted master problem, and solve the LP relaxation of (1)-(3) using a traditional CG approach (see Section 2). The optimal LP objective value provides a valid lower bound for the objective value of integer solutions. We then form a subproblem with the columns generated by our MLCG method and the traditional CG method, and solve the subproblem using an existing algorithm. By doing this, we can obtain a high-quality integer solution with a valid optimality gap. We will evaluate the efficacy of our MLPR method for seeding RMH in our experiments.

5. EXPERIMENTS

This section presents the experimental results to show the efficacy of our MLCG method in generating columns for the VCP. Our algorithms are implemented in C++, and the source code is publicly available at https://github.com/yuansuny/mlcg.git. Our experiments are conducted on a high performance computing server with multiple CPUs @2.70GHz, each with 4GB memory.

5.1. PREDICTION ACCURACY

We use the first 100 problem instances (g001-g100) from MATILDA (Smith-Miles & Bowly, 2015) to construct a training set. These graphs are evolved using Genetic Algorithm to acquire controllable characteristics such as density, algebraic connectivity, energy, and standard deviation of eigenvalues of adjacency matrix. For each instance (|V | = 100), we use the algorithm implemented in the igraph library (Tsukiyama et al., 1977; Csardi & Nepusz, 2006) to enumerate all MISs (S) in the corresponding graph. The Gurobi solver with the default parameter setting, 4 CPUs and 16GB memory is then used to compute the optimal solution value (i.e., class label) for each S ∈ S. The constructed training set consists of 94, 553 positive (with a class label 1) and 94, 093 negative (with a class label 0) training instances (i.e., MISs). The training set is reasonably balanced, indicating that there often exist multiple optimal solutions in the VCP. The distribution of the training instances in the 2D feature space created by PCA is presented in Appendix B.1. We test multiple classifiers for this classification task, including KNN (k = 3), DT (depth = 20), linear SVM (C = 1) and nonlinear SVM with the RBF kernel (C = 1000). The implementations of KNN and DT are from the scikit-learn library (Pedregosa et al., 2011) ; the nonlinear SVM is from the LIBSVM library (Chang & Lin, 2011) ; and the linear SVM is from the LIBLINEAR library (Fan et al., 2008) . The parameters of these classification algorithms are tuned by hand. All the algorithms and datasets used are publicly available. We train each classification algorithm on the training set with fifteen features, and the ten-fold cross-validation accuracy of KNN, DT, linear SVM, and nonlinear SVM are 91%, 90%, 78% and 85%, respectively. The correlation between each feature and the class label, the feature weights learned by the linear SVM, and a trained simple DT are presented and discussed in Appendix B.2.

5.2. THE EFFICACY OF GENERATING MISS

Test instances. We use 33 problem instances from MATILDA (Smith-Miles & Bowly, 2015) as our test instances, which cannot be optimally solved by Gurobi (with 4 CPUs) in 10 seconds. The number of vertices in those graphs are all 100. For each test instance, we use our MLCG method to generate a subset of MISs, and use Gurobi to solve a sub-problem formed by the generated MISs. The quality of the best integer solution found in the sub-problem is an indication of the quality of the MISs generated by our MLCG method. Parameter setting. The size of the MIS subset | S| is set to 20|V |, where |V | is the number of vertices in a graph. In general, a larger | S| leads to a better integer solution, but at an expense of longer runtime. The number of iterations n it is set to 10 for now, and will be tuned later. In each iteration of our MLCG method, 50% of MISs are replaced: κ = 50. Linear SVM versus Decision Tree. We test two efficient classification algorithms, linear SVM and DT, for evaluating the quality of MISs. The average optimality gap generated by our MLCG method with linear SVM is 2.49%, which is significantly better than that with DT 7.95% (See Appendix B.3). This indicates that linear SVM is more effective than DT in evaluating the quality of MISs, which is somewhat surprising because the classification accuracy of DT is much higher than that of linear SVM. In the rest of our experiments, we will only use linear SVM to evaluate MISs. Random update versus crossover update. We test two approaches (random and crossover) for generating new MISs to replace low-quality MISs in each iteration of our algorithm. The two approaches perform similarly, with the average optimality gap generated by the random approach 2.49% and that of the crossover approach 2.68% (See Appendix B.4). The reason why the crossover approach is not more effective may be that the diversity of the generated MISs is important. In particular, having more high-quality but similar MISs in the subset is not expected to result in a better integer solution. In the rest of the paper, we simply use the random approach to generate new MISs.

Number of iterations.

We vary the number of iterations (n it ) from 1 to 100, and record the optimality gap generated by our MLCG method. The solution quality generated by our method improves dramatically in the first several iterations, and the improvement slows down later on. The smallest optimality gap is generated at around the 50 th iteration, and cannot be further reduced afterwards (See Appendix B.5 for the detailed results). Hence, we set n it = 50 hereafter. Baselines. We compare our MLCG method to four baselines for generating columns: 1) Random, which randomly generates 10 MISs starting from each vertex in a graph, resulting in 10|V | in total (the same number as our method 20|V | × 50%); 2) RC, which replaces linear SVM with a simple criterion of using the reduced cost to evaluate MISs; 3) MLF, which enumerates all MISs in a graph and uses linear SVM to select the top 10 MISs for each vertex, thus 10|V | in total; and 4) Full, which simply enumerates and uses all MISs in a graph. The MISs generated by each method are used to construct a (sub-)problem, which is then solved by Gurobi with 4 CPUs and a one-hour cutoff time. Comparison results. The optimality gap (with respect to the lower bound produced by Full) of each algorithm for each test problem instance and the runtime of the algorithms are presented in Table 1 . The results are averaged over 25 independent runs. Comparing our MLCG method with Full (which is an exact approach), we can clearly see that MLCG consistently generates an optimal or near-optimal solution using a much less runtime. On average, our MLCG method reduces 95% of the runtime of Full, and only increases the optimality gap from 0.91% to 1.87%. Notably on g243, the full problem cannot be solved to optimality by Gurobi in a one-hour cutoff time. In contrast, using the columns generated by our MLCG method to construct a sub-problem, Gurobi is able to find a better integer solution in only 33.63 seconds. Compared to Random, our MLCG method consistently generates a better solution for the test problem instances except g496. The average optimality gap generated by our MLCG method is about 5 times better (smaller) than that of Random. Furthermore, our MLCG method significantly outperforms the RC method in terms of the optimality gap generated (with p-value = 0.0169). This clearly demonstrates that our ML model has learned a more useful rule than the reduced cost for evaluating the quality of MISs. Surprisingly, the MLF method does not perform as well as the MLCG method in terms of the optimality gap. The reason for this is that the aim of CG is to select a diverse set of high-quality MISs to cover each vertex at least once. The ML model can evaluate the quality of MISs, but it cannot measure the diversity of MISs selected. In the MLF approach where all MISs are listed, multiple high-quality but similar MISs (i.e., sharing many common vertices) are likely to be selected. Although those selected MISs are individually of high-quality, when combined it loses diversity. Hence, the MLF method does not perform well compared to the MLCG method.

5.3. SEEDING RESTRICTED MASTER HEURISTIC

We use the columns generated by our MLCG method to warm-start RMH, namely MLRMH, compared to RMH with the typical random initialization. The number of initial MISs is set to 10|V |. The restricted master problem and the pricing problem are both solved by Gurobi. Apart from the 33 MATILDA problem instances used above, we include 30 other MATILDA problem instances, in which the number of MISs is so large that they cannot be enumerated with 16GB memory. In addition, we evaluate the generalization of our method on a set of larger DIMACS problem instances, in which the number of vertices vary from 125 to 1000. The experimental results are summarized in Table 2 , and the detailed experimental settings and results are shown in Appendix B.6. Note that the optimal LP objective obtained is a valid lower bound (LB) for the objective value of integer solutions, and is used to compute the optimality gap. The LB generated by MLRMH (or RMH) is very tight for most of the problem instances tested. Our MLRMH algorithm is able to prove optimality for many of the problem instances and provide an optimality gap for the remainder, making it a quasi-exact method. The average optimality gap generated by MLRMH is significantly smaller than that of RMH. Interestingly, by using the columns generated by our MLCG method to warm-start RMH, the number of CG iterations is substantially reduced. This also leads to a significant reduction in the runtime of solving the LP. Note that the time used by our MLCG method to generate columns is counted as part of the LP solving time. Our method is not confined to solving problem instances generated from a similar distribution. In fact, the MATILDA graphs we used are evolved using Genetic Algorithm to acquire various characteristics such as density, algebraic connectivity, energy, and standard deviation of eigenvalues of adjacency matrix etc (Smith-Miles & Bowly, 2015) , and the DIMACS graphs are a collection of hard problem instances from multiple resources with very different problem characteristics and sizes. Our experimental results showed that our method trained on a subset of MATILDA graphs performs well across the MATILDA and DIMACS graphs.

6. CONCLUSION

In this paper, we have developed an effective CG method based on ML for the VCP. We defined a column as a Maximal Independent Set whose vertices can share the same color in a feasible solution, and developed an ML model to predict which column belongs to an optimal solution. Novel features were designed to characterize a column, including those computed from the graph data, sample solutions and the LP relaxation. We then used the ML model to evaluate the quality of columns, and incorporated it into a search method to generate a subset of high-quality columns without the need of enumerating all columns in a problem instance. We empirically showed that our ML model achieved a high accuracy in predicting which columns are optimal on the datasets considered. Furthermore, we showed that the columns generated by our method were significantly better than those generated randomly or via the reduced cost. Finally, we showed that the columns generated by our method can be used to boost the performance of a CG-based heuristic as a warm start. Our proposed MLCG method is generic. Although we have only demonstrated the efficacy of this approach on the VCP, the same idea can be applied to other combinatorial optimization problems. Taking the vehicle routing problem as an example, in which a column is a resource constrained shortest path, an ML model can be built to predict which path is likely to be part of the optimal solution. This will require the design of problem specific features to characterize a path and a specialized sampling method to efficiently generate paths. Our paper has focussed on using CG as part of a heuristic approach, where only a restricted master problem is solved to obtain a final integer solution, as is commonly found in many papers applying CG to large scale problems. However, the same approach could easily be applied in the context of an exact B&P method, where CG would be done predominantly using the ML approach with reduced-cost based pricing of columns only completed at the end to prove optimality of the LP at each node. Our results are indicative that such an ML boosted approach may be expected to have a positive impact on the performance of a B&P approach as well. However a rigorous investigation of this is beyond the scope of the current paper.

A APPENDIX A.1 AN EFFICIENT RANDOM SAMPLING METHOD

We describe an efficient random sampling method to generate sample solutions to the VCP in Algorithm The inputs to the algorithm are the graph G(V, E) and the set of MISs Sv containing vertex v, for each v ∈ V . The output is a set of MISs selected which can cover all the vertices in V at least once (i.e., a feasible solution denoted as S x ). The random sampling method starts by initializing the solution S x as empty and marking each vertex v ∈ V as not covered. It then iterates through each vertex v ∈ V , and if v has not been covered, it randomly selects a MIS S from Sv and adds S to the solution S x . The vertices in S are then marked as covered. A feasible solution S x is obtained when all vertices in V have been covered. Algorithm A.1 Random Sampling Input: graph G(V, E), MISs sets Sv for each v ∈ V . Output: a sample solution S x . Initialize S x ← {}. Initialize covered[v] ← 0, for each v ∈ V . for each v ∈ V do if covered[v] = 0 then randomly select a MIS S from Sv . add S into S x . for each ṽ ∈ S do covered[ṽ] ← 1. Proof. In the best case, the MISs selected do not share any common vertex. In other words, each vertex in the graph is covered exactly once. In this case, the number of basic operations (e.g., comparison and assignment) performed in Algorithm A.1 is about 3|V |, which is in O(|V |). In contrast, the MISs selected may share many common vertices and each vertex in the graph may be covered multiple times. In the worst case, each MIS selected in Algorithm A.1 can only cover one new vertex that has not been covered (with other vertices in the MIS already being covered). In this case, the number of assignments performed in the inner loop of Algorithm A.1 is We run Algorithm A.1 n times to generate n sample solutions |V | i=1 |S i |, {S 1 x , S 2 x , • • • , S n x }. In each run, we randomly permute the vertex set V to increase the diversity of the generated solutions. The sample size n should be large enough so that each MIS in S is expected to be sampled at least once. In our experiments, we set n = | S|.

A.2 AN EFFICIENT METHOD FOR COMPUTING STATISTICAL FEATURES

We first describe the statistical features in more detail. Let x ∈ {0, 1} | S| be the binary string representation of a sample solution, where a binary variable x S = 1 if and only if the corresponding MIS S is in the solution, for each S ∈ S. The n sample solutions can be denoted as {x 1 , x 2 , • • • , x n }, where x i S = 1 if and only if S is in the i th sample solution. Let y = {y 1 , y 2 , • • • , y n } be the objective values of the n solutions. The first statistical feature is an objective-based measure, which accumulates the 'payoff' of using S to construct solutions in terms of the objective values: f obm (S) = n i=1 x i S /y i . Because the vertex coloring is a minimization problem, a MIS that frequently appears in highquality sample solutions (with smaller objective values) is expected to have a larger accumulated As the scale of the objective-based score is sensitive to the scale of objective values and sample size, we normalize the objective-based scores by the maximum objective-based score in S: f 10 (S) = f obm (S)/ max S ′ ∈ S f obm (S ′ ). Let r = {r 1 , r 2 , • • • , r n } denote the ranking of the n sample solutions in terms of their objective values. The next statistical feature used is the ranking-based measure originally proposed by (Sun et al., 2021b) : f rbm (S) = n i=1 x i S /r i . If S frequently appears in high-quality sample solutions (with a smaller rank), it is more likely to have a larger ranking-based score. The ranking-based score is normalized by the maximum rankingbased score in S: f 11 (S) = f rbm (S)/ max S ′ ∈ S f rbm (S ′ ). The Pearson correlation coefficient between the values of the binary decision variable x S and the objective values y over the sample solutions can also be used to quantify the expected benefits of using S to construct solutions: f 12 (S) = n i=1 (x i S -xS )(y i -ȳ) n i=1 (x i S -xS ) 2 n i=1 (y i -ȳ) 2 , ( ) where xS = n i=1 x i S /n and ȳ = n i=1 y i /n. If x S is highly negatively correlated with y, it means the sample solutions containing S generally have a smaller objective value than those not having S. Thus, using S to construct solutions is expected to result in a smaller objective value. The last statistical measure computes the Pearson correlation coefficient between the values of the binary decision variable x S and the ranking of the sample solutions (r): f 13 (S) = n i=1 (x i S -xS )(r i -r) n i=1 (x i S -xS ) 2 n i=1 (r i -r) 2 , ( ) where xS = n i=1 x i S /n and r = n i=1 r i /n. This ranking-based correlation score is equivalent to Spearman's rank correlation coefficient between x S and y, as the value of x i S can be interpreted as its rank (considering tied ranks). A MIS with a high negative ranking-based correlation score is more likely to appear in high-quality sample solutions. Computing the statistical features based on the binary string representation of sample solutions (x) can be achieved in O(n| S|). This is computationally slow when the number of samples (n) and the number MISs (| S|) are large. Based on the fact that x is a vector of binary variables, we introduce an efficient method to compute the statistical features in the following. Lemma A.2 ( (Sun et al., 2021b) ). For a binary variable x i S , the following equality holds: σ x S = n i=1 (x i S -xS ) 2 = nx S (1 -xS ), where xS = n i=1 x i S /n. Proof. Here, we provide a different proof to the Lemma than (Sun et al., 2021b) . σ x S = n i=1 (x i S -xS ) 2 = n i=1 (x i S ) 2 -x2 S = n i=1 (x i S ) 2 -nx 2 S . As x i S is a binary variable, n i=1 (x i S ) 2 = n i=1 x i S = nx S . Hence, σ x S = nx S -nx 2 S = nx S (1 -xS ). ( ) Algorithm A.2 Computing Statistical Features Input: sample solutions {S 1 x , • • • , S n x }, objective values {y 1 , • • , y n }, rankings {r 1 , • • • , r n }, MIS subset S. Output: statistical features f obm (S), f rbm (S), f 12 (S) and f 13 (S), for each S ∈ S. Initialize f obm (S) ← 0, f rbm (S) ← 0, for each S ∈ S. Initialize xS ← 0, σ xy S ← 0, σ xr S ← 0, for each S ∈ S. Compute the mean objective value: ȳ ← n i=1 y i /n. Compute the mean objective ranking: r ← (1 + n)/2. for i = 1 to n do for each S ∈ S i x do f obm (S) ← f obm (S) + 1/y i . f rbm (S) ← f rbm (S) + 1/r i . xS ← xS + 1/n. σ xy S ← σ xy S + (y i -ȳ). σ xr S ← σ xr S + (r i -r). Compute the objective variance: σ y ← n i=1 (y i -ȳ) 2 . Compute the ranking variance: σ r ← n(n + 1)(n -1)/12. for each S ∈ S do σ x S ← xS (1 -xS )n. f 12 (S) ← σ xy S / σ x S σ y . f 13 (S) ← σ xr S / σ x S σ r . Lemma A.3. For a binary variable x i S and any variable y i , the following equality holds: σ xy S = n i=1 (x i S -xS )(y i -ȳ) = i∈{1,••• ,n}∧x i S =1 (y i -ȳ), where xS = n i=1 x i S /n and ȳ = n i=1 y i /n. Proof. By making use of the fact that n i=1 (y i -ȳ) = 0, we have σ xy S = n i=1 (x i S -xS )(y i -ȳ) = n i=1 x i S (y i -ȳ) -xS n i=1 (y i -ȳ) = n i=1 x i S (y i -ȳ). As x i S is a binary variable, σ xy S = i∈{1,••• ,n}∧x i S =1 1 • (y i -ȳ) + i∈{1,••• ,n}∧x i S =0 0 • (y i -ȳ) = i∈{1,••• ,n}∧x i S =1 (y i -ȳ). Lemma A.3 holds for any variable y i . By replacing y i with r i in Lemma A.3, the convariance between the values of x S and the ranking of objective values r can be computed as σ xr S = n i=1 (x i S -xS )(r i -r) = i∈{1,••• ,n}∧ x i S =1 (r i -r). Furthermore, as the ranking of the objective values r is from 1 to n, it is not difficult to show that r = n i=1 r i /n = (n + 1)/2 and σ r = n i=1 (r i -r) 2 = n(n + 1)(n -1)/12. With this simplification, the statistical features can be computed efficiently using the set representation of sample solutions {S 1 x , 

A.3 COMPUTING CLASS LABELS

We present in Algorithm A.3 a "brute-force" approach to compute for each MIS in a graph whether it belongs to any optimal solution. The algorithm starts by solving a given problem instance to optimality using the exact solver Gurobi, with an optimal solution x * and the optimal objective value y * obtained. The algorithm then iterates through each S ∈ S. If S does not belong to any optimal solution found so far, x S is fixed to 1, and the corresponding subproblem is solved by Gurobi. If the best objective value found (ŷ) is equal to y * , it means that there exists at least one alternative optimal solution that includes S; hence all the MISs used in the newly generated optimal solution (denoted as x) are then marked as optimal. This approach may not be the most efficient method to compute the class label for the MISs in a graph. There might exist better ways to compute all optimal solutions of a problem instance, e.g., after an optimal solution is found, a cut can be generated to prune the optimal solution from the search space, so that a different optimal solution is generated one at a time until all have been enumerated. However, as training is conducted offline, Algorithm A. 

B.2 FEATURE IMPORTANCE

To investigate the relevance of the features, we compute the Pearson correlation coefficient between each feature and the class label, which is shown in Table B .1. The reduced cost (f 15 ) has the highest correlation with the class label, indicating it is one of the most important features. The size of a MIS (f 1 ) and the maximum, minimum and average relative sizes of a MIS (f 2 to f 4 ) have a noticeable correlation with the class label. As expected, the objective-based measure (f 10 ) and the rankingbased measure (f 11 ) are positively correlated with the class label, whilst the two correlation-based measures (f 12 and f 13 ) are negatively correlated with the class label. We also compute the mutual information between each feature and the class label to capture any non-linear relationships between them. Since mutual information can be only computed for discrete variables, we discretize the values of each feature into five bins. The mutual information is normalized by the minimum entropy of the corresponding feature and the class label, so that it is in the range of 0 and 1. The LP features (f 14 and f 15 ) have the highest nonlinear correlation with the class label, as shown in Table B .1. In addition, we present in Table B .1 the optimal weights (w) of the features learned by the linear SVM classifier, which can be directly used to evaluate the quality of MISs. Finally, we investigate the trained Decision Tree (DT) to identify which features are important. When the depth of tree is set to 5, DT achieves an accuracy of 80% with all features used except the min relative size (f 3 ), obj correlation (f 12 ) and rank correlation (f 13 ). When the depth of tree is set to 3, DT achieves an accuracy of 78% with three features used: reduced cost (f 15 ), max degree (f 6 ) and min degree (f 7 ). Hence, the reduced cost (f 15 ), max degree (f 6 ) and min degree (f 7 ) are the most important features identified by DT. stances are presented in Table B .3. The two approaches perform equally well (without statistically significant difference), with the average optimality gap generated by the random approach 2.49% and that of the crossover approach 2.68%. The reason why the crossover approach is not more effective may be that the diversity of the generated MISs is important. In particular, having more high-quality but similar MISs in the subset is not expected to result in a better integer solution. In the rest of the paper, we simply use the random approach to generate new MISs.

B.5 EFFECTS OF NUMBER OF ITERATIONS

We investigate the effect of the number of iterations (n it ) on the performance of our MLCG method. To do so, we vary the number of iterations from 1 to 100, and present the average optimality gap generated by our method in Figure B .5. Note that the optimal LP objective is a valid lower bound (LB) for the objective value of integer solutions, and is used to compute the optimality gap in Table B .5. We can observe that the LB generated by MLRMH is very tight for most of the problem instances tested. Significantly, our MLRMH algorithm is able to prove optimality for many of the problem instances and provide an optimality gap for the remainder, making it a quasi-exact method. MLRMH consistently generates an equally-well or significantly better solution than RMH for the test problem instances except g496. The mean optimality gap generated by MLRMH is much smaller than that of RMH. Interestingly, by using the columns generated by our MLCG method to warm-start RMH, the number of CG iterations is significantly reduced from 69.63 to 7.08 on average. This also leads to a significant reduction in the runtime of solving the LP and MIP. Note that the time used by our MLCG method to generate columns is counted as part of the LP solving time. 



Given a graph G(V, E), the time complexity of generating a sample solution to the VCP using Algorithm A.1 is O(|V |) in the best case, and O(α|V |) in the worst case, where α = max S∈ S |S| is the maximum size of the MISs in S.

where |S i | is the size of the i th selected MIS. As α denotes the maximum size of the MISs in S, we have |S i | ≤ α and |V | i=1 |S i | ≤ α|V |. Hence the worst case time complexity of Algorithm A.1 is O(α|V |).

distribution of the training instances (i.e., MIS), we create a 2-D feature space via principal component analysis in Figure B.1. We can observe that the positive and negative training instances are separated to some extent. Most of the negative training instances are located in the left region, while the positive training instances are mainly in the middle of the reduced feature space.

Figure B.1: The distribution of training instances in the reduced feature space, where Z 1 and Z 2 are the first two principal components of the feature vectors. Each dot represents training instance.

The experimental results generated by each algorithm for solving the test problem instances. The last row presents the p-value of the t-tests between the results generated by our MLCG method and each of the other methods. The ones with statistical significance (p-value < 0.05) are in bold.

The comparison between the MLRMH and RMH algorithms for solving the three sets of test problem instances. The results are averaged across the instances in each problem set.

• • • , S n x }. The idea is to iterate through each S in the sample solutions S i x (where i = 1, 2, • • • , n) and accumulate f obm (S), f rbm (S), xS , σ xy S and σ xr S . The two correlation-based features can then be easily computed, as shown in Algorithm A.2. The time complexity of computing the statistical features using Algorithm A.2 is O(| S| + which is much smaller than O(n| S|), because |S i x | is much smaller than | S| in general. Algorithm A.3 Computing Optimal Solutions for LabelingInput: a graph G(V, E), the full set of MISs S.Output: optimal solution value x * S for each S ∈ x * , y * ← Gurobi(G, S).

1: The Pearson correlation coefficient (PCC) and normalized mutual information (NMI) between each feature and the class label, and the weight learned by linear SVM for each feature.

The trained DT with depth = 3 is presented in FigureB.2. 2: The experimental results of our MLCG method when using different classifiers to evaluate the quality of columns on the test problem instances. The results are averaged across 25 independent runs. The last row presents the p-value of the t-tests between the results generated by linear SVM (with C = 1) and DT with different depths. The ones with statistical significance (p-value < 0.05) are highlighted in bold. The optimality gap (%) generated by our MLCG method at different iterations (n it ). The results are averaged over 25 independent runs and 33 MATILDA test problem instances.

B.3. We can observe that the solution quality generated by our method improves (i.e., optimality gap decreases) dramatically in the first several iterations, and the improvement slows down later on. The best (smallest) optimality gap is generated at around the 50 th iteration, and the solution quality cannot be further improved afterwards. Note that the result presented in FigureB.3 is the best optimality gap generated at each iteration, instead of the best optimality gap found so far. Therefore, the curve in Figure B.3 does not monotonically decrease due to randomness. The detailed results for each test problem instance are presented in TableB.4.B.6 DETAILED EXPERIMENTS FOR SEEDING RESTRICTED MASTER HEURISTICWe use the columns generated by our MLCG method to warm-start RMH, namely MLRMH, compared to RMH with the typical random initialization. The number of initial MISs is set to 10|V |. The restricted master problem and the pricing problem are both solved by Gurobi. The detailed experimental results for the 33 MATILDA test problem instances are presented in Table

6: The experimental results of the MLRMH and RMH algorithms for solving the 30 MATILDA test problem instances, in which the number of MISs is so large that they cannot be enumerated upfront using 16GB memory. The last row presents the p-values of the t-tests.TableB.7: The experimental results of the MLRMH and RMH algorithms for solving a set of larger DIMACS problem instances. The last row presents the p-values of the t-tests.

ACKNOWLEDGEMENT

This work was supported by an ARC Discovery Grant (DP180101170) from Australian Research Council.

annex

Table B .3: The experimental results of our MLCG method when using different methods to generate columns on the test problem instances. The results are averaged across 25 independent runs. The last row presents the p-value of the t-tests between the results generated by the two methods. bound generated by Gurobi with 4 CPUs and a one-hour cutoff time. We can clearly see that our method using linear SVM generates a significantly better (smaller) optimality gap than that using DT, indicating linear SVM is more effective than DT in evaluating the quality of MISs. This is somewhat surprising, because the classification accuracy of DT is much higher than that of linear SVM. Further, DT with a larger depth (d) may perform even worse in evaluating the quality of MISs, although the classification accuracy improves. This indicates that DT with a large depth may be overfitted. The other reason for this may be that the decision function used may not the most appropriate criterion to evaluate the quality of MISs. For example, in a large decision tree, many leaf nodes have a score of 1, and DT cannot distinguish the MISs in those nodes, resulting in a poor performance. This issue can possibly be resolved by using a different criterion to evaluate the quality of MISs, such as using the "distance" from the feature vector of a MIS to the decision boundary of DT. However, as our aim is not to compare which classification algorithm performs the best in this task, a further investigation along this line is beyond the scope of this paper. So far, we have only used the 33 MATILDA problem instances from (Smith-Miles & Bowly, 2015) as our test problem instances. The MISs in those graphs can be enumerated using 16GB memory.Here, we include 30 other MATILDA problem instances from (Smith-Miles & Bowly, 2015) , in which the number of MISs is so large that they cannot be enumerated with 16GB memory. We evaluate the efficacy of our MLRMH method on these problem instances. Table B .6 presents the experimental results of our MLRMH algorithm compared against the RMH algorithm for solving the 30 MATILDA test problem instances. Our MLRMH algorithm is able to prove optimality for 11 problem instances, while the RMH algorithm is only able to prove optimality for 8 instances. Further, the average optimality gap generated by our MLRMH method (8.96%) is statistically significantly better than that generated by the RMH algorithm (12.53%). By using the columns generated by our method to warm-start RMH, the average number of CG iterations is significantly reduced from 156.28 to 59.38. This also leads to a significant reduction in the runtime of solving the LP and MIP. This clearly demonstrates that the RMH algorithm using the columns generated by our method as a warm start performs significantly better than the RMH alone.Finally, we use a set of larger DIMACS graphs, the number of vertices in which varies from 125 to 1, 000, to evaluate the generalization capability of our ML model. The cutoff time of MLRMH or RMH for solving each problem instance is set to ten hours. Table B .7 presents the experimental results (averaged over two independent runs) of the problem instances, whose LP relaxation can be optimally solved via CG within the cutoff time. Note that Gurobi with a ten-hour cutoff time may not be able to solve the reduced MIPs for large problem instances (e.g., r1000.1), resulting in a large optimality gap. Further, the time used by our MLCG method in generating columns for large problem instances may be non-negligible, because the computation of features requires solving the restricted master problem in each iteration of our algorithm. However, overall the RMH algorithm using the columns generated by our MLCG method as a warm start performs better than RMH alone.Published as a conference paper at ICLR 2023 

