UNSUPERVISED LEARNING FOR COMBINATORIAL OP-TIMIZATION NEEDS META LEARNING

Abstract

A general framework of unsupervised learning for combinatorial optimization (CO) is to train a neural network whose output gives a problem solution by directly optimizing the CO objective. Albeit with some advantages over traditional solvers, current frameworks optimize an averaged performance over the distribution of historical problem instances, which misaligns with the actual goal of CO that looks for a good solution to every future encountered instance. With this observation, we propose a new objective of unsupervised learning for CO where the goal of learning is to search for good initialization for future problem instances rather than give direct solutions. We propose a meta-learning-based training pipeline for this new objective. Our method achieves good performance. We observe that even the initial solution given by our model before fine-tuning can significantly outperform the baselines under various evaluation settings including evaluation across multiple datasets, and the case with big shifts in the problem scale. The reason we conjecture is that meta-learning-based training lets the model be loosely tied to each local optimum for a training instance while being more adaptive to the changes of optimization landscapes across instances. 1 

1. INTRODUCTION

Combinatorial optimization (CO), aiming to find out the optimal solution from discrete search space, has a pivotal position in scientific and engineering fields (Papadimitriou & Steiglitz, 1998; Crama, 1997) . Most CO problems are NP-complete or NP-hard. Conventional heuristics or approximation requires insightful comprehension of the particular problem. Starting from the seminal work from Hopfield & Tank (1985) , researchers apply neural networks (NNs) (Smith, 1999; Vinyals et al., 2015) to solve CO problems. The motivation is that NNs may learn heuristics through solving historical problems, which could be useful to solve similar problems in the future. Many NN-based methods (Selsam et al., 2018; Joshi et al., 2019; Hudson et al., 2021; Gasse et al., 2019; Khalil et al., 2016) require optimal solutions to the CO problem as supervision in training. However, optimal solutions are hard to get in practice and the obtained model often does not generalize well (Yehuda et al., 2020) . Methods based on reinforcement learning (RL) (Mazyavkina et al., 2021; Bello et al., 2016; Khalil et al., 2017; Yolcu & Póczos, 2019; Chen & Tian, 2019; Yao et al., 2019; Kwon et al., 2020; 2021; Delarue et al., 2020; Nandwani et al., 2021) do not need labels while they often suffer from notoriously unstable training. Recently, unsupervised learning methods have attracted much attention (Toenshoff et al., 2021; Amizadeh et al., 2018; Yao et al., 2019; Karalias & Loukas, 2020; Wang et al., 2022) . A common strategy of these methods is to design an NN whose output gives a solution to the CO problem and then train the NN via gradient descent by directly optimizing the CO objectives over a set of training instances. This strategy is superior in its faster training, good generalization, and strong capability of dealing with large-scale problems. Despite the prominent progress, current unsupervised learning methods always optimize NNs towards an averaged good performance over training instances. This means even if a testing instance comes from the same distribution of the training instances, the solution to this single instance may not have good quality, let alone the case when the testing instance is out-of-distribution (OOD). This induces a concern when we apply NNs in practice because practical problems often expect to have a good solution to every encountered instance. For example, allocating surveillance cameras is crucial for each-time exhibition in every art gallery. Solvers when applied to this problem (O'rourke, 1987; Yabuta & Kitazawa, 2008) should output a good solution every time. Traditional CO solvers are designed toward this goal. However, it is time-consuming and unable to learn heuristics from historical instances. So, can we leverage the benefit of learning from history with the goal of achieving an instance-wise good solution instead of an averaged good solution? This motivates us to study a new formulation of unsupervised learning for CO. We regard the objective of learning from history as to search for a good initialization for each future instance rather than give a direct solution. Since in practice, future instances are unavailable during the training stage, we propose to view each training instance as a pseudo-new instance for the rest training instances. Then, our learning objective is to learn a good initialization of this model, such that further optimization of the initialization could achieve good solutions on each of these pseudo-new instances. We observe meta learning is suitable to implement this idea and propose to adopt MAML (Finn et al., 2017) in our training pipeline as a proof of concept. Note that the step of optimization on each pseudo-new instance shares a similar spirit with fine-tuning a model over each down-streaming task as traditional meta learning does. However, each task in our case corresponds to optimization over each training instance. We name our method Meta-EGN by extending the previous framework EGN (Karalias & Loukas, 2020) via meta learning. Our key observation is that with this new objective, even the initial solution given by Meta-EGN (before fine-tuning on a test instance) is substantially better than the solution given by EGN and other methods that optimize the averaged performance over training instances. Our conjectured reason is that the new objective, by taking into account fine-tuning the model over new instances, trains the model to avoid being trapped into a local minimum induced by each training instance while being more adaptive to the changes of optimization landscapes across instances. We demonstrate the benefits of Meta-EGN via experiments within three benchmark CO problems (max clique, vertex cover, and max independent set) on multiple synthetic graphs and three realworld graph datasets, with the number of nodes ranging from 100 to 5000. Meta-EGN significantly outperforms state-of-the-art learning-based baselines (Karalias & Loukas, 2020; Toenshoff et al., 2021) , greedy algorithms, and the commercial CO solver Gurobi9.5 (Gurobi Optimization, 2022) in most cases. Meta-EGN also shows super OOD generalization performance when the training and test datasets are different or have graphs of entirely different sizes. Moreover, recently, Angelini & Ricci-Tersenghi (2022) have shown that the learning-based method in (Schuetz et al., 2022) could not achieve comparable results with the degree-based greedy algorithm (DGA) (Angelini & Ricci-Tersenghi, 2019) in the max independent set (MIS) problem on large-scaled random-regular graphs (RRGs), which raises attentions from machine learning community. We observe the issues come from two aspects: (1) graph neural networks (GNNs) used to encode the regular graph suffer from the node ambiguity issue due to their limited expressive power (Xu et al., 2019) ; (2) the model in (Schuetz et al., 2022) did not learn from history but was directly optimized over each testing case, which tends to be trapped into a local optimum. By addressing these two issues, Meta-EGN can consistently outperform DGA while maintaining the same time complexity to generate solutions. Fig. 1 show the results.

2. RELATED WORK

In the following, we review two groups of works: unsupervised learning for CO and meta learning. Previous works on unsupervised learning for CO have studied max-cut (Yao et al., 2019) and TSP problems (Hudson et al., 2021) , while these works depend on carefully selected problem-specific objectives. Some works have investigated satisfaction problems (Amizadeh et al., 2018; Toenshoff et al., 2019) . Applying these approaches to general CO problems requires problem reductions. The works most relevant to us are (Karalias & Loukas, 2020) , (Wang et al., 2022) and (Schuetz et al., 2022) . Karalias & Loukas (2020) propose an unsupervised learning framework EGN for general CO problems based on the Erdős' probabilistic method, which bonds the quality of the final solutions with probability. Wang et al. (2022) generalize EGN and prove that if the CO objective can be relaxed into an entry-wise concave form, a solution of good quality can be deterministically achieved. This further inspires the design of proxy objectives for CO problems that may not have closed-form objectives, such as those in circuit design. Schuetz et al. (2022) have recently extended EGN to large-scale max independent set problems on random-regular graphs. Meta learning is proposed to learn hyper-parameters or initialization from historical tasks and achieve fast adaption to new tasks. Finn et al. (2017) propose model-agnostic meta learning (MAML), which aims to obtain good parameter initialization and be accommodated to few-shot learning with limited steps of fine-tuning. Nichol et al. (2018) accelerate MAML by adopting firstorder approximation on the gradient estimation. Rajeswaran et al. (2019) introduce implicit-MAML that adopts an objective with fine-tuning till the stationary points on new tasks. Implicit-MAML does not fit our case because we try to avoid long-time fine-tuning. Hsu et al. (2018) study unsupervised learning under the meta learning framework and focus exclusively on vision tasks. To the best of our knowledge, our work is the first to apply meta learning to unsupervised learning for CO.

3. PRELIMINARIES: NOTATIONS AND PROBLEM FORMULATION

Combinatorial Optimization on Graphs. We follow the settings considered in (Karalias & Loukas, 2020; Wang et al., 2022; Schuetz et al., 2022) and study CO problems on graphs whose solutions can be represented as a subset of nodes of the input graph instance, although our method could be applied to a broader range of problems. Suppose G is the universe of graph instances. Let G(V, E) ∈ G denote a graph instance where V = {1, 2, ..., n} is the node set and E is the edge set. Let X = (X i ) 1≤i≤n ∈ {0, 1} n denote the discrete optimization variables defined on V , where X i = 1 denotes that node i is selected in the output node subset. A CO problem on G consists of a cost function f (•; G) : {0, 1} n → R ≥0 and a feasible set Ω ⊆ {0, 1} n that stands for the finite set of all feasible X's, and asks to solve min X∈{0,1} n f (X; G) s.t. X ∈ Ω (1) Unsupervised Learning for CO. The Erdös-Goes-Neural (EGN) framework of unsupervised learning for CO proposed in (Karalias & Loukas, 2020) is reviewed as follows. Here, we use the notation system in a follow-up work (Wang et al., 2022) as it is more clear. Learning for CO problem is to learn an algorithm A θ (•) : G → {0, 1} n typically parameterized by an NN where θ denotes the parameters of the NN such that given a graph instance G, X = A θ (G) gives a solution of Eq. 1. In practice, directly optimizing the parameters θ is hard in general. Therefore, we may consider a relaxed cost function f r ( •; G) : [0, 1] n → R ≥0 where f r (X; G) = f (X; G) on any discrete points X ∈ {0, 1} n and a relaxed constraint g r (•; G) : [0, 1] n → R ≥0 where {X ∈ {0, 1} n : g r (X; G) = 0} and {X ∈ {0, 1} n : g r (X; G) ≥ 1} defines the feasible set Ω and the infeasible set Ω c respectively. Also, suppose the NN in A θ can give soft solutions X ∈ [0, 1] n . Then, we may train θ by minimizing a label-independent loss function: min θ l(θ; G) ≜ f r ( X; G) + βg r ( X; G), X = A θ (G), for some β > 0. The significant observation made by (Wang et al., 2022) , which generalizes the argument in (Karalias & Loukas, 2020) , is a type of performance guarantee on the condition that f r and g r are entry-wise concave, which is satisfied in all the cases studied in this work: If the loss that achieves l(θ; G) < β for some β > max X∈{0,1} n f (X; G), then the discrete solution X obtained by rounding the soft solution X = A θ (G) according to Def. 1 is feasible X ∈ Ω and of good quality f (X; G) ≤ l(θ; G). For each G i ∈ B j , compute the adapted parameter: θ (j) i = θ (j) -α∇ θ (j) l(θ (j) ; G i ) 4: Update: θ (j+1) ← θ (j) -γ∇ θ (j) Gi∈Bj l(θ i ; G i ) 5: end for 6: return θ ← θ (K) ▷ Training ends 7: For a given testing instance G ′ : ▷ Testing starts 8: if fine-tuning is allowed then 9: Fine-tune the parameters: θ G ′ ← θ -α∇ θ l(θ; G ′ ) 10: Use Def. 1 to round the relaxed solution given by A θ G ′ (G ′ ) ▷ With fine-tuning 11: else 12: Use Def. 1 to round the relaxed solution given by A θ (G ′ ) ▷ Without fine-tuning 13: end if ▷ Testing ends Definition 1 (Rounding). For a soft solution X ∈ [0, 1] n and an arbitrary order of the entries (w.l.o.g 1,2,...,n), fix all the other entries unchanged and round Xi into 0 or 1 as X i = arg min j=0,1 f r (X 1 , ..., X i-1 , j, Xi+1 , ..., Xn ) + βg r (X 1 , ..., X i-1 , j, Xi+1 , ..., Xn ), replace Xi with X i and repeat this operation until all the entries are discrete.

4. META LEARNING FOR ERD ÖS GOES NEURAL (META-EGN)

The above performance guarantee lays the theoretical foundation for EGN. However, the following practical issue motivates us to incorporate meta learning into EGN.

4.1. MOTIVATION: WHAT NEEDED IS LEARNING FOR INSTANCE-WISE GOOD SOLUTIONS

It is often time-consuming to perform online optimization of l(θ; G) for each encountered instance G. This also mismatches the goal of learning, i.e., learning heuristics from history/data. Therefore, a pipeline commonly adopted is as follows. Suppose there is a set of training instances G i , 1 ≤ i ≤ m, IID sampled from a distribution P G . We optimize θ by following min θ m i=1 l(θ; G i ), which is similar to empirical risk minimization (ERM) in standard supervised learning. When a test instance G appears, we apply the learned A θ to get a soft solution and round it to the final solution. This pipeline cannot guarantee the quality for this instance G. Even if the training instances G i , 1 ≤ i ≤ m are in a large quantity (so in-distribution generalization is not a problem), and even if the test instance G also follows P G , we may not guarantee a low l(θ; G) for one particular G because ERM only guarantees a low averaged performance E G∼P G [l(θ; G)]. This issue may also violate the condition to have a performance guarantee as reviewed in Sec. 3, as it is instance-wise. Here, we highlight that in practice even the minimal averaged loss min θ E G∼P G [l(θ; G)] is often strictly greater than averaged instance-wise minimal loss E G∼P G [min θ l(θ; G)], because practical NNs are not expressive enough to remember the optimal solution to every instance. Unfortunately, many practical CO problems actually expect instance-wise good solutions. This is because every instance in practice is crucial. A terrible solution for one instance may raise a security issue (e.g., the surveillance-camera allocation problem) or cause huge economic losses (e.g., the routing problem in a transportation system). With this observation, our work is to address the problem by studying unsupervised learning for instance-wise good solutions to CO problems.

4.2. TRAINING TOWARDS INSTANCE-WISE OPTIMALITY VIA META LEARNING

Our idea to address the problem is to regard the goal of learning from history as to search for good initialization for future instances rather than give direct solutions. Such good initialization can be quickly fine-tuned by further optimizing the model for each instance, which ultimately gives instance-wise good solutions. However, in practice, we do not have access to any future/test instances. So, can we just use historical/training instances to implement the above idea? Our strategy is to view each training instance G i as a pseudo-test instance to test and optimize the quality of initialization given by the model. Specifically, this strategy gives us an objective min θ m i=1 li (θ), where li (θ) = min θi l(θ i ; G i ) with θ i = θ as initialization. Eq. 4 has some abuse of notations. The minimum li (θ) depending on the initialization θ is because of the non-convex nature of min θi l(θ i ; G i ), where the initialization θ i = θ matters significantly. We further simplify Eq. 4 with some practical consideration. In fact, we may not allow further optimizing θ with so many gradient-descent steps for each instance, especially during the online test stage. As a proof of concept, we consider the case with only one-step gradient descent, which already gives good empirical results. Specifically, our training objective follows Our Objective: min θ m i=1 l i (θ), where l i (θ) = l(θ i ; G i ) with θ i = θ -α∇ θ l(θ; G i ). (5) Here, θ is to give a good initialization A θ (G i ) over each instance G i while θ i is with one-step fine-tune to achieve a G i -specified good solution A θi (G i ). Optimization in Eq. 5 can be implemented via the meta learning pipeline MAML (Finn et al., 2017) . We name the obtained model Meta-EGN and summarize its training and testing in Alg. 1. In step 3, Meta-EGN performs the one-step gradient descent on each training instance. Note that we consider two testing cases with or without fine-tuning because the latter saves much inference time. A simple extension of Theorem 1 in (Wang et al., 2022) gives a performance guarantee for Meta-EGN in Theorem 1 as follows. Here, for a test instance G, we even allow l(θ; G) to violate the original condition l(θ; G) < β in (Wang et al., 2022) to some extent. After one-step fine-tuning in step 9, the performance guarantee is still achievable. The detailed proof is in Appendix. A. Theorem 1 (Performance Guarantee). Suppose the relaxations f r and g r are entry-wise concave as required in (Wang et al., 2022) . Let θ denote the learned parameter after training. Given a test instance G, suppose locally l(•; G) is L-smooth at θ, i.e., ∥∇ θ ′ l(θ ′ ; G) -∇ θ l(θ; G)∥ ≤ L∥θ ′ -θ∥ for all θ ′ that satisfies ∥θ ′ -θ∥ ≤ ϵ. Then, if l(θ; G) < β + △ (even if l(θ; G) ≥ β), for any α ∈ (0, 2/L) Meta-GNN with one-step finetuning outputs a feasible solution X of good quality f (X; G) ≤ l(θ; G) -△. Here, △ = ∥∇ θ l(θ; G)∥ϵ + 1 2Lα 2 -4α ϵ 2 if ϵ < α∥∇ θ l(θ; G)∥ or △ = (α -Lα 2 2 )∥∇ θ l(θ; G)∥ 2 o.w.. To better understand Meta-EGN, we show its training/testing dynamics in Fig. 4 .2. As we expected, the training loss of EGN is somewhere in-between the losses of Meta-EGN before and after the fine-tuning step. Training EGN is stabler and converges faster than training Meta-EGN. However, what is unexpected is that in validation, Meta-EGN has a much lower loss and achieves much better performance than EGN even before fine-tuning. This implies that Meta-EGN holds better generalization than EGN. We conjecture the reasons are as follows. First, the optimization landscape for CO problems is extremely non-convex (Mezard & Montanari, 2009) due to the intersected feasible-infeasible regions and the high penalty coefficient  l(θ; G i ) l(θ; G) m i=1 l(θ -∇ θ l(θ; G i ); G i ) f (X; G) s.t. X ∈ Ω Training or not Yes No Yes No Fine-tune timing No Long Short/No Long Generalization Good - Better - Table 2 : The discrete objectives (Eq. 1) and their relaxations (Eq. 2) for the three problems to be studied. The hyper-parameter p in RB model β. EGN that has low losses for training instances may give a high loss even when the optimization landscape is just slightly shifted (from training to a test instance). However, the parameters of Meta-EGN are loosely tied to a local minimum for each training instance. Instead, those parameters, being aware of follow-up instance-wise fine-tuning steps, are likely to fall into some location close to a local minimum for each instance while being not trapped in any one of them, which makes the model robust to landscape shifts across instances. Second, a CO problem could vary a lot across graph instances even for those generated from the same distribution, especially when the instances are large. So, it is reasonable to view the problem over each instance as a separate but relevant task. Meta learning has shown good generalization when data distributions shift across tasks, which has empirical evidence in CV and NLP applications (Jeong & Kim, 2020; Conklin et al., 2021) .

MC

Discrete Obj. max X 1≤i≤n X i s.t. (i, j) ∈ E if X i , X j = 1 Relaxation l MC (θ; G) ≜ -(β + 1) (i,j)∈E Xi Xj + β 2 i̸ =j Xi Xj MVC Discrete Obj. min X 1≤i≤n X i s.t. X i + X j ≥ 1 if (i, j) ∈ E Relaxation l MVC (θ; G) ≜ 1≤i≤n Xi + β (i,j)∈E (1 -Xi )(1 -Xj ) MIS Discrete Obj. max X 1≤i≤n X i s.t. X i X j = 0 if (i, j) ∈ E Relaxation l MIS (θ; G) ≜ -1≤i≤n Xi + β (i,j)∈E Xi Xj As a summary, we provide a comparison between different unsupervised frameworks to solve CO problems in Table 1 . Note that PI-GNN (Schuetz et al., 2022) is directly fine-tuned on each test instance without training so the fine-tuning time is long. Also, although PI-GNN also pursues instance-wise good solutions, its performance could be bad because it does not learn from training instances. The instance-wise solutions could be just bad local minima.

5. EXPERIMENTS

We study three CO problems, namely max clique (MC) to find the largest set of nodes where each pair of nodes are connected, minimum vertex covering (MVC) to find the smallest set of nodes that every edge is connected to at least one nodes in the set, and max independent set (MIS) to find the largest set where any two vertices in the set are not adjacent. Their objectives (Eq. 1) and relaxations (Eq. 2) are listed in Table 2 . For the detailed derivation, see Appendix C.

Datasets:

We conduct experiments on the MC and MVC problems over three real datasets Twitter (Leskovec & Krevl, 2014) , COLLAB and IMDB (Yanardag & Vishwanathan, 2015) and two synthetic datasets with 200 and 500 nodes generated by the RB model (Xu, 2007) . We name them RB200 and RB500, respectively. We make RB200 and RB500 extremely hard by setting a small hyper-parameter ρ = 0.25 of the RB model (Xu, 2007) . The difficulty-ρ relationship on the MVC problem with 500 vertices is shown in Fig. 3 , where the models are pre-trained on the RB graphs with uniformly sampled ρ ∈ [0.3, 1.0] and tested on different RB graphs generated with single ρ's. We keep all the other hyper-parameters the same. As ρ increases, Meta-EGN and Gurobi9.5 all tend to achieve better performance. Meta-EGN could outperform Gurobi9.5 in hard instances ρ ∈ (0, 0.55] while remaining a gap on the easy ones. To verify performances for data-scale generalization, we also generate large-graph datasets RB1000, RB2000, and RB5000 with ρ = 0.25. As for the MIS problem, random-regular graphs (RRGs) are often used as benchmarks because they are challenging. Baselines: Our baselines include unsupervised learning methods, heuristics, and traditional CO solvers. For the MC and MVC problems, we take our direct baseline EGN (Karalias & Loukas, 2020) , and also take RUN-CSP (Toenshoff et al., 2021) as another baseline. We do not consider other learning-based methods because they generally perform worse than EGN (Karalias & Loukas, 2020) . As to the heuristics, we use greedy algorithms as heuristic baselines. For traditional CO solvers, we compare against the best commercial CO problem solver Gurobi9.5 (Gurobi Optimization, 2022) via converting the problems into integer programming form. We track the time t that the models use from the start of inferring to the end of rounding to output feasible solutions. We set this time t as the time budget of Gurobi9.5 for purely solving the integer programming, and list the actual time usage of Gurobi9.5 which includes pre-processing plus t. As to the MIS problem, we take PI-GNN (Schuetz et al., 2022) and EGN Karalias & Loukas (2020) as the learning-based baselines. We take the random greedy algorithm (RGA) and degree-based greedy algorithm (DGA) as introduced in Angelini & Ricci-Tersenghi (2019) as the heuristic baselines. When we consider fine-tuning EGN and Meta-EGN over a test instance, we use 1-step gradient descent as fine-tuning. Implementation: For the MC and MVC problems, we use 4-layer GIN (Xu et al., 2019) as the backbone network for both meta-EGN and EGN. We use 1e-3 as both the outer learning rate (γ) of Meta-EGN and the learning rate of EGN. Here, the backbone and the learning rate are the same as those in (Karalias & Loukas, 2020) . For the MIS problem, we use 6-layer GIN. The outer learning rate (γ) of Meta-EGN and the learning rate of EGN are set as 1e-4. The inner learning rate (α) of Meta-EGN is always set as 5e-5. We run all experiments by using a Xeon(R) Gold 6248 CPU with Table 5 : Scale generalization on the MC and MVC. ApR is the larger the better for MC while the smaller the better for MVC. All the models are trained on RB500 training data. 'Fast/Medium/Accurate' denotes GNNs (without fine-tuning) using 1/4/8 random single node seed(s) per testing instance. 'Fine-tuning' use 1-step Fine-tuning the best trial among the 8 node seed(s). 'Gap' represents the averaged gap defined as c × (# of nodes in the optimal solution -# of nodes by the given method) where c = 1 for MC and c = -1 for MVC. 'Rank' is the average rank of solutions among the three methods. Optimal solutions are generated via Gurobi9.5 with a time limit 3000 seconds. Approximation rate for MC larger than 1, highlighted by * , indicates the model outperforms Gurobi9.5 solver with 3000s time budget. Pareto-optimal results are in bold. 26 threads and a Quadro RTX 6000 GPU. All codes run on the PyTorch platform (Paszke et al., 2019) . For more details, see Appendix. C. Overcoming the limited expressive power of GNNs: GNNs are known with limited expressive power (Xu et al., 2019; Morris et al., 2019) . Specifically, over RRGs, the GIN backbone will associate each node with the same representation, unless node representations are initialized not equally. To keep a fair comparison, for the MC and MVC problem, we follow Karalias & Loukas (2020) and adopt the initialization based on a single random node seed (one selected node is initialized as 1, others as 0). We use 8 single random node seeds for EGN and Meta-EGN in the experiments of Sec. 5.2 and report the best among the 8 trials. We try different numbers of random node seeds in the experiments of Sec. 5.3. For the large-scale MIS problem studied in Sec. 5.4, we find such single node initialization is too local to generate valid global solutions. So, we adopt initialization based on the solutions of greedy algorithms DGA (for Figs. 1, 4 .2) and RGA (for Fig. 4 ). Then, EGN and Meta-EGN can be viewed as learning heuristics to improve the greedy solutions. Note that learning heuristics to tune these solutions is non-trivial (Andrade et al., 2012; Rahman & Virag, 2017) .

5.2. META-EGN BOOSTS THE PERFORMANCE WITHOUT DISTRIBUTION SHIFTS

We first compare the performances of different methods when the datasets used for training and testing are from the same distribution. Table 3 and Table 4 show the results for the MC problem and the MVC problem respectively. In both problems and across the five datasets, Meta-EGN significantly outperforms EGN and RUN-CSP, both before and after the fine-tuning step. In comparison with the traditional CO solvers, Meta-EGN narrows the gap from Gurobi9.5 on those real small graphs. For RB graphs, Meta-EGN outperforms Gurobi9.5 on RB500 for both the MC and MVC problems. We notice that both EGN and Meta-EGN perform generally well on the MC problem while not as competitive on the MVC problem. This results from the initialization of GNN inputs. The MC problem outputs clusters that are more local while MVC asks for global assignments, which makes such single-seed-based initialization less fit for the MVC problem.

5.3. META-EGN BOOSTS THE PERFORMANCE WITH DISTRIBUTION SHIFTS

Problem Scale Shift: Here, we use large-scale RB graphs of 1000-5000 nodes to test EGN and Meta-EGN that is trained based on RB500. Table 5 shows the results. Both methods show good generalization while Meta-EGN is always better. As the scale increases, Meta-EGN outperforms Gurobi9.5. For example, it takes Meta-EGN with 4 random initializations only 1.02s to beat Gurobi9.5 that runs for 3000 seconds on RB5000 dataset in the MC problem. Moreover, Meta-EGN can even outperform Gurobi9.5 on the MVC problem when the problem scale becomes large. Real-Synthetic Distribution Shift: Here, we train EGN and Meta-EGN on Twitter and test them on RB500. Table 6 shows the results. Compare Table 6 with Tables 3, 4 . We observe better generalization performance of Meta-EGN compared to EGN. For example, for the MC, Meta-EGN has almost the same performance whether there is a dataset shift or not (0.833 v.s. 0.834 before fine-tuning, 0.876 v.s. 0.878 after fine-tuning) while EGN has a bigger gap in performance when there is a shift (0.819 v.s. 0.829 before fine-tuning, 0.831 v.s. 0.864 after fine-tuning). For the MVC, although the performance drop of Meta-EGN is larger, such a drop is still much smaller than that of EGN. 2022) as stated in Sec. 1: 1) PI-GNN is trained directly on each single testing instance without learning from the training dataset that contains varies graphs, which is likely to be trapped into the local optima; 2) GNN generally suffers from a node ambiguity issue on RRGs. To resolve the problem, we utilize the outputs of DGA and RGA as the initialization of GNN inputs (EGN, Meta-EGN) and expect to learn heuristics from historical data to further tune the solutions given by the greedy algorithms. We train GNN models on RRGs with 1000 nodes with node degrees randomly sampled from 3, 7, 10, 20, and test on larger RRGs (up to 10 5 nodes). Experiments show that Meta-EGN can further improve DGA (in Fig. 1 ) and RGA (in Fig. 4 ), while EGN fails to better tune DGA. Note that here EGN and Meta-EGN adopt the exactly same backbones. We attribute the improvement to meta-learning-based training as adopted by Meta-EGN. See Table 7 in Appendix for more details of the numerical improvement by Meta-EGN. We also see in these cases, one-step fine-tuning does not contribute much to the performance of EGN or Meta-EGN, indicating the model before fine-tuning has been very close to a local minimum.

5.4. MAX INDEPENDENT

We also check the extra time cost by running Meta-EGN to improve DGA solutions in Fig. 5 and Fig. 6 in Appendix. The extra time cost is just 1% (without fine-tuning) -30% (with fine-tuning) of the time cost of DGA. In theory, the extra time cost without fine-tuning should be O(|E|) for GNN inference plus O(|V |) for rounding, which is in the same order as DGA, while the GNN parallel inference substantially reduces the time.

6. CONCLUSION

This work proposes an unsupervised learning framework Meta-EGN with the goal of optimizing NNs towards instance-wise good solutions to CO problems. Meta-EGN leverages MAML to achieve the goal. Meta-EGN views each training instance as a separate task and learns a good initialization for all these tasks. Meta-EGN significantly improves the performance of its baseline and has shown good generalization when the data used for training and testing has different scales or distributions. In addition, Meta-EGN can learn to improve the greedy heuristics while paying almost no extra time cost in the problem of maximum independent set on large-scale random regular graphs.

A PROOF OF THEOREM 1

We first prove Theorem 1, then we specify the value of α to obtain Theorem 2 as a specific case of Theorem 1. The proof of Theorem 1 is divided into two parts. In part 1, we prove that if l(θ; G) < β + △ (even if l(θ; G) ≥ β), for any α ∈ (0, 2/L) Meta-GNN with one-step finetuning outputs a feasible solution X of good quality f (X; G) ≤ l(θ; G) -△. Here, △ = ∥∇ θ l(θ; G)∥ϵ + 1 2Lα 2 -4α ϵ 2 if ϵ < α∥∇ θ l(θ; G)∥ or △ = (α -Lα 2 2 )∥∇ θ l(θ; G)∥ 2 o.w.. In part 2, we prove that once Meta-EGN achieves the loss value l(θ ′ ; G) after the one-step finetuning, the rounding process would output a feasible X whose objective satisfies f (X; G) ≤ l(θ ′ ; G). Part 1:We could get l(θ ′ ; G) (a) ≤ l(θ; G) + ∇ θ l(θ; G)(θ ′ -θ) + 1 2 L∥θ ′ -θ∥ 2 2 (b) = l(θ; G) + 1 2 L∥θ ′ -θ∥ 2 2 -α∥∇ θ l(θ; G)∥ 2 2 = l(θ; G) + ( Lα 2 2 -α)∥∇ θ l(θ; G)∥ 2 2 , where (a) is due to the local L-smoothness of l(•; G), (b) is due to the definition of one-step finetuning θ ′ = θ -α∇ θ l(θ; G). If ϵ < α∥∇ θ l(θ; G)∥: Let △ = ∥∇ θ l(θ; G)∥ϵ + 1 2Lα 2 -4α ϵ 2 , we have: min ϵ -△ = min ϵ - 1 2Lα 2 -4α ϵ 2 -∥∇ θ l(θ; G)∥ϵ = ( Lα 2 2 -α)∥∇ θ l(θ; G)∥ 2 , ( ) thus l(θ ′ ; G) ≤ l(θ; G) -△. If ϵ ≥ α∥∇ θ l(θ; G)∥: Let △ = (α -Lα 2 2 )∥∇ θ l(θ; G)∥ 2 , we would directly have: l(θ ′ ; G) ≤ l(θ; G) -△. By this, we finish the first part of the proof for Theorem 1. Part 2: The proof in this part follows the rounding analysis in Wang et al. (2022) . Consider the rounding procedure from continuous space X = A θ (G), X ∈ [0, 1] n into the discrete feasible solution X ∈ {0, 1} n . Let Xi , X i , i = {0, 1, ..., n} denote their entries. W.l.o.g, suppose the rounding order is from 1 to n and we have finished the rounding before the t-th node, we now analyze the rounding of t-th node: f r ([X 1 , ..., X t-1 , Xt , Xt+1 , ..., Xn ]; G) + βg r ([X 1 , ..., X t-1 , Xt , Xt+1 , ..., Xn ]; G) (d) ≥ Xt (f r ([X 1 , ..., X t-1 , 1, Xt+1 , ... Xn ]; G) + βg r ([X 1 , ..., X t-1 , 1, Xt+1 , ..., Xn ]; G)) + (1 -Xt )(f r ([X 1 , ..., X t-1 , 0, Xt+1 , ..., Xn ]; G) + βg r ([X 1 , ..., X t-1 , 0, Xt+1 , ..., Xn ]; G)) ≥ Xt ( min jt={0,1} f r ([X 1 , ..., X t-1 , j t , Xt+1 , ..., Xn ]; G) + βg r ([X 1 , ..., X t-1 , j t , Xt+1 , ..., Xn ]; G)) + (1 -Xt )( min jt={0,1} f r ([X 1 , ..., X t-1 , j t , Xt+1 , ..., Xn ]; G) + βg r ([X 1 , ..., X t-1 , j t , Xt+1 , ..., Xn ]; G)) (e) = f r ([X 1 , ..., X t-1 , X t , Xt , ..., Xn ]; G) + βg r ([X 1 , ..., X t-1 , X t , Xt , ..., Xn ]; G) (10) where (d) is due to l r (θ; G)'s entry-wise concavity w.r.t X and Jensen's inequality, (e) is due to X t = arg min j=0,1 f r (X 1 , ..., X t-1 , t, Xt+1 , ..., Xn ) + βg r (X 1 , ..., X t-1 , t, Xt+1 , ..., Xn ) (the definition of our rounding process). The loss value is monotonically non-increasing through the whole rounding process according to the equation above, thus we could get: l(θ ′ ) ≥ f (X; G) + βg(X; G) By this, we finish the proof of the second part. A.1 A SPECIFIC CASE Note that in the first part of the proof above, if we specify the value of α as 1 L in equation ( 6), we could have: l(θ ′ ; G) ≤ l(θ; G) - ∥∇ θ l(θ; G)∥ 2 2 2L (12) If ϵ < 1 L ∥∇ θ l(θ; G)∥: Let △ = ∥∇ θ l(θ; G)∥ϵ -L 2 ϵ 2 , we have: min ϵ -△ = min ϵ L 2 ϵ 2 -∥∇ θ l(θ; G)∥ϵ = - ∥∇ θ l(θ; G)∥ 2 2L , ( ) thus l(θ ′ ; G) ≤ l(θ; G) -△. ( ) If ϵ ≥ 1 L ∥∇ θ l(θ; G)∥: Let △ = 1 2L ∥∇ θ l(θ; G)∥ 2 , we would directly have: l(θ ′ ; G) ≤ l(θ; G) -△. By this, we obtain Theorem 2, a specific case of Theorem 1 as follows: Theorem 2 (A Specific case of Theorem 1). Suppose the relaxations f r and g r are entry-wise concave as required in (Wang et al., 2022) . Let θ denote the learned parameter after training. Given a test instance G, suppose locally l(•; G) is L-smooth at θ, i.e., ∥∇ θ ′ l(θ ′ ; G)-∇ θ l(θ; G)∥ ≤ L∥θ ′ -θ∥ for all θ ′ that satisfies ∥θ ′ -θ∥ ≤ ϵ. Then, if l(θ; G) < β + △ (even if l(θ; G) ≥ β), there exists α such that Meta-GNN with one-step finetuning outputs a feasible solution X of good quality f (X; G) ≤ l(θ; G) -△. Here, We display the average approximation rate improvement and the average node number increase by .  △ = ∥∇ θ l(θ; G)∥ϵ -L 2 ϵ 2 if ϵ < 1 L ∥∇ θ l(θ; G)∥ or △ = 1 2L ∥∇ θ l(θ; G)∥ 2 o.w..

C.2 DETAILED DERIVATION OF THE LOSS FUNCTION RELAXATION

In this part, we display the detailed loss function relaxation of the three problems in our study (the MC, the MVC, and the MIS). The basic idea of training loss design and relaxation follow (Karalias & Loukas, 2020; Wang et al., 2022) . In the following derivation, we use i, j to represent the nodes in graphs, we use X i , X j ∈ {0, 1} to denote the discrete assignment of the binary optimization variables, and we use Xi , Xj ∈ [0, 1] to denote the relaxed soft assignment of the binary optimization variables. The maximum clique (MC): A clique is a set of nodes S ∈ V such that any two distinct nodes in the set are adjacent. The MC aims to find out the clique with the largest number of nodes. We could formulated the optimization objective as follows: max X 1≤i≤n X i s.t. (i, j) ∈ E if X i , X j = 1, X i , X j denotes whether to take the node into the clique set (X i = 1) or not (X i = 0). By setting a proper penalty coefficient β, we could formulate the loss function relaxation as follows (the detailed derivation follows the corresponding case study in Karalias & Loukas (2020) ). l MC (θ; G) ≜ -(β + 1) (i,j)∈E Xi Xj + β 2 i̸ =j Xi Xj . ( ) The minimum vertex covering (MVC): A vertex cover is a set of nodes S ∈ V that any edge in the graph is connected to at least a node from the set. The MVC aims to find out the cover set with the smallest number of nodes. The optimization objective could be summarized as follows: min X 1≤i≤n X i s.t. X i + X j ≥ 1 if (i, j) ∈ E, where X i , X i denotes whether to take the node into the cover set (X i = 1) or not (X i = 0). We design the constraint function g to represent the total number of edges that have not been covered given a set of variable assignment X, and thus we write g as: g MVC (X; G) ≜ (i,j)∈E (1 -X i )(1 -X j ). Then we relax the constraint g and add it into the training objective by multiplying a proper penalty coefficient β, following the relaxation principle in Wang et al. (2022) : l MVC (θ; G) ≜ 1≤i≤n Xi + β (i,j)∈E (1 -Xi )(1 -Xj ). By this, we aim to minimize the value of the loss function above in order to minimize the node number of the cover set as well as consider the covering property in the constraint. The maximum independent set (MIS): An independent set is a set of nodes where any two distinct nodes in the set are not adjacent to each other. The MIS aims to find out the independent set with the largest number of nodes. We could formulate the objective of the MIS as follows: max X 1≤i≤n X i s.t. X i X j = 0 if (i, j) ∈ E, where X i , X j denotes whether to take the node into the independent set (X i = 1) or not (X i = 0). We formulate the constraint g as the total number of edges whose two connected nodes at the end points are both assigned into the independent set. Therefore we could write the constraint as follows: g MIS ≜ (i,j)∈E X i X j . We then relax the constraint g into continuous space and add it into the c function with a proper penalty coefficient β, following the relaxation principle in (Wang et al., 2022) , and thus we could write the training loss function as: l MIS (θ; G) ≜ - 1≤i≤n Xi + β (i,j)∈E Xi Xj . ( ) By this, we aim to minimize the value of the loss function above in order to maximize the node number of the independent set as well as consider the independent property in the constraint.

C.3 SEPARATED ALGORITHM TABLES

We separate the algorithm For each G i ∈ B j , compute the adapted parameter: θ (j) i = θ (j) -α∇ θ (j) l(θ (j) ; G i ) 4: Update: θ (j+1) ← θ (j) -γ∇ θ (j) Gi∈Bj l(θ We run all of the greedy algorithms with PyThon 3.8 in this paper. A potential method to boost the time cost of these greedy algorithms is to use c++. Random Greedy Algorithm for MIS (RGA): RGA takes a time to reach a solution that is linear in the problem size n. It starts from an empty independent set S. At each step 1 ≤ t ≤ n, a node i is chosen at random from the graph G t and added to the independent set. Then all the neighbors of i are removed from G t to formulate a new graph G t+1 . The process iterates until G t * is empty at step t * , the solution is S. Degree-based Greedy Alforithm for MIS (DGA): DGA modifies RGA by sorting the degrees of the nodes before each iteration starts, and always put the node with the smallest degree into the independent set. Teonshoff Greedy for MC: Toenshoff et al. (2021) convert the testing instances into its complement graph, and then run DGA to solve the MIS problem. It takes the solution to the MIS problem on the complement graph as the solution for MC on the original graph. Greedy for MVC: Greedy for MVC starts from an empty covering set S. At each step 1 ≤ t ≤ n, it first sorts the degrees of the nodes in the graph G t and always adds the node i with the largest degree into the covering set S. Then all the edges that connect with i are removed from G t to formulate a new graph G t+1 . The process stops until G t * is empty at step t * , the solution is S.

D DISCUSSION ON LIMITATIONS

• As mentioned at the end of Sec. 5.1, both EGN and Meta-EGN perform generally well in MC, which outputs the cliques that are more local in comparison with the vertex covering in MVCs that require more global assignments. The random initialization seed with one node randomly set as 1 and the others as 0 would potentially limit the performance of EGN and Meta-EGN in more global CO tasks. • We use Meta-EGN and EGN to modify the solution of DGA and RGA in the MIS problem. In addition, there are also many other Monte Carlo (MC) algorithms (i.e. simulated annealing and parallel tempering) that could produce better results than DGA or RGA in RRGs (Angelini & Ricci-Tersenghi, 2022 ). An intuitive idea is to test whether we could learn Meta-EGN to further fine-tune these more advanced MC algorithms in the MIS problem on RRGs. We leave the research on modifying Meta-EGN to better deal with the CO problems that require global assignments and using Meta-EGN to improve other advanced MC algorithms as a future study.



Our code is available at: https://github.com/Graph-COM/Meta_CO



Figure 1: Approximation Rates of different methods in the MIS problem. Meta-EGN and EGN (Karalias & Loukas, 2020) are trained on RRGs with 1000 nodes and with node degree randomly sampled from 3, 7, 10, 20. Meta-EGN and EGN are evaluated over larger RRGs with 10 3 ∼ 10 5 nodes. More details about the setting are in Secs. 5.1 and 5.4. Meta-EGN outperforms DGA (Angelini & Ricci-Tersenghi, 2019) by about 0.3% -0.5% in approximation rates on average.

Figure 2: Training/validating dynamics of Meta-EGN and EGN Karalias & Loukas (2020) for the MIS problem. Detailed experiment settings follow Sec. 5.1.

Figure 3: Performance v.s. hyperparameter ρ of the RB model

Figure 4: ApRs in the MIS problem on RRGs. Meta-EGN and EGN are both trained with the output of Random Greedy Algorithm (RGA) as initialization.

Figure 5: Time cost v.s. Graph Scales. For the MIS problem on large-scale RRGs, Angelini & Ricci-Tersenghi (2022) have recently posted a concern on learning-based methods by arguing that PI-GNN in Schuetz et al. (2022) could not achieve comparable results with the heuristic algorithm DGA (Angelini & Ricci-Tersenghi, 2019). We see the reason comes from improper usage of learning-based methods in Schuetz et al. (2022) as stated in Sec. 1: 1) PI-GNN is trained directly on each single testing instance without learning from the training dataset that contains varies graphs, which is likely to be trapped into the local optima; 2) GNN generally suffers from a node ambiguity issue on RRGs. To resolve the problem, we utilize the outputs of DGA and RGA as the initialization of GNN inputs (EGN, Meta-EGN) and expect to learn heuristics from historical data to further tune the solutions given by the greedy algorithms. We train GNN models on RRGs with 1000 nodes with node degrees randomly sampled from 3, 7, 10, 20, and test on larger RRGs (up to 10 5 nodes). Experiments show that Meta-EGN can further improve DGA (in Fig.1) and RGA (in Fig.4), while EGN fails to better tune DGA. Note that here EGN and Meta-EGN adopt the exactly same backbones. We attribute the improvement to meta-learning-based training as adopted by Meta-EGN. See Table7in Appendix for more details of the numerical improvement by Meta-EGN. We also see in these cases, one-step fine-tuning does not contribute much to the performance of EGN or Meta-EGN, indicating the model before fine-tuning has been very close to a local minimum.

SUPPLEMENTARY TIME COST V.S. GRAPH SCALE IN THE MIS we show the degree 3, 7, 10 in the following Fig. 6. They show the same time-cost vs scale relation as that in Fig. 5. The extra time cost of GNN is O(|E|) for inference plus O(|V |) for rounding, which is in the same order of DGA.

Figure 6: Time cost v.s. Graph Scales on degree 3, 7, 10

mini-batch training along these batches and optimizes over each mini-batch. As to Meta-EGN, for each training epoch Meta-EGN only randomly samples a single batch and does the meta learning algorithm on the batch. The batch sizes of the methods are controlled the same.

Train Meta-EGN Require: Training instances Ξ = {G 1 , G 2 , ..., G m }; Hyperparameters: α, γ.1: Randomly initialize θ (0) 2: for each randomly sampled mini-batch B j ⊂ Ξ, j = 0, 1, ..., K -1 do ▷ Training starts 3:

for 6: return θ ← θ (K) ▷ Training ends Algorithm 3 Test Meta-EGN with/without Fine-tuning Require: Testing instance G ′ ; Hyperparameter: α; Pre-trained parameter initialization θ. 1: For a given testing instance G ′ : ▷ Testing starts 2: if fine-tuning is allowed then 3: Fine-tune the parameters: θ G ′ ← θ -α∇ θ l(θ; G ′ ) 4: Use Def. 1 to round the relaxed solution given by A θ G ′ (G ′ ) ▷ With fine-tuning 5: else 6: Use Def. 1 to round the relaxed solution given by A θ (G ′ ) ▷ Without fine-tuning 7: end if 8: return the rounded solution ▷ Testing ends C.4 IMPLEMENTATION OF THE HEURISTICS

Algorithm 1 Train Meta-EGN and Test Meta-EGN with/without Fine-tuning Require: Training instances Ξ = {G 1 , G 2 , ..., G m }; Hyperparameters: α, γ.

Comparison between different unsupervised frameworks. G denotes the test instance and Gi, 1 ≤ i ≤ m are training instances. The standard EGN pipeline does not adopt any fine-tuning.

ApR (time: second/graph) on the MC problem. ApR is the larger the better. 'report' denotes the reported performance inKaralias & Loukas (2020), 're-impl' denotes re-implementation, 'f-t' stands for finetune. Pareto-optimal results are in bold.

ApR (time: second/graph) on the MVC problem. ApR is the smaller the better.. 'f-t' stands for one-step fine-tune. Pareto-optimal results are in bold.

Generalization from Twitter to RB2000 on the MC and MVC. Pareto-optimal results are in bold.

Improvement of Meta-EGN over DGA and RGA in the MIS on RRGs, 'Imp in ApR' denotes the average improvement in approximation rate, and 'Imp in #Node' denotes the average number of nodes that Meta-EGN could find more than the heuristics. Imp in ApR Imp in #Node Imp in ApR Imp in #Node Imp in ApR Imp in #Node Imp in ApR Imp in #Node TRAINING THE MODELS ON SUBSETS OF THE TRAINING DATA We display the average approximation rates of the models that are only trained on subsets of the original training data in the max clique problem on Twitter. The training dataset is randomly sampled from the original training dataset and the testing dataset remains the same as that in Table.3. Both methods have worse performance as the number of training instances reduces, while Meta-EGN only has a 0.6% performance decrease from the full-size training dataset with 695 samples to the training subset with only 64 instances. In contrast, EGN decreases its performance by 1.7%.

The number of instances in each dataset. '20/scale/degree' means that we generate 20 testing instances for each different scale-degree pair. We generate RB1000, RB2000, and RB5000 only for testing.

table of Meta-EGN into training and testing parts to make it clearer. The algorithm table is shown in Alg. 2 for training and Alg. 3.

acknowledgement

We would like to express our deepest appreciation to Dr. Tianyi Chen for the insightful discussion on the meta-learning framework from a theoretical aspect and Dr. Ruqi Zhang for the constructive advice on the fine-tuning strategies. We would also like to extend our deepest gratitude to Dr. Hanjun Dai and Dr. Jialin Liu for sharing their invaluable insights into the general ideas of learning for combinatorial optimization. Also many thanks to our funding, H. Wang and P. Li are partially supported by 2021 JPMorgan Faculty Award and the NSF award OAC-2117997.

annex

We train the EGN and Meta-EGN models on RRGs with only 3 or 20 degrees and test them on RRGs with the rest of degrees from {3, 7, 10, 20}. Models take the output of DGA as the initialization graph node feature. We show the performance of both the models without fine-tuning in Fig. 7 .When only trained on RRGs with degree 3 (See the left two figures in Fig. 7 ), both the models could not generalize well, as neither of them could outperform the initialization input of DGA. Note that Meta-EGN still achieves better performance than EGN in this case. As to the models only trained on RRGs with degree 20 (See the right two figures in Fig. 7 ), we observe that both the model have relatively good generalization ability across different degrees, yet Meta-EGN could still marginally outperform EGN in this case. We attribute this phenomenon to the fact that solving the MIS on RRGs with degree 20 is much more complicated than those with degree 3 and thus may contain adequate heuristics for solving RRGs with lower degrees. 

B.5 COMPARISON ON THE TRAINING TIME OF THE MODELS

We display the wall clock training time for the two methods to converge in Table. 9 (from start to the best epoch on validation set). We observe that Meta-EGN generally takes two to three times to converge compared with EGN, but their training time cost basically remains on the same order of magnitude. All the codes run on the PyTorch platform 1.9.0 (Paszke et al., 2019) and PyTorch Geometric framework 1.7.2 (Fey & Lenssen, 2019) . The details of each dataset is shown in Table . 10, all of the real datasets are publicly available, and we follow the code in (Toenshoff et al., 2021) to generate the RB model. The real-world dataset split follows that in Karalias & Loukas (2020) . 

