LEARNING COMBINATORIAL NODE LABELING ALGORITHMS

Abstract

We present the combinatorial node labeling framework, which generalizes many prior approaches to solving hard graph optimization problems by supporting problems where solutions consist of arbitrarily many node labels, such as graph coloring. We then introduce a neural network architecture to implement this framework. Our architecture builds on a graph attention network with several inductive biases to improve solution quality and is trained using policy gradient reinforcement learning. We demonstrate our approach on both graph coloring and minimum vertex cover. Our learned heuristics match or outperform classical hand-crafted greedy heuristics and machine learning approaches while taking only seconds on large graphs. We conduct a detailed analysis of the learned heuristics and architecture choices and show that they successfully adapt to different graph structures.

No labels

Set [Dai et al. 2017; Li et al. 2018; Manchanda et al. 2020; Barrett et al. 2020] Example tasks: § Max cut § MVC ( §4.2) § Max clique ‸ Permutation [Bello et al. 2017; Dai et al. 2017; Joshi et al. 2019; Kool et al. 2019; Cappart et al. 2020; Drori et al. 2020; Ma et al. 2020] Example tasks: § TSP § Longest paths Node Labeling [Ours] Example tasks: § Graph coloring ( §4.1) § Min k-cut § Clique cover (Dai et al., 2017) . We add a label assignment step, allowing us to solve new problems. Further, the average time for picking the next vertex is significantly reduced, such that the total number of arithmetic operations is now linear in the size of the graph. We evaluate our approach ( §4) and demonstrate significantly improved performance for neural graph coloring (GC) and find near-optimal solutions for minimum vertex cover (MVC). We additionally study the runtime of our models, conduct comprehensive ablation studies, and provide qualitative analyses of the learned heuristics, showing they adapt to the properties of the input graph. Related work. We now review key related works. Figure 1 (left) provides a comparison of node labeling with other frameworks. Supervised learning. The fundamental downsides of supervised learning for combinatorial optimization are twofold: First, it can be difficult to formulate a problem in a supervised manner, since it might have many optimal solutions (e.g., GC). Second, even if the problem admits a direct supervised formulation, we still need labeled data for training, which can be hard to generate and relies on an existing solver. In particular, supervised learning cannot easily be used for problems that have not been studied before. Advantages of supervised learning are its sample efficiency and that it can lead to overall better results. Recent approaches like Joshi et al. (2019) and Manchanda et al. (2020) obtain good results for influence maximization (IM) and the traveling salesman problem (TSP), respectively. Both approaches use supervised learning. For IM, the approach of Manchanda et al. (2020) shows promising results on graphs much larger than those seen in training. For TSP, the approach of Joshi et al. ( 2019) is very efficient but does not generalize well to graphs larger than those seen in training. Li et al. (2018) also use supervised learning and produce good results on minimum vertex cover (MVC), maximum independent set, and maximal clique. Unsupervised Learning. To apply unsupervised learning, it is necessary to formulate a differentiable surrogate loss. There have been approaches for several specific combinatorial optimization problems (Nazi et al., 2019; Amizadeh et al., 2019; Tönshoff et al., 2019; Yao et al., 2019) and there has been progress to create a framework for the derivation of trainable losses (Karalias & Loukas, 2020) . Still, significant insight into a problem is required to design suitable loss functions. Reinforcement learning (RL). Using RL only requires a way to represent partial solutions and a way to score the cost of a (partial or final) solution. Dai et al. (2017) provide S2V-DQN, a general framework for learning problems like MVC and TSP that is trained with RL. It shows good results across different graph sizes for the covered problems, but is not fast enough to replace existing approaches nor does it handle arbitrary node labels (see Fig. 1 ). Kool et al. (2019) focus on routing problems like TSP and the vehicle routing problem. They outperform Dai et al. on TSP instances of the training size. Unfortunately, their approach does not seem to generalize to graph sizes that are very different from those used for training. Several other RL approaches have been proposed and evaluated for TSP (Bello et al., 2017; Cappart et al., 2020; Drori et al., 2020; Ma et al., 2020) . Barrett et al. (2020) consider the maximum cut (MaxCut) problem. Huang et al. (2019) present a Monte Carlo search tree approach specialized only for graph coloring. These methods do not address the general node labeling framework, but instead model the solution as a permutation of vertices (e.g., TSP, vehicle routing) or a set of nodes or edges (e.g., MVC, MaxCut). Instead, we can represent solutions where vertices are assigned to an unknown and unbounded number of partitions, which is crucial for solving tasks such as graph coloring.

2. COMBINATORIAL NODE LABELING

Many graph heuristics can be phrased as a greedy process, where vertices get assigned a problemdependent label one after the other. For example, this label could indicate if the vertex is part of the solution set, its position in a permutation, or its membership in one of many sets. We introduce combinatorial node labeling, which frames many hard graph optimization problems, such as graph coloring (see §D for a list), as a greedy process. This generalizes previous work (Kool et al., 2019; Dai et al., 2017; Ma et al., 2020; Drori et al., 2020) , to encompass problems where the number of labels is not known in advance and is unbounded (see Fig. 1 ). Every node labeling problem can be formulated as a (finite) Markov decision process (MDP), during which nodes are successively added to a so-called partial node labeling until a termination criterion is met. In §3, we will present a graph learning approach to optimizing such node labeling MDPs.

2.1. PRELIMINARIES

We consider an undirected, unweighted, and simple graph G = (V, E) with n nodes in V and m edges in E. We denote the neighbors of a node v by N (v). We assume w.l.o.g. that the graph is connected and hence m = Ω(n). A node labeling is a function c : V → L, where L ⊆ {0, . . . , n}. A partial node labeling is a function c : V → L for a subset of nodes V ⊆ V and labels L ⊆ L. A node labeling problem is subject to a feasibility condition and a real-valued cost function f . The cost function maps a node labeling c to a real-valued cost f (c). We require that the feasibility condition be expressed in terms of an efficient (polynomial-time computable) extensibility test T : P(V × L) × V × L → {0, 1} , where P denotes the powerset. We say the extensibility test passes when it returns 1. Intuitively, given a partial node labeling c , a node v ∈ V , and label , the extensibility test passes if and only if c can be extended by labeling node v with such that c can be extended into a node labeling. Formally, the extensibility test characterizes the set of feasible solutions: Definition 2.1. A node labeling c is feasible if and only if there exists a sequence of nodelabel pairs (v 1 , 1 ), . . . , (v n , n ) such that for all i ≥ 0 the extensibility test T satisfies T ({(v 1 , 1 ), . . . , (v i , i )}, v i+1 , i+1 ) = 1. The goal of the node labeling problem is to minimize the value of the cost function among the feasible node labelings. For consistency, an infeasible node labeling has infinite cost. Next, we present the two node labeling problems on which we focus in our evaluation. Definition 2.2. A k-coloring of a graph G = (V, E) is a node labeling c : V → {1, 2, . . . , k} such that no two neighbors have the same label, i.e., ∀{u, v} ∈ E : c(u) = c(v). The cost function for GC is the number of distinct labels (or colors) k. Given a partial node labeling c : V → {1, . . . , k} and any vertex-label pair (v, ), the extensibility test passes for (c , v, ) if and only if the extended partial node labeling c ∪ (v, ) is a k-or (k + 1)-coloring of the induced subgraph G[V ∪ {v}]. In particular, the test does not pass when > k + 1. The smallest k for which there is a k-coloring of G is the chromatic number χ(G) of G. Definition 2.3. A vertex cover of a graph G = (V, E) is a node labeling c : V → {0, 1} such that every edge is incident to at least one node with label 1, i.e., ∀{u, v} ∈ E : c(u) = 1 ∨ c(v) = 1. The cost function for MVC is the number of nodes with label 1. Given a partial node labeling c : V → {0, 1} the extensibility test passes for (c , v, ) if and only if the extended partial node labeling c ∪ (v, ) is a vertex cover of the induced subgraph G[V ∪ {v}].

2.2. NODE LABELING MDP

We show how to construct an MDP that models a given combinatorial node labeling problem. Minimizing the cost of the combinatorial node labeling problem is equivalent to maximizing the return of this MDP. In the vast majority of reinforcement learning approaches to solve combinatorial graph optimization problems (Kool et al., 2019; Dai et al., 2017; Ma et al., 2020; Drori et al., 2020) , a state corresponds to a set or sequence of nodes that are already added to a solution set. Instead, in our setting the state represents a partial node labeling. This is why in addition to problems like MVC and TSP, we can also model problems with more than two labels (even when the number of labels is not known in advance). Graph coloring is such a problem. Lemma 2.4. For any node labeling problem, there is an MDP whose terminal states correspond to the feasible solutions with a cost equal to the negative return. We embed the cost function f and the extensibility test into the MDP. Note that we do not require a way to measure the cost of partial node labelings. Here, we formulate the state space, action space, transition function, and reward. In §C.1, we finish the proof of Lemma 2.4.

State space.

A state S represents a partial node labeling. It is a set of pairs S = V × L for a subset of nodes V ⊆ V and a subset of labels L ⊆ {0, . . . , n}. A state is terminal if V = V . Hence, the set of states is the powerset P(V × {0, . . . , n}) of the Cartesian product of the vertices and labels. Action space. In state S, the set of legal actions are the pairs (v, ) for nodes v and labels which pass the extensibility test of the problem for the partial node labeling given by S (i.e., T (S, v, l) = 1). Transition function. In our case, the transition function T is deterministic. That is, given the current state S t and an action (v, ), T (S t , (v, )) yields the next state S t+1 = S t ∪ {(v, )}. Reward. For a terminal state S representing the node labeling c, the reward is -f (c). For all other states, the reward is 0. A policy is a mapping from states to probabilities for each action. Note that we can turn a probabilistic policy into a deterministic greedy policy by choosing the action with largest probability. Next, we present how to train such a policy end-to-end using policy gradients.

3. GRAPH LEARNING APPROACH

We present a graph learning approach to node labeling, which is inspired by greedy algorithms. Greedy approaches generally trade optimality for improved runtime. A greedy node labeling algorithm assigns a label in {0, . . . , n} to one node after another based on a problem-specific heuristic. Hence, it can be seen as providing (1) an order on the nodes and (2) a rule to label the next selected node. We focus on learning an order on the nodes and pick a label that passes the extensibility test according to a fixed rule. The following two lemmas show there exists a label assignment rule that ensures the optimal solution can be found for GC and MVC (see §C.2 for the proofs): Lemma 3.1. For every graph G, there exists an ordering of vertices for which choosing the smallest color that passes the extensibility test colors G optimally. Lemma 3.2. For every graph G, there exists an ordering of vertices for which choosing the label 1 until every edge in G has one of its endpoints labeled with 1 produces a minimum vertex cover of G. We expect similar results can be obtained for most other node labeling problems. Instead of a handcrafted ordering heuristic, we learn to assign weights to each node and choose the nodes according to their weights. To compute these weights, we introduce a novel spatial locality inductive bias inspired by the greedy heuristics: labeling a node should only affect the weights of its neighbors. As we will show in §4.3, this leads to better test scores compared to the alternatives of updating all or none of the weights when a node is labeled. This spatial locality bias is inspired by successful greedy heuristics: The ListRight heuristic for MVC (Delbot & Laforest, 2008) assigns a node to the vertex cover based on the assignment of its neighbors. For GC, the DSATUR strategy selects nodes according to their saturation degree (Brélaz, 1979) . If a new node is selected, only the saturation degree of its neighborhood can change; the others remain unchanged.

3.1. POLICY OPTIMIZATION

We train our node labeling model by policy gradients, specifically REINFORCE (Bello et al., 2017) with a greedy rollout baseline (Kool et al., 2019) . The advantage of policy gradients over Q-learning is that is has stronger convergence guarantees (Sutton & Barto, 2018) . At a high level, the algorithm works as follows. We begin by initializing two models, the current model and the baseline model. For each graph in the batch, the algorithm performs a probabilistic rollout of the policy. The baseline We show how the state embedding is updated as nodes are colored. The state embedding focuses on the last labeled node, and contains the graph embedding, and embeddings of the last labeled node and its label, which pools the embeddings of nodes with the same label. model performs a greedy rollout. The difference between the two costs determines the policy gradient update. After every epoch, we perform a (one-sided) paired t-test over the cost on a challenge dataset to check if the baseline model should be replaced with the current model. See §A.2 for more details.

3.2. GAT-CNL ARCHITECTURE

Our architecture, GAT-CNL, consists of an encoder and a decoder to learn a policy specific to the node labeling problem. The encoder learns the local structural information that is important for the problem in the form of a node embedding. The state embedding encapsulates information about the graph itself (enabling the network to adapt its actions to the graph), the last node that was labeled and its label, and a summary of prior actions (with pooling). This enables the state embedding to have constant size; adding additional nodes provided no benefit (see §4.3). This also serves as a temporal locality bias; however, note that the decoder is not Markovian, as it depends on more than just the previous decision. The decoder uses the node embeddings and the state embedding to select the next node based on attention weights between the node embeddings and the state embedding. After the decoder picks the next node v, the label rule (see Lemmas 3.1 and 3.2) assigns the label for the node. The policy then takes the action (v, ). Then, the state embedding is updated and the decoder is invoked again until all nodes are labeled. Figure 1 (right) overviews our architecture. Node features. Each node v is associated with an input feature vector x v . Our input features consist of a combination of sine and cosine functions of the node degree, similar to positional embeddings (Vaswani et al., 2017) . This representation ensures that input features are bounded in magnitude even for larger graphs. We subtract the mean node degree from the degrees on the synthetic dense graph instances. GAT encoder. We use a hidden dimension of size d (unless stated otherwise, d = 64). The input features are first linearly transformed and then fed into a GNN, which produces, for each node v, a node embedding h v ∈ R d . We use a three-layer Graph Attention Network (GAT) (Vaswani et al., 2017; Velickovic et al., 2018; Lee et al., 2019) , additive multi-head attention with four heads, batch normalization (Ioffe & Szegedy, 2015) with a skip connection (He et al., 2016) at each encoder layer, and leaky ReLU activations (Maas et al., 2013) . State embedding. The state embedding allows the decoder to condition its choice based on the graph instance and the partial node labeling. For computational reasons, we ensure it is of constant size. Denote the node labeled in step t by v (t) and its label by (t) . Then the state embedding consists of three components concatenated together: (1) The graph embedding h G , a max-pooling over all node embeddings. (2) The node embedding h v (t-1) of the last labeled node v (t-1) . ( 3) The label embedding h (t-1) of the last labeled node's label (t-1) , a max-pooling over the embeddings of all nodes with that label. In the first iteration, we use a learned parameter h (0) for ( 2) and (3). We considered including more than just the last labeled node, but found that this led to worse performance ( §4.3). Hence, this induces a temporal bias by focusing on the prior node and nodes with the same label as the last labeled node. See Figure 3 for an illustration of the state embedding. Local attention decoder. The decoder takes as input the node embeddings generated by the encoder and the state embedding and outputs the next node to label. In each time step t, an attention mechanism between the state embedding g t and each node embedding h v produces attention weights a (t) v . Here, we introduce a spatial locality bias: labeling a node can only affect the attention scores of its neighbors in the next time step. Let V be the set of nodes already labeled. The attention weight a (t) v for node v in time step t is given by the local decoding. For a node v / ∈ V : a (t) v = C • tanh (Θ1gt) T (Θ2hv) √ d v ∈ N (v (t-1) ) or t = 0 a (t-1) v v / ∈ N (v (t-1) ). If v ∈ V , then the attention weight is a (t) v = -∞. In the first iteration of the decoder, we calculate the coefficients for each node in the graph. As in Bello et al. (2017) , we clip the attention coefficients within a constant range [-C, C]. In our experiments we set C = 10. The learnable parameter matrices are Θ 1 ∈ R d×3d and Θ 2 ∈ R d×d . We use scaled dot-product attention (Vaswani et al., 2017) (instead of additive attention) to speed up the decoding. Finally, for each node v we apply a softmax over all attention weights to obtain the probability p v that node v is labeled next. See Figure 2 for a visualization of the attention weight computation during decoding. During inference, our greedy policy selects the vertices with maximum probability. Our sampling policy (for k samples) runs the greedy policy once, then evaluates the learned probabilistic policy k times (selecting a node v with the learned probability p v ), returning the best result.

3.3. NUMBER OF OPERATIONS

We express the number of operations (arithmetic operations and comparisons) of the model during inference parameterized by the embedding dimension d, the number of nodes n and the number of edges m. The encoder uses O(dm + d 2 n) arithmetic operations and the decoder uses O(d 2 m) arithmetic operations, resulting in O(dm + d 2 n + d 2 m) arithmetic operations, which is linear in the size of the graph. To select the action of maximum probability (or sample a node), the decoder additionally needs O(n 2 ) comparison operations (although this could be reduced to O(m log n) with an appropriate data structure). We empirically study the runtime in § §4.1 and 4.2; in practice, the d 2 m term dominates the runtime for the evaluated graphs until 5000 vertices. In contrast, updating all attention weights after every labeling scales as O(n 3 ) (see §B.5).

4. EXPERIMENTS

We evaluate our approach on established benchmarks for graph coloring and minimum vertex cover, including greedy baselines and machine learning approaches. We focus on other heuristic approaches that return an approximation in polynomial time. Training. We use three different synthetic graph distributions to generate instances for training and validation (Albert & Barabási, 2002; Erdős & Rényi, 1960; Watts & Strogatz, 1998) . We generate 20,000 graphs for training. The graphs have between 20 and 100 nodes. We use Adam with learning rate α = 10 -4 (Kingma & Ba, 2015). The effective batch size is B = 320, which comes from using batches of 64 graphs for each node count n and accumulating their gradients. We clip the L2 norm of the gradient to 1, as done in Bello et al. (2017) . We selected these hyperparameters after initial experiments on the validation set. Each model took 15-20 CPU compute node hours to train on a cluster with Intel Xeon E5-2695 v4 and 64 GB memory per node. The overall time spent training was was less than 2500 CPU node hours and the time spent on validation and testing was less than 300 CPU node hours. We train each model for 200 epochs with five random seeds and report the standard deviation σ of cost w.r.t. the random seeds as ±σ. See §A for more details. Test Scores. In addition to mean cost, we report the ratio of the solution cost to the optimal solution cost (approximation ratio). For large graphs, this cannot be computed exactly in a timely manner. In this case, we use the best solution found by an ILP solver within a compute time of one hour. To compare with baselines which return infeasible solutions (and hence have ill-defined cost), we report the percentage of wins (ties for first place count as wins) and the percentage of instances solved optimally. We refer to these metrics as "Wins" and "Optimal", respectively. We use the model with the lowest cost to compute these percentages. Runtime.We benchmark on a c2d-standard-4 Google Cloud instance with 4 vCPUs and 16 GB RAM.

4.1. GRAPH COLORING

Greedy baselines. Largest-First greedily colors nodes in decreasing order of degree. Smallest-Last (Matula & Beck, 1983 ) colors the nodes in reverse degeneracy order, which guarantees that when a node is colored, it will have the smallest possible number of neighbors that have been already colored. Smallest-Last guantees a constant number of colors for certain families of graphs, such as Barabási-Albert graphs (Albert & Barabási, 2002) and planar graphs (Matula & Beck, 1983) . DSATUR (Brélaz, 1979) selects nodes based on the largest number of distinct colors in its neighborhood. DSATUR is exact on certain families of graphs, e.g., bipartite graphs (Brélaz, 1979) . We use the implementations from NetworkX (Hagberg et al., 2008) . Machine learning baseline. We compare our approach with the chromatic number estimator of Lemos et al. (2019) . It does not guarantee that the solution is feasible, meaning that it can both underand overestimate the chromatic number. We use the values reported by the original paper. Note that S2V-DQN (Dai et al., 2017) cannot solve GC because of the way it embeds the state. COLOR benchmark (Table 1 ). We evaluate on the same subset of the COLOR02/03 benchmark (Col, 2002) as Lemos et al. (2019) , consisting of 20 instances of size between 25 and 561 nodes. Our greedy policy outperforms both Largest-First and Smallest-Last and is tied with DSATUR for the most graphs solved optimally. When sampling (100 samples) to evaluate the policy, our model outperforms all baselines in both mean cost and win percentage and is tied for the most graphs solved optimally. The approximation ratio is 1.25 and 1.13 for our greedy and sampling policies, resp. Results on classic graphs. We also trained our model on four families of sparse graphs: cycles, wheels, random trees, and stars. We trained on graphs up to 400 nodes and evaluated on graphs up to 10,000 nodes. The produced colorings are optimal or extremely close to optimal for all four families (Table 3 ). As our model works perfectly on cycles and wheels we conclude that the model learns to leverage local graph structure and works even when all nodes have the same degree and are completely symmetrical. Figure 5 : Example covers from our learned heuristic. Nodes with a bold border are in the cover. Numbers indicate the labeling order. Once a cover is found, the order is irrelevant. picks higher degree, centrally located nodes first. However, if several nodes have the same degree, it favors coloring neighboring nodes subsequently. This happens in the WS graphs, see Figure 4b . The learned heuristic can consistently color the WS graphs with 4 colors, which matches the Smallest-Last heuristic. We conclude that the learned heuristic captures complex aspects of the graph extending beyond simple degree-based decisions and considers the graph's local neighborhood structure. Runtime (Figure 6 ). We compare the runtime of our approach with the classical baselines. As we did not have access to the code of Lemos et al. (2019) , we could not compare the runtime directly. Our approach is faster than DSATUR for graphs larger than 640 nodes and scales much better. As expected from §3.3, the runtime of our algorithm grows linearly with the size of the graph, similar to the simpler baselines such as Largest First and Smallest Last, which have better constant factors.

4.2. MINIMUM VERTEX COVER

Classic baselines. We compare with two classic algorithms. First, we use the endpoints of a maximal matching, which produces a 2-approximation Papadimitriou & Steiglitz (1982) . Second, we compare with list-right Delbot & Laforest (2008) , a √ ∆ 2 + 3 2 approximation algorithm for maximum degree ∆. Machine learning baselines. S2V-DQN is a Q-learning based approach (Dai et al., 2017) . We use the values reported in the original paper. Li et al. (2018) present a tree-search based approach trained in a supervised way. In contrast to S2V-DQN, it samples multiple solutions, then verifies if they are feasible. The time to construct a feasible solution varies depending on the instance. We use the publicly available code and pretrained model from the authors. We run Li et al.'s code until it finds a feasible solution, and sample more solutions if it is below the time budget of 30 seconds per graph. Results on in-distribution graphs. We evaluate and compare our approach for MVC with S2V-DQN (Dai et al., 2017) and Li et al. (2018) on the same dataset of generated graphs as Dai et al. (2017) . It consists of 16,000 graphs from two distributions, Erdős-Rényi (ER) (Erdős & Rényi, 1960) and Barabási-Albert (BA) (Albert & Barabási, 2002) , of sizes varying from 20 to 600 nodes. We use the results reported by Dai et al. (2017) on their model trained on 40-50 nodes, except for the graphs with less than 40 nodes, for which no data is available for this model. Hence we use their model trained on 20-40 nodes on these smaller graphs. See Table 2 for the results on ER graphs and §B.3 for results on additional graphs. On ER graphs, our model achieves the closest average approximation ratio, followed by Li et al. (2018) . Li et al. (2018) has the lowest average cost. Note that the lowest approximation ratio and lowest cost need not coincide because the cost grows quickly with graph size, whereas the approximation ratio does not. Our model and Li et al. (2018) outperform the greedy baseline, while S2V-DQN is slightly outperformed by List Right. In Table 4 , we show how the approximation ratio and cost depend on the instance size. As shown in §B.3, on the BA graphs, our model is about 2.3% away from optimal. The two machine learning baselines are slightly less than 1% away from optimal. The greedy baselines are 9% -45% away from optimal. Qualitative Results. Figure 5 shows typical results of our learned MVC heuristics. See §E.2 for more examples. On the ER graphs, we can see that the heuristic does not always start with the highest degree node. In contrast, on the BA graphs, the heuristic has a strong preference to start with the highest degree node. Unlike the classic greedy heuristics (and our learned graph coloring heuristic), the learned MVC heuristics seldomly pick neighboring nodes subsequently. 

4.3. ABLATION STUDIES

Spatial locality. We test the inductive biases we made regarding locality of the decoder by comparing against a decoder variant that never updates the attention weights (static decoding) and a variant that always updates all of the attention weights (global decoding). Static decoding never recomputes the attention weights. For node a node i that is not yet labeled, its weight is: a i = C • tanh (Θ1g0) T (Θ2hi) √ d . Static decoding uses O(d 2 n + m + n 2 ) operations, which are fewer than those of local update decoding when m d 2 n. With static decoding, the model is essentially a GNN with a special node-readout function. Global decoding recomputes the attention weights in each time step t. For a node i that is not yet labeled, its weight is: a (t) i = C • tanh (Θ1gt) T (Θ2hi) √ d . Global decoding uses O(d 2 n 2 ) operations, which is at least a d 2 factor more than local update decoding for not too dense graphs (m n 2 /d 2 ). When there are only two labels (as for MVC), global decoding is very similar to the Kool et al. (2019) model. The difference to Kool et al. (2019) is that they use additional attention layer to compute a new state embedding in each step. We train graph coloring models with both static and global decoding on the Lemos et al. (2019) subset of the COLOR challenge graphs (following §4.1). Static and global decoding achieve a mean cost of 10.74 ±0.12 and 10.71 ±0.05 , respectively, both worse than when using our inductive bias (Table 1 ). Architecture Parameters. We varied the size of the context embedding (i.e., the number of nodes and their labels that contribute to it). Increasing the context size does not significantly improve the test score on graph coloring. For graph coloring, a context of size two and three results in a mean cost of 10.49 ±0.12 and 10.42 ±0.12 , respectively, for the greedy policy. We varied the number of attention heads (among 1, 2, 4) with a per-head dimension of 16. For graph coloring, this results in a mean validation cost of 5.29 ±0.02 , 4.98 ±0.01 , and 4.95 ±0.02 , respectively. We therefore use 4 attention heads (hidden dimension 64). In §B.4, we provide additional ablation studies for the encoder.

5. CONCLUSION

We introduced combinatorial node labeling, a framework that generalizes existing approaches to many hard graph problems, and presented a neural network architecture for it, which demonstrates excellent results on both graph coloring and minimum vertex cover problems. This serves as an important step toward replacing hand-crafted heuristics in graph algorithms with learned heuristics tailored to a particular problem and graph structure. There are many avenues for future research. While the nodel labeling framework is very general, other graph problems may require adjustments to the neural architecture or inductive biases for good performance. In particular, handling weighted graphs and edge labeling problems would be valuable. for batch in D train do 6: [ p θ,i , π i ∼ M θ (G i ) for G i in batch ] Sample from policy 7: [ p BL θ,i , π BL i ← M θ BL (G i ) for G i in batch ] Greedy baseline 8: ∇ θ J(θ) = 1 B B i=1 (L(π i | G i ) -L(π BL i | G i )) ∇ θ log(p θ,i ) Policy Gradient 9: θ ← ADAM(θ, ∇ θ J(θ)) 10: end for 11: if OneSidedPairedTTest(M θ , M BL θ , D challenge ) < 0.05 then Challenge the baseline 12: θ BL ← θ 13: D challenge ← sample new challenge dataset 14: end if 15: end for The complete training procedure is given in Algorithm 1. Algorithm 1 follows from the textbook REINFORCE with a baseline (Sutton & Barto, 2018) by factoring the probability of reaching a terminal state and using that the rewards are 0 in our MDP except when reaching a terminal state. Unlike Kool et al. (2019) , we do not use warmup epochs where training starts out with an exponential moving average baseline.

B.1 VALIDATION RESULTS

We compare the cost of the learned heuristic for different parameters of the training. The validation set consists of 600 graphs with n nodes for n ∈ {20, 50, 100, 200, 400, 600}.

B.1.1 GRAPH COLORING

Table 6 shows the validation cost on the three training distribution for all evaluated configurations. With a larger learning rate of α = 10 -3 , the mean validation cost for graph coloring is significantly worse, namely 5.22 ±0.002 . A smaller learning rate of α = 10 -5 leads to a mean validation cost of 5.02 ±0.001 , which is slightly worse than the cost of 4.95 ±0.02 for α = 10 -4 . 

B.2 RESULTS BY SIZE

Table 8 shows how the cost and approximation ratio of our MVC approach varies with the instance size (on the ER test graphs). Although the approximation ratio grows slightly with instance size, it remains within ca 5.5% of optimal for graphs with 500-600 nodes. Table 9 shows how the cost of GC varies on two synthetic distributions of graphs, S-ER and BA. For BA and ER graphs, the cost grows by about one color on the larger graphs.

B.3 ADDITIONAL RESULTS FOR MVC

Results on BA graphs See Table 10 for the MVC results on BA graphs. 

B.4 ABLATION STUDIES FOR THE ENCODER

We varied the number of layers in the encoder, removed the shortcut connections, and removed the normalization. The results are summarized in Table 6 . We can see that removing the shortcut connections has a very strong detrimental effect on the validation cost. Removing both shortcuts and normalization deteriorates the cost further. Using 2 layers results in a small, but noticeable increase in cost, whereas a single layer has a significantly worse cost.

B.5 RUNTIME SCALABILITY

We evaluated the inference runtime on a c2d-highmem-4 (4 vCPUs, 32 GB RAM) Google Cloud machine (Environment M94 with PyTorch 1.11). Figure 7 show the runtime scaling of our approach on GC and MVC together with a linear least squares fit. In Figure 7b , we see that for graphs up to around 5000 vertices the runtime of GC inference very closely follows a linear trend. For larger graphs, the runtime grows slightly faster than linear, as shown in Figure 7a . Similar results hold for MVC: it takes less than 0.5 seconds to compute a vertex cover for a graph with 1,000 nodes. Figure 7d and Figure 7c show the distribution of MVC runtimes of up to 5120 nodes. B.6 DISCUSSION. It is not surprising that our machine learning approach performs best on in-distribution graphs. Whilst it is desirable to have an approach that generalizes well, if a representative sample of graphs is available for a target applications, this does not pose an issue. Moreover, we have shown that the quality degrades gradually when the test distribution differs from the training distribution.

C ADDITIONAL PROOFS C.1 THE NODE LABELING MDP

Proof of Lemma 2.4. Consider a sequence of actions (v 1 , 1 ), . . . , (v n , n ) ending in a terminal state. For all t, the prefix (v 1 , 1 ), . . . , (v t , t ) of this sequence corresponds to a partial node labeling c (by viewing the sequence of node-label pairs as describing a function from nodes to labels). By construction of the MDP, labeling node v i+1 with t+1 passes the extensibility test for c . Hence the node labeling c represented by (v 1 , 1 ), . . . , (v n , n ) is feasible. By construction, the return of the episode is -f (c), where f (c) is the cost of node labeling c. Conversely, consider a feasible solution c with cost f (c). Then, by definition of feasibility ( §4.3), there is a sequence (v 1 , 1 ), . . . , (v n , n ) of node-label pairs such that for all t ≥ 0 the partial node labeling given by (v 1 , 1 ), . . . , (v t , t ) passes the extensibility test for node v t+1 and label t+1 . Hence, the sequence of node-label pairs is also a sequence of actions in the MDP leading to a terminal state. The return for this episode is -f (c). Note that since our tasks are episodic, the return equals the sum of the rewards (specifically the reward received in the terminal state). In particular, we do not use discounting.

C.2 OPTIMALITY OF THE LABELING RULE

Proof of Lemma 3.1. Let G be some graph with chromatic number χ(G) = k and c * be a mapping that colors G optimally. We partition V into color classes C i = {v | c * (v) = i} such that all nodes with color i are in C i . Now, we build an ordering by consecutively taking all nodes from C 1 , then all nodes from C 2 and so on. Choosing the smallest color that passes the extensibility test will produce an optimal coloring for such an order of nodes: The proof is by strong induction on the index of the color class i. The induction hypothesis H(i) is that for all nodes v in C j for j < i, v is colored with a color in {1, . . . , j}. Assume the induction hypothesis H(i) holds. Now, consider a node v in C i . The color i must be a valid color for v: First, assigning color i does not produce any conflicts with any node u in C j for j < i because by induction hypothesis node u has a color strictly less than i. Second, assigning color i to v does not produce a conflict with another node w in C i because then C i would not be a valid class of colors (nodes in a color class cannot be neighbors.). As we choose the smallest valid color and i is valid, v get a color in {1, . . . i}. Thus, H(i + 1) holds. Note that this coloring might be different from the one of c * . This is, because a node in C i might have no conflicts with some color j < i and therefore this node will be assigned color j. Proof of Lemma 3.2. Let S be the set over nodes with label 1 in a minimum vertex cover of G. Order these nodes first (in an arbitrary relative order), then order the remaining nodes in V -S after these nodes (in an arbitrary relative order). Now, label the nodes with 1 in this order until every edge has an endpoint with label 1. After |S| steps, every node in S has label 1, meaning that the nodes with label 1 form a minimum vertex cover: If the nodes formed a vertex cover after less than |S| steps, we would find a smaller vertex cover, contradicting the minimality of S.

D LIST OF COMBINATORIAL NODE LABELING PROBLEMS

We provide an extensive list of classic graph optimization problems framed as node labeling problems. Note that there can be multiple equivalent formulations. For some problems, we consider a weighed graph G with weight function w : E → R + , we write w(u, v) the weight of an edge {u, v}. For a set of nodes S, we denote the subgraph of G induced by S with G[S]. The problems in Table 11 require a partition of the nodes as their solution. These can be represented as node labeling problems by giving each partition its unique label. For many of the problems, the number of used labels determines the cost function. The problems in Table 12 require a path (or a sequences of nodes) as their solution, which we represent as node labeling problems by having the label indicate the position in the path (or sequence). 

Problem

Extensibility Test T (V × L, v, l) Cost function f Balanced k-partition (Kernighan & Lin, 1970) There are no more than n k nodes with the same label and at most k labels. {u,v}∈E,l(u) =l(v) w(u, v) Balanced k, 1 + vpartition (Kernighan & Lin, 1970) There are no more than n(1+ ) k nodes with the same label and at most k labels. {u,v}∈E,l(u) =l(v) w(u, v) Minimum k-cut (Karger & Stein, 1996 ) k -|V | -|V | -1 ≤ |L ∪ {v}| and |L ∪ {v}| ≤ k {u,v}∈E,l(u) =l(v) w(u, v) Clique cover (Karp, 1972) Every label induces a clique Number of labels Domatic number (Hedetniemi & Laskar, 1990) Every label induces a dominating set of G[V ∪ {v}] Negative number of labels Graph coloring (Jensen et al., 1995) No neighbor of v has label l Number of labels Graph co-coloring (Jensen et al., 1995) The nodes with label l induce an independent set in G or the complement of G

Number of labels

k-defective coloring (Cowen et al., 1986) No node has more than k neighbors with label l Table 12 : Node labeling problems where the labels encode a permutation of nodes. Problem Extensibility Test T (V × L, v, l) Cost function f Traveling salesman problem (Dantzig et al., 1954) l = max(L) + 1 and v is a neighbor of the node in L with label max(L) (u,v)∈E,l(v)=l(u)+1 w(u, v) Tree decomposition (Bodlaender, 2005) l = max(L) + 1 For a node v i with label i, add edges to G until v i forms a clique with its higher-labeled neighbors. The cost is the largest number of higher-labeled neighbors in the augmented graph (Bodlaender, 2005) . Longest path (Karger et al., 1997) l = max(L) + 1 Maximum number of nodes with consecutive labels that induce a path The problems in Table 13 require a set of nodes as their solution. These can be represented as node labeling problems by giving the nodes in the solution set the label 1 and the nodes not in the solution set the label 0. The cost function is closely related to the number of nodes with label 1 for most of these problems. Table 13 : Node labeling problems with binary labels. For all these problems, the extensibility test passes only if the label is 0 or 1 (and the additional requirements listed below are satisfied). Problem Extensibility Test T (V × L, v, l) Cost function f Maximum cut (Karp, 1972) At least one node has label 1 -|{{u, v} ∈ E, l(u) = l(v)}| Sparsest cut (Arora et al., 2009) At least one node has label 1 |{{u,v}∈E, l(u) =l(v)}| |{v∈V, l(v)=1}| Maximum independent set (Tarjan & Trojanowski, 1977 ) (Robson, 1986) The subgraph induced by the nodes with label 1 is an independent set -|{v ∈ V, l(v) = 1}| Minimum vertex cover (node cover) (Karp, 1972) The subgraph induced by the nodes with label 1 is a vertex cover of G[V ∪ {v}] |{v ∈ V, l(v) = 1}| Maximum clique (Tomita & Seki, 2003) The subgraph induced by the nodes with label 1 is a clique -|{v ∈ V, l(v) = 1}| Minimum feedback node set (Karp, 1972 ) G[{u ∈ V ∪ {v}, l(u) = 0}] is a forest |{v ∈ V, l(v) = 1}| Metric dimension (Harary & Melter, 1976) The nodes in V ∪ {v} are uniquely identified by their distances to nodes with label 1 |{v ∈ V, l(v) = 1}| Minimum dominating set (Hedetniemi & Laskar, 1990) The nodes with label 1 form a dominating set of G [V ∪ {v}] |{v ∈ V, l(v) = 1}| Minimum connected dominating set (Hedetniemi & Laskar, 1990) The nodes with label  1 form a connected dominating set of G[V ∪ {v}] |{v ∈ V, l(v) = 1}|

E ADDITIONAL EXAMPLES

E.1 GRAPH COLORING Figure 8 and Figure 9 show additional results of our learned coloring heuristic on in-distribution graphs. For the Erdős-Rényi graphs and Watts-Strogatz, we provide some examples in a circular layout and some with a force-directed layout. The force-directed layout emphasizes the structure of the graph, but for these two graph classes leads to many crossing edges. See Figure 10 and Figure 11 for results on cycles, wheels, and trees. Interestingly, the heuristic picks the highest degree node in a star or wheel sometimes first and sometimes last. E.2 MINIMUM VERTEX COVER Figure 12 shows additional example covers of our learned heuristic on in-distribution graphs.



Figure 1: Left: Venn diagram of tasks solvable with the set, permutation, and node labeling frameworks. Node labeling generalizes existing frameworks and allows solving additional tasks. Center & right: Comparison of our architecture with S2V-DQN(Dai et al., 2017). We add a label assignment step, allowing us to solve new problems. Further, the average time for picking the next vertex is significantly reduced, such that the total number of arithmetic operations is now linear in the size of the graph.

Figure 3: Temporal locality of the state embedding.We show how the state embedding is updated as nodes are colored. The state embedding focuses on the last labeled node, and contains the graph embedding, and embeddings of the last labeled node and its label, which pools the embeddings of nodes with the same label.

Figure 4: Example colorings produced by our learned heuristic. Node borders indicate the colors. Numbers on the nodes indicate the order in which the heuristic labels them.

Figure 6: Distribution of graph coloring inference runtime on WS graphs.

Policy Training with Reinforce+Baseline 1: Input: number of epochs E, batch size B, datasets D train 2: Initialize model M θ and baseline model M BL θ 3: D challenge ← sample new challenge dataset 4: for epoch = 1, . . . , E do 5:

Figure 7: Inference runtime with local decoding. The solid line indicates a linear least squares fit, the dashed orange line the mean. We report the coefficient of determination R 2 and standard errror S.

n = 16.

Figure 8: Example colorings produced by our learned heuristic. Node borders indicate the colors. Numbers on the nodes indicate the order in which the heuristic labels them.

n = 20.

Figure 9: Example colorings produced by our learned heuristic. Node borders indicate the colors. Numbers on the nodes indicate the order in which the heuristic labels them.

Figure10: Example colorings produced by our learned heuristic on cycles and wheels. The learned coloring heuristic visits nodes on the cycles in-order. For the wheels, the center of the wheel is either visited first or last.

Tree graph, n = 29.

Figure11: Example colorings produced by our learned heuristic on stars and random trees. For stars, the heuristic either labels the center first or last. The tree heuristic prefers to start coloring at one of the leaves of the tree and then colors nodes in a search pattern from there, coloring nodes that are neighbors of already colored nodes. It often labels the remaining leaves very late into the coloring.

n = 42.

Figure 12: Example covers produced by our learned heuristic on Erdős-Rényi and Barabási-Albert graphs. Black-bordered nodes are in the cover.

Graph coloring results on the Lemos et al. (2019) subset of the COLOR challenge graphs.

Comparison of MVC approaches on dense ER graphs with edge-probability 0.15.

Our approach colors simple families of graphs (near-)optimally.

Mean validation cost of our approach for graph coloring on WS graphs and cost and approximation ratio for minimum vertex cover on ER graphs (10 samples). Shaded columns are for graphs larger than during training. As shown in §B.5, our approach takes around 0.5 second to find a cover on the test graphs with 1,000 nodes on a CPU per sample, and around 5 seconds for 10 samples. This is comparable to whatDai et al. (2017) reported on a GPU on a similar graph (11 seconds). Note that the time budget for Li et al. (2018) was 30 seconds and the time budget for the combinatorial solver was 1 hour.

shows the validation results for training on either only one distribution and evaluating on ER and BA graphs. Training on a mixture ER and BA graphs leads to worse validation cost on BA graphs compared to training only on BA graphs. Training on ER graphs exclusively without BA graphs leads to a slight cost improvement on ER graphs.

Node labeling problems which partition the nodes into 2 or more sets.

ETHICS STATEMENT

The combinatorial node labeling framework and our neural network architecture target a broad class of graph problems, and hence are very general-purpose. Downstream tasks of such problems range from compiler passes to logistics optimization to graph data mining. Hence, it is hard to identify specific cases of benefit or harm from our work, as it depends on the specific application of the downstream tasks. We nevertheless urge careful consideration of the implications of improving performance on tasks using our methods, especially ones with privacy implications (e.g., data mining).

REPRODUCIBILITY STATEMENT

We detail the combinatorial node labeling framework in §2, and describe the neural network architecture and training process we use in §3. We note that the node labeling framework is very general and alternative architectures could be used to solve it. Proofs of our theoretical claims are provided in §C and we give details of our training setup in §4 and §A. We additionally include our source code in the supplementary material to aid reproducibility.

A TRAINING

A.1 DATA GENERATION We use four different synthetic graph distributions to generate instances for training and validation. All graphs are generated via the Python NETWORKX library (Hagberg et al., 2008) .Barabási-Albert Model (Albert & Barabási, 2002) The Barabási-Albert (BA) Model generates random scale-free networks. Similar to real-world networks BA graphs grow by preferential attachment, i.e., a new node is more likely to link to more connected nodes. The model is parameterized by one parameter δ, which dictates the average degree.Erdős-Rényi Model (Erdős & Rényi, 1960 ) An Erdős-Rényi (ER) graph G(n, p) has n nodes and each edge exists independently with probability p. The expected number of edges is n 2 p. Watts-Strogatz Model (Watts & Strogatz, 1998) Watts-Strogatz (WS) graphs were developed to overcome the shortcomings of ER graphs when modeling real world graphs. In real networks we see the formation of local clusters, i.e., the neighbors of a node are more likely to be neighbors. For parameters k and q, a WS graph is built as follows: build a ring of n nodes. Next, connect each node to its k nearest neighbors. Finally, replace each edge {u, v} by a new edge {u, w} (chosen uniformly at random) with probability q.

A.1.1 TRAINING SET PARAMETERS

See Table 5 for the parameters of the graph distributions used during training. Note that for BA and ER graphs, the parameters match those used in the Dai et al. (2017) test set (see Table 2 and Table 10 ). We also consider sparse ER graphs (S-ER), for we set the edge probability such that graphs have expected average degree close to ∆ = 7.5 when n is small but remain connected with high probability when n is large. This means thatfor a small , which we set to 0.2 in our experiments. The formula is derived from the connectivity threshold of ER graphs (Erdős & Rényi, 1960) .For graph coloring, we train on a hybrid dataset consisting of an equal proportion of BA, S-ER, and WS graphs. For minimum vertex cut, we train on a dataset consisting of BA graphs, a dataset consisting of ER graphs, and a hybrid dataset consisting on a combination of the two (in equal proportion). We use the in-distribution models for the evaluation on the synthetic test instances and the hybrid model for the memetracker graph. During training, we use an equal proportion of graphs with n ∈ {20, 40, 50, 70, 100} nodes.For the results on simple graphs in Table 3 , we use graphs with sizes n ∈ {10, . . . , 51, 60, 61, 70, 71, 80, 81, 90, 91, 99, 100, 200, 201, 300, 301, 400, 401} and validate on 1000 or 1001 nodes.

A.2 POLICY OPTIMIZATION

We train our model with REINFORCE with a greedy rollout baseline Kool et al. (2019) . The details follow. We denote the cost of labeling the graph G i in the order given by the sequence of nodes π by L(π, G i ). A model M is parameterized by parameters θ. On a graph G i , the model returns a sequence of nodes π and an associated probability p θ . The probability p θ is the product of all action probabilities that led to the sequence of nodes π. We write p θ , π ← M θ (G i ) when the policy is evaluated deterministically and p θ , π ∼ M θ (G i ) when the policy is evaluated probabilistically. BA ER S-ER WS δ = 4 p = 0.15 p = p s-er k = 5, q = 0.1 

