GRAPH-BASED DETERMINISTIC POLICY GRADIENT FOR REPETITIVE COMBINATORIAL OPTIMIZATION PROBLEMS

Abstract

We propose an actor-critic framework for graph-based machine learning pipelines with non-differentiable blocks, and apply it to repetitive combinatorial optimization problems (COPs) under hard constraints. Repetitive COP refers to problems to be solved repeatedly on graphs of the same or slowly changing topology but rapidly changing node or edge weights. Compared to one-shot COPs, repetitive COPs often rely on fast (distributed) heuristics to solve one instance of the problem before the next one arrives, at the cost of a relatively large optimality gap. Through numerical experiments on several discrete optimization problems, we show that our approach can learn reusable policies to reduce the optimality gap of fast (distributed) heuristics for independent repetitive COPs, and can optimize the long-term objectives for repetitive COPs embedded in graph-based Markov decision processes.

1. INTRODUCTION

In a general network setting, the network state is captured by a graph G = (V, E, S), that comprises a vertex set V, an edge set E, and a matrix S capturing the node features. A vector o ∈ R |V| captures the outcomes on individual nodes of a non-differentiable network process, f net (•), as: o = f net (G) . (1) We aim to optimize the system-level objective f obj (o), where f obj : R |o| → R is a known linear combination, by improving parts of the network process f net (•). In particular, we are interested in network processes that involve graph-based repetitive combinatorial optimization problems (R-COPs) (Kraay & Harker, 1996) -COPs that need to be solved repeatedly on graphs of the same or slowly changing topology (V or E) but rapidly changing node features (S), under hard constraints that must be satisfied at all times (Kendall, 1975) , which often make COPs NP-hard. R-COPs have many real-world applications, such as task scheduling (Pinedo, 2012) , route planning (Vogiatzis & Pardalos, 2013) , link scheduling (Joo & Shroff, 2012; Paschalidis et al., 2015; Eisen et al., 2019; Zhao et al., 2022a; b) and routing (Oliveira et al., 2011) in communication networks, and energy management in smart grids (Chau et al., 2018) , where a node or edge weight captures varying cost or utility. In practice, solvers for R-COPs are often subject to restricted runtime and/or distributed execution. For example, in scheduling, network routing, and multi-object tracking in computer vision, COPs need to be solved within tens of milliseconds to a few seconds. Although memory-based approaches can avoid solving each instance from scratch for some applications (Kraay & Harker, 1996; Wang, 2021) , in general, R-COPs rely on fast heuristics to meet the strict time constraints, at the cost of relatively large optimality gaps. Moreover, in real-time networked systems, such as communication networks, smart grids, and robot swarms (Tolstaya et al., 2020) , centralized solvers often suffer from the large communication overhead of gathering the full network state, high computational complexity, and risk of single point of failure. Therefore, distributed solutions (Moser & Tardos, 2010) are preferred for better scalability and robustness, in which nodes across the network work in parallel to collect and process the information of only their local neighborhoods for decision making. Depending on the definition of f obj , R-COPs can be categorized as, 1) independent R-COPs, i.e., S(t 1 ) and S(t 2 ) are considered as independent if t 1 ̸ = t 2 , and 2) R-COPs embedded in a graph-based Markov decision process (MDP). For example, in link scheduling for wireless multiple networks (Zhao et al., 2022a; b) , the network state G(t) = (V, E, S(t)), where S(t) captures the packet backlogs of all the links at time t, depends on the schedule of t -1 found by solving a maximum weight independent set (MWIS) problem defined on G(t -1). Scheduling for maximum throughput (number of data packets transmitted on the schedule) is equivalent to optimize each MWIS instance individually (Zhao et al., 2022a) , which is formulated as an independent R-COP. However, to minimize the average backlog (packets left in the queues by the schedule) across links and over time, the transition of network states must be considered and the scheduling is formulated as R-COP in a graph-based MDP (Zhao et al., 2022b) . Similar MDP formulations can also be found in wireless scheduling for battery lifetime (Sikandar et al., 2020) , vehicle routing for waste collection (Wu et al., 2020) , and inventory control in distribution networks (Çelebi, 2015) . However, the 2nd type of R-COPs have rarely been addressed, except in an ad-hoc manner for link scheduling (Zhao et al., 2022b) . Therefore, a general approach to R-COPs in graph-based MDPs would be of high interest. 1.1 EXISTING APPROACHES TO COPS Centralized solvers: The general approach for exactly solving a COP is to formulate it as a mixedinteger program, and solve it by branch-and-bound (Land & Doig, 1960) or dynamic programming. Commercial Gurobi solvers (LLC, 2020) can exactly solve COPs on graphs of hundreds of nodes in reasonable time. Although large real-world graphs with certain structural properties can be reduced (Lamm et al., 2019) to find exact solutions, problems of large scale and/or stringent time limits often rely on efficient heuristics to approximate the solutions; common strategies include greedy algorithms, local search (Wang et al., 2018) , tabu search, ant colony (Jovanovic et al., 2010) , and simulated annealing. In recent machine learning-based COP solvers, graph neural networks (GNNs) (Wu et al., 2021) are trained to guide an algorithmic framework, which guarantees the constraints being always followed; common frameworks include branching (Khalil et al., 2016; Gasse et al., 2019; Nair et al., 2020; Zarpellon et al., 2021) , tree search (Li et al., 2018) , greedy algorithms (Khalil et al., 2017; Zhao et al., 2022a) , and local search (Hudson et al., 2022) . In these approaches, a COP instance is formulated as a finite episode of an MDP, with the residual graph of each intermediate step defined as a state, based on which an action of adding one vertex in the residual graph to the partial solution is generated by a GNN, and a solution is built from a sequence of such scalar actions. This formulation has a time complexity of at least O(|V||E|), since a GNN of time complexity of O(|E|) (Wu et al., 2021) is called in each intermediate step. (Drori et al., 2020; Hottung et al., 2022) Distributed solvers: Popular distributed approaches to COPs include Moser-Tardos algorithm (Moser & Tardos, 2010 ) and dynamic programming. The complexity of distributed algorithms is typically measured by local communication complexity, which refers to the rounds of message exchanges between a node and its neighbors (Joo & Shroff, 2012; Zhao et al., 2022a) . For example, on MWIS, Moser-Tardos algorithm (Joo & Shroff, 2012; Moser & Tardos, 2010) can converge in O(log * |V |) rounds, whereas distributed dynamic programming (Paschalidis et al., 2015) in O(|V |) rounds. Distributed algorithms may have larger optimality gaps due to the lack of global information, but topological metrics, such as edge betweenness for the Steiner tree problem (Fujita et al., 2016) , can help reduce the gap. Learning-based distributed COP solvers have received less attention compared to their centralized counterparts. GCN-LGS (Zhao et al., 2022a ) is an architecture proposed for a repetitive MWIS problem, in which an L-layer graph convolutional neural network (GCNN) generates a vector z for a topology (V, E), and a greedy heuristic solves modified instances (V, E, c(t) ⊙ z), rather than the original ones (V, E, c(t)), for the next N time slots t ∈ {1, . . . , N }, assuming the topology would not change by t = N . GCN-LGS can reduce the optimality gap of greedy solvers (Joo & Shroff, 2012 ) by 1 /3 to 1 /2, with a time complexity of O(L|E|/N + |V|), or can converge in O(L/N + log |V |) iterations in a distributed setting. For large reusing factor N , the overhead of GCNN is negligible. However, the training methods in (Zhao et al., 2022a; b) are ad hoc and specific to MWIS problems, and may not be directly applicable to other types of R-COPs. In this work, we generalize the GCN-LGS architecture in (Zhao et al., 2022a; b) to a wider range of R-COPs, through a unified framework of problem formulation and actor-critic architecture. Unlike previous works that solve a COP instance through a sequence of scalar actions, our actor network generates a high-dimensional intermediate action to parameterize a given classical heuristic, h ′ (•), which produces vectorized decisions by solving one or multiple instances of an R-COP. This formula-tion allows us to pick a fast and/or distributed classical heuristic h ′ (•) that adheres to various practical restrictions, while guaranteeing the decisions always follow the hard constraints. For independent R-COPs, an intermediate action encodes the underlying topology shared by N instances to improve the average quality of their solutions, reducing the GNN overhead from O(|E|) (Drori et al., 2020; Hottung et al., 2022) to O(|E|/N ), which is arbitrarily small for a large N . In R-COPs embedded in a graph-based MDP, an intermediate action for the optimal expected long-term objective is generated based on the network state in each time step, serving as the cost vector of the corresponding COP instance, which is later translated into a decision vector by h ′ (•). The challenge, however, is that our policy network contains a non-differentiable block, e.g., h ′ (•).

1.2. LEARNING IN NON-DIFFERENTIABLE PIPELINES

Training with supervised or unsupervised learning methods is challenging in the presence of nondifferentiable blocks. Reinforcement learning (RL) (Sutton & Barto, 2018) addresses this problem by treating the non-differentiable block as (part of) the environment (Silver et al., 2014; Khalil et al., 2017; Zhao et al., 2022a) . Compared to Q-learning (Watkins & Dayan, 1992 ) that is sequential, policy gradient (Silver et al., 2014) is better suited for the high-dimensional action spaces in networks of parallel and dynamic nature. A major issue in RL for networks is how to assign credit to individual elements based on the system-level reward. Zeroth-order optimization (ZOO) (Liu et al., 2020) is the last resort for non-differentiable pipelines, as it requires numerous evaluations of f net (•), which can be computationally prohibitive. In some scenarios, soft constraints (Kendall, 1975) can avoid the use of non-differentiable pipeline, allowing a fully differentiable policy network being trained by primal-dual optimization (Eisen et al., 2019) or imitation learning (Ross et al., 2011) . To address the aforementioned shortcomings, we propose GDPG-Twin, a graph-based deterministic policy gradient method based on the actor-critic framework (Sutton & Barto, 2018, ch. 13 ). GDPG-Twin trains a differentiable twin of the non-differentiable policy block in f net (•), e.g., h ′ (•), as part of the critic network to facilitate the training of the actor network. A similar differentiable approximation bridges (DAB) (Ramapuram & Webb, 2020) can approximate the immediate behavior of a non-differentiable block in standalone systems. We improve DAB from three aspects: we use a twin network to predict the element-wise expected outcomes of the non-differentiable policy network, use a GNN to account for the permutation equivariance in network settings, and use random policy sampling for static policy parameters Z in a fixed topology. In addition, GDPG-Twin is more efficient than ZOO. Although we focus on node-related problems, GDPG-Twin also applies to edge-related problems, e.g., by using simplicial neural networks (Roddenberry et al., 2021) .

1.3. CONTRIBUTIONS

In summary, our proposed framework: 1) can generalize to learning for different COPs without handcrafting the credit assignment strategies as in other schemes of network-based RL (Eisen et al., 2019; Zhao et al., 2022a; b) , 2) works for R-COPs with hard constraints, 3) requires fewer evaluations of f net (•) than ZOO (Liu et al., 2020) , and 4) has the advantage of RL schemes in not relying on expensive data labeling as in supervised learning or a computationally intensive supervising algorithm as in imitation learning (Ross et al., 2011) . Our contribution contains the following three aspects: • We propose a general approach to R-COPs under hard constraints and practical restrictions. Our approach can reduce the optimality gap of fast and/or distributed heuristics for independent R-COPs, at the cost of only an upfront computation and communication overhead. Moreover, it can optimize the long-term objectives for R-COPs in a graph-based MDP by embedding the future reward into the cost vector of COP instance at each time step. • We propose GDPG-Twin, an actor-critic architecture for network settings. By using a twin network that learns the element-wise expected outcomes o of a non-differentiable policy network in f net (•), the critic can leverage the knowledge of the linear combination of the system-level objective f obj (o) to address the challenge of credit assignment across the network, which is a major roadblock for discrete or mixed-integer network processes. • We adopt a random policy sampling strategy in the training of the twin network, which enables optimizing a static policy for R-COPs defined on fixed topologies. Moreover, our approach requires significantly fewer evaluations of the network process f net (•) than ZOO. Notation. Upright bold lower(upper)-case symbols are used to denote column vectors, e.g., x (matrices, e.g., X). x i denotes the ith element of vector x, X i,j denotes the element at row i and column j of matrix X, X i * (or X * j ) denotes row i (or column j) of matrix X. Unless otherwise specified, calligraphic upper-case symbols (e.g., V) are used to denote sets. (•) ⊤ and ⊙ denote transpose and element-wise product, respectively.

2. PROBLEM FORMULATION

Many binary discrete COPs have the following formulation, x * = min x c ⊤ x (2a) s.t. x i ∈ {0, 1} , ∀i ∈ {1, . . . , |V|} , other problem-specific constraints, (2c) where x and c are respectively the vectors of decisions and weights on nodes, and constraints in (2c) are often defined on the graph, e.g., x i + x j ≤ 1, ∀{i, j} ∈ E. Without loss of generality, (2b) can be of other arities and (2c) can be on a hypergraph or simplicial complex. Since many problems in (2) are NP-hard, we seek to develop efficient heuristics to approximate x * . Furthermore, we define the function space F of valid heuristics, meaning that x = f (V, E, c) satisfies the constraints in (2) for all f ∈ F. We also define the space P of practical functions, i.e., functions that satisfy pre-specified practical restrictions, e.g., limited runtime and/or distributed execution.

2.1. INDEPENDENT REPETITIVE COPS

For independent R-COPs, the weight c of an instance is considered to be a random vector drawn from its target sampling distribution Ω c . We formulate independent R-COPs as finding a valid heuristic (policy) that optimizes the expectation of the objective function in (2a) under practical restrictions h * = min h∈(F ∩P) E c∼Ω c (c ⊤ x) (3a) s.t. x = h(V, E, c) .

2.2. REPETITIVE COPS IN A GRAPH-BASED MARKOV DECISION PROCESS

Given network state G(t) = (V(t), E(t), S(t)), where S(t) captures features on nodes at time t, our objective is to find a valid heuristic h for a COP in (2) and a cost function Ψ, which together form a policy that -subject to the practical restrictions -maps G(t) into x(t) and maximizes the expected system-level value over time horizon T , as follows: h * , Ψ * = max h∈(F ∩P),Ψ∈P E G(1)∼Ω 1 [f obj (o(1))] , s.t. o(t) = E {h,Ψ} T -t k=0 γ k r(t + k) G(t) , r(t) = f r (V(t), E(t), S(t), x(t)) , (4c) x(t) = h(V(t), E(t), c(t)) , (4d) c(t) = Ψ(V(t), E(t), S(t)) , (4e) S(t + 1) = f s (V(t), E(t), S(t), x(t)) . (4f) In (4), o(t) is the value vector of current state G(t) under policy {h, Ψ}, 0 ≤ γ ≤ 1 is the discount factor, (4a) states that the system objective is the expectation of a known linear combination, f obj : R |o| → R, of o(1) over the initial state distribution Ω 1 , capturing both average reward and start-state formulations (Sutton et al., 1999) , (4d) states that a decision vector x(t) is generated by a valid heuristic h ∈ F of a COP based on the network topology (V(t), E(t)) and cost vector c(t), which is a function of the network state G(t) as stated by (4e), and (4c) and (4f) define the MDP by respectively stating that the reward vector r(t) and the next state S(t + 1) depend on the current state S(t) and the decisions x(t). In general, f r (•) and f s (•) in (4) are stochastic functions, capturing some stationary random processes in the environment. The formulation in (4) can capture R-COPs with accumulative objectives, such as latency and battery lifetime in wireless scheduling, waste level in vehicle routing for waste collection, and inventory in distribution networks. Notice that if T = t, γ = 1, S(t) = c(t) (Ψ is bypassed), r(t) = c(t) ⊙ x(t), f obj (o) = 1 ⊤ o, and we set f s in (4f) as drawing a random vector from Ω c , then (4) boils down to (3). Thus, (4) is a generalized form of all R-COPs. Appendix A further illustrate (3) and ( 4) via exemplary formulations of two wireless scheduling problems.

3. GRAPH-BASED DETERMINISTIC POLICY GRADIENT

Our downstream pipeline follows the existing methodology of using a neural network to guide a discrete operation, which guarantees the hard constraints, but keeps the neural network outside the iterations of the algorithmic framework for lower complexity (Zhao et al., 2022a) . In this section, we provide the main solution, whereas the full procedures of optimization for Sections 3.1 and 3.2 are given by Algorithms 1 and 2 in Appendix D. Since (3) and ( 4) are functional optimization problems, we seek to approximately solve them by parameterizing the policy (h or {h, Ψ}) and reformulating (3) and ( 4) as finding the optimal set of parameters. To meet the practical restrictions, we rely on manual selections (or design) of a baseline heuristic h ′ ∈ F and the parameterizations.

3.1. LEARNING FOR INDEPENDENT REPETITIVE COP

For independent R-COPs, we reformulate (3) as Z * = min Z∈R |c|×g 1 ⊤ o (5a) s.t. o = E c∼Ω c (c ⊙ x) (5b) x = h ′ (V, E, w) , ( ) w i = f loc (c i ; Z i * ) , ∀i ∈ {1, . . . , |c|} . The objectives in (3a) and ( 5a) are equal due to the linearity of expectation, i.e., 5c) and (5d) further break the parameterized policy in (3b), h(•; Z), into a baseline heuristic h ′ (•) ∈ (F ∩ P) given in advance and a parameterized local function f loc (•; Z i * ), where Z i * ∈ R 1×g captures the g local parameters for node i ∈ V. The local function f loc (•; Z i * ) can be chosen as, e.g., a multiplier, a single neuron, or even a small neural network, and depends on node-specific parameters Z i * . E c∼Ω c (c ⊤ x) = E c∼Ω c ( |c| i=1 c i x i ) = |c| i=1 E c∼Ω c (c i x i ) = 1 ⊤ o. Constraints ( To solve the problem formulated in (5), we employ deterministic policy gradient reinforcement learning, where the policy parameters are Z. However, the gradient ∇ Z 1 ⊤ o is not available since h ′ (•) is non-differentiable. To address this problem, we introduce a trainable and differentiable twin network f twin (•; Θ c ) to learn the element-wise expected outcome of h(•; Z). In contrast to the critic in a typical standalone setting, which directly predicts the system-level objective, the twin network works for network settings, where the policy parameters and outcome are supported on graphs. The twin can be implemented by a graph or a simplicial neural network, depending on the specific COP. The expected behavior of the twin can be described by o ≈ ô = f twin (V, E, c, Z; Θ * c ), where c = E Ω c (c) . (6) Based on (5a) and ( 6), we can estimate the system objective as 1 ⊤ ô. Based on (6) and the chain rule ∂1 ⊤ ô ∂Z = 1 ⊤ ∂ ô ∂Z , the policy gradient is estimated as (Silver et al., 2014 ) ∇ Z 1 ⊤ o ≈ ∇ Z 1 ⊤ ô = ∇ Z f twin (V, E, E Ω c (c), Z; Θ c )1 . (7) Given a policy learning rate 0 ≤ α p ≤ 1, we can update the policy parameters as, Z ← Z-α p ∇ Z 1 ⊤ ô. For applications on static topologies, we can optimize Z directly with stochastic gradient descent. For R-COPs on dynamic networks, we want the policy parameters to be a function of the topology, implemented as an actor, Z = Ψ(V, E, c; Θ p ). In this case, we can estimate the gradient ∇ Θp 1 ⊤ ô = ∇ Θp Ψ(V, E, c; Θ p )∇ Z 1 ⊤ ô, and update the actor parameters as, Θ p ← Θ p -α p ∇ Θp 1 ⊤ ô. From the perspective of the actor Ψ(•; Θ p ), its input is (V, E, c) instead of the instantaneous network state (V, E, c), and its output is an intermediate action Z, which is used as the policy parameters. Given a learning rate 0 ≤ α c ≤ 1, the twin network can be updated by the following gradient descent, Θ c ← Θ c -α c ∇ Θc ℓ mse (ô, o) , where the loss function ℓ mse (ô, o) is the mean-square-error (MSE) between ô and o. Since o is an expectation over sampling space Ω c , we can implement the following stochastic gradient descent Θ c ← Θ c -α c ∇ Θc ℓ mse (ô, c ⊙ x), ℓ mse (ô, c ⊙ x) = 1 |x| |x| i=1 (ô i -c i x i ) 2 , c ∈ Ω |c| , by minimizing the MSE loss in ( 9) with an off-the-shelf optimizer. Notice that we need to evaluate h ′ (•) and f loc (•; Z) to get x in (9). A detailed derivation of ( 9) is given in Appendix B.

3.1.1. RANDOM SAMPLING AROUND CURRENT POLICY

For applications based on static topologies, i.e., (V, E, c) are constant, we no longer need an actor to generate Z. In this case, the twin is likely to be overfitted if we only feed it with a static Z during training. To address this problem, we feed the twin f twin (•) and h(•; Z) with random samples around the current policy parameters Z (j) = Z + N (j) , N m,n ∈ U(-ϵ, ϵ) where ϵ is the sampling radius. The loss in (9) then becomes ℓ mse (ô (j) , c ⊙ x (j) ). This random sampling strategy enables the critic, comprising the twin f twin (•; Θ c ) and system-level objective function f obj (ô), to learn the loss landscape around the current Z, thus improve the quality of gradient in (7). In ZOO, a policy gradient is estimated from at least two random samples around the current Z (including Z itself) (Liu et al., 2020) . While fewer policy samples requires fewer evaluations of h(•; Z), which could be computationally expensive in many applications, it could degrade the convergence by raising the noise floor of the gradient estimate. Compared to ZOO, our twin-based critic can continuously learn, refine, and memorize the loss landscape around the current policy as new samples coming in, leading to better gradient estimate in backpropagation (improved convergence as shown in Figure . 1), and higher policy sampling efficiency, as shown in Sections 4.2 and 4.3.

3.2. LEARNING FOR REPETITIVE COP IN A GRAPH-BASED MARKOV DECISION PROCESS

For R-COP in a graph-based MDP, we parameterize the policy in (4), {h, Ψ}, with a given baseline heuristic h ′ ∈ F, and a parameterized cost function Ψ(•; Θ p ). We then reformulate (4) as: Θ * p = max Θp∈R |Θp| E G(1)∼Ω 1 [f obj (o(1))] , s.t. o(t) = E Θp T -t k=0 γ k r(t + k) G(t) , r(t) = f r (V(t), E(t), S(t), x(t)) , (10c) x(t) = h ′ (V(t), E(t), c(t)) , (10d) c(t) = Ψ(V(t), E(t), S(t); Θ p ) , (10e) S(t + 1) = f s (V(t), E(t), S(t), x(t)) . ( ) By fixing h in the policy {h, Ψ} to a given baseline heuristic h ′ in (10d), we only need to optimize the actor network Ψ(•; Θ p ) that generates the intermediate action c(t) based on G(t) in (10e). Similar to (7), we estimate the gradient ∇ c(t) E G(1)∼Ω 1 [f obj (o(1))] through a twin network that predicts the value vector in (10b) as o(t) ≈ ô(t) = f twin (V(t),E(t),S(t),c(t); Θ c ). Based on linearity of expectation and the policy gradient theorem-the policy gradient does not depend on the gradient of the state distribution (Sutton et al., 1999) , the policy gradient estimate is (Silver et al., 2014 ) ∇ Θp E G(1)∼Ω 1 [f obj (o(1))] ≈ E G(1)∼Ω 1 ∇ Θp ô(t)∇ ô(t) f obj (ô(1)) . Then, we can update the policy network as Θ p ← Θ p + α p ∇ Θp ô(t)∇ ô(t) f obj (ô(1)) with G(1) sampled from Ω 1 . According to the derivation in Appendix C, the twin network can be trained via stochastic gradient descent, i.e., by minimizing the following loss with an off-the-shelf optimizer: ℓ mse (ô(t), r(t) + γô(t + 1)) , where ô(t + 1) = 0, ∀t ≥ T . ( ) 4 NUMERICAL RESULTS

4.1. INDEPENDENT R-COPS

We demonstrate the effectiveness of GDPG-Twin on four types of independent R-COPs by showing that it can improve the quality of solutions of fast and/or distributed heuristics with minimal overhead. These problems are maximum weighted independent set (MWIS), minimum weighted dominating set (MWDS), node weighted Steiner tree (NWST), and minimum weighted connected dominating set (MWCDS). They are all NP-hard, and need to be solved repetitively in a wide range of applications. For example, the MWIS problem appears in various schedulers (Pinedo, 2012; Joo & Shroff, 2012; Zhao et al., 2022a) and multi-object tracking in computer vision (Brendel et al., 2011) . The MWDS problem is encountered in wireless network clustering (Shahraki et al., 2020) . Multicast routing in communication networks involves the NWST problem (Sun et al., 2020) the lowest cost in energy consumption or security vulnerability (Oliveira et al., 2011) . We refer the readers to Appendix F for more detailed descriptions of the aforementioned applications. (3)-LGS-adhoc GCNN(3)-LGS-Twin GCNN(3)-LGS-ZOO MP [Paschalidis15] LGS [Joo12] We develop four similar ML pipelines respectively for the four R-COPs, in which the actor and twin networks are implemented by L-layer and 5-layer GCNNs, respectively, the baseline heuristics h(•) in (5c) are selected as centralized or distributed greedy solvers. The details of the four ML pipelines are in Appendix E for the equations of GCNN, Appendix F for the detailed training configurations, and Appendix H for the hyperparameters of GCNNs and brief descriptions of the chosen baseline heuristics. We aim to demonstrate the effectiveness of our proposed learning framework in closing the optimality gaps of some widely used fast heuristics for R-COPs, at the cost of negligible overhead, e.g., an additional computational complexity of O(L|E|/N ) or local communication complexity of O(L/N ) per instance on sparse graphs. We do not claim smaller optimality gaps than the state-of-theart (slower and centralized) heuristics, nor that GCNN is the best candidate graph neural architecture for these problems. An appealing characteristic of our framework is its modularity, which allows the GCNNs being replaced by any other (non-convolutional) GNN while our approach is still valid. The four COPs are widely applied in various wireless networks, involving graphs of up to hundreds of nodes, and restricted runtimes from tens of milliseconds to a few seconds. Due to the dynamic nature and the lack of real-world datasets of wireless network topologies, we follow the norm of wireless research by running our experiments on four sets of synthetic random graphs: Erdős-Rényi (ER) (Erdős & Rényi, 1959) , Barabási-Albert (BA) (Albert & Barabási, 2002) , Gaussian Random Partition (GRP), and connected Watts-Strogatz small-world (WS) graphs. Each test dataset contains 2000 random graphs generated by a graph model with parameters detailed in Appendix F. The optimality of a heuristic is evaluated by the average approximation ratio, E Ω ′ (c ⊤ x (h) /c ⊤ x (b) ), on a test set Ω ′ , where x (h) and x (b) are respectively the solutions from the heuristic of interests and a reference algorithm. Only in MWIS, the optimal solutions from (Zhao et al., 2022a) are used as reference, in other COPs, greedy heuristics are the references. MWIS: An independent (vertex) set for a graph is a subset of vertices not connected by any edges. The MWIS problem is to find an independent set on a vertex weighted graph that maximizes the total weight. The baseline heuristic is a distributed local greedy solver (LGS-MWIS) (Joo & Shroff, 2012) , with a local communication complexity of O(log |V|). We test multiple distributed MWIS solvers, including the GCNN-enhanced and vanilla LGS-MWIS, and a message passing (MP) algorithm (Paschalidis et al., 2015) on a set of 500 random ER graphs from (Zhao et al., 2022a) . 3 GCNNs are respectively trained by ad hoc RL (Zhao et al., 2022a) , GDPG-Twin, and ZOO (Liu et al., 2020) , on a set of 6000 ER graphs. The approximation ratios of the tested solvers w.r.t. the optimal solver are shown in Figure 1 . GDPG-Twin (93.1%) works equally well as the ad hoc RL (93.2%), and slightly outperforms ZOO (92.7%) with only 1 /3 to 1 /2 evaluations of f net (•; Z) (see Section 4.3). The GCNN can reduce the optimality gap of LGS-MWIS (89.7%) by 1 /3, beating the MP algorithm (90.7%). 100 Figure 3 shows that GCNN-enhanced and vanilla LGS-MWIS converge in 3 ∼ 4 rounds in this test (MP algorithm converges in 2|V| rounds), where the former is slightly faster for a large N .

MWDS:

A dominating set for a graph G = (V, E) is a subset D of V such that every vertex not in D is adjacent to at least one member of D. In the MWDS problem, every node is associated with a non-negative weight, and the objective is to find a dominating set of minimum total weight. The baseline heuristic is a centralized greedy algorithm, Greedy-MWDS, as detailed in Appendix H. The approximation ratio of the GCNN-enhanced Greedy-MWDS w.r.t. the vanilla Greedy-MWDS on the 4 sets of random graphs described earlier, are shown in Figure 2 . On average, GCNN improves the performance of Greedy-MWDS by 11.42% on BA graphs, 6.8% on ER graphs, 3.0% on GRP graphs, and 2.6% on WS graphs. The improvement is more pronounced on larger graphs. As shown in Figure 4 , the average runtime of the GCNN-enhanced Greedy-MWDS on a laptop of 4 CPUs no GPU (detailed in Appendix I) ranges from 32 to 78 milliseconds per instance, and is linear to the graph size, which can be further cut to 2-17 milliseconds for a large reusing factor N . NWST: In the NWST problem, we are given an undirected graph G with node cost (non-negative weight) and a subset of nodes called terminals. The goal is to find a minimum cost subgraph of G that connects the terminals. In the test, the terminals are selected by randomly removing 10% ∼ 50% nodes from a maximal independent set (MIS) on the graph. The baseline heuristic is a distributed greedy algorithm, Kruskal's shortest path heuristic (K-SPH-NWST) (Matsuyama, 1980; Bauer & Varma, 1996) . The approximation ratio of the GCNN-enhanced K-SPH-NWST w.r.t. the vanilla K-SPH-NWST on the 4 sets of random graphs, are shown in Figure 5 . GCNN can improve K-SPH-NWST by 6.8% on BA graphs, 1.2% on GRP graphs, 1.4% on ER graphs, 0.9% on WS graphs, and 0.9% on a real-world dataset of Internet backbone topology (Knight et al., 2011) with |V| = 31.23 on average. The noticeable gain on BA graphs is meaningful to large real-world networks, such as the Internet, World Wide Web, and social networks (Posfai & Barabasi, 2016, Ch. 5) .

MWCDS:

In MWCDS problem, we are given an undirected and connected graph, and our goal is to find a minimum weighted dominating set that is connected. Our baseline heuristic is a distributed greedy algorithm (Dist.Greedy), whereas the reference algorithm is a centralized greedy heuristic. Both MWCDS heuristics are implemented in two steps (Sun et al., 2019) : 1) find a MWDS, 2) connect the MWDS by solving a NWST problem where the terminals are the solution of step 1. The approximation ratios of the vanilla and GCNN-enhanced distributed heuristics w.r.t. the centralized greedy heuristic on 4 sets of random graphs, each with 2000 graphs, are shown in Figure 6 . Despite the large gap between distributed and centralized greedy algorithms, GCNN can improve the distributed greedy by 17.8% on GRP graphs, 16.0% on ER graphs, 6.6% on WS graphs, and 4.0% on BA graphs. The average centralized runtimes of GCNN-enhanced distributed solvers for the MWIS, NWST, MWCDS problems are respectively 0.017 ∼ 0.05, 0.07 ∼ 1.5, 0.08 ∼ 2.0 seconds per instance, as Figure 7 : GDPG-Twin achieves similar network-wide mean and medium backlogs (smaller is better) of lookahead RL (Zhao et al., 2022b) in training a distributed link scheduler, using only 1 /5 evaluations of h(•) of it. reported in Appendix I, along with their estimated computational time (not the execution time) in distributed execution, as less than 0.18, 5.1, and 5.8 milliseconds, respectively.

4.2. REPETITIVE MWIS IN A GRAPH-BASED MARKOV DECISION PROCESS

Next, we compare GDPG-Twin to an ad-hoc training method (Zhao et al., 2022b) in the context of delay-oriented distributed wireless link scheduling, demonstrating its effectiveness and efficiency in long-term goal-seeking for R-COPs in a graph-based MDP. The distributed scheduler contains a 1-layer GCNN and LGS in both methods, and a 5-layer GCNN is used as the twin in GDPG-Twin. The node feature matrix, S(t) = [q(t); l(t)], encloses the packet backlogs q(t) and stochastic data rates l(t) on all links (nodes in conflict graph). The state transition in (4f) is based on q(t + 1) = q(t) + a(t) -min(q(t), l(t) ⊙ x(t)), where a(t) captures random packets arrivals, and x(t) captures binary scheduling decisions. The objective is to minimize the average backlog, E i∈V,t≤T [q i (t)], see Appendix A for detailed formulation. As shown in Figure 7 , two distributed schedulers respectively trained by 5-step lookahead RL (Zhao et al., 2022b) and GDPG-Twin achieve similar performance in terms of the mean and median backlogs on conflict graphs with different levels of centralization, as indicated by the number on the x-axis. However, GPDG-Twin is 5 times faster than lookahead RL, as the former needs only 2 evaluations of h(•) per t, whereas the latter needs 10.

4.3. COMPARISON WITH ZEROTH-ORDER OPTIMIZATION IN TRAINING

The trajectories of relative performances of GCNN-enhanced LGS-MWIS (as detailed in Section 4.1) w.r.t. vanilla LGS-MWIS, under different training methods, are illustrated in Figure 8 , where x-axis is the number of evaluations of the non-differentiable h(•) (LGS-MWIS). Each point on the curve is the average value of 100 instances. During training, random weights are generated on-the-fly for a training graph, whereas the weights on a validation graph are unchanged. GDPG-Twin and ZOO are configured with the same sampling radius and learning rate. In training, GDPG-Twin converges within only 1 /3 to 1 /2 evaluations of h(•) required by the fastest ZOO based on 2-point gradient estimation, showing a better sampling efficiency than ZOO, by a factor of 2 to 3.

5. CONCLUSION

We address repetitive combinatorial optimization problems under practical restrictions in runtime and/or distributed execution, by introducing a non-differentiable policy network based on a hand-picked, fast and/or distributed heuristic, which is parameterized by a continuous-valued highdimensional intermediate action from an actor GNN. The actor GNN is optimized by graph-based deterministic policy gradient with the help of a critic based on a twin network that can predict the node-wise expected outcomes of the policy network. Through 5 examples, we demonstrate that our approach can: 1) leverage the shared underlying topology of independent R-COPs, to reduce the average optimality gap of the fast and/or distributed heuristics, and 2) optimize the long-term objectives for R-COPs in a graph-based Markov decision process. In terms of limitations, our work has not addressed the variance and worst-case of the optimality gap, which are important for many real-world applications. Our evaluation is based on synthetic rather than real-world graphs. Moreover, the actor and critic networks in our framework would need further design for broader tasks, e.g., edge related R-COPs. In addition, we do not expect our work to have any impact on social equality. 

A APPENDIX: TWO EXEMPLARY FORMULATIONS OF REPETITIVE COPS

For link scheduling in wireless multiple networks (Joo & Shroff, 2012; Zhao et al., 2022a; b) , the network state is G(t) = (V(t), E(t), S(t)), where (V(t), E(t)) is the underlying topology of the conflict graph, in which a vertex i ∈ V(t) represents a wireless link, and an undirected edge {i, j} ∈ E(t) captures the conflicting relationship between two links i and j. The underlying topology of the conflict graph is supposed to change slowly compared to node features S(t). The node feature matrix S(t) = [q(t); l(t)] encloses vectors of packet backlogs (queue lengths), q(t), and predicted link rates l(t) on all the links at time t. The backlog vector q(t) evolves according to the scheduling decisions x(t), and the state transition function f s (•) in (4f) is q(t + 1) = q(t) + a(t) -min(q(t), l(t) ⊙ x(t)), where a(t) captures random packets arrivals, and l(t) follows a stationary random distribution. The scheduling decision is subjective to a binary constraint, x i ∈ {0, 1}, ∀i ∈ V, and a constraint of independent set as, x i + x j ≤ 1, ∀{i, j} ∈ E, corresponding to (2b) and (2c), respectively. In link scheduling for throughput maximization (Joo & Shroff, 2012; Zhao et al., 2022a) , ( 4e) is defined as utility function c(t) = q(t) ⊙ l(t) to prioritize wireless links with large backlogs and high link rates. The heuristic h(•) in ( 4d) is a distributed MWIS solver, e.g., LGS in (Joo & Shroff, 2012) and GCN-LGS in (Zhao et al., 2022a) . The reward vector in (4c) is defined as r(t) = c(t) ⊙ x(t), and the expected return o(t) = E t≤T [r(t)] for (4b). Since throughput maximization is equivalent to optimize MWIS instances individually without considering their inter-dependencies, (4e) and (4f) together can be simplified as a sampling process c ∼ Ω c . Therefore, t can be dropped, and the expected return becomes o(t) = E c∼Ω c [c ⊙ x], and the objective function in (4a) becomes f obj (o(t)) = 1 ⊤ o = E c∼Ω c c ⊤ x . The formulation is then simplified as (3) with c = -q ⊙ l. For delay-oriented link scheduling (Zhao et al., 2022b) , the objective is to minimize the average backlog across the network and time, E i∈V,t≤T [q i (t)], and the inter-dependencies between network states cannot be ignored. This objective can be formulated as an average reward objective, implemented by setting the discount factor as γ = 1 and defining f obj (o(t)) = 1 T E i∈V [o i (t)] . The heuristic solver h(•) in (4d) is LGS, and the reward vector in (4c) is r(t) = q ′ (t) -q(t), where q ′ (t) is the backlog vector under LGS with a baseline cost vector c ′ (t -1) = q(t -1) ⊙ l(t -1). The objective function in (4a) can then be transformed to E i∈V,t≤T [q ′ i (t)] -E i∈V,t≤T [q i (t)], in which the first component is a constant. The cost vector in (4e) is obtained by a GCN as c(t) = Ψ (V(t), E(t), q(t) ⊙ l(t); Θ p ), which is trained to maximize the objective function. These two examples show that independent R-COP is a special case of R-COP in graph-based MDP, and problems under similar settings can be formulated differently based on their objectives.

B APPENDIX: SGD FOR TWIN NETWORK IN INDEPENDENT REPETITIVE COPS

Give MSE loss, ℓ mse (ô, o) = 1 |o| |o| i=1 (ô i -o i ) 2 , ( ) the partial derivative of ℓ mse (ô, o) w.r.t. the parameters of the twin network is ∂ℓ mse (ô, o) ∂Θ c = ∂ℓ mse (ô, o) ∂ô ∂ô ∂Θ c (14a) = 2 |o| [ô -E Ω |c| (c ⊙ x)] ⊤ ∂ô ∂Θ c (14b) = E Ω |c| 2 |o| (ô -c ⊙ x) ⊤ ∂ô ∂Θ c , ( ) where from (14b) to ( 14c) is based on the linearity of expectation. Based on (14c), by drawing c ∼ Ω c , we have the following unbiased stochastic gradient estimation: ∂ℓ mse (ô, o) ∂Θ c = E Ω c ∂ℓ mse (ô, o) ∂Θ c , where ∂ℓ mse (ô, o) ∂Θ c = 2 |o| (ô -c ⊙ x) ⊤ ∂ô ∂Θ c (16a) = ∂ℓ mse (ô, c ⊙ x) ∂Θ c . ( ) Therefore, the stochastic gradient estimation for the parameters of the twin network is ∇ Θc ℓ mse (ô, o) = ∇ Θc ℓ mse (ô, c ⊙ x) .

C APPENDIX: SGD FOR TWIN NETWORK IN R-COPS IN GRAPH-BASED MDP

We define a return vector supported on the nodes of the graph as g(t) = T -t k=0 γ k r(t + k) , which can be expressed in a recursive form g(t) = r(t) + γg(t + 1) , where g(t) = 0 , ∀t > T . (19) The outcome (value) vector of state G(t) is the expected return vector under policy Θ p , o(t) = E Θp [g(t)|G(t)] . (20) The gradient ∇ Θc ℓ mse (ô(t), o(t)) can be found by the following partial derivative ∂ℓ mse (ô(t), o(t)) ∂Θ c = ∂ℓ mse (ô(t), o(t)) ∂ô(t) ∂ô(t) ∂Θ c (21a) = 2 |o(t)| ô(t) -E Θp [g(t)|G(t)] ⊤ ∂ô(t) ∂Θ c (21b) = E Θp 2 |o(t)| [ô(t) -g(t)|G(t)] ⊤ ∂ô(t) ∂Θ c , ( ) where from (21b) to ( 21c) is based on the linearity of expectation, and that ô(t) is a function of G(t), generated by the twin network, as defined in (12). Based on (19), we have the following unbiased estimation of the partial derivative in (21a) ∂ℓ mse (ô(t), o(t)) ∂Θ c = E Θp ∂ℓ mse (ô(t), o(t)) ∂Θ c , where ∂ℓ mse (ô(t), o(t)) ∂Θ c = 2 |o(t)| [ô(t) -g(t)|G(t)] ⊤ ∂ô(t) ∂Θ c (23a) = 2 |o(t)| [ô(t) -r(t) -γg(t + 1)|G(t)] ⊤ ∂ô(t) ∂Θ c . ( ) In (23b), r(t) is the reward vector collected under current state G(t), as defined in (10c), and g(t + 1)|G(t) is the return of the next state G(t + 1), evolved from the current state G(t) according to (10f). The return of the next state G(t + 1) is estimated as g(t + 1) ≈ ô(t + 1) by the twin network, as defined in ( 12). As a result, we have the following approximation r(t) + γg(t + 1)|G(t) ≈ r(t) + γô(t + 1) . (24) Plugging ( 24) to (23b), we have ∂ℓ mse (ô(t), o(t)) ∂Θ c ≈ ∂ℓ mse (ô(t), r(t) + γô(t + 1)) ∂Θ c . ( ) Therefore, the stochastic gradient estimation for the parameters of the twin network is ∇ Θc ℓ mse (ô(t), o(t)) ≈ ∇ Θc ℓ mse (ô(t), r(t) + γô(t + 1))) .

D APPENDIX: ALGORITHMIC PROCEDURES OF GDPG-TWIN

Algorithm 1 GDPG-Twin for R-COPs with Independent Instances j) , N (j) ∈ U(-ϵ, ϵ) /* Random policy sampling */ 10: ô = f twin (V, E, c, Z (j) ; Θ * c ) Input: Ω c , Ω G , h(•), α p , α c , E, B, ϵ Output: Z or Θ p , Θ c 1: Initialize Z or Θ p , Θ c randomly or as pretrained models, c = E Ω c (c) 2: for e ∈ {1, 2, . . . , E} do 3: Q p = ∅, Q c = ∅ /* Clear gradient buffers */ 4: for b ∈ {1, . . . , B} do 5: Draw G(V, E) ∈ Ω G , c ∈ Ω c /* Draw data from training dataset or target distribution */ 6: if Actor network is used then 7: Z = Ψ(V, E, c; Θ p ) 8: end if 9: Z (j) = Z + N ( 11: x = f net (V, E, c; Z (j) ) based on h(•) in (5c) and f loc (•) in (5d) 12: Estimate gradient ∇ Θc ℓ mse (ô, c ⊙ x) for critic 13: Q c ← Q c ∪ {∇ Θc ℓ mse (ô, c ⊙ x)} 14: Estimate policy gradient ∇ Z 1 ⊤ ô based on (7) 15: if Actor network is used then 16: Estimate gradient ∇ Θp 1 ⊤ ô = ∇ Θp Ψ(V, E, c; Θ p )∇ Z 1 ⊤ ô for actor 17: Q p ← Q p ∪ {∇ Θp 1 ⊤ ô} 18: else 19: Q p ← Q p ∪ {∇ Z 1 ⊤ ô} 20: end if 21: end for 22: Θ c ← Θ c -α c E Qc [∇ Θc ℓ mse (ô, c ⊙ x)] 23: Z ← Z -α p E Qc ∇ Z 1 ⊤ ô or Θ p ← Θ p -α p E Qp ∇ Θp 1 ⊤ ô 24: end for 25: Output Z or Θ p , Θ c For the four demonstrated R-COPs with independent instances, we set the hyperparameters of training procedure in Algorithm 1 as follows: α p = α c = 0.0001, E = 25, B = 100, ϵ = 0.15. Algorithm 2 GDPG-Twin for R-COPs in a graph-based MDP Input: Ω 1 , Ω G , h(•), α p , α c , T, E Output: Θ p , Θ c 1: Initialize Θ p , Θ c randomly or as pretrained models 2: for e ∈ {1, 2, . . . , E} do 3: Q e = ∅, Q p = ∅, Q c = ∅ /* Clear experience & gradient buffers */ 4: Initialize state G(1) = (V(1), E(1), S(1)) ∼ Ω 1 (or Ω G for ergodic MDP) 5: for t ∈ {1, . . . , T } do 6: c(t) = Ψ(V(t), E(t), S(t); Θ p ) 7: ô(t) = f twin (V(t), E(t), S(t), c(t); Θ * c ) 8: Obtain decision vector x(t) based on (10d) 9: Observe reward vector r(t) according to (10c) 10: Update state feature S(t + 1) according to (10f) 11: Estimate stochastic policy gradient ∇ Θp f obj (ô(t)) based on (11) 12:  Q p ← Q p ∪ {∇ Θp f obj (ô(t))} 13: Q e ← Q e ∪ {< Q c ← Q c ∪ {∇ Θc ℓ mse (ô(t), r(t) + γô(t + 1))} 19: end for 20: Θ c ← Θ c -α c E Qc [∇ Θc ℓ mse (ô(t), r(t) + γô(t + 1))] 21: Θ p ← Θ p + α p ∇ Θp f obj (ô(1)) /* Objective maximization */ 22: end for 23: Output Θ p , Θ c For the demonstrated delay-oriented link scheduling (Zhao et al., 2022b) , we use the following hyperparameters in training described in Algorithm 2: α p = α c = 0.0001, T = 64, E = 200. The types and mixture of the training graphs, random processes, etc., are identical to (Zhao et al., 2022b ).

E APPENDIX: GRAPH CONVOLUTIONAL NEURAL NETWORKS

An L-layer GCNN is implemented as follows: Given the input feature S (0) = S supported on a graph G, the output is Z = S (L) = f GCN (G, S; Θ), where an intermediate lth layer of the GCNN is S l = σ l S l-1 Θ l 0 + LS l-1 Θ l 1 , l ∈ {1, . . . , L} . In ( 27), L is the normalized Laplacian of graph G, Θ l 0 , Θ l 1 ∈ R g l-1 ×g l are the trainable parameters, g l-1 and g l are the dimensions of the output features of layers l -1 and l, respectively, and σ l (.) is an element-wise activation function. In a distributed system, ( 27) can be implemented by the following local operation on node v ∈ V, S l v * = σ l   S l-1 v * Θ l 0 +   S l-1 v * - u∈N (v) S l-1 u * d(v)d(u)   Θ l 1   , where S l v * ∈ R 1×g l is the vth row of matrix S l , which captures the features on node v, d(v) is the degree of node v, and N (v) is the set of neighboring nodes of node v. , 150, 200, 250, 300, 350, 400, 450, 500} (varies by problem) . ER: size |V|, edge probability |V| /k. BA: size |V|, number of edges to attach from a new node to existing nodes m = ⌊pk⌉. WS: size |V|, each node is joined with its k nearest neighbors in a ring topology, p: the probability of rewiring each edge. GRP: size |V|, mean cluster size k, shape parameter min(7, k), probability of intra-cluster connection p, probability of inter cluster connection max(0.1, p /3). For MWIS problem in Section 4.1, the test set of 500 ER graphs and the corresponding optimal solutions are from https://github.com/zhongyuanzhao/distgcn (Zhao et al., 2022a) .

G APPENDIX: APPLICATIONS OF THE FOUR EXEMPLARY R-COPS G.1 MWIS IN SCHEDULING AND COMPUTER VISION

Definition: An independent (vertex) set for a graph is a subset of vertices not connected by any edges. The MWIS problem is to find an independent set on a vertex weighted graph that maximizes the total weight. MWIS for wireless scheduling: MWIS can be applied to link scheduling in wireless multihop networks with orthogonal multiple access (Joo & Shroff, 2012; Zhao et al., 2022a) . In a wireless multihop network, a wireless link refers to a pair of nearby wireless transceivers that can directly talk to each other. In orthogonal multiple access, two links would conflict with each other if they share the same transceiver (which can only be tuned to one link at a time), or they would block out each other if activated simultaneously (e.g., any transceiver(s) of a link are located too close to any transceiver(s) of the other link to interfere the reception of wireless signal). In each time slot, a link scheduler (Max-Weight scheduler) would determine a set of non-interfering links to be activated, so that it would maximize the total utility of the wireless network. Max-Weight scheduling is essentially finding a MWIS on the conflict graph of the wireless network, which is defined as follows: each vertex in the conflict graph is a link in the wireless network, and an edge captures the conflict relationship between two links. For example, with a per-link utility function based on the length of backlogged data packets of each link, Max-Weight scheduling can achieve the maximum throughput (the amount of data packets transmitted in a time slot) of the wireless network. The typical length of a time slot in various wireless communication protocols ranges from 1 ∼ 100 milliseconds, which means that a Max-Weight scheduler needs to solve an MWIS instance every 1 ∼ 100 milliseconds. Meanwhile, the topology of the conflict graph in wireless scheduling (determined by the topology of the wireless networks and physical locations of transceivers) evolves at much lower pace, such as seconds to minutes for mobile wireless networks, or remain the same if all the transceivers (such as microwave towers and wireless sensors) are static. Graph coloring problems are also applied to wireless scheduling, especially, multi-channel wireless scheduling. The conflict graph is formulated similarly in the previous single-channel scheduling, but instead of finding a MWIS, each node in the conflict graph (link in the wireless network) is assigned a color representing a particular channel, so that neighboring nodes (conflicting links) will never have the same color (channel). In Cornaz et al. (2017) , the four types of graph coloring problems: Vertex coloring problem (VCP), equitable vertex coloring problem (ECP), Max-coloring (Max-Col) which can be seen as the weighted version of VCP, and Bin Packing Problem with Conflict (BPPC), can be converted to solving MWIS on an associated graph. MWIS for multiobject tracking in computer vision: In multiobject tracking (Brendel et al., 2011) , a detector first identifies a set of objects in a video frame, records the following properties of the corresponding bounding box of each object: location, size, the histograms of color, intensity gradients, and optical flow. Next, the detected objects across different video frames needs to be linked according to their properties to maintain their unique identities. The second step is formulated as finding a MWIS on a graph, in which nodes represent candidate matches (of two objects) from every two consecutive frames, referred as tracklets; node weights encode the similarity of the corresponding matches; and edges connect nodes whose corresponding tracklets violate the hard constraints that no two matches share the same object. If there are 10 objects in each of the two consecutive frames, there would be 100 tracklets. The MWIS on such a graph is a set of matches that maximize the total similarities of tracklets, resulting in the most plausible tracking of multiple objects. In this formulation, a MWIS instance needs to be solved every video frame (24, 30 frames per second), while the topology of the graphs of consecutive MWIS instances would be similar since consecutive frames in a typical video stream would remain similar.

G.2 MWDS IN WIRELESS SENSOR NETWORKS AND COVERING CODES

Definition: A dominating set for a graph G = (V, E) is a subset D of V such that every vertex not in D is adjacent to at least one member of D. In the MWDS problem, every node is associated with a non-negative weight, and the objective is to find a dominating set of minimum total weight. MWDS for clustering in wireless sensor networks: Wireless sensor networks are a type of ad-hoc network for monitoring purposes. It usually include a large number of sensor nodes, which are resource-constrained (such as battery power), but can connect to other nodes of the network for transmitting sensed data. Each node can also forward data from neighbors to the sink (or gateway, base station, server, etc.) (Shahraki et al., 2020) . Clustering is one of the most popular techniques for the topology management of wireless sensor networks . It organizes sensor nodes into a set of groups called clusters, each cluster has one or more cluster heads which gathers data from other members of the cluster and send the (fused) data to the sink directly or indirectly. Using clustering techniques, resource-constrained sensor nodes do not need to send their data to the sink directly, which can cause energy depletion, resource consumption inefficiency and interference. Clustering can be formulated as MWDS problem, where the cluster heads form a dominating set so that the rest of the sensor nodes can directly reach at least one cluster head. By defining the node weight as the cost of being a cluster head, MWDS can minimize the total cost of wireless sensor networks, such as energy consumption or quality of service. To maximize the lifetime of the wireless sensor network, it may need to select a different set of cluster heads once in a while to avoid draining the battery of cluster heads. The node weight would change based on the battery levels of the sensor nodes.

G.3 NWST FOR MULTICAST ROUTING IN WIRELESS NETWORKS

Definition: In the NWST problem, we are given an undirected graph G with node cost (non-negative weight) and a subset of nodes called terminals. The goal is to find a minimum cost subgraph of G that connects the terminals. NWST in multicast routing: A multicast route is a network route that connects more than two nodes at the same time (Oliveira et al., 2011) . Multicast routing is applicable to networking scenarios where data needs to be shared by a group of users, such as a software company sends out a security patch to the computers across the Internet installed with its software, or a company pushes a notification to the smart phones installed with its mobile app. In these example, the Internet would be the graph, the server and the recipients are defined as the terminals in the Steiner tree problem, and the non-terminal nodes are the routers, gateways, and other computers on the Internet. Multicast routing in wireless multihop networks can be formulated as node weighted Steiner tree (NWST) problem (Sun et al., 2020) . The graph models the network, e.g., a node represents a wireless device and an edge represents a link between two wireless devices. The terminals in the NWST are the subset of wireless devices that need to connected with each other, and the non-terminal nodes are the rest of the wireless devices in the network. The node weight captures the cost of a non-terminal node relaying packets, such as the consumption of energy, bandwidth, and/or CPU load. By formulating the multicast routing as a NWST, we can connect all the terminals at minimal total cost. Whenever there is a request to establish a multicast route, a NWST problem in the network needs to be solved. In medium to large-scale wireless networks, distributed algorithms are always preferred since it does not require a centralized coordinator to collect the real-time network state and compute the solution.

G.4 MWCDS IN MOBILE AD-HOC NETWORKS

Definition: In MWCDS problem, we are given an undirected connected graph, and our goal is to find a minimum weighted dominating set that is connected.

MWCDS for virtual backbone computation:

In mobile ad-hoc networks, a virtual backbone is a set of nodes that can be used to take routing decisions and that act as proxies for routing packets (Oliveira et al., 2011) . It can reduce the amount of routing information shared among nodes of the system, by requiring only a subset of nodes to be actively involved in routing decisions. Formally, we would define the mobile ad-hoc network as a undirected graph, in which each node represents a mobile device, and each edge represents a link between two mobile devices. A virtual backbone needs to a connected subgraph so that packets from one node can always reach to another node in the virtual backbone. A node in the network needs to be either a member of the virtual backbone or adjacent to at least one member of the virtual backbone, so that it can reach to the rest of the network. Therefore, a virtual backbone needs to be a dominating set of the graph. Moreover, since members of the virtual backbone take the responsibility of routing, which comes with extra cost in battery power consumption and routing packets, we want to minimize the total cost of implementing a virtual backbone. Therefore, the computation of a virtual backbone is to find the minimum weighted connected dominating set of the network. The virtual backbone in mobile ad-hoc networks needs to be re-established after a while, in order to adapt to the changing traffic conditions, network topology, as well as the battery levels of the mobile devices. The network topology may change slowly, whereas the battery levels of mobile devices and traffic conditions may change more frequently. Re-establishing the virtual backbone once a while could avoid draining the batteries of the members of the virtual backbone. Moreover, virtual backbone computation requires distributed algorithm to find the MWCDS, since there is no server or base-station in mobile ad-hoc network to perform such computation.

H APPENDIX: ML PIPELINES AND BASELINE HEURISTICS H.1 MWIS PIPELINE

We adopt the GCN-LGS pipeline in (Zhao et al., 2022a) as the actor network, which comprises a 3-layer GCNN followed by a distributed local greedy solver (LGS-MWIS) (Joo & Shroff, 2012) . The actor GCNN is configured as follows: the dimensions of GCNN layers are g l = 32, l = 1, 2, and g 0 = g 3 = 1, the hidden layers employ leaky ReLU activation, while the output layer ReLU activation. The GCNN outputs vector c ∈ R |V| . The local function in (5d) is instantiated as a multiplier, f loc (c i , z i ) = c i z i . The twin network is implemented by a 5-layer GCNN, in which the hidden layers have dimensions of g l = 64, l = 1, . . . , 4 and leaky ReLU activation, and the output layer has a dimension of g 5 = 2 and Softmax activation. The training takes about an hour on a server with 16 GB memory, 8 CPUs and 1 GPU (Nvidia 1080 Ti). The basic greedy heuristic first sorts the nodes by their weight in decreasing order, then iteratively adds a node to the solution and removes the node and its neighbors from the sorted list, until the list is empty. The distributed heuristic LGS-MWIS (Joo & Shroff, 2012) iteratively builds a solution as follows: in an iteration, each node compares its weight with its neighbors; if a node has the maximal weight in the neighborhood, it marks itself as 1, and broadcasts a control message to its neighbors, who then mark themselves as -1; the unmarked nodes enter the next iteration. When all nodes are marked, the solution is the set of nodes marked as 1. LGS-MWIS has an average local communication complexity of O(log |V|) on general graphs. The centralized greedy algorithm and LGS-MWIS are detailed in (Zhao et al., 2022a , Algo 1 and Appendix Algo 1). For MWIS problem in Section 4.1, the test set of 500 ER graphs and the corresponding optimal solutions are from https://github.com/zhongyuanzhao/distgcn (Zhao et al., 2022a) .

H.2 MWDS PIPELINE AND GREEDY-MWIDS

We augment Greedy-MWDS with a 5-layer actor GCNN and the same local function in the MWIS pipeline in Appendix H.1. The actor GCNN is configured as follows: the input and hidden layers have dimensions of g l = 32, l = 1, . . . , 4 and leaky ReLU activations, the input and output dimensions are g 0 = g 5 = 1, and the output layer employs ReLU activation. The twin is also implemented by a 5-layer GCNN configured the same as the twin GCNN in the MWIS pipeline in Appendix H.1, except that g l = 32, l = 1, . . . , 4. The GCNN is trained on randomly generated ER graphs of 100 to 300 nodes. The training takes about an hour on a server with 16 GB memory, 8 CPUs and 1 GPU (Nvidia 1080 Ti) The greedy algorithm, Greedy-MWDS, iteratively builds a solution by adding to the solution the most cost-effective node from the vertices not yet in the solution, marking its neighbors as covered, until all nodes are either in the solution or covered. A good metric for the cost-effectiveness of node v ∈ V is (Jovanovic et al., 2010) : ω(v) = c(v) 1 + u∈N (v)∩W ′ c(u) , W ′ = {i|i ∈ V \ (D ′ ∪ N (D ′ )} , where smaller ω(v) means better cost-effectiveness, c(v) is the weight of node v, D ′ is the partial solution (initialized as ∅), N (•) refers to the neighbors of a node or a vertex set, W ′ is the set of uncovered nodes. An alternative greedy algorithm (Greedy-MWIDS) selects the most cost-effective node only from the uncovered vertices in each iteration, which builds an independent dominating set (IDS). Greedy-MWIDS can be implemented by the distributed LGS-MWIS (Joo & Shroff, 2012) , and is used as part of the distributed heuristic for the MWCDS problem in Appendix H.4.

H.3 NWST PIPELINE AND BASELINE HEURISTICS

The actor, twin, and local function are configured the same as in the MWDS pipeline in Appendix H.2. The input node feature for the actor and twin GCNNs is one-hot encoded indicator of whether a node is a terminal. The GCNNs are trained on randomly generated GRP graphs with 100 to 300 nodes. The training takes 5 hours on a server with 16 GB memory, 8 CPUs and 1 GPU (Nvidia 1080 Ti). Shortest path heuristic (SPH) (Matsuyama, 1980) initializes the terminals as a set of subtrees; starting from an arbitrary subtree, it iteratively merges with its nearest subtree through the shortest path. The distance of a path is the total cost of nodes on it, where terminals have zero cost. The algorithm terminates when only one tree is left. The shortest path is found by Dijkstra's algorithm (Dijkstra et al., 1959) . Kruskal's SPH (Bauer & Varma, 1996) (K-SPH) is a distributed variation of SPH, in which every subtree merges with its nearest subtree until only one tree is left. SPH has an approximation ratio of 2 (Matsuyama, 1980).

H.4 MWCDS PIPELINE, BASELINE AND REFERENCE HEURISTICS

A low complexity heuristic for MWCDS problem (Sun et al., 2019) can be implemented in two steps: 1) find a MWDS, 2) connect the MWDS by solving a NWST problem where the terminals are the solution of step 1. In our baseline heuristic, we choose Greedy-MWIDS in Appendix H.2 for step 1, and K-SPH-NWST in Appendix H.3 for step 2, and treat them as a single distributed greedy heuristic (Dist.Greedy). Our reference algorithm is a centralized greedy heuristic, composed of Greedy-MWDS in Appendix H.2 for step 1 and SPH-NWST in Appendix H.3 for step 2. The actor GCNN is configured as follows: the dimensions of input and hidden layers are g l = 32, l = 1, . . . , 4 and leaky ReLU activations, the input and output dimensions are g 0 = 1, g 5 = 2, and the output layer employs linear activation. The actor outputs Z ∈ R |V|×2 , and the local function is a single neuron, f loc (c i , Z i * ) = ReLU (Z i,1 c i + Z i,2 ). The twin network is a 5-layer GCNN configured identically as the twin in Appendices H.2 and H.3, except that its input dimension fits the output dimension of its own actor GCNN. The GCNN is trained on GRP graphs of 100 to 300 nodes. The training takes 8 hours on a server with 16 GB memory, 8 CPUs and 1 GPU (Nvidia 1080 Ti).

I APPENDIX: CENTRALIZED RUNTIME IN SECONDS

We report the actual runtime of GCNN-enhanced COP solvers demonstrated in Section 4, measured on a laptop computer (Macbook Pro, 16GB memory, 2 GHz Quad-Core Intel Core i5, CPU only). It should be noted that these runtimes are based on our Python code, of which the implementation could be further optimized for running speed. For the examples demonstrated by distributed COP solvers, the centralized runtime does not represent their distributed runtime. Therefore, we report both the total centralized runtime and the per-node centralized runtime for each distributed solver. The latter represents the estimated computational time in distributed execution. However, the major component of distributed runtime for distributed solvers is the communication time rather than the computational time, which could not be measured in our setting. In general, the GCNN no longer dominate the total runtime for these distributed COP solvers. For MWIS in Figure 9 



lowered the time complexity of vehicle routing problems to O(|E|) (typically, |E| > |V|) by encoding the graph by a GNN only once per COP instance. However, for R-COPs, this formulation is limited by centralized execution and the complexity of a GNN, and does not apply to R-COPs in graph-based MDPs.

Figure 1: Approximation ratios (Larger is better) of the vanilla and GCNN-enhanced distributed heuristics for MWIS problem (max), w.r.t. the optimal solver.

Figure 2: Approximation ratio (Smaller is better) of the GCNN-enhanced w.r.t. the vanilla Greedy-MWDS for MWDS problem (min) on 4 sets of random graphs.

Figure 3: Average rounds of local message exchange for GCNN-enhanced and vanilla LGS-MWIS solvers to solve an instance, excluding the GCNN (N = ∞).

Figure 4: Average runtime of GCNN-enhanced Greedy-MWDS per instance by graph size, in seconds, no reusing N = 1 and infinity reusing N = ∞ of Z.

Figure 8: Performance trajectories of GCNN-enhanced LGS-MWIS trained by GDPG-Twin and ZOOs with 2-point and 11-point gradient estimations. Larger is better. GDPG-Twin needs fewer evaluations of h(•).

APPENDIX: CONFIGURATIONS OF RANDOM GRAPHS FOR TRAINING AND TESTING p ∈ U(0.15, 0.35), k ∈ ⌊U(10, 30)⌉. Training graph size |V| ∈ {100, 150, 200, 250, 300}. Testing graph size |V| ∈ {100

(a) and 9(b), the centralized runtime of distributed COP solvers are still linear to the graph size. For NWST (Figures 10(a) and 10(b)) and MWCDS (Figures 11(a) and 11(b)), the major component of the centralized runtime is Dijkstra's algorithm for the shortest path, which has a time complexity of O(|V| 2 ) or O((|V| + |E|) log |V|).

Dilay Çelebi. Inventory control in a centralized distribution network using genetic algorithms: A case study. Computers & Industrial Engineering, 87:532-539, 2015. ISSN 0360-8352. doi: https://doi.org/10.1016/j.cie.2015.05.035.

r(t), ô(t) >}

ACKNOWLEDGMENTS

This research was sponsored by the Army Research Office and was accomplished under Cooperative Agreement Number W911NF-19-2-0269. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

availability

//github.com/XzrTGMu

