GRAPH-BASED DETERMINISTIC POLICY GRADIENT FOR REPETITIVE COMBINATORIAL OPTIMIZATION PROBLEMS

Abstract

We propose an actor-critic framework for graph-based machine learning pipelines with non-differentiable blocks, and apply it to repetitive combinatorial optimization problems (COPs) under hard constraints. Repetitive COP refers to problems to be solved repeatedly on graphs of the same or slowly changing topology but rapidly changing node or edge weights. Compared to one-shot COPs, repetitive COPs often rely on fast (distributed) heuristics to solve one instance of the problem before the next one arrives, at the cost of a relatively large optimality gap. Through numerical experiments on several discrete optimization problems, we show that our approach can learn reusable policies to reduce the optimality gap of fast (distributed) heuristics for independent repetitive COPs, and can optimize the long-term objectives for repetitive COPs embedded in graph-based Markov decision processes.

1. INTRODUCTION

In a general network setting, the network state is captured by a graph G = (V, E, S), that comprises a vertex set V, an edge set E, and a matrix S capturing the node features. A vector o ∈ R |V| captures the outcomes on individual nodes of a non-differentiable network process, f net (•), as: o = f net (G) . (1) We aim to optimize the system-level objective f obj (o), where f obj : R |o| → R is a known linear combination, by improving parts of the network process f net (•). In particular, we are interested in network processes that involve graph-based repetitive combinatorial optimization problems (R-COPs) (Kraay & Harker, 1996) -COPs that need to be solved repeatedly on graphs of the same or slowly changing topology (V or E) but rapidly changing node features (S), under hard constraints that must be satisfied at all times (Kendall, 1975) , which often make COPs NP-hard. R-COPs have many real-world applications, such as task scheduling (Pinedo, 2012), route planning (Vogiatzis & Pardalos, 2013), link scheduling (Joo & Shroff, 2012; Paschalidis et al., 2015; Eisen et al., 2019; Zhao et al., 2022a; b) and routing (Oliveira et al., 2011) in communication networks, and energy management in smart grids (Chau et al., 2018) , where a node or edge weight captures varying cost or utility. In practice, solvers for R-COPs are often subject to restricted runtime and/or distributed execution. For example, in scheduling, network routing, and multi-object tracking in computer vision, COPs need to be solved within tens of milliseconds to a few seconds. Although memory-based approaches can avoid solving each instance from scratch for some applications (Kraay & Harker, 1996; Wang, 2021) , in general, R-COPs rely on fast heuristics to meet the strict time constraints, at the cost of relatively large optimality gaps. Moreover, in real-time networked systems, such as communication networks, smart grids, and robot swarms (Tolstaya et al., 2020) , centralized solvers often suffer from the large communication overhead of gathering the full network state, high computational complexity, and risk of single point of failure. Therefore, distributed solutions (Moser & Tardos, 2010) are preferred for better scalability and robustness, in which nodes across the network work in parallel to collect and process the information of only their local neighborhoods for decision making. Depending on the definition of f obj , R-COPs can be categorized as, 1) independent R-COPs, i.e., S(t 1 ) and S(t 2 ) are considered as independent if t 1 ̸ = t 2 , and 2) R-COPs embedded in a graph-based Markov decision process (MDP). For example, in link scheduling for wireless multiple networks (Zhao et al., 2022a; b) , the network state G(t) = (V, E, S(t)), where S(t) captures the packet backlogs of all the links at time t, depends on the schedule of t -1 found by solving a maximum weight independent set (MWIS) problem defined on G(t -1). Scheduling for maximum throughput (number of data packets transmitted on the schedule) is equivalent to optimize each MWIS instance individually (Zhao et al., 2022a) , which is formulated as an independent R-COP. However, to minimize the average backlog (packets left in the queues by the schedule) across links and over time, the transition of network states must be considered and the scheduling is formulated as R-COP in a graph-based MDP (Zhao et al., 2022b) . Similar MDP formulations can also be found in wireless scheduling for battery lifetime (Sikandar et al., 2020) , vehicle routing for waste collection (Wu et al., 2020) , and inventory control in distribution networks (Çelebi, 2015) . However, the 2nd type of R-COPs have rarely been addressed, except in an ad-hoc manner for link scheduling (Zhao et al., 2022b) . Therefore, a general approach to R-COPs in graph-based MDPs would be of high interest. 1 In this work, we generalize the GCN-LGS architecture in (Zhao et al., 2022a; b) to a wider range of R-COPs, through a unified framework of problem formulation and actor-critic architecture. Unlike previous works that solve a COP instance through a sequence of scalar actions, our actor network generates a high-dimensional intermediate action to parameterize a given classical heuristic, h ′ (•), which produces vectorized decisions by solving one or multiple instances of an R-COP. This formula-



.1 EXISTING APPROACHES TO COPS Centralized solvers: The general approach for exactly solving a COP is to formulate it as a mixedinteger program, and solve it by branch-and-bound(Land & Doig, 1960)  or dynamic programming. Commercial Gurobi solvers (LLC, 2020) can exactly solve COPs on graphs of hundreds of nodes in reasonable time. Although large real-world graphs with certain structural properties can be reduced(Lamm et al., 2019)  to find exact solutions, problems of large scale and/or stringent time limits often rely on efficient heuristics to approximate the solutions; common strategies include greedy algorithms, Popular distributed approaches to COPs include Moser-Tardos algorithm (Moser & Tardos, 2010) and dynamic programming. The complexity of distributed algorithms is typically measured by local communication complexity, which refers to the rounds of message exchanges between a node and its neighbors(Joo & Shroff, 2012; Zhao et al., 2022a). For example, on MWIS, Moser-Tardos algorithm(Joo & Shroff, 2012; Moser & Tardos, 2010)  can converge in O(log * |V |) rounds, whereas distributed dynamic programming (Paschalidis et al., 2015) in O(|V |) rounds. Distributed algorithms may have larger optimality gaps due to the lack of global information, but topological metrics, such as edge betweenness for the Steiner tree problem(Fujita et al., 2016), can help reduce the gap. Learning-based distributed COP solvers have received less attention compared to their centralized counterparts.GCN-LGS (Zhao et al., 2022a) is an architecture proposed for a repetitive MWIS problem, in which an L-layer graph convolutional neural network (GCNN) generates a vector z for a topology (V, E), and a greedy heuristic solves modified instances (V, E, c(t) ⊙ z), rather than the original ones (V, E, c(t)), for the next N time slots t ∈ {1, . . . , N }, assuming the topology would not change by t = N . GCN-LGS can reduce the optimality gap of greedy solvers (Joo & Shroff, 2012) by 1 /3 to 1 /2, with a time complexity of O(L|E|/N + |V|), or can converge in O(L/N + log |V |) iterations in a distributed setting. For large reusing factor N , the overhead of GCNN is negligible. However, the training methods in(Zhao et al., 2022a;b)  are ad hoc and specific to MWIS problems, and may not be directly applicable to other types of R-COPs.

availability

//github.com/XzrTGMu

