GRAPH-BASED DETERMINISTIC POLICY GRADIENT FOR REPETITIVE COMBINATORIAL OPTIMIZATION PROBLEMS

Abstract

We propose an actor-critic framework for graph-based machine learning pipelines with non-differentiable blocks, and apply it to repetitive combinatorial optimization problems (COPs) under hard constraints. Repetitive COP refers to problems to be solved repeatedly on graphs of the same or slowly changing topology but rapidly changing node or edge weights. Compared to one-shot COPs, repetitive COPs often rely on fast (distributed) heuristics to solve one instance of the problem before the next one arrives, at the cost of a relatively large optimality gap. Through numerical experiments on several discrete optimization problems, we show that our approach can learn reusable policies to reduce the optimality gap of fast (distributed) heuristics for independent repetitive COPs, and can optimize the long-term objectives for repetitive COPs embedded in graph-based Markov decision processes.

1. INTRODUCTION

In a general network setting, the network state is captured by a graph G = (V, E, S), that comprises a vertex set V, an edge set E, and a matrix S capturing the node features. A vector o ∈ R |V| captures the outcomes on individual nodes of a non-differentiable network process, f net (•), as: o = f net (G) . (1) We aim to optimize the system-level objective f obj (o), where f obj : R |o| → R is a known linear combination, by improving parts of the network process f net (•). In particular, we are interested in network processes that involve graph-based repetitive combinatorial optimization problems (R-COPs) (Kraay & Harker, 1996) -COPs that need to be solved repeatedly on graphs of the same or slowly changing topology (V or E) but rapidly changing node features (S), under hard constraints that must be satisfied at all times (Kendall, 1975) In practice, solvers for R-COPs are often subject to restricted runtime and/or distributed execution. For example, in scheduling, network routing, and multi-object tracking in computer vision, COPs need to be solved within tens of milliseconds to a few seconds. Although memory-based approaches can avoid solving each instance from scratch for some applications (Kraay & Harker, 1996; Wang, 2021) , in general, R-COPs rely on fast heuristics to meet the strict time constraints, at the cost of relatively large optimality gaps. Moreover, in real-time networked systems, such as communication networks, smart grids, and robot swarms (Tolstaya et al., 2020) , centralized solvers often suffer from the large communication overhead of gathering the full network state, high computational complexity, and risk of single point of failure. Therefore, distributed solutions (Moser & Tardos, 2010) are preferred for better scalability and robustness, in which nodes across the network work in parallel to collect and process the information of only their local neighborhoods for decision making. Depending on the definition of f obj , R-COPs can be categorized as, 1) independent R-COPs, i.e., S(t 1 ) and S(t 2 ) are considered as independent if t 1 ̸ = t 2 , and 2) R-COPs embedded in a graph-based 1



, which often make COPs NP-hard. R-COPs have many real-world applications, such as task scheduling (Pinedo, 2012), route planning (Vogiatzis & Pardalos, 2013), link scheduling(Joo & Shroff, 2012; Paschalidis et al., 2015; Eisen et al., 2019; Zhao et al.,  2022a;b)  and routing(Oliveira et al., 2011)  in communication networks, and energy management in smart grids(Chau et al., 2018), where a node or edge weight captures varying cost or utility.

availability

//github.com/XzrTGMu

