EMBEDDING A RANDOM GRAPH VIA GNN: MEAN-FIELD INFERENCE THEORY AND RL APPLICATIONS TO NP-HARD MULTI-ROBOT/MACHINE SCHEDULING

Abstract

We develop a theory for embedding a random graph using graph neural networks (GNN) and illustrate its capability to solve NP-hard scheduling problems. We apply the theory to address the challenge of developing a near-optimal learning algorithm to solve the NP-hard problem of scheduling multiple robots/machines with time-varying rewards. In particular, we consider a class of reward collection problems called Multi-Robot Reward Collection (MRRC). Such MRRC problems well model ride-sharing, pickup-and-delivery, and a variety of related problems. We consider the classic identical parallel machine scheduling problem (IPMS) in the Appendix. For the theory, we first observe that MRRC system state can be represented as an extension of probabilistic graphical models (PGMs), which we refer to as random PGMs. We then develop a mean-field inference method for random PGMs. We prove that a simple modification of a typical GNN embedding is sufficient to embed a random graph even when the edge presence probabilities are interdependent. Our theory enables a two-step hierarchical inference for precise and transferable Q-function estimation for MRRC and IPMS. For scalable computation, we show that transferability of Q-function estimation enables us to design a polynomial time algorithm with 1 -1/e optimality bound. Experimental results on solving NP-hard MRRC problems (and IMPS in the Appendix) highlight the near-optimality and transferability of the proposed methods.

1. INTRODUCTION

Consider a set of identical robots seeking to serve a set of spatially distributed tasks. Each task is given an initial age (which then increases linearly in time). Greater rewards are given to younger tasks when service is complete according to a predetermined reward rule. We focus on NP-hard scheduling problems possessing constraints such as 'no possibility of two robots assigned to a task at once'. Such problems prevail in operations research, e.g., dispatching vehicles to deliver customers in a city or scheduling machines in a factory. Impossibility results in asynchronous communicationfoot_0 [Fischer et al. (1985) ] make these problems inherently centralized. Learning-based scheduling methods for single-robot NP-hard problems. structure2vec (Dai et al. (2016) ) is a popular Graphical Neural Network (GNN) derived from the fixed point iteration of PGM based mean-field inference. Recently, Dai et al. (2017) showed that structure2vec can construct a solution for Traveling Salesman Problem (TSP). A partial solution to TSP was considered as an intermediate state, and the state was represented using a heuristically constructed probabilistic graphical model (PGM). This GNN was used to infer the Q-function, which they exploit to select the next assignment. While their choice of PGM was entirely heuristic, their approach achieved nearoptimality and transferability of their trained single-robot scheduling algorithm to new single-robot scheduling problems with an unseen number of tasks. Those successes were restricted to single-robot problems except for special cases when the problem can be modeled as a variant of single-robot TSP via multiple successive journeys of a single robot, c.f., (Nazari et al. (2018); Kool et al. (2018) ). Proposed methods and contributions. The present paper explores the possibility of near-optimally solving multi-robot, multi-task NP-hard scheduling problems with time-dependent rewards using a learning-based algorithm. This is achieved by first extending the probabilistic graphical model (PGM)-based mean-field inference theory in Dai et al. ( 2016) for random PGM. We next consider a seemingly naive two-step heuristic: (i) approximate each edge's presence probability and (ii) apply a typical GNN encoder with probabilistic adjustments. We subsequently provide theoretical results that justify this approach. We call structure2vec (Dai et al. ( 2016)) combined with this heuristic as random structure2vec. After observing that each state of a robot scheduling problem can be represented as a random PGM, we use random structure2vec to design a transferable (to different size problems) reinforcement learning method that is, to the best of our knowledge, the first to learn near-optimal NP-hard multi-robot/machine scheduling with time-dependent rewards. Experiments yield 97% optimality for MRRC problems in a deterministic environment with linearly-varying rewards. This performance is well extended to experiments with stochastic traveling time. ) At such decision epochs (which occur at every time-step in our discrete-time model), we reassign all available robots to remaining tasks. We use k to index the decision epochs and let t k denote the time that epoch k occurs (in discrete-time t k = k • ∆). We assume that at each decision epoch, we are newly given the duration of time required for a robot to complete each task, which we call task completion time. Such task completion times may be constants or random variables, and in either case, they are determined by current state (e.g., locations of the robot and the task) at each epoch. 2 We consider initial tasks as nodes in a fully connected graph. For the edge from task p to task q, we denote as T T p,q . The edge weight assigned is the task completion time for a robot that has just completed task p to subsequently complete task q. Let E T T , W T T be the set of all T T p,q and the set of corresponding weights. All elements of W T T are multiples of ∆).

State. The state s

t k at time t k is represented as G t k , W RT t k , α t k . G t is a directed bipartite graph (R ∪ T t k , E RT t k ) where R is the set of all robots, T t k is the set of all remaining unserved tasks at time step t k . The set E RT t k consists of all directed edges from robots to unserved tasks at time t k . To each edge is associated a weight equal to the task completion time. Let W RT t k denote the set of all such weights for all edges at t k (either constants or random variables, of which values are restricted to multiples of ∆ in the the DTDS system). For example, RT i,p ∈ E RT t k is an edge indicating robot i is assigned to serve task p. To this edge a task completion time is assigned according to current locations of the robot i and task p. Each task is given an initial age which increases linearly with time (a multiple of ∆ for DTDS). Let α t k = {η p t k ∈ R|p ∈ T t k } denote the set of ages where η p t k indicates the age of task p at time-step t k . We denote the set of possible states as S. It is intuitively clear how MRRC can directly model problems with stationary tasks. MRRC can also model problems such as ride-sharing or package delivery problems in which the robot location at



Due to this limitation, multi-agent (decentralized) methods are rarely used in industries (e.g., factories). In the later case, our method only requires samples of the random variables; distributions are not required.



Figure 1: Representing a ridesharing/pickup and delivery problem as an MRRC problem

