EMBEDDING A RANDOM GRAPH VIA GNN: MEAN-FIELD INFERENCE THEORY AND RL APPLICATIONS TO NP-HARD MULTI-ROBOT/MACHINE SCHEDULING

Abstract

We develop a theory for embedding a random graph using graph neural networks (GNN) and illustrate its capability to solve NP-hard scheduling problems. We apply the theory to address the challenge of developing a near-optimal learning algorithm to solve the NP-hard problem of scheduling multiple robots/machines with time-varying rewards. In particular, we consider a class of reward collection problems called Multi-Robot Reward Collection (MRRC). Such MRRC problems well model ride-sharing, pickup-and-delivery, and a variety of related problems. We consider the classic identical parallel machine scheduling problem (IPMS) in the Appendix. For the theory, we first observe that MRRC system state can be represented as an extension of probabilistic graphical models (PGMs), which we refer to as random PGMs. We then develop a mean-field inference method for random PGMs. We prove that a simple modification of a typical GNN embedding is sufficient to embed a random graph even when the edge presence probabilities are interdependent. Our theory enables a two-step hierarchical inference for precise and transferable Q-function estimation for MRRC and IPMS. For scalable computation, we show that transferability of Q-function estimation enables us to design a polynomial time algorithm with 1 -1/e optimality bound. Experimental results on solving NP-hard MRRC problems (and IMPS in the Appendix) highlight the near-optimality and transferability of the proposed methods. 1 Due to this limitation, multi-agent (decentralized) methods are rarely used in industries (e.g., factories).

1. INTRODUCTION

Consider a set of identical robots seeking to serve a set of spatially distributed tasks. Each task is given an initial age (which then increases linearly in time). Greater rewards are given to younger tasks when service is complete according to a predetermined reward rule. We focus on NP-hard scheduling problems possessing constraints such as 'no possibility of two robots assigned to a task at once'. Such problems prevail in operations research, e.g., dispatching vehicles to deliver customers in a city or scheduling machines in a factory. Impossibility results in asynchronous communication 1 [Fischer et al. (1985) ] make these problems inherently centralized. Learning-based scheduling methods for single-robot NP-hard problems. structure2vec (Dai et al. (2016) ) is a popular Graphical Neural Network (GNN) derived from the fixed point iteration of PGM based mean-field inference. Recently, Dai et al. (2017) showed that structure2vec can construct a solution for Traveling Salesman Problem (TSP). A partial solution to TSP was considered as an intermediate state, and the state was represented using a heuristically constructed probabilistic graphical model (PGM). This GNN was used to infer the Q-function, which they exploit to select the next assignment. While their choice of PGM was entirely heuristic, their approach achieved nearoptimality and transferability of their trained single-robot scheduling algorithm to new single-robot scheduling problems with an unseen number of tasks. Those successes were restricted to single-robot problems except for special cases when the problem can be modeled as a variant of single-robot TSP via multiple successive journeys of a single robot, c.f., (Nazari et al. (2018) ; Kool et al. (2018) ). Proposed methods and contributions. The present paper explores the possibility of near-optimally solving multi-robot, multi-task NP-hard scheduling problems with time-dependent rewards using a learning-based algorithm. This is achieved by first extending the probabilistic graphical model (PGM)-based mean-field inference theory in Dai et al. (2016) for random PGM. We next consider a seemingly naive two-step heuristic: (i) approximate each edge's presence probability and (ii) apply a typical GNN encoder with probabilistic adjustments. We subsequently provide theoretical results that justify this approach. We call structure2vec (Dai et al. (2016) ) combined with this heuristic as random structure2vec. After observing that each state of a robot scheduling problem can be represented as a random PGM, we use random structure2vec to design a transferable (to different size problems) reinforcement learning method that is, to the best of our knowledge, the first to learn near-optimal NP-hard multi-robot/machine scheduling with time-dependent rewards. Experiments yield 97% optimality for MRRC problems in a deterministic environment with linearly-varying rewards. This performance is well extended to experiments with stochastic traveling time. We formulate a multi-robot reward collection problem (MRRC) as a discrete-time, discrete-state (DTDS) sequential decision-making problem. (For a closely related continuous-time and continuousstate (CTCS) formulation of IPMS problems, see Appendix A.1). For the DTDS formulation, time advances in fixed increments ∆ and each such time-step is considered a decision epoch. (For the CTCS formulation, the times when a robot arrives at a task or completes a task are considered as decision epochs.) At such decision epochs (which occur at every time-step in our discrete-time model), we reassign all available robots to remaining tasks. We use k to index the decision epochs and let t k denote the time that epoch k occurs (in discrete-time t k = k • ∆). We assume that at each decision epoch, we are newly given the duration of time required for a robot to complete each task, which we call task completion time. Such task completion times may be constants or random variables, and in either case, they are determined by current state (e.g., locations of the robot and the task) at each epoch. 2 We consider initial tasks as nodes in a fully connected graph. For the edge from task p to task q, we denote as T T p,q . The edge weight assigned is the task completion time for a robot that has just completed task p to subsequently complete task q. Let E T T , W T T be the set of all T T p,q and the set of corresponding weights. All elements of W T T are multiples of ∆).

State. The state s

t k at time t k is represented as G t k , W RT t k , α t k . G t is a directed bipartite graph (R ∪ T t k , E RT t k ) where R is the set of all robots, T t k is the set of all remaining unserved tasks at time step t k . The set E RT t k consists of all directed edges from robots to unserved tasks at time t k . To each edge is associated a weight equal to the task completion time. Let W RT t k denote the set of all such weights for all edges at t k (either constants or random variables, of which values are restricted to multiples of ∆ in the the DTDS system). For example, RT i,p ∈ E RT t k is an edge indicating robot i is assigned to serve task p. To this edge a task completion time is assigned according to current locations of the robot i and task p. Each task is given an initial age which increases linearly with time (a multiple of ∆ for DTDS). Let α t k = {η p t k ∈ R|p ∈ T t k } denote the set of ages where η p t k indicates the age of task p at time-step t k . We denote the set of possible states as S. It is intuitively clear how MRRC can directly model problems with stationary tasks. MRRC can also model problems such as ride-sharing or package delivery problems in which the robot location at the start of the task is different than at the end. Consider pickup and delivery tasks as illustrated in Figure 1 . Task 1, denoted as τ 1 , is to pickup from A and deliver to location B. The weight assigned to the edge T T 2,1 is the task completion time for a robot who has just completed task 2, and is thus located at C, who subsequently completes task 1. The traveling distance to task 1 (C → A) is 4 and the delivery distance (A → B) is 3, so the task completion time is T T 2,1 = 3 + 4 = 7. In the middle image in Figure 1 , state s t k (robots nodes, task nodes, arcs from robots to tasks and their weights, and ages), the system E T T (arcs between task nodes) and their weights W T T are depicted. Joint assignment. Once a robot has reached a task, it will conduct it until completion. Otherwise, we allow reassignment prior to arrival. Thus, available robots can change their assignments whenever a decision epoch occurs. A joint assignment of robots to tasks at current state s t k = G t k , W RT t k , α t k , denoted as a t k , should satisfy: (i) no two robots can be assigned to the same task, and (ii) a robot may only remain without assignment when the number of robots exceeds the number of remaining tasks. Thus, a joint assignment a t k is the set of edges in a maximal bipartite matching of the bipartite graph G t k . The action space A t k is depends upon s t k , as it is defined as the set of all maximal bipartite matchings in G t k . A policy π is defined as π(s t k ) = a t k , where s t k ∈ S and a t k ∈ A t k . Transition function and reward. In the hierarchical control literature, our assignment is termed a macro-action. In pursuit of the macro-action, robots may make multiple sequential micro-actions to serve the task. The transition probability associated with a macro-action is derived from the transition probabilities associated with micro-actions [Omidshafiei et al. (2017) ]. For a joint macroaction, assume there is an induced joint micro-action denoted as u t ∈ U with associated transition probabilities P (s t+1 |s t , u t ) : S t ×U t ×S t → [0, 1]. Omidshafiei et al. (2017) proves we can calculate the corresponding 'extended transition function' P s t k+1 |s t k , a t k : S t k × A t k × S t k → [0, 1]. When a task is served, a reward is given according to a predetermined reward function that computes rewards according to the task's age at the time of service. Note that the state and assignment information s t k , a t k and s t k+1 are thus sufficient to determine the reward at decision epoch t k+1 . As such we denote the reward function as R(s t k , a t k , s t k +1 ) : S t k × A t k × S t k → R.

Objective.

Given an initial state s t0 ∈ S, the MRRC seeks to maximize the sum of expected rewards through time by optimizing an assignment policy π * as π * = argmax π E π,P ∞ k=0 R s t k , π(s t k ), s t k+1 |s t0 .

3.1. BACKGROUND ON PROBABILISTIC GRAPHICAL MODEL (PGM) AND STRUCTURE2VEC

PGM. Given random variables X = {X k }, suppose that we can factor the joint distribution p (X ) as p (X )= 1 Z i φ i (D i ) where φ i (D i ) denotes a marginal distribution or conditional distribution associated with a set of random variables D i ; Z is a normalizing constant. Then {X k } is called a probabilistic graphical model (PGM). In a PGM, D i is called a clique and φ i (D i ) is called a clique potential for D i . When we write simply φ i , suppressing D i , D i is called the scope of φ i . Mean-field inference with PGM. A popular use of PGM is PGM-based mean-field inference. Suppose that X = {{Y k }, {H j }}, where we are interested in the inference of {H j } given {Y k }. For the inference problem, our interest will be calculating p ({H j = h j } | {Y k = y k }) but the calculation might not be tractable. In mean-field inference, we instead find a surrogate distribution q({H j = h j }) = j q j (h j ) with smallest Kullback-Leibler distance to p ({H j = h j } | {Y k = y k }). This surrogate distribution is then used to conduct the inference. Hereafter, for convenience, we suppress explicit mention of the random variable, for example, we write p(h j ) for p(H j = h j ). Koller & Friedman (2009) shows that when we are given a PGM, the q({h j }) can be obtained by a fixed point equation. Despite the usefulness of this approach, we are not often directly given the PGM. Structure2vec. In some problems such as molecule classification problems, data is given as graphs. For such special cases, [Dai et al. (2016) ] suggests that this graph structure information may be enough to conduct a mean-field inference when combined with Graph Neural Network (GNN). Let us first embed p (h j | {y k }) to a vector μj using the equation μj = H φ (h j ) p (h j | {y k }) dh j . Suppose that our problem has a special PGM structure that joint distribution is proportional to some factorization k∈V φ (h k , y k ) i,j∈V φ (h i , h j ), where V denotes the set of vertex indexes. Then according to [Dai et al. (2016) ], the embedding of the fixed point iteration operation of PGM-based mean-field inference corresponds to a neural network operation μi = σ(W 1 y i + W 2 j =i μj ) (σ denotes Relu function and W denotes parameters of neural networks). We can therefore use {μ k } to solve the original inference problem instead of p ({h k } | {y j }) or q({h k }). Note that their suggested neural network operation is similar to the network structure of Graph Convolutional Networks [Kipf & Welling (2017) ], a popular GNN-based graph embedding method. This observation enables one to interpret GNN-based graph embedding methods as mean-field inference using PGM. Suppose that the set of all possible PGMs on X , denoted as G X , is prior knowledge (e.g., for a robot scheduling problem, PGM is often a specific Bayesian Network -see Appendix A.2). A random PGM on X is then defined as {G X , P} where P : G X → [0, 1] is the probability measure for a realization of an element of G X . Note that the inference of P will be difficult. To avoid this task, we start by defining semi-cliques. Suppose that we are given the set of all possible cliques on X as C X . Only a few cliques in C X will be actually realized as an element of PGM according to P and become real cliques. So we call the elements D m ∈ C X as semi-cliques. Note that if we are given P then we can easily calculate the presence probability p m of semi-clique D m as p m = G∈G X P(G)1 Dm∈G . Mean-field inference with random PGM. The following theorem extends mean-field inference with PGM (Koller & Friedman (2009) ) to mean-field inference with random PGM. It shows that we only need to infer the presence probability of each semi-clique in the random PGM, not P. Theorem 1. Random PGM based mean field inference. Suppose we are given a random PGM on X = {X k }. Also, assume that we know presence probability {p m } for all semi-cliques D m ∈ C X . The surrogate distribution {q k (x k )} in mean-field inference is locally optimal only if q k (x k ) = 1 Z k exp m:X k ∈Dm p m E (Dm-{X k })∼q [ln φ m (D m , x k )] where Z k is a normalizing constant and φ m is the clique potential for clique m. (For the proof see Appendix A.3.) Random structure2vec. From Theorem 1, we can develop a random structure2vec corresponding to a random PGM with ({H k }, {Y k }). That is, we can combine (i) the fixed point equation of the mean field approximation for q k (h k ) (Theorem 1) and (ii) the injective embedding for μi = H φ(H i )p(h i |y i )dh i to come up with parameterized fixed point equation for μk (see Figure 2 ). As in Dai et al. (2016) , we restrict our discussion to the case where there are semi-cliques between two random variables. In this case, the notation we use for D m and p m is D ij and p ij . Lemma 1. Structure2vec for random PGM. Given a random PGM on X = ({H k }, {Y k }). As [Dai et al. (2016) ], suppose that our problem has a PGM structure with joint distribution proportional to some factorization k φ (h k , y k ) i,j φ (h i , h j ). Assume that the presence probabilities {p ij } for all pairwise semi-cliques D ij ∈ C X are given. Then fixed point equation in Theorem 1 for p({H k }|{y k }) is embedded to generate the fixed point equation μk = σ W 1 y k + W 2 j =k p kj μj . The proof of Lemma 1 can be found in Appendix A.4. Remarks. Note that inference of P is in general a difficult task. One implication of Theorem 1 is that we transformed a difficult inference task to a simple inference task: inferring the presence probability of each semi-clique. (See Appendix A.5 for the algorithm that conducts this task). In addition, Lemma 1 provides a theoretical justification to ignore the interdependencies among edge presences when embedding a random graph using GNN. When graph edges are not explicitly given or known to be random, the simplest heuristic one can imagine is to separately infer the presence probabilities of all edges and adjust the weights of GNN's message propagation. According to Lemma 1, possible interdependencies among edges would not affect the quality of such heuristic's inference. As illustrated in Appendix A.2, MRRC problems with no randomness induces Bayesian networks with factorization k φ (h k , y k ) i,j φ (h i , h j ). Therefore, according to Lemma 1, we are justified to use random structure2vec to design a method to learn solutions to our MRRC problems.

4.1. DESIGNING Q-FUNCTION ESTIMATOR HAVING ORDER-TRANSFERABILITY

Intuitively, local graph information around node k is embedded into the structure2vec output vector μk [Dai et al. (2016) ]. Using this intuition, we propose a two-step sequential and hierarchical stateembedding neural network using random structure2vec that is designed to achieve what we will later call order-transferable Q-function estimation. This allows problem-size transferable Q-learning, i.e., the neural network parameter θ, trained to calculate Q m θ that approximates the Q-function Q m for an m-robot scheduling problem, can be well used to solve n-robot scheduling problems (n = m). For brevity, we assume task completion times are deterministic. For the detailed algorithm with random task completion times, see Appendix A.6. The following procedure is illustrated in Figure 3 . Step 1. Distance Embedding. The first structure2vec layer embeds information of robot locations around each task k, i.e. local graph structure around each task k with respect to robots, to each μ1 k (superscript 1 denotes the outcome of first layer). For the input of the first structure2vec layer ({x k } in Lemma 1), we only use robot assignment information (if k is an assigned task, we set the value of x k to task completion time of assignment (a duration); if k is not an assigned task:, we set x k = 0). Step 2. Value Embedding. The second structure2vec layer embeds how much value is likely in the local graph around task k to μ2 k . Recall that the output vectors of the first structure2vec layer, {μ 1 k }, carry information about the graph structure of robots locally around each task. For each task k, we concatenate task k's age η k t with μ1 k to get μ 1 k and use {μ 1 k } as the input ({x k } in Lemma 1) to the second structure2vec layer. Denote the outcome of second structure2vec layer as {μ 2 k }. Step 3. Computing Q θ (s t k , a t k ). To derive Q θ (s t k , a t k ), we aggregate the embedding vectors for all nodes by μ2 = k μ2 k to obtain one global vector μ2 to embed the value affinity of the global graph. We then use a neural network to map μ2 into Q θ (s t k , a t k ). Let us provide the intuition related to problem-size transferability of Q-learning. Step 1 above, transferability is trivial; the inference problem is a scale-free task locally around each node. For Step 2, consider the ratio of robots to tasks. The overall value affinity embedding will be underestimated if this ratio in the training environment is smaller than this ratio in the testing environment; overestimated overall otherwise. The intuition is that this over/under-estimation does not matter in Q-learning [van Hasselt et al. (2015) ] as long as the order of Q-function value among actions are the same. That is, as long as the best assignments chosen are the same, i.e., argmax at k Q n (s t k , a t k ) = argmax at k Q n θ (s t k , a t k ), the magnitude of imprecision |Q n (s t k , a t k ) -Q n θ (s t k , a t k )| does not matter. We call this property order-transferability of Q-function estimator with θ.

4.2. ORDER TRANSFERABILITY-ENABLED AUCTION FOR SCALABLE COMPUTATION

Learning-based heuristics for solving NP-hard problems have recently received attention due to their fast computation speed for large size NP-hard problems [Dai et al. (2017) ]. However, this advantage disappears for Q-learning methods when faced with large action spaces [Lillicrap et al. (2015) ]. For multi-robot/machine scheduling problems, the set of all multi-robot assignments at each decision epoch is the action space; it grows exponentially as the number of robots and tasks increases. As such, the computational requirement of the argmax at k Q(s t k , a t k ) operation increases exponentially. In this section, we demonstrate how order transferability of Q-function estimation enables us to design a polynomial-time algorithm with a provable performance guarantee (1 -1/e optimality) to substitute for the argmax operation. We call this algorithm an order transferability-enabled auction-based policy (OTAP) and denote it as π Q θ , where the Q θ indicates that the Q-function estimator with current parameter θ is used during the auction.

4.2.1. ORDER TRANSFERABILITY-ENABLED AUCTION-BASED POLICY (OTAP)

We continue to use the notation introduced in section 4.1. Recall that state s t k = (G t k , α t k ) where G t k = (R ∪ T t k , E RT t k ). OTAP finds an assignment a t , the edge set of a maximal bipartite matching in the bipartite graph G t k , after N = max (|R|, |T t |) iterations of Bidding and Consensus phases. Bidding phase. In the n th bidding phase, initially all robots know M (n-1) θ , the ordered set of n -1 robot-task edges in E RT t k determined by the previous n -1 iterations. An unassigned robot i ignores all others unassigned and calculates Q n θ (s t k , M (n-1) θ ∪ { RT ip }) for each unassigned task p as if those k robots (robot i together with all robots assigned tasks in the previous n -1 iterations) only exist in the future and will serve all remaining tasks. (Here, RT ip ∈ E RT t k is the edge corresponding to assigning robot i to task j at decision epoch t k .) If task has the highest value, robot i bids { RT i , Q n θ (s t , M (n-1) θ ∪ { RT i }) } to the centralized auctioneer. Since the number of ignored robots varies at each iteration, transferability of Q-function inference is crucial. Consensus phase. At n th consensus phase, the centralized auctioneer finds the bid with the best bid value, say { RT i * p * , Q n θ (s t , M (n-1) θ ∪{ RT i * p * })}. (Here i * and p * denote the best robot task pair.) Denote RT i * p * =: m (n) θ . The centralized auctioneer updates the shared ordered set M (n) θ = M (n-1) θ ∪ m (n) θ . These two phases iterate until we reach M (N ) θ = {m (1) θ , . . . , m (N ) θ }. This M (N ) θ is chosen as the joint assignment a * t k at time step t k . That is, π Q θ (s t k ) = a * t k . The computational complexity for computing π Q θ is O (|R| |T t k |) and is only polynomial (See Appendix A.8.1).

Provable performance bound of OTAP.

Let the true Q-functions for OTAP be {Q n } N n=1 . Denote the outcome of OTAP with these true Q-functions as M (N ) = {m (1) , . . . , m (N ) }. Lemma 2. If the Q-function approximator has order transferability, then M (N ) = M (N ) θ . For any decision epoch t k , let M denote a set of robot-task pairs (a subset of E RT t k ). For any robot- task pair m ∈ E RT t k , define ∆(m | M) := Q |M∪{m}| (s t k , M ∪ {m}) -Q |M| (s t k , M) as the the marginal value (under the true Q-functions) of adding robot-task pair m ∈ E RT t k . Note, we allow "adding" m ∈ M for mathematical convenience in the subsequent proof. In that case, we have ∆(m | M) = 0, m ∈ M. Theorem 2. Suppose that the Q-function approximation with the parameter value θ exhibits order transferability. Denote M (N ) θ as the result of OTAP using {Q n θ } N n=1 and let M * = argmax at k Q |at k | (s t k , a t k ). If ∆(m | M) ≥ 0, ∀M ⊂ E RT t k , ∀m ∈ E RT t k , and the marginal value of adding one robot diminishes as the number of robots increases, i.e., ∆(m | M) ≤ ∆(m | N ), ∀N ⊂ M ⊂ E RT t k , ∀m ∈ E RT t k , then the result of OTAP is at least better than 1 -1/e of an optimal assignment. That is, Q N θ (s t k , M (N ) θ )≥Q |M * | (s t k , M * )(1 -1/e) . For proofs of Lemma 2 and Theorem 2, see Appendix A.7 and A.8.

4.2.2. AUCTION-FITTED Q-ITERATION FRAMEWORK AND EXPLORATION

Auction-fitted Q-iteration. We incorporate OTAP into a fitted Q-iteration, i.e., we find θ that empirically minimizes E π Q θ ,s k+1 ∼P [Q θ (s k , a k ) -[r (s k , a k ) + γQ θ (s k+1 , π Q θ (s k+1 ))]] . Please note that this method's rigorous fixed point analysis is the scope of subsequent future research. Exploration. How can we conduct exploration in the auction-fitted Q-iteration framework? Unfortunately, we cannot use an -greedy method since: (i) an arbitrary random deviation in a joint assignment often induces a catastrophic failure [Maffioli (1986) ], and (ii) the joint assignment space, which is complex and combinatorial, is difficult to explore efficiently with such an arbitrary random exploration policy. In learning the parameters θ for Q θ (s k , a k ), we use the exploration strategy that perturbs the parameters θ randomly to actively explore the joint assignment space with TAP. While this method was originally developed for policy-gradient based methods [Plappert et al. (2017) ], exploration in parameter space is useful in our auction-fitted Q-iteration since it generates a reasonable combination of assignments.

5. EXPERIMENTS AND RESULTS

We focus on DTDS MRRC problem in the main paper and now elaborate upon our reasoning. As discussed in section 2, the formulation of MRRC problem assumes that task completion times are given as prior knowledge. Recall that in stochastic environment, task completion times are given as random variables or sets of samples. In a simulation experiment standpoint, one must generate such a dataset of task completion times before she can discuss algorithms to solve MRRC problems. However, it is extremely difficult to generate a reasonable distribution of task completion times under continuous state continuous time (CTCS) environment [Bertsekas (2014) ; Omidshafiei et al. (2017) ]. For example, finding an optimal control for stochastic routing problems under a CTCS environment is in general intractable unless you discretize the space and time so that you transform CTCS environment to get an approximate DTDS environment [Kushner & Dupuis (2013) ]. Despite this difficulty, stochastic environment experiment is important since one of the main benefits of learning-based heuristics is its capability to tractably solve stochastic scheduling problems [Rossi et al. (2018) ]. Therefore, in this paper, we focus on experiments under DTDS environment and target to show that our algorithm's performance for deterministic environments extends to stochastic environments. Since there is no standard dataset for MRRC problems, we deliberately created a grid-world environment that generates nontrivial task completion time distributions with minimizing the selection bias. The idea we took was to use a complex maze (see Figure 3 ) generator of Neller et al. (2010) (code provided in Appendix 10) and compare it with the baselines. (For CTCS environment experiment under deterministic environment, refer IPMS experiments (Appendix A.1)). We avoided over-fitting by randomly generating a new maze for every training and testing experiment with initial task/robot locations also randomly chosen, only fixing the problem size while doing that. To generate the task completion times, Dijkstra's algorithm and dynamic programming were used for deterministic and stochastic environments, respectively. To minimize artificiality, the simplest MRRC problem is considered as follows. In the deterministic environment, robots always succeed in their movement. In the stochastic environment, a robot makes its intended move with a certain probability. (Cells with a dot: success with 55%, every other direction with 15% each. Cells without a dot: 70% and 10%, respectively.) A task is considered served when a robot reaches it. We consider two reward rules: linearly decaying rewards f (age) = max{200 -age, 0} and nonlinearly decaying rewards f (age) = λ age with λ = 0.99, where age is the task age when served. The initial age of tasks are uniformly distributed in the interval [0, 100]. Throughout, the performance measure used is ρ = (%rewards collected by the proposed method/reward collected by the baseline). The baselines are: • %Optimal: Gurobi was used for problems with the deterministic environment and linear rewards. Gurobi Optimization (2019) was allowed a 60-min time limit to search for an optimal solution. • Ekici et al: For deterministic environments with linear rewards, an up-to-date, fast heuristic for MRRC (Ekici & Retharekar (2013) ) was used (it claims 93% optimality for 50 tasks and 4 robots). • Sequential Greedy Algorithm (SGA): To our knowledge, there is no literature addressing MRRC with stochastic environments or exponential rewards. Instead, we construct an indirect baseline using a general-purpose multi-robot task allocation algorithm called SGA (Han-Lim Choi et al. (2009) ). We will provide our performance divided by SGA performance as %SGA. We will see that the %SGA in the deterministic linear-reward case is maintained for other cases. Performance test. We tested the performance under four environments: deterministic/linear rewards, deterministic/nonlinear rewards, stochastic/linear rewards, stochastic/nonlinear rewards. See Table 1 . For linear/deterministic rewards, our method achieves near-optimality with 3% fewer rewards than optimal on average. The standard deviation for ρ is provided in parentheses. For others, we see that the %SGA ratio for linear/deterministic is well maintained in stochastic or nonlinear environments. Due to dynamic programming computation complexity of dataset generation, we only consider 8 robots/50 tasks at maximum. Larger size problems were considered in IPMS experiments. Transferability test. Table 2 shows comprehensive transferability test results. The rows indicate training conditions, while the columns indicate testing conditions. The results in the diagonal cells in red (cells with the same training size and testing size) serve as baselines (direct testing). The results in the off-diagonal show the results for the transferability testing, and demonstrate how the algorithms trained with different problem size perform well on test problems. We can see that lower-direction transfer tests (trained with larger size problems and tested with smaller size problems) show only a small loss in performance. For upper-direction transfer tests (trained with smaller size problems and tested with larger size problems), the performance loss was up 4 percent. Scalability analysis. For scalability considerations, including computational analysis of OTAP and training data complexity, see Appendix A.8.1. 

6. CONCLUDING REMARKS

We developed a theory of random PGM-based mean-field inference method and provided a theoretical justification for a simple modification of popular GNN methods to embed a random graph. This theory was motivated from addressing the challenge of developing a near-optimal learning-based algorithm for solving NP-hard multi-robot/machine scheduling problems. While precise inference of Q-function is required to address this challenge, the two-layer random structure2vec embedding procedure we suggested has shown an empirical success. We further address inscalability problem of Q-learning methods for multi-robot/machine scheduling problem by suggesting a polynomial-time assignment algorithm with a provable performance guarantee. IPMS is a problem defined in continuous state/continuous time space. Once service of a task i begins, it requires a deterministic duration of time τ i for a machine to complete -we call this the processing time. Machines are all identical, which means processing time of each tasks among machines are all the same. Processing times of each tasks are all different. Before a machine can start processing a task, it is required to first setup for the task. In this paper, we discuss IPMS with 'sequence-dependent setup times'. In this case, a machine must conduct a setup prior to serving each task. The duration of this setup depends on the current task i and the task j that was previously served on that machine -we call this the setup time. The completion time for each task is thus the sum of the setup time and processing time. Under this setting, we solve the IPMS problem for make-span minimization as discussed in [Kurz et al. (2001) ]. That is, we seek to minimize the total time spent from the start time to the completion of the last task. IPMS problem's sequential decision making problem formulation resembles that of MRRC with continuous-time and continuous-space. That is, every time there is a finished task, we make assignment decision for a free machine. We call this times as 'decision epochs' and express them as an ordered set (t 1 , t 2 , . . . , t k , . . . ). Abusing this notation slightly, we use (•) t k = (•) k . This problem can be cast as a Markov Decision Problem (MDP) whose state, action, and reward are defined as follows: State. The state s t k at time t k is represented as G t k , W RT t k , t k . G t is a directed bipartite graph (R ∪ T t k , E RT t k ) where R is the set of all machines, T t k is the set of all remaining unserved tasks at time step t k . The set E RT t k consists of all directed edges from machines to unserved tasks at time t k . To each edge is associated a weight equal to the task completion time. Let W RT t k denote the set of all such weights for all edges at t k (either constants or random variables and restricted to multiples of ∆ in the the DTDS system). For example, RT i,p ∈ E RT t k is an edge indicating machine i is assigned to serve task p. To this edge a random variable denoting the task completion time (a duration) is assigned. Each task is given an initial age which increases linearly with time (a multiple of ∆ for DTDS). Let α t k = {η p t k ∈ R|p ∈ T t k } denote the set of ages where η p t k indicates the age of task p at time-step t k . We denote the set of possible states as S. Action. Defined the same as MRRC with continuous state/time space.

Reward. Let's denote the time between decision epoch k and decision epoch

k + 1 as T k = t k -t k-1 . One can easily see that T k is completely determined by s k , a k and s k+1 . Therefore, we can denote the reward we get with s k , a k and s k+1 as T (s k , a k , s k+1 ). Transition probabilities. The transition probability P is defined the same as MRRC problem. Objective. We can now define an assignment policy φ as a function that maps a state s k to action a k . Given s 0 initial state, an IPMS problem with makespan minimization objective can be expressed as a problem of finding an optimal assignment policy φ * such that For IPMS, we test it with continuous time, continuous state environment. While there have been many learning-based methods proposed for (single) robot scheduling problems, to the best our knowledge our method is the first learning method to claim scalable performance among machine-scheduling problems. Hence, in this case, we focus on showing comparable performance for large problems, instead of attempting to show the superiority of our method compared with heuristics specifically designed for IPMS (actually no heuristic was specifically designed to solve our exact problem (makespan minimization, sequence-dependent setup with no restriction on setup times)) φ * = argmin φ E π,P ∞ k=0 T (s k , a k , s k+1 ) |s 0 . For each task, processing times is determined using uniform [16, 64] . For every (task i, task j) ordered pair, a unique setup time is determined using uniform [0, 32]. As illustrated in Appendix A.1, we want to minimize make-span. As a benchmark for IPMS, we use Google OR-Tools library Google (2012). This library provides metaheuristics such as Greedy Descent, Guided Local Search, Simulated Annealing, Tabu Search. We compare our algorithm's result with the heuristic with the best result for each experiment. We consider cases with 3, 5, 7, 10 machines and 50, 75, 100 jobs. The results are provided in Appendix Table 3 . Makespan obtained by our method divided by the makespan obtained in the baseline is provided. Although our method has limitations in problems with a small number of tasks, it shows comparable performance to a large number of tasks and shows its value as the first learning-based machine scheduling method that achieves scalable performance. Here we illustrate that robot scheduling problem randomly induces a random Bayesian Network from state s t . See figure 4 . Given starting state s t and action a t , a person can repeat a random experiment of "sequential decision making using policy φ". In this random experiment, we can define events 'How robots serve all remaining tasks in which sequence'. We call such an event a 'scenario'. For example, suppose that at time-step t we are given robots {A, B}, tasks {1, 2, 3, 4, 5}, and policy φ. One possible scenario S * can be {robot A serves task 3 → 1 → 2 and robot B serves task 5 → 4}. Define random variable {{H j } a task characteristic, e.g. 'The time when task k is serviced'. The question is, 'Given a scenario S * , what is the relationship among random variables {H k }' {y k } (inputs in section 4.1)? Recall that in our sequential decision making formulation we are given all the 'task completion time' information in the s t description. Note that, task completion time is only dependent on the previous task and assigned task. In our example above, under scenario S * 'when task 2 is served' is only dependent on 'when task 1 is served'. That is, P (H 2 |H 1 , H 3 , S * ) = P (H 2 |H 1 , S * ). This relationship is called 'conditional independence'. Given a scenario S * , every relationship among {H i |S * } can be expressed using this kind of relationship among random variables. A graph with this special relationship is called 'Bayesian Network' [Koller & Friedman (2009) ], a probabilistic graphical model. Therefore, under a fixed scenario S * , this problem's joint distribution can be assumed to be factored as PGM structure k φ (h k |y k ) i,j φ (h i |h j ) where y k is the inputs considered in section 4.1 and H i denoting the time task i is served. A.3 PROOF OF THEOREM 1. We first define necessary definitions for our proof. Given a random PGM {G X , P}, a PGM is chosen among G X , the set of all possible PGMs on X . The set of semi-cliques is denoted as C X . As discussed in the main text, if we are given P then we can easily calculate the presence probability p m of semi-clique D m as p m = G∈G X P(G)1 Dm∈G . For each semi-clique D i in C X , define a binary random variable V i : F → {0, 1} with value 0 for the factorization that does not include semi-clique D i and value 1 for the factorization that include semi-clique D i . Let V be a random vector V = V 1 , V 2 , . . . , V |C X | . Then we can express P (X 1 , . . . , X n |V ) ∝ |C X | i=1 φ i D i V i . We denote φ i D i V i as ψ(D i ). Now we prove Theorem 1. In mean-field inference, we want to find a distribution Q (X 1 , . . . , X n ) = n i=1 Q i (X i ) such that the cross-entropy between it and a target distribution is minimized. Following the notation in Koller & Friedman (2009) , the mean field inference problem can written as the following optimization problem. min Q D i Q i |P (X 1 , . . . , X n |V )) s.t. xi Q i (x i ) = 1 ∀i Here D ( i Q i | P (X 1 , . . . , X n |V )) can be expressed as D ( i Q i | P (X 1 , . . . , X n |V )) = E Q [ln ( i Q i )] -E Q [ln (P (X 1 , . . . , X n |V ))]. Note that E Q [ln (P (X 1 , . . . , X n |V ))] = E Q ln 1 z Π |C X | i=1 ψ i D i , V = E Q   ln   1 z |C X | i=1 ψ i D i , V     = E Q   |C X | i=1 V i ln φ i D i   -E Q [ln(Z)] = |C X | i=1 E Q V i ln φ i D i -E Q [ln(Z)] = |C X | i=1 E V i E Q V i ln φ i D i |V i -E Q [ln(Z)] = |C X | i=1 P V i = 1 E Q ln φ i D i -E Q [ln(Z)] = |C X | i=1 p i E Q ln φ i D i -E Q [ln(Z)]. Hence, the above optimization problem can be written as max Q E Q   |C X | i=1 p i ln φ i D i   + E Q n i=1 (ln Q i ) s.t. xi Q i (x i ) = 1 ∀i In Koller & Friedman (2009) , the fixed point equation is derived by solving an analogous equation to (1) without the presence of the p i . Theorem 1 follows by proceeding as in Koller & Friedman (2009) with straightforward accounting for p i . A.4 PROOF OF LEMMA 1. Since we assume semi-cliques are only between two random variables, we can denote C X = {D ij } and presence probabilities as {p ij } where i, j are node indexes. Denote the set of nodes as V. From here, we follow the approach of Dai et al. (2016) and assume that the joint distribution of random variables can be written as p ({H k } , {X k }) ∝ k∈V ψ i (H k |X k ) k,i∈V ψ i (H k |H i ) . Expanding the fixed-point equation for the mean field inference from Theorem 1, we obtain:  Q k (h k ) = 1 Z k exp    ψ i :H k ∈D i E (D i -{H k })∼Q ln ψ i H k = h k |D i    = 1 Z k exp{lnφ (H k = h k |x k ) + i∈V H p ki Q i (h i ) ln φ (H k = h k |H i ) dh i }. l i = j∈V p ji μ(t-



In the later case, our method only requires samples of the random variables; distributions are not required.



Figure 1: Representing a ridesharing/pickup and delivery problem as an MRRC problem

Figure 2: State representation and main inference procedure

Figure 3: Illustration of overall pipeline of our method

Figure 4: Representing MRRC as a random Bayesian Network

Performance test (50 trials of training for each cases)

Transferability test (50 trials of training for each cases, linear & deterministic env.)

Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y. Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. Parameter Space Noise for Exploration. pp. 1-18, 2017.

IPMS test results for makespan minimization with deterministic task completion time (our algorithm / best Google OR tool result)

= relu (W 3 l i + W 4 x i ) 14 µ l = Concatenate μ(T1)

annex

This fixed-point equation for Q k (h k ) is a function of {Q j (h j )} j =k such thatAs in Dai et al. (2016) , this equation can be expressed as a Hilbert space embedding of the form μk = T • x k , {p kj μj } j =i , where μk indicates a vector that encodes Q k (h k ) . In this paper, we use the nonlinear mapping T (based on a neural network form ) suggested in Dai et al. (2016) : The pseducode implementation is as follows: In lines 1 and 2, the likelihood of the existence of a directed edge from each node m to node n is computed by calculating W 1 relu W 2 u k mn and averaging over the M samples. In lines 3 and 4, we use the soft-max function to obtain p m,n . We combine random sampling and inference procedure suggested in section and Figure 3 . Denote the set of task with a robot assigned to it as T A . Denote a task in T A as t i and the robot assigned to t i as r ti . The corresponding edge in E RT for this assignment is rt i ti . The key idea is to use samples of rt i ti to generate N number of sampled Q(s, a) value and average them to get the estimate of E(Q(s, a)). First, for l = 1 . . . N we conduct the following procedure. For each task t i in T A , we sample one data e l rt i ti . Using those samples and {p ij }, we follow the whole procedure illustrated in section 4.1 to get Q(s, a) l . Second, we get the average of {Q(s, a) l } l=N l=1 to get the estimate ofThe complete algorithm of section 4.1 with task completion time as a random variable is given as below.1 age i = age of node i 2 The set of nodes for assigned taskssample e l rt i ti from rt i ti 7x i = e l rt i ti 9 else:A.8.1 SCALABILITY ANALYSIS Computational complexity. MRRC can be formulated as a semi-MDP (SMDP) based multi-robot planning problem (e.g., Omidshafiei et al. (2017) ). This problem's complexity with R robots and T tasks and maximum H time horizon is O((R!/T !(R -T )!) H ). For example, Omidshafiei et al. (2017) state that a problem with only 13 task completion times ('TMA nodes' in their language) possessed a policy space with cardinality 5.622 * 10 17 . In our proposed method, this complexity is addressed by a combination of two complexities: computational complexity and training complexity. For computational complexity of joint assignment decision at each timestep, it is 5)) where (1) -( 5) are as follows.( (4) # of neural net computation for each structure2vec propagation operation=C (constant): This is only dependent on the hyperparameter size of neural network and does not increase as number of robots or tasks. (5) # of neural net computation for inference of random PGM=O(|T | 2 ) As an offline stage, we infer the semi-clique presence probability for every possible directed edge, i.e. from a task to another task using algorithm introduced in Appendix 6. This algorithm complexity isTraining data efficiency. Training efficiency also is required to obtain scalability. To quantify this we measured the training time required to achieve 93% optimality. As before, we consider a deterministic environment with linear rewards and compare with the exact optimum. For the entire codes used for experiments, please go to the following Google drive link for the codes.

