LEARNING A TRANSFERABLE SCHEDULING POLICY FOR VARIOUS VEHICLE ROUTING PROBLEMS BASED ON GRAPH-CENTRIC REPRESENTATION LEARNING

Abstract

Reinforcement learning has been used to learn to solve various routing problems. however, most of the algorithm is restricted to finding an optimal routing strategy for only a single vehicle. In addition, the trained policy under a specific target routing problem is not able to solve different types of routing problems with different objectives and constraints. This paper proposes an reinforcement learning approach to solve the min-max capacitated multi vehicle routing problem (mCVRP), the problem seeks to minimize the total completion time for multiple vehicles whose one-time traveling distance is constrained by their fuel levels to serve the geographically distributed customer nodes. The method represents the relationships among vehicles, customers, and fuel stations using relationship-specific graphs to consider their topological relationships and employ graph neural network (GNN) to extract the graph's embedding to be used to make a routing action. We train the proposed model using the random mCVRP instance with different numbers of vehicles, customers, and refueling stations. We then validate that the trained policy solve not only new mCVRP problems with different complexity (weak transferability but also different routing problems (CVRP, mTSP, TSP) with different objectives and constraints (storing transferability).

1. INTRODUCTION

The Vehicle Routing Problem (VRP), a well-known NP-hard problem, has been enormously studied since it appeared by Dantzig & Ramser (1959) . There have been numerous attempts to compute the exact (optimal) or approximate solutions for various types of vehicle routing problems by using mixed integer linear programming (MILP), which uses mostly a branch-and-price algorithm appeared in Desrochers et al. (1992) or a column generation method (Chabrier, 2006) , or heuristics ( (Cordeau et al., 2002; Clarke & Wright, 1964; Gillett & Miller, 1974; Gendreau et al., 1994) ). However, these approaches typically require huge computational time to find the near optimum solution. For more information for VRP, see good survey papers (Cordeau et al., 2002; Toth & Vigo, 2002) . There have been attempts to solve such vehicle routing problems using learning based approaches. These approaches can be categorized into supervised-learning based approaches and reinforcementlearning based approaches (Bengio et al., 2020) ; supervised learning approaches try to map a target VRP with a solution or try to solve sub-problems appears during optimization procedure, while reinforcement learning (RL) approaches seek to learn to solve routing problems without supervision (i.e, solution) but using only repeated trials and the associated reward signal. Furthermore, the RL approaches can be further categorized into improvement heuristics and construction heuristics (Mazyavkina et al., 2020) ; improvement heuristics learn to modify the current solution for a better solution, while construction heuristics learn to construct a solution in a sequential decision making framework. The current study focuses on the RL-based construction heuristic for solving various routing problems. Various RL-based solution construction approaches have been employed to solve the traveling salesman problem (TSP) (Bello et al., 2016; Khalil et al., 2017; Nazari et al., 2018; Kool et al., 2018) or the capacitated vehicle routing problem (CVRP) (Nazari et al., 2018; Kool et al., 2018) . (Bello et al., 2016; Nazari et al., 2018; Kool et al., 2018 ) has used the encoder-decoder structure to sequentially generate routing schedules, and (Khalil et al., 2017) uses graph based embedding to determine the next assignment action. Although these approaches have shown the potential that the RL based approaches can learn to solve some types of routing problems, these approaches have the major two limitations: (1) only focus on routing a single vehicle over cities for minimizing the total traveling distance (i.e., min-sum problem) and (2) the trained policy for a specific routing problem cannot be used for solving other routing problems with different objective and constraints (they show that trained policy can be used to solve the same type of the routing problems with different problem sizes). In this study, We proposed the Graph-centric RL-based Transferable Scheduler (GRLTS) for various vehicle routing problems. GRLTS is composed of graph-centric representation learning and RLbased scheduling policy learning. GRLTS is mainly designed to solve min-max capacititated multi vehicle routing problems (mCVRP); the problem seeks to minimize the total completion time for multiple vehicles whose one-time traveling distance is constrained by their fuel levels to serve the geographically distributed customer nodes. The method represents the relationships among vehicles, customers, and fuel stations using relationship-specific graphs to consider their topological relationships and employ graph neural network (GNN) to extract the graph's embedding to be used to make a routing action. To effectively train the policy for minimizing the total completion time while satisfying the fuel constraints, we use the specially designed reward signal in RL framework. The representation learning for graph and the decision making policy are trained in an end-to-end fashion in an MARL framework. In addition, to effectively explore the joint combinatorial action space, we employ curriculum learning while controlling the difficulty (complexity) of a target problem. The proposed GRLTS resolves the two issues raised in other RL-based routing algorithms: • GRLTS learns to coordinate multiple vehicles to minimize the total completion time (makespan). It can resolve the first issue of other RL-based routing algorithms and can be used to solve practical routing problems of scheduling multiple vehicles simultaneously. (Kang et al., 2019 ) also employed the graph based embedding (random graph embedding) to solve identical parallel machine scheduling problem, the problem seeking to minimize the makespan by scheduling multiple machines. However, our approach is more general in that it can consider capacity constraint and more fast and scalable node embedding strategies. • GRLTS transfers the trained scheduling policy with random mCVRP instances to be used for solving not only new mCVRP problems with different complexity but also different routing problems (CVRP, mTSP, TSP) with different objectives and constraints.

2.1. MIN-MAX SOLUTION FOR MCVRP

We define the set of vehicles V V = 1, ..., N V , the set of customers V C = 1, ..., N C , and the set of refueling stations V R = 1, ..., N R , where N A , N C , and N R are the numbers of vehicles, customers, and refueling stations, respectively. The objective of min-max mCVRP is minimizing the makespan that is the longest distance among all vehicle's traveling distance, i.e., min max i∈V V L i with L i being the traveling distance of vehicle i, while each vehicle's one-time traveling distance is constrained by its remaining fuel. The detailed mathematical formulation using mixed integer linear programming (MILP) is provided in Appendix. 

2.2. DEC-MDP FORMULATION FOR MCVRP

We seek to sequentially construct an optimum solution. Thus, we frame the solution construction procedure as a decentralized Markov decision problem (Dec-MDP) as follows.

2.2.1. STATE

We define the vehicle state s v t , ∀v ∈ V V , the customer state s c t , ∀c ∈ V C , and the refueling station state s r t , ∀r ∈ V R as follows:  • State of vehicle v, s v t = (x v t , f v t , q v t ). x v t is the allocated node that vehicle v to visit; f v t is the current fuel level; and q v t is the number of customers served by the vehicle v so far. • State of a customer c, s c t = (x c , v c ). x c is the location of customer node c (static). Visit indicator v c ∈ {0, 1} becomes 1 if the customer c is visited and 0, otherwise. • State of a refueling station r, s r t = x r . x r is the location of the refueling station r (static). The global state s t then becomes s t = ({s v t } Nv v=1 , {s c t } N C c=1 , {s r t } N R r=1 ).

2.2.2. ACTIONS & STATE TRANSITION

Action a v t for vehicle v at time t is indicating a node to be visited by vehicle v at time t + 1, that is, a v t = x v t+1 ∈ {V C ∪ V R }. Therefore, the next state of vehicle v becomes s v t+1 = (x v t+1 , f v t+1 , q v t+1 ) where f v t+1 and q v t+1 are determined deterministically by an action a v t as follows: • Fuel capacity update: f v t+1 = F v , if a v t ∈ V R f tv -d(x v t , a v t ), otherwise. • Customer visit number update: q v t+1 = q v t , if a v t ∈ V R q v t + 1, otherwise.

2.2.3. REWARDS

The goal of mCVRP is to force all agents to coordinate to finish the distributed tasks quickly while satisfying the fuel constraints. To achieve this global goal in a distributed manner, we use the specially designed independent reward for each agent as: • visiting reward: To encourage vehicles to visit the customer nodes faster, in turn, minimizing makespan, we define customer visit reward r v visit = q v t . This reward is provided when an agent visits a customer; the more customer nodes a vehicle agent n visits, the greater reward it can earn. • Refueling reward: To induce a strategic refueling, we introduce refuel reward r v ref uel = q v t × ( (F v -f v t ) /(F v -1)) α . We define the refuel reward as an opportunity cost. That is, vehicles with sufficient fuel are not necessary to refuel (small reward). In contrast, refueling vehicles with a lack of fuel is worth as much as visiting customers. In this study, we set F v = 10 (which is the equivalent to the total traveling distance that vehicle v can travel with the fuel tank fully loaded) and α = 2.

2.3. RELATIONSHIPS WITH OTHER CLASS OF VRPS

mCVRP, the target problem of this study, has three key properties: 1) the problem seeks to minimize the total completion time of vehicles by forcing all vehicles to coordinate (in a distributed manner), 2) the problem employs fuel capacity constraints requiring the vehicles to visit the refueling stations strategically, 3) the problem considers multiple refueling depots (revisit allowed). If some of these requirements are relaxed, min-max mCVRP can be degenerated into simpler conventional routing problems: • TSP is the problem where a single vehicle is operated to serve every customer while minimizing the total traveling distance. The agent needs or needs not come back to the depot. This problem does not have capacity constraints. • CVRP (capacity-constrained TSP) is the problem where a single vehicle is operated to serve every customer while minimizing the total traveling distance and satisfying the fuel constraint. The vehicles need to comeback depot to charge. • mTSP (multi-agent TSP) is the problem where multiple vehicles should serve all the customers as quickly as possible. This problem does not have capacity constraints. • mCVRP (multi-agent, capacity-constrained TSP) is our target problem having the properties of both mTSP and CVRP. Additionally, we add more than one refueling depot. The mathematical formulations for these problems are provided in Appendix. We train the policy using random mCVRP instances with varying numbers of agents and customers and employ the trained policy without parameter changes to solve TSP, CVRP and mTSP to test its domain transferability.

3. METHOD

This section explains how the proposed model, given a state (a partial solution), assigns an idle vehicle to next node to visit under the sequential decision-making framework (see Figure 2 ).

3.1. STATE REPRESENTATION USING RELATIONSHIP-SPECIFIC MULTIPLE GRAPHS

The proposed model represents the global state s t using as a weighted graph G t = G(V, E, w) where V = {V V , V C , V R } , and E is the set of edges between node i, j ∈ V and w is weight for edge (i, j) (here, distance d ij ). Each node corresponding to vehicle, customer, and refueling station will be initialized with its associated states defined earlier. Although we can assume that all nodes are connected with each other regardless of types and distance, we restrict the edge connection to its neighboring nodes to reduce the computational cost. Specifically, each type of node can define its connectivity range and connect an edge if any node is located within its range as follows (see Figure 2 ): e vj = 1 ∀v ∈ V v , d(v, j) ≤ R V = f v t (1) e cj = 1 ∀c ∈ V c , d(c, j) ≤ R C (2) e rj = 1 ∀r ∈ V R , d(r, j) ≤ R R = max v∈Vv F v (3) That is, the vehicle node v ∈ V v connects the edges with nodes that are located within its traveling distance (i.e., the current fuel level f v t ). In addition, the customer nodes c ∈ V c connects the edges with nodes that are located within the constant range R C . We set R C = 5 while following a typical hyperparameter selection procedure. Finally, the refueling node r ∈ V R connects the edges with nodes within maximum distance that the vehicle with the largest fuel capacity can travel with the full loaded fuel (in this study F v = 10 for all vehicles). Note that the target node j that can be connected to each node can be any types of nodes.

3.2. NODE EMBEDDING USING GNN

The proposed model employs Graph neural network (GNN) (Scarselli et al., 2008) to compute the node embeddings for all nodes. The node embedding procedure starts with constructing the graph G t out of the global state s t = ({s v t } Nv v=1 , {s c t } N C c=1 , {s r t } N R r=1 ). The method then compute the initial node embeddings h i and edge embeddings h ij for all the nodes and edges of G t by employing encoder network. The sequel will explain how the GNN layer update these node and edge embeddings.

3.2.1. EDGE UPDATE

The edge feature h τ ij at τ iteration is updated using the edge updater function φ E as h τ ij = φ E (h τ -1 ij , h τ -1 i , h τ -1 j ), ∀i ∈ V, ∀j ∈ N i (4) where h τ -1 i and h τ -1 j are node embedding vectors of node i and node j at τ -1 iteration.

3.2.2. EDGE FEATURE AGGREGATION

The updated edge feature h τ ij can be thought of as an message sent from node j to node i. Node i aggregates these messages from its neighboring nodes j ∈ {N V (i), N C (i), N R (i)}, where N V (i) is the neighboring vehicle nodes of node i, as ( hτ i,V , hτ i,C , hτ i,R ) =   j∈N V (i) α ij h τ ij , j∈N C (i) α ij h τ ij , j∈N R (i) α ij h τ ij   (5) where the attention weight α ij = sofmax i (e ij ) , where e ij = f e (s i , s j ; w e ), scores the significance of node j to node i. Note that message aggregation is separately conducted for different types of noes and the aggregated messages per type are concatenated.

3.2.3. NODE FEATURE UPDATE

The aggregated edge node embeddings ( hτ i,V , hτ i,C , hτ i,R ) per its neighborhood type are then used to update the node embedding vector h τ i using node update function φ V as h τ i = φ V h τ -1 i , ( hτ i,V , hτ i,C , hτ i,R ) The node embedding procedure is repeated H (hop) times for all nodes, and the final node embeddings h H = {h H i } N i=1 is used to determine the next assignment action of an idle vehicle.

3.3. DECISION MAKING USING NODE EMBEDDING

When an vehicle node i reaches the assigned customer node, an event occurs and the vehicle node i computes its node embedding h H i and selects one of its feasible actions, choosing one of unvisited customer nodes or refueling nodes around vehicle node i. The probability for agent i to choose node j, a i t = j, is computed by the parameterized actor π(a i t = j|s t ; φ) as: π(a i t = j|s t ; φ) = exp(F (h H i , h H j ; φ) k∈N V (i)∪N R (i) exp(F (h H i , h H k ; φ) where F (h H i , h G j ; φ) is the fitness function evaluating the goodness for agent i to choose node j as the next action (i.e., a i t = j).

3.4. TRAINING GRLTS

We train the proposed model using random mCVRP instances with varying numbers of vehicles, customers. We employ the actor-critic algorithm to train the parameters for the GNN, the critic, and the actor (GRLTS). We first approximate the action-value function Q π (s, a) ≈ Qπ (s, a; θ) using a neural network with the critic parameter θ. The parameter θ for the centralized critic is optimized to minimize the loss L: L(θ) = E o,a,r,o ∼D [( Qπ (s, a; θ) -y) 2 ] = E o,a,r,o ∼D [( Qπ (h N hop ; θ) -y) 2 ] ( ) where y = r + γ Qπ (h N hop ; θ ) is the target value evaluated with the the actor π = π(a|s; φ ) and the critic Qπ (s, a; θ ) using the target parameters φ and θ . In addition, D is a state transition memory. To train the actor network, we use PPO method (Schulman et al., 2017) to maximize J(φ). PPO aims to maximize the clipped surrogate objective function as follow: J CLIP (φ) = E t [min(R t (φ) Ât , clip(R t (φ), 1 -, 1 + ) Ât )] where R t (φ) = π φ (at|st) π φ old (a t |o t ) . We compute the advantage estimator Ât by running the policy for T time steps as Ât = δ t + γδ t+1 + • • • + • • • + γ T -1 δ t+T -1 δ t = r t + γV (h N hop t+1 ; θ) -V (h N hop t ; θ) In addition, equation ( 9) is added by a value function error and an entropy bonus for sufficient exploration because critic and actor share parameters as follow: J CLIP t (φ) = E t [J CLIP t (φ) -c 1 (V θ (s t ) -V target ) 2 + c 2 H(s t , π φ (•))] where V θ (s t ) ≈ V (h N hop t ; θ) from the centralized critic; H denotes an entropy bonus; and c 1 and c 2 are hyperparameters. On updating the centralized critic parameter θ and the decentralized (but, shared) actor parameter φ, we use Monte-Carlo simulation. To sequentially update both parameters, we follow an update rule as follow: θ k+1 , φ k+1 ← arg max θ i ,φ k E o i ,a i ∼π θ i k J CLIP t (φ) -c 1 (V θ (s t ) -V target ) 2 + c 2 H(s t , π φ (•)) Algorithm (1) focus on sequence of the parameter update using above equations. Algorithm 1 Parameter update in decentralized actors with PPO and Monte-Carlo simulation 1: for Agent i = 1, 2, . . . , N do 2: for U pdate iteration = 1, 2, . . . , K do 3: Evaluate policy π θ old in environment for an experienced episode 4: Compute r t (φ) in Equation ( 9) 5: Compute advantage estimates Â1 , . . . , ÂT in Equation (10) 6: Optimize surrogate objective in Equation ( 11) with batch size T 7: θ old ← θ, φ old ← φ followed by (12) 8: end for 9: end for

4. EXPERIMENTS

The trained policy is used to solve various types of vehicle routing problems, mCVRP, mTSP, CVRP, and TSP without changing the parameters. For each type except mCVRP, we use two types of data sets: random instances with varying numbers of vehicles and customers whose locations are sampled randomly, and (2) benchmark problems obtained from the library (mTSP library, CVRP library, and TSP library). To validate the proposed GRLTS, we develop mixed integer linear programming (MILP) formulations and compute the solutions using CPLEX 12.9 (all the optimization formulations are provided in the Appendix). We also use Google OR-Tools (Perron & Furnon) as the representative heuristic solvers. For the problems where other deep RL based approaches tried to solve (TSP and CVRP), we compare the performance of the proposed approach to those of deep RL baselines. All experiments are conducted on Windows 10, Intel(R) Core(TM) i9-9900K CPU 3.6 GHz, and 32GB RAM. GPU acceleration is not used in testing. We apply the trained policy (GRLTS) to solve mCVRP with different numbers of vehicles N v ∈ 2, 3, 5 and the numbers of customers N c ∈ 25, 50, 100. The number of refueling stations are set to be 4, 5 and 10 in case of N c = 25, N c = 50 and N c = 100. For every combination of N v and N c , we randomly generate 100 mCVRP instances, each of which has randomly located customer nodes and refueling nodes. The x and y coordinates of each node is randomly sampled from the uniform distributions; x ∼ U [0, 1] and y ∼ U [0, 1], respectively. We employ GRLTS and ORT to solve the same 100 random mCVRP instances and compute the average makespan and the computational time required to solve the mCVRP instance with the policies. Table 1 compares the average makespan and computational times.

4.1. PERFORMANCE COMPARISON OF MCVRPS

ORT (Google OR-tool) produces the best results for the small-sized problems with reasonable computational time; however, it requires extensive time or even fails to compute any feasible solution for the large scale problem (∞ means that ORT cannot find a feasible solution). Notably, the GRLTS achieves a better performance than ORT for large-scale problems with a significantly shorter computational time. To validate the scalability of the proposed model, we further conduct test experiments with the four cases: N c = 25 with N v = 5, N c = 100 with N v = 10, N c = 400 with N v = 20, and N c = 2, 500 with N v = 50. Table 2 shows how the makespan and the computational time increases with the size of the problem. By comparing the ratio between the number of customers that each vehicle need to serve, (5: 10: 20: 50), and the makespan , (1.20: 1.88: 4.09: 11.71), we can roughly confirms that the trained model can perform reasonably well even in large sized problem that have never been experience during the training. For all different sizes of the mCVRP problem, we also compare the performance of the proposed method with the solution computed by CPLEX solver. Because it typically takes a long time to compute the near-optimum solution by CPLEX, we only compare the single mCVRP case. The result is provided in the Appendix.

4.2. PERFORMANCE COMPARISON ON MTSP

We apply the trained network (without parameter changes) to solve mTSP, which seeks to minimize the total completion time of multiple vehicles (minmax mTSP). This problem is a relaxed version of mCVRP in that it does not require the capacity constraints (we maintains the fuel level to be maximum during execution). We solve the randomly generated 100 mTSP instances for every combination of N c ∈ 50, 100, 400 and N v ∈ 2, 4, 6, 8 (or N v ∈ 10, 20, 30, 40). Table 7 compares the average makespan and computational time. The trained model outperforms ORT in the large-size problems (see N c = 100 and N c = 400 cases in Table 7 ) in terms of both makespan and the computational time. For large scale problems, GRLTS achieves roughly 28% shorter makespan than ORT on average with significantly reduced computational time (50% reduction). Similarly, for all the sizes of mTSP case, we compare the performance of GRLTS to the solution computed by CPLEX in the Appendix. We also employed the trained model to solve mTSP benchmark problems in MTSPLib We employ the trained model to solve randomly generated 100 CVRP instances with N c = 20, 50, 100 with a single capacitated vehicle. All nodes are randomly scattered in the unit square of [0, 1] × [0, 1]. To make the trained policy compatible with the CVRP settings, we fix the number of vehicles to be one and use a single refueling node as if it were the depot in CVRP environment. As a result, the single vehicle needs to revisit the depot due to payload capacity (which is equivalent to fuel constraint). The payload capacity is set to be 30, 40 and 50 for CVRP20, CVRP50 and CVRP100, respectively. Demand are uniformly sampled from {1, . . . , 9}. These test cases have been widely used by the studies seeking to develop RL-based solvers for CVRP problems. Table 4 summaries results for the random CVRP environment. L2I and LKH3 are well-known best performing heuristic algorithms developed from the OR community, thus can serve as an oracle for comparing the performance. We also consider other RL-based approaches (Kool et al., 2018; Nazari et al., 2018; Chen & Tian, 2019) . In general, our model is not outperforming other RL-based approaches and OR-tool. However, all RL-based approaches, except our model, are trained under the same environment of CVRP cases (the training and test instances are sampled from distribution). However, our model is trained with the complete different mCVRP environment and tested with the CVRP environment (strong transferability). We also employed the trained policy to solve the CVRP benchmark problem instances from CVRPLib (Uchoa et al., 2017) . Table 10 in Appendix provides all the performance results compared with other RL-baseline models. The results show that our model has better performance than one of the state-of-art RL-based approach (Kool et al., 2018) . Lastly, we employ the trained model to solve 100 randomly generated TSP with different number of customers N c = 20, 50, 100 and a single vehicle. Table 5 shows summarized the results. Although GRLTS is not outperforming other RL-based scheduling methods, it shows the reasonably performance that is comparable to some of well known heuristic algorithms. Given that the GRLTS is trained with mCVRP environment and have never seen TSP instances, this result can validate that GRLTS can be transferred to TSP as well.

4.4. PERFORMANCE COMPARISON OF TSP

We also employed the trained policy to solve the TSP benchmark problem instances fromTSPLib (Reinhelt, 2014) . Table 8 in Appendix provides all the performance results compared with other RL-baseline models. The results show that our model has comparable performance with the-state-ofart RL-based approaches (GPN (Ma et al., 2019) and S2V-DQN (Khalil et al., 2017) ) and heuristic algorithms, especially for large-scale TSP problems.

5. CONCLUSION

We proposed the Graph-centric RL-based Transferable Scheduler for various vehicle routing problems using graph-centric state presentation (GRLTS) that can solve any types of vehicle routing problems such as mCVRP, mTSP, CVRP, and TSP. The transferability is achieved by graph-centric representation learning that can generalize well over various relationships among vehicles, customers, and refuel stations (depot). GRLTR is computationally efficient for solving very large-scale vehicle routing problems with complex constraints, which provides potential that such RL-based scheduler can be used for large-scale realistic applications in logistics, transportation, and manufacturing.

A APPENDIX A.1 MILP FORMULATIONS FOR ROUTING PROBLEMS

The MILP formulations for mCVRP, mTSP, and TSP are provided in this section. All problems are formulated to find the optimum routes of vehicles to minimize the makespan. For TSP formulation, the makespan minimization is equal to total distance minimization due to single-vehicle operation. Although mTSP here is same as multi-VRP by its definition, we denote the problem as mTSP because we focus on the makespan minimization problem with multi-vehicle operation.

A.1.1 TSP

A fundamental formulation in the context of routing problems is based on TSP formulation by Miller-Tucker-Zemlin (MTZ) formulation (Miller et al., 1960) : minimize i∈V j∈V,i =j d ij x ij subject to. i∈V x ij = 1, ∀j ∈ V : i = j, j∈V x ij = 1, ∀i ∈ V : i = j, u i -u j + |V |x ij ≤ |V | -1, ∀i, j ∈ V \ vStart, (3) 0 ≤ u i ≤ |V | -1, ∀i ∈ V \ vStart, (4) x ij ∈ {0, 1}, ∀i, j ∈ V, u i ∈ Z, ∀i ∈ V \ vStart. The routing problem is defined in a graph G(V, E, w) where V , E and w are nodes, edges, and weight (distance d or cost c). x ij is 1 if the edge between node i and node j exists, otherwise, 0. Here, corresponding distance of the edge is d ij . Constraint (3,4) with dummy variable u i is for subtour elimination. As setting d ij = 0, ∀j ∈ vStart, the problem can be arbitrary end assumption.

A.1.2 MTSP

The goal of mTSP is to find the sub-routes of multiple vehicles to minimize the makespan. Thus, decision variables are expanded to multi-vehicle settings, and the objective function is modified from A.1.1 to consider the makespan minimization setting (MinMax problem). minimize Q subject to. i∈V j∈V d ij x aij ≤ Q, ∀a ∈ A : i = j, j∈V x aij = 1, ∀a ∈ A, i ∈ vStart a : i = j, a∈A i∈G j∈G x aij = 1, ∀j ∈ vT ask : i = j, ( ) i∈G x aij - i∈G x aji = 0, ∀a ∈ A, j ∈ V : i = j, u ai -u aj + |V |x aij ≤ |V | -1, ∀a ∈ A, j ∈ V \ vStart a : i = j, (5) 0 ≤ u ai ≤ |V | -1, ∀a ∈ A, i ∈ V \ vStart a , (6) x aij ∈ {0, 1}, ∀a ∈ A, ∀i, j ∈ V, (7) u ai ∈ Z, ∀a ∈ A, i ∈ V \ vStart a . (8) Here, we denoted Q as the makespan, which is the longest traveling distance among multiple vehicles. By minimizing this maximum traveling distance, the above formulation can minimize the makespan. Therefore, the above formulation is to minimize the makespan. We also allow a vehicle to start a tour at any staring node vStart a (constraint (2)).

A.1.3 MCVRP

We extend the above mTSP formulation in A.1.2 to include the fuel constraint (i.e., allowable traveling distance per sub-tour). In this formulation, vehicles can start their tour at an arbitrary location. minimize Q subject to. i∈V j∈V d ij x aij ≤ Q, ∀a ∈ A : i = j, j∈V x aij = 1, ∀a ∈ A, i ∈ vStart a : i = j, a∈A i∈G j∈G x aij = 1, ∀j ∈ vT ask : i = j, ( ) i∈G x aij - i∈G x aji = 0, ∀a ∈ A, j ∈ V : i = j, u ai -u aj + |V |x aij ≤ |V | -1, ∀a ∈ A, j ∈ V \ vStart a : i = j, (5) 0 ≤ u ai ≤ |V | -1, ∀a ∈ A, i ∈ V \ vStart a , (6) x aij ∈ {0, 1}, ∀a ∈ A, ∀i, j ∈ V, (7) u ai ∈ Z, ∀a ∈ A, i ∈ V \ vStart a (8) f ai = F a , ∀a ∈ A, i ∈ vRef uel, (9) f ai -d ij x aij ≥ 0, ∀a ∈ A, ∀i, j ∈ vRef uel, (10) f aj -f ai + d ij x aij ≥ F a (1 -x aij ), ∀a ∈ A, ∀i, j ∈ vT ask. Constraint ( 9) indicates that a vehicle can charge fuel as much as the vehicle's maximum fuel capacity F a . In addition, constraint (10) requires that a vehicle must have enough remaining fuel to visit a refueling node. Constraint (11) indicates that a vehicle cannot consume (travel) more than its fuel capacity. We allow vehicles to visit refueling nodes as many times as possible by introducing a sufficient number of dummy variables. A.1.4 CVRP Similar to the above mCVRP with fuel constraint, we also solve CVRP constrained by payload capacity (not fuel constraint) and with single vehicle (m = 1 so that |A| = 1). We consider a single depot to start a mission and unload some burdens. Therefore, vRef uel in the MILP of mCVRP becomes same as vStart a . Fuel consumption as much as distance d ij between node i and node j is equivalent to payload capacity consumption as much as demand d j at node j. As a result, constraint (9-11) of the mCVRP model becomes: c ai = C a , ∀a ∈ A, i ∈ vStart, (9) c aj -c ai + d j x aij ≥ C a (1 -x aij ), ∀a ∈ A, ∀i, j ∈ vT ask. (10) The constraint (10) in the mCVRP model is dropped because payload capacity is independent in returning to the starting depot.

A.2 COMPARISON WITH CPLEX SOLUTIONS

For a single mCVRP and mTSP instances, we compare the performance of the proposed approach with the exact solution computed by CPLEX. This experiment shows how the solution computed from the proposed method is comparable to the near-optimum solution calculated by the powerful optimization solver. Note that the mCVRP instance for this experiment was generated from the grid environment. The distance between customer nodes are different from the experiments for Table 1 . We apply the already trained model to solve mCVRP with different numbers of vehicles N v ∈ 2, 3, 5 and the numbers of customers N c ∈ 25, 50, 100. The number of refueling stations are set to be 4, 5 and 10 in case of N c = 25, N c = 50 and N c = 100. For all cases, we compute the total completion time and the computational time (the number in the parenthesis) required to construct a scheduling. We set the limit of the computational time to be 18,000 (sec) for all cases. The symbol ∞ indicate the case where the computational time is reached and the solution at that moment is used in the table. The blanks with the hyphen (-) indicate the case where the algorithm could not find any feasible solution.

A.2.1 MCVRP

First, CPX (CPLEX) can only solve for mCVRPs with N c = 25; for large scale problems, it cannot compute any feasible solution. ORT (Google OR-tool) produces the best results for the small sized problem with reasonable computational time; however, for the large scale problem it requires extensive amount of time or even fail to compute any feasible solution. Notably, the proposed GRLTS achieves the good performance with the least amount of computational time.

A.2.2 MTSP

For the grid environment, a single mTSP instance was generated and used for comparing the performances of GRLTS to CPLEX and ORT. 

A.3 EXPERIMENT RESULTS ON BENCHMARK PROBLEMS

We apply the trained network to solve benchmark problems for TSP, mTSP and CVRP.

A.3.1 TSP

Table 8 shows results of TSPLib (Reinhelt, 2014) with some RL-based approaches (GPN (Ma et al., 2019) and S2V-DQN (Khalil et al., 2017) and heuristics. The average performance of the proposed model is not outperforming the state of the art RL approaches. However, we observed that the proposed method is scalable in that the performance on the large-scale problem does not degrade compared to other approaches. We compare the performance by problem instance size (eil51 ∼ tsp225 Vs. pr226 ∼ pch442). Our model showed 1.1% reduction in the optimality gap (15.4% → 14.3%). However, the other RL-based approaches show much higher increases in the optimality gap (Drori et al. (2020) : 3.1% → 10.0%, GPN: 14.1% → 33.1%, Drori et al. (2020) : 4.7% → 10.6%). 9 shows results from mTSPLibfoot_1 solving MinMax TSP. Our model shows 3% longer results than ORT. However, computational time is, on average, about 30 seconds faster, which is about 45%. In addition, we calculate difference of the optimality gap by N v = 2, 3, 5, 7. Our model shows better performance as N v increases (N v = 2 case: 9.8%, N v = 3 case: 8.9%, N v = 5 case: 2.8%, N v = 7 case: -7.5%) 6) shows how the trained policy perform on the thee set of validation problems while the policy is being trained by the random mCVRP inc stances. The three validation problems are 1) 100 customers covered by 5 vehicles, 2) 100 customers by 10 vehicles and 3) 400 customers covered by 20 vehicles. The first row show the performance on the random training instances while the second, third, and fourth row show the performances on the three validation problems. As training processes progress, the trained model gradually becomes more efficient, that is, a fleet of vehicles visits more customers faster. After about 200 training epochs, the trained model converges, but there are some randomness due to the nature of policy gradient structure. Although the makespan curves in the second column of training case seems relatively constant (the first row row), the level of difficulty is increasing as curriculum becomes harder (The number of refueling is also increasing as curriculum becomes harder). For training, we randomly select the number of vehicles, customers, and refueling stations. We randomly assign these entities over the grid world with the distance between each grid cell being 1. When transforming the snapshot of the mCVRP to a graph, we use the Manhattan distance between the two cells as the edge weight. In addition, the speed of each vehicle is set to be 1; thus, each vehicle moves over one cell during a single time-tic. Although the world is represented as discrete gird, the mCVRP problem's state transition is event-based. Whenever the agent reaches the assigned customer, the event occurs, and all the edge distance is updated. Initially, this grid world is developed to use the trained policy for search and rescue problems, seeking distributed victims over the grid-world. Because each vehicle (drone) can search a specific zone within a certain amount of time, using a cell-based grid world is a reasonable choice. Once the policy is trained, it can be used for both a discrete world or a continuous world because these environments can be transformed into a graph without any difference. We validated the proposed method using CVRP and TSP instances defined over the continuous state to compare the performance of the proposed model to other deep baseline line models. Other deep baseline line models all use the continuous state. To boost training, we vary the difficulty level of random mCVRP instances during training process (see Algorithm (2)). For the curriculum instance generation in line 9, we compute N curr , the number of customers which are assumed to be visited already, as N curr = N c × (1 -2 × currLevel/10). Then, we choose N curr tasks randomly and mark as visited and distribute N curr for all agents' visit number q t . Algorithm Non-curriculum instance generation (followed by line 1) 12: end if 13: end for Simulation: Episode generation As a medium for the interaction of the two components such as environment and the GRLTS, simulation conducts the agents' actions and stores the state transitions which are used to update the GRLTS. Starting from the problem instance generation (Algorithm (2)), simulation assigns the agents' assignment as a t agt using the action probability π(•|o t agt ) computed from the proposed GRLTS. Simulation computes the agent agt's transition time ∆ agt as follow: ∆ agt = Dist(x t agt ,a t agt ) /vagt = Dist(x t agt , a t agt ) where velocity of agents are constant as 1 and distance between i and j Dist(i, j) can be the Manhattan distance or Euclidean distance.



www.infoiasi.ro/ mtsplib www.infoiasi.ro/ mtsplib



Figure (1) (left) shows a snapshot of a mCVRP state and Figure (1) (right) represents a feasible solution of the mCVRP.

Figure 1: An example of mCVRP. (Left) a snapshot of environment state at time t. The circular range of each vehicle indicates the possible moving range with the current fuel level. (Right) a feasible solution of the environment example.

Figure 2: Sequential decision-making framework with trained GRLTS.

Figure 4: N c = 20, N v = 2 case of mTSP

Figure 6: Performance curves of training (1st row) and testing (2nd, 3rd and 4th rows). Note that we plot a reward curve and cover ratio curves in 1st column for training and tesing, respectively.

Training instance generation1: Generate random problem instances N v , N c , N r , x 0 v , x 0 r , 2: currLevel ← 0 3: for i training = 1, . . . , N training do number p curr ∼ U[0, 1]

Performance comparison of mCVRPs on random instances Nv = 3 Nv = 5 Nv = 2 Nv = 3 Nv = 5 Nv = 5 Nv = 10

Scalability test of the trained GRLTS with large-scale mCVRPs

Performance comparison on mTSP for random instances

Performance comparison on CVRP for random instances

Performance comparison on TSP for random instances

Performance comparison of mCVRPs on random instances

Performance comparison on mTSP for random instances

Performance comparison on TSP library

Performance comparison on mTSP library

is summary of hyperparameters used in training process.

Comparison of case #1 and case #2 give us two interpretations: 1) case #1 shows the model is trained to visit all customers and 2) in case #2, the model achieves consistent performance as training iterations progress in spite of same visit ratio as 1 around 100 epoch and 350 epoch. Besides, the makespans of both training epochs are very close. Case #2 is relatively easier environment than case #1 in that more vehicles are deployed to serve same number of customers.

annex

Green grids are for refueling nodes, blue circles are for vehicles (color becomes black as fuel capacity diminishes), and grey grids are for unvisited customers. If a customer is visited by a vehicle, the color becomes red or white; the red girds indicate the victims, while the white indicates no victim in that cell. These red grids do not have any effect on performance measure, because visiting all customers (grey grids) is top-priority. end if 23: end while RL networks: GRLTS Most of algorithms in the RL model is explained the paper so that we focus on the interaction between simulation and GRLTS networks and internal message exchanges.Algorithm 4 GRLTS 1: Receive an observation Π agt o t agt at time t from simulation (line 6 of Algorithm 3) 2: Generate G t,(0) with an input of observations Π i o t i 3: for node embedding iteration τ = 1, . . . , N hop do 4:Compute connectivity C ij for all nodes i, j:For refueling nodes r, C rj = F, ∀j ∈ V For task nodes t, C tj = 5, ∀j ∈ V 5:Store the node embedding vector h A i at memeory i 6:Update the connected edges following Equation (3) 7:Aggregate the incoming edge featured at each nodes following Equation (4) 8:Update the node feature with the node updator function φ v 9: end for 10: Compute Q(h A i , a) for agent i where a ∈ A(o i ) = {v ∈ {V R ∪ V T }|C iv = 1} 11: Compute π (a|o i ) = exp(Q(h A i ,a))a∈A(o i ) exp(Q(h A i ,a))12: Send the action probability π (a|o i ) to simulation

