SCHEDULENET: LEARN TO SOLVE MINMAX MULTI-PLE TRAVELLING SALESMAN PROBLEM

Abstract

There has been continuous effort to learn to solve famous CO problems such as Traveling Salesman Problem (TSP) and Vehicle Routing Problem (VRP) using reinforcement learning (RL). Although they have shown good optimality and computational efficiency, these approaches have been limited to scheduling a singlevehicle. MinMax mTSP, the focus of this study, is the problem seeking to minimize the total completion time for multiple workers to complete the geographically distributed tasks. Solving MinMax mTSP using RL raises significant challenges because one needs to train a distributed scheduling policy inducing the cooperative strategic routings using only the single delayed and sparse reward signal (makespan). In this study, we propose the ScheduleNet that can solve mTSP with any numbers of salesmen and cities. The ScheduleNet presents a state (partial solution to mTSP) as a set of graphs and employs type aware graph node embeddings for deriving the cooperative and transferable scheduling policy. Additionally, to effectively train the ScheduleNet with sparse and delayed reward (makespan), we propose an RL training scheme, Clipped REINFORCE with "target net," which significantly stabilizes the training and improves the generalization performance. We have empirically shown that the proposed method achieves the performance comparable to Google OR-Tools, a highly optimized meta-heuristic baseline.

1. INTRODUCTION

There have been numerous approaches to solve combinatorial optimization (CO) problems using machine learning. Bengio et al. (2020) have categorized these approaches into demonstration and experience. In demonstration setting, supervised learning has been employed to mimic the behavior of the existing expert (e.g., exact solvers or heuristics). On the other hand, in the experience setting, typically, reinforcement learning (RL) has been employed to learn a parameterized policy that can solve newly given target problems without direct supervision. While the demonstration policy cannot outperform its guiding expert, RL-based policy can outperform the expert because it improves its policy using a reward signal. Concurrently, Mazyavkina et al. (2020) have further categorized the RL approaches into improvement and construction heuristics. An improvement heuristics start from the arbitrary (complete) solution of the CO problem and iteratively improve it with the learned policy until the improvement stops (Chen & Tian, 2019; Ahn et al., 2019) . On the other hand, the construction heuristics start from the empty solution and incrementally extend the partial solution using a learned sequential decision-making policy until it becomes complete. There has been continuous effort to learn to solve famous CO problems such as Traveling Salesman Problem (TSP) and Vehicle Routing Problem (VRP) using RL-based construction heuristics (Bello et al., 2016; Kool et al., 2018; Khalil et al., 2017; Nazari et al., 2018) . Although they have shown good optimality and computational efficiency performance, these approaches have been limited to only scheduling a single-vehicle. The multi-extensions of these routing problems, such as multiple TSP and multiple VRP, are underrepresented in the deep learning research community, even though they capture a broader set of the real-world problems and pose a more significant scientific challenge. The multiple traveling salesmen problem (mTSP) aims to determine a set of subroutes for each salesman, given m salesmen and N cities that need to be visited by one of the salesmen, and a depot where salesmen are initially located and to which they return. The objective of a mTSP is either minimizing the sum of subroute lengths (MinSum) or minimizing the length of the longest subroute (MinMax). In general, the MinMax objective is more practical, as one seeks to visit all cities as soon as possible (i.e., total completion time minimization). In contrast, the MinSum formulation, in general, leads to highly imbalanced solutions where one of the salesmen visits most of the cities, which results in longer total completion time (Lupoaie et al., 2019) . In this study, we propose a learning-based decentralized and sequential decision-making algorithm for solving Minmax mTSP problem; the trained policy, which is a construction heuristic, can be employed to solve mTSP instances with any numbers of salesman and cities. Learning a transferable mTSP solver in a construction heuristic framework is significantly challenging comparing to its single-agent variants (TSP and CVRP) because (1) we need to use the state representation that is flexible enough to represent any arbitrary number of salesman and cities (2) we need to introduce the coordination among multiple agents to complete the geographically distributed tasks as quickly as possible using a sequential and decentralized decision making strategy and (3) we need to learn such decentralized cooperative policy using only a delayed and sparse reward signal, makespan, that is revealed only at the end of the episode. To tackle such a challenging task, we formulate mTSP as a semi-MDP and derive a decentralized decision making policy in a multi-agent reinforcement learning framework using only a sparse and delayed episodic reward signal. The major components of the proposed method and their importance are summarized as follows: • Decentralized cooperative decision-making strategy: Decentralization of scheduling policy is essential to ensure the learned policy can be employed to schedule any size of mTSP problems in a scalable manner; decentralized policy maps local observation of each idle salesman one of feasible individual action while joint policy maps the global state to the joint scheduling actions. • State representation using type-award graph attention (TGA): the proposed method represents a state (partial solution to mTSP) as a set of graphs, each of which captures specific relationships among works, cities, and a depot. The proposed method then employs TGA to compute the node embeddings for all nodes (salesman and cities), which are used to assign idle salesman to an unvisited city sequentially. • Training decentralized policy using a single delayed shared reward signal: Training decentralized cooperative strategy using a single sparse and delayed reward is extremely difficult in that we need to distribute credits of a single scalar reward (makespan) over the time and agents. To resolve this, we propose a stable MARL training scheme which significantly stabilizes the training and improves the generalization performance. We have empirically shown that the proposed method achieves the performance comparable to Google OR-Tools, a highly optimized meta-heuristic baseline. The proposed approach outperforms OR-Tools in many cases on in-training, out-of-training problem distributions, and real-world problem instances. We also verified that scheduleNet can provide an efficient routing service to customers.

2. RELATED WORK

Construction RL approaches A seminal body of work focused on the construction approach in the RL setting for solving CO problems (Bello et al., 2016; Nazari et al., 2018; Kool et al., 2018; Khalil et al., 2017) . These approaches utilize encoder-decoder architecture, that encodes the problem structure into a hidden embedding first, and then autoregressively decodes the complete solution. Bello et al. (2016) utilized LSTM (Hochreiter & Schmidhuber, 1997) based encoder and decode the complete solution (tour) using Pointer Network (Vinyals et al., 2015) scheme. Since the routing tasks are often represented as graphs, Nazari et al. (2018) proposed an attention based encoder, while using LSTM decoder. Recently, Kool et al. (2018) proposed to use Transformer-like architecture (Vaswani et al., 2017) to solve several variants of TSP and single-vehicle CVRP. On the contrary, Khalil et al. (2017) do not use encoder-decoder architecture, but a single graph embedding model, structure2vec (Dai et al., 2016) , that embeds a partial solution of the TSP and outputs the next city in the (sub)tour. (Kang et al., 2019) has extended structure2vec to random graph and employed this random graph embedding to solve identical parallel machine scheduling problems, the problem seeking to minimize the makespan by scheduling multiple machines. Learned mTSP solvers The machine learning approaches for solving mTSP date back to Hopfield & Tank (1985) . However, these approaches require per problem instance training. (Hopfield & Tank, 1985; Wacholder et al., 1989; Somhom et al., 1999) . Among the recent learning methods, Kaempfer & Wolf (2018) encodes MinSum mTSP with a set-specialized variant of Transformer architecture that uses permutation invariant pooling layers. To obtain the feasible solution, they use a combination of the softassign method Gold & Rangarajan (1996) and a beam search. Their model is trained in a supervised setting using mTSP solutions obtained by Integer Linear Programming (ILP) solver. Hu et al. (2020) utilizes a GNN encoder and self-attention Vaswani et al. (2017) policy outputs a probability of assignment to each salesman per city. Once cities are assigned to specific salesmen, they use existing TSP solver, OR-Tools (Perron & Furnon) , to obtain each worker's subroutes. Their method shows impressive scalability in terms of the number of cities, as they present results for mTSP instances with 1000 cities and ten workers. However, the trained model is not scalable in terms of the number of workers and can only solve mTSP problems with a pre-specified, fixed number of workers.

3. PROBLEM FORMULATION

We define the set of m salesmen indexed by V T = {1, 2, ..., m}, and the set of N cities indexed by V C = {m + 1, 2, ..., m + N }. Following mTSP conventions, we define the first city as the depot. We also define the 2D-coordinates of entities (salesmen, cities, and the depot) as p i . The objective of MinMax mTSP is to minimize the length of the longest subtour of salesmen, while subtours covers all cities and all subtours of salesmen end at the depot. For the clarity of explanation, we will refer to salesman as a workers, and cities as a tasks.

3.1. MDP FORMULATION FOR MINMAX MTSP

In this paper, the objective is to construct an optimal solution with a construction RL approach. Thus, we cast the solution construction process of MinMax mTSP as a Markov decision process (MDP). The components of the proposed MDP are as follows. Transition The proposed MDP transits based on events. We define an event as the the case where any worker reaches its assigned city. We enumerate the event with the index τ for avoiding confusion from the elapsed time of the mTSP problem. t(τ ) is a function that returns the time of event τ . In the proposed event-based transition setup, the state transitions coincide with the sequential expansion of the partial scheduling solution. State Each entity i has its own state s i τ = p i τ , 1 active τ , 1 assigned τ at the τ -th event. the coordinates p i τ is time-dependent for workers and static for tasks and the depot. Indicator 1 active τ describes whether in case of worker, inactive means that worker returned to the depot. Similarly, 1 assigned τ indicates whether worker is assigned to a task or not. We also define the environment state s env τ that contains the current time of the environment, and the sequence of tasks visited by each worker, i.e., partial solution of the mTSP. The state s τ of the MDP at the τ -th event becomes s τ = {s i τ } m+N i=1 , s env τ . The first state s 0 corresponds to the empty solution of the given problem instance, i.e., no cities have been visited, and all salesmen are in the depot. The terminal state s T corresponds to a complete solution of the given mTSP instance, i.e., when every task has been visited, and every worker returned to the depot (See Figure 1 ). Action A scheduling action a τ is defined as the worker-to-task assignment, i.e. salesman has to visit the assigned city. Reward. We formulate the problem in a delayed reward setting. Specifically, the sparse reward function is defined as r(s τ ) = 0 for all non-terminal events, and r(s T ) = t(T), where T is the index of the terminal state. In other words, a single reward signal, which is obtained only for the terminal state, is equals to the makespan of the problem instance.

4. SCHEDULENET

Given the MDP formulation for MinMax mTSP, we propose ScheduleNet that can recommend a scheduling action a τ given the current state G τ represented as a graph, i.e., π θ (a τ |G τ ). The Schedul-Net first presents a state (partial solution of mTSP) as a set of graphs, each of which captures specific relationships among workers, tasks, and a depot. Then ScheduleNet employs type-aware graph attention (TGA) to compute the node embeddings and use the computed node embeddings to determine the next assignment action (See figure 2 ).

4.1. WORKER-TASK GRAPH REPRESENTATION

Whenever an event occurs and the global state s τ of the MDP is updated at τ , ScheduleNet constructs a directed complete graph G τ = (V, E) out of s τ , where V = V T ∪ V C is the set of nodes and E is the set of edges. We drop the time iterator τ to simplify the notations since the following operations only for the given time step. The nodes and edges and their associated features are defined as: • v i denotes the node corresponding entity i in mTSP problem. The node feature x i for v i is equal to the state s i τ of entity i. In addition, k i denote the type of node v i . For instance, if the entity i is worker and its 1 active τ = 1, then the k i becomes active-worker type. • e ij denotes the edge between between source node v j and destination node v i , representing the relationships between the two. The edge feature w ij is equal to the Euclidean distance between the two nodes.

4.2. TYPE-AWARE GRAPH ATTENTION EMBEDDING

In this section, we describe a type-aware graph attention (TGA) embedding procedure. We denote h i and h ij as the node and the edge embedding, respectively, at a given time step, and h i and h ij as the updated embedding by TGA embedding. A single iteration of TGA embedding consists of three phases: (1) edge update, (2) message aggregation, and (3) node update. Type-aware Edge update Given the node embeddings h i for v i ∈ V and the edge embeddings h ij for e ij ∈ E, ScheduleNet computes the updated edge embedding h ij and the attention logit z ij as: h ij = TGA E ([h i , h j , h ij ], k j ) z ij = TGA A ([h i , h j , h ij ], k j ) (1) where TGA E and TGA A are, respectively, the type-aware edge update function and the type-aware attention function, which are defined for the specific type k j of the source node v j . The updated edge feature h ij can be thought of as the message from the source node v j to the destination node v i , and the attention logit z ij will be used to compute the importance of this message. In computing the updated edge feature (message), TGA E and TGA A first compute the "type-aware" edge encoding u ij , which can be seen as a dynamic edge feature varying depending on the source node type, to effectively model the complex type-aware relationships among the nodes. Using the computed "type-aware" edge encoding u ij , these two functions then compute the updated edge feature and attention logit using a multiplicative interaction (MI) layer (Jayakumar et al., 2019) . The use of MI layer significantly reduces the number of parameters to learn without discarding the expressibility of the embedding procedure. The detailed architecture for TGA E and TGA A are provided in Appendix A.4. Type-aware Message aggregation The distribution of the node types in the mTSP graphs is highly imbalanced, i.e., the number of task-specific node types is much larger than the worker specific ones. This imbalance is problematic, specifically, during the message aggregation of GNN, since permutation invariant aggregation functions are akin to ignore messages from few-but-important nodes in the graph. To alleviate such an issue, we propose the following type-aware message aggregation scheme. We first define the type k neighborhood of node v i as the set of the k typed source nodes that are connected to the destination node v i , i.e., N k (i) = {v l |k l = k, ∀v l ∈ N (i)}, where N (i) is the in-neighborhood set of node v i containing the nodes that are connected to node v i with incomingedges. The node v i aggregates separately messages from the same type of source nodes. For example, the aggregated message m k i from k-type source nodes is computed as: m k i = j∈N k (i) α ij h ij (2) where α ij is the attention score computed using the attention logits computed before as: α ij = exp(z ij ) j∈N k (i) exp(z ij ) Finally, all aggregated messages per type are concatenated to produce the total aggregated message m i for node v i as m i = concat({m k i |k ∈ K}) Type-aware Node update The aggregated message m i for node v i is then used to compute the updated node embedding h i using the type-aware graph node update function TGA V as: h i = TGA V ([h i , m i ], k i ) (5) 4.3 ASSIGNMENT PROBABILITY COMPUTATION ScheduleNet model consists of two type-aware graph embedding layers that utilize the embedding procedure explained in the section above. The first embedding layer raw-2-hid is used to encode initial node and edge features x i and w ij of the (full) graph G τ , to obtain initial hidden node and edge features h i and h (0) ij , respectively. We define the target subgraph G s τ as the subset of nodes and edges from the original (full) graph G τ that only includes a target-worker (unassigned-worker) node and all unassigned-city nodes. The second embedding layer hid-2-hid embeds the target subgraph G s τ , H times. In other words, a hidden node and edge embeddings h Specifically, probability of assigning target worker i to task j is computed as y ij = MLP actor (h (H) i ; h (H) j ; h (H) ij ) p ij = softmax({y ij } j∈ A(G τ ) ) where the h (H) i , and h (H) ij is the final hidden node, edge embeddings, respectively. In addition, A(G τ ) denote the set of feasible actions defined as {v j |k j = "Unassigned-task" ∀j ∈ V}.

5. TRAINING SCHEDULENET

In this section, we describe the training scheme of the ScheduleNet. Firstly, we explain reward normalization scheme which is used to reduce the variance of the reward. Secondly, we introduce a stable RL training scheme which significantly stabilizes the training process. Makespan normalization As mentioned in Section 3.1, we use the makespan of mTSP as the only reward signal for training RL agent. We denote the makespan of given policy π as M (π). We observe that, the makespan M (π) is a highly volatile depending on the problem size (number of cities and salesmen), the topology of the map, and the policy. To reduce the variance of the reward, we propose the following normalization scheme: m(π, π b ) = M (π b ) -M (π) M (π b ) where π and π b is the evaluation and baseline policy, respectively. The normalized makespan m(π, π b ) is similar to (Kool et al., 2018 ), but we additionally divide the performance difference by the makespan of the baseline policy, which further reduces the variance that is induced by the size of the mTSP instance. From the normalized terminal reward m(π, π b ), we compute the normalized return as follows: G τ (π, π b ) := γ T -τ m(π, π b ) (8) where T is the index of the terminal state, and γ is the discount factor. The normalized return G τ (π, π b ) becomes smaller and converges to (near) zero as τ decreases. From the perspective of the RL agent, it allows to the agent to acknowledge neutrality of current policy compared to the baseline policy for the early phase of the MDP trajectory. It is natural since knowing the relative goodness of the policy is hard from the early phase of the MDP. Stable RL training It is well known that the solution quality of CO problems, including the makespan of mTSP, is extremely sensitive to the action selection, and it thus prevents the stable policy learning. To address this problem, we propose the clipped REINFORCE, a variant PPO without the learned value function. We empirically found that it is hard to train the value functionfoot_0 , thus we use normalized returns Gτ (π θ , π b ) directly. Then, the objective of the clipped REINFORCE is given as follows: L(θ) = E π θ T τ =0 [min(clip(ρ τ , 1 -, 1 + )Gτ (π θ , π b ), ρ τ Gτ (π θ , π b ))] Table 1 : mTSP Uniform Results. We report the mean and standard deviation of the makespans for different uniform maps. We evaluate 500 independent maps for each N and m pairs. The gaps between OR-Tools and ScheduleNet are measured per instances. where N = 20 N = 50 m = 2 m = 3 m = 5 m = 5 m = 7 m = ρ τ = π θ (a τ |G τ ) π θ old (a τ |G τ ) and (G τ , a τ ) ∼ π θ is the state-action marginal following π θ , and π θ old is the old policy. Training detail We used the greedy version of current policy as the baseline policy π b . After updating the policy π θ , we smooth the parameters of policy π θ with the Polyak average (Polyak & Juditsky, 1992) to further stabilize policy training. The pseudo code of training and network architecture is given in Appedix A.5.1. 

6. EXPERIMENTS

We train the ScheduleNet using mTSP instances whose number m of workers and the number N of tasks are sampled from m ∼ U (2, 4) and N ∼ U (10, 20), respectively. This trained Schedu-leNet policy is then evaluated on the various dataset, including randomly generated uniform mTSP datasets, mTSPLib (mTS), and randomly generated uniform TSP dataset, TSPLib, and TSP (dai). See Appendix for further training details.

Random mTSP results

We firstly investigate the generalization performance of ScheduleNet on the randomly generated uniform maps with varying numbers of tasks and workers. We report the results of OR-Tools and 2Phase heuristics; 2Phase Nearest Insertion (NI), 2Phase Farthest Insertion (FI), 2Phase Random Insertion (RI), and 2Phase Nearest Neighbor (NN). The 2Phase heuristics construct sub-tours by (1) clustering cities with clustering algorithm, and (2) applying the TSP heuristics within the cluster. The details of implementation are provided in the appendix. Table 1 shows that ScheduleNet in overall produces a slightly longer makespan than OR-Tools even for the large-sized mTSP instances. As the complexity of the target mTSP instance increases, the gap between ScheduleNet and OR-Tools decreases, even showing the cases where ScheduleNet outperforms OR-Tools. To further clarify, ScheduleNet has potentials for winning the OR-Tools on small and large cases as shown in the figure 3 . This result empirically proves that ScheduleNet, even trained with small-sized mTSP instances, can solve large scale problems well. Notably, on the large scale maps, 2-Phase heuristics show their general effectiveness due to the uniformity of the city positions. It naturally invokes us to consider more realistic problems as discussed in the following section.

mTSPLib results

The trained ScheduleNet is employed to solve the benchmark problems in mT-SPLib, without additional training, to validate the generalization capability of ScheduleNet on unseen mTSP instances, where the problem structure can be completely different from the instances used during training. Table 2 compares the performance of the ScheduleNet to other baseline models, including CPLEX (optimal solution), OR-Tools, and other meta-heuristics (Lupoaie et al., 2019); self-organization Map (SOM), ant-colony Optimization (ACO), and evolutionary algorithm (EA). We report the best known upper-bound for CPLEX results whenever the optimal solution is not known. OR-Tools generally shows promising results. Interestingly, OR-Tools also discovers the solution even better than the known upper-bounds. (e.g., eil76-m=5,7, rat99-m=5) That is possible for the large cases the search space of the exact method, CLPEX, becomes easily prohibitively large. Our method shows the second-best performance following OR-tools. The winning heuristic methods, 2Phase-NI/RI, shows drastic performance degradation on mTSPLib maps. It is noteworthy that our method, even in the zero-shot setting, performs better than the meta-heuristic methods, which perform optimization to solve each benchmark problem.

Computational times

The mixed-integer linear programming (MILP) formulated mTSP problem becomes quickly intractable due to the exponential growth of search space, namely subtour elim- ination constraint (SEC), as the number of workers increases. The computational gain of (Meta) heuristics, including the proposed method and OR-Tools, originates from the effective heuristics that trims out possible tours. The computational times of ScheduleNet linearly increase as the number of worker m increases for the number for the fixed number of task N due to the MDP formulation of mTSP. On the contrary, it is found that the computation times of OR-Tools depend on m and N , and also graph topology. As a result, the ScheduleNet becomes faster than OR-Tools for large instances as shown in figure 6.

6.2. EFFECTIVENESS OF THE PROPOSED TRAINING SCHEME

Figure 5 compares the training curves of ScheduleNet and its variants. We firstly show the effectiveness of the proposed sparse reward compared to the dense reward functions; distance reward and distance-utilization reward. The distance reward is defined as the negative distance between the current worker position and the assigned city. This reward function is often used for solving TSP (Dai et al., 2016) . The distance-utilization is defined as distance reward over the number of active workers. This reward function aims to minimize the (sub) tour distances while maximizing the utilization of the workers. The proposed sparse reward is the only reward function that can train ScheduleNet stable and achieves the minimal gaps, also as shown in 5 [Left] . 

7. CONCLUSION

We proposed ScheduleNet for solving MinMax mTSP, the problem seeking to minimize the total completion time for multiple workers to complete the geographically distributed tasks. The use of type-aware graphs and the specially designed TGA graph node embedding allows the trained ScheduleNet policy to induce the coordinated strate-gic subroutes of the workers and to be well transferred to unseen mTSP with any numbers of workers and tasks. We have empirically shown that the proposed method achieves the performance comparable to Google OR-Tools, a highly optimized meta-heuristic baseline. All in all, this study has shown the potential that the proposed ScheduleNet can be effectively used to schedule multiple vehicles for solving large-scale, practical, real-world applications. Type-aware Edge update The edge update scheme is designed to reflect the complex type relationship among the entities while updating edge features. First the context embedding c ij of edge e ij computed using the source node type k j such that: c ij = MLP etype (k j ) where MLP etype is the edge type encoder. The source node types are embedded into the context embedding c ij using MLP etype . Next, the type-aware edge encoding u ij is computed using the Multiplicative Interaction (MI) layer (Jayakumar et al., 2019) as follows: u ij = MI edge ([h i ; h j ; h ij ], c ij ) where MI edge is the edge MI layer. We utilize the MI layer, which dynamically generates its parameter depending on the context c ij and produces "type-aware" edge encoding u ij , to effectively model the complex type relationships among the nodes. "type-aware" edge encoding u ij can be seen as a dynamic edge feature which varies depending on the source node type. After the updated edge embedding h ij and its attention logit z ij is obtained as: h ij = MLP edge (u ij ) z ij = MLP attn (u ij ) where MLP edge and MLP attn is the edge updater and logit function, respectively. the edge updater and logit function produces updated edge embdding and logits from the "type-aware" edge. The computation steps of equation 11, 12, and 13 are defined as TGA E . Similarly, the computation steps of equation 11, 12, and 14 are defined as TGA A . Message aggregation First, we define the type-k neighborhood of node v i such that N k (i) = {v l |k l = k, ∀v l ∈ N (i)}, where N (i) is the in-neigborhood set of node i. i.e., The type-k neighborhood is the set of edges heading to node i, and their source nodes have type k. The proposed type-aware message aggregation procedure computes attention score α ij for the e ij , which starts from node j and heads to node i, such that: α ij = exp(z ij ) l∈N k j (i) exp(z il ) Intuitively speaking, The proposed attention scheme normalizes the attention logits of incoming edges over the types. Therefore, the attention scores sum up to 1 over each type-k neighborhood. Next, the type-k neighborhood message m i,k for node v i is computed as: m k i = j∈N k (i) α ij h ij In this aggregation step, the incoming messages of node i are aggregated type-wisely. Finally, all incoming type neighborhood messages are concatenated to produce (inter-type) aggregated message m i for node v i , such that: m i = concat({m k i |k ∈ K}) Node update Similar to the edge update phase, first, the context embedding c i is computed for each node v i : c i = MLP ntype (k i ) ) where MLP ntype is the node type encoder. Then, the updated hidden node embedding h i is computed as below: h i = MLP node (h i , u i ) (19) where u i = MI node (m i , c i ) is the type-aware node embedding that is produced by MI node layer using aggregated messages m i and the context embedding c i . The computation steps of equation 18, and 19 are defined as TGA E . The overall computation procedure TGA is illustrated in Figure 7 . 



Note that the value function is trained to predict the makespan of the state to serve as an advantage estimator. Due to the combinatorial nature of the mTSP, the target of value function, makespan, is highly volatile, which makes training value function hard. We further discuss this in the experiment section.



Figure 1: The mTSP MDP The black lined balls indicates the events of the mTSP MDP. The empty, dashed, and, filled rectangles represent the unassigned, assigned, and inactive cities, respectively. The circles represent the workers and the positions of the circles show the 2D coordinates of worker. The orange and blue colored lines shows the subtours of the orange and blue worker, respectively.

Figure 2: Assignment action determination step of ScheduleNet

ij , respectively. The final hidden embeddings are then used to make decision regarding the worker-to-task assignment.

Figure 3: Gap distributions of the random mTSP instances [Left] Small-size random instance results, [Right] Medium-size random instance results, [Right] Large-size random instance results

Figure 5: [Left] Effect of reward function The blue, orange, and green curve shows the training gap of the sparse reward, the distance reward, and distance-utilization reward over the training steps, respectively. The shadow regions visualize one standard deviation from the mean trends. We replicates 5 experiments per reward setup. [Right] Effect of training method The orange, green, and red curve shows the training gap of PPO model with the proposed spare, distance, and distanceutilization reward, respectively. We visualize the results of the clipped REINFORCE with the spare reward, which is denoted as blue curve, for clear comparisons. The shadow regions visualize one standard deviation from the mean trends. We replicates 5 experiments per setup.

Figure 4: makespan distribution

Figure 6: Computation time of ScheduleNet and OR-Tools. Figures show the computation time (in seconds) of ScheduleNet and OR-Tools as a function of number of cities (left) and number of workers (right). We sample 10 mTSP instances per N -m combination for OR-Tools due to variability in the runtime due to underlying instance's topology. We measure all times on a CPU machine equip AMD Ryzen Thereadripper 29990WX without any parallelization.

Figure 7: Type-aware graph attention embedding We omit the type-aware edge update for the clarity of visualization.

DETAILS OF SCHEDULENET TRAINING A.5.1 TRAINING PSEUDO CODE In this section, we presents a pseudocode for training ScheduleNet. Algorithm 1: ScheduleNet Training Input: Training policy π θ Output: Smoothed policy π φ 1 Initialize smoothed policy with parameters φ ← θ. 2 for update step do 3 Generate a random mTSP instance I 4 for number of episodes do 5 Construct mTSP MDP from the instance I 6 π b ← arg max(π θ ) 7 Collect samples with π θ and π b from the mTSP MDP.

θ old ← π θ 9 for inner updates K do 10 θ ← θ + α∇ θ L(θ) 11 φ ← βφ + (1 -β)θ

mTSPLib results. The CPLEX results with * are optimial solutions. Otherwise, the known-best upper bound of CPLEX results are reported (mTS). Meta-heuristic results are reproduced fromLupoaie et al. (2019). The results of the two leading heuristic algorithms are provided here. The full results are given in the appendix A.7.

A APPENDIX A.1 DETAILS OF MDP TRANSITION AND GRAPH FORMULATION

Event based MDP transition The formulated semi-MDP for ScheduleNet is event-based. Thus, whenever all workers are assigned to cities, the environment transits in time, until any of the workers arrives to the city (i.e. completes the task). Arrival of the worker to the city is the event trigger, meanwhile the other assigned workers are still on the way to their correspondingly assigned cities. We assume that each worker transits towards the assigned city with unit speed in the 2D Euclidean space, i.e. the distance travelled by each worker equals the time past between two consecutive MDP events.Graph formulation In total our graph formulation includes seven mutually exclusive node type:(1) assigned-worker, (2) unassigned-worker, (3) inactive-worker, (4) assigned-city, (5) unassignedcity, (6) inactive-city, and (7) depot. Here, the set of active workers (cities) is defined by the union of assigned and unassigned workers (cities). Inactive-city node refers to the city that has been already visited, while the inactive-worker node refers to the worker that has finished its route and returned to the depot.

A.2 DETAILS OF IMPLEMENTATION

2phase mTSP heuristics 2phase heuristics for mTSP is an extension of well-known TSP heuristics to the m > 1 cases. First, we perform K-means spatial clustering of cities in the mTSP instance, where K = m. Next, we apply TSP insertion heuristics (Nearest Insertion, Farthest Insertion, Random Insertion, and Nearest Neighbour Insertion) for each cluster of cities. It should be noted that, performance of the 2phase heuristics is highly depended on the spatial distribution of the cities on the map. Thus 2phase heuristics perform particularly well on uniformly distributed random instances, where K-means clustering can obtain clusters with approximately same number of cities per cluster.Proximal Policy Optimization Our implementation of PPO closely the standard implementation of PPO2 from stable-baselines (Hill et al., 2018) with default hyperparameters, with modifications to allow for distributed training with Parameter Server.

A.3 COMPUTATION TIME

Figure 6 shows the computation time curves as the function of number of cities (left), and number of workers (right). Overall, ScheduleNet is faster than OR-Tools, and the difference in computation speed only increases with the problem size. Additionally, ScheduleNet's computation time depends only on the problem size (N + m), whereas the computation time of OR-Tools on both the size of the problem and the topology of the underlying mTSP instance. In other words, the number of solutions searched by OR-Tools vary depending on the underlying problem.Another computational and practical advantage of the ScheduleNet is its invariance to the number of workers. Computational complexity of ScheduleNet increases linearly with the number workers. On the other hand, the search space of meta-heuristic algorithms drastically increase with the number of workers, possibly, due to the exponentially increasing number Subtour Elimination Constraints (SEC). Particularly, we investigated that OR-Tools decreases the search space, by deactivating part of the workers, i.e. not utilizing all possible partial solutions (subtours). As a result, Figure 6 shows that computation time of the OR-Tools actually decrease due to deactivation part of workers, at the expense of the decreasing solution quality.

A.4 DETAILS OF TYPE-AWARE GRAPH ATTENTION EMBEDDING

In this section, we thoroughly describe a type-aware graph embedding procedure. Similar to the main body, We overload notations for the simplicity of notation such that the input node and edge feature as h i and h ij , and the embedded node and edge feature h i and h ij , respectively.The proposed graph embedding step consists of three phases: (1) type-aware edge update, (2) typeaware message aggregation, and (3) type-aware node update.

A.5.2 HYPERPARAMETERS

In this section, we fully explain hyperparameters of ScheduleNet.Network Architecture We use the same hyperparmeters for the raw-2-hid TGA layer and the hid-2-hid TGA layer. M LP etype and M LP ntype has one hidden layer with 32 neurons and their output dimensions are both 32. Both MI layers has 64 dimensional outputs. M LP edge , M LP attn , and M LP node has 2 hidden layers with 32 neurons. M LP actor has 2 hidden layers and the hidden layers has 128 neurons each. We use ReLU activation functions for all hidden layers. The hidden graph embedding step H is two.Training We use the discount factor γ of 0.7. We use Adam (Kingma & Ba, 2014) with learning rate value of 0.001. We set the clipping parameter as 0.2. We sample 40 independent mTSP trajectories per gradient update. We clipped the maximum gradient whenever the norm of gradient is larger than 0.5. Inner updates steps K is three. The smoothing parameter β is 0.95.

A.6 TRANSFERABILITY TEST ON TSP (m = 1)

The trained ScheduleNet has been employed to solve random TSP instances. Because ScheduleNet can be used to schedule any m number of workers, if we set m = 1, it can be used to schedule TSP instance without further training. Table 3 shows the results on this transferability experiments.Table 3 shows that the trained ScheduleNet can solve reasonably well on random TSP instances, although ScheduleNet has never been exposed to such TSP instances. Note that as the size of TSP increases, the gap between the ScheduleNet and other models becomes smaller. If the ScheduleNet is trained with TSP instances with m = 1, the performance can be further improved. However, we did not try that experiment to check its transferability over different types of routing problems with different objectives. 

