SCHEDULENET: LEARN TO SOLVE MINMAX MULTI-PLE TRAVELLING SALESMAN PROBLEM

Abstract

There has been continuous effort to learn to solve famous CO problems such as Traveling Salesman Problem (TSP) and Vehicle Routing Problem (VRP) using reinforcement learning (RL). Although they have shown good optimality and computational efficiency, these approaches have been limited to scheduling a singlevehicle. MinMax mTSP, the focus of this study, is the problem seeking to minimize the total completion time for multiple workers to complete the geographically distributed tasks. Solving MinMax mTSP using RL raises significant challenges because one needs to train a distributed scheduling policy inducing the cooperative strategic routings using only the single delayed and sparse reward signal (makespan). In this study, we propose the ScheduleNet that can solve mTSP with any numbers of salesmen and cities. The ScheduleNet presents a state (partial solution to mTSP) as a set of graphs and employs type aware graph node embeddings for deriving the cooperative and transferable scheduling policy. Additionally, to effectively train the ScheduleNet with sparse and delayed reward (makespan), we propose an RL training scheme, Clipped REINFORCE with "target net," which significantly stabilizes the training and improves the generalization performance. We have empirically shown that the proposed method achieves the performance comparable to Google OR-Tools, a highly optimized meta-heuristic baseline.

1. INTRODUCTION

There have been numerous approaches to solve combinatorial optimization (CO) problems using machine learning. Bengio et al. (2020) have categorized these approaches into demonstration and experience. In demonstration setting, supervised learning has been employed to mimic the behavior of the existing expert (e.g., exact solvers or heuristics). On the other hand, in the experience setting, typically, reinforcement learning (RL) has been employed to learn a parameterized policy that can solve newly given target problems without direct supervision. While the demonstration policy cannot outperform its guiding expert, RL-based policy can outperform the expert because it improves its policy using a reward signal. Concurrently, Mazyavkina et al. (2020) have further categorized the RL approaches into improvement and construction heuristics. An improvement heuristics start from the arbitrary (complete) solution of the CO problem and iteratively improve it with the learned policy until the improvement stops (Chen & Tian, 2019; Ahn et al., 2019) . On the other hand, the construction heuristics start from the empty solution and incrementally extend the partial solution using a learned sequential decision-making policy until it becomes complete. There has been continuous effort to learn to solve famous CO problems such as Traveling Salesman Problem (TSP) and Vehicle Routing Problem (VRP) using RL-based construction heuristics (Bello et al., 2016; Kool et al., 2018; Khalil et al., 2017; Nazari et al., 2018) . Although they have shown good optimality and computational efficiency performance, these approaches have been limited to only scheduling a single-vehicle. The multi-extensions of these routing problems, such as multiple TSP and multiple VRP, are underrepresented in the deep learning research community, even though they capture a broader set of the real-world problems and pose a more significant scientific challenge. The multiple traveling salesmen problem (mTSP) aims to determine a set of subroutes for each salesman, given m salesmen and N cities that need to be visited by one of the salesmen, and a depot where salesmen are initially located and to which they return. The objective of a mTSP is either minimizing the sum of subroute lengths (MinSum) or minimizing the length of the longest subroute (MinMax). In general, the MinMax objective is more practical, as one seeks to visit all cities as soon as possible (i.e., total completion time minimization). In contrast, the MinSum formulation, in general, leads to highly imbalanced solutions where one of the salesmen visits most of the cities, which results in longer total completion time (Lupoaie et al., 2019) . In this study, we propose a learning-based decentralized and sequential decision-making algorithm for solving Minmax mTSP problem; the trained policy, which is a construction heuristic, can be employed to solve mTSP instances with any numbers of salesman and cities. Learning a transferable mTSP solver in a construction heuristic framework is significantly challenging comparing to its single-agent variants (TSP and CVRP) because (1) we need to use the state representation that is flexible enough to represent any arbitrary number of salesman and cities (2) we need to introduce the coordination among multiple agents to complete the geographically distributed tasks as quickly as possible using a sequential and decentralized decision making strategy and (3) we need to learn such decentralized cooperative policy using only a delayed and sparse reward signal, makespan, that is revealed only at the end of the episode. To tackle such a challenging task, we formulate mTSP as a semi-MDP and derive a decentralized decision making policy in a multi-agent reinforcement learning framework using only a sparse and delayed episodic reward signal. The major components of the proposed method and their importance are summarized as follows: • Decentralized cooperative decision-making strategy: Decentralization of scheduling policy is essential to ensure the learned policy can be employed to schedule any size of mTSP problems in a scalable manner; decentralized policy maps local observation of each idle salesman one of feasible individual action while joint policy maps the global state to the joint scheduling actions. • State representation using type-award graph attention (TGA): the proposed method represents a state (partial solution to mTSP) as a set of graphs, each of which captures specific relationships among works, cities, and a depot. The proposed method then employs TGA to compute the node embeddings for all nodes (salesman and cities), which are used to assign idle salesman to an unvisited city sequentially. • Training decentralized policy using a single delayed shared reward signal: Training decentralized cooperative strategy using a single sparse and delayed reward is extremely difficult in that we need to distribute credits of a single scalar reward (makespan) over the time and agents. To resolve this, we propose a stable MARL training scheme which significantly stabilizes the training and improves the generalization performance. We have empirically shown that the proposed method achieves the performance comparable to Google OR-Tools, a highly optimized meta-heuristic baseline. The proposed approach outperforms OR-Tools in many cases on in-training, out-of-training problem distributions, and real-world problem instances. We also verified that scheduleNet can provide an efficient routing service to customers.

2. RELATED WORK

Construction RL approaches A seminal body of work focused on the construction approach in the RL setting for solving CO problems (Bello et al., 2016; Nazari et al., 2018; Kool et al., 2018; Khalil et al., 2017) . These approaches utilize encoder-decoder architecture, that encodes the problem structure into a hidden embedding first, and then autoregressively decodes the complete solution. structure2vec (Dai et al., 2016) , that embeds a partial solution of the TSP and outputs the next city in the (sub)tour. (Kang et al., 2019) has extended structure2vec to random graph and employed this random graph embedding to solve identical parallel machine scheduling problems, the problem seeking to minimize the makespan by scheduling multiple machines.



Kool et al. (2018)  proposed to use Transformer-like architecture(Vaswani et al., 2017)  to solve several variants of TSP and single-vehicle CVRP. On the contrary,Khalil et al. (2017)  do not use encoder-decoder architecture, but a single graph embedding model,

