SOLVING NP-HARD PROBLEMS ON GRAPHS WITH EX-TENDED ALPHAGO ZERO

Abstract

There have been increasing challenges to solve combinatorial optimization problems by machine learning. Khalil et al. (NeurIPS 2017) proposed an end-to-end reinforcement learning framework, which automatically learns graph embeddings to construct solutions to a wide range of problems. However, it sometimes performs poorly on graphs having different characteristics than training graphs. To improve its generalization ability to various graphs, we propose a novel learning strategy based on AlphaGo Zero, a Go engine that achieved a superhuman level without the domain knowledge of the game. We redesign AlphaGo Zero for combinatorial optimization problems, taking into account several differences from two-player games. In experiments on five NP-hard problems such as MINIMUMVERTEX-COVER and MAXCUT, our method, with only a policy network, shows better generalization than the previous method to various instances that are not used for training, including random graphs, synthetic graphs, and real-world graphs. Furthermore, our method is significantly enhanced by a test-time Monte Carlo Tree Search which makes full use of the policy network and value network. We also compare recently-developed graph neural network (GNN) models, with an interesting insight into a suitable choice of GNN models for each task.

1. INTRODUCTION

There is no polynomial-time algorithm found for NP-hard problems [7] , but they often arise in many real-world optimization tasks. Therefore, a variety of algorithms have been developed in a long history, including approximation algorithms [2, 14] , meta-heuristics based on local searches such as simulated annealing and evolutionary computation [15, 10] , general-purpose exact solvers such as CPLEX 1 and Gurobi [16] , and problem-specific exact solvers [1, 25] . Recently, machine learning approaches have been actively investigated to solve combinatorial optimization, with the expectation that the combinatorial structure of the problem can be automatically learned without complicated hand-crafted heuristics. In the early stage, many of these approaches focused on solving specific problems [17, 5] such as the traveling salesperson problem (TSP). Khalil et al. [19] proposed a general framework to solve combinatorial problems by a combination of reinforcement learning and graph embedding, which attracted attention for the following two reasons: It does not require any knowledge on graph algorithms other than greedy selection based on network outputs. Furthermore, it learns algorithms without any training dataset. Thanks to these advantages, the framework can be applied to a diverse range of problems over graphs and it also performs much better than previous learning-based approaches. However, we observed poor empirical performance on some graphs having different characteristics (e.g., synthetic graphs and real-world graphs) than random graphs that were used for training, possibly because of the limited exploration space of their Q-learning method. In this paper, to overcome its weakness, we propose a novel solver, named CombOpt Zero. CombOpt Zero is inspired by AlphaGo Zero [33], a superhuman engine of Go, which conducts Monte Carlo Tree Search (MCTS) to train deep neural networks. AlphaGo Zero was later generalized to AlphaZero [34] so that it can handle other games; however, its range of applications is limited to two-player games whose state is win/lose (or possibly draw). We extend AlphaGo Zero to a bunch of combinatorial problems by a simple normalization technique based on random sampling. In the same way as AlphaGo Zero, CombOpt Zero automatically learns a policy network and value network by self-play based on MCTS. We train our networks for five kinds of NP-hard tasks and test on different instances including standard random graphs (e.g., the Erdős-Renyi model [11] and the Barabási-Albert model [3]), benchmark graphs, and real-world graphs. We show that, with only a greedy selection on the policy network, CombOpt Zero has a better generalization to a variety of graphs than the existing method, which indicates that the MCTS-based training strengthens the exploration of various actions. When more computation time is allowed, using the MCTS at test time with the full use of both the policy network and value network significantly improves the performance. Furthermore, we combine our framework with several graph neural network models [21, 39, 28] , and experimentally demonstrate that an appropriate choice of models contributes to improving the performance with a significant margin.

2. BACKGROUND

In this section, we introduce the background which our work is based on.

2.1. MACHINE LEARNING FOR COMBINATORIAL OPTIMIZATION

Machine learning approaches for combinatorial optimization problems have been studied in the literature, starting from Hopfield & Tank [17], who applied a variant of neural networks to small instances of Traveling Salesperson Problem (TSP). With the success of deep learning, more and more studies were conducted including Bello et al. [5] , Kool et al. [23] for TSP and Wang et al. [37] for MAXSAT. Khalil et al. [19] proposed an end-to-end reinforcement learning framework S2V-DQN, which attracted attention because of promising results in a wide range of problems over graphs such as MINIMUMVERTEXCOVER and MAXCUT. Another advantage of this method is that it does not require domain knowledge on specific algorithms or any training dataset. It optimizes a deep Q-network (DQN) where the Q-function is approximated by a graph embedding network, called structure2vec (S2V) [9] . The DQN is based on their reinforcement learning formulation, where each action is picking up a node and each state represents the "sequence of actions". In each step, a partial solution S ⊂ V , i.e., the current state, is expanded by the selected vertex v * = arg max v∈V (h(S)) Q(h(S), v) to (S, v * ), where h(•) is a fixed function determined by the problem that maps a state to a certain graph, so that the selection of v will not violate the problem constraint. For example, in MAXIMUMINDEPENDENTSET, h(S) corresponds to the subgraph of the input graph G = (V, E) induced by V \(S ∪ N (S)), where N (S) is the open neighbors of S. The immediate reward is the change in the objective function. The Q-network, i.e, S2V learns a fixed dimensional embedding for each node. In this work, we mitigate the issue of S2V-DQN's generalization ability. We follow the idea of their reinforcement learning setting, with a different formulation, and replace their Q-learning by a novel learning strategy inspired by AlphaGo Zero. Note that although some studies combine classic heuristic algorithms and learning-based approaches (using dataset) to achieve the state-of-the-art performance [26, 12] , we stick to learning without domain knowledge and dataset in the same way as S2V-DQN.

2.2. ALPHAGO ZERO

AlphaGo Zero [35] is a well-known superhuman engine designed for use with the game of Go. It trains a deep neural network f θ with parameter θ by reinforcement learning. Given a state (game board), the network outputs f θ (s) = (p, v), where p is the probability vector of each move and v ∈ [-1, 1] is a scalar denoting the state value. If v is close to 1, the player who takes a corresponding action from state s is very likely to win. The fundamental idea of AlphaGo Zero is to enhance its own networks by self-play. For this self-play, a special version of Monte Carlo Tree Search (MCTS) [22] , which we describe later, is used. The network is trained in such a way that the policy imitates the enhanced policy by MCTS π, and the value imitates the actual reward from self play z (i.e. z = 1 if the player wins and z = -1 otherwise).



www.cplex.com

