POLICY-BASED SELF-COMPETITION FOR PLANNING PROBLEMS

Abstract

AlphaZero-type algorithms may stop improving on single-player tasks in case the value network guiding the tree search is unable to approximate the outcome of an episode sufficiently well. One technique to address this problem is transforming the single-player task through self-competition. The main idea is to compute a scalar baseline from the agent's historical performances and to reshape an episode's reward into a binary output, indicating whether the baseline has been exceeded or not. However, this baseline only carries limited information for the agent about strategies how to improve. We leverage the idea of self-competition and directly incorporate a historical policy into the planning process instead of its scalar performance. Based on the recently introduced Gumbel AlphaZero (GAZ), we propose our algorithm GAZ 'Play-to-Plan' (GAZ PTP), in which the agent learns to find strong trajectories by planning against possible strategies of its past self. We show the effectiveness of our approach in two well-known combinatorial optimization problems, the Traveling Salesman Problem and the Job-Shop Scheduling Problem. With only half of the simulation budget for search, GAZ PTP consistently outperforms all selected single-player variants of GAZ.

1. INTRODUCTION

One of the reasons for the success of AlphaZero (Silver et al., 2017) is the use of a policy and value network to guide the Monte Carlo tree search (MCTS) to decrease the search tree's width and depth. Trained on state-outcome pairs, the value network develops an 'intuition' to tell from single game positions which player might win. By normalizing values in the tree to handle changing reward scales (Schadd et al., 2008; Schrittwieser et al., 2020) , AlphaZero's mechanisms can be applied to single-agent (or single-player) tasks. Although powerful, the MCTS relies on value approximations which can be hard to predict (van Hasselt et al., 2016; Pohlen et al., 2018) . Furthermore, without proper normalization, it can be difficult for value function approximators to adapt to small improvements in later stages of training. In recent years there has been an increasing interest in learning sequential solutions for combinatorial optimization problems (COPs) from zero knowledge via deep reinforcement learning. Particularly strong results have been achieved with policy gradient methods by using variants of self-critical training (Rennie et al., 2017; Kool et al., 2018; Kwon et al., 2020) where it is avoided to learn a value function at all. By baselining the gradient estimate with the outcome of rolling out a current or historical policy, actions are reinforced by how much better (or worse) an episode is compared to the rollouts. Something similar is achieved in MCTS-based algorithms for single-player tasks by computing a scalar baseline from the agent's historical performance. The reward of the original task is reshaped to a binary ±1 outcome indicating whether an episode has exceeded this baseline or not (Laterre et al., 2018; Schmidt et al., 2019; Mandhane et al., 2022) . This self-competition brings the original single-player task closer to a two-player game and bypasses the need for in-tree value scaling during training. The scalar baseline against which the agent is planning must be carefully chosen as it should neither be too difficult nor too easy to outperform. Additionally, in complex problems, a single scalar value holds limited information about the instance at hand and the agent's strategies for reaching the threshold performance. In this paper, we follow the idea of self-competition in AlphaZero-style algorithms for deterministic single-player sequential planning problems. Inspired by the original powerful 'intuition' of

