POLICY-BASED SELF-COMPETITION FOR PLANNING PROBLEMS

Abstract

AlphaZero-type algorithms may stop improving on single-player tasks in case the value network guiding the tree search is unable to approximate the outcome of an episode sufficiently well. One technique to address this problem is transforming the single-player task through self-competition. The main idea is to compute a scalar baseline from the agent's historical performances and to reshape an episode's reward into a binary output, indicating whether the baseline has been exceeded or not. However, this baseline only carries limited information for the agent about strategies how to improve. We leverage the idea of self-competition and directly incorporate a historical policy into the planning process instead of its scalar performance. Based on the recently introduced Gumbel AlphaZero (GAZ), we propose our algorithm GAZ 'Play-to-Plan' (GAZ PTP), in which the agent learns to find strong trajectories by planning against possible strategies of its past self. We show the effectiveness of our approach in two well-known combinatorial optimization problems, the Traveling Salesman Problem and the Job-Shop Scheduling Problem. With only half of the simulation budget for search, GAZ PTP consistently outperforms all selected single-player variants of GAZ.

1. INTRODUCTION

One of the reasons for the success of AlphaZero (Silver et al., 2017) is the use of a policy and value network to guide the Monte Carlo tree search (MCTS) to decrease the search tree's width and depth. Trained on state-outcome pairs, the value network develops an 'intuition' to tell from single game positions which player might win. By normalizing values in the tree to handle changing reward scales (Schadd et al., 2008; Schrittwieser et al., 2020) , AlphaZero's mechanisms can be applied to single-agent (or single-player) tasks. Although powerful, the MCTS relies on value approximations which can be hard to predict (van Hasselt et al., 2016; Pohlen et al., 2018) . Furthermore, without proper normalization, it can be difficult for value function approximators to adapt to small improvements in later stages of training. In recent years there has been an increasing interest in learning sequential solutions for combinatorial optimization problems (COPs) from zero knowledge via deep reinforcement learning. Particularly strong results have been achieved with policy gradient methods by using variants of self-critical training (Rennie et al., 2017; Kool et al., 2018; Kwon et al., 2020) where it is avoided to learn a value function at all. By baselining the gradient estimate with the outcome of rolling out a current or historical policy, actions are reinforced by how much better (or worse) an episode is compared to the rollouts. Something similar is achieved in MCTS-based algorithms for single-player tasks by computing a scalar baseline from the agent's historical performance. The reward of the original task is reshaped to a binary ±1 outcome indicating whether an episode has exceeded this baseline or not (Laterre et al., 2018; Schmidt et al., 2019; Mandhane et al., 2022) . This self-competition brings the original single-player task closer to a two-player game and bypasses the need for in-tree value scaling during training. The scalar baseline against which the agent is planning must be carefully chosen as it should neither be too difficult nor too easy to outperform. Additionally, in complex problems, a single scalar value holds limited information about the instance at hand and the agent's strategies for reaching the threshold performance. In this paper, we follow the idea of self-competition in AlphaZero-style algorithms for deterministic single-player sequential planning problems. Inspired by the original powerful 'intuition' of AlphaZero to evaluate board positions, we propose to evaluate states in the value function not by comparing them against a scalar threshold but directly against states at similar timesteps coming from a historical version of the agent. The agent learns by reasoning about potential strategies of its past self. We summarize our contributions as follows: (i) We assume that in a self-competitive framework, the scalar outcome of a trajectory ζ under some baseline policy is less informative for tree-based planning than ζ's intermediate states. Our aim is to put the flexibility of rollouts in selfcritical training into a self-competitive framework while maintaining the policy's information in intermediate states of the rollout. We motivate this setup from the viewpoint of advantage baselines, show that policy improvements are preserved, and arrive at a simple instance of gamification where two players start from the same initial state, take actions in turn and aim to find a better trajectory than the opponent. (ii) We propose the algorithm GAZ Play-to-Plan (GAZ PTP) based on Gumbel AlphaZero (GAZ), the latest addition to the AlphaZero family, introduced by Danihelka et al. ( 2022). An agent plays the above game against a historical version of itself to improve a policy for the original single-player problem. The idea is to allow only one player in the game to employ MCTS and compare its states to the opponent's to guide the search. Policy improvements obtained through GAZ's tree search propagate from the game to the original task. We show the superiority of GAZ PTP over single-player variants of GAZ on two COP classes, the Traveling Salesman Problem and the standard Job-Shop Scheduling Problem. We compare GAZ PTP with different single-player variants of GAZ, with and without self-competition. We consistently outperform all competitors even when granting GAZ PTP only half of the simulation budget for search. In addition, we reach competitive results for both problem classes compared to benchmarks in the literature.

2. RELATED WORK

Gumbel AlphaZero In GAZ, Danihelka et al. ( 2022) redesigned the action selection mechanisms of AlphaZero and MuZero (Silver et al., 2017; Schrittwieser et al., 2020) . At the root node, actions to explore are sampled without replacement using the Gumbel-Top-k trick (Vieira; Kool et al., 2019b) . Sequential halving (Karnin et al., 2013) is used to distribute the search simulation budget among the sampled actions. The singular action remaining from the halving procedure is selected as the action to be taken in the environment. At non-root nodes, action values of unvisited nodes are completed by a value interpolation, yielding an updated policy. An action is then selected deterministically by matching this updated policy to the visit count distribution (Grill et al., 2020) . This procedure theoretically guarantees a policy improvement for correctly estimated action values, both for the root action selection and the updated policy at non-root nodes. Consequently, the principled search of GAZ works well even for a small number of simulations, as opposed to AlphaZero, which might perform poorly if not all actions are visited at the root node. Similarly to MuZero, GAZ normalizes action values with a min-max normalization based on the values found during the tree search to handle changing and unbounded reward scales. However, if for example a node value is overestimated, the probability of a good action might be reduced even when all simulations reach the end of the episode. Additionally, value function approximations might be challenging if the magnitude of rewards changes over time or must be approximated over long time horizons (van Hasselt et al., 2016; Pohlen et al., 2018) . Rennie et al. (2017) introduce self-critical training, a policy gradient method that baselines the REINFORCE (Williams, 1992) gradient estimator with the reward obtained by rolling out the current policy greedily. As a result, trajectories outperforming the greedy policy are given positive weight while inferior ones are suppressed. Self-critical training eliminates the need for learning a value function approximator (and thus all innate training challenges) and reduces the variance of the gradient estimates. The agent is further provided with an automatically controlled curriculum to keep improving. Self-critical methods have shown great success in neural combinatorial optimization. Kool et al. (2018) use a greedy rollout of a periodically updated best-so-far policy as a baseline for routing problems. Kwon et al. (2020) and Kool et al. (2019a) bundle the return of multiple sampled trajectories to baseline the estimator applied to various COPs. The idea of using rollouts to control the agent's learning curriculum is hard to transfer one-to-one to MCTS-based algorithms, as introducing a value network to avoid full Monte Carlo rollouts in the tree search was exactly one of the great strengths in AlphaZero (Silver et al., 2017) .

