CONVEX REGULARIZATION IN MONTE-CARLO TREE SEARCH

Abstract

Monte-Carlo planning and Reinforcement Learning (RL) are essential to sequential decision making. The recent AlphaGo and AlphaZero algorithms have shown how to successfully combine these two paradigms to solve large scale sequential decision problems. These methodologies exploit a variant of the well-known UCT algorithm to trade off the exploitation of good actions and the exploration of unvisited states, but their empirical success comes at the cost of poor sample-efficiency and high computation time. In this paper, we overcome these limitations by studying the benefit of convex regularization in Monte-Carlo Tree Search (MCTS) to drive exploration efficiently and to improve policy updates, as already observed in RL. First, we introduce a unifying theory on the use of generic convex regularizers in MCTS, deriving the first regret analysis of regularized MCTS and showing that it guarantees an exponential convergence rate. Second, we exploit our theoretical framework to introduce novel regularized backup operators for MCTS, based on the relative entropy of the policy update and on the Tsallis entropy of the policy. We provide an intuitive demonstration of the effect of each regularizer empirically verifying the consequence of our theoretical results on a toy problem. Finally, we show how our framework can easily be incorporated in AlphaGo and AlphaZero, and we empirically show the superiority of convex regularization w.r.t. representative baselines, on well-known RL problems across several Atari games.

1. INTRODUCTION

Monte-Carlo Tree Search (MCTS) is a well-known algorithm to solve decision-making problems through the combination of Monte-Carlo planning with an incremental tree structure (Coulom, 2006) . Although standard MCTS is only suitable for problems with discrete state and action spaces, recent advances have shown how to enable MCTS in continuous problems (Silver et al., 2016; Yee et al., 2016) . Most remarkably, AlphaGo (Silver et al., 2016) and AlphaZero (Silver et al., 2017b; a) couple MCTS with neural networks trained using Reinforcement Learning (RL) (Sutton & Barto, 1998) methods, e.g., Deep Q-Learning (Mnih et al., 2015) , to speed up learning of large scale problems with continuous state space. In particular, a neural network is used to compute value function estimates of states as a replacement of time-consuming Monte-Carlo rollouts, and another neural network is used to estimate policies as a probability prior for the therein introduced PUCT action selection method, a variant of well-known UCT sampling strategy commonly used in MCTS for exploration (Kocsis et al., 2006) . Despite AlphaGo and AlphaZero achieving state-of-the-art performance in games with high branching factor like Go (Silver et al., 2016) and Chess (Silver et al., 2017a) , both methods suffer from poor sample-efficiency, mostly due to the polynomial convergence rate of PUCT (Xiao et al., 2019) . This problem, combined with the high computational time to evaluate the deep neural networks, significantly hinder the applicability of both methodologies. In this paper, we provide a unified theory of the use of convex regularization in MCTS, which proved to be an efficient solution for driving exploration and stabilizing learning in RL (Schulman et al., 2015; 2017a; Haarnoja et al., 2018; Buesing et al., 2020) . In particular, we show how a regularized objective function in MCTS can be seen as an instance of the Legendre-Fenchel transform, similar to previous findings on the use of duality in RL (Mensch & Blondel, 2018; Geist et al., 2019; Nachum & Dai, 2020) and game theory (Shalev-Shwartz & Singer, 2006; Pavel, 2007) . Establishing our theoretical framework, we can derive the first regret analysis of regularized MCTS, and prove that a generic convex regularizer guarantees an exponential convergence rate to the solution of the reg-1

