CONVEX REGULARIZATION IN MONTE-CARLO TREE SEARCH

Abstract

Monte-Carlo planning and Reinforcement Learning (RL) are essential to sequential decision making. The recent AlphaGo and AlphaZero algorithms have shown how to successfully combine these two paradigms to solve large scale sequential decision problems. These methodologies exploit a variant of the well-known UCT algorithm to trade off the exploitation of good actions and the exploration of unvisited states, but their empirical success comes at the cost of poor sample-efficiency and high computation time. In this paper, we overcome these limitations by studying the benefit of convex regularization in Monte-Carlo Tree Search (MCTS) to drive exploration efficiently and to improve policy updates, as already observed in RL. First, we introduce a unifying theory on the use of generic convex regularizers in MCTS, deriving the first regret analysis of regularized MCTS and showing that it guarantees an exponential convergence rate. Second, we exploit our theoretical framework to introduce novel regularized backup operators for MCTS, based on the relative entropy of the policy update and on the Tsallis entropy of the policy. We provide an intuitive demonstration of the effect of each regularizer empirically verifying the consequence of our theoretical results on a toy problem. Finally, we show how our framework can easily be incorporated in AlphaGo and AlphaZero, and we empirically show the superiority of convex regularization w.r.t. representative baselines, on well-known RL problems across several Atari games.

1. INTRODUCTION

Monte-Carlo Tree Search (MCTS) is a well-known algorithm to solve decision-making problems through the combination of Monte-Carlo planning with an incremental tree structure (Coulom, 2006) . Although standard MCTS is only suitable for problems with discrete state and action spaces, recent advances have shown how to enable MCTS in continuous problems (Silver et al., 2016; Yee et al., 2016) . Most remarkably, AlphaGo (Silver et al., 2016) and AlphaZero (Silver et al., 2017b; a) couple MCTS with neural networks trained using Reinforcement Learning (RL) (Sutton & Barto, 1998) methods, e.g., Deep Q-Learning (Mnih et al., 2015) , to speed up learning of large scale problems with continuous state space. In particular, a neural network is used to compute value function estimates of states as a replacement of time-consuming Monte-Carlo rollouts, and another neural network is used to estimate policies as a probability prior for the therein introduced PUCT action selection method, a variant of well-known UCT sampling strategy commonly used in MCTS for exploration (Kocsis et al., 2006) . Despite AlphaGo and AlphaZero achieving state-of-the-art performance in games with high branching factor like Go (Silver et al., 2016) and Chess (Silver et al., 2017a) , both methods suffer from poor sample-efficiency, mostly due to the polynomial convergence rate of PUCT (Xiao et al., 2019) . This problem, combined with the high computational time to evaluate the deep neural networks, significantly hinder the applicability of both methodologies. In this paper, we provide a unified theory of the use of convex regularization in MCTS, which proved to be an efficient solution for driving exploration and stabilizing learning in RL (Schulman et al., 2015; 2017a; Haarnoja et al., 2018; Buesing et al., 2020) . In particular, we show how a regularized objective function in MCTS can be seen as an instance of the Legendre-Fenchel transform, similar to previous findings on the use of duality in RL (Mensch & Blondel, 2018; Geist et al., 2019; Nachum & Dai, 2020) and game theory (Shalev-Shwartz & Singer, 2006; Pavel, 2007) . Establishing our theoretical framework, we can derive the first regret analysis of regularized MCTS, and prove that a generic convex regularizer guarantees an exponential convergence rate to the solution of the reg-ularized objective function, which improves on the polynomial rate of PUCT. These results provide a theoretical ground for the use of arbitrary entropy-based regularizers in MCTS until now limited to maximum entropy (Xiao et al., 2019) , among which we specifically study the relative entropy of policy updates, drawing on similarities with trust-region and proximal methods in RL (Schulman et al., 2015; 2017b) , and the Tsallis entropy, used for enforcing the learning of sparse policies (Lee et al., 2018) . Moreover, we provide an empirical analysis of the toy problem introduced in Xiao et al. ( 2019) to intuitively evince the practical consequences of our theoretical results for each regularizer. Finally, we empirically evaluate the proposed operators in AlphaGo and AlphaZero on problems of increasing complexity, from classic RL problems to an extensive analysis of Atari games, confirming the benefit of our novel operators compared to maximum entropy and, in general, the superiority of convex regularization in MCTS w.r.t. classic methods.

2. PRELIMINARIES 2.1 MARKOV DECISION PROCESSES

We consider the classical definition of a finite-horizon Markov Decision Process (MDP) as a 5tuple M = S, A, R, P, γ , where S is the state space, A is the finite discrete action space, R : S × A × S → R is the reward function, P : S × A → S is the transition kernel, and γ ∈ [0, 1) is the discount factor. A policy π ∈ Π : S × A → R is a probability distribution of the event of executing an action a in a state s. A policy π induces a value function corresponding to the expected cumulative discounted reward collected by the agent when executing action a in state s, and following the policy π thereafter: Q π (s, a) E ∞ k=0 γ k r i+k+1 |s i = s, a i = a, π , where r i+1 is the reward obtained after the i-th transition. An MDP is solved finding the optimal policy π * , which is the policy that maximizes the expected cumulative discounted reward. The optimal policy corresponds to the one satisfying the optimal Bellman equation (Bellman, 1954) Q * (s, a) S P(s |s, a) [R(s, a, s ) + γ max a Q * (s , a ) ] ds , and is the fixed point of the optimal Bellman operator T * Q(s, a) S P(s |s, a) [R(s, a, s ) + γ max a Q(s , a )] ds . Additionally, we define the Bellman operator under the policy π as T π Q(s, a) S P(s |s, a) R(s, a, s ) + γ A π(a |s )Q(s , a )da ds , the optimal value function V * (s) max a∈A Q * (s, a), and the value function under the policy π as V π (s) max a∈A Q π (s, a).

2.2. MONTE-CARLO TREE SEARCH AND UPPER CONFIDENCE BOUNDS FOR TREES

Monte-Carlo Tree Search (MCTS) is a planning strategy based on a combination of Monte-Carlo sampling and tree search to solve MDPs. MCTS builds a tree where the nodes are the visited states of the MDP, and the edges are the actions executed in each state. MCTS converges to the optimal policy (Kocsis et al., 2006; Xiao et al., 2019) , iterating over a loop composed of four steps: 1. Selection: starting from the root node, a tree-policy is executed to navigate the tree until a node with unvisited children, i.e. expandable node, is reached; 2. Expansion: the reached node is expanded according to the tree policy; 



Simulation: run a rollout, e.g. Monte-Carlo simulation, from the visited child of the current node to the end of the episode; 4. Backup: use the collected reward to update the action-values Q(•) of the nodes visited in the trajectory from the root node to the expanded node. The tree-policy used to select the action to execute in each node needs to balance the use of already known good actions, and the visitation of unknown states. The Upper Confidence bounds for Trees (UCT) sampling strategy (Kocsis et al., 2006) extends the use of the well-known UCB1 sampling strategy for multi-armed bandits (Auer et al., 2002), to MCTS. Considering each node corresponding to a state s ∈ S as a different bandit problem, UCT selects an action a ∈ A applying an upper bound to the action-value function UCT(s, a) = Q(s, a) + log N (s) N (s, a) ,

