CONVEX REGULARIZATION IN MONTE-CARLO TREE SEARCH

Abstract

Monte-Carlo planning and Reinforcement Learning (RL) are essential to sequential decision making. The recent AlphaGo and AlphaZero algorithms have shown how to successfully combine these two paradigms to solve large scale sequential decision problems. These methodologies exploit a variant of the well-known UCT algorithm to trade off the exploitation of good actions and the exploration of unvisited states, but their empirical success comes at the cost of poor sample-efficiency and high computation time. In this paper, we overcome these limitations by studying the benefit of convex regularization in Monte-Carlo Tree Search (MCTS) to drive exploration efficiently and to improve policy updates, as already observed in RL. First, we introduce a unifying theory on the use of generic convex regularizers in MCTS, deriving the first regret analysis of regularized MCTS and showing that it guarantees an exponential convergence rate. Second, we exploit our theoretical framework to introduce novel regularized backup operators for MCTS, based on the relative entropy of the policy update and on the Tsallis entropy of the policy. We provide an intuitive demonstration of the effect of each regularizer empirically verifying the consequence of our theoretical results on a toy problem. Finally, we show how our framework can easily be incorporated in AlphaGo and AlphaZero, and we empirically show the superiority of convex regularization w.r.t. representative baselines, on well-known RL problems across several Atari games.

1. INTRODUCTION

Monte-Carlo Tree Search (MCTS) is a well-known algorithm to solve decision-making problems through the combination of Monte-Carlo planning with an incremental tree structure (Coulom, 2006) . Although standard MCTS is only suitable for problems with discrete state and action spaces, recent advances have shown how to enable MCTS in continuous problems (Silver et al., 2016; Yee et al., 2016) . Most remarkably, AlphaGo (Silver et al., 2016) and AlphaZero (Silver et al., 2017b; a) couple MCTS with neural networks trained using Reinforcement Learning (RL) (Sutton & Barto, 1998) methods, e.g., Deep Q-Learning (Mnih et al., 2015) , to speed up learning of large scale problems with continuous state space. In particular, a neural network is used to compute value function estimates of states as a replacement of time-consuming Monte-Carlo rollouts, and another neural network is used to estimate policies as a probability prior for the therein introduced PUCT action selection method, a variant of well-known UCT sampling strategy commonly used in MCTS for exploration (Kocsis et al., 2006) . Despite AlphaGo and AlphaZero achieving state-of-the-art performance in games with high branching factor like Go (Silver et al., 2016) and Chess (Silver et al., 2017a) , both methods suffer from poor sample-efficiency, mostly due to the polynomial convergence rate of PUCT (Xiao et al., 2019) . This problem, combined with the high computational time to evaluate the deep neural networks, significantly hinder the applicability of both methodologies. In this paper, we provide a unified theory of the use of convex regularization in MCTS, which proved to be an efficient solution for driving exploration and stabilizing learning in RL (Schulman et al., 2015; 2017a; Haarnoja et al., 2018; Buesing et al., 2020) . In particular, we show how a regularized objective function in MCTS can be seen as an instance of the Legendre-Fenchel transform, similar to previous findings on the use of duality in RL (Mensch & Blondel, 2018; Geist et al., 2019; Nachum & Dai, 2020) and game theory (Shalev-Shwartz & Singer, 2006; Pavel, 2007) . Establishing our theoretical framework, we can derive the first regret analysis of regularized MCTS, and prove that a generic convex regularizer guarantees an exponential convergence rate to the solution of the reg-ularized objective function, which improves on the polynomial rate of PUCT. These results provide a theoretical ground for the use of arbitrary entropy-based regularizers in MCTS until now limited to maximum entropy (Xiao et al., 2019) , among which we specifically study the relative entropy of policy updates, drawing on similarities with trust-region and proximal methods in RL (Schulman et al., 2015; 2017b) , and the Tsallis entropy, used for enforcing the learning of sparse policies (Lee et al., 2018) . Moreover, we provide an empirical analysis of the toy problem introduced in Xiao et al. (2019) to intuitively evince the practical consequences of our theoretical results for each regularizer. Finally, we empirically evaluate the proposed operators in AlphaGo and AlphaZero on problems of increasing complexity, from classic RL problems to an extensive analysis of Atari games, confirming the benefit of our novel operators compared to maximum entropy and, in general, the superiority of convex regularization in MCTS w.r.t. classic methods.

2. PRELIMINARIES 2.1 MARKOV DECISION PROCESSES

We consider the classical definition of a finite-horizon Markov Decision Process (MDP) as a 5tuple M = S, A, R, P, γ , where S is the state space, A is the finite discrete action space, R : S × A × S → R is the reward function, P : S × A → S is the transition kernel, and γ ∈ [0, 1) is the discount factor. A policy π ∈ Π : S × A → R is a probability distribution of the event of executing an action a in a state s. A policy π induces a value function corresponding to the expected cumulative discounted reward collected by the agent when executing action a in state s, and following the policy π thereafter: Q π (s, a) E ∞ k=0 γ k r i+k+1 |s i = s, a i = a, π , where r i+1 is the reward obtained after the i-th transition. An MDP is solved finding the optimal policy π * , which is the policy that maximizes the expected cumulative discounted reward. The optimal policy corresponds to the one satisfying the optimal Bellman equation (Bellman, 1954) Q * (s, a) S P(s |s, a) [R(s, a, s ) + γ max a Q * (s , a ) ] ds , and is the fixed point of the optimal Bellman operator T * Q(s, a) S P(s |s, a) [R(s, a, s ) + γ max a Q(s , a )] ds . Additionally, we define the Bellman operator under the policy π as T π Q(s, a) S P(s |s, a) R(s, a, s ) + γ A π(a |s )Q(s , a )da ds , the optimal value function V * (s) max a∈A Q * (s, a), and the value function under the policy π as V π (s) max a∈A Q π (s, a).

2.2. MONTE-CARLO TREE SEARCH AND UPPER CONFIDENCE BOUNDS FOR TREES

Monte-Carlo Tree Search (MCTS) is a planning strategy based on a combination of Monte-Carlo sampling and tree search to solve MDPs. MCTS builds a tree where the nodes are the visited states of the MDP, and the edges are the actions executed in each state. MCTS converges to the optimal policy (Kocsis et al., 2006; Xiao et al., 2019) , iterating over a loop composed of four steps: 1. Selection: starting from the root node, a tree-policy is executed to navigate the tree until a node with unvisited children, i.e. expandable node, is reached; 2. Expansion: the reached node is expanded according to the tree policy; 3. Simulation: run a rollout, e.g. Monte-Carlo simulation, from the visited child of the current node to the end of the episode; 4. Backup: use the collected reward to update the action-values Q(•) of the nodes visited in the trajectory from the root node to the expanded node. The tree-policy used to select the action to execute in each node needs to balance the use of already known good actions, and the visitation of unknown states. The Upper Confidence bounds for Trees (UCT) sampling strategy (Kocsis et al., 2006) extends the use of the well-known UCB1 sampling strategy for multi-armed bandits (Auer et al., 2002) , to MCTS. Considering each node corresponding to a state s ∈ S as a different bandit problem, UCT selects an action a ∈ A applying an upper bound to the action-value function UCT(s, a) = Q(s, a) + log N (s) N (s, a) , where N (s, a) is the number of executions of action a in state s, N (s) = a N (s, a), and is a constant parameter to tune exploration. UCT asymptotically converges to the optimal action-value function Q * , for all states and actions, with the probability of executing a suboptimal action at the root node approaching 0 with a polynomial rate O( 1 t ), for a simulation budget t (Kocsis et al., 2006; Xiao et al., 2019) .

3. REGULARIZED MONTE-CARLO TREE SEARCH

The success of RL methods based on entropy regularization comes from their ability to achieve state-of-the-art performance in decision making and control problems, while enjoying theoretical guarantees and ease of implementation (Haarnoja et al., 2018; Schulman et al., 2015; Lee et al., 2018) . However, the use of entropy regularization is MCTS is still mostly unexplored, although its advantageous exploration and value function estimation would be desirable to reduce the detrimental effect of high-branching factor in AlphaGo and AlphaZero. To the best of our knowledge, the MENTS algorithm (Xiao et al., 2019) is the first and only method to combine MCTS and entropy regularization. In particular, MENTS uses a maximum entropy regularizer in AlphaGo, proving an exponential convergence rate to the solution of the respective softmax objective function and achieving state-of-the-art performance in some Atari games (Bellemare et al., 2013) . In the following, motivated by the success in RL and the promising results of MENTS, we derive a unified theory of regularization in MCTS based on the Legendre-Fenchel transform (Geist et al., 2019) , that generalizes the use of maximum entropy of MENTS to an arbitrary convex regularizer. Notably, our theoretical framework enables to rigorously motivate the advantages of using maximum entropy and other entropy-based regularizers, such as relative entropy or Tsallis entropy, drawing connections with their RL counterparts TRPO (Schulman et al., 2015) and Sparse DQN (Lee et al., 2018) , as MENTS does with Soft Actor-Critic (SAC) (Haarnoja et al., 2018) .

3.1. LEGENDRE-FENCHEL TRANSFORM

Consider an MDP M = S, A, R, P, γ , as previously defined. Let Ω : Π → R be a strongly convex function. For a policy π s = π(•|s) and Q s = Q(s, •) ∈ R A , the Legendre-Fenchel transform (or convex conjugate) of Ω is Ω * : R A → R, defined as: Ω * (Q s ) max πs∈Πs T πs Q s -τ Ω(π s ), where the temperature τ specifies the strength of regularization. Among the several properties of the Legendre-Fenchel transform, we use the following (Mensch & Blondel, 2018; Geist et al., 2019) . Proposition 1 Let Ω be strongly convex. • Unique maximizing argument: ∇Ω * is Lipschitz and satisfies ∇Ω * (Q s ) = arg max πs∈Πs T πs Q s -τ Ω(π s ). • Boundedness: if there are constants L Ω and U Ω such that for all π s ∈ Π s , we have L Ω ≤ Ω(π s ) ≤ U Ω , then max a∈A Q s (a) -τ U Ω ≤ Ω * (Q s ) ≤ max a∈A Q s (a) -τ L Ω . • Contraction: for any Q 1 , Q 2 ∈ R S×A Ω * (Q 1 ) -Ω * (Q 2 ) ∞ ≤ γ Q 1 -Q 2 ∞ . Although the Legendre-Fenchel transform Ω * applies to every strongly convex function, for the purpose of this work we only consider a representative set of entropic regularizers.

3.2. REGULARIZED BACKUP AND TREE POLICY

In MCTS, each node of the tree represents a state s ∈ S and contains a visitation count N (s, a). Given a trajectory, we define n(s T ) as the leaf node corresponding to the reached state s T . Let s 0 , a 0 , s 1 , a 1 ..., s T be the state action trajectory in a simulation, where n(s T ) is a leaf node of T . Whenever a node n(s T ) is expanded, the respective action values (Equation 6) are initialized as Q Ω (s T , a) = 0, and N (s T , a) = 0 for all a ∈ A. For all nodes in the trajectory, the visitation count is updated by N (s t , a t ) = N (s t , a t ) + 1, and the action-values by Q Ω (s t , a t ) = r(s t , a t ) + γρ if t = T r(s t , a t ) + γΩ * (Q Ω (s t+1 )/τ )) if t < T (6) where Q Ω (s t+1 ) ∈ R A with components Q Ω (s t+1 , a), ∀a ∈ A, and ρ is an estimate returned from an evaluation function computed in s T , e.g. a discounted cumulative reward averaged over multiple rollouts, or the value-function of node n(s T +1 ) returned by a value-function approximator, e.g. a neural network pretrained with deep Q-learning (Mnih et al., 2015) , as done in (Silver et al., 2016; Xiao et al., 2019) . We revisit the E2W sampling strategy limited to maximum entropy regularization (Xiao et al., 2019) and, through the use of the convex conjugate in Equation ( 6), we derive a novel sampling strategy that generalizes to any convex regularizer π t (a t |s t ) = (1 -λ st )∇Ω * (Q Ω (s t )/τ )(a t ) + λ st |A| , where λ st = |A| /log( a N (st,a)+1) with > 0 as an exploration parameter, and ∇Ω * depends on the measure in use (see Table 1 for maximum, relative, and Tsallis entropy). We call this sampling strategy Extended Empirical Exponential Weight (E3W) to highlight the extension of E2W from maximum entropy to a generic convex regularizer.

3.3. CONVERGENCE RATE TO REGULARIZED OBJECTIVE

We show that the regularized value V Ω can be effectively estimated at the root state s ∈ S, with the assumption that each node in the tree has a σ 2 -subgaussian distribution. This result extends the analysis provided in (Xiao et al., 2019) , which is limited to the use of maximum entropy. Theorem 1 At the root node s where N (s) is the number of visitations, with > 0, V Ω (s) is the estimated value, with constant C and Ĉ, we have P(|V Ω (s) -V * Ω (s)| > ) ≤ C exp{- N (s) Ĉσ(log(2 + N (s))) 2 }, where V Ω (s) = Ω * (Q s ) and V * Ω (s) = Ω * (Q * s ). From this theorem, we obtain that the convergence rate of choosing the best action a * at the root node, when using the E3W strategy, is exponential. Theorem 2 Let a t be the action returned by E3W at step t. For large enough t and constants C, Ĉ P(a t = a * ) ≤ Ct exp{- t Ĉσ(log(t)) 3 }. (9)

4. ENTROPY-REGULARIZATION BACKUP OPERATORS

From the introduction of a unified view of generic strongly convex regularizers as backup operators in MCTS, we narrow the analysis to entropy-based regularizers. For each entropy function, Table 1 shows the Legendre-Fenchel transform and the maximizing argument, which can be respectively replaced in our backup operation (Equation 6) and sampling strategy E3W (Equation 7). Using maximum entropy retrieves the maximum entropy MCTS problem introduced in the MENTS algorithm (Xiao et al., 2019) . This approach closely resembles the maximum entropy RL framework used to encourage exploration (Haarnoja et al., 2018; Schulman et al., 2017a) . We introduce two novel MCTS algorithms based on the minimization of relative entropy of the policy update, inspired by trust-region (Schulman et al., 2015) and proximal optimization methods (Schulman et al., 2017b) in RL, and on the maximization of Tsallis entropy, which has been more recently introduced in RL as an effective solution to enforce the learning of sparse policies (Lee et al., 2018) . We call these algorithms RENTS and TENTS. Contrary to maximum and relative entropy, the definition of the Legendre-Fenchel and maximizing argument of Tsallis entropy is non-trivial, being Ω * (Q t ) = τ • spmax(Q t (s, •)/τ ), ∇Ω * (Q t ) = max Q t (s, a) τ -a∈K Q t (s, a)/τ -1 |K| , 0 , where spmax is defined for any function f : S × A → R as spmax(f (s, •)) a∈K f (s, a) 2 2 - ( a∈K f (s, a) -1) 2 2|K| 2 + 1 2 , ( ) and K is the set of actions that satisfy 1 + if (s, a i ) > i j=1 f (s, a j ) , with a i indicating the action with the i-th largest value of f (s, a) (Lee et al., 2018) . Table 1 : List of entropy regularizers with Legendre-Fenchel transforms and maximizing arguments.

Entropy

Regularizer Ω(πs) 10) Equation ( 11) Legendre-Fenchel Ω * (Qs) Max argument ∇Ω * (Qs) Maximum a π(a|s) log π(a|s) log a e Q(s,a) τ e Q(s,a) τ b e Q(s,b) τ Relative DKL(πt(a|s)||πt-1(a|s)) log a πt-1(a|s)e Q t (s,a) τ πt-1(a|s)e Q t (s,a) τ b πt-1(b|s)e Q t (s,b) τ Tsallis 1 2 ( π(a|s) 2 2 -1) Equation ( 4.1 REGRET ANALYSIS At the root node, let each children node i be assigned with a random variable X i , with mean value V i , while the quantities related to the optimal branch are denoted by * , e.g. mean value V * . At each timestep n, the mean value of variable X i is V in . The pseudo-regret (Coquelin & Munos, 2007) at the root node, at timestep n, is defined as R UCT n = nV * - n t=1 V it . Similarly, we define the regret of E3W at the root node of the tree as R n = nV * - n t=1 V it = nV * - n t=1 I(i t = i)V it = nV * - i V i n t=1 πt (a i |s), where πt (•) is the policy at time step t, and I(•) is the indicator function. Theorem 3 Let κ i = ∇Ω * (a i |s) + L p Ĉσ 2 log C δ /2n, and χ i = ∇Ω * (a i |s) -L p Ĉσ 2 log C δ /2n , where ∇Ω * (.|s) is the policy with respect to the mean value vector V (•) at the root node s. For any δ > 0, with probability at least 1 -δ, ∃ constant L, p, C, Ĉ so that the pseudo regret R n satisfies nV * -n i V i κ i + L p τ (U Ω -L Ω ) 1 -γ ≤ R n ≤ nV * -n i V i χ i - L p τ (U Ω -L Ω ) 1 -γ . This theorem provides bounds for the regret of E3W using a generic convex regularizer Ω; thus, we can easily retrieve from it the regret bound for each entropy regularizer. Let m = min a ∇Ω * (a|s). Corollary 1 Maximum entropy: nV * -n i V i κ i + L τ log |A| 1-γ ≤ R n ≤ nV * -n i V i χ i -L τ log |A| 1-γ . Corollary 2 Relative entropy: nV * -n i V i κ i + L τ (log |A|-1 m ) 1-γ ≤ R n ≤ nV * -n i V i χ i -L τ (log |A|-1 m ) 1-γ . Corollary 3 Tsallis entropy: nV * -n i V i κ i + L 2 |A| -1 2|A| τ 1 -γ ≤ R n ≤ nV * -n i V i χ i -L 2 |A| -1 2|A| τ 1 -γ . Remarks. The regret bound of UCT and its variance have already been analyzed for nonregularized MCTS with binary tree (Coquelin & Munos, 2007) . On the contrary, our regret bound analysis in Theorem 3 applies to generic regularized MCTS. From the specialized bounds in the corollaries, we observe that the maximum and relative entropy share similar results, although the bounds for relative entropy are slightly smaller due to 1 m . Remarkably, the bounds for Tsallis entropy become tighter for increasing number of actions, which translates in limited regret in problems with high branching factor. This result establishes the advantage of Tsallis entropy in complex problems w.r.t. to other entropy regularizers, as empirically confirmed by the positive results in several Atari games described in Section 5.

4.2. ERROR ANALYSIS

We analyse the error of the regularized value estimate at the root node n(s) w.r.t. the optimal value: ε Ω = V Ω (s) -V * (s). Theorem 4 For any δ > 0 and generic convex regularizer Ω, with some constant C, Ĉ, with probability at least 1 -δ, ε Ω satisfies - Ĉσ 2 log C δ 2N (s) - τ (U Ω -L Ω ) 1 -γ ≤ ε Ω ≤ Ĉσ 2 log C δ 2N (s) . ( ) To give a better understanding of the effect of each entropy regularizer in Table 1 , we specialize the bound in Equation 14to each of them. From (Lee et al., 2018) , we know that for maximum entropy Ω(π t ) = a π t log π t , we have -log |A| ≤ Ω(π t ) ≤ 0; for relative entropy Ω(π t ) = KL(π t ||π t-1 ), if we define m = min a π t-1 (a|s), then we can derive 0 ≤ Ω(π t ) ≤ -log |A| + log 1 m ; and for Tsallis entropy Ω(π t ) = 1 2 ( π t 2 2 -1), we have -|A|-1 2|A| ≤ Ω(π t ) ≤ 0. Then, Corollary 4 maximum entropy error: - Ĉσ 2 log C δ 2N (s) - τ log |A| 1 -γ ≤ ε Ω ≤ Ĉσ 2 log C δ 2N (s) . Corollary 5 relative entropy error: - Ĉσ 2 log C δ 2N (s) - τ (log |A| -log 1 m ) 1 -γ ≤ ε Ω ≤ Ĉσ 2 log C δ 2N (s) . Corollary 6 Tsallis entropy error: - Ĉσ 2 log C δ 2N (s) - |A| -1 2|A| τ 1 -γ ≤ ε Ω ≤ Ĉσ 2 log C δ 2N (s) . These results show that when the number of actions |A| is large, TENTS enjoys the smallest error; moreover, we also see that lower bound of RENTS is always smaller than for MENTS.

5. EMPIRICAL EVALUATION

In this section, we empirically evaluate the benefit of the proposed entropy-based MCTS regularizers. First, we complement our theoretical analysis with an empirical study of the synthetic tree toy problem introduced in Xiao et al. ( 2019), which serves as a simple scenario to give an interpretable demonstration of the effects of our theoretical results in practice. Second, we compare to AlphaGo and AlphaZero (Silver et al., 2016; 2017a) , recently introduced to enable MCTS to solve large scale problems with high branching factor. Our implementation is a simplified version of the original algorithms, where we remove various tricks in favor of better interpretability. For the same reason, we do not compare with the most recent and state-of-the-art variant of AlphaZero known as MuZero (Schrittwieser et al., 2019) , as this is a slightly different solution highly tuned to maximize performance, and a detailed description of its implementation is not available. 13). For a fair comparison, we use fixed τ = 0.1 and = 0.1 across all algorithms. Figure 1 and 2 show how UCT and each regularizer behave for different configurations of the tree. We observe that, while RENTS and MENTS converge slower for increasing tree sizes, TENTS is robust w.r.t. the size of the tree and almost always converges faster than all other methods to the respective optimal value. Notably, the optimal value of TENTS seems to be very close to the one of UCT, i.e. the optimal value of the unregularized objective, and also converges faster than the one estimated by UCT, while MENTS and RENTS are considerably further from this value. In terms of regret, UCT explores less than the regularized methods and it is less prone to high regret, at the cost of slower convergence time. Nevertheless, the regret of TENTS is the smallest between the ones of the other regularizers, which seem to explore too much. These results show a general superiority of TENTS in this toy problem, also confirming our theoretical findings about the advantage of TENTS in terms of approximation error (Corollary 6) and regret (Corollary 3), in problems with many actions.

5.2. ENTROPY-REGULARIZED ALPHAZERO

In its standard form, AlphaZero (Silver et al., 2017a) uses the PUCT sampling strategy, a variant of UCT (Kocsis et al., 2006) that samples actions according to the policy P U CT (s, a) = Q(s, a) + P (s, a) N (s) 1 + N (s, a) , ( ) where P is a prior probability on action selection, and is an exploration constant. A value network and a policy network are used to compute, respectively, the action-value function Q and the prior policy P . We use a single neural network, with 2 hidden layers composed of 128 ELU units, and two output layer respectively for the action-value function and the policy. We run 500 AlphaZero episodes, where each episode is composed of 300 steps. A step consists of running 32 MCTS simulations from the root node, as defined in Section 2, using the action-value function computed by the value network instead of using Monte-Carlo rollouts. At the end of each cycle, the average action-value of the root node is computed and stored, the tree is expanded using the given sampling strategy, and the root node is updated with the reached node. At the end of the episode, a minibatch of 32 samples is built from the 300 stored action-values, and the network is trained with one step of gradient descent using RMSProp with learning rate 0.001. The entropy-regularized variants of AlphaZero can be simply derived replacing the average backup operator, with the desired entropy function, and replacing PUCT with E3W using the respective maximizing argument and = 0.1. Cartpole and Acrobot. Figure 3 shows the cumulative reward of standard AlphaZero based on PUCT, and the three entropy-regularized variants, on the Cartpole and Acrobot discrete control problems (Brockman et al., 2016) . While standard AlphaZero clearly lacks good convergence and stability, the entropy-based variants behave differently according to the problem. First, although not significantly superior, RENTS exhibits the most stable learning and faster convergence, confirming the benefit of relative entropy in control problems as already known for trust-region methods in RL (Schulman et al., 2015) . Second, considering the small number of discrete actions in the problems, TENTS cannot benefit from the learning of sparse policies and shows slightly unstable learning in Cartpole, even though the overall performance is satisfying in both problems. Last, MENTS solves the problems slightly slower than RENTS, but reaches the same final performance. Although the results on these simple problems are not conclusive to assert the superiority of one method over the other, they definitely confirm the advantage of regularization in MCTS, and hint at the benefit of the use of relative entropy in control problems. Further analysis on more complex control problems will be desirable (e.g. MuJoCo (Todorov et al., 2012) ), but the need to account for continuous actions, a non-trivial setting for MCTS, makes it out of the scope of this paper.

5.3. ENTROPY-REGULARIZED ALPHAGO

The learning time of AlphaZero can be slow in problems with high branching factor, due to the need of a large number of MCTS simulations for obtaining good estimates of the randomly initialized action-values. To overcome this problem, AlphaGo (Silver et al., 2016) initializes the action-values using the values retrieved from a pretrained network, which is kept fixed during the training. Atari. Atari 2600 (Bellemare et al., 2013 ) is a popular benchmark for testing deep RL methodologies (Mnih et al., 2015; Van Hasselt et al., 2016; Bellemare et al., 2017) but still relatively disregarded in MCTS. We use a Deep Q-Network, pretrained using the same experimental setting of Mnih et al. (2015) , to initialize the action-value function of each node after expansion as Q init (s, a) = (Q(s, a) -V (s)) /τ , for MENTS and TENTS, as done in Xiao et al. (2019) . For RENTS we init Q init (s, a) = log P prior (a|s)) + (Q(s, a) -V (s)) /τ , where P prior is the Boltzmann distribution induced by action-values Q(s, .) computed from the network. Each experimental run consists of 512 MCTS simulations. The temperature τ is optimized for each algorithm and game via grid-search between 0.01 and 1. The discount factor is γ = 0.99, and for PUCT the exploration constant is c = 0.1. Table 2 shows the performance, in terms of cumulative reward, of standard AlphaGo with PUCT and our three regularized versions, on 22 Atari games. Moreover, we test also AlphaGo using the MaxMCTS backup (Khandelwal et al., 2016) for further comparison with classic baselines. We observe that regularized MCTS dominates other baselines, in particular TENTS achieves the highest scores in all the 22 games, showing that sparse policies are more effective in Atari. This can be explained by Corollary 6 which shows that Tsallis entropy can lead to a lower error at the root node even with a high number of actions compared to relative or maximum entropy.

6. CONCLUSION

We introduced a theory of convex regularization in Monte-Carlo Tree Search (MCTS) based on the Legendre-Fenchel transform. Exploiting this theoretical framework, we studied the regret of MCTS when using a generic strongly convex regularizer, and we proved that it has an exponential convergence rate. We use these results to motivate the use of entropy regularization in MCTS, particularly considering maximum, relative, and Tsallis entropy. Finally, we test regularized MCTS algorithms in discrete control problems and Atari games, showing its advantages over other methods. 

A RELATED WORK

Entropy regularization is a common tool for controlling exploration in Reinforcement Learning (RL) and has lead to several successful methods (Schulman et al., 2015; Haarnoja et al., 2018; Schulman et al., 2017a; Mnih et al., 2016) . Typically specific forms of entropy are utilized such as maximum entropy (Haarnoja et al., 2018) or relative entropy (Schulman et al., 2015) . This approach is an instance of the more generic duality framework, commonly used in convex optimization theory. Duality has been extensively studied in game theory (Shalev-Shwartz & Singer, 2006; Pavel, 2007) and more recently in RL, for instance considering mirror descent optimization (Montgomery & Levine, 2016; Mei et al., 2019) , drawing the connection between MCTS and regularized policy optimization (Grill et al., 2020) , or formalizing the RL objective via Legendre-Rockafellar duality (Nachum & Dai, 2020) . Recently (Geist et al., 2019) introduced regularized Markov Decision Processes, formalizing the RL objective with a generalized form of convex regularization, based on the Legendre-Fenchel transform. In this paper, we provide a novel study of convex regularization in MCTS, and derive relative entropy (KL-divergence) and Tsallis entropy regularized MCTS algorithms, i.e. RENTS and TENTS respectively. Note that the recent maximum entropy MCTS algorithm MENTS (Xiao et al., 2019 ) is a special case of our generalized regularized MCTS. Unlike MENTS, RENTS can take advantage of any action distribution prior, in the experiments the prior is derived using Deep Q-learning (Mnih et al., 2015) . On the other hand, TENTS allows for sparse action exploration and thus higher dimensional action spaces compared to MENTS. In experiments, both RENTS and TENTS outperform MENTS. Several works focus on modifying classical MCTS to improve exploration. UCB1-tuned (Auer et al., 2002) modifies the upper confidence bound of UCB1 to account for variance in order to improve exploration. (Tesauro et al., 2012) proposes a Bayesian version of UCT, which obtains better estimates of node values and uncertainties given limited experience. Many heuristic approaches based on specific domain knowledge have been proposed, such as adding a bonus term to value estimates (Gelly & Wang, 2006; Teytaud & Teytaud, 2010; Childs et al., 2008; Kozelek, 2009; Chaslot et al., 2008) or prior knowledge collected during policy search (Gelly & Silver, 2007; Helmbold & Parker-Wood, 2009; Lorentz, 2010; Tom, 2010; Hoock et al., 2010) . (Khandelwal et al., 2016) formalizes and analyzes different on-policy and off-policy complex backup approaches for MCTS planning based on RL techniques. (Vodopivec et al., 2017) proposes an approach called SARSA-UCT, which performs the dynamic programming backups using SARSA (Rummery, 1995) . Both (Khandelwal et al., 2016) and (Vodopivec et al., 2017) directly borrow value backup ideas from RL to estimate the value at each tree node, but they do not provide any proof of convergence.

B PROOFS

Let r and r be respectively the average and the the expected reward at the leaf node, and the reward distribution at the leaf node be σ 2 -sub-Gaussian. Lemma 1 For the stochastic bandit problem E3W guarantees that, for t ≥ 4, P r -rt ∞ ≥ 2σ log(2 + t) ≤ 4|A| exp - t (log(2 + t)) 3 . Proof 1 Let us define N t (a) as the number of times action a have been chosen until time t, and Nt (a) = t s=1 π s (a), where π s (a) is the E3W policy at time step s. By choosing λ s = |A| log(1+s) , it follows that for all a and t ≥ 4, Nt (a) = t s=1 π s (a) ≥ t s=1 1 log(1 + s) ≥ t s=1 1 log(1 + s) - s/(s + 1) (log(1 + s)) 2 ≥ 1+t 1 1 log(1 + s) - s/(s + 1) (log(1 + s)) 2 ds = 1 + t log(2 + t) - 1 log 2 ≥ t 2 log(2 + t) . From Theorem 2.19 in Wainwright (2019), we have the following concentration inequality: P(|N t (a) -Nt (a)| > ) ≤ 2 exp{- 2 2 t s=1 σ 2 s } ≤ 2 exp{- 2 2 t }, where σ 2 s ≤ 1/4 is the variance of a Bernoulli distribution with p = π s (k) at time step s. We define the event E = {∀a ∈ A, | Nt (a) -N t (a)| ≤ }, and consequently P(| Nt (a) -N t (a)| ≥ ) ≤ 2|A| exp(- 2 2 t ). ( ) Conditioned on the event E , for =  2σ 2 log( 2 δ ) N t (a) ≤ 2 exp - 1 (log(2 + t)) 3 . Under review as a conference paper at ICLR 2021 Therefore, for t ≥ 2 P r -rt ∞ > 2σ log(2 + t) ≤ P r -rt ∞ > 2σ log(2 + t) E + P(E C ) ≤ k P |r(a) -rt (a)| > 2σ log(2 + t) + P(E C ) ≤ 2|A| exp - 1 (log(2 + t)) 3 + 2|A| exp - t (log(2 + t)) 3 = 4|A| exp - t (log(2 + t)) 3 . Lemma 2 Given two policies π (1) = ∇Ω * (r (1) ) and π (2) = ∇Ω * (r (2) ), ∃L, such that π (1) -π (2) p ≤ L r (1) -r (2) p . Proof 2 This comes directly from the fact that π = ∇Ω * (r) is Lipschitz continuous with p -norm. Note that p has different values according to the choice of regularizer. Refer to Niculae & Blondel (2017) for a discussion of each norm using Shannon entropy and Tsallis entropy regularizer. Relative entropy shares the same Properties with Shannon Entropy. Lemma 3 Consider the E3W policy applied to a tree. At any node s of the tree with depth d, Let us define N * t (s, a) = π * (a|s).t, and Nt (s, a) = t s=1 π s (a|s), where π k (a|s) is the policy at time step k. There exists some C and Ĉ such that P | Nt (s, a) -N * t (s, a)| > Ct log t ≤ Ĉ|A|t exp{- t (log t) 3 }. Proof 3 We denote the following event, E r k = { r(s , .) -rk (s , .) ∞ < 2σ log(2 + k) }. Thus, conditioned on the event  λ k ≤ L t k=1 Qk (s , .) -Q(s , .) p + t k=1 λ k (Lemma 2) ≤ L|A| 1 p t k=1 Qk (s , .) -Q(s , .) ∞ + t k=1 λ k ( Property of p-norm) ≤ L|A| 1 p γ d t k=1 rk (s , .) -r(s , .) ∞ + t k=1 λ k (Contraction 3.1) ≤ L|A| 1 p γ d t k=1 2σ log(2 + k) + t k=1 λ k ≤ L|A| 1 p γ d t k=0 2σ log(2 + k) dk + t k=0 |A| log(1 + k) dk ≤ Ct log t . for some constant C depending on |A|, p, d, σ, L, and γ . Finally, P(| Nt (s, a) -N * t (s, a)| ≥ Ct log t ) ≤ t i=1 P(E c rt ) = t i=1 4|A| exp(- t (log(2 + t)) 3 ) ≤ 4|A|t exp(- t (log(2 + t)) 3 ) = O(t exp(- t (log(t)) 3 )). Lemma 4 Consider the E3W policy applied to a tree. At any node s of the tree, Let us define N * t (s, a) = π * (a|s).t, and N t (s, a) as the number of times action a have been chosen until time step t. There exists some C and Ĉ such that P |N t (s, a) -N * t (s, a)| > Ct log t ≤ Ĉt exp{- t (log t) 3 }. Proof 4 Based on the result from Lemma 3, we have P |N t (s, a) -N * t (s, a)| > (1 + C) t log t ≤ Ct exp{- t (log t) 3 } ≤ P | Nt (s, a) -N * t (s, a)| > Ct log t + P |N t (s, a) -Nt (s, a)| > t log t ≤ 4|A|t exp{- t (log(2 + t)) 3 } + 2|A| exp{- t (log(2 + t)) 2 }(Lemma 3 and (16)) ≤ O(t exp(- t (log t) 3 )). Theorem 1 At the root node s of the tree, defining N (s) as the number of visitations and V Ω * (s) as the estimated value at node s, for > 0, we have P(|V Ω (s) -V * Ω (s)| > ) ≤ C exp{- N (s) Ĉ(log(2 + N (s))) 2 }. Proof 5 We prove this concentration inequality by induction. When the depth of the tree is D = 1, from Proposition 1, we get |V Ω (s) -V * Ω (s)| = Ω * (Q Ω (s, .)) -Ω * (Q * Ω (s, .)) ∞ ≤ γ r -r * ∞ (Contraction) where r is the average rewards and r * is the mean reward. So that P(|V Ω (s) -V * Ω (s)| > ) ≤ P(γ r -r * ∞ > ). From Lemma 1, with = 2σγ log(2+N (s)) , we have P(|V Ω (s) -V * Ω (s)| > ) ≤ P(γ r -r * ∞ > ) ≤ 4|A| exp{- N (s) 2σγ(log(2 + N (s))) 2 } = C exp{- N (s) Ĉ(log(2 + N (s))) 2 }. Let assume we have the concentration bound at the depth D -1, Let us define V Ω (s a ) = Q Ω (s, a), where s a is the state reached taking action a from state s. then at depth D -1 P(|V Ω (s a ) -V * Ω (s a )| > ) ≤ C exp{- N (s a ) Ĉ(log(2 + N (s a ))) 2 }. Now at the depth D, because of the Contraction Property, we have |V Ω (s) -V * Ω (s)| ≤ γ Q Ω (s, .) -Q * Ω (s, .) ∞ = γ|Q Ω (s, a) -Q * Ω (s, a)|. Theorem 2 Let a t be the action returned by algorithm E3W at iteration t. Then for t large enough, with some constants C, Ĉ, P(a t = a * ) ≤ Ct exp{- t Ĉσ(log(t)) 3 }. Proof 7 Let us define event E s as in Lemma 5. Let a * be the action with largest value estimate at the root node state s. The probability that E3W selects a sub-optimal arm at s is δ /2n, where ∇Ω * (.|s) is the policy with respect to the mean value vector V (•) at the root node s. For any δ > 0, with probability at least 1 -δ, ∃ constant L, p, C, Ĉ so that the pseudo regret R n satisfies P(a t = a * ) ≤ a P(V Ω (s a )) > V Ω (s a * )|E s ) + P(E c s ) = a P((V Ω (s a ) -V * Ω (s a )) -(V Ω (s a * ) -V * Ω (s a * )) ≥ V * Ω (s a * ) -V * Ω (s a )|E s ) + P(E c s ). Let us define ∆ = V * Ω (s a * ) -V * Ω ( nV * -n i V i κ i + L p τ (U Ω -L Ω ) 1 -γ ≤ R n ≤ nV * -n i V i χ i - L p τ (U Ω -L Ω ) 1 -γ . Proof 8 From Lemma 2 given two policies π (1) = ∇Ω * (r (1) ) and π (2) = ∇Ω * (r (2) ), ∃L, such that π (1) -π (2) p ≤ L r (1) -r (2) p ≤ L 1 p r (1) -r (2) ∞ . From (13), we have the regret R n = nV * - i V i n t=1 πt (a i |s), where πt (•) is the policy at time step t, and I(•) is the indicator function. V * is the optimal branch at the root node, V i is the mean value function of the branch with respect to action i, V (•) is the |A| vector of value function at the root node. V (•) is the |A| estimation vector of value function at the root node. π(.|s) = ∇Ω * (V (•)) is the policy with respect to the V (•) vector at the root node. Then for any δ > 0, with probability at least 1 -δ, we have so that In case of Maximum Entropy and Relative Entropy p = 1, because π (1) -π (2) ∞ ≤ L r (1) -r (2) ∞ . So that we have for MENTS R n = nV * - i V i n t=1 πt (a i |s) ≤ nV * - i V i n t=1 π(a i |s) - L p τ (U Ω -L Ω ) 1 -δ + Ĉσ 2 log C δ 2n R n ≤ nV * - i V i n t=1 π(a i |s) - L p τ (U Ω -L Ω ) 1 -δ + Ĉσ 2 log C δ 2n R n ≤ nV * -n i V i π(a i |s) - L p τ (U Ω -L Ω ) 1 -δ + Ĉσ 2 log C δ 2n nV * -n i V i κ i + L τ log |A| 1 -γ ≤ R n ≤ nV * -n i V i χ i -L τ log |A| 1 -γ . For RENTS, we have nV * -n i V i κ i + L τ (log |A| -1 m ) 1 -γ ≤ R n ≤ nV * -n i V i χ i -L τ (log |A| -1 m ) 1 -γ where m = min a π(a|s). In case of Tsallis Entropy p = 2 ( Niculae & Blondel (2017) ), so that nV * -n i V i κ i + L 2 |A| -1 2|A| τ 1 -γ ≤ R n ≤ nV * -n i V i χ i - L 2 |A| -1 2|A| τ 1 -γ Before derive the next theorem, we state here the Theorem 2 in Geist et al. (2019) • Boundedness: for two constants L Ω and U Ω such that for all π ∈ Π, we have L Ω ≤ Ω(π) ≤ U Ω , then V * (s) - τ (U Ω -L Ω ) 1 -γ ≤ V * Ω (s) ≤ V * (s).



The value of the standard deviation is not provided inXiao et al. (2019). After trying different values, we observed that our results match the one inXiao et al. (2019) when using σ = 0.05.



Figure1: For each algorithm, we show the convergence of the value estimate at the root node to the respective optimal value (top), to the UCT optimal value (middle), and the regret (bottom).

Figure 3: Cumulative rewards of AlphaZero with UCT and entropy-based operators, in CartPole (a) and Acrobot (b). Results are averaged over 5 and 10 seeds and show 95% confidence intervals.

2+t) , we have N t (a) ≥ t 4 log(2+t) . For any action a by the definition of sub-gaussian, P |r(a) -rt (a)| > 8σ 2 log( 2 δ ) log(2 + t) t ≤ P |r(a) -rt (a)| > 2σ 2 log( 2 δ ) N t (a) ≤ δ by choosing a δ satisfying log( 2 δ ) = 1 (log(2+t)) 3 , we have P |r(a) -rt (a)| >

a ), therefore for ∆ > 0, we haveP(a t = a * ) ≤ a P((V Ω (s a ) -V * Ω (s a )) -(V Ω (s a * ) -V * Ω (s a * )) ≥ ∆|E s ) + +P(E c s ) ≤ a P(|V Ω (s a ) -V * Ω (s a )| ≥ α∆|E s ) + P(|V Ω (s a * ) -V * Ω (s a * )| ≥ β∆|E s ) + P(E c s ) ≤ a C a exp{-N (s)(α∆) Ĉa (log(2 + N (s))) 2 } + C a * exp{-N (s)(β∆) Ĉa * (log(2 + N (s))) 2 } + P(E c s ),where α + β = 1, α > 0, β > 0, and N (s) is the number of visitations the root node s. Let us define1 Ĉ = min{ (α∆) Ca , (β∆) C a * }, and C = 1 |A| max{C a , C a * } we have P(a = a * ) ≤ C exp{-= a * ) ≤ O(t exp{-t (log(t)) 3 }).Theorem 3 Consider an E3W policy applied to the tree. Let κ i = ∇Ω * (a i |s) + L p Ĉσ 2 log C δ /2n, χ i = ∇Ω * (a i |s) -L p Ĉσ 2 log C

|π(a i |s) -πt (a i |s)| ≤ π(.|s) -πt (.|s)

Average score in Atari over 100 seeds per game. Bold denotes no statistically significant difference to the highest mean (t-test, p < 0.05). Bottom row shows # no difference to highest mean.

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez,  Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et  al. Mastering the game of go without human knowledge. Nature, 550(7676):354-359, 2017b. Richard S Sutton and Andrew G Barto. Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998. Gerald Tesauro, VT Rajan, and Richard Segal. Bayesian inference in monte-carlo tree search. arXiv preprint arXiv:1203.3519, 2012. Fabien Teytaud and Olivier Teytaud. On the huge benefit of decisive moves in monte-carlo tree search algorithms. In Proceedings of the 2010 IEEE Conference on Computational Intelligence and Games, pp. 359-364. IEEE, 2010. E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026-5033, 2012. David Tom. Investigating uct and rave: Steps towards a more robust method, 2010. Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. In Thirtieth AAAI conference on artificial intelligence, 2016. Tom Vodopivec, Spyridon Samothrakis, and Branko Ster. On monte carlo tree search and reinforcement learning. Journal of Artificial Intelligence Research, 60:881-936, 2017.

So that

From (17), we can have lim t→∞ N (s a ) = ∞ because if ∃L, N (s a ) < L, we can find > 0 for which ( 17) is not satisfied. From Lemma 4, when N (s) is large enough, we have N (s a ) → π * (a|s)N (s) (for example N (s a ) > 1 2 π * (a|s)N (s)), that means we can find C and Ĉ that satisfyLemma 5 At any node s of the tree, N (s) is the number of visitations. We define the eventwhere > 0 and V Ω * (s) is the estimated value at node s. We haveProof 6 The proof is the same as in Theorem 2. We prove the concentration inequality by induction.When the depth of the tree is D = 1, from Proposition 1, we getwhere r is the average rewards and r * is the mean rewards. So thatFrom Lemma 1, with = 2σγ log(2+N (s)) and given E s , we haveLet assume we have the concentration bound at the depth D -1, Let us define V Ω (s a ) = Q Ω (s, a), where s a is the state reached taking action a from state s, then at depth D -1}.Now at depth D, because of the Contraction Property and given E s , we have.Where τ is the temperature and γ is the discount constant.Theorem 4 For any δ > 0, with probability at least 1 -δ, the ε Ω satisfiesProof 9 From Theorem 2, let us define δ = C exp{- 2N (s) 2 Ĉσ 2 }, so that =then for any δ > 0, we haveThen, for any δ > 0, with probability at least 1 -δ, we haveFrom Proposition 1, we have

