MCTRANSFORMER: COMBINING TRANSFORMERS AND MONTE-CARLO TREE SEARCH FOR OFFLINE RE-INFORCEMENT LEARNING

Abstract

Recent studies explored the framing of reinforcement learning as a sequence modeling problem, and then using Transformers to generate effective solutions. In this study, we introduce MCTrasnformer, a framework that combines Monte-Carlo Tree Search (MCTS) with Transformers. Our approach uses an actor-critic setup, where the MCTS component is responsible for navigating previously-explored states, aided by input from the Transformer. The Transformer controls the exploration and evaluation of new states, enabling an effective and efficient evaluation of various strategies. In addition to the development of highly effective strategies, our setup enables the use of more efficient sampling compared to existing MCTS-based solutions. MCTrasnformer, therefore, is able to effectively learn from a small number of simulations for each newly-explored node, without degrading performance. Evaluation conducted on the challenging and well-known problem of SameGame, shows that MCTrasnformer outscores Transformer-only and MCTS-only solutions by a factor of three and more.

1. INTRODUCTION

Transformers have recently been shown to be very effective in the field of reinforcement learning (RL) Chen et al. (2021) ; Janner et al. (2021) . This was achieved by converting offline RL to a classification problem, which facilitates advanced sequence modeling abilities using Transformers. At evaluation time, the Transformer functions as an autoregressive model, generating sequences of future actions. The main shortcoming of the aforementioned approach is the absence of exploration during online evaluation. The Transformer model is, therefore, limited in its ability to adjust to novel circumstances. While online fine-tuning of the model was recently proposed Zheng et al. (2022a) , it requires a relatively large number of training samples. Moreover, a one-time fine-tuning approach is not suitable for problems with high degrees of volatility, that require continuous exploration. We introduce MCTrasnformer, a RL framework that enables cost-effective exploration of planning problems. Our approach combines Monte-Carlo Tree Search (MCTS) with the Transformer architecture in an actor-critic setup. The MCTS component of MCTrasnformer is tasked with balancing the exploration/exploitation trade-off needed in most RL tasks, while the Transformer component is tasked with predicting the utility of previously-unexplored nodes. Additionally, we use the Transformer to carry out the simulation phase of the MCTS (i.e., the rollout policy), where the former's advanced and effective modeling allows us to use a vary small number of simulations, thus keeping our approach efficient. We evaluate the proposed approach on SameGame, a challenging and well-known problem. This game is considered challenging due to the high variance of its initial states. Game boards are randomly initialized, which forces any solver to perform a great deal of exploration from the first step. Another factor that adds complexity to the planning process is the large bonus score that is awarded only when the board is fully cleared. The ability to correctly assess whether the board can be cleared has significant effect on the planner's behavior. Our evaluation shows that MCTrasnformer significantly outperforms top-performing methods in a budget-based setting.

2. PRELIMINARIES 2.1 BACKGROUND -SAMEGAME

Game rules. SameGame is a single-player game, played on a rectangular board of height H and width W . The board is randomly filled with tiles of C different colorsfoot_0 . Two tiles are considered adjacent if they are connected either vertically or horizontally. A block of tiles is a group of two or more adjacent tiles of the same color. A tile with no adjacent tiles of the same colors is a singleton. At each turn, the player selects a single block (one cannot select singletons), which is then removed from the board. When a block is removed, the board is reorganized as follows: a) tiles above the removed tiles "fall down" (a physics-based model); b) when an entire column is removed, all the columns to its right shift left. The game continues until no more blocks exist on the board, i.e., an empty board or one or more singletons. Reward function. The reward is calculated each turn as follows: (n i -2) 2 , where n i is the size of block chosen by the user at step i. If the board is empty when the game concludes, the player receives an additional 1000 points bonus. If tiles remain on the board, a penalty of i (n i -2) 2 , where n i is the number of tiles left of color i, is exacted. The score is the sum of all the rewards. This setup creates two (potentially conflicting) goals for the player: the first goal is to create blocks that are as large as possible; the second goal is to empty the board, which may require a larger number of steps, as a result of the need to select smaller blocks. Complexity. A SameGame board is defined as solvable if a sequence of actions exists so that the board can be emptied. As shown in Schadd et al. (2008) Takes & Kosters ( 2009), determining whether a randomly-initialized board is solvable is NP-complete. Therefore, finding a sequence of actions that maximizes the score, regardless of whether the board is solvable, is also NP-complete. The main reasons for this difficulty are the game's high branching factor-around 17, initially-and the fact that the length of an average game in our setup is approximately 27. 



Our description refers to the version of SameGame used in this study, which is one of the most common.



Figure 1: The proposed MCTrasnformer approach

