MCTRANSFORMER: COMBINING TRANSFORMERS AND MONTE-CARLO TREE SEARCH FOR OFFLINE RE-INFORCEMENT LEARNING

Abstract

Recent studies explored the framing of reinforcement learning as a sequence modeling problem, and then using Transformers to generate effective solutions. In this study, we introduce MCTrasnformer, a framework that combines Monte-Carlo Tree Search (MCTS) with Transformers. Our approach uses an actor-critic setup, where the MCTS component is responsible for navigating previously-explored states, aided by input from the Transformer. The Transformer controls the exploration and evaluation of new states, enabling an effective and efficient evaluation of various strategies. In addition to the development of highly effective strategies, our setup enables the use of more efficient sampling compared to existing MCTS-based solutions. MCTrasnformer, therefore, is able to effectively learn from a small number of simulations for each newly-explored node, without degrading performance. Evaluation conducted on the challenging and well-known problem of SameGame, shows that MCTrasnformer outscores Transformer-only and MCTS-only solutions by a factor of three and more.

1. INTRODUCTION

Transformers have recently been shown to be very effective in the field of reinforcement learning (RL) Chen et al. (2021) ; Janner et al. (2021) . This was achieved by converting offline RL to a classification problem, which facilitates advanced sequence modeling abilities using Transformers. At evaluation time, the Transformer functions as an autoregressive model, generating sequences of future actions. The main shortcoming of the aforementioned approach is the absence of exploration during online evaluation. The Transformer model is, therefore, limited in its ability to adjust to novel circumstances. While online fine-tuning of the model was recently proposed Zheng et al. (2022a) , it requires a relatively large number of training samples. Moreover, a one-time fine-tuning approach is not suitable for problems with high degrees of volatility, that require continuous exploration. We introduce MCTrasnformer, a RL framework that enables cost-effective exploration of planning problems. Our approach combines Monte-Carlo Tree Search (MCTS) with the Transformer architecture in an actor-critic setup. The MCTS component of MCTrasnformer is tasked with balancing the exploration/exploitation trade-off needed in most RL tasks, while the Transformer component is tasked with predicting the utility of previously-unexplored nodes. Additionally, we use the Transformer to carry out the simulation phase of the MCTS (i.e., the rollout policy), where the former's advanced and effective modeling allows us to use a vary small number of simulations, thus keeping our approach efficient. We evaluate the proposed approach on SameGame, a challenging and well-known problem. This game is considered challenging due to the high variance of its initial states. Game boards are randomly initialized, which forces any solver to perform a great deal of exploration from the first step. Another factor that adds complexity to the planning process is the large bonus score that is awarded only when the board is fully cleared. The ability to correctly assess whether the board can be cleared has significant effect on the planner's behavior. Our evaluation shows that MCTrasnformer significantly outperforms top-performing methods in a budget-based setting.

