ON THE ROLE OF PLANNING IN MODEL-BASED DEEP REINFORCEMENT LEARNING

ABSTRACT

Model-based planning is often thought to be necessary for deep, careful reasoning and generalization in artificial agents. While recent successes of model-based reinforcement learning (MBRL) with deep function approximation have strengthened this hypothesis, the resulting diversity of model-based methods has also made it difficult to track which components drive success and why. In this paper, we seek to disentangle the contributions of recent methods by focusing on three questions: (1) How does planning benefit MBRL agents? (2) Within planning, what choices drive performance? (3) To what extent does planning improve generalization? To answer these questions, we study the performance of MuZero [58], a state-of-the-art MBRL algorithm with strong connections and overlapping components with many other MBRL algorithms. We perform a number of interventions and ablations of MuZero across a wide range of environments, including control tasks, Atari, and 9x9 Go. Our results suggest the following: (1) Planning is most useful in the learning process, both for policy updates and for providing a more useful data distribution. (2) Using shallow trees with simple Monte-Carlo rollouts is as performant as more complex methods, except in the most difficult reasoning tasks. (3) Planning alone is insufficient to drive strong generalization. These results indicate where and how to utilize planning in reinforcement learning settings, and highlight a number of open questions for future MBRL research. Model-based reinforcement learning (MBRL) has seen much interest in recent years, with advances yielding impressive gains over model-free methods in data efficiency [12, 15, 25, 76] , zero-and few-shot learning [16, 37, 60] , and strategic thinking [3, 62, 63, 64, 58] . These methods combine planning and learning in a variety of ways, with planning specifically referring to the process of using a learned or given model of the world to construct imagined future trajectories or plans. Some have suggested that models will play a key role in generally intelligent artificial agents [14, 50, 55, 56, 57, 67] , with such arguments often appealing to model-based aspects of human cognition as proof of their importance [24, 26, 28, 41] . While the recent successes of MBRL methods lend evidence to this hypothesis, there is huge variance in the algorithmic choices made to support such advances. For example, planning can be used to select actions at evaluation time [e.g., 12] and/or for policy learning [e.g., 34]; models can be used within discrete search [e.g., 58] or gradient-based planning [e.g., 25, 29]; and models can be given [e.g., 45] or learned [e.g., 12 ]. Worryingly, some works even come to contradictory conclusions, such as that long rollouts can hurt performance due to compounding model errors in some settings [e.g., 34], while performance continues to increase with search depth in others [58] . Given the inconsistencies and non-overlapping choices across the literature, it can be hard to get a clear picture of the full MBRL space. This in turn makes it difficult for practitioners to decide which form of MBRL is best for a given problem (if any). The aim of this paper is to assess the strengths and weaknesses of recent advances in MBRL to help clarify the state of the field. We systematically study the role of planning and its algorithmic design choices in a recent state-of-the-art MBRL algorithm, MuZero [58] . Beyond its strong performance, MuZero's use of multiple canonical MBRL components (e.g., search-based planning, a learned model, value estimation, and policy optimization) make it a good candidate for building intuition about the roles of these components and other methods that use them. Moreover, as discussed in the

