ON THE ROLE OF PLANNING IN MODEL-BASED DEEP REINFORCEMENT LEARNING

ABSTRACT

Model-based planning is often thought to be necessary for deep, careful reasoning and generalization in artificial agents. While recent successes of model-based reinforcement learning (MBRL) with deep function approximation have strengthened this hypothesis, the resulting diversity of model-based methods has also made it difficult to track which components drive success and why. In this paper, we seek to disentangle the contributions of recent methods by focusing on three questions: (1) How does planning benefit MBRL agents? (2) Within planning, what choices drive performance? (3) To what extent does planning improve generalization? To answer these questions, we study the performance of MuZero [58], a state-of-the-art MBRL algorithm with strong connections and overlapping components with many other MBRL algorithms. We perform a number of interventions and ablations of MuZero across a wide range of environments, including control tasks, Atari, and 9x9 Go. Our results suggest the following: (1) Planning is most useful in the learning process, both for policy updates and for providing a more useful data distribution. (2) Using shallow trees with simple Monte-Carlo rollouts is as performant as more complex methods, except in the most difficult reasoning tasks. (3) Planning alone is insufficient to drive strong generalization. These results indicate where and how to utilize planning in reinforcement learning settings, and highlight a number of open questions for future MBRL research. Model-based reinforcement learning (MBRL) has seen much interest in recent years, with advances yielding impressive gains over model-free methods in data efficiency [12, 15, 25, 76] , zero-and few-shot learning [16, 37, 60] , and strategic thinking [3, 62, 63, 64, 58] . These methods combine planning and learning in a variety of ways, with planning specifically referring to the process of using a learned or given model of the world to construct imagined future trajectories or plans. Some have suggested that models will play a key role in generally intelligent artificial agents [14, 50, 55, 56, 57, 67] , with such arguments often appealing to model-based aspects of human cognition as proof of their importance [24, 26, 28, 41] . While the recent successes of MBRL methods lend evidence to this hypothesis, there is huge variance in the algorithmic choices made to support such advances. For example, planning can be used to select actions at evaluation time [e.g., 12] and/or for policy learning [e.g., 34]; models can be used within discrete search [e.g., 58] or gradient-based planning [e.g., 25, 29] ; and models can be given [e.g., 45] or learned [e.g., 12] . Worryingly, some works even come to contradictory conclusions, such as that long rollouts can hurt performance due to compounding model errors in some settings [e.g., 34], while performance continues to increase with search depth in others [58] . Given the inconsistencies and non-overlapping choices across the literature, it can be hard to get a clear picture of the full MBRL space. This in turn makes it difficult for practitioners to decide which form of MBRL is best for a given problem (if any). The aim of this paper is to assess the strengths and weaknesses of recent advances in MBRL to help clarify the state of the field. We systematically study the role of planning and its algorithmic design choices in a recent state-of-the-art MBRL algorithm, MuZero [58] . Beyond its strong performance, MuZero's use of multiple canonical MBRL components (e.g., search-based planning, a learned model, value estimation, and policy optimization) make it a good candidate for building intuition about the roles of these components and other methods that use them. Moreover, as discussed in the next section, MuZero has direct connections with many other MBRL methods, including Dyna [67], MPC [11] , and policy iteration [33] . To study the role of planning, we evaluate overall reward obtained by MuZero across a wide range standard MBRL environments: the DeepMind Control Suite [70], Atari [8], Sokoban [51], Minipacman [22] , and 9x9 Go [42] . Across these environments, we consider three questions. ( 1) For what purposes is planning most useful? Our results show that planning-which can be used separately for policy improvement, generating the distribution of experience to learn from, and acting at testtime-is most useful in the learning process for computing learning targets and generating data. ( 2) What design choices in the search procedure contribute most to the learning process? We show that deep, precise planning is often unnecessary to achieve high reward in many domains, with two-step planning exhibiting surprisingly strong performance even in Go. (3) Does planning assist in generalization across variations of the environment-a common motivation for model-based reasoning? We find that while planning can help make up for small amounts of distribution shift given a good enough model, it is not capable of inducing strong zero-shot generalization on its own.

1. BACKGROUND AND RELATED WORK

Model-based reinforcement learning (MBRL) [9, 26, 47, 49, 74] involves both learning and planning. For our purposes, learning refers to deep learning of a model, policy, and/or value function. Planning refers to using a learned or given model to construct trajectories or plans. In most MBRL agents, learning and planning interact in complex ways, with better learning usually resulting in better planning, and better planning resulting in better learning. Here, we are interested in understanding how differences in planning affect both the learning process and test-time behavior. though may be insufficient in settings which require longterm reasoning such as in sparse reward tasks or strategic games like Go. Conversely, Dyna [67] is a classic background planning method which uses the the model to simulate data on which to train a policy via standard modelfree methods like Q-learning or policy gradient. Background planning methods often feature improved data efficiency over model-free methods [e.g., 34], but exhibit the same drawbacks as model-free approaches such as brittleness to out-of-distribution experience at test time. A number of works have adopted hybrid approaches combining both decision-time and background planning. For example, Guo et al. [23], Mordatch et al. [48] distill the results of a decision-time planner into a policy. Silver et al. [61], Tesauro & Galperin [71] do the opposite, allowing a policy to guide the behavior of a decision-time planner. Other works do both, incorporating the distillation or imitation step into the learning loop by allowing the distilled policy from the previous iteration to guide planning on the next iteration, illustrated by the "update" arrow in Figure 1 . This results in a form of approximate policy iteration which can be implemented both using single-step [40, 54] or multi-step [17, 18] updates, the latter of which is also referred to as expert iteration [3] or dual policy iteration [66] . Such algorithms have succeeded in board games [3, 4, 63, 64] , discrete-action MDPs [27, 51, 58] and continuous control [43, 45, 65] . In this paper, we focus our investigation on MuZero [58], a state-of-the-art member of the approximate policy iteration family. MuZero is a useful testbed for our analysis not just because of its strong performance, but also because it exhibits important connections to many other works in the MBRL literature. For example, MuZero implements a form of approximate policy iteration [18] and



Figure 1: Model-based approximate policy iteration. The agent updates its policy using targets computed via planning and optionally acts via planning during training, at test time, or both. MBRL methods can be broadly classified into decisiontime planning, which use the model to select actions, and background planning, which use the model to update a policy [68]. For example, model-predictive control (MPC) [11] is a classic decision-time planning method that uses the model to optimize a sequence of actions starting from the current environment state. Decisiontime planning methods often feature robustness to uncertainty and fast adaptation to new scenarios [e.g., 76],though may be insufficient in settings which require longterm reasoning such as in sparse reward tasks or strategic games like Go. Conversely, Dyna [67] is a classic background planning method which uses the the model to simulate data on which to train a policy via standard modelfree methods like Q-learning or policy gradient. Background planning methods often feature improved data efficiency over model-free methods [e.g., 34], but exhibit the same drawbacks as model-free approaches such as brittleness to out-of-distribution experience at test time.

