ON TRADE-OFFS OF IMAGE PREDICTION IN VISUAL MODEL-BASED REINFORCEMENT LEARNING Anonymous

Abstract

Model-based reinforcement learning (MBRL) methods have shown strong sample efficiency and performance across a variety of tasks, including when faced with high-dimensional visual observations. These methods learn to predict the environment dynamics and expected reward from interaction and use this predictive model to plan and perform the task. However, MBRL methods vary in their fundamental design choices, and there is no strong consensus in the literature on how these design decisions affect performance. In this paper, we study a number of design decisions for the predictive model in visual MBRL algorithms, focusing specifically on methods that use a predictive model for planning. We find that a range of design decisions that are often considered crucial, such as the use of latent spaces, have little effect on task performance. A big exception to this finding is that predicting future observations (i.e., images) leads to significant task performance improvement compared to only predicting rewards. We also empirically find that image prediction accuracy, somewhat surprisingly, correlates more strongly with downstream task performance than reward prediction accuracy. We show how this phenomenon is related to exploration and how some of the lower-scoring models on standard benchmarks (that require exploration) will perform the same as the best-performing models when trained on the same training data. Simultaneously, in the absence of exploration, models that fit the data better usually perform better on the downstream task as well, but surprisingly, these are often not the same models that perform the best when learning and exploring from scratch. These findings suggest that performance and exploration place important and potentially contradictory requirements on the model.

1. INTRODUCTION

The key component of any model-based reinforcement learning (MBRL) methods is the predictive model. In visual MBRL, this model predicts the future observations (i.e., images) that will result from taking different actions, enabling the agent to select the actions that will lead to the most desirable outcomes. These features enable MBRL agents to perform successfully with high dataefficiency (Deisenroth & Rasmussen, 2011) in many tasks ranging from healthcare (Steyerberg et al., 2019 ), to robotics (Ebert et al., 2018) , and playing board games (Schrittwieser et al., 2019) . More recently, MBRL methods have been extended to settings with high-dimensional observations (i.e., images), where these methods have demonstrated good performance while requiring substantially less data than model-free methods without explicit representation learning (Watter et al., 2015; Finn & Levine, 2017; Zhang et al., 2018; Hafner et al., 2018; Kaiser et al., 2020) . However, the models used by these methods, also commonly known as World Models (Ha & Schmidhuber, 2018), vary in their fundamental design. For example, some recent works only predict the expected reward (Oh et al., 2017) or other low-dimensional task-relevant signals (Kahn et al., 2018) , while others predict the images as well (Hafner et al., 2019) . Along a different axis, some methods model the dynamics of the environment in the latent space (Hafner et al., 2018) , while some other approaches model autoregressive dynamics in the observation space (Kaiser et al., 2020) . Unfortunately, there is little comparative analysis of how these design decisions affect performance and efficiency, making it difficult to understand the relative importance of the design decisions that have been put forward in prior work. The goal of this paper is to understand the trade-offs between the design choices of model-based agents. One basic question that we ask is: does predicting images actually provide a benefit for MBRL methods? A tempting alternative to predicting observations is to simply predict future rewards, which, in principle, gives a sufficient signal to infer all task-relevant information. However, as we will see, predicting images has clear and quantifiable benefits -in fact,

