ON TRADE-OFFS OF IMAGE PREDICTION IN VISUAL MODEL-BASED REINFORCEMENT LEARNING Anonymous

Abstract

Model-based reinforcement learning (MBRL) methods have shown strong sample efficiency and performance across a variety of tasks, including when faced with high-dimensional visual observations. These methods learn to predict the environment dynamics and expected reward from interaction and use this predictive model to plan and perform the task. However, MBRL methods vary in their fundamental design choices, and there is no strong consensus in the literature on how these design decisions affect performance. In this paper, we study a number of design decisions for the predictive model in visual MBRL algorithms, focusing specifically on methods that use a predictive model for planning. We find that a range of design decisions that are often considered crucial, such as the use of latent spaces, have little effect on task performance. A big exception to this finding is that predicting future observations (i.e., images) leads to significant task performance improvement compared to only predicting rewards. We also empirically find that image prediction accuracy, somewhat surprisingly, correlates more strongly with downstream task performance than reward prediction accuracy. We show how this phenomenon is related to exploration and how some of the lower-scoring models on standard benchmarks (that require exploration) will perform the same as the best-performing models when trained on the same training data. Simultaneously, in the absence of exploration, models that fit the data better usually perform better on the downstream task as well, but surprisingly, these are often not the same models that perform the best when learning and exploring from scratch. These findings suggest that performance and exploration place important and potentially contradictory requirements on the model.

1. INTRODUCTION

The key component of any model-based reinforcement learning (MBRL) methods is the predictive model. In visual MBRL, this model predicts the future observations (i.e., images) that will result from taking different actions, enabling the agent to select the actions that will lead to the most desirable outcomes. These features enable MBRL agents to perform successfully with high dataefficiency (Deisenroth & Rasmussen, 2011) in many tasks ranging from healthcare (Steyerberg et al., 2019) , to robotics (Ebert et al., 2018) , and playing board games (Schrittwieser et al., 2019) . More recently, MBRL methods have been extended to settings with high-dimensional observations (i.e., images), where these methods have demonstrated good performance while requiring substantially less data than model-free methods without explicit representation learning (Watter et al., 2015; Finn & Levine, 2017; Zhang et al., 2018; Hafner et al., 2018; Kaiser et al., 2020) . However, the models used by these methods, also commonly known as World Models (Ha & Schmidhuber, 2018) , vary in their fundamental design. For example, some recent works only predict the expected reward (Oh et al., 2017) or other low-dimensional task-relevant signals (Kahn et al., 2018) , while others predict the images as well (Hafner et al., 2019) . Along a different axis, some methods model the dynamics of the environment in the latent space (Hafner et al., 2018) , while some other approaches model autoregressive dynamics in the observation space (Kaiser et al., 2020) . Unfortunately, there is little comparative analysis of how these design decisions affect performance and efficiency, making it difficult to understand the relative importance of the design decisions that have been put forward in prior work. The goal of this paper is to understand the trade-offs between the design choices of model-based agents. One basic question that we ask is: does predicting images actually provide a benefit for MBRL methods? A tempting alternative to predicting observations is to simply predict future rewards, which, in principle, gives a sufficient signal to infer all task-relevant information. However, as we will see, predicting images has clear and quantifiable benefits -in fact, we observe that accuracy in predicting observations correlates more strongly with control performance than accuracy of predicting rewards. Our goal is to specifically analyze the design trade-offs in the models themselves, decoupling this as much as possible from the confounding differences in the algorithm. While a wide range of different algorithms have been put forward in the literature, we restrict our analysis to arguably the simplest class of MBRL methods, which train a model and then use it for planning without any explicit policy. While this limits the scope of our conclusions, it allows us to draw substantially clearer comparisons. The main contributions of this work are two-fold. First, we provide a coherent conceptual framework for high-level design decisions in creating models. Second, we investigate how each one of these choices and their variations can affect the performance across multiple tasks. We find that: 1. Predicting future observations (i.e. images) leads to significant task performance improvement compared to only predicting rewards. And, somewhat suprisingly, image prediction accuracy correlates more strongly with downstream task performance than reward prediction accuracy. 2. We show how this phenomenon is related to exploration: -Some of the lower-scoring models on standard benchmarks that require exploration will perform the same as the best-performing models when trained on the same training data. -In the absence of exploration, models that fit the data better usually perform better on the downstream task, but surprisingly, these are often not the same models that perform the best when learning and exploring from scratch. 3. A range of design decisions that are often considered crucial, such as the use of latent spaces, have little effect on task performance. These findings suggest that performance and exploration place important and potentially contradictory requirements on the model. We will open-source our implementation that can be used to reproduce the experiments and can be further extended to other environments and models.

2. RELATED WORK

MBRL is commonly used for applications where sample efficiency is essential, such as real physical systems (Deisenroth & Rasmussen, 2011; Deisenroth et al., 2013; Levine et al., 2016) or healthcare (Raghu et al., 2018; Yu et al., 2019; Steyerberg et al., 2019) . In this work, our focus is on settings with a high-dimensional observation space (i.e., images). Scaling MBRL to this setting has proven to be challenging (Zhang et al., 2018) , but has shown recent successes (Hafner et al., 2019; Watter et al., 2015; Levine et al., 2016; Finn et al., 2016b; Banijamali et al., 2017; Oh et al., 2017; Zhang et al., 2018; Ebert et al., 2018; Hafner et al., 2018; Dasari et al., 2019; Kaiser et al., 2020) . Besides these methods, a number of works have studied MBRL methods that do not predict pixels, and instead directly predict future rewards (Oh et al., 2017; Liu et al., 2017; Schrittwieser et al., 2019; Sekar et al., 2020) , other reward-based quantities (Gelada et al., 2019) , or features that correlate with the reward or task (Dosovitskiy & Koltun, 2016; Kahn et al., 2018) . It might appear that predicting rewards is sufficient to perform the task, and a reasonable question to ask is whether image prediction accuracy actually correlates with better task performance or not. One of our key findings is that predicting images improves the performance of the agent, suggesting a way to improve the task performance of these methods. Visual MBRL methods must make a number of architecture choices in structuring the predictive model. Some methods investigate how to make the sequential high-dimensional prediction problem easier by transforming pixels (Finn et al., 2016a; De Brabandere et al., 2016; Liu et al., 2017) or decomposing motion and content (Tulyakov et al., 2017; Denton et al., 2017; Hsieh et al., 2018; Wichers et al., 2018; Amiranashvili et al., 2019) . Other methods investigate how to incorporate stochasticity through latent variables (Xue et al., 2016; Babaeizadeh et al., 2018; Denton & Fergus, 2018; Lee et al., 2018; Villegas et al., 2019) , autoregressive models (Kalchbrenner et al., 2017; Reed et al., 2017; Weissenborn et al., 2019) , flow-based approaches (Kumar et al., 2019) and adversarial methods (Lee et al., 2018) . However, whether prediction accuracy actually contributes to MBRL performance has not been verified in detail on image-based tasks. We find a strong correlation between image prediction accuracy and downstream task performance, suggesting video prediction is likely a fruitful area of research for improving visual MBRL.

