SIMPLIFYING MODEL-BASED RL: LEARNING REPRESENTATIONS, LATENT-SPACE MODELS, AND POLICIES WITH ONE OBJECTIVE

Abstract

While reinforcement learning (RL) methods that learn an internal model of the environment have the potential to be more sample efficient than their model-free counterparts, learning to model raw observations from high dimensional sensors can be challenging. Prior work has addressed this challenge by learning lowdimensional representation of observations through auxiliary objectives, such as reconstruction or value prediction. However, the alignment between these auxiliary objectives and the RL objective is often unclear. In this work, we propose a single objective which jointly optimizes a latent-space model and policy to achieve high returns while remaining self-consistent. This objective is a lower bound on expected returns. Unlike prior bounds for model-based RL on policy exploration or model guarantees, our bound is directly on the overall RL objective. We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods. While sample efficient methods typically are computationally demanding, our method attains the performance of SAC in about 50% less wall-clock time 1 .

1. INTRODUCTION

While RL algorithms that learn an internal model of the world can learn more quickly than their model-free counterparts (Hafner et al., 2018; Janner et al., 2019) , figuring out exactly what these models should predict has remained an open problem: the real world and even realistic simulators are too complex to model accurately. Although model errors may be rare under the training distribution, a learned RL agent will often seek out the states where an otherwise accurate model makes mistakes (Jafferjee et al., 2020) . Simply training the model with maximum likelihood will not, in general, produce a model that is good for model-based RL (MBRL). The discrepancy between the policy objective and the model objective is called the objective mismatch problem (Lambert et al., 2020) , and remains an active area of research. The objective mismatch problem is especially important in settings with high-dimensional observations, which are challenging to predict with high fidelity. 



Prior model-based methods have coped with the difficulty to model high-dimensional observations by learning the dynamics of a compact representation of observations, rather than the dynamics of the raw observations. Depending on their learning objective, these representations might still be hard to predict or might not contain task relevant information. Besides, the accuracy of prediction depends not just on the model's parameters, but also on the states visited by the policy. Hence, another way of reducing prediction errors is to optimize the policy to avoid transitions where the model is inaccurate, while achieving high returns. In the end, we want to train the model, representations, and policy to be self-consistent: the policy should only visit states where the model is accurate, the representation should encode information that is task-relevant and predictable. Can we design a model-based RL algorithm that automatically learns compact yet sufficient representations for model-based reasoning?1 Project website with code: https://alignedlatentmodels.github.io/ 1



Figure 1: (left) Most model-based RL methods learn the representations, latent-space model, and policy using three different objectives. (Right) We derive a single objective for all three components, which is a lower bound on expected returns. Based on this objective, we develop a practical deep RL algorithm.

