MODEL-BASED REINFORCEMENT LEARNING VIA LATENT-SPACE COLLOCATION

Abstract

The ability to plan into the future while utilizing only raw high-dimensional observations, such as images, can provide autonomous agents with broad and general capabilities. However, realistic tasks may require handling sparse rewards and performing temporally extended reasoning, and cannot be solved with only myopic, short-sighted planning. Recent work in model-based reinforcement learning (RL) has shown impressive results using heavily shaped reward functions that require only short-horizon reasoning. In this work, we study how techniques trajectory optimization can enable more effective long-horizon reasoning. We draw on the idea of collocation-based planning and adapt it to the image-based setting by leveraging probabilistic latent variable models, resulting in an algorithm that optimizes trajectories over latent variables. Our latent collocation method (LatCo) provides a general and effective approach to longer-horizon visual planning. Empirically, our approach significantly outperforms prior model-based approaches on challenging visual control tasks with sparse rewards and long-term goals.

1. INTRODUCTION

In order for autonomous agents to perform complex tasks in open-world settings, they must be able to process high-dimensional sensory inputs, such as images, and reason over long horizons about the potential effects of their actions. Recent work in model-based reinforcement learning (RL) has shown impressive results in autonomous skill acquisition directly from image inputs, demonstrating benefits such as learning efficiency and improved generalization. While these advancements have been largely fueled by improvements on the modeling side -from better uncertainty estimates and incorporation of temporal information (Ebert et al., 2017) , to the explicit learning of latent representation spaces (Hafner et al., 2019) -they leave much room for improvement on the planning and optimization side. Most of the current best-performing deep learning approaches for vision-based planning use only gradient-free action sampling as the underlying optimizer, and are typically applied to settings where a dense and well-shaped reward signal is available. In this work, we argue that more powerful planners are necessary for longer-horizon reasoning. With this goal of long-horizon planning, we aim to extend the myopic planning behavior of existing visual planning methods. Whether it's to avoid local minima due to short-sightedness, or to reason further into the future in order to solve multi-step or sparse-reward tasks, this ability to perform long-horizon planning is critical. Many of the current state-of-the-art visual planning approaches use gradient-free sampling-based optimization methods such as shooting (Ebert et al., 2018; Nagabandi Figure 1 : Collocation-based planning. Each image shows a full plan at that optimization step. Collocation jointly optimizes dynamics satisfaction constraints as well as rewards. This ability to violate dynamics allows for the rapid discovery of high-reward regions (where the object is next to the goal), while the subsequent refinement of the planned trajectory focuses on feasibly achieving it. et al., 2020); as shown in Figure 6 , these approaches can get stuck when they must reason further into the future. Here, the curse of dimensionality in conjunction with a lack of shaped reward signal can prevent greedy planning methods from succeeding. Instead, we look to the gradient-based optimization approach of collocation, where crucially, dynamics satisfaction constraints are optimized jointly with the rewards. As shown in Figure 1 , this ability to violate dynamics constraints and imagine even impossible trajectories enables collocation to explore much more effectively and take shortcuts through the optimization landscape in order to learn about high-reward regions before figuring out how to get to those regions. In contrast to shooting approaches, or even to gradient-based approaches such as backpropagating rewards through model predictions, this ability to violate dynamics greatly helps to prevent getting stuck in local minima while still avoiding the need for special dense or shaped reward signals. Collocation, as introduced above, can provide many benefits over other optimization techniques, but it has thus far been demonstrated (Ratliff et al., 2009; Mordatch et al., 2012; Schulman et al., 2014) mostly in conjunction with known dynamics models and when performing optimization over states. In this work, we are interested in autonomous skill acquisition directly from image inputs, where both the underlying states as well as the underlying dynamics are unknown. Naïvely applying collocation to this visual planning setting would lead to intractable optimization, due to the high-dimensional as well as partially observed nature of images, in addition to the fact that only a thin manifold in the pixel space constitutes valid images. Instead, we draw from the representation learning literature and leverage latent dynamics models, which learn a latent representation of the observations that is not only Markovian but also compact and lends itself well to planning. In this learned latent space, we propose to perform collocation over states and actions with the joint objective of maximizing rewards as well as minimizing dynamics violations. Bridging control theory literature with modern deep-RL techniques for learning from images, we propose in this work to perform collocation in a learned latent space in order to enable effective long-horizon planning. The main contribution of this work is an algorithm for latent-space collocation (LatCo), which is an efficient model-based RL approach for solving challenging planning tasks that works directly from image observations by leveraging collocation in a learned latent space. To the best of our knowledge, our paper is the first to scale collocation to visual observations, thus enabling longer-horizon reasoning in model-based RL. In our experiments, we analyze various aspects of our algorithm, and we demonstrate our approach significantly outperforming prior model-based approaches on visual control tasks that require longer-horizon planning with sparse rewards.

2. RELATED WORK

Model-based reinforcement learning. Recent work has scaled model-based reinforcement learning to complex systems leveraging powerful neural network dynamics models (Chua et al., 2018; Nagabandi et al., 2020) , while showing significant data efficiency improvements over model-free agents. Further, these neural network models can be scaled to high-dimensional image observations using convolutional architectures (Ebert et al., 2018; Hafner et al., 2019) . However, despite these successes in building better predictive models, planning with these black-box neural network dynamics remains a challenge. While this prior work used simple trajectory optimization techniques like derivative-free shooting, we propose to leverage the more powerful collocation methods. Other work explored more complex approaches based on mixed-integer linear programming (Say et al., 2017) or gradient descent with input-convex neural networks (Chen et al., 2019) , but it is unclear whether these approaches scale to visual observations. Latent Planning. Other works have considered different optimization methods such as iterative Linear-Quadratic Regulator (iLQR) (Watter et al., 2015; Zhang et al., 2019) . However, these approaches require specialized locally-linear predictive models, and still rely on shooting and local search in the space of actions, which is prone to local minima. Instead, our collocation approach can be used with any latent state model, and is able to optimize in the state-space, which we show often enables us escape local minima and plan better trajectories. Another line of work relied on graph-based optimization (Kurutach et al., 2018; Savinov et al., 2018; Eysenbach et al., 2019; Liu et al., 2020) tree search (Schrittwieser et al., 2019; Parascandolo et al., 2020) , or other symbolic planners (Asai & Fukunaga, 2018) , while we use continuous optimization, which is more suitable for continuous control. Recent work has designed hierarchical planning methods that plan over extended

