MODEL-BASED REINFORCEMENT LEARNING VIA LATENT-SPACE COLLOCATION

Abstract

The ability to plan into the future while utilizing only raw high-dimensional observations, such as images, can provide autonomous agents with broad and general capabilities. However, realistic tasks may require handling sparse rewards and performing temporally extended reasoning, and cannot be solved with only myopic, short-sighted planning. Recent work in model-based reinforcement learning (RL) has shown impressive results using heavily shaped reward functions that require only short-horizon reasoning. In this work, we study how techniques trajectory optimization can enable more effective long-horizon reasoning. We draw on the idea of collocation-based planning and adapt it to the image-based setting by leveraging probabilistic latent variable models, resulting in an algorithm that optimizes trajectories over latent variables. Our latent collocation method (LatCo) provides a general and effective approach to longer-horizon visual planning. Empirically, our approach significantly outperforms prior model-based approaches on challenging visual control tasks with sparse rewards and long-term goals.

1. INTRODUCTION

In order for autonomous agents to perform complex tasks in open-world settings, they must be able to process high-dimensional sensory inputs, such as images, and reason over long horizons about the potential effects of their actions. Recent work in model-based reinforcement learning (RL) has shown impressive results in autonomous skill acquisition directly from image inputs, demonstrating benefits such as learning efficiency and improved generalization. While these advancements have been largely fueled by improvements on the modeling side -from better uncertainty estimates and incorporation of temporal information (Ebert et al., 2017) , to the explicit learning of latent representation spaces (Hafner et al., 2019) -they leave much room for improvement on the planning and optimization side. Most of the current best-performing deep learning approaches for vision-based planning use only gradient-free action sampling as the underlying optimizer, and are typically applied to settings where a dense and well-shaped reward signal is available. In this work, we argue that more powerful planners are necessary for longer-horizon reasoning. With this goal of long-horizon planning, we aim to extend the myopic planning behavior of existing visual planning methods. Whether it's to avoid local minima due to short-sightedness, or to reason further into the future in order to solve multi-step or sparse-reward tasks, this ability to perform long-horizon planning is critical. Many of the current state-of-the-art visual planning approaches use gradient-free sampling-based optimization methods such as shooting (Ebert et al., 2018; Nagabandi 



Figure 1: Collocation-based planning. Each image shows a full plan at that optimization step. Collocation jointly optimizes dynamics satisfaction constraints as well as rewards. This ability to violate dynamics allows for the rapid discovery of high-reward regions (where the object is next to the goal), while the subsequent refinement of the planned trajectory focuses on feasibly achieving it. 1

