MODEL-BASED VISUAL PLANNING WITH SELF-SUPERVISED FUNCTIONAL DISTANCES

Abstract

A generalist robot must be able to complete a variety of tasks in its environment. One appealing way to specify each task is in terms of a goal observation. However, learning goal-reaching policies with reinforcement learning remains a challenging problem, particularly when hand-engineered reward functions are not available. Learned dynamics models are a promising approach for learning about the environment without rewards or task-directed data, but planning to reach goals with such a model requires a notion of functional similarity between observations and goal states. We present a self-supervised method for model-based visual goal reaching, which uses both a visual dynamics model as well as a dynamical distance function learned using model-free reinforcement learning. Our approach learns entirely using offline, unlabeled data, making it practical to scale to large and diverse datasets. In our experiments, we find that our method can successfully learn models that perform a variety of tasks at test-time, moving objects amid distractors with a simulated robotic arm and even learning to open and close a drawer using a real-world robot. In comparisons, we find that this approach substantially outperforms both model-free and model-based prior methods. Videos and visualizations are available here: https://sites.google.

1. INTRODUCTION

Designing general-purpose robots that can perform a wide range of tasks remains an open problem in AI and robotics. Reinforcement learning (RL) represents a particularly promising tool for learning robotic behaviors when skills can be learned one at a time from user-defined reward functions. However, general-purpose robots will likely require large and diverse repertoires of skills, and learning individual tasks one at a time from manually-specified rewards is onerous and time-consuming. How can we design learning systems that can autonomously acquire general-purpose knowledge that allows them to solve many different downstream tasks? To address this problem, we must resolve three questions. (1) How can the robot be commanded to perform specific downstream tasks? A simple and versatile choice is to define tasks in terms of desired outcomes, such as an example observation of the completed task. (2) What types of data should this robot learn from? In settings where modern machine learning attains the best generalization results (Deng et al., 2009; Rajpurkar et al., 2016; Devlin et al., 2018) , a common theme is that excellent generalization is achieved by learning from large and diverse task-agnostic datasets. In the context of RL, this means we need offline methods that can use all sources of prior data, even in the absence of reward labels. As collecting new experience on a physical robot is often expensive, offline data is often more practical to use in real-world settings (Levine et al., 2020) . (3) What should the robot learn from this data to enable goal-reaching? Similar to prior work (Botvinick & Weinstein, 2014; Watter et al., 2015; Finn & Levine, 2017; Ebert et al., 2018b) , we note that policies and value functions are specific to a particular task, while a predictive model captures the physics of the environment independently of the task, and thus can be used for solving almost any task. This makes model learning particularly effective for learning from large and diverse datasets, which do not necessarily contain successful behaviors. Figure 1 : The robot must find actions that quickly achieve the desired goal. State transitions and the true optimal distances between states are unknown, so our method learns an approximate shortest distance function and dynamics model directly on images. These models allow the robot to find the shortest path to the goal at test-time. While model-based approaches have demonstrated promising results, including for vision-based tasks in real-world robotic systems (Ebert et al., 2018a; Finn & Levine, 2017) , such methods face two major challenges. First, predictive models on raw images are only effective over short horizons, as uncertainty accumulates far into the future (Denton & Fergus, 2018; Finn et al., 2016; Hafner et al., 2019b; Babaeizadeh et al., 2017) . Second, using such models for planning toward goals requires a notion of similarity between images. While prior methods have utilized latent variable models (Watter et al., 2015; Nair et al., 2018) , 2 pixel-space distance (Nair & Finn, 2020), and other heuristic measures of similarity (Ebert et al., 2018b) , these metrics only capture visual similarity. To enable reliable control with predictive models, we instead need distances that are aware of dynamics. In this paper, we propose Model-Based RL with Offline Learned Distances (MBOLD), which aims to address both of these challenges by learning predictive models together with image-based distance functions that reflect functionality, from offline, unlabeled data. The learned distance function estimates of the number of steps that the optimal policy would take to transition from one state to another, incorporating not just visual appearance, but also an understanding of dynamics. However, to learn dynamical distances from task-agnostic data, supervised regression will lead to overestimation, since the paths in the data are not all optimal for any task. Instead, we utilize approximate dynamic programming for distance estimation. While prior work has studied such methods to learn goal-conditioned policies in online model-free RL settings (Eysenbach et al., 2019; Florensa et al., 2019) , we extend it to the offline setting and show that approximate dynamic programming techniques derived from Q-learning style Bellman updates can learn effective shortest path dynamical distances. Although this procedure resembles model-free reinforcement learning, we find empirically that it does not by itself produce useful policies. Instead, our method (Fig. 1 ) combines the strengths of dynamics models and distance functions, using the predictive model to plan over short horizons, and using the learned distances to provide a global cost that captures progress toward distant goals. The primary contribution of this work is an offline, self-supervised approach for solving arbitrary goal-reaching tasks by combining planning with predictive models and learned dynamical distances. To our knowledge, our method is the first to directly combine predictive models on images with dynamical distance estimators on images, entirely from random, offline data without reward labels. Through our experimental evaluation on challenging robotic object manipulation tasks, including simulated object relocation and real-world drawer manipulation, we find that our method can outperform previously introduced reward specification methods for visual model-based control with a relative performance improvement of at least 50% across all tasks, and compares favorably to prior work in model-based and model-free RL. We also find that combining Q-functions with planning improves dramatically over policies directly learned with model-free RL.

2. RELATED WORK

Offline and Model-based RL: A number of prior works have studied the problem of learning behaviors from existing offline datasets. While recent progress has been made in applying model-free RL techniques to this problem of offline or batch RL (Fujimoto et al., 2019; Wu et al., 2019; Kumar et al., 2019; 2020; Nair et al., 2020b) , one approach that has shown promise is offline model-based RL (Lowrey et al., 2019; Kidambi et al., 2020; Yu et al., 2020; Argenson & Dulac-Arnold, 2020) , where the agent learns a predictive model of the world from data. Such model-based methods have seen success both in the offline and online RL settings, and have a rich history of being effective for planning (Deisenroth & Rasmussen, 2011; Watter et al., 2015; McAllister & Rasmussen, 2016; Chua et al., 2018; Amos et al., 2018; Hafner et al., 2019b; Nagabandi et al., 2018; Kahn et al., 2020; Dong et al., 2020) or policy optimization (Sutton, 1991; Weber et al., 2017; Ha & Schmidhuber, 2018; Janner et al., 2019; Wang & Ba, 2019; Hafner et al., 2019a) . However, the vast majority of

