DISCRETE PREDICTIVE REPRESENTATION FOR LONG-HORIZON PLANNING

Abstract

Discrete representations have been key in enabling robots to plan at more abstract levels and solve temporally-extended tasks more efficiently for decades. However, they typically require expert specifications. On the other hand, deep reinforcement learning aims to learn to solve tasks end-to-end, but struggles with long-horizon tasks. In this work, we propose Discrete Object-factorized Representation Planning (DORP), which learns temporally-abstracted discrete representations from exploratory video data in an unsupervised fashion via a mutual information maximization objective. DORP plans a sequence of abstract states for a low-level model-predictive controller to follow. In our experiments, we show that DORP robustly solves unseen long-horizon tasks. Interestingly, it discovers independent representations per object and binary properties such as a key-and-door.

1. INTRODUCTION

In future, we hope that robots will be able to operate in unstructured environments such as homes and hospitals, and endowed with long-horizon planning ability. Despite successes in deep reinforcement learning (RL) from raw observations, much progress relies on the availability of shaped reward to guide the learning (Ng et al., 1999; Mirza et al., 2020) . On the other hand, over past decades, task and motion planning has been shown to solve much longer-horizon goal-directed tasks such as making a cup of coffee from torque control (Kaelbling & Lozano-Pérez, 2011; Srivastava et al., 2014; Toussaint, 2015; Wang et al., 2018) . However, these methods often require pre-specified discrete abstract states, task representations and transition models, e.g., whether the robot is holding a cup and what actions (or perturbations) change such an abstract state. In this paper, we aim to learn discrete representations for high-level abstract planning from video interaction data, combined with a learned short-horizon controller. Learning discrete representations from unsupervised data for planning is challenging for two reasons. First, the relationship between the optimization objective and the true task objective is not well-defined. Second, optimizing a model with a discrete layer is difficult with standard deep learning techniques. Recent methods approach the first problem using reconstruction or constrastive objectives (Watter et al., 2015; Anand et al., 2019; Ha & Schmidhuber, 2018; Kurutach et al., 2018; Hafner et al., 2019; Srinivas et al., 2020) ; however, the learned representations are in continuous latent space which is unstructured and difficult to combine with high-level abstract planning. While other methods show promise in learning discrete representations, they have not been applied to temporally-extended RL tasks (Oord et al., 2018; Razavi et al., 2019; Risi & Stanley, 2019; Asai & Fukunaga, 2017; Stratos & Wiseman, 2020) . In this work, we propose Discrete Object-factorized Representations for Planning (DORP) -a novel framework for visual planning and control by learning discrete representations and a low-level controller. DORP learns discrete representations from images that change slowly overtime, such as whether or not the agent holds a key or which room the agent is in, along with a low-level predictive model for control. These slow features enable the agent to plan at a low frequency in longer-horizon tasks. More specifically, DORP represents an abstract state as a set of one-hot vectors, and optimizes its encoder by maximizing a mutual information lower bound between the current representations to future observations (Oord et al., 2018) . In order to train through the discrete layer, we apply the Gumbel-Softmax reparametrization trick (Jang et al., 2016; Maddison et al., 2016) . Using abstract states as nodes, we build an approximate feasibility graph based on observed transition data. When provided with new start and goal images, the agent plan the shortest abstract path. Using the next Figure 1 : Learning discrete representions for planning. We approach the long-horizon planning on high dimensional image space by learning a discrete representation and planning on the discrete space. (a) We learn multiple one-hot encodings for each observation fully unsupervisedly with contrastive learning. Each one-hot encoding corresponds to a temporally abstract state of one of the freely moving entity in the observation, such as a room the agent is in or whether a key has been picked up. (b) With the discrete encoding, we build a graph that connects the current encoding to the goal encoding and do a graph search on it to plan efficiently for long horizon tasks. abstraction as waypoint, model-predictive control maximizes the objective that is 1 if it reaches the target abstraction and 0 otherwise with a trained video prediction model. Unlike other subgoal planning work (Savinov et al., 2018; Nasiriany et al., 2019; Laskin et al., 2020; Liu et al., 2020) , by following abstract waypoints, DORP avoids unneccessary steps to match exact waypoint states. In a set of experiments, we demonstrate that DORP learns temporally-consistent and objectfactorized representations suitable for planning. We show that these representations enable DORP to handle unseen long-horizon tasks more successfully compared to the states-of-the-arts in visual planning. Interestingly, we observe that latent representations show object-level factorization such as key-and-door.

2. PRELIMINARIES

We present background material on unsupervised representation learning and discrete optimization that our method builds on, and our problem formulation. Contrastive Predictive Coding (CPC) (Oord et al., 2018) learns low-dimensional representations that are most predictive of the future future high-dimensional sequential data. A non-linear encoder q θ : O → R l , parametrized by θ, encodes the observation o t ∈ O to a latent l-dimensional vector representation z t . Let's define a similarity score f k (z t , o t+k ) = exp(z t ψq θ (o t+k )) where ψ is a trainable l-by-l similarity matrix and o t+k is a future observation k steps ahead of o t . Given the query observation o t , we aim to classify the key observations -o t+k as positive and other sample õ from the dataset as negative. Formally, we optimize the loss function L CP C = -E ot,o t+k log f k (z t , o t+k ) -log oj ∈X f k (z t , o j ) with respect to θ and ψ. This also corresponds to maximizing a lowerbound of the mutual information between the latent representation z t and the future observation o t+k . Gumbel-Softmax (GS) (Jang et al., 2016) is a discrete optimization technique to compute the gradient of through samples from a categorical distribution π i . GS first applies a reparametrization trick to rewrite samples z i as onehot(arg max j (g j + logπ i )) where g j are i.i.d samples from Gumbel(0, 1) (Gumbel, 1935) . GS approximates arg max with softmax making the discrete stochastic layer differentiable, i.e., z i = softmax((g i + log π i )/τ ). When the temperature parameter τ approaches 0, the samples converge to the true categorical distribution. Empirically, the temperature τ starts high and is annealed closer to 0 by a certain schedule.

2.1. PROBLEM STATEMENT: GOAL-DIRECTED VISUAL PLANNING AND CONTROL

We define an unknown, fully-observable, stochastic dynamical system f which maps input observation o t ∈ O and action a t ∈ A to the next observation o t+1 . Under this dynamical system, we assume a simple exploration policy π rand which can collect data that characterizes the dynamics of the system. This is known as self-supervised data or play data (Agrawal et al., 2016; Lynch et al., 2020) . We consider high-dimensional observation such as images.

