ENVIRONMENT PREDICTIVE CODING FOR EMBODIED AGENTS

Abstract

We introduce environment predictive coding, a self-supervised approach to learn environment-level representations for embodied agents. In contrast to prior work on self-supervised learning for images, we aim to jointly encode a series of images gathered by an agent as it moves about in 3D environments. We learn these representations via a zone prediction task, where we intelligently mask out portions of an agent's trajectory and predict them from the unmasked portions, conditioned on the agent's camera poses. By learning such representations on a collection of videos, we demonstrate successful transfer to multiple downstream navigationoriented tasks. Our experiments on the photorealistic 3D environments of Gibson and Matterport3D show that our method outperforms the state-of-the-art on challenging tasks with only a limited budget of experience.

1. INTRODUCTION

In visual navigation tasks, an intelligent embodied agent must move around a 3D environment using its stream of egocentric observations to sense objects and obstacles, typically without the benefit of a pre-computed map. Significant recent progress on this problem can be attributed to the availability of large-scale visually rich 3D datasets (Chang et al., 2017; Xia et al., 2018; Straub et al., 2019) , developments in high-quality 3D simulators (Anderson et al., 2018b; Kolve et al., 2017; Savva et al., 2019a; Xia et al., 2020) , and research on deep memory-based architectures that combine geometry and semantics for learning representations of the 3D world (Gupta et al., 2017; Henriques & Vedaldi, 2018; Chen et al., 2019; Fang et al., 2019; Chaplot et al., 2020b; c) . Deep reinforcement learning approaches to visual navigation often suffer from sample inefficiency, overfitting, and instability in training. Recent contributions work towards overcoming these limitations for various navigation and planning tasks. The key ingredients are learning good image-level representations (Das et al., 2018; Gordon et al., 2019; Lin et al., 2019; Sax et al., 2020) , and using modular architectures that combine high-level reasoning, planning, and low-level navigation (Gupta et al., 2017; Chaplot et al., 2020b; Gan et al., 2019; Ramakrishnan et al., 2020a) . Prior work uses supervised image annotations (Mirowski et al., 2016; Das et al., 2018; Sax et al., 2020) and self-supervision (Gordon et al., 2019; Lin et al., 2019) to learn good image representations that are transferrable and improve sample efficiency for embodied tasks. While promising, such learned image representations only encode the scene in the nearby locality. However, embodied agents also need higher-level semantic and geometric representations of their history of observations, grounded in 3D space, in order to reason about the larger environment around them. Therefore, a key question remains: how should an agent moving through a visually rich 3D environment encode its series of egocentric observations? Prior navigation methods build environment-level representations of observation sequences via memory models such as recurrent neural networks (Wijmans et al., 2020 ), maps (Henriques & Vedaldi, 2018; Chen et al., 2019; Chaplot et al., 2020b) , episodic memory (Fang et al., 2019) , and topological graphs (Savinov et al., 2018; Chaplot et al., 2020c) . However, these approaches typically use hand-coded representations such as occupancy maps (Chen et al., 2019; Chaplot et al., 2020b; Ramakrishnan et al., 2020a; Karkus et al., 2019; Gan et al., 2019) and semantic labels (Narasimhan et al., 2020; Chaplot et al., 2020a) , or specialize them by learning end-to-end for solving a specific task (Wijmans et al., 2020; Henriques & Vedaldi, 2018; Parisotto & Salakhutdinov, 2018; Cheng et al., 2018; Fang et al., 2019) . ? ? ? ? ? Figure 1 : Environment Predictive Coding: During self-supervised learning, our model is given video walkthroughs of various 3D environments. We mask portions out of the trajectory (dotted lines) and learn to infer them from the unmasked parts (in red). We specifically mask out all overlapping views in a local neighborhood to limit the content shared with the unmasked views. The resulting EPC encoder builds environment-level representations of the seen content that are predictive of the unseen content (marked with a "?"), conditioned on the camera poses. The agent then uses this learned encoder in multiple navigational tasks in novel environments. In this work, we introduce environment predictive coding (EPC), a self-supervised approach to learn flexible representations of 3D environments that are transferrable to a variety of navigation-oriented tasks. The key idea is to learn to encode a series of egocentric observations in a 3D environment so as to be predictive of visual content that the agent has not yet observed. For example, consider an agent that just entered the living room in an unfamiliar house and is searching for a refrigerator. It must be able to predict where the kitchen is and reason that it is likely to contain a refrigerator. The proposed EPC model aims to learn representations that capture these natural statistics of real-world environments in a self-supervised fashion, by watching videos recorded by other agents. See Fig. 1 . To this end, we devise a self-supervised zone prediction task in which the model learns environment embeddings by watching egocentric view sequences from other agents navigating in 3D environments in pre-collected videos. Specifically, we segment each video into zones of visually and geometrically connected views, while ensuring limited overlap across zones in the same video. Then, we randomly mask out zones, and predict the masked views conditioned on both the unmasked zones' views and the masked zones' camera poses. Intuitively, to perform this task successfully, the model needs to reason about the geometry and semantics of the environment to figure out what is missing. We devise a transformer-based model to infer the masked visual features. Our general strategy can be viewed as a context prediction task in sequential data (Devlin et al., 2018; Sun et al., 2019b; Han et al., 2019) -but, very differently, aimed at representing high-level semantic and geometric priors in 3D environments to aid embodied agents who act in them. Through extensive experiments on Gibson and Matterport3D, we show that our method achieves good improvements on multiple navigation-oriented tasks compared to state-of-the-art models and baselines that learn image-level embeddings.

2. RELATED WORK

Self-supervised visual representation learning: Prior work leverages self-supervision to learn image and video representations from large collections of unlabelled data. Image representations attempt proxy tasks such as inpainting (Pathak et al., 2016) and instance discrimination (Oord et al., 2018; Chen et al., 2020; He et al., 2020) , while video representation learning leverages signals such as temporal consistency (Wei et al., 2018; Fernando et al., 2017; Kim et al., 2019) and contrastive predictions (Han et al., 2019; Sun et al., 2019a) . The VideoBERT project (Sun et al., 2019a; b) jointly learns video and text representations from unannotated videos via filling in masked out information. Dense Predictive Coding (Han et al., 2019; 2020) learns video representations that capture the slow-moving semantics in videos. Whereas these methods focus on capturing human activity for video recognition, we aim to learn geometric and semantic cues in 3D spaces for embodied agents. Accordingly, unlike the existing video models (Sun et al., 2019a; b; Han et al., 2019) , our approach is grounded in the 3D relationships between views. Representation learning via auxiliary tasks for RL: Reinforcement learning approaches often suffer from high sample complexity, sparse rewards, and unstable training. Prior work tackles these

