ENVIRONMENT PREDICTIVE CODING FOR EMBODIED AGENTS

Abstract

We introduce environment predictive coding, a self-supervised approach to learn environment-level representations for embodied agents. In contrast to prior work on self-supervised learning for images, we aim to jointly encode a series of images gathered by an agent as it moves about in 3D environments. We learn these representations via a zone prediction task, where we intelligently mask out portions of an agent's trajectory and predict them from the unmasked portions, conditioned on the agent's camera poses. By learning such representations on a collection of videos, we demonstrate successful transfer to multiple downstream navigationoriented tasks. Our experiments on the photorealistic 3D environments of Gibson and Matterport3D show that our method outperforms the state-of-the-art on challenging tasks with only a limited budget of experience.

1. INTRODUCTION

In visual navigation tasks, an intelligent embodied agent must move around a 3D environment using its stream of egocentric observations to sense objects and obstacles, typically without the benefit of a pre-computed map. Significant recent progress on this problem can be attributed to the availability of large-scale visually rich 3D datasets (Chang et al., 2017; Xia et al., 2018; Straub et al., 2019) , developments in high-quality 3D simulators (Anderson et al., 2018b; Kolve et al., 2017; Savva et al., 2019a; Xia et al., 2020) , and research on deep memory-based architectures that combine geometry and semantics for learning representations of the 3D world (Gupta et al., 2017; Henriques & Vedaldi, 2018; Chen et al., 2019; Fang et al., 2019; Chaplot et al., 2020b; c) . Deep reinforcement learning approaches to visual navigation often suffer from sample inefficiency, overfitting, and instability in training. Recent contributions work towards overcoming these limitations for various navigation and planning tasks. The key ingredients are learning good image-level representations (Das et al., 2018; Gordon et al., 2019; Lin et al., 2019; Sax et al., 2020) , and using modular architectures that combine high-level reasoning, planning, and low-level navigation (Gupta et al., 2017; Chaplot et al., 2020b; Gan et al., 2019; Ramakrishnan et al., 2020a) . Prior work uses supervised image annotations (Mirowski et al., 2016; Das et al., 2018; Sax et al., 2020) and self-supervision (Gordon et al., 2019; Lin et al., 2019) to learn good image representations that are transferrable and improve sample efficiency for embodied tasks. While promising, such learned image representations only encode the scene in the nearby locality. However, embodied agents also need higher-level semantic and geometric representations of their history of observations, grounded in 3D space, in order to reason about the larger environment around them. Therefore, a key question remains: how should an agent moving through a visually rich 3D environment encode its series of egocentric observations? Prior navigation methods build environment-level representations of observation sequences via memory models such as recurrent neural networks (Wijmans et al., 2020 ), maps (Henriques & Vedaldi, 2018; Chen et al., 2019; Chaplot et al., 2020b ), episodic memory (Fang et al., 2019) , and topological graphs (Savinov et al., 2018; Chaplot et al., 2020c) . However, these approaches typically use hand-coded representations such as occupancy maps (Chen et al., 2019; Chaplot et al., 2020b; Ramakrishnan et al., 2020a; Karkus et al., 2019; Gan et al., 2019) and semantic labels (Narasimhan et al., 2020; Chaplot et al., 2020a) , or specialize them by learning end-to-end for solving a specific task (Wijmans et al., 2020; Henriques & Vedaldi, 2018; Parisotto & Salakhutdinov, 2018; Cheng et al., 2018; Fang et al., 2019) .

