MULTI-VIEW MASKED AUTOENCODERS FOR VISUAL CONTROL

Abstract

This paper investigates how to leverage data from multiple cameras to learn representations beneficial for visual control. To this end, we present the Multi-View Masked Autoencoder (MV-MAE), a simple and scalable framework for multiview representation learning. Our main idea is to mask multiple viewpoints from video frames at random and train a video autoencoder to reconstruct pixels of both masked and unmasked viewpoints. This allows the model to learn representations that capture useful information of the current viewpoint but also the cross-view information from different viewpoints. We evaluate MV-MAE on challenging RLBench visual manipulation tasks by training a reinforcement learning agent on top of frozen representations. Our experiments demonstrate that MV-MAE significantly outperforms other multi-view representation learning approaches. Moreover, we show that the number of cameras can differ between the representation learning phase and the behavior learning phase. By training a single-view control agent on top of multi-view representations from MV-MAE, we achieve 62.3% success rate while the single-view representation learning baseline achieves 42.3%.

1. INTRODUCTION

Recent self-supervised learning approaches have been successful at learning useful representations from multiple views of the data, including different channels (Zhang et al., 2017) or patches (Oord et al., 2018) of an image, vision-sound modalities (Owens et al., 2016) , vision-language modalities (Radford et al., 2021; Alayrac et al., 2022) , and frames of a video (Wang & Gupta, 2015) . The main underlying idea of these approaches is to utilize information about the same data from different perspectives as supervision for representation learning. Notably, Zhang et al. (2017) trained an autoencoder that predicts a subset of the image channels from another subset, and Radford et al. ( 2021) trained a vision-language model that matches image-text pairs with contrastive learning. Promising results from these approaches suggest that data diversity can play a key role in representation learning. In the context of visual control, the camera is an easily accessible instrument that can increase data diversity by providing information about the same scene from different viewpoints. For instance, it has been a widely-used technique for roboticists to utilize multiple cameras for solving complex manipulation tasks (Akkaya et al., 2019; Akinola et al., 2020; Hsu et al., 2022; James et al., 2022; Jangir et al., 2022) . Yet these works mostly focus on the improved performance from utilizing multi-view observations as inputs, not investigating the effectiveness of representation learning with diverse data from multiple cameras. A notable exception is the work from Sermanet et al. ( 2018), which learns view-invariant representations via contrastive learning. However, enforcing viewpoint invariance assumes that all viewpoints share similar information and thus requires careful curation of positive and negative pairs, similar to other contrastive approaches that often depend on complex design choices about sampling such pairs (Arora et al., 2019) . We present Multi-View Masked Autoencoders (MV-MAE), a simple and scalable framework for visual representation learning with diverse data from multiple cameras. Our main idea is to mask randomly selected viewpoints and train an autoencoder that reconstructs pixels of both masked and unmasked viewpoints. This allows the model to learn representations of each viewpoint that captures visual information of the current viewpoint but also cross-view information of other viewpoints. To further encourage cross-view representation learning, we propose to train a video autoencoder by masking multiple viewpoints from video frames at random. Because the model can utilize information

