MULTI-VIEW MASKED AUTOENCODERS FOR VISUAL CONTROL

Abstract

This paper investigates how to leverage data from multiple cameras to learn representations beneficial for visual control. To this end, we present the Multi-View Masked Autoencoder (MV-MAE), a simple and scalable framework for multiview representation learning. Our main idea is to mask multiple viewpoints from video frames at random and train a video autoencoder to reconstruct pixels of both masked and unmasked viewpoints. This allows the model to learn representations that capture useful information of the current viewpoint but also the cross-view information from different viewpoints. We evaluate MV-MAE on challenging RLBench visual manipulation tasks by training a reinforcement learning agent on top of frozen representations. Our experiments demonstrate that MV-MAE significantly outperforms other multi-view representation learning approaches. Moreover, we show that the number of cameras can differ between the representation learning phase and the behavior learning phase. By training a single-view control agent on top of multi-view representations from MV-MAE, we achieve 62.3% success rate while the single-view representation learning baseline achieves 42.3%.

1. INTRODUCTION

Recent self-supervised learning approaches have been successful at learning useful representations from multiple views of the data, including different channels (Zhang et al., 2017) or patches (Oord et al., 2018) of an image, vision-sound modalities (Owens et al., 2016) , vision-language modalities (Radford et al., 2021; Alayrac et al., 2022) , and frames of a video (Wang & Gupta, 2015) . The main underlying idea of these approaches is to utilize information about the same data from different perspectives as supervision for representation learning. Notably, Zhang et al. (2017) trained an autoencoder that predicts a subset of the image channels from another subset, and Radford et al. ( 2021) trained a vision-language model that matches image-text pairs with contrastive learning. Promising results from these approaches suggest that data diversity can play a key role in representation learning. In the context of visual control, the camera is an easily accessible instrument that can increase data diversity by providing information about the same scene from different viewpoints. For instance, it has been a widely-used technique for roboticists to utilize multiple cameras for solving complex manipulation tasks (Akkaya et al., 2019; Akinola et al., 2020; Hsu et al., 2022; James et al., 2022; Jangir et al., 2022) . Yet these works mostly focus on the improved performance from utilizing multi-view observations as inputs, not investigating the effectiveness of representation learning with diverse data from multiple cameras. A notable exception is the work from Sermanet et al. ( 2018), which learns view-invariant representations via contrastive learning. However, enforcing viewpoint invariance assumes that all viewpoints share similar information and thus requires careful curation of positive and negative pairs, similar to other contrastive approaches that often depend on complex design choices about sampling such pairs (Arora et al., 2019) . We present Multi-View Masked Autoencoders (MV-MAE), a simple and scalable framework for visual representation learning with diverse data from multiple cameras. Our main idea is to mask randomly selected viewpoints and train an autoencoder that reconstructs pixels of both masked and unmasked viewpoints. This allows the model to learn representations of each viewpoint that captures visual information of the current viewpoint but also cross-view information of other viewpoints. To further encourage cross-view representation learning, we propose to train a video autoencoder by masking multiple viewpoints from video frames at random. Because the model can utilize information We extract features from each viewpoint with convolutional networks (CNN) and mask all features from randomly selected viewpoints of video frames. We also mask randomly selected features from remaining viewpoints to encourage the autoencoder to learn information of unmasked frames. A vision transformer (ViT; Dosovitskiy et al. 2021) encoder processes visible features to fuse information from multiple views and frames. Then a ViT decoder concatenates mask tokens for each view and processes inputs to reconstruct frames. We note that the autoencoder reconstructs all frames at the same time. from the current frame but also information from unmasked frames of the target view, we find our approach helps the model to focus on predicting important details, e.g., gripper poses. Then we utilize learned representations for visual control by training a reinforcement learning agent that learns a world model on top of frozen representations and utilizes it for behavior learning (Seo et al., 2022a) . Contributions. We highlight the contributions of our paper below: • We present MV-MAE, a simple and scalable framework that can leverage diverse data from multiple cameras for visual representation learning. The main idea of MV-MAE is training a video masked autoencoder (Tong et al., 2022; Feichtenhofer et al., 2022) with a view-masking strategy that encourages the model to learn spatial dependency between viewpoints. • We provide empirical evaluation of MV-MAE on challenging visual manipulation tasks from RLBench (James et al., 2020) . Unlike other multi-view representation learning baselines that enforce invariance between multiple viewpoints (Sermanet et al., 2018; Assran et al., 2022) , MV-MAE consistently outperforms a single-view representation learning baseline under a challenging experimental setup with multiple cameras of diverse types. • We demonstrate that data diversity from multiple viewpoints can play a crucial role in representation learning for visual control. In particular, by training a single-view control agent on top of multi-view representations from MV-MAE, we find that our approach significantly outperforms a single-view representation learning baseline, e.g., we achieve 62.3% success rate on six visual manipulation tasks while the baseline achieves 42.3%.

2. RELATED WORK

Unsupervised visual representation learning. Self-supervised learning on large-scale unlabeled datasets has been actively studied in the domain of computer vision (Noroozi & Favaro, 2016; Chen et al., 2020; Grill et al., 2020; He et al., 2021) . A line of research that has been successful is contrastive learning, which learns representations by maximizing the mutual information between



Figure1: Illustration of our representation learning scheme. We extract features from each viewpoint with convolutional networks (CNN) and mask all features from randomly selected viewpoints of video frames. We also mask randomly selected features from remaining viewpoints to encourage the autoencoder to learn information of unmasked frames. A vision transformer (ViT; Dosovitskiy et al.2021) encoder processes visible features to fuse information from multiple views and frames. Then a ViT decoder concatenates mask tokens for each view and processes inputs to reconstruct frames. We note that the autoencoder reconstructs all frames at the same time.

