MULTIMODAL MASKED AUTOENCODERS LEARN TRANSFERABLE REPRESENTATIONS

Abstract

Building scalable models to learn from diverse, multimodal data remains an open challenge. For vision-language data, the dominant approaches are based on contrastive learning objectives that train a separate encoder for each modality. While effective, contrastive learning approaches introduce sampling bias depending on the data augmentations used, which can degrade performance on downstream tasks. Moreover, these methods are limited to paired image-text data, and cannot leverage widely-available unpaired data. In this paper, we investigate whether a large multimodal model trained purely via masked token prediction, without using modality-specific encoders or contrastive learning, can learn transferable representations for downstream tasks. We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE), which learns a unified encoder for both vision and language data via masked token prediction. We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks. Surprisingly, we find that M3AE benefits from a higher text mask ratio (50-90%), in contrast to BERT whose standard masking ratio is 15%, due to the joint training of two data modalities. We also provide qualitative analysis showing that the learned representation incorporates meaningful information from both image and language. Lastly, we demonstrate the scalability of M3AE with larger model size and training time, and its flexibility to train on both paired image-text data as well as unpaired data.

1. INTRODUCTION

With the rapid advances in neural architectures (Vaswani et al., 2017) and hardware performance, self-supervised pre-training has made tremendous progress in natural language processing (NLP) and vision (He et al., 2021; Devlin et al., 2018; Bao et al., 2021; Brown et al., 2020) . The underlying idea, often referred as masked token prediction, is conceptually simple: the model learns to predict a removed portion of the data. Masked token prediction has enabled highly successful methods for pre-training in NLP and vision, including Transformer (Vaswani et al., 2017) , GPT (Brown et al., 2020) , BERT (Devlin et al., 2018), and MAE (He et al., 2021) . These pre-trained representations have been shown to generalize well to various downstream tasks. The cornerstone of these successes is that these methods excellently leverage large and diverse datasets. Indeed, with the scaling up of data diversity and model capacity, there is still no sign of plateau on generalization to various downstream tasks (Devlin et al., 2018; He et al., 2021) . Driven by the successes in NLP and vision, there has been significant interest in improving visual representation learning by training on large and diverse multimodal datasets that contains both images and text. These datasets, such as CC12M (Changpinyo et al., 2021) and YFCC100M (Thomee et al., 2015) , are often much more scalable than explicitly labeled datasets such as ImageNet (Deng et al., 2009) , and the diverse language data can provide rich supervision to train more generalizable representations. The dominant paradigm for multimodal pre-training is cross-modal contrastive learning, such as CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) . These methods show that cross-modal contrastive learning models, trained on giant corpora of paired image-and-text, can generalize well to various downstream tasks. Despite these progresses, a major limitation for contrastive learning is that it requires paired image-and-text data and therefore cannot leverage widely available unpaired data. In addition, contrastive learning based methods use separate encoders for image and text, making it difficult for models to access information from different modalities at the same time. The separation of image and text encoders hinder the joint understanding of image and text. To address the above limitations for visual representation learning, we propose a simple and scalable architecture called the multimodal masked autoencoders (M3AE) for learning a single unified model on large image and language data, without using modality-specific encoders or contrastive learning. Based on MAE (He et al., 2021), M3AE is trained purely via masked token prediction. Our key idea is to treat an image-and-text pair as a long sequence of tokens consisting of embeddings of image patches and text. M3AE is trained simply by masking random patches of the input image and language tokens, and learning to reconstruct the masked pixels and text. In this paper, we provide an empirical study of M3AE trained on the multimodal CC12M (Changpinyo et al., 2021) dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks such as image classification and out-of-distribution detection. We find that multimodal pre-training of M3AE on CC12M achieves significantly higher performance on the ImageNet-1k linear classification benchmark (Russakovsky et al., 2014) compared to pre-training on images only (MAE). Our strong results for M3AE demonstrate the generalization benefits of multimodal training for learning transferable representations across datasets. Surprisingly, we find that M3AE performs best when we apply a high mask ratio (75%) on language, while in contrast, language models like BERT (Devlin et al., 2018) conventionally use a low mask ratio (15%) because language data are highly semantic and information-dense. We hypothesize that M3AE benefits from a higher mask ratio on text because it enforces a better joint understanding of vision and language during masked token prediction. We also provide qualitative analysis showing that the learned representation incorporates meaningful information from both image and language. Furthermore, we demonstrate the scalability of M3AE with larger model size and training time, as well as its flexibility to train on both paired image-text data as well as unpaired data.

2. RELATED WORK

Self-supervised representation learning via reconstruction After the introduction of Transformers (Vaswani et al., 2017) , self-supervised language modeling has made substantial progress in recent years. After pre-training on a large amount of unlabeled data with reconstruction loss, Large-scale transformer language models like BERT (Devlin et al., 2018) and GPT (Brown et al., 2020) are highly successful in learning representations that generalize well to various downstream tasks. Taking inspiration from the success in NLP, research have proposed a wide variety of self-supervision method (Chen et al., 2020a; Dosovitskiy et al., 2020; Bao et al., 2021; He et al., 2021) . iGPT (Chen et al., 2020a) that operates on sequences of pixels and reconstruct the unknown pixels. ViT (Doso-



Figure1: Multimodal masked autoencoder (M3AE) consists of an encoder that maps language tokens and image patches to a shared representation space, and a decoder that reconstructs the original image and language from the representation.

