MASKED VISION AND LANGUAGE MODELING FOR MULTI-MODAL REPRESENTATION LEARNING

Abstract

In this paper, we study how to use masked signal modeling in vision and language (V+L) representation learning. Instead of developing masked language modeling (MLM) and masked image modeling (MIM) independently, we propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality. This is motivated by the nature of image-text paired data that both of the image and the text convey almost the same information but in different formats. The masked signal reconstruction of one modality conditioned on another modality can also implicitly learn cross-modal alignment between language tokens and image patches. Our experiments on various V+L tasks show that the proposed method, along with common V+L alignment losses, achieves state-of-the-art performance in the regime of millions of pre-training data. Also, we outperforms the other competitors by a significant margin in limited data scenarios.

1. INTRODUCTION

Vision and language (V+L) representation learning has gained significant attention due to the transferablility of the representations to a diverse set of downstream tasks such as zero-or few-shot visual recognition (Jia et al., 2021; Radford et al., 2021; Tsimpoukelli et al., 2021) , object detection (Cai et al., 2022; Kamath et al., 2021 ), information retrieval (Li et al., 2022; 2021) , and multi-modal generation (Ramesh et al., 2022; 2021) etc. This success is mainly driven by large-scale pre-training with paired image and text data. The current V+L pre-training techniques particularly focus on the representation learning that characterizes the association between vision and language, and they are largely inspired by self-supervised learning techniques (Devlin et al., 2018; He et al., 2020) in uni-modal learning. Masked signal modeling is a popular self-supervisory pre-training task (Devlin et al., 2018; Liu et al., 2019; Yang et al., 2019; Bao et al., 2021; Xie et al., 2021; He et al., 2022) , which aims at reconstructing the masked signals from the unmasked ones. It has been independently explored in the domains of natural language processing (NLP) and computer vision (Devlin et al., 2018; Liu et al., 2019; Yang et al., 2019; Bao et al., 2021; Xie et al., 2021; He et al., 2022) . For example, in the domain of NLP, BERT (Devlin et al., 2018) and several follow-up works (Liu et al., 2019; Yang et al., 2019) utilize masked language modeling (MLM) where the model is expected to predict the masked text tokens using unmasked tokens. They have shown that MLM leads to powerful generalization performance across diverse NLP tasks. In computer vision, as shown in Figure 1 (top-left), the masked image modeling (MIM) is to predict masked pixels or image patches using unmasked portions of the images. MIM has shown to be an effective pre-training task for learning visual representations (Bao et al., 2021; Xie et al., 2021; He et al., 2022) . While MLM and MIM have been actively explored in each domain, existing works do not fully leverage the masked multi-modal signal modeling in the domain of V+L. For example, as shown in Figure 1 (bottom-left), several approaches rely only on MLM with unmasked images and do not model the masked images (Duan et al., 2022; Li et al., 2022; 2021; 2019; Yang et al., 2022) . In this case, the distribution of text given image, p(T |I), can be learned, but the distribution of image given text, P (I|T ), cannot. This will potentially lead to biased performance in cross-modal retrieval tasks such as image-to-text or text-to-image retrieval as shown in our experiments. Although there

