MASKED VISION AND LANGUAGE MODELING FOR MULTI-MODAL REPRESENTATION LEARNING

Abstract

In this paper, we study how to use masked signal modeling in vision and language (V+L) representation learning. Instead of developing masked language modeling (MLM) and masked image modeling (MIM) independently, we propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality. This is motivated by the nature of image-text paired data that both of the image and the text convey almost the same information but in different formats. The masked signal reconstruction of one modality conditioned on another modality can also implicitly learn cross-modal alignment between language tokens and image patches. Our experiments on various V+L tasks show that the proposed method, along with common V+L alignment losses, achieves state-of-the-art performance in the regime of millions of pre-training data. Also, we outperforms the other competitors by a significant margin in limited data scenarios.

1. INTRODUCTION

Vision and language (V+L) representation learning has gained significant attention due to the transferablility of the representations to a diverse set of downstream tasks such as zero-or few-shot visual recognition (Jia et al., 2021; Radford et al., 2021; Tsimpoukelli et al., 2021) , object detection (Cai et al., 2022; Kamath et al., 2021) , information retrieval (Li et al., 2022; 2021) , and multi-modal generation (Ramesh et al., 2022; 2021) etc. This success is mainly driven by large-scale pre-training with paired image and text data. The current V+L pre-training techniques particularly focus on the representation learning that characterizes the association between vision and language, and they are largely inspired by self-supervised learning techniques (Devlin et al., 2018; He et al., 2020) in uni-modal learning. Masked signal modeling is a popular self-supervisory pre-training task (Devlin et al., 2018; Liu et al., 2019; Yang et al., 2019; Bao et al., 2021; Xie et al., 2021; He et al., 2022) , which aims at reconstructing the masked signals from the unmasked ones. It has been independently explored in the domains of natural language processing (NLP) and computer vision (Devlin et al., 2018; Liu et al., 2019; Yang et al., 2019; Bao et al., 2021; Xie et al., 2021; He et al., 2022) . For example, in the domain of NLP, BERT (Devlin et al., 2018) and several follow-up works (Liu et al., 2019; Yang et al., 2019) utilize masked language modeling (MLM) where the model is expected to predict the masked text tokens using unmasked tokens. They have shown that MLM leads to powerful generalization performance across diverse NLP tasks. In computer vision, as shown in Figure 1 (top-left), the masked image modeling (MIM) is to predict masked pixels or image patches using unmasked portions of the images. MIM has shown to be an effective pre-training task for learning visual representations (Bao et al., 2021; Xie et al., 2021; He et al., 2022) . While MLM and MIM have been actively explored in each domain, existing works do not fully leverage the masked multi-modal signal modeling in the domain of V+L. For example, as shown in Figure 1 (bottom-left), several approaches rely only on MLM with unmasked images and do not model the masked images (Duan et al., 2022; Li et al., 2022; 2021; 2019; Yang et al., 2022) . In this case, the distribution of text given image, p(T |I), can be learned, but the distribution of image given text, P (I|T ), cannot. This will potentially lead to biased performance in cross-modal retrieval tasks such as image-to-text or text-to-image retrieval as shown in our experiments. Although there exist works that use both modality signals masked, they either use a frozen object detector to extract region-based visual features (Chen et al., 2020b; Li et al., 2020a; Lu et al., 2020; Su et al., 2019; Tan & Bansal, 2019) or mask the image tokens from a pre-trained image tokenizer instead of the raw RGB pixels (Dou et al., 2022; Fu et al., 2021; Singh et al., 2022) . These frozen object detector and image tokenizer not only require additional data for training but also prevent the V+L interactions from being learned end-to-end. In this paper, we propose joint masked V+L modeling where the original signal is reconstructed by using its masked input and the corresponding unmasked input from the other modality. As illustrated in Figure 1 (right part), although we exploit random masking, the dog face in the image can be used to predict the masked text token "dog" and the text "green ball" can be used to reconstruct the corresponding masked patches in the image. To ensure that the model uses information from both modalities, we explicitly enforce the model to utilize cross-attention to generate the joint representations. Compared with the aforementioned existing works, our approach models both conditional distributions, p(I|T ) and p(T |I). Also, the model is trained end-to-end, without frozen bottleneck components that disturb learning interactions between V+L. By reconstructing one modality signal from the corresponding the other modality signal (e.g. reconstructing the text "dog" from the visual signals of dog face), the model implicitly learns the alignment between V+L. In addition, we observe that the model trained for the joint masked V+L modeling becomes noticeably effective when the training data is limited. Overall, our contributions are summarized as below: 1. We propose a joint masked V+L modeling task. We show that models pre-trained with the proposed task, along with common V+L alignment tasks such as image-text matching, achieves state-of-the-art performance on a broad rage of V+L tasks. 2. We provide a probabilistic interpretation of the proposed method and highlight the difference between ours and existing approaches in terms of the V+L joint distribution estimation. 3. We achieve significantly better performance than other V+L models in the regimes of limited training data, and only ∼40% of data used by the state-of-the-art models is sufficient to match their performance.

2. RELATED WORK

Vision and language representation learning The methods in V+L representation learning can be categorized based on how the information is fused between the modalities to obtain the joint representations. We group the fusion techniques into three categories: 1) transformers with attention across modalities, 2) contrastive learning with a large-scale pre-training data, 3) a hybrid form of learning with cross-attention and a contrastive loss. The attention across modalities has been widely used with image features extracted from off-the-shelf object detectors and text features obtained from transformer encoders (Chen et al., 2020b; Li et al., 2020a; Lu et al., 2020; Tan & Bansal, 2019; Zhang et al., 2021; Li et al., 2020b; Su et al., 2019; Li et al., 2019) . While cross-attention effectively aligns V+L representations, it is computationally expensive since all possible pairs of images



Figure 1: An overview of masked vision and language modeling. The left side shows existing approaches and the right side highlights our proposed approach.

