MASKED VISION AND LANGUAGE MODELING FOR MULTI-MODAL REPRESENTATION LEARNING

Abstract

In this paper, we study how to use masked signal modeling in vision and language (V+L) representation learning. Instead of developing masked language modeling (MLM) and masked image modeling (MIM) independently, we propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality. This is motivated by the nature of image-text paired data that both of the image and the text convey almost the same information but in different formats. The masked signal reconstruction of one modality conditioned on another modality can also implicitly learn cross-modal alignment between language tokens and image patches. Our experiments on various V+L tasks show that the proposed method, along with common V+L alignment losses, achieves state-of-the-art performance in the regime of millions of pre-training data. Also, we outperforms the other competitors by a significant margin in limited data scenarios.

1. INTRODUCTION

Vision and language (V+L) representation learning has gained significant attention due to the transferablility of the representations to a diverse set of downstream tasks such as zero-or few-shot visual recognition (Jia et al., 2021; Radford et al., 2021; Tsimpoukelli et al., 2021) , object detection (Cai et al., 2022; Kamath et al., 2021) , information retrieval (Li et al., 2022; 2021) , and multi-modal generation (Ramesh et al., 2022; 2021) etc. This success is mainly driven by large-scale pre-training with paired image and text data. The current V+L pre-training techniques particularly focus on the representation learning that characterizes the association between vision and language, and they are largely inspired by self-supervised learning techniques (Devlin et al., 2018; He et al., 2020) in uni-modal learning. Masked signal modeling is a popular self-supervisory pre-training task (Devlin et al., 2018; Liu et al., 2019; Yang et al., 2019; Bao et al., 2021; Xie et al., 2021; He et al., 2022) , which aims at reconstructing the masked signals from the unmasked ones. It has been independently explored in the domains of natural language processing (NLP) and computer vision (Devlin et al., 2018; Liu et al., 2019; Yang et al., 2019; Bao et al., 2021; Xie et al., 2021; He et al., 2022) . For example, in the domain of NLP, BERT (Devlin et al., 2018) and several follow-up works (Liu et al., 2019; Yang et al., 2019) utilize masked language modeling (MLM) where the model is expected to predict the masked text tokens using unmasked tokens. They have shown that MLM leads to powerful generalization performance across diverse NLP tasks. In computer vision, as shown in Figure 1 (top-left), the masked image modeling (MIM) is to predict masked pixels or image patches using unmasked portions of the images. MIM has shown to be an effective pre-training task for learning visual representations (Bao et al., 2021; Xie et al., 2021; He et al., 2022) . While MLM and MIM have been actively explored in each domain, existing works do not fully leverage the masked multi-modal signal modeling in the domain of V+L. For example, as shown in Figure 1 (bottom-left), several approaches rely only on MLM with unmasked images and do not model the masked images (Duan et al., 2022; Li et al., 2022; 2021; 2019; Yang et al., 2022) . In this case, the distribution of text given image, p(T |I), can be learned, but the distribution of image given text, P (I|T ), cannot. This will potentially lead to biased performance in cross-modal retrieval tasks such as image-to-text or text-to-image retrieval as shown in our experiments. Although there

Masked Image Modeling Masked Language Modeling in V+L Learning

"A [MASK] is running towards a green ball" "A dog is running towards a green ball"

Masked Vision and Language Modeling

"A dog is running towards a green ball" "A [MASK] is running towards a green ball" "A dog is running towards a green ball" Figure 1 : An overview of masked vision and language modeling. The left side shows existing approaches and the right side highlights our proposed approach. exist works that use both modality signals masked, they either use a frozen object detector to extract region-based visual features (Chen et al., 2020b; Li et al., 2020a; Lu et al., 2020; Su et al., 2019; Tan & Bansal, 2019) or mask the image tokens from a pre-trained image tokenizer instead of the raw RGB pixels (Dou et al., 2022; Fu et al., 2021; Singh et al., 2022) . These frozen object detector and image tokenizer not only require additional data for training but also prevent the V+L interactions from being learned end-to-end. In this paper, we propose joint masked V+L modeling where the original signal is reconstructed by using its masked input and the corresponding unmasked input from the other modality. As illustrated in Figure 1 (right part), although we exploit random masking, the dog face in the image can be used to predict the masked text token "dog" and the text "green ball" can be used to reconstruct the corresponding masked patches in the image. To ensure that the model uses information from both modalities, we explicitly enforce the model to utilize cross-attention to generate the joint representations. Compared with the aforementioned existing works, our approach models both conditional distributions, p(I|T ) and p(T |I). Also, the model is trained end-to-end, without frozen bottleneck components that disturb learning interactions between V+L. By reconstructing one modality signal from the corresponding the other modality signal (e.g. reconstructing the text "dog" from the visual signals of dog face), the model implicitly learns the alignment between V+L. In addition, we observe that the model trained for the joint masked V+L modeling becomes noticeably effective when the training data is limited. Overall, our contributions are summarized as below: 1. We propose a joint masked V+L modeling task. We show that models pre-trained with the proposed task, along with common V+L alignment tasks such as image-text matching, achieves state-of-the-art performance on a broad rage of V+L tasks. 2. We provide a probabilistic interpretation of the proposed method and highlight the difference between ours and existing approaches in terms of the V+L joint distribution estimation. 3. We achieve significantly better performance than other V+L models in the regimes of limited training data, and only ∼40% of data used by the state-of-the-art models is sufficient to match their performance.

2. RELATED WORK

Vision and language representation learning The methods in V+L representation learning can be categorized based on how the information is fused between the modalities to obtain the joint representations. We group the fusion techniques into three categories: 1) transformers with attention across modalities, 2) contrastive learning with a large-scale pre-training data, 3) a hybrid form of learning with cross-attention and a contrastive loss. The attention across modalities has been widely used with image features extracted from off-the-shelf object detectors and text features obtained from transformer encoders (Chen et al., 2020b; Li et al., 2020a; Lu et al., 2020; Tan & Bansal, 2019; Zhang et al., 2021; Li et al., 2020b; Su et al., 2019; Li et al., 2019) . While cross-attention effectively aligns V+L representations, it is computationally expensive since all possible pairs of images and texts need to be processed. On the contrary, the authors in (Jia et al., 2021; Radford et al., 2021; Mokady et al., 2021; Shen et al., 2021; Yuan et al., 2021) show that contrastive learning with uni-modal encoders and millions of image-text pairs can achieve powerful zero-shot performance in diverse V+L tasks. The contrastive learning-based approaches do not rely on computationally expensive cross-attention but require an excessively large amount of training data. Hence, a combination of contrastive loss and cross-attention is utilized by complementing limitations of both approaches in (Li et al., 2021; 2022; Yang et al., 2022; Duan et al., 2022) . In particular, only image and text pairs that result in high similarity by uni-modal encoders are processed using the cross-attention layers to reduce the computational burden and improve the alignment. Masked signal modeling is a commonly used pre-training objective in the aforementioned V+L models. It has been independently explored in each of vision and language domain. In the NLP domain, BERT and its variants (Devlin et al., 2018; Liu et al., 2019) achieve representations that can generalize to a broad range of NLP tasks through MLM. Autoregressive language models (Radford et al., 2018; 2019) which predict masked future tokens have shown to be effective self-supervised learners. The success of the language models leads to several MIM techniques. BeiT (Bao et al., 2021) is trained to recover masked visual tokens which are obtained by a discrete variational autoencoder (dVAE). In SimMIM (Xie et al., 2021) and MAE (He et al., 2022) , transformers are trained to recover masked patches in an end-to-end fashion. The authors in (Chen et al., 2020a) propose to autoregressively predict the unknown pixels to learn visual representations. In (Bachmann et al., 2022) , multiple vision modality data are masked and reconstructed to learn visual representations. In the domain of V+L learning, (Arici et al., 2021) explores MIM and MLM for catalog data with short text attributes. V+L models with an object detector often aim at recovering only bounding box visual features (Chen et al., 2020b; Li et al., 2020a; Lu et al., 2020; Tan & Bansal, 2019; Su et al., 2019) . Several V+L models focus on predicting future text tokens without MIM (Wang et al., 2021; Yu et al., 2022; Alayrac et al., 2022) . While both MIM and MLM are explored in (Geng et al., 2022) , the trained model is evaluated only on vision tasks. In (Dou et al., 2022; Fu et al., 2021; Singh et al., 2022; Wang et al., 2022) , image tokens defined by image tokenizers trained with additional 250 million images (Ramesh et al., 2021) or distillation from the CLIP model (Radford et al., 2021) trained with 400 million image-text pairs (Peng et al., 2022) 

3. METHOD

Our method has two types of pre-training objectives, which are 1) masked vision and language modeling and 2) multi-modal alignment. We explain each pre-training objective in this section.

3.1. MASKED VISION AND LANGUAGE MODELING

The overall framework of masked vision and language modeling is shown in Figure 2 . We use transformer-based encoders (Vaswani et al., 2017) for both image and text streams. Given an imagetext pair (I, T ), an image encoder, f im , is used to extract features, v = {v cls , v 1 , ..., v N }, from the image input I. N is the number of image patches and v cls is the encoding of the image class token, [CLS] . The text encoder, f txt , extracts features, w = {w cls , w 1 , ..., w M }, from the text input, T . M is the number of text tokens and w cls is the encoding of the start token of a sentence, [START] . The image and the text encoder consist of 12 self-attention blocks as shown in Figure 3 (a). The image and the text features are further processed by image and text cross-modality encoders. The cross-modality encoders have 3 cross-attention blocks as illustrated in Figure 3 (b). The image (text) cross-modality encoder uses text (image) features to generate attentions. These cross-modality encoders can enhance the representation of one modality by interacting with another modality (Lu et al., 2020; Tan & Bansal, 2019) . Image and Text Masking: For text masking, we follow BERT (Devlin et al., 2018) with minor modifications. In BERT, the original tokens are replaced with either the [MASK] token or random tokens. We use only the [MASK] token to replace tokens to be masked (Wettig et al., 2022) . For image masking, we follow (He et al., 2022; Xie et al., 2021) and use random masking of raw image patches with a masking patch size of 32 × 32. Given that 224 × 224 images are divided into 16 × 16 patches for the image encoder, the large masking patch prevents the model from simply copying their neighborhood pixels for reconstruction (Xie et al., 2021) . Joint Reconstruction: We reconstruct the original signals of one modality from its masked input conditioned on the unmasked input of the other modality. Specifically, an original image, I, and a masked text, T m , are used to reconstruct an original text, T , and similarly a masked image, I m , and an original text, T , are used to reconstruct the original image, I. For image reconstruction, (I m , T ) is first given to the image and the text encoders to obtain masked image features, v m , and unmasked text features, w. Following (Xie et al., 2021) , we use both masked and unmasked patches to obtain v m . (v m , w) are further processed by the image cross-modality encoder, g im , where w is used to compute cross-attentions. The output of g im is mapped back to the original RGB image space by an image cross-modality decoder, g de im , which consist of 3 cross-attention blocks followed by a fully connected layer (FC). Although existing work exploits a light-weight transformer decoder with only self-attention (He et al., 2022) or a simple linear mapping (Xie et al., 2021) for the image decoder, we use joint information between modalities to allow further interactions in decoding. For masked text reconstruction, a token classifier, g de txt , which consists of a FC followed by softmax is applied to the output of the text cross-modality encoder, g txt , for the token prediction. The masked V+L modeling loss, L M V LM , is defined as L M V LM = E (I,T )∼D [ H(y M T , ϕ M txt (I, T m )) MLM + 1 Ω(I M ) ∥I M -ϕ M im (I m , T )∥ 1 MIM ] , where ϕ txt = g de txt (g txt (f im (I), f txt (T m ))) and ϕ im = g de im (g im (f im (I m ), f txt (T )) ). The loss is computed only for masked pixels and text tokens. Hence, the superscript M denotes data or features correspond to the masked signals. A pair of I and T is sampled from the training dataset D, H denotes cross-entropy, and y M T is a matrix that contains one-hot row vectors for the ground truth of masked text tokens. Ω(•) is the number of pixels. When minimizing L M V LM , the model is enforced to reconstruct the original signals by attending to the other modality signals. Cross-attending for reconstruction enables the model to learn interaction between V+L modalities.

3.2. MULTI-MODAL ALIGNMENT

In addition to the masked signal modeling tasks, we adopt two additional tasks to explicitly learn multi-modality alignment. The first one is an image-text contrastive (ITC) learning (Radford et al., 2021; Jia et al., 2021) . For the k-th pair of image and text features out of the image and text encoders, two separate FC layers are used to project the image [CLS] token features and the text [START] token features to the same dimensional feature space with unit norm, z k im and z k txt , respectively. The loss, L IT C is computed as L IT C = - 1 N N k=1 log exp(z k im • z k txt /τ ) N n=1 exp(z k im • z n txt /τ ) + log exp(z k im • z k txt /τ ) N n=1 exp(z n im • z k txt /τ ) , where N and τ are the batch size and the temperature scaling parameter, respectively. The second task is an image-text matching (ITM) (Chen et al., 2020b; Li et al., 2021; 2020b) , predicting whether an image and a text are aligned or not. The [CLS] and [START] token features from the image and text cross-modality encoders are z cross im and z cross txt , respectively. To fuse these two features, we compute the element-wise product of z cross im and z cross txt (z cross im * z cross txt ), and a FC layer followed by softmax is applied to obtain the final prediction (Lu et al., 2019) . For training, we use y IT M = 1, when z cross im and z cross txt are a pair. Otherwise, y IT M = 0. The loss, L IT M , is defined as L IT M = E (I,T )∼D [H(y IT M , g itm (z cross im * z cross txt ))]. (3) Following (Li et al., 2021) , we sample in-batch hard negatives based on the distribution of the cosine similarity between z im and z txt . The overall pre-training loss, L, is defined as L = L M V LM + L IT C +L IT M . We term our model trained with loss L as MaskVLM (Masked Vision and Language Modeling).

3.3. PROBABILISTIC INTERPRETATION

We differentiate MaskVLM from the existing V+L models using masked signal modeling from a perspective of likelihood estimation. The training objective of masked signal modeling on unimodal signals, X, focuses on learning the data distribution p(X) which is formulated by the law of total probability as p(X) = Xm∈M X p(X m ) • p(X|X m ), where X m is an instance of masked signal from the set of all possible masked signals, M X . MIM or MLM learns the data distribution by maximizing Xm∈M X p(X|X m ) (Bengio et al., 2013) . In V+L representation learning, the ultimate goal is to learn the joint distribution of multi-modal signals, p(I, T ). However, the authors in (Sohn et al., 2014) pointed out that directly maximizing the likelihood for the joint distribution is challenging because of the heterogeneous multimodal data distributions. Instead, they show minimizing variation of information defined as -E (I,T )∼D (log p(I|T ) + log p(T |I)) is sufficient to estimate the joint distribution. From a perspective of variation of information, the limitations in existing works can be better understood. Several existing works attempted to approximate the joint distribution using MLM with unmasked image (Duan et al., 2022; Li et al., 2021; 2019; Yang et al., 2022) . In other words, p(T |I, T m ) is maximized to learn the conditional distribution, p(T |I), but p(I|T ) is not modeled. In other existing works (Chen et al., 2020b; Li et al., 2020a; Lu et al., 2020; Su et al., 2019; Tan & Bansal, 2019) , where both modalities are masked, the visual masking is limited to mask the visual features extracted from a frozen object detector, ψ(•), instead of the raw image pixels. In this case, the distributions p(ψ(I)|T ) and p(T |ψ(I)) are modeled instead of p(I|T ) and p(T |I). This frozen feature extractor can bottleneck the direct estimation of the underlying data distribution. MaskVLM is trained endto-end to estimate both conditional distributions, p(I|T ) and p(T |I), which directly minimizes the variation of information. We hypothesize this modeling of conditional distributions for both modalities could lead to superior performance in both large-scale and limited data training scenarios, which we empirically demonstrated in Section 4.

4.1. PRE-TRAINING DATASETS AND DOWNSTREAM TASKS

We use the union of four datasets for pre-training so that we can perform a fair comparison with existing state-of-the-art methods (Chen et al., 2020b; Li et al., 2021) . These datasets are Conceptual Captions (CC) (Sharma et al., 2018) , SBU Captions (Ordonez et al., 2011) , Visual Genome (VG) (Krishna et al., 2017) , and COCO Captions (Lin et al., 2014) . While VG and COCO contain captions annotated by humans, CC and SBU Captions are automatically collected from the web. The total number of unique images and image-text pairs in the four datasets are 4.1M and 5.2M, respectively. We term this pre-training dataset as the 4M dataset. We validate the pre-trained model on the following four downstream tasks: Image-Text Retrieval: We perform text-to-image and image-to-text retrieval. We use the ITC and ITM losses of Section 3.2 for finetuning and the finetuned models are evaluated on COCO (Lin et al., 2014) and Flickr30k (Plummer et al., 2015) . In addition, since COCO is used for pre-training, zero-shot retrieval performance is reported on Flickr30k. In (Li et al., 2021) , the model finetuned on COCO is used for the zero-shot evaluation on Flickr30k. Although it may result in better performance, we believe that using the finetuned model does not validate the zero-shot capability of the Method # images

MSCOCO (5K) Flickr30k (1K) Image Retrieval

Text Retrieval Image Retrieval Text Retrieval R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 ImageBERT (Qi et al., 2020 pre-trained model. Therefore, we use the pre-trained model directly for zero-shot evaluation. Following (Li et al., 2021) , we first retrieve top-k candidates using the similarity scores from the image and the text encoders. The top-k candidates are further processed by the cross-modality encoders to obtain the final retrieval results. Visual Question Answering (VQA): Here, given an image and a question pair, the model should generate a correct answer. The model is evaluated on VQA v2 (Goyal et al., 2017) . We adopt the answer generation framework (Cho et al., 2021) and finetune the base model with a fusion encoder and an answer decoder. The model architectures are visualized in Figure 6 (a) of Appendix. The fusion encoder consists of one cross-attention block shown in Figure 3 (b). The output from the text cross-modality encoder is used as queries, and the image cross-modality encoder output is utilized to create attentions in the fusion encoder. The architecture of the answer decoder is the same as that of the text cross-modality encoder, but it is trained with a language modeling loss to generate the answers. Specifically, the output of the fusion encoder is used for computing attentions and the answer tokens are autoregressively predicted. During inference, [START] token is used as an initial token to generate following answer tokens. Natural Language for Visual Reasoning (NLVR): This tasks involves a binary classification with a triplet, (text, image1, image2). The goal here is to predict whether the text describes the pair of images. For finetuning, we feedforward (text, image1) and (text, image2) separately to extract the features as shown in Figure 6 (b). The [CLS] token features of image1 and image2 from the image encoder are denoted as v 1 and v 2 , respectively. The [START] token text features from the text encoder is w. These features are processed by the cross-modality encoders. The outputs of the image and text cross-modality encoders are fused by element-wise multiplication. The fused features for both images are concatenated, and a classifier with two linear layers predicts whether the text is aligned with the image pair or not. NLVR2 (Suhr et al., 2018) is used for the evaluation. Visual Entailment (VE): Given an image text pair, the task is to classify the relationship between the image and the text into one of three categories: entailment, neutral, and contradictory. The element-wise product of the output from the image and the text cross-modality encoders is forwarded to a classifier of two linear layers for prediction. SNLI-VE (Xie et al., 2019) is used for evaluation.

4.2. IMPLEMENTATION DETAILS

We use a Visual Transformer (ViT) (Dosovitskiy et al., 2020) pre-trained on ImageNet (Deng et al., 2009) and a pre-trained RoBERTa from (Liu et al., 2019) to initialize the image and the text encoder, respectively. We pre-train the model for 50 epochs when the 4M dataset is used and 30 epochs for all other experiments. A batch size of 512 is used with 16 NVIDIA Tesla V100 GPUs. All parameters are optimized using AdamW (Loshchilov & Hutter, 2017) with a weight decay of 0.05. Following (Xie et al., 2021) , we use the image masking ratio of 60%. While 15% masking ratio is used for text in language models (Devlin et al., 2018; Liu et al., 2019) , we use 30% since the paired image can provide additional information for text reconstruction. During pre-training, the learning rate is warmed up to 3 × 10 -4 in the first 5 epochs and decayed to 3 × 10 -5 using a cosine scheduler. The learning rates for the image encoder and the text encoder are set to 10 -5 , which is less than that of the cross-modality encoders. Text Retrieval R@1 R@5 R@10 R@1 R@5 R@10 ImageBERT (Qi et al., 2020 (Dosovitskiy et al., 2020) . More details can be found in Appendix.

4.3. EVALUATION ON IMAGE-TEXT RETRIEVAL, VQA, NLVR, AND VE

We compare the finetuned image-text retrieval performance of the proposed MaskVLM with the state-of-the-art methods in Table 1 . The second column is the number of unique images used for pretraining and the retrieval performance is evaluated in terms of Recall@k (R@k). We do not directly compare with ALIGN (Jia et al., 2021) since it is trained with more than 300 times of data used for MaskVLM. However, we still highlight the small performance gap between MaskVLM and ALIGN. We achieve the best performance in all Recall@k metrics on both COCO and Flickr30k except for the image retrieval R@10 and text retrieval R@5 on Flickr30k. Compared to ALIGN, we even achieve higher R@1 for image retrieval on COCO and text retrieval on Flickr30k. Table 2 shows the zero-shot retrieval performance of the state-of-the-art methods on Flickr30k. MaskVLM achieves a significant improvement over the second best method, ALBEF (Li et al., 2021) , by 6.8 points at R@1 for image retrieval. Given that ALBEF is not trained for MIM, we hypothesize that ALBEF achieves the biased performance for text retrieval and MaskVLM achieves the significant improvement in image retrieval by additional MIM which models p(I|T ). While FLAVA exploits both MLM and MIM with the pre-trained image tokenizer, using 13 times more data than MaskVLM, MaskVLM still outperforms FLAVA by 9.8 and 19.3 points at R@1 for image and text retrieval respectively. Compared with CLIP (Radford et al., 2021) which is trained with at least 76 times more data than MaskVLM, we still achieve higher R@1 for image retrieval by 6.3 points. In general, MaskVLM achieves state-of-the-art performance in both finetuning and zero-shot experiments. We report the accuracies on VQA, NLVR, and VE in performances in all these tasks except for the validation split of NLVR2. In particular, MaskVLM is better than the second best method by 0.43, 1.14, and 0.27 on the test splits of VQA, NLVR2, and SNLI-VE, respectively. Compared to the base model of SimVLM, we narrow the accuracy gaps to 2.74% and 3.48% in test-std and test splits of VQA and VE, respectively. MaskVLM achieves higher accuracy than SimVLM base in the test split of NLVR2 by 0.21%.

4.4. EVALUATION WITH LIMITED PRE-TRAINING DATA

Figure 4 : R@1 plots for image retrieval (left) and text retrieval (right) on COCO using limited pre-training data. We highlight the performance of MaskVLM in limited data scenarios. In particular, we create three subsets of the 4M pre-training data by sampling 50%, 25%, and 10% of CC and combining them with COCO. The number of imagetext pairs in each subset is around 39%, 25%, and 16% of the 4M pre-training data which contain 5.2M pairs, respectively. We pre-train models with these subsets of the data and analyze the downstream task performance in comparison with state-of-the-art methods. The results are reported in Table 4 . Particularly, image and text retrieval R@1 performance on COCO is also visualized in Figure 4 . We compare MaskVLM with the most recent state-of-the-art methods, which are ALBEF (Li et al., 2021) and Codebook (Duan et al., 2022) . In Table 4 , as the size of pre-training data becomes smaller from CC 50% + COCO to CC 10% + COCO, the performance gap between MaskVLM and ALBEF increases from 6.39 to 8.71 at R@1 in COCO image retrieval (IR), 7.46 to 9.04 at R@1 in COCO text retrieval (TR), 1.17 to 1.79 in VQA and 0.31 to 1.24 in the test set of SNLI-VE. In NLVR2 and VQA, MaskVLM trained with CC 10% + COCO achieves higher accuracy than ALBEF trained with CC50% + COCO, which contains more than twice of image-text pairs in CC 10% + COCO. In Figure 4 , while Codebook shows competitive recall performance compared to the MaskVLM with the 4M dataset ( 5.2M pairs), the R@1 differences in image and text retrieval, respectively, increase from 1.4 and 1.0 in the 4M dataset to 3.15 and 3.80 in CC 50% + COCO. Our model trained with CC25% + COCO outperforms Codebook trained with CC50% + COCO by 1.90 and 2.76 points in terms of image and text retrieval R@1, respectively. Since one of the main differences in MaskVLM compared to ALBEF and Codebook is the additional MIM, we believe that joint modeling of V+L contribute to better performance in limited data scenarios.

4.5. ABLATION STUDY

We perform an ablation study using different combinations of loss components to highlight the contribution of masked V+L modeling. We compare six models with the same architecture but with different loss functions for pre-training. We pre-train all models on the CC 50% + COCO dataset and compare finetuned and zero-shot retrieval performance on Flickr30k in Table 5 . We note that zero-shot evaluation of the MLM + MIM model cannot be performed because the FC layers to compute ITM and ITC are not trained during pre-training. ITC and ITM are closely related to the retrieval task since they are used for finetuning and measuring the similarity between images and texts. However, MLM + MIM still achieves significantly better finetuned and zero-shot performance than ITC, which shows that MLM + MIM alone learns meaningful V+L representations. Compared to ITC+ ITM in the finetuned performance, ITC + ITM + MLM achieves slightly improved R@1 Loss Finetuned Zero-shot IR TR IR TR R@1 R@5 R@1 R@5 R@1 R@5 R@1 R@5 ITC 65.10 89.88 by 0.38 in image retrieval and degraded R@1 by 0.3 in text retrieval. When MIM alone is used with ITC + ITM as well, finetuned R@1 is improved by 0.16 and degraded by 0.8 for image and text retrieval, respectively, over ITC + ITM. On the other hand, when ITC + ITM + MLM + MIM is used, the model achieves significant improvement of finetuned performance over ITC + ITM + MLM by 0.92 and 2.10 for R@1 image and text retrieval, respectively. ITC + ITM + MLM + MIM also obtains the best performance in zero-shot retrieval. This result further supports the advantage of joint modeling for masked V+L signals.

4.6. QUALITATIVE RESULTS

We perform a qualitative analysis to show the role of multi-modal information in the reconstruction of masked signals from our model. To be specific, we illustrate the prediction of masked text tokens with and without the corresponding images. This illustration highlights how MaskVLM effectively utilizes both modality information to complete the masked signal modeling task. Figure 5 shows the reconstruction of masked texts using original images ("Recon (org)") and masked images ("Recon (mask)"). In the first top example, when the model is given a masked text and a masked image which does not contain the "dog", the reconstruction is performed by using only available information such as image patches of "green grass". Thus, the prediction is limited to "a brown of motion" or "the green forest". However, when the original image is used for reconstruction, both "a brown and white dog" and "white fence" are reconstructed by accurately attending to the image. In the bottom example, the visible patches of the masked image contain a few people, but lack background information. Consequently, the reconstruction with the masked image does not contain any background information but the background "cave" is reflected in the reconstruction with the original image. Theses examples confirm that MaskVLM has learned to perform masked signal modeling using both V+L information.

5. CONCLUSION

We propose masked vision and language modeling as a pre-training task for learning V+L representations. We provide a probabilistic interpretation to highlight the contribution of the proposed method and validate its advantages in both large-scale and limited data regimes. We consistently achieve the state-of-the-art performance in a broad range of V+L tasks.

A APPENDIX

A.1 DETAILS ON FINETUNING FOR DOWNSTREAM TASKS We explain implementation details for each of the downstream tasks. For all the downstream tasks, we use AdamW (Loshchilov & Hutter, 2017) with a weight decay of 0.05 and the cosine scheduler. An image size of 384 × 384 with RandAugment (Cubuk et al., 2020) is utilized and the positional encoding is interpolated following (Dosovitskiy et al., 2020) . Except for the VQA task, we use the model achieves the best performance in the validation set to report the performance on the test set. We use the last epoch model for the VQA evaluation. Image-Text Retrieval: COCO (Lin et al., 2014) and Flickr30k (Plummer et al., 2015) are used to report the performance. To be specific, we follow data splits proposed in (Karpathy & Fei-Fei, 2015) and an average recall over image and text retrieval is used to find the best model in the validation set. The pre-trained model is finetuned for 15 epochs with a batch size of 256 and a learning rate of 1 × 10 -5 . Visual Question Answering (VQA): For a fair comparison with existing methods (Chen et al., 2020b; Li et al., 2021) , we use training and validation sets from VQA v2.0 (Goyal et al., 2017) with a subset of VQA samples from Visual Genome (Krishna et al., 2017) for training. Also, we report performance on both test-dev and test-std splits of VQA v2.0. Following (Li et al., 2021) , we weight the loss for each answer based on its occurrence among all the answers. The model is finetuned for 15 epochs with a batch size of 256 . We use a learning rate of 2 × 10 -5 for the image and the text cross-modality encoders, the fusion encoder, and the answer decoder. For the image and the text encoders, a learning rate of 1 × 10 -5 is used. The fusion encoder and the answer decoder are initialized by the last and all three blocks of the pre-trained text cross-modality encoder, respectively. Natural Language for Visual Reasoning (NLVR): Data splits proposed in (Suhr et al., 2018) are used for finetuning and evaluation. The model is finetuned for 5 epochs with a batch size of 128. Since the classifier is newly added after finetuning, we use a learning rate of 1 × 10 -4 for the classifier and 1 × 10 -5 for the remaining parts of the model. Different from (Duan et al., 2022; Li et al., 2021; Yang et al., 2022) , where the models require additional text-assignment pre-training step before finetuning, we directly finetune for simplicity. Visual Entailment (VE): We follow data splits proposed in SNLI-VE (Xie et al., 2019) . We finetune the model with a batch size of 256 for 5 epochs. Similar to the NLVR task, a learning rate of 1 × 10 -4 is used for the classifier and 1 × 10 -5 is used for the remaining parts of the model.  𝑣 ! 𝑣 " [𝑧 #$,! &'()) * 𝑧 *+* &'()) , 𝑧 #$," &'()) * 𝑧 *+* &'()) ] 𝑧 #$," &'()) 𝑧 *+* &'()) 𝑧 #$,! &'())

A.2 REPRODUCIBILITY

We add more details of MaskVLM for reproducibility. We used the ImageNet pretrained ViT (vit base patch16 224) from (Wightman, 2019) and the pre-trained RoBERTa (roberta-base) from Hugging Face (Wolf et al., 2020) . The detailed model architectures are visualized in Figure 7 . Following (Dosovitskiy et al., 2020) , the image encoder uses layer normalization (Ba et al., 2016) before each multi-head attention block while the text encoder applies layer normalization after each multi-head attention block (post norm). For the image (text) cross-modality encoder, we adopt the post norm and use the outputs of the text (image) encoder as keys and values Text Retrieval Image Retrieval Text Retrieval R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 ALBEF 56. to compute cross-attention. To compute MIM and MLM, the self-attention outputs of the masked image features, v m , is used as queries and the unmasked text features, w, are used as keys and values in the image cross-modality encoder. For the text cross-modality encoder, the masked text features, w m , are used as queries and the unmasked image features, v, are used as keys and values. To keep the framework simple, we do not use any loss weighting for each loss term and layer decay during finetuning.

A.3 ABLATION STUDY ON MASKING STRATEGIES

We study different masking strategies in computing MIM and MLM losses. In particular, we compare MaskVLM using one modality masked and the other modality unmasked for reconstruction (MaskVLM (one)) and MaskVLM using both modalities masked at the same time for reconstruction. We compare these two MaskVLM models with the state-of-the-art method, ALBEF in Table 6 . We follow the experimental setup described in Section 4.1 and report the finetuning performance on image-text retrieval. The performance of masking one modality at a time (MaskVLM (one)) was slightly better than masking both modalities at the same time (MaskVLM (both)). However, we observed that both reconstruction strategies are still effective as they achieve higher R@1 for image and text retrieval in both COCO and Flickr30k compared to ALBEF.

A.4 ABLATION STUDY ON MASKING RATIO

We perform ablation study using different masking ratios for masked vision and language modeling. In particular, we pre-train MaskVLM with several combinations of image and text masking ratios on the CC 50% + COCO dataset and report the finetuned image-text retrieval performance on Flickr30k in Table 7 . We also report an average of R@k for image and text retrieval. When only image masking ratio is changed in the first three rows of the table, the difference between the maximum and the minimum of the average recall is 0.26 for image retrieval and 0.10 for text retrieval. This shows that MaskVLM achieves stable performance across the tested image masking ratios. From comparison between the second row and the last row, we observe that increasing the text masking ratio from 0.15 to 0.3 leads to higher recall performance for both image and text retrieval. Masking ratio (image / text) Image Retrieval Text Retrieval R@1 R@5 R@10 Average R@1 R@5 R@10 Average 0.5 / 0. We evaluate the image recognition performance of MaskVLM. Following CLIP (Radford et al., 2021) , we perform image classification directly using the pre-trained MaskVLM on various image recognition datasets including UC Merced Land Use (Yang & Newsam, 2010) , MIT-67 (Quattoni & Torralba, 2009) , CUB-200 (Wah et al., 2011) , Oxford Flowers (Nilsback & Zisserman, 2008) , Caltech-256 (Griffin et al., 2007) , and ImageNet-1K (Deng et al., 2009) . We compare the Top-1 accuracy of MaskVLM with ALBEF (Li et al., 2021) in Table 8 . Both models are pre-trained with the 4M dataset. Since during pre-training stage, both MaskVLM and ALBEF were initialized with the ImageNet pre-trained weights, the evaluation on ImageNet is not strictly zero-shot but the evaluation on other datasets is. We formulate image classification as image-to-text retrieval where the similarity scores between a query image and all the class names are computed to retrieve top-1 class name. The similarity scores can be obtained using either ITC and ITM scores, and we report them separately. Also, the results of using prompt engineering as in CLIP are reported. We present additional examples for the qualitative analysis of MaskVLM in Figure 8 . Similar to Figure 5 , masked text tokens are reconstructed with masked images ("Recon (mask)") and original images ("Recon (org)"). We highlight that MaskVLM utilizes both V+L information to reconstruct the text which corresponds to the given image. In Table 9 , we report the statistics of the 4M pre-training dataset that MaskVLM is trained on. We note that some data urls provided in the web datasets can become invalid, which may lead to slightly different number of image-text pairs depending on when the datasets are downloaded. 



Figure 2: A framework of joint modeling of masked vision and language. The blue and green lines demonstrate the information flow for image and text reconstruction, respectively. The dotted lines indicate the cross-modal input of unmasked signals for generating attention.Image (text) EncoderImage (text) Cross-modality Encoder

Figure 5: Masked language modeling examples using masked and original images. "Recon (mask)" and "Recon (org)" denote reconstructed text from the masked image and the original image, respectively.

Figure 6: An illustration of model architectures for VQA and NLVR.

Figure 7: Model architectures of (a) Image encoder, (b) Text encoder and (c) Image (text) crossmodality encoder. The dotted lines in (c) denote key and value from the other modality for crossattention.

Figure 8: Additional masked language modeling examples using masked and original images. "Recon (mask)" and "Recon (org)" denote reconstructed text using the masked image and the original image, respectively.

Comparison with finetuned state-of-the-art methods on image-text retrieval. The gray row indicates that the model is trained with significantly larger number of data than MaskVLM.

Zero-shot image-text retrieval performance on Flickr30k. The gray row indicates that the model is trained with significantly larger number of data than MaskVLM.



Except for SimVLM whose pretraining data is more than 300 times larger than that of MaskVLM, we consistently achieve the best

Downstream task performance with limited pre-training data.

80.10 96.90 55.08 80.90 68.40 90.00 ITC + ITM 79.96 95.56 92.30 98.90 69.50 89.54 82.40 96.60 MLM 80.34 95.82 92.00 99.30 70.74 90.92 84.40 97.10 ITC + ITM + MIM 80.12 95.56 91.50 99.00 69.26 90.30 82.90 97.20 ITC + ITM + MLM + MIM 81.26 96.00 94.10 99.60 71.18 91.12 85.60 97.50

Image-text retrieval evaluation on Flickr30k with different loss functions for pre-training.

Comparison of finetuned MaskVLMs with different masking strategies and ALBEF on image-text retrieval. (MaskVLM (One): masking one modality at a time for computing MLM and MIM losses. MaskVLM (both): masking both modalities at the same time for reconstruction)

Finetuned image-text retrieval performance on Flickr30k with different masking ratios for masked vision and language modeling.A.5 EVALUATION ON IMAGE RECOGNITION

MaskVLM consistently outperforms ALBEF across all the datasets. In particular, prompt engineering improves the average accuracy across all the datasets for MaskVLM but ALBEF achieves lower average accuracy with prompt engineering. This shows that MaskVLM can better align images with variants of text than ALBEF, which results in stronger V+L representations of MaskVLM for the image recognition task.

Top-1 accuracy of pre-trained MaskVLM and ALBEF on image recognition. ITC and ITM denote the alignment scores utilized to perform image classification.

Statistics of the 4M pre-training dataset.

ACKNOWLEDGMENTS

We thank Jiali Duan for providing results of the Codebook (Duan et al., 2022) with limited pretraining data in Figure 4 .

