VOLTA: VISION-LANGUAGE TRANSFORMER WITH WEAKLY-SUPERVISED LOCAL-FEATURE ALIGNMENT

Abstract

Figure 1: We introduce VoLTA, Vision-Language Transformer with weakly-supervised local-feature Alignment, a VLP paradigm trained with graph optimal transport (GOT) based image-text matching. VoLTA learns finegrained local visual representation only using global image-caption pairs, eliminating the use of expensive grounding annotations. This figure shows how different words in captions attend relevant image regions, produced by the GOT module of VoLTA pre-trained on COCO.

1. INTRODUCTION

Inspired by the escalating unification of transformer-based modeling in vision (Dosovitskiy et al., 2021; Liu et al., 2021; Chen et al., 2021a) and language (Devlin et al., 2019; Liu et al., 2019) domains, coupled with readily available large-scale image-caption pair data, vision-language pre-training (VLP) (Lu et al., 2019; Li et al., 2020a; Kim et al., 2021; Kamath et al., 2021; Zhang et al., 2021) has recently been receiving ever-growing attention. VLP has not only been proven the de-facto for several VL tasks, but it has also been beneficial for traditional vision-only tasks, such as image classification and object detection. Such wide-range applications of VLP can broadly be categorized into two groups: (i) tasks requiring image-level understanding, e.g., image classification, image & language encoders and only fuse their representation in the representation space. CLIP (Radford et al., 2021) , UniCL (Yang et al., 2022a) and ALIGN (Jia et al., 2021) use image-text contrastive loss to learn aligned representations. SLIP (Mu et al., 2021) combines self-supervised visual representation learning and contrastive multi-modal learning. M3AE (Geng et al., 2022) , FLAVA (Singh et al., 2022) combines masked image modeling and masked language modeling. Another line of work is to use cross attention to fuse vision and language information in the early stage (Kamath et al., 2021; Dou et al., 2022b; Lu et al., 2019; Li et al., 2020b; Kiela et al., 2019; Kim et al., 2021; Zhang et al., 2021; Li et al., 2022b; Wang et al., 2022c) . These works focus on learning semantic-level aligned visionlanguage representations. In addition, UniTAB (Yang et al., 2022c) , OFA (Wang et al., 2022b) , GLIP Li et al. (2022c) , and FIBER (Dou et al., 2022a) use expensive grounding image-text-box annotations to learn fine-grained aligned representations. Our work uses representation space alignment and cross-attention fusion, but we do not use any box annotation to learn robust feature-level alignments. Unsupervised Representation Alignment: Unsupervised multi-modal alignment typically relies on specific metrics. Wasserstein distance (WD) (Peyré et al., 2019) , a.k.a Earth Mover's distance (EMD)based optimal transport (OT) algorithms have widely been adopted to various domain alignment tasks, including sequence-to-sequence learning (Chen et al., 2019) , few-shot learning (Zhang et al., 2020) , knowledge distillation (Balaji et al., 2019) , unsupervised domain adaptation (Balaji et al., 2019) , generative networks (Han et al., 2015; Genevay et al., 2018; Mroueh et al., 2018; 2019) , and multi-modal learning (Yuan et al., 2020; Chen et al., 2020d; Kim et al., 2021; Li et al., 2022d; Pramanick et al., 2022) . Previous VLP methods (Chen et al., 2020d; Kim et al., 2021) , which use OT-based patch-word alignment, only utilize WD. However, we argue that jointly modeling Gromov-Wasserstein distance (GWD) (Peyré et al., 2016) and Wasserstein distance (WD) results in a superior multi-modal alignment for intricate images. To the best of our knowledge, this is the first work to apply WD and GWD-based optimal transport for feature-level alignment in VLP.

3. METHOD

In this section, we present our proposed framework, VoLTA, which contains three broad modules -(i) intra-and inter-modality redundancy reduction, (ii) weekly-supervised cross-modal alignment (CMA) of local features, and (iii) cross-modal attention fusion (CMAF). Next, we introduce the fine-tuning strategies for various uni-and multi-modal downstream tasks as supported by VoLTA. An overview of different modules of VoLTA is depicted in Figure 2 .

3.1.1. INTRA-AND INTER-MODALITY REDUNDANCY REDUCTION

We use Barlow Twins (BT) (Zbontar et al., 2021) , a non-contrastive covariance regularization as the foundational objective of VoLTA. The recent success of contrastive vision-language pre-training (VLP) (Radford et al., 2021; Li et al., 2021b; Jia et al., 2021; Kim et al., 2021; Yang et al., 2022a; Dou et al., 2022a; b) has already shown that, compared to a single modality, image-caption pairs offer a significantly higher-level of abstractive and semantic concepts about the training samples. However, common contrastive VLP objectives, like InfoNCE (Oord et al., 2018) , are data-hungry, as they require large batch sizes and well-mined hard negatives. On the other hand, the BT objective operates on the dimensions of the embeddings across the two views of training samples. Hence, it has proven robust to batch size and can be trained using lower memory resources. In this work, we extend the BT objective for multi-modal setup. The original BT algorithm, which operates on joint embeddings of distorted samples, was proposed only for image modality. Specifically, for each image of a batch X , two distorted views are obtained using a distribution of data augmentation T with disparate probabilities. These distorted images are then fed into a shared image encoder which contains a feature extraction network (e.g., ResNet (He et al., 2016) ) cascaded with trainable linear projection layers, producing a batch of parallel embeddings z A and z B . The BT loss computed using the encoded embeddings can be denoted as: L BT ≜ i 1 -C ii 2 + λ i j̸ =i C ij 2 , C ij = b z A b,i z B b,j b z A b,i 2 b z B b,j 2 (1) Where λ is a positive weighting factor; C is the cross-correlation matrix computed between z A and z B along the batch dimension; b stands for sample indices in a batch; i, j refers to the dimension Proj. GOT Proj. Proj.

Two polar bears swimming in water.

A bunch of young boys are engaged in a game of basketball. Basketball is being played by a bunch of young boys. Self-Att indices of z A and z B . The first term in Equation 1 is the invariance term which attempts to equate the diagonal elements of the cross-correlation C matrix to 1, whereas the second term is the redundancy reduction term which pushes the off-diagonal elements of the C matrix to 0. In this work, we propose to use BT for image-caption pairs. Specifically, we use stochastic data augmentations for both images and textsfoot_0 , and directly apply the BT objective for all the 2 × 2 pairs, resulting in additional supervision. Note that, this simple, straightforward and instinctive extension enables us to apply redundancy reduction in-between and across modalities, which intuitively results in superior visual representation. Moreover, in this bi-modal setting, we can pre-train a text encoder in parallell with the image encoder, and thus, can generalize our system to a wider range of uni-and multi-modal downstream applications. Intra-modal Objective: Intra-modal objective refers to applying the BT loss in-between the pairs of image and text embedddings. Given an image-caption pair, we first have two augmented views (I, I ′ ) for each image, and two augmented views (T, T ′ ) for each text. Then, we resort to Equation 1individually for the image and text pairs. L k BT ≜ i 1 -C k ii 2 + λ i j̸ =i C k ij 2 , ∀ k ∈ {II ′ , T T ′ } (2) Inter-modal Objective: Inter-modal objective refers to applying the BT loss across the image and text embeddings. Since the image and text encoder can output features with different shapes, we design the projector layers with same output dimension. Hence, in addition to the original BT loss between (I, I ′ ) in Zbontar et al. (2021) , we get three more loss terms -(T, T ′ ), (I, T ′ ), (I ′ , T ), leading to 3× diverse and high-quality additional supervision. The inter-modal BT losses can also directly be computed following Equation 1. L k BT ≜ i 1 -C k ii 2 + λ i j̸ =i C k ij 2 , ∀ k ∈ {IT ′ , I ′ T } The resulting bi-modal BT loss is L BT = k L k BT , ∀ k ∈ {II ′ , T T ′ , IT ′ , I ′ T }.

3.1.2. ALIGNMENT OF LOCAL FEATURES

Though the inter-modal redundancy reduction provides high-quality semantic supervision, it is computed on the global image-and text features and, thus, only simulates implicit and non-interpretable multi-modal alignment. However, fine-grained region-level downstream applications like detection, segmentation, and reference expression comprehension require local visual feature descriptors with specific spatial information. To achieve this, most existing top-performing VLP methods, including UniTAB (Yang et al., 2022c) , OFA (Wang et al., 2022b) , GLIP (Li et al., 2022c) , and FIBER (Dou et al., 2022a) , use high-resolution image-text-box data for fine-grained pre-training. However, bounding box annotations are expensive to collect and use for supervision. Hence, we seek an alternate weekly-supervised solution for local feature-level alignment using global image-caption annotations. Recently, Wasserstein distance (WD) (Peyré et al., 2019) , a.k.a Earth Mover's distance (EMD)-based optimal transport (OT) algorithms have been used for weakly-supervised patch-word alignment in VLP (Chen et al., 2020d; Kim et al., 2021) . Such OT-based learning methods are optimized for distribution matching by minimizing the cost of a transport plan. We pose the patch-word alignment as a more structured graph-matching problem and use the graph optimal transport (GOT) algorithm, which utilizes Gromov-Wasserstein distance (GWD) (Peyré et al., 2016) in conjunction with WD to ensure the preservation of topological information during cross-modal alignment. More specifically, we obtain the patch-and token-level features from the last layers of corresponding visual and textual transformer encoders, and use these encoded local-feature vectors to construct modality specific dynamic graphs -G x (V x , E x ) for image patches and G y (V y , E y ) for text tokens. Each node in these graphs i ∈ {V x , V y } is represented by corresponding feature vectors, and intermediate edges e ∈ {E x , E y } by thresholded cosine similarity. Importance of GOT in patch-word alignment: As mentioned previously, GOT adopts two types of OT distances -WD for node matching and GWD for edge matching. In contrast, previous vision-language pre-training algorithms (Chen et al., 2020d; Kim et al., 2021) using OT for patchword alignment only considered WD. However, we argue that intricate images with multiple similar objects with different shapes and colors require both WD and GWD for accurate, fine-grained matching. For example, in Figure 3 , there are multiple "men" present in the image. WD can only match nodes in the graph, and will treat all "men" entities as identical and will ignore neighbouring relations like "in blue shirt" and "holding the scissors". However, by using proper edge matching with GWD, we can preserve the graph's topological structure. We can correctly identify which "man" in the image the sentence is referring to. Hence, we couple WD and GWD mutually beneficially and use a joint transport plan for accurate patch-word matching. Once G x and G y are computed, we follow Chen et al. (2020a) to compute WD and GWD. Wasserstein Distance (WD) calculates the pairwise distances between two sets of cross-domain node embeddings. Consider two discrete distributions, ϕ ∈ P(X) and ψ ∈ P(Y), where ϕ = n i=1 u i δ xi and ψ = m j=1 v j δ vj ; and δ x being the Delta-Dirac function centered on x. Since ϕ and ψ are both probability distributions, sum of weight vectors is 1, i u i = 1 = j v j . The WD distance between ϕ and ψ is defined as: D w (ϕ, ψ) = min T∈Π(u,v) i j T ij • c(x i , y j ) where Π(u, v) = {T ∈ R n×m + |T1 m = u, T ⊤ 1 n = v}, c(x i , y j ) is cosine distance metric, and T is the transport plan, interpreting the amount of mass shifted from ϕ i to ψ j . Gromov-Wasserstein Distance (GWD) assists in edge matching and preserves graph topology by calculating distances between pairs of nodes in each domain, as well as measuring how these distances compare to the counter domain. In the same discrete graph matching setting, GWD between Under review as a conference paper at ICLR 2023 ϕ and ψ can be mathematically represented as: D gw (ϕ, ψ) = min T∈Π(u,v) i,i ′ ,j,j ′ Tij Ti ′ j ′ L(x i , y j , x ′ i , y ′ j ) where intra-graph structural similarity between two node pairs (x i , x ′ i ) and (y j , y ′ j ) is represented as L(x i , y j , x ′ i , y ′ j ) = ∥c 1 (x i , x ′ i ) -c 2 (y i , y ′ i )∥ , c i being cosine similarity between a node pair in any graph G i . Transport plan T is periodically updated to align the edges in different graphs belonging to disparate modalities. We further follow Chen et al. (2020a) to combine WD and GWD transport plans, leading to a unified GOT objective given as: L GOT (ϕ, ψ) = γD w (ϕ, ψ) + (1 -γ)D gw (ϕ, ψ) where γ regulates the importance of two loss terms.

3.1.3. CROSS-MODAL ATTENTION FUSION (CMAF)

BT and GOT losses are computed in a dual encoder setting, which does not contain cross-modal interactions and are not suitable for complex multi-modal feature representation. Most existing methods, including UNITER (Chen et al., 2020d) , ViLT (Kim et al., 2021) , METER (Dou et al., 2022b) and GLIP (Li et al., 2022c) design cross-modal fusion by stacking additional transformer layers on top of uni-modal encoders, introducing large number of added parameters during pretraining. We follow a more efficient solution proposed by FIBER (Dou et al., 2022a) , which inserts cross-modal fusion into the uni-modal backbones with a gating mechanism. Specifically, at the top M transformer layers in the vision and language backbone, cross-attention signals, weighted by a gating scalar α, are added to self-attention: x = Self-Att(x) x = x + x + α * Cross-Att(x, y) (7) x = x + FFN(x) where α is a trainable parameter initialized to 0. Following existing literature (Li et al., 2021a; Wang et al., 2021a; Dou et al., 2022b; a) , we use Masked Language Modeling (MLM) and Image-Text Matching (ITM) to pre-train the cross-attention parameters. For MLM, we randomly mask 15% text tokens, and the loss aims to reconstruct the masked tokens. We feed the network with randomly sampled image-caption pairs for ITM, and the loss predicts whether they are matched. The gating mechanism is a good choice for CMAF because (i) cross-attention parameters can easily be switched off by setting the gating scalar α to 0 when computing the BT and GOT losses. Thus, we can learn the cross-attention parameters without affecting the original computational flow of uni-modal backbones. (ii) gating mechanism is more lightweight and memory-efficient than adding fusion-specific layers (GLIP and METER use 4× more fusion parameters than FIBER). Overall, VoLTA training pipeline can be summarized in the following three steps: • BT & GOT: CMAF is switched off (α = 0), VoLTA acts as dual encoder, L BT and L GOT are computed. • MLM & ITM: CMAF is switched on (α ̸ = 0), VoLTA now acts as fusion encoder, L MLM and L ITM losses are computed. • Back-propagation: the four losses are added, giving  L total = L BT + w GOT * L GOT + L MLM + L ITM ,

3.2. FINETUNING FOR DOWNSTREAM TASKS

We adopt VoLTA to a wide range of vision-and vision-language downstream tasks. We switch off the inserted cross-attention modules for the vision-only tasks and use the image encoder. For the vision-language tasks, following Dou et al. (2022a) , we utilize the learned cross-attention parameters as required. For example, VQA and visual reasoning employ all cross-attention modules, whereas captioning requires only image-to-text cross-attention. Again, during IRTR, we switch off all crossattentions and use VoLTA in a dual encoder setting. We keep all cross-attention parameters during multi-modal object detection and referring expression comprehension and train an object detection head from scratch using the language-aware image features.

4. EXPERIMENTS

Pre-training & downstream datasets: First, we pre-train VoLTA on the image-caption pairs of COCO2017 (Lin et al., 2014) (Kazemzadeh et al., 2014; Yu et al., 2016) , and language-conditioned object detection on COCO and LVIS (Gupta et al., 2019) . We exclude any overlap between our pre-training and downstream validation/test splits. Detailed statistics of all downstream datasets are given in Appendix C. Network architectures: Following FIBER (Dou et al., 2022a) , we adopt Swin-Base (Liu et al., 2021) and RoBERTa-Base Liu et al. (2019) as our vision and text encoders, which are initialized with weights from uni-modal pre-training. We collect patch-and token features from last transformer layers, and feed them into local projector network to compute GOT loss. Furthermore, we apply AvgPool on patch and token-features, and fed them into global projector network to compute BT We first experiment on three uni-modal tasks -classification, object detection, and instance segmentation. For a direct comparison with existing ResNet50 and Swin-T baselines, we re-train identical encoders with VoLTA pipeline. Furthermore, since the uni-modal tasks do not utilize cross-attention parameters, we perform an ablation by dropping the CMAF module from VoLTA. Image Classification: Object Detection & Instance Segmentation: Next, we perform two uni-modal region-level tasksobject detection on VOC07 + 12 and COCO2017, and instance segmentation on COCO2017. As shown in Table 2 , VoLTA yields the state-of-the-art performance in both tasks across majority metrics. Worth noting, VoLTA, pre-trained with 123k image-caption pairs achieves better performance than baselines pre-trained with 1.3M images of ImageNet, proving the efficacy of VLP with fine-grained patch-token alignment over vision-only pre-training.

4.2. RESULTS ON COARSE-GRAINED VISION-LANGUAGE TASKS

Next, we perform image-level multi-modal downstream tasks -visual question answering (VQA), visual reasoning, retrieval, and captioning. VQA & Visual Reasoning: As reported in Table 3 , VoLTA achieves the best performance on VQA and visual reasoning across the baselines pre-trained with a comparable amount of data. Moreover, on VQA, VoLTA beats LXMERT, which is trained with 2× more data. These results demonstrate the efficacy of our method even when utilizing a mid-scale pre-training corpus. Retrieval: Most existing VLP methods use a fusion encoder for image and text retrieval and feed every image-text pair into the model. Though such fine-tuning often results in higher performance, it introduces quadratic time-cost and is not scalable. Following Dou et al. (2022a) , we adopt a more efficient strategy. We drop the cross-attention parameters for this task and compute the dot product of image and text features extracted separately in the dual-encoder setting. As shown in Table 3 , even with such an approach, VoLTA produces superior performance among the baselines trained with a similar amount of data, beating all three baselines by a significant margin. It is worth mentioning that besides achieving a superior result than all baselines using a comparable amount of data on multi-modal coarse-grained tasks, VoLTA also outperforms multiple methods pre-trained using magnitude more data. These results, shown in Table F .1, indicate the effectiveness and generalizability of VoLTA across these tasks. Next, we perform region-level multi-modal downstream tasks -referring expression comprehension (REC) and language-guided object detection.

4.3. RESULTS ON FINE-GRAINED VISION-LANGUAGE TASKS

REC: This task aims to localize target objects in an image described by a referring expression phrased in natural language and, thus, perfectly evaluates the fine-grained feature representation capability of VoLTA. As depicted in Table 4, VoLTA beats larger-sized UNITER-L and VILLA-L models on the challenging testB split of both RefCOCO and RefCOCO+. Moreover, VoLTA performs comparably with MDETR and UniTAB, even without being trained on grounding data. These results indicate our model's efficacy in learning fine-grained local visual features. Object Detection: We evaluate VoLTA on two challenging language-conditioned object detection benchmarks -COCO and LVIS. Note that, all existing baselines for this tasks are pre-trained on fine-grained image-text-box data, whereas VoLTA only utilizes image-caption pairs. As shown in Table 5 , VoLTA performs comparatively with these strong baselines. Note that VoLTA beats Mask R-CNN, MDETR, and GLIP-B on LVIS APr, which denotes average precision on rare objects. Thus, we conclude that VoLTA achieves impressive localization ability and robustness, even without utilizing any grounding annotations.

5. CONCLUSION

We present VoLTA, a unified VLP paradigm that utilizes image-caption data but achieves fine-grained region-level image understanding, eliminating the use of expensive box annotations. VoLTA adopts graph optimal transport-based weakly supervised patch-token alignment and produces an explicit, self-normalized, and interpretable low-level matching criterion. Extensive experiments demonstrate the effectiveness of VoLTA on a wide range of coarse-and fine-grained tasks. In the future, we plan to pre-train VoLTA on larger-scale datasets.

A PESUDO CODE OF VOLTA

The training psuedo code for VoLTA is as follows: Algorithm 1 PyTorch-style pseudocode for VoLTA. 

B OVERVIEW OF VISION-LANGUAGE PRE-TRAINING MODELS

Vision-Language Pretrained (VLP) models have proven extremely beneficial for multi-modal tasks in the recent years. Earlier works were predominantly focused on using pre-trained object detectors to extract patch (region) level information from corresponding images (Lu et al., 2019; Li et al., 2020a; Tan & Bansal, 2019; Chen et al., 2020d; Su et al., 2019) . In some of these models, such as ViLBERT (Lu et al., 2019) and LXMERT (Tan & Bansal, 2019) , multi-modality fusion has been achieved via co-attention which utilizes a third transformer containing fused information independently obtained from respective vision and language encoders (transformer). On the contrary, VisualBERT (Li et al., 2020a) , VL-BERT (Su et al., 2019) and UNITER (Chen et al., 2020d) employ a merged attention strategy to fuse both image patches and text features together into a unified transformer through corresponding image and text embedders. In addition to image patches and texts in a unified transformer, OSCAR (Li et al., 2020b) uses object tags as inputs. VinVL (Zhang et al., 2021) follows a similar strategy to that of OSCAR, the only difference being their novel 3-way contrastive loss which optimizes the training objectives used for VQA and text-image matching. VL-T5 (Cho et al., 2021) exploits bounding-box coordinate information, image IDs, and region IDs along with ROI features for visual embedding. Encoded visual and textual features are fed into a bi-directional multi-modal encoder and an auto-regressive text decoder framework for pre-training. In all the aforementioned methods, pre-trained object detectors are usually frozen during the training, and extracting region-level features from images can be tedious. To address these shortcomings, end-to-end pre-training methods have been developed. PixelBERT (Huang et al., 2020) uses convolutional neural networks (CNNs) based visual encoder and sentence encoder to obtain image and text representations, respectively. These representations are then fed to a subsequent transformer via a cross-modality alignment. SOHO (Huang et al., 2021) uses grid features discretization via a learned vision dictionary which are then fed into a cross-modal module. SimVLM (Wang et al., 2021b) uses CNN and text token embedding for image and text feature representation extraction along with a unified encoder-decoder transformer trained on a PrefixLM objective. Finally, MDETR (Kamath et al., 2021) uses CNN and RoBERTa (along with corresponding projection layers) for image and text feature extraction, which are then concatenated before passing through a unified transformer trained on 1.3M Image-Text-Box (I-T-B) annotated data. In recent years, the rise of Vision Transformers (ViT) (Dosovitskiy et al., 2021) has motivated the research community to have an all-transformer framework by incorporating ViTs (instead of CNN backbones) in VLP models. Image patch features and text token embeddings are fed directly into a ViT model for pre-training in VILT (Kim et al., 2021 ). Visual Parsing (Xue et al., 2021) , ALBEF (Li et al., 2021a) and METER (Dou et al., 2022b) use ViTs as vision encoders for image feature generation. For multimodal fusion, ALBEF and METER use co-attention in their pre-training frameworks. Another class of VLP models in the form of CLIP (Radford et al., 2021) , DeCLIP (Li et al., 2021b) and ALIGN (Jia et al., 2021) have been introduced lately. Although, known for their impressive zero-shot recognition ability and excellent transferability to downstream tasks, these models typically rely on huge amount of image-text pairs for pre-training. Contrastive loss forms the core component of the pre-training objectives in these VLP models. In such models (e.g., CLIP (Radford et al., 2021) , DeCLIP (Li et al., 2021b) ), separate encoders have been used for each modality. On the contrary, Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) (You et al., 2022) leveraged knowledge distribution across multiple modalities (image and text) through parameter sharing. In their unified framework, the parameters which are being shared between two modalities include the attention and feedforward modules, and the LayerNorm layers. GLIP (Li et al., 2022c) and GLIPv2 (Zhang et al., 2022) use a localization loss along with a wordregion alignment loss for pre-training corresponding encoders using image-text-box annotations. BLIP (Li et al., 2022b) employs image and text encoders connected through a cross-modality multihead attention which are pre-trained on image-text pairs using contrastive and language modeling objectives. OmniVL (Wang et al., 2022a) utilizes a unified image(and video) encoder and a text encoder pre-trained on image-text, image-label, video-text and video-label pairs using unified visionlanguage contrastive, vision-language matching and language modeling losses. Furthermore, a visual-grounded alignment decoder is also present for facilitating better learning and alignment between various modalities. X-VLM (Zeng et al., 2022) employs vision transformer to extract features from the subset of patches representing images/regions/objects. These patch features are then paired with associated text features for contrastive learning, matching, and MLM. Additionally, image and text pairings are also done for bounding-box prediction which is used to locate visual concepts in the image. CMAL (Ma et al., 2022) proposes interactions between features (obtained from respective image and text encoders) via cross-modal associative mappings which help in fine-grained semantic alignment between the learned representations. LOUPE (Li et al., 2022a ) implements token-level and semantics-level Shapley interaction modeling with global image-text contrastive loss (in a dualencoder setting) for explicit learning of fine-grained semantic alignment between visual regions and textual phrases without using expensive bounding-box annotations. FILIP (Yao et al., 2022) removes the need for cross-modality attention fusion by modeling the fine-grained semantic alignment between visual and textual tokens via a novel cross-modal late interaction mechanism in the contrastive loss. TCL (Yang et al., 2022b) FIBER (Dou et al., 2022a) fuses vision and language encoder backbones through merged co-attention which are then pre-trained on 4M data with two stage pre-training (coarse-and fine-grained). Imagetext pairs are used in the coarse-grained pre-training stage which is then followed by a fine-grained pre-training stage with image-text-box annotations. However, these bounding box annotations come with extra overheads. Therefore, in our model, VoLTA, we propose an alternate solution for optimaltransport based local feature-level alignment using global image-caption annotations which performs well not only on coarse-grained tasks (such as VQA and Image Captioning), but also on fine-grained tasks (such as Referring Expression Comprehension and Object Detection). Table B .1 encapsulates an overview of the details of all these aforementioned methods.

C DOWNSTREAM DATASETS

Our downstream tasks can be categorized into three groups: uni-modal, multi-modal coarse-grained, and multi-modal fine-grained. Uni-modal: For uni-modal tasks, we fine-tune (and validate) our pre-trained model on ImageNet-1k (Deng et al., 2009) for Image Classification, VOC07+12 (Everingham et al., 2010) for image classification and object detection, and COCO (Lin et al., 2014) for image classification, object detection and instance segmentation. Multi-modal Coarse-grained: Here, we fine-tune (and validate) our pre-trained model on VQAv2 (Antol et al., 2015) for visual question answering, NLVR 2 (Suhr et al., 2019) for visual reasoning, Flickr30k (Plummer et al., 2015) for image and text retrieval, and COCO (Lin et al., 2014) for image captioning. Multi-modal Fine-grained: For these tasks, we fine-tune (and validate) our pre-trained model on RefCOCO, RefCOCO+, and RefCOCOg (Kazemzadeh et al., 2014; Yu et al., 2016) for referring expression comprehension, and COCO (Lin et al., 2014) and LVIS Mini (Gupta et al., 2019) for language-conditioned object detection. It is to be noted that several multi-modal downstream tasks are built based on the COCO dataset where the validation and test splits of these tasks are scattered across the raw COCO splits. Therefore, when pre-training our model, we carefully selected the portion of the COCO dataset which does not overlap with the validation/test splits of these multi-modal downstream tasks.

D IMPLEMENTATION DETAILS & HYPER-PARAMETER VALUES D.1 DATA AUGMENTATION

We use ResNet50/Swin-T/Swin-B (He et al., 2016; Liu et al., 2021) as image encoder and RoBERTa (Liu et al., 2019 ) as text encoder. Encoders are followed by a corresponding projector network which is a 3-layer MLP with the configuration [d-2048-2048-1024] where d represents the embedding dimension of the encoder's output. Image Augmentations: Two sets of random transformations sampled from an augmentation pool are applied on each input image to generate two disparate distorted views. The augmentation policy is composed of RandomResizedCrop, RandomHorizontalFlip, ColorJitter, RandomGrayscale, GaussianBlur, and Solarization augmentations where RandomResizedCrop is applied with a probability of 1.0, whilst the remaining ones are applied randomly with varying probabilities following Zbontar et al. (2021) as outlined in Table D .1. Our proposed VLP model, VoLTA comprises a vision encoder and a language encoder with a merged co-attention for cross-modality fusion. While conducting experiments, we have considered two types of vision encoder backbones -ResNet-50 (He et al., 2016) and Swin Transformer (Liu et al., 2021) . It is to be noted that for fair comparisons with related works (Dou et al., 2022b; a) , the input image resolution for ResNet-50 encoder backbone is kept as 224 × 224, whereas for Swin-B, it is 384 × 384. The output embedding dimension of the image encoder in both cases is 1024. Similarly, to be consistent with Dou et al. (2022b; a) , we have selected RoBERTa as the language encoder with vocabulary size of 50265, tokenizer as 'roberta-base', maximum input text length of 30, and an output embedding dimension of 768 (please refer to Table D .2 for more details). Vision and language encoders are individually followed by projector heads, each consisting of 3 linear layers each with 2048 output units (except for the last one which has 1024 output units) accompanied with a Batch Normalization layer and ReLU activation (for exact configuration please refer to Table D .2). The final projected output denotes the feature representation of the input (image and/or text) which are used for downstream tasks. In order to learn these representations, the embeddings (i.e., output from respective encoders) are fed to the loss function of VoLTA. The loss function of VoLTA includes four different loss components, namely, multi-modal Barlow Twins for intra-and inter-modality redundancy reduction, GOT for alignment of local features, MLM and ITM together for encouraging cross-modal attention fusion. For MLM, we randomly mask 15%foot_1 (MLM probability in Table D .2) of the input tokens and the model is trained to reconstruct the original tokens. For ITM, the model predicts whether a given image-text pair is matched. For optimization, we follow the same protocol as described in Zbontar et al. (2021) where we use the LARS (You et al., 2017) optimizer to train our model for 20 epochs with a batch size of 256. A base LR of 0.1 is used for the weights and 0.0048 for the biases and batch normalization parameters which are then multiplied by a factor of 2. We employ a learning rate warm-up (linear) upto a period of 2 epochs followed by a cosine decay schedule to reduce the LR by a factor of 1000. A weight decay parameter of 1e-6 is used excluding the biases and batch normalization parameters. We conducted Vision-Language Classification (VQAv2 and NLVR 2 ): Vision-Language Classification task encompasses VQAv2 and NLVR 2 whose hyper-parameter setup has been taken from METER (Dou et al., 2022b) and FIBER (Dou et al., 2022a) . Model finetuning is done with peak learning rates of 2e-5 for the backbones, 1e-4 for the cross-modal parameters, and 1e-3 for the head layer for 10 epochs with a batch size of 512. The image resolutions are set to 576 for VQAv2 and 384 for NLVR 2 and the models are evaluated with the VQA-Scores for VQAv2 and accuracy for NLVR 2 (Table C .1). Image-Text Retrieval (IRTR): We follow Dou et al. (2022a) for IR-TR setup for the Flickr30k dataset where the cross-attention layers in the backbones are removed during IR-TR fine-tuning and evaluation. The peak learning rates are set to 2e-5 for the backbones and 1e-4 for the head layer, a batch size of 1024 is considered and each image resolution is set to 576. We evaluate on the Recall@1 metric for both text and image retrieval tasks as outlined in the Table C .1. Image Captioning: For image captioning, only the image-to-text attentions are kept for the crossmodality attention fusion and the model is converted into a standard seq2seq model (Dou et al., 2022a) . We used causal mask in the decoding side and the outputs are predicted auto-regressively (Dou et al., 2022a) . Models are trained with the cross-entropy loss for 5 epochs with the peak learning rates of 5e-5 for the backbones, and 2.5e-4 for the rest of the parameters followed by a two-stage finetuning. In the first stage, finetuning with GOLD (Pang & He, 2021) is done for 5 epochs with a peak learning rate of 1e-5 for the backbones, since it is efficient and has been proven to be effective when the model input can correspond to different outputs. Second stage finetuning involves CIDEr optimization where the learning rate is further reduced to 1e-6 and the model is trained for 3 epochs. A batch size of 512 is considered in both these cases and a beam size of 5 is used during inference. Evaluation metrics include BLEU (Papineni et al., 2002) , METEOR (Banerjee & Lavie, 2005) , CIDEr (Vedantam et al., 2015) , and SPICE (Anderson et al., 2016) scores as shown in 

E ABLATION

We have conducted ablation studies, particularly, on pre-training objectives, GOT loss weight and the dimension of projectors which are summarized below: In our loss formulation, we introduce a GOT loss weight w GOT which regulates the alignment of local features through GOT loss. By conducting a grid-search on uni-modal downstream classification tasks, we assessed the impact of w GOT as shown in the Table E .4 and experimentally found its best value to be 100 in our case. It is to be noted that a very high value of w GOT considerably degrades the performance on downstream tasks.

E.4 ABLATION ON PROJECTOR DIMENSION

The design of projector head plays a pivotal role in the downstream performance of the model (Garrido et al., 2022) . To investigate the impact of hidden and feature (projector output) dimensions, we have tested 4 different configurations on uni-modal downstream classification tasks. It can be noticed from Table E .5 that an increase in the number of parameters in the projector head does not necessarily lead Language-conditioned Object Detection: Object detection forms an indispensable constituent of several multi-modal understanding systems. However, conventional object detection pipeline, being employed as a black-box tool, predicts all possible objects in the image. On the other hand, for better apprehension of combinations of these objects in free-form texts, language-conditioned object detection task is considered (Kamath et al., 2021; Dou et al., 2022a) . We use pre-trained VoLTA for fine-tuning and evaluation on COCO and LVIS datasets for text-conditioned object detection task. As illustrated in the Figure G.3, VoLTA predicts bounding boxes relevant to the text prompts (captions) 



Image and text augmentations details can be found in Appendix D.1 Following BERT, we decompose this 15% into 10% random words, 10% unchanged, and 80% with a special token[MASK].



Figure 2: Computation of four different objectives, LBT, LGOT, LMLM, and LITM by the proposed framework, VoLTA (Vision-Language Transformer with weakly-supervised local-feature Alignment). Inspired by (Dou et al., 2022a), VoLTA inserts cross-modal attention fusion (CMAF) inside uni-modal backbones with a gating mechanism. During VoLTA pre-training, every forward iteration consists of three steps -(i) CMAF is switched off, VoLTA acts as dual encoder, LBT and LGOT are computed. (ii) CMAF is switched on, VoLTA acts as fusion encoder, image-masked caption pair is fed into the model to compute LMLM. (iii) CMAF is kept on, randomly sampled image-caption pair is fed into the model to compute LITM. Such fusion strategy results in a lightweight and flexible model compared to using fusion-specific transformer layers.

Figure 3: Example of an intricate image containing multiple similar entities, and the visual attention map corresponding to the marked word in the caption, produced by the GOT module of VoLTA.

and back-propagated into the model end-to-end. An ablation on w GOT is given in Appendix E.3. The overall VoLTA pipeline for computation of different training objectives is shown in Figure 2. The pseudo code for VoLTA can be found in Appendix A.

f_I: Image Encoder, f_T: Text Encoder # task_names: string containing task names # I: Image input, T: Text input, N: Batch size, D: Projector dim # BT: Barlow Twins loss function # WD, GWD: Wasserstein and Gromov-Wasserstein loss functions # MLM, ITM: MLM and ITM loss functions, respectively # gamma: coefficient of GWD loss in GOT # w_GOT: weight of GOT loss def GOT(x_1, x_2, f_1, f_2): # compute embeddings z_A, z_B = f_1(x_1), f_2(x_2) # N x D # normalize representation along batch dimension z_A_norm = (z_A -z_A.mean(dim=0)) / z_A.std(dim=0) z_B_norm = (z_B -z_B.mean(dim=0)) / z_B.std(dim=0) # cosine distance matrix c = cosine_dist_matrix(z_A, z_B)} # Wasserstein distance loss_w = W_D(c, z_A.size(0), z_A.size(1), z_B.size(1)) # Gromov-Wasserstein distance loss_gw = GW_D(z_A.transpose(2,1), z_B.transpose(2,1)) return gamma * torch.mean(loss_gw) + (1 -gamma) * torch.mean(loss_w) def VoLTA (I, T): total_loss = torch.tensor(0.) for x in loader: # load a batch with N samples # two augmented versions of I, T I1, I2 = augment_image(I); T1, T2 = augment_text(T) if "BTGOT" in task_names: # BT loss intra_loss = BT(I1, I2, f_I) + BT(T1, T2, f_T)} inter_loss = BT(I1, T1, f_I, f_T) + BT(I2, T2, f_I, f_T)} BT_loss = intra_loss + inter_loss total_loss += BT_loss # GOT loss GOT_loss = GOT(I1, T1, f_I, f_T) + GOT(I2, T2, f_I, f_T)} total_loss += w_GOT * GOT_loss # cross-attention is enabled if "MLM" in task_names:

uses global cross-modal alignment, intra-modal alignment and local mutual information maximization losses along with Masked Language Modeling and Image-Text Matching to learn robust image-text representations during pretraining. UniCL(Yang et al., 2022a)  utilizes a unified learning method with a two-way contrastive loss (image-to-text and text-to-image) in the image-text-label space which can learn representations from either of the image-label and image-text data or both. UniTAB(Yang et al., 2022c)  employs a transformer based encoder-decoder framework which can jointly output open-ended text and box, encouraging alignment between words and boxes.

RESULTS ON COARSE-GRAINED VISION LANGUAGE TASKS: COMPARISON WITH METHODS USING MORE PRE-TRAINING DATA TableF.1 presents a comparison of VoLTA on multimodal coarse-grained tasks with state-of-the-art methods pre-trained using magnitude more data. On VQA, VoLTA beat ViLBERT, UNITER-B, VILLA-B, UNIMO-B and ViLT-B, each pre-trained on 3 -4M datasets. Note that VoLTA is trained only on COCO and VG, whereas the other methods use a combination of COCO, VG, CC, and SBU datasets. Such strong performance proves the generalizability of VoLTA. On captioning, VoLTA beats Unified VLP, OSCAR, UFO-B, ViTCAP, VinVL-B, METER-CLIP-B and XGPT. However, for IRTR and NLVR, VoLTA can not yield better performance over these baselines. We assume that the large domain difference between pre-training and downstream datasets is the reason behind the limited performance on IRTR and NLVR.G QUALITATIVE RESULTSVisual Question Answering and Visual Reasoning: Visual question answering (VQA) is a widely recognized multi-modal task which infers an answer in response to a text-based question about an image. In Figure G.1 we demonstrated several example image-question pairs along with corresponding answers predicted by VoLTA on VQAv2 validation set. The primary aim of the visual reasoning task is to ascertain the veracity of a natural language statement against an associated image pair. Figure G.2 displays examples of responses (True/False) predicted by VoLTA on NLVR 2 validation set.

Figure G.1: Examples on Visual Question Answering from VQAv2 validation dataset. We display a variety of examples (e.g., number of items, color of objects, type of objects, events and actions) with respective answers predicted by VoLTA.

Figure G.2: Examples on Visual Reasoning from NLVR 2 validation dataset. For each statement (text prompt), 2 images are shown alongside each other and VoLTA predicts whether the given statement is True (green box) or False (red box).

Figure G.3: Examples of Object Detection from COCO validation dataset with various text prompts. Our model predicts boxes relevant to the text (caption) and labels them with the corresponding spans.

Figure G.4: Examples of Referring Expression Comprehension from RefCOCO (top), RefCOCO+ (middle) and RefCOCOg (bottom) validation datasets. The expressions in RefCOCOg typically have florid and longer constructions as compared to RefCOCO and RefCOCO+. The model has access to the entire text and uses it to disambiguate amongst different objects in the image.

Figure G.5: This figure shows how different words in captions attend relevant image regions, produced by the GOT module of VoLTA pre-trained on COCO. Extension of Figure 1. All image-caption pairs are taken from the COCO2017 train split.

Uni-modal downstream: linear image classification. We benchmark learned representations on image classification task by training linear classifiers on fixed features. We report top-1 accuracy on ImageNet-1k validation set, classification mAP on VOC07, and per-class (PC) and overall (O) F1 scores on COCO. Numbers with † are re-implemented byYuan et al. (2021), and the numbers with ‡ are re-implemented by us. methods trained with significantly larger dataset are colored gray. Best results are in bold.

Uni-modal downstream: object detection and instance segmentation with fine-tuning. We benchmark learned representations on VOC07 + 12 object detection task using faster R-CNN(Ren et al., 2015), and on COCO2017 object detection and instance segmentation using mask R-CNN He et al. (2017), both with C4 backbone variant(Wu et al., 2019). Best results are in bold.

Multi-modal coarse-grained downstreams: visual question answering, visual reasoning, retrieval and captioning. We only compare with methods pre-trained on comparable amount of dataset. For captioning, 4 metrics are reported -B@4: BLEU@4, M: METEOR, C: CIDEr, S: SPICE. Best results are in bold. VoLTA-B denotes Swin-B backbone. loss. Both local and global projector networks has 3 linear layers with dimension 2048-2048-1024, with batch normalization and ReLU after first two layers. An ablation on projector dimension is given in Appendix E.4. During downstream tasks, we use the image and text features after the AvgPool layer. For CMAF, we insert the cross-attention into the top 6 blocks of the vision and text encoders. Moreover, for direct comparison with existing uni-modal baselines, we re-train VoLTA with ResNet50(He et al., 2016) and Swin-Tiny image encoders.

Multi-modal fine-grained downstream: referring expression comprehension. Methods pre-trained on Im-Txt-Box data or significantly larger amount of Im-Txt data are colored gray. Best comparable results are in bold. VoLTA-B denotes Swin-B backbone. We perform captioning on the COCO dataset to evaluate if VoLTA can adopt a generation task. We integrate GOLD (Pang & He, 2021) into VoLTA during fine-tuning as it produces significant improvements. As shown in Table3, our approach maintains superior captioning performance across all baselines pre-trained with comparable data. Using CIDEr optimization further improves performance.

Multi-modal fine-grained downstream: language-conditioned object detection on COCO and LVIS. All available baselines are pre-trained on Im-Txt-Box data, and are colored gray. VoLTA-B denotes Swin-B backbone.

1: Overview of VLP models. OD: objective detector. Xformer: transformer. Emb.: embedding. MLM/MIM: masked language/image modeling. ITM: image-text matching. WRA: word-region alginment. ITC: image-text contrastive learning. Grnd: Grounding. Cap: Captioning. TP: Token Prediction. CA: Contrastive Alignment, NNS: Nearest Neighbour Supervision, MVS: Multiview Supervision, SL: Sim-siam Loss, MHA: Multi-head attn., LM: Language Modeling, UniVLC: Unified Vision Language Contrastive, VLM:

1: Dataset statistics for uni-modal and multi-modal downstream tasks.

1: Image and text augmentation details.Text Augmentations: Two sets of random transformations are applied on input text using EDA(Wei & Zou, 2019) including synonym replacement, random insertion, random swap, and random deletion with different probabilities as outlined in Table D.1. D.2 PRE-TRAINING SETUP Table D.2 shows the details of hyper-parameters used during training.

1: Ablation study on different losses of the training objective of VoLTA for multi-modal coarsegrained downstream tasks. Each model is pre-trained on 231k samples from COCO2017 and VG.

We followDou et al. (2022a)  for training and evaluation on 3 different datasets(RefCOCO, RefCOCO+, and RefCOCO)  where the models are finetuned with a batch size of 16 for 20 epochs. A warmup of 2000 steps with a peak LR of 1e-5 for both the OD head as well as the rest of the model's parameters are used. LR drops twice, once at 67% and the other at 89% of the total number of steps. Horizontal flip augmentation has been turned off during REC training because it was observed inDou et al. (2022a)  that horizontal flip adversely affected the performance, particularly on the RefCOCO dataset. Accuracy is used as the evaluation metric in this case (TableC.1). We follow the training and evaluation setup ofDou et al. (2022a)  for textconditioned (multi-modal) object detection. For both COCO and LVIS datasets, model has been finetuned for 24 epochs with batch size of 32, a LR of 1e-5, and two learning rate drops, once at 67% and the other at 89% of the total number of steps. AP scores are used in this case for model evaluation (TableC.1).

2: Ablation study on different losses of the training objective of VoLTA for referring expression comprehension tasks. Each model is pre-trained on 231k samples from COCO2017 and VG. Table E.3: Ablation study on Intra-and Inter-modal Barlow Twins objective for multi-label image classification on VOC07 and COCO. We report classification mAP on VOC07, and per-class (PC) and overall (O) F1 scores on COCO. Each model is pre-trained on 123k train-val samples from COCO2017.We analyze the effectiveness of different pre-training objectives through an ablation on coarse-and fine-grained downstream tasks (Tables E.1 and E.2). First, we pre-train VoLTA only with the multimodal BT loss. In this setup, VoLTA only acts as a dual encoder; thus, the cross-attention parameters are not pre-trained. Next, we add MLM and ITM loss which helps the model to learn cross-modal information via attention fusion. Next, we add the GOT pre-training objective. Note that GOT adopts two types of OT distances -WD for node matching and GWD for edge matching. As shown in table E.2, the GWD helps to improve the performance of reference expression comprehension across RefCOCO, RefCOCO+, and RefCOCOg datasets. Specifically on RefCOCOg, adding L gw yields a significant 4.0% boost in the challenging test set. Since this dataset contains intricate images with multiple similar objects with different shapes and colors, GWD is crucial in distinguishing between them. Overall, this set of experiments demonstrates that all of the objectives are necessary for our model to obtain good performance on different coarse-and fine-grained multi-modal tasks.E.2 ABLATION ON INTRA-AND INTER-MODAL BARLOW TWINS LOSSWe verify the effectiveness of the multi-modal Barlow Twins (BT) objective by ablating the intraand inter-modal terms. The first row of TableE.3 is identical to the original image-only BT objective. Next, we introduce the text branch and add the same BT objective between the two views of the caption. Afterward, we add the inter-modal BT objectives. As shown in TableE.3, each loss term improved the image classification performance, demonstrating the importance of both intra-and inter-modal objectives.

4: Ablation study on the value of wGOT, the weight of GOT loss in L total in the objective of VoLTA for multi-label image classification on VOC07 and COCO. We report classification mAP on VOC07, and per-class (PC) and overall (O) F1 scores on COCO. Each model is pre-trained on 123k train-val samples from COCO2017.TableE.5: Ablation study on the dimension of local and global projector networks of VoLTA for multi-label image classification on VOC07 and COCO. We report classification mAP on VOC07, and per-class (PC) and overall (O) F1 scores on COCO. Each model is pre-trained on 123k train-val samples from COCO2017.to an increase in performance. For example, a projector configuration of 8192-8192-256 has roughly 8 times more parameters than 2048-2048-1024, although, the latter performs better in downstream tasks (TableE.5), indicating that the output (feature) dimension of projector plays a crucial role in final performance of the model.

REPRODUCIBILITY STATEMENT

The pre-training code of VoLTA can be found in the supplementary material. We also provide a detailed implementation setup for pre-training and downstream experiments in Appendix D.2 and D.3, respectively. After publication, we will provide pre-trained checkpoints and open-source the code on a public repository. grid search for the GOT loss hyperparameter (w GOT ) and we empirically found the best value to be 100. Linear Evaluation: For ImageNet, the linear classifier has been trained for 100 epochs with batch size of 256, LR of 0.3 and a cosine LR schedule. Cross-entropy loss is minimized with SGDM optimizer (momentum of 0.9) and a weight decay of 1e-6. For both COCO and VOC, the linear classifier has been trained for 100 epochs with AdamW optimizer with batch size of 256, LR of 5e-2 and a weight decay of 1e-6.Object Detection: For training the detection model, the detectron2 library (Wu et al., 2019 ) has been used. The backbone networks for Faster R-CNN (Ren et al., 2015) and Mask R-CNN (He et al., 2017) has been initialized using our pre-trained model.For VOC07+12, we used the trainval set comprising of 16K images for training a Faster R-CNN (Ren et al., 2015) C-4 backbone for 24K iterations using a batch size of 16 across 8 GPUs using SyncBatchNorm. The initial learning rate for the model is 0.15 which is reduced by a factor of 10 after 18K and 22K iterations. Linear warmup (Goyal et al., 2017) is used with a slope of 0.333 for 1000 iterations.For COCO, Mask R-CNN (He et al., 2017) with a C-4 backbone on the COCO 2017 train split is used for training and the results are reported on the val split. A learning rate of 0.03 is used and the other parameters are kept the same as in the 1× schedule in detectron2 (Wu et al., 2019) . and labels them with the corresponding spans from the text. For example, the top-middle image has 4 objects, however our model predicts boxes only for person and cup based on the text prompt.

Referring Expression Comprehension (REC):

The objective of REC is to align the entire referring expression (text) with the corresponding box by disambiguating among the several occurrences of an object belonging to the same category and therefore, one box per expression is to be predicted. For example, bottom-left image in Figure G.4 depicts VoLTAś box prediction for the corresponding referring expression: the slice of cake on the left.

