VOLTA: VISION-LANGUAGE TRANSFORMER WITH WEAKLY-SUPERVISED LOCAL-FEATURE ALIGNMENT

Abstract

Figure 1: We introduce VoLTA, Vision-Language Transformer with weakly-supervised local-feature Alignment, a VLP paradigm trained with graph optimal transport (GOT) based image-text matching. VoLTA learns finegrained local visual representation only using global image-caption pairs, eliminating the use of expensive grounding annotations. This figure shows how different words in captions attend relevant image regions, produced by the GOT module of VoLTA pre-trained on COCO.

1. INTRODUCTION

Inspired by the escalating unification of transformer-based modeling in vision (Dosovitskiy et al., 2021; Liu et al., 2021; Chen et al., 2021a) and language (Devlin et al., 2019; Liu et al., 2019) domains, coupled with readily available large-scale image-caption pair data, vision-language pre-training (VLP) (Lu et al., 2019; Li et al., 2020a; Kim et al., 2021; Kamath et al., 2021; Zhang et al., 2021) has recently been receiving ever-growing attention. VLP has not only been proven the de-facto for several VL tasks, but it has also been beneficial for traditional vision-only tasks, such as image classification and object detection. Such wide-range applications of VLP can broadly be categorized into two groups: (i) tasks requiring image-level understanding, e.g., image classification, image & text retrieval (Plummer et al., 2015) , visual question answering (Antol et al., 2015) , and (ii) tasks requiring region-level understanding, e.g., object detection, instance segmentation, and referring expression comprehension (Kazemzadeh et al., 2014; Yu et al., 2016) . Most existing VLP methods support either application, leaving the question of a generalizable and unified VL framework. Traditional VLP methods with image-level understanding (Li et al., 2021a; Wang et al., 2021b; Dou et al., 2022b) utilize large-scale image-caption pair datasets and are commonly trained with image-text contrastive objectives computed on global features. Hence, it is not trivial to extend such methods to region-level applications. On the other hand, VLP methods with region-level understanding (Kamath et al., 2021; Li et al., 2022c; Zhang et al., 2022) Subsequently, we focus on achieving region-level fine-grained understanding by weakly-supervised alignment of image patches and text tokens. Previous VLP methods (Chen et al., 2020d; Kim et al., 2021) in this direction use Wasserstein distance (WD) (Peyré et al., 2019) , a.k.a Earth Mover's distance (EMD)-based optimal transport (OT) algorithms for such alignment problem. However, we argue that WD is not optimum for intricate images with multiple similar entities. Thus, we propose to jointly utilize Gromov-Wasserstein distance (GWD) (Peyré et al., 2016) and Wasserstein distance (WD) in a setup known as graph optimal transport (Chen et al., 2020a). Moreover, instead of using commonly deployed contrastive objective, we propose to utilize redundancy reduction from Barlow Twins (Zbontar et al., 2021) , which is less data-intensive and does not require hard-negative mining. We also follow Dou et al. (2022a) to incorporate deep multi-modal fusion into the uni-modal backbones, removing the need for costly fusion-specific transformer layers. To this end, we introduce VoLTA, Vision-Language Transformer with weakly-supervised local-feature Alignment, a unified VLP paradigm only utilizes image-caption annotations but achieves fine-grained region-level image understanding, eliminating the use of expensive box annotations. Figure 1 visualizes the feature-level image-text alignment. VoLTA can attend text tokens to the corresponding visual patches without relying on low-level supervision. In summary, our contributions are three-fold. (i) We propose to use graph optimal transport for weakly-supervised feature-level patch-token alignment. (ii) We introduce VoLTA, a unified VLP paradigm for image-level and region-level applications, but pre-trained only using image-caption pairs. VoLTA is memory, compute, and time-efficient and can easily be scaled up with readily available large-scale image-caption data harvested from the web. (iii) We performed a wide range of vision-and vision-language coarse-and fine-grained downstream experiments to demonstrate the effectiveness of VoLTA compared to strong baselines pre-trained with significantly more caption and box annotations.

2. RELATED WORKS

Uni-modal Self-supervised Pre-training: In recent years, the machine learning community has observed a boom of self-supervised pre-training. In the language domain, representations learned by BERT (Devlin et al., 2019 ), RoBERTa (Liu et al., 2019) have become the default setting for any downstream tasks. Generative models such as GPT (Radford et al., 2019; Brown et al., 2020) have also achieved impressive few-shot/zero-shot performances on novel applications. SimCSE (Gao et al., 2021) uses contrastive learning to help learn useful sentence representations. In the vision domain, a series of contrastive/joint-embedding methods (He et al., 2020; Chen et al., 2020c; 2021b; 2020b; Grill et al., 2020; Chen & He, 2021; Caron et al., 2021; Zbontar et al., 2021; Bardes et al., 2022; Assran et al., 2022) have outperformed supervised counterparts. Recently, generative models such as BEiT (Bao et al., 2021) and MAE (He et al., 2022) have also achieved impressive performances with much more scalable potential. Vision-language pre-training: Vision-language pre-training (VLP) mainly relies on image-text pair datasets to learn joint visual-language representations. One line of work is to train separate vision and



use image-text-box grounding data and are designed to predict bounding boxes during pre-training. Consequently, they do not support image-level tasks. Furthermore, accurate bounding box annotations require high-resolution input images, which are often expensive to collect, annotate and use for pre-training at scale. Recently, FIBER (Dou et al., 2022a) addressed the problem of such unified VLP and proposed a two-stage pre-training algorithm requiring fewer box annotations than previous region-level pre-training methods. Moving a step forward, we aim to eliminate the use of costly box annotations and ask the challenging but natural question: Can we attain region-level understanding from global image-caption annotations and unify image-and region-level tasks in a single VL framework?

