VOLTA: VISION-LANGUAGE TRANSFORMER WITH WEAKLY-SUPERVISED LOCAL-FEATURE ALIGNMENT

Abstract

Figure 1: We introduce VoLTA, Vision-Language Transformer with weakly-supervised local-feature Alignment, a VLP paradigm trained with graph optimal transport (GOT) based image-text matching. VoLTA learns finegrained local visual representation only using global image-caption pairs, eliminating the use of expensive grounding annotations. This figure shows how different words in captions attend relevant image regions, produced by the GOT module of VoLTA pre-trained on COCO.

1. INTRODUCTION

Inspired by the escalating unification of transformer-based modeling in vision (Dosovitskiy et al., 2021; Liu et al., 2021; Chen et al., 2021a) and language (Devlin et al., 2019; Liu et al., 2019) domains, coupled with readily available large-scale image-caption pair data, vision-language pre-training (VLP) (Lu et al., 2019; Li et al., 2020a; Kim et al., 2021; Kamath et al., 2021; Zhang et al., 2021) has recently been receiving ever-growing attention. VLP has not only been proven the de-facto for several VL tasks, but it has also been beneficial for traditional vision-only tasks, such as image classification and object detection. Such wide-range applications of VLP can broadly be categorized into two groups: (i) tasks requiring image-level understanding, e.g., image classification, image &

