UNIVERSAL VISION-LANGUAGE DENSE RETRIEVAL: LEARNING A UNIFIED REPRESENTATION SPACE FOR MULTI-MODAL RETRIEVAL

Abstract

This paper presents Universal Vision-Language Dense Retrieval (UniVL-DR), which builds a unified model for multi-modal retrieval. UniVL-DR encodes queries and multi-modality resources in an embedding space for searching candidates from different modalities. To learn a unified embedding space for multi-modal retrieval, UniVL-DR proposes two techniques: 1) Universal embedding optimization strategy, which contrastively optimizes the embedding space using the modality-balanced hard negatives; 2) Image verbalization method, which bridges the modality gap between images and texts in the raw data space. UniVL-DR achieves the state-ofthe-art on the multi-modal open-domain question answering benchmark, WebQA, and outperforms all retrieval models on the two subtasks, text-text retrieval and text-image retrieval. It demonstrates that universal multi-modal search is feasible to replace the divide-and-conquer pipeline with a united model and also benefits single/cross modality tasks.

1. INTRODUCTION

Although search engines primarily focus on textual data (Singhal et al., 2001) , multi-media is necessary to satisfy user needs during retrieval. A user query can be answered by the information in variant formats, such as a text document, or a picture. The growth of multi-media content has been one of the most notable trends on the internet (Mei et al., 2014) , and various studies have proved that users prefer more vivid multi-media content in search results (Datta et al., 2008) . Current multi-media search systems often employ a divide-and-conquer approach. As shown in Figure 1 (a), they first conduct search in individual modalities, including text, image, video, etc. (Bajaj et al., 2016; Grubinger et al., 2008; Kwiatkowski et al., 2019; Awad et al., 2021) , and then fuse the retrieval results from various verticals together, e.g., building another ranking layer on top of these single/cross modality retrievers (Escalante et al., 2008; Grubinger et al., 2008) . Both relevance modeling and retrieval result fusion are usually entwined to achieve more accurate multi-modal retrieval results. However, due to the modality gap, they can be only pipeline-modeled in divide-andconquer, making it challenging to fuse retrieval results from different modalities. In this paper, we explore the potential of universal multi-modal retrieval to build an end-to-end model and retrieve multi-modality documents for user queries. Illustrated in Figure 1 (b), universal multi-modal retrieval maps queries and multi-modality resources to one universal embedding space and retrieves multi-modality candidates via KNN search. As a result, the relevance modeling, cross-modality matching, and retrieval result fusion are done by one model. More specifically, we propose a Universal Vision-Language Dense Retrieval (UniVL-DR) model to get the representations of queries, texts, and images and learn a tailored vision-language embedding space for multi-modal retrieval. UniVL-DR optimizes the vision-language embedding space using hard negatives (Xiong et al., 2021a) and balances the modalities of these negatives to alleviate the modality preference of multi-modal retrievers. Furthermore, UniVL-DR introduces an image verbalization method, which regards language as a kind of mentalese (Cavanagh, 2021) and mitigates the modality gap between images and texts. Our image verbalization method first aligns the semantics of image captions and figure pixels (Huang et al., 2021a) , and then paraphrases the image facts. It helps to bridge language and vision understanding modules of UniVL-DR via natural language. To build a multi-modal retrieval benchmark, we leverage a multi-modal question answering (QA) benchmark WebQA (Chang et al., 2022) and convert it to a standard open-domain setting: retrieving multi-modality candidates from text and image collections for a user query. Divide-and-conquer is an intuitive way to build a multi-modal retrieval system and we pre-route queries to oracle modality to show the upper bound performance of such a system. Compared with the divide-and-conquer system, UniVL-DR addresses the retrieval result fusion challenge, achieves state-of-the-art multi-modal retrieval performance, and brings more than 5% improvement in single/cross modality retrieval. Our experiments show that UniVL-DR learns an effective embedding space for multi-modal retrieval by separating texts and images into different areas and guiding queries to return candidates from corresponding modalities. Our further analyses show that UniVL-DR can alleviate overfit singlemodality signals by balancing hard negatives during training and bridging the modality gap between vision and language by verbalizing images. All experimental results show that learning one universal representation space is starting to benefit single-modality tasks-pretraining representation models on multi-modality and using our techniques can learn additional signals from multi-modalities, overcome the modality boundary, and provide convincing gains in single/multi-modality tasks.

2. RELATED WORK

Document retrieval is a typical single modality retrieval task, which aims to return related documents for user queries and can be tackled with dense retrievers (Xiong et al., 2021b; Lewis et al., 2020; Zhan et al., 2021; Li et al., 2021b; Yu et al., 2021) . Dense retrievers encode queries and documents with pretrained language models (Devlin et al., 2019) and map them in an embedding space to conduct an efficient search. The query and document encoders are usually contrastively trained with in-batch negatives, BM25 retrieved negatives, and hard negatives (Karpukhin et al., 2020; Xiong et al., 2021a) . Recently, lots of work has focused on multi-modal retrieval tasks, which retrieve texts and images to satisfy the multi-modality information needs of users (Hannan et al., 2020; Singh et al., 2021; Talmor et al., 2021; Chang et al., 2022) . WebQA (Chang et al., 2022) , an open-domain multi-modal question answering benchmark, is built to encourage the following work to represent multi-modal knowledge in a unified space and answer user queries with the information from attribute modalities. It is a more realistic setting, which avoids synthesizing queries with templates (Talmor et al., 2021) and downplays the role of modality disambiguation (Hannan et al., 2020) in the multi-modality modeling. To search information from large-scale multi-modality sources, WebQA (Chang et al., 2022) employs a divide-and-conquer pipeline to search text and image candidates with BM25 and CLIP (Radford et al., 2021) and then fuse these retrieval results using a vision-language model. However, single-



Divide-and-Conquer Multi-Media Search. Construction of the Xanadu house in Kissimmee, Florida, began with the pouring of a concrete … In 1946, he was honored on the first coin to feature an African American, the Booker T. Washington Memorial Half Dollar … The National Air and Space Museum of the Smithsonian Institution, also called the Air and Space Museum …

Figure 1: Different Architectures of Multi-Modal Retrieval Systems.

