UNIVERSAL VISION-LANGUAGE DENSE RETRIEVAL: LEARNING A UNIFIED REPRESENTATION SPACE FOR MULTI-MODAL RETRIEVAL

Abstract

This paper presents Universal Vision-Language Dense Retrieval (UniVL-DR), which builds a unified model for multi-modal retrieval. UniVL-DR encodes queries and multi-modality resources in an embedding space for searching candidates from different modalities. To learn a unified embedding space for multi-modal retrieval, UniVL-DR proposes two techniques: 1) Universal embedding optimization strategy, which contrastively optimizes the embedding space using the modality-balanced hard negatives; 2) Image verbalization method, which bridges the modality gap between images and texts in the raw data space. UniVL-DR achieves the state-ofthe-art on the multi-modal open-domain question answering benchmark, WebQA, and outperforms all retrieval models on the two subtasks, text-text retrieval and text-image retrieval. It demonstrates that universal multi-modal search is feasible to replace the divide-and-conquer pipeline with a united model and also benefits single/cross modality tasks.

1. INTRODUCTION

Although search engines primarily focus on textual data (Singhal et al., 2001) , multi-media is necessary to satisfy user needs during retrieval. A user query can be answered by the information in variant formats, such as a text document, or a picture. The growth of multi-media content has been one of the most notable trends on the internet (Mei et al., 2014) , and various studies have proved that users prefer more vivid multi-media content in search results (Datta et al., 2008) . Current multi-media search systems often employ a divide-and-conquer approach. As shown in Figure 1 (a), they first conduct search in individual modalities, including text, image, video, etc. (Bajaj et al., 2016; Grubinger et al., 2008; Kwiatkowski et al., 2019; Awad et al., 2021) , and then fuse the retrieval results from various verticals together, e.g., building another ranking layer on top of these single/cross modality retrievers (Escalante et al., 2008; Grubinger et al., 2008) . Both relevance modeling and retrieval result fusion are usually entwined to achieve more accurate multi-modal retrieval results. However, due to the modality gap, they can be only pipeline-modeled in divide-andconquer, making it challenging to fuse retrieval results from different modalities. In this paper, we explore the potential of universal multi-modal retrieval to build an end-to-end model and retrieve multi-modality documents for user queries. Illustrated in Figure 1 (b), universal multi-modal retrieval maps queries and multi-modality resources to one universal embedding space and retrieves multi-modality candidates via KNN search. As a result, the relevance modeling, cross-modality matching, and retrieval result fusion are done by one model. More specifically, we propose a Universal Vision-Language Dense Retrieval (UniVL-DR) model to get the representations of queries, texts, and images and learn a tailored vision-language embedding space for multi-modal retrieval. UniVL-DR optimizes the vision-language embedding space using hard negatives (Xiong et al., 2021a) and balances the modalities of these negatives to alleviate

