CROSS-PROBE BERT FOR EFFICIENT AND EFFECTIVE CROSS-MODAL SEARCH Anonymous

Abstract

Inspired by the great success of BERT in NLP tasks, many text-vision BERT models emerged recently. Benefited from cross-modal attentions, text-vision BERT models have achieved excellent performance in many language-vision tasks including text-image retrieval. Nevertheless, cross-modal attentions used in textvision BERT models require too expensive computation cost when solving textvision retrieval, which is impractical for large-scale search. In this work, we develop a novel architecture, cross-probe BERT. It relies on devised text and vision probes, and the cross-modal attentions are conducted on text and vision probes. It takes lightweight computation cost, and meanwhile effectively exploits crossmodal attention. Systematic experiments conducted on two public benchmarks demonstrate state-of-the-art effectiveness and efficiency of the proposed method.

1. INTRODUCTION

Tradition text-to-image retrieval tasks are tackled by joint embedding methods. They map the text queries and reference images into the same feature space so that queries and images can be compared directly. Basically, they adopt a two-tower architecture, in which one tower extracts features of text queries, and the other tower extracts features of reference images. In the training phase, the parameters of two towers are optimized so that a query and its relevant images are close in the feature space whereas the distance between the query and its irrelevant images is large. Since two towers independently generate image and query features, image features can be extracted offline and cached in the database. In the search phase, the cached image features can be directly compared with the query's feature, and the retrieval is efficient. Due to high efficiency, joint embedding methods based on the two-tower structure have been widely used in many large-scale cross-modal retrieval. Inspired by great success achieved by self-attention mechanism of Transformer (Vaswani et al. (2017) ) and BERT (Devlin et al. (2019) ) in NLP tasks, several text-vision BERT models (Lu et al. (2019); Li et al. (2020) ) emerge. They take the query-image pair as input and extend the original text-modal self-attention to the multi-modal self-attention. The text-vision BERT effectively models the interactions between image features and query features, provides contextual encoding for both image features and query features, and achieves a significantly better retrieval accuracy compared with its two-tower counterpart. Despite the high effectiveness achieved by text-vision BERT, the extremely high computation cost brought by pairwise input limits its practical usefulness, especially for large-scale cross-modal retrieval in industrial applications. Given a query and N reference images, it needs to feed N query-image pairs to text-vision BERT for N relevance scores. That is, it requires to repeatedly encode the query for N times. In a large-scale cross-modal retrieval task, N is extremely large, making text-vision BERT prohibitively slow for obtaining relevant scores with all reference images. In contrast, a two-tower encoder only needs to encode the query for one time, and N reference image features can be pre-computed offline and cached in the database. Thus, it obtains relevance scores between the query and reference images in a very efficient manner by computing the cosine similarities between the query feature and pre-computed reference image features. Though the inefficient pairwise attention limits the usefulness of the text-vision BERT in large-scale cross-modal retrieval, there are few works to speed up the text-vision BERT. In fact, the inefficiency caused by pairwise input is a general problem which is also encountered in other retrieval tasks such as query-to-document retrieval (Humeau et al. ( 2020 2018)), is based on the two-tower architecture. Since the query/question and document are independently encoded, the document features can be pre-computed and cached. In this case, the relevance between the query and each document can be determined by the cosine similarity between the query/question's feature and the document's cached feature. It achieves a high efficiency but a relatively low retrieval accuracy. In contrast, Cross-encoder (Urbanek et al.) takes a question-answer pair or a querydocument pair as input, exploiting the cross-attention like text-vision BERT and achieving high retrieval accuracy but is inefficient. To balance effectiveness and efficiency, existing methods (Cao et al. (2020) ; Zhang et al. ( 2020)) adopt the two-tower architecture in lower layers and use the crossattention architecture in the upper layers. We term this architecture as "split-merge" encoder. In that case, features from lower two-tower layers can be extracted offline and cached. Then questionanswer or query-document attentions can be conducted in the upper layers. Since the number of upper cross-attention layers is small, the efficiency is boosted. Similarly, Poly-Encoder (Humeau et al. ( 2020)) conducts the two-tower architecture for feature extraction, and uses an additional cross-attention layer on the top to obtain the similarities between the query and reference items. In this paper, we propose a novel architecture, cross-probe (CP) BERT for effective and efficient cross-modal retrieval. Motivated by the great success of the "split-merge" style encoder in query document retrieval, we extend text-vision BERT to adopt it for speeding up the computation. In particular, we devise several vision probes and text probes along with the image's local features and the query's word features. In the lower a few layers, we adopt the two-tower architecture. The vision probes and the image's local features are concatenated and fed into the vision tower and generate the attended vision probes. In parallel, the text probes and the query's word features are concatenated and fed into the text tower, and generates the attended text probes. After that, the attended vision probes and text probes are concatenated and fed into a series of cross-attention layers to exploit the cross-modal attentions. Since the number of text probes is considerably smaller than the number of words in the query and the number of vision probes is smaller than the number of local features of the image, the cost of our CP BERT in computing cross-modal attention is significantly less than that of text-vision BERT. Meanwhile, the cross-modal attention is only exploited in the upper a few layers, making our CP BERT more efficient. Systematic experiments conducted on two public benchmarks demonstrate the excellent effectiveness and efficiency of the proposed CP BERT.

2. RELATED WORK

Traditional cross-modal retrieval, e.g., text-image retrieval, relies on joint embedding. It maps texts and images from two modalities into a common feature space through two encoders. Then texts and images can be compared and the distance between their global features in the common feature space measures their similarities. Early joint embedding methods (Gong et al. (2012) ; Rasiwasia et al. ( 2010)) utilize canonical correlation analysis (CCA) to project hand-crafted text and image features to a joint CCA space. Recently, inspired by great progress achieved by deep neural network, methods based on deep learning emerge. VSE++ (Faghri et al. (2017) ) obtains the image feature through a convolution neural network (CNN) and encodes the text by a gated recurrent unit (GRU). The CNN and the GRU are trained in an end-to-end manner by the designed triplet loss. The triplet loss seeks to minimize the distance between the relevance text-image pairs and maximize the distance between irrelevant pairs. The merit of joint-embedding methods is simplicity and efficiency. In this case, a text query as well as a image is represented by a global feature. The relevance between the image and the text can be efficiently obtained by computing the cosine similarity of their features. Meanwhile, the global text and image features make it feasible for the approximate nearest neighborhood (ANN) search such as Hashing and inverted indexing, so that the large-scale retrieval can be efficient. Nevertheless, the global feature used in joint-embedding methods has limitations. In many cases, the relevance between a text and an image is determined by very few words in the text and some small regions in the image. Therefore, the relevant text words and image regions might be distracted by irrelevant words and regions when using global features. Thus methods based on local features are proposed to overcome the limitations of global features used in joint-embedding methods. In DVS (Karpathy & Fei-Fei (2014) ), an image is represented by a set of bounding box features extracted from the object detector, R-CNN. Meanwhile, the text is represented by a sequence of word features extracted from an RNN. Then the bounding box features and word features are aligned to obtain the similarity between the image and the text. The alignment operation can effectively alleviate the distraction from irrelevant word-region pairs. Similarly, SCAN (Lee et al. (2018) ) relies on bounding box features from a faster R-CNN and word features from a GRU. It conducts the alignment through soft-attention and optimizes the loss function through hard negative mining. Nevertheless, both DVS



)), question answering (Cao et al. (2020); Zhang et al. (2020)). In these tasks, similarly there are two mainstream encoders for obtaining the relevance score. The first type of encoder, Bi-encoder (Dinan et al. (2019); Mazare et al. (

