CAMVR: CONTEXT-ADAPTIVE MULTI-VIEW REPRE-SENTATION LEARNING FOR DENSE RETRIEVAL

Abstract

The recently proposed MVR (Multi-View Representation) model achieves remarkable performance in open-domain dense retrieval. In MVR, the document can match with multi-view queries by encoding the document into multiple representations. However, these representations tend to collapse into the same one when the percentage of documents answering multiple queries in training data is low. In this paper, we propose a CAMVR (Context-Adaptive Multi-View Representation) learning framework, which explicitly avoids the collapse problem by aligning each viewer token with different document snippets. In CAMVR, each viewer token is placed before each snippet to capture the local and global information with the consideration that answers of different view queries may scatter in one document. In addition, the view of the snippet containing the answer is used to explicitly supervise the learning process, from which the interpretability of view representation is provided. The extensive experiments show that CAMVR outperforms the existing models and achieves state-of-the-art results.

1. INTRODUCTION

Dense retrieval approaches based on pre-trained language models (Devlin et al., 2019; Liu et al., 2019) achieve significant retrieval improvements compared with sparse bag-of-words representation approaches (Jones, 1972; Robertson et al., 2009). A typical dense retriever usually encodes the document and query into two separate vector-based representations through a dual encoder architecture (Karpukhin et al., 2020; Lee et al., 2019; Qu et al., 2021; Xiong et al., 2020), then relevant documents are selected according to the similarity scores between their representations. To this end, the major approaches improve the retrieval performance by learning a high-quality document representation, including hard negative mining (Zhan et al., 2021; Xiong et al., 2020; Qu et al., 2021) and task-specific pre-training (Gao & Callan, 2021b; Oguz et al., 2022). Intuitively, the capacity of a single-vector representation is limited (Luan et al., 2020) when the document is long and corresponds to multi-view queries. Recently, Zhang et al. (2022) proposes a MVR (Multi-View Representation) model to improve the representation capacity by encoding a document into multiple representations, and the similarity score between a query and document is determined as the maximum score calculated with their dense representations. However, when the percentage of documents answering multiple queries in training data is low, the multiple vectorbased representations of each document tend to collapse into the same one. For brevity, we use the average number of queries (AVQ) per document to represent the percentage of documents answering multiple queries in training data. As show in Table 2 , when the AVQ value is small (1.0), multi-view representations collapse into the same one and the performance deteriorates sharply in MVR. Considering that the answers of queries may scatter in different snippets of one document and the viewer tokens in MVR model are located in front of the document, this makes it difficult to perceive the information from the snippets that may contain the scattered answers. As shown in Figure 1 , the two answers are located in the 2nd and 4th snippets in example 1, while the two answers are located in the 5th and 6th snippets in example 2. Obviously, MVR will not be able to adaptively capture the different answer information because the positions of viewer tokens are fixed. Well-learned multi-view representations are expected to capture the information from different snippets. In this paper, we propose a CAMVR (Context-Adaptive Multi-View Representation) learning framework, which explicitly avoids the collapse problem by aligning viewer tokens with snippets (as illustrated Title: (1) The Lord of the Rings: The Fellowship of the Ring Document: (2) The Lord of the Rings: The Fellowship of the Ring is a 2001 epic fantasy adventure film directed by Peter Jackson based on the first volume of J.R.R Tolkien's "The Lord of the Rings" (1945-1955). (3) It is the first instalment in "The Lord of Rings series", and was followed by ( 2002) and ( 2003), based on the second and third volumes of "The Lord of the Rings". (4) Set in Middle-earth, the story tells of the Dark Lord Sauron (Sala Baker), who is seeking the One Ring. Title: (1) FIFA World Cup Document: (2) Preceding three years, to determine which teams qualify for the tournament phase, which is often called the "World Cup Finals". (3) After this, 32 teams, including the automatically qualifying host nation(s), compete in the tournament phase for the title at venues within the host nation(s) over a period of about a month. (4) The 21 World Cup tournaments have been won by eight national teams. (5) Brazil have won five times, and they are the only team to have played in every tournament. (6) The other World Cup winners are Germany and Italy, with four title each. Query1: When was the first lord of the rings made? Answer1: 2002. Query2: Who is the bad guy in lord of the rings? Answer2: Sauron.

Query1:

The only country in the world to have played in every tournament of world cup football? Answer1: Brazil. Query2: Italy has won the world cup how many times? Answer2: four.

CAMVR Input

Model Input in Figure 1 ). Firstly, to allow views to adaptively capture the information of answer snippets in a document, we locate each viewer token before each snippet. Therefore, each view representation is able to aggregate local and global information via the self-attention mechanism of transformer (Vaswani et al., 2017). Furthermore, we use the view corresponding to the answer snippet of a given query to explicitly supervise the learning task, which allows the view to attend to the corresponding local snippet, thereby increasing the view interpretability. [VIE 1 ] [VIE n ] W 1 … W m … [VIE 1 ] [VIE 2 ] W 1 … 2001 [VIE 4 ] … Sauron … … … MVR Input CAMVR Input Model Input [VIE 1 ] [VIE n ] W 1 … W m … [VIE 1 ] [VIE 5 ] W 1 … Brazil [VIE 6 ] … four … … The contributions of this paper are as follows: • Different from MVR model, where the representations would collapse into the same one when the percentage of documents answering multiple queries in training data is low, our proposed CAMVR learning framework is able to adaptively align each viewer token with each document snippet to capture fine-grained information of local snippets and global context, which effectively avoid the collapse problem. • We specify the answer snippet for positive document to supervise the learning process and allows the representation to attend to local snippets, which provides the interpretability for each view representation. • Extensive experiments are conducted to evaluate our proposed CAMVR in terms of the analysis of overall retrieval performance, collapse problem, view interpretability and view number, etc. The experimental results on open-domain retrieval datasets show the effectiveness of our proposed model, which achieves the state-of-the-art performance.

2. RELATED WORK

In this section, we review the existing three model architectures for dense retrieval: dual encoder, late interaction and multi-vector model respectively. As shown in Figure 2 , late interaction and multi-vector model are the variants of dual encoder. 



Figure 1: Two examples from Natural Question Dataset and the input comparison of MVR and CAMVR. [V IE i ] is a viewer token and the answers are marked red. Compared with MVR model where the viewer tokens are located before the document, our CAMVR model is able to adaptively capture the information of answer snippets by aligning each viewer token with each snippet.

Dual encoder(Karpukhin et al., 2020) (Figure2(a)) is widely used in the first-stage document retrieval to obtain the relevant documents from a large-scale corpus. In general, the pre-trained modelBERT (Devlin et al., 2019) is utilized as a base model for dual encoder considering its text representation ability learned from massive data. To improve retrieval performance, Lee et al. (2019) and Chang et al. (2020) further pre-train the BERT with inverse cloze task, body first selection and wiki link prediction. To make the language models be ready for dense retrieval tasks, Gao & Callan (2021a;b) propose pre-training architectures coCondenser and Condenser, where a contrastive learn-

