CAMVR: CONTEXT-ADAPTIVE MULTI-VIEW REPRE-SENTATION LEARNING FOR DENSE RETRIEVAL

Abstract

The recently proposed MVR (Multi-View Representation) model achieves remarkable performance in open-domain dense retrieval. In MVR, the document can match with multi-view queries by encoding the document into multiple representations. However, these representations tend to collapse into the same one when the percentage of documents answering multiple queries in training data is low. In this paper, we propose a CAMVR (Context-Adaptive Multi-View Representation) learning framework, which explicitly avoids the collapse problem by aligning each viewer token with different document snippets. In CAMVR, each viewer token is placed before each snippet to capture the local and global information with the consideration that answers of different view queries may scatter in one document. In addition, the view of the snippet containing the answer is used to explicitly supervise the learning process, from which the interpretability of view representation is provided. The extensive experiments show that CAMVR outperforms the existing models and achieves state-of-the-art results.

1. INTRODUCTION

Dense retrieval approaches based on pre-trained language models (Devlin et al., 2019; Liu et al., 2019) achieve significant retrieval improvements compared with sparse bag-of-words representation approaches (Jones, 1972; Robertson et al., 2009) . A typical dense retriever usually encodes the document and query into two separate vector-based representations through a dual encoder architecture (Karpukhin et al., 2020; Lee et al., 2019; Qu et al., 2021; Xiong et al., 2020) , then relevant documents are selected according to the similarity scores between their representations. To this end, the major approaches improve the retrieval performance by learning a high-quality document representation, including hard negative mining (Zhan et al., 2021; Xiong et al., 2020; Qu et al., 2021) and task-specific pre-training (Gao & Callan, 2021b; Oguz et al., 2022) . Intuitively, the capacity of a single-vector representation is limited (Luan et al., 2020) when the document is long and corresponds to multi-view queries. Recently, Zhang et al. (2022) proposes a MVR (Multi-View Representation) model to improve the representation capacity by encoding a document into multiple representations, and the similarity score between a query and document is determined as the maximum score calculated with their dense representations. However, when the percentage of documents answering multiple queries in training data is low, the multiple vectorbased representations of each document tend to collapse into the same one. For brevity, we use the average number of queries (AVQ) per document to represent the percentage of documents answering multiple queries in training data. As show in Table 2 , when the AVQ value is small (1.0), multi-view representations collapse into the same one and the performance deteriorates sharply in MVR. Considering that the answers of queries may scatter in different snippets of one document and the viewer tokens in MVR model are located in front of the document, this makes it difficult to perceive the information from the snippets that may contain the scattered answers. As shown in Figure 1 , the two answers are located in the 2nd and 4th snippets in example 1, while the two answers are located in the 5th and 6th snippets in example 2. Obviously, MVR will not be able to adaptively capture the different answer information because the positions of viewer tokens are fixed. Well-learned multi-view representations are expected to capture the information from different snippets. In this paper, we propose a CAMVR (Context-Adaptive Multi-View Representation) learning framework, which explicitly avoids the collapse problem by aligning viewer tokens with snippets (as illustrated Title: (1) The Lord of the Rings: The Fellowship of the Ring Document: (2) The Lord of the Rings: The Fellowship of the Ring is a 2001 epic fantasy adventure film directed by Peter Jackson based on the first volume of J.R.R Tolkien's "The Lord of the Rings" (1945) (1946) (1947) (1948) (1949) (1950) (1951) (1952) (1953) (1954) (1955) . (3) It is the first instalment in "The Lord of Rings series", and was followed by ( 2002) and ( 2003), based on the second and third volumes of "The Lord of the Rings". (4) Set in Middle-earth, the story tells of the Dark Lord Sauron (Sala Baker), who is seeking the One Ring.

Title:

(1) FIFA World Cup Document: (2) Preceding three years, to determine which teams qualify for the tournament phase, which is often called the "World Cup Finals". (3) After this, 32 teams, including the automatically qualifying host nation(s), compete in the tournament phase for the title at venues within the host nation(s) over a period of about a month. (4) The 21 World Cup tournaments have been won by eight national teams. (5) Brazil have won five times, and they are the only team to have played in every tournament. (6) The other World Cup winners are Germany and Italy, with four title each. Query1: When was the first lord of the rings made? Answer1: 2002. Query2: Who is the bad guy in lord of the rings? Answer2: Sauron.

Query1:

The only country in the world to have played in every tournament of world cup football? Answer1: Brazil. Query2: Italy has won the world cup how many times? Answer2: four. in Figure 1 ). Firstly, to allow views to adaptively capture the information of answer snippets in a document, we locate each viewer token before each snippet. Therefore, each view representation is able to aggregate local and global information via the self-attention mechanism of transformer (Vaswani et al., 2017) . Furthermore, we use the view corresponding to the answer snippet of a given query to explicitly supervise the learning task, which allows the view to attend to the corresponding local snippet, thereby increasing the view interpretability.

Model Input

[VIE 1 ] [VIE n ] W 1 … W m … [VIE 1 ] [VIE 2 ] W 1 … 2001 [VIE 4 ] … Sauron … … … MVR Input CAMVR Input Model Input [VIE 1 ] [VIE n ] W 1 … W m … [VIE 1 ] [VIE 5 ] W 1 … Brazil [VIE 6 ] … four … … The contributions of this paper are as follows: • Different from MVR model, where the representations would collapse into the same one when the percentage of documents answering multiple queries in training data is low, our proposed CAMVR learning framework is able to adaptively align each viewer token with each document snippet to capture fine-grained information of local snippets and global context, which effectively avoid the collapse problem. • We specify the answer snippet for positive document to supervise the learning process and allows the representation to attend to local snippets, which provides the interpretability for each view representation. • Extensive experiments are conducted to evaluate our proposed CAMVR in terms of the analysis of overall retrieval performance, collapse problem, view interpretability and view number, etc. The experimental results on open-domain retrieval datasets show the effectiveness of our proposed model, which achieves the state-of-the-art performance.

2. RELATED WORK

In this section, we review the existing three model architectures for dense retrieval: dual encoder, late interaction and multi-vector model respectively. As shown in Figure 2 , late interaction and multi-vector model are the variants of dual encoder. Dual encoder (Karpukhin et al., 2020) (Figure 2 (a)) is widely used in the first-stage document retrieval to obtain the relevant documents from a large-scale corpus. In general, the pre-trained model BERT (Devlin et al., 2019) is utilized as a base model for dual encoder considering its text representation ability learned from massive data. To improve retrieval performance, Lee et al. (2019) and Chang et al. (2020) further pre-train the BERT with inverse cloze task, body first selection and wiki link prediction. To make the language models be ready for dense retrieval tasks, Gao & Callan (2021a; b) propose pre-training architectures coCondenser and Condenser, where a contrastive learn- Query Encoder … Doc Encoder … sim [CLS] B 1 A 1 [CLS] A k B m … … [CLS] B 1 A 1 [CLS] A k B m (a) Dual encoder (DPR) sum maxsim … maxsim maxsim Query Encoder … Doc Encoder … [CLS] B 1 A 1 [CLS] A k B m … … [CLS] B 1 A 1 [CLS] A k B m (b) Late interaction (ColBERT) Query Encoder … Doc Encoder … … … [VIE] A 1 A k [VIE 1 ] [VIE n ] B 1 B m … … … [VIE] A 1 A k [VIE 1 ] [VIE n ] B 1 B m sim 1 sim n maxsim (c) Mulit-vector model (MVR) Figure 2 : The comparison of different model architectures designed for dense retrieval. ing task is additional considered in coCondenser. Besides pre-training, negative sampling is adopted to improve retrieval performance by mining high-quality negatives. DPR (Karpukhin et al., 2020) picks out hard negatives through the BM25 retriever. Yang et al. (2017) and Xiong et al. (2020) generate hard negatives dynamically with the newest checkpoint in training process. Qu et al. (2021) and Ren et al. (2021) use cross encoder to mine hard negatives. The interaction between each query and document is also used to improve retrieval performance by exploiting their token-level relevance (Nogueira & Cho, 2019; MacAvaney et al., 2019; Dai & Callan, 2019; Jiang et al., 2020) . However, such token-level interaction is computation intensive, which is impractical for the first-stage retrieval given a large-scale document collection. Therefore, late interaction paradigm (Figure 2 (b)) is proposed to improve the computational efficiency. In general, each pair of query and document is encoded independently and the similarity between them is computed through a fine-grained interaction mechanism. In ColBERT (Khattab & Zaharia, 2020; Santhanam et al., 2022) , a relevance score is determined as the sum of the maximum similarity between the tokens of the query and document. Poly-Encoder (Humeau et al., 2020) provides an attention mechanism to learn the global features of a document, which is more efficient than directly using all token-level features of the document. COIL (Gao et al., 2021) is proposed to only calculates the scores of the overlapped tokens between a query and document to reduce the computational costs. Since the representation capacity of single vector is limited for a long document and late interaction between all tokens still requires high computational costs, multi-vector models (Figure 2 (c)) are therefore proposed. ME-BERT (Luan et al., 2020 ) is proposed to improve retrieval performance by using the first m token embedings as the document representation. However, this would lose useful information in the latter part of the document. DRPQ (Tang et al., 2021 ) encodes a document with class centroids by clustering all token embedings. MVR (Zhang et al., 2022) encodes a document into multi-view representations and each of them can be used to answer different view queries. However, the multi-view representations would collapse into the same one when the percentage of documents answering multiple queries in training data is low. Though DCSR (Hong et al., 2022) produces multiple vectors for the document, there is a gap between the training and inference tasks, which limits its retrieval performance.

3.1. PRELIMINARY

Dual encoder is widely used to encode a query (by the query encoder) and document (by the document encoder) into a single vector respectively in open-domain dense retrieval as shown in Figure 2 (a). Given a query q and a document d, the similarity score function is determined as follows: f (q, d) = sim(E Q (q), E D (d)), where E Q (•) is the query encoder, E D (•) is the document encoder and sim(•) is the inner product function. Then, the loss function of dual encoder is defined as L = -log exp(f (q, d + )/τ ) exp(f (q, d + )/τ ) + l exp(f (q, d - l )/τ ) , where d + and d - l represent the positive and negative documents for q, and the hyper-parameter τ is a scaling factor for model optimization (Sachan et al., 2021) . MVR model takes the dual encoder architecture to encode the query and document into vectors, as shown in Figure 2(c ). For the query encoder, the [CLS] token is replaced with a viewer token [V IE] to produce the query representation. Formally, the query representation is given as E(q) = E Q ([V IE] • q • [SEP ]), where E(q) is determined by the vector of viewer token [V IE], [SEP ] is a special token in BERT and • is the concatenation operation. For the document encoder, viewer tokens [V IE i ](i = 1, 2, . . . , n) are placed at the beginning of document to produce multi-view representations: E 1 (d), E 2 (d), . . . , E n (d) = E D ([V IE 1 ] . . . [V IE n ] • d • [SEP ]), where the i-th view representation E i (d) of document is determined by the vector of [V IE i ]. The similarity score function between the query q and i-th view of the document is defined as f i (q, d) = sim(E(q), E i (d)), where sim(•) is the inner product function. The overall similarity score of query q and document d is aggregated by f (q, d) = max i {f i (q, d)}. Different from the loss function of dual encoder, MVR further adds a regularization for different views to alleviate the collapse problem. Formally, the final loss function is given as L = -log exp(f (q, d + )/τ ) exp(f (q, d + )/τ ) + l exp(f (q, d - l )/τ ) -λ log exp(f (q, d + )/τ ) i exp(f i (q, d + )/τ ) , ( ) where λ is a hyper-parameter.

3.2. CONTEXT-ADAPTIVE MULTI-VIEW REPRESENTATION

In MVR model, the input of a viewer token is given as: e i = v i + p 0 + o, i = 1, . . . , n where n is the number of viewer tokens, v i is the token embedding of [V IE i ], p 0 and o are the position and segment embeddings, respectively. Since p 0 and o are not altered for all viewer tokens, Equation ( 8) suggests that all the differences in view representations are caused from the token embedding. From another perspective, Equation ( 8) can be equivalently rewritten as follows: e i = ṽ0 + pi + o, i = 1, . . . , n where ṽ0 = p 0 and pi = v i for i = 1, . . . , n. That is, the viewer token embedding ṽ0 is fixed, while the position embedding is varying. However, as shown in Figure 1 , answer snippets for different queries are scattered in a document. The views with n fixed position embeddings in Equation ( 9) are not able to capture the information of fine-grained scattered answer snippets for each document. When the percentage of documents answering multiple queries in training data is low, the information to align the document views to different queries is insufficient. Thus, the multi-view representations tend to collapse into the same one and the performance will deteriorate sharply. In this paper, we propose a CAMVR (Context-Adaptive Multi-View Representation) learning framework in which the position embeddings of viewer tokens are adaptively changed with the context of a document. As shown in Figure 1 , it is crucial for different views to perceive fine-grained information in different answer snippets. Since each document snippet may answer some queries, we place a viewer token before each snippet, aiming to capture the information of local snippet and global context. Query Encoder Identify the shortest sentence s i in S. [VIE] Q [VIE] Query Document Encoder S 1 [VIE n ] S n … [VIE 1 ] [VIE i ] [VIE n ] … sim 1 sim i sim n … Positive document … … [VIE 1 ] Document Encoder [VIE 1 ] S 1 [VIE n ] S n … [VIE 1 ] [VIE 2 ] [VIE n ] … sim 1 sim 2 sim n … Negative document sim + maxsim - … maxsim - maxsim -

7:

Merge s i with its shorter adjacent sentence s j into a new snippet s ′ i . 8: s i ← s ′ i . 9: Remove s j from S. 10:

end while 11: end if

To guarantee the sentence completeness of each snippet, we design a heuristic method to partition the document into snippets, and the details are described in Algorithm 1. For the document d, we can obtain the n snippets [s 1 , . . . , s n ] with the document splitting algorithm. Then the input context of query and document can be given as follows: q : [V IE] • q • [SEP ], d : [V IE 1 ] • s 1 • • • [V IE n ] • s n • [SEP ]. We treat viewer tokens and document tokens equally in position embedding, and let the viewer token embedding be the same with that of MVR model. Thus, the input of [V IE i ] is given as e i = v i + p ai + o, i = 1, . . . , n; a i ∈ [0, L] ) where a i is the i-th viewer position in the sequence and L is the maximum sequence length. The position embedding is same with that of vanilla pre-trained model, mainly considering its capability in perceiving local and global information which is proven in MLM (Masked Language Model) task. In addition, different viewer token embeddings help to distinguish adjacent views, making their representations complementary and avoiding the collapse problem. We use the last hidden states of dual encoder as the representation of the query and document, E(q) = E Q ([V IE] • q • [SEP ]), (11) E 1 (d), E 2 (d), . . . , E n (d) = E D ([V IE 1 ] • s 1 • • • [V IE n ] • s n • [SEP ]), where E(q) is the query representation, and E i (d)(i = 1, 2, . . . , n) is i-th view representation of the document. The definition of similarity function for query and document refers to MVR model. In our method, each view representation contains the information of local snippets and global context. To make it attend more to local snippets, we specify the view of answer snippet to align with the query in training process, as shown in Figure 3 . Let the i-th view be the answer view, then the loss function of CAMVR is defined as follows: L = -log exp(f i (q, d + )/τ ) exp(f i (q, d + )/τ ) + l exp(f (q, d - l )/τ ) , where f i (q, d + ) is the similarity score between the query and answer view of positive document, and used as the similarity score between the query and positive document in training process. In inference, the similarity score is given by maximizing over the similarity scores between the query representation and document multi-view representations. We firstly use the document encoder of our model to encode all documents in the retrieval corpus into multi-view representations and build the index for each document view. Then we retrieve the relevant documents for the given query, and boost the retrieval process by the ANN (Approximate Nearest Neighbor) technique (Johnson et al., 2019) . In contrast, the time complexity of CAMVR is the same with MVR, and lower than other multi-vector models such as DRPQ (Tang et al., 2021) and DSCR (Hong et al., 2022) .

4. EXPERIMENTS

4.1 DATASETS SQuAD (Rajpurkar et al., 2016 ) is a crowdsourced reading comprehension dataset. The version we used in this paper is created by DPR (Karpukhin et al., 2020) and it contains 70k training data. Since the test data is not released, we perform the retrieval evaluations in terms of top 5/20/100 accuracy on the development set. Natural Question (NQ) (Kwiatkowski et al., 2019 ) is a large dataset for open-domain QA. All the queries are scraped from the Google search engine and anonymized. The documents are collected from Wikipedia articles. According to DPR (Karpukhin et al., 2020) , 60k training data of NQ is used. TriviaQA (Joshi et al., 2017 ) is a reading comprehension dataset containing over 650k queryanswer-document triples. All the documents are collected from Wikipedia and Web. In our experiments, we leverage the version released by DPR (Karpukhin et al., 2020) which contains 60k training data. The retrieval corpus used in our experiments contains 21,015,324 documents. According to DPR (Karpukhin et al., 2020) , all the documents are non-overlapping chunks of 100 words.

4.2. IMPLEMENTATION DETAILS

We use 2 NVIDIA Tesla A100 GPUs (with 40G RAM) to train MVR and CAMVR models with a batch size of 56 on each GPU. To save the computational resources, the length of query and document are set to 128 and 256, respectively. In addition, automatic mixed precision and gradient checkpoints (Chen et al., 2016) are used. We use the Adam optimizer with a learning rate of 1e-5, and the dropout rate is set to 0.1. Other hyperparameter settings follow Zhang et al. (2022) . To make a fair comparison with baseline models, we follow the mined hard negatives strategy adopted in coCondenser (Gao & Callan, 2021a) and MVR (Zhang et al., 2022) in Section 4.3. Note that we use the pre-trained coCondenser as the base model without any further pre-training.

4.3. RETRIEVAL PERFORMANCE

We compare our CAMVR model with the previous state-of-the-art models. These models can be divided into two categories based on the representations of documents: single-vector models and multi-vector models. The single-vector models we compare in our experiments include ANCE (Xiong et al., 2020) , RocketQA (Qu et al., 2021) , Condenser (Gao & Callan, 2021b) , DPR-PAQ (Oguz et al., 2022) and coCondenser (Gao & Callan, 2021a) . For multi-vector models, we consider DRPQ (Tang et al., 2021) and MVR (Zhang et al., 2022 ) models for comparison. As the experimental results shown in Table 1 , our CAMVR model outperforms all the existing models. Specifically, our CAMVR achieves 0.7 (on SQuAD), 0.4 (on NQ), and 3.0 (on Trivia QA) points of top5 accuracy improvements compared with the performance of MVR model. The improvement is more significant on Trivia QA. The reason is that the AVQ value of the Trivia QA (1.2) is too small, as a result, the performance of MVR dercreases since the multi-view representations tend to collapse into the same one. In contrast, the multi-view representations of our CAMVR model can capture the information of local snippet and global context. In conclusion, the improvements on SQuAD, NQ and Triva QA datasets suggest that our CAMVR model is able to perform well for different AVQ values, especially for a small AVQ value. Method SQuAD Natural Question Trivia QA R@5 R@20 R@100 R@5 R@20 R@100 R@5 R@20 R@100 BM25 

4.4. COLLAPSE ANALYSIS

In this part, we study the collapse problem that the multi-vector models may suffer from. We construct a new training set with an AVQ value of 1 by extracting the samples from NQ dataset. Then, we compare the retrieval performance between MVR and CAMVR using each view representation alone. As shown in Table 2 , the performances of MVR model using each view representation and its overall accuracy are nearly identical. This indicates that all the view representations collapse into the same one. In contrast, the performances of each view representation in our CAMVR are distinguishable, and the overall performance is significantly better than that using each single view representation since the multiple view representations can complement to each other. NQ AVQ=1, MVR AVQ=1, CAMVR R@5 R@20 R@100 R@5 R@20 R@100 View1 To learn multiple distinguishable view representations, attention probability vectors of views are expected to be different. This means different views should attend to different contextual information. Table 3 : The KL divergences between attention vectors of viewer tokens for MVR and CAMVR.

4.5. VIEW INTERPRETABILITY ANALYSIS

In this section, we conduct the view interpretability analysis of our CAMVR. Unlike the MVR model, the views are not interpretable because each answer view is selected through the max-pooler operation in training shown as Figure 2 (c). For CAMVR, we directly specify the answer view of positive document for each (q, d + ) pair to supervise the learning process, which would force each view representation to capture more information of the corresponding snippet. To analyse the interpretability of each view, we compare the both overall and the retrieval accuracy exactly achieved by the answer view. For a given query q i , we denote the corresponding top K retrieved documents as [d 1 , . . . , d K ], and their corresponding matched snippets are ŜK = [ms 1 , . . . , ms K ], then the answer view accuracy is defined as: ACC K ans = 1 N q Nq i I(at least one snippet in ŜK that contains the answer of q i ), where N q is the number of the test queries, I(•) is the indicator function. As shown in Table 4 , the gaps between overall performance and answer view performance given the three datasets on top 5/20/100 are small, which indicates that the answer view representation successfully captures the information of the corresponding snippet, in which the scattered answers are located at.

Method SQuAD

Natural Question Trivia QA R@5 R@20 R@100 R@5 R@20 R@100 R@5 R@20 R@100 Overall 69.0 To further illustrate the interetability of view representations, we take an example from NQ dataset for retrieval evaluation using MVR and CAMVR. As shown in Figure 4 , answers of two queries are located at the 3rd and 5th snippets, respectively. For MVR model, the 3rd view is selected for the two queries in terms of time view and number view, which indicates MVR cannot distinguish the two view representations. In contrast, our CAMVR is able to answer the two queries, where the 3rd and 5th answer views are successfully selected. 

4.6. VIEW NUMBER ANALYSIS

We conduct experiments to test the impact of view number on retrieval performance given the three datasets. The results with different view number varying from 1 to 12 are shown in Table 5 . When the view number increases from 1 to 8, the performance is greatly improved. In particular, the improvements of top5 accuracy on the datasets of SQuAD, Natural Question and Trivia QA are up to 7.7, 2.7 and 1.8 points. The reason is that the average number of sentences per document is 6.5 in retrieval corpus, and 1 view or 4 views are insufficient to capture the fine-grained information of all snippets. The retrieval performance improved slightly when the view number is increased from 8 to 12 since the snippets information answering multi-view queries has almost been captured by the first 8 views.

CAMVR SQuAD

Natural Question Trivia QA R@5 R@20 R@100 R@5 R@20 R@100 R@5 R@20 R@100 n=1 61. 

5. CONCLUSION

In this paper, we propose a context-adaptive multi-view representation learning framework for dense retrieval. Different from MVR model, where the representations would collapse into the same one with a low percentage of the documents answering multiple queries in training data, our proposed CAMVR learning framework is able to avoid the collapse problem by adaptively aligning each viewer token with each document snippet. In addition, we supervise the learning process by specifying the answer snippets for positive documents and allowing the representations to attend to local snippets, this additionally provides the interpretability for each view representation. The experimental results show that our proposed CAMVR is able to avoid the collapse problem and achieves the state-of-the-art retrieval performance.

A PERFORMANCE SIGNIFICANCE

To demonstrate the significant difference of the retrieval performance between CAMVR and MVR, we implement the two models based on pre-trained BERT and coCondenser given three datasets (SQuAD, Natural Question and Trivia QA) with 5 independent runs for each experimental setting. Due to the constraints of time and computational resources, all models are trained without utilizing mined hard negatives strategy. The experimental results are shown in Table 6 . It is clear that the top5 accuracy of CAMVR is significantly better than MVR with a significance level of 0.05, and the top20 and top100 accuracy are also better than or comparable to MVR.

Method SQuAD

Natural Question Trivia QA R@5 R@20 R@100 R@5 R@20 R@100 R@5 R@20 R@100 MVR (coCondenser) 59 

B VIEW CAPACITY ANALYSIS

To evaluate the view capacity of capturing global information, we investigate the retrieval performance with each view as the document representation alone. The results shown in Table 7 demonstrate that each view of CAMVR can also capture the document-level information. Specifically, the top20 accuracy of view1, view2, view3 and view5 are comparable to the coCondenser (single-vector model), and surpass DPR by at least 9.8 points.



Figure 3: The architecture of context-adaptive multi-view representation learning framework. sim + denotes the similarity score between the query representation and answer view representation of positive document. The maxsim -denotes the similarity score between the query and one negative document. Algorithm 1 Document Splitting Input: A document d. Parameter: View number n and empty sentence b. Output: A sequence S containing n snippets. 1: Split d into k sentences S = [s 1 , s 2 , . . . , s k ] by the sent tokenize function in NLTK toolkit. 2: if k ≤ n then 3: Add n-k empty sentences b to S and obtain the final snippets S = [s 1 , s 2 , . . . , s k , b, . . . , b]. 4: else 5:while len(S) > n do

Figure 4: Example of the document retrieved by MVR and CAMVR. Correct answers and selected views are written in bold.

Retrieval performance on SQuAD dev, Natural Question test and Trivia QA test. The best performing models are marked bold and the results unavailable are left blank. The underlined coCondenser is reproduced byZhang et al. (2022).

Retrieval performance of each view on Nature Question test set for MVR and CAMVR under the case that AVQ value of training data is 1.

The comparison of overall performance and answer view performance for CAMVR on SQuAD, Natural Question and Trivia QA datasets.

This is a record of Argentina's results at the FIFA World Cup. [VIE 3 ] Argentina is one of the most successful national football teams in the world, having won two World Cups in 1978 and 1986. [VIE 4 ] Argentina has been runners up three times in 1930, 1990 and 2014. [VIE 5 ] The team was present in all but four of the World Cups, being behind only Brazil, Italy and Germany in number of appearance. [VIE 6 ] Argentina has also won the Copa America 14 times, one less than Uruguay. [VIE 7 ] Moreover, Argentina has won the Confederations Cup and the gold medal. [VIE 8 ]

Performance of different view number in CAMVR.

The performance comparison of MVR and CAMVR based on pre-trained coCondenser and BERT. The asterisk indicates that the result is significant with a significance level of 0.05, according to a standard t-test.

annex

We further study the collapse problem by assessing the distance of each pair of attention vectors in the last layer of document encoder. The Kullack-Leibler divergence (KL divergence) is used as the distance metric. Let α i = (α i1 , α i2 , . . . , α iL ) be the attention probability vector of [V IE i ]. The KL divergence between α i and α j is defined as follows:which satisfies that D KL (i, j) ≥ 0, and D KL (i, j) = 0 if and only if α i = α j . We randomly select 1000 documents from the retrieval corpus to calculate the average KL divergence between each pair of attention vectors. The results are shown in Table 3 . All averaged KL divergence values between viewer tokens are 0 in MVR, which indicates that all viewer tokens attend to each token with nearly the same weights. In contrast, all the averaged KL divergence values of our CAMVR are greater than 0, which indicates that the viewer tokens are able to attend to different contextual information. 

C HARD NEGATIVES ANALYSIS

Since negative sampling is crucial to improve retrieval performance demonstrated by the existing literatures, we conduct the experiments to understand how the hard negatives improve the retrieval performance. To reduce the computational costs in negative sampling, we adopt the two-round training pipeline as coCondenser (Gao & Callan, 2021a) . We firstly train a model by using the negative samples obtained via BM25, then the trained model is used to sample hard negatives for each query.Next, we retrain a model with the augmented training data containing the BM25 negatives and hard negatives. As shown in Table 8 , all retrieval performances of the given three datasets are improved significantly with the hard negatives, especially for the top5 accuracy.

Method SQuAD

Natural Question Trivia QA R@5 R@20 R@100 R@5 R@20 R@100 R@5 R@20 R@100 without HN 69.0 

