APPROXIMATE NEAREST NEIGHBOR NEGATIVE CON-TRASTIVE LEARNING FOR DENSE TEXT RETRIEVAL

Abstract

Conducting text retrieval in a learned dense representation space has many intriguing advantages. Yet dense retrieval (DR) often underperforms word-based sparse retrieval. In this paper, we first theoretically show the bottleneck of dense retrieval is the domination of uninformative negatives sampled in mini-batch training, which yield diminishing gradient norms, large gradient variances, and slow convergence. We then propose Approximate nearest neighbor Negative Contrastive Learning (ANCE), which selects hard training negatives globally from the entire corpus. Our experiments demonstrate the effectiveness of ANCE on web search, question answering, and in a commercial search engine, showing ANCE dot-product retrieval nearly matches the accuracy of BERT-based cascade IR pipeline. We also empirically validate our theory that negative sampling with ANCE better approximates the oracle importance sampling procedure and improves learning convergence.

1. INTRODUCTION

Many language systems rely on text retrieval as their first step to find relevant information. For example, search ranking (Nogueira & Cho, 2019) , open domain question answering (OpenQA) (Chen et al., 2017) , and fact verification (Thorne et al., 2018) all first retrieve relevant documents for their later stage reranking, machine reading, and reasoning models. All these later-stage models enjoy the advancements of deep learning techniques (Rajpurkar et al., 2016; Wang et al., 2019) , while, the first stage retrieval still mainly relies on matching discrete bag-of-words, e.g., BM25, which has become the pain point of many systems (Nogueira & Cho, 2019; Luan et al., 2020; Zhao et al., 2020) . Dense Retrieval (DR) aims to overcome the sparse retrieval bottleneck by matching in a continuous representation space learned via neural networks (Lee et al., 2019; Karpukhin et al., 2020; Luan et al., 2020) . It has many desired properties: fully learnable representation, easy integration with pretraining, and efficiency support from approximate nearest neighbor (ANN) search (Johnson et al., 2017) . These grant dense retrieval an intriguing potential to fundamentally overcome some intrinsic limitations of sparse retrieval, for example, vocabulary mismatch (Croft et al., 2009) . One challenge in dense retrieval is to construct proper negative instances when learning the representation space (Karpukhin et al., 2020) . Unlike in reranking (Liu, 2009) where the training and testing negatives are both irrelevant documents from previous retrieval stages, in first stage retrieval, DR models need to distinguish all irrelevant ones in a corpus with millions or billions of documents. As illustrated in Fig. 1 , these negatives are quite different from those retrieved by sparse models.

Query Relevant DR Neg BM25 Neg Rand Neg

Figure 1 : T-SNE (Maaten & Hinton, 2008) representations of query, relevant documents, negative training instances from BM25 (BM25 Neg) or randomly sampled (Rand Neg), and testing negatives (DR Neg) in dense retrieval. In this paper, we first theoretically analyze the convergence of dense retrieval training with negative sampling. Using the variance reduction framework (Alain et al., 2015; Katharopoulos & Fleuret, 2018) , we show that, under conditions commonly met in dense retrieval, local in-batch negatives lead to diminishing gradient norms, resulted in high stochastic gradient variances and slow training convergence -the local negative sampling is the bottleneck of dense retrieval's effectiveness. Based on our analysis, we propose Approximate nearest neighbor Negative Contrastive Estimation (ANCE), a new contrastive representation learning mechanism for dense retrieval. Instead of random or in-batch local negatives, ANCE constructs global negatives using the beingoptimized DR model to retrieve from the entire corpus. This fundamentally aligns the distribution of negative samples in training and of irrelevant documents to separate in testing. From the variance reduction perspective, these ANCE negatives lift the upper bound of per instance gradient norm, reduce the variance of the stochastic gradient estimation, and lead to faster learning convergence. We implement ANCE using an asynchronously updated ANN index of the corpus representation. Similar to Guu et al. (2020) , we maintain an Inferencer that parallelly computes the document encodings with a recent checkpoint from the being optimized DR model, and refresh the ANN index used for negative sampling once it finishes, to keep up with the model training. Our experiments demonstrate the advantage of ANCE in three text retrieval scenarios: standard web search (Craswell et al., 2020) , OpenQA (Rajpurkar et al., 2016; Kwiatkowski et al., 2019) , and in a commercial search engine's retrieval system. We also empirically validate our theory that the gradient norms on ANCE sampled negatives are much bigger than local negatives, thus improving the convergence of dense retrieval models.foot_0 

2. PRELIMINARIES

In this section, we discuss the preliminaries of dense retrieval and its representation learning. Task Definition: Given a query q and a corpus C, the first stage retrieval is to find a set of documents relevant to the query D + = {d 1 , ..., d i , ..., d n } from C (|D + | |C|), which then serve as input to later more complex models (Croft et al., 2009) . Instead of using sparse term matches and inverted index, Dense Retrieval calculates the retrieval score f () using similarities in a learned embedding space (Lee et al., 2019; Luan et al., 2020; Karpukhin et al., 2020) : f (q, d) = sim(g(q; θ), g(d; θ)), where g() is the representation model that encodes the query or document to dense embeddings. The encoder parameter θ provides the main capacity. The similarity function (sim()) is often simply cosine or dot product to leverage efficient ANN retrieval (Johnson et al., 2017; Guo et al., 2020) . BERT-Siamese Model: A standard instantiation of Eqn. 1 is to use the BERT-Siamese/twotower/dual-encoder model (Lee et al., 2019; Karpukhin et al., 2020; Luan et al., 2020) : f (q, d) = BERT(q) • BERT(d) = MLP( [CLS] q ) • MLP( [CLS] d ). (2) It encodes the query and document separately with BERT as the encoder g(), using their last layer's [CLS] token representation, and applied dot product (•) on them. This enables offline precomputing of the document encodings and efficient first-stage retrieval. In comparison, the BERT reranker (Nogueira et al., 2019) applies BERT on the concatenation of each to-rerank query-document pair: BERT(q • d), which has explicit access to term level interactions between query-document with transformer attentions, but is often infeasible in first stage retrieval as enumerating all documents in the corpus for each query is too costly. Learning with Negative Sampling: The effectiveness of DR resides in learning a good representation space that maps query and relevant documents together, while separating irrelevant ones. The learning of this representation often follows standard learning to rank (Liu, 2009) : Given a query q, a set of its relevant document D + q and irrelevant ones D - q , find the best θ * that: θ * = argmin θ q d + ∈D + q d -∈D - q l(f (q, d + ), f (q, d -)). The loss l() can be binary cross entropy (BCE), hinge loss, or negative log likelihood (NLL). A unique challenge in dense retrieval, targeting first stage retrieval, is that the irrelevant documents to separate are from the entire corpus (D - q = C \ D + q ). This often leads to millions of negative instances, which have to be sampled in training: θ * = argmin θ q d + ∈D + d -∈ D- l(f (q, d + ), f (q, d -)). (4) Here we start to omit the subscript q in D q . All D + and D -are query dependent. A natural choice is to sample negatives Dfrom top documents retrieved by BM25. However, they may bias the DR model to merely mimic sparse retrieval (Luan et al., 2020) . Another way is to sample negatives in local mini-batches, e.g., as in contrastive learning (Oord et al., 2018) , however, these local negatives do not significantly outperform BM25 negatives (Karpukhin et al., 2020; Luan et al., 2020) .

3. ANALYSES ON THE CONVERGENCE OF DENSE RETRIEVAL TRAINING

In this section, we theoretically analyze the convergence of dense retrieval training. We first show the connections between learning convergence and gradient norms (Sec. 3.1), then we discuss how non-informative negatives in dense retrieval yield less optimal convergence (Sec. 3.2).

3.1. ORACLE NEGATIVE SAMPLING ACCORDING TO PER-INSTANCE GRADIENT-NORM

Let l(d + , d -) = l(f (q, d + ), f (q, d -) be the loss function on the training triple (q, d + , d -), P D -the negative sampling distribution for the given (q, d + ), and p d -the sampling probability of negative instance d -, a stochastic gradient decent (SGD) step with importance sampling (Alain et al., 2015) is: θ t+1 = θ t -η 1 N p d - ∇ θt l(d + , d -), with θ t the parameter at t-th step, θ t+1 the one after, and N the total number of negatives. The scaling factor 1 N p d -ensures Eqn. 5 is an unbiased estimator of the non-stochastic gradient on the full data. Then we can characterize the converge rate of this SGD step as the movement to the optimal θ * . Following derivations in variance reduction (Katharopoulos & Fleuret, 2018; Johnson & Guestrin, 2018) , let g d -= 1 N p d -∇ θt l(d + , d -) the weighted gradient, the convergence rate is: E∆ t = ||θ t -θ * || 2 -E P D -(||θ t+1 -θ * || 2 ) (6) = ||θ t || 2 -2θ T t θ * -E P D -(||θ t -ηg d -|| 2 ) + 2θ * T E P D -(θ t -ηg d -) (7) = -η 2 E P D -(||g d -|| 2 ) + 2ηθ T t E P D -(g d -) -2ηθ * T E P D -(g d -) (8) = 2ηE P D -(g d -) T (θ t -θ * ) -η 2 E P D -(||g d -|| 2 ) (9) = 2ηE P D -(g d -) T (θ t -θ * ) -η 2 E P D -(g d -) T E P D -(g d -) -η 2 Tr(V P D -(g d -) ). (10) This shows we can obtain better convergence rate by sampling from a distribution P D -that minimizes the variance of the stochastic gradient estimator E P D -(||g d -|| 2 ), or Tr(V P D -(g d -) ) as the estimator is unbiased. The variance reflects how good the stochastic gradient from negative sampling represents the full gradient on all negatives-the latter is ideal but infeasible. Intuitively, we prefer the stochastic estimator to be stable and have smaller variances. A well known result in importance sampling (Alain et al., 2015; Johnson & Guestrin, 2018) is that there exists an optimal distribution that: To prove this, one can apply Jensen's inequality on the gradient variance and verify that Eqn. 11 achieves the minimum. The detailed derivations can be find in Johnson & Guestrin (2018) . p * d -= argmin p d -Tr(V P D -(g d -)) ∝ ||∇ θt l(d + , d -)|| 2 . ( ) q 𝑑 ! 𝐷 ! ! "# " Trainer Inferencer q 𝑑 ! 𝐷 ! ! "$ " Checkpoint k-1 … Checkpoint k q 𝑑 ! 𝐷 ! ! "$ " q 𝑑 ! 𝐷 ! ! " … Checkpoint k+1 q 𝑑 ! 𝐷 ! ! "# " … Inferencing Index & Search Training Positives ANCE Negatives Index & Search Eqn. 11 shows that the convergence rate can be improved by sampling negatives proportional to their per-instance gradient norms (though too expensive to calculate). Intuitively, an negative instance with larger gradient norm is more likely to reduce the non-stochastic training loss, thus should be sampled more frequently than those with diminishing gradients. The correlation of larger gradient norm and better training convergence is also observed in BERT fine-tuning (Mosbach et al., 2021) .

3.2. UNINFORMATIVE IN-BATCH NEGATIVES AND THEIR DIMINISHING GRADIENTS

Diminishing Gradients of Uninformative Negatives: Though the close form of gradient norms often does not exist, Katharopoulos & Fleuret (2018) derives the following upper bound: ||∇ θt l(d + , d -)|| 2 ≤ Lρ||∇ φ L l(d + , d -)|| 2 , ( ) where L is the number of layers, ρ is composed by pre-activation weights and gradients in intermediate layers, and ||∇ φ L l(d + , d -)|| 2 is the gradient on the last layer. The derivation of this upper bound is on multi-layer perception with any depths and any activation function that is Lipschitz continuous (Katharopoulos & Fleuret, 2018) . On complicated neural networks, the intermediate layers are regulated by various normalization and this upper bound often holds empirically (Sec. 6.3). In addition, for many loss functions, for example, BCE loss and pairwise hinge loss, we can verify that when the loss goes to zero the gradient norm of the last layer also goes to zero: Katharopoulos & Fleuret, 2018) . l(d + , d -) → 0 ⇒ ||∇ φ L l(d + , d -)|| 2 → 0 ( Putting everything together, using uninformative negative samples with near zero loss results in the following chain of undesirable properties: ||∇ φ L l(d + , d -)|| 2 → 0 low upper bound ⇒ ||∇ θt l(d + , d -)|| 2 → 0 diminishing gradient norm ⇒ Tr(V P D -(g d -)) ↑ large scholastic variance ⇒ E∆ t ↓ . slow convergence The uninformative negative samples yield diminishing gradient norms, larger variances of the scholastic gradient estimator, and less optimal learning convergence. 

4. APPROXIMATE NEAREST NEIGHBOR NOISE CONTRASTIVE ESTIMATION

Our analyses show the importance, if not necessity, to construct negatives globally from the corpus to avoid uninformative negatives for better learning convergence. In this section, we propose Approximate nearest neighbor Negative Contrastive Estimation (ANCE), which selects negatives from the entire corpus using an asynchronously updated ANN index. ANCE samples negatives from the top retrieved documents via the DR model from the ANN index: θ * = argmin θ q d + ∈D + d -∈D - ANCE l(f (q, d + ), f (q, d -)), with ANCE can pair with many DR models. For simplicity, we use BERT-Siamese (Eqn. 2), with shared encoder weights between q and d and negative log likelihood (NLL) loss (Luan et al., 2020) . D - ANCE = ANN f (q,d) \ D + and ANN f (q,d Asynchronous Index Refresh: During stochastic training, the DR model f () is updated each minibatch. Maintaining an update-to-date ANN index to select fresh ANCE negatives is challenging, as the index update requires two operations: 1) Inference: refresh the representations of all documents in the corpus with an updated DR model; 2) Index: rebuild the ANN index using updated representations. Although Index is efficient (Johnson et al., 2017) , Inference is too expensive to compute per batch as it requires a forward pass on a corpus much bigger than a training batch. Thus we implement an asynchronous index refresh similar to Guu et al. (2020) , and update the ANN index once every m batches, i.e., with checkpoint f k . As illustrated in Fig. 2 , besides the Trainer, we run an Inferencer that takes the latest checkpoint (e.g., f k ) and recomputes the encodings of the entire corpus. In parallel, the Trainer continues its stochastic learning using D - f k-1 from ANN f k-1 . Once the corpus is re-encoded, the Inferencer updates the index (ANN f k ) and feed it to the Trainer, e.g., through a shared file system. In this process, the ANCE negatives (D - ANCE ) are asynchronously updated to "catch up" with the stochastic training, with an async-gap determined by the computing resources allocated to the Inferencer. Our experiment in Sec 6.4 studies the influence of this async-gap in learning convergence.

5. EXPERIMENTAL METHODOLOGIES

This section describes our experimental setups. More details can be found in Appendix A.1 and A.2.

Benchmarks:

The web search experiments use the TREC 2019 Deep Learning (DL) Track (Craswell et al., 2020) . It is a standard ad hoc retrieval benchmark: given web queries from Bing, to retrieval passages or documents from the MS MARCO corpus (Bajaj et al., 2016) . We use the official setting and focus on the first stage retrieval, but also show results when reranking top 100 BM25 candidates. The OpenQA experiments use the Natural Questions (NQ) (Kwiatkowski et al., 2019) and TriviaQA (TQA) (Joshi et al., 2017) , following the exact settings from Karpukhin et al. (2020) . The metrics are Coverage@20/100, which evaluate whether the Top-20/100 retrieved passages include the answer. We also evaluate whether ANCE's better retrieval can propagate to better answer accuracy, by running the state-of-the-art systems' readers on top of ANCE retrieval. The readers are RAG-Token (Lewis et al., 2020b) on NQ and DPR Reader on TQA, using their suggested settings. We also study the effectiveness of ANCE in the first stage retrieval of a commercial search engine's production system. We change the training of a production-quality DR model to ANCE, and evaluate the offline gains in various corpus sizes, encoding dimensions, and exact/approximate search. Baselines: In TREC DL, we include best runs in relevant categories and refer to Craswell et al. (2020) for more baseline scores. We implement various DR baselines using the same BERT-Siamese (Eqn. 2), but with different training negative construction: random sampling in batch (Rand Neg), random sampling from BM25 top 100 (BM25 Neg) (Lee et al., 2019; Gao et al., 2020b) , and the 1:1 combination of BM25 and Random negatives (BM25 + Rand Neg) (Karpukhin et al., 2020 , 2020) . We also compare with contrastive learning/Noise Contrastive Estimation, which uses hardest negatives in batch (NCE Neg) (Gutmann & Hyvärinen, 2010; Oord et al., 2018; Chen et al., 2020a) . In OpenQA, we compare with DPR, BM25, and their combinations (Karpukhin et al., 2020) . Implementation Details: In TREC DL, recent research found MARCO passage training labels cleaner (Yan et al., 2019) and BM25 negatives can help train dense retrieval (Karpukhin et al., 2020; Luan et al., 2020) . Thus, we include a "BM25 Warm Up" setting (BM25 → * ), where the DR models are first trained using MARCO official BM25 Negatives. ANCE is also warmed up by BM25 negatives. All DR models in TREC DL are fine-tuned from RoBERTa base uncased (Liu et al., 2019) . In OpenQA, we warm up ANCE using the released DPR checkpoints (Karpukhin et al., 2020) . To fit long documents in BERT-Siamese, ANCE uses the two settings from Dai & Callan (2019b) , FirstP which uses the first 512 tokens of the document, and MaxP, where the document is split to 512-token passages (maximum 4) and the passage level scores are max-pooled. The max-pooling operation is natively supported in ANN. The ANN search uses the Faiss IndexFlatIP Index (Johnson et al., 2017) . We use batch size 8 and gradient accumulation step 2 on 4 V100 32GB GPUs. For each positive, we uniformly sample one negative from ANN top 200 (weighted sample and/or from top 100 also work well). We measured ANCE efficiency using one 32GB V100 GPU, Intel(R) Xeon(R) Platinum 8168 CPU and 650GB of RAM memory. In asynchronous training, we allocate equal amounts of GPUs to the Trainer and the Inferencer, often four or eight each. The Trainer produces a model checkpoint every 5k or 10k training batches. The Inferencer loads the recent model checkpoint and calculates the embeddings of the corpus in parallel. Once the embedding calculation finishes, a new ANN index is built and the Trainer switches to it for negative construction. All their communications are through a shared file system. On MS MARCO, the ANN negative index is refreshed about every 10K training steps. Model NQ TQA T5-11B (Closed) (Roberts et al., 2020) 34.5 -T5-11B + SSM (Closed) (Roberts et al., 2020) 36.6 -REALM (Guu et al., 2020) 40.4 -DPR (Karpukhin et al., 2020) 41.5 56.8 DPR + BM25 (Karpukhin et al., 2020) 39.0 57.0 RAG-Token (Lewis et al., 2020b) 44.1 55.2 RAG-Sequence (Lewis et al., 2020b) 44.5 56.1 ANCE + Reader 46.0 57.5 

6. EVALUATION RESULTS

In this section, we first evaluate the effectiveness and efficiency of ANCE. Then we empirically study the convergence of ANCE training and the influence of the asynchronous gap. More comparisons of dense and sparse retrieval, hyperparameter study, and case study are in Appendix.

6.1. EFFECTIVENESS

In web search (Table 1 ), ANCE significantly outperforms all sparse retrieval, including the BERTbased DeepCT (Dai et al., 2019) . Among DR models with different training strategies, ANCE is the only one robustly exceeding sparse methods in document retrieval. In OpenQA, ANCE outperforms DPR and its fusion with BM25 (DPR+BM25) in retrieval accuracy (Table 2 ). It also improves end-to-end QA accuracy, using the same readers with previous state-of-the-arts but ANCE retriever (Table 4 ). ANCE's effectiveness is even more observed in real production (Table 3 ). Among all DR models, ANCE has the smallest gap between its retrieval and reranking accuracy, showing the importance of global negatives in training retrieval models. ANCE retrieval nearly matches the accuracy of the cascade IR with interaction-based BERT Reranker (Nogueira & Cho, 2019) , even though BERT-Siamese does not explicitly capture term-level interactions. With ANCE, we can learn a representation space that effectively captures the finesse of search relevance.

6.2. EFFICIENCY

The efficiency of ANCE (FirstP) in TREC DL doc is shown in Table 5 . In serving, we measure the online latency to retrieve/rank 100 documents per query, with query batched. DR is 100x faster than BERT Rerank, a natural benefit of BERT-Siamese where the document encodings are calculated offline and separately with the query. In comparison, the interaction-based BERT Reranker runs BERT once per query and candidate document pair. The bulk of training computing is in calculating the encoding of the corpus for ANCE negative construction, which is mitigated by making the index refresh asynchronous. 

6.3. EMPIRICAL ANALYSES ON TRAINING CONVERGENCE

We first show the long tail distribution of search relevance in dense retrieval. As plotted in Fig. 3 , there are a few instances per query with significant higher retrieval scores, while the majority form a long tail. In retrieval/ranking, the key challenge is to distinguish the relevant ones among those highest scored ones; the rest is trivially irrelevant. We also empirically measure the probability of local in-batch negatives including informative negatives (D - * ), by their overlap with top 100 highest scored negatives. This probability, either using NCE Neg or Rand Neg, is zero, the same as our theory shows. In comparison, the overlap between BM25 Neg with top dense retrieved negatives is 15%, while that of ANCE negatives starts at 63% and converges to 100% by design. Then we empirically validate our theory that local negatives lead to lower loss, bounded gradient norm, non-ideal importance sampling, and thus slow convergence (Eqn. 13). The training loss and pre-clip gradient norms during DR training are plotted in Fig. 4 . As expected, the uninformative local negatives resulted in near-zero gradient norms, while ANCE global negatives maintain a higher gradient norm. The gradient norm of the last layer in the BERT-Siamese model during ANCE training (black dotted lines in Fig. 4 ) is consistently bigger than the other layers, which empirically aligns with the upper bound in Eqn. 12. Also as our theory suggests, the gradient norms of local negatives are bounded close to zero, while those of ANCE global negatives are bigger by orders of magnitude. This confirms that ANCE better approximates the oracle importance sampling distribution (p * d -∝ ||∇ θt l(d + , d -)|| 2 ) and improves learning convergence.

6.4. IMPACT OF ASYNCHRONOUS GAP

The efficiency constraints enforce an asynchronous gap (async-gap) in ANCE training: The negatives are selected using the encodings from an earlier stage of the being optimized DR model. The async-gap is determined by the target index refreshing rate, which is determined by the allocation of computing resource on the Trainer versus the Inferencer, as well as the learning rate. This experiment studies the impact of this async-gap. The training curves and testing NDCG of different configurations are plotted in Fig. 5 . A too large async-gap, either from large learning rate (Fig. 5 In many scenarios, using a same amount of extra GPUs for ANCE as a one-time training cost is a good return of investment. The efficiency bottleneck in production is often in inference and serving.

7. RELATED WORK

In early research on neural information retrieval (Neu-IR) (Mitra & Craswell, 2018), a common belief was that the interaction models, those that explicitly handle term level matches, are more effective though more expensive (Guo et al., 2016; Xiong et al., 2017; Nogueira & Cho, 2019) . Many techniques are developed to reduce their cost, for example, distillation (Gao et al., 2020a) and caching (Humeau et al., 2020; Khattab & Zaharia, 2020; MacAvaney et al., 2020) . ANCE shows that a properly trained representation-based BERT-Siamese can be effective as well. This finding will motivate many new research explorations in Neu-IR. Deep learning has been used to improve various components of sparse retrieval, for example, term weighting (Dai & Callan, 2019b) , query expansion (Zheng et al., 2020) , and document expansion (Nogueira et al., 2019) . Dense Retrieval chooses a different path and conducts retrieval purely in the embedding space via ANN search (Lee et al., 2019; Chang et al., 2020; Karpukhin et al., 2020; Luan et al., 2020) . This work demonstrates that a simple dense retrieval system can achieve SOTA accuracy, while also behaves dramatically different from sparse retrieval. The recent advancement in dense retrieval may raise a new generation of search systems. Recent research in contrastive representation learning also shows the benefits of sampling negatives from a larger candidate pool. In computer vision, He et al. (2020) decouple the negative sampling pool size with training batch size, by maintaining a negative candidate pool of recent batches and updating their representation with momentum. This enlarged negative pool significantly improves unsupervised visual representation learning (Chen et al., 2020b) . A parallel work (Xiong et al., 2021) improves DPR by sampling negatives from a memory bank (Wu et al., 2018) -in which the representations of negative candidates are frozen so more candidates can be stored. Instead of a bigger local pool, ANCE goes all the way along this trajectory and constructs negatives globally from the entire corpus, using an asynchronously updated ANN index. Besides being a real world application itself, dense retrieval is also a core component in many other language systems, for example, to retrieve relevant information for grounded language models (Khandelwal et al., 2020; Guu et al., 2020) , extractive/generative QA (Karpukhin et al., 2020; Lewis et al., 2020b) , and fact verification (Xiong et al., 2021) , or to find paraphrase pairs for pretraining (Lewis et al., 2020a) . There dense retrieval models are either frozen or optimized indirectly by signals from their end tasks. ANCE is orthogonal to those lines of research and focuses on the representation learning for dense retrieval. Its better retrieval accuracy can benefit many other language systems.

8. CONCLUSION

In this paper, we first provide theoretical analyses on the convergence of representation learning in dense retrieval. We show that under common conditions in text (Bajaj et al., 2016) . The document corpus was post-constructed by back-filling the body texts of the passage's URLs and their labels were inherited from its passages (Craswell et al., 2020) . The testing sets are labeled by NIST accessors on the top 10 ranked results from past Track participants (Craswell et al., 2020) . TREC DL official metrics include NDCG@10 on test and MRR@10 on MARCO Passage Dev. MARCO Document Dev is noisy and the recall on the DL Track testing is less meaningful due to low label coverage on DR results. There is a two-year gap between the construction of the passage training data and the back-filling of their full document content. Some original documents were no longer available. There is also a decent amount of content changes in those documents during the two-year gap, and many no longer contain the passages. This back-filling perhaps is the reason why many Track participants found the passage training data is more effective than the inherited document labels. Note that the TREC testing labels are not influenced as the annotators were provided the same document contents when judging. All the TREC DL runs are trained using these training data. Their inference results on the testing queries of the document and the passage retrieval tasks were evaluated by NIST assessors in the standard TREC-style pooling technique (Voorhees, 2000) . The pooling depth is set to 10, that is, the top 10 ranked results from all participated runs are evaluated, and these evaluated labels are released as the official TREC DL benchmarks for passage and document retrieval tasks. More Details on Baselines: The most representative sparse retrieval baselines in TREC DL include the standard BM25 ("bm25base" or "bm25base_p"), Best TREC Sparse Retrieval ("bm25tuned_rm3" or "bm25tuned_prf_p") with tuned query expansion (Lavrenko & Croft, 2017) , and Best DeepCT ("dct_tp_bm25e2", doc only), which uses BERT to estimate the term importance for BM25 (Dai & Callan, 2019a) . These three runs represent the standard sparse retrieval, best classical sparse retrieval, and the recent progress of using BERT to improve sparse retrieval. We also include the standard cascade retrieval-and-reranking systems BERT Reranker ("bm25exp_marcomb" or "p_exp_rm3_bert"), which is the best run using standard BERT on top of query/doc expansion, from the groups with multiple top MARCO runs (Nogueira & Cho, 2019; Nogueira et al., 2019) .

More

BERT-Siamese Configurations: We follow the network configurations in Luan et al. (2020) in all Dense Retrieval methods, which we found provides the most stable results. More specifically, we initialize the BERT-Siamese model with RoBERTa base (Liu et al., 2019) and add a 768 × 768 projection layer on top of the last layer's "[CLS]" token, followed by a layer norm. We use BERT-Siamese, NLL loss, and dot product to be consistent with recent research. We have obtained better accuracy with more vectors per document, BCE loss, and cosine similarity, but that is not the focus of this paper.

Implementation Details:

The training often takes about 1-2 hours per ANCE epoch, which is whenever new ANCE negative is ready, it immediately replaces existing negatives in training, without waiting. It converges in about 10 epochs, similar to other DR baselines. The optimization uses LAMB optimizer, learning rate 5e-6 for document and 1e-6 for passage retrieval, and linear warm-up and decay after 5000 steps. More detailed hyperparameter settings can be found in our code release.

A.2 OVERLAP WITH SPARSE RETRIEVAL IN TREC 2019 DL TRACK

As a nature of TREC-style pooling evaluation, only those ranked in the top 10 by the 2019 TREC participating systems were labeled. As a result, documents not in the pool and thus not labeled are all considered irrelevant, even though there may be relevant ones among them. When reusing TREC style relevance labels, it is very important to keep track of the "hole rate" on the evaluated systems, i.e., the fraction of the top K ranked results without TREC labels (not in the pool). A larger hole rate shows that the evaluated methods are very different



Our code and trained models are available at http://aka.ms/ance. https://github.com/facebookresearch/DPR. https://huggingface.co/transformers/master/model d oc/rag.html



Figure 2: ANCE Asynchronous Training. The Trainer learns the representation using negatives from the ANN index. The Inferencer uses a recent checkpoint to update the representation of documents in the corpus and, once finished, refreshes the ANN index with most up-to-date encodings.

In-Batch Negatives: We argue that, when training DR models, the in-batch local negatives are unlikely to provide informative samples due to two properties of text retrieval.Let D - * be the set of informative negatives that are hard to distinguish from D + , and b be the batch size, we have (1) b |C|, the batch size is far smaller than the corpus size; (2) |D - * | |C|, that only a few negatives are informative and the majority of corpus is trivially unrelated. Both conditions hold in most dense retrieval scenarios. The two together make the probability that a random mini-batch includes meaningful negatives (p = b|D - * | |C| 2 ) close to zero. Selecting negatives from local training batches unlikely provides optimal training signals for dense retrieval.

) the top retrieved documents by f () from the ANN index. By definition, D - ANCE are the hardest negatives for the current DR model: D - ANCE ≈ D - * . In theory, these more informative negatives have higher training loss, elevate the upper bound on the gradient norms (first component of Eqn 13), and prevent the slow convergence indicated in Eqn 13.

Figure 3: The top DR scores for ten random TREC DL testing queries. The x-axes are their ranking order. The y-axes are their retrieval scores minus corpus average. All models are warmed up by BM25 Neg. The percentages are the overlaps between the testing and training negatives near convergence.

Figure 4: The loss and gradient norms during DR training (after BM25 warm up). The gradient norms are the per-layer average of the bottom (1-4), middle (5-8), and top (9-12) transformer layers. Black dotted lines are the grad norm of the last layer in ANCE (FirstP). The x-axes are training steps.

Figure 5: Training loss and testing NDCG of ANCE (FirstP) on documents.The sub captions list the ANN index refreshing rate (e.g., per 10k Batch), Trainer:Inferencer GPU allocation (e.g., 4:4), and learning rate (e.g., 1e-5). The x-axes are the training steps.

(a)) or low refreshing rate (Fig.5(b)), makes the training unstable, perhaps because the refreshed index changes too dramatically, as indicated by the peaks in training loss and dips of testing NDCG. The async-gap is not significant when we allocate an equal amount of GPUs to the index refreshing and to the training (Fig.5(d)). Further reducing the gap (Fig.5(c)) does not improve learning convergence.

Details on OpenQA Experiments: All the DPR related experimental settings, baseline systems, and DPR Reader are based on their open source libarary 2 . The RAG-Token reader uses their open-source release in huggingface 3 . The RAG-Seq release in huggingface is not yet stable by the time we did our experiment, thus we choose the RAG-Token in our OpenQA experiment. RAG only releases the NQ models thus we use DPR reader on TriviaQA. We feed top 20 passages from ANCE to RAG-Token on NQ and top 100 passages to DPR's BERT Reader, following the guideline in their open-source codes.

Results in TREC 2019 Deep Learning Track. Results not available are marked as "n.a.", not applicable are marked as "-". Best results in each category are marked bold. Dense Retrieval baselines use the same BERT-Siamese but different training strategies.

Retrieval results (Answer Coverage at Top-20/100) on Natural Questions (NQ) and Trivial QA (TQA) in the setting fromKarpukhin et al. (2020).

OpenQA Test Scores in Single Task Setting. ANCE+Reader switches the retrieve of the OpenQA systems from DPR to ANCE and keeps their QA components, which is RAG-Token on Natural Questions (NQ) and DPR Reader on Trivia QA (TQA). T5 results are "closed-book". The others are open-book.

Efficiency of ANCE Serving and Training.

retrieval, the local negatives used in DR training are uninformative, yield low gradient norms, and contribute little to the learning convergence. We then propose ANCE to eliminate this bottleneck by constructing training negatives globally from the entire corpus. Our experiments demonstrate the advantage of ANCE in web search, OpenQA, and the production environment of a commercial search engine. Our studies empirically validate our theory that ANCE negatives have much bigger gradient norms, reduce the stochastic gradient variance, and improve training convergence. More Details on TREC DL Benchmarks: There are two tasks in the TREC DL 2019 Track: document retrieval and passage retrieval. The training and development sets are from MS MARCO, which includes passage level relevance labels for one million Bing queries

9. ACKNOWLEDGMENTS

We thank Di He for discussions on learning theories and Safoora Yousefi for feedback in writing.

annex

from those systems that participated in the Track and contributed to pool, thus the evaluation results are not perfect. Note that the hole rate does not necessarily reflect the accuracy of the system, only the difference of it.In TREC 2019 Deep Learning Track, all the participating systems are based on sparse retrieval. Dense retrieval methods often differ considerably from sparse retrievals and in general will retrieve many new documents. This is confirmed in Table 6 . All DR methods have very low overlap with the official BM25 in their top 100 retrieved documents. At most, only 25% of documents retrieved by DR are also retrieved by BM25. This makes the hole rate quite high and the recall metric not very informative. It also suggests that DR methods might benefit more in this year's TREC 2020 Deep Learning Track if participants are contributing DR based systems.The MS MARCO ranking labels were not constructed based on pooling the sparse retrieval results. They were from Bing (Bajaj et al., 2016) , which uses many signals beyond term overlap. This makes the recall metric in MS MARCO more robust as it reflects how a single model can recover a complex online system.

A.3 HYPERPARAMETER STUDIES

We show the results of some hyperparameter configurations in Table 7 . The cost of training with BERT makes it difficult to conduct a lot of hyperparameter explorations. Often a failed configuration leads to divergence early in training. We barely explore other configurations due to the time-consuming nature of working with pretrained language models. Our DR model architecture is kept consistent with recent parallel work and the learning configurations in Table 7 are about all the explorations we did. Most of the hyperparameter choices are decided solely using the training loss curve and otherwise by the loss in the MARCO Dev set. We found the training loss, validation NDCG, and testing performance align well in our (limited) hyperparameter explorations.

A.4 CASE STUDIES

In this section, we show Win/Loss case studies between ANCE and BM25. Among the 43 TREC 2019 DL Track evaluation queries in the document task, ANCE outperforms BM25 on 29 queries, loses on 13 queries, and ties on the rest 1 query. The winning examples are shown in Table 8 and the losing ones are in Table 9 . Their corresponding ANCE-learned (FirstP) representations are illustrated by t-SNE in Fig. 6 and Fig. 7 .In general, we found ANCE better captures the semantics in the documents and their relevance to the query. The winning cases show the intrinsic limitations of sparse retrieval. For example, BM25 exact matches the "most popular food" in the query "what is the most popular food in Switzerland" but the document is about Mexico. The term "Switzerland" does match with the document but it is in the related question section.The losing cases in Table 9 are also quite interesting. Many times we found that it is not that DR fails completely and retrieves documents not related to the query's information needs at all, which was a big concern when we started research in DR. The errors ANCE made include retrieving documents that are related just not exactly relevant to the query, for example, "yoga pose" for "bow in yoga". In other cases, ANCE retrieved wrong documents due to the lack of the domain knowledge: the pretrained language model may not know "active margin" is a geographical terminology, not a financial one (which we did not know neither and took some time to figure out when conducting this case study). There are also some cases where the dense retrieved documents make sense to us but were labeled irrelevant.The t-SNE plots in Fig. 6 and Fig. 7 show many interesting patterns of the learned representation space. The ANCE winning cases often correspond to clear separations of different document groups. For losing cases the representation space is more mixed, or there is too few relevant documents which may cause the variances in 9 . Qids and queries are listed in the sub-captions.model performances. There are also many different interesting patterns in the ANCE-learned representation space. We include the t-SNE plots for all 43 TREC DL Track queries in our open-source repo. More future analyses of the learned patterns in the representation space may help provide more insights on dense retrieval.

