APPROXIMATE NEAREST NEIGHBOR NEGATIVE CON-TRASTIVE LEARNING FOR DENSE TEXT RETRIEVAL

Abstract

Conducting text retrieval in a learned dense representation space has many intriguing advantages. Yet dense retrieval (DR) often underperforms word-based sparse retrieval. In this paper, we first theoretically show the bottleneck of dense retrieval is the domination of uninformative negatives sampled in mini-batch training, which yield diminishing gradient norms, large gradient variances, and slow convergence. We then propose Approximate nearest neighbor Negative Contrastive Learning (ANCE), which selects hard training negatives globally from the entire corpus. Our experiments demonstrate the effectiveness of ANCE on web search, question answering, and in a commercial search engine, showing ANCE dot-product retrieval nearly matches the accuracy of BERT-based cascade IR pipeline. We also empirically validate our theory that negative sampling with ANCE better approximates the oracle importance sampling procedure and improves learning convergence.

1. INTRODUCTION

Many language systems rely on text retrieval as their first step to find relevant information. For example, search ranking (Nogueira & Cho, 2019) , open domain question answering (OpenQA) (Chen et al., 2017) , and fact verification (Thorne et al., 2018) all first retrieve relevant documents for their later stage reranking, machine reading, and reasoning models. All these later-stage models enjoy the advancements of deep learning techniques (Rajpurkar et al., 2016; Wang et al., 2019) , while, the first stage retrieval still mainly relies on matching discrete bag-of-words, e.g., BM25, which has become the pain point of many systems (Nogueira & Cho, 2019; Luan et al., 2020; Zhao et al., 2020) . Dense Retrieval (DR) aims to overcome the sparse retrieval bottleneck by matching in a continuous representation space learned via neural networks (Lee et al., 2019; Karpukhin et al., 2020; Luan et al., 2020) . It has many desired properties: fully learnable representation, easy integration with pretraining, and efficiency support from approximate nearest neighbor (ANN) search (Johnson et al., 2017) . These grant dense retrieval an intriguing potential to fundamentally overcome some intrinsic limitations of sparse retrieval, for example, vocabulary mismatch (Croft et al., 2009) . One challenge in dense retrieval is to construct proper negative instances when learning the representation space (Karpukhin et al., 2020) . Unlike in reranking (Liu, 2009) where the training and testing negatives are both irrelevant documents from previous retrieval stages, in first stage retrieval, DR models need to distinguish all irrelevant ones in a corpus with millions or billions of documents. As illustrated in Fig. 1 , these negatives are quite different from those retrieved by sparse models. Recent research explored various ways to construct negative training instances for dense retrieval (Karpukhin et al., 2020) , e.g., using contrastive learning (Oord et al., 2018; He et al., 2020; Chen et al., 2020a) to select hard negatives in current or recent mini-batches. However, as observed in recent research (Karpukhin et al., 2020) , the in-batch local negatives, though effective in learning word or visual representations, are not significantly better than spare-retrieved negatives in representation learning for dense retrieval. In addition, the accuracy of dense retrieval models often underperform BM25, especially on documents (Gao et al., 2020b; Luan et al., 2020) . In this paper, we first theoretically analyze the convergence of dense retrieval training with negative sampling. Using the variance reduction framework (Alain et al., 2015; Katharopoulos & Fleuret, 2018) , we show that, under conditions commonly met in dense retrieval, local in-batch negatives lead to diminishing gradient norms, resulted in high stochastic gradient variances and slow training convergence -the local negative sampling is the bottleneck of dense retrieval's effectiveness.

Query Relevant DR Neg BM25 Neg Rand Neg

Based on our analysis, we propose Approximate nearest neighbor Negative Contrastive Estimation (ANCE), a new contrastive representation learning mechanism for dense retrieval. Instead of random or in-batch local negatives, ANCE constructs global negatives using the beingoptimized DR model to retrieve from the entire corpus. This fundamentally aligns the distribution of negative samples in training and of irrelevant documents to separate in testing. From the variance reduction perspective, these ANCE negatives lift the upper bound of per instance gradient norm, reduce the variance of the stochastic gradient estimation, and lead to faster learning convergence. We implement ANCE using an asynchronously updated ANN index of the corpus representation. Similar to Guu et al. ( 2020), we maintain an Inferencer that parallelly computes the document encodings with a recent checkpoint from the being optimized DR model, and refresh the ANN index used for negative sampling once it finishes, to keep up with the model training. Our experiments demonstrate the advantage of ANCE in three text retrieval scenarios: standard web search (Craswell et al., 2020 ), OpenQA (Rajpurkar et al., 2016; Kwiatkowski et al., 2019) , and in a commercial search engine's retrieval system. We also empirically validate our theory that the gradient norms on ANCE sampled negatives are much bigger than local negatives, thus improving the convergence of dense retrieval models.foot_0 

2. PRELIMINARIES

In this section, we discuss the preliminaries of dense retrieval and its representation learning. Task Definition: Given a query q and a corpus C, the first stage retrieval is to find a set of documents relevant to the query D + = {d 1 , ..., d i , ..., d n } from C (|D + | |C|), which then serve as input to later more complex models (Croft et al., 2009) . Instead of using sparse term matches and inverted index, Dense Retrieval calculates the retrieval score f () using similarities in a learned embedding space (Lee et al., 2019; Luan et al., 2020; Karpukhin et al., 2020) : f (q, d) = sim(g(q; θ), g(d; θ)), where g() is the representation model that encodes the query or document to dense embeddings. The encoder parameter θ provides the main capacity. The similarity function (sim()) is often simply cosine or dot product to leverage efficient ANN retrieval (Johnson et al., 2017; Guo et al., 2020) . BERT-Siamese Model: A standard instantiation of Eqn. 1 is to use the BERT-Siamese/twotower/dual-encoder model (Lee et al., 2019; Karpukhin et al., 2020; Luan et al., 2020) : f (q, d) = BERT(q) • BERT(d) = MLP( [CLS] q ) • MLP( [CLS] d ). It encodes the query and document separately with BERT as the encoder g(), using their last layer's [CLS] token representation, and applied dot product (•) on them. This enables offline precomputing of the document encodings and efficient first-stage retrieval. In comparison, the BERT reranker (Nogueira et al., 2019) applies BERT on the concatenation of each to-rerank query-document pair: BERT(q • d), which has explicit access to term level interactions between query-document with transformer attentions, but is often infeasible in first stage retrieval as enumerating all documents in the corpus for each query is too costly.



Our code and trained models are available at http://aka.ms/ance.



Figure 1: T-SNE (Maaten & Hinton, 2008) representations of query, relevant documents, negative training instances from BM25 (BM25 Neg) or randomly sampled (Rand Neg), and testing negatives (DR Neg) in dense retrieval.

