ZERO-SHOT RETRIEVAL WITH SEARCH AGENTS AND HYBRID ENVIRONMENTS

Abstract

Learning to search is the task of building artificial agents that learn to autonomously use a search box to find information. So far, it has been shown that current language models can learn symbolic query reformulation policies, in combination with traditional term-based retrieval, but fall short of outperforming neural retrievers. We extend the previous learning to search setup to a hybrid environment, which accepts discrete query refinement operations, after a first-pass retrieval step via a dual encoder. Experiments on the BEIR task show that search agents, trained via behavioral cloning, outperform the underlying search system based on a combined dual encoder retriever and cross encoder reranker. Furthermore, we find that simple heuristic Hybrid Retrieval Environments (HRE) can improve baseline performance by several nDCG points. The search agent based on HRE (HARE) matches state-of-the-art performance, balanced in both zeroshot and in-domain evaluations, via interpretable actions, and at twice the speed.

1. INTRODUCTION

Transformer-based dual encoders for retrieval, and cross encoders for ranking (cf. e.g., Karpukhin et al. (2020) ; Nogueira & Cho (2019)), have redefined the architecture of choice for information search systems. However, sparse term-based inverted index architectures still hold their ground, especially in out-of-domain, or zero-shot, evaluations. On the one hand, neural encoders are prone to overfitting on training artifacts (Lewis et al., 2021) . On the other, sparse methods such as BM25 (Robertson & Zaragoza, 2009) may implicitly benefit from term-overlap bias in common datasets (Ren et al., 2022) . Recent work has explored various forms of dense-sparse hybrid combinations, to strike better variance-bias tradeoffs (Khattab & Zaharia, 2020; Formal et al., 2021b; Chen et al., 2021; 2022) . 2022) evaluate a simple hybrid design which takes out the dual encoder altogether and simply applies a cross encoder reranker to the top documents retrieved by BM25. This solution couples the better generalization properties of BM25 and high-capacity cross encoders, setting the current SOTA on BEIR by reranking 1000 documents. However, this is not very practical as reranking is computationally expensive. More fundamentally, it is not easy to get insights on why results are reranked the way they are. Thus, the implicit opacity of neural systems is not addressed.

Rosa et al. (

We propose a novel hybrid design based on the Learning to Search (L2S) framework (Adolphs et al., 2022) . In L2S the goal is to learn a search agent that autonomously interacts with the retrieval environment to improve results. By iteratively leveraging pseudo relevance feedback (Rocchio, 1971) , and language models' understanding, search agents engage in a goal-oriented traversal of the answer space, which aspires to model the ability to 'rabbit hole' of human searchers (Russell, 2019) . The framework is also appealing because of the interpretability of the agent's actions. Adolphs et al. (2022) show that search agents based on large language models can learn effective symbolic search policies, in a sparse retrieval environment, but fail to outperform neural retrievers. We extend L2S to a dense-sparse hybrid agent-environment framework structured as follows. The environment relies on both a state-of-the-art dual encoder, GTR (Ni et al., 2021) , and BM25 which separately access the document collection. Results are combined and sorted by means of a transformer cross encoder reranker (Jagerman et al., 2022) . We call this a Hybrid Retrieval Environment (HRE). Our search agent (HARE) interacts with HRE by iteratively refining the query via search operators, and aggregating the best results. HARE matches state-of-the-art results on the BEIR dataset (Thakur et al., 2021 ) by reranking a one order of magnitude less documents than the SOTA system (Rosa et al., 2022) Figure 1 shows an example of a search session performed by the HARE search agent applying structured query refinement operations. The agent adds two successive filtering actions to the query 'what is the weather like in germany in june' (data from MS MARCO (Nguyen et al., 2016) ). In the first step it restricts results to documents containing the term 'temperatures', which occurs in the first set of results. In the second step, results are further limited to documents containing the term 'average'. This fully solves the original query by producing an nDCG@10 score of 1.

2. RELATED WORK

Classic retrieval systems such as BM25 (Robertson & Zaragoza, 2009) use term frequency statistics to determine the relevancy of a document for a given query. Recently, neural retrieval models have become more popular and started to outperform classic systems on multiple search tasks. Karpukhin et al. ( 2020) use a dual-encoder setup based on BERT (Devlin et al., 2019) , called DPR, to encode query and documents separately and use maximum inner product search (Shrivastava & Li, 2014) to find a match. They use this model to improve recall and answer quality for multiple open-domain question-answer datasets. Large encoder-decoder models such as T5 (Raffel et al., 2020) are now preferred as the basis for dual encoding as they outperform encoders-only retrievers (Ni et al., 2021) . It has been observed that dense retrievers can fail to catch trivial query-document syntactic matches involving n-grams or entities (Karpukhin et al., 2020; Xiong et al., 2021; Sciavolino et al., 2021) . ColBERT (Khattab & Zaharia, 2020) gives more importance to individual terms by means of a late interaction multi-vector representation framework, in which individual term embeddings are accounted in the computation of the query-document relevance score. This is expensive as many more vectors need to be stored for each indexed object. ColBERTv2 (Santhanam et al., 2022) combines late interaction with more lightweight token representations. SPLADE (Formal et al., 2021b ) is another approach that relies on sparse representations, this time induced from a transformer's masked heads. SPLADEv2 (Formal et al., 2021a; 2022) further improves performance introducing hardnegative mining and distillation. Chen et al. (2021) propose to close the gap with sparse methods on phrase matching and better generalization by combining a dense retriever with a dense lexical model trained to mimic the output of a sparse retriever (BM25). Ma et al. (2021) combine single hybrid vectors and data augmentation via question generation. In Section 3 (Table 2b ) we evaluate our search environment and some of the methods above. The application of large LMs to retrieval, and ranking, presents significant computational costs for which model distillation (Hinton et al., 2015) is one solution, e.g. DistillBERT (Sanh et al., 2019) . The generalization capacity of dual encoders have been scrutinized recently in QA and IR tasks (Lewis et al., 2021; Zhan et al., 2022; Ren et al., 2022) . Zhan et al. (2022) claims that dense



Figure 1: Sequential query refinements combining pseudo relevance feedback and search operators.

, reducing latency by 50%. Furthermore, HARE does not sacrifice in-domain performance. The agent's actions are interpretable and dig deep in HRE's rankings.

