HYPER: MULTITASK HYPER-PROMPTED TRAINING EN-ABLES LARGE-SCALE RETRIEVAL GENERALIZATION

Abstract

Recently, large-scale text retrieval has made impressive progress, facilitating both information retrieval and downstream knowledge-intensive tasks (e.g., opendomain QA and dialogue). With a moderate amount of data, a neural text retriever can outperform traditional methods such as BM25 by a large step. However, while being applied to out-of-domain data 1 , the performance of a neural retriever degrades considerably. Therefore, how to enable a retriever to perform more robustly across different domains or tasks and even show strong zero-shot transfer ability is critical for building scalable IR systems. To this end, we propose HYPER, a hyper-prompted training mechanism to enable uniform retrieval across tasks of different domains. Specifically, our approach jointly trains the query encoder with a shared prompt-based parameter pool and a prompt synthesizer that dynamically composes hyper-prompt for encoding each query from different tasks or domains. Besides, to avoid the mode collapse of prompt attention distribution for different queries, we design a contrastive prompt regularization that promotes the mode of prompt attention to be aligned and uniform. Through multi-task hyper-prompted training, our retriever can master the ability to dynamically represent different types of queries and transfer knowledge across different domains and tasks. Extensive experiments show our model attains better retrieval performance across different tasks and better zero-shot transfer ability compared with various previous methods. * Work done during the internship at Microsoft.

1. INTRODUCTION

Large-scale retrieval aims to retrieve relevant documents from millions to billions of documents according to a given query, which is the so-called first stage retrieval (Cai et al., 2021) . It can benefit for resolving various knowledge-intensive tasks significantly (Guu et al., 2020; Lewis et al., 2020) , since the retrieved relevant documents contain explicit knowledge of world (Petroni et al., 2021) . Traditional term-matching methods including tf-idf and BM25 (Yang et al., 2017) can effectively achieve retrieval by building an inverted index and perform fairly well regardless of domains, however, recent popular neural retrievers outperform them by a large step with a moderate amount of task-specific data (Karpukhin et al., 2020; Formal et al., 2021b; Khattab & Zaharia, 2020) . For neural retrieval, a common way is to use pre-trained language models (e.g., BERT) (Devlin et al., 2019) to encode queries and documents into vectors respectively, which is known as Bi-Encoder. Although neural retrievers can be optimized effectively by utilizing the samples of specific tasks, in real-world applications, the formats of queries are different and the expected priorities of query vectors are varying considerably from task to task. For example, in Naturals Questions dataset (Kwiatkowski et al., 2019) , a query such as "what was the first capital city of Australia" is a simple question sentence, however, in Wizard of Wikipedia dataset (Dinan et al., 2018) , a query such as "...Snoop Dogg is so awesome, he's a great rapper and does a lot for his community as well..." contain multiple declarative sentences with implicit retrieval target. Besides the difference in query formats, different tasks also require generating query vectors with different richness or intents, in HotpotQA dataset (Yang et al., 2018) an input query "which game was published first, Near and Far or King of Tokyo?" expects an input query that can retriever documents relevant to the two mentioned items which are fair different from the queries in Natural Question that require retrieving specific facts to only one item. Those differences between tasks cause significant performance degradation when a model is applied to different tasks. Moreover, there is also a data sparse problem for recently popular tasks (Almeida & Matos, 2020) , which expects a better generalization of a neural retriever (Thakur et al., 2021) . To resolve the above challenges, we aim to build a universal model that is capable to process queries uniformly regardless of the differences between different tasks including varying formats of input queries and the unique features of query vectors for specific tasks. Meanwhile, we expect our model can obtain stronger generalization abilities which can be reflected by promising zero-shot and few-shot performance in large-scale retrieval. Specifically, the first problem is how to enable a universal query process. For a neural retriever, the ability to resolve a specific task means a set of parameters trained on this task. Although one can train different models for each tasks (Karpukhin et al., 2020) or simply use a shared encoder with multi-tasking setting (Maillard et al., 2021) , the first method leads to heavy parameter cost while the second method results in potentially indifferent generalization abilities. To this end, we propose HYPER, a multi-task HYPEr-prompted training mechanism that can be combined with any transformer-based neural Retrieves. HYPER consists of two key components. The first component is Query-conditional Prompt Synthesizer (QPS) that leverages the attention module to synthesize suitable parameters of query encoder for different queries, which enables our query encoder to master the ability to dynamically represent different types of queries and transfer learned parameters across different tasks and domains by multi-task training. Nevertheless, we find merely applying QPS results in a mode collapse problem of attention scores distributions, which causes our query encoder fails to learn different abilities to process queries for different tasks. To deal with this problem, we propose the Contrastive Prompt Regularization (CPR) to encourage the parameter synthesizing of the same tasks to become similar for better training effectiveness while promoting our query encoder to distinguish queries of different tasks and thus avoid mode collapse problems. Through the above multi-task hyper-prompted training, our HYPER can master the ability to dynamically represent different types of queries and transfer knowledge across different domains and tasks. Therefore, HYPER can enable large-scale retrieval generalization in the zero-shot and few-shot scenarios. To conclude, our contributions are three-fold as follows, i) we present HYPER, a multitask hyperprompted training mechanism that enables a neural retriever to dynamically process different types of queries with different hyper-prompts and transfer learned knowledge across different domains and tasks. ii) to impede the uniform retrieval in model construction and optimization, we propose Query-conditional Prompt Synthesizer (QPS) along with Contrastive Prompt Regularization (CPR) to synthesize suitable prompts for different queries. iii) Experiments in zero-shot in-domain and cross-domain retrieval tasks reflect the superior generalization provided by HYPER and the strong multi-tasking performance indicates the achieving of uniform retrieval.

2. METHOD

Task Formation For the large-scale text retrieval, we aim to seek document d + containing relevant knowledge from a large collection of documents D to answer the query q. Although input queries vary from task to task, we propose employing only one retriever to process them uniformly. Specifically, for datasets C = {T 1 , T 2 , . . . , T t } and out-of-domain data C = {T t+1 , T t+2 , . . . , T t+k },where t,k is the number of tasks with training samples and without training samples respectively, the goal is to learn a neural retriever model P (d + |q, D; θ) (θ denotes the parameters of the model) with C and perform well on these in-domain tasks, while transferring the learned knowledge to process a new query q from out-of-domain datasets C. Thus, given any queries in C ∪ C, one can find the proper knowledge documents d + following P (d + |q, D; θ). 

Value Key

Query-conditional prompt (p I )

Prefix Encoder

Prefix of it is provided in Figure 1 . HYPER first leverage the instance encoder θ I to generate the Query representation of a query , while hyper-prompts P are used as Key and Value for attention module. Therefore, we can dynamically generate different prompts for different types of queries. Besides, our proposed contrastive prompt regularization is used to avoid the mode collapse problem which is crucial for learning different parameters for different types of queries.

2.1. PROMPT SYNTHESIZING

There are three main components in our query encoding process including an instance encoder θ I , shared basic prompts P = {p i |p i ∈ R m×d , i ∈ {1, • • • , N }} where N is the number of shared prompts, m is the length of each basic prompt, and the prefix encoder θ p . To enable our model to process different tasks uniformly, we store learned knowledge to process different queries into shared prompts and synthesize prompts for different queries dynamically through the attention module. Moreover, we introduce a contrastive prompt regularization to prevent mode collapse of attention scores, which is crucial for HYPER to effectively learn diverse knowledge to synthesize different prompts for queries of different tasks. In the following, we will first describe how HYPER generate the corresponding prefix for different queries. Then, we explain the mode collapse problem in our attention module and how CPR resolves it. Query-conditional Prompt Synthesizing We intend to generate dynamic hyper-prompts with the global semantic of a query which enables a neural retriever to adapt to different types of queries across different domains. To generate a query representation for our prompt attention module, we first map input query q into corresponding word embeddings representation X = [w 1 , w 2 , • • • , w l ] ∈ R l×d , where l is the length of query, d is the dimension of word embedding. Then we employ max pooling along with the sequence axis of X and obtain X = MaxPooling(X). Finally, we utilize a non-linear transformation to generate the incipient query representation as follows: H I = GELU(W 1 X)W T 2 , Q = LayerNorm(H I ). Here, W {1,2} ∈ R d h ×d are the transformation matrices and d h is the dimension of the hidden variable. GELU is Gaussian Error Linear Unit (Hendrycks & Gimpel, 2016) and LayerNorm is the Layer-wise Normalization (Ba et al., 2016) . Similar to query encoding, we employ max pooling operation along with the prompt length axis to transform each prompt into pi = MaxPooling(p i ), i ∈ {1, • • • , N }. Thus, we can use Q pT i to imply the fitness of different parameters for an input query. We employ softmax to normalize these scores and form a prompt attention distribution A ∈ R N , which is further used for synthesizing the final query-conditional prompt p I . The process is formally described as: α i = exp(Q pT i /τ ) N i exp(Q pT i /τ ) , p I = N i α i p i . ( ) where α i is the normalized attention score of the i-th prompt, p I ∈ R m×d is generated queryconditional prompt, and τ is the pre-defined temperature. To further improve the effectiveness of our proposed method, we transform the generated prompt into prefix (Liu et al., 2021a ) that owns the better representational ability. Although one can simply employ up projection to generate prefix for each layer of retriever, we find this approach considerably increases the number of total parameters which may cause the degradation of the generalization. Therefore, we employ a parameterefficient transformation (Stickland & Murray, 2019; Houlsby et al., 2019) as the prefix encoder θ p = {W dp , W up } to generate the prefix which can be described as follow: h = W dp p I , PL = W up Tanh(h) where W dp ∈ R dp×d and W up ∈ R Ld×dp are projection matrices, d p is down projection dimension of p I , L is the number of layers of query encoder θ q , Tanh is the hyperbolic tangent function, h is the intermediary low-dimension representation and PL ∈ R Ld is the dynamically generated parameters that can be split into prefixes for each layer of the neural retriever. Contrastive Prompt Regularization Although the above mechanisms can generate queryconditional prefixes and share parameters across different tasks, we find it results in the so-called mode collapse problem of attention score distributions. Specifically, the attention score distributions of different queries are very similar which may cause our module degenerating to a prefix-tuning. Moreover, we expect queries belonging to the same task generate similar attention scores distribution while queries belonging to different tasks own dissimilar attention scores distribution. To simultaneously learn representations of hyper-prompts and clustering for different types of queries across different domains or tasks, we propose the Contrastive Prompt Regularization (CPR) that employs soft constraint to cluster the attention scores implicitly and thus avoid the mode collapse problem. CPR can be formally described as follows. L CPR (C) = B∈C qi∈B qj ∈B,Iq i =Iq j D f A(q i ), A(q j ) alignment - q k ∈B,Iq i ̸ =Iq j D f A(q i ), A(q k ) uniformity . Here, B is a mini-batch of training samples randomly selected from C, A(q * ) ∼ P (z) = α z , z ∈ {1, 2, • • • , N } is the attention score distribution of the input query q * , I q * is an indicator function that represents the dataset of a task that the query q * belonging to, D f is a divergence that measures the similarity of two distributions. Inspired by contrastive learning (Wang & Isola, 2020) , the first term that contains D f A(q i ), A(q j ) can be viewed as alignment regularization that encourages similar queries generated by similar prefixes, and the second term that contains D f A(q i ), A(q k ) cab be viewed as uniformity regularization that prevents mode collapse of distributions of attention score. In our implementation, we use Jensen-Shannon divergence (Manning & Schütze, 2002) since it owns certain upper and lower bounds which avoids the numeric overflow in optimizing.

2.2. UNIFORM RETRIEVAL WITH QUERY-CONDITIONAL PROMPT

Lexicon-weighted Retriever HYPER is compatible with any deep neural text retriever based on Transformer (Vaswani et al., 2017) architecture, however, we find the lexicon-weighted retrieval method is more promising to attain better zero-shot retrievalfoot_0 . Therefore, we adopt a lexicon-weight language model SPLADE (Formal et al., 2021b ) LM(•, •; θ q ) as our backbone network. Combining with the generated dynamic prefix, we can represent a text of query as follows: v q = MaxPooling(log (1 + ReLU(LM(PL, X; θ q )))) ∈ R d . ( ) where ReLU is the Rectified Linear Unit. Following the common practice, we adopt the contrastive loss (Karpukhin et al., 2020; Khattab & Zaharia, 2020; Formal et al., 2021b ) that utilize a limited number of positive documents d + and d -for each queries for training. Specifically, we employ BM25foot_1 to retrieve top-100 relevant documents as d -as negative samples except those also contain answers to a query. To encode positive documents and negative documents into corresponding vector representation v d+ and v d-, we employ a document encoder θ d to encode them but skip the prefix generation as v d = MaxPooling(log (1 + ReLU(LM(d; θ d )))) ∈ R d . 4 Thus, a likelihood distribution can be formatted as follows, P (d + |v q , d -; θ q , θ d ) = exp(v T q v d+ ) d ′ ∈d+∪d-exp (v T q v d ′ ) . ( ) Following the common practice of training lexical retriever (Formal et al., 2021b) , we add additional floating point operations per second (FLOPS) regularization terms (Paria et al., 2020) to reduce the computation cost of the retrieval process and improve the retrieval effectiveness. Then, the loss function of our methods can be defined as L q (C) = q∈C P (d + |v q , d -; θ q , θ d ) + λ q FLOPS(v q ) + λ d FLOPS(v d ), where FLOPS(•) is a regularization term proposed by (Formal et al., 2021b) and we use hyperparameters λ q , λ d to adjust the sparsity of representation of v q and v d , respectively.

Model Training

In training, we adopt a multi-tasking training paradigm with a mini-batch mixup which means that we randomly sample from all tasks to compose a training batch. We train the entire model on KILT for in-domain testing for superior performance. For cross-domain zero-shot retrieval, we freeze the parameters of the backbone network and only tune the parameters of our proposed components. The objective function we used can be described as: min P,θ I ,θp, (θq,θ d ) L q (C) + λ c L CPR (C), where C = t i=1 T i is the mixed data of different tasks, λ c is a fixed pre-defined weight to control the regularization of CPR.

3. EXPERIMENTS

Benchmark Datasets We use two publicly available retrieval datasets for evaluation, including KILT (Petroni et al., 2021) and BEIR (Thakur et al., 2021) . KILT is a benchmark for knowledgeintensive tasks that require retrieving additional knowledge from the wiki, we select all 7 datasets containing corresponding training samples to train our model in a multi-task setting. Specifically, we use a variety of tasks including fact-checking (FEVER), entity linking (AY2), slot filling (zsRE), question answering (NQ, TQA, HoPo), and dialogue (WoW). KILT provides the provenances of all tasks in one wiki corpus, which enables us to train models with a share passage encoder (Maillard et al., 2021) . For BEIR, it is a widely known zero-shot information retrieval benchmark and we employ it to evaluate the transfer learning ability provided by different methods. Also, we remove the datasets that are contained in KILT which results in 10 tasks including TREC-COVID (TC), NFCorpus (NFC), ArguAna (Argu), Tóuche-2020 (Touche), FiQA-2018 (FiQA), DBPedia (DBP), SCIDOCS (SciD), Climate-FEVER (CFever) for a fair comparison. Evaluation Metrics When evaluation on KILT, we adopt R-Precision as the retrieval metric which is the primary metric used in their paper (Petroni et al., 2021) . R-Precision can be calculated as r R , where R is the number of Wikipedia pages inside each provenance set and r is the number of relevant pages among the top-R retrieved pages. For experiments on BEIR, Normalised Cumulative Discount Gain (nDCG) (Gysel & de Rijke, 2018) is reported to represent performances of different methods. Specifically, we utilize the official TREC evaluation tool (Gysel & de Rijke, 2018) and compute nDCG@10 for all datasets. Experiment Setups We train our model on KILT in a multi-task learning paradigm. Since our method can be combined with any transformer-based neural retriever, we adopt SPLADEv2 (Formal et al., 2021b) and DRP (Karpukhin et al., 2020) provided by the original paper as our backbone modelfoot_3 . The learning rate of the backbone network and the modules of HYPER is set to 2 × 10 -5 by following Formal et al. (2021b) and 1 × 10 -3 selected from {10 -1 , 10 -2 , 10 -3 }, respectively. The temperature τ is set to e, λ q is set to 0.3, λ d is set to 0.1. d q is set to 400 and d p is set to 100. The number of train epochs is set up to 10 epochs, both max document length and max query length are set to 512 to fit the task with a very long query, and the batch size is set to 256. For each query, we provide 1 positive sample and 19 negative samples for training.We set the sequence length of each basic prompt to 100 selected from {10, 50, 100}. The λ c and the number of shared basic prompts N Table 1 : Page-level R-precision on KILT. All models in the multi-tasking part are trained on 7 tasks of KILT, while the models in the zero-shot part are trained with the leave-one-out setting that leaves out the dataset used for testing in training. Model names that end with "zero" mean they are tested directly without training and the "-prefix" follows the model name means the corresponding model is trained through prefix-tuning. * indicates results from (Maillard et al., 2021) are tuned and we finally select 0.1 and 20 respectivelyfoot_4 . We fix the random seed always to 42 and all experiments are conducted on eight A100 GPUs.foot_5 

3.1. MAIN EVALUATION

Supervised and Zero-shot Performance on KILT We conduct both supervised and zero-shot experiments on KILT and the results are shown in Table 1 . Since our hyper-prompted training mechanism is applied to the lexicon-weighted retrieval method (e.g., SPLADE), we name it as SPLADE-HYPER. Besides, we also test the effectiveness of our HYPER on dense retrieval methods (e.g., DPR) which results in DPR-HYPER. First, we can easily find that SPLADE can provide superior performance than dense representation methods (e.g., DPR) in terms of both the supervised and zeroshot settings on KILT. Notably, the performance gap is extremely significant in the zero-shot setting where dense retrieval methods (a.k.a., DPR) can only achieve comparable results with traditional BM25foot_6 while lexicon-weighted retrieval methods significant outperform BM25 and dense retrieval methods. Second, compared with fine-tuning model entirely or prefix-tuning (Liu et al., 2021b) , our SPLADE-HYPER can obtain better performance on most tasks of the supervised setting, which demonstrates the superiority of our HYPER in sharing and transferring knowledge across different retrieval tasks or domains. Meanwhile, we can notice that HYPER can also improve performance even in all unseen tasks, which reflects our method can effectively transfer learned knowledge from previous tasks and adapt to different types of queries across different domains. The experimental results of the few-shot setting (shown in Appedix A) can also further prove the effectiveness of the proposed HYPER. Zero-shot Performance on BEIR We also directly test the performance of our SPLADE-HYPER without tuning on BEIR, which is a widely known zero-shot IR benchmark. Following the most recent works (Xu et al., 2022; Wang et al., 2022) , we compare our methods with varieties of methods, including DocT5 (Nogueira et al., 2019a) , ColBERT (Khattab & Zaharia, 2020) , DPR (Karpukhin et al., 2020) , ANCE (Xiong et al., 2020) , GenQ (Thakur et al., 2021) , TAS-B (Reimers & Gurevych, 2019) , MoDIR (Xin et al., 2021) and LapraDOR (Xu et al., 2022) . Experiment results are shown in Table 2 , as we can see, our HYPER occupies the best performance on 4 of 9 tasks of BEIR, and we also attain the best average performance. Besides, our method consistently outperforms the backbone network SPLADE, which indicates HYPER can enable a model to transfer learned knowledge across different domains better and thus improve the generalization ability of models. Moreover, our SPLADE-HYPER is better than SPLADE-prefix, which shows the superiority of proposed queryconditional prompt synthesizing and the strong ability of the dynamic parameterization to adapt different types of queries across the different domains.

3.2. ANALYSES

Ablation Study To verify the effectiveness of our proposed mechanisms, we propose several variants of our model for comparison. Specifically, to evaluate the effectiveness of QPS, we replace the generated attention score distribution with a fixed uniform distribution which results in a variant of our model without the attention module. Also, to verify the effectiveness of CPR, we separately remove the alignment regularization and uniformity regularization to constitute the other two variants. Experiment results are shown in Table 3 , as we can see, removing any modules in our mechanisms cause a significant decrease in performance. Comparison between HYPER and HYPER w/o query conditional indicates the attention module in QPS can successfully generate suitable prompts for different queries and thus improve the performance. Moreover, we can observe removing either regularization terms of CPR results in degraded performances. Therefore, we can conclude that both alignment regularization and uniformity are crucial for enabling the effective query-conditional prompt generation to process different queries. Prompt Attention Similarity vs Task Similarity We conduct additional experiments to investigate whether the similarities of prompt attention distributions can reflect the similarities between different tasks. Specifically, we calculate the similarities between the mean values of attention score distributions A belonging to the same task, and the results are shown in Figure 2 . Obviously, there are two different groups of similar attention score distribution, AY2 and WoW in the top-left corner and others in the bottom-right, which are bounded by green lines. After reviewing the data, we find the two groups of datasets can be distinguished by the lengths of input queries. In particular, the lengths of queries of AY2 and WoW are usually composed of multiple sentences while queries of other datasets only contain one or few sentences. This implies our mechanism can recognize the different lengths of different queries, which is a prerequisite for dynamically adopting different methods to process different queries, such as extracting important information from long queries or predicting implicit relations in short queries. Moreover, we take a closer look and observe the sub-groups of attention score distributions annotated by blue lines. Comparing the queries of these two groups, we find the group composed of TQA, HoPo, and FEVER requires more inference skills than the group composed of zsRE and NQ. This implies our query-conditional module can even distinguish more subtle differences in queries, which qualitatively reflects the effectiveness of QPS. 

Performance vs Efficiency

We further investigate the efficiency of our method since real-world applications not only require better retrieval results but also low retrieval latency. To measure the latency, we adopt the Query Per Second (QPS) as a metric and the higher QPS means more queries can be processed in time. We evaluate both dense retrievers and sparse retrievers with or without HYPER on KILT and experiment results are shown in Figure 4 . As we can see, compared with backbone models, HYPER can obtain better performance and higher QPS which demonstrates the better efficiency of our method. Meanwhile, although BM25 obtains better efficacy, neural methods outperform it on the retrieval metric significantly. 

Impact of Parameters

To better understand the effect of our method, we change the number of basic prompts (N ) and the weight of CPR (λ c ) and test these variants on both KILT and BEIR. Specifically, we vary the number of shared prompts in {2, 5, 10, 20, 30, 50} with the weight of CPR fixed to 0.1. Meanwhile, we vary the weights of CPR in {0.001, 0.005, 0.01, 0.02, 0.05, 0.1, 0.5} with number of shared prompt is fixed to 20. The average scores of different variants are shown in Figure 5 . As the number of shared prompts varies across different values, we can observe the performance increase and then decrease. This phenomenon implies more shared prompts can enable the model to learn more patterns exits in data while too many shared prompts suffered from the sparsity of representation space and thus results in insufficiently trained combinations of shared prompts. Similarly, we also find a moderate weight of CPR can lead to the best performance. This indicates our proposed CPR can benefit the query conditional module and thus improve the performance while too large λ c causes the main objective loss L q to be ignored.

4. RELATED WORK

Information retrieval can be generally defined as searching relevant documents about a short text as an input query. The major challenge of IR comes from the huge amount of candidate documents, which results in the matching function between query and documents having to be extremely simple for high efficacy. Therefore, we need to seek superior encoding methods for queries and documents to improve the accuracy of retrieval. Traditional methods such as tf-idf and BM25 rely on term matching and build high-dimension, sparse vector (Yang et al., 2017; Robertson & Zaragoza, 2009) for effective retrieval. Although they are effective across various tasks of different domains (Chen et al., 2017; Yang et al., 2017) , they fail to adapt to more specific tasks and are outperformed largely by the recently popular neural text retriever with sufficient training samples. Neural text retrievers are based on pre-trained language models such as BERT (Devlin et al., 2019) and can be classified into two types, the dense-vector retrievers (Xiong et al., 2021) and sparse-vector retrievers. For dense-vector retriever (Karpukhin et al., 2020; Gao & Callan, 2021) , it encodes both queries and documents into low-dimension vectors and then calculates the relevance scores between them. Although a dense vector is more effective to conduct semantic retrieve (Lin & Lin, 2022) , the compressing process of texts may result in a lost of information. In the contrast, sparse-vector retriever (Formal et al., 2021b; Khattab & Zaharia, 2020) encodes both queries and documents into high-dimension, sparse vectors and then calculate the concurrence (Nogueira et al., 2019a) or top coordinate terms (Formal et al., 2021b) of words. Therefore, it can achieve effective lexicon matching, and the varying amount of activating dimensions in vectors relieves the information lost in encoding, while it introduces an additional quantization process to avoid the unstable of real-values of vectors. Although neural retrievers can perform well with a moderate amount of data, a real-world application, the data of target tasks could be considerably scarce (Thakur et al., 2021) . Hence, zero-shot and few-shot settings on retrieval tasks receive more and more attention and various methods have been proposed to improve the model performance under this setting including unsupervised pretrained (Xu et al., 2022; Wang et al., 2022) , data augmentation (Nogueira et al., 2019b; Thakur et al., 2021; Dai et al., 2022) and enhanced training (Reimers & Gurevych, 2019; Xiong et al., 2020) . However, there is only a primary investigation of methods utilizing transfer learning, and still a large room for improvement. Technically, HYPER is similar to utilizing prefix tuning for IR tasks as Tam et al. (2022) and discrete prompt tuning for natural language tasks as Sanh et al. (2022) , but goes beyond the comparison of existing mechanisms and focuses on generating dynamically query-conditional prompts, and enabling a neural retriever to process queries of different tasks uniformly. Besides, HYPER is inspired by HyperPrompt (He et al., 2022) that explores prompt-based task-conditioning of self-attention in Transformers. Nonetheless, HYPER dynamically generates the prompt according to every query itself rather than indispensably relying on the task-level information, which enables our model to obtain superior generalization and transferability. Moreover, HYPER employs the prefix-tuning method to utilize the dynamic prompts rather than concatenating prompts into the self-attention layer directly.

5. CONCLUSION

In this paper, we propose to process queries of different tasks uniformly regardless of the difference in query format and varying priorities of query vectors. Specifically, we present HYPER, a hyper-prompted training mechanism that can be easily combined with any transformer-based neural retrievers. In HYPER, to enable the uniform process queries, we propose Query-conditional Prompt Synthesizing (QPS) to dynamically synthesize different parameter combinations for different queries. Moreover, to resolve the mode collapse problem of attention scores distribution in QPS, we propose Contrastive Prompt Regularization (CPR) to simultaneously learn representations of hyper-prompts and clustering for different types of queries across different domains or tasks. We conduct extensive experiments which demonstrate our methods can improve the in-domain and out-of-domain zero-shot retrieval performance of a neural retriever significantly. In-depth analyses reveal how our mechanism enables the uniform processing of queries. A FEW-SHOT EVALUATION ON KILT In Table 4 , we continually train models in a new task with a few-shot setting. As we can see, our methods consistently outperform several strong baselines, which demonstrates that our method benefits the model more quickly adapting to the different tasks in the same domain by sharing knowledge among tasks. 

B COMPARISON OF THE TASK-SPECIFIC FINE-TUNING MODEL

We also test the performance of directly fine-tuning the SPLADE on each task of KILT, which results in 7 different retrievers (SPLADE-FT). The results are shown in Table 5 . We can find that SPLADE-FT can achieve a significantly better AVG score than SPLADE-MT, although multi-task training can bring improvement to 4 out of 7 tasks (a.k.a. FEVER, zsRE, NQ, and TQA). Besides, through incorporating the proposed HYPER, SPLADE-HypeR can achieve a comparable AVG score with SPLADE-FT and even outperform SPLADE-FT on 4 out of 7 tasks (a.k.a. FEVER, zsRE, NQ, TQA, and WoW). Notably, SPLADE-FT train separate models for each task which results in roughly 6 times more parameters than our model. We further investigate how the length of prefixes influences the computation cost of query encoding and retrieval performance. To this end, we vary the size of the prefix in {5,10,20,50,100, 200} and record both the average time of encoding a query and retrieval performance on KILT. We compare our method with Prefix-tuning and the results are shown in Figure 6 . The inference time is measured on a machine with Intel Xeon CPU E5-2678. As we can see, our HypeR obtains better average retrieval scores than Prefix-Tuning for all different prefix lengths, and both Prefix-Tuning and HypeR achieve the best performance when the prefix length is 100. Notably, our HypeR costs 4% and 25% more encoding time than Prefix-Tuning when the prefix length is 10 and 100, respectively. We have put the results as a figure in the appendix of our revised manuscript. Thank you again for your constructive suggestions. 

D CASE STUDY

To provide an intuitive understanding of our methods, we show some similar and dissimilar queries based on prompt attention distributions in Table 6 . Specifically, we randomly sample 3 queries belonging to different tasks as base queries from the collection of test data of all KILT datasets. Then, for each base query, we randomly sample 2 queries from the queries with top 5% similar attention scores. Besides, we also randomly sample 2 queries belonging to different datasets from the queries with the lowest 5% similar attention score, since we find there are too many trivial candidates from the same datasets. Have you ever read anything by John Grisham? Yes, I have read his very first novel "A Time to Kill" which was published in June 1989 after he took four years to write it! I haven t read anything by him but I remember the movies for both a time to kill and the firm.

Dissimilar Query 2 Aidayago2

Multinational commander going back to east Zaire. Jonathan Wright NAIROBI 1996-12-06 The Canadian general in charge of a multinational force for eastern Zaire said on Friday he was going back to Zaire for more information about the plight of about 165,000 Rwandan refugees adrift in the countryside. Lieutenant-General Maurice Baril told a news conference in Nairobi his main concern was for a large group of about 150,000 refugees living off the land in a valley about 65 km (40 miles) west of the eastern city of Goma. If he decided it was necessary and safe for the aircrew, he would not hesitate to order airdrops of food for the refugees, even against the wishes of the government in Kinshasa and the [START ENT] Zairean [END ENT] rebels who control much of eastern Zaire, he said. " Tomorrow I ḿ going into Rwanda and my intention is to go across into eastern Zaire and try to find out for the second time what the situation is on the ground," he said. General Baril saw rebel leader Laurent Kabila in Goma last week but the rebels told him the crisis was over because most of the Rwandan refugees have already gone home. The rebels do not want the multinational force to deploy on the ground , for fear it might help the Zairean army regain control of the area.



We also report evaluation results using the dense model (e.g., DPR) as the backbone network in Table1. We adopt the default setting of Anserini for BM25 where k1 = 0.9, b = 0.4. Using dynamic representations of different documents requires to rebuild index in real-time which causes heavy calculation cost. Both models are pre-trained on MS-MARCO(Nguyen et al., 2016) and can provide superior initial performance on KILT. The effects of different hyperparameters are investigated in section 3.2 Our Code is available at https://github.com/oklen/HypeR. The observation is consistent with several previous studies(Maillard et al., 2021;Thakur et al., 2021)



Figure 1: An illustration of HYPER architecture in multi-tasking training.

Figure 2: Similarities of attention scores distributions of different tasks.

Figure 3: Visualization of attention weight embeddings for different tasks.

Figure 4: Performance versus QPS (latency).

Figure 5: Effect of the number of shared basic prompts (N ) and weights of CPR (λ c The two figures on the left vary N , while the two on the right vary λ c .

Figure 6: Efficiency compraing between HypeR and Prefix-Tuning. The multi-tasking fine-tuning can be viewed as an origin.

.

Zero-shot generalization evaluated on 9 datasets of BEIR. * indicates results fromThakur et al. (2021). † indicates results fromXu et al. (2022).

Ablation study on KILT.

Page-level R-precision on KILT in the few-shot setting.

Comparison of the task-specific fine-tuning model on KILT. SPLADE-prefix means the model trained through prefix-tuning.

Queries with similar and dissimilar prompt attention distributions.Ariz. 1996-12-06 Action Performance Cos Inc said Friday it has agreed to acquire Motorsport Traditions Ltd and Creative Marketing & Promotions Inc for about $13 million in cash and stock. The two firms to be acquired have about $25 million in annual revenues from the design, manufacture and sale and distribution of licensed motorsports products. The deal is expected to close by the end of the year subject to due diligence and other customary closing conditions. Base Query WoW Do you shop online for clothes? Yes quite often. It allows me to directly buy goods from a seller over the internet without having to leave the comfort of my home. What about you? Yeah same I love visiting websites of different retailers directly to see product availability and the best prices. Yeah me too. I use amazon a lot to buy stuff. Like even this computer! Similar Query 1 WoW Online shopping is the better experience to choose the the product on our desire.Yes! I much prefer shopping online at home on the Internet rather than fighting crowds in an actual store. what did you often purchase? Anything and everything! Housewares, clothing, electronics, you name it! I like the convenience of browsing stores on my laptop, tablet computer and smartphone! yes, it includes lot of choices I especially like the functionality of websites like Amazon for shopping. The have a great search feature that allows me to find specific models, brands and items. Such website, improved a lot for customer convenience. Similar Query 2 WoW I like to shop online, probably a bit too much. Lol me too! What's your favorite online store? I love to shop on amazon. So you shop online a lot too? Where do you shop? Amazon as well. I also shop a lot online at Old Navy and Gap. They have great sales. Do they really? I mainly use only amazon. Do you use Amazon Kindle, the series of e-readers designed and marketed by Amazon. No I have never used a kindle. My dad has one of those though and it looks cool if you love books.

