LEARNING LISTWISE DOMAIN-INVARIANT REPRE-SENTATIONS FOR RANKING

Abstract

Domain adaptation aims to transfer the knowledge acquired by models trained on (data-rich) source domains to (low-resource) target domains, for which a popular method is invariant representation learning. While they have been studied extensively for problems including classification and regression, how they would apply to ranking problems, where the data and metrics follow a list structure, is not well understood. Theoretically, we establish a domain adaptation generalization bound for ranking under listwise metrics such as MRR and NDCG, that naturally suggests an adaptation method via learning listwise invariant feature representations. Empirically, we demonstrate the benefits of listwise invariant representations by experiments for unsupervised domain adaptation on real-world ranking tasks, including passage reranking. The main novelty of our results is that they are tailored to listwise ranking: the invariant representations are learned at the list level rather than at the item level.

1. INTRODUCTION

Learning to rank applies machine learning to solve ranking problems that are at the core of many everyday products and applications, including and not limited to search engines and recommendation systems (Liu, 2009) . The availability of ever-increasing amounts of training data has enabled larger and larger models to improve the state-of-the-art on more ranking tasks. A prominent example is text retrieval and ranking, where neural language models with billions of parameters easily outperform traditional ranking models, e.g., BM25 (Nogueira et al., 2020) . But the need for abundant data means that large neural models may not benefit tasks with little to no annotated data, where they could actually fare worse than baselines such as gradient boosted decision trees (Qin et al., 2021) . Techniques for extending the benefits of large models to low-resource domains include zero-shot learning and domain adaptation. In the former, instead of directly optimizing for the task of interest with limited data, referred to as the target domain, the model is trained on a data-rich source domain that has a similar data distribution. The latter considers the scenario where (abundant unlabeled) data from the target domain is available, which can be leveraged to estimate the domain shift and improve transferability, e.g., by learning invariant feature representations. This setting and its algorithms are studied extensively for problems including classification and regression (Blitzer et al., 2008; Ganin et al., 2016; Zhao et al., 2018) . For ranking problems, however, existing methods are mostly limited to specific tasks and applications. In fact, due to the inherent list structure of the metrics and data, theoretical explorations of domain adaptation for ranking are only nascent. To this end, we provide the first analysis of domain adaptation for listwise rankingfoot_0 via domaininvariant representations (Section 3), building on the foundational work by Ben-David et al. (2007) for domain adaptation in the binary classification setting. One of the results from our theory is that, when the domain shift is small in terms of the Wasserstein distance, a ranking model optimized on the source is transferrable to the target domain, whose performance under metrics such as MRR and NDCG can be bounded. Inspired by our theory, we propose an adversarial training method for learning listwise domaininvariant representations, called ListDA (Section 4), that minimizes the distributional shifts between source and target domains in the feature space for improving generalization on the target domain. Unlike traditional classification and regression settings, in ranking, each input follows a list structure, containing the items to be ranked. The main technical novelty of ListDA is that the invariant representations it learns are of each list as a whole rather than the individual items they contain. Empirically, we evaluate ListDA for unsupervised domain adaptation on two ranking tasks (Section 5 and Appendix C), including passage reranking-a fundamental task in information retrieval (Craswell et al., 2019) -where the goal is to rerank a list of candidate documents retrieved by a first-stage retrieval model in response to a search query. We adapt T5 neural rerankers (Raffel et al., 2020) fine-tuned on the general domain MS MARCO dataset (Bajaj et al., 2018) to two specialized domains: biomedical and news articles. Our results demonstrate the benefits of invariant representations on the transferability of rankers trained with ListDA.

2. PRELIMINARIES

Learning to Rank. A ranking problem is defined by a joint distribution over listsfoot_2 X ∈ X of items and nonnegative relevance scores Y = (Y 1 , • • • , Y ) ∈ R ≥0 . We assume that all lists are length-, and the ground-truth scores are a function of the lists, y(X), so that a ranking problem is equivalently defined by a marginal distribution µ X of lists along with a scoring function y : X → R ≥0 . The goal is to train a ranker f : X → S that maps each list x to rank assignments r := f (x) ∈ S , where r i represents the predicted rank of item i, and S denotes the set of permutations on {1, 2, • • • , }, that recover the descending ordering of the relevance scores y i , i.e., y i > y j ⇐⇒ r i < r j for all i = j. The more common setup is to train a scoring function h : X → R whose output is a list of ranking scores, s.t. s i := h(x) i correlates with y i in each list, and its ordering agrees that of the ground-truth scores. Rank assignments could be obtained from the ranking scores by taking their descending ordering or via probabilistic models (Section 3). The quality of the predicted ranks is measured by ranking metrics u : S × R ≥0 → R ≥0 , which are functions that take as inputs the rank assignments of the list along with the ground-truth relevance scores and output a utility score. Popular listwise metrics in information retrieval include reciprocal rank and normalized discounted cumulative gain (Voorhees, 1999; Järvelin & Kekäläinen, 2002) : Definition 1 (RR). Suppose the ground-truth relevance scores y ∈ {0, 1} are binary, then the reciprocal rank of the rank assignments r ∈ S is RR(r, y) = max{r -1 i : 1 ≤ i ≤ , y i = 1} ∪ {0}. The expectation of RR over the dataset is called mean reciprocal rank (MRR), E[RR(f (X), y(X))]. Definition 2 (NDCG). The discounted cumulative gain (DCG) and the normalized DCG (with identity gain functions, w.l.o.g.) of the rank assignments r ∈ S are DCG(r, y) = i=1 y i log(r i + 1) and NDCG(r, y) = DCG(r, y) IDCG(y) , where IDCG(y) = max r ∈S DCG(r , y), the ideal DCG, is the maximum DCG value of a list and is attained by the descending ordering of the ground-truth y i 's. Domain Adaptation. The present work studies the adaptation of a scorer from a source domain (µ X S , y S ) to a target domain (µ X T , y T ). When domain shift is small, i.e., µ X S ≈ µ X T and y S ≈ y T , scorers trained on the source are expected to be transferrable to the target without the explicit need of labeled target data. Indeed, the target performance is bounded by the source performance in such cases. As an example, for binary classification, we have the following generalization bound for randomized classifiers (Shen et al., 2018) : Theorem 1. Let binary classification problems on a source and a target domain be given by joint distributions µ S , µ T over inputs and labels (X, Y ) ∈ X × {0, 1}. Let F ⊂ [0, 1] X be a class of L-Lipschitz predictors, and define the error rate of f ∈ F by E(f ) := E (X,Y )∼µ [1(Y = Ŷ )] := E (X,Y )∼µ [f (X) • 1(Y = 1) + (1 -f (X)) • 1(Y = 0)], meaning that the output classifications are probabilistic according to P( Ŷ = 1 | X = x) = f (x). Define λ * := min f (E S (f ) + E T (f )), then for all f ∈ F, E T (f ) ≤ E S (f ) + 2L • W 1 (µ X S , µ X T ) + λ * , where µ X denotes the marginal distribution of the input X. The domain shift in Theorem 1 is measured by the Wasserstein-1 distance between source and target marginal input distributions, whose Kantorovich-Rubinstein dual form is given by (Edwards, 2011) : Definition 3 (Wasserstein-1). Let p, q be probability measures on a metric space (X, d X ), their Wasserstein- 1 distance is W 1 (p, q) = sup f ∈Lip(1) ( X f (x) dp(x) -X f (x) dq(x)). Where the supremum is taken over 1-Lipschitz functionals f : X → R: Definition 4 (Lipschitz). Let (X , d X ), (X , d X ) be metric spaces. A function f : X → X is L-Lipschitz, denoted by f ∈ Lip(L), if d X (f (x 1 ), f (x 2 )) ≤ Ld X (x 1 , x 2 ) for all x 1 , x 2 ∈ X .

3. DOMAIN ADAPTATION GENERALIZATION BOUND FOR RANKING

We establish the first domain adaptation generalization bound for ranking problems under listwise ranking metrics. Specifically, we consider the setting of learning scoring functions and (transferrable) representations, where the end-to-end scorer f = h • g is a composition of a shared feature map g : X → Z and a scoring function h : Z → R on the learned list representations. For instance, if the end-to-end scorer is an m-layer MLP, we could treat the first (m -1) layers as g and the last as h. For the bound, we let the rank assignments r ∈ S be generated from the output scores s := h • g(x) probabilistically via a Plackett-Luce model (Plackett, 1975; Luce, 1959) , with the exponentiated scores exp(s i ) as its parameters (Cao et al., 2007; Guiver & Snelson, 2009) . Definition 5 (P-L model). A Plackett-Luce model with parameters v ∈ R >0 specifies a distribution over S , and the probability mass function, denoted by p v , is defined for all r ∈ S by p v (r) = i=1 v I(r)i j=i v I(r)j , where I(r) i is the index of the item ranked at i, so that r I(r)i = i, ∀i. With the above probabilistic procedure of generating rank assignments, the performance (or utility) of a scorer f w.r.t. a ranking metric u : S × R ≥0 → R ≥0 is evaluated by E(f ) := E X∼µ X max r∈S u(r, y(X)) -E R∼p exp(f (X)) [u(R, y(X))] , which computes its suboptimality relative to the maximum attainable utility. The randomization of the rank assignments using the P-L model is analogous to that of the classifications (using the Bernoulli model) in Theorem 1, and its purpose, together with the exponentiation, is to make the ranking metrics continuous w.r.t. to the raw scores output by the model. Our analysis, however, differs from that of Theorem 1 due to difficulties arising from the list structure of listwise ranking metrics. For instance, in the realizable setting, the tightness of Theorem 1 hinges on the uniqueness of the optimal classifier. The optimal ranker on ranking problems, in contrast, is generally nonunique. Consider a list with scores y = (1, 1, 0) as an example: the maximum utility is attained by both rank assignments r = (1, 2, 3) and (2, 1, 3). To proceed, we require the following Lipschitz assumptions. Assumption 1. The ranking metric u : S × R ≥0 → R ≥0 is bounded by B and (Euclidean) L u -Lipschitz in the second argument-the ground-truth relevance scores y. Assumption 2. The ground-truth scoring function y : X → R ≥0 is L y -Lipschitz (Euclidean on the output space). We will show that RR and NDCG satisfy Assumption 1. Assumption 2 says that similar lists (i.e., close in X ) should have similar ground-truth scores. It is satisfied, for instance, when X is finite and the scores are bounded (this argument is used in Corollary 3); this setup is typical with text data, where the inputs are sequences of one-hot vocabulary encodings. Assumption 3. The space of input lists X is a metric space, and the class H of scoring functions h : Z → R is L h -Lipschitz (Euclidean on the output space). Assumption 4. The feature space Z is a metric space, and the class G of feature maps g : X → Z is such that, ∀g ∈ G, the restrictions of g to the supports of µ X S and µ X T , g| supp(µ X S ) and g| supp(µ X T ) respectively, are both invertible with L g -Lipschitz inverses. Assumption 3 is standard in generalization and complexity analyses and could be enforced with e.g. L 2 -regularization (Anthony & Bartlett, 1999; Bartlett et al., 2017) . The last assumption is technical, which says that the original inputs are recoverable from their feature representations via Lipschitz g -1 , meaning that the feature map g should retain as much information from the inputs on each domain. Note that this assumption does not hinder domain-invariant representation learning; as long as G is sufficiently expressive, ∃g ∈ G satisfying µ Z S = µ Z T . We are now ready to state our domain adaptation generalization bound for learning to rank: Theorem 2. Under Assumptions 1 to 4, for any g ∈ G, define λ * g := min h (E S (h • g) + E T (h • g)), then for all h ∈ H, E T (h • g) ≤ E S (h • g) + 2(2L u L y L g + BL h √ ) • W 1 (µ Z S , µ Z T ) + λ * g , where µ Z denotes the marginal distribution of the feature Z, µ Z (z) := µ X (g -1 (z)). The bound says that if the features computed by g are domain-invariant, i.e., µ Z S = µ Z T , and the optimal joint risk λ * g remains low, then a scorer h optimized for the source domain will also perform well on the target. This suggests that domain adaptation for ranking can be achieved via listwise invariant representation learning, where we optimize the feature map g using (unlabeled) source and target domain data to minimize W 1 (µ Z S , µ Z T ) by aligning the list feature distributions, and simultaneously optimize h on the source domain with labeled data. We propose such a method in Section 4. Indeed, generalization bounds of the type of Theorems 1 and 2 (originally derived by Ben-David et al. (2007) ) form the basis of a family of domain adaptation methods based on invariant representation learning, which has seen empirical success in fields ranging from vision (Zhao et al., 2022) to language (Ramponi & Plank, 2020) . The key distinction of our bound is that the features Z are defined on each list X as a whole, rather than the individual items contained in the lists. To illustrate, suppose the setup where the feature representation of each list is a stack of k-dimensional vectors, g(x) = z = (u 1 , • • • , u ) ∈ R ×k , and u i ∈ R k corresponds to the i-th item in the list. For invariant representation learning, we aim for the distributional alignment of not just the items but also the lists, µ Z S = µ Z T . Concretely, denote the distribution over item feature vectors by µ U (u) : = P Z∼µ Z (u ∈ Z), then µ Z S = µ Z T =⇒ µ U S = µ U T , but the converse is not true! We demonstrate this point empirically in Section 5, that listwise invariant representations are more appropriate for domain adaptation for listwise ranking, where the data and metrics all follow a list structure, compared to the pointwise method of learning invariant representations of items. Lastly, to instantiate our bound on MRR and NDCG, we simply verify their Lipschitzness: Corollary 3 (Bound for MRR). RR is 1-Lipschitz in y, thereby E T [RR(h • g)] ≥ E S [RR(h • g)] -2(2L y L g + L h √ ) • W 1 (µ Z S , µ Z T ) -λ * g , where for brevity we wrote E[RR(h • g)] := E X∼µ X ,R∼p exp(h•g(X)) [RR(R, y(X))]. Corollary 4 (Bound for NDCG). Suppose U min ≤ IDCG(y) ≤ U max for some U min , U max ∈ (0, ∞) and all y ∈ y S (supp(µ X S )) ∪ y T (supp(µ X T )), then NDCG is O( √ )-Lipschitz in y, thereby E T [NDCG(h • g)] ≥ E S [NDCG(h • g)] -O( √ (L y L g + L h )) • W 1 (µ Z S , µ Z T ) -λ * g . where for brevity we wrote E[NDCG(h • g)] := E X∼µ X ,R∼p exp(h•g(X)) [NDCG(R, y(X))].

4. LEARNING LISTWISE DOMAIN-INVARIANT REPRESENTATIONS

As discussed in Section 3, Theorem 2 suggests that domain adaptation for ranking could be achieved with listwise invariant representations via learning a feature map g ∈ G that minimizes the distributional shifts between source and target on the feature space Z, as measured by W 1 (µ Z S , µ Z T ). Specifically, we consider the setup where the feature representation z := g(x) of each input list containing items, x = (x 1 , • • • , x ), is the stacking of feature vectors, i.e., Z = R ×k , and each z := (u 1 , • • • , u ) where u i ∈ R k is the learned feature vector of the i-th item in the list. This setup is standard in many learning to rank implementations, e.g., in neural text ranking, each feature vector is an embedding of the input text computed by a language model (Guo et al., 2020) . Learning invariant representations is similar to generative modeling, and a well-known technique in GAN literature is adversarial training (Goodfellow et al., 2014; Ganin et al., 2016) , which solves a minimax problem max g min fad L ad (g, f ad ) of two players-the feature map g and an adversary f ad : Z → R. The objective is defined with an adversarial loss function ad : R × {0, 1} → R, whose inputs are the adversary output â := f ad (z) and the domain identity a (set to 1 for target): L ad (g, f ad ) := E x∼µ X S [ ad (f ad • g(x), 0)] + E x∼µ X T [ ad (f ad • g(x), 1)]. The adversarial loss corresponds to probability metrics between µ Z S , µ Z T under specific choices of ad (Goodfellow et al., 2014; Arjovsky et al., 2017) . With the 0-1 loss of ad (â, a) = (1 -a) • 1(â ≥ 0)+a•1(â < 0), L ad becomes the (balanced total) classification error of f ad as a domain discriminator on predicting the domain identities, L ad (g, f ad ) = P x∼µ X S (f ad • g(x) ≥ 0) + P x∼µ X T (f ad • g(x) < 0), and it gives an upper bound on W 1 (µ Z S , µ Z T ) under optimality of f ad : Proposition 5. Denote the metric on R ×k by d, and define B := sup (z,z )∈supp(µ Z S ×µ Z T ) d(z, z ). If ad is the 0-1 loss, then W 1 (µ Z S , µ Z T ) ≤ B(1 -min fad L ad (g, f ad )). In practice, the 0-1 loss is replaced by a surrogate loss when training f ad to minimize the classification error, for which we use the logistic loss in our experiments: ad (â, a) = log(1 + e (1-2a)â ). (1) Finally, by parameterizing the function class F ad f ad , e.g., neural networks, the minimax problem can be optimized with gradient descent-ascent (w.r.t. g and f respectively), implemented with a gradient reversal layer on top of g (Ganin et al., 2016) . To prevent g from converging to trivial solutions like x → 0 and causing information loss in the learned features (thereby increasing the minimum achievable E S and λ * g ), g is optimized together with the scorer h under a joint objective: L joint (h, g) = min h∈H,g∈G L rank (h • g) -λ min fad∈Fad L ad (g, f ad ) , where L rank is a ranking loss of choice (surrogate to the ranking metric), and λ > 0 is a hyperparameter controlling the strength of domain-invariant feature learning. Choosing F ad . The only missing piece is the choice of the discriminator function class. Unlike prior work, however, the distributions µ Z S , µ Z T being modeled here are defined not over items but rather lists (or sets, to be more precise), z = (u 1 , • • • , u ), hence common choices e.g. MLP may not be appropriate. Indeed, a possible design choice is to flatten each list into a single k-dimensional vector and feed into an MLP discriminator, but it does not capture the permutation-invariance property of the lists z; a list and its permutations are perceived as distinct inputs by the MLP discriminator under this design, despite them being identical as far as the ranker h is concerned. Without permutation-invariance built-in, the optimization of f ad is data inefficient. Therefore, tailored to list-like inputs, we use transformers (no positional encoding) with meanpooling for F ad as a novelty in our implementation (Vaswani et al., 2017) , which is permutationinvariant, continuously differentiable, and has good expressive power. Fig. 1 includes a block diagram of our method instantiated on the RankT5 model in Section 5, referred to as ListDA.

5. EXPERIMENTS ON PASSAGE RERANKING

In this section, we evaluate ListDA on the passagefoot_3 reranking task, where given a text query, the goal is to rank candidate passages in a retrieved set based on their relevance to the query. In Appendix C, we additionally evaluate ListDA on the ranking task from the Yahoo! Learning to Rank Challenge. Reranking is employed in scenarios where the corpus is too large for all (millions of) documents to be exhaustively ranked by more accurate but expensive models such as SOTA cross-attention rankers based on language models; rather, a simpler but efficient first-stage model such as sparse or dense retrievers e.g. BM25 and DPR is used to retrieve a candidate set of (hundreds or thousands of) passages (Robertson & Zaragoza, 2009; Karpukhin et al., 2020) , whose ranks are then refined and improved by a more sophisticated reranking model. We consider unsupervised domain adaptation. Training neural rerankers requires large amounts of queries and document-relevance annotations. While such data can be obtained relatively easily from search engines under weak supervision for general text domains, annotations on specialized domains such as scientific literature are costly to acquire. Since unannotated documents are almost always readily available, this makes for a suitable candidate to apply unsupervised domain adaptation: specialized rerankers are adapted from ones trained on general domains. Models. We use BM25 as the first-stage retriever for simplicity and focus on the adaptation of the reranker-a cross-attention model based on T5 Base with 250 million parameters (Zhuang et al., 2022) . Given a query q and documents d 1 , • • • , d retrieved by BM25, the list is formed by concatenating the query and each document (with the title, if available), x = ([q, d 1 ], [q, d 2 ], • • • , [q, d ]). We follow the setup in Section 4 and use T5 encoder as the feature map g, so that the feature representation of the list is the first-token output embeddings that T5 computes on each query- document (q-d) pair, 4 z = (u 1 , • • • , u ) = (T5(x 1 ), • • • , T5(x )) ∈ R ×1024 . The ranking scores are then obtained by projecting each q-d embedding by a dense layer, s = (h(u 1 ), • • • , h(u )). We train the ranking model by minimizing the listwise softmax cross-entropy ranking loss: rank (s, y) = - i=1 y i log exp(s i ) j=1 exp(s j ) . ListDA adversarial training also follows Section 4. The discriminator f ad is a stack of three transformer blocks with the same architecture as those of the T5 encoder. Given a list feature z = (u 1 , • • • , u ), we obtain its domain prediction by feeding all vectors u i through the transformer blocks at once as a sequence, taking the mean-pool of the outputs and projecting it to a logit with a dense layer. We use an ensemble of five discriminators as in (Elazar & Goldberg, 2018) to reduce the sensitivity to the randomness in the initialization and the training process, 5 i=1 L ad (g, f ad ). Datasets. The source domain in our experiments is MS MARCO for passage ranking, a largescale dataset containing 8 million passages from the web that covers a wide range of topics, and 532,761 search query and relevant passage pairs (Bajaj et al., 2018) . The target domains are biomedical (TREC-COVID, BioASQ) and news articles (Robust04) (Voorhees et al., 2021; Tsatsaronis et al., 2015; Voorhees, 2005) . The data are collected and preprocessed as in the BEIR benchmark (Thakur et al., 2021) ; their paper also contains statistics of the datasets. Recall from above that the inputs to our cross-attention model are q-d pairs. However, there are no training queries available on two out of three target domains, so following (Ma et al., 2021) , we synthesize training queries on all target domains in a zero-shot manner, via a T5 XL query generator (QGen) trained on MS MARCO relevant q-d pairs. QGen synthesizes each query as a seq-to-seq task given a passage from the target corpus as input, whereby we expect the synthesized queries to be related to the input passages. See Table 7 for samples of QGen q-d pairs. Baseline Methods. We compare ListDA to three baseline methods. In zero-shot learning, the reranker is trained on MS MARCO only and directly evaluated on the targets. In QGen PL, we treat target domain QGen synthesized q-d as relevant pairs and train the reranker on both MS MARCO and QGen q-d pairs (PL as in these q-d pairs are "pseudolabeled" by QGen). This method underlies several recent works on domain adaptation of text retrievers and rerankers (Ma et al., 2021; Sun et al., 2021; Wang et al., 2022) . All adaptation methods, including ListDA, are applied on the target domains separately, i.e., we train a model for each source-target pair. Lastly, prior work that applies invariant representation learning to domain adaptation all performs feature alignment at the item level (Cohen et al., 2018; Tran et al., 2019; Xin et al., 2022) instead of the list level (our ListDA). Yet, the key message of this paper is that the former is not suitable for listwise ranking. To verify and demonstrate our claim, we also compare ListDA to ItemDA, which learns pointwise invariant representations with a three-layer MLP discriminator (no improvements from going larger) whose inputs are the q-d embeddings individually (similar to the DANN model by Ganin et al. (2016) ). Due to space constraints, experiment details including hyperparameter settings and the construction of training example lists are relegated to Appendices B.1 and B.2. We also include case studies and additional results with the pairwise logistic ranking loss and the hybrid method of ListDA + QGen PL in Appendix B. On the left, η ad = 0.004 is fixed and λ varies. On the right, λ = 0.02 is fixed and η ad varies.

5.1. RESULTS

The main results are presented in Table 1 . We report metrics are that are commonly used in the literature (e.g., TREC-COVID uses NDCG@20), and evaluate rank assignments given by the descending ordering of the ranking scores. Since TREC-COVID and Robust04 are annotated with 3-level relevancy, the scores are binarized for mean average precision (MAP) and MRR as follows: for TREC-COVID, 0 (not relevant) and 1 (partially relevant) are negative, 2 (fully relevant) is positive; for Robust04, 0 (not relevant) is negative, 1 (relevant) and 2 (highly relevant) are positive. Across all three datasets, ListDA has the best performance, and the fact that it uses the same resource as QGen PL demonstrates the benefits of invariant representations. Furthermore, the favorable comparison of ListDA to ItemDA supports the discussion in Section 3 that for domain adaptation for listwise ranking, the invariant representations the model learns should be of each list as a whole, and not of the items individually. Quality of QGen. An explanation for why QGen PL underperforms ListDA despite sharing resources is that the negative sampling of irrelevant q-d pairs could lead to false negatives in the training data. Sun et al. (2021) observed that queries synthesized by QGen lack specificity and could be relevant to many documents. While QGen PL trains on the QGen pseudolabels and the randomly sampled irrelevant documents by treating them as ground truth, they are not assumed by ListDA, which is thereby less likely to be affected by false negatives or false positives-when synthesized queries turn out to be irrelevant to the input passages (see Table 9 for samples). While out of the scope, improving query generation should boost the performance of both QGen PL and ListDA.

5.2. ANALYSIS OF LISTDA

Size of Target Data. Unsupervised domain adaptation requires sufficient unlabeled data, but not all domains have the same amount: BioASQ has 14 million documents (also the total number of QGen queries, as we synthesize one per document), but Robust04 only 528,155, and TREC-COVID 171,332. In Fig. 2 , we plot the performance of ListDA under varying numbers of target QGen queries (also the number of target training lists). Surprisingly, on Robust04 and TREC-COVID, using just ~100 target QGen queries (0.03% and 0.06% of all, respectively) is sufficient for ListDA to achieve full performance! Although the number of queries is small, since 1,000 documents are retrieved per query, the total number of distinct target documents is still substantial-up to 100,000, or 29.5% and 60.7% of the respective corpora. Performance begins to drop when reduced to ~10 queries, capping the number of documents at 10,000 (2.7% and 5.8%, respectively). The same data efficiency, however, is not observed on BioASQ, likely due to the hardness of the dataset from e.g. the extensive use of specialized biomedical terms (Tables 7 to 9 ). Sensitivity to Hyperparameters. ListDA introduces two main hyperparameters for the discriminator f ad : the learning rate η ad and the strength of invariant feature learning λ. We plot in Fig. 3 the sensitivity of their settings by fixing one and varying the other. It is observed that a balanced choice of λ is needed to elicit the best performance from ListDA, but the same choice largely works well across datasets. We set η ad to be multiples of the reranker learning rate η rank , and the results show that each dataset prefers different settings of η ad , probably due to their distinct domain characteristics.

6. RELATED WORK

Learning to Rank and Text Ranking. Traditional learning to rank concerns tabular datasets with numerical features for which a wide array of models are developed (Liu, 2009) , ranging from SVMs (Joachims, 2006) , gradient boosted decision trees (Burges, 2010) , to neural rankers (Burges et al., 2005; Pang et al., 2020; Qin et al., 2021) . Another research direction is the design of ranking losses (surrogate to ranking metrics), which are categorized into pointwise, pairwise, and listwise approaches (Cao et al., 2007; Bruch et al., 2020; Zhu & Klabjan, 2020; Jagerman et al., 2022a) . Recent advances in large neural language models have spurred interest in applying them on text ranking tasks (Lin et al., 2022) , leading to cross-attention models (Han et al., 2020; Nogueira & Cho, 2020; Nogueira et al., 2020; Pradeep et al., 2021) and generative models based on query likelihood (dos Santos et al., 2020; Zhuang & Zuccon, 2021; Zhuang et al., 2021; Sachan et al., 2022) . A different line of work is neural text retrieval models, which emphasizes efficiency, including dualencoder (Karpukhin et al., 2020; Zhan et al., 2021) , late-interaction (Khattab & Zaharia, 2020; Hui et al., 2022) , and models based on transformer memory (Tay et al., 2022) . Domain Adaptation. Following (Ben-David et al., 2007; Blitzer et al., 2008) , a family of domain adaptation methods is based on learning (adversarial) domain-invariant feature representations (Long et al., 2015; Ganin et al., 2016; Courty et al., 2017) . These methods are applied in fields including NLP and on tasks ranging from cross-domain sentiment analysis, question-answering (Li et al., 2017; Vernikos et al., 2020) , to unsupervised cross-lingual learning and machine translation (Xian et al., 2022; Lample et al., 2018) . Our method also belongs to this family, but to the best of our knowledge, no prior work considers learning domain-invariant representations of lists/sets. Recently, Zhao et al. (2019) point out that for classification, Theorem 1 admits a lower bound under perfect feature alignment and source accuracy when there are distribution shifts in class priors between source and target domains. Their result does not apply to ranking problems, and we leave investigations on the effects of shifts in marginal distributions of relevance scores to future work. Domain Adaptation for Information Retrieval. Work on this subject is categorized into supervised and unsupervised domain adaptation. The former assumes access to labeled source data and (a small amount of; few-shot) labeled target data (Sun et al., 2021) . The present work focuses on the latter, only assuming access to unannotated target documents. Cohen et al. (2018) apply invariant representation learning to unsupervised domain adaptation for text ranking, followed by Tran et al. (2019) for enterprise email search, and Xin et al. (2022) for dense retrieval. However, their approach learns invariant representations of items and not lists. Another family of approaches is based on query generation (Ma et al., 2021; Wang et al., 2022) , initially proposed for dense retrieval.

7. CONCLUSION

We analyze domain adaptation for ranking theoretically and establish a generalization bound under listwise ranking metrics. Our bound leads to an adaptation method based on learning listwise domain-invariant representations, called ListDA, which is empirically demonstrated to be effective for unsupervised domain adaptation on various ranking tasks. The novelty of our results is that they are tailored to listwise ranking, where the data the metrics all follow a list structure. A key message from our theory and experiments is that when applying invariant representation learning to domain adaptation for ranking problems, the representations learned should be at the list level rather than at the item level. We believe our theoretical and empirical contributions provide a foundation for future studies on domain adaptation for ranking.

A OMITTED PROOFS

Before proving the generalization bounds for binary classification (Theorem 1) and ranking (Theorem 2), recall the following properties of Lipschitz functions: Fact 6 (Properties of Lipschitz functions). 1. f : R d → R is differentiable, then it is (Euclidean) L-Lipschitz if and only if ∇f 2 ≤ L. 2. If f : X → R is L-Lipschitz and g : X → R is M -Lipschitz, then af + bg is (|a|L + |b|M )- Lipschitz, and max(f, g) is max(L, M )-Lipschitz. 3. If f : X → Y is L-Lipschitz and g : Y → Z is M -Lipschitz, then g • f is LM -Lipschitz. Proof. For the first statement, suppose bounded gradient norms, then by mean value theorem ∃t ∈ [0, 1] s.t. f (y) -f (x) = ∇f (z) (y -x) with z := (1 -t)x + ty, so by Cauchy-Schwarz, f (y) -f (x) 2 ≤ ∇f (z) 2 y -x 2 ≤ L y -x 2 . Next, suppose L-Lipschitzness, then by differentiability, ∇f (x) z = f (x + z) -f (x) + o( z 2 ). Set z := t∇f (x), we have t ∇f (x) 2 2 = f (x + t∇f (x)) -f (x) + o(t∇f (x)) ≤ Lt ∇f (x) 2 + o(t ∇f (x) 2 ), and the result follows by dividing both sides by t ∇f (x) 2 and taking t → 0. For the second, |af (x) + bg(x) -(af (y) + bg(y))| ≤ |a||f (x) -f (y)| + |b||g(x) -g(y)| ≤ (|a|L + |b|M )d X (x, y). Next, assume w.l.o.g. max(f (x), g(x)) -max(f (y), g(y)) ≥ 0, then |max(f (x), g(x)) -max(f (y), g(y))| = f (x) -max(f (y), g(y)) ≤ f (x) -f (y) ≤ Ld X (x, y) if max(f (x), g(x)) = f (x) g(x) -max(f (y), g(y)) ≤ g(x) -g(y) ≤ M d X (x, y) else ≤ max(L, M )d X (x, y). For the third, d Z (g • f (x), g • f (y)) ≤ M d Y (f (x), f (y)) ≤ LM d X (x, y). We first prove Theorem 1, as it shares the same organization of the arguments with Theorem 2. Proof of Theorem 1. Define random variable η := 1(Y = 1), then E(f ) = E (X,Y )∼µ [η -(2η - 1)f (X)]. Note that E(f ) -E(f ) = E (X,Y )∼µ [η -(2η -1)f (X)] -E (X,Y )∼µ [η -(2η -1)f (X)] = E (X,Y )∼µ [(2η -1) • (f (X) -f (X))] ≤ E X∼µ X [|f (X) -f (X)|] because (2η -1) = ±1. On the other hand, E X∼µ X [|f (X) -f (X)|] = E (X,Y )∼µ [|(2η -1) • (f (X) -f (X)) -η + η|] ≤ E (X,Y )∼µ [|(2η -1)f (X) -η|] + E (X,Y )∼µ [| -(2η -1)f (X) + η|] = E (X,Y )∼µ [η -(2η -1)f (X)] + E (X,Y )∼µ [η -(2η -1)f (X)] = E(f ) + E(f ). Then by Fact 6, the fact that taking absolute value is 1-Lipschitz, and Definition 3, for all f, f ∈ F, E T (f ) = E S (f ) + (E T (f ) -E T (f )) -(E S (f ) + E S (f )) + (E S (f ) + E T (f )) ≤ E S (f ) + E X∼µ X T [|f (X) -f (X)|] -E X∼µ X S [|f (X) -f (X)|] + (E S (f ) + E T (f )) ≤ E S (f ) + sup q∈Lip(2L) (E X∼µ X T [q(X)] -E X∼µ X S [q(X)]) + (E S (f ) + E T (f )) ≤ E S (f ) + 2L • W 1 (µ X S , µ X T ) + (E S (f ) + E T (f )) . and the result follows by taking the min over f . Next, we prove Theorem 2. The main idea behind the proof is that under our setup and assumptions, we could write E S and E T as expectations of Lipschitz functions of Z ∼ µ Z S and µ Z T , respectively, so by Definition 3 their difference can be upper bounded by W 1 (µ Z S , µ Z T ). While omitted, Theorem 2 can be extended to the cutoff version of the ranking metric u with a simple modification of the proof. Also, a finite sample generalization bound could be obtained using e.g., Rademacher complexity, assuming Lipschitzness of the end-to-end scorer (Blitzer et al., 2008; Shalev-Shwartz & Ben-David, 2014) . Proof of Theorem 2. Fix g ∈ G, which has a L g -Lipschitz inverse g -1 to supp(µ X S ) by Assumption 4. Define h,g : Z → R ≥0 for any given h : Z → R by h,g (z) := max r∈S u(r, y S • g -1 (z)) -E R∼p exp(h(z)) [u(R, y S • g -1 (z))] = max r∈S u(r, y S • g -1 (z)) - r∈S u(r, y S • g -1 (z)) i=1 exp(h(z) I(r)i ) j=i exp(h(z) I(r)j ) , and note that E S (h • g) = E X∼µ X S [ h,g (g(X))] =: E Z∼µ Z S [ h,g (z) ]. An analogous analysis holds for E T . We show that h,g as written above is a Lipschitz function of z if h is Lipschitz. For the first term, because u is L u -Lipschitz in y S • g -1 (z) and y S • g -1 (z) is L y L g -Lipschitz in z, so u is L u L y L g -Lipschitz in z, and so is z → max r∈S u(r, y S • g -1 (z)) by Fact 6. Now we bound the second term. We show that it is Lipschitz in both y S • g -1 (z) =: y and h(z) =: s under the Euclidean distance. By Jensen's inequality, ∇ y E R∼p exp(s) [u(R, y)] 2 = E R∼p exp(s) [∇ y u(R, y)] 2 ≤ E R∼p exp(s) [ ∇ y u(R, y) 2 ] ≤ L u . Next, ∇ s E R∼p exp(s) [u(R, y)] 2 = k=1 ∇ s r∈S u(r, y) i=1 exp(s I(r)i ) j=i exp(s I(r)j ) 2 k ≤ B √ , where the inequality is due to product rule and ∂ ∂s I(r)m r∈S u(r, y) i=1 exp(s I(r)i ) j=i exp(s I(r)j ) = r∈S u(r, y) i=1 k =i ∂ ∂s I(r)m exp(s I(r)i ) j=k exp(s I(r)j ) exp(s I(r)i ) j=k exp(s I(r)j ) = r∈S u(r, y) i=1 1(m ≤ i) 1 - exp(s I(r)i ) j=k exp(s I(r)j ) k=1 exp(s I(r)i ) j=k exp(s I(r)j ) ≤ B r∈S i=1 k=1 exp(s I(r)i ) j=k exp(s I(r)j ) = B; recall that d dxi softmax(x) j = softmax(x) i (1(i = j) -softmax(x) j ). Suppose h ∈ Lip(L h ), then z → E R∼p exp(h(z)) [u(R, y S • g -1 (z))] is Lipschitz because E R∼p exp(h(z)) [u(R, y S • g -1 (z))] -E R∼p exp(h(z )) [u(R, y S • g -1 (z ))] ≤ E R∼p exp(h(z)) [u(R, y S • g -1 (z))] -E R∼p exp(h(z)) [u(R, y S • g -1 (z ))] + E R∼p exp(h(z)) [u(R, y S • g -1 (z ))] -E R∼p exp(h(z )) [u(R, y S • g -1 (z ))] ≤ L u y S • g -1 (z) -y S • g -1 (z ) 2 + B √ h(z) -h(z ) 2 ≤ (L u L y L g + BL h √ )d X (x, x ). Putting everything together, h,g is (2L u L y L g + BL h √ )-Lipschitz in z for any h ∈ Lip(L h ). Then by Fact 6 and Definition 3, for all g ∈ G and h , h ∈ H, E T (h • g) = E S (h • g) + (E T (h • g) -E T (h • g)) -(E S (h • g) + E S (h • g)) + (E S (h • g) + E T (h • g)) ≤ E S (h • g) + (E T (h • g) -E T (h • g)) -(E S (h • g) -E S (h • g)) + (E S (h • g) + E T (h • g)) = E S (h • g) + E Z∼µ Z T [ h,g (Z) -h ,g (Z)] -E Z∼µ Z S [ h,g (Z) -h ,g (Z)] + (E S (h • g) + E T (h • g)) ≤ E S (h • g) + sup q∈Lip(2(2LuLyLg+BL h √ )) (E Z∼µ Z T [q(Z)] -E Z∼µ Z S [q(Z)]) + (E S (h • g) + E T (h • g)) ≤ E S (h • g) + 2(2L u L y L g + BL h √ ) • W 1 (µ Z S , µ Z T ) + (E S (h • g) + E T (h • g)) , and the result follows by taking the min over h . Finally, we verify the Lipschitz conditions for RR and NDCG. Proof of Corollary 3. It suffices to verify that y → RR(r, y) is 1-Lipschitz, which follows from the fact that RR ≤ 1 and y -y 2 ≥ 1 for all y, y ∈ {0, 1} , y = y . Proof of Corollary 4. It suffices to verify that y → NDCG(r, y) := DCG(r, y) IDCG(y) = i=1 y i log(r * i + 1) -1 i=1 y i log(r i + 1) is Lipschitz. Note that IDCG(y) = max r DCG(r, y), a max of continuous functions, is piecewise continuous in y where each piece is defined by an r ∈ S : {y : r = arg max r DCG(r, y)}. Let r ∈ S , and y, y ∈ R s.t. arg max r DCG(r , y) = arg max r DCG(r , y ) =: r * , i.e., they are on the same piece for IDCG. Then ∂ ∂y k NDCG(r, y) = IDCG(y) -1 • ∂ ∂y k i=1 y i log(r i + 1) -DCG(r, y) • ∂ ∂y k i=1 y i log(r * i + 1) -2 ≤ IDCG(y) -1 • log(r k + 1) -1 + DCG(r, y) • log(r * k + 1) 2 ≤ IDCG(y) -1 • log(2) -1 + DCG(r, y) • log( + 1) 2 ≤ U -1 min + U max log( + 1) 2 , so NDCG is √ (U -1 min + U max log( + 1) 2 )-Lipschitz by combining the above and the fact that IDCG is continuous. Proof of Proposition 5. First, W 1 (µ Z S , µ Z T ) = inf γ∈Γ(µ Z S ,µ Z T ) Z×Z d(z, z ) dγ(z, z ) ≤ B inf γ∈Γ(µ Z S ,µ Z T ) Z×Z 1(z = z ) dγ(z, z ) = B 1 - sup γ∈Γ(µ Z S ,µ Z T ) Z×Z 1(z = z ) dγ(z, z ) = B 1 - Z min µ Z S (z), µ Z T (z) dz = B Z max 0, µ Z T (z) -µ Z S (z) dz = B 2 Z µ Z T (z) -µ Z S (z) dz, because µ Z T (z) -µ Z S (z) dz = 0. Note that the last term is the total variation between µ Z S , µ Z T . On the other hand, define Ŷ (z) := 1(f ad (z) ≥ 0). Then the balanced total error rate of Ŷ on predicting the domain identities is Err( Ŷ ) := Z Ŷ (z)µ Z S (z) + (1 -Ŷ (z))µ Z T (z) dz = 1 + Z Ŷ (z) - 1 2 µ Z S (z) -µ Z T (z) dz. This quantity is minimized with Ŷ * (z) = 1(µ Z T (z) ≥ µ Z S (z)), whereby Err( Ŷ * ) = 1 - 1 2 Z µ Z S (z) -µ Z T (z) dz ≤ 1 - 1 B W 1 (µ Z S , µ Z T ). The result follows from an algebraic rearrangement of the terms.

B ADDITIONAL EXPERIMENTS ON PASSAGE RERANKING AND DETAILS

In this section, additional experiment results for unsupervised domain adaptation on the passage reranking task considered in Section 5 are provided, along with case studies on ListDA vs. zeroshot and QGen PL (Tables 8 and 9 ), hyperparameter settings (Appendix B.1) and details on the construction of training example lists (Appendix B.2). Pairwise Logistic Ranking Loss. We experiment with the pairwise logistic ranking loss in place of the listwise softmax cross-entropy loss (Eq. ( 1)) used in Section 5, defined as rank (s, y) = - i=1 j=1 1(y i > y j ) log exp(s i ) exp(s i ) + exp(s j ) . The results on the Robust04 dataset are provided in Table 2a . Since the pairwise logistic loss does not perform better than softmax cross-entropy (cf. Table 1 ; also see (Jagerman et al., 2022b) ), further experiments with this ranking loss are not pursued. As an implementation remark, in this set of experiments, we do not perform pairwise comparisons during inference to obtain the predicted rank assignments due to the high time complexity (this high complexity is avoided during training due to training list truncation, see Appendix B.2). However, our theory is applicable to any (black-box) model that produces list level representations, and our ListDA method is also more generally applicable any models whose list level representations are stackings of feature vectors, i.e., z = (u 1 , u 2 , • • • , u L ) and u L ∈ R k (here L = , and can be arbitrary). For instance, while not pursued in this work, ListDA could be instantiated on DuoT5 (Pradeep et al., 2021) , a SOTA pairwise ranking model, by treating the stackings of pairwise q-d i -d j embeddings as the list feature representations (thereby L = ( -1)/2). ListDA + QGen PL Method and Signal-1M Dataset. We experiment with supplementing ListDA with QGen pseudolabels by (uniformly) combining the training objectives of ListDA and QGen PL methods (ListDA + QGen PL). The results on the three datasets considered in Section 5 are included in Table 2b . This method is also applied on the Signal-1M (RT) dataset (Suarez et al., 2018) , with results in Table 3 . It is noted that reranking using neural rerankers transferred from MS MARCO source domain do not give better performance than BM25 on Signal-1M, which is also observed in prior work (Thakur et al., 2021; Liang et al., 2022) . This does not mean that neural rerankers are worse than BM25, but that MS MARCO is not a good choice for source domain when Signal-1M is the target, because of the arguably large domain shift between tweet retrieval and MS MARCO web search-qualitatively, it can be seen from Table 7 that the text styles and task semantics of the two domains are very different. Hence, the following discussions on Signal-1M results are focused on reranking models. On Signal-1M, QGen PL improves upon the zero-shot baseline but ListDA does not, which is likely due to the large domain shift between MS MARCO and Signal-1M that prevented ListDA from finding the correct source-target feature alignment without supervision. With QGen pseudolabels, improved ListDA performance is observed with + QGen PL, which could have benefited from QGen q-d pairs acting as anchor points for ListDA to find the correct correspondence between source and target. Overall, ListDA + QGen PL is the only method that consistently improves upon the zero-shot baseline on all four datasets including Signal-1M, although underperforms ListDA on the other three. Further improvements to this method may be possible with better strategies of balancing the constituent training objectives of ListDA and QGen PL. Case Studies. In Tables 8 and 9 , we include examples where the ranks (top-1) predicted by models trained with ListDA have higher utilities v.s. zero-shot and QGen PL results, respectively, to provide a qualitative understanding of the benefits and advantages of ListDA. In zero-shot learning, the reranker is trained on MS MARCO general domain data only and directly evaluated on the target domains. When the target domain differs from MS MARCO stylistically or consists of passages containing domain-specific words, as in the TREC-COVID and BioASQ datasets for biomedical retrieval, the zero-shot model, which is barely exposed to the specialized language usages, may resort to keyword matching. Examples of such cases are presented in Table 8 , where the top passages returned by the zero-shot model contain keywords from the query but are irrelevant. In QGen PL, the reranker is trained on the pseudolabels generated during the query synthesis procedure, treating the passages from which the queries are generated as relevant. However, as remarked in Section 5, because QGen is deployed in a zero-shot manner, the pseudolabels are not guaranteed to be valid, meaning that they could be false positives. This is observed in cases presented in Tables 7 and 9 . One specific scenario where false positives hurt transfer performance is when the synthesized queries (of false positive documents) coincide with queries from the evaluation set, which is indeed the case on TREC-COVID and BioASQ in Table 9 ! Since ListDA does not assume the pseudolabels in its training objective, it does not suffer the same pitfalls as QGen PL.

B.1 HYPERPARAMETER SETTINGS

We discuss model hyperparameter settings for our passage reranking experiments in this section. For BM25, we use the implementation of Anserini (Yang et al., 2017) , set k 1 = 0.82 and b = 0.68 on MS MARCO source domain, and k 1 = 0.9 and b = 0.4 on all target domains without tuning. As in (Thakur et al., 2021) , titles are indexed as a separate field with equal weight as the document body, if available. For the T5 reranker, the model is fine-tuned from the T5 v1.1 Base checkpoint on a Dragonfish TPU with 8x8 topology for 100,000 steps with batch size of 32 (each example is a list containing 31 items) per domain. We tune the learning rate η rank ∈ {5e-5, 1e-4, 2e-4}, and select the one that gives the best zero-shot performance to use on all models for each dataset (see Fig. 4 for zero-shot sweep results). We apply a learning rate schedule on η rank that decays (exponentially) by a factor of 0.7 every 5,000 steps. Each concatenated query-document text input are truncated to 512 tokens. For the domain discriminators, there are two hyperparameters: the strength of invariant feature learning λ, and the discriminator learning rate η ad . We tune both by sweeping λ ∈ {0.01, 0.02}, and η ad ∈ {10, 20, 40} × η rank , multiples of the reranker learning rate (see Fig. 3 in Section 5 for ListDA sweep results). The tuned hyperparameter settings for η rank , η ad and λ used in our experiments are included in Table 4 . As a remark on running time, ListDA and ItemDA take roughly the same time to train as QGen PL because the overhead of the domain discriminators is small. Compared to zero-shot, they have double the training time due to data loading: the adaptation methods are trained on target domain data in addition to source domain ones.

B.2 TRAINING EXAMPLE LIST CONSTRUCTION

Recall from the description in Section 5 that in listwise ranking, the inputs are defined over lists and the invariant representations for domain adaptation are learned at the list level. In other words, the ranking loss and the adversarial loss need to be computed on ranking scores and feature representations that the model outputs on lists of documents (containing both relevant and irrelevant ones) for each query. Under our reranking setup, each list would be the top-r documents retrieved by BM25 on a query, and we set r = 1000 in our experiments. However, there are two complications. The first is that due to memory constraints, it is not feasible to feed all 1000 documents from each list simultaneously through the T5 encoder during training. The second is that the source domain MS MARCO dataset only contains annotations of one relevant document per query, meaning that out of the 1000 documents retrieved by BM25 for each query, we would only know that one of them is relevant; the ground-truth relevance scores for the remaining 999 documents is unknown. Example List Construction with Negative Sampling. To address both issues, we truncate the list to length = 31 during training, and perform random negative sampling from the BM25 results to gather irrelevant documents. On the MS MARCO source domain, given a query q and top 1000 documents retrieved by BM25, d 1 , • • • , d 1000 , we construct the example list with the one document d * labeled as relevant in the dataset along with 30 randomly sampled documents d N1 , • • • , d N30 treated as irrelevant, and get x = ([q, d * ], [q, d N1 ], • • • , [q, d N30 ]) and y = (1, 0, • • • , 0). For the target domains, we perform the same procedure. Given a QGen synthesized query and top 1000 documents retrieved by BM25, the example list consists of the pseudolabeled document d (i.e., the document with which the query was synthesized) and 30 randomly sampled irrelevant documents, so that x = ([q, d], [q, d N1 ], • • • , [q, d N30 ] ) and y = (1, 0, • • • , 0). Note that the pseudolabels y are used by QGen PL but discarded by ListDA. Reducing MS MARCO False Negatives. One potential problem that arises with negative sampling is that the constructed lists may contain false negatives (i.e., relevant documents that are incorrectly marked as irrelevant); in fact, false negatives are prevalent in the MS MARCO dataset. While such false negatives are relatively harmless for source domain supervised training because the true positive document to which they are similar is labeled and trained on so that the effects are canceled out, they have a larger impact on source-target feature alignment for invariant representation learning. One potential negative impact is that the duplicates in the lists (which have identical feature vectors) will cause ListDA to collapse distinct documents on the target domain to the same feature vector for achieving alignment, which is an artifact that will cause information loss in target domain feature representations. Another is that the inclusion of duplicates will alter the marginal distribution of scores on the source domain (Zhao et al., 2019) . To lower the chance of selecting false negatives on MS MARCO, we rerank each BM25-retrieved results using a ranker that is pre-trained on MS MARCO, and sample negatives from documents that are ranked at 300 or higher, since the duplicates and relevant documents will be concentrated at the top (Qu et al., 2021) . We only apply this method when constructing source domain lists for feature alignment (namely, in ListDA and ItemDA), and it only affects the computation of adversarial loss. This sampling method is not applicable to target domains because we do not have reliable pre-trained rankers for them. It is also not used for source domain supervised training (i.e., the computation of ranking loss), as decreased performance was observed in our preliminary experiments with this method, likely due to the exclusion of "hard" negatives.

C EXPERIMENTS ON YAHOO! LETOR DATASET

Our method is also evaluated on the ranking task from the Yahoo! Learning to Rank Challenge v2.0 (Chapelle & Chang, 2011) , referred to as the Yahoo! LETOR dataset. This is a web search ranking dataset in numerical format, where each item is represented by a 700-d vector with values in the range of [0, 1]. It has two subsets, called "Set 1" and "Set 2", whose data originate from the US and an Asian country respectively. Among the 700 features, 415 are defined on both sets (shared), and the other 285 are defined only on Set 1 or 2 only (disjoint); we hence write each item x := [x shared , x disjoint ] as a concatenation of shared features x shared ∈ R 415 and disjoint ones x disjoint ∈ R 285 . We consider unsupervised domain adaptation from Set 1 to Set 2. Our code is implemented with the Hugging Face Transformers library (Wolf et al., 2020) . Models. Our models has the same setup as that of the passage reranking experiments in Section 5, except that the T5 reranker is replaced by a three-hidden-layer MLP following (Zhuang et al., 2020) , and we treat the list of 256-d outputs on the second layer as the feature representations. g(x) i = u i = ReLU W 3 ReLU(W 2 ReLU(W 1 x i,shared + b 1 ) + b 2 ) ReLU(W 2 ReLU(W 1 x i,disjoint + b 1 ) + b 2 ) + b 3 , h(x) i = s i = W 4 u i + b 4 , where W 1 ∈ R 1024×415 , W 1 ∈ R 1024×285 , W 2 , W 2 ∈ R 256×1024 , W 3 ∈ R 256×512 , and W 4 ∈ R 1×256 , all randomly initialized. Each model in the ensemble of five domain discriminators is a stack of three T5 encoder transformer blocks, with 4 attention heads (num_heads), size-32 key, query and value projections per attention head (d_kv), and size-1024 intermediate feedforward layers (d_ff). Results. The results are presented in Table 5 . Considering the small dataset size and number of training steps, each method is evaluated by an ensemble of five separately trained models to reduce the variance on the results due to the randomness in the initialization and the training process. Since Yahoo! LETOR is annotated with 5-level relevancy, the scores are binarized for MAP and MRR metrics by mapping 0 (bad) and 1 (fair) to negative, and 2 (good), 3 (excellent), 4 (perfect) to positive. Because of the availability of labeled data on Set 2, we also include Supervised results, where the model is trained on labeled data from both Set 1 and 2. The best unsupervised transfer performance is achieved by ListDA. In particular, the favorable comparison of ListDA to ItemDA again corroborates our discussion in Section 3 that listwise invariant representation learning is more appropriate for listwise ranking, although their gap is smaller compared to passage reranking results (Table 1 ), which we suspect is because the weak contextual (query) information for defining the list structure on the data is too weak in this numerical dataset. Lastly, thanks to the availability of labeled data, we also provide Supervised results where the model is trained on both Set 1 and 2, serving as an upper bound for ListDA. Hyperparameters. The model is trained from scratch on an NVIDIA A6000 GPU for 5,000 steps with batch size of 32 (each example is a list containing one to no more than 140 items) per domain. We apply a learning rate schedule on η rank that decays (exponentially) by a factor of 0.7 every 500 steps. We exhaustively tune ranker and domain discriminator learning rates and the strength of feature alignment from η rank ∈ {1e-5, 2e-5, 4e-5, 8e-5, 1e-4, 2e-4, 4e-4, 8e-4, 1e-3}, η ad ∈ {0.2, 0.4, 0.8, 1, 2, 4, 8, 10} × η rank , and λ ∈ {0.01, 0.02, 0.04, 0.08, 0.1, 0.2}. The tuned settings are included in Table 6 . 



In this paper, by listwise ranking, we mean that the data are defined over lists and the ranking metric is listwise. It is not a statement of the ranking model, e.g., whether the predicted rank assignments are obtained in a pointwise, pairwise or listwise manner(Joachims, 2006;Cao et al., 2007), or the interactions between items in the same list are modeled(Pang et al., 2020). A natural and common choice for the space of lists X is the -times Cartesian product of R k , R ×k , meaning that each list x = (x1, • • • , x ) is a stack of items represented by k-dimensional feature vectors. But generally and more abstractly, the lists need not follow this structure; e.g., X can also be R d (not scaling with ) provided a mechanism to represent lists by fixed-length vectors. The terms document, text, and passage are used interchangeably here. Under our setup, the feature vectors are computed on each item independently, but generally and ideally g would also model the interactions between items in the same list(Pang et al., 2020).



Figure 1: Block diagram of ListDA instantiated on the cross-attention T5 ranker for text ranking.

Figure3: ListDA performance under different hyperparameter settings for λ and η ad . Grey horizontal line is zero-shot. On the left, η ad = 0.004 is fixed and λ varies. On the right, λ = 0.02 is fixed and η ad varies.

With pairwise logistic ranking loss.

Figure 4: Zero-shot performance under different hyperparameter settings for η rank .

Transfer performance of T5 reranker on top 1000 BM25-retrieved passages.

Transfer performance of T5 reranker on top 1000 BM25-retrieved passages; in addition to Table1results. (a) With pairwise logistic ranking loss in place of softmax cross-entropy on Robust04. Source domain is MS MARCO. Gain function in NDCG is the identity map. * Improves upon zero-shot baseline with statistical significance (p ≤ 0.05) under the two-tailed Student's t-test. † Improves upon QGen PL. ‡ Improves upon ItemDA. (b) With ListDA + QGen PL domain adaptation method.

Transfer performance of T5 reranker on top 1000 BM25-retrieved passages on Signal-1M.

Hyperparameter settings of T5 reranker and domain discriminators.

Transfer performance of 3-layer MLP ranker on Yahoo! LETOR (Set 2).Source domain is Yahoo! LETOR (Set 1). Gain function in NDCG is the identity map. Bold indicates the best unsupervised result. Results are from ensembles of five models. * Improves upon zero-shot baseline with statistical significance (p ≤ 0.05) under the two-tailed Student's t-test. ‡ Improves upon ItemDA. Significance tests are not performed on supervised results.

Hyperparameter settings of 3-layer MLP ranker and domain discriminators.

Samples of test set relevant q-d pairs and QGen synthesized q-d pairs from domains considered in Section 5 passage reranking experiments. Truncated or omitted texts are indicated by "[...]".

Samples of passage reranking results where ListDA achieves higher utilities v.s. zero-shot. Truncated or omitted texts are indicated by "[...]". Little knowledge goes a long way -Cancer of the prostate need not be a killer / Health Check. Earlier this year, 13-year-old [...] died from cancer of the bladder and prostate. His death is a grim reminder that no male should consider himself immune from waterworks trouble. The prostate, a gland about the size and shape of a chestnut, lies deep in the pelvis just below the bladder. Because it surrounds the urethra, it has the potential to block the flow of urine completely. [...] TREC-COVID Q: What are the longer-term complications of those who recover from COVID-19? D: [...] Our previous experience with members of the same corona virus family (SARS and MERS) which have caused two major epidemics in the past albeit of much lower magnitude, has taught us that the harmful effect of such outbreaks are not limited to acute complications alone. Long term cardiopulmonary, glucometabolic and neuropsychiatric complications have been documented following these infections. [...] D: Up to 20-30% of patients hospitalized with coronavirus disease (COVID-19) have evidence of myocardial involvement. Acute cardiac injury in patients hospitalized with COVID-19 is associated with higher morbidity and mortality. There are no data on how acute treatment for COVID-19 may affect convalescent phase or long-term cardiac recovery and function. Myocarditis from other viral pathogens can evolve into overt or subclinical myocardial dysfunction, [...] BioASQ Q: What is the interaction between WAPL and PDS5 proteins? D: Pds5 and Wpl1 act as anti-establishment factors preventing sister-chromatid cohesion until counteracted in S-phase by the cohesin acetyl-transferase Eso1. [...] Here, we show that Pds5 is essential for cohesin acetylation by Eso1 and ensures the maintenance of cohesion by promoting a stable cohesin interaction with replicated chromosomes. The latter requires Eso1 only in the presence of Wapl, indicating that cohesin stabilization relies on Eso1 only to neutralize the antiestablishment activity. [...] D: [...] Here, we show that cohesin suppresses compartments but is required for TADs and loops, that CTCF defines their boundaries, and that the cohesin unloading factor WAPL and its PDS5 binding partners control the length of loops. In the absence of WAPL and PDS5 proteins, cohesin forms extended loops, presumably by passing CTCF sites, accumulates in axial chromosomal positions (vermicelli), and condenses chromosomes. [...]

Samples of passage reranking results where ListDA achieves higher utilities v.s. QGen PL. Truncated or omitted texts are indicated by "[...]". : [...] 3. Care of Patients with Tracheostomy 4. Suctioning of Respiratory Tract Secretions III. Modifying Host Risk for Infection A. Precautions for Prevention of Endogenous Pneumonia 1. Prevention of Aspiration 2. Prevention of Gastric Colonization B. Prevention of Postoperative Pneumonia C. Other Prophylactic Procedures for Pneumonia 1. Vaccination of Patients 2. Systemic Antimicrobial Prophylaxis 3. Use of Rotating "Kinetic" Beds Prevention and Control of Legionnaires' Disease [...] QGen: what kind of precautions are used to prevent pneumonia D: [...] LEGIONNAIRE'S DISEASE STRIKES 16 AT REUNION IN COLORADO; 3 DIE. An outbreak of legionnaire's disease at a 50th high school reunion was blamed Thursday for the deaths of three elderly celebrants and the pneumonia-like illness of 13 others. State health officials contacted 250 other people from 21 states who attended the Lamar High School reunion for the classes of 1937 through 1941 but found no new cases, Dr. Ellen Mangione, a Colorado Department of Health epidemiologist, said. [...] TREC-COVID Q: what drugs have been active against SARS-CoV or SARS-CoV-2 in animal studies? D: Different treatments are currently used for clinical management of SARS-CoV-2 infection, but little is known about their efficacy yet. Here we present ongoing results to compare currently available drugs for a variety of diseases to find out if they counteract SARS-CoV-2-induced cytopathic effect in vitro. [...] We will provide results as soon as they become available, [...] QGen: what is the treatment for sars D: [...] the antiviral efficacies of lopinavirritonavir, hydroxychloroquine sulfate, and emtricitabine-tenofovir for SARS-CoV-2 infection were assessed in the ferret infection model. [...] all antiviral drugs tested marginally reduced the overall clinical scores of infected ferrets but did not significantly affect in vivo virus titers. Despite the potential discrepancy of drug efficacies between animals and humans, these preclinical ferret data should be highly informative to future therapeutic treatment of COVID-19 patients. BioASQ Q: What is the function of the Spt6 gene in yeast? D: As a means to study surface proteins involved in the yeast to hypha transition, human monoclonal antibody fragments (single-chain variable fragments, scFv) have been generated that bind to antigens expressed on the surface of Candida albicans yeast and/or hyphae. [...] To assess C. albicans SPT6 function, expression of the C. albicans gene was induced in a defined S. cerevisiaespt6 mutant. Partial complementation was seen, confirming that the C. albicans and S. cerevisiae genes are functionally related in these species. QGen: what is the function of spt6 gene in candida albicans D: Spt6 is a highly conserved histone chaperone that interacts directly with both RNA polymerase II and histones to regulate gene expression. [...] Our results demonstrate dramatic changes to transcription and chromatin structure in the mutant, including elevated antisense transcripts at >70% of all genes and general loss of the +1 nucleosome. Furthermore, Spt6 is required for marks associated with active transcription, including trimethylation of histone H3 on lysine 4, previously observed in humans but not Saccharomyces cerevisiae, and lysine 36. [...]

