LEARNING LISTWISE DOMAIN-INVARIANT REPRE-SENTATIONS FOR RANKING

Abstract

Domain adaptation aims to transfer the knowledge acquired by models trained on (data-rich) source domains to (low-resource) target domains, for which a popular method is invariant representation learning. While they have been studied extensively for problems including classification and regression, how they would apply to ranking problems, where the data and metrics follow a list structure, is not well understood. Theoretically, we establish a domain adaptation generalization bound for ranking under listwise metrics such as MRR and NDCG, that naturally suggests an adaptation method via learning listwise invariant feature representations. Empirically, we demonstrate the benefits of listwise invariant representations by experiments for unsupervised domain adaptation on real-world ranking tasks, including passage reranking. The main novelty of our results is that they are tailored to listwise ranking: the invariant representations are learned at the list level rather than at the item level.

1. INTRODUCTION

Learning to rank applies machine learning to solve ranking problems that are at the core of many everyday products and applications, including and not limited to search engines and recommendation systems (Liu, 2009) . The availability of ever-increasing amounts of training data has enabled larger and larger models to improve the state-of-the-art on more ranking tasks. A prominent example is text retrieval and ranking, where neural language models with billions of parameters easily outperform traditional ranking models, e.g., BM25 (Nogueira et al., 2020) . But the need for abundant data means that large neural models may not benefit tasks with little to no annotated data, where they could actually fare worse than baselines such as gradient boosted decision trees (Qin et al., 2021) . Techniques for extending the benefits of large models to low-resource domains include zero-shot learning and domain adaptation. In the former, instead of directly optimizing for the task of interest with limited data, referred to as the target domain, the model is trained on a data-rich source domain that has a similar data distribution. The latter considers the scenario where (abundant unlabeled) data from the target domain is available, which can be leveraged to estimate the domain shift and improve transferability, e.g., by learning invariant feature representations. This setting and its algorithms are studied extensively for problems including classification and regression (Blitzer et al., 2008; Ganin et al., 2016; Zhao et al., 2018) . For ranking problems, however, existing methods are mostly limited to specific tasks and applications. In fact, due to the inherent list structure of the metrics and data, theoretical explorations of domain adaptation for ranking are only nascent. To this end, we provide the first analysis of domain adaptation for listwise rankingfoot_0 via domaininvariant representations (Section 3), building on the foundational work by Ben-David et al. (2007) for domain adaptation in the binary classification setting. One of the results from our theory is that, when the domain shift is small in terms of the Wasserstein distance, a ranking model optimized on the source is transferrable to the target domain, whose performance under metrics such as MRR and NDCG can be bounded. Inspired by our theory, we propose an adversarial training method for learning listwise domaininvariant representations, called ListDA (Section 4), that minimizes the distributional shifts between source and target domains in the feature space for improving generalization on the target domain. Unlike traditional classification and regression settings, in ranking, each input follows a list structure, containing the items to be ranked. The main technical novelty of ListDA is that the invariant representations it learns are of each list as a whole rather than the individual items they contain. Empirically, we evaluate ListDA for unsupervised domain adaptation on two ranking tasks (Section 5 and Appendix C), including passage reranking-a fundamental task in information retrieval (Craswell et al., 2019) -where the goal is to rerank a list of candidate documents retrieved by a first-stage retrieval model in response to a search query. We adapt T5 neural rerankers (Raffel et al., 2020) fine-tuned on the general domain MS MARCO dataset (Bajaj et al., 2018) to two specialized domains: biomedical and news articles. Our results demonstrate the benefits of invariant representations on the transferability of rankers trained with ListDA.

2. PRELIMINARIES

Learning to Rank. A ranking problem is defined by a joint distribution over listsfoot_2 X ∈ X of items and nonnegative relevance scores Y = (Y 1 , • • • , Y ) ∈ R ≥0 . We assume that all lists are length-, and the ground-truth scores are a function of the lists, y(X), so that a ranking problem is equivalently defined by a marginal distribution µ X of lists along with a scoring function y : X → R ≥0 . The goal is to train a ranker f : X → S that maps each list x to rank assignments r := f (x) ∈ S , where r i represents the predicted rank of item i, and S denotes the set of permutations on {1, 2, • • • , }, that recover the descending ordering of the relevance scores y i , i.e., y i > y j ⇐⇒ r i < r j for all i = j. The more common setup is to train a scoring function h : X → R whose output is a list of ranking scores, s.t. s i := h(x) i correlates with y i in each list, and its ordering agrees that of the ground-truth scores. Rank assignments could be obtained from the ranking scores by taking their descending ordering or via probabilistic models (Section 3). The quality of the predicted ranks is measured by ranking metrics u : S × R ≥0 → R ≥0 , which are functions that take as inputs the rank assignments of the list along with the ground-truth relevance scores and output a utility score. Popular listwise metrics in information retrieval include reciprocal rank and normalized discounted cumulative gain (Voorhees, 1999; Järvelin & Kekäläinen, 2002) : Definition 1 (RR). Suppose the ground-truth relevance scores y ∈ {0, 1} are binary, then the reciprocal rank of the rank assignments r ∈ S is RR(r, y) = max{r -1 i : 1 ≤ i ≤ , y i = 1} ∪ {0}. The expectation of RR over the dataset is called mean reciprocal rank (MRR), E[RR(f (X), y(X))]. Definition 2 (NDCG). The discounted cumulative gain (DCG) and the normalized DCG (with identity gain functions, w.l.o.g.) of the rank assignments r ∈ S are DCG(r, y) = i=1 y i log(r i + 1) and NDCG(r, y) = DCG(r, y) IDCG(y) , where IDCG(y) = max r ∈S DCG(r , y), the ideal DCG, is the maximum DCG value of a list and is attained by the descending ordering of the ground-truth y i 's. Domain Adaptation. The present work studies the adaptation of a scorer from a source domain (µ X S , y S ) to a target domain (µ X T , y T ). When domain shift is small, i.e., µ X S ≈ µ X T and y S ≈ y T , scorers trained on the source are expected to be transferrable to the target without the explicit need of labeled target data. Indeed, the target performance is bounded by the source performance in such cases. As an example, for binary classification, we have the following generalization bound for randomized classifiers (Shen et al., 2018): 



In this paper, by listwise ranking, we mean that the data are defined over lists and the ranking metric is listwise. It is not a statement of the ranking model, e.g., whether the predicted rank assignments are obtained in a pointwise, pairwise or listwise manner (Joachims, 2006; Cao et al., 2007), or the interactions between items in the same list are modeled(Pang et al., 2020). A natural and common choice for the space of lists X is the -times Cartesian product of R k , R ×k , meaning that each list x = (x1, • • • , x ) is a stack of items represented by k-dimensional feature vectors. But generally and more abstractly, the lists need not follow this structure; e.g., X can also be R d (not scaling with ) provided a mechanism to represent lists by fixed-length vectors.

