ARE NEURAL RANKERS STILL OUTPERFORMED BY GRADIENT BOOSTED DECISION TREES?

Abstract

Despite the success of neural models on many major machine learning problems, their effectiveness on traditional Learning-to-Rank (LTR) problems is still not widely acknowledged. We first validate this concern by showing that most recent neural LTR models are, by a large margin, inferior to the best publicly available Gradient Boosted Decision Trees (GBDT) in terms of their reported ranking accuracy on benchmark datasets. This unfortunately was somehow overlooked in recent neural LTR papers. We then investigate why existing neural LTR models under-perform and identify several of their weaknesses. Furthermore, we propose a unified framework comprising of counter strategies to ameliorate the existing weaknesses of neural models. Our models are the first to be able to perform equally well, comparing with the best tree-based baseline, while outperforming recently published neural LTR models by a large margin. Our results can also serve as a benchmark to facilitate future improvement of neural LTR models.

1. INTRODUCTION

Neural approaches have been dominating in many major machine learning domains, such as computer vision (He et al., 2015) , natural language processing (Devlin et al., 2019) , and speech recognition (Hannun et al., 2014) . However, the effectiveness of neural approaches in traditional Learningto-Rank (LTR), the long-established inter-disciplinary research area at the intersection of machine learning and information retrieval (Liu, 2009) , is not widely acknowledged (Yang et al., 2019) , especially on benchmark datasets that have only numerical features. Historically, a series of LTR models were developed by researchers at Microsoft, starting with RankNet (Burges et al., 2005) and LambdaRank (Burges et al., 2007) , both based on neural networks, and culminating in LambdaMART (Wu et al., 2010) , which is based on Gradient Boosted Decision Trees (GBDT); Burges (2010) provides an overview of this evolution. There are two publicly available implementations of LambdaMART: one provided by the RankLibfoot_0 library that is part of the Lemur Project (henceforth referred to as λMART RankLib ); and the LightGBMfoot_1 implementation provided by Microsoft (Ke et al., 2017) (henceforth referred to as λMART GBM ). As we will show in Section 3, λMART GBM substantially outperforms λMART RankLib . There is strong and continuing interest in neural ranking models, with numerous papers published in the last few years alone. Most of these papers treat RankNet and LambdaRank as weak baselines (Pang et al., 2020; Bruch et al., 2019b) and LambdaMART as the "state-of-the-art" (Bruch et al., 2019b; Li et al., 2019; Zhu & Klabjan, 2020; Hu et al., 2019) . However, when examining these papers, we note that they either acknowledge their under-performance to λMART GBM or claim state-of-the-art performance by comparing to a weaker λMART RankLib implementation. The inconsistency of performance evaluation on benchmark datasets in this field has made it difficult to measure progress (Lipton & Steinhardt, 2018) . It therefore remains an open question whether neural LTR models are as effective as they claim to be, and how to improve them if that is not the case. In this paper, we first conduct a benchmark to show that λMART GBM outperforms recently published neural models, as well as the λMART RankLib , by a large margin. While the neural paradigm is still appealing in a myriad of ways, such as being composable, flexible, and able to benefit from a plethora of new advances (Vaswani et al., 2017; Devlin et al., 2019) , the research progress in neural ranking models could be hindered due to their inferior performance to tree models. It thus becomes critical to understand the pitfalls of building neural rankers and boost their performance on benchmark datasets. Specifically, we investigate why neural LTR approaches under-perform on standard LTR datasets and identify three major weaknesses that are typically ignored by recent work. First, neural models are not as adept at performing effective feature transformations and scaling, which is one major benefit of using tree-based methods (Saberian et al., 2019) . In ranking data which is typically longtailed, this can be a prohibitive property. Second, standard feed-forward networks are ineffective in generating higher-order features as noted by recent papers (Wang et al., 2017b; Beutel et al., 2018) . More effective network architectures for neural LTR models are needed. Third, recent neural LTR work on benchmark datasets does not employ high-capacity networks, a key success factor of many neural models (Devlin et al., 2019) , possibly due to a small scale of training data that causes overfitting. On the other hand, there are several potential benefits of neural approaches over LambdaMART for LTR, such as their flexibility to model listwise data and the existence of many techniques to mitigate data sparsity. To that end, we propose a new framework that ameliorates the weaknesses of existing neural LTR approaches and improves almost all major network components. In the proposed framework, we make several technical contributions: (1) We demonstrate empirical evidence that a simple log1p transformation on the input features is very helpful. (2) We use data augmentation (DA) to make the most out of high-capacity neural models, which is surprisingly the first work in the LTR literature to do so. We show that adding a simple Gaussian noise helps, but only when the model capacity is appropriately augmented (which probably explains why there is no prior work on such a simple idea). ( 3) We use self-attention (SA) to model the listwise ranking data as context, and propose to use latent cross (LC) to effectively generate the interaction of each item and its listwise context. We conduct experiments on three widely used public LTR datasets. Our neural models are trained with listwise ranking losses. On all datasets, our framework can outperform recent neural LTR methods by a large margin. When comparing with the strong LambdaMART implementation, λMART GBM , we are able to achieve equally good results, if not better. Our work can also serve as a benchmark for neural ranking models, which we believe can lay a fertile ground for future neural LTR research, as rigorous benchmarks on datasets such as ImageNet (Russakovsky et al., 2015) and GLUE (Wang et al., 2018a) do in their respective fields.

2. BACKGROUND

We provide some background on LTR, including its formulation and common metrics. We review LambdaMART and highlight its two popular implementations which are causes of the inconsistency of evaluations in the recent literature.

2.1. LEARNING TO RANK

LTR methods are supervised techniques and the training data can be represented as a set Ψ = {(x, y) ∈ χ n × R n )}, where x is a list of n items x i ∈ χ and y is a list of n relevance labels y i ∈ R for 1 ≤ i ≤ n. We use χ as the universe of all items. In traditional LTR problems, each x i corresponds to a query-item pair and is represented as a feature vector in R k where k is the number of feature dimensions. With slightly abuse of notation, we also use x i as the feature vector and say x ∈ R n×k . The objective is to learn a function that produces an ordering of items in x so that the utility of the ordered list is maximized. Most LTR algorithms formulate the problem as learning a ranking function to score and sort the items in a list. As such, the goal of LTR boils down to finding a parameterized ranking function s(•; Θ) : χ n → R n , where Θ denotes the set of parameters, to minimize the empirical loss: L(s) = 1 |Ψ| (x, y)∈Ψ l(y, s(x)), where l(•) is the loss function on a single list. LTR algorithms differ primarily in how they parameterize s and how they define l. There are many existing ranking metrics such as NDCG and MAP used in LTR problems. A common property of these metrics is that they are rank-dependent and place more emphasis on the top ranked items. For example, the commonly adopted NDCG metric is defined as N DCG(π s , y) = DCG(π s , y) DCG(π * , y) , where π s is a ranked list induced by the ranking function s on x, π * is the ideal list (where x is sorted by y), and DCG is defined as: DCG(π, y) = n i=1 2 yi -1 log 2 (1 + π(i)) = n i=1 G i D i In practice, the truncated version that only considers the top-k ranked items, denoted as NDCG@k, is often used.

2.2. LAMBDAMART

LTR models have evolved from linear models (Joachims, 2002) , to nerual networks (Burges et al., 2005) , and then to decision trees (Burges, 2010) in the past two decades. LambdaMART, proposed about ten years ago (Wu et al., 2010; Burges, 2010) , is still treated as the "state-of-the-art" for LTR problems in recent papers (Bruch et al., 2019b; Zhu & Klabjan, 2020) . It is based on Gradient Boosted Decision Trees (GBDT). During each boosting step, the loss is dynamically adjusted based on the ranking metric in consideration. For example, ∆NDCG is defined as the absolute difference between the NDCG values when two documents i and j swap their positions in the ranked list sorted by the obtained ranking functions so far. ∆N DCG(i, j) = |G i -G j | • 1 D i - 1 D j . Then LambdaMART uses a pairwise logistic loss and adapts the loss by re-weighting each item pair in each iteration, with s(x)| i being the score for item i and α being a hyperparameter: l(y, s(x)) = yi>yj ∆N DCG(i, j) log 2 (1 + e -α(s(x)|i-s(x)|j ) ) There are two popular public implementations of LambdaMART, namely λMART GBM and λMART RankLib . λMART GBM is more recent than λMART RankLib and has more advanced features by leveraging novel data sampling and feature bundling techniques (Ke et al., 2017) . However, recent neural LTR papers either use the weaker implementation of λMART RankLib (Pang et al., 2020; Wang et al., 2017a; Ai et al., 2018; 2019) , or acknowledge the inferior performance of neural models when compared with λMART GBM (Bruch et al., 2019b) . Such an inconsistency makes it hard to determine whether neural models are indeed more effective than the tree-based models.

3. BENCHMARKING EXISTING METHODS

To resolve the inconsistency, we perform a benchmark on three popular LTR benchmark datasets to show that: 1) there is a large gap between the two implementations of tree-based LambdaMART λMART GBM and λMART RankLib ; 2) Recent neural LTR methods are generally significantly worse than the stronger implementation. Then we discuss several weaknesses of recent neural LTR approaches, and point out promising directions, which lay the foundation of our proposed framework. 

Models Rerank

Web30K NDCG@k Yahoo NDCG@k Istella NDCG@k @1 @5 @10 @1 @5 @10 @1 @5 @10 λMARTRankLib 

3.1. DATASETS

The three data sets we used in our experiments are public benchmark datasets widely adopted by the research community. They are the LETOR dataset from Microsoft (Qin & Liu, 2013) , Set1 from the YAHOO LTR challenge (Chapelle & Chang, 2011) , and Istella (Dato et al., 2016) . We call them Web30K, Yahoo, and Istella respectively. All of them are data sets for web search ranking and the largest data sets publicly available for LTR algorithms. The relevance labels of documents for each query are rated by human in the form of multilevel graded relevance. See Qin & Liu (2013) for an example list of features, such as the number of URL clicks, or the BM25 scores of the different page sections. An overview of these three datasets is shown in Table 1 .

3.2. COMPARISON

We compare a comprehensive list of methods in Table 2 . λMART GBM (Ke et al., 2017) and λMART RankLib are the two LambdaMART implementations. RankSVM (Joachims, 2006 ) is a classic pairwise learning-to-rank model built on SVM. GSF (Ai et al., 2019 ) is a neural model using groupwise scoring function and fully connected layers. ApproxNDCG (Bruch et al., 2019b ) is a neural model with fully connected layers and a differeiable loss that approximates NDCG (Qin et al., 2010) . DLCM (Ai et al., 2018) is an RNN based neural model that use list context information to rerank a list of documents based on λMART RankLib as in the original paper. SetRank (Pang et al., 2020 ) is a neural model using self-attention to encode the entire list and perform a joint scoring. SetRank re (Pang et al., 2020) is SetRank plus ordinal embeddings based on the initial document ranking generated by λMART RankLib as in the original paper. We choose to compare these methods because they are either popular or recent. The neural models are already leveraging advanced neural techniques such as using neural methods to model the entire ranking list, which is difficult for tree-based models to achieve. We reproduced results for λMART RankLib , λMART GBM , RankSVM, GSF, and ApproxNDCG with extensive hyperparameter tuning with more details in Appendix A. Results for the DLCM and SetRank methods are from their respective papers where the authors did their own tuning. Note that the test set is fixed for all datasets, thus the numbers are comparable. From Table 2 , we can see the following. 1) λMART GBM is a more appropriate "state-of-the-art" LambdaMART baseline, as it significantly outperforms λMART RankLib . 2) Recent neural LTR methods, though sometimes outperform λMART RankLib , are inferior to λMART GBM by a large margin, sometimes by as much as 15%, comparatively. These results show the inconsistency of existing methods and validate the concerns on the current practice of neural LTR modelsfoot_2 .

4. NEURAL LTR MODELS

A natural question is: why do neural models under-perform on LTR benchmark datasets compared with LambdaMART, despite their success in many machine learning research areas? We first identify a few weaknesses of the neural LTR models and then propose our methods to address them.

4.1. WEAKNESSES

By reviewing recent papers and the strength of tree-based models, we give the following hypotheses: Feature transformation. Neural networks are sensitive to input feature scales and transformations (Saberian et al., 2019) . LTR datasets consist of features of diverse scales with long-tail distributions, such as the number of clicks of an item. Tree-based models are known to partition the feature space effectively, which is beneficial for datasets (such as LTR datasets) with only numeric features. Some recent work already shows the benefits of better input feature transformations than Gaussian normalization (Saberian et al., 2019; Zhuang et al., 2020) . Unfortunately, neither the pioneering neural LTR papers (Burges et al., 2005; 2007) nor the most recent ones discuss the impact of feature transformation. Network architecture. Unless the focus is the neural architecture, neural LTR papers typically use a standard feed-forward network that consists of a stack of fully connected layers. However, fully connected layers are known to be ineffective in generating higher-order feature interactions. The problem has been widely studied in areas such as ads prediction (Wang et al., 2017b ) and recommender systems (Beutel et al., 2018) , but has not received enough attention for LTR. Data sparsity. Recent neural LTR models are small and do not employ high-capacity networks (Bruch et al., 2019b; Pang et al., 2020) , possibly due to the overfitting issue. While large datasets are key factors to many recent successes of neural models in other domains (He et al., 2015; Devlin et al., 2019) , the publicly available LTR datasets are comparatively small. Popular techniques such as data augmentation to mitigate overfitting in high-capacity networks are commonly used in other areas (Perez & Wang, 2017) . But it is less intuitive on how to do data augmentation for LTR datasets, compared with, e.g., rotating a cat image in computer vision.

4.2. IMPROVEMENTS

We introduce our proposed neural LTR framework that tries to address the above mentioned concerns. Figure 1 summarizes our DASALC framework, which stands for Data Augmented Self-Attentive Latent Cross ranking network.

4.2.1. EXPLICIT FEATURE TRANSFORMATION AND DATA AUGMENTATION

Features in LTR datasets are diverse and can be of different scales. Out of the three datasets we consider, only the Yahoo dataset has been normalized (we leave it not-transformed). It is well known that neural networks are sensitive to input data scale, and we apply a simple "log1p" transformation to every element of x and empirically find it works well for the Web30K and Istella datasets: x = log e (1 + |x|) sign(x). ( ) where is the element-wise multiplication operator. We use a very simple data augmentation technique on LTR datasets. We add a random Gaussian noise independently to every element of input vector x: where σ is a scalar hyperparameter. The random noise is added after the log1p transformation in an online fashion during training (i.e. different perturbations will be added to the same data point seen in different batches). A single scalar σ for every feature is reasonable because the feature distributions are normalized by log1p. Also data augmentation is added after input Batch Normalization (BN) when applicable. Note that the random noise is added independently to every element so (later) BN will not cancel it away. We find such a simple data augmentation technique works well in our framework, but as shown in experiments, it only works when the capacity of the network is properly augmented as described in the next section. x = x + N (0, σ 2 I) (7) Log1p Transform … FC-ReLU-BN Score s i Document List f 1 f 2 … f i … f n a i s 1 … s i ... For notation simplicity, we combine the log1p feature transformation and data augmentation into a single function f : R n×k → R n×k : f = log e (1 + |x|) sign(x) + N (0, σ 2 I) (8)

4.2.2. LEVERAGING LISTWISE CONTEXT

For LTR problem, the list of documents can be leveraged in neural models. This is the key base to enhance the network architecture for LTR. We leverage the multi-head self-attention (MHSA) mechanism (Vaswani et al., 2017) to encode ranking list information. More specifically, we generate a contextual embedding a i , for each item i, considering the document similarity between document i and every document in the list. For the multi-head self-attention mechanism, we have the input f ∈ R n×k , and project f into a query (in the context of attention mechanism) matrix Q = fW Q , a key matrix K = fW K , and a value matrix V = fW V with trainable projection matrices W Q , W K , and W V ∈ R k×z , where z is the attention head size. Then a self-attention (SA) head computes the weighted sum of the transformed values V as, SA(f) = Softmax(S(f))V, where similarity matrix between Q and K is defined as S(f) = QK T √ z . For each layer, the results from the H heads are concatenated to form the output of multi-head self-attention by MHSA(f) = concat h∈[H] [SA h (f)]W out + b out , where W out ∈ R Hz×z and b out ∈ R n×z are trainable parameters. We apply L ≥ 1 layers of multihead self-attention followed by a layer normalization (Ba et al., 2016) similarly to (Vaswani et al., 2017) . By treating a i as the listwise contextual embedding for item i, we further leverage the simple latent cross idea (Beutel et al., 2018) to effectively generate feature interactions: h cross i = (1 + a i ) h out (x i ), where is the element-wise multiplication operator (a i will go through a linear projection when the dimensions do not match, omitted in the equation), and h out (x i ) is the output of the final hidden layer of regular network. Learning to rank can be seen as learning to induce order over set of items. One desirable property for ranking approaches that use listwise context is to be permutation equivariant: applying a permutation over input items leads to an equivalent permutation over output scores. DASALC satisfies such a permutation equivariance property. Proposition 1. Let π be a permutation of indices of [1, .., n] and x ∈ R n×k be the input item representation. DASALC is permutation equivariant for scores generated over input items , i.e, s DASALC (π(x)) = π(s DASALC (x)). See proof at Appendix C.

4.3. REMARKS

We compared several popular pointwise, pairwise, and listwise ranking losses. We report all results based on the softmax cross entropy loss l(y, s(x)) = -n i=1 y i log e e s i j e s j since it is simple and empirically robust in general, as demonstrated in Appendix B.2. We provided a general framework that can enhance neural LTR models in many components. For each component, we purposefully use simple or well-known techniques for enhancement because the scope of the current research is to identify the possible reasons why neural LTR is under-performing when compared with the best traditional tree-based methods. Clearly, each component can use more advanced techniques, such as learning a more flexible data transformation (Zhuang et al., 2020) or using data augmentation policy (Cubuk et al., 2019) , which we leave as future work.

5. EXPERIMENTS

We conduct experiments on the three LTR datasets (introduced in Sec 3.1) with our proposed framework and compare with some methods in Sec 3. For all our experiments using neural network approaches, we implemented them using the TF-Ranking (Pasumarthi et al., 2019) library. We use two variants of our proposed approaches. DASALC is a model trained in our proposed framework. DASALC-ens is an ensemble of DASALC. By realizing LambdaMART is an ensemble method based on boosting, we leverage the randomness of neural model training and simply use the average score of 3-5 models (tuned on validation set) from different runs as the final score in DASALC-ens. Main result. The results are summarized in Table 3 . We focus on the comparison with λMART GBM and also include SetRank to highlight the difference with recent neural LTR models. Readers can refer to Table 2 for more results. We tune hyperparameters on the validation sets, with more details in Appendix A. We have the following observations and discussions: (1) DASALC can sometimes achieve comparable or better results than λMART GBM , and outperforms recent neural LTR methods by a large margin. (2) DASALC-ens, though simple, can achieve neutral or significantly better results than λMART GBM on all datasets and metrics. (3) The results on Yahoo dataset are weaker than the other two datasets. One thing to note is Yahoo dataset is already normalized upon release. As we note the importance of input feature transformation, the provided normalization may not be ideal for neural models, thus it should be encouraged to release LTR datasets with raw feature values. 

6. RELATED WORK

We focus on traditional LTR problems when there are only numeric features and human ratings available. Some works (Mitra & Craswell, 2018; Nogueira et al., 2019; Han et al., 2020) on document matching and ranking leverage neural components such as word2vec and BERT when raw text is available, where the major benefit comes from semantic modeling of highly sparse input and tree-based methods become less relevant due to its limitation in handling sparse features. The pioneering neural LTR models are RankNet (Burges et al., 2005) and LambdaRank (Burges et al., 2007) . They use feed-forward networks on dense features as their scoring functions and became less favored than tree-based LambdaMART (Burges, 2010) . Recent neural LTR models have explored new model architectures (Pang et al., 2020; Qin et al., 2020b) , differetiable losses (Bruch et al., 2019b) , and leveraging more auxiliary information (Ai et al., 2018) . However, there is less work that specifically understands and addresses weaknesses for neural LTR, and a benchmark with strong tree-based baseline is missing. In this work, we show that relatively simple components that aim to address weaknesses of neural models can outperform recent methods significantly. The idea of generating new data for LTR has been explored in few work recently, but their focus is to train more discriminative ranking models, not to mitigate the data sparsity problem for high-capacity neural models. For example, Yu & Lam (2019) uses a separate Autoencoder model to generate data and then feed them into tree-based models. This work can be treated as orthogonal to our data augmentation technique. Several LTR papers have leveraged neural sequence modeling based on LSTM (Ai et al., 2018) or self-attention (Pang et al., 2020; Pasumarthi et al., 2020) , which is not easy for tree-based approaches to model. We also leverage listwise context via self-attention to show neural LTR models are easily extendable. The combination of self-attention based listwise context and latent cross in our work to specifically mitigate the ineffectiveness of neural model to generate higher-order feature interactions has not been explored in the literature. Our work is mostly orthogonal to another line of LTR research, namely unbiased learning to rank from implicit feedback data, such as clicks (Joachims et al., 2017; Hu et al., 2019; Qin et al., 2020a; Zhuang et al., 2021) . There are also papers that try to reproduce tree models using neural architectures for tabular data (Saberian et al., 2019; Lee & Jaakkola, 2020) . Our motivation is different in that our goal is to identify and mitigate weaknesses of neural approaches in general.

7. CONCLUSION AND DISCUSSION

In this paper, we first showed the inconsistency of performance comparison between neural rankers and GBDT models, and verified the inferior performance of neural models. We then identified the weaknesses when building neural rankers in multiple components and proposed methods to address them. Our proposed framework performs competitively well with the strong tree-based baselines. We believe our general framework and the rigorous benchmarking provides critical contribution to facilitate future neural LTR research. In particular, neural models are powerful in modeling complex relations (e.g, attention mechanism (Vaswani et al., 2017) ) and raw text features (e.g., BERT (Devlin et al., 2019) ). Also, the active research on neural networks in other domains continuously advances neural techniques (e.g., optimizers (Kingma & Ba, 2014) ) All these can be studied in the LTR setting and our work pave ways to avoid pitfalls when leveraging these techniques.

B.2 RANKING LOSSES

Many recent progresses of neural LTR are on ranking losses, especially listwise ranking losses (Bruch et al., 2019b; a; 2020; Grover et al., 2019) . For example, it is attractive to devise differentiable versions of ranking losses for end-to-end learning. Here we do a benchmark of different ranking losses on regular DNN models on different datasets to show that (1) Listwise ranking losses are superior choices to pointwise or pairwise losses that are normally used for non-neural LTR models; (2) Performances of state-of-the-art listwise ranking losses are comparable; (3) The softmax cross entropy loss is a simple but robust choice. We consider the following ranking losses: • SigmoidCrossEntropy: a widely used pointwise loss: l(y, s(x)) = n i=1 -y i s i + log e (1 + e si ). • RankNet (Burges et al., 2005) : a popular pairwise loss: l(y, s(x)) = yi>yj log e (1 + e sj -si ). • LambdaRank (Burges et al., 2007; Wang et al., 2018b) : the pairwise loss with ∆NDCG weight, which is a direct implementation of the LambdaMART loss in Eq. ( 5). • Softmax (Cao et al., 2007; Bruch et al., 2019a) : a popular listwise loss: l(y, s(x)) = -n i=1 y i log e e s i j e s j . • ApproxNDCG (Qin et al., 2010; Bruch et al., 2019b) : a listwise loss that is a differentiable approximation of NDCG metric: l(y, s(x)) = - 1 DCG(π * ,y) n i=1 2 y i -1 log 2 (1+πs(i)) , where π s (i) = 1 2 + j sigmoid( sj -si T ) with T a smooth parameter. • GumbelApproxNDCG (Bruch et al., 2019b; 2020) : a listwise loss with a stochastic treatment on ApproxNDCG: scores s in the above NDCG loss function will be substituted by s i + g i , with a gumbel noise g i = -log e (-log e U i )) from U i uniformly sampled in [0, 1]. • NeuralSortNDCG (Grover et al., 2019) : a listwise loss that approximates NDCG metric with the NeuralSort trick: l(y, s(x)) = - 1 DCG(π * ,y) n i,r=1 (2 y i -1)P s ir log 2 (1+r) , where P s ir is an approximate permutation matrix, obtained by NeuralSort trick: P s ir = softmax[((n + 1 -2i)s rj |s r -s j |)/T ], with T a smooth parameter. • GumbelNeuralSortNDCG: a listwise loss with a stochastic treatment of NeuralSortNDCG by replacing the score s in neural sort permutation matrix by s i + g i , where g i is again sampled from the gumbel distribution. This is new in the literature but not the major focus of this work.

Ranking loss

Web30K NDCG@k Istella NDCG@k @1 @5 @10 @1 @5 @10 SigmoidCrossEntropy The results are summarized in Table 6 . For different ranking losses, we make a grid search over different optimizers with different learning rates: for Adam optimizer, we scan learning rates ∈ {10 -4 , 10 -3 , 10 -2 }; for Adagrad optimizer, we scan learning rates ∈ {0.01, 0.1, 0.5}. When the smooth parameter T is applicable, we also scan it ∈ {0.1, 1, 10}. We report the results based on best NDCG@5 for different losses. As we have stated above, we find that: (1) The performance of models trained with listwise losses are significantly better than the models trained with pointwise or pairwise losses. (2) Different

B.5 PERFORMANCE ON CATBOOST

We mainly compared with λMART RankLib and λMART GBM in the main content since they are the most popular baselines used in recent papers. There are other GBDT implementations that can also be used for the LTR task. Catboost (Prokhorenkova et al., 2018 ) is a recently popular GBDT implementation for various tasks. We also evaluate its performance on the three LTR datasets. Note that Catboost is not specific to ranking and does not have a standard LambdaMART implementation to the best of our knowledge. We try both the QueryRMSE loss and YetiRank loss, which are the best performing losses on most existing Catboost's benchmarks. The results are reported in Table 10 . 

Models

Web30K NDCG@k Yahoo NDCG@k Istella NDCG@k @1 @5 @10 @1 @5 @10 @1 @5 @10 λMART We can see that Catboost can produce very decent results, clearly outperforming λMART RankLib , but its comparison with λMART GBM is mixed. We encourage researchers to also consider different implementations such as Catboost in future LTR work.

B.6 LAMBDAMART ENSEMBLE

We showed that a simple ensemble of neural rankers can bring meaningful gains, leveraging the stochastic nature of neural network learning. On the other hand, LambdaMART itself is an ensemble algorithm using boosting, but it is still interesting to see the effect of ensembling multiple LambdaMART models. We conduct additional experiments on this front using λMART GBM and have two major observations: 1) Running LambdaMART multiple times with the same configuration generates very similar results, and ensemble in this setting does not help, whereas neural rankers can benefit from such a simple setting; 2) In Table 11 we show ensembling LambdaMART with different configurations (e.g., different # trees, # leaves and learning rate) on the Istella dataset. We ensemble five LambdaMART models chosen on the validation set. The results on other datasets are similar. Method Istella NDCG@k @1 @5 @10 λMARTGBM 74.92 71.24 76.07 λMARTGBM-ens 75.04 71.40 76.28 Table 11 : Results on the Istella datasets using LambdaMART ensembles. We can see that the improvement from ensembling LambdaMART is smaller than that in neural rankers (see Table 3 ). Our hypothesis is that model ensembles tend to be more effective for neural rankers with stronger stochastic nature, and exploring advanced model ensemble methods with neural rankers is an interesting future direction.

C PERMUTATION EQUIVARIANCE ANALYSIS

For any general scoring function s(x) : R n×k → R n , and a permutation π over indices [1, ..., n], we call s to be permutation equivariant iff s(π(x)) = π(s(x)) The scoring function for proposed approach, DASALC, can be written as a combination of feature transformation and data augmentation function f, output of multi-headed self-attention a := MHSA L (f) and output of final layer of regular network h out (x). s DASALC (x) = W T F C ReLU ((1 + a(x)) h out (x)) (13) Note that per-item transformations, which we refer to as univariate transformations, are trivially permutation equivariant. Also, composition of two permutation equivariant functions is also permutation equivariant, as the permutation operator and the permutation equivariant functions are commutative. Hence linear projection, ReLU activation and f (as a function of x) are permutation equivariant. Multi-headed self-attention is shown to be permutation equivariant (Pang et al., 2020) . Hence, on applying permutation π to the proposed scoring function, we see that it satisfies the permutation equivariance property. 



https://sourceforge.net/p/lemur/wiki/RankLib/ https://github.com/microsoft/LightGBM We use the Fold1 in Web30K to be consistent with the setup of Yahoo and Istella. Some of the reported results on Web30K were based on 5-fold cross-validation (CV). We verified on λMART RankLib that the difference between Fold1 and CV is small and does not affect our conclusion.



π(s DASALC (x)) = π(W T F C ReLU ((1 + a(x)) h out (x))) = W T F C ReLU (π((1 + a(x)) h out (x))) = W T F C ReLU ((1 + π(a(x)) π(h out (x))) = W T F C ReLU (((1 + a(π(x))) h out (π(x)))) = s DASALC (π(x))

The statistics of the three largest public benchmark datasets for LTR models.

All numbers are significantly worse than the corresponding number from λMART GBM at the p < 0.05 level using a two-tailed t-test. Best performing numbers are bold.

Result on the Web30K, Yahoo, and Istella datasets. ↑ means significantly better result, performanced against λMART GBM at the p < 0.05 level using a two-tailed t-test. Last row is relative difference of DASALC-ens over λMART GBM .

NDCG@5 on Istella when different components are added.Ablation study. We provide some ablation study results in Table 4 to highlight the effectiveness of each component in our framework. Each component is added cumulatively from left to right in the table. We can see that each component helps and the best performance is achieved when all components are combined. More detailed ablation study is provided in Appendix B. Appendix B.1 gives more results on the effect of the log1p transformation. Appendix B.2 compares different loss functions and shows that listwise ranking loss performs better. Appendix B.3 shows the benefit of effective listwise context modeling. Appendix B.4 shows the effect of data augmentation in different model architectures.

Results on the Web30k and Istella datasets with standard feed-forward network architecture.

Comparison of Catboost with other methods on the Web30K, Yahoo, and Istella datasets.

APPENDIX A HYPERPARAMETER TUNING

For λMART GBM , we do a grid search for number of trees ∈ {300, 500, 1000}, number of leaves ∈ {200, 500, 1000}, and learning rate ∈ {0.01, 0.05, 0.1, 0.5}.For our neural models the main hyperparameters are hidden layer size ∈ {256, 512, 1024, 2048, 3072, 4096} and number of layers ∈ {3, 4, 5, 6} for regular DNN, data augmentation noise ∈ [0, 5.0] using binary search with step 0.1, number of attention layers ∈ {3, 4, 5, 6}, and number of attention heads ∈ {2, 3, 4, 5}. The same parameter swept is enabled on the baselines we tried when applicable. One noticeable difference between our work and existing work is that we tried large hidden layer size up to 4096 and found that large models work better in general when data augmentation is enabled. We are in the process to release the code and trained models in an open-sourced software package.

B ABLATION STUDIES AND ANALYSIS B.1 EFFECT OF LOG1P INPUT TRANSFORMATION

We first show that the simple log1p transform can improve performance on the Web30K and Istella datasets (Yahoo dataset has already been normalized). Results in Table 5 are based on regular DNN models using the softmax cross-entropy loss. The trends are similar for other configurations. We also noted the results are in general slightly better than Gaussian normalization due to the long-tail nature of LTR dataset features, which we omit here.

Method

Web30K NDCG@k Istella NDCG@k @1 @5 @10 @1 @5 @10 Without log1p44 We can see that such simple transformation can bring meaningful gains. In all following sections, we use log1p transformation by default.Published as a conference paper at ICLR 2021 listwise losses are generally comparable, and we found that the softmax cross-entropy loss performs coherently well over different models and different datasets. It is thus used in our main results and following sections.(3) LambdaRank does not work well for neural models. On the other hand, previous work (Bruch et al., 2019a) shows that tree-based models with softmax loss are not as good as LambdaMART, demonstrating that tree-based models and neural LTR models have different behavior on different loss functions. This encourages future work to design neural LTR specific ranking losses.

B.3 EFFECT OF LISTWISE CONTEXT

We study the effect of leveraging listwise context with self-attention, with and without latent cross (concatenation between item feature and context feature will be applied) (Pasumarthi et al., 2020) on the Web30K and Istella datasets. Results are shown in Table 7 . We can see that using neural approach to model listwise context, which is difficult for tree-based models to do, is quite beneficial. Latent cross, though simple, can help leverage listwise context more effectively.

Method

Web30K NDCG@k Istella NDCG@k @1 @5 @10 @1 @5 @10 DNN 48. of DNN starts to drop as soon as we start to add noise. However, for DASALC, data augmentation helps and the performance looks robust using different levels of noise. The performance peeks around σ = 1.5. The optimal σ needs tuning for different datasets but the general trends are similar for other datasets. We treat the study of the exact mechanism of how data augmentation works in DASALC and the application of more sophisticated data augmentation techniques as future work.We also try to add noise to λMART GBM and see similar results as DNN. The results on the YAHOO dataset is shown in Table 9 , we can see that adding noise leads to worse accuracy.Method (σ) Yahoo NDCG@k @1 @5 @10 λMARTGBM(0.0) 

