ARE NEURAL RANKERS STILL OUTPERFORMED BY GRADIENT BOOSTED DECISION TREES?

Abstract

Despite the success of neural models on many major machine learning problems, their effectiveness on traditional Learning-to-Rank (LTR) problems is still not widely acknowledged. We first validate this concern by showing that most recent neural LTR models are, by a large margin, inferior to the best publicly available Gradient Boosted Decision Trees (GBDT) in terms of their reported ranking accuracy on benchmark datasets. This unfortunately was somehow overlooked in recent neural LTR papers. We then investigate why existing neural LTR models under-perform and identify several of their weaknesses. Furthermore, we propose a unified framework comprising of counter strategies to ameliorate the existing weaknesses of neural models. Our models are the first to be able to perform equally well, comparing with the best tree-based baseline, while outperforming recently published neural LTR models by a large margin. Our results can also serve as a benchmark to facilitate future improvement of neural LTR models.

1. INTRODUCTION

Neural approaches have been dominating in many major machine learning domains, such as computer vision (He et al., 2015) , natural language processing (Devlin et al., 2019) , and speech recognition (Hannun et al., 2014) . However, the effectiveness of neural approaches in traditional Learningto-Rank (LTR), the long-established inter-disciplinary research area at the intersection of machine learning and information retrieval (Liu, 2009) , is not widely acknowledged (Yang et al., 2019) , especially on benchmark datasets that have only numerical features. Historically, a series of LTR models were developed by researchers at Microsoft, starting with RankNet (Burges et al., 2005) and LambdaRank (Burges et al., 2007) , both based on neural networks, and culminating in LambdaMART (Wu et al., 2010) , which is based on Gradient Boosted Decision Trees (GBDT); Burges (2010) provides an overview of this evolution. There are two publicly available implementations of LambdaMART: one provided by the RankLibfoot_0 library that is part of the Lemur Project (henceforth referred to as λMART RankLib ); and the LightGBMfoot_1 implementation provided by Microsoft (Ke et al., 2017) (henceforth referred to as λMART GBM ). As we will show in Section 3, λMART GBM substantially outperforms λMART RankLib . There is strong and continuing interest in neural ranking models, with numerous papers published in the last few years alone. Most of these papers treat RankNet and LambdaRank as weak baselines (Pang et al., 2020; Bruch et al., 2019b) and LambdaMART as the "state-of-the-art" (Bruch et al., 2019b; Li et al., 2019; Zhu & Klabjan, 2020; Hu et al., 2019) . However, when examining these papers, we note that they either acknowledge their under-performance to λMART GBM or claim state-of-the-art performance by comparing to a weaker λMART RankLib implementation. The inconsistency of performance evaluation on benchmark datasets in this field has made it difficult to measure progress (Lipton & Steinhardt, 2018) . It therefore remains an open question whether neural LTR models are as effective as they claim to be, and how to improve them if that is not the case. In this paper, we first conduct a benchmark to show that λMART GBM outperforms recently published neural models, as well as the λMART RankLib , by a large margin. While the neural paradigm is still appealing in a myriad of ways, such as being composable, flexible, and able to benefit from a plethora of new advances (Vaswani et al., 2017; Devlin et al., 2019) , the research progress in neural ranking models could be hindered due to their inferior performance to tree models. It thus becomes critical to understand the pitfalls of building neural rankers and boost their performance on benchmark datasets. Specifically, we investigate why neural LTR approaches under-perform on standard LTR datasets and identify three major weaknesses that are typically ignored by recent work. First, neural models are not as adept at performing effective feature transformations and scaling, which is one major benefit of using tree-based methods (Saberian et al., 2019) . In ranking data which is typically longtailed, this can be a prohibitive property. Second, standard feed-forward networks are ineffective in generating higher-order features as noted by recent papers (Wang et al., 2017b; Beutel et al., 2018) . More effective network architectures for neural LTR models are needed. Third, recent neural LTR work on benchmark datasets does not employ high-capacity networks, a key success factor of many neural models (Devlin et al., 2019) , possibly due to a small scale of training data that causes overfitting. On the other hand, there are several potential benefits of neural approaches over LambdaMART for LTR, such as their flexibility to model listwise data and the existence of many techniques to mitigate data sparsity. To that end, we propose a new framework that ameliorates the weaknesses of existing neural LTR approaches and improves almost all major network components. In the proposed framework, we make several technical contributions: (1) We demonstrate empirical evidence that a simple log1p transformation on the input features is very helpful. (2) We use data augmentation (DA) to make the most out of high-capacity neural models, which is surprisingly the first work in the LTR literature to do so. We show that adding a simple Gaussian noise helps, but only when the model capacity is appropriately augmented (which probably explains why there is no prior work on such a simple idea). ( 3) We use self-attention (SA) to model the listwise ranking data as context, and propose to use latent cross (LC) to effectively generate the interaction of each item and its listwise context. We conduct experiments on three widely used public LTR datasets. Our neural models are trained with listwise ranking losses. On all datasets, our framework can outperform recent neural LTR methods by a large margin. When comparing with the strong LambdaMART implementation, λMART GBM , we are able to achieve equally good results, if not better. Our work can also serve as a benchmark for neural ranking models, which we believe can lay a fertile ground for future neural LTR research, as rigorous benchmarks on datasets such as ImageNet (Russakovsky et al., 2015) and GLUE (Wang et al., 2018a) do in their respective fields.

2. BACKGROUND

We provide some background on LTR, including its formulation and common metrics. We review LambdaMART and highlight its two popular implementations which are causes of the inconsistency of evaluations in the recent literature.

2.1. LEARNING TO RANK

LTR methods are supervised techniques and the training data can be represented as a set Ψ = {(x, y) ∈ χ n × R n )}, where x is a list of n items x i ∈ χ and y is a list of n relevance labels y i ∈ R for 1 ≤ i ≤ n. We use χ as the universe of all items. In traditional LTR problems, each x i corresponds to a query-item pair and is represented as a feature vector in R k where k is the number of feature dimensions. With slightly abuse of notation, we also use x i as the feature vector and say x ∈ R n×k . The objective is to learn a function that produces an ordering of items in x so that the utility of the ordered list is maximized.



https://sourceforge.net/p/lemur/wiki/RankLib/ https://github.com/microsoft/LightGBM

