ARE NEURAL RANKERS STILL OUTPERFORMED BY GRADIENT BOOSTED DECISION TREES?

Abstract

Despite the success of neural models on many major machine learning problems, their effectiveness on traditional Learning-to-Rank (LTR) problems is still not widely acknowledged. We first validate this concern by showing that most recent neural LTR models are, by a large margin, inferior to the best publicly available Gradient Boosted Decision Trees (GBDT) in terms of their reported ranking accuracy on benchmark datasets. This unfortunately was somehow overlooked in recent neural LTR papers. We then investigate why existing neural LTR models under-perform and identify several of their weaknesses. Furthermore, we propose a unified framework comprising of counter strategies to ameliorate the existing weaknesses of neural models. Our models are the first to be able to perform equally well, comparing with the best tree-based baseline, while outperforming recently published neural LTR models by a large margin. Our results can also serve as a benchmark to facilitate future improvement of neural LTR models.

1. INTRODUCTION

Neural approaches have been dominating in many major machine learning domains, such as computer vision (He et al., 2015) , natural language processing (Devlin et al., 2019) , and speech recognition (Hannun et al., 2014) . However, the effectiveness of neural approaches in traditional Learningto-Rank (LTR), the long-established inter-disciplinary research area at the intersection of machine learning and information retrieval (Liu, 2009) , is not widely acknowledged (Yang et al., 2019) , especially on benchmark datasets that have only numerical features. Historically, a series of LTR models were developed by researchers at Microsoft, starting with RankNet (Burges et al., 2005) and LambdaRank (Burges et al., 2007) , both based on neural networks, and culminating in LambdaMART (Wu et al., 2010) , which is based on Gradient Boosted Decision Trees (GBDT); Burges (2010) provides an overview of this evolution. There are two publicly available implementations of LambdaMART: one provided by the RankLibfoot_0 library that is part of the Lemur Project (henceforth referred to as λMART RankLib ); and the LightGBMfoot_1 implementation provided by Microsoft (Ke et al., 2017) (henceforth referred to as λMART GBM ). As we will show in Section 3, λMART GBM substantially outperforms λMART RankLib . There is strong and continuing interest in neural ranking models, with numerous papers published in the last few years alone. Most of these papers treat RankNet and LambdaRank as weak baselines (Pang et al., 2020; Bruch et al., 2019b) and LambdaMART as the "state-of-the-art" (Bruch et al., 2019b; Li et al., 2019; Zhu & Klabjan, 2020; Hu et al., 2019) . However, when examining these papers, we note that they either acknowledge their under-performance to λMART GBM or claim state-of-the-art performance by comparing to a weaker λMART RankLib implementation. The inconsistency of performance evaluation on benchmark datasets in this field has made it difficult to measure progress (Lipton & Steinhardt, 2018) . It therefore remains an open question whether neural LTR models are as effective as they claim to be, and how to improve them if that is not the case.



https://sourceforge.net/p/lemur/wiki/RankLib/ https://github.com/microsoft/LightGBM 1

