UNCERTAINTY IN GRADIENT BOOSTING VIA ENSEMBLES

Abstract

For many practical, high-risk applications, it is essential to quantify uncertainty in a model's predictions to avoid costly mistakes. While predictive uncertainty is widely studied for neural networks, the topic seems to be under-explored for models based on gradient boosting. However, gradient boosting often achieves stateof-the-art results on tabular data. This work examines a probabilistic ensemblebased framework for deriving uncertainty estimates in the predictions of gradient boosting classification and regression models. We conducted experiments on a range of synthetic and real datasets and investigated the applicability of ensemble approaches to gradient boosting models that are themselves ensembles of decision trees. Our analysis shows that ensembles of gradient boosting models successfully detect anomalous inputs while having limited ability to improve the predicted total uncertainty. Importantly, we also propose a concept of a virtual ensemble to get the benefits of an ensemble via only one gradient boosting model, which significantly reduces complexity.

1. INTRODUCTION

Gradient boosting (Friedman, 2001 ) is a widely used machine learning algorithm that achieves stateof-the-art results on tasks containing heterogeneous features, complex dependencies, and noisy data: web search, recommendation systems, weather forecasting, and many others (Burges, 2010; Caruana & Niculescu-Mizil, 2006; Richardson et al., 2007; Roe et al., 2005; Wu et al., 2010; Zhang & Haghani, 2015) . Gradient boosting based on decision trees (GBDT) underlies such well-known libraries like XGBoost, LightGBM, and CatBoost. In this paper, we investigate the estimation of predictive uncertainty in GBDT models. Uncertainty estimation is crucial for avoiding costly mistakes in high-risk applications, such as autonomous driving, medical diagnostics, and financial forecasting. For example, in self-driving cars, it is necessary to know when the AI-pilot is confident in its ability to drive and when it is not to avoid a fatal collision. In financial forecasting and medical diagnostics, mistakes on the part of an AI forecasting or diagnostic system could either lead to large financial or reputational loss or to the loss of life. Crucially, both financial and medical data are often represented in heterogeneous tabular form -data on which GBDTs are typically applied, highlighting the relevance of our work on obtaining uncertainty estimates for GBDT models. Approximate Bayesian approaches for uncertainty estimation have been extensively studied for neural network models (Gal, 2016; Malinin, 2019) . Bayesian methods for tree-based models (Chipman et al., 2010; Linero, 2017) have also been widely studied in the literature. However, this research did not explicitly focus on studying uncertainty estimation and its applications. Some related work was done by Coulston et al. (2016); Shaker & Hüllermeier (2020) , who examined quantifying predictive uncertainty for random forests. However, the area has been otherwise relatively under-explored, especially for GBDT models that are widely used in practice and known to outperform other approaches based on tree ensembles. While for classification problems GDBT models already return a distribution over class labels, for regression tasks they typically yield only point predictions. Recently, this problem was addressed in the NGBoost algorithm (Duan et al., 2020) , where a GBDT model is trained to return the mean and variance of a normal distribution over the target variable y for a given feature vector. However, such models only capture data uncertainty (Gal, 2016; Malinin, 2019) , also known as aleatoric uncertainty, which arises due to inherent class overlap or noise in the data. However, this does not quantify uncertainty due to the model's inherent lack of knowledge about inputs from regions either far from the training data or sparsely covered by it, known as knowledge uncertainty, or epistemic uncertainty (Gal, 2016; Malinin, 2019) . One class of approaches for capturing knowledge uncertainty are Bayesian ensemble methods, which have recently become popular for estimating predictive uncertainty in neural networks (Depeweg et al., 2017; Gal & Ghahramani, 2016; Kendall et al., 2018; Lakshminarayanan et al., 2017; Maddox et al., 2019; Smith & Gal, 2018) . A key feature of ensemble approaches is that they allow overall uncertainty to be decomposed into data uncertainty and knowledge uncertainty within an interpretable probabilistic framework (Depeweg et al., 2017; Gal, 2016; Malinin, 2019) . Ensembles are also known to yield improvements in predictive performance. This work examines ensemble-based uncertainty-estimation for GBDT models. The contributions are as follows. First, we consider generating ensembles using both classical Stochastic Gradient Boosting (SGB) as well as the recently proposed Stochastic Gradient Langevin Boosting (SGLB) (Ustimenko & Prokhorenkova, 2020). Importantly, SGLB allows us to guarantee that the models are asymptotically sampled from a true Bayesian posterior. Second, we show that using SGLB we can construct a virtual ensemble using only one gradient boosting model, significantly reducing the computational complexity. Third, to understand the attributes of using ensembles-based uncertainty estimation in GBDT models, we conduct extensive analysis on several synthetic datasets. Finally, we evaluate the proposed approach on a range of real regression and classification datasets. Our results show that this approach successfully enables the detection of anomalous out-of-domain inputs. Importantly, our solution is easy to combine with any implementation of GBDT. Our methods have been implemented within the open-source CatBoost library. The code of our experiments is publicly available at https://github.com/yandex-research/GBDT-uncertainty.

2. PRELIMINARIES

Uncertainty Estimation via Bayesian Ensembles In this work we consider uncertainty estimation within the standard Bayesian ensemble-based framework (Gal, 2016; Malinin, 2019) . Here, model parameters θ are considered random variables and a prior p(θ) is placed over them to compute a posterior p(θ|D) via Bayes' rule: p(θ|D) = p(D|θ)p(θ) p(D) . ( ) where D = {x (i) , y (i) } N i=1 is the training dataset. Each set of parameters can be considered a hypothesis or explanation about how the world works. Samples from the posterior should yield explanations consistent with the observations of the world contained within the training data D. However, on data far from D each set of parameters can yield different predictions. Therefore, estimates of knowledge uncertainty can be obtained by examining the diversity of predictions. Consider an ensemble of probabilistic models {P(y|x; θ (m) )} M m=1 sampled from the posterior p(θ|D). Each model P(y|x, θ (m) ) yields a different estimate of data uncertainty, represented by the entropy of its predictive distribution (Malinin, 2019) . Uncertainty in predictions due to knowledge uncertainty is expressed as the level of spread, or "disagreement", of models in the ensemble (Malinin, 2019) . Note that exact Bayesian inference is often intractable, and it is common to consider either an explicit or implicit approximation q(θ) to the true posterior p(θ|D). While a range of approximations has been explored for neural network models (Gal & Ghahramani, 2016; Lakshminarayanan et al., 2017; Maddox et al., 2019) foot_0 , to the best of our knowledge, limited work



A full overview is available in(Ashukha et al., 2020; Ovadia et al., 2019).

