GRADIENT BOOSTING PERFORMS GAUSSIAN PROCESS INFERENCE

Abstract

This paper shows that gradient boosting based on symmetric decision trees can be equivalently reformulated as a kernel method that converges to the solution of a certain Kernel Ridge Regression problem. Thus, we obtain the convergence to a Gaussian Process' posterior mean, which, in turn, allows us to easily transform gradient boosting into a sampler from the posterior to provide better knowledge uncertainty estimates through Monte-Carlo estimation of the posterior variance. We show that the proposed sampler allows for better knowledge uncertainty estimates leading to improved out-of-domain detection.

1. INTRODUCTION

Gradient boosting (Friedman, 2001 ) is a classic machine learning algorithm successfully used for web search, recommendation systems, weather forecasting, and other problems (Roe et al., 2005; Caruana & Niculescu-Mizil, 2006; Richardson et al., 2007; Wu et al., 2010; Burges, 2010; Zhang & Haghani, 2015) . In a nutshell, gradient boosting methods iteratively combine simple models (usually decision trees), minimizing a given loss function. Despite the recent success of neural approaches in various areas, gradient-boosted decision trees (GBDT) are still state-of-the-art algorithms for tabular datasets containing heterogeneous features (Gorishniy et al., 2021; Katzir et al., 2021) . This paper aims at a better theoretical understanding of GBDT methods for regression problems assuming the widely used RMSE loss function. First, we show that the gradient boosting with regularization can be reformulated as an optimization problem in some Reproducing Kernel Hilbert Space (RKHS) with implicitly defined kernel structure. After obtaining that connection between GBDT and kernel methods, we introduce a technique for sampling from prior Gaussian process distribution with the same kernel that defines RKHS so that the final output would converge to a sample from the Gaussian process posterior. Without this technique, we can view the output of GBDT as the mean function of the Gaussian process. Importantly, our theoretical analysis assumes the regularized gradient boosting procedure (Algorithm 2) without any simplifications -we only need decision trees to be symmetric (oblivious) and properly randomized (Algorithm 1). These assumptions are non-restrictive and are satisfied in some popular gradient boosting implementations, e.g., CatBoost (Prokhorenkova et al., 2018) . Our experiments confirm that the proposed sampler from the Gaussian process posterior outperforms the previous approaches (Malinin et al., 2021) and gives better knowledge uncertainty estimates and improved out-of-domain detection.

2. BACKGROUND

Assume that we are given a distribution over X × Y , where X ⊂ R d is called a feature space and Y ⊂ R -a target space. Further assume that we are given a dataset z = {(x i , y i )} N i=1 ⊂ X × Y of size N ≥ 1 sampled i.i.d. from . Let us denote by ρ(dx) = Y d (dx, dy). W.l.o.g., we also assume that X = supp ρ = x ∈ R d : ∀ > 0, ρ({x ∈ R d : x -x R d < }) > 0 which is a closed subset of R d . Moreover, for technical reasons, we assume that 1 2N N i=1 y 2 i ≤ R 2 for some constant R > 0 almost surely, which can always be enforced by clipping. Throughout the paper, we also denote by x N and y N the matrix of all feature vectors and the vector of targets.

2.1. GRADIENT BOOSTED DECISION TREES

Given a loss function L : R 2 → R, a classic gradient boosting algorithm (Friedman, 2001) iteratively combines weak learners (usually decision trees) to reduce the average loss over the train set z: L(f ) = E z [L(f (x), y)]. At each iteration τ , the model is updated as: f τ (x) = f τ -1 (x) + w τ (x), where w τ (•) ∈ W is a weak learner chosen from some family of functions W, and is a learning rate. The weak learner w τ is usually chosen to approximate the negative gradient of the loss function -g τ (x, y) := -∂L(s,y) ∂s s=fτ-1(x) : w τ = arg min w∈W E z -g τ (x, y) -w(x) 2 . (1) The family W usually consists of decision trees. In this case, the algorithm is called GBDT (Gradient Boosted Decision Trees). A decision tree is a model that recursively partitions the feature space into disjoint regions called leaves. Each leaf R j of the tree is assigned to a value, which is the estimated response y in the corresponding region. Thus, we can write w(x) = d j=1 θ j 1 {x∈Rj } , so the decision tree is a linear function of the leaf values θ j . A recent paper Ustimenko & Prokhorenkova (2021) proposes a modification of classic stochastic gradient boosting (SGB) called Stochastic Gradient Langevin Boosting (SGLB). SGLB combines gradient boosting with stochastic gradient Langevin dynamics to achieve global convergence even for non-convex loss functions. As a result, the obtained algorithm provably converges to some stationary distribution (invariant measure) concentrated near the global optimum of the loss function. We mention this method because it samples from similar distribution as our method but with a different kernel.

2.2. ESTIMATING UNCERTAINTY

In addition to the predictive quality, it is often important to detect when the system is uncertain and can be mistaken. For this, different measures of uncertainty can be used. There are two main sources of uncertainty: data uncertainty (a.k.a. aleatoric uncertainty) and knowledge uncertainty (a.k.a. epistemic uncertainty). Data uncertainty arises due to the inherent complexity of the data, such as additive noise or overlapping classes. For instance, if the target is distributed as y|x ∼ N (f (x), σ 2 (x)), then σ(x) reflects the level of data uncertainty. This uncertainty can be assessed if the model is probabilistic. Knowledge uncertainty arises when the model gets input from a region either sparsely covered by or far from the training data. Since the model does not have enough data in this region, it will likely make a mistake. A standard approach to estimating knowledge uncertainty is based on ensembles (Gal, 2016; Malinin, 2019) . Assume that we have trained an ensemble of several independent models. If all the models understand an input (low knowledge uncertainty), they will give similar predictions. However, for out-of-domain examples (high knowledge uncertainty), the models are likely to provide diverse predictions. For regression tasks, one can obtain knowledge uncertainty by measuring the variance of the predictions provided by multiple models (Malinin, 2019) . Such ensemble-based approaches are standard for neural networks (Lakshminarayanan et al., 2017) . Recently, ensembles were also tested for GBDT models (Malinin et al., 2021) . The authors consider two ways of generating ensembles: ensembles of independent SGB models and ensembles of independent SGLB models. While empirically methods are very similar, SGLB has better theoretical properties: the convergence of parameters to the stationary distribution allows one to sample models

