GRADIENT BOOSTING PERFORMS GAUSSIAN PROCESS INFERENCE

Abstract

This paper shows that gradient boosting based on symmetric decision trees can be equivalently reformulated as a kernel method that converges to the solution of a certain Kernel Ridge Regression problem. Thus, we obtain the convergence to a Gaussian Process' posterior mean, which, in turn, allows us to easily transform gradient boosting into a sampler from the posterior to provide better knowledge uncertainty estimates through Monte-Carlo estimation of the posterior variance. We show that the proposed sampler allows for better knowledge uncertainty estimates leading to improved out-of-domain detection.

2. BACKGROUND

Assume that we are given a distribution over X × Y , where X ⊂ R d is called a feature space and Y ⊂ R -a target space. Further assume that we are given a dataset z = {(x i , y i )} N i=1 ⊂ X × Y of size N ≥ 1 sampled i.i.d. from . Let us denote by ρ(dx) = Y d (dx, dy). W.l.o.g., we also assume that X = supp ρ = x ∈ R d : ∀ > 0, ρ({x ∈ R d : x -x R d < }) > 0 which is a closed subset of R d . Moreover, for technical reasons, we assume that 1 2N N i=1 y 2 i ≤ R 2 for some constant R > 0 almost surely, which can always be enforced by clipping. Throughout the paper, we also denote by x N and y N the matrix of all feature vectors and the vector of targets.

2.1. GRADIENT BOOSTED DECISION TREES

Given a loss function L : R 2 → R, a classic gradient boosting algorithm (Friedman, 2001) iteratively combines weak learners (usually decision trees) to reduce the average loss over the train set z: L(f ) = E z [L(f (x), y)]. At each iteration τ , the model is updated as: f τ (x) = f τ -1 (x) + w τ (x), where w τ (•) ∈ W is a weak learner chosen from some family of functions W, and is a learning rate. The weak learner w τ is usually chosen to approximate the negative gradient of the loss function -g τ (x, y) := -∂L(s,y) ∂s s=fτ-1(x) : w τ = arg min w∈W E z -g τ (x, y) -w(x) 2 . (1) The family W usually consists of decision trees. In this case, the algorithm is called GBDT (Gradient Boosted Decision Trees) . A decision tree is a model that recursively partitions the feature space into disjoint regions called leaves. Each leaf R j of the tree is assigned to a value, which is the estimated response y in the corresponding region. Thus, we can write w(x) = d j=1 θ j 1 {x∈Rj } , so the decision tree is a linear function of the leaf values θ j . A recent paper Ustimenko & Prokhorenkova (2021) proposes a modification of classic stochastic gradient boosting (SGB) called Stochastic Gradient Langevin Boosting (SGLB). SGLB combines gradient boosting with stochastic gradient Langevin dynamics to achieve global convergence even for non-convex loss functions. As a result, the obtained algorithm provably converges to some stationary distribution (invariant measure) concentrated near the global optimum of the loss function. We mention this method because it samples from similar distribution as our method but with a different kernel.

2.2. ESTIMATING UNCERTAINTY

In addition to the predictive quality, it is often important to detect when the system is uncertain and can be mistaken. For this, different measures of uncertainty can be used. There are two main sources of uncertainty: data uncertainty (a.k.a. aleatoric uncertainty) and knowledge uncertainty (a.k.a. epistemic uncertainty). Data uncertainty arises due to the inherent complexity of the data, such as additive noise or overlapping classes. For instance, if the target is distributed as y|x ∼ N (f (x), σ 2 (x)), then σ(x) reflects the level of data uncertainty. This uncertainty can be assessed if the model is probabilistic. Knowledge uncertainty arises when the model gets input from a region either sparsely covered by or far from the training data. Since the model does not have enough data in this region, it will likely make a mistake. A standard approach to estimating knowledge uncertainty is based on ensembles (Gal, 2016; Malinin, 2019) . Assume that we have trained an ensemble of several independent models. If all the models understand an input (low knowledge uncertainty), they will give similar predictions. However, for out-of-domain examples (high knowledge uncertainty), the models are likely to provide diverse predictions. For regression tasks, one can obtain knowledge uncertainty by measuring the variance of the predictions provided by multiple models (Malinin, 2019) . from a particular posterior distribution. One drawback of SGLB is that its convergence rate is unknown, as the proof is asymptotic. However, the convergence rate can be upper bounded by that of Stochastic Gradient Langevin Dynamics for log-concave functions, e.g., Zou et al. (2021) , which is not dimension-free. In contrast, our rate is dimension-free and scales linearly with inverse precision.

2.3. GAUSSIAN PROCESS INFERENCE

In this section, we briefly describe the basics of Bayesian inference with Gaussian processes that is closely related to our analysis. A random variable f with values in L 2 (ρ) is said to be a Gaussian process f ∼ GP f 0 , σ 2 K +δ 2 Id L2 with covariance defined via a kernel K(x, x ) = cov f (x), f (x ) and mean value f 0 ∈ L 2 (ρ) iff ∀g ∈ L 2 (ρ) we have X f (x)g(x)ρ(dx) ∼ N X f0(x)g(x)ρ(dx), σ 2 X×X g(x)K(x, x )g(x )ρ(dx) ⊗ ρ(dx ) + δ 2 g 2 L 2 . A typical Gaussian Process setup is to assume that the target y|x is distributed as GP 0 L2(ρ) , σ 2 K+ δ 2 Id L2 for some kernel function 1 K(x, x ) with scales σ > 0 and δ > 0: y|x = f (x), f ∼ GP(0 L2(ρ) , σ 2 K + δ 2 Id L2 ). The posterior distribution f (x)|x, x N , y N is again a Gaussian Process GP(f * , σ 2 K + δ 2 Id L2 ) with mean and covariance given by (see Rasmussen & Williams (2006) ): f λ * (x) = K(x, x N ) K(x N , x N ) + λI N -1 y N , K(x, x) = K(x, x) -K(x, x N ) • K(x N , x N ) + λI N -1 K(x N , x). with λ = δ 2 σ 2 . To estimate the posterior mean f λ * (x) = R f p(f |x, x N , y N ) df , we can use the maximum a posteriori probability estimate (MAP) -the only solution for the following Kernel Ridge Regression (KRR) problem: L(f ) = 1 2N N i=1 f (x i ) -y i 2 + λ 2N f 2 H → min f ∈H , where H = span K(•, x) x ∈ X ⊂ L 2 (ρ) is the reproducing kernel Hilbert space (RKHS) for the kernel K(•, •) and L(f ) → min f ∈H means that we are seeking minimizers of this function L(f ). We refer to Appendix D for the details on KRR and RKHS. To solve the KRR problem, one can apply the gradient descent (GD) method in the functional space: f τ +1 = (1 - λ N )f τ - 1 N N i=1 (f τ (x i ) -y i )K xi , f 0 = 0 L2(ρ) , where > 0 is a learning rate. Since this objective is strongly convex due to the regularization, the gradient descent rapidly converges to the unique optimum: f λ * = lim τ →∞ f τ = K(•, x N ) K(x N , x N ) + λI N -1 y N , see Appendices C and E for the details. Gradient descent guides f τ to the posterior mean f λ * of the Gaussian Process with kernel σ 2 K + δ 2 Id L2 . To obtain the posterior variance estimate K(x, x) for any x, one can use sampling and introduce a source of randomness in the above iterative scheme as follows: 1. sample f init ∼ GP(0 L2(ρ) , σ 2 K + δ 2 Id L2 ); 2. set new labels y new N = y N -f init (x N ); 3. fit GD f τ (•) on y new N assuming F 0 (•) = 0 L2(ρ) ; 4. output f init (•) + f τ (•) as final model. 1 K(x, x ) is a kernel function if K(xi, xj) N,N i=1,j=1 ≥ 0 for any N ≥ 1 and any xi ∈ X almost surely. This method is known as Sample-then-Optimize (Matthews et al., 2017) and is widely adopted for Bayesian inference. As τ → ∞, we get f init + f ∞ distributed as a Gaussian Process posterior with the desired mean and variance. Formally, the following result holds: Lemma 2.1. f init + f ∞ follows the Gaussian Process posterior GP(f λ * , σ 2 K + δ 2 Id L2 ) with: f λ * (x) = K(x, x N ) K(x N , x N ) + λI N -1 y N , K(x, x) = K(x, x) -K(x, x N ) K(x N , x N ) + λI N -1 K(x N , x) .

3.1. PRELIMINARIES

In our analysis, we assume that we are given a finite set V of weak learners used for the gradient boosting.foot_0 Each ν corresponds to a decision tree that defines a partition of the feature space into disjoint regions (leaves). For each ν ∈ V, we denote the number of leaves in the tree by L ν ≥ 1. Also, let φ ν : X → {0, 1} Lν be a mapping that maps x to the vector indicating its leaf index in the tree ν. This mapping defines a decomposition of X into the disjoint union: X = ∪ Lν j=1 x ∈ X φ (j) ν (x) = 1 . Having φ ν , we define a weak learner associated with it as x → θ, φ ν (x) R Lν for any choice of θ ∈ R Lν which we refer to as 'leaf values'. In other words, θ corresponds to predictions assigned to each region of the space defined by ν. Let us define a linear space F ⊂ L 2 (ρ) of all possible ensembles of trees from V: F = span φ (j) ν (•) : X → {0, 1} ν ∈ V, j ∈ {1, . . . , L ν } . We note that the space F can be data-dependent since V may depend on z, but we omit this dependence in the notation for simplicity. Note that we do not take the closure w.r.t. the topology of L 2 (ρ) since we assume that V is finite and therefore F is finite-dimensional and thus topologically closed.

3.2. GBDT ALGORITHM UNDER CONSIDERATION

Our theoretical analysis holds for classic GBDT algorithms discussed in Section 2.1 equipped with regularization from Ustimenko & Prokhorenkova (2021) . The only requirement we need is that the procedure of choosing each new tree has to be properly randomized. Let us discuss a tree selection algorithm that we assume in our analysis. Each new tree approximates the gradients of the loss function with respect to the current predictions of the model. Since we consider the RMSE loss function, the gradients are proportional to the residuals r j = y j -f (x j ), where f is the currently built model. The tree structure is defined by the features and the corresponding thresholds used to split the space. The analysis in this paper assumes the SampleTree procedure (see Algorithm 1), which is a classic approach equipped with proper randomization. SampleTree builds an oblivious decision tree (Prokhorenkova et al., 2018) , i.e., all nodes at a given level share the same splitting criterion (feature and threshold). 3 To limit the number of candidate splits, each feature is quantized into n + 1 bins. In other words, for each feature, we have n thresholds that can be chosen arbitrarily. 4 The maximum tree depth is limited by m. Recall that we denote the set of all possible tree structures by V. We build the tree in a top-down greedy manner. At each step, we choose one split among all the remaining candidates based on the following score defined for ν ∈ V and residuals r: D(ν, r) := 1 N Lv j=1 N i=1 φ (j) ν (x i ) r i 2 N i=1 φ (j) ν (x i ) . (3) Algorithm 1 SampleTree(r; m, n, β) input: residuals r = (r i ) N i=1 output: oblivious tree structure ν ∈ V hyper-parameters: number of feature splits n, max. tree depth m, random strength β ∈ [0, ∞) definitions: S = (j, k) j ∈ {1, . . . , d}, k ∈ {1, . . . , n}indices of all possible splits instructions: initialize i = 0, ν 0 = ∅, S (0) = S repeat sample (u i (s)) s∈S (i) ∼ U [0, 1] nd-i choose next split as {s i+1 } = arg max s∈S (i) D (ν i , s), r -β log(-log u i (s)) update tree: ν i+1 = (ν i , s i+1 ) update candidate splits: T, m, n, β, λ) input: dataset z = (x N , y N ) hyper-parameters: learning rate > 0, regularization λ > 0, iterations of boosting T , parameters of SampleTree m, n, β instructions: S (i+1) = S (i) \{s i+1 } i = i + 1 until i ≥ m or S (i) = ∅ return: ν i Algorithm 2 TrainGBDT(z; , initialize τ = 0, f 0 (•) = 0 L2(ρ) repeat r τ = y N -f τ (x N ) -compute residu- als ν τ = SampleTree(r τ ; m, n, β) -con- struct a tree θ τ = N i=1 φ (j) ντ (xi)r (i) τ N i=1 φ (j) ντ (xi) Lν τ j=1 -set val- ues in leaves f τ +1 (•) = (1 -λ N )f τ (•) + φ ντ (•), θ τ R Lν τ -update model τ = τ + 1 until τ ≥ T return: f T (•) In classic gradient boosting, one builds a tree recursively by choosing such split s that maximizes the score D((ν i , s), r) (Ibragimov & Gusev, 2019) . 5 Random noise is often added to the scores to improve generalization. In SampleTree, we choose a split that maximizes D (ν i , s), r + ε, where ε ∼ Gumbel(0, β) . ( ) Here β is random strength: β = 0 gives the standard greedy approach, while β → ∞ gives the uniform distribution among all possible split candidates. To sum up, SampleTree is a classic oblivious tree construction but with added random noise. We do this to make the distribution of trees regular in a certain sense: roughly speaking, the distributions should stabilize with iterations by converging to some fixed distribution. Given the algorithm SampleTree, we describe the gradient boosting procedure assumed in our analysis in Algorithm 2. It is a classic GBDT algorithm but with the update rule f τ +1 (•) = (1 -λ /N ) f τ (•) + w τ (x). In other words, we shrink the model at each iteration, which serves as regularization (Ustimenko & Prokhorenkova, 2021) . Such shrinkage is available, e.g., in the CatBoost library.

3.3. DISTRIBUTION OF TREES

The SampleTree algorithm induces a local family of distributions p(•|f, β) for each f ∈ F: p(dν|f, β) = P SampleTree y N -f (x N ); m, n, β ∈ dν . Remark 3.1. Lemma D.5 ensure that such distribution coincides with the one where we use f * (x N ) instead of y N . This is due to the fact that D ν, y N -f (x N ) = D ν, f * -f (x N ) ∀ν ∈ V, f ∈ F. The following lemma describes the distribution p(dν|f, β), see Appendix F for the proof. , where the sum is over all permutations ς ∈ P m , ν ς,i = (s ς(1) , . . . , s ς(i) ), and ν = (s 1 , . . . , s m ). Let us define the stationary distribution of trees as π( (nd-m)!m! .

3.4. RKHS STRUCTURE

In this section, we describe the evolution of GBDT in a certain Reproducing Kernel Hilbert Space (RKHS). Even though the problem is finite dimensional, treating it as functional regression is more beneficial as dimension of the ensembles space grows rapidly and therefore we want to obtain dimension-free constants which is impossible if we treat it as finite dimensional optimization problem. Let us start with defining necessary kernels. For convenience, we also provide a diagram illustrating the introduced kernels and relations between them in Appendix A. Definition 3.4. A weak learner's kernel k ν (•, •) is a kernel function associated with a tree structure ν ∈ V which can be defined as: k ν (x, x ) = Lν j=1 w (j) ν φ (j) ν (x)φ (j) ν (x ), where w (j) ν = N max{N (j) ν , 1} , N (j) ν = N i=1 φ (j) ν (x i ) . This weak learner's kernel is a building block for any other possible kernel in boosting and is used to define the iterations of the boosting algorithm analytically. Definition 3.5. We also define a greedy kernel of the gradient boosting algorithm as follows: K f (x, x ) = ν∈V k ν (x, x )p(ν|f, β) . This greedy kernel is a kernel that guides the GBDT iterations, i.e., we can think of each iteration as SGD with a kernel from 3.5, and 3.4 is used as a stochastic gradient estimator of the Fréchet derivative in RKHS defined by the kernel from 3.5. Definition 3.6. Finally, there is a stationary kernel K(x, x ) that is independent from f : K(x, x ) = ν∈V k ν (x, x )π(ν) , which we call a prior kernel of the gradient boosting. This kernel defines the limiting solution since the gradient projection on RKHS converges to zero, and thus 3.5 converges to 3.6. Note that F = span K(•, x) | x ∈ X . Having the space of functions F, we define RKHS structure H = F, •, • H on it using a scalar product defined as f, K(•, x) H = f (x). Now, let us define the empirical error of a model f : L(f, λ) = 1 2N y N -f (x N ) 2 R N + λ 2N f 2 H . Then, we define V (f, λ) = L(f, λ) -inf f ∈F L(f , λ). Let us also define the following functions: f λ * ∈ arg min f ∈F V (f, λ) and f * = lim λ→0 f λ * ∈ arg min f ∈F V (f ), where V (f ) = V (f, 0). It is known that such f * exists and is unique since the set of all solutions is convex, and therefore there is a unique minimizer of the norm • H . Finally, the following lemma gives the formula of the GBDT iteration in terms of kernels in Lemma 3.7 which will be useful in proving our results. See Appendix D for the proofs. Lemma 3.7. Iterations f τ of Gradient Boosting (Algorithm 2) can be written in the form: fτ+1 = (1 - λ N )fτ + N kν τ (•, xN ) f * (xN ) -fτ (xN ) , ντ ∼ p(ν|fτ , β) .

3.5. KERNEL GRADIENT BOOSTING CONVERGENCE TO KRR

Consider the sequence {f τ } τ ∈N generated by the gradient boosting algorithm. Its evolution is described by Lemma 3.7. The following theorem estimates the expected (w.r.t. the randomness of tree selection) empirical error of f T relative to the best possible ensemble. The full statement of the theorem and its proof can be found in Appendix G. Theorem 3.8. Assume that β, T 1 are sufficiently large and is sufficiently small (see Appendix G). Then, ∀T ≥ T 1 , EV (f T , λ) ≤ O e -O (T -T 1 ) N + λ 2 N 2 + + λ βN 2 . Corollary 3.9. (Convergence to the solution of the KRR problem) Under the assumptions of the previous theorem, we have the following dimension-free bound: E f T -f λ * 2 L2 ≤ O e -O (T -T 1 ) N + λ 2 N + N + λ βN . This bound is dimension-free thanks to functional treatment and exponentially decaying to small value with iterations and therefore justifies the observed rapid convergence of gradient boosting algorithms in practice even though dimension of space H is enormous.

4. GAUSSIAN INFERENCE

So far, the main result of the paper proved in Section 3.5 shows that Algorithm 2 solves the Kernel Ridge Regression problem, which can be interpreted as learning Gaussian Process posterior mean f λ * under the assumption that f ∼ GP(0, σ 2 K + δ 2 Id L2 ) where λ = σ 2 δ 2 . I.e., Algorithm 2 does not give us the posterior variance. Still, as mentioned earlier, we can estimate the posterior variance through Monte-Carlo sampling in a sample-then-optimize way. For that, we need to somehow sample from the prior distribution GP(0, σ 2 K + δ 2 Id L2 ).

4.1. PRIOR SAMPLING

We introduce Algorithm 3 for sampling from the prior distribution. SamplePrior generates an ensemble of random trees (with random splits and random values in leaves). Note that while being random, the tree structure depends on the dataset features x N since candidate splits are based on x N . We first note that the process h T (•) is centered with covariance operator K: Eh T (x) = 0 ∀x ∈ X , Eh T (x)h T (y) = K(x, y) ∀x, y ∈ X . (5) Then, we show that h T (•) converges to the Gaussian Process in the limit. Lemma 4.1. The following convergence holds almost surely in x ∈ X: h T (•) ----→ T →∞ GP 0 L2(ρ) , K . Algorithm 3 SamplePrior(T, m, n) hyper-parameters: number of iterations T , parameters of SampleTree m, n instructions: initialize τ = 0, h 0 (x) = 0 repeat ν τ = SampleTree(0 R N ; m, n, 1) -sample random tree θ τ ∼ N 0 R Lν τ , diag N max{N (j) ντ ,1} : j ∈ {1, . . . , L ντ } -generate random values in leaves h τ +1 (•) = h τ (•) + 1 √ T φ ντ (•), θ τ R Lν τ - update model τ = τ + 1 until τ ≥ T return: h T (•)

4.2. POSTERIOR SAMPLING

Now we are ready to introduce Algorithm 4 for sampling from the posterior. The procedure is simple: we first perform T 0 iterations of SamplePrior to obtain a function h T0 (•) and then we train a standard GBDT model We further refer to this procedure as SamplePosterior or KGB (Kernel Gradient Boosting) for brevity. Denote f T1 (•) approximating y N -σh T0 (x N ) + N (0 N , δ 2 I N ). Our final model is σh T0 (•) + f T1 (•). h ∞ = lim T0→∞ h T0 , f ∞ = lim f T1 , where the first limit is with respect to the point-wise convergence of stochastic processes and the second one with respect to L 2 (ρ) convergence. The following theorem shows that KGB indeed samples from the desired posterior. The proof directly follows from Lemmas 4.1 and 2.1. Algorithm 4 SamplePosterior(z; , T 1 , T 0 , m, n, β, σ, δ) input: dataset z = (x N , y N ) hyper-parameters: learning rate > 0, boosting iteration T 1 , SamplePrior iterations T 0 , parameters of Sam-pleTree m, n, β, kernel scale σ > 0 (default σ = 1), noise scale δ > 0 (default: δ = 0.01) instructions: h T0 (•) = SamplePrior(T 0 , m, n) y new N = y N -σh T0 (x N ) + N (0 N , δ 2 I N ) f T1 (•) = TrainGBDT (x N , y new N ); , T 1 , m, n, β, δ 2 σ 2 return: σh T0 (•) + f T1 (•) Theorem 4.2. In the limit, the output of Algorithm 4 follows the Gaussian process posterior: σh ∞ (•) + f ∞ (•) + N (0, δ 2 ) ∼ GP f , K with mean f (x) = K(x, x N ) K(x N , x N ) + λI N -1 y N and covariance K(x, x) = δ 2 + σ 2 K(x, x) -K(x, x N ) K(x N , x N ) + λI N -1 K(x N , x) .

5. EXPERIMENTS

This section empirically evaluates the proposed KGB algorithm and shows that it indeed allows for better knowledge uncertainty estimates.

Synthetic experiment

To illustrate the KGB algorithm in a controllable setting, we first conduct a synthetic experiment. For this, we defined the feature distribution as uniform over D = {(x, y) ∈ [0, 1] 2 : 1 10 ≤ (x -1 2 ) 2 -(y -1 2 ) 2 ≤ 1 4 ∧ (x ≤ 2 5 ∨ x ≥ 3 5 ) ∧ (y ≤ 2 5 ∨ y ≥ 3 5 )}. We sample 10K points from U ([0, 1] 2 ) and take into the train set only those that fall into D. The target is defined as f (x, y) = x + y. Figure 1 (a) illustrates the training dataset colored with the target values. For evaluation, we take the same 10K points without restricting them to D. For KGB, we fix = 0.3, T 0 = 100, T 1 = 900, σ = 10 -2 , δ = 10 -4 β = 0.1, m = 4, n = 64, and sampled 100 KGB models. Figure 1(b) shows the estimated by Monte-Carlo posterior mean µ. On Figure 1 (c), we show log σ, where σ 2 is the posterior variance estimated by Monte-Carlo. One can see that the posterior variance is small in-domain and grows when we move outside the dataset D, as desired. Experiment on real datasets Uncertainty estimates for GBDTs have been previously analyzed by Malinin et al. (2021) . Our experiments on real datasets closely follow their setup, and we compare the proposed KGB with SGB, SGLB, and their ensembles. For the experiments, we use several standard regression datasets (Gal & Ghahramani, 2016) . The implementation details can be found in Appendix H. The code of our experiments can be found on GitHub.foot_4 We note that in our setup, we cannot compute likelihoods as kernel K is defined implicitly, and its evaluation requires summing up among all possible trees structures number of which grows as (nd) m which is unfeasible, not to mention the requirement to inverse the kernel which requires O(N 2+ω ) operations which additionally rules out the applicability of classical Gaussian Processes methods with our kernel. Therefore, a typical Bayesian setup is not applicable, and we resort to the uncertainty estimation setup described in Malinin et al. (2021) . Also, the intractability of the kernel does not allow us to treat σ, δ in a fully Bayesian way, as it will require estimating the likelihood. Therefore, we fix them as constants, but we note that this will not affect the evaluation metrics for our setup as they are scale and translation invariant. First, we compare KGB with SGLB since they both sample from similar posterior distributions. Thus, this comparison allows us to find out which of the algorithms does a better sampling from the posterior and thus provides us with more reliable estimates of knowledge uncertainty. Moreover, we consider the SGB approach as the most "straightforward" way to generate an ensemble of models. In Table 1 , we compare the predictive performance of the methods. Interestingly, we obtain improvements on almost all the datasets. Here we perform cross-validation to estimate statistical significance with paired t-test and highlight the approaches that are insignificantly different from the best one (p-value > 0.05). Then, we check whether uncertainty measured as the variance of the model's predictions can be used to detect errors and out-of-domain inputs. Detecting errors can be evaluated via the Prediction-Rejection Ratio (PRR) (Malinin, 2019; Malinin et al., 2020) . PRR measures how well uncertainty estimates correlate with errors and rank-order them. Out-of-domain (OOD) detection is usually assessed via the area under the ROC curve (AUC-ROC) for the binary task of detecting whether a sample is OOD (Hendrycks & Gimpel, 2017) . For this, one needs an OOD test set. We use the same OOD test sets (sampled from other datasets) as Malinin et al. (2021) . The results of this experiment are given in Table 2 . We can see that the proposed method significantly outperforms the baselines for out-of-domain detection. These improvements can be explained by the theoretical soundness of KGB: convergence properties are theoretically grounded and non-asymptotic. In contrast, for SGB, there are no general results applicable in our setting, while for SGLB the guarantees are asymptotic. In summary, these results show that our approach is superior to SGB and SGLB, achieving smaller values of RMSE and having better knowledge uncertainty estimates.

6. CONCLUSION

This paper theoretically analyses the classic gradient boosting algorithm. In particular, we show that GBDT converges to the solution of a certain Kernel Ridge Regression problem. We also introduce a simple modification of the classic algorithm allowing one to sample from the Gaussian posterior. The proposed method gives much better knowledge uncertainty estimates than the existing approaches. We highlight the following important directions for future research. First, to explore how one can control the kernel and use it for better knowledge uncertainty estimates. Also, we do not analyze generalization in the current work, which is another important research topic. Finally, we need to establish universal approximation property which further justifies need for functional formalism.

A NOTATION USED IN THE PAPER

For convenience, let us list some frequently used notation: • X ⊂ R d -feature space; • d -dimension of feature vectors; • Y ⊂ R -target space; • ρ -distribution of features; • N -number of samples; • z = (x N , y N ) -dataset; • V -set of all possible tree structures; • L ν : V → N -number of leaves for ν ∈ V; • D(ν, r) -score used to choose a split (3); • S -indices of all possible splits; • n -number of borders in our implementation of SampleTree; • m -depth of the tree in our implementation of SampleTree; • β -random strength; • -learning rate; • λ -regularization; • F -space of all possible ensembles of trees from V; • φ ν : X → {0, 1} Lν -tree structure; • φ (j) ν -indicator of j-th leaf; • V (f ) -empirical error of a model f relative to the best possible f ∈ F; • k ν (•, •) -single tree kernel; • K(•, •) -stationary kernel of the gradient boosting; • p(•|f, β) -distribution of trees, f ∈ F; • π(•) = lim β→∞ p(•|f, β) = p(•|f * , β) -stationary distribution of trees; • σ -kernel scale. The following diagram illustrates the kernels introduced in our paper: Weak learner' kernel kν τ : X × X → R+ Σν τ : F → F Σ fτ : F → F Iteration kernel K fτ : X × X → R+ Stationary Kernel K : X × X → R+ H = F, •, • H acts as f →kν τ f (x N ) expected value expected valuefτ τ →∞ used in dot product B ADDITIONAL RELATED WORK Let us briefly discuss some additional related work. Mondrian forest method (Balog et al., 2016) and Generalized Random Forests (Athey et al., 2019) , besides having links to the kernel methods, in fact, have the underlying limiting RKHS space that is much smaller than the space of all possible ensembles built on the same weak learners due to the independence of the trees that are added to the ensemble. Therefore, there is an issue of high bias when comparing plain gradient boosting with the plain random forest method. Also, these two methods are built from scratch to obtain an RKHS interpretation while we provide a link between the existing standard gradient boosting approaches to the kernel methods, i.e., we do not create a novel gradient boosting algorithm but rather show that the existing ones already have such a link to derive convergence rates and to exploit such linkage to obtain formal Gaussian process interpretation of the gradient boosting learning to get uncertainty estimates using well-established gradient boosting libraries. Let us mention that there are approaches that study kernels induced by tree ensembles through the perspective of Neural Tangent Kernel (Kanoh & Sugiyama, 2022) , though this analysis is not applicable for classical gradient boosting, while ours is. Let us also briefly discuss the papers on Neural Tangent Kernel, e.g., Jacot et (Lee et al., 2020; 2018; Yang, 2019; Cho & Saul, 2009) , i.e., is sample-then-optimize. For boosting, we achieve this only by introducing Algorithm 4 relying on Algorithm 3, which in its essence, is random initialization for gradient boosting. The classic gradient boosting (Algorithm 2) can be considered as the mean value of the Gaussian Process, which has no analogs in the world of deep learning, and to achieve convergence to posterior mean there, one needs to average among many trained models. This can be considered as an advantage of gradient boosting over deep learning that we derive in our paper.

C CONVEX OPTIMIZATION IN FUNCTIONAL SPACES

In this section, we formulate basic definitions of differentiability in functional spaces and the theorem on the convergence of gradient descent in functional spaces. For the proof of the theorem and further details on convex optimization in functional space, the reader can consult Luenberger (1969) . We consider H to be a Hilbert space with some scalar product •, • H . Definition C.1. We say that F : H → R is Fréchet differentiable if for any f ∈ H there exists a bounded linear functional L f : H → R such that ∀h ∈ H F (f + h) = F (f ) + L f [h] + o( h ) . The value of L f : H → R is denoted by D f F (f ) and is called a Fréchet differential of F at point f . So, Fréchet differential is a functional D f F : H → B(H, R), where B(X, Y ) denotes a normed space of linear bounded functionals from X to Y . Definition C.2. Let F : H → R be Fréchet differentiable with a Fréchet differential D f F (f ) that is a bounded linear functional. Then, by the Riesz theorem there exists a unique h f such that Let I ∈ B(H × H, R) be a linear operator defined as I(g, h) = (g, h) H . Theorem C.6. Let F be bounded below and twice Fréchet differentiable functional on a Hilbert space H. Assume that D f F (f ) [h] = h f , h H ∀h ∈ H. We call such element a gradient of F in H at f ∈ H and denote it by ∇ H F (f ) = h f ∈ H. Definition C.3. F : H → R is said to be twice Fréchet differentiable if D f F is Fréchet dif- ferentiable, D 2 f F (f ) satisfies 0 ≺ mI D 2 f F (f ) µI. Then the gradient descent scheme: f k+1 = f k -∇ H F (f k ) converges to f * that minimizes F. Proof. For the proof see Luenberger (1969) .

D KERNEL RIDGE REGRESSION AND RKHS

Definition D.1. K : X × X → R is called a kernel function if it is positive semi-definite, i.e., ∀N ∈ N + ∀x N ∈ X N : K(x N , x N ) 0. Definition D.2. For any kernel function we can define a Reproducing Kernel Hilbert Space (RKHS) H(K) = span K(•, x) x ∈ X with a scalar product such that f, K(•, x) H(K) = f (x) . Consider the following Kernel Ridge Regression problem: V (f, λ) = 1 2N y N -f (x N ) 2 R N + λ 2N f 2 H(K) -min f ∈H(K) V (f, λ) → min f ∈H(K) and the following Kernel Ridgeless Regression problem: V (f ) = 1 2N y N -f (x N ) 2 R N -min f ∈H(K) V (f ) → min f ∈H(K) . Lemma D.3. min H(K) V (f, λ) has the only solution f λ * = K(•, x N )(K(x N , x N ) + λI) -1 y N . Proof. First, let us show that f λ * ∈ span K(•, x i ) . Let H(K) = span K(•, x i ) ⊕ span K(•, x i ) ⊥ and consider the projector P : H(K) → H(K) onto the space span K(•, x i ) . It is easy to show that P (f )(x N ) = f (x N ) for any f ∈ H(K). Indeed, (f -P (f ))[x N ] = f -P (f ), K(•, x N ) = 0 . If f λ * does not lie in span K(•, x i ) , then f λ * H(K) > P (f λ * ) H(K) and V (P (f λ * ), λ) < V (f λ * , λ). We get a contradiction with the minimality of f λ * . Now, let us prove the existence of f λ * . Consider f = K(•, x N )c, where c ∈ R N . Then, we find the optimal c by taking a derivative of V (f, λ) with respect to c and equating it to zero: K(x N , x N )(K(x N , x N )c -y N ) + λK(x N , x N )c = 0 . Then, c v = (K(x N , x N ) + λI) -1 (y N + v) , where v ∈ ker K(x N , x N ). Note that all K(•, x N )c v are equal. Then, we have the only solution of the KRR problem: f λ * = K(•, x N )(K(x N , x N ) + λI) -1 y N . Lemma D.4. min H(K) V (f ) has the only solution in span K(•, x i ) and it is the solution of minimum RKHS norm: f * = K(•, x N )K(x N , x N ) † y N . 1 2 D 2 f V (f λ * , λ)[f -f λ * , f -f λ * ] = 1 2N f λ * (x N ) -f (x N ) 2 R N + λ 2N f λ * (x N ) -f (x N ) 2 R N . The explicit formula for the Fréchet Derivative of V (f, λ) can be found in Appendix E.

E GAUSSIAN PROCESS INFERENCE

In this section, we prove Lemma 2.1 from Section 2.3 of the main text. Firstly, consider the following regularized error functional: V (f, λ) = 1 2N N i=1 f (x i ) -y i 2 + λ 2N f 2 H -min f ∈H(K) 1 2N N i=1 f (x i ) -y i 2 + λ 2N f 2 H . With this functional we can consider the following optimization problem: min f ∈H(K) V (f, λ), which is called as Kernel Ridge Regression. We will show that this functional satisfies the conditions needed for Theorem C.6. We will also deduce the formula of the gradient of V in order to show that gradient descent takes the form (2). Lemma E.1. V (f, λ) is Fréchet differentiable with the differential given by: D f V (f, λ) = λ N f, • H(K) + 1 N N i=1 f (x i ) -y i ev xi , where ev xi : H(K) → R is a bounded linear functional such that ev xi (f ) = f (x i ) = (f, K(x i , •)) H(K) . 8 Proof. As Fréchet differential is linear, we only need to find Fréchet differential for (f (x i ) -y i ) 2 . Note that (f (x i ) -y i ) 2 is a composition of two functions: F : H(K) → R, F = ev xi -y i , G : R → R, G(x) = x 2 . The differential of the composition can be found as: D f G(F (f )) = ∂ ∂x G(F (f ))D f F (f ) , D f G(F (f )) = 2(f (x i ) -y i )ev xi , where D f F (f ) = ev xi because ev xi (f + h) -y i = ev xi (f ) -y i + ev xi (h) . Lemma E.2. The gradient of V (f, λ), Riesz representative of the functional above, is given by: ∇ f V (f, λ) = λ N f + 1 N N i=1 (f (x i ) -y i )K xi . Proof. Follows from the previous lemma. Lemma E.3. V (f, λ) is twice Fréchet differentiable with the differential given by: D 2 f V : H(K) → B(H(K), B(H(K), R)) , D 2 f V (f, λ)[h] = λ N h, • H(K) + 1 N N i=1 h(x i )ev xi . Proof. Due to the linearity of Fréchet differential and lemma E.1 we need to find only Fréchet differential for (f (x i ) -y i )ev xi . Consider S(f ) = (f (x i ) -y i )ev xi . Then we need to find V f ∈ B(H(K), B(H(K), R)) such that S(f + h) = S(f ) + V f [h] + o( h ).

It is easy to show that

h → h(x i )ev xi ∈ B(H(K), B(H(K), R)) and S(f + h) = S(f ) + h(x i )ev xi . Thus, we get that D f S(f )[h] = h(x i )ev xi . From this the statement of the lemma follows. Given all the above lemmas, as a corollary of Theorem C.6, we have the following. Corollary E.4. Gradient descent, defined by the following iterative scheme f τ +1 = 1 - λ N f τ - 1 N N i=1 (f τ (x i ) -y i )K xi , f 0 = 0 L2(ρ) converges to the optimum of V (f, λ). Thus, f λ * = lim τ →∞ f τ = K(•, x N ) K(x N , x N ) + λI N -1 y N . Proof. By Lemma E.2, our update rule has the form f τ +1 = f τ -∇ H V (f τ , λ). Then, we will find m, µ such that 0 ≺ mI D 2 f V (f, λ) µI. By Lemma E.3, D 2 f V (f, λ)[g, h] = λ N g, h H(K) + 1 N g(x N ) T h(x N ) . Then, we can take m = λ N . Let us also write D 2 f V (f, λ)[g, g] = λ N g 2 H(K) + 1 N g(x N ) 2 = λ N g 2 H(K) + 1 N g, K(•, x N ) H(K) 2 ≤ ( λ N + 1 N max x∈X K(x, x)) g 2 H(K) . Then, we can take µ = λ N + 1 N max x∈X K(x, x). By theorem C.6 and lemma D.3 the corollary follows. Lemma E.5. Consider the gradient descent: f τ +1 = 1 - λ N f τ - 1 N N i=1 (f τ (x i ) -y i )K xi , f 0 = 0 L2(ρ) , f ∞ = lim τ →∞ f τ and the following randomization scheme: 1. sample f init ∼ GP(0 L2(ρ) , σ 2 K + δ 2 Id L2 ); 2. set new labels y new N = y N -f init (x N ); 3. fit GD f τ (•) on y new N assuming f 0 (•) = 0 L2(ρ) ; 4. output f (•) = f init (•) + f ∞ (•) as final model. Then, f from the scheme above follows the Gaussian Process posterior with the following mean and covariance: E f (x) = K(x, x N ) K(x N , x N ) + λI N -1 y N , cov( f (x)) = δ 2 + σ 2 K(x, x) -K(x, x N ) K(x N , x N ) + λI N -1 K(x N , x) . Proof. f ∞ = K(•, x N ) K(x N , x N )+λI N -1 y new N = K(•, x N ) K(x N , x N )+λI N -1 (y N -f init (x N )) . Let us find the distribution of f at x ∈ R n . It can be easily seen that: E f (x) = K(x, x N ) K(x N , x N ) + λI N -1 y N . Let us now calculate covariance: cov f (x) = E( f (x) -E f (x))( f (x) -E f (x)) T = E f init (x) -K(x, x N ) K(x N , x N ) + λI N -1 f init (x N ) • f init (x) -K(x, x N ) K(x N , x N ) + λI N -1 f init (x N ) T = Ef init (x)f init (x) T -Ef init (x)f init (x N ) T K(x N , x N ) + λI N -1 K(x N , x) -K(x, x N ) K(x , x N ) + λI N -1 Ef init (x N )f init (x) T + K(x, x N ) K(x N , x N ) + λI N -1 Ef init (x N )f init (x N ) T K(x N , x N ) + λI N -1 K(x N , x) = δ 2 + σ 2 K(x, x) -2K(x, x N )(K(x N , x N ) + λI N ) -1 K(x N , x) + σ 2 K(x, x N )(K(x N , x N ) + λI N ) -1 K(x N , x) = δ 2 + σ 2 K(x, x) -K(x, x N )(K(x N , x N ) + λI N ) -1 K(x N , x) , which is exactly what we need.

F DISTRIBUTION OF TREES

Lemma F.1 (Lemma 3.2 in the main text). p(ν|f, β) = ς∈Pm m i=1 e D(ν ς,i ,r) β s∈S\νς,i-1 e D((ν ς,i-1 ,s),r) β , where the sum is over all permutations ς ∈ P m , ν ς,i = (s ς(1) , . . . , s ς(i) ), and ν = (s 1 , . . . , s m ). Proof. Let us fix some permutation ς ∈ P m . W.l.o.g., let ς = id Pm , i.e. ς(i) = i ∀i. It remains to derive the formula for the fixed permutation. The probability of adding the next split given the previously build tree is: P (ν i-1 ∪ s i |ν i-1 ) = e 1 β D(νi,r) s∈S\νi-1 e 1 β D((νi-1,s),r) , which comes from (4) and the Gumbel-SoftMax trick. Then, we decompose the probability P (ν) of a tree as: P (ν) = m i=1 P (ν i-1 ∪ s i |ν i-1 ) , and so for the fixed permutation we have P (ν) = m i=1 e 1 β D(νi,r) s∈S\νi-1 e 1 β D((νi-1,s),r) . Then we sum over all permutations and the lemma follows. Now, let us define the following value indicating how different are the distribution of trees for f and f * : Γ β (f ) = max max ν∈V p(ν|f * , β) p(ν|f, β) , max ν∈V p(ν|f, β) p(ν|f * , β) . Lemma F.2. The following bound relates the distributions. Γ β (f ) ≤ e 2mV (f ) β . Proof. Consider π = p(•|f * , β) and the following expression P (ν, ς): P (ν, ς) := m i=1 e 1 β D(νς,i,r) s∈S\νς,i-1 e 1 β D((νς,i-1,s),r) . Then, ς∈Pm P (ν, ς) ≤ ς∈Pm e m β D(ν,r) m i=1 1 s∈S\νς,i-1 1 ≤ e 2mV (f ) β π(ν) . where in second inequality we used D(ν, r) ≤ 2V (f ) which straightly follows from the definition. By noting that the probabilities remain the same if we shift D(•, r) ← D(•, r) -2V (f ) which becomes everywhere non-positive and allows us to do the above trick once more but in reverse manner: if we formally replace the D with such modified function and repeat the steps with reversing the inequalities which is needed since the new function is everywhere negative then the lemma follows. ς∈Pm P (ν, ς) ≥ ς∈Pm e m i=1 1 β D(νς,i,r)-m β 2V (f ) m i=1 1 s∈S\νi-1 1 ≥ e -2mV (f ) β π(ν) . G PROOF OF THEOREM 3.8 G.1 RKHS STRUCTURE In section 3.4 we defined RKHS structure on F as: f, K(•, x) H(K) = f (x) and we introduced the kernels k ν , K f , K π . Let us also define a kernel K p (•, •) = ν∈V k ν (•, •)p(ν) for arbitrary distribution p on V. This way, taking p as δ ν (•), p(ν | f, β), π(•) we get K p equal to k ν , K f , K, respectively. For each kernel, we define the operator associated with it denoted similarly: K p : F → F, f → X K p (•, x)f (x)ρ(dx). Lemma G.1. Consider two positive semidefinite operators on a finite dimensional vector space V : A : V → V and B : V → V such that A B. Then, Im A ≥ Im B. Lemma G.2. K p : F → F is invertible for p non-vanishing on V. Now we need to show that λ min (Σ p ) ≥ 1 N . Consider the following formula: Σ p = 1 N N i=1 K p (•, x i ) ⊗ H(K p ) K p (•, x i ) , Σ p = 1 N N i=1 K p (x i , x i ) K p (•, x i ) K p (x i , x i ) ⊗ H(K p ) K p (•, x i ) K p (x i , x i ) , where (a ⊗ H(K p ) b)[c] = b, c H(K p ) a. If a = b and a H(K p ) = 1 , then 1 and 0 are the only eigenvalues of a ⊗ H(K p ) a. Denote by S = span{K p (•, x i ) | i = 1, . . . , N } ⊂ H(K p ) and m = dimS, n = dimH(K p ). Then, λ min (Σ p ) = λ n-m+1 (Σ p ) = min dimU =n-m+1 max x∈U R Σ p (x), where R Σ p (x) = (Σ p x,x) H(K p ) (x,x) H(K p ) . As dimU = n -m + 1, then U ∩ S = ∅. Suppose K p (•, x i ) ∈ U ∩ S, then max x∈U R Σ p (x) ≥ R Σ p ( K p (•, x i ) K p (x i , x i ) ) * ≥ K p (x i , x i ) N ≥ 1 N , where (*) is fulfilled as a ⊗ H(K p ) a is a positive semidefinite operator and the last inequality follows from K p (x, x) ≥ 1 ∀x ∈ X .

G.2 NORM MAJORIZATION

The following lemmas relate the norms L 2 , H, R N with respect to each other.foot_6 Indeed, by these lemmas we can consider the bound • L2 • H ≤ • R N . Further, in the main theorems we will use these relations extensively. Corollary G.8. f (x N ) ≥ f H for f ∈ span{K(•, x i ) | i = 1, . . . , N }. Proof. 1 N f (x N ) 2 = Σf, f H ≥ 1 N f 2 H as Σ 1 N I on span{K(•, x i ) | i = 1, . . . , N }. Lemma G.9. λ max (K) ≤ max x∈X K(x, x). Proof. Consider K as an operator on (F, L 2 (ρ)). We will prove that K B((F ,L2(ρ))) ≤ max x∈X K(x, x) and the lemma will follow. Consider the inequality K[f ](x) = K x , f L2 ≤ K x L2 f L2 . Then, K[f ] L2 ≤ max x∈X K x L2 f L2 . Note also that K(x, x ) ≤ min(K(x, x), K(x , x )) which can be easily seen from the definition. Then, K x L2 ≤ K(x, x) and the lemma follows. Corollary G.10. (Expected squared norm majorization by RKHS norm) The following bound holds ∀f ∈ H: f 2 L2(ρ) ≤ max x∈X K(x, x) f 2 H . Proof. We have λ max (K) f 2 H ≥ Kf, f H = f, f L2 = f 2 L2 . Then, from the previous lemma the bound holds. Then, we use the following trick: f (x) = K(•, x), f H . It allows us to rewrite: K p f, g L2 = X ×X K p (x, y) K(•, x), f H K(•, y), g H ρ(dx)ρ(dy). From this, it immediately follows: K p = X ×X K p (x, y) K(•, x) ⊗ H K(•, y) ρ(dx)ρ(dy). This shows that K p is indeed symmetric w.r.t. the dot product of H since both K(•, x)⊗ H K(•, y) and K(•, y) ⊗ H K(•, x) are present with the same weight K p (x, y)ρ(dx)ρ(dy) = K p (y, x)ρ(dy)ρ(dx).

G.4 ITERATIONS OF GRADIENT BOOSTING

Lemma G.14. For any ν ∈ V, we have k ν (•, x N )[y N -f * (x N )] = 0. Proof. Follows from Lemma D.5. Lemma G.15 (Lemma 3.7 in the main text). Iterations f τ of gradient boosting (Algorithm 2) can be written in the form: fτ+1 = 1 - λ N fτ + N kν τ (•, xN ) yN -fτ (xN ) = (1 - λ N )fτ + N kν τ (•, xN ) f * (xN ) -fτ (xN ) , ντ ∼ p(ν|fτ , β) . Proof. According to Algorithm 2: f τ +1 (•) = 1 - λ N f τ (•) + φ ντ (•), θ τ R Lν τ for θ τ = N i=1 φ (j) ντ (x i )r (i) τ N i=1 φ (j) ντ (x i ) Lν τ j=1 . Thus, f τ +1 = 1 - λ N f τ + 1 N Lν τ j=1 ω j ντ φ j ντ i : φ j ντ (xi)=1 r i τ . Now note that k ντ (•, x i ) = ω j ντ φ j ντ (•), where j is such that φ j ντ (x i ) = 1. From this the lemma follows. From Lemmas G.15, D.4, it is easy to show that f τ , f * ∈ span{K(•, x i ) | i = 1, . . . , N }. Then, hereafter we can use Corollary G.8 to bound H norm with R N norm. Lemma G.16. The iterations of gradient boosting can be represented as: -1 E f τ -f τ +1 | f τ = K fτ D[f τ -f * ] + λ N f τ , where D : F → F is bounded linear operator defined as Riesz representative with respect to L 2 scalar product of such bilinear function 1 N f (x N ) T h(x N ) = Df, h L2 . Similar decomposition holds for ∇ H V (f, λ) = KD[f -f * ] + λ N f . Proof. First, observe that Ef τ +1 | f τ = f τ -∇V (f τ , λ) where gradient here is taken with respect to H(K fτ ). Keep in mind that in the definition of V (f τ , λ), the norm in the regularizer term is taken with respect to H(K fτ ) instead of H(K). Thus, we need only to prove that ∇ H V (f, λ) = KD[f -f * ] + λ N f . Consider Fréchet differential D f V (f )[h] = 1 N (f (x N ) -f * (x N )) T h(x N ) = D[f -f * ], h L2 . By Lemma G.4, we deduce D f V (f )[h] = 1 N (f (x N ) -f * (x N )) T h(x N ) = KD[f -f * ], h H , which implies ∇V (f ) = KD[f -f * ] and the lemma follows.

G.5 MAIN LEMMAS

Lemma G.17. Let A, B ∈ B(H, H) be two PSD operators such that ξB -A and ξA -B are PSD for some ξ ∈ (1, ∞). Let g, h ∈ H be two arbitrary vectors and λ ∈ R ++ be a constant. Then, A[g] + λξh, B[g] + λh H ≥ 1 2 ξ -1 A[g] + λξh 2 H -ξ(ξ 2 -1)λ 2 h 2 H . Proof. Consider the following equality: ξ ξ -1 A[g] + λh, B[g] + λh H = ξ 2 ξ -1 A[g] + λh 2 H + B[g] + λh 2 H -(B -ξ -1 A)[g] 2 H , which is basically the classical decomposition of the dot product x, y = 1 2 x 2 + y 2 -x - y 2 . Then, we note that 1 -ξ -2 B -B -ξ -1 A = ξ -2 (ξA -B) is PSD by assumption and since B -ξ -1 A is PSD it implies that 1 -ξ -2 B ≥ B -ξ -1 A , which implies: A[g] + λξh, B[g] + λh H ≥ ξ 2 ξ -1 A[g] + λh 2 H + B[g] + λh 2 H -(1 -ξ -2 ) 2 B[g] 2 H . Finally, note that ξ 2 2-ξ -2 -1 = ξ 2 (1-ξ -2 ) 2 2-ξ -2 ≤ ξ 2 -1. Then, the result directly follows from the following equality: B[g] + λh 2 H -(1 -ξ -2 ) 2 B[g] 2 H = ξ -1 2 -ξ -2 B[g] + λ ξ -1 2 -ξ -2 2 H -λ 2 ξ 2 2 -ξ -2 -1 h 2 H . Denote κ(A, B) = Id H -BA -1 B(H,H) = (B -A)A -1 B(H,H) . Lemma G.18. If ξ -1 K K f ξK for ξ > 1, then κ(K, K f ) ≤ ξ -1. Proof. First, we note that both operators are symmetric semi-positive definite in L 2 . Now, let us look at the Rayleigh quotient: (K -K f )K -1 B(H,H) = max f ∈F \{0} (K -K f )K -1 f H f H = max f ∈F \{0} K -1 2 (K -K f )K -1 2 f L2 f L2 . In the last equality we used fact that K is symmetric positive definite and therefore K 1 2 is too and hence we can substitute f ← K 1 2 f and use the explicit formula for the dot product in H via the product in L 2 . Now we observe that -(ξ -1)Id L2 = K -1 2 (K -ξK)K -1 2 K -1 2 (K -K f )K -1 2 K -1 2 (K -ξ -1 K)K -1 2 = (1 -ξ -1 )Id L2 (ξ -1)Id L2 , which implies that the spectral radius ρ(K -1 2 (K -K f )K -1 2 ) is bounded by ξ -1. Therefore, we obtain: max f ∈F \{0} K -1 2 (K -K f )K -1 2 f L2 f L2 = ρ(K -1 2 (K -K f )K -1 2 ) ≤ ξ -1. Lemma G.19. Let A, B ∈ B(H, H). Then, the following inequality holds: A[g] + λh, B[g] + λh H ≥ 1 2 -κ(A, B) A[g] + λh 2 H -κ 2 (A, B) λ 2 2 h 2 H . Proof. Let us rewrite the left part as A[g]+λh, B[g]+λh H = A[g]+λh, Id L2 +(BA -1 -Id L2 ) (A[g]+λh)+λ(Id L2 -BA -1 )h H = A[g] + λh 2 H -A[g] + λh, (Id L2 -BA -1 )(A[g] + λh) H + A[g] + λh, λ(Id L2 -BA -1 )h H . Then, we use the equality a, b = 1 2 a 2 + b 2 -a -b 2 . Also, we use A[g] + λh, (Id L2 -BA -1 )(A[g] + λh) H ≤ A[g] + λh H (Id L2 -BA -1 )(A[g] + λh) H ≤ κ(A, B) A[g] + λh 2 H to obtain A[g] + λh, B[g] + λh H ≥ 3 2 -κ(A, B) A[g] + λh 2 H + λ 2 2 (Id L2 -BA -1 2 H - 1 2 (A[g] + λh) -λ(Id L2 -BA -1 )h 2 H ≥ 3 2 -κ(A, B) A[g] + λh 2 H -(A[g] + λh) 2 H - λ 2 2 (Id L2 -BA -1 )h 2 H ≥ 1 2 -κ(A, B) A[g] + λh 2 H -κ 2 (A, B) λ 2 2 h 2 H . The following lemma holds. Lemma G.20. If ( λ N + 1) < 1 and f 0 = 0 H , then ∀τ the following holds almost surely f τ -f * ≤ f * for norms • L2 , • H , and • R N . Proof. Note that f τ +1 -f * = Id L2 - λ N Id L2 -Σ ντ f τ -f * - λ N f * . Now observe that S := Id L2 -λ N Id L2 -Σ ντ is symmetric with eigenvalues 0 < λ i (S) ≤ 1-λ N , therefore its operator norm in B(L 2 ), B(H), and R N ×N is less than 1 -λ N . Taking the norm of left and right sides and using the sub-additivity of the norm, we obtain: f τ +1 -f * ≤ (1 - λ N ) f τ -f * + λ N f * . Since f 0 -f * = f * that recurrent relation inductively yields the statement of the lemma. Corollary G.21. Under the same conditions, f τ ≤ 2 f * .

G.6 MAIN THEOREMS

Let us denote R = f * R N . We argue that it is a constant value since the kernel H and f * are convergent as N → ∞ which makes it bounded by some constant with probability arbitrary close to one. By Lemma F. 2, Γ β (f ) ≤ e 2mV (f ) β . Then, Γ β (f τ ) ≤ e 2m 1 2N fτ -f * 2 R N β ≤ e mR 2 N β and we denote M β = e mR 2 N β > 1. Theorem G.22. Consider an arbitrary , 0 < ( λ N + 1) < 1 and (1+M β λ) 4M β N ≥ . The following inequality holds: EV (fT ) ≤ R 2 2N e - 1+M β λ 2M β N T + M β λ 1 2N + 4M β 1 + M β λ (M 2 β -1) λ N + 4 1 + M β λ 2λ N 2 + M β (1 + 2λ 2 N 2 ) R 2 . Proof. To prove the theorem, we will bound V (f ) ≤ V (f, M β λ) + const. It will allow us to invoke Lemma G.17. After that by using strong convexity we obtain a bound on EV (f τ , M β ) and then a bound on EV (f τ ) will follow straightforwardly. To get the result for V (f τ , M β λ), we expand V (f τ +1 , M β λ) by substituting the formula for f τ +1 and first dealing with the term V (f τ +1 ): 1 2N f τ +1 (x N ) -f * (x N ) 2 R N = 1 2N f τ (x N ) -f * (x N ) 2 R N + 1 N f τ +1 (x N ) -f τ (x N ), f τ (x N ) -f * (x N ) R N + 1 2N f τ +1 (x N ) -f τ (x N ) 2 R N = V (f τ ) -Σ ντ [f τ -f * ] + λ N f τ , KD[f τ -f * ] H + 2 2N Σ ντ [f τ -f * ] + λ N f τ 2 R N ≤ V (f τ ) -Σ ντ [f τ -f * ] + λ N f τ , KD[f τ -f * ] H + 2 2 V (f τ ) + λ 2 2 N 3 f τ (x N ) 2 R N , where we used the inequality a + b 2 ≤ 2 a 2 + 2 b 2 and Lemma G.7 which allows us to bound Σ ντ (f τ -f * ) R N ≤ (f τ -f * ) R N . Then, we analyze the regularization: λ 2N M β fτ+1 2 H = λ 2N M β fτ 2 H + fτ+1 -fτ , λ N M β fτ H + λ 2 2N M β Σν τ [fτ -f * ] + λ N fτ 2 H ≤ λ 2N M β fτ 2 H - λ N M β fτ , Σν τ [fτ -f * ] + λ N fτ H + 2 λ N M β fτ -f * 2 H + 2 λ 3 N 3 M β fτ 2 H ≤ λ 2N M β fτ 2 H - λ N M β fτ , Σν τ [fτ -f * ] + λ N fτ H + 2 λ N M β (1 + 4λ 2 N 2 ) f * 2 H . where in the first inequality we used Σ ντ [f τ -f * ] H ≤ f τ -f * H which is due to Lemma G.7. Summing up the expectations of those two expressions, we obtain: EV (f τ +1 , M β λ) = EV (f τ +1 ) + M β λ 2N E f τ +1 2 H -C M β λ ≤ EV (f τ ) -E Σ fτ [f τ -f * ] + λ N f τ , KD[f τ -f * ] + M β λ N f τ H + 2 2 EV (f τ ) + λ 2 2 N 3 E f τ (x N ) 2 R N + λ 2N M β E f τ 2 H + 2 λ N M β (1 + 4λ 2 N 2 ) f * 2 H -C M β λ ≤ (1 + 2 2 ) EV (f τ ) + λ 2N M β E f τ 2 H -C M β λ -E Σ fτ [f τ -f * ] + λ N f τ , KD[f τ -f * ] + M β λ N f τ H + λ 2 2 N 3 E f τ (x N ) 2 R N + 2 λ N M β (1 + 4λ 2 N 2 ) f * 2 H + 2 2 C M β λ ≤ (1 + 2 2 )EV (f τ , M β λ) -E K fτ D[f τ -f * ] + λ N f τ , KD[f τ -f * ] + M β λ N f τ + 2 2 λ N 2λ N 2 + M β (1 + 2λ 2 N 2 ) R 2 . Here we used C λ = inf f ∈F L(f, λ) -inf f ∈F L(f ) ≤ L(f * , λ) -L(f * ) = λ 2N f * 2 H ≤ λ 2N R 2 . Then, by applying Lemma G.16 for Σ fτ = K fτ D and applying Lemma G.17 with ξ = M β , A = K, B = K fτ , g = D[f τ -f * ] , and h = f τ , we obtain the following bound: -E K fτ D[f τ -f * ] + λ N f τ , KD[f τ -f * ] + M β λ N f τ H ≤ - 2M β E ∇ H V (f τ , M β λ) 2 H + 2 M β (M 2 β -1) λ 2 N 2 E f τ 2 H ≤ - 2M β E ∇ H V (f τ , M β λ) 2 H + 2 M β (M 2 β -1) λ 2 N 2 R 2 . Then, by using Polyak-Łojasiewicz inequality 1 2 ∇V H ≥ µV for µ-strongly convex function V (restricted on span{K( •, x i ) | i = 1, . . . , N }) with µ ≥ 1+M β λ N > 0, which is due to Corollary G.8, we obtain: -E K fτ D[f τ -f * ] + λ N f τ , KD[f τ -f * ] + M β λ N f τ H ≤ - λM β + 1 M β N EV (f τ , M β λ) + 2 M β (M 2 β -1) λ 2 N 2 R 2 . Substituting it into the bound on V (f τ +1 , M β λ) gives: EV (f τ +1 , M β λ) ≤ 1 -( 1 + M β λ M β N -2 ) V (f τ , M β λ) + 2 λ N M β (M 2 β -1) λ N + 2λ N 2 + M β (1 + 2λ 2 N 2 ) R 2 ≤ 1 - 1 + M β λ 2M β N V (f τ , M β λ) + 2 λ N M β (M 2 β -1) λ N + 2λ N 2 + M β (1 + 2λ 2 N 2 ) R 2 , which yields EV (f T , M β λ) ≤ R 2 2N e - 1+M β λ 2M β N T + 4M β λ 1 + M β λ M β (M 2 β -1) λ N + 2λ N 2 + M β (1 + 2λ 2 N 2 ) R 2 , where we used the bound V (f 0 , M β λ) = V (0, M β λ) = 1 2N f * (x N ) 2 R N -C M β λ ≤ R 2 2N . Next, we use the following inequality: V (f ) = L(f ) -min L(f ) ≤ L(f, M β λ) -min L(f, M β λ) + min L(f, M β λ) -min L(f ) ≤ V (f, M β λ) + L(f * , M β λ) -L(f * ) = V (f, M β λ) + M β λ 2N R 2 , which finally gives us the following bound on EV (f T ): EV (f T ) ≤ R 2 2N e - 1+M β λ 2M β N T + M β λ 1 2N + 4M β 1 + M β λ (M 2 β -1) λ N + 4 1 + M β λ 2λ N 2 + M β (1 + 2λ 2 N 2 ) R 2 . Theorem G.23. (Theorem 3.8 in the main text ) Let C = M β λ 1 2N + 4M β 1+M β λ (M 2 β -1) λ N + 4 1+M β λ 2λ N 2 + M β (1 + 2λ 2 N 2 ) R 2 . Assume that 0 < ( λ N + 1) < 1 and (1+M β λ) 4M β N ≥ , ≥ , e 4mC β ≤ 5 4 (this bound can be achieved by taking β arbitrary large) and define T 1 = 2M β N (1+M β λ) log R 2 2CN + 1. Then ∀T ≥ T 1 it holds that EV (f T , λ) ≤ 2(C + λ N R 2 )e -1+λ 4N (T -T1) + 8λ 1 + λ λM 2 β N + 1 + 2λ(1 + λ) N 2 R 2 . Proof. First, we apply the previous theorem to obtain a bound on V (f τ ) which we will use to claim that the kernels K fτ and K are close to each other in expectation. If we take T1 = 2M β N (1 + M β λ) log R 2 2CN + 1 , then the following inequalities hold ∀τ ≥ T 1 : EV (f τ ) ≤ 2C, EV (f τ , λ) ≤ 2(C + λ N R 2 ). Then, analogously with the previous theorem, we estimate: EV (f τ +1 , λ) ≤ (1 + 2 2 )EV (f τ , λ) -E K fτ D[f τ -f * ] + λ N f τ , KD[f τ -f * ] + λ N f τ + 2 2 λ N 1 + 2λ(1 + λ) N 2 R 2 . Further, we bound -E K fτ D[f τ -f * ] + λ N f τ , KD[f τ -f * ] + λ N f τ by Lemma G.19, instead of Lemma G.17, which we used in the previous theorem: ∀τ ≥ T 1 -E K fτ D[f τ -f * ] + λ N f τ , KD[f τ -f * ] + λ N f τ ≤ E (e 2mV (fτ ) β -1) - 1 2 ∇ H V (f τ , λ) 2 H + E(e 2mV (fτ ) β -1) 2 λ 2 2N 2 f τ 2 H ≤ 2 e 2mEV (fτ ,λ) β - 3 2 1 + λ N EV (f τ , λ) + 2λ 2 M 2 β N 2 R 2 ≤ 2 1 + λ N e 4mC β -3 1 + λ N EV (f τ , λ) + 2λ 2 M 2 β N 2 R 2 ≤ 2 1 + λ N e 4mC β -3 1 + λ N EV (f τ , λ) + 2λ 2 M 2 β N 2 R 2 ≤ - 1 + λ 2N EV (f τ , λ) + 2λ 2 M 2 β N 2 R 2 . Substituting this in the formula, we get: ∀τ ≥ T 1 EV (f τ +1 , λ) ≤ 1 - 1 + λ 2N -2 EV (f τ , λ) + 2 λ N λM 2 β N + 1 + 2λ(1 + λ) N 2 R 2 ≤ (1 - 1 + λ 4N )EV (f τ , λ) + 2 λ N λM 2 β N + 1 + 2λ(1 + λ) N 2 R 2 . Iterating the bound yields EV (f T , λ) ≤ EV (f T1 , λ)e -1+λ 4N (T -T1) + 8λ 1 + λ λM 2 β N + 1 + 2λ(1 + λ) N 2 R 2 ≤ 2(C + λ N R 2 )e -1+λ 4N (T -T1) + 8λ 1 + λ λM 2 β N + 1 + 2λ(1 + λ) N 2 R 2 . Corollary G.24. (Convergence to the solution of the KRR / Convergence to the Gaussian Process posterior mean function). Under the assumptions of both previous theorems we have that:foot_7  E f T -f λ * 2 L2 ≤ max x∈X K(x, x) 4N (C + λ N R 2 )e -1+λ 4N (T -T1) + 16N λ 1 + λ λM 2 β N + 1 + 2λ(1 + λ) N 2 R 2 . Proof. By Lemma D.7, V (f, λ) = 1 2N f (x N ) -f λ * (x N ) 2 R N + λ 2N f -f λ * 2 H . Then, by the previous theorem, we get a bound on 1 2N E f T (x N ) -f λ * (x N ) 2 R N , and by Lemmas G.10 and G.8, we majorize our L 2 norm by R N norm. Lemma then follows. Lemma G.25 (Lemma 4.1in the main text). The following convergence holds almost surely in x ∈ X: h T (•) ----→ T →∞ GP 0 L2(ρ) , K . Proof. From (5), we have that the covariance of h T is K independently from T . Thus, it remains to show that the limit is Gaussian almost surely which essentially holds due to the central limit theorem almost surely in x ∈ X: h T (x) = 1 √ T T i=1 h T,i (x) → N 0 L2 , K(x, x) , where each individual tree h T,i is centered i.i.d. (with the same distribution as h 1 ).

H IMPLEMENTATION DETAILS

In the experiments, we fix σ = 10 -2 (scale of the kernel) and δ = 10 -4 (scale of noise), which theoretically can be taken arbitrarily. As a hyperparameter (that is estimated on the validation set), we consider β ∈ {10 -2 , 10 -1 , 1}. We use the standard CatBoost library and add the Gumbel noise term in selecting the trees for the "L2" scoring function, which is implemented in CatBoost out of the box but is not used by SGB and SGLB since it is not the default one for the library. Moreover, we do not consider subsampling of the data (as SGLB does also), and differently from SGB and SGLB, we disable the "boost-from-average" option. Finally, we set l2-leaf -reg value to 0, as SGLB does.



The finiteness of V is important for our analysis, and it usually holds in practice, see Section 3.2.3 In fact, the procedure can be extended to arbitrary trees, but this would over-complicate formulation of the algorithm and would not change the space of tree ensembles as any non-symmetric tree can be represented as a sum of symmetric ones.4 A standard approach is to quantize the feature such that all n + 1 buckets have approximately the same number of training samples. Maximizing (3) is equivalent to minimizing the squared error between the residuals and the mean values in the leaves. Note that for oblivious decision trees, changing the order of splits does not affect the obtained partition. Hence, we assume that each tree is defined by an unordered set of splits. https://github.com/TakeOver/Gradient-Boosting-Performs-Gaussian-Process-Inference We further use the notation Kx i := K(•, xi). Note that • R N indeed becomes a norm once we restrict our space to span{K(•, xi) | i = 1, . . . , N }. When N → ∞, K converges to a certain kernel. Thus, maxx∈X K(x, x) can be estimated with a constant with probability arbitrary close to one.



ν ς,i-1 ,s),r) β

Figure 1: KGB on a synthetic dataset.

•) = lim β→∞ p(•|f, β). It follows from Remark 3.1 that we also have π(•) = p(•|f * , β).

Predictive performance, RMSE

Error and OOD detection

al. (2018);Li & Liang (2018);Allen-Zhu et al. (2019);Du et al. (2019), that study deep learning convergence through the perspective of kernel methods. Though such works share similarities with what we do, there are fundamental differences. First, our work is not in the over-parametrization regime, i.e., our kernel method correspondence works for tree ensembles with fixed parameters, but the correspondence is achieved as the number of iterations goes to infinity. It is worth noting that the kernel method perspective on deep learning basically establishes that each trained deep learning model is a sample from Gaussian Process posterior

where the definition of Fréchet differential is analogous to Definition C.1 with the only difference that D f F takes values in B(H, R). The second Fréchet differential is denoted by D 2

annex

Proof. Consider f = K(•, x N )c, where c ∈ R N . Now consider V (f ) and differentiate it with respect to c. If we equate the derivative to zero, we get:Then, K(x N , x N )c-(y N +v) = 0 for some v ∈ ker K(x N , x N ). Note that y N +ker K(x N , x N )∩ Im K(x N , x N ) = ∅. Then, for any v such that y N + v ∈ Im K(x N , x N ) there exists a solution c v = K(x N , x N ) † (y N + v) + ker K(x N , x N ). This follows from the fact that K(x N , x N )K(x N , x N ) † is an orthoprojector onto Im K(x N , x N ). Note thatThen, the existence and uniqueness of f * follow. Now, consider a linear space F ⊂ L 2 (ρ) of all possible ensembles of trees from V:Define the unique function:Then following two lemmas hold:We have:for small enough α > 0, which contradicts with the definition of f * .Lemma D.6.Proof. We need to prove it only for V (f ) without regularization as for regularized it follows immediately (note that f * is minimizer of not regularized objective). By definition,Now, let us prove thatwhere the last equality follows from the previous lemma.Proof. It is sufficient to check on the basis. So, we take g = K p (•, x). Then,where the second equality holds by the definition of the operator K p .For a weak learner ν, we define a covariance operator:Also, for an arbitrary probability distribution p over V, we denote Σ p = ν∈V Σ ν p(ν). These operators are typically referred to as covariance operators.Let us formulate and prove several lemmas about the RKHS structure and operators Σ, Σ f , Σ ν . Lemma G.5. (Courant-Fischer) Let A be an n × n real symmetric matrix and λ 1 ≤ . . . ≤ λ n its eigenvalues. Then,, where 1 n×n is a matrix of size n × n consisting of ones. Then, we note that 1 n×n = n and now the statement of the lemma follows.Lemma G.7. (Covariation majorization) The following operator inequality holds for probability distributions p, p over V, where p is arbitrary and p does not vanish at any ν ∈ V:Proof. Consider the following operators:As AB and BA have the same spectra, we further study the spectrum of K N .We haveand λ max (K N ) ≤ 1 follows from lemma G.6. Then, λ max (Σ p ) ≤ 1 follows.

G.3 SYMMETRY OF OPERATORS

In this section, we establish symmetry of various operators with respect to the norms L 2 , H, R N . These results are mainly required to claim that the spectral radii of these operators coincide with their operator norms in various spaces:Though, we use symmetry of operators not only this way. Lemma G.11. (Universal symmetry of covariance operators in H) The operator Σ p for any p is symmetric w.r.t. the dot product of H(K p ) for any non-singular p .Proof. First, let us prove the statement for non-singular distribution p. To see that, we consider the following quantity:Then, we use the following trick:. It allows us to rewrite:From this, it immediately follows:This shows that Σ p is indeed symmetric w.r.t. the dot product of H(K p ). Finally, we can use the continuity argument, which we can use due to intrinsic finite dimension, to conclude that symmetry must hold for arbitrary distributions, in particular for p = δ ν which corresponds to Σ ν .Lemma G.12. (Universal symmetry of covariance operators in L 2 ) The operator Σ p for any p is symmetric w.r.t. the dot product of L 2 .Proof. We consider similarly the following quantity:Then, we use the following trick K p (•, x i ), f H(Kp) = K -1 p K p (•, x i ), f L2 . It allows us to rewrite:From this, it immediately follows:Which shows that Σ p is indeed symmetric w.r.t. the dot product of L 2 .Lemma G.13. (Universal symmetry of kernel operators in H) The operator K p for any p is symmetric w.r.t. the dot product of H.Proof. We consider decomposing K p as:K p (x, y)f (x)g(y)ρ(dx)ρ(dy).

