ANALYZING TREE ARCHITECTURES IN ENSEMBLES VIA NEURAL TANGENT KERNEL

Abstract

A soft tree is an actively studied variant of a decision tree that updates splitting rules using the gradient method. Although soft trees can take various architectures, their impact is not theoretically well known. In this paper, we formulate and analyze the Neural Tangent Kernel (NTK) induced by soft tree ensembles for arbitrary tree architectures. This kernel leads to the remarkable finding that only the number of leaves at each depth is relevant for the tree architecture in ensemble learning with an infinite number of trees. In other words, if the number of leaves at each depth is fixed, the training behavior in function space and the generalization performance are exactly the same across different tree architectures, even if they are not isomorphic. We also show that the NTK of asymmetric trees like decision lists does not degenerate when they get infinitely deep. This is in contrast to the perfect binary trees, whose NTK is known to degenerate and leads to worse generalization performance for deeper trees.

1. INTRODUCTION

Ensemble learning is one of the most important machine learning techniques used in real world applications. By combining the outputs of multiple predictors, it is possible to obtain robust results for complex prediction problems. Decision trees are often used as weak learners in ensemble learning (Breiman, 2001; Chen & Guestrin, 2016; Ke et al., 2017) , and they can have a variety of structures such as various tree depths and whether or not the structure is symmetric. In the training process of tree ensembles, even a decision stump (Iba & Langley, 1992) , a decision tree with the depth of 1, is known to be able to achieve zero training error as the number of trees increases (Freund & Schapire, 1996) . However, generalization performance varies depending on weak learners (Liu et al., 2017) , and the theoretical properties of their impact are not well known, which results in the requirement of empirical trial-and-error adjustments of the structure of weak learners. In this paper, we focus on a soft tree (Kontschieder et al., 2015; Frosst & Hinton, 2017) as a weak learner. A soft tree is a variant of a decision tree that inherits characteristics of neural networks. Instead of using a greedy method (Quinlan, 1986; Breiman et al., 1984) to search splitting rules, soft trees make decision rules soft and simultaneously update the entire model parameters using the gradient method. Soft trees have been actively studied in recent years in terms of predictive performance (Kontschieder et al., 2015; Popov et al., 2020; Hazimeh et al., 2020 ), interpretability (Frosst & Hinton, 2017; Wan et al., 2021) , and potential techniques in real world applications like pre-training and fine-tuning (Ke et al., 2019; Arik & Pfister, 2019) . In addition, a soft tree can be interpreted as a Mixture-of-Experts (Jordan & Jacobs, 1993; Shazeer et al., 2017; Lepikhin et al., 2021) , a practical technique for balancing computational cost and prediction performance. To theoretically analyze soft tree ensembles, Kanoh & Sugiyama (2022) introduced the Neural Tangent Kernel (NTK) (Jacot et al., 2018) induced by them. The NTK framework analytically describes the behavior of ensemble learning with infinitely many soft trees, which leads to several non-trivial properties such as global convergence of training and the effect of parameter sharing in an oblivious tree (Popov et al., 2020; Prokhorenkova et al., 2018) . However, their analysis is limited to a specific type of trees, perfect binary trees, and theoretical properties of other various types of tree architectures are still unrevealed. Figure 1 illustrates representatives of tree architectures and their associated space partitioning in the case of a two-dimensional space. Note that each partition is not the axis parallel direction as we are considering soft trees. Not only symmetric trees, as shown in (a) and (b), but also asymmetric trees (Rivest, 1987) , as shown in (c), are often used in practical applications (Tanno et al., 2019) . Moreover, the structure in (d) corresponds to the rule set ensembles (Friedman & Popescu, 2008) , a combination of rules to obtain predictions, which can be viewed as a variant of trees. Although each of these architectures has a different space partitioning and is practically used, it is not theoretically clear whether or not such architectures make any difference in the resulting predictive performance in ensemble learning. In this paper, we study the impact of tree architectures of soft tree ensembles from the NTK viewpoint. We analytically derive the NTK that characterizes the training behavior of soft tree ensemble with arbitrary tree architectures and theoretically analyze the generalization performance. Our contributions can be summarized as follows: • The NTK of soft tree ensembles is characterized by only the number of leaves per depth. We derive the NTK induced by an infinite rule set ensemble (Theorem 2). Using this kernel, we obtain a formula for the NTK induced by an infinite ensemble of trees with arbitrary architectures (Theorem 3), which subsumes (Kanoh & Sugiyama, 2022, Theorem 1) as a special case (perfect binary trees). Interestingly, the kernel is determined by the number of leaves at each depth, which means that non-isomorphic trees can induce the same NTK (Corollary 1). • The decision boundary sharing does not affect to the generalization performance. Since the kernel is determined by the number of leaves at each depth, infinite ensembles with trees and rule sets shown in Figure 1 (a) and (d) induce exactly the same NTKs. This means that the way in which decision boundaries are shared does not change the model behavior within the limit of an infinite ensemble (Corollary 2). • The kernel degeneracy does not occur in deep asymmetric trees. The NTK induced by perfect binary trees degenerates when the trees get deeper: the kernel values become almost identical for deep trees even if the inner products between input pairs are different, resulting in poor performance in numerical experiments. In contrast, we find that the NTK does not degenerate for trees that grow in only one direction (Proposition 1); hence generalization performance does not worsen even if trees become infinitely deep (Proposition 2, Figure 8 ).

2. PRELIMINARY

We formulate soft trees, which we use as weak learners in ensemble learning, and review the basic properties of the NTK and the existing result for the perfect binary trees. where internal nodes and leaf nodes are indexed from 1 to N and 1 to L, respectively. For simplicity, we assume that N and L are the same across different weak learners throughout the paper.



Figure 1: Schematic image of decision boundaries in an input space split by the (a) perfect binary tree, (b) oblivious tree, (c) decision list, and (d) rule set.

SOFT TREES Let us perform regression through an ensemble of M soft trees. Given a data matrix x ∈ R F ×N composed of N training samples x 1 , . . . , x N with F features, each weak learner, indexed by m ∈ [M ] = {1, . . . , M }, has a parameter matrix w m ∈ R F ×N for internal nodes and π m ∈ R 1×L for leaf nodes. They are defined in the following format: x = | . . . | x 1 . . . x N | . . . | , w m = | . . . | w m,1 . . . w m,N | . . . | , π m = (π m,1 , . . . , π m,L ) ,

