ANALYZING TREE ARCHITECTURES IN ENSEMBLES VIA NEURAL TANGENT KERNEL

Abstract

A soft tree is an actively studied variant of a decision tree that updates splitting rules using the gradient method. Although soft trees can take various architectures, their impact is not theoretically well known. In this paper, we formulate and analyze the Neural Tangent Kernel (NTK) induced by soft tree ensembles for arbitrary tree architectures. This kernel leads to the remarkable finding that only the number of leaves at each depth is relevant for the tree architecture in ensemble learning with an infinite number of trees. In other words, if the number of leaves at each depth is fixed, the training behavior in function space and the generalization performance are exactly the same across different tree architectures, even if they are not isomorphic. We also show that the NTK of asymmetric trees like decision lists does not degenerate when they get infinitely deep. This is in contrast to the perfect binary trees, whose NTK is known to degenerate and leads to worse generalization performance for deeper trees.

1. INTRODUCTION

Ensemble learning is one of the most important machine learning techniques used in real world applications. By combining the outputs of multiple predictors, it is possible to obtain robust results for complex prediction problems. Decision trees are often used as weak learners in ensemble learning (Breiman, 2001; Chen & Guestrin, 2016; Ke et al., 2017) , and they can have a variety of structures such as various tree depths and whether or not the structure is symmetric. In the training process of tree ensembles, even a decision stump (Iba & Langley, 1992) , a decision tree with the depth of 1, is known to be able to achieve zero training error as the number of trees increases (Freund & Schapire, 1996) . However, generalization performance varies depending on weak learners (Liu et al., 2017) , and the theoretical properties of their impact are not well known, which results in the requirement of empirical trial-and-error adjustments of the structure of weak learners. In this paper, we focus on a soft tree (Kontschieder et al., 2015; Frosst & Hinton, 2017) as a weak learner. A soft tree is a variant of a decision tree that inherits characteristics of neural networks. Instead of using a greedy method (Quinlan, 1986; Breiman et al., 1984) to search splitting rules, soft trees make decision rules soft and simultaneously update the entire model parameters using the gradient method. Soft trees have been actively studied in recent years in terms of predictive performance (Kontschieder et al., 2015; Popov et al., 2020; Hazimeh et al., 2020) , interpretability (Frosst & Hinton, 2017; Wan et al., 2021) , and potential techniques in real world applications like pre-training and fine-tuning (Ke et al., 2019; Arik & Pfister, 2019) . In addition, a soft tree can be interpreted as a Mixture-of-Experts (Jordan & Jacobs, 1993; Shazeer et al., 2017; Lepikhin et al., 2021) , a practical technique for balancing computational cost and prediction performance. To theoretically analyze soft tree ensembles, Kanoh & Sugiyama (2022) introduced the Neural Tangent Kernel (NTK) (Jacot et al., 2018) induced by them. The NTK framework analytically describes the behavior of ensemble learning with infinitely many soft trees, which leads to several non-trivial properties such as global convergence of training and the effect of parameter sharing in an oblivious tree (Popov et al., 2020; Prokhorenkova et al., 2018) . However, their analysis is limited to a specific type of trees, perfect binary trees, and theoretical properties of other various types of tree architectures are still unrevealed. Figure 1 illustrates representatives of tree architectures and their associated space partitioning in the case of a two-dimensional space. Note that each partition is not the axis parallel direction as we are considering soft trees. Not only symmetric trees, as shown in (a) and (b), but also asymmetric trees (Rivest, 1987) , as shown in (c), are often used in practical applications (Tanno et al., 2019) . Moreover, the structure in (d) corresponds to the rule set ensembles (Friedman & Popescu, 2008) , a combination of rules to obtain predictions, which can be viewed as a variant of trees. Although each of these architectures has a different space partitioning and is practically used, it is not theoretically clear whether or not such architectures make any difference in the resulting predictive performance in ensemble learning. In this paper, we study the impact of tree architectures of soft tree ensembles from the NTK viewpoint. We analytically derive the NTK that characterizes the training behavior of soft tree ensemble with arbitrary tree architectures and theoretically analyze the generalization performance. Our contributions can be summarized as follows: • The NTK of soft tree ensembles is characterized by only the number of leaves per depth. We derive the NTK induced by an infinite rule set ensemble (Theorem 2). Using this kernel, we obtain a formula for the NTK induced by an infinite ensemble of trees with arbitrary architectures (Theorem 3), which subsumes (Kanoh & Sugiyama, 2022 , Theorem 1) as a special case (perfect binary trees). Interestingly, the kernel is determined by the number of leaves at each depth, which means that non-isomorphic trees can induce the same NTK (Corollary 1). • The decision boundary sharing does not affect to the generalization performance. Since the kernel is determined by the number of leaves at each depth, infinite ensembles with trees and rule sets shown in Figure 1 (a) and (d) induce exactly the same NTKs. This means that the way in which decision boundaries are shared does not change the model behavior within the limit of an infinite ensemble (Corollary 2). • The kernel degeneracy does not occur in deep asymmetric trees. The NTK induced by perfect binary trees degenerates when the trees get deeper: the kernel values become almost identical for deep trees even if the inner products between input pairs are different, resulting in poor performance in numerical experiments. In contrast, we find that the NTK does not degenerate for trees that grow in only one direction (Proposition 1); hence generalization performance does not worsen even if trees become infinitely deep (Proposition 2, Figure 8 ).

2. PRELIMINARY

We formulate soft trees, which we use as weak learners in ensemble learning, and review the basic properties of the NTK and the existing result for the perfect binary trees.

2.1. SOFT TREES

Let us perform regression through an ensemble of M soft trees. Given a data matrix x ∈ R F ×N composed of N training samples x 1 , . . . , x N with F features, each weak learner, indexed by m ∈ [M ] = {1, . . . , M }, has a parameter matrix w m ∈ R F ×N for internal nodes and π m ∈ R 1×L for leaf nodes. They are defined in the following format: x = | . . . | x 1 . . . x N | . . . | , w m = | . . . | w m,1 . . . w m,N | . . . | , π m = (π m,1 , . . . , π m,L ) , where internal nodes and leaf nodes are indexed from 1 to N and 1 to L, respectively. For simplicity, we assume that N and L are the same across different weak learners throughout the paper.

2.1.1. INTERNAL NODES

In a soft tree, the splitting operation at an intermediate node n ∈ [N ] = {1, . . . , N } is not completely binary. To formulate the probabilistic splitting operation, we introduce the notation ℓ ↙ n (resp. n ↘ ℓ), which is a binary relation being true if a leaf ℓ ∈ [L] = {1, . . . , L} belongs to the left (resp. right) subtree of a node n and false otherwise. We also use an indicator function 1 Q on the argument Q; that is, 1 Q = 1 if Q is true and 1 Q = 0 otherwise. Every leaf node ℓ ∈ [L] holds the probability that data reach to it, which is formulated as a function µ m,ℓ : R F × R F ×N → [0, 1] defined as µ m,ℓ (x i , w m ) = N n=1 σ(w ⊤ m,n x i ) flow to the left 1 ℓ↙n (1 -σ(w ⊤ m,n x i )) flow to the right 1 n↘ℓ , where σ : R → [0, 1] represents softened Boolean operation at internal nodes. The obtained value µ m,ℓ (x i , w m ) is the probability of a sample x i reaching a leaf ℓ in a soft tree m with its parameter matrix w m . If the output of a decision function σ takes only 0.0 or 1.0, this operation realizes the hard splitting used in typical decision trees. We do not explicitly use the bias term for simplicity as it can be technically treated as an additional feature. Internal nodes perform as a sigmoid-like decision function such as the scaled error function (Martins & Astudillo, 2016) , or the two-class entmax function σ(p) = entmax([αp, 0]) (Peters et al., 2019) . More precisely, any continuous function is possible if it is rotationally symmetric about the point (0, 1/2) satisfying lim p→∞ σ(p) = 1, lim p→-∞ σ(p) = 0, and σ(0) = 0.5. Therefore, the theoretical results presented in this paper hold for a variety of sigmoidlike decision functions. When the scaling factor α ∈ R + (Frosst & Hinton, 2017) is infinitely large, sigmoid-like decision functions become step functions and represent the (hard) Boolean operation. σ(p) = 1 2 erf(αp) + 1 2 = 1 2 ( 2 √ π αp 0 e -t 2 dt) + 1 2 , the two-class sparsemax function σ(p) = sparsemax([αp, 0]) Equation 1 applies to arbitrary binary tree architectures. Moreover, if the flow to the right node (1 -σ(w ⊤ m,n x i )) is replaced with 0, it is clear that the resulting model corresponds to a rule set (Friedman & Popescu, 2008) , which can be represented as a linear graph. Note that the value L ℓ=1 µ m,ℓ (x i , w m ) is always guaranteed to be 1 for any soft trees, while it is not guaranteed for rule sets.

2.1.2. LEAF NODES

The prediction for each x i from a weak learner m parameterized by w m and π m , represented as a function f m : R F × R F ×N × R 1×L → R, is given by f m (x i , w m , π m ) = L ℓ=1 π m,ℓ µ m,ℓ (x i , w m ), where π m,ℓ denotes the response of a leaf ℓ of the weak learner m. This formulation means that the prediction output is the average of leaf values π m,ℓ weighted by µ m,ℓ (x i , w m ), the probability of assigning the sample x i to the leaf ℓ. In this model, w m and π m are updated during training with a gradient method. If µ m,ℓ (x i , w m ) takes the value of only 1.0 for one leaf and 0.0 for the other leaves, the behavior of the soft tree is equivalent to a typical decision tree prediction.

2.1.3. AGGREGATION

When aggregating the output of multiple weak learners in ensemble learning, we divide the sum of the outputs by the square root of the number of weak learners, which results in f (x i , w, π) = 1 √ M M m=1 f m (x i , w m , π m ). This 1/ √ M scaling is known to be essential in the existing NTK literature to use the weak law of the large numbers (Jacot et al., 2018) . Each of model parameters w m,n and π m,ℓ are initialized with zero-mean i.i.d. Gaussians with unit variances. We refer such an initialization as the NTK initialization.

2.2. NEURAL TANGENT KERNEL

For any learning model function g, the NTK induced by g at a training time τ is formulated as a matrix H * τ ∈ R N ×N , in which each (i, j) ∈ [N ] × [N ] component is defined as [ H * τ ] ij := Θ * τ (x i , x j ) := ∂g(x i , θ τ ) ∂θ τ , ∂g(x j , θ τ ) ∂θ τ , where Θ * τ : R F × R F → R. The bracket ⟨•, •⟩ denotes the inner product and θ τ ∈ R P is a concatenated vector of all trainable parameters at the training time τ . An asterisk " * " indicates that the model is arbitrary. The model function g : R F × R P → R used in Equation 4is not limited to neural networks, and expected to be a variety of models. If we use soft trees introduced in Section 2.1 as weak learners, the NTK is formulated as M m=1 N n=1 ∂f (xi,w,π) ∂wm,n , ∂f (xj ,w,π) ∂wm,n + M m=1 L ℓ=1 ∂f (xi,w,π) ∂π m,ℓ , ∂f (xj ,w,π) ∂π m,ℓ . If the NTK does not change from its initial value during training, one could describe the behavior of functional gradient descent with an infinitesimal step size under the squared loss using kernel ridgeless regression with the NTK (Jacot et al., 2018; Lee et al., 2019) , which leads to the theoretical understanding of the training behavior. Such a property gives us a data-dependent generalization bound (Bartlett & Mendelson, 2003) , which is important in the context of over-parameterization. The kernel does not change from its initial value during the gradient descent with an infinitesimal step size when considering an infinite width neural network (Jacot et al., 2018) or an infinite number of soft perfect binary trees (Kanoh & Sugiyama, 2022) under the NTK initialization. Models with the same limiting NTK, which is the NTK induced by a model with infinite width or infinitely many weak learners, have exactly equivalent training behavior in function space. The NTK induced by a soft tree ensemble with infinitely many perfect binary trees, that is, the NTK when M → ∞, is known to be obtained in closed-form at initialization: Theorem 1 (Kanoh & Sugiyama ( 2022)). Let u ∈ R F be any column vector sampled from zeromean i.i.d. Gaussians with unit variance. The NTK for an ensemble of soft perfect binary trees with tree depth D converges in probability to the following deterministic kernel as M → ∞, Θ (D,PB) (x i , x j ) := lim M →∞ Θ (D,PB) 0 (x i , x j ) = 2 D D Σ(x i , x j )(T (x i , x j )) D-1 Ṫ (x i , x j ) contribution from internal nodes + (2T (x i , x j )) D contribution from leaves , where Σ(x i , x j ) := x ⊤ i x j , T (x i , x j ) := E[σ(u ⊤ x i )σ(u ⊤ x j )], and Ṫ (x i , x j ) := E[ σ(u ⊤ x i ) σ(u ⊤ x j )]. Moreover, when the decision function is the scaled error function, T (x i , x j ) and Ṫ (x i , x j ) are analytically obtained in the closed-form as T (x i , x j ) = 1 2π arcsin α 2 Σ(x i , x j ) (α 2 Σ(x i , x i ) + 0.5)(α 2 Σ(x j , x j ) + 0.5) + 1 4 , Ṫ (x i , x j ) = α 2 π 1 (1 + 2α 2 Σ(x i , x i )) (1 + 2α 2 Σ(x j , x j ))-4α 4 Σ(x i , x j ) 2 . ( ) Here, "PB" stands for a "P"erfect "B"inary tree. The dot used in σ(u ⊤ x i ) means the first derivative, and E[•] means the expectation. The scalar π in Equation 6 and Equation 7 is the circular constant, and u corresponds to w m,n at any internal nodes. We can derive the formula of the limiting kernel by treating the number of trees in a tree ensemble like the width of the neural network, although the neural network and the soft tree ensemble appear to be different models.

3. THEORETICAL RESULTS

We first consider rule set ensembles shown in Figure 1 (d) and provide its NTK in Section 3.1. This becomes the key component to introduce the NTKs for trees with arbitrary architectures in Section 3.2. Due to space limitations, detailed proofs are given in the Appendix. 

3.1. NTK FOR RULE SETS

We prove that the NTK induced by a rule set ensemble is obtained in the closed-form as M → ∞ at initialization: Theorem 2. The NTK for an ensemble of M soft rule sets with the depth D converges in probability to the following deterministic kernel as M → ∞, Θ (D,Rule) (x i , x j ) := lim M →∞ Θ (D,Rule) 0 (x i , x j ) = D Σ(x i , x j )(T (x i , x j )) D-1 Ṫ (x i , x j ) contribution from internal nodes + (T (x i , x j )) D contribution from leaves . ( ) We can see that the limiting NTK induced by an infinite ensemble of 2 D rules coincides with the limiting NTK of the perfect binary tree in Theorem 1: 2 D Θ (D,Rule) (x i , x j ) = Θ (D,PB) (x i , x j ). Here, 2 D corresponds to the number of leaves in a perfect binary tree. Figure 2 gives us an intuition: by duplicating internal nodes, we can always construct rule sets that correspond to a given tree by decomposing paths from the root to leaves, where the number of rules in the rule set corresponds to the number of leaves in the tree.

3.2. NTK FOR TREES WITH ARBITRARY ARCHITECTURES

Using our interpretation that a tree is a combination of multiple rule sets, we generalize Theorem 1 to include arbitrary architectures such as an asymmetric tree shown in the right panel of Figure 2 . Theorem 3. Let Q : N → N ∪ {0} be a function that receives any depth and returns the number of leaves connected to internal nodes at the input depth. For any tree architecture, the NTK for an ensemble of soft trees converges in probability to the following deterministic kernel as M → ∞, Θ (ArbitraryTree) (x i , x j ) := lim M →∞ Θ (ArbitraryTree) 0 (x i , x j ) = D d=1 Q(d) Θ (d,Rule) (x i , x j ). We can see that this formula covers the limiting NTK for perfect binary trees 2 D Θ (D,Rule) (x i , x j ), as a special case by letting Q(D) = 2 D and 0 otherwise. Kanoh & Sugiyama (2022) used mathematical induction to prove Theorem 1. However, this technique is limited to perfect binary trees. Consequently, we have now invented an alternative way of deriving the limiting NTK: treating a tree as a combination of independent rule sets using the symmetric properties of the decision function and the statistical independence of the leaf parameters. It is also possible to show that the limiting kernel does not change during training: Theorem 4. Let λ min and λ max be the minimum and maximum eigenvalues of the limiting NTK. Assume ∥x i ∥ 2 = 1 for all i ∈ [N ] and x i ̸ = x j (i ̸ = j). For ensembles of arbitrary soft trees with the NTK initialization trained under gradient flow with a learning rate η < 2/(λ min + λ max ) and a positive finite scaling factor α, we have, with high probability, Each rule set corresponds to a path to a leaf in a tree, as shown in Figure 2 . Therefore, the depth of a rule set corresponds to the depth at which a leaf is present. Since Theorem 3 tells us that the limiting NTK depends on only the number of leaves at each depth with respect to tree architecture, the following holds: Corollary 1. The same limiting NTK can be induced from trees that are not isomorphic. sup Θ (ArbitraryTree) τ (x i , x j ) -Θ (ArbitraryTree) 0 (x i , x j ) = O 1 √ M . ( For example, for two trees illustrated in Figure 3, Q (1) = 0, Q(2) = 2, and Q(3) = 4. Therefore, the limiting NTKs are identical for ensembles of these trees and become 2Θ (2,Rule) (x i , x j ) + 4Θ (3,Rule) (x i , x j ). Since they have the same limiting NTKs, their training behaviors in function space and generalization performances are exactly equivalent when we consider infinite ensembles, although they are not isomorphic and were expected to have different properties. To see this phenomenon empirically, we trained two types of ensembles; one is composed of soft trees in the left architecture in Figure 3 and the other is in the right-hand-side architecture in Figure 3 . We tried two settings, M = 16 and M = 4096, to see the effect of the number of trees (weak learners). The decision function is a scaled error function with α = 2.0. Figure 4 shows trajectories during full-batch gradient descent with a learning rate of 0.1. Outputs at initialization are shifted to zero (Chizat et al., 2019) . There are 10 randomly generated training points and 10 randomly generated test data points with dimension F = 5. Each line corresponds to each data point, and solid and dotted lines denote ensembles of left and right architecture, respectively. This result shows that two trajectories (solid and dotted lines for each color) become similar if M is large, meaning that the property shown in Corollary 1 is empirically effective. When we compare a rule set and a tree under the same number of leaves as shown in Figure 1 (a) and (d), it is clear that the rule set has a larger representation power as it has more internal nodes and no decision boundaries are shared. However, when the collection of paths from the root to leaves in a tree is the same as the corresponding rule set as shown in Figure 2 , their limiting NTKs are equivalent. Therefore, the following corollary holds: Corollary 2. Sharing of decision boundaries through parameter sharing does not affect the limiting NTKs. This result generalizes the result in (Kanoh & Sugiyama, 2022) , which shows that the kernel induced by an oblivious tree, as shown in Figure 1 (b), converges to the same kernel induced by a nonoblivious one, as shown in Figure 1 (a), in the limit of infinite trees.

4. CASE STUDY: DECISION LIST

As a typical example of asymmetric trees, we consider a tree that grows in only one direction, as shown in Figure 5 , often called a decision list (Rivest, 1987) and commonly used in practical applications (Letham et al., 2015) . In this architecture, one leaf exists at each depth, except for leaves at the final depth, where there are two leaves. (5,PB) 0 (x i , x j ) and Θ (5,DL) 0 (x i , x j ) to the fixed limit Θ (5,PB) (x i , x j ) and Θ (5,DL) (x i , x j ) as M increases. The kernel induced by finite trees is numerically calculated and plotted 10 times with parameter re-initialization.

4.1. NTK FOR DECISION LISTS

We show that the NTK induced by decision lists is formulated in closed-form as M → ∞ at initialization: Proposition 1. The NTK for an ensemble of soft decision lists with the depth D converges in probability to the following deterministic kernel as M → ∞, Θ (D,DL) (x i , x j ) := lim M →∞ Θ (D,DL) 0 (x i , x j ) = Θ (1,Rule) (x i , x j ) + Θ (2,Rule) (x i , x j ) + • • • + 2Θ (D,Rule) (x i , x j ) = Σ (x i , x j ) Ṫ (x i , x j ) D d=1 d (T (x i , x j )) d-1 + D (T (x i , x j )) D-1 contribution from internal nodes + D d=1 (T (x i , x j )) d + (T (x i , x j )) D contribution from leaves . ( ) In Proposition 1, "DL" stands for a "D"ecision "L"ist. The first equation comes from Theorem 3. We numerically demonstrate the convergence of the kernels for perfect binary trees and decision lists in Figure 6 when the number M of trees gets larger. We use two simple inputs: x i = {1, 0} and x j = {cos(β), sin(β)} with β = [0, π]. The scaled error function is used as a decision function. The kernel induced by finite trees is numerically calculated 10 times with parameter re-initialization for each of M = 16, 64, 256, 1024, and 4096. We empirically observe that the kernels induced by sufficiently many soft trees converge to the limiting kernel given in Equation 5and Equation 11shown by the dotted lines in Figure 6 . The kernel values induced by a finite ensemble are already close to the limiting NTK if the number of trees is larger than several hundred, which is a typical order of the number of trees in practical applications (Popov et al., 2020) . This indicates that our NTK analysis is also effective in practical applications with finite ensembles.

4.2. DEGENERACY

Next, we analyze the effect of the tree depth to the kernel values. It is known that overly deep soft perfect binary trees induce the degeneracy phenomenon (Kanoh & Sugiyama, 2022) , and we analyzed whether or not this phenomenon also occurs in asymmetric trees like decision lists. Since 0 < T (x i , x j ) < 0.5, replacing the summation in Equation 11with an infinite series, we can obtain the closed-form formula when the depth D → ∞ in the case of decision lists: Proposition 2. The NTK for an ensemble of soft decision lists with an infinite depth converges in probability to the following deterministic kernel as M → ∞, lim D→∞ Θ (D,DL) (x i , x j ) = Σ (x i , x j ) Ṫ (x i , x j ) (1 -T (x i , x j )) 2 contribution from internal nodes + T (x i , x j ) 1 -T (x i , x j ) contribution from leaves . Thus the limiting NTK Θ (D,DL) of decision lists neither degenerates nor diverges as D → ∞. Figure 7 shows how the kernel changes as depth changes. In the case of the perfect binary tree, the kernel value sticks to zero as the inner product of the input gets farther from 1.0 (Kanoh & Sugiyama, 2022) , whereas in the decision list case, the kernel value does not stay at zero. In other words, deep perfect binary trees cannot distinguish between vectors with a 90-degree difference in angle and vectors with a 180-degree difference in angle. Meanwhile, even if the decision list becomes infinitely deep, the kernel does not degenerate as shown by the dotted line in the right panel of Figure 7 . This implies that a deterioration in generalization performance is not likely to occur even if the model gets infinitely deep. We can understand such behavior intuitively from the following reasoning. When the depth of the perfect binary tree is infinite, all splitting regions become infinitely small, meaning that every data point falls into a unique leaf. In contrast, when a decision list is used, large splitting regions remain, so not all data are separated. This can avoid the phenomenon of separating data being equally distant.

4.3. NUMERICAL EXPERIMENTS

We experimentally examined the effects of the degeneracy phenomenon discussed in Section 4.2. Setup. We used 90 classification tasks in the UCI database (Dua & Graff, 2017) , each of which has fewer than 5000 data points as in (Arora et al., 2020) . We performed kernel regression using the limiting NTK defined in Equation 5and Equation 11, equivalent to the infinite ensemble of the perfect binary trees and decision lists. We used D in {2, 4, 8, 16, 32, 64, 128} and α in {1.0, 2.0, 4.0, 8.0, 16.0, 32.0}. The scaled error function is used as a decision function. To consider the ridge-less situation, regularization strength is fixed to 1.0 × 10 -8 . We report four-fold crossvalidation performance with random data splitting as in Arora et al. (2020) and Fernández-Delgado et al. (2014) . Other details are provided in the Appendix. Performance. Figure 8 shows the averaged performance in classification accuracy on 90 datasets. The generalization performance decreases as the tree depth increases when perfect binary trees are used as weak learners. However, no significant deterioration occurs when decision lists are used as weak learners. This result is consistent with the degeneracy properties as discussed in Section 4.2. The performance of decision lists already becomes almost consistent with their infinite depth limit when the depth reaches around 10. This suggests that we will no longer see significant changes in output for deeper decision lists. For small α, asymmetric trees often perform better than symmetric trees, but the characteristics reverse for large α. Computational complexity of the kernel. Let U = D d=1 1 Q(d)>0 , the number of depths connected to leaves. In general, the complexity for computing each kernel value for a pair of samples is O(U ). However, there are cases in which we can reduce the complexity to O(1), such as in the case of an infinitely deep decision list as shown in Proposition 2, although U = ∞.

5. DISCUSSIONS

Application to Neural Architecture Search (NAS). Arora et al. (2019) proposed using the NTK for Neural Architecture Search (NAS) (Elsken et al., 2019) for performance estimation. Such studies have been active in recent years (Chen et al., 2021; Xu et al., 2021; Mok et al., 2022) . Our findings allow us to reduce the number of tree architecture candidates significantly. Theorem 3 tells us the existence of redundant architectures that do not need to be explored in NAS. The numerical experiments shown in Figure 8 suggest that we do not need to explore extremely deep tree structures even with asymmetric tree architecture. Analogy between decision lists and residual networks. Huang et al. (2020) showed that although the multi-layer perceptron without skip-connection (He et al., 2016) exhibits the degeneracy phenomenon, the multi-layer perceptron with skip-connection does not exhibit it. This is common to our situation, where skip-connection for the multi-layer perceptron corresponds to asymmetric structure for soft trees like decision lists. Moreover, Veit et al. (2016) proposed an interpretation of residual networks showing that they can be seen as a collection of many paths of differing lengths. This is similar to our case, because decision lists can be viewed as a collection of paths, i.e., rule sets, with different lengths. Therefore, our findings in this paper suggest that there may be a common reason why performance does not deteriorate easily as the depth increases.

6. CONCLUSIONS

We have introduced and studied the NTK induced by arbitrary tree architectures. Our theoretical analysis via the kernel provides new insights into the behavior of the infinite ensemble of soft trees: for different soft trees, if the number of leaves per depth is equal, the training behavior of their infinite ensembles in function space matches exactly, even if the tree architectures are not isomorphic. We have also shown, theoretically and empirically, that the deepening of asymmetric trees like decision lists does not necessarily induce the degeneracy phenomenon, although it occurs in symmetric perfect binary trees. is used. Here, the subscription "→" means that the expected value of the corresponding term will be. Similarly, for leaves, f (D,Rule) (x i , w, π) ∂π m,1 = 1 π m,1 √ M f (D,Rule) m (x i , w m , π m ) , (A.5) resulting in Θ (D,Rule,leaves) (x i , x j ) = (T (x i , x j )) D . (A.6) Combining Equation A.3 and Equation A.6, we obtain Equation 8. A.2 PROOF OF THEOREM 3 Theorem 3. Let Q : N → N ∪ {0} be a function that receives any depth and returns the number of leaves connected to internal nodes at the input depth. For any tree architecture, the NTK for an ensemble of soft trees converges in probability to the following deterministic kernel as M → ∞, Θ (ArbitraryTree) (x i , x j ) := lim M →∞ Θ (ArbitraryTree) 0 (x i , x j ) = D d=1 Q(d) Θ (d,Rule) (x i , x j ). Proof. We separate leaf and inner node contributions. Contribution from Internal Nodes. For a soft Boolean operation, the following equations hold: E m (1 -σ(w ⊤ m,n x i ))(1 -σ(w ⊤ m,n x j )) = E m 1 -σ(w ⊤ m,n x i ) →0.5 -σ(w ⊤ m,n x j ) →0.5 + σ(w ⊤ m,n x i )σ(w ⊤ m,n x j ) = E m [σ(w ⊤ m,n x i )σ(w ⊤ m,n x j )], (A.7) E m ∂(1 -σ(w ⊤ m,n x i )) ∂w m,n ∂(1 -σ(w ⊤ m,n x j )) ∂w m,n = E m x ⊤ i x j σ(w ⊤ m,n x i ) σ(w ⊤ m,n x j ) = E m ∂σ(w ⊤ m,n x i ) ∂w m,n ∂σ(w ⊤ m,n x j ) ∂w m,n . (A.8) Since each σ(w ⊤ m,n x i ) becomes 0.5, although the term 1-σ(w ⊤ m,n x i ) is used instead of σ(w ⊤ m,n x i ) for the rightward flow in the tree, exactly the same limiting NTK can be obtained by treating 1σ(w ⊤ m,n x i ) as σ(w ⊤ m,n x i ). As for an inner node contribution, the derivative is obtained as ∂f (ArbitraryTree) (x i , w, π) ∂w m,n = 1 √ M L ℓ=1 π m,ℓ ∂µ m,ℓ (x i , w m ) ∂w m,n = 1 √ M L ℓ=1 π m,ℓ S n,ℓ (x i , w m )x i σ w ⊤ m,n x i , (A.9) where  S n,ℓ (x, w m ) := N n ′ =1 σ w ⊤ m,n ′ x i 1 (ℓ↙n ′ )&(n̸ =n ′ ) 1 -σ w ⊤ m,n ′ x i 1 (n ′ ↘ℓ)&(n̸ =n ′ ) (-1) 1 n↘ℓ ,

C.2 COMPARISON TO THE GRADIENT BOOSTING DECISION TREE

For reference information, we show experimental results for the gradient boosting decision tree. The experimental procedure is the same as that in Section 4.3. We used scikit-learnfoot_2 for the implementation. As for hyperparameters, we used max_depth in {2, 4, 6}, subsample in {0.6, 0.8, 1.0}, learning_rate in {0.1, 0.01, 0.001}, and n_estimators (the number of trees) in {100, 300, 500}. Other parameters were set to be default values of the library. Figure A.5 shows the averaged accuracy over 90 datasets. We used five random seeds {0, 1, 2, 3, 4} and their mean, minimum, and maximum performances are reported. When we use the best parameter, its averaged accuracy is 0.8010, which is slightly better than the performance of infinitely deep decision list: 0.7889 with α = 4.0 as shown in the dotted line in Figure A.5. When we look at each dataset, however, the infinitely deep decision list is superior to the gradient boosting decision tree for 35 out of 90 datasets. One is not necessarily better than the other, and their inductive biases may or may not be appropriate for each dataset.



http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/ data.tar.gz https://scikit-learn.org/stable/modules/generated/sklearn.kernel_ ridge.KernelRidge.html https://scikit-learn.org/stable/modules/generated/sklearn.ensemble. GradientBoostingRegressor.html



Figure 1: Schematic image of decision boundaries in an input space split by the (a) perfect binary tree, (b) oblivious tree, (c) decision list, and (d) rule set.

Figure 2: Correspondence between rule sets and binary trees. The top shows the corresponding rule sets for the bottom tree architectures.

) Therefore, we can analyze the training behavior based on kernel regression.

Figure 3: Non-isomorphic tree architectures used in ensembles that induce the same limiting NTK.

Figure 5: Decision list: a binary tree that grows in only one direction.

Figure7: Depth dependency of (Left) Θ (D,PB) (x i , x j ) and (Right) Θ (D,DL) (x i , x j ). For decision lists, the limit of infinite depth is indicated by the dotted line.

Figure 8: Averaged accuracy over 90 datasets. Horizontal dotted lines show the accuracy of decision lists with the infinite depth. The statistical significance is assessed in the Appendix.

Figure A.1: P-values of the Wilcoxon signed rank test for results on perfect binary trees and decision lists with different parameters.

ACKNOWLEDGEMENT

This work was supported by JSPS, KAKENHI Grant Number JP21H03503, Japan and JST, CREST Grant Number JPMJCR22D3, Japan.

ETHICS STATEMENT

We believe that theoretical analysis of the NTK does not lead to harmful applications.

REPRODUCIBILITY STATEMENT

Proofs are provided in the Appendix. For numerical experiments and figures, reproducible source codes are shared in the supplementary material.

A PROOFS

A.1 PROOF OF THEOREM 2 Theorem 2. The NTK for an ensemble of M soft rule sets with the depth D converges in probability to the following deterministic kernel as M → ∞, Θ (D,Rule) (x i , x j ) := lim M →∞ Θ (D,Rule) 0 (x i , x j ) = D Σ(x i , x j )(T (x i , x j )) D-1 Ṫ (x i , x j ) contribution from internal nodes + (T (x i , x j )) D contribution from leaves .Proof. We consider the contribution from internal nodes Θ (D,Rule,nodes) and the contribution from leaves Θ (D,Rule,leaves) separately, such that Θ (D,Rule) (x i , x j ) = Θ (D,Rule,nodes) (x i , x j ) + Θ (D,Rule,leaves) (x i , x j ) .(A.1) As for internal nodes, we havewhere we consider the derivative with respect to a node t, and w m,-t denotes the internal node parameter matrix except for the parameters of the node t. Since there are D possible locations for t, we obtainand & is a logical conjunction. Since π m,ℓ is initialized as zero-mean i.i.d. Gaussians with unit variances,Therefore, the inner node contribution for the limiting NTK isSuppose leaf ℓ is connected to an internal node of depth d. With Equation A.7 and Equation A.8, we obtainTherefore, considering all leaves,where Θ (d,Rule,nodes) is introduced in Equation A.3Contribution from Leaves. As for the contribution from leaves, the derivative is obtained asSince w m,n used in µ m,ℓ (x i , w m ) is initialized as zero-mean i.i.d. Gaussians, contribution from leaves on the limiting NTK induced by arbitrary tree architecture is:where Θ (d,Rule,leaves) is introduced in Equation A.6A.3 PROOF OF THEOREM 4Theorem 4. Let λ min and λ max be the minimum and maximum eigenvalues of the limiting NTK.AssumeFor ensembles of arbitrary soft trees with the NTK initialization trained under gradient flow with a learning rate η < 2/(λ min + λ max ) and a positive finite scaling factor α, we have, with high probability,Proof. To prove that the kernel does not move during training, we need to show the positive definiteness of the kernel and the local Lipschitzness of the model Jacobian at initialization J (x, θ), whose (i, j) entry is ∂f (xi,θ) ∂θj where θ j is a j-th component of θ:Lemma 1 (Lee et al. ( 2019)). Assume that the limiting NTK induced by any model architecture is positive definite for input sets x, such that minimum eigenvalue of the NTK λ min > 0. For models with local Lipschitz Jacobian trained under gradient flow with a learning rate η < 2(λ min + λ max ), we have with high probability:The local Lipschitzness of the soft tree ensemble's Jacobian at initialization is already proven, even with arbitrary tree architectures:Lemma 2 (Kanoh & Sugiyama (2022) ). For soft tree ensemble models with the NTK initialization and a positive finite scaling factor α, there is K > 0 such that for every C > 0, with high probability, the following holds:whereAs for the positive definiteness, Θ (D,PB) (x i , x j ) is known to be positive definite.Lemma 3 (Kanoh & Sugiyama (2022) ). For infinitely many perfect binary soft trees with any depth and the NTK initialization, the limiting NTK is positive definite if ∥x i ∥ 2 = 1 for all i ∈ [N ] andSince Θ (D,PB) (x i , x j ) is equivalent to Θ (D,Rule) (x i , x j ) up to constant multiple, Θ (D,Rule) (x i , x j ) is positive definite under the same assumption. Besides, since Θ (ArbitraryTree) (x i , x j ) is represented by the summation of Θ (D,Rule) (x i , x j ) as in Theorem 3, Θ (ArbitraryTree) (x i , x j ) is also positive definite under the same assumption.These results show that the limiting NTK induced by arbitrary tree architecture also does not change during training.

B DETAILS OF NUMERICAL EXPERIMENTS B.1 DATASET ACQUISITION

We used the UCI datasets (Dua & Graff, 2017) preprocessed by Fernández-Delgado et al. ( 2014) 1 . We selected 90 datasets with less than 5000 data points as in Arora et al. (2020) and Kanoh & Sugiyama (2022) .

B.2 MODEL SPECIFICATIONS

We used scikit-learn 2 to perform kernel regression. The regularization strength is set to be a tiny value (1.0 × 10 -8 ) so that it becomes almost ridge-less regression.

B.3 COMPUTATIONAL RESOURCE

We ran all experiments on 2.20 GHz Intel Xeon E5-2698 CPU and 252 GB of memory with Ubuntu Linux (version: 4.15.0-117-generic).

B.4 STATISTICAL SIGNIFICANCE

We conducted a Wilcoxon signed rank test for 90 datasets to check the statistical significance of the differences between performances on a perfect binary tree and a decision list. Figure A.1 shows the p-values. Statistically significant differences can be observed for areas where the differences appear large in Figure 8 , such as when α is small and D is large. We used Bonferroni correction to account for multiple testing, and the resulting significance level of the p-value is about 0.0012 for 5 percent confidence level. An asterisk "*" is placed in after correction. For cases where the symmetry of the tree does not produce a large difference, such as at the depth of 2, the difference in performance is often not statistically significant. 

