ANALYZING TREE ARCHITECTURES IN ENSEMBLES VIA NEURAL TANGENT KERNEL

Abstract

A soft tree is an actively studied variant of a decision tree that updates splitting rules using the gradient method. Although soft trees can take various architectures, their impact is not theoretically well known. In this paper, we formulate and analyze the Neural Tangent Kernel (NTK) induced by soft tree ensembles for arbitrary tree architectures. This kernel leads to the remarkable finding that only the number of leaves at each depth is relevant for the tree architecture in ensemble learning with an infinite number of trees. In other words, if the number of leaves at each depth is fixed, the training behavior in function space and the generalization performance are exactly the same across different tree architectures, even if they are not isomorphic. We also show that the NTK of asymmetric trees like decision lists does not degenerate when they get infinitely deep. This is in contrast to the perfect binary trees, whose NTK is known to degenerate and leads to worse generalization performance for deeper trees.

1. INTRODUCTION

Ensemble learning is one of the most important machine learning techniques used in real world applications. By combining the outputs of multiple predictors, it is possible to obtain robust results for complex prediction problems. Decision trees are often used as weak learners in ensemble learning (Breiman, 2001; Chen & Guestrin, 2016; Ke et al., 2017) , and they can have a variety of structures such as various tree depths and whether or not the structure is symmetric. In the training process of tree ensembles, even a decision stump (Iba & Langley, 1992) , a decision tree with the depth of 1, is known to be able to achieve zero training error as the number of trees increases (Freund & Schapire, 1996) . However, generalization performance varies depending on weak learners (Liu et al., 2017) , and the theoretical properties of their impact are not well known, which results in the requirement of empirical trial-and-error adjustments of the structure of weak learners. In this paper, we focus on a soft tree (Kontschieder et al., 2015; Frosst & Hinton, 2017) as a weak learner. A soft tree is a variant of a decision tree that inherits characteristics of neural networks. Instead of using a greedy method (Quinlan, 1986; Breiman et al., 1984) to search splitting rules, soft trees make decision rules soft and simultaneously update the entire model parameters using the gradient method. Soft trees have been actively studied in recent years in terms of predictive performance (Kontschieder et al., 2015; Popov et al., 2020; Hazimeh et al., 2020 ), interpretability (Frosst & Hinton, 2017; Wan et al., 2021) , and potential techniques in real world applications like pre-training and fine-tuning (Ke et al., 2019; Arik & Pfister, 2019) . In addition, a soft tree can be interpreted as a Mixture-of-Experts (Jordan & Jacobs, 1993; Shazeer et al., 2017; Lepikhin et al., 2021) , a practical technique for balancing computational cost and prediction performance. To theoretically analyze soft tree ensembles, Kanoh & Sugiyama (2022) introduced the Neural Tangent Kernel (NTK) (Jacot et al., 2018) induced by them. The NTK framework analytically describes the behavior of ensemble learning with infinitely many soft trees, which leads to several non-trivial properties such as global convergence of training and the effect of parameter sharing in an oblivious tree (Popov et al., 2020; Prokhorenkova et al., 2018) . However, their analysis is limited to a specific type of trees, perfect binary trees, and theoretical properties of other various types of tree architectures are still unrevealed.

