THE POWER OF CHOICES IN DECISION TREE LEARNING

Abstract

We propose a simple and natural generalization of standard and empirically successful decision tree learning algorithms such as ID3, C4.5, and CART. These classic algorithms, which have been central to machine learning for decades, are greedy in nature: they grow a decision tree by iteratively splitting on the "best" attribute. We augment these algorithms with an additional greediness parameter k and our resulting algorithm, Top-k, considers the k best attributes as possible splits instead of just the single best attribute. We demonstrate, theoretically and empirically, the power of this simple generalization. We first prove a sharp greediness hierarchy theorem showing that for every k ∈ N, Top-(k + 1) can be much more powerful than Top-k: there are data distributions for which the former achieves accuracy 1 -ε, whereas the latter only achieves accuracy 1 2 + ε. We then show, through extensive experiments, that Top-k compares favorably with the two main approaches to decision tree learning: classic greedy algorithms and more recent "optimal decision tree" algorithms. On one hand, Top-k consistently enjoys significant accuracy gains over the greedy algorithms across a wide range of benchmarks, at the cost of only a mild training slowdown. On the other hand, Top-k is markedly more scalable than optimal decision tree algorithms, and is able to handle dataset and feature set sizes that remain beyond the reach of these algorithms. Taken together, our results highlight the potential practical impact of the power of choices in decision tree learning.

1. INTRODUCTION

Decision trees are a fundamental workhorse in machine learning. Their logical and hierarchical structure makes them easy to understand and their predictions easy to explain. Decision trees are therefore the most canonical example of an interpretable model: in his influential survey (Breiman, 2001b) , Breiman writes "On interpretability, trees rate an A+"; much more recently, the survey Rudin et al. (2022) lists decision tree optimization as the very first of 10 grand challenges for the field of interpretable machine learning. Decision trees are also at the heart of modern ensemble methods such as random forests (Breiman, 2001a) and XGBoost (Chen & Guestrin, 2016), which achieve state-of-the-art accuracy for a wide range of tasks. Greedy algorithms such as ID3 (Quinlan, 1986 ), C4.5 (Quinlan, 1993 ), and CART (Breiman et al., 1984) have long been the standard approach to decision tree learning. These algorithms build a decision tree from labeled data in a top-down manner, growing the tree by iteratively splitting on the "best" attribute as measured with respect to a certain potential function (e.g. information gain). Owing to their simplicity, these algorithms are highly efficient and scale gracefully to handle massive datasets and feature set sizes, and they continue to be widely employed in practice and enjoy significant empirical success. For the same reasons, these algorithms are also part of the standard curriculum in introductory machine learning and data science courses. The trees produced by these greedy algorithms are often reasonably accurate, but can nevertheless be suboptimal. There has therefore been a separate line of work, which we overview in Section 2, on algorithms that optimize for accuracy, and in fact, seek to produce optimally accurate decision trees. These algorithms employ a variety of optimization techniques (including dynamic programming, integer programming, and SAT solvers) and are completely different from the simple greedy algorithms discussed above. Since the problem of finding an optimal decision tree has long been known to be NP-hard (Hyafil & Rivest, 1976) , any algorithm must suffer from the inherent combinatorial explosion when the instance size becomes sufficiently large (unless P=NP). Therefore, while this line of work has made great strides in improving the scalability of algorithms for optimal decision trees, dataset and feature set sizes in the high hundreds understandably remain out of reach.

This state of affairs raises a natural question:

Can we design decision tree learning algorithms that improve significantly on the accuracy of classic greedy algorithms and yet inherit their simplicity and scalability? In this work, we propose a new approach and make the case that it provides a strong affirmative answer to the question above. We further show that it opens up several new avenues for exploration in both the theory and practice of decision tree learning. 1.1 OUR CONTRIBUTIONS

1.1.1. TOP-k: A SIMPLE AND EFFECTIVE GENERALIZATION OF CLASSIC GREEDY DECISION TREE ALGORITHMS

We introduce an easily interpretable greediness parameter to the class of all greedy decision tree algorithms, a broad class that encompasses ID3, C4.5, and CART. This parameter, k, represents the number of features that the algorithm considers as candidate splits at each step. Setting k = 1 recovers the fully greedy classical approaches, and increasing k allows the practitioner to produce more accurate trees at the cost of only a mild training slowdown. The focus of our work is on the regime where k is a small constant-preserving the efficiency and scalability of greedy algorithms is a primary objective of our work-although we mention here that by setting k to be the dimension d, our algorithm produces an optimal tree. Our overall framework can thus be viewed as interpolating between greedy algorithms at one extreme and "optimal decision tree" algorithms at the other, precisely the two main and previously disparate approaches to decision tree learning discussed above. We will now describe our framework. A feature scoring function H takes as input a dataset over d binary features and a specific feature i ∈ [d], and returns a number quantifying the "desirability" of this feature as the root of the tree. The greedy algorithm corresponding to H selects as the root of the tree it builds the feature that has the largest score under H; our generalization will instead consider the k features with the k highest scores. Definition 1 (Feature scoring function). A feature scoring function H takes as input a labeled dataset S over a d-dimensional feature space, a feature i ∈ [d], and returns a score ν i ∈ [0, 1]. See Section 3.1 for a discussion of the feature scoring functions that correspond to standard greedy algorithms ID3, C4.5, and CART. Pseudocode for Top-k is provided in Figure 1 . We note that from the perspective of interpretability, the trained model looks the same regardless of what k is. During training, the algorithm considers more splits, but only one split is eventually used at each node.

1.1.2. THEORETICAL RESULTS ON THE POWER OF TOP-k

The search space of Top-(k + 1) is larger than that of Top-k, and therefore its training accuracy is certainly at least as high. The first question we consider is: Is the test accuracy of Top-(k + 1) only marginally better than that of Top-k, or are there examples of data distributions for which even a single additional choice provably leads to huge gains in test accuracy? Our first main theoretical result is a sharp greediness hierarchy theorem, showing that this parameter can have dramatic impacts on accuracy, thereby illustrating its power: Theorem 1 (Greediness hierarchy theorem). For every ε > 0, k, h ∈ N, there is a data distribution on which Top-(k + 1) achieves at least 1 -ε accuracy with a depth budget of h, but Top-k achieves at most 1 2 + ε accuracy with a depth budget of h. Theorem 1 is a special case of a more general result that we show: for all k < K, there are data distributions on which Top-K achieves maximal accuracy gains over Top-k, even if Top-k is allowed a larger depth budget: Theorem 2 (Generalization of Theorem 1). For every ε > 0, k, K, h ∈ N where k < K, there is a data distribution on which Top-K achieves at least 1 -ε accuracy with a depth budget of h, but Top-k achieves at most 1 2 + ε accuracy even with a depth budget of h + (K -k -1).

