THE POWER OF CHOICES IN DECISION TREE LEARNING

Abstract

We propose a simple and natural generalization of standard and empirically successful decision tree learning algorithms such as ID3, C4.5, and CART. These classic algorithms, which have been central to machine learning for decades, are greedy in nature: they grow a decision tree by iteratively splitting on the "best" attribute. We augment these algorithms with an additional greediness parameter k and our resulting algorithm, Top-k, considers the k best attributes as possible splits instead of just the single best attribute. We demonstrate, theoretically and empirically, the power of this simple generalization. We first prove a sharp greediness hierarchy theorem showing that for every k ∈ N, Top-(k + 1) can be much more powerful than Top-k: there are data distributions for which the former achieves accuracy 1 -ε, whereas the latter only achieves accuracy 1 2 + ε. We then show, through extensive experiments, that Top-k compares favorably with the two main approaches to decision tree learning: classic greedy algorithms and more recent "optimal decision tree" algorithms. On one hand, Top-k consistently enjoys significant accuracy gains over the greedy algorithms across a wide range of benchmarks, at the cost of only a mild training slowdown. On the other hand, Top-k is markedly more scalable than optimal decision tree algorithms, and is able to handle dataset and feature set sizes that remain beyond the reach of these algorithms. Taken together, our results highlight the potential practical impact of the power of choices in decision tree learning.

1. INTRODUCTION

Decision trees are a fundamental workhorse in machine learning. Their logical and hierarchical structure makes them easy to understand and their predictions easy to explain. Decision trees are therefore the most canonical example of an interpretable model: in his influential survey (Breiman, 2001b) , Breiman writes "On interpretability, trees rate an A+"; much more recently, the survey Rudin et al. ( 2022) lists decision tree optimization as the very first of 10 grand challenges for the field of interpretable machine learning. Decision trees are also at the heart of modern ensemble methods such as random forests (Breiman, 2001a) and XGBoost (Chen & Guestrin, 2016), which achieve state-of-the-art accuracy for a wide range of tasks. Greedy algorithms such as ID3 (Quinlan, 1986), C4.5 (Quinlan, 1993), and CART (Breiman et al., 1984) have long been the standard approach to decision tree learning. These algorithms build a decision tree from labeled data in a top-down manner, growing the tree by iteratively splitting on the "best" attribute as measured with respect to a certain potential function (e.g. information gain). Owing to their simplicity, these algorithms are highly efficient and scale gracefully to handle massive datasets and feature set sizes, and they continue to be widely employed in practice and enjoy significant empirical success. For the same reasons, these algorithms are also part of the standard curriculum in introductory machine learning and data science courses. The trees produced by these greedy algorithms are often reasonably accurate, but can nevertheless be suboptimal. There has therefore been a separate line of work, which we overview in Section 2, on algorithms that optimize for accuracy, and in fact, seek to produce optimally accurate decision trees. These algorithms employ a variety of optimization techniques (including dynamic programming, integer programming, and SAT solvers) and are completely different from the simple greedy algorithms discussed above. Since the problem of finding an optimal decision tree has long been known to be NP-hard (Hyafil & Rivest, 1976) , any algorithm must suffer from the inherent combinatorial

