SUCCINCT EXPLANATIONS WITH CASCADING DECI-SION TREES

Abstract

Classic decision tree learning is a binary classification algorithm that constructs models with first-class transparency -every classification has a directly derivable explanation. However, learning decision trees on modern datasets generates large trees, which in turn generate decision paths of excessive depth, obscuring the explanation of classifications. To improve the comprehensibility of classifications, we propose a new decision tree model that we call Cascading Decision Trees. Cascading Decision Trees shorten the size of explanations of classifications, without sacrificing model performance overall. Our key insight is to separate the notion of a decision path and an explanation path. Utilizing this insight, instead of having one monolithic decision tree, we build several smaller decision subtrees and cascade them in sequence. Our cascading decision subtrees are designed to specifically target explanations for positive classifications. This way each subtree identifies the smallest set of features that can classify as many positive samples as possible, without misclassifying any negative samples. Applying cascading decision trees to new samples results in a significantly shorter and succinct explanation, if one of the subtrees detects a positive classification. In that case, we immediately stop and report the decision path of only the current subtree to the user as an explanation for the classification. We evaluate our algorithm on standard datasets, as well as new real-world applications and find that our model shortens the explanation depth by over 40.8% for positive classifications compared to the classic decision tree model.

1. INTRODUCTION

Binary classification is the process of classifying the given input set into two classes based on some classification criteria. Binary classification is widely used in everyday life: for example, a typical application for binary classification is determining whether a patient has some disease by analyzing their comprehensive medical record. Existing work on binary classification mainly uses the accuracy of prediction as the main criterion for evaluating model performance. However, in order for a model to be useful in real-world applications, it is imperative that users are able to understand and explain the logic underlying model predictions. Model comprehensibilityfoot_0 in some real-world applications, especially in the medical and scientific domains, is of the utmost importance. In these cases, users need to understand the classification model to scientifically explain the reasons behind the classification or even rely on the model itself to discover the possible solution to the target problem. It is difficult to provide explainability without sacrificing classification accuracy using current models. "Black-box" models such as deep neural network, random forests, and ensembles of classifiers tend to have the highest accuracy in binary classification Freitas (2014); Doilovi et al. (2018) . However, their opaque structure hinders understandability, making the logic behind the predictions difficult to trace. This lack of transparency may further discourage users from using the model Augasta & Kathirvalavakumar (2012); Van Assche & Blockeel (2007) . Decision tree models, on the other hand, have transparent decision making steps. A traversal of features on the decision path from the root to the leaf node is presented to users as a rule. Therefore, compared to other models, the decision tree model has historically been characterized as having high comprehensibility Freitas (2014); Doilovi et al. (2018) . However, whether models generated by classic decision trees provide enough comprehensibility has been challenged: "decision trees [...] can grow so large that no human can understand them or verify their correctness" Caruana et al. (1999) , or they may contain subtrees with redundant attribute conditions, resulting in potential misinterpretation of the root cause for model predictions Freitas (2014). The work presented in this paper introduces an algorithm for deriving succinct explanations for positive classifications, while maintaining the overall prediction accuracy. To this end, we introduce a novel cascading decision tree model. A cascading decision tree is a sequence of several smaller decision trees (subtrees) with the predefined tree depth. After every subtree sequentially follows another subtree, mimicking in this way a cascading waterfall, thus the name. The sequence ends when the subtree does not contain any leaves describing positively classified samples. Fig. 2 depicts one such cascading decision tree. The main idea behind cascading decision trees is that, while most algorithms for constructing decision trees are greedy and they try to classify as many samples as soon as possible, such classification results in large explanation paths for the samples in the lower levels of the tree. Instead, we construct a subtree of the predefined depth. That subtree contains a short explanation for the samples it managed to classify. However, the subtree with a short depth will misclassify samples. We next repeat the process on the training set with the samples that were previously classified positively removed. This way, the samples classified as positive in the second subtree will have a much shorter explanation path than they would in the original decision tree. In the cascading decision tree model, an explanation path for positively classified sample is the path that starts in the root of the corresponding subtree. We target explanations for only positive classifications, based on real-world motivation. In the medical domain, a positive classification result indicates that a person has the disease for which the test is being done NIH (2020). The positive classification is also combined with additional testings needed for a full diagnosis CDC (2020). Note that, if a practical application arises, our cascading decision trees model could easily be changed to target the negative classifications. Reducing the size and the depth of a decision tree to improve comprehensibility has been studied, both from a theoretical and a practical perspective. However, constructing such optimally small decision trees is an NP-complete problem Hyafil & Rivest (1976) , and the main drawback of these approaches is that the model is computationally too expensive to train. Even when using the stateof-the-art libraries Aglin et al. (2020); Verwer & Zhang (2019), we observed that complexity. To illustrate this on an example, to learn a model on the Ionosphere dataset (from the UCI Machine Learning Repository), the BinOCT tool needs approximately 10 minutes, while our approach completes this task in 1.1 seconds. We demonstrate the applicability of the cascading decision tree model in two ways. First, we use our model to perform standard binary classification on three numerical datasets from the UCI Machine Learning Repository. Second, we apply our model to a new application of binary classification, namely continuous integration (CI) build status prediction Santolucito et al. (2018) . Overall, we report that compared to the classical decision tree algorithm, our approach shortens the explanation depth for positive classifications by more than 40.8% while maintaining the prediction accuracy.

2. MOTIVATING EXAMPLES

In this section we demonstrate how cascading decision trees can generate a shorter and more succinct explanation. The following simple synthetic example is contrived to illustrate our tool's basic functionality. Given the dataset in Table1, a classical decision tree will construct a model shown in Fig. 1 . Using the same dataset, our cascading decision trees algorithm generates a model with three subtrees in shown in Fig. 2 . Let's assume that there is a new sample ,"Sample11", with the feature vector (F, F, F, T ). Both models classify "Sample11" with the same prediction result, "Positive". However, the explanations extracted from these two models are different. In the classic decision tree model (Fig. 1 ), "Sample11" falls into node (9). Thus, the explanation path here is "Feature1 = F", "Feature2 = F" and "Feature3 = F", with the explanation depth of three.



In this paper, comprehensibility and interpretability are used interchangeably.

