GENERALIZING TREE MODELS FOR IMPROVING PRE-DICTION ACCURACY

Abstract

Can we generalize and improve the representation power of tree models? Tree models are often favored over deep neural networks due to their interpretable structures in problems where the interpretability is required, such as in the classification of feature-based data where each feature is meaningful. However, most tree models have low accuracies and easily overfit to training data. In this work, we propose Decision Transformer Network (DTN), our highly accurate and interpretable tree model based on our generalized framework of tree models, decision transformers. Decision transformers allow us to describe tree models in the context of deep learning. Our DTN is proposed based on improving the generalizable components of the decision transformer, which increases the representation power of tree models while preserving the inherent interpretability of the tree structure. Our extensive experiments on 121 feature-based datasets show that DTN outperforms the state-of-the-art tree models and even deep neural networks.

1. INTRODUCTION

Can we generalize and improve the representation power of tree models? The tree models learn a structure where the decision process is easy to follow (Breiman et al., 1984) . Due to this attractive feature of tree models, various efforts are being put to improve their performance or to utilize them as subcomponents of a deep learning model (Kontschieder et al., 2015; Shen et al., 2018) . Compared to typical deep neural networks, the main characteristic of tree models is that input data are propagated through layers without a change in their representations; the internal nodes calculate the probability of an input x arriving at the leaf nodes. The decision process that determines the membership of a training example, i.e., the process of constructing the subsets of training data at each leaf, is the core operation in tree models, which we generalize and improve in this work. Previous works on tree models consider the decisions at different nodes as separate operations, i.e., each node performs its own decision independently from the other nodes in the same layer, based on the arrival probability of an input x to the node. The independence of decisions allows learning to be simplified. However, in order to obtain a comprehensive view, the decisions of multiple nodes must be considered simultaneously. Furthermore, a typical tree model with a depth of L requires b L -1 decision functions, where b is a branching factor, which makes it intractable to construct a deep tree model. In this work, we suggest that many tree models can be generalized to what we term as the decision transformer as Proposition 1. A decision transformer generalizes existing tree models by treating the decisions of each layer as a single operation involving all nodes in that layer. More specifically, the decision transformer views each layer as a stochastic decision (Def. 2) which linearly transforms the membership of each training data by a learned stochastic matrix (Def. 1). The aggregated layerwise view allows the decision transformer to reduce the complexity of analysis to O(L), where L is the tree depth. Furthermore, formulating the tree models in a probabilistic context allows them to be understood by the deep learning framework and provides us a theoretical foundation for deeply understanding the advantages and the limitations of existing tree models in one framework. The in-depth understanding of tree models as decision transformers allows us to propose Decision Transformer Network (DTN), a novel architecture that inherits the interpretability of tree models while improving their representation power. DTN is an extension of tree models into deep networks that creates multiple paths between nodes based on stochastic decisions. We propose two versions of DTN: DTN-D with full dense connections and DTN-S with sparse connections where the edges are connected in a locality-sensitive form. Our experiments show that DTN outperforms previous tree models and deep neural networks on 121 tabular datasets. Our contributions are summarized as follows. First, we introduce a generalized framework for tree models called the decision transformer and perform theoretical analysis on its generalizability and interpretability (Section 3). Second, we propose dense and sparse versions of Decision Transformer Network (DTN), our novel decision-based model that is both interpretable and accurate (Section 4). Third, we undergo extensive experiments on 121 tabular datasets to show that DTN outperforms the state-of-the-art tree models and deep neural networks (Section 5).

2. RELATED WORKS

A tree model refers to a parameterized function that passes an example through its tree-structured layers without changing its representations. We categorize tree models based on whether it utilizes linear decisions over raw features or neural networks in combination with representation learning. Tree Models with Linear Decisions Traditional tree models perform linear decisions over a raw feature. Given an example x, decision trees (DT) (Breiman et al., 1984 ) compare a single selected element to a learned threshold at each branch to perform a decision. Soft decision trees (SDT) (Irsoy et al., 2012) generalize DTs to use all elements of x by a logistic classifier, making a probabilistic decision in each node. SDTs have been widely studied and applied to various problems due to its simplicity (Irsoy & Alpaydin, 2015; Frosst & Hinton, 2017; Linero & Yang, 2018; Yoo & Sael, 2019) . Deep neural decision trees (Yang et al., 2018) also extend DTs to a multi-branched tree by splitting each feature element into multiple bins. These models provide a direct interpretation of predictions and decision processes. There are also well-known ensembles of such linear trees, such as random forests (Breiman, 2001) and XGBoost (Chen & Guestrin, 2016). These ensemble models tboost the performance of tree models at the expense of interpretability.

Tree Models with Representation Learning

There are recent works combining tree models with deep neural networks, which add the ability of representation learning to the hierarchical decisions of tree models. A popular approach is to use abstract representations learned by convolutional neural networks (CNNs) as inputs of the decision functions of tree models (Kontschieder et al., 2015; Roy & Todorovic, 2016; Shen et al., 2017; 2018; Wan et al., 2020) , sometimes with multilayer perceptrons instead of CNNs (Bulò & Kontschieder, 2014) . Another approach is to categorize raw examples by hierarchical decisions and then apply different classifiers to the separate clusters (Murthy et al., 2016) . There are also approaches that improve the performance of deep neural networks by inserting hierarchical decisions as differentiable operations into a neural network, instead of building a complete tree model (Murdock et al., 2016; McGill & Perona, 2017; Brust & Denzler, 2019; Tanno et al., 2019) . In this work, we focus on models having complete trees in its structure.

3. GENERALIZING TREE MODELS

We first introduce a decision transformer, our unifying framework that represents a tree model as a series of linear transformations over probability vectors. We then analyze the properties of decision transformers: the generalizability to existing tree models and the interpretability.

3.1. DECISIONS TRANSFORMERS

Definition 1 (Stochastic matrix). A matrix T is (left) stochastic if T ≥ 0 and i a ij = 1 for all j. A vector p is stochastic if it satisfies the condition as a matrix of size |p| × 1. We denote the set of all stochastic vectors and matrices as P, i.e., we represent that T ∈ P or p ∈ P. Definition 2 (Stochastic decision). A function f is a stochastic decision if it linearly transforms an input vector π by a left stochastic matrix as f (π) = Tπ, where T ∈ P is independent of π. Corollary 1. Given a stochastic decision f and a vector π, f (π) ∈ P if π ∈ P.

