RRL: A SCALABLE CLASSIFIER FOR INTERPRETABLE RULE-BASED REPRESENTATION LEARNING

Abstract

Rule-based models, e.g., decision trees, are widely used in scenarios demanding high model interpretability for their transparent inner structures and good model expressivity. However, rule-based models are hard to optimize, especially on large data sets, due to their discrete parameters and structures. Ensemble methods and fuzzy/soft rules are commonly used to tackle these issues, but they sacrifice the model interpretability. In this paper, we propose a new classifier, named Rulebased Representation Learner (RRL), that automatically learns interpretable nonfuzzy rules for data representation. To train the non-differentiable RRL effectively, we project it to a continuous space and propose a novel training method, called Gradient Grafting, that can directly optimize the discrete model using gradient descent. An improved design of logical activation functions is also devised to increase the scalability of RRL and enable it to discretize the continuous features end-to-end. Exhaustive experiments on 9 small and 4 large data sets show that RRL outperforms the competitive approaches, has low complexity close to the simple decision trees, and is rational for its main technical contributions.

1. INTRODUCTION

Although Deep Neural Networks (DNNs) have achieved impressive results in various machine learning tasks (Goodfellow et al., 2016) , rule-based models, benefiting from their transparent inner structures and good model expressivity, still play an important role in domains demanding high model interpretability, such as medicine, finance, and politics (Doshi-Velez & Kim, 2017) . In practice, rule-based models can easily provide explanations for users to earn their trust and help protect their rights (Molnar, 2019; Lipton, 2016) . By analyzing the learned rules, practitioners can understand the decision mechanism of models, and use their knowledge to improve or debug the models (Chu et al., 2018) . Moreover, even if post-hoc methods can provide interpretations for DNNs, the interpretations from rule-based models are more faithful and specific (Murdoch et al., 2019) . However, conventional rule-based models are hard to optimize, especially on large data sets, due to their discrete parameters and structures, which limits their application scope. To take advantage of rule-based models in more scenarios, we urgently need to improve their scalability. Studies in recent years provide some solutions to improve conventional rule-based models in different aspects. Ensemble methods and soft/fuzzy rules are proposed to improve the performance and scalability of rule-based models but at the cost of model interpretability (Ke et al., 2017; Breiman, 2001; Irsoy et al., 2012) . Bayesian framework is also leveraged to more reasonably restrict and adjust the structures of rule-based models (Letham et al., 2015; Wang et al., 2017; Yang et al., 2017) . However, due to the non-differentiable model structure, they have to use methods like MCMC or Simulated Annealing, which could be time-consuming for large models. Another way to improve rule-based models is to let a high-performance but complex model (e.g., DNN) teach a rule-based model (Frosst & Hinton (2017) ; Ribeiro et al. ( 2016)). However, to learn from the complex model, it requires soft rules, or the fidelity of the student model is not guaranteed. Wang et al. (2020) try to extract hierarchical rule sets from a tailored neural network. Although the extracted rules could behave differently from the neural network when the network is large, combined with binarized networks (Courbariaux et al., 2015) , it inspires us that we can search for the discrete solution of rule-based models in a continuous space and leverage optimization methods like gradient descent. In this paper, we propose a novel rule-based model, named Rule-based Representation Learner (RRL) (see Figure 1a ), which owns three key technical contributions: (i) To achieve model transparency, RRL is formulated as a hierarchical model, with layers supporting both conjunction and disjunction operations. This paves the way for automatically learning interpretable non-fuzzy rules for data representation and classification. (ii) To facilitate training effectiveness, RRL exploits a novel gradient-based discrete model training method, Gradient Grafting, that directly optimizes the discrete model and uses the gradient information at both continuous and discrete points to suit more scenarios. (iii) To ensure data scalability, RRL utilizes improved logical activation functions to handle highdimensional features. By further combining the improved logical activation functions with a tailored feature binarization layer, it realizes the continuous feature discretization in an end-to-end manner. We conduct experiments on 9 small data sets and 4 large data sets to validate the advantages of our model over other representative classification models. The benefits of the model's key components are also verified by experiments.

2. RELATED WORK

Rule-based Models. Decision tree, rule list, and rule set are the widely used structures in rulebased models. For their discrete parameters and non-differentiable structures, we have to train them by employing various heuristic methods (Quinlan, 1993; Breiman, 2017; Cohen, 1995) which may not find the globally best solution or a solution with close performance. Alternatively, train them with search algorithms (Wang et al., 2017; Angelino et al., 2017) , which could take too much time on large data sets. In recent studies, Bayesian frameworks are leveraged to restrict and adjust model structure more reasonably (Letham et al., 2015; Wang et al., 2017; Yang et al., 2017) . Lakkaraju et al. ( 2016) learn independent if-then rules with smooth local search. Using algorithmic bounds and efficient data structures, Angelino et al. ( 2017) try to accelerate the learning of certifiably optimal rule lists. However, except heuristic methods, most existing rule-based models need frequent itemsets mining and/or long-time searching, which limits their applications. Moreover, it is hard for these rule-based models to get comparable performance with complex models like Random Forest. Ensemble models like Random Forest (Breiman, 2001) and Gradient Boosted Decision Trees (Chen & Guestrin, 2016; Ke et al., 2017) , have better performance than the single rule-based model. However, for the decision is made by hundreds of models, ensemble models are commonly not considered as interpretable models (Hara & Hayashi, 2016) . Soft or fuzzy rules are also used to improve model performance (Irsoy et al., 2012; Ishibuchi & Yamamoto, 2005) , but non-discrete rules are much harder to understand than discrete ones. Deep Neural Decision Tree (Yang et al., 2018 ) is a tree model realized by neural networks with the help of soft binning function and Kronecker product. However, due to the use of Kronecker product, it is not scalable with respect to the number of features. Other studies try to teach the rule-based model by a complex model, e.g., DNN, or extract rule-based models from complex models (Frosst & Hinton, 2017; Ribeiro et al., 2016; Wang et al., 2020) . However, the fidelity of the student model or extracted model is not guaranteed.

Gradient-based Discrete Model

Training. The gradient-based discrete model training methods are mainly proposed to train binary or quantized neural networks for network compression and acceleration. Courbariaux et al. (2015; 2016) propose to use Straight-Through Estimator (STE) for binary neural network training and achieve empirical success. However, STE requires gradient information at discrete points, which limits its applications. ProxQuant (Bai et al., 2018) formulates quantized network training as a regularized learning problem and optimizes it via the prox-gradient method. ProxQuant can use gradients at non-discrete points, but cannot directly optimize for the discrete model. The Random Binarization (RB) method (Wang et al., 2020) trains a neural network with random binarization for its weights to ensure the discrete and continuous model behave samely. However, when the model is large, the differences between the discrete and continuous model are inevitable. Gumbel-Softmax estimator (Jang et al., 2016) generates a categorical distribution with a differentiable sample. However, we can hardly deal with a large number of variables, e.g., the weights of binary networks, using the Gumbel-Softmax estimator for it is a biased estimator. Our method, i.e., Gradient Grafting, is different from all the aforementioned works in using gradient information of both discrete and continuous models in each backpropagation.

