RRL: A SCALABLE CLASSIFIER FOR INTERPRETABLE RULE-BASED REPRESENTATION LEARNING

Abstract

Rule-based models, e.g., decision trees, are widely used in scenarios demanding high model interpretability for their transparent inner structures and good model expressivity. However, rule-based models are hard to optimize, especially on large data sets, due to their discrete parameters and structures. Ensemble methods and fuzzy/soft rules are commonly used to tackle these issues, but they sacrifice the model interpretability. In this paper, we propose a new classifier, named Rulebased Representation Learner (RRL), that automatically learns interpretable nonfuzzy rules for data representation. To train the non-differentiable RRL effectively, we project it to a continuous space and propose a novel training method, called Gradient Grafting, that can directly optimize the discrete model using gradient descent. An improved design of logical activation functions is also devised to increase the scalability of RRL and enable it to discretize the continuous features end-to-end. Exhaustive experiments on 9 small and 4 large data sets show that RRL outperforms the competitive approaches, has low complexity close to the simple decision trees, and is rational for its main technical contributions.

1. INTRODUCTION

Although Deep Neural Networks (DNNs) have achieved impressive results in various machine learning tasks (Goodfellow et al., 2016) , rule-based models, benefiting from their transparent inner structures and good model expressivity, still play an important role in domains demanding high model interpretability, such as medicine, finance, and politics (Doshi-Velez & Kim, 2017) . In practice, rule-based models can easily provide explanations for users to earn their trust and help protect their rights (Molnar, 2019; Lipton, 2016) . By analyzing the learned rules, practitioners can understand the decision mechanism of models, and use their knowledge to improve or debug the models (Chu et al., 2018) . Moreover, even if post-hoc methods can provide interpretations for DNNs, the interpretations from rule-based models are more faithful and specific (Murdoch et al., 2019) . However, conventional rule-based models are hard to optimize, especially on large data sets, due to their discrete parameters and structures, which limits their application scope. To take advantage of rule-based models in more scenarios, we urgently need to improve their scalability. Studies in recent years provide some solutions to improve conventional rule-based models in different aspects. Ensemble methods and soft/fuzzy rules are proposed to improve the performance and scalability of rule-based models but at the cost of model interpretability (Ke et al., 2017; Breiman, 2001; Irsoy et al., 2012) . Bayesian framework is also leveraged to more reasonably restrict and adjust the structures of rule-based models (Letham et al., 2015; Wang et al., 2017; Yang et al., 2017) . However, due to the non-differentiable model structure, they have to use methods like MCMC or Simulated Annealing, which could be time-consuming for large models. Another way to improve rule-based models is to let a high-performance but complex model (e.g., DNN) teach a rule-based model (Frosst & Hinton (2017); Ribeiro et al. (2016) ). However, to learn from the complex model, it requires soft rules, or the fidelity of the student model is not guaranteed. Wang et al. (2020) try to extract hierarchical rule sets from a tailored neural network. Although the extracted rules could behave differently from the neural network when the network is large, combined with binarized networks (Courbariaux et al., 2015) , it inspires us that we can search for the discrete solution of rule-based models in a continuous space and leverage optimization methods like gradient descent.

