ROUTE, INTERPRET, REPEAT: BLURRING THE LINE BETWEEN POST HOC EXPLAIN-ABILITY AND INTERPRETABLE MODELS

Abstract

The current approach to ML model design is either to choose a flexible Blackbox model and explain it post hoc or to start with an interpretable model. Blackbox models are flexible but difficult to explain, whereas interpretable models are designed to be explainable. However, developing interpretable models necessitates extensive ML knowledge, and the resulting models tend to be less flexible, offering potentially subpar performance compared to their Blackbox equivalents. This paper aims to blur the distinction between a post hoc explanation of a Black-Box and constructing interpretable models. We propose beginning with a flexible BlackBox model and gradually carving out a mixture of interpretable models and a residual network. Our design identifies a subset of samples and routes them through the interpretable models. The remaining samples are routed through a flexible residual network. We adopt First Order Logic (FOL) as the interpretable model's backbone, which provides basic reasoning on concepts retrieved from the BlackBox model. On the residual network, we repeat the method until the proportion of data explained by the residual network falls below a desired threshold. Our approach offers several advantages. First, the mixture of interpretable and flexible residual networks results in almost no compromise in performance. Second, the route, interpret, and repeat approach yields a highly flexible interpretable model. Our extensive experiment demonstrates the performance of the model on various datasets. We show that by editing the FOL model, we can fix the shortcut learned by the original BlackBox model. Finally, our method provides a framework for a hybrid symbolic-connectionist network that is simple to train and adaptable to many applications.

1. INTRODUCTION

Model explainability is essential in high-stakes applications of AI, such as healthcare. While BlackBox models (e.g., Deep Learning) offer flexibility and modular design, post hoc explanation is prone to confirmation bias Wan et al. (2022) , lack of fidelity to the original model Adebayo et al. (2018) , and insufficient mechanistic explanation of the decision-making process Rudin (2019). Interpretable-by-design models do not suffer from those issues but tend to be less flexible than Blackbox models and demand substantial expertise to design and fine-tune. Using post hoc explanation or adopting an interpretable model is a mutually exclusive decision to be made at the initial phase of AI model design. This paper aims to blur the line on that dichotomous model design. 2020) are examples of post hoc explainability approaches. Those methods either identify important features of input that contribute the most to the network's output Shrikumar et al. (2016) , generate perturbation to the input that flips the network's output Samek et al. (2016 ), Montavon et al. (2018) , or estimate simpler functions that locally approximate the network output. The advantage of the post hoc explainability methods is that they do not compromise the flexibility and performance of the BlackBox. However, the post hoc explainability method suffers from several undesirable significant drawbacks, such as a lack of fidelity and mechanistic explanation of the network output Rudin (2019). Without a mechanistic explanation, recourse to a model's undesirable behavior is unclear. Interpretable models are alternative designs to the BlackBox model that do not suffer from many of those drawbacks. Interpretable models also have a long history in statistics and machine learning Letham et al. (2015) ; Breiman et al. (1984) . Several families of interpretable models exist, such as the rule-based approach and generalized additive models Hastie & Tibshirani (1987) . Many methods focus on tabular or categorical data and less on high-dimensional structured data such as images. Interpretable models for structured data rely mostly on projecting to a lower dimensional concept or symbolic space that is understandable for humans Koh et al. (2020) . Aside from a few exceptions Ciravegna et al. ( 2021); Barbiero et al. (2022) , the current State-Of-The-Art (SOTA) design does not model the interaction between concepts and symbols, hence offering limited reasoning capabilities and less robustness. Furthermore, current designs are not as flexible as the Blackbox model, which may compromise the performance of such models. We aim to achieve the best of both worlds: the flexibility of the BlackBox and the mechanistic explainability of the interpretable models. The general idea is that a single interpretable model may not be sufficiently powerful to explain all samples, and several interpretable models might be hidden inside the Blackbox model. We construct a hybrid neuro-symbolic model by progressively carving out a mixture of interpretable model and a residual network. Our design identifies a subset of samples and routes them through the interpretable models. The remaining samples are routed through a flexible residual network. We adopt First Order Logic (FOL) as the interpretable model's backbone, which provides basic reasoning on concepts retrieved from the BlackBox model. FOL is a logical function that accepts predicates (concept presence/absent) as input and returns a True/False output being a logical expression of the predicates. The logical expression, which is a set of AND, OR, Negative, and parenthesis, can be written in the so-called Disjunctive Normal Form (DNF). DNF is a FOL logical formula composed of a disjunction (OR) of conjunctions (AND), known as the "sum of products." On the residual network, we repeat the method until the proportion of data explained by the residual network falls below a desired threshold. The experimental results across various computer vision and medical imaging datasets reveal that our method accounts for the diversity of the explanation space and has minimal impact on the Blackbox's performance. Additionally, we apply our method's explanations to detect shortcuts in computer vision and successfully eliminate the bias from the Blackbox's representation. 2017) discuss the post hoc based explanation method using saliency maps to explain a convolution neural network. This method aims to highlight the pixels in the input images that contributed to the network's prediction. Adebayo et al. (2018); Kindermans et al. (2019) demonstrates that the saliency maps highlight the correct regions in the image even though the backbone's representation was arbitrarily perturbed. Additionally, in LIME Ribeiro et al. ( 2016), given a superpixel, a surrogate linear function attempts to learn the prediction of the Blackbox surrounding that superpixel. SHAP Lundberg & Lee (2017) utilizes a prominent game-theoretic strategy called SHAPLY values to estimate the Blackbox's prediction by considering all the permutations of adding and removing a specific feature to determine its importance in the final prediction. Having the explainations in terms of pixel intensities, they do not correspond to the high-level interpretable attributes (concept), understood by humans. In this paper, we aim to provide the post hoc explanation of the Blackbox in terms of the interpretable concepts, rather than the pixel intensities.

Post hoc explanations

Interpretable models In this class, the researchers try to design inherently interpretable models to eliminate the requirement for post hoc explanations. In the literature, we find interpretable models in Generalized Additive Models (GAM) Hastie & Tibshirani (1987) , or on logic formulas, as in Decision Trees Breiman et al. (1984) or Bayesian Rule Lists (BRL) Letham et al. (2015) . However, most of these methods work well in categorical datasets rather than continuous data such as images. Additionally, Chen et al. ( 2019); Nauta et al. ( 2021) introduce a "case-based reasoning" technique where the authors first dissect an image in Prototypical parts and then classifies by combining evidence from the pre-defined prototypes. This method is highly sensitive to the choice of prototypes



The literature on post hoc explainable AI is extensive. The methods such as model attribution (e.g., Saliency Map Simonyan et al. (2013); Selvaraju et al. (2017)), counterfactual approach Abid et al. (2021); Singla et al. (2019), and distillation methods Alharbi et al. (2021); Cheng et al. (

Simonyan et al. (2013); Selvaraju et al. (2017); Smilkov et al. (

