ROUTE, INTERPRET, REPEAT: BLURRING THE LINE BETWEEN POST HOC EXPLAIN-ABILITY AND INTERPRETABLE MODELS

Abstract

The current approach to ML model design is either to choose a flexible Blackbox model and explain it post hoc or to start with an interpretable model. Blackbox models are flexible but difficult to explain, whereas interpretable models are designed to be explainable. However, developing interpretable models necessitates extensive ML knowledge, and the resulting models tend to be less flexible, offering potentially subpar performance compared to their Blackbox equivalents. This paper aims to blur the distinction between a post hoc explanation of a Black-Box and constructing interpretable models. We propose beginning with a flexible BlackBox model and gradually carving out a mixture of interpretable models and a residual network. Our design identifies a subset of samples and routes them through the interpretable models. The remaining samples are routed through a flexible residual network. We adopt First Order Logic (FOL) as the interpretable model's backbone, which provides basic reasoning on concepts retrieved from the BlackBox model. On the residual network, we repeat the method until the proportion of data explained by the residual network falls below a desired threshold. Our approach offers several advantages. First, the mixture of interpretable and flexible residual networks results in almost no compromise in performance. Second, the route, interpret, and repeat approach yields a highly flexible interpretable model. Our extensive experiment demonstrates the performance of the model on various datasets. We show that by editing the FOL model, we can fix the shortcut learned by the original BlackBox model. Finally, our method provides a framework for a hybrid symbolic-connectionist network that is simple to train and adaptable to many applications.

1. INTRODUCTION

Model explainability is essential in high-stakes applications of AI, such as healthcare. While BlackBox models (e.g., Deep Learning) offer flexibility and modular design, post hoc explanation is prone to confirmation bias Wan et al. (2022) , lack of fidelity to the original model Adebayo et al. (2018) , and insufficient mechanistic explanation of the decision-making process Rudin (2019). Interpretable-by-design models do not suffer from those issues but tend to be less flexible than Blackbox models and demand substantial expertise to design and fine-tune. Using post hoc explanation or adopting an interpretable model is a mutually exclusive decision to be made at the initial phase of AI model design. This paper aims to blur the line on that dichotomous model design. 2016), generate perturbation to the input that flips the network's output Samek et al. (2016 ), Montavon et al. (2018) , or estimate simpler functions that locally approximate the network output. The advantage of the post hoc explainability methods is that they do not compromise the flexibility and performance of the BlackBox. However, the post hoc explainability method suffers from several undesirable significant drawbacks, such as a lack of fidelity and mechanistic explanation of the network output Rudin



The literature on post hoc explainable AI is extensive. The methods such as model attribution (e.g., Saliency Map Simonyan et al. (2013); Selvaraju et al. (2017)), counterfactual approach Abid et al. (2021); Singla et al. (2019), and distillation methods Alharbi et al. (2021); Cheng et al. (2020) are examples of post hoc explainability approaches. Those methods either identify important features of input that contribute the most to the network's output Shrikumar et al. (

