AN ADDITIVE INSTANCE-WISE APPROACH TO MULTI-CLASS MODEL INTERPRETATION

Abstract

Interpretable machine learning offers insights into what factors drive a certain prediction of a black-box system. A large number of interpreting methods focus on identifying explanatory input features, which generally fall into two main categories: attribution and selection. A popular attribution-based approach is to exploit local neighborhoods for learning instance-specific explainers in an additive manner. The process is thus inefficient and susceptible to poorly-conditioned samples. Meanwhile, many selection-based methods directly optimize local feature distributions in an instance-wise training framework, thereby being capable of leveraging global information from other inputs. However, they can only interpret single-class predictions and many suffer from inconsistency across different settings, due to a strict reliance on a pre-defined number of features selected. This work exploits the strengths of both methods and proposes a framework for learning local explanations simultaneously for multiple target classes. Our model explainer significantly outperforms additive and instance-wise counterparts on faithfulness with more compact and comprehensible explanations. We also demonstrate the capacity to select stable and important features through extensive experiments on various data sets and black-box model architectures.

1. INTRODUCTION

Black-box machine learning systems enjoy a remarkable predictive performance at the cost of interpretability. This trade-off has motivated a number of interpreting approaches for explaining the behavior of these complex models. Such explanations are particularly useful for high-stakes applications such as healthcare (Caruana et al., 2015; Rich, 2016) , cybersecurity (Nguyen et al., 2021) or criminal investigation (Lipton, 2018) . While model interpretation can be done in various ways (Mothilal et al., 2020; Bodria et al., 2021) , our discussion will focus on feature importance or saliency-based approach -that is, to assign relative importance weights to individual features w.r.t the model's prediction on an input example. Features here refer to input components interpretable to humans; for high-dimensional data such as texts or images, features can be a bag of words/phrases or a group of pixels/super-pixels (Ribeiro et al., 2016) . Explanations are generally made by selecting top K features with the highest weights, signifying K most important features to a black-box's decision. Note that this work tackles feature selection locally for an input data point, instead of generating global explanations for an entire dataset. An abundance of interpreting works follows the removal-based explanation approach (Covert et al., 2021) , which quantifies the importance of features by removing them from the model. Based on how feature influence is summarized into an explanation, methods in this line of works can be broadly categorized as feature attribution and feature selection. In general, attribution methods produce relative importance scores to each feature, whereas selection methods directly identify the subset of features most relevant to the model behavior being explained. One popular approach to learn attribution is through an Additive model (Ribeiro et al., 2016; Zafar & Khan, 2019; Zhao et al., 2021) . The underlying principle is originally proposed by LIME (Ribeiro et al., 2016) which learns a regularized linear model for each input example wherein each coefficient represents feature importance scores. LIME explainer takes the form of a linear model w.z where z denotes neighboring examples sampled heuristically around the inputfoot_0 . Though highly interpretable themselves, additive methods are inefficient since they optimize individual explainers for every input. As opposed to the instance-specific nature of the additive model, most of the feature selection methods are developed instance-wisely (Chen et al., 2018; Bang et al., 2021; Yoon et al., 2019; Jethani et al., 2021a) . Instance-wise frameworks entail global training of a model approximating the local distributions over subsets of input features. Post-hoc explanations can thus be obtained simultaneously for multiple instances. Contributions. In this work, we propose a novel strategy integrating both approaches into an additive instance-wise framework that simultaneously tackles all issues discussed above. The framework consists of 2 main components: an explainer and a feature selector. The explainer first learns the local attributions of features across the space of the response variable via a multi-class explanation module denoted as W . This module interacts with the input vector in an additive manner forming a linear classifier locally approximating the black-box decision. To support the learning of local explanations, the feature selector constructs local distributions that can generate high-quality neighboring samples on which the explainer can be trained effectively. Both components are jointly optimized via backpropagation. Unlike such works as (Chen et al., 2018; Bang et al., 2021) that are sensitive to the choice of K as a hyper-parameter, our learning process eliminates this reliance (See Appendix G for a detailed analysis on why this is necessary). Our contributions are summarized as follows • We introduce AIM -an Additive Instance-wise approach to Multi-class model interpretation. Our model explainer inherits merits from both families of methods: model-agnosticism, flexibility while supporting efficient interpretation for multiple decision classes. To the best of our knowledge, we are the first to integrate additive and instance-wise approaches into an end-to-end amortized framework that produces such a multi-class explanation facility. • Our model explainer is shown to produce remarkably faithful explanations of high quality and compactness. Through quantitative and human assessment results, we achieve superior performance over the baselines on different datasets and architectures of the black-box model.

2. RELATED WORK

Early interpreting methods are gradient-based in which gradient values are used to estimate attribution scores, which quantifies how much a change in an input feature affects the black-box's prediction in infinitesimal regions around the input. It originally involves back-propagation for calculating the gradients of the output neuron w.r.t the input features (Simonyan et al., 2014) . This early approach however suffers from vanishing gradients during the backward pass through ReLU layers that can downgrade important features. Several methods are proposed to improve the propagation rule (Bach et al., 2015; Springenberg et al., 2014; Shrikumar et al., 2017; Sundararajan et al., 2017) . Since explanations based on raw gradients tend to be noisy highlighting meaningless variations, a refined approach is sampling-based gradient, in which sampling is done according to a prior distribution for computing the gradients of probability A burgeoning body of works in recent years can be broadly categorized as removal-based explanation (Covert et al., 2021) . Common removal techniques include replacing feature values with neutral or user-defined values such as zero or Gaussian noises (Zeiler & Fergus, 2014; Dabkowski & Gal, 2017; Fong & Vedaldi, 2017; Petsiuk et al., 2018; Fong et al., 2019) , marginalization of distributions over input features (Lundberg & Lee, 2017; Covert et al., 2020; Datta et al., 2016) , or substituting held-out feature values with samples from the same distribution (Ribeiro et al., 2016) . The output explanations are often either attribution-based or selection-based. In addition to additive models



z is a binary representation vector of an input indicating the presence/absence of features. The dot-product operation is equivalent to summing up feature weights given by the weight vector w, giving rise to additivity.



Baehrens et al. (2010)  or expectation function(Smilkov et al.,  2017; Adebayo et al., 2018). Functional Information (FI)(Gat et al., 2022)  is the state-of-the-art in this line of research applying functional entropy to compute feature attributions. FI is shown to work on auditory, visual and textual modalities, whereas most of the previous gradient-based methods are solely applicable to images.

