AN ADDITIVE INSTANCE-WISE APPROACH TO MULTI-CLASS MODEL INTERPRETATION

Abstract

Interpretable machine learning offers insights into what factors drive a certain prediction of a black-box system. A large number of interpreting methods focus on identifying explanatory input features, which generally fall into two main categories: attribution and selection. A popular attribution-based approach is to exploit local neighborhoods for learning instance-specific explainers in an additive manner. The process is thus inefficient and susceptible to poorly-conditioned samples. Meanwhile, many selection-based methods directly optimize local feature distributions in an instance-wise training framework, thereby being capable of leveraging global information from other inputs. However, they can only interpret single-class predictions and many suffer from inconsistency across different settings, due to a strict reliance on a pre-defined number of features selected. This work exploits the strengths of both methods and proposes a framework for learning local explanations simultaneously for multiple target classes. Our model explainer significantly outperforms additive and instance-wise counterparts on faithfulness with more compact and comprehensible explanations. We also demonstrate the capacity to select stable and important features through extensive experiments on various data sets and black-box model architectures.

1. INTRODUCTION

Black-box machine learning systems enjoy a remarkable predictive performance at the cost of interpretability. This trade-off has motivated a number of interpreting approaches for explaining the behavior of these complex models. Such explanations are particularly useful for high-stakes applications such as healthcare (Caruana et al., 2015; Rich, 2016) , cybersecurity (Nguyen et al., 2021) or criminal investigation (Lipton, 2018) . While model interpretation can be done in various ways (Mothilal et al., 2020; Bodria et al., 2021) , our discussion will focus on feature importance or saliency-based approach -that is, to assign relative importance weights to individual features w.r.t the model's prediction on an input example. Features here refer to input components interpretable to humans; for high-dimensional data such as texts or images, features can be a bag of words/phrases or a group of pixels/super-pixels (Ribeiro et al., 2016) . Explanations are generally made by selecting top K features with the highest weights, signifying K most important features to a black-box's decision. Note that this work tackles feature selection locally for an input data point, instead of generating global explanations for an entire dataset. An abundance of interpreting works follows the removal-based explanation approach (Covert et al., 2021) , which quantifies the importance of features by removing them from the model. Based on how feature influence is summarized into an explanation, methods in this line of works can be broadly categorized as feature attribution and feature selection. In general, attribution methods produce relative importance scores to each feature, whereas selection methods directly identify the subset of features most relevant to the model behavior being explained. One popular approach to learn attribution is through an Additive model (Ribeiro et al., 2016; Zafar & Khan, 2019; Zhao et al., 2021) . The underlying principle is originally proposed by LIME (Ribeiro et al., 2016) which learns a regularized linear model for each input example wherein each coefficient represents feature importance scores. LIME explainer takes the form of a linear model w.z where z denotes neighboring examples sampled heuristically around the inputfoot_0 . Though highly interpretable themselves, additive methods are inefficient since they optimize individual explainers for every input. As opposed to the instance-specific nature of the additive model, most of the feature selection methods are developed instance-wisely (Chen et al., 2018; Bang et al., 2021; Yoon et al., 2019; Jethani et al., 2021a) . Instance-wise frameworks entail global training of a model approximating the local distributions over subsets of input features. Post-hoc explanations can thus be obtained simultaneously for multiple instances. Contributions. In this work, we propose a novel strategy integrating both approaches into an additive instance-wise framework that simultaneously tackles all issues discussed above. The framework consists of 2 main components: an explainer and a feature selector. The explainer first learns the local attributions of features across the space of the response variable via a multi-class explanation module denoted as W . This module interacts with the input vector in an additive manner forming a linear classifier locally approximating the black-box decision. To support the learning of local explanations, the feature selector constructs local distributions that can generate high-quality neighboring samples on which the explainer can be trained effectively. Both components are jointly optimized via backpropagation. Unlike such works as (Chen et al., 2018; Bang et al., 2021) that are sensitive to the choice of K as a hyper-parameter, our learning process eliminates this reliance (See Appendix G for a detailed analysis on why this is necessary). Our contributions are summarized as follows • We introduce AIM -an Additive Instance-wise approach to Multi-class model interpretation. Our model explainer inherits merits from both families of methods: model-agnosticism, flexibility while supporting efficient interpretation for multiple decision classes. To the best of our knowledge, we are the first to integrate additive and instance-wise approaches into an end-to-end amortized framework that produces such a multi-class explanation facility. • Our model explainer is shown to produce remarkably faithful explanations of high quality and compactness. Through quantitative and human assessment results, we achieve superior performance over the baselines on different datasets and architectures of the black-box model.

2. RELATED WORK

Early interpreting methods are gradient-based in which gradient values are used to estimate attribution scores, which quantifies how much a change in an input feature affects the black-box's prediction in infinitesimal regions around the input. It originally involves back-propagation for calculating the gradients of the output neuron w.r.t the input features (Simonyan et al., 2014) . This early approach however suffers from vanishing gradients during the backward pass through ReLU layers that can downgrade important features. Several methods are proposed to improve the propagation rule (Bach et al., 2015; Springenberg et al., 2014; Shrikumar et al., 2017; Sundararajan et al., 2017) . Since explanations based on raw gradients tend to be noisy highlighting meaningless variations, a refined approach is sampling-based gradient, in which sampling is done according to a prior distribution for computing the gradients of probability Baehrens et al. (2010) or expectation function (Smilkov et al., 2017; Adebayo et al., 2018) . Functional Information (FI) (Gat et al., 2022) is the state-of-the-art in this line of research applying functional entropy to compute feature attributions. FI is shown to work on auditory, visual and textual modalities, whereas most of the previous gradient-based methods are solely applicable to images. A burgeoning body of works in recent years can be broadly categorized as removal-based explanation (Covert et al., 2021) . Common removal techniques include replacing feature values with neutral or user-defined values such as zero or Gaussian noises (Zeiler & Fergus, 2014; Dabkowski & Gal, 2017; Fong & Vedaldi, 2017; Petsiuk et al., 2018; Fong et al., 2019) , marginalization of distributions over input features (Lundberg & Lee, 2017; Covert et al., 2020; Datta et al., 2016) , or substituting held-out feature values with samples from the same distribution (Ribeiro et al., 2016) . The output explanations are often either attribution-based or selection-based. In addition to additive models that estimate feature important via the coefficients of the linear model, feature attributions can be calculated using Shapley values (Datta et al., 2016; Lundberg & Lee, 2017; Covert et al., 2020) or directly obtained by measuring the changes in the predictive probabilities or prediction losses when adding or excluding certain features (Zeiler & Fergus, 2014; Schwab & Karlen, 2019) . On the other hand, selection-based works straightforwardly determine which subset of features are important or unimportant to the model behavior under analysis (Chen et al., 2018; Yoon et al., 2019; Bang et al., 2021; Jethani et al., 2021a; Nguyen et al., 2021; 2022) . Explanations are made by either selecting features with the highest logit scores obtained from the learned feature distribution, or specifying a threshold between 0 and 1 to decide on the most probable features. Most selection methods adopt amortized optimization, thus post-hoc inference of features for multiple inputs can be done very efficiently. In contrast, attribution-based approaches are mostly less efficient since they process input examples individually. There have been methods focusing on improving the computational cost of these models (Dabkowski & Gal, 2017; Schwab & Karlen, 2019; Jethani et al., 2021b) . Recently, there is an emerging interest in integrating the instance-wise property into an additive framework to better exploit global information. For a given input, While earlier methods generate a single d-dimensional weight vector w x assigning the importance weights to each feature, we define an explainer E : R d → R d×C mapping an input x to a weight matrix W x ∈ R d×C with the entry W i,j x representing the relative weights of the ith feature of x to the predicted label j ∈ {1, ..., C}. E(x) = W x . Given a training batch, LIME (Ribeiro et al., 2016) in particular trains separate explainers for every input, thus cannot take advantage of the global information from the entire dataset. In line with the instance-wise motivation, our explainer E is trained globally over all training examples to produce local explanations with respect to individual inputs simultaneously, which also seeks to effectively enable global behavior (e.g., two similar instances should have similar explanations). As E is expected to be locally faithful to the black-box model, we optimize E on the local neighborhood around the input x. This region is constructed via a feature selection module. We now explain how this is done.

3.2. TRAINING OBJECTIVES

Let z ∈ {0, 1} d be a random variable with the entry z i = 1 indicating the feature ith is important to the black-box's predictions. With respect to x, we employ a selector S : R d → [0, 1] d that outputs S(x) = π x such that π i x := P(z i = 1 | X = x), i = 1, ..., d. Through the probability vector π x , the selector helps define a local distribution on a local space of samples z x ⊙ x with z x ∼ MultiBernoulli(π x ) and element-wise product ⊙. The selector S is also a learnable module, and we want it to generate well-behaved local samples that focus more on valuable features/attributions of x. Intuitively, if the feature i of x contributes more to the predictions of the black-box model, i.e., π i x ≈ 1, the explainer is expected to give higher assignments to the row vector W i,: x . To mimic how the black-box model behaves towards different attributions, we propose to minimize the cross-entropy loss between the prediction of the black-box model on local examples z x ⊙ x and the prediction of the explainer on binary vectors z x via the weight matrix W x as L 1 = E x E zx CE ỹm , softmax(W T x z x ) , ( ) where CE is the cross-entropy function and ỹm = argmax c P m (Y = c | z x ⊙ x). To make the process continuous and differentiable for training, the temperature-dependent Gumbel-Softmax trick (Jang et al., 2016; Maddison et al., 2016) is applied for relaxing Bernoulli variables z i x . In particular, the continuous representation zi x is sampled from the Concrete distribution as zi x , 1 -zi x ∼ Concrete(π i x , 1 -π i x ): zi x = exp{ log π i x + G i1 /τ } exp{(log(1 -π i x ) + G i0 )/τ } + exp{(log π i x + G i1 )/τ )} , with temperature τ , random noises G i0 and G i1 independently drawn from Gumbel distribution G t = -log(-log u t ), u t ∼ Uniform(0, 1).

Given the corresponding prediction

ỹm = argmax c P m (Y = c | zx ⊙ x), L 1 now becomes L 1 = E x E zx CE ỹm , softmax(W T x zx )) . Since z x is a binary vector indicating the absence/presence of features, z x ⊙ x indeed acts as a local perturbation, which generally concurs with the principle of LIME model. However, different from LIME, we amortize the explainer E to produce the weight matrices W x locally approximating the black-box model with linear classifiers operating on input neighborhoods. Furthermore, we replace LIME's uniform sampling strategy with a learnable local distribution offered by the selector S. We argue that heuristic sampling is inadequate for our purpose. As d gets large, realizing the space of 2 d possible binary patterns is infeasible. Given the fact that the number of binary patterns that actually approximate the original prediction is arbitrarily small, it is thus very difficult for such a simple linear separator as one used in LIME to learn useful patterns within finite sampling rounds. While diversity in these samples is desirable for learning attributions for individual decision classes, we also want the explainer E to focus more on relevant features to the original prediction y m . To encourage the selector to yield more of the samples that contain the features that best approximate the model behavior on the original input, we propose the following information-theoretic approach. Let x S denote the sub-vector formed by the subset of K most important features S = {i 1 , . . . , i K } ⊂ {1, . . . , d} (i i < i 2 < • • • < i K ). Thus, π i x can now be viewed as the probability that the ith feature of x appears in S. Given a random vector X S ∈ R K , we maximize the mutual information.

I(X

S ; Y ) = E log P m (Y | X S ) P m (Y ) = E X E S|X E Y |X S log P m (Y | X S ) + Constant. ( ) Based on the following inequality, we can obtain a variational lower bound for I(X S ; Y ) via a generic choice of conditional distribution Q S (Y | X S ) E Y |X S [log P m (Y | X S )] = E Y |X S [log Q S (Y | X S )] + KL (P m (Y | X S ) , Q S (Y | X S )) ≥ E Y |X S [log Q S (Y | X S )] , where KL represents the Kullback-Leibler divergence. It is worth noting that the purpose of using the mutual information in L2X (Chen et al., 2018) Maximizing the mutual information in Eq. (3) can therefore be relaxed to maximizing the variational lower bound E X E S|X E Y |X S log Q S (Y | X S ) . We parametrize Q with a function approximator G such that Q S (x S ) := G(x S ). Notice that we can now use the element-wise product zx ⊙ x to approximate x S . If x contains discrete features (e.g., words), we embed a feature (e.g., a selected word) in S with a learnable embedding vector, wherein a feature not in S is replaced with a zero vector. With the prediction y m = argmax c P m (Y = c | X = x), our second objective is given as L 2 = E x E zx CE (y m , G( zx ⊙ x)) . The final objective. We parametrize E, S and G with three neural networks of appropriate capacity. All networks E, S and G are jointly optimized over total parameters θ and globally on the training set. We further introduce a regularization term over W to encourage sparsity and accordingly compact explanations. The final objective function is now given as min θ L 1 + α L 2 + β E x [||W x || 2,1 ] , where ∥ • ∥ 2,1 is the group norm 2, 1, and α, β are balancing coefficients on loss terms. α and β are subject to tuning since a highly compressed representation can cause information loss and harm faithfulness. Figure 1 summarizes our framework as follows Figure 1 : An illustration of AIM pipeline. Left: Given an input x, the explainer E produces a local multi-class explanation module W x in which each entry W i,j x representing the relative weight of the ith feature of x to the predicted label j ∈ {1, ..., C}. E is optimized on a local space of perturbations around x. Such a space is constructed via the feature selector S that is simultaneously optimized to generate a high-quality local distribution containing well-behaved neighboring samples. The binary sample z x ∼ Multi-Bernoulli(π x ) is passed through a Gumbel-Softmax sampler for relaxation. We end up with the explanation matrix W x and relaxed samples zx . Right: The figure illustrates how these output components interact with each other and the input x to form the first and second loss objectives given in Eq. ( 2) and (4). The final objective in Eq. ( 5) combines L 1 and L 2 with an additional sparsity term to induce compactness. CE is the cross-entropy function. ⊙, •, and σ denote the element-wise product, inner product and softmax operation respectively.

3.3. INFERENCE

A standard inference strategy is to choose top K features with the highest weights, with K determined in advance. In our framework, the explainer outputs a weight matrix W x size d × C (recall that d is the number of features and C is the number of target classes). We obtain the black-box's predicted label j = y m = argmax c P m (Y = c | X = x) and select the corresponding column W :,j x as the weight vector. Features can then be derived accordingly. Though it is intuitive to use π x directly for the explanation, doing this may require specifying a certain threshold θ ∈ [0; 1]. Since π x represents local distributions, choosing the thresholds individually for each input is daunting while setting a global threshold for all inputs is sub-optimal. Moreover, the selection of an ith feature using π x (i.e., π i x ≥ θ) is independent for each feature, so when combined, they do not guarantee the resulting subsets of features can well approximate the black-box's decisions. On the other hand, the explainer looks into input features all-in-once to settle with good subsets of features. Appendix D provides evidence that inference according to W x is the optimal strategy.

4. EXPERIMENTS

We conducted experiments on various machine learning classification tasks. In the main paper, we focus on NLP classifiers since we believe text data is the most challenging modality. In the following, we discuss the experimental design for textual data. • Sentiment Analysis: The Large Movie Review Dataset IMDB (Maas et al., 2011) consists of 50, 000 movie reviews with positive and negative sentiments. The black-box classifier is a bidirectional GRU (Chen et al., 2018) that achieves an 85.4% test accuracy. • Hate Speech Detection: HateXplain is an annotated dataset of Twitter and Gab posts for hate speech detection (Mathew et al., 2021) . The task is to classify a post either to be normal or to contain hate/offensive speech. The black-box model is a bidirectional LSTM (Gers et al., 2000) stacked under a standard Transformer encoder layer (Vaswani et al., 2017) of 4 attention heads. The best test accuracy obtained is 69.6%. • Topic Classification: AG is a collection of more than 1 million news articles. AG News corpus (Zhang et al., 2015) is constructed by selecting 4 largest classes from the original dataset: World, Sports, Business, and Sci/Tech. We train a word-level convolution neural network (CNN) (LeCun et al., 1995) as a black-box model. It obtains 89.7% accuracy on the test set. See Appendix A for additional details on our experimental setup and model design. Appendix E further demonstrates the remarkable capability of AIM for generalizing on images and tabular data. Code and data for reproducing our experiments are published at https://github.com/ isVy08/AIM/.

4.1. PERFORMANCE METRICS & BASELINE METHODS

The task of a saliency-based explainer is to find the subset of input features S that best mimics the black-box's predictions on the original input. Following the suggestions from Robnik-Šikonja & Bohanec (2018) on desiderata of explanations and related works (Ribeiro et al., 2016; Chen et al., 2018; Schwab & Karlen, 2019; Situ et al., 2021; Gat et al., 2022) , Table 1 presents the metrics for quantitative evaluation of word-level explanations. See Appendix B for implementation details. For text classification tasks, we compare our method against baselines that have done extensive experiments on textual data: L2X (Chen et al., 2018) , LIME (Ribeiro et al., 2016) , VIBI (Bang et al., 2021) and FI (Gat et al., 2022) . Regarding model architectures, note that AIM has been intentionally designed to match those of L2X and VIBI to assure fair comparison. For each method, we tune the remaining hyper-parameters over a wide range of settings and report the results for which Faithfulness is highest (See Appendix H).

4.2. RESULTS

We compare the performance of methods by assessing the extent to which the set of 10 best features satisfies the criteria discussed in Table 1 . Except for AIM and FI that do not treat K as a hyperparameter, all the other baselines are optimized at K = 10. Table 2 reports the average results over 5 model initializations. We here show that our method AIM consistently outperforms the baselines on all metrics while achieving a remarkably high level of faithfulness of over 90% across datasets. AIM effectively approximates the black-box predictions with only 10 features, which demonstrates the sufficiency of the selected feature sets. Given the vast combinatorial space of possible subsets of Degree of agreement between the prediction given the full document and the prediction given the selected words in S. A higher value means the explanations are strongly relevant to the black-box's prediction.

Brevity

How concise is the explanation? Brevity Number of clusters of duplicates or semantically related words formed over S. A lower value means the tokens are less semantically polarizing and more compact.

Comprehensibility

How well do humans understand the explanation?

Purity

Proportion of stopwords and punctuation included in S. A lower proportion is equivalent to a more meaningful feature set.

Stability

How similar are the explanation for similar instances? Intersection over Union (IoU) Proportion of overlapping words in the explanations of two similar documents. We expect the selected features for two such examples overlap in great quantity. Degree of importance How well does the explanation reflect the importance of features or parts of the explanation? Positive ∆ log-odds Difference in the confidence of the black-box model in a prediction before and after masking important words given in S. A higher value indicates S contains important features. Degree of importance How well does the explanation reflect the importance of features or parts of the explanation? Negative ∆ log-odds Difference in the confidence of the black-box model in a prediction before and after masking unimportant words i.e., words not in S. A lower value indicates features not contained in S are unimportant. features, we believe the capacity to efficiently search for a sufficient set of features is what makes AIM stand out from the existing works. Examining ∆ log-odds, it is observed that our top 10 features are deemed more important since removing them causes the largest drops in confidence of the black-box model in the original prediction (on the full document). Given an input containing only important features, interestingly there is even a slight increase in confidence when the black-box models make that prediction. negative negative this may just be the worst movie ever produced. worst plot, worst acting, worst special effects...be prepared if you want to watch this. the only way to get enjoyment out of it is to light a match and burn the tape of it, knowing it will never fall into the hands of any sane person again. positive negative to me, "anatomie" is certainly one of the better movies i have seen. i don't think "anatomie" was primarily intended to be a horror movie but a movie questioning the ethics of science. if you watch it with that in mind, it turns into a really good film. the only annoying bit was the awful voice dubbing for the english version. how can you expect any non-german person to listen to these unbearable german accents for two hours ? let native english speakers do the talking or use subtitles instead!! negative positive i have seen this movie several times, it sure is one of the cheapest action flicks of the eighties. so, i think many viewers would definitely change the channel when they come across this one. but, if you are into great trash, "dragon hunt" is made for you. the main characters (the mcnamara twins) are sporting great moustaches and look so ridiculous in their camouflage dresses. one of the best scenes is when one of then gets shot in the leg and is still kicking his enemies into nirvana. this movie is really awful, but then again, it is a great party tape!

4.3. HUMAN EVALUATION

We additionally conduct a human experiment to evaluate whether the words selected as an explanation convey sufficient information about the original document to human users. We ask 30 university students to infer the sentiments of 50 IMDB movie reviews, given only 10 key words obtained from an explainer for each review. To avoid confusion, only examples where the black-box model predicts correctly are considered (See Appendix F for the setup). We assess whether the sentiment inferred by humans is consistent with the actual label of a movie review: human accuracy. Some reviews are judged as "neutral / can't decide", because the selected key words are neutral, or because positive and negative words are comparable in quantity. We exclude these neutral examples when computing the average accuracy for a participant, but record the proportion of such examples as a proxy measure for purity. The final accuracy is averaged over multiple participants and reported in Table 4 . It is consistent with our quantitative results that explanations from AIM are perceived to be more informative and contain fewer neural features, thus being more comprehensible to human users.

4.4. MULTI-CLASS EXPLANATION

A novel contribution of our work is the capability of simultaneously explaining multiple decision classes from a single matrix W x . Whereas existing methods often require re-training or re-optimization to predict a different class, our explainer produces class-specific explanations in a single forward pass: given a learned W x , select the column j (W :,j x ) corresponding to the target class to be explained. To the best of our knowledge, we are the first to propose an explanation module with such a facility. We assess the quality of multi-class explanations via two modifications of Faithfulness and IoU. The former metric Class-specific Faithfulness measures whether the black-box prediction on the explanations aligns with the class being interpreted. The latter Pairwise IoU evaluates the overlapping ratio of words in the explanations for a pair of decision classes. Table 5 provides the average results for these metrics, in comparison with LIME and FI. AIM performs surprisingly well on binary classification tasks with the selected feature sets nearly distinctive to each class i.e., overlapping words account only for less than 4%. Faithfulness 98.09% of on the first class, for example, means that given the explanations, the black-box model predicts label 0 for 98.09% of testing examples. Meanwhile, the performance of LIME and FI seems to be no better than random and sensitive to the distribution of classes in the datasets. However, the task gets more challenging as more classes are involved. Since AG News is a dataset of news articles from 4 topics, it is sometimes difficult to clearly distinguish a text between two classes, which we suspect leads to a higher overlapping ratio, thereby harming faithfulness. Regardless, the success on IMDB and HateXplain demonstrates the potential of supporting counterfactual explanations that seek to determine which features a black-box classifier attends to when predicting a certain class. 

5. CONCLUSION AND FUTURE WORK

We developed AIM -a novel model interpretation framework that integrates local additivity with instance-wise feature selection. The approach focuses on learning attributions across the target output space, based on which to derive important features maximally faithful to the black-box model being explained. We provide empirical evidence further proving the quality of our explanations: compact yet comprehensive, distinctive to each decision class and comprehensible to human users. Exploring causal or counterfactual explanations, especially within our multi-class module is a potential research avenue. Though extension to regression problems and other modalities such as audio or graphical data is straightforward, our future work will conduct thorough experiments on these modalities along with comprehensive comparisons with related baselines. Furthermore, our paper currently focuses on word-level explanations, so there is a chance of discarding positional or phrasal information (e.g., idioms, phrasal verbs). This can be resolved through chunk-level or sentence-level explanations, which will be tackled in future works of ours. 

A EXPERIMENTAL DESIGN

We now discuss the model design for each component in our framework. We parametrize E, S and G by three deep neural network functions. Since our input X is discrete, every network contains a learnable embedding layer. The explainer E passes the embedded inputs into three 250-dimensional dense layers and outputs W after applying ReLU non-linearity. The selector S is composed of one bidirectional LSTM of 100 dimension and three dense layers of the same size. Each layer is stacked between a Dropout layer and an activation. The upper layers use ReLU while Sigmoid is a natural choice for the final one. Regarding the network G, after feeding the inputs into its own embedding layer, we process the outputs through a 250-dimensional convolutional layer with kernel size 3, followed by a max-pooling layer over the sequence length. The last layer is a dense layer of dimension 250 together with Softmax activation. We use the same architecture for all tasks and train our model with Adam optimizer at τ = 0.2 and a learning rate of 0.001. We tune the coefficients α, β via grid search to achieve an adequate balance of faithfulness and compression. 

B PERFORMANCE METRICS

We here discuss the implementation details of our quantitative metrics for text explanation tasks. Recall that saliency-based approaches produce explanations in the form of a subset of K most important features S given by the weight vector.

B.1 PURITY

For text classification tasks, we observe that an explainer sometimes selects stopwords or punctuation as important features, which are incomprehensible from a human user's perspective. An effective explainer should reduce the likelihood of picking such "contaminated" features. Purity quantifies the proportion of stopwords and punctuation included in S. We obtain the collection of stopwords and punctuation via NLTK package (Bird, 2006) .

B.2 BREVITY

Given a subset S, we define an explainer achieving brevity if the subset contains closely related features. For textual data, we expect the chosen features to contain a large number of duplicates and/or synonyms. We introduce cluster ratio to quantify brevity. Specifically, we first collect a database of semantically related words through WordNet (Miller, 1995) . We group tokens in S into clusters of synonyms (including duplicates), then calculate the average number of clusters formed over K tokens.

B.3 FAITHFULNESS

Faithfulness measures the degree of agreement between the black-box's prediction given the explanation and the prediction given the original input. When fed into the black-box model, the explanationa set of discrete features, is reconstructed into a similar representation vector with the original input where features in S are retained and those not in S are masked by zero paddings. Faithfulness is a standard criterion to evaluate the quality of textual explanations and commonly adopted in various literature, including our baseline papers (Ribeiro et al., 2016; Chen et al., 2018; Situ et al., 2021; Gat et al., 2022) .

B.4 STABILITY

One desirable property of a good model explainer is Stability -the ability to produce the same explanations given similar examples. In the context of text explanations, the subsets of selected important words are expected to overlap in large quantities for two similar documents. We evaluate explanation stability through a simplified implementation of the measure Intersection over Union (IoU) originally proposed in (Situ et al., 2021) . Given an example x in the test set, we first search for the nearest neighbors N (x). The neighboring documents are defined to (1) have the same (black-box predicted) label and (2) be either lexically or semantically similar. We adopt the ratio of overlapping tokens as a proxy metric for lexical similarity. Semantic similarity is measured via cosine similarity of their BERT representations, obtained by summing over the token representations of the last hidden state produced by a pre-trained BERT uncased base open sourced by Hugging Face (Wolf et al., 2019) . We then select a set of 20 distinctive neighbors, consisting of top 10 semantically and top 10 lexically similar documents. Let v x and v x ′ respectively denote the subsets of top K tokens selected for the instance x and its neighbor x ′ , IoU is given as 1 |N (x)| x ′ ∈N (x) |v x ∩ v x ′ | |v x ∪ v x ′ | . To eliminate the effect of poor initialization, for each model explainer, we evaluate the model initialization with the highest faithfulness and compare the stability of top 10 explanations. Noticing that explainers sometimes favor a large number of stopwords, which may overestimate the measure, we exclude such tokens in the feature sets when computing Stability.

B.5 ∆ LOG-ODDS

Given an example x, ∆ log-odds(x) measures the change in the confidence of the black-box's prediction before and after masking the features in an explanation (Schwab & Karlen, 2019; Situ et al., 2021) . Given the original input vector x, the black-box model outputs the predictive distribution P m (Y = c | x) with label c ∈ {1, ..., C}. Let y m = argmax c P m (Y = c | x) denote the predicted label. ∆log-odds = log-odds(P m (y m | x)) -log-odds(P m (y m | x)), where log-odds(P) = log P 1-P and x denotes the masked representation. Positive ∆ log-odds refers to the input version where we mask important features i.e., features in S. Negative ∆ log-odds refers to the input version where we mask unimportant feature i.e., all features not in S. This is also the input version used to evaluate Faithfulness.

C QUALITATIVE COMPARISON

This section presents 12 additional qualitative examples to examine the quality of explanations of all model explainers. These examples are randomly selected from the outputs of the model initialization with the best faithfulness. Examples 9 -12 are particularly dedicated to illustrate multi-class explanations. Across all examples, we again demonstrate that our explanations are strongly consistent with black-box's predictions, highly compact (by covering duplicates and synonyms) and distinctive to each decision class. 1. Original document: this movie was a pleasant surprise for me. in all honesty, the previews looked horrible, up until the point where emma thompson and alan rickman appeared. so i rented it with reservation, but i thoroughly enjoyed this movie. it had great acting, a few good plot twists, and, of course, emma thompson and alan rickman. its definitely worth checking out. Original document: i could not believe the original rating i found when i looked up this film, 9.5? unfortunately it looks like i am not alone. the film, is slow and boring really, one of the sad things is that if the film had been given a realistic rating of around 5 or 6 then the expectation would not have been so high. unfortunately, this was not the case, so when watching the film, and seeing the poor story and acting, i am left giving it a 3/10 score. vinnie jones is superb in lock stock, and also snatch, and he plays a great hard man, however, he should stick to this role. its a bit like when stallone and schwarzenegger have done comedy films, they just don't work. neither can he play lead actor, he plays better as supporting or otherwise. when he plays lead, his acting talents are too 'in view' and shown up as not really very good. mean machine is another good example of this. Explainer 12. Original document: the finest short i've ever seen. some commentators suggest it might have been lengthened, due to the density of insight it offers. there's irony in that comment and little merit. the acting is all up to noonan and he carries his thankless character perfectly. i might have preferred that the narrator be less "recognizable", but the gravitas lent is pitch perfect. this is a short for people who read, for those whose "bar" is set high and for those who recognize that living in a culture that celebrates stupidity and banality can forge contrary and bitter defenders of beauty. a beautiful short film. fwiw: i was pleased at the picasso reference, since i once believed that picasso was just another art whore with little talent; like, i assume, most people -until the day i saw some drawings he made when he was 12. picasso was a finer draftsman and a brilliant artist at that age than many artists will ever become in a lifetime. i understood immediately why he had to make the art he became known for. 

D ABLATION STUDY

Generally, our framework involves both an explainer and a feature selector. The explainer E aims to produce a multi-class explanation module W x directly used to infer features. The role of the selector S is to learn good local distributions to generate high-quality local perturbations to train E. Here we study various setups for AIM to demonstrate that the proposed method yields the optimal performance. We seek to answer the following questions: 1. Does inference from the probability vector π x of the selector S give a better result than inference from W x of the explainer E? 2. Is the explainer E a necessary component? 3. Is using samples from learnable local distributions better than using heuristic samples? To validate hypotheses, we analyze 3 different approaches on IMDB dataset. Table compares the quality of explanations produced under these setups with the performance level achieved under our proposed method on 3 metrics: Faithfulness, Purity and Brevity.

D.1 INFERENCE FROM SELECTOR

In the original framework, we experiment with the strategy of inferring features from the output probability vector π x . Following the same approach, we rank features π x in a decreasing order and conventionally select the top 10. We initially argue that the selector operates on each feature independently, thus does not guarantee good performance when combining features into a single explanation. Table 7 supports this argument in that this approach leads to a nearly 20% drop in Faithfulness.

D.2 TRAINING SELECTOR ONLY

The L2X and VIBI frameworks only contain a feature selector from which to accordingly infer explanations. We investigate whether training the explainer jointly is necessary or if the selector simply does the job. Recall our final objective function min θ L 1 + α L 2 + β E x ||W x || 2,1 , where L 1 is used to train the explainer and optimize W x . Removing the role of explainer, we omit L 1 and the third loss term. We then only train the selector according to L 2 with the support of the approximator G. It can be seen from Table 7 that AIM again under-performs under this setup and the performance does not differ much from the first scenario above. We further note that our method employs the Gumbel trick for Bernoulli sampling, while L2X applies it for Categorical sampling over K features. This explains why despite having the Selector only, L2X can still search for better combinations of features.

D.3 TRAINING EXPLAINER ONLY

Lastly, we analyze the importance of learning local distributions via the selector compared to using heuristic sampling. In our framework, we expect the selector helps mitigate the risk of ill-conditioned local samples observed in LIME. To validate this hypothesis, we modify our framework to mimic LIME: We first exclude the S and G. To train E, we uniformly sample local perturbations, denoted as ẑx , with the number of non-zero elements also uniformly drawn at random. E is now optimized purely on L 1 , which is modified as L 1 = E x E ẑx CE ỹm , softmax(W T x ẑx ) . Table 7 shows that this heuristic approach does significantly worsen the explainer performance. This proves the effectiveness of optimizing the selector S together with the explainer E so that S can assist E in learning faithful feature attributions and distinctively across decision classes. , 2017) . To keep it consistent with these baselines, we consider each pixel to be a feature and the goal is to find the optimal local subset of pixels S for each example x that can approximate the black-box prediction on the full image. We compare Faithfulness scores of top K selected features with K ∈ {200, 300, 400}. We again approximate the explanation with the input variant where the features not selected are masked by zeros. The results are averaged over 5 model initializations. AIM has been shown to be superior on texts and Table 8 here demonstrates that AIM framework can also work reasonably well on images.

E.1.2 SUPERPIXEL-BASED EXPLANATION

We now consider groups of pixels as features. We split each image into 4 × 4 patches size 7 × 7, resulting in a total of 16 features. We here demonstrate the extensibility of AIM framework on tabular data. We experimented with two real-world datasets: Admission (Acharya et al., 2019) (classifying whether a graduate application is successful on 12 features) and Adult (Kohavi et al., 1996) (predicting whether income exceeds 50K/year on 7 features). For every test example, the task is to select the top most relevant features to the original prediction from a black-box classifier. Evaluation metrics. We want to evaluate how well the top K important features can approximate the prediction on the full input. In the main paper, this is done via the Faithfulness metric: the consistency between the black-box predictions on the full input and the input variant where nonselected features are masked. For texts and images, masking is often done by zero replacement. However, for tabular data, zero values do not indicate the absence of features. We therefore implement mean masking and noise masking strategies: the former replaces unimportant features with the mean value of each feature over the test set; the latter replaces them with random uniform noise ϵ ∈ [-1, 1]. We additionally report the Positive ∆ Log-odds scores, which measure the drop in the black-box confidence scores before and after masking the top K features. A selection of features is deemed important when Faithfulness and Positive ∆ Log-odds metrics are both high. Results. We evaluate the top K = 6 and K = 3 features (half the number of features) respectively for Admission and Adult datasets. The following table summarizes the average results over 5 model initializations and 10 initializations of random noises. The black-box architectures are given in parentheses. We compare AIM against popular tabular baselines: LIME (Ribeiro et al., 2016) , INVASE (Yoon et al., 2019) and MAPLE (Plumb et al., 2018) . Here we demonstrate the selected features from AIM do encode sufficient information to yield a prediction consistent with the full input while being more robust to perturbations. 

F HUMAN EVALUATION

We ask 30 university students to infer the sentiments of 50 IMDB movie reviews, each of which is given only 10 key words obtained from each explainer. Each participant is presented with 3 sections, containing output examples from AIM, L2X and LIME respectively. Each section displays 50 sets of 10 key words corresponding to 50 different movie reviews. The information on which section belongs to which method is hidden and the ordering of examples within a section is randomized. The following table reports the performance of L2X trained at different values of K. It highlights the fact that a careful choice of K as a hyperparameter is crucial, and a larger K does not necessarily yield better results. This is undesirable since in fact, larger K increases the chance of selecting meaningless features (lower purity) while does not guarantee faithfulness will go up accordingly. We also provide qualitative examples showing inconsistencies for selecting top K features i.e, the rankings of features vary across settings. For instance, though qualitatively considered an important feature, the word amazing in example 1 is selected in top 5 but does not appear in top 10 and even ranks ninth in top 20. The same pattern can be observed across examples. We here would like to demonstrate that our framework is not highly sensitive to hyperparameters. As reported, β is chosen at 1e -3 across most text and image datasets, while α can vary within {0.1, 0.5, 0.8}. The following table reports the performance of our models under various settings of α (averaged over 5 initializations). It can be seen that there is no significant variation in the performances compared to our reported results (highlighted in bold). For the purpose of clarity, we only display the results for crucial metrics: Faithfulness and ∆ Log-odds scores. The table below lists all of the remaining hyper-parameters subject to tuning and their corresponding model performance. We tune all baselines via grid search over the following ranges and report the average results over 5 initializations. For the purpose of clarity, we only display the results for 3 metrics: Purity, Brevity and Faithfulness. When there is a trade-off among these metrics, Faithfulness is chosen to be the deciding criterion. The best settings for each method are presented in bold and their corresponding results are reported in the main paper.

H.2.1 L2X

For L2X, we tune τ which is the Gumbel-Softmax temperature. L2X has only one loss term, so no loss term coefficient needs tuning. The relevant hyper-parameters of LIME for text explanations include kernel width (used to define proximity function) and number of sampling perturbations N . As shown in Figure 5 , increasing N leads to better faithfulness, yet at the cost of an exponential climb in computing times. At our maximum capacity, we follow the authors' suggestion setting N = 5000 for all experiments. We examine kernel width in {15, 20, 25, 30, 35}. 



z is a binary representation vector of an input indicating the presence/absence of features. The dot-product operation is equivalent to summing up feature weights given by the weight vector w, giving rise to additivity. VIBI(Bang et al., 2021) measures fidelity via the prediction accuracy of the approximator model, whereas we conduct a post-hoc comparison of the black-box's predictions on the original and masked input. All qualitative examples presented in our work are randomly selected from the outputs of the model initialization with the best Faithfulness.



of this paper, we limit the current discussion to classification problems. Consider a data set of pairs (X, Y ) where X ∼ P X (•) is the input random variable and Y is characterized by the conditional distribution P m (Y | X) obtained as the predictions of a pre-trained black-box classifier for the response variable. The notation m stands for model, indicating the predictive distribution of the black-box model, to be differentiated from the ground-truth distribution. We denote x ∈ R d as an input realization with d interpretable features and predicted label Y = c ∈ {1, ..., C}. Given an input x, we obtain the hard prediction from the black-box model as y m = argmax c P m (Y = c | X = x).

Figure 4: Human Evaluation Interface

Plumb et al. (2018);Yoon et al. (2022) in particular learn a surrogate model assigning weights to training examples such that those more similar or relevant to the input are given higher weights.

Description of quantitative evaluation metrics.

.g., the features picked by a model trained on top 5 may be considered irrelevant by one trained on top 10. AIM strictly avoids such inconsistency as our framework is not sensitive to K.

additionally provides 4 examples of the features chosen by AIM in IMDB dataset. This helps shed light on why the black-box model makes a certain prediction, especially the wrong one. A comprehensive qualitative comparison with the baselines on multiple examples can further be found in Appendix C.3 While explanations from additive models (LIME and FI) are contaminated with a larger volume of neutral words, instance-wise methods (L2X and VIBI) tend to select more meaningful features. AIM stands out with the strongest compactness by picking up all duplicates and synonyms without compromising predictive performance. Note that LIME suffers from low brevity mainly because its algorithm extracts unique words as explanations. This also means LIME's feature sets tend to be more diverse than the other methods and thus should be more faithful. Our experiment nevertheless shows that this is not the case.

Ground-truth labels and labels predicted by the black-box model on IMDB movie reviews are given in the first two columns. 10 most relevant words selected by AIM are highlighted in yellow.

Human evaluation results on IMDB dataset of AIM, L2X, and LIME.

Quality of multi-class explanations from AIM, LIME and FI.

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28, 2015. Xingyu Zhao, Wei Huang, Xiaowei Huang, Valentin Robu, and David Flynn. Baylime: Bayesian local interpretable model-agnostic explanations. In Uncertainty in Artificial Intelligence, pp. 887-896. PMLR, 2021.

details data splits and best hyperparameters used in our experiments for 3 text classification (IMDB / HateXplain / AG News) and 2 image recognition tasks (MNIST / Fashion-MNIST). α and β are the balancing coefficients on the loss terms in the final training objective. For every dataset, we tune α and β via grid search with values in {0.1, 0.5, 1, 1.5, 1.8, 2} and {1e-2, 1e-3, 1e-4} respectively, and the setting that yields the highest Faithfulness is selected. In Table12, we provide detailed empirical results showing our superior performance is insensitive to hyper-parameter choices.

Dataset statistics and hyperparameters.

FIworst, tape, plot, worst, be, never, burn, worst 3. Original document: to me, "anatomie" is certainly one of the better movies i have seen. i dont think "anatomie" was primarily intended to be a horror movie but a movie questioning the ethics of science. if you watch it with that in mind, it turns into a really good film. the only annoying bit was the awful voice dubbing for the english version. how can you expect any non-german person to listen to these unbearable german accents for two hours? let native english speakers do the talking or use subtitles instead!! Original document: i have seen this movie several times, it sure is one of the cheapest action flicks of the eighties. so, i think many viewers would definitely change the channel when they come across this one. but, if you are into great trash, "dragon hunt" is made for you. the main characters (the mcnamara twins) are sporting great moustaches and look so ridiculous in their camouflage dresses. one of the best scenes is when one of then gets shot in the leg and is still kicking his enemies into nirvana. this movie is really awful, but then again, it is a great party tape! Original document: fda oks scientist publishing vioxx data (ap) ap -the food and drug administration has given a whistle-blower scientist permission to publish data indicating that as many as 139,000 people had heart attacks that may be linked to vioxx, the scientists lawyer said monday. Original document: tivo net loss widens; subscribers grow tivo inc.(tivo.o: quote, profile, research) , maker of digital television recorders, on monday said its quarterly net loss widened as it boosted spending to acquire customers, but subscribers to its fee-based tv service rose.

Ablation study of AIM on IMDB dataset. * Proposed method.

Table 9 reports faithfulness of superpixel-based explanations on MNIST and Fashion-MNIST test sets, averaged over 5 model initializations. Randomly selected examples of various scenarios are additionally presented for qualitative investigation. We find that the selected features are particularly useful to explain wrong decisions in terms of what spurious signals the black-box model relies on to make predictions e.g., the round shape to predict digit 0, or the rectangular pattern at the bottom to predict a Trouser instead of a T-shirt.

Performance of AIM, LIME, MAPLE and INVASE on tabular datasets

Performance of L2X when trained at 3 values of K for all datasets. Performance of AIM under the same setup is reported for comparison. Original document: i saw Riverdance -the new show and loved it from the very first moment! it is an energetic tribute to Irish dance filled with brilliant dancing, music and choreography! the leads, Jean Butler and Colin Dunne had me captivated with their exquisite dancing! may they always keep shining and keep dancing. their on stage chemistry was amazing, and the unity between them on stage was obvious. they look like they were made to dance with each other! this show is my absolute favourite, and probably always will be. long live Riverdance! Original document: how much longer will the west continue to put up with all of this shit from retarded third worlders and liberal cucks before we revert to a frontier mentality and just start the real violence . Yahoo and SBC extend partnership and plan new services Yahoo and SBC communications have agreed to collaborate to extend some of the online services and content they currently provide to PC users to mobile phones and home entertainment devices. Original document: riding high on the success of "rebel without a cause", came a tidal wave of teen movies. arguably this is one of the best. a very young Mcarthur excels here as the not really too troubled teen. the story concentrates more on perceptions of delinquency, than any traumatic occurrence. the supporting cast is memorable, Frankenheimer directs like an old pro. just a story of a young man that finds others take his actions much too seriously.In our experiments, the only hyperparameters subject to tuning are loss coefficients α and β. While α seeks to balance exploration and exploitation of local samples as discussed in the previous sections, β controls the magnitude of ||W || 2,1 for stable backpropagation.

AIM Hyper-parameters Tuning.   This section provides performance results of the baseline methods under different hyper-parameter settings. Note that our black-box architecture for IMDB dataset is different from ones reported inChen et al. (2018),Bang et al. (2021) andGat et al. (2022): L2X adopts CNN while VIBI and FI opt for LSTM. Since AIM is not GRU-based either, our black-box model is chosen to be a bidirectional GRU in order to examine whether these models can explain different kinds of black-box architectures.

L2X Hyper-parameters Tuning.

LIME Hyper-parameters Tuning.

VIBI Hyper-parameters Tuning.

VIBI Hyper-parameters Tuning.

FI Hyper-parameters Tuning.

ACKNOWLEDGMENTS

Trung Le and Dinh Phung were supported by the US Air Force grant FA2386-21-1-4049. Trung Le was also supported by the ECR Seed grant of Faculty of Information Technology, Monash University.

annex

Published as a conference paper at ICLR 2023 

H.2.3 VIBI

In terms of model architecture, VIBI offers multiple options for the approximator. In our experiments, LSTM approximator gives the highest accuracies for both IMDB and AG News, while CNN works best on HateXplain. For the remaining hyper-parameters, we tune the Gumbel-Softmax temperature τ within {0.2, 0.5, 0.7}. The objective function of VIBI further has two loss terms where β is the weight of the second one, controlling brevity of the explanation. We explore β within {0.1, 0.3, 0.5, 1.0, 1.5, 2.0}.

H.2.4 FI

For FI, the relevant hyper-parameter is the number of sampling perturbations N . Since FI is very time-expensive, we tune N over 3 values {50, 100, 200}.

