EVALUATIONS AND METHODS FOR EXPLANATION THROUGH ROBUSTNESS ANALYSIS

Abstract

Feature based explanations, that provide importance of each feature towards the model prediction, is arguably one of the most intuitive ways to explain a model. In this paper, we establish a novel set of evaluation criteria for such feature based explanations by robustness analysis. In contrast to existing evaluations which require us to specify some way to "remove" features that could inevitably introduces biases and artifacts, we make use of the subtler notion of smaller adversarial perturbations. By optimizing towards our proposed evaluation criteria, we obtain new explanations that are loosely necessary and sufficient for a prediction. We further extend the explanation to extract the set of features that would move the current prediction to a target class by adopting targeted adversarial attack for the robustness analysis. Through experiments across multiple domains and a user study, we validate the usefulness of our evaluation criteria and our derived explanations.

1. INTRODUCTION

There is an increasing interest in machine learning models to be credible, fair, and more generally interpretable (Doshi-Velez & Kim, 2017) . Researchers have explored various notions of model interpretability, ranging from trustability (Ribeiro et al., 2016) , fairness of a model (Zhao et al., 2017) , to characterizing the model's weak points (Koh & Liang, 2017; Yeh et al., 2018) . Even though the goals of these various model interpretability tasks vary, the vast majority of them use so called feature based explanations, that assign importance to individual features (Baehrens et al., 2010; Simonyan et al., 2013; Zeiler & Fergus, 2014; Bach et al., 2015; Ribeiro et al., 2016; Lundberg & Lee, 2017; Ancona et al., 2018; Sundararajan et al., 2017; Zintgraf et al., 2017; Shrikumar et al., 2017; Chang et al., 2019) . There have also been a slew of recent evaluation measures for feature based explanations, such as completeness (Sundararajan et al., 2017 ), sensitivity-n (Ancona et al., 2018 ), infidelity (Yeh et al., 2019) , causal local explanation metric (Plumb et al., 2018) , and most relevant to the current paper, removal-and preservation-based criteria (Samek et al., 2016; Fong & Vedaldi, 2017; Dabkowski & Gal, 2017; Petsiuk et al., 2018) . A common thread in all these evaluation measures is that for a good feature based explanation, the most salient features are necessary, in that removing them should lead to a large difference in prediction score, and are also sufficient in that removing non-salient features should not lead to a large difference in prediction score. Thus, common evaluations and indeed even methods for feature based explanations involve measuring the function difference after "removing features", which in practice is done by setting the feature value to some reference value (also called baseline value sometimes). However, this would favor feature values that are far way from the baseline value (since this corresponds to a large perturbation, and hence is likely to lead to a function value difference), causing an intrinsic bias for these methods and evaluations. For example, if we set the feature value to black in RGB images, this introduces a bias favoring bright pixels: explanations that optimize such evaluations often omit important dark objects such as a dark-colored dog. An alternative approach to "remove features" is to sample from Figure 1 : Illustration of our explanation highlighting both pertinent positive and negative features that support the prediction of "2". The blue circled region corresponds to pertinent positive features that when its value is perturbed (from white to black) will make the digit resemble "7"; while the green and yellow circled region correspond to pertinent negative features that when turned on (black to white) will shape the digit into "0","8", or "9". some predefined distribution or a generative model (Chang et al., 2019) . This nevertheless in turn incurs the bias inherent to the generative model, and accurate generative models that approximate the data distribution well might not be available in all domains. In this work, instead of defining prediction changes with "removal" of features (which introduces biases as we argued), we alternatively consider the use of small but adversarial perturbations. It is natural to assume that adversarial perturbations on irrelevant features should be ineffective, while those on relevant features should be effective. We can thus measure the necessity of a set of relevant features, provided by an explanation, by measuring the consequences of adversarially perturbing their feature values: if the features are indeed relevant, this should lead to an appreciable change in the predictions. Complementarily, we could measure the sufficiency of the set of relevant features via measuring consequences of adversarially perturbing its complementary set of irrelevant features: if the perturbed features are irrelevant, this should not lead to an appreciable change in the predictions. We emphasize that by our definition of "important features", our method may naturally identify both pertinent positive and pertinent negative features (Dhurandhar et al., 2018) since both pertinent positive and pertinent negative features are the most susceptible to adversarial perturbations, and we demonstrate the idea in Figure 1 . While exactly computing such an effectiveness measure is NP-hard (Katz et al., 2017) , we can leverage recent results from test-time robustness (Carlini & Wagner, 2017; Madry et al., 2017) , which entail that perturbations computed by adversarial attacks can serve as reasonably tight upper bounds for our proposed evaluation. Given this adversarial effectiveness evaluation measure, we further design feature based explanations that optimize this evaluation measure. To summarize our contributions: • We define new evaluation criteria for feature based explanations by leveraging robustness analysis involving small adversarial perturbations. These reduce the bias inherent in other recent evaluation measures that focus on "removing features" via large perturbations to some reference values, or sampling from some reference distribution. • We design efficient algorithms to generate explanations that optimize the proposed criteria by incorporating game theoretic notions, and demonstrate the effectiveness and interpretability of our proposed explanation on image and language datasets: via our proposed evaluation metric, additional objective metrics, as well as qualitative results and a user study.foot_0 

2. RELATED WORK

Our work defines a pair of new objective evaluation criteria for feature based explanations, where existing measurements can be roughly categorized into two families. This first family of explanation evaluations are based on measuring fidelity of the explanation to the model. Here, the feature based explanation is mapped to a simplified model, and the fidelity evaluations measure how well this simplified model corresponds to the actual model. A common setting is where the feature vectors are locally binarized at a given test input, to indicate presence or "removal" of a feature. A linear model with the explanation weights as coefficients would then equal to the sum of attribution values for all present features. Completeness or Sum to Delta requires the sum of all attributions to equal the prediction difference between the original input and a baseline input (Sundararajan et al., 2017; Shrikumar et al., 2017) , while Sensitivity-n further generalize this to require the sums of subsets of attribution values to equal the prediction difference of the input with features present or absent corresponding to the subset, and the baseline (Ancona et al., 2018) . Local accuracy (Ribeiro et al., 2016; Lundberg & Lee, 2017) measures the fidelity of local linear regression model corresponding to explanation weights; while Infidelity is a framework that encompasses these instances above (Yeh



Code available at https://github.com/ChengYuHsieh/explanation_robustness.

