GANMEX: ONE-VS-ONE ATTRIBUTIONS USING GAN-BASED MODEL EXPLAINABILITY

Abstract

Attribution methods have been shown as promising approaches for identifying key features that led to learned model predictions. While most existing attribution methods rely on a baseline input for performing feature perturbations, limited research has been conducted to address the baseline selection issues. Poor choices of baselines limit the ability of one-vs-one explanations for multi-class classifiers, which means the attribution methods were not able to explain why an input belongs to its original class but not the other specified target class. Achieving one-vs-one explanation is crucial when certain classes are more similar than others, e.g. two bird types among multiple animals, by focusing on key differentiating features rather than shared features across classes. In this paper, we present GANMEX, a novel approach applying Generative Adversarial Networks (GAN) by incorporating the to-be-explained classifier as part of the adversarial networks. Our approach effectively selects the baseline as the closest realistic sample belong to the target class, which allows attribution methods to provide true one-vs-one explanations. We showed that GANMEX baselines improved the saliency maps and led to stronger performance on perturbation-based evaluation metrics over the existing baselines. Existing attribution results are known for being insensitive to model randomization, and we demonstrated that GANMEX baselines led to better outcome under the cascading randomization of the model.

1. INTRODUCTION

Modern Deep Neural Network (DNN) designs have been advancing the state-of-the-art performance of numerous machine learning tasks with the help of increasing model complexities, which at the same time reduces model transparency. The need for explainable decision is crucial for earning trust of decision makers, required for regulatory purposes Goodman & Flaxman (2017) , and extremely useful for development and maintainability. Due to this, various attribution methods were developed to explain the DNNs decisions by attributing an importance weight to each input feature. In high level, most attribution methods, such as integrated gradient (IG) (Sundararajan et al. ( 2017)), DeepSHAP (Lundberg & Lee (2017)), DeepLift (Shrikumar et al. (2017) ) and Occlusion (Zeiler & Fergus (2013) ), alter the features between the original values and the values of some baseline instance, and accordingly highlight the features that impacts the model's decision. While extensive research has been conducted on the attribution algorithms, research regarding the selection of baselines is rather limited, and it is typically treated as an afterthought. Most existing methodologies by default apply a uniform-value baseline, which can dramatically impact the validity of the feature attributions (Sturmfels et al. (2020) ), and as a result, existing attribution methods showed rather unperturbed output even after complete randomization of the DNN (Adebayo et al. (2018) ). In a multi-class classification setting, existing baseline choices do not allow specifying a target class, and this has limited the ability for providing a class-targeted or one-vs-one explanation, meaning explaining why the input belongs to class A and not a specific class B. These explanations are crucial when certain classes are more similar than others, as often happens for example when the classes have a hierarchy among them. For example, in a classification task of apples, oranges and bananas, a model decision for apples vs oranges should be based on their color rather than the shape since both an apple and orange are round. This would intuitively only happen when asking for an explanation of 'why apple and not orange' rather than 'why apple'. In this paper, we present GAN-based Model EXplainability (GANMEX), a novel methodology for generating one-vs-one explanations leveraging GAN. In a nutshell, we use GANs to produce a baseline image which is a realistic instance from a target class that resembles the original instance. A naive use of GANs can be problematic because the explanation generated would not be specific to the to-be-explained DNN. We lay out a well-tuned recipe that avoids these problems by incorporating the classifier as a static part of the adversarial networks and adding a similarity loss function for guiding the generator. We showed in the ablation study that both swapping in the DNN and adding the similarity loss are critical for resulting the correct explanations. To the best of our knowledge, GANMEX is the first to apply GAN for explaining DNN decisions, and furthermore the first to provide a realistic baseline image, rather than an ad-hoc null instance. We showed that GANMEX baselines can be used with a variety of attribution methods, including IG, DeepLIFT, DeepSHAP and Occlusion, to produce one-vs-one attribution superior compared with existing approaches. GANMEX outperformed the existing baseline choices on perturbation-based evaluation metrics and showed more desirable behavior under the sanity checks of randomizing DNNs. Other than its obvious advantage for one-vs-one explanations, we show that by replacing only the baselines and without changing the attribution algorithms, GANMEX greatly improves the saliency maps for binary classifiers, where one-vs-one and one-vs-all are equivalent. 2018)). We focus on global attribution methods since they tackle the gradient discontinuity issue in local attribution methods, and they are known to be more effective on explaining the marginal effect of a feature's existence (Ancona et al. (2018) ). In this paper, we discussed five popular global attribution methods below: Integrated Gradient (IG) (Sundararajan et al. ( 2017)) calculates a path integral of the model gradient from a baseline image x to the input image x:

2. RELATED WORKS

IG i = (x i -xi ) 1 α=0 ∂ xi S(x + α(x -x))dα. The baseline is commonly chosen to be the zero input and the integration path is selected as the straight path between the baseline and the input. DeepLIFT (Shrikumar et al. (2017) ) addressed the discontinuity issue by performing backpropagation and assigns a score C ∆xi∆t to each neuron in the networks based on the input difference to the baseline ∆x i = x i -xi and the difference in the activation to that of the baseline ∆t = t(x) -t(x), that satisfies the summation-to-delta property i C ∆xi∆t = ∆t. Occlusion (Zeiler & Fergus (2013) ; Ancona et al. ( 2018)) applies full-feature perturbations by removing each feature and calculating the impacts on the DNN output. The feature removal was performed by replacing its value with zero, meaning an all zero input was implicitly used as the baseline. 2017)) was built upon the framework of DeepLIFT but connecting the multipliers of attribution rule (rescale rule) to SHAP values, which are computed by 'erasing features'. The operation of erasing one or more features require the notion of a background, which is defined by either a distribution (e.g. uniform distribution over the training set) or single baseline instance. For practical reasons, it is common to choose a single baseline instance to avoid having to store the entire training set in memory.

DeepSHAP

Expected Gradient (Erion et al. (2019) ) is a variant of IG that calculates the expected attribution over a prior distribution of baseline input, usually approximated by the training set X T , meaning



ATTRIBUTION METHODS AND SALIENCY MAPS Attribution methods and their visual form, saliency maps, have been commonly used for explaining DNNs. Given an input x = [x 1 , ..., x N ] ∈ R N and model output S(x) = [S 1 (x), ..., S C (x)] ∈ R C , an attribution method for output i assign contribution to each pixel A S,c = [a 1 , ..., a N ]. There are two major attribution method families: Local attribution methods that are based on infinitesimal feature perturbations, such as gradient saliency (Simonyan et al. (2014)) and gradient*input (Shrikumar et al. (2016)), and global attribution methods that are based on feature perturbation with respect to a baseline input (Ancona et al. (

(Chen et al. (2019); Lundberg & Lee (2017); Shrikumar et al. (

