GANMEX: ONE-VS-ONE ATTRIBUTIONS USING GAN-BASED MODEL EXPLAINABILITY

Abstract

Attribution methods have been shown as promising approaches for identifying key features that led to learned model predictions. While most existing attribution methods rely on a baseline input for performing feature perturbations, limited research has been conducted to address the baseline selection issues. Poor choices of baselines limit the ability of one-vs-one explanations for multi-class classifiers, which means the attribution methods were not able to explain why an input belongs to its original class but not the other specified target class. Achieving one-vs-one explanation is crucial when certain classes are more similar than others, e.g. two bird types among multiple animals, by focusing on key differentiating features rather than shared features across classes. In this paper, we present GANMEX, a novel approach applying Generative Adversarial Networks (GAN) by incorporating the to-be-explained classifier as part of the adversarial networks. Our approach effectively selects the baseline as the closest realistic sample belong to the target class, which allows attribution methods to provide true one-vs-one explanations. We showed that GANMEX baselines improved the saliency maps and led to stronger performance on perturbation-based evaluation metrics over the existing baselines. Existing attribution results are known for being insensitive to model randomization, and we demonstrated that GANMEX baselines led to better outcome under the cascading randomization of the model.

1. INTRODUCTION

Modern Deep Neural Network (DNN) designs have been advancing the state-of-the-art performance of numerous machine learning tasks with the help of increasing model complexities, which at the same time reduces model transparency. The need for explainable decision is crucial for earning trust of decision makers, required for regulatory purposes Goodman & Flaxman (2017) , and extremely useful for development and maintainability. Due to this, various attribution methods were developed to explain the DNNs decisions by attributing an importance weight to each input feature. In high level, most attribution methods, such as integrated gradient (IG) (Sundararajan et al. (2017) ), DeepSHAP (Lundberg & Lee (2017)), DeepLift (Shrikumar et al. (2017) ) and Occlusion (Zeiler & Fergus (2013) ), alter the features between the original values and the values of some baseline instance, and accordingly highlight the features that impacts the model's decision. While extensive research has been conducted on the attribution algorithms, research regarding the selection of baselines is rather limited, and it is typically treated as an afterthought. Most existing methodologies by default apply a uniform-value baseline, which can dramatically impact the validity of the feature attributions (Sturmfels et al. (2020) ), and as a result, existing attribution methods showed rather unperturbed output even after complete randomization of the DNN (Adebayo et al. (2018) ). In a multi-class classification setting, existing baseline choices do not allow specifying a target class, and this has limited the ability for providing a class-targeted or one-vs-one explanation, meaning explaining why the input belongs to class A and not a specific class B. These explanations are crucial when certain classes are more similar than others, as often happens for example when the classes have a hierarchy among them. For example, in a classification task of apples, oranges and bananas, a model decision for apples vs oranges should be based on their color rather than the shape since both an apple and orange are round. This would intuitively only happen when asking for an explanation of 'why apple and not orange' rather than 'why apple'. In this paper, we present GAN-based Model EXplainability (GANMEX), a novel methodology for generating one-vs-one explanations leveraging GAN. In a nutshell, we use GANs to produce a baseline image which is a realistic instance from a target class that resembles the original instance. A naive use of GANs can be problematic because the explanation generated would not be specific to the to-be-explained DNN. We lay out a well-tuned recipe that avoids these problems by incorporating the classifier as a static part of the adversarial networks and adding a similarity loss function for guiding the generator. We showed in the ablation study that both swapping in the DNN and adding the similarity loss are critical for resulting the correct explanations. To the best of our knowledge, GANMEX is the first to apply GAN for explaining DNN decisions, and furthermore the first to provide a realistic baseline image, rather than an ad-hoc null instance. We showed that GANMEX baselines can be used with a variety of attribution methods, including IG, DeepLIFT, DeepSHAP and Occlusion, to produce one-vs-one attribution superior compared with existing approaches. GANMEX outperformed the existing baseline choices on perturbation-based evaluation metrics and showed more desirable behavior under the sanity checks of randomizing DNNs. Other than its obvious advantage for one-vs-one explanations, we show that by replacing only the baselines and without changing the attribution algorithms, GANMEX greatly improves the saliency maps for binary classifiers, where one-vs-one and one-vs-all are equivalent.

2. RELATED WORKS

2.1 ATTRIBUTION METHODS AND SALIENCY MAPS Attribution methods and their visual form, saliency maps, have been commonly used for explaining DNNs. Given an input x = [x 1 , ..., x N ] ∈ R N and model output S(x) = [S 1 (x), ..., S C (x)] ∈ R C , an attribution method for output i assign contribution to each pixel A S,c = [a 1 , ..., a N ]. There are two major attribution method families: Local attribution methods that are based on infinitesimal feature perturbations, such as gradient saliency (Simonyan et al. (2014) ) and gradient*input (Shrikumar et al. (2016) ), and global attribution methods that are based on feature perturbation with respect to a baseline input (Ancona et al. (2018) ). We focus on global attribution methods since they tackle the gradient discontinuity issue in local attribution methods, and they are known to be more effective on explaining the marginal effect of a feature's existence (Ancona et al. (2018) ). In this paper, we discussed five popular global attribution methods below: Integrated Gradient (IG) (Sundararajan et al. (2017) ) calculates a path integral of the model gradient from a baseline image x to the input image x: IG i = (x i -xi ) 1 α=0 ∂ xi S(x + α(x -x))dα. The baseline is commonly chosen to be the zero input and the integration path is selected as the straight path between the baseline and the input. DeepLIFT (Shrikumar et al. (2017) ) addressed the discontinuity issue by performing backpropagation and assigns a score C ∆xi∆t to each neuron in the networks based on the input difference to the baseline ∆x i = x i -xi and the difference in the activation to that of the baseline ∆t = t(x) -t(x), that satisfies the summation-to-delta property i C ∆xi∆t = ∆t. Occlusion (Zeiler & Fergus (2013) ; Ancona et al. (2018) ) applies full-feature perturbations by removing each feature and calculating the impacts on the DNN output. The feature removal was performed by replacing its value with zero, meaning an all zero input was implicitly used as the baseline. DeepSHAP (Chen et al. (2019) ; Lundberg & Lee (2017); Shrikumar et al. (2017) ) was built upon the framework of DeepLIFT but connecting the multipliers of attribution rule (rescale rule) to SHAP values, which are computed by 'erasing features'. The operation of erasing one or more features require the notion of a background, which is defined by either a distribution (e.g. uniform distribution over the training set) or single baseline instance. For practical reasons, it is common to choose a single baseline instance to avoid having to store the entire training set in memory. Expected Gradient (Erion et al. (2019) ) is a variant of IG that calculates the expected attribution over a prior distribution of baseline input, usually approximated by the training set X T , meaning EG i = Ex∼X T α∼U (0,1) (x -x) i ∂ xi S(x + α(x -x)) where U is the uniform distribution. In other words, the baseline of IG is replaced with a uniform distribution over the samples in the training set. A crucial property of the above methods is their need for a baseline, which is either explicitly or implicitly defined. In what follows we show that these methods are greatly improved by modifying their baseline to that chosen by GANMEX.

2.2. THE BASELINE SELECTION PROBLEM

Limited research has been done on the problem of baseline selection so far. A simple "most natural input", such as zero values of all numerical features is commonly chosen as the baseline. For image inputs, uniform images with all pixels set to the max/min/medium values are commonly chosen. The static baselines frequently cause the attribution to only focus on or even overly highlight the area where the feature values are different from the baseline values, and hide the feature importance where the input values are close to the baseline values (Sundararajan & Taly (2018) ; Adebayo et al. (2018) ; Kindermans et al. (2017) ; Sturmfels et al. (2020) ). Several none-static baselines have been proposed in the past, but each of them suffered from its own downsides (Sturmfels et al. (2020) ). Fong & Vedaldi (2017) used blurred images as baselines, but the results are biased toward highlighting high-frequency information from the input. Bach et al. (2015) make use of the training samples by finding the training example belonging to the target class closest to the input in Euclidean distance. Even though the concept of minimum distance is highly desirable, but in practice, the nearest neighbor selection in high dimensional space can frequently lead to poor outcome, and most of the nearest neighbors are rather distant from the original input. Along the same concept, expected gradient simply samples over all training instances instead of identifying the closest instance (Erion et al. (2019) ). Expected gradient benefits from ensembling in a way similar to that of SmoothGrad, which averages over multiple saliency maps produced by imposing Gaussian noise on the original image (Smilkov et al. (2017) ; Hooker et al. (2019) ). We claim however that averaging over the training set does not solve the issue; for example, due to the foreground being located in different sections of the images, the average image would often resemble a uniform baseline.

2.3. ONE-VS-ONE AND ONE-VS-ALL ATTRIBUTION

In multi-class settings, while one-vs-all explanation A S,co (x) was designed to explain why the input x belong to its original class c o and not the others, one-vs-one explanations aim to provide an attribution A S,co→ct (x) ∈ R N that explains why x belong to c o and not the specified target class c t . Most existing attribution methods were primarily designed for one-vs-all explanation, but was proposed to extend to one-vs-one by simply calculating the attribution with respect to the difference of the original class probability to the target class probability S diff (x) = S co (x)-S ct (x) (Bach et al. (2015) ; Shrikumar et al. (2017) ). It is easy to think of examples where this somewhat naive formulation will not provide correct one-vs-one explanation. Taking the example of fruit classification, for both apples and oranges the explanation could easily be the round shape, and taking the difference between those will result in an arbitrary attribution. We claim that without a class-targeted baseline, the modified attributions will still omit the "vs-one" aspect of the one-vs-one explanation. Take IG for example, A S,diff (x) = A S,co (x)-A S,ct (x). With zero baseline, the target class score S ct (x) and its gradient will likely stay close to zero along the straight path from the input to the zero baseline, meaning that A S,ct (x) ≈ 0 because the instance never belongs to the target class. With this in mind the one-vs-one explanation is not very informative with respect to the target class c t . Few class-targeted baselines were proposed in the past. The minimum distance training sample (MDTS) described in Section 2.2 is class-targeted as the sample was selected from the designated target class. While the original expected gradient was defined for one-vs-all explanation only, we extended the method to one-vs-one by sampling the baselines only from the target class. However, as mentioned in Section 2.2, MDTS is frequently hindered by the sparsity of the training set in the high dimensional space, and expected gradient suffers from undesired effects caused by uncorrelated training samples. The problem of baseline selection, especially for the one-vs-one explainability, has presented a challenging problem, because the ideal baseline choice can simply be absent from the training set.

2.4. GAN AND IMAGE-TO-IMAGE TRANSLATION

Image-to-Image Translation is a family of GAN originally introduced by Isola et al. (2017) for creating mappings between two domains of data. While the corresponding pairs of images are rare in most real-world dataset, Zhu et al. (2017) has made the idea widely applicable by introducing a reconstruction loss to tackle the tasks with unpaired training dataset. Since then, more efficient and better performing approaches have been developed to improve few-shot performance (Liu et al. (2019) ) and output diversity (Choi et al. (2020) ). Nevertheless, we found the StarGAN variant proposed by Choi et al. (2017) specifically applicable to the baseline selection problem because of its standalone class discriminator in the adversarial networks as well as the deterministic mapping that preserve the styles of the translated images (Choi et al. (2020) ). GANs have not been applied for explaining DNNs in the past to our best knowledge. Prior to our work, Chang et al. (2018) proposed the fill-in the dropout region (FIDO) methods and suggested generators including CA- GAN Yu et al. (2018) for filling in the masked area. However, the CA-GAN generation was designed for calculating the smallest sufficient region and smallest destroying region Dabkowski & Gal (2017) that only produced 1-vs-all explanations. FIDO is computationally expensive as an optimization task is required for each attribution map. The fill-in method requires an unmasked area for reference, hence only works for a small subset of attribution methods. More importantly, the FIDO is highly dependent on the generator's capability of recreating the image based on partially masked features. With pre-trained generators like CA-GAN, we argue that the resulting saliency map is more associated with the pre-trained generator instead of the classifier itself.

3. GAN-BASED MODEL EXPLAINABILITY

It has been previously established that attribution methods are more sensitive to features where the input values are the same as the baseline values and less sensitive to those where the input values and the baseline values are different (Adebayo et al. (2018) ; Sundararajan & Taly (2018) ). Therefore, we expect a well-chosen baseline to differ from the input only on the key features. Good candidates for achieving this would be the sample in the target class but with minimum distance to the input. Formally, for a one-vs-one attribution problem A S,co→ct (x), We define the class-targeted baseline to be the closest point in the input space (not limited to the train set) that belongs to the target class B ct (x) = arg min x∈Gc t x -x Here, G ct is the set of realistic examples in the target class, and • is the Euclidean distance. By using this baseline we have A S,co→ct (x, B ct (x)) providing the explanations as to why input x belongs to its original class c o and not class c t . Now, since it isn't realistic to optimize within the actual set G ct we work with a softer version of Equation 1: B ct (x) = arg min x∈R N ( x - x -log T (x, c t )). where T (x, c t ) represent the probability of x belonging to the target class, meaning x ∈ G ct . Given a classifier S(x) = [S 1 (x), ..., S C (x)], we have the estimate S c (x) to the probability of a realistic image x to be in class c. In order to make use of this we decompose T (x, c t ) = R(x)S ct (x) where R(x) indicates the probability of x being a realistic image. We end up with the following objective for the baseline instance. B ct (x) = arg min x∈R N ( x -x -log R(x) -log S ct (x))

3.1. APPLYING STARGAN TO THE CLASS-TARGETED BASELINE

Here we introduce GAN-based Model EXplainability (GANMEX) that uses GAN to generate the class-targeted baselines. Given an input x and a target class c t , GANMEX aims to generate a classtargeted baseline G(x, c t ) that achieve the three following objectives: 1. The baseline belongs to the target class (with respect to the classifier). 2. The baseline is a realistic sample. 3. The baseline is close to the input. To further explain the need for all 3 objectives, we point the reader to Figure 1 . The GANMEX baseline represents the "closest and realistic target class baseline". Without the assistance of GANs, the selected baseline can easily either fall into the domain of unrealistic image. A naive fix will choose a realistic image from the training set, but that will not be close to the input. Finally, for correct one-vs-one explainability we need the baseline to belong to a specific target class. We have provided more intuitions behind the baseline selection requirements in Appendix G. We chose StarGAN (Choi et al. (2017) ) as the method for computing the above T or rather R function. Although many Image-to-Image translation methods could be applied to do so, StarGAN inherently works with multi-class problems, and allows for a natural way of using the already trained classifier S as a discriminator, rather than having us train a different discriminator. StarGAN provides a scalable image-to-image translation approach by introducing (1) a single generator G(x, c) accepting an instance x and a class c, and producing a realistic example x in the target class c, (2) two separate discriminators: D src (x) for distinguishing between real and fake images, and D cls (x, c) for distinguishing whether x belongs to class c. It introduced following loss functions L adv = E x [log(D src (x))] + E x,c [log(1 -D src (G(x, c)))] (3) L r cls = E c ,x∈c [-log(D cls (c |x))] L f cls = E x,c [-log(D cls (c|G(x, c)))] (5) L rec = E c,c ,x∈c [ x -G(G(x, c), c ) 1 ] Here, E • [] defines the average over the variables in the subscript, where x is an example in the training set, and c, c are classes. L adv is the standard adversarial loss function between the generator and the discriminators, L r cls and L f cls are domain classification loss functions for real images and fake images, respectively, and L rec is the reconstruction loss commonly used for unpaired image-to-image translation to make sure two opposite generation action will lead to the original input. The combined loss functions for the generator and the discriminator are L D = -L adv + λ r cls L r cls (7) L G = L adv + λ f cls L f cls + λ rec L rec (8) The optimization procedure for StarGAN alternates between modifying the discriminators D src (x), D cls (x, c) to minimize L D , and the generator G to minimize L G . Equation 8 is almost analogical to equation 2. The term L adv corresponds to -log R(x) and the term λ f cls L f cls corresponds to the term log S ct (x). There is a mismatch between the term λ rec L rec and x -x . One forces the generator to be invertible, while the other forces the generated image to be close to the original. We found that the L rec term is useful to encourage the convergence of the GAN. However, a similarity term x -G(x, c) is also needed in order for the baseline image to be close to the origin -this allows for better explainability. We show in what follows (Figure 7 .B, Appendix D) that without this similarty term, the created image can indeed be farther away from the origin. Other than the added similarity term, for GANMEX we replace the discriminator D cls (c|x) with the classifier S c (x), since as mentioned above, this way the generator provides a baseline adapted to our classifier. Concluding, we optimize the following term for the generator L f G = log(1 -D src (x)) -λ f cls log(S c (x)) + λ rec x -G(x, c ) 1 + λ sim x -x 1 (9) where x is short for G(x, c). Notice that we used L1 distance rather than L2 for the similarity loss, because L2 distance leads to blurring outputs for image-to-image translation algorithms (Isola et al. (2017) ). Other image-to-image translation approaches can potentially select baselines satisfying the criteria (2) and (3) above, but they lack the replaceable class discriminator component, that is crucial for explaining the already trained classifier. We provide several ablation studies in Appendix D where we show that without incorporating the to-be-explained classifier to the adversarial networks, the GAN generated baselines will fail the randomization sanity checks. We provide more implementation details including hyper-parameters in Appendix A.2. 

4.1. ONE-VS-ONE ATTRIBUTION FOR MULTI-CLASS CLASSIFIERS

We tested the one-vs-one attribution on two multi-class datasets -MNIST, SVHN, and CIFAR10. As shown in Figure 2 .A, the GANMEX baseline successfully identified the closest transformed image in the target class as the baseline. Take explaining why 0 and not 6 for example, the ideal baseline would keep the "C"-shape part unchanged, and only erase the top-right corner and complete the lower circle, which was achieved by GANMEX. Limited by the training space, MDTS baselines were generally more different from the input image. Therefore, the explanation made with respect to GANMEX baselines were more focused on the key features compared to that of the MDTS baseline and expected gradient. We observed the same trends across more numbers, where GANMEX helps IG, DeepLIFT, Occlusion and DeepSHAP disregard the common strokes between the original and targeted digits, and focusing only on the key differences. The out-performance of GANMEX was even more obvious in the SVHN datasets, where the numbers can have any font, color, and background, and in CIFAR10, which has more complexity and diversity. Notice that both the zero baseline and training set baseline cause the explanation to have more focus on the background, and in contrast, the GANMEX example focuses only on the key features that would cause the digit to change. Zero baselines, on the other hand, were generally unsuccessful in making one-vs-one explanations. The attributions on MNIST look similar to the original input and ignores everything in the background, and the attributions on SVHN were rather noisy. As shown in Figure 8 , attributions based on zero baselines only changed marginally with different target classes. This shows that purposely designed class-targeted baselines are required for meaningful one-vs-one explanation.

Mis-classification Analysis

We next demonstrated how one-vs-one saliency maps can be used for trouble-shooting mis-classification cases. For an input x that belongs to class c o but was misclassified as class c m . A S,cm→co (x) = A S,cm→co (x, B co (x)) provides explanation to why x belongs to c m and not c o with respect to the trained classifier, and this will help human understand how the classifier has led to the incorrect decision. We provided examples in Figure 2 .B where the samples were mis-classified. For MNIST and SVHN samples, the mis-classification mostly happened when the digits were presented in a non-typical way. The GANMEX baseline B co (x) show how a more typical digit should have been written according to the trained classifier, and the attribution A S,cm→co (x) highlights the area that led to the mis-classification. CIFAR10 presented more complex classification challenges, and the classifier can easier confused ship with an airplane because of the pointy front and the lack of sea horizon, or a dog with a cat because of the shape of the ears and the nose, and those areas were highlighted in the one-vs-one saliency maps. Perturbation-based evaluation We followed the perturbation-based evaluation suggested by Bach et al. (2015) that flips input features starting from the ones with the highest saliency values and evaluates the cumulative impacts on the score delta S co -S ct as proposed by Shrikumar et al. (2017) . Flipping a feature means to provide with a value of 1 -x where x is its original value, assuming all features are normalized to x ∈ [0, 1]. A wanted behavior from the attribution map is that the score delta will decrease as rapidly as possible as we flip the features one by one. We provide in Figure 3 .A-C the perturbation curves for both MNIST, SVHN and CIFAR10, plotting the score delta as a function of the number of flipped features. It is painfully clear that that by using a GANMEX baseline rather than the alternative zero baseline, the descent of the curve is much faster, meaning that we successfully capture the most important features using GANMEX. This holds true for all attribution methods. As a side note, notice that in SVHN when we flip all features the score delta goes back to where it was in the beginning as opposed to going down to zero. This is due to the fact that once all features are flipped, we are back to having the same digit as before. Gini-index We calculated the Gini Index representing the sparseness of the saliency maps as proposed by Chalasani et al. (2018) , where a larger score means sparser saliency map, which is a desired property. Figure 3 .D-F shows our experiments comparing the different techniques with a zero baseline vs. GANMEX. Other than IG/SVHN where GANMEX has a visible advantage, the results are roughly the same; we suspect that the sparseness of zero baseline attribution was benefited from incorrectly hiding key features, as shown in Figure 2 .A and 3.A-C. Expected gradient, on the other hand, consistently under-performs its counterpart (IG+GANMEX) in both datasets. Cascading randomization We performed the sanity checks proposed by Adebayo et al. (2018) that performs cascading randomization from top to bottom layers of the DNN and observe the changes in the saliency maps. Specifically, layer by layer we replace the model weights with Gaussians random variables scaled to have the same norm. For meaningful model explanations, we would expect the attributions to be gradually randomized during the cascading randomization process. In contrast, unperturbed saliency maps during the model randomization would suggest that the attributions were based on general features of the input and not specifically based on the trained model. Figure 3 .G shows the experiment on MNIST data with the network layers named (input to output) CNN1, CNN2, FC, Output. It shows that even though the saliency maps generated by the original IG, DeepLIFT and Occlusion were rather unperturbed (still showing the shape of the digit) after the model randomization, with the help of GANMEX, both the baselines and the saliency maps were perturbed over the cascading randomization. Expected gradient, while showing more randomization compared to the zero baseline saliency maps, still roughly shows the shape of the digit throughout the sanity check.

4.2. ATTRIBUTION FOR BINARY CLASSIFIERS

In addition to the one-vs-one aspect, GANMEX generally improves the quality of the saliency maps compared with the existing baselines, and this can be tested on binary datasets where the one-vs-one explanations and the one-vs-all explanations are equivalent. For apple2orange dataset, conceptually apples and oranges both have round shapes but have different colors, so we would expect the saliency maps on a reasonably performing classifier to highlight the colors of the fruit, but not the shapes, and definitely not the background. In Figure 4 and Figure 9 we compared the saliency map generated by DeepLIFT and IG with the zero input, max input, blurred image, with those generated with the GANMEX baselines. In all non-GANMEX baselines (Zero, Max, Blur) we commonly observe one of two errors in the saliency map. The first error consists of highlighting the background. The second highlights only the edge of the apple(s) providing the false indication that the model is basing its decisions on the shape of the object rather than its color. It is quite clear that neither of these errors occur when using the GANMEX baselines as the background is never present and the full shape of the apple(s)/orange(s) is highlighted.

5. CONCLUSION AND FUTURE WORK

We have proposed GAN-based model explainability, a novel approach for generating one-vs-one explanation baselines without being constrained by the training set. We used the GANMEX baselines in conjunction with IG, DeepLIFT, SHAP, and Occlusion, and to our surprise, the baseline replacement was all it takes to address the common downside of the existing attribution methods (blind to certain input values and fail to randomize with the model randomization) and significantly improve the one-vs-one explainability. The out-performance was demonstrated through perturbation-based evaluation, sparseness measures, and cascading randomization sanity checks. The one-vs-one explanation achieved by GANMEX opens up possibilities for obtaining more insights about how DNNs differentiate similar classes. While GANMEX showed promising results on explaining binary classifiers, where one-vs-all and one-vs-one explanations are directly comparable, open questions remain on how to apply GANMEX to one-vs-all explainability for multi-class classifiers, and how to best optimize the GAN component to effective generate baselines for classification tasks with large number of classes.

A IMPLEMENTATION DETAILS

A.1 DATASETS AND CLASSIFIERS MNIST (LeCun & Cortes ( 2010)) The classifier consists of two 6x6 CNN layers with a stride of 2, followed by a 256-unit fully connected layer, a dropout layer with p = 0.5, and the 10 output neurons. As shown in Springenberg et al. (2015) the stride>1 CNN achieved comparable performance with pooling layers. The classifier was trained for 50 epochs and achieve a test accuracy of 99.3%. 2017)), and then followed by a dense layer of 1024 neurons and a dropout layer of p = 0.5 before the output neurons. The classifier was trained for 50 epochs and achieve a test accuracy of 87.7%.

A.2 BASELINE GENERATION WITH GANMEX

Our baseline generation process is based on StarGAN (Choi et al. (2017) ). We used the Tensorflow-GAN implementation (https://github.com/tensorflow/gan) and made the following two modifications (Equation 9): 1. The class discriminator D cls is replace by the target classifier S to be explained. 2. A similarity loss L sim is added to the training objective function. We train the GANMEX model for 100k steps for the MNIST and apple2orange datasets, 300k steps for the SVHN dataset, and 400k steps for the CIFAR10 dataset. Only the train split is used for training, and the attribution results and evaluation were done on the test split of the dataset.

A.3 ATTRIBUTION METHODS

We used DeepExplain (https://github.com/marcoancona/DeepExplain) for generating saliency maps with IG, DeepLIFT, and Occlusion. We modified the code base to use the score delta (S co -S ct ) instead of the original class score (S ct ) and allowing replacing the zero baseline (see Section 2.1) by custom baselines from GANMEX and MDTS. Expected gradient was separately implemented according to the formulation in Erion et al. (2019) . We set the number of sampling steps to 200 for both IG and Expected Gradient, and used Occlusion-1 that only perturb the pixel itself (as supposed to perturbing the whole neighboring patch of pixels). The DeepSHAP saliency maps were calculated using SHAP (https://github.com/slundberg/shap). We made similar modification to replace the original class score by the score delta and feed in the custom baseline instances. In all saliency maps shown in the paper, blue color in indicates positive values and red color indicates negative values. We skipped Occlusion for large images (apple2orange) and also skipped SHAP for full dataset evaluations due to the computation resource constraints.

B BASELINE DISTANCE ANALYSIS

To measure how various baseline selection approaches satisfy the minimum distance requirements in Equation 1, we calculated D(x, x) = x-x for (1) GANMEX, (2) MDTS, (3) a randomly selected sample in the target class as baseline and (4) zero baseline. GANMEX was on-par with MDTS on the MNIST dataset, but on SVHN and CIFAR10 dataset that have more degrees of freedom (object Table 1 : Baseline distance analysis comparing the average intra-class distance (Dintra, sampled) and the average inter-class distance (Dinter, sampled), with the average distance from the instance to the baseline input generated by GANMEX (GAN), MDTS, random selection (RAND), and zero inputs (Zero). size, color, orientation, background, ...), GANMEX was significantly better in identifying minimum distance baselines compared to the in-sample search. The high dataset complexity of was supported by the average intra-class distance, the average distance between any two instances within the same class, which was higher than that of MNIST. Note that the resulting sample to baseline distance D(x, xGANMEX ) is much higher in MNIST than in SVHN, because there were more boundary values (0s and 1s) in MNIST. We further evaluated the similarity distance on the vertical edge area (D edge (x, x)) of the SVHN images (Figure5.A). Empirically, we observed that the digits of interest were rarely present in the vertical edge, and therefore, we would expect a closest baseline choice will lead to minimal D edge (x, x) under the minimum distance requirements. We provided a histogram in Figure 5 .B for comparing the distribution of D edge (x, x) for MDTS and GANMEX, and we presented sampled success/failure cases in Figure 5 .C-E. Overall, GANMEX leads to baselines that are closer to the original samples.

C ADDITIONAL METRICS

Here we evaluated the different baseline choices with additional metrics (Table 2 ). AOPC L measures the area over the perturbation curve within the first L perturbation steps (Samek et al. (2017) ; Tomsett et al. (2019) ). One potential downside of AOPC L is that the metric is only sensitive to the top L features in the saliency map and not the rest. Therefore, in addition to AOPC L=100 , we calculated AOPC all , the area over the perturbation curve across all feature. The gradient family of IG and expected gradient generally outperformed DeepLIFT and occlusion on the AOPC metrics, with IG+GANMEX performed the best overall. The sparsity (Chalasani et al. (2018) ) measured by the Gini index is a desirable property for onevs-one attribution. We expect a good one-vs-one explanation to highlight only the differentiating features. Compared with one-vs-all saliency maps, one-vs-one saliency maps should highlight a smaller subset of features, especially when the target classes are similar to the original classes. Therefore, one would expect more sparse one-vs-one saliency maps are more likely to be correct. Our results showed that saliency maps generated by IG+GANMEX and IG+MDTS have higher Gini index and therefore are more sparse compared to other methods (Table 2 ). We also measured the faithfulness reported by Alvarez-Melis & Jaakkola (2018); Tomsett et al. (2019) and monotonicity suggested by phi Nguyen & Martínez (2020). Instead of measuring the cumulative effect of alternating a set of features, both faithfulness and monotonicity measure the impacts on alternating single features. We found that expected gradient and Occlusion+zero baseline are the best performers on those two metrics. Lastly, we designed an inverse localization metric to measure whether the explanation is localized in the focus area. We observed that the digits mostly have < 1 aspect ratios, meaning that their widths are smaller then their heights. As a results, the areas at the two vertical edges are generally not covered by the primary numbers, and instead, they usually show the background or the neighboring numbers. Therefore, we can reasonably expect the saliency map sensitivity to be location in the center area (area excluding the vertical edges), and not the vertical edge area (Figure 5 .A). Based on this observation, we designed the inverse localization metric, L(A(x)) = ( 1 card(Sedge) i∈Sedge |A i (x)|)/( 1 card(Scenter) i∈Scenter |A i (x)|), that calculates the ratio of the average absolute sensitivity between the vertical edge area and the center area. S center and S edge represent the feature set in the center area and the edge area, respectively. card(.) measures the cardinality of feature sets, and x and A(x) are the sample and the corresponding saliency map. A lower L(A(x)) would mean that the saliency map A(x) is more localized in the focus area. As shown in Table 2 , we see a consistent trend of the saliency maps with GANMEX baselines being more localized (lower inverse localization) compared to MDTS baselines across all attribution methods, and expected gradient and the zero baselines generally lead to the worst results. The saliency maps produced by occlusion+GANMEX was the most localized among all the methods tested. To summarize, we evaluated multiple attribution methods and baseline combinations with metrics that assess different properties of the saliency maps. While the results of faithfulness and monotinicity were less consistent, GANMEX has consistently lead to better metrics on inverse localization, AOPC L , AOPC all and sparsity, especially when compared with the non-class-targeted zero baseline that were widely used in the field.

D ABLATION STUDIES

Here we analyzed the possibilities of using other GAN models. Different from StarGAN, most other image-to-image translation algorithms do not have a stand-alone class discriminator that can be swapped with a trained classifier. To simulate such restrictions, we trained a similar GAN model but with the class discriminator trained jointly with the generator from scratch. Figure 7 .A shows that while the stand-alone GAN yields similar baseline with GANMEX, both the baselines and saliency maps of the stand-alone GAN remains unperturbed under cascading randomization of the model. This indicates that the class-wise explanations provided by stand-alone GAN were not specific to the to-be-explained classifier. The importance of the similarity loss in Equation 9 can be demonstrated on a colored-MNIST dataset, where we randomly assigned the digits with one of the three colors {red, green, blue}, with labels of the instances remain unchanged from the original MNIST labels of {0, ..., 9}. The classifier was trained with the same model architecture and training process as for MNIST. The dataset demonstrated different modes (colors in this case) that are irrelevant to the labels, and we would expect the class-targeted baseline for x would be another instance that has the same color as x. Figure 7 .B shows that the similarity loss is the crucial component for ensuring that the baseline has the same color with the input. Without the similarity loss, the generated baseline instance can easily have a different color with the original image. The reconstruction loss itself does not provide the same-mode constraint because a mapping of G(red) → green and G(green) → red does not get penalized by the reconstruction loss. While the reconstruction loss was not required for GANMEX and the same-mode constraint, we observed that some degrees of reconstruction loss help GANs converge faster.

E HYPER-PARAMETER ANALYSIS

We tested how the generated baselines change with respect to the hyperparemeters in the GANMEX loss function. The hyper-parameters, λ f cls , λ rec , and λ sim , presented in Equation 9 control the degrees of the classification loss, reconstruction loss, and similarity loss, respectively. We performed the hyper-parameter scan on the SVHN dataset as it has enough complex and yet simple enough for visually assessing the attribution. Table 3 : Run Time Analysis comparing the attribution computation time for IG, expected gradient (EG), DeepLIFT (DL) and Occlusion (Occ), as well as the baseline generation time for EG, GANMEX (GAN), and MDTS. The computation was performed on a single Tesla V100 GPU, and the compute time was measure in seconds on calculation over all samples for the dataset, and the baseline generation time is the additional compute time on top of the attribution methods. ( §) We selected the same sampling number for IG and EG, and the baseline selection time of EG was estimated by the complexity difference of EG and IG. ( †) The attribution inference time for CIFAR10 was measured in 10 separate batches due to the memory constraint. ( ‡) MDTS search was performed on CPU instead of GPU. Classification Loss (λ f cls ) Low classification loss tended to make some transformation unsuccessful, and high classification loss introduced additional noise that make the images unrealistic. Similarity Loss (λ rec ) Similarity loss is the key component for minimum distance optimization. As we have shown in Section D and Figure 7 .B, at zero similarity loss, the generator is only constraint by the reconstruction loss and can lead to incorrect font colors and background. High similarity loss, on the other hand, makes the baselines to be too similar to the original images. Reconstruction Loss (λ sim ) As we have mentioned in Section D and Figure 7 .B, reconstruction loss is not required for GANMEX, but it slightly helps GAN to converged. In contrast, high reconstruction loss can lead to incorrect outputs.

F COMPUTE TIME ANALYSIS

In Table 3 , we measure the GANMEX compute time compared with various attribution methods. While the GAN component takes 5-23 hours to train depending on the datasets, the inference step only requires one single forward operation, and the compute time (MNIST: 9.3 ms, SVHN: 15.4 ms, CIFAR10: 96.0 ms) is at the same order with IG and Occlusion. More experiment details are provided in the caption of Table 3 .

G INTUITIONS BEHIND THE MINIMUM DISTANCE REQUIREMENTS

Here we present the intuitions behind the baseline selection criteria from Section 3.1 using a simplified formulation. Assuming a transformation ρ projecting from a set of high-level concept variables V to a sample x, with x = ρ(V ), and we can separate V into three groups V = {V dis , V con , V irr }. Here, V dis are the discriminating variables that leads to the model decision, for example, the color the fruits in our apply/orange dataset; V con are the contingent variables that are independent to the model decision but correlate with how the discriminating variables are presented (eg. size and location of the fruits); V irr are the irrelevant variables that are both independent to the model decision and uncorrelated to the presentation of the discriminating variables (eg. background colors of the images). The expected one-vs-all explanation under the formulation would be A co (x) = A co (ρ(V )) = A(ρ(V dis co , V con , V irr )) = α co (V dis co , V con ) (10) with α being a transformation from the underlying concept variables to the explanation. Here we assumed a correct mapping α should be independent of V irr because the variable set has no impact on the discriminating variables themselves or how discriminating variables are presented. Now if we apply the concept of one-vs-one, we expect a one-vs-one explanation to produce where A c0→ct and B ct were defined in Section 3, and V dis co and V dis ct are the discriminating variables for class c o and c t . In the apple to orange example, V dis co would represent red color, and V dis ct would represent orange color. A c0→ct (x, B ct (x)) = α c0→ct (V dis co , V dis ct , V con ) For a baseline gerenative function B ct (x) = B ct (ρ(V dis , V con , V irr )) = ρ( Ṽ dis , Ṽ con , Ṽ irr ), where Ṽ dis , Ṽ con and Ṽ irr are the concept variables for generated baseline input. We can write out A c0→ct (x, B ct (x)) = A c0→ct (ρ(V dis , V con , V irr ), ρ( Ṽ dis , Ṽ con , Ṽ irr )) Although A c0→ct can be designed to be independent of V con and Ṽ con , we still need additional constraint to make Equation 12 satisfy the form of Equation 11. While one can apply additional constraints on A c0→ct to satisfy such requirements, an alternative approach would be requiring B ct to ensure the following. Ṽ dis = V dis ct (13) Ṽ con = V con (14) Ṽ irr = V irr (15) Equation 13 requires the baseline to belong to the target class c t , and this implies that without additional constraints to A c0→ct , a class-targeted one-vs-one baseline is required for correct one-vsone explanations. Equation 14 and Equation 15 combined have led to the closest input requirements in Section 3.1. Assuming a smooth transformation (ρ), minimizing the distance of x -B ct (x) provides an effective way of ensuring Equation 14and Equation 15. Going back to the apple/orange example, a baseline satisfying Equation 13-15 for an apple image input would be an image with an orange fruit of the same size, at the same location, with the same background to the original input, and all of the above can be achieved by the minimum distance sample described in Section 3.1. 



Figure 1: Intuition of using GANs for generating class-targeted baselines in SVHN dataset. Without GANs, a closest target class sample can easily be unrealistic (a), while the GAN helps confine the sample in the realistic sample space (b). The MDTS baseline and other training samples used in expected gradient can be very different from the input (c). (d) shows the zero baseline that is the most commonly used.

Figure 2: (A) Saliency maps for multi-class datasets (MNIST, SVHN, CIFAR10) generated with various baselines, including zero baseline (Zero), MDTS and GANMEX, with classes co → ct indicated for each example. (B) Mis-classification analysis showing with the (mis-classified class, correct class) pairs. The baseline columns show the expected images generated by GANMEX for the correct classes, and the saliency maps show the explanation of "why not the correct class" produce by IG, DeepLIFT (DL) and occlusion (Occ).

Figure 3: (A-C) Perturbation-based evaluation plots for MNIST and SVHN, respectively. The dashed lines represent the non-class-targeted baselines and the solid lines represent class-targeted baselines. (D-F) Gini indices, with the yellow bars represent saliency maps with zero baselines and the green bars represent that of GANMEX baselines. (G) Sanity checks showing the original saliency maps (Orig) and saliency maps under cascading randomization over the four layers: output layer (Output), fully connected (FC), and two CNN layers (CNN1, CNN2).4 EXPERIMENTSIn what follows we experiment with the datasets MNIST (LeCun & Cortes (2010)), Street-View House Numbers (SVHN)(Netzer et al. (2011)), CIFAR10(Krizhevsky (2009)) and apple2orange(Zhu et al. (2017)). Further details about the datasets and classifiers are given in Appendix A.1.Our techniques are designed to improve any global attribution method by providing an improved baseline. In our experiments we consider four attribution methods -IG, DeepLIFT, Occlusion, and DeepSHAP. The baselines we consider include the zero baseline (the default baseline in all four methods), minimum distance training sample (MDTS), and the GANMEX baseline. We also compared our results with a modified version of expected gradient aimed to provide 1-vs-1 explanations, which runs IG over a randomly chosen target class image from the training set, as opposed to a random image from the training set.

Figure 4: Saliency maps for the classifier on the apple2orange dataset with four baseline choices: zero baseline (Zero), maximum value baseline (Max), blurred baseline (Blur), and GANMEX baseline (GANMEX).

Figure 5: (A) Vertical edge area in an SVHN image. (B) Histogram of sample to baseline distance (D edge (x, x)) in the vertical edge area. (C-E) Samples comparing GANMEX and MDTS baselines with D edge (x, x) indicated on the top of the baseline images. (C) Easy cases for GANMEX (D edge (x, x) ≈ 1). (D) Difficult cases for GANMEX (D edge (x, x) ≈ 3). (E) Difficult cases for GANMEX (D edge (x, x) ≈ 6).

Figure 6: (A) Cascading randomization on baselines generated by a stand-alone GAN lead to little randomization on the saliency maps. (B) Colored-MNIST dataset. GAN baselines generated with both similarity loss and reconstruction loss (S+R), similarity loss only (S), reconstruction loss only (R), and none of those (NA). Only S+R and S successfully constrained the baselines in the same modes (colors) with the inputs.

Figure 7: GANMEX baselines generated with various weights for the (A) classification loss, (B) similarity loss, and (C) reconstruction loss.

Figure8: One-vs-one saliency maps using class-targeted baselines (GANMEX) vs non-class-targeted baselines (zero baselines). One-vs-one saliency maps generated using zero baselines show almost the same attributions regardless of the target class, making the one-vs-one saliency maps (columns with target labels) similar to the one-vs-all saliency maps (the "Avg" columns that show the averaged saliency maps over all target classes). GANMEX baselines corrected the behavior for both IG, DeepLIFT and DeepSHAP, and produced different attributions depending on the target classes.

Figure 9: Additional examples of saliency maps for the classifier on the apple2orange dataset with four baseline choices: zero baseline (Zero), maximum value baseline (Max), blurred baseline (Blur), and GANMEX baseline (GANMEX).

-View House Numbers (SVHN)(Netzer et al. (2011)) We tested our models on the cropped version of SVHN and used the same model architecture with that of MNIST and achieved a test accuracy of 90.3% after 50 epochs of training.

Additional metrics for attribution methods using the zero baseline (Zero), MDTS, and GANMEX (GAN).

