DECOY-ENHANCED SALIENCY MAPS

Abstract

Saliency methods can make deep neural network predictions more interpretable by identifying a set of critical features in an input sample, such as pixels that contribute most strongly to a prediction made by an image classifier. Unfortunately, recent evidence suggests that many saliency methods poorly perform, especially in situations where gradients are saturated, inputs contain adversarial perturbations, or predictions rely upon inter-feature dependence. To address these issues, we propose a framework that improves the robustness of saliency methods by following a two-step procedure. First, we introduce a perturbation mechanism that subtly varies the input sample without changing its intermediate representations. Using this approach, we can gather a corpus of perturbed data samples while ensuring that the perturbed and original input samples follow the same distribution. Second, we compute saliency maps for the perturbed samples and propose a new method to aggregate saliency maps. With this design, we offset the gradient saturation influence upon interpretation. From a theoretical perspective, we show that the aggregated saliency map not only captures inter-feature dependence but, more importantly, is robust against previously described adversarial perturbation methods. Following our theoretical analysis, we present experimental results suggesting that, both qualitatively and quantitatively, our saliency method outperforms existing methods, in a variety of applications.

1. INTRODUCTION

Deep neural networks (DNNs) deliver remarkable performance in an increasingly wide range of application domains, but they often do so in an inscrutable fashion, delivering predictions without accompanying explanations. In a practical setting such as automated analysis of pathology images, if a patient sample is classified as malignant, then the physician will want to know which parts of the image contribute to this diagnosis. Thus, in general, a DNN that delivers interpretations alongside its predictions will enhance the credibility and utility of its predictions for end users (Lipton, 2016) . In this paper, we focus on a popular branch of explanation methods, often referred to as saliency methods, which aim to find input features (e.g., image pixels or words) that strongly influence the network predictions (Simonyan et al., 2013; Selvaraju et al., 2016; Binder et al., 2016; Shrikumar et al., 2017; Smilkov et al., 2017; Sundararajan et al., 2017; Ancona et al., 2018) . Saliency methods typically rely on back-propagation from the network's output back to its input to assign a saliency score to individual features so that higher scores indicate higher importance to the output prediction. Despite attracting increasing attention, saliency methods suffer from several fundamental limitations: • Gradient saturation (Sundararajan et al., 2017; Shrikumar et al., 2017; Smilkov et al., 2017) may lead to the problem that the gradients of important features have small magnitudes, breaking down the implicit assumption that important features, in general, correspond to large gradients. This issue can be triggered when the DNN outputs are flattened in the vicinity of important features. • Importance isolation (Singla et al., 2019) refers to the problem that gradient-based saliency methods evaluate the feature importance in an isolated fashion, implicitly assuming that the other features are fixed. • Perturbation sensitivity (Ghorbani et al., 2017; Kindermans et al., 2017; Levine et al., 2019) refers to the observation that even imperceivable, random perturbations or a simple shift transformation of the input data may lead to a large change in the resulting saliency scores. In this paper, we tackle these limitations by proposing a decoy-enhanced saliency score. At a high level, our method generates the saliency score of an input by aggregating the saliency scores of multiple perturbed copies of this input. Specifically, given an input sample of interest, our method first generates a population of perturbed samples, referred to as decoys, that perfectly mimic the neural network's intermediate representation of the original input. These decoys are used to model the variation of an input sample originating from either sensor noise or adversarial attacks. The decoy construction procedure draws inspiration from the knockoffs, proposed recently by Barber & Candès (2015) in the setting of error-controlled feature selection, where the core idea is to generate knockoff features that perfectly mimic the empirical dependence structure among the original features. In brief, the current paper makes three primary contributions. First, we propose a framework to perturb input samples to produce corresponding decoys that preserve the input distribution, in the sense that the intermediate representations of the original input data and the decoys are indistinguishable. We formulate decoy generation as an optimization problem, applicable to diverse deep neural network architectures. Second, we develop a decoy-enhanced saliency score by aggregating the saliency maps of generated decoys. By design, this score naturally offsets the impact of gradient saturation. From a theoretical perspective, we show how the proposed score can simultaneously reflect the joint effects of other dependent features and achieve robustness to adversarial perturbations. Third, we demonstrate empirically that the decoy-enhanced saliency score outperforms existing saliency methods, both qualitatively and quantitatively, on three real-world applications. We also quantify our method's advantage over existing saliency methods in terms of robustness against various adversarial attacks.

2. RELATED WORK

A variety of saliency methods have been proposed in the literature. Some, such as edge detectors and Guided Backpropagation (Springenberg et al., 2014) are independent from the predictive model (Nie et al., 2018; Adebayo et al., 2018) . 1 Others are designed only for specific architectures (i.e., Grad-CAM (Selvaraju et al., 2016) for CNNs, DeConvNet for CNNs with ReLU activations (Zeiler & Fergus, 2014) ). In this paper, instead of exhaustively evaluating all saliency methods, we apply our method to the three saliency methods that do depend on the predictor (i.e., passing the sanity checks in Adebayo et al. (2018) and Sixt et al. (2020) ) and are applicable to diverse DNN architectures: • The vanilla gradient method (Simonyan et al., 2013) simply calculates the gradient of the class score with respect to the input x, which is defined as E grad (x; F c ) = x F c (x). • The SmoothGrad method (Smilkov et al., 2017) seeks to reduce noise in the saliency map by averaging over explanations of the noisy copies of an input, defined as E sg (x; F c ) = 1 N N i=1 E grad (x + g i ; F c ) with noise vectors g i ∼ N (0, σ 2 ). • The integrated gradient method 2 (Sundararajan et al., 2017) starts from a baseline input x 0 and sums over the gradient with respect to scaled versions of the input ranging from the baseline to the observed input, defined as E ig (x; F c ) = (x -x 0 ) × 1 0 x F c (x 0 + α(x -x 0 ))dα. We do not empirically compare to several other categories of methods. Counterfactual-based methods work under the same setup as saliency methods, providing explanations for the predictions of a pretrained DNN model (Sturmfels et al., 2020) . These methods identify the important subregions within an input image by perturbing the subregions (by adding noise, rescaling (Sundararajan et al., 2017) , blurring (Fong & Vedaldi, 2017) , or inpainting (Chang et al., 2019) ) and measuring the resulting changes in the predictions (Ribeiro et al., 2016; Lundberg & Lee, 2017; Chen et al., 2018; Fong & Vedaldi, 2017; Dabkowski & Gal, 2017; Chang et al., 2019; Yousefzadeh & O'Leary, 2019; Goyal et al., 2019) . Although these methods do identify meaningful subregions in practice, they exhibit several limitations. First, counterfactual-based methods implicitly assume that regions containing the object most contribute to the prediction (Fan et al., 2017) . However, Moosavi-Dezfooli et al. (2017) showed that counterfactual-based methods are also vulnerable to adversarial attacks, which force these methods to output unrelated background rather than the meaningful objects as important subregions. Second, the counterfactual images may be potentially far away from the training distribution, causing ill-defined classifier behavior (Burns et al., 2019; Hendrycks & Dietterich, 2019) . In addition to these limitations, counterfactual-based methods and our decoy-based method are fundamentally different in three ways. First, the former seeks the minimum set of features to exclude in order to minimize the prediction score or to include in order to maximize the prediction score (Fong & Vedaldi, 2017) , whereas our approach aims to characterize the influence of each feature on the prediction score. Second, counterfactual-based methods explicitly consider the decision boundary by comparing each image to the closest image on the other side of the boundary. In contrast, the proposed method only considers the decision boundary implicitly by calculating the gradient's variants. Third, unlike counterfactual images, which could potentially be out-of-distribution, decoys are plausibly constructed in the sense that their intermediate representations are indistinguishable from the original input data by design. Because of these limitations and differences, we do not compare our method with counterfactual-based methods. In addition to saliency methods and counterfactual-based methods, several other types of interpretation methods have been proposed that either aim for a different goal or have a different setup. For example, recent research (e.g., Ribeiro et al. (2016) ; Lundberg & Lee (2017); Chen et al. (2018; 2019b) ) designed techniques to explain a black-box model, where the model's internal weights are inaccessible. Koh & Liang (2017) and some follow-up work (Yeh et al., 2018; Koh et al., 2019) tried to find the training points that are most influential for a given test sample. Some other efforts have been made to train a more interpretable DNN classifier (Fan et al., 2017; Zołna et al., 2019; Alvarez-Melis & Jaakkola, 2018; Toneva & Wehbe, 2019) , synthesize samples that represent the model predictions (Ghorbani et al., 2019; Chen et al., 2019a) ), or identifying noise-tolerant features (Ikeno & Hara, 2018; Schulz et al., 2020) . However, due to the task and setup differences, we do not consider these methods in this paper.

3.1. PROBLEM SETUP

Consider a multi-label classification task in which a pre-trained neural network model implements a function F : R d → R C that maps from the given input x ∈ R d to C predicted classes. The score for each class c ∈ {1, • • • , C} is F c (x), and the predicted class is the one with maximum score, i.e., arg max c∈{1,••• ,C} F c (x). A saliency method aims to assign to each feature a saliency score, encoded in a saliency map E(x; F c ) : R d → R d , in which the features with higher scores represent higher "importance" relative to the final prediction. Given a pre-trained neural network model F with L layers, an input x, and a saliency method E such that E(x; F ) is a saliency map of the same dimensions as x, the proposed scores can be obtained in two steps: generating decoys and aggregating the saliency maps of the decoys (See Fig. 1(A ) that illustrates the workflow of creating decoy-enhanced saliency maps).

3.2. DECOY DEFINITION

Say that F : R d → R d is the function instantiated by the given network, which maps from an input x ∈ R d to its intermediate representation F (x) ∈ R d at layer ∈ {1, 2, • • • , L}. A vector x ∈ R d is said to be a decoy of x ∈ R d at a specified layer if the following swappable condition is satisfied: F (x) = F (x swap(x,K) ), for swappable features K ⊂ {1, • • • , d} . (1) Here, the swap(x, K) operation swaps features between x and x based on the elements in K. In this work, K represents a small meaningful feature set, which represents a small region/segment in an image or a group of words (embeddings) in a sentence. Take an image recognition task for example. Assume K = {10} and x is a zero matrix, then x swap(x,K) indicates a new image that is identical to x except that the tenth pixel is set to zero. An illustrative explanation of swap operator is shown in Fig. 1(B ). Using the swappable condition, we aim to ensure that the original image x and its decoy x are indistinguishable in terms of the intermediate representation at layer . Note in particular that the construction of decoys relies solely on the first layers of the neural network F 1 , F 2 , • • • , F and is independent of the succeeding layers F +1 , • • • , F L . As such, x is conditionally independent of the classification task F (x) given the input x; i.e., x |= F (x)|x.

3.3. DECOY GENERATION

To identify decoys satisfying the swappable condition, we solve the following optimization problem: maximize x∈[x min ,xmax] d ((x -x) • s) + 1 , s.t. F (x) -F (x) ∞ ≤ , (x -x) • (1 -M) = 0 Here, (•) + = max(•, 0), and the operators • 1 and • ∞ correspond to the L 1 and L ∞ norms, respectively. M ∈ {0, 1} d is a specified binary mask. And the value of each feature in the decoy x is restricted to lie in a legitimate value range i.e., [x min , x max ] (e.g., the pixel value should lie in [0, 255] ). We impose the constraint F (x) -F (x) ∞ ≤ , which ensures that the generated decoy satisfies the swappable condition described in Eqn. 1. It should be noted that we take x and x to be indistinguishable except for the swappable features indicated by the mask (i.e., x swap(x,K) = x). As is shown later in Section 3.4, our decoy-enhanced saliency score is defined to capture the empirical range of the decoy saliencies. Here, we first need to estimate the upper/lower ends of the legitimate decoys. To achieve this, in Eqn. 2, we maximize the deviation between x and x from both the positive and negative directions, i.e., s = +1 and s = -1. By using this objective function, for each mask M, we can compute two decoys-one for the positive deviation (i.e., s = +1) and the other for the negative one (i.e., s = -1). To solve the optimization function in Eqn. 2, we employ three commonly adopted methods -lagrange multiplier, projected gradient, and change-of-variable (Carlini & Wagner, 2017) -to transform the original objective function into the following form: minimize x -max(( 1 2 (tanh(x) + 1) -x) • s, 0) 1 + λ • (|F ( 1 2 (tanh(x) + 1)) -F (x)| -τ ) + 2 2 , where λ > 0 is the lagrange multiplier. xi = arctanh(2x i -1), for all i ∈ {1, 2, • • • , d}. τ > 0 is introdued to approximate the L ∞ norm in Eqn. 2. After obtaining x by solving Eqn. 3, we compute x and map it back to the original feature value range [x min , x max ]. More details about how to transform Eqn. 2 into Eqn. 3 can be found in Section A6.

3.4. DECOY-ENHANCED SALIENCY SCORES

Given an input sample x and a swappable patch with size P , we can obtain ( √ d -P + stride) 2 unique masks by sliding the swappable patch across the input with a certain stride. For computational efficiency, we aggregate m masks into one decoy sample and optimize these masks jointly by solving one decoy sample from Eqn. 3. Then, we can generate 2n decoys for that sample. We denote these decoys as x1 , x2 , • • • , x2n . Here, n = ( √ d -P + stride) 2 /m . For these decoys, we can then apply a given saliency method E to yield the corresponding decoy saliency maps E(x 1 ; F ), E(x 2 ; F ), • • • , E(x 2n ; F ) . With these decoy saliency maps in hand, for each feature x i in x, we can characterize its saliency score variation by using a population of saliency scores Ẽi = E(x 1 ; F c ) i , E(x 2 ; F c ) i , • • • , E(x 2n ; F c ) i . In this work, we define the decoy-enhanced saliency score Z i for each feature x i as Zi = max( Ẽi) -min( Ẽi) . (4) Here, Z i is determined by the empirical range of the decoy saliency scores. Ideally, important features will have large values and unimportant ones will have small values. Note that the proposed method is designed specifically for nonlinear models in need of interpretation. As is discussed in Section A7, it cannot output meaningful saliency maps on linear models. It should also be noted that by sliding the swappable patch across the input and ensembling the obtained decoy-enhanced saliency maps, we could capture the saliency of each feature. The motivations of manipulating at a patch level rather than the entire input are capturing the local dependency structure and enabling batch operations for better efficiency.

3.5. THEORETICAL INSIGHTS

In this section, we analyze the saliency score method in a theoretical fashion. 3 In particular, we take a convolutional neural network with the ReLU activation function as an example to discuss why the proposed interpretation method can account for inter-feature dependence while also improving explanatory robustness. It should be noted that, while we conduct our theoretical analysis in the setting of CNNs with a specific activation function, the conclusions drawn from the theoretical analysis can easily be extended to other feed-forward neural architectures and other activation functions (e.g., sigmoid and tanh). For analysis of other neural architectures, see Section A9. Consider a CNN with L hidden blocks, with each layer containing a convolutional layer with a filter of size √ s × √ s and a max pooling layer with pooling size √ s × √ s . (We set the pooling size the same as the kernel size in each block for simplicity.) The input to this CNN is x ∈ R d , unrolled from a √ d × √ d matrix. Similarly, we also unroll each convolutional filter into g ∈ R s , where g is indexed as (g ) j for j ∈ J . Here, J corresponds to the index shift in matrix form from the top-left to bottom-right element. For example, a 3 × 3 convolutional filter (i.e., s = 9) is indexed by J = - √ d -1, - √ d, - √ d + 1, -1, 0, 1, √ d -1, √ d, √ d + 1 . The output of the network is the probability vector p ∈ R C generated by the softmax function, where C is the total number of classes. Such a network can be represented as m = pool(relu(g * m -1 )) for = 1, 2, 3, ..., L , o = W T L+1 mL + bL+1, p = softmax(o) , where relu(•) and pool(•) indicate the ReLU and pooling operators, m ∈ R d is the output of the block (m 0 = x), and (g * m -1 ) ∈ R d -1 represents a convolutional operation on that block. We assume for simplicity that the convolution retains the input shape. Consider an input x and its decoy x, generated by swapping features in K. For each feature i ∈ K, we have the following theorem for the decoy-enhanced saliency score Z i : Theorem 1. In the aforementioned setting, Z i is bounded by Zi - 1 2 k∈K (x + k -x- k )(Hx) k,i ≤ C1 . Here, C 1 > 0 is a bounded constant and H x is the Hessian of F c (x) on x where (H x ) i,k = ∂ 2 F c ∂xi∂x k . x+ and xrefer to the decoy that maximizes and minimizes E(x; F c ), respectively. See Section A7 for the proof. Theorem 1 implies that the proposed saliency score is determined by the secondorder Hessian ((H x ) i,k ) in the same swappable feature set. The score explicitly models the feature dependencies in the swappable feature set via this second-order Hessian, potentially capturing meaningful patterns such as edges, texture, etc. In addition to enabling representation of inter-feature dependence, Theorem 1 sheds light on the robustness of the proposed saliency score against adversarial attack. To illustrate the robustness improvement of our method, we introduce the following proposition. The proof of this proposition as well as in-depth analysis can be found in Section A8. Proposition 1. Given an input x and the corresponding adversarial sample x, if both |x i -xi | ≤ C 2 δ i and xixi ≤ C 2 δ i can be obtain where C 2 > 0 is a bounded constant and δ i = |E(x, F ) i -E(x, F ) i |, then the following relation can be guaranteed. |(Z x) i -(Z x ) i | ≤ |E(x, F ) i -E(x, F ) i | . Given an adversarial sample x (i.e., the perturbed x), we say a saliency method is not robust against x if the deviation of the corresponding explanation δ i = |E(x, F ) i -E(x, F ) i | (for all i ∈ {1, 2, • • • , d}) is large. According to the proposition above, we can easily discover that the deviation of our decoy-enhanced saliency score is always no larger than that of other saliency methods when a certain condition is satisfied. This indicates that, when the condition holds, our saliency method can guarantee a stronger resistance to the adversarial perturbation. To ensure the satisfication of conditions |x i -xi | ≤ C 2 δ i and xixi ≤ C 2 δ i , we can further introduce the corresponding condition as a constraint to Eqn. 2. In the following section, without further clarification, the saliency scores used in our evaluation are all derived with this constraint imposed.

4. EXPERIMENTS

To evaluate the effectiveness of our proposed method, we perform extensive experiments on deep learning models that target three tasks: image classification, sentiment analysis, and network intrusion detection. The performance of our approach is assessed both qualitatively and quantitatively. The results show that our proposed method identifies intuitively more coherent saliency maps than the state-of-the-art saliency methods alone. The method also achieves quantitatively better alignment to truly important features and demonstrates stronger robustness to adversarial manipulation. The description of the datasets and experimental setup can be found in Section A10.

4.1. SALIENCY BENCHMARK

As mentioned in Section 2, we apply our decoy enhancement method to three saliency methods: vanilla gradient, SmoothGrad, and integrated gradient. Here, we applied the default setup for the integrated gradient (a zero baseline) and SmoothGrad. Section A14 shows that our method can also improve the performance of the variants of the integrated gradient/SmoothGrad and GradCAM. In each case, the decoy-enhanced saliency scores are post-processed in the following way before qualitative and quantitative evaluations. To rule out the bias introduced by the saliency values and ensure a fair comparison, we constructed a binary saliency map by retaining only the top-K features ranked by each method. Specifically, we set the saliency value of the selected features as 1 and the rest features as 0. In this section, we choose the K as the top 20 percent of all features. Note that Section A16 shows that subtly varying K does not influence the experiment conclusions. To demonstrate that all three methods, when enhanced with decoys, still depend on the predictor, we carry out a sanity check on the ImageNet dataset. The results show that our decoy enhanced-saliency methods pass the sanity check (see Section A11 for details).

4.2. PERFORMANCE IN VARIOUS APPLICATIONS

To comprehensively evaluate our proposed approach against the baselines mentioned above, we focus on two criteria. First, we aim to achieve qualitative coherence of the identified saliency map. Intuitively, we prefer a saliency method that highlights features that align closely with the predictions (e.g., highlights the object of interest in an image or the words indicating the sentiment of the sentence). Second, we aim to quantify the correctness of the saliency maps produced by the corresponding method. To do it, we use the fidelity metric (Dabkowski & Gal, 2017) , defined as: where c indicates the predicted class of input x, and E(x; F c ) is the top-K-retained binary saliency map described above. E(x; F c ) • x performs entry-wise multiplication between E(x; F c ) and x, encoding the overlap between the object of interest and the concentration of the saliency map. The rationale behind this metric utilization is as follows. By viewing the saliency score of the feature as its contribution to the predicted class, a good saliency method will highlight more important features and thus give rise to higher predicted class scores and lower metric values. SF (E(•; F c ), x) = -log F c (E(x; F c ) • x) (8)

4.2.1. PERFORMANCE ON THE IMAGENET DATASET

We applied our decoy-enhanced saliency score to randomly sampled images from the ImageNet dataset (Russakovsky et al., 2015) , with a pretrained VGG16 model (Simonyan & Zisserman, 2014) . See Section A12 for applicability of our method to diverse CNN architectures such as AlexNet (Krizhevsky et al., 2012) and ResNet (He et al., 2016) . The 3 × 3 image patches are treated as swappable features in generating decoys. A side-by-side comparison (Fig. 2(A) ) suggests that decoys consistently help to reduce noise and produce more visually coherent saliency maps. For example, the original integrated gradient method highlights the region of dog head in a scattered format, which is also revealed by the difference plot. In contrast, the decoy-enhanced integrated gradient method not only highlights the missing body but also identifies the dog head with more details such as ears, cheek, and nose (See Section A18 for more visualization examples). The visual coherence is also quantitatively supported by the saliency fidelity score. To further evaluate the necessity of the two steps (i.e., decoy generation and aggregation) in our method, we carried out a control experiment by replacing either step with alternatives. Specifically, as alternatives to the decoy generation, we used an image in which all pixel values are either replaced with a single mean pixel value or contaminated with Gaussian white noise. Regarding the decoy aggregation, we calculated the mean saliency score as the alternative. As shown in Fig. 2 (B), our method, which incorporate both steps, reports the best performance. This validates the effectiveness of each of our designs. Recall that the number of decoys n is decided by the patch size (P ), stride, and the number of multiple masks in one decoy (m). comparison of the original saliency method, our method, and its alternatives. Here, the alternative methods represent the practice of replacing the decoy generation (Eqn. 2) with constant perturbation or noise perturbation as well as the practice of replacing the decoy aggregation (Eqn. 4) with mean aggregation (See Tab. A5 for more statistics about the performance differences). M . Fig. 2 (C) shows that our method achieves stable fidelity scores across the substantial variations of decoy numbers. The sensitivity test of other hyper-parameters can be found in Section A16.

4.2.2. PERFORMANCE ON THE STANFORD SENTIMENT TREEBANK (SST) DATASET

We also applied our decoy-enhanced saliency score to randomly sampled sentences from the Stanford Sentiment Treebank (SST) (Russakovsky et al., 2015) . We train a two-layer CNN (Kim, 2014) which takes the pretrained word embeddings as input (Pennington et al., 2014) (see A10 for experimental details). As suggested by Guan et al. (2019) , the average saliency value of all dimensions of a word embedding is regarded as the word-level saliency value. The embeddings of the words are treated as swappable features when generating decoys. As shown in Fig. 3 (A), a side-by-side comparison suggests that decoys consistently help to produce semantically more meaningful saliency maps. For example, in a sentence with negative sentiment, keywords associated with negation, such as 'no' and 'not', are more highlighted by decoy-enhanced saliency methods. The semantic coherence is also quantitatively supported by the saliency fidelity (Fig. 3(B) ). We also tested the alternatives mentioned above: constant (replacing the decoy generation with the mean embedding of the whole dictionary) and noise perturbation with range aggregation, and decoy with mean aggregation. Fig. 3 (B) shows that our method outperforms these alternatives.

4.3. ROBUSTNESS TO ADVERSARIAL ATTACKS

Next we investigate the robustness of our method to adversarial manipulations of images.In particular, we focus on three popular adversarial attacks (Ghorbani et al., 2017) : (1) the top-k attack, which seeks to decrease the scores of the top k most important features, (2) the target attack, which aims to increase the importance of a pre-specified region in the input image, and (3) the mass-center attack, which aims to spatially change the center of mass of the original saliency map. Here, we specify the bottom-right 4 × 4 region of the original image for the target attack and select k = 5000 in the top-k attack. We use the sensitivity metric (Alvarez-Melis & Jaakkola, 2018) to quantify the robustness of a saliency method E to adversarial attack, defined as: SS(E(•, F c ), x, x) = (E(x, F c ) -E(x, F c )) 2 x -x 2 (9) where x is the perturbed image of x. A small sensitivity value means that similar inputs do not lead to substantially different saliency maps. As shown in Fig. 4 (A), a side-by-side comparison suggests that decoys consistently yield low sensitivity scores and help to produce more visually coherent saliency maps, mitigating the impact of various adversarial attacks. More examples can be found in Section A18. The visual coherence and robustness to adversarial attacks are also quantitatively supported by Fig. 4 (B)∼(D). As is mentioned above, we also did experiments on a MLP trained with a network intrusion dataset and show the results in Section A13. The results are consistent with those on CNNs, which confirm our method's applicability to the widely-used feed-forward networks. 

5. DISCUSSION AND CONCLUSION

In this work, we propose a method for computing, from a given saliency method, decoy-enhanced saliency scores that yield more accurate and robust saliency maps. We formulate the decoy generation as an optimization problem, applicable to diverse deep neural network architecture. We demonstrate the superior performance of our method relative to three standard saliency methods, both qualitatively and quantitatively, even in the presence of various adversarial perturbations to the image. From a theoretical perspective, by deriving a closed-form solution, we show that the proposed score can provably compensate for the limitations of existing saliency methods by reflecting the joint effects from other dependent features and maintaining robustness to adversarial perturbations. Fig. 2 (C) shows our method can achieve a decent performance with only a small number of decoys. Section A15 further shows the runtime of generating one decoy is marginal compared to existing saliency methods. This indicates that our technique can improve the existing saliency methods without introducing too much computational overhead. With the parallel computing enabled by multiple GPUs, our method can be much faster, which further decreases the overhead and escalates our method's practicability (See Section A15 for a detailed runtime analysis). Our method mainly introduce three hyperparameters: swappable feature size K, network layer , and initial Lagrange multiplier λ. In Section A16, we show that our method is insensitive to the substantial variation of hyperparameters. We generate decoys by using Eqn. 2. While there are other widely used perturbation methods (e.g., random noise, blurring, and inpainting), they are not suitable for generating decoys. First, Section 4 shows that some general pertrbations (i.e., random noise and constant perturbation) obtain worse fidelity than decoy. Second, without ensuring the swappable condition in Eqn. 1, they cannot provide a theoretical guarantee for robustness improvement. Third, methods like blurring and inpainting are not well-defined for applications beyond computer vision. This work points to several promising directions for future research. First, (E(x; F c ) • x) may be out-of-distribution and thus fails our fidelity metric. We will investigate more rigorous metrics and use other benchmark datasets (e.g., BAM (Yang & Kim, 2019 )) for evaluation. Second, a possible extension is to customize our method to recurrent neural networks and to inputs with categorical/discrete features. Third, recent work (Bansal et al., 2020; Chen et al., 2019c) shows that adversarial training can improve the interpretability of a DNN model. It is worth exploring whether our method could further enhance the quality of saliency maps derived from these adversarially retrained classifiers. A fourth promising direction could be reframing interpretability as hypothesis testing and using decoys to deliver a set of salient features, subject to false discovery rate control at some pre-specified level (Burns et al., 2019; Lu et al., 2018) . Because tanh(x i ) ∈ [-1, 1] implies xi ∈ [0, 1], any solution to x is naturally valid. It should be noted that other transformations for this third step are also possible but were not explored in this paper. Putting these ideas together, we minimize the following objective function: minimize x -( 1 2 (tanh(x) + 1) -x) • s) + 1 + λ • F ( 1 2 (tanh(x) + 1)) -F (x) ∞ , where λ > 0 is initialized small and repeatedly doubled until the optimization succeeds. Because the L ∞ norm is not fully differentiable, we adopt the approximation trick introduced by Carlini & Wagner (2017) and solve the following formulation: minimize x -max(( 1 2 (tanh(x) + 1) -x) • s, 0) 1 + λ • (|F ( 1 2 (tanh(x) + 1)) -F (x)| -τ ) + 2 2 , where τ > 0. In this paper, we follow the selection strategy proposed in Carlini & Wagner (2017) and initialize τ = 1. After each iteration, if the second term is zero, then we reduce τ by a factor of 0.95 and repeat; otherwise, we terminate the optimization. After obtaining x, we compute x and map it back to the original feature value range [x min , x max ]. Note that Eqn. 3 can be efficiently solved by any first-order optimization method without introducing too much computational overhead. In practice, the average run time of solving it is 62.3% shorter than the fastest, vanilla gradient method.

A7 PROOF OF THEOREM 1

Before proving Theorem 1, we first state and prove the following lemma. Lemma 1. Consider an input x and its decoy x, generated by replacing the original features with swappable features in K, |K| = K. The partial derivative of F c (x) w.r.t. to xi for i ∈ K is ( xF c (x)) i - 1 2 k∈K (x k -x k )(H x) i,k ≤ C . Proof. The second-order Taylor expansion of the predicted F c (x) for target class c around x is as follows: F c (x) ≈ F c (x) + xF c (x) T ∆ + 1 2 ∆ T H x∆ , where ∆ = xx. By definition of the decoys in Section 2.2 (i.e., F c (x) = F c (x)), the following equation holds: xF c (x) T ∆ ≈ - 1 2 ∆ T H x∆ . From the above equation, we can see that, for a linear model, the linearity zeroes out the gradient of the decoys, causing our method to output zero saliency scores for all input features. We clarified in Section 3.4 that our method is mainly defined for non-linear complicated models. Given a swappable patch of size K × 1 starting from position i 1 , then ∆ = [0, ..., x i1 -xi1 , ..., x i K - xi K , 0, ..., 0]. As such, we have xF c (x) T ∆ = i∈K ( xF c (x)) i (x i -xi ) , ∆ T H x∆ = i∈K (x i -xi ) k∈K (H x) i,k (x k -xk ) . Plugging Eqn. (15) into Eqn. ( 14), we have i∈K [( xF c (x)) i + 1 2 k∈K (H x) i,k (x k -xk )](x i -xi ) = 0 . ( ) Then we can derive ( xF c (x)) i + 1 2 k∈K (x k -xk )(H x) i,k ≤ C , ( xF c (x)) i - 1 2 k∈K (x k -x k )(H x) i,k ≤ C . First, we can derive |x i -x i | is bounded by 2max(x max , |x min |). We also have |x i+k -x i+k | 0 in that we can always find a small perturbation to each feature in x such that F (x) -F (x) ∞ ≤ . In addition, both gradient and Hessian are bounded by some Lipschitz constant (Szegedy et al., 2013) . 4 As a result, we can always find a constant C, such that C ≥ -k 1 ∈K\i [( xF c (x)) k 1 + 1 2 k 2 ∈K (Hx) k 1 ,k 2 (x k 2 -x k 2 )](x k 1 -x k 1 ) (xi-xi) . For the case K = 1, we have ( xF c (x)) i = 1 2 (H x) i,i (x i -x i ). Now we prove Theorem 1 from Section 3.5. Consider a CNN with L hidden blocks, with each layer containing a convolutional layer with a filter of size √ s × √ s and a max pooling layer with pooling size √ s × √ s . The input to this CNN is x ∈ R d , unrolled from a √ d × √ d matrix. Similarly, we also unroll each convolutional filter into g ∈ R s , where g is indexed as (g ) j for j ∈ J . Here, J corresponds to the index shift in matrix form from the top-left to bottom-right element. The output of the network is the probability vector p ∈ R C generated by the softmax function, where C is the total number of classes. Such a network can be represented as m = pool(relu(g * m -1 )) for = 1, 2, 3, ..., L , o = W T L+1 m L + b L+1 , p = softmax(o) , where relu(•) and pool(•) indicate the ReLU and pooling operators, m ∈ R d is the output of the block (m 0 = x), and (g * m -1 ) ∈ R d -1 represents a convolutional operation on that block. Consider an input x and its decoy x, generated by swapping features in K. For each feature i ∈ K, we have the following theorem for the decoy-enhanced saliency score Z i : Theorem 1. In the aforementioned setting, Z i is bounded by Z i - 1 2 k∈K (x + k -x- k )(H x ) k,i ≤ C 1 . ( ) Proof. The gradient of p c with respect to x can be written as follows, using the denominator layout notation of the derivative of a vector: x p c = L =1 ∂m ∂m -1 ∂o ∂m L ∂p c ∂o , where ∂o ∂m L = W L+1 , and ∂pc ∂o c = (p c -p 2 c ) if c = c , ∂pc ∂o c = -p c p c otherwise . ( ) Then we can write ∂pc ∂o as follows: ∂p c ∂o = P•c , where P•c corresponds to the c-th column of P and P = diag(p)-pp T . We then define B = ∂m ∂m -1 as B ∈ R d -1 ×d . In the following, we compute B . First, we can have ∂(m )j ∂(relu(g * m -1 ))n = 1 if ĵ -n ∈ J , and n = argmax n ∈ ĵ+J (g * m -1 ) n , ∂(m )j ∂(relu(g * m -1 ))n = 0 otherwise , where ĵ represents the center of the pooling patch in relu(g * m -1 ), which results in (m ) j . Then we can compute ∂(relu(g * m -1 ))n ∂(m -1 )i = (a ) n (g ) n-i if n -i ∈ J , ∂(relu(g * m -1 ))n ∂(m -1 )i = 0 otherwise , where (a ) n = 1 {(relu(g * m -1 ) n ) ≥ 0}. If we change the activation function to either sigmoid or tanh, then (a ) n in Eqn. 25 will be replaced with the derivative of either function. For the sigmoid activation function σ(x), the derivative is σ(x)(1 -σ(x)), with a range of [0, 1 4 ]. For the tanh activation function tanh(x), the derivative is 1 -tanh(x) 2 , with a range of [0, 1]. We conclude that the derivative of both sigmoid and tanh are bounded by a value no larger than 1. Combining Eqn. 24 with 25, we have (B )ij = ∂(m ) j ∂(m -1 ) i = (a )n(g )n-i if n -i ∈ J , ĵ -n ∈ J , and n = argmax n ∈ ĵ+J (g * m -1 ) n , (B )ij = ∂(m ) j ∂(m -1 ) i = 0 otherwise . ( ) For simplicity, we rewrite the non-zero condition as n ∈ Ĵ . Plugging B , = 1, ..., L, into Eqn. 20, we can obtain the partial derivative x p c . Further, we compute each element in the Hessian matrix H ij as follows: H ij = xi ( xj p c ) = ∂( L =1 B ) j• W L+1 P•c ∂x i = ( L =1 B ) j• W L+1 ∂ P•c ∂x i = d L n L =1 ( L =1 B ) jn L (W L+1 ) n L • ∂ P•c ∂x i , and ∂ Pc c ∂x i = (1 -2p c ) xi p c if c = c , p c xi p c + p c xi p c otherwise . Now we compute ( L =1 B ) jn L as ( L =1 B ) jn L = (B 1 ) j• L-1 =2 B (B L ) •n L , where (B 1 ) j• B 2 = [0, ..., C n2 (a 2 ) n2 n1∈ Ĵ1 (a 1 ) n g n-1 , ..., 0] , and where C n2 = (g 2 ) n2-2 n1∈ Ĵ1 g n-1 . Here, we redefine Ĵ1 as the set of indices such that (B 1 ) jn1 = 0 for n 1 ∈ Ĵ1 . As such, we can compute (B 1 ) j• L-1 =2 B as (B 1 ) j• L-1 =2 B = [0, .., C n L-1 (a L-1 ) n L-1 L-2 =1 n ∈ Ĵ (a ) n , ..., 0] . Plugging Eqn. 31 into Eqn. 29, we have ( L =1 B ) jn L = (B 1 ) j• L-1 =2 B (B L ) •n L = (C L ) n L (a L ) n L L-1 =1 n ∈ Ĵ (a ) n . Plugging Eqn. 32 into Eqn. 27, we have H ij =   C j L =1 n ∈ Ĵ (a ) n   ∂ P•c ∂x i , where C j is a linear combination of g 1 , ..., g L , W L+1 , which is bounded. H ij equals the multiplication of two components-the summation of neurons activated by x and a gradient ∂ P•c ∂xi . Given the total number of neurons in a CNN is a constant (denoted by C T ), we have 0 ≤ L =1 n ∈ Ĵ (a ) n ≤ C T . Then, we have |(H x ) ij | ≤ C T |C j ∂ P•c ∂xi |. Since the derivatives of both sigmoid and tanh are no larger than 1, this inequality also applies to the network with these two functions as the activation function. Similarly, for the Hessian (H x) ij of a decoy x, we also have |(H x) ij ≤ C T |C j ∂ P•c ∂ xi |. Given the inequality of (H x) ij and (H x) ij , we can obtain that |(H x) ij -(H x ) ij | ≤ 2C T max(| Cj ∂ P•c ∂ xi |, |C j ∂ P•c ∂xi )|, where ∂ P•c ∂xi is given by Eqn. 28. Recalling that P c is within [0, 1], the gradient ∂Pc ∂xi is bounded by some Lipschitz constant (Szegedy et al., 2013) , we can obtain that ∂ P•c ∂xi is bounded by some constant. Finally, we can derive that |(H x) ij -(H x ) ij | ≤ C C , where C C represents the upper bound. 5Now, we derive the decoy-enhanced saliency score Z i for x i , given a population of saliency scores Ẽi = E(x 1 ; F ) i , E(x 2 ; F ) i , • • • , E(x 2n ; F ) i . Let x+ , x-∈ x1 , x2 , • • • , x2n denotes the decoy which maximizes and minimize E(x; F ) i , respectively. According to Lemma 1, the partial derivative xi p c has the following relationship ( xF c (x)) i - 1 2 k∈K (x k -x k )(H x) i,k ≤ C , Then, we can derive 1 2 k∈K (x + k -x k )(H x+ ) i,k -C ≤ ( x+ F c (x + )) i ≤ 1 2 k∈K (x + k -x k )(H x+ ) i,k + C , - 1 2 k∈K (x - k -x k )(H x-) i,k -C ≤ -( x-F c (x -)) i ≤ - 1 2 k∈K (x - k -x k )(H x-) i,k + C , Then, we have Z i = ( x+ F c (x + )) i -( x-F c (x -)) i ≤ 1 2 k∈K (x + k -x k )(H x+ ) i,k - 1 2 k∈K (x - k -x k )(H x-) i,k + 2C ≤ 1 2 k∈K (x + k -x k )((H x ) i,k + C C ) - 1 2 k∈K (x - k -x k )((H x-) i,k -C C ) + 2C ≤ 1 2 k∈K (x + k -x- k )(H x ) i,k + 1 2 C C k∈K (x + k -x- k ) + 2C , And Z i = ( x+ F c (x + )) i -( x-F c (x -)) i ≥ 1 2 k∈K (x + k -x k )(H x+ ) i,k - 1 2 k∈K (x - k -x k )(H x-) i,k -2C ≥ 1 2 k∈K (x + k -x k )((H x ) i,k -C C ) - 1 2 k∈K (x - k -x k )((H x-) i,k + C C ) + 2C ≥ 1 2 k∈K (x + k -x- k )(H x ) i,k - 1 2 C C k∈K (x + k -x- k ) -2C , Combining Eqn. 37 with Eqn. 38, we have Z i - 1 2 k∈K (x + k -x- k )(H x ) k,i ≤ C 1 . Recall that (x + k -xk ) is bounded by a upper-bound, we can obtain that there exist a constant C 1 , such that C 1 ≥ 1 2 C C k∈K (x + k -x- k ) + 2C. Note that this upper bound is data specific, and we leave the exploration on its tightness as a part of future works.

A8 PROOF OF PROPOSITION 1

Proposition 1. Given an input x and its corresponding adversarial sample x, if both |x i -xi | ≤ C 2 δ i and xi -xi ≤ C 2 δ i can obtain where C 2 > 0 is a bounded constant and δ i = |E(x, F ) i -E(x, F ) i |, then the following relation can be guaranteed. |(Z x) i -(Z x ) i | ≤ |(E(x, F ) i -E(x, F )) i | . ( ) Proof. Recall the goal of the attack against saliency maps is to subtly perturb an input sample such that the added perturbation does not change the output of the classifier (Ghorbani et al., 2017) but force a saliency method to output a less meaningful saliency map (i.e., highlighting features that are irrelevant to the classifier prediction). To achieve this goal, when generating an adversarial sample x from the given input x, an attacker needs to impose the following constraint xx ∞ ≤ . Suppose we have an adversarial sample x satisfies this constraint. Then, we can assume (x -x) i = ˆ i , where |ˆ i | ≤ , for i = 1, 2, ..., d. In addition, we can compute saliency maps E(x, F ) and E(x, F ) for x and x by using an existing saliency method. 6 Given both saliency maps, we can further compute the difference between E(x, F ) and E(x, F ) as (E(x, F ) -E(x, F )) i = xF c (x) -x F c (x) = (H x (x -x)) i = d j=1 (H x ) ij ˆ j . ( ) Based on the Eqn. 2 in Section 3.3, when generating the decoys x, we ensure the classifier's predictions for those decoys are as same as that of the x. In this work, we achieve this by bounding the difference between the hidden representations of x and x. As is discussed in Section A7, to preserve the same prediction c for x and x, one has to ensure |F c (x) -F c (x)| is bounded. This implies the difference between x and x is bounded within . Here, i represents the maximum difference between xi and x i at the i th dimension. As is mentioned above, the adversarial sample x does not change the classifier's prediction. Therefore, we could imply ˆ i ≤ i , for i = 1, 2, ..., d. Now, suppose we obtain a set of decoys for x and have their corresponding saliency maps, i.e., E(x  1 ; F ) i , E(x 2 ; F ) i , • • • , E(x 2n ; F ) i ) . Let x+ ∈ x1 , x2 , • • • , xn denote the decoys which maximize E(x; F ) i and let xdenote the decoys which minimize E(x; F ) i . Similarly, we can also have the corresponding decoys xand xfor the adversarial sample x as well as their corresponding saliency maps. With both the decoys and saliency maps for the input sample x and its adversarial sample x, we can compute the difference between (Z x) i and (Z x ) i as (Z x) i -(Z x ) i = E( x+ , F ) i -E( x-, F ) i -E(x + , F ) i -E(x -, F ) i ) = (H x ( x+ -x)) i -(H x ( x--x)) i -(H x (x + -x)) i -(H x (x --x)) i = d j=1 (H x ) ij ( x+ j -x- j ) -(x + j -x- j ) . ( ) To guarantee an improvement in robustness against the adversarial perturbation, we have to ensure that |(Z x) i -(Z x ) i | -|(E(x, F ) -E(x, F )) i | ≤ 0, for i = 1, 2..., d. That is, d j=1 (H x ) ij ( x+ j -x- j ) -(x + j -x- j ) - d j=1 (H x ) ij ˆ j ≤ 0, d j=1 (H x ) ij ( x+ j -x- j ) -(x + j -x- j ) ≤ d j=1 (H x ) ij ˆ j , As is discussed in Section A7, |(H x ) ij | ≤ C C . With this, we can have d j=1 (H x ) ij ( x+ j -x- j ) -(x + j -x- j ) ≤ d j=1 |(H x ) ij | ( x+ j -x- j ) -(x + j -x- j ) ≤ d j=1 C c ( x+ j -x- j ) -(x + j -x- j ) By plugging Eqn. 44 into Eqn. 43, we conclude that as long as ( x+ j -x- j ) -(x + j -x- j ) ≤ 1 Ccd d j=1 (H x ) ij ˆ j , our method could guarantee to improve the robustness against the adversarial perturbations. Let δ i = |E(x, F ) i -E(x, F ) i |. If we can ensure that |x i -xi | ≤ 1 4Ccd δ i and xi -xi ≤ 1 4Ccd δ i , we can have x+ j -x- j ≤ 1 2Ccd δ i and x+ j -x- j ≤ 1 2Ccd δ i . Thus, the aforementioned condition can be satisfied, i.e., ( x+ j -x- j ) -(x + j -x- j ) ≤ 1 Ccd δ i . By setting C 2 = 1 4Ccd , we could obtain the robustness conditions in Proposition 1.

A9 COROLLARY 1

Consider a multilayer perceptron with L fully-connected hidden layers and a decoy swappable size K × 1. The input of this MLP is x ∈ R d . For each hidden layer, we use the ReLU activation function. Similar to the CNN mentioned above, the output of this CNN is p ∈ R C . The network can be represented as: m = relu(W T m -1 + b ), For = 1, 3, ..., L , o = W T L+1 m L + b L+1 , p = softmax(o) . ( ) where Corollary 1. For the above MLP, Z i is also bounded by: W ∈ R d -1 ×d , for ∈ {1, • • • , L + 1} Z i ≤ 1 2 k∈K (x i+k -x i+k )(H x ) i+k,i + C 2 . ( ) Proof. Based on the proof of Theorem 1, the gradient of p c with respect to x can be written as follows x p c = L l=1 B W L+1 P•c . ( ) where B = ∂m ∂m -1 , B ∈ R d -1 ×d . P•c is also defined as P = diag(p) -pp T . In the following, we compute B l . First, we can compute (B 1 ) ij , in which (B 1 ) ij = ∂(m 1 ) j ∂x i = ∂(W T 1 x + b 1 ) j ∂x i ∂(m 1 ) j ∂(W T 1 x + b 1 ) j = (W 1 ) ij (a 1 ) j , where (a 1 ) j = 1{(W T 1 x + b 1 ) j ≥ 0}. Similar, we can also compute (B ) ij , for = 2, 3, ..., L B ) ij = (W ) ij (a ) j , where (a ) j = 1{(W T x + b ) j ≥ 0}. Then, we compute the each element in the Hessian matrix H ij . Specifically, based on Eqn. 27, we have H ij = d L n L =1 ( L =1 B ) jn L (W L+1 ) n L • ∂ P•c ∂x i , ( ) where ∂ P•c ∂xi is the same with Eqn. 28. Now, we compute ( L =1 B l ) jn L as ( L =1 B ) jn L = (B 1 ) j• L-1 =2 B (B L ) •n L , where (B 1 ) j• = [(W 1 ) j1 (a 1 ) 1 , (W 1 ) j2 (a 1 ) 2 , ..., (W 1 ) jd1 (a 1 ) d1 ] and (B 1 ) j• B 2 = [(a 2 ) 1 d1 n1=1 (C 2 ) 1n1 (a 1 ) n1 , ..., (a 2 ) d2 d1 n1=1 (C 2 ) d2,n1 (a 1 ) n1 ] , where (C 2 ) n2,n1 = (W 2 ) n1,n2 (W 1 ) j,n1 . For simplicity, we can rewrite d1 n1=1 (C 2 ) n2,n1 (a 1 ) n1 = (C 2 ) n2 d1 n1=1 (a 1 ) n1 . Then, we have (B 1 ) j• B 2 = [(C 2 ) 1 (a 2 ) 1 d1 n1=1 (a 1 ) n1 , ..., (C 2 ) d2 (a 2 ) d2 d1 n1=1 (a 1 ) n1 ] . ( ) As such, we can compute (B 1 ) j• L-1 =2 B as (B 1 ) j• L-1 =2 B = [(C L-1 ) 1 (a L-1 ) 1 L-2 =1 d n =1 (a ) n , ..., (C L-1 ) d L-1 (a L-1 ) d L-1 L-2 =1 d n =1 (a ) n ] . (54) Plugging Eqn. 54 into Eqn. 23, we have ( L =1 B ) jn L = (B 1 ) j• L-1 =2 B (B L ) •n L = (C L ) n L (a L ) n L L-1 =1 d n =1 (a ) n . ( ) Finally, we can obtain that H ij = d L n L =1 (C L ) n L (a L ) n L L-1 =1 d n =1 (a ) n (W L+1 ) n L • ∂ P•c ∂x i = C j L =1 d n =1 (a ) n ∂ P•c ∂x i , where C j is a linear combination of the elements in (W 1 ) j• , W 2 , ..., W L+1 . Note that the Hessian derived from the MLP has a similar form with the Hessian derived from the CNN in Eqn. 33, i.e., the summation of neurons activated by x multiplying the gradient. Here, the summation of neurons activated by x is again bounded by the total number of neurons in the network. The gradient ∂ P•c ∂xi is bounded by a Lipschitz constant. Similarly, we also have the following inequality for (H x) ij and (H x ) ij , i.e., |(H x) ij -(H x ) ij | ≤ C M . Similar to Theorem 1,let x+ , x-∈ x1 , x2 , • • • , x2n denotes the decoy which maximizes and minimize E(x; F ) i , respectively. Based on Eqn. 34 to Eqn. 39, we have Z i - 1 2 k∈K (x + k -x- k )(H x ) k,i ≤ C 2 . ( ) C 2 ≥ 1 2 C M k∈K (x + k -x- k ) + 2C. Slightly different for CNN, MLP sometimes is used to process the input that does not have a strong local dependency. In this case, we can set the swappable path size K = 1. Then, Eqn. 57 can reformulated as Z i -1 2 (x + i -x- i )(H x ) i,i ≤ C 2 . As we can observe from this equation, our proposed saliency score is still able to compensate for the gradient saturation problem. 

A10 DATASETS AND EXPERIMENT SETUP

In this section, we introduce the datasets used in our experiments and the neural network trained on each dataset, followed by our choices of hyper-parameters when explaining each model. ImageNet. We randomly select a subset of samples from the ImageNet validation set, which can be downloaded from the following link: http://www.image-net.org/. We adopt the most widely used preprocessing method for the selected images. Specifically, for each image, we resized it to 227 × 227, converted it to BGR format, and subtract the mean value of each channel [103.939, 116.779, 123.68 ] from the image. Rather than training our own networks, we downloaded a pretrained VGG16 model, AlexNet model, and ResNet_v1_50 model from the following link: https://github.com/tensorflow/models/tree/master/research/ slim and http://www.cs.toronto.edu/~guerzhoy/tf_alexnet/. We applied our proposed method to explain the predictions of these networks on the selected samples. SST. We downloaded the Stanford Sentiment Treebank (SST1) from the following link: https: //github.com/harvardnlp/sent-conv-torch/tree/master/data. The data is spited into a training set of 76, 961 samples and a testing set of 1, 821 samples. We used a pretrained glove embedding to represent each word in the sentences (sample). The embedding of each word is a vector of 100 dimensions. The pretrained embedding matrix can be downloaded from the following link: http://nlp.stanford.edu/data/wordvecs/glove.6B.zip. We trained a two-layer CNN with the embeddings as inputs. The model achieves about 80% accuracy on the testing set. The preprocessed testing data and the pretrained model can be downloaded from the following link: https://tinyurl.com/y9noqj6l. We run our explanation method on the pretrained model with the testing samples.

Network intrusion detection (IDS).

We use a subset of CSE-CIC-IDS2018 dataset (Sharafaldin et al., 2018; for Cybersecurity, 2018) , a network intrusion dataset contains the benign network traffic traces and malicious traces generated by three types of attacks: Denial of Service (DoS)-Hulk, SSH-BruteForce, and Infiltration. The training set contains 88, 661 samples and the testing set has 22, 165 samples. Each sample is represented as a vector of 83 dimensions, where each feature represents the statistics of network traffic flows (e.g., Number of packets, Number of bytes, Length of packets, etc). The features are normalized within [0, 1] by using the scikit-learn MinMaxScaler function. We trained a two-layer MLP to classify whether an input is a benign traffic or an attack (intrusion). Note that the legends have the same meaning as the column labels in Fig. 2 . The model reaches 99% accuracy on the testing set. After training the model, we randomly sampled a subset of 2, 000 testing samples and used our method to derive explanations from the model predictions of samples in this subset. The dataset, model, and the descriptions of each feature can be found in https://tinyurl.com/y9noqj6l. Hyper-parameter choices. The hyper-parameter choices of the proposed method on three datasets are shown in Table A1 . In the table, is the index of the layer within the target model that is selected to generate the decoy images. The Lagrange multiplier λ controls the weight of F (x) -F (x) ∞ . The patch_size and stride control the size and the stride step of each decoy patch. τ is introduced by Eqn. 3 in Section A6. Note that we set the swappable patch size of SST and IDS data as 1, because their features may not have a strong local correlation. It should also be noted that we selected the swappable patch size of ImageNet data as the widely used convolutional kernel size 3 and stride size 1. We set the number of patches (masks) in each decoy m as 100 for ImageNet, 1 for SST and IDS. When generating adversarial attack images, we applied the code released by the corresponding work (Ghorbani et al., 2017) and followed their default setup in our implementation.

A11 SANITY CHECK FOR DECOY-ENHANCED SALIENCY MAPS

As suggested by Adebayo et al. (2018) , any valid saliency methods should pass the sanity check in the sense that the saliency method should be dependent on the learned parameters of the predictive model, instead of edge or other generic feature detectors. We performed the model parameter randomization test (Adebayo et al., 2018) on the ImageNet dataset by comparing the output of the proposed saliency method on a pretrained VGG16 network with the output of the proposed saliency method on a weight-randomized VGG16 network. If the proposed saliency method indeed depends on the learned parameters of the model, it is expected that the outputs between the two cases differ substantially. Following the cascading randomization strategy (Adebayo et al., 2018) , the weights of pretrained VGG16 network are randomized from the top to bottom layers in a cascading fashion. This cascading randomization procedure is designed to destroy the learned weights successively. As illustrated in Fig. A5 , the cascading randomization destroys the decoy-enhanced saliency maps combined with three existing saliency methods, qualitatively. The conclusion is also supported by quantitative comparison measured by the structural similarity index (SSIM), shown in Fig. A6 . Here, the column labels are as same as those in Fig. 2 . The difference figures share the same colorbar as those in Fig. 2 . Table A2 : Quantitative comparison of our method and baselines on the network intrusion dataset. We report the means and standard errors of the fidelity scores. 

A12 APPLICABILITY TO OTHER CNN ARCHITECTURES

In addition to the VGG16 model, we generated saliency maps for AlexNet (Krizhevsky et al., 2012) and ResNet (He et al., 2016) trained from the ImageNet dataset. We visualize their saliency maps in Fig. A7 . We observe that our method consistently outperforms the baseline methods, both quantitatively and qualitatively. Together with the results in Section 4, these results suggest that we can apply our decoy-enhanced saliency methods to various feed-forward network architectures and expect consistent performance.

A13 PERFORMANCES ON THE NETWORK INTRUSION DATASET.

Rather than visualizing the saliency scores through heatmaps, we apply the following to compare the saliency scores obtained by different methods qualitatively. We ranked the features based on their saliency scores and compared the ranking obtained by the existing methods with that obtained by our decoy-enhanced method. "Minimum size of packet in forward direction", "Minimum length of a packet", "Minimum time between two packets sent in the forward direction" are ranked higher by our methods than the baselines. These features could capture the differences between benign and malicious traffics. This is because attackers usually tend to rapidly send small packages to discover the backdoors in the victim network system, while the benign users may send much larger packages with a longer interval between two packages. On the contrary, features that are not that useful for intrusion detection (e.g., timestamp, Download and upload ratio) are wrongly pinpointed by the existing method. However, our methods correctly assign lower importance to these features. Table A2 shows the fidelity comparisons of different saliency methods. We can observe that our decoys-enhanced methods outperform the original saliency methods. These results show that our method could pinpoint more accurate features and achieve a higher fidelity than baselines. We also evaluated three alternatives used in Section 4: constant perturbation with range aggregation, noise perturbation with range aggregation, decoys generation with mean aggregation. The results in Table A2 are consistant with those in Fig. 2 and Fig. 3 , i.e., our method outperforms these baselines. In summary, the results on this dataset align with those on the other datasets. This confirms our method's applicability to multilayer perceptrons. A14 DECOYS ON OTHER BASELINES. In Section 4, we evaluated our methods on three state-of-the-art saliency methods. Recent research (Sturmfels et al., 2020; Hooker et al., 2019) suggests some variants that improve the perfor- : Fidelity comparision of saliency maps obtained by original saliency methods and our decoy-enhanced versions. "ExpGrad" refers to Expected Gradient, "SGradRage" stands for Smoothgrad with range aggregation, and "IntUniform" represents integrated gradient with uniform baseline (See Tab. A7 for more statistics about the performance differences). mance of these baseline methods. Here, by using ImageNet data, we evaluate whether our decoy method could further improve these variants and another widely used saliency method. Specifically, we consider two variants of the integrated gradient: integrated gradient with uniform baseline (Sturmfels et al., 2020) and Expected Gradient (Sturmfels et al., 2020) ; two variants of the SmoothGrad: VarGrad (Hooker et al., 2019) and Smoothgrad with range aggregation; and one existing saliency method: Grad-CAM (Selvaraju et al., 2016) . For the variants of the integrated gradient and Smooth-Grad, we kept the number of samples the same as the original version and used the default number suggested by existing works -25 (See https://github.com/PAIR-code/saliency). We will investigate whether increasing the sample numbers improve the existing saliency methods' fidelity and robustness in future work. Fig. A8 and Fig. A9 shows the qualitatively and quantitatively comparison of each method with/without decoys. As is depicted in Fig. A8 , our method helps knock off the noises and improve the visual quality of the saliency maps. Fig. A9 further demonstrates the advantage of our method in explanation fidelity. Together with the results in Section 4, they demonstrate the generalizability of our technique to different saliency methods. Note that our method only imposes a minor improvement on Grad-CAM both qualitatively and quantitatively. As part of future work, we will explore how to customize our method for Grad-CAM and investigate the effectiveness of applying our technique to more saliency methods.

A15 RUNTIME OF DECOY GENERATIONS

To evaluate the computational cost of our decoy generations, we carried out the run time comparison between optimizing one decoy and calculating three types of saliency methods, repeated 500 times with respect to different patch masks. As illustrated in Fig. A10a , on average, optimizing one decoy is 62.3% faster than the fastest vanilla gradient-based saliency method. For other methods, the optimization is even less expensive, in a relative sense. As is mentioned in Section 3.4, our decoy generation, and the saliency map computation can be run parallelly in a batch mode. In the optimal case, where we have enough resources to compute each Figure A10 : Run time of decoy generation. The comparison is conducted in the same CPU/GPU to ensure fairness. Note that "Grad", "IntGrad", and "SGrad" stands for the vanilla gradient, the integrated gradient, and the SmoothGrad, respectively. decoy saliency map (i.e., E(x i ; F )) parallelly, the overall runtime of generating one decoy-enhanced saliency map is the total time of generating one decoy and computing one saliency map using the existing methods. Fig. A10b shows the comparisons between the optimal decoy-enhanced saliency map generation time and the original saliency map generation time across three saliency methods. As we can observe from the figure, our method introduces negligible computational overhead over the existing methods. Taking a step back, when the users have limited resources for running the decoy-enhanced saliencies in a fully parallel fashion. As is shown in Fig. 2 , our method is not sensitive to the variations in the number of generated decoys. More specifically, Fig. 2 shows that we can obtain a decent performance by only solving 16 decoys on the ImageNet dataset. In the worst-case scenario where a user cannot run decoy generation in parallel, our method's computational overhead over the baselines is 24X for the Gradient approach, 16 X for the Integrated gradient method, and 16 X for the SmoothGrad method. In most cases, where users could afford partial parallel computing, this overhead will be decreased linearly with the available computational resources. For example, if a user has 4 GPUs, the overhead will drop to 6X for Gradient, 4X for Integrated gradient, and 4X for SmoothGrad. We argue that for an ensemble method, this overhead is acceptable. Besides, saliency generation is much lighter weight than training deep neural networks. Even with 4X∼6X overheads, the time of computing saliency maps is still much less than network training. In addition, our method can be even faster on more powerful machines, which escalates the practicality of our method.

A16 HYPER-PARAMETER SENSITIVITY

We also conduct experiments on the VGG16 to understand the impact of hyper-parameter choices on the performance of our optimization-based decoy generation method. Specifically, we focus on the choice of three hyper-parameters: network layer , initial Lagrange multiplier λ, and patch size. Accordingly, we first varied the value of for VGG16 and compared the differences of the generated decoy saliencies from the three aforementioned saliency methods. In particular, we set it to range from the first convolutional layer to the last pooling layer and demonstrate the generated decoy saliencies in Fig. A19 . Note that according to our design, only the convolutional layers and the pooling layers can be used to generate decoy images. For each saliency method, Fig. A19 demonstrates that the decoy saliencies generated from different layers for the same image are of similar qualities. Fig. A19 also shows the mean and standard derivation of the SF scores for each saliency method. These quantitative results also support the conclusion that our approach is not sensitive to the layer. This is likely because, as previous research has shown (Chan et al., 2015; Saxe et al., 2011) , the final classification results of a DNN are not highly related to the hidden representations. As a result, For each image, we first calculated the gradient-based saliency maps with and without using decoys, based on the pretrained model. Following the preprocessing steps suggested by Dabkowski & Gal (2017) ; Fong & Vedaldi (2017), we then obtained a bounding box from each calculated saliency maps based on certain thresholds. Specifically, we investigated three thresholding strategies suggested by Fong & Vedaldi (2017): value thresholding, energy thresholding, and mean thresholding. Following the evaluation protocol of Dabkowski & Gal (2017) ; Fong & Vedaldi (2017), we then computed the Intersect over Union (IoU) of the extracted box and the ground truth. If an IoU is greater than 0.5, the corresponding box is marked as correct. Table A3 shows that decoy-enhanced saliency maps achieve higher accuracy than those of the vanilla gradient. 

A19 STATISTICS OF THE PERFORMANCE DIFFERENCES

In section 4, Section A14, and Section A16, we varied the choice of K in the top-K normalizations, compared our method with each baseline approach, and showed the fidelity/sensitivity of each approach in the box-plots. To demonstrate the advantage of our method over the baselines, we further compared the fidelity/sensitivity difference between our method and the corresponding baseline approach. To be more specific, given two sets of fidelity/sensitivity scores (s our and s base ) obtained from our method and a baseline approach respectively, we first computed their difference, i.e., dif f = s our -s base . Then, we conducted a statistical measure on the values of dif f by computing the mean, the standard error, and the p-value of the paired t-test. For the paired t-test, our null hypothesis is H 0 : E[dif f ] ≥ 0. This indicates that, if the value of p is larger than a threshold, we cannot reject this null hypothesis, and have to conclude that our method cannot outperform the corresponding baseline approach. As we present in Table A4∼Table A11, the overall experiment results align with those shown in the box plots, demonstrating the superiority of our method over the baselines. 



Sixt et al. (2020) shows that LRP(Binder et al., 2016) is independent of the parameters of certain layers.2 Ancona et al. (2018) shows that input gradient and DeepLIFT(Shrikumar et al., 2017) are strongly related to the integrated gradient. As such, we only select the integrated gradient. Note that we developed the theoretical properties by using the vanilla gradient as the base saliency method. Following other works that also utilized Lipschitz continuity to analyze DNNs(Szegedy et al., 2013;Ghorbani et al., 2017), we assume that F is locally continuous around x, for = 1, 2, ..., L. Note that this inequality cannot be directly obtained by the Lipschitz inequality, because the gradient may not be continuous. For simplicity, we use the vanilla gradient method. The conclusion can be generalized to the other saliency methods considered in this paper



Figure 1: The overview of the proposed method. (A) The workflow of creating decoy-enhanced saliency maps. (B) The operation of swapping image patches between original and decoy images.

Figure 2: Performance evaluation on ImageNet. (A) Visualization of saliency maps on foreground and background objects. (B) Fidelity comparison of original saliency method (i.e., "Without decoys"), our method (i.e., "Decoys w/ range aggregation"), and its alternatives: replacing the decoy generation (Eqn. 2) with constant perturbation (i.e., "Constant w/ range aggregation") or noise perturbation (i.e., "Noise w/ range aggregation"); replacing the decoy aggregation (Eqn. 4) with mean aggregation (i.e., "Decoys w/ mean aggregation") (See Tab. A4 for more statistics about the performance differences between our method and the baselines). (C) Performance with regard to variant patch size and different number of decoys.

Figure 3: Evaluation results obtained from the SST dataset. (A) Visualization of saliency maps in each word, where the normalized saliency values are shown for better distinction. (B) Fidelitycomparison of the original saliency method, our method, and its alternatives. Here, the alternative methods represent the practice of replacing the decoy generation (Eqn. 2) with constant perturbation or noise perturbation as well as the practice of replacing the decoy aggregation (Eqn. 4) with mean aggregation (See Tab. A5 for more statistics about the performance differences).

Figure 4: Robustness to adversarial attacks on images. (A) Visualization of saliency maps under adversarial attacks. (B)∼(D) The decoy-enhanced saliency score is compared to the original saliency score under adversarial attacks, evaluated by sensitivity (See Tab. A6 for more statistics about the performance differences).

represents the weights of the neural network, and b ∈ R d represents the biases, where d 0 = d and d L+1 = C. m ∈ R d is the output of each hidden layer, with m 0 = x and o ∈ R C is the logits. The entry-wise softmax operator for target class c is defined as p c = e oc C c =1 e o c , for c ∈ {1, 2, • • • , C}.

Figure A5: Cascading randomization on VGG16 network. The figure shows the original saliency map (first column) for the terrier.Progression from left to right corresponds to complete randomization of the pretrained VGG16 network weights from the top layer to the bottom layer. Note that, here, we followed the visualization method inAdebayo et al. (2018) to show the saliency maps, i.e., 0-1 normalization. The row labels share the same meanings as the column labels in Fig.2.

Saliency maps generated on ResNet.

Figure A7: Visualization of saliency maps under different CNN architectures.Here, the column labels are as same as those in Fig.2. The difference figures share the same colorbar as those in Fig.2. TableA2: Quantitative comparison of our method and baselines on the network intrusion dataset. We report the means and standard errors of the fidelity scores.

Figure A8: Visualization of saliency maps obtained by original saliency methods and our decoyenhanced versions. "ExpGrad" refers to Expected Gradient, "SGradRage" stands for Smoothgrad with range aggregation, and "IntUniform" represents integrated gradient with uniform baseline. The difference figures share the same colorbar as those in Fig. 2.

(a) Run time to optimize one decoy and calculate saliency map with the existing methods. (b) Run time to compute saliency maps with and without optimizing one decoy.

Figure A11: Visualization of saliency maps optimized using different initial λ.

Figure A14: Sensitivity comparison when selecting top 10% features on ImageNet (See Tab. A10 for more statistics about the performance differences).

Figure A15: Sensitivity comparison when selecting top 40% features on ImageNet (See Tab. A11 for more statistics about the performance differences).

Fig. A17, Fig. A16, and Fig. A18 provide more results of the fidelity and robustness evaluation. These results are consistent with those shown in the Section 4.

Figure A18: Visualization of saliency maps on the perturbed images generated by using three attacks in VGG16. The column labels are the same with those in Fig. 2(A).

The mean and standard derivation of SF score for gradient, integrated gradient and SmoothGrad are: (10.23, 0.29), (10.37, 0.84), (9.34, 0.51). The mean and standard derivation of SF score for gradient, integrated gradient and SmoothGrad are: (0.07, 0.02), (0.01, 0.003), (0.06, 0.007).

The mean and standard derivation of SF score for gradient, integrated gradient and SmoothGrad are: (2.15, 0.50), (0.97, 0.56), (0.19, 0.06).

Figure A19: Demonstrations of decoy-enhanced saliency maps generated from each convolutional and pooling layer in VGG16.

The hyper-parameter choices of the proposed method on different target models.

Saliency maps generated on AlexNet.

ImageNet localization accuracy on VGG16 network using different thresholding strategies.

SS: 102.42 SS: 77.11 SS: 24.14 SS: 22.18 SS: 63.28 SS: 55.56 SS: 349.18 SS: 241.76 SS: 29.98 SS: 27.60 SS: 40.64 SS: 34.44 SS: 269.87 SS: 183.50 SS: 25.53 SS: 21.81 SS: 41.95 SS: 35.14 SS: 34.35 SS: 27.78 SS: 53.63 SS: 45.22 SS: 2.49 SS: 2.25 SS: 93.31 SS: 61.20 SS: 23.03 SS: 19.23 SS: 33.78 SS: 29.19 SS: 24.10 SS: 16.21 SS: 30.50 SS: 24.81 SS: 27.15 SS: 22.70 SS: 77.48 SS: 56.36 SS: 19.50 SS: 15.97 SS: 10.29 SS: 8.77 SS: 197.97 SS: 138.14 SS: 20.46 SS: 16.50 SS: 14.07 SS: 13.22 SS: 140.91 SS: 138.14 SS: 29.10 SS: 25.06 SS: 13.35 SS: 11.89 SS: 35.97 SS: 28.72 SS: 16.49 SS: 13.82 SS: 32.75 SS: 29.32 SS: 139.94 SS: 121.71 SS: 24.63 SS: 20.68 SS: 23.68 SS: 22.35 SS: 26.98 SS: 23.88 SS: 28.37 SS: 22.34 SS: 16.74 SS: 15.73

A6 IMPLEMENTATION DETAILS

The optimization function proposed to generate decoys is non-differentiable and very difficult to solve; hence, we instead solve an alternate formulation with the help of the following tricks. First, we introduce a Lagrange multiplier λ > 0 and augment the first constraint in the optimization function as a penalty in the objective function. This will rule out the hyper-parameter in Eqn. 2. Second, we use projected gradient descent during the optimization to eliminate the mask constraint (i.e., (x -x) • (1 -M) = 0). Specifically, after each standard gradient descent step, we enforce x = x • M + x • (1 -M). Third, we use the change-of-variable trick (Carlini & Wagner, 2017) to eliminate the feature value constraint (i.e., x ∈ [x min , x max ] d ). Instead of directly optimizing x, we first normalize it to [0, 1] and introduce x satisfying xi = 1 2 (tanh(x i ) + 1), for all i ∈ {1, 2, • • • , d}. generating decoy saliencies for the same sample with the same label from different layers should yield similar results.We also varied the initial Lagrange multiplier λ to be 10 1 , 10 2 , 10 3 , 10 4 , 10 5 and compared the differences of the generated decoy saliencies. Fig. A11 depicts the quantitative and qualitative comparison results. As shown in the figure, the different choices of initial λ all produce similar saliency maps, indicating a negligible influence upon our method.Then, we fixed m and increased the patch size to be {3, 5, 7, 9, 11} and showed the generated decoy saliencies in Fig. A12 . The results show that varying the patch size within a certain range only imposes a negligible influence upon our method.Recall that in Section 3.4, we mention that decoy masks are generated by sliding the swappable patch across a given input. With a given constant stride 1, the number of sliding windows is equal to ( √ d -P + 1) 2 . In our implementation, to enable batch computing, we introduce m, which controls the number of sliding windows in each decoy. Then, the number of decoys is 2 ( A12 shows the results of fixing m as 100 and varying P . In Fig. 2 (C), we substantially varied both P and m and showed that our method is insensitive to the variations in the number of decoys n. Note that the box bars with the same color in Fig. 2(C ) are drawn by fixing P and varying m. Their slight difference indicates the robustness of our method in the variations of m.The results in Fig. 2(C), A19, A11 , and A12 indicate we can expect to obtain stable decoy saliencies when the hyper-parameters are subtly varied. This is a critical characteristic because users do not need to overly worry about setting very precise hyper-parameters to obtain a desired saliency map.In addition to the hyper-parameters introduced by our methods, we also test the sensitivity of fidelity evaluation results to the choice of K in the topK normalization. Specifically, we varied K to select top 10% and 40% important features and redrawn the fidelity/sensitivity comparison figures in Fig. 2(B 

A17 OBJECT LOCALIZATION

We compare our method and the vanilla gradient on the object localization task (Dabkowski & Gal, 2017; Fong & Vedaldi, 2017) , where the model was trained with the class label only without access to any localization data. We carried out Imagenet ILSVRC'14 localization task (Russakovsky et 

