ROBUST ATTRIBUTIONS REQUIRE RETHINKING ROBUSTNESS METRICS

Abstract

For machine learning models to be reliable and trustworthy, their decisions must be interpretable. As these models find increasing use in safety-critical applications, it is important that not just the model predictions but also their explanations (as feature attributions) be robust to small human-imperceptible input perturbations. Recent works have shown that many attribution methods are fragile and have proposed improvements in either the attribution methods or the model training. Existing works measure attributional robustness by metrics such as top-k intersection, Spearman's rank-order correlation (or Spearman's ρ) or Kendall's rank-order correlation (or Kendall's τ ) to quantify the change in feature attributions under input perturbation. However, we show that these metrics are fragile. That is, under such metrics, a simple random perturbation attack can seem to be as significant as more principled attributional attacks. We instead propose Locality-sENSitive (LENS) improvements of the above metrics, namely, LENS-top-k, LENS-Spearman and LENS-Kendall, that incorporate the locality of attributions along with their rank order. Our locality-sensitive metrics provide tighter bounds on attributional robustness and do not disproportionately penalize attribution methods for reasonable local changes. We show that the robust attribution methods proposed in recent works also reflect this premise of locality, thus highlighting the need for a localitysensitive metric for progress in the field. Our empirical results on well-known benchmark datasets using well-known models and attribution methods support our observations and conclusions in this work.

1. INTRODUCTION

There has been an explosive increase in the use of deep neural network (DNN)-based models for many applications in recent years, which has resulted in an equivalently increasing interest in finding ways to interpret the decisions made by these models. Interpretability is an important aspect of responsible and trustworthy AI, and attribution methods are important for explaining and debugging real-world AI/ML systems. Attribution methods are used across application domains today (see (Gade et al., 2020) for a general discussion and (Tang et al.; Yap et al.; Oviedo et al., 2022; Oh & Jeong, 2020) for some examples), despite their limitations. These methods (Zeiler et al., 2010; Simonyan et al., 2014; Bach et al., 2015; Selvaraju et al., 2017; Chattopadhyay et al., 2018; Sundararajan et al., 2017; Shrikumar et al., 2016; Smilkov et al., 2017; Lundberg & Lee, 2017) find approaches to explain the decisions made by these models, instead of using them as a black box. With growing numbers of explanation methods (see (Lipton, 2018; Samek et al., 2019; Fan et al., 2021; Zhang et al., 2020; Zhang & Zhu, 2018) for surveys), there have also been recent concerted efforts on analyzing and proposing methods to ensure the robustness of DNN model explanations. This requires that the model explanations (also known as attributions) do not change with human-imperceptible changes in input (Chen et al., 2019; Sarkar et al., 2021) . For example, an explanation code for a credit card failure cannot change significantly for a small human-imperceptible change in input features, or the saliency maps explaining the risk prediction of a chest X-ray should not change significantly with a minor human-imperceptible change in the image. This is referred to as attributional robustness. From another perspective, DNN-based models are known to have a vulnerability to imperceptible adversarial perturbations (Biggio et al., 2013; Szegedy et al., 2014) , which make them misclassify input images. These small imperceptible perturbations are constructed using techniques like Fast Gradient Signed Method (FGSM) (Goodfellow et al., 2015) and Projected Gradient Descent (PGD) (Madry et al., 2018) . Adversarial training with PGD is a well-known solution to obtain better adversarial robustness to attacks like FGSM and PGD. While adversarial robustness has received significant attention over the last few years (Ozdag, 2018; Silva & Najafirad, 2020) , attributional robustness has received lesser attention. In an early effort, (Ghorbani et al., 2019) provided a method to construct a small imperceptible perturbation which when added to the input x will lead to commonly used correlation measures such as top-k intersection, Spearman's rank-order correlation or Kendall's rank-order correlation used to quantify the change between the explanation map of the original image and that of the perturbed image to drop (see Figure 1 ). The intersection between the top-1000 pixels before and after perturbation is less than 0.16 in both the cases; thus, as a metric it cannot really distinguish the two. Defenses against such attributional attacks have been proposed recently in (Chen et al., 2019; Singh et al., 2020; Wang et al., 2020; Sarkar et al., 2021) , which focus on regularizing the loss function (Chen et al., 2019; Singh et al., 2020; Sarkar et al., 2021) or use smoothing techniques on gradients (Wang et al., 2020) to reduce the impact of the attack. Across all the efforts so far (Ghorbani et al., 2019; Chen et al., 2019; Singh et al., 2020; Wang et al., 2020; Sarkar et al., 2021) , the robustness of attributions on input perturbation is measured using metrics such as top-k intersection, and rank correlations like Spearman's ρ and Kendall' τ to estimate the quality of the attack. While such metrics give a reasonable estimate when there are significant changes in attributions (see Figure 1 row 1), they are highly sensitive to minor local changes in attributions, even by one or few pixel coordinate locations (see Figure 1 row 2). We, in fact, show (Section 3.1) that under such metrics, a random perturbation is as strong an attack as existing benchmark methods such as (Ghorbani et al., 2019) . This may not be a true indicator of the robustness of attributions of a model, and thereby misleading to research efforts that build on current observations. Beyond highlighting this important issue, we instead propose locality-sensitive improvements of the above metrics that incorporate the locality of attributions along with their rank order. We show that such a locality-sensitive distance is upper-bounded by a metric based on symmetric set difference. Our key contributions are summarized below: • We firstly observe that existing robustness metrics for model attributions overpenalize minor drifts in attribution, leading to a false sense of fragility. We go on to show that under existing such metrics, a random perturbation is as good an attack as principled methods like (Ghorbani et al., 2019) . • In order to address this issue, we propose Locality-sENSitive (LENS) improvements of existing metrics, namely, LENS-top-k, LENS-Spearman and LENS-Kendall, that incorporate the locality of attributions along with their rank order. Our locality-sensitive metrics do not disproportionately penalize attribution methods for reasonable local changes. • We show that our proposed LENS variants are well-motivated by metrics defined on the space of attributions, and they provide tighter bounds on the attributional robustness of already known improvements in attribution methods and model training designed for better attributions. • Our comprehensive empirical results on benchmark datasets and models used in existing work clearly support our aforementioned observations, and support the need for the LENS variants of the metrics. • We also show that existing robust attribution methods implicitly reflect this premise of locality, thus highlighting the need for a locality-sensitive metric for progress in the field.

2. BACKGROUND AND RELATED WORK

We herein discuss background literature related to our work from three different perspectives: a brief summary of explanation/attribution methods in general, review of recent work in attributional robustness (both attacks and defenses), and other recent related work. Attribution Methods. Existing efforts on explainability in DNN models can be broadly categorized as: local and global methods, model-agnostic and model-specific methods, or as post-hoc and antehoc (intrinsically interpretable) methods (Molnar, 2019; Lecue et al., 2021) . Almost all the popular methods in use today -including methods to visualize weights and neurons (Simonyan et al., 2014; Zeiler & Fergus, 2014 ), guided backpropagation (Springenberg et al., 2015) , CAM (Zhou et al.) , GradCAM (Selvaraju et al., 2017) , Grad-CAM++ (Chattopadhyay et al., 2018) , LIME (Ribeiro et al., 2016) , DeepLIFT (Shrikumar et al., 2016; 2017) , LRP (Bach et al., 2015) , Integrated Gradients (Sundararajan et al., 2017) , SmoothGrad (Smilkov et al., 2017) ), DeepSHAP (Lundberg & Lee, 2017) and TCAV (Kim et al., 2018) -are all post-hoc methods, which are used on top of a pre-trained DNN model as a separate layer/module to explain its predictions. We focus on such post-hoc attribution methods in this work. For a more detailed survey of explainability methods for DNN models, please see (Lecue et al., 2021; Molnar, 2019; Samek et al., 2019) .

Robustness of Attributions.

With the growing numbers of attribution methods, there has also been a recent focus on identifying the desirable characteristics of such methods (Alvarez-Melis & Jaakkola, 2018; Adebayo et al., 2018; Yeh et al., 2019; Chalasani et al., 2020; Tomsett et al., 2020; Boggust et al., 2022; Agarwal et al., 2022) . A key desired trait that has been highlighted by many of these efforts is robustness or stability of attributions, i.e., the explanation should not vary significantly within a small local neighborhood of the input (Alvarez-Melis & Jaakkola, 2018; Chalasani et al., 2020) . Ghorbani et al. (2019) showed that well-known methods such as gradient-based attributions, DeepLIFT (Shrikumar et al., 2017) and Integrated Gradients (IG) (Sundararajan et al., 2017) are vulnerable, and provided an algorithm to construct a small imperceptible perturbation which when added to the input results in change in the attribution. Slack et al. (2020) showed that methods like LIME (Ribeiro et al., 2016) and DeepSHAP (Lundberg & Lee, 2017) too are vulnerable to such manipulations. The identification of such vulnerability and attack has subsequently led to multiple research efforts that have attempted to make a model's attributions robust. Chen et al. (2019) proposed a regularization-based approach, where an explicit regularizer term is added to the loss function to maintain the model gradient across input (IG, in particular) while training the DNN model. This was subsequently extended by (Sarkar et al., 2021; Singh et al., 2020; Wang et al., 2020) , all of whom provide different training strategies and regularizers to improve attributional robustness of models. Each of these methods including (Ghorbani et al., 2019) measures change in attribution before and after input perturbation using top-k intersection, and/or rank correlations like Spearman's ρ and Kendall' τ . Such metrics have recently, in fact, further been used to understand issues surrounding attributional robustness (Wang & Kong, 2022) . Other efforts that quantify stability of attributions in tabular data also use Euclidean distance or its variants (Alvarez-Melis & Jaakkola, 2018; Yeh et al., 2019; Agarwal et al., 2022) . Each of these metrics look for dimension-wise correlation or pixel-level matching between attribution maps before and after perturbation, and thus penalize even a minor change in attribution (say, even by one pixel coordinate location). This may not be reasonable, and could even be misleading. In this work, we highlight the need to revisit such metrics, and propose a locality-sensitive variant that can be easily integrated into all existing metrics. Other Related Work. In other related efforts that have studied similar properties of attribution-based explanations, Carvalho et al. ( 2019); Bhatt et al. (2020) stated that stable explanations should not vary too much between similar input samples, unless the model's prediction changes drastically. All the abovementioned attributional attacks and defenses (Ghorbani et al., 2019; Sarkar et al., 2021; Singh et al., 2020; Wang et al., 2020) maintain this property, since they focus on input perturbations that change the attribution without changing the model prediction itself. Similarly, Arun et al. (2020) and Fel et al. (2022) introduced the notions of repeatability/reproducibility and generalizability respectively, both of which focus on the desired property that a trustworthy explanation must point to similar evidence across similar input images. In this work, we provide a practical metric to study this notion of similarity by considering locality-sensitive metrics.

3. ATTRIBUTIONAL ROBUSTNESS METRICS ARE FRAGILE

We first discuss the fragility of existing metrics, before presenting our locality-sensitive variants. The robustness of an attribution method has generally been measured in earlier work (Ghorbani et al., 2019; Chen et al., 2019) by computing the similarity between attributions of an original input and the attribution of the same input with perturbations, and averaging this similarity across multiple input images. Similar to adversarial perturbations (Madry et al., 2018) , such "attributional perturbations" are carefully constructed attack vectors of small ∞ norm that do not change the model prediction but only the attributions on the perturbed inputs (Ghorbani et al., 2019) . Using this, Ghorbani et al. (2019) show that many popular attribution methods are fragile. The common similarity measures between attributions are top-k intersection and rank correlation coefficients such as Spearman's ρ and Kendall's τ . Note that all of the above similarity measures depend only on the rank order of features in the attributions (e.g., rank order of pixels in images). 3.1 RANDOM VECTORS ARE ATTRIBUTIONAL ATTACKS UNDER EXISTING METRICS Random vectors of a small ∞ norm are often used as baselines of input perturbations (both in adversarial robustness (Silva & Najafirad, 2020) and attributional robustness literature Ghorbani et al. (2019) ), since it is known that predictions of neural network models are known to be resilient to random perturbations of inputs. Previous work by Ghorbani et al. (2019) has shown random perturbations to be a reasonable baseline to compare against their attributional attack. Extending it further, we show that a single input-agnostic random perturbation happens to be an effective universal attributional attack if we measure attributional robustness using a weak metric based on top-k intersection. In other words, considering even a random perturbation happens to be a good attributional attack under such metrics, we show that existing metrics for attributional robustness such as top-k intersection are extremely fragile, i.e., they would unfairly deem many attribution methods as fragile. Integrated Gradients (IG) is a well-known attribution method based on well-defined axiomatic foundations (Sundararajan et al., 2017) , which is commonly used in attributional robustness literature (Chen et al., 2019; Sarkar et al., 2021) . We take a naturally trained CNN model on MNIST and perturb the images using a random perturbation (an independent random perturbation per input image) as well as a single, input-agnostic or universal random perturbation for all images. Figure 2 shows a sample image from the MNIST dataset and the visual difference between the IG of the original image, the IG after adding a random perturbation, and the IG after adding a universal random perturbation. The IG after the universal random attack (Figure 2d ) is visually more dissimilar to the IG of the original image (Figure 2b ) than the IG of a simple random perturbation (Figure 2c ). (Note that top-k intersection between Figure 2b and 2c is only 0.62, although the two look similar. As stated in the caption, a locality-sensitive metric shows them to be closer in attribution however.) Similarly, Table 1 shows that under existing metrics to quantify attributional robustness of IG on a naturally trained CNN model, even a single, input-agnostic or universal random perturbation can sometimes be a more effective attributional attack than using an independent random perturbation for each input. The detailed description of our experimental setup can be found in Appendix B.

4. LOCALITY-SENSITIVE METRICS FOR ATTRIBUTIONAL ROBUSTNESS

Section 3 raises the need to look beyond current metrics used to study attributional robustness. Current analyses of attributional attacks use similarity measures such as top-k intersection and rank correlation metrics such as Spearman's ρ, Kendall's τ that ignore where the top attributions are located in the image. As an aside, these are not metrics in the mathematical sense but theoretically interesting metrics have been derived from them in the ranking literature (Fagin et al., 2003) .

4.1. DEFINING LOCALITY-SENSITIVE METRICS FOR ATTRIBUTIONS

We propose a natural way to extend the existing similarity measures to further incorporate the locality of pixel attributions in images to derive more robust, locality-sensitive measures of attributional robustness. Let a ij (x) denote the attribution value or importance assigned to the (i, j)-th pixel in an input image x, and let S k (x) denote the set of k pixel positions with the largest attribution values. Let N w (i, j) = {(p, q) : i -w ≤ p ≤ i + w, j -w ≤ q ≤ j + w} be the neighboring pixel positions within a (2w + 1) × (2w + 1) window around the (i, j)-th pixel. By a slight abuse of notation, we use N w (S k (x)) to denote (i,j)∈S k (x) N w (i, j), that is, the set of all pixel positions that lie in the union of (2w + 1) × (2w + 1) windows around the top-k pixels. For a given attributional perturbation Att(•), let T k = S k (x + Att(x)) denote the top-k pixels in attribution values after applying the attributional perturbation Att(x). Top-k intersection met- ric outputs |S k (x) ∩ T k (x)| /k. We propose Locality-sENSitive top-k metrics (LENS-top-k) as |N w (S k (x)) ∩ T k (x)| /k and |S k (x) ∩ N w (T k (x))| /k , along the lines of precision and recall used to evaluate ranking. We define Locality-sENSitive Spearman's ρ (LENS-Spearman) and Locality-sENSitive Kendall's τ (LENS-Kendall) metrics as rank correlation coefficients for the smoothed ranking orders according to ãij (x)'s and ãij (x + Att(x))'s, respectively. These can be used to compare two different attributions for the same image, the same attribution method on two different images, or even two different attributions on two different images, as long as the attribution vectors lie in the same space, e.g., images of the same dimensions where attributions assign importance values to pixels. We show some theoretically interesting properties of locality-sensitive measures below. Let a 1 and a 2 be two attribution vectors for two images, and let S k and T k be the set of top k pixels in these images according to a 1 and a 2 , respectively. We define a locality-sensitive top-k distance between two attribution vectors a 1 and a 2 as d (w) k (a 1 , a 2 ) def = prec (w) k (a 1 , a 2 ) + recall (w) k (a 1 , a 2 ), where prec (w) k (a 1 , a 2 ) def = |S k \ N w (T k )| k and recall (w) k (a 1 , a 2 ) def = |T k \ N w (S k )| k , similar to precison and recall used in ranking literature, except the key difference being the inclusion of neighborhood items based on locality. We state below a monotonicity property of d (w) k (a 1 , a 2 ) and upper bound it in terms of the symmetric set difference of top-k attributions. Proposition 1. For any w 1 ≤ w 2 , we have d (w2) k (a 1 , a 2 ) ≤ d (w1) k (a 1 , a 2 ) ≤ |S k T k | /k, where denotes the symmetric set difference, i.e., A B = (A \ B) ∪ (B \ A). Combining d (w) k (a 1 , a 2 ) across different values of k and w, we can define a distance d(a 1 , a 2 ) = ∞ k=1 α k ∞ w=0 β w d (w) k (a 1 , a 2 ), where α k and β w be non-negative weights, monotonically decreasing in k and w, respectively, such that k α k < ∞ and w β w < ∞. We show that the distance defined above is upper bounded by a metric that is similar to the metrics proposed in Fagin et al. (2003) based on symmetric set difference of top-k ranks to compare two rankings. Proposition 2. d(a 1 , a 2 ) defined above is upper bounded by u(a 1 , a 2 ) given by u(a 1 , a 2 ) = ∞ k=1 α k ∞ w=0 β w |S k T k | k , and u(a 1 , a 2 ) defines a bounded metric on the space of attribution vectors. Note that top-k intersection, Spearman's ρ and Kendall's τ do not take the attribution values a ij (x)'s into account but only the rank order of pixels according to these values. We also define a localitysensitive w-smoothed attribution as follows. ã(w) ij (x) = 1 (2w + 1) 2 (p,q)∈Nw(i,j), 1≤p,q≤n a pq (x) We show that the w-smoothed attribution leads to a contraction in the 2 norm commonly used in theoretical analysis of simple gradients as attributions. Proposition 3. For any inputs x, y and any w ≥ 0, ã(w ) (x) -ã(w) (y) 2 ≤ a(x) -a(y) 2 . Thus, any theoretical bounds on the attributional robustness of simple gradients in 2 norm proved in previous works continue to hold for locality-sensitive w-smoothed gradients. For example, Wang et al. (2020) show the following Hessian-based bound on simple gradients. For an input x and a classifier or model defined by f , let ∇ x (f ) and ∇ y (f ) be the simple gradients w.r.t. the inputs at x and y. Theorem 3 in Wang et al. (2020) upper bounds the 2 distance between the simple gradients of nearby points x - y 2 ≤ δ as ∇ x (f ) -∇ y (f ) 2 δ λ max (H x (f )) , where H x (f ) is the Hessian of f w.r.t. the input at x and λ max (H x (f )) is its maximum eigenvalue. By Proposition 3 above, the same continues to hold for w-smoothed gradients, i.e., ∇(w) x (f ) - ∇(w) y (f ) 2 δ λ max (H x (f )). The proofs of all the propositions above are included in Appendix A.

4.2. LOCALITY-SENSITIVE (LENS) ATTRIBUTIONAL ROBUSTNESS METRICS ARE STRONGER

The top-k intersection is a measure of similarity instead of distance. Therefore, in our experiments for attributional robustness, we use locality-sensitive similarity measures w-LENS-prec@k and w-LENS-recall@k to denote 1prec (w) k (a 1 , a 2 ) and 1 -recall k (a 1 , a 2 ), respectively, where a 1 is the attribution of the original image and a 2 is the attribution of the perturbed image. For rank correlation coefficients such as Kendall's τ and Spearman's ρ, we compute w-LENS-Kendall and w-LENS-Spearman as the same Kendall's τ and Spearman's ρ but computed on the locality-sensitive w-smoothed attribution map ã(w) instead of the original attribution map a. We also study how these similarity measures and their resulting attributional robustness measures change as we vary k and w. In this section, we measure the attributional robustness of Integrated Gradients (IG) on naturally trained models as top-k intersection, w-LENS-prec@k and w-LENS-recall@k between the IG of the original images and the IG of their perturbations obtained by various attacks. The attacks we consider are the top-t attack and the mass-center attack of Ghorbani et al. (2019) as well as random perturbation. All perturbations have ∞ norm bounded by δ = 0.3 for MNIST, δ = 0.1 for Fashion MNIST, and δ = 8/255 for GTSRB and Flower datasets. Comparison of top-k intersection, 1-LENSprec@k and 1-LENS-recall@k. Figure 3 shows that top-k intersection penalizes IG even for small, local changes. 1-LENS-prec@k and 1-LENS-recall@k values are always higher in comparison across all datasets in our experiments. Moreover, on both MNIST and Fashion MNIST, 1-LENS-prec@k is roughly 2x higher (above 90%) compared to top-k intersection (near 40%). In other words, an attack may appear stronger under a weaker measure of attributional robustness, if it ignores locality. The effect of varying k. Figure 4 shows a large disparity between top-k intersection and 1-LENSprec@k even when k is large. Figure 4 shows that top-k intersection can be very low even when the IG of the original and the IG of the perturbed images are locally very similar, as indicated by high 1-LENS-prec@k. Our observation holds for the perturbations obtained by the top-t attack of (Ghorbani et al., 2019) as well as a random perturbation across all datasets in our experiments. w-LENS-prec@k for varying w. Figure 5 that w-LENS-prec@k increases as we increase w to consider larger neighborhoods around the pixels with top attribution values. This holds for multiple perturbations, namely, top-t attack and mass-center attack by Ghorbani et al. (2019) as well as a random perturbation. Notice that the top-t attack of Ghorbani et al. (2019) is constructed specifically for the top-t intersection objective, and perhaps as a result, shows larger change when we increase local-sensitivity by increasing w in the robustness measure. Comparison of Spearman's ρ and Kendall's τ with 1-LENS-Spearman and 1-LENS-Kendall. Figure 6 compares Spearman's ρ and Kendall's τ with 1-LENS-Spearman and 1-LENS-Kendall measures for attributional robustness. We observe that 1-smoothing of attribution maps increases the corresponding Kendall's τ and Spearman's ρ measures of attributional robustness, and this observation holds across all datasets in our experiments. As a result, we believe that 1-LENS-Spearman and 1-LENS-Kendall result in better or tighter attributional robustnes measures than Spearman's ρ and Kendall's τ . 2019) is modifiable for any similarity objective, we use 1-LENS-prec@k to construct a new attributional attack for 1-LENS-prec@k objective based on the k × k neighborhood of pixels. Surprisingly, we notice that it leads to a worse attributional attack, if we measure its effectiveness using the top-k intersection; see Figure 7 . In other words, attributional attacks against locality-sensitive measures of attributional robustness are non-trivial and may require fundamentally different ideas. Appendix E contains additional results with similar conclusions when Simple Gradients are used instead of Integrated Gradients (IG) for obtaining the attributions. A common approach to get robust attributions is to keep the attribution method unchanged but train the models differently in a way that the resulting attributions are more robust to small perturbations of inputs. Chen et al. (2019) proposed the first defense against the attributional attack of Ghorbani et al. (2019) . Wang et al. (2020) also find that IG-NORM based training of Chen et al. (2019) gives models that exhibit attributional robustness against the top-k attack of Ghorbani et al. (2019) along with adversarially trained models. Figure 8 shows a sample image from the Flower dataset, where the Integrated Gradients (IG) of the original image and its perturbation by the top-k attack are visually similar for models that are either adversarially trained (trained using Projected Gradient Descent or PGD-trained, as proposed by (Madry et al., 2018) ) or IG-SUM-NORM trained as in Chen et al. (2019) . In other words, these differently trained model guard the sample image against the attributional top-k attack. Recent work by Nourelahi et al. (2022) has empirically studied the effectiveness of adversarially (PGD) trained models in obtaining better attributions, e.g., Figure 8b shows sharper attributions to features highlighting the ground-truth class.

5. CONNECTION TO ROBUST ATTRIBUTION TRAINING METHODS

Figure 9 shows that PGD-trained and IG-SUM-NORM trained models have more robust Integrated Gradients (IG) in comparison to their naturally trained counterparts, and this holds for the previously used measures of attributional robustness (e.g., top-k intersection) as well as the new localitysensitive measures we propose (e.g., 1-LENS-prec@k, 1-LENS-recall@k) across all datasets in our experiments. The top-k attack of Ghorbani et al. (2019) is not a threat to IG if we simply measure its effectiveness using 1-LENS-prec@k (Figure 9(a-c ) for MNIST, Fashion MNIST and GTSRB), and moreover, use IG on PGD-trained or IG-SUM-NORM trained models (Figure 9 (d) for Flower). Figure 9 : Average top-k intersection, 1-LENS-prec@k, 1-LENS-recall@k measured between IG(original image) and IG(perturbed image) for models that are naturally trained, PGD-trained and IG-SUM-NORM trained. The perturbation used is the top-t attack of Ghorbani et al. (2019) . Shown for (a) MNIST, (b) Fashion MNIST, (c) GTSRB and (d) Flower datasets. The above observation about robustness of Integrated Gradients (IG) for PGD-trained and IG-SUM-NORM trained models holds even when we use 1-LENS-Spearman and 1-LENS-Kendall measures to quantify the attributional robustness to the top-k attack of Ghorbani et al. (2019) , and it holds across all datasets used in our experiments; see Figure 10 . Moreover, the 1-LENS-Kendall and 1-LENS-Spearman values in Figure 10 are always higher than the corresponding Kendall's τ and Spearman's ρ values, which further strengthen the conclusions from previous papers that IG on PGD-trained and IG-SUM-NORM trained models give better attributions. Chalasani et al. (2020) show theoretically that ∞ -adversarial training (PGD-training) leads to stable Integrated Gradients (IG) under 1 norm. They also show empirically that PGD-training leads to sparse attributions (IG and DeepSHAP) when sparseness in measured indirectly as the change in the Gini index. Our empirical results extend their theoretical observation about stability of IG for PGD-trained models, as we measure local stability in terms of both the attribution values as well as their corresponding positions in the image.

6. CONCLUSION AND FUTURE WORK

We show that the fragility of attributions is an effect of using fragile robustness metrics such as top-k intersection that only look at the rank order of attributions and fail to capture the closeness of pixel positions with high attributions. We highlight the need for locality-sensitive metrics for attributional robustness and propose some natural locality-sensitive extensions of the existing metrics. Theoretical understanding of locality-sensitive metrics of attributional robustness, constructing stronger attributional attacks for these metrics, and using them to build attributionally robust models are some important future directions. Reproducibility Statement. The anonymous code for the paper can be found at this anonymous link.

SUPPLEMENTARY MATERIAL

The Appendix contains proofs, additional experiments to show that the trends hold across different datasets and other ablation studies which could not be included in the main paper due to space constraints. A PROOFS FROM SECTION 4 We restate and prove Proposition 1 below. Proposition 4. For any w 1 ≤ w 2 , we have d (w2) k (a 1 , a 2 ) ≤ d (w1) k (a 1 , a 2 ) ≤ |S k T k | /k, where denotes the symmetric set difference, i.e., A B = (A \ B) ∪ (B \ A). Proof. The inequalities follows immediately using S ⊆ N w1 (S) ⊆ N w2 (S), for any S, and hence, |S \ N w (T )| ≤ |S \ T |, for any S, T and w. We restate and prove Proposition 2 below. Proposition 5. d(a 1 , a 2 ) defined above is upper bounded by u(a 1 , a 2 ) given by u(a 1 , a 2 ) = ∞ k=1 α k ∞ w=0 β w |S k T k | k , and u(a 1 , a 2 ) defines a bounded metric on the space of attribution vectors. Proof. Proof follows from Proposition 1 and using the fact that symmetric set difference satisfies triangle inequality. We restate and prove Proposition 3 below. Proposition 6. For any inputs x, y and any w ≥ 0, ã(w) (x) -ã(w) (y) 2 ≤ a(x) -a(y) 2 . Proof. ã(w) (x) -ã(w) (y) 2 2 = 1≤i,j≤n ã(w) ij (x) - ã(w) ij (y) 2 = 1≤i,j≤n 1 (2w + 1) 4     (p,q)∈Nw(i,j), 1≤p,q≤n (a pq (x) -a pq (y))     2 ≤ 1≤i,j≤n (2w + 1) 2 (2w + 1) 4 (p,q)∈Nw(i,j), 1≤p,q≤n (a pq (x) -a pq (y)) 2 by Cauchy-Schwarz inequality = 1 (2w + 1) 2 1≤i,j≤n (p,q)∈Nw(i,j), 1≤p,q≤n (a pq (x) -a pq (y)) 2 ≤ (2w + 1) 2 (2w + 1) 2 1≤p,q≤n (a pq (x) -a pq (y)) 2 because each (p, q) appears in at most (2w + 1) 2 possibles N w (i, j)'s = a(x) -a(y) 2 2 .

B DETAILS OF EXPERIMENTAL SETUP

The detailed description of the setup used in our experiments. Datasets: We use the standard benchmark train-test split of all the datasets used in this work, that is publicly available. MNIST dataset consists of 70, 000 images of 28 × 28 size, divided into 10 classes: 60, 000 used for training and 10, 000 for testing. Adversarial training: We use the standard setup as used by (Chen et al., 2019) . We perform PGD based adversarial training with the provided budget using the following settings (number of steps, step size) for PGD : MNIST (40,0.01), Fashion MNIST(20,0.01), GTSRB(7,8/255), Flower(7,2/255). Training for Attributional Robustness: We use the IG-SUM-NORM objective function for all the datasets study based on (Chen et al., 2019) based training. With the exact setting as given in paper with codefoot_0 . Hardware Configuration: We used a server with 4 Nvidia GeForce GTX 1080i GPU and a server with 8 Nvidia Tesla V100 GPU to run the experiments in the paper. Figure 12 : Attributional robustness of IG on naturally, PGD and IG-SUM-NORM trained models measured as top-k intersection, 1-LENS-prec@k and 1-LENS-recall@k between the IG of the original images and the IG of their perturbations obtained by the random attack (Ghorbani et al., 2019) across different datasets. Figure 13 : Attributional robustness of IG on naturally, PGD and IG-SUM-NORM trained models measured as top-k intersection, 1-LENS-prec@k and 1-LENS-recall@k between the IG of the original images and the IG of their perturbations obtained by the mass-center attack (Ghorbani et al., 2019) across different datasets. Average top-k Intersection top-k 1-LENS-prec@k 2-LENS-prec@k 3-LENS-prec@k (c) IG : center of mass Figure 14 : Attributional robustness of IG on naturally, PGD and IG-SUM-NORM trained models measured as top-k intersection and w-LENS-prec@k between the IG of the original images and the IG of their perturbations. Perturbations are obtained by the top-t attack and the mass-center attack (Ghorbani et al., 2019) as well as a random perturbation. The plots show the effect of varying w on Flower dataset.

E EXPERIMENTS WITH SIMPLE GRADIENTS

We observe that our conclusions about Integrated Gradients (IG) continue to hold qualitatively, even if we replace IG with Simple Gradients as our attribution method.  top-k 1-LENS-recall@k 1-LENS-prec@k (d) Flower Figure 16 : Attributional robustness of Simple Gradients on naturally, PGD and IG-SUM-NORM trained models measured as top-k intersection, 1-LENS-prec@k and 1-LENS-recall@k between the Simple Gradient of the original images and the Simple Gradient of their perturbations obtained by the random attack (Ghorbani et al., 2019) across different datasets. Figure 17 : Attributional robustness of Simple Gradients on naturally, PGD and IG-SUM-NORM trained models measured as top-k intersection, 1-LENS-prec@k and 1-LENS-recall@k between the Simple Gradient of the original images and the Simple Gradient of their perturbations obtained by the mass-center attack (Ghorbani et al., 2019) across different datasets. F ADDITIONAL RESULTS FOR PGD-TRAINED AND IG-SUM-NORM TRAINED MODELS Figure 21 and Figure 19 shows the impact of k in top-k for adversarially(PGD) trained and attributional(IG-SUM-NORM) trained network, respectively. But an important point to be noticed is that even with small number of features LENS is able to cross 70-80% which supports the observation of sparsity and stability of attributions achieved by adversarially(PGD) trained models by Chalasani et al. (2020) . Similarly, the experiments with different w value for w-LENS-top-k in 20 clearly incidates that due to the stability properties at lower window sizes LENS is able to cross the 80% intersection quickly. Supporting that our metric nicely captures local stability very well. While we observed only the top-k version of LENS. Figure 10 further extends the observation to LENS-Spearman and LENS-Kendall who to show that with LENS with a smoothing of 3 × 3 the Average top-k Intersection top-k 1-LENS-prec@k 2-LENS-prec@k 3-LENS-prec@k (c) SG : center of mass Figure 18 : Attributional robustness of Simple Gradients on naturally, PGD and IG-SUM-NORM trained models measured as top-k intersection and w-LENS-prec@k between the IG of the original images and the IG of their perturbations. Perturbations are obtained by the top-t attack and the mass-center attack (Ghorbani et al., 2019) as well as a random perturbation. The plots show the effect of varying w on Flower dataset. maps from adversarial and attributional robust models have a higher top-k intersection above 70% in comparison to natural trained model. 20 : Attributional robustness of IG on naturally, PGD and IG-SUM-NORM trained models measured as top-k intersection and w-LENS-prec@k between the IG of the original images and the IG of their perturbations. Perturbations are obtained by the top-t attack and the mass-center attack (Ghorbani et al., 2019) as well as a random perturbation. The plots show the effect of varying w on Flower dataset.



https://github.com/jfc43/robust-attribution-regularization



Figure 1: Attributional attack on Flower dataset using Ghorbani et al. (2019) method on a ResNet model. Columns 1 and 3 show the image before and after an imperceptible perturbation; Columns 2 and 4 show the corresponding attributions. Note the change in attributions despite no perceptible change in input. Row 1 shows a distinct change in top-k pixels with the highest attribution; Row 2 shows only a local change in top-k pixels with the highest attribution still within the object.The intersection between the top-1000 pixels before and after perturbation is less than 0.16 in both the cases; thus, as a metric it cannot really distinguish the two.

Figure 2: Sample image from MNIST shows that the Integrated Gradients (IG) after a universal random perturbation are more dissimilar than IG after a simple, independent random perturbation for each input. All perturbations have random ±1 coordinates, scaled down to have ∞ norm = 0.3. (c) has a top-k intersection of 0.68, while (d) has a top-k intersection of 0.62. With our locality-sensitive metric, (c) has LENS-top-k of 0.99 and (d) has LENS-top-k of 1.0.

Figure3: Attributional robustness of IG on naturally trained models measured as average top-k intersection, 1-LENS-prec@k and 1-LENS-recall@k between IG(original image) and IG(perturbed image) obtained by the top-t attack(Ghorbani et al., 2019) across different datasets.

Figure4: Attributional robustness of IG on naturally trained models measured as average top-k intersection and 1-LENS-prec@k between IG(original image) and IG(perturbed image). Perturbations are obtained by the top-t attack(Ghorbani et al., 2019) and random perturbation. The plots show how the above measures change with varying k across different datasets.

Figure5: Attributional robustness of IG on naturally trained models measured as average top-k intersection and w-LENS-prec@k between IG(original image) and IG(perturbed image). Perturbations are obtained by the top-t attack and the mass-center attack(Ghorbani et al., 2019) as well as random perturbation. The plots show the effect of varying w on Flower dataset.

Figure 7: Average top-k intersection between IG(original image) and IG(perturbed image) on naturally trained models where the perturbation is obtained by incorporating 1-LENS-prec@k objective in the Ghorbani et al. (2019) attack. Modifying the attack of Ghorbani et al. (2019) for 1-LENS-prec@k objective A natural question is whether the original top-k attack of Ghorbani et al. (2019) seem weaker under localitysenstitive robustness measures only because the attack was specifically constructed for a corresponding top-k intersection objective. Since the construction of the attack in Ghorbani et al. (2019) is modifiable for any similarity objective, we use 1-LENS-prec@k to construct a new attributional attack for 1-LENS-prec@k objective based on the k × k neighborhood of pixels. Surprisingly, we notice that it leads to a worse attributional attack, if we measure its effectiveness using the top-k intersection; see Figure7. In other words, attributional attacks against locality-sensitive measures of attributional robustness are non-trivial and may require fundamentally different ideas.

Figure 8: Sample image from Flower whose IG for PGDtrained (top) and IG-SUM-NORM trained (bottom) models seem robust to perturbations by the top-k attack of Ghorbani et al. (2019).

Figure 10: Average Kendall's τ , Spearman's ρ, 1-LENS-Kendall and 1-LENS-Spearman used to measure the attributional robustness of IG on natrually trained, PGD-trained and IG-SUM-NORM trained models. The perturbation used is the top-k attack of Ghorbani et al. (2019). Shown for (a) MNIST, (b) Fashion MNIST, (c) GTSRB and (d) Flower datasets.

Figure 11: Sample top-k highlighting using Flower.

Figure15: Attributional robustness of Simple Gradients on naturally, PGD and IG-SUM-NORM trained models measured as top-k intersection, 1-LENS-prec@k and 1-LENS-recall@k between the Simple Gradient of the original images and the Simple Gradient of their perturbations obtained by the top-k attack(Ghorbani et al., 2019) across different datasets.

Figure19: Comparing the top-k intersection between top-k and 1-LENS-prec@k with different k for top-k evaluation while the attack is a fixed k. For IG-SUM-NORM trained network.

Figure20: Attributional robustness of IG on naturally, PGD and IG-SUM-NORM trained models measured as top-k intersection and w-LENS-prec@k between the IG of the original images and the IG of their perturbations. Perturbations are obtained by the top-t attack and the mass-center attack(Ghorbani et al., 2019) as well as a random perturbation. The plots show the effect of varying w on Flower dataset.

Attributional robustness of IG on naturally trained models measured using average top-k intersection, Spearman's ρ and Kendall's τ between IG(original image) and IG(perturbed image). k = 100 for MNIST, Fashion MNIST, GTSRB and k = 1000 for Flower. Bold entries indicate where the universal random perturbation is a stronger attributional attack.

Fashion MNIST dataset consists of 70, 000 images of 28 × 28 size, divided into 10 classes: 60, 000 used for training and 10, 000 for testing. GTSRB dataset consists of 51, 739 images of 32 × 32 size, divided into 43 classes: 34, 699 used for training, 4, 410 for validation and 12, 630 for testing. Flower dataset consist of 1, 360 images of 128 × 128 size, divided into 17 classes: 1, 224 used for training and 136 for testing. GTSRB and Flower datasets were preprocessed exactly as given in Chen et al. (2019)[Appendix C] for consistency of results. Architectures: For MNIST, Fashion MNIST, GTSRB and Flower datasets we use the exact architectures as used by Chen et al. (2019). We use the same comparison metrics as used by Ghorbani et al. (2019) and Chen et al. (2019) like top-k pixels intersection, Spearman's ρ and Kendall's τ rank correlation to compare attribution maps of the original and perturbed images. The k value for top-k attack along with settings like step size, number of steps and number of times attack is to be applied is as used by Chen et al. (2019) for the attack construction : MNIST(200,0.Sample sizes for attribution robustness evaluations: IG based experiments For MNIST, Fashion MNIST and Flower with fixed top-k attack similar to Chen et al. (2019) the complete test set were used to obtain the results. For GTSRB a random sample of size 1000 was used for all the experiments. Simple gradient based experiments For MNIST and Fashion MNIST a random sample of 2500/1000 from the test set. For GTSRB, a random sample of size 1000 and the complete test set for the Flower dataset.

