ROBUST ATTRIBUTIONS REQUIRE RETHINKING ROBUSTNESS METRICS

Abstract

For machine learning models to be reliable and trustworthy, their decisions must be interpretable. As these models find increasing use in safety-critical applications, it is important that not just the model predictions but also their explanations (as feature attributions) be robust to small human-imperceptible input perturbations. Recent works have shown that many attribution methods are fragile and have proposed improvements in either the attribution methods or the model training. Existing works measure attributional robustness by metrics such as top-k intersection, Spearman's rank-order correlation (or Spearman's ρ) or Kendall's rank-order correlation (or Kendall's τ ) to quantify the change in feature attributions under input perturbation. However, we show that these metrics are fragile. That is, under such metrics, a simple random perturbation attack can seem to be as significant as more principled attributional attacks. We instead propose Locality-sENSitive (LENS) improvements of the above metrics, namely, LENS-top-k, LENS-Spearman and LENS-Kendall, that incorporate the locality of attributions along with their rank order. Our locality-sensitive metrics provide tighter bounds on attributional robustness and do not disproportionately penalize attribution methods for reasonable local changes. We show that the robust attribution methods proposed in recent works also reflect this premise of locality, thus highlighting the need for a localitysensitive metric for progress in the field. Our empirical results on well-known benchmark datasets using well-known models and attribution methods support our observations and conclusions in this work.

1. INTRODUCTION

There has been an explosive increase in the use of deep neural network (DNN)-based models for many applications in recent years, which has resulted in an equivalently increasing interest in finding ways to interpret the decisions made by these models. Interpretability is an important aspect of responsible and trustworthy AI, and attribution methods are important for explaining and debugging real-world AI/ML systems. Attribution methods are used across application domains today (see (Gade et al., 2020) for a general discussion and (Tang et al.; Yap et al.; Oviedo et al., 2022; Oh & Jeong, 2020) for some examples), despite their limitations. These methods (Zeiler et al., 2010; Simonyan et al., 2014; Bach et al., 2015; Selvaraju et al., 2017; Chattopadhyay et al., 2018; Sundararajan et al., 2017; Shrikumar et al., 2016; Smilkov et al., 2017; Lundberg & Lee, 2017) find approaches to explain the decisions made by these models, instead of using them as a black box. With growing numbers of explanation methods (see (Lipton, 2018; Samek et al., 2019; Fan et al., 2021; Zhang et al., 2020; Zhang & Zhu, 2018) for surveys), there have also been recent concerted efforts on analyzing and proposing methods to ensure the robustness of DNN model explanations. This requires that the model explanations (also known as attributions) do not change with human-imperceptible changes in input (Chen et al., 2019; Sarkar et al., 2021) . For example, an explanation code for a credit card failure cannot change significantly for a small human-imperceptible change in input features, or the saliency maps explaining the risk prediction of a chest X-ray should not change significantly with a minor human-imperceptible change in the image. This is referred to as attributional robustness. From another perspective, DNN-based models are known to have a vulnerability to imperceptible adversarial perturbations (Biggio et al., 2013; Szegedy et al., 2014) , which make them misclassify input images. These small imperceptible perturbations are constructed using techniques like Fast Gradient Signed Method (FGSM) (Goodfellow et al., 2015) and Projected Gradient Descent (PGD) (Madry et al., 2018) . Adversarial training with PGD is a well-known solution to obtain better adversarial robustness to attacks like FGSM and PGD. While adversarial robustness has received significant attention over the last few years (Ozdag, 2018; Silva & Najafirad, 2020) , attributional robustness has received lesser attention. In an early effort, (Ghorbani et al., 2019) provided a method to construct a small imperceptible perturbation which when added to the input x will lead to commonly used correlation measures such as top-k intersection, Spearman's rank-order correlation or Kendall's rank-order correlation used to quantify the change between the explanation map of the original image and that of the perturbed image to drop (see Figure 1 ). Across all the efforts so far (Ghorbani et al., 2019; Chen et al., 2019; Singh et al., 2020; Wang et al., 2020; Sarkar et al., 2021) , the robustness of attributions on input perturbation is measured using metrics such as top-k intersection, and rank correlations like Spearman's ρ and Kendall' τ to estimate the quality of the attack. While such metrics give a reasonable estimate when there are significant changes in attributions (see Figure 1 row 1), they are highly sensitive to minor local changes in attributions, even by one or few pixel coordinate locations (see Figure 1 row 2). We, in fact, show (Section 3.1) that under such metrics, a random perturbation is as strong an attack as existing benchmark methods such as (Ghorbani et al., 2019) . This may not be a true indicator of the robustness of attributions of a model, and thereby misleading to research efforts that build on current observations. Beyond highlighting this important issue, we instead propose locality-sensitive improvements of the above metrics that incorporate the locality of attributions along with their rank order. We show that such a locality-sensitive distance is upper-bounded by a metric based on symmetric set difference. Our key contributions are summarized below: • We firstly observe that existing robustness metrics for model attributions overpenalize minor drifts in attribution, leading to a false sense of fragility. We go on to show that under existing such metrics, a random perturbation is as good an attack as principled methods like (Ghorbani et al., 2019) . • In order to address this issue, we propose Locality-sENSitive (LENS) improvements of existing metrics, namely, LENS-top-k, LENS-Spearman and LENS-Kendall, that incorporate the locality of attributions along with their rank order. Our locality-sensitive metrics do not disproportionately penalize attribution methods for reasonable local changes. • We show that our proposed LENS variants are well-motivated by metrics defined on the space of attributions, and they provide tighter bounds on the attributional robustness of already known improvements in attribution methods and model training designed for better attributions. • Our comprehensive empirical results on benchmark datasets and models used in existing work clearly support our aforementioned observations, and support the need for the LENS variants of the metrics. • We also show that existing robust attribution methods implicitly reflect this premise of locality, thus highlighting the need for a locality-sensitive metric for progress in the field.

2. BACKGROUND AND RELATED WORK

We herein discuss background literature related to our work from three different perspectives: a brief summary of explanation/attribution methods in general, review of recent work in attributional robustness (both attacks and defenses), and other recent related work.



Figure 1: Attributional attack on Flower dataset using Ghorbani et al. (2019) method on a ResNet model. Columns 1 and 3 show the image before and after an imperceptible perturbation; Columns 2 and 4 show the corresponding attributions. Note the change in attributions despite no perceptible change in input. Row 1 shows a distinct change in top-k pixels with the highest attribution; Row 2 shows only a local change in top-k pixels with the highest attribution still within the object.The intersection between the top-1000 pixels before and after perturbation is less than 0.16 in both the cases; thus, as a metric it cannot really distinguish the two.

