ROBUST ATTRIBUTIONS REQUIRE RETHINKING ROBUSTNESS METRICS

Abstract

For machine learning models to be reliable and trustworthy, their decisions must be interpretable. As these models find increasing use in safety-critical applications, it is important that not just the model predictions but also their explanations (as feature attributions) be robust to small human-imperceptible input perturbations. Recent works have shown that many attribution methods are fragile and have proposed improvements in either the attribution methods or the model training. Existing works measure attributional robustness by metrics such as top-k intersection, Spearman's rank-order correlation (or Spearman's ρ) or Kendall's rank-order correlation (or Kendall's τ ) to quantify the change in feature attributions under input perturbation. However, we show that these metrics are fragile. That is, under such metrics, a simple random perturbation attack can seem to be as significant as more principled attributional attacks. We instead propose Locality-sENSitive (LENS) improvements of the above metrics, namely, LENS-top-k, LENS-Spearman and LENS-Kendall, that incorporate the locality of attributions along with their rank order. Our locality-sensitive metrics provide tighter bounds on attributional robustness and do not disproportionately penalize attribution methods for reasonable local changes. We show that the robust attribution methods proposed in recent works also reflect this premise of locality, thus highlighting the need for a localitysensitive metric for progress in the field. Our empirical results on well-known benchmark datasets using well-known models and attribution methods support our observations and conclusions in this work.

1. INTRODUCTION

There has been an explosive increase in the use of deep neural network (DNN)-based models for many applications in recent years, which has resulted in an equivalently increasing interest in finding ways to interpret the decisions made by these models. Interpretability is an important aspect of responsible and trustworthy AI, and attribution methods are important for explaining and debugging real-world AI/ML systems. Attribution methods are used across application domains today (see (Gade et al., 2020) for a general discussion and (Tang et al.; Yap et al.; Oviedo et al., 2022; Oh & Jeong, 2020) for some examples), despite their limitations. These methods (Zeiler et al., 2010; Simonyan et al., 2014; Bach et al., 2015; Selvaraju et al., 2017; Chattopadhyay et al., 2018; Sundararajan et al., 2017; Shrikumar et al., 2016; Smilkov et al., 2017; Lundberg & Lee, 2017) find approaches to explain the decisions made by these models, instead of using them as a black box. With growing numbers of explanation methods (see (Lipton, 2018; Samek et al., 2019; Fan et al., 2021; Zhang et al., 2020; Zhang & Zhu, 2018) for surveys), there have also been recent concerted efforts on analyzing and proposing methods to ensure the robustness of DNN model explanations. This requires that the model explanations (also known as attributions) do not change with human-imperceptible changes in input (Chen et al., 2019; Sarkar et al., 2021) . For example, an explanation code for a credit card failure cannot change significantly for a small human-imperceptible change in input features, or the saliency maps explaining the risk prediction of a chest X-ray should not change significantly with a minor human-imperceptible change in the image. This is referred to as attributional robustness. From another perspective, DNN-based models are known to have a vulnerability to imperceptible adversarial perturbations (Biggio et al., 2013; Szegedy et al., 2014) , which make them misclassify input images. These small imperceptible perturbations are constructed using techniques like Fast Gradient Signed Method (FGSM) (Goodfellow et al., 2015) and Projected Gradient Descent (PGD) (Madry et al., 2018) . Adversarial training with PGD is a well-known solution to obtain

