ANALYZING THE EFFECTS OF CLASSIFIER LIPSCHITZ-NESS ON EXPLAINERS

Abstract

Machine learning methods are getting increasingly better at making predictions, but at the same time they are also becoming more complicated and less transparent. As a result, explainers are often relied on to provide interpretability to these black-box prediction models. As crucial diagnostics tools, it is important that these explainers themselves are reliable. In this paper we focus on one particular aspect of reliability, namely that an explainer should give similar explanations for similar data inputs. We formalize this notion by introducing and defining explainer astuteness, analogous to astuteness of classifiers. Our formalism is inspired by the concept of probabilistic Lipschitzness, which captures the probability of local smoothness of a function. For a variety of explainers (e.g., SHAP, RISE, CXPlain), we provide lower bound guarantees on the astuteness of these explainers given the Lipschitzness of the prediction function. These theoretical results imply that locally smooth prediction functions lend themselves to locally robust explanations. We evaluate these results empirically on simulated as well as real datasets.

1. INTRODUCTION

Machine learning models have improved over time at prediction and classification, especially with the advances made in deep learning and availability of large amounts of data. These gains in predictive power have often been achieved using increasingly complex and black-box models. This has led to significant interest in, and a proliferation of, explainers that provide explanations for the predictions made by these black-box models. Given the crucial importance of these explainers it is imperative to understand what makes them reliable. In this paper we focus on explainer robustness. A robust explainer is one where similar inputs results in similar explanations (Alvarez-Melis & Jaakkola, 2018) . For example, consider two patients given the same diagnosis in a medical setting. These two patients share identical symptoms and are demographically very similar, therefore a diagnostician would expect that factors influencing the model decision should be similar as well. Prior work in explainer robustness suggests that this expectation does not always hold true (Alvarez-Melis & Jaakkola, 2018; Ghorbani et al., 2019) ; small changes to the input samples can result in large shifts in explanation. For this reason we investigate the theoretical underpinning of explainer robustness. Specifically, we focus on investigating the connection between explainer robustness and smoothness of the black-box function being explained. We propose and formally define explainer astuteness -a property of explainers which captures the probability that a given method provides similar explanations to similar data points. This definition allows us to evaluate the robustness for a given explainer over the entire dataset and helps tie explainer robustness to probabilistic Lipschitzness of classifiers. We then provide a theoretical way to connect this explainer astuteness to the probabilistic Lipschitzness of the black-box function that is being explained. Since probabilistic Lipschitzness is a measure of the probability that a function is smooth in a local neighborhood, our results demonstrate how the smoothness of the black-box function itself impacts the astuteness of the explainer. This implies that enforcing smoothness on black-box functions lends them to more robust explanations. Related Work. A wide variety of explainers have been proposed in the literature (Guidotti et al., 2018; Arrieta et al., 2020) . Explainers can broadly be categorized as feature attribution or feature selection explainers. Feature attribution explainers provide continuous-valued importance scores to each of the input features, while feature selection explainers provide binary decisions on whether a feature is important or not. Some popular feature attribution explainers can be viewed through the lens of Shapley values such as SHAP (Lundberg & Lee, 2017), LIME (Ribeiro et al., 2016) and LIFT (Shrikumar et al., 2016) . Some models such as CXPlain (Schwab & Karlen, 2019), PredDiff (Zintgraf et al., 2017) and feature ablation explainers (Lei et al., 2018) 2018) empirically show that robustness, in the sense that explainers should provide similar explanations for similar inputs, is a desirable property and how forcing this property yields better explanations. Recently, Agarwal et al. ( 2021) explore the robustness of LIME (Ribeiro et al., 2016) and SmoothGrad (Smilkov et al., 2017) , and prove that for these two methods their robustness is related to the maximum value of the gradient of the predictor function. Our work is closely related to Alvarez-Melis & Jaakkola (2018) and Agarwal et al. ( 2021) on explainer robustness. However, instead of enforcing explainers to be robust themselves (Alvarez-Melis & Jaakkola, 2018), our theoretical results suggest that ensuring robustness of explanations also depends on the smoothness of the black-box function that is being explained. Our results are complementary to the results obtained by Agarwal et al. (2021) in that our theorems cover a wider variety of explainers as compared to only Continuous LIME and SmoothGrad (see contributions below). We further relate robustness to probabilistic Lipschitzness of black-box models, which is a quantity that can be empirically estimated. Additionally, there has been recent work estimating upper-bounds of Lipschitz constant for neural networks (Virmaux & Scaman, 2018; Fazlyab et al., 2019; Gouk et al., 2021) , and enforcing Lipschitz continuity during neural networks training, with an eye towards improving classifier robustness (Gouk et al., 2021; Aziznejad et al., 2020; Fawzi et al., 2017; Alemi et al., 2016 ). Fel et al. (2022) empirically demonstrated that 1-Lipschitz networks are better suited as predictors that are more explainable and trustworthy. Our work provides crucial additional motivation for that line of research; i.e., it provides theoretical reasons to improve Lipschitzness of neural networks from the perspective of enabling more robust explanations.

Contributions:

• We formalize and define explainer astuteness which captures the probability that a given explainer provides similar explanations to similar points. This formalism allows us to theoretically analyze robustness properties of explainers. • We provide theoretical results that connect astuteness of explainers to the smoothness of the black-box function they are providing explanations on. Our results suggest that smooth black-box functions result in explainers providing more astute explanations. While this statement is intuitive, proving it is non-trivial and requires additional assumptions for different explainers (See Section 3.2). • Specifically we prove this result for astuteness of three classes of explainers: (1) Shapley value based (e.g. SHAP), (2) explainers that simulate mean effect of features (e.g. RISE), and (3) explainers that simulate individual feature removal (e.g. CXPlain). Formally, our theorems establish a lower bound on explainer astuteness that depends on the Lipschitzness



calculate feature attributions by simulating individual feature removal, while other methods such as RISE(Petsiuk  et al., 2018)  calculate the mean effect of a feature's presence to attribute importance to it. In contrast, feature selection methods include individual selector approaches such as L2X(Chen et al., 2018)  andINVASE (Yoon et al., 2018), and group-wise selection approaches such as gI(Masoomi et al., 2020). While seemingly diverse, these models have been shown to have striking underlying similarities, for example, Lundberg & Lee (2017) unify six different explainers under a single framework. Recently, Covert et al. (2020) went a step further and combined 25 existing methods under the overall class of removal-based explainers. Similarly, there has been a recent increase in research focused on analyzing the behaviour of these explainers themselves in ways similar to how classification models have been analyzed. Recent work has focused on dissecting various properties of explainers. Yin et al. (2021) propose stability and sensitivity as measures of faithfulness of explainers to the decision-making process of the blackbox model and empirically demonstrate the usefulness of these measures. Li et al. (2020) explore connections between local explainability and model generalization. Ghorbani et al. (2019) test the robustness of explainers through systemic and adversarial perturbations. Agarwal et al. (2022) define and discuss theoretical guarantees around faithfulness and stability in the context of Graph Neural Networks. Our definition of astuteness is related to what they call stability, but defined as a probability over all available instances in such a way that connection to probabilistic Lipschitzness of the classifier becomes clear. Alvarez-Melis & Jaakkola (

