EXPLANATION UNCERTAINTY WITH DECISION BOUNDARY AWARENESS Anonymous authors Paper under double-blind review

Abstract

Post-hoc explanation methods have become increasingly depended upon for understanding black-box classifiers in high-stakes applications, precipitating a need for reliable explanations. While numerous explanation methods have been proposed, recent works have shown that many existing methods can be inconsistent or unstable. In addition, high-performing classifiers are often highly nonlinear and can exhibit complex behavior around the decision boundary, leading to brittle or misleading local explanations. Therefore, there is an impending need to quantify the uncertainty of such explanation methods in order to understand when explanations are trustworthy. We introduce a novel uncertainty quantification method parameterized by a Gaussian Process model, which combines the uncertainty approximation of existing methods with a novel geodesic-based similarity which captures the complexity of the target black-box decision boundary. The proposed framework is highly flexible; it can be used with any black-box classifier and feature attribution method to amortize uncertainty estimates for explanations. We show theoretically that our proposed geodesic-based kernel similarity increases with the complexity of the decision boundary. Empirical results on multiple tabular and image datasets show that our decision boundary-aware uncertainty estimate improves understanding of explanations as compared to existing methods.

1. INTRODUCTION

Machine learning models are becoming increasingly prevalent in a wide variety of industries and applications. In many such applications, the best performing model is opaque; post-hoc explainability methods are one of the crucial tools by which we understand and diagnose the model's predictions. Recently, many explainability methods, termed explainers, have been introduced in the category of local feature attribution methods. That is, methods that return a real-valued score for each feature of a given data sample, representing the feature's relative importance with respect to the sample prediction. These explanations are local in that each data sample may have a different explanation. Using local feature attribution methods therefore helps users better understand nonlinear and complex black-box models, since these models are not limited to using the same decision rules throughout the data distribution. Recent works have shown that existing explainers can be inconsistent or unstable. For example, given similar samples, explainers might provide different explanations (Alvarez-Melis & Jaakkola, 2018; Slack et al., 2020) . When working in high-stakes applications, it is imperative to provide the user with an understanding of whether an explanation is reliable, potentially problematic, or even misleading. A way to guide users regarding an explainer's reliability is to provide corresponding uncertainty quantification estimates. One can consider explainers as function approximators; as such, standard techniques for quantifying the uncertainty of estimators can be utilized to quantify the uncertainty of explainers. This is the strategy utilized by existing methods for producing uncertainty estimates of explainers (Slack et al., 2021; Schwab & Karlen, 2019) . However, we observe that for explainers, this is not sufficient; because in addition to uncertainty due to the function approximation of explainers, explainers also have to deal with the uncertainty due to the complexity of the decision boundary (DB) of the blackbox model in the local region being explained. Consider the following example: we are using a prediction model for a medical diagnosis using two features, level of physical activity and body mass index (BMI) (Fig. 1 ). In order to understand the prediction and give actionable recommendations to the patient, we use a feature attribution method to evaluate the relative importance of each feature. Because of the nonlinearity of the prediction model, patients A and B show very similar symptoms, but are given very different recommendations. Note that while this issue is related to the notion of explainer uncertainty, measures of uncertainty that only consider the explainer would not capture this phenomenon. This suggests that any notion of uncertainty is incomplete without capturing information related to the local behavior of the model. Therefore, the ability to quantify uncertainty for DB-related explanation instability is desirable. We approach this problem from the perspective of similarity: given two samples and their respective explanations, how closely related should the explanations be? From the previous intuition, we define this similarity based on a geometric perspective of the DB complexity between these two points. Specifically, we propose a novel geodesic-based kernel similarity metric, which we call the Weighted Exponential Geodesic (WEG) kernel. The WEG kernel encodes our expectation that two samples close in Euclidean space may not actually be similar if the DB within a local neighborhood of the samples is highly complex. Using our similarity formulation, we propose the Gaussian Process Explanation UnCertainty (GPEC) framework, which is an instance-wise, model-agnostic, and explainer-agnostic method to quantify the uncertainty of explanations. The proposed notion of uncertainty is complementary to existing quantification methods. Existing methods primarily estimate the uncertainty related to the choice in model parameters and fitting the explainer, which we call function approximation uncertainty, and does not capture uncertainty related to the DB. GPEC can combine the DB-based uncertainty with function approximation uncertainty derived from any local feature attribution method. In summary, we make the following contributions: • We introduce a geometric perspective on capturing explanation uncertainty and define a novel geodesic-based similarity between explanations. We prove theoretically that the proposed similarity captures the complexity of the decision boundary from a given black-box classifier. • We propose a novel Gaussian Process-based framework that combines A) uncertainty from decision boundary complexity and B) explainer-specific uncertainty to generate uncertainty estimates for any given feature attribution method and black box model. • Empirical results show GPEC uncertainty improves understanding of feature attribution methods.

2. RELATED WORKS

Explanation Methods. A wide variety of methods have been proposed for the purpose of improving transparency for pre-trained black-box prediction models (Guidotti et al., 2018; Barredo Arrieta et al., 2020) . Within this category of post-hoc methods, many methods focus on local explanations, that is, explaining individual predictions rather than the entire model. Some of these methods generate explanations through local feature selection (Chen et al., 2018; Masoomi et al., 2020) . In this



Figure 1: Illustrative example of how similar data samples can result in very different feature importance scores given a black-box model with nonlinear decision boundary. Here, two similar patients with similar predictions are given opposing feature importance scores which could result in misguided recommendations. We define a similarity based on the geometry of the decision boundary between any two given samples (red line). While the two patients are close together in the Euclidean sense, they are dissimilar under the proposed WEG kernel similarity. Using GPEC would return a high uncertainty measure for the explanations, which would flag the results for further investigation.

