IMPROVING EXPLANATION RELIABILITY THROUGH GROUP ATTRIBUTION

Abstract

Although input attribution methods are mainstream in understanding predictions of DNNs for straightforward interpretations, the non-linearity of DNNs often makes the attributed scores unreliable in explaining a given prediction, deteriorating the faithfulness of the explanation. However, the challenge could be mitigated by explaining groups of explanatory components rather than the individuals, as interaction among the components can be reduced through appropriate grouping. Nevertheless, a group attribution does not explain the component-wise contributions so that its component-interpreted attribution becomes less reliable than the original component attribution, indicating the trade-off of dual reliabilities. In this work, we first introduce the generalized definition of reliability loss and group attribution to formulate the optimization problem of the reliability tradeoff. Then we specify our formalization to Shapley value attribution and propose the optimization method G-SHAP. Finally, we show the explanatory benefits of our method through experiments on image classification tasks.

1. INTRODUCTION

The advance in deep neural networks facilitates a training model to learn high-level semantic features in a variety of fields, but intrinsic difficulties in explaining predictions of DNNs become a primary barrier to real-world applications, especially for domains requiring trustful reasoning for model predictions. While various approaches have been proposed to tackle the challenge, which includes deriving global behavior or knowledge of a trained model (Kim et al., 2018) , explaining the semantics of a target neuron in a model, (Ghorbani et al., 2019; Simonyan et al., 2013; Szegedy et al., 2015) , introducing self-interpretable models (Zhang et al., 2018; Dosovitskiy et al., 2020; Touvron et al., 2020; Arik & Pfister, 2019) , input-attribution methods became the mainstream of post-hoc explanation methods since they explain a model prediction by assigning a scalar score to each explanatory component (feature) of its input data, yielding the straightforward explanation for end-users through data-corresponded visualization such as a heatmap. However, since each explanatory component is explained with a single scalar score, the nonlinearity in DNNs makes their scores less reliable in explaining a model's prediction. It results in the discrepancy between the explained and actual model behavior for a prediction, deteriorating the faithfulness of the explanation. As it is the inherent challenge of input attribution methods, the problem has been studied and tackled with various approaches and perspectives: (Grabisch & Roubens, 1999) formalizes the axiomatic interactions for cooperative games, (Tsang et al., 2018) explains the statistical interaction between input features from learned weights in DNN, (Kumar et al., 2021) introduces Shapley Residuals to quantify the unexplained contribution of Shapley values, (Janizek et al., 2021) extends Integrated Gradients (Sundararajan et al., 2017) to Integrated Hessians to explain the interaction between input features. While these approaches have improved the explainability to the DNN's nonlinearity, their explaining scores are not corresponded to each explanatory components in many cases, reducing the interpretability of explanations. Figure 1 : Trade-off of the dual reliability loss of a group attribution for a simple non-linear function. Grouping x 1 , x 2 resolves their interaction so that it reduces the reliability loss of the group {x 1 , x 2 } but increases those of component-interpreted scores. Here attribution score and its reliability loss are defined as the input gradient and expected L2 error of its tangent approximation, which are ∂ ∂xi f (x) and E t∼N (0,1) [(f (x + te i ) -f (x) -tφ i ) 2 ], respectively. Instead, it can be alleviated by explaining a model's prediction in terms of groups explanatory components rather than the individuals, termed group attribution. Appropriate grouping can weaken the interaction among the components, yielding more reliable explanation. However, a group attribution does not attribute scores to the individual components so that interpreting a group attribution in terms of the individual components results in less reliable explanation than the original component attribution. Therefore, both group-wise and component-interpreted attribution reliability should be considered for deriving a group attribution, implying a trade-off optimization problem. Figure 1 illustrates this problem with simple a non-linear function. In this paper, we present our work as follows: In Section 2, we introduce the generalized definition of reliability loss and group attribution to formulate the optimization problem of the reliability tradeoff. In section 3, we integrate our formalization with Shapley value attribution (Lundberg & Lee, 2017) and propose the grouping algorithm G-SHAP. We choose the Shapley value as our scoring policy for two reasons: 1) it has been utilized as a popular attribution method for its model-agnostic characteristic and well-founded axiomatic properties. 2) it becomes less reliable when there are strong interactions among the explanatory component's contribution, as it take the aggregation of the contributions over all coalition states. In section 4, we show the explanatory benefits of our method through experiments on image classification tasks as follows: 1) we verify the grouping effect of G-SHAP through quantitative and visual analysis. 2) we validate our grouping approach by comparing it with several baseline grouping method which would yield the similar grouping result to ours. 3) we show the improvement in local explainability of a prediction through the estimation game, which utilizes the deletion game (Petsiuk et al., 2018; Wagner et al., 2019) to measure the error of model output changes. Our contributions are summarized as follows: 1. We introduce two novel concepts to improve the limited reliability of input attribution methods: reliability loss that quantifies the discrepancy between the explained and the actual model behavior for a prediction, group attribution that explains a prediction in terms of groups of explanatory components. Since a group attribution becomes less reliable in explaining component-wise contributions, we formulate the optimization problem to resolve the reliability trade-off. While we choose the Shapley value as our scoring policy, our formulation consists of generalized terms, applicable for other input attribution methods. 2. We propose G-SHAP, a grouping algorithm for Shapley value attribution. We empirically show that G-SHAP has better local explainability of a model prediction than SHAP. We also validate the effectiveness of our grouping approach by comparing it with several baseline grouping methods, which would yield the similar grouping results to ours.

