CAN WE FAITHFULLY REPRESENT MASKED STATES TO COMPUTE SHAPLEY VALUES ON A DNN?

Abstract

Masking some input variables of a deep neural network (DNN) and computing output changes on the masked input sample represent a typical way to compute attributions of input variables in the sample. People usually mask an input variable using its baseline value. However, there is no theory to examine whether baseline value faithfully represents the absence of an input variable, i.e., removing all signals from the input variable. Fortunately, recent studies (Ren et al., 2023a; Deng et al., 2022a) show that the inference score of a DNN can be strictly disentangled into a set of causal patterns (or concepts) encoded by the DNN. Therefore, we propose to use causal patterns to examine the faithfulness of baseline values. More crucially, it is proven that causal patterns can be explained as the elementary rationale of the Shapley value. Furthermore, we propose a method to learn optimal baseline values, and experimental results have demonstrated its effectiveness.

1. INTRODUCTION

Many attribution methods (Zhou et al., 2016; Selvaraju et al., 2017; Lundberg and Lee, 2017; Shrikumar et al., 2017) have been proposed to estimate the attribution (importance) of input variables to the model output, which represents an important direction in explainable AI. In this direction, many studies (Lundberg and Lee, 2017; Ancona et al., 2019; Fong et al., 2019) masked some input variables of a deep neural network (DNN), and they used the change of network outputs on the masked samples to estimate attributions of input variables. As Fig. 1 shows, there are different types of baseline values to represent the absence of input variables. Theoretically, the trustworthiness of attributions highly depends on whether the current baseline value can really remove the signal of the input variable without bringing in new out-of-distribution (OOD) features. However, there is no criterion to evaluate the signal removal of masking methods. To this end, we need to first break the blind faith that seemingly reasonable baseline values can faithfully represent the absence of input variables, and the blind faith that seemingly OOD baseline values definitely cause abnormal features. In fact, because a DNN may have complex inference logic, seemingly OOD baseline values do not necessarily generate OOD features. Concept/causality-emerging phenomenon. The core challenge of theoretically guaranteeing or examining whether the baseline value removes all or partial signals of an input variable is to explicitly define the signal/concept/knowledge encoded by a DNN in a countable manner. To this end, Ren et al. 2022a) have surprisingly discovered that when the DNN is sufficiently trained, the sparse and symbolic concepts emerge. Thus, we use such concepts as a new perspective to define the optimal baseline value for the absence of input variables. As Fig. 1 shows, each concept represents an AND relationship between a specific set S of input variables. The co-appearance of these input variables makes a numerical contribution US to the network output. Thus, we can consider such a concept as a causal pattern 1 of the network output, and US is termed the causal effect. For example, the concept of a rooster's head consists of the forehead, eyes, beak, and crown, i.e., S = {forehead, eyes, beak, crown} = {f, e, b, c} for short. Only if input variables f , e, b, and c co-appear, the causal pattern S is triggered and makes an effect US on the confidence of the head classification. Otherwise, the absence of any input variables in the causal pattern S will remove the effect.

Ren et al. (2023a) have extracted a set of sparse causal patterns (concepts) encoded by the DNN.

More importantly, the following finding has proven that such causal patterns 1 can be considered as elementary inference logic used by the DNN. Specifically, given an input sample with n variables, we can generate 2 n different masked samples. We can use a relatively small number of causal patterns to accurately mimic network outputs on all 2 n masked samples, which guarantees the faithfulness of causal patterns. Defining optimal baseline values based on causal patterns. From the above perspective of causal patterns, whether baseline values look reasonable and fit human's intuition is no longer the key factor to determine the trustworthiness of baseline values. Instead, we evaluate the faithfulness of baseline values by using causal patterns. Because the baseline value is supposed to represent the absence of an input variable, we find that setting an optimal baseline value usually generates the most simplified explanation of the DNN, i.e., we may extract a minimum number of causal patterns to explain the DNN. Such an explanation is the most reliable according to Occam's Razor. • We prove that using incorrect baseline values makes a single causal pattern be explained as an exponential number of redundant causal patterns. Let us consider the following toy example, where the DNN contains a causal pattern S={f, e, b, c} with a considerable causal effect E on the output. If an incorrect baseline value b f of the variable f (forehead) just blurs the image patch, rather than fully remove its appearance, then masking the variable f cannot remove all score E. The remaining score E -U {f,e,b,c} will be explained as redundant causal patterns U {e,b} , U {e,c} , U {e,b,c} , etc. • Furthermore, incorrect baseline values may also generate new patterns. For example, if baseline values of {f, e, b, c} are set as black regions, then masking all four regions may generate a new pattern of a black square, which is a new causal pattern that influences the network output. Therefore, we consider that the optimal baseline value, which faithfully reflects the true inference logic, usually simplifies the set of causal patterns. I.e., it usually reduces the overall strength of existing causal effects most without introducing new causal effects. However, we find that most existing masking methods are not satisfactory from this perspective (see Section 3.2 and Table 1 ), although the masking method based on conditional distribution of input variables (Covert et al., 2020b; Frye et al., 2021) performs a bit better. In particular, we notice that Shapley values can also be derived from causal patterns in theory, i.e., the causal patterns are proven to be elementary effects of Shapley values. Therefore, we propose a new method to learn optimal baseline values for Shapley values, which removes the causal effects of the masked input variables and avoids introducing new causal effects. Contributions of this paper can be summarized as follows. (1) We propose a metric to examine whether the masking approach in attribution methods could faithfully represent the absence state of input variables. Based on this metric, we find that most previous masking methods are not reliable.



(2023a) have discovered a counter-intuitive concept-emerging phenomenon in a trained DNN. Although the DNN does not have a physical unit to encode explicit causality or concepts, Ren et al. (2023a); Deng et al. (

Figure 1: (Left) Previous masking methods may either introduce additional signals, or cannot remove all the old signals. (Right) The inference of the DNN on images masked by these baseline values can be well mimicked by causal patterns.

