CAN WE FAITHFULLY REPRESENT MASKED STATES TO COMPUTE SHAPLEY VALUES ON A DNN?

Abstract

Masking some input variables of a deep neural network (DNN) and computing output changes on the masked input sample represent a typical way to compute attributions of input variables in the sample. People usually mask an input variable using its baseline value. However, there is no theory to examine whether baseline value faithfully represents the absence of an input variable, i.e., removing all signals from the input variable. Fortunately, recent studies (Ren et al., 2023a; Deng et al., 2022a) show that the inference score of a DNN can be strictly disentangled into a set of causal patterns (or concepts) encoded by the DNN. Therefore, we propose to use causal patterns to examine the faithfulness of baseline values. More crucially, it is proven that causal patterns can be explained as the elementary rationale of the Shapley value. Furthermore, we propose a method to learn optimal baseline values, and experimental results have demonstrated its effectiveness.

1. INTRODUCTION

Many attribution methods (Zhou et al., 2016; Selvaraju et al., 2017; Lundberg and Lee, 2017; Shrikumar et al., 2017) have been proposed to estimate the attribution (importance) of input variables to the model output, which represents an important direction in explainable AI. In this direction, many studies (Lundberg and Lee, 2017; Ancona et al., 2019; Fong et al., 2019) masked some input variables of a deep neural network (DNN), and they used the change of network outputs on the masked samples to estimate attributions of input variables. As Fig. 1 shows, there are different types of baseline values to represent the absence of input variables. Theoretically, the trustworthiness of attributions highly depends on whether the current baseline value can really remove the signal of the input variable without bringing in new out-of-distribution (OOD) features. However, there is no criterion to evaluate the signal removal of masking methods. To this end, we need to first break the blind faith that seemingly reasonable baseline values can faithfully represent the absence of input variables, and the blind faith that seemingly OOD baseline values definitely cause abnormal features. In fact, because a DNN may have complex inference logic, seemingly OOD baseline values do not necessarily generate OOD features. Concept/causality-emerging phenomenon. The core challenge of theoretically guaranteeing or examining whether the baseline value removes all or partial signals of an input variable is to explicitly define the signal/concept/knowledge encoded by a DNN in a countable manner. To this end, Ren et al. (2023a) have discovered a counter-intuitive concept-emerging phenomenon in a trained DNN. Although the DNN does not have a physical unit to encode explicit causality or concepts, Ren et al. (2023a) ; Deng et al. (2022a) have surprisingly discovered that when the DNN is sufficiently trained, the sparse and symbolic concepts emerge. Thus, we use such concepts as a new perspective to define the optimal baseline value for the absence of input variables. As Fig. 1 shows, each concept represents an AND relationship between a specific set S of input variables. The co-appearance of these input variables makes a numerical contribution US to the network output. Thus, we can consider such a concept as a causal pattern 1 of the network output, and US is termed the causal effect. For example, the concept of a rooster's head consists of the forehead, eyes, beak, and crown, i.e., S = {forehead, eyes, beak, crown} = {f, e, b, c} for short. Only if input variables f , e, b, and c co-appear, the causal pattern S is triggered and makes an effect US on the confidence of the head classification. Otherwise, the absence of any input variables in the causal pattern S will remove the effect. Ren et al. (2023a) have extracted a set of sparse causal patterns (concepts) encoded by the DNN. More importantly, the following finding has proven that such causal patterns 1 can be considered as elementary inference logic used by the DNN. Specifically, given an input sample with n variables, we can generate 2 n different masked samples. We can use a relatively small number of causal patterns to accurately mimic network outputs on all 2 n masked samples, which guarantees the faithfulness of causal patterns. Defining optimal baseline values based on causal patterns. From the above perspective of causal patterns, whether baseline values look reasonable and fit human's intuition is no longer the key factor to determine the trustworthiness of baseline values. Instead, we evaluate the faithfulness of baseline values by using causal patterns. Because the baseline value is supposed to represent the absence of an input variable, we find that setting an optimal baseline value usually generates the most simplified explanation of the DNN, i.e., we may extract a minimum number of causal patterns to explain the DNN. Such an explanation is the most reliable according to Occam's Razor. • We prove that using incorrect baseline values makes a single causal pattern be explained as an exponential number of redundant causal patterns. Let us consider the following toy example, where the DNN contains a causal pattern S={f, e, b, c} with a considerable causal effect E on the output. If an incorrect baseline value b f of the variable f (forehead) just blurs the image patch, rather than fully remove its appearance, then masking the variable f cannot remove all score E. The remaining score E -U {f,e,b,c} will be explained as redundant causal patterns U {e,b} , U {e,c} , U {e,b,c} , etc. • Furthermore, incorrect baseline values may also generate new patterns. For example, if baseline values of {f, e, b, c} are set as black regions, then masking all four regions may generate a new pattern of a black square, which is a new causal pattern that influences the network output. Therefore, we consider that the optimal baseline value, which faithfully reflects the true inference logic, usually simplifies the set of causal patterns. I.e., it usually reduces the overall strength of existing causal effects most without introducing new causal effects. However, we find that most existing masking methods are not satisfactory from this perspective (see Section 3.2 and Table 1 ), although the masking method based on conditional distribution of input variables (Covert et al., 2020b; Frye et al., 2021) performs a bit better. In particular, we notice that Shapley values can also be derived from causal patterns in theory, i.e., the causal patterns are proven to be elementary effects of Shapley values. Therefore, we propose a new method to learn optimal baseline values for Shapley values, which removes the causal effects of the masked input variables and avoids introducing new causal effects. Contributions of this paper can be summarized as follows. (1) We propose a metric to examine whether the masking approach in attribution methods could faithfully represent the absence state of input variables. Based on this metric, we find that most previous masking methods are not reliable. (2) We define and develop an approach to estimating optimal baseline values for Shapley values, which ensures the trustworthiness of the attribution.

2. EXPLAINABLE AI THEORIES BASED ON GAME-THEORETIC INTERACTIONS

This paper is a typical achievement on the theoretical system of game-theoretic interactions. In fact, our research group has developed and used the game-theoretical interaction as a new perspective to solve two challenges in explainable AI, i.e., (1) how to define and represent implicit knowledge encoded by a DNN as explicit and countable concepts, (2) how to use concepts encoded by the DNN to explain its representation power or performance. More importantly, we find that the gametheoretic interaction is also a good perspective to analyze the common mechanism shared by previous empirical findings and explanations of DNNs. • Explaining the knowledge/concepts encoded by a DNN. Defining interactions between input variables of a DNN in game theory is a typical research direction (Grabisch and Roubens, 1999; Sundararajan et al., 2020) . To this end, we further defined the multi-variate interaction (Zhang et al., 2021a; d) and multi-order interaction (Zhang et al., 2021b) to represent interactions of different complexities. Ren et al. (2023a) and Li and Zhang (2023) first discovered that we could consider game-theoretic interactions as the concepts encoded by a DNN, considering the following three terms. (1) We found that a trained DNN usually only encoded very sparse and salient interactions, and each interaction made a certain effect on the network output. (2) We proved that we could just use the effects of such a small number of salient interactions to well mimic/predict network outputs on an exponential number of arbitrarily masked input samples. (3) We found that salient interactions usually exhibited strong transferability across different samples, strong transferability across different DNNs, and strong discrimination power. Thus, the above three perspectives comprised the solid foundation of considering salient interactions as the concepts encoded by a DNN. Furthermore, Cheng et al. (2021b) found that such interactions usually represented the most reliable and prototypical concepts encoded by a DNN. Cheng et al. (2021a) further analyzed the different signal-processing behaviors of a DNN in encoding shapes and textures. • The game-theoretic interaction is also a new perspective to investigate the representation power of a DNN. Deng et al. (2022a) proved a counter-intuitive bottleneck/difficulty of a DNN in representing interactions of the intermediate complexity. Zhang et al. (2021b) explored the effects of the dropout operation on interactions to explain the generalization power of a DNN. Wang et al. (2021a; b) ; Ren et al. (2021) used interactions between input variables to explain the adversarial robustness and adversarial transferability of a DNN. Zhou et al. (2023) found that complex (highorder) interactions were more likely to be over-fitted, and they used the generalization power of different interaction concepts to explain the generalization power of the entire DNN. Ren et al. (2023b) proved that a Bayesian neural network (BNN) was less likely to encode complex (highorder) interactions, which avoided over-fitting. • Game-theoretic interactions are also used to analyze the common mechanism shared by many empirical findings. Deng et al. (2022b) discovered that almost all (fourteen) attribution methods could be re-formulated as a reallocation of interactions in mathematics. This enabled the fair comparison between different attribution methods. Zhang et al. (2022) proved that twelve previous empirical methods of boosting adversarial transferability could be explained as reducing interactions between pixel-wise adversarial perturbations.

3. PROBLEMS WITH THE REPRESENTATION OF THE MASKED STATES

The Shapley value (Shapley, 1953) was first introduced in game theory to measure the contribution of each player in a game. People usually use Shapley values to estimate attributions of input variables of a DNN. Let the input sample x of the DNN contain n input variables, i.e., x = [x1, . . . , xn] . The Shapley value of the i-th input variable ϕ i is defined as follows. ϕi = S⊆N \{i} [|S|!(n -|S| -1)!/n!] • v(x S∪{i} ) -v(xS) where v(xS) ∈ R denotes the model output when variables in S are present, and variables in N \S are masked. Specifically, v(x ∅ ) represents the model output when all input variables are masked. The Shapley value of the variable i is computed as the weighted marginal contribution of i when the variable i is present w.r.t. the case when the variable i is masked, i.e. Table 1 : The ratio R of the remaining and newly introduced causal effects in the masked inputs. A small value of R meant that baseline values removed most original causal effects and did not introduce many new effects. to mask variables to represent their absence. Specifically, given an input sample x, xS denotes a masked sample, which is generated by masking variables in the set N \ S. R (zero) R (mean) R (blur) R (conditional) If i ∈ S, (xS)i = xi; otherwise, (xS)i = bi We aim to learn optimal baseline values b to faithfully represent absent states of input variables. Decomposing a DNN's output into sparse interactions. Given a trained DNN v and an input x with n input variables, Ren et al. (2023a) have proven that the DNN output v(x) can be decomposed into effects of interactions between input variables. Specifically, let S ⊆ N denote a subset of input variables. The interaction effect between variables in S is defined as the following Harsanyi dividend (Harsanyi, 1982) . US def = S ′ ⊆S (-1) |S|-|S ′ | • v(x S ′ ) Based on this definition, we have v(x) = S⊆N US. Sparse salient interactions can be considered as causal patterns 1 (or concepts) encoded by the DNN. Theorem 1 and Remark 1 prove that most interactions have ignorable effects US ≈ 0, and people can use a few salient interactions with non-ignorable effects to well approximate the inference scores on 2 n different masked samples. Thus, we can consider such interactions as causal patterns 1 or concepts encoded by the DNN. Accordingly, we can consider the interaction effect US as the causal effect. Besides, Remark 1 has been verified on different DNNs learned for various tasks by experiments in both Appendix G.1 and (Ren et al., 2023a) . Theorem 1 (Faithfulness, proven by Ren et al. (2023a) and Appendix E.1) Let us consider a DNN v and an input sample x with n input variables. We can generate 2 n different masked samples, i.e., {xS|S ⊆ N }. The DNN's outputs on all masked samples can always be well mimicked as the sum of the triggered interaction effects in Eq. ( 3), i.e., ∀S ⊆ N, v(xS) = S ′ ⊆S U S ′ . Remark 1 (Sparsity) Interaction effects in most DNNs are usually very sparse. Most interaction effects are almost zero, i.e., US ≈ 0. A few most salient interaction effects in Ω (less than 100 interaction effects in most cases) are already enough to approximate the DNN, i.e., ∀S ⊆ N, v(xS) ≈ S ′ ∈Ω,S ′ ⊆S U S ′ , where |Ω| ≪ 2 n . Each causal pattern (concept) S represents an AND relationship between input variables in S. For example, the head pattern of a rooster consists of {forehead, eyes, beak, crown}. If the forehead, eyes, beak, and crown of the rooster co-appear, then the head pattern S = {forehead, eyes, beak, crown} is triggered and makes a causal effect US on the output. Otherwise, if any part is masked, the causal pattern S will not be triggered, and the DNN's inference score v(x) will not receive the causal effect US. In sum, when we mask an input variable i, it is supposed to remove all causal effects of all AND relationships that contain the variable i. Please see Appendix F for the proof.

3.1. EXAMINING THE FAITHFULNESS OF BASELINE VALUES USING CAUSAL PATTERNS

We use salient causal patterns (or concepts) to evaluate the faithfulness of masking methods. Specifically, we examine whether baseline values remove most causal effects depending on x i , and whether baseline values generate new causal effects. The evaluation of the masking methods based on salient causal effects is theoretically supported from the following three perspectives. First, Theorem 1 and Remark 1 prove that the inference score of a DNN can be faithfully disentangled into a relatively small number of causal patterns. Second, Theorem 2 shows that Shapley values can be explained as a re-allocation of causal effects to input variables. Therefore, reducing effects of salient patterns means the removal of elementary factors that determine Shapley values. Besides, in order to verify that the reduction of causal patterns can really represent the absence of input variables, we have conducted experiments to find that salient patterns triggered by white noise inputs were much less than those triggered by normal images. Please see Appendix G.2 for details. Theorem 2 (proven by Harsanyi (1982) and Appendix E.2) We can directly derive Shapley values from the effects US of causal patterns. The Shapley value can be considered as uniformly allocating each causal pattern S's effect US to all its variables, i.e. ϕi = S⊆N \{i} 1 |S|+1 U S∪{i} . Third, an incorrect baseline value b i will make partial effects of the AND relationship of the variable i be mistakenly explained as an exponential number of additional redundant causal patterns, which significantly complicates the explanation. Therefore, the optimal baseline value is supposed to generate the most sparse causal patterns as the simplest explanation of the DNN. Compared to dense causal patterns generated by sub-optimal baseline values, the simplest explanation removes as many as existing causal effects as possible without introducing additional causal effects.  ′ p = 1, b ′ q = 1} , then this function will be explained as four causal patterns Ω = {∅, {p}, {q}, {p, q}}, i.e., f (x) = U ∅ C ∅ + U {p} C {p} + U {q} C {q} + U {p,q} C {p,q} , where U ∅ = 2w, U {p} = -4w, U {q} = -3w, and U {p,q} = 6w are computed using incorrect baseline values. Incorrect baseline values increase complicated causal patterns and lead to incorrect Shapley values ϕp = -w, ϕq = 0. In fact, the existence of most newly introduced causal patterns is due to that the effects of a high-order causal pattern are not fully removed, and that OOD causal patterns (new OOD edges or shapes) may be caused by incorrect baseline values.

3.2. PROBLEMS WITH PREVIOUS MASKING METHODS

In this subsection, we compare causal patterns in the masked sample with causal patterns in the original sample to evaluate the following baseline values. (1) Mean baseline values. As Fig. 1 shows, the baseline value of each input variable is set to the mean value of this variable over all samples (Dabkowski and Gal, 2017), i.e. bi = Ex[xi] . However, empirically, this method actually introduces additional signals to the input. For example, mean values introduce massive grey dots to images and may form new edges as abnormal causal patterns. This has been verified by experiments in Table 1 . Experimental details will be introduced later. (2) Zero baseline values. Baseline values of all input variables are set to zero (Ancona et al., 2019; Sundararajan et al., 2017) , i.e. ∀i ∈ N, bi = 0. As Fig. 1 shows, just like mean baseline values, zero baseline values also introduce additional signals (black dots) to the input (verified in Table 1 ). (3) Blurring input samples. Fong and Vedaldi (2017) and Fong et al. (2019) blur image pixels x i using a Gaussian kernel as its masked state. Covert et al. (2020a) ; Sturmfels et al. (2020) mentioned that this approach only removed high-frequency signals, but failed to remove low-frequency signals. (4) For each input variable, determining a different baseline value for each specific context S. Instead of fixing baseline values as constants, some studies use varying baseline values to compute v(xS) given x, which are determined temporarily by the context S in x. Some methods (Frye et al., 2021; Covert et al., 2020b) define v(xS) by modeling the conditional distribution of variable values in N \S given the context S, i.e. v(xS) = E p(x ′ |x S ) [model(xS ⊔ x ′ N \S )]. The operation ⊔ means the concatenation of x's dimensions in S and x ′ 's dimensions in N \ S. By assuming the independence between input variables, the above conditional baseline values can be simplified to marginal baseline values (Lundberg and Lee, 2017 ), i.e. v(xS) = E p(x ′ ) [model(xS ⊔ x ′ N \S )]. We conducted experiments to examine whether the above baseline values remove all causal patterns in the original input and whether baseline values introduce new causal patterns. We used the metric ) to evaluate the quality of masking. We generated a set of samples based on x, where a set of input variables were masked, and U ′ S denote the causal effect in such masked samples. US denote the causal effect in the original sample x, which was used for normalization. U (noise) S denotes the causal effect in a white noise input, and it represents the unavoidable effect of huge amounts of noise patterns. Thus, we considered the U (noise) S term as an inevitable anchor value and removed it from R for a more convincing evaluation. The masking method would have two kinds of effects on causal patterns. (1) We hoped to remove all existing salient patterns in the original sample. (2) We did not expect the masking method to introduce new salient patterns. Interestingly, the removal of existing salient patterns decreased the R value, while the triggering of new patterns increased the R value. Thus, the R metric reflected both effects. A small value of R indicated a good setting of baseline values. R = Ex ( S⊆N |U ′ S |-S⊆N |U (noise) S |)/( S⊆N |US| We used 20 images in the MNIST dataset (LeCun et al., 1998) and 20 images in the CIFAR-10 dataset (Krizhevsky et al., 2009) to compute R, respectively. We split each MNIST image into 7 × 7 grids and split each CIFAR-10 image into 8 × 8 grids. For each image, we masked the central 4 × 3 grids using the zero baseline, mean baseline, blur baseline, and the baseline based on the conditional distribution, and computed the metric of R (zero) , R (mean) , R (blur) , and R (conditional) , respectively. Table 1 shows that the ratio R by using previous baseline values were all large. Although the masking method based on conditional distribution performed better than some other baseline values, our method exhibited the best performance. It indicates that previous masking methods did not remove most existing patterns and/or trigger new patterns.

3.3. ABSENCE STATES AND OPTIMAL BASELINE VALUES

In the original scenario of game theory, the Shapley value was proposed without the need to define the absence of players. When people explain a DNN, we consider that the true absence state of variables should generate the most simplified causal explanation. Remark 2 and Theorem 3 show that correct baseline values usually generate the simplest causal explanation, i.e., using the least number of causal patterns to explain the DNN. In comparison, if an incorrect baseline value b i does not fully remove all effects of AND relationships of the variable i, then the remained effects will be mistakenly explained as a large number of other redundant patterns. The above proof well fits Occam's Razor, i.e., the simplest causality with the minimum causal patterns is more likely to represent the essence of the DNN's inference logic. This also lets us consider the baseline values that minimize the number of salient causal patterns (i.e., achieving the simplest causality) as the optimal baseline values. Therefore, the learning of the baseline value b * i of the i-th variable can be formulated to sparsify causal patterns in the deep model. Particularly, such baseline values are supposed to remove existing Table 3 : Examples of generated functions and their ground-truth baseline values. Functions (∀i ∈ N, x i ∈ {0, 1}) The ground truth of baseline values 3, 4, 5, 6, 7, 8, 9, 10} causal effects without introducing many new effects. -0.185x 1 (x 2 + x 3 ) 2.432 -x 4 x 5 x 6 x 7 b * i = 0 for i ∈ {1, 2, 3, 4, 5, 6, 7} -x 1 x 2 x 3 + sigmoid(-5x 4 x 5 x 6 x 7 + 2.50) -x 8 x 9 b * i = 1 for i ∈ {4, 5, 6, 7}, b * i = 0 for i ∈ {1, 2, 3, 8, 9} -sigmoid(+4x 1 -4x 2 + 4x 3 -6.00) -x 4 x 5 x 6 x 7 -x 8 x 9 x 10 b * i = 1 for i = 2, b * i = 0 for i ∈ {1, b * = arg min b x |Ω(x)|, subject to Ω(x) = {S ⊆ N ||US(x|b)| > τ } (4) where US(x|b) denotes the causal effect computed on the sample x by setting baseline values to b.

4. ESTIMATING BASELINE VALUES

Based on Theorem 3, we derive Eq. ( 4) to learn optimal baseline values, but the computational cost of enumerating all causal patterns is exponential. Thus, we explore an approximate solution to learning baseline values. According to Theorem 4, incorrect baseline values usually mistakenly explain high-order causal patterns as an unnecessarily large number of low-order causal patterns, where the order m of the causal effect U S is defined as the cardinality of S, m = |S|. Thus, the objective of learning baseline values is roughly equivalent to penalizing effects of loworder causal patterns, in order to prevent learning incorrect baseline values that mistakenly represent the high-order pattern as an exponential number of low-order patterns. min b L(b), subject to L(b) = x S⊆N,|S|≤k |US(x|b)| (5) An approximate-yet-efficient solution. When each input sample contains a huge number of variables, e.g., an image sample, directly optimizing Eq. ( 5) is NP-hard. Fortunately, we find the multiorder Shapley value and the multi-order marginal benefit in the following equation have strong connections with multi-order causal patterns (proven in Appendix H), as follows. ϕ (m) i (x|b) def = E S⊆N \{i} |S|=m v(x S∪{i} , b)-v(xS, b) = E S⊆N \{i} |S|=m L⊆S U L∪{i} (x|b) ∆vi(S|x, b) def = v(x S∪{i} , b)-v(xS, b) = L⊆S U L∪{i} (x|b) where ϕ (m) i (x|b) and ∆vi(S|x, b) denote the m-order Shapley value and the m-order marginal benefit computed using baseline values b, respectively, where the order m is given as m = |S|. According to the above equation, high-order casual patterns US are only contained by high-order Shapley values ϕ (m) i and high-order marginal benefits ∆vi. Therefore, in order to penalize the effects of low-order causal patterns, we penalize the strength of low-order Shapley values and low-order marginal benefits, respectively, as an engineering solution to boost computational efficiency. In experiments, these loss functions were optimized via SGD. LShapley(b) = m∼Unif(0,λ) x∈X i∈N |ϕ (m) i (x|b)|, Lmarginal(b) = m∼Unif(0,λ) x∈X i∈N E S⊆N |S|=m |∆vi(S|x, b)| (7) where λ ≥ m denotes the maximum order to be penalized. We have conducted experiments to verify that baseline values b learned by loss functions in Eq. ( 7) could effectively sparsify causal effects of low-order causal patterns in Eq. ( 5). Please see Appendix G.3 for results. Most importantly, we still used the metric R in Section 3.2 to check whether the learned baseline values removed original causal patterns in the input while not introducing new patterns. The low value of R (ours) in Table 1 shows that baseline values learned by our method successfully removed existing salient causal patterns without introducing many new salient patterns.

5.1. VERIFICATION OF CORRECTNESS OF BASELINE VALUES AND SHAPLEY VALUES

Correctness of baseline values on synthetic functions. People usually cannot determine the ground truth of baseline values for real images, such as the MNIST dataset. Therefore, we conducted experiments on synthetic functions with ground-truth baseline values, in order to verify the correctness of the learned baseline values. We randomly generated 100 functions, whose causal patterns and ground truth of baseline values could be easily determined. This dataset has been released at https://github.com/zzp1012/faithful-baseline-value. The generated functions were composed of addition, subtraction, multiplication, exponentiation, and the sigmoid operations (see Table 3 ). For example, for the function y = sigmoid(3x1x2-3x3-1.5)-x4x5+0.25(x6+x7) 2 , xi ∈ {0, 1}, there were three causal patterns (i.e. {x1, x2, x3}, {x4, x5}, {x6, x7}) , which were activated only if xi = 1 for i ∈ {1, 2, 4, 5, 6, 7} and x3 = 0. In this case, the ground truth of baseline values was b * i = 0 for i ∈ {1, 2, 4, 5, 6, 7} and b * Correctness of baseline values on functions in (Tsang et al., 2018) . Besides, we also evaluated the correctness of the learned baseline values using functions in Tsang et al. (2018) . Among all the 92 input variables in these functions, the ground truth of 61 variables could be determined (see Appendix G.4). Thus, we used these annotated baseline values to test the accuracy. Table 4 reports the accuracy of the learned baseline values on the above functions. In most cases, the accuracy was above 90%, showing that our method could effectively learn correct baseline values. A few functions in (Tsang et al., 2018) did not have salient causal patterns, which caused errors in the learning. Besides, in experiments, we tested our method under three different initializations of baseline values (i.e., 0, 0.5, and 1). Table 4 shows that baseline values learned with different initialization settings all converged to similar and high accuracy. Correctness of the computed Shapley values. Incorrect baseline values lead to incorrect Shapley values. We verified the correctness of the computed Shapley values on the extended Addition-Multiplication dataset (Zhang et al., 2021c) . We added the subtraction operation to avoid all baseline values being zero. Theorem 2 considers the Shapley value as a uniform assignment of effects of each causal pattern to its compositional variables. This enabled us to determine the ground-truth Shapley value of variables without baseline values based on causal patterns. For example, the function f (x) = 3x1x2 + 5x3x4 + x5 s.t. x = [1, 1, 1, 1, 1] contained three causal patterns, according to the principle of the most simplified causality. Accordingly, the ground-truth Shapley values were φ1 = φ2 = 3/2, φ3 = φ4 = 5/2, and φ5 = 1. See Appendix G.5 for more details. The estimated Shapley value ϕi was considered correct if |ϕi -φi| ≤ 0.01; otherwise, incorrect. Then, we computed the accuracy of the estimated Shapley values as the ratio of input variables with correct Shapley values. Discussion on why the learned baseline values generated correct Shapley values. We computed Shapley values of variables in the extended Addition-Multiplication dataset using different baseline values, and compared their accuracy in Table 5 . The result shows that our method exhibited the highest accuracy. Table 6 shows an example of incorrect Shapley values computed by using other baseline values. Our method generated correct Shapley values in this example. For the variable x6, due to its negative coefficient -1.98, its contribution should be negative. However, all other baseline values generated positive Shapley values for x6. The term -4.23x7 showed the significant effect of the variable x7 on the output, but its Shapley value computed using baseline values in SHAP was just -0.010, which was obviously incorrect. 

5.2. RESULTS AND EVALUATION ON REALISTIC DATASETS AND MODELS

Learning baseline values. We used our method to learn baseline values for MLPs, LeNet (Le-Cun et al., 1998) , and ResNet-20 (He et al., 2016) trained on the UCI South German Credit dataset (namely credit dataset) (Dua and Graff, 2017) , the UCI Census Income dataset (namely income dataset) (Dua and Graff, 2017) , the MNIST dataset (LeCun et al., 1998) , and the CIFAR-10 dataset (Krizhevsky et al., 2009) , respectively. We learned baseline values by using either LShapley or Lmarginal as the loss function. In the computation of LShapley, we set v(xS) = log p(y truth |x S ) 1-p(y truth |x S ) . In the computation of Lmarginal, |∆vi(S)| was set to |∆vi(S)| = ∥h(x S∪{i} )-h(xS)∥1, where h(xS) denotes the output feature of the penultimate layer given the masked input xS, in order to boost the efficiency of learning. We set λ = 0.2n for the MNIST and the CIFAR-10 datasets, and set λ = 0.5n for the simpler data in two UCI datasets. Given baseline values, we used the sampling-based approximation (Castro et al., 2009) to estimate Shapley values. We used two ways to initialize baseline values before learning, i.e. setting baseline values to zero or mean values over different samples, namely zero-init and mean-init, respectively. Fig. 3 (left) shows that baseline values learned with different initialization settings all converged to similar baseline values, except for very few dimensions having multiple local-minimum solutions (discussed in Appendix G.7), which proved the stability of our method. Comparison of attributions computed using different baseline values. Fig. 3 shows the learned baseline values and the computed Shapley values on the income dataset. We found that attributions generated by zero/mean baseline values conflicted with the results of all other methods. Our method obtained that the occupation had more influence than the marital status on the income, which was somewhat consistent with our life experience. However, baseline values in SHAP and SAGE sometimes generated abnormal explanations. In this top-right example, the attribute capital gain was zero, which was not supposed to support the prediction of "the person made over 50K a year." However, the SAGE's baseline values generated a large positive Shapley value for capital gain. In the bottom-right example, both SHAP and SAGE considered the marital status important for the prediction. SHAP did not consider the occupation as an important variable. Therefore, we considered these explanations not reliable. Attribution maps and baseline values generated on the CIFAR-10 and the MNIST datasets are provided in Appendix G.6. Compared to zero/mean/blurring baseline values, our baseline values were more likely to ignore noisy variables in the background, which were far from the foreground in images. Compared to SHAP, our method yielded more informative attributions. Besides, our method generated smoother attributions than SAGE.

6. CONCLUSIONS

In this paper, we have defined the absence state of input variables in terms of causality. Then, we have found that most existing masking methods cannot faithfully remove existing causal patterns without triggering new patterns. In this way, we have formulated optimal baseline values for the computation of Shapley values as those that remove most causal patterns. Then, we have proposed an approximate-yet-efficient method to learn optimal baseline values that represent the absence states of input variables. Experimental results have demonstrated the effectiveness of our method.

ETHIC STATEMENT

This paper aims to examine the masking approach in previous explaining methods. We find that previous settings of the masking approach cannot faithfully represent the absence of input variables, thereby hurting the trustworthiness of the obtained explanations. Therefore, we propose a new method to learn optimal baseline values to represent the absence of input variables. In this way, the trustworthiness of explanations of the DNN is further boosted. There are no ethical issues with this paper.

REPRODUCIBILITY STATEMENT

We have provided proofs for all theoretical results in Appendix E and Appendix H. We have also provided experimental details in Section 5 and Appendix G. Furthermore, we will release the code when the paper is accepted. 

A RELATED WORKS

No previous methods directly examined the faithfulness of the masking methods. Instead, we made a survey in a larger scope of attribution methods and other explainable AI studies, and put them in the appendix. Nevertheless, we will put this section back to the main paper if the paper is accepted. In the scope of explainable AI, many methods (Simonyan et al., 2014; Yosinski et al., 2015; Mordvintsev et al., 2015; Dosovitskiy and Brox, 2016; Zhou et al., 2015) have been proposed to explain the DNN. Among all methods, the estimation of attributions for each input variable represents a classical direction (Zhou et al., 2016; Selvaraju et al., 2017; Lundberg and Lee, 2017; Shrikumar et al., 2017) . In this paper, we mainly focus on attributions based on Shapley values. Shapley values. The Shapley value (Shapley, 1953) in game theory was widely considered as a fair distribution of the overall reward in a game to each player (Weber, 1988) . (Sen et al., 1981) and (Grömping, 2007) used the Shapley value to attribute the correlation coefficient of a linear regression to input features. ( Štrumbelj et al., 2009; Štrumbelj and Kononenko, 2014) used the Shapley value to attribute the prediction of a model to input features. (Bork et al., 2004) used the Shapley value to measure importances of protein interactions in large, complex biological interaction networks. (Keinan et al., 2004) employed the Shapley value to measure causal effects in neurophysical models. (Sundararajan et al., 2017) proposed Integrated Gradients based on the AumannShapley (Aumann and Shapley, 2015) cost-sharing technique. Besides above local explanations, (Covert et al., 2020b) focused on the global interpretability. In order to compute the Shapley value in deep models efficiently, (Lundberg and Lee, 2017) proposed various approximations for Shapley valus in DNNs. (Lundberg et al., 2018) further computed the Shapley value on tree emsembles. (Aas et al., 2021) generalized the approximation method in (Lundberg and Lee, 2017) to the case when features were related to each other. (Ancona et al., 2019) further formulated a polynomial-time approximation of Shapley values for DNNs. Baseline values. In terms of baseline values of Shapley values, most studies (Covert et al., 2020a; Merrick and Taly, 2020; Sundararajan and Najmi, 2020; Kumar et al., 2020) compared influences of baseline values on explanations, without providing any principles for setting baseline values. Shrikumar et al. (2017) proposed DeepLIFT to estimate attributions of input variables, and also mentioned the choice of baseline values. Besides, Agarwal and Nguyen (2021) and Frye et al. (2021) used generative models to alleviate the out-of-distribution problem caused by baseline values. Unlike previous studies, we rethink and formulate baseline values from the perspective of gametheoretic causality. We define the absent state of input variables, and propose a method to learn optimal baseline values based on the number of causal patterns.

CLASSIFICATION

In order to quantitatively evaluate Shapley values computed by different baseline values on the MNIST dataset, we constructed an And-Or decision tree following (Harradon et al., 2018) , whose structure directly provided the ground-truth Shapley value for each input variable. Then, we used different attribution methods to explain the decision tree. Table 7 shows that our method generated more accurate Shapley values than other baseline values. We constructed a decision tree (Song et al., 2013) for each category in the MNIST dataset. Specifically, for each category (digit), we first computed the average image over all training samples in this category. Let x(c) ∈ R n denote the average image of the c-th category. Then, we built a decision tree by considering each pixel as an internal node. The splitting rule for the decision tree was designed as follows. Given an input x in the category c, the splitting criterion at the pixel (node) x i was designed as (x (c) i > 0.5)&(x i > 0.5) 2 . If (x (c) i > 0.5)&(x i > 0.5) = True, then the pixel value x i was added to the output; otherwise, x i was ignored. In this way, the output of the decision tree was f (x) = i∈V x i , where V = {i ∈ N |(x (c) i > 0.5)&(x i > 0.5) = True} denote the set of all pixels that satisfied the above equation. For inference, the probability of x belonging to the category c was p(c|x) = sigmoid(γ(f (x) -β)), where γ = 40 was a constant and β ∝ i∈N 1 x(c) i >0.5 . In this case, we defined v(x N ) = log p(c|x) 1-p(c|x) . Thus, the co-appearing of pixels in V formed a causal pattern to contribute for v(x N ). In other words, because ∀i ∈ N, x i ≥ 0, the absence of any pixel in V might deactivate this pattern by leading to a small probability p(c|x) < 0.5 and a small v. This pattern can also be understood as an AND node in the And-Or decision tree (Song et al., 2013) . In the above decision tree, the ground-truth Shapley values of input variables (pixels) were easy to determine. The above decision tree ensured that the absence of any variable in V would deactivate the causal pattern. Therefore, according to Theorem 2 in the paper, the output probability should be fairly assigned to pixels in V , i.e., they shared the same Shapley values φi = v(x N ) |V | . For other pixels that were not contained in the output, their ground-truth Shapley values were zero. We estimated Shapley values of input variables in the above decision tree by using zero baseline values, mean baseline values, baseline values in SHAP, and the learned baseline values by our method, respectively. Let ϕ i denote the estimated Shapley value of the variable i. If |ϕ i -φi | ≤ 0.01, we considered the estimated Shapley value ϕ i correct; otherwise, incorrect. In this way, we computed the accuracy of the estimated Shapley values, and Table 7 shows that our method achieved the highest accuracy.

C REMOVING ADVERSARIAL PERTURBATIONS FROM THE INPUT

Let x denote the normal sample, and let x adv = x + δ denote the adversarial example generated by (Madry et al., 2018) . According to (Ren et al., 2021) , the adversarial example x adv mainly created out-of-distribution bivariate interactions with high-order contexts, which were actually related to the high-order interactions (causal patterns) in this paper. Thus, in the scenario of this study, the adversarial utility was owing to out-of-distribution high-order interactions (causal patterns). The removal of input variables was supposed to remove most high-order causal patterns. Therefore, the baseline value can be considered as the recovery of the original sample. In this way, we used the adversarial example x adv to initialize baseline values before learning, and used Lmarginal to learn baseline values. If the learned baseline values b satisfy ∥b-x∥1 ≤ ∥x adv -x∥1, we considered that our method successfully recovered the original sample to some extent. We conducted experiments using LeNet, AlexNet (Krizhevsky et al., 2012) , and ResNet-20 on the MNIST dataset (∥δ∥∞ ≤ 32/255) and the CIFAR-10 dataset (∥δ∥∞ ≤ 8/255). Table 8 shows that our method recovered original samples from adversarial examples, which demonstrated the effectiveness of our method.

D AXIOMS OF THE SHAPLEY VALUE

The Shapley value (Shapley, 1953) was first introduced in game theory, which measures the contribution of each player in a game. Actually, given an input x with n input variables, i.e., x = [x1, . . . , xn], we can consider a deep model as a game with n players N = {1, 2, • • • , n}. Each player i is an input variable x i (e.g. an input dimension, a pixel, or a word). In this way, the problem of fairly estimating attributions of input variables in the DNN is equivalent to the problem of fairly assigning the total reward in the game to each player. The Shapley value is widely considered a fair attribution method, because it satisfies the following four axioms (Weber, 1988) . (1) Linearity axiom: If two games can be merged into a new game u(xS) = v(xS) + w(xS), then Shapley values in the two old games also can be merged, i.e. ∀i ∈ N , ϕi,u = ϕi,v + ϕi,w. (2) Dummy axiom and nullity axiom: The dummy player i is defined as a player without any interactions with other players, i.e. satisfying ∀S ⊆ N \ {i}, v(x S∪{i} ) = v(xS) + v(x {i} ). Then, the dummy player's Shapley value is computed as ϕi = v(x {i} ). The null player i is defined as a player that satisfies ∀S ⊆ N \ {i}, v(x S∪{i} ) = v(xS). Then, the null player's Shapley value is ϕi = 0. 2 For Table 7 , the splitting criterion was designed as (x (c) i > 0.5). (3) Symmetry axiom: If ∀S ⊆ N \ {i, j}, v(x S∪{i} ) = v(x S∪{j} ), then ϕi = ϕj. (4) Efficiency axiom: The overall reward of the game is equal to the sum of Shapley values of all players, i.e. v(xN ) -v(x ∅ ) = i∈N ϕi.

E PROOFS OF THEOREMS

This section provides proofs of theorems in the main paper.

E.1 PROOF OF THEOREM 1

Theorem 1 (Faithfulness, proven by Ren et al. (2023a) ) Let us consider a DNN v and an input sample x with n input variables. We can generate 2 n different masked samples, i.e., {xS|S ⊆ N }. The DNN's outputs on all masked samples can always be well mimicked as the sum of the triggered interaction effects in Eq. ( 3), i.e., ∀S ⊆ N, v(xS) = S ′ ⊆S U S ′ . Proof: According to the definition of the Harsanyi dividend, we have ∀S ⊆ N , S ′ ⊆S U S ′ = S ′ ⊆S L⊆S ′ (-1) |S ′ |-|L| v(xL) = L⊆S S ′ ⊆S:S ′ ⊇L (-1) |S ′ |-|L| v(xL) = L⊆S |S| s ′ =|L| S ′ ⊆S:S⊇L |S ′ |=s ′ (-1) s ′ -|L| v(xL) = L⊆S v(xL) |S|-|L| m=0 |S| -|L| m (-1) m = v(xS) E.2 PROOF OF THEOREM 2 Theorem 2 Harsanyi dividends can be considered as causal patterns of the Shapley value. ϕ i = S⊆N \{i} 1 |S| + 1 U S∪{i} In this way, the effect of an causal pattern consisting of m variables can be fairly assigned to the m variables. This connection has been proved in (Harsanyi, 1982) . • Proof: right = S⊆N \{i} 1 |S| + 1 U S∪{i} = S⊆N \{i} 1 |S| + 1   L⊆S (-1) |S|+1-|L| v(L) + L⊆S (-1) |S|-|L| v(L ∪ {i})   = S⊆N \{i} 1 |S| + 1 L⊆S (-1) |S|-|L| [v(L ∪ {i}) -v(L)] = L⊆N \{i} K⊆N \L\{i} (-1) |K| |K| + |L| + 1 [v(L ∪ {i}) -v(L)] % Let K = S \ L = L⊆N \{i}   n-1-|L| k=0 (-1) k k + |L| + 1 n -1 -|L| k   [v(L ∪ {i}) -v(L)] % Let k = |K| = L⊆N \{i} |L|!(n -1 -|L|)! n! [v(L ∪ {i}) -v(L)] % by the property of combinitorial number . We consider interactions of input samples that activate causal patterns. We find that when models/functions contain a single complex collaborations between multiple variables (i.e. high-order causal patterns), incorrect baseline values usually generate a mixture of many low-order causal patterns. In comparison, ground-truth baseline values lead to sparse and high-order causal patterns. • Theoretical proof: Without loss of generality, let us consider an input sample x, with ∀j ∈ S, xj ̸ = δj. Based on the ground-truth baseline value {δj}, we have = ϕi = left Functions (∀i ∈ N, i ∈ {0, 1}) Baseline values b Ratios r f (x) = x 1 x 2 x 3 x 4 x 5 x = [1, 1, 1, 1, 1] f (x) = sigmoid(5x 1 x 2 x 3 + 5x 4 -7.5) x = [1, 1, 1, 1] f (x) = x 1 (x 2 + x 3 -x 4 ) 3 x = [1, 1, 1, 0] (1) v(xS) = f (xS) = wS j∈S (xj -δj) ̸ = 0, (2) ∀S ′ ⊊ S, v(x S ′ ) = wS j∈S ′ (xj -δj) k∈S\S ′ (δ k -δ k ) = 0, Accordingly, we have US = S ′ ⊆S (-1) |S|-|S ′ | v(x S ′ ) = v(xS) ̸ = 0. For S ′ ⊊ S, we have U S ′ = L⊆S ′ (-1) |S ′ |-|L| v(xL) = L⊆S ′ 0 = 0. (3) ∀S ′ ̸ = S, let S ′ = L ∪ M , where L ⊆ S and M ∩ S = ∅. Then, we have in Table 9 , and the results verify our conclusion. We find that when models/functions contain complex collaborations between multiple variables (i.e. high-order causal patterns), incorrect baseline values usually generate fewer high-order causal patterns and more low-order causal patterns than ground-truth baseline values. In other words, the model/function is explained as massive low-order causal patterns. In comparison, ground-truth baseline values lead to sparse and high-order salient patterns.

F PROVING THAT MASKING INPUT VARIABLES REMOVES CAUSAL EFFECTS

In this section, we prove that for the causal pattern S ∋ i, if the input variable i is masked, then the causal effect w S = 0. Proof: let S = S ′ ∪ {i}. If i ∈ S is masked, then ∀L s.t. i / ∈ L, x L = x L∪{i} . Therefore, v(L ∪ {i}) = v(L). According to the definition of Harsanyi dividend (Harsanyi, 1982) , we have US = L⊆S (-1) |S|-|L| v(L) = L⊆(S ′ ∪{i}) (-1) |S ′ |+1-|L| v(L) = L⊆S ′ (-1) |S ′ |+1-|L| v(L) + L⊆S ′ (-1) |S ′ |-|L| v(L ∪ {i}) = L⊆S ′ (-1) |S ′ |+1-|L| v(L) + L⊆S ′ (-1) |S ′ |-|L| v(L) = L⊆S ′ (-1) |S ′ |+1-|L| + (-1) |S ′ |-|L| v(L) = L⊆S ′ (-1 + 1)(-1) |S ′ |-|L| v(L) = 0 Note that the causal pattern not containing i will not be deactivated by the masking of i. For example, {eyes, beak} is not deactivated by the absence of forehead, because this pattern represents the AND relationship between eyes and beak, and it does not contain forehead.

G.1 VERIFICATION OF THE SPARSITY OF CAUSAL PATTERNS

In this subsection, we conducted experiments to verify the sparsity of causal effects, which is introduced in Remark 1. To this end, we computed causal effects U S of all 2 n causal patterns encoded by a DNN. Specifically, we trained a three-layer MLP on the income dataset and computed causal effects in the model. Figure 4 shows the distribution of absolute causal effects |U S | of causal patterns in the first five samples of each category of the income dataset. These results show that most causal patterns had insignificant causal effects, U S ≈ 0. Only a few causal patterns had salient causal effects. Moreover, we also conducted experiments to demonstrate the universality of this phenomenon. We trained the five-layer MLP, CNN, LSTM, ResNet-32, and VGG-16 on the UCI census income dataset, the UCI TV news channel commercial detection dataset, the SST-2 dataset, and the MNIST dataset, respectively. Figure 5 shows the absolute causal effects U S in the descending order. These results show that various DNNs learned on different tasks could be explained by a set of sparse causal patterns.

VARIABLES

In this subsection, we conducted experiments to verify that causal patterns reflect the states of removing existing patterns. Given causal effects U S in the normal input image and causal effects in the white noise input, we compared their distributions in Figure 6 . Note that we assumed that the white noise input naturally contained information for classification than the normal input image. We found that most causal effects in the white noise input were close to zero, and there were few salient causal patterns. Besides, we computed the average strength of causal effects in the above two inputs. In the normal input, the average strength of causal effects E S⊆N |U S | = 5.5285, while in the white noise input, the average strength was much smaller, E S⊆N |U This section discusses the ground truth of baseline values of synthetic functions in Section 5.1 of the main paper. In order to verify the correctness of the learned baseline values, we conducted experiments on synthetic functions with ground-truth baseline values. We randomly generated 100 This section discusses the ground truth of Shapley values in the extended Addition-Multiplication dataset (Zhang et al., 2021c) , which is used in Section 5.1 of the main paper. In order to verify the correctness of the Shapley values obtained by the optimal baseline values in this paper, we conducted experiments on the extended Addition-Multiplication dataset (Zhang et al., 2021c) with ground-truth Shapley values. The Addition-Multiplication dataset in (Zhang et al., 2021c) contained functions that only consisted of addition and multiplication operations. For example, f (x) = x 1 x 2 + x 3 x 4 where each input variable x i ∈ {0, 1} was a binary variable. Given x = [1, 1, 1, 1], the function contained two salient causal patterns, i.e. {x 1 , x 2 } and {x 3 , x 4 }, and their benefits to the output were U {x1,x2} = U {x3,x4} = 1, respectively. According to (Harsanyi, 1982) , the Shapley value is a uniform distribution of attributions. Therefore, the effect of a causal pattern was supposed to be uniformly assigned to variables in the pattern. Thus, the ground-truth Shapley values of variables were φ1 = φ2 = 1/2, and φ3 = φ4 = 1/2. However, if the input x = [1, 0, 1, 1], then the pattern {x 1 , x 2 } was deactivated and U {x1,x2} = 0. In this case, φ1 = φ2 = 0 while φ3 = φ4 = 1/2. According to the analysis in Appendix G.4, ground-truth baseline values in the Addition-Multiplication dataset were all zero. Then our method is equivalent to the zero baseline values. Therefore, in order to avoid all ground-truth baseline values being zero, we added the subtraction operation. We also added a coefficient before each term in the function to boost the diversity of functions. For example, f (x) = 3.2x 1 x 2 + 1.5x 3 (x 4 -1). This function also contained two causal patterns, but the ground-truth baseline values of variables were different from the aforementioned function. Here, b * 0 = b * 0 = b * 3 = 0 and b * 4 = 1. Given the input x = [1, 1, 1, 0], all patterns were activated and f (x) = U {x1,x2} + U {x3,x4} = 3.2 + (-1.5) = 1.7. Ground-truth Shapley values of input variables were φ1 = φ2 = 3.2/2 and φ3 = φ4 = -1.5/2. However, for the input x = [1, 1, 1, 1], the pattern {x 3 , x 4 } was deactivated, thereby φ3 = φ4 = 0. Note that the above function can also be considered to contain three patterns (f (x) = 3.2x 1 x 2 + 1.5x 3 x 4 -1.5x 3 ). According to Occam's Razor, we follow the principle of the most simplified interaction to recognize causal patterns in the function, i.e. using the least number of causal patterns. Thus, we consider the above function f (x) = 3.2x 1 x 2 + 1.5x 3 (x 4 -1) containing two salient causal patterns. Based on the extended Addition-Multiplication dataset, we randomly generated an input sample for each function in the dataset. Each variable x i in input samples were independently sampled following the Bernoulli distribution, i.e. p(x i = 1) = 0.7. Therefore, for the mean baseline, baseline values of different input variables were all 0.7. For the baseline value based on the marginal distribution, which was used in SHAP (Lundberg and Lee, 2017), p(x ′ i ) ∼ Bernoulli(0.7). Then, we compared the accuracy of the computed Shapley values of input variables based on zero baseline values, mean baseline values, baseline values in SHAP, and the optimal baseline values defined in this paper, respectively. The result in Table 4 of the main paper shows that the optimal baseline values correctly generated the ground-truth attributions/Shapley values of input variables.

G.6 EXPERIMENTAL RESULTS ON THE MNIST AND THE CIFAR-10 DATASETS

Experimental results on the MNIST datasets. Experimental settings on the MNIST dataset have been introduced in Section 5.2 of the paper. Figure 9 shows the learned baseline values on the This section provides experimental results on the UCI South German Credit dataset (Dua and Graff, 2017) . Based on the UCI datasets, we learned MLPs following settings in (Guidotti et al., 2018) . Besides, we also noticed that baseline values learned with different initialization settings (zero-init and mean-init) all converged to similar baseline values, except for very few dimensions having multiple local-minimum solutions, which proved the stability of our method. More specifically, an input variable might have different optimal baseline values in real applications from different benefits L marginal is more fine-grained than the loss function on the multi-order Shapley value L Shapley . This section provides proofs for the above claims. First, the Shapley value ϕ i can be decomposed into the sum of Shapley values of different orders ϕ (m) i and marginal benefits of different orders ∆v i (S), as follows. ϕ i = 1 n n-1 m=0 ϕ (m) i = 1 n n-1 m=0 E S⊆N \{i},|S|=m ∆v i (S) where the Shapley value of m-order ϕ Connection between multi-variate interactions and multi-order marginal benefits. The morder marginal benefit can be decomposed as the sum of multi-variate interaction benefits, as follows. Therefore, high-order causal effects U S are only contained in high-order marginal benefits. Connection between multi-order interactions and multi-order Shapley values. The m-order Shapley value can also be decomposed as the sum of interaction benefits, as follows. Therefore, high-order causal effects U S are only contained in high-order Shapley values. 



= 1. Please see Appendix G.4 for more discussions about the setting of ground-truth baseline values.We used our method to learn baseline values on these functions and tested the accuracy. Note that|bi -b * i | ∈ [0, 1] and b * i ∈ {0, 1}. If |bi -b * i | < 0.5, we considered the learned baseline value correct. We set λ = 0.5n in both LShapley and Lmarginal. The results are reported in Table and are discussed later.



Figure 1: (Left) Previous masking methods may either introduce additional signals, or cannot remove all the old signals. (Right) The inference of the DNN on images masked by these baseline values can be well mimicked by causal patterns.

Figure 2: Causal patterns that explain the inference on a sample in the income dataset.

proof in Appendix E.3) Let us consider a function with a single causal pattern f (xS) = wS j∈S (xj-δj). Accordingly, ground-truth baseline values of variables are obviously {δj}, because setting any variable ∀j ∈ S, xj = δj will deactivate this pattern. Given the correct baseline values b * j = δj, we can use a single causal pattern to regress f (xS), i.e., US = f (xS), ∀ S ′ ̸ = S, U S ′ = 0. Theorem 3 (proof in Appendix E.3) For the function f (xS) = wS j∈S (xj -δj), if we use m ′ incorrect baseline values {b ′ j |b ′ j ̸ = δj} to replace correct ones to compute causal effects, then the function will be explained to contain at most 2 m ′ causal patterns. Theorem 4 (proof in Appendix E.3) If we use m ′ incorrect baseline values to compute causal effects in the function f (xS) = wS j∈S (xj -δj), a total of m ′ k-|S|+m ′ causal patterns of the k-th order emerge, k ≥ |S|-m ′ . A causal pattern of the k-th order means that this causal pattern represents the AND relationship between k variables. Specifically, Remark 2, Theorems 3 and 4 provide a new perspective to understand how incorrect baseline values generate new causal patterns. Remark 2 shows how correct baseline values explain a toy model that contains a single causal pattern. Theorems 3 and 4 show that incorrect baseline values will use an exponential number of redundant low-order patterns to explain a single high-order causal pattern. For example, we are given the function f (x) = w(xp -δp)(xq -δq) s.t. xp = 3, xq = 4, δp = 2, δq = 3. If we use ground-truth baseline values {δp, δq}, then the function is explained as simple as a single causal pattern Ω ={{p, q}}, which yields correct Shapley values ϕp = ϕq = 0.5 • w, according to Theorem 2. Otherwise, if we use incorrect baseline values {b

Figure 3: The learned baseline values (left) and Shapley values computed with different baseline values (right) on the income dataset. Results on the MNIST, the CIFAR-10, and the credit datasets are shown in Appendix G.6 and G.7.

Comparison between ground-truth baseline values and incorrect baseline values. The last column shows ratios of causal patterns of different orders r m = S⊆N,|S|=m |U S | S⊆N,S̸ =∅ |U S |

b * = [0, 0, 0, 0, 0] incorrect: b (1) = [0.5, 0.5, 0.5, 0.5, 0.5] incorrect: b (2) = [0.1, 0.2, 0.6, 0.0, 0.1] incorrect: b (3) = [0.7, 0.1, 0.3, 0.5, 0.1]

b * = [0, 0, 0, 0] incorrect: b (1) = [0.5, 0.5, 0.5, 0.5] incorrect: b (2) = [0.6, 0.4, 0.7, 0.3] incorrect: b (3) = [0.3, 0.6, 0.5, 0.8]

truth: b * = [0, 0, 0, 1] incorrect: b (1) = [0.5, 0.5, 0.5, 0.5] incorrect: b (2) = [0.2, 0.3, 0.6, 0.1] incorrect: b (3) = [1.0, 0.3, 1.0, 0.1] E.3 PROOF OF REMARK2, THEOREM 3, AND THEOREM 4 Let us consider a function with a single causal pattern f (xS) = wS j∈S (xj -δj). Accordingly, ground-truth baseline values of variables are obviously {δj}, because setting any variable ∀j ∈ S, xj = δj will deactivate this pattern. Given the correct baseline values b * j = δj, we can use a single causal pattern to regress f (xS), i.e., US = f (xS), ∀ S ′ ̸ = S, U S ′ = 0. For the function f (xS) = wS j∈S (xj -δj), if we use m ′ incorrect baseline values {b ′ j |b ′ j ̸ = δj} to replace correct ones to compute causal effects, then the function will be explained to contain at most 2 m ′ causal patterns. Theorem 4 If we use m ′ incorrect baseline values to compute causal effects in the function f (xS) = wS j∈S (xj -δj), a total of m ′ k-|S|+m ′ causal patterns of the k-th order emerge, k ≥ |S| -m ′ . A causal pattern of the k-th order means that this causal pattern represents the AND relationship between k variables.

Figure 4: Histograms of absolute causal effects of causal patterns encoded by the three-layer MLP trained on the income dataset.

These results indicated that salient causal patterns could reflect the information encoded in the input.

Figure 6: The distribution of causal effects U S in the normal input, and the distribution of causal effects U (noise) S in the white noise input.

999 for i = 9, b * i = 0.001 for i ∈ {1,2, 3, 4, 5, 6, 7, 8} x1x2 + 2 x3+x5+x6 + 2 x3+x4+x5+x7 + sin(x7 sin(x8 + x9)) + arccos(0.9x10) b * i = 0.001 for i ∈ {1, 2, 3, 4, 5, 6} tanh(x1x2 + x3x4) |x5| + exp(x5 + x6) + log((x6x7x8) 2 + 1) + x9x10 + 1 1+|x10| b * i = 0.001 for i ∈ {6, 7, 8, 9, 10} sinh(x1 + x2) + arccos(tanh(x3 + x5 + x7)) + cos(x4 + x5) + sec(x7x9) b * i = 0.999 for i = 3, b * i = 0.001 for i ∈ {1, 2, 4} G.5 DISCUSSION ABOUT THE SETTING OF GROUND-TRUTH SHAPLEY VALUES.

Figure 9: The learned baseline values on the MNIST dataset (better viewed in color).

Shapley values computed using different baseline values. Just like results on the UCI Census Income dataset, attributions (Shapley values) generated by our learned baseline values are similar to results of the varying baseline values in SHAP and SAGE. However, the zero/mean baseline values usually generated conflicting results with all other methods.

S⊆N \{i}|S|=m v(x S∪{i} ) -v(x S ) , and the marginal benefit ∆v i (S)def = v(x S∪{i} ) -v(x S ). |S|!(n --|S|!) (n -1)! v(x S∪{i} ) -v(xS) S⊆N,|S|=m v(x S∪{i} ) -v(xS) S⊆N \{i},|S|=m ∆vi(S)

Let p = |P | = K⊆S (1 + (-1) |S|-|K| ) ∆vi(K) % Let p = |P |

Analysis about previous masking methods.

Accuracy of the learned baseline values.

Accuracy of Shapley values on the extended Addition-Multiplication dataset using different settings of baseline values.

Accuracy of Shapley values computed on the pre-defined decision tree, which was based on the MNIST dataset.

The learned baseline values could recover original samples from adversarial examples.

Figure 5: Absolute causal effects of different causal patterns shown in a descending order, which shows that sparse causality is universal for various DNNs.

Functions in(Tsang et al., 2018) and their ground-truth baseline values.

ACKNOWLEDGEMENT

This work is partially supported by the National Nature Science Foundation of China ( 62276165), National Key R&D Program of China (2021ZD0111602), Shanghai Natural Science Foundation (21JC1403800,21ZR1434600), National Nature Science Foundation of China (U19B2043). This work is also partially supported by Huawei Technologies Inc.

annex

Therefore, there is only one causal pattern with non-zero effect US.In comparison, if we use m ′ incorrect baseline values {δ ′ j }, where j∈S 1 δ ′ j ̸ =δ j = m ′ , then the function will be explained to contain at most 2 m ′ causal patterns. For the simplicity of notations, let S = {1, 2, ..., m}, and δ ′ 1 = δ1 + ϵ1, ..., δ ′ m ′ = δ m ′ + ϵ m ′ , where ϵ1, ..., ϵ m ′ ̸ = 0. Let T = {1, 2, . . . , m ′ }. In this case, we haveIn other words, only if S \ T ⊆ S ′ , U S ′ ̸ = 0. In this way, a total ofcausal patterns of the k-th order emerge, where the order k of a causal pattern means that this causal patternFor example, if the input x is given as follows,where ϵi ̸ = 0 are arbitrary non-zero scalars. In this case, we have ∀S ′ ⊆ T, U S ′ ∪{m ′ +1,...,m} = ϵ1ϵ2...ϵm ̸ = 0. Besides, if {m ′ + 1, ..., m} ⊈ S ′ , we have U S ′ = 0. In this way, there are totally 2 m ′ causal patterns in x.• Experimental verification: We further conducted experiments to show that the incorrect setting of baseline values makes a model/function consisting of high-order causal patterns be mistakenly explained as a mixture of low-order and high-order causal patterns. To show this phenomenon, we compare causal patterns computed using ground-truth baseline values and incorrect baseline values Table 10 : Examples of synthetic functions and their ground-truth baseline values.Functions (∀i ∈ N, xi ∈ {0, 1})The ground truth of baseline valuesfunctions whose causal patterns and ground truth of baseline values could be easily determined. As Table 10 shows, the generated functions were composed of addition, subtraction, multiplication, exponentiation, and sigmoid operations.The ground truth of baseline values in these functions was determined based on causal patterns between input variables. In order to represent the absence states of variables, baseline values should activate as few salient patterns as possible, where activation states of causal patterns were considered as the most infrequent state. Thus, we first identified the activation states of causal patterns of variables, and the ground-truth of baseline values was set as values that inactivated causal patterns under different masks. We took the following examples to discuss the setting of ground-truth baseline values (in the following examples, ∀i ∈ N, x i ∈ {0, 1} and b * i ∈ {0, 1}).The activation state of this causal pattern is x 1 x 2 x 3 = 1 when ∀i ∈ {1, 2, 3}, x i = 1. In order to inactivate the causal pattern, we set ∀i ∈ {1, 2, 3}, b * i = 0.Let us just focus on the term of -x 1 x 2 x 3 in f (x). The activation state of this causal pattern is -x 1 x 2 x 3 = -1 when ∀i ∈ {1, 2, 3}, x i = 1. In order to inactivate the causal pattern, we set ∀i ∈ {1, 2, 3}, bLet us just focus on the term of (x 1 + x 2 -x 3 ) 3 in f (x). The activation state of this causal pattern is (x 1 + x 2 -x 3 ) 3 = 8 when x 1 = x 2 = 1, x 3 = 0. In order to inactivate the causal pattern under different masks, we set b (Tsang et al., 2018) . This section provides more details about ground-truth baseline values of functions proposed in (Tsang et al., 2018) . We evaluated the correctness of the learned baseline values using functions proposed in (Tsang et al., 2018) . Among all the 92 input variables in these functions, the ground truth of 61 variables could be determined and are reported in Table 11 . Note that some variables cannot be 0 or 1 (e.g. x 8 cannot be zero in the first function), and we set ∀i ∈ N, x i ∈ {0.001, 0.999} for variables in these functions instead. Similarly, we set the ground truth of baseline values ∀i ∈ N, b * i ∈ {0.001, 0.999}. Some variables did not collaborate/interact with other variables (e.g. x 4 in the first function), thereby having no causal patterns. We did not assign ground-truth baseline values for these individual variables, and these variables are not used for evaluation. Some variables formed more than one causal pattern with other variables, and had different ground-truth baseline values w.r.t. different patterns. In this case, the collaboration between input variables was complex and hard to analyze, so we did not consider such input variables with conflicting patterns for evaluation, either. 

G.8 COMPUTATIONAL COMPLEXITY OF THE PROPOSED LOSS FUNCTIONS

In this section, we will introduce both the theoretical complexity of the loss function and the real complexity of approximating the loss in real applications.In terms of the theoretical complexity, the complexity of the loss functions L Shapley and L marginal in Eq. ( 7) is Dn • In terms of real implementation, as mentioned in Section 5.2, we used the sampling-based approximation (Castro et al., 2009) to compute the loss function. We randomly sampled K input variables in each epoch, and we randomly sampled T contexts of each order m w.r.t. each variable i to approximate ϕ (m) i (x|b). Thus, the real computational cost of L Shapley and L marginal was reduced to DK • λ m=0 T , where we set T = 100 in implementation. We conducted an experiment to show the time cost of learning the optimal baseline values. We used the MLP, LeNet, and ResNet-20 trained on the income dataset, the credit dataset, the MNIST dataset, and the CIFAR-10 dataset. For the MNIST and the CIFAR-10 datasets, we randomly sampled K = 100 pixels as the variable set N to compute the Loss L Shapley and optimize their baseline values. Table 12 reports the time cost of optimizing L Shapley for an epoch w.r.t. DNNs trained on different datasets. The time cost was measured by using PyTorch 1.10 on Ubuntu 20.04, with the Intel(R) Core(TM) i9-10900X CPU @ 3.70GHz and one NVIDIA Geforce RTX 3090 GPU. , and marginal benefits of different orders ∆v i (S). Furthermore, we find that high-order causal effects are only contained by high-order Shapley values and marginal benefits. Actually, the loss function on marginal

