CAN WE FAITHFULLY REPRESENT MASKED STATES TO COMPUTE SHAPLEY VALUES ON A DNN?

Abstract

Masking some input variables of a deep neural network (DNN) and computing output changes on the masked input sample represent a typical way to compute attributions of input variables in the sample. People usually mask an input variable using its baseline value. However, there is no theory to examine whether baseline value faithfully represents the absence of an input variable, i.e., removing all signals from the input variable. Fortunately, recent studies (Ren et al., 2023a; Deng et al., 2022a) show that the inference score of a DNN can be strictly disentangled into a set of causal patterns (or concepts) encoded by the DNN. Therefore, we propose to use causal patterns to examine the faithfulness of baseline values. More crucially, it is proven that causal patterns can be explained as the elementary rationale of the Shapley value. Furthermore, we propose a method to learn optimal baseline values, and experimental results have demonstrated its effectiveness.

1. INTRODUCTION

Many attribution methods (Zhou et al., 2016; Selvaraju et al., 2017; Lundberg and Lee, 2017; Shrikumar et al., 2017) have been proposed to estimate the attribution (importance) of input variables to the model output, which represents an important direction in explainable AI. In this direction, many studies (Lundberg and Lee, 2017; Ancona et al., 2019; Fong et al., 2019) masked some input variables of a deep neural network (DNN), and they used the change of network outputs on the masked samples to estimate attributions of input variables. As Fig. 1 shows, there are different types of baseline values to represent the absence of input variables. Theoretically, the trustworthiness of attributions highly depends on whether the current baseline value can really remove the signal of the input variable without bringing in new out-of-distribution (OOD) features. However, there is no criterion to evaluate the signal removal of masking methods. To this end, we need to first break the blind faith that seemingly reasonable baseline values can faithfully represent the absence of input variables, and the blind faith that seemingly OOD baseline values definitely cause abnormal features. In fact, because a DNN may have complex inference logic, seemingly OOD baseline values do not necessarily generate OOD features. Concept/causality-emerging phenomenon. The core challenge of theoretically guaranteeing or examining whether the baseline value removes all or partial signals of an input variable is to explicitly define the signal/concept/knowledge encoded by a DNN in a countable manner. To this end, Ren et al. (2023a) have discovered a counter-intuitive concept-emerging phenomenon in a trained DNN. Although the DNN does not have a physical unit to encode explicit causality or concepts, Ren et al. (2023a); Deng et al. (2022a) have surprisingly discovered that when the DNN is sufficiently trained, the sparse and symbolic concepts emerge. Thus, we use such concepts as a new perspective to define the optimal baseline value for the absence of input variables. As Fig. 1 shows, each concept represents an AND relationship between a specific set S of input variables. The co-appearance of these input variables makes a numerical contribution US to the network output. Thus, we can consider such a concept as a causal patternfoot_0 of the network output,



Note that in this paper, the causal pattern means the extracted causal relationship between input variables and the output encoded by the DNN, rather than the true intrinsic causal relationship hidden in data.

