FOOL SHAP WITH STEALTHILY BIASED SAMPLING

Abstract

SHAP explanations aim at identifying which features contribute the most to the difference in model prediction at a specific input versus a background distribution. Recent studies have shown that they can be manipulated by malicious adversaries to produce arbitrary desired explanations. However, existing attacks focus solely on altering the black-box model itself. In this paper, we propose a complementary family of attacks that leave the model intact and manipulate SHAP explanations using stealthily biased sampling of the data points used to approximate expectations w.r.t the background distribution. In the context of fairness audit, we show that our attack can reduce the importance of a sensitive feature when explaining the difference in outcomes between groups while remaining undetected. More precisely, experiments performed on real-world datasets showed that our attack could yield up to a 90% relative decrease in amplitude of the sensitive feature attribution. These results highlight the manipulability of SHAP explanations and encourage auditors to treat them with skepticism.

1. INTRODUCTION

As Machine Learning (ML) gets more and more ubiquitous in high-stake decision contexts (e.g. , healthcare, finance, and justice), concerns about its potential to lead to discriminatory models are becoming prominent. The use of auditing toolkits (Adebayo et al., 2016; Saleiro et al., 2018; Bellamy et al., 2018) is getting popular to circumvent the use of unfair models. However, although auditing toolkits can help model designers in promoting fairness, they can also be manipulated to mislead both the end-users and external auditors. For instance, a recent study of Fukuchi et al. (2020) has shown that malicious model designers can produce a benchmark dataset as fake "evidence" of the fairness of the model even though the model itself is unfair. Another approach to assess the fairness of ML systems is to explain their outcome in a post hoc manner (Guidotti et al., 2018) . For instance, SHAP (Lundberg & Lee, 2017) has risen in popularity as a means to extract model-agnostic local feature attributions. Feature attributions are meant to convey how much the model relies on certain features to make a decision at some specific input. The use of feature attributions for fairness auditing is desirable for cases where the interest is on the direct impact of the sensitive attributes on the output of the model. One such situation is in the context of causal fairness (Chikahara et al., 2021) . In some practical cases, the outputs cannot be independent from the sensitive attribute unless we sacrifice much of prediction accuracy. For example, any decisions based on physical strength are statistically correlated to gender due to biological nature. The problem in such a situation is not the statistical bias (such as demographic parity), but whether the decision is based on physical strength or gender, i.e. the attributions of each feature. The focus of this study is on manipulating the feature attributions so that the dependence on sensitive features is hidden and the audits are misled as if the model is fair even if it is not the case. Recently, several studies reported that such a manipulation is possible, e.g. by modifying the black-box model to be explained (Slack et al., 2020; Begley et al., 2020; Dimanov et al., 2020) , by manipulating the computation algorithms of feature attributions (Aïvodji et al., 2019) , and by poisoning the data distribution (Baniecki et al., 2021; Baniecki & Biecek, 2022) . With these findings in mind, the current possible advice to the auditors is not to rely solely on the reported feature attributions for fairness auditing. A question then arises about what "evidence" we can expect in addition to the feature attributions, and whether they can be valid "evidence" of fairness. In this study, we show that we can craft fake "evidence" of fairness for SHAP explanations, which provides the first negative answer to the last question. In particular, we show that we can produce not only manipulated feature attributions but also a benchmark dataset as the fake "evidence" of fairness. The benchmark dataset ensures the external auditors reproduce the reported feature attributions using the existing SHAP library. In our study, we leverage the idea of stealthily biased sampling introduced by Fukuchi et al. (2020) to cherry-pick which data points to be included in the benchmark. Moreover, the use of stealthily biased sampling allows us to keep the manipulation undetected by making the distribution of the benchmark sufficiently close to the true data distribution. Figure 1 illustrates the impact of our attack in an explanation scenario with the Adult Income dataset. Figure 1 : Example of our attack on the Adult Income dataset. After the attack, the feature gender moved from the most negative attribution to the 6 th , hence hiding some of the model bias. Our contributions can be summarized as follows: • Theoretically, we formalize a notion of foreground distribution that can be used to extend Local Shapley Values (LSV) to Global Shapley Values (GSV), which can be used to decompose fairness metrics among the features (Section 2.2). Moreover, we formalize the task of manipulating the GSV as a Minimum Cost Flow (MCF) problem (Section 4). • Experimentally (Section 5), we illustrate the impact of the proposed manipulation attack on a synthetic dataset and four popular datasets, namely Adult Income, COM-PAS, Marketing, and Communities. We observed that the proposed attack can reduce the importance of a sensitive feature while keeping the data manipulation undetected by the audit. Our results indicate that SHAP explanations are not robust and can be manipulated when it comes to explaining the difference in outcomes between groups. Even worse, our results confirm we can craft a benchmark dataset so that the manipulated feature attributions are reproducible by external audits. Henceforth, we alert auditors to treat post-hoc explanation methods with skepticism even if it is accompanied by some additional evidence. 

2. SHAPLEY VALUES

d ÿ i"1 φ i pf, x, zq " f pxq ´f pzq. Simply put, the difference between the model prediction at x and the baseline z is shared among the different features. Additional details on the computation of LSV are presented in Appendix B.1.

2.2. GLOBAL SHAPLEY VALUES

LSV are local because they explain the prediction at a specific x and rely on a single baseline input z. Since model auditing requires a more global analysis of model behavior, we must understand the predictions at multiple inputs x " F sampled from a distribution F called the foreground. Moreover, because the choice of baseline is somewhat ambiguous, the baselines are sampled z " B from a distribution B colloquially referred to as the background. Taking inspiration from Begley et al. (2020), we can compute Global Shapley Values (GSV) by averaging LSV over both foreground and background distributions. Definition 2.1. Φ i pf, F, Bq :" E x"F z"B rφ i pf, x, zq ‰ , i " 1, 2, . . . , d. Proposition 2.1. The GSV have the following property d ÿ i"1 Φ i pf, F, Bq " E x"F rf pxqs ´E x"B rf pxqs. (3)

2.3. MONTE-CARLO ESTIMATES

In practice, computing expectations w.r.t the whole background and foreground distributions may be prohibitive and hence Monte-Carlo estimates are used. For instance, when a dataset is used to represent a background distribution, explainers in the SHAP library such as the ExactExplainer and TreeExplainer will subsample this dataset 1 by selecting 100 instances uniformly at random when the size of the dataset exceeds 100. More formally, let CpS, ωq :" ÿ x pjq PS ω j δpx pjq q (4) represent a categorical distribution over a finite set of input examples S, where δp¨q is the Dirac probability measure, w j ě 0 @j, and ř j ω j " 1. Estimating expectations with Monte-Carlo amounts to sampling M instances S 0 " F M S 1 " B M , (5) and computing the plug-in estimate p Φpf, S 0 , S 1 q :"Φpf, CpS 0 , 1{M q, CpS 1 , 1{M qq " 1 M 2 ÿ x piq PS0 ÿ z pjq PS1 φpf, x piq , z pjq q. ( ) When a set of samples is a singleton (e.g. S 1 " tz pjq u), we shall use the convention p Φpf, S 0 , tz pjq uq " p Φpf, S 0 , z pjq q to improve readability. In Appendix B.2, p Φpf, S 0 , S 1 q is shown to be a consistent and asymptotically normal estimate of Φpf, F, Bq meaning that one can compute approximate confidence intervals around p Φ to capture Φ with high probability. In practice, the estimates p Φ are employed as the model explanation which we see as a vulnerability. As discussed in Section 4, the Monte-Carlo estimation is the key ingredient that allows us to manipulate the GSV in favor of a dishonest entity.

3. AUDIT SCENARIO

This section introduces an audit scenario to which the proposed attack of SHAP can apply. This scenario involves two parties: a company and an audit. The company has a dataset D " tpx piq , y piq qu N i"1 with x piq P R d and y piq P t0, 1u that contains N input-target tuples and also has a model f : X Ñ r0, 1s that is meant to be deployed in society. The binary feature with index s (i.e. x s P t0, 1u) represents a sensitive feature with respect to which the model should not explicitly f (x) E[f (x)|x s = 0] E[f (x)|x s = 1] (a) The data initially provided by the company to the audit is f pD0q and f pD1q i.e. the model predictions for all instances in the private dataset for different values of xs. This dataset can later be used by the audit to assess whether or not the subsets S 1 0 , S 1 1 provided by the company where cherry-picked. discriminate. Both the data D and the model f are highly private so the company is very careful when providing information about them to the audit. Hence, f is a black box from the point of view of the audit. At first, the audit asks the company for the necessary data to compute fairness metrics e.g. the Demographic Parity (Dwork et al., 2012) , the Predictive Equality (Corbett-Davies et al., 2017), or the Equal Opportunity (Hardt et al., 2016) . Note that our attack would apply as long as the fairness metric is a difference in model expectations over subgroups. For simplicity, the audit decides to compute the Demographic Parity Erf pxq|x s " 0s ´Erf pxq|x s " 1s, and therefore demands access to the model outputs for all inputs with different values of the sensitive feature : f pD 0 q and f pD 1 q, where D 0 " tx piq : x piq s " 0u and D 1 " tx piq : x piq s " 1u are subsets of the input data of sizes N 0 and N 1 respectively. Doing so does not force the company to share values of features other than x s nor does it requires direct access to the inner workings of the proprietary model. Hence, this demand respects privacy requirements and the company will accept to share the model outputs across all instances, see Figure 2a . At this point, the audit confirms that the model is indeed biased in favor of x s " 1 and puts in question the ability of the company to deploy such a model. Now, the company argues that, although the model exhibits a disparity in outcomes, it does not mean that the model explicitly uses the feature x s to make its decision. If such is the case, then the disparity could be explained by other features statistically associated with x s . Some of these other features may be acceptable grounds for decisions. To verify such a claim, the audit decides to employ post-hoc techniques to explain the disparity. Since the model is a black-box, the audits shall compute the GSV. The foreground F and background B are chosen to be the data distributions conditioned on x s " 0 and x s " 1 respectively F :" CpD 0 , 1{N 0 q B :" CpD 1 , 1{N 1 q. ( ) According to Equation 3, the resulting GSV will sum up to the demographic parity (cf. Equation 7). If the sensitive feature has a large negative GSV Φ s , then this would mean that the model is explicitly relying on x s to make its decisions and the company would be forbidden from deploying the model. If the GSV has a small amplitude, however, the company could still argue in favor of deploying the model in spite of having disparate outcomes. Indeed, the difference in outcomes by the model could be attributed to more acceptable features. See Figure 2b for a toy example illustrating this reasoning. To compute the GSV, the audit demands the two datasets of inputs D 0 and D 1 , as well as the ability to query the black box f at arbitrary points. Because of privacy concerns on sharing values of x across the whole dataset, and because GSV must be estimated with Monte-Carlo, both parties agree that the company shall only provide subsets S 0 Ă D 0 and S 1 Ă D 1 of size M to the audit so they can compute a Monte-Carlo estimate p Φpf, S 0 , S 1 q. The company first estimate GSV on their own by choosing S 0 , S 1 uniformly at random from F and B (cf. Equation 5) and observe that p Φ s indeed has a large negative value. They realize they must carefully select which data points will be sent, otherwise, the audit may observe the bias toward x s " 1 and the model will not be deployed. Moreover, the company understands that the audit currently has access to the data f pD 0 q and f pD 1 q representing the model predictions on the whole dataset (see Figure 2a ). Therefore, if the company does not share subsets S 0 , S 1 that were chosen uniformly at random from D 0 , D 1 , it is possible for the audit to detect this fraud by doing a statistical test comparing f pS 0 q to f pD 0 q and f pS 1 q to f pD 1 q. The company needs a method to select misleading subsets S 1 0 , S 1 1 whose GSV is manipulated in their favor while remaining undetected by the audit. Such a method is the subject of the next section.

4. FOOL SHAP WITH STEALTHILY BIASED SAMPLING 4.1 MANIPULATION

To fool the audit, the company can decide to indeed sub-sample S 1 0 uniformly at random S 1 0 " F M . Then, given this choice of foreground data, they can repeatedly sub-sample S 1 1 " B M , and choose the set S 1 1 leading to the smallest | p Φ s pf, S 1 0 , S 1 1 q|. We shall call this method "brute-force". Its issue is that, by sub-sampling S 1 1 from B, it will take an enormous number of repetitions to reduce the attribution since the GSV p Φ s pf, S 1 0 , S 1 1 q is concentrated on the population GSV Φ s pf, F, Bq. A more clever method is to re-weight the background distribution before sampling from it i.e. define B 1 ω :" CpD 1 , ωq with ω ‰ 1{N 1 and then sub-sample S 1 1 " B 1M ω . To make the model look fairer, the company needs the p Φ s computed with these cherry-picked points to have a small magnitude. Proposition 4.1. Let S 1 0 be fixed, and let p Ñ represent convergence in probability as the size M of the set S 1 1 " B 1M ω increases, we have p Φ s pf, S 1 0 , S 1 1 q p Ñ ÿ z pjq PD1 ω j p Φ s pf, S 1 0 , z pjq q. We note that the coefficients p Φ s pf, S 1 0 , z pjq q in Equation 9 are tractable and can be computed and stored by the company. We discuss in more detail how to compute them in Appendix B.3. An additional requirement is that the non-uniform distribution B 1 ω remains similar to the original B. Otherwise, the fraud could be detected by the audit. In this work, the notion of similarity between distributions will be captured by the Wasserstein distance in output space. Definition 4.1 (Wassertein Distance). Any probability measure π over D 1 ˆD1 is called a coupling measure between B and B 1 ω , denoted π P ∆pB, B 1 ω q, if 1{N 1 " ř j π ij and ω j " ř i π ij . The Wassertein distance between B and B 1 ω mapped to the output-space is defined as WpB, B 1 ω q " min πP∆pB,B 1 ω q ÿ i,j |f pz piq q ´f pz pjq q|π ij , a.k.a the cost of the optimal transport plan that distributes the mass from one distribution to the other. We propose Algorithm 1 to compute the weights ω by minimizing the magnitude of the GSV while maintaining a small Wasserstein distance. The trade-off between attribution manipulation and proximity to the data is tuned via a hyper-parameter λ ą 0. We show in the Appendix A.2 that the optimization problem at line 5 of Algorithm 1 can be reformulated as a Minimum Cost Flow (MCF) and hence can be solved in polynomial time (more precisely r OpN 2.5 1 q as in Fukuchi et al. ( 2020)).

4.2. DETECTION

We now discuss ways the audit can detect manipulation of the sampling procedure. Recall that the audit has previously been given access to f pD 0 q, f pD 1 q representing the model outputs across all instances in the private dataset. The audit will then be given sub-samples S 1 0 , S 1 1 of D 0 , D 1 on which they can compute the output of the model and compare with f pD 0 q, f pD 1 q. To assess whether or not the sub-samples provided by the company were sampled uniformly at random, the audit has to conduct statistical tests. The null hypothesis of these tests will be that S 1 0 , S 1 1 were sampled uniformly at random from D 0 , D 1 . The detection Algorithm 2 with significance α uses both the Kolmogorov-Smirnov and Wald tests with Bonferonni corrections (i.e. the α{4 terms in the Algorithm). The Kolmogorov-Smirnov and Wald tests are discussed in more detail in Appendix C.

Algorithm 1 Compute non-uniform weights

1: procedure COMPUTE_WEIGHTS(D 1 , t p Φ s pf, S 1 0 , z pjq qu j , λ) 2: β :" signr ř z pjq PD1 p Φ s pf, S 1 0 , z pjq q s 3: B :" CpD 1 , 1{N 1 q Ź Unmanipulated background 4: B 1 ω :" CpD 1 , ωq Ź Manipulated background as a function of ω 5: ω " arg min ω β ř z pjq PD1 ω j p Φ s pf, S 1 0 , z pjq q `λWpB, B 1 ω q Ź Optimization Problem 6: return ω; Algorithm 2 Detection with significance α 1: procedure DETECT_FRAUD(f pD 0 q, f pD 1 q, f pS 1 0 q, f pS 1 1 q, α, M ) 2: for i " 0, 1 do 3: f pS i q " Cpf pD i q, 1{N i q M Ź Subsample without cheating. 4: p-value-KS " KSp f pS i q, f pS 1 i q q Ź KS test comparing f pS i q and f pS 1 i q 5: p-value-Wald " Waldp f pS 1 i q, f pD i q q Ź Wald test 6: if p-value-KS ă α{4 or p-value-Wald ă α{4 then Ź Reject the null hypothesis 7: return 1 8: return 0;

4.3. WHOLE PROCEDURE

The procedure returning the subsets S 1 0 , S 1 1 is presented in Algorithm 3. It conducts a log-space search between λ min and λ max for the λ hyper-parameter (line 6) in order to explore the possible attacks. For each value of λ, the attacker runs Algorithm 1 to obtain B 1 ω (line 7), then repeatedly samples S 1 1 " B 1M ω (line 10) and attempts to detect the fraud (line 11). The attacker will choose B 1 ω that minimizes the magnitude of p Φ s while having a detection rate below some threshold τ (line 12). An example of search over λ on a real-world dataset is presented in Figure 3 . Algorithm 3 Fool SHAP 1: procedure FOOL_SHAP(f, D 0 , D 1 , M, λ min , λ max , τ, α) 2: S 1 0 " CpD 0 , 1{N 0 q M Ź S 1 0 is sampled without cheating 3: Compute p Φ s pf, S 1 0 , z pjq q @z pjq P D 1 Ź cf. Section B.3 4: B ‹ " CpD 1 , 1{N 1 q 5: Φ ‹ s " 1{N 1 ř z pjq PD1 p Φ s pf, S 1 0 , z pjq q Ź Initialize the solution 6: for λ " λ max , . . . , λ min do 7: ω " COMPUTE_WEIGHTS(D 1 , p Φ s pf, S 1 0 , z pjq q ( j Detection " 0 9: for rep " 1, . . . , 100 do Ź Detect the manipulation 10: S 1 1 " B 1M ω 11: Detection += DETECT_FRAUD(f pD 0 q, f pD 1 q, f pS 1 0 q, f pS 1 1 q, α, M ) 12: if | ř z pjq PD1 ω j p Φ s pf, S 1 0 , z pjq q| ă |Φ ‹ s | and Detection ă 100τ then 13: B ‹ " B 1 ω 14: Φ ‹ s " ř z pjq PD1 ω j p Φ s pf, S 1 0 , z pjq q Ź Update the solution 15: S 1 1 " B ‹M Ź Cherry-pick by sampling from the non-uniform background 16: return S 1 0 , S 1 One limitation of Fool SHAP is that it manipulates a single sensitive feature. In Appendix E.4, we present a possible extension of Algorithm 1 to handle multiple sensitive features and present preliminary results of its effectiveness. A second limitation is that it only applies to "interventional" Shapley values which break feature correlations. This choice was made because most methods in the SHAP libraryfoot_0 are "interventional". Future work should port Fool SHAP to "observational" Shapley values that use conditional expectations to remove features (Frye et al., 2020) . ω . The goal here is to reduce the amplitude of the sensitive feature (red curve).

4.4. CONTRIBUTIONS

The first technique to fool SHAP with perturbations of the background distribution was a genetic algorithm Baniecki & Biecek (2022) . Although promising, the cross-over and mutation operations it employs to perturb data do not take into account feature correlations and can therefore generate unrealistic data. Moreover, the objective to minimize does not enforce similarity between the original and manipulated backgrounds. We show in Appendix E.3 that these limitations lead to systematic fraud detections. Hence, our contributions are two-fold. First, by perturbing the background via nonuniform weights over pre-existing instances (i.e. B 1 ω :" CpD 1 , ωq ) rather than a genetic algorithm, we avoid the issue of non-realistic data. Second, by considering the Wasserstein distance, we can control the similarity between the original and fake backgrounds. Since the Stealthity Biased Sampling technique introduced in Fukuchi et al. (2020) also leverages a non-uniform distribution over data points and the Wasserstein distance, it makes sense to adapt it to fool SHAP. Still, the approach of Fukuchi et al. is different from ours. Indeed, in their work, they minimize the Wasserstein distance while enforcing a hard constraint on the number of instances that land on the different bins for the target and sensitive feature, That way, they can set the Demographic Parity to any given value while staying close to the original data. In our setting of manipulating the model explanation, we leave the Demographic Parity intact and instead manipulate its feature attribution. In terms of the optimization objective, we now minimize a Shapley value with a soft constraint on the Wasserstein distance.

5.1. TOY EXPERIMENT

The task is predicting which individual will be hired for a job that requires carrying heavy objects. The causal graph for this toy data is presented in Figure 4 (left). We observe that sex (S) influences height (H), and that both these features influence the Muscular Mass (M ). In the end, the hiring decisions (Y ) are only based on the two attributes relevant to the job: H and M . Also, two noise features N 1, N 2 were added. More details and justifications for this causal graph are discussed in Appendix D.1. Since strength and height (two important qualifications for applicants) are correlated with sex, any model f that fits the data will exhibit some disparity in hiring rates between sexes. Although, if the model decisions do not rely strongly on feature S, the company can argue in favor of deployment. GSV are used by the audit to measure the amount by which the model relies on the sex feature, see Figure 4 (Middle). By employing Fool SHAP with M " 100, the company can reduce the GSV of feature S considerably compared to the brute-force and genetic algorithms. More importantly, the audit is not able to detect that the provided samples S 1 0 , S 1 1 were cherry-picked, see Figure 4 (Right). More results are presented in Appendix E. 1 . 0 q, f pS 1 1 q and the CDF over the whole data f pD 0 q, f pD 1 q. Here the audit cannot detect the fraud using their detection algorithm.

5.2. DATASETS

We consider four standard datasets from the FAccT literature, namely COMPAS, Adult-Income, Marketing, and Communities. • COMPAS regroups 6,150 records from criminal offenders in Florida collected from 2013-2014. This binary classification task consists in predicting who will re-offend within two years. The sensitive feature s is race with values x s " 0 for African-American and x s " 1 for Caucasian. • Adult Income contains demographic attributes of 48,842 individuals from the 1994 U.S. census. It is a binary classification problem with the goal of predicting whether or not a particular person makes more than 50K USD per year. The sensitive feature s in this dataset is gender, which took values x s " 0 for female, and x s " 1 for male. • Marketing involves information on 41,175 customers of a Portuguese bank and the binary classification task is to predict who will subscribe to a term deposit. Detector calibration refers to the assessment that, assuming the null hypothesis to be true, the probability of rejecting it (i.e. false positive) should be bounded by the significance level α. Remember that the null hypothesis of the audit detector is that the sets S 1 0 , S 1 1 provided by the company are sampled uniformly from D 0 , D 1 . Hence, to test the detector, the audit can sample their own subsets f pS 0 q, f pS 1 q uniformly from at random from f pD 0 q, f pD 1 q, run the detection algorithm, and count the number of detection over 1000 repeats. Table 1 shows the false positive rates over the five train-test splits using a significance level α " 5%. We observe that the false positive rates are indeed bounded by α for all model types and datasets implying that the detector employed by the audit is calibrated. 2020), we compute the manipulated weights multiple times using 5 bootstrap sub-samples of D 1 of size 2000 to obtain a set of weights ω r1s , ω r2s , . . . , ω r5s which we average to obtain the final weights ω.

5.4. ATTACK RESULTS AND DISCUSSION

Results of 46 attacks with M " 200 are shown in Figure 5 . Specific examples of the conducted attacks are presented in Appendix E.2. As a point of reference, we also show results for the brute-force and genetic algorithms. To make comparisons to our attack more meaningful, the brute-force method was only allowed to run for the same amount of time it took to search for the non-uniform weights ω (about 30-180 seconds). Also, the genetic algorithm ran for 400 iterations and was stopped early if there were 10 consecutive detections. We note that, across all datasets, Fool SHAP leads to greater reductions of the sensitive feature attribution compared to brute-force search and the genetic perturbations of the background. Now focusing on Fool SHAP, for the datasets COMPAS and Marketing, we observe median reductions in amplitudes of about 90%. This means that our attack can considerably reduce the apparent importance of the sensitive attribute. For the Adult and Communities datasets, the median reduction in amplitude is about 50% meaning that we typically reduce by half the importance of the sensitive feature. Still, looking at the maximum reduction in amplitude for Adult-Income and Communities, we note that one attack managed to reduce the amplitude by 90%. Therefore, luck can play a part in the degree of success of Fool SHAP, which is to be expected from data-driven attacks. Finally, the audit was consistently unable to detect the fraud using statistical tests. This observation raises concerns about the risk that SHAP explanations can be attacked to return not only manipulated attributions but also non-detectable fake evidence of fairness.

6. CONCLUSION

To conclude, we proposed a novel attack on Shapley values that does not require modifying the model but rather manipulates the sampling procedure that estimates expectations w.r.t the background distribution. We show on a toy example and four fairness datasets that our attack can reduce the importance of a sensitive feature when explaining the difference in outcomes between groups using SHAP. Crucially, the sampling manipulation is hard to detect by an audit that is given limited access to the data and model. These results raise concerns about the viability of using Shapley values to assess model fairness. We leave as future work the use of Shapley values to decompose other fairness metrics such as predictive equality and equal opportunity. Moreover, we wish to move to use cases beyond fairness, as we believe that the vulnerability of Shapley values that was demonstrated can apply to many other properties such as safety and security.

7. ETHICS STATEMENT

The main objective of this work is to raise awareness about the risk of manipulation of SHAP explanations and their undetectability. As such, it aims at exposing the potential negative societal impacts of relying on such explanations. It remains however possible that malicious model producers could use this attack to mislead end users or cheat during an audit. However, we believe this paper makes a significant step toward increasing the vigilance of the community and fostering the development of trustworthy explanations methods. Furthermore, by showing how fairness can be manipulated in explanation contexts, this work contributes to the research on the certification of the fairness of automated decision-making systems.

8. REPRODUCIBILITY STATEMENT

The source code of all our experiments is available onlinefoot_1 . Moreover, experimental details are provided in appendix D.2 for the interested reader.

A PROOFS A.1 PROOFS FOR GLOBAL SHAPLEY VALUES (GSV)

Proposition A.1 (Proposition 2.1). The GSV have the following property d ÿ i"1 Φ i pf, F, Bq " E x"F rf pxqs ´E x"B rf pxqs. Proof. As a reminder, we have defined the vector Φpf, F, Bq " E x"F z"B rφpf, x, zq ‰ , whose components sum up to d ÿ i"1 Φ i pf, F, Bq " d ÿ i"1 E x"F z"B r φ i pf, x, zq s (13) " E x"F z"B " d ÿ i"1 φ i pf, x, zq  (14) " E x"F z"B r f pxq ´f pzq s " E x"F rf pxqs ´E z"B rf pzqs ( 16) " E x"F rf pxqs ´E x"B rf pxqs, where at the last step we have simply renamed a dummy variable. Proposition A.2 (Proposition 4.1). Let S 1 0 be fixed, and let p Ñ represent convergence in probability as the size M of the set S 1 1 " B 1M increases, then we have p Φ s pf, S 1 0 , S 1 1 q p Ñ N1 ÿ j"1 ω j p Φ s pf, S 1 0 , z pjq q. Proof. p Φpf, S 1 0 , S 1 1 q " 1 M 2 ÿ x piq PS 1 0 ÿ z pjq PS 1 1 φpf, x piq , z pjq q " 1 M ÿ z pjq PS 1 1 ˆ1 M ÿ x piq PS 1 0 φpf, x piq , z pjq q " 1 M ÿ z pjq PS 1 1 p Φpf, S 1 0 , z pjq q. ( ) Since S 1 0 is assumed to be fixed, then the only random variable in p Φ s pf, S 1 0 , z pjq q is z pjq which represents an instance sampled from the B 1 . Therefore, we can define ψpzq :" p Φ s pf, S 1 0 , zq and we get p Φ s pf, S 1 0 , S 1 1 q " 1 M ÿ z pjq PS 1 1 p Φ s pf, S 1 0 , z pjq q " 1 M ÿ z pjq PS 1 1 ψpz pjq q with S 1 1 " B 1M . By the weak law of large number, the following holds as M goes to infinity (Wasserman, 2004, Theorem 5.6) 1 M ÿ z pjq PS 1 1 ψpz pjq q p Ñ E z"B 1 rψpzqs. Now, as a reminder, the manipulated background distribution is B 1 :" CpD 1 , ωq with ω ‰ 1{N 1 . Therefore p Φ s pf, S 1 0 , S 1 1 q p Ñ E z"B 1 rψpzqs " E z"CpD1,ωq rψpzqs " N1 ÿ j"1 ω j ψpz pjq q " N1 ÿ j"1 ω j p Φ s pf, S 1 0 , z pjq q (22) concluding the proof. Published as a conference paper at ICLR 2023

A.2 PROOFS FOR OPTIMIZATION PROBLEM A.2.1 TECHNICAL LEMMAS

We provide some technical lemmas that will be essential when proving Theorem A.1. These are presented for completeness and are not intended as contributions by the authors. Let us first write the formal definition of the minimum of a function. Definition A.1 (Minimum). Given some function f : D Ñ R, the minimum of f over D (denoted f ‹ ) is defined as follows: f ‹ " min xPD f pxq ðñ Dx ‹ P D s.t. f ‹ " f px ‹ q ď f pxq @x P D. Basically, the notion of minimum coincides with the infimum inf f pDq (highest lower bound) when this lower bound is attained for some x ‹ P D. By the Extreme Values Theorem, the minimum always exists when D is compact and f is continuous. For the rest of this appendix, we shall only study optimization problems where points on the domain set D " tpx, yq : x P X , y P Y x Ă Yu can be selected by the following procedure 1. Choose some x P X 2. Given the selected x, choose some y P Y x Ă Y, where the set Y x is non-empty and depends on the value of x. When optimizing functions over these domains, one can optimize in two steps as highlighted in the following lemma. Lemma A.1. Given a compact domain D of the form described above and a continuous objective function f : D Ñ R, the minimum f ‹ is attained for some px ‹ , y ‹ q and the following holds Proof. Let r f pxq :" inf yPYx f px, yq, which is a well defined function on X . We can then take its infimum f ‹ " inf xPX r f pxq. But is f ‹ an infimum of f pDq? By the definition of infimum f ‹ ď r f pxq @ x P X " inf yPYx f px, yq ď f px, yq @ y P Y x , so that f ‹ is a lower bound of f pDq. In fact, it is the highest lower bound possible so inf px,yqPD f px, yq " inf xPX inf yPYx f px, yq. By the Extreme Value Theorem, since D is compact and f is continuous, there exists px ‹ , y ‹ q P D s.t. f ‹ " inf px,yqPD f px, yq " max px,yqPD f px, yq " f px ‹ , y ‹ q. Since the infimum is attained on the left-hand-side of Equation 23, then it must also be attained on the right-hand-side and therefore we can replace all inf with min in Equation 23, leading to the desired result. Applying Lemma A.1 with the function f px, yq :" hpxq `gpyq proves the Lemma. s t j r i apeq " β p Φ s pf, S 1 0 , z pjq q cpeq " 8 f peq " r ω j apeq " λ |f pz piq q ´f pz pjq q| cpeq " 8 f peq " r π i,j apeq " 0 cpeq " 1 f peq " 1 Figure 6 : Graph G on which we solve the MCF. Note that the total amount of flow is d " N 1 and there are N 1 left and right nodes j , r i .

A.2.2 MINIMUM COST FLOWS

Let G " pV, Eq be a graph with vertices v P V with directed edges e P E Ă V ˆV, c : E Ñ R `be a capacity and a : E Ñ R be a cost. Moreover, let s, t P E be two special vertices called the source and the sink respectivelly, and d P R `be a total flow. The Minimum-Cost Flow (MCF) problem of G consists of finding the flow function f : E Ñ R `that minimizes the total cost min f ÿ ePE apeqf peq s.t. 0 ď f peq ď cpeq @e P E ÿ ePu `f peq ´ÿ ePu ´f peq " $ & % 0 u P V z ts, tu d u " s ´d u " t (24) where u `:" tpu, vq P Eu and u ´:" tpv, uq P Eu are the outgoing and incoming edges from u. The terminology of flow arises from the constraint that, for vertices that are not the source nor the sink, the outgoing flow must equal the incoming one, which is reminiscent of conservation laws in fluidic. We shall refer to f ppu, vqq as the flow from u to v. Now that we have introduced minimum cost flows, let us specify the graph that will be employed to manipulate GSV, see Figure 6 . We label the flow going from the sink s to one of the left vertices as r ω i " ω i ˆN1 , and the flow going from j to r i as r π i,j " π i,j ˆN1 . The required flow is fixed at d " N 1 . Theorem A.1. Solving the MCF of Figure 6 leads to a solution of the linear program in Algorithm 1. Proof. We begin by showing that the flow conservation constraints in the MCF imply that π is a coupling measure (i.e. π P ∆pB, B 1 ω q), and ω is constrained to the probability simplex ∆pN 1 q. Applying the conservation law on the left-side of the graph leads to the conclusion that the flows entering vertices j must sum up to N 1 N1 ÿ j"1 r ω j " N 1 . This implies that ω is must be part of the probability simplex. By conservation, the amount of flow that leaves a specific vertex j must also be r ω j , hence ÿ i r π ij " r ω j . For any edge outgoing from r i to the sink t, the flow must be exactly 1. This is because we have N 1 edges with capacity cpeq " 1 going into the sink and the sink must receive an incoming flow of N 1 . As a consequence of the conservation law on a specific vertex r i , the amount of flow that goes into each r i is also 1 ÿ j r π ij " 1. Putting everything together, from the conservation laws on G, we have that ω P ∆pN 1 q, and π P ∆pB, B 1 ω q. Now, to make the parallel between the MCF and Algorithm 1, we must use Lemma A.2. Note that ω is restricted to the probability simplex, while π is restricted to be a coupling measure. Importantly, the set of all possible coupling measures ∆pB, B 1 ω q is different for each ω because B 1 ω depends on ω. Hence, the domain has the same structure as the ones tackled in Lemma A.2 (where x P X becomes ω P ∆pN 1 qq and y P Y x becomes π P ∆pB, B 1 ω q). Also, the set of possible ω and π is a bounded simplex in R N1pN1`1q so it is compact, and the objective function of the MCF is linear, thus continuous. Hence, we can apply the Lemma A.2 to the MCF. min f ÿ ePE f peqapeq " min r ω,r π N1 ÿ j"1 β r ω j p Φ s pf, S 1 0 , z pjq q `λ ÿ i,j r π ij |f pz piq q ´f pz pjq q| " min r ω,r π N 1 N 1 ˆβ N1 ÿ j"1 r ω j p Φ s pf, S 1 0 , z pjq q `λ ÿ i,j r π ij |f pz piq q ´f pz pjq q| " N 1 min r ω,r π ˆβ N1 ÿ j"1 r ω j N 1 p Φ s pf, S 1 0 , z pjq q `λ ÿ i,j r π ij N 1 |f pz piq q ´f pz pjq q| " N 1 min ωP∆pN1q,πP∆pB,B 1 ω q ˆβ N1 ÿ j"1 ω j p Φ s pf, S 1 0 , z pjq q `λ ÿ i,j π i,j |f pz piq q ´f pz pjq q| " N 1 min ωP∆pN1q,πP∆pB,B 1 ω q ˆhpωq `gpπq " N 1 min ωP∆pN1q ˆhpωq `min πP∆pB,B 1 q gpπq ˙(cf. Lemma A.2) " N 1 min ωP∆pN1q ˆβ N1 ÿ j"1 ω j p Φ s pf, S 1 0 , z pjq q `λ min πP∆pB,B 1 ω q ÿ i,j π i,j |f pz piq q ´f pz pjq q| " N 1 min ωP∆pN1q ˆβ N1 ÿ j"1 ω j p Φ s pf, S 1 0 , z pjq q `λ WpB, B 1 ω q ẇhich (up to a multiplicative constant N 1 ) is a solution of the linear program of Algorithm 1.

B SHAPLEY VALUES B.1 LOCAL SHAPLEY VALUES (LSV)

We introduce Local Shapley Values (LSV) more formally. First, as explained earlier, Shapley values are based on coalitional game theory where the different features work together toward a common outcome f pxq. In a game, the features can either be present or absent, which is simulated by replacing some features with a baseline value z. Definition B.1 (The Replace Function). Let x be an input of interest x, S Ď t1, 2, . . . , du be a subset of input features that are considered active, and z be a baseline input, then the replace-function r S : R d ˆRd Ñ R d is defined as r S pz, xq i " " x i if i P S z i otherwise. ( ) We note that this function is meant to "activate" the features in S. Now, if we let π be a random permutation of d features, and π i denote all features that appear before i in π, the LSV are computed via φ i pf, x, zq :" E π"Ω " f p r πiYtiu pz, xq q ´f p r πi pz, xq q ‰ , i " 1, 2, . . . , d, where Ω is the uniform distribution over d! permutations. Observe that the computation of LSV is scales poorly with the number of features d hence model-agnostic computations are only possible with datasets with few features such as COMPAS and Adult-Income. For datasets with larger amounts of features the TreeExplainer algorithm (Lundberg et al., 2020) can be used to compute the LSV (cf. Equation 26) in polynomial time given that one is explaining a tree-based model.

B.2 CONVERGENCE

As a reminder, we are interested in estimating the GSV Φ " Φpf, F, Bq which requires estimating expectations w.r.t the foreground and background distributions. Said estimations can be conducted with Monte-Carlo where we sample M instances S 0 " F M S 1 " B M , and compute the plug-in estimates p Φpf, S 0 , S 1 q :" Φpf, CpS 0 , 1{M q, CpS 1 , 1{M qq " 1 M 2 ÿ x piq PS0 ÿ z pjq PS1 φpf, x piq , z pjq q. We now show that, p Φpf, S 0 , S 1 q is a consistent and asymptotically normal estimate of Φpf, F, Bq Proposition B.1. Let f : X Ñ r0, 1s be a black box, F and B be distributions on X , and p Φ " p Φpf, S 0 , S 1 q be the plug-in estimate of Φ " Φpf, F, Bq, the following holds for any δ P s0, 1r and k " 1, 2 . . . , d lim M Ñ8 P ˆ|p Φ k ´Φk | ě F ´1 N p0,1q p1 ´δ{2q 2 ? M b σ 2 10 `σ2 01 ˙" δ, where F ´1 N p0,1q is the inverse Cumulative Distribution Function (CDF) of the standard normal distribution, σ 2 10 " V x"F r E z"B rφ i pf, x, zqqs s and σ 2 01 " V z"B r E x"F rφ i pf, x, zqqs s. Proof. The proof consists simply in noting that LSV φ k pf, x piq , z pjq q are a function of two independent samples x piq " F and z pjq " B. The model f is assumed fixed and hence for any feature k we can define hpx piq , z pjq q :" φ k pf, x piq , z pjq q. Now, the estimates of GSV can be rewritten p Φ k pf, S 0 , S 1 q " 1 |S 0 | |S 1 | ÿ x piq PS0 ÿ z pjq PS1 hpx piq , z pjq q, which we recognize as a well-known class of statistics called two-samples U-statistics. Such statistics are unbiased and asymptotically normal estimates of Φ k pf, F, Bq " E x"F z"B rhpx, zqs. The asymptotic normality of two-samples U-statistics is characterized by the following Theorem (Lee, 2019, Section 3.7.1). Theorem B.1. Let p Φ k " p Φ k pf, S 0 , S 1 q be a two-samples U-statistic with |S 0 | " N, |S 1 | " M , moreover let hpx, zq have finite first and second moments, then the following holds for any δ P s0, 1r lim N `M Ñ8 s.t. N {pN `M qÑpPp0,1q P ˆ|p Φ k ´Φk | ě F ´1 N p0,1q p1 ´δ{2q ? M `N d σ 2 10 p `σ2 01 1 ´p ˙" δ, where σ 2 10 " V x"F r E z"B rhpx, zqs s and σ 2 01 " V z"B r E x"F rhpx, zqs s. Proposition B.1 follows from this Theorem by choosing N " M, p " 0.5 and noticing that having a model with bounded outputs (f : X Ñ r0, 1s) implies that |hpx, zq| ď 1 @x, z P X which means that hpx, zq has bounded first and second moments.

B.3 COMPUTE THE LSV

Running Algorithm 1 requires computing the coefficients p Φ s pf, S 1 0 , z pjq q for j " 1, 2, . . . , N 1 . To compute them, first note that they can be written in terms of LSV for all instances in S 1 0 p Φ s pf, S 1 0 , z pjq q " 1 M ÿ x piq PS 1 0 φ s pf, x piq , z pjq q. The LSV φ s pf, x piq , z pjq q are computed deeply in the SHAP code and are not directly accessible using the current API. Hence, we had to access them using Monkey-Patching i.e. we modified the ExactExplainer class so that it stores the LSV as one of its attributes. The attribute can then be accessed as seen in Figure 7 . The code is provided as a fork the SHAP repository. For the TreeExplainer, because its source code is in C++ and wrapped in Python, we found it simpler to simply rewrite our own version of the algorithm in C++ so that it directly returns the LSV, instead of Monkey-Patching the TreeExplainer. The toy dataset was constructed to closely match the results of the following empirical study comparing skeletal mass distributions between men and women (Janssen et al., 2000) . Firstly, the sex feature was sampled from a Bernoulli S " Bernoullip0.5q. According to Table 1 of Janssen et al. (2000) , the average height of women participants was 163 cm while it was 177cm for men. Both height distributions had the same standard deviation of 7cm. Hence we sampled height via H|S " man " N p177, 49q H|S " woman " N p163, 49q It was noted in Janssen et al. (2000) that there was approximately a linear relationship between height and skeletal muscle mass for both sexes. Therefore, we computed the muscle mass M as M |tH " h, S " manu " 0.186h `5 M |tH " h, S " womanu " 0.128h `4 with " N p0, 1q The values of coefficients 0.186, 0.128 and noise levels 5 and 4 were chosen so the distributions of M |S would approximately match that of Table 1 in Janssen et al. (2000) . Finally the target was chosen following Y |tH " h, M " mu " Bernoullip P pH, M q q with P pH, M q " " 1 `expt100ˆ1pH ă 160q ´0.3pM ´28qu ‰ ´1. Simply put, the chances of being hired in the past (Y ) were impossible for individuals with a smaller height than 160cm. Moreover, individuals with a higher mass skeletal mass were given more chances to be admitted. Yet, individuals with less muscle mass could still be given the job if they displayed sufficient determination. In the end, we generated 6000 samples leading to the following disparity in Y . PpY " 1|S " manq " 0.733 PpY " 1|S " womanq " 0.110. The datasets were first divided into train/test subsets with ratio 4 5 : 1 5 . The models were trained on the training set and evaluated on the test set. All categorical features for COMPAS, Adult, and Marketing were one-hot-encoded which resulted in a total of 11, 40, and 61 columns for each dataset respectively. A simple 50 steps random search was conducted to fine-tune the hyper-parameters with cross-validation on the training set. The resulting test set performance and demographic parities for all models and datasets, aggregated over 5 random data splits, are reported in Tables 2 and 3 respectively. We note that beyond 90 iterations, the detector is systematically able to assert that the dataset is manipulated. The smallest value of amplitude that can be reached via the genetic algorithm without being detected is around 0.05. Figures 8 (b) (c) and (d) show the CDFs of f pS 1 1 q where S 1 1 is chosen via the genetic algorithm, brute-force, and Fool SHAP respectively. We observe that Fool SHAP is the method where the resulting CDF for f pS 1 1 q is closest to the CDF for f pD 1 q. This is why the audit is not able to detect fraud using statistical tests. The fact that Fool SHAP generates fake CDFs that are close to the data CDFs is a consequence of minimizing the Wasserstein distance. These results highlight the superiority of Fool SHAP compared to the brute-force approach and the genetic algorithm.

E.2 EXAMPLES OF ATTACKS

In this section, we present 8 specific examples of the attacks that were conducted on COMPAS, Adult, Marketing, and Communities. Figure 9 : Attack of RF fitted on COMPAS. Left: GSV before and after the attack with M " 200. As a reminder, the sensitive attribute is race. Right: Comparison of the CDF of the misleading subsets f pS 1 0 q, f pS 1 1 q and the CDF over the whole data. f pD 0 q, f pD 1 q. As a reminder, the sensitive attribute is race. Right: Comparison of the CDF of the misleading subsets f pS 1 0 q, f pS 1 1 q and the CDF over the whole data. f pD 0 q, f pD 1 q. As a reminder, the sensitive attribute is gender. Right: Comparison of the CDF of the misleading subsets f pS 1 0 q, f pS 1 1 q and the CDF over the whole data. f pD 0 q, f pD 1 q. Figure 12 : Attack of RF fitted on Adults. Left: GSV before and after the attack with M " 200. As a reminder, the sensitive attribute is gender. Right: Comparison of the CDF of the misleading subsets f pS 1 0 q, f pS 1 1 q and the CDF over the whole data. f pD 0 q, f pD 1 q. As a reminder, the sensitive attribute is age. Right: Comparison of the CDF of the misleading subsets f pS 1 0 q, f pS 1 1 q and the CDF over the whole data. f pD 0 q, f pD 1 q. As a reminder, the sensitive attribute is age. Right: Comparison of the CDF of the misleading subsets f pS 1 0 q, f pS 1 1 q and the CDF over the whole data. f pD 0 q, f pD 1 q. standard deviation across the 5 train/test splits employed in our main experiments. Moreover, window 20 convolutions were used to smooth the curves and make them more readable. On the Marketing and Communities datasets, we see that for both XGB and Random Forests models, the detector is quickly able to assert that the data was manipulated. We suspect the genetic algorithm cannot fool the detector on these two datasets because they contain a large number of features (Marketing has 20, Communities has 98). Such a large number of features could make it harder to perturbate samples while staying close to the original data manifold. Since the model behavior is unpredictable outside of the data manifold, it is impossible for the genetic algorithm to guarantee that the CDF of f pS 1 1 q will be close to the CDF of f pD 1 q. For adult-income, the detection rate appears to be lower but still, the largest reductions in amplitude of the sensitive feature were about 15%, even after 2.5 hours of run-time. Contrary to the genetic algorithm, our method Fool SHAP addresses both constraints of making the fake data realistic and keeping it close to the original dataset. Indeed, our objective is tuned to make sure that the Wasserstein distance between the original and perturbed data is small. Moreover, since we do not generate new samples but rather apply non-uniform weights to pre-existing ones, we do not run into the risk of generating unrealistic data. 

E.4 MULTIPLE SENSITIVE ATTRIBUTES

We present preliminary results for settings where one wishes to manipulate the Shapley values of multiple sensitive features s each part of a set s P S. For example, in our experiments, we considered gender as a sensitive attribute for the Adult-Income dataset and we showed that one can diminish the attribution of this feature. Nonetheless, there are two other features in Adult-Income that share information with gender: relationship and marital-status. Indeed, relationship can take the value widowed and marital-status can take the value wife, which are both proxies of gender=female. For this reason, these two other features may be considered sensitive and decision-making that relies strongly on them may not be acceptable. Hence, we must derive a method that reduces the total attributions of the features in S " tgender, relationship, marital-statusu. We first let β s :" signr ř z pjq PD1 p Φ s pf, S 1 0 , z pjq q s for any s P S. In our experiments, all these signs will typically be negative. The proposed approach is to minimize the 1 norm }p p Φ s pf, S 1 0 , S 1 1 qq sPS } 1 :" ÿ sPS | p Φ s pf, S 1 0 , S 1 1 q |, which we interpret as the total amount of disparity we can attribute to the sensitive attributes. Remember that p Φ s pf, S 1 0 , S 1 1 q converges in probability to ř z pjq PD1 ω j p Φ s pf, S 1 0 , z pjq q (cf. Proposition 4.1). Therefore minimizing the 1 norm will require minimizing  which is again a linear function of the weights. We present Algorithm 4 as an overload of Algorithm 1 that now supports taking multiple sensitive attributes as inputs. Algorithm 4 Compute non-uniform weights for multiple sensitive attributes s P S 1: procedure COMPUTE_WEIGHTS(D 1 , p Φ s pf, S 1 0 , z pjq q ( s,j , λ) 2: β s :" signr ř z pjq PD1 p Φ s pf, S 1 0 , z pjq q s @s P S; 3: B :" CpD 1 , 1{N 1 q Ź Unmanipulated background ω " arg min ω ř z pjq PD1 ω j ř sPS β s p Φ s pf, S 1 0 , z pjq q `λWpB, B 1 ω q 6: return ω; The only difference in the resulting MCF is that we must use the cost apeq " ř sPS β s p Φ s pf, S 1 0 , z pjq q for edges ps, j q in the graph G of Figure 6 . This new algorithm is guaranteed to diminish the 1 norm of the attributions of all sensitive features. However, that this does not imply that all sensitive attributes will diminish in amplitude. Indeed, minimizing the sum of multiple quantities does not guarantee that each quantity will diminish. For example, 4 `7 is smaller than 6 `6 although 4 is smaller than 6 and 7 is higher than 6. Still, we see reducing the 1 norm as a natural way to hide the total amount of disparity that is attributable to the sensitive features. Another important methodological change is the way we select the optimal hyper-parameter λ in Algorithm 3. Now at line 12, we use the 1 norm ř sPS | ř z pjq PD1 ω j p Φ s pf, S 1 0 , z pjq q| as a selection criterion. Figures 20 and 21 present preliminary results of attacks on three RFs/XGBs fitted on Adults with different train/test splits. We note that in all cases, before the attack, the three sensitive features had large negative attributions. By applying our method, we can considerably reduce the amplitude of the two sensitive attributes. The attribution of the remaining sensitive feature remains approximately constant or slightly becomes more negative. We leave it as future work to run large-scale experiments with multiple sensitive features for various datasets. ω . The goal here is to reduce the amplitude all sensitive features (red curves) in order to hide their contribution to the disparity in model outcomes. ω . The goal here is to reduce the amplitude all sensitive features (red curves) in order to hide their contribution to the disparity in model outcomes.



except the TreeExplainer when no background data is provided https://github.com/gablabc/Fool_SHAP



Two models f1 and f2 (decision boundaries in dashed lines) with perfect accuracy exhibit a disparity in outcomes w.r.t groups with xs ă 0 and xs ą 0.Here, Φspf1, F, Bq " ´1 while Φspf2, F, Bq " 0. Hence, f2 is indirectly unfair toward xs because of correlations in the data.

Figure 2: Illustrations of the audit scenario.

Figure 3: Example of log-space search over values of λ using an XGBoost classifier fitted on Adults. (a) The detection rate as a function of the parameter λ of the attack. The attacker uses a detection rate threshold τ " 10%. (b) For each value of λ, the vertical slice of the 11 curves is the GSV obtained with the resulting B 1ω . The goal here is to reduce the amplitude of the sensitive feature (red curve).

Figure 4: Toy example. Left: Causal graph. Middle: GSV for the different attacks with M " 100.Right: Comparison of the CDF of the Fool SHAP subsets f pS 1 0 q, f pS 1 1 q and the CDF over the whole data f pD 0 q, f pD 1 q. Here the audit cannot detect the fraud using their detection algorithm.

Given a compact domain D of the form described above and two continuous functions h : X Ñ R and g : Y Ñ R, then min

Figure 7: How we extract the LSV from the ExactExplainer via Monkey-Patching.

CDFs for genetic algorithm.

1 ) f (S 1 ) f (D 0 ) f (S 0 ) (d) CDFs for Fool SHAP.

Figure 8

Figure10: Attack of XGB fitted on COMPAS. Left: GSV before and after the attack with M " 200. As a reminder, the sensitive attribute is race. Right: Comparison of the CDF of the misleading subsets f pS 1 0 q, f pS 1 1 q and the CDF over the whole data. f pD 0 q, f pD 1 q.

Figure11: Attack of XGB fitted on Adults. Left: GSV before and after the attack with M " 200. As a reminder, the sensitive attribute is gender. Right: Comparison of the CDF of the misleading subsets f pS 1 0 q, f pS 1 1 q and the CDF over the whole data. f pD 0 q, f pD 1 q.

Figure13: Attack of RF fitted on Marketing. Left: GSV before and after the attack with M " 200. As a reminder, the sensitive attribute is age. Right: Comparison of the CDF of the misleading subsets f pS 1 0 q, f pS 1 1 q and the CDF over the whole data. f pD 0 q, f pD 1 q.

Figure 14: Attack of XGB fitted on Marketing. Left: GSV before and after the attack with M " 200.As a reminder, the sensitive attribute is age. Right: Comparison of the CDF of the misleading subsets f pS 1 0 q, f pS 1 1 q and the CDF over the whole data. f pD 0 q, f pD 1 q.

Figure 17: First two principal components of D 1 (Blue) and S 1 1 (Red) returned by the genetic algorithm on XGB models.

Figure 18: Iterations of the genetic algorithm applied to 5 XGB models per dataset.

ω j p Φ s pf, S 1 0 , z pjq q " ÿ z pjq PD1 ω j ÿ sPS β s p Φ s pf, S 1 0 , z pjq q,

CpD 1 , ωq Ź Manipulated background as a function of ω 5:

Figure 20: Example of log-space search over values of λ using RFs classifier fitted on Adults and three sensitive attributes. Each row is a different train/test split seed. (Left) The detection rate as a function of the parameter λ of the attack. (Right) For each value of λ, the vertical slice of the 11 curves is the GSV obtained with the resulting B 1ω . The goal here is to reduce the amplitude all sensitive features (red curves) in order to hide their contribution to the disparity in model outcomes.

Figure 21: Example of log-space search over values of λ using XGBs classifier fitted on Adults and three sensitive attributes. Each row is a different train/test split seed. (Left) The detection rate as a function of the parameter λ of the attack. (Right) For each value of λ, the vertical slice of the 11 curves is the GSV obtained with the resulting B 1ω . The goal here is to reduce the amplitude all sensitive features (red curves) in order to hide their contribution to the disparity in model outcomes.

Shapley values have a long background in coalitional game theory, where multiple players collaborate toward a common outcome. In the context of explaining model decisions, the players are the input features and the common outcome is the model output f pxq. In coalitional games, players (features) are either present or absent. Since one cannot physically remove an input feature once the model has already been fitted, SHAP removes features by replacing them with a baseline value z. This leads to the Local Shapley Value (LSV) φ i pf, x, zq which respect the so-called efficiency axiom (Lundberg &Lee, 2017)

Values of the test set accuracy and demographic parity for each model type and dataset are presented in Appendix D.2. False Positive Rates (%) of the detector i.e. the frequency at which S 0 , S 1 are considered cherry-picked when they are not. No rate should be above 5%.

The first step of the attack (line 3 of Algorithm 3) requires that the company run SHAP on their own and compute the necessary coefficients to run Algorithm 1. For the COMPAS and Adults datasets, the ExactExplainer of SHAP was used. Since Marketing and Communities contain more than 15 features, and since the ExactExplainer scales exponentially with the number of features, we were restricted to using the TreeExplainer (Lundberg et al., 2020) on these datasets. The TreeExplainer avoids the exponential cost of Shapley values but is only applicable to tree-based models such as RFs and XGBs. Therefore, we could not conduct the attack on MLPs fitted on Marketing and Communities.The following step is to solve the MCF for various values of λ (line 7 of Algorithm 3). As stated previously, solving the MCF can be done in polynomial time in terms of N 1 , which was tractable for a small dataset like COMPAS and Communities, but not for larger datasets like Adult and Marketing. To solve this issue, as was done in Fukuchi et al. (

Models Test Accuracy % (mean ˘stddev).

Models Demographic Parity (mean ˘stddev).

availability

//github.com/slundberg/shap/blob/

C STATISTICAL TESTS C.1 KS TEST

A first test that can be conducted is a two-samples Kolmogorov-Smirnov (KS) test (Massey Jr, 1951 ). If we let p F S pxq "be the empirical CDF of observations in the set S. Given two sets S and S 1 , the KS statistic is KSpS,Under the null-hypothesis H 0 : S " D |S| , S 1 " D |S 1 | for some univariate distribution D, this statistic is expected to not be too large with high probability. Hence, when the company provides the subsets S 1 0 , S 1 1 , the audit can sample their own two subsets f pS 0 q, f pS 1 q uniformly at random from f pD 0 q, f pD 1 q and compute the statistics KSpf pS 0 q, f pS 1 0 qq and KSpf pS 1 q, f pS 1 1 qq to detect a fraud.

C.2 WALD TEST

An alternative is the Wald test, which is based on the central limit theorem. If S 1 " B M , then the empirical average of the model output over S 1 is asymptotically normally distributed as M increases Waldpf pS 1 q, f pBqq :"where µ :" E z"f pBq rzs and σ 2 :" V z"f pBq rzs are the expected value and variance of the model output across the whole background. The same reasoning holds for S 0 and the foreground F.Applying the Wald test with significance α would detect fraud whenwhere F ´1 N p0,1q is the inverse of the CDF of a standard normal variable.Published as a conference paper at ICLR 2023 Figure 15 : Attack of XGB fitted on Communities. Left: GSV before and after the attack with M " 200. As a reminder, the sensitive attribute is PctWhite>90. Right: Comparison of the CDF of the misleading subsets f pS 1 0 q, f pS 1 1 q and the CDF over the whole data. f pD 0 q, f pD 1 q. As a reminder, the sensitive attribute is PctWhite>90. Right: Comparison of the CDF of the misleading subsets f pS 1 0 q, f pS 1 1 q and the CDF over the whole data. f pD 0 q, f pD 1 q.

E.3 GENETIC ALGORITHM

This section motivates the use of stealthily biased sampling to perturb Shapley Values in place of the method of Baniecki et al. (2021) , which fools SHAP by perturbing the background dataset S 1 1 via a genetic algorithm. In said genetic algorithm, a population of P fake background datasets tS Although the use of a genetic algorithm makes the method of Baniecki et al. (2021) very versatile, its main drawback is that there is no constraint on the similarity between the perturbed background and the original one. Moreover, the mutation and cross-over operations ignore the correlations between features and hence the perturbed dataset can contain unrealistic instances. To highlight these issues, Figure 17 presents the first two principal components of D 1 and S 1 1 for the XGB models used in Section 5.4. On COMPAS and Marketing especially, we see that the fake samples S 1 1 lie in regions outside of the data manifold. For Adult-Income and Marketing, the fake data overlaps more with the original one, but this could be an artifact of only visualizing 2 dimensions.For a more rigorous analysis of "similarity" between S 1 1 and D 1 , we must study the detection rate of the audit detector. To this end, Figures 18 and 19 , present the amplitude reduction and the detection rate after a given number of iterations of the genetic algorithm. These curves show the average and

