BETTER SAMPLING IN EXPLANATION METHODS CAN PREVENT DIESELGATE-LIKE DECEPTION

Abstract

Machine learning models are used in many sensitive areas where, besides predictive accuracy, their comprehensibility is also important. Interpretability of prediction models is necessary to determine their biases and causes of errors and is a prerequisite for users' confidence. For complex state-of-the-art black-box models, post-hoc model-independent explanation techniques are an established solution. Popular and effective techniques, such as IME, LIME, and SHAP, use perturbation of instance features to explain individual predictions. Recently, Slack et al. ( 2020) put their robustness into question by showing that their outcomes can be manipulated due to poor perturbation sampling employed. This weakness would allow dieselgate type cheating of owners of sensitive models who could deceive inspection and hide potentially unethical or illegal biases existing in their predictive models. This could undermine public trust in machine learning models and give rise to legal restrictions on their use. We show that better sampling in these explanation methods prevents malicious manipulations. The proposed sampling uses data generators that learn the training set distribution and generate new perturbation instances much more similar to the training set. We show that the improved sampling increases the LIME and SHAP's robustness, while the previously untested method IME is already the most robust of all.

1. INTRODUCTION

Machine learning models are used in many areas where besides predictive performance, their comprehensibility is also important, e.g., in healthcare, legal domain, banking, insurance, consultancy, etc. Users in those areas often do not trust a machine learning model if they do not understand why it made a given decision. Some models, such as decision trees, linear regression, and naïve Bayes, are intrinsically easier to understand due to the simple representation used. However, complex models, mostly used in practice due to better accuracy, are incomprehensible and behave like black boxes, e.g., neural networks, support vector machines, random forests, and boosting. For these models, the area of explainable artificial intelligence (XAI) has developed post-hoc explanation methods that are model-independent and determine the importance of each feature for the predicted outcome. Frequently used methods of this type are IME ( Štrumbelj & Kononenko, 2013) , LIME (Ribeiro et al., 2016) , and SHAP (Lundberg & Lee, 2017). To determine the features' importance, these methods use perturbation sampling. Slack et al. (2020) recently noticed that the data distribution obtained in this way is significantly different from the original distribution of the training data as we illustrate in Figure 1a . They showed that this can be a serious weakness of these methods. The possibility to manipulate the post-hoc explanation methods is a critical problem for the ML community, as the reliability and robustness of explanation methods are essential for their use and public acceptance. These methods are used to interpret otherwise black-box models, help in debugging models, and reveal models' biases, thereby establishing trust in their behavior. Non-robust explanation methods that can be manipulated can lead to catastrophic consequences, as explanations do not detect racist, sexist, or otherwise biased models if the model owner wants to hide these biases. This would enable dieselgate-like cheating where owners of sensitive prediction models could hide the socially, morally, or legally unacceptable biases present in their models. As the schema of the attack on explanation methods on Figure 1b shows, owners of prediction models could detect when their models are examined and return unbiased predictions in this case and biased predictions in normal use. This could have serious consequences in areas where predictive models' reliability and fairness are essential, e.g., in healthcare or banking. Such weaknesses can undermine users' trust in machine learning models in general and slow down technological progress. In this work, we propose to change the main perturbation-based explanation methods and make them more resistant to manipulation attempts. In our solution, the problematic perturbation-based sampling is replaced with more advanced sampling, which uses modern data generators that better capture the distribution of the training dataset. We test three generators, the RBF network based generator (Robnik-Šikonja, 2016), random forest-based generator, available in R library semiArtificial (Robnik-Šikonja, 2019), as well as the generator using variational autoencoders (Miok et al., 2019) . We show that the modified gLIME and gSHAP methods are much more robust than their original versions. For the IME method, which previously was not analyzed, we show that it is already quite robust. We release the modified explanation methods under the open-source licensefoot_0 . In this work, we use the term robustness of the explanation method as a notion of resilience against adversarial attacks, i.e. as the ability of an explanation method to recognize the biased classifier in an adversary environment. This type of robustness could be more formally defined as the number of instances where the adversarial model's bias was correctly recognized. We focus on the robustness concerning the attacks described in Slack et al. (2020) . There are other notions of robustness in explanation methods; e.g., (Alvarez-Melis & Jaakkola, 2018) define the robustness of the explanations in the sense that similar inputs should give rise to similar explanations. The remainder of the paper is organized as follows. In Section 2, we present the necessary background and related work on explanation methods, attacks on them, and data generators. In Section 3, we propose a defense against the described weaknesses of explanation methods, and in Section 4, we empirically evaluate the proposed solution. In Section 5, we draw conclusions and present ideas for further work.

2. BACKGROUND AND RELATED WORK

In this section, we first briefly describe the background on post-hoc explanation methods and attacks on them, followed by data generators and related works on the robustness of explanation methods.

2.1. POST-HOC EXPLANATION METHODS

The current state-of-the-art perturbation-based explanation methods, IME, LIME, and SHAP, explain predictions for individual instances. To form an explanation of a given instance, they measure the difference in prediction between the original instance and its neighboring instances, obtained



https://anonymous.4open.science/r/5d550c62-5c5c-4ee3-81ef-ab96fe0838ca/



a) PCA based visualization of a part of the COMPAS dataset. The blue points show the original instances, and the red points represent instances generated with the perturbation sampling used in the LIME method. The distributions are notably different. b) The idea of the attack on explanation methods based on the difference of distributions. The attacker's adversarial model contains both the biased and unbiased model. The decision function that is part of the cheating model decides if the instance is outside the original distribution (i.e. used only for explanation) or an actual instance. If the case of an actual instance, the result of the adversarial model is equal to the result of the biased model, otherwise it is equal to the result of the unbiased model.

