SEMI-SUPERVISED COUNTERFACTUAL EXPLANATIONS

Abstract

Counterfactual explanations for machine learning models are used to find minimal interventions to the feature values such that the model changes the prediction to a different output or a target output. A valid counterfactual explanation should have likely feature values. Here, we address the challenge of generating counterfactual explanations that lie in the same data distribution as that of the training data and more importantly, they belong to the target class distribution. This requirement has been addressed through the incorporation of auto-encoder reconstruction loss in the counterfactual search process. Connecting the output behavior of the classifier to the latent space of the auto-encoder has further improved the speed of the counterfactual search process and the interpretability of the resulting counterfactual explanations. Continuing this line of research, we show further improvement in the interpretability of counterfactual explanations when the auto-encoder is trained in a semi-supervised fashion with class tagged input data. We empirically evaluate our approach on several datasets and show considerable improvement in-terms of several metrics.

1. INTRODUCTION

Recently counterfactual explanations have gained popularity as tools of explainability for AI-enabled systems. A counterfactual explanation of a prediction describes the smallest change to the feature values that changes the prediction to a predefined output. A counterfactual explanation usually takes the form of a statement like, "You were denied a loan because your annual income was 30, 000. If your income had been 45, 000, you would have been offered a loan ". Counterfactual explanations are important in the context of AI-based decision-making systems because they provide the data subjects with meaningful explanations for a given decision and the necessary actions to receive a more favorable/desired decision in the future. Application of counterfactual explanations in the areas of financial risk mitigation, medical diagnosis, criminal profiling, and other sensitive socio-economic sectors is increasing and is highly desirable for bias reduction. Apart from the challenges of sparsity, feasibility, and actionability, the primary challenge for counterfactual explanations is their interpretability. Higher levels of interpretability lead to higher adoption of AI-enabled decision-making systems. Higher values of interpretability will improve the trust amongst data subjects on AI-enabled decisions. AI models used for decision making are typically black-box models, the reasons can either be the computational and mathematical complexities associated with the model or the proprietary nature of the technology. In this paper, we address the challenge of generating counterfactual explanations that are more likely and interpretable. A counterfactual explanation is interpretable if it lies within or close to the model's training data distribution. This problem has been addressed by constraining the search for counterfactuals to lie in the training data distribution. This has been achieved by incorporating an auto-encoder reconstruction loss in the counterfactual search process. However, adhering to training data distribution is not sufficient for the counterfactual explanation to be likely. The counterfactual explanation should also belong to the feature distribution of its target class. To understand this, let us consider an example of predicting the risk of diabetes in individuals as high or low. A sparse counterfactual explanation to reduce the risk of diabetes might suggest a decrease in the body mass index (BMI) level for an individual while leaving other features unchanged. The model might predict a low risk based on this change and the features of this individual might still be in the data distribution of the model. However, they will not lie in the data distribution of individuals with low risk of diabetes because of other relevant features of low-risk individuals like glucose tolerance, serum insulin, diabetes pedigree, etc.

