SEMI-SUPERVISED COUNTERFACTUAL EXPLANATIONS

Abstract

Counterfactual explanations for machine learning models are used to find minimal interventions to the feature values such that the model changes the prediction to a different output or a target output. A valid counterfactual explanation should have likely feature values. Here, we address the challenge of generating counterfactual explanations that lie in the same data distribution as that of the training data and more importantly, they belong to the target class distribution. This requirement has been addressed through the incorporation of auto-encoder reconstruction loss in the counterfactual search process. Connecting the output behavior of the classifier to the latent space of the auto-encoder has further improved the speed of the counterfactual search process and the interpretability of the resulting counterfactual explanations. Continuing this line of research, we show further improvement in the interpretability of counterfactual explanations when the auto-encoder is trained in a semi-supervised fashion with class tagged input data. We empirically evaluate our approach on several datasets and show considerable improvement in-terms of several metrics.

1. INTRODUCTION

Recently counterfactual explanations have gained popularity as tools of explainability for AI-enabled systems. A counterfactual explanation of a prediction describes the smallest change to the feature values that changes the prediction to a predefined output. A counterfactual explanation usually takes the form of a statement like, "You were denied a loan because your annual income was 30, 000. If your income had been 45, 000, you would have been offered a loan ". Counterfactual explanations are important in the context of AI-based decision-making systems because they provide the data subjects with meaningful explanations for a given decision and the necessary actions to receive a more favorable/desired decision in the future. Application of counterfactual explanations in the areas of financial risk mitigation, medical diagnosis, criminal profiling, and other sensitive socio-economic sectors is increasing and is highly desirable for bias reduction. Apart from the challenges of sparsity, feasibility, and actionability, the primary challenge for counterfactual explanations is their interpretability. Higher levels of interpretability lead to higher adoption of AI-enabled decision-making systems. Higher values of interpretability will improve the trust amongst data subjects on AI-enabled decisions. AI models used for decision making are typically black-box models, the reasons can either be the computational and mathematical complexities associated with the model or the proprietary nature of the technology. In this paper, we address the challenge of generating counterfactual explanations that are more likely and interpretable. A counterfactual explanation is interpretable if it lies within or close to the model's training data distribution. This problem has been addressed by constraining the search for counterfactuals to lie in the training data distribution. This has been achieved by incorporating an auto-encoder reconstruction loss in the counterfactual search process. However, adhering to training data distribution is not sufficient for the counterfactual explanation to be likely. The counterfactual explanation should also belong to the feature distribution of its target class. To understand this, let us consider an example of predicting the risk of diabetes in individuals as high or low. A sparse counterfactual explanation to reduce the risk of diabetes might suggest a decrease in the body mass index (BMI) level for an individual while leaving other features unchanged. The model might predict a low risk based on this change and the features of this individual might still be in the data distribution of the model. However, they will not lie in the data distribution of individuals with low risk of diabetes because of other relevant features of low-risk individuals like glucose tolerance, serum insulin, diabetes pedigree, etc. To address this issue, authors in Van Looveren & Klaise (2019) proposed to connect the output behavior of the classifier to the latent space of the auto-encoder using prototypes. These prototypes guide the counterfactual search process in the latent space and improve the interpretability of the resulting counterfactual explanations. However, the auto-encoder latent space is still unaware of the class tag information. This is highly undesirable, especially when using a prototype guided search for counterfactual explanations on the latent space. In this paper, we propose to build a latent space that is aware of the class tag information through joint training of the auto-encoder and the classifier. Thus the counterfactual explanations generated will not only be faithful to the entire training data distribution but also faithful to the data distribution of the target class. We show that there are considerable improvements in interpretability, sparsity and proximity metrics can be achieved simultaneously, if the auto-encoder trained in a semi-supervised fashion with class tagged input data. Our approach does not rely on the availability of train data used for the black box classifier. It can be easily generalized to a post-hoc explanation method using the semi-supervised learning framework, which relies only on the predictions on the black box model. In the next section we present the related work. Then, in section 3 we present preliminary definitions and approaches necessary to introduce our approach. In section 4, we present our approach and empirically evaluate it in section 5.

2. RELATED WORK

Counterfactual analysis is a concept derived from from causal intervention analysis. Counterfactuals refer to model outputs corresponding to certain imaginary scenarios that we have not observed or cannot observe. Recently Wachter et al. (2017) proposed the idea of model agnostic (without opening the black box) counterfactual explanations, through simultaneous minimization of the error between model prediction and the desired counterfactual and distance between original instance and their corresponding counterfactual. This idea has been extended for multiple scenarios by Mahajan et al. 2019) addresses the feasibility of counterfactual explanations through causal relationship constraints amongst input features. They present a method that uses structural causal models to generate actionable counterfactuals. Authors in Joshi et al. (2019) propose to characterize data manifold and then provide an optimization framework to search for actionable counterfactual explanation on the data manifold via its latent representation. Authors in Poyiadzi et al. (2020) address the issues of feasibility and actionability through feasible paths, which are based on the shortest path distances defined via density-weighted metrics. An important aspect of counterfactual explanations is their interpretability. A counterfactual explanation is more interpretable if it lies within or close to the data distribution of the training data of the black box classifier. To address this issue Dhurandhar et al. (2018) proposed the use of auto-encoders to generate counterfactual explanations which are "close" to the data manifold. They proposed incorporation of an auto-encoder reconstruction loss in counterfactual search process to penalize counterfactual which are not true to the data manifold. This line of research was further extended by Van Looveren & Klaise (2019) , they proposed to connect the output behaviour of the classifier to the latent space of the auto-encoder using prototypes. These prototypes improved speed of counterfactual search process and the interpretability of the resulting counterfactual explanations. While Van Looveren & Klaise (2019) connects the output behaviour of the classifier to the latent space through prototypes, the latent space is still unaware of the class tag information. We propose to build a latent space which is aware of the class tag information through joint training of the auto-encoder and the classifier. Thus the counterfactual explanations generated will not only be faithful the entire training data distribution but also faithful the data distribution of the target class. In a post-hoc scenario where access to the training data is not guaranteed, we propose to use the input-output pair data of the black box classifier to jointly train the auto-encoder and classifier in the semi-supervised learning framework. Authors in Zhai & Zhang (2016 ), Gogna et al. (2016) have explored the use semisupervised auto-encoders for sentiment analysis and analysis of biomedical signal analysis. Authors



(2019), Ustun et al. (2019) , Poyiadzi et al. (2020) based on the incorporation of feasibility constraints, actionability and diversity of counterfactuals. Authors in Mothilal et al. (2020) proposed a framework for generating diverse set of counterfactual explanations based on determinantal point processes. They argue that a wide range of suggested changes along with a proximity to the original input improves the chances those changes being adopted by data subjects. Causal constraints of our society do not allow the data subjects to reduce their age while increasing their educational qualifications. Such feasibility constraints were addressed by Mahajan et al. (2019) and Joshi et al. (2019) using a causal framework. Authors in Mahajan et al. (

