LEARNING TO COUNTER: STOCHASTIC FEATURE-BASED LEARNING FOR DIVERSE COUNTERFACTUAL EXPLANATIONS

Abstract

Interpretable machine learning seeks to understand the reasoning process of complex black-box systems that are long notorious for lack of explainability. One growing interpreting approach is through counterfactual explanations, which go beyond why a system arrives at a certain decision to further provide suggestions on what a user can do to alter the outcome. A counterfactual example must be able to counter the original prediction from the black-box classifier, while also satisfying various constraints for practical applications. These constraints exist at trade-offs between one and another presenting radical challenges to existing works. To this end, we propose a stochastic learning-based framework that effectively balances the counterfactual trade-offs. The framework consists of a generation and a feature selection module with complementary roles: the former aims to model the distribution of valid counterfactuals whereas the latter serves to enforce additional constraints in a way that allows for differentiable training and amortized optimization. We demonstrate the effectiveness of our method in generating actionable and plausible counterfactuals that are more diverse than the existing methods and particularly in a more efficient manner than counterparts of the same capacity.

1. INTRODUCTION

Recent advances in machine learning, especially the successes of deep neural networks, have promoted the use of these systems in various real-world applications. Such models provide remarkable predictive performance yet often at a cost of transparency and interpretability. This has sparked controversy over whether to rely on algorithmic predictions for critical decision making, from graduate admission (Waters & Miikkulainen, 2014; Acharya et al., 2019 ), job recruitment (Ajunwa et al., 2016) to high-stakes cases of credit assessment (Lessmann et al., 2015) or criminal justice (Lipton, 2018; Gifford, 2018) . Progress in interpretable machine learning offers interesting solutions to explaining the predictive mechanisms of black-box models. One useful interpreting approach is through counterfactual examples, which sheds light on what modifications to be made to an individual's profile that can counter an unfavorable decision outcome from a black-box classifier. Such explanations explore what-if scenarios that suggest possible recourses for future improvement. Counterfactual explanability indeed has important social implications at both personal and organizational level. For instance, job applicants who get rejected by the CV screening algorithm of a company are likely to benefit from feedbacks like 'getting 1 more referral' or 'being fluent in at least 2 languages', which would help them better prepare for future applications. At organizational level, by engaging with job candidates in this way as a form of advocating for transparency in decision making, companies can improve employer branding and attractiveness to top talents. Internally, organizations can also validate whether any prejudice or unfairness towards a particular group is implicitly introduced in historical data and consequentially embedded in the classifiers producing biased decisions. Related works. Recent years have seen an explosion in literature on counterfactual explanability, from works that initially focused on one or two specific characteristics or families of models to those that can deal with multiple constraints and various model types. There have been many attempts to summarize major themes of research and discuss open challenges in great depth. We therefore refer readers to Karimi et al. (2020b); Verma et al. (2020) ; Guidotti (2022) for excellent surveys of

