LEARNING TO COUNTER: STOCHASTIC FEATURE-BASED LEARNING FOR DIVERSE COUNTERFACTUAL EXPLANATIONS

Abstract

Interpretable machine learning seeks to understand the reasoning process of complex black-box systems that are long notorious for lack of explainability. One growing interpreting approach is through counterfactual explanations, which go beyond why a system arrives at a certain decision to further provide suggestions on what a user can do to alter the outcome. A counterfactual example must be able to counter the original prediction from the black-box classifier, while also satisfying various constraints for practical applications. These constraints exist at trade-offs between one and another presenting radical challenges to existing works. To this end, we propose a stochastic learning-based framework that effectively balances the counterfactual trade-offs. The framework consists of a generation and a feature selection module with complementary roles: the former aims to model the distribution of valid counterfactuals whereas the latter serves to enforce additional constraints in a way that allows for differentiable training and amortized optimization. We demonstrate the effectiveness of our method in generating actionable and plausible counterfactuals that are more diverse than the existing methods and particularly in a more efficient manner than counterparts of the same capacity.

1. INTRODUCTION

Recent advances in machine learning, especially the successes of deep neural networks, have promoted the use of these systems in various real-world applications. Such models provide remarkable predictive performance yet often at a cost of transparency and interpretability. This has sparked controversy over whether to rely on algorithmic predictions for critical decision making, from graduate admission (Waters & Miikkulainen, 2014; Acharya et al., 2019 ), job recruitment (Ajunwa et al., 2016) to high-stakes cases of credit assessment (Lessmann et al., 2015) or criminal justice (Lipton, 2018; Gifford, 2018) . Progress in interpretable machine learning offers interesting solutions to explaining the predictive mechanisms of black-box models. One useful interpreting approach is through counterfactual examples, which sheds light on what modifications to be made to an individual's profile that can counter an unfavorable decision outcome from a black-box classifier. Such explanations explore what-if scenarios that suggest possible recourses for future improvement. Counterfactual explanability indeed has important social implications at both personal and organizational level. For instance, job applicants who get rejected by the CV screening algorithm of a company are likely to benefit from feedbacks like 'getting 1 more referral' or 'being fluent in at least 2 languages', which would help them better prepare for future applications. At organizational level, by engaging with job candidates in this way as a form of advocating for transparency in decision making, companies can improve employer branding and attractiveness to top talents. Internally, organizations can also validate whether any prejudice or unfairness towards a particular group is implicitly introduced in historical data and consequentially embedded in the classifiers producing biased decisions.

Related works.

Recent years have seen an explosion in literature on counterfactual explanability, from works that initially focused on one or two specific characteristics or families of models to those that can deal with multiple constraints and various model types. There have been many attempts to summarize major themes of research and discuss open challenges in great depth. We therefore refer readers to Karimi et al. (2020b); Verma et al. (2020) ; Guidotti (2022) for excellent surveys of methods in this area. We here focus on reviewing algorithms that can support multiple or even diverse counterfactual generations -a property that has received less attention. Dealing with the combinatorial nature of the task, earlier works commonly adopt mixed integer programming (Russell, 2019), genetic algorithms (Sharma et al., 2020) , or SMT solvers (Karimi et al., 2020a) . Another recent popular approach is gradient-based optimization (Mothilal et al., 2020; Bui et al., 2022) . In a similar fashion with adversarial learning (Goodfellow et al., 2014) , it involves iteratively perturbing the input data point according to an objective function that incorporates desired constraints. The whole idea of diversity is to explore different combinations of features and feature values that can counter the original prediction while accommodating various user needs. To support diversity, Russell (2019) in particular enforces hard constraints on the current generations to be different from the previous ones. Such a constraint will however be removed whenever the solver cannot be satisfied. Meanwhile, Mothilal et al. ( 2020 2021) attempts to model the conditional likelihood of mutable features given the immutable features using the training data. They then adopt Monte Carlo sampling to generate counterfactuals from this distribution and filter out samples that do not meet counterfactual constraints. Amortized optimization emerges as a more effective strategy that explicitly models the distribution of counterfactual examples via a generative model such as a Variational auto-encoder (VAE) (Mahajan et al., 2019; Pawelczyk et al., 2020; Downs et al., 2020) or a Markov decision process (MDP) model under a reinforcement learning setting (Verma et al., 2022) . Obtaining such a distribution, sampling of counterfactuals can therefore be done straightforwardly. Contributions. In this paper, we propose a learning-based framework diverging markedly from the previous approaches. We reformulate the combinatorial search task into a stochastic optimization problem that can be solved efficiently via gradient descent. Whereas the previous works model the generative distributions via MDP Verma et al. ( 2022), VAE or VAE-based counterparts (Mahajan et al., 2019; Pawelczyk et al., 2020; Downs et al., 2020) , we construct a learnable generation module G that directly models the conditional distributions of individual features such that they form a valid counterfactual distribution when combined. Another point of difference of our framework lies in the usage of Bernoulli sampling to ensure that only minimal changes are introduced to the generative counterfactuals. In prior works, standard metrics such as L1 or L2 is often used to penalize the distance between the counterfactual and original data point. Verma et al. (2020) criticizes this approach as non-obvious, especially for handling categorical features. Avoiding the use of distance measures, we optimize a feature selection module S to output a Bernoulli distribution for each feature representing the likelihood of the feature being mutated. S is a flexible module that can adapt to different user-defined constraints about mutability of features. Similar to most works, our framework is developed to deal with heterogeneous tabular data. However, instead of one-hot encoding every categorical feature and treat each level as an individual numerical feature, we propose the opposite strategy: to discretize the numerical features and treat them as categorical. The benefits are four-fold: (1) we can conveniently apply one functional form over feature distributions, which then only requires one reparameterization trick feature-wise; (2) it helps expand the original input space that may later support generalization; (3) we believe that it yields more useful explanations and easier for human users to follow the suggestions than forcing them to meet hard requirements from specific numerical values; (4) it helps reduces privacy risks when revealing the counterfactual suggestions to the public. To facilitate end-to-end differentiable training, we employ the Gumbel-Softmax reparameterization trick for effective treatment of categorical features. This is the first time this strategy is used in this line of research. Our contributions can be summarized as follows • We introduce Learning to Counter (L2C) -a stochastic feature-based learning approach for generating counterfactual explanations that address the counterfactual desirable properties in a single end-to-end differentiable framework.



) and Bui et al. (2022) add another loss term for diversity using Determinantal Point Processes (Kulesza et al., 2012), whereas the other works only demonstrate the capacity to generate multiple counterfactuals via empirical results. Moreover, all of these algorithms are computationally expensive. Along the line, Redelmeier et al. (

