INSPIRE: A FRAMEWORK FOR INTEGRATING INDIVIDUAL USER PREFERENCES IN RECOURSE

Abstract

Most recourse generation approaches optimize for indirect distance-based metrics like diversity, proximity, and sparsity, or a shared cost function across all users to generate recourse. The latter is an unrealistic assumption because users can have diverse feature preferences which they might be willing to act upon and any changes to any undesirable feature might lead to an impractical recourse. In this work, we propose a novel framework to incorporate the individuality of users in both recourse generation and evaluation procedure by focusing on the cost incurred by a user when opting for a recourse. To achieve this, we first propose an objective function, Expected Minimum Cost (EMC) that is based on two key ideas: (1) the user should be comfortable adopting at least one solution when presented with multiple options, and (2) we can approximately optimize for users' satisfaction even when their true cost functions (i.e., costs associated with feature changes) are unknown. EMC samples multiple plausible cost functions based on diverse feature preferences in the population and then finds a recourse set with one good solution for each category of user preferences. We optimize EMC with a novel discrete optimization algorithm, Cost-Optimized Local Search (COLS), that is guaranteed to improve the quality of the recourse set over iterations. Our evaluation framework computes the fraction of satisfied users by simulating each user's cost function and then computing the incurred cost for the provided recourse set. Experimental evaluation on popular real-world datasets demonstrates that our method satisfies up to 25.9% more users compared to strong baselines. Moreover, human evaluation shows that our recourses are preferred more than twice as often as the strongest baseline. 1

1. INTRODUCTION

Over the past few years, ML models have been increasingly deployed to make critical decisions related to loan approval (Siddiqi, 2012) , allocation of public resources (Chouldechova et al., 2018) , and hiring decisions (Ajunwa et al., 2016) . These decisions have real-life consequences for the involved users. As a result, there is a growing emphasis on explaining these models' decisions (Poulin et al., 2006; Ribeiro et al., 2018) and providing recourse for unfavorable decisions (Voigt & dem Bussche, 2018) . A recourse is an actionable plan that allows a user to change the decision of a deployed model to a desired alternative (Wachter et al., 2017) . Recourses are often presented to users as a set of counterfactuals (cfs), where each cf details the changes to the user's state vector (i.e., their feature vector). Recourses are desired to be actionable, and feasible. Actionable means that only features which can be changed by the user are requested to be changed. A recourse is feasible if it is easy for the user to adopt, in other words, it is actionable and has a low cost for the user. To achieve these objectives, prior work used feature distance-based objectives like proximity, sparsity, and feature diversity. For instance, Mothilal et al. (2020) and Wachter et al. (2017) encourage proximity by minimizing the distance between the user's state vector and the counterfactuals (cfs) with the assumption that proximal cfs are easier to adopt. Whereas, sparsity quantifies the number of features that require modification to implement a recourse (Mothilal et al., 2020) . In contrast to these, feature diversity (Mothilal et al., 2020; Cheng et al., 2021) provides a user with multiple cfs that change diverse subsets of features assuming that users are more likely to find at least one feasible solution. These objectives capture the desired properties of recourses but do not account for individual user preferences that should be the primary objective. For instance, if a user prefers to change features f 1 and f 2 , then providing them with recourses that change undesirable features make them infeasible even if they are proximal, sparse, and diverse. To address this, some recourse methods define a single cost function that is shared by all the users. A cost function C(f, i, j) denotes the cost of changing a feature f from value i to j. They optimize and evaluate for low-cost solutions under this function (Ustun et al., 2019; Rawal & Lakkaraju, 2020; Karimi et al., 2020c; d; Cui et al., 2015) . We question this assumption and argue for the importance of user-specific cost functions as a shared cost function is likely to poorly represent different users in a diverse population. Hence, these indirect objectives and global cost functions might be necessary but are not sufficient for a feasible recourse. resents an abstract cost function space where squares denote cost function samples that are the same color if they are similar and form a cluster. We aim to find a recourse set where each cf (here, {s1, s2, s3}) does well under a particular cluster of cost functions. The shaded big circles each represent a single cf si that caters to the enclosed cost functions. Here the user's hidden groundtruth cost function (grey circle) is served well by s1.

In this work, we propose a novel framework, INSPIRE (INdividual uSer Preferences

In RecoursE), that incorporates individual user preference via user-specific cost function to generating algorithmic recourse. INSPIRE provides each user with a recourse set that contains multiple cf options such that there is at least one feasible solution adhering to the user's personal feature preference (if possible). As noted by Rawal & Lakkaraju (2020) , in most cases it is difficult for users to specify their exact feature preferences or cost functions. INSPIRE solves this issue by focusing and improving upon four major components -(1) the procedure to formalize and define individual user preferences via user-specific cost functions, (2) the recourse objective function, (3) the optimization algorithm, and (4) the evaluation procedure. Next, we propose a novel objective function, Expected Minimum Cost (EMC) that approximately optimizes for the cost incurred by the user under their cost function (which is unknown). To do this, (1) we build on Ustun et al. (2019) to propose three distributions over cost functions, D lin , D perc , and D mix , that represent diverse user preferences in a population. These distributions are based on linear and percentile changes in the feature values ( §3.1). ( 2) Next, we compute the expected minimum cost of the generated cfs with respect to multiple sampled cost functions from one of the proposed distributions ( §3.2). In order to efficiently optimize for EMC, we propose a discrete optimization method, Cost-Optimized Local Search (COLS) ( §3.3). COLS guarantees a monotonic reduction in EMC of the recourse set, leading to large empirical reductions in the user-incurred cost. Note that, the EMC objective encourages diversity in the solution set with respect to the diverse feature preferences a user might possess by ensuring that each cf is a good cf under some particular cluster of cost functions from the sampling distribution. Hence, if the user's ground-truth cost function is well represented by any of the clusters, then we will have some counterfactual that is feasible (actionable and low-cost) under their cost function (shown in Figure 1 ). To evaluate the effectiveness of EMC and COLS, we run experiments on two popular real-world datasets: Adult-Income (Dua & Graff, 2017) and COMPAS (Larson et al., 2016) . We compare our method with multiple strong baselines methods like DICE (Mothilal et al., 2020) , FACE (Poyiadzi et al., 2020) , and Actionable Recourse (AR) (Ustun et al., 2019) . We evaluate these methods on existing metrics from the literature like diversity, proximity, sparsity, and validity ( §4.1) along with our novel cost-based evaluation framework ( §3.4) and a human evaluation. In particular, we define the fraction of satisfied users based on whether their cost of recourse is below a certain satisfiability threshold k. We also report coverage, which is the fraction of users with at least one actionable recourse (Rawal & Lakkaraju, 2020) . Using simulated user cost functions, we show that our method satisfies up to 25.89% more users than strong baseline methods while covering up to 22.35% more users across datasets. Furthermore, our human evaluation shows that the recourses generated by our method are preferred by humans 57% of the time as compared to 25% for our strongest baseline (Actionable Recourse), a difference of 32%. We also perform important ablations to show what fraction of the performance can be attributed to COLS optimization method or the EMC objective. Finally, we perform a fairness analysis of all the methods across demographic subgroups to show that our method is more fair than baseline methods. Our primary contributions in this paper are listed below. 1. We conceptualize a novel framework, INSPIRE that accounts for the individuality of users while generating and evaluating recourse options. INSPIRE provides the flexibility for future researchers to further innovate on its four components. 2. We propose a new objective function, Expected Minimum Cost that approximately optimizes for a user's true cost function by using diverse plausible cost functions from a distribution. 3. We propose a discrete optimization method, Cost-Optimized Local Search which generates recourses that lead to higher user satisfaction. In human evaluation, we find that our recourses are preferred more than twice as often as the strongest baseline recourses while being fairer. 4. We propose a novel evaluation procedure to simulate users' hidden cost functions to asses individual user satisfaction by using our proposed metric FS@k.

2. PROBLEM STATEMENT

Features Types. We assume a dataset with features F = {f 1 , f 2 , ...f h }. Features can be mutable, conditionally mutable, or immutable, according to the causal processes that generates it. For example, Race is an immutable feature (Mothilal et al., 2020) , Age and Education are conditionally mutable (cannot be decreased), and number of work hours is mutable (can both increase and decrease). Following Ustun et al. (2019) , continuous features are always discretized into appropriate sized bins. Cost Function. In this work, we assume that each user has an inherent feature-preference (FP) that captures the ease of changing a particular features, and different users can have different FPs. We express such differential FPs via user-specific cost-function. A cost function C(f, i, j) denotes the cost of changing a feature f from i to j and lies in [0, 1] ∪ {∞}. Here, 0 means that the transition has no associated cost, whereas 1 means it is maximally difficult, and ∞ means that it is infeasible. Transition Costs. Given a cost function C and two feature vectors s i , s j , the cost of transition from s i → s j is the summation of the cost of changing individual features. Hence, Cost(s i , s j ; C) = f ∈F C(f, s f i , s f j ) , where s f is the value of feature f in the state vector. User Definition. A user is defined as a tuple u = (s u , C * u ), where s u is the user's current state vector of length |F| containing their feature values and C * u is their ground-truth cost function. See Appendix Table 7 for examples of s u and feature preferences. Next, we define the cost incurred by a user when acting on a recourse set S consisting of cfs S. As a rational user will select the least costly option, the cost they will incur is the minimum transition cost across all cf in the recourse set, defined as, MinCost(su, S; C * u ) = min s j ∈S Cost(su, sj; C * u ), where Cost(su, sj; C * u ) = f ∈F C * u (f, s f u , s f j ). Problem Definition. For a user u, our goal is to find a recourse set S u such that there exists at least one low-cost cf with the desired outcome. Hence, if the user's ground-truth cost function C * u is provided then we can provide them with a good recourse by directly optimizing for, Su = arg min S MinCost(su, S; C * u ) s.t. ∃ si ∈ S s.t. F (si) = 1, ( ) where F is the black-box ML model and 1 is the desired class. Similar to Rawal & Lakkaraju (2020) , we note that in practice it is difficult for a user to precisely quantify their FP and cost function. Hence, in most practical scenarios C * u is not provided and we propose the EMC objective to approximately optimize for the user incurred cost.

3. INSPIRE: INCORPORATING INDIVIDUALITY OF USERS IN RECOURSE

Given a user, we first sample some plausible cost functions from our proposed distribution that capture diverse user feature preferences (FP) in a population ( §3.1). Then our COLS optimization method to generate a candidate recourse set ( §3.3). Next, we compute our EMC objective for this candidate cfs with respect to multiple samples of cost functions ( §3.2). We perform this iteratively to generate a final candidate recourse set which is evaluated using our evaluate procedure ( §3.4).

3.1. CHARACTERISING TRANSITION COST AND COST FUNCTION DISTRIBUTIONS

Our first goal is to carefully design distributions to sample cost functions that -(1) adhere to the fundamental concepts of how the population quantifies the transition costs between two feature values, and (2) can quantify and integrate different user FPs to represent a diverse population. Recent works like Ustun et al. (2019) argue that users in a population fundamentally quantify the transition cost of changing a feature value from x to y as being proportional to -(1) the difference in percentile of x and y; (2) the number of major steps involved to transition from x to y. For instance, when changing the education feature from Bachelors to Ph.D., the percentile might be appropriate as very few people have Ph.D. degrees as compared to Bachelors, leading to a higher cost. While in contrast, when changing the number of working hours from 30 to 35, users might associate a fixed cost for every additional hour as opposed to percentile differences. We recognize these different underlying phenomena for quantifying transition costs and call them percentile cost and linear cost. Even though most users quantify transition costs for features in the two ways described above, they can find it easier or harder to act upon certain features depending on their personal circumstances. We quantify these user FPs via preference-scores denoted by p = [p f1 , . . . , p f h ], which sum to 1 and each p fi ∈ [0, 1] represents the willingness of the user u to change feature f i . We use these FPs to scale the transition costs (percentile and linear costs) of each feature f i by (1 -p fi ) which decreases the cost of transition for preferred features and vice-versa. This allows us to create a user-specific cost-function that accounts for their FPs. Now, given a FP scores p, Algorithm 2 and 3 allow us to generate cost functions based on percentile and linear costs respectively that adhere to the FPs. Next, we propose three distributions, D perc , D lin , D mix , that are highly flexible and can generate cost-functions that model diverse FPs to better represent a population. The distributions D perc , D lin are based on percentile and linear transition costs. Whereas, D mix is our most general distribution that combines both linear and percentile transition costs using a user-specific cost-type weight α. To generate samples from D mix , we use Algorithm 4 that first samples a FP scores p by -(1) randomly sampling a subset of preferred features for a hypothetical user that are easy to act upon, and (2) sampling their FP scores p from a Dirichlet distribution with concentration parameter with all ones for preferred features and zero otherwise. Then, we use FP score p along with a state vector s in Algorithm 2 and 3 to sample percentile and linear transition cost. We then obtain the mixed-costs by taking a convex combination of the percentile and linear transition costs with randomly sampled cost-type weight α ∼ U nif (0, 1) that capture the user's fundamental way to quantify transition cost. These mixed-costs represent transition cost for a hypothetical user with FP score p and cost-type weight α. In order to capture the slight variance amongst users with similar preference, we use a Beta distribution with the mixed-costs as mean with a small noise (std = 0.01) to get the final cost function. We highlight some desired properties of our D mix distribution -(1) The distribution captures all possible FP score vectors p because we randomly sample the preferred features. (2) It captures a user's predilection towards linear and percentile costs by combining them with a cost-type weight. (3) The Linear and Percentile costs are monotonic, i.e., more drastic changes have higher the associated costs. (4) Following Watson et al. (2021) , we provide the user with an option to specify their needs by providing their preferred features, or FP scores p. These properties allow us to represent a much larger space of plausible cost functions compared to past works that assume a shared cost function with no user FPs. Hence, these distributions better represent a population.

3.2. EXPECTED MINIMUM COST (EMC) OBJECTIVE FUNCTION

As noted by Rawal & Lakkaraju (2020) , in most practical scenarios the user's true cost-function C * u is hard to obtain thus we cannot exactly minimize for Equation 2. Hence, we propose the Expected Minimum Cost (EMC) objective function. Give a state vector s, a recourse set S, and a distribution D train to sample cost functions, we compute EMC as follows, EMC(s, S) = EC i ∼D train [MinCost(s, S; Ci)] ≈ 1 M M i=1 min s j ∈S Cost(s, sj; Ci). We employ Monte Carlo Estimation (Robert & Casella, 2010) to approximate the expectation by sampling M cost functions {C i } M i=1 from D train and then expand MinCost using Equation 1. Moreover, the distribution D train can be any of the three distributions we proposed. Next, for the user u, we obtain the recourse set S u by minimizing the EMC objective as follows, S u = arg min S EMC(s u , S). (4) 3. For efficient implementation we store the costs of all the N cfs with respect to all the M cost functions C i . Instead of making a direct comparison of EMC for the best-set-so-far and the candidate set, we evaluate whether any cfs from the candidate set S t would improve the EMC of the best set S best if we swapped out individual cfs. Specifically, if the benefit of replacing s i ∈ S t with s j ∈ S best is positive, i.e., reduces EMC of S best then we make the replacement (see Algorithm 5). The ability to assess the benefit of each candidate cf is critical because it allows us to constantly update the best set using cfs from a candidate set instead of waiting for an entire candidate set with lower EMC. For objectives like feature diversity, evaluating the benefit of individual replacement becomes expensive (see Appendix B.1). Moreover, for COLS we can guarantee that the EMC of the best set will monotonically decrease over time, which we formally state below: Theorem 3.1 (Monotonicity of COLS Algorithm). Given the best set, S best t-1 ∈ R N ×d , the candidate set at iteration t, S t ∈ R N ×d , the matrix  C b ∈ R N ×M and C ∈ R N ×M containing ; {C i } M i=1 ) ≤ EMC(s u , S best t-1 ; {C i } M i=1 ) For the proof of the theorem, please refer to Appendix B.2.2.

P-COLS:

The P-COLS method is a variant of COLS that starts multiple parallel runs of COLS with different initial sets. With a given computational budget, each run is allocated a fraction of the budget. The recourse set of the run with the least objective value is provide to the user.

3.4. EVALUATION PROCEDURE AND INDIVIDUAL USER BASED METRICS

Given that users' ground-truth cost functions C * u are unknown, in order to asses user's satisfaction, we need to ask them. Generally, obtaining such feedback is challenging. Hence, for every user u, we simulate a evaluation cost function C # u from a distribution D test and use Equation 1 to compute their Proposed Metrics -FS@k and Coverage: We introduce a new cost-based metric that directly captures user satisfaction and is computed using each user's simulated cost function C # u . We say that a user is satisfied by a recourse set if the best option in that set achieves a sufficiently low cost under C # u . Formally, given a set of users U and the recourse sets {S u } u∈U provided to them, we define the fraction of users satisfied at a satisfiability threshold k as: F S@k(U, {Su}u∈U ) = 1 |U| u∈U 1{MinCost(su, Su; C # u ) < k} In reality, k can vary from user to user, we keep k fixed across users in our experiments because the goal of any method is to find low-cost recourses regardless of k. In deployment scenarios, reasonable values of k can be estimated by doing a user survey. In addition to FS@k, we also measure Population Average Cost (PAC), which is defined as PAC = 1 |U | u∈U MinCost(s u , S u ; C # u ) . Another important measure is the Coverage (Cov) which is the fraction of users to which the recourse method can provide at least one actionable recourse (Rawal & Lakkaraju, 2020) , defined as Cov(U, {S u } u∈U ) = F S@∞ = 1 |U | u∈U 1{MinCost(s u , S u ; C # u ) < ∞}.

4.1. EXPERIMENTAL SETUP

Dataset: We use the Adult-Income (Dua & Graff, 2017) and COMPAS (Larson et al., 2016) datasets which are available under Open Data Commons PDDL license. The Adult-Income dataset is based on the 1994 US Census data and contains 12 features. The model has to predict whether an individual's income is over $50, 000. COMPAS contains 7 features and was collected by ProPublica and contains information about the criminal history of defendants for analyzing recidivism. The model needs to decide bail based on predicting which applicants will recidivate in the next two years. These datasets are anonymized to prevent privacy. We preprocess both datasets based on a previous analysis where categorical features are binarized (Pawelczyk et al., 2021 ).foot_1 Our black-box model is an Multi-Layer Perceptron with 2-layers. Please refer to Appendix Tables 9 and 10 for experiments with logistic regression and Appendix A.1 and Table 5 for further experimental details. Baselines: We compare our methods COLS and P-COLS with DICE (Mothilal et al., 2020) , FACE-Knn and FACE-Epsilon (Poyiadzi et al., 2020) , and Actionable Recourse (Ustun et al., 2019) . Importantly, we control for compute across methods by restricting the number of forward passes to the black-box model, which are needed to decide if a counterfactual produces the desired class. For most big models, this is the rate-limiting step for each method. We ran our experiments on a local u which is used for evaluation. For completeness, in Section 4.2 Q5, and Appendix A.2 Q5, Q6, we provide additional results in cases where D train and D test are different to show that our method is robust to choices of these distributions.

Distance Based Recourse Metrics:

To compare with past work, we evaluate methods on distance-based metrics like feature diversity, proximity, sparsity, and validity that lie between [0, 1] with higher values being better. We report the average of these metrics, in percentage across all users. For a single user, Proximity is defined as In this experiment, we compare different recourse methods on our cost-based evaluation framework and distance-based metrics. We report the average performance over five random seeds in Table 1 . We observe that COLS and P-COLS, that optimize for EMC, achieve 22.64% and 25.89% higher user satisfaction while covering 19.28% and 22.42% more users compared to the strongest baseline on Adult-Income and COMPAS, respectively. Meanwhile, other methods that optimize for a combination of distance-based metrics, perform worse on user cost-based metrics that directly model user satisfaction. Interestingly, we find that COLS and P-COLS solutions exhibit very high feature diversity, proximity, and sparsity. This implies that -(1) the D mix distribution is generating cost functions that model diverse FPs and COLS along with EMC allows us to obtain the highest diversity even compared to other methods that directly optimize for it, and (2) proximity, sparsity, and diversity emerge as necessary metrics even under our cost-based evaluation framework but they are not sufficient to satisfy users with preferences as shown by other methods performance on cost-metrics. prox(x, S) = 1 -1 |S| |S| i=1 dist(x, S i ),

Q2. Is the Performance Improved by the COLS Optimization Method or the EMC Objective?

We perform an ablation study to understand the impact of COLS optimization method and the EMC objective. To do so, we run a basic local search (LS) to optimize objectives used by other methods like feature diversity, proximity, and sparsity along with validity. We use a basic local search because there are no simple and efficient way to guarantee reductions in the diversity objective by swapping out single elements from the solution set that is required for using COLS (see Appendix B.1). To quantify the usefulness of COLS, we also optimize EMC using a basic local search. The results in Table 2 suggest that optimizing for distance-metrics is sub-optimal. For proximity, sparsity, and feature diversity objectives, the FS score and coverage are very low, while they perform well on their respective metrics. The low FS score for distance metrics is expected as they ignore user preferences and hence can edit features that are not preferred making the generated recourses infeasible under the user's cost function. We find that EMC with LS outperforms all distance objectives not only on FS, PAC but also proximity, and sparsity suggesting that the EMC is a better objective. Meanwhile, the 19% difference in the performance of EMC with LS and COLS can be attributed to our cost optimization ( §3.3) that allows COLS to efficiently search the solution space.

Q3. Do Recourse Methods Provide Fair Solutions Across Subgroups?

Next, We assess if the recourse methods provide equitable solutions across subgroups based on demographic features like Gender and Race. This is important because we want to ensure that recourse methods are not further inducing bias towards any particular group because it directly affects the life of users. We adapt existing fairness metrics for disparate impact across population subgroups (Feldman et al., 2015) for the recourse outcomes we study, which we denote by the Disparate Impact Ratio (DIR). Given a metric M, DIR is a ratio between metric scores across two subgroups. DIR-M = M(S=1)/M(S=0). We use either Cov or FS@1 as M. Under the DIR metric, the maximum fairness score that can be achieved is 1, though this might not be achievable depending on the black-box model. We run experiments on the Adult-Income dataset, with a budget of 5000 model queries and |S| = 10. We present the gender and race based subgroup results in Table 4 and Appendix Table 6 respectively. We observe that our methods are typically more fair than baselines on both Gender and Racebased subgroups while providing recourse to a larger fraction of people in both subgroups. In particular, we find that our method achieves a score very close to 1 on DIR-FS and DIR-Cov implying a very high degree of fairness. We attribute the fairness of our method to (1) the fact that our method does not depend on the data distribution, and (2) the use of diverse cost functions to generate recourse. Condition (2) is important since there are other individualized methods that do not rely on the data distribution, such as Face, which can generate less fair solutions than COLS.

Q4. Which Recourse Method Do Humans Prefer?

Next, we are interested in whether humans would consider recourses to be reasonable for our synthetic users. We designed a small study where we provided human annotators with state vectors s u and the sampled FP scores p u for 100 users from the Adult-Income dataset (see Appendix A.1.4 for details). We presented the recourse generated by COLS and Actionable Recourse (strongest baseline) to the annotators while anonymizing each method's name and asked them two questions: (1) Acting as if they were the user with the provided preferences and state vector, which recourse would they prefer to adopt? (2) Does the recourse generated by each method seem reasonable to them? We collect three annotations for each sample and take a majority vote for each response; we allow for users to indicate "no preference" between the two proposed recourses, and if there was a tie in annotation we record the majority vote as "no preference." We found that our method was preferred 57% of the time, while AR was preferred only 25% of the time, a difference of 32% (+/-16 points variance, Fleiss' kappa=0.74, and p=1e-4). Furthermore, human annotators found 60% of the recourses generated by COLS to be reasonable as compared to 33% for AR, a 27% difference with p<1e-4. This study shows that our method is preferred by humans over the baseline.

Q5. Robustness to Distribution Shifts And Other Research Questions

In this experiment, we test our evaluation framework's robustness in cases where the train and test distributions, D train and D test are different. In Appendix Figure 2 , the top-left and bottom-right corners shows cases where D train = D lin while the users' evaluation cost functions C # u are drawn from D test = D per (and vice versa). This is a complete distribution shift and our method performs equally well for these cases demonstrating that our method is robust to distribution shift during test time. For full experimental design and conclusions please refer to Appendix Section A.2 and Figures 2 and 3 . We also provide several additional research questions in the Appendix A.2, which we summarize here: (1) We can make use of a larger compute budget to scale up the performance (Fig- ure 4 ); (2) The recourse sets provide high quality solutions to users using as few as 3 counterfactuals (Figure 5 ); (3) We can achieve high user satisfaction with as few as 20 Monte Carlo samples, rather than 1000 (Figure 6 ); (4) Our method works for other classification models as well (Table 10 ); and (5) We present the computational complexity and runtimes in Appendix A.1.2. We also show some qualitative examples of recourses provided by our method in Table 7 .

5. RELATED WORK

Here, we distinguish our approach based on our recourse objectives, optimizer, and evaluation. We point readers to Venkatasubramanian & Alfano (2020) for a philosophical basis for algorithmic recourse and to Karimi et al. (2020b) for a comprehensive survey of the existing recourse methods. Objectives: The most prominent family of objectives for recourse includes distance-based objectives (Wachter et al., 2017; Karimi et al., 2020a; Dhurandhar et al., 2018; Mothilal et al., 2020; Rasouli & Yu, 2021) . These methods primarily seek recourses that are close to the original data point. In DICE, Mothilal et al. ( 2020) provide users with a set of counterfactuals while trading off between proximity and feature diversity. A second category of methods uses other heuristics based on the data distribution (Aguilar-Palacios et al., 2020; Gomez et al., 2020) to come up with counterfactuals. FACE constructs a graph from the given data and then tries to find a high-density path between points in order to generate counterfactuals (Poyiadzi et al., 2020) . Lastly, the works closest to ours are the cost-based objectives, which capture feasibility in terms of the cost of recourse: (1) Cui et al. (2015) define a cost function specifically for tree-based classifiers, which compares the different paths that two data points follow in a tree to obtain a classifier-dependent measure of cost. (2) Karimi et al. (2020c; d) take a causal intervention perspective on the task and define cost in terms of the normalized distance between the user state and the counterfactual. (3) Ustun et al. (2019) define cost in terms of the number of changed features and frame recourse generation as an Integer Linear Program. (4) Rawal & Lakkaraju (2020) infer global cost function from pairwise comparisons of features that are drawn from simulated users. However, they take a different approach to the recourse generation problem, which is to find a list of rules that can apply to any user to obtain a recourse, rather than specially generating recourses for each user as in this work. Importantly, all of these works assume there is a known and single cost function that is shared by all users. Optimization: Several recourse methods uses gradient-based optimization to generate counterfactuals close to a user's data point (Wachter et al., 2017; Mothilal et al., 2020) . Some recent approaches use tree-based techniques (Rawal & Lakkaraju, 2020; von Kügelgen et al., 2020; Kanamori et al., 2020) or kernel-based methods (Dandl et al., 2020; Gomez et al., 2020; Ramon et al., 2020) , while others employ some heuristic (Poyiadzi et al., 2020; Aguilar-Palacios et al., 2020) to generate counterfactuals. A few works use autoencoders to generate recourses (Pawelczyk et al., 2020; Joshi et al., 2019) , while Karimi et al. (2020a) and Ustun et al. (2019) utilize SAT and ILP solvers, respectively. Evaluation: Besides ensuring that recourses are classified as the desired outcome by a model (validity), the most prominent approaches to evaluate recourses rely on distance-based metrics. In DICE, Mothilal et al. (2020) evaluate recourses according to their proximity, sparsity, and feature diversity. Meanwhile, several works directly consider the cost of the recourses, using a single known cost function as a metric, meaning that all users share the same cost function. In contrast, Rawal & Lakkaraju (2020) estimate a cost function from simulated pairwise feature comparisons. For all these method, a single cost function is used for both recourse generation and evaluation, i.e. the solutions are optimized and tested on the same cost function (Cui et al., 2015; Karimi et al., 2020c; d; Rawal & Lakkaraju, 2020) . In contrast, we evaluate recourse methods by simulating user-specific cost functions that can vary greatly across users to capture their preference.

6. DISCUSSION AND CONCLUSION

Our novel framework INSPIRE provides a way to incorporate the individuality of the user in the recourse generation and evaluation process. INSPIRE lays a foundation for the future works to build more complex distributions to better represent the population by designing non-linear transition costs or modifying the COLS procedure to account for the causal relationships between features while accounting for individual user preferences. We show that our method achieves much higher rates of user satisfaction than comparable baselines and observe that diversity, proximity, and sparsity emerge as important metrics even in our framework but are not sufficient for user satisfaction.

ETHICS STATEMENT

We hope that our recourse method is adopted by institutions seeking to provide reasonable paths to users for achieving more favorable outcomes under the decisions of black-box machine learning models or other inscrutable models. We see this as a "robust good," similar to past commentators Venkatasubramanian & Alfano (2020) . Below, we comment on a few other ethical aspects of the algorithmic recourse problem. First, we suggest that fairness is an important value that recourse methods should always be evaluated along, but we note that evaluations will depend heavily on the model, training algorithm, and training data. For instance, a sufficiently biased model might not even allow for suitable recourses for certain subgroups. As a result, any recourse method will fail to identify an equitable set of solutions for the population. That said, recourse methods can still be designed to be more or less fair. This much is evident from our varying results on fairness metrics using a number of recourse methods. What will be valuable in future work is to design experiments that separate the effects on the fairness of the model, training algorithm, training data, and recourse algorithm. Until then, we risk blaming the recourse algorithm for the bias of a model, or vice versa. Additionally, there are possible dual-use risks from developing stronger recourse methods. For instance, malicious actors may use recourse methods when developing models in order to exclude certain groups from having available recourse, which is essentially a reversal of the objective of training models for which recourse is guaranteed (Ross et al., 2021) . We view this use case as generally unlikely, but pernicious outcomes are possible. We also note that these kinds of outcomes may be difficult to detect, and actors may make bad-faith arguments about the fairness of their deployed models based on other notions of fairness (like whether or not a model has access to protected demographic features) that distract from an underlying problem in the fairness of recourses.

REPRODUCIBILITY STATEMENT

To encourage reproducibility, we provide our source code, including all the data pre-processing, model training, recourse generation, and evaluation metric scripts as supplementary material. The details about the datasets and the pre-processing are provided in Appendix A.1.1. We also provide clear and concise Algorithms 4, 2, 3 for our cost sampling procedures and our optimization method COLS in Algorithm 1. Additionally, we also provide formal proof of the Theorem 3.1 stated in the main paper in Appendix B.2.2 along with the constructive procedure for the proof provided in Algorithm 1. 

A APPENDIX FOR INSPIRE: A FRAMEWORK FOR INTEGRATING INDIVIDUAL USER PREFERENCES IN RECOURSE

A.1 EXPERIMENTAL SETUP

A.1.1 DATASETS AND BLACK-BOX MODEL

In our experiments, we have two versions of the dataset, one with binary categorical features, whereas the other with non-binary categorical features. In the main paper, we show results on the binarized version (Table 1 ) as an important baseline, Actionable Recourse (Ustun et al., 2019) , operates with binary categorical features. 3 The data statistics for all the datasets can be found in Table 5 . In our experiments, for all the datasets, the features gender and race are considered to be immutable (Mothilal et al., 2020) , since we perform subgroup analysis with these variables that would be rendered meaningless if users could switch subgroups. Other features can either be mutable or conditionally mutable depending on semantics. These constraints can be incorporated into the methods by providing a schema of feature mutability criterion. Our black-box model is a multi-layer perceptron model with 2 hidden layers trained on the trained set and validated on the dev set. The accuracy numbers are shown in Table 5 . The test set which is used in the counterfactual generation experiments only contains users which are classified to the undesired class by the trained black-box model. Note that our frameworks can operate with any type of model, the only requirement is the ability to query the model for outcome given a user's state vector. The results demonstrate that as the distance increases the performance drops a bit and then plateaus, which means that the method is robust to this kind of distribution shift. Please refer to Section A.2 for more details.

A.1.3 RECOURSE GENERATION AND EVALUATION PIPELINE

To approximate the expectation in equation 3, our algorithm samples a set of random cost functions {C i } M i=1 ∼ D train , which are used at the generation time to optimize for the user's hidden cost function. In the generation phase, we use Equation 4 as our objective. Note that, this objective promotes that the generated counterfactual set contains at least one good counterfactual for each of the cost samples, hence this set satisfies a large variety of samples from D train . This is achieved via minimizing the mean of the minimum cost incurred for each of the Monte Carlo samples (Robert & Casella, 2010) . Equivalently, the objective is minimized by a set of counterfactuals S where for each cost function there exists an element in S which incurs the least possible cost. In practice the size of set S is restricted, hence we may not achieve the absolute minimum cost but the objective tries to ensure that the counterfactuals which belong to the set have a low cost at least with respect to one Monte Carlo cost sample. The generation phase outputs a set of counterfactuals S which is to be provided to the users as recourse options. Given this set S u , in the evaluation phase, we use the users simulated cost functions which are hidden in the generation phase, to compute the cost incurred by the user MinCost(s u , S; C # u ) and calculate the metrics defined in the Section 4.1.

A.1.4 DETAILS OF HUMAN EVALUATION

For our human evaluation experiments, we had three undergraduate research assistants with a background in computer science. They were provided with a set of instructions on how to interpret and perform the task. Specifically, in virtual meetings, we provided them with an overview of the dataset along with the feature descriptions, a description of the task, and an overview of the recourse generation problem. Before testing, we conducted a small understanding quiz including example problems, and we corrected any misunderstandings of the study procedure. For each data point, they were asked to assume that they were a hypothetical user with the given state vector and preference scores in the sample and then were provided with the recourses generated by our method and Actionable Recourse Ustun et al. (2019) (in a blind format with randomized ordering). In total, we collected three annotations each for 100 samples from the Adult-Income dataset. We don't see any participant risks from doing the study, as the participant were asked to assume the identity of hypothetical user and asked to guess which recourse are better. The hourly rate for for the annotators was 12.5$/hours and it took around 2.5 hours for the whole study which leads to a total cost of 93.75$. The instructions provided to the participants of the human study are shown in Figure 7 and a screenshot of how the study was conducted is provided in Figure 8 . Results: In Figure 2 , we show a heatmap plot to which demonstrates the robustness of our method. The color of the block corresponding to Monte Carlo alpha, α mc = x and the users alpha, α user = y represents the fraction of users that were satisfied when α mc = x and α user = y. This means that if the user thought of costs only in terms of Linear step involved but the recourse method used samples with only percentile based cost, still the recourse set can satisfy almost the same number of users. In Figure 3 , the corners correspond to these extreme cases described above, the user This means that our methods is robust to misspecification in the train and test distributions. The almost consistent color of the grid means that there is very slight variation in the Fraction of Satisfied users when the model is tested on out of distribution user cost types. Q6. Are Solutions Robust to Misspecified Cost Distributions? Design: In our cost sampling procedure, we make minimal assumptions about the user's feature preferences if they are not provided by the user. When finding recourses, we select a random subset of features along with their preference score for each user. However, there are situations where user preferences may be relatively homogeneous for certain features where people usually share common preferences. For example, to increase their income, many users might prefer to edit their occupation type or education level rather than their work hours or marital status. Given the possibility of this kind of distribution shift in feature preferences, we want to measure how robust our method is to distribution shift between our sampling distribution and the actual cost distribution followed by users. In this experiment, we test a case of this kind of distribution shift over cost functions. For users in the Adult-Income data, we generate recourse sets using Monte Carlo samples from our standard distribution D mix (Algorithm 4). To obtain hidden user cost functions that differ from this distribution, we first generate 500 different feature subsets indicating which features are editable, where each subset corresponds to a binary vector concentration representing a user having specific preferences for some features over others (see Sec. 3.1 and Alg. 4). Since having different editable features induces a different distribution over cost functions, we obtain a measure of distribution shift for each of the 500 concentration vectors by taking an l 2 distance between the vector and its nearest neighbor in the space of concentration vectors used to generate the recourses. We use the nearest neighbor because the most outlying concentration vectors are least likely to be satisfied by the recourse set. ent recourse methods as the the number of counterfacuals to be generated is increased. These are the average number across 5 different runs along with standard deviation error bars. We see that there is a monotonic increase in the fraction of users satisfied as the size of the set increases. We also observe that most of the performance can be obtained with a small set size. Please refer to Section A.2 for more details. There is a steep increase and then the performance saturates. This implies that in practice we do not need a large number of samples to converge to the higher user satisfaction. Refer to Section A.2 for more details. In other words, the likelihood that a user is satisfied depends on the minimum distance between their concentration vector and its nearest neighbor in the cost samples used at recourse generation time. Therefore, when the minimum distance increases, there is a greater distribution shift between the user's cost functions and those obtained from D mix . Finally, we measure how many users are satisfied for a given degree of distribution shift.

Results:

In Figure 3 , we show a binned plot of FS@1 against our measure of distribution shift. We observe that as the distance between the distributions increases, the fraction of users satisfied decreases slightly and then plateaus. Even at the maximum distance we obtain, performance has only dropped about 3 points. This implies that our method is robust to distribution shift in the cost distribution in terms of which features people prefer to edit. We attribute this result to the fact that our method (1) assumes random feature preferences which subsumes these skewed preferences and (2) provides multiple recourse options, each of which can cater to different kinds of preferences. As a result, we achieve a good covering of the cost function space (see experiments with respect to varying recourse set size and number of sampled cost functions in the Appendix A.2). Q7. Does Method Performance Scale with Available Compute? standard deviation error bars. In Figure 5 , we plot the fraction of users satisfied @1 as the size of set S is increased. Result: We observed that COLS and P-COLS monotonically increase the FS@1 metric as |S| increases from 1 to 30. This is consistent with the intuition behind our methods (See Figure 1 , section 3.2, A.1.3 for more details). It is a fundamental property of our objective that as |S| increases towards M which is 1000 in this case, the quality of the solution set should increase and reach the best possible value that can be provided under the user's cost function. We note empirically that smaller set size |S| between 3 to 10 is enough in most practical cases to reach close to maximum performance. Additionally, even with |S| ∈ {1, 2, 3} our methods significantly outperform all the other methods in terms of the number of users satisfied. This property is useful in real-world scenarios where the deployed recourse method can provide as little as 3 options while still satisfying a large fraction of users. Additionally, we also see improvement in the case of AR and Face-Knn methods as |S| increases. Note that Randoms Search's performance doesn't change as we increase the set size because the method doesn't take local steps from the best set and samples random points from a very large space, hence it is much harder to end up with low-cost counterfactuals. Q9. Does increase the number of Monte Carlo samples help with user satisfaction? Design: In this experiment, we want to demonstrate the effect of increasing the number of Monte Carlo samples on the performance of our COLS method. We take a random subset of 100 users, a budget of 5000, |S| = 10. We vary the number of Monte Carlo samples (M) in the set {1, 5, 10, 20, 30, 100, 200, 300, 500 , 1000} and compute the user satisfaction. We ran 5 different runs with different Monte Carlo samples and show the average FS@1 along with the standard deviation in the Figure 6 . Results: We observe that as the number of Monte Carlo samples increases, the performance of the method on the FS@1 metric monotonically increases. This supports the intuition underlying our method (see Figure 1 ). That is, given a user with a cost function C * u as we get more and more samples from the cost distribution D train the probability of having a cost sample similar to C * u increases and hence the fraction of satisfied users increase. It is important to note that empirically the method's performance approaches maximum user satisfaction with as low as 20 Monte Carlo samples. In real-world scenarios, where the deployed model is catering to a large population this can lead to small recourse generation time, hence making it more practical.

Q10

. Qualitative examples of the recourses generated for some of the users. In Table 7 , we show a few examples of users along with their state vector, their editable features, their preference scores along with the recourses provided to them and their cost. In Table 8 , we show the results on the non-binary version of the dataset. We observe similar performance on and trends in these results as well. COLS and P-COLS performs the best in terms of user satisfaction.

Q11. Comparison of methods on Non Binary Dataset?

Q12. Robustness to black-box model architecture families and randomness? In this experiment we demonstrate the result of our model when we train the same ANN architecture with different random seed (Table 9 ) and when we change the model family to a logistic regression classifier (Table 10 ). These obtained results have similar trends and demonstrate the effectiveness and robustness of our methods COLS and P-COLS which consistently satisfy cover and satisfy more users with low average population costs. In Table 9 , we show the results when we train another black-box model with a different seed to see the effect of having a different trained model from the same model family.

Q13. Additional results for different values of k in FS@k

In Table 11 , we report the fraction of satisfied user metric FS@k for four different values of k ∈ {0.5, 1, 2, 3}. These results are an extension of the results presented in Table 1 .

B APPENDIX -OBJECTIVE AND OPTIMIZATION

B.1 PROPOSED METHOD

B.1.1 OTHER OBJECTIVES

To obtain feasible a counterfactual set, past works have used various objective terms. We list objectives below from methods we compare with. 1. DICE (Mothilal et al., 2020) optimizes for a combination of Distance Metrics like diversity and proximity. They model diversity via Determinantal Point Processes (Kulesza & Taskar, 2012) adopted for solving subset selection problems with diversity constraints. They use determinant of the kernel matrix given by the counterfactuals as their diversity objective as defined below. dpp_diversity(S) = det(K), whereK ij = 1 1 + dist(s i , s j ) Here, dist(s i , s j ) is the normalized distance metric as defined in Wachter et al. (2017) between two state vectors. Proximity is defined in terms of the distance between the original state vector and the counterfacutals, prox(x, S) = 1 -1 N |S| i=1 dist(x, S i ), where S i is a counterfactual.  ; s) = j∈J A log 1 -Q j (s j + a j ) 1 -Q j (s j ) where Q j (.) is the cumulative distribution function of s j in the target population, J A is the set of actionable features and a j is the action performed on the feature j.

Notation:

We assume that we have a dataset with features F = {f 1 , f 2 , ...f k }. Each feature can either be continuous F con ⊂ F or categorical F cat ⊂ F. Each continuous feature f con i takes values in the range [r min i , r max i ], which we discretize to integer values. For a continuous feature f i , we define the range Q (fi) = {k ∈ Z : k ∈ [r min i , r max i ]} and for a categorical feature f i , we define it as Q (fi) = {q fi 1 , q fi 2 , ..., q fi di }, where q fi (.) are the states that feature f i can take. Features can either be mutable (F m ), conditionally mutable (F cm ), or immutable (F ⊘ ), according to the real-world causal  if p fi = 0 then C(f i , s i , .) = ∞ C(f i , s i , s i ) = 0 else if f i is ordered then if f i can only increase then C(f i , s i , x) =    |getP ercentile(x) -getP ercentile(s i )| ∀x > s i 0 ∀x = s i ∞ ∀x < s i else if f i can only decrease then C(f i , s i , x) =    |getP ercentile(s i ) -getP ercentile(x)| ∀x < s i 0 ∀x = s i ∞ ∀x > s i else if f i can both increase or decrease then C(f i , s i , x) =    |getP ercentile(x) -getP ercentile(s i )| ∀x > s i 0 ∀x = s i |getP ercentile(s i ) -getP ercentile(x)| ∀x < s i else if f i is unordered then C(f i , s i , .) = U nif orm(0, 1) end if p is not None then C(f i , s i , .) ← C(f i , s i , .) * (1 -p fi ) return C end processes that generate the data. Mutable features can transition from between any pair of states in Q (fi) ; conditionally mutable features can transition between pairs of states only when permitted by certain conditions; and immutable features cannot be changed under any circumstances. For example, Race is an immutable feature (Mothilal et al., 2020) , Age and Education are conditionally mutable (cannot be decreased under any circumstances), and number of work hours is mutable (can both increase and decrease). Lastly, while continuous features inherently define an ordering in its values, categorical features can either be ordered or unordered based on its semantic meaning. For instance, Age is an ordered feature that is conditionally mutable (can only increase).

B.2.1 HIERARCHICAL COST SAMPLING PROCEDURE

To optimize for EMC, we need a plausible distribution which can model users' cost functions. We propose a hierarchical cost sampling distribution which provides cost samples that are a linear combination of percentile shift cost (Ustun et al., 2019) and linear cost, where the weights of this combination are user-specific. Percentile shift cost for ordered features is proportional to the change in a feature's percentile associated with the change from an old feature value to a new one. E.g., if a user is asked to increase the number of work hours from 40 to 70, then given the whole dataset, we can estimate the percentile of users working 40 and 70 hours a week. The cost incurred is then proportional to the difference in these percentiles. The Linear cost for ordered features is proportional to the number of intermediate states a user will have to go through while transitioning from their current state to the final state. E.g., if a user is asked to change their education level from High-school to Masters then there are two steps involved in the process. First, they need to get a Bachelors degree and then a Masters degree in which case, the user's cost is proportional to 2 because of the two steps involved in the process. evaluate the change in the objective function if one element of the set is replaced by a new one. To evaluate the change in objective in such cases, we need iterate over all pairs of element in the best and the candidate set and then evaluate the objective for the whole set again. The iteration over both the sets here is not the hard part but the computation that needs to be done within. For our objective, we can compute costs for individual recourses rather than sets, meaning we can do a trivial operation to compute the benefits of each pair replacement. But, if we wanted to do this with diversity then for each pair of replacement we need to compute additional S distances for each replacement because the distance of the new replace vector needs to be computed with respect to all the other vectors, for each iteration of the nested loop. This quickly makes it infeasible to improve the best set by replacing individual candidates with the best set elements. However, for metrics where it is easy to evaluate the effect of individual elements on the objective function, we can easily merge the best set and any other set S t from time t to monotonically increase the objective function value. In our objective function, EMC, we can compute the goodness of individual counterfactuals with respect to all the Monte Carlo samples (Robert & Casella, 2010) . Given a set of counterfactuals we can obtain a matrix of incurred cost C ∈ R N ×M , which specifies the cost of each counterfactual for each of the Monte Carlo samples. We can use this to update the best set S best using elements from the perturbed set S t at time t. This procedure is defined in algorithm 5. It iterates over all pairs of element in s i ∈ S best and s j ∈ S t and computes the change that will occur in the objective function by replacing s i → s j . Note that we are not recomputing the costs here. Given S best , S t , C b and C, we can guarantee that we will update the best set S best in a way to improve the mean of the minimum costs incurred for all the Monte Carlo samples. This is shown in algorithm 5 and the monotonicity of the EMC objective under this case can be formally stated as, Proof. To prove this theorem, we construct a procedure that ensures that the EMC is monotonic. For this procedure, we prove that the monotonicity of EMC holds. Check algorithm 5 for a constructive procedure for this proof, which is more intuitive to understand. Theorem We start off by noting that each element of C b ij is the cost of the i th counterfactual s b i in the best set S best t-1 with respect to the cost function C j given by Cost(s u , s b i ; C j ). Similarly C ij = Cost(s u , s i ; C j ) where s i is the i th candidate counterfactual. Note that, the EMC is the average of the MinCost with respect to all the sampled cost function C j . What this means is that given a pair of counterfactual from S best t-1 × S t and for each C j , we can compute the change in the MinCost which we describe later. These replacements can lead to an increase in the cost with respect to certain cost function but the overall reduction depend on the aggregate change over all the cost functions. Given this, for each replacement candidate pair in S best t-1 × S t , we can compute the change in EMC by summing up the changes in the MinCost across all cost functions C j ; this is called the cost-benefit for this replacement pair. The cost benefit can be negative for certain replacements as well if the candidate counterfactual increases the cost across all the cost functions. The pairs with the highest positive cost benefits are replaced to construct the set S best t , if no pair has a positive benefit then we keep set S best t-1 = S best t . Hence, this procedure monotonically reduces EMC. We now specify how the change in MinCost can be computed to complete the proof. To compute the change in MinCost for a single cost function C i , first we find the counterfactual in S best t-1 with the lowest and second lowest cost which we denote by s b l1 and s b l2 . These are the counterfactuals which can affect the MinCost with respect to a particular cost function C i . This is true because when we replace the counterfactual s b l1 which has the lowest cost for C j with a new candidate counterfactual s i , there are two cases. ). Here, C b l2j is the second lowest cost counterfactual for C i . Note that the change in this case will be negative and also depend on the second best counterfactual because once the s b l1 is removed from the set, the best cost for C i will either be for s b l2 or s i , hence we take the minimum of those two and then take the difference as the increase in cost. Please refer to Algorithm 5 for a cognitively easier way to understand the proof.

B.2.3 OTHER METHODS

In this section, we describe some of the optimization methods used by relevant baselines. where c i is a counterfactual, k is the number of counterfactuals, f(.) is the black box ML model, yloss(.) is the metric which minimizes the distance between models prediction and the desired outcome y. dpp_diversity(.) is the diversity metric as defined in Section B.1.1 and λ 1 and λ 2 are hyperparameters to balance the components in the objective. Please refer to Mothilal et al. ( 2020) for more details. 2. FACE (Poyiadzi et al., 2020) operates under the idea that to obtain actionable counterfactuals they need to be connected to the user state via paths that are probable under the original data distribution aka high-density path. They construct two different types of graphs based on nearest neighbors (Face-knn) and the ϵ-graph (Face-Eps). They define geodesic distance which trades-off between the path length and the density along this path. Lastly, they use the Shortest Path First Algorithm (Dijkstra's algorithm) to get the final counterfactuals. Please refer to (Poyiadzi et al., 2020) for more details. 3. Actionable Recourse (Ustun et al., 2019) tries to find an action set a for a user such that taking the action changes the black-box models decision to the desired outcome class, denoted by +1. They try to minimize the cost incurred by the user while restricting the set of actions within an action set A(x). The set A(x) imposes constraints related to feasibility and actionability with respect to features. They optimize the log-percentile shift objective (see Section B.1.1). Their final optimization equation is min cost(a; x) s.t. f (x + a) = +1, a ∈ A(x) which is cast as an Integer Linear Program (Mittleman, 2018) to provide users with recourses. Their publicly available implementation is limited to a binary case for categorical features,foot_3 hence we demonstrate results on the binarized version of the dataset.



Our code is uploaded as supplementary material. The code for the Actionable Recourse method(Ustun et al., 2019) requires binary categorical variables. The binary datasets can be downloaded from https://github.com/carla-recourse/cf-data, whereas the nonbinary data can be found on https://github.com/interpretml/DiCE. Please refer to the this example where they mention about these restricted abilities https://github.com/ustunb/actionable-recourse/blob/master/examples/ex_01_quickstart.ipynb



Figure1: Diagram showing the intuition behind the Expected Minimum Cost Objective. This figure represents an abstract cost function space where squares denote cost function samples that are the same color if they are similar and form a cluster. We aim to find a recourse set where each cf (here, {s1, s2, s3}) does well under a particular cluster of cost functions. The shaded big circles each represent a single cf si that caters to the enclosed cost functions. Here the user's hidden groundtruth cost function (grey circle) is served well by s1.

where S i is a counterfactual. Sparsity (Mothilal et al., 2020) is defined as spar(x, S) = 1 -1 |S| * d |S| i=1 |x| j=1 1 {xj ̸ =Sij } . Feature diversity (Mothilal et al., 2020) is defined as div(S) = 1 Z |S|-1 i=1 |S| j=i+1 dist(S i , S j ), where Z is the number of terms in the double summation. Validity is defined as val(Y ) = |{unique si∈S : f (si)=+1}| |S| . 4.2 RESEARCH QUESTIONS Q1. Which Recourse Method Satisfies the Most Users?

Figure 2: This figure shows the performance of the method on FS@k when recourses are generated with Monte Carlo cost samples from a distribution with α-weight varying between 0 and 1, where the user costs follow different α-weight values varying between 0, 1. Performance is robust to misspecification of α. Refer to Section A.2 for more details.

ADDITIONAL RESEARCH QUESTIONS Q5. Robustness to Misspecification in population's true and proposed cost function distribution? Design: Our D mix distribution samples cost by taking an α-weighted combination of linear and percentile costs. These two cost have different underlying assumptions about the how users view the cost of transition between the states. We want to test the robustness of our method in terms of misspecification in users disposition to these types of cost. We perform a robustness analysis where the users cost function has a different α mixing weight as compared to the Monte Carlo samples we use to optimize for EMC. This creates a distribution shift in the user cost function distribution (D test ) and the Monte Carlo sampling distribution (D train ) used in EMC. We vary the user and Monte Carlo distributions α-weights within the range of 0 to 1 in steps of 0.2. At the extremes values of α = 0, 1, the shifts are very drastic as the underlying distribution changes completely. In the case when monte carlo α weight is 0 and user α weight is 1 then D train = D perc and D test = D lin , simlarly for the other case we get D train = D lin and D test = D perc . Please note that the distribution D lin and D perc have completely different underlying principles and are two completely different distributions. Hence, the corners of the heatmap represent drastic distribution shifts.

satisfaction for the top left corner (D train = D perc and D test = D lin ) is similar to the bottom left corner (D train = D lin and D test = D lin ). Similarly things happen for the opposite case which is denoted by the top-right (D train = D perc and D test = D perc ) and bottom-right (D train = D lin and D test = D perc ) corners.This means that even when a complete distribution shift occurs the performance user satisfaction remains similar. This can be attributed to the hierarchical step for user preference sampling in the procedure because the preferences values can be arbitrary and they scale the raw percentile and linear cost hence the distribution designed this way to model extremely diverse types of transition costs.

Figure 4: Figure showing the performance of different recourse methods as the Budget is increased. These are the average number across 5 different runs along with the standard deviation error bars. For some methods the standard deviation is very low hence not visible as bars in the plot. It can be seen that as the budget increases the performance of COLS and P-COLS increases. Please refer to Section A.2 for more details.

Figure 6: Figure showing the performance of the COLS method as the number of Monte Carlo samples increase. These are the average number across 5 different runs along with standard deviation error bars.There is a steep increase and then the performance saturates. This implies that in practice we do not need a large number of samples to converge to the higher user satisfaction. Refer to Section A.2 for more details.

Figure 7: Instructions for Human Evaluation. Please refer to Section A.1.4 for more details.

Figure 8: Screenshots of how the human evaluation test was conducted. Please refer to Section A.1.4 for more details.

Either, C b l1j > C ij or C b l1j ≤ C ij . In case when the candidate s i has lower cost for C j than C b l1j , i.e. C b l1j > C ij , then the replacement reduces the cost by C b l1j -C ij . In case when the candidate cost for C j , C ij , is higher than the lowest cost in the best set C b l1j , i.e. C b l1j ≤ C ij , it means that this replacement will increase the cost for C i by C b l1j -min(C ij , C b l2j

1.DICE (Mothilal et al., 2020)  perform gradient-based optimization in this continuous space while optimizing for objective defined in Section B.1.1. Their final objective function is defined asC(x) = arg min c1,...,c k i , x) -λ 2 dpp_diversity(c 1 , . . . , c k )

Recourse method performance across various cost and distance metrics (Section 4.1). The numbers reported are averaged across 5 different runs. For all the metrics higher is better except for PAC.

Ablation results with Search algorithms trained on different objectives.

Percentage of times each method was preferred by human annotators (Fleiss kappa=0.74 and p=1e-4).

Fairness analysis of recourse methods for Gender based subgroups. DIR: Disparate Impact Ratio; M: Male, F: Female.

Table containing data statistics and black-box model details. The binary version of the datasets are take from (Pawelczyk et al., 2021) whereas the non-binary version are taken from (Mothilal et al., 2020). Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, volume 161 of Proceedings of Machine Learning Research, pp. 1382-1392. PMLR, 27-30 Jul 2021. URL https://proceedings.mlr.press/v161/watson21a.html.

Fairness analysis of recourse methods for subgroups with respect to Race.

Table comparing different recourse methods across various cost and distance metrics on Non-Binary versions of the datasets(Section A.1.1).The numbers reported are averaged across 5 different runs. Variance values have been as 89% of them were lower than 0.05, with the maximum being 0.86. FS@1: Fraction of users satisfied at k = 1. PAC: Population Average Cost. Cov: Population Coverage. For all the metrics higher is better except for PAC where lower is better.

Table comparing different recourse methods across various cost and distance metrics for a black-box model with different seed but belonging to the same model family. The numbers reported are averaged across 5 different runs.

Table comparing different recourse methods across various cost and distance metrics for a logistic regression black-box model. The numbers reported are averaged across 5 different runs.

Cost metrics for additional k values in FS@k for the results presented in the main Table1.

Algorithm 2 Sampling procedure for Percentile Transition Costs Input: State vector s, Optional: Feature Preference Scores p Output: Percentile based Transition Cost functions C. function PerCost(s, p = None) forall f i ∈ F do // si value of feature fi in s.

B.1 (Monotonicity of Cost-Optimized Local Search Algorithm). Given the best set, S best t-1 ∈ R N ×d , the candidate counterfactual at iteration t, S t ∈ R N ×d , the matrix C b ∈ R N ×M and C ∈ R N ×M containing the incurred cost of each counterfactual in S best t-1 and S t with respect to all the M sampled cost functions {C i } M i=1 , there always exist a S best t constructed from S best t-1 and S t such that EMC(s u , S best

annex

 providing qualitative examples for two users from the dataset. We show each users state vector, the features that user is willing to edit, the preference scores for those editable features, the recourses provided and the cost of the generated recourses. In the first example we see that user highly prefers the feature capital loss and the recourse which suggests edit to that has the lowest cost for the user. Whereas, the recourse which makes changes to both Occupation and Capital Loss has the highest cost as its changing multiple features. For the second user, we see that the most preferred feature is Education-Num but the changes suggested in the recourse requires three steps 7-8-9-10, hence the cost for that recourse is not the lowest but still relatively low. Whereas, the recourse suggesting smaller changes to Capital Loss which is the second most preferred feature has the lowest cost for the user. Design: In this experiment on the Adult-Income dataset, we measure the change in performance of all the models as the number of access to the black-box model (budget) increases. Ideally, a good recourse method should be able to exploit these extra queries and use it to satisfy more users.We vary the allocated budget in the set {500, 1000, 2000, 3000, 5000, 10000} and report the FS@1.We run the experiment on a random subset of 100 users for 5 independent runs and then report the average performance with standard deviation-based error bars in Figure 4 .

Results:

In Figure 4 , we can see that as the allocated budget increases the performance of the COLS and P-COLS increases and then saturates. This suggests that our method can exploit the additional black-box access to improve the performance. Other methods like AR and Face-Knn also show performance improvement but our method COLS and P-COLS consistently upper-bound their performance. Our method satisfies approximately 70% of the user with a small budget of 500 and quickly starts to saturate around a budget of 1000. This suggests that our methods are suitable even under tight budget constraints as they can achieve good performance rapidly. For example, in a real-world scenario where the recourse method is deployed and has to cater to a large population, in such cases there might be budget constraints imposed onto the method where achieving good quality solution quickly is required. Lastly, for DICE and Random search the performance on the FS@1 increase by a very small margin and then stays constant as these methods are trying to optimize for different objectives which don't align well with user satisfaction as demonstrated in Section 4.2.Q8. Does providing more options to users help?Design: In this experiment, we measure the effect of having flexibility to provide the user with more options, i.e. a bigger set S. The question here is that can the methods effectively exploit this advantage and provide lower cost solution sets to the user such that the overall user satisfaction is improved. In this experiment on the Adult-Income dataset, we take a random subset of 100 users and fix the budget to 5000, Monte Carlo cost sample is set to 1000 and then vary the size of the set S in the set {1, 2, 3, 5, 10, 20, 30}. We restrict the size of the set to a maximum of 30 as beyond a point it becomes hard for users to evaluate all the recourse options and decide which one to act upon. We run 5 independent runs for all the data points and plot the mean performance along with Input: State vector s, Optional: Preferred features Fp, feature preference scores p, cost-type mixing weight α Output: Preference scores p and the cost functions C.▷ Beta parametrized with mean and variance Cp ←-Beta(C (M ix) , σ = 0.01) return p, Cp end

B.2.2 MERGING COUNTERFACTUAL SETS

When searching for a good solution set, it would be useful to have the option of improving on the best set we have obtained so far using individual counterfactuals in the next candidate set we see, rather than waiting for a new, higher-scoring set to come along. While optimizing for objectives like diversity, which operate over all pairs of elements in the set, it is computationally complex to 

