COVARIANCE-ROBUST MINIMAX PROBABILITY MA-CHINES FOR ALGORITHMIC RECOURSE

Abstract

Algorithmic recourse is rising as a prominent technique to promote the explainability and transparency of the predictive model in ethical machine learning. Existing approaches to algorithmic recourse often assume an invariant predictive model; however, this model, in reality, is usually updated temporally upon the input of new data. Thus, a recourse that is valid respective to the present model may become invalid for the future model. To resolve this issue, we propose a pipeline to generate a model-agnostic recourse that is robust to model shifts. Our pipeline first estimates a linear surrogate of the nonlinear (black-box) model using covariance-robust minimax probability machines (MPM); then, the recourse is generated with respect to this robust linear surrogate. We show that the covariance-robust MPM recovers popular regularization schemes, including 2 -regularization and class-reweighting. We also show that our covariance-robust MPM pushes the decision boundary in an intuitive manner, which facilitates an interpretable generation of a robust recourse. The numerical results demonstrate the usefulness and robustness of our pipeline.

1. INTRODUCTION

The recent prevalence of machine learning (ML) in supporting consequential decisions involving humans such as loan approval (Moscato et al., 2021) , job hiring (Cohen et al., 2019; Schumann et al., 2020) , and criminal justice (Brayne & Christin, 2021) urges the need of transparent ML systems with explanations and feedback to users (Doshi-Velez & Kim, 2017; Miller, 2019) . One popular and emerging approach to providing feedback is the algorithmic recourse (Ustun et al., 2019) . A recourse suggests how the input instance should be modified to alter the outcome of a predictive model. Consider a specific scenario in which an individual is rejected from receiving a loan by a financial institution's ML model. Recently, it has become a legal necessity to provide explanations and recommendations to the individual so that they can improve their situation and obtain a loan in the future (GDPR, Voigt & Von dem Bussche (2017)). For example, an explanation can be "increase the income to $5000" or "reduce the debt/asset ratio to below 20%". Leveraging the recourses, financial institutions can assess the reliability of their ML predictive models and increase user engagement through actionable feedback and acceptance guarantee if they fulfill the requirements. To construct plausible and meaningful recourses, one must assess and strike a balance between conflicting criteria. They can be: (1) validity, a recourse should effectively reverse the unfavorable prediction of the model into a favorable one, (2) proximity, recourse should be close to the original input instance to alleviate the efforts required, and thus to encourage the adoption of the recourse, (3) actionability, prescribed modifications should follow causal laws of our society (Ustun et al., 2019; Karimi et al., 2021) ; for example, one can not modify their race or decrease their age. Various techniques were proposed to devise algorithmic recourses for a given predictive model, extensive surveys are provided in (Karimi et al., 2020a; Stepin et al., 2021; Pawelczyk et al., 2021; Verma et al., 2020) . Wachter et al. (2017) introduced the definition of counterfactual explanations and proposed a gradient-based approach to find the nearest instance that yields a favorable outcome. Ustun et al. (2019) proposed a mixed integer programming formulation (AR) that can find recourses for a linear classifier with a flexible design of the actionability constraints. Alternatively, Karimi et al. (2021; 2020b) investigated the nearest recourse through the lens of minimal intervention to take causal relationships between features into account. Recent works including Russell (2019) and Mothilal et al. (2020) also studied the problem of generating a menu of diverse recourses to provide multiple possibilities that users might choose. The aforementioned methods rely on an assumption of an invariant predictive model. Nevertheless, machine learning models are usually re-trained or re-calibrated as new data arrive. Thus, a valid recourse at present may become invalid in the future, leading to an exemplary case where a rejected applicant may spend efforts to improve their income and reapply for a loan, but then is rejected (again) simply because the ML model has been updated. This leads to a potential inefficiency due to the waste of resources and loss of trust in the recommendation and in the ML system (Rudin, 2019) . Studying this phenomenon, Rawal et al. (2020) described several types of model shifts related to the correction, temporal, and geospatial shifts from data. They pointed out that the recourses, even constructed with state-of-the-art algorithms, are vulnerable to distributional shifts in the model's parameters. Pawelczyk et al. (2020) study counterfactual explanations under predictive multiplicity and its relation to the difference in the way two classifiers treat predicted individuals. Black et al. (2021) then show that the constructed recourses might be invalid even for the model retrained with different initial conditions such as weight initialization and leave-one-out variations in data. Recently, Upadhyay et al. (2021) leveraged robust optimization to propose ROAR -a framework for generating recourses that are robust to shifts in the predictive model, which is assumed to be a linear classifier. Despite the promising results, existing methods are often restricted to the linear classifiers setting to be able to introduce actionability or robustness (Ustun et al., 2019; Russell, 2019; Upadhyay et al., 2021; Rawal et al., 2020) . For non-linear classifiers, a linear surrogate method such as LIME (Ribeiro et al., 2016) is used to approximate the local decision boundary of the black-box classifiers; the recourse is then generated respectively to the (linear) surrogate model instead of the nonlinear model. LIME is well-known for explaining predictions of black-box ML models by fitting a reweighted linear regression model to the perturbed samples around an input instance. In the recourse literature, LIME is the most common linear surrogate for the local decision boundary of the black-box models (Ustun et al., 2019; Upadhyay et al., 2021) . Unfortunately, the LIME surrogate has several limitations. Firstly, Laugel et al. (2018) and White & Garcez (2019) showed that LIME may not be faithful to the underlying models because LIME might be influenced by input features at a global scale rather than a local scale. Secondly, explanations generated by perturbation-based methods are also well-known to be sensitive to the original input and the synthesized perturbations (Alvarez-Melis & Jaakkola, 2018; Ghorbani et al., 2019; Slack et al., 2020; 2021; Agarwal et al., 2021; Laugel et al., 2018) . Several works have been proposed to overcome these issues. Laugel et al. (2018) and Vlassopoulos et al. (2020) proposed alternative sampling procedures that generate sample instances in the neighborhood of the closest counterfactual to fit a local surrogate. White & Garcez (2019) integrated counterfactual explanation to local surrogate models to introduce a novel fidelity measure of an explanation. Later, Garreau & von Luxburg (2020) and Agarwal et al. (2021) analyzed theoretically the stabilityfoot_0 of LIME, especially in the low sampling size regime. Zhao et al. (2021) leveraged Bayesian reasoning to improve the consistency in repeated explanations of a single prediction. Nevertheless, the impact and effectiveness of these surrogates on the recourse generation are still unknown. Contributions. We revisit the recourse generation scheme through surrogate models. We propose a novel model-agnostic pipeline that facilitates the generation of robust and actionable recourses. The core innovation in our pipeline is the use of the covariance-robust minimax probability machines (MPM) as a linear surrogate of the nonlinear black-box ML model. Additionally, we contribute • to the field of MPM and robust classifier: We propose and analyze in detail the covariance-robust MPMs in which the set of possible perturbations of the covariance matrices are prescribed using distances on the space of positive semidefinite matrices. Motivated by the statistical distances between Gaussian distributions, we show that the covariance-robustness induces and connects to two prominent regularization schemes of the nominal MPM: if the distance is motivated by the Bures distance, we recover the 2 -regularization, if the distance is motivated by the Fisher-Rao distance, we recover class reweighting schemes. While prior works showed that distributionally robust optimization (DRO) with optimal transport distance recovers norm regularization (Shafieezadeh-Abadeh et al., 2019b) and f -divergence DRO leads to reweighting (Duchi & Namkoong, 2019) , this paper extends the connections to the MPMs. • to the field of robust algorithmic recourse: We propose an intuitive and interpretable approach to generate robust recourse. We show that, by calibrating the radii of the ambiguity sets in a proper manner, the covariance-robust MPM shifts the separation hyperplane towards the favorable class. As a consequence, our recourse exhibits robustness to model shifts and it is also lenient to incorporate mixed-integer constraints to promote actionability. This paper unfolds as follows. In Section 2, we delineate our explanation framework using MPM. Section 3 dives deeper into the MPM problem and its robustification. Section 4-5 construct two types of covariance-robust MPM using the Bures and Fisher-Rao distance on the space of covariance matrices. In Section 6, we demonstrate empirically that the covariance-robust MPM provides a competitive approximation of the local decision boundary, and improves the robustness of the recourse subject to model shifts. All proofs are relegated to the appendix.

2. RECOURSE GENERATION FRAMEWORK

Throughout this paper, we assume that the covariate space is X = R d and we have a binary label space Y = {-1, +1}. Without any loss of generality, we assume that label -1 encodes the unfavorable decision, while +1 encodes the favorable one. Given a specific classifier and an input x 0 with an unfavorable predicted outcome, the goal of this paper is to find a recourse recommendation for x 0 that has a high probability of being classified into a favorable group, subject to possible shifts in the parameters underlying the classifier. Such recourse is termed a robust recourse. Our robust recourse generator consists of three components (see Figure 1 for a schematic view): (i) a local sampler: we use a similar procedure as in Vlassopoulos et al. ( 2020) and Laugel et al. (2018) . Given an instance x 0 , we choose k nearest counterfactuals x 1 , . . . , x k from the training data that have the opposite label to x 0 . For each x i , we perform a line search to find a point x b,i that is on the decision boundary and the line segment between x 0 and x i . Among x b,i , we choose the nearest point x b to x 0 and sample uniformly in an 2 -neighborhood with radius r p around x b . We then query the black-box classifier to obtain the predicted labels of the synthetic samples. (ii) a linear surrogate using (covariance-robust) MPM: We use the synthetic samples to estimate the moment information ( µ y , Σ y ) of the covariate conditional on each predicted class y. We then train a covariance-robust MPM parametrized by θ ϕ to approximate the local decision boundary of the ML model. (iii) a recourse search: Basically, we can apply any existing recourse search for linear models on top of the linear surrogate θ ϕ dictated by the covariance-robust MPMs to find a robust recourse. In this paper, we use a simple projection onto the hyperplane prescribed by θ ϕ for simplicity and AR (Ustun et al., 2019) , which is a MIP-based framework, to promote actionable recourses. Center to the success of our pipeline is the possibility of shifting the MPM classification hyperplane toward the region of the favorable class, which induces robust recourse with respect to model shifts in a geometrically-intuitive manner (see Remark 4.6 for a detailed discussion). It is imperative to note a clear distinction between our pipeline and the existing method of ROAR (Upadhyay et al., 2021) : ROAR uses a non-robust surrogate in Step (ii) and then formulates a min-max optimization problem in Step (iii) for recourse search, whilst our pipeline uses a robust surrogate in Step (ii) and then employs a simple recourse search in Step (iii). Note that mixed-integer formulations can be injected in Step (iii) to generate more realistic robust recourses in our pipeline. On the contrary, mixed-integer constraints are not easy to be integrated into the min-max formulation of ROAR. Subsequently, Sections 3-5 describe in detail different methods to build robust surrogates in Step (ii). The application of recourse research in Step (iii) is provided in the experiment section (Section 6.2).

3. (COVARIANCE-ROBUST) MPM

Figure 2 : An intuitive explanation of the robustification mechanism. From left to right: As the radius ρ -1 increases, the worst-case covariance matrix of the class -1 is inflated and shifts the MPM boundary towards the favorable class. The projection of the input x 0 onto the hyperplane will have a tendency to lie deeper into the favorable region and may become more robust to model shifts. MPM is a binary classification framework pioneered by Lanckriet et al. (2001) and extended to Quadratic MPM in (Lanckriet et al., 2003) . For each class y ∈ Y, MPM makes no assumption on the specific parametric form of the (conditional) distribution P y of X|Y = y. Instead, MPM assumes that we can identify P y only up to the first two moments, i.e., it assumes that P y has mean vector µ y ∈ R d and covariance matrix Σ y ∈ S d + , denoted P y ∼ ( µ y , Σ y ). These moments can be estimated from the samples synthesized from the boundary sampler.  P y (C θ (X) = y), where we define the feasible set Θ {θ = (w, b) ∈ R d+1 : w = 0}. Notice that the constraints w = 0 eliminate trivial solutions to the classification problem. To derive the MPM, we define the set of feasible slopes W = w ∈ R d \{0} : y∈Y yw µ y = 1 , which is a hyperplane in R d . The main instrument for solving (1) is the following result from (Lanckriet et al., 2001, §2) which provides the form of its optimal solution. Lemma 3.1 (Optimal solution). Let w be an optimal solution to the second-order cone program In this paper, we refer to the second-order cone program (2) as the nominal MPM problem, because the MPM is fully determined by the solution to (2). We next discuss the covariance-robust MPM. min w∈W y∈Y w Σ y w,

3.1. QUADRATIC MPM

In a practical setting, it is likely that the covariance matrices Σ y are misspecified, for example, due to low sample size, statistical estimation error, or corrupted data. To hedge against these mismatches, Lanckriet et al. (2003) proposed to add another layer of robustness by allowing the mean vectors and the covariance matrices of the conditional distributions to be chosen (adversarially) in a prescribed set, which we call the ambiguity set. They showed that perturbing the mean vectors does not change the optimal classifier. In this paper, we, therefore, perturb only the covariance matrices. More specifically, we allow the conditional distribution P y to be in the ambiguity set U ϕ y ( P y ) = {P y : P y ∼ ( µ y , Σ y ), ϕ(Σ y Σ y ) ≤ ρ y }, where ϕ is a measure of dissimilarity between covariance matrices. The distributionally robust minimax probability machine is formulated as min θ∈Θ max y∈Y max Py∈U ϕ y ( Py) P y (C θ (X) = y). Previously, Lanckriet et al. (2003) considered the robust MPM with moment uncertainty, in which the covariance matrix is perturbed using the quadratic divergence. Definition 3.2 (Quadratic divergence). Given two positive semidefinite matrices Σ, Σ ∈ S d + , the quadratic divergence between them is Q(Σ Σ) = Tr (Σ -Σ) 2 . The divergence Q is the squared Frobenius norm of Σ -Σ; thus Q is non-negative and vanishes to zero if and only if Σ = Σ, so it is a divergence on S d + . The Quadratic MPM has the below form (Lanckriet et al., 2003) . Theorem 3.3 (Quadratic MPM). Suppose that ϕ ≡ Q. Let w Q be a solution to the problem min w∈W y∈Y w ( Σ y + √ ρ y I)w. ( ) Then θ Q = (w Q , b Q ) solves the distributionally robust MPM problem (3), with κ Q = y∈Y (w Q ) ( Σ y + √ ρ y I)w Q -1 , b Q = (w Q ) µ +1 -κ Q (w Q ) ( Σ +1 + √ ρ +1 I)w Q . The Quadratic MPM can be considered as a regularization of the nominal problem (2): each matrix Σ y is added with a diagonal matrix √ ρ y I, making the matrix better conditioned. This is equivalently known as inverse regularization, which ensures invertibility when Σ y is low-rank and ρ y > 0.

3.2. COVARIANCE-ROBUST MPM

While Lanckriet et al. (2003) focused only on the quadratic divergence, their results can be generalized to the covariance-robust MPM with a general divergence ϕ. For any y ∈ Y, define τ y : R d → R as τ ϕ y (w) max Σy∈S d + :ϕ(Σy Σy)≤ρy w Σ y w. We are now ready to give a generalized reformulation of problem (3). Proposition 3.4 (Covariance-robust MPM). Let w ϕ be the optimal solution to the problem min w∈W y∈Y τ ϕ y (w), then θ ϕ = (w ϕ , b ϕ ) solves the distributionally robust MPM problem (3), where κ ϕ = y∈Y τ ϕ y (w ϕ ) -1 , and b ϕ = (w ϕ ) µ +1 -κ ϕ τ ϕ +1 (w ϕ ).

3.3. EQUIVALENCE UNDER GAUSSIAN ASSUMPTIONS

While the quadratic divergence Q in Definition 3.2 is attractive for its tractability, it is not statistically meaningful. More specifically, it does not coincide with any distance between probability distributions with the corresponding covariance information. In this paper, we consider discrepancy measures ϕ that arise as a statistical distance between Gaussian distributions. To this goal, we first need to show that the covariance-robust MPM is invariant with the Gaussian assumption. Define a parametric ambiguity set constructed on the space of Gaussian distribution of the form U N y ( P y ) = P y ∈ P(X ) : P y ∼ N ( µ y , Σ y ), ϕ(Σ y Σ y ) ≤ ρ y , wherein any distribution is Gaussian. Consider the Gaussian distributionally robust MPM problem min θ max y∈Y max Py∈U N y ( Py) P y (C θ (X) = y). Proposition 3.5 (Gaussian equivalence). The optimizer θ ϕ = (w ϕ , b ϕ ) in Proposition 3.4 also solves the Gaussian parametric covariance-robust MPM problem (6). Proposition 3.5 justifies the use of divergences induced by a distance between normal distributions. We study several constructions of the covariance-robust MPM in the subsequent sections.

4. BURES MPM

We first explore the case where ϕ is the Bures divergence whose definition is as follows. Definition 4.1 (Bures divergence). Given two positive semi-definite matrices Σ, Σ ∈ S d + , the Bures divergence between them is B(Σ Σ) = Tr Σ + Σ -2( Σ 1 2 Σ Σ 1 2 ) 1 2 . It can be shown that B is symmetric and non-negative, and it vanishes to zero if and only if Σ = Σ. As such, B is a divergence on the space of positive semidefinite matrices. Moreover, B also equals the squared type-2 Wasserstein distance between two Gaussian distributions with the same mean vector and covariance matrices Σ and Σ (Olkin & Pukelsheim, 1982; Givens & Shortt, 1984; Gelbrich, 1990) . Next, we assert the form of the Bures MPM. Theorem 4.2 (Bures MPM). Suppose that ϕ ≡ B. Let w B be the solution of the following problem min w∈W y∈Y w Σ y w + y∈Y ρ y w 2 . ( ) Then θ B = (w B , b B ) is the optimal solution of the distributionally robust MPM problem (3), where κ B = y∈Y (w B ) Σ y w B + y∈Y ρ y w B 2 -1 , and b B = (w B ) µ +1 -κ B ( (w B ) Σ +1 w B + ρ +1 w B 2 ). Theorem 4.2 unveils a fundamental connection between robustness and regularization: if we construct the ambiguity sets for the covariance matrices using the Bures divergence, the resulting optimization problem ( 7) is an l 2 -regularization of the nominal problem (2). This connection aligns with previous observations highlighting the equivalence between regularization schemes and optimal transport robustness (Shafieezadeh-Abadeh et al., 2019a; Blanchet et al., 2019) . To prove Theorem 4.2, we provide a result that asserts the analytical form of τ B y (w). Next, we study the asymptotic form of the Bures MPM as the radii of the ambiguity sets grow. Note that problem (7) depends only on the sum of the radii, but not on the individual values of each radius. Let ρ = y∈Y ρ y be their sum, it suffices to study when ρ grows to infinity. To this end, denoted by w B ρ the optimal solution of problem ( 7) parametrized by ρ. The next result provides the analysis of the asymptotic value of w B ρ as ρ → ∞. Proposition 4.4 (Bures asymptotic hyperplane). Fix y ∈ Y, let -y be its opposite class and suppose that ρ -y remains constant. Let w B ρy be the optimal solution of (7) parametrized by ρ y . As ρ y → ∞, w B ρy → w B ∞ y∈Y y µ y y∈Y y µ y 2 2 . Notice that as ρ y → ∞, κ B τ B y (w B ρy ) → 1. Thus, b B ∞ = (w B ∞ ) µ yy for any y ∈ Y. The asymptotic hyperplane defined by w B ∞ is thus characterized by the linear equation w B ∞ x -w B ∞ µ y + y = 0, which identifies a hyperplane passing through µ -y as y∈Y yw µ y = 1. Moreover, we note that the asymptotic hyperplane does not depend on the covariance matrices. Remark 4.5 (Quadratic asymptotic hyperplane). It is provable that the solution of problem (4) converges to w B ∞ as ρ = y ρ y tends to infinity. Thus, Quadratic MPM (4) and Bures MPM (7) are asymptotically equivalent even though they induce different regularizations of the nominal MPM (2). Remark 4.6 (Geometric intuition for robust recourse). Figure 3 visualizes the Bures MPM hyperplanes by varying the radii. Notice that the hyperplane drifts toward the favorable (+1) class as the uncertainty of the unfavorable (-1) covariance matrix increases. Thus, the recourse generated w.r.t. the green hyperplane is more robust compared to that generated w.r.t. the red one. By calibrating the radii, we shift the hyperplane and obtain robust recourses at different robustness-cost trade-offs.

5. FISHER-RAO MPM

We now explore the case where ϕ is the Fisher-Rao distance which is defined as follows. Definition 5.1 (Fisher-Rao distance). Given two positive definite matrices Σ, Σ ∈ S d ++ , the Fisher-Rao distance between them is F(Σ, Σ) = log( Σ -1 2 Σ Σ -1 2 ) 2 , where log( • ) is the matrix logarithm. The Fisher-Rao distance enjoys many nice properties. In particular, it is invariant to inversion and congruence, i.e., for any Σ, Σ ∈ S d ++ and invertible A ∈ R d×d , F(Σ, Σ) = F(Σ -1 , Σ -1 ) = F(AΣA , A ΣA ). Such invariances are especially statistically meaningful as it implies that the results remain unchanged if we reparametrize the problem with an inverse covariance matrix (instead of the covariance matrix) or apply a change of basis to the data space X . It is shown that F is the unique Riemannian distance (up to scaling) on the cone S d ++ with such invariances (Savage, 1982) . Next, we assert the form of the Fisher-Rao MPM. Theorem 5.2 (Fisher-Rao MPM). Suppose that ϕ ≡ F. Let w F be the solution of the problem min w∈W y∈Y exp ρ y 2 w Σ y w. Then θ F = (w F , b F ) is the optimal solution of the distributionally robust MPM problem (3), where κ F = y∈Y exp ρ y 2 (w F ) Σ y w F -1 , and b F = (w F ) µ +1 -κ F exp ρ +1 2 (w F ) Σ +1 w F . Theorem 5.2 divulges another foundational connection between robustness and regularization: if we construct the ambiguity sets for the covariance matrices using the Fisher-Rao distance, the resulting optimization problem ( 8) is a reweighted version of the nominal problem (2). Each term (w Σ y w) 1 2 is assigned a weight exp(ρ y /2), which is proportional to the radius ρ y . This connection aligns with previous observations highlighting the equivalence between reweighting schemes and distributional robustness (Ben-Tal et al., 2013; Bayraksan & Love, 2015; Namkoong & Duchi, 2017; Hashimoto et al., 2018) . To prove Theorem 5.2, we derive an analytical expression of τ F y (w). Proposition 5.3 (Fisher-Rao distance). If ϕ ≡ F, then τ F y (w) = exp ρy 2 (w Σw) 1 2 for all y ∈ Y. The proof of Theorem 5.2 follows by combining the results from Proposition 3.4 and Proposition 5.3. Next, we study the asymptotic form of the Fisher-Rao MPM. Proposition 5.4 (Fisher-Rao asymptotic hyperplane). Fix y ∈ Y, let -y be its opposite class and suppose that ρ -y remains constant. Let w F ρy be the optimal solution of (8) parametrized by ρ y . Let a y∈Y y µ y , then as ρ y → ∞, w F ρy → w F ∞,y (a Σ -1 y a) -1 Σ -1 y a. Contrary to the Bures MPM, the asymptotic hyperplane of Fisher-Rao MPM depends explicitly on the covariance matrix Σ y . The boundary prescribed by the Fisher-Rao MPM can be shifted through an appropriate calibration of the radii ρ y . Thus, Fisher-Rao MPM can generate a robust algorithmic recourse in a geometrically-intuitive manner.

6. NUMERICAL EXPERIMENTS

We conduct comprehensive experiments to highlight the performance of our models. We first compare the fidelity and stability of our surrogates with LIME. We then compare the quality of our recourses against two popular baselines: ROAR (Upadhyay et al., 2021) and AR (Ustun et al., 2019) . Classifier. We use three-layer MLP with 20, 50, 20 nodes and ReLU activation functions in each consecutive layer. We use a sigmoid function in the last layer to produce probabilities. Dataset. We evaluate our framework using popular real-world datasets for algorithmic recourse: German Credit (Dua & Graff, 2017; Groemping, 2019) , Small Bussiness Administration (SBA) (Li et al., 2018) , and Student performance (Cortez & Silva, 2008) . Each dataset contains two sets of data (the present data -D 1 and the shifted data D 2 ). The shifted dataset D 2 could capture the correction shift (German credit), the temporal shift (SBA), or the geospatial shift (Student). For each dataset, we use 80% of the instances in the present data D 1 to train an underlying predictive model and the remaining instances are used to generate and evaluate recourses. The shifted data D 2 is used to train future classifiers in Section 6.2. The main text contains the results for the German and Student datasets. Further results, including the SBA and the synthetic data, are provided in Appendix A.

6.1. FIDELITY AND STABILITY OF THE SURROGATE MODELS

We evaluate the performance of different surrogate models with respect to the current classifier. We compare our methods: QUAD-MPM in (3.3), the BW-MPM in (4.2) and the FR-MPM in (5.2) against the popular linear surrogate LIME (Ribeiro et al., 2016) under the following metrics: Stability. We use the procedure in Agarwal et al. (2021) to measure the stability of the surrogate models with respect to small perturbations in the input instance. For a given instance x, we draw a set U x of 10 neighbors of x from N (x, 0.001I) independently. We use the abovementioned methods to find the linear surrogate θ x = (w x , b x ) for each x ∈ U x . We report the maximum distance of the explanations of x to that of x; precisely, the stability formula is Stability(w x ) = max x ∈Ux w x -w x 2 . Fidelity. We use the LocalFid criterion as in Laugel et al. (2018) to measure the fidelity of a local surrogate model. For a given instance x, we draw a set V x of 1000 instances uniformly from an l 2 -ball of radius r fid centered on x. The local fidelity of the surrogate θ x is measured as: LocalFid(θ x ) = 1 |Vx| x ∈Vx f (x ) -C θx (x ) 2 , where f ( • ) is the original classifier and C θ ( • ) is the linear surrogate classifier. Basically, LocalFid measures the fraction of instances where the output class of f and C θx agree. Here, we set r fid to 10% of the maximum distance between instances in the training data. Note that V x is for evaluation only and independent from the perturbation samples used to train the local surrogate. To generate the MPM's surrogates, we choose 10 nearest counterfactuals of x 0 in training data to find x b . We set the perturbation radius r p to 5% of the maximum distance between instances in the training data and set ρ +1 = 0, ρ -1 = 1.0. For LIME, we use the default parameters recommended in the LIME source code and return θ = (w, b-0.5) as the LIME's surrogate, similar to Laugel et al. (2018) . We vary the number of perturbation samples in a range of [500, 10000] to measure the fidelity and sensitivity of constructed surrogates under low sampling sizes. The results on German and Student datasets are shown in Figure 4 . The results show the superiority of MPM's surrogates compared to LIME in both local fidelity and stability metrics. Meanwhile, FR-MPM provides higher-fidelity surrogates compared to QUAD-MPM and BW-MPM.

6.2. ROBUSTNESS OF RECOURSES

We now study the robustness to model shifts of recourses and its trade-off against the recourse cost. Metrics. We use an 1 -distance as the cost function. We define the current validity as the validity of the generated recourses with respect to the given classifier. To measure the validity of recourses under the model shifts, we sample 80% instances of the shifted data D 2 100 times to train 100 'future' MLP classifiers. We then report the future validity of recourse as the fraction of the future classifiers with respect to which the recourse is valid. Cost-validity trade-off. We fix the number of perturbation samples to 1000 and vary the ambiguity size (ρ +1 = 0, ρ -1 ∈ [0, 10] for FR-MPM-PROJ and δ max ∈ [0, 0.2] for the uncertainty size of ROAR). We then plot the Pareto frontiers of the cost-validity trade-off in Figure 5 . Generally, increasing the ambiguity size will increase both the current and future validity of recourses, but induces a sacrifice in the cost. This result is consistent with the analysis in Rawal et al. (2020) . However, the frontiers of FR-MPM-PROJ dominate the frontiers of LIME-ROAR, CLIME-ROAR, and LIMELS-ROAR. Note that we use CARLA's implementation with default parameters for Wachter and its higher cost compared to the linear surrogate's methods on the Student dataset is consistent with results in Upadhyay et al. (2021) and Pawelczyk et al. (2021) . Table 1 shows the performance of FR-MPM-PROJ at radii ρ +1 = 0, ρ -1 = 10 and of ROAR-related methods at uncertainty size δ max = 0.2. Here, the FR-MPM-PROJ has similar future validity compared to LIMELS-ROAR, but at a lower cost and higher current validity. Further results and discussion are in Appendix A.2. 

A EXPERIMENTS

A.1 EXPERIMENTAL DETAILS Classifier. We use a three-layer MLP with 20, 50, and 20 nodes and a ReLU activation in each consecutive layer. We use the sigmoid function in the last layer to produce probabilities. To train this classifier, we use the binary cross-entropy, solved using the Adam optimizer and 1000 epochs. Datasets. We provide more details about synthetic and real-world datasets. For synthetic data, we generate 2-dimensional data by sampling instances uniformly in a rectangle x = (x 1 , x 2 ) ∈ [-2, 4] × [-2, 7]. Each sample is labeled using the following function: f (x) = 1 if x 2 ≥ 1 + x 1 + 2x 2 1 + x 3 1 -x 4 1 + ε, 0 otherwise, where ε is a random noise. We generate a present data set D 1 with ε = 0 and a shifted data set D 2 with ε ∼ N (0, 1). The decision boundary of the MLP classifier for current synthetic data is illustrated in Figure 6 . The detail of three real-world datasets are listed below: i German Credit (Dua & Graff, 2017) . The dataset contains the information (e.g. age, gender, financial status,...) of 1000 customers who took a loan from a bank. The classification task is to determine the risk (good or bad) of an individual. There is another version of this dataset regarding corrections of coding error (Groemping, 2019) . We use the corrected version of this dataset as shifted data to capture the correction shift. The features we used in this dataset include 'duration', 'amount', 'personal status sex', and 'age'. When considering actionability constraints (Section 6.2), we set 'personal status sex' as immutable and 'age' to be non-decrease. ii Small Bussiness Administration (SBA) (Li et al., 2018) . This data includes 2,102 observations with historical data of small business loan approvals from 1987 to 2014. We divide this dataset into two datasets (one is instances from 1989 -2006 and one is instances from 2006 -2014) to capture temporal shifts. We use the following features: selected, 'Term', 'NoEmp', 'CreateJob', 'RetainedJob', 'UrbanRural', 'ChgOffPrinGr', 'GrAppv', 'SBA Appv', 'New', 'RealEstate', 'Portion', 'Recession'. When considering actionability constraints, we set 'UrbanRural' as immutable. iii Student performance (Cortez & Silva, 2008) . This data includes the performance records of 649 students in two schools: Gabriel Pereira (GP) and Mousinho da Silveira (MS). The classification task is to determine if their final score is above average or not. We split this dataset into two sets in two schools to capture geospatial shifts. The features we used are: 'age', 'Medu', 'Fedu', 'studytime', 'famsup', 'higher', 'internet', 'romantic', 'freetime', 'goout', 'health', 'absences', 'G1', 'G2'. When considering actionability constraints, we set 'romantic' as immutable and 'age' to be non-decreased. For categorical features, we use one-hot encoding to convert them to binary features, similar to Mothilal et al. (2020) . We also normalize continuous features to zero mean and unit variance before training the classifier. The performance of the classifier is reported in Table 3 . 

A.2 ADDITIONAL EXPERIMENTAL RESULTS

Local fidelity and stability experiments. Here, we provide benchmarks of local fidelity and stability (Section 6.1) on the synthetic and SBA datasets in Figure 7 . We also run with a different setting for the stability and local fidelity metric to assess if the results are sensitive to the parameter choices. Specifically, we sample 10 neighbors in the distribution N (x, 0.0001I) instead of N (x, 0.001I) to measure the stability. Meanwhile, we set r f id to 20% and the radius r p to 10% of the maximum distance between instances in the training data. The result is shown in Figure 8 . 1.30 ± 0.01 1.00 ± 0.00 0.98 ± 0.01 1.63 ± 0.23 1.00 ± 0.00 0.88 ± 0.05 2.47 ± 0.29 0.95 ± 0.03 0.75 ± 0.09 QUAD-MPM-PROJ 1.09 ± 0.02 0.99 ± 0.01 0.98 ± 0.00 1.17 ± 0.11 1.00 ± 0.00 0.85 ± 0.05 1.46 ± 0.13 0.95 ± 0.03 0.54 ± 0.07 BW-MPM-PROJ 1.19 ± 0.02 1.00 ± 0.00 0.99 ± 0.00 1.67 ± 0.13 1.00 ± 0.00 0.95 ± 0.02 2.13 ± 0.18 0.95 ± 0.03 0.76 ± 0.06 FR-MPM-PROJ 1.20 ± 0.02 1.00 ± 0.00 1.00 ± 0.00 1.83 ± 0.13 1.00 ± 0.00 0.97 ± 0.01 2.51 ± 0.21 0.95 ± 0.03 0.82 ± 0.05 The cost-validity trade-off. Here, we provide more detail about the settings of the baselines and the complementary results of the experiments in Section 6.2 in the main paper. For Wachter's implementation, we use CARLA's source codefoot_6 (Pawelczyk et al., 2021) , which employs an adaptive scheme to adjust the hyperparameter (λ) if no valid recourse is found. We adopt this implementation for ROAR and set the initial λ to 0.1 as suggested in (Upadhyay et al., 2021) . Regarding the surrogate models, we use the open-source code with the default settings for LIME 7 (Ribeiro et al., 2016). We adapt this source code accordingly for CLIME's implementation (Agarwal et al., 2021) . LIMELS, SVM, and MPM-related surrogates use the same boundary sampling procedure (with the same seed), in which we set the number of counterfactuals k = 10 and the number of perturbation samples is 1000. Figure 9 shows the Pareto frontiers of the cost-validity trade-off on the synthetic and three real-world datasets. Table 4 shows the performance of FR-MPM-PROJ at radii ρ +1 = 0, ρ -1 = 10 and of ROAR-related methods at uncertainty size δ max = 0.2 on synthetic, SBA, and Student datasets. The hyperparameter λ is set to 0.1 for both Wachter and ROAR-related methods. In FR-MPM-PROJ and ROAR-related methods, the trade-off between cost and validity can be observed when increasing the ambiguity size (ρ -1 for FR-MPM and δ max for ROAR-related methods), 1.02 ± 0.06 0.64 ± 0.04 0.61 ± 0.02 6.06 ± 4.06 0.26 ± 0.07 0.03 ± 0.02 3.07 ± 0.23 0.43 ± 0.13 0.36 ± 0.03 MPM-AR 1.08 ± 0.04 0.68 ± 0.04 0.62 ± 0.03 5.06 ± 3.33 0.39 ± 0.18 0.08 ± 0.05 3.02 ± 0.19 0.40 ± 0.12 0.36 ± 0.02 QUAD-MPM-AR 1.65 ± 0.08 0.99 ± 0.01 0.98 ± 0.01 7.40 ± 3.36 0.97 ± 0.03 0.49 ± 0.19 5.60 ± 0.31 0.97 ± 0.04 0.69 ± 0.05 BW-MPM-AR 1.86 ± 0.09 1.00 ± 0.00 0.99 ± 0.01 9.66 ± 3.21 0.98 ± 0.04 0.62 ± 0.03 8.62 ± 0.38 1.00 ± 0.00 0.92 ± 0.05 FR-MPM-AR 2.03 ± 0.04 1.00 ± 0.00 0.99 ± 0.00 10.02 ± 3.09 1.00 ± 0.00 0.64 ± 0.03 9.63 ± 0.38 1.00 ± 0.00 0.95 ± 0.03 similar to the analysis in (Rawal et al., 2020) and the experimental results in (Upadhyay et al., 2021; Black et al., 2021) . Generally, the Pareto frontiers of FR-MPM-PROJ dominate the frontiers of ROAR-related methods on all evaluated datasets. In other words, with the same cost (or validity), our method will provide recourses with a higher validity (or lower cost) compared to ROAR. Table 4 demonstrates that our method has similar validity but a much smaller cost than ROAR on Synthetic and German datasets. Meanwhile, our method achieves higher validity with reasonable cost on SBA and Student datasets. Comparing other baselines, LIMELS-ROAR performs slightly better compared to LIME-ROAR and CLIME-ROAR. Wachter provides the recourses with high current validity but is vulnerable to model shifts, resulting in poor future validity. Wachter has the lowest cost for the synthetic and German datasets but a higher cost for the SBA and Student datasets compared to the linear surrogate-based methods. This is consistent with the results in (Upadhyay et al., 2021) since the objective function of Wachter might be non-convex when the classifier is an MLP network. Comparison with ROAR using the vanilla MPM and SVM as the surrogate model. Here, we compare the FR-MPM-PROJ with ROAR using the vanilla MPM and SVM as the surrogate model. Both vanilla MPM and SVM use the same boundary sample procedure (with the same seed number) as FR-MPM. The settings are similar to the experiment in Section 6.2. The result shown in Figure 10 demonstrates the merits of our method. Actionability. We provide the complementary result of the actionability experiment on the synthetic, SBA, and Student datasets in Table 5 . Using AR with FR-MPM produces the recourses with higher value in both the current and future validity compared to AR using other surrogates. Ablation study. We conduct an ablation study to understand the contribution of each stage in our method. Figure 10 shows the comparison of our method with ROAR using vanilla MPM and SVM as the local surrogate. Figure 11 shows the Pareto frontiers of FR-MPM-PROJ compared to its ablations by alternating the FR-MPM with other surrogates (LIME, MPM) or alternating the projection with Wachter. Particularly, we compare our method with LIME-ROBUST-PROJ, which uses LIME as the surrogate model and then solves the robustified projection: x r = arg min{ x -x 0 1 : x w + b -δ max x 2 ≥ 0}, where (w, b) is the weight and bias of LIME's surrogate, δ max is similar to the uncertainty size of ROAR (Upadhyay et al., 2021) . The recourses are generated with respect to the MLP classifier on synthetic and three real-world datasets. This result demonstrates the usefulness of the FR-MPM in promoting the generation of robust recourses. Note that, for ρ -= 0 and ρ + = 0, the hyperplane of the FR-MPM classifier recovers the vanilla MPM's hyperplane. Comparison with the probabilistic shiftings. One might attempt to increase the probabilistic threshold (usually set to 0.5) at which a sample is considered 'favorable' to generate robust recourses. In Figure 12 , we compare the proposed method with Wachter, LIME, and MPM with various probabilistic thresholds in the range [0.5, 0.9]. It can be seen that FR-MPM-PROJ consistently achieves the best performance compared to other baselines. Interestingly, Wachter improves its future validity significantly on the synthetic and German datasets as the threshold increases. However, our method still dominates Wachter in all datasets. Robust recourses with MPM's variants. We compare FR-MPM-PROJ with QUAD-MPM-PROJ and BW-MPM-PROJ. The settings are similar in the cost-validity trade-off experiments in Section 6.2. 

A.3 COVARIANCE-ROBUST MPMS WITH DIFFERENT DIVERGENCES

In this section, we discuss the variants of covariance-robust MPMs with different distances and provide guidance for choosing the surrogate model in practice, especially at a low sample size. Proposition 4.4 and Remark 4.5 showed that Quadratic MPM and Bures MPM coincide when one of the radii ρ y grows to infinity and they are independent of the covariance matrices Σ y . Meanwhile, the asymptotic hyperplane of the Fisher-Rao MPM when ρ y → ∞ aligns with axes of the covariance matrices Σ y (see Proposition 5.4 and Figure 14 ). It suggests that the Fisher-Rao MPM is not a suitable surrogate at low sample sizes as it relies on the estimate of the covariance matrices. On the other hand, when the number of samples is sufficient to estimate the covariance matrices accurately, Fisher-Rao MPM would be better than Quadratic MPM and Bures MPM as it takes the geometry of the data into account when robustifying the surrogate. We probe the performance of MPMs with different distances at low sample sizes to demonstrate our claim above. Local fidelity. We probe the local fidelity at low sample sizes and plot the result in Figure 15 . The experiment settings are similar to those in Section 6.1. The number of samples is set in the range of [50, 1000] . We also measure the average condition number of estimated covariance matrices for both positive and negative classes in Figure 15a . It can be seen that the covariance matrices are ill-conditioned at 50 samples on SBA and Student datasets. The fidelity of FR-MPM is just slightly better than QUAD-MPM and BW-MPM. When the number of samples increased, FR-MPM benefited the most, and the gap between FR-MPM and QUAD-MPM (or BW-MPM) became more significant. It supports our claim that FR-MPM would better approximate the decision boundary when the number of samples is sufficient for estimating the covariance matrices. Robust recourses. We revisit the recourse generation with covariance-robust MPMs using quadratic distance (QUAD-MPM-PROJ) and Fisher-Rao distance (FR-MPM-PROJ), at which the surrogate is estimated with 50 and 1000 synthesized samples. We omit the comparison with Bures distance to ease the presentation as it behaves asymptotically like the MPM using the quadratic distance. The results are shown in Figure 16 . The results showed that QUAD-MPM-PROJ would be better at a low For each fixed Σ y , using elementary probability theory, we could calculate the Gaussian probability explicitly: P y (y(w X -b) ≤ 0) = 1 -Φ y(w µ y -b) w Σ y w , where Φ is the cumulative distribution function of the standard Gaussian random variable. Therefore, problem ( 16) can be re-written as which is the same problem as problem (12) in the proof of Proposition 3.4. Hence, problem (6) shares the same optimal solution as problem (3). This completes the proof. max α s. t. α ∈ R + , w ∈ R d , b ∈ R α ≤ max

B.2 PROOFS OF SECTION 4

We first prove Proposition 4.3 to lay the foundation for the proof of Theorem 4.2. Proof of Proposition 4. The optimal γ can be found by using calculus, which is given by γ = w 2 2 + w Σ y w w 2 2 ρ y , with the corresponding optimal value τ B y (w) 2 = (ρ y w 2 + w Σ y w) 2 . We thus have the necessary result. We now prove Theorem 4.2. Proof of Theorem 4.2. Using the Bures divergence B, the optimization problem min w∈W y∈Y τ B y (w) becomes problem (7) by exploiting the analytical form of τ B y (w) in Proposition 4.3. By invoking Proposition 3.4, we obtain the postulated results on the optimal solution θ B for the case of the Bures divergence. Proof of Proposition 4.4. Note that problem (7) has a unique solution because the objective function is strictly convex and coercive. Moreover, the optimal solution of (7) coincides with the optimal solution w (λ) of the following second-order cone program min w∈W 1 λ y∈Y w Σ y w + w 2 , where λ = 1/ρ. By a compactification of W and applying Berge's maximum theorem (Berge, 1963, pp. 115-116) , the function w (λ) is continuous on a non-negative compact range of λ, and converges to w (0) as λ → 0. The optimal solution w (0) coincides with the solution of 



Throughout, "robustness" is used in the algorithmic recourse setting with respect to the model shifts(Rawal et al., 2020). "Robustness" is also used to indicate the sensitivity of LIME to the sampling distribution. To avoid confusion, in what follows, we use "stability" to refer to the aforementioned sensitivity of LIME. LIMELS uses the same boundary sampling algorithm as the FR-MPM but trains a ridge regression instead. https://github.com/marcotcr/lime https://github.com/ustunb/actionable-recourse https://github.com/carla-recourse/CARLA https://github.com/carla-recourse/CARLA https://github.com/marcotcr/lime



Figure 1: The sampler synthesizes new instances around x 0 and queries the predicted labels from the classifier f . The moment information ( µ y , Σ y ) estimated from the synthetic psedo-labeled data (represented by triangles and ellipsoids) serves as inputs for the Covariance-robust MPM. The MPM surrogate θ ϕ (red hyperplane) is the target classifier used to generate recourses (red circle).

The goal of MPM is to find a (non-trivial) linear classifier that minimizes the maximum misclassification rate among classes. To this end, we consider the family of linear classifiers parametrized by θ = (w, b) ∈ R d+1 , w = 0 with classification rule C θ (x) = sign(w x -b), where (w, b) is the slope and intercept. The MPM solves the min-max optimization problem min θ∈Θ max y∈Y max Py∼( µy, Σy)

then θ = ( w, b) solves the MPM problem (1), where κ = y∈Y w Σ y w -1 , and b = w µ +1κ w Σ +1 w.

Proposition 4.3 (Bures divergence). If ϕ ≡ B, then τ B y (w) = ρ y w 2 + w Σ y w for all y ∈ Y. The proof of Theorem 4.2 follows by combining Propositions 3.4 and 4.3.

Figure 3: A 2D example of the Bures MPM hyperplanes with fixed ρ +1 = 0 and ρ -1 = 0.01 (red) and ρ -1 = 10 (green). The green line is pushed towards the favorable region (predicted as +1).

Figure 4: Benchmarks of fidelity and stability on the German and Student dataset. Higher local fidelity and lower stability are better.

Figure 5: Pareto frontier of the cost-validity trade-off on the German and Student datasets.

Figure 6: An illustration of MLP's decision boundary for the synthetic data.

Figure 7: Benchmarks of the local fidelity and stability on synthetic and SBA dataset.

Figure 8: Benchmarks of the local fidelity and stability.

Figure 10: Pareto frontiers of our method compared with ROAR using vanilla MPM and SVM as the surrogate model. The recourses are generated with respect to the MLP classifier on synthetic and three real-world datasets.

Figure 11: Ablation study: Pareto frontiers of FR-MPM-PROJ compared to its ablations by alternating the FR-MPM by common local surrogates. The recourses are generated with respect to the MLP classifier on synthetic and three real-world datasets.

Figure 13: The comparison among MPM-related methods with different distances.

(a) ρ+1 = ρ-1 = 0. (b) ρ+1 = 0, ρ-1 = 1. (c) ρ+1 = 0, ρ-1 = 10.

Figure 14: Visualization of MPM's hyperplanes with Quadratic, Bures, and Fisher-Rao distances. When ρ +1 = ρ -1 = 0, all hyperplanes coincide and recover the vanilla MPM. All hyperplanes move towards the favorable class as the radius for the unfavorable class ρ -1 increases. At ρ -1 = 10 in Subfigure (c), the hyperplanes of Quadratic and Bures MPMs come close together which is distinct from the Fisher-Rao MPM's hyperplane. Notice that the Fisher-Rao MPM in Subfigure (c) tends to position in parallel to the major axis of the unfavorable covariance matrix, which shows the dependence on Σ -1 , see Proposition 5.4. The Bures and Quad MPM hyperplanes in Subfigure (c) do not show any dependence on the covariance matrix, which aligns with the results in Proposition 4.4.

The average condition number of the generated covariance matrices for positive and negative classes. (b) Local fidelity of MPM variants at low sample sizes.

Figure 15: The comparison among MPM-related methods with different distances at low sample sizes.

Figure 16: The comparison of QUAD-MPM-PROJ and FR-MPM-PROJ at low sample sizes.

Using the same argument as inLanckriet et al. (2001) (see equation (4) and the discussions following it inLanckriet et al. (2001)), we could show that the optimal θ = (w, b) must classify µ y correctly, i.e., y = sign(w µ y -b), w Σ y w ≤ y(w µ y -b).As a consequence, problem (10) is equivalent tomax α s. t. α ∈ R + , w ∈ R d , b ∈ R y(w µ y -b)Using that τ ϕ y (w) = maxΣy∈S d + :ϕ(Σy Σy)≤ρy w Σ y w and that α → α 1-α is monotone increasing, the above problem is further equivalent tomax κ s. t. κ ∈ R + , w ∈ R d , b ∈ R y(w µ y -b) ≥ κ τ ϕ y (w) ∀y ∈ Y.(12)From the constraints, we getw µ +1 -κ τ ϕ +1 (w) ≥ b ≥ w µ -1 + κ τ ϕ -1(13)So, we can eliminate the variable b and reduce problem (12) to max κ s. t. κ ∈ R + , w ∈ R d w µ +1 -κ τ ϕ +1 (w) ≥ w µ -1 + κ τ ϕ -1 (w).

can eliminate the variable κ and rewrite problem (14) as min Using the definition of τ ϕ y (w), we can see that the above problem is homogeneous in w, which implies that we could further re-write it as min that from (13) and (15), at optimality, we haveκ = y∈Y y w µ y y∈Y τ ϕ y (w) = 1 y∈Y τ ϕ y (w),and b = w µ +1 -κ τ ϕ +1 (w) = w µ -1 + κ τ ϕ -1 (w). This completes the proof.Proof of Proposition 3.5. First, following exactly the same arguments as in the proof of Proposition 3.4, we see that problem (6) is equivalent tomax α s. t. α ∈ R + , w ∈ R d , b ∈ R 1 -α ≥ max Py∈U N y ( Py) P y (y(w X -b) ≤ 0) ∀y ∈ Y.(16)In the proof of Proposition 3.4, we handle the maximum in the constraint by decomposing it into two layers of maximization problems (see (11)). However, because of the Gaussian assumption, in this case, we have max Py∈U N y ( Py) P y (y(w X -b) ≤ 0) = max Σy∈S d + :ϕ(Σy Σy)≤ρy Py∼N ( µy,Σy) P y (y(w X -b) ≤ 0).

of Φ, the constraints in problem (17) become y(w µ y -b) ≥ Φ -1 (α) min Σy∈S d + :ϕ(Σy Σy)≤ρy w Σ y w ∀y ∈ Y. Using that τ ϕ y (w) = max Σy∈S d + :ϕ(Σy Σy)≤ρy w Σ y w and that Φ -1 (α) is monotone increasing, problem (17) is further equivalent to max κ s. t. κ ∈ R + , w ∈ R d , b ∈ R y(w µ y -b) ≥ κ τ ϕ y (w) ∀y ∈ Y.

3. By Nguyen et al. (2021, Proposition 2.8), we haveτ B y (w) 2 = inf γI ww γ(ρ y -Tr Σ y ) + γ 2 (γI -ww) -1 , Σ y .Using the Sherman-Morrison formula(Bernstein, 2009, Corollary 2.8.8), we find(I -1 γ ww) -1 = I + ww γ -w 2 2Notice that the constraint γI ww is equivalent to γ > w 2 2 by Schur complement. Thus, we haveτ B y (w) 2 = inf γ> w 2 2 γρ y + γ w Σ y w γ -w 2 2 .

Euclidean projection of the origin onto the hyperplane W. An elementary computation confirms thatτ F y (w) 2 = max v Z y v : log Z y F ≤ ρ y with v = Σ 1 2 w.We now proceed to show that the above optimization problem admits the maximizerZ y = U U + exp(ρ y ) vv v 2 2 ,

Performance of competing algorithms on real datasets. For the current and future validity, higher is better. For the cost, lower is better. Bold indicates the best performance.Baselines. The experiment in Section 6.1 suggests that the Fisher-Rao model can be used to represent the class of covariance-robust MPM. We use FR-MPM as the linear surrogate θ F = (w F , b F ) and a simple projection to generate recourse by finding x r = arg min{ x -x 0 1 : x w F + b F ≥ 0}.

Performance of AR using different local surrogate models. Berk Ustun, Alexander Spangher, and Yang Liu. Actionable recourse in linear classification. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 10-19, 2019. Sahil Verma, John Dickerson, and Keegan Hines. Counterfactual explanations for machine learning: A review. arXiv preprint arXiv:2010.10596, 2020.

Accuracy and AUC results of the classifiers on the synthetic and three real-world datasets.

Performance of competing algorithms on synthetic, SBA, and Student datasets.

Performance of AR using different local surrogate models. ± 0.07 0.67 ± 0.02 0.67 ± 0.01 4.50 ± 3.48 0.10 ± 0.07 0.04 ± 0.03 3.38 ± 0.24 0.45 ± 0.07 0.42 ± 0.06 CLIME-AR 1.34 ± 0.05 0.41 ± 0.02 0.43 ± 0.01 4.10 ± 3.80 0.07 ± 0.07 0.01 ± 0.01 4.34 ± 1.24 0.63 ± 0.28 0.50 ± 0.15 LIMELS-AR 1.05 ± 0.05 0.62 ± 0.06 0.59 ± 0.04 4.97 ± 3.76 0.22 ± 0.15 0.06 ± 0.07 3.15 ± 0.16 0.49 ± 0.13 0.38 ± 0.03 SVM-AR

REPRODUCIBILITY STATEMENT

In order to foster reproducibility, we have released all source code and scripts used to replicate our experimental results at https://anonymous.4open.science/r/mpm-recourse. The repository includes source code, datasets, configurations, and instructions; thus one could reproduce our results with several commands.We use the original authors' implementations of for LIME 3 (Ribeiro et al., 2016) and AR 4 (Ustun et al., 2019). We use a well-known CARLA's implementation 5 for Wachter (Wachter et al., 2017) . Since we cannot find the open-source code for CLIME (Agarwal et al., 2021) , LIMELS (Laugel et al., 2018) , and ROAR (Upadhyay et al., 2021) , we implement according to their papers. CLIME and LIMELS are adapted from LIME's source code while ROAR is adapted from CARLA's source code for Wachter.The hyperparameter configurations for our methods and other baseline are clearly stated in Section A, Appendix A.1 and Appendix A.2 and also stored in the repository. The surrogates sharing the same local sampler will be ensured to have the same random seed, therefore, have the same synthesized samples. The hyperparameters that affect baselines' performance such as λ and the probabilistic threshold of Wachter and ROAR will also be studied in Appendix A.2.The remaining proofs and theoretical claims are provided in Appendix B. sample size. When increasing the number of samples, the recourses constructed with Fisher-Rao MPM exhibit a better cost-validity trade-off. This result is consistent with our previous observation in the local fidelity experiment. 

B PROOFS

where we used the classification rule that C θ (x) = sign(w x -b) if and only if y(w X -b) ≥ 0 and that the feasible set takes the form Θ {θ = (w, b) ∈ R d+1 : w = 0}. We claim that w = 0 is never optimal. To see this, take w = 0. Then, P y (y(w X -b) ≤ 0) = P y (yb ≥ 0), which is independent of the random variable X and is either 0 or 1 no matter what b we choose. Therefore, maxand hence α = 0, which is never optimal. So, the domain of w can be relaxed from R d \ {0} to R d . Problem (9) can then be further re-written asNotice that here, equivalency means the optimal solution (w , b ) of (10) will constitute the optimal solution θ = (w , b ) of the original min-max-max problem. Moreover, recall the definition of the ambiguity set U ϕ y ( P y ) = {P y : P y ∼ ( µ y , Σ y ), ϕ(Σ y Σ y ) ≤ ρ y }, where P y ∼ ( µ y , Σ y ) means that the the distribution P y has mean µ y and covariance Σ y . In other words, each element P y in the ambiguity U ϕ y ( P y ) is determined by first choosing a covariance matrix Σ y satisfying the divergence constraint ϕ(Σ y Σ y ) ≤ ρ y and then picking a distribution P y having mean µ y and covariance Σ y . Therefore, the worst-case probability admits a two-layer decomposition max Py∈U ϕ y ( Py)Using Lanckriet et al. (2001, Equation (6) ), the inner maximum value is given by max Py∼( µy,Σy).Combining the last two equalities, we can express the constraint in problem (10) aswhere U is an d × (d -1) orthonormal matrix whose columns are orthogonal to v. First, by Nguyen et al. (2019, Lemma C.1), the feasible region is compact. Since the objective function v Z Y v is continuous in Z y , an optimal solution Z y exists. Next, we first claim that the constraint holds with equality at optimality. Suppose that log Z y F < ρ y . Then, for some small δ > 0, the matrix Z y + δ vv is feasible due to the continuity of the constraint function log Z y F and has a strictly better objective value than the optimal solution Z y . This violates the optimality of Z y . Hence, log Z y F = ρ y for any optimal solution Z y , and the problem is equivalent towhere O(d) is the set of d × d orthogonal matrices. For any orthogonal matrix Q, the objective function v QDiag(λ)Q v ≤ λ 1 v 2 2 , the right-hand side of which can be attained by settingTherefore, our problem is further reduced toIt is then easy to see that at optimality, the optimal λ ∈ R d ++ must satisfySince λ 1 ≥ λ 2 = 1, we have log λ 1 = ρ y and hence λ 1 = exp(ρ y ). In other words,The corresponding optimal value isThis completes the proof.We are now ready to prove Theorem 5.2.Proof of Theorem 5.2. Using the Fisher-Rao divergence F, the optimization problembecomes problem (8) by exploiting the analytical form of τ F y (w) in Proposition 5.3. By invoking Proposition 3.4, we obtain the postulated results on the optimal solution θ F for the case of the Fisher-Rao divergence.Proof of Proposition 5.4. Note that problem (8) has a unique solution because the objective function is strictly convex and coercive. Also, the optimal solution of ( 8 . By a compactification of W and applying Berge's maximum theorem (Berge, 1963, pp. 115-116) , the function w (λ) is continuous on a non-negative compact range of λ, and converges to w (0) as λ → 0. The optimal solution w (0) coincides with the solution of min w∈W w Σ y w.Because the square-root function is monotonically increasing, w (0) also solves min w∈W w Σ y w, which is a convex, quadratic program with a single linear constraint. If a is defined as in the statement, then a convex optimization argument implieswhich completes the proof.

C LOGDET MPM

In this appendix, we consider when ϕ is the Log-Determinant (LogDet) divergence. The LogDet divergence is formally defined as follows.Definition C.1 (LogDet divergence). Given two positive definite matrices Σ, Σ ∈ S d ++ , the logdeterminant divergence between them isIt can be shown that D is a divergence because it is non-negative, and it vanishes to zero if and only if Σ = Σ. However, D is not symmetric, and in general we have D(Σ Σ) = D( Σ Σ). The LogDet divergence D is related to the relative entropy: it is equal to the Kullback-Leibler divergence between two Gaussian distributions with the same mean vector and covariance matrices Σ and Σ.We now provide the form of the LogDet MPM problem. Theorem C.2 (LogDet MPM). Suppose that ϕ ≡ D. Let w D be the optimal solution of the following second-order cone problemwhere c y = -W -1 (-exp(-ρ y -1)) and W -1 is the Lambert-W function for the branch -1. Let κ D and b D be calculated asis the optimal solution of the distributionally robust MPM problem (3).Theorem C.2 shows that the LogDet divergence induces a similar reweighting scheme as the Fisher-Rao MPM. The asymptotic analysis of the LogDet MPM follows similarly from the Fisher-Rao MPM and is omitted. The proof of Theorem C.2 follows trivially from the below result, which provides the analytical form of τ D y (w). Proposition C.3 (LogDet divergence). Suppose that ϕ ≡ D, then for any y ∈ Y, we havewhere W -1 is the Lambert-W function for the branch -1. Using the matrix determinant formula (Bernstein, 2009) The first-order optimality condition for γ is ρ -log 1 -w Σ y w γ -w Σ y w γ -w Σ y w = 0, and the optimal solution for γ is γ = w Σ y w 1 + 1/W -1 (-exp(-ρ y -1)).Replacing the value of γ into the objective function leads to the necessary result.

