DISTRIBUTIONALLY ROBUST RECOURSE ACTION

Abstract

A recourse action aims to explain a particular algorithmic decision by showing one specific way in which the instance could be modified to receive an alternate outcome. Existing recourse generation methods often assume that the machine learning model does not change over time. However, this assumption does not always hold in practice because of data distribution shifts, and in this case, the recourse action may become invalid. To redress this shortcoming, we propose the Distributionally Robust Recourse Action (DiRRAc) framework, which generates a recourse action that has a high probability of being valid under a mixture of model shifts. We formulate the robustified recourse setup as a min-max optimization problem, where the max problem is specified by Gelbrich distance over an ambiguity set around the distribution of model parameters. Then we suggest a projected gradient descent algorithm to find a robust recourse according to the min-max objective. We show that our DiRRAc framework can be extended to hedge against the misspecification of the mixture weights. Numerical experiments with both synthetic and three real-world datasets demonstrate the benefits of our proposed framework over state-of-the-art recourse methods.

1. INTRODUCTION

Post-hoc explanations of machine learning models are useful for understanding and making reliable predictions in consequential domains such as loan approvals, college admission, and healthcare. Recently, recourse has been rising as an attractive tool to diagnose why machine learning models have made a particular decision for a given instance. A recourse action provides a possible modification of the given instance to receive an alternate decision (Ustun et al., 2019 ). Consider, for example, the case of loan approvals in which a credit application is rejected. Recourse will offer the reasons for rejection by showing what the application package should have been to get approved. A concrete example of a recourse in this case may be "the monthly salary should be higher by $500" or "20% of the current debt should be reduced". A recourse action has a positive, forward-looking meaning: they list out a directive modification that a person should implement so that they can get a more favorable outcome in the future. If a machine learning system can provide the negative outcomes with the corresponding recourse action, it can improve user engagement and boost the interpretability at the same time (Ustun et al., 2019; Karimi et al., 2021b) . Explanations thus play a central role in the future development of human-computer interaction as well as human-centric machine learning. Despite its attractiveness, providing recourse for the negative instances is not a trivial task. For realworld implementation, designing a recourse needs to strike an intricate balance between conflicting criteria. First and foremost, a recourse action should be feasible: if the prescribed action is taken, then the prediction of a machine learning model should be flipped. Further, to avoid making a drastic change to the characteristics of the input instance, a framework for generating recourse should minimize the cost of implementing the recourse action. An algorithm for finding recourse must make changes to only features that are actionable and should leave immutable features (relatively) unchanged. For example, we must consider the date of birth as an immutable feature; in contrast, we can consider salary or debt amount as actionable features. Various solutions have been proposed to provide recourses for a model prediction (Karimi et al., 2021b; Stepin et al., 2021; Artelt & Hammer, 2019; Pawelczyk et al., 2021; 2020; Verma et al., 2020) . For instance, Ustun et al. (2019) used an integer programming approach to obtain actionable recourses and also provide a feasibility guarantee for linear models. Karimi et al. (2020) proposed a model-agnostic approach to generate the nearest counterfactual explanations and focus on structured data. Dandl et al. (2020) proposed a method that finds the counterfactual by solving a multi-objective optimization problem. Recently, Russell (2019) and Mothilal et al. (2020) focus on finding a set of multiple diverse recourse actions, where the diversity is imposed by a rule-based approach or by internalizing a determinant point process cost in the objective function. These aforementioned approaches make a fundamental assumption that the machine learning model does not change over time. However, the dire reality suggests that this assumption rarely holds. In fact, data shifts are so common nowadays in machine learning that they have sparkled the emerging field of domain generalization and domain adaptation. Organizations usually retrain models as a response to data shifts, and this induces corresponding shifts in the machine learning models parameters, which in turn cause serious concerns for the feasibility of the recourse action in the future (Rawal et al., 2021) . In fact, all of the aforementioned approaches design the action which is feasible only with the current model parameters, and they provide no feasibility guarantee for the future parameters. If a recourse action fails to generate a favorable outcome in the future, then the recourse action may become less beneficial (Venkatasubramanian & Alfano, 2020) , the pledge of a brighter outcome is shattered, and the trust in the machine learning system is lost (Rudin, 2019) . To tackle this challenge, Upadhyay et al. (2021) proposed ROAR, a framework for generating instance-level recourses that are robust to shifts in the underlying predictive model. ROAR used a robust optimization approach that hedges against an uncertainty set containing plausible values of the future model parameters. However, it is well-known that robust optimization solutions can be overly conservative because they may hedge against a pathological parameter in the uncertainty set (Ben-Tal et al., 2017; Roos & den Hertog, 2020) . A promising approach that can promote robustness while at the same time prevent from over-conservatism is the distributionally robust optimization framework (El Ghaoui et al., 2003; Delage & Ye, 2010; Rahimian & Mehrotra, 2019; Bertsimas et al., 2018) . This framework models the future model parameters as random variables whose underlying distribution is unknown but is likely to be contained in an ambiguity set. The solution is designed to counter the worst-case distribution in the ambiguity set in a min-max sense. Distributionally robust optimization is also gaining popularity in many estimation and prediction tasks in machine learning (Namkoong & Duchi, 2017; Kuhn et al., 2019) . Contributions. This paper combines ideas and techniques from two principal branches of explainable artificial intelligence: counterfactual explanations and robustness to resolve the recourse problem under uncertainty. Concretely, our main contributions are the following: 1. We propose the framework of Distributionally Robust Recourse Action (DiRRAc) for designing a recourse action that is robust to mixture shifts of the model parameters. Our DiRRAc maximizes the probability that the action is feasible with respect to a mixture shift of model parameters while at the same time confines the action in the neighborhood of the input instance. Moreover, the DiRRAc model also hedges against the misspecification of the nominal distribution using a min-max form with a mixture ambiguity set prescribed by moment information. 2. We reformulate the DiRRAc problem into a finite-dimensional optimization problem with an explicit objective function. We also provide a projected gradient descent to solve the problem. 3. We extend our DiRRAc framework along several axis to handle mixture weight uncertainty, to minimize the worst-case component probability of receiving the unfavorable outcome, and also to incorporate the Gaussian parametric information. We first describe the recourse action problem with mixture shifts in Section 2. In Section 3, we present our proposed DiRRAc framework, its reformulation and the numerical routine for solving it. The extension to the parametric Gaussian setting will be discussed in Section 4. Section 5 reports the numerical experiments showing the benefits of the DiRRAc framework and its extensions. Notations. For each integer K, we have [K] = {1, . . . , K}. We use S d + (S d ++ ) to denote the space of symmetric positive semidefinite (definite) matrices. For any A ∈ R m×m , the trace operator is Tr A = d i=1 A ii . If a distribution Q k has mean µ k and covariance matrix Σ k , we write Q k ∼ (µ k , Σ k ). If additionally Q k is Gaussian, we write Q k ∼ N (µ k , Σ k ). Writing Q ∼ (Q k , p k ) k∈[K] means Q is a mixture of K components, the k-th component has weight p k and distribution Q k .

2. RECOURSE ACTION UNDER MIXTURE SHIFTS

We consider a binary classification setting with label Y = {0, 1}, where 0 represents the unfavorable outcome while 1 denotes the favorable one. The covariate space is R d , and any linear classifier C θ : R d → Y characterized by the d-dimensional parameter θ is of the form C θ (x) = 1 if θ x ≥ 0, 0 otherwise. Note that the bias term can be internalized into θ by adding an extra dimension, and thus it is omitted. Suppose that at this moment (t = 0), the current classifier is parametrized by θ 0 , and we are given an input instance x 0 ∈ R d with unfavorable outcome, that is, C θ0 (x 0 ) = 0. One period of time from now (t = 1), the parameters of the predictive model will change stochastically and are represented by a d-dimensional random vector θ. This paper focuses on finding a recourse action x which is reasonably close to the instance x 0 , and at the same time, has a high probability of receiving a favorable outcome in the future. Figure 1 gives a bird's eye view of the setup. Figure 1 : A canonical setup of the recourse action under mixture shifts problem. To measure the closeness between the action x and the input x 0 , we assume that the covariate space is endowed with a non-negative, continuous cost function c. In addition, suppose temporarily that θ follows a distribution P. Because maximizing the probability of the favorable outcome is equivalent to minimizing the probability of the unfavorable outcome, the recourse can be found by solving min P(C θ (x) = 0) : x ∈ X, c(x, x 0 ) ≤ δ . The parameter δ ≥ 0 in (1) governs how far a recourse action can be from the input instance x 0 . Note that we constrain x in a set X which captures operational constraints, for example, the highest education of a credit applicant should not be decreasing over time. In this paper, we model the random vector θ using a finite mixture of distributions with K components, the mixture weights are p satisfying k∈[K] p k = 1. Each component in the mixture represents one specific type of model shifts: the weights p reflect the proportion of the shift types while the component distribution P k represents the (conditional) distribution of the future model parameters in the k-th shift. Further information on mixture distributions and their applications in machine learning can be found in (Murphy, 2012, §3.5) . Note that the mixture model is not a strong assumption. It is well-known that the Gaussian mixture model is a universal approximator of densities, in the sense that any smooth density can be approximated with any specific nonzero amount of error by a Gaussian mixture model with enough components (Goodfellow et al., 2016; McLachlan & Peel, 2000) . Thus, our mixture models are flexible enough to hedge against distributional perturbations of the parameters under large values of K. The design of the ambiguity set to handle ambiguous mixture weights and under the Gaussian assumption is extensively studied in the literature on distributionally robust optimization (Hanasusanto et al., 2015; Chen & Xie, 2021) . If each P k is a Gaussian distribution N ( θ k , Σ k ), then P is a mixture of Gaussian distributions. The objective of problem (1) can be expressed as P(C θ (x) = 0) = k∈[K] p k P k (C θ (x) = 0) = k∈[K] p k Φ -x θ k x Σ k x , where the first equality follows from the law of conditional probability, and Φ is the cumulative distribution function of a standard Gaussian distribution. Under the Gaussian assumption, we can solve (1) using a projected gradient descent type of algorithm (Boyd & Vandenberghe, 2004) . Remark 2.1 (Nonlinear models). Our analysis focuses on linear classifiers, which is a common setup in the literature (Upadhyay et al., 2021; Ustun et al., 2019; Rawal et al., 2021; Karimi et al., 2020; Wachter et al., 2018; Ribeiro et al., 2016) . To extend to nonlinear classifiers, we can follow a similar approach as in Rawal & Lakkaraju (2020b) and Upadhyay et al. (2021) by first using LIME Ribeiro et al. (2016) to approximate the nonlinear classifiers locally with an interpretable linear model, then subsequently applying our framework.

3. DISTRIBUTIONALLY ROBUST RECOURSE ACTION FRAMEWORK

Our Distributionally Robust Recourse Action (DiRRAc) framework robustifies formulation (1) by relaxing the parametric assumption and hedging against distribution misspecification. First, we assume that the mixture components P k are specified only through moment information, and no particular parametric form of the distribution is imposed. In effect, P k is assumed to have mean vector θ k ∈ R d and positive definite covariance matrix Σ k 0. Second, we leverage ideas from distributionally robust optimization to propose a min-max formulation of (1), in which we consider an ambiguity set which contains a family of probability distributions that are sufficiently close to the nominal distribution P. We prescribe the ambiguity set using Gelbrich distance (Gelbrich, 1990) . Definition 3.1 (Gelbrich distance). The Gelbrich distance G between two tuples (θ, Σ) ∈ R d × S d + and ( θ, Σ) ∈ R d × S d + amounts to G((θ, Σ), ( θ, Σ)) θ -θ 2 2 + Tr Σ + Σ -2( Σ 1 2 Σ Σ 1 2 ) 1 2 . It is easy to verify that G is non-negative, symmetric and it vanishes to zero if and only if (θ, Σ) = ( θ, Σ). Further, G is a distance on R d × S d + because it coincides with the type-2 Wasserstein distance between two Gaussian distributions N (µ, Σ) and N ( µ, Σ) (Givens & Shortt, 1984) . Distributionally robust formulations with moment information prescribed by the G distance are computationally tractable under mild conditions, deliver reasonable performance guarantees and also generate a conservative approximation of the Wasserstein distributionally robust optimization problem (Kuhn et al., 2019; Nguyen et al., 2021) . In this paper, we use the Gelbrich distance G to form a neighborhood around each P k as B k ( P k ) Q k : Q k ∼ (θ k , Σ k ), G((θ k , Σ k ), ( θ k , Σ k )) ≤ ρ k . Intuitively, one can view B k ( P k ) as a ball centered at the nominal component P k of radius ρ k ≥ 0 prescribed using the distance G. This component set B k ( P k ) is non-parametric, and the first two moments of Q k are sufficient to decide whether Q k belongs to B k ( P k ). Moreover, if Q k ∈ B k ( P k ) , then any distribution Q k with the same mean vector and covariance matrix as Q k also belongs to B k ( P k ). Notice that even when the radius ρ k is zero, the component set B k ( P k ) does not collapse into a singleton. Instead, if ρ k = 0 then B k ( P k ) still contains all distributions of the same moment ( θ k , Σ k ) with the nominal component distribution P k , and consequentially it possesses the robustification effects against the parametric assumption on P k . The component sets are utilized to construct the ambiguity set for the mixture distribution as B( P) Q : ∃Q k ∈ B k ( P k ) ∀k ∈ [K] such that Q ∼ (Q k , p k ) k∈[K] . Any Q ∈ B( P) is also a mixture distribution with K components, with the same mixture weights p. Thus, B( P) contains all perturbations of P induced separately on each component by B k ( P k ). We are now ready to introduce our DiRRAc model, which is a min-max problem of the form inf x∈X sup Q∈B( P) Q(C θ (x) = 0) s. t. c(x, x 0 ) ≤ δ sup Q k ∈B k ( P k ) Q k (C θ (x) = 0) < 1 ∀k ∈ [K]. (2) The objective of (2) is to minimize the worst-case probability of unfavorable outcome of the recourse action. Moreover, the last constraint imposes that for each component, the worst-case conditional probability of unfavorable outcome should be strictly less than one. Put differently, this last constraint requires that the action should be able to lead to favorable outcome for any distribution in B k ( P k ). By definition, each supremum subproblem in ( 2) is an infinite-dimensional maximization problem over the space of probability distributions, and thus it is inherently difficult. Fortunately, because we use the Gelbrich distance to prescribe the set B k ( P k ), we can solve these maximization problems analytically. This consequentially leads to a closed-form reformulation of the DiRRAc model into a finite-dimensional problem. Next, we will reformulate the DiRRAc problem (2), provide a sketch of the proof and propose a numerical solution routine.

3.1. REFORMULATION OF DIRRAC

Each supremum in ( 2) is an infinite-dimensional optimization problem on the space of probability distributions. We now show that (2) can be reformulated as a finite-dimensional problem. Towards this end, let X be the following d-dimensional set. X x ∈ X : c(x, x 0 ) ≤ δ, -θ k x + ρ k x 2 < 0 ∀k ∈ [K] . The next theorem asserts that the DiRRAc problem (2) can be reformulated as a d-dimensional optimization problem with an explicit, but complicated, objective function. Theorem 3.2 (Equivalent form of DiRRAc). Problem (2) is equivalent to the finite-dimensional optimization problem inf x∈X k∈[K] p k f k (x) 2 , ( ) where the function f k admits the closed-form expression f k (x) = ρ k θ k x x 2 + x Σ k x ( θ k x) 2 + x Σ k x -ρ 2 k x 2 2 ( θ k x) 2 + x Σ k x . Next, we sketch a proof of Theorem 3.2 and a solution procedure to solve problem (4).

3.2. PROOF SKETCH

For any component k ∈ [K], define the following worst-case probability of unfavorable outcome f k (x) sup Q k ∈B k ( P k ) Q k (C θ (x) = 0) = sup Q k ∈B k ( P k ) Q k ( θ x ≤ 0) ∀k ∈ [K]. To proceed, we rely on the following elementary result from (Nguyen, 2019, Lemma 3.31 ). Lemma 3.3 (Worst-case Value-at-Risk). For any x ∈ R d and β ∈ (0, 1), we have inf τ : sup Q k ∈B k ( P k ) Q k ( θ x ≤ -τ ) ≤ β = -θ k x + 1 -β β x Σ k x + ρ k √ β x 2 . (6) Note that the left-hand side of ( 6) is the worst-case Value-at-Risk with respect to the ambiguity set B k ( P k ). Leveraging this result, the next proposition provides the analytical form of f k (x). Proposition 3.4 (Worst-case probability). For any k ∈ [K] and ( θ k , Σ k , ρ k ) ∈ R d × S d + × R + , define the following constants A k -θ k x, B k x Σ k x, and C k ρ k x 2 . We have f k (x) sup Q k ∈B k ( P k ) Q k ( θ x ≤ 0) = 1 if A k + C k ≥ 0, -A k C k +B k √ A 2 k +B 2 k -C 2 k A 2 k +B 2 k 2 ∈ (0, 1) if A k + C k < 0. The proof of Theorem 3.2 follows by noticing that the DiRRAc problem (2) can be reformulated using the elementary functions f k as min x∈X    k∈[K] p k f k (x) : c(x, x 0 ) ≤ δ, f k (x) < 1 ∀k ∈ [K]    , where the objective function follows from the definition of the set B( P). It suffices now to combine with Proposition 3.4 to obtain the necessary result. The detailed proof is relegated to the Appendix. Next we propose a projected gradient descent algorithm to solve the problem (4).

3.3. PROJECTED GRADIENT DESCENT ALGORITHM

We consider in this section an iterative numerical routine to solve the DiRRAc problem in the equivalent form (4). First, notice that the second constraint that defines X in ( 3) is a strict inequality, thus the set X is open. We thus modify slightly this constraint by considering the following set X ε = x ∈ X : c(x, x 0 ) ≤ δ, -θ k x + ρ k x 2 ≤ -ε ∀k ∈ [K] for some value ε > 0 sufficiently small. Moreover, if the parameter δ is too small, it may happen that the set X ε becomes empty. Define δ min ∈ R + as the optimal value of the following problem inf c(x, x 0 ) : x ∈ X, -θ k x + ρ k x 2 ≤ -ε ∀k ∈ [K] . ( ) Then it is easy to see that X ε is non-empty whenever δ ≥ δ min . In addition, because c is continuous and X is closed, the set X ε is compact. In this case, we can consider problem (4) with the feasible set being X ε , for which the optimal solution is guaranteed to exist. Let us now define the projection operator Proj Xε as Proj Xε (x ) arg min x -x 2 2 : x ∈ X ε . If X is convex and c(•, x 0 ) is a convex function, then X ε is also convex, and the projection operation can be efficiently computed using convex optimization. In particular, suppose that c(x, x 0 ) = x -x 0 2 is the Euclidean norm and X is second-order cone representable, then the projection is equivalent to a second-order cone program, and can be solved using off-the-shelf solvers such as GUROBI Gurobi Optimization, LLC (2021) or Mosek (MOSEK ApS, 2019) . The projection operator Proj Xε now forms the building block of a projected gradient descent algorithm with a backtracking linesearch. The details regarding the algorithm, along with the convergence guarantee, are presented in Appendix E. To conclude this section, we visualize the geometrical intuition of our method in Figure 2 . Dashed lines represent the hyperplane -θ k x = 0 for different k, while elliptic curves represent the robust margin -θ k x + ρ k x = 0 with matching color. Increasing the ambiguity size ρ k brings the elliptic curves towards the top-right corner and farther away from the dash lines. The set X taken as the intersection of elliptical and promixity constraints will move deeper into the interior of the favorable prediction region, resulting in more robust recourses.

4. GAUSSIAN DIRRAC FRAMEWORK

We here revisit the Gaussian assumption on the component distributions, and propose the parametric Gaussian DiRRAc framework. We make the temporary assumption that P k are Gaussian for all k ∈ [K], and we will robustify against only the misspecification of the nominal mean vector and covariance matrix ( θ k , Σ k ). To do this, we first construct the Gaussian component ambiguity sets ∀k : B N k ( P k ) Q k : Q k ∼ N (θ k , Σ k ), G((θ k , Σ k ), ( θ k , Σ k )) ≤ ρ k , where the superscript emphasizes that the ambiguity sets are neighborhoods in the space of Gaussian distributions. The resulting ambiguity set for the mixture distribution is B N ( P) = Q : ∃Q k ∈ B N k ( P k ) ∀k ∈ [K] such that Q ∼ (Q k , p k ) k∈[K] . The Gaussian DiRRAc problem is formally defined as min x∈X sup Q∈B N ( P) Q(C θ (x) = 0) s. t. c(x, x 0 ) ≤ δ sup Q k ∈B N k ( P k ) Q k (C θ (x) = 0) < 1 2 ∀k ∈ [K]. Similar to Section 3, we will provide the reformulation of the Gaussian DiRRAc formulation and a sketch of the proof in the sequence. Note that the last constraint in (8) has margin 1 2 instead of 1 as in the DiRRAc problem (2). The detailed reason will be revealed in the proof sketch in Section 4.2.

4.1. REFORMULATION OF GAUSSIAN DIRRAC

Remind that the feasible set X is defined as in (3). The next theorem asserts the equivalent form of the Gaussian DiRRAc problem (8). Theorem 4.1 (Gaussian DiRRAc reformulation). The Gaussian DiRRAc problem ( 8) is equivalent to the finite-dimensional optimization problem min x∈X 1 - k∈[K] p k Φ(g k (x)), where the function g k admits the closed-form expression g k (x) = ( θ k x) 2 -ρ 2 k x 2 2 θ k x x Σ k x + ρ k x 2 ( θ k x) 2 + x Σ k x -ρ 2 k x 2 2 . Problem ( 9) can be solved using the projected gradient descent algorithm discussed in Section 3.3.

4.2. PROOF SKETCH

The proof of Theorem 4.1 relies on the following analytical form of the worst-case Value-at-Risk (VaR) under parametric Gaussian ambiguity set (Nguyen, 2019, Lemma 3.31 ). Lemma 4.2 (Worst-case Gaussian VaR). For any x ∈ R d and β ∈ (0, 1 2 ], let t = Φ -1 (1 -β). Then inf τ : sup Q k ∈B N k ( P k ) Q k ( θ x ≤ -τ ) ≤ β = -θ k x + t x Σ k x + ρ 1 + t 2 x 2 . ( ) It is important to note that Lemma 4.2 is only valid for β ∈ (0, 0.5]. Indeed, for β > 1 2 , evaluating the infimum problem in the left-hand side of (10) requires solving a non-convex optimization problem as t = Φ -1 (1 -β) < 0. As a consequence, the last constraint of the Gaussian DiRRAc formulation ( 8) is capped at a probability value of 0.5 to ensure the convexity of the feasible set in the reformulation (9). The proof of Theorem 4.1 follows a similar line of argument as for the DiRRAc formulation, with g k being the worst-case Gaussian probability g k (x) sup Q k ∈B N k ( P k ) Q k (C θ (x) = 0) = sup Q k ∈B N k ( P k ) Q k ( θ x ≤ 0) ∀k ∈ [K]. To conclude this section, we provide a quick sanity check: by setting K = 1 and ρ 1 = 0, we have a special case in which θ follows a Gaussian distribution N ( µ 1 , Σ 1 ). Thus, θ x ∼ N ( µ 1 x, x Σ 1 x) and it is easy to verify from the formula of g 1 in the statement of Theorem 4.1 that g 1 (x) = ( θ 1 x)/(x Σ 1 x) 1 2 , which recovers the value of Pr( θ x ≤ 0) under the Gaussian distribution.

5. NUMERICAL EXPERIMENTS

We compare extensively the performance of our DiRRAc model ( 2) and Gaussian DiRRAc model ( 8) against four strong baselines: ROAR (Upadhyay et al., 2021) , CEPM (Pawelczyk et al., 2020) , AR (Ustun et al., 2019) and Wachter (Wachter et al., 2018) . We conduct the experiments on three real-world datasets (German, SBA, Student). Appendix A provides further comparisons with more baselines: Nguyen et al. ( 2022), Karimi et al. (2021a) and ensemble variants of ROAR, along with the sensitivity analysis of hyperparameters. Appendix A also contains the details about the datasets and the experimental setup. Metrics. For all experiments, we use the l 1 distance c(x, x 0 ) = x-x 0 1 as the cost function. Each dataset contains two sets of data (the present and shifted data). The present data is to train the current classifier for which recourses are generated while the remaining data is used to measure the validity of the generated recourses under model shifts. We choose 20% of the shifted data randomly 100 times and train 100 classifiers respectively. The validity of a recourse is computed as the fraction of the classifiers for which the recourse is valid. We then report the average of the validity of all generated recourses and refer this value as M 2 validity. We also report M 1 validity, which is the fraction of the instances for which the recourse is valid with respect to the original classifier. Results on real-world data. We use three real-world datasets which capture different data distribution shifts (Dua & Graff, 2017 ): (i) the German credit dataset, which captures a correction shift. (ii) the Small Business Administration (SBA) dataset, which captures a temporal shift. (iii) the Student performance dataset, which captures a geospatial shift. Each dataset contains original data and shifted data. We normalize all continuous features to [0, 1]. Similar to Mothilal et al. ( 2020), we use one-hot encodings for categorial features, then consider them as continuous features in [0, 1]. To ease the comparison, we choose K = 1. The choices of K are discussed further in Appendix A. We split 80% of the original dataset and train a logistic classifier. This process is repeated 100 times independently to obtain 100 observations of the model parameters. Then we compute the empirical mean and covariance matrix for ( θ 1 , Σ 1 ). To evaluate the trade-off between l 1 cost and M 2 validity of DiRRAc and ROAR, we compute l 1 cost and the M 2 validity by running DiRRAc with varying values of δ add and ROAR with varying values of λ. We define δ = δ min + δ add , δ min is specified in (7). Figure 3 shows that the frontiers of DiRRAc dominate the frontiers of ROAR. This indicates that DiRRAc achieves a far smaller l 1 cost for the robust recourses than ROAR. Next, we evaluate the l 1 and l 2 cost, M 1 and M 2 validity of DiRRAc, ROAR and other baselines. The results in Table 1 demonstrate that DiRRAc has high validity in all three datasets while preserving low costs (l 1 and l 2 cost) in comparison to ROAR. Our DiRRAc framework consistently outperforms the AR, Wachter, and CEPM in terms of M 2 validity. Nonlinear models. Following the previous work as in Rawal et al. (2021) ; Upadhyay et al. (2021) and Bui et al. (2022) , we adapt our DiRRAc framework and other baselines (AR and ROAR) to non-linear models by first generating local linear approximations using LIME (Ribeiro et al., 2016) . For each instance x 0 , we first generate a local linear model for the MLPs classifier 10 times using LIME, each time using 1000 perturbed samples. To estimate ( θ 1 , Σ 1 ), we compute the mean and covariance matrix of parameters θ x0 of 10 local linear models. We randomly choose 10% of the shifted dataset and concatenate with training data of the original dataset 10 times, then train a shifted MLPs classifier. According to Table 2 . On the German Credit and Student dataset, DiRRAc has a higher M 2 validity than other baselines, and a slightly lower M 2 validity on the SBA dataset than ROAR, while maintaining a low l 1 cost relative to ROAR and CEPM. Concluding Remarks. In this work, we proposed the Distributionally Robust Recourse Action (DiRRAc) framework to address the problem of recourse robustness under shifts in the parameters of the classification model. We introduced a distributionally robust optimization approach for generating a robust recourse action using a projected gradient descent algorithm. The experimental results demonstrated that our framework has the ability to generate the recourse action that has high probability of being valid under different types of data distribution shifts with a low cost. We also showed that our framework can be adapted to different model types, linear and non-linear models, and allows for actionability constraints of the recourse action. Remark 5.1 (Extensions). The DiRRAc framework can be extended to hedge against the misspecification of the mixture weights p. Alternatively, the objective function of DiRRAc can be modified to minimize the worst-case component probability. These extensions are explored in Section C. Corresponding extensions for the Gaussian DiRRAc framework are presented in Section D. Remark 5.2 (Choice of ambiguity set). This paper's results rely fundamentally on the design of ambiguity sets using a Gelbrich distance on the moment space. This Gelbrich ambiguity set leads to the • 2 -regularizations of the worst-case Value-at-Risk in Lemmas 3.3 and 4.2. If we consider other moment ambiguity sets, for example, the moment bounds in Delage & Ye (2010) or the Kullback-Leibler-type sets in Taskesen et al. (2021) , then these regularization equivalence are not available, and there is no trivial way to extend the results to reformulate the (Gaussian) DiRRAc framework.

A ADDITIONAL EXPERIMENT RESULTS

Here, we provide further details about the datasets, experimental settings, and additional results. Source code can be found at https://github.com/duykhuongnguyen/DiRRAc.

A.1 DATASETS

Real-world datasets. We use three real-world datasets which are popular in the settings of robust algorithmic recourse: German credit (Dua & Graff, 2017) , SBA Li et al. (2018) , and Student performance Cortez & Silva (2008) . We select a subset of features from each dataset: • For the German credit dataset from the UCI repository, we choose five features: Status, Duration, Credit amount, Personal Status, and Age. We found in the descriptions of two datasets that feature Status in the data correction shift dataset corrects the coding errors in the original dataset (Dua & Graff, 2017) . • For the SBA dataset, we follow • For the Student Performance dataset, motivated by Cortez & Silva (2008) , we choose G3 -final grade for deciding the label pass or fail for each student. The student who has G3 < 12 is labeled 0 (failed) and 1 (passed) otherwise. For input features, we choose 9 features: Age, Study time, Famsup, Higher, Internet, Health, Absences, G1, G2. We separate the dataset into the original and the geospatial shift data by 2 different schools. We report the accuracy of the current classifiers and shifted classifiers for two types of models: logistics classifiers (LR) and MLPs classifiers (MLPs) on each dataset in Table 3 .  = µ 0 + [α, 0] , where α is a mean shift magnitude. For covariance shift, we replace Σ 0 by Σ shift 0 = (1 + β)Σ 0 , where β is a covariance shift magnitude. For mean and covariance shift, we replace (µ 0 , Σ 0 ) by (µ shift 0 , Σ shift 0 ). We generate 500 samples for each class from the unshifted distribution with µ 0 = [-3; -3], µ 1 = [3; 3], and Σ 0 = Σ 1 = I. To visualize the decision boundaries of the linear classifiers for synthetic data, we synthesize the shifted data in total 100 times including 33 mean shifts, 33 covariance shifts and 34 both shifts, then we visualize the 100 model's parameters in a two-dimensional space in Figure 4 and Figure 5 . 

A.2 EXPERIMENTAL SETTINGS

Implementation details. For all the baselines, we use the implementation of CARLA (Pawelczyk et al., 2021) . We use the hyperparameters of AR, Wachter and CEPM that are provided by CARLA. For ROAR, we use the same parameters as in ROAR (Upadhyay et al., 2021) . Experimental settings. The experimental settings for the experiments in the main text are as follows: • In Figure 3 , we fix ρ 1 = 0.1 and vary δ add ∈ [0, 2.0] for DiRRAc. Then we fix δ max = 0.1 and vary λ ∈ [0.01, 0.2] for ROAR. • In Table 1 and Table 2 , we first initialize ρ 1 = 0.1 and we choose the δ add that maximizes the M 1 validity. We follow the same procedure as in the original paper for ROAR (Upadhyay et al., 2021) : choose δ max = 0.1 and find the value of λ that maximizes the M 1 validity. The detailed settings are provided in Table 4 . Table 4 : Parameters for the experiments with real-world data in Table 1 . Parameters Values K 1 δ add 1.0 p [1] ρ [0.1] λ 0.7 ζ 1 Choice of number of components K for real-world datasets. To choose K for real-world datasets, we use the same procedure in Section 5 to obtain 100 observations of the model parameters. Then we determine the number of components K on these observations by using K-means clustering and Elbow method (Thorndike, 1953; Ketchen & Shook, 1996) . Then we train a Gaussian mixture model on these observations and obtain p k , θ k , Σ k for the optimal number of components K. The Elbow method visualization for each dataset is shown in Figure 6 . German DiRRAc (K = 5) 1.00 ± 0.00 0.99 ± 0.07 1.73 ± 0.31 1.40 ± 0.20 Gaussian DiRRAc (K = 5) 1.00 ± 0.00 0.99 ± 0.07 1.73 ± 0.31 1.23 ± 0.23 SBA DiRRAc (K = 4) 1.00 ± 0.00 1.00 ± 0.00 1.83 ± 0.49 1.48 ± 0.29 Gaussian DiRRAc (K = 4) 1.00 ± 0.00 0.99 ± 0.02 1.67 ± 0.68 0.98 ± 0.42 Student DiRRAc (K = 6) 1.00 ± 0.00 0.96 ± 0.09 1.59 ± 0.33 1.04 ± 0.22 Gaussian DiRRAc (K = 6) 1.00 ± 0.00 0.75 ± 0.19 0.82 ± 0.30 0.53 ± 0.21 The results in Table 5 indicate that as we deploy our framework with the optimal number of components K, then DiRRAc delivers a smaller cost in all three datasets. The M 2 validity of Gaussian DiRRAc slightly increases in the Student Performance dataset. Sensitivity analysis of hyperparameters δ add and ρ k . Here we analyze the sensitivity of the hyperparameters δ add and ρ k to the l 1 cost of recourses and M 2 validity of DiRRAc. From the results in Figure 3 , we can observe that as δ add increases, both the cost and the robustness of the recourse increase. We study the sensitivity of hyperparameters ρ k to M 2 validity by first fixing the δ add = 0.1 and vary ρ k ∈ [0.0, 0.5]. According to Figure 7 , we can observe that as ρ k increases, the cost of recourses rises as well, yielding in more robust recourses. 

A.3 RESULTS ON REAL-WORLD DATA

Experiments with prior on Σ. In some cases, we presume, we may not have access to the training data. We set θ 1 = θ 0 , where θ 0 is the parameters of the original classifier. Then we choose Σ 1 = τ I with τ = 0.1. We generate recourse for each input instance and compute the M 1 validity using the original classifier and the M 2 validity using the shifted classifiers. The results in Table 6 show that our methods produce the same performance while at the same time keeping the l 1 and l 2 cost lower than ROAR in all three datasets. Table 6 : Benchmark of M 1 validity, M 2 validity, l 1 and l 2 using θ 1 = θ 0 and Σ 1 = 0.1I on different real-world datasets. Dataset Methods M1 validity M2 validity l1 cost l2 cost German ROAR 1.00 ± 0.00 0.94 ± 0.15 3.88 ± 0.54 1.61 ± 0.22 DiRRAc 1.00 ± 0.00 0.96 ± 0.07 1.48 ± 0.39 1.34 ± 0.41 Gaussian DiRRAc 1.00 ± 0.00 0.99 ± 0.06 1.58 ± 0.29 1.35 ± 0.24 SBA ROAR 1.00 ± 0.00 1.00 ± 0.00 3.10 ± 0.72 1.35 ± 0.30 DiRRAc 1.00 ± 0.00 1.00 ± 0.00 1.64 ± 0.37 1.27 ± 0.30 Gaussian DiRRAc 1.00 ± 0.00 1.00 ± 0.00 1.64 ± 0.37 1.25 ± 0.26

Student ROAR

1.00 ± 0.00 0.94 ± 0.10 2.02 ± 0.38 0.96 ± 0.18 DiRRAc 1.00 ± 0.00 0.97 ± 0.06 1.81 ± 0.19 1.47 ± 0.13 Gaussian DiRRAc 1.00 ± 0.00 0.88 ± 0.14 1.18 ± 0.26 0.82 ± 0.18 Experiments with actionability constraints. Using our two methods (DiRRAc and Gaussian DiRRAc) and the AR method (Ustun et al., 2019) , we analyze how the actionability constraints affect the cost and validity of the recourse. We select a subset of features from each dataset and define each feature as immutable or non-decreasing as follows: • In the German credit dataset, we select Personal status as an immutable attribute because it is challenging to impose changes in an individual's status and sex. We view age as a non-decreasing feature. • In the SBA dataset, we select UrbanRural and Recession as two immutable attributes since it will be difficult to change these features in the near future. RetainedJob is another feature that we view as non-decreasing. • In the Student Performance dataset, we assume that a student's Higher education would not change, and select higher education as an immutable feature. Age and Absences are considered as non-decreasing. The above specifications are aligned with the existing numerical setup in algorithmic recourse (Ustun et al., 2019; Rawal & Lakkaraju, 2020a) . For each dataset, we run the process of generating the recourse action by adding constraints to the projected gradient descent algorithm. The experimental setup on three different real-world datasets is the same as in Section 5. The results in Table 7 indicate that the M 2 validity of our 2 methods drops in the German Credit dataset. The validity in shifted data of AR also decreases in this dataset. In other datasets, the performance of our 2 methods remains the same. The l 1 and l 2 cost of DiRRAc slightly increase in the Student Performance dataset. Furthermore, there exists recourse for every input instance. 1.00 ± 0.00 0.99 ± 0.06 1.62 ± 0.30 1.27 ± 0.20 Gaussian DiRRAc 1.00 ± 0.00 0.99 ± 0.06 1.62 ± 0.30 1.09 ± 0.24 SBA AR 1.00 ± 0.00 0.41 ± 0.18 0.61 ± 0.42 0.56 ± 0.36 DiRRAc 1.00 ± 0.00 1.00 ± 0.00 1.74 ± 0.44 1.34 ± 0.40 Gaussian DiRRAc 1.00 ± 0.00 0.99 ± 0.02 1.60 ± 0.62 0.98 ± 0.42

Student AR

1.00 ± 0.00 0.48 ± 0.19 0.29 ± 0.21 0.26 ± 0.18 DiRRAc 1.00 ± 0.00 0.95 ± 0.09 1.61 ± 0.31 1.08 ± 0.24 Gaussian DiRRAc 1.00 ± 0.00 0.74 ± 0.18 0.81 ± 0.27 0.55 ± 0.21 Comparison with RBR. Here we compare our approach on the nonlinear model settings to a more recent approach on robust recourse (Nguyen et al., 2022) . German RBR 0.98 ± 0.13 0.71 ± 0.25 1.11 ± 0.10 0.50 ± 0.07 LIME-DiRRAc 0.78 ± 0.42 0.75 ± 0.27 1.14 ± 0.27 1.02 ± 0.05 LIME-Gaussian DiRRAc 0.70 ± 0.46 0.70 ± 0.31 1.11 ± 0.26 1.00 ± 0.06 SBA RBR 1.00 ± 0.00 0.97 ± 0.12 1.42 ± 0.45 0.59 ± 0.18 LIME-DiRRAc 0.93 ± 0.26 0.93 ± 0.26 1.10 ± 0.11 1.07 ± 0.05 LIME-Gaussian DiRRAc 0.82 ± 0.38 0.80 ± 0.38 0.64 ± 0.29 0.43 ± 0.32

Student RBR

1.00 ± 0.00 0.90 ± 0.23 1.02 ± 0.53 0.42 ± 0.20 LIME-DiRRAc 0.97 ± 0.18 0.97 ± 0.18 1.12 ± 0.23 1.12 ± 0.23 LIME-Gaussian DiRRAc 0.69 ± 0.46 0.59 ± 0.46 0.58 ± 0.54 0.50 ± 0.51 We provide the results in Table 8 : we can observe that RBR has (nearly) perfect M 1 validity. This result is natural because RBR is designed to handle the nonlinear predictive model directly. Our methods do not have the perfect M 1 validity because we use the LIME approximation. However, it is important to note that in the problem of robust recourse facing future model shifts, we regard the M 2 validity as the most crucial metric because it is the proportion of recourse instances that are valid with respect to the shifted (future) models. In terms of l 1 cost and M 2 validity, the results demonstrate that our method has a competitive performance compared to the existing state-of-the-art methods. In particular, LIME-DiRRAc outperforms RBR in terms of M 2 validity for two datasets (German and Student). In the SBA dataset, our approach has a lower M 2 validity, but the cost of recourses generated by our method is also lower. This result is consistent with our discussion about the l 1 cost and M 2 validity trade-off in the Appendix. Comparison with MINT on German Credit datasets. We add a more recent baseline MINT proposed by Karimi et al. (2021a) for comparison purpose. MINT requires a causal graph; thus, we restrict the experiment to the German Credit dataset (the specifications of the causal graphs are not available for SBA and Student Performance). We do not consider MACE as a baseline for nonlinear model comparison because MACE is not applicable to neural network target models due to its high computational cost. We use the same set of features as in the MINT and ROAR paper (Karimi et al., 2021a; Upadhyay et al., 2021) with four features: Sex, Age, Credit Amount and Duration. The results in Table 9 demonstrate that the recourse generated by our framework is more robust to model shifts, but it has a higher l 1 cost. 

Methods

M1 validity M2 validity l1 cost MINT 1.00 ± 0.00 0.87 ± 0.09 0.77 ± 0.23 DiRRAc 1.00 ± 0.00 0.99 ± 0.06 1.62 ± 0.30 Gaussian DiRRAc 1.00 ± 0.00 0.99 ± 0.06 1.62 ± 0.30 Comparison with ensemble baselines. Prior work suggested that model ensembles can be effective for out-of-distribution prediction (Ovadia et al., 2019; Fort et al., 2019) . Now we explore a model ensemble method to generate recourse based on ROAR as follows. First we follow the procedure in Section 5 to obtain 100 model parameters θ i with i ∈ {1, . . . , 100}. Then we find recourse by solving the following problem: x = arg min x ∈A max δ∈∆ max i∈{1,...,100} C θ i δ (x ) , 1 + λc (x 0 , x ) , where is the cross-entropy loss function. Second, we use the same 100 models and generate recourse for each model independently. Then we average the ROAR recourses across those 100 models as follows. x = 1 100 100 i=1 arg min x ∈A max δ∈∆ C θ i δ (x ) , 1 + λc (x 0 , x ) . Table 10 : Benchmark of different variants of ROAR on three real-world datasets. Dataset Methods M1 validity M2 validity l1 cost l2 cost German ROAR 1.00 ± 0.00 0.94 ± 0.15 3.88 ± 0.54 1.61 ± 0.22 ROAR-Ensemble 1.00 ± 0.00 0.95 ± 0.15 5.11 ± 0.59 2.12 ± 0.24 ROAR-Avg 1.00 ± 0.00 0.95 ± 0.15 4.46 ± 0.36 2.00 ± 0.14 DiRRAc 1.00 ± 0.00 0.99 ± 0.06 1.62 ± 0.30 1.25 ± 0.21 Gaussian DiRRAc 1.00 ± 0.00 0.99 ± 0.06 1.62 ± 0.30 1.05 ± 0.23 SBA ROAR 1.00 ± 0.00 1.00 ± 0.00 3.10 ± 0.72 1.35 ± 0.30 ROAR-Ensemble 1.00 ± 0.00 1.00 ± 0.00 4.54 ± 0.95 1.91 ± 0.38 ROAR-Avg 1.00 ± 0.00 1.00 ± 0.00 2.86 ± 0.70 1.78 ± 0.35 DiRRAc 1.00 ± 0.00 1.00 ± 0.00 1.74 ± 0.44 1.34 ± 0.40 Gaussian DiRRAc 1.00 ± 0.00 0.99 ± 0.02 1.60 ± 0.62 0.98 ± 0.42

Student ROAR

1.00 ± 0.00 0.94 ± 0.10 2.02 ± 0.38 0.96 ± 0.18 ROAR-Ensemble 1.00 ± 0.00 0.98 ± 0.05 3.73 ± 0.50 1.43 ± 0.19 ROAR-Avg 1.00 ± 0.00 0.97 ± 0.10 2.78 ± 0.31 1.31 ± 0.17 DiRRAc 1.00 ± 0.00 0.95 ± 0.09 1.55 ± 0.34 1.07 ± 0.23 Gaussian DiRRAc 1.00 ± 0.00 0.74 ± 0.18 0.78 ± 0.30 0.54 ± 0.21 In Table 10 , we provide results for the ROAR ensemble method as ROAR-Ensemble and the average ROAR recourses as ROAR-Avg. From this table, the M 1 and M 2 validity of ROAR-Ensemble and ROAR-Avg remain the same for all datasets. In almost every benchmark, the recourses generated by those two approaches are more costly than ROAR. In comparison with our framework, our DiRRAc and Gaussian DiRRAc methods demonstrate advantages in terms of the cost of recourses. More discussions about cost-validity trade-off. Previous work about robust recourses have suggested that recourses are more robust with the expense of higher costs (Rawal et al., 2021; Upadhyay et al., 2021; Pawelczyk et al., 2020; Black et al., 2022) . Our results with DiRRAc and Gaussian DiRRAc are consistent with this suggestion. However, our framework can achieve robust and actionable recourses with a far smaller cost than ROAR (Upadhyay et al., 2021) and CEPM (Pawelczyk et al., 2020) . Comparison of run time. We define the adaptive mean and covariance shift magnitude as α = µ adapt ×iter, β = Σ adapt ×iter with µ adapt , Σ adapt are the factor of data shifts, iter is the index of iterative loop of synthesizing process. For data distribution shifts, we generate mean shifts and covariance shifts 50 times each type with adaptive mean and covariance shift magnitude, with the parameters µ adapt = Σ adapt = 0.1. To estimate θ k and Σ k , we define valid mixture weights p and generate data for each component for 100 times with the same ratio as the mixture weight. We train 100 logistic classifiers to compute the empirical mean θ k and the empirical covariance matrix Σ k for the k-th component. We generate a recourse for each test instance that belongs to the negative class. In Figure 8 , we present the results of the cost-robustness analysis of DiRRAc and ROAR on synthetic data. 

B PROOFS B.1 PROOFS OF SECTION 3

To prove Proposition 3.4, we are using the notion of Value-at-Risk which is defined as follows. Definition B.1 (Value-at-Risk). For any fixed distribution Q k of θ, the Value-at-Risk at the risk tolerance level β ∈ (0, 1) of the loss θ x is defined as Q k -VaR β ( θ x) inf{τ ∈ R : Q k ( θ x ≤ τ ) ≥ 1 -β}. We are now ready to provide the proof of Proposition 3.4. Proof of Proposition 3.4. Using the definition of the Value-at-Risk in Definition B.1, we have sup Q k ∈B k ( P k ) Q k ( θ x ≤ 0) = inf β : β ∈ [0, 1], sup Q k ∈B k ( P k ) Q k ( θ x ≤ 0) ≤ β = inf β : β ∈ [0, 1], sup Q k ∈B k ( P k ) Q k -VaR β (-θ x) ≤ 0 By Nguyen (2019, Lemma 3.31) , we can reformulate the worst-case value-at-risk as sup Q k ∈B k ( P k ) Q k -VaR β (-θ x) = -θ k x + 1 -β β x Σ k x + ρ k √ β x 2 . It is now easy to observe that in the first case when -θ k x + ρ k x 2 ≥ 0, then we should have sup Q k ∈B k ( P k ) Q k ( θ x ≤ 0) = 1. We now consider the second case when -θ k x + ρ k √ β x 2 < 0. It is easy to see, by the monotocity of the worst-case value-at-risk with respect to β, that the minimal value β should satisfies -θ k x + 1 -β β x Σ k x + ρ k √ β x 2 = 0. Using the transformation t ← √ β , we have -θ k xt + 1 -t 2 x Σ k x + ρ k x 2 = 0. By rearranging terms and then squaring up both sides, we have the equivalent quadratic equation (A 2 k + B 2 k )t 2 + 2A k C k t + C 2 k -B 2 k = 0 with A k -θ k x ≤ 0, B k x Σ k x ≥ 0, and C k ρ k x 2 ≥ 0 as defined in the statement of the proposition. Note, moreover, that we also have A 2 k ≥ C 2 k . This leads to the solution t = -A k C k + B k A 2 k + B 2 k -C 2 k A 2 k + B 2 k ≥ 0. Thus, we find f k (x) = -A k C k + B k A 2 k + B 2 k -C 2 k A 2 k + B 2 k 2 . This completes the proof. We now provide the proof of Theorem 3.2. Proof of Theorem 3.2. We first consider the objective function f of (2), which can be re-expressed as f (x) = sup P∈B( P) P(C θ (x) = 0) = sup Q k ∈B k ( P k ) ∀k k∈[K] p k Q k ( θ x ≤ 0) = k∈[K] p k × sup Q k ∈B k ( P k ) Q k ( θ x ≤ 0) = k∈[K] p k × f k (x), where the equality in the second line follows from the non-negativity of p k , and the last equality follows from the definition of f k (x) in (5). Applying Proposition 3.4, we obtain the objective function of problem (4). Consider now the last constraint of (2). Using the result of Proposition 3.4, this constraint is equivalent to -θ k x + ρ k x 2 < 0 ∀k ∈ [K]. This leads to the feasible set X as is defined in (3). This completes the proof.

B.2 PROOFS OF SECTION 4

To prove Theorem 4.1, we first define the following worst-case Gaussian component probability function f N k (x) sup Q k ∈B N k ( P k ) Q k (C θ (x) = 0) = sup Q k ∈B N k ( P k ) Q k ( θ x ≤ 0) ∀k ∈ [K]. The next proposition provides the reformulation of f N k . Proposition B.2 (Worst-case probability -Gaussian). For any x ∈ R d , any k ∈ [K] and any ( θ k , Σ k , ρ k ) ∈ R d × S d + × R + , define the following constants A k -θ k x, B k x Σ k x, and C k ρ k x 2 . The following holds: (i) We have f N k (x) < 1 2 if and only if A k + C k < 0. (ii) If x satisfies f N k (x) < 1 2 , then f N k (x) = 1 -Φ A 2 k -C 2 k -A k B k + C k A 2 k + B 2 k -C 2 k . Proof of Proposition B.2. We first prove Assertion (i). Pick any Q k ∈ B N k ( P k ), then Q k is a Gaus- sian distribution Q k ∼ N (θ k , Σ k ), and thus Q k ( θ x ≤ 0) = Φ -θ k x √ x Σx . Guaranteeing f N k (x) < 1 2 is equivalent to guaranteeing sup G((θ k ,Σ k ),( θ k , Σ k ))≤ρ k -θ k x ≤ 0. Note that we also have sup G((θ k ,Σ k ),( θ k , Σ k ))≤ρ k -θ k x = sup θ k : θ k -θ k 2≤ρk -θ k x = -θ k x + ρ k x 2 by the properties of the dual norm. This leads to the equivalent condition that A k + C k < 0. We now prove Assertion (ii). Using the definition of the Value-at-Risk in Definition B.1, we have sup Q k ∈B N k ( P k ) Q k ( θ x ≤ 0) = inf β : β ∈ [0, 1 2 ), sup Q k ∈B N k ( P k ) Q k ( θ x ≤ 0) ≤ β = inf β : β ∈ [0, 1 2 ), sup Q k ∈B N k ( P k ) Q k -VaR β (-θ x) ≤ 0 Using the result from Nguyen (2019, Lemma 3.31), we have sup Q k ∈B k ( P k ) Q k -VaR β (-θ x) = -θ k x + t x Σ k x + ρ 1 + t 2 x 2 = A k + B k t + C k √ 1 + t, with t = Φ -1 (1 -β). Taking the infimum over β is then equivalent to finding the root of the equation A k + tB k + C k 1 + t 2 = 0. Using a transformation τ = 1/t, the above equation becomes A k τ + B k + C k 1 + τ 2 = 0 with solution τ = -A k B k + C k A 2 k + B 2 k -C 2 k A 2 k -C 2 k > 0. Notice that A k + C k < 0, and we also have A 2 k > C 2 k , thus τ is well-defined. The result now follows by noticing that f N k (x) = 1 -Φ(t) = 1 -Φ(1/τ ). We are now ready to prove Theorem 4.1. Proof of Theorem 4.1. Problem ( 8) is equivalent to min k∈[K] p k × f N k (x) s. t. c(x, x 0 ) ≤ δ f N k (x) < 1 2 ∀k ∈ [K]. Applying Proposition B.2, we obtain the necessary result.

C EXTENSIONS OF THE DIRRAC FRAMEWORK

Throughout this section, we explore two extensions of our DiRRAc framework. In Section C.1, we study an additional layer of robustification with respect to the mixture weights p. Next, in Section C.2, we consider an alternative formulation of the objective function to minimize the worstcase component probability.

C.1 ROBUSTIFICATION AGAINST MIXTURE WEIGHT UNCERTAINTY

The DiRRAc problem considered in Section 3 only robustifies the component distributions P k . We now discuss a plausible approach to robustify against the misspecification of the mixture weights p. Because the mixture weights should form a probability vector, it is convenient to model the perturbation in the mixture weights using the φ-divergence. Definition C.1 (φ-divergence). Let φ : R → R be a convex function on the domain R + , φ(1) = 0, 0 × φ(a/0) = a × lim t↑∞ φ(t)/t for a > 0, and 0 × φ(0/0) = 0. The φ-divergence D φ between two probability vectors p, p ∈ R K + amounts to D φ (p p) k∈[K] p k × φ(p k / p k ). The family of φ-divergences contains many well-known statistical divergences such as the Kullback-Leibler divergence, the Hellinger distance, etc. Further discussion on this family can be found in Pardo (2018) . Distributionally robust optimization models with φ-divergence ambiguity set were originally studied in decision-making problems (Ben-Tal et al., 2013; Bayraksan & Love, 2015) and have recently gained attention thanks to their successes in machine learning tasks (Namkoong & Duchi, 2017; Hashimoto et al., 2018; Duchi et al., 2021) . Let ε ≥ 0 be a parameter indicating the uncertainty level of the mixture weights. The uncertainty set for the mixture weights is formally defined as ∆ p ∈ [0, 1] K : 1 p = 1, D φ (p p) ≤ ε , which contains all K-dimensional probability vectors which are of φ-divergence at most ε from the nominal weights p. The ambiguity set of the mixture distributions that hedge against the weight misspecification is U( P) Q : ∃p ∈ ∆, ∃Q k ∈ B k ( P k ) ∀k ∈ [K] such that Q ∼ (Q k , p k ) , where the component sets B k ( P k ) are defined as in Section 3. The DiRRAc problem with respect to the ambiguity set U( P) becomes min sup P∈U ( P) P(C θ (x) = 0) s. t. c(x, x 0 ) ≤ δ sup Q k ∈B k ( P k ) Q k (C θ (x) = 0) < 1 ∀k ∈ [K]. It is important to note at this point that the feasible set of ( 12) coincides with the feasible set of (2). Thus, to resolve problem ( 12), it suffices to analyze the objective function of ( 12). Given the function φ, we define its conjugate function φ * : R → R ∪ {∞} by φ * (s) = sup t≥0 {ts -φ(t)} . The next theorem asserts that the worst-case probability under U( P) can be computed by solving a convex program. Theorem C.2 (Objective value). The feasible set of problem ( 12) coincides with X . Further, for every x ∈ X , the objective value of ( 12) equals to the optimal value of a convex optimization problem sup P∈U ( P) P(C θ (x) = 0) = min λ∈R+, η∈R η + ελ + λ k∈[K] p k φ * f k (x) -η λ , where f k (x) are computed using Proposition 3.4. Proof of Theorem C.2. From the definition of the set U( P), we can rewrite F using a two-layer decomposition F (x) = sup P∈U ( P) P(C θ (x) = 0) = sup p∈∆ sup Q k ∈B k ( P k ) ∀k k∈[K] p k Q k ( θ x ≤ 0) = sup p∈∆ k∈[K] p k × sup Q k ∈B k ( P k ) Q k ( θ x ≤ 0) = sup p∈∆ k∈[K] p k × f k (x), where the equality in the second line follows from the non-negativity of p k , and the last equality follows from the definition of f k (x) in (5). By applying the result from Ben-Tal et al. (2013, Corollary 4. 2), we have F (x) =      min η + ελ + λ k∈[K] p k φ * f k (x) -η λ s. t. λ ∈ R + , η ∈ R. The proof is complete. From the result of Theorem C.2, we can derive the gradient of the objective function of (12) using Danskin's theorem Shapiro et al. (2009, Theorem 7.21) , or simply using auto-differentiation. Furthermore, φ * is convex, and thus solving the minimization problem in Theorem C.2 can be done efficiently using convex optimization algorithms.

C.2 MINIMIZING THE WORST-CASE COMPONENT PROBABILITY

Instead of minimizing the (total) probability of unfavorable outcome, we can consider an alternative formulation where the recourse action minimizes the worst-case conditional probability of unfavorable outcome over all K components. Mathematically, if we opt for the component ambiguity sets B k ( P k ) constructed in Section 3, then we can solve min max k∈[K] sup Q k ∈B k ( P k ) Q k (C θ (x) = 0) s. t. c(x, x 0 ) ≤ δ sup Q k ∈B k ( P k ) Q k (C θ (x) = 0) < 1 ∀k ∈ [K]. Interestingly, problem (13a) does not involve the mixture weighs p. As a consequence, a trivial advantage of this model is that it hedges automatically against the misspecification of p. To complete, we provide its equivalent finite-dimensional form. Corollary C.3 (Component Probability DiRRAc). Problem (13a) is equivalent to min x∈X max k∈[K] ρ k θ k x x 2 + x Σ k x ( θ k x) 2 + x Σ k x -ρ 2 k x 2 2 ( θ k x) 2 + x Σ k x .

D EXTENSIONS OF THE GAUSSIAN DIRRAC FRAMEWORK

In this section, we leverage the results in Section C to extend the Gaussian DiRRAc framework to (i) handle the uncertainty of the mixture weight and (ii) minimize the worst-case modal probability. Remind that each individual mixture ambiguity set B N k ( P k ) is of the form B N k ( P k ) = Q k : Q k ∼ N (θ k , Σ k ), G((θ k , Σ k ), ( θ k , Σ k )) ≤ ρ k , which is a ball in the space of Gaussian distributions.

D.1 HANDLING MIXTURE WEIGHT UNCERTAINTY -GAUSSIAN DIRRAC

Following the notations in Section C.1, we define the set of possible mixture weights as ∆ = p ∈ [0, 1] K : 1 p = 1, D φ (p p) ≤ ε and the ambiguity set with Gaussian information is defined as U N ( P) = Q : ∃p ∈ ∆, ∃Q k ∈ B N k ( P k ) ∀k ∈ [K] such that Q ∼ (Q k , p k ) k∈[K] . The distributionally robust problem with respect to the ambiguity set U( P) is inf sup P∈U N ( P) P(C θ (x) = 0) s. t. c(x, x 0 ) ≤ δ sup Q k ∈B N k ( P k ) Q k (C θ (x) = 0) < 1 2 ∀k ∈ [K]. Following the results in Section 4, the feasible set of (14) coincides with the set X . It suffices now to provide the reformulation for the objective function of ( 14). 

D.2 MINIMIZING WORST-CASE COMPONENT PROBABILITY

We now consider the Gaussian DiRRAc that minimizes the worst-case modal probability of infeasibility. More concretely, we consider the recourse action obtained by solving inf max k∈[K] sup Q k ∈B N k ( P k ) Q k (C θ (x) = 0) s. t. c(x, x 0 ) ≤ δ sup Q k ∈B N k ( P k ) Q k (C θ (x) = 0) < 1 2 ∀k ∈ [K]. The next corollary provides the equivalent form of the above optimization problem. Corollary D.2. Problem (15a) is equivalent to inf x∈X max k∈[K]      1 -Φ ( θ k x) 2 -ρ 2 k x 2 2 θ k x x Σ k x + ρ k x 2 ( θ k x) 2 + x Σ k x -ρ 2 k x 2 2      . ( ) E PROJECTED GRADIENT DESCENT ALGORITHM The pseudocode of the algorithm is presented in Algorithm 1. The convergence guarantee for Algorithm 1 follows from Beck (2017, Theorem 10.15) , and is distilled in the next theorem. Algorithm 1 Projected gradient descent algorithm with backtracking line-search Input: Input instance x 0 , feasible set X ε and objective function f Line search parameters: λ ∈ (0, 1), ζ > 0 (Default values: λ = 0.7, ζ = 1) Initialization: Set x 0 ← Proj Xε (x 0 ) for t = 0, . . . , T -1 do Find the smallest integer i ≥ 0 such that f Proj Xε (x t -λ i ζ∇f (x t )) ≤ f (x t ) -1 2λ i ζ x t -Proj Xε (x t -λ i ζ∇f (x t )) 2 2 . Set x t+1 = Proj Xε (x t -λ i ζ∇f (x t )). end for Output: x T Theorem E.1 (Convergence guarantee). Let {x t } t=0,1,...,T be the sequence generated by Algorithm 1. Then, all limit points of the sequence {x t } t=0,1,...,T are stationary points of problem (4) with the modified feasible set X ε . Furthermore, there exists some constant C > 0 such that for any T ≥ 1, we have min t=0,1,...,T x t -Proj Xε (x t -ζ∇f (x t )) 2 ζ ≤ C √ T .



Figure 2: The feasible set X in (3) is shaded in blue. The circular arc represents the proximity boundary c(x, x 0 ) = δ with c being an Euclidean distance.Dashed lines represent the hyperplane -θ k x = 0 for different k, while elliptic curves represent the robust margin -θ k x + ρ k x = 0 with matching color. Increasing the ambiguity size ρ k brings the elliptic curves towards the top-right corner and farther away from the dash lines. The set X taken as the intersection of elliptical and promixity constraints will move deeper into the interior of the favorable prediction region, resulting in more robust recourses.

Figure 3: Comparison of M 2 validity as a function of the l 1 distance between input instance and the recourse for our DiRRAc method and ROAR on real datasets.Table1: Benchmark of M 1 and M 2 validity, l 1 and l 2 cost for linear models on real datasets.

Figure 4: Synthetic data shifts and the corresponding model parameter shifts (decision boundaries).

Figure6: Elbow method for determining the optimal number of components for parameter shifts. Dashed lines represent the optimal K for three real-world datasets. German Credit: Elbow at K = 5. SBA: Elbow at K = 4. Student Performace: Elbow at K = 6.

Figure 7: Sensitivity analysis of hyperparameters ρ k to l 1 cost and M 2 validity of DiRRAc.

Figure 8: Comparison of M 2 validity as a function of the l 1 distance between input instance and the recourse for our DiRRAc method and ROAR on synthetic data.

Figure 9: Impact of distribution shifts to the empirical validity. Left: mean shifts parametrized by α; Center: covariance shifts parametrized by β; Right: Mean and covariance shifts with α = β.

Corollary D.1. For any x ∈ X , we havesup P∈U N ( P) η + ελ + λ k∈[K] p k φ * f N k (x) -η λ s. t. λ ∈ R + , η ∈ R,where the values f N k (x) are obtained in Proposition B.2.Corollary D.2 follows from Theorem D.2 by replacing the quantities f k (x) by f N k (x) to take into account the Gaussian parametric information. The proof of Corollary D.2 is omitted.

Benchmark of M 1 and M 2 validity, l 1 and l 2 cost for linear models on real datasets. ± 0.00 0.95 ± 0.09 1.55 ± 0.34 1.07 ± 0.23 Gaussian DiRRAc 1.00 ± 0.00 0.74 ± 0.18 0.78 ± 0.30 0.54 ± 0.21

Benchmark of M 1 and M 2 validity, l 1 and l 2 cost for non-linear models on real datasets.

Accuracy of the underlying classifiers.

Performance of DiRRAc and Gaussian DiRRAc with K components on three real-world datasets.

Benchmark of M 1 validity, M 2 validity, l 1 and l 2 using actionability constraints on different real-world datasets.

Comparison with RBR for non-linear models on real datasets.

Comparison with MINT on German Credit dataset.

Table 11 reports the average run time: we observe that Wachter has the smallest run time, and our (Gaussian) DiRRAc has a smaller run time than ROAR in all datasets. Average runtime (seconds).

acknowledgement

Acknowledgments. Viet Anh Nguyen acknowledges the generous support from the CUHK's Improvement on Competitiveness in Hiring New Faculties Funding Scheme.

