UNDERSTANDING AND MITIGATING ACCURACY DIS-PARITY IN REGRESSION

Abstract

With the widespread deployment of large-scale prediction systems in high-stakes domains, e.g., face recognition, criminal justice, etc., disparity on prediction accuracy between different demographic subgroups has called for fundamental understanding on the source of such disparity and algorithmic intervention to mitigate it. In this paper, we study the accuracy disparity problem in regression. To begin with, we first propose an error decomposition theorem, which decomposes the accuracy disparity into the distance between label populations and the distance between conditional representations, to help explain why such accuracy disparity appears in practice. Motivated by this error decomposition and the general idea of distribution alignment with statistical distances, we then propose an algorithm to reduce this disparity, and analyze its game-theoretic optima of the proposed objective function. We conduct experiments on four real-world datasets. The experimental results suggest that our proposed algorithms can effectively mitigate accuracy disparity while maintaining the predictive power of the regression models.

1. INTRODUCTION

Recent progress in machine learning has led to its widespread use in many high-stakes domains, such as criminal justice, healthcare, student loan approval, and hiring. Meanwhile, it has also been widely observed that accuracy disparity could occur inadvertently under various scenarios in practice (Barocas & Selbst, 2016) . For example, errors are inclined to occur for individuals of certain underrepresented demographic groups (Kim, 2016) . In other cases, Buolamwini & Gebru (2018) showed that notable accuracy disparity gaps exist across different racial and gender demographic subgroups on several real-world image classification systems. Moreover, Bagdasaryan et al. (2019) found out that a differentially private model even enlarges such accuracy disparity gaps. Such accuracy disparity gaps across demographic subgroups not only raise concerns in high-stake applications but also can be utilized by malicious parties causing information leakage (Yaghini et al., 2019) . Despite the ample needs of accuracy parity, most prior work limits its scope to studying the problem in binary classification settings (Hardt et al., 2016; Zafar et al., 2017b; Zhao et al., 2019; Jiang et al., 2019) . In a seminal work, Chen et al. (2018) analyzed the impact of data collection on accuracy disparity in general learning models. They provided a descriptive analysis of such parity gaps and advocated for collecting more training examples and introducing more predictive variables. While such a suggestion is feasible in applications where data collection and labeling is cheap, it is not applicable in domains where it is time-consuming, expensive, or even infeasible to collect more data, e.g., in autonomous driving, education, etc. Our Contributions In this paper, we provide a prescriptive analysis of accuracy disparity and aim at providing algorithmic interventions to reduce the disparity gap between different demographic subgroups in the regression setting. To start with, we first formally characterize why accuracy disparity appears in regression problems by depicting the feasible region of the underlying group-wise errors. We also provide a lower bound on the joint error and a complementary upper bound on the error gap across groups. Based on these results, we illustrate why regression models aiming to minimize the global loss will inevitably lead to accuracy disparity if the input distributions or decision functions differ across groups (see Figure 1a ). We further propose an error decomposition theorem that decomposes the accuracy disparity into the distance between the label populations and the distance between conditional representations. To mitigate such disparities, we propose two algorithms to reduce accuracy disparity via joint distribution alignment with total variation distance and Wasserstein distance, respectively. Furthermore, we analyze the game-theoretic optima of the objective function and illustrate the principle of our algorithms from a game-theoretic perspective (see Figure 1b ). To corroborate the effectiveness of our proposed algorithms in reducing accuracy disparity, we conduct experiments on four real-world datasets. Experimental results suggest that our proposed algorithms help to mitigate accuracy disparity while maintaining the predictive power of the regression models. We believe our theoretical results contribute to the understanding of why accuracy disparity occurs in machine learning models, and the proposed algorithms provides an alternative for intervention in real-world scenarios where accuracy parity is desired but collecting more data/features is time-consuming or infeasible.

2. PRELIMINARIES

Notation We use X ⊆ R d and Y ⊆ R to denote the input and output space. We use X and Y to denote random variables which take values in X and Y, respectively. Lower case letters x and y denote the instantiation of X and Y . We use H(X) to denote the Shannon entropy of random variable X, H(X | Y ) to denote the conditional entropy of X given Y , and I(X; Y ) to denote the mutual information between X and Y . To simplify the presentation, we use A ∈ {0, 1} as the sensitive attribute, e.g., gender, race, etc. Let H be the hypothesis class of regression models. In other words, for h ∈ H, h : X → Y is a predictor. Note that even if the predictor does not explicitly take the sensitive attribute A as an input variable, the prediction can still be biased due to the correlations with other input variables. In this work we study the stochastic setting where there is a joint distribution D over X, Y and A from which the data are sampled. For a ∈ {0, 1} and y ∈ R, we use D a to denote the conditional distribution of D given A = a and D y to denote the conditional distribution of D given Y = y. For an event E, D(E) denotes the probability of E under D. Given a feature transformation function g : X → Z that maps instances from the input space X to feature space Z, we define g D := D • g -1 to be the induced (pushforward) distribution of D under g, i.e., for any event Given a joint distribution D, the error of a predictor h under D is defined as E ⊆ Z, g D(E ) := D({x ∈ X | g(x) ∈ E }). Err D (h) := E D [(Y - h(X)) 2 ]. To make the notation more compact, we may drop the subscript D when it is clear from the context. Furthermore, we also use MSE D ( Y , Y ) to denote the mean squared loss between the predicted variable Y = h(X) and the true label Y over the joint distribution D. Similarly, we also use CE D (A A) denote the cross-entropy loss between the predicted variable A and the true label A over the joint distribution D. Throughout the paper, we make the following standard assumption in regression problems: Assumption 2.1. There exists M > 0, such that for any hypothesis H h : X → Y, h ∞ ≤ M and |Y | ≤ M . Problem Setup We study the fair regression problem: the goal is to learn a regressor that is fair in the sense that the errors of the regressor are approximately equal across the groups given by the sensitive attribute A. We assume that the sensitive attribute A is only available to the learner during the training phase and is not visible during the inference phase. We would like to point out that there are many other different and important definitions of fairness (Narayanan, 2018) even in the sub-category of group fairness, and our discussion is by no means comprehensive. For example, two frequently used definitions of fairness in the literature are the so-called statistical parity (Dwork et al., 2012) and equalized odds (Hardt et al., 2016) . Nevertheless, throughout this paper we mainly focus accuracy parity as our fairness notion, due to the fact that machine learning systems have been shown to exhibit substantial accuracy disparities between different demographic subgroups (Barocas & Selbst, 2016; Kim, 2016; Buolamwini & Gebru, 2018) . This observation has already brought huge public attention (e.g., see New York Times, The Verge, and Insurance Journal) and calls for machine learning systems that (at least approximately) satisfy accuracy parity. Formally, accuracy parity is defined as follows: Definition 2.1 (Accuracy Parity). Given a joint distribution D, a predictor h satisfies accuracy parity if Err D0 (h) = Err D1 (h). The violation of accuracy parity is also known as disparate mistreatment (Zafar et al., 2017a) . In practice the exact equality of on accuracy between two groups is often hard to ensure, so we define error gap to measure how well the model satisfies accuracy parity: Definition 2.2 (Error Gap). Given a joint distribution D, the error gap of a hypothesis h is ∆ Err (h) := |Err D0 (h) -Err D1 (h)|. By definition, if a model satisfies accuracy parity, ∆ Err (h) will be zero. Next we introduce two distance metrics that will be used in our theoretical analysis and algorithm design: • Total variation distance: it measures the largest possible difference between the probabilities that the two probability distributions can assign to the same event E. We use d TV (P, Q) to denote the total variation: d TV (P, Q) := sup E |P(E) -Q(E)|. • Wasserstein distance: the Wasserstein distance between two probability distributions is W 1 (P, Q) = sup f ∈{f : f L ≤1} Ω f dP - Ω f dQ , where f L is the Lipschitz semi-norm of a real-valued function of f and Ω is the sample space over which two probability distributions P and Q are defined. By the Kantorovich-Rubinstein duality theorem (Villani, 2008) , we recover the primal form of the Wasserstein distance, defined as W 1 (P, Q) := inf γ∈Γ(P,Q) d(X, Y ) dγ(X, Y ), where Γ(P, Q) denotes the collection of all couplings of P and Q, and X and Y denote the random variables with law P and Q respectively. Note that we use L 1 distance for d(•, •) throughout the paper, but the extensions to other distance, e.g., L 2 distance, is straightforward.

3. MAIN RESULTS

In this section, we first characterize why accuracy disparity arises in regression models. More specifically, given a hypothesis h ∈ H, we first describe the feasible region of Err D0 and Err D1 by proving a lower bound of joint errors and an upper bound of the error gap. Then, we give a geometric interpretation to visualize the feasible region of Err D0 and Err D1 and illustrate how error gap arises when learning a hypothesis h that minimizes the global squared error. We further analyze the accuracy disparity by decomposing it into the distance between label populations and the distance between conditional representations. Motivated by the decomposition, we propose two algorithms to reduce accuracy disparity, connect the game-theoretic optima of the objective functions in our algorithms with our theorems, and describe the practical implementations of the algorithms. Due to the space limit, we defer all the detailed proofs to the appendix.

3.1. BOUNDS ON CONDITIONAL ERRORS AND ACCURACY DISPARITY GAP

When we learn a predictor, the prediction function induces X h -→ Y , where Y is the predicted target variable given by hypothesis h. Hence for any distribution D 0 (D 1 ) of X, the predictor also induces a distribution h D 0 (h D 1 ) of Y . Recall that the Wasserstein distance is metric, hence the following chain of triangle inequalities holds: W 1 (D 0 (Y ), D 1 (Y )) ≤ W 1 (D 0 (Y ), h D 0 ) + W 1 (h D 0 , h D 1 ) + W 1 (h D 1 , D 1 (Y )) Intuitively, W 1 (D 0 (Y ), h D 0 ) and W 1 (h D 1 , D 1 (Y ) ) measure the distance between the true label distribution and the predicted one on A = 0/1 cases, respectively. This distance is related to the prediction error of function h conditioned on A = a: Lemma 3.1. Let Y = h(X) ∈ R, then for a ∈ {0, 1}, W 1 (D a (Y ), h D a ) ≤ Err Da (h). With the above results, we can get the following theorem that characterizes the lower bound of joint error on different groups: Theorem 3.1. Let Y = h(X) ∈ R, we have Err D0 (h) + Err D1 (h) ≥ 1 2 W 1 (D 0 (Y ), D 1 (Y )) - W 1 (h D 0 , h D 1 ) + 2 . In Theorem 3.1, we see that if the difference between the label distribution across groups is large, then statistical disparity could potentially lead to a large joint error. Moreover, Theorem 3.1 could be extended to give a lower bound on the joint error incurred by h as well: Corollary 3.1. Let Y = h(X) ∈ R and α = D(A = 0) ∈ [0, 1], we have Err D (h) ≥ 1 2 min{α, 1 - α} • W 1 (D 0 (Y ), D 1 (Y )) -W 1 (h D 0 , h D 1 ) + 2 . Next, we upper bound the error gap to gain more insights on accuracy disparity. For a ∈ {0, 1}, define the conditional variance Var Da [Y |X] = E Da [(Y -E Da [Y |X]) 2 |X] and it shows up as the irreducible error of predicting Y when we only use the knowledge of X. We also know that the optimal decision function conditioned on A = a under mean squared error to be E Da [Y |X]. The following theorem characterizes the upper bound of the error gap between two groups: Theorem 3.2. For any hypothesis H h : X → Y, if the Assumption 2.1 holds, then: ∆ Err (h) ≤ 8M 2 d TV (D 0 (X), D 1 (X)) + |E D0 [Var D0 [Y |X]] -E D1 [Var D1 [Y |X]]| + 4M min{E D0 [|E D0 (Y |X) -E D1 (Y |X)|], E D1 [|E D0 (Y |X) -E D1 (Y |X)|]}. Remark Theorem 3.2 upper bounds the error gap across groups by three terms: the first term corresponds to the distance of input distribution across groups, the second term is the noise (variance) difference, and third term is the discrepancy of optimal decision functions across different groups. In an ideal and fair setting, where both distributions are noiseless, and the optimal decision functions are insensitive to the group membership, then Theorem 3.2 implies a sufficient condition to guarantee accuracy parity is to find group-invariant representation that minimize d TV (D 0 (X), D 1 (X)). Geometric Interpretation By Theorem 3.1 and Theorem 3.2, in Figure 1a , we visually illustrate how accuracy disparity arises given data distribution and the learned hypothesis that aims to minimize the global squared error. In Figure 1a , given the hypothesis class H, we use the line Err D0 + Err D1 = B to denote the lower bound in Theorem 3.1 and the two lines |Err D0 -Err D1 | = A to denote the upper bound in Theorem 3.2. These three lines form a feasible region (the green area) of Err D0 and Err D1 under the hypothesis class H. For any optimal hypothesis h which is solely designed to minimize the overall error, the best the hypothesis h can do is to intersect with one of the two bottom vertices. For example, the hypotheses (the red dotted line and the blue dotted line) trying to minimize overall error intersect with the two vertices of the region to achieve the smallest Err D0 -intercept (Err D1 -intercept), due to the imbalance between these two groups. However, since these two vertices are not on the diagonal of the feasible region, there is no guarantee that the hypothesis can satisfy accuracy parity (Err D0 = Err D1 ), unless we can shrink the width of green area to zero. Conditional Distribution Alignment Reduces Accuracy Parity In Theorem 3.2, we illustrate how accuracy disparity arises in regression models due to noise, distance between representations, and distance between decision functions. However, it is nearly impossible to collect noiseless data with group-invariant input distribution. Moreover, there is no guarantee that the upper bound will be lower if we learn the group-invariant representation that minimizes d TV (D 0 (X), D 1 (X)) alone, since the learned representation could potentially increase the variance. In this regard, we prove a novel upper bound which is free from the above noise term to motivate aligning conditional distributions to mitigate the error disparity across groups. To do so, we relate the error gap to the label distribution and the predicted distribution condition on Y = y: Theorem 3.3. If Assumption 2.1 holds, then for ∀h ∈ H, let Y = h(X), the following inequality holds: ∆ Err (h) ≤ 8M 2 d TV (D 0 (Y ), D 1 (Y )) + 3M min{E D0 [|E D y 0 [ Y ] -E D y 1 [ Y ]|], E D1 [|E D y 0 [ Y ] -E D y 1 [ Y ]|]}. Remark We see that the error gap is upper bounded by two terms: the distance between label distributions and the discrepancy between conditional predicted distributions across groups. Note that this is different from the decomposition we have in Theorem 3.2, where the marginal distribution is on X instead of Y . Given a dataset, the distance of label distributions is a constant since the label distribution is fixed. For the second term, if we can minimize the discrepancy of the conditional predicted distribution across groups, we then have a model that is free of accuracy disparity when the label distribution is well aligned.

3.2. ALGORITHM DESIGN

Inspired by Theorem 3.3, we can mitigate the error gap if we align the group distributions via minimizing the distance of the conditional distributions across groups. However, it is intractable to do so explicitly in regression problems since Y can take infinite values on R. Next we will present two algorithms to approximately solve the problem through adversarial representation learning. Given a Markov chain X g -→ Z h -→ Y , we are interested in learning group-invariant conditional representations so that the discrepancy between the induced conditional distributions D Y 0 (Z = g(X)) and D Y 1 (Z = g(X) ) is minimized. In this case, the second term of the upper bound in Theorem 3.3 is minimized. However, it is in general not feasible since Y is a continuous random variable. Instead, we propose to learn the representations of Z to minimize the discrepancy between the joint distributions D 0 (Z = g(X), Y ) and D 1 (Z = g(X), Y ). Next, we will show the distances between conditional predicted distributions D Y 0 (Z = g(X)) and D Y 1 (Z = g(X) ) are minimized when we minimize the joint distributions D 0 (Z = g(X), Y ) and D 1 (Z = g(X), Y ) in Theorem 3.4 and Theorem 3.5. To proceed, we first consider using the total variation distance to measure the distance between two distributions. In particular, we can choose to learn a binary discriminator f : Z × Y -→ A that achieves minimum binary classification error on discriminating between points sampled from two distributions. In practice, we use the cross-entropy loss as a convex surrogate loss. Formally, we are going to consider the following minimax game between g and f : min f ∈F max g CE D (A f (g(X), Y )) Next we show that for the above equation, the optimal feature transformation g corresponds to the one that induces invariant conditional feature distributions. Theorem 3.4. Consider the minimax game in (1). The equilibrium (g * , f * ) of the game is attained when 1). Z = g * (X) is independent of A conditioned on Y ; 2). f * (Z, Y ) = D(A = 1 | Y, Z). Since in the equilibrium of the game Z is independent of A conditioned on Y , the optimal f * (Z, Y ) could also be equivalently written as f * (Z, Y ) = D(A = 1 | Y ), i.e. , the only useful information for the discriminator in the equilibrium is through the external information Y . In Theorem 3.4, the minimum cross-entropy loss that the discriminator (the equilibrium of the game) can achieve is H(A | Z, Y ) (see Proposition A.1 in Appendix A). By the basic property of conditional entropy, we have: min f ∈F CE D (A f (g(X), Y )) = H(A | Z, Y ) = H(A | Y ) -I(A; Z | Y ). We know that H(A | Y ) is a constant given the data distribution. The maximization of g in ( 1) is equivalent to the minimization of min Z=g(X) I(A; Z | Y ), and it follows that the optimal strategy for the transformation g is the one that induces conditionally invariant features, e.g., I(A; Z | Y ) = 0. Formally, we arrive at the following minimax problem: min h,g max f ∈F MSE D (h(g(X)), Y ) -λ • CE D (A f (g(X), Y )) In the above formulation, the first term corresponds to the minimization of prediction loss of the target task and the second term is the loss incurred by the adversary f . As a whole, the minimax optimization problem expresses a trade-off (controlled by the hyper-parameter λ > 0) between accuracy and accuracy disparity through the representation learning function g. Wasserstein Variant Similarly, if we choose to align joint distributions via minimizing Wasstertein distance, the following theorem holds. Theorem 3.5. Let g * := arg min g W 1 (D 0 (g(X), Y ), D 1 (g(X), Y )), then D Y 0 (Z = g * (X)) = D Y 1 (Z = g * (X) ) almost surely. One notable advantage of using the Wasserstein distance instead of the TV distance is that, the Wasserstein distance is a continuous functional of both the feature map g as well as the discriminator f (Arjovsky et al., 2017) . Furthermore, if both g and f are continuous functions of their corresponding model parameters, (which is the case for models we are going to use in experiments), the objective function will be continuous in both model parameters. This property of the Wasserstein distance makes it more favorable from an optimization perspective. Using the dual formulation, equivalently, we can learn a Lipschitz function f : Z × Y → R as a witness function: min h,g,Z0∼g D0,Z1∼g D1 max f : f L ≤1 MSE D (h(g(X)), Y ) + λ • f (Z 0 , Y ) -f (Z 1 , Y ) . (3) Game-Theoretic Interpretation To make our algorithms easier to follow, we provide a gametheoretic interpretation of our algorithms in Figure 1b . Consider Alice (encoder) and Bob (discriminator) participate a two-player game: upon receiving a set of inputs X, Alice applies a transformation to the inputs to generate the corresponding features Z and then send them to Bob. Besides the features sent by Alice, Bob also has access to the external information Y , which corresponds to the corresponding labels for the set of features sent by Alice. Once having both the features Z and the corresponding labels Y from external resources, Bob's goal is to guess the group membership A of each feature sent by Alice, and to maximize his correctness as much as possible. On the other hand, Alice's goal is to compete with Bob, i.e., to find a transformation to confuse Bob as much as she can. Different from the traditional game without external information, here due to the external information Y Bob has access to, Alice cannot hope to fully fool Bob, since Bob can gain some insights about the group membership A of features from the external label information. Nevertheless, Theorem 3.4 and Theorem 3.5 both state that when Bob uses a binary discriminator or a Wasstertein discriminator to learn A, the best Alice could do is to to learn a transformation g so that the transformed representation Z is insensitive to the values of A conditioned on any values of Y .

4. EXPERIMENTS

Inspired by our theoretical results that decompose accuracy disparity into the distance between label populations and the distance between conditional representations, we propose two algorithms to mitigate it. In this section, we conduct experiments to evaluate the effectiveness of our proposed algorithms in reducing the accuracy disparity.

Datasets

We conduct experiments on four real-world benchmark datasets: the Adult dataset (Dua & Graff, 2017) , COMPAS dataset (Dieterich et al., 2016) , Law School dataset (Wightman & Ramsey, 1998), and Communities and Crime dataset (Dua & Graff, 2017) . All datasets contain binary sensitive attributes (e.g., male/female, white/non-white). We refer readers to Appendix B for detailed descriptions of the datasets and the data pre-processing pipelines. Methods We term the proposed algorithms CENET and WASSERSTEINNET for our two proposed algorithms respectively. For each dataset, we perform controlled experiments by fixing the regression neural network architecture to be the same. We train the regression nets via mean squared loss. Note that although the Adult dataset and COMPAS dataset are for binary classification tasks, we can still take them as regression tasks with two distinctive ordinal values. To the best of our knowledge, no previous study aims to minimize accuracy disparity in regression using representation learning. However, there are other similar fairness notions and mitigation techniques proposed for regression and we add them as our baselines: (1) Bounded group loss (BGL) (Agarwal et al., 2019) , which asks for the prediction errors for any groups to remain below a pre-defined level ; (2) Coefficient of determination (COD) (Komiyama et al., 2018) , which asks for the coefficient of determination between the sensitive attributes and the predictions to remain below a pre-defined level . Figure 2 : Overall results: R 2 regression scores and error gaps in different datasets. Our goal is to achieve high R 2 scores with small error gap values (i.e., the points located in the upper-left corner). Among all methods, we vary the trade-off parameter (i.e., λ in CENET and WASSERSTEINNET and in BGL and COD) and report and the corresponding R 2 scores and the error gap values. For each experiment, we average the results for ten random seeds. We refer readers to Appendix B for detailed parameter and hyper-parameter settings in our experiments. We also defer the additional experimental results and analyses on how the trade-off parameters λ and affects the performance of different algorithms to Appendix C.

4.2. RESULTS AND ANALYSES

The overall results are visualized in Figure 2 . 1 The following summarizes our observations and analyses: (1) Overall, trade-offs exist between the predictive power of the regressors and accuracy parity: for each method we test, the general trend is that with the decrease of the values of error gaps, the values of R 2 also decrease. The exception is CENET in the Adult dataset and Crime dataset since training CENET is unstable when λ is large and we will provide more details in Appendix C; (2) Our proposed methods WASSERSTEINNET and CENET are effective in reducing the error gaps while keeping the R 2 scores relatively high in the Adult, COMPAS and Crime dataset. In the Law dataset, the error gaps decrease with high utility losses in our proposed methods; (3) Among our proposed methods, WASSERSTEINNET achieves better accuracy and accuracy disparity trade-offs while CENET suffers significant accuracy loss and may fail to decrease the error gaps in the Adult and Crime dataset. The reason behind it is that the minimax optimization in the training of CENET could lead to an unstable training process under the presence of a noisy approximation to the optimal discriminator (Arjovsky & Bottou, 2017; Arjovsky et al., 2017) ; (4) Compared to our proposed methods, BGL and COD can also decrease error gaps to a certain extent. This is because: (i) BGL aims to keep errors remaining relatively low in each group, which helps to reduce accuracy disparity; (ii) CoD aims to reduce the correlation between the sensitive attributes and the predictions (or the inputs) in the feature space, which might somehow reduce the dependency between the distributions of these two variables. In comparison, our proposed methods do better in mitigating the error gaps.

5. RELATED WORK

Algorithmic Fairness In the literature, two main notions of fairness, i.e., group fairness and individual fairness, has been widely studied (Dwork et al., 2012; Zemel et al., 2013; Feldman et al., 2015; Hardt et al., 2016; Zafar et al., 2017b; Madras et al., 2019; Khani & Liang, 2019) . In particular, Chen et al. ( 2018) analyzed the impact of data collection on discrimination (e.g., false positive rate, false negative rate, and zero-one loss) from the perspectives of bias-variance-noise decomposition, and they suggested collecting more training examples and collect additional variable to reduce discrimination. In comparison, our work precisely characterizes the disparate predictive accuracy in terms of the distance between label populations and the distance between conditional representation and propose algorithms to reduce accuracy disparity across groups in regression. Fair Regression A series of work focus on fairness under the regression problems (Calders et al., 2013; Johnson et al., 2016; Berk et al., 2018; Komiyama et al., 2018; Chzhen et al., 2020; Bigot, 2020; Zink & Rose, 2020; Mary et al., 2019; Narasimhan et al., 2020) . To the best of our knowledge, no previous study aimed to minimize accuracy disparity in regression from representation learning. However, there are other similar fairness notions proposed for regression: Agarwal et al. ( 2019) proposed fair regression with bounded group loss (i.e., it asks that the prediction error for any protected group remain below some pre-defined level) and used exponentiated-gradient approach to satisfy BGL; Komiyama et al. (2018) aimed to reduce the coefficient of determination between the sensitive attributes between the predictions to some pre-defined level and used off-the-shelf convex optimizer to solve the problem. In contrast, we source out the root of accuracy disparity through the lens of information theory and reducing it via distributional alignment in a minimax game. Fair Representation A line of work focus on building algorithmic fair decision making systems using adversarial techniques to learn fair representations (Edwards & Storkey, 2015; Beutel et al., 2017; Adel et al., 2019; Zhao et al., 2019) . The main idea behind is to learn a good representation of the data so that the data owner can maximize the accuracy while removing the information related to the sensitive attribute. Madras et al. (2018) proposed a generalized framework to learn adversarially fair and transferable representations and suggests using the label information in the adversary to learn equalized odds or equal opportunity representations in the classification setting. Apart from adversarial representation, recent work also proposed to use distance metrics, e.g., the maximum mean discrepancy (Louizos et al., 2015) and the Wasserstein distance (Jiang et al., 2019) to remove group-related information. Compared to their work, we propose to align (conditional) distributions across groups to reduce accuracy disparity using minimax optimization and analyze the game-theoretic optima in the minimax game in the regression setting.

6. CONCLUSION

In this paper, we theoretically and empirically study accuracy disparity in regression problems. Specifically, we prove an information-theoretic lower bound on the joint error and a complementary upper bound on the error gap across groups to depict the feasible region of group-wise errors. Our theoretical results indicate that accuracy disparity occurs inevitably due to the label distributions differ across groups. To reduce such disparity, we further propose to achieve accuracy parity by learning conditional group-invariant representations using statistical distances. The game-theoretic optima of the objective functions in our proposed methods are achieved when the accuracy disparity is minimized. Our empirical results on four real-world datasets demonstrate that our proposed algorithms help to reduce accuracy disparity effectively. We believe our results take an important step towards better understanding accuracy disparity in machine learning models. Lemma A.2. If Assumption 2.1 holds, then the following inequality holds: |E D0 [(h(X) - E D0 [Y |X])] 2 -E D0 [(h(X) -E D1 [Y |X])] 2 | ≤ 4M E D0 [|E D0 [Y |X] -E D1 [Y |X]|]. Proof. |E D0 [(h(X) -E D0 [Y |X])] 2 -E D0 [(h(X) -E D1 [Y |X])] 2 | = |E D0 [h 2 (X) -2h(X)E D0 [Y |X] + E 2 D0 [Y |X] -h 2 (X) + 2h(X)E D1 [Y |X] -E 2 D1 [Y |X]]| ≤ 2M E D0 [|E D0 [Y |X] -E D1 [Y |X]|] + 2M E D0 [|E D0 [Y |X] -E D1 [Y |X]|] (Assumption 2.1) = 4M E D0 [|E D0 [Y |X] -E D1 [Y |X]|]. Theorem 3.2. For any hypothesis H h : X → Y, if the Assumption 2.1 holds, then: ∆ Err (h) ≤ 8M 2 d TV (D 0 (X), D 1 (X)) + |E D0 [Var D0 [Y |X]] -E D1 [Var D1 [Y |X]]| + 4M min{E D0 [|E D0 (Y |X) -E D1 (Y |X)|], E D1 [|E D0 (Y |X) -E D1 (Y |X)|]}. Proof. First, we show that for a ∈ {0, 1}, Err Da (h) = E Da [(h(X) -Y ) 2 ] = E Da [(h(X) -E Da [Y |X] + E Da [Y |X] -Y ) 2 ] = E Da [(h(X) -E Da [Y |X]) 2 ] + E Da [(Y -E Da [Y |X]) 2 ] -2 E Da [(h(X) -E Da [Y |X])(Y -E Da [Y |X])] = E Da [(h(X) -E Da [Y |X]) 2 ] + E Da [(Y -E Da [Y |X]) 2 ]. Note that the last equation holds since E Da [(h(X) -E Da [Y |X])(Y -E Da [Y |X])] = E Da(X) [E Da(Y |X) [(h(X) -E Da [Y |X])(Y -E Da [Y |X])|X]] = E Da(X) [(h(X) -E Da [Y |X])E Da(Y |X) (Y -E Da [Y |X]|X)] = E Da(X) [(h(X) -E Da [Y |X])(E Da [Y |X] -E Da [Y |X])] = 0. Next we bound the error gap: |Err D0 (h) -Err D1 (h)| = |E D0 [(h(X) -E D0 [Y |X]) 2 ] -E D1 [(h(X) -E D1 [Y |X]) 2 ] + E D0 [(Y -E D0 [Y |X]) 2 ] -E D1 [(Y -E D1 [Y |X]) 2 ]| ≤ |E D0 [(h(X) -E D0 [Y |X]) 2 ] -E D1 [(h(X) -E D1 [Y |X]) 2 ]| (Triangle inequality) + |E D0 [Var D0 [Y |X]] -E D1 [Var D1 [Y |X]]|. Now it suffices to bound: |E D0 [(h(X) -E D0 [Y |X]) 2 ] -E D1 [(h(X) -E D1 [Y |X]) 2 ]| = |E D0 [(h(X) -E D0 [Y |X]) 2 ] -E D0 [(h(X) -E D1 [Y |X]) 2 ] + E D0 [(h(X) -E D1 [Y |X]) 2 ] -E D1 [(h(X) -E D1 [Y |X]) 2 ]| ≤ |E D0 [(h(X) -E D0 [Y |X]) 2 ] -E D0 [(h(X) -E D1 [Y |X]) 2 ]| (Triangle inequality) + |E D0 [(h(X) -E D1 [Y |X]) 2 ] -E D1 [(h(X) -E D1 [Y |X]) 2 ]|. Invoke Lemma A.1 and Lemma A.2 to bound the above two terms: |E D0 [(h(X) -E D0 [Y |X]) 2 ] -E D0 [(h(X) -E D1 [Y |X]) 2 ]| + |E D0 [(h(X) -E D1 [Y |X]) 2 ] -E D1 [(h(X) -E D1 [Y |X]) 2 ]| ≤ 4M E D0 [|E D0 [Y |X] -E D1 [Y |X]|] + 8M 2 d TV (D 0 (X), D 1 (X)). (Lemma A.1 & Lemma A.2) By symmetry, we also have: |E D0 [(h(X) -E D0 [Y |X]) 2 ] -E D0 [(h(X) -E D1 [Y |X]) 2 ]| + |E D0 [(h(X) -E D1 [Y |X]) 2 ] -E D1 [(h(X) -E D1 [Y |X]) 2 ]| ≤ 4M E D1 [|E D0 [Y |X] -E D1 [Y |X]|] + 8M 2 d TV (D 0 (X), D 1 (X)). (Lemma A.1 & Lemma A.2) Combining the two inequalities above together, we have: |E D0 (h(X) -E D0 [Y |X]) 2 -E D0 (h(X) -E D1 [Y |X]) 2 | + |E D0 (h(X) -E D1 [Y |X]) 2 -E D1 (h(X) -E D1 [Y |X]) 2 | ≤ 8M 2 d TV (D 0 (X), D 1 (X)) + 4M min{E D0 [|E D0 [Y |X] -E D1 [Y |X]|], E D1 [|E D0 [Y |X] -E D1 [Y |X]|]}. Incorporating the two variance terms back to the above inequality then completes the proof. Theorem 3.3. If Assumption 2.1 holds, then for ∀h ∈ H, let Y = h(X), the following inequality holds: ∆ Err (h) ≤ 8M 2 d TV (D 0 (Y ), D 1 (Y )) + 3M min{E D0 [|E D y 0 [ Y ] -E D y 1 [ Y ]|], E D1 [|E D y 0 [ Y ] -E D y 1 [ Y ]|]}. Proof. First, we show that for a ∈ {0, 1}: Err Da (h) = E Da [(h(X) -Y ) 2 ] = E Da [h 2 (X) -2Y h(X) + Y 2 ] = E Da [h 2 (X) -2Y h(X)] + E Da [Y 2 ]. Next, we bound the error gap: |Err D0 (h) -Err D1 (h)| = |E D0 [h 2 (X) -2Y h(X)] + E D0 [Y 2 ] -E D1 [h 2 (X) -2Y h(X)] -E D1 [Y 2 ]| ≤ |E D0 [h 2 (X) -2Y h(X)] -E D1 [h 2 (X) -2Y h(X)]| + |E D0 [Y 2 ] -E D1 [Y 2 ]|. (Triangle inequality) For the second term, we can easily prove that |E D0 [Y 2 ] -E D1 [Y 2 ]| = | Y 2 , dD 0 -dD 1 | ≤ Y 2 ∞ dD 0 -dD 1 1 ≤ 2M 2 d TV (D 0 (Y ), D 1 (Y )) , where the second equation follows Hölder's inequality and the last equation follow the definition of total variation distance. Now it suffices to bound the remaining term: |E D0 [h 2 (X) -2Y h(X)] -E D1 [h 2 (X) -2Y h(X)]| = h(x)(h(x) -2y) dµ 0 (x, y) -h(x)(h(x) -2y) dµ 1 (x, y) ≤ h(x)(h(x) -2y) dµ 0 (x|y)dµ 0 (y) - h(x)(h(x) -2y) dµ 0 (x|y)dµ 1 (y) (Triangle inequality) + h(x)(h(x) -2y) dµ 1 (x|y)dµ 1 (y) - h(x)(h(x) -2y) dµ 0 (x|y)dµ 1 (y) . We upper bound the first term: h(x)(h(x) -2y) dµ 0 (x|y) dµ 0 (y) - h(x)(h(x) -2y) dµ 0 (x|y) dµ 1 (y) ≤ h(x)(h(x) -2y)(dµ 0 (y) -dµ 1 (y)) dµ 0 (x|y) ≤ dµ 0 (y) -dµ 1 (y) sup x h(x) h(x) -2y dµ 0 (x|y) ≤ M E D0 [|h(X) -2Y ||Y = y] dµ 0 (y) -dµ 1 (y) (Assumption 2.1) ≤ 3M 2 dµ 0 (y) -dµ 1 (y) (Assumption 2.1) ≤ 6M 2 d TV (D 0 (Y ), D 1 (Y )). Note that the last equation follows the definition of total variation distance. For the second term, we have: h(x)(h(x) -2y) dµ 1 (x|y) dµ 1 (y) - h(x)(h(x) -2y) dµ 0 (x|y) dµ 1 (y) ≤ h 2 (x)(dµ 1 (x|y) -dµ 0 (x|y)) dµ 1 (y) + 2y h(x)(dµ 1 (x|y) -dµ 0 (x|y)) dµ 1 (y) (Triangle inequality) ≤ 3M E D1 [|E D y 0 [ Y ] -E D y 1 [ Y ]|]. (Assumption 2.1) To prove the last equation, we first see that: h 2 (x)(dµ 1 (x|y) -dµ 0 (x|y)) dµ 1 (y) ≤ sup x h(x) h(x)(dµ 1 (x|y) -dµ 0 (x|y)) dµ 1 (y) ≤ M E D0 [h(X)|Y = y] -E D1 [h(X)|Y = y] dµ 1 (y) (Assumption 2.1) = M E D1 [|E D y 0 [ Y ] -E D y 1 [ Y ]|]. Similarly, we also have: 2y h(x)(dµ 1 (x|y) -dµ 0 (x|y)) dµ 1 (y) ≤ 2 (sup y)h(x)(dµ 1 (x|y) -dµ 0 (x|y)) dµ 1 (y) ≤ 2M E D0 [h(X)|Y = y] -E D1 [h(X)|Y = y] dµ 1 (y) (Assumption 2.1) = 2M E D1 [|E D y 0 [ Y ] -E D y 1 [ Y ]|] . By symmetry, we can also see that: |E D0 [h 2 (X) -2Y h(X)] -E D1 [h 2 (X) -2Y h(X)]| ≤ 6M 2 d TV (D 0 (Y ), D 1 (Y )) + 3M E D1 [|E D y 0 [ Y ] -E D y 1 [ Y ]|]. Combine the above two equations yielding: |E D0 [h 2 (X) -2Y h(X)] -E D1 [h 2 (X) -2Y h(X)]| ≤ 6M 2 d TV (D 0 (Y ), D 1 (Y )) + 3M min{E D0 [|E D y 0 [ Y ] -E D y 1 [ Y ]|], E D1 [|E D y 0 [ Y ] -E D y 1 [ Y ]|]}. Incorporating the terms back to the upper bound of the error gap then completes the proof. Theorem 3.4. Consider the minimax game in (1). The equilibrium (g * , f * ) of the game is attained when 1). Z = g * (X) is independent of A conditioned on Y ; 2). f * (Z, Y ) = D(A = 1 | Y, Z). Proof. To prove Theorem 3.4, we first give Proposition A.1. Proposition A.1. For any feature map g : X → Z, assume that F contains all the randomized binary classifiers and F f : Z × Y → A, then min f ∈F CE D (A f (g(X), Y )) = H(A | Z, Y ). Proof. By the definition of cross-entropy loss, we have: CE D (A f ) = -E D [I(A = 0) log(1 -f (g(X), Y )) + I(A = 1) log(f (g(X), Y ))] = -E g D [I(A = 0) log(1 -f (Z, Y )) + I(A = 1) log(f (Z, Y ))] = -E Z,Y E A|Z,Y [I(A = 0) log(1 -f (Z, Y )) + I(A = 1) log(f (Z, Y ))] = -E Z,Y [D(A = 0 | Z, Y ) log(1 -f (Z, Y )) + D(A = 1 | Z, Y ) log(f (Z, Y ))] = E Z,Y [D KL (D(A | Z, Y ) f (Z, Y ))] + H(A | Z, Y ) ≥ H(A | Z, Y ), where D KL (• •) denotes the KL divergence between two distributions. From the above inequality, it is also clear that the minimum value of the cross-entropy loss is achieved when f (Z, Y ) equals the conditional probability D(A = 1 | Z, Y ), i.e., f * (Z, Y ) = D(A = 1 | Z = g(X), Y ). Proposition A.1 states that the minimum cross-entropy loss that the discriminator can achieve is H(A | Z, Y ) when f is the conditional distribution D(A = 1 | Z = g(X), Y ). By the basic property of conditional entropy, we have: min f ∈F CE D (A f (g(X), Y )) = H(A | Z, Y ) = H(A | Y ) -I(A; Z | Y ). Note that H(A | Y ) is a constant given the distribution D, so the maximization of g is equivalent to the minimization of min Z=g(X) I(A; Z | Y ), and it follows that the optimal strategy for the transformation g is the one that induces conditionally invariant features, e.g., I(A; Z | Y ) = 0. On the other hand, if g * plays optimally, then the optimal response of the discriminator f is given by f * (Z, Y ) = D(A = 1 | Z = g * (X), Y ) = D(A = 1 | Y ). Theorem 3.5. Let g * := arg min g W 1 (D 0 (g(X), Y ), D 1 (g(X), Y )), then D Y 0 (Z = g * (X)) = D Y 1 (Z = g * (X)) almost surely. Proof. By the definition of Wasstertein distance, we have: W 1 (D 0 (Z, Y ), D 1 (Z, Y )) = inf γ∈Γ(D0,D1) d((z 0 , y 0 ), (z 1 , y 1 )) dγ((z 0 , y 0 ), (z 1 , y 1 )) = inf γ∈Γ(D0,D1) d((z 0 , y 0 ), (z 1 , y 1 )) dγ(z 0 , z 1 | y 0 , y 1 ) dγ(y 0 , y 1 ) = inf γ∈Γ(D0,D1) z 0 -z 1 1 + |y 0 -y 1 | dγ(z 0 , z 1 | y 0 , y 1 ) dγ(y 0 , y 1 ) ≥ inf γ∈Γ(D0,D1) |y 0 -y 1 | dγ(y 0 , y 1 ) dγ(z 0 , z 1 | y 0 , y 1 ) = inf γ∈Γ(D0(Y ),D1(Y )) |y 0 -y 1 | dγ(y 0 , y 1 ) = W 1 (D 0 (Y ), D 1 (Y )). To finish the proof, next we prove the lower bound is achieved when D Y 0 (Z = g * (X)) = D Y 1 (Z = g * (X)): it is easy to see W 1 (D Y 0 (Z), D Y 0 (Z)) = z 0 -z 1 1 dγ(z 0 , z 1 | y 0 , y 1 ) = 0 when the conditional distributions are equal. In this case, when the Wasserstein distance is minimized, then Z is conditionally independent of A given Y .

B EXPERIMENTAL DETAILS

Adult The Adult dataset contains 48,842 examples for income prediction. The task is to predict whether the annual income of an individual is greater or less than 50K/year based on the attributes of the individual, such as education level, age, occupation, etc. In our experiment, we use gender (binary) as the sensitive attribute. The target variable (income) is an ordinal binary variable: 0 if < 50K/year otherwise 1. After data pre-processing, the dataset contains 30,162/15,060 training/test instances where the input dimension of each instance is 113. We show the data distributions for different demographic subgroups in Table 1 . To preprocess the dataset, we first filter out the data records that contain the missing values. We then remove the sensitive attribute from the input features and normalize the input features with its means and standard deviations. Note that we use one-hot encoding for the categorical attributes. For our proposed methods, we use a three-layer neural network with ReLU as the activation function of the hidden layers and the sigmoid function as the output function for the prediction task (we take the first two layers as the feature mapping). The number of neurons in the hidden layers is 60. We train the neural networks with the ADADELTA algorithm with the learning rate 0.1 and a batch size of 512. The models are trained in 50 epochs. For the adversary networks in CENET and WASSERSTEINNET, we use a two-layer neural network with ReLU as the activation function. The number of neurons in the hidden layers of the adversary networks is 60. The adversary network in CENET also use sigmoid function as the output function. The weight clipping norm in the adversary network of WASSERSTEINNET is 0.005. We use the gradient reversal layer (Ganin et al., 2016) to implement the gradient descent ascent (GDA) algorithm for optimization of the minimax problem since it makes the training process more stable (Daskalakis & Panageas, 2018) . For the rest of the datasets we used in our experiments, we also use gradient reversal layer to implement our algorithms. We use the Fairlearn toolkit (Bird et al., 2020) to implement BGL: we use the exponentiated-gradient algorithm with the default setting as the mitigator and vary the upper bound ∈ {0.07, 0.1, 0.2, 0.5} of the bounded group loss constraint. For each value of , we run ten random seeds and compute the means and standard deviations. COMPAS The COMPAS dataset 6,172 instances to predict whether a criminal defendant will recidivate within two years or not. It contains attribute such as age, race, etc. In our experiment, we use race (white or non-white) as the sensitive attribute and recidivism as the target variable. We split the dataset into training and test set with the ratio 7/3. We show the data distributions for different demographic subgroups in Table 2 . For all methods, we use a two-layer neural network with ReLU as the activation function of the hidden layers and the sigmoid function as the output function for the prediction task (we take the first layer as the feature mapping). The number of neurons in the hidden layers is 60. We train the neural networks with the ADADELTA algorithm with the learning rate 1.0 and a batch size of 512. The models are trained in 50 epochs. For the adversary networks in CENET and WASSERSTEINNET, we use a two-layer neural network with ReLU as the activation function. The number of neurons in the hidden layers of the adversary networks is 10. The adversary network in CENET also use sigmoid function as the output function. The weight clipping norm in the adversary network of WASSERSTEINNET is 0.05. We use the Fairlearn toolkit to implement BGL: we use the exponentiated-gradient algorithm with the default setting as the mitigator and vary the upper bound ∈ {0.1, 0.2, 0.3, 0.5} of the bounded group loss constraint. For each value of , we run ten random seeds and compute the means and standard deviations. As for COD, we follow the source implementation. 2 We use the same hyper-parameter settings as (Komiyama et al., 2018) : We use the kernelized optimization with the random Fourier features and the RBF kernel (we vary hyper-parameter of the RBF kernel γ ∈ {0.1, 1.0, 10, 100}) and report the best results with minimal MSE loss for each time we change the fairness budget . We also vary ∈ {0.01, 0.1, 0.5, 1.0} and run ten random seeds and compute the means and standard deviations. 3b . To preprocess the dataset, we first remove the non-predictive attributes and sensitive attributes from the input features. Note that all features in the dataset have already been normalized in [0, 1] so that we do not perform additional normalization to the features. We then replace the missing values with the mean values of the corresponding attributes. For all methods, we use a two-layer neural network with ReLU as the activation function of the hidden layers and the sigmoid function as the output function for the prediction task (we take the first layer as the feature mapping). The number of neurons in the hidden layers is 50. We train the neural networks with the ADADELTA algorithm with the learning rate 0.1 and a batch size of 256. The models are trained in 100 epochs. For the adversary networks in CENET and WASSERSTEINNET, we use a two-layer neural network with ReLU as the activation function. The number of neurons in the hidden layers of the adversary networks is 100. The adversary network in CENET also use sigmoid function as the output function. The weight clipping norm in the adversary network of WASSERSTEINNET is 0.002. We use the Fairlearn toolkit to implement BGL: we use the exponentiated-gradient algorithm with the default setting as the mitigator and vary the upper bound ∈ {0.01, 0.02, 0.03, 0.05} of the bounded group loss constraint. For each value of , we run ten random seeds and compute the means and standard deviations. As for COD, we follow the same hyper-parameter settings as (Komiyama et al., 2018) : We use the kernelized optimization with the random Fourier features and the RBF kernel (we vary hyperparameter of the RBF kernel γ ∈ {0.1, 1.0, 10, 100}) and report the best results with minimal MSE loss for each time we change the fairness budget . The hyper-parameter settings follow from (Komiyama et al., 2018) . We also vary ∈ {0.01, 0.1, 0.5, 1.0} and run ten random seeds and compute the means and standard deviations. Law School The Law School dataset contains 1,823 records for law students who took the bar passage study for Law School Admissionfoot_2 . The features in the dataset include variables such as undergraduate GPA, LSAT score, full-time status, family income, gender, etc. In our experiment, we use gender as the sensitive attribute and undergraduate GPA as the target variable. We split the dataset into training and test set with the ratio 8/2. We show the data distributions for different demographic subgroups in Figure 3a . For all methods, we use a two-layer neural network with ReLU as the activation function of the hidden layers and the sigmoid function as the output function for the prediction task (we take the first layer as the feature mapping). The number of neurons in the hidden layers is 10. We train the neural networks with the ADADELTA algorithm with the learning rate 0.1 and a batch size of 256. The models are trained in 100 epochs. For the adversary networks in CENET and WASSERSTEINNET, we use a two-layer neural network with ReLU as the activation function. The number of neurons in the hidden layers of the adversary networks is 10. The adversary network in CENET also use sigmoid function as the output function. The weight clipping norm in the adversary network of WASSERSTEINNET is 0.2. We use the Fairlearn toolkit to implement BGL: we use the exponentiated-gradient algorithm with the default setting as the mitigator and vary the upper bound ∈ {0.01, 0.02, 0.03, 0.05} of the bounded group loss constraint. For each value of , we run ten random seeds and compute the means and standard deviations. As for COD, we follow the same hyper-parameter settings as (Komiyama et al., 2018) : We use the kernelized optimization with the random Fourier features and the RBF kernel (we vary hyperparameter of the RBF kernel γ ∈ {0.1, 1.0, 10, 100}) and report the best results with minimal MSE loss for each time we change the fairness budget . The hyper-parameter settings follow from (Komiyama et al., 2018) . We also vary ∈ {0.01, 0.1, 0.5, 1.0} and run ten random seeds and compute the means and standard deviations.

C ADDITIONAL EXPERIMENTAL RESULTS AND ANALYSES

In this section, we provide additional experimental results and analyses.

C.1 IMPACT OF FAIRNESS TRADE-OFF PARAMETERS

We present additional experimental results and analyses to gain more insights into how the fairness trade-off parameters (e.g., λ and ) affect the performance of the model predictive performance and accuracy disparity in each methods. CENET 0.0102±0.0010 0.0101±0.0009 0.0090±0.0018 0.0070±0.0030 0.0066±0.0030 WASSERSTEINNET 0.0102±0.0010 0.0098±0.0016 0.0090±0.0019 0.0072±0.0025 0.0069±0.0027 Table 3 shows R 2 regression scores and error gaps when λ changes in CENET and WASSERSTEIN-NET. We see that the error gap gradually decreases with the increase of the trade-off parameter λ in most scenarios with small accuracy loss (except for CENET in Adult dataset and Crime dataset when λ is large), which demonstrates the overall effectiveness of our proposed algorithms. Plus, the increase of λ generally leads to the instability of training processes with larger variances of both values of R 2 and error gap. In contrast to WASSERSTEINNET, CENET outperforms in mitigating the accuracy disparity while achieving similar or better accuracy in COMPAS and Law dataset. In Adult and Crime dataset, when λ is small, CENET also does better in reducing the error gap than WASSERSTEINNET with similar accuracy loss. The results follow the fact that minimizing total variation distance between two continuous distributions ensures the minimization of Wasserstein distance (Gibbs & Su, 2002) . However, when λ increases, WASSERSTEINNET achieves better accuracy and performance disparity trade-off while CENET suffers significant accuracy loss and may fail to decrease the error gap. It is not surprising since the estimation of total variation in minimax optimization could lead to an unstable training process (Arjovsky & Bottou, 2017; Arjovsky et al., 2017) . Table 4 shows R 2 regression scores and error gaps when changes in BGL. We see that with the decrease of the trade-off parameter , both the values of R 2 and error gaps decrease. This is because when upper bound of BGL is small, the accuracy disparity is also mitigated. When is above/below a certain threshold, the values of R 2 and error gaps then increase/decrease. It is also worth to note that the exponentiated-gradient approach to solve BGL does not introduce the randomness during optimization. 0.3508±0.0000 0.3508±0.0000 0.3696±0.0000 0.3696±0.0000 ∆ Err 0.0612±0.0000 0.0612±0.0000 0.0726±0.0000 0.0726±0.0000 COMPAS 0.1 0.2 0.3 0.5 R 2 0.1478±0.0000 0.1478±0.0000 0.1507±0.0000 0.1507±0.0000 ∆ Err 0.0072±0.0000 0.0072±0.0000 0.0086±0.0000 0.0086±0.0000 Crime 0.01 0.02 0.03 0.05 R 2 0.3922±0.0000 0.3922±0.0000 0.5380±0.0000 0.5380±0.0000 ∆ Err 0.0189±0.0000 0.0189±0.0000 0.0238±0.0000 0.0238±0.0000 Law 0.01 0.02 0.03 0.05 R 2 0.1407±0.0000 0.1407±0.0000 0.1407±0.0000 0.1412±0.0000 ∆ Err 0.0094±0.0000 0.0094±0.0000 0.0094±0.0000 0.0101±0.0000 Table 5 : R 2 regression scores and error gaps when changes in COD. COMPAS 0.01 0.1 0.5 1.0 R 2 0.1033±0.0111 0.1144±0.0100 0.1146±0.0099 0.1146±0.0099 ∆ Err 0.0064±0.0042 0.0083±0.0058 0.0085±0.0060 0.0085±0.0060 Crime 0.01 0.1 0.5 1.0 R 2 0.1262±0.0000 0.3284±0.0000 0.3603±0.0000 0.3603±0.0000 ∆ Err 0.0312±0.0000 0.0307±0.0000 0.0343±0.0000 0.0343±0.0000 Law 0.01 0.1 0.5 1.0 R 2 0.1262±0.0000 0.3284±0.0000 0.3606±0.0000 0.3603±0.0000 ∆ Err 0.0312±0.0000 0.0307±0.0000 0.0343±0.0000 0.0343±0.0000 Table 5 shows R 2 regression scores and error gaps when changes in COD. We see that with the decrease of the trade-off parameter , both the values of R 2 and error gaps decrease. It is worth to note that the the optimization of QCQP to solve COD does not introduce the randomness, and the only randomness introduced in COMPAS dataset is because using the random Fourier features in prediction achieves the best performance in COMPAS dataset.

C.2 VISUALIZATION OF TRAINING PROCESSES

We visualize the training processes of our proposed methods CENET and WASSERSTEINNET in the Adult dataset and COMPAS dataset in Figure 4 and Figure 5 , respectively. We also compare their training dynamics with the model performance that we solely minimize the MSE loss (i.e., λ = 0) and we term it as NO DEBIAS. 



COD cannot be implemented on the Adult dataset since the size of the Adult dataset is large and the QCQP optimization algorithm to solve COD needs a quadratic memory usage of the dataset size. https://github.com/jkomiyama/fairregresion We use the edited public version of the dataset which can be download here: https://github.com/ algowatchpenn/GerryFair/blob/master/dataset/lawschool.csv



Game-theoretic illustration of our algorithms.

Figure 1: The left figure illustrates how accuracy disparity arises by minimizing the global squared loss. The right figure gives a schematic illustration of the proposed algorithmic framework.

Figure 3: Data distributions for different demographic subgroups in two datasets.

Figure 4: Training visualization of CENET, WASSERSTEINNET (λ = 50) and NO DEBIAS (λ = 0) in the Adult dataset.

Data distribution of Y andA in Adult dataset.

Data distribution of Y andA in COMPAS dataset.Communities and CrimeThe Communities and Crime dataset contains 1,994 examples of socioeconomic, law enforcement, and crime data about communities in the United States. The task is to predict the number of violent crimes per 100K population. All attributes in the dataset have been curated and normalized to [0, 1]. In our experiment, we use race (binary) as the sensitive attribute: 1 if the population percentage of the white is greater or equal to 80% otherwise 0. After data pre-processing, the dataset contains 1,595/399 training/test instances where the input dimension of each instance is 96. We visualize the data distributions for different demographic subgroups in Figure

R 2 regression scores and error gaps when λ changes in CENET and WASSERSTEINNET.

R 2 regression scores and error gaps when changes in BGL.

APPENDIX

In the appendix, we give the proofs of the theorems and claims in our paper, the experimental details and more experimental results.A MISSING PROOFS Lemma 3.1. Let Y = h(X) ∈ R, then for a ∈ {0, 1}, W 1 (D a (Y ), h D a ) ≤ Err Da (h).Proof. The prediction error conditioned on a ∈ {0, 1} isTaking square root at both sides then completes the proof.Proof. Since W 1 (•, •) is a distance metric, the result follows immediately the triangle inequality and Lemma 3.1:Rearrange the equation above and by AM-GM inequality, we haveTaking square at both sides then completes the proof.Proof. The joint error isLemma A.1. If Assumption 2.1 holds, then the following inequality holds:Proof. First, we know that h(X)Note that the last equation follows the definition of total variation distance. In Figure 4 and Figure 5 , we can see that as the training progress, the MSE losses in both datasets are decreasing and finally converge. However, the training dynamics of error gaps are much more complex even in the NO DEBIAS case. Before convergence, the training dynamics of error gaps differs among different datasets. Our methods enforce the models to converge to the points where error gap are smaller while preserving the models' predictive performance. It is also worth to note that minimax optimization makes the training processes somehow unstable.

