ATTAINABILITY AND OPTIMALITY: THE EQUALIZED-ODDS FAIRNESS REVISITED

Abstract

Fairness of machine learning algorithms has been of increasing interest. In order to suppress or eliminate discrimination in prediction, various notions as well as approaches to impose fairness have been proposed. However, in different scenarios, whether or not the chosen notion of fairness can always be attained, even if with unlimited amount of data, is not well addressed. In this paper, focusing on the Equalized Odds notion of fairness, we consider the attainability of this criterion, and furthermore, if attainable, the optimality of the prediction performance under various settings. In particular, for classification with a deterministic prediction function of the input, we give the condition under which Equalized Odds can hold true; if randomized prediction is acceptable, we show that under mild assumptions, fair classifiers can always be derived. Moreover, we prove that compared to enforcing fairness by post-processing, one can always benefit from exploiting all available features during training and get better prediction performance while remaining fair. However, for regression tasks, Equalized Odds is not always attainable if certain conditions on the joint distribution of the features and the target variable are not met. This indicates the inherent difficulty in achieving fairness in certain cases and suggests a broader class of prediction methods might be needed for fairness.

1. INTRODUCTION

As machine learning models become widespread in automated decision making systems, apart from the efficiency and accuracy of the prediction, their potential social consequence also gains increasing attention. To date, there is ample evidence that machine learning models have resulted in discrimination against certain groups of individuals under many circumstances, for instance, the discrimination in ad delivery when searching for names that can be predictive of the race of individual (Sweeney, 2013) ; the gender discrimination in job-related ads push (Datta et al., 2015) ; stereotypes associated with gender in word embeddings (Bolukbasi et al., 2016) ; the bias against certain ethnicities in the assessment of recidivism risk (Angwin et al., 2016) . The call for accountability and fairness in machine learning has motivated various (statistical) notions of fairness. The Demographic Parity criterion (Calders et al., 2009) requires the independence between prediction (e.g., of a classifier) and the protected feature (sensitive attributes of an individual, e.g., gender, race). Equalized Odds (Hardt et al., 2016) , also known as Error-rate Balance (Chouldechova, 2017) , requires the output of a model be conditionally independent of protected feature(s) given the ground truth of the target. Predictive Rate Parity (Zafar et al., 2017a) , on the other hand, requires the actually proportion of positives (negatives) in the original data for positive (negative) predictions should match across groups (well-calibrated). On the theoretical side, results have been reported regarding relationships among fairness notions. It has been independently shown that if base rates of true positives differ among groups, then Equalized Odds and Predictive Rate Parity cannot be achieved simultaneously for non-perfect predictors (Kleinberg et al., 2016; Chouldechova, 2017) . Any two out of three among Demographic Parity, Equalized Odds, and Predictive Rate Parity are incompatible with each other (Barocas et al., 2017) . At the interface of privacy and fairness, the impossibility of achieving both Differential Privacy (Dwork et al., 2006) and Equal Opportunity (Hardt et al., 2016) while maintaining non-trivial accuracy is also established (Cummings et al., 2019) . In practice, one can broadly categorize computational procedures to derive a fair predictor into three types: pre-processing approaches (Calders et al., 2009; Dwork et al., 2012; Zemel et al., 2013; Zhang et al., 2018; Madras et al., 2018; Creager et al., 2019; Zhao et al., 2020) , in-processing approaches (Kamishima et al., 2011; Pérez-Suay et al., 2017; Zafar et al., 2017a; b; Donini et al., 2018; Song et al., 2019; Mary et al., 2019; Baharlouei et al., 2020) , and post-processing approaches (Hardt et al., 2016; Fish et al., 2016; Dwork et al., 2018) . In accord with the fairness notion of interest, a pre-processing approach first maps the training data to a transformed space to remove discriminatory information between protected feature and target, and then pass on the data to make prediction. In direct contrast, a post-processing approach treats the off-the-shelf predictor(s) as uninterpretable black-box(es), and imposes fairness by outputting a function of the original prediction. For inprocessing approaches, various kinds of regularization terms are proposed so that one can optimize the utility function while suppressing the discrimination at the same time. Approaches based on estimating/bounding causal effect between the protected feature and final target have also been proposed (Kusner et al., 2017; Russell et al., 2017; Zhang et al., 2017; Nabi & Shpitser, 2018; Zhang & Bareinboim, 2018; Chiappa, 2019; Wu et al., 2019) . Focusing on the Equalized-Odds criterion, although various approaches have been proposed to impose the fairness requirement, whether or not it is always attainable is not well addressed. The attainability of Equalized Odds, namely, the existence of the predictor that can score zero violation of fairness in the large sample limit, is an asymptotic property of the fairness criterion. This characterizes a completely different kind of violation of fairness compared to the empirical error bound of discrimination in finite-sample cases. If utilizing a "fair" predictor which is actually biased, the discrimination would become a snake in the grass, making it hard to detect and eliminate. Actually, as we illustrate in this paper, Equalized Odds is not always attainable for regression and even classification tasks, if we use deterministic prediction functions. This calls for alternative definitions in the same spirit as Equalized Odds that can always be achieved under various circumstances. Our contributions are mainly: • For regression and classification tasks with deterministic prediction functions, we show that Equalized Odds is not always attainable if certain (rather restrictive) conditions on the joint distribution of the features and the target variable are not met. • Under mild assumptions, for binary classification we show that if randomized prediction is taken into consideration, one can always derive a non-trivial Equalized Odds classifier. • Considering the optimality of performance under fairness constraint(s), when exploiting all available features, we show that the predictor derived via an in-processing approach would always outperform the one derived via a post-processing approach (unconstrained optimization followed by a post-processing step).

2. PRELIMINARIES

In this section, we first illustrate the difference between prediction fairness and procedure fairness, and then, we present the formal definition of Equalized Odds (Hardt et al., 2016) .

2.1. HIERARCHY OF FAIRNESS

Before presenting the formulation of fairness, it is important to see the distinction between different levels of fairness when discussing fair predictors. When evaluating the performance of the proposed fair predictor, it is a common practice to compare the loss (with respect to the utility function of choice, e.g., accuracy for binary classification) computed on target variable and the predicted value. There is an implicit assumption lying beneath this practice: the generating process of the data, which is just describing a real-world procedure, is not biased in any sense (Danks & London, 2017) . Only when we treat the target variable (recorded in the dataset) as unbiased can we justify the practice of loss evaluation and the conditioning on target variable when imposing fairness (as we shall see in the definition of Equalized Odds in Equation 1). One may consider a music school admission example. The music school committee would decide if they admit a student to the violin performance program based on the applicant's personal information, educational background, instrumental performance, and so on. When evaluating whether or not the admission is "fair", there are actually two levels of fairness. First, based on the information at hand, did the committee evaluate the qualification of applicants without bias (How committee evaluate the applicants)? And second, is committee's procedure of evaluating applicants' qualification reasonable (How other people view the evaluation procedure used by the committee)? In this paper, we consider prediction fairness, namely, assuming the data recorded is unbiased, the prediction (made with respect to current reality) itself should not include any biased utilization of information. The fairness with respect to the data generating procedure as well as the potential future influence of the prediction are beyond the scope of this paper. 

2.2. EQUALIZED-ODDS FAIRNESS

Y ⊥ ⊥ A | Y. For classification tasks, one can conveniently use the probability distribution form: ∀a ∈ A, t, y ∈ Y : P ( Y = t | A = a, Y = y) = P ( Y = t | Y = y), (2) or more concisely, P Y |AY (t | a, y) = P Y |Y (t | y). For better readability, we also use the formulation in Equation 3 in cases without ambiguity. In the context of binary classification (Y = {0, 1}), Equalized Odds requires that the True Positive Rate (TPR) and False Positive Rate (FPR) of each certain group match population positive rates. Throughout the paper, without loss of generality we assume there is only one protected feature for the purpose of simplifying notation. However, considering the fact that the protected feature can be discrete (e.g., race, gender) or continuous (e.g., the ratio of ethnic group in the population for certain district of a city), we do not assume discreteness of the protected feature. Due to the space limit, we will focus on the illustration and implication of our results and defer all the proofs to the appendix.

3. FAIRNESS IN REGRESSION MAY NOT BE ATTAINED

In this section we consider the attainability of Equalized Odds for regression tasks, namely, whether or not it is possible to find a predictor that is conditionally independent from the protected feature given true value of the target. For linearly Gaussian cases, one can attain Equalized Odds by constraining zero partial correlation between the prediction and the protected feature given target variable (Woodworth et al., 2017) . Various regularization terms have also been proposed to suppress discrimination when predicting a continuous target (Berk et al., 2017; Mary et al., 2019) . However, whether or not one can always achieve 0-discrimination for regression, even if with an unlimited amount of data, is not clear yet. If "fair" predictors are deployed without carefully checking the attainability of fairness, the discrimination would become a hidden hazard, making it hard to detect and eliminate. Actually as we will show in this section, even in the simple setup of linearly correlated continuous data, Equalized Odds is not always attainable.

3.1. UNATTAINABILITY OF EQUALIZED ODDS IN LINEAR NON-GAUSSIAN REGRESSION

As stated in Section 2.1, in this paper we consider prediction fairness, and therefore any possible bias introduced by the data generating procedure itself is beyond the scope of the discussion. Consider the situation where the data is generated as following (H is not measured in the dataset): X = qA + E X , H = bA + E H , Y = cX + dH + E Y , where (A, E X , E H , E Y ) are mutually independent. In fact, if at most one of E X and E := E Y + dE H is Gaussian, then any linear combination of A and X with non-zero coefficients will not be conditionally independent from A given Y , meaning that it is not possible to achieve Equalized-Odds fairness. Let Z be a linear combination of A and X, i.e., Z = αA + βX = (α + qβ)A + βE X , with linear coefficients α and β, where β = 0. In Theorem 3.1, we present the general result in linear non-Gaussian cases, where one cannot achieve the conditional independence between Z and A given Y . Theorem 3.1. (Unattainability of Equalized Odds in the Linear Non-Gaussian Case) Assume that feature X has a causal influence on Y , i.e., c = 0 in Equation 4, and that the protected feature A and Y are not independent, i.e., qc + bd = 0. Assume p E X and p E are positive on R. Let f 1 := log p A , f 2 := log p E X , and f 3 := log p E . Further assume that f 2 and f 3 are third-order differentiable. Then if at most one of E X and E is Gaussian, Z is always conditionally dependent on A given Y . From Theorem 3.1, we see that in linear non-Gaussian cases, any non-zero linear combination of the feature (which is a deterministic function of the input) will not satisfy Equalized Odds. One may wonder whether Equalized Odds can be achieved by nonlinear regression, instead of a linear model. Although a proof with general nonlinear models is rather complicated, our simulation results in Section 5.1 strongly suggest that the unattainability of Equalized Odds persists in nonlinear regression cases. In light of the unattainability of Equalized Odds for prediction with deterministic functions of A and X, it is desirable to develop general, nonlinear prediction algorithms to produce a probabilistic prediction (i.e., with a certain type of randomness in the prediction). One possible way follows the framework of Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) : we use random standard Gaussian noise, in addition to A and X, as input, such that the output will have a specific type of randomness. The parameters involved are learned by minimizing prediction error and enforcing Equalized Odds on the "randomized" output at the same time. Given that this approach is not essential to illustrate the claims made in this paper and that theoretical properties of such nonlinear regression algorithms with randomized output are not straightforward to establish, this is left as future work.

4. FAIRNESS IN CLASSIFICATION

In this section, we consider the attainability of Equalized Odds for binary classifiers (with a deterministic or randomized prediction function), and furthermore, if attainable, the optimality of performance under the fairness criterion. Admittedly, as is already pointed out by Woodworth et al. (2017) , we generally cannot have 0-discriminatory predictors with a finite number of samples; instead, one should consider imposing δ-discrimination in practice (δ is the violation of Equalized Odds). However, this does not guarantee nor rule out the possibility of attaining 0-discrimination on population when the sample size goes to infinity.

4.1. CLASSIFICATION WITH DETERMINISTIC PREDICTION

We begin with considering cases when the classification is performed by a deterministic function of the input. In particular, we derive the condition under which Equalized Odds can possibly hold true. X|a and the conditional distribution P X|AY (x|a, y) are coupled in some specific way so that they happen to satisfy the specified equality. In special cases when X ⊥ ⊥ A | Y , if f is a function of only X, condition (ii) would always hold true. In general situations, if there does not exist any subsets K a , K a ⊆ X for different values of a, a ∈ A such that Σ x∈Ka P X|AY (x|a, y) = Σ x∈K a P X|AY (x|a , y), then condition (ii) can never hold true (i.e., we cannot find a deterministic function f (A, X) that satisfies Equalized Odds). Generally speaking, in order to score a better classification accuracy, one would like to make P Y |A,X (t|a, x) as close as possible to P Y |A,X (y|a, x), and if the set S (t) X|a and P X|AY (x|a, y) are not strictly coupled, condition (ii) would be violated.

4.2. CLASSIFICATION WITH RANDOMIZED PREDICTION

In this section, we consider cases when randomized prediction is acceptable, namely, the classifier would output class labels with certain probabilities. We first derive the relation between positive rates (TPR and FPR) of binary classifiers before and after the post-processing step, i.e., Y opt (the unconstrainedly optimized classifier) and Y post (the fair classifier derived by post-processing Y opt ), and show that under mild assumptions, one can always derive a non-trivial Equalized-Odds (on population level) Y post via a post-processing step. Then, from the ROC feasible area perspective, we prove that post-processing approaches are actually equivalent to in-processing approaches but with additional "pseudo" constraints enforced. Therefore, using the same loss function, post-processing approaches can perform no better than in-processing approaches.

4.2.1. THE POST-PROCESSING STEP

The post-processing step of a predictor Y (here we drop the subscript if without ambiguity) only utilizes the information in the joint distribution (A, Y, Y ). A fair predictor Y post derived via a postprocessing step, for instance, the shifted decision boundary (Fish et al., 2016) , the derived predictor (Hardt et al., 2016) , or the (monotonic) joint loss optimization over decoupled classifiers (Dwork et al., 2018) , is then fully specified by a (possibly randomized) function of (A, Y ). This implies the conditional independence Y post ⊥ ⊥ Y | A, Y . Since we can denote the positive rates of Y as P Y |AY (1|a, y) 1 , positive rates of Y (here we drop the subscript for readability) as P Y |AY (1|a, y), the relation between positive rates of binary classifiers before and after a post-processing step would satisfy (for every a ∈ A, u, y ∈ Y): P Y |AY (1|a, y) = Σ u∈Y β (u) a P Y |AY (u|a, y), where β (u) a := P ( Y = 1 | A = a, Y = u). Notice that Equation 5 is just a factorization of probability under the conditional independence (between Y post and Y given A and Y ). Therefore, post-processing an existing predictor boils down to optimizing parameters (for discrete A) or functions (for continuous A) β (u) a .

4.2.2. ROC FEASIBLE AREA

On the Receiver Operator Characteristic (ROC) plane, a two-dimensional plane with horizontal axis denoting FPR and vertical axis denoting TPR, the performance of any binary predictor Y (not 1 Recall that P Y |AY (u|a, y) = P ( Y = u | A = a, Y = y). When u = 1, P Y |AY (1|a, y) = P ( Y = 1 | A = a, Y = y) represents positive rates of Y ; When u = 0, P Y |AY (0|a, y) = P ( Y = 0 | A = a, Y = y) represents positive rates of 1 -Y (the classifier that flips the prediction of Y ). 0.00 0.25 0.50 0.75 1.00 P ( e Y = 1 | A, Y = 0) 0.0 0.2 0.4 0.6 0.8 1.0 P ( e Y = 1 | A, Y = 1) ≠( e Y § in ) (a) Discrete protected feature (b) Continuous protected feature 0.00 1.00 0.0 1.0 ε a( b Y ) < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 V 4 d t X G O s + 2 O J O G y 4 K V F x X E 5 r o Y = " > A A A C A X i c b V D L S s N A F J 3 U V 4 2 v q A t B N 8 E i 1 E 1 J R P C x K r h x I 1 S w D 2 l K u Z l O 2 6 E z S Z i Z K C U E F + K X u B I U x K 3 g T 7 j y R 1 w 7 f S y 0 9 c C F w z n 3 c u 8 9 f s S o V I 7 z Z W R m Z u f m F 7 K L 5 t L y y u q a t b 5 R k W E s M C n j k I W i 5 o M k j A a k r K h i p B Y J A t x n p O r 3 z g Z + 9 Y Y I S c P g S v U j 0 u D Q C W i b Y l B a a l p b X g c 4 h 2 Y C a d 6 7 p S 3 S B Z V c p / t N K + c U n C H s a e K O S a 6 4 / Z D 7 v r v 4 K D W t T 6 8 V 4 p i T Q G E G U t Z d J 1 K N B I S i m J H U 9 G J J I s A 9 6 J C 6 p g F w I h v J 8 I H U 3 t N K y 2 6 H Q l e g 7 K H 6 e y I B L m W f + 7 q T g + r K S W 8 g / u f V Y 9 U + b i Q 0 i G J F A j x a 1 I 6 Z r U J 7 k I b d o o J g x f q a A B Z U 3 2 r j L g j A S m d m m j o F d / L n a V I 5 K L i H h Z N L H c c p G i G L d t A u y i M X H a E i O k c l V E Y Y p e g R P a M X 4 9 5 4 M l 6 N t 1 F r x h j P b K I / M N 5 / A B 0 d m h 4 = < / l a t e x i t > P ( e Y = 1 | A, Y = 1) < l a t e x i t s h a 1 _ b a s e 6 4 = " c C J h 6 2 l 4 Z H E 9 I I F C D 7 k R F 0 o Y R Z M = " > A A A C D H i c b V D L S g M x F M 3 U V 6 2 v q j v d h B a h g p Q Z E X y A U H H j R q h g X 7 S l Z D K 3 b W j m Q Z J R y l B c C a 7 8 F F e C g u j S H 3 D l j 7 g 2 0 3 a h 1 Q M X T s 6 5 l 9 x 7 7 I A z q U z z 0 0 h M T c / M z i X n U w u L S 8 s r 6 d W 1 s v R D Q a F E f e 6 L q k 0 k c O Z B S T H F o R o I I K 7 N o W L 3 T m O / c g V C M t + 7 V P 0 A m i 7 p e K z N K F F a a q U z x V z j m j m g G H c g q g 3 w M b Z w w 2 U O P t n B t f i 1 3 U p n z b w 5 B P 5 L r D H J F j Z u s 1 8 3 5 2 / F V v q j 4 f g 0 d M F T l B M p 6 5 Y Z q G Z E h G K U w y D V C C U E h P Z I B + q a e s Q F 2 Y y G t w z w l l Y c 3 P a F L k / h o f p z I i K u l H 3 X 1 p 0 u U V 0 5 6 c X i f 1 4 9 V O 2 D Z s S 8 I F T g 0 d F H 7 Z B j 5 e M 4 G O w w A V T x v i a E C q Z 3 x b R L B K F K x 5 d K 6 R S s y Z v / k v J u 3 t r L H 1 7 o O I 7 Q C E m 0 i T I o h y y 0 j w r o D B V R C V F 0 h x 7 Q E 3 o 2 7 o 1 H 4 8 V 4 H b U m j P H M O v o F 4 / 0 b z 2 2 b s g = = < / l a t e x i t > P ( e Y = 1 | A, Y = 0) < l a t e x i t s h a 1 _ b a s e 6 4 = " Y i q R 1 N 6 y V b 2 I K H 3 h J 0 h Q y O b y e 4 0 = " > A A A C D H i c b V D J S g N B E O 2 J W 4 z b q D e 9 N A l C B A k z I r i A E P H i R Y h g N p I Q e n o q S Z O e h e 4 e J Q z B k + D J T / E k K I g e / Q F P / o h n O 8 t B o w 8 K X r 9 X R V c 9 J + R M K s v 6 N B J T 0 z O z c 8 n 5 1 M L i 0 v K K u b p W k k E k K B R p w A N R c Y g E z n w o K q Y 4 V E I B x H M 4 l J 3 u 6 c A v X 4 G Q L P A v V S + E h k f a P m s x S p S W m m a 6 k K 1 f M x c U 4 y 7 E 1 T 4 + x j a u e 8 z F J z u 4 q l / W d t P M W D l r C P y X 2 G O S y W / c Z r 5 u z t 8 K T f O j 7 g Y 0 8 s B X l B M p a 7 Y V q k Z M h G K U Q z 9 V j y S E h H Z J G 2 q a + s Q D 2 Y i H t / T x l l Z c 3 A q E L l / h o f p z I i a e l D 3 P 0 Z 0 e U R 0 5 6 Q 3 E / 7 x a p F o H j Z j 5 Y a T A p 6 O P W h H H K s C D Y L D L B F D F e 5 o Q K p j e F d M O E Y Q q H V 8 q p V O w J 2 / + S 0 q 7 O X s v d 3 i h 4 z h C I y T R J k q j L L L R P s q j M 1 R A R U T R H X p A T + j Z u D c e j R f j d d S a M M Y z 6 + g X j P d v z e e b s Q = = < / l a t e x i t > a(1 b Y ) < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 T 9 c T B b e W B j / j F S l p c t + p a s K 9  E Q = " > A A A C B X i c b V D J S g N B E O 2 J W 4 z b q D d z a Q x C P B h m R H A 5 B b x 4 E S K Y R Z I Q a j q d p E n 3 z N D d o 4 Q h i A f x U z w J C u J R P 8 K T P + L Z z n L Q x A c F j / e q q K r n h Z w p 7 T h f V m J m d m 5 + I b m Y W l p e W V 2 z 1 z d K K o g k o U U S 8 E B W P F C U M 5 8 W N d O c V k J J Q X i c l r 3 u 6 c A v X 1 O p W O B f 6 l 5 I 6 w L a P m s x A t p I D T t d a 4 M Q 0 I i h n 3 X x H q 7 d s C b t g I 6 v + r s N O + P k n C H w N H H H J J P f u s 9 8 3 5 6 / F x r 2 Z 6 0 Z k E h Q X x M O S l V d J 9 T 1 G K R m h N N + q h Y p G g L p Q p t W D f V B U F W P h 0 / 0 8 Y 5 R m r g V S F O + x k P 1 9 0 Q M Q q m e 8 E y n A N 1 R k 9 5 A / M + r R r p 1 V I + Z H 0 a a + m S 0 q B V x r A M 8 S A Q 3 m a R E 8 5 4 h Q C Q z t 2 L S A Q l E m 9 x S K Z O C O / n z N C n t 5 9 y D 3 P G F i e M E j Z B E a b S N s s h F h y i P z l A B F R F B d + g R P a M X 6 8 F 6 s l 6 t t 1 F r w h r P b K I / s D 5 + A L t y m u Q = < / l a t e x i t > 0.00 1.00 0.0 1.0 a( b Y ) < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 V 4 d t X G O s + 2 O J O G y 4 K V F x X E 5 r o Y = " > A A A C A X i c b V D L S s N A F J 3 U V 4 2 v q A t B N 8 E i 1 E 1 J R P C x K r h x I 1 S w D 2 l K u Z l O 2 6 E z S Z i Z K C U E F + K X u B I U x K 3 g T 7 j y R 1 w 7 f S y 0 9 c C F w z n 3 c u 8 9 f s S o V I 7 z Z W R m Z u f m F 7 K L 5 t L y y u q a t b 5 R k W E s M C n j k I W i 5 o M k j A a k r K h i p B Y J A t x n p O r 3 z g Z + 9 Y Y I S c P g S v U j 0 u D Q C W i b Y l B a a l p b X g c 4 h 2 Y C a d 6 7 p S 3 S B Z V c p / t N K + c U n C H s a e K O S a 6 4 / Z D 7 v r v 4 K D W t T 6 8 V 4 p i T Q G E G U t Z d J 1 K N B I S i m J H U 9 G J J I s A 9 6 J C 6 p g F w I h v J 8 I H U 3 t N K y 2 6 H Q l e g 7 K H 6 e y I B L m W f + 7 q T g + r K S W 8 g / u f V Y 9 U + b i Q 0 i G J F A j x a 1 I 6 Z r U J 7 k I b d o o J g x f q a A B Z U 3 2 r j L g j A S m d m m j o F d / L n a V I 5 K L i H h Z N L H c c p G i G L d t A u y i M X H a E i O k c l V E Y Y p e g R P a M X 4 9 5 4 M l 6 N t 1 F r x h j P b K I / M N 5 / A B 0 d m h 4 = < / l a t e x i t > a 0 ( b Y ) < l a t e x i t s h a _ b a s e = " c / n H U A w x g y u X D X x f E T N T g = " > A A A C A n i c b V D L S s N A F J U V v q B v R T b C I d V M S E X y s C m c C B X s Q p Q b q b T d u h M E m Y m S g n V j f J K F B I o / c o f c e s d D W A x c O z L v f f E a N S f a X k Z q a n p m d S n F h a X l l f M b W y D G O B S Q m H L B R V H y R h N C A l R R U j U g Q D j F b z v c r R I G g a X q h s R j M r o E K Q W m p b m L e A c g n s n L u D W Q N q j k q r d X N N h A m i T O i G Q L m w / Z v z j L d / H Q b I Y C R R m I G X N s S P l J S A U x Y z M m s S Q S A y S z Q A T q S X D D o W T t a a V j N U O g K l D V Q f k w K X s c l c l B t O e x f + W q y a R C g y h W J M D D R c Y W S q + n F Y D S o I V q y r C W B B a W b o M A r H R o m Y x O w R n / e Z K U / P O Q f Q s d x g o Z I o y j X L I Q Y e o g M Q E Z U Q R r f o E T j Y i q R 1 N 6 y V b 2 I K H 3 h J 0 h Q y O b y e 4 0 = " > A A A C D H i c b V D J S g N B E O 2 J W 4 z b q D e 9 N A l C B A k z I r i A E P H i R Y h g N p I Q e n o q S Z O e h e 4 e J Q z B k + D J T / E k K I g e / Q F P / o h n O 8 t B o w 8 K X r 9 X R V c 9 J + R M K s v 6 N B J T 0 z O z c 8 n 5 1 M L i 0 v K K u b p W k k E k K B R p w A N R c Y g E z n w o K q Y 4 V E I B x H M 4 l J 3 u 6 c A v X 4 G Q L P A v V S + E h k f a P m s x S p S W m m a 6 k K 1 f M x c U 4 y 7 E 1 T 4 + x j a u e 8 z F J z u 4 q l / W d t P M W D l r C P y X 2 G O S y W / c Z r 5 u z t 8 K T f O j 7 g Y 0 8 s B X l B M p a 7 Y V q k Z M h G K U Q z 9 V j y S E h H Z J G 2 q a + s Q D 2 Y i H t / T x l l Z c 3 A q E L l / h o f p z I i a e l D 3 P 0 Z 0 e U R 0 5 6 Q 3 E / 7 x a p F o H j Z j 5 Y a T A p 6 O P W h H H K s C D Y L D L B F D F e 5 o Q K p j e F d M O E Y Q q H V 8 q p V O w J 2 / + S 0 q 7 O X s v d 3 i h 4 z h C I y T R J k q j L L L R P s q j M 1 R A R U T R H X p A T + j Z u D c e j R f j d d S a M M Y z 6 + g X j P d v z e e b s Q = = < / l a t e x i t > P ( e Y = 1 | A, Y = 1) < l a t e x i t s h a 1 _ b a s e 6 4 = " c C J h 6 2 l 4 Z H E 9 I I F C D 7 k R F 0 o Y R Z M = " > A A A C D H i c b V D L S g M x F M 3 U V 6 2 v q j v d h B a h g p Q Z E X y A U H H j R q h g X 7 S l Z D K 3 b W j m Q Z J R y l B c C a 7 8 F F e C g u j S H 3 D l j 7 g 2 0 3 a h 1 Q M X T s 6 5 l 9 x 7 7 I A z q U z z 0 0 h M T c / M z i X n U w u L S 8 s r 6 d W 1 s v R D Q a F E f e 6 L q k 0 k c O Z B S T H F o R o I I K 7 N o W L 3 T m O / c g V C M t + 7 V P 0 A m i 7 p e K z N K F F a a q U z x V z j m j m g G H c g q g 3 w M b Z w w 2 U O P t n B t f i 1 3 U p n z b w 5 B P 5 L r D H J F j Z u s 1 8 3 5 2 / F V v q j 4 f g 0 d M F T l B M p 6 5 Y Z q G Z E h G K U w y D V C C U E h P Z I B + q a e s Q F 2 Y y G t w z w l l Y c 3 P a F L k / h o f p z I i K u l H 3 X 1 p 0 u U V 0 5 6 c X i f 1 4 9 V O 2 D Z s S 8 I F T g 0 d F H 7 Z B j 5 e M 4 G O w w A V T x v i a E C q Z 3 x b R L B K F K x 5 d K 6 R S s y Z v / k v J u 3 t r L H 1 7 o O I 7 Q C E m 0 i T I o h y y 0 j w r o D B V R C V F 0 h x 7 Q E 3 o 2 7 o 1 H 4 8 V 4 H b U m j P H M O v o F 4 / 0 b z 2 2 b s g = = < / l a t e x i t > ε a(1 b Y ) < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 T 9 c T B b e W B j / j F S l p c t + p a s K 9 E Q = " > A A A C B X i c b V D J S g N B E O 2 J W 4 z b q D d z a Q x C P B h m R H A 5 B b x 4 E S K Y R Z I Q a j q d p E n 3 z N D d o 4 Q h i A f x U z w J C u J R P 8 K T P + L Z z n L Q x A c F j / e q q K r n h Z w p 7 T h f V m J m d m 5 + I b m Y W l p e W V 2 z 1 z d K K o g k o U U S 8 E B W P F C U M 5 8 W N d O c V k J J Q X i c l r 3 u 6 c A v X 1 O p W O B f 6 l 5 I 6 w L a P m s x A t p I D T t d a 4 M Q 0 I i h n 3 X x H q 7 d s C b t g I 6 v + r s N O + P k n C H w N H H H J J P f u s 9 8 3 5 6 / F x r 2 Z 6 0 Z k E h Q X x M O S l V d J 9 T 1 G K R m h N N + q h Y p G g L p Q p t W D f V B U F W P h 0 / 0 8 Y 5 R m r g V S F O + x k P 1 9 0 Q M Q q m e 8 E y n A N 1 R k 9 5 A / M + r R r p 1 V I + Z H 0 a a + m S 0 q B V x r A M 8 S A Q 3 m a R E 8 5 4 h Q C Q z t 2 L S A Q l E m 9 x S K Z O C O / n z N C n t 5 9 y D 3 P G F i e M E j Z B E a b S N s s h F h y i P z l A B F R F B d + g R P a M X 6 8 F 6 s l 6 t t 1 F r w h r P b K I / s D 5 + A L t y m u Q = < / l a t e x i t > a 0 (1 b Y ) < l a t e x i t s h a 1 _ b a s e 6 4 = " Q h 9 2 Z I f H 2 o I 7 j a m D T N o K J x A O F j Q = " > A A A C B n i c b V D J S g N B E O 1 x j X E b 9 a Y e G o M Y D 4 Y Z E V x O A S 9 e h A h m k S S E m k 4 n a d I 9 M 3 T 3 K G E I g i D 4 K Z 4 E B f E m / o M n f 8 S z n e W g i Q 8 K H u 9 V U V X P C z l T 2 n G + r I n J q e m Z 2 c R c c n 5 h c W n Z X l k t q C C S h O Z J w A N Z 8 k B R z n y a 1 0 x z W g o l B e F x W v T a p z 2 / e E 2 l Y o F / q T s h r Q p o + q z B C G g j 1 e z N S h O E g F o M O 9 2 0 i / d w 5 Y b V a Q t 0 f N X d r d k p J + P 0 g c e J O y S p 7 P p 9 6 v v 2 / D 1 X s z 8 r 9 Y B E g v q a c F C q 7 D q h r s Y g N S O c d p O V S N E Q S B u a t G y o D 4 K q a t z / o o u 3 j V L H j U C a 8 j X u q 7 8 n Y h B K d Y R n O g X o l h r 1 e u J / X j n S j a N q z P w w 0 t Q n g 0 W N i G M d 4 F 4 k u M 4 k J Z p 3 D A E i m b k V k x Z I I N o E l 0 y a F N z R n 8 d J Y T / j H m S O L 0 w c J 2 i A B N p A W y i N X H S I s u g M 5 V A e E X S H H t E z e r E e r C f r 1 X o b t E 5 Y w 5 k 1 9 A f W x w 8 i V J s V < / l a t e x i t > (c) Feasible areas for , and e Yin < l a t e x i t s h a 1 _ b a s e 6 4 = " W g N r v z 3 D l l i Q R d h E 1 E H 3 c G l E p N 0 = " > A A A C A 3 i c b V C 7 S g N B F J 2 N r 7 i + V q 3 E Z j E I V m F X B B W b g I 2 V R D A + S E K c n b 1 J h s w + m L m r h m W x 8 g v 8 B i t B Q W z 9 C i t r K / / C y a P Q x A M D h 3 P u 5 c 4 5 X i y 4 Q s f 5 N H I T k 1 P T M / l Z c 2 5 + Y X H J W l 4 5 U 1 E i G V R Y J C J 5 4 V E F g o d Q Q Y 4 C L m I J N P A E n H u d w 5 5 / f g 1 S 8 S g 8 x W 4 M 9 Y C 2 Q t 7 k j K K W G t Z a 7 Y b 7 g F z 4 k F 5 m j b S G c I s p D 7 O s Y R W c o t O H P U 7 c I S m U r P e r 7 + O v h 3 L D + q j 5 E U s C C J E J q l T V d W K s p 1 Q i Z w I y s 5 Y o i C n r 0 B Z U N Q 1 p A K q e 9 i N k 9 q Z W f L s Z S f 1 C t P v q 7 4 2 U B k p 1 A 0 9 P B h T b a t T r i f 9 5 1 Q S b e 3 U d K E 4 Q Q j Y 4 1 E y E j Z H d 6 8 P 2 u Q S G o q s J Z Z L r v 9 q s T S V l q F s z T d 2 C O 5 p 5 n J x t F 9 2 d 4 v 6 J r u O A D J A n 6 2 S D b B G X 7 J I S O S J l U i G M 3 J F H 8 k x e j H v j y X g 1 3 g a j O W O 4 s 0 r + w H j / A e S f n H 8 = < / l a t e x i t > e Ypost < l a t e x i t s h a 1 _ b a s e 6 4 = " o U G / C X 0 c O S S I W N 7 p P b N O J d F D I Y A = " > A A A C B X i c b V D J S g N B E O 1 J X G L c o h 5 z c D A I n s K M C C q C B r x 4 j G A W S U L o 6 a k k T X o W u 2 v U M M z B k x f / w 5 O g I F 7 9 C E / + j Z 3 l o I k P C h 7 v V V F V z w k F V 2 h Z 3 0 Y q P T e / s J h Z y i 6 v r K 6 t 5 z Y 2 q y q I J I M K C 0 Q g 6 w 5 V I L g P F e Q o o B 5 K o J 4 j o O b 0 z 4 d + 7 R a k 4 o F / h Y M Q W h 7 t + r z D G U U t t X P 5 5 h 1 3 A b l w I b 5 O 2 n E T 4 R 7 j M F C Y J O 1 c w S p a I 5 i z x J 6 Q Q m k 7 f D o 7 T d + U 2 7 m v p h u w y A M f m a B K N W w r x F Z M J X I m I M k 2 I w U h Z X 3 a h Y a m P v V A t e L R E 4 m 5 q x X X 7 A R S l 4 / m S P 0 9 E V N P q Y H n 6 E 6 P Y k 9 N e 0 P x P 6 8 R Y e e o F X M / j B B 8 N l 7 U i Y S J g T l M x H S 5 B I Z i o A l l k u t b T d a j k j L U u W W z O g V 7 + u d Z U t 0 v 2 g f F 4 0 s d x w k Z I 0 P y Z I f s E Z s c k h K 5 I G V S I Y w 8 k G f y S t 6 M R + P F e D c + x q 0 p Y z K z R f 7 A + P w B Q 1 2 b t A = = < / l a t e x i t > b Yopt < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 L 0 D D D K M I R 3 v Y h 2 p R r 4 P 0 T 2 / O l s = " > A A A C A n i c b V D J S g N B E O 1 J X G L c R r 0 I H h w M g q c w I 4 K K o A E v H i O Y R Z I Q e j q V p E n P Y n e N G o b x 5 N H P 8 C Q o i F f / w p N / Y 2 c 5 a O K D g s d 7 V V T V c 0 P B F d r 2 t 5 F K z 8 z O z W c W s o t L y y u r 5 t p 6 W Q W R Z F B i g Q h k 1 a U K B P e h h B w F V E M J 1 H M F V N z e + c C v 3 I J U P P C v s B 9 C w 6 M d n 7 c 5 o 6 i l p r l Z v + M t 6 F K M r 5 N m X E e 4 x z g I M U m a Z s 7 O 2 0 N Y 0 8 Q Z k 1 x h O 3 w 6 O 0 3 f F J v m V 7 0 V s M g D H 5 m g S t U c O 8 R G T C V y J i D J 1 i M F I W U 9 2 o G a p j 7 1 Q D X i 4 Q e J t a u V l t U O p C 4 f r a H 6 e y K m n l J 9 z 9 W d H s W u m v Q G 4 n 9 e L c L 2 U S P m f h g h + G y 0 q B 0 J C w N r E I f V 4 h I Y i r 4 m l E m u b 7 V Y l 0 r K U I e W z e o U n M m f p 0 l 5 P + 8 c 5 I 8 v d R w n Z I Q M 2 S I 7 Z I 8 4 5 J A U y A U p k h J h 5 I E 8 k 1 f y Z j w a L 8 a 7 8 T F q T R n j m Q 3 y B 8 b n D 8 G l m k 4 = < / l a t e x i t > (d) with "pseudo" constraints ⌦( e Y ⇤ in ) < l a t e x i t s h a 1 _ b a s e 6 4 = " o A q h i N P z I 2 + 7 m q o 9 T o 1 q X 8 f 0  B p s = " > A A A C D X i c b V C 7 S g N B F J 2 N r x h f U c s 0 i 0 G I F m F X B C M 2 A R s 7 I x i N Z G O Y n b 1 J B m c f z N y N h m U / Q B s / x U p Q E F t r s f J v n D w K X w c G D u e c y 5 1 7 3 E h w h Z b 1 a W S m p m d m 5 7 L z u Y X F p e W V / O r a m Q p j y a D O Q h H K h k s V C B 5 A H T k K a E Q S q O 8 K O H e v D o f + e R + k 4 m F w i o M I W j 7 t B r z D G U U t t f N F 5 9 i H L i 0 5 1 9 w D 5 M K D 5 C J t J w 7 C D S Y 8 S N P L 7 S 2 d s s r W C O Z f Y k 9 I s V q 4 q + z 0 8 b 3 W z n 8 4 X s h i H w J k g i r V t K 0 I W w m V y J m A N O f E C i L K r m g X m p o G 1 A f V S k b H p O a m V j y z E 0 r 9 A j R H 6 v e J h P p K D X x X J 3 2 K P f X b G 4 r / e c 0 Y O 5 W W v i m K E Q I 2 X t S J h Y m h O W z G 9 L g E h m K g C W W S 6 7 + a r E c l Z a j 7 y + V 0 C / b v m / + S s 5 2 y v V v e P 9 F 1 H J A x s q R A N k i J 2 G S P V M k R q Z E 6 Y γ a ( Y ) = γ a0 ( Y ), γ a1 ( Y ) := P Y |AY (1|a, 0), P Y |AY (1|a, 1) . ( ) Further denote the corresponding convex hull of Y on the ROC plane as C a ( Y ) using vertices: C a ( Y ) := convhull (0, 0), γ a ( Y ), γ a (1 -Y ), (1, 1) , and then, as already stated in Hardt et al. (2016) , the (FPR, TPR) pair corresponding to a postprocessing predictor falls within (including the boundary of) C a ( Y ). Definition 4.1. (ROC feasible area) The feasible area of a predictor Ω( Y ), specified by the hypothesis space of available predictors Y , is the set containing all attainable (FPR, TPR) pairs by the predictor on the ROC plane satisfying Equalized Odds. In Hardt et al. (2016) it is proposed that the post-processing fair predictor can be derived by solving a linear programming problem on the ROC plane. However, it is not clearly stated whether or not such problem always has a non-trivial solution. Following Hardt et al. (2016) , we analyze the relation between the (FPR, TPR) pair of predictors on the ROC plane and formally establish the existence of the non-trivial Equalized-Odds predictor.  Ω( Y post ) = ∅. Here Y post is a possibly randomized function of only A and Y , trading off TPR with FPR across groups with different value of protected feature. From the panels (a) and (b) of Figure 1 we can also see that Ω( Y post ), the ROC feasible area of Y post , is the intersection of Ω a ( Y ), indicating that although Equalized Odds is attained, the performance of Y post is always worse than the weakest performance across different groups, which is obviously suboptimal.

4.2.3. OPTIMALITY OF PERFORMANCE AMONG FAIR CLASSIFIERS

In this subsection we discuss the optimality of performance of fair classifiers derived via different approaches. Considering the fact that recent efforts to impose Equalized Odds in the pre-processing manner (Madras et al., 2018; Zhao et al., 2020) approach the problem from a representation learning perspective, where the main focus is to learn fair representations that at the same time preserve sufficient information from the original data, we omit pre-processing approaches from the discussion and compare the performance of post-processing and in-processing fair classifiers. Nonlinear regression with a neural net regressor (Mary et al., 2019) on the data generated with nonlinear transformations and Gaussian exogenous terms. We can observe obvious dependencies between Y and A on a small interval of Y . This indicates the conditional dependency between Y and A given Y , i.e., the Equalized Odds is not achieved. Admittedly, when only the information about joint distribution of (A, Y, Y ) is available, postprocessing is the best we can do. However, this is not the case when we have access to additional available features during training. For any predictor specified by parameters θ ∈ Θ, the derivation of the in-processing fair predictor Y in and the unconstrained statistical optimal predictor Y opt take following forms respectively: min θ∈Θ E[l (f (A, X; θ), Y )] s.t. P Yin|AY (t | a, y) = P Yin|Y (t | y) where Y in = f (A, X; θ), (8) min θ∈Θ E[l (f (A, X; θ), Y )] where Y opt = f (A, X; θ). It is natural to wonder, now that one can always directly solve for Y in from Equation 8, how is it related to Y post , which is derived by post-processing the Y opt solved from Equation 9? Interestingly, although Y in and Y post are solved separately using different constrained optimization schemes, one can draw a connection between them by utilizing Y opt as a bridge and reason about the relation between their ROC feasible areas Ω( Y in ) and Ω( Y post ), as we summarize in the following theorem. Theorem 4.3. (Equivalence between ROC feasible areas) Let Ω( Y post ) denote the ROC feasible area specified by the constraints enforced on Y post . Then Ω( Y post ) is identical to the ROC feasible area Ω( Y * in ) that is specified by the following set of constraints: (i) constraints enforced on Y in ; (ii) additional "pseudo" constraints: ∀a ∈ A, β a0 = β (0) a1 , β (1) a1 , where β (u) ay = Σ x∈X P ( Y in = 1 | A = a, X = x)P (X = x | A = a, Y = y, Y opt = u). As we can see from panels (c) and (d) of Figure 1 , if the additional "pseudo" constraints are introduced when optimizing Y * in , we have Ω( Y in ) ⊇ Ω( Y post ) = Ω( Y * in ). The ROC feasible area is fully specified by the hypothesis class and the fairness constraint. Therefore, with the same objective function and fairness constraint, the fair classifier derived from an in-processing approach always outperforms the one derived from a post-processing approach. We can see that when we have access to additional features and choose a post-processing approach, we lose performance (compared to Y in ) by unintentionally introducing "pseudo" constraints during optimization. These "pseudo" constraints actually offset the benefit of utilizing additional features (in the hope to score a better performance while remaining fair).

5. EXPERIMENTS

In order to intuitively illustrate the claims, we provide numerical results for various settings. We first present the result for (linear non-Gaussian and nonlinear) regression tasks when Equalized Odds is not attained. We demonstrate the dependence between the prediction and the protected feature given true value of the target variable. Then for classification tasks we compare the performance of several existing methods in the literature on the Adult, Bank, COMPAS, and German Credit data sets. The detailed description of the data sets is available in the appendix.

5.1. REGRESSION WITH LINEAR NON-GAUSSIAN AND NONLINEAR DATA

In Section 3.1 we showed the unattainability of Equalized Odds for regression with linear non-Gaussian data. Although a proof for similar results in nonlinear cases does not seem straightforward, as strongly suggested by our numerical illustrations, the unattainability of Equalized Odds persists in nonlinear regression cases. In Figure 2 we present scatter plots of Y versus A for Y in a small (compared to its support) interval, for linear non-Gaussian as well as nonlinear regression cases. For linear cases, the data is generated as stated in Equation 4, with non-Gaussian distributed exogenous terms (E X , E H , and E Y ). We use linear regression with the Equalized Correlations constraint (Woodworth et al., 2017) , a weaker notion of Equalized Odds for linearly correlated data, as the predictor. For nonlinear cases, the data is generated using a similar scheme but with nonlinear transformations (e.g., combinations of sin(•), log(•), and polynomials) and Gaussian distributed exogenous terms. We use a neural net regressor with an Equalized Odds regularization term (Mary et al., 2019) to perform nonlinear fair regression. As we can see in Figure 2 , for nonlinear regression tasks, Equalized Odds may not be attained even if every exogenous term is Gaussian distributed.

5.2. FAIR CLASSIFICATION

In Figure 3 , we compare the performance under Equalized Odds of multiple methods proposed in the literature. Hardt et al. (2016) propose a post-processing approach where the prediction is randomized to minimize violation of fairness; Zafar et al. (2017a) use a covariance proxy measure as the regularization term when optimizing classification accuracy; Agarwal et al. (2018) take the reductions approach and reduce fair classification into solving a sequence of cost-sensitive classification problems; Rezaei et al. (2020) minimize the worst-case log loss using an approximated regularization term; Baharlouei et al. (2020) propose to use Rényi correlation as the regularization term to account for nonlinear dependence between variables. To measure the violation of the fairness criterion, we use Equalized Odds (EOdds) violation, defined as max y∈Y a,a ∈A P Y |AY (1|a, y) -P Y |AY (1|a , y) . Following Agarwal et al. (2018) , we pick 0.01 as the default violation bound that the EOdds violation does not exceed (if practically achievable for the method) during training. For each method we plot the testing accuracy versus the violation of Equalized Odds. Although a probabilistic classification model is used across each method (here is logistic regression), if an algorithm output the class label where the prediction likelihood is maximized, the prediction is in essence performed by a deterministic function of input features (e.g., Rezaei et al. (2020) ; Baharlouei et al. (2020) ). As we have shown in Section 4.1, for classification with a deterministic function, in general cases the conditions specified in Theorem 4.1 are easily violated, i.e., Equalized Odds may not be attained even if there is an unlimited amount of data. Therefore, although here we are considering finite data cases, we can still anticipate a lower level of fairness violation with a randomized prediction. This is validated by the numerical experiment: while the approach by Hardt et al. (2016) does not score the lowest test error, the violation of Equalized Odds is the lowest compared to other approaches. The benefit of introducing randomization can also be witnessed by the Pareto frontier presented in Agarwal et al. (2018) , where the approach can potentially achieve any desired fairnessaccuracy trade-off between that of the post-processing approach and that of the unconstrainedly optimized classifierfoot_0 . In some scenarios people tend to only care about equal TPR (e.g., the rate of acceptance/admission) across groups, i.e., the Equal Opportunity (Hardt et al., 2016 ) notion of fairness. The related numerical result on real-world data sets is also presented.

6. CONCLUSION AND FUTURE WORK

In this paper, we focus on the Equalized-Odds criterion and consider the attainability of fairness, and furthermore, if attainable, the optimality of the prediction performance under various settings. We first show that, for fair regression, one can only achieve Equalized Odds when certain conditions on the joint distribution of the features and the target variable are met. Then for classification tasks with deterministic classifiers, we give the condition under which Equalized Odds can hold true; we also show that under mild assumptions, one can always find a non-trivial Equalized-Odds (randomized) predictor, even with a continuous protected feature; in terms of the optimality of performance, one can always (if conditions permit) benefit from exploiting all available features during training. Future work would naturally consider nonlinear regression algorithms with randomized output and fairness guarantees, and the attainability of more fine-grained (compared to group fairness) criteria of fairness (e.g., individual fairness) as well as the procedure fairness in the fairness hierarchy.

A APPENDIX

A.1 PROOF FOR THEOREM 3.1 To prove the unattainability of Equalized Odds in regression, we will need the following lemma, which provides a way to characterize conditional independence/dependence with conditional or joint distributions. Lemma A.1. Variables V 1 and V 2 are conditionally independent given variable V 3 if and only if there exist functions h(v 1 , v 3 ) and g(v 2 , v 3 ) such that p V1,V2|V3 (v 1 , v 2 | v 3 ) = h(v 1 , v 3 ) • g(v 2 , v 3 ). ( ) Proof. First, if V 1 and V 2 are conditionally independent given variable V 3 , then Equation 10 holds: p V1,V2|V3 (v 1 , v 2 | v 3 ) = p V1|V3 (v 1 | v 3 ) • p V2|V3 (v 2 | v 3 ). We then let h(v 3 ) := h(v 1 , v 3 )dv 1 and g(v 3 ) := g(v 2 , v 3 )dv 2 . Take the integral of Equation 10 w.r.t. v 1 and v 2 , we have: p V2|V3 (v 2 | v 3 ) = h(v 3 ) • g(v 2 , v 3 ), p V1|V3 (v 1 | v 3 ) = g(v 3 ) • h(v 1 , v 3 ), respectively. Bearing in mind Equation 10, one can see that the product of the two equations above is p V2|V3 (v 2 | v 3 ) • p V1|V3 (v 1 | v 3 ) = h(v 3 ) • g(v 2 , v 3 ) • g(v 3 ) • h(v 2 , v 3 ) = h(v 3 ) • g(v 3 ) • p V1,V2|V3 (v 1 , v 2 | v 3 ). Take the integral of the equation above w.r.t. v 1 and v 2 gives h(v 3 ) • g(v 3 ) ≡ 1. The above equation then reduces to p V2|V3 (v 2 | v 3 ) • p V1|V3 (v 1 | v 3 ) = p V1,V2|V3 (v 1 , v 2 | v 3 ). That is, V 1 and V 2 are conditionally independent given V 3 . Now we are ready to prove the unattainability of Equalized Odds in linear non-Gaussian regression: Theorem. (Unattainability of Equalized Odds in the Linear Non-Gaussian Case) Assume that feature X has a causal influence on Y , i.e., c = 0 in Equation 4, and that the protected feature A and Y are not independent, i.e., qc + bd = 0. Assume p E X and p E are positive on R. Let f 1 := log p A , f 2 := log p E X , and f 3 := log p E . Further assume that f 2 and f 3 are third-order differentiable. Then if at most one of E X and E is Gaussian, Z is always conditionally dependent on A given Y .

Proof. According to Equation 4, we have

A Z Y = 1 0 0 α + qβ β 0 qc + bd c 1 • A E X E . ( ) The determinant of the above linear transformation is β, which relates the probability density function of the variables on the LHS and that of the variables on the RHS of the equation. Therefore, according to Equation 11, we can rewrite the joint probability density function by making use of the Jacobian determinant and factor the joint density into marginal density functions (A, E X , E are mutually independent according to the data generating process). Further let α := α + qβ β , r := bd - cα β , and c := c β . Then we have E X = 1 β Z -αA, E = Y -rA -cZ, and p A,Z,Y (a, z, y) = p A,E X ,E (a, e x , e)/|β| = 1 |β| p A (a)p E X (e x )p E (e) = 1 |β| p A (a)p E X ( 1 β z -αa)p E (y -ra -cz). On its support, the log-density can be written as  J := log p A,Z,Y (a, z, y) = log p A (a) + log p E X ( 1 β z -αa) + log p E (y -ra -cz) -log|β| = f 1 (a) + f 2 ( 1 β z -αa) + f 3 (y -ra -cz) -log |β|. According to Equation 13, we have ∂J ∂z = 1 β • f 2 ( 1 β z -αa) -c • f 3 (y -ra -cz) ⇒ ∂ 2 J ∂a∂z = - α β • f 2 ( 1 β z -αa) + rc • f 3 (y -ra -cz). Combining Equations 14 and 15 gives rc • f 3 (y -ra -cz) = α β • f 2 ( 1 β z -αa). Further taking the partial derivative of both sides of the above equation w.r.t. y yields rc • f 3 (y -ra -cz) ≡ 0. There are three possible situations where the above equation holds: (i) c = 0, which is equivalent to c = 0 and contradicts with the theorem assumption. (ii) r = 0. Then according to Equation 16, we have α β • f 2 ( 1 β z -αa) ≡ 0, implies either α = 0 or f 2 ( 1 β z -αa) ≡ 0. If the latter is the case, then f 2 is a linear function and, accordingly, exp(f 2 ) is not integrable and does not correspond to any valid density function. If the former is true, i.e., α = 0, then according to Equation 12, we have α = -qβ, which further implies r = bd -cα β = bd + qc. Therefore, in this situation, bd + qc = 0, which again contradicts with the theorem assumption. (iii) f 3 (yracz) ≡ 0. That is, f 3 is a quadratic function with a nonzero coefficient for the quadratic term (otherwise f 3 does not correspond to the logarithm of any valid density function). Thus E follows a Gaussian distribution. Only situation (iii) is possible, i.e., rc = 0 and E follows a Gaussian distribution. This further tells us that the RHS of Equation 16 is a nonzero constant. Hence f 2 is a quadratic function and E X also follows a Gaussian distribution. Therefore if A ⊥ ⊥ Z | Y were to be true, then E X and E are both Gaussian. Its contrapositive gives the conclusion of this theorem. Corollary. Suppose that both E X and E are Gaussian, with variances σ 2 E X and σ 2 E , respectively. (The protected feature A is not necessarily Gaussian.) Then Z ⊥ ⊥ A | Y if and only if α β = bdc • σ 2 E X -q • σ 2 E c 2 • σ 2 E X + σ 2 E . Proof. Under the condition that E X and E are Gaussian, their log-density functions are third-order differentiable. Then according to the proof of Theorem 3.  i) ∀t ∈ Y : S (t) A = A, (ii) ∀t ∈ Y, ∀a, a ∈ A, a = a : Σ x∈S (t) X|a P (X = x | A = a, Y = y) = Σ x∈S (t) X|a P (X = x | A = a , Y = y). Proof. We begin by considering the case when A and X are discrete (for the purpose of readability). The Equalized Odds criterion can be written in terms of the conditional probabilities:  Expand the LHS of Equation 19: P ( Y = t | A = a, Y = y) = Σ x∈X P ( Y = t | A = a, X = x, Y = y)P (X = x | A = a, Y = y), and bear in mind that Y := f (A, X) is a deterministic function of (A, X), we have: P f (A, X) = t | A = a, X = x, Y = y) = P f (A, X) = t | A = a, X = x) ∈ {0, 1}. From Equation 20we can see that the conditional probability P (X = x | A = a, Y = y) can contribute to the summation only when f (a, x) = t. We can rewrite the LHS of Equation 19: Since Equalized Odds holds true if and only if Equation 19 holds true, then the LHS of the equation does not involve a (as is the case for the RHS), i.e., Q (t) (a, y) does not change with a. Then Equation 19 becomes: which gives the condition (ii). Therefore, Equalized Odds implies conditions (i) and (ii). On the other hand, it is easy to see that when conditions (i) and (ii) are satisfied, Equation 19 holds true, i.e., Equalized Odds holds true. P ( Y = t | A = a, Y = y) = Σ Q (t) (a, y) = Σ a∈S (t) A Q (t) (a, y)P (A = a | Y = y) = Q (t) (a, When A and X are continuous, one can replace the summation with integration accordingly. A.3 PROOF FOR THEOREM 4.2 Theorem. (Attainability of Equalized Odds) Assume that the feature X is not independent from Y , and that Y is a function of A and X. Then for binary classification, if Y is a non-trivial predictor for Y , there is always at least one non-trivial predictor Y post derived by post-processing Y that can attain Equalized Odds, i.e., Ω( Y post ) = ∅. Proof. Since Y is a function of (A, X) and X ⊥ ⊥ Y , Y is not conditionally independent from Y given protected feature A. Furthermore, since Y is a non-trivial estimator of the binary target Y , there exists a positive constant > 0, such that: P ( Y = 1 | A = a, Y = 1) -P ( Y = 1 | A = a, Y = 0) ≥ , ∀a ∈ A. ( ) Equation 21 implies that for each value of A, the corresponding true positive rate of the non-trivial predictor is always strictly larger than its false positive ratefoot_1 . As illustrated in panels (a) and (b) of Figure 1 , the (FPR, TPR) pair of the predictor Y when A = a, i.e., the point γ a ( Y ) on ROC plane, will never fall in the gray shaded area, and its coordinates are bounded away from the diagonal by at least . Therefore, the intersection of all C a ( Y ) would always form a parallelogram with non-empty area, which corresponds to attainable non-trivial post-processing fair predictors Y post . A.4 PROOF FOR THEOREM 4.3

Theorem. (Equivalence between ROC feasible areas)

Let Ω( Y post ) denote the ROC feasible area specified by the constraints enforced on Y post . Then Ω( Y post ) is identical to the ROC feasible area Ω( Y * in ) that is specified by the following set of constraints: (i) constraints enforced on Y in ; (ii) additional "pseudo" constraints: ∀a ∈ A, β (24)



In the approach proposed byAgarwal et al. (2018), the randomization can come in two folds: the first kind of randomization comes from picking a classifier from the distribution of multiple available classifiers; the second kind of randomization comes from the probabilistic prediction (if the hypothesis class contains probabilistic prediction models). If the TPR of the predictor is always smaller than its FPR, one can simply flip the prediction (since the target is binary) and then Equation 21 holds true. http://archive.ics.uci.edu/ml/datasets/Adult https://archive.ics.uci.edu/ml/datasets/bank+marketing https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)



X|a = ∅; (2) for any fixed value of a ∈ A, t∈Y S (t) X|a = X . Condition (ii) says that the set S (t)

F + P e e D J e j b d h a o Y z a y j P z D e f w C D T J p P < / l a t e x i t > P ( e Y = 1 | A, Y = 0)< l a t e x i t s h a 1 _ b a s e 6 4 = "

Figure 1: ROC feasible area illustrations. Panels (a)-(b): Attainability of Equalized Odds for binary classifiers with discrete or continuous protected feature. Panels (c)-(d): ROC feasible areas comparison between Ω( Y in ), Ω( Y post ), Ω( Y opt ), and Ω( Y * in ).

Figure 2: Illustration of unattainable Equalized Odds for regression tasks. Panel (a)-(b): Linear regression on the data generated with linear transformations and non-Gaussian distributed exogenous terms (following Laplace, Uniform distribution respectively). Panel (c)-(d): Nonlinear regression with a neural net regressor(Mary et al., 2019) on the data generated with nonlinear transformations and Gaussian exogenous terms. We can observe obvious dependencies between Y and A on a small interval of Y . This indicates the conditional dependency between Y and A given Y , i.e., the Equalized Odds is not achieved.

Figure 3: Results for classification with Equalized Odds/Equal Opportunity criterion.

∀a ∈ A, t, y ∈ Y : P ( Y = t | A = a, Y = y) = P ( Y = t | Y = y).

x | A = a, Y = y) := Q (t) (a, y).Similarly, for the RHS of Equation 19, we have:P ( Y = t | Y = y) = Σ a∈A Σ x∈X P ( Y = t | A = a, X = x, Y = y)P (X = x, A = a | Y = y) x | A = a, Y = y)P (A = a | Y = y) (t) (a, y)P (A = a | Y = y).

contains all possible values of A, i.e., A = S a | Y = y) < 1). Since Q (t) (a, y) does not change with a, we have:∀a, a ∈ A, a = a : Σ x∈S (t) X|a P (X = x | A = a, Y = y) = Σ x∈S (t) X|a P (X = x | A = a , Y = y),

ay = Σ x∈X P ( Y in = 1 | A = a, X = x)P (X = x | A = a, Y = y, Y opt = u).Proof. Since the post-processing predictor Y post is derived by optimizing over parameters or functions (of A) β (u) a . Therefore, considering the fact that P Ypost|AY (1|a, y) = γ ay ( Y post ), P Yopt|AY (1|a, y) = γ ay ( Y opt ), we have the relation between γ ay ( Y post ) and γ ay ( Y opt ):γ ay ( Y post ) = β (0) a γ ay (1 -Y opt ) + β (1) a γ ay ( Y opt ), β (0) a = P ( Y post = 1 | A = a, Y opt = 0), β (1) a = P ( Y post = 1 | A = a, Y opt = 1).(22)Similarly, consider the relation between positive rates of Y in and those of Y opt , i.e., P Yin|AY (1|a, y)and P Yopt|AY (1|a, y), by factorizing P Yin|AY (1|a, y) over X and Y opt :P Yin|AY (1|a, y) = u∈Y x∈X P Yin|AX (1|a, x)P X|AY Yopt (x|a, y, u) P Yopt|AY (u|a, y).(23)Therefore, we have the relation between γ ay ( Y in ) and γ ay ( Y opt ):γ ay ( Y in ) = β (0) ay γ ay (1 -Y opt ) + β (1) ay γ ay ( Y opt ), β (0) ay = Σ x∈X P ( Y in = 1 | A = a, X = x)P (X = x | A = a, Y = y, Y opt = 0), β(1) ay = Σ x∈X P ( Y in = 1 | A = a, X = x)P (X = x | A = a, Y = y, Y opt = 1).

Hardt et al. (2016) proposed Equalized Odds which requires conditional independence between prediction and protected feature(s) given ground truth of the target. Let us denote the protected feature by A, with domain of value A, additional (observable) feature(s) by X, with domain of value X , target variable by Y , with domain Y, (not necessarily fair) predictors by Y , and fair predictors by Y . Equalized-Odds fairness requires

Let us take a look at the two conditions. Condition (i) says that within each class determined by the classification function f , A should be able to take all possible values in A. While condition (i) is already pretty restrictive, condition (ii) specifies an even stronger constraint on the relation between the conditional probability P X|AY (x|a, y) (or the conditional probability density p X|AY (x|a, y) for continuous X) and the set S X|a has following properties: (1) for any fixed value of a ∈ A, if t = t , then S

According to Lemma A.1, A ⊥ ⊥ Z | Y if and only if p A,Z|Y (a, z | y) is a product of a function of a and y and a function of z and y. p A,Z,Y (a, z, y) is further a product of the above function and a function of only y. This property, under the conditions in Theorem 3.1, is equivalent to the constraint

Theorem. Assume that the protected feature A and Y are dependent and that their joint probability P (A, Y ) (for discrete A) or joint probability density p(A, Y ) (for continuous A) is positive for every combination of possible values of A and Y . Further assume that Y is not fully determined by A, and that there are additional features X that are not independent of Y . Let the output of the classifier Y be a deterministic function f : A × X → Y. Let S X|a := {x | f (a, x) = t}. Equalized Odds holds true if and only if the following two conditions are satisfied:

annex

If there is more than one variable in X in Equation 24, one can expand the summation if needed; if some variables are continuous, one may also substitute the summation with integration accordingly.From Equation 24, β (0) ay and β(1) ay depend on the value of Y :Apart from Equalized Odds constraints (which are shared by Y in and Y post ), when enforcing additional "pseudo" constraints βa1 and β(1)ay and β (0) ay no longer depend on Y . This is exactly the inherent constraint Y post satisfies. Therefore the stated equivalence between ROC feasible areas Ω( Y post ) (specified by the constraints enforced on Y post ) and Ω( Y * in ) (specified by the constraints enforced on Y in together with the additional "pseudo" constraints) hold true.A.5 DESCRIPTION OF THE DATA SETS (1) Adult 4 : The UCI Adult data set contains 14 features for 45,222 individuals (32,561 samples for training and 12,661 samples for testing). The census information includes gender, marital status, education, capital gain, etc. The classification task is to predict whether a person's annual income exceeds 50,000 USD. We use the provided testing set for evaluations and present the result with gender and race (consider white and black people only) set as the protected feature respectively.(2) Bank 5 : The UCI Bank Marketing data set is related with marketing campaigns of a banking institution, containing 16 features of 45,211 individuals. The assigned classification task is to predict if a client will subscribe (yes/no) to a term deposit. The original data set is very unbalanced with only 4,667 positives out of 45,211 samples. Therefore, we combine "yes" points with randomly subsampled "no" points and perform experiments on the downsampled data set with 10,000 data points. The protected feature is the marital status of the client.(3) COMPAS (Angwin et al., 2016) : The COMPAS data set contains records of over 11,000 defendants from Broward County, Florida, whose risk (of recidivism) was assessed using the COMPAS tool. Each record contains multiple features of the defendant, including demographic information, prior convictions, degree of charge, and the ground truth for recidivism within two years. Following Zafar et al. (2017a); Nabi & Shpitser (2018) , we limit our attention to the subset consisting of African-Americans and Caucasians defendants. The features we use include age, gender, race, number of priors, and degree of charges.The task is to predict the recidivism of the defendant and we choose race as the protected feature.(4) German Credit 6 : The UCI German Credit data contains 20 features (7 numerical, 13 categorical) describing the social and economical status of 1,000 customers. The prediction task is to classify people as good or bad credit risks. We use the provided numerical version of the data and choose gender as the protected feature.

A.6 ADDITIONAL DISCUSSION

For classification, while randomization can ensure group level of fairness, there is still some inherent shortcoming of the criterion that we should pay attention to. For example, in the FICO case study in Hardt et al. (2016) , for a specific client from certain demographic group, the decision of approve/deny the loan actually comes in two folds: if his/her credit score is above (below) the upper (lower) threshold, the bank approve (deny) the application for sure; if the score falls in the interval between two thresholds, the bank would flip a coin to make a decision. Then we can imagine the following situation when a client whose credit score falls within the interval between the upper and lower thresholds goes to a bank to apply a loan. He/she can ask (if conditions permit) the bank to run the model multiple times until the decision is approval. This would make the randomization that was built into the system for the sake of fairness no longer effective, and the system in essence only has one fixed threshold (i.e., the original lower threshold).

