FAIREE: FAIR CLASSIFICATION WITH FINITE-SAMPLE AND DISTRIBUTION-FREE GUARANTEE

Abstract

Algorithmic fairness plays an increasingly critical role in machine learning research. Several group fairness notions and algorithms have been proposed. However, the fairness guarantee of existing fair classification methods mainly depends on specific data distributional assumptions, often requiring large sample sizes, and fairness could be violated when there is a modest number of samples, which is often the case in practice. In this paper, we propose FaiREE, a fair classification algorithm that can satisfy group fairness constraints with finite-sample and distribution-free theoretical guarantees. FaiREE can be adapted to satisfy various group fairness notions (e.g., Equality of Opportunity, Equalized Odds, Demographic Parity, etc.) and achieve the optimal accuracy. These theoretical guarantees are further supported by experiments on both synthetic and real data. FaiREE is shown to have favorable performance over state-of-the-art algorithms.

1. INTRODUCTION

As machine learning algorithms have been increasingly used in consequential domains such as college admission Chouldechova & Roth (2018) , loan application Ma et al. (2018) , and disease diagnosis Fatima et al. (2017) , there are emerging concerns about the algorithmic fairness in recent years. When standard machine learning algorithms are directly applied to the biased data provided by humans, the outputs are sometimes found to be biased towards certain sensitive attribute that we want to protect (race, gender, etc). To quantify the fairness in machine learning algorithms, many fairness notions have been proposed, including the individual fairness notion Biega et al. (2018) , group fairness notions such as Demographic Parity, Equality of Opportunity, Predictive Parity, and Equalized Odds (Dieterich et al., 2016; Hardt et al., 2016; Gajane & Pechenizkiy, 2017; Verma & Rubin, 2018) , and multi-group fairness notions including multi-calibration Hébert-Johnson et al. (2018) and multi-accuracy Kim et al. (2019) . Based on these fairness notions or constraints, corresponding algorithms were designed to help satisfy the fairness constraints (Hardt et al., 2016; Pleiss et al., 2017; Zafar et al., 2017b; Krishnaswamy et al., 2021; Valera et al., 2018; Chzhen et al., 2019; Zeng et al., 2022; Thomas et al., 2019) . 2 for detailed numerical results. Left: DEOO v.s. α, Right: DEOO v.s. Test accuracy. Here, DEOO is the degree of violation to fairness constraint Equality of Opportunity and α is the prespecified desired level to upper bound DEOO for both methods. See Eq. (1) in Section 2 for a more detailed definition. Among these fairness algorithms, postprocessing is a popular type of algorithm which modifies the output of the model to satisfy fairness constraints. However, recent post-processing algorithms are found to lack the ability to realize accuracy-fairness trade-off and perform poorly when the sample size is limited (Hardt et al., 2016; Pleiss et al., 2017) . In addition, since most fairness constraints are non-convex, some papers propose convex relaxation-based methods Zafar et al. (2017b) ; Krishnaswamy et al. (2021) . This type of algorithms generally do not have the theoretical guarantee of how the output satisfies the exact original fairness constraint. Another Definition 2. (Equality of Opportunity (Hardt et al., 2016) ) A classifier satisfies Equality of Opportunity if it satisfies the same true positive rate among protected groups: P X|A=1,Y =1 ( Y = 1) = P X|A=0,Y =1 ( Y = 1). Equalized Odds is an extension of Equality of Opportunity, requiring both false positive rate and true positive rate are similar across different attributes. Definition 3. (Equalized Odds (Hardt et al., 2016) ) A classifier satisfies Equalized Odds if it satisfies the following equality: P X|A=1,Y =1 ( Y = 1) = P X|A=0,Y =1 ( Y = 1) and P X|A=1,Y =0 ( Y = 0) = P X|A=0,Y =0 ( Y = 0). Sometimes it is too strict to require the classifier to satisfy Equality of Opportunity or Equalized Odds exactly, which may sacrifice a lot of accuracy (as a very simple example is f (x, a) ≡ 1). In practice, to strike a balance between fairness and accuracy, it makes sense to relax the equality above to an inequality with a small error bound. We use the difference with respect to Equality of Opportunity, denoted by DEOO, to measure the disparate impact: DEOO = P X|A=1,Y =1 ( Y = 1) -P X|A=0,Y =1 ( Y = 1). (1) For a classifier ϕ, following (Zeng et al., 2022; Cho et al., 2020) , |DEOO(ϕ)| ≤ α denotes an α-tolerance fairness constraint that controls the difference between the true positive rates below α. Similarly, we define the following difference with Equalized Odds. Since Equalized Odds, the difference is a two-dimensional vector: DEO = (P X|A=1,Y =1 ( Y = 1)-P X|A=0,Y =1 ( Y = 1), P X|A=1,Y =0 ( Y = 1)-P X|A=0,Y =0 ( Y = 1)). For notational simplicity, we use the notation ⪯ for the element-wise comparison between vectors, that is, DEO ⪯ (α 1 , α 2 ) if and only if P X|A=1,Y =1 ( Y = 1) -P X|A=0,Y =1 ( Y = 1) ≤ α 1 and P X|A=1,Y =0 ( Y = 1) -P X|A=0,Y =0 ( Y = 1) ≤ α 2 . Additional Notation. We denote the proportion of group a by p a := P(A = a) for a ∈ {0, 1}; the proportion of group Y = 1 conditioned on A for p Y,a := P(Y = 1 | A = a); the proportion of group Y = 1 conditioned on A and X for η a (x) := P(Y = 1 | A = a, X = x). Also, we denote by P X (x) and P X|A=a,Y =y (x) respectively the distribution function of X and the distribution function of X conditioned on A and Y . The score function of standard Bayes-optimal classifier without fairness constraint is defined as ϕ * (x, a) = 1{f * (x, a) > 1/2}, where f * ∈ arg min f [P(Y ̸ = 1{f(x, a) > 1/2})]. We denote v (k) as the k th ordered value of sequence v in non-decreasing order. For a set T , we denote sort(T ) as a function that returns T in non-decreasing order. For a number a ∈ R, we use ⌈a⌉ to denote the ceiling function that maps a to the least integer greater than or equal to a. For a positive integer n, we use [n] to denote the set {1, 2, ..., n}.

3. FAIREE: A FINITE SAMPLE BASED ALGORITHM

In this section, we propose FaiREE, a general post-processing algorithm that produces a Fair classifier in a finite-sample and distribution-fREE manner, and can be applied to a wide range of group fairness notions. We will illustrate its use in Equality of Opportunity as an example in this section, and discuss more applications in later sections.

3.1. THE GENERAL PIPELINE OF FAIREE

Suppose we have dataset S = S 0,0 ∪ S 0,1 ∪ S 1,0 ∪ S 1,1 , where S y,a = {x y,a 1 , . . . , x y,a n y,a } is the set of features associated with label Y = y ∈ {0, 1} and protected attribute A = a ∈ {0, 1}. We denote the size of S y,a by n y,a . Throughout the paper, we assume that x y,a i , i ∈ {1, . . . , n y,a } are independently and identically distributed given Y = y, A = a. We define n = n 0,0 + n 0,1 + n 1,0 + n 1,1 to be the total number of samples. Our goal is to post-process any given classifier to make it satisfy certain group fairness constraints. FaiREE is a post-processing algorithm that can transform any pre-trained classification function score f in order to satisfy fairness constraints. In particular, FaiREE consists of three main steps, scoring, candidate set construction, and candidate selection. See Figure 2 for an illustration. We would like to note that the procedure that first chooses a candidate set of tuning parameters and then selects the best one has been commonly used in machine learning, such as in Seldonian algorithm framework to control safety and fairness Thomas et al. ( 2019 Step 1: Scoring. FaiREE takes input as 1). a given fairness guarantee G, such as Equality of Opportunity or Equalized Odds; 2). an error bound α, which controls the violation with respect to our given fairness notion; 3). a small tolerance level δ, which makes sure our final classifier satisfies our requirement with probability at least 1 -δ; 4). a dataset S. For scoring, we first apply the given classifier f to S y,a and denote the outcome t y,a i := f (x y,a i ) as scores for each sample. These scores are then sorted within each subset in non-decreasing order respectively and obtain T y,a = {t y,a (1) , . . . , t y,a (n y,a ) }. Step 2: Candidate Set Construction. We first present a key observation for this step, which holds for many group fairness notions such as Equality of Opportunity, Equalized Odds (see details of more fairness notions in Section 3.2): Any classifier can fit the fairness constraint with high probability by setting the decision threshold appropriately, regardless of the data distribution. The insight of this observation comes from recent literature on post-processing algorithms and Neyman-Pearson classification algorithm (Fish et al., 2016; Corbett-Davies et al., 2017; Valera et al., 2018; Menon & Williamson, 2018; Tong et al., 2018; Chzhen et al., 2019) . Under Equality of Opportunity, this observation is formalized in Proposition 1. We also establish similar results under other fairness notions beyond Equality of Opportunity in Section 3.2. From this observation we can build an algorithm to calculate the probability that a classifier f with a certain threshold will satisfy the fairness constraint: Diff G (f ) ≤ α , where Diff G (f ) is a generic notation to denote the violation rate of f under some fairness notion G. Then we choose the classifiers with the probability P(Diff G (f ) > α) ≤ δ as our candidate set C. This candidate set consists of a set of threshold values, with potentially different thresholds for different subpopulations. Step 3: Candidate Selection. Furthermore, as there might be multiple classifiers that satisfy the given fairness constraints, we aim to choose a classifier with a small mis-classification error. To do this, FaiREE estimates the mis-classification error err(f ) of the classifier f , and chooses the one with the smallest error among the candidate set constructed in the second step. In the rest part of the section, as an example, we consider Equality of Opportunity as our target group fairness constraint and provide our algorithm in detail.

3.2. APPLICATION TO EQUALITY OF OPPORTUNITY

In this section, we apply FaiREE to the fairness notion Equality of Opportunity. The following two subsections explain the steps Candidate Set Construction and Candidate Selection in detail.

3.2.1. CANDIDATE SET CONSTRUCTION

We first formalize our observation in the following proposition. Using the property of order statistics, the following proposition states that it is sufficient to choose the threshold of the score-based classifier from the sorted scores to control the fairness violation in a distribution-free and finitesample manner. Here, k 1,a is the index from which we select the threshold in T 1,a . Proposition 1. Consider k 1,a ∈ {1, . . . , n 1,a } for a ∈ {0, 1}, and the score-based classifier ϕ(x, a) = 1{f(x, a)) > t 1,a (k 1,a ) }. Let g 1 (k, a) = E[ n 1,a j=k n 1,a j (Q 1,1-a -α) j (1 -(Q 1,1-a - α)) n 1,a -j ] with Q 1,a ∼ Beta(k, n 1,a -k + 1 ), then we have: P(|DEOO(ϕ)| > α) ≤ g 1 (k 1,1 , 1) + g 1 (k 1,0 , 0). Additionally, if t 1,a (k 1,a ) is a continuous random variable, the inequality above becomes tight equality. Here, g 1 is a function constructed using the property of order statistics so that g 1 (k 1,1 , 1) and g 1 (k 1,0 , 0) upper bound P(DEOO(ϕ) > α) and P(DEOO(ϕ) < -α) respectively. We note that g 1 can be efficiently compute using Monte Carlo simulations. In our experiments, we approximate g 1 by randomly sampling from the Beta distribution for 1000 times and achieve satisfactory approximation. This proposition ensures that the DEOO of a given classifier can be controlled with high probability if we choose an appropriate threshold value when post-processing. Based on the above proposition, we then build our classifiers for an arbitrarily given score function f as below. We define L(k 1,0 , k 1,1 ) = g 1 (k 1,1 , 1) + g 1 (k 1,0 , 0). Recall that the error tolerance is α, and δ is the tolerance level. Our candidate set is then constructed as K = {(k 1,0 , k 1,1 ) | L(k 1,0 , k 1,1 ) ≤ δ}. Before we proceed to the theoretical guarantee for this candidate set, we introduce a bit more notation. Let us denote the size of the candidate set K by M , and the elements in the set K by (k 1,0 1 , k 1,1 1 ), . . . , (k 1,0 M , k 1,1 M ). Additionally, we let φi (x, a) = 1{f(x, a) > t 1,a (k 1,a i ) }, for i = 1, . . . , M . To ensure that there exists at least one valid classifier (i.e. M ≥ 1), we should have E[(Q 1,0 - α) n 1,0 ] + E[(Q 1,1 -α) n 1,1 ] ≤ δ, which requires a necessary and sufficient lower bound requirement on the sample size, as formulated in the following proposition: Theorem 1. If min{n 1,0 , n 1,1 } ≥ ⌈ log δ 2 log(1-α) ⌉, for each i ∈ {1, . . . , M } in the candidate set, we have |DEOO( φi )| < α with probability 1 -δ. As there are at most n 1,0 n 1,1 elements in the candidate set, the size of K, M , can be as large as O(n 2 ). To further reduce the computational complexity, in the following part, we provide a method to shrink the candidate set. Our construction is inspired by the following lemma, which gives the analytical form of the fair Bayes-optimal classifier under the Equality of Opportunity constraint. This Bayes-optimal classifier is defined as ϕ * α = arg min |DEOO(ϕ)|≤α P(ϕ(x, a) ̸ = Y ). Lemma 1 (Adapted from Theorem E.4 in Zeng et al. (2022) ). The fair Bayes-optimal classifier under Equality of Opportunity can be explicitly written as ϕ * α (x, a) = 1{f * (x, a) > t * a }, then t * 1 = p1p Y,1 2p1p Y,1 -(1/t 1,0 (k) -2)•p0p Y,0 . Note that in practice, the input classifier f can be the classifier trained by a classification algorithm on the training set, which means it is close to f * . Thus from this observation, we can adopt a new way of building a much smaller candidate set. Note that our original candidate set is defined as : K = {(k 1,0 , k 1,1 ) | L(k 1,0 , k 1,1 ) ≤ δ} = {(k 1,0 1 , k 1,1 1 ), . . . , (k 1,0 M , k 1,1 M )}. Now, for every 1 ≤ k ≤ n 1,0 , from Lemma 1 we denote u 1 (k) = arg min u |t 1,1 (u) - p1 pY,1 2 p1 pY,1 -(1/t 1,0 (k) -2)• p0 pY,0 |, where pa = n 1,a +n 0,a n 0,0 +n 0,1 +n 1,0 +n 1,1 and py,a = n 1,a n 0,a +n 1,a . We then build our candidate set as below: K ′ ={(k 1,0 , u 1 (k 1,0 )) | L(k 1,0 , u 1 (k 1,0 )) ≤ δ}. (2) This candidate set K ′ has cardinality at most n. Since our next step, Candidate Selection, has computational complexity that is linear in the size of the candidate set, using the new set K ′ would help us reduce the computational complexity from O(n 2 ) to O(n).

3.2.2. CANDIDATE SELECTION

In this subsection, we explain in detail how we choose the classifier with the smallest misclassification error from the candidate set constructed in the last step. For a given pair (k 1,0 i , k 1,1 i ) in the candidate set of index (i ∈ [M ]), we need to know the rank of t 1,0 (k 1,0 i ) and t 1,1 (k 1,1 i ) in the sorted set T 0,0 and T 0,1 respectively in order to compute the test error where we need to consider both y = 0 and 1. Specifically, we find the k 0,a i such that t 0,a (k 0,a i ) ≤ t 1,a (k 1,a i ) < t 0,a (k 0,a i +1) for a ∈ {0, 1}. To estimate the test mis-classification error of φi (x, a) = 1{f(x, a) > t 1,a (k 1,a i ) }, we divide the error into four terms by different values of y and a. We then estimate each part using the property of order statistics respectively, and obtain the following proposition: Proposition 2. Suppose the density functions of f under A = a, Y = 1 are continuous. Let êi = k 1,0 i n 1,0 +1 n 1,0 n + k 1,1 i n 1,1 +1 n 1,1 n + n 0,0 + 1 2 -k 0,0 i n 0,0 +1 n 0,0 n + n 0,1 + 1 2 -k 0,1 i n 0,1 +1 n 0,1 n , for i = 1, 2, ..., M . Then, there exist two constants c 1 , c 2 > 0 such that | P( φi (x, a) ̸ = Y ) -êi |≤ c 1 / min(n 0,0 , n 0,1 ) with probability larger than 1 -c 2 exp(-min(n 0,0 , n 0,1 )). The above proposition enables us to efficiently estimate the test error of the M classifiers φi 's defined above, from which we can choose a classifier with the lowest test error φ. The algorithm is summarized in Algorithm 1. In the following, we provide the theory showing that the output of Algorithm 1: FaiREE for Equality of Opportunity Input: Training data S = S 0,0 ∪ S 0,1 ∪ S 1,0 ∪ S 1,1 ; the error bound α; the tolerance level δ; a given pre-trained classifier f T y,a = {f (x y,a 1 ), . . . , f (x y,a ny,a )} {t y,a (1) , . . . , t y,a (ny,a) } =sort(T y,a ) Define g 1 (k, a) as in Proposition 1, and let L(k 1,0 , k 1,1 ) = g 1 (k 1,1 , 1) + g 1 (k 1,0 , 0) Build candidate set K ′ as in Eq. 2, and write K ′ = {(k 1,0 1 , k 1,1 1 ), . . . , (k 1,0 M ′ , k 1,1 M ′ )} Find k 0,0 i , k 0,1 i : t 0,0 (k 0,0 i ) ≤ t 1,0 (k 1,0 i ) < t 0,0 (k 0,0 i +1) , t 0,1 (k 0,1 i ) ≤ t 1,1 (k 1,1 i ) < t 0,1 (k 0,1 i +1) i * ← arg min i∈[M ′ ] { êi } ( êi is defined in Proposition 2) Output: φ(x, a) = 1{f(x, a) > t 1,a (k 1,a i * ) } Algorithm 1 is approaching the optimal mis-classification error under Equality of Opportunity. The following theorem states that the final output of FaiREE has both controlled DEOO, and achieved almost minimum mis-classification error when the input classifier is properly chosen. Theorem 2. Given any α ′ < α. Set δ = c 0 /M for some c 0 > 0, where M is the candidate set size. Suppose min{n 1,0 , n 1,1 } ≥ ⌈ log δ 2 log(1-α) ⌉. φ is the output of FaiREE, then: (1). |DEOO( φ)| < α with probability 1 -c 0 . (2). Suppose the density functions of f and f * under A = a, Y = 1 are continuous. For any δ ′ , ϵ 0 > 0, there exist 0 < c < 1 and c 1 > 0 such that when the input classifier f (x, a) satisfies ∥f -f * ∥ ∞ ≤ ϵ 0 and the constructed candidate set is K ′ , we have P( φ(x, a) ̸ = Y ) -P(ϕ * α ′ (x, a) ̸ = Y ) ≤ 2F * (+) (2ϵ 0 ) + δ ′ with probability larger than 1 -c 1 c min{n 1,0 ,n 0,0 ,n 0,1 } . (F * (+) (x) is defined in Lemma 6 in the appendix.) Theorem 2 ensures that our classifier will approximate fair Bayes-optimal classifier if the input classifier is close to f * . Here, α ′ < α is any positive constant, which we adopt to promise that our candidate set is not empty. Remark: We remark that FaiREE requires no assumption on data distribution except for a minimum sample size. It has the advantage over existing literature which generally imposes different assumptions to data distribution. For example, Chzhen et al. (2019) assumes η(x, a) must surpass the level 1 2 on a set of non-zero measure. Valera et al. (2018) assumes that the shifting threshold of the classifier follows the beta distribution. Also, Zeng et al. ( 2022)'s result only holds for population-level, and the finite-sample version is not studied.

4.1. EQUALIZED ODDS

In this section, we apply our algorithm to the fairness notion Equalized Odds, which has two fairness constraints simultaneously. To ensure the two constraints, the algorithm of equalized odds should be different from Algorithm 1. We should consider all S y,a instead of just S 1,a when estimating the violation to fairness constraint in the step Candidate Set Construction. Thus we add a function g 0 that deals with data with protected attribute A = 1, to perfect our algorithm together with g 1 defined in the last section. Similar to Proposition 1, the following proposition assures that choosing an appropriate threshold during post-processing enables the high probability control of a given classifier's DEO. Proposition 3. Given k 1,0 , k 1,1 satisfying k 1,a ∈ {1, . . . , n 1,a } (a = 0, 1). Define ϕ(x, a) = 1{f(x, a)) > t 1,a (k 1,a ) }, g y (k, a) = E[ n y,a j=k n y,a j (Q y,1-a -α) j (1 -(Q y,1-a -α)) n y,a -j ] with Q y,a ∼ Beta(k + 1 -y, n y,a -k + y ), then we have: P(|DEO(ϕ)| ⪯ (α, α)) ≥ 1 -g 1 (k 1,1 , 1) -g 1 (k 1,0 , 0) -g 0 (k 0,1 , 1) -g 0 (k 0,0 , 0). Similar to Proposition 1, g 0 and g 1 jointly control the probability of ϕ violating the DEO constraint. Algorithm 2: FaiREE for Equalized Odds

Input:

Training data S = S 0,0 ∪ S 0,1 ∪ S 1,0 ∪ S 1,1 ; the error bound α; the tolerance level δ; a given pre-trained classifier f T y,a = {f (x y,a 1 ), . . . , f (x y,a ny,a )} {t y,a (1) , . . . , t y,a (ny,a) } =sort(T y,a ) Define g 0 (k, a) and g 1 (k, a) as in Proposition 3, L 1 (k 1,0 , k 1,1 ) = g 1 (k 1,1 , 1) + g 1 (k 1,0 , 0), and L 0 (k 0,0 , k 0,1 ) = g 0 (k 0,1 , 1) + g 0 (k 0,0 , 0). For every k 1,0 , k 1,1 , there exists k 0,0 , k 0,1 such that t 0,0 (k 0,0 ) ≤ t 1,0 (k 1,0 ) < t 0,0 (k 0,0 +1) , t 0,1 (k 0,1 ) ≤ t 1,1 (k 1,1 ) < t 0,1 (k 0,1 +1) . Build the candidate set as K = {(k 1,0 , k 1,1 ) | L 1 (k 1,0 , k 1,1 ) + L 0 (k 0,0 , k 0,1 ) ≤ δ} = {(k 1,0 1 , k 1,1 1 ), . . . , (k 1,0 M , k 1,1 M )}. Compute êi as in Proposition 2), and let i * = arg min i∈[M ] { êi }. Output: φ(x, a) = 1{f(x, a) > t 1,a (k 1,a i * ) } Proposition 3 yields the following proposition on the DEO of classifiers in the candidate set. Theorem 3. If min{n 0,0 , n 0,1 , n 1,0 , n 1,1 } ≥ ⌈ log δ 4 log(1-α) ⌉, then for each i ∈ {1, . . . , M }, we have |DEO( φi )| ⪯ (α, α) with probability 1 -δ. The theoretical analysis of test error is similar to the algorithm for Equality of Opportunity. Theorem 4. Given α ′ < α. Set δ = c 0 /M for some c 0 > 0, where M is the candidate set size.

Suppose min{n

0,0 , n 0,1 , n 1,0 , n 1,1 } ≥ ⌈ log δ 4 log(1-α) ⌉. φ is the final output of FaiREE, then: (1). |DEO( φ)| ⪯ (α, α) with probability 1 -c 0 . (2). Suppose the density functions of f * under A = a, Y = 1 are continuous. We denote ϕ * α ′ ,α ′ = arg min |DEO(ϕ)|⪯(α ′ ,α ′ ) P(ϕ(x, a) ̸ = Y ). For any δ ′ , ϵ 0 > 0, there exist 0 < c < 1 and c 1 > 0 such that when the input classifier f satisfies | f (x, a) -f * (x, a) |≤ ϵ 0 , we have P( φ(x, a) ̸ = Y ) -P(ϕ * α ′ ,α ′ (x, a) ̸ = Y ) ≤ 2F * (+) (2ϵ 0 ) + δ ′ with probability larger than 1 -c 1 c min{n 1,0 ,n 1,1 ,n 0,0 ,n 0,1 } . (F * (+) (x) is defined in Lemma 6 in the appendix.)

4.2. ON COMPARING DIFFERENT FAIRNESS CONSTRAINTS

In this subsection, we further extend our algorithms to more fairness notions. The detailed technical results and derivations are deferred to Section A.8 in the appendix. Specifically, we compare the sample size requirement to make any given score function f to achieve certain fairness constraint. We note that our algorithm is almost assumption-free, except for the i.i.d. assumption and a necessary and sufficient condition of the sample size. Therefore, we make a chart below to recommend different fairness notions used in practice when the sample sizes are limited. We summarize our results in the following table:  n a ≥ ⌈ log δ 2 log(1-α) ⌉ n 1,a ≥ ⌈ log δ 2 log(1-α) ⌉ n 0,a ≥ ⌈ log δ 2 log(1-α) ⌉ n y,a ≥ ⌈ log δ 4 log(1-α) ⌉ n y,a ≥ ⌈ log δ 4 log( 1-y+(2y-1)p Y,|y-a| -α y(2p Y,a -1)+1-p Y,a ) ⌉ From this table, we find that Demographic Parity requires the least sample size, Equality of Opportunity and Predictive Equality need a lightly larger sample size, and Equalized Odds is the notion that requires the largest sample size among the first four fairness notions. The sample size requirement for Equalized Accuracy is similar to that of Equalized Odds, but does not have a strict dominance.

5. EXPERIMENTS

In this section, we conduct experiments to test and understand the effectiveness of FaiREE. For both synthetic data and real data analysis, we compare FaiREE with the following representative methods for fair classification: Reject-Option-Classification (ROC) method in Kamiran et al. (2012) , Eqodds-Postprocessing (Eq) method in Hardt et al. (2016) , Calibrated Eqodds-Postprocessing (C-Eq) method in Pleiss et al. (2017) and FairBayes method in Zeng et al. (2022) .The first three baselines are designed to cope with Equalized Odds and the last one is for Equality of Opportunity.

5.1. SYNTHETIC DATA

To show the distribution-free and finite-sample guarantee of FaiREE, we generate the synthetic data from mixed distributions. Real world data are generally heavy-tailed (Resnick (1997) ). Thus, we consider the following models with various heavy-tailed distributions for generating synthetic data: Model 1. We generate the protected attribute A and label Y with probability p 1 = P(A = 1) = 0.7, p 0 = P(A = 0) = 0.3, p y,1 = P(Y = y | A = 1) = 0.7 and p y,0 = P(Y = y | A = 0) = 0.4 for y ∈ {0, 1}. The dimension of features is set to 60, and we generate features with x 0,0 i,j i.i.d. ∼ t(3), where t(k) denotes the t-distribution with degree of freedom k, x 0,1 i,j i.i.d. ∼ χ 2 1 , x 1,0 i,j i.i.d. ∼ χ 2 3 and x 1,1 i,j i.i.d. ∼ N (µ, 1), where µ ∼ U (0, 1) and the scale parameter is fixed to be 1, for j = 1, 2, ..., 60. Model 2. We generate the protected attribute A and label Y with the probability, location parameter and scale parameter the same as Model 1. The dimension of features is set to 80, and we generate features with x 0,0 i,j i.i.d. ∼ t(4), x 0,1 i,j i.i.d. ∼ χ 2 2 , x 1,0 i,j i.i.d. ∼ χ 2 4 and x 1,1 i,j i.i.d. ∼ Laplace(µ, 1), for j = 1, 2, ..., 80. For each model, we generate 1000 i.i.d. samples, and the experimental results are summarized in Tables 2 and 3 . From these two tables, we find that our proposed FaiREE, when applied to different fairness notions Equality of Opportunity and Equality of Opportunity, is able to control the required fairness vio- lation respectively with high probability, while all the other methods cannot. In addition, although satisfying stronger constraints, the mis-classification error FaiREE is comparable to, and sometimes better than the state-of-the-art methods.

5.2. REAL DATA ANALYSIS

In this section, we apply FaiREE to a real data set, Adult Census dataset (Dua et al., 2017) , whose task is to predict whether a person's income is greater than $50,000. The protected attribute is gender, and the sample size is 45,222, including 32561 training samples and 12661 test samples. To facilitate the numerical study, we randomly split data into training set, calibration set and test set at each repetition and repeat for 500 times. FaiREE is compared the existing methods described in the last subsection. Again, as shown in Table 4 , the proposed FaiREE method controls the fairness constraints at the desired level, and achieve small mis-classification error. More implementation details and experiments on other benchmark datasets are presented in A.9. 

A APPENDIX

A.1 PROOF OF PROPOSITION 1 Proof. The classifier is ϕ = 1{f(x, 0) > t 1,0 (k 1,0 ) }, a = 0 1{f(x, 1) > t 1,1 (k 1,1 ) }, a = 1 we have: |DEOO(ϕ)| = |P( Ŷ = 1 | A = 0, Y = 1) -P( Ŷ = 1 | A = 1, Y = 1)| = |P(f (x, 0) > t 1,0 (k 1,0 ) | A = 0, Y = 1) -P(f (x, 1) > t 1,1 (k 1,1 ) | A = 1, Y = 1)| = |1 -F 1,0 (t 1,0 (k 1,0 ) ) -[1 -F 1,1 (t 1,1 (k 1,1 ) )]| = |F 1,1 (t 1,1 (k 1,1 ) ) -F 1,0 (t 1,0 (k 1,0 ) )| Hence, P(|DEOO(ϕ)| > α) = P(|F 1,1 (t 1,1 (k 1,1 ) ) -F 1,0 (t 1,0 (k 1,0 ) )| > α) = P(F 1,1 (t 1,1 (k 1,1 ) ) -F 1,0 (t 1,0 (k 1,0 ) ) > α) + P(F 1,1 (t 1,1 (k 1,1 ) ) -F 1,0 (t 1,0 (k 1,0 ) ) < -α) ∆ = A + B. We then have A = P(F 1,1 (t 1,1 (k 1,1 ) ) -F 1,0 (t 1,0 (k 1,0 ) ) > α) = P(F 1,0 (t 1,0 (k 1,0 ) ) < F 1,1 (t 1,1 (k 1,1 ) ) -α) ≤ E[P(t 1,0 (k 1,0 ) < F 1,0 -1 (F 1,1 (t 1,1 (k 1,1 ) ) -α))1{F 1,1 (t 1,1 (k 1,1 ) ) -α > 0} | t 1,1 (k 1,1 ) ] = E{P[at least k 1,0 of t 1,0 's are less than F 1,0 -1 (F 1,1 (t 1,1 (k 1,1 ) ) -α)]1{F 1,1 (t 1,1 (k 1,1 ) ) -α > 0} | t 1,1 (k 1,1 ) } Following this, we obtain A ≤ E{ n 1,0 j=k 1,0 P[exactly j of the t 1,0 's are less than F 1,0 -1 (F 1,1 (t 1,1 (k 1,1 ) ) -α)]1{F 1,1 (t 1,1 (k 1,1 ) ) -α > 0} | t 1,1 (k 1,1 ) } = E{ n 1,0 j=k 1,0 n 1,0 j P[t 1,0 < F 1,0 -1 (F 1,1 (t 1,1 (k 1,1 ) ) -α)] j (1 -P[t 1,0 < F 1,0 -1 (F 1,1 (t 1,1 (k 1,1 ) ) -α)]) n 1,0 -j 1{F 1,1 (t 1,1 (k 1,1 ) ) -α > 0} | t 1,1 (k 1,1 ) } ≤ E[ n 1,0 j=k 1,0 n 1,0 j (F 1,1 (t 1,1 (k 1,1 ) ) -α) j (1 -(F 1,1 (t 1,1 (k 1,1 ) ) -α)) n 1,0 -j | t 1,1 (k 1,1 ) ] Similarly, we have B ≤ E[ n 1,1 j=k 1,1 n 1,1 j (F 1,0 (t 1,0 (k 1,0 ) ) -α) j (1 -(F 1,0 (t 1,0 (k 1,0 ) ) -α)) n 1,1 -j | t 1,0 (k 1,0 ) ] Published as a conference paper at ICLR 2023 Hence, we have A + B ≤E[ n 1,0 j=k 1,0 n 1,0 j (F 1,1 (t 1,1 (k 1,1 ) ) -α) j (1 -(F 1,1 (t 1,1 (k 1,1 ) ) -α)) n 1,0 -j | t 1,1 (k 1,1 ) ] + E[ n 1,1 j=k 1,1 n 1,1 j (F 1,0 (t 1,0 (k 1,0 ) ) -α) j (1 -(F 1,0 (t 1,0 (k 1,0 ) ) -α)) n 1,1 -j | t 1,0 (k 1,0 ) ] ≤E[ n 1,0 j=k 1,0 n 1,0 j (Q 1,1 -α) j (1 -(Q 1,1 -α)) n 1,0 -j ] + E[ n 1,1 j=k 1,1 n 1,1 j (Q 1,0 -α) j (1 -(Q 1,0 -α)) n 1,1 -j ] The last inequality holds because F 1,a (t 1,a (k 1,a ) ) is stochastically dominated by Beta(k 1,a , n 1,ak 1,a + 1). If t 1,a is continuous random variable, the equality holds. Now we complete the proof.

A.2 PROOF OF LEMMA 1

We first introduce the lemma (theorem E.4 in Zeng et al. ( 2022)): Lemma 2. (Fair Bayes-optimal Classifiers under Equality of Opportunity). Let E ⋆ = DEOO (f ⋆ ). For any α > 0, all fair Bayes-optimal classifiers f ⋆ E,α under the fairness constraint |DEOO(f )| ≤ α are given as follows: -When |E ⋆ | ≤ α, f ⋆ E,α = f ⋆ -When |E ⋆ | > α, suppose P X|A=1,Y =1 η 1 (X) = p1p Y,1 2(p1p Y,1 -t ⋆ E,α ) = 0, then for all x ∈ X and a ∈ A, f ⋆ E,α (x, a) = I η a (x) > p a p Y,a 2p a p Y,a + (1 -2a)t ⋆ E,α where t ⋆ E,α is defined as t ⋆ E,α = sup t : P Y |A=1,Y =1 η 1 (X) > p 1 p Y,1 2p 1 p Y,1 -t > P Y |A=0,Y =1 η 0 (X) > p 0 p Y,0 2p 0 p Y,0 + t + E ⋆ |E ⋆ | α . Now we come back to prove Proposition 1. Proof. From Lemma 2, we have t * 0 = p 0 p Y,0 2p 0 p Y,0 + t * E,α (3) t * 1 = p 1 p Y,1 2p 1 p Y,1 -t * E,α Combine Eq. ( 3) and (4) together and we complete the proof.

A.3 PROOF OF PROPOSITION 2

We first provide a lemma for the mis-classification error of the classifier in the candidate set. Lemma 3. | P( φi (x, a) ̸ = Y ) -[ k 1,0 i n 1,0 +1 p 0 p Y,0 + k 1,1 i n 1,1 +1 p 1 p Y,1 + n 0,0 + 1 2 -E(k 0,0 i ) n 0,0 +1 p 0 (1 -p Y,0 ) + n 0,1 + 1 2 -E(k 0,1 i ) n 0,1 +1 p 1 (1 -p Y,1 )] |≤ p0(1-p Y,0 ) 2(n 0,0 +1) + p1(1-p Y,1 ) 2(n 0,1 +1) We also have the following lemma: Lemma 4. F 0,0 (t 0,0 (k 1,0 ) ) ∼ Beta(k 0,0 , n 0,0 -k 0,0 +1), F 0,1 (t 0,1 (k 0,1 ) ) ∼ Beta(k 0,1 , n 0,1 -k 0,1 +1). Proof of Lemma 4. Since F 0,0 , F 0,1 are the continuous cumulative distribution functions of the t 0,0 's and t 0,1 's, we have F 0,0 (t 0,0 ), F 0,1 (t 0,1 ) ∼ U (0, 1), thus F 0,0 (t 0,0 (k 0,0 ) ) is the k 0,0 th order statistic of n 0,0 i.i.d samples from U (0, 1) and F 0,1 (t 0,1 (k 0,1 ) ) is the k 0,1 th order statistic of n 0,1 i.i.d samples from U (0, 1). Thus, from the well known fact of the ordered statistics, we have F 0,0 (t 0,0 (k 0,0 ) ) ∼ Beta(k 0,0 , n 0,0k 0,0 + 1) and F 0,1 (t 0,1 (k 0,1 ) ) ∼ Beta(k 0,1 , n 0,1 -k 0,1 + 1).

Now we come back to the proof of Lemma 3:

Proof of Lemma 3. The classifier is: φ = 1{f(x, 0) > t 1,0 (k 1,0 ) }, A = 0 1{f(x, 1) > t 1,1 (k 1,1 ) }, A = 1 So we have the mis-classification error:  P(Y ̸ = Ŷ ) = P(Y = 1, Ŷ = 0) + P(Y = 0, Ŷ = 1) = P(Y = 1, Ŷ = 0, A = 0) + P(Y = 1, Ŷ = 0, A (Y = 0, A = 1) = E[P(f (x, 0) ≤ t 1,0 (k 1,0 ) | Y = 1, A = 0) | t 1,0 (k 1,0 ) ]p 0 p Y,0 + E[P(f (x, 1) ≤ t 1,1 (k 1,1 ) | Y = 1, A = 1) | t 1,1 (k 1,1 ) ]p 1 p Y,1 + E[P(f (x, 0) ≥ t 1,0 (k 1,0 ) | Y = 0, A = 0) | t 1,0 (k 1,0 ) ]p 0 (1 -p Y,0 ) + E[P(f (x, 1) ≥ t 1,1 (k 1,1 ) | Y = 0, A = 1) | t 1,1 (k 1,1 ) ]p 1 (1 -p Y,1 ) ≤ E[P(f (x, 0) ≤ t 1,0 (k 1,0 ) | Y = 1, A = 0) | t 1,0 (k 1,0 ) ]p 0 p Y,0 + E[P(f (x, 1) ≤ t 1,1 (k 1,1 ) | Y = 1, A = 1) | t 1,1 (k 1,1 ) ]p 1 p Y,1 + E[P(f (x, 0) ≥ t 0,0 (k 0,0 ) | Y = 0, A = 0) | t 1,0 (k 1,0 ) ]p 0 (1 -p Y,0 ) + E[P(f (x, 1) ≥ t 0,1 (k 0,1 ) | Y = 0, A = 1) | t 1,1 (k 1,1 ) ]p 1 (1 -p Y,1 ) = E[F 1,0 (t 1,0 (k 1,0 ) ) | t 1,0 (k 1,0 ) ]p 0 p Y,0 + E[F 1,1 (t 1,1 (k 1,1 ) ) | t 1,1 (k 1,1 ) ]p 1 p Y,1 + E[1 -E[F 0,0 (t 0,0 (k 0,0 ) ) | t 0,0 (k 0,0 ) ] | t 1,0 (k 1,0 ) ]p 0 (1 -p Y,0 ) + E[1 -E[F 0,1 (t 0,1 (k 0,1 ) ) | t 0,1 (k 0,1 ) ] | t 1,1 (k 1,1 ) ]p 1 (1 -p Y,1 ) = k 1,0 i n 1,0 + 1 p 0 p Y,0 + k 1,1 i n 1,1 + 1 p 1 p Y,1 + n 0,0 + 1 -E(k 0,0 i ) n 0,0 + 1 p 0 (1 -p Y,0 ) + n 0,1 + 1 -E(k 0,1 i ) n 0,1 + 1 p 1 (1 -p Y,1 ) The last equality comes from Lemma 4, and from the fact that E(Beta(α, β)) = α α+β . Similarly, we have P(Y ̸ = Ŷ ) ≥ k 1,0 i n 1,0 + 1 p 0 p Y,0 + k 1,1 i n 1,1 + 1 p 1 p Y,1 + n 0,0 -E(k 0,0 i ) n 0,0 + 1 p 0 (1-p Y,0 )+ n 0,1 -E(k 0,1 i ) n 0,1 + 1 p 1 (1-p Y,1 ) Thus, we have | P( φi (x, a) ̸ = Y ) -[ k 1,0 i n 1,0 +1 p 0 p Y,0 + k 1,1 i n 1,1 +1 p 1 p Y,1 + n 0,0 + 1 2 -E(k 0,0 i ) n 0,0 +1 p 0 (1 -p Y,0 ) + n 0,1 + 1 2 -E(k 0,1 i ) n 0,1 +1 p 1 (1 -p Y,1 )] |≤ p0(1-p Y,0 ) 2(n 0,0 +1) + p1(1-p Y,1 ) 2(n 0,1 +1) . Now we complete the proof of Lemma 3. Next, we come to prove Proposition 2. Lemma 5 (Hoeffding's inequality). Let X 1 , . . . , X n be independent random variables. Assume that X i ∈ [m i , M i ] for every i. Then, for any t > 0, we have P n i=1 (X i -EX i ) ≥ t ≤ e - 2t 2 n i=1 (M i -m i ) 2 Proof of Proposition 2. First, we notice that k 0,a i is the number of t 0,a 's such that t 0,a < t 1,a (k 1,a i ) , i.e. k 0,a i = n 0,1 j=1 1{t 0,a j < t 1,a (k 1,a i ) }. Thus, for a given ϵ > 0, from Hoeffding's inequality, we have with probability 1 -e -2n 0,a ϵ 2 , k 0,a i -E(k 0,a i ) n 0,a = n 0,1 j=1 1{t 0,a j < t 1,a (k 1,a i ) } - n 0,1 j=1 E(1{t 0,a j < t 1,a (k 1,a i ) }) n 0,a ≤ ϵ. Similarly, with probability 1 -e -2n 0,a ϵ 2 , k 0,a i -E(k 0,a i ) n 0,a = n 0,1 j=1 1{t 0,a j < t 1,a (k 1,a i ) } - n 0,1 j=1 E(1{t 0,a j < t 1,a (k 1,a i ) }) n 0,a ≥ -ϵ. Thus, | k 0,a i -E(k 0,a i ) n 0,a |≤ ϵ with probability 1 -2e -2n 0,a ϵ 2 . Then we estimate p a and p Y,a by pa = n 1,a +n 0,a n and pY,a = n Y,a n (n is the number of the total samples). Here, n 1,a +n 0,a n = n i=1 1{Z a i =1} n and n Y,a n = n i=1 1{Z Y,a i =1} n , where Z a i ∼ B(1, p a ) and Z Y,a i ∼ B(1, p Y,a ). From Hoeffding's inequality (Lemma 5), we have: P(| pa -p a |≥ n 0,a n ϵ) ≤ 2e -2n 0,a ϵ 2 , P(| pY,a -p Y,a |≥ n 0,a n ϵ) ≤ 2e -2n 0,a ϵ 2 Thus, with probability 1 -6e -2n 0,a ϵ 2 , we have:                  | pa -p a | ≤ n 0,a n ϵ | pY,a -p Y,a | ≤ n 0,a n ϵ | k 0,a i -E(k 0,a i ) n 0,a | ≤ ϵ Hence, we have with probability 1 -6(e -2n 0,0 ϵ 2 + e -2n 0,1 ϵ 2 ), |P( φi (x, a) ̸ = Y ) -P( φi (x, a) ̸ = Y )| ≤| k 1,0 i n 1,0 + 1 p 0 p Y,0 + k 1,1 i n 1,1 + 1 p 1 p Y,1 + n 0,0 + 0.5 -E(k 0,0 i ) n 0,0 + 1 p 0 (1 -p Y,0 ) + n 0,1 + 0.5 -E(k 0,1 i ) n 0,1 + 1 p 1 (1 -p Y,1 ) -[ k 1,0 i n 1,0 + 1 p0 pY,0 + k 1,1 i n 1,1 + 1 p1 pY,1 + n 0,0 + 0.5 -k 0,0 i n 0,0 + 1 p0 (1 -pY,0 ) + n 0,1 + 0.5 -k 0,1 i n 0,1 + 1 p1 (1 -pY,1 )]| + p 0 (1 -p Y,0 ) 2(n 0,0 + 1) + p 1 (1 -p Y,1 ) 2(n 0,1 + 1) ≤ϵ[ n 0,0 n k 1,0 i n 1,0 + 1 (p 0 + p Y,0 ) + n 0,1 n k 1,1 i n 1,1 + 1 (p 1 + p Y,1 )] + ϵ 2 ( n 0,0 n k 1,0 i n 1,0 + 1 + n 0,1 n k 1,1 i n 1,1 + 1 ) + ϵ[ n 0,0 n 0,0 + 1 [p 0 + p 0 p Y,0 + n 0,0 n ϵ( n 0,0 n ϵ + p 0 + p Y,0 + 1)] + n 0,1 n 0,1 + 1 [p 1 + p 1 p Y,1 + n 0,1 n ϵ( n 0,1 n ϵ + p 1 + p Y,1 + 1)]] + n 0,0 + 0.5 -E(k 0,0 i ) n 0,0 + 1 n 0,0 n ϵ[ n 0,0 n ϵ + p 0 + p Y,0 + 1] + n 0,1 + 0.5 -E(k 0,1 i ) n 0,1 + 1 n 0,1 n ϵ[ n 0,1 n ϵ + p 1 + p Y,1 + 1] + p 0 (1 -p Y,0 ) 2(n 0,0 + 1) + p 1 (1 -p Y,1 ) 2(n 0,1 + 1) ≤ϵ[ n 0,0 n (p 0 + p Y,0 ) + n 0,1 n (p 1 + p Y,1 )] + ϵ 2 ( n 0,0 n + n 0,1 n ) + ϵ[ n 0,0 n 0,0 + 1 [p 0 + p 0 p Y,0 + n 0,0 n ϵ( n 0,0 n ϵ + p 0 + p Y,0 + 1)] + n 0,1 n 0,1 + 1 [p 1 + p 1 p Y,1 + n 0,1 n ϵ( n 0,1 n ϵ + p 1 + p Y,1 + 1)]] + n 0,0 n ϵ[ n 0,0 n ϵ + p 0 + p Y,0 + 1] + n 0,1 n ϵ[ n 0,1 n ϵ + p 1 + p Y,1 + 1] + p 0 (1 -p Y,0 ) 2(n 0,0 + 1) + p 1 (1 -p Y,1 ) 2(n 0,1 + 1) ≤2ϵ + ϵ 2 + ϵ[ϵ 2 + 4ϵ + 2] + ϵ 2 + 4ϵ + p 0 (1 -p Y,0 ) 2(n 0,0 + 1) + p 1 (1 -p Y,1 ) 2(n 0,1 + 1) =ϵ 3 + 6ϵ 2 + 8ϵ + p 0 (1 -p Y,0 ) 2(n 0,0 + 1) + p 1 (1 -p Y,1 ) 2(n 0,1 + 1) Thus we complete the proof.

A.4 THEOREM FOR THE ORIGINAL CANDIDATE SET K

Sometimes we would use the original candidate set K instead of the small set K ′ to achieve the optimal accuracy more precisely. Now we provide our results for the candidate set K. To facilitate the theoretical analysis, we first introduce the following lemma which implies that the difference between the output of the function can be controlled by the difference between the input. (i.e. the cumulative distribution function won't increase drastically.) Lemma 6. For a distribution F with a continuous density function, suppose q(x) denotes the quantile of x under F , then for x > y, we have F (-) (x -y) ≤ q(x) -q(y) ≤ F (+) (x -y), where F (-) (x) and F (+) (x) are two monotonically increasing functions, F (-) (ϵ) > 0, F (+) (ϵ) > 0 for any ϵ > 0 and lim ϵ→0 F (-) (ϵ) = lim ϵ→0 F (+) (ϵ) = 0. Proof of Lemma 6. Since the domain of q(x) is a closed set and q(x) is continuous, we know that q(x) is uniformly continuous. Thus we can easily find F (+) to satisfy the RHS. For F (-) , we simply define F (-) (t) = inf x {q(x + t) -q(t)}. Since q(x + t) -q(t) > 0 for t > 0 and the domain of x is a closed set, we have F (-) (ϵ) > 0 for ϵ > 0 and lim ϵ→0 F (-) (ϵ) = 0. Now we complete the proof. Now we provide the following theorem. Theorem 5. Given α ′ < α. If min{n 1,0 , n 1,1 } ≥ ⌈ log δ 2 log(1-α) ⌉. Suppose φ is the final output of FaiREE, we have: (1) |DEOO( φ)| < α with probability (1 -δ) M , where M is the size of the candidate set. (2) Suppose the density distribution functions of f * under A = a, Y = 1 are continuous. When the input classifier f satisfies | f (x, a) -f * (x, a) |≤ ϵ 0 , for any ϵ > 0 such that F * (+) (ϵ) ≤ α-α ′ 2 -F * (+) (2ϵ 0 ), we have P( φ(x, a) ̸ = Y )-P(ϕ * α ′ (x, a) ̸ = Y ) ≤ 2F * (+) (2ϵ 0 )+2F * (+) (ϵ)+2ϵ 3 +12ϵ 2 +16ϵ+ p 0 (1 -p Y,0 ) n 0,0 + 1 + p 1 (1 -p Y,1 ) n 0,1 + 1 with probability 1 -(2M + 4)(e -2n 0,0 ϵ 2 + e -2n 0,1 ϵ 2 ) -(1 -F 1,0 (-) (2ϵ)) n 1,0 -(1 -F 1,1 (-) (2ϵ)) n 1,1 . Proof of Theorem 5. The (1) of the theorem is a direct corollary from Theorem 1, now we prove the (2) of the theorem. The proof can be divided into two parts. The first part is to prove that there exist classifiers in our candidate set that are close to the fair Bayes-optimal classifier. The second part is to prove that our algorithm can successfully choose one of these classifiers with high probability. We suppose the fair Bayes optimal classifier has the form ϕ * For the first part, for any ϵ > 0, from Lemma 6, t 1,a has a positive probability F 1,a (+) (2ϵ) to fall in the interval [λ * a -ϵ, λ * a + ϵ], which implies that the probability that there exists a ∈ {0, 1} such that all t 1,a 's fall out of [λ * a -ϵ, λ * a + ϵ] is less than (1 -F 1,0 (+) (2ϵ)) n 1,0 + (1 -F 1,1 (+) (2ϵ)) n 1,1 . So with probability 1 -(1 -F 1,0 (+) (2ϵ)) n 1,0 -(1 -F 1,1 (-) (2ϵ)) n 1,1 , there will exist t 1,a in [λ * a -ϵ, λ * a + ϵ], which we denote as ϕ 0 (x, a) = 1{f(x, a) > t 1,a * }. We also denote ϕ * 0 (x, a) = 1{f * (x, a) > t 1,a * }. Hence the gap between the classifier ϕ 0 and the Bayes-optimal classifier will be very close. In detail, we have | P(ϕ 0 (x, a) ̸ = Y ) -P(ϕ * α ′ (x, a) ̸ = Y ) | ≤ | P(ϕ 0 (x, a) ̸ = Y ) -P(ϕ * 0 (x, a) ̸ = Y ) | + | P(ϕ * 0 (x, a) ̸ = Y ) -P(ϕ * α ′ (x, a) ̸ = Y ) | ≤P(t 1,a * -ϵ 0 ≤ f * (x, a) ≤ t 1,a * + ϵ 0 ) + P(min{t 1,a * , λ * a } ≤ f * (x, a) ≤ max{t 1,a * , λ * a }) (Lemma 6) ≤F * (+) (2ϵ 0 ) + F * (+) (max{t 1,a * , λ * a } -min{t 1,a * , λ * a }) ≤F * (+) (2ϵ 0 ) + 2F * (+) (ϵ) so we complete the first part of the proof. Now we come to the second part. First, we notice that DEOO(ϕ 0 ) and DEOO(ϕ * α ′ ) are close to each other. || DEOO(ϕ 0 ) | -| DEOO(ϕ * α ′ ) || ≤ || DEOO(ϕ 0 ) | -| DEOO(ϕ * 0 ) || + || DEOO(ϕ * 0 ) | -| DEOO(ϕ * α ′ ) || = || P(f > t 1,0 1 | Y = 1, A = 1) | ≤2F * (+) (2ϵ 0 ) + P(min{t 1,a * , λ * a } ≤ f * (x, a) ≤ max{t 1,a * , λ * a }) (Lemma 6) ≤2F * (+) (2ϵ 0 ) + F * (+) (max{t 1,a * , λ * a } -min{t 1,a * , λ * a }) ≤2F * (+) (2ϵ 0 ) + 2F * (+) (ϵ) Thus, | DEOO(ϕ 0 ) |≤| DEOO(ϕ * α ′ ) | +2F * (+) (2ϵ 0 ) + 2F * (+) (ϵ) = α ′ + 2F * (+) (2ϵ 0 ) + 2F * (+) (ϵ). If F * (+) (ϵ) ≤ α-α ′ 2 -F * (+) (2ϵ 0 ) , then there will exist at least one feasible classifier in the candidate set. From Lemma 3, we have the mis-classification error | P( φi (x, a) ̸ = Y ) -[ k 1,0 i n 1,0 +1 p 0 p Y,0 + k 1,1 i n 1,1 +1 p 1 p Y,1 + n 0,0 + 1 2 -E(k 0,0 i ) n 0,0 +1 p 0 (1 -p Y,0 ) + n 0,1 + 1 2 -E(k 0,1 i ) n 0,1 +1 p 1 (1 -p Y,1 )] |≤ p0(1-p Y,0 ) 2(n 0,0 +1) + p1(1-p Y,1 ) 2(n 0,1 +1) . If we can accurately estimate the mis-classification error, than the second part is almost done. For the estimation of E(k 0,0 i ), we can easily use k 0,0 i . We notice that k 0,a i is the number of t 0,a 's such that t 0,a < t 1,a (k 1,a i ) , i.e. k 0,a i = n 0,1 i=1 1{t 0,a i < t 1,a (k 1,a i ) }. Thus, from Hoeffding's inequality, we have with probability 1 -e -2n 0,a ϵ 2 , k 0,a i -E(k 0,a i ) n 0,a = n 0,1 j=1 1{t 0,a j < t 1,a (k 1,a i ) } - n 0,1 j=1 E(1{t 0,a j < t 1,a (k 1,a i ) }) n 0,a ≤ ϵ. Similarly, with probability 1 -e -2n 0,a ϵ 2 , k 0,a i -E(k 0,a i ) n 0,a = n 0,1 j=1 1{t 0,a j < t 1,a (k 1,a i ) } - n 0,1 j=1 E(1{t 0,a j < t 1,a (k 1,a i ) }) n 0,a ≥ -ϵ. Thus, | k 0,a i -E(k 0,a i ) n 0,a |≤ ϵ with probability 1 -2e -2n 0,a ϵ 2 . Then, it's easy to estimate p a and p Y,a with pa = n 1,a +n 0,a n and pY,a = n Y,a n (n is the number of the total samples). Here, n 1,a +n 0,a n = n i=1 1{Z a i =1} n and n Y,a n = n i=1 1{Z Y,a i =1} n , where Z a i ∼ B(1, p a ) and Z Y,a i ∼ B(1, p Y,a ). From Hoeffding's inequality, we have: P(| pa -p a |≥ n 0,a n ϵ) ≤ 2e -2n 0,a ϵ 2 , P(| pY,a -p Y,a |≥ n 0,a n ϵ) ≤ 2e -2n 0,a ϵ 2 Thus, with probability 1 -6e -2n 0,a ϵ 2 , we have:                  | pa -p a | ≤ n 0,a n ϵ | pY,a -p Y,a | ≤ n 0,a n ϵ | k 0,a i -E(k 0,a i ) n 0,a | ≤ ϵ Hence, we have with probability 1 -(2M + 4)(e -2n 0,0 ϵ 2 + e -2n 0,1 ϵ 2 ), for each i ∈ {1, . . . , M }, |P( φi (x, a) ̸ = Y ) -P( φi (x, a) ̸ = Y )| ≤| k 1,0 i n 1,0 + 1 p 0 p Y,0 + k 1,1 i n 1,1 + 1 p 1 p Y,1 + n 0,0 + 0.5 -E(k 0,0 i ) n 0,0 + 1 p 0 (1 -p Y,0 ) + n 0,1 + 0.5 -E(k 0,1 i ) n 0,1 + 1 p 1 (1 -p Y,1 ) -[ k 1,0 i n 1,0 + 1 p0 pY,0 + k 1,1 i n 1,1 + 1 p1 pY,1 + n 0,0 + 0.5 -k 0,0 i n 0,0 + 1 p0 (1 -pY,0 ) + n 0,1 + 0.5 -k 0,1 i n 0,1 + 1 p1 (1 -pY,1 )]| + p 0 (1 -p Y,0 ) 2(n 0,0 + 1) + p 1 (1 -p Y,1 ) 2(n 0,1 + 1) ≤ϵ[ n 0,0 n k 1,0 i n 1,0 + 1 (p 0 + p Y,0 ) + n 0,1 n k 1,1 i n 1,1 + 1 (p 1 + p Y,1 )] + ϵ 2 ( n 0,0 n k 1,0 i n 1,0 + 1 + n 0,1 n k 1,1 i n 1,1 + 1 ) + ϵ[ n 0,0 n 0,0 + 1 [p 0 + p 0 p Y,0 + n 0,0 n ϵ( n 0,0 n ϵ + p 0 + p Y,0 + 1)] + n 0,1 n 0,1 + 1 [p 1 + p 1 p Y,1 + n 0,1 n ϵ( n 0,1 n ϵ + p 1 + p Y,1 + 1)]] + n 0,0 + 0.5 -E(k 0,0 i ) n 0,0 + 1 n 0,0 n ϵ[ n 0,0 n ϵ + p 0 + p Y,0 + 1] + n 0,1 + 0.5 -E(k 0,1 i ) n 0,1 + 1 n 0,1 n ϵ[ n 0,1 n ϵ + p 1 + p Y,1 + 1] + p 0 (1 -p Y,0 ) 2(n 0,0 + 1) + p 1 (1 -p Y,1 ) 2(n 0,1 + 1) We have the following equalities:              λ * 1 = 1 2 - ( 1 λ * 0 -2)p0p Y,0 p1p Y,1 λ 1 = 1 2 - ( 1 λ 0 -2) p0 pY,0 p1 pY,1 Hence,        p 0 p Y,0 λ * 0 + p 1 p Y,1 λ * 1 = 2(p 0 p Y,0 + p 1 p Y,1 ) p0 pY,0 λ 0 + p1 pY,1 λ 1 = 2(p 0 pY,0 + p1 pY,1 ) By subtracting, we have (2 - 1 λ 1 )p 1 pY,1 -(2 - 1 λ * 1 )p 1 p Y,1 = ( 1 λ 0 -2)p 0 pY,0 -( 1 λ * 0 -2)p 0 p Y,0 . We have ( 1 λ 0 -2)p 0 pY,0 -( 1 λ * 0 -2)p 0 p Y,0 ≤( 1 λ 0 -2)(p 0 + n 0,0 n ϵ)(p Y,0 + n 0,0 n ϵ) -( 1 λ * 0 -2)p 0 p Y,0 ≤ ϵ λ * 0 (λ * 0 -ϵ) p 0 p Y,0 + n 0,0 n ϵ( 1 λ * 0 -ϵ -2)(p 0 + p Y,0 + n 0,0 n ϵ) Similarly, we have ( 1 λ 0 -2)p 0 pY,0 -( 1 λ * 0 -2)p 0 p Y,0 ≥ - ϵ λ * 0 (λ * 0 + ϵ) p 0 p Y,0 + n 0,0 n ϵ( 1 λ * 0 + ϵ -2)(-p 0 -p Y,0 + n 0,0 n ϵ); (2 - 1 λ 1 )p 1 pY,1 -(2 - 1 λ * 1 )p 1 p Y,1 ≤( 1 λ * 1 - 1 λ 1 )p 1 p Y,1 + n 0,1 n ϵ(2 - 1 λ 1 )(p 1 + p Y,1 + n 0,1 n ϵ); (2 - 1 λ 1 )p 1 pY,1 -(2 - 1 λ * 1 )p 1 p Y,1 ≥( 1 λ * 1 - 1 λ 1 )p 1 p Y,1 + n 0,1 n ϵ(2 - 1 λ 1 )(-p 1 -p Y,1 + n 0,1 n ϵ). Now, we get:                              ( 1 λ * 1 - 1 λ 1 )p 1 p Y,1 + n 0,1 n ϵ(2 - 1 λ 1 )(-p 1 -p Y,1 + n 0,1 n ϵ) ≤ ϵ λ * 0 (λ * 0 -ϵ) p 0 p Y,0 + n 0,0 n ϵ( 1 λ * 0 -ϵ -2)(p 0 + p Y,0 + n 0,0 n ϵ) ( 1 λ * 1 - 1 λ 1 )p 1 p Y,1 + n 0,1 n ϵ(2 - 1 λ 1 )(p 1 + p Y,1 + n 0,1 n ϵ) ≥ - ϵ λ * 0 (λ * 0 + ϵ) p 0 p Y,0 + n 0,0 n ϵ( 1 λ * 0 + ϵ -2)(-p 0 -p Y,0 + n 0,0 n ϵ) λ1 -λ * 1 ≤ λ1λ * 1 p1pY,1 [ ϵ λ * 0 (λ * 0 -ϵ) p0pY,0 + n 0,0 n ϵ( 1 λ * 0 -ϵ -2)(p0 + pY,0 + n 0,0 n ϵ) -2 n 0,1 n ϵ(-p1 -pY,1 + n 0,1 n ϵ)] and λ1 -λ * 1 ≥ λ1λ * 1 p1pY,1 [- ϵ λ * 0 (λ * 0 + ϵ) p0pY,0 + n 0,0 n ϵ( 1 λ * 0 + ϵ -2)(-p0 -pY,0 + n 0,0 n ϵ) -2 n 0,1 n ϵ(p1 + pY,1 + n 0,1 n ϵ)] Thus, we have: | λ1 -λ * 1 |≤ ϵ λ * 0 (λ * 0 -ϵ) p0pY,0 + ϵ( 1 λ * 0 -ϵ -2)(2 + ϵ) + 4ϵ p1pY,1 Now, combined with above, we have with probability 1 -4e -2n 0,a ϵ 2 -(1 -F 1,0 (-) (2ϵ)) n 1,0 ,                        | pa -pa | ≤ n 0,a n ϵ | pY,a -pY,a | ≤ n 0,a n ϵ | λ0 -λ * 0 | ≤ ϵ | λ1 -λ * 1 | ≤ ϵ λ * 0 (λ * 0 -ϵ) p0pY,0 + ϵ( 1 λ * 0 -ϵ -2)(2 + ϵ) + 4ϵ p1pY,1 From the proof of Theorem 5, we have: If F * (+) (ϵ) ≤ α-α ′ 2 -F * (+) (2ϵ0), then with probability 1 -(2M + 4)(e -2n 0,0 ϵ 2 + e -2n 0,1 ϵ 2 ) -(1 - F 1,0 (-) (2ϵ)) n 1,0 , P( φ(x, a) ̸ = Y ) -P(ϕ * α ′ (x, a) ̸ = Y ) ≤2F * (+) (2ϵ0) + F * (+) (ϵ) + F * (+) ( ϵ λ * 0 (λ * 0 -ϵ) p0pY,0 + ϵ( 1 λ * 0 -ϵ -2)(2 + ϵ) + 4ϵ p1pY,1 ) + 2ϵ 3 + 12ϵ 2 + 16ϵ. Now we complete the proof. A.6 FAIR BAYES-OPTIMAL CLASSIFIERS UNDER EQUALIZED ODDS Theorem 6 (Fair Bayes-optimal Classifiers under Equalized Odds). Let EO ⋆ = DEO (f ⋆ ) = (E ⋆ , P ⋆ ). For any α > 0, there exist 0 < α 1 ≤ α and 0 < α 2 ≤ α such that all fair Bayes-optimal classifiers f ⋆ EO,α under the fairness constraint |DEO(f )| ⪯ (α 1 , α 2 ) are given as below: • When |EO ⋆ | ⪯ (α 1 , α 2 ), f ⋆ EO,α = f ⋆ . • When E * > α 1 or P * > α 2 , for all x ∈ X and a ∈ A, there exist t ⋆ 1,EO,α and t ⋆ 2,EO,α such that f ⋆ EO,α (x, a) = I η a (x) > pap Y,a +(2a-1) P Y,a 1-P Y,a t ⋆ 2,EO,α 2pap Y,a +(2a-1)( P Y,a 1-P Y,a t ⋆ 2,EO,α -t ⋆ 1,EO,α ) + aτ ⋆ EO,α I η a (x) = pap Y,a +(2a-1) P Y,a 1-P Y,a t ⋆ 2,EO,α 2pap Y,a +(2a-1)( P Y,a 1-P Y,a t ⋆ 2,EO,α -t ⋆ 1,EO,α ) , Here, we assume P X|A=1,Y =1 η 1 (X) = p1p Y,1 + p Y,1 1-p Y,1 t ⋆ 2,EO,α 2p1p Y,1 + p Y,1 1-p Y,1 t ⋆ 2,EO,α -t ⋆ 1,EO,α = P X|A=0,Y =0 η 1 (X) = p1p Y,1 + p Y,1 1-p Y,1 t ⋆ 2,EO,α 2p1p Y,1 + p Y,1 1-p Y,1 t ⋆ 2,EO,α -t ⋆ 1,EO,α = 0 and thus τ ⋆ EO,α ∈ [0, 1] can be an arbitrary constant. To prove Theorem 6, we first introduce the Neyman-Pearson Lemma. Lemma 7. (Generalized Neyman-Pearson lemma). Let f 0 , f 1 ,. . . , f m be m+1 real-valued functions defined on a Euclidean space X . Assume they are ν -integrable for a σ -finite measure ν. Let ϕ 0 be any function of the form ϕ 0 (x) =    1, f 0 (x) > m i=1 c i f i (x) γ(x) f 0 (x) = m i=1 c i f i (x) 0, f 0 (x) < m i=1 c i f i (x) where 0 ≤ γ(x) ≤ 1 for all x ∈ X . For given constants t 1 , . . . , t m ∈ R, let T be the class of Borel functions ϕ : X → R satisfying X ϕf i dν ≤ t i , i = 1, 2, . . . , m and T 0 be the set of ϕ s in T satisfying ( 5) with all inequalities replaced by equalities. If ϕ 0 ∈ T 0 , then ϕ 0 ∈ argmax ϕ∈T0 X ϕf 0 dν. Moreover, if c i ≥ 0 for all i = 1, . . . , m, then ϕ 0 ∈ argmax ϕ∈T X ϕf 0 dν. Then we come to prove the theorem. Proof. If |EO ⋆ | ⪯ (α, α), we are done since f ⋆ is just our target classifier. Now, we assume |EO ⋆ | ⪯ (α, α) does not hold. Let f be a classifier that gives output Y = 1 with probability f (x, a) under X = x and A = a. The mis-classification error for f is R(f ) = P( Y ̸ = Y ) = 1 -P( Y = 1, Y = 1) -P( Y = 0, Y = 0) = P( Y = 1, Y = 0) -P( Y = 1, Y = 1) + P(Y = 1) Thus, to minimize the mis-classification error is just equivalent to maximize P( Y = 1, Y = 0) -P( Y = 1, Y = 1), which can be expressed as: P( Y = 1, Y = 1) -P( Y = 1, Y = 0) =P X|A=1,Y =1 ( Y = 1)p 1 p Y,1 + P X|A=0,Y =1 ( Y = 1) (1 -p 1 ) p Y,0 -P X|A=1,Y =0 ( Y = 1)p 1 (1 -p Y,1 ) -P X|A=0,Y =0 ( Y = 1) (1 -p 1 ) (1 -p Y,0 ) =p 1 p Y,1 X f (x, 1)dP X|1,1 (x) -(1 -p Y,1 ) X f (x, 1)dP X|A=1,Y =0 (x) + (1 -p 1 ) p Y,0 X f (x, 0)dP X|A=0,Y =1 (x) -(1 -p Y,0 ) X f (x, 0)dP X|A=0,Y =0 (x) = A X f (x, a)M (x, a)dP X (x)dP(a) with M (x, a) = ap 1 p Y,1 dP X|A=1,Y =1 (x) dP X (x) -(1 -p Y,1 ) dP X|A=1,Y =0 (x) dP X (x) +(1 -a)p 0 p Y,0 dP X|A=0,Y =1 (x) dP X (x) -(1 -p Y,0 ) dP X|A=0,Y =0 (x) dP X (x) . Next, for any classifier f , we have, DEO(f ) = (P X|A=1,Y =1 ( Y = 1) -P X|A=0,Y =1 ( Y = 1), P X|A=1,Y =0 ( Y = 1) -P X|A=0,Y =0 ( Y = 1)) = ( X f (x, 1)dP X|A=1,Y =1 (x) - X f (x, 0)dP X|A=0,Y =1 (x), X f (x, 1)dP X|A=1,Y =0 (x) - X f (x, 0)dP X|A=0,Y =0 (x)) = ( A X f (x, a)H E (x, a)dP X (x)dP(a), A X f (x, a)H P (x, a)dP X (x)dP(a)) with        H E (x, a) = adP X|A=1,Y =1 (x) p 1 dP X (x) - (1 -a)dP X|A=0,Y =1 (x) p 0 dP X (x) H P (x, a) = adP X|A=1,Y =0 (x) p 1 dP X (x) - (1 -a)dP X|A=0,Y =0 (x) p 0 dP X (x) . (7) Since lim t2→∞ p 2 1 p Y,1 + p Y,1 1-p Y,1 t2 2p 2 1 p Y,1 + p Y,1 1-p Y,1 t2-t1 = lim t2→∞ p 2 0 p Y,0 - p Y,0 1-p Y,0 t2 2p 2 0 p Y,0 - p Y,0 1-p Y,0 t2+t1 = 1, we have: lim t2→∞ P X|A=1,Y =1 (η 1 (x) > p 2 1 p Y,1 + p Y,1 1-p Y,1 t 2 2p 2 1 p Y,1 + p Y,1 1-p Y,1 t 2 -t 1 ) = lim t2→∞ P X|A=0,Y =1 (η 0 (x) > p 2 0 p Y,0 - p Y,0 1-p Y,0 t 2 2p 2 0 p Y,0 - p Y,0 1-p Y,0 t 2 + t 1 ) =0 lim t2→∞ P X|A=1,Y =0 (η 1 (x) > p 2 1 p Y,1 + p Y,1 1-p Y,1 t 2 2p 2 1 p Y,1 + p Y,1 1-p Y,1 t 2 -t 1 ) = lim t2→∞ P X|A=0,Y =0 (η 0 (x) > p 2 0 p Y,0 - p Y,0 1-p Y,0 t 2 2p 2 0 p Y,0 - p Y,0 1-p Y,0 t 2 + t 1 ) =0 So there exist t ⋆ 1,EO,α and t ⋆ 2,EO,α , such that: t ⋆ 1,EO,α E * |E * | > 0, t ⋆ 2,EO,α P * |P * | > 0, and                                      E ⋆ |E ⋆ | [P X|A=1,Y =1 (η1(x) > p 2 1 pY,1 + p Y,1 1-p Y,1 t ⋆ 2,EO,α 2p 2 1 pY,1 + p Y,1 1-p Y,1 t ⋆ 2,EO,α -t ⋆ 1,EO,α ) + τ P X|A=1,Y =1 (η1(x) = p 2 1 pY,1 + p Y,1 1-p Y,1 t ⋆ 2,EO,α 2p 2 1 pY,1 + p Y,1 1-p Y,1 t ⋆ 2,EO,α -t ⋆ 1,EO,α ) -P X|A=0,Y =1 (η0(x) > p 2 0 pY,0 - p Y,0 1-p Y,0 t ⋆ 2,EO,α 2p 2 0 pY,0 - p Y,0 1-p Y,0 t ⋆ 2,EO,α + t ⋆ 1,EO,α )] = α1 < α P ⋆ |P ⋆ | [P X|A=1,Y =0 (η1(x) > p 2 1 pY,1 + p Y,1 1-p Y,1 t ⋆ 2,EO,α 2p 2 1 pY,1 + p Y,1 1-p Y,1 t ⋆ 2,EO,α -t ⋆ 1,EO,α ) + τ P X|A=1,Y =0 (η1(x) = p 2 1 pY,1 + p Y,1 1-p Y,1 t ⋆ 2,EO,α 2p 2 1 pY,1 + p Y,1 1-p Y,1 t ⋆ 2,EO,α -t ⋆ 1,EO,α ) -P X|A=0,Y =0 (η0(x) > p 2 0 pY,0 - p Y,0 1-p Y,0 t ⋆ 2,EO,α 2p 2 0 pY,0 - p Y,0 1-p Y,0 t ⋆ 2,EO,α + t ⋆ 1,EO,α )] = α2 < α We consider the constraint,        E ⋆ |E ⋆ | A X f (x, a)HE(x, a)dPX (x)dP(a) ≤ α1 F ⋆ |F ⋆ | A X f (x, a)HP (x, a)dPX (x)dP(a) ≤ α2. Let f be the classifier of the form: fs 1 ,s 2 ,τ (x, a) =      From (6) & (7), M (x, a) > s1 E ⋆ |E ⋆ | HE(x, a) + s2 P ⋆ |P ⋆ | HP (x, a) is equal to ap1 pY,1 dP X|A=1,Y =1 (x) dPX (x) -(1 -pY,1) dP X|A=1,Y =0 (x) dPX (x) + (1 -a)p0 pY,0 dP X|A=0,Y =1 (x) dPX (x) -(1 -pY,0) dP X|A=0,Y =0 (x) dPX (x) > E ⋆ |E ⋆ | s1( adP X|A=1,Y =1 (x) p1dPX (x) - (1 -a)dP X|A=0,Y =1 (x) p0dPX (x) ) + P ⋆ |P ⋆ | s2( adP X|A=1,Y =0 (x) p1dPX (x) - (1 -a)dP X|A=0,Y =0 (x) p0dPX (x) ) which is equal to            p1 pY,1 dP X|A=1,Y =1 (x) dPX (x) -(1 -pY,1) dP X|A=1,Y =0 (x) dPX (x) > E ⋆ |E ⋆ | s1dP X|A=1,Y =1 (x) + P ⋆ |P ⋆ | s2dP X|A=1,Y =0 (x) p1dPX (x) , a = 1 p0 pY,0 dP X|A=0,Y =1 (x) dPX (x) -(1 -pY,0) dP X|A=0,Y =0 (x) dPX (x) > - E ⋆ |E ⋆ | s1dP X|A=0,Y =1 (x) + P ⋆ |P ⋆ | s2dP X|A=0,Y =0 (x) p0dPX (x) , a = 0 Thus, M (x, a) > s1 E ⋆ |E ⋆ | HE(x, a) + s2 P ⋆ |P ⋆ | HP (x, a) ⇐⇒ pa pY,a dP X|A=a,Y =1 (x) dPX (x) -(1 -pY,a) dP X|A=a,Y =0 (x) dPX (x) > (2a -1) E ⋆ |E ⋆ | s1dP X|A=a,Y =1 (x) + P ⋆ |P ⋆ | s2dP X|A=a,Y =0 (x) padPX (x) . ⇐⇒ pY,adP X|A=a,Y =1 (x) pY,adP X|A=a,Y =1 (x) + (1 -pY,a)dP X|A=a,Y =0 (x) > p 2 a pY,a + (2a -1) p Y,a 1-p Y,a t2 2p 2 a pY,a + (2a -1)( p Y,a 1-p Y,a t2 . where t1 = 2 E ⋆ |E ⋆ | s1, t2 = 2 P ⋆ |P ⋆ | s2. ⇐⇒ ηa(x) > p 2 a pY,a + (2a -1) p Y,a 1-p Y,a t2 2p 2 a pY,a + (2a -1)( p Y,a 1-p Y,a t2 -t1) . As a result, fs 1 ,s 2 ,τ (x, a) in ( 9) can be written as ft 1 ,t 2 ,τ (x, a) = 1{ηa(x) > p 2 a pY,a + (2a -1) p Y,a 1-p Y,a t2 2p 2 a pY,a + (2a -1)( p Y,a 1-p Y,a t2 -t1) }+aτ 1{ηa(x) = p 2 a pY,a + (2a -1) p Y,a 1-p Y,a t2 2p 2 a pY,a + (2a -1)( p Y,a 1-p Y,a t2 -t1) }. (10) Further, the constraint (8) for f in (10) is equivalent to                                      E ⋆ |E ⋆ | [P X|A=1,Y =1 (η1(x) > p 2 1 pY,1 + p Y,1 1-p Y,1 t2 2p 2 1 pY,1 + p Y,1 1-p Y,1 t2 -t1 ) + τ P X|A=1,Y =1 (η1(x) = p 2 1 pY,1 + p Y,1 1-p Y,1 t2 2p 2 1 pY,1 + p Y,1 1-p Y,1 t2 -t1 ) -P X|A=0,Y =1 (η0(x) > p 2 0 pY,0 - p Y,0 1-p Y,0 t2 2p 2 0 pY,0 - p Y,0 1-p Y,0 t2 + t1 )] ≤ α1 P ⋆ |P ⋆ | [P X|A=1,Y =0 (η1(x) > p 2 1 pY,1 + p Y,1 1-p Y,1 t2 2p 2 1 pY,1 + p Y,1 1-p Y,1 t2 -t1 ) + τ P X|A=1,Y =0 (η1(x) = p 2 1 pY,1 + p Y,1 1-p Y,1 t2 2p 2 1 pY,1 + p Y,1 1-p Y,1 t2 -t1 ) -P X|A=0,Y =0 (η0(x) > p 2 0 pY,0 - p Y,0 1-p Y,0 t2 2p 2 0 pY,0 - p Y,0 1-p Y,0 t2 + t1 )] ≤ α2 Now, let Tα 1 ,α 2 be the class of Borel functions f that satisfy (8) and Tα 1 ,α 2 ,0 be the set of f -s in Tα that satisfy (8) with all the inequalities being replaced by equalities. From the definition of t Since F 0,a (t 0,a (k 0,a +1) ) is stochastically dominated by Beta(k 0,a + 1, n 0,a -k 0,a ), we complete the proof.

A.8 ALGORITHMS FOR OTHER GROUP FAIRNESS CONSTRAINTS

In addition to Equality of Opportunity and Equalized Odds, there are other common fairness constraints and we can extend FaiREE to them. Training data: S = S 0,0 ∪ S 0,1 ∪ S 1,0 ∪ S 1,1 α: error bound δ: small tolerance level f : a classifier T y,a = {f (x y,a 1 ), . . . , f (x y,a ny,a )} {t y,a (1) , . . . , t y,a (ny,a) } =sort(T y,a ) T y = T y,0 ∪ T y,1 {t y (1) , . . . , t y (ny) } =sort(T y ) Define g(k, a) = E[ n a j=k n a j (Q 1-a -α) j (1 -(Q 1-a -α)) n a -j ] with Q a ∼ Beta(k, n a -k + 1), L(k 0 , k 1 ) = g(k 0 , 0) + g(k 1 , 1) Build candidate set K = {(k 0 , k 1 ) | L(k 0 , k 1 ) ≤ δ} = {(k 0 1 , k 1 1 ), . . . , (k 0 M , k 1 M )} Find k y,0 i : t y,0 j=k n a j (Q 1-a -α) j (1 -(Q 1-a -α)) n a -j ] with Q a ∼ Beta(k, n a -k + 1), then we have: P(|DDP (ϕ)| > α) ≤ g(k 0 , 0) + g(k 1 , 1). If t a is continuous random variable, the equality holds. with probability 1 -(2M + 4)(e -2n 1,0 ϵ 2 + e -2n 1,1 ϵ 2 + e -2n 0,0 ϵ 2 + e -2n 0,1 ϵ 2 ).

A.8.2 FAIREE FOR PREDICTIVE EQUALITY

Algorithm 4: FaiREE for Predictive Opportunity Input: Training data: S = S 0,0 ∪ S 0,1 ∪ S 1,0 ∪ S 1,1 α: error bound δ: small tolerance level f : a classifier T y,a = {f (x y,a 1 ), . . . , f (x y,a ny,a )} {t y,a (1) , . . . , t y,a (ny,a) } =sort(T y,a ) Define g 0 (k, a) = E[ n 0,a j=k n 0,a j (Q 0,1-a -α) j (1 -(Q 0,1-a -α)) n 0,a -j ] with Q 0,a ∼ Beta(k, n 0,a -k + 1), L(k 0,0 , k 0,1 ) = g 0 (k 0,0 , 0) + g 0 (k 0,1 , 1) Build candidate set K = {(k 0,0 , k 0,1 ) | L(k 0,0 , k 0,1 ) ≤ δ} = {(k 0,0 1 , k 0,1 1 ), . . . , (k 0,0 M , k 0,1 M )} Find k 1,0 i , k 1,1 i : t 1,0 (k 1,0 i ) ≤ t 0,0 (k 0,0 i ) < t 1,0 (k 1,0 i +1) , t 1,1 (k 1,1 i ) ≤ t 0,1 (k 0,1 i ) < t 1,1 (k 1,1 i +1) i * ← arg min i∈[M ] { êi } ( êi is defined in Proposition 2) Output: φ(x, a) = 1{f(x, a) > t 0,a (k 0,a i * ) } Similar to the algorithm for Equality of Opportunity, we have the following propositions and assumption: Proposition 5. Given k 0,0 , k 0,1 satisfying k 1,a ∈ {1, . . . , n 0,a } (a = 0, 1). Define ϕ(x, a) = 1{f(x, a) > t 0,a (k 0,a ) }, g 0 (k, a) = E[ n 0,a j=k n 0,a j (Q 0,1-a -α) j (1 -(Q 0,1-a -α)) n 0,a -j ] with Q 0,a ∼ Beta(k, n 0,a -k + 1), then we have: P(|DP E(ϕ)| > α) ≤ g 0 (k 0,0 , 0) + g 0 (k 0,1 , 1). If t 0,a is continuous random variable, the equality holds. Theorem 9. If min{n 0,0 , n 0,1 } ≥ ⌈ 



Code is available at https://github.com/lphLeo/FaiREE 1, M (x, a) > s1 E ⋆ |E ⋆ | HE(x, a) + s2 P ⋆ |P ⋆ | HP (x, a); aτ, M (x, a) = s1 E ⋆ |E ⋆ | HE(x, a) + s2 P ⋆ |P ⋆ | HP (x, a); 0, M (x, a) < s1 E ⋆ |E ⋆ | HE(x, a) + s2 P ⋆ |P ⋆ | HP (x, a),(9)



Figure 1: Comparison of FairBayes and FaiREE on the synthetic data with sample size = 1000. See Table 2 for detailed numerical results. Left: DEOO v.s. α, Right: DEOO v.s. Test accuracy. Here, DEOO is the degree of violation to fairness constraint Equality of Opportunity and α is the prespecified desired level to upper bound DEOO for both methods. See Eq. (1) in Section 2 for a more detailed definition.

); Giguere et al. (2022); Weber et al. (2022), the Learn then Test framework for risk control Angelopoulos et al. (2021), and in highdimensional statistics Wang et al. (2022).

Figure 2: A concrete pipeline of FaiREE for Equality of Opportunity. Edges in Step 2 represent the selected candidate pair and the red edge in Step 3 represents the final optimal candidate selected from all the edges. Each pair represents two different thresholds of a single classifier.

1) + P(Y = 0, Ŷ = 1, A = 0) + P(Y = 0, Ŷ = 1, A = 1) = P( Ŷ = 0|Y = 1, A = 0)P(Y = 1, A = 0) + P( Ŷ = 0|Y = 1, A = 1)P(Y = 1, A = 1) + P( Ŷ = 1|Y = 0, A = 0)P(Y = 0, A = 0) + P( Ŷ = 1|Y = 0, A = 1)P

′ (x, a) = 1{f * (x, a) > λ * a }. And the output classifier of our algorithm is of the form φ(x, a) = 1{f(x, a) > λ a }.

f t ⋆ 1,EO,α ,t ⋆ 2,EO,α ,τ (x, a) ∈ argmax f ∈Tα 1 ,α 2 A X ft 1 ,t 2 ,τ (x, a)M (x, a)dPX (x)dP(a)

Demographic Parity). A classifier satisfies Demographic Parity if its prediction Y is statistically independent of the sensitive attribute A :P( Y = 1 | A = 1) = P( Y = 1 | A = 0)Definition 5 (Predictive Equality). A classifier satisfies Predictive Equality if it achieves the same TNR (or FPR) among protected groups:P X|A=1,Y =0 ( Y = 1) = P X|A=0,Y =0 ( Y = 1)Definition 6 (Equalized Accuracy). A classifier satisfies Equalized Accuracy if its mis-classification error is statistically independent of the sensitive attribute A:P( Y ̸ = Y | A = 1) = P( Y ̸ = Y | A = 0)Similar to DEOO, we can define the following measures:DDP = P X|A=1 ( Y = 1) -P X|A=0 ( Y = 1)(12)DPE = P X|A=1,Y =0 ( Y = 1) -P X|A=0,Y =0 ( Y = 1)(13)DEA = P( Y ̸ = Y | A = 1) -P( Y ̸ = Y | A = 0).(14)A.8.1 FAIREE FOR DEMOGRAPHIC PARITY Algorithm 3: FaiREE for Demographic Parity Input:

y ∈ {0, 1} i * ← arg min i∈[M ] { êi } ( êi is defined in Proposition 2) Output: φ(x, a) = 1{f(x, a) > t a (k a i * ) }Similar to the algorithm for Equality of Opportunity, we have the following propositions and assumption:Published as a conference paper at ICLR 2023 Proposition 4. Given k 0 , k 1 satisfying k a ∈ {1, . . . , n a } (a = 0, 1). Define ϕ(x, a) = 1{f(x, a) > t a (k a ) }, g(k, a) = E[ n a

If min{n 0 , n 1 } ≥ ⌈ log δ 2 log(1-α) ⌉, we have |DDP ( φi )| < α with probability 1 -δ, for each i ∈ {1, . . . , M }. Theorem 8. Given α ′ < α. If min{n 0 , n 1 } ≥ ⌈ log δ 2 log(1-α) ⌉. Suppose φ is the final output of FaiREE, we have: (1) |DDP ( φ)| ≤ α with probability (1 -δ) M, where M is the size of the candidate set.(2) Suppose the density distribution functions off * under A = a, Y = 1 are continuous. ϕ * DDP,α = arg min |DDP (ϕ)|≤α P(ϕ(x, a) ̸ = Y ). When the input classifier f satisfies ∥f (x, a) -f * (x, a)∥ ∞ ≤ ϵ 0 , for any ϵ > 0 such that F * (+) (ϵ) ≤ α-α ′ 2 -F * (+) (2ϵ 0 ), we have | P( φ(x, a) ̸ = Y ) -P(ϕ * α ′ (x, a) ̸ = Y ) |≤ 2F * (+) (2ϵ 0 ) + 2F * (+) (ϵ) + 2ϵ 3 + 12ϵ 2 + 16ϵ

α) ⌉, we have |DP E( φi )| < α with probability 1 -δ, for each i ∈ {1, . . . , M }.

Figure 5: DEOO v.s. Accuracy & DPE v.s. Accuracy for Model 2

Sample complexity requirements for FaiREE to achieve different fairness constraints. We consider the following fairness notions: DP (Demographic Parity), EOO (Equality of Opportunity), EO (Equalized Odds), PE (Predictive Equality), EA (Equalized Accuracy), and n a = n 0,a + n 1,a .

Experimental studies under Model 1. Here |DEOO| denotes the sample average of the absolute value of DEOO defined in Eq. (1), and |DEOO|95 denotes the sample upper 95% quantile. |DP E| and |DP E| 95 are defined similarly for DP E defined in Eq. (13). ACC is the sample average of accuracy. We use "/" in the DP E line because FairBayes and FaiREE-EOO are not designed to control DP E.

Experimental studies under Model 2, with the same notation as Table2.

Result of different methods on Adult Census dataset Zeng, Edgar Dobriban, and Guang Cheng. Bayes-optimal classifiers under group fairness. arXiv preprint arXiv:2202.09724, 2022. Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 335-340, 2018.

Hence, from Generalized Neyman-Pearson lemma, we have:

ACKNOWLEDGEMENTS

The research of Linjun Zhang is partially supported by NSF DMS-2015378. The research of James Zou is partially supported by funding from NSF CAREER and the Sloan Fellowship.

annex

Further, we obtain+ ϵ[ n 0,0 n 0,0 + 1 [p 0 + p 0 p Y,0 + n 0,0 n ϵ( n 0,0 n ϵ + p 0 + p Y,0 + 1)]) 2(n 0,1 + 1)) 2(n 0,1 + 1)Combining two parts together, we have: with probability 1 -(2M + 4)(e -2n 0,0 ϵ 2 + e -2n 0,1 ϵ 2 ) -(1 -F 1,0 (-) (2ϵ)) n 1,0 -(1 -F 1,1 (-) (2ϵ)) n 1,1 , P( φ(x, a) ̸ = Y )-P(ϕ * α ′ (x, a) ̸ = Y ) ≤ 2F * (+) (ϵ)+2F * (+) (2ϵ 0 )+2ϵ 3 +12ϵ 2 +16ϵ+ p 0 (1 -p Y,0 ) n 0,0 + 1 + p 1 (1 -p Y,1 ) n 0,1 + 1 . Now we complete the proof.A.5 PROOF OF THEOREM 2Proof. The (1) of the theorem is a direct corollary from Theorem 1, now we prove the (2) of the theorem.It's sufficient to modify the first part of the proof of Theorem 5 and the second part simply follows Theorem 5.For the first part, for any ϵ > 0, from Lemma 6, t 1,0 has a positive probabilitywhich implies that the probability that all t 1,0 's fall out ofwhich we denote as λ 0 . We denote the corresponding classifier as 1{f(x, a) > λ a }.From the proof of Theorem 5, we have with probabilityNow we complete our proof.

A.7 PROOF OF PROPOSITION 3

Proof. The classifier isFrom Proposition 1, we haveAlso,And we havej=k 0,0 P[exactly j of the t 0,0 's are less than F 0,0 -1 (F 0,1 (t 0,1Similarly, we haveHence, we have, where M is the size of the candidate set.(2) Suppose the density distribution functions of f

A.8.3 FAIREE FOR EQUALIZED ACCURACY

Algorithm 5: FaiREE for Equalized Accuracy Input:(1) , . . . , t y,a (ny,a) } =sort(T y,a )Define) n 1,a -j ] with) n 0,a -j ] with Q 0,a ∼ Beta(k + 1, n 0,a -k), L 0 (k 0,0 , k 0,1 ) = g 0 (k 0,1 , 1) + g 0 (k 0,0 , 0)Similar to the algorithm for Equalized Odds, we have the following propositions and assumption:, andThen we have:) ⌉, in which ⌈•⌉ denotes the ceiling function.Theorem 11. Under Assumption 1, we have |DEA( φi )| ≤ α with probability 1 -δ, for each i ∈ {1, . . . , M }.Corollary 1. Under Assumption 1, we have |DEA( φ)| ≤ α with probability (1 -δ) M , where M is the size of the candidate set.

A.9 IMPLEMENTATION DETAILS AND ADDITIONAL EXPERIMENTS

From Lemma 2, we adopt a new way of building a much smaller candidate set. Note that our shrunk candidate set for Equality of Opportunity is:Since Equalized Odds constraint is an extension of Equality of Opportunity, our target classifier should be in K ′ .To select our target classifier, it's sufficient to add a condition of similar false positive rate between privileged and unprivileged groups. Specifically, we choose our final candidate set as below:We also did experiments on other benchmark datasets.First, we apply FaiREE to German Credit dataset Kamiran & Calders (2009) , whose task is to predict whether a bank account holder's credit is good or bad. The protected attribute is gender, and the sample size is 1000, with 800 training samples and 200 test samples. To facilitate the numerical study, we randomly split data into training set, calibration set, and test set at each repetition and repeat 500 times. We further generate a synthetic model where trained classifiers are more informative.Model 3. We generate the protected attribute A and label Y with the probability, location parameter and scale parameter the same as Model 1. The dimension of features is set to 60, and we generate features with x 0,0 i,j i.i.d.∼ t(1), x 0,1 i,j i.i.d.∼ t(4), x 1,0 i,j i.i.d.∼ χ 2 1 and x 1,1 i,j i.i.d.∼ χ 2 4 , for j = 1, 2, ..., 60.Table 7 : Experimental studies under Model 3. Here |DEOO| denotes the sample average of the absolute value of DEOO defined in Eq. ( 1), and |DEOO|95 denotes the sample upper 95% quantile. |DP E| and |DP E| 95 are defined similarly for DP E defined in Eq. ( 13). ACC is the sample average of accuracy. We use "/" in the DP E line because FairBayes and FaiREE-EOO are not designed to control DP E. 8, 9 , and 10.

Eq

From the experimental results, we can find that FaiREE has favorable results over these baseline methods, with respect to fairness and accuracy. In particular, the experimental results indicate that while the baseline methods are designed to minimize the fairness violation as much as possible (i.e. set α = 0), these methods are unable to have an exact control of the fairness violation to a desired level α. For example, in the analysis of Adult Census dataset, the 95% quantile of the DEOO fairness violations of Fairdecision is 0.078, and that of LAFTR, Meta-cl and Adv-debias are all above 0.2. Moreover, our results found that If we allow the same fairness violation of DEOO and DEP for our proposed method FaiREE, we have a much higher accuracy (0.845) compared to the accuracy of those four baseline methods. 

