A NEAR-OPTIMAL ALGORITHM FOR DEBIASING TRAINED MACHINE LEARNING MODELS Anonymous

Abstract

We present an efficient and scalable algorithm for debiasing trained models, including deep neural networks (DNNs), which we prove to be near-optimal by bounding its excess Bayes risk. Unlike previous black-box reduction methods to cost-sensitive classification rules, the proposed algorithm operates on models that have been trained without having to retrain the model. Furthermore, as the algorithm is based on projected stochastic gradient descent (SGD), it is particularly attractive for deep learning applications. We empirically validate the proposed algorithm on standard benchmark datasets across both classical algorithms and modern DNN architectures and demonstrate that it outperforms previous postprocessing approaches for unbiased classification.

1. INTRODUCTION

Machine learning is increasingly applied to critical decisions which can have a lasting impact on individual lives, such as for credit lending (Bruckner, 2018) , medical applications (Deo, 2015) , and criminal justice (Brennan et al., 2009) . Consequently, it is imperative to understand and improve the degree of bias of such automated decision-making. Unfortunately, despite the fact that bias (or "fairness") is a central concept in our society today, it is difficult to define it in precise terms. In fact, as people perceive ethical matters differently depending on a plethora of factors including geographical location or culture (Awad et al., 2018) , no universally-agreed upon definition for bias exists. Moreover, the definition of bias may depend on the application and might even be ignored in favor of accuracy when the stakes are high, such as in medical diagnosis (Kleinberg et al., 2017; Ingold and Soper, 2016) . As such, it is not surprising that several definitions of "unbiased classification" have been introduced. These include statistical parity (Dwork et al., 2012; Zafar et al., 2017a) , equality of opportunity (Hardt et al., 2016) , and equalized odds (Hardt et al., 2016; Kleinberg et al., 2017) . Unfortunately, such definitions are not generally compatible (Chouldechova, 2017) and some might even be in conflict with calibration (Kleinberg et al., 2017) . In addition, because fairness is a societal concept, it does not necessarily translate into a statistical criteria (Chouldechova, 2017; Dixon et al., 2018) . Statistical parity Let X be an instance space and let Y = {0, 1} be the target set in a standard binary classification problem. In the fair classification setting, we may further assume the existence of a (possibly randomized) sensitive attribute s : X → {0, 1, . . . , K}, where s(x) = k if and only if x ∈ X k for some total partition X = ∪ k X k . For example, X might correspond to the set of job applicants while s indicates their gender. Here, the sensitive attribute can be randomized if, for instance, the gender of an applicant is not a deterministic function of the full instance x ∈ X (e.g. number of publications, years of experience, ...etc). Then, a commonly used criterion for fairness is to require similar mean outcomes across the sensitive attribute. This property is well-captured through the notion of statistical parity (a.k.a. demographic parity) (Corbett-Davies et al., 2017; Dwork et al., 2012; Zafar et al., 2017a; Mehrabi et al., 2019) : Definition 1 (Statistical Parity). Let X be an instance space and X = ∪ k X k be a total partition of X . A classifier f : X → {0, 1} satisfies statistical parity across all groups X 1 , . . . , X K if: The bias defined as the absolute difference in mean outcome between genders within different demographic groups, before and after applying the proposed algorithm. Blue bars show the results of the unmodified classifier, orange bars show the results of optimizing for statistical parity with no regard for demographic information. Finally, green bars are the results of applying statistical parity on the cross product of gender and ethnicity. max k∈{1,2,...,K} E x [f (x) | x ∈ X k ] - min k∈{1,2,...,K} E x [f (x) | x ∈ X k ] ≤ To motivate and further clarify the definition, we showcase the empirical results on the Adult benchmark dataset (Blake and Merz, 1998) in Figure 1 . When tasked with predicting whether the income of individuals is above $50K per year, all considered classifiers exhibit gender-related bias. One way of removing such bias is to enforce statistical parity across genders. Crucially, however, without taking ethnicity into account, different demographic groups may experience different outcomes. In fact, gender bias can actually increase in some minority groups after enforcing statistical parity. This can be fixed by redefining the sensitive attribute to be the cross product of both gender and ethnicity (green bars). Our main contribution is to present a near-optimal recipe for debiasing models, including deep neural networks, according to Definition 1. Specifically, we formulate the task of debiasing learned models as a regularized optimization problem that is solved efficiently using the projected SGD method. We show how the algorithm produces thresholding rules with randomization near the thresholds, where the width of randomization is controlled by the regularization parameter. We also show that randomization near the threshold is necessary for Bayes risk consistency. While we focus on binary sensitive attributes in our experiments in Section 5, our algorithm and its theoretical guarantees continue to hold for non-binary sensitive attributes as well.

Statement of Contribution

1. We derive a near-optimal post-processing algorithm for debiasing learned models (Section 3). 2. We prove theoretical guarantees for the proposed algorithm, including a proof of correctness and an explicit bound on the Bayes excess risk (Section 4). 3. We empirically validate the proposed algorithm on benchmark datasets across both classical algorithms and modern DNN architectures. Our experiments demonstrate that the proposed algorithm significantly outperforms previous post-processing methods (Section 5). In Appendix E, we also show how the proposed algorithm can be modified to handle other criteria of bias as well.

2. RELATED WORK

Algorithms for fair machine learning can be broadly classified into three groups: (1) pre-processing methods, (2) in-processing methods, and (3) post-processing methods (Zafar et al., 2019) . Preprocessing algorithms transform the data into a different representation such that any classifier trained on it will not exhibit bias. This includes methods for learning a fair representation (Zemel et al., 2013; Lum and Johndrow, 2016; Bolukbasi et al., 2016; Calmon et al., 2017; Madras et al., 2018; Kamiran and Calders, 2012) , label manipulation (Kamiran and Calders, 2009) , data augmentation (Dixon et al., 2018) , or disentanglement (Locatello et al., 2019) . On the other hand, in-processing methods constrain the behavior of learning algorithms in order to control bias. This includes methods based on adversarial learning (Zhang et al., 2018) and constraint-based classification, such as by incorporating constrains on the decision margin (Zafar et al., 2019) or features (Grgić-Hlača et al., 2018) . Agarwal et al. (2018) showed that the task of learning an unbiased classifier could be reduced to a sequence of cost-sensitive classification problems, which could be applied to any black-box classifier. One caveat of the latter approach is that it requires solving a linear program (LP) and retraining classifiers, such as neural networks, many times before convergence. The algorithm we propose in this paper is a post-processing method, which can be justified theoretically (Corbett-Davies et al., 2017; Hardt et al., 2016; Menon and Williamson, 2018; Celis et al., 2019) . Fish et al. (2016) and Woodworth et al. (2017) fall under this category. However, the former only provides generalization guarantees without consistency results while the latter proposes a twostage approach that requires changes to the original training algorithm. Kamiran et al. (2012) also proposes a post-processing algorithm, called Reject Option Classifier (ROC), without providing any theoretical guarantees. In contrast, our algorithm is Bayes consistent and does not alter the original classification method. In Celis et al. (2019) and Menon and Williamson (2018) , instance-dependent thresholding rules are also learned. However, our algorithm also learns to randomize around the threshold (Figure 2 (a)) and this randomization is key to our algorithm both theoretically as well as experimentally (Appendix C and Section 5). Hardt et al. (2016) learns a randomized post-processing rule but our proposed algorithm outperforms it in all of our experiments (Section 5). Woodworth et al. (2017) showed that the post-processing approach can, sometimes, be highly suboptimal. Nevertheless, the latter result does not contradict the statement that our post-processing rule is near-optimal because we assume that the original classifier outputs a monotone transformation of some approximation to the posterior probability p(y = 1 | x) (e.g. margin or softmax output) whereas Woodworth et al. (2017) assumed in their construction that the post-processing rule had access to the binary predictions only. We argue that the proposed algorithm has distinct advantages, particularly for deep neural networks (DNNs). First, stochastic convex optimization methods are well-understood and can scale well to massive amounts of data (Bottou, 2010) , which is often the case in deep learning today. Second, the guarantees provided by our algorithm hold w.r.t. the binary predictions instead of using a proxy, such as the margin as in some previous works (Zafar et al., 2017b; 2019) . Third, unlike previous reduction methods that would require retraining a deep neural network several times until convergence (Agarwal et al., 2018) , which can be prohibitively expensive, our algorithm operates on learned models that are trained once and does not require retraining. Besides developing algorithms for fair classification, several recent works focused on other related aspects, such as proposing new definitions for fairness; e.g. demographic parity (Dwork et al., 2012; Mehrabi et al., 2019) , equalized odds (Hardt et al., 2016) , equality of opportunity/disparate mistreatment (Zafar et al., 2017a; Hardt et al., 2016) , and individual fairness (Dwork et al., 2012) . Recent works have also established several impossibility results related to fair classification, such as Kleinberg et al. (2017) ; Chouldechova (2017) . In our case, we derive a new impossibility result that holds for any deterministic binary classifier and relate it to the task of controlling the covariance between the classifier's predictions and the sensitive attribute (Appendix E).

3. NEAR-OPTIMAL ALGORITHM FOR STATISTICAL PARITY

Notation We reserve boldface letters for random variables (e.g. x), small letters for instances (e.g. x), capital letters for sets (e.g. X), and calligraphic typeface for universal sets (e.g. the instance space X ). Given a set S, 1 S (x) ∈ {0, 1} is the characteristic function indicating whether x ∈ S. We denote by [n] the set of integers {1, . . . , n} and [x] + = max{0, x}. Algorithm Given a classifier f : X → [-1, +1] our goal is to post-process the predictions made by ffoot_0 in order to control the bias with respect to a sensitive attribute s : X → [K] as in Definition 1. To this end, instead of learning a deterministic classifier, we consider randomized prediction rules of the form h : X × {1, 2, . . . , K} × [-1, 1] → [0, 1], where h(x) represents the probability of predicting the positive class given (i) instance x ∈ X , (ii) sensitive attribute s(x), and (iii) classifier's output f (x). As discussed in Appendix B, for post-processing rule h(x), and for each group X k ⊆ X , the fairness constraint in Definition 1 can be written as |E x [ h(x) | x ∈ X k ] -ρ| ≤ , where ρ ∈ [0, 1 ] is a hyper-parameter tuned via a validation dataset. On the other hand, minimizing the probability of altering the predictions of the original classifier can be achieved by maximizing the inner product E x [ h(x) • f (x)]. Instead of optimizing this quantity directly, which would lead to a pure thresholding rule, we minimize the regularized objective: (γ/2)E x [ h(x) 2 ] -E x [ h(x) • f (x)] for some regularization parameter γ > 0. This regularization leads to randomization around the threshold, which we show to be critical, both theoretically (Section 4 and Appendix C) and experimentally (Section 5). Using Lagrange duality we show that the solution reduces to the update rules in Equation 2with optimization variables {λ k , µ k } k∈[K] and the corresponding predictor which outputs +1 for group X k with probability hγ (x) is given by hγ (x) =    0, f (x) ≤ λ k -µ k (f (x) -λ k + µ k )/γ, λ k -µ k ≤ f (x) ≤ λ k -µ k + γ 1, f (x) ≥ λ k -µ k + γ (1) where ξ γ is given by Eq. ( 3). Update rules To learn these parameters, one can apply the following update rules (Appendix B): λ s(x) ← max 0, λ s(x) -η 2 + ρ + ∂ ∂λ s(x) ξ γ f (x) -(λ s(x) -µ s(x) ) µ s(x) ← max 0, µ s(x) -η 2 -ρ + ∂ ∂µ s(x) ξ γ f (x) -(λ s(x) -µ s(x) ) , where, again, ρ ∈ [0, 1] is a hyperparameter tuned via a validation dataset, s : X → [K] is the sensitive attribute, and γ > 0 is a regularization parameter that controls the level of randomization. In addition, the function ξ γ : R → R + is given by: ξ γ (w) = w 2 2γ • I{0 ≤ w ≤ γ} + w - γ 2 • I{w > γ} Note that ξ γ is convex and its derivative ξ γ is (1/γ)-Lipschitz continuous; it can be interpreted as differentiable approximation to the ReLU unit (Nair and Hinton, 2010) . A full pseudocode of the proposed algorithm is presented in Appendix A.

4. THEORETICAL ANALYSIS

Next, we analyze the algorithm. Our first theoretical result is to show that the prediction rule in Equation 1 learned through the update rules presented in Section 3 satisfies the desired fairness guarantees on the training sample. Theorem 1 (Correctness). Let hγ : X → [0, 1] be the randomized predictor in Equation 1 learned by applying the update rules in Equation 2 starting with µ k = 0, λ k = 0, ∀k ∈ [K] until convergence. Then, hγ satisfies statistical parity w.r.t. {X k } k∈ [K] in the training sample. The proof of Theorem 1 is presented in Appendix B. The following guarantee, which holds w.r.t. the underlying data distribution, shows that the randomized prediction rule converges to the Bayes - 1 | 1 0 f (x) γ h(x) | | - 1 4 - 1 2 | 25 | 20 | 15 | 10 | 5 0 Epoch λ 0 -u 0 (a) Decision rule (b) Convergence Figure 2: (a) The learned post-processing decision rule h(x) in Equation 1as a function of the classifier's score f (x). Randomization is applied when h(x) ∈ (0, 1), which can be controlled using the regularization parameter γ > 0. (b) The value of λ 0 -u 0 is plotted against the number of epochs in the projected SGD method applied to the output of the random forests classifier trained on the Adult dataset to implement statistical parity with respect to the gender attribute (cf. Section 5 and Figure 1 ). We observe fast convergence in agreement with Proposition 1. optimal unbiased classifier if the original classifier f is Bayes consistent. The proof of the following theorem (Appendix C) is based on the Lipschitz continuity of the decision rule when γ > 0 and the robustness framework of Xu and Mannor (2010) . Theorem 2. Let h = arg min h∈H E[h(x) = y], where H is the set of binary predictors on X that satisfy fairness according to Definition 1 for > 0. Let hγ : X → [0, 1] be the randomized learning rule in Equation 1. If hγ is trained on a freshly sampled data of size N , then there exists a value of ρ ∈ [0, 1] such that the following holds with a probability of at least 1 -δ: E[I{ hγ (x) = y}] ≤ E[I{h (x) = y}] + E |2η(x) -1 -f (x)| + 2γ + 8(2 + 1 γ ) N 1 3 + 4 2K + 2 log 2 δ N , where η(x) = p(y = 1|x = x) is the Bayes regressor and K is the number of groups X k . Consequently, if the original classifier is Bayes consistent and we have: N → ∞, γ → 0 + and γN -1 3 → ∞, then E[ hγ (x) = y] P -→ E[h (x) = y]. Hence, the updates converge to the optimal prediction rule subject to the chosen fairness constraint. Running time As shown in Appendix B, the update rules in Equation 2 perform a projected stochastic gradient descent on the following optimization problem: min (µ1,λ1),...,(µ K ,λ K )≥0 F = E x (λ s(x) +µ s(x) )+ρ (λ s(x) -µ s(x) )+ξ γ (f (x)-(λ s(x) -µ s(x) )) (4) We assume with no loss of generality that f (x) ∈ [-1, 1] since f (x) is assumed to be an estimator to 2η(x) -1 (see Section 3 and Appendix B) and any thresholding rule over f (x) can be transformed into an equivalent rule over a monotone increasing function of f (i.e. using the hyperbolic tangent). Proposition 1. Let µ (0) = λ (0) = 0 and write µ (t) , λ (t) ∈ R K for the value of the optimization variables after t updates defined in Equation 2for some fixed learning rate α t = α. Let μ = (1/T ) T t=1 µ (t) (x) and λ = (1/T ) T t=1 λ (t) (x). Then, E[ F ] -F (µ ) ≤ (1 + ρ + ) 2 α 2 + ||µ || 2 2 + ||γ || 2 2 2T α , where F : R K × R K → R is the objective function in (4) using the averaged solution μ and λ while F is its optimal value. In particular, E[ F ] -F (µ ) = O( K/T ) when α = O( K/T ). The proof is in Appendix D. Hence, the post-processing rule can be efficiently computed. In practice, we observe fast convergence as shown in Figure 2 (b). As shown in Figure 2 (a), the hyperparameter γ controls the width of randomization around the thresholds. A large value of γ may reduce the accuracy of the classifier. On the other hand, γ cannot be zero because randomization around the threshold is, in general, necessary for Bayes risk consistency as illustrated in the following example: Example 1 (Randomization is necessary). Suppose that X = {-1, 0, 1} where p(x = -1) = 1/2, p(x = 0) = 1/3 and p(x = 1) = 1/6. Let η(-1) = 0, η(0) = 1/2 and η(1) = 1. In addition, let s ∈ {0, 1} be a sensitive attribute, where p(s = 1|x = -1) = 1/2, p(s = 1|x = 0) = 1, and p(s = 1|x = 1) = 0. Then, the Bayes optimal prediction rule f (x) subject to statistical parity ( = 0) satisfies: p(f (x) = 1|x = -1) = 0, p(f (x) = 1|x = 0) = 7/10 and p(f (x) = 1|x = 1) = 1. Note that the Bayes excess risk bound in Theorem 2 is vacuous when γ = 0. Therefore, γ controls a trade-off depending on how crucial randomization is around the thresholds (e.g. in k-NN where the classifier's scores come from a finite set or in deep neural networks that tend to produce scores concentrated around {-1, +1}). In our experiments, γ is always chosen using a validation set.

5. EMPIRICAL EVALUATION

Experiment Setup We compare against three post-processing methods: (1) the post-processing algorithm of Hardt et al. (2016) (2) the shift inference method, first introduced in (Saerens et al., 2002) and used more recently in (Wang et al., 2020) , and (3) the Reject Option Classifier (ROC) (Kamiran et al., 2012) . We use the implementation of the algorithm of Hardt et al. (2016) in the Fair-Learn software package (Dudik et al., 2020) . The training data used for the post-processing methods is always a fresh sample, i.e. different from the data used to train the original classifiers. The value of the hyper-parameter θ of the ROC algorithm is chosen in the grid {0.01, 0.02, . . . , 1.0}. When ROC fails, its solution with the minimum bias is reported. In the proposed algorithm, the parameter γ is chosen in the grid {0.01, 0.02, 0.05, 0.1, 0.2, . . . , 1.0} while ρ is chosen in the gird E[y] ± {0, 0.05, 0.1}. All hyper-parameters are selected based on a separate validation dataset. Tabular Data We empirically evaluate the performance of the proposed algorithm and the baselines on two real-world datasets, namely the Adult income dataset and the Default of Credit Card 1 and 2 ) due to the absence of randomization. Clients (DCCC) dataset, both taken from the UCI Machine Learning Repository (Blake and Merz, 1998) . The Adult dataset contains 48,842 records with 14 attributes each and the goal is to predict if the income of an individual exceeds $50K per year. The DCCC dataset contains 30,000 records with 24 attributes, and the goal is to predict if a client will default on their credit card payment. Both datasets include sensitive attributes, such as sex and age. In Figure 1 we showcased why, in some cases, the sensitive attribute can be the cross product of multiple features (e.g. religion, gender, and race). In our experiments in this section, we define the sensitive class to be the class of females. In the DCCC dataset, we additionally introduce bias in the training set for the purpose of the experiment: if s(x) = y(x) we keep the instance and otherwise drop it with probability 0.5. We train four classifiers on each dataset: (1) random forests with maximum depth 10, (2) k-NN with k = 10, (3) a two-layer fully connected neural network with 128 hidden nodes, and (4) logistic regression. For the latter, we fine-tune the parameter C in a grid of values chosen in a logarithmic scale between 10 -4 and 10 4 using 10-fold cross validation. The learning rate in our algorithm is fixed to 10 -1 (K/T ) 1/2 , where T is the number of steps, and = 0. Table 1 shows the bias and accuracy on test data after applying each post-processing method. The column marked as "original" corresponds to the original classifier without any alteration. As shown in the table, both our proposed algorithm and the algorithm of Hardt et al. (2016) eliminate bias in all classifiers. By contrast, the shift-inference method does not succeed at controlling statistical parity while the ROC method can fail when the output of the original classifier is discrete, such as in kNN, because it does not learn to randomize. Moreover, the proposed algorithm has a much lower impact on the test accuracy compared to Hardt et al. (2016) and can even improve it in certain cases. The fact that fairness can sometimes improve accuracy was recently noted by Blum and Stangl (2020) . The full tradeoff curves between bias and performance are provided in Figure 3 . CelebA Dataset Our second set of experiments builds on the task of predicting the "attractiveness" attribute in the CelebA dataset (Liu et al., 2015) . CelebA contains 202,599 images of celebrities annotated with 40 binary attributes, including gender. We use two standard deep neural network architectures: ResNet50 (He et al., 2016) and MobileNet (Howard et al., 2017) , trained from scratch or pretrained on ImageNet. We present the results in Table 2 . We observe that the proposed algorithm significantly outperforms the post-processing algorithm of Hardt et al. (2016) and performs, at least, as well as the ROC algorithm whenever the latter algorithm succeeds. Often, however, ROC fails at debiasing the deep neural networks because it does not learn to randomize when most scores produced by neural networks are concentrated around the set {-1, +1}. We investigated the strong performance compared to that of Hardt et al. (2016) and found that it is due to the specific form of randomization used by the proposed algorithm. As shown in Figure 4 , the post-processing algorithm of Hardt et al. (2016) uses a fixed probability when randomizing between two thresholds. For CelebA trained from scratch, for example, the post-processing rule of Hardt et al. (2016) predicts nearly uniformly at random when ResNet50 predicts the negative class for males. In contrast, our algorithm uses a ramp function that takes the confidence of the scores into account. In Figure 4 , in particular, the male instances with scores close to -1 are flipped with probability ≈ 0.15, as opposed to ≈ 0.5 in Hardt et al. (2016) , and this difference is compensated for by flipping all examples with scores larger than ≈ -0.9 and all female instances with scores less than ≈ 0.9. Hence, less randomization is applied when the original classifier is more confident. Lastly, one important observation we note in Table 2 is the impact of pre-training -pretraining in our experiments helps in achieving a lower test error rate even after eliminating bias. In other words, pretraining seems to reduce the cost of debiasing trained models.

6. CONCLUDING REMARKS

In this paper, we propose a near-optimal post-processing algorithm for debiasing trained machine learning models. The proposed algorithm is scalable, does not require retraining the classifiers, and has a limited impact on the test accuracy. In addition to providing strong theoretical guarantees, we show that it outperforms previous post-processing methods for unbiased classification on standard benchmarks across classical and modern machine learning models.

A FULL ALGORITHM

Algorithm 1: A Pseudocode of the Proposed Algorithm for Conditional Statistical Parity. Data: γ > 0; ρ ∈ [0, 1]; ≥ 0; f : X → [-1, +1]; s : X → [K] Result: Optimal values of thresholds: (λ1, µ1), . . . , (λK , µK ). Training: Initialize (λ1, µ1), . . . , (λK , µK ) to zeros. Then, repeat until convergence: 1. Sample an instance x ∼ p(x) 2. Perform the updates: λ s(x) ← max 0, λ s(x) -η 2 + ρ + ∂ ∂λ s(x) ξγ f (x) -(λ s(x) -µ s(x) ) µ s(x) ← max 0, µ s(x) -η 2 -ρ + ∂ ∂µ s(x) ξγ f (x) -(λ s(x) -µ s(x) ) , where ξγ is given by Eq. ( 3). Prediction: Given an instance x in the group X k , predict the label +1 with probability: h(x) =      0, f (x) ≤ λ k -µ k (f (x) -(λ k -µ k ))/γ, λ k -µ k ≤ f (x) ≤ λ k -µ k + γ 1, f (x) ≥ λ k -µ k + γ B PROOF OF THEOREM 1 B.1 CONSTRAINED CONVEX FORMULATION Suppose we have a binary classifier on the instance space X . We would like to construct an algorithm for post-processing the predictions made by that classifier such that we control the bias with respect to a set of pairwise disjoint groups X 1 , . . . , X K ⊆ X according to Definition 1. We assume that the output of the classifier f : X → [-1, +1] is an estimate to 2η(x)-1, where η(x) = p(y = 1|x = x) is the Bayes regressor. This is not a strong assumption because many algorithms can be calibrated to provide probability scores (Platt et al., 1999; Guo et al., 2017 ) so the assumption is valid. We consider randomized rules of the form: h : X × {1, 2, . . . , K} × [-1, 1] → [0, 1], whose arguments are: (1) the instance x ∈ X , (2) the sensitive attribute s(x) ∈ [K], and (3) the original classifier's score f (x). Because randomization is sometimes necessary as proved in Section 4, h(x) is the probability of predicting the positive class when the instance is x ∈ X . Suppose we have a training sample of size N , which we will denote by S. Let q i = h(x i ) ∈ [0, 1] for the i-th instance in S. For each group X k ⊆ S, the fairness constraint in Definition 1 over the training sample can be written as: 1 |X k | i∈X k q i -ρ ≤ 2 , for some hyper-parameter ρ > 0. This holds by the triangle inequality. To learn f , we propose solving the following regularized optimization problem: min 0≤qi≤1 N i=1 (γ/2) q 2 i -f (x i ) q i s.t. ∀X k ∈ G : i∈X k q i -ρ ≤ k (6) where γ > 0 is a regularization parameter and k = |X k | /2.

B.2 REDUCTION TO UNCONSTRAINED OPTIMIZATION

Because the groups X k are pairwise disjoint, the optimization problems in (6) decomposes into K separate suboptimization problems, one for each group X k . Each sub-optimization problem can be written in the following general form: min 0≤qi≤1 M i=1 γ 2 q 2 i -f (x i )q i s.t. M i=1 (z i q i -b) ≤ , - M i=1 (z i q i -b) ≤ To recall, = M /2. The Lagrangian is: L(q, α, β, λ, µ) = i γ 2 q 2 i -f (x i )q i + λ( i (z i q i -b) -) -µ( i (z i q i -b) + ) + i α i (q i -1) - i β i q i Taking the derivative w.r.t. q i gives us: q i = 1 γ f (x i ) -(λ -µ)z i -α i + β i Plugging this back, the dual problem becomes: min q,λ,µ,α,β i γ 2 q 2 i + b(λ -µ) + (λ + µ) + i α i s.t. q i = 1 γ f (x i ) -(λ -µ)z i -α i + β i λ, µ, α i , β i ≥ 0 Next, we eliminate variables. By eliminating β i , we have: min q,λ,µ,α,β i γ 2 q 2 i + b(λ -µ) + (λ + µ) + i α i s.t. q i - 1 γ f (x i ) -(λ -µ)z i -α i ≥ 0 λ, µ, α i ≥ 0 Equivalently: min q,λ,µ,α,β i γ 2 q 2 i + b(λ -µ) + (λ + µ) + i α i s.t. α i ≥ f (x i ) -γq i -(λ -µ)z i λ, µ, α i ≥ 0 Next, we eliminate α i to obtain: min q,λ,µ i γ 2 q 2 i + b(λ -µ) + (λ + µ) + i f (x i ) -γq i -(λ -µ)z i + λ, µ ≥ 0 Finally, let's eliminate the q i variables. For a given optimal µ and λ, it is straightforward to observe that the minimizer q to γ/2q 2 +[w -γq] + must lie in the set {0, w/γ, 1}. In particular, if w/γ ≤ 0, then q = 0. If w/γ ≥ 1, then q = 1. Note here that we make use of the fact that γ > 0. So, the optimal value of q to γ/2q 2 + [w -γq] + is: ξ γ (w) =      0 w γ ≤ 0 w 2 2γ 0 ≤ w γ ≤ 1 w -γ 2 w γ ≥ 1 From this, the optimization problem reduces to: min λ,µ≥0 N i=1 b(λ -µ) + (λ + µ) + ξ γ (f (x i ) -(λ -µ)z i ) This is a differentiable objective function and can be solved quickly using the projected gradient descent method (Boyd and Mutapcic, 2008) . The projection step here is taking the positive parts of λ and µ. This leads to the update rules in Algorithm 1. What about the solution? Given λ and µ, the solution of q i is a minimizer to; γ 2 q 2 i + f (x i ) -γq i -(λ -µ)z i + As stated earlier, the solution is: q i =    0, f (x i ) ≤ (λ -µ)z i (1/γ)(f (x i ) -(λ -µ)z i ), γ(λ -µ)z i ≤ f (x i ) ≤ (λ -µ)z i + γ 1, f (x i ) ≥ (λ -µ)z i + γ So, we have a ramp function. In the proposed algorithm, we have z i = 1 and b = ρ for all examples. This proves Theorem 1.

C PROOF OF THEOREM 2 C.1 OPTIMAL UNBIASED PREDICTORS

We begin by proving the following result, which can be of independent interest. Theorem 3. Let f = arg min f :X →{0,1} E[I{f (x) = y}] be the Bayes optimal decision rule subject to group-wise affine constraints of the form E[w k (x) • f (x) | x ∈ X k ] = b k for some fixed partition X = ∪ k X k . If w k : X → R and b k ∈ R are such that there exists a constant c ∈ (0, 1) in which p(f (x) = 1) = c will satisfy all the affine constraints, then f satisfies p(f (x) = 1) = I{η(x) > t k } + τ k I{η(x) = t k }, where η(x) = p(y = 1|x = x) is the Bayes regressor, t k ∈ [0, 1] is a threshold specific to the group X k ⊆ X , and τ k ∈ [0, 1]. Proof. Minimizing the expected misclassification error rate of a classifier f is equivalent to maximizing: E[f (x) • y + (1 -f (x)) • (1 -y)] = E E[f (x) • y + (1 -f (x)) • (1 -y)] x = E E[f (x) • (2η(x) -1)] x + E[1 -η(x)] Hence, selecting f that minimizes the misclassification error rate is equivalent to maximizing: E[f (x) • (2η(x) -1)] Instead of maximizing this directly, we consider the regularized form first. Writing g(x) = 2η(x) -1, the optimization problem is: min 0≤f (x)≤1 (γ/2)E[f (x) 2 ] -E[f (x) • g(x)] s.t. E[w(x) • f (x)] = b Here, we focused on one subset X k because the optimization problem decomposes into K separate optimization problems, one for each X k . If there exists a constant c ∈ (0, 1) such that f (x) = c satisfies all the equality constraints, then Slater's condition holds so strong duality holds (Boyd and Vandenberghe, 2004) . The Lagrangian is: (γ/2)E[f (x) 2 ] -E[f (x) • g(x)] + µ(E[w(x) • f (x)] -b) + E[α(x)(f (x) -1)] -E[β(x)f (x)], where α(x), β(x) ≥ 0 and µ ∈ R are the dual variables. Taking the derivative w.r.t. the optimization variable f (x) yields: γf (x) = g(x) -µ w(x) -α(x) + β(x) Therefore, the dual problem becomes: max α(x),β(x)≥0 -(2γ) -1 E[(g(x) -µ w(x) -α(x) + β(x)) 2 ] -bµ -E[α(x)] We use the substitution in Eq. ( 10) to rewrite it as: min α(x),β(x)≥0 (γ/2) E[f (x) 2 ] + bµ + E[α(x)] s.t.∀x ∈ X : γf (x) = g(x) -µ w(x) -α(x) + β(x) Next, we eliminate the multiplier β(x) by replacing the equality constraint with an inequality: min α(x)≥0 (γ/2) E[f (x) 2 ] + bµ + E[α(x)] s.t.∀x ∈ X : g(x) -γf (x) -µ w(x) -α(x) ≤ 0 Finally, since α(x) ≥ 0 and α(x) ≥ g(x) -γf (x) -µw(x), the optimal solution is the minimizer to: min f :X →R (γ/2)E[f (x) 2 ] + bµ + E[max{0, g(x) -γf (x) -µw(x)}] Next, let µ be the optimal solution of the dual variable µ. Then, the optimization problem over f decomposes into separate problems, one for each x ∈ X . We have: f (x) = arg min τ ∈R (γ/2)τ 2 + [g(x) -γτ -µ w(x)] + Using the same argument in Appendix B, we deduce that f (x) is of the form: f (x) =    0, g(x) -µ w(x) ≤ 0 1 g(x) -µ γ (1/γ) (g(x) -µ w(x)) otherwise Finally, the statement of the theorem holds by taking the limit as γ → 0 + .

C.2 EXCESS RISK BOUND

In this section, we write D to denote the underlying probability distribution and write S to denote the uniform distribution over the training sample (a.k.a. empirical distribution). The parameter ρ stated in the theorem is given by: ρ = (1/2) max k∈{1,2,...,K} E x [h (x) | x ∈ X k ] + min k∈{1,2,...,K} E x [h (x) | x ∈ X k ] Note that, by definition, the optimal classifier h that satisfies statistical parity also satisfies the constraint in (6) with this choice of ρ. Hence, with this choice of ρ, h remains optimal among all possible classifiers. Observe that the decision rule depends on x only via f (x) ∈ [-1, +1]. Hence, we write z = f (x). Since the thresholds are learned based on a fresh sample of data, the random variables z i are i.i.d. In light of Eq. 9, we would like to minimize the expectation of the loss l( hγ , x) = -f (x) • hγ (x) = -z • q(z) . = ζ(z) for some function q : [-1, +1] → [0, 1] of the form shown in 2(a). Note that ζ is 2(1 + 1/γ)-Lipschitz continuous within the same group and sensitive class. This is because the thresholds are always in the interval [-1 -γ, 1 + γ]; otherwise moving beyond this interval would not change the decision rule. Let hγ be the decision rule learned by the algorithm. Using Corollary 5 in (Xu and Mannor, 2010) , we conclude that with a probability of at least 1 -δ: E D [l( hγ , x)] -E S [l( hγ , x)] ≤ inf R≥1 4 R (1 + 1 γ + 2 2(R + K) log 2 + 2 log 1 δ N Here, we used the fact that the observations f (x) are bounded in the domain [-1, 1] and that we can first partition the domain into groups X k (K subsets) in addition to partitioning the interval [-1, 1] into R smaller sub-intervals and using the Lipschitz constant. Choosing R = N 1 3 and simplifying gives us with a probability of at least 1 -δ: E D [l( hγ , x)] -E S [l( hγ , x)] ≤ 4(2 + 1 γ ) N 1 3 + 2 2K + 2 log 1 δ N The same bound also applies to the decision rule h γ that results from applying optimal threshold with width γ > 0 (here, "optimal" is with respect to the underlying distribution) because the -cover (Definition 1 in (Xu and Mannor, 2010) ) is independent of the choice of the thresholds. By the union bound, we have with a probability of at least 1 -δ, both of the following inequalities hold: E D [l( hγ , x)] -E S [l( hγ , x)] ≤ 4(2 + 1 γ ) N 1 3 + 2 2K + 2 log 2 δ N (12) E D [l(h γ , x)] -E S [l(h γ , x)] ≤ 4(2 + 1 γ ) N 1 3 + 2 2K + 2 log 2 δ N In particular: E D [l( hγ , x)] ≤ E S [l( hγ , x)] + 4(2 + 1 γ ) N 1 3 + 2 2K + 2 log 2 δ N ≤ E S [l(h γ , x)] + γ + 4(2 + 1 γ ) N 1 3 + 2 2K + 2 log 2 δ N ≤ E D [l(h γ , x)] + γ + 8(2 + 1 γ ) N 1 3 + 4 2K + 2 log 2 δ N The first inequality follows from Eq. ( 12). The second inequality follows from the fact that hγ is an empirical risk minimizer to the regularized loss, where E[ f (x) 2 ] ≤ 1 since f (x) ∈ [0, 1]. The last inequality follows from Eq. ( 13). Finally, we know that the thresholding rule h γ with width γ > 0 is, by definition, a minimizer to: (γ/2)E[h(x) 2 ] -E[h(x) • f (x)] among all possible bounded functions h : X → [0, 1] subject to the desired fairness constraints. Therefore, we have: (γ/2)E[h γ (x) 2 ] -E[h γ (x) • f (x)] ≤ (γ/2)E[h (x) 2 ] -E[h (x) • f (x)] Hence: E[l(h γ , x)] = -E[h γ (x) • f (x)] ≤ γ + E[l(h , x)] This implies the desired bound: E D [l( hγ , x)] ≤ E D [l(h , x)] + 2γ + 8(2 + 1 γ ) N 1 3 + 4 2K + 2 log 2 δ N Therefore, we have consistency if N → ∞, γ → 0 + and γN 1 3 → ∞. For example, this holds if γ = O(N -1 6 ). So far, we have assumed that the output of the original classifier coincides with the Bayes regressor. If the original classifier is Bayes consistent, i.e. E[|2η(x) -1 -f (x)|] → 0 as N → ∞, then we have Bayes consistency of the post-processing rule by the triangle inequality.

D PROOF OF PROPOSITION 1

Proof. Since |ξ γ (w)| ≤ 1, the gradient at a point x during SGD has a square 2 -norm bounded by ||(1 + ρ + ) 2 at all rounds. Following the proof steps of (Boyd and Mutapcic, 2008) and using the fact that projections are contraction mappings, one obtains: 1 T T t=1 E[F (t) ] -F (µ ) ≤ ||µ || 2 2 + ||γ || 2 2 + (1 + ρ + ) 2 T α 2 2T α = (1 + ρ + ) 2 α 2 + ||µ || 2 2 + ||γ || 2 2 2T α By Jensen's inequality, we have 1 T T t=1 E[F (t) ] ≤ E[F (μ)]. Plugging this into the earlier results yields: E[ F ] -F (µ ) ≤ (1 + ρ + ) 2 α 2 + ||µ || 2 2 + ||γ || 2 2 2T α E EXTENSION TO OTHER CRITERIA E.1 CONTROLLING THE COVARIANCE The proposed algorithm can, sometimes, be adjusted to control bias according to other criteria as well besides statistical parity. For example, we demonstrate in this section how the proposed postprocessing algorithm can be adjusted to control the covariance between the classifier's prediction and the sensitive attribute when both are binary random variables. Let a, b, c ∈ {0, 1} be random variables . Let C(a, b) . = E[a • b] -E[a] • E[b] be their covariance, and C(a, b | c) their covariance conditioned on c: C(a, b | c = c) = E[a • b | c = c] -E[a | c = c] • E[b | c = c]. ) Then, one possible criterion for measuring bias is to measure the conditional/unconditional covariance between the classifier's predictions and the sensitive attribute when both are binary random variables. Because the random variables are binary, it is straightforward to show that achieving zero covariance implies independence. Suppose we have a binary classifier on the instance space X . We would like to construct an algorithm for post-processing the predictions made by that classifier such that we guarantee |C f (x), 1 S (x) | x ∈ X k | ≤ , where X = ∪ k X k is a total partition of the instance space. Informally, this states that the fairness guarantee with respect to the senstiive attribute 1 S : X → {0, 1} holds within each subgroup X k . We assume, again, that the output of the classifier f : X → [-1, +1] is an estimate to 2η(x) -1, where η(x) = p(y = 1|x = x) is the Bayes regressor and consider randomized rules of the form: h : X × {0, 1} × {1, 2, . . . , K} × [-1, 1] → [0, 1], whose arguments are: (i) the instance x ∈ X , (ii) the sensitive attribute 1 S : X → {0, 1} , (iii) the sub-group membership k : X → [K], and (iv) the original classifier's score f (x). Because randomization is sometimes necessary as proved in Section 4, h(x) is the probability of predicting the positive class when the instance is x ∈ X . Suppose we have a training sample of size N , which we will denote by S. Let q i = h(x i ) ∈ [0, 1] for the i-th instance in S. For each group X k ⊆ S, the desired fairness constraint on the covariance can be written as: 1 |X k | i∈X k (1 S (i) -ρ k ) q i ≤ , where ρ k = E x [1 S (x) | x ∈ X k ]. This is because: 1 |X k | i∈X k (1 S (i) -ρ k ) q i = 1 |X k | i∈X k 1 S (i) f (i) - ρ k |X k | i∈X k f (i) = E[1 S (x) • f (x) | x ∈ X k ] -E[1 S (x)| x ∈ X k ] • E[ f (x) | x ∈ X k ] = C( f (x), 1 S (x) | x ∈ X k ), where the expectation is over the training sample. Therefore, in order to learn h, we solve the regularized optimization problem: min 0≤qi≤1 N i=1 (γ/2) q 2 i -f (x i ) q i s.t. ∀X k ∈ G : i∈X k (1 S (i) -ρ k ) q i ≤ k (15) where γ > 0 is a regularization parameter and k = |X k | . This is of the same general form analyzed in Section B.2. Hence, the same algorithm can be applied with b = 0 and z i = 1 S (i) -ρ k .

E.2 IMPOSSIBILITY RESULT

The previous algorithm for controlling covariance requires that the subgroups X k be known in advance. Indeed, our next impossibility result shows that this is, in general, necessary. In other words, a deterministic classifier f : X → {0, 1} cannot be universally unbiased with respect to a sensitive class S across all possible known and unknown groups unless the representation x has zero mutual information with the sensitive attribute or if f is constant almost everywhere. As a corollary, the groups X k have to be known in advance. Proposition 2 (Impossibility result). Let X be the instance space and Y = {0, 1} be a target set. Let 1 S : X → {0, 1} be an arbitrary (possibly randomized) binary-valued function on X and define γ : X → [0, 1] by γ(x) = p(1 S (x) = 1 | x = x), where the probability is evaluated over the randomness of 1 S : X → {0, 1}. Write γ = E x [γ(x)]. Then, for any binary f : X → {0, 1} it holds that sup π: X →{0,1} E π(x) C f (x), γ(x)| π(x) ≥ 1 2 E x |γ(x) -γ| • min{Ef, 1 -Ef }, where C f (x), γ(x)| π(x) is defined in Equation 14. Proof. Fix 0 < β < 1 and consider the subset: W = {x ∈ X : (γ(x) -γ) • (f (x) -β) > 0}, and its complement W = X \ W . Since f (x) ∈ {0, 1}, the sets W and W are independent of β as long as it remains in the open interval (0, 1). More precisely: W = γ(x) -γ > 0 ∧ f (x) = 1 γ(x) -γ ≤ 0 ∧ f (x) = 0 Now, for any set X ⊆ X , let p X be the projection of the probability measure p(x) on the set X (i.e. p X (x) = p(x)/p(X)). Then, with a simple algebraic manipulation, one has the identity:  E x∼p X [(γ(x) -γ) (f (x) -β)] = C(γ(x), f (x); x ∈ X) + (E x∼p X [γ] -γ) • (E x∼p X [f ] -β) β = f . = 1 2 E x∼p W f (x) + E x∼p W f (x) Substituting the last equation into Eq. ( 18) gives the lower bound: C(γ(x), f (x); x ∈ W ) ≥ min{ f , 1 -f } • E x∼p W |γ(x) -γ| + 1 2 (E x∼p W [γ] -γ) E x∼p W f (x) -E x∼p W f (x) Repeating the same analysis for the subset W , we arrive at the inequality: C(γ(x), f (x); x ∈ W ) ≤ -min{ f , 1 -f } E x∼p W |γ(x) -γ| + 1 2 (E x∼p W [γ] -γ) E x∼p W f (x) -E x∼p W f (x) Writing π(x) = 1 W (x), we have by the reverse triangle inequality: E π(x) C f (x), γ(x); π(x) ≥ min{ f , 1 -f } • E x |γ(x) -γ| (22) Finally: 2 f ≥ p(x ∈ W ) • E x∼p W f (x) + p(x ∈ W ) • E x∼p W f (x) = E[f ] Similarly, we have 2(1 -f ) ≥ 1 -E[f ]. Therefore: min{ f , 1 -f } ≥ 1 2 min{Ef, 1 -Ef } Combining this with Eq. ( 22) establishes the statement of the proposition.



Ideally an estimate of some monotone transformation of 2η(x) -1, where η(x) = p(y = 1|x = x) is the Bayes regressor. This is not a strong assumption because many algorithms can be calibrated to provide probability scores(Platt et al., 1999;Guo et al., 2017).



Figure1: Top: Histogram of classifiers' predictions on both subpopulations, demonstrating a clear gender bias in all cases. Bottom: The bias defined as the absolute difference in mean outcome between genders within different demographic groups, before and after applying the proposed algorithm. Blue bars show the results of the unmodified classifier, orange bars show the results of optimizing for statistical parity with no regard for demographic information. Finally, green bars are the results of applying statistical parity on the cross product of gender and ethnicity.

Figure3: The tradeoff curves are displayed for each classification problem. The x-axis corresponds to bias (Definition 1) while the y-axis is the test accuracy. In general, debiasing CDDD improves test accuracy because bias was introduced to the training data only. In addition, ROC fails at debiasing four classifiers (see also Tables1 and 2) due to the absence of randomization.

Figure 4: The distribution of the scores produced by ResNet50 trained from scratch are shown for both subpopulations. The curves correspond to the randomized post-processing rules, i.e. p(y = 1|x), of Hardt et al. (2016) and the proposed algorithm with γ = 0.1 and ρ = E[y].

17)By definition of W , we have:E x∼p W [(γ(x) -γ)(f (x) -β)] = E x∼p W [|γ(x) -γ||f (x) -β|] ≥ min{β, 1 -β}E x∼p W |γ(x) -γ|Combining this with Eq. (17), we have:C(γ(x), f (x); x ∈ W ) ≥ min{β, 1 -β}E x∼p W |γ(x) -γ| + (E x∼p W [γ] -γ)(β -E x∼p W [f ]) (18)Since the set W does not change when β is varied in the open interval (0, 1), the lower bound holds for any value of β ∈ (0, 1). W set:

A comparison of the four post-processing methods on CelebA (predict attractiveness) applied to the output of ResNet50 and MobileNet, each trained either from scratch or on ImageNet. The proposed algorithm performs much better than ROC in terms of bias and much better thanHardt et al. (2016) in terms of accuracy. Shift inference performs poorly in both objectives.

