A NEAR-OPTIMAL ALGORITHM FOR DEBIASING TRAINED MACHINE LEARNING MODELS Anonymous

Abstract

We present an efficient and scalable algorithm for debiasing trained models, including deep neural networks (DNNs), which we prove to be near-optimal by bounding its excess Bayes risk. Unlike previous black-box reduction methods to cost-sensitive classification rules, the proposed algorithm operates on models that have been trained without having to retrain the model. Furthermore, as the algorithm is based on projected stochastic gradient descent (SGD), it is particularly attractive for deep learning applications. We empirically validate the proposed algorithm on standard benchmark datasets across both classical algorithms and modern DNN architectures and demonstrate that it outperforms previous postprocessing approaches for unbiased classification.

1. INTRODUCTION

Machine learning is increasingly applied to critical decisions which can have a lasting impact on individual lives, such as for credit lending (Bruckner, 2018) , medical applications (Deo, 2015) , and criminal justice (Brennan et al., 2009) . Consequently, it is imperative to understand and improve the degree of bias of such automated decision-making. Unfortunately, despite the fact that bias (or "fairness") is a central concept in our society today, it is difficult to define it in precise terms. In fact, as people perceive ethical matters differently depending on a plethora of factors including geographical location or culture (Awad et al., 2018) , no universally-agreed upon definition for bias exists. Moreover, the definition of bias may depend on the application and might even be ignored in favor of accuracy when the stakes are high, such as in medical diagnosis (Kleinberg et al., 2017; Ingold and Soper, 2016) . As such, it is not surprising that several definitions of "unbiased classification" have been introduced. These include statistical parity (Dwork et al., 2012; Zafar et al., 2017a) , equality of opportunity (Hardt et al., 2016) , and equalized odds (Hardt et al., 2016; Kleinberg et al., 2017) . Unfortunately, such definitions are not generally compatible (Chouldechova, 2017) and some might even be in conflict with calibration (Kleinberg et al., 2017) . In addition, because fairness is a societal concept, it does not necessarily translate into a statistical criteria (Chouldechova, 2017; Dixon et al., 2018) . Statistical parity Let X be an instance space and let Y = {0, 1} be the target set in a standard binary classification problem. In the fair classification setting, we may further assume the existence of a (possibly randomized) sensitive attribute s : X → {0, 1, . . . , K}, where s(x) = k if and only if x ∈ X k for some total partition X = ∪ k X k . For example, X might correspond to the set of job applicants while s indicates their gender. Here, the sensitive attribute can be randomized if, for instance, the gender of an applicant is not a deterministic function of the full instance x ∈ X (e.g. number of publications, years of experience, ...etc). Then, a commonly used criterion for fairness is to require similar mean outcomes across the sensitive attribute. This property is well-captured through the notion of statistical parity (a.k.a. demographic parity) (Corbett-Davies et al., 2017; Dwork et al., 2012; Zafar et al., 2017a; Mehrabi et al., 2019) : Definition 1 (Statistical Parity). Let X be an instance space and X = ∪ k X k be a total partition of X . A classifier f : X → {0, 1} satisfies statistical parity across all groups X 1 , . . . , X K if: To motivate and further clarify the definition, we showcase the empirical results on the Adult benchmark dataset (Blake and Merz, 1998) in Figure 1 . When tasked with predicting whether the income of individuals is above $50K per year, all considered classifiers exhibit gender-related bias. One way of removing such bias is to enforce statistical parity across genders. Crucially, however, without taking ethnicity into account, different demographic groups may experience different outcomes. In fact, gender bias can actually increase in some minority groups after enforcing statistical parity. This can be fixed by redefining the sensitive attribute to be the cross product of both gender and ethnicity (green bars). max k∈{1,2,...,K} E x [f (x) | x ∈ X k ] - min k∈{1,2,...,K} E x [f (x) | x ∈ X k ] ≤ Our main contribution is to present a near-optimal recipe for debiasing models, including deep neural networks, according to Definition 1. Specifically, we formulate the task of debiasing learned models as a regularized optimization problem that is solved efficiently using the projected SGD method. We show how the algorithm produces thresholding rules with randomization near the thresholds, where the width of randomization is controlled by the regularization parameter. We also show that randomization near the threshold is necessary for Bayes risk consistency. While we focus on binary sensitive attributes in our experiments in Section 5, our algorithm and its theoretical guarantees continue to hold for non-binary sensitive attributes as well.

Statement of Contribution

1. We derive a near-optimal post-processing algorithm for debiasing learned models (Section 3). 2. We prove theoretical guarantees for the proposed algorithm, including a proof of correctness and an explicit bound on the Bayes excess risk (Section 4). 3. We empirically validate the proposed algorithm on benchmark datasets across both classical algorithms and modern DNN architectures. Our experiments demonstrate that the proposed algorithm significantly outperforms previous post-processing methods (Section 5). In Appendix E, we also show how the proposed algorithm can be modified to handle other criteria of bias as well.



Figure1: Top: Histogram of classifiers' predictions on both subpopulations, demonstrating a clear gender bias in all cases. Bottom: The bias defined as the absolute difference in mean outcome between genders within different demographic groups, before and after applying the proposed algorithm. Blue bars show the results of the unmodified classifier, orange bars show the results of optimizing for statistical parity with no regard for demographic information. Finally, green bars are the results of applying statistical parity on the cross product of gender and ethnicity.

