A NEAR-OPTIMAL ALGORITHM FOR DEBIASING TRAINED MACHINE LEARNING MODELS Anonymous

Abstract

We present an efficient and scalable algorithm for debiasing trained models, including deep neural networks (DNNs), which we prove to be near-optimal by bounding its excess Bayes risk. Unlike previous black-box reduction methods to cost-sensitive classification rules, the proposed algorithm operates on models that have been trained without having to retrain the model. Furthermore, as the algorithm is based on projected stochastic gradient descent (SGD), it is particularly attractive for deep learning applications. We empirically validate the proposed algorithm on standard benchmark datasets across both classical algorithms and modern DNN architectures and demonstrate that it outperforms previous postprocessing approaches for unbiased classification.

1. INTRODUCTION

Machine learning is increasingly applied to critical decisions which can have a lasting impact on individual lives, such as for credit lending (Bruckner, 2018) , medical applications (Deo, 2015) , and criminal justice (Brennan et al., 2009) . Consequently, it is imperative to understand and improve the degree of bias of such automated decision-making. Unfortunately, despite the fact that bias (or "fairness") is a central concept in our society today, it is difficult to define it in precise terms. In fact, as people perceive ethical matters differently depending on a plethora of factors including geographical location or culture (Awad et al., 2018) , no universally-agreed upon definition for bias exists. Moreover, the definition of bias may depend on the application and might even be ignored in favor of accuracy when the stakes are high, such as in medical diagnosis (Kleinberg et al., 2017; Ingold and Soper, 2016) . As such, it is not surprising that several definitions of "unbiased classification" have been introduced. These include statistical parity (Dwork et al., 2012; Zafar et al., 2017a) , equality of opportunity (Hardt et al., 2016) , and equalized odds (Hardt et al., 2016; Kleinberg et al., 2017) . Unfortunately, such definitions are not generally compatible (Chouldechova, 2017) and some might even be in conflict with calibration (Kleinberg et al., 2017) . In addition, because fairness is a societal concept, it does not necessarily translate into a statistical criteria (Chouldechova, 2017; Dixon et al., 2018) . Statistical parity Let X be an instance space and let Y = {0, 1} be the target set in a standard binary classification problem. In the fair classification setting, we may further assume the existence of a (possibly randomized) sensitive attribute s : X → {0, 1, . . . , K}, where s(x) = k if and only if x ∈ X k for some total partition X = ∪ k X k . For example, X might correspond to the set of job applicants while s indicates their gender. Here, the sensitive attribute can be randomized if, for instance, the gender of an applicant is not a deterministic function of the full instance x ∈ X (e.g. number of publications, years of experience, ...etc). Then, a commonly used criterion for fairness is to require similar mean outcomes across the sensitive attribute. This property is well-captured through the notion of statistical parity (a.k.a. demographic parity) (Corbett-Davies et al., 2017; Dwork et al., 2012; Zafar et al., 2017a; Mehrabi et al., 2019) : Definition 1 (Statistical Parity). Let X be an instance space and X = ∪ k X k be a total partition of X . A classifier f : X → {0, 1} satisfies statistical parity across all groups X 1 , . . . , X K if: 



max k∈{1,2,...,K} E x [f (x) | x ∈ X k ] -min k∈{1,2,...,K} E x [f (x) | x ∈ X k ] ≤1

