

A

With the recent expanding attention of machine learning researchers and practitioners to fairness, there is a void of a common framework to analyze and compare the capabilities of proposed models in deep representation learning. In this paper, we evaluate different fairness methods trained with deep neural networks on a common synthetic dataset to obtain a better insight into the working of these methods. In particular, we train about 2000 different models in various setups, including unbalanced and correlated data configurations, to verify the limits of the current models and better understand in which setups they are subject to failure. In doing so we present a dataset, a large subset of proposed fairness metrics in the literature, and rigorously evaluate recent promising debiasing algorithms in a common framework hoping the research community would take this benchmark as a common entry point for fair deep learning.

1

The emergence of deep learning models has brought up questions on the fairness of these models with respect to sensitive attributes. There have been studies on the reliability and bias of deep learning approaches, such as the bias of the learning algorithms due to matching a data distribution in adversarial models (Cohen et al., 2018) , visual question answering (Agrawal et al., 2018) , image search setups (Kay et al., 2015) , and gender classification (Buolamwini & Gebru, 2018) . Recent approaches (Madras et al., 2018; Zhao et al., 2020; Creager et al., 2019; Zhang et al., 2018) address the bias in deep learning models and propose fair models that better guarantee fairness criterion while maintaining accuracy. The general idea in these models is applying a fairness learning technique either on the learned latent representation (Madras et al., 2018; Zhao et al., 2020; Creager et al., 2019) or the output eligibility criterion (Zhang et al., 2018) , given an input. Adversarial learning techniques (Goodfellow et al., 2014; Ganin & Lempitsky, 2015) have been extensively used to learn the data distribution of interest, such as learning disentangled latent space (Kim & Mnih, 2018; Chen et al., 2018; 2016) , or subtraction of a feature from the latent space (Lample et al., 2017; Denton et al., 2017) . Adversarial learning has also been used in fairness models to either remove sensitive information from the latent space (Madras et al., 2018; Zhao et al., 2020; Zhang et al., 2018) or disentangle latent features into sensitive and non-sensitive attributes (Creager et al., 2019) . The goal in these approaches is to remove the sensitive information, such as gender, from the latent space that is later used for other tasks, such as eligibility classification. In this paper, we aim to verify whether well-performing and ubiquitous models can effectively reduce bias when evaluated using well-established fairness metrics. Here we focus on deep learning models for classification and their adversarial bias mitigation counterparts. The difficulty of a task is determined by the distribution of the dataset, which in turn, is a function of the label imbalance, features imbalance, dependency of sensitive attributes to the rest of features, and distribution shift from training to development phase to name a few. In addition, in the current state of fairness community problems with reproducibility and inconsistency in the experimentation and dataset setups makes comparison difficult. Given the importance and possibly tangible impact of proposed algorithms when in production, we advocate for a rigorous and unified evaluation of the capabilities of debiasing models. We argue the need for a systematic analysis that would assess different bias mitigation approaches under the perspective of different fairness metrics. This would help elucidate the most promising research contributions. To analyse debiasing models we propose a dataset that facilitates creation of different biased setups, such as data imbalance or correlation among eligibility and sensitive or non-sensitive attributes. In particular, contrary to real datasets where the intensity of each feature cannot be controlled, this dataset allows changing only one component while keeping all other components unchanged, something that allows study of different variations of biased setups. To evaluate models under different biased setups, we provide an in-depth analysis of baselines and recently proposed bias mitigation models in the literature. In particular, we evaluate three promising debiasing models (six variants) together with a baseline model using a unified set of fairness metrics and report results by carrying extensive hyper-parameter search in all cases, ensuring that the drawn conclusions can be attributed to modeling or loss choices. In doing so, we evaluate these models under different setups by training about 2000 models, in which we transition from balanced setups towards challenging unbalanced and correlated setups, where eligibility criterion is correlated with sensitive or non-sensitive attributes. Given the importance of fairness and the serious implications of their misuse, we intentionally try to push these models to their breaking point using extreme settings from our dataset. We consider different ratios for each sensitive group to create "unfair datasets". This is noteworthy that our analysis is not to undermine the effectiveness of any of these methods, rather setting the expectations and boundaries for different use cases and encouraging the community to follow. To summarize, our contributions are: • We show when there is a correlation between eligibility and the sensitive attribute, models exploit it for prediction of accuracy and use it even in test cases when such correlation does not exist, which leads to unfair predictions. • We show that even when there are non-sensitive attributes in the dataset or small visual features in the input image that correlate with eligibility, models can still exploit it and hence be biased. • We demonstrate that the choice of seed can affect the results significantly and different models show a considerable variation of performance given the seed. • We provide a deep learning codebase composed of six debiasing models and a baseline to the fairness community for research and evaluation of fairness models. • We provide a dataset with different controllable sets of features and correlation among them, to facilitate research on fairness models in different unfair setups.

2. R W

There is an increasing literature studying how biased datasets can bias learning algorithms to discriminate (Bolukbasi et al., 2016; Cohen et al., 2018; Agrawal et al., 2018) . These studies have led researchers to propose new evaluation tools and datasets (Buolamwini & Gebru, 2018) , with the goal of identifying potential machine learning error rate gaps among different groups. To remedy this issue, numerous contributions have emerged in order to probe machine learning systems at different levels and reduce discrimination from the perspective of modeling. Friedler et al. ( 2019) evaluate the fairness of pre-processing, in-processing, and post-processing machine learning approaches; however, no deep learning model is evaluated in their study. Moreover, the models are not evaluated while changing the level of dataset bias or correlation among features. Kamishima et al. (2012) propose a regularization approach for models with probabilistic discriminative nature and evaluate it on logistic regression models; (Zafar et al., 2017a) proposes a flexible mechanism to optimize accuracy in the presence of non-convex fairness constraints (or vice-versa, optimize fairness with accuracy constraints) and evaluate it on logistic regressions and support vector machines. In addition, several contributions have attempted to leverage adversarial learning to mitigate biases in learned representations or in classifiers. Adversarial learning was originally introduced within the framework of Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) and it has been leveraged in the bias mitigation frameworks to make different groups indistinguishable from one another with respect to a sensitive attribute. In particular, (Edwards & Storkey, 2016) 



learns adversarially to debias encoded features to be used in classification tasks, while being fair with respect to demographic parity. Madras et al. (2018) adopt a group normalized 1 loss and adapt it for each fairness metric of interest.Beutel et al. (2017)  debias the encoded features using an adversarial training procedure for classification tasks to achieve equality of opportunity.Zhang et al. (2018)

