

A

With the recent expanding attention of machine learning researchers and practitioners to fairness, there is a void of a common framework to analyze and compare the capabilities of proposed models in deep representation learning. In this paper, we evaluate different fairness methods trained with deep neural networks on a common synthetic dataset to obtain a better insight into the working of these methods. In particular, we train about 2000 different models in various setups, including unbalanced and correlated data configurations, to verify the limits of the current models and better understand in which setups they are subject to failure. In doing so we present a dataset, a large subset of proposed fairness metrics in the literature, and rigorously evaluate recent promising debiasing algorithms in a common framework hoping the research community would take this benchmark as a common entry point for fair deep learning.

1

The emergence of deep learning models has brought up questions on the fairness of these models with respect to sensitive attributes. There have been studies on the reliability and bias of deep learning approaches, such as the bias of the learning algorithms due to matching a data distribution in adversarial models (Cohen et al., 2018) , visual question answering (Agrawal et al., 2018) , image search setups (Kay et al., 2015) , and gender classification (Buolamwini & Gebru, 2018) . Recent approaches (Madras et al., 2018; Zhao et al., 2020; Creager et al., 2019; Zhang et al., 2018) address the bias in deep learning models and propose fair models that better guarantee fairness criterion while maintaining accuracy. The general idea in these models is applying a fairness learning technique either on the learned latent representation (Madras et al., 2018; Zhao et al., 2020; Creager et al., 2019) or the output eligibility criterion (Zhang et al., 2018) , given an input. Adversarial learning techniques (Goodfellow et al., 2014; Ganin & Lempitsky, 2015) have been extensively used to learn the data distribution of interest, such as learning disentangled latent space (Kim & Mnih, 2018; Chen et al., 2018; 2016) , or subtraction of a feature from the latent space (Lample et al., 2017; Denton et al., 2017) . Adversarial learning has also been used in fairness models to either remove sensitive information from the latent space (Madras et al., 2018; Zhao et al., 2020; Zhang et al., 2018) or disentangle latent features into sensitive and non-sensitive attributes (Creager et al., 2019) . The goal in these approaches is to remove the sensitive information, such as gender, from the latent space that is later used for other tasks, such as eligibility classification. In this paper, we aim to verify whether well-performing and ubiquitous models can effectively reduce bias when evaluated using well-established fairness metrics. Here we focus on deep learning models for classification and their adversarial bias mitigation counterparts. The difficulty of a task is determined by the distribution of the dataset, which in turn, is a function of the label imbalance, features imbalance, dependency of sensitive attributes to the rest of features, and distribution shift from training to development phase to name a few. In addition, in the current state of fairness community problems with reproducibility and inconsistency in the experimentation and dataset setups makes comparison difficult. Given the importance and possibly tangible impact of proposed algorithms when in production, we advocate for a rigorous and unified evaluation of the capabilities of debiasing models. We argue the need for a systematic analysis that would assess different bias mitigation approaches under the perspective of different fairness metrics. This would help elucidate the most promising research contributions.

