A TECHNICAL AND NORMATIVE INVESTIGATION OF SOCIAL BIAS AMPLIFICATION Anonymous authors Paper under double-blind review

Abstract

The conversation around the fairness of machine learning models is growing and evolving. In this work, we focus on the issue of bias amplification: the tendency of models trained from data containing social biases to further amplify these biases. This problem is brought about by the algorithm, on top of the level of bias already present in the data. We make two main contributions regarding its measurement. First, building off of Zhao et al. ( 2017), we introduce and analyze a new, decoupled metric for measuring bias amplification, BiasAmp → , which possesses a number of attractive properties, including the ability to pinpoint the cause of bias amplification. Second, we thoroughly analyze and discuss the normative implications of this metric. We provide suggestions about its measurement by cautioning against predicting sensitive attributes, encouraging the use of confidence intervals due to fluctuations in the fairness of models across runs, and discussing what bias amplification means in the context of domains where labels either don't exist at test time or correspond to uncertain future events. Throughout this paper, we work to provide a deeply interrogative look at the technical measurement of bias amplification, guided by our normative ideas of what we want it to encompass.

1. INTRODUCTION

The machine learning community is becoming increasingly cognizant of problems surrounding fairness and bias, and correspondingly a plethora of new algorithms and metrics are being proposed (see e.g., Mehrabi et al. (2019) for a review). The gatekeepers checking the systems to be deployed often take the form of fairness evaluation metrics, and it is vital that these be deeply investigated both technically and normatively. In this paper, we endeavor to do this for bias amplification. Bias amplification happens when a model exacerbates biases from the training data at test time. It is the result of the algorithm (Foulds et al., 2018) , and unlike other forms of bias, cannot be solely attributed to the dataset. To this end, we propose a new way of measuring bias amplification, BiasAmp →foot_0 , that builds off a prior metric from Men Also Like Shopping (Zhao et al., 2017) , that we will call BiasAmp MALS . Our metric's technical composition aligns with the real-world qualities we want it to encompass, addressing a number of the previous metric's shortcomings by being able to: 1) generalize beyond binary attributes, 2) take into account the base rates that people of each attribute appear, and 3) disentangle the directions of amplification. Concretely, consider a visual dataset (Fig. 1 ) where each image has a label for the task, T , which is painting or not painting, and further is associated with a protected attribute, A, which is woman or man. If the gender of the person biases the prediction of the task, we consider this A → T bias amplification; if the reverse happens, then T → A. In our normative discussion, we discuss a few topics. We consider whether predicting protected attributes is necessary in the first place; by not doing so, we can trivially remove T → A amplification. We also encourage the use of confidence intervals when using our metric because BiasAmp → , along with other fairness metrics, suffers from the Rashomon Effect (Breiman, 2001) , or multiplicity of good models. In deep neural networks, random seeds have relatively little impact on accuracy; however, that is not the case for fairness, which is more brittle to randomness. Figure 1 : Consider an image recognition dataset where the goal is to classify the task, T , as painting or not painting, and the attribute, A, as woman or man. Note that in this dataset women are correlated with painting, and men with not painting. In this work we are particularly concerned with errors that contribute to the amplification of bias (red and yellow in the figure), i.e., those that amplify the training correlation. We further disentangle these errors into those that amplify the attribute to task correlation (i.e., incorrectly predict the task based on the attribute of the person; shown in yellow) versus those that in contrast amplify the task to attribute correlation (shown in red). Notably, a trait of bias amplification is that it is not at odds with accuracy, unlike many other fairness metrics, because the goal of not amplifying biases and matching task-attribute correlations is aligned with that of accurate predictions. For example, imagine a dataset where the positive outcome is associated at a higher rate with group A than with group B. A classifier that achieves 100% accuracy at predicting the positive outcome is not amplifying bias; however, according to metrics like demographic parity, this perfect classifier is still perpetuating bias because it is predicting the positive label at different rates for both groups. While matching training correlations is desired in object detection where systems should perfectly predict the labels, we will explore the nuances of what this means in situations where the validity of the labels, and thus task-attribute correlations themselves, are up for debate. For example, in the risk prediction task which assesses someone's likelihood of recidivism, the label represents whether someone with a set of input features ended up recidivating, but is not a steadfast indicator of what another person with the same input features will do. Here, we would not want to replicate the task-attribute correlations at test time, and it is important to keep this in mind when deciding what fairness metrics to apply. The notion of amplification also allows us to encapsulate the idea that systemic harms and biases can be more harmful than errors made without such a history (Bearman et al., 2009) ; for example, in images overclassifying women as cooking carries more of a negative connotation than overclassifying men as cooking. 2 Distinguishing between which errors are more harmful than others is a pattern that can often be lifted from the training data. To ground our work, we first distinguish what bias amplification captures that standard fairness metrics cannot, and then distinguish BiasAmp → from BiasAmp MALS . Our key contributions are: 1) proposing a new way to measure bias amplification, addressing multiple shortcomings of prior work and allowing us to better diagnose where a model goes wrong, and 2) providing a technical analysis and normative discussion around the use of this measure in diverse settings, encouraging thoughtfulness with each application.

2. RELATED WORK

Fairness Measurements. Fairness is nebulous and context-dependent, and approaches to quantifying it (Verma & Rubin, 2018; Buolamwini & Gebru, 2018) include equalized odds (Hardt et al., 2016 ), equal opportunity (Hardt et al., 2016 ), demographic parity (Dwork et al., 2012; Kusner et al., 2017) , fairness through awareness (Dwork et al., 2012; Kusner et al., 2017) , fairness through unawareness (Grgic-Hlaca et al., 2016; Kusner et al., 2017) , and treatment equality (Berk et al., 2017) . We examine bias amplification, which is a type of group fairness where correlations are amplified. Bias Amplification. Bias amplification has been measured by looking at binary classifications (Leino et al., 2019 ), GANs (Jain et al., 2020; Choi et al., 2020), and correlations (Zhao et al., 2017) . Wang et al. (2019) measures this using dataset leakage and model leakage. The difference between these values is the level of bias amplification, but this is not a fair comparison because the



The arrow in BiasAmp → is meant to signify the direction that bias amplification is flowing, and not intended to be a claim about causality. We use the terms man and woman to refer to binarized socially-perceived gender expression, recognizing these labels are not inclusive, and in vision datasets are often assigned by annotators rather than self-disclosed.

