COMPARING HUMAN AND MACHINE BIAS IN FACE RECOGNITION

Abstract

Much recent research has uncovered and discussed serious concerns of bias in facial analysis technologies, finding performance disparities between groups of people based on perceived gender, skin type, lighting condition, etc. These audits are immensely important and successful at measuring algorithmic bias but have two major challenges: the audits (1) use facial recognition datasets which lack quality metadata, like LFW and CelebA, and (2) do not compare their observed algorithmic bias to the biases of their human alternatives. In this paper, we release improvements to the LFW and CelebA datasets which will enable future researchers to obtain measurements of algorithmic bias that are not tainted by major flaws in the dataset (e.g. identical images appearing in both the gallery and test set). We also use these new data to develop a series of challenging facial identification and verification questions that we administered to various algorithms and a large, balanced sample of human reviewers. We find that both computer models and human survey participants perform significantly better at the verification task, generally obtain lower accuracy rates on dark-skinned or female subjects for both tasks, and obtain higher accuracy rates when their demographics match that of the question. Academic models exhibit comparable levels of gender bias to humans, but are significantly more biased against darker skin types than humans.

1. INTRODUCTION

Facial analysis systems have been the topic of intense research for decades, and instantiations of their deployment have been criticized in recent years for their intrusive privacy concerns and differential treatment of various demographic groups. Companies and governments have deployed facial recognition systems (Derringer, 2019; Hartzog, 2020; Weise & Singer, 2020) which have a wide variety of applications from relatively mundane, e.g., improved search through personal photos (Google, 2021) , to rather controversial, e.g., target identification in warzones (Marson & Forrest, 2021) . A flashpoint issue for facial analysis systems is their potential for biased results by demographics (Garvie, 2016; Lohr, 2018; Buolamwini & Gebru, 2018; Grother et al., 2019; Dooley et al., 2021) , which make facial recognition systems controversial for socially important applications, such as use in law enforcement or the criminal justice system. To make things worse, many studies of machine bias in face recognition use datasets which themselves are imbalanced or riddled with errors, resulting in inaccurate measurements of machine bias. It is now widely accepted that computers perform as well as or better than humans on a variety of facial recognition tasks (Lu & Tang, 2015; Grother et al., 2019) in terms of accuracy, but what about bias? The algorithm's superior overall performance, as well as speed to inference, makes the use of facial recognition technologies widely appealing in many domain areas and comes at enhanced costs to those surveilled, monitored, or targeted by their use (Lewis, 2019; Kostka et al., 2021) . Many previous studies which examine and critique these technologies through algorithmic audits do so only up to the point of the algorithm's biases. They stop short of comparing these biases to that of their human alternatives. In this study, we question how the bias of the algorithm compares to human bias in order to fill in one of the largest omissions in the facial recognition bias literature. We investigate these questions by creating a dataset through extensive hand curation which improves upon previous facial recognition bias auditing datasets, using images from two common facial recognition datasets (Huang et al., 2008; Liu et al., 2015) and fixing many of the imbalances and erroneous labels. Common academic datasets contain many flaws that make them unacceptable for this purpose. For example, they contain many duplicate image pairs that differ only in their compression scheme or cropping. As a result, it is quite common for an image to appear in both the gallery and test set when evaluating image models, which distorts accuracy statistics when evaluating on either humans or machines. Standard datasets also contain many incorrect labels and low quality images, the prevalence of which may be unequal across different demographic groups. We also create a survey instrument that we administer to a sample of non-expert human participants (n = 545) and ask machine models (both through academically trained models and commercial APIs) the same survey questions. In comparing the results of these two modalities, we conclude that, first, humans and academic models both perform better on questions with male subjects. Second, humans and academic models both perform better on questions with light-skinned subjects. Third, humans perform better on questions where the subject looks like they do. Fourth, commercial APIs are phenomenally accurate at facial recognition and we could not evaluate any major disparities in their performance across racial or gender lines. Finally, overall we found that academic models exhibit comparable levels of gender bias to humans, but are significantly more biased against darker skin types than humans.

2. BACKGROUND AND PRIOR WORK

We provide a brief overview of facial recognition and additional related work. We further detail similar comparative studies which contrast the performance of humans and machines. Much of the discussion of bias overlaps with the sub-field of machine learning that focuses on social and societal harms. We refer the reader to Chouldechova & Roth (2018) and Barocas et al. (2019) for additional background of that broader ecosystem and discussion around bias in machine learning.

Facial Recognition

In this overview, we focus on a review of the types of facial recognition technology rather than contrasting different implementations thereof. Within facial recognition, there are two large categories of tasks: verification and identification. Verification asks a 1-to-1 question: is the person in the source image the same person as in the target image? Identification asks a 1-to-many question: given the person in the source image, where does the person appear within a gallery composed of many target identities and their associated images, if at all? Modern facial recognition algorithms, such as He et al. ( 2016 Bias in Facial Recognition Bias has been studied in facial recognition for the past decade. Early work, like that of Klare et al. (2012) and O'Toole et al. (2012) , focused on single-demographic effects (specifically, race and gender), whereas the more recent work of Buolamwini & Gebru (2018) uncovers unequal performance from an intersectional perspective, specifically between gender and skin tone. The latter work has been and continues to be hugely impactful both within academia and at the industry level. For example, the 2019 update to NIST FRVT specifically focused on demographic mistreatment from commercial platforms (Grother et al., 2019) . While our work focuses on the identification and comparison of bias, existing work on remedying the ills of socially impactful technology and unfair systems can be split into three (or, arguably, four (Savani et al., 2020) ) focus areas: pre-, in-, and post-processing. Pre-processing work largely focuses on dataset curation and preprocessing (e.g., Feldman et al., 2015; Ryu et al., 2018; Quadrianto et al., 2019; Wang & Deng, 2020) . In-processing often constrains the ML training method or optimization algorithm itself (e.g., Zafar et al., 2017a; b; Agarwal et al., 2018; Donini et al., 2018; Goel et al., 2018; Zafar et al., 2019; Diana et al., 2020; Lahoti et al., 2020; Martinez et al., 2020; Padala & Gujar, 2020; Wang & Deng, 2020) , or focuses explicitly on so-called fair representation learning (e.g., Dwork et al., 2012; Zemel et al., 2013; Edwards & Storkey, 2016; Beutel et al., 2017; Madras et al., 2018; Wang et al., 2019; Adeli et al., 2021) . Post-processing techniques adjust decisioning at inference time to align with fairness definitions (e.g., Hardt et al., 2016; Wang et al., 2020) .

Human Performance Comparisons

No work in the past to our knowledge has specifically focused on the question of comparing bias or disparity between humans and machines. Some prior work



); Chen et al. (2018); Wang et al. (2018) and Deng et al. (2019), use deep neural networks to extract feature representations of faces and then compare those to match individuals. An overview of recent research on these topics can be found in Wang & Deng (2018). Other types of facial analysis technology include face detection, gender or age estimation, and facial expression recognition.

