COMPARING HUMAN AND MACHINE BIAS IN FACE RECOGNITION

Abstract

Much recent research has uncovered and discussed serious concerns of bias in facial analysis technologies, finding performance disparities between groups of people based on perceived gender, skin type, lighting condition, etc. These audits are immensely important and successful at measuring algorithmic bias but have two major challenges: the audits (1) use facial recognition datasets which lack quality metadata, like LFW and CelebA, and (2) do not compare their observed algorithmic bias to the biases of their human alternatives. In this paper, we release improvements to the LFW and CelebA datasets which will enable future researchers to obtain measurements of algorithmic bias that are not tainted by major flaws in the dataset (e.g. identical images appearing in both the gallery and test set). We also use these new data to develop a series of challenging facial identification and verification questions that we administered to various algorithms and a large, balanced sample of human reviewers. We find that both computer models and human survey participants perform significantly better at the verification task, generally obtain lower accuracy rates on dark-skinned or female subjects for both tasks, and obtain higher accuracy rates when their demographics match that of the question. Academic models exhibit comparable levels of gender bias to humans, but are significantly more biased against darker skin types than humans.

1. INTRODUCTION

Facial analysis systems have been the topic of intense research for decades, and instantiations of their deployment have been criticized in recent years for their intrusive privacy concerns and differential treatment of various demographic groups. Companies and governments have deployed facial recognition systems (Derringer, 2019; Hartzog, 2020; Weise & Singer, 2020) which have a wide variety of applications from relatively mundane, e.g., improved search through personal photos (Google, 2021), to rather controversial, e.g., target identification in warzones (Marson & Forrest, 2021) . A flashpoint issue for facial analysis systems is their potential for biased results by demographics (Garvie, 2016; Lohr, 2018; Buolamwini & Gebru, 2018; Grother et al., 2019; Dooley et al., 2021) , which make facial recognition systems controversial for socially important applications, such as use in law enforcement or the criminal justice system. To make things worse, many studies of machine bias in face recognition use datasets which themselves are imbalanced or riddled with errors, resulting in inaccurate measurements of machine bias. It is now widely accepted that computers perform as well as or better than humans on a variety of facial recognition tasks (Lu & Tang, 2015; Grother et al., 2019) in terms of accuracy, but what about bias? The algorithm's superior overall performance, as well as speed to inference, makes the use of facial recognition technologies widely appealing in many domain areas and comes at enhanced costs to those surveilled, monitored, or targeted by their use (Lewis, 2019; Kostka et al., 2021) . Many previous studies which examine and critique these technologies through algorithmic audits do so only up to the point of the algorithm's biases. They stop short of comparing these biases to that of their human alternatives. In this study, we question how the bias of the algorithm compares to human bias in order to fill in one of the largest omissions in the facial recognition bias literature. We investigate these questions by creating a dataset through extensive hand curation which improves upon previous facial recognition bias auditing datasets, using images from two common facial recognition datasets (Huang et al., 2008; Liu et al., 2015) and fixing many of the imbalances and

