A DEEP DIVE INTO DATASET IMBALANCE AND BIAS IN FACE IDENTIFICATION

Abstract

As the deployment of automated face recognition (FR) systems proliferates, bias in these systems is not just an academic question, but a matter of public concern. Media portrayals often center imbalance as the main source of bias, i.e., that FR models perform worse on images of non-white people or women because these demographic groups are underrepresented in training data. Recent academic research paints a more nuanced picture of this relationship. However, previous studies of data imbalance in FR have focused exclusively on the face verification setting, while the face identification setting has been largely ignored, despite being deployed in sensitive applications such as law enforcement. This is an unfortunate omission, as 'imbalance' is a more complex matter in identification; imbalance may arise in not only the training data, but also the testing data, and furthermore may affect the proportion of identities belonging to each demographic group or the number of images belonging to each identity. In this work, we address this gap in the research by thoroughly exploring the effects of each kind of imbalance possible in face identification, and discuss other factors which may impact bias in this setting.

1. INTRODUCTION

Automated face recognition is becoming increasingly prevalent in modern life, with applications ranging from improving user experience (such as automatic face-tagging of photos) to security (e.g., phone unlocking or crime suspect identification). While these advances are impressive achievements, decades of research have demonstrated disparate performance in FR systems depending on a subject's race (Phillips et al., 2011; Cavazos et al., 2020 ), gender presentation (Alvi et al., 2018; Albiero et al., 2020 ), age (Klare et al., 2012) , and other factors. This is especially concerning for FR systems deployed in sensitive applications like law enforcement; incorrectly tagging a personal photo may be a mild inconvenience, but incorrectly identifying the subject of a surveillance image could have life-changing consequences. Accordingly, media and public scrutiny of bias in these systems has increased, in some cases resulting in policy changes. One major source of model bias is dataset imbalance; disparities in rates of representation of different groups in the dataset. Modern FR systems employ neural networks trained on large datasets, so naturally much contemporary work focuses on what aspects of the training data may contribute to unequal performance across demographic groups. Some potential sources that have been studied include imbalance of the proportion of data belonging to each group (Wang & Deng, 2020; Gwilliam et al., 2021) , low-quality or poorly annotated images (Dooley et al., 2021) , and confounding variables entangled with group membership (Klare et al., 2012; Kortylewski et al., 2018; Albiero et al., 2020) . Dataset imbalance is a much more complex and nuanced issue than it may seem at first blush. While a naive conception of 'dataset imbalance' is simply as a disparity in the number of images per group, this disparity can manifest itself as either a gap in the number of identities per group, or in the number of images per identity. Furthermore, dataset imbalance can be present in different ways in both the training and testing data, and these two source of imbalance can have radically different (and often opposite) effects on downstream model bias. Past work has only considered the verification setting of FR, where testing consists of determining whether a pair of images belongs to the same identity. As such, 'imbalance' between demographic groups is not a meaningful concept in the test data. Furthermore, the distinction between imbalance 1

