A DEEP DIVE INTO DATASET IMBALANCE AND BIAS IN FACE IDENTIFICATION

Abstract

As the deployment of automated face recognition (FR) systems proliferates, bias in these systems is not just an academic question, but a matter of public concern. Media portrayals often center imbalance as the main source of bias, i.e., that FR models perform worse on images of non-white people or women because these demographic groups are underrepresented in training data. Recent academic research paints a more nuanced picture of this relationship. However, previous studies of data imbalance in FR have focused exclusively on the face verification setting, while the face identification setting has been largely ignored, despite being deployed in sensitive applications such as law enforcement. This is an unfortunate omission, as 'imbalance' is a more complex matter in identification; imbalance may arise in not only the training data, but also the testing data, and furthermore may affect the proportion of identities belonging to each demographic group or the number of images belonging to each identity. In this work, we address this gap in the research by thoroughly exploring the effects of each kind of imbalance possible in face identification, and discuss other factors which may impact bias in this setting.

1. INTRODUCTION

Automated face recognition is becoming increasingly prevalent in modern life, with applications ranging from improving user experience (such as automatic face-tagging of photos) to security (e.g., phone unlocking or crime suspect identification). While these advances are impressive achievements, decades of research have demonstrated disparate performance in FR systems depending on a subject's race (Phillips et al., 2011; Cavazos et al., 2020) , gender presentation (Alvi et al., 2018; Albiero et al., 2020 ), age (Klare et al., 2012) , and other factors. This is especially concerning for FR systems deployed in sensitive applications like law enforcement; incorrectly tagging a personal photo may be a mild inconvenience, but incorrectly identifying the subject of a surveillance image could have life-changing consequences. Accordingly, media and public scrutiny of bias in these systems has increased, in some cases resulting in policy changes. One major source of model bias is dataset imbalance; disparities in rates of representation of different groups in the dataset. Modern FR systems employ neural networks trained on large datasets, so naturally much contemporary work focuses on what aspects of the training data may contribute to unequal performance across demographic groups. Some potential sources that have been studied include imbalance of the proportion of data belonging to each group (Wang & Deng, 2020; Gwilliam et al., 2021) , low-quality or poorly annotated images (Dooley et al., 2021) , and confounding variables entangled with group membership (Klare et al., 2012; Kortylewski et al., 2018; Albiero et al., 2020) . Dataset imbalance is a much more complex and nuanced issue than it may seem at first blush. While a naive conception of 'dataset imbalance' is simply as a disparity in the number of images per group, this disparity can manifest itself as either a gap in the number of identities per group, or in the number of images per identity. Furthermore, dataset imbalance can be present in different ways in both the training and testing data, and these two source of imbalance can have radically different (and often opposite) effects on downstream model bias. Past work has only considered the verification setting of FR, where testing consists of determining whether a pair of images belongs to the same identity. As such, 'imbalance' between demographic groups is not a meaningful concept in the test data. Furthermore, the distinction between imbalance

Face Recognition Model

Example Gallery 2: primarily female images 4 photos match probe Example Gallery 1: primarily male images 2 photos match probe Probe Image of identities belonging to a certain demographic group versus that of images per identity in each demographic group has not been carefully studied in either the testing or the training data. All of these facets of imbalance are present in the face identification setting, where testing involves matching a probe image to a gallery of many identities, each of which contains multiple images. We illustrate this in Figure 1 . In this work, we unravel the complex effects that dataset imbalance can have on model bias for face identification systems. We separately consider imbalance (both in terms of identities or images per identity) in the train set and in the test set. We also consider the realistic social use case in which a large dataset is collected from an imbalanced population and then split at random, resulting in similar dataset imbalance in both the train and test set. We specifically focus on imbalance with respect to gender presentation, as (when restricting to only male-and female-identified individuals) this allows the proportion of data in each group to be tuned as a single parameter, as well as the availability of an ethically obtained identification dataset with gender presentation metadata of sufficient size to allow for subsampling without significantly degrading overall performance. Our findings show that each type of imbalance has a distinct effect on a model's performance on each gender presentation. Furthermore, in the realistic scenario where the train and test set are similarly imbalanced, the train and test imbalance have the potential to interact in a way that leads to systematic underestimation of the true bias of a model during an audit. Thus any audit of model bias in face identification must carefully control for these effects. 



Figure 1: Examples of imbalance in face identification. Top left: data containing more female identities than male identities. Top right: data containing the same number of male and female identities, but more images per male identity. Bottom: two possible test (gallery) sets showing how the effects of different kinds of imbalance may interact.

The remainder of this paper is structured as follows: Section 2 discusses related work, and Section 3 introduces the problem and experimental setup. Sections 4 and 5 give experimental results related to imbalance in the training set and test set, respectively, and Section 6 gives results for experiments where the imbalance in the training set and test set are identical. In Section 7.1, we evaluate randomly initialized feature extractors on test sets with various levels of imbalance to further isolate the effects of this imbalance from the effects of training. In Section 7.2, we investigate the correlation between the performance of models trained with various levels of imbalance and human performance.Even before the advent of neural network-based face recognition systems, researchers have studied how the composition of training data affects verification performance. Phillips et al. (2011) compared algorithms from the Face Recognition Vendor Test(Phillips et al., 2009)  and found that those developed in East Asia performed better on East Asian Faces, and those developed in Western

