ON THE IMPORTANCE OF ARCHITECTURES AND HYPER-PARAMETERS FOR FAIRNESS IN FACE RECOGNITION Anonymous

Abstract

Face recognition systems are deployed across the world by government agencies and contractors for sensitive and impactful tasks, such as surveillance and database matching. Despite their widespread use, these systems are known to exhibit bias across a range of sociodemographic dimensions, such as gender and race. Nonetheless, an array of works proposing pre-processing, training, and post-processing methods have failed to close these gaps. Here, we take a very different approach to this problem, identifying that both architectures and hyperparameters of neural networks are instrumental in reducing bias. We first run a large-scale analysis of the impact of architectures and training hyperparameters on several common fairness metrics and show that the implicit convention of choosing high-accuracy architectures may be suboptimal for fairness. Motivated by our findings, we run the first neural architecture search for fairness, jointly with a search for hyperparameters. We output a suite of models which Pareto-dominate all other competitive architectures in terms of accuracy and fairness. Furthermore, we show that these models transfer well to other face recognition datasets with similar and distinct protected attributes. We release our code and raw result files so that researchers and practitioners can replace our fairness metrics with a bias measure of their choice.

1. INTRODUCTION

Face recognition is regularly deployed across the world by government agencies for tasks including surveillance, employment, and housing decisions. However, recent studies have shown that face recognition systems exhibit disparity in accuracy based on race and gender (Grother et al., 2019; Raji et al., 2020; Raji & Fried, 2021; Learned-Miller et al., 2020) . For example, some face recognition models were 10 or 100 times more likely to give false positives for Black or Asian people, compared to white people (Allyn, 2020) . This bias has already led to multiple false arrests and jail time for innocent Black men in the USA (Hill, 2020a) . Motivated by the discovery of bias in face recognition and other models deployed in real-world applications, dozens of definitions for fairness have been proposed (Verma & Rubin, 2018) , and many pre-processing, training, and post-processing techniques have been developed to mitigate model bias. However, these techniques have fallen short of de-biasing face recognition systems, and training fair models in this setting demands addressing several technical challenges (Cherepanova et al., 2021b) . While existing methods for de-biasing face recognition systems use a fixed neural network architecture and training hyperparameter setting, we instead ask a fundamental question which has received little attention: does model-bias stem from the architecture and hyperparameters? We further ask whether we can we exploit the extensive research in the fields of neural architecture search (NAS) (Elsken et al., 2019) and hyperparameter optimization (HPO) (Feurer & Hutter, 2019) to search for models that achieve a desired trade-off between model-bias and accuracy. In this work, we take the first step towards answering these questions. To this end, we conduct the first large-scale analysis of the relationship between hyperparameters, architectures, and bias. We train a diverse set of 29 architectures, ranging from ResNets (He et al., 2016b) to vision transformers (Dosovitskiy et al., 2020; Liu et al., 2021) to Gluon Inception V3 (Szegedy et al., 2016) that the implicit convention of choosing the highest-accuracy architectures can be detrimental to fairness, and suggest that architecture and hyperparameters play a significant role in determining the fairness-accuracy tradeoff. Next, we exploit this observation in order to design architectures with a better fairness-accuracy tradeoff. We initiate the study of NAS for fairness; specifically, we run NAS+HPO to jointly optimize fairness and accuracy. To tackle this problem, we construct a search space based on the highest-performing architecture from our analysis. We use the Sequential Model-based Algorithm Configuration method (SMAC (Lindauer et al., 2022)), for multi-objective architecture and hyperparameter search. We discover a Pareto frontier of face recognition models that outperform existing state-of-the-art models on both accuracy and multiple fairness metrics. An overview of our methodology can be found in Figure 1 . We release all of our code and raw results at https://anonymous.4open.science/r/FR-NAS-92EC so that users can adapt our work to any bias measure of their choice. Our contributions We summarize our main contributions below: • We provide a new bias mitigation strategy which identifies that architectures have a profound influence on fairness, and then exploits that insight in order to design more fair architectures via Neural Architecture Search and Hyperparameter Optimization. • We conduct a large-scale study of 29 architectures, each trained across a variety of hyperparameters, totalling 88 493 GPU hours, showing that architectures and hyperparameters have a big impact on fairness. We then conduct the first neural architecture search for fairness, jointly with hyperparameter optimization and optimizing for accuracy -culminating in a set of architectures which Pareto-dominate all models in a large set of modern architectures. 

2. BACKGROUND AND RELATED WORK

While our work is the first to leverage neural architecture search (NAS) to build fair models, a body of prior work exists in the fields of NAS and face recognition, and we discuss it here. Face Recognition. Face recognition tasks fall into two categories: verification and identification. Verification asks whether the person in a source image is the same person as in the target image; this is a one-to-one comparison. Identification instead asks whether a given person in a source image appears within a gallery composed of many target identities and their associated images; this is a one-to-many comparison. Novel techniques in face recognition tasks, such as ArcFace (Wang et al., 2018) , CosFace (Deng et al., 2019), and MagFace (Meng et al., 2021) , use deep networks (often called the backbone) to extract feature representations of faces and then compare those to match individuals (with mechanisms called the head). Generally, backbones take the form of image feature extractors and heads resemble MLPs with specialized loss functions. Often, the term "head" refers to both the last layer of the network and the loss function. We focus our analysis on identification,



Figure 1: Overview of our methodology.

to MobileNetV3(Howard et al., 2019)  onCelebA (Liu et al., 2015), for a total of 88 493 GPU hours. We train each of these architectures across different head, optimizer, and learning rate combinations. Our results show that different architectures learn different inductive biases from the same dataset. We conclude

• Our new architectures outperform the current state of the art architecture, ArcFace (Deng et al., 2019), when training and testing CelebA and VGGFace2, and when training on CelebA and testing on other face recognition datasets (LFW, CFP-FP, CPLFW, AgeDB, and CALFW). Furthermore our architectures transfer well across different protected attributes Section 4.3.1.

