A VIEW FROM SOMEWHERE: HUMAN-CENTRIC FACE REPRESENTATIONS

Abstract

Few datasets contain self-identified sensitive attributes, inferring attributes risks introducing additional biases, and collecting attributes can carry legal risks. Besides, categorical labels can fail to reflect the continuous nature of human phenotypic diversity, making it difficult to compare the similarity between same-labeled faces. To address these issues, we present A View From Somewhere (AVFS)-a dataset of 638,180 human judgments of face similarity. 1 We demonstrate the utility of AVFS for learning a continuous, low-dimensional embedding space aligned with human perception. Our embedding space, induced under a novel conditional framework, not only enables the accurate prediction of face similarity, but also provides a human-interpretable decomposition of the dimensions used in the human-decision making process, and the importance distinct annotators place on each dimension. We additionally show the practicality of the dimensions for collecting continuous attributes, performing classification, and comparing dataset attribute disparities.

1. INTRODUCTION

The canonical approach to evaluating human-centric image dataset diversity is based on demographic attribute labels. Many equate diversity with parity across the subgroup distributions (Kay et al., 2015; Schwemmer et al., 2020) , presupposing access to demographically labeled samples. However, most datasets are web scraped, lacking ground-truth information about image subjects (Andrews et al., 2023) . Moreover, data protection legislation considers demographic attributes to be personal information and limits their collection and use (Andrus et al., 2021; 2020) . Even when demographic labels are known, evaluating diversity based on subgroup counts fails to reflect the continuous nature of human phenotypic diversity (e.g., skin tone is often reduced to "light" vs. "dark"). Further, even within the same subpopulation, image subjects exhibit certain traits to a greater or lesser extent than others (Becerra-Riera et al., 2019; Carcagnì et al., 2015; Feliciano, 2016) . When labels are unknown, researchers typically choose certain attributes they consider to be relevant for human diversity and use human annotators to infer them (Karkkainen & Joo, 2021; Wang et al., 2019) . Inferring labels, however, is difficult, especially for nebulous social constructs, e.g., race and gender (Hanna et al., 2020; Keyes, 2018) , and can introduce additional biases (Freeman et al., 2011) . Beyond the inclusion of derogatory categories (Koch et al., 2021; Birhane & Prabhu, 2021; Crawford & Paglen, 2019) , label taxonomies often do not permit multi-group membership, resulting in the erasure of, e.g., multi-ethnic individuals (Robinson et al., 2020; Karkkainen & Joo, 2021) . Significantly, discrepancies between inferred and self-identified attributes can induce psychological distress by invalidating an individual's self-image (Campbell & Troyer, 2007; Roth, 2016) . In this work, we avoid problematic semantic labels altogether and propose to learn a face embedding space aligned with human perception. To do so, we introduce A View From Somewhere (AVFS)-a dataset of 638,180 face similarity judgments over 4,921 faces. Each judgment corresponds to the odd-one-out (i.e., least similar) face in a triplet of faces and is accompanied by both the identifier and demographic attributes of the annotator who made the judgment. Our embedding space, induced under a novel conditional framework, not only enables the accurate prediction of face similarity, but also

