A VIEW FROM SOMEWHERE: HUMAN-CENTRIC FACE REPRESENTATIONS

Abstract

Few datasets contain self-identified sensitive attributes, inferring attributes risks introducing additional biases, and collecting attributes can carry legal risks. Besides, categorical labels can fail to reflect the continuous nature of human phenotypic diversity, making it difficult to compare the similarity between same-labeled faces. To address these issues, we present A View From Somewhere (AVFS)-a dataset of 638,180 human judgments of face similarity. 1 We demonstrate the utility of AVFS for learning a continuous, low-dimensional embedding space aligned with human perception. Our embedding space, induced under a novel conditional framework, not only enables the accurate prediction of face similarity, but also provides a human-interpretable decomposition of the dimensions used in the human-decision making process, and the importance distinct annotators place on each dimension. We additionally show the practicality of the dimensions for collecting continuous attributes, performing classification, and comparing dataset attribute disparities.

1. INTRODUCTION

The canonical approach to evaluating human-centric image dataset diversity is based on demographic attribute labels. Many equate diversity with parity across the subgroup distributions (Kay et al., 2015; Schwemmer et al., 2020) , presupposing access to demographically labeled samples. However, most datasets are web scraped, lacking ground-truth information about image subjects (Andrews et al., 2023) . Moreover, data protection legislation considers demographic attributes to be personal information and limits their collection and use (Andrus et al., 2021; 2020) . Even when demographic labels are known, evaluating diversity based on subgroup counts fails to reflect the continuous nature of human phenotypic diversity (e.g., skin tone is often reduced to "light" vs. "dark"). Further, even within the same subpopulation, image subjects exhibit certain traits to a greater or lesser extent than others (Becerra-Riera et al., 2019; Carcagnì et al., 2015; Feliciano, 2016) . When labels are unknown, researchers typically choose certain attributes they consider to be relevant for human diversity and use human annotators to infer them (Karkkainen & Joo, 2021; Wang et al., 2019) . Inferring labels, however, is difficult, especially for nebulous social constructs, e.g., race and gender (Hanna et al., 2020; Keyes, 2018) , and can introduce additional biases (Freeman et al., 2011) . Beyond the inclusion of derogatory categories (Koch et al., 2021; Birhane & Prabhu, 2021; Crawford & Paglen, 2019) , label taxonomies often do not permit multi-group membership, resulting in the erasure of, e.g., multi-ethnic individuals (Robinson et al., 2020; Karkkainen & Joo, 2021) . Significantly, discrepancies between inferred and self-identified attributes can induce psychological distress by invalidating an individual's self-image (Campbell & Troyer, 2007; Roth, 2016) . In this work, we avoid problematic semantic labels altogether and propose to learn a face embedding space aligned with human perception. To do so, we introduce A View From Somewhere (AVFS)-a dataset of 638,180 face similarity judgments over 4,921 faces. Each judgment corresponds to the odd-one-out (i.e., least similar) face in a triplet of faces and is accompanied by both the identifier and demographic attributes of the annotator who made the judgment. Our embedding space, induced under a novel conditional framework, not only enables the accurate prediction of face similarity, but also provides a human-interpretable decomposition of the dimensions used in the human decision-making process, as well as the importance distinct annotators place on each dimension. We demonstrate that the individual embedding dimensions (1) are related to concepts of gender, ethnicity, age, as well as face and hair morphology; and (2) can be used to collect continuous attributes, perform classification, and compare dataset attribute disparities. We further show that annotators are influenced by their sociocultural backgrounds, underscoring the need for diverse annotator groups to mitigate bias.

2. RELATED WORK

Similarity. The human mind is conjectured to have, "a considerable investment in similarity" (Medin et al., 1993) . When two objects are compared they mutually constrain the set of features that are activated in the human mind (Markman, 1996)-i.e., features are dynamically discovered and aligned based on what is being compared. Alignment is contended to be central to similarity comparisons. Shanon (1988) goes as far as arguing that similarity is not a function of features, but that the features themselves are a function of the similarity comparison. Contextual similarity. Human perception of similarity can vary with respect to context (Roth & Shoben, 1983) . Context makes salient context-related properties and the extent to which objects being compared share these properties (Medin et al., 1993; Markman & Gentner, 2005; Goodman, 1972) . For example, Barsalou (1987) found that "snakes" and "raccoons" were judged to be more similar when placed in the context of pets than when no context was provided. The odd-one-out similarity task (Zheng et al., 2019) used in this work also provides context. By varying the context (i.e., the third object) in which two objects are experienced, it is possible to uncover different features that contribute to their pairwise similarity. This is important as there are an uncountable number of ways in which two objects may be similar (Love & Roads, 2021) . Psychological embeddings. Multidimensional scaling (MDS) is often used to learn psychological embeddings from human similarity judgments (Zheng et al., 2019; Roads & Love, 2021; Dima et al., 2022; Josephs et al., 2021) . As MDS approaches cannot embed images outside of the training set, researchers have used pretrained models as feature extractors (Sanders & Nosofsky, 2020; Peterson et al., 2018; Attarian et al., 2020) , which can introduce unwanted implicit biases (Krishnakumar et al., 2021; Steed & Caliskan, 2021) . Moreover, previous approaches ignore inter-and intra-annotator variability. By contrast, our conditional model is trained end-to-end and can embed any arbitrary face from the perspective of a specific annotator. Face datasets. Most face datasets are semantically labeled images, created for the purposes of identity and attribute recognition (Karkkainen & Joo, 2021; Liu et al., 2015; Huang et al., 2008; Cao et al., 2018) . An implicit assumption is that semantic similarity is equivalent to visual similarity (Deselaers & Ferrari, 2011) . However, many semantic categories are functional (Rosch, 1975; Rothbart & Taylor, 1992) , i.e., unconstrained by visual features such as shape, color, and material. Moreover, semantic labels often only indicate the presence or absence of an attribute, as opposed to its magnitude, making it difficult to compare the similarity between same-labeled samples. While face similarity datasets exist, the judgments narrowly pertain to identity (McCauley et al., 2021; Sadovnik et al., 2018; Somai & Hancock, 2021) and expression similarity (Vemulapalli & Agarwala, 2019). Annotator positionality. Semantic categorization by annotators not only depends on the image subject, but also on extrinsic contextual cues (Freeman et al., 2011) and the annotators' sociocultural backgrounds (Segall et al., 1966) . Despite this, annotator positionality is rarely discussed in computer vision (Chen & Joo, 2021; Zhao et al., 2021; Denton et al., 2021) ; "only five publications [from 113] provided any [annotator] demographic information" (Scheuerman et al., 2021) . To our knowledge, AVFS represents the first human-centric vision dataset, where each annotation is associated with the annotator who created it and their demographics, permitting the study of annotator bias.

3. A VIEW FROM SOMEWHERE DATASET

To learn a face embedding space aligned with human perception, we collect AVFS-a dataset of odd-one-out similarity judgments collected from humans. An odd-one-out judgment corresponds to the least similar face in a triplet of faces, representing a three-alternative forced choice (3AFC) task. AVFS dataset documentation can be found in Appendix C.

