EVERYBODY NEEDS GOOD NEIGHBOURS: AN UNSU-PERVISED LOCALITY-BASED METHOD FOR BIAS MIT-IGATION

Abstract

Learning models from human behavioural data often leads to outputs that are biased with respect to user demographics, such as gender or race. This effect can be controlled by explicit mitigation methods, but this typically presupposes access to demographically-labelled training data. Such data is often not available, motivating the need for unsupervised debiasing methods. To this end, we propose a new meta-algorithm for debiasing representation learning models, which combines the notions of data locality and accuracy of model fit, such that a supervised debiasing method can optimise fairness between neighbourhoods of poorly vs. well modelled instances as identified by our method. Results over five datasets, spanning natural language processing and structured data classification tasks, show that our technique recovers proxy labels that correlate with unknown demographic data, and that our method outperforms all unsupervised baselines, while also achieving competitive performance with state-of-the-art supervised methods which are given access to demographic labels.

1. INTRODUCTION

It is well known that naively-trained models potentially make biased predictions even if demographic information (such as gender, age, or race) is not explicitly observed in training, leading to discrimination such as opportunity inequality (Hovy & Søgaard, 2015; Hardt et al., 2016) . Although a range of fairness metrics (Hardt et al., 2016; Blodgett et al., 2016) and debiasing methods (Elazar & Goldberg, 2018; Wang et al., 2019; Ravfogel et al., 2020) have been proposed to measure and improve fairness in model predictions, they generally require access to protected attributes during training. However, protected labels are often not available (e.g., due to privacy or security concerns), motivating the need for unsupervised debiasing methods, i.e., debiasing without access to demographic variables. Previous unsupervised debiasing work has mainly focused on improving the worst-performing groups, which does not generalize well to ensuring performance parity across all protected groups (Hashimoto et al., 2018; Lahoti et al., 2020) . In Section 3, we propose a new meta-algorithm for debiasing representation learning models, named Unsupervised Locality-based Proxy Label assignment (ULPL). As shown in Figure 1 , to minimize performance disparities, ULPL derives binary proxy labels based on model predictions, indicating poorly-vs. well-modelled instances. These proxy labels can then be combined with any supervised debiasing method to optimize fairness without access to actual protected labels. The method is based on the key observation that hidden representations are correlated with protected groups even if protected labels are not observed in model training, enabling the modelling of unobserved protected labels from hidden representations. We additionally introduce the notion of data locality to proxy label assignment, representing neighbourhoods of poorly-vs. well-modelled instances in a nearestneighbour framework. In Section 4, we compare the combination of ULPL with state-of-the-art supervised debiasing methods on five benchmark datasets, spanning natural language processing and structured data classification. Experimental results show that ULPL outperforms unsupervised and semi-supervised Figure 1 : An overview of ULPL. Given a model trained to predict label y from x by optimizing a particular loss, we derive binary proxy labels over-vs. under-performing within each target class based on training losses. These proxy labels are then smoothed according to the neighbourhood in latent space. Finally, the group-unlabeled data is augmented with z ′ , enabling the application of supervised bias mitigation methods. baselines, while also achieving performance competitive with state-of-the-art supervised techniques which have access to protected attributes at training time. In Section 5, we show that the proxy labels inferred by our method correlate with known demographic data, and that it is effective over multi-class intersectional groups and different notions of group-wise fairness. Moreover, we test our hypothesis of locality smoothing by studying the predictability of protected attributes and robustness to hyperparameters in finding neighbours.

2. RELATED WORK

Representational fairness One line of work in the fairness literature is on protected information leakage, i.e., bias in the hidden representations. For example, it has been shown that protected information influences the geometry of the embedding space learned by models (Caliskan et al., 2017; May et al., 2019) . Previous work has also shown that downstream models learn protected information such as authorship that is unintentionally encoded in hidden representations, even if the model does not have access to protected information during training (Li et al., 2018; Wang et al., 2019; Zhao et al., 2019; Han et al., 2021b) . Rather than reduce leakage, in this work, we make use of leakage as a robust and reliable signal of unobserved protected labels and derive proxy information from biased hidden representations for bias mitigation.

Empirical fairness

Another line of work focuses on empirical fairness by measuring model performance disparities across protected groups, e.g., via demographic parity (Dwork et al., 2012) , equalized odds and equal opportunity (Hardt et al., 2016) , or predictive parity (Chouldechova, 2017) . Based on aggregation across groups, empirical fairness notions can be further broken down into group-wise fairness, which measures relative dispersion across protected groups (Li et al., 2018; Ravfogel et al., 2020; Han et al., 2022a; Lum et al., 2022) , and per-group fairness, which reflects extremum values of bias (Zafar et al., 2017; Feldman et al., 2015; Lahoti et al., 2020) . We follow previous work (Ravfogel et al., 2020; Han et al., 2021b; Shen et al., 2022) in focusing primarily on improving group-wise equal opportunity fairness. Unsupervised bias mitigation Recent work has considered semi-supervised bias mitigation, such as debiasing with partially-labelled protected attributes (Han et al., 2021a) , noised protected labels (Awasthi et al., 2020; Wang et al., 2021; Awasthi et al., 2021) , or domain adaptation of protected attributes (Coston et al., 2019; Han et al., 2021a) . However, these approaches are semi-supervised, as true protected labels are still required for optimizing fairness objectives. Although Gupta et al. (2018) has proposed to use observed features as proxies for unobserved protected labels, the selection of proxy features is handcrafted and does not generalize to unstructured

