EVERYBODY NEEDS GOOD NEIGHBOURS: AN UNSU-PERVISED LOCALITY-BASED METHOD FOR BIAS MIT-IGATION

Abstract

Learning models from human behavioural data often leads to outputs that are biased with respect to user demographics, such as gender or race. This effect can be controlled by explicit mitigation methods, but this typically presupposes access to demographically-labelled training data. Such data is often not available, motivating the need for unsupervised debiasing methods. To this end, we propose a new meta-algorithm for debiasing representation learning models, which combines the notions of data locality and accuracy of model fit, such that a supervised debiasing method can optimise fairness between neighbourhoods of poorly vs. well modelled instances as identified by our method. Results over five datasets, spanning natural language processing and structured data classification tasks, show that our technique recovers proxy labels that correlate with unknown demographic data, and that our method outperforms all unsupervised baselines, while also achieving competitive performance with state-of-the-art supervised methods which are given access to demographic labels.

1. INTRODUCTION

It is well known that naively-trained models potentially make biased predictions even if demographic information (such as gender, age, or race) is not explicitly observed in training, leading to discrimination such as opportunity inequality (Hovy & Søgaard, 2015; Hardt et al., 2016) . Although a range of fairness metrics (Hardt et al., 2016; Blodgett et al., 2016) and debiasing methods (Elazar & Goldberg, 2018; Wang et al., 2019; Ravfogel et al., 2020) have been proposed to measure and improve fairness in model predictions, they generally require access to protected attributes during training. However, protected labels are often not available (e.g., due to privacy or security concerns), motivating the need for unsupervised debiasing methods, i.e., debiasing without access to demographic variables. Previous unsupervised debiasing work has mainly focused on improving the worst-performing groups, which does not generalize well to ensuring performance parity across all protected groups (Hashimoto et al., 2018; Lahoti et al., 2020) . In Section 3, we propose a new meta-algorithm for debiasing representation learning models, named Unsupervised Locality-based Proxy Label assignment (ULPL). As shown in Figure 1 , to minimize performance disparities, ULPL derives binary proxy labels based on model predictions, indicating poorly-vs. well-modelled instances. These proxy labels can then be combined with any supervised debiasing method to optimize fairness without access to actual protected labels. The method is based on the key observation that hidden representations are correlated with protected groups even if protected labels are not observed in model training, enabling the modelling of unobserved protected labels from hidden representations. We additionally introduce the notion of data locality to proxy label assignment, representing neighbourhoods of poorly-vs. well-modelled instances in a nearestneighbour framework. In Section 4, we compare the combination of ULPL with state-of-the-art supervised debiasing methods on five benchmark datasets, spanning natural language processing and structured data classification. Experimental results show that ULPL outperforms unsupervised and semi-supervised Figure 1 : An overview of ULPL. Given a model trained to predict label y from x by optimizing a particular loss, we derive binary proxy labels over-vs. under-performing within each target class based on training losses. These proxy labels are then smoothed according to the neighbourhood in latent space. Finally, the group-unlabeled data is augmented with z ′ , enabling the application of supervised bias mitigation methods. baselines, while also achieving performance competitive with state-of-the-art supervised techniques which have access to protected attributes at training time. In Section 5, we show that the proxy labels inferred by our method correlate with known demographic data, and that it is effective over multi-class intersectional groups and different notions of group-wise fairness. Moreover, we test our hypothesis of locality smoothing by studying the predictability of protected attributes and robustness to hyperparameters in finding neighbours.

2. RELATED WORK

Representational fairness One line of work in the fairness literature is on protected information leakage, i.e., bias in the hidden representations. For example, it has been shown that protected information influences the geometry of the embedding space learned by models (Caliskan et al., 2017; May et al., 2019) . Previous work has also shown that downstream models learn protected information such as authorship that is unintentionally encoded in hidden representations, even if the model does not have access to protected information during training (Li et al., 2018; Wang et al., 2019; Zhao et al., 2019; Han et al., 2021b) . Rather than reduce leakage, in this work, we make use of leakage as a robust and reliable signal of unobserved protected labels and derive proxy information from biased hidden representations for bias mitigation. Empirical fairness Another line of work focuses on empirical fairness by measuring model performance disparities across protected groups, e.g., via demographic parity (Dwork et al., 2012) , equalized odds and equal opportunity (Hardt et al., 2016) , or predictive parity (Chouldechova, 2017) . Based on aggregation across groups, empirical fairness notions can be further broken down into group-wise fairness, which measures relative dispersion across protected groups (Li et al., 2018; Ravfogel et al., 2020; Han et al., 2022a; Lum et al., 2022) , and per-group fairness, which reflects extremum values of bias (Zafar et al., 2017; Feldman et al., 2015; Lahoti et al., 2020) . We follow previous work (Ravfogel et al., 2020; Han et al., 2021b; Shen et al., 2022) in focusing primarily on improving group-wise equal opportunity fairness. Unsupervised bias mitigation Recent work has considered semi-supervised bias mitigation, such as debiasing with partially-labelled protected attributes (Han et al., 2021a) , noised protected labels (Awasthi et al., 2020; Wang et al., 2021; Awasthi et al., 2021) , or domain adaptation of protected attributes (Coston et al., 2019; Han et al., 2021a) . However, these approaches are semi-supervised, as true protected labels are still required for optimizing fairness objectives. Although Gupta et al. (2018) has proposed to use observed features as proxies for unobserved protected labels, the selection of proxy features is handcrafted and does not generalize to unstructured inputs (e.g., text or images). Therefore, there is no guarantee of correlation between proxy labels and unobserved protected labels (Chen et al., 2019) . The most relevant line of work focuses on the notion of Max-Min fairness (Rawls, 2001) , which aims to maximize the minimum performance across protected groups. Hashimoto et al. (2018) optimize worst-performing distributions without access to actual protected labels, but suffer from the risk of focusing on outliers, reducing the effectiveness of bias mitigation. Adversarially reweighted learning (ARL) (Lahoti et al., 2020) extends the idea by employing an additional adversary in training to prevent the optimization from focusing on noisy outliers, based on the notion of computational identifiability (Hébert-Johnson et al., 2017) . However, adversarial training is notoriously non-convex, and there is no guarantee that the adversary will learn contiguous regions rather than identifying outliers. In contrast, our proposed neighbourhood smoothing method is memory-based, does not require adversarial training, and one can explicitly adjust the smoothness of neighbourhood search. Unsupervised fairness evaluation To access fairness without demographics, recent work (Kallus et al., 2020) has proposed to measure fairness w.r.t. auxiliary variables such as surname and geolocation in different datasets, which is a different research topic and beyond the scope of this paper. In this paper, we use protected labels for tuning and evaluation, and in practice, one can employ our unsupervised debiasing methods together with unsupervised fairness evaluation approaches to perform hyperparameter tuning for better fairness.

Dataset cartography

Training instances are also grouped based on predictability in the literature on dataset cartography, which is similar to the assignment of proxy labels in this paper. Swayamdipta et al. (2020) propose to visualize training instances according to variability and confidence, where a higher-confidence indicates the instance label can be predicted more easily. Le Bras et al. ( 2020) also group training instances by their predictability, measured by training simple linear discriminators. Such methods focus on improving in-and out-of-distribution performance without taking fairness into consideration. In comparison, our proposed method aims to mitigate bias by assigning proxy protected group labels to training instances based on their losses within a particular class.

3.1. PROBLEM FORMULATION

Consider a dataset consisting of n instances D = {(x i , y i , z i )} n i=1 , where x i is an input vector to the classifier, y i ∈ [1, . . . , C] represents target class label, and z i ∈ [1, . . . , G] is the group label, such as gender. For unsupervised bias mitigation, protected labels are assumed to be unobserved at training and inference time. n c,g denotes the number of instances in a subset with target label c and protected label g, i.e., D c,g = {(x i , y i , z i )|y i = c, z i = g} n i=1 . A vanilla model (m = f • e) consists of two connected parts: the encoder e is trained to compute the hidden representation from an input, h = e(x), and the classifier makes prediction, ŷ = f (h). Let L c,g = 1 nc,g (xi,y i ,zi)∈Dc,g ℓ(m(x i ), y i ) be the average empirical risk for D c,g , where ℓ is a loss function such as cross-entropy. Similarly, let L c denote the average for instances with target label c (D c ), and L denote the overall empirical risk.

Fairness measurement

We follow previous work in measuring group-wise performance disparities (Ravfogel et al., 2020; Roh et al., 2021; Shen et al., 2022) . Specifically, for a particular utility metric U , e.g., the true positive rate, the results for each protected group are C-dimensional vectors, one dimension for each class. For the subset of instances D c,g , we denote the corresponding evaluation results as U c,g . Let U c denote the overall utilities of class c, then group-wise fairness is achieved if the utilities of all groups are identical, U c,g = U c , ∀c, g ⇔ |U c,g -U c | = 0, ∀c, g. In addition to the overall performance metric U , we denote the fairness metric F , as a measurement of group-wise utility disparities.

3.2. UNSUPERVISED LOCALITY-BASED PROXY LABEL ASSIGNMENT

Proxy label assignment To mitigate bias, the first question is how to minimize disparities of non-differentiable metrics in model training. Previous work has shown that empirical risk-based objectives can form a practical approximation of expected fairness, as measured by various metrics including AUC-ROC (Lahoti et al., 2020) , positive rate (Roh et al., 2021) , and true positive rate (Shen et al., 2022) . Without loss of generality, we illustrate with the equal opportunity fairness. Note that our method generalizes to other fairness criteria, see Appendix D for detailss. By replacing the utility metrics U with empirical risks w.r.t. an appropriate loss function L, the group-wise fairness metrics are reformulated as  ′ i = 1 Li>L y i i , where z ′ i denotes the augmented proxy group label. The two types of protected labels indicate that the loss of an instance is either greater than the mean (an 'under-represented' group) or ≤ the mean (an 'over-represented' group). Each instance can now be assigned with a binary proxy label and used with existing debiasing methods, resulting in augmented datasets D ′ = {(x i , y i , z ′ i )} n i=1 . Neighbourhood smoothing Simply focusing on worse-performing instances can force the classifier to memorize noisy outliers (Arpit et al., 2017) , reducing the effectiveness of bias mitigation. To address this problem, we find the neighbourhood that is likely to be from the same demographic based on the observation of protected information leakage (introduced in Section 2, and justified in Section 5.2), and smooth the proxy label of each instance based on its neighbours. Specifically, we use the notion of data locality and adopt a k-Nearest-Neighbour classifier (k-NN) to smooth the proxy label. Given hidden representations {h 1 , h 2 , . . . , h n }, where h i = e(x i ), and a query point h j , k-NN searches for the k points {h j1 , h j2 , . . . , h j k }, s.t. y j = y j k is closest in distance to h j , and then makes predictions through majority vote among proxy labels {z ′ j1 , z ′ j2 , . . . , z ′ j k }. Unlike the standard setting of k-NN, where the query point is excluded from consideration, we include the query instance. As a result, the smoothing process degrades to using naive proxy labels when k = 1, where the discriminator prediction is the original proxy label without smoothing. For k > 1, on the other hand, neighbourhood smoothing comes into effect. Proxy label assignment and neighbour smoothing can be applied at different granularities, such as calculating the loss at steps vs. iterations; see Appendix C.1 for details.

3.3. THEORETICAL JUSTIFICATION

Approximating Fairness Criteria In multi-class classification settings, the equal opportunity fairness is achieved if ŷ ⊥ z|y, ∀y, i.e., the true positive rates (TPR) of each target class are equal for all partitions of the dataset, where partitioning is based on z. Using the definition of cross-entropy of the i-th instance, - C c=1 1 {y i } (c) log(p(ŷ i = c)) where ŷi is a function of x i , the loss for the subset of instances with target label c can be simplified as: L c = 1 n c (xi,y i ,zi)∈Dc -log(p(ŷ i = c)) = 1 n c (xi,y i ,zi)∈D -log(p(ŷ = c|y i = c)) (1) Notice that L c is calculated on the subset D c = {(x i , y i , z i )|y i = c} n i=1 , making L c an unbiased estimator of -log p(ŷ = c|y i = c), which approximates -log TPR of the c-th class. As such, it can be seen that TPR can be empirically replaced by cross-entropy loss when measuring EO fairness. Fairness Lower Bound Consider the worst case in fairness, e.g. p(ŷ = 1, y = 1|z = 1) ≈ 0 and p(ŷ = 1, y = 1|z = 2) ≈ 1, where the TPR gap between the two groups is 1. Such unfairness in training is shown as the minimum training loss of instances in group 1 being larger than the maximum loss of instances in group 0. Taking the proxy label assignment into consideration, this example results in the strong correlation between the gold-class group label and proxy group labels, i.e., p(z ′ = 1|z = 1) = p(z ′ = 0|z = 2) ≈ 1. Therefore, the correlation between proxy labels and gold-class group labels is positively correlated with unfairness, and the optimization w.r.t. proxy labels increases the lower bound of fairness.

4. EXPERIMENTAL RESULTS

This section demonstrates the effectiveness of our proposed method through experiments against various competitive baselines and across five widely-used datasets. We report evaluation results of all models based on average values over five runs with different random seeds for each dataset.

4.1. EXPERIMENT SETUP

Datasets We consider the following benchmark datasetsfoot_0 from the fairness literature: (1) Moji (Blodgett et al., 2016; Elazar & Goldberg, 2018) , sentiment analysis with protected attribute race; (2) Bios (De-Arteaga et al., 2019; Subramanian et al., 2021) , biography classification with protected attributes gender and economy; (3) TrustPilot (Hovy et al., 2015) , product rating prediction with protected attributes age, gender, and country; (4) COMPAS (Flores et al., 2016) , recidivism prediction with protected attributes gender and race; and (5) Adult (Kohavi, 1996) , income prediction with protected attributes gender and race. To enable thorough comparison and explore correlation with unobserved protected attributes, for datasets with more than one protected attribute, we treat each protected attribute as a distinct task, e.g., Bios-gender, and Bios-economy are treated as two different tasks. As a result, there are ten different tasks in total. Baselines We employ the following baselines: (1) Vanilla, which trains the classifier without explicit bias mitigation; (2) FairBatch (Roh et al., 2021) , which adjusts the resampling probabilities of each protected group for each minibatch to minimize loss disparities; (3) GD CLA (Shen et al., 2022) , which adjusts the weights of each protected group to minimize loss disparities; (4) GD GLB (Shen et al., 2022) , which is a variant of GD CLA that additionally minimizes loss differences across target classes; (5) Adv (Li et al., 2018) , which trains the adversary to identify protected information from hidden representations, and removes protected information through unlearning adversaries; (6) SemiAdv (Han et al., 2021a) , which trains the adversary with partially-observed protected labels; and (7) ARL (Lahoti et al., 2020) , which employs an adversary to assign larger weights to computationally-identifiable underrepresented instances. Besides the Vanilla model, methods (2)-( 5) are supervised debiasing baselines, SemiAdv is a semi-supervised debiasing baseline method, and ARL is the baseline for unsupervised bias mitigation. In terms of our proposed method, we examine its effectiveness in combination with several supervised debiasing methods, GD CLA , GD GLB , and Adv, denoted ULPL+GD CLA , ULPL+GD GLB , and ULPL+Adv, respectively. To be clear, the supervision in each case is based on the proxy labels z ′ i learned in an unsupervised manner by ULPL. Evaluation Metrics This paper is generalizable to different metrics by varying the objectives of the debiasing methods. For illustration purposes, we follow Ravfogel et al. (2020) ; Shen et al. (2022) and Han et al. (2021a) in measuring the overall accuracy and equal opportunity fairness, which measures true positive rate (TPR) disparities across groups. Consistent with Section 3.1, we measure the sum of TPR gap across subgroups to capture absolute disparities. We focus on less fair classes by using root mean square aggregation for class-wise aggregation. Overall, the fairness metric is F = 1 -1 C C c=1 1 G G g=1 |TPR c,g -TPR c | 2 . For both metrics, larger is better. Model comparison Previous work has shown that debiasing methods suffer from performancefairness trade-offs in bias mitigation (Shen et al., 2022) . Most debiasing methods involve a trade-off hyperparameter to control the extent to which the model sacrifices performance for fairness, such as λ GDCLA , the strength of additional regularization objectives of GD CLA and our proposed method ULPL+GD CLA . As shown in Figure 2a , given the performance-fairness trade-offs, selecting the Figure 2 : ULPL+GD CLA degrades to Vanilla performance (dotted lines) when setting λ = 0. As increase the weight of the fairness objective, fairness (orange dashed line) improves at the cost of performance (blue solid line). Figure 2b focuses on the Pareto frontier, and presents AUC-PFC as shaded area over the Bios-gender dataset. best performance degrades to vanilla training and choosing the best fairness results in random predictions. As such, performance and fairness must be considered simultaneously in model comparisons, such as early stopping and hyperparameter tuning. We use protected attributes for early stopping over a validation set and report results on the test set. In practice, model selection should be made in a domain-specific manner, where the best method varies. To make quantitative comparisons based on the performance-fairness trade-offs, we follow Han et al. (2023) in reporting the area under the performance-fairness trade-off curves (AUC-PFC) of each method. As shown in Figure 2b , the performance-fairness trade-off curve (PFC) of a particular method consists of a Pareto frontier, which represents the best results that can be achieved in different scenarios, and the area under the curve based on PFC (AUC-PFC) reflects the overall goodness of a method. In particular, the AUC-PFC score of GD CLA (red) and ULPL+GD CLA (blue) are 0.498 and 0.485, respectively, and their difference (i.e., area between the two curves) is 0.013. See Appendix B for further details. Dataset Moji Bios TrustPilot Adult COMPAS Attribute R G E G A C G R G R Vanilla 0.

4.2. CAN WE MITIGATE BIAS WITHOUT ACCESS TO PROTECTED ATTRIBUTES?

Table 1 compares our proposed ULPL based methods against baselines. Vanilla and supervised debiasing baselines: Compared with the Vanilla model, supervised debiasing baselines (GD CLA , GD GLB , FairBatch, and Adv) substantially improve fairness with relatively little performance cost, resulting in larger AUC-PFC scores. Among the four supervised de- Unsupervised debiasing baselines: ARL (Lahoti et al., 2020) is also an unsupervised debiasing method that trains an adversary to predict the weights of each instance such that the weighted empirical risk is maximized. The training objective of ARL does not match with the definition of group-wise fairness, and as such, it results in lower AUC-PFC scores than our proposed methods, that explicitly optimize for performance parity across protected groups. Dataset Moji Bios TrustPilot Adult COMPAS Attribute R G E G A C G R G R F 0. In terms of excluding outliers, the adversary of ARL is intended to predict smooth weights of instances from (x, y), such that the trained model focuses more on worse-performing instances and is discouraged from memorizing noisy outliers. However, our results show that our ULPL method is more robust and effective in implementing these notions. Different ULPL methods: Among the proxy label-based methods, ULPL+GD CLA consistently outperforms other methods. ULPL+GD CLA calculates the loss differences separately before aggregation, which eliminates the influence of group size in debiasing and treats each group and class equally in optimization, which is better aligned with the evaluation metric. Adv is a popular method for achieving representational fairness, and differs from ULPL+GD CLA and ULPL+GD GLB in that it directly optimizes for empirical fairness. Removing protected information from hidden representations requires accurate perceptions of the global geometry of particular protected groups. However, proxy labels are based on local loss differences within each class, meaning that the same proxy label in different classes may conflate different protected groups. As such, the combination of ULPL with Adv is less effective than the two other combinations.foot_1 

5.1. PROXY LABEL ASSIGNMENT

We first investigate if the ULPL labels are meaningful through the lens of Pearson's correlation (r z ′ ) between proxy labels and signed performance gaps. Given that F is optimized if c,g |TPR c,g -TPR c | = 0, an instance (x i , y i ) should be assigned z ′ i = 0 if TPR y i ,zi -TPR zi < 0, i.e., the unobserved group z is under-performing in class y i , and z ′ i = 1 otherwise. We calculate r z ′ between P (z ′ = 0|y, z) and TPR y,z -TPR z , and presents the results of Vanilla over each dataset in Table 2 . It can be seen that there exists a strong correlation r z ′ for all datasets other than TrustPilot, indicating that the unsupervised proxy label recovers demographic data and provides a strong signal for bias mitigation. We also observe that better fairness results in smaller r z ′ , for example, for TrustPilot and Bios-E, which is not surprising as the gaps (|TPR c,g -TPR c |) are close to 0.  Dataset Moji Bios TrustPilot Adult COMPAS Attribute R G E G A C G R G R ULPL+GDCLA 9 1 1 1 1 1 13 13 3 13 ULPL+GDGLB 7 1 1 1 1 1 13 7 9 13 ULPL+Adv 3 1 1 1 1 1 9 13 3 5 Table 3 : (a) Leakage (%) of protected attributes. (b) Best k assignments of each method.

5.2. EFFECTIVENESS OF THE NEIGHBOURHOOD-SMOOTHING

In smoothing ULPL labels, we hypothesise that an instance's neighbours are likely from the same protected group. Except for instance itself in smoothing, the remaining nearest neighbours are essentially the results of a standard k-nearest-neighbour (KNN) model. Therefore, we perform analysis based on standard KNN models and investigate if the remaining nearest neighbours are helpful for label smoothing, i.e., are they from the same protected group as the target instance.

Protected information predictability:

Proxy label smoothing is based on the hypothesis that there is a strong correlation between the hidden representations and the protected labels, even if protected labels are not observed during training. To test this hypothesis, we employ 1-NN for protected label prediction based on Vanilla hidden representations within each target class, using leave-one-out cross-validation over each batch to evaluate the predictability of protected attributes. Table 3a presents the results (macro F1 score) for each protected attribute, from which we can see a strong correlation between unobserved protected labels and hidden representations. Furthermore, in Appendix E.2, we show that although debiasing methods successfully reduce performance disparities in downstream tasks, leakage of protected attributes in debiased hidden representations is still high, consistent with previous work (Han et al., 2021b) . First, we can see that k-NN is highly robust to varying values of p. In terms of whether label smoothing should be class-specific or inspecific, there is a slight empirical advantage to performing it in a class-specific manner. In terms of k, for Moji, higher values result in better estimations of protected labels, although there is a clear plateau. If we explore this effect over the other datasets in terms of the k value that achieves the highest AUC-PFC score, as can be seen in Table 3b , there is no clear trend, with neighbourhood-smoothing (k > 1) improving results for Moji, Adult, and COMPAS and the best results being achieved for values from 3 to 13, whereas for Bios and TrustPilot, no neighbourhood smoothing (k = 1) performs best. Although the optimal value of k varies greatly across datasets and debiasing methods, it is possible to perform fine-grained tuning to reduce computational cost. In Appendix C.4, we discuss situations where label smoothing succeeds or fails, and an effective tuning strategy for the value of k. We also investigate the robustness of binary proxy labels to non-binary intersectional groups (i.e. the cross product of values across different protected attributes).

5.3. DEBIASING FOR INTERSECTIONAL GROUPS

Table 4 presents debiasing results for intersectional groups over those datasets that are labelled with more than one protected attribute. Compared to the single protected attribute results, the AUC-PFC scores of Vanilla are consistently smaller, indicating greater bias across intersectional groups, consistent with the findings of Subramanian et al. (2021) . For ULPL models, on the other hand, the results are competitive with supervised debiasing methods, consistent with the single attribute setting.

5.4. OTHER FAIRNESS METRICS: DEMOGRAPHIC PARITY

Finally, we investigate the robustness of ULPL to other notions of fairness. For illustration purposes, we focus on demographic parity fairness (DP) (Blodgett et al., 2016) , which requires model predictions to be independent of protected attributes. Again, we aggregate accuracy and DP fairness trade-offs as AUC-PFC scores. Since DP is sensitive to class imbalance and there is no standard way of generalizing DP to multi-class classification tasks, we only conduct experiments over binary classification tasks, namely Moji, Adult, and COMPAS. Table 5 shows the results w.r.t. demographic parity fairness. The overall trend is similar to our original results for equal opportunity fairness, indicating that ULPL is robust to different fairness metrics when combined with a range of debasing methods.

6. CONCLUSION

Much of previous work in the fairness literature has the critical limitation that it assumes access to training instances labelled with protected attributes. To remove this restriction, we present a novel way of deriving proxy labels, enabling the adaptation of existing methods to unsupervised bias mitigation. We conducted experiments over five widely-used NLP and ML benchmark datasets, and showed that, when combined with different debiasing strategies, our proposed method consistently outperforms naively-trained models and unsupervised debiasing baselines, achieving results which are competitive with supervised debiasing methods. Furthermore, we showed our proposed method to be generalizable to multi-class intersectional groups and different notions of fairness.

A DATASETS AND PRE-PROCESSING

A.1 MOJI Following previous studies (Ravfogel et al., 2020; Han et al., 2021b) (Lui & Baldwin, 2012) . Despite this, there are some non-English reviews in the filtered dataset, and there, we further drop instances from Germany, Denmark and France, resulting in a dataset with 54k instances in total.

A.4 ADULT AND COMPAS

Except for race features, we use the same pre-processing as in Lahoti et al. (2020) for COM-PAS (Flores et al., 2016) and Adult (Kohavi, 1996) datasets with 5,278 and 43,131 examples, respectively. Lahoti et al. ( 2020) considers binary race groups (white vs. black). However, there are more than two protected groups in the original dataset. Specifically, there are 3 race groups in COMPAS: African-American, Caucasian, and Other; and 5 race groups in Adult: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

B EVALUATION METRICS

Besides the absolute gap metric (|U c,g -U c | = 0), a broad range of formats of metrics have been introduced in previous studies to capture different assumptions about the nature of fairness. For example, Lum et al. (2022) propose to measure the variability of performance across demographic groups ( 1 Yang et al. (2020) only focus on the largest gap (max g |U c,g -U c |), and Feldman et al. (2015) measure performance ratio rather than gap in measuring fairness ( maxgUc,g mingUc,g ). Although, different aggregation methods have been applied to measure group-wise fairness, the optimization of any one of them is a sufficient condition for the optimization of other methods, as the optimization conditions of these metrics are identical, U c,g = U c ∀c, g. G-1 g |U c,g -U c | 2 ), For fair comparison across different debiasing approaches, we should select the hyperparameters consistently. Previous work has used different criteria for model selection, including: (1) minimum loss (Hashimoto et al., 2018; Li et al., 2018) ; (2) maximum utility (Lahoti et al., 2020) , e.g., based on accuracy or F-measure; (3) manual selection based on visual inspection of the trade-off curve (Elazar & Goldberg, 2018; Ravfogel et al., 2020) ; and (4) constrained selection (Han et al., 2021b; Subramanian et al., 2021) , by selecting the best fairness constrained to a particular level of performance, and vice versa. Each selection criterion reflects the performance at a particular situation, making it very hard to rigorously compare methods. Instead, the AUC-PFC score is the integral of performance-fairness curves with respect to performance on an interval [0, 1]. For a particular dataset, by the definition of fairness metrics, a random classifier achieves the best fairness. Therefore, the integration from 0 to the random prediction accuracy is dataset-specific and is identical to different methods. In this paper, we normalize AUC-PFC scores for each dataset by ignoring the performance worse than random guess. example, the performance and fairness are 0.7109 ± 0.0110 and 0.6358 ± 0.1331, respectively. The random model corresponds to 0.5 accuracy and 1.0 fairness. Given these two points, the PFC is the line form (0.7109, 0.6358) to (0.5, 1), and the AUC-PFC score is (0.7109 -0.5) × 0.6358 + 0.5 × (0.7109 -0.5) × (1 -0.6358) = 0.172, which is consistent with Table 1 . However, we still need to select a model for early stopping before model selection. Instead of considering performance and fairness metrics separately, we use the distance to the optimal point ("DTO"), which quantifies the accuracy-fairness tradeoff (Marler & Arora, 2004; Han et al., 2022a) . DTO measures the normalized Euclidean distance for a given combination of accuracy and fairness to the optimal point which denotes the ideal result, e.g., accuracy and fairness of 1.0. It is typically unachievable in practice.

B.1 INTERPRETATION OF AUC-PFC RESULTS

The main motivation for using AUC-PFC is for ease of comparison between approaches, as it aggregates the performance-fairness trade-off curve (PFC) of each model to a single number, enabling systematic comparison across different tasks. The two common questions related to AUC-PFC are: • The magnitude of AUC-PFC differs from a single metric, and a 0.0001 improvement in the AUC-PFC score is equivalent to a 1 percentage point (pp) improvement in both performance and fairness (0.01 × 0.01). In the paper, numbers are rounded to 3 decimals, and a minimum difference in AUC-PFC (0.001) is roughly equivalent to a 3 pp improvement in both performance and fairness in a PFC plot. • The calculation of AUC-PFC scores is normalized by the worst performance, which is the majority label proportion when using the accuracy metric. Therefore, AUC-PFC scores represent to what extent a model improves the performance or fairness over the random model. There is no doubt that using AUC-PFC comes with certain limitations. To address the major concerns related to AUC-PFC scores, we present additional results in Appendix E, including disaggregated results for each dataset. In particular, we provide the PFC of each method (e.g., Figure 6 in Appendix F), representing the best fairness that can be achieved at different performance levels, and vice versa. One limitation of a PFC plot is that it is hard to make quantitative conclusions based on the plot itself, and we cannot conclude that one method is better than another if any intersection exists between their PFCs. To address this problem, we additionally conduct quantitative comparisons across different debiasing methods by model selection w.r.t. two different criteria, and then compare both the performance and fairness of the selected models (e.g., Table 14 in Appendix E). For each method, we report the evaluation results averaged over 5 random runs with standard deviation for both the development set and test set. As stated in Appendix F, we present disaggregated results (including a PFC plot and a table) for all 15 settings on GitHub.

C EXPERIMENTAL DETAILS

We conduct our experiments on an HPC cluster instance with 4 CPU cores, 32GB RAM, and one NVIDIA V100 GPU.

C.1 ASSIGNING AND SMOOTHING PROXY LABELS

Assigning proxy labels In the current experiments, proxy labels are assigned based on the losses of each minibatch, i.e., the loss per instance is taken from a particular training iteration. We acknowledge that there are other ways of extracting training losses, e.g. taking losses from the final model or averaged over multiple iterations as the reviewer suggested, and we leave it as a future work. Smoothing proxy labels The proxy label assignment and smoothing happen simultaneously at each iteration. By doing so, our method can be incorporated into existing systems with only a few lines of changes to replace the actual protected labels with our proxy labels. At each minibatch, the actual protected labels are replaced with smoothed proxy labels. All debiasing methods will be on the proxy labels in the later process. During label smoothing, unsmoothed labels are used for voting to avoid inconsistency in smoothing decisions for other examples. We first collect the nearest neighbours of each instance and then do the voting for all of them.

C.2 MODELS AND PARAMETER TUNING

All approaches presented in this paper share the same dataset-specific hyperparameters as the standard model. Hyperparameters are tuned using grid-search, in order to minimize distance to the optimal. Table 9 : Search space of dataset-specific hyperparameters. All debiasing methods in this paper does not introduce extra parameter to the main task model, and will not need to considered at the inference time. As such, we provide method-specific hyperparameters separately, and the search space for method-specific hyperparameters are shared across difference datasets. • GD CLA tunes the strength of the additional loss for minimizing absolute loss difference within each class. loguniform-float[10 -6 , 10 -1 ], 40 times. • GD GLB tunes the strength of the additional loss for minimizing absolute loss difference. loguniform-float[10 -5 , 10 -0 ], 40 times. • FairBatch tunes the adjustment rate for resampling probabilities. loguniform-float[10 -4 , 10 -0 ], 40 times. • Adv tunes the weights of unlearning adversaries in training. loguniform-float[10 -2 , 10 +2 ], 40 times. • SemiAdv tunes the weights of unlearning adversaries in training. loguniform-float[10 -2 , 10 +2 ], 40 times. Table 10 : Average computational budget, measured in seconds. Dataset Moji Bios TrustPilot Adult COMPAS Attribute R G E G×E G A C G×A×C G R G×R G R • ARL tunes the learning rate of learning adversaries in training. loguniform-float[10 -4 , 10 +2 ], 40 times. • ULPL methods tunes the k from 1 to 15, and p-norm from 2 to 6. Notice that, this paper report the AUC-PFC, which eliminate the requirement for model selection, i.e., there is no best-found trade-off hyperparameters w.r.t. bias mitigation.

C.3 COMPUTATIONAL BUDGET

Table 10 shows average GPU time of model training. Noticing that debiasing components will not be used for inference, i.e., different methods have identical inference cost.

C.4 PARAMETER TUNING FOR LABEL SMOOTHING

For Bios, class-specific neighbourhood smoothing degrades to naive proxy labels when there is only a small number of instances in a particular class. For example, there are 28 distinct target classes in the Bios dataset, with a highly skewed distribution. As such, there can be only one instance per target class in a minibatch, and the neighbourhood smoothing does not work in this case. For TrustPilot, we hypothesise that it is due to the leakage of protected information being very low, and accordingly the neighbourhoods of instances being noisy. The selection of k for label smoothing. As observed in Section 5.2, the optimal value of k varies greatly across datasets and debiasing methods, and in the context of this paper, we deal with this through a simple grid search over different values of k, which is computationally expensive. Although we do not currently have an algorithm for efficiently optimizing k at this time, we have observed that the value of k is positively correlated with model leakage and unfairness. Therefore, we could start tuning the value of k from a large value if the model is significantly biased, as the instances from the same protected group are likely to be close to each other. Otherwise, we can use the proxy labels without smoothing if the results are reasonably fair.

D THEORETICAL JUSTIFICATION D.1 FROM EMPIRICAL LOSSES TO UTILITY METRICS

For illustration purposes, we assume binary settings for both target class and protected attribute labels. In Section 3.3, we have shown that the proposed method can be used to improve the equal opportunity fairness. Demographic parity (DP) For DP fairness, the predictions are expected to be independent from protected attributes (ŷ ⊥ z), and the fairness is satisfied if the differences in positive prediction rate between demographic groups are zero: p(ŷ = 1|z = 0) = p(ŷ = 1|z = 1). Thus, p(ŷ = 1, y = 0|z = 0) + p(ŷ = 1, y = 1|z = 0) =p(ŷ = 1, y = 0|z = 1) + p(ŷ = 1, y = 1|z = 1). Since p(ŷ = 0, y|z) + p(ŷ = 1, y|z) = 1, ∀y, z, by replacing p(ŷ = 1, y = 0|z) with 1 -p(ŷ = 0, y = 0|z), (1 -p(ŷ = 0, y = 0|z = 0)) + p(ŷ = 1, y = 1|z = 0) =(1 -p(ŷ = 0, y = 0|z = 1)) + p(ŷ = 1, y = 1|z = 1). An equivalent condition to the DP fairness is that p(ŷ = 1, y = 1|z = 0) -p(ŷ = 0, y = 0|z = 0) =p(ŷ = 1, y = 1|z = 1) -p(ŷ = 0, y = 0|z = 1). A sufficient condition for DP is, both p(ŷ = 1, y = 1|z = 0) = p(ŷ = 1, y = 1|z = 1) and p(ŷ = 0, y = 0|z = 0) = p(ŷ = 0, y = 0|z = 1) are satisfied. Next, we show how to map the conditional joint probability to training losses. As for the y = 1, recall that, L 1 is an unbiased estimator of -log(p(ŷ = 1|y i = 1) (Equation ( 1)), L 1 = -log(p(ŷ = 1|y i = 1) L 1 = -log(p(ŷ = 1|y i = 1) -log(p(y i = 1)) + log(p(y i = 1)) L 1 = -log(p(ŷ = 1|y i = 1)p(y i = 1)) + log(p(y i = 1)) L 1 = -log(p(ŷ = 1, y i = 1)) + log(p(y i = 1)) By substituting the joint probability with losses, p(ŷ = 1, y = 1|z = 0) = p(ŷ = 1, y = 1|z = 1) -log(p(ŷ = 1, y = 1|z = 0)) = -log(p(ŷ = 1, y = 1|z = 1)) L 1,0 -log(p(y i = 1|z i = 0)) = L 1,1 -log(p(y i = 1|z i = 1)) L 1,0 -L 1,1 = log( p(y i = 1|z i = 0) p(y i = 1|z i = 1) ) Similarly, L 0 is an unbiased estimator of -log(p(ŷ = 0, y i = 0) + log(p(y i = 0)), and the DP condition for y = 0, p(ŷ = 1, y = 1|z = 0) = p(ŷ = 1, y = 1|z = 1), is equivalent to L 0,0 -L 0,1 = log( p(y i =0|zi=0) p(y i =0|zi=1) ). Notice that p(y|z), ∀y, z are dataset-specific constant numbers, and if p(y|z = 0) = p(y|z = 1), ∀y, the DP conditions are identical to Equalized Odds fairness (Hardt et al., 2016) , and can be approximated by L 1,0 = L 1,1 and L 0,0 = L 0,1 . Last but not least, recall that the nature of DP assumes y and z are independent, therefore, p(y|z = 0) -p(y|z = 1) ≈ 0, ∀y generally holds when the DP fairness is desired. Confusion-matrix based metrics So far, we have shown that minimizing loss differences can approximate the optimization of the two most wildly used notions of fairness: EO and DP fairness. Since model predictions and target labels are observed during training, such approximation can also be applied to other confusion-matrix-based metrics. For example, the cross-entropy loss of instances w.r.t. predictions as 0 and 1 are approximations of the positive predictive value (p(y = 1|ŷ = 1)) and negative predictive value (p(y = 0|ŷ = 0)), respectively.

D.2 OPTIMIZING FAIRNESS WITH PROXY LABELS

Bias mitigation for different fairness criteria Overall, ULPL only assigns proxy labels to training instances, and the optimization for fairness is achieved using existing supervised debiasing methods, by learning uniform representations across proxy groups (such as Adv) or minimizing loss disparities in training (such as GD CLA and GD GLB ). Based on ULPL, different fairness criteria can be optimized by employing different variants of a particular debiasing method. Taking Adv as an example, a discriminator is trained to recover the protected information from hidden representations, and the main task model is optimized to remove protected information from hidden representations through unlearning the discriminator. By doing so, the hidden representations and corresponding predictions are expected to be independent of the protected attribute, ensuring DP fairness. To adopt Adv for EO fairness, the discriminator takes target labels into consideration, e.g. training a specific discriminator for instances with positive target class only, and the removal of protected information is then class-dependent, aligning with the definition of EO fairness. Table 11 : Proxy label assignment without smoothing and evaluations for the Vanilla model over Moji. Evaluation results ± standard deviation (%) are averaged over 5 runs with different random seeds. ± P (z ′ = 1) refers to the proportion of instances being assigned with 1, indicating worseperformed groups. Evaluating metrics include: (1) positive predictive rate (PPR), corresponding to the demographic parity fairness (Blodgett et al., 2016) , (2) true positive rate (TPR) and false positive rate (FPR), corresponding to equalized odds and equal opportunity fairness (Hardt et al., 2016) , and (3) positive predictive value (PPV) and negative predictive value (NPV), corresponding to test fairness (Chouldechova, 2017) .

E.1 PROXY LABEL ASSIGNMENT -MOJI

We first investigate if the ULPL labels are meaningful through the lens of training examples in the Moji dataset. Table 11 presents the results of the Vanilla model. It can be seen that, AAE tweets are more likely to be classified as happy, while SAE tweets are more likely to be classified as sad, resulting in consistent trend in gaps with respect to PPR, TPR, FPR, PPV, and NPV. Based on loss-disparities, AAE instances with sad target labels are more possible to be assigned with z ′ = 1 (68.5% vs. 17% for SAE and AAE, respectively), encouraging debiasing methods to focus more on sad-AAE instances in training. Similarly, happy-SAE instances are more likely to be assigned with z ′ = 1, indicating that happy-SAE are upweighted in training. For the dataset distribution perspective of view, as introduced in Appendix A.1, Moji is balanced with respect to both sentiment and ethnicity but skewed in terms of sentiment-ethnicity combinations (40% happy-AAE, 10% happy-SAE, 10% sad-AAE, and 40% sad-SAE, respectively), which is closely related to the DL assignments that minority groups within each target class are assigned with z ′ = 1. I.e., our proposed proxy label differentiates minority groups with majority groups within each target class without observing demographic labels.

E.2 PROTECTED LABEL PREDICTABILITY AFTER DEBASING

Neighbour smoothing requires protected information in hidden representations during the whole training, requiring encoded protected information in hidden representations. Han et al. (2021b) show that although supervised debiasing methods have shown success in reducing performance disparities in downstream tasks, the predictability of protected attributes in debiased hidden representations is still well above the ideal value. We take ULPL+GD CLA as an example and explore the protected label predictability across different debiased models (i.e., different trade-offs). As seen from Figure 4 , fairness scores (the blue line) improve at the cost of performance. However, the predictability of protected labels is quite stable at a high level, indicating that protected information is still encoded in debiased models, and our proposed neighbourhood smoothing method is robust to bias mitigation.

E.3 EFFECTIVENESS OF THE NEIGHBOUR SMOOTHING

In smoothing z ′ labels, we hypothesis that the nearest neighbours of an instance are likely to from the same protected group. Except the instance itself, the remaining nearest neighbours are essentially the results of a standard k-nearest-neighbour (k-NN) model. Therefore, we perform analysis based on standard k-NN models, and investigate if the remaining nearest neighbours are helpful for label smoothing, i.e., from the same protected group as the target instance.

E.4 OTHER FAIRNESS METRICS, DP

We investigate the robustness of ULPL methods to other notions of fairness. For illustration purposes, we focus on demographic parity fairness (DP) (Blodgett et al., 2016) in this experiment, which requires model predictions to be independent with protected attributes. Again, we aggregate accuracy and DP fairness trade-offs as the AUC-PFC scores. Since DP is sensitive to class imbalance and there is no standard way of generalizing DP to multi-class classification tasks, we only conduct experiments over binary classification tasks, including Moji, Adult, and COMPAS. Proxy group labels (z ′ ) dynamically adjusted at each minibatch during training, which differs from fixed protected labels in supervised debiasing. As a result, supervised debiasing methods based on the observed protected attributes z mitigate biases for particular protected groups. While the proposed proxy label approaches focus on the group of instances that are underrepresented during training, which is expected to be more general than debiasing to a particular protected attribute. Figure 5 demonstrates the difference in AUC-PFC scores between in-domain debiasing and cross-domain debiasing. For each debiasing method, we train the debiased model and conduct model selections based on the source protected attribute (Economy). The trained models are then evaluated w.r.t. the target unobserved protected attribute (Gender). The ability to generalize to unobserved protected attributes is measured as the difference between in-domain and cross-domain AUC-PFC. GD CLA are representatives of debiasing methods that directly optimize loss parity. In particular, the training objective of GD CLA is: L GDCLA = L + λ GDCLA c g |L c,g -L c |, where |L c,g -L c | is optimized to achieve better fairness. Since L c,g , L c , and L are average losses, their magnitude are irrelevant to subset sizes (n c,g , n c , and n, respectively), which in turn applies the same strength of fairness regularization to all subset of instances D c,g , ∀c, g. In other words, GD CLA ignores the influence of group size in bias mitigation, resulting in robustness to imbalanced class distributions. There is a perfect alignment between ULPL and GD CLA in the sense that the proxy group z ′ = 1 will always be upweighted during the optimization of GD CLA . I.e., ULPL+GD CLA reduces the loss disparities across instances within each target class, which in turn improves the lower bound of group-wise fairness, especially for the EO fairness. As a result of the consistency, we observe that ULPL+GD CLA outperforms other unsupervised methods. GD GLB is a variant of GD CLA , and they can only be differentiated by the way of incorporating fairness regularization: L GDGLB = L + λ GDGLB c g |L c,g -L|, where the average loss in the regularization term is based on all instances (L), differing from the average loss within each target class (L c ) for GD CLA . As a result, GD GLB additionally encourages the performance parity across target class, which is typically known as long-tail learning. However, ULPL+GD GLB could potentially lead to worse results for better-performed target classes. For example, for ith target class, assuming that loss differences have been minimized (i.e., L i,z ′ =0 ≈ L i,z ′ =1 ), it is possible that all instances with target label y = i will be under-fitted if L i > L. Adv represents a different family of debiasing methods which aims at learning fair hidden representations. The training objective of Adv includes the mutual information (M I) to the training objective in addition to standard loss: L Adv = L + M I(z, h), where the h = e(x) is the hidden representation of input x extracted from the encoder e. By minimizing the M I objective, we expected the learned hidden representations h are orthogonal to protected attributes z. In practice, M I(z, h) is expressed by the combination of marginal entropy (H(z)) and conditional entropy (H(z|h)): M I(z, h) = H(z) -H(z|h), where H(z) is a constant number and can be ignored in the optimization, and H(z|h) is estimated by an adversary (d) that is trained to identify the protected attributes (ẑ = d(h)). The key step of Adv is the training of d, i.e., if d can effectively recover z from h. One problem associated with ULPL is that the mapping from proxy labels to ground truth labels is class-specific, making it harder for the recovering. Therefore, although the adversaries are non-linear classifiers, the effectiveness of ULPL+Adv is not as good as other ULPL + * methods as shown in Table 1 .

F FULL RESULTS

In addition to AUC-PFC scores, we present PFC and evaluation results for each dataset. We investigate two different selection criteria and report the evaluation results over the development and test sets. Specifically, we conduct model selection over the development set based on: (1) maximum fairness within a performance trade-off threshold of 5% (F@P-5%); and (2) maximum fairness within a performance trade-off threshold of 10% (F@P-10%). The areas under each PFC in Figure 6 correspond to a number in Table 1 . Consistent with the Table 1, it can been from Figure 6 that GD CLA results in the best PFC. In addition, PFCs of Adv and SemiAdv are highly overlapped with each other, and their AUC-PFC scores are also identical in Table 1 . Last but not least, ULPL+GD CLA is better than ARL most of the time in Figure 6 , which is summarized as a 0.016 improvement in terms of AUC-PFC score in Table 1 .



Key characteristics of the datasets, including dataset statistics, are provided in Appendix A. See Appendix E.6 for further discussion. There are slight discrepancies in the dataset composition due to data attrition: the original dataset (De-Arteaga et al., 2019) had 399k instances, while 393k were collected byRavfogel et al. (2020).



c,g -L c | = 0, which is an approximation of the desired fairness measurement C c=1 G g=1 |U c,g -U c | = 0. However, protected labels (z) are not observed in unsupervised debiasing settings, which raises the question: how can we optimize fairness objectives with unobserved protected labels? Based on the fairness objective C c=1 G g=1 |L c,g -L c | = 0, we propose to focus on groups within each target class that are systematically poorly modelled. To this end, we binarize the numerous unobserved group labels into two types based on their training losses: z

Tuning ULPL+GDCLA trade-off hyperparameter. Shaded areas = 95% CI estimated over 5 runs. Performance-fairness curve of GDCLA (red dashed line), and ULPL+GDCLA (black solid line).

Figure 3: Hyperparameter sensitivity analysis over the Moji dataset.

Figure 4: Predictability after debiasing.

Figure 6: PFC over the Moji dataset.

636 0.837 0.915 0.963 0.971 0.960 0.951 0.751 0.894 0.672 r z ′ 0.996 0.656 0.214 0.056 0.045 0.034 0.600 0.691 0.983 0.855 Fairness (F ) and Pearson's r z ′ between proxy labels and signed gaps for Vanilla.biasing methods, GD CLA outperforms other methods, which is consistent with previous work(Shen et al., 2022).Similar to the baseline debiasing methods, debiasing w.r.t. proxy labels (ULPL+ * ) also improves fairness over Vanilla, and achieves results that are competitive with supervised debiasing methods.Semi-supervised debiasing baselines: We examine the effectiveness of SemiAdv by removing 50% of the protected labels, i.e., the adversary is trained over a subset of training instances. Observe that SemiAdv achieves almost identical results to Adv, consistent withHan et al. (2021a). Although SemiAdv uses protected labels in training, it is substantially outperformed by the proposed unsupervised method ULPL+GD CLA .

AUC-PFC based on demographic parity fairness.

AUC-PFC w.r.t. intersectional groups.

, the original training dataset is balanced with respect to both sentiment and ethnicity but skewed in terms of sentiment-ethnicity combinations (40% happy-AAE, 10% happy-SAE, 10% sad-AAE, and 40% sad-SAE, respectively). The dev and test sets are balanced in terms of sentiment-ethnicity combinations. The dataset contains 100K/8K/8K train/dev/test instances.When varying training set distributions, we keep the 8k test instances unchanged. We use DeepMoji(Caliskan et al., 2017) to obtain twitter representations, where DeepMoji is a model pretrained over 1.2 billion English tweets and DeepMoji is fixed during model training. For each score in TrustPilot, the table shows the number of instances and the breakdown across demographics as a percentage. notated with the target rating variable and associated with three protected labels gender (male vs. female), age(under-35-year-old vs. over-45-year-old), and, location (UK vs. the US). The original dataset contains 5 different countries (US, UK, Germany, Denmark, and France), andLi et al. (2018) discard non-English reviews after automatic language classification

Table 8 summaries the lowest accuracy scores w.r.t. each dataset.When calculating AUC-PFC scores, for these methods that are not flexible to achieve best fairness, we manually add the random model to the calculation. Taking the Vanilla model on Moji as an Majority label proportion, i.e., lowest accuracy of each dataset.

G×R

Table 12 shows results w.r.t. demographic parity fairness. Trends for different methods are similar to the results of equal opportunity fairness, indicating that debasing methods are robust to different fairness metrics.

AUC-PFC based on demographic parity fairness.

AUC-PFC score differences between debiasing w.r.t. target protected attributes (T), and source protected attributes (S). Larger numbers indicate better generalizations to unobserved protected groups.

AUC-PFC differences across difference datasets w.r.t. a subset of debiasing methods. Overall, Vanilla shows the best generalization to unobserved protected attributes, and unsupervised debiasing methods are better than supervised and semi-supervised debiasing methods. As shown in Figure5, the selected model of ULPL+GD CLA based on economy labels at accuracy around 0.6 is not a Pareto point for gender, confirming that decreases in AUC-PFC of Vanilla and unsupervised debiasing methods are caused by cross-domain model selection.E.6 THE APPLICATION OF ULPL TO DIFFERENT DEBIASING APPROACHES

Evaluation results ± standard deviation (%) of selected models over the Moji dataset.For the demonstration purpose, here we present the results for Moji. The full disaggregated results of 15 settings can also be seen at https://github.com/HanXudong/An_Unsupervised_ Locality-based_Method_for_Bias_Mitigation/blob/main/unsupervised_ bias_mitigation/NB_Appendix_indomain_tradeoffs_dispaly.ipynb.

ACKNOWLEDGEMENTS

We thank the anonymous reviewers for their helpful feedback and suggestions. This work was funded by the Australian Research Council, Discovery grant DP200102519. This research was undertaken using the LIEF HPC-GPGPU Facility hosted at the University of Melbourne. This Facility was established with the assistance of LIEF Grant LE170100200.

ETHICS STATEMENT

This work focuses on learning fair models without observation of protected labels at training time. Demographics are assumed to be available only for evaluation purposes, and not used for model training or inference. We only use attributes that the user has self-identified in our experiments. All data and models in this study are publicly available and used under strict ethical guidelines.

REPRODUCIBILITY STATEMENT

All baseline experiments are conducted with the FairLib library (Han et al., 2022b). Source code is available at https://github.com/HanXudong/An_Unsupervised_ Locality-based_Method_for_Bias_Mitigation Appendix A reports relevant statistics, details of train/test/dev splits, etc. for the five benchmark datasets that are used in this paper. Appendix B reports implementation details of evaluation metrics and corresponding aggregation approaches. Appendix C presents details of computing infrastructure used in our experiments, computational budget, hyperparameter search, etc. Appendix F reports PFC and full dis-aggregated results, i.e., mean ± std over 5 random runs with different random seeds.

