COMPUTATIONAL-UNIDENTIFIABILITY IN REPRESEN-TATION FOR FAIR DOWNSTREAM TASKS

Abstract

Deep representation learning methods are highlighted as they outperform classical algorithms in various downstream tasks, such as classification, clustering, generative models, etc. Due to their success and impact on the real world, fairness concern is rising with noticeable attention. However, the focus of the fairness problem was limited to a certain downstream task, mostly classification. We claim that the fairness problems to various downstream tasks originated from the input feature space, i.e., the learned representation space. While several studies explored fair representation for the classification task, the fair representation learning method for unsupervised learning is not actively discussed yet. To fill this gap, we define a new notion of fairness, computational-unidentifiability, which suggests the fairness of the representation as the distributional independence of the sensitive groups. We demonstrate motivating problems that achieving computationally-unidentifiable representation is critical for fair downstream tasks. Moreover, we propose a novel fairness metric, Fair Fréchet distance (FFD), to quantify the computationalunidentifiability and address the limitation of a well-known fairness metric for unsupervised learning, i.e., balance. The proposed metric is efficient in computation and preserves theoretical properties. We empirically validate the effectiveness of the computationally-unidentifiable representations in various downstream tasks.

1. INTRODUCTION

Thanks to the outstanding performance and development of deep learning, it has been widely applied to various domains, including natural language processing (NLP) (Devlin et al., 2018) , computer vision (Karras et al., 2019) , and generative models (Goodfellow et al., 2014) . On the other hand, the reliability and fairness concerns (Lee & Floridi, 2020; Angwin et al., 2016; Dastin, 2018) advanced due to their impact on the real world applications. Such fairness concerns include credit limit estimation (Vigdor, 2019) , job application filtering (Dastin, 2018) , or crime prevention (Dressel & Farid, 2018) , etc. Accordingly, algorithmic fairness is getting growing attention to prevent biased predictions. Following the mainstream fairness literature, we here focus on group fairness (Dua & Graff, 2019; Zafar et al., 2015; Hardt et al., 2016) , which suggests the equality of certain statistical measures (e.g., true positive rate, positive prediction) between subgroups with different protected attribute (e.g., gender, race, religion, etc). It has been widely studied to mitigate fairness violations in downstream tasks. Numerous studies (Hardt et al., 2016; Choi et al., 2020; Pleiss et al., 2017; Madras et al., 2018) explore how to attain group fairness in classification tasks. The primary objective of this family of works is to obtain the prediction independence of a protected property. Hardt et al. (2016) suggest equal opportunity, which requires the same true positive rates for the subgroup. Calibration among the subgroups (Kleinberg et al., 2016) is to match the predicted probability and actual distribution of favorable class. Moreover, some works (Kim et al., 2020; Jang et al., 2021) study efficient multi-constraint optimization to satisfy multiple fairness notions. However, most of the works mainly focus on the supervised setting. Even though deep learning has significant success in various unsupervised learning tasks, such as clustering (Xie et al., 2016; Guo et al., 2017) , generative model (Karras et al., 2019; Radford et al., 2019) , and NLP (Hadifar et al., 2019) , the fairness of unsupervised learning is relatively not actively studied (Buet-Golfouse & Utyagulov, 2022) , and how to quantify the fairness of unsupervised learning methods has not been well established yet. A widely used metric for fair clustering is called balance (Chierichetti et al., 2017) , which is analogous to demographic parity (Barocas & Selbst, 2016) in classification. However, the balance has some limitations since it quantifies fairness by computing the ratio of samples in different protected groups within a cluster. For instance, even in the ideal balance (the ratio of the samples from different groups matches the group truth), the sensitive groups can distribute samples separately within clusters. In this case, it is easy to determine which sensitive group the sample belongs to so that it might lead to a biased decision in downstream tasks. Especially in generative models, e.g., VAE (Kingma & Welling, 2013) , the generated samples can be imbalanced if the latent space is dependent on the sensitive attributes. This can cause a critical problem as generative models are widely applied to mitigate the imbalance of datasets (Guo et al., 2019; Fajardo et al., 2021; Mirza et al., 2021) . Instead, we propose a novel approach, computational-unidentifiability, as a fairness notion in unsupervised learning. Analogous to the fact that biased data is responsible for the biased decision-making (Buolamwini & Gebru, 2018; Mehrabi et al., 2021) , we here claim that the learned representation itself plays a critical role in fair downstream tasks utilizing DNN. Even though deep representation has been appreciated for its superb performance (Eldan & Shamir, 2016; Kozma et al., 2018) , the fairness concerns in the space have been overlooked. Thus, we explore the fairness in representation space that could bridge DNN and the downstream tasks with fairness concerns. We validate our claim on downstream tasks by comparing the performance and fairness of two distributions: fair and unfair representation. To measure fairness in representation space, we propose a novel metric called FFD (Fair Fréchet distance) inspired by Fréchet distance (Dowson & Landau, 1982) to efficiently quantify fairness in representation space by measuring distributional independence of the sensitive groups with computational identifiability (Hébert-Johnson et al.; Lahoti et al., 2020) . Unlike the balance, we not only consider statistical independence but also distributional independence between the sensitive groups. This can be a good reference for future work to evaluate the fairness or distributional independence in the representation space of certain attributes of interest. Moreover, we propose a deep fair clustering framework to learn a fair representation that achieves comparable performance with other clustering methods while ensuring fairness. The contributions in the paper can be summarized as follows: 1. We study the motivating problem of why fair representation is important to achieve fair downstream tasks. 2. We propose a novel metric that quantifies fairness in representation space. We provide rigorous analysis of the theoretical property and complexity of our fairness metric. 3. We propose a framework for fair representation learning for downstream tasks. 4. We validate our method on various benchmark datasets comparing with state-of-the-art fair methods in the literature.

2. RELATED WORKS GROUP FAIRNESS

As a class of definitions, group fairness measures the disparity of predicted outcomes among the subgroups with certain sensitive attributes. A number of works introduce fair notions to mitigate the bias and ensure the independence of the performance measures between the subgroups to achieve group fairness. Demographic parity (Barocas & Selbst, 2016) suggests that positive prediction should be equalized and independent of the sensitive attribute. Equal opportunity (Hardt et al., 2016) states that true positive rates should match. Likewise, Predictive equality (Chouldechova, 2017) states the equality of false positive rates. Group-wise calibration (Kleinberg et al., 2016; Pleiss et al., 2017) proposed to match the probability estimate with the actual ratio of positive distribution within the group. In an unsupervised setting, balance (Chierichetti et al., 2018) is introduced to have an equal number of samples from different protected groups within a cluster as fair clustering. However, the balance only considers statistical parity, which limits the utility as a metric since perfect balance (i.e., 1) does not guarantee fairness (as the base rate differs). Moreover, none of the works explore the fairness of the representation itself.

FAIR SUPERVISED LEARNING

To assure group fairness, recent works in supervised learning reside in one of the three approaches: 1) pre-processing; 2) in-processing; 3) post-processing. Pre-processing method (Chen et al., 2018) suggests improving the skewed sample size problem. Adversarial learning methods (Madras et al., 2018; Zhao et al., 2019) are proposed to learn a fair representation that is independent of sensitive attributes. Post-processing the output of the biased model (Hardt et al., 2016; Jang et al., 2021) with multiple fairness objectives are introduced as they are more efficient than training a model from scratch. However, the aforementioned approaches minimize group fairness constraints specified for the classification task.

FAIR UNSUPERVISED LEARNING

Fairness in the unsupervised setting has recently got attention (Buet-Golfouse & Utyagulov, 2022; Ghadiri et al., 2021) . A pioneering work (Chierichetti et al., 2018) in fair clustering method proposed fairlet decomposition to pre-process data followed by classic clustering methods to address disparate impact. Scalable fair clustering algorithm (Backurs et al., 2019) is the following work of fairlet decomposition by improving the efficiency with approximation. Variational framework (Ziko et al., 2019) is introduced to satisfy KL fairness objective. Wang & Davidson (2019) propose a new concept called fairoid that enforces the centroids of each sensitive group in feature space to have an equal distance to each cluster centroid. Adversarial objective (Li et al., 2020) is employed to learn a representation that is statistically independent w.r.t. sensitive attribute while clustering-favorable utilizing individual clustering modules. However, to our best knowledge, previous works mostly focused on the predicted outcome to be independent of the sensitive attribute, i.e., statistical parity. We here study the independence of sensitive attributes in the learned representation in unsupervised learning.

3. MOTIVATING PROBLEMS

In this work, we define a novel fairness notion called computationally unidentifiability that is more extensive than the existing task-specific notions. Inspired by fair classification works (Hébert-Johnson et al.; Lahoti et al., 2020) , we define computational-identifiability as the maximum possible ability for an external classifier to distinguish which sensitive group the data belongs to. Two distributions are computationally unidentifiable if and only if they are identical, i.e., no external classifier can distinguish which sensitive group the sample is drawn from. We demonstrate motivating problems showing how such distributional independence affects fairness in downstream tasks.

3.1. CLASSIFICATION AND CLUSTERING

Consider data distribution with binary sensitive attribute, A = {0, 1}, and binary label, Y = {0, 1}. In Fig. 1 , we illustrate synthetic data distributions similar to previous works (Zafar et al., 2015; Kim et al., 2020) with two scenarios that both satisfy the perfect balance, i.e., base rate for each protected group is identical. The perfect balance can also be referred to as statistical independence. We denote X ya as a set of instances with y 2 Y and a 2 A. The detail of the synthetic data sampling process is in the appendix. When comparing two distributions in Fig. 1b and Fig. 1a , the distribution in Fig. 1b explicitly exposes which sensitive group a sample belongs to, i.e., computationally identifiable (CI). This has a potential risk of discrimination in downstream tasks. Specifically, it is unstable for maintaining good clustering performance since such data representation can be clustered by the sensitive group structure whether than the expected intrinsic features (Lee et al., 2021) . In contrast, in Fig. 1a , the representation satisfies not only statistical independence but also distributional independence w.r.t. sensitive attribute, i.e., computationally unidentifiable (CU). Therefore, models cannot easily identify which group a sample belongs to and thus cannot discriminate against groups in the downstream tasks. To validate our claim, we evaluate two representations with classification and clustering, which are the most popular tasks in supervised and unsupervised learning. For the classification task, we test the logistic regression. To measure fairness in classification, we adopt demographic parity (Barocas & Selbst, 2016)  |P ( Ŷ = y|A = 0, Y = y) P ( Ŷ = y|A = 1, Y = y)|, where Ŷ is predicted label. Both DP and EOD are the lower, the better. For fair clustering, we measure the balance by following the previous works (Xie et al., 2016; Li et al., 2020; Bera et al., 2019) , which is to satisfy E X⇠D [A = a|C(X) = k] = E X ⇠D [A = a], where C(X) = k indicates that the data X is clustered to the k-th cluster by model C. Achieving the balanced clustering satisfies the statistical independence w.r.t. A, and balance = 1 is a perfect balance. However, we claim that statistical independence cannot fully examine fair clustering. To address the limitation of the previous fair unsupervised learning metric, we propose a novel fairness metric for representation called Fair Fréchet Distance (FFD), which will be discussed in the following section. Table 1 summarizes the evaluation of downstream tasks on the two distributions. Even though both CU and CI data are sampled from perfectly balanced distributions, fairness violations from CI are significantly worse than that of CU on both tasks. It is interesting to note that fairness is sensitive to distributional independence; however, performance is not affected. This validates that fair representation itself has a substantial impact on fairness in downstream tasks while preserving utility. Moreover, FFD is a good proxy to measure computational identifiability since smaller FFD indicates harder to identify sensitive information from the representation.

4. FAIR FR ÉCHET DISTANCE

To quantify the proposed fairness notion in terms of computational identifiability, in this subsection we introduce a novel metric named Fair Fréchet Distance (FFD) to measure the distance between distributions from different sensitive groups. Consider two sets of samples U 2 R d⇥n0 and V 2 R d⇥n1 . Suppose the samples in U and V are drawn from multivariate Gaussian distributions, respectively. Define a centering matrix H n 2 R n⇥n as H n = I 1 n 1 n 1 > n , where 1 n 2 R n is a vector with all elements being 1; and I is the identity matrix. We first introduce two metrics in Definition 4.1 and 4.2 that measure the distance between two distributions U and V . Definition 4.1. Fréchet distance (FD) (Dowson & Landau, 1982) between U and V is defined as: FD 2 (U, V ) = kµ U µ V k 2 2 + Tr ⌃ U + ⌃ V 2(⌃ 1 2 U ⌃ V ⌃ 1 2 U ) 1 2 , where µ U , µ V and ⌃ U , ⌃ V are the means and covariance matrices of U and V , respectively. Definition 4.2. We define the Fair Fréchet Distance within Cluster (FFDC) between U and V as follows: F F DC 2 (U, V ) = U 1 n0 n 0 V 1 n1 n 1 2 2 + ✓ kUH n0 k F p n 0 1 kV H n1 k F p n 1 1 ◆ 2 + Tr(UU > ) + Tr(V V > ) p n 0 1 p n 1 1 , when n 0 , n 1 > 1, else F F DC 2 (U, V ) = 1. Next, we define Fair Fréchet Distance (FFD) in Definition 4.3. For simplicity, we define FFD for the case with binary sensitive feature a 2 {0, 1}. We will introduce how to extend such a measure to the case with multi-valued sensitive features at the end of this subsection. For a clustering assignment of m samples into c clusters as {X 1 , X 2 , . . . , X k }, where X k 2 R d⇥n k , k = 1, 2, . . . c, contains the n k samples in the k-th cluster that sums to c P k=1 n k = m. Within each cluster X k , define U k 2 R d⇥n k and V k 2 R d⇥n k as follows: u i k = ⇢ x i k , if a i k = 0 0, else , v i k = ⇢ x i k , if a i k = 1 0, else where x i k is the i-th sample in X k ; a i k 2 {0, 1} is the sensitive feature of the i-th sample in X k ; and 0 is the zero vector. Thus we have U k + V k = X k , k = 1, 2, . . . , c. Definition 4.3. With the definition of U k | c k=1 and V k | c k=1 in equation 1, we define FFD for the m samples with the clustering assignment {X 1 , X 2 , . . . , X k } as:  F F D({X 1 , X 2 , . . . , X k }) = max k F F DC(U k , V k ). F F D 2 ({X 1 , X 2 , . . . , X k }) max k 1 n k 1 Tr(X k X > k )  max k F D 2 (U k , V k )  F F D 2 ({X 1 , X 2 , . . . , X k }). Proof of Theorem 4.4 is in the appendix. In the case of the multi-valued sensitive feature, we can extend the definition of FFDC in Definition 4.2 with the max FFDC value among all pairs of sensitive groups in a cluster, and thus extend the definition of FFD in Definition 4.3. We can easily verify that Theorem 4.4 still holds in the case with the multi-valued sensitive feature. Theorem 4.4 indicates that the FD metric is upper bounded by our proposed FFD metric, thus minimizing FFD indicates the minimization of the upper bound of FD. Further, the gap between the FD and our FFD metric is bounded by max k 1 n k 1 Tr(X k X > k ) . Note that FFD is minimized if and only if the two following conditions are met in each cluster X k , k = 1, 2, . . . c: U k 1 n k = V k 1 n k , kU k H n k k 2 F = kV k H n k k 2 F , in which case we have FD 2 = 0. Thus FD value is minimized if and only if our proposed FFD metric is minimized.

5. FAIR CLUSTERING FRAMEWORK

In this section, we present our deep fair clustering framework and its objective functions. For simplicity, we consider sensitive attribute as binary feature. However, this can be easily extended to multiple sensitive attribute problems. Consider the c-clustering problem, given the i.i.d. sampled m data samples X 2 R d⇥m , where each sample is represented by a d-dimensional vector. Encoder E learns a representation Z 2 R l⇥m and a clustering module C takes Z as an input and outputs probability P 2 R c⇥m of the predicted cluster as a soft label. The goal of E and C is to achieve computationally unidentifiable fairness and high clustering performance. Given a matrix X 2 R d⇥m with m samples, we denote the i-th data point from X as its bold lower case letter with index in the superscript, e.g., x i , and the k-th entry of a vector as a lower case letter e.g., x i k .

5.1. CLUSTERING LOSS

Inspired by the previous works (Xie et al., 2016; Li et al., 2020) , we employ clustering loss to learn the representation that is concentrated in the cluster centroids. Clustering module C assigns probability that a sample z i = E(x i ) belongs to each cluster k 0 by comparing with trainable cluster centroids c k on Student t-distribution as: p i k = (1 + kz i c k k 2 /↵) ↵+1 2 P k 0 (1 + kz i c k 0 k 2 /↵) ↵+1 2 , ( ) where p i k indicates the probability that x i belongs to k-th cluster and ↵ is the degree of freedom of Student t-distribution. Then, assign the target cluster q i k by sharpening the soft assignment p i k within a sensitive group a as q i k = (p i k ) 2 / P x j 2Xa p j k P k 0 (p j k 0 ) 2 / P x j 2Xa p j k 0 , which reinforce the confidence of the predicted cluster and prevent large clusters as a regularizer. We set empirical clustering loss Lcls as KL divergence between p k and q k as Lcls = KL( P ||Q) = X x2X X k p k log p k q k . ( )

5.2. FAIRNESS LOSS

Our goal is to further improve fairness in the clustering task that sensitive group is not identifiable by the samples in a cluster. Recent work proposed to use fairoid (fair-centroid) (Wang & Davidson, 2019) that the centroid of each sensitive group should have equal distance to all cluster centroids. We claim that fairoid cannot guarantee fair representation since equidistance centroids can be perfectly separated by the cluster centroids. To achieve computational-unidentifiability, we employ variational autoencoder (VAE) structure (Kingma & Welling, 2013) for the encoder to leverage the reparameterization trick. Then we can formulate the latent feature of an instance x i as z i = E(x i ) = µ i + ✏ , where ✏ ⇠ N (0, I), where µ and are the mean and variance respectively. To enforce the learned representation independent of the sensitive attribute, we minimize the distance between distributions from a different protected group within a cluster, i.e., KL(p (a,k) ||p (a 0 ,k) ), where p (a,k) is a probability distribution of the samples in a sensitive group a with predicted cluster k. Assume the distribution follows the Gaussian distribution as p (a,k) = N (µ (a,k) , Diag( (a,k) )). Then our fairness objective to minimize KL divergence can be written as: L fair = 1 2 ✓ 2 log ✓ (a,k) (a 0 ,k) ◆ 2 (a,k) + (µ (a,k) µ (a 0 ,k) ) 2 2 (a 0 ,k) + 1 ◆ . For the empirical loss Lfair , we use μ(a,k) = 1 |X a,k | P i2X a,k µi and ˆ (a,k) = 1 |X a,k | P i2X a,k i as the empirical mean and variance where X a,k is denoted as a set of instances predicted as cluster k in group a, since we assume all samples are i.i.d. To sum up, our final objective is to minimize the loss as follows: min E,C Lcls + Lfair .

6. EXPERIMENTS

In this section, we compare fairness and the performance of the proposed method with the state-ofthe-art methods.

6.1. EXPERIMENTAL SETUP

Benchmark Dataset. We use two image datasets and two tabular datasets to evaluate the methods. MNIST-USPS dataset consists of 60,000 MNISTfoot_0 , and 7,291 USPSfoot_1 hand written gray scale digits. We consider the source of the image i.e., MNIST, USPS as a sensitive attribute with c = 10 clustering problem. MTFL (Zhang et al., 2014) consists of 12,995 facial images and its landmark information. It also provides information such as gender and wearing glasses. By following (Li et al., 2020) , we use wearing glasses or not as a sensitive attribute and c = 2 clustering problem with desired clustering attribute is gender. We pre-process the image dataset by normalizing the pixel value. The normalization parameters are mean = 0.1307, std = 0.3081 for MNIST-USPS, and mean= (0.3527, 0.3902, 0.4697), and std= (1, 1, 1) for MTFL respectively. Comparing Methods. To evaluate our method, we compare with the following related methods in the experiments. ScFC (Backurs et al., 2019) As a baseline and reference, we use k-means++ and perfect clustering. We use the same backbone structure for deep fair clustering methods for the fair evaluation. For USPS-MNIST, we pretrain the encoder to reconstruct the original image as VAE following DFC (Li et al., 2020) . For MTFL, we adopt ResNet50 (He et al., 2016) pretrained with ImageNet for the encoder. We used Adam optimizer (Kingma & Ba, 2014) with learning rate as 10 5 . We implement all experiments on Nvidia Quadro RTX 6000 and Intel i9-9960X with 128GB RAM. Evaluation Metric. For the evaluation, we measure performance with accuracy and NMI (Strehl & Ghosh, 2002) , and fairness with accuracy difference between sensitive groups, balance, and FFD. The four metrics can be computed as: Accuracy = P x i ⇠X [arg max k p i k = y i ] n , NMI = P k,j n + kj log nn + kj n k n + j r P k n k log n k n P j n + j log n + j n , Balance = min k ✓ min ⇣ n uk n vk , n uk n vk ⌘ ◆ , FFD 2 = max k ✓ U k 1 n k n k V k 1 n k n k 2 + ⇣ kU k H n k k F p n k 1 kV k H n k k F p n k 1 ⌘ 2 + Tr(U k U > k ) + Tr(V k V > k ) n k 1 ◆ . We denote n, n k , n uk , n + j , and n + kj as total number of samples, number of samples predicted as cluster k, cluster k with group u, has ground truth label j, and samples intersected with k and j. Also, y i indicates the true label of x i , which is matched to the clusters by the linear sum assignment problem to find the best pair between the predicted cluster and the true label for calculating accuracy. The lower bound of Fréchet Distance (FD) can be calculated by simply omitting the last term in the above equation. 

6.2. QUANTITATIVE EVALUATION

In Table 2 , we report the quantitative evaluation of two image datasets. For accuracy and NMI, the higher, the better, and balance is better if it is close to that of perfect clustering. For FFD, it is lower the better. To calculate FFD, we set all comparing deep models (DFC and ours) to have the same dimension in the representation space. In addition, we preprocessed the latent features from each model by normalizing the maximum magnitude to 1 for a fair comparison. For non-deep models, we measure FFD in the original input space, and the values are underlined. Note that we do not directly compare FFD from deep and non-deep models since they are calculated in different spaces. In the table, we observe some non-deep fairness methods achieve lower accuracy than classical k-means++, which is sacrificed to have better balance. With the proposed method, we achieve comparable or better results on both accuracy and balance compared with the baselines. Moreover, we could achieve a significantly lower FFD than the other deep fair method, DFC (Li et al., 2020) . As an ablation study, we evaluate our framework with the same structure without the fair loss term. We empirically found that integrating L fair in training sometimes favorably contributes to not only fairness but also performance. It is interesting to note that ScFC (Backurs et al., 2019) got lower FFD than the perfect clustering in MNIST-USPS. Thus FFD can be also a good measure to reveal how biased the dataset itself is against some demographic groups, e.g., imbalanced data or under-representation analysis.

6.3. QUALITATIVE ANALYSIS

In this subsection, we qualitatively evaluate fairness of the learned representation proposed in the paper comparing with other deep fair clustering methods. This shows that for the pretraining of the encoder or some downstream tasks, sensitive information takes an important role, which is not desirable. At last, as in Fig. 2b , we could achieve similar distribution between different sensitive groups within a cluster. This can be explained by the proposed objective functions that our L fair aims to learn the representation that follows the multivariate normal distribution for all sensitive groups meanwhile the centroids of a different sensitive group within the cluster. It is noticeable that the representation from DFC is highly identifiable compared with ours. This would result in potential bias in downstream tasks or possibly generating clusters with the same sensitive group. Table 3 : Evaluation of the learned representations from deep networks on MNIST-USPS dataset. We compare the end-to-end deep model and adopt k-means++ clustering method. The representation with lower FFD achieves more stable and fair results.

6.4. JUSTIFICATION OF FAIR FR ÉCHET DISTANCE AS A FAIRNESS METRIC

Representation learning for clustering using deep networks can benefit from their structure of discovering intrinsic features that are difficult to observe in raw data. However, as we mentioned in the motivation, if samples are computationally-identifiable (unfair), they are more vulnerable to being clustered with extrinsic features i.e., sensitive attribute. To validate this claim, we conduct a classical k-means++ algorithm to cluster the learned representation from our method and DFC. In Table 3 , we summarize the results. As expected, DFC lost more accuracy and NMI compared to ours when the learned features are clustered with k-means++ because the FFD was higher than ours. In contrast, we observe almost identical results by k-means++ when we train with our representation. Also, we achieve better balance and NMI compared with DFC variant. This confirms that FFD is a good metric of fair clustering as the representation with lower FFD consistently outcomes fair clusters. This is also shown qualitatively by t-SNE representation. When the representation is computationally-identifiable and easily separable by the sensitive attribute, this can result in subsequent unstable and unfair clustering.

7. CONCLUSION AND DISCUSSION

In this paper, we define computationally unidentifiable fairness as a novel notion of fairness to measure distributional independence of sensitive attributes by leveraging Fréchet distance. Furthermore, we elaborate on the theoretical analysis of the proposed metric and find some interesting properties. We integrate contrastive learning and distributional constraint to achieve state-of-the-art performance while maintaining computational-unidentifiability. We report experimental results comparing with other fair clustering methods on various benchmark datasets to validate our claim.



http://yann.lecun.com/exdb/mnist/ https://www.kaggle.com/bistaumanga/usps-dataset



Figure1: Illustration of two synthetic data of computationally unidentifiable (fair) and identifiable (unfair) distribution with a binary sensitive group and class. Different color (resp. shape) indicates different class y (resp. sensitive group a). We denote X ya as a set of instances with y, a 2 {0, 1}.

With the definition of U k | c k=1 and V k | c k=1 in equation 1, the following inequality holds:

Fig. 2 illustrates t-SNE (Van der Maaten & Hinton, 2008) visualization of the original data, the learned representation of our model, and DFC on MNIST-USPS dataset. The colors in top and bottom rows indicate different ground truth labels and sensitive attributes, respectively. The first two columns show the progress of our model in the training process. The last column in the figure is the visualization of DFC after it converges. At the starting phase, as in Fig. 2a, we observe that representation is clustered based on the sensitive attribute.

Figure 2: t-SNE visualization of FFC and DFC from 1024 randomly selected samples on MNIST-USPS dataset. Samples with different colors indicate different ground truth labels (10 digits) and different sensitive groups (MNIST, USPS) for the top and the bottom row. Acc (Diff) NMI Balance FFD 2 Perfect 1.0 (0.0) 1.0 0.12 -DFC 0.824 (0.160) 0.828 0.053 14.13 DFC (k-means++) 0.812 (0.115) 0.754 0.044 7.61 Ours 0.831 (0.015) 0.837 0.091 1.82 Ours (k-means++) 0.831 (0.016) 0.834 0.090 1.82

Evaluation of downstream tasks on different distributions. Computational-Unidentifiable (CU) distribution achieves significantly fairer results with similar performance in both supervised and unsupervised learning tasks, while Computationally-Identifiable (CI) distribution has a huge detriment of fairness. FFD is the proposed metric to measure the fairness of a representation.

Consider a clustering problem that partitions the m samples into c clusters, where each data sample is formulated as a d-dimensional vector. The FFD metric we proposed in Definition 4.3 is efficient in computation, which requires linear time w.r.t. the number of features and number of samples. The calculation of FFD in Definition 4.3 has a time complexity of O(ndc) (since we only need to calculate the trace of matrix UU > and V V > in Definition 4.2, it requires linear instead of quadratic time w.r.t. d). In contrast, traditional FD metric in Definition 4.1 has a cubic time complexity w.r.t. number of features. The time complexity for calculating F D is O(c(nd 2 + d 3 )) (since it requires the computation of exact covariance matrices ⌃ U and ⌃ V in Definition 4.1 and the corresponding square root).

is non-deep fair clustering method that approximates fairlet decomposition algorithm in a linear run time. ALG(Bera et al., 2019) is non-deep fair clustering method that is based on k-median approach. DFC(Li et al., 2020) is a deep fair clustering method to learn fair and clustering-favorable representation by adversarial loss and cluster modules with an individual group. VFC(Ziko et al., 2019) is a variational framework for fair clustering with KL fairness as clustering objective.

Evaluation of clustering methods on two datasets: MNIST-USPS and MTFL. For accuracy and NMI, it is higher the better. Balance is better if it is closer to perfect clustering i.e., original data statistic. FFD 2 measures distributional independence of sensitive attribute, i.e., the lower, the better. FFD is measured in the learned representation (resp. input space) space for the deep (resp. non-deep) models. FFD measurement in the input space is underlined.

