LEARNING FAIR GRAPH REPRESENTATIONS VIA AUTOMATED DATA AUGMENTATIONS

Abstract

We consider fair graph representation learning via data augmentations. While this direction has been explored previously, existing methods invariably rely on certain assumptions on the properties of fair graph data in order to design fixed strategies on data augmentations. Nevertheless, the exact properties of fair graph data may vary significantly in different scenarios. Hence, heuristically designed augmentations may not always generate fair graph data in different application scenarios. In this work, we propose a method, known as Graphair, to learn fair representations based on automated graph data augmentations. Such fairness-aware augmentations are themselves learned from data. Our Graphair is designed to automatically discover fairness-aware augmentations from input graphs in order to circumvent sensitive information while preserving other useful information. Experimental results demonstrate that our Graphair consistently outperforms many baselines on multiple node classification datasets in terms of fairness-accuracy trade-off performance. In addition, results indicate that Graphair can automatically learn to generate fair graph data without prior knowledge on fairness-relevant graph properties.

1. INTRODUCTION

Recently, graph neural networks (GNNs) attract increasing attentions due to their remarkable performance (Gao et al., 2021; Gao & Ji, 2019; Liu et al., 2021a; b; Yuan et al., 2021) in many applications, such as knowledge graphs (Hamaguchi et al., 2017) , molecular property prediction (Liu et al., 2022; 2020; Han et al., 2022a) and social media mining (Hamilton et al., 2017) . Despite recent advances in graph representation learning (Grover & Leskovec, 2016; Kipf & Welling, 2017; 2016; Gilmer et al., 2017; Han et al., 2022b) , these GNN models may inherit or even amplify bias from training data (Dai & Wang, 2021) , thereby introducing prediction discrimination against certain groups defined by sensitive attributes, such as race and gender. Such discriminative behavior may lead to serious ethical and societal concerns, thus limiting the applications of GNNs to many real-world high-stake tasks, such as criminal justice (Suresh & Guttag, 2019) , job hunting (Mehrabi et al., 2021) , healthcare (Rajkomar et al., 2018) , and credit scoring (Feldman et al., 2015; Petrasic et al., 2017) . Hence, it is highly desirable to learn fair graph representations without discriminatory biases (Dong et al., 2022; Zhang et al., 2022; Kang et al., 2022; Dai et al., 2022) . A primary issue (Mehrabi et al., 2021; Olteanu et al., 2019) in fairness is that training data usually contain biases, which is the source of discriminative behavior of models. Thereby, many existing works (Agarwal et al., 2021; Kose & Shen, 2022; Spinelli et al., 2021) propose to learn fair graph representations by modifying training data with fairness-aware graph data augmentations. These methods propose some graph data properties that are beneficial to fair representation learning, and then adopt heuristic graph data augmentation operations, including node feature masking and edge perturbation, to refine graph data. However, the proposed graph properties (Spinelli et al., 2021; Kose & Shen, 2022) may not be appropriate for all graph datasets due to the diverse nature of graph data. For example, balanced inter/intra edges (Kose & Shen, 2022) may destroy topology structures of social networks, leading to the loss of important information. Even if the proposed graph properties are effective, the best graph properties may vary significantly in different scenarios. Hence, it is highly desirable to automatically discover dataset-specific fairness-aware augmentation strategies among different datasets with a single framework. To this end, a natural question is raised: Can we achieve fair graph representation learning via automated data augmentations? In this work, we attempt to address this question via proposing Graphair, a novel automated graph augmentation method for fair graph representation learning. A primary challenge is how to achieve fairness and informativeness simultaneously in the augmented data. As we intentionally avoid assuming prior knowledge on what types of graphs are considered fair, we propose to employ an adversary model to predict sensitive attributes from augmented graph data. A fair augmented graph should prevent the adversary model from identifying the sensitive attributes. In addition, we propose to retain useful information from original graphs by using contrastive learning to maximize the agreement between original and augmented graphs. Experimental results demonstrate that Graphair consistently outperforms many baselines on multiple node classification datasets in terms of fairness-accuracy trade-off performance.

2.1. FAIR GRAPH REPRESENTATION LEARNING

In this work, we study the problem of fair graph representation learning. Let G = {A, X, S} be a graph with n nodes. Here, A ∈ {0, 1} n×n is the adjacency matrix, and A ij = 1 if and only if there exists an edge between nodes i and j. X = [x 1 , • • • , x n ] T ∈ R n×d is the node feature matrix, where each x i ∈ R d is the d-dimensional feature vector of node i. S ∈ {0, 1} n is the vector containing sensitive attributes (e.g., gender or race) of nodes that should not be captured by machine learning models to make decisions. Our target is to learn a fair graph representation model f : (A, X) → H ∈ R n×d ′ , and the learned representation H = f (A, X) is fed into a classification model θ : H → Ŷ ∈ {0, 1} n to predict the binary label of nodes in G. Particularly, for an ideal fair model f , the output representation H should result in a prediction Ŷ that satisfies the fairness criteria. In general, there exist several different definitions of fairness criteria, including group fairness (Dwork et al., 2012; Rahmattalabi et al., 2019; Jiang et al., 2022b) , individual fairness (Kang et al., 2020; Dong et al., 2021; Petersen et al., 2021) , and counterfactual fairness (Agarwal et al., 2021; Ma et al., 2022) . In this work, we focus on group fairness, which is defined as P( Ŷi |S i = 0) = P( Ŷi |S i = 1), i = 1, . . . , n, where Ŷi is the prediction for node i, and S i is the sensitive attribute of node i. Note that even though the sets of node attributes or features in X and S are disjoint, correlations may exist between (A, X) and S. Hence, even if S is not explicitly exposed to f , f may implicitly infer parts of S from (A, X) and produce biased representation H, thereby making the prediction Ŷ unfair. How to prevent models from intentionally fitting these correlations is the central problem to be solved in achieving fair graph representation learning. Currently, several studies have proposed different strategies to achieve fair graph representation learning. An early study (Rahman et al., 2019) proposes to train the model through fair random walks. Some recent studies (Li et al., 2020; Laclau et al., 2021) propose to reduce prediction discrimination through optimizing adjacency matrices, which can improve fairness for link prediction tasks. In addition, adversarial learning is another popular strategy to achieve fairness on node representation learning tasks. Many studies (Fisher et al., 2020; Dai & Wang, 2021; Bose & Hamilton, 2019) adopt adversarial learning to filter out sensitive attribute information from the learned node representations. Overall, most existing methods learn fair representations via altering model training strategy with fairness regularization. However, a primary issue in fairness learning lies in the fact that training data usually possess bias. Hence, an alternative and highly desirable solution is to modify data through data augmentations, thus enabling models to learn fair representations easily. In this work, we design a learnable graph augmentation method to reduce bias in graph data, leading to more effective fairness-aware representation learning on graphs.

2.2. GRAPH DATA AUGMENTATIONS

Inspired by the success of data augmentations in computer vision and natural language processing, graph data augmentation (Zhao et al., 2022) attracts increasing attention in academia. Most studies (You et al., 2020; Zhu et al., 2020; Wang et al., 2021; Veličković et al., 2019; You et al., 2021; Rong et al., 2020) are based on uniformly random modifications of graph adjacency matrices or node features, such as masking node features, dropping edges, or cropping subgraphs. In addition, recent studies (Luo et al., 2023; Zheng et al., 2020; Luo et al., 2021; Zhao et al., 2021; Chen et al., 2020) design learnable data augmentation methods to enhance task-relevant information in augmented graphs. Note that none of the above methods are fairness-aware and only a few studies have investigated fairness-aware graph augmentations. Spinelli et al. (2021) 3 FAIRNESS VIA AUTOMATED DATA AUGMENTATIONS While previous fairness-aware graph data augmentations all rely on manually defined and fixed fairness-relevant augmentation strategies, we explore a more adaptive and effective method to discover fairness-aware graph augmentations by automated augmentation models. Note that though automated graph augmentations have been applied to some graph representation tasks (Luo et al., 2023; 2021; Zhao et al., 2021) , they have not been studied in fair graph representation learning. In this work, we propose Graphair, an automated graph augmentation method for fair graph representation learning. Graphair uses an automated augmentation model to generate new graphs with fair topology structures and node features while preserving the most informative components from input graphs. The augmentation model is trained end-to-end with multiple optimization objectives in order to circumvent sensitive information while retaining other useful information simultaneously. To the best of our knowledge, Graphair is the first automated graph augmentation method addressing group fairness with a theoretical guarantee of fairness and informativeness.

3.1. AUTOMATED GRAPH AUGMENTATIONS

We first present the details of the augmentation process. Given an input graph G = {A, X, S}, we use the automated augmentation model g to generate a new graph G ′ = {A ′ , X ′ , S} as T A , T X = g(A, X), A ′ = T A (A), X ′ = T X (X). (2) Here, T A is the edge perturbation transformation, which maps A to the new adjacency matrix A ′ by removing existing edges and adding new edges. T X is the node feature masking transformation, which produces the new node feature matrix X ′ by setting some values of X to zero. T A and T X contain the exact transformations for each edge and node feature in G. In other words, the augmentation model g decides whether there is an edge connecting any two nodes in G and whether each value in X should be set to zero or not. In the augmentation model g, a GNN-based augmentation encoder g enc : (A, X) → Z ∈ R n×dr is first used to extract d r -dimensional embeddings Z for nodes in G. We adopt graph convolutional network (GCN) (Kipf & Welling, 2017) as the GNN encoder here. Afterward, the exact transformations for each edge and node feature are performed as described below. Edge perturbation. Given the embedding Z, an multi-layer perceptron (MLP) model MLP A first computes the hidden embeddings Z A ∈ R n×d r ′ from Z, then an inner-product decoder computes the edge probability matrix A ′ ∈ R n×n , where the value A ′ ij at the i-th row, j-th column of the matrix A ′ denotes the predicted probability that an edge exists between the nodes i and j in G ′ . Finally, the output adjacency matrix A ′ is obtained by sampling from the Bernoulli distribution parameterized with the probabilities in A ′ . Formally, this process can be described as Z A = MLP A (Z), A ′ = σ Z A Z T A , A ′ ij ∼ Bernoulli A ′ ij for i, j = 1, • • • , n, where σ(•) is the sigmoid function. Node feature masking. Given the embedding Z, an MLP model MLP X first computes the mask probability matrix M ∈ R n×d , where the value M ij at the i-th row, j-th column of the matrix M denotes the predicted probability that the j-th feature of node i is not set to zero. Afterward, the mask matrix M is sampled from the Bernoulli distribution parameterized with the probabilities in M , and the new feature matrix X ′ is obtained by multiplying X by M . This process can be formally described as Z X = MLP X (Z), M = σ(Z X ), M ij ∼ Bernoulli M ij for i, j = 1, • • • , n, X ′ = M ⊙X, ( ) where ⊙ is the Hadamard product, and σ(•) is the sigmoid function. Note that the Bernoulli sampling for adjacency matrix A ′ and mask matrix M are non-differentiable. To make the augmentation model g end-to-end trainable, we adopt the commonly-used trick to approximate the Bernoulli sampling in Eq. ( 3) and ( 4). Specifically, we relax the Bernoulli sampling procedure by the Gumbel-Softmax reparameterization trick (Jang et al., 2017; Maddison et al., 2017; 2014) . Given a probability P computed from a parameterized model φ, the relaxed Bernoulli sampling calculates a continuous approximation P = 1 1+exp(-(log P +G)/τ ) , where τ is a temperature hyperparameter and G ∼ Gumbel(0, 1) is a random variable sampled from the standard Gumbel distribution. For the forward propagation, the discrete value P = ⌊ P +foot_0 2 ⌋ is used as the result sampled from the Bernoulli distribution with the probability P . For the backward propagation, a straight-through gradient estimator (Bengio et al., 2013) is used, which approximates the gradient as ∇ φ P ≈ ∇ φ P .

3.2. ADVERSARIAL TRAINING

As our objective is to generate fair augmentations to reduce bias, the ideal augmentation model g should satisfy the fairness property. In other words, it should assign low probabilities to graph elements (edges, node features) that cause prediction bias. However, we cannot achieve it via supervised training because there is no ground truth indicating which graph elements lead to prediction bias and should be modified. To tackle this issue, we propose to use an adversarial learning based method to implicitly optimize the model to learn to mitigate bias in the input graph. Specifically, we use an adversary model k : (A ′ , X ′ ) → Ŝ ∈ [0, 1] n to predict the sensitive attribute S from the new adjacency matrix A ′ and new node feature matrix X ′ generated by the augmentation model g. The adversary model k and the augmentation model g are jointly trained via an adversarial fashion. In this process, k is optimized to maximize the prediction accuracy of the sensitive attribute, while g is optimized to mitigate bias in A ′ and X ′ so that it is difficult for the adversary model k to identify sensitive attribute information from A ′ and X ′ . Formally, this adversarial training process can be described as the following optimization problem: min g max k L adv = min g max k 1 n n i=1 S i log Ŝi + (1 -S i )log 1 -Ŝi , ( ) where Ŝi is the prediction of the sensitive attribute of node i by the adversary model k. 1

3.3. CONTRASTIVE TRAINING

We note that only using the adversarial training may cause the augmentation model g to collapse into trivial solutions. For instance, g may learn to always generate a complete graph and set all node features to zero, which contains no bias, since all nodes are equivalent. Such augmented graphs are not informative at all because they lose all the information from the input graphs. To make the augmentation model g satisfy the informativeness property, i.e., preserving the most informative components of the input graph in the generated graphs, we additionally use a contrastive learning objective during training. Given the input graph G = {A, X, S} and the augmented graph G ′ = {A ′ , X ′ , S}, we first use a GNN-based representation encoder f to extract node representations H = f (A, X) and H ′ = f (A ′ , X ′ ) from G and G ′ , respectively. Afterward, we optimize the augmentation model g and the representation encoder f jointly by minimizing a contrastive objective, which maximizes the similarity between the representations of the same node in H and H ′ . Specifically, let h i and h ′ i denote the representation of node i in H and H ′ , respectively. For node i, we consider (h i , h ′ i ) as a positive pair, and (h i , h j ) and (h i , h ′ j ) for any node j other than i as negative pairs. We define the representation similarity as sim(h i , h ′ j ) = c(t(h i ), t(h ′ j )) , where c is the cosine similarity and t is a non-linear projection implemented with a two-layer MLP model. We follow Zhu et al. (2020) to define the contrastive objective for any positive pair (h i , h ′ i ) as l(h i , h ′ i ) = -log exp (sim(h i , h ′ i )/τ ) n j=1 exp sim(h i , h ′ j )/τ + n j=1 1 [j̸ =i] exp (sim(h i , h j )/τ ) , where τ denotes the temperature parameter, 1 [j̸ =i] ∈ {0, 1} is the indicator function whose value is 1 if and only if j ̸ = i. The overall contrastive objective is computed over the positive pairs (h i , h ′ i ) and (h ′ i , h i ) for all nodes as L con = 1 2n n i=1 [l(h i , h ′ i ) + l(h ′ i , h i )] . To prevent the augmentation model g from generating graphs that deviate too much from input graphs, we add a reconstruction-based regularization term to the overall training objective. Specifically, let L BCE and L MSE denote binary cross-entropy loss and mean squared error loss, respectively, and the regularization term is defined as L reconst = L BCE (A, A ′ ) + λL MSE (X, X ′ ) = - n i=1 n j=1 A ij log A ′ ij + (1 -A ij )log 1 -A ′ ij + ∥X -X ′ ∥ 2 F , where λ is a hyperparameter, and ∥ • ∥ F denotes the Frobenius norm of matrix (Golub & Van Loan, 1996) . To sum up, the overall training process can be described as the following min-max optimization procedure, min f,g max k L = min f,g max k αL adv + βL con + γL reconst , where α, β, γ are hyperparameters. The parameters of augmentation model g, adversary model k, and representation encoder f are jointly optimized with this min-max optimization procedure. In each training step, we first update the parameters of f and g to minimize L while keeping k fixed, then update the parameters of k to maximize L adv while keeping f and g fixed. See Figure 1 for an overview of our proposed Graphair method. The training algorithm is summarized in Appendix B.

3.4. DISCUSSIONS

Graphair learns different fairness-aware augmentation strategies for different graph datasets by the automated augmentation model, thereby eliminating the negative effect of fixed fairness-relevant augmentation strategies (Spinelli et al., 2021; Agarwal et al., 2021; Kose & Shen, 2022) . In addition, Graphair mitigates bias by modifying both graph topology structures and node features, while some existing studies (Spinelli et al., 2021) only consider one of them. We demonstrate these advantages through extensive empirical studies in Section 4.2 and 4.3. Furthermore, we show in Section 3.5 and 3.6 that the used training objectives can be theoretically proven to help the augmentation model generate new graphs with fair topology structures and node features, and preserve the most informative components from the input graph simultaneously. Specifically, we use adversarial and contrastive learning to optimize the augmentation model to satisfy the fairness and informativeness properties, respectively.

3.5. THEORETICAL ANALYSIS OF FAIRNESS

Following Madras et al. (2018) , we quantify the unfairness of a classifier d : (A ′ , X ′ ) → [0, 1] n using demographic parity distance. Given a graph G ′ = (A ′ , X ′ , S), let Ŷ = d(A ′ , X ′ ) ∈ [0, 1] n denote the prediction of the classifier d and Ŷi is the prediction of node i. The demographic parity distance is defined as ∆ DP (d) ≜ |E i∼S 0 ( Ŷi )-E j∼S 1 ( Ŷj )| , where S 0 and S 1 denote the set of nodes whose sensitive attributes are 0 and 1, respectively. Note that ∆ DP (d) = 0 if Ŷ ⊥ S, i.e., the group fairness discussed in Section 2.1 is satisfied. The following theorem shows that minimizing the optimal adversarial loss is equivalent to minimizing the unfairness of the classifier d, so minimizing the performance of the adversary model can indeed encourage the augmentation model to generate fair graphs. Theorem 1. Let G ′ , k, S be defined as above. For any downstream task, we consider a classifier d : (A ′ , X ′ ) → Ŷ ∈ [0, 1] n predicting label Y ∈ {0, 1} n using G ′ as input. Assume the adversarial loss for each sample is bounded, i.e., there exists constant M so that |S i log Ŝi +(1-S i )log 1 -Ŝi | ≤ M holds for each sample. Then we show that the demographic parity ∆ DP (d) is bounded by the optimal adversarial objective value L adv * , i.e., L adv * ≥ n ′ M n(1-e M ) ∆ DP (d) - n ′ M n(1-e -M ) , where n ′ represents the maximal number of samples with the same sensitive attributes. Detailed proof of this theorem is given in Appendix A.1.

3.6. THEORETICAL ANALYSIS OF INFORMATIVENESS

We quantify the amount of information obtained about one random variable by observing the other random variable by mutual information. We show in the following theorem that minimizing the contrastive loss L con is equivalent to maximizing a lower bound of the mutual information I(G; G ′ ) between the original graph G and the augmented graph G ′ , thus achieving informativeness. Theorem 2. Let G, G ′ , H and H ′ be defined as above. Our contrastive objective is a lower bound of mutual information between the input graph G and the augmented graph G ′ . Formally, -L con ≤ I(G; G ′ ). ( ) Detailed proof of this theorem is given in Appendix A.2.

3.7. COMPLEXITY ANALYSIS

Graphair shares the same time and space complexity as the GNN architecture of the representation encoder f during inference because only f is used to compute node representations. During training, 

4. EXPERIMENTS

In this section, we evaluate Graphair on three real-world datasets, including NBA, Pokec-z and Pokec-nfoot_1 . More details on datasets are given in Appendix F.1. Experimental results show that Graphair outperforms many baselines on node classification tasks in terms of both fairness and accuracy. To gain insights from learned fair graph data, we provide a comprehensive analysis on learned fair graph topology structures and fair node features. Our analysis results are consistent with studies (Spinelli et al., 2021; Kose & Shen, 2022; Jiang et al., 2022a; Dai & Wang, 2021) . We also provide runtime experiments and hyperparameter studies in Appendix D.

4.1. EXPERIMENTAL SETTINGS

Evaluation metrics. We use accuracy to evaluate prediction performance of node classification tasks. To quantify group fairness, we follow studies (Louizos et al., 2016; Beutel et al., 2017) Baselines. We compare our methods with the following baseline methods, including (1) Fairwalk (Rahman et al., 2019) , a fairness-aware random walk (Grover & Leskovec, 2016) for unsupervised node representation learning task; (2) GRACE (Zhu et al., 2020) , deep graph contrastive representation learning with uniform random graph augmentations; (3) GCA (Zhu et al., 2021) , graph contrastive learning with adaptive augmentations; (4) NIFTY (Agarwal et al., 2021) , the first graph contrastive learning method with fairness-aware graph augmentations. Note that we only use the unsupervised component of NIFTY to learn node representations; (5) FairDrop (Spinelli et al., 2021) , a heuristic edge dropping method to enhance fairness in graph representation learning; (6) FairAug (Kose & Shen, 2022), an adaptive data augmentation method for fair node representation learning. We adopt FairDrop and FairAug in the contrastive learning framework of GRACE to learn node representations, since they are both graph augmentation methods. The optimal hyperparameters for all methods are obtained by grid search. Evaluation protocol. We use an evaluation protocol following Veličković et al. (2019) 

4.2. EXPERIMENTAL RESULTS

Fairness and accuracy performance. Table 1 shows accuracy, demographic parity, and equal opportunity metrics of our proposed Graphair, compared with baselines in Section 4.1 on the three real-world datasets. From the results, we have the following observations: • Our proposed Graphair consistently achieves the best fairness performance in terms of demographic parity and equal opportunity on evaluated datasets. For example, compared with GRACE, our method reduces demographic parity by 65.8%, 67.2% and 81.2% on NBA, Pokec-z, and Pokec-n datasets, respectively, with comparable accuracy performance. • Fairness-aware augmentation methods (e.g., FairDrop, NIFTY, and FairAug) have lower prediction bias compared to GRACE and GCA. It is worth noting that these heuristic augmentation methods targeting manually designed fair graph properties may not consistently achieve state-ofthe-art performance for all datasets due to diverse graph data. Specifically, FairDrop outperforms NIFTY on Pokec-n dataset, while NIFTY outperforms FairDrop on NBA and Pokec-z datasets. To this end, Graphair can automatically learn to discover fairness-aware augmentations on different graph datasets and thus outperforms all these fairness-aware methods in terms of demographic parity and equal opportunity on all three datasets. Trade-off between accuracy and fairness. We further compare the accuracy-fairness trade-off performance of Graphair with several baselines. We choose demographic parity as the fairness metric. Figure 2 shows the Pareto front curves generated by a grid search of hyperparameters for each method. The upper-left corner point represents the ideal performance, i.e., highest accuracy and lowest prediction bias. The results show that Graphair achieves the best ACC-DP trade-off compared with all fairness-aware baselines on three datasets.

4.3. ABLATION STUDIES

Graphair considers two graph transformations to mitigate bias in node features and graph topology structures. In this subsection, we conduct ablation studies to investigate the contributions of two graph transformations and demonstrate the advances of Graphair. In other words, we investigate if both fair node features and graph topology structures enhance prediction fairness (i.e., lower DP and EO). Specifically, we remove node feature masking, denoted as "Graphair w/o FM", and remove edge perturbation, denoted as "Graphair w/o EP". Table 2 shows that Graphair outperforms both "Graphair w/o FM" and "Graphair w/o EP" in terms of demographic parity and equal opportunity on all three datasets. Experimental results demonstrate that both fair node features and graph topology structures are beneficial to mitigating prediction bias. Methods only considering either node features or graph topology (e.g., FairDrop) are not promising due to the limited graph transformation space.

4.4. ANALYSIS OF FAIR VIEW

In this subsection, we study the properties of fair graph data generated by Graphair from graph topology and node features perspectives. Firstly, we introduce node-wise sensitive homophily coefficient to characterize the distribution of sensitive attributes from the neighborhood. Given a graph G = {A, X, S}, node-wise sensitive homophily coefficient for node i, denoted as ϵ i , represents the proportion of neighbors with the same sensitive attributes, i.e., ϵ i = n j=1 Aij 1 [s i =s j ] n j=1 Aij , where n is the number of nodes, and 1 [si=sj ] is the indicator function evaluating to 1 if and only if s i = s j . Subsequently, we analyze the learned fair graph topology via node sensitive homophily distribution compared with the original graph topology. Figure 3 shows that the learned fair graph topology reduces average node sensitive homophily compared to the original graph topology. Such observation is consistent with several previous studies (Spinelli et al., 2021; Kose & Shen, 2022; Jiang et al., 2022a; Dai & Wang, 2021 ) that high node sensitive homophily values lead to prediction bias. Additionally, we analyze the learned fair node features via Spearman correlation (Zwillinger & Kokoska, 1999) between the sensitive attribute and non-sensitive features. Note that fair node features should have low Spearman correlation values. Figure 4 shows the top-10 Spearman correlation values in the original graph data. We can see that the learned fair node features reduce Spearman correlation values compared to the original node features, thus preventing models from fitting the correlations as discussed in Section 2.1. These analysis results demonstrate that our method Graphair can automatically learn to generate fair graph data without prior knowledge of fairness-relevant graph properties.

5. CONCLUSIONS

In this work, we propose Graphair, an automated graph augmentation method for fair representation learning. Graphair uses an automated augmentation model to generate new graphs with fair topology structures and node features, while preserving the most informative components from input graphs. We adopt adversarial learning and contrastive learning to achieve fairness and informativeness simultaneously in the augmented data. Experimental results demonstrate that Graphair consistently outperforms state-of-the-art baselines on node classification tasks for real-world graph datasets in terms of fairness-accuracy trade-off performance. In the future, we would like to improve the efficiency of Graphair and extend Graphair to the case where only limited sensitive attribute information is available. 

B TRAINING ALGORITHM FOR GRAPHAIR

We summarize the training algorithm for Graphair and provide the pseudo codes in Algorithm 1.

Algorithm 1 Training algorithm

Require: adjacency matrix A, feature matrix X, sensitive attribute S while not converged do Generate a fair view G ′ using the augmentation model g Obtain node representations H of G using the representation encoder f Obtain node representations H ′ of G ′ using the representation encoder f Compute L by Eq. ( 9) Update f and g by applying stochastic gradient descent to minimize L Update the adversary k by applying stochastic gradient ascent to maximize L end while

C BATCH TRAINING FOR LARGE GRAPHS

Because Graphair has a space complexity of O(n 2 ) in the full-batch setting, it is expensive to train Graphair on large graph datasets. To reduce space complexity, we adopt the graph sampling-based batch training method proposed by Zeng et al. (2020) to perform mini-batch training. Specifically, we construct a subgraph via a random walk sampler for each batch. Then the subgraph is used as the input of Graphair, and the augmentation model g generates a fair view of the subgraph. Both adversarial training and contrastive training are performed on the subgraph. The normalization techniques in (Zeng et al., 2020) are also used to eliminate biases caused by subgraph sampling. Note that the generated fair view for each batch might be subtly different from the one in the fullbatch setting, because the augmentation model can only modify edges inside subgraphs in the minibatch setting. Nevertheless, such a small difference won't make a big change when the batch size is large enough.

D.1 RUNNING TIME COMPARISON

We provide the running time comparison in Figure 5 for our Graphair and baselines. We don't include FairWalk because the implementation we used doesn't use GPU to accelerate the training process. To achieve a fair comparison, we train all models for 500 epochs and report the average running time over 5 runs. When performing batch training on Pokec-z and Pokec-n datasets, we use a random walk sampler with 1000 root nodes and walk length 3. Note that FairAug proposes a fairness-aware graph sampling operation, so we use it to sample subgraphs with 3000 nodes instead of using a random walk sampler. Figure 5 shows that Graphair has a higher time complexity than other baselines. This is not surprising because all the baselines rely on fixed augmentation strategies and don't need a learnable neural network model. 

F.2 IMPLEMENTATION DETAILS

For Graphair, we adopt two-layer GCN models as the adversary model k and augmentation encoder g enc , and a three-layer GCN model as the representation encoder f . We use 64 as the hidden dimension in all three models. For the augmentation model, we use an MLP model with 2 layers, the hidden size of 64, and ReLU as the non-linear activation function for MLP A and MLP X . The hyperparameter β is set to 1, and the hyperparameters α, γ and λ are determined with a grid search among {0.1, 1, 10}. For a fair comparison, we use three-layer GCN models for all baselines except FairWalk. The dimension of the node representations is selected as 64 for all datasets. We run the experiments 5 times and report the average performance for each method. We train the models for 500 epochs using Adam optimizer with 1 × 10 -4 learning rate and 1 × 10 -5 weight decay. For the results in Table 1 , we select the optimal hyperparameters with the highest accuracy. For the classifier used for evaluation, we use an MLP model with 2 layers, the hidden size of 128, and ReLU as the non-linear activation function. The classifier is trained for 500 epochs using Adam optimizer with 1 × 10 -3 learning rate and 1 × 10 -5 weight decay.

G EXPERIMENTS ON A SYNTHETIC GRAPH DATASET

To further validate the scalability of Graphair on larger graphs, we conduct experiments on a large synthetic graph dataset with 1,000,000 nodes. The synthetic dataset is generated as follows. We assume that the sensitive attribute is a binary value and randomly assign 0 or 1 to each node with equal probability. For node features, we use Gaussian Mixture Model to generate biased two-dimensional node features. The distributions of node features of different sensitive groups are different. Specifically, we use Gaussian distributions N (µ 1 , Σ) and N (µ 2 , Σ) to generate node features for nodes with sensitive attributes 0 and 1, respectively, where µ 1 = [0, 1], µ 2 = [1, 0] and Σ = 1 0 0 2 . For the adjacency matrix, we randomly generate edges via a stochastic block model. Since nodes with the same sensitive attributes are more likely to be connected in social networks, we generate edges with lower inter-connection and higher intra-connection probability between sensitive groups. Specifically, we set the probability of connecting two nodes with the same and different sensitive attributes as 1 × 10 -3 and 1 × 10 -4 , respectively. For label generation, we intentionally make the labels correlated to the sensitive attributes. Specifically, for the label of each node, we set 0 as the threshold value and create a binary label based on the second dimension of the node features. Then we add noise to the labels by randomly flipping 20% of the labels to a different class. Since the node GRACE 60.41 ± 0.04 47.06 ± 4.20 47.01 ± 4.25 FairDrop 59.05 ± 9.47 29.62 ± 12.88 29.66 ± 11.88 NIFTY 56.38 ± 8.92 29.24 ± 12.16 29.16 ± 12.16 FairAug 60.42 ± 0.02 43.40 ± 1.01 43.33 ± 1.00 Graphair 60.36 ± 0.02 15.56 ± 9.01 15.57 ± 8.99 representations are learned in an unsupervised manner, we use a small portion of labeled data to train the classifier. In other words, we randomly split 10%/10%/80% for training, validating, and testing the classifier. We use the same hyperparameters for modeling training and the same architecture as discussed in Appendix F.2. We run the experiments 3 times and report the average performance for each method. We use the mini-batch training discussed in Appendix C on this large synthetic graph dataset. Table 4 shows accuracy, demographic parity, and equal opportunity metrics of our proposed Graphair, compared with GRACE, FairDrop, NIFTY, and FairAug. From the results, fairness-aware augmentation methods have lower prediction bias compared to the uniform random augmentation method (i.e., GRACE). In addition, Graphair achieves the best fairness performance in terms of demographic parity and equal opportunity on this large synthetic dataset. These results demonstrate the scalability of our Graphair on large graph datasets.



Here we use negative binary cross-entropy loss, so the adversary model k aims to maximize Ladv. We adopt the graph mini-batch training method proposed byZeng et al. (2020) on Pokec-z and Pokec-n datasets.



Figure 1: An overview of our framework.

to adopt demographic parity ∆ DP = |P( Ŷ = 1|S = 0) -P( Ŷ = 1|S = 1)| and equal opportunity ∆ EO = |P( Ŷ = 1|S = 0, Y = 1) -P( Ŷ = 1|S = 1, Y = 1)|, where Y and Ŷ denote ground-truth labels and predictions, respectively. Note that a model with lower DP and EO implies better fairness performance.

Figure 3: Node sensitive homophily distributions in the original and the fair graph data. NBA Pokec_z Pokec_n

Figure 5: The running time comparison.

Figure 8: Comparison of different fairness-aware augmentation methods on the NBA dataset.

Comparisons between our method and baselines on node classification tasks in terms of accuracy and fairness. The best results are shown in bold.Graphair computes the adjacency matrix A ′ and the pairwise similarity in the contrastive loss, thus having a space complexity of O(n 2 ), where n is the number of nodes. Fortunately, we can easily adopt the graph sampling-based batch training method proposed byZeng et al. (2020) to perform mini-batch training and reduce the space complexity to O(m 2 ), where m is the batch size. More details on mini-batch training are given in Appendix C.

Comparisons among different components in the augmentation model. ± 0.45 2.56 ± 0.41 4.64 ± 0.17 68.17 ± 0.08 2.10 ± 0.17 2.76 ± 0.19 67.43 ± 0.25 2.02 ± 0.40 1.62 ± 0.47 classifier. The test accuracy and fairness of this classifier are used as the proxy for the quality of the learned representation H.

The statistics of datasets. The statistics of the datasets are given in Table3. Note that we use the minibatch training discussed in Appendix C on Pokec-z and Pokec-n datasets to reduce computation complexity.

Comparisons between our method and baselines on the synthetic dataset in terms of accuracy and fairness. The best results are shown in bold.

ACKNOWLEDGMENTS

The work was supported, in part, by NSF (IIS-1939716, IIS-1900990, and IIS-2006861), and Cisco Research. The views and conclusions in this paper are those of the authors and should not be interpreted as representing any funding agencies. We would also like to thank the helpful feedback from the anonymous reviewers.

A PROOFS OF THEOREMS

A.1 PROOF OF THEOREM 1 Proof. Let Ŷ = d(G ′ ) ∈ [0, 1] n denotes the prediction of the classifier d and Ŷi is the prediction of node i. Suppose without loss of generality that E i∼S 1 [ Ŷi ] ≤ E j∼S 0 [ Ŷj ] . Then, we haveWe assume the bounded adversarial loss of each sample, i.e., |log( Ŝi )| ≤ M for the sample with sensitive attribute S = 1, and |log(1-Ŝi )| ≤ M for the sample with sensitive attribute S = 0. Based on the concavity of log(•) function and Jensen's inequality, for anyNote that higher prediction indicates lower sensitive attribute value sincewe consider an adversary model k that output the opposite of classifier) denote the prediction of the adversary model k. Then, we havewhere inequality (a) holds since n ′ = max(|S 0 |, |S 1 |), and inequality (b) holds due to equation ( 12). An optimal adversary model k * should do at least better than any arbitrary choice of k, thereby we haveA.2 PROOF OF THEOREM 2Proof. According to (Zhu et al., 2020) , the contrastive objective -L con is a lower bound of the true mutual information between H and H ′ , i.e.,According to the data processing inequality, we have I(U ; V ) ≥ I(U ; W ) for a Markov chain U → V → W , where U, V, W are random variables. The representations H and H ′ are extracted from the original graph G and the fair view, since H and (A ′ , X ′ ) are conditionally independent given (A, X). The Markov chain leads to the following inequality,Combining Eq. 14 and 15, we have -L con ≤ I(G; G ′ ), (16) which completes our proof.

NBA Pokec_z Pokec_n

Figure 6 : Ablation study on hyperparameter β. Figure 7 : Ablation study on hyperparameter γ.

D.2 HYPERPARAMETER STUDIES

In this subsection, we conduct hyperparameter studies for further investigation on the contribution of different components in Graphair. First, we tune hyperparameter β among {0.1, 0.5, 1, 5, 10}. Note that the contrastive loss is unavoidable in training the representation encoder f , so we don't consider the case where β = 0. Results in Figure 6 show that change of β leads to a trade-off between fairness and prediction performance. Additionally, we tune hyperparameter γ among {0, 0.1, 0.5, 1, 5, 10}.Results in Figure 7 show that Graphair has a better performance (higher accuracy and lower demographic parity) with the reconstruction based regularization term (γ ̸ = 0) than without it (γ = 0). This is because the reconstruction loss can prevent the augmentation model g from generating graphs that deviate too much from the input graph.

E VISUALIZATION OF DIFFERENT AUGMENTATION METHODS

In this section, we provide a case study comparing different fairness-aware augmentation methods.In Figure 8 , we show the change of a 1-hop ego graph in the NBA dataset. Node 1 is the ego node of this ego graph and other nodes are the 1-hop neighbors of Node 1. Results in Figure 8 , show that FairDrop drops most edges in the original ego graph due to the high sensitive homophily of the ego node. In contrast, Graphair only drops one edge and preserves the original graph topology. Besides, Graphair reduces the sensitive homophily by connecting the ego node to a node with a different sensitive attribute (node 9).

F MORE DETAILS ON EXPERIMENTAL SETTINGS

F.1 DATASETSWe use three real-world social network datasets, including NBA, Pokec-z, and Pokec-n (Dai & Wang, 2021) , to evaluate Graphair on node classification tasks. Pokec-z and Pokec-n are sampled from a larger social network Pokec, which is the most popular social network in Slovakia. User features contain gender, age, hobbies, interests, education, working field, etc. Among these features, region is treated as the sensitive attribute and working field is used as the predicted label. Since the node representations are learned in an unsupervised manner, we use a small portion of labeled data to train the classifier. In other words, we randomly split 10%/10%/80% for training, validating and testing the classifier. NBA is extended from a Kaggle dataset with more than 400 NBA basketball players. The player information contains nationality, age, salary, performance statistics in the 2016-2017 season, etc. Nationality is treated as the sensitive attribute and the task is to predict if the salary of the player is over median. We randomly split 20%/35%/45% for training, validating and

