TOPOLOGY MATTERS IN FAIR GRAPH LEARNING: A THEORETICAL PILOT STUDY

Abstract

Recent advances in fair graph learning observe that graph neural networks (GNNs) further amplify prediction bias compared with multilayer perception (MLP), while the reason behind this is unknown. In this paper, we conduct a theoretical analysis of the bias amplification mechanism in GNNs. This is a challenging task since GNNs are difficult to be interpreted, and real-world networks are complex. To bridge the gap, we theoretically and experimentally demonstrate that aggregation operation in representative GNNs accumulates bias in node representation due to topology bias induced by graph topology. We provide a sufficient condition identifying the statistical information of graph data, so that graph aggregation enhances prediction bias in GNNs. Motivated by this data-centric finding, we propose a fair graph refinement algorithm, named FairGR, to rewire graph topology to reduce sensitive homophily coefficient while preserving useful graph topology. Experiments on node classification tasks demonstrate that FairGR can mitigate the prediction bias with comparable performance on three real-world datasets. Additionally, FairGR is compatible with many state-of-the-art methods, such as adding regularization, adversarial debiasing, and Fair mixup via refining graph topology. Therefore, FairGR is a plug-in fairness method and can be adapted to improve existing fair graph learning strategies.

1. INTRODUCTION

Graph neural networks (GNNs) (Kipf & Welling, 2017; Veličković et al., 2018; Wu et al., 2019) are widely adopted in various domains, such as social media mining (Hamilton et al., 2017) , knowledge graph (Hamaguchi et al., 2017) and recommender system (Ying et al., 2018) , due to remarkable performance in learning representations. Graph learning, a topic with growing popularity, aims to learn node representation containing both topological and attribute information in a given graph. Despite the outstanding performance in various tasks, GNNs still inherit or even amplify societal bias from input graph data (Dai & Wang, 2021) . The biased node representation largely limits the application of GNNs in many high-stake tasks, such as job hunting (Mehrabi et al., 2021) and crime ratio prediction (Suresh & Guttag, 2019) . Hence, bias mitigation that facilitates the research on fair GNNs is in urgent need. In many real-world graphs, nodes with the same sensitive attribute (e.g., ages) are more likely to connect. For example, young people mainly make friends with people of similar ages (Dong et al., 2016) . We call this phenomenon "topology bias". Even worse, in GNNs, the representation of each node is learned by aggregating the representations of its neighbors. Thus, nodes with the same sensitive attributes will be more similar after the aggregation. To get a sense, we visualize the topology bias for three real-world datasets (Pokec-n, Pokec-z, and NBA) in Figure 1 , where different edge types are highlighted with different colors for the top-3 largest connected components in original graphs. Such topology bias leads to more similar node representation for those nodes with the same sensitive attribute, which is a major source of the graph representation bias. Existing bias mitigation work for GNNs is empirical via adding regularization, adversarial debiasing, or contrastive learning. These works are motivated by the fact that graph neural networks trained on graphs may inherit the societal bias in data, and the topology of graphs and the message passing in GNNs could even magnify the bias compared with multilayer perception (MLP) (Dai & Wang, 2021) . However, even though fair prediction in GNN can be achieved via a fair training strategy, a

Pokec_n

Pokec_z NBA Figure 1 : Visualization of topology bias in three real-world datasets, where black edges and red edges represent the edge with the same or different sensitive attributes for the connected node pair, respectively. We visualize the largest three connected components for each dataset. It is obvious that the sensitive homophily coefficients (the ratio of homo edges) are high in practice, i.e., 95.30%, 95.06%, and 72.37% for Pokec-n, Pokec-z, and NBA dataset, respectively. fundamental understanding of why large topology bias and message passing algorithm amplifying the bias happens is still missing. A natural question is raised: Can we theoretically understand why large topology bias and message passing algorithm amplify the bias from a graph data perspective? In this work, we move the first step to understand why large topology bias and message passing algorithm amplify the bias. Specifically, we first define the sensitive homophily coefficient to describe how likely the connected nodes are with the same sensitive attributes. Subsequently, we theoretically prove that the GCN-like aggregation in message passing inevitably accumulates representation bias for graphs with large sensitive homophily coefficients. Second, motivated by our theoretical analysis, we develop a fair graph refinement algorithm, named FairGR, to achieve fair GNN prediction via revising graph topology. More importantly, FairGR is a plug-in data refinement method and compatible with many fair training strategies, such as regularization, adversarial debiasing, and Fair mixup. In short, the contributions can be summarized as follows: • To the best of our knowledge, it is the first paper to theoretically investigate why the GCN-like message passing scheme amplifies representation bias for large topology bias. Specifically, we provide a sufficient condition of graph data to show that GCN-like message passing amplifies representation bias. • Motivated by our theoretical analysis, we propose a graph topology refinement method, named FairGR, to achieve fair prediction. • We empirically show that the prediction bias of GNNs is larger than that of MLP on real-world datasets. Additionally, the effectiveness of FairGR is experimentally evaluated on three real-world datasets. The results show that compared to the state-of-the-art, our FairGR exhibits a superior trade-off between prediction performance and fairness, and is compatible with many fair training strategies, such as regularization, adversarial debiasing, and Fair mixup.

2.1. NOTATIONS

We adopt bold upper-case letters to denote matrices such as X, bold lower-case letters such as x to denote vectors or random variables, and calligraphic font such as X to denote set. Given a matrix X ∈ R n×d , the i-th row and j-th column are denoted as X i and X •,j , and the element in i-th row and j-th column is X i,j . We use l 1 norm of matrix X as ||X|| 1 = ij |X ij |. Let G = {V, E} be a graph with the node set V = {v 1 , • • • , v n } and the undirected edge set E = {e 1 , • • • , e m }, where n, m represent the number of node and edge, respectively. The graph structure G can be represented as an adjacent matrix A ∈ R n×n , where A ij = 1 if existing edge between node v i and node v j . N (i) denotes the neighbors of node v i and Ñ (i) = N (i) ∪ {v i } denotes the self-inclusive neighbors. Suppose that each node is associated with a d-dimensional feature vector and a (binary) sensitive attribute, the feature for all nodes and sensitive attribute is denoted as X ori = R n×d and s ∈ {-1, 1} n . I(s, X) represents the mutual information between the sensitive attribute and node features. A ⊙ B represents Hadamard product for matrix element-wise multiplication.

2.2. LABEL AND SENSITIVE HOMOPHILY COEFFICIENT IN GRAPHS

The behaviors of graph neural networks have been investigated in the context of label homophily for connected node pairs in graphs (Ma et al., 2021) . Label homophily in graphs is typically defined to characterize the similarity of connected node labels in graphs. Here, similar node pair means that the connected nodes share the same label. From the perspective of fairness, we also define the sensitive homophily coefficient to represent the sensitive attribute similarity among connected node pairs. Informally, the coefficients for label homophily and sensitive homophily are defined as the fraction of the edges connecting the nodes of the same class label and sensitive attributes in a graph (Zhu et al., 2020; Ma et al., 2021) . We also provide the formal definition as follows: Definition 1 (Label and Sensitive Homophily Coefficient) Given a graph G = {V, E} with node label vector y, node sensitive attribute vector s, the label and sensitive attribute homophily coefficients are defined as the fraction of edges that connect nodes with the same labels or sensitive attributes ϵ label (G, y) = 1 |E| (i,j)∈E 1(y i = y j ), and ϵ sens (G, s) = 1 |E| (i,j)∈E 1(s i = s j ), where |E| is the number of edges and 1(•) is the indicator function. Recent works (Ma et al., 2021; Chien et al., 2021; Zhu et al., 2020) aim to understand the relation between the message passing in GNNs and label homophily from the interactions between GNNs (model) and graph topology (data). For graph data with a high label homophily coefficient, existing works (Ma et al., 2021; Tang et al., 2020) have demonstrated, either provably or empirically, that the nodes with higher node degree obtain more prediction benefits in GCN, compared to the benefits that peripheral nodes obtain. As for graph data with a low label homophily coefficient, GNNs do not necessarily lead to better prediction performance compared with MLP since the node features of neighbors with different labels contaminate the node features during feature aggregation. However, although work (Dai & Wang, 2021) empirically points out that graph data with a large sensitive homophily coefficient may enhance bias in GNNs, the fundamental understanding of message passing and graph data properties, such as sensitive homophily coefficient, is still missing. We provide the theoretical analysis in Section 4.

3. RELATED WORK

We briefly review the existing work on graph neural networks and fairness-aware learning on graphs. (Please refer to Appendix D for a more comprehensive discussion.). Existing GNNs can be roughly divided into spectral-based and spatial-based GNNs. Spectral-based GNNs provide graph convolution definition based on graph theory (Bruna et al., 2013; Defferrard et al., 2016; Henaff et al., 2015) . Spatial-based GNNs variant are popular due to explicit neighbors' information aggregation, including Graph convolutional networks (GCN) (Kipf & Welling, 2017) , graph attention network (GAT) (Veličković et al., 2018) . As for fairness in graph data, many works have been developed to achieve fairness in machine learning community (Chuang & Mroueh, 2020) , including fair walk (Rahman et al., 2019) , adversarial debiasing (Dai & Wang, 2021) , Bayesian approach (Buyl & De Bie, 2020) , and contrastive learning (Agarwal et al., 2021) . Much literature empirically shows that GNN or message passing may enhance prediction bias compared with MLP. However, the theoretical understanding of why such a phenomenon happens is still unclear. In this work, we take an initial step toward theoretically understanding why message passing enhances bias from a data perspective. Based on this understanding, we develop a simple yet effective fair graph refinement method to achieve better tradeoff performance. More importantly, the proposed FairGR is compatible with many fair training strategies. For each node, GNNs aggregate their neighbors' features to learn their representation. In real-world datasets, we observe a high sensitive homophily coefficient (which is even higher than the label homophily coefficient). However, existing message-passing schemes tend to aggregate the node features from their neighbors with the same sensitive attributes. Thus, the common belief is that the message passing renders node representations with the same sensitive attribute more similar. However, this common belief is heuristic and the quantifiable relationship between the topology bias and representation bias is still missing. In this section, (1) we rectify such common belief and quantitatively reveal that only sufficient high sensitive homophily coefficient would lead to bias enhancement; (2) we analyze the influence of other graphs statistical information, such as the number of nodes n, the edge density ρ d , and sensitive group ratio c, in term of bias enhancement.

4.1. SYNTHETIC GRAPH

we consider the synthetic random graph generation using contextual stochastic block model (CSBM) (Fortunato & Hric, 2016) , including graph topology and node features generation with Gaussian mixture distribution. As a pilot study, we choose the most common-used graph topology with stochastic block model (SBM) and node feature with Gaussian mixture distributions in the analysis. The rationale for the choice is due to the fact that SBM is widely used to model and analyze most complex networks (e.g., social networks, World Wide Web, and biological networks) (Van Der Hofstad, 2016) . 1Graph Topology Throughout our analysis, we mainly focus on 4 characteristics of graph topology: the number of nodes n, the edge density ρ d , sensitive homophily coefficient ϵ sens , and sensitive group ratio c. Specifically, we consider the synthetic random graph generation, including graph topology and node features generation as follows: Definition 2 ((n, ρ d , ϵ sens , c)-graph) The synthetic random graph G sampled from (n, ρ d , ϵ sens , c)graph satisfies the following properties: 1) the graph node number is n; 2) the adjacency matrix A satisfies A ij ∈ {0, 1} and E ij [P(A ij = 1)] = ρ d ; 3) given connected node pair A ij = 1, the probability of connected nodes with the same sensitive attribute satisfies P(s i = s j |A ij = 1) = ϵ sens ; 4) the binary sensitive attribute s i ∈ {-1, 1} satisfies E i [P(s i = 1)] = c; 5) independent edge generation. Node Features. We assume that node attributes in synthetic graph follow Gaussian Mixture Model GM M (c, µ 1 , Σ 1 , µ 2 , Σ 2 ). For the node with sensitive attribute s i = -1 (s i = 1), the node attributes X i follow Gaussian distribution P 1 = N (µ 1 , Σ 1 ) (P 2 = N (µ 2 , Σ 2 )), where the node attributes with the same sensitive attribute are independent and identically distributed, and µ i , Σ i (i = 1, 2) represent the mean vector and covariance matrix.

4.2. REPRESENTATION BIAS MEASUREMENT

To measure the node representations bias, we adopt the mutual information between sensitive attribute and node attributes I(s, X). Note that the exact mutual information I(s, X) is intractable to estimate, an upper bound on the exact mutual information is developed as a surrogate metric in the following Theorem 1: Theorem 1 Suppose the synthetic graph node attribute X is generated based on Gaussian Mixture Model GM M (c, µ 1 , Σ 1 , µ 2 , Σ 2 ), i.e., the probability density function of node attributes for the nodes of different sensitive attribute s = {-1, 1} follows f X (X i = x|s i = -1) ∼ N (µ 1 , Σ 1 ) △ = P 1 and f X (X i = x|s i = 1) ∼ N (µ 2 , Σ 2 ) △ = P 2 , and the sensitive attribute ratio satisfies E i [P(s i = 1)] = c, then the mutual information between sensitive attribute and node attributes I(s, X) satisfies I(s, X) ≤ -(1 -c) ln (1 -c) + c exp -D KL (P 1 ||P 2 ) -c ln c + (1 -c) exp -D KL (P 2 ||P 1 ) △ = Bias(s, X). Based on Theorem 1, we can observe that lower distribution distance D KL (P 1 ||P 2 ) or D KL (P 2 ||P 1 ) is beneficial for reducing Bias(s, X) and I(s, X) since the sensitive attribute is less distinguishable based on node representations.

4.3. WHEN AGGREGATION ENHANCES THE BIAS?

We focus on the role of message passing in terms of fairness. Suppose the graph adjacency matrix A is sampled for (n, ρ d , ϵ sens , c)-graph and we adopt the GCN-like message passing X = ÃX, where Ã is normalized adjacency matrix with self-loop. We define the bias difference for such message passing as ∆Bias = Bias(s, X) -Bias(s, X) to measure the role of graph topology. Subsequently, we derive intra-connect and inter-connect probability in Lemma 1. Lemma 1 Suppose the synthetic graph is generated from (n, ρ d , ϵ sens , c)-graph, then we obtain the intra-connect and inter-connect probability as follows: p conn = ρ d ϵ sens c 2 + (1 -c) 2 , q conn = ρ d (1 -ϵ sens ) 2c(1 -c) . Subsequently, we provide a sufficient condition to specify the case that graph topology enhances bias in Theorem 2. Theorem 2 Suppose the synthetic graph node attribute X is generated based on Gaussian Mixture Model GM M (c, µ 1 , Σ, µ 2 , Σ), and the graph adjacency matrix A is generated from (n, ρ d , ϵ sens , c)graph. If adopting GCN-like message passing X = ÃX, bias will be enhanced, i.e., ∆Bias > 0 if the bias-enhance condition holds: (ν 1 -ν 2 ) 2 min{ζ 1 , ζ 2 } > 1, where ν 1 -ν 2 < 1 represents the reduction coefficient of the distance between the mean node attributes of the two sensitive attributes groups, ζ 1 , ζ 2 mean the connection degree of two sensitive groups; the mathematical formulation is given by ν 1 = (n 1 -1)p conn + 1 ζ 1 , ν 2 = (n 1 -1)q conn ζ 2 , ( ) where ζ 1 = n -1 q conn + (n 1 -1)p conn + 1, ζ 2 = n -1 p conn + (n 1 -1)q conn + 1, the node number with the same sensitive attribute n -1 = n(1 -c), n 1 = nc, intra-connect probability p conn = E ij [P(A ij |s i = s j )], and inter-connect probability q conn = E ij [P(A ij |s i s j = -1)]. Based on Theorem 2, we can provide more discussion on the influence of 4 graph data in-formation of graph:the number of nodes n, the edge density ρ d , sensitive homophily coefficient ϵ sens , and sensitive group ratio c for bias enhancement as follows: • Node representation bias is enhanced by message passing for sufficient large sensitive homophily ϵ sens . According to Lemma 1, for large sensitive homophily coefficient ϵ sens → 1, the inter-connect probability q conn → 0 and intra-connect probability p conn keeps the maximal value. In this case, based on Theorem 2, it is easy to obtain that ν 1 = 1, ν 2 = 0 and the distance for the mean aggregated node representation will keep the same, i.e., μ1 -μ2 = (ν 1 -ν 2 )(µ 1 -µ 2 ) = µ 1 -µ 2 . Additionally, the covariance will be diminished after aggregation since ζ 1 and ζ 2 are strictly larger than 1. Therefore, for sufficient large sensitive homophily coefficient ϵ sens , the bias-enhance condition (ν 1 -ν 2 ) 2 min{ζ 1 , ζ 2 } > 1 holds. • The bias enhancement implicitly depends on node representation geometric differentiation, including the distance between the mean node representation within the same sensitive attribute and the scale covariance matrix. Theorem 1 implies that low mean representation distance and concentrated representation (low covariance matrix) lead to fair representation. However, GCN-like message passing renders the mean node representation distance reduction ν 1 -ν 2 and concentrated for each sensitive attribute group, which is an "adversarial" effect for fairness and the mean distance and covariance reduction is controlled by sensitive homophily coefficient. • The bias is enlarged as node number n being increased. For large node number n, the mean distance almost keeps constant since ν 1 ≈ cpconn (1-c)qconn+cpconn , ν 2 ≈ cqconn (1-c )pconn+cqconn , and ζ 1 , ζ 2 are almost proportional to node number n. Therefore, the bias-enhancement condition can be more easily satisfied and ∆Bias would be higher for large graph data. The intuition is that, given graph density ratio, large graph data represents a higher average degree node. Hence, each aggregated node representation adopts more neighbor's information with the same sensitive attribute, and thus leads to a lower covariance matrix value and higher representation bias. • The bias is enlarged as graph connection density ρ being increased. Based on Lemma 1, interconnection probability and inter-connection probability are both proportional to graph connection density ρ d . Therefore, ν 1 and ν 2 almost keep constant and the distance of mean node representation is constant as well. As for the covariance matrix, message passing leads to more concentrated node representation since ζ 1 and ζ 2 are larger for higher graph connection density ρ d . The rationale is similar to the graph node number: given node number, higher graph connection density ρ d means higher average node degree and each aggregated node representation adopts more neighbor's information with the same sensitive attribute. • When the sensitive attribute is more balanced (i.e.,) the bias will be enlarged. Based on Lemma 1, it is easy to obtain that, given graph connection density ρ d and graph node number n, the intra-connection probability p conn would be high, while being low for inter-connection probability q conn , if the balanced sensitive attribute. In other words, intra-connection probability p conn (inter-connection probability q conn ) monotonically decreases (increases) with respect to |c -1 2 |.

5. LEARNING FAIR GRAPH TOPOLOGY

The above section provides a data-centric perspective (graph topology) to understanding why message passing may enhance representation bias. In other words, graph topology refinement may assist GNNs model in achieving fair and accurate prediction. However, searching for the optimal graph topology for GNNs is a non-trivial problem due to large-scale discrete optimization. In this section, motivated by the theoretical analysis, we propose Fair Graph Refinement method, named FairGR, to achieve fair prediction via learning the optimal graph topology. Specifically, we formulate the objective functions into three parts, including low sensitive homophily coefficient, high label homophily coefficient, and small topology perturbation. In this way, FairGR can explicitly reduce the topology bias while preserving useful topology information for prediction. Problem Formulation. Considering binary sensitive attribute s i ∈ {-1, 1} and binary classification task with label y i ∈ {-1, 1} for i-th node, we aim to modify the graph topology to achieve low sensitive homophily coefficient ϵ sens , high label homophily coefficient ϵ label , and small topology perturbation. For sensitive homophily coefficient with the binary sensitive attribute, the determination of any two nodes with the same sensitive attribute can be obtained via H(ss T ), where H(•) is Heaviside (unit) step function, sensitive attribute vector s ∈ {-1, 1} n×1 . Based on Definition 1, It is easy to rewrite sensitive homophily coefficient as ϵ sens = ||H(ss T )⊙A||

||A||1

, where ⊙ represents Hadamard product, and || • || 1 denotes the entry-wise l 1 norms (i.e., the summation over all absolute value of elements). Similarly, label homophily coefficient as ϵ label = ||H(yy T )⊙A|| . As for graph topology perturbation, we can use the entry-wise l 1 norms of the difference as the measurement. In a nutshell, the objective function to reconstruct graph connections given sensitive attribute s, label y, and graph topology A can be formulated as: L( Â|s, y, A) = ||H(ss T ) ⊙ Â|| || Â|| 1 -α ||H(yy T ) ⊙ Â|| || Â|| 1 + β|| Â -A|| 1 , Therefore, the rewired graph topology can be obtained via a constrained optimization problem min Â L( Â|s, y, A)s.t. Âij ∈ {0, 1}, where α and β are the hyperparameters for label homophily coefficient and graph topology perturbation. Considering the formulated problem is a large-scale discrete optimization problem, we employ a heuristic optimization method to obtain the modified graph topology. Optimization Strategy. To optimize the formulated problem with constraint, we adopt Proximal Gradient Descent (PGD). Specifically, we first adopt gradient descent to update the graph topology using gradient ∂L( Â|s,y,A) ∂ Â . Subsequently, we clip the graph topology Â within {0, 1} in the projection operation of PGD. Such operation will conduct multiple times to obtain the final graph topology. In practice, considering only the sensitive attribute and label for the training set are available, we actually only modify the graph topology within the training nodes. In other words, the three objectives, including sensitive homophily coefficient, label homophily coefficient, and graph topology perturbation, are calculated for the subgraph of training nodes. In this way, we can avoid information leakage from the test set and reduce the complexity of the optimization problem. Evaluation and Computation Complexity Analysis. For algorithmic evaluation of pre-processing FairGR, we compared the prediction performance (including accuracy and fairness) using original graph topology and rewired graph topology across multiple GNN backbones. Denote the number of training nodes and update iterations as N and T , respectively. Then the computation complexity for gradient computation and projection of PGD are both O(nfoot_1 train ). The total computation complexity to obtain the final rewired graph topology is given by O(T n 2 train ). The memory consumption is O(n 2 train ) due to the storage of graph topology gradient. 

6. EXPERIMENTS

In this section, we conduct experiments to validate the effectiveness (see Apppendix G for more details.) of the proposed FairGR. We firstly validate that GCN-like aggregation enhances representation bias for the graph data with large sensitive homophily via synthetic experiments. For real-world datasets, we conduct experiments to show that the prediction bias of GNN is larger than that of MLP. Moreover, we introduce the experimental settings and then evaluate our proposed FairGR compared with several baselines in terms of prediction performance and fairness metrics on real-world datasets.

6.1. SYNTHETIC EXPERIMENTS

In the synthetic experiments, we demonstrate the relation between DP difference across GCN-like message passing operation and sensitive homophily coefficient 2 . Specifically, we investigate the influence of graph node number n, graph connection density ρ d , sensitive homophily ϵ sens , and sensitive attribute ration c for bias enhancement of GCN-like message passing. For evaluation metric, we adopt the demographic parity (DP) difference during message passing to measure the bias enhancement. For node attribute generation, we first generate node attribute with Gaussian distribution N (µ 1 , Σ) and N (µ 2 , Σ) for node with binary sensitive attribute, respectively, where µ 1 = [0, 1], µ 1 = [1, 0] and Σ = 1 0 0 1 . For adjacency matrix generation, we randomly generate edges via a stochastic block model based on the intra-connection and inter-connection probability. Figure 2 shows the DP difference during message passing with respect to the sensitive homophily coefficient. We observe that a higher sensitive homophily coefficient generally leads to larger bias enhancement. Additionally, higher graph connection density ρ d , larger node number n, and balanced sensitive attribute ratio c correspond to higher bias enhancement, which is consistent with our theoretical analysis in Theorem 2. 6.2 EXPERIMENTS ON REAL-WORLD DATASETS

6.2.1. EXPERIMENTAL SETTINGS

Datasets. The experiments are conducted on three real-world datasets, including Pokec-z, Pokec-n, and NBA (Dai & Wang, 2021) . Pokec-z and Pokec-n are sampled from a larger social network Pokec (Takac & Zabovsky, 2012) based on the province in Slovakia. We choose region information and the working field of the users as the sensitive attribute and the predicted label, respectively. NBA dataset includes around 400 NBA basketball players and is collected from a Kaggle datasetfoot_2 and Twitter. The information of players includes age, nationality, and salary in the 2016-2017 season. We choose nationality (U.S. and overseas player) as the binary sensitive attribute, and the prediction label is whether the salary is higher than the median. Evaluation Metrics. For the node classification task, we adopt accuracy to evaluate the classification performance. As for fairness metric, we adopt two most common-used group fairness metrics, including demographic parity and equal opportunity, to measure the prediction bias (Louizos et al., 2015; Beutel et al., 2017) . Specifically, demographic parity is defined as the average prediction difference over different sensitive attribute groups, i.e., ∆ DP = |P(ŷ = 1|s = -1) -P(ŷ = 1|s = 1)|. Similarly, equal opportunity is given by ∆ EO = |P(ŷ = 1|s = -1, y = 1)-P(ŷ = 1|s = 1, y = 1)|, where y and ŷ represent the ground-truth label and predicted label, respectively. Baselines. Considering that the proposed FairGR is a pre-processing method, we show that our proposed FairGR can improve many representative GNNs, such as GCN (Kipf & Welling, 2017) , GAT (Veličković et al., 2018) , SGC (Wu et al., 2019) , in many fairness training strategies, such adding regularization, adversarial debiasing. For all models, we train 2 layers of neural networks with 64 hidden units for 300 epochs. Implementation Details. For each experiment, we run 5 times and report the average performance for each method. We adopt Adam optimizer with 0.001 learning rate and 1e -5 weight decay for all models training. In adversarial debiasing setting, we train the classifier and adversary head with 70 and 30 epochs, respectively. The hyperparameters for adversary debiasing are tuned in {0.0, 0.5, 1.0, 2.0, 5.0, 8.0, 10.0, 50.0, 100.0}. For adding regularization, we adopt the hyperparameter set {0.0, 1.0, 1.5, 2.0, 5.0, 8.0, 10.0, 15.0, 25.0, 50.0, 80.0, 100.0}.

6.2.2. DOES GNNS HAVE A LARGER PREDICTION BIAS THAN MLP?

To validate the effect of bias enhancement of GNNs, we compare the performance of many representative GNNs over MLP on various real-world datasets and summarize the results in Table 1 . From these results, we make the following observations: • Many representative GNNs have a higher prediction bias compared with MLP model on all three datasets in terms of demographic parity and equal opportunities. For demographic parity, the prediction bias of MLP is lower than that of GAT, GCN, and SGC by 32.64%, 50.46%, 66.53% and 58.72% on Pokec-z dataset. The higher prediction bias comes from the aggregation within the same-sensitive attribute nodes and topology bias in graph data. • FairGR can mitigate bias for GCN and SGC backbone via rewiring graph topology in these three datasets. For GAT backbone, although the bias can be mitigated, the accuracy drop is significant due to the fact that GAT is more sensitive to graph topology rewire.

6.2.3. DOES FAIRGR ACHIEVE BETTER TRADEOFF PERFORMANCE IN VARIOUS SETTINGS?

Comparison with adversarial debiasing and regularization. To validate that the proposed FairGR is compatible with many fairness training strategies, we also show the prediction performance and fairness metric trade-off compared with adversarial debiasing (Fisher et al., 2020) and add demographic parity regularization (Chuang & Mroueh, 2020) . In adversarial debiasing (Louppe et al., 2017) , the output of GNNs is the input of the adversary, and the goal of the adversary is to predict the node sensitive attribute. For these two fair training strategies, we adopt GCN, GAT, and SGC as backbones. We randomly split 50%/25%/25% for training, validation, and test dataset. Figure 3 shows the Pareto optimality curve for all methods, where the right-bottom corner point represents the ideal performance (100% accuracy and 0% prediction bias). From the results, we list the following observations as follows: • For both adversarial debiasing and adding regularization training strategies, our proposed FairGR can achieve a better DP-Acc trade-off compared with that without any graph data refinement for many GNNs. In other words, FairGR can effectively reduce training bias and is compatible with many existing fairness training strategies. • Topology does matter in GNNs. For adding regularization or adversarial debiasing, FairGS embrace different tradeoff performance gain on top of different GNNs. Such observation implies that there is a complicated interaction between graph topology and message passing algorithms. Additionally, FairGS provide the most tradeoff performance benefit in GAT compared with GCN and SGC. The high capacity of GAT may energize the message passing algorithm to learn from data. Therefore, the tradeoff performance improvement is the highest in adding regularization and adversarial debiasing.

7. CONCLUSION

In this work, we theoretically demonstrate that the message passing amplifies node representation bias under the graph data with a large sensitive homophily coefficient, and reveal the role of other graphs statistical information in terms of bias amplification. Additionally, motivated by theoretical understanding, we develop a simple yet effective graph refinement method, named FairGR, to reduce the sensitive homophily while preserving useful information. We conduct synthetic experiments to validate theoretical findings. Experimental results on real-world datasets demonstrate the effectiveness of FairGR in many fair training strategies and GNNs backbones in node classification tasks.

A PROOF OF THEOREM 1

We provide a more general proof for categorical sensitive attribute s ∈ {1, 2, • • • , K} and the prior probability is given by P(s = i) = c i . Suppose the conditional node attribute x distribution given node sensitive attribute s = i satisfies normal distribution P i (x) △ = N (µ i , Σ i ), the distribution of node sensitive attribute is the mixed Gaussian distribution f(x) = K i=1 c i P i (x). Based on the definition of mutual information, we have I(s, x) = H(x) - K i=1 c i H(x|s = i); (3) where H(•) represents Shannon entropy for random variable. Subsequently, we focus on the entropy of the mixed Gaussian distribution H(x). We show that such entropy can be upper bounded by the pairwise Kullback-Leibler (KL) divergence as follows: Pi||Pj ) , I(s, x) = - K i=1 c i E Pi ln K j=1 c j P j (x) - K i=1 c i H(x|s = i) (a) ≤ - K i=1 c i ln K j=1 c j e E P i [ln Pj (x)] - K i=1 c i H(x|s = i) = - K i=1 c i ln K j=1 c j e -H(Pi||Pj ) - K i=1 c i H(x|s = i) (b) = - K i=1 c i ln K j=1 c j e -H(Pi)-D KL (Pi||Pj ) - K i=1 c i H(x|s = i) = - K i=1 c i ln K j=1 c j e -D KL (Pi||Pj ) + K i=1 c i H(P i ) - K i=1 c i H(x|s = i), = - K i=1 c i ln K j=1 c j e -D KL ( where KL divergence D KL (P i ||P j ) △ = P i (x) ln Pi(x) Pj (x) dx and cross entropy H(P i ||P j ) = -P i (x) ln P j (x)dx. The inequality (a) holds based on the variational lower bound on the expectation of a log-sum inequality , 1997) , and quality (2) holds based on H(P i ||P j ) = H(P i ) + D KL (P i ||P j ). As a special case for binary sensitive attribute, it is easy to obtain the following results: E ln i Z i ≥ ln i e E[ln Zi] (Kullback I(s, X) ≤ -(1 -c) ln (1 -c) + c exp -D KL (P 1 ||P 2 ) -c ln c + (1 -c) exp -D KL (P 2 ||P 1 ) .

B PROOF OF THEOREM 2

Before going deeper for our proof, we first introduce two useful lemmas on KL divergence and statistical information of graph. Lemma 2 For two d-dimensional Gaussian distributions P = N (µ p , Σ p ) and Q = N (µ q , Σ q ), the KL divergence D KL (P ||Q) is given by D KL (P ||Q) = 1 2 ln |Σ q | |Σ p | -d + (µ p -µ q ) ⊤ Σ -1 q (µ p -µ q ) + T r(Σ -1 q Σ p ) ( ) where ⊤ is matrix transpose operation and T r(•) is trace of a square matrix. Proof: Note that the probability density function of multivariate Normal distribution is given by: P (x) = 1 (2π) d/2 |Σ p | 1/2 exp - 1 2 (x -µ p ) ⊤ Σ -1 p (x -µ p ) , the KL divergence between distributions P and Q can be given by D KL (P ||Q) = E P [ln(P ) -ln(Q)] = E P 1 2 ln |Σ q | |Σ p | - 1 2 (x -µ p ) ⊤ Σ -1 p (x -µ p ) + 1 2 (x -µ q ) ⊤ Σ -1 q (x -µ q ) = 1 2 ln |Σ q | |Σ p | - 1 2 E P (x -µ p ) ⊤ Σ -1 p (x -µ p ) I1 + 1 2 E P (x -µ q ) ⊤ Σ -1 q (x -µ q ) I2 . Using the commutative property of the trace operation, we have I 1 = 1 2 E P (x -µ p ) ⊤ Σ -1 p (x -µ p ) = 1 2 T r E P (x -µ p ) ⊤ (x -µ p )Σ -1 p = 1 2 T r E P (x -µ p ) ⊤ (x -µ p ) Σ -1 p = 1 2 T r Σ p Σ -1 p = d 2 , ( ) As for the term I 2 , note that x -µ q = (x -E P [x]) + (E P [x] -µ q ), we can obtain the following equation: I 2 = 1 2 E P (x -µ q ) ⊤ Σ -1 q (x -µ q ) = 1 2 (µ p -µ q ) ⊤ Σ -1 q (µ p -µ q ) + T r(Σ -1 q Σ p ) , Therefore, the KL divergence D KL (P ||Q) is given by D KL (P ||Q) = 1 2 ln |Σ q | |Σ p | -d + (µ p -µ q ) ⊤ Σ -1 q (µ p -µ q ) + T r(Σ -1 q Σ p ) .

□

Lemma 3 Suppose the synthetic graph is generated from (n, ρ d , ϵ sens , c)-graph, then we obtain the intra-connect and inter-connect probability as follows: p conn = ρ d ϵ sens c 2 + (1 -c) 2 , q conn = ρ d (1 -ϵ sens ) 2c(1 -c) . Proof: Based on Bayes' rule, we have the intra-connect and inter-connect probability as follows p conn = P(A ij = 1|s i = s j ) = P(A ij = 1)P(s i = s j |A ij = 1) P(s i = s j ) = ρ d ϵ sens c 2 + (1 -c) 2 , q conn = P(A ij = 1|s i s j = -1) = P(A ij = 1)P(s i s j = -1|A ij = 1) P(s i s j ) = -1 = ρ d (1 -ϵ sens ) 2c(1 -c) . ( ) Note that the synthetic graph is generated from (n, ρ d , ϵ sens , c)-graph. The sensitive attribute s is generated to with ratio c, i.e., the number of node sensitive attribute s = -1 and s = 1 are n -1 = n(1 -c) and n 1 = nc. Based on the determined sensitive attribute s, we randomly generate the edge based on parameters ρ d and ϵ sens and Lemma 1, i.e., the edges within and cross the same group are randomly generated based on intra-connect probability and inter-connect probability. Therefore, the adjacency matrix A ij is independent on node attribute X i and X j given sensitive attributes s i and s j , i.e., A ij ⊥ ⊥ (X i , X j )|(s i , s j ). Similarly, the different node attributes and edges are also dependent on each other given sensitive attributes, i.e., A ij ⊥ ⊥ A ij |(s i , s j , s k ) and X i ⊥ ⊥ X j |(s i , s j ). Therefore, considering GCN-like message passing Xi = n j=1 Ãij X j , we have the aggregated node attributes expectation given sensitive attribute as follows: μ1 = E Xi [ Xi |s i = -1] = n j=1 E Ãij ,Xj [ Ãij X j |s i = -1] = n j=1 E Ãij [ Ãij |s i = -1]E Xj [X j |s i = -1] = (n -1 -1)E Ãij [ Ãij |s i = -1, s j = -1]E Xj [ Xj |s i = -1, s j = -1] +E Ãij [ Ãij |s i = -1, i = j]E Xj [X j |s i = -1] +n 1 E Ãij [ Ãij |s i = -1, s j = 1]E Xj [X j |s i = -1, s j = 1] = [(n -1 -1)p conn + 1]µ 1 + n 1 q conn µ 2 (n -1 -1)p conn + 1 + n 1 q conn △ = ν 1 µ 1 + (1 -ν 1 )µ 2 . ( ) where ν 1 = (n-1-1)pconn+1 (n-1-1)pconn+1+n1qconn . Similarly, for the node with sensitive attribute 1, we have μ2 = E Xi [ Xi |s i = 1] = n j=1 E Ãij ,Xj [ Ãij X j |s i = 1] = n j=1 E Ãij [ Ãij |s i = 1]E Xj [X j |s i = 1] = n -1 E Ãij [ Ãij |s i = 1, s j = -1]E Xj [X j |s i = 1, s j = -1] +E Ãij [ Ãij |s i = 1, i = j]E Xj [X j |s i = 1] +(n 1 -1)E Ãij [ Ãij |s i = 1, s j = 1]E Xj [X j |s i = 1, s j = 1] = n -1 q conn µ 1 + [(n 1 -1)p conn + 1]µ 2 n -1 q conn + (n 1 -1)p conn + 1 △ = ν 2 µ 1 + (1 -ν 2 )µ 2 . ( ) where ν 2 = (n-1-1)qconn n-1qconn+1+(n1-1)pconn . As for the covariance matrix of aggregated node attributes X given node sensitive attribute s = -1 and original sensitive attribute, note that we can obtain Σ1 = D Xi [ Xi |s i = -1] = n j=1 D Ãij ,Xj [ Ãij X j |s i = -1] = n j=1 E Ãij [ Ã2 ij |s i = -1]D Xj [ Xj |s i = -1] = (n -1 -1)p conn Σ + Σ + n 1 q conn Σ [(n -1 -1)p conn + 1 + n 1 q conn ] 2 = Σ (n -1 -1)p conn + 1 + n 1 q conn △ = ζ -1 1 Σ, where ζ 1 = (n -1 -1)p conn + 1 + n 1 q conn . Similarly, given node sensitive attribute s = 1, we have Σ2 = D Xi [ Xi |s i = 1] = Σ n-1qconn+1+(n1-1)pconn △ = ζ -1 2 Σ, where ζ 2 = n -1 q conn + 1 + (n 1 -1)p conn . In other words, the covariance matrix of the aggregated node attributes is lower than the original one since the "average" operation will make node representation more concentrated. Note that the summation over several Gaussian random variables is still Gaussian, we define the node attributes distribution for sensitive attribute s = -1 and s = 1 as P 1 = N (µ 1 , Σ), P 2 = N (µ 2 , Σ), respectively. Similarly, the aggregated node representation distribution follows for sensitive attribute s = -1 and s = 1 as P1 = N (μ 1 , Σ1 ), P2 = N (μ 2 , Σ2 ). Note that the sensitive attribute ratio keeps the same after the aggregation and larger KL divergence for these two sensitive attributes group distribution, the bias enhances ∆Bias > 0 if D KL ( P1 || P2 ) > D KL (P 1 ||P 2 ) and D KL ( P2 || P1 ) > D KL (P 2 ||P 1 ). Therefore, we only focus on the KL divergence. According to Lemma 2, it is easy to obtain KL divergence for original distribution as follows: D KL (P 1 ||P 2 ) = 1 2 ln |Σ| |Σ| -d + (µ 1 -µ 2 ) ⊤ Σ -1 (µ 1 -µ 2 ) + T r(Σ -1 Σ) = 1 2 (µ 1 -µ 2 ) ⊤ Σ -1 (µ 1 -µ 2 ), As for KL divergence for aggregated distribution, similarly, we have D KL ( P1 || P2 ) = 1 2 ln | Σ2 | | Σ1 | -d + (μ 1 -μ2 ) ⊤ Σ-1 2 (μ 1 -μ2 ) + T r( Σ-1 2 Σ1 ) = 1 2 d ln ζ 1 ζ 2 -d + (ν 1 -ν 2 ) 2 ζ 2 (µ 1 -µ 2 ) ⊤ Σ -1 (µ 1 -µ 2 ) + ζ 2 ζ 1 T r(I d ) (c) ≥ 1 2 (ν 1 -ν 2 ) 2 ζ 2 (µ 1 -µ 2 ) ⊤ Σ -1 (µ 1 -µ 2 ), where inequality (c) holds since ln x ≤ x -1 for any x > 0. Compared with equations ( 12) and ( 13), it is seen that D KL ( P1 || P2 ) > D KL (P 1 ||P 2 ) if (ν 1 -ν 2 ) 2 ζ 2 > 1. Similarly, we can have D KL ( P2 || P1 ) > D KL (P 2 ||P 1 ) if (ν 1 -ν 2 ) 2 ζ 1 > 1. In a nutshell, the bias enhances ∆Bias > 0 after message passing if (ν 1 -ν 2 ) 2 min{ζ 1 , ζ 2 } > 1.

C TOPOLOGY AMPLIFIES BIAS IN ONE-LAYER GCN

Section 4 demonstrates that GCN-like aggregation operation amplifies node representation bias for graph data with large topology bias. However, it is still unclear whether such observation holds for general GNNs or not. Generally speaking, this problem is quite fundamental and challenging to understand the role of topology in fair graph learning. In this section, we try to move a step toward this problem by considering one-layer GCN. Prior to comparing the prediction for one-layer GCN and one-layer MLP, we first provide the connection between demographic parity and mutual information of sensitive attributes and predictions. Then, we theoretically compare the prediction bias of one-layer GCN and one-layer MLP through the lens of mutual information.

C.1 PREDICTION BIAS AND MUTUAL INFORMATION

Here, we only consider binary sensitive attributes s ∈ {-1, 1} and binary labels y ∈ {-1, 1}. Similarly, we can define ŷ ∈ {-1, 1} as the binary model predictions. In the fairness community, demographic parity, defined as the average prediction difference among different sensitive attribute groups, is the most commonly used fairness metric, i.e., ∆DP = |P(ŷ|s = 1)-P(ŷ|s = -1)|. From the mutual information perspective, the correlation between sensitive attributes s and prediction ŷ can be measured by I(s; ŷ). In this subsection, we provide an inherent connection between demographic parity ∆DP and mutual information I(s; ŷ) as follows: Theorem 3 For binary sensitive attributes s ∈ {-1, 1} and binary prediction ŷ ∈ {-1, 1}, demographic parity ∆DP and mutual information I(s; ŷ) satisfies I(s; ŷ) ≤ 2∆DP Proof: For simplicity, we defined the joint probability as α i∪j = P(ŷ = i, s = j) and condition probability as α i|j = P(ŷ = i|s = j). Considering the log ratio between joint distribution and margin product probability, we have log P(ŷ = i, s = j) P(ŷ = i)P(s = j) = log α i|j j α i|j P(s = j) = log 1 + (α i|j -α i|-j )P(s = -j) j α i|j P(s = j) (d) ≤ (α i|j -α i|-j ) P(s = -j) j α i|j P(s = j) ≤ ∆DP P(s = -j) j α i|j P(s = j) . • Comparison. We also provide the connection between concentration property and bias enhancement in GNN. The intuition of why GNN enhances the bias, high sensitive homophily represents that the nodes with the same sensitive attributes are connected with high probability. Considering concentration property, the node representation for the same sensitive attribute is more similar after aggregation, therefore leading to highly different representations for different sensitive attribute groups. Notice that such behavior only happens for high sensitive homophily coefficients and shallow GNN. When the node with different sensitive attribute groups is connected randomly, the bias enhancement would not happen or be insignificant due to random concentration. When adopting deep GNNs, all node representations converge and have no bias enhancement. Unfortunately, due to concentration property, shallow GNNs are mainly used in practice. As for high sensitive homophily coefficient, such a condition is usually satisfied in practice due to natural graph data property. Additionally, there are several differences between our proposed optimization scheme and work (Carriere et al., 2021) , including definition, dependence, and optimization: • Definition: Persistent homology is a method for calculating the importance of topological features in the simplicial complex. For example, giving a set of points in a point cloud corresponding to a chair, the task is to detect the object from the points. In this case, there is no connection between any pair of points. Persistent homology is a tool to identify the topological feature (or connection patterns) via gradually building up the connection between points. However, for graphs, they are 1-simplex with explicit connection patterns defined by the set of edges, thus many properties from persistent homology will degenerate to the field of graph theory. For example, applying the persistent homology on the graph is equivalent to building maximum spanning trees (MSTs) using Kruskal algorithm (Kleinberg & Tardos, 2006) , which is irrelevant to our proposed optimization scheme. • Dependence. Persistent homology is generally related to sample features, as shown in the example Point cloud optimization of (Carriere et al., 2021) . In other words, persistent homology somehow represents the topological features of all samples. Differentially, in the graph data we focused on, there are node normal attributes, sensitive attributes, and adjacency matrix (graph topology). Based on the definition, the sensitive homophily coefficient is related to sensitive attributes and the adjacency matrix. However, the optimized adjacency matrix is generally dependent on sensitive attributes. • Optimization. The main challenge for persistent homology-based optimization is generally undifferentiable except in some special cases. (Carriere et al., 2021) develops a general framework to study the differentiability of the persistence of parametrized families of filtrations. In this way, under mild assumptions, stochastic subgradient descent algorithms can be applied to such functions to converge almost surely to a critical point. For our problem (2), the gradient of loss over topology is differentiable in general. The challenge falls in the constraint of an element value. In our solution, we use a gradient-based method to update the adjacency matrix via variables relaxation and then adopt project operation to satisfy such constraint.

D RELATED WORKS

Graph neural networks. GNNs achieve state-of-the-art performance for various real-world applications. There are two categories in GNNs model backbones, including spectral-based and spatial-based GNNs. Spectral-based GNNs provide graph convolution operation together with feature transformation (Bruna et al., 2013; Defferrard et al., 2016; Henaff et al., 2015) . Many spatial-based GNNs are also proposed to aggregate the neighbors' information, including graph attention network (GAT) (Veličković et al., 2018 ), GraphSAGE (Hamilton et al., 2017) , SGC (Wu et al., 2019) , APPNP (Klicpera et al., 2019) , et al (Gao et al., 2018; Monti et al., 2017) . Fairness-aware learning on graphs. Fairness in machine learning has attracted many research efforts to mitigate prediction bias (Chuang & Mroueh, 2020; Zhang et al., 2018; Du et al., 2021; Yurochkin & Sun, 2020; Jiang et al., 2022; Creager et al., 2019; Feldman et al., 2015) . Fair walk (Rahman et al., 2019) is a fair version of random walk to learn fair node representation via revising neighborhood sampling. From the bias mitigation perspective, adversarial debiasing and contrastive learning are also developed for graph data. For example, works (Dai & Wang, 2021; Bose & Hamilton, 2019; Fisher et al., 2020 ) also adopt the adversary to predict the sensitive attribute given the node representation. fairness-aware representation learning is also developed via node feature masking, graph topology rewires (Agarwal et al., 2021; Köse & Shen, 2021) for node classification or link prediction tasks (Laclau et al., 2021; Li et al., 2021) . However, the inherent reason behind the observation that GNNs show higher prediction bias than MLP is still missing. In this work, we theoretically and experimentally reveal that many GNNs aggregation schemes boost node representation bias under topology bias. Furthermore, we develop a simple yet effective graph refinement method, named FairGR, to achieve fair prediction.

E DATASET STATISTICS

The data statistical information on three real-world datasets, including Pokec-n, Pokec-z, and NBA, are provided in Table 2 . It is seen that the sensitive homophily is even higher than label homophily coefficient among three real-world datasets, which validates that real-world datasets are usually with large topology bias. 

G.1 MORE SYNTHETIC EXPERIMENTAL RESULTS

In this subsection, we provide more experimental results on different covariance matrix. Although our theory is only derived for the same covariance matrix, we still observe similar results for the case of different covariance matrix. For node attribute generation, we generate node attribute with Gaussian distribution N (µ 1 , Σ) and N (µ 2 , Σ) for node with binary sensitive attribute, respectively, where µ 1 = [0, 1], µ 1 = [1, 0] and Σ = 1 0 0 2 . We adopt the same evaluation metric and adjacency matrix generation scheme in Section 6.1 Figure 4 shows the DP difference during message passing with respect to sensitive homophily coefficient for different initial covariance matrices. We observe that a higher sensitive homophily coefficient generally leads to larger bias enhancement. Additionally, higher graph connection density ρ d , larger node number n, and balanced sensitive attribute ratio c correspond to higher bias enhancement, which is consistent with our theoretical analysis in Theorem 2. 

G.2 ABLATION STUDY FOR FAIRGR

For investigating the effect of hyperparameters α and β, we conduct experiments with different hyperparameters chosen from α = {0.0, 0.1, 0.5, 1.0, 5.0, 10.0} and β = {0.0, 0.1, 0.5, 1.0, 5.0, 10.0} while the other one is selected as default. The default value for hyperparameters α and β are 0.1 and 0.5, respectively. The results of hyperparamter study with respect to α and β are shown in Figures 5 and 6 , respectively. From these results, we can obtain the following observations: • Hyprparameter α and β demonstrate different influences on GNN backbone. For example, for Pokec-n dataset, hyperparameter α only shows a negligible influence on the accuracy of GCN and GAT, while significant for GAT. As for DP, the bias metric is more sensitive to α compared with accuracy. • Hyprparameter α and β demonstrate different influences on graph dataset. For example, GAT achieves the best accuracy and lowest DP with α = 0.1 in NBA dataset, while achieving the lowest accuracy and highest DP in Pokec-n dataset. Such observation indicates the importance of hyperparameter tuning for different datasets. 

Pokec_n

Pokec_z NBA and MLP models both converge in terms of accuracy for the high label homophily dataset Pokec-n and Pokec-z dataset. For high sensitive homophily Pokec-n and Pokec-z dataset, GNNs demonstrate higher prediction than that MLP, while the prediction bias difference is relatively small for the low-sensitive-homophily dataset NBA.

G.4 FAIRGR RESULTS ON VANILLA TRAINING

For vanilla training, Figure 8 shows the test accuracy and demographic parity curve during training for different GNNs backbones and whether FairGR is adopted for graph topology rewire. From these results, we can obtain the following observations: • Different GNNs demonstrate different accuracy and demographic parity performance. For example, for Pokec-n dataset, GCN has the highest accuracy performance and lowest demographic parity, which implies that message passing also matters even for the same graph topology. • Our proposed FairGR consistently achieves lower demographic parity and comparable accuracy performance on all datasets and backbones.

Pokec_n

Pokec_z NBA 

G.5 TRADEOFF PERFORMANCE ON FAIR MIXUP

We also demonstrate that FairGR can achieve better tradeoff performance for different GNN backbones with Fair mixup (Chuang & Mroueh, 2021) in Figure 9 . Specifically, since input fair mixup requires calculating model prediction for mixed input batch, it is non-trivial to adopt input fair mixup in our experiments on the node classification task. This is because, for GNN aggregations of neighborhoods' information, the neighborhood information for the mixed input batch is missing. Instead, we adopt manifold fair mixup for the logit layer in our experiments. Experimental results show that FairGR can achieve better accuracy-fairness tradeoff performance for many GNNs backbones on three datasets.

H FUTURE WORK

There are two lines of follow-up research directions. Firstly, The generalization of the theoretical analysis on why aggregation enhances bias in GNN can be further extended. As a pilot study, we theoretically investigate why this phenomenon happens for GCN-like aggregation under random graph topology generated by stochastic block model and Gaussian mixture feature distribution. The more general analysis of other aggregation operations, random graph models, and other feature distributions can be extended. The other line focuses on the graph topology rewire algorithmic perspective, including improving the efficiency of FairGR and effectiveness via designing different objectives for graph topology rewire.



We leave the analysis of other random graph models or feature distributions in future work. Note that we only do a theoretical study in GCN-like message passing operation as a pilot study. The investigation of other GNN aggregation operations (such as GraghSAGE-like operation) and GNN models may require different techniques and can be further conducted in future work. https://www.kaggle.com/noahgift/social-power-nba



Figure 2: The difference of demographic parity for message passing. Left: DP difference for different graph connection density ρ d with sensitive attribute ratio c = 0.5 and number of nodes n = 10 4 ; Middle: DP difference for different number of nodes n with sensitive attribute ratio c = 0.5 and graph connection density ρ d = 10 -3 ; Right: DP difference for different sensitive attribute ratio c with graph connection density ρ d = 10 -3 and number of nodes n = 10 4 ;

Figure 3: DP and Acc trade-off performance on three real-world datasets compared with adding regularization (Top) and adversarial debiasing (Bottom). The trade-off curve closer to the right bottom corner represents better trade-off performance.

Figure 4: The difference of demographic parity for message passing with different initial covariance matrices. Left: DP difference for different graph connection density ρ d with sensitive attribute ratio c = 0.5 and number of nodes n = 10 4 ; Middle: DP difference for different number of nodes n with sensitive attribute ratio c = 0.5 and graph connection density ρ d = 10 -3 ; Right: DP difference for different sensitive attribute ratio c with graph connection density ρ d = 10 -3 and number of nodes n = 10 4 .

Figure 6: Ablation study on hyperparameter β in terms of DP and Acc across different GNN backbones on three real-world datasets.

Figure 7: Accuracy (top) and DP (bottom) training curve w.r.t. epochs for different backbones, including GAT, GCN, SGC, and MLP, on three real-world datasets.

Figure 8: Accuracy (top) and DP (bottom) training curve w.r.t. epochs for different backbones, including GAT, GCN, SGC, and MLP, on three real-world datasets.

The performance on Node Classification (GR represents graph topology rewire). DP (%)↓ ∆ EO (%) ↓ Acc (%) ↑ ∆ DP (%) ↓ ∆ EO (%) ↓ Acc (%) ↑ ∆ DP (%) ↓ ∆ EO (%)↓ MLP 70.48 ± 0.77 1.61 ± 1.29 2.22 ± 1.01 72.48 ± 0.26 1.53 ± 0.89 3.39 ± 2.37 65.56 ± 1.62 22.37 ± 1.87 18.00 ± 3.52 GAT 69.76 ± 1.30 2.39 ± 0.62 2.91 ± 0.97 71.00 ± 0.48 3.71 ± 2.15 7.50 ± 2.88 57.78 ± 10.65 20.12 ± 16.18 13.00 ± 13.37 GAT-GR 56.75 ± 6.32 1.04 ± 0.80 1.14 ± 1.02 61.27 ± 9.34 0.54 ± 0.51 2.27 ± 1.55 53.65 ± 10.31 4.16 ± 5.13 3.67 ± 3.23 GCN 71.78 ± 0.37 3.25 ± 2.35 2.36 ± 2.09 73.09 ± 0.28 3.48 ± 0.47 5.16 ± 1.38 61.90 ± 1.00 23.70 ± 2.74 17.50 ± 2.63 GCN-GR 71.68 ± 0.58 1.94 ± 1.59 1.27 ± 0.71 72.68 ± 0.44 0.47 ± 0.39 0.82 ± 0.78 61.59 ± 1.85 20.24 ± 4.41 9.50 ± 2.77 SGC 71.24 ± 0.46 4.81 ± 0.30 4.79 ± 2.27 71.46 ± 0.41 2.22 ± 0.29 3.85 ± 1.63 63.17 ± 0.63 22.56 ± 3.94 14.33± 2.16 SGC-GR 70.95 ± 0.91 3.32 ± 1.31 3.20 ± 1.90 71.91 ± 0.52 0.71 ± 0.65 2.39 ± 0.69 62.54 ± 1.62 18.56 ± 2.81 2.50 ± 1.66

Statistical Information on Datasets Dataset # Nodes # Node Features # Edges # Training Labels # Training Sens Label Homop Sens HomopCalculate the gradient∂L( Â|s,y,A)

annex

where inequality (d) holds due to log(1 + x) ≤ x for any x > -1. According to the definition of mutual information, we have I(s; ŷ) = i,j P(ŷ = i, s = j) log P(ŷ = i, s = j) P(ŷ = i)P(s = j) = i,j α i∪j log α i|j j α i|j P(s = j)α i∪j P(s = -j) j α i|j P(s = j)where inequality (f) holds due to i a i b i ≤ i a i i b i for non-negative a i and b i . □Theorem 3 shows there is a strong connection between demographic parity and mutual information for binary sensitive attributes and binary labels, i.e., the mutual information is upper bounded by demographic parity multiplied by 2.

C.2 PREDICTION BIAS COMPARISON BETWEEN ONE-LAYER GCN AND ONE-LAYER MLP

Considering the strong connection between mutual information and demographic parity, we investigate prediction bias comparison between one-layer GCN and one-layer MLP through the lens of mutual information. For one-layer MLP model, the prediction is given by ŷMLP = σ(XW M LP ), where W M LP is trainable parameter for MLP. Similarly, for one-layer GCN, the prediction is given by ŷGCN = σ( ÃXW GCN ), where W GCN is the trainable parameters for GCN. Define X = ÃX, it is easy to see that one-layer MLP model and one-layer GCN model are almost the same except with different node features. Based on Theorem 2, the aggregated node features X embrace higher presentation bias than that of X. In other words, the bias of input data for one-layer GCN is higher than that of one-layer MLP.For the prediction bias, note that sensitive attributes s, node features X, and prediction ŷ form a Markov chain X → X → ŷ since P (ŷ|s, X) = P (ŷ|X) for the model with vanilla training. Based on data processing inequality, it is easy to obtain I(s; ŷMLP ) = I(s; X) -I(s; X|ŷ M LP ),In other words, when I(s; X|ŷ M LP ) = I(s; X|ŷ GCN ), the higher input data bias will lead to higher prediction bias. For one-layer MLP and one-layer GCN, if the condition in Theorem 2 are satisfied, the prediction bias of one-layer GCN is also larger than that of MLP.

C.3 COMPARISON WITH CONCENTRATION PROPERTY IN GNN AND PERSISTENT HOMOLOGY

GNN's concentration property represents all node presentation convergence after stacking of aggregations (Nt & Maehara, 2019; Ma et al., 2022; Baranwal et al., 2021) . There are differences between bias enhancement and concentration property in GNN:• Definition. Concentration property means that the node representation of all nodes convergence after GNN aggregation. Bias enhancement represents the node representation for different sensitive groups that are more distinguished. In fact, such two properties are somehow contrary since perfect concentration leads to zero bias. • Aggregation. For concentration property, only the normal features and topology are involved in the analysis. The high-level interpretation of concentration is that aggregation acts like a low-frequency filter and such an "average" effect leads to node representation convergence. In the sensitive homophily coefficient, we would like to clarify that the sensitive attributes are not included for node feature aggregation due to law restrictions. Even though the sensitive attribute is included in GNN aggregation, and all node representations are the same, it does not represent the bias enhancement (actually zero bias.). The bias in GNN represents the highly different node representations or predictions among different sensitive groups (defined by sensitive attributes).

