FAIR GRAPH MESSAGE PASSING WITH TRANSPARENCY Anonymous

Abstract

Recent advanced works achieve fair representations and predictions through regularization, adversarial debiasing, and contrastive learning in graph neural networks (GNNs). These methods implicitly encode the sensitive attribute information in the well-trained model weight via backward propagation. In practice, we not only pursue a fair machine learning model but also lend such fairness perception to the public. For current fairness methods, how the sensitive attribute information usage makes the model achieve fair prediction still remains a black box. In this work, we first propose the concept transparency to describe whether the model embraces the ability of lending fairness perception to the public or not. Motivated by the fact that current fairness models lack of transparency, we aim to pursue a fair machine learning model with transparency via explicitly rendering sensitive attribute usage for fair prediction in forward propagation . Specifically, we develop an effective and transparent Fair Message Passing (FMP) scheme adopting sensitive attribute information in forward propagation. In this way, FMP explicitly uncovers how sensitive attributes influence final prediction. Additionally, FMP scheme can aggregate useful information from neighbors and mitigate bias in a unified framework to simultaneously achieve graph smoothness and fairness objectives. An acceleration approach is also adopted to improve the efficiency of FMP. Experiments on node classification tasks demonstrate that the proposed FMP outperforms the state-of-the-art baselines in terms of fairness and accuracy on three real-world datasets.

1. INTRODUCTION

Graph neural networks (GNNs) (Kipf & Welling, 2017; Veličković et al., 2018; Wu et al., 2019; Han et al., 2022a; b) are widely adopted in various domains, such as social media mining (Hamilton et al., 2017) , knowledge graph (Hamaguchi et al., 2017) and recommender system (Ying et al., 2018) , due to remarkable performance in learning representations. Graph learning, a topic with growing popularity, aims to learn node representation containing both topological and attribute information in a given graph. Despite the outstanding performance in various tasks, GNNs often inherit or even amplify societal bias from input graph data (Dai & Wang, 2021) . The biased node representation largely limits the application of GNNs in many high-stake tasks, such as job hunting (Mehrabi et al., 2021) and crime ratio prediction (Suresh & Guttag, 2019) . Hence, bias mitigation that facilitates the research on fair GNNs is in an urgent need. Many existing works achieving fair prediction in graph either rely on regularization (Jiang et al., 2022) , adversarial debiasing (Dai & Wang, 2021) , or contrastive learning (Zhu et al., 2020; 2021b; Agarwal et al., 2021; Kose & Shen, 2022) . These methods adopt sensitive attribute information in training loss refinement. In this way, such sensitive attribute can be implicitly encoded in well-trained model weight through backward propagation. However, only achieving fair model is insufficient in practice since the fairness should also lend perception to the public (e.g., the auditors, or the maintainers of machine learning systems). In other words, the influence of sensitive attributes should be easily probed for public. We name such property for public probing as transparency. Specifically, we provide the following formal statement on transparency in fairness: Transparency in fairness: Onlookers can verify the released fair model with • Transparent influence: How and if the sensitive attribute information influence fair model prediction. • Less is more: The required resources to obtain the influence of sensitive attributes only includes well-trained model and test data samplesfoot_0 . From auditors' perspective, even though the fairness metric for machine learning model is low, such fair model is still untrustful if the auditors cannot understand how the sensitive attributes are adopted to achieve fair prediction given the well-trained model. From maintainer's perspective, it is important to understand how the model provide fair prediction. Such understanding could help maintainers improve models and further convince auditors in terms of fairness. In summary, transparency aims to make fairness implementation understandable. The transparency aims to make the process of achieving fair model via sensitive attribute informations white-boxfoot_1 . Therefore, the maintainers and auditors both get benefits from model transparency. More importantly, similar to intrinsic explainability of the modelfoot_2 , the fairness with transparency property is binary, i.e., the prediction model either embrace the fairness with transparency or not. Based on the formal statement on transparency in fairness, the key rule to determinate whether a fair model is transparent is that the model prediction difference under the cases with and without sensitive attribute information can be identified given the well-trained fair model and test data samples. Unfortunately, many existing fairness works, including regularization, adversarial debiasing, and contrastive learning, do not satisfy transparency requirements in practice. For example, the fair model trained based on existing works are not with transparent influence. This is because the sensitive attribute is implicitly encoded in well-trained model weight. Therefore, it is intractable to infer how sensitive attribute influence the well-trained model weight without access the dynamic model training process. Additionally, for fair model obtained from existing works, the required resources for transparent influence includes training data and training strategy so that the influence of sensitive attributed can be probed via detecting well-trained model weight difference. In a nutshell, the current fair models based on loss refinement lack of transparency. A natural question is raised: Can we find fair prediction model with transparency? In this work, we provide a positive answer via chasing transparency and fairness in message passing of GNNs. The key idea of achieving transparency is to explicitly adopt sensitive attribute in message passing (forward propagation). Specifically, we design an fair and transparent message passing scheme for GNNs, called fair message passing (FMP). First, we formulate an optimization problem that integrates fairness and prediction performance objectives. Then, we solve the formulated problem via Fenchel conjugate and gradient descent to generate fair-and-predictive representation. We also interpret the gradient descent as aggregation first and them debiasing. Finally, we integrate FMP in graph neural networks to achieve fair and accurate prediction for node graph classification task. Further, we demonstrate the superiority of FMP by examining its effectiveness and efficiency, where we adopt the property of softmax function to accelerate the gradient calculation over primal variables. In short, the contributions can be summarized as follows: • We consider fairness problem from a new perspective, named transparency, i.e., the sensitive attribute should be easily probed for public. We point out that many existing fairness method cannot achieve transparency. • We propose FMP to achieve fairness with transparency via using sensitive attribute information in message passing. Specifically, we use gradient descent to chasing graph smoothness and fairness in a unified optimization framework. An acceleration method is proposed to reduce gradient computational complexity with theoretical and empirical validation. • The effectiveness and efficiency of FMP are experimentally evaluated on three real-world datasets. The results show that compared to the state-of-the-art, our FMP exhibits a superior trade-off between prediction performance and fairness with negligibly computation overhead.

2.1. NOTATIONS

We adopt bold upper-case letters to denote matrix such as X, bold lower-case letters such as x to denote vectors, calligraphic font such as X to denote set. Given a matrix X ∈ R n×d , the i-th row and j-th column are denoted as X i and X •,j , and the element in i-th row and j-th column is X i,j . We use the Frobenius norm, l 1 norm of matrix X as ||X|| F = i,j X 2 i,j and ||X|| 1 = ij |X ij |, respectively. Given two matrices X, Y ∈ R n×d , the inner product is defined as ⟨X, Y⟩ = tr(X ⊤ Y), where tr(•) is the trace of a square matrix. SF (X) represents softmax function with a default normalized column dimension. Let G = {V, E} be a graph with the node set V = {v 1 , • • • , v n } and the undirected edge set E = {e 1 , • • • , e m }, where n, m represent the number of node and edge, respectively. The graph structure G can be represented as an adjacent matrix A ∈ R n×n , where A ij = 1 if existing edge between node v i and node v j . N (i) denotes the neighbors of node v i and Ñ (i) = N (i) ∪ {v i } denotes the self-inclusive neighbors. Suppose that each node is associated with a d-dimensional feature vector and a (binary) sensitive attribute, the feature for all nodes and sensitive attribute are denoted as X ori = R n×d and s ∈ {-1, 1} n . Define the sensitive attribute incident vector as ∆ s = 1>0(s) ||1>0(s)||1 -1>0(-s) ||1>0(-s)||1 to normalize each sensitive attribute group, where 1 >0 (s) is an element-wise indicator function.

2.2. GNNS AS GRAPH SIGNAL DENOISING

A GNN model is usually composed of several stacking GNN layers. Given a graph G with N nodes, a GNN layer typically contains feature transformation X trans = f trans (X ori ) and aggregation X agg = f agg (X trans |G), where X ori ∈ R n×din , X trans , X agg ∈ R n×dout represent the input and output features. The feature transformation operation transforms the node feature dimension, and feature aggregation, updates node features based on neighbors' features and graph topology. Recent works (Ma et al., 2021b; Zhu et al., 2021a) have established the connections between many feature aggregation operations in representative GNNs and a graph signal denoising problem with Laplacian regularization. Here, we only introduce GCN/SGC as an examples to show the connection from a perspective of graph signal denoising. The more discussions are elaborated in Appendix G. Feature aggregation in Graph Convolutional Network (GCN) or Simplifying Graph Convolutional Network (SGC) is given by X agg = ÃX trans , where Ã = D-1 2 Â D-1 2 is a normalized self-loop adjacency matrix Â = A + I, and D is degree matrix of Ã. Recent works (Ma et al., 2021b; Zhu et al., 2021a) provably demonstrate that such feature aggregation is equivalent to one-step gradient descent to minimize tr(F ⊤ I -Ã)F with initialization F = X trans .

3. FAIR MESSAGE PASSING

In this section, we propose a new message passing scheme to aggregate useful information from neighbors while debiasing representation bias. Specifically, we formulate fair message passing as an optimization problem to pursue smoothness and fair node representation simultaneously. Together with an effective and efficient optimization algorithm, we derive the closed-form fair message passing. Finally, the proposed FMP is shown to be integrated in fair GNNs at three stages, including transformation, aggregation, and debiasing step, as shown in Figure 1 .

3.1. THE OPTIMIZATION FRAMEWORK

To achieve graph smoothness prior and fairness in the same process, a reasonable message passing should be a good solution for the following optimization problem: min F λ s 2 tr(F T LF) + 1 2 ||F -X trans || 2 F hs(F) + λ f ||∆ s SF (F)|| 1 h f ∆sSF (F) . ( ) where L represents normalized Laplacian matrix, h s (•) and h f (•) denotes the smoothness and fairness objectivesfoot_3 , respectively, and X trans ∈ R n×dout is the transformed d out -dimensional node features and F ∈ R n×dout is the aggregated node features of the same matrix size. The first two terms preserve the similarity of connected node representation and thus enforces graph smoothness. The last term enforces fair node representation so that the average predicted probability between groups of different sensitive attributes can remain constant. The regularization coefficients λ s and λ f adaptively control the trade-off between graph smoothness and fairness. Smoothness objective h s (•). The adjacent matrix in existing graph message passing schemes is normalized for improving numerical stability and achieving superior performance. Similarly, the graph smoothness term requires normalized Laplacian matrix, i.e., L = I-Ã, Ã = D-1 2 Â D-1 2 , and Â = A + I. From edge-centric view, smoothness objective enforces connected node representation to be similar since tr( F T LF) = (vi,vj )∈E || Fi √ di+1 - Fj √ dj +1 || 2 F , where d i = k A ik represents the degree of node v i . Fairness objective h f (•). The fairness objective measures the bias for node representation after aggregation. Recall sensitive attribute incident vector ∆ s indicates the sensitive attribute group and group size via the sign and absolute value summation. Recall that the sensitive attribute incident vector as ∆ s = 1>0(s) ||1>0(s)||1 -1>0(-s) ||1>0(-s)||1 and SF (F) represents the predicted probability for node classification task, where SF (F) ij = P (y i = j|X). Furthermore, we can show that our fairness objective is actually equivalent to demographic parity, i.e., ∆ s SF (F) j = P (y i = j|s i = 1, X) -P (y i = j|s i = -1, X). Please see proof in Appendix B. In other words, our fairness objective, l 1 norm of ∆ s SF (F) characterizes the predicted probability difference between two groups with different sensitive attribute. Therefore, our proposed optimization framework can pursue graph smoothness and fairness simultaneously.

3.2. ALGORITHM FOR FAIR MESSAGE PASSING

For smoothness objective, many existing popular message passing scheme can be derived based on gradient descent with appropriate step size choice (Ma et al., 2021b; Zhu et al., 2021a) . However, directly computing the gradient of the fairness term makes the closed-form gradient complicated since the gradient of l 1 norm involves the sign of elements in the vector.

3.2.1. BI-LEVEL OPTIMIZATION PROBLEM FORMULATION.

To solve this optimization problem in a more effective and efficient manner, Fenchel conjugate (Rockafellar, 2015) is introduced to transform the original problem as bi-level optimization problem. Fenchel conjugate (Rockafellar, 2015) (a.k.a. convex conjugate) is the key tool to transform the original problem into bi-level optimization problem. For the general convex function h(•), its conjugate function is defined as h * (U) △ = sup X ⟨U, X⟩ -h(X). Based on Fenchel conjugate, the fairness objective can be transformed as variational representation h f (p) = sup u ⟨p, u⟩ -h * f (u), where p = ∆ s SF (F) ∈ R 1×dout is a predicted probability vector for classification. Furthermore, the original optimization problem is equivalent to min F max u h s (F) + ⟨p, u⟩ -h * f (u) where u ∈ R 1×dout and h * f (•) is the conjugate function of fairness objective h f (•).

3.2.2. PROBLEM SOLUTION

Motivated by Proximal Alternating Predictor-Corrector (PAPC) (Loris & Verhoeven, 2011; Chen et al., 2013) , the min-max optimization problem (2) can be solved by the following fixed-point equations F = F -∇h s (F) -∂⟨p,u⟩ ∂F , u = prox h * f u + ∆ s SF (F) . (3) where prox h * f (u) = arg min y ||y -u|| 2 F + h * f (u). Similar to "predictor-corrector" algorithm (Loris & Verhoeven, 2011) , we adopt iterative algorithm to find saddle point for min-max optimization problem. Specifically, starting from (F k , u k ), we adopt a gradient descent step on primal variable F to arrive ( Fk+1 , u k ) and then followed by a proximal ascent step in the dual variable u. Finally, a gradient descent step on primal variable in point ( Fk+1 , u k ) to arrive at (F k+1 , u k ). In short, the iteration can be summarized as        Fk+1 = F k -γ∇h s (F k ) -γ ∂⟨p,u k ⟩ ∂F F k , u k+1 = prox βh * f u k + β∆ s SF ( Fk+1 ) , Fk+1 = F k -γ∇h s (F k ) -γ ∂⟨p,u k+1 ⟩ ∂F F k . (4) where γ and β are the step size for primal and dual variables. Note that the close-form for ∂⟨p,u⟩ ∂F ∈ R n×dout and prox βh * f (•) are still not clear, we will provide the solution one by one. Proximal Operator. As for the proximal operators, we provide the close-form in the following proposition: Proposition 1 (Proximal Operators) The proximal operators prox βh * f (u) satisfies prox βh * f (u) j = sign(u) j min |u j |, λ f , where sign(•) and λ f are element-wise sign function and hyperparameter for fairness objective. In other words, such proximal operator is element-wise projection onto l ∞ ball with radius λ f . FMP Scheme. Similar to works (Ma et al., 2021b; Liu et al., 2021) , choosing γ = 1 1+λs and β = 1 2γ , we have F k -γ∇h s (F k ) = (1 -γ)I -γλ s L F k + γX trans = γX trans + (1 -γ) ÃF k , Therefore, we can summarize the proposed FMP as two phases, including propagation with skip connection (Step ❶) and bias mitigation (Steps ❷-❺). For bias mitigation, Step ❷ updates the aggregated node features for fairness objective; Steps ❸ and ❹ aim to learn and "reshape" perturbation vector in probability space, respectively. Step ❺ explicitly mitigate the bias of node features based on gradient descent on primal variable. The mathematical formulation is given as follows:                  X k+1 agg = γX trans + (1 -γ) ÃX k , Step ❶ Fk+1 = X k+1 agg -γ ∂⟨p,u k ⟩ ∂F F k , Step ❷ ūk+1 = u k + β∆ s SF ( Fk+1 ), Step ❸ u k+1 = min |ū k+1 |, λ f • sign(ū k+1 ), Step ❹ F k+1 = X k+1 agg -γ ∂⟨p,u k+1 ⟩ ∂F F k . Step ❺ where X k+1 agg represents the node features with normal aggregation and skip connection with the transformed input X trans .

3.2.3. GRADIENT COMPUTATION ACCELERATION

The softmax property is also adopted to accelerate the gradient computation. Note that p = ∆ s SF (F) and SF (•) represents softmax over column dimension, directly computing the gradient ∂⟨p,u⟩ ∂F based on chain rule involves the three-dimensional tensor ∂p ∂F with gigantic computation complexity. Instead, we simplify the gradient computation based on the property of softmax function in the following theorem.

Male Perturbation

Female Perturbation Theorem 1 (Gradient Computation) The gradient over primal variable ∂⟨p,u⟩ ∂F satisfies ∂⟨p, u⟩ Jacobian Transformation MLP Input Prediction Propagation Debiasing 𝑋 !"# 𝑋 $"%&' 𝑋 %(( ∂F = U s ⊙ SF (F) -Sum 1 (U s ⊙ SF (F))SF (F). where U s △ = ∆ ⊤ s u, ⊙ represents element-wise product and Sum 1 (•) represents the summation over column dimension with preserved matrix shape.

3.3. DISCUSSION ON FMP

In this section, we provide the interpretation and analyze the transparency, efficiency, and effectiveness of proposed FMP scheme. Specifically, we interprete FMP as two phrases, including conventional aggregation and bias mitigation, and the computation complexity of FMP is lower than backward gradient calculation. Interpretation Note that the gradient of fairness objective over node features F satisfies ∂⟨p,u⟩ ∂F = ∂⟨p,u⟩ ∂SF (F) ∂SF (F) ∂F and ∂⟨p,u⟩ ∂SF (F) = ∆ ⊤ s u, such gradient calculation can be interpreted as three steps: Softmax transformation, perturbation in probability space, and debiasing in representation space. Specifcally, we first map the node representation into probability space via softmax transformation. Subsequently, we calculate the gradient of fairness objective in probability space. It is seen that the perturbation ∆ ⊤ s u actually poses low-rank debiasing in probability space, where the nodes with different sensitive attribute embrace opposite perturbation. In other words, the dual variable u represents the perturbation direction in probability space. Finally, the perturbation in probability space will be transformed into representation space via Jacobian transformation ∂SF (F) ∂F . Transparency The proposed FMP explicitly uses the sensitive attribute information in Steps ❷-❺ during forward propagation. In other words, if we aims to identify the influence of sensitive attributes for FMP, it is sufficient to check the difference between the input and output for debiasing step. It is worth mentioning that the required information for identifying the influence of sensitive attributes are naturally from the forward propagation. However, for the fair model from existing works (e.g, adding regularization and adversarial debiasing), note that the sensitive attribute information is implicitly encoded in the well-trained model weight, the sensitive attribute perturbation inevitably lead to variability of well-trained model weight. Therefore, it is required to retrain the model for probing the influence of sensitive attribute perturbation. The key drawback of these methods is due to encoding the sensitive attributes information into well-trained model weights. From aditors' perspective, it is quite hard to identify the influence of sensitive attributes only given well-trained fair model. Instead, In other words, thanks to the softmax property, we achieve an efficient fair message passing scheme. Effectiveness The proposed FMP explicitly achieves graphs smoothness prior and fairness via alternative gradient descent. In other words, the propagation and debiasing forward in a white-box manner and there is not any trainable weight during forwarding phrase. The effectiveness of the proposed FMP is also validated in the experiments on three real-world datasets.

4. EXPERIMENTS

In this section, we conduct experiments to validate the effectiveness and efficiency of proposed FMP. We firstly validate that graph data with large sensitive homophily enhances bias in GNNs via synthetic experiments. Moreover, for experiments on real-world datasets, we introduce the experimental settings and then evaluate our proposed FMP compared with several baselines in terms of prediction performance and fairness metrics.

4.1. EXPERIMENTAL SETTINGS

Datasets. We conduct experiments on real-world datasets Pokec-z, Pokec-n, and NBA (Dai & Wang, 2021) . Pokec-z and Pokec-n are sampled, based on province information, from a larger Facebook-like social network Pokec (Takac & Zabovsky, 2012) in Slovakia, where region information is treated as the sensitive attribute and the predicted label is the working field of the users. NBA dataset is extended from a Kaggle datasetfoot_4 consisting of around 400 NBA basketball players. The information of players includes age, nationality, and salary in the 2016-2017 season. The players' link relationships is from Twitter with the official crawling API. The binary nationality (U.S. and overseas player) is adopted as the sensitive attribute and the prediction label is whether the salary is higher than the median. Evaluation Metrics. We adopt accuracy to evaluate the performance of node classification task. As for fairness metric, we adopt two quantitative group fairness metrics to measure the prediction bias. According to works (Louizos et al., 2015; Beutel et al., 2017) -1, y = 1) -P(ŷ = 1|s = 1, y = 1)|, where y and ŷ represent the ground-truth label and predicted label, respectively. Baselines. We compare our proposed FMP with representative GNNs, such as GCN (Kipf & Welling, 2017) , GAT (Veličković et al., 2018) , SGC (Wu et al., 2019) , and APPNP (Klicpera et al., 2019) , and MLP. For all models, we train 2 layers neural networks with 64 hidden units for 300 epochs. Additionally, We also compare adversarial debiasing and adding demographic regularization methods to show the effectiveness of the proposed method. Implementation Details. We run the experiments 5 times and report the average performance for each method. We adopt Adam optimizer with 0.001 learning rate and 10 -5 weight decay for all models. For adversarial debiasing, we adopt train classifier and adversary with 70 and 30 epochs, respectively. The hyperparameter for adversary loss is tuned in {0.0, 1.0, 2.0, 5.0, 8.0, 10.0, 20.0, 30.0}. For adding regularization, we adopt the hyperparameter set {0.0, 1.0, 2.0, 5.0, 8.0, 10.0, 20.0, 50.0, 80.0, 100.0}.

4.2. EXPERIMENTAL RESULTS

Comparison with existng GNNs. The accuracy, demographic parity and equal opportunity metrics of proposed FMP for Pokec-z, Pokec-n, NBA dataset are shown in Table 1 compared with MLP, GAT, GCN, SGC and APPNP. The detailed statistical information for these three dataset is shown in Table 3 . From these results, we can obtain the following observations: • Many existing GNNs underperorm MLP model on all three datasets in terms of fairness metric. For instance, the demographic parity of MLP is lower than GAT, GCN, SGC and APPNP by 32.64%, 50.46%, 66.53% and 58.72% on Pokec-z dataset. The higher prediction bias comes from the aggregation within the-same-sensitive-attribute nodes and topology bias in graph data. • Our proposed FMP consistently achieve lowest prediction bias in terms of demographic parity and equal opportunity on all datasets. Specifically, FMP reduces demographic parity by 49.69%, 56.86% and 5.97% compared with the lowest bias among all baselines in Pokec-z, Pokec-n, and NBA dataset. Meanwhile, our proposed FMP achieves the best accuracy in NBA dataset, and comparable accuracy in Pokec-z and Pokec-n datasets. In a nutshell, proposed FMP can effectively mitigate prediction bias while preserving the prediction performance. Comparison with adversarial debiasing and regularization. To validate the effectiveness of proposed FMP, we also show the prediction performance and fairness metric trade-off compared with fairness-boosting methods, including adversarial debiasing (Fisher et al., 2020) and adding regularization (Chuang & Mroueh, 2020) . Similar to (Louppe et al., 2017) , the output of GNNs is the input of adversary and the goal of adversary is to predict the node sensitive attribute. We also adopt several backbones for these two methods, including MLP, GCN, GAT and SGC. We randomly split 50%/25%/25% for training, validation and test dataset. Figure 2 shows the pareto optimality curve for all methods, where right-bottom corner point represents the ideal performance (highest accuracy and lowest prediction bias). From the results, we list the following observations as follows: • Our proposed FMP can achieve better DP-Acc trade-off compared with adversarial debiasing and adding regularization for many GNNs and MLP. Such observation validates the effectiveness of the key idea in FMP: aggregation first and then debiasing. Additionally, FMP can reduce demographic parity with negligible performance cost due to transparent and efficient debiasing. • Message passing in GNNs does matter. For adding regularization or adversarial debiasing, different GNNs embrace huge distinctionwhich implies that appropriate message passing manner potentially leads to better trade-off performance. Additionally, many GNNs underperforms MLP in low label homophily coefficient dataset, such as NBA. The rationale is that aggregation may not always bring benefit in terms of accuracy when the neighbors have low probability with the same label.

5. RELATED WORKS

Graph neural networks. GNNs generalizing neural networks for graph data have already been shown great success for various real-world applications. There are two streams in GNNs model design, i.e., spectral-based and spatial-based. Spectral-based GNNs provide graph convolution definition based on graph theory, which is utilized in GNN layers together with feature transformation (Bruna et al., 2013; Defferrard et al., 2016; Henaff et al., 2015) . Graph convolutional networks (GCN) (Kipf & Welling, 2017) simplifies spectral-based GNN model into spatial aggregation scheme. Since then, many spatial-based GNNs variant are developed to update node representation via aggregating its neighbors' information, including graph attention network (GAT) (Veličković et al., 2018) , GraphSAGE (Hamilton et al., 2017) , SGC (Wu et al., 2019) , APPNP (Klicpera et al., 2019) , et al (Gao et al., 2018; Monti et al., 2017) . Graph signal denoising is another perspective to understand GNNs. Recently, there are several works show that GCN is equivalent to the first-order approximation for graph denoising with Laplacian regularization (Henaff et al., 2015; Zhao & Akoglu, 2019) . The unified optimization framework are provided to unify many existing message passing schemes (Ma et al., 2021b; Zhu et al., 2021a) . Fairness-aware learning on graphs. Many works have been developed to achieve fairness in machine learning community (Jiang et al., 2022; Chuang & Mroueh, 2020; Zhang et al., 2018; Du et al., 2021; Yurochkin & Sun, 2020; Creager et al., 2019; Feldman et al., 2015) . A pilot study on fair node representation learning is developed based on random walk (Rahman et al., 2019) . Additionally, adversarial debiasing is adopt to learn fair prediction or node representation so that the well-trained adversary can not predict the sensitvie attribute based on node representation or prediction (Dai & Wang, 2021; Bose & Hamilton, 2019; Fisher et al., 2020) . A Bayesian approach is developed to learn fair node representation via encoding sensitive information in prior distribution in (Buyl & De Bie, 2020) . Work (Ma et al., 2021a ) develops a PAC-Bayesian analysis to connect subgroup generalziation with accuracy parity. (Laclau et al., 2021; Li et al., 2021) aim to mitigate prediction bias for link prediction. Fairness-aware graph contrastive learning are proposed in (Agarwal et al., 2021; Köse & Shen, 2021) . However, aforementioned works ignore the requirement of transparency in fairness. In this work, we develop an efficient and transparent fair message passing scheme explicitly rendering sensitive attribute usage.

6. CONCLUSION

In this work, we consider fairness problem from a new perspective, named transparency, i.e., the influence of sensitive attribute should be easily probed to public. We point out that existing fairness model in graph lack of transparency due to implicitly encoding sensitive attribute information in welltrained model weight. Additionally, integrated in a unified optimization framework, we develop an efficient, effective and transparent FMP to learn fair node representations while preserving prediction performance. Experimental results on real-world datasets demonstrate the effectiveness and efficiency compared with state-of-the-art baselines in node classification tasks. A NOTATIONS The number of edges n The number of nodes d The number of node feature dimensions d out The number of node classes ∆ s ∈ R 1×n The sensitive attribute incident vector ϵ label Label homophily coefficient ϵ sens Sensitive homophily coefficient X ori ∈ R n×d The input node attributes matrix A ∈ R n×n The adjacency matrix Â ∈ R n×n The adjacency matrix with self-loop Ã ∈ R n×n The normalized adjacency matrix with self-loop L ∈ R n×n The Laplacian matrix X trans ∈ R n×dout The output node features for feature transformation F agg ∈ R n×dout The aggregated node features after propagation F ∈ R n×dout The learned node features considering graph smoothness and fairness u ∈ R 1×dout The permutation direction in feature representation space h * (•) Fenchel conjugate function of h(•) ||X|| F , ||X|| 1 The Frobenius norm and l 1 norm of matrix X λ f , λ s Hyperparameter for fairness and graph smoothness objectives

B PROOF ON FAIRNESS OBJECTIVE

The fairness objective can be shown as the average prediction probability difference as follows: ∆ s SF (F) j = 1 >0 (s) ||1 >0 (s)|| 1 - 1 >0 (-s) ||1 >0 (-s)|| 1 SF (F) :,j = si=1 P (y i = j|X) ||1 >0 (s)|| 1 -si=-1 P (y i = j|X) ||1 >0 (-s)|| 1 = P (y i = j|s i = 1, X) -P (y i = j|s i = -1, X).

C PROOF OF THEOREM 1

Before providing in-depth analysis on the gradient computation, we first introduce the softmax function derivative property in the following lemma: Lemma 1 For the softmax function with N -dimensional vector input y = SF (x) : R 1×N -→ R 1×N , where y j = e x j N k=1 e x k for ∀j ∈ {1, 2, • • • , N }, the derivative N × N Jocobian matrix is defined by [ ∂y ∂x ] ij = ∂yi ∂xj . Additionally, Jocobian matrix satisfies ∂y ∂x = yI N -y ⊤ y, where I N represents N × N identity matrix and ⊤ denotes the transpose operation for vector or matrix. Proof: Considering the gradient ∂yi ∂xj for arbitrary ij, according to quotient and chain rule of derivatives, we have ∂y i ∂x j = e xi N k=1 e x k -e xi+xj N k=1 e x k 2 = e xi N k=1 e x k • N k=1 e x k -e xi N k=1 e x k = y i (1 -y j ), Similarly, for arbitrary i ̸ = j, the gradient is given by ∂y i ∂x j = e xi N k=1 e x k • -e xi N k=1 e x k = -y i y j . Combining these two cases, it is easy to verify the Jocobian matrix satisfies ∂y ∂x = yI N -y ⊤ y. □ Arming with the derivative property of softmax function, we further investigate the gradient ∂⟨p,u⟩ ∂F , where p = ∆ s SF (F) ∈ R 1×dout and SF (•) and u ∈ R 1×dout is independent with F ∈ R n×dout . Considering softmax function SF (x) ∈ R n×d is row-wise adopted in node representation matrix, the gradient satisfies ∂SF (F)i ∂Fj = 0 dout×dout for i ̸ = j. Note that the inner product ⟨p, u⟩ = dout k=1 p k u k , it is easy the obtain the gradient [ ∂⟨p,u⟩ ∂F ] ij = dout k=1 ∂p k ∂Fij u k . To simply the current notation, we denote F △ = SF (F). According to chain rule of derivative, we have ∂p k ∂F ij = dout t=1 ∂p k ∂ Ftk ∂ Ftk ∂F ij = dout t=1 ∆ s,t ∂ Ftk ∂F ij (a) = ∆ s,i ∂ Fik ∂F ij (b) = ∆ s,i Fik [δ kj -Fij ], where δ kj is Dirac function (equals 1 only if k = j, otherwise 0;), equality (a) holds since softmax function is row-wise operation, and equality (b) is based on Lemma 1. Furthermore, we can obtain the gradient of fairness objective w.r.t. node presentation as follows: [ ∂⟨p, u⟩ ∂F ] ij = dout k=1 ∂p k ∂F ij u k = dout k=1 ∆ s,i Fik [δ kj -Fij ]u k = ∆ s,i Fij u j -∆ s,i Fij dout k=1 Fik u k . Therefore, the matrix formulation is given by ∂⟨p, u⟩ ∂F = U s ⊙ SF (F) -Sum 1 (U s ⊙ SF (F))SF (F). where U s △ = ∆ ⊤ s u ∈ R n×dout and Sum 1 (•) represents the summation over column dimension with preserved matrix shape. Therefore, the computation complexity for gradient ∂⟨p,u⟩ ∂F is O(nd out ).

D PROOF OF PROPOSITION 1

We Based on aforementioned two cases, it is easy to get the conjugate function for l 1 norm (the dual norm is l ∞ ), i.e., the conjugate function for h f (x) = λ||x|| 1 is given by  h * f (y) = 0, ||x|| ∞ ≤ λ, +∞, ||x|| ∞ > λ. (

F DATASET STATISTICS

For fair comparison with previous work, we perform the node classification task on three real-world dataset, including Pokec-n, Pokec-z, and NBA. The data statistical information on three real-world dataset is provided in Table 3 . It is seen that the sensitive homophily are even higher than label homophily coefficient among three real-world dataset, which validates that the real-world dataset are usually with large topology bias. GAT. Feature aggregation in GAT applies the normalized attention coefficient to compute a linear combination of neighbor's features as X agg,i = j∈N (i) α ij X trans,j , where α ij = sof tmax j (e ij ), e ij = LeakyReLU(X ⊤ trans,i w i + X ⊤ trans,j w j ), and w i and w j are learnable column vectors. Prior study (Ma et al., 2021b) demonstrates that one-step gradient descent with adaptive step size 1 j∈ Ñ (i) (ci+cj ) for the following objective problem: PPNP / APPNP. Feature aggregation in PPNP and APPNP adopt the aggregation rules as X agg = α I -(1 -α) Ã -1 X trans and X k+1 agg = (1 -α) ÃX k agg + αX trans . It is shown that they are equivalent to the exact solution and one gradient descent step with stepsize α 2 to minimize the following objective problem: min F ||F -X trans || 2 F + ( 1 α -1)tr F ⊤ (I -Ã)F .

Pokec-n

Pokec-z NBA 

H.1 MORE EXPERIMENTAL SETTING DETAILS

In FMP implementation, we first use 2 layers MLP with 64 hidden units and the output dimension for MLP is 2. We also stack 2 layers for propagation and debiasing steps, where there is not any trainable model parameters. As for the model training, we adopt cross entropy loss function with 300 epochs. We also adopt Adam optimizer with 0.001 learning rate and 1 × 10 5 weight decay for all models. The hyperprameters for FMP is λ f = {0, 5, 10, 15, 20, 30, 100} and λ s = {0, 0.01, 0.1, 0.5, 1.0, 2.0, 3, 5, 10, 15, 20} .

H.2 COMPARISON WITH FAIR MIXUP

We also implement Fair mixup (Chuang & Mroueh, 2021) as the additional baseline for different GNN backbones in Figure 3 . Note that input fair mixup requires calculating model prediction for mixed input batch, it is non-trivial to adopt input fair mixup in our experiments (node classification task) since forward propagation in GNN aggregates information from neighborhoods while the neighborhood information for the mixed input batch is missing. Therefore, we adopt manifold fair mixup for the logit layer (the previous layers contain aggregation step) in our experiments. Experimental results show that our method can still achieve better accuracy-fairness tradeoff performance on three datasets.

H.3 SENSITIVE ATTRIBUTE INFLUENCE PROBE

As for lending fairness perceptron, it represents the influence of sensitive attributes could be identified. For example, our proposed FMP includes three steps, i.e., transformation, aggregation and debiasing, where the sensitive attribute explicitly adopted in debiasing step. If we aims to identify the influence of sensitive attributes for FMP, it is sufficient to check the difference between the input and output for debiasing step. It is worth mentioning that the required information for identifying the influence of sensitive attributes are naturally from the forward propagation. Additionally, if we aim to identify the influence of sensitive attributes for existing methods (e.g, adding regularization and adversarial debiasing), the well-trained fair model is insufficient and we need additional vanilla (unfair) model without using any sensitive attribute information. In other words, these methods require model retraining with sensitive attribute removement and thus much more resources for sensitive attributes influence auditing. The key drawback of these methods is due to encoding the sensitive attributes information into well-trained model weights. From auditors' perspective, it is quite hard to identify the influence of sensitive attributes only given well-trained fair model. Instead, our designed FMP explicitly adopts the sensitive attribute information in the forward propagation process, which naturally avoid the dilemma that sensitive attributes are encoded into well-trained model weight. Figure 4 shows the visualization results for training with/without (left/right) sensitive attribute for FMP and several baselines (with GCN backbones) across three real-world datasets. From the visualization results, we observe that all methods with sensitive attribute information achieve better fairness since the logit layer representation for different sensitive attributes are mixed with each other. Therefore, it is hard to identify the sensitive attribute based on the representation and thus leads to higher fairness results. The key different is that the results for training with/without (left/right) sensitive attribute in FMP can both obtained through the forward propagation, while the other baseline methods requiring model retraining to probe the influence of sensitive attributes.

H.4 RUNNING TIME COMPARISON

We provide running time comparison in Figure 5 for our proposed FMP and other baselines, including vanilla, regularization and adversarial debiasing on many backbones (MLP, GCN, GAT, SGC, and APPNP). To achieve fair comparison, we adopt the same Adam optimizer with 200 epochs with 5 running times. We list several observations as follows: • The running time of proposed FMP is very efficient for large-scale dataset. Specifically, for vanilla method, the running time of FMP is higher than most lighten backbone MLP with 46.97% and 15.03% time overhead on Pokec-n and Poken-z dataset, respectively. Compared with the most timeconsumption APPNP, the running time of FMP is lower with 64.07% and 41.45% time overhead on Pokec-n and Poken-z dataset, respectively. • The regularization method achieves almost the same running time compared with vanilla method on all backbones. For example, GCN with regularization encompasses higher running time with 6.41% time overhead compared with vanilla method. Adversarial debiasing is extremely time-consuming. For example, GCN with adversarial debiasing encompasses higher running time with 88.58% time overhead compared with vanilla method.

Pokec_n Pokec_z NBA

Figure 6 : Hyperparameter study on fairness and smoothness hyperparameter for demographic parity and Accuracy.

H.5 HYPERPARAMETER STUDY

We provide hyperparameter study for further investigation on fairness and smoothness hyperparmeter on prediction and fairness performance on three datasets. Specifically, we tune hyperparameters as λ f = {0.0, 5.0, 10.0, 15.0, 20.0, 30.0, 100.0, 1000.0} and λ s = {0.0, 0.1, 0.5, 1.0, 3.0, 5.0, 10.0, 15.0, 20.0}. From the results in Figure 6 , we can make the following observations: • The accuracy and demographic parity are extremely sensitive to smoothness hyperparameter. It is seen that, for Pokec-n and Pokec-z datasets (NBA), larger smoothness hyperparameter usually leads to higher (lower) accuracy with higher prediction bias. The rationale is that, only for graph data with high label homophily coefficient, GCN-like aggregation with skip connection is beneficial. Otherwise, the neighbor's node representation with different label will mislead representation update. • The appropriate fairness hyperparameter leads to better fairness and prediction performance tradeoff. The reason is that fairness hyperparameter determinates the perturbation vector update step size in probability space. Only appropriate step size can lead to better perturbation vector update.

I BROADER SOCIAL IMPACT AND LIMITATIONS

Transparency is an advanced property in fairness domain and poses huge challenging for research and industrial. Many existing works mainly rely on specific fairness metric to evaluate the prediction bias. The transparency may stimulates maintainers and auditors of machine learning system to rethink the fairness evaluation/auditing. Only achieving fair model with lower bias for specific fairness metric is insufficient. The maintainers should also consider how to leverage the influence of sensitive attribute for auditors. The transparency may lead maintainers pay more effects in improving the transparency of fair model and could be helpful to convince the auditors. The limitations of this work are that it requests sensitive information in the inference stage.



A naive way for many existing works (e.g., adding fair regularization, adversarialdebiasing, et.al.) to obtain the influence of sensitive attribute is to train fair and unfair model with/without the sensitive attribute information, and then get the prediction difference. Therefore, the required resources includes training data and additional (unfair) model training. Similar with the goal of model explainability, only achieving accurate prediction is insufficient, and chasing explainability can help experts understand how the model provide prediction and convince users. Decision tree is intrinsic explainable while deep neural networks is not. Such smoothness objetive is the the most common-used one in existing methods(Ma et al., 2021b;Belkin & Niyogi, 2001;Kalofolias, 2016). The various other smoothness objectives could be considered to improve the performance of FMP and we leave it for future work. https://www.kaggle.com/noahgift/social-power-nba



Figure 1: The model pipeline consists of three steps: MLP (feature transformation), propagation with skip connection and debiasing via low-rank perturbation in probability space.

Figure 2: DP and Acc trade-off performance on three real-world datasets compared with adding regularization (Top) and adversarial debiasing (Bottom). The trade-off curve more close to right bottom corner means better trade-off performance 6

firstly show the conjugate function for general norm function f (x) = λ||x||, where x ∈ R 1×dout . The conjugate function of f (x) satisfies f * (y) = 0, ||x|| * ≤ λ, +∞, ||x|| * > λ. (13) where ||x|| * is dual norm of the original norm ||x||, defined as ||y|| * = max ||x||≤1 y ⊤ x. Considering the conjugate function definition f * (y) = max x y ⊤ x -λ||x|| the analysis can be divided as the following two cases: ❶ If ||y|| * ≤ λ, according to the definition of dual norm, we have y ⊤ x ≤ ||x||||y|| * ≤ λ||x|| for ∀||x||, where the equality holds if and only if ||x|| = 0. Hence, it is easy to obtain f * (y) = max x y ⊤ x -λ||x|| = 0. ❷ If ||y|| * > λ, note that the dual norm ||y|| * = max ||x||≤1 y ⊤ x > λ, there exists variables x so that ||x|| ≤ 1 and x⊤ y < λ. Therefore, for any constant t, we have f * (y) ≥ y ⊤ (tx) -λ||tx|| = t(y ⊤ x -λ||x||) t→∞ -→ ∞.

Given the conjugate function h * f (•), we further investigate the proximal operators prox h * f . Note thatprox h * f (u) = arg min y ||y -u|| 2 F + h * f (u) = arg min ||y||∞≤λ f ||y -u|| 2 F = arg min yj ≤λ f ∀j∈[dout] dout j=1 |y j -u j | 2 ,the proximal operator problem can be decomposed as element-wise sub-problem, i.e.,prox h * f (u) j = arg min yj ≤λ f |y j -u j | 2 = sign(u j ) min(|u j |, λ f )which completes the proof.E TRAINING ALGORITHMSWe summarize the training algorithm for FMP and provide the pseudo codes in Algorithm 1. Algorithm 1: FMP Training Algorithm Input :Graph dataset =(X, A, Y); The total epochs T ; Hyperparameters λ s and λ f . Output :The well-trained FMP model. 1 for epoch from 1 to T do 2 Conduct feature transformation using MLP; 3 Conduct propagation and debiasing as steps ❶-❺; 4 Calculate the cross entropy loss for node classification task; 5 Conduct back propagation step to update model weight; 6 end

Figure 3: DP and Acc trade-off performance on three real-world datasets compared with (manifold) Fair Mixup.

Figure 4: The visualization of node representation for training with/without (left/right) sensitive attribute for FMP and several baselines across three real-world dataset. The data point with different colors represents different sensitive attributes.

Comparative Results with Baselines on Node Classification. ↑ ∆ DP (%) ↓ ∆ EO (%) ↓ Acc (%) ↑ ∆ DP (%) ↓ ∆ EO (%) ↓ Acc (%) ↑ ∆ DP (%) ↓ ∆ EO (%) ↓ MLP 70.48 ± 0.77 1.61 ± 1.29 2.22 ± 1.01 72.48 ± 0.26 1.53 ± 0.89 3.39 ± 2.37 65.56 ± 1.62 22.37 ± 1.87 18.00 ± 3.52 GAT 69.76 ± 1.30 2.39 ± 0.62 2.91 ± 0.97 71.00 ± 0.48 3.71 ± 2.15 7.50 ± 2.88 57.78 ± 10.65 20.12 ± 16.18 13.00 ± 13.37 GCN 71.78 ± 0.37 3.25 ± 2.35 2.36 ± 2.09 73.09 ± 0.28 3.48 ± 0.47 5.16 ± 1.38 61.90 ± 1.00 23.70 ± 2.74 17.50 ± 2.63 SGC 71.24 ± 0.46 4.81 ± 0.30 4.79 ± 2.27 71.46 ± 0.41 2.22 ± 0.29 3.85 ± 1.63 63.17 ± 0.63 22.56 ± 3.94 14.33± 2.16 APPNP 66.91 ± 1.46 3.90 ± 0.69 5.71 ± 1.29 69.80 ± 0.89 1.98 ± 1.30 4.01 ± 2.36 63.80 ± 1.19 26.51 ± 3.33 20.00 ± 4.56 FMP 70.50 ± 0.50 0.81 ± 0.40 1.73 ± 1.03 72.16 ± 0.33 0.66 ± 0.40 1.47 ± 0.87 73.33 ± 1.85 18.92 ± 2.28 13.33 ± 5.89our designed FMP explicitly adopts the sensitive attribute information in the forward propagation process, which naturally avoid the dilemma that sensitive attributes are encoded into well-trained model weight. In a nutshell, FMP encompasses with higher transparency since (1) the sensitive attribute is explicitly adopted in forward propagation; (2) It is not necessary to retrain model for probing influence of sensitive attribute.Efficiency FMP is an efficient message passing scheme. The computation complexity for the aggregation (sparse matrix multiplications) is O(md out ), where m is the number of edges in the graph. For FMP, the extra computation mainly focus on the perturbation calculation, as shown in Theorem 1, with the computation complexity O(nd out ). The extra computation complexity is negligible in that the number of nodes n is far less than the number of edges m in the real-world graph. Additionally, if directly adopting autograd to calculate the gradient via back propagation, we have to calculate the three-dimensional tensor ∂p ∂F with computation complexity O(n 2 d out ).

Table of Notations

Statistical Information on Datasets Dataset # Nodes # Node Features # Edges # Training Labels # Training Sens

