ON DYADIC FAIRNESS: EXPLORING AND MITIGATING BIAS IN GRAPH CONNECTIONS

Abstract

Disparate impact has raised serious concerns in machine learning applications and its societal impacts. In response to the need of mitigating discrimination, fairness has been regarded as a crucial property in algorithmic designs. In this work, we study the problem of disparate impact on graph-structured data. Specifically, we focus on dyadic fairness, which articulates a fairness concept that a predictive relationship between two instances should be independent of the sensitive attributes. Based on this, we theoretically relate the graph connections to dyadic fairness on link predictive scores in learning graph neural networks, and reveal that regulating weights on existing edges in a graph contributes to dyadic fairness conditionally. Subsequently, we propose our algorithm, FairAdj, to empirically learn a fair adjacency matrix with proper graph structural constraints for fair link prediction, and in the meanwhile preserve predictive accuracy as much as possible. Empirical validation demonstrates that our method delivers effective dyadic fairness in terms of various statistics, and at the same time enjoys a favorable fairness-utility tradeoff.

1. INTRODUCTION

The scale of graph-structured data has grown explosively across disciplines (e.g., social networks, telecommunication networks, and citation networks), calling for robust computational techniques to model, discover, and extract complex structural patterns hidden in big graph data. Research work has been proposed for inference learning on potential connections (Liben-Nowell & Kleinberg, 2007) , and corresponding algorithms can be used for high-quality link prediction and recommendations (Adamic & Adar, 2003; Sarwar et al., 2001; Qi et al., 2006) . In this work, we study the potential disparate impact in the prediction of dyadic relationships between two instances within a homogeneous graph. Despite the wide applications of link prediction algorithms, serious concerns raised by disparate impact (Angwin et al., 2016; Barocas & Selbst, 2016; Bose & Hamilton, 2019a; Liao et al., 2020) should also be reckoned with by algorithm designers. In an algorithmic context, disparate impact often describes the disparity in influential decisions which essentially derives from the characteristics protected by anti-discrimination laws or social norms. Unfortunately, this negative impact derived from biased data and conventional algorithms occurs in many applications including link prediction. One example is that a user recommender system follows the proximity principle (individuals are more likely to interact with similar individuals) or existing connections with intrinsic bias. Such an operating mode would deliver biased recommendations dominated by sensitive attributes. For example, users with the same religion or ethnic group are more likely to be recommended to a user, and consequently generate segregation in social relations by long-term accumulation (Hofstra et al., 2017) . Another example can be noticed in news streaming. When a news app has collected the political profile from a user, in pursuit of the user preference in news streaming, the system might only deliver politicking that the user is predisposed to agree with, therefore skews a user's scope and narrows the view by selectively displaying reality (Pariser, 2011) . To alleviate these concerns, an algorithm should perform a link prediction without being biased by the sensitive attribute of the two instances, and should also stream diverse and preferred recommendations. Motivated by the potential bias in real cases, in this paper we propose dyadic fairness for the link prediction problem in homogeneous graphs, where the dyadic fairness criterion expects the predictions to be statistically independent of the sensitive attributes from the given two vertices. We focus our scope on Graph Neural Networks (GNNs), which have already shown remarkable capacity in graph representation learning by message passing along the graph structure (Xu et al., 2018; 2020; Ying et al., 2018; Wang et al., 2019; Fan et al., 2019; Li et al., 2020) . Within the pipeline of GNNs, given an arbitrary graph, we theoretically analyze the relationship between dyadic fairness and the graph connections. Our findings suggest adapting weights on existing edges in a graph can contribute to dyadic fairness conditionally. Continuing with our theoretical findings, we propose FairAdj, an algorithm to empirically learn a fair adjacency matrix by updating the normalized adjacency matrix while keeping the original graph structure unchanged. Integrating with a utility objective function, the proposed algorithm seeks supplied dyadic fairness and link predictive utility simultaneously. Our definition of dyadic fairness in a graph context is inspired by the statistical metrics in group fairness (Dwork et al., 2012; Kusner et al., 2017) . First, vertices in a graph are categorized into several groups according to a protected attribute. Then, the dyadic fairness criterion asks some standard statistics such as positive outcomes or false positive rate on link score to be approximately equalized across intra and inter groups. Essentially, such a requirement asks for a more diverse prediction between and within different groups defined by the protected attribute, hence it also allows to mitigate social segregation by asking for more interactions across different protected groups in the graph. Empirically, we present studies on six real-world social and citation networks to demonstrate the effectiveness of the proposed method. We conduct evaluations towards seven measurements of both utility and dyadic fairness. Comparing to other baseline methods (Kipf & Welling, 2016b; Grover & Leskovec, 2016; Rahman et al., 2019; Bose & Hamilton, 2019b) , we consistently observe improvements from two aspects. First, dyadic fairness metrics verify that our method can minimize the statistical gap between the predictions of intra and inter links. Second, in terms of utility, our results are consistent with the existing literature (Zhao & Gordon, 2019; Fish et al., 2016; Calders et al., 2009) , that satisfying fairness can potentially lead to a decrease in utility. However, our algorithm enjoys a more favorable fairness-utility tradeoff (same in fairness but less sacrifice in utility, and vice versa) when compared to previous works. Additionally, to approach the real application cases, we also showcase a direct product that comes from dyadic fairness: our method can effectively stream more diverse recommendations containing instances holding different kinds of sensitive attributes.

2. RELATED WORK

In this section we mainly review some closely related work in both fair machine learning and graph representation learning. We also briefly describe and discuss several existing works on learning fair node representations. Fair Machine Learning. Various types of fairness notions have been proposed and studied, including group fairness (Kusner et al., 2017; Kearns et al., 2018; 2019) , individual fairness (Dwork et al., 2012), and preference-based notions (Zafar et al., 2017a; Ustun et al., 2019) . Embracing these definitions, relevant algorithms involving fair constraints have been proposed. Zemel et al. (2013) propose a method to find a good representation to maximize utility while preserving both group and individual fairness. Following works on fair representation learning use autoencoder (Madras et al., 2018) or adversarial training (Zhao & Gordon, 2019; Zhao et al., 2019; Edwards & Storkey, 2015; Louizos et al., 2016) to simultaneously remove the sensitive patterns while preserving enough information for prediction. Zafar et al. (2017b) optimize for decision boundary fairness through regularization in logistic regression and support vector machines, and some other works achieve fairness by optimal transport between sensitive groups (Gordaliza et al., 2019; Jiang et al., 2019) and fair kernel methods (Donini et al., 2018) . However, most proposed learning algorithms for fairness are mainly built on independent and identically distributed data, which are not suitable to be directly applied to graph-structured data with dyadic fairness. Graph Representation Learning. Representation learning on graphs is formulated to convert a structural graph into a low-dimensional space while preserving the discriminative and structural representations. Efficient graph analytic methods (Von Luxburg, 2007; Tang et al., 2015; Perozzi et al., 2014; Grover & Leskovec, 2016; Xu et al., 2019) can benefit a series of downstream applications including node classification (Wang et al., 2017) , node clustering (Nie et al., 2017) , link prediction (Zhang & Chen, 2018 ) and graph classification as well. Recently, Graph Neural Networks (GNNs) have shown remarkable capacity in graph representation learning, with emergent varieties (Kipf & Welling, 2016a; Veličković et al., 2017; Hamilton et al., 2017) consistently delivering promising results. Our work uses GNNs for graph representation learning but targets improving dyadic fairness in link prediction. Fair Graph Embedding. As fairness in graph-structured data a relatively new topic for research, only a few studies have investigated the fair issues in graph representation learning. Rahman et al. (2019) first proposed Fairwalk, a random walk based graph embedding method that revises the transition probability according to the vertex's sensitive attributes. Following the idea of adversarially removing sensitive patterns (Madras et al., 2018) , Liao et al. (2020) proposed to use adversarial training on vertex representations to minimize the marginal discrepancy. This work mainly focuses on learning node representations that are free of sensitive attributes, which is different from ours. Other works includes fair collaborative filtering (Yao & Huang, 2017) , item recommendation (Steck, 2018; Chakraborty et al., 2019) in bipartite graphs, and fair graph covering problem (Rahmattalabi et al., 2019) .

3. PRELIMINARIES

Let G := (V, E) as a graph with a fix set of vertices V and edges E, where vertex features with M dimensions are represented by X ∈ R N ×M . A nonnegative adjacency matrix A ∈ R N ×N describes the relations between every pair of vertices. The element a vu in A represents the weight on the linkage bridging v and u, and is set to zero if no link exists. Every vertex holds a sensitive attribute, and we use S(v) to denotes the sensitive attribute as well as the sensitive group membership of v. Let Γ(v) be the set of 1-hop neighbors of v including self-loop. Edge (v, u) is called intra if S(v) = S(u), and inter implies S(v) = S(u). |S| denotes the cardinality of group S. For a binary sensitive attribute with two groups S 0 and S 1 separated from the graph, S 0 := {v ∈ S 0 | Γ(v) ∩ S 1 = ∅} represents the set of vertices in S 0 which locate on the boundary and has connections with S 1 , and the same for S 1 . Set U to be the discrete uniform distribution over the set of vertices V. Suppose a bivariate link prediction function g(•, •) : R D × R D → R, that given two vectors of the embedded vertices representations, a value is obtained showing the model belief that these two vertices are potentially linked. Having these basic notations, we consider the disparity in link prediction bridging on intra and inter sensitive groups. The general purpose of dyadic fairness is to predict links independently of whether two vertices having the same sensitive attribute or not. We extend from demographic parity (Edwards & Storkey, 2015; Kipf & Welling, 2016b; Madras et al., 2018; Zemel et al., 2013) to formulate a specific criteria for dyadic fairness. In a binary classification problem, demographic parity expects a classifier gives positive outcomes to two sensitive groups at the same rate. We turn the two groups in the content of demographic parity into the groups of intra and inter links. Ideally, achieving dyadic fairness will bring intra and inter link predictions at the same rate from a bag of candidate links. Having vertices representation v and u, dyadic fairness can be mathematically formulated as Definition 3.1. A link prediction algorithm satisfies dyadic fairness if the predictive score satisfy Pr(g(u, v)|S(u) = S(v)) = Pr(g(u, v)|S(u) = S(v)) To quantify the fairness, we establish dyadic fairness on link prediction upon a fixed set of vertices, and models the expectation of absolute difference in score outcome across the groups of intra and inter links. Note that we also comprehensively evaluate our model on fairness by other four statistics gap extended from (Hardt et al., 2016) in Section 6.

4. HOW GRAPH CONNECTIONS AFFECT FAIRNESS

In this section, we propose a chain of theoretical analyses 1 established on a variant of demographic parity and graph neural networks to associate dyadic fairness with graph connections. We first demonstrate demographic parity in the outcomes of link prediction can be sufficiently reduced to the achievement of fair vertex representations when employing an inner product function for link prediction. Suggested by the sufficiency, we reveal how the pipeline of a one-layer graph neural network can affect the demographic parity, and draw the conclusion that for an arbitrary graph using GNNs for embedding, properly regulating the weights on existing graph connections can contribute to fairness conditionally. The theoretical findings motivate our algorithmic design as presented in the next section. Without loss of generality, in this section, we consider the sensitive attribute to be binary, where two sensitive groups S 0 and S 1 can be separated from the graph, but show the cases with sensitive attributes in multiple categorical values in our experimental section. Proposition 4.1. For a link prediction function g(•, •) modeled as inner product g(v, u) = v Σu, where Σ ∈ S M ++ is a positive-definite matrix, ∃Q > 0, ∀v ∼ V, v 2 ≤ Q, for E v∼U [v] ∈ R M , for dyadic fairness based on demographic parity, if E v∼U [v | v ∈ S 0 ] -E v∼U [v | v ∈ S 1 ] 2 ≤ δ, ∆ DP := |E (v,u)∼U ×U [g(v, u) | S(v) = S(u)] -E (v,u)∼U ×U [g(v, u) | S(v) = S(u)]| ≤ Q Σ 2 • δ. (2) Remark 1. Proposition 4.1 can be applied for a general inner product function in Euclidean space, where Σ directionally and differently scales two input vectors. When setting Σ to an identity matrix, function g(•, •) reduces to dot product and is widely used in a series of research work on link prediction (Kipf & Welling, 2016b; Trouillon et al., 2016; Yao & Huang, 2017) . The above proposition implies fair vertex representations is a sufficient condition to achieve demographic parity in link prediction. Suggested by the sufficiency, the approach to fairness could be reduced to achieving vertex representations with a small discrepancy between sensitive groups. With the proposition, we are ready to proceed to understand fairness within graph neural networks and reveal how the structure or connections of a graph could affect demographic parity. A single layer GNN can be generically written as GNN θ (X, A) := ρ ( AXW θ ), where ρ is a nonlinear activation function, A is the normalized adjacency matrix, and W θ is the trainable weight matrix. One GNN layer can be decomposed into two disjoint phases: a vertex feature smoothing phase over the graph using A, and a feature embedding phase using W θ and ρ. Concretely, we consider left normalization A = D -1 A (D is the degree matrix) for feature smoothing. Equivalently, at an individual level, for each vertex it is one-hop mean-aggregation Agg(v) := deg w (v) -1 u∈Γ(v) a vu u, where deg w (v) := u∈Γ(v) a vu stands for the weighted degree of vertex v. We respectively abbreviate E v∼U [v|v ∈ S 0 ] and E v∼U [v|v ∈ S 1 ] as µ 0 and µ 1 . Let σ denotes the maximal deviation of vertex representations, namely, ∀v ∈ S 0 , v -µ 0 ∞ ≤ σ, and ∀v ∈ S 1 , v -µ 1 ∞ ≤ σ. Let D max := max v∈V deg w (v) be the maximal weighted degree in G, m w := S(v) =S(u) a vu be the summation of weights on inter links. With these notations, we show how the discrepancy between µ 0 and µ 1 changes after conducting feature smoothing for one time over the graph. Theorem 4.1. For an arbitrary graph with nonnegative link weights, after conducting one meanaggregation over the graph, the consequent representation discrepancy between two sensitive groups ∆ Aggr DP := E v∼U [Agg(v) | v ∈ S 0 ] -E v∼U [Agg(v) | v ∈ S 1 ] 2 is bounded by max{α min µ 0 -µ 1 ∞ -2σ, 0} ≤ ∆ Aggr DP ≤ α max µ 0 -µ 1 2 + 2 √ M σ, where α min =min{α 1 , α 2 }, α max =max{α 1 , α 2 }, α 1 =|1-mw Dmax ( 1 |S0| + 1 |S1| )|, α 2 =|1-| S0| |S0| -| S1| |S1| |. Remark 2. Theorem 4.1 shows the lower and upper bound given by the graph structure and the maximum deviation σ of vertex representations in each sensitive group on demographic parity after conducting one aggregation function on vertices. The contraction coefficient α max is a maximum of two absolute terms α 1 and α 2 , where α 2 is a constant predetermined by the graph connections. It is worth pointing out that although in the worst case α max could be 1, e.g., in a complete bipartite graph, in most practical graphs it is strictly less than 1, hence the above upper bound corresponds to a contraction lemma but under some additional error introduced by deviation σ. We also provide several illustrative graph diagrams for the contraction of this theorem in Appendix B. To approximate demographic parity after feature smoothing, Theorem 4.1 inspires a strategy to regulate the weights on graph connections to change α 1 in α max so as to minimize the upper bound Algorithm 1: Algorithmic routine for FairAdj Input: vertex features X, adjacency matrix A, GNNs parameters θ, learning rates η θ and η A Normalize adjacency matrix A ← D -1 A Fix the elements with zero in A and select the non-zero elements for optimization while θ or A has not converged do for t = 1 to T 1 optimize for utility do Compute L util by Eq. ( 5), g θ ← ∇ θ L util , θ ← θ + η θ • Adam(θ, g θ ) for t = 1 to T 2 optimize for fairness do Z ← GNN θ (X, A), Â ← ZZ reconstruct graph connections Compute L fair by Eq. ( 6), g A ← ∇ A L fair for v = 1 to N projected gradient descent do Sort n non-zero elements in [ A -η A g A ] v, * in descending order: e 1 ≥ e 2 ≥ • • • e n γ ← n j=1 1(e j + 1 j (1 - j i=1 e i ) ≥ 0) 1(•):the indicator function β ← 1 ρ (1 - γ i=1 e i ) for u = 1 to n update A do [ A] v,u ← max{[ A -η A g A ] v,u + β, 0} Output: Link predictive score between vertex v and u ← sigmoid(GNN θ (v, A) GNN θ (u, A)) for one-layer mean-aggregation. For term α 1 , obviously if the summation of inter weights are too small (m w → 0) or too large (m w → D w • min{|S 0 |, |S 1 |}), α 1 will approximate to 1. This indicates that increasing the weights on inter links cannot always guarantee to achieve a better demographic parity although inter group connections are always the minority in links, but should regulate this part to a proper range depending on the size of sensitive groups and the connected situation of a graph. We combine the upper bound with the second feature embedding phase. Here we denote µ i := E v∼U [GNN θ (v, A) | v ∈ S i ], i = 0, 1, Q := sup{ GNN θ (v, A) 2 | v ∈ V}. Corollary 4.1. For ∆ DP on vertices after passing one layer GNN θ (X, A) = ρ ( AXW θ ), we have: ∆ DP ≤ Q Σ 2 • µ 0 -µ 1 2 ≤ QL 2 Σ 2 W θ 2 2 • (α µ 0 -µ 1 2 + 2 √ M σ), where L is the Lipschitz constant for ρ. The first inequality holds by Proposition 4.1, the second one is by Theorem 4.1 and the definition of spectral norm and Lipschitz constant, also realizing that Q ≤ L W θ 2 Q. Multiple layers of GNNs can be reasoned out similarly. From the above theorem, we see ∆ DP can be processed with a tighter upper bound by regulating the weights on edges, and it is also dependent on the property of W θ , ρ, and the error term O(σ). When setting W θ fixed, our theoretical findings provide a feasible solution that regulating weights on graph connections can achieve better demographic parity on link prediction, and as supplementary, also indicate where or when it cannot perform well with finite layers of GNNs: (1) The solution cannot guarantee arbitrary fairness if we want to preserve the graph structure due to the resistance of α 1 in lower bound in Theorem 4.1. (2) When it is already fair enough in the original graph data, which means µ 0 -µ 1 2 is small and additional error O(σ) is comparable to it, the upper bound cannot be reduced significantly and the solution may not further mitigate the bias. We include a dataset to investigate the potential limitations empirically in response to the above analysis in Appendix D. In the following sections, we implement the inspired algorithm and demonstrate that multiple real-world networks accept this solution with favorable results and a better fairness-utility tradeoff.

5. LEARNING FAIR GRAPH CONNECTIONS

The above discussion indicates that when employing GNNs for graph embedding, adjusting the adjacency matrix can assist the model with achieving fairness conditionally. However, searching for the optimal adjacency matrix within hierarchical graph neural networks is a non-trivial problem. In this section, continuing with the preceding analysis, we develop FairAdj algorithm to adjust the graph connections and learn a fair adjacency matrix by updating A while preserving the original graph structure unchanged. In overview, we implement the algorithm by separately optimizing the parameter W θ of GNNs towards utility, and adjusting A towards dyadic fairness by gradient descent and empirical risk minimization with structural and right stochastic constraints. Therefore, FairAdj is able to pursue the supplied dyadic fairness and link predictive utility simultaneously. We employ variational graph autoencoder (Kipf & Welling, 2016b) for feature embedding. A two-layer graph neural network is used as inference model GNN θ (•, •). Z denotes the embedded representations, and dot product between embedded representations is the generative model p(•). The KL-divergence term KL[• •] punishes the discrepancy between latent distribution and a Gaussian prior. The objective function to reconstruct graph connections from latent variable can be written as: max θ L util := E GNN θ (Z|X, A) [log p(A | Z)] -KL[GNN θ (Z | X, A) N (0, 1)]. For fairness, we impel L fair to empirically seek for better graph connections, then update A with constraints. Specifically, we optimize the normalized adjacency matrix A as follows: min A L fair := E v,u∼U ×U [â vu | S(v) = S(u)] -E v,u∼U ×U [â vu | S(v) = S(u)] 2 , s.t. (1). [ A] vu = 0, if [A] vu = 0, (2). A1 = 1, A ≥ 0, where âvu takes value in Â = ZZ and 1 is the all-one vector with size N . The two constraints are necessary for optimizing A: (1) Elements with zero value should be maintained, meaning no new links can be established during optimization. This restriction is proposed to preserve utility, due to adding fictitious links might mislead the directions of message passing, and further corrupt the representation learning. Therefore, we only adapt weights on existing edges and preserve the original graph structure. (2) In consistent with the initial left normalization A = D -1 A, the optimized matrix should still remain a right stochastic matrix. This can restrict the largest eigenvalue of adjacency matrix A 2 = 1, hence avoid numerical instabilities and exploding gradients when training W θ towards L util . In practice, we observe explosions during training with no constraints applied on A. For the first constraint in Eq. ( 6), we only compute gradients and update the elements in A which have non-zero initialization. For the second one, after selecting the variable to optimize, we employ projected gradient descent by (Wang & Carreira-Perpinán, 2013) to satisfy the constraint while minimizing L fair . Once given computed gradients on A denoted as ∇ A L fair , with the corresponding learning rate η A , we have the following optimization problem that update A by projecting Aη A ∇ A L fair into the feasible region with the minimum Euclidean distance: min A-(η A ∇ A Lfair) v [ A -η A ∇ A L fair ] v, * -[ A -(η A ∇ A L fair ) ] v, * 2 s.t. ( A -(η A ∇ A L fair ) )1 = 1 and A -(η A ∇ A L fair ) ≥ 0, where [ A] v, * are the row-wise elements for A, namely, all the connections on v. A -(η A ∇ A L fair ) is the update for A after projection. Since A is row-wise independent and the objective function is strictly convex for this quadratic program, there exists a unique solution for each row. Solution details for projected gradient descent are restated as a part of our algorithmic pipeline. The algorithmic routine is elaborated in Algorithm 1. θ and A are optimized iteratively with T 1 and T 2 epochs for co-adaptation. Compared to adversarial training method on graph embedding (Bose & Hamilton, 2019b) , which requires a hyperparameter to control the fairness-utility tradeoff, we also find regulating the convergence of A has a similar effect as well. That is because the more A changes, the further A is away from the original graph connections, and consequently, the more 

6. EXPERIMENTS

We present empirical analysis on six real-world datasets, compared with baseline methods in terms of seven evaluative metrics on both fairness and utility. Approaching to applications, we testify that our method can enhance the diversity in recommendations. Due to the space limitation, we only showcase partial results but defer the rest in Appendix D. Moreover, as an intermediate result, we demonstrate vertex representations are embedded more fairly assessed by fair clustering in Appendix E.

6.1. SETTINGS

Datasets. We conduct experiments on real-world social networks and citation networks including Oklahoma97, UNC28 (Traud et al., 2011) , Facebook#1684, Cora, Citeseer, and Pubmed. Okla-homa97 and UNC28 are two school social networks. A link represents a friendship relation in social media, and every user has a profile for vertex features, including student/faculty status, gender (sensitive attribute), major, etc. Facebook#1684 is a social ego network from Facebook app. As the rest three citation networks, each vertex represents an article with bag-of-words descriptions as features. A link stands for a citation regardless the direction. We set the category of an article as the sensitive attribute. Statistic for datasets are summarized in Table 1 , where #Class is the number of sensitive groups, #Intra/Inter Ratio represents the ratio that the number of actual intra/inter links v.s. the number of links if the graph is fully connected. These two terms show the density of intra/inter links. Dis. Ratio (abbreviated from disparity ratio) is calculated by divide intra ratio into inter ratio. Dis. Ratio equaling to one implies intra/inter connections are perfectly balanced in existing graph, and the degree of deviation from 1 indicates how skew the link connections are. Baselines and Protocols. We involve four baseline methods. Variational graph autoencoder (VGAE) (Kipf & Welling, 2016b) inherits from variational autoencoder, which uses two GNNlayer as the inference model and leverages latent variables to reconstruct the graph connections. Node2vec (Grover & Leskovec, 2016 ) is a widely used graph embedding approach based on random walk. Fairwalk (Rahman et al., 2019) is built upon node2vec and designed specifically for fairness issues. It modifies the transition probability for one vertex according to the sensitive attribute of its neighbors. The last one is adversarial training on vertex representations (Bose & Hamilton, 2019b) , which aims to minimize the discrepancy between different sensitive groups by optimizing parameters in GNN. Besides the standard pipeline for utility, it additionally trains the networks to confuse a discriminator, meanwhile training the discriminator to distinguish the embedded features with different sensitive attributes. A hyperparameter λ is used in the overall objective function to balance the tradeoff between utility and fairness. We vary λ in experiments and make comparisons to various results given by adversarial training. For all experiments, we randomly remove 10% links from the graph and reserve them for evaluation, and equivalently, the same number of false links are sampled in the evaluation phase. For one dataset, we repeat experiments with different train/test splits for 20 times. Full experimental configurations are available in Appendix C. Metrics. We evaluate the utility of link prediction using Area Under the Curve (AUC) and Average Precision (AP). Fairness is evaluated towards ∆ DP , as well as the disparity on the expected score on all the true samples ∆ true and false samples ∆ false . Besides these, following the suggested fairness notions (Hardt et al., 2016) , we compute the maximum gap of true negative rate (TNR) and false negative rate (FNR). Illustratively, let the conditional cumulative distribution function of score R be evaluated by a threshold τ with a given label written as F s y (τ ) := Pr(R≤τ |Y =y, S=s), y∈{0, 1}, s∈{intra, inter}. With these notations, the maximum gap in true negative rate can be expressed as ∆ TNR := max τ |F intra 0 (τ ) -F inter 0 (τ )|, and similar for false positive rate ∆ FNR := max τ |F intra 1 (τ ) -F inter 1 (τ )|. These two terms also reflect the disparity in true positive rate (TPR) and false positive rate (FPR) due to TPR = 1 -FNR and FPR = 1 -TNR.

6.2. RESULTS AND ANALYSIS

Table 2 and 3 list quantitative results on UNC28 and Citeseer comparing to VGAE, node2vec, and Fairwalk. Two choices of T 2 are presented here, where T 2 in a smaller value pursues a comparable performance in utility (a little bit lower in AUC but higher in AP) to random walk based methods, and at the same time performs much better in fairness. We also present T 2 = 20 since we observe a convergence on the adjacency matrix. The results indicate that FairAdj achieves the best on various statistics for dyadic fairness and with only a small sacrifice in predictive utility. A Better Tradeoff. Figure 1 plots every experimental result for our method and adversarial training with various fairness-utility regulated hyperparameters. Two observations explain why our method surpasses the adversarial technique. (1) Blue dots are closer to the left top corner than the red on the whole, meaning the same level of fairness is achieved with less sacrifice in utility. An explanation for this favorable property is that, the adversarial training method neglects the graph connections and only diminish the group discrepancy, where two irrelevant instances with no connection but from different groups may be closely mapped, thus greatly damage the utility. The optimization on A does facilitate the feature smoothing across groups which is not indicated in the original adjacency matrix, but still considers the graph connections. (2) Additionally, blue dots are more aggregated, suggesting our methods escape from the instability of min-max optimization and acting more robust to different train/test splits. However, as shown, FairAdj cannot achieve arbitrary small in ∆ DP as red dots do. This is indicated in Section 4 as the first potential limitation. Diversity in Recommendations. We examine the top-scored links in evaluation at a certain proportion in terms of its diversity and utility, shown in Figure 2 . This exploration can be useful when conducting recommendations according to scores in descending order. For a fixed proportion, we report the diversity as the number of inter links divided by the number of intra links, and the utility as the recall rate among these recommendations. Figures show that as a direct product by dyadic fairness, FairAdj enhances the diversity in recommendations but is achieved at a sacrifice of utility.

7. CONCLUSION

We studied the dyadic fairness in graph-structured data. We theoretically analyzed how the connections in graph links affect dyadic fairness of demographic parity when employing graph neural networks for representation learning. On the basis of the foregoing analysis, we proposed FairAdj to learn a fair adjacency matrix, and pursued the dyadic fairness and prediction utility simultaneously. Empirical validations demonstrated the achievement of fairness and a better fairness-utility tradeoff. In Appendix, we present proofs in Section A, illustrative diagrams for Theorem 4.1 in Section B, experimental configurations in Section C, deferred results in Section D, and the demonstration of fair vertex representation in Section E. A PROOF Proposition 4.1. For a link prediction function g(•, •) modeled as inner product g(v, u) = v Σu, where Σ ∈ S M ++ is a positive-definite matrix, ∃Q > 0, ∀v ∼ V, v 2 ≤ Q, for E v∼U [v] ∈ R M , for dyadic fairness based on demographic parity, if E v∼U [v | v ∈ S 0 ] -E v∼U [v | v ∈ S 1 ] 2 ≤ δ, ∆ DP := |E (v,u)∼U ×U [g(v, u) | S(v) = S(u)] -E (v,u)∼U ×U [g(v, u) | S(v) = S(u)]| ≤ Q Σ 2 • δ. (2) Proof. To simplify the notations, we use p := E v∼U [v | v ∈ S 0 ] ∈ R M and q := E v∼U [v | v ∈ S 1 ] ∈ R M to denote the expectations in representations for S 0 and S 1 respectively. |E intra -E inter | = E[v Σu | v ∈ S 0 , u ∈ S 1 ] -E[v Σu | v ∈ S 0 , u ∈ S 0 ∨ v ∈ S 1 , u ∈ S 1 ] = p Σq - |S 0 | 2 |S 0 | 2 + |S 1 | 2 p Σp + |S 1 | 2 |S 0 | 2 + |S 1 | 2 q Σq = (q -p) |S 0 | 2 |S 0 | 2 + |S 1 | 2 Σp - |S 1 | 2 |S 0 | 2 + |S 1 | 2 Σq To simplify the notation, we will use α : = |S 0 | 2 /(|S 0 | 2 + |S 1 | 2 ) and β := |S 1 | 2 /(|S 0 | 2 + |S 1 | 2 ) ≤ q -p 2 • αΣp -βΣq 2 ≤ δ • Σ 2 • ( αp 2 + βq 2 ) = Q Σ 2 • δ, which completes the proof. The first inequality above is due to Cauchy-Schwarz, and the second one is by the definition of spectral norm. The last equality holds by the linearity of expectation: if ∀v ∈ V, v 2 ≤ Q, then E[v] 2 ≤ E[ v 2 ] ≤ Q. Theorem 4.1. For an arbitrary graph with nonnegative link weights, after conducting one meanaggregation over the graph, the consequent representation discrepancy between two sensitive groups ∆ Aggr DP := E v∼U [Agg(v) | v ∈ S 0 ] -E v∼U [Agg(v) | v ∈ S 1 ] 2 is bounded by max{α min µ 0 -µ 1 ∞ -2σ, 0} ≤ ∆ Aggr DP ≤ α max µ 0 -µ 1 2 + 2 √ M σ, where α min =min{α 1 , α 2 }, α max =max{α 1 , α 2 }, α 1 =|1-mw Dmax ( 1 |S0| + 1 |S1| )|, α 2 =|1-| S0| |S0| -| S1| |S1| |. Proof. The feature representation of v after conducting one mean-aggregation is Agg(v) = 1 deg w (v) u∈Γ(u) a vu u = 1 deg w (v) ( u∈Γ(u)∩S0 a vu u + u∈Γ(u)∩S1 a vu u). Here we separate the summation of neighbor features into two parts in terms of the sensitive attribute. We use the bracket notation to abbreviate the range of a vector. That is, if a vector u satisfies µ -σ ≤ u ≤ µ + σ, we abbreviate this as u ∈ [µ ± σ]. Consider the unilateral case v ∈ S 0 , we have Agg(v) ∈ [ u∈Γ(v)∩S0 deg w (v) a vu µ 0 + u∈Γ(v)∩S1 deg w (v) a vu µ 1 ± σ • 1] ∈ [(µ 0 + u∈Γ(v)∩S1 a vu deg w (v) (µ 1 -µ 0 )) ± σ • 1] where 1 is the all-one vector with proper size. Published as a conference paper at ICLR 2021 The first derivation is due to the fact that each u ∈ S 0 lies in the range of [µ 0 ± σ • 1] and each u ∈ S 1 lies in the range of [µ 1 ± σ • 1]. The second one is by the definition of weighted degree. Using β v = v∈Γ(v)∩S opp(v) a vu /deg w (v) where S opp(v) is the opposite sensitive group where v belongs. The expectation of Agg(v) for S 0 is E v∼U [Agg(v) | v ∈ S 0 ] ∈ [( 1 |S 0 | v∈S0 (µ 0 + β v (µ 1 -µ 0 ))) ± σ • 1] ∈ [(µ 0 + 1 |S 0 | v∈S0 β v (µ 1 -µ 0 )) ± σ • 1]. And for v ∈ S 1 we have E v∼U [Agg(v) | v ∈ S 1 ] ∈ [(µ 1 + 1 |S 1 | v∈S1 β v (µ 0 -µ 1 )) ± σ • 1]. Based on the above two terms, the gap in expectation of two groups after passing one meanaggregation layer becomes E v∼U [Agg(v) | v ∈ S 0 ] -E v∼U [Agg(v) | v ∈ S 1 ] ∈ [(1 -( 1 |S 0 | v∈S0 β v + 1 |S 1 | v∈S1 β v )) • (µ 0 -µ 1 ) + 2σ • 1]. Next we study the range of α : = 1 -(|S 0 | -1 v∈S0 β v + |S 1 | -1 v∈S1 β v ). First we consider the term |S 0 | -1 v∈S0 β v . Since deg w (v) ≤ D max , ∀v ∈ V, we have v∈S0 β v = v∈S0 u∈Γ(v)∩S1 a vu deg w (v) ≥ 1 D max v∈S0 u∈Γ(v)∩S1 a vu = m w D max . For non-negative weights, D max ≥ deg w (v) = u∈Γ(v)∩S0 a vu + u∈Γ(v)∩S1 a vu ≥ u∈Γ(v)∩S1 a vu . This means for v ∈ S 0 , β v = u∈Γ(u)∩S1 a vu deg w (u) ≤ 1, thus, v∈S0 β v = v∈ S0 β v ≤ | S 0 |. The first equality holds because β v = 0 when v ∈ S 0 / S 0 , meaning v doesn't contain any inter-edges. Since the analysis for S 1 is similar, we derive the lower and upper bounds for |S i | -1 v∈Si β v , i = 0, 1 1 |S i | • m w D max ≤ 1 |S i | v∈Si β v ≤ ( | S i | |S i | ), i = 0, 1. Based on the above results, we give the bound for α as follows: α ∈ [ 1 -( | S 0 | |S 0 | + | S 1 | |S 1 | ), 1 - m w D max ( 1 |S 0 | + 1 |S 1 | ) ], Let α min and α max be lower bound and upper bower of |α |, we have α max = max{1 -( | S 0 | |S 0 | + | S 1 | |S 1 | ), 1 - m w D max ( 1 |S 0 | + 1 |S 1 | )} α min = min{1 -( | S 0 | |S 0 | + | S 1 | |S 1 | ), 1 - m w D max ( 1 |S 0 | + 1 |S 1 | )} Thus we give the upper bound of ∆ Aggr DP : ∆ Aggr DP ≤ α max µ 0 -µ 1 2 + 2 √ M σ (8) where the second part in RHS is due to 2σ • 1 2 = 2 √ M σ. Next we consider i-th entrance of µ 0 and µ 1 , denoted as µ i 0 and µ i 1 respectively. The i-the entrance of E v∼U [Agg(v) | v ∈ S 0 ] -E v∼U [Agg(v) | v ∈ S 1 ] take nonzero values if and only if |(1 -( 1 |S 0 | v∈S0 β v + 1 |S 1 | v∈S1 β v )) • (µ i 0 -µ i 1 )| ≥ 2σ Thus we obtain the lower bound of ∆ Aggr DP : ∆ Aggr DP ≥ max{α min µ 0 -µ 1 ∞ -2σ, 0} which completes the proof. B COMPLEMENTARY DIAGRAMS TO THEOREM 4.1 We provide diagrams to help better understand the upper bound in Theorem 4.1. Figure 3 provides a common case that the gap in expectation between two sensitive groups shrinks after mean-aggregation. Here the maximal deviation term σ can be neglected since it is much smaller than the expectation gap. Figure 4 provides a case that the term σ is not negligible against the expectation gap between two sensitive groups. Here σ = 100 and the gap equals to 0. After aggregation, we see the new expectation gap becomes 20, showing that the discrepancy in representations increases. Figure 5 provides another case that the contraction coefficient α equals to 1 due to the resistance of α 2 . Here all vertices possess inter links, and the graph is a complete bipartite graph. Then the aggregation fully exchanges the sensitive information, and thus the representation discrepancy remains unchanged. Cases in Figure 4 and 5 are also pointed out by the analysis in Section 4. 

C EXPERIMENTAL CONFIGURATIONS

For all experiments, we set T 1 = 50 and the total epochs which contain T 1 and T 2 equal to 4. Graph neural networks are applied with two hidden layers with size 32 and 16 respectively. η θ is set to 0.01. For η A for different datasets, we have: Oklahoma97: 0.1; UNC28: 0.1; Cora: 0.2; Citeseer: 0.5. Experiments are conducted on Nvidia Titan RTX graphics card.

D ADDITIONAL RESULTS

We present experimental results for Citeseer and UNC28 in this section. All the results deliver similar conclusions as we state in the main body of this paper. Additionally, we include another dataset Facebook#1684 in response to the second limitation as indicated in Section 4. In this case, ∆ DP , ∆ true , ∆ false are already small as given by VGAE, and FairAdj is not able to further minimize the gap. 8 . To quantify that, we conduct K-means clustering on vertex representation and evaluate the ratio of samples from different sensitive groups within each clusters, and the ratio is called balance. We range the number of clusters from 4 to 8 and report the average balance across all clusters. In general, the higher the balance, the fairer in vertex representations. Overall the series of FairAdj achieves a higher balance, which shows the invariant representations on vertices across different sensitive groups. 



Proofs for the proposition and theorem are in Appendix A.



Figure 1: Comparison with adversarial training method (Bose & Hamilton, 2019b) in terms of the tradeoff between utility and fairness. Left: UNC28; Right: Citeseer. Blue points denote FairAdj with different T 2 values, red points represent (Bose & Hamilton, 2019b) with different λ values.

Figure 2: Diversity and utility in recommendations. Left: UNC28; Right: Citeseer. X-axis 'proportion' means we investigate the top x% valued links and check the ratio between inter and intra links that presented in y-axis.

Figure 3: An illustrative graph example with two protected groups S 0 and S 1 . All vertices have selfloop. The expectation gap shrinks after mean aggregation. Here, |E v∼U [v|v ∈ S 0 ] -E v∼U [v|v ∈ S 1 ]| = 20, σ = 2 and all link weights are equal. After aggregation, |E v∼U [Agg(v)|v ∈ S 0 ] -E v∼U [Agg(v)|v ∈ S 1 ]| = |6.15 -(-6.15)| = 12.3 < 20.

Figure 4: Case 1: The maximal deviation term O(σ) is not negligible. Here σ = 100 and all link weights are equal. All vertices have self-loop. |E v∼U [v|v ∈ S 0 ] -E v∼U [v|v ∈ S 1 ]| = 0. But after mean-aggregation, |E v∼U [Agg(v)|v ∈ S 0 ] -E v∼U [Agg(v)|v ∈ S 1 ]| = |5 -25| = 20 > 0.

Figure 5: Case 2: The contraction coefficient α equals to 1. This happens when the graph is a complete bipartite graph. Mean-aggregation fully exchanges the sensitive information and the gap of two groups remains unchanged.

Figure 6: Compare to adversarial training on vertex representations. Left: Oklahoma97; Right: Cora.

Figure 7: Diversity and utility in recommendations. Left: Oklahoma97; Right: Cora.

Figure 8: Evaluations on balance of clusters. Left: Oklahoma97; Right: UNC28.

Statistic for datasets in experiments.

Experimental results on UNC28. ± 0.54 87.75 ± 0.65 1.53 ± 0.35 0.32 ± 0.29 0.41 ± 0.35 2.84 ± 0.74 2.22 ± 0.68 FairAdj T2=20 87.04 ± 0.55 87.80 ± 0.65 1.57 ± 0.36 0.34 ± 0.31 0.42 ± 0.35 2.76 ± 0.75 2.16 ± 0.73 damage in utility but more enhancement in fairness. Thanks to this favorable property, in experiments, comparing to adversarially remove sensitive attributes, we present multiple options for T 2 to control the convergence and observe a more favorable fairness-utility tradeoff shown by our method.

Experimental results on Citeseer. VGAE 81.77 ± 1.23 85.57 ± 1.39 11.24 ± 1.83 3.37 ± 2.33 2.14 ± 1.26 10.81 ± 3.61 11.31 ± 2.99 node2vec 81.21 ± 1.35 84.69 ± 1.26 14.49 ± 3.38 4.02 ± 2.74 6.82 ± 4.62 7.37 ± 3.07 12.50 ± 4.21 Fairwalk 81.69 ± 1.50 84.97 ± 1.23 13.50 ± 2.97 3.30 ± 2.49 5.33 ± 3.83 7.26 ± 3.30 11.34 ± 3.19 FairAdj T2=2 80.45 ± 1.34 84.47 ± 1.43 9.57 ± 1.84 2.55 ± 2.02 1.70 ± 1.47 9.87 ± 3.17 10.27 ± 3.23 FairAdj T2=20 78.84 ± 1.38 82.74 ± 1.46 7.81 ± 1.80 1.85 ± 1.66 1.30 ± 1.41 9.22 ± 2.95 10.01 ± 3.13

Experimental results on Oklahoma97. VGAE 90.13 ± 0.32 91.24 ± 0.37 8.73 ± 0.38 8.56 ± 0.44 0.40 ± 0.32 36.51 ± 1.41 2.26 ± 0.92 node2vec 86.49 ± 0.35 84.09 ± 0.50 7.23 ± 0.64 3.35 ± 0.45 1.08 ± 0.97 32.55 ± 1.32 2.36 ± 0.69 Fairwalk 86.56 ± 0.32 84.23 ± 0.44 7.31 ± 0.62 3.49 ± 0.47 1.13 ± 0.85 32.77 ± 1.20 2.18 ± 0.69 FairAdj T2=5 84.92 ± 0.81 85.07 ± 0.92 3.60 ± 0.35 0.40 ± 0.32 0.33 ± 0.28 4.00 ± 0.88 2.02 ± 0.76 FairAdj T2=20 81.01 ± 1.01 80.79 ± 0.93 2.96 ± 0.30 0.38 ± 0.31 0.32 ± 0.25 5.61 ± 1.06 2.03 ± 0.92

Experimental results on Cora. VGAE 88.48 ± 0.88 90.81 ± 0.78 26.74 ± 1.51 9.99 ± 2.32 10.26 ± 1.59 28.25 ± 4.46 26.71 ± 3.83 node2vec 87.93 ± 0.75 87.82 ± 1.06 39.99 ± 2.75 6.63 ± 3.58 27.86 ± 4.94 23.66 ± 4.73 32.96 ± 5.24 Fairwalk 88.04 ± 0.84 88.10 ± 1.20 40.49 ± 2.58 7.30 ± 3.28 29.43 ± 4.86 23.74 ± 4.19 33.79 ± 5.08 FairAdj T2=5 86.00 ± 1.12 88.32 ± 0.86 21.05 ± 1.26 6.99 ± 2.24 6.14 ± 1.59 20.72 ± 3.62 19.46 ± 3.62 FairAdj T2=20 83.85 ± 1.07 86.08 ± 0.93 17.87 ± 1.18 5.40 ± 2.23 3.74 ± 1.46 16.75 ± 4.87 15.37 ± 3.84

Experimental results on Pubmed. VGAE 91.20 ± 0.85 91.26 ± 0.80 20.88 ± 1.48 4.19 ± 0.93 8.04 ± 1.83 12.01 ± 2.92 19.18 ± 4.16 node2vec 74.27 ± 1.23 79.24 ± 1.29 19.14 ± 0.93 3.38 ± 2.57 8.90 ± 2.56 6.65 ± 2.21 10.91 ± 1.88 fairwalk 73.43 ± 1.11 78.96 ± 1.24 18.42 ± 1.65 3.11 ± 1.84 7.79 ± 3.49 6.61 ± 2.28 10.93 ± 2.54 FairAdj T2 = 5 88.64 ± 1.09 88.21 ± 1.22 16.06 ± 0.98 1.96 ± 0.82 4.40 ± 1.28 8.93 ± 2.90 12.75 ± 1.56 FairAdj T2 = 20 87.53 ± 1.03 87.10 ± 1.17 14.73 ± 0.98 1.39 ± 0.92 3.17 ± 1.10 9.09 ± 2.10 10.46 ± 1.73

Experimental results on Facebook#1684. ± .55 93.91 ± .68 2.03 ± .81 0.59 ± .49 0.90 ± .57 4.48 ± 1.57 4.94 ± 1.32 node2vec 90.57 ± .74 85.61 ± 1.09 1.70 ± 1.43 0.52 ± .49 2.47 ± 1.52 6.51 ± 2.04 5.06 ± 1.36 fairwalk 90.56 ± .63 85.58 ± .87 1.97 ± 1.51 0.62 ± .47 2.14 ± 1.77 6.92 ± 2.19 5.03 ± 1.46 FairAdj T2 = 1 94.68 ± .48 93.94 ± .62 2.02 ± .82 0.60 ± .50 0.93 ± .60 4.42 ± 1.57 4.82 ± 1.54 FairAdj T2 = 20 94.63 ± .49 93.84 ± .64 1.77 ± .81 0.53 ± .41 0.92 ± .49 5.00 ± 1.52 4.86 ± 1.41 E FAIR VERTEX REPRESENTATION As an intermediate result, we inspect the fairness in vertex representation in Figure

ACKNOWLEDGEMENT

We would like to thank Zizhang Chen and Wei Lu for the helpful discussions, and Lizi Liao for providing the Oklahoma97/UNC28 datasets. This work is partially supported by NSF OAC 1920147. 

