DEBIASED GRAPH NEURAL NETWORKS WITH AGNOSTIC LABEL SELECTION BIAS

Abstract

Most existing Graph Neural Networks (GNNs) are proposed without considering the selection bias in data, i.e., the inconsistent distribution between the training set with test set. In reality, the test data is not even available during the training process, making selection bias agnostic. Training GNNs with biased selected nodes leads to significant parameter estimation bias and greatly impacts the generalization ability on test nodes. In this paper, we first present an experimental investigation, which clearly shows that the selection bias drastically hinders the generalization ability of GNNs, and theoretically prove that the selection bias will cause the biased estimation on GNN parameters. Then to remove the bias in GNN estimation, we propose a novel Debiased Graph Neural Networks (DGNN) with a differentiated decorrelation regularizer. The differentiated decorrelation regularizer estimates a sample weight for each labeled node such that the spurious correlation of learned embeddings could be eliminated. We analyze the regularizer in causal view and it motivates us to differentiate the weights of the variables based on their contribution on the confounding bias. Then, these sample weights are used for reweighting GNNs to eliminate the estimation bias, thus help to improve the stability of prediction on unknown test nodes. Comprehensive experiments are conducted on several challenging graph datasets with two kinds of label selection bias. The results well verify that our proposed model outperforms the state-of-the-art methods and DGNN is a flexible framework to enhance existing GNNs.

1. INTRODUCTION

Graph Neural Networks (GNNs) are powerful deep learning algorithms on graphs with various applications (Scarselli et al., 2008; Kipf & Welling, 2016; Veličković et al., 2017; Hamilton et al., 2017) . Existing GNNs mainly learn a node embedding through aggregating the features from its neighbors, and such message-passing framework is supervised by node label in an end-to-end manner. During this training procedure, GNNs will effectively learn the correlation between the structure pattern and node feature with node label, so that GNNs are capable of learning the embeddings of new nodes and inferring their labels. One basic requirement of GNNs making precise prediction on unseen test nodes is that the distribution of labeled training and test nodes is same, i.e., the structure and feature of labeled training and test nodes follow the similar pattern, so that the learned correlation between the current graph and label can be well generalized to the new nodes. However, in reality, there are two inevitable issues. (1) Because it is difficult to control the graph collection in an unbiased environment, the relationship between the collected real-world graph and the labeled nodes is inevitably biased. Training on such graph will cause biased correlation with node label. Taking a scientist collaboration network as an example, if most scientists with "machine learning" (ML) label collaborate with those with "computer vision" (CV) label, existing GNNs may learn spurious correlation, i.e., scientists who cooperate with CV scientist are ML scientists. If a new ML scientist only connects with ML scientists or the scientists in other areas, it will be probably misclassified. (2) The test node in the real scenario is usually not available, implying that the distribution of new nodes is agnostic. Once the distribution is inconsistent with that in the training nodes, the performance of all the current GNNs will be hindered. Even transfer learning is able to solve the distribution shift problem, however, it still needs the prior of test distribution, which actually cannot be obtained beforehand. Therefore, the agnostic label selection bias greatly affects the generalization ability of GNNs on unknown test data. In order to observe selection bias in real graph data, we conduct an experimental investigation to validate the effect of selection bias on GNNs (details can be seen in Section 2.1). We select training nodes with different biased degrees for each dataset, making the distribution of training nodes and test nodes inconsistent. The results clearly show that selection bias drastically hinders the performance of GNNs on unseen test nodes. Moreover, with heavier bias, the performance drops more. Further, we theoretically analyze how the data selection bias results in the estimation bias in GNN parameters (details can be seen in Section 2.2). Based on the stable learning technique (Kuang et al., 2020) , we can assume that the learned embeddings consist of two parts: stable variables and unstable variables. The data selection bias will cause the spurious correlation between these two kinds of variables. Thereby we prove that with the inevitable model misspecification, the spurious correlation will further cause the parameter estimation bias. Once the weakness of the current GNNs with selection bias is identified, one natural question is "how to remove the estimation bias in GNNs?" In this paper, we propose a novel Debiased Graph Neural Network (DGNN) framework for stable graph learning by jointly optimizing a differentiated decorrelation regularizer and a weighted GNN model. Specifically, the differentiated decorrelation regularizer is able to learn a set of sample weights under differentiated variable weights, so that the spurious correlation between stable and unstable variables would be greatly eliminated. Based on the causal view analysis of decorrelation regularizer, we theoretically prove that the weights of variables can be differentiated by the regression weights. Moreover, to better combine the decorrelation regularizer with GNNs, we prove that adding the regularizer to the embedding learned by the second to last layer could be both theoretically sound and flexible. Then the sample weights learned by decorrelation regularizer are used to reweight GNN loss so that the parameter estimation could be unbiased. In summary, the contributions of this paper are three-fold: i) We investigate a new problem of learning GNNs with agnostic label selection bias. The problem setting is general and practical for real applications. ii) We bring the idea of variable decorrelation into GNNs to relieve bias influence on model learning and propose a general framework DGNN which could be adopted to various GNNs. iii) We conduct the experiments on real-world graph benchmarks with two kinds of agnostic label selection bias, and the experimental results demonstrate the effectiveness and flexibility of our model.

2. EFFECT OF LABEL SELECTION BIAS ON GNNS

In this section, we first formulate our target problem as follows: 

2.1. EXPERIMENTAL INVESTIGATION

We conduct an experimental investigation to examine whether the state-of-the-art GNNs are sensitive to the selection bias. The main idea is that we will perform two representative GNNs: GCN (Kipf & Welling, 2016) and GAT (Veličković et al., 2017) on three widely used graph datasets: Cora, Citeseer, Pubmed (Sen et al., 2008) with different degrees of bias. If the performance drops sharply in comparison with the scenarios without selection bias, this will demonstrate that GNNs cannot generalize well in selection bias setting. To simulate the agnostic selection bias scenario, we first follow the inductive setting in Wu et al. (2019) that masks the validation and test nodes as the training graph G train in the training phase, and then infer the labels of validation and test nodes with whole graph G test . In this way, the distribution of test node can be considered agnostic. Following Zadrozny (2004) , we design a biased label selection method on training graph G train . The selection variable e is introduced to control whether the node will be selected as labeled nodes, where e = 1 means selected and 0 otherwise. For node i, we compute its neighbor distribution ratio: r i = |{j|j ∈ N i , y j ≠ y i }|/|N i |, where N i is neighborhood of node i in G train and y j ≠ y i means the label of central node i is not the label of its neighborhood node j. And r i measures the difference between the label of central node i with the labels of its neighborhood. Then we average all the nodes' r to get a threshold t. For each node, the probability to be selected is: P (e i = 1|r i ) = r i ≥ t 1 - r i < t , where ∈ (0.5, 1) is used to control the degree of selection bias and the larger means heavier bias. We set as {0.7, 0.8, 0.9} to get three bias degrees for each dataset, termed as Light, Medium, Heavy, respectively. We select 20 nodes for each class for training and the validation and test nodes are same as Yang et al. (2016) . Furthermore, we take the unbiased datasets as baselines, where the labeled nodes are selected randomly. Figure 1 is the results of GCN and GAT on biased datasets. The dashed lines mean the performances of GCN/GAT on unbiased datasets and the solid lines refer to the results on biased datasets. We can find that: i) The dashed lines are all above the corresponding coloured solid lines, indicating that selection bias greatly affects the GNNs' performance. ii) All solid lines decrease monotonically with the increase of bias degree, demonstrating that heavier bias will cause larger performance decrease.

2.2. THEORETICAL ANALYSIS

The above experiment empirically verifies the effect of selection bias on GNNs. Here we theoretically analyze the effect of selection bias on estimating the parameters in GNNs. First, because biased labeled nodes have biased neighborhood structure, GNNs will encode this biased information into the node embeddings. Based on stable learning technique (Kuang et al., 2020) , we make following assumption: Assumption 1. All the variables of embeddings learned by GNNs for each node can be decomposed as H = {S, V}, where S represents the stable variables and V represents the unstable variables. Specifically, for both training and test environment, E(Y|S = s, V = v) = E(Y|S = s). Under Assumption 1, the distribution shift between training set and test set is mainly induced by the variation in the joint distribution over (S, V), i.e., P(S train , V train ) ≠ P(S test , V test ). However, there is an invariant relationship between stable variable S and outcome Y in both training and test environments, which can be expressed as P(Y train |S train ) = P(Y test |S test ). Assumption 1 can be guaranteed by Y⊥V|S. Thus, one can solve the stable prediction problem by developing a function f (⋅) based on S. However, one can hardly identify such variables in GNNs. Without loss of generality, we take Y as continuous variable for analysis and have the following assumption: Assumption 2. The true generation process of target variable Y contains not only the linear combination of stable variables S, but also the nonlinear transformation of stable variables. Based on the above assumptions, we formalize the label generation process as follows: Y = f (X, A) + ε = G (X, A; θ g ) S β S + G (X, A; θ g ) V β V + g(G (X, A; θ g ) S ) + ε, where G (X, A; θ g ) ∈ R N ×p denotes an unknown function of X and A that learns node embedding and it can be learned by a GNN, such as GCN and GAT, the output variables of G (X, A; θ g ) can be decomposed as stable variables G (X, A; θ g ) S ∈ R N ×m and unstable variables G (X, A; θ g ) V ∈ R N ×q (m + q = p), β S ∈ R m×1 and β V ∈ R q×1 are the linear coefficients can be learned by the last layer of GNNs, ε is the independent random noise, and g(⋅) is the nonlinear transformation function of stable variables. According to Assumption 1, we know that coefficients of unstable variables G (X, A; θ g ) V are actually 0 (i.e., β V =0). For a classical GNN model with linear regressor, its prediction function can be formulated as: Ŷ = Ĝ (X, A; θ g ) S βS + Ĝ (X, A; θ g ) V βV + ε. (2) Compared with Eq. ( 1), we can find that the parameters of GNN could be unbiasedly estimated if the nonlinear term g(G (X, A; θ g ) S ) = 0, because the GNN model will have the same label generation mechanism as Eq. ( 1). However, limited by the nonlinear power of GNNs (Xu et al., 2019) , it is reasonable to assume that there is a nonlinear term g(G (X, A; θ g ) S ) ≠ 0 that cannot be fitted by the GNNs. Under this assumption, next, we taking a vanilla GCN (Kipf & Welling, 2016) as an example to illustrate how the distribution shift will induce parameter estimation bias. A two-layer GCN can be formulated as Âσ( ÂXW (0) )W (1) , where Â is the normalized adjacency matrix, W is the transformation matrix at each layer and σ(⋅) is the Relu activation function. We decompose GCN as two parts: one is embedding learning part Âσ( ÂXW (0) ), which can be decomposed as 2), and the other part is W [S T , V T ], corresponding to Ĝ (X, A; θ g ) S and Ĝ (X, A; θ g ) V in Eq. ( ( 1) , where the learned parameters can be decomposed as [ βS , βV ], corresponding to [ βS , βV ] in Eq. ( 2). We aim at minimizing the square loss: L GCN = ∑ n i=1 (S T i βS + V T i βV -Y i ) 2 . According to the derivation rule of partitioned regression model, we have: βV -β V = ( 1 n n i=1 V T i V i ) -1 ( 1 n n i=1 V T i g(S i )) + ( 1 n n i=1 V T i V i ) -1 ( 1 n n i=1 V T i S i )(β S -βS ), (3) βS -β S = ( 1 n n i=1 S T i S i ) -1 ( 1 n n i=1 S T i g(S i )) + ( 1 n n i=1 S T i S i ) -1 ( 1 n n i=1 S T i V i )(β V -βV ), where n is labeled node size, S i is i-th sample of S, 1 n ∑ n i=1 V T i g(S i ) = E(V T g(S)) + o p (1), 1 n ∑ n i=1 V T i S i = E(V T S) + o p (1) and o p (1) is the error which is negligible. Ideally, βV -β V = 0 indicates that there is no bias between the estimated and the real parameter. However, if E(V T S) ≠ 0 or E(V T g(S)) ≠ 0 in Eq. ( 3), βV will be biased, leading to the biased estimation on βS in Eq. ( 4) as well. Since the correlation between V and S (or g(S)) might shift in test phase, the biased parameters learned in training set is not the optimal parameters for predicting testing nodes. Therefore, to increase the stability of prediction, we need to unbiasedly estimate the parameters of βV by removing the correlation between V and S (or g(S)) on training graph, making 4) can also cause estimation bias, but the relation between S and g(S) is stable across environments, which do not influence the stability to some extent. E(V T S) = 0 or E(V T g(S)) = 0. Note that 1 n ∑ n i=1 S T i g(S i ) in Eq. (

3.1. REVISITING ON VARIABLE DECORRELATION IN CAUSAL VIEW

To decorrelate V and S (or g(S)), we should decorrelate the output variables of Ĝ (X, A; θ g ) (Kuang et al., 2020) . They propose a Variable Decorrelation (VD) term with sample reweighting technique to eliminate the correlation between each variable pair, in which the sample weights are learned by jointly minimizing the moment discrepancy between each variable pair: L V D (H) = p j=1 ||H T .j Λ w H .-j /n -H T .j w/n ⋅ H T .-j w/n|| 2 2 , where H ∈ R n×p means the variables needed to be decorrelated, H .j is j-th variable of H, H  T .i H .j ) = E(H T .i )E(H .j ) for each variable pair j and k. L V D (H) decorrelates all the variable pairs equally. However, decorrelating all the variables requires sufficient samples Kuang et al. (2020) , i.e., n → ∞, which is hard to be satisfied, especially in the semi-supervised setting. In this scenario, we cannot guarantee L V D (H) = 0. Therefore the key challenge is how to remove the correlation influencing the unbiased estimation most when L V D (H) ≠ 0. Inspired by confounding balancing technique in observational studies (Hainmueller, 2012) , we revisit the variable decorrelation regularizer in causal view and show how to differentiate each variable pair. Confounding balancing techniques are often used for causal effect estimation of treatment T , where the distributions of confounders X are different between treated (T = 1) and control (T = 0) groups because of non-random treatment assignment. One could balance the distribution of confounders between treatment and control groups to unbiased estimate causal treatment effects (Yao et al., 2020) . Most balancing approaches exploit moments to characterize distributions, and balance them by adjusting sample weights w as follows: w = arg min w || ∑ i∶T i =1 X i -∑ i∶T i =0 w i ⋅ X i || 2 2 . After balancing, the treatment T and confounders X tend to be independent. Given one targeted variable j, under the variables only have linear relation assumptionfoot_0 , its decor- relataion term, L V D j = ||H T .j Λ w H .-j /n -H T .j w/n ⋅ H T .-j w/n|| 2 2 , is to make H .j independent of H .-j , which is same as the confounding balancing term making treatment and confounders independent. Thereby, L V D j can also be viewed as a confounding balancing term, where H .j is treatment and H .-j is confounders, illustrated in Fig. 2(a) . Hence, our target can be explained as unbiasedly estimate causal effect of each variable which is invariant across training and test set. As different variable may contribute unequally to the confounding bias, it is necessary to differentiate the confounders. The target of differentiating confounders exactly matches our target that removes the correlation of variables influencing the unbiased estimation most.

3.2. DIFFERETIATED VARIABLE DECORRELATION

Considering the continuous treatment, the causal effect of treatment can be measured by Marginal Treatment Effect Function (MTEF) (Kreif et al., 2015) , and defined as: M T EF = E[Y i (t)]-E[Y i (t-∆t)] ∆t , where Y i (t) represents the potential outcome of sample i with treatment status T = t, E(⋅) refers to the expectation function, and ∆t denotes the increasing level of treatment. With the sample weights w decorrelating treatment and confounders, we can estimate the MTEF by: M T EF = ∑ i∶T i =t w i ⋅ Y i (t) -∑ j∶T j =t-∆t w j ⋅ Y j (t -∆t) ∆t . Next we theoretically analyze how to differentiate confounders' weights with following theorem. Theorem 1. In observational studies, different confounders make unequal confounding bias on Marginal Treatment Effect Function (MTEF) with their own weights, and the weights can be learned via regressing outcome Y on confounders X and treatment variable T . We prove Theorem 1 with the following assumption: Assumption 3 (Linearity). The regression of outcome Y on confounders X and treatment variable T is linear, that is Y = ∑ k≠t α k X .k + α t T + c + ε, where α k ∈ α is the linear coefficient. Under Assumption 3, we can write estimator of M T EF as: M T EF = ∑ i∶T i =t w i ⋅ Y i (t) -∑ j∶T j =t-∆t w j ⋅ Y j (t -∆t) ∆t = M T EF + k≠t α k ( ∑ i∶T i =t w i ⋅ X ik -∑ j∶T j =t-∆t w j ⋅ X jk ∆t ) + φ(ε), where M T EF is the ground truth, φ(ε) means the noise term, and φ(ε) ≃ 0 with Gaussian noise. The detailed derivation can be found in Appendix A. To reduce the bias of M T EF , we need regulate the term ∑ k≠t α k ( ∑ i∶T i =t w i ⋅X ik -∑ j∶T j =t-∆t w j ⋅X jk ∆t ), where -1) is the node embedding to be decorrelated. T is the treatment, corresponding to one target variable in 1) . X is the confounders, corresponding to the remaining variables of the target variable in means the difference of the k-th confounder between treated and control samples. The parameter α k represents the confounding bias weight of the k-th confounder, and it is the coefficient of X .k . Moreover, because our target is to learn the weight of each variable pair, i.e., between treatment and each confounder, we need to learn the weight α t of treatment that is the coefficient of T . Hence, the confounder weights and treatment weight can be learned from the regression of observed outcome Y on confounders X and treatment T under Linearity assumption. ∑ i∶T i =t w i ⋅X ik -∑ j∶T j =t- H (K- H (K-1) . Y is Due to the connection between treatment effect estimation with variable decorrelation as analyzed in Section 3.1, we utilize Theorem 1 to reweight the variable weight in variable decorrelation term. When apply the Theorem 1 to GNNs, the confounders X should be H .-j and treatment is H .j , where the embedding H is learned by Ĝ (X, A; θ g ) in Eq. ( 2). And the variable weights α could be computed from the regression coefficients for H, hence α is equal to β in Eq. ( 2). Then the Differentiated Variable Decorrelation (DVD) term can be formulated as follows: min w L DV D (H) = p j=1 (α T ⋅ abs(H T .j Λ w H .-j /n -H T .j w/n ⋅ H T .-j w/n)) 2 + λ 1 n n i=1 w 2 i + λ 2 ( 1 n n i=1 w i -1) 2 , s.t.w ⪰ 0 where abs(⋅) means the element-wise absolute value operation, preventing positive and negative values from eliminating. Term λ 1 n ∑ n i=1 w 2 i is added to reduce the variance of sample weights to achieve stability, and the formula λ 2 ( 1 n ∑ n i=1 w i -1) 2 avoids all the sample weights to be 0. The term w ⪰ 0 constrains each sample weight to be non-negative. After variable reweighting, the weighted decorrelation term in Eq. ( 8) can be rewritten as ∑ j≠k α 2 j α 2 k ||H T .j Λ w H .k /n-H T .j w/n⋅H T .k w/n|| 2 2 , and the weight for variable pair j and k would be α 2 j α 2 k , hence it considers both the weights of treatment and confounder. We prove the uniqueness property of w in Appendix B, as follows: 8) is unique. Theorem 2 (Uniqueness). If λ 1 n ≫ p 2 + λ 2 , p 2 ≫ max(λ 1 , λ 2 ), |H i,j | ≤ c and |α i | ≤ c for some constant c, the solution ŵ ∈ {w ∶ |w i | ≤ c} to minimize Eq. (

3.3. DEBIASED GNN FRAMEWORK

In this section, we describe the framework of Debiased GNN that incorporates DVD/VD term with GNNs in a seamless way. As analyzed in Section 2.2, decorrelating Âσ( ÂXW (0) ) could make GCN stable. However, most GNNs follow a layer-by-layer stacking structure, and the output embedding of each layer is more easy to obtain in implementing. Since Âσ( ÂXW (0) ) is the aggregation of the first layer embedding σ( ÂXW (0) ), decorrelating these variables may lack the flexibility that incorporates DVD/VD term with other GNN structure. Fortunately, we have the following theorem to identify a more flexible way to combine variable decorrelation with GNNs. Theorem 3. Given p pairwise uncorrelated variables Z = (Z 1 , Z 2 , ⋯, Z p ), with a linear aggregation operator Â, the variables of Y = ÂZ are still pairwise uncorrelated. Proof can be found in Appendix C. The theorem indicates that if the variables of embeddings Z are uncorrelated, after any form of linear neighborhood aggregation Â, e.g., average, attention or sum, the variables of transformed embeddings Y would be also uncorrelated. Therefore, decorrelating σ( ÂXW (0) ) can also reduce the estimation bias. For a K layers of GNN, we can directly decorrelate the output of (K -1)-th layer, i.e., σ( Â⋯σ( ÂXW (0) )⋯W (K-2) ) for a K layers of GCN. The previous analysis finds a flexible way to incorporate DVD/VD term with GNNs, however, recall that we analyze GNNs based on the least squares loss, and most existing GNNs are designed for classification. Therefore, in the following, we analyze that the previous conclusions are still applicable in classification. We consider the cases that softmax layer is used as the output layer of GNNs and loss is the cross-entropy error function. We use the Newton-Raphson update rule (Bishop, 2006) to bridge the gap between linear regression and multi-classification. According to the Newton-Raphson update rule, the update formula for transformation matrix W (K-1) of the last layer of GCN can be derived: W (new) .j = W (old) .j -(H T RH) -1 H T (HW (old) .j -Y .j ) = (H T RH) -1 {H T RHW (old) .j -H T (HW (old) .j -Y .j )} = (H T RH) -1 H T Rz, where R kj = -∑ N n=1 H n W (old) .k (I kj -H n W .j ) is a weighing matrix and I kj is the element of the identity matrix, and z = HW (old) .j -R -1 (Y .j -W .j H ) is an effective target value. Eq. ( 9) takes the form of a set of normal equations for a weighted least-squares problem. As the weighing matrix R is not constant but depends on the parameter vector W (old) .j , we must apply the normal equations iteratively. Each iteration uses the last iteration weight vector W (old) .j to compute a revised weighing matrix R and regresses the target value z with HW (new) .j . Therefore, the variable decorrelation can also be applied to the GNNs with softmax classifier to reduce the estimation bias in each iteration. into the regularizer L DV D ( H(K-1) ). As GCN has the formula sof tmax( ÂH (K-1) W (K-1) ), the variable weights of H(K-1) used for differentiating L DV D ( H(K-1) ) can be computed from α = Var(W (K-1) , axis = 1), where Var(⋅, axis = 1) refers to calculating the variance of each row of some matrix and it reflects each variable's weight for classification which is similar to the regression coefficients. Note that when incorporating VD term with GNNs, we do not need compute the variable weights. Then the sample weights w learned by DVD term have the ability to remove the correlation in H(K-1) . We propose to use this sample weights to reweight softmax loss: min θ L G = l∈Y L w l ⋅ ln(q( H(K) l ) ⋅ Y l ), where q(⋅) is the softmax function, Y L is the set of labeled node indices and θ is the set of parameters of GCN. The complexity analysis as well as the optimization of whole algorithm are summarized in Appendix D.

4. EXPERIMENTS

Datasets Here, we validate the effectiveness of our method on node classification with two kinds of selection biased data, i.e., label selection bias and small sample selection bias. For label selection bias, we empoly three widely used graph datasets: Cora, Citeseer and Pubmed (Sen et al., 2008) . As in Section 2.1, we make the inductive setting for each graph and get three biased degrees for each graph. For small sample selection bias, we conduct the experiments on NELL dataset (Carlson Baselines Under our proposed framework, we incorporate the VD/DVD term with GCN and GAT called GCN-VD/DVD and GAT-VD/DVD (details in Appendix F), and thus GCN and GAT are two basic baselines. We compare with GNM-GCN/GAT (Zhou et al., 2019) that considers the label selection bias in transductive setting. Moreover, several state-of-the-art GNNs are included: Chebyshev filter (Kipf & Welling, 2016), SGC (Wu et al., 2019) and APPNP (Klicpera et al., 2019) . Additionally, we compare with Planetoid (Yang et al., 2016) and MLP trained on the labeled nodes.

Results on Label Selection Bias Dataset

The results are given in Table 1 , and we have the following observations. First, the proposed models (i.e., GCN/GAT with VD/DVD terms) always achieve the best performances in most cases, which well demonstrates that the effectiveness of our proposed debiased GNN framework. Second, comparing with base models, our proposed models all achieve up to 17.0% performance improvements, and gain larger improvements under heavier bias scenarios. Since the major difference between our model with base models is the VD/DVD regularizer, we can safely attribute the significant improvements to the effective decorrelation term and its seamless joint with GNN models. Moreover, GCN/GAT-DVD achieve better results that GCN/GAT-VD in most cases. It validates the importance and effectiveness of differentiating variables' weights in semi-supervised setting. Additional experimental results about the sample weight analysis and parameter sensitivity analysis can be found in Appendix G.

Results on Small Sample Selection Bias Dataset

As NELL is a large-scale graph, we cannot run GAT on a single GPU with 16GB memory. We only perform GCN-VD/DVD and compare with representative methods which can perform on this dataset. The results are shown in Table 2 . First, GCN-VD/DVD achieve significant improvements over GCN. It indicates that selection bias could be induced by a small number of labeled nodes and our proposed method can relieve the estimation bias. Moreover, GCN-DVD further improves GCN-VD with a large margin. It further validates that decorrelating all the variable pairs equally is suboptimal, and our differentiated strategy is effective when labeled nodes are scarce. The reason that GNM-GCN fails is the GNM relies on the accuracy of the IPW estimator that predicts the probability of a node to be selected, however, in this dataset, the ratio of positive and negative samples are extremely unbalanced influencing the performance of IPW. 

5. RELATED WORKS

In the past few years, Graph Neural Networks (GNNs) (Scarselli et al., 2008; Kipf & Welling, 2016; Veličković et al., 2017; Xu et al., 2019; Klicpera et al., 2019) have become the major technology to capture patterns encoded in the graph due to its powerful representation capacity. Although the current GNNs have achieved great success, when applied to inductive setting, they all assume that training nodes and test nodes follow the same distribution. However, this assumption does not always hold in real applications. GNM (Zhou et al., 2019) first pays attention on the label selection problem on graph learning, and it learns a IPW estimator to estimate the probability of each node to be selected and uses this probability to reweight the labeled nodes. However, it heavily relies on the accuracy of IPW estimator, which depends on the label assignment distribution of whole graph, hence it is more suitable for transductive setting. To enhance the stability in unseen varied distributions, some literatures (Shen et al., 2020b; Kuang et al., 2020) have revealed the connection between correlation and prediction stability under model misspecification. However, these methods are built on the simple regressions, but GNNs have more complex structure and properties needed to be considered. We also notice that Shen et al. (2020a) propose a differentiated variable decorrelation term for linear regression. However, this decorrelation term requires multiple environment with different correlations between stable variable and unstable variable available in the training stage while our method do not require.

6. CONCLUSION

In this paper, we investigate a general and practical learning GNNs with agnostic selection bias. The selection bias will inevitably cause the GNNs to learn the biased correlation between aggregation mode and class label and make the prediction unstable. We then propose a novel differentiated decorrelated GNN, which combines the debiasing technique with GNNs in a unified framework. Extensive experiments well demonstrate the effectiveness and flexibility of GNN-DVD. A DERIVATION OF M T EF M T EF = ∑ i∶T i =t w i ⋅ Y i (t) -∑ j∶T j =t-∆t w j ⋅ Y j (t -∆t) ∆t = ∑ i∶T i =t w i ⋅ (∑ k≠t α k X ik + α t t + c + ) -∑ j∶T j =t-∆t w j ⋅ (∑ k≠t α k X jk + α t (t -∆t) + c + ) ∆t = ∑ i∶T i =t w i α t t -∑ j∶T j =t-∆t w j α t (t -∆t) ∆t + (∑ i∶T i =t w i ∑ k≠t α k X ik -∑ j∶T j =t-∆t w j ∑ k≠t α k X ik ) ∆t + φ( ) = M T EF + k≠t α k ( ∑ i∶T i =t w i ⋅ X ik -∑ j∶T j =t-∆t w j ⋅ X jk ∆t ) + φ( ), where ∑ i∶T i =t w i α t t-∑ j∶T j =t-∆t w j α t (t-∆t) ∆t is the ground truth of M T EF , φ( ) means the noise term, and φ( ) ≃ 0 with Gaussian noise. B PROOF OF THEOREM 2 ŵ = arg min w p j=1 (α T ⋅abs(H T .j Λ w H .-j /n-H T .j w/n⋅H T .-j w/n)) 2 + λ 1 n n i=1 w 2 i +λ 2 ( 1 n n i=1 w i -1) 2 Proof For simplicity, we denote L 1 = ∑ p j=1 (α T ⋅ abs(H T .j Λ w H .-j /n -H T .j w/n ⋅ H T .-j w/n)) 2 , L 2 = 1 n ∑ n i=1 w 2 i , L 3 = ( 1 n ∑ n i=1 w i -1) 2 and F(w) = L 1 + λ 1 L 1 + λ 2 L 2 . We first calculate the Hessian matrix of F(w), denoted as H e , to prove the uniqueness of the optimal solution ŵ, as follows: H e = ∂ 2 L 1 ∂w 2 + λ 1 ∂ 2 L 2 ∂w 2 + λ 2 ∂ 2 L 3 ∂w 2 For the term L 1 , we can rewrite it as: L 1 = j≠k α 2 i α 2 k ( 1 n n i=1 H i,j H i,k w i -( 1 n n i=1 H i,j w i )( 1 n n i=1 H i,k w i )) 2 = j≠k α 2 i α 2 k (( 1 n n i=1 H i,j H i,k w i ) 2 -( 2 n n i=1 H i,j H i,k w i )( 1 n n i=1 H i,j w i )( 1 n n i=1 H i,k w i ) + (( 1 n n i=1 H i,j w i )( 1 n n i=1 H i,k w i )) 2 ) And when |H i,j | ≤ c, for any variable j and k, and |w i | ≤ c, we have ∂ 2 ∂w 2 ( 1 n ∑ n i=1 H i,j H i,k w i ) 2 = O( 1 n 2 ), ∂ 2 ∂w 2 ( 1 n ∑ n i=1 H i,j w i )( 1 n ∑ n i=1 H i,k w i ) = O( 1 n 2 ) and ∂ 2 ∂w 2 (( 2 n ∑ n i=1 H i,j H i,k w i )( 1 n ∑ n i=1 H i,j w i )( 1 n ∑ n i=1 H i,k w i )) = O( 1 n 2 ). Then with |α i | ≤ c, we have α 2 i α 2 k ∂ 2 ∂w 2 ( 1 n ∑ n i=1 H i,j H i,k w i -( 1 n ∑ n i=1 H i,j w i )( 1 n ∑ n i=1 H i,k w i )) 2 = O( 1 n 2 ). L 1 is sum of p(p -1) such terms. Then we have ∂ 2 L 1 ∂w 2 = O( p 2 n 2 ). With some algebras, we can also have ∂ 2 L 2 ∂w 2 = 1 n I, ∂ 2 L 3 ∂w 2 = 1 n 2 11 T , thus, H e = O( p 2 n 2 ) + λ 1 n I + λ 2 n 2 11 T = λ 1 n I + O( p 2 + λ 2 n 2 ). Therefore, if λ 1 n ≫ p 2 +λ 2 n 2 , equivalent to λ 1 n ≫ p 2 +λ 2 , H e is an almost diagonal matrix. Hence, H e is positive definite (Nakatsukasa, 2010) . Then the function F(w) is convex on C = {w ∶ |w i | ≤ c}, and has unique optimal solution ŵ. Moreover, because L 1 is our major decorrelation term, we hope L 1 to dominate the terms λ 1 L 2 and λ 2 L 3 . On C, we have L 1 = O(1), L 2 = O(1), α 2 i α 2 k ( 1 n ∑ n i=1 H i,j H i,k w i - ( 1 n ∑ n i=1 H i,j w i )( 1 n ∑ n i=1 H i,k w i )) 2 = O(1). Thus L 1 = O(p 2 ). When p 2 ≫ max(λ 1 , λ 2 ), L 1 will dominate the regularization terms L 2 and L 3 . C PROOF OF THEOREM 3 Let Z = {Z 1 , Z 2 , ⋯, Z p } be p pairwise uncorrelated variables. ∀Z i , Z j ∈ Z, (Z (1) i , Z i , ⋯, Z (n) i ) and (Z (1) j , Z (2) j , ⋯, Z (n) j ) are n simple random samples drawn from Z i and Z j respectively, and have same distribution with Z i and Z j . Given a linear aggregation matrix Â = (a ij ), ∀s, v ∈ (1, 2, ⋯, n), let Y (s) i = ∑ n k=1 a sk Z (k) i and Y (v) j = ∑ n l=1 a vl Z (l) j , and we have following derivation: Cov(Y (s) i , Y (v) j ) = Cov( n k=1 a sk Z (k) i , n l=1 a vl Z (l) j ) = n k=1 n l=1 a sk a vl Cov(Z (k) i , Z (l) j ) = n k=1 n l=1 a sk a vl δ ij , where δ ij = 0 when i ≠ j, otherwise δ ij = 1. Therefore, when i ≠ j, we have Cov(Y To optimize our GNN-DVD algorithm, we propose an iterative method. Firstly, we let w = ω ⊙ ω to ensure non-negativity of w and initialize sample weight ω i = 1 for each sample i and GNN's parameters θ with random uniform distribution. Once the initial values are given, in each iteration, we fix the sample weights ω and update the GNN's parameters θ by L G with gradient descent, then compute the confounder weights α from the linear transform matrix W (K-1) . With α and fixing the GNN's parameters θ, we update the sample weights ω with gradient descent to minimize L DV D (H (K-1) ). We iteratively update the sample weights w and GNN's parameters θ until L G converges.

D PSEUDOCODE OF GNN-DVD

Complexity Analysis Compared with base model (e.g., GCN and GAT), the mainly incremental time cost is the complexity from DVD term. The complexity of DVD term is O(np 2 ), where n is the number of labeled nodes and p is the dimension of embedding. And it is quite smaller than the base model (e.g., the complexity of GCN is linear to the number of edges). Some statistics of datasets used in our paper are presented in Table 3 , including the number of nodes, the number of edges, the number of classes, the number of features, the bias degree and bias type.

E DATASET DESCRIPTION AND EXPERIMENTAL SETUP E.1 DATASET DESCRIPTION

For three citation networks, we conduct the biased labeled node selection process to get three degrees of datasets for each dataset to validate the effect of label selection bias, in which each class in each dataset contains 20 labeled nodes in training set and the validation set and test set are same as Yang et al. (2016) . For NELL, because it only has a single labeled node per class in training set, the training nodes are hard to cover all the neighborhood distribution happened in the test set. Hence, we use this dataset to validate the effectiveness of our method on the extreme small labeled nodes size bias. The data splits are also same as Yang et al. (2016) . A description of each of dataset is given as follows: • Cora (Sen et al., 2008) is a citation network of Machine Learning papers that collected from 7 classes:{Theory, Case Based, Reinforcement Learning, Genetic Algorithms, Neural Networks, Probabilistic Methods, Rule Learning }. Nodes represent papers, edges refer to the citation relationship, and features are bag-of-words vectors for each paper. • Citeseer (Sen et al., 2008) is a citation network of Machine Learning papers that collected from 6 classes:{Agents, Artificial Intelligence, Database, Information Retrieval, Machine Learning, Human Computer Interaction }. Nodes represent papers, edges refer to the citation relationship, and features are bag-of-words vectors for each paper. • Pubmed (Sen et al., 2008) is a citation network from the PubMed database, which contains a set of articles (Nodes) related to diabetes and the citation relationship among them. The node features are bag-of-words vectors, and the node label are the diabetes type researched in the articles. • NELL Carlson et al. (2010) is a dataset extracted from the knowledge graph, which is a set of entities connected with directed, labeled edges (relations). Our pre-processing scheme is same as Yang et al. (2016) , where each entity pair (e 1 , r, e 2 ) is assigned with separate relation nodes r 1 and r 2 as (e 1 , r 1 ) and (e 2 , r 2 ). We use text bag-of-words representation as feature vector of the entities.

E.2 EXPERIMENTAL SETUP

As the Section 2.1 has described, for all datasets, to simulate the agnostic selection bias scenario, we first follow the inductive setting in Wu et al. (2019) that masks the validation and test nodes in the training phase and validation and test with whole graph so that the test nodes will be agnostic. For GCN and GAT, we utilize the same two-layer architecture as their original paper (Kipf & Welling, 2016; Veličković et al., 2017) . We use the following sets of hyperparameters for GCN on Cora, 



Nonlinear relation between variables can be incorporated by considering high-order moments in Eq. (5).



Figure 1: Effect of selection bias on GCN and GAT.

Figure 2: (a) Diagram of decorrelating node embedding with confounding balance. H (K-1) is the

the outcome, corresponding to labels. (b) The framework of GNN-DVD. The same color in the two figures represents the same kind of variable.

Figure 2(b) is the framework of GNN-DVD, and we input the labeled nodes' embeddings H(K-1)

Y i , Y j ) = 0. Extended the conclusion to multiple variable, Y = (Y 1 , Y 2 , ⋯, Y n ) are pairwise uncorrelated. Proof completes.

Figure 4: Accuracy of GCN-DVD with different λ 1 and λ 2 on different biased Cora datasets.

Figure 5: Accuracy of GCN-DVD with different λ 1 and λ 2 on different biased Citeseer datasets.

Figure 6: Accuracy of GCN-DVD with different λ 1 and λ 2 on different biased Pubmed datasets.

Problem 1 (Semi-supervised Learning on Graph with Agnostic Label Selection Bias). Given a training graph G train = {A train , X train , Y train }, where A train ∈ R

Performance of three citation networks. The '*' indicates the best results of the baselines. Best results of all methods are indicated in bold. '% gain over GCN/GAT' means the improvement percent of GCN/GAT-DVD against GCN/GAT, respectively. , 2010) that each class only has one labeled node for training. Due to the large scale of this dataset, the test nodes are easily to have distribution shift from training nodes. The details of the datasets and experimental setup are given in Appendix E. One can download codes and datasets for all experiments from the supplementary material.

Performance of NELL

Algorithm 1: GNN-DVD Algorithm Input :Training graph G train = {A, X, Y}, and indices of labeled nodes Y L ; Max iteration:maxIter Output :GNN parameter θ and sample weights w Initialization :Let w = ω ⊙ ω and initialize sample weights ω with 1; Initialize GNN's parameters θ with random uniform distribution; Iteration t ← 0

Dataset statistics

annex

Citeseer, Pubmed: 0.5 (dropout rate), 5 ⋅ 10 -4 (L2 regularization) and 32 (numbder of hidden units); and for NELL: 0.1 (dropout rate), 1 ⋅ 10 -5 (L2 regularization) and 64 (number of hidden units). For GAT on Cora, Citeseer, we use: 8 (first layer attention heads), 8 (features each head), 1 (second layer attention head), 0.6 (dropout), 0.0005 (L2 regularization); and for Pubmed: 8 (second layer attention head), 0.001 (L2 regularization), other parameters are same with Cora and Citeseer. To fair comparison, the GNN part of our model uses the same architecture and hyper-parameters with base model and we grid search λ 1 and λ 2 from {0.01, 0.1, 1, 10, 100}. For other baselines, we use the optimal hyper-parameters in literatures on each dataset. For all the experiments, we run 10 times with different random seed and report its average Accuracy results.

F EXTEND TO GAT

We can easily incorporate VD/DVD term to other GNNs. We combine them with GAT and more extensions leave as future work. GAT utilizes attention mechanism to aggregate neighbor information. It also follows the linear aggregation and transformation steps. Similar with GCN, the hidden embedding H(K-1) is the input of VD/DVD term, and the variable weights α are calculated from the transformation matrix W (K-1) and the sample weights w are used to reweight the softmax loss.Note that original paper utilizes same transformation matrix W (K-1) for transforming embedding and learning attention values. Because α means the importance of each variable for classification, and it should be computed from transformation matrix W (K-1) for transforming embedding, hence we use separate matrix for transforming embedding and learning attention values respectively. This modification does not change the performance of GAT in experiments.

G ADDITIONAL EXPERIMENTS G.1 SAMPLE WEIGHT ANALYSIS

Here we analyze the effect of sample weights w in our model. We compute the amount of correlation in the labeled nodes' embeddings H(K-1) learned by standard GCN and the weighted embeddings of the same layer learned by GCN-DVD. Note that, the weights are the last iteration of sample weights of GCN-DVD. Following Cogswell et al. (2016) , the amount of correlation of GCN and GCN-DVD are measured by Frobenius norm of cross-corvairance matrix computed from vectors of H(K-1)and weighted H(K-1) respectively. Figure 3 shows the amount of correlation in unweighted and weight embeddings, and we can observe that the embeddings' correlation in all datasets are reduced, demonstrating that the weights learned by GCN-DVD can reduce the correlations between embedded variables. Moreover, one can observe that it is hard to reduce the correlation to zero. Therefore, the necessity of differentiating variables' weights will be further validated. 

G.2 PARAMETER SENSITIVITY

We study the sensitiveness of parameters and report the results of GCN-DVD on three citation networks in Fig. 4 -6. The experimental results show that GCN-DVD is relatively stable to λ 1 and λ 2 with wide ranges in most cases, indicating the robustness of our model.

