UNIFYING GRAPH CONVOLUTIONAL NEURAL NET-WORKS AND LABEL PROPAGATION

Abstract

Label Propagation (LPA) and Graph Convolutional Neural Networks (GCN) are both message passing algorithms on graphs. Both solve the task of node classification but LPA propagates node label information across the edges of the graph, while GCN propagates and transforms node feature information. However, while conceptually similar, it is unclear how LPA and GCN can be combined under a unified framework to improve node classification. Here we study the relationship between LPA and GCN in terms of feature/label influence, in which we characterize how much the initial feature/label of one node influences the final feature/label of another node in GCN/LPA. Based on our theoretical analysis, we propose an end-to-end model that combines GCN and LPA. In our unified model, edge weights are learnable, and the LPA serves as regularization to assist the GCN in learning proper edge weights that lead to improved classification performance. Our model can also be seen as learning the weights for edges based on node labels, which is more task-oriented than existing feature-based attention models and topology-based diffusion models. In a number of experiments on real-world graphs, our model shows superiority over state-of-the-art graph neural networks in terms of node classification accuracy.

1. INTRODUCTION

Consider the problem of node classification in a graph, where the goal is to learn a mapping M : V → L from node set V to label set L. Solution to this problem is widely applicable to various scenarios, e.g., inferring income of users in a social network or classifying scientific articles in a citation network. Different from a generic machine learning problem where samples are independent from each other, nodes are connected by edges in the graph, which provide additional information and require more delicate modeling. To capture the graph information, researchers have mainly designed models on the assumption that labels/features are correlated over the edges of the graph. In particular, on the label side L, node labels are propagated and aggregated along edges in the graph, which is known as Label Propagation Algorithm (LPA) (Zhu et al., 2005; Zhou et al., 2004; Zhang & Lee, 2007; Wang & Zhang, 2008; Karasuyama & Mamitsuka, 2013; Gong et al., 2017; Liu et al., 2019a) ; On the node side V, node features are propagated along edges and transformed through neural network layers, which is known as Graph Convolutional Neural Networks (GCN)foot_0 (Kipf & Welling, 2017; Hamilton et al., 2017; Li et al., 2018; Xu et al., 2018; Liao et al., 2019; Xu et al., 2019b; Qu et al., 2019) . GCN and LPA are related in that they propagate features and labels on the two sides of the mapping M, respectively. Prior work Li et al. (2019) has shown the relationship between GCN and LPA in terms of low-pass graph filtering. However, it is unclear how the discovered relationship benefits node classification. Specifically, can GCN and LPA be combined to develop a more accurate model for node classification in graphs? Here we study the theoretical relationship between GCN and LPA from the viewpoint of feature/label influence, where we quantify how much the initial feature/label of node v b influences the output feature/label of node v a in GCN/LPA by studying the Jacobian/gradient of node v b with respect to node v a . We also prove the quantitative relationship between feature influence and label influence, i.e., the label influence of v b on v a equals the cumulative discounted feature influence of v b on v a in expectation (Theorem 1). Based on the theoretical analysis, we propose a unified model GCN-LPA for node classification. We show that the key to improving the performance of GCN is to enable nodes of the same class to connect more strongly with each other by making edge weights/strengths trainable. Then we prove that increasing the strength of edges between the nodes of the same class is equivalent to increasing the accuracy of LPA's predictions (Theorem 2). Therefore, we can first learn the optimal edge weights by minimizing the loss of predictions in LPA, then plug the optimal edge weights into a GCN to learn node representations. In GCN-LPA, we further combine the above two steps together and train the whole model in an end-to-end fashion, where the LPA part serves as regularization to assist the GCN part in learning proper edge weights that benefit the separation of different node classes. It is worth noticing that GCN-LPA can also be seen as learning the weights for edges based on node label information, which requires less handcrafting and is more task-oriented than existing attention models that learn edge weights based on node feature similarity (Veličković et al., 2018; Thekumparampil et al., 2018; Zhang et al., 2018; Liu et al., 2019b) or diffusion models that learn adjacency matrix based on graph topology (Klicpera et al., 2019a; Xu et al., 2019a; Abu-El-Haija et al., 2019; Klicpera et al., 2019b) . We conduct extensive experiments on five datasets, and the results indicate that our model outperforms state-of-the-art graph neural networks in terms of classification accuracy. The experimental results also show that combining GCN and LPA together is able to learn more informative edge weights thereby leading to better performance.

2. OUR APPROACH

In this section, we first formulate the node classification problem and briefly introduce LPA and GCN. We then prove their relationship from the viewpoints of feature influence and label influence. Based on the theoretical finding, we propose a unified model GCN-LPA, and analyze why our model is theoretically superior to vanilla GCN.

2.1. PROBLEM FORMULATION AND PRELIMINARIES

Consider a graph G = (V, A, X, Y ), where V = {v 1 , • • • , v n } is the set of nodes, A ∈ R n×n is the adjacency matrix, X is the feature matrix of nodes and Y is labels of nodes. a ij (the ij-th entry of A) is the weight of the edge connecting v i and v j . N (v) denotes the set of first-order neighbors of node v in graph G. Each node v i has a feature vector x i which is the i-th row of X, while only the first m nodes (m n) have labels y 1 , • • • , y m from a label set L = {1, • • • , c}. The goal is to learn a mapping M : V → L and predict labels of unlabeled nodes. Label Propagation Algorithm. LPA (Zhu et al., 2005) assumes that two connected nodes are likely to have the same label, and thus it propagates labels iteratively along the edges. Let Y (k) = [y (k) 1 , • • • , y (k) n ] ∈ R n×c be the soft label matrix in iteration k > 0, in which the i-th row y (k) i denotes the predicted label distribution for node v i in iteration k. When k = 0, the initial label matrix Y (0) = [y (0) 1 , • • • , y (0) n ] consists of one-hot label indicator vectors y (0) i for i = 1, • • • , m (i. e., labeled nodes) or zero vectors otherwise (i.e., unlabeled nodes). Then LPA in iteration k is formulated as the following two steps: Y (k+1) = Ã Y (k) , y (k+1) i = y (0) i , ∀ i ≤ m. In the above equations, Ã is the normalized adjacency matrix, which can be the random walk transition matrix Ãrw = D -1 A or the symmetric transition matrix Ãsym = D -1 2 AD -1 2 , where D is the diagonal degree matrix for A with entries d ii = j a ij . Without loss of generosity, we use Ã = Ãrw in this work. In Eq. ( 1), all nodes propagate labels to their neighbors according to normalized edge weights. Then in Eq. ( 2), labels of all labeled nodes are reset to their initial values, because LPA wants to persist labels of nodes which are labeled, so that unlabeled nodes do not overpower the labeled ones as the initial labels would otherwise fade away. Graph Convolutional Neural Networks. GCN Kipf & Welling ( 2017) is a multi-layer feedforward neural network that propagates and transforms node features across the graph. The feature propagation scheme of GCN in layer k is: X (k+1) = σ ÃX (k) W (k) , where W (k) is trainable weight matrix in the k-th layer, σ(•) is an activation function, and X (k) = [x (k) 1 , • • • , x n ] are the k-th layer node representations with X (0) = X. By setting the dimension of the last layer to the number of classes c, the last layer can be seen as (unnormalized) label distribution predicted for a given node. The whole model can thus be optimized by minimizing the discrepancy between predicted node label distributions and ground-truth labels Y .

2.2. FEATURE INFLUENCE AND LABEL INFLUENCE

Consider two nodes v a and v b in a graph. Inspired by Koh & Liang (2017) and Xu et al. (2018) , we study the relationship between GCN and LPA in terms of influence, i.e., how the output feature/label of v a will change if the initial feature/label of v b is varied slightly. Technically, the feature/label influence is measured by the Jacobian/gradient of the output feature/label of v a with respect to the initial feature/label of v b . Denote x (k) a as the k-th layer representation vector of v a in GCN, and x b as the initial feature vector of v b . We quantify the feature influence of v b on v a as follows: Definition 1 (Feature influence) The feature influence of node v b on node v a after k layers of GCN is the L1-norm of the expected Jacobian matrix ∂x (k) a /∂x b : I f (v a , v b ; k) = E ∂x (k) a /∂x b 1 . The normalized feature influence is then defined as Ĩf (v a , v b ; k) = I f (v a , v b ; k)/ vi∈V I f (v a , v i ; k). We also consider the label influence of node v b on node v a in LPA (this implies that v a is unlabeled and v b is labeled). Since different label dimensions of y (•) i do not interact with each other in LPA, we assume that all y i and y (•) i are scalars within [0, 1] (i.e., this is a binary classification task) for simplicity. Label influence is defined as follows: Definition 2 (Label influence) The label influence of labeled node v b on unlabeled node v a after k iterations of LPA is the gradient of y The following theorem shows the relationship between feature influence and label influence: Theorem 1 (Relationship between feature influence and label influence) Assume the activation function used in GCN is ReLU. Denote v a as an unlabeled node, v b as a labeled node, and β as the fraction of unlabeled nodes. Then the label influence of v b on v a after k iterations of LPA equals, in expectation, to the cumulative normalized feature influence of v b on v a after k layers of GCN: E I l (v a , v b ; k) = k j=1 β j Ĩf (v a , v b ; j). Proof of Theorem 1 is in Appendix A. Intuitively, Theorem 1 shows that if v b has high label influence on v a , then the initial feature vector of v b will also affect the output feature vector of v a greatly. Theorem 1 provides the theoretical guideline for designing our unified model in the next subsection.

2.3. THE UNIFIED MODEL

Before introducing the proposed model, we rethink the GCN method and see what an ideal set of node representations should be like. Since we aim to classify nodes, the perfect node representation would be such that nodes with the same label are embedded closely together, which would give a large separation between different classes. Intuitively, the key to achieve this goal is to enable nodes within the same class to connect more strongly with each other, so that they are pushed together by GCN (more discussion is presented in Section 2.4). We can therefore make edge strengths/weights trainable, then learn to increase the intra-class feature influence: i∈L va,v b :ya=i,y b =i Ĩf (v a , v b ) (L is the label set), by adjusting edge weights. However, this requires operating on Jacobian matrices with the size of d (0) × d (K) (d (0) and d (K) are the dimensions of input and output in GCN, respectively), which is impractical if initial node features are high-dimensional. Fortunately, we can turn to optimizing the intra-class label influence instead, i.e., i∈L va,v b :ya=i,y b =i I l (v a , v b ), according to Theorem 1. Note that i∈L va,v b :ya=i, y b =i I l (v a , v b ) = va v b :y b =ya I l (v a , v b ). We further show, by the following theorem, that the term v b :y b =ya I l (v a , v b ) (the total intra-class label influence on a given node v a ) is proportional to the probability that v a is classified correctly by LPA: Theorem 2 (Relationship between label influence and LPA's prediction) Consider a given node v a and its label y a . If we treat node v a as unlabeled, then the total label influence of nodes with label y a on node v a is proportional to the probability that node v a is classified as y a by LPA: v b :y b =ya I l (v a , v b ; k) ∝ Pr ŷlpa a = y a , where ŷlpa a is the predicted label of v a using a k-iteration LPA. Proof of Theorem 2 is in Appendix B. Theorem 2 indicates that, if edge weights {a ij } maximize the probability that v a is correctly classified by LPA, then they also maximize the intra-class label influence on node v a . We can therefore first learn the optimal edge weights A * by minimizing the loss of predicted labels by LPA:foot_1  A * = arg min A L lpa (A) = arg min A 1 m va:a≤m J(ŷ lpa a , y a ), where J is the cross-entropy loss, ŷlpa a and y a are the predicted label distribution of v a using LPA and the true one-hot label of v a , respectively. a ≤ m means v a is labeled. The optimal A * maximizes the probability that each node is correctly labeled by LPA, thus also maximizes the intra-class label influence (according to Theorem 2) and intra-class feature influence (according to Theorem 1). Since A * increases the connection strength within each class, it is expected to improve the performance of GCN compared with the original adjacency matrix A. Therefore, we can plug A * into GCN to predict labels: X (k+1) = σ(A * X (k) W (k) ), k = 0, 1, • • • , K -1. (7) We use ŷgcn a , the a-th row of X (K) , to denote the predicted label distribution of v a using the GCN specified in Eq. ( 7). Then the optimal transformation matrices in the GCN can be learned by minimizing the loss of predicted labels by GCN: W * = arg min W L gcn (W, A * ) = arg min W 1 m va:a≤m J(ŷ gcn a , y a ), It is more elegant (and empirically better) to combine the above two steps together into a multiobjective optimization problem, and train the whole model in an end-to-end fashion: W * , A * = arg min W,A L gcn (W, A) + λL lpa (A), ( ) where λ is the balancing hyper-parameter. In this way, L lpa (A) serves as a regularization term that assists the learning of edge weights A, since it is hard for GCN to learn both W and A simultaneously due to overfitting. The proposed GCN-LPA approach can also be seen as learning the importance of edges that can be used to reconstruct node labels accurately by LPA, then transferring this knowledge from label space to feature space for GCN. It is also worth noticing how the optimal A * is configured. The principle here is that we do not modify the basic structure of the original graph (i.e., not adding or removing edges) but only adjusting weights of existing edges. This is equivalent to learning a positive mask matrix M for the adjacency matrix A and taking the Hadamard product M • A = A * . Each element M ij can be set as either a free variable or a function of the two nodes, for example, M ij = log exp(x i Hx j ) + 1 where H is a learnable kernel matrix for measuring feature similarity.

2.4. ANALYSIS OF GCN-LPA MODEL BEHAVIOR

In this subsection, we show benefits of our unified model compared with GCN by analyzing properties of embeddings produced by the two models. We first analyze the update rule of GCN for node k) , where ãij = a ij /d ii is the normalized weight of edge (j, i). This formula can be decomposed into the following two steps: (1) In aggregation step, we calculate the aggregated representation h v i : x (k+1) i = σ vj ∈N (vi) ãij x (k) j W ( (k) i of all neighborhoods N (v i ): h (k) i = vj ∈N (vi) ãij x (k) j . (2) In transformation step, the aggregated representation h (k) i is mapped to a new space by a transformation matrix and nonlinear function: k) . We show by the following theorem that the aggregation step reduces the overall distance in the embedding space between the nodes that are connected in the graph: Figure 1 : A graph with two classes of nodes, while white nodes are unlabeled (Figure 1a ). To classify nodes, our model will increase the connecting strength among nodes within the same class, thereby increasing their feature/label influence on each other. In this way, our model is able to identify potential intra-class edges (bold links in Figure 1b ) and strengthen their weights. x (k+1) i = σ h (k) i W ( Theorem 3 (Shrinking property in GCN) Let D(x) = 1 2 vi,vj ãij x i -x j 2 2 be a distance met- ric over node embeddings x. Then we have D(h (k) ) ≤ D(x (k) ). Proof of Theorem 3 is in Appendix C. Theorem 3 indicates that the overall distance among connected nodes is reduced after taking one aggregation step, which implies that connected components in the graph "shrink" and nodes within each connected component get closer to each other in the embedding space. In an ideal case where edges only connect nodes with the same label, the aggregation step will push nodes within the same class together, which greatly benefits the transformation step that acts like using a hyperplane W (k) for classification. However, two connected nodes may have different labels. These "noisy" edges will impede the formation of clusters and make the interclass boundary less clear. Fortunately, in GCN-LPA, edge weights are learned by minimizing the difference between ground-truth labels and labels reconstructed from local neighbors. This will force the model to increase the weight/bandwidth of possible paths that connect nodes with the same label, so that labels can "flow" easily along these paths for the purpose of label reconstruction. In this way, GCN-LPA is able to identify potential intra-class edges and increase their weights to assist learning clustering structures ( see Figure 1 for an illustrating example). To empirically justify our claim, we apply a two-layer untrained GCN with randomly initialized transformation matrices to the well-known Zachary's karate club network (Zachary, 1977) as shown in Figure 2a , which contains 34 nodes of 2 classes and 78 unweighted edges (grey solid lines). We then increase the weights of intra-class edges by ten times to simulate GCN-LPA. We find that GCN works well on this network (Figure 2b ), but GCN-LPA performs even better than GCN because the node embeddings are completely linearly separable as shown in Figure 2c . To further justify our claim, we randomly add 20 "noisy" inter-class edges (grey dotted lines) to the original network, from which we observe that GCN is misled by noise and mixes nodes of two classes together (Figure 2d ), but GCN-LPA still distinguishes the two clusters (Figure 2e ) because it is better at "denoising" undesirable edges based on the supervised signal of labels. Notice that GCN does not produce linearly separable embeddings (Figure 2b vs. Figure 2c ), while GCN-LPA performs much better even in the presence of noisy edges (Figure 2d vs. Figure 2e ). Additional visualizations are included in Appendix D. Locally Linear Embedding. Locally linear embedding (LLE) (Roweis & Saul, 2000) and its variants (Zhang & Wang, 2007; Kong et al., 2012) learn edge weights by constructing a linear dependency between a node and its neighbors, then use the learned edge weights to embed highdimensional nodes into a low-dimensional space. Our work is similar to LLE in the aspect of transferring the knowledge of edge importance from one space to another, but the difference is that LLE is an unsupervised dimension reduction method that learns the graph structure based on local proximity only, while our work is semi-supervised and explores high-order relationship among nodes. Label Propagation Algorithm. Classical LPA (Zhu et al., 2005; Zhou et al., 2004) can only make use of node labels rather than node features. In contrast, adaptive LPA considers node features by making edge weights learnable. Typical techniques of learning edge weights include adopting kernel functions (Zhu et al., 2003; Liu et al., 2019a ) (e.g., a ij = exp(-d (x id -x jd ) 2 /σ 2 d ) where d is dimensionality of features), minimizing neighborhood reconstruction error (Wang & Zhang, 2008; Karasuyama & Mamitsuka, 2013) , using leave-one-out loss (Zhang & Lee, 2007) , or imposing sparseness on edge weights (Hong et al., 2009) . However, in these LPA variants, node features are only used to assist learning the graph structure rather than explicitly mapped to node labels, which limits their capability in node classification. Another notable difference is that adaptive LPA learns edge weights by introducing the regularizations above, while our work takes LPA itself as regularization to learn edge weights. Attention and Diffusion on Graphs. Our method is also conceptually connected to attention mechanism on graphs, in which an attention weight α ij is learned between node v i and v j . For example, (Veličković et al., 2018) , (Thekumparampil et al., 2018) , (Zhang et al., 2018) , and α ij = a tanh(W 1 x i + W 2 x j ) in GeniePath (Liu et al., 2019b) , where a and W are trainable variables. Our method is also similar to diffusion-based methods (Klicpera et al., 2019a; Xu et al., 2019a; Abu-El-Haija et al., 2019; Klicpera et al., 2019b; Jiang et al., 2019; Yang et al., 2019) . Graph diffusion uses extended neighborhoods for aggregation in GNNs, which can be seen as learning a new adjacency matrix for a given graph. A significant difference between attention/diffusion mechanisms and our work is that attention/diffusion is learned based on feature similarity/graph topology, while we propose that edge weights should be consistent with the distribution of labels on the graph, which requires less handcrafting of the attention/diffusion function and is more task-oriented. α ij = LeakyReLU(a [W x i ||W x j ]) in GAT α ij = a • cos(W x i , W x j ) in AGNN α ij = (W 1 x i ) W 2 x j in GaAN

4.1. EXPERIMENT SETUP

Datasets. We use the following five datasets in our experiments. Cora, Citeseer, and Pubmed (Sen et al., 2008) are citation networks, where nodes correspond to documents, edges correspond to citation links, and each node has a sparse bag-of-words feature vector as well as a class label. We also use two co-authorship networks (Shchur et al., 2018) 83.0 ± 1.4 72.6 ± 0.9 78.4 ± 1.5 91.9 ± 0.9 93.4 ± 1.6 Table 1 : Mean and the 95% confidence intervals of test set accuracy for all methods and datasets. study for each author. Statistics of the five datasets are shown in Appendix E. We also calculate the intra-class edge rate (the fraction of edges that connect two nodes within the same class), which is significantly higher than inter-class edge rate in all networks. The finding supports our claim in Section 2.4 that node classification benefits from intra-class edges in a graph. Baselines. We compare against the following baselines in our experiments. Logistic Regression (LR) is feature-based methods that do not consider the graph structure. Label Propagation (LPA) (Zhu et al., 2005) , on the other hand, only consider the graph structure and ignore node features. We also compare with several GNNs: Graph Convolutional Network (GCN) (Kipf & Welling, 2017), Graph Attention Network (GAT), Jumping Knowledge Network (JK-Net) (Xu et al., 2018) , Graph Isomorphism Network (GIN) (Xu et al., 2019b) , and Graph Diffusion Convolution (GDC) (Klicpera et al., 2019b) (with GCN as the base model). In addition, we propose another baseline GCN+LPA, which simply adds predictions of GCN and LPA together. Experimental Setup. Our experiments focus on the transductive setting where we only know labels of part of nodes but have access to the entire graph as well as features of all nodes. 3 We randomly sample 20 nodes per class as training set, 50 nodes per class as validation set, and the remaining nodes as test set. The weight of each edge is treated as a free variable during training. We train our model for 200 epochs using Adam (Kingma & Ba, 2015) and report the test set accuracy when validation set accuracy is maximized. Each experiment is repeated five times and we report the mean and the 95% confidence interval. We initialize weights according to Glorot & Bengio (2010) and row-normalize input features. During training, we apply L2 regularization to the transformation matrices and use the dropout technique (Srivastava et al., 2014) . The settings of all other hyperparameters can be found in Appendix F.

4.2. RESULTS

Comparison with Baselines. The results of node classification are summarized in Table 1 . does not perform consistently well on other datasets. In addition, GCN+LPA does not perform well, since it utilizes the prediction of LPA directly, making its performance limited by LPA. Efficacy of LPA Regularization. We investigate the influence of the number of LPA iterations and the training weight of LPA loss term λ on the performance of classification. The results on Citeseer dataset are plotted in Figures 3 and 4 , respectively, where each line corresponds to a given number of GCN layers in GCN-LPA. From Figure 3 we observe that the performance is boosted at first when the number of LPA iterations increases, then the accuracy stops increasing and decreases since a large number of LPA iterations will include more noisy nodes. Figure 4 shows that training without the LPA loss term (i.e., λ = 0) is more difficult than the case where λ = 1 ∼ 5, which justifies our aforementioned claim that it is hard for the GCN part to learn both transformation matrices W and edge weights A simultaneously without the assistance of LPA regularization. Influence of Labeled Node Rate. To study the influence of labeled node rate on the performance of our model, we vary the ratio of labeled node rate on Citeseer from 5% to 80% while keeping the validation and test set fixed, and report the result in Table 2 . From Table 2 we observe that GCN-LPA outperforms GCN and LPA consistently, and the improvement achieved by GCN-LPA increases when labeled node rate is larger (from 0.6% to 2.1% compared with GCN). This is because GCN-LPA requires node labels to calculate edge weights. Therefore, a larger labeled node rate will provide more information for identifying noisy edges. 

5. CONCLUSION

We studies the theoretical relationship between two types of well-known graph-based algorithms for node classification, label propagation algorithm and graph convolutional neural networks, from the perspectives of feature/label influence. We then propose a unified model GCN-LPA, which learns transformation matrices and edge weights simultaneously in GCN with the assistance of LPA regularizer. We also analyze why our unified model performs better than traditional GCN in terms of node classification. Experiments on five datasets demonstrate that our model outperforms stateof-the-art baselines, and it is also highly time-efficient with respect to the size of a graph.

APPENDIX

A PROOF OF THEOREM 1 Before proving Theorem 1, we first give two lemmas that demonstrate the exact form of feature influence and label influence defined in this paper. The relationship between feature influence and label influence can then be deduced from their exact forms. Lemma 1 Assume that the nonlinear activation function in GCN is ReLU. Let P a→b k be a path [v (k) , v (k-1) , • • • , v (0) ] of length k from node v a to node v b , where v (k) = v a , v (0) = v b , and v (i-1) ∈ N (v (i) ) for i = k, • • • , 1. Then we have Ĩf (v a , v b ; k) = P a→b k 1 i=k ãv (i-1) ,v (i) , where ãv (i-1) ,v (i) is the normalized weight of edge (v (i) , v (i-1) ). Proof. See Xu et al. (2018) for the detailed proof. The product term in Eq. ( 10) is the probability of a given path P a→b k . Therefore, the right hand side in Eq. ( 10) is the sum over probabilities of all possible paths of length k from v a to v b , which is the probability that a random walk starting at v a ends at v b after taking k steps. Lemma 2 Let U a→b j be a path [v (j) , v (j-1) , • • • , v (0) ] of length j from node v a to node v b , where v (j) = v a , v (0) = v b , v (i-1) ∈ N (v (i) ) for i = j, • • • , 1, and all nodes along the path are unlabeled except v (0) . Then we have I l (v a , v b ; k) = k j=1 U a→b j 1 i=j ãv (i-1) ,v (i) , where ãv (i-1) ,v (i) is the normalized weight of edge (v (i) , v (i-1) ). To intuitively understand this lemma, note that there are two differences between Lemma 1 and Lemma 2: (1) In Lemma 1, Ĩf (v a , v b ; k) sums over all paths from v a to v b of length k, but in Lemma 2, I l (v a , v b ; k) sums over all paths from v a to v b of length no more than k. The is because in LPA, v b 's label is reset to its initial value after each iteration, which means that the label of v b serves as a constant signal that begins propagating in the graph again and again after each iteration. (2) In Lemma 1 we consider all possible paths from v a to v b , but in Lemma 2, the paths are restricted to contain unlabeled nodes only. The reason here is the same as above: Since the labels of labeled nodes are reset to their initial values after each iteration in LPA, the influence of v b 's label will be absorbed in labeled nodes, and the propagation of v b 's label will be cut off at these nodes. Therefore, v b 's label can only flow to v a along the paths with unlabeled nodes only. See Figure 7 for an illustrating example showing the label propagation in LPA.

Proof.

As mentioned above, a significant difference between LPA and GCN is that all labeled nodes are reset to its original labels after each iteration in LPA. This implies that the initial label y b of node v b appears not only as y (0) b , but also as every y , where v z traverses all neighbors of v a . For those v z 's that are initially labeled, y (k-1) z is reset to their initial labels in each iteration. Therefore, they are always constant and independent of y  where z > m means v z is unlabeled. To intuitively understand Eq. ( 14), one can imagine that we perform a random walk starting from node v a for one step, where the "transition probability" is the edge weights ã, and all nodes in this random walk are restricted to unlabeled nodes only. Note that we can further decompose every y (k-1) z in Eq. ( 14) in the way similar to what we do for y (k) a in Eq. ( 13). So the expansion in Eq. ( 14) can be performed iteratively until the index k decreases to j. This is equivalent to performing all possible random walks for k -j steps starting from v a , where all nodes but the last in the random walk are restricted to be unlabeled nodes: ∂y (k) a ∂y (j) b = vz∈V U a→z k-j   1 i=k-j ãv (i-1) ,v (i)   ∂y (j) z ∂y (j) b , where v z in the first summation term is the end node of a random walk, U a→z k-j in the second summation term is an unlabeled-nodes-only path from v a to v z of length k -j, and the product term is the probability of a given path U a→z k-j . Consider the last term ∂y (j) z ∂y (j) b in Eq. ( 15). We know that = 1 for z = b, which means that only those random-walk paths that end exactly at v b (i.e., the end node v z is exactly v b ) count for the computation in Eq. ( 15). Therefore, we have ∂y (k) a ∂y (j) b = U a→b k-j 1 i=k-j ãv (i-1) ,v (i) , where U a→b k-j is a path from v a to v b of length k -j containing only unlabeled nodes except v b . Substituting the right hand term of Eq. ( 12) with Eq. ( 16), we obtain that I l (v a , v b ; k) = k-1 j=0 U a→b k-j 1 i=k-j ãv (i-1) ,v (i) = k j=1 U a→b j 1 i=j ãv (i-1) ,v (i) . Now Theorem 1 can be proved by combining Lemma 1 and Lemma 2: Proof. Suppose that whether a node is labeled or not is independent of each other for the given graph. Then we have E I l (v a , v b ; k) =E   k j=1 U a→b j 1 i=j ãv (i-1) ,v (i)   = k j=1 E   U a→b j 1 i=j ãv (i-1) ,v (i)   = k j=1 P a→b j Pr P a→b j is an unlabeled-nodes-only path  1 i=j ãv (i-1) ,v (i) = k j=1 P a→b j β j 1 i=j ãv (i-1) ,v (i) = k j=1 β j Ĩf (v a , v b ; j). y (k) a [y a ] = v b :y b =ya k j=1 U a→b j 1 i=j ãv (i-1) ,v (i) , which equals v b :y b =ya I l (v a , v b ; k) according to Lemma 2. Therefore, we have Pr(ŷ a = y a ) = y (k) a [y a ] i∈L y (k) a [i] ∝ y (k) a [y a ] = v b :y b =ya I l (v a , v b ; k) C PROOF OF THEOREM 3 In this proof we assume that the dimension of node representations is one, but note that the conclusion can be easily generalized to the case of multi-dimensional representations since the function D(x) can be decomposed into the sum of one-dimensional cases. In the following of this proof, we still use bold notations x (k) i and h (k) i to denote node representations, but keep in mind that they are scalars rather than vectors. We give two lemmas before proving Theorem 3. The first one is about the gradient of D(x): Lemma 3 h (k) i = x (k) i -∂D(x (k) ) ∂x (k) i . Proof. x (k) i -∂D(x (k) ) ∂x (k) i = x (k) i -vj ∈N (vi) ãij (x (k) i -x (k) j ) = vj ∈N (vi) ãij x (k) j = h (k) i . It is interesting to see from Lemma 3 that the aggregation step in GCN is equivalent to running gradient descent for one step with a step size of one. However, this is not able to guarantee that D(h (k) ) ≤ D(x (k) ) because the step size may be too large to reduce the value of D. The second lemma is about the Hessian of D(x): Lemma 4 ∇ 2 D(x) 2I, or equivalently, 2I -∇ 2 D(x) is a positive semidefinite matrix. Proof. We first calculate the Hessian of D (x) = 1 2 vi,vj ãij x i -x j 2 2 : ∇ 2 D(x) =     1 -ã11 -ã 12 • • • -ã 1n -ã 21 1 -ã22 • • • -ã 2n . . . . . . . . . . . . -ã n1 -ã n2 • • • 1 -ãnn     = I -D -1 A. Therefore, 2I -∇ 2 D(x) = I + D -1 A. Since D -1 A is Markov matrix (i.e., each entry is nonnegative and the sum of each row is one), its eigenvalues are within the range [-1, 1], so the eigenvalues of I + D -1 A are within the range [0, 2]. Therefore, I + D -1 A is a positive semidefinite matrix, and we have ∇ 2 D(x) 2I. We can now prove Theorem 3: Proof. Since D is a quadratic function, we perform a second-order Taylor expansion of D around x (k) and obtain the following inequality: D(h (k) ) =D(x (k) ) + ∇D(x (k) ) (h (k) -x (k) ) + 1 2 (h (k) -x (k) ) ∇ 2 D(x)(h (k) -x (k) ) =D(x (k) ) -∇D(x (k) ) ∇D(x (k) ) + 1 2 ∇D(x (k) ) ∇ 2 D(x)∇D(x (k) ) ≤D(x (k) ) -∇D(x (k) ) ∇D(x (k) ) + ∇D(x (k) ) ∇D(x (k) ) =D(x (k) ). 8b and 8d), we conclude that more inter-class edges will make the separation harder for GCN (or GCN-LPA). Comparing Figure 8a and 8b (or Figure 8c and 8d), we conclude that GCN-LPA is more noise-resistant than GCN, therefore, GCN-LPA can better differentiate classes and identify clustering substructures.

E DATASETS DETAILS

The statistics of all datasets are shown in Table 3 . F HYPER-PARAMETER SETTINGS The detailed hyper-parameter settings for all datasets are listed in Table 4 . In GCN-LPA, we use the same dimension for all hidden layers. Note that the number of GCN layers and the number of LPA iterations can actually be different since GCN and LPA are implemented as two independent modules. We use grid search to determine hyper-parameters on Cora, and perform fine-tuning on other datasets, i.e., varying one hyper-parameter per time to see if the performance can be further improved. The search spaces for hyper-parameters are as follows: • Dimension of hidden layers: {8, 16, 32}; • # GCN layers: {1, 2, 3, 4, 5, 6}; • # LPA iterations: {1, 2, 3, 4, 5, 6, 7, 8, 9}; • L2 weight: {10 -7 , 2 × 10 -7 , 5 × 10 -7 , 10 -6 , 2 × 10 -6 , 5 × 10 -6 , 10 -5 , 2 × 10 -5 , 5 × 10 -5 , 10 -4 , 2 × 10 -4 , 5 × 10 -4 , 10 -3 }; • LPA weight (λ): {0, 1, 2, 5, 10, 15, 20};



There are methods in statistical relational learningRossi et al. (2012) also using feature propagation/diffusion techniques. In this work, we focus on GCN, but the analysis and the proposed model can be easily generalized to other feature diffusion methods. Here the optimal edge weights A * share the same topology as the original graph G, i.e., we do not add or remove edges from G but only learning the weights of existing edges. See the end of this subsection for more discussion. CONNECTION TO EXISTING WORKEdge weights play a key role in graph-based machine learning algorithms. In this section, we discuss three lines of related work that learn edge weights adaptively. Our method can be easily generalized to inductive setting if implemented using minibatch training like GraphSAGE(Hamilton et al., 2017).



with respect to y b : I l (v a , v b ; k) = ∂y (k) a /∂y b .

(a) A graph with two classes of nodes (b) Potential intra-class edges (bold links)

Figure2: Node embeddings of Zachary's karate club network trained on a node classification task (red vs. blue). Figure2avisualizes the graph. Node coordinates in Figure2b-2e are the embedding coordinates. Notice that GCN does not produce linearly separable embeddings (Figure2bvs. Figure2c), while GCN-LPA performs much better even in the presence of noisy edges (Figure2dvs. Figure2e). Additional visualizations are included in Appendix D.

Figure 3: Sensitivity to the number of LPA iterations on Citeseer dataset.

Figure 4: Sensitivity to λ (weight of LPA loss) on Citeseer dataset.

Figure 5: Training time per epoch on random graphs.

Figure 6: Visualization of learned edge weights in Coauthor-CS dataset.Visualization of Learned Edge Weights. To intuitively understand what our model learns about edge weights, we split nodes in Coauthor-CS dataset into 15 groups according to their labels, and calculate the average weights of edges connecting every pair of node groups as well as the average weights of edges within every group. The results are shown in Figure6, where darker color indicates higher average weights of edges. It is clear that values along the diagonal (intra-class edges weights) are significantly larger than off-diagonal values (inter-class edge weights) in general, which demonstrates that GCN-LPA is able to identify the importance of edges and distinguish inter-class and intraclass edges. The visualization results are similar for other datasets.Time Complexity. We study the training time of GCN-LPA on random graphs. We use the one-hot identity vector as feature and 0 as label for each node. The size of training set and validation set is 100 and 200, respectively, while the rest is test set. The average number of neighbors for each node is set as 5, and the number of nodes is varied from one thousand to one million. We run GCN-LPA and GCN for 100 epochs on a Microsoft Azure virtual machine with 1 NVIDIA Tesla M60 GPU, 12 Intel Xeon CPUs (E5-2690 v3 @2.60GHz), and 128GB of RAM, using the same hyper-parameter setting as in Cora. The training time per epoch of GCN-LPA and GCN is presented in Figure 5. Our result shows that GCN-LPA requires only 9.2% extra training time on average compared to GCN.

Figure7: An illustrating example of label propagation in LPA. Suppose labels are propagated for three iterations, and no self-loop exists. Blue nodes are labeled while white nodes are unlabeled. (a) v a 's label propagates to v 1 (yellow arrows). Note that the propagation of v a 's label to v 3 is cut off since v 3 is labeled thus absorbing v a 's label. (b) v a 's label that propagated to v 1 further propagates to v 2 and v b (yellow arrows). Meanwhile, v a 's label is reset to its initial value then propagates from v a again (green arrows). (c) Label propagation in iteration 3. Purple arrows denote the propagation of v a 's label starting from v a for the third time. (d) All possible paths of length no more than three from v a to v b containing unlabeled nodes only. Note that there is no path of length one from v a to v b .

derivatives w.r.t. y (j) b are zero. So we only need to consider the terms where v z is an unlabeled node: ∂y

PROOF OF THEOREM 2 Proof. Denote the set of labels as L. Since different label dimensions in y (•) a do not interact with each other when running LPA, the value of the y a -th dimension in y (•) a (denoted by y (•) a [y a ]) comes only from the nodes with initial label y a . It is clear that

Figure8illustrates more visualization of GCN and GCN-LPA on karate club network. In each subfigure, we vary the number of layers from 1 to 4 to examine how the learned representations evolve. The initial node features are one-hot identity vectors, and the dimension of hidden layers and output layer is 2. The transformation matrices are uniformly initialized within range [-1, 1]. We use sigmoid function as the nonlinear activation function. Comparing the four figures in each row, we conclude that the aggregation step and transformation step in GCN and GCN-LPA do benefit the separation of different classes. Comparing Figure8aand 8c (or Figure8b and 8d), we conclude that more inter-class edges will make the separation harder for GCN (or GCN-LPA). Comparing Figure8aand 8b (or Figure8c and 8d), we conclude that GCN-LPA is more noise-resistant than GCN, therefore, GCN-LPA can better differentiate classes and identify clustering substructures.

Figure 8: Visualization of GCN and GCN-LPA with 1 ∼ 4 layers on karate club network.

Coauthor-CS and Coauthor-Phy, where nodes are authors and an edge indicates that two authors co-authored a paper. Node features represent paper keywords for each author's papers, and class labels indicate most active fields of



Accuracy of LPA, GCN, and GCN-LPA on Citeseer with different labeled node rate.

Statistics for all datasets.

Hyper-parameter settings for all datasets.

annex

Dengyong Zhou, Olivier Bousquet, Thomas N Lal, Jason Weston, and • Dropout rate: {0, 0.1, 0.2, 0.3, 0.4, 0.5};• Learning rate: {0.01, 0.02, 0.05, 0.1, 0.2, 0.5}.

