UNIFYING GRAPH CONVOLUTIONAL NEURAL NET-WORKS AND LABEL PROPAGATION

Abstract

Label Propagation (LPA) and Graph Convolutional Neural Networks (GCN) are both message passing algorithms on graphs. Both solve the task of node classification but LPA propagates node label information across the edges of the graph, while GCN propagates and transforms node feature information. However, while conceptually similar, it is unclear how LPA and GCN can be combined under a unified framework to improve node classification. Here we study the relationship between LPA and GCN in terms of feature/label influence, in which we characterize how much the initial feature/label of one node influences the final feature/label of another node in GCN/LPA. Based on our theoretical analysis, we propose an end-to-end model that combines GCN and LPA. In our unified model, edge weights are learnable, and the LPA serves as regularization to assist the GCN in learning proper edge weights that lead to improved classification performance. Our model can also be seen as learning the weights for edges based on node labels, which is more task-oriented than existing feature-based attention models and topology-based diffusion models. In a number of experiments on real-world graphs, our model shows superiority over state-of-the-art graph neural networks in terms of node classification accuracy.

1. INTRODUCTION

Consider the problem of node classification in a graph, where the goal is to learn a mapping M : V → L from node set V to label set L. Solution to this problem is widely applicable to various scenarios, e.g., inferring income of users in a social network or classifying scientific articles in a citation network. Different from a generic machine learning problem where samples are independent from each other, nodes are connected by edges in the graph, which provide additional information and require more delicate modeling. To capture the graph information, researchers have mainly designed models on the assumption that labels/features are correlated over the edges of the graph. In particular, on the label side L, node labels are propagated and aggregated along edges in the graph, which is known as Label Propagation Algorithm (LPA) (Zhu et al., 2005; Zhou et al., 2004; Zhang & Lee, 2007; Wang & Zhang, 2008; Karasuyama & Mamitsuka, 2013; Gong et al., 2017; Liu et al., 2019a) ; On the node side V, node features are propagated along edges and transformed through neural network layers, which is known as Graph Convolutional Neural Networks (GCN)foot_0 (Kipf & Welling, 2017; Hamilton et al., 2017; Li et al., 2018; Xu et al., 2018; Liao et al., 2019; Xu et al., 2019b; Qu et al., 2019) . GCN and LPA are related in that they propagate features and labels on the two sides of the mapping M, respectively. Prior work Li et al. (2019) has shown the relationship between GCN and LPA in terms of low-pass graph filtering. However, it is unclear how the discovered relationship benefits node classification. Specifically, can GCN and LPA be combined to develop a more accurate model for node classification in graphs? Here we study the theoretical relationship between GCN and LPA from the viewpoint of feature/label influence, where we quantify how much the initial feature/label of node v b influences the output feature/label of node v a in GCN/LPA by studying the Jacobian/gradient of node v b with respect to node v a . We also prove the quantitative relationship between feature influence and label influence, i.e., the label influence of v b on v a equals the cumulative discounted feature influence of v b on v a in expectation (Theorem 1). Based on the theoretical analysis, we propose a unified model GCN-LPA for node classification. We show that the key to improving the performance of GCN is to enable nodes of the same class to connect more strongly with each other by making edge weights/strengths trainable. Then we prove that increasing the strength of edges between the nodes of the same class is equivalent to increasing the accuracy of LPA's predictions (Theorem 2). Therefore, we can first learn the optimal edge weights by minimizing the loss of predictions in LPA, then plug the optimal edge weights into a GCN to learn node representations. In GCN-LPA, we further combine the above two steps together and train the whole model in an end-to-end fashion, where the LPA part serves as regularization to assist the GCN part in learning proper edge weights that benefit the separation of different node classes. It is worth noticing that GCN-LPA can also be seen as learning the weights for edges based on node label information, which requires less handcrafting and is more task-oriented than existing attention models that learn edge weights based on node feature similarity (Veličković et al., 2018; Thekumparampil et al., 2018; Zhang et al., 2018; Liu et al., 2019b) or diffusion models that learn adjacency matrix based on graph topology (Klicpera et al., 2019a; Xu et al., 2019a; Abu-El-Haija et al., 2019; Klicpera et al., 2019b) . We conduct extensive experiments on five datasets, and the results indicate that our model outperforms state-of-the-art graph neural networks in terms of classification accuracy. The experimental results also show that combining GCN and LPA together is able to learn more informative edge weights thereby leading to better performance.

2. OUR APPROACH

In this section, we first formulate the node classification problem and briefly introduce LPA and GCN. We then prove their relationship from the viewpoints of feature influence and label influence. Based on the theoretical finding, we propose a unified model GCN-LPA, and analyze why our model is theoretically superior to vanilla GCN.

2.1. PROBLEM FORMULATION AND PRELIMINARIES

Consider a graph G = (V, A, X, Y ), where V = {v 1 , • • • , v n } is the set of nodes, A ∈ R n×n is the adjacency matrix, X is the feature matrix of nodes and Y is labels of nodes. a ij (the ij-th entry of A) is the weight of the edge connecting v i and v j . N (v) denotes the set of first-order neighbors of node v in graph G. Each node v i has a feature vector x i which is the i-th row of X, while only the first m nodes (m n) have labels y 1 , • • • , y m from a label set L = {1, • • • , c}. The goal is to learn a mapping M : V → L and predict labels of unlabeled nodes. Label Propagation Algorithm. LPA (Zhu et al., 2005) assumes that two connected nodes are likely to have the same label, and thus it propagates labels iteratively along the edges. Let Y (k) = [y (k) 1 , • • • , y (k) n ] ∈ R n×c be the soft label matrix in iteration k > 0, in which the i-th row y (k) i denotes the predicted label distribution for node v i in iteration k. When k = 0, the initial label matrix Y (0) = [y (0) 1 , • • • , y (0) n ] consists of one-hot label indicator vectors y (0) i for i = 1, • • • , m i.e., labeled nodes) or zero vectors otherwise (i.e., unlabeled nodes). Then LPA in iteration k is formulated as the following two steps: Y (k+1) = Ã Y (k) , (k+1) i = y (0) i , ∀ i ≤ m. In the above equations, Ã is the normalized adjacency matrix, which can be the random walk transition matrix Ãrw = D -1 A or the symmetric transition matrix Ãsym = D -1 2 AD -1 2 , where D is the diagonal degree matrix for A with entries d ii = j a ij . Without loss of generosity, we use Ã = Ãrw in this work. In Eq. ( 1), all nodes propagate labels to their neighbors according to normalized edge weights. Then in Eq. ( 2), labels of all labeled nodes are reset to their initial values,



There are methods in statistical relational learning Rossi et al. (2012) also using feature propagation/diffusion techniques. In this work, we focus on GCN, but the analysis and the proposed model can be easily generalized to other feature diffusion methods.

