MASKED LABEL PREDICTION: UNIFIED MESSAGE PASSING MODEL FOR SEMI-SUPERVISED CLASSIFICA-TION

Abstract

Graph neural network (GNN) and label propagation algorithm (LPA) are both message passing algorithms, which have achieved superior performance in semisupervised classification. GNN performs feature propagation by a neural network to make predictions, while LPA uses label propagation across graph adjacency matrix to get results. However, there is still no good way to combine these two kinds of algorithms. In this paper, we proposed a new Unified Message Passaging Model (UniMP) that can incorporate feature propagation and label propagation with a shared message passing network, providing a better performance in semisupervised classification. First, we adopt a Graph Transformer jointly label embedding to propagate both the feature and label information. Second, to train UniMP without overfitting in self-loop label information, we propose a masked label prediction strategy, in which some percentage of training labels are simply masked at random, and then predicted. UniMP conceptually unifies feature propagation and label propagation and be empirically powerful. It obtains new state-of-the-art semi-supervised classification results in Open Graph Benchmark (OGB).

1. INTRODUCTION

There are various scenarios in the world, e.g., recommending related news and products, discovering new drugs, or predicting social relations, which can be described as graph structures. Many methods have been proposed to optimize these graph-based problems and achieved significant success in many related domains such as predicting the properties of nodes (Yang et al., 2016; Kipf & Welling, 2016) , links (Grover & Leskovec, 2016; Battaglia et al., 2018) , and graphs (Duvenaud et al., 2015; Niepert et al., 2016; Bojchevski et al., 2018) . In the task of semi-supervised node classification, we are required to learn with labeled examples and then make predictions for those unlabeled ones. To better classify the nodes' labels in the graph, based on the Laplacian smoothing assumption (Li et al., 2018; Xu et al., 2018b) , the message passing models were proposed to aggregate the information from its connected neighbors in the graph, acquiring enough facts to produce a more robust prediction for unlabeled nodes. Generally, there are two kinds of practical methods to implement message passing model, the Graph Neural Networks (GNNs) (Kipf & Welling, 2016; Hamilton et al., 2017; Xu et al., 2018b; Liao et al., 2019; Xu et al., 2018a; Qu et al., 2019) and the Label Propagation Algorithms (LPAs) (Zhu, 2005; Zhu et al., 2003; Zhang & Lee, 2007; Wang & Zhang, 2007; Karasuyama & Mamitsuka, 2013; Gong et al., 2016; Liu et al., 2019) . GNNs combine graph structures by propagating and aggregating nodes features through several neural layers, which get predictions from feature propagation. While LPAs make predictions for unlabeled instances by label propagation iteratively. Since GNN and LPA are based on the same assumption, making semi-supervised classifications by information propagation, there is an intuition that incorporating them together for boosting performance. Some superior studies have proposed their graph models based on it. For example, APPNP (Klicpera et al., 2019) and TPN (Liu et al., 2019) To unify the feature and label propagation, there are mainly two issues needed to be addressed: Aggregating feature and label information. Since node feature is represented by embeddings, while node label is a one-hot vector. They are not in the same vector space. In addition, there are different between their message passing ways, GNNs can propagate the information by diverse neural structures likes GraphSAGE (Hamilton et al., 2017) , GCN (Kipf & Welling, 2016) and GAT (Veličković et al., 2017) . But LPAs can only pass the label message by graph adjacency matrix. Supervised training. Supervised training a model with feature and label propagation will overfit in self-loop label information inevitably, which makes the label leakage in training time and causes poor performance in prediction. In this work, inspired by several advantages developments (Vaswani et al., 2017; Wang et al., 2018; Devlin et al., 2018) in Natural Language Processing (NLP), we propose a new Unified Message Passing model (UniMP) with masked label prediction that can settle the aforementioned issues. UniMP is a multi-layer Graph Transformer, jointly using label embedding to transform nodes labels into the same vector space as nodes features. It propagates nodes features like the previous attention based GNNs (Veličković et al., 2017; Zhang et al., 2018) . Meanwhile, its multi-head attentions are used as the transition matrix for propagating labels vectors. Therefore, each node can aggregate both features and labels information from its neighbors. To supervised training UniMP without overfitting in self-loop label information, we draw lessons from masked word prediction in BERT (Devlin et al., 2018) and propose a masked label prediction strategy, which randomly masks some training instances' label embedding vectors and then predicts them. This training method perfectly simulates the procedure of transducing labels information from labeled to unlabeled examples in the graph. We conduct experiments on three semi-supervised classification datasets in the Open Graph Benchmark (OGB), where our new methods achieve novel state-of-the-art results in all tasks, gaining 82.56% ACC in ogbn-products, 86.42% ROC-AUC in ogbn-proteins and 73.11% ACC in ogbnarxiv. We also conduct the ablation studies for the models with different inputs to prove the effectiveness of our unified method. In addition, we make the most thorough analysis of how the label propagation boosts our model's performance.

2. METHOD

We first introduce our notation about graph. We denote a graph as G = (V, E), where V denotes the nodes in the graph with |V | = n and E denotes edges with |E| = m. The nodes are described by the feature matrix X ∈ R n×f , which usually are dense vectors with f dimension, and the target class matrix Y ∈ R n×c , with the number of classes c. The adjacency matrix A = [a i,j ] ∈ R n×n is used to describe graph G, and the diagonal degree matrix is denoted by D = diag(d 1 , d 2 , ..., d n ) , where d i = j a ij is the degree of node i. A normalized adjacency matrix is defined as D -1 A or D -1 2 AD -1 2 , and we adopt the first definition in this paper.

2.1. FEATURE PROPAGATION AND LABEL PROPAGATION

In semi-supervised node classification, based on the Laplacian smoothing assumption, the GNN transforms and propagates nodes features X across the graph by several layers, including linear layers and nonlinear activation to build the approximation of the mapping: X → Y . The feature



integrate GNN and LPA by concatenating them together, andGCN-LPA (Wang & Leskovec, 2019)  uses LPA to regularize their GCN model. How-ever, as shown in Tabel 1, aforementioned methods still can not incorporate GNN and LPA within a message passing model, propagating feature and label in both training and prediction procedure. Comparision between message passing models

