MASKED LABEL PREDICTION: UNIFIED MESSAGE PASSING MODEL FOR SEMI-SUPERVISED CLASSIFICA-TION

Abstract

Graph neural network (GNN) and label propagation algorithm (LPA) are both message passing algorithms, which have achieved superior performance in semisupervised classification. GNN performs feature propagation by a neural network to make predictions, while LPA uses label propagation across graph adjacency matrix to get results. However, there is still no good way to combine these two kinds of algorithms. In this paper, we proposed a new Unified Message Passaging Model (UniMP) that can incorporate feature propagation and label propagation with a shared message passing network, providing a better performance in semisupervised classification. First, we adopt a Graph Transformer jointly label embedding to propagate both the feature and label information. Second, to train UniMP without overfitting in self-loop label information, we propose a masked label prediction strategy, in which some percentage of training labels are simply masked at random, and then predicted. UniMP conceptually unifies feature propagation and label propagation and be empirically powerful. It obtains new state-of-the-art semi-supervised classification results in Open Graph Benchmark (OGB).

1. INTRODUCTION

There are various scenarios in the world, e.g., recommending related news and products, discovering new drugs, or predicting social relations, which can be described as graph structures. Many methods have been proposed to optimize these graph-based problems and achieved significant success in many related domains such as predicting the properties of nodes (Yang et al., 2016; Kipf & Welling, 2016) , links (Grover & Leskovec, 2016; Battaglia et al., 2018) , and graphs (Duvenaud et al., 2015; Niepert et al., 2016; Bojchevski et al., 2018) . In the task of semi-supervised node classification, we are required to learn with labeled examples and then make predictions for those unlabeled ones. To better classify the nodes' labels in the graph, based on the Laplacian smoothing assumption (Li et al., 2018; Xu et al., 2018b) , the message passing models were proposed to aggregate the information from its connected neighbors in the graph, acquiring enough facts to produce a more robust prediction for unlabeled nodes. Generally, there are two kinds of practical methods to implement message passing model, the Graph Neural Networks (GNNs) (Kipf & Welling, 2016; Hamilton et al., 2017; Xu et al., 2018b; Liao et al., 2019; Xu et al., 2018a; Qu et al., 2019) and the Label Propagation Algorithms (LPAs) (Zhu, 2005; Zhu et al., 2003; Zhang & Lee, 2007; Wang & Zhang, 2007; Karasuyama & Mamitsuka, 2013; Gong et al., 2016; Liu et al., 2019) . GNNs combine graph structures by propagating and aggregating nodes features through several neural layers, which get predictions from feature propagation. While LPAs make predictions for unlabeled instances by label propagation iteratively. Since GNN and LPA are based on the same assumption, making semi-supervised classifications by information propagation, there is an intuition that incorporating them together for boosting performance. Some superior studies have proposed their graph models based on it. For example, APPNP (Klicpera et al., 2019) and TPN (Liu et al., 2019) integrate GNN and LPA by concatenating them together, and GCN-LPA (Wang & Leskovec, 2019) uses LPA to regularize their GCN model. How-ever, as shown in Tabel 1, aforementioned methods still can not incorporate GNN and LPA within a message passing model, propagating feature and label in both training and prediction procedure. To unify the feature and label propagation, there are mainly two issues needed to be addressed: Aggregating feature and label information. Since node feature is represented by embeddings, while node label is a one-hot vector. They are not in the same vector space. In addition, there are different between their message passing ways, GNNs can propagate the information by diverse neural structures likes GraphSAGE (Hamilton et al., 2017) , GCN (Kipf & Welling, 2016) and GAT (Veličković et al., 2017) . But LPAs can only pass the label message by graph adjacency matrix. Supervised training. Supervised training a model with feature and label propagation will overfit in self-loop label information inevitably, which makes the label leakage in training time and causes poor performance in prediction. In this work, inspired by several advantages developments (Vaswani et al., 2017; Wang et al., 2018; Devlin et al., 2018) in Natural Language Processing (NLP), we propose a new Unified Message Passing model (UniMP) with masked label prediction that can settle the aforementioned issues. UniMP is a multi-layer Graph Transformer, jointly using label embedding to transform nodes labels into the same vector space as nodes features. It propagates nodes features like the previous attention based GNNs (Veličković et al., 2017; Zhang et al., 2018) . Meanwhile, its multi-head attentions are used as the transition matrix for propagating labels vectors. Therefore, each node can aggregate both features and labels information from its neighbors. To supervised training UniMP without overfitting in self-loop label information, we draw lessons from masked word prediction in BERT (Devlin et al., 2018) and propose a masked label prediction strategy, which randomly masks some training instances' label embedding vectors and then predicts them. This training method perfectly simulates the procedure of transducing labels information from labeled to unlabeled examples in the graph. We conduct experiments on three semi-supervised classification datasets in the Open Graph Benchmark (OGB), where our new methods achieve novel state-of-the-art results in all tasks, gaining 82.56% ACC in ogbn-products, 86.42% ROC-AUC in ogbn-proteins and 73.11% ACC in ogbnarxiv. We also conduct the ablation studies for the models with different inputs to prove the effectiveness of our unified method. In addition, we make the most thorough analysis of how the label propagation boosts our model's performance.

2. METHOD

We first introduce our notation about graph. We denote a graph as G = (V, E), where V denotes the nodes in the graph with |V | = n and E denotes edges with |E| = m. The nodes are described by the feature matrix X ∈ R n×f , which usually are dense vectors with f dimension, and the target class matrix Y ∈ R n×c , with the number of classes c. The adjacency matrix A = [a i,j ] ∈ R n×n is used to describe graph G, and the diagonal degree matrix is denoted by D = diag(d 1 , d 2 , ..., d n ) , where d i = j a ij is the degree of node i. A normalized adjacency matrix is defined as D -1 A or D -1 2 AD -1 2 , and we adopt the first definition in this paper.

2.1. FEATURE PROPAGATION AND LABEL PROPAGATION

In semi-supervised node classification, based on the Laplacian smoothing assumption, the GNN transforms and propagates nodes features X across the graph by several layers, including linear layers and nonlinear activation to build the approximation of the mapping: X → Y . The feature  X 9 G 1 z c o G m B W 2 u z Z z Z G o m a / 5 D N x w = " > A A A C A X i c b V C 7 S g N B F L 0 T X z G + o p Y 2 g 0 G w C r s i a B m w s Y x g H p g s Y X Y y m w y Z m V 1 m Z o W w p P I X b L W 3 E 1 u / x N Y v c T b Z Q h M P X D i c c y / 3 c M J E c G M 9 7 w u V 1 t Y 3 N r f K 2 5 W d 3 b 3 9 g + r h U d v E q a a s R W M R 6 2 5 I D B N c s Z b l V r B u o h m R o W C d c H K T + 5 1 H p g 2 P 1 b 2 d J i y Q Z K R 4 x C m x T n r o S 2 L H Y Z R 1 Z 4 N q z a t 7 c + B V 4 h e k B g W a g + p 3 f x j T V D J l q S D G 9 H w v s U F G t O V U s F m l n x q W E D o h I 9 Z z V B H J T J D N E 8 / w m V O G O I q 1 G 2 X x X P 1 9 k R F p z F S G b j N P a J a 9 X P z X C + X S Z x t d B x l X S W q Z o o v H U S q w j X F e B x 5 y z a g V U 0 c I 1 d x l x 3 R M N K H W l V Z x p f j L F a y S 9 k X d 9 + r + 3 W W t c V n U U 4 Y T O I V z 8 O E K G n A L T W g B B Q X P 8 A K v 6 A m 9 o X f 0 s V g t o e L m G P 4 A f f 4 A O 4 q X l A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " X 9 G 1 z c o G m B W 2 u z Z z Z G o m a / 5 D N x w = " > A A A C A X i c b V C 7 S g N B F L 0 T X z G + o p Y 2 g 0 G w C r s i a B m w s Y x g H p g s Y X Y y m w y Z m V 1 m Z o W w p P I X b L W 3 E 1 u / x N Y v c T b Z Q h M P X D i c c y / 3 c M J E c G M 9 7 w u V 1 t Y 3 N r f K 2 5 W d 3 b 3 9 g + r h U d v E q a a s R W M R 6 2 5 I D B N c s Z b l V r B u o h m R o W C d c H K T + 5 1 H p g 2 P 1 b 2 d J i y Q Z K R 4 x C m x T n r o S 2 L H Y Z R 1 Z 4 N q z a t 7 c + B V 4 h e k B g W a g + p 3 f x j T V D J l q S D G 9 H w v s U F G t O V U s F m l n x q W E D o h I 9 Z z V B H J T J D N E 8 / w m V O G O I q 1 G 2 X x X P 1 9 k R F p z F S G b j N P a J a 9 X P z X C + X S Z x t d B x l X S W q Z o o v H U S q w j X F e B x 5 y z a g V U 0 c I 1 d x l x 3 R M N K H W l V Z x p f j L F a y S 9 k X d 9 + r + 3 W W t c V n U U 4 Y T O I V z 8 O E K G n A L T W g B B Q X P 8 A K v 6 A m 9 o X f 0 s V g t o e L m G P 4 A f f 4 A O 4 q X l A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " X 9 G 1 z c o G m B W 2 u z Z z Z G o m a / 5 D N x w = " > A A A C A X i c b V C 7 S g N B F L 0 T X z G + o p Y 2 g 0 G w C r s i a B m w s Y x g H p g s Y X Y y m w y Z m V 1 m Z o W w p P I X b L W 3 E 1 u / x N Y v c T b Z Q h M P X D i c c y / 3 c M J E c G M 9 7 w u V 1 t Y 3 N r f K 2 5 W d 3 b 3 9 g + r h U d v E q a a s R W M R 6 2 5 I D B N c s Z b l V r B u o h m R o W C d c H K T + 5 1 H p g 2 P 1 b 2 d J i y Q Z K R 4 x C m x T n r o S 2 L H Y Z R 1 Z 4 N q z a t 7 c + B V 4 h e k B g W a g + p 3 f x j T V D J l q S D G 9 H w v s U F G t O V U s F m l n x q W E D o h I 9 Z z V B H J T J D N E 8 / w m V O G O I q 1 G 2 X x X P 1 9 k R F p z F S G b j N P a J a 9 X P z X C + X S Z x t d B x l X S W q Z o o v H U S q w j X F e B x 5 y z a g V U 0 c I 1 d x l x 3 R M N K H W l V Z x p f j L F a y S 9 k X d 9 + r + 3 W W t c V n U U 4 Y T O I V z 8 O E K G n A L T W g B B Q X P 8 A K v 6 A m 9 o X f 0 s V g t o e L m G P 4 A f f 4 A O 4 q X l A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " X 9 G 1 z c o G m B W 2 u z Z z Z G o m a / 5 D N x w = " > A A A C A X i c b V C 7 S g N B F L 0 T X z G + o p Y 2 g 0 G w C r s i a B m w s Y x g H p g s Y X Y y m w y Z m V 1 m Z o W w p P I X b L W 3 E 1 u / x N Y v c T b Z Q h M P X D i c c y / 3 c M J E c G M 9 7 w u V 1 t Y 3 N r f K 2 5 W d 3 b 3 9 g + r h U d v E q a a s R W M R 6 2 5 I D B N c s Z b l V r B u o h m R o W C d c H K T + 5 1 H p g 2 P 1 b 2 d J i y Q Z K R 4 x C m x T n r o S 2 L H Y Z R 1 Z 4 N q z a t 7 c + B V 4 h e k B g W a g + p 3 f x j T V D J l q S D G 9 H w v s U F G t O V U s F m l n x q W E D o h I 9 Z z V B H J T J D N E 8 / w m V O G O I q 1 G 2 X x X P 1 9 k R F p z F S G b j N P a J a 9 X P z X C + X S Z x t d B x l X S W q Z o o v H U S q w j X F e B x 5 y z a g V U 0 c I 1 d x l x 3 R M N K H W l V Z x p f j L F a y S 9 k X d 9 + r + 3 W W t c V n U U 4 Y T O I V z 8 O E K G n A L T W g B B Q X P 8 A K v 6 A m 9 o X f 0 s V g t o e L m G P 4 A f f 4 A O 4 q X l A = = < / l a t e x i t > Ŷ < l a t e x i t s h a 1 _ b a s e 6 4 = " X J a 7 8 6 7 P k X J a H H / L q / k + a X f Z 1 t E = " > A A A C C X i c b V D L S s N A F L 2 p r 1 p f U Z d u B o v g q i R S 0 G X B j c s K 9 i F N K J P p p B 0 6 k 4 S Z S a G E f I G / 4 F b 3 7 s S t X + H W L 3 H S Z q G t B y 4 c z r m X e z h B w p n S j v N l V T Y 2 t 7 Z 3 q r u 1 v f 2 D w y P 7 + K S r 4 l Q S 2 i E x j 2 U / w I p y F t G O Z p r T f i I p F g G n v W B 6 W / i 9 G Z W K x d G D n i f U F 3 g c s Z A R r I 0 0 t G 1 P Y D 0 J w s y b Y J 0 9 5 v n Q r j s N Z w G 0 T t y S 1 K F E e 2 h / e 6 O Y p I J G m n C s 1 M B 1 E u 1 n W G p G O M 1 r X q p o g s k U j + n A 0 A g L q v x s k T x H F 0 Y Z o T C W Z i K N F u r v i w w L p e Y i M J t F T r X q F e K / X i B W P u v w x s 9 Y l K S a R m T 5 O E w 5 0 j E q a k E j J i n R f G 4 I J p K Z 7 I h M s M R E m / J q p h R 3 t Y J 1 0 r 1 q u E 7 D v W / W W 8 2 y n i q c w T l c g g v X 0 I I 7 a E M H C M z g G V 7 g 1 X q y 3 q x 3 6 2 O 5 W r H K m 1 P 4 A + v z B 8 D q m p M = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " X J a 7 8 6 7 P k X J a H H / L q / k + a X f Z 1 t E = " > A A A C C X i c b V D L S s N A F L 2 p r 1 p f U Z d u B o v g q i R S 0 G X B j c s K 9 i F N K J P p p B 0 6 k 4 S Z S a G E f I G / 4 F b 3 7 s S t X + H W L 3 H S Z q G t B y 4 c z r m X e z h B w p n S j v N l V T Y 2 t 7 Z 3 q r u 1 v f 2 D w y P 7 + K S r 4 l Q S 2 i E x j 2 U / w I p y F t G O Z p r T f i I p F g G n v W B 6 W / i 9 G Z W K x d G D n i f U F 3 g c s Z A R r I 0 0 t G 1 P Y D 0 J w s y b Y J 0 9 5 v n Q r j s N Z w G 0 T t y S 1 K F E e 2 h / e 6 O Y p I J G m n C s 1 M B 1 E u 1 n W G p G O M 1 r X q p o g s k U j + n A 0 A g L q v x s k T x H F 0 Y Z o T C W Z i K N F u r v i w w L p e Y i M J t F T r X q F e K / X i B W P u v w x s 9 Y l K S a R m T 5 O E w 5 0 j E q a k E j J i n R f G 4 I J p K Z 7 I h M s M R E m / J q p h R 3 t Y J 1 0 r 1 q u E 7 D v W / W W 8 2 y n i q c w T l c g g v X 0 I I 7 a E M H C M z g G V 7 g 1 X q y 3 q x 3 6 2 O 5 W r H K m 1 P 4 A + v z B 8 D q m p M = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " X J a 7 8 6 7 P k X J a H H / L q / k + a X f Z 1 t E = " > A A A C C X i c b V D L S s N A F L 2 p r 1 p f U Z d u B o v g q i R S 0 G X B j c s K 9 i F N K J P p p B 0 6 k 4 S Z S a G E f I G / 4 F b 3 7 s S t X + H W L 3 H S Z q G t B y 4 c z r m X e z h B w p n S j v N l V T Y 2 t 7 Z 3 q r u 1 v f 2 D w y P 7 + K S r 4 l Q S 2 i E x j 2 U / w I p y F t G O Z p r T f i I p F g G n v W B 6 W / i 9 G Z W K x d G D n i f U F 3 g c s Z A R r I 0 0 t G 1 P Y D 0 J w s y b Y J 0 9 5 v n Q r j s N Z w G 0 T t y S 1 K F E e 2 h / e 6 O Y p I J G m n C s 1 M B 1 E u 1 n W G p G O M 1 r X q p o g s k U j + n A 0 A g L q v x s k T x H F 0 Y Z o T C W Z i K N F u r v i w w L p e Y i M J t F T r X q F e K / X i B W P u v w x s 9 Y l K S a R m T 5 O E w 5 0 j E q a k E j J i n R f G 4 I J p K Z 7 I h M s M R E m / J q p h R 3 t Y J 1 0 r 1 q u E 7 D v W / W W 8 2 y n i q c w T l c g g v X 0 I I 7 a E M H C M z g G V 7 g 1 X q y 3 q x 3 6 2 O 5 W r H K m 1 P 4 A + v z B 8 D q m p M = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " X J a 7 8 6 7 P k X J a H H / L q / k + a X f Z 1 t E = " > A A A C C X i c b V D L S s N A F L 2 p r 1 p f U Z d u B o v g q i R S 0 G X B j c s K 9 i F N K J P p p B 0 6 k 4 S Z S a G E f I G / 4 F b 3 7 s S t X + H W L 3 H S Z q G t B y 4 c z r m X e z h B w p n S j v N l V T Y 2 t 7 Z 3 q r u 1 v f 2 D w y P 7 + K S r 4 l Q S 2 i E x j 2 U / w I p y F t G O Z p r T f i I p F g G n v W B 6 W / i 9 G Z W K x d G D n i f U F 3 g c s Z A R r I 0 0 t G 1 P Y D 0 J w s y b Y J 0 9 5 v n Q r j s N Z w G 0 T t y S 1 K F E e 2 h / e 6 O Y p I J G m n C s 1 M B 1 E u 1 n W G p G O M 1 r X q p o g s k U j + n A 0 A g L q v x s k T x H F 0 Y Z o T C W Z i K N F u r v i w w L p e Y i M J t F T r X q F e K / X i B W P u v w x s 9 Y l K S a R m T 5 O E w 5 0 j E q a k E j J i n R f G 4 I J p K Z 7 I h M s M R E m / J q p h R 3 t Y J 1 0 r 1 q u E 7 D v W / W W 8 2 y n i q c w T l c g g v X 0 I I 7 a E M H C M z g G V 7 g 1 X q y 3 q x 3 6 2 O 5 W r H K m 1 P 4 A + v z B 8 D q m p M = < / l a t e x i t > A < l a t e x i t s h a 1 _ b a s e 6 4 = " O K o O 0 2 j 3 M s n U g R v E P k P 0 d l 2 n v a w = " > A A A C A X i c b V C 7 S g N B F L 0 b X z G + V i 1 t B o N g F X Y l o G X E x j K C e W C y h N n J b D J k Z n a Z m R X C k s p f s N X e T m z 9 E l u / x N l k C 0 0 8 c O F w z r 3 c w w k T z r T x v C + n t L a + s b l V 3 q 7 s 7 O 7 t H 7 i H R 2 0 d p 4 r Q F o l 5 r L o h 1 p Q z S V u G G U 6 7 i a J Y h J x 2 w s l N 7 n c e q d I s l v d m m t B A 4 J F k E S P Y W O m h L 7 A Z h 1 F 2 P R u 4 V a / m z Y F W i V + Q K h R o D t z v / j A m q a D S E I 6 1 7 v l e Y o I M K 8 M I p 7 N K P 9 U 0 w W S C R 7 R n q c S C 6 i C b J 5 6 h M 6 s M U R Q r O 9 K g u f r 7 I s N C 6 6 k I 7 W a e U C 9 7 u f i v F 4 q l z y a 6 C j I m k 9 R Q S R a P o 5 Q j E 6 O 8 D j R k i h L D p 5 Z g o p j N j s g Y K 0 y M L a 1 i S / G X K 1 g l 7 Y u a 7 9 X 8 u 3 q 1 U S / q K c M J n M I 5 + H A J D b i F J r S A g I R n e I F X 5 8 l 5 c 9 6 d j 8 V q y S l u j u E P n M 8 f F y e X f Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " O K o O 0 2 j 3 M s n U g R v E P k P 0 d l 2 n v a w = " > A A A C A X i c b V C 7 S g N B F L 0 b X z G + V i 1 t B o N g F X Y l o G X E x j K C e W C y h N n J b D J k Z n a Z m R X C k s p f s N X e T m z 9 E l u / x N l k C 0 0 8 c O F w z r 3 c w w k T z r T x v C + n t L a + s b l V 3 q 7 s 7 O 7 t H 7 i H R 2 0 d p 4 r Q F o l 5 r L o h 1 p Q z S V u G G U 6 7 i a J Y h J x 2 w s l N 7 n c e q d I s l v d m m t B A 4 J F k E S P Y W O m h L 7 A Z h 1 F 2 P R u 4 V a / m z Y F W i V + Q K h R o D t z v / j A m q a D S E I 6 1 7 v l e Y o I M K 8 M I p 7 N K P 9 U 0 w W S C R 7 R n q c S C 6 i C b J 5 6 h M 6 s M U R Q r O 9 K g u f r 7 I s N C 6 6 k I 7 W a e U C 9 7 u f i v F 4 q l z y a 6 C j I m k 9 R Q S R a P o 5 Q j E 6 O 8 D j R k i h L D p 5 Z g o p j N j s g Y K 0 y M L a 1 i S / G X K 1 g l 7 Y u a 7 9 X 8 u 3 q 1 U S / q K c M J n M I 5 + H A J D b i F J r S A g I R n e I F X 5 8 l 5 c 9 6 d j 8 V q y S l u j u E P n M 8 f F y e X f Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " O K o O 0 2 j 3 M s n U g R v E P k P 0 d l 2 n v a w = " > A A A C A X i c b V C 7 S g N B F L 0 b X z G + V i 1 t B o N g F X Y l o G X E x j K C e W C y h N n J b D J k Z n a Z m R X C k s p f s N X e T m z 9 E l u / x N l k C 0 0 8 c O F w z r 3 c w w k T z r T x v C + n t L a + s b l V 3 q 7 s 7 O 7 t H 7 i H R 2 0 d p 4 r Q F o l 5 r L o h 1 p Q z S V u G G U 6 7 i a J Y h J x 2 w s l N 7 n c e q d I s l v d m m t B A 4 J F k E S P Y W O m h L 7 A Z h 1 F 2 P R u 4 V a / m z Y F W i V + Q K h R o D t z v / j A m q a D S E I 6 1 7 v l e Y o I M K 8 M I p 7 N K P 9 U 0 w W S C R 7 R n q c S C 6 i C b J 5 6 h M 6 s M U R Q r O 9 K g u f r 7 I s N C 6 6 k I 7 W a e U C 9 7 u f i v F 4 q l z y a 6 C j I m k 9 R Q S R a P o 5 Q j E 6 O 8 D j R k i h L D p 5 Z g o p j N j s g Y K 0 y M L a 1 i S / G X K 1 g l 7 Y u a 7 9 X 8 u 3 q 1 U S / q K c M J n M I 5 + H A J D b i F J r S A g I R n e I F X 5 8 l 5 c 9 6 d j 8 V q y S l u j u E P n M 8 f F y e X f Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " O K o O 0 2 j 3 M s n U g R v E P k P 0 d l 2 n v a w = " > A A A C A X i c b V C 7 S g N B F L 0 b X z G + V i 1 t B o N g F X Y l o G X E x j K C e W C y h N n J b D J k Z n a Z m R X C k s p f s N X e T m z 9 E l u / x N l k C 0 0 8 c O F w z r 3 c w w k T z r T x v C + n t L a + s b l V 3 q 7 s 7 O 7 t H 7 i H R 2 0 d p 4 r Q F o l 5 r L o h 1 p Q z S V u G G U 6 7 i a J Y h J x 2 w s l N 7 n c e q d I s l v d m m t B A 4 J F k E S P Y W O m h L 7 A Z h 1 F 2 P R u 4 V a / m z Y F W i V + Q K h R o D t z v / j A m q a D S E I 6 1 7 v l e Y o I M K 8 M I p 7 N K P 9 U 0 w W S C R 7 R n q c S C 6 i C b J 5 6 h M 6 s M U R Q r O 9 K g u f r 7 I s N C 6 6 k I 7 W a e U C 9 7 u f i v F 4 q l z y a 6 C j I m k 9 R Q S R a P o 5 Q j E 6 O 8 D j R k i h L D p 5 Z g o p j N j s g Y K 0 y M L a 1 i S / G X K 1 g l 7 Y u a 7 9 X 8 u 3 q 1 U S / q K c M J n M I 5 + H A J D b i F J r S A g I R n e I F X 5 8 l 5 c 9 6 d j 8 V q y S l u j u E P n M 8 f F y e X f Q = = < / l a t e x i t >  P (Y V U |X, Ŷ, A) < l a t e x i t s h a 1 _ b a s e 6 4 = " F 7 O O S D M A h L Y G H U l L r C z S t 2 c a w 7 Q = " > A A A C N X i c b V D L S s N A F J 3 U V 6 2 v q E s 3 g 0 W o o C W R g i 5 c V N y 4 r G D a S h P K Z D p p h 0 4 e z E y E E v M z / o S / 4 F Z 3 L t y I u P U X n L Q p 1 N Y D A 4 d z z u X e O W 7 E q J C G 8 a 4 V l p Z X V t e K 6 6 W N z a 3 t H X 1 3 r y n C m G N i 4 Z C F v O 0 i Q R g N i C W p Z K Q d c Y J 8 l 5 G W O 7 z O / N Y D 4 Y K G w Z 0 c R c T x U T + g H s V I K q m r X z Y q t o / k w P W S + 7 S b N E + t 9 H E q t N M T O O X 2 A E m V m F G u 0 u O u X j a q x h h w k Z g 5 K Y M c j a 7 + a f d C H P s k k J g h I T q m E U k n Q V x S z E h a s m N B I o S H q E 8 6 i g b I J 8 J J x r 9 M 4 Z F S e t A L u X q B h G N 1 d i J B v h A j 3 1 X J 7 E Q x 7 2 X i v 5 7 r z 2 2 W 3 o W T 0 C C K J Q n w Z L E X M y h D m F U I e 5 Q T L N l I E Y Q 5 V b d D P E A c Y a m K L q l S z P k K F k n z r G o a V f O 2 V q 7 X 8 n q K 4 A A c g g o w w T m o g x v Q A B b A 4 A m 8 g F f w p j 1 r H 9 q X 9 j 2 J F r R 8 Z h / 8 g f b z C y E 1 r H M = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " F 7 O O S D M A h L Y G H U l L r C z S t 2 c a w 7 Q = " > A A A C N X i c b V D L S s N A F J 3 U V 6 2 v q E s 3 g 0 W o o C W R g i 5 c V N y 4 r G D a S h P K Z D p p h 0 4 e z E y E E v M z / o S / 4 F Z 3 L t y I u P U X n L Q p 1 N Y D A 4 d z z u X e O W 7 E q J C G 8 a 4 V l p Z X V t e K 6 6 W N z a 3 t H X 1 3 r y n C m G N i 4 Z C F v O 0 i Q R g N i C W p Z K Q d c Y J 8 l 5 G W O 7 z O / N Y D 4 Y K G w Z 0 c R c T x U T + g H s V I K q m r X z Y q t o / k w P W S + 7 S b N E + t 9 H E q t N M T O O X 2 A E m V m F G u 0 u O u X j a q x h h w k Z g 5 K Y M c j a 7 + a f d C H P s k k J g h I T q m E U k n Q V x S z E h a s m N B I o S H q E 8 6 i g b I J 8 J J x r 9 M 4 Z F S e t A L u X q B h G N 1 d i J B v h A j 3 1 X J 7 E Q x 7 2 X i v 5 7 r z 2 2 W 3 o W T 0 C C K J Q n w Z L E X M y h D m F U I e 5 Q T L N l I E Y Q 5 V b d D P E A c Y a m K L q l S z P k K F k n z r G o a V f O 2 V q 7 X 8 n q K 4 A A c g g o w w T m o g x v Q A B b A 4 A m 8 g F f w p j 1 r H 9 q X 9 j 2 J F r R 8 Z h / 8 g f b z C y E 1 r H M = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " F 7 O O S D M A h L Y G H U l L r C z S t 2 c a w 7 Q = " > A A A C N X i c b V D L S s N A F J 3 U V 6 2 v q E s 3 g 0 W o o C W R g i 5 c V N y 4 r G D a S h P K Z D p p h 0 4 e z E y E E v M z / o S / 4 F Z 3 L t y I u P U X n L Q p 1 N Y D A 4 d z z u X e O W 7 E q J C G 8 a 4 V l p Z X V t e K 6 6 W N z a 3 t H X 1 3 r y n C m G N i 4 Z C F v O 0 i Q R g N i C W p Z K Q d c Y J 8 l 5 G W O 7 z O / N Y D 4 Y K G w Z 0 c R c T x U T + g H s V I K q m r X z Y q t o / k w P W S + 7 S b N E + t 9 H E q t N M T O O X 2 A E m V m F G u 0 u O u X j a q x h h w k Z g 5 K Y M c j a 7 + a f d C H P s k k J g h I T q m E U k n Q V x S z E h a s m N B I o S H q E 8 6 i g b I J 8 J J x r 9 M 4 Z F S e t A L u X q B h G N 1 d i J B v h A j 3 1 X J 7 E Q x 7 2 X i v 5 7 r z 2 2 W 3 o W T 0 C C K J Q n w Z L E X M y h D m F U I e 5 Q T L N l I E Y Q 5 V b d D P E A c Y a m K L q l S z P k K F k n z r G o a V f O 2 V q 7 X 8 n q K 4 A A c g g o w w T m o g x v Q A B b A 4 A m 8 g F f w p j 1 r H 9 q X 9 j 2 J F r R 8 Z h / 8 g f b z C y E 1 r H M = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " F 7 O O S D M A h L Y G H U l L r C z S t 2 c a w 7 Q = " > A A A C N X i c b V D L S s N A F J 3 U V 6 2 v q E s 3 g 0 W o o C W R g i 5 c V N y 4 r G D a S h P K Z D p p h 0 4 e z E y E E v M z / o S / 4 F Z 3 L t y I u P U X n L Q p 1 N Y D A 4 d z z u X e O W 7 E q J C G 8 a 4 V l p Z X V t e K 6 6 W N z a 3 t H X 1 3 r y n C m G N i 4 Z C F v O 0 i Q R g N i C W p Z K Q d c Y J 8 l 5 G W O 7 z O / N Y D 4 Y K G w Z 0 c R c T x U T + g H s V I K q m r X z Y q t o / k w P W S + 7 S b N E + t 9 H E q t N M T O O X 2 A E m V m F G u 0 u O u X j a q x h h w k Z g 5 K Y M c j a 7 + a f d C H P s k k J g h I T q m E U k n Q V x S z E h a s m N B I o S H q E 8 6 i g b I J 8 J J x r 9 M 4 Z F S e t A L u X q B h G N 1 d i J B v h A j 3 1 X J 7 E Q x 7 2 X i v 5 7 r z 2 2 W 3 o W T 0 C C K J Q n w Z L E X M y h D m F U I e 5 Q T L N l I E Y Q 5 V b d D P E A c Y a m K L q l S z P k K F k n z r G o a V f O 2 V q 7 X 8 n q K 4 A A c g g o w w T m o g x v Q A B b A 4 A m 8 g F f w p j 1 r H 9 q X 9 j 2 J F r R 8 Z h / 8 g f b z C y E 1 r H M = < / l a t e x i t > Figure 1 : The architecture of our UniMP. propagation scheme of GNN in layer l is: H (l+1) = σ(D -1 AH (l) W (l) ) Y = f out (H (L) ) where the σ is an activation function, W (l) is the trainable weight in the l-th layer, and the H (l) is the l-th layer representations of nodes. H (0) is equal to node input features X. Finally, a f out output layer is applied on the final representation to make prediction for Y . As for LPA, it also assumes the labels between connected nodes are smoothing and propagates the labels iteratively across the graph. Given an initial label matrix Ŷ (0) , which consists of one-hot label indicator vectors ŷ0 i for the labeled nodes or zeros vectors for the unlabeled. A simple iteration equation of LPA is formulated as following: Ŷ (l+1) = D -1 A Ŷ (l) Labels are propagated from each other nodes through a normalized adjacency matrix D -1 A.

2.2. UNIFIED MESSAGE PASSING MODEL

As shown in Figure 1 , we employ a Graph Transformer, jointly using label embedding to construct our unified message passing model for combining the aforementioned feature and label propagation together. Graph Transformer. Since Transformer (Vaswani et al., 2017) has been proved being powerful in NLP, we employ its vanilla multi-head attention into graph learning with taking into account the case of edge features. Specifically, given nodes features H (l) = {h (l) 1 , h (l) 2 , ..., h n }, we calculate multi-head attention for each edge from j to i as following: q (l) c,i = W (l) c,q h (l) i + b (l) c,q k (l) c,j = W (l) c,k h (l) j + b (l) c,k e c,ij = W c,e e ij + b c,e α (l) c,ij = q (l) c,i , k (l) c,j + e c,ij u∈N (i) q (l) c,i , k (l) c,u + e c,iu where q, k = exp( q T k √ d ) is exponential scale dot-product function and d is the hidden size of each head. For the c-th head attention, we firstly transform the source feature h (l) i and distant feature h (l) j into query vector q (l) c,i ∈ R d and key vector k (l) c,j ∈ R d respectively using different trainable parameters W (l) c,q , W (l) c,k , b (l) c,q , b (l) c,k . The provided edge features e ij will be encoded and added into key vector as additional information for each layer. After getting the graph multi-head attention, we make a message aggregation from the distant j to the source i: v (l) c,j = W (l) c,v h (l) j + b (l) c,v ĥ(l) i = C c=1 j∈N (i) α (l) c,ij (v (l) c,j + e c,ij ) r (l) i = W (l) r h (l) i + b (l) r β (l) i = sigmoid(W (l) g [ ĥ(l) i ; r (l) i ; ĥ(l) i -r (l) i ]) h (l+1) i = ReLU(LayerNorm((1 -β (l) i ) ĥ(l) i + β (l) i r (l) i )) (4) where the is the concatenation operation for C head attention. Comparing with the Equation 1, multi-head attention matrix replaces the original normalized adjacency matrix as transition matrix for message passing. The distant feature h j is transformed to v c,j ∈ R d for weighted sum. In addition, inspired by (Li et al., 2019; Chen et al., 2020) to prevent oversmoothing, we propose a gated residual connections between layers by r i ∈ R d and β (l) i ∈ R 1 . Specially, similar to GAT, if we apply the Graph Transformer on the output layer, we will employ averaging for multi-head output as following: ĥ(l) i = 1 C C c=1 j∈N (i) α (l) c,ij (v (l) c,j + e (l) c,ij ) h (l+1) i = (1 -β (l) i ) ĥ(l) i + β (l) i r (l) i (5) Label Embedding and Propagation. We propose to embed the partially observed labels information into the same space as nodes features: Ŷ ∈ R n×c → Ŷe ∈ R n×f , which consist of the label embedding vector for labeled nodes and zeros vectors for the unlabeled. And then, we combine the label propagation into Graph Transformer by simply adding the nodes features and labels features together as propagation features (H 0 = X + Ŷe ) ∈ R n×f . We can prove that by mapping partiallylabeled Ŷ and nodes features X into the same space and adding them up, our model is unifying both label propagation and feature propagation within a shared message passing framework. Let's take Ŷe = Ŷ W e and A * to be normalized adjacency matrix D -1 A or the attention matrix from our Graph Transformer like Equation 3. Then we can find that: H (0) = X + Ŷ W c H (l+1) = σ(((1 -β)A * + βI)H (l) W (l) ) (6) where β can be the gated function like Equation 4or a pre-defined hyper-parameters like APPNP (Klicpera et al., 2019) . For simplification, we let σ function as identity function, then we can get: H (l) = ((1 -β)A * + βI) l (X + Ŷ W c )W (1) W (2) . . . W (l) = ((1 -β)A * + βI) l XW + ((1 -β)A * + βI) l Ŷ W c W (7) where W = W (1) W (2) . . . W (l) . Then we can find that our model can be approximately decomposed into feature propagation ((1β)A * + βI) l XW and label propagation ((1β)A * + βI) l Ŷ W c W .

3. MASKED LABEL PREDICTION

Previous works on GNNs seldom consider using the partially observed labels Ŷ in both training and inference stages. They only take those labels information as ground truth target to supervised train their model's parameters θ with given X and A: arg max θ log p θ ( Ŷ |X, A) = V i=1 log p θ (ŷ i |X, A) (8) where V represents the partial nodes with labels. However, our UniMP model propagates nodes features and labels to make prediction: p(y|X, Ŷ , A). Simply using above objective for our model will make the label leakage in the training stage, causing poor performance in inference. Learning from BERT, which masks input words and makes prediction for them to pretrain their model (masked word prediction), we propose a masked label prediction strategy to train our model. During training, at each step, we corrupt the Ŷ into Ỹ by randomly masking a portion of node labels to zeros and keep the others remain, which is controlled by a hyper-parameter called label rate. Let those masked labels be Ȳ , our objective function is to predict Ȳ with given X, Ỹ and A: arg max θ log p θ ( Ȳ |X, Ỹ , A) = V i=1 log p θ (ȳ i |X, Ỹ , A) where V represents those nodes with masked labels. In this way, we can train our model without the leakage of self-loop labels information. And during inference, we will employ all Ŷ as input labels to predict the remaining unlabeled nodes.

4. EXPERIMENTS

We propose a Unified Message Passing Model (UniMP) for semi-supervised node classification, which incorporates the feature and label propagation jointly by a Graph Transformer and employ a masked label prediction strategy to optimize it. We conduct the experiments on the Node Property Prediction of Open Graph Benchmark (OGBN), which includes several various challenging and large-scale datasets for semi-supervised classification, splitted in the procedure that closely matches the real-world application Hu et al. (2020) . To verified our models effectiveness, we compare our model with others State-Of-The-Art models (SOTAs) in ogbn-products, ogbn-proteins and ogbnarxiv three OGBN datasets. We also provide more experiments and comprehensive studies to show our motivation more intuitively, and how LPA improves our model to achieve better results. Datasets. Most of the frequently-used graph datasets are extremely small compared to graphs found in real applications. And the performance of GNNs on these datasets is often unstable due to several issues including their small-scale nature, non-negligible duplication or leakage rates, unrealistic data splits (Dwivedi et al., 2020; Hu et al., 2020) . Consequently, we conduct our experiments on the recently released datasets of Open Graph Benchmark (OGB) (Hu et al., 2020) , which overcome the main drawbacks of commonly used datasets and thus are much more realistic and challenging. OGB datasets cover a variety of real-world applications and span several important domains ranging from social and information networks to biological networks, molecular graphs, and knowledge graphs. They also span a variety of predictions tasks at the level of nodes, graphs, and links/edges. As shown in table 2, in this work, we performed our experiments on the three OGBN datasets with different sizes and tasks for getting credible result, including ogbn-products about 47 products categories classification with given 100-dimensional nodes features , ogbn-proteins about 112 kinds of proteins function prediction with given 8-dimensional edges features and ogbn-arxiv about 40class topics classification with given 128 dimension nodes features. More details about these datasets are provided in Appendix A.

4.1. DATASETS AND EXPERIMENTAL SETTINGS

Implementation Details. As mentioned above, these datasets are different from each other in sizes or tasks. So we evaluate our model on them with different sampling methods like previous studies (Li et al., 2020) , getting credible comparison results. In ogbn-products dataset, we use Neighbor-Sampling with size = 10 for each layer to sample the subgraph during training and use full-batch for inference. In ogbn-proteins dataset, we use Random Partition to split the dense graph into subgraph to train and test our model. The number of partitions is 9 for training and 5 for test. As for small-size ogbn-arxiv dataset, we just apply full batch for both training and test. We set the hyper-parameter of our model for each dataset in Table 3 , and the label rate means the percentage of labels we preserve during applying masked label prediction strategy. We use Adam optimizer with lr = 0.001 to train our model. Specially, we set weight decay to 0.0005 for our model in small-size ogbn-arxiv dataset to prevent overfitting. More details about the tuned hyper-parameters are provided in Appendix B. Following the requirement of OGB, we run our experimental results for each dataset 10 times and report the mean and standard deviation. As shown in Tabel 4, Tabel 5, and Tabel 6, our unified model outperform all other comparative models in three OGBN datasets. Since most of the compared models only consider optimizing their models for the features propagation, these results demonstrate that incorporating label propagation into GNN models can bring significant improvements. Specifically, we gain 82.56% ACC in ogbn-products, 86.42% ROC-AUC in ogbn-proteins, which achieves about 0.6-1.6% absolute improvements compared to the newly SOTA methods like DeeperGCN (Li et al., 2020) . In ogbn-arxiv, our method gains 73.11% ACC, achieve 0.37% absolute improvements compared to GCNII (Chen et al., 2020) , whose parameters are four times larger than ours. (Li et al., 2020) 0.7192 ± 0.0016 0.7262 ± 0.0014 1, 471, 506 GaAN (Zhang et al., 2018) 0.7197 ± 0.0024 -1, 471, 506 DAGNN (Liu et al., 2020a) 0.7209 ± 0.0025 -1, 751, 574 JKNet (Xu et al., 2018b) 0.7219 ± 0.0021 0.7335 ± 0.0007 331, 661 GCNII (Chen et al., 2020) 0 Ŷ Transformer 0.8269 ± 0.0009 0.8560 ± 0.0003 0.7332 ± 0.0014 † In ogbn-proteins, nodes features are not provided initially. We average the edge features as their nodes features and drop the edge features for fair comparison in this experiment. which is slightly different from Table 5 . X is the nodes features, A is the graph adjacent matrix and Ŷ is the observed labels. We run these models three times and report their means and stds. One of our motivations for using label propagation is that labels can carry additional informative feature which cannot be replaced by the model's approximation. However, the relation between the coverage of labeled data and the impact of label propagation for our model still be uncertain. Therefore, we conduct more experiments in ogbn-arxiv to investigate their relationship in several different scenarios: • In Figure 2a , we train UniMP using X, Ŷ, A as inputs. We tune the training label rate which is the hyper-parameter of masked label prediction task and display the validation and test accuracy. Our model achieves better performance when label rate is about 0.625. Wang & Leskovec (2019) have theoretically proved that using the LPA for GCN during training can enable nodes within the same class/label to connect more strongly, increasing the accuracy (ACC) of model's prediction. Our model can be seen as an upgraded version of them, using LPA in both training and testing time. Therefore, we try to experimentally prove the above idea based on our model. We use the Margin Similarity Function to reflect the connection tightness of the nodes with same class (the higher scores, the stronger connection they are, and more details in Appendix C). We conduct the experiments on ogbn-arxiv. And as shown in Figure 3 , the ACC of models' prediction is proportional to Margin Similarity. Unifying feature and label propagation can further strengthen their connection, improving their ACC. Moreover, our Graph Transformer outperforms GAT in both connection tightness and ACC with different inputs.

5. CONCLUSION

We first propose a unified message passing model, UniMP, which jointly performs feature propagation and label propagation within a Graph Transformer to make the semi-supervised classification. Furthermore, we propose a masked label prediction method to supervised training our model, preventing it from overfitting in self-loop label information. Experimental results show that UniMP outperforms the previous state-of-the-art models on three main OGBN datasets: ogbn-products, ogbn-proteins and ogbn-arxiv by a large margin, and ablation studies demonstrate the effectiveness of unifying feature propagation and label propagation.



EXPLORING HOW LABEL PROPAGATION AFFECTS UNIMP 12.5 25.0 37.5 50.0 62.5 75.0 87.5 Label Rate (%) in Training Phase Training with different label rate.

X, A Test X, A, Ŷ Valid X, A, Ŷ Test (b) Training with different proportion of data. 0.0 12.5 25.0 37.5 50.0 62.5 75.0 87.5 100.0 Label Rate (%) in Testing Phase Testing with different proportion of labels.

Figure Exploration of how label coverage affects label propagation.

Figure 2b describes the correlation between the proportion of training data and the effectiveness of label propagation. We fix the label rate with 0.625. The only change is the training data proportion. It's a common sense that with the increase of training data, the performance is gradually improving. And the model with label propagation can have greater benefits from increasing labeled data proportion. • In the training stage, our model always masks a part of the training label and tries to recover them. But in the inference stage, our model utilizes all training labels for predictions, which is slightly inconsistent with the one in training. In Figure 2c, we fix our trained models and perform label propagation with different label rate in inference. It's found that when lower the label rate during prediction, UniMP might have worse performance (less than 0.70) than the baseline (about 0.72). However, when the label rate climbs up, the performance can boost up to 0.73. • In Figure 2d, we calculate accuracy for unlabeled nodes with a different number of neighbors. The experimental result shows that nodes with more neighbors have higher accuracy. And the model with label propagation can always have improvements even with different numbers of training neighbors.

Figure 3: Correlation between accuracy and margin similarity between neighbors.

Comparision between message passing models

Dataset statistics of OGB node property prediction

The hyper-paramerter setting of our model Baseline and other comparative SOTAs are provided by OGB leaderboard. Some of the including results are conducted officially by authors from original papers, while the others are re-implemented by communities. And all these results are guaranteed to be reproducible with open source codes.

Results for ogbn-products

Results for ogbn-proteins

Results for ogbn-arxiv

ABLATION STUDIES ON GRAPH TRANSFORMER AND MASKED LABEL PREDICTIONIn this section, we will conduct extensive studies to identify the improvements from different components of our unified model. To get a fair comparison, we re-implement classical GNN methods like GCN and GAT, following the same sampling methods and model setting in Table3. The hidden size of GCN is head num*hidden size since it doesn't have head attention. We also change different inputs for our models to study the effectiveness of feature and label propagation.As shown in Tabel 7, it's surprising that only Ŷ and A, GNNs still work well in all three datasets, outperforming those MLP model only given X. This implies that one's label relies heavily on its neighborhood instead of itself feature. For models with X and A as inputs like most GNNs do, they are more likely to remember the labels of training set through approximations, which is inaccurate. It's a waste of information in semi-supervised classification when prediction without incorporating the annotated label Ŷ information from training sets, which are preciser than the model's approximations for training data. In addition, with different input settings, our improved Graph Transformer can outperform GAT, GCN in most cases. Ablation studies on models with different inputs.

annex

A DATASETS DETAILS ogbn-products. As shown in Table 2 , ogb-products is an undirected and unweighted graph, representing an Amazon product co-purchasing network. The goal of this task is to predict the category of a product in a multi-class classification setup, where the 47 top-level categories are used for target labels. To match the real-world application, it conducts a splitting on the dataset based on sales ranking, where the top 10% for training, next top 2% for validation, and the rest for testing.ogbn-proteins. As shown in Table 2 , ogbn-proteins dataset is an undirected, weighted, and typed (according to species) graph. Nodes represent proteins, and edges indicate different types of biologically meaningful associations between proteins, e.g., physical interactions, co-expression or homology. The task is to predict the presence of protein functions in a multi-label binary classification setup, where there are 112 kinds of labels to predict in total. The performance is measured by the average of ROC-AUC scores across the 112 tasks. It conducts the splitting by species.ogbn-arxiv. As shown in Table 2 , ogbn-arxiv dataset is a directed graph, representing the citation network between all Computer Science (CS) arXiv papers indexed by MAG (Wang et al., 2020) .The task is to predict the 40 subject areas of arXiv CS papers, e.g., cs.AI, cs.LG, and cs.OS, which are manually determined by the paper's authors and arXiv moderators. This dataset were splitted by time.

B HYPER-PARAMETERS TUNED ON UNIMP MODEL

There are the hyper-parameters we tuned on our unified model for comparison with other SOTA results, where the asterisks denote the hyper-parameters we eventually selected. C MARGIN SIMILARITY FUNCTION Given an attention weight α ij from GAT or Graph Transformer, which can represent the connection tightness between source node i and distance node j, we employ the Circle Loss (Sun et al., 2020) and make a slight change on it to build our Margin Similarity Function (MSF), measuring the connection tightness between the neighbors nodes with same labels. For each center node i and its neighbors j, k ∈ N (i), we take the measurement task as a pair similarity problem in which the center node's neighbors with same label are positive samples and the others are negative samples, calculating their connection tightness as following:e αi,je α i,k (10)

