TOWARDS ROBUST GRAPH NEURAL NETWORKS AGAINST LABEL NOISE

Abstract

Massive labeled data have been used in training deep neural networks, thus label noise has become an important issue therein. Although learning with noisy labels has made great progress on image datasets in recent years, it has not yet been studied in connection with utilizing GNNs to classify graph nodes. In this paper, we propose a method, named LPM, to address the problem using Label Propagation (LP) and Meta learning. Different from previous methods designed for image datasets, our method is based on a special attribute (label smoothness) of graphstructured data, i.e., neighboring nodes in a graph tend to have the same label. A pseudo label is computed from the neighboring labels for each node in the training set using LP; meta learning is utilized to learn a proper aggregation of the original and pseudo label as the final label. Experimental results demonstrate that LPM outperforms state-of-the-art methods in graph node classification task with both synthetic and real-world label noise. Source code to reproduce all results will be released.

1. INTRODUCTION

Deep Neural Networks (DNNs) have achieved great success in various domains, but the necessity of collecting large amount of samples with high-quality labels is both expensive and time-consuming. To address this problem, cheaper alternatives have emerged. For example, the onerous labeling process can be completed on some crowdsourced system like Amazon Mechanical Turkfoot_0 . Besides, we can collect labeled samples from web with search engines and social media. However, all these methods are prone to produce noisy labels of low quality. As is shown in recent research (Zhang et al., 2016b) , an intractable problem is that DNNs can easily overfit to noisy labels, which dramatically degrades the generalization performance. Therefore, it is necessary and urgent to design some valid methods for solving this problem. Graph Neural Networks (GNNs) have aroused keen research interest in recent years, which resulted in rapid progress in graph-structured data analysis (Kipf & Welling, 2016; Velickovic et al., 2017; Xu et al., 2018; Hou et al., 2019; Wang & Leskovec, 2020) . Graph node classification is the mostcommon issue in GNNs. However, almost all the previous works about label noise focus on image classification problem and handling noisy labels in the task of graph node classification with GNNs has not been studied yet. Fortunately, most edges in the graph-structured datasets are intra-class edges (Wang & Leskovec, 2020) , indicating that a node's label can be estimated by its neighbor nodes' labels. In this paper, we utilize this special attribute of graph data to alleviate the damages caused by noisy labels. Moreover, meta learning paradigm serves as a useful tool for us to learn a proper aggregation between origin labels and pseudo labels as the final labels. The key contributions of this paper are as follows: • To the best of our knowledge, we are the first to focus on the label noise existing in utilizing GNNs to classify graph nodes, which may serve as a beginning for future research towards robust GNNs against label noise. • We utilize meta-learning to learn how to aggregate origin labels and pseudo labels properly to get more credible supervision instead of learning to re-weight different samples. We experimentally show that our LPM outperforms state-of-the-art algorithms in utilizing GNNs to classify graph nodes with both synthetic and real-world label noise.

2.1. GRAPH NEURAL NETWORKS

To start, we use G = (V, E, X ) to denote a graph whose nodes set is V and edges set is E, and X ∈ R n×d is the input feature matrix, where n denotes the number of nodes in the graph and d is the dimension of the input feature vector of each node. We use e u,v ∈ E to denote the edge that connects node u and v. For each node v ∈ V, its neighbor nodes set can be donated as N v = {u : e u,v ∈ E}. For node classification task, the goal of GNNs is to learn optimal mapping function f (•) to predict the class label y v for node v. Generally speaking, GNNs follows a framework including aggregation and combination in each layer. Different GNNs have proposed different ways of aggregation and combination. In general, the k-th layer of a GNN reads a (k) v = Aggregate (k) ({h (k-1) u : u ∈ N (v)}), h (k) v = Combine (k) (h (k-1) v , a (k) v ), where h (k) v is the output for k-th layer of node v, h v is the input vector of node v.

2.2. LABEL PROPAGATION

In Label Propagation (LP), node labels are propagated and aggregated along the edges in the graph (Zhou et al., 2004; Zhu et al., 2005; Wang & Zhang, 2007; Karasuyama & Mamitsuka, 2013) . There are some works which were designed to improve the performance of label propagation. For example, Gong et al. (2016) proposed a novel iterative label propagation algorithm which explicitly optimizes the propagation quality by manipulating the propagation sequence to move from simple to difficult examples; Zhang et al. (2020) introduces a triple matrix recovery mechanism to remove noise from the estimated soft labels during propagation. Label propagation has been applied in semi-supervised image classification task. For example, Gong et al. (2017) used a weighted Knearest neighborhood graph to bridge the datapoints so that the label information can be propagated from the scarce labeled examples to unlabeled examples along the graph edges. Park et al. (2020) proposed a novel framwork to propagate the label information of the sampled data (reliable) to adjacent data along a similarity based graph. Compared to these methods, we utilize the intrinsic graph structure instead of handcrafted graph to propagate clean labels information, which is more reliable for graph-structured data. Besides, GNNs are utilized by us to extract features and classify nodes for graph-structured data.

2.3. META-LEARNING BASED METHODS AGAINST NOISY LABELS

Meta-learning aims to learn not only neural networks' weights, but also itself, such as hand-designed parameters, optimizer and so on (Andrychowicz et al., 2016; Finn et al., 2017) . Several works have utilized meta-learning paradigm to deal with label noise. For example, Li et al. (2019) has proposed to find noise-tolerant model parameters by keeping the consistency between the output of teacher and student networks, and Li et al. (2017b) trains the teacher networks with samples with clean labels and then transfer the knowledge to student networks so that the student can learn correctly even if the existence of mislabeled data. Besides, Ren et al. (2018) ; Jenni & Favaro (2018) ; Shu et al. (2019) utilize meta-learning paradigm to re-weight samples, i.e., weight samples with clean labels more and weight mislabeled samples less. The weighting factors are optimized by gradient decent or generated by a network to minimizes the loss on a small amount of samples with correct labels. In contrast, meta-learning paradigm is utilized in this paper to learn how to aggregate origin labels and pseudo labels properly. We can get more credible supervision by combining the original label information with the label information provided by LP properly. , one half of every training node is pseudo label predicted by LP and the other half is original label. Some nodes' (node 5,7) pseudo labels are the same with their original labels, we select them D select to train GNNs and inject them to clean sets for better label propagation. We can get proper labels for the left nodes D lef t (node 6,8,9,10) based on meta learning.

3.1. PRELIMINARIES

Given a graph data with n nodes and their labels D = {(x 0 , y 0 ), (x 1 , y 1 ), ..., (x n-1 , y n-1 )}, where x j is the j-th node and y j ∈ {0, 1} c is the label over c classes. D train = {(x 0 , y 0 ), (x 1 , y 1 ), ..., (x s-1 , y s-1 )} are training nodes with noisy labels. Our goal is to enable the GNNs f (x j ; w) trained with noisy sets D train can also generalize well on test nodes. w is the learnable parameters of GNNs. In our method, m nodes with true labels D clean = {(x s , y s ), (x s+1 , y s+1 ), ..., (x s+m-1 , y s+m-1 )} in the graph are provided as the initial clean sets (m s). GCN (Kipf & Welling, 2016) and GAT (Velickovic et al., 2017) are utilized in our experiments to extract features and classify nodes. Our method includes two main parts: label propagation and label aggregation. We will go into details about these two parts in the following section 3.2 and section 3.3.

3.2. LABEL PROPAGATION

Label Propagation is based on the label smoothness that two connected nodes tend to have the same label. Therefore, the weighted average of neighbor nodes' label of a node is similar to this node's true label. An illustration of LP part in our method can be found in Figure . 1. The first step of LP is to construct an appropriate neighborhood graph. A common choice is k-nearest graph (Iscen et al., 2019; Liu et al., 2018) but there is an intrinsic graph structure (adjacency matrix A) in graph data, so our similarities matrix W with zero diagonal can be constructed with A, whose elements W i,j are pairwise similarities between node i and node j: W i,j = A i,j d(h i , h j ) + ε , where h i , h j are the feature vectors extracted by GNNs for node i and node j. d(•, •) is a distance measure (e.g.,Euclidean distance). ε is an infinitesimal. Note that we can get W with time complexity O(| E |) instead of O(n 2 ) because A is a sparse matrix whose edge lists are given. Then we can normalize the similarities matrix W : S = D -1/2 W D -1/2 , ( ) where D is a diagonal matrix with (i,i)-value to be the sum of the i-th row of W . Let Y (k) = [y (k) 1 , ..., y n ] T ∈ R n×c be the soft label matrix in LP iteration k and the i-th row y (k) i is the predicted label distribution for node i. When k = 0, the initial label matrix Y (0) = [y (0) 1 , ..., y (0) n ] T consists of one-hot label vectors for i = s, s + 1, ..., s + m -1(i.e., initial clean sets) or zero vectors otherwise. The LP (Zhu et al., 2005) in iteration k can be formulated as: Y (k+1) = SY (k) , (k+1) i = y (0) i , ∀i ∈ [s, s + m -1] (5) In Eq. ( 4), every node's label in the (k + 1)-th iteration equals the weighted average of its neighbor nodes' labels in k-th iteration. In this way, the clean sets propagate labels to the noisy training nodes according to normalized edge weights. And then in Eq. ( 5), the labels of clean sets nodes are reset to their initial values. The reason is that we can take full advantage of the tiny minority of clean nodes and in case that the effect of clean sets fade away. Co-teaching (Han et al., 2018) and Co-teaching plus (Yu et al., 2019) have been proposed to train DNNs robustly against label noise. There are two DNNs which select samples with small loss from noisy training sets to train each other. Our method is similar to theirs to some extent because LP is utilized by us to select true-labeled samples from D train for training. However, instead of taking the nodes with small loss as true-labeled nodes, we select the nodes D select whose original labels are same with pseudo labels for training. Original labels of D select are credible and we also inject them to initial clean sets D clean for better LP in next epoch. This is why our method can achieve better performance even if few true-labeled nodes are provided. For ∀(x j , y j ) ∈ D lef t , we can get two loss values: l 1 = loss(ŷ j , y j ), (6) l 2 = loss(ŷ j , ỹj ), (7) where ŷj is the label predicted by GNNs for training node j and ỹj is the pseudo label predicted by LP for node j. We can also get final label y j for node j by aggregating original label y j and pseudo label ỹj :

3.3. META-LEARNING BASED LABEL AGGREGATION

y j = λ j y j + (1 -λ j )ỹ j , λ j ∈ [0, 1] where λ is the aggregation coefficient. Some previous methods designed a weighting function mapping training loss to sample weights for noisy label problems (Kumar et al., 2010; Ren et al., 2018; Shu et al., 2019) . Instead, we utilize a 3-layer multi-layer perceptron (MLP) as the aggregation network g(•; •) to map loss values to aggregation coefficient λ j : λ j = g(l 1 l 2 ; θ) = λ j (θ; w), Where l 1 l 2 is a 2-dimensional vector which is the concatenation of l 1 and l 2 and θ is the weights of aggregation network g. The rationality lies on a consensus that samples' loss values are affiliated with the credibility of samples' original labels (Kumar et al., 2010; Shu et al., 2019; Yu et al., 2019) . The MLP or aggregation networks' input layer are 2 neurons and its output layer is one neuron, which can be an approximator to almost any continuous functions. The activation function of the last layer is sigmoid function to ensure that output λ j ∈ [0, 1]. We can get the training loss L tr j for node j: L tr j (w, θ) = loss(ŷ j (w), y j (θ)), Then we can backward on the GNNs: ŵt (θ t ) = w t - α | D lef t | (xj ,yj )∈D lef t ∇ w L tr j (w, θ t )| wt , ( ) where α is the learning rate of GNNs. Then we can get the loss L c on clean sets D clean : L c ( ŵt (θ t )) = 1 | D clean | (xi,yi)∈D clean loss(f (x i ; ŵt (θ t )), y i ), Where f (x i ; ŵt (θ t )) is the output of GNNs. Then we can utilize L c to update the weights of aggregation network: θ t+1 = θ t -β∇ θ L c ( ŵ(θ))| θt , where β is the learning rate of aggregation network. Finally, GNNs' weights can be updated: w t+1 = w t - α | D lef t | (xj ,yj )∈D lef t ∇ w L tr j (w, θ t+1 )| wt . ( ) To some extent, this part is similar to re-weight based methods (Ren et al., 2018; Shu et al., 2019) . However, LPM has two significant advantages. Firstly, re-weight based methods can not remove the damages caused by incorrect labels because they assign every noisy training sample a positive weight while LPM potentially has the ability to take full advantage of noisy samples positively. Secondly, LPM can generate comparatively credible labels for other usages while re-weight or some other methods can not. Algorithm. 1 shows all the steps of our algorithm.

3.4. CONVERGENCE OF LPM

Here we show theoretically that the loss functions will converge to critical points under some mild conditions. The detailed proof of the following theorems will be provided in Appendix C. Theorem 1 Suppose the loss function loss is L-Lipschitz smooth, and λ(•) is differential with a δbounded gradient, twice differential with its Hessian bounded by B with respect to θ. Let the learning rate α t = min{1, k T }, for some k > 0, such that k T < 1 and learning rate β t a monotone descent sequence, β t = min{ 1 L , c √ T } for some c > 0, such that L ≤ c √ T and ∞ t=1 β t ≤ ∞, ∞ t=1 β 2 t ≤ ∞. Then the clean loss of Aggregation Net can achieve ∇ θ L c ( ŵ(θ t )) 2 2 ≤ in O(1/ 2 ) steps. More specifically, min 0≤t≤T ∇ θ L c ( ŵ(θ t )) 2 2 ≤ O( C √ T ). ( ) Theorem 2 Under the conditions of Theorem 1, with the gradient of loss bounded by ρ, then  lim t→∞ ∇ wt L tr (w t , θ t+1 ) 2 2 = 0. ( ) D c = D clean for t = 0, 1, 2, ..., T -1 do for ∀v ∈ D do h v = f (x v ; w t ); for (i, j) ∈ {1, 2, ..., n} 2 do W i,j = Ai,j d(hi,hj )+ε ; for k = 0, 1, 2, ..., K -1 do Y (k+1) = D -1/2 W D -1/2 Y (k) , y (k+1) j = y (0) j (∀ node j ∈ D c ) end D select = D lef t = ∅; for ∀ node i ∈ D train do if onehot(y (K) i ) = y i do D select = node {i} ∪ D select ; else do D lef t = node {i} ∪ D lef t ; end D c = D c ∪ D select w t ← one-step optimization of w t with the selected nodes D select ; for ∀ node j ∈ D lef t do ŷj = f (x j ; w t ); l 1 = loss(ŷ j , y j ); l 2 = loss(ŷ j , ỹj ); λ j = g(l 1 l 2 ; θ t ); y j = λ j y j + (1 -λ j )ỹ j , λ j ∈ [0, 1]; L tr j (w, θ) = loss(ŷ j (w), y j (θ)); end ŵt (θ t ) = w t -α |D lef t | (xj ,yj )∈D lef t ∇ w L tr j (w, θ t )| wt ; L c ( ŵt (θ t )) = 1 |D clean | (xi,yi)∈D clean loss(f (x i ; ŵt (θ t )), y i ); θ t+1 = θ t -β∇ θ L c ( ŵ(θ))| θt ; w t+1 = w t -α |D lef t | (xj ,yj )∈D lef t ∇ w L tr j (w; θ t+1 )| wt . end

4.1. DATASETS AND IMPLEMENTATION DETAILS

We validate our method on six benchmark datasets, namely citation networks (Sen et al., 2008) including Cora, Citeseer and Pubmed. Coauthor-Phy dataset (Shchur et al., 2018) is also utilized in our experiments, but the results are shown in Appendix A due to the limited space. Summary of the graph datasets mentioned above are shown in Table . 1. The Clothing1M (Xiao et al., 2015) and Webvision (Li et al., 2017a) dataset are utilized to validate the effectiveness of our method in real-world label noise settings. We take a kNN graph (k = 5) as the graph structure so that GNNs can be applied in these two datasets, which follows previous work (Franceschi et al., 2019) . More details about our preprocessing on Clothing1M and Webvision datasets can be seen in Appendix B. The experiments are conducted with two types of label noise: unif orm noise and f lip noise following previous works (Zhang et al., 2016a; Shu et al., 2019) . The former means that the label of each sample is independently changed to a random class with probability p, and the latter means that the label is independently flipped to a similar class with total probability p. The ratio of training, validation, and test nodes are set as 4:4:2. Only nearly 25 nodes with clean labels in the validation set are provided as the clean set in each dataset and we ensure that each class has the same number of samples. For example, we use 8 clean samples per label class for Pubmed. GCN (Kipf & Welling, 2016) serves as the base classification network model in our experiments and it is trained using Adam (Kingma & Ba, 2014) with an initial learning rate 0.01 and a weight decay 5 × 10 -4 , except that the weight decay equals to 0 in Clothing1M and Coauthor-Phy datasets. We compare LPM with multiple baselines using the same network architecture. These baselines are typical and some of them achieve state-of-the-arts performance on image datasets, which include: 

4.2. RESULTS

Table . 2 shows the results on Cora and Citeseer with different levels of uniform noise ranging from 0% to 80%. Every experiment are repeated 5 times with different random seeds. Finally, we report the best test accuracy across all epochs averaged over 5 repetitions for each experiment. As can be seen in Table . 2, our method gets the best performance across all the datasets and all noise rates, except the second for 0% uniform noise rate. Our method performs even better when the labels are corrupted at high rate. Table . 3 shows the performance on Cora, Citeseer and Pubmed with different levels of flip noise ranging from 0% to 40%. It can be seen that our method also outperforms state-of-the-arts methods under flip noise across different noise rate, except that the second for 0% flip noise rate. Our method outperforms the corresponding second best method by a large margin when the noise rate is 0.4. As can be seen in Table . 4, our method can also perform better than other baselines in datasets with real-world label noise. We also experiment with Graph Attention Networks (Velickovic et al., 2017) as the feature extractor and classifier, the results shown in Appendix A demonstrate that our method can also perform well with other GNNs. 

4.4. IMPACT OF FINETUNING AND NOISE RATE

We would like to investigate how our baselines can perform without finetuning. As can be seen in Figure . 5, the performance of the baselines will degenerate relatively significantly without finetuning across different noise rate. This illustrates that some baselines (without finetuning) that are designed for image datasets may perform relatively poor on graph-structured data and this motivates our work which trains GNNs robustly utilizing the structure information of graph data. Besides, We can also observe that our method only drops nearly 9% when the flip noise rate increased from 0% to 40%, whereas the baseline has dropped nearly 20% -30%, which illustrates that our method is more robust, especially at high noise rate. At 0% noise, our method only slightly underperforms reweights besed methods. This is reasonable because the original labels are all correct but our method will inevitably perturb a few clean labels while the re-weights based methods will not.

4.5. SIZE OF THE CLEAN SET

We try to strike a balance and understand when finetuning will be effective. As can be seen in Figure . 6, our method can also perform better even if the size of clean set is extremely small. The overall test accuracy does not grow much when the size of clean set is large enough. Besides, the test accuracy of baselines with fintuning will increase significantly when the size of clean set grows larger. This suggests that finetuning will be valid when the size of clean set grows larger because GNNs can achieve good performance with relatively less samples (Kipf & Welling, 2016; Veličković et al., 2017) . From this perspective, our method can also serve as complementary for finetuning based methods when the size of clean set is large enough. 

5. CONCLUSION AND FUTURE WORK

In this work, we proposed a robust framwork for GNNs against label noise. This is the first method that specially designed for label noise problem existing in utilizing GNNs to classify graph nodes and it outperforms state-of-the-arts methods in graph-structured data, which may serve as the beginning for future research towards robust GNNs against label noise. As a future work, we may design an inductive robust method. Besides, better methods that don't need clean sets are also the goals of us. finetuning and the total epoch of all the experiments is 300. In Co-teaching plus experiment, the initial epoch is 270, the forget rate is 0.1 and 5 epochs for linear drop rate ,the exponent of the forget rate is 1. For MW-Nets, the dimension of the meta net's middle layer is 100 and the learning rate is 5 × 10 -3 . q for GCEloss is 0.1. The combination of Normalized Focal Loss and Mean Absolute Error is utilized in APL experiments, the weight of Normalized Focal Loss is 0.1 and the weight of Mean Absolute Error is 10. For JoCoR experiments, the epochs for linear drop rate is 5 and the exponent of the forget rate is 2. The balance coefficient between conventional supervised learning loss and contrastive loss is 0.01. The learning rate and weight decay of Graph Attention Networks are 0.01 and 5 × 10 -4 . The dimension of hidden layer of GAT is 16 and the number of head attentions is 8. The alpha of the leaky relu is 0.2 and the dropout rate is 0.5. Throughout this work we implemented gradient based meta-learning algorithms in PyTorch using the Higher library (Grefenstette et al., 2019) .

C APPENDIX : CONVERGENCE OF LPM

Our proof of the convergence of LPM mainly follow some previous works (Ren et al., 2018; Shu et al., 2019) that utilize meta-learning to reweight noisy training samples. As is illustrated in some previous works (Zhou et al., 2004; Zhu et al., 2005) , LPA will converge to a fixed point. Namely, D select and D lef t will converge to fixed sets. In our proof, the final | D lef t | and final | D clean | are denoted with n and m for easier illustration. Loss function loss is denoted by l in this proof. Here we first rewrite the forward and backward equations as follows: ŷj = f (x j ; w t ) = y j (w)| wt (17) λ j = g(l(y j , ŷj ) l(ỹ j , ŷj ); θ t ) = λ j (θ; w t )| θt (18) L tr (w t ; θ t ) = 1 n n j=1 l(λ j y j + (1 -λ j )ỹ j , ŷj ) (19) ŵt (θ t ) = w t -α∇ w L c (w; θ t )| wt (20) ŷi = f (x i ; ŵt ) = y i ( ŵ; x i )| ŵt (21) L c ( ŵ)| ŵt = 1 m m i=1 L c i ( ŵ)| ŵt = 1 m m i=1 L c i ( ŵt (θ))| θt = 1 m m i=1 l(y i , ŷi ) θ t+1 = θ t -β∇ θ L c ( ŵ(θ))| θt w t+1 = w t -α∇ w L tr (w; θ t+1 )| wt (24) (x j , y j ) is node from the final left training set D lef t ; (x i , y i ) is node from the final clean set D clean ; f is the GCN for classification with its weights w; g is the Aggregation Net whose input are the nodes from clean set with its weights θ; L c is the loss on clean sets. L tr is the final training loss. l(y, ŷ) is the loss (such as Cross Entropy) which satisfies linearity given by l(λy 1 + (1 -λ)y 2 , ŷ) = λl(y 1 , ŷ) + (1 -λ)l(y 2 , ŷ). Derivation of the equation of updating the weights in Aggregation Net 1 m m i=1 ∇ θ L c i ( ŵ(θ))| θt = 1 m m i=1 ∂L c i ( ŵ) ∂ ŵ | ŵt n j=1 ∂ ŵt (θ) ∂λ j | θt ∂λ j (θ; w t ) ∂θ | θt . According to Equation ( 20) ŵt (θ)| θt = w t -α∇ wt 1 n n j=1 l(λ j y j + (1 -λ j )ỹ j , ŷj ) ∂ ŵt (θ) ∂λ j | θt = - α n ∇ wt ∂l(λ j y j + (1 -λ j )ỹ j , ŷj ) ∂λ j | θt ∂ ŵt (θ) ∂λ j | θt = - α n ∇ wt ∂[λ j l(y j , ŷj ) + (1 -λ j )l(ỹ j , ŷj )] ∂λ j | θt ∂ ŵt (θ) ∂λ j | θt = - α n ∇ wt (l(y j , ŷj ) -l(ỹ j , ŷj ))| θt ∂ ŵt (θ) ∂λ j | θt = - α n ∂(l(y j , ŷj ) -l(ỹ j , ŷj )) ∂w t | wt Therefore, Equation ( 25) can be written as 1 m m i=1 ∇ θ L c i ( ŵ(θ))| θt = - α mn m i=1 ∂L c i ( ŵ) ∂ ŵ | ŵt n j=1 ∂(l(y j , ŷj ) -l(ỹ j , ŷj )) ∂w t | wt ∂λ j (θ; w t ) ∂θ | θt = - α n n j=1 ( 1 m m i=1 ∂L c i ( ŵ) ∂ ŵ | T ŵt ∂(l(y j , ŷj ) -l(ỹ j , ŷj )) ∂w t | wt ) ∂λ j (θ; w t ) ∂θ | θt = - α n n j=1 ( 1 m m i=1 G ij ) ∂λ j (θ; w t ) ∂θ | θt , where G ij = ∂L c i ( ŵ) ∂ ŵ | T ŵt ∂(l(yj ,ŷj )-l(ỹj ,ŷj )) ∂wt | wt . Lemma 1. Suppose the loss function l is L-Lipschitz smooth, and λ(•) is differential with a δ-bounded gradient, twice differential with its Hessian bounded by B with respcet to θ, and the loss function l(•, •) have ρ-bounded gradients with respect to the parameter w. Then the gradient of w with respect to L c i ( ŵ) is Lipschitz continuous. Proof. The supposition is equivalent to the following inequalities, ∇ ŵL c ( ŵ)| w1 -∇ ŵL c ( ŵ)| w2 ≤ L w 1 -w 2 , for any w 1 , w 2 ; ∇ θ λ(θ; w t ) ≤ ρ; (27) ∇ 2 θ 2 λ(θ; w t ) ≤ B; (28) ∇ w l(y i , ŷi (( ŵt (w); x i ))) ≤ δ. ( ) The gradient of θ with respect to loss on clean set reads ∇ θ L c i ( ŵ(θ))| θt = - α n n j=1 ∂L c i ( ŵ) ∂ ŵ | T ŵt ∂(l(y j , ŷj ) -l(ỹ j , ŷj )) ∂w t | wt ∂λ j (θ; w t ) ∂θ | θt = - α n n j=1 G ij ∂λ j (θ; w t ) ∂θ | θt Taking the gradient of θ in both sides of the equation, we have ∇ 2 θ 2 L c i ( ŵ(θ))| θt = - α n n j=1 ( ∂G ij ∂θ | θt ∂λ j (θ; w t ) ∂θ | θt + (G ij ) ∂ 2 λ j (θ; w t ) ∂θ 2 | θt ). For the first term in summation, ∂G ij ∂θ | θt ∂λ j (θ; w t ) ∂θ | θt ≤δ ∂ ∂ ŵ ( ∂L c i ( ŵ) ∂θ | T θt )| T ŵt ∂(l(y j , ŷj ) -l(ỹ j , ŷj )) ∂w t | wt =δ ∂ ∂ ŵ (- α n n k=1 ∂L c i ( ŵ) ∂ ŵ | T ŵt ∂(l(y k , ŷk ) -l(ỹ k , ŷk )) ∂w t | wt ∂λ j (θ; w t ) ∂θ | θt )| T ŵt ∂(l(y j , ŷj ) -l(ỹ j , ŷj )) ∂w t | wt =δ (- α n n k=1 ∂ 2 L c i ( ŵ) ∂ ŵ2 | T ŵt ∂(l(y k , ŷk ) -l(ỹ k , ŷk )) ∂w t | wt ∂λ k (θ; w t ) ∂θ | θt )| T ŵt ∂(l(y j , ŷj ) -l(ỹ j , ŷj )) ∂w t | wt ≤δα ∂ 2 L c i ( ŵ) ∂ ŵ2 | ŵt ∂(l(y k , ŷk ) -l(ỹ k , ŷk )) ∂w t | wt ∂λ k (θ; w t ) ∂θ | θt )| ŵt ∂(l(y j , ŷj ) -l(ỹ j , ŷj )) ∂w t | wt ≤4αLρ 2 δ 2 . And for the second term, (G ij ) ∂ 2 λ j (θ; w t ) ∂θ 2 | θt = ∂L c i ( ŵ) ∂ ŵ | T ŵt ∂(l(y j , ŷj ) -l(ỹ j , ŷj )) ∂w t | wt ∂ 2 λ j (θ; w t ) ∂θ 2 | θt ≤2Bρ 2 . Therefore, ∇ 2 θ 2 L c i ( ŵ(θ))| θt ≤ 4α 2 Lρ 2 δ 2 + 2αρ 2 B. Let L v = 4α 2 Lρ 2 δ 2 + 2αρ 2 B ,Based on Lagrange mean value theorem, we have ∇ θ L c ( ŵt (θ 1 )) -∇ θ L c ( ŵt (θ 2 )) ≤ L v θ 1 -θ 2 , for all θ 1 , θ 2 . Theorem 1. Suppose the loss function l is L-Lipschitz smooth, and λ(•) is differential with a δbounded gradient, twice differential with its Hessian bounded by B with respect to θ. Let the learning rate α t = min{1, k T }, for some k > 0, such that k T < 1 and learning rate β t a monotone descent sequence, β t = min{ 1 L , c √ T } for some c > 0, such that L ≤ c √ T and ∞ t=1 β t ≤ ∞, ∞ t=1 β 2 t ≤ ∞. Then the loss of Aggregation Net can achieve ∇ θ L c ( ŵ(θ t )) 2 2 ≤ in O(1/ 2 )steps. More specifically, min 0≤t≤T ∇ θ L c ( ŵ(θ t )) 2 2 ≤ O( C √ T ). Proof. The iteration for updating the parameter θ reads θ t+1 = θ t -β∇ θ L c ( ŵt (θ))| θt . In two successive iteration, observe that L c ( ŵt+1 (θ t+1 )) -L c ( ŵt (θ t )) =[L c ( ŵt+1 (θ t+1 )) -L c ( ŵt (θ t+1 ))] + [L c ( ŵt (θ t+1 )) -L c ( ŵt (θ t ))]. For the first term, given that loss function on clean set is Lipschitz smooth, we have L c ( ŵt+1 (θ t+1 )) -L c ( ŵt (θ t+1 )) ≤ < ∇L c ( ŵt (θ t+1 )), ŵt+1 (θ t+1 ) -ŵt (θ t+1 ) > + L 2 ŵt+1 (θ t+1 ) -ŵt (θ t+1 ) 2 2 . According to Equation ( 20) and ( 23), ŵt+1 (θ t+1 ) -ŵt (θ t+1 ) = - α t n n j=1 [λ j ∇ w l(y j , ŷj ) + (1 -λ j )∇ w l( ỹj , ŷj )]| wt+1 , and thus, L c ( ŵt+1 (θ t+1 )) -L c ( ŵt (θ t+1 )) ≤ α t ρ 2 + L 2 α 2 t ρ 2 , since the first gradient of loss function is bounded by ρ. By the Lipschitz continuity of L c ( ŵt (θ)) according to Lemma 1., it can be obtained that L c ( ŵt (θ t+1 )) -L c ( ŵt (θ t )) ≤ ∇ θt L c ( ŵt (θ t )), θ t+1 -θ t + L 2 θ t+1 -θ t 2 2 = ∇ θt L c ( ŵt (θ t )), -β t ∇ θt L c ( ŵt (θ t )) + Lβ 2 t 2 ∇ θt L c ( ŵt (θ t )) 2 2 = -(β t - Lβ 2 t 2 ) ∇ θt L c ( ŵt (θ t )) 2 2 . Therefore, the Equation (32) satisfies L c ( ŵt+1 (θ t+1 )) -L c ( ŵt (θ t )) ≤ α t ρ 2 + L 2 α 2 t ρ 2 -(β t - Lβ 2 t 2 ) ∇ θt L c ( ŵt (θ t )) 2 2 (β t - Lβ 2 t 2 ) 2 2 ∇ θt L c ( ŵt (θ t )) 2 2 ≤ α t ρ 2 + L 2 α 2 t ρ 2 -L c ( ŵt+1 (θ t+1 )) + L c ( ŵt (θ t )). Summing up above inequalities from 1 to T , we have Proof. It is obvious that a t satisfy ∞ t=0 a t = ∞, ∞ t=0 a t ≤ ∞. In Eq. 18, 19, 20, and the linearity of L, we rewrite the update of w as w t+1 = w t -α t ∇L tr (w t ; θ t+1 ) = w t -α t n n j=1 λ j (θ t+1 ; w t )∇ wt l(y j , ŷj (w t )) + (1 -λ j (θ t+1 ; w t ))∇ wt l(ỹ j , ŷj (w t )). First, we have the difference of the loss function on training set between two iterations, L tr (w t+1 ; θ t+2 ) -L tr (w t ; θ t+1 ) =[L tr (w t+1 ; θ t+2 ) -L tr (w t+1 ; θ t+1 )] + [L tr (w t+1 ; θ t+1 ) -L tr (w t ; θ t+1 )]. (33)



https://www.mturk.com/



Figure 1: Illustration of label propagation in our method. The two types of nodes are distinguished by two colours (blue and green). The nodes surrounded by dotted line are training nodes D train whose label may be incorrect and those surrounded by solid line are clean sets D clean . In Figure.1(b), one half of every training node is pseudo label predicted by LP and the other half is original label. Some nodes' (node 5,7) pseudo labels are the same with their original labels, we select them D select to train GNNs and inject them to clean sets for better label propagation. We can get proper labels for the left nodes D lef t (node 6,8,9,10) based on meta learning.

Figure 2: Computation graph of meta-learning based label aggregation.

Figure 3: Comparsion of the true-labeled samples rate in D train and D select in various datasets.

Figure 5: Test accuracy on Cora and Citeseer across various flip noise rate.

Figure 6: Test accuracy on Cora and Citeseer across various size of clean set.

Figure A.4: Confusion matrices of Basemodel and LPM on various datasets under 40% flip noise. Figure. 4(a)-4(c) are the results of Basemodel. Figure. 4(d)-4(f) are the results of LPM.

Let (a n ) 1≤n , (b n ) 1≤n be two non-negative real sequences such that the series ∞ ii a n diverges, the series ∞ ii a n b n converges, and there exists K > 0 such that b n+1 -b n ≤ Ka n . Then the seqences (b n ) 1≤n converges to 0. Proof. See the proof of Lemma A.5 in [Stochastic majorization-minimization algorithms for ].Theorem 2. Suppose the loss function l is L-Lipschitz smooth and have ρ-bounded gradients with respect to training data and clean set, and λ(•) is differential with a δ-bounded gradient twice differential with its Hessian bounded by B with respect to θ. Let the learning rate α t = min{1, k T }, for some k > 0, such that k T < 1 and learning rate β t a monotone descent sequence, β t = min{ 1 L , c √ T } for some c > 0, such that L ≤ c wt L tr (w t ; θ t+1 ) 2 2 = 0.

Algorithm 1: LPM. Line 2-12: label propagation; Line 13-22: label aggregation. Data: D,D train ,D clean , max epochs T , LP iterations K in every epoch,A,feature matrix X ,GNNs feature extractor f , Aggregation Network g, expanding clean set for LP D c Result: Robust GNNs parameters w T

Dataset statistics after removing self-loops and duplicate edges(Wang & Leskovec, 2020)

Comparison with baselines in test accuracy (%) on Cora and Citeseer with uniform noise ranging from 0% to 80%. Mean accuracy (std) over 5 repetitions are reported. The best and the second best results are highlighted in bold and italic bold respectively.

Comparison with baselines in test accuracy (%) on Cora , Citeseer and Pubmed with flip noise ranging from 0% to 40%. Mean accuracy (std) over 5 repetitions are reported. The best and the second best results are highlighted in bold and italic bold respectively.

Comparison with baselines in test accuracy (%) on Clothing1M and Webvision. Mean accuracy (± std) over 5 repetitions are reported. The best is highlighted in bold. Clothing1M 35.83±0.03 38.05±0.13 53.5±0.08 54.15±0.23 56.9±0.08 56.3±0.12 57.35±0.11 Webvision 32.43±0.05 34.58±0.08 50.12±0.16 52.42±0.25 53.45±0.13 54.12±0.22 55.43±0.17

The performance of LPM without label aggregation and LPM with random λ in Citeseer.

In conclusion, it proves that the algorithm can always achievemin 0≤t≤T ∇ θ L c ( ŵ(θ t )) 2 2 ≤ O( 1

A APPENDIX : ADDITIONAL EXPERIMENT RESULTS

We also take Graph Attention Networks (GAT) as the feature extractor and classifier and the results shown in Table . A.6 validate that our method can also perform well with various GNNs. Besides, LPM can also perform better than other baselines in larger graph dataset Coauthor-Phy, the results can be seen in Table . A.7. We also demonstrate confusion matrices of Basemodel and LPM in Figure . A.4, which visually show that our method can improve the robustness against label noise of GNNs by a large margin.

B APPENDIX : ADDITIONAL DETAILS OF OUR EXPERIMENTS

Original Clothing1M and Webvision datasets are all large-scale datasets with real-world label noise. We randomly choose 5000 images in 10 classes from original datasets and every image serves as a node in the graph, a kNN graph (k=5) is treated as the graph structure so that GNNs can be applied in Clothing1M datasets. This setting is similar to some previous works which also aim to apply GNNs in datasets without graph structure. ResNet-50 with ImageNet pretrained weights is utilized by us to extract feature vectors for all the images. For the first term in Eq.33, by the L-Lipschitz-smooth and ρ-bounded gradients of λ with respect to training and clean set,(λ j (θ t+2 ; w t+1 ) -λ j (θ t+1 ; w t+1 ))l(y j , ŷ(w t+1 )) + (λ j (θ t+1 ; w t+1 ) -λ j (θ t+2 ; w t+1 ))l(ỹ j , ŷj (w t+1 ))2 )(l(y j , ŷ(w t+1 )) + l(y j , ŷ(w t+1 )))2 )(l(y j , ŷ(w t+1 )) + l(y j , ŷ(w t+1 ))).For the second term in Eq. 33,Therefore, we haveSumming up the inequalities in both sides from t = 1 to ∞, we haveRearrange the terms of the inequality, we obtainThe inequality next to last holds since our loss function is bounded by M , and the last one holds for ∞ t=1 α 2 t and ∞ t=1 β 2 t are finite. In addition, sincewe can obtain thatIn the other hand, based on the inequality:For Eq. 34 which reads ∞ t=1 α t ∇ wt L tr (w t ; θ t+1 ) 2 2 ≤ ∞, since ∞ t=0 α t = ∞, and there exists K = C > 0, such that | ∇L tr (w t+1 ; θ t+2 ) 2 2 -∇L tr (w t ; θ t+1 ) 2 2 | ≤ Cα t , by Lemma 2., we can conclude that lim t→∞ ∇ wt L tr (w t ; θ t+1 ) 2 2 = 0, which indicates that the gradient of loss on training set of our algorithm will finally achieve to zero, and thus the iteration of w enables training loss to converge.

