HOW POWERFUL IS IMPLICIT DENOISING IN GRAPH NEURAL NETWORKS

Abstract

Graph Neural Networks (GNNs), which aggregate features from neighbors, are widely used for graph-structured data processing due to their powerful representation learning capabilities. It is generally believed that GNNs can implicitly remove the non-predictive noises. However, the analysis of implicit denoising effect in graph neural networks remains open. In this work, we conduct a comprehensive theoretical study and analyze when and why the implicit denoising happens in GNNs. Specifically, we study the convergence properties of noise matrix. Our theoretical analysis suggests that the implicit denoising largely depends on the connectivity, the graph size, and GNN architectures. Moreover, we formally define and propose the adversarial graph signal denoising (AGSD) problem by extending graph signal denoising problem. By solving such a problem, we derive a robust graph convolution, where the smoothness of the node representations and the implicit denoising effect can be enhanced. Extensive empirical evaluations verify our theoretical analyses and the effectiveness of our proposed model.

1. INTRODUCTION

Graph Neural Networks (GNNs) (Kipf & Welling, 2017; Veličković et al., 2018; Hamilton et al., 2017) have been widely used in graph learning and achieved remarkable performance on graphbased tasks, such as traffic prediction (Guo et al., 2019) , drug discovery (Dai et al., 2019) , and recommendation system (Ying et al., 2018) . A general principle behind Graph Neural Networks (GNNs) (Kipf & Welling, 2017; Veličković et al., 2018; Hamilton et al., 2017) is to perform a message passing operation that aggregates node features over neighborhoods, such that the smoothness of learned node representations on the graph is enhanced. By promoting graph smoothness, the message passing and aggregation mechanism naturally leads to GNN models whose predictions are not only dependent on the feature of one specific node, but also the features from a set of neighboring nodes. Therefore, this mechanism can, to a certain extent, protect GNN models from noises: real-world graphs are usually noisy, e.g., Gaussian white noise exists on node features (Zhou et al., 2021) , however, the influence of feature noises on the model's output could be counteracted by the feature aggregation operation in GNNs. We term this effect as implicit denoising. While many works have been conducted in the empirical exploration of GNNs, relatively fewer advances have been achieved in theoretically studying this denoising effect. Early GNN models, such as the vanilla GCN (Kipf & Welling, 2017) , GAT (Veličković et al., 2018) and GraphSAGE (Hamilton et al., 2017) , propose different designs of aggregation functions, but the denoising effect is not discussed in these works. Some recent attempts (Ma et al., 2021b) are made to mathematically establish the connection between a variety of GNNs and the graph signal denoising problem (GSD) (Chen et al., 2014) : q(F) = min F ∥F -X∥ 2 F + λ tr F ⊤ LF , where X = X * + η is the observed noisy feature matrix, η ∈ R n×d is the noise matrix, X * is the clean feature matrix, and L is the graph Laplacian. The second term encourages the smoothness of the filtered feature matrix F over the graph., i.e., nearby vertices should have similar vertex features. By regarding the feature aggregation process in GNNs as solving a GSD problem, more advanced GNNs are proposed, such as GLP (Li et al., 2019) , S 2 GC (Zhu & Koniusz, 2021) , and IRLS (Yang et al., 2021) . Despite these prior attempts, little efforts have been made to rigorously study the denoising effect of message passing and aggregation operation. This urges us to think about a fundamental but not clearly answered question: Why and when implicit denoising happens in GNNs? In this work, we focus on the non-predictive stochasticity of noise in GNNs' aggregated features and analyze its properties. We prove that with the increase in graph size and graph connectivity factor, the stochasticity tends to diminish, which is called the "denoising effect" in our work. We will address this question using the tools from concentration inequalities and matrix theories, which are concerned with the study of the convergence of noise matrix. It offers a new framework to study the properties of graphs and GNNs in terms of the denoising effect. In order to facilitate our theoretical analysis, we derive Neumann Graph Convolution (NGC) from GSD. Specifically, to study the convergence rate, we introduce an insightful measurement on the convolution operator, termed high-order graph connectivity factor, which reveals how uniform the nodes are distributed in the neighborhood and reflects the strength of information diluted on a single neighboring node during the feature aggregation step. Intuitively, as the General Hoeffding Inequality (Hoeffding, 1994) (Lemma. D.1) suggests, a larger high-order graph connectivity factor, i.e., nodes are more uniformly distributed in the neighborhood, accelerates the convergence of the noise matrix and a larger graph size leads to faster convergence. Besides, GNN architectures also affect the convergence rate. Deeper GNNs can have a faster convergence rate. To further strengthen the denoising effect, inspired by the adversarial training method (Madry et al., 2018) , we propose the adversarial graph signal denoising problem (AGSD) . By solving such a problem, we derive a robust graph convolution model based on the correlation of node feature and graph structure to increase the high-order graph connectivity factor, which helps us improve the denoising performance. Extensive experimental results on standard graph learning tasks verify our theoretical analyses and the effectiveness of our derived robust graph convolution model. Notations. Let G = (V, E) represent a undirected graph, where V is the set of vertices {v 1 , • • • , v n } with |V| = n and E is the set of edges. The adjacency matrix is defined as A ∈ {0, 1} n×n , and A i,j = 1 if and only if (v i , v j ) ∈ E. Let N i = {v j |A i,j = 1} denote the neighborhood of node v i and D denote the diagonal degree matrix, where D i,i = n j=1 A i,j . The feature matrix is denoted as X ∈ R n×d where each node v i is associated with a d-dimensional feature vector X i . Y ∈ {0, 1} n×c denotes the matrix, where Y i ∈ {0, 1} c is a one-hot vector and c j=1 Y i,j = 1 for any v i ∈ V .

2. A SIMPLE UNIFYING FRAMEWORK: NEUMANN GRAPH CONVOLUTION

A General Framework. In this section, we discuss a simple yet general framework for solving graph signal denoising problem, namely Neumann Graph Convolution (NGC). Note that NGC is not a new GNN architecture. There also exist similar GNN architectures, such as GLP (Li et al., 2019) , S 2 GC (Zhu & Koniusz, 2021) , and GaussianMRF (Jia & Benson, 2022) . We focus on the theoretical analysis of the denoising effect in GNNs in this work. NGC can facilitate our theoretical analysis. By taking the derivative ∇q (F) = 2 LF + 2(F -X) to zero, we obtain the solution of GSD optimization problem as follows: F = (I + λ L) -1 X. (2) To avoid the expensive computation of the inverse matrix, we can use Neumann series (Stewart, 1998) expansion to approximate Eq. (2) up to up to S-th order: I + λ L -1 = 1 λ + 1 I - λ λ + 1 A -1 ≈ 1 λ + 1 S s=0 λ λ + 1 A s , where A can take the form of A = D -1 2 A D -1 2 or A = D -1 A, and the proof can be found in Appendix B. Based on the Neumann series expansion of the solution of GSD, we introduce a general graph convolution model -Neumann Graph Convolution defined as the following expansion: H = A S XW = 1 λ + 1 S s=0 λ λ + 1 A s XW, where A S = 1 λ+1 S s=0 λ λ+1 A s and W is the weight matrix. Our spectral convolution A S X on graphs is a multi-scale graph convolution (Abu-El-Haija et al., 2020; Liao et al., 2019) , which covers the single-scale graph convolution models such as SGC (Wu et al., 2019a) since the graph convolution of SGC is A 2 X and A 2 is the third term of A S . Specifically, the N-GCN (Abu-El-Haija et al., 2020) extends the feature aggregation module in GCN by concatenating neighboring feature information using A k at each layer, while for NGC, we incorporate such multi-scale information by summing up λ λ+1 A s as can be seen from Eq. ( 4). Besides, if we remove the non-linear functions in GCN (Kipf & Welling, 2017) , it also can be covered by our model. Therefore, we can draw the conclusion that our proposed NGC is a general framework. High-order Graph Connectivity Factor. Based on NGC, we obtain the filtered graph signal via F = A S X. Intuitively, A S captures not only the connectivity of the graph structure (represented by A), but also the higher order connectivity (represented by A 2 , A 3 , . . . , A

S

). As will be discussed in Sec. 3, larger high order graph connectivity can accelerate the convergence of the noise feature matrix. To formally quantify the high order graph connectivity, we give the following definition: Definition 1 (High-order Graph Connectivity Factor). We define the high-order graph connectivity factor τ as τ = max i τ i , where τ i = n n j=1 A S 2 ij 1 - λ λ + 1 S+1 2 . (5) Remark 1. Here we give some intuitions about why Eq. ( 5) represents high-order graph connectivity. Note that each element in A S is non-negative and each row sum satisfies 1 n j=1 A S ij = 1 - λ λ + 1 S+1 . Based on Eq. ( 6), the sum of squares of elements in each row satisfy: 1 - λ λ + 1 S+1 2 n ≤ n j=1 A S 2 ij ≤ 1 - λ λ + 1 S+1 2 . ( ) When the high-order graph has a high connectivity, i.e., the elements in row i of A S are more uniformly distributed, Eq. ( 7) reaches its lower bound. Meanwhile, if the graph is not connected and there is only one element whose value is larger than 0 in row i, Eq. ( 7) reaches its upper bound. Therefore, the value of τ ∈ [1, n] is determined as follows: when the high-order graph connectivity is high, τ → 1 and when the graph is less connected, τ → n.

3. MAIN THEORY

In this section, we analyze the denoising effect of NGC. Before we present our main theory, we first present our aggregation on noisy feature matrix and formulate four assumptions, which are necessary to construct our theory. For the convenience of theoretical analysis, we adopt MSE loss 2 for our main theory. Consider A S as our aggregation scheme, the NGC training based on Eq. ( 4) can be formulated as min W f (W) = A S XW -Y 2 F = A S (X * + η)W -Y 2 F , where X * is the clean feature matrix, η denotes the noise added on X * , and X = X * + η is the observed data matrix. Intuitively, if A S η is small enough, the added noise will not change the optimization direction on which the parameter is updated under the clean feature matrix X * . Before we present our main theory, we give four assumptions about noise η, A S , and parameters W. 1 Note that this result is obtained by using A = D -1 A for the ease of theoretical analysis while in experiments we adopt more commonly used A = D -1 2 A D -1 2 . The proof can be found in Appendix C. 2 We consider MSE loss since it gives easier form of gradient and it can be extended to other losses satisfying certain conditions. Assumption 1. Each entry of the noise matrix η, i.e., [η] ij is i.i.d sub-Gaussian random variable with variance σ and mean µ = 0, i.e., E e λ([η]ij -µ) ≤ e σ 2 λ 2 /2 for all λ ∈ R. (9) Note that it is common to assume that the noise follows Gaussian distribution (Zhou et al., 2021; Chen et al., 2021; Zhang et al., 2022) , which is also covered by our sub-Gaussian assumption. Assumption 2. The high-order graph connectivity factor τ is O (n), i.e., lim n→∞ τ n = 0. As we have discussed in Sec. 2, τ depends on the graph structure. In a well-connected graph, τ is usually relatively small compared with n. Only if all nodes of a graph are isolated, τ reaches its upper bound n. Assumption 3. The Frobenius norm of the parameter matrix W is bounded by a constant. There exists C > 0 such that ∥W∥ F ≤ C, which is unrelated to n. We assume that the Frobenius norm of W is bounded by a constant. This is reasonable since recent advances in Neural Tangent Kernel (Jacot et al., 2018) indicate that over-parameterized network weights lie in the neighborhood of the small random initialization, which justifies Assumption 3. Assumption 4. The loss function in Eq. ( 8) is L-smooth, ∥∇f (W 1 ) -∇f (W 2 )∥ 2 ≤ L∥W 1 -W 2 ∥ 2 for all W 1 , W 2 ∈ R d×c . ( ) The L-smoothness of f depends on the largest singular value of A S X. For conciseness, we start with the smooth case. As the core part of our proof, we first derive the upper bound of the Frobenius norm of A S η. Lemma 1. Suppose we choose t = 2τ 1 -λ λ+1 S 2 (4 log n + log 2d) /n. Then under Assumptions 1 and 2, with a high probability 1 -1/d, we have A S η 2 F ≤ 2τ 1 -λ λ+1 S+1 2 σ 2 (4 log n + log 2d) n , where the proof can be found in Appendix D. Lemma 1 implies that the norm of the aggregated noise matrix A S η is bounded by three terms: the number of nodes of a graph n, the expansion order S, the high-order graph connectivity factor τ . Intuitively, as the concentration bounds suggest, if we extract enough samples from the same sub-Gaussian variable, the average of these samples will converge to zero with a high probability. This requires our graph to be large enough and the sum of squares of the elements in the row of A S to be small enough, which depends on the graph structure. Now we start to present our main theorem for graph denoising. In order to demonstrate the effect of graph denoising, we further consider another loss function g(•) with the clean feature matrix: g(W) = A S X * W -Y 2 F . ( ) Let W * g = arg min W g(W) be the minimizer of clean loss g, we aim to demonstrate that the learned model (from gradient descent on the noisy data X) has essentially the same performance as W * g which is the optimal solution for the clean loss g. Theorem 1. Under Assumptions 1, 2, 3, 4 and Lemma 1, let W (k) f denote the k-th step gradient descent solution for min W f (W) with step size α ≤ 1/L, with probability 1 -1/d we have g W (k) f -g W * g ≤ O 1 2kα + O τ log n n , where W * g = arg min W g(W) is the optimal solution of the clean loss function g(W), τ is the high-order graph connectivity factor, and n is the number of nodes of a graph. The proof of Theorem 1 can be found in Appendix E. Remark 2. Denoising Effect. Theorem 1 suggests that the k-th step gradient descent solution W (k) f which is trained using the noisy feature matrix X enjoys a similar performance as the actual clean loss minimizer W * g with large enough k and n. This implies the denoising effect of our proposed solution in Eq. ( 4). Remark 3. Effect of graph structure on denoising. Note that the second term in Eq. ( 13) suggests that the denoising effect is linear with respect to τ , which is directly related to the dataset graph structure. Specifically, as will be shown in Sec. 3, graph structure affects the value of τ , and thus the denoising effect. A large well-connected graph tend to have a better denoising performance since n is large and τ is close to 1. Case Study: the Influence of Graph Structure on Implicit Denoising. In this case study, we give four illustration samples in Figure 1 . G 1 , G 2 , and G 3 have the same number of nodes. But the nodes on G 1 are isolated. G 2 has only one connected component and has a center node v 1 on the graph. G 3 is a complete graph such that there is an edge between any two nodes. In addition, we give a larger illustration graph G 4 to understand the influence of graph size. From Figure 1 , we can extract the following insights: 1) There is no denoising effect (the value of τ log n/n is quite large) on G 1 since the nodes are isolated. 2) The complete graph G 3 has the best denoising effect among the graphs of the same size since the values of elements in each row are distributed uniformly, leading to the lower bound of τ . 3) Although G 2 has only one connected component, there is a center node v 1 on the graph. The existence of the center node makes the value of elements in each row imbalanced, which means that τ tends to have a larger value compared with G 3 . 4) The decentralized graph like G 4 also can get a smaller τ . 5) In terms of graph size, the graph with a larger size has a better denoising effect. Combining our Theorem 1 and case study, we conclude that the denoising effect is influenced by graph size n and the high-order graph connectivity factor τ , which reflects the graph structure.

4. ROBUST NEUMANN GRAPH CONVOLUTION

In this section, we propose a new graph signal denoising problem -adversarial graph signal denoising (AGSD) problem to improve the denoising performance by deriving a robust graph convolution model.

4.1. ADVERSARIAL GRAPH SIGNAL DENOISING PROBLEM

Note that the second term in the GSD problem (Eq. ( 1)) which controls the smoothness of the feature matrix over graphs, is related to both the graph Laplacian and the node features. Therefore, the slight changes in the graph Laplacian matrix could lead to an unstable denoising effect. Inspired by the recent studies in adversarial training (Madry et al., 2018) , we formulate the adversarial graph signal denoising problem as a min-max optimization problem: min F ∥F -X∥ 2 F + λ • max L ′ tr F ⊤ L ′ F s. t. L ′ -L F ≤ ε. Intuitively, the inner maximization on the Laplacian L ′ generates perturbations on the graph structurefoot_0 , and enlarges the distance between the node representations of connected neighbors. Such maximization finds the worst case perturbations on the graph Laplacian that hinders the global smoothness of F over the graph. Therefore, by training on those worse case Laplacian perturbations, one could obtain a robust graph signal denoising solution. Ideally, through solving Eq. ( 14), the smoothness of the node representations as well as the implicit denoising effect can be enhanced.

4.2. MINIMIZATION OF THE OPTIMIZATION PROBLEM

The min-max formulation in Eq. ( 14) also makes the adversarial graph signal denoising problem much harder to solve. Fortunately, unlike adversarial training (Madry et al., 2017) where we need to first adopt PGD to solve the inner maximization problem before we solve the outer minimization problem, here inner maximization problem is simple and has a closed form solution. In other words, we do not need to add random perturbations on the graph structure at each training epoch and can find the largest perturbation which maximizes the inner adversarial loss function. Denote the perturbation as δ, and L ′ = L + δ. Directly solving 4 the inner maximization problem, we get δ = ε∇h(δ) = εFF ⊤ ∥FF ⊤ ∥ F . Plugging this solution into Eq. ( 14), we can rewrite the outer optimization problem as follows: ρ(F) = min F ∥F -X∥ 2 F + λ max tr F ⊤ LF + λε tr F ⊤ FF ⊤ F ∥FF ⊤ ∥ F . ( ) Taking the gradient of ρ(F) to zero, we get the solution of the outer optimization problem as follows: F = I + λ L + λε FF ⊤ ∥FF ⊤ ∥ F -1 X. Both sides of Eq. ( 16) contains F, directly computing the solution is difficult. Note that in Eq. ( 14) we also require F to be close to X, we can approximate Eq. ( 16) by replacing the F with X in the inverse matrix on the right hand side. With the Neumann series expansion of the inverse matrix, we get the final approximate solution as H ≈ 1 λ + 1 S s=0 λ λ + 1 A - εXX ⊤ ∥XX ⊤ ∥ F s XW. ( ) The difference between Eq. ( 17) and Eq. ( 4) is that there is one more term in Eq. ( 17) derived from solving the inner optimization problem of Eq. ( 14). Based on this, we proposed our robust Neumann graph convolution (RNGC). Scalability. Although RNGC introduces extra computational burdens for large graphs due to the XX ⊤ term, if the feature matrix is sparse, the extra computational effort is minimal as the XX ⊤ term can also be sparse. For the scalability of RNGC on large graphs with dense feature matrix, we only compute the inner product of feature vectors (X i , X j|j∈Ni ) between adjacent neighbors like masked attention in GAT. Compared with NGC, the additional computation cost is O(|E|).

5. EXPERIMENTS

In this section, we conduct a comprehensive empirical study to understand the influence of different factors on the denoising effect of various models. To quantify the denoising effect, we test the model accuracy on noisy data on various GNN architectures and MLP for standard node classification tasks, where the noisy data is synthesized by mixing Gaussian noise with the original feature matrix. We also synthesize noisy data by flipping individual feature with a small Bernoulli probability on three citation datasets with binary features.

5.1. DENOISING EFFECTIVENESS COMPARISON OF VARIOUS GNN MODELS

In this section, we compare the denoising effectiveness of different GNN models through their test accuracy by training on the noisy feature matrix with Gaussian noise. Datasets. In our experiments, we utilize three public citation network datasets Cora, Citeseer, and Pubmed (Sen et al., 2008) which are homophily graphs for semi-supervised node classification. For the semi-supervised learning experimental setup, we follow the standard fixed splits employed in (Yang et al., 2016) , with 20 nodes per class for training, 500 nodes for validation, and 1,000 nodes for testing. We also use four datasets: Cornell, Texas, Wisconsin, and Actor which are heterophily graphs for full-supervised node classification. For each dataset, we randomly split nodes into 60%, 20%, and 20% for training, validation, and testing as suggested in (Pei et al., 2020) . Moreover, we utilize three large-scale graph datasets: Coauthor-CS, Coauthor-Phy (Shchur et al., 2018) , and ogbn-products (Hu et al., 2020) for evaluation. For Coauthor datasets, we split nodes into 60%, 20%, and 20% for training, validation, and testing. For ogbn-products dataset, we follow the dataset split in OGB (Hu et al., 2020) . Baselines. For the baselines, we consider graph neural networks derived from graph signal denoising, including GLP (Li et al., 2019) , S 2 GC (Zhu & Koniusz, 2021) , and IRLS (Yang et al., 2021) ; popular GNN architectures, such as GCN (Kipf & Welling, 2017) and GAT (Veličković et al., 2018) ; and MLP which has no aggregation operation. Experimental Setup and Implementations. We assume that the original feature matrix is clean and do not have noise and we synthesize the noise from the standard Gaussian distribution and add them on the original feature matrix. By default, we apply row normalization for data after adding the Gaussian noisefoot_2 , and train all the models based on these noisy feature matrix. For the hyper-parameters of each model, we follow the setting that reported in their original papers. To eliminate the effect of randomness, we repeat such experiment for 100 or 10 times and report the mean accuracy. Note that in each repeated run, we add different Gaussian noises. While for the same run, we apply the same noisy feature matrix for training all the models. For our NGC and RNGC model, the hyper-parameter details can be found in Appendix H.2. Results on Supervised Node Classification. Figure 2 illustrates the comparison of classification accuracy against the various noise levels for semi-supervised node classification tasks. The noise level ξ 2 , we can observe that the test accuracy of MLP is close to randomly guessing (RG) when the noise level is relatively large. This implies the weak denoising effect of MLP models. For shallow GNN models, such as GCN and GAT (which usually contain 2 layers), their denoising performance is limited especially on Pubmed since they do not aggregate information (features and noise) from higher-order neighbors. For models with deep layersfoot_3 , such as IRLS (≥ 8 layers), the denoising performance is much better compared to shallow models. Lastly, our NGC and RNGC model with 16 layers (S = 16) achieve significantly better denoising performance compared with other baseline methods, which backup our theoretical analyses. In most cases, NGC and RNGC achieve very similar denoising performance but in general, RNGC still slightly outperforms NGC, suggesting that we indeed gain more benefits by solving the adversarial graph denoising problem. On heterophilic graphs, MLPs are shown to achieve better performances in Table 1 than more GNNs such as GCN since feature aggregation may give inaccurate information. For our RNGC, note that we keep the zero-order term (no aggregation like MLP) with the largest weight and thus preserve the original feature information while also considering high-order terms for better denoising performances. Thus in the context of the denoising problem, the design of RNGC actually take the advantage of both zeroth-order and high-order information to achieve better denoising performances. For ogbn-products, we only choose MLP, GCN, and S 2 GC as baselines, since the results are sensitive concerning model size and various tricks from the OGB leaderboard. For fair comparison, the size of parameters for these baselines and RNGC is the same. We also use full-batch training for the baselines and our model. Table 2 and 3 report the comparison of classification accuracy against the various noise levels for full-supervised node classification tasks on large-scale graphs. The first-and second-highest accuracies are highlighted in bold. For these datasets, we test ξ ∈ {0.1, 1}. Compared with the above small datasets, the node degree on these three datasets is larger, which means they have better connectivity. From Table 2 and 3, we can observe that the test accuracy of MLP is far lower than GCN and RNGC. This implies the weak denoising effect of MLP. The test accuracy of GCN is slightly smaller than RNGC on these datasets since they are well-connected and have a large graph size and we can achieve a good denoising performance with shallow-layer GNN models. For the scalability of RNGC on large graphs such as ogbn-products, we use the acceleration method mentioned in Sec. 4.2.

5.2. DENOISING PERFORMANCE ON FEATURE FLIPPING PERTURBATION

In this section, we compare the denoising effectiveness of different models through their test accuracy by training on the noisy feature matrix which is perturbated through flipping the individual feature with a small Bernoulli probability on three citation datasets. Setting and Results. We flip the individual feature on three citation datasets: Cora, Citeseer, and Pubmed as the noise. And we compare the denoising performance of RNGC with MLP and GCN. From Table 4 (the std information can be found in Appendix I.2), we can observe that the denoising performance of RNGC is much better than baselines when the flip probability is 0.4. In fact, the added perturbations by flipping the individual feature approximately follow a Bernoulli distribution, which is also a Sub-Gaussian distribution. The results verify our theoretical analysis further.

5.3. DEFENSE PERFORMANCE OF RNGC AGAINST GRAPH STRUCTURE ATTACK

Although we do not perform actual graph structure perturbations as in graph adversarial attacks (Zügner et al., 2018; Zügner & Günnemann, 2019a ) but a virtual perturbation in the Laplacian. Therefore, it's not clear how much perturbations on the Laplacian correspond to the actual perturbations on graph structure. Nevertheless, we still conduct the experiments of RNGC against graph structure meta-attack where the ptb rate is 25%. As shown in the Table 5 (the std information can be found in Appendix I.2), our RNGC model still outperforms than GCN, GAT, RobustGCN (Zügner & Günnemann, 2019b) , GCN-Jaccard (Wu et al., 2019b) , GCN-SVD (Entezari et al., 2020) , and S 2 GC on Cora, Citeseer, and Pubmed.

6. RELATED WORK

Graph Signal Denoising Existing graph denoising works are mainly based on the graph smoothing technique (Chen et al., 2014; Zhou et al., 2021) . It is well known that GNNs can increase the smoothness of node features through aggregating information from neighbors, thus the influence from noisy features can be counteracted in GNN's output. Some GNN models are derived from the perspective of signal denoising, such as S 2 GC (Zhu & Koniusz, 2021) , GLP (Li et al., 2019) , and IRLS (Yang et al., 2021) . Moreover, Ma et al. (2021b) builds the connection between signal denoising and existing GNNs by formulating message passing as a process of solving the GSD problem. This suggests a possibility for us to understand the behavior of GNNs through the lens of signal denoising. Besides, some works (Liu et al., 2021a; Fan et al., 2022; Ma et al., 2021a; Jin et al., 2021; Liu et al., 2021b; Jin et al., 2020; Zhang et al., 2022) have been proposed to conduct statistical analysis on the graph noise from the empirical perspective. In this work, we perform an extensive analysis to understand the denoising effect of GNNs from both theoretical and experimental perspectives. Smoothing and Over-smoothing. One key principle of GNNs is to improve the smoothness of node representations. But stacking graph layers can lead to over-smoothing (Li et al., 2018) , where the node representations can not be distinguishable. There are some recent works that have been proposed to address over-smoothing such as JKnet (Xu et al., 2018) , GCNII (Chen et al., 2020) , and RevGNN-Deep (Li et al., 2021) . They add the output of shallow layers to the final layers with a residual-style design. In this work, we will show smoothing can help the denoising effect of GNNs.

7. CONCLUSION

Our work conducts a comprehensive study on the implicit denoising effect of graph neural networks. We theoretical show that the denoising effect of GNNs are largely influenced by the connectivity and the size of the graph structure, as well as the GNN architectures. Motivated by our analysis, we also propose a robust graph convolution model by solving the robust graph signal denoising problem which enhances the smoothness of node representations and the implicit denoising effect.

A THE DETAILS ON HOW TO SOLVE THE INNER MAXIMIZATION PROBLEM IN SEC. 4.2

Different from the non-concave inner maximization problem in the adversarial attack, our inner maximization problem is indeed a convex optimization problem. Hence, we do not need to add random perturbations on the graph structure at each training epoch and can find the largest perturbation which maximizes the inner adversarial loss function. Denote the perturbation as δ, and L ′ = L + δ. We can rewrite the inner maximization problem as max L ′ tr F ⊤ L ′ F = ⟨ L, F ⊤ F⟩ + max δ ⟨δ, F ⊤ F⟩ s. t. ∥δ∥ F ≤ ε. ( ) We denote h(δ) = ⟨δ, F ⊤ F⟩. Obviously, h(δ) reaches the largest value when δ has the same direction with the gradient of h(δ), e.g. δ = ε∇h(δ) = εFF ⊤ ∥FF ⊤ ∥ F , which is illustrated in Fig. 3 .  𝑥 𝑦 ||𝛿|| ! ∇ℎ(𝛿) max ||#|| ! $% 𝛿 & ∇ℎ(𝛿) 𝛿 = 𝜀∇ℎ(𝛿) ||∇ℎ 𝛿 || ! (A) = lim k→∞ ∥|A k |∥ 1/k = inf k≥1 ∥|A k |∥ 1/k ≤ ∥|A|∥. Lemma B.1 describes the relationship between the spectral radius of a matrix and its matrix norm, i.e. ρ(A) = lim k→∞ ∥|A k |∥ 1/k . Lemma B.2. Let A ∈ C n×n , the spectral radius ρ(A) = max(abs(spec(A))), if ρ(A) < 1, then ∞ k=0 A k converges to (I -A) -1 . Proof. We first prove that (I -A) -1 exists as follows: Based on the definition of eigenvalues of A, we have |λI -A| = 0 and the solution is the eigenvalue of A. Since ρ(A) < 1, if λ ≥ 1, then |λI -A| ̸ = 0, so |I -A| ̸ = 0, which means (I -A) -1 exists. Since ρ(A) < 1 and by Lemma B.1, we have lim Lemma B.4. Let A ∈ {0, 1} n×n be the adjacency matrix of a graph and k→∞ ∥|A k |∥ = ρ(A) k = 0. Let S k = A 0 + A 1 + • • • + A k , A = D -1 2 A D -1 2 or A = D -1 A, then (I - λ λ + 1 A) -1 = ∞ k=0 λ λ + 1 A k . Proof. We first prove that ρ( A) ≤ 1 where A = D -1 2 A D -1 2 . Let λ be the eigenvalue of A, and v be the corresponding eigenvector. Then we have D -1 2 A D -1 2 v = λv =⇒ D -1 2 D -1 2 A D -1 2 v = λ D -1 2 v =⇒ D -1 A D -1 2 v = λ D -1 2 v, which means (λ, D -1 2 v) is the eigen-pair of D -1 A. By Lemma B.3, there exists i, such that λ -D -1 A ii ≤ j̸ =i D -1 A ij =⇒ D -1 A ii - j̸ =i D -1 A ij ≤ λ ≤ D -1 A ii + j̸ =i D -1 A ij . Since D -1 A ij > 0 and j D -1 A ij = j D -1 A ij = 1, obviously -1 < D -1 A ii - j̸ =i D -1 A ij ≤ λ ≤ D -1 A ii + j̸ =i D -1 A ij = 1. So if A = D -1 2 A D -1 2 , we have ρ( A) ≤ 1. When A = D -1 A, we denote (λ, v) as the eigen-pair of D -1 A. Similarly, by Lemma B.3, there exists i, such that λ -D -1 A ii ≤ j̸ =i D -1 A ij =⇒ D -1 A ii - j̸ =i D -1 A ij ≤ λ ≤ D -1 A ii + j̸ =i D -1 A ij .

Obviously, we can get the same conclusion for

A = D -1 A. So it is true for ρ λ λ+1 A ≤ λ λ+1 < 1 By Lemma B.2, we get the result (I -λ λ+1 A) -1 = ∞ k=0 λ λ+1 A k , which finishes the proof. By Lemma B.4, we approximate the inverse matrix (I + λ L) -1 up to S-th order with I + λ L -1 = 1 λ + 1 I - λ λ + 1 A -1 ≈ 1 λ + 1 S s=0 λ λ + 1 A s .

C THE ROW SUMMATION OF THE NEUMANN SERIES

We provide the derivations of the row sum of A S in this section. Before we derive the row summation of A S , we first derive the row summation of A k . Lemma C.1. Consider a probability matrix P ∈ R n×n , where P ij ≥ 0. Besides, for all i, we have n j=1 P ij = 1. Then for any s ∈ Z + , we have n j=1 P s ij = 1, Proof. We give a proof by induction on k. Base case: When k = 1, the case is true. Inductive step: Assume the induction hypothesis that for a particular k, the single case n = k holds, meaning P k is true: ∀i, n j=1 P k ij = 1. As P k+1 = P k P, so we have Then for any i, we have n j=1 A S ij = 1 + 1 S s=0 λ λ + 1 A ij s = 1 λ + 1 S s=0 λ λ + 1 s = 1 - λ λ + 1 S+1 .

D PROOF OF LEMMA 1

We provide the details of proof of Lemma 1. We first introduce the General Hoeffding Inequality (Hoeffding, 1994), which is essential for bounding A S η 2 F . Lemma D.1. (General Hoeffding Inequality (Hoeffding, 1994) ) Suppose that the variables X 1 , • • • , X n are independent, and X i has mean µ i and sub-Gaussian parameter σ i . Then for all t ≥ 0, we have P n i=1 (X i -µ i ) ≥ t ≤ exp - t 2 2 n i=1 σ 2 i . Now let's prove Lemma 1. Proof of Lemma 1. For any entry A S η ij = n p=1 A S ip η pj , where η pj is a sub-Gaussian variable with parameter σ 2 . By the General Hoeffding inequality D.1, we have P   1 λ + 1 S s=0 λ λ + 1 A S s η ij ≥ t   ≤ 2 exp          - nt 2 2τ 1 -λ λ+1 S+1 2 σ 2          . where τ = max i τ i and τ i = n n j=1 A S 2 ij 1 -λ λ+1 S+1 2 . Applying union bound (Vershynin, 2010 ) to all possible pairs of i ∈ [n], j ∈ [n], we get P A S η ∞,∞ ≥ t ≤ i,j P A S η ij ≥ t ≤ 2n 2 exp          - nt 2 2τ 1 -λ λ+1 S+1 2 σ 2          . (22) Applying union bound again, we have P A S η 2 F ≥ t ≤ i,j P A S η ∞,∞ ≥ √ t ≤ 2n 4 exp          - nt 2τ 1 -λ λ+1 S+1 2 σ 2          . ( ) Choose t = 2τ 1 -λ λ+1 S+1 (4 log n + log 2d) /n and with probability 1 -1/d, we have A S η 2 F ≤ 2τ 1 -λ λ+1 S+1 2 σ 2 (4 log n + log 2d) n , which finishes the proof.

E PROOF OF THE MAIN THEOREM 1

We provide the details of proof of main theorem 1. [Restatement of Theorem 1] Under Assumptions 1,2,3,4, let W f denote the k-th step gradient descent solution for min W f (W) with step size α ≤ 1/L, with probability 1 -1/d we have g W (k) f -g W * g ≤ O 1 2kα + O τ log n n , where W * g = arg min W g(W) is the optimal solution of the clean loss function g(W), τ is the high-order graph connectivity factor, and n is the number of nodes of a graph. Proof. By the definition of L-smooth, we can obtain the following inequality: f (W ′ f ) ≤ f (W f ) + ⟨∇f (W f ), W ′ f -W f ⟩ + 1 2 L∥W ′ f -W f ∥ 2 F . Let's use the gradient descent algorithm with W ′ f = W + f = W f -α∇f (W f ). We then get: f W + f ≤ f (W f ) + ⟨∇f (W f ), W + f -W f ⟩ + 1 2 L W + f -W f 2 F = f (W f ) + ⟨∇f (W f ), W f -α∇f (W f ) -W f ⟩ + 1 2 L∥W f -α∇f (W f ) -W f ∥ 2 F = f (W f ) -⟨∇f (W f ), α∇f (W f ⟩ + 1 2 L∥α∇f (W f )∥ 2 F = f (W f ) -α∥∇f (W f )∥ 2 F + 1 2 Lα 2 ∥∇f (W f )∥ 2 F = f (W f ) -1 - 1 2 Lα α∥∇f (W f )∥ 2 F . With the fixed step size α ≤ 1/L, we know that -(1 - 1 2 Lα) = 1 2 Lα -1 ≤ 1 2 L(1/L) -1 = 1 2 -1 = -1 2 . Plugging this into Eq. ( 27), we have the following inequality: f W + f ≤ f (W f ) - 1 2 α∥∇f (W f )∥ 2 F . If we choose t to be small enough such that t ≤ 1/L, this inequality implies that the loss function value strictly decreases under each iteration of gradient descent since ∥∇f (W f )∥ is positive unless ∇f (W f ) = 0 e.g. W f = W * f , where W f reaches W * f . Now, let's bound the loss function value f (W + f ). Since f is convex, we can write f (W f ) ≤ f W * g + ⟨∇f (W f ), W f -W * g ⟩. ( ) Introducing this inequality into Eq. ( 28), we can obtain the following: f W + f -f W * g ≤ ⟨∇f (W f ), W f -W * g ⟩ - α 2 ∥∇f (W f )∥ 2 F ≤ 1 2α 2α⟨∇f (W f ), W f -W * g ⟩ -α 2 ∥∇f (W f )∥ 2 F ≤ 1 2α 2α⟨∇f (W f ), W f -W * g ⟩ -α 2 ∥∇f (W f )∥ 2 F -W f -W * g 2 F + 1 2α W f -W * g 2 F ≤ 1 2α W f -W * g 2 F -W f -α∇f (W f ) -W * g 2 F . ( ) Notice that by the definition of gradient descent update, we have W + f = W f -α∇f (W f ). Plugging this into the final inequality of Eq. ( 30), we can get: f W + f -f W * g ≤ 1 2α W f -W * g 2 F -W + f -W * g 2 F . This inequality holds for W + f on every iteration of gradient descent. Summing over iterations, we get: k i=1 f W (i) f -f W * g ≤ k i=1 1 2α W (i-1) f -W * g 2 F -W (i) f -W * g 2 F = 1 2α W (0) f -W * g 2 F -W (k) f -W * g 2 F ≤ 1 2α W (0) f -W * g 2 F . ( ) With the inequality of Eq. ( 29), we know that f (W f ) strictly decreases over each iteration. So we have following: f W (k) f -f W * g ≤ 1 k k i=1 f W (i) f -f W * g ≤ 1 2kα W (0) f -W * g 2 F (33) Equivalently, we have the inequality for the loss function g(W f ): g W (k) f -g W * g = f W (k) f -f W * g + 2⟨ A S ηW * g , A S X * W * g -Y⟩ + ⟨ A S ηW * g , A S ηW * g ⟩ -2⟨ A S ηW (k) f , A S X * W (k) f -Y⟩ -⟨ A S ηW (k) f , A S ηW (k) f ⟩ ≤ 1 2kα W (0) f -W * g 2 F + A S η 2 F W * g 2 F 2 A S X * W * g -Y 2 F + A S η 2 F W * g 2 F + A S η 2 F W (k) f 2 F 2 A S XW (k) f -Y 2 F + A S η 2 F W (k) f 2 F ≤ O 1 2kα + O τ log n n , which finishes the proof. F MORE DETAILS ON EQUATION (1). We provide more details on how to obtain Equation (1). Note that if we set L = I -D -1 2 A D -1 2 , we have tr F ⊤ LF = tr F ⊤ (I -D -1 2 A D -1 2 )F = tr F ⊤ F -tr F ⊤ D -1 2 A D -1 2 F = tr FF ⊤ -tr D -1 2 A D -1 2 FF ⊤ . On the other hand, if we set L = I -D -1 A, we have tr F ⊤ LF = tr F ⊤ (I -D -1 A)F = tr F ⊤ F - tr F ⊤ D -1 AF = tr FF ⊤ -tr D -1 AFF ⊤ . We denote F =    F 1 . . . F n    and F ⊤ = F ⊤ 1 • • • F ⊤ n , where F i = [F i1 • • • F id ], then we have tr FF ⊤ = n i=1 F i F ⊤ i . When L = I -D -1 2 A D -1 2 , we have tr D -1 2 A D -1 2 FF ⊤ = tr             A11 √ d1+1 √ d1+1 A12 √ d1+1 √ d2+1 • • • A1n √ d1+1 √ dn+1 A21 √ d2+1 √ d1+1 A22 √ d2+1 √ d2+1 • • • A2n √ d2+1 √ dn+1 . . . . . . . . . . . . An1 √ dn+1 √ d1+1 An2 √ dn+1 √ d2+1 • • • Ann √ dn+1 √ dn+1            F 1 F ⊤ 1 F 1 F ⊤ 2 • • • F 1 F ⊤ n F 2 F ⊤ 1 F 2 F ⊤ 2 • • • F 2 F ⊤ n . . . . . . . . . . . . F n F ⊤ 1 F n F ⊤ 2 • • • F n F ⊤ n            = n i=1 n j=1 A ij √ d i + 1 d j + 1 F j F ⊤ i . On the other hand, when L = I -D -1 A, we have tr D -1 AFF ⊤ = tr           A11 d1+1 A12 d1+1 • • • A1n d1+1 A21 d2+1 A22 d2+1 • • • A2n d2+1 . . . . . . . . . . . . An1 dn+1 An2 dn+1 • • • Ann dn+1           F 1 F ⊤ 1 F 1 F ⊤ 2 • • • F 1 F ⊤ n F 2 F ⊤ 1 F 2 F ⊤ 2 • • • F 2 F ⊤ n . . . . . . . . . . . . F n F ⊤ 1 F n F ⊤ 2 • • • F n F ⊤ n           = n i=1 n j=1 A ij d i + 1 F j F ⊤ i . So when L = I -D -1 2 A D -1 2 , we have tr F ⊤ LF L = I -D -1 2 A D -1 2 = tr F ⊤ (I -D -1 2 A D -1 2 )F = tr FF ⊤ -tr D -1 2 A D -1 2 FF ⊤ = n i=1 F i F ⊤ i - n i=1 n j=1 A ij √ d i + 1 d j + 1 F j F ⊤ i = 1 2 n i=1 F i F ⊤ i + 1 2 n j=1 F j F ⊤ j - n i=1 n j=1 A ij √ d i + 1 d j + 1 F j F ⊤ i = 1 2   n i=1 F i F ⊤ i + n j=1 F j F ⊤ j -2 n i=1 n j=1 A ij √ d i + 1 d j + 1 F j F ⊤ i   = 1 2   n i=1 n j=1 A ij F i F ⊤ i d i + 1 + n i=1 n j=1 A ij F j F ⊤ j d j + 1 -2 n i=1 n j=1 A ij √ d i + 1 d j + 1 F j F ⊤ i   undirected graph = 1 2   n i=1 n j=1 A ij F i F ⊤ i d i + 1 + A ij F j F ⊤ j d j + 1 - A ij √ d i + 1 d j + 1 F j F ⊤ i - A ij √ d i + 1 d j + 1 F i F ⊤ j   = 1 2   n i=1 n j=1 A ij F i F ⊤ i d i + 1 + F j F ⊤ j d j + 1 - F j F ⊤ i √ d i + 1 d j + 1 - F i F ⊤ j √ d i + 1 d j + 1   = 1 2   n i=1 n j=1 A ij F i √ d i + 1 - F j d j + 1 F ⊤ i √ d i + 1 - F ⊤ j d j + 1   = 1 2   n i=1 n j=1 A ij F i √ d i + 1 - F j d j + 1 2 2   = (i,j)∈E A ij F i √ d i + 1 - F j d j + 1 2 2 . On the other hand, when L = I -D -1 A, we have tr F ⊤ LF L = I -D -1 A = tr F ⊤ (I -D -1 A)F = n i=1 F i F ⊤ i - n i=1 n j=1 A ij d i + 1 F j F ⊤ i = 1 2 n i=1 F i F ⊤ i + 1 2 n j=1 F j F ⊤ j - n i=1 n j=1 A ij d i + 1 F j F ⊤ i = 1 2   n i=1 n j=1 A ij F i F ⊤ i d i + 1 + n i=1 n j=1 A ij F j F ⊤ j d i + 1 -2 n i=1 n j=1 A ij √ d i + 1 √ d i + 1 F j F ⊤ i   undirected graph = 1 2   n i=1 n j=1 A ij F i √ d i + 1 - F j √ d i + 1 F ⊤ i √ d i + 1 - F ⊤ j √ d i + 1   = 1 2   n i=1 n j=1 A ij F i √ d i + 1 - F j √ d i + 1 2 2   = (i,j)∈E A ij F i √ d i + 1 - F j √ d i + 1 2 2 .

G DATASETS DETAILS

Cora, Citeseer, and Pubmed are standard citation network benchmark datasets (Sen et al., 2008) . Coauthor-CS and Coauthor-Phy are extracted from Microsoft Academic Graph (Shchur et al., 2018) . Cornell, Texas, Wisconsin, and Actor are constructed by Pei et al. (2020) . ogbn-products is a large-scale product, constructed by Hu et al. (2020) . In this section, we analyze the influence of the depth of NGC and RNGC model on denoising performance by testing the classification accuracy on semi-supervised node classification tasks. We conduct two sets of experiments: with/without noise in feature matrix. For experiment with feature noise, we simple fix the noise level ξ = 1. In each set of experiments, we evaluate the test accuracy with respect to NGC and RNGC model depth, which corresponding to the value of S in A S . From Figure 4 and 5, we can observe that the test accuracy barely changes with depth if the model is trained on the clean features on Cora and Pubmed but changes greatly if the model is trained on the clean feature on Citeseer. In this regard, the over-smoothing issue exists in RNGC model on citeseer. However, the denoising performance of shallow RNGC is not good as deeper RNGC models, especially on the large graph like Pubmed. This suggests that we do need to increase the depth of GNN model to include more higher-order neighbors for better denoising performances. 



Here we do not need exact graph structure perturbations as in graph adversarial attacks(Zügner et al., 2018; Zügner & Günnemann, 2019a) but a virtual perturbation that could lead to small changes in the Laplacian. More details on how to solve the inner maximization problem can be found in Appendix A. We also perform an analysis on the effect of row normalization in noisy feature matrix in Appendix I.1. We also perform an analysis on the denoising effect of depth in NGC and RNGC in Appendix I.2.



Figure 3: The illustration of the inner maximization problem. The adversarial loss function reaches the largest value when the direction of δ is the same with ∇h(δ)

then we have lim k→∞ (S k -AS k ) = lim k→∞ (I -A)S k = lim k→∞ (I -A k+1 ) = I Since (I -A) -1 exists, so we have (I -A) lim k→∞ S k = I, and lim k→∞ S k = (I -A) -1 , which finishes the proof. Lemma B.2 describes the convergence of Neumann Series and the condition to get the convergence. Lemma B.3. (Gerschgorin Disc) (Bhatia, 2013) Let A ∈ C n×n , with entries a ij . For any eigenvalue λ, there exits i and the corresponding Gerschgorin disc D (a ii , R i ) ⊆ C such that λ lies in this disc, i.e. |λ -a ii | ≤ n j̸ =i |a ij |. Lemma B.3 describes the estimated range of eigenvalues. Now we start to derive the Neumann Series expansion of the solution of GSD as follows.

the row summation of A k is 1. Now we can obtain the row summation for A S .

G 3 : a complete graph with 4 nodes; G 4 : a ring graph with 8 nodes. For computing τ , λ and S are set to be 64.

Comparison of classification accuracy v.s. noise level for semi-supervised node classification tasks. The noise level ξ controls the magnitude of the Gaussian noise we add to the feature matrix: X + ξη where η is sampled from standard i.i.d., Gaussian distribution. Summary of results (10 runs) on heterophily graphs in terms of classification accuracy (%)

Summary of results (10 runs) on Coauthor-CS and Coauthor-Phy in terms of accuracy (%) controls the magnitude of the Gaussian noise we add to the feature matrix: X+ξη where η is sampled from standard i.i.d., Gaussian distribution. For Cora and Citeseer, we test ξ ∈ {0.1, 0.2, 0.3, 0.4, 0.5} and for Pubmed, we test ξ ∈ {0.01, 0.02, 0.03, 0.04, 0.05}. From Figure

Denoising performance over 100 runs against flipping perturbation

Datasets statisticsIn this section, we analyze the influence of row normalization on denoising performance. The noise level ξ controls the magnitude of the Gaussian noise we add to the feature matrix: X + ξη where η is sampled from standard i.i.d., Gaussian distribution. For Cora, Citeseer, and Pubmed, we test ξ ∈ {1, 10, 100}. From Table12, we can observe that the denoising performance of w/ row normalization is better than w/o row normalization. Since row normalization can shrink the value

The hyper-parameters for NGC and RNGC on three citation datasets.

The hyper-parameters for NGC and RNGC on four heterophily graphs. thus reducing the variance σ. In other words, row normalization make A S η

The hyper-parameters for NGC and RNGC on two co-author datasets.

The hyper-parameters for NGC and RNGC on ogbn-products dataset.

The hyper-parameters for NGC and RNGC on three citation datasets of the flipping experiments.

Summary of results of NGC w/o raw normalization on three datasets in terms of classification accuracy (%)

