A UNIFIED VIEW ON GRAPH NEURAL NETWORKS AS GRAPH SIGNAL DENOISING Anonymous authors Paper under double-blind review

Abstract

Graph Neural Networks (GNNs) have risen to prominence in learning representations for graph structured data. A single GNN layer typically consists of a feature transformation and a feature aggregation operation. The former normally uses feed-forward networks to transform features, while the latter aggregates the transformed features over the graph. Numerous recent works have proposed GNN models with different designs in the aggregation operation. In this work, we establish mathematically that the aggregation processes in a group of representative GNN models including GCN, GAT, PPNP, and APPNP can be regarded as (approximately) solving a graph denoising problem with a smoothness assumption. Such a unified view across GNNs not only provides a new perspective to understand a variety of aggregation operations but also enables us to develop a unified graph neural network framework UGNN. To demonstrate its promising potential, we instantiate a novel GNN model, ADA-UGNN, derived from UGNN, to handle graphs with adaptive smoothness across nodes. Comprehensive experiments show the effectiveness of ADA-UGNN.

1. INTRODUCTION

Graph Neural Networks (GNNs) have shown great capacity in learning representations for graphstructured data and thus have facilitated many down-stream tasks such as node classification (Kipf & Welling, 2016; Veličković et al., 2017; Ying et al., 2018a; Klicpera et al., 2018) and graph classification (Defferrard et al., 2016; Ying et al., 2018b) . As traditional deep learning models, a GNN model is usually composed of several stacking GNN layers. Given a graph G with N nodes, a GNN layer typically contains a feature transformation and a feature aggregation operation as: Feature Transformation: X in = f trans (X in ); Feature Aggregation: X out = f agg (X in ; G) (1) where X in ∈ R N ×din and X out ∈ R N ×dout denote the input and output features of the GNN layer with d in and d out as the corresponding dimensions, respectively. Note that the non-linear activation is not included in Eq. (1) to ease the discussion. The feature transformation operation f trans (•) transforms the input of X in to X in ∈ R N ×dout as its output; and the feature aggregation operation f agg (•; G) updates the node features by aggregating the transformed node features via the graph G. In general, different GNN models share similar feature transformations (often, a single feed-forward layer), while adopting different designs for aggregation operation. We raise a natural question -is there an intrinsic connection among these feature aggregation operations and their assumptions? The significance of a positive answer to this question is two-fold. Firstly, it offers a new perspective to create a uniform understanding on representative aggregation operations. Secondly, it enables us to develop a general GNN framework that not only provides a unified view on multiple existing representative GNN models, but also has the potential to inspire new ones. In this paper, we aim to build the connection among feature aggregation operations of representative GNN models including GCN (Kipf & Welling, 2016) , GAT (Veličković et al., 2017) , PPNP and APPNP (Klicpera et al., 2018) . In particular, we mathematically establish that the aggregation operations in these models can be unified as the process of exactly, and sometimes approximately, addressing a graph signal denoising problem with Laplacian regularization (Shuman et al., 2013) . This connection suggests that these aggregation operations share a unified goal: to ensure feature smoothness of connected nodes. With this understanding, we propose a general GNN framework, UGNN, which not only provides a straightforward, unified view for many existing aggregation operations, but also suggests various promising directions to build new aggregation operations suitable for distinct applications. To demonstrate its potential, we build an instance of UGNN called ADA-UGNN, which is suited for handling varying smoothness properties across nodes, and conduct experiments to show its effectiveness.

2. REPRESENTATIVE GRAPH NEURAL NETWORKS

In this section, we introduce notations for graphs and briefly summarize several representative GNN models. A graph can be denoted as G = {V, E}, where V and E are its corresponding node and edge sets. The connections in G can be represented as an adjacency matrix A ∈ R N ×N , with N the number of nodes in the graph. The Laplacian matrix of the graph G is denoted as L. It is defined as L = D -A, where D is a diagonal degree matrix corresponding to A. There are also normalized versions of the Laplacian matrix such as L = I -D -1 2 AD -1 2 or L = I -D -1 A. In this work, we sometimes adopt different Laplacians to establish connections between different GNNs and the graph denoising problem, clarifying in the text. In this section, we generally use X in ∈ R N ×din and X out ∈ R N ×dout to denote input and output features of GNN layers. Next, we describe a few representative GNN models.

2.1. GRAPH CONVOLUTIONAL NETWORKS (GCN)

Following Eq. ( 1), a single layer in GCN (Kipf & Welling, 2016 ) can be written as follows: Feature Transformation: X in = XinW; Feature Aggregation: Xout = ÃX in , where W ∈ R din×dout is a feature transformation matrix, and Ã is a normalized adjacency matrix which includes a self-loop, defined as follows: Ã = D-1 2 Â D-1 2 , with Â = A + I and D = diag( j Â1,j, . . . , j ÂN,j). In practice, multiple GCN layers can be stacked, where each layer takes the output of its previous layer as input. Non-linear activation functions are included between consecutive layers.

2.2. GRAPH ATTENTION NETWORKS (GAT)

Graph Attention Networks (GAT) adopts the same feature transformation operation as GCN in Eq. ( 2). The feature aggregation operation (written node-wise) for a node i is as: Xout[i, :] = j∈ Ñ (i) αijX in [j, :], with αij = exp (eij) k∈ Ñ (i) exp (e ik ) . where Ñ (i) = N (i) ∪ {i} denotes the neighbors (self-inclusive) of node i, and X out [i, :] is the i-th row of the matrix X out , i.e. the output node features of node i. In this aggregation operation, α ij is a learnable attention score to differentiate the importance of distinct nodes in the neighborhood. Specifically, α ij is a normalized form of e ij , which is modeled as: eij = LeakyReLU X in [i, :] X in [j, :] a (5) where [• •] denotes the concatenation operation and a ∈ R 2d is a learnable vector. Similar to GCN, a GAT model usually consists of multiple stacked GAT layers.

2.3. PERSONALIZED PROPAGATION OF NEURAL PREDICTIONS (PPNP)

Personalized Propagation of Neural Predictions (PPNP) (Klicpera et al., 2018) introduces an aggregation operation based on Personalized PageRank (PPR). Specifically, the PPR matrix is defined as α(I -(1 -α) Ã) -1 , where α ∈ (0, 1) is a hyper-parameter. The ij-th element of the PPR matrix specifies the influence of node i on node j. The feature transformation operation is modeled as Multi-layer Perception (MLP). The PPNP model can be written in the form of Eq. ( 1) as follows: Feature Transformation: X in = MLP(Xin); Feature Aggregation: Xout = α(I -(1 -α) Ã) -1 X in . Unlike GCN and GAT, PPNP only consists of a single feature aggregation layer, but with a potentially deep feature transformation. Since the matrix inverse in Eq. ( 6) is costly, Klicpera et al. (2018) also introduces a practical, approximated version of PPNP, called APPNP, where the aggregation operation is performed in an iterative way as: X (k) out = (1 -α) ÃX (k-1) out + αX in k = 1, . . . K, where X (0) out = X in and X (K) out is the output of the feature aggregation operation. As proved in Klicpera et al. (2018) , X (K) out converges to the solution obtained by PPNP, i.e., X out in Eq. ( 6).

3. GNNS AS GRAPH SIGNAL DENOISING

In this section, we aim to establish the connections between the introduced GNN models and a graph signal denoising problem with Laplacian regularization. We first introduce the problem. Problem 1 (Graph Signal Denoising with Laplacian Regularization). Suppose that we are given a noisy signal X ∈ R N ×d on a graph G. The goal of the problem is to recover a clean signal F ∈ R N ×d , assumed to be smooth over G, by solving the following optimization problem: arg min F L = F -X 2 F + c • tr(F LF), Note that the first term guides F to be close to X, while the second term tr(F LF) is the Laplacian regularization that guides the smoothness of F over the graph. c > 0 is a balancing constant. Assuming we adopt the unnormalized version of Laplacian matrix with L = D -A (the adjacency matrix A is assumed to be binary), the second term in Eq. ( 8) can be written in an edge-centric way or a node-centric way as: edge-centric: c (i,j)∈E F [i, :] -F [j, : ] 2 2 ; node-centric: 1 2 c i∈V j∈ Ñ (i) F [i, :] -F [j, :] 2 2 . Clearly, from the edge-centric view, the regularization term measures the global smoothness of F, which is small when connected nodes share similar features. On the other hand, we can view the term j∈ Ñ (i) F [i, :] -F [j, :] 2 2 as a local smoothness measure for node i as it measures the difference between node i and all its neighbors. The regularization term can then be regarded as a summation of local smoothness over all nodes. Note that the adjacency matrix A is assumed to be binary when deriving Eq. ( 9). Similar formulations can also be derived to other types of Laplacian matrices. In the following subsections, we demonstrate the connections between aggregation operations in various GNN models and the graph signal denoising problem.

3.1. CONNECTION TO PPNP AND APPNP

In this subsection, we establish the connection between the graph signal denoising problem (8) and the aggregation propagations in PPNP and APPNP in Theorem 1 and Theorem 2, respectively. Theorem 1. When we adopt the normalized Laplacian matrix L = I-Ã, with Ã defined in Eq. ( 3), the feature aggregation operation in PPNP (Eq. ( 6)) can be regarded as exactly solving the graph signal denoising problem (8) with X in as the input noisy signal and c = 1 α -1. Proof. Note that the objective in Eq. ( 8) is convex. Hence, its closed-form solution F * to exactly solve the graph signal denosing problem can be obtained by setting its derivative to 0 as: ∂L ∂F = 2(F -X) + 2cLF = 0 ⇒ F * = (I + cL) -1 X Given L = I -Ã, F * can be reformulated as: F * = (I + cL) -1 X = I + c I -Ã -1 X = 1 1 + c I - c 1 + c Ã -1 X (11) The feature aggregation operation in Eq. ( 6) is equivalent to the closed-form solution in Eq. ( 11) when we set α = 1/(1 + c) and X = X in . This completes the proof. Theorem 2. When we adopt the normalized Laplacian matrix L = I -Ã, the feature aggregation operation in APPNP (Eq. ( 7)) approximately solves the graph signal denoising problem (8) by iterative gradient descent with X in as the input noisy signal, c = 1 α -1 and stepsize b = 1 2+2c . Proof. To solve the graph signal denoising problem (8), we take iterative gradient method with the stepsize b. Specifically, the k-th step gradient descent on problem (8) is as follows: F (k) ← F (k-1) -b • ∂L ∂F (F = F (k-1) ) = (1 -2b -2bc)F (k-1) + 2bX + 2bc ÃF (k-1) where F (0) = X. When we set the stepsize b as 1 2+2c , we have the following iterative steps: F (k) ← 1 1 + c X + c 1 + c ÃF (k-1) , k = 1, . . . K, which is equivalent to the iterative aggregation operation of the APPNP model in Eq. ( 7) with X = X in and α = 1 1+c . This completes the proof. These two connections provide a new explanation on the hyper-parameter α in PPNP and APPNP from the graph signal denoising perspective. Specifically, a smaller α indicates a larger c, which means the obtained X out is enforced to be smoother over the graph.

3.2. CONNECTION TO GCN

We draw the connection between the GCN model (Kipf & Welling, 2016) and the graph signal denoising problem in Theorem 3. Theorem 3. When we adopt the normalized Laplacian matrix L = I -Ã, the feature aggregation operation in GCN Eq. ( 2) can be regarded as solving the graph signal denoising problem (8) using one-step gradient descent with X in as the input noisy signal and stepsize b = 1 2c . Proof. The gradient with respect to F at X is ∂L ∂F F=X = 2cLX. Hence, one-step gradient descent for the graph signal denoising problem (8) can be described as: F ← X -b ∂L ∂F F=X = X -2bcLX = (1 -2bc)X + 2bc ÃX. ( ) When stepsize b = 1 2c and X = X in , we have F ← ÃX in , which is the same as the aggregation operation of GCN. With this connection, it is easy to verify that a GCN model with multiple GCN layers can be regarded as solving the graph signal denoising problem multiple times with different noisy signals. Specifically, each layer of a GCN model corresponds to a graph signal denoising problem, where the input noisy signal is the output from the previous layer after the feature transformation of the current layer. Note that there are earlier works (NT & Maehara, 2019; Zhao & Akoglu, 2019) drawing connection between GCN and the optimization problem in Eq. ( 8), where the aggregation operation in GCN is shown to be the first-order approximation of the exact solution.

3.3. CONNECTION TO GAT

To establish the connection between graph signal denoising and GAT (Veličković et al., 2017) , in this subsection, we adopt an unnormalized version of the Laplacian. It is defined based on the adjacency matrix with self-loop Â, i.e. L = D -Â with D denoting the diagonal degree matrix of Â. Then, the denoising problem in Eq. ( 8) can be rewritten from a node-centric view as: arg min F L = i∈V F[i, :] -X[i, :] 2 2 + 1 2 i∈V c • j∈ Ñ (i) F[i, :] -F[j, :] 2 2 , where Ñ (i) = N (i) ∪ {i} denotes the neighbors (self-inclusive) of node i. In Eq. ( 15), the constant c is shared by all nodes, which indicates that the same level of local smoothness is enforced to all nodes. However, nodes in a real-world graph can have varied local smoothness. For nodes with low local smoothness, we should impose a relatively smaller c, while for those nodes with higher local smoothness, we need a larger c. Hence, instead of a unified c as in Eq. ( 15), we could consider a node-dependent c i for each node i. Then, the optimization problem in Eq. ( 15) can be adjusted as: arg min F L = i∈V F [i, :] -X [i, :] 2 2 + 1 2 i∈V ci • j∈ Ñ (i) F [i, :] -F [j, :] 2 2 (16) We next show that the aggregation operation in GAT is closely connected to an approximate solution of problem ( 16) with the help of the following theorem. Theorem 4. With adaptive stepsize b i = 1/ j∈ Ñ (i) (c i + c j ) for each node i, the process of taking one step of gradient descent from X to solve problem (16) can be described as follows: F[i, :] ← j∈ Ñ (i) bi(ci + cj)X[j, :]. ( ) Proof. The gradient of optimization problem in Eq. ( 16) with respect to F focusing on a node i can be formulated as: ∂L ∂F [i, :] = 2 (F [i, :] -X [i, :]) + j∈ Ñ (i) (ci + cj) (F [i, :] -F [j, :]) , where c j in the second term appears since i is also in the neighborhood of j. Then, the gradient at X is ∂L ∂F[i,:] F[i,:]=X[i,:] = j∈ Ñ (i) (ci + cj) (X [i, :] -X [j, :]) . Thus, taking a step of gradient descent starting from X with stepsize b can be described as follows: F [i, :] ← X [i, :] -b • ∂L ∂F [i, :] F[i,:]=X[i,:] = 1 -b j∈ Ñ (i) (ci + cj) X [i, :] + j∈ Ñ (i) b (ci + cj) X [j, :] (19) Given b = 1/ j∈ Ñ (i) (c i +c j ), Eq. ( 19) can be rewritten as F[i, :] ← j∈ Ñ (i) b i (c i + c j )X[j, :], which completes the proof. Eq. ( 17) resembles the aggregation operation of GAT in Eq. (4) if we treat b i (c i + c j ) as the attention score α ij . Note that we have j∈ Ñ (i) (c i + c j ) = 1/b i , for all i ∈ V. So, (c i + c j ) can be regarded as the pre-normalized attention score and 1/b i can be regarded as the normalization constant. We further compare b i (c i + c j ) with α ij by investigating the formulation of e ij in Eq. ( 5). Eq. ( 5) can be rewritten as: eij = LeakyReLU (X in [i, :]a1 + X in [j, :]a2) (20) where a 1 ∈ R d and a 2 ∈ R d are learnable column vectors, which can be concatenated to form a in Eq. ( 5). Comparing e ij with (c i + c j ), we find that they take a similar form. Specifically, X in [i, :]a 1 and X in [j, :]a 2 can be regarded as the approximations of c i and c j , respectively. The difference between b i (c i + c j ) and α ij is that the normalization in Eq. ( 17) for b i (c i + c j ) is achieved via summation rather than a softmax as in Eq. ( 4) for α ij . Note that since GAT makes the c i and c j learnable, they also include a non-linear activation in calculating e ij . By viewing the attention mechanism in GAT from the perspective of Eq. ( 17), namely that c i actually indicates a notion of local smoothness for node i, we can develop other ways to parameterize c i . For example, instead of directly using the node features of i as an indicator of local smoothness like GAT, we can consider the neighborhood information. In fact, we adopt this idea to design a new aggregation operation in Section 5.

4. UGNN: A UNIFIED GNN FRAMEWORK VIA GRAPH SIGNAL DENOISING

In the previous section, we established that the aggregation operations in PPNP, APPNP, GCN and GAT are intimately connected to the graph signal denoising problem with (generalized) Laplacian regularization. In particular, from this perspective, all their aggregation operations aim to ensure feature smoothness: either a global smoothness over the graph as in PPNP, APPNP and GCN, or a local smoothness for each node as in GAT. This understanding allows us to develop a unified feature aggregation operation by posing the following, more general graph signal denoising problem: Problem 2 (Generalized UGNN Graph Signal Denoising Problem). arg min F L = F -X 2 F + r(C, F, G), where r(C, F, G) denotes a flexible regularization term to enforce some prior over F. Note that we overload the notation C here: it can function as a scalar (like a global constant in GCN), a vector (like node-wise constants in GAT) or even a matrix (edge-wise constants) if we want to give flexibility to each node pair. Different choices of r(•) imply different feature aggregation operations. Besides PPNP, APPNP, GCN and GAT, there are aggregation operations in more GNN models that can be associated with Problem 2 with different regularization terms such as PairNorm (Zhao & Akoglu, 2019) and DropEdge (Rong et al., 2019) (more details can be found in Appendix B). The above mentioned regularization terms are all related to the Laplacian regularization. Other regularization terms can also be adopted, which may lead to novel designs of GNN layers. For example, if we aim to enforce that the clean signal is piece-wise linear, we can adopt r(C, F, G) = C • LF 1 designed for trend filtering (Tibshirani et al., 2014; Wang et al., 2016) . With these discussions, we propose a unified framework (UGNN) to design GNN layers from the graph signal processing perspective as: (1) Design a graph regularization term r(C, F, G) in Problem 2 according to specific applications; (2) Feature Transformation: X in = f trans (X in ); and (3) Feature Aggregation: Solving Problem 2 with X = X in and the designed r(C, F, G). To demonstrate the potential of UGNN, next we introduce a new GNN model ADA-UGNN by instantiating UGNN with r(C, F, G) enforcing adaptive local smoothness across nodes. Note that we introduce ADA-UGNN with node classification as the downstream task.

5. ADA-UGNN: ADAPTIVE LOCAL SMOOTHING WITH UGNN

From the graph signal denoising perspective, PPNP, APPNP, and GCN enforces global smoothness by penalizing the difference with a constant C for all nodes. However, real-world graphs may consist of multiple groups of nodes which have different behaviors in connecting to similar neighbors. For example, Section 6.1 shows several graphs with varying distributions of local smoothness (as measured by label homophily): summarily, not all nodes are highly label-homophilic, and some nodes have considerably "noisier" neighborhoods than others. Moreover, as suggested by Wu et al. (2019) ; Jin et al. (2020) , adversarial attacks on graphs tend to promote such label noise in graphs by connecting nodes from different classes and disconnecting nodes from the same class, rendering resultant graphs with varying local smoothness across nodes. Under these scenarios, a constant C might not be optimal and adaptive (i.e. non-constant) smoothness to different nodes is desired. As shown in Section 3.3 by viewing GAT's aggregation as a solution to regularized graph signal denoising, GAT can be regarded as adopting an adaptive C for different nodes, which facilitates adaptive local smoothness. However, in GAT, the graph denoising problem is solved by a single step of gradient descent, which might still be suboptimal. Furthermore, when modeling the local smoothness factor c i in Eq. ( 17), GAT only uses features of node i as input, which may not be optimal since by understanding c i as local smoothness, it should be intrinsically related to the neighborhood of node i. In this section, we adapt this notion directly into the UGNN framework by introducing a new regularization term, and develop a resulting GNN model (ADA-UGNN) which aims to enforce adaptive local smoothness to nodes in a different manner to GAT. We then utilize an iterative gradient descent method to approximate the optimal solution for Problem 2 with the following regularization term: r(C, F, G) = 1 2 • i∈V Ci j∈ Ñ (i) F[i, :] √ di - F[j, :] dj . 2 2 ( ) where d i , d j denotes the degree of node i and j respectively, and C i indicates the smoothness factor of node i, which is assumed to be a fixed scalar. Note that, the above regularization term can be regarded as a generalized version of the regularization term used in PPNP, APPNP, and GCN. Similar to PPNP and APPNP, ADA-UGNN only consists of a single GNN layer. However, ADA-UGNN assumes adaptive local smoothness. We next describe the feature transformation and aggregation operations of ADA-UGNN, and show how to derive the model via UGNN.

5.1. FEATURE TRANSFORMATION

Similar to PPNP and APPNP, we adopt MLP for the feature transformation. Specifically, for a node classification task, the dimension of the output of the feature transformation X in is the number of classes in the graph.

5.2. FEATURE AGGREGATION

We use iterative gradient descent to solve Problem 2 with the regularization term in Eq. ( 22) The iterative gradient descent steps are stated in the following theorem and its proof can be found at Appendix A.1. Theorem 5. With adaptive stepsize bi = 1/ 2 + j∈ Ñ (i) (Ci + Cj)/di for each node i, the iterative gradient descent steps to solve Problem 2 with the regularization term in Eq. ( 22) is as follows: F (k) [i, :] ← 2bX[i, :] + bi j∈ Ñ (i) (Ci + Ci) F (k-1) [j, :] didj ; k = 1, . . . . where F (0) [i, :] = X[i, :]. The iterative steps in Eq. ( 23) is guaranteed for convergence as stated in the following theorem and its proof can be found in Appendix A.2. Theorem 6. The iterative steps in Eq. ( 23) is guaranteed to converge to the optimal solution of Problem 2 with Eq. ( 22) as regularization term. Following the iterative solution in Eq. ( 23), we model the aggregation operation (for node i) for ADA-UGNN as follows: X (k) out [i, :] ← 2biX in [i, :] + bi v j ∈ Ñ (v i ) (Ci + Cj) X (k-1) out [j, :] didj ; k = 1, . . . K, ( ) where K is the number gradient descent iterations, C i can be considered as a positive scalar to control the level of "local smoothness" for node i and b i can be calculated from {C j |j ∈ Ñ (i)} as bi = 1/ 2 + j∈ Ñ (i) (Ci + Cj)/di . However, in practice, C i is usually unknown. One possible solution is to treat C i as hyper-parameters. Treating C i as hyper-parameters for all nodes is impractical, since there are, in total N of them and we do not have their prior knowledge. Thus, we model C i as a function of the information of the neighborhood of node i as follows: where h 2 (•) is a function to transform the neighborhood information of node i to a vector, while h 1 (•) further transforms it to a scalar. σ(•) denotes the sigmoid function, which maps the output scalar from h 1 (•) to (0, 1) and s can be treated as a hyper-parameter controlling the upper bound of C i . h 1 (•) can be modeled as a single layer fully-connected neural network. There are different designs for h 2 (•) such as channel-wise variance or mean (Corso et al., 2020) . In this paper, we adopt channel-wise variance as the h 2 (•) function. In this case, the calculation of C i in Eq. ( 25) only involves H parameters, with H denoting number of classes in the dataset. APPNP can be regarded a special case of ADA-UGNN, where h 2 (•) is modeled as a constant function producing 1 as the output for all nodes. For the node classification task, the representation X (K) out , which is obtained after K iterations as in Eq. ( 24), is directly softmax normalized row-wise and its i-th row indicates the discrete class distribution of node i.

6. EXPERIMENT

In this section, we evaluate how the proposed ADA-UGNN handles graphs with varying local smoothness. We conduct node classification experiments on natural graphs, and also evaluate the model's robustness under adversarial attacks. We note that our main goal in proposing/evaluating ADA-UGNN is to demonstrate the promise of deriving new aggregations as solutions of denoising problems, rather than state-of-the-art performance.

6.1. NODE CLASSIFICATION

In this section, we conduct the node classification task. We first introduce the datasets and the experimental settings in Section 6.1.1 and then present the results in Section 6.1.2.

6.1.1. DATASETS AND EXPERIMENTAL SETTINGS

We conduct the node classification task on 8 datasets from various domains including citation, social, co-authorship and co-purchase networks. Specifically, we use three citation networks including CORA, CITESEER, and PUBMED (Sen et al., 2008) ; one social network, BLOGCATALOG (Huang et al., 2017) ; two co-authorship networks including COAUTHOR-CS and COAUTHOR-PH (Shchur et al., 2018) ; and two co-purchase networks including AMAZON-COMP and Amazon Photos (Shchur et al., 2018) . Descriptions and detail statistics about these datasets can be found in Appendix C.1. To provide a sense of the local smoothness properties of these datasets, in addition to the summary statistics, we also illustrate the local label smoothness distributions in Appendix C.1.1: here, we define the local label smoothness of a node as the ratio of nodes in its neighborhood that share the same label (see formal definition in Eq. ( 34) in Appendix C.1.1). Notably, the variety in local label smoothness within several real-world datasets -also observed in (Shah, 2020) -clearly motivates the importance of the adaptive smoothness assumption in ADA-UGNN. For the citation networks, we use the standard split as provided in Kipf & Welling (2016); Yang et al. (2016) . For BLOG-CATALOG, we adopt the split provided in Zhao et al. (2020) . For both the citation networks and BLOGCATALOG, the experiments are run with 30 random seeds and the average results are reported. For co-authorship and co-purchase networks, we utilize 20 labels per class for training, 30 nodes per class for validation and the remaining nodes for test. This process is repeated 20 times, which results in 20 different training/validation/test splits. For each split, the experiment is repeated for 20 times with different initialization. The average results over 20 × 20 experiments are reported. We compare our methods with the methods introduced in Section 2 including GCN, GAT and APPNP. Note that we do not include PPNP as it is difficult to scale for most of the datasets due to the calculation of inverse in Eq. 6. For all methods, we tune the hyperparameters from the following options: 1) learning rate: {0.005, 0.01, 0.05}; 2) weight decay {5e-04, 5e-05, 5e-06, 5e-07, 5e-08}; and 3) dropout rate: {0.2, 0.5, 0.8}. For APPNP and our method we further tune the number of iterations K and the upper bound s for c i in Eq. ( 25) from the following range: 1) K: {5, 10}; and s: {1, 9, 19}. Note that we treat APPNP as a special case of our proposed method with h 2 (•) = 1.

6.1.2. PERFORMANCE COMPARISON

The performance comparison is shown in Table 1 , where t-test is used to test the significance. First, GAT outperforms GCN in most datasets. It indicates that modeling adaptive local smoothness is helpful. Second, APPNP/ADA-UGNN outperform GCN/GAT in most settings, suggesting that iterative gradient descent may offer advantages to single-step gradients, due to their better ability to achieve a solution closer to the optimal. Third, and most notably, the proposed ADA-UGNN achieves consistently better performance than GCN/GAT, and outperforms or matches the stateof-the-art APPNP across datasets. Notice that in some datasets such as CORA, CITESEER, and COAUTHOR-PH, the improvements of the proposed model compared with APPNP are not very significant. Figure 3 in Appendix C.1.1 shows that these datasets have extremely skewed local label smoothness distributions, with the majority of nodes having perfect, 1.0, label homophily (they are only connected to other nodes of the same label). APPNP shines in such cases, since its assumption of h 2 (•) = 1 is ideal for these nodes (designating maximal local smoothness). Conversely, our model has the challenging task of learning h 2 (•) -in such skewed cases, learning h 2 (•) may be quite challenging and unfruitful. On the other hand, for datasets with higher diversity in local label smoothness across nodes such as BLOGCATALOG and AMAZON-COMP, the proposed ADA-UGNN achieves more significant improvements. To further validate, we partition the nodes in the test set of each dataset into two groups: (1) high smoothness: those with local label smoothness >0.5, and (2) low smoothness: those with ≤0.5, and evaluate accuracy for APPNP and the proposed ADA-UGNN for each group. The results for CORA, BLOGCATALOG, AMAZON-COMP and COAUTHOR-CS are presented in Figure 1 while the results for the remaining datasets can be found in Figure 4 in Appendix C.2. Figure 1 clearly shows that ADA-UGNN consistently improves performance for low-smoothness nodes in most datasets, while keeping comparable (or marginally worse) performance for high-smoothness nodes. In cases where many nodes have low-level smoothness (like BLOGCATALOG or AMAZON-COMP), our method can notably improve overall performance.

6.2. ROBUSTNESS UNDER ADVERSARIAL ATTACKS

Adversarial attacks on graphs tend to connect nodes from different classes and remove edges between nodes from the same class (Wu et al., 2019; Jin et al., 2020) , producing graphs with varying local label smoothness after attack (we demonstrate this in Appendix C.3). To further demonstrate that ADA-UGNN can handle graphs with varying local label smoothness better than alternatives, we conduct experiments to show its robustness under adversarial attacks. Specifically, we adopt Mettack (Zügner & Günnemann, 2019) to perform the attacks. Mettack produces non-targeted attacks which aim to impair test set node classification performance by strategically adding or removing edges from the victim graph. We utilize the attacked graphs (5%-25% perturb rate) from Jin et al. (2020) and follow the same setting, i.e., each method is run with 10 random seeds and the average performance is reported. These attacked graphs are generated from CORA, CITESEER and PUBMED, respectively and only the largest connected component is retained in each graph. Furthermore, the training, validation and test split ratio is 10/10/80%, which is different from the standard splits we use in Section 6.1. Thus, the performances reported in this section is not directly comparable with those in the previous section. We compare our method both with standard GNNs discussed in Section 2 (GCN, GAT, APPNP), but also with recent state-of-the-art defense techniques against adversarial attacks including GCN-Jaccard (Wu et al., 2019) , GCN-SVD (Entezari et al., 2020) , Pro-GNN-fs and Pro-GNN (Jin et al., 2020) . The detailed description of these methods can be found at 2 . Again, we observe that GAT outperforms GCN, suggesting the appeal of an adaptive local smoothness assumption. Here, our method (orange) substantially outperforms GCN, GAT and APPNP by a large margin, especially in scenarios with high perturbation rate. Moreover, the proposed ADA-UGNN is also even more robust than several specially designed adversarial defense methods, like GCN-Jaccard and GCN-SVD, which are based on pre-processing the adversarial attack graphs to obtain cleaner ones, thanks to its adaptive smoothness assumption. Compared with Pro-GNN-fs, our method performs comparably or even better in a few settings, especially when perturbation rate is high. Furthermore, in these settings, the performance of our method is even closer to Pro-GNN, which is the current state-of-the art adversarial defense technique. Note that, Pro-GNN-fs and Pro-GNN involves learning cleaner adjacency matrices of the attacked graphs, and thus has O(M ) parameters (M denotes the number of edges in a graph), while our proposed model has far less parameters. Specifically, we have O(d in • d out ) for feature transformation and H parameters for modelling h 1 (•) with H denoting the number of labels.

7. RELATED WORKS

There are mainly two streams of work in developing GNN models, i.e, spectral-based and spatialbased. When designing spectral-based GNNs, graph convolution (Shuman et al., 2013) , defined based on spectral theory, is utilized to design graph neural network layers together with the feature transformation and non-linearity (Bruna et al., 2013; Henaff et al., 2015; Defferrard et al., 2016) . These designs of the spectral-based graph convolution are tightly related with graph signal processing, and they can be regarded as graph filters. Low-pass graph filters can usually be adopted to denoise graph signals (Chen et al., 2014) . In fact, most algorithms discussed in our work can be regarded as low-pass graph filters. With the emergence of GCN (Kipf & Welling, 2016) , which can be regarded as a simplified spectral-based and also a spatial-based graph convolution operator, numerous spatial-based GNN models have since been developed (Hamilton et al., 2017; Veličković et al., 2017; Monti et al., 2017; Gao et al., 2018; Gilmer et al., 2017) . Graph signal denoising is to infer a cleaner graph signal given a noisy signal, and can be usually formulated as a graph regularized optimization problem (Chen et al., 2014) . Recently, several works connect GCN with graph signal denoising with Laplacian regularization (NT & Maehara, 2019; Zhao & Akoglu, 2019) , where they found the aggregation process in GCN models can be regarded as the first-order approximation of the optimal solution of the denoising problem. On the other hand, GNNs are also utilized to develop novel algorithms for graph denoising (Chen et al., 2020) . Unlike these works, our paper details how a family of GNN models can be unified with a graph signal denoising perspective, and demonstrates its promise for new architecture design.

8. CONCLUSION

In this paper, we show how various representative GNN models including GCN, PPNP, APPNP and GAT can be unified mathematically as natural instances of graph denoising problems. Specifically, the aggregation operations in these models can be regarded as exactly or approximately addressing such denoising problems subject to Laplacian regularization. With these observations, we propose a general framework, UGNN, which enables the design of new GNN models from the denoising perspective via regularizer design. As an example demonstrating the promise of this paradigm, we instantiate the UGNN framework with a regularizer addressing adaptive local smoothness across nodes, a property prevalent in several real-world graphs, and proposed and evaluated a suitable new GNN model, ADA-UGNN.

B CONNECTIONS TO PAIRNORM AND DROPEDGE

PairNorm and DropEdge, which are two recently proposed GNN enhancements for developing deeper GNN models, are corresponding to the following regularization terms: PairNorm: (i,j)∈E C p • F[i, :] -F[j, :] 2 2 - (i,j) ∈E C n • F[i, :] -F[j, :] 2 2 , DropEdge: (i,j)∈E C ij • F[i, :] -F[j, :] 2 2 , where C ij ∈ {0, 1}. For PairNorm, C consists of C p , C n > 0 and the regularization term ensures connected nodes to be similar while disconnected nodes to be dissimilar. For DropEdge, C is a sparse matrix having the same shape as adjacency matrix. For each edge (i, j), its corresponding C ij is sampled from a Bernoulli distribution with mean 1 -q, where q is a pre-defined dropout rate. In this section, we provide information of the datasets we used in the experiments as follows:

C EXPERIMENTS

• Citation Networks: CORA, CITESEER and PUBMED are widely adopted benchmarks of GNN models. In these graphs, nodes represent documents and edges denote the citation links between them. Each node is associated bag-of-words features of its corresponding document and also a label indicating the research field of the document. • Blogcatalog: BLOGCATALOG is an online blogging community where bloggers can follow each other. The BLOGCATALOG graph consists of blogger as nodes while their social relations as edges. Each blogger is associated with some features generated from key words of his/her blogs. The bloggers are labeled according to their interests. • Co-purchase Graph: AMAZON-COMP and AMAZON-PHOTO are co-purchase graphs, where nodes represent items and edges indicate that two items are frequently bought together. Each item is associated with bag-of-words features extract from its corresponding reviews. The labels of items are given by the category of them. • Co-authorship Graphs: COAUTHOR-CS and COAUTHOR-PH are co-authorship graphs, where nodes are authors and edges indicating the co-authorship between authors. Each author is associated with some features representing the keywords of his/her papers. The label of an author indicates the his/her most active research field. Some statistics of these graphs are shown in Table 2 . 

C.3 LOCAL SMOOTHNESS DISTRIBUTION OF ATTACKED GRAPH

Graph adversarial attacks tend to connect nodes from different classes while disconnect nodes from the same class, which typically leads to more diverse distributions of local smoothness level. We present the distributions of the graphs generated by Mettack (Zügner & Günnemann, 2019) with different perturbation rate for CORA, CITESEER and PUBMED in Figure 5 , Figure 6 and Figure 7 , respectively.

C.4 BASELINES FOR ADVERSARIAL DEFENSE

In this section, we list the descriptions of the defense algorithms we adopt in Section 6.2 as follows: • GCN-Jaccard (Wu et al., 2019) : GCN-Jaccard aims to pre-process a given attacked graph by removing those edges added by the attackers. Specifically, Jaccard smilarlity is utilized 



Ci = s • σ h1 h2 X in [j, :]|j ∈ Ñ (i) ,(25)



Figure1: Accuracy for nodes with low and high local label smoothness. COAUTHOR-PH, the improvements of the proposed model compared with APPNP are not very significant. Figure3in Appendix C.1.1 shows that these datasets have extremely skewed local label smoothness distributions, with the majority of nodes having perfect, 1.0, label homophily (they are only connected to other nodes of the same label). APPNP shines in such cases, since its assumption of h 2 (•) = 1 is ideal for these nodes (designating maximal local smoothness). Conversely, our model has the challenging task of learning h 2 (•) -in such skewed cases, learning h 2 (•) may be quite challenging and unfruitful. On the other hand, for datasets with higher diversity in local label smoothness across nodes such as BLOGCATALOG and AMAZON-COMP, the proposed ADA-UGNN achieves more significant improvements.

Figure 3: Distribution of local label smoothness (homophily) on different graph datasets: note the non-homogeneity of smoothness values.

Figure 8: ADA-UGNN performance (test accuracy) under different numbers of gradient steps (K).

Node Classification Accuracy on Various Datasets

Dataset summary statistics.

A PROOFS

A.1 PROOF OF THEOREM 5 Theorem 5. With adaptive stepsize b i = 1/ 2 + vj ∈ Ñ (vi) (C i + C j )/d i for each node v i , the iterative gradient descent steps to solve Problem 2 with the regularization term in Eq. ( 22) is as follows:whereProof. The gradient of the optimization problem 2 with the regularization term in Eq. ( 22) with respect to F (focusing on node i) is as follows:where C j in the second term appears since node i is also in the neighborhood of node j. The iterative gradient descent steps with adaptive stepsize b i can be formulated as follows:With the gradient in Eq. ( 27), the iterative steps in Eq. ( 28) can be rewritten as:(C i + C j )/d i , the iterative steps in Eq. ( 29) can be re-written as follows:with F (0) [i, :] = X[i, :], which completes the proof.

A.2 PROOF OF THEOREM 6

Theorem 6. The iterative steps in Eq. ( 23) is guaranteed to converge to the optimal solution of Problem 2 with Eq. ( 22) as regularization term.Proof. By taking the second derivative with respect to F[i, :], we obtain the Hessian matrix as:which implies the Lipschitz constant of the gradient in Eq. ( 27) isTo guarantee convergence, the stepsize b i for node i should be smaller than 2/ 2 + vj ∈ Ñ (i) terov, 2013) . The stepsize we adopt in Theorem 5 is b i = 1/ 2 + j∈ Ñ (i) to measure the feature similarity between connected pairs of nodes. The edges between node pairs with low-similarity are removed by the algorithm. This pre-processed graph is then utilized for the node classification task. • GCN-SVD (Entezari et al., 2020) : GCN-SVD is also a pre-process method. It use SVD to decompose the adjacency matrix of a given perturbed graph and then obtain its low-rank approximation. The low-rank approximation is believed to be cleaner as graph adversarial attacks are observed to be high-rank in (Entezari et al., 2020) . • Pro-GNN (Jin et al., 2020) : Pro-GNN tries to learn a cleaner graph while training the node classification model at the same time. Specifically, it treats the adjacency as parameters, which is optimized during the training stage. Several different constraints are enforced to this learnable adjacency matrix, including: 1) the learned adjacency matrix should be close to the original adjacency matrix; 2) the learned adjacency matrix should be low-rank; and 3) the learned adjacency matrix should ensure feature smoothness. Pro-GNN-fs is a variant of Pro-GNN where the third constraint, i.e. feature smoothness, is not enforced.

C.5 INVESTIGATION ON NUMBER OF GRADIENT DESCENT STEPS IN ADA-UGNN

In this section, we conducted experiments to check how the performance of ADA-UGNN is affected by K. For each K, we run the experiments on standard splits of CORA, CITESEER and PUBMED with 30 random seeds (i.e., the same setting as in Section 6.) The average performance is reported. As shown in Figure 8 , the performance increases quickly as K gets larger when K is relatively small. After K becomes large, the performance either slowly grows or slightly fluctuates as K further increases.

