A UNIFIED FRAMEWORK FOR CONVOLUTION-BASED GRAPH NEURAL NETWORKS Anonymous authors Paper under double-blind review

Abstract

Graph Convolutional Networks (GCNs) have attracted a lot of research interest in the machine learning community in recent years. Although many variants have been proposed, we still lack a systematic view of different GCN models and deep understanding of the relations among them. In this paper, we take a step forward to establish a unified framework for convolution-based graph neural networks, by formulating the basic graph convolution operation as an optimization problem in the graph Fourier space. Under this framework, a variety of popular GCN models, including the vanilla-GCNs, attention-based GCNs and topology-based GCNs, can be interpreted as a same optimization problem but with different carefully designed regularizers. This novel perspective enables a better understanding of the similarities and differences among many widely used GCNs, and may inspire new approaches for designing better models. As a showcase, we also present a novel regularization technique under the proposed framework to tackle the oversmoothing problem in graph convolution. The effectiveness of the newly designed model is validated empirically.

1. INTRODUCTION

Recent years have witnessed a fast development in graph processing by generalizing convolution operation to graph-structured data, which is known as Graph Convolutional Networks (GCNs) (Kipf & Welling, 2017) . Due to the great success, numerous variants of GCNs have been developed and extensively adopted in the field of social network analysis (Hamilton et al., 2017; Wu et al., 2019a; Veličković et al., 2018) , biology (Zitnik et al., 2018) , transportation forecasting (Li et al., 2017) and natural language processing (Wu et al., 2019b; Yao et al., 2019) . Inspired by GCN, a wide variety of convolution-based graph learning approaches are proposed to enhance the generalization performance of graph neural networks. Several research aim to achieve higher expressiveness by exploring higher-order information or introducing additional learning mechanisms like attention modules. Although proposed from different perspectives, their exist some connections between these approaches. For example, attention-based GCNs like GAT (Veličković et al., 2018) and AGNN (Thekumparampil et al., 2018) share the similar intention by adjusting the adjacency matrix with a function of edge and node features. Similarly, TAGCN (Du et al., 2017) and MixHop (Kapoor et al., 2019) can be viewed as particular instances of PPNP (Klicpera et al., 2018) under certain approximation. However, the relations among these graph learning models are rarely studied and the comparisons are still limited in analyzing generalization performances on public datasets. As a consequence, we still lack a systematic view of different GCN models and deep understanding of the relations among them. In this paper, we resort to the techniques in graph signal processing and attempt to understand GCN-based approaches from a general perspective. Specifically, we present a unified graph convolution framework by establishing graph convolution operations with optimization problems in the graph Fourier domain. We consider a Laplacian regularized least squares optimization problem and show that most of the convolution-based approaches can be interpreted in this framework by adding carefully designed regularizers. Besides vanilla GCNs, we also extend our framework to formulating non-convolutional operations (Xu et al., 2018a; Hamilton et al., 2017) , attention-based GCNs (Veličković et al., 2018; Thekumparampil et al., 2018) and topology-based GCNs (Klicpera et al., 2018; Kapoor et al., 2019) , which cover a large fraction of the state-of-the-art graph learning ap-proaches. This novel perspective provides a re-interpretation of graph convolution operations and enables a better understanding of the similarities and differences among many widely used GCNs, and may inspire new approaches for designing better models. As a conclusion, we summarize our contributions as follow: 1. We introduce a unified framework for convolution-based graph neural networks and interpret various convolution filters as carefully designed regularizers in the graph Fourier domain, which provides a general methodology for evaluating and relating different graph learning modules. 2. Based on the proposed framework, we provide new insights on understanding the limitations of GCNs and show new directions to tackle common problems and improve the generalization performance of current graph neural networks in the graph Fourier domain. Additionally, the unified framework can serve as a once-for-all platform for expert-designed modules on convolution-based approaches, where newly designed modules can be easily implemented on other networks as a plugin module with trivial adaptations. We believe that our framework can provide convenience for designing new graph learning modules and searching for better combinations. 3. As a showcase, we present a novel regularization technique under the proposed framework to alleviate the oversmoothing problem in graph representation learning. As shown in Section 4, the newly designed regularizer can be implemented on several convolution-based networks and effectively improve the generalization performance of graph learning models.

2. PRELIMINARY

We start with an overview of the basic concepts of graph signal processing. Let G = (V, A) denote a graph with node feature vectors where V represents the vertex set consisting of nodes {v 1 , v 2 , . . . , v N } and A = (a ij ) ∈ R N ×N is the adjacency matrix implying the connectivity between nodes in the graph. Let D = diag(d(1), . . . , d(N )) ∈ R N ×N be the degree matrix of A where d(i) = j∈V a ij is the degree of vertex i. Then, L = D -A is the combinatorial Laplacian and L = I -D (-1/2) AD (-1/2) is the normalized Laplacian of G. Additionally, we let Ã = A + I and D = D + I denote the augmented adjacency and degree matrices with added self-loops. Then Lsym = I -D-1/2 Ã D-1/2 ( Ãsym = D-1/2 Ã D-1/2 ) and Lrw = I -D-1 Ã ( Ãrw = D-1 Ã) are the augmented symmetric normalized and random walk normalized Laplacian (augmented adjacency matrices) of G, respectively. Let x ∈ R N be a signal on the vertices of the graph. The spectral convolution is defined as a function of a filter g θ parameterized in the Fourier domain (Kipf & Welling, 2017) : g θ x = U g θ (Λ)U T x, where U and Λ are the eigenvectors and eigenvalues of the normalized Laplacian L. Also, we follow Hoang & Maehara (2019) and define the variation ∆ and D-inner product as: ∆(x) = i,j∈V a ij (x(i) -x(j)) 2 = x T Lx, (x, y) D = i∈V (d(i) + 1)x(i)y(i) = x T Dy, which specifies the smoothness and importance of the signal respectively.

3. UNIFIED GRAPH CONVOLUTION FRAMEWORK

With the success of GCNs, a wide variety of convolution-based approaches are proposed which progressively enhance the expressive power and generalization performance of graph neural networks. Despite the effectiveness of GCN and its derivatives on specific tasks, there still lack a comprehensive understanding on the relations and differences among various graph learning modules. Graph signal processing is a powerful technique which has been adopted in several graph learning researches (Kipf & Welling, 2017; Hoang & Maehara, 2019; Zhao & Akoglu, 2019) . However, existing researches mainly focus on analyzing the properties of GCNs while ignore the connections between different graph learning modules. Innovatively, in this work, we consider interpreting convolution-based approaches from a general perspective with graph signal processing techniques. In specific, we establish the connections between graph convolution operations and optimization problems in graph Fourier space, showing the effect of each module explicitly with specific regularizers. This novel perspective provides a systematic view of different GCN models and deep understanding of the relations among them.

3.1. UNIFIED GRAPH CONVOLUTION FRAMEWORK

Several researches have proved that, in the field of graph signal processing, the representative features are mostly preserved in the low-frequency signals while noises are mostly contained in the high-frequency signals (Hoang & Maehara, 2019) . Based on this observation, numerous graph representation learning methods are designed to decrease the high-frequency components, which can be viewed as low-pass filters in the graph Fourier space. With similar inspiration, we consider a Laplacian regularized least squares optimization problem with graph signal regularizers and attempt to build connections with these filters. Definition 1 Unified Graph Convolution Framework. Graph convolution filters can be achieved by solving the following Laplacian regularized least squares optimization: min X i∈V x(i) -x(i) 2 D + λL reg , where x D = (x, x) D denotes the norm induced by D. In the following sections, we will show that a wide range of convolution-based graph neural networks can be derived from Definition 1 with different carefully designed regularizers, and provide new insights on understanding different graph learning modules from the graph signal perspective.

3.1.1. GRAPH CONVOLUTIONAL NETWORKS

Graph convolutional networks (GCNs) (Kipf & Welling, 2017) are the foundation of numerous graph learning models and have received widespread concerns. Several researches have demonstrated that the vanilla GCN is essentially a type of Laplacian smoothing over the whole graph, which makes the features of the connected nodes similar. Therefore, to reformulate GCNs in the graph Fourier space, we consider utilizing the variation ∆(x) as the regularizer. Definition 2 Vanilla GCNs. Let x(i) i∈V be the estimation of the input observation x(i) i∈V . A low-pass filter: X = Ãrw X, (4) is the first-order approximation of the optimal solution of the following optimization: min X i∈V x(i) -x(i) 2 D + i,j∈V a ij x(i) -x(j) 2 2 . (5) Derivations of the definitions are presented in Appendix A. As the eigenvalues of the approximated filter Ãrw are bounded by 1, it resembles a low-pass filter that removes the high-frequency signals. By exchanging Ãrw with Ãsym (which has the same eigenvalues as Ãrw ), we obtain the same formulation adopted in GCNs. It has been stated that the second term ∆(x) in Eq.( 5) measures the variation of the estimation x over the graph structure. By adding this regularizer to the objective function, the obtained filter emphasizes the low-frequency signals through minimizing the variation over the local graph structure, while keeping the estimation close to the input in the graph Fourier space.

3.1.2. NON-CONVOLUTIONAL OPERATIONS

Residual Connection. Residual connection is first proposed by He et al. (2016) and has been widely adopted in graph representation learning approaches. In the vanilla GCNs, norms of the eigenvalues of the filter Ãrw (or Ãsym ) are bounded by 1 which ensures numerical stability in the training procedure. However, on the other hand, signals in all frequency band will shrink as the convolution layer stacks, leading to a consistent information loss. Therefore, adding the residual connection is deemed to preserve the strength of the input signal. Definition 3 Residual Connection. A graph convolution filter with residual connection: X = Ãrw X + X, where > 0 controls the strength of residual connection, is the first-order approximation of the optimal solution of the following optimization: min X i∈V ( x(i) -x(i) 2 D -x(i) 2 D ) + i,j∈V a ij x(i) -x(j) 2 2 . ( ) By adding the negative regularizer to penalize the estimations with small norms, we can induce the same formulation as the vanilla graph convolution with residual connection. Concatenation. Concatenation is practically a residual connection with different learning weights. Definition 3' Concatenation. A graph convolution filter concatenating with the input signal: X = Ãrw X + XΘΘ T , is the first-order approximation of the optimal solution of the following optimization: min X i∈V ( x(i) -x(i) 2 D -x(i)Θ 2 D ) + i,j∈V a ij x(i) -x(j) 2 2 , ( ) where > 0 controls the strength of concatenation and Θ is the learning coefficient. Although the learning weights ΘΘ T has a constrained expressive capability, it can be compensated by the following feature learning modules.

3.1.3. ATTENTION-BASED CONVOLUTIONAL NETWORKS

Since the convolution filters in GCNs are dependent only on the graph structure, GCNs are proved to have restricted expressive power and may cause the oversmoothing problem. Several researches try to introduce the attention mechanism to the convolution filter, learning to assign different edge weights at each layer based on nodes and edges. GAT (Veličković et al., 2018) and AGNN (Thekumparampil et al., 2018) compute the attention coefficients as a function of the features of connected nodes, while ECC (Simonovsky & Komodakis, 2017) and GatedGCN (Bresson & Laurent, 2017) consider the activations for each connected edge. Although these approaches have different insights, they can be all formulated as (See details in Appendix A): p ij = a ij f θ (x(i), x(j), e ij ), i, j ∈ V, where e ij denotes the edge representation if applicable. Therefore, we replace a ij in Definition 2 with learned coefficients to enforce different regularization strength on the connected edges. Definition 4 Attention-based GCNs. An attention-based graph convolution filter: X = P X, is the first-order approximation of the optimal solution of the following optimization: min X i∈V x(i) -x(i) 2 D + i,j∈V p ij x(i) -x(j) 2 2 , s.t. j∈V p ij = Dii , ∀i ∈ V. ( ) Notice that we use a normalization trick to constrain the degree of attention matrix to be the same as the original degree matrix D as we want to preserve the strength of the regularization for each node. The formulated filter P corresponds to the matrix D-1 p with row sum equals to 1, which is also consistent with most of the attention-based approaches after normalization. Through adjusting the regularization strength for edges, nodes with higher attention coefficients tend to have similar features while the distance for nodes with low attention coefficients will be further.

3.1.4. TOPOLOGY-BASED CONVOLUTIONAL NETWORKS

Attention-based approaches are mostly designed based on the local structure. Besides focusing on the first-order adjacency matrix, several approaches (Klicpera et al., 2018; 2019; Kapoor et al., 2019; Du et al., 2017) propose to adopt the structural information in the multi-hop neighborhood, which are referred to as topology-based convolutional networks. We start with an analysis of PPNP (Klicpera et al., 2018) and then derive a general formulation for topology-based approaches. PPNP. PPNP provides insights towards the propagation scheme by combining message-passing function with personalized PageRank. As proved in (Xu et al., 2018b) , the influence of node i on node j is proportional to a k-step random walk, which converges to the limit distribution with multiple stacked convolution layers. By involving the restart probability, PPNP is able to preserve the starting node i's information. Similarly, in Definition 2, the first term can also be viewed as a regularization of preserving the original signal information. Therefore, we may achieve the same purpose by adjusting the regularization strength. Definition 5 PPNP. A graph convolution filter with personalized propagation (PPNP): X = α(I n -(1 -α) Ãrw ) -1 X, ) is equivalent to the optimal solution of the following optimization: min X α i∈V x(i) -x(i) 2 D + (1 -α) i,j∈V a ij x(i) -x(j) 2 2 , ( ) where α ∈ (0, 1] is the restart probability. Higher α means a higher possibility to teleport back to the starting node, which is consistent with the higher regularization on the original signal in ( 14). Multi-hop PPNP. One of the possible weakness of the original PPNP is that personalized PageRank only utilizes the regularizer over the local structure. Therefore, we may improve the expressive capability by involving multi-hop information, which is equivalent to adding regularizers for higherorder variations. Definition 6 Multi-hop PPNP. Let t be the highest order adopted in the algorithm. A graph convolution filter with multi-hop personalized propagation (Multi-hop PPNP): X = α 0 (I n - t k=1 α k Ãk rw ) -1 X, ( ) where t k=0 α k = 1, α 0 > 0 and α k ≥ 0, k = 1, 2, . . . , t, is equivalent to the optimal solution of the following optimization: min X α 0 i∈V x(i) -x(i) 2 D + t k=1 α k i,j∈V a (k) ij x(i) -x(j) 2 2 , ( ) where a (k) ij is proportional to the transition probability of the k-step random walk and the same normalization trick in Section 3.1.3 is adopted on {a (k) ij }. Solving Eq.( 15) directly is computationally expensive. Therefore, we derive a first-order approximation by Taylor expansion and result in the form of: X = ( T i=0 α i Ãi rw )X + O( ÃT rw X). ( ) As the norm of the eigenvalues of Ãrw are bounded by 1, we can keep the first term in Eq.( 17) as a close approximation. By comparing the approximated solution with topology-based graph convolutional networks, we find that most of the approaches can be reformulated as particular instances of Definition 6. For example, the formulation for Mixhop (Kapoor et al., 2019) can be derived as an approximation of Eq.( 17) if we let t = 2 and α 0 = α 1 = α 2 = 1/3. Different learning weights can be applied to each hop as Section 3.1.2 to concatenate multi-hop signals. See more examples in Appendix B.

3.2. REMARKS

In this section, we build a bridge between graph convolution operations and optimization problems in the graph Fourier space and provide insights into interpreting graph convolution operations with regularizers. For conclusion, we rewrite the general form of the unified framework as follow. Definition 1' Unified Graph Convolution Framework. Convolution-based graph neural networks can be reformulated (after approximation) as particular instances of the optimal solution of the following optimization problem: min X α 0 i∈V ( x(i) -x(i) 2 D -x(i)Θ 2 D Non-Conv ) + t k=1 α k i,j∈V p (k) ij x(i)Θ (k) -x(j)Θ (k) 2 2 Attention-based Topology-based +λL reg , where t k=0 α k = 1, α k ≥ 0 and j∈V p ij = Dii , ∀i ∈ V. If we let d be the feature dimension of X, then Θ, Θ (k) ∈ R d×d are the corresponding learning weights. L reg corresponds to the personalized regularizer based on the framework, which can be effective if carefully designed as we will show in Section 4. By establishing the unified framework, we interpret various convolution filters as carefully designed regularizers in the graph Fourier domain, which provides new insights on understanding graph learning modules from the graph signal perspective. Several graph learning modules are reformulated as smoothing regularizers over the graph structure with different intentions. While vanilla GCNs focus on minimizing the variation over the local graph structure, attention-based and topology-based GCNs take a step forward and concentrate on the differences between connected edges and graph structure with larger receptive field. This novel perspective enables a better understanding of the similarities and differences among many widely used GCNs, and may inspire new approaches for designing better models.

4. TACKLING OVERSMOOTHING UNDER THE UNIFIED FRAMEWORK

Based on the proposed framework, we provide new insights on understanding the limitations of GCNs and inspire a new line of work towards designing better graph learning models. As a showcase, we present a novel regularization technique under the framework to tackle the oversmoothing problem. It is shown that the newly designed regularizer can be easily implemented on other convolution-based networks with trivial adaptations and effectively improve the generalization performances of graph learning approaches.

4.1. REGULARIZATION ON FEATURE VARIANCE

Here, we adopt the definition of feature-wise oversmoothing in (Zhao & Akoglu, 2019) . Due to multiple layers of Laplacian smoothing, all features fall into the same subspace spanned by the dominated eigenvectors of the normalized adjacency matrix, which also corresponds to the similar situation described in (Klicpera et al., 2018) . To tackle this problem, we propose to penalize the features when they are close to each other. Specifically, we consider the pairwise distance between normalized features, which is summarized as: δ(X) = 1 d 2 i,j∈d x •i / x •i -x •j / x •j 2 2 , ( ) where d is the feature dimension and x •i ∈ R n represents the i-th dimension for all nodes. Therefore, Eq.( 19) can be interpreted as a feature variance regularizer, representing the distance between features after normalization. By adding this regularizer to the unified framework, the proposed filter should have the property to drive different features away. Definition 7 Regularized Feature Variance. Let ⊗ be the Kronecker product operator, vec(X) ∈ R nd be the vectorized signal X. Let D X be a diagonal matrix whose value is defined by D X (i, i) = x •i 2 . A graph convolution filter with regularized feature variance: vec( X) = (I n ⊗ [(α 1 + α 2 )I -α 2 Ãrw ] -α 3 [D -1 x (I - 1 d 11 T )D -1 x ] ⊗ D-1 ) -1 vec(X) (20) is equivalent to the optimal solution of the following optimization: where α 1 > 0, α 2 , α 3 ≥ 0. For computation efficiency, we approximate x•i with x •i as we assume that a single convolution filter provides little effect to the norm of features. min X α 1 i∈V x(i)-x(i) 2 D + α 2 i,j∈V a ij x(i)-x(j) 2 2 -α 3 1 d i,j∈d x•i / x •i -x •j / x •j Calculating the Kronecker product and inverse operators are computationally expensive. Nevertheless, we can approximate Eq.( 20) via Taylor expansion with an iterative algorithm. If we let: A = (α 1 + α 2 )I -α 2 Ãrw , B = I n , C = -α 3 D-1 , D = D -1 x (1 - 1 d 11 T )D -1 x . Then, a t-order approximated formulation is summarized as: X(0) = X, (24) X(k+1) = X + X(k) -A X(k) B -C X(k) D, k = 0, 1, . . . , t -1. Through approximation, computation overhead is greatly reduced. See details in the Appendix A. As far as we are concerned, the advantages of utilizing feature variance regularization are threefold. First, the regularizer measures the difference between features, therefore explicitly preventing all features from falling into the same subspace. Second, the modified convolution filter does not require additional training parameters, avoiding the risk of overfitting. Third, the regularizer is designed based on the proposed unified framework, which means it can be easily implemented on other convolution-based networks as a plug-in module.

4.2. DISCUSSION

Several researches have also shared insights on understanding and tackling oversmoothing. It is shown in (Li et al., 2018) that the graph convolution of GCN is a special form of Laplacian smoothing and the authors try to compensate the long-range dependencies by co-training GCN with a random walk model. JKNet (Xu et al., 2018b) proved that the influence score between nodes converges to a fixed distribution when layer stacks, therefore losing local information. As a remedy, they proposed to concatenate layer-wise representations to perform mixed structural information. More recently, Oono & Suzuki (2020) theoretically demonstrated that graph neural networks lose expressive power exponentially due to oversmoothing. Comparing to the aforementioned researches, our proposed method acts explicitly on the graph signals and can be easily implemented on other convolution-based networks as a plug-in module with trivial adaptations. 

4.3. EXPERIMENT

To testify the effectiveness of the regularizer, we empirically validate the proposed method on several widely used semi-supervised node classification benchmarks, including transductive and inductive settings. As we have stated in Section 4.1, our regularizer can be implemented on various convolution-based approaches under the unified graph convolution framework. Therefore, we consider three different versions by implementing the regularizer on vanilla-GCNs, attention-based GCNs and topology-based GCNs. We achieve state-of-the-art results on almost all of the settings and show the effectiveness of tackling oversmoothing on graph-structured data. Dataset and Experimental Setup. We conduct experiments on four real-world graph datasets. For transductive learning, we evaluate our method on the Cora, Citeseer, Pubmed datasets, following the experimental setup in (Sen et al., 2008) . PPI (Zitnik & Leskovec, 2017 ) is adopted for inductive learning. Dataset statistics and more experimental setups are presented in Appendix C. For comparison, we categorize state-of-the-art convolution-based graph neural networks into three specific classes, corresponding to the three versions of our proposed method. The first category is based on the vanilla-GCN proposed by Kipf & Welling (2017) , including GCN, FastGCN (Chen et al., 2018) , SGC (Wu et al., 2019a) , GIN (Xu et al., 2018a) , and DGI (Veličković et al., 2019) . Since GIN is not initially evaluated on citation networks, we implement GIN following the setting in (Xu et al., 2018a) . The second category corresponds to the attention-based approaches, including GAT (Veličković et al., 2018) , AGNN (Thekumparampil et al., 2018) , MoNet (Monti et al., 2017) and GatedGCN (Bresson & Laurent, 2017) . The last category of approaches is topologybased GCNs which utilizes the structural information in the multi-hop neighborhood. We consider APPNP (Klicpera et al., 2018) , TAGCN (Du et al., 2017) and MixHop (Kapoor et al., 2019) as the baselines. Table 2 : Test Micro-F1 Score on inductive learning dataset. We report mean values and standard deviations in 5 independent experiments. Dataset PPI GCN (Kipf & Welling, 2017) 92.4 GAT (Veličković et al., 2018) 97.3 SGC (Wu et al., 2019a) 66.4 JKNet (Xu et al., 2018b) 97.6 GraphSAGE (Hamilton et al., 2017) 61.2 DGI (Veličković et al., 2019) 63.8 GCN+reg (ours) 97.69±0.32 GAT+reg (ours) 98.23±0.08 Transductive Learning. Table 1 presents the performance of our method and several stateof-the-art graph neural networks on transductive learning datasets. For three classes of convolution-based approaches, we implement our regularizer with GCN, GAT and APPNP as comparisons with other baselines, respectively. For a fair comparison, we adopt the same network structure, hyperparameters and training configurations as baseline models. It is shown that the proposed model achieves state-of-theart results on all three settings. On all of the datasets, we can observe a 0.5∼1.0% higher performance after adopting the proposed regularizer. Notably, the proposed model achieves the highest improvement on the vanilla GCNs as this simplest version suffers most from the oversmoothing problem. Meanwhile, when combining with GAT, the model achieves the highest results comparing with almost all the baselines. Considering that attention mechanism and the regularization on oversmoothing focus on the local and global properties respectively, this can be an ideal combination for graph representation learning. We also conduct experiments on three citation networks with random splits and present the result in Appendix D. Inductive Learning. For the inductive learning task, we implement our method on the vanilla GCN and GAT, and adopt the same experimental setup. inductive learning dataset. It can be seen that our model compares favorably with all the competitive baselines. On the PPI dataset, out model achieves 0.5∼1% higher on test Micro-F1 score, showing the effectiveness of applying our method under inductive settings. Comparison with Other Related Works. To validate the effectiveness of our model, we compare the proposed regularizer with two state-of-the-art approaches on tackling oversmoothing, DropEdge (Rong et al., 2019) and PairNorm (Zhao & Akoglu, 2019) . For fair comparison, all approaches are adopted on vanilla-GCN with 2∼8 layers and show the best performance on three transductive datasets respectively. As shown in Table 3 , our regularizer achieves best performance on all three settings. As PairNorm is more suitable when a subset of the nodes lack feature vectors, it is less competitive in the general settings. Analysis. As we have stated above, the regularizer can be interpreted as the mean feature variance, which prevents different features from falling into the same subspace. To testify the effect of our method, we compute the mean pairwise distance (Eq.( 19)) of the last hidden layer of GCN and GAT, with and without regularizer on the Cora dataset. We show the result of models with 2-8 layers in Figure 1 . As we can observe, the feature variances and the accuracies of models with regularization are comparably higher than vanilla models with obvious gaps. Therefore, after applying the regularizer, features are more separated from each other, and the oversmoothing problem is alleviated.

5. CONCLUSION

In this paper, we develop a unified graph convolution framework by establishing graph convolution filters with optimization problems in the graph Fourier space. We show that most convolutionbased graph learning models are equivalent to adding carefully designed regularizers. Besides vanilla GCN, our framework is extended to formulating non-convolutional operations, attentionbased GCNs and topology-based GCNs, which cover a large fraction of state-of-the-art graph learning models. On this basis, we propose a novel regularization on tackling the oversmoothing problem as a showcase, proving the effectiveness of designing new modules based on the framework. Through the unified framework, we provide a general methodology for understanding and relating different graph learning modules, with new insights on tackling common problems and improving the generalization performance of current graph neural networks in the graph Fourier domain. Meanwhile, the unified framework can also serve as a once-for-all platform for expert-designed modules on convolution-based approaches. We hope our work can promote the understandings towards graph convolutional networks and inspire more insights in this field. Definition 3' Concatenation. A graph convolution filter concatenating with the input signal: X = Ãrw X + XΘΘ T , is the first-order approximation of the optimal solution of the following optimization: min X i∈V ( x(i) -x(i) 2 D -x(i)Θ 2 D ) + i,j∈V a ij x(i) -x(j) 2 2 , ( ) where > 0 controls the strength of concatenation and Θ is the learning coefficients for the concatenated signal. Proof. Let l denote the objective function. We have l = tr[( X -X) T D( X -X)] -tr(( XΘ) T D( XΘ)) + tr( XT L X). Then, ∂l ∂ X = 2 D( X -X) + 2L X -2 D XΘΘ T . If we let ∂l ∂ X = 0: ( D + L) X -D XΘΘ T = DX (I + Lrw ) X -XΘΘ T = X. With the help of the Kronecker product operator ⊗ and first-order Taylor expansion, we have vec( X) = [(I ⊗ (I + Lrw )) -((ΘΘ T ) ⊗ I)] -1 vec(X) ≈ [2I -(I ⊗ (I + Lrw )) + ((ΘΘ T ) ⊗ I)]vec(X) = vec(2X -(I + Lrw )X + XΘΘ T ) = vec( Ãrw X + XΘΘ T ). Definition 4 Attention-based GCNs. An attention-based graph convolution filter: X = P X, is the first-order approximation of the optimal solution of the following optimization: min X i∈V x(i) -x(i) 2 D + i,j∈V p ij x(i) -x(j) 2 2 , s.t. j∈V p ij = Dii , ∀i ∈ V. Proof. Let l denote the objective function. We have l = tr[( X -X) T D( X -X)] + tr( XT L X). Then, ∂l ∂ X = 2 D( X -X) + 2( D -DP ) X. If we let ∂l ∂ X = 0: (2 D -DP ) X = DX (2I -P ) X = X. Similarly, we can prove that (2I -P ) is a positive definite matrix, with eigenvalues in range [1, 3] . Therefore, X = (2I -P ) -1 X ≈ P X. Definition 5 & 6 Topology-based GCNs Due to the fact that most of the topology-based models adopt non-convolutional operations like concatenation, we derive a more general objective function by combining with the non-convolutional operations: min X α 0 i∈V x(i) -x(i) 2 D + t k=1 α k i,j∈V a (k) ij x(i)Θ (k) -x(j)Θ (k) 2 2 , where t k=0 α k = 1, α 0 > 0 and α k ≥ 0, k = 1, 2, . . . , t. If we let d be the feature dimension of X, Θ (k) ∈ R d×d correspond to the learning weights for the k th hop neighborhood. Let l denote the objective function, we have: ∂l ∂ X = α 0 D( X -X) + t k=1 α k ( D -D Ãk rw ) XΘ (k) (Θ (k) ) T . By letting ∂l ∂ X = 0, we have: α 0 X + t k=1 (I n -Ãk rw ) XΘ (k) (Θ (k) ) T = α 0 X. Therefore, with the help of the Kronecker product operator ⊗ and first-order Taylor expansion, we have [α 0 I n + t k=1 (α k Θ (k) (Θ (k) ) T ) ⊗ (I n -Ãk rw )]vec( X) = α 0 vec(X). We can observe that t k=1 (α k Θ (k) (Θ (k) ) T ) and (I n -Ãk rw ) have non-negative eigenvalues. Due to the property of the Kronecker product that the eigenvalues of the Kronecker product (A ⊗ B) equal to the product of eigenvalues of A and B, the filter (α 0 I n + t k=1 (α k Θ (k) (Θ (k) ) T ) is proved to be a positive definite matrix. Therefore, vec( X) = α 0 [α 0 I n + t k=1 (α k Θ (k) (Θ (k) ) T ) ⊗ (I n -Ãk rw )] -1 vec(X) ≈ α 0 [(2 -α 0 )I n - t k=1 (α k Θ (k) (Θ (k) ) T ) ⊗ (I n -Ãk rw )]vec(X) = α 0 vec[(2 -α 0 )X - t k=1 α k (I n -Ãk rw )XΘ (k) (Θ (k) ) T ]. If we let W (0) = 2 -α 0 α 0 I n - t k=1 α k α 0 Θ (k) (Θ (k) ) T ; W (k) = Θ (k) (Θ (k) ) T ), k = 1, 2, . . . , t; we can denote the convolution filter as: X = t k=0 α k Ãk rw XW (k) . ( ) As we have stated in the Section 2.2.2, although the learning weights has a constrained expressive capability, it can be compensated by the following feature learning module. We omit the proofs of Definition 5 and 6, as they can be viewed as particular instances of (35). Definition 7 Regularized Feature Variance. Let ⊗ be the Kronecker product operator, vec(X) ∈ R nd be the vectorized signal X. Let D X be a diagonal matrix whose value is defined by D X (i, i) = x •i 2 . A graph convolution filter with regularized feature variance: vec( X) = (I n ⊗ [(α 1 + α 2 )I -α 2 Ãrw ] -α 3 [D -1 x (I - 1 d 11 T )D -1 x ] ⊗ D-1 ) -1 vec(X) (40) is equivalent to the optimal solution of the following optimization: min X α 1 i∈V x(i) -x(i) 2 D + α 2 i,j∈V a ij x(i) -x(j) 2 2 -α 3 1 d i,j∈d x•i / x •i -x•j / x •j 2 2 , (41) where α 1 > 0, α 2 , α 3 ≥ 0. For computation efficiency, we approximate D X with D X as we assume that a single convolution filter provides little effect to the norm of features. Proof. Let l denote the objective function. We have l = α 1 tr[( X -X) T D( X -X)] + α 2 ( XT L X) -α 3 tr[ XD -1 x (I - 1 d 11 T )D -1 x XT ]. Then, ∂l ∂ X = 2α 1 D( X -X) + 2α 2 L X -2α 3 XD -1 x (I - 1 d 11 T )D -1 x . If we let ∂l ∂ X = 0: [(α 1 + α 2 )I -α 2 D-1 Ãrw ] X -α 3 D-1 XD -1 x (I - 1 d 11 T )D -1 x = α 1 X. With the help of the Kronecker product operator ⊗, we have (I n ⊗ [(α 1 + α 2 )I -α 2 Ãrw ] -α 3 [D -1 x (I - 1 d 11 T )D -1 x ] ⊗ D-1 )vec( X) = vec(X). ( ) By setting α 3 with a small positive value, the filter in Eq.( 42) is still a positive definite matrix. Therefore we complete the proof. Similarly, we can derive a simpler form via Taylor approximation. If we let: A = (α 1 + α 2 )I -α 2 Ãrw , B = I n , C = -α 3 D-1 , D = D -1 x (1 - 1 d 11 T )D -1 x . Then, the first-order approximation of Eq.( 40) is summarized as: vec( X) = (B T ⊗ A + D T ⊗ C) -1 vec(X) ≈ (2I -B T ⊗ A -D T ⊗ C)vec(X) = vec(2X -AXB -CXD). Additionally, we can also derive a t-order approximated formulation: vec( X(t) ) = (I + t i=1 [I -(B T ⊗ A + D T ⊗ C)] i )vec(X). However, it is computationally expensive to calculate the Kronecker product. Therefore, we consider utilizing a iterative algorithm. For any 0 ≤ k < t vec( X(k+1) ) = (I + k+1 i=1 [I -(B T ⊗ A + D T ⊗ C)] i )vec(X) = [I -(B T ⊗ A + D T ⊗ C)](I + k i=1 [I -(B T ⊗ A + D T ⊗ C)] i )vec(X) + vec(X) = [I -(B T ⊗ A + D T ⊗ C)]vec( X(k) ) + vec(X) = vec(X + X(k) -A X(k) B -C X(k) D).

B. REFORMULATION EXAMPLES

The reformulation examples of GCN derivatives are presented in Table 4 . 1.8 3.8 1.9 (Veličković et al., 2018) 5.4 8.5 3.3 AGNN (Thekumparampil et al., 2018) 5.3 7.9 3.2 APPNP (Klicpera et al., 2018) 9.8 14.2 13.6 GCN + regs (ours) 5.7 8.0 3.2 sion h = 64. We set α 1 = 0.2, α 2 = 0.8 and α 3 = 0.05 for all four datasets. We apply L 2 regularization with λ = 0.0005 and use dropout on both layers. For training strategy, we initialize weights using the initialization described in (Glorot & Bengio, 2010) and follow the method proposed in GCN, adopting an early stop if validation loss does not decrease for certain consecutive epochs. The implementations of baseline models are based on the PyTorch-Geometric library (Fey & Lenssen, 2019) in all experiments.

D. RANDOM SPLITS

As illustrated in (Shchur et al., 2018) , using the same train/validation/test splits of the same datasets precludes a fair comparison of different architectures. Therefore, we follow the setup in (Shchur et al., 2018) and evaluate the performance of our model on three citation networks with random splits. Empirically, for each dataset, we use 20 labeled nodes per class as the training set, 30 nodes per class as the validation set, and the rest as the test set. For every model, we choose the hyperparameters that achieve the best average accuracy on Cora and CiteSeer datasets and applied to Pubmed dataset. Table 6 shows the results on three citation networks under the random split setting. As we can observe, our model consistently achieves higher performances on all the datasets. On Citeseer, our model achieves higher accuracy than on the original split. On Cora and Pubmed, the test accuracies of our model are comparable to the original split, while most of the baselines suffer from a serious decline.

E. TIME CONSUMPTION

As we have shown in Eq.( 20), the computation of graph filter with the regularizer is greatly increased with Kronecker product and inverse matrix operations. Nevertheless, we approximate the filter with an iterative algorithm as stated in Eq.( 25) and realize an efficient implement. 

F. ABLATION STUDY

To analyze the effects of regularization strength, we conduct experiments on three transductive datasets and present the results in Table 8 . As we can observe, with reasonable choice of the regularization strength, our approach can achieve consistent improvement under all settings. However, when the regularization strength is too large, the training procedure becomes unstable and the model performance suffers from a severe decrease.



, (21)



Figure 1: Accuracy and mean feature variance on Cora. Use GCN and GAT for comparison.

Test accuracy (%) on transductive learning datasets. We report mean values and standard deviations in 30 independent experiments. The best results are highlighted with boldface.

Table 2 presents the comparison results on

Comparison results on transductive learning datasets. We report mean values and standard deviations in 30 independent experiments. The best results are highlighted with boldface. The number in the brackets represent the number of GCN layers when achieving the best performance.

Test accuracy (%) on transductive learning datasets with random slits. We report mean values and standard deviations of the test accuracies over 100 random train/validation/test splits.

Training and test time on Cora. We report mean values in 5 independent experiments. The best results are highlighted with boldface.

To empirically testify the computation efficiency, we conduct experiments on Cora and report the training and test time of several GCN models on a single RTX 2080 Ti GPU. Due to the early stopping rule (see details in Appendix C), the training epoch for each module is different. The results are shown in Table7. As we can observe, when combining with vanilla GCNs, the training and test time of our model is similar to GAT and AGNN and faster than APPNP.

Ablation study on the regularization strength. We report mean values and standard deviations in 30 independent experiments. The best results are highlighted with boldface.

APPENDIX A. PROOFS OF THE DEFINITIONS

Definition 2 Vanilla GCNs. Let x(i) i∈V be the estimation of the input observation x(i) i∈V . A low-pass filter: X = Ãrw X, (26) is the first-order approximation of the optimal solution of the following optimization:Proof. Let l denote the objective function. We haveAs the norm of eigenvalues of Ãrw = I -Lrw is bounded by 1, I + Lrw has eigenvalues in range [1, 3] , which proves that I + Lrw is a positive definite matrix. Therefore, X = (I + Lrw ) -1 X.(28) Unfortunately, solving the closed-form solution of Eq.( 28) is computationally expensive. Nevertheless, we can derive a simpler form, X ≈ (I-Lrw )X = Ãrw X, via first-order Taylor approximation which establishes the Definition.Definition 3 Residual Connection. A graph convolution filter with residual connection: X = Ãrw X + X, (29) where > 0 controls the strength of residual connection, is the first-order approximation of the optimal solution of the following optimization:Proof. Let l denote the objective function. We haveTherefore, the first-order approximation of the optimal solution is 

C. DATA STATISTICS AND EXPERIMENTAL SETUPS

We conduct experiments on four real-world graph datasets, whose statistics are listed in Table 5 . For transductive learning, we evaluate our method on the Cora, Citeseer, Pubmed datasets, following the experimental setup in (Sen et al., 2008) . There are 20 nodes per class with labels to be used for training and all the nodes' features are available. 500 nodes are used for validation and the generalization performance is tested on 1000 nodes with unseen labels. PPI (Zitnik & Leskovec, 2017 ) is adopted for inductive learning, which is a protein-protein interaction dataset containing 20 graphs for training, 2 for validation and 2 for testing while testing graphs remain unobserved during training.To ensure a fair comparison with other methods, we implement our module without interfering the original network structure. In all three settings, we use two convolution layers with hidden dimen-

