WIDE GRAPH NEURAL NETWORK

Abstract

Graph Neural Networks from the spatial and the spectral domains often suffer from the following problems: over-smoothing, poor flexibility, and low performance on heterophily. In this paper, we provide a unified view of GNNs from the matrix space analysis perspective to identify potential reasons for these problems and propose a new GNN framework to address them, called Wide Graph Neural Network (WGNN). We formulate GNNs as two components: one is for constructing a non-parametric feature space, and the other is for learning the parameters to re-weight the feature space. For instance, spatial GNNs encode the adjacency matrix multiplication as the feature space and stack layers to re-weight it, and spectral ones sum the polynomials to build the feature space and learn shared model weights. Instead, WGNN constructs the space by concatenating all polynomials and re-weights them individually. This mechanism reduces the unnecessary constraints on the feature space due to the concatenation, which avoids over-smoothing and allows independent parameters for better flexibility. Beyond the parameter independence property, WGNN enjoys further flexibility in adding matrices with arbitrary columns. For instance, by taking the principal components of the adjacency matrix, we can significantly improve the representation of heterophilic graphs. We provide a detailed theoretical analysis and conduct extensive experiments on eight datasets to show the superiority of the proposed WGNN. 1

1. INTRODUCTION

𝑊 (") 𝐴 # 𝑊 ($) 𝐴 # 𝑊 (%) 𝐴 # Graph neural networks (GNNs) have demonstrated their great potential in representation learning for graph-structured data, such as social networks, transportation networks, protein interaction networks, and chemical structures (Fan et al., 2019; Wu et al., 2020; Zheng et al., 2022) . Despite the success, existing GNNs still suffer some issues in the following. Firstly, the spatial GNNs aggregate the information from the connected nodes, resulting in the well-known over-smoothing (Cai & Wang, 2020) . Secondly, the spatial models assume that the features of connected nodes are similar; however, this assumption does not hold in heterophilic graphs (Zheng et al., 2022) . Thirdly, the spectral GNNs use polynomials to approach arbitrary graph filters (He et al., 2021; Klicpera et al., 2019; Defferrard et al., 2016) . In the absence of layer stacking, the spectral GNNs are exempt from the issue of over-smoothing. However, these spectral GNNs still perform poorly on heterophilic graphs since each polynomial term also shares the same assumption of similarity in neighbors. In addition, spectral methods share the parameters for each polynomial term, leading to a less flexible architecture. To better understand the problems in both spatial and spectral domains, efforts exist that integrate GNNs, e.g., from the perspective of optimization objectives (Ma et al., 2021; Zhu et al., 2021) . However, they focus on summarizing general formulas while lacking a clear explanation of the problems.

𝑋

In this paper, we propose a unified view for both spectral and spatial GNNs from the matrix space analysis point of view to investigate possible reasons for these problems and contribute a new way to address them. Specifically, for the sake of theoretical investigations, we first abstract a linear approximation of the GNNs following Wu et al. (2019a) ; Xu et al. (2018a) . Then, as shown in the mathematical formulation and implementation structure of Figure 1 , we decompose the components with and without parameters in the linear approximation, where the latter is regarded as a feature space built by node attributes and graph structure (e.g., adjacency or Laplacian matrices), and the former denotes the learnable parameters to re-weight the features. Consider spatial GNNs that 1) build the feature space by taking the power of the adjacency matrix, and 2) form the parameter space by taking the product of the weight matrices. For spectral GNNs, they sum the polynomials to compose the feature space and share the parameter for each. Based on this view, we can identify the reasons for issues in GNNs. When forming the feature space by powers of adjacency matrices, we find that over-smoothing is due to feature space compression. The parameter-sharing manner of spectral GNNs limits the flexibility of their architectures. Besides, the common issue of poor performance in heterophilic graphs is caused by the construction of each feature sub-space that embodies the similarity of neighboring nodes in both methods. The primary contribution of this work is a wide architecture of GNNs named Wide Graph Neural Networks (WGNN), whose basic architecture is shown in Figure 1 . In particular, it constructs the feature space by concatenating the polynomial terms of the adjacency matrix. This concatenation avoids space compression caused by powers in the spatial domain and alleviates the over-smoothing problem. To account for the feature space with multiple polynomial terms, the WGNN re-weights each one with an independent parameter matrix. Unlike spectral GNNs, which use a single parameter matrix for all polynomial terms, our WGNN has better flexibility by allowing different parameters for each. WGNN architectures also enjoy augmenting the feature space with arbitrary width of matrices. With this characteristic, we can improve the performance on heterophilic graphs by adding principal components of the adjacency matrix. This augmentation reduces the dependency of the feature space on the similarity of adjacent nodes since the principal components only extract the graph structure. Comprehensive experiments on both homophilic and heterophilic datasets demonstrate the superiority of WGNN. Contributions. (1) We provide a unified view of both spatial and spectral GNNs, which formulates GNNs as the framework of jointly constructing the feature space and learning the parameters to re-weight. (2) We propose a new architecture, WGNN, which avoids over-smoothing, enjoys flexibility, alleviates heterophily problems, and provide a detailed theoretical analysis. (3) We conduct experiments on homophilic and heterophilic datasets and achieve significant improvements, e.g., an average accuracy increase of 32% on heterophilic graphs.

2. PRELIMINARIES

In this paper, we focus on the undirected graph G = (V, E), along with its node attributes of V as X ∈ R n×d and adjacency matrix A ∈ R n×n to present E. GNNs take the input of the node attributes and the adjacency matrix, and output the hidden node representations, as H = GNN(X, A) ∈ R n×d . By default, we employ the cross-entropy loss function in the node classification task to minimize the difference between node label Y and the obtained representation as L(H, Y ) = -i Y i log softmax(H i ). 2.1 SPATIAL AND SPECTRAL GNNS Spatial GNNs mostly fall into the message-passing paradigm. For any given node, it essentially aggregates features from its neighbors and updates the aggregated feature, H (k+1) i = σ upd H (k) i , agg Âij , H (k) j ; j ∈ N (i) , where σ (•) is a non-linear activation function, H (k) indicates the hidden representation in k-th layer, agg and upd are the aggregation and updating functions (Balcilar et al., 2021) , Â = (D + I) -1/2 (A + I)(D + I) -1/2 is the re-normalized adjacency matrix using the degree matrix D, and N (•) denotes the 1-hop neighbors. Here, we provide two examples to specify this general expression. One is the vanilla GCN (Kipf & Welling, 2017 ) that adopts the mean-aggregation and the average-update, as shown in the left part of Figure 1 . Its formulation is: k) . H (k+1) = σ ÂH (k) W ( (2) The second example shows a different update scheme with skip-connection (Xu et al., 2018a; Li et al., 2019; Chen et al., 2020b) , which is defined as follows, H (k+1) = σ α (k) H (0) W (k) 0 + ÂH (k) W (k) 1 , where α (k) controls the weight of each layer's skip-connection, W (k) 0 , W (k) 1 are the transformation weights for the initial layer and the previous one, respectively. Spectral GNNs originally employ the Graph Fourier transforms to get filters (Chung & Graham, 1997) , such as using the eigendecomposition of the Laplacian matrix: L = I -Â = U ΛU T . In recent years, methods of this type have focused more on approximating arbitrary global filters using polynomials (Wang & Zhang, 2022; Zhu & Koniusz, 2020; He et al., 2021) , which has shown superior performance and is written as H = K k=0 γ (k) P k ( L)σ(XW 1 )W 2 , where P k (•) donates a polynomial's k-order term; γ (k) is the adaptive coefficients and W 1 , W 2 are learnable parameters. In Figure 1 , we replace W 1 W 2 with W (0) . Note that some instances of spectral filters are not included in this paper, such as Levie et al. (2018) ; Thanou et al. (2014) .

2.2. CHALLENGING ISSUES

Over-smoothing: In spatial GNNs, when stacking layers deep enough, the representations of connected nodes tend to be the same. Unlike deep models with tens of layers, GNNs often have only a few layers, for which there exists relievers such as DropEdge Rong et al. (2020) and the skip-connection scheme (Li et al., 2019; Xu et al., 2018b; a; Chen et al., 2020b) while having limited effect. Recent research shows that over-smoothing is the result of low-pass filters from a spectral perspective (Wu et al., 2019b; He et al., 2021) . Homophily and heterophily: Homophily and heterophily are the concepts of differentiating whether connected nodes share the same labels. According to the definition of h = |{Y i = Y j ; (i, j) ∈ E}|/|E| (Zhu et al., 2020) , we consider a graph with a larger h as more likely to be homophilic, otherwise heterophilic. GNNs were designed for homophilic graphs, making them unable to deal with heterophilic ones, sometimes even worse than MLPs (Zheng et al., 2022) . Poor flexibility: In spectral GNNs, all polynomial terms share the same parameter matrix due to the concentration on updating the coefficients only, as shown in Figure 1 . This learning mechanism, unlike spatial GNNs, can form another weight matrix to the next feature subspace by layer-wise multiplications, resulting in the feature matrices of spectral methods being linearly correlated.

3. METHODS

We propose a unified view with a decomposition of the feature space and the parameters in GNNs. The primary motivation for the view is we consider the potential connections among the issues is how they use the graph data, i.e., the construction of the feature space. To conduct theoretical investigations of the feature space, we abstract a linear approximation of GNNs based on the success of linearization attempts of Wu et al. (2019a) ; Xu et al. (2018a) . Specifically, we offer an overall formulation of linear approximation of arbitrary graph neural networks. GNN(X, Â) as: H = GNN(X, Â) = T -1 t=0 Φ t (X, Â)Θ t , where Φ t (X, Â) ∈ R n×dt is the non-parametric feature space constructing function that inputs the graph data (e.g., node attributes and graph structure) and outputs a feature subspace, Θ ∈ R dt×c is the parameter space to re-weight the corresponding feature subspace for each class c, and T is a hyper-parameter of the number of the feature sub-spaces that the GNN contains. In general, in this linear approximation, a GNN model forms K feature sub-spaces, i.e., Φ t , and outputs the addition of all the re-weighted sub-spaces using the respective parameters Θ t . Note that the (total) feature space is the union of the sub-spaces as Φ = {Φ t } t=0,1,••• ,T -1 . Similarly, we have the (total) parameters Θ = {Θ t } t=0,1,••• ,T -1 . Besides, the number of the subspaces T that a GNN model obtains is not parallel with its layer/order, for which we will provide some examples in Section 3.1. In what follows, we will first identify the feature space Φ and the parameters Θ for the existing GNNs. Then, leveraging the linear approximation, we will introduce our proposed wide-form GNN architecture called WGNN. Lastly, we will theoretically analyze the reasons behind the failures of existing GNNs, e.g., over-smoothing and poor performance of heterophily, and conclude the superiority of WGNN. 3.1 REVISITING SPATIAL AND SPECTRAL GNNS Spatial GNNs. We first transform the recursive formula of spatial GNNs, e.g., equation 1, to an explicit formula, by iterating from the initial node attributes that H (0) = X and ignoring the activation function. Following Section 2, we consider two examples of spatial GNNs: vanilla GCN (Kipf & Welling, 2017) and the one with skip-connections (Xu et al., 2018a) . The linear approximated explicit formula of a K-layer GCN is written as: H (K) = ÂK X K-1 i=0 W (i) , which forms single feature space Φ 0 = ÂK X and parameters Θ 0 = K-1 i=0 W (i) with T = 1. While equation 3 furthermore considers skip-connections, whose K-layer linear approximated explicit formula is formualted as: H (K) =   K-1 i=0 Âi Xα (K-1-i) W (K-1-i) 0 K-1 j=K-i W (j) 1   + ÂK X K-1 h=0 W (h) 1 . By this decomposition, this GCN with skip-connections consists of T = K +1 feature sub-spaces. It forms each feature sub-space as Φ t = Ât X. For the first T -1 sub-spaces, the according respective parameters is denoted as Θ t;t<T -1 = α (K-1-t) W (K-1-t) 0 T -1 j=K-t W (t) 1 , and for for the last Φ T , the parameter is Θ T = T -1 h=0 W (h) 1 . Please refer to the appendix A.1 for the derivation. Spectral GNNs. Spectral GNNs are specified by the explicit formula as equation 4. We remove the activation function, and obtain the linear approximation of a K-order spectral GNNs as: H (K) = K k=0 P k ( L)Xγ (k) W (0) . We put the learnable polynomial coefficient γ (k) together with the parameter matrices. Also, we combine the shared parametric matrices in equation 4 as W (0) = W 1 W 2 . In this way, equation 8 forms T = K + 1 feature sub-spaces, where each sub-space is denoted as Φ t = P t ( L)X, and the parameters utilized to re-weight the respective sub-spaces are Θ t = W (0) W (1) . Primary analysis. In Table 1 , we summarize more instances of spatial and spectral methods, with different colors to distinguish the feature space Φ (orange) and parameters Θ (blue). It demonstrates that the proposed uniform view can support most of the methods in both spatial and spectral domains. Due to the page limits, we put the example of GCNIIChen et al. (2020b) and ARMA Bianchi et al. (2021) in Appendix C.8. Compared to the general formulation of re-weighting feature sub-spaces, e.g., equation 5, existing GNNs prohibit constraints on both the feature space and the parameter space. We can observe that the feature space Φ of spatial GNNs is always constrained by the power of the adjacent matrix, which is potentially related to the over-smoothing problems. Most of the parameters Θ of spectral GNNs are shared for different sub-spaces, which limits the flexibility of adequately re-weighting for each sub-space. Besides, the feature space Φ in both spatial and spectral GNNs is formulated by multiplication of structural matrices function and node attributes (e.g., Φ k = P k ( L)X). This multiplication to the node attributes X ∈ R n×d demands the feature sub-spaces to obtain d columns, which prevents the feature matrices with other shapes. Table 1 : The feature space and parameters of the linear approximation for GNN models. Original formula * Linear approximation formulations GCN (Kipf & Welling, 2017 ) H (k+1) = σ ÂH (k) W (k) H (K) = ÂK X K-1 i=0 W (i) GIN (Xu et al., 2018a ) H (k+1) = σ (ϵ (k) I + Â)H (k) W (k) 0 W (k) 1 H (K) = K t=0 Âk X {q0,••• ,qK-t-1}⊆{ϵ (0) ,••• ,ϵ (K-1) } i qi • K-1 j=0 W (j) 0 W (j) 1 APPNP (Klicpera et al., 2019 ) H (k+1) = (1 -α) ÂH (l) + αH (0) ; H (0) = σ(XW1)W2 H (K) = K t=0 (1 -α) t Âl H (0) + t-1 i=0 α(1 -α) i Âi H (0) W1W2 ChebyNet (Defferrard et al., 2016) * * H = K k=0 Pk( L)XW (k) H (K) = K t=0 Pt( L)XW (t) GPRGNN (Chien et al., 2021 ) H = K k=0 γ (k) Lk σ(XW1)W2 H (K) = K t=0 Lt Xγ (t) W1W2 BernNet (He et al., 2021 ) H = K k=0 1 2 K K k γ (k) (2I -L) K-k Lk σ(XW1)W2 H (K) = K t=0 (2I -L) K-l Lt Xγ (t) W1W2 WGNN (Ours) H = K k=0 Pk( L)XW (k) + SW (s) H = K-1 k=0 Pk( L)XW (l) + J-1 j=0 SjW (j) * Without specification, H (0) = X; * * Tk(x) denotes Chebyshev polynomial P0(x) = 1, P1(x) = x, Pk(x) = 2xPk-1 -Pk-2.

3.2. OUR PROPOSAL: WIDE GRAPH NEURAL NETWORK

Given the observations in the last part, we propose a Wide Graph Neural Network, a generalized framework of GNNs that relaxes the constraints as formulated in the following, H = K k=0 P k ( L)XW (k) + J-1 j=0 S j W (j) . It constructs the feature space in a two-fold way. The first part inherits the previous GNNs, that the same size of the feature sub-spaces is formed by the multiplication of the polynomials of the structural matrix P k ( L) and the node attributes X. Secondly, we allow the feature sub-spaces S j with an arbitrary number of the columns, instead of the same columns with X. Beyond these, WGNN utilizes independent parameters matrices W (k) and W (j) to re-weight each feature sub-spaces to provide flexible re-weighting. To sum up, WGNN forms the feature space of T = K + J sub-spaces, denoted as Φ k = P k ( L)X ∈ R n×d similar to GNNs', and Φ j = S j ∈ R n×dj , the additional part with arbitrary columns ones, which compose the total feature space Φ = {Φ k } k=0,1,••• ,K-1 ∪ {Φ j } j=0,1,••• ,J-1 . Respectively, Θ k ∈ R d×c and Θ j ∈ R dj ×c re-weight them with respect to the objective. In general, S j could be any transformation of node features X, graph structure Â, or both of them. The feature spaces Φ k = P k ( L)X ∈ R n×d provides the usage of the node features X only, e.g. k = 0, and both of node attributes and graph structure, e.g. k > 0. Besides, using node features leads to the dependency of the adjacent nodes' similarity and is parallel to the heterophily problem. In WGNN, we break this dependency and form S j by using the graph structure only. To extract the low-dimension information for the graph structure, we deploy truncated SVD to get its principal components as follows: S = Q Ṽ ; Â = QV R T , where S denotes that we only use single S j , e.g., J = 1. Throughout the remaining context, we stick to truncated SVD as a case for WGNN and delve into it accordingly. In addtion, the empirical results of other transformation functions are given in Appendix C.4 Notably, the feature space Φ k and S may have imbalanced scales and cause poor re-weighting. We, therefore, add a column-wise normalization to ensure each column equally contributes to the whole feature space.

3.3. THEORETICAL ANALYSIS

In this part, we analyze the feature space that different GNNs formed to explain the challenges of over-smoothing and poor performance on heterophilic graphs. Over-smoothing. The over-smoothing problem occurs when stacking deep GNN layers. We describe this phenomenon by the compression of the column span of the feature space. It is defined as all the possible column-wise linear combinations of the matrix's columns and denoted as Span(Φ) = { i a i Φ •i ; a i ∈ R}. We provide Theorem 3.1 to interpret the cause of this issue. Theorem 3.1. The span of the feature space Φ k = Âk X will be shrunk gradually with the increase of k, which leads to the over-smoothing problem. Let us first look at the feature space of K-layer vanilla GCN, ÂK X = (I -L) K X = U (I -Λ) l U T X; Λ ∈ [0, 2] n . A 1 A 2 A 3 A 4 Number the layers k It is a linear combination of U (I -Λ) l using U T X, and (I -Λ) l re-weights U in column-wise. Since Λ ∈ [0, 2] 2 , we have (I -Λ) ∈ [-1, 1] n . Along with the increase of K, the weights of some U ' columns will approach zero to shrink Span(U (I -Λ) l ), because Span( ÂK X) = Span(U (I -Λ) K U T X) ⊂ Span(U (I -Λ) K ). Figure 2 (a) demonstrates this phenomenon since the distribution of the eigenvalues shifts to zero with increasing layers while the similarity of the combination ÂK X greatly enhances. In this way, over-smoothing occurs due to the limited column span of the feature space that compresses the representations. To alleviate this issue, some modified GNNs using skip-connections that form the feature space as { Âk X} k=0,1,••• ,K that joints successive layer's spaces, that Span({ Âk X} k=0,1,••• ,K ) = Span( Â0 X) ∪ Span( Â1 X) ∪ • • • ∪ Span( ÂK X). It will not be influenced by the compression of later components, e.g., Span( ÂK X), and therefore avoids over-smoothing. Similar conclusion can be derived by the feature space of spectral type that forms { Lk X} k=0,1,••• ,K . Following this, we can understand the performance bottleneck with increasing layers by the similarity of the introduced feature space from later spatial layers / spectral orders to the previous ones. Here, we provide quantitative analysis to better describe the property. For this purpose, we take the feature space of spectral GNN (Chien et al., 2021) as an example, i.e., { Lk X} k=0,1,••• ,4 , and measure the linear correlation of the appended k-th feature space to the previous ones by calculating the mutual-correlation values: E k i = max j=0,••• ,k-1 µ( Lj X, Lk X •i ), where i is the index of the column in Lk X, and µ(M 0 , M 1 ) = max du∈M0,dv∈M1 cos(d u , d v ) is the mutual-coherence of two matrices, based on the cosine distance cos. In Figure 2 (b), we visualize the distribution of {E k i } of all the columns with k = 1, 2, 3, 4. It confirms the great improvement of the linear correlation, which results in little expansion of the feature space. Therefore, increasing spectral orders or spatial layers can hardly enhance performance. Although some studies explain the over-smoothing problem, e.g., Huang et al. (2020) ; Oono & Suzuki (2019) ; Cai & Wang (2020) . Our perspective differs from them in the view concept; please refer to the comparison we provide in Appendix B.3. Poor performance in heterophily. The majority of GNNs' performance on heterophilic graphs is much worse than on homophilic graphs. To understand this issue from the perspective of matrix space analysis, we study the linear correlation of the feature space to the label space. We provide an empirical analysis of the distribution following mutual-coherence values in Figure 2(c ), E k = 1 C c E k c ; E k c = µ( Lk X T • , Y ′ T c ) where we randomly sample 60% rows to mimic the training set denoted as T , and report the mean with variance. c is the dimension of the matrix of the one-hot node labels, i.e., Y ′ ∈ R n×c . It shows the distance between the feature space and the label space on homophilic graphs is much higher than in the heterophilic scenario, which leads to poorer performance in the heterophily case. Besides, we append a theoretical explanation from the feature space only, by Theorem A.1, that the mutual-coherence of the heterophilic feature space, i.e., µ( LX), is higher than the homophilic ones. WGNN compared to spatial and spectral GNNs. Our WGNN provides a new way of dealing with graph data, where both graph structures and node attributes are regarded as the input cues to construct feature spaces. In this way, the complex multiplicative design between the graph structure and node attributes can be avoided and the constraints of feature space are relaxed, which contributes to better model flexibility and generalizability on the heterophilic graph. From the feature space construction perspective, WGNN utilizes all the sub-spaces from P k ( X). Compared with the spatial GNNs, the feature space of our WGNN is more flexible as Span({ Âk X} k=0,1,••• ,K ) ⊂ Span({P k ( L)X k=0,1,••• ,K ), since it builds the space using all the polynomial terms while spatial GNNs only takes the highest ordered term. This property helps WGNN avoid the over-smoothing problem. Besides, we append a sub-space S to the feature space of WGNN, which is built with the graph structure only. Without using the node attributes, this sub-space is closer to the label space than others that are highly dependent on nodes' similarity, and achieve better performance on the heterophilic graphs. As shown in Figure 2 (c), we demonstrate that the sub-space S helps the feature space to approach labels, especially for heterophilic graphs. From the view of parameters, compared with spectral GNNs, WGNN relaxes all the constraints on the parameters and allows to re-weight the feature space independently. In Appendix A.4, we supply the demonstration of the parameters constraints within previous GNNs. It tells that the constraints on parameters (W (k) ) limit the span of the weighted feature space (see Theorem A.2).

4. EXPERIMENTS

We evaluate the proposed WGNN on the following aspects: (1) node classification results, (2) robustness on the challenging issues, (3) ablation studies. Dataset. We implement our experiments on homophilic datsets, i.e., Cora, CiteSeer, PubMed, Computers, and Photo (Yang et al., 2016; Shchur et al., 2018) , and heterophilic Chameleon, Squirrel and Actor (Rozemberczki et al., 2021; Pei et al., 2020) . More details are provided in Appendix C. Baselines. We compare a list of state-of-the-art GNN methods. For spatial GNNs, we have GCN (Kipf & Welling, 2017) , GAT (Velickovic et al., 2018) , GraphSAGE (Hamilton et al., 2017) , GCNII (Chen et al., 2020b) and APPNP (Klicpera et al., 2019) , where MLP is included as a particular case. For spectrals, we take ChebyNet (Defferrard et al., 2016) , GPRGNN (Chien et al., 2021) and BernNet (He et al., 2021) . Besides, we cover the recent unified models, ADA-UGNN (Ma et al., 2021) and GNN-LF/HF (Zhu et al., 2021) . WGNN employs the Chebyshev or Monomial polynomials to construct feature space, and we name the corresponding version as WGNN-C and WGNN-M, respectively. Please refer to Appendix C.3 for more details about the implementation. We test on transductive node classification task with random 60%/20%/20% splits and summarize the results of 100 runs in Table 2 , reporting the average accuracy with a 95% confidence interval. We observe that WGNN has almost the best performance on homophilic graphs. Particularly, compared with the current SoTA method ADA-UGNN (Ma et al., 2021) which unifies the objectives in both spatial and spectral domains, our WGNN achieves 1.1% accuracy improvement on average of 5 homophilic graphs datasets. Besides, WGNN obtains 32.0% improvement on average of three heterophilic graphs datasets than the GCN baseline. The excellent performance on both homophilic and heterophilic graphs indicates the potential of WGNN.

4.2. ROBUSTNESS ON THE CHALLENGING ISSUES

Over-smoothing. We evaluate WGNN with different numbers of feature sub-space K on the Cora (homophilic) and Chameleon datasets (heterophilic). Figure 5 indicates that the performance of WGNN will not drop as extending feature space. It is because our concatenating polynomials of adjacent matrices avoids the compression of feature space. Heterophilic graphs. As shown in Table 2 and Section 4.1, WGNN greatly alleviates the problem of heterophily by supplementing the feature space with only the graph structure to reduce the dependency on nodes' similarity. 

4.3. ABLATION STUDIES

In this subsection, we study the contribution of different components in WGNN and answer the following questions. How does each sub-space affect, e.g., P k ( L)X and S? P k ( L)X and S respectively matter on homophilic and heterophilic graphs. We evaluate WGNN on 5 datasets of both homophilic and heterophilic graphs in 3 different feature space constructions: including w/o S, w/o P k ( L)X k=0 , and w/o P k ( L)X k>0 , which respectively denote building the feature spaces without graph structure, without note attributes, and without the combination of them. In the ablation results of Table 3 , we found that w/o S works well on homophilic graphs but fails on heterophilic ones, while the other two work oppositely. Does the column-wise normalization matter? It matters when the node scale is not huge. As shown in Table 3 , we find column-wise normalization works well in most cases, except for PubMed. It may be because the large node scale of PubMed causes the tiny value of normalized feature space. On what ratio of the truncated SVD is adequate? 94% We conduct an experiment using different ratios of singular vectors and values to construct S, i.e., the top j singular values obtains V jj ratio of the components. In Figure 4 , we report the test accuracy with respect to increasing ratios of singular values on CiteSeer and Chameleon, where CiteSeer is robust to the variation of the ratio, while Chameleon shows the best performance at ratio = 94%. Thus, we use 94% for other experiments. We offer more interpretation of the SVD results in Appendix C.7. On what order polynomial is sufficient? Three. We test a progressive order K of the polynomials on Cora and Chameleon demonstrated in Figure 5 . The performance in Cora rises from 1 to 3 and decreases in a slight tendency, while Chameleon has minor changes. It suggests that order 3 is good enough to achieve nearly optimal performance. Please refer to Appendix C.5 for more results. Are dropout (Agarap, 2018) and DropEdge (Rong et al., 2020) help? No. We respectively integrate "dropout+ReLU" and DropEdge into WGNN and show the corresponding performance with different drop ratios on four datasets in Figure 6 . Unfortunately, both of them show worse performance when increasing the drop rate. It may be because these regularization tricks break the graph structure. Besides, the results also advocate for more attention on the feature space construction, preventing over-applying deep artifices. Is WGNN easy to train? Comparably, yes, and WGNN converges fast. In Table 2 , we collect the training time per epoch (ms) for each method, which shows that WGNN behaves at a comparable time cost to other baselines, such as GCN (Kipf & Welling, 2017) . Note that the time we report includes the graph propagation for a fair comparison, though WGNN can further reduce it by constructing the feature space in a pre-processing manner. This advantage comes from the lowerordered polynomial feature space and much simpler computation in feed-forward. Please refer to Appendix C.3 for more details on the optimal architectures for the baselines. In Figure 3 , we compare the convergence time for all methods and observe that WGNN consumes the minimum number of training epochs while achieving the highest accuracy. Is the SVD applicable in practice? Yes. In Table 4 , we show training time, SVD time (as preprocessing), and their ratio of WGNN. We find the rate of SVD time in whole training time is lower than 10%, which confirms the WGNN's applicability. 

5. CONCLUSIONS, LIMITATIONS, AND FUTURE RESEARCH

In this paper, we provide a unified view to analyze GNNs, which separates the feature space and parameters using a linear approximation. Together, we provide a theoretical analysis of existing challenges under the setting of feature space or parameter space. To address these challenges, we propose a flexible architecture that relaxes all constraints, called Wide Graph Neural Network (WGNN). Comprehensive experiments are conducted to verify its superiority. Limitations and future research. More general nonlinear cases are not included in our work, such as GAT (Velickovic et al., 2018) , GateGNN (Bresson & Laurent, 2017) , and will be considered in future work. The mechanism between the feature space and the respective parameters is worth more effort to optimize; in a way, the parameters in the WGNN can be further reduced by introducing reasonable constraints. Finally, since WGNN adopts the same graph structure as node attributes as graph data, more feature space construction methods should be discovered in the future. C.8 Feature space and parameters for more GNN models . . . . . . . . . . . . . . . . 22

A PROOFS

A.1 DERIVATION OF EQUATION 7 Iterate equation 3 from H (0) = X, we have H (0) = X (11) H (1) = α (0) XW (0) 0 + ÂXW (0) 1 (12) H (2) = α (1) XW (1) 0 + Âα (0) XW (0) 0 W (1) 1 + Â2 XW (0) 1 W (1) 1 (13) H (3) = α (2) XW (2) 0 + Âα (1) XW (1) 0 W (2) 1 (14) + Â2 α (0) XW (0) 0 W (1) 1 W (2) 1 + Â3 XW (0) 1 W (1) 1 W (2) 1 (15) • • • (16) Identify the rule of the iteration, we obtain H (k) = l-1 i=0 δ (k) i + Âl X l-1 h=0 W (h) 1 , where δ (k) i s calculate by: δ (k) i = α (k-1-i) Âi XW (k-1-i) 0 l-1 j=l-i W (j) 1 . We apply equation 18 on equation 17 and put α (k-1-i) back to the learnable parameters W (k-1-i) 0 , and the result of equation 7 is achieved.

A.2 THE COLUMN-WISE NORMALIZATION IN CURRENT GNNS

Here, we include some 10-ordered polynomial functions, to see the different column-wise normalization response from these models. Column-wise normalization are defined as enforcing ∥F •i ∥ 2 = 1, where we take F as the concatenation of the feature space. We extend this to an arbitrary k times of ∥F •i ∥ 2 = 1, i.e., ∥F •i ∥ 2 = k, which equals to measure the extent of the consistency of each ∥F •i ∥ 2 . Therefore, we report the standard variance of {∥F •i ∥ 2 ; i = 1, 2, • • • }, and the smaller value suggests greater response of column-wise normalization. Chebshev, Bernstein and Monomial polynomials are compared in Table 5 . Bernstein polynomial produces the least variance, suggesting it encourages the most atomicity compared to other polynomials. This observation aligns with the narrative in the original paper of BernNet He et al. (2021) , where the authors claim the Bernstein polynomial is more numerically stable than other polynomial functions.

A.3 EXPLAINING HETEROPHILY FROM THE PERSPECTIVE OF FEATURE SPACE

Theorem A.1. The mutual-coherence of the heterophilic feature space, i.e., µ(LX), is higher than the homophilic ones. Proof. In a binary classification task, we assume the node features can be draw from two separate p-dimensional multivariate Gaussian distribution D 0 = N ( ⃗ µ 0 , Σ 0 ) and D 1 = N ( ⃗ µ 1 , Σ 1 ), corresponding to class c 0 and c 1 . Σ 0 , Σ 1 are both diagonal, i.e., for all i, j; i ̸ = j dimensions are independent. The node features are equally sampled from the two distributions, X = {x u ; x u ∼ D 0 } ∪ {x v ; x v ∼ D 1 }, where each includes n samples. Without loss of generality, suppose two columns from X that d i ⊥ d j . Equivalently, this leads to: d T i d j = u∼D0 x ui x uj + v∼D1 x vi x vj (19) ∼ nE(D 0i D 0j ) + nE(D 1i D 1j ) (20) = nE(D 0i )E(D 0j ) + nE(D 1i )E(D 1j ) (21) = n(µ 0i µ 0j + µ 1i µ 1j ) (22) = 0. Note that for obtaining (64), we use the law of large numbers, e.g., n i=0 A i = Ā. So far, we have a equation µ 0i µ 0j + µ 1i µ 1j = 0 from this orthogonality, which is the only equation that our assumptions hold. Next, we examine the effect of L on d T i d j . L is employed by a left-hand side multiplication, which equals to a row-wise transformation of d i and d j . We consider two extreme cases of homophily and heterophily, respectively. Firstly, for homophily, L only acts with the nodes that from the same class. For simplicity, if each transformation is averaged, i.e., L ij is row-wise normalized, D 0 and D 1 remain the same. Therefore, the orthogonal relation remains. Secondly, L combines the nodes that from different classes in heterophily. In this situation, D 0 and D 1 shift to D ′ 0 = N ( ⃗ µ 0 -⃗ µ 1 , Σ 0 + Σ 1 ) and D ′ 1 = N ( ⃗ µ 1 -⃗ µ 0 , Σ 0 + Σ 1 ). Here, we rewrite the alignment from (65): d T i d j = nE(D 0i )E(D 0j ) + nE(D 1i )E(D 1j ) (24) = n(µ 0i -µ 1i )(µ 0j -µ 1j ) + n(µ 1i -µ 0i )(µ 1j -µ 0j ) (25) = 2n(µ 0i -µ 1i )(µ 0j -µ 1j ) (26) = 2n(µ 0i µ 0j + µ 1i µ 1j ) -2n(µ 0i µ 1j + µ 1i µ 0j ) (27) = -2n(µ 0i µ 1j + µ 1i µ 0j ). Based on this, the condition of orthogonality will not be held because µ 0i µ 1j + µ 1i µ 0j is not equal to zero. To sum up, heterophily breaks the limiting orthogonal condition, while homophily does not. Extensively, the LX columns are more easily to be mixed in row-wise, losing their distinctiveness mutually. In other words, it increases the mutual-coherence of LX, for the definition below. Definition A.1. The mutual-coherence of a matrix A ∈ R n×n is the maximal inner product between columns from these two bases, µ(A) = max 1≤i,j≤n |a T i a j |, where each column is normalized as ∥a i ∥ 2 = 1, 1 ≤ i ≤ n. Based on this analysis and definition, we find that the mutual-coherence of the feature space in a heterophilic graph, e.g., LX, is more likely greater than that of a homophilic one. As a consequence, this phenomena will be superimposed in the overall feature space F = ∥{P k (L)X; k = 0, 1 • • • , K} to undermine the power of the feature space.

A.4 PARAMETERS CONSTRAINTS IN PREVIOUS GNNS

Theorem A.2. The span of the feature space Φ k = Âk X will be shrunk gradually with the increase of k, which leads to the over-smoothing problem. Proof. We summarize the constraints on W in current GNNs as the following: i) in the case of MLP-based implementation He et al. (2020) , all layers share the same W , which forces the layerwise representation parameters into a single matrix; and ii) in the case of layer-wise W Kipf & Welling (2017) Li et al. (2019) , each W k+1 is built upon its previous one, i.e, W k+1 = k+1 i=0 W i . We extract the ideas of these constraints into the following example. Suppose a linearly correlated feature space U ′ = (d 0 , d 1 , λ 0 d 0 , λ 1 d 1 ), where d 0 ⊥ d 1 , d k ∈ R 2 . x ∈ Span{d 0 , d 1 } need to be recovered by the elements in U ′ . We deploy the aforementioned two types of constraint on the undecided variables b 0 , b 1 , b 2 , and b 3  : i) b 2 = b 0 , b 3 = b 1 , and ii) b 2 = µb 0 , b 3 = µb 1 , where µ is a trainable scalar. They align with the graph neural networks. We begin by discussing these two cases. Representing x in the first case, yields: x = b 0 d 0 + b 1 d 1 + b 0 λ 0 d 0 + b 1 λ 1 b 1 (30) = (1 + λ 0 )b 0 d 0 + (1 + λ 1 )b 1 d 1 . ( ) Using the unique representation theorem Hoffman & Kunze (2004) , we have (1 + λ 0 )b 0 = a 0 and (1 + λ 1 )b 1 = a 1 . Put it in a matrix multiplication format: 1 0 λ 0 0 0 1 0 λ 1    b 0 b 1 b 0 b 1    = a 0 a 1 , which produces: 1 + λ 0 0 0 1 + λ 1 b 0 b 1 = a 0 a 1 . It holds the closed form that b 0 = a0 (1+λ0) , b 1 = a1 (1+λ1) . Then, we represent x in the second case: x = b 0 d 0 + b 1 d 1 + µb 0 λ 0 d 0 + µb 1 λ 1 b 1 (34) = (1 + µλ 0 )b 0 d 0 + (1 + µλ 1 )b 1 d 1 , which produces (1+µλ 0 )b 0 = a 0 and (1+µλ 1 )b 1 = a 1 . Formulate them in a matrix multiplication: 1 0 λ 0 0 0 1 0 λ 1    b 0 b 1 µb 0 µb 1    = a 0 a 1 . This is a under-determined system and gives b 0 = a0 (1+µλ0) , b 1 = a1 (1+µλ1) , b 2 = µa0 (1+µλ0) , and b 3 = µa1 (1+µλ1) . We look into the values of b k to get the expressivity of the base D. Given the extreme case where λ 0 → 0, the appended λ 0 d 0 is constrained while the original one keeps expressing. On the contrary, when λ 0 → ∞, the original base d 0 is constrained by a0 (1+λ0) or a0 (1+µγ0) while the appended one expresses. Besides, for the second case, when µ → 0, the corresponding bases are limited by b 2 , b 3 → 0. Consequently, both cases lead to partial expression of the feature spaces. Finally, comparing these two cases, i.e., ( 59) and ( 62) to (55), we find that they merely restrict the parameter space of (b 0 , b 1 , b 2 , b 3 ) T by either sharing the values of each other or enforcing their linear dependence. Therefore, restricting the parameter space in these two cases leads to partial expression of the feature spaces. This proof is completed. In the following, we provide the norm of the learned parameter matrix respect to each column of the feature space, as shown in Figure 7 . We find that existing GNNs' parameters are limited leaving a significant part of feature space unexplored. Differently, WGNN abandons all the constraints of the parameters and allows all the columns to be re-weighted. 2021) first bridge the spatial methods to the spectral ones, that they assign most of the spatial GNNs with their corresponding graph filters. More specifically, they begin with GNN models' convolution matrix and then summarize their frequency responses. For example, GCN (Kipf & Welling, 2017) obtains the convolution matrix of D-1/2 Ã D-1/2 leads to the filter of Φ GCN (λ) ≈ 1 -λp/(p -1). This work causes attention to the unified perspective viewing GNNs, though they fail to explain the existing progress and issues in spectral view. Ma et al. (2021) regard the aggregation progress of GCN (Kipf & Welling, 2017) , GAT (Velickovic et al., 2018) , and APPNP (Klicpera et al., 2019) as graph signal denoising problem, which aims to recover a clean signal H from min H ∥H -X∥ 2 F + c • tr H T LH . Given this, the authors consider generalize the smoothing regularization term to i∈V C i /2 j∈N(i) ∥H i / √ d i -H j / d j ∥ 2 2 and propose ADA-UGNN. However, it also lacks the understanding of over-smoothing or heterophily. Zhu et al. (2021) give a more comprehensive summary of GNNs from an optimization view, which partly overlaps with Ma et al. (2021) 's opinions of graph signal denoising. Based on their conclusion, they propose GNN-LF/HF with parameters adjusting the corresponding objective, e.g., GNN-LF approaches min H ∥I + β L1/2 (H -X)∥ 2 F + (1/α -1) tr H T LH and behaves as a low-pass filter. They attribute over-smoothing to the absence of original features and overcome this issue in their proposal; however, heterophily is untouched either. In general, these integrated perspectives lack the explanation of the issues, but focus on general formulas.

B.2 SHALLOW MODELS

SIGN (Rossi et al., 2020) practices to device the aggregation in a pre-processing way, which is similar to our idea of feature space construction. However, they focus on large-scale scenarios by removing sampling and aggregation operations from the training. FSGNN (Maurya et al., 2021) implements feature selection on GNNs, similar to feature sub-space selection. Nevertheless, it combines all the sub-spaces by one linear transformation, which undermines the ability to understand the contribution of each part. These shallow models empirically approaches to the idea of feature space, since they will not perform better on in deep models. Besides, concatenation is also an instance of the feature space concept, which is adopted by many spatial GNNs. In our view, these observations and modifications are are framed as a method to expand the feature space.

B.3 OTHER ANALYTICAL VIEWS OF OVER-SMOOTHING

As we mentioned in Section 3, that several works exist explaining over-smoothing problems (including those you mentioned). However, the view concept of our perspective, i.e., matrix space analysis, is different from the existing ones. Oono & Suzuki (2019) propose one of the most accepted views, which is later followed by some literature Huang et al. (2020) ; Shan et al. (2021) . They assumes a stable point of node feature made by the node degrees and proposes that whatever the initial node features are, some nodes will converge to the stable degree feature, with the layer going infinitely, which is depending on the node's corresponding eigenvalue. In details, the prove that d M (X (l) ) ≤ (sλ) l d M (X (0) ), where d M (X (l) ) is the distance to a given feature on the top of node-degree and converges to 0 when sλ < 1. For concise, let us call it degree-view. It analyzes from the row-wise perspective of the feature space. And some studies are also from a row-wise aspect by measuring the global Dirichlet energy of the node features Cai & Wang (2020) . Differently, our view starts from the column-wise of the feature space by comparing the (column) span of the feature space that extended. In particular, we factorize the feature space, for example, L k X, as the bases (column-wise) composed by the eigenvector matrix U (of L). Each column is re-weighted by the corresponding eigenvalue Λ k ii , then expressed by the static weight matrix U T X. Therefore, we consider the feature space spanned by the columns of U Λ k using the weight matrix U T X, which is different from the node/row-wise perspective from the degree view. Moreover, our view can potentially explain the inferior performance on heterophilic datasets than on homophilic ones. By Thm B.1, we can see the mutual coherence of the feature matrix (e.g., LX) is higher in heterophilic settings, leading to a shrunk feature space.

C EXPERIMENTAL SETTINGS C.1 DATASET DETAILS

The datasets are concluded in Table 6 , with licenses.foot_1 foot_2 foot_3 Cora, CiteSeer, and PubMed are commonly used homophilic citation networks Yang et al. (2016) . Computers and Photo are homophilic co-bought networks from Amazon Shchur et al. (2018) . For heterophilic datasets, we utilize hyperlinked networks Squirrel and Chameleon from Pei et al. (2020) , and Actor, a subgraph from the film-director-actor network Rozemberczki et al. (2021) . PyGfoot_4 are employed to get these data. Each datasets are split into three parts using random selection: 60% as the training set, 20% as the validation set, and 20% as the test set. We set these datasets to undirected graphs as we assumed in the Preliminaries. We report the average accuracy (micro F1 score) in the classification task with a 95% confidence interval in all the tables and figures. For each result, we run 100 times on 10 random seeds. Besides, we present the standard variance of  H (k+1) = σ ÂH (k) W (k) H (K) = ÂK X K-1 i=0 W (i) GIN (Xu et al., 2018a ) (Chen et al., 2020b) H (l+1) = σ (1α (l) ) ÂH (l) + α (l) H (0) (1β (l) )I + β (l) W (l) H (k+1) = σ (ϵ (k) I + Â)H (k) W (k) 0 W (k) 1 H (K) = K t=0 Âk X {q0,••• ,qK-t-1}⊆{ϵ (0) ,••• ,ϵ (K-1) } i q i • K-1 j=0 W (j) 0 W (j) 1 GCNII H (K) = K-1 l=0 Âl X L-1 i=L-l (1 -α (i) )α (L-l-1) L-1 j=L-l-1 W (j) } + ÂK K-1 h=0 (1 -α (h) )W (h) ARMA (Bianchi et al., 2021 ) (Klicpera et al., 2019 ) H (K) = σ( LH (K-1) W 1 + XW 2 ) H (K) = K t=0 LXW t 2 W K-t 1 APPNP H (k+1) = (1 -α) ÂH (l) + αH (0) ; H (0) = σ(XW 1 )W 2 H (K) = K t=0 (1 -α) t Âl H (0) + t-1 i=0 α(1 -α) i Âi H (0) W 1 W 2 ChebyNet (Defferrard et al., 2016) * * H = K k=0 P k ( L)XW (k) H (K) = K t=0 P t ( L)XW (t) GPRGNN (Chien et al., 2021 ) H = K k=0 γ (k) Lk σ(XW 1 )W 2 H (K) = K t=0 Lt Xγ (t) W 1 W 2 BernNet (He et al., 2021 ) H = K k=0 1 2 K K k γ (k) (2I -L) K-k Lk σ(XW 1 )W 2 H (K) = K t=0 (2I -L) K-l Lt Xγ (t) W 1 W 2 WGNN (Ours) H = K k=0 P k ( L)XW (k) + SW (s) H = K-1 k=0 P k ( L)XW (l) + J-1 j=0 S j W (j) * Without specification, H (0) = X. * * T k (x) denotes Chebyshev polynomial P 0 (x) = 1, P 1 (x) = x, P k (x) = 2xP k-1 -P k-2 .



The implementation of WGNN is available at https://drive.google.com/drive/folders/ 1A6VWiPmKRhCNfdcuFJvnxTiTgzgbJIZ6?usp=sharing Chameleon, Squirrel: https://github.com/benedekrozemberczki/MUSAE/blob/master/LICENSE Cora, CiteSeer, PubMed, Actor: https://networkrepository.com/policy.php Computers, Photo: https://github.com/shchur/gnn-benchmark/blob/master/LICENSE https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html



Figure 1: WGNN compared with current GNNs

Figure 2: Visualization of the analysis. (a) collects the distribution of the eigenvalues with increasing powersof the adjacency matrix. It shifts to zero gradually, which compresses the feature space and leads to decreasing the diversity of all the nodes' features, measured by MAD global(Chen et al., 2020a). (b) shows the similarity between the later feature space to the previous total improves greatly, leading to marginal extension of the space. (c) compares the similarity of different feature spaces' distances to the labels, where the heterophilic case obtains a greater distance and our proposed S can reduce the distance.

Figure 3: Convergence curve

Figure7: We present the expression of the feature space. Y-axis marks the respective norm of the parameters, i.e., ∥Wi∥2 of each F•i, and x-axis shows the index of the features spaces from different spatial layers or spectral orders. WGNN relaxes the constraint on the parameters, leading to a full expression on the feature space.

Overall Performance of Wide Graph Neural Networks (WGNN) 47±0.09 83.56±0.10 87.83±0.10 86.94±0.06 93.89±0.10 39.01±0.51 63.90±0.11 42.47±0.07 ADA-UGNN 14.36±0.21 88.92±0.11 79.34±0.09 90.08±0.05 89.56±0.09 94.66±0.07 44.58±0.16 59.25±0.16 41.38±0.12

Ablation study of the components in WGNN ±0.22 81.96 ±0.23 89.87 ±0.49 67.82 ±0.26 73.33 ±0.35 WGNN-M 89.09 ±0.22 81.76 ±0.23 89.93 ±0.23 67.90 ±0.23 73.26 ±0.38 w/o norm 86.23 ±1.43 79.32 ±0.59 90.27 ±0.49 64.70 ±1.10 68.25 ±1.64 w/o S 89.20 ±0.93 81.95 ±0.87 89.76 ±0.46 43.21 ±0.99 61.54 ±1.52 w/o P k ( L)X k>0 71.10 ±1.72 74.38 ±1.01 86.61 ±0.54 67.90 ±0.96 73.35 ±1.21 w/o P k ( L)X k=0 84.70 ±1.05 58.60 ±2.19 85.84 ±0.45 65.75 ±0.63 72.61 ±1.60

Time consumption of SVD

The column-wise normalization response for different polyonmials on Cora

Statistics of Datasets

Table 2 in the Table 7 followed: Overall performance of WGNN compared to the baselines with reporting the standard variance. ±0.59 76.67 ±0.90 85.11 ±0.91 82.62 ±0.73 84.16 ±0.46 37.86 ±1.38 57.83 ±1.09 38.99 ±0.60 GCN 87.69 ±1.39 79.31 ±1.63 86.71 ±0.63 83.24 ±0.39 88.61 ±1.28 47.21 ±2.06 61.85 ±1.32 28.61 ±1.36 GAT 88.07 ±1.45 80.80 ±0.93 86.69 ±0.48 82.86 ±1.23 90.84 ±1.12 33.40 ±4.99 51.82 ±4.68 33.48 ±1.25 GraphSAGE 87.74 ±1.46 79.20 ±1.48 87.65 ±0.51 87.38 ±0.52 93.59 ±0.47 48.15 ±1.59 62.45 ±1.70 36.39 ±0.88 GCNII 87.46 ±0.18 80.76 ±1.05 88.82 ±0.75 84.75 ±0.78 93.21 ±0.89 43.28 ±1.21 61.80 ±1.54 38.61 ±0.90 APPNP 87.92 ±0.72 81.42 ±0.95 88.16 ±0.49 85.88 ±0.46 90.40 ±1.20 39.63 ±1.00 59.01 ±1.68 39.90 ±0.88 ChebNet 87.17 ±0.96 77.97 ±1.84 89.04 ±0.42 87.92 ±0.65 94.58 ±0.55 44.55 ±1.39 64.06 ±2.39 25.55 ±8.43 GPRGNN 87.97 ±1.23 78.57 ±1.56 89.11 ±0.44 86.07 ±0.71 93.99 ±0.54 43.66 ±1.12 63.67 ±1.72 36.93 ±1.30 BernNet 87.66 ±1.33 79.34 ±1.63 89.33 ±0.40 88.66 ±0.44 94.03 ±0.40 44.57 ±1.68 63.07 ±2.15 36.89 ±1.43 GNN-LF 88.12 ±0.06 83.66 ±0.06 87.79 ±0.05 87.63 ±0.05 93.79 ±0.06 39.03 ±0.08 59.84 ±0.09 41.97 ±0.06 GNN-HF 88.47 ±0.09 83.56 ±0.10 87.83 ±0.10 86.94 ±0.06 93.89 ±0.10 39.01 ±0.51 63.90 ±0.11 42.47 ±0.07 ADA-UGNN 88.92 ±0.11 79.34 ±0.09 90.08 ±0.05 89.56 ±0.09 94.66 ±0.07 44.58 ±0.16 59.25 ±0.16 41.38 ±0.12 WGNN-C 89.45 ±1.10 81.96 ±1.18 89.87 ±0.74 90.79 ±0.40 95.36 ±0.70 67.82 ±1.31 73.33 ±1.78 40.54 ±0.79 WGNN-M 89.09 ±1.22 81.76 ±1.17 89.93 ±0.50 90.60 ±0.53 95.45 ±0.73 67.90 ±1.18 73.26 ±1.95 40.91 ±0.61 w/o norm 86.23 ±1.99 79.32 ±0.78 90.27 ±0.69 89.43 ±0.55 94.94 ±0.78 64.70 ±1.53 68.25 ±2.30 37.46 ±0.86 w/o S 89.20 ±1.30 81.95 ±1.16 89.76 ±0.65 89.10 ±0.60 94.56 ±0.77 43.21 ±1.39 61.54 ±2.12 40.89 ±0.60 w/o P k 71.10 ±2.41 74.38 ±20.86 86.61 ±0.75 89.58 ±0.56 94.90 ±0.44 67.90 ±1.34 73.35 ±1.70 38.44 ±1.01 w/o P 0 84.70 ±1.47 58.60 ±3.06 85.84 ±0.64 90.02 ±0.32 92.92 ±0.72 65.75 ±0.88 72.61 ±2.23 25.89 ±4.80

Studying Over-smoothing on WGNN, DropEdge, SkipConnection, and GCNII ± 0.88 88.57 ± 1.75 88.60 ± 1.07 88.03 ± 0.69 87.98 ± 0.74 88.25 ± 0.58 87.93 ± 1.20 88.23 ± 0.97 DropEdge 83.44 ± 1.83 77.38 ± 1.68 60.98 ± 2.18 55.93 ± 1.51 51.05 ± 0.58 50.91 ± 0.76 40.29 ± 2.40 36.45 ± 8.6 SkipConnection 87.54 ± 0.56 87.03 ± 0.82 86.92 ± 0.47 86.89 ± 0.56 87.46 ± 0.66 87.00 ± 0.82 86.87 ± 0.62 86.89 ± 1.1 GCNII 87.31 ± 1.44 87.80 ± 1.68 87.57 ± 2.18 88.09 ± 1.51 88.20 ± 1.44 88.52 ± 1.33 87.64 ± 1.87 88.16 ± 0.18 Chameleon WGNN 74.02 ± 1.02 74.20 ± 1.46 74.01 ± 1.46 74.18 ± 1.22 74.18 ± 1.52 73.92 ± 1.39 73.84 ± 1.27 73.97 ± 1.18 DropEdge 30.67 ± 2.03 22.71 ± 1.49 20.87 ± 1.18 21.87 ± 1.74 22.62 ± 1.35 21.72 ± 0.50 21.31 ± 3.99 21.51 ± 2.07 SkipConnection 60.50 ± 0.45 59.71 ± 0.40 58.86 ± 0.53 58.16 ± 0.35 57.96 ± 0.40 58.89 ± 0.45 58.26 ± 0.52 58.53 ± 0.51 GCNII 60.50 ± 0.45 59.71 ± 0.40 58.86 ± 0.53 58.16 ± 0.35 57.96 ± 0.40 58.89 ± 0.45 58.26 ± 0.52 58.53 ± 0.51 Squirrel WGNN 68.44 ± 0.59 68.87 ± 0.82 68.70 ± 0.71 68.23 ± 1.08 68.13 ± 0.78 68.36 ± 0.85 67.62 ± 1.00 66.40 ± 1.20 DropEdge 27.95 ± 0.98 28.42 ± 0.60 27.88 ± 0.81 26.56 ± 1.49 26.32 ± 1.26 23.92 ± 1.40 23.24 ± 1.60 22.40 ± 0.98 SkipConnection 42.27 ± 0.50 41.33 ± 0.38 41.39 ± 0.39 40.11 ± 0.57 39.70 ± 0.39 40.25 ± 0.63 39.50 ± 0.54 40.06 ± 0.4 GCNII 42.27 ± 0.50 41.33 ± 0.38 41.39 ± 0.39 40.11 ± 0.57 39.70 ± 0.39 40.25 ± 0.63 39.50 ± 0.54 40.06 ± 0.46 C.8 FEATURE SPACE AND PARAMETERS FOR MORE GNN MODELS Here, we present Table 1 in a more friendly way (with a larger scale and rotated 90 degrees), with adding GCNII Chen et al. (2020b).

Feature Space and Parameters for More GNN Models

REPRODUCIBILITY STATEMENT

The code used in our experiments is provided in the supplementary material. For the data sets used in the experiments, a comprehensive description is given in Appendix C. We employ Adam for optimization and set the early stopping criteria as a warmup of 50 pluses patience of 200 for a maximum of 100 epochs. We conduct all the experiments on the machine with NVIDIA 3090 GPU (24G) and Intel(R) Xeon(R) Platinum 8260L CPU @ 2.30GHz.

C.3 SEARCHING SPACE FOR BASELINES HYPER-PARAMETERS

For WGNN, we turn the following hyper-parameters by the grid search.• Learning rate: {0.01, 0.05, 0.1}• Weight decay: {0.0005, 0.001, 0.005, 0.01, 0.02, 0.05}• |S| for homophilic graphs: {0, 10, 50, 100, 200, 500, 1000, 2000}• |S| for heterophilic graphs: {500, 600, 700, 800, 900, 1000, 1500, 2000}• Suggested |S|: the whole hundred from the 94% singular values• Hidden size: 64 Table 8 represents the hyper-parameters searched for the baselines used in our experiments. We prioritize their original released code repository, and the ranges of turning parameters are according to their papers.• MLP, GCN, GAT GraphSAGE, APPNP, GCNII are implemented with PyG. 6• ChebNet is implemented according to the code style of BernNet/GPRGNN.• GPRGNN is implemented according to its original code repository. 7• BernNet is implemented according to its original code repository. 8• ADA-UGNN is implemented according to its original code repository. 9• GNN-HF/LF are implemented according to its original code repository. 10

C.4 OTHER TRANSFORMATIONS FOR COMPACTING THE GRAPH STRUCTURE INFORMATION

We append other possible transformations to extract compacted information from the normalized adjacency matrix Â. In details, we compared:• KernelPCA: a PCA method using non-linear kernel, where radial basis function (RBF) is used.• FastICA: a fast version of independent components analysis, which is a linear method.• IsoMap: a nonlinear dimensionality reduction method based on spectral theory.All of them can be easily implemented by sklearn package. As shown in 

C.5 RESULTS OF WGNN USING DIFFERENT POLYNOMIAL ORDERS

In the main text, we implement the polynomial order K within the range of three based on the empirical observations, e.g., Figure 5 . Here, we provide more comprehensive results for different choices of K.Table 11 indicates that we may find better K in a wider range, while the improvement is possibly marginal. , 20, 30, 40, 50, 60, 70 , 80} and compare with a representative method GCNII that overcomes over-smoothing. The results of Table 12 verify the effectiveness of WGNN on avoiding over-smoothing which expresses superior capability as Residual Connection and DropEdge and even achieve comparable results with GCNII. Note that all the hyper-parameters including Dropout, DropEdge, α, θ for SkipConnection and GCNII is searched within {0.2, 0.5, 0.8}.

C.7 MORE COMPREHENSIVE STUDY OF SVD

In the main test, we append the results on CiteSeer and Squirrel to better verify the importance of principal components of extracting information from adjacency matrix into S j . As shown in the Figure 8 below, we find the precisely results as we shared in Section 4. Cora and CiteSeer both 1) have a more smoothing distribution of the singular values and 2) the information from graph structure is less important as the node features and their interaction, therefore, the change of the performance is more stable with introducing more principal components. On the other hand, Chameleon and Squirrel 1) have a more centralized distribution of the singular values and 2) graph structure is a more important information, resulting an tendency of the performance that increases first and then decrease. In general, we can achieve a satisfying results on both kind of datasets when 94% principal components are inclusive.Here, we offer more intuition about using principal components. The principal components project and summarize a larger correlated variables into a smaller and more easily interpretable axes of variation. It is ideal for S j to embody the graph structure information from adjacency matrix, because the adjacency matrix is sparse and high-dimensional but each nodes are topologically correlated. However, the different components need to be distinct from each other to be interpretable otherwise they only represent random directions, which leads to noise. 

