SIMPLE SPECTRAL GRAPH CONVOLUTION FROM AN OPTIMIZATION PERSPECTIVE

Abstract

Recent studies on SGC, PageRank and S 2 GC have demonstrated that several graph diffusion techniques are straightforward, quick, and effective for tasks in the graph domain like node classification. Even though these techniques do not even need labels, they can nevertheless produce more discriminating features than raw attributes for downstream tasks with different classifiers. These methods are data-independent and thus primarily rely on some empirical parameters on polynomial bases (e.g., Monomial and Chebyshev), which ignore the homophily of graphs and the attribute distribution. They are more insensitive to heterophilous graphs due to the low-pass filtering. Although there are many approaches focusing on GNNs based on heterophilous graphs, these approaches are dependent on label information to learn model parameters. In this paper, we study the question: are labels a necessity for GNNs with heterophilous graphs? Based on this question, we propose a framework of self-representation on graphs related to the Least Squares problem. Specifically, we use Generalized Minimum RESidual (GMRES) method, which finds the least squares solution over Krylov subspaces. In theoretical analysis, without label information, we enjoy better features with graph convolution. The proposed method, like previous data-independent methods, is not a deep model and is, therefore, quick, scalable, and simple. We also show performance guarantees for models on real and synthetic data. On a benchmark of real-world datasets, empirically, our method is competitive with existing deep models for node classification.

1. INTRODUCTION

With the development of deep learning, CNNs have been widely used in different applications. A convolutional neural network (CNN) is exploits the shift-invariance, local connectivity, and compositionality of image data. As a result, CNNs extract meaningful local features for various imagerelated problems. Although CNNs effectively capture hidden patterns on the Euclidean grid, there is an increasing number of applications where data is represented in the form of non-Euclidean grid, e.g. in the graph domain. GNNs redefine the convolution on the graph in two different ways: spatial and spectral. Spatialbased methods decompose the convolution operation into an aggregation function and a transformation function. The aggregation function is used to aggregate neighbourhood node information by the mean function, which is somewhat similar to the box filter in traditional image processing. Some representative methods in this category are Message Passing Neural Networks (MPNN (Gilmer et al., 2017) ), GraphSAGE (Hamilton et al., 2017) , GAT (Veličković et al., 2017) , etc. Spectral methods are based on Graph Fourier Transformation (GFT). They try to learn a filtering function on the eigenvalues (or graph kernel, heat kernel, etc.) These methods usually use approximations in order to simplify the amount of computation, e.g. Chebyshev polynomials and Monomial polynomials are used by ChebNet (Defferrard et al., 2016) ), GDC (Klicpera et al., 2019) , SGC (Wu et al., 2019) , S 2 GC (Zhu & Koniusz, 2021) . Although spatial and spectral methods effectively extend the convolution operator to the graph domain, they usually suffer from oversmoothing on heterophily graph because they follow the homophily assumption, thus severely affect the node classification task as shown in Figure 1 . Figure 1 : Results on the contextual SBM using SGC, S 2 GC and PR (PageRank) with number of hops K = 2, 8. 'Raw' shows the error when no filtering method is applied. All methods only work well in homophilous networks. However, graphs are not always homophilic: they show the opposite property in some connected node groups. This makes it harder for existing homophilic GNNs to learn from general graphstructured data, which leads to a significant drop in performance on heterophilous graphs. There are many GNNs for a graph with heterophily. Their motivation mainly focuses on improving feature propagation and features transformation. Non-local neighbor extension usually is used for incorporating high-order neighbor information (Abu-El-Haija et al., 2019; Zhu et al., 2020; Jin et al., 2021) or discovering potential neighbours (Liu et al., 2021; Yang et al., 2021; Zheng et al., 2022) . Adaptive message aggregation is a good way to reduce the heterophilous edges (Veličković et al., 2017; Suresh et al., 2021) . Inter-layer combination provide a more flexible way to learn graph convolution (Xu et al., 2018; Zhu et al., 2020; Chien et al., 2021) . However, all of these approaches are designed for semi-supervised node classification, which is usually transductive (labels for training). In this paper, first we review the connection between GNNs and the Label Propagation (LP) with Laplacian regularization (Zhou et al., 2003) . The closed-form solution only depends on a parameter balancing smoothing and fitting error. This results in low-pass filter methods for homophilious graphs such as PageRank and S 2 GC, which cannot work well on heterophilous graphs. Based on the Taylor expansion of the closed form solution, we reformulate label propagation with Laplacian regularization to Residual Minimization in Krylov subspace. We further generalize the residual minimization in the Krylov subspace into a more generalized Polynomial Approximation. Then we discuss other possible bases such as Chebshev polynomials. In theoretical analysis, we try to explore whether high-order (second-order in this paper) or multi-scale graph convolutions are able to improve the performance given raw attributes without labels. In experiments with synthetic data, we show performance in line with our theoretical expectations. On the real-world benchmarks, our method is competitive with other graph convolution techniques in homophilous graphs and outperforms them (even some GNNs methods with transductive learning) on heterophilous graphs. Our contributions are: 1.) We reveal the labels are not necessary for graph neural networks on heterophilous graphs. The linear graph convolution is powerful on heterophilous graphs and homophilous graph, and outperforms GNNs for heterophilous graphs on semi-supervised node classification. 2.) We propose a framework of Feature (or Label) Propagation by parameterizing spectral graph convolution as residual minimization in Krylov subspace. We further reformulate residual minimization problem into Polymonimial Approximation, which can yield Chebshev and Berstein bases to overcome the Runge phenomenon. 3.) In theory, we prove second-order graph convolution is better than first-order graph convolution on heterophilous graphs and multi-scale (single and second-order) can provide better results with some combinations of parameters. 4.) Compared with other methods of label-dependent GNNs under heterophily, our method is competitive in real-world benchmarks. The proposed method outperforms other low-pass graph convolution without learning. -independent Spectral Graph Convolution Hammond et al. (2011) introduced Chebyshev polynomial to estimate wavelets in graph signal processing. Based on this polynomial approximation, ChebNet was proposed for combining neural network and the graph convolution operator. Unlike ChebNet using Chebyshev polynomial, Diffusion Convolutional Neural Network (DCNN (Atwood & Towsley, 2016)) use the normalized adjacent matrix as polynomial bases to approximate any graph filters. Simplifying Graph Convolution (SGC (Wu et al., 2019) ) are a special case of DCNN: only k-th power of normalized adjacent matrix is selected. Graph Diffusion Convolution (GDC (Klicpera et al., 2019)) shown two other special cases based on normalized adjacent matrix: heat kernel and PageRank kernel. It should be noted that GDC re-normalizes the given kernel like a normalized adjacent matrix. Thus, in this paper, we use PageRank kernel such as APPNP (Klicpera et al., 2018) . Simple Spectral Graph Convolution (S 2 GC (Zhu & Koniusz, 2021)) is based on a modified Markov diffusion kernel (Fouss et al., 2012) . Although these methods are effective on node classification, fixed parameters ignore the graph property of homophily/heterophily, and node attributes in different dimensions. These drawbacks limit such methods on heterophilous graphs.

Data

Learnable graph convolutions Chebyshev polynomials are used by ChebNet (Defferrard et al., 2016) to approximate the graph convolutions. In theory, one can learn any kind of filter (Balcilar et al., 2021) . With Cayley polynomials, CayleyNet (Levie et al., 2018) learns the graph convolutions and produces a variety of graph filters. Low-pass or high-pass filters can be derived from graph convolutions. GPR-GNN (Chien et al., 2021) employs the Monomial basis to approximate these filters. Through the family of Auto-Regressive Moving Average filters (Narang et al., 2013) , ARMA (Bianchi et al., 2021) learns the rational graph convolutions. The graph convolutions are approximated in BernNet (He et al., 2021) , which also learns graph filter using the Bernstein basis. Although these methods achieve good performance on different datasets, the learnable parameters of the graph convolution kernel depend only on the label information, which leads to overfitting due to too few or unbalanced labels. Graph Neural Networks for Heretophily Graphs are not always homophilic. The opposite is true on connected node groups. This makes it harder for existing homophilic GNNs to learn from general graph-structured data, which leads to a significant drop in performance on heterophilous graphs. Increasing Homophilic Edges (HoE) and decreasing Heterophilic Edges (HeE) are two mainly ways to improve feature propagation. HoE refers to edges connecting two nodes of the same class while HeH means edges connecting two nodes of different classes. The strategies of increasing increasing HoE include using two-hop (or higher) neighbours and discovering new neighbours with feature similarity. Decreasing Heterophilic Edges (HeE) assigns the weights on edges to reduce the impact from potential heterophilous edges. At each message passing step, H 2 GCN (Zhu et al., 2020) aggregates data from higher-order neighbors. In order to offer theoretical guarantee, H 2 GCN confirms that when one-hop neighbors' labels are conditionally independent, two-hop neighbors tend to include more nodes belonging to the same class. The generalised PageRank is used with graph convolutions in GPR-GNN (Chien et al., 2021) to jointly maximise the extraction of node features and topological information for both homophilous and heterophilous graphs. These methods are based on transductive learning. Without label information they cannot learn a useful model. Graph convolution based methods such as SGC, S 2 GC and PageRank do not need labels at all.

3. METHODS

In this section, we review the classical Label Propagation with Laplacian Regularization (Zhou et al., 2003) and show the relationship between his iterative solution and the existing GNNs. Then, by analyzing the closed form of LP, we formulate the label propagation to residual minimizing in Krylov subspace to learn parameters for graph convolution. To overcome the Runge phenomenon, we reformulate residual minimizing problem in Krylov subspace to a more general polymonimial approximation problem, which helps introduce other kinds of bases such as Chebyshev and Berstein polynomials.

3.1. PRELIMINARIES

Let G = (V, E) be a simple and connected undirected graph with n nodes and m edges. We use {1, • • • , n} to denote the node index of G, whereas d j denotes the degree of node j in G. Let A be the adjacency matrix and D be the diagonal degree matrix. Let A = A + I n denote the adjacency matrix with added self-loops and the corresponding diagonal degree matrix D, where I n ∈ R n×n is an identity matrix. Finally, let X ∈ R n×d denote the node feature matrix, where each node v is associated with a d-dimensional feature vector x v . To facilitate the definition of dimension-independent objective functions, we use y ∈ R n×1 to denote 1D node features. Label Propagation with Laplacian Regularization. A classical regularization framework for label (or feature) propagation (Zhou et al., 2003) includes two components: a fitting term with least square and a smoothing term with Laplacian regularization.The fitting term controls the target so it is not far away to the original point. The smoothing term encourages the connected elements have similar scale. The loss function associated with f ∈ R n×1 is defined as: E(f ) = 1 2   n i,j=1 A ij 1 √ D ii f i - 1 D jj f j 2 + µ n i=1 ∥f i -y i ∥ 2   , where µ > 0 is the regularization parameter. Differentiating E(f ) with respect to f , we have ∂Q ∂f f =f * = f * -Af * + µ (f * -y) = 0. Although there exist a closed-form based on Eq. 2, for large graphs the inverse of I -αA is not practically-feasible to compute, and instead iterative approximations are preferable. To this end, we may set f (0) = y, and then proceed to iteratively descend in the direction of the negative gradient: f (t+1) = αAf (t) + (1 -α)f (0) , where α = 1 1+µ . If we define y = f (X; θ) and replace A with A, Eq. 3 equates to principled GNN layers, such as those used by GCN (Kipf & Welling, 2016) , APPNP (Klicpera et al., 2018) .

3.2. LABEL PROPAGATION WITH RESIDUAL MINIMIZING OVER KRYLOV SUBSPACE

In this section, we reformulate the closed-form solution of label propagation with Laplacian regularization (Zhou et al., 2003) to a more generalized model based on a Residual Minimizing over Krylov Subspace to solve for the parameters for graph convolution. Let us remind the closed-form solution Eq. 1: f = (1 -α)(I -αA) -1 y. If we put the closed-form solution into the fitting term of Eq. 1, we have: min f ∥y -f ∥ 2 = min α ∥y -(1 -α)(I -αA) -1 y∥ 2 = min w∈R r ∥αy -A r i=0 w i A i y∥ 2 , where w i = (1 -α)α i . We could rescale y by 1 -(1 -α) to eliminate the parameter β. Please note r < rank(A). Then we obtain a more compact form: min w∈R r-1 ∥y -A r i=0 w i A i y∥ 2 = min x∈Kr(A,b) ∥y -Ax∥ 2 , ( ) where the set of vectors K r (A, y) = y, Ay, A 2 y, . . . , A r-1 y is called the order-r Krylov matrix, and the subspace spanned by these vectors is called the order-r Krylov subspace. Based on this we obtain a denoised signal as f = Ax. Relation to GPR-GNN (Chien et al., 2021) . GPR-GNN first extracts hidden state features with a MLP for each node and then uses Generalized PageRank (GPR) to propagate them. The GPR-GNN process can be mathematically described as: P = softmax(Z), Z = K k=0 γ k H (k) , H (k) = ÃH (k-1) , H (0) = f (X; θ) , where softmax(Z i,: ) = e Z ij c j=1 e Z ij . Although Generalized PageRank looks similar to the purpose of our approach, we notice three key differences: (1) The GPR learns generalized graph convolution on logits rather than features (or attributes). (2) The parameters in GPR only depend on labels rather than internal information of the graph and the corresponding attributes. (3) There is no global optimal solution for GPR-GNN because of the feature extraction with a MLP.

3.3. POLYNOMIAL APPROXIMATION WITH CONSTRAINTS

The Eq. 5 also solves an approximation problem. The only difference being that the space of polynomials is now P r = { polynomials p of degree ≤ r with p r (0) = 1}. Expressed in terms of polynomial coefficients, we have a constraint of w 0 = 1. Here is how Eq. 5 can be reduced to the polynomial approximation in P r . The iterate x can be written as x = q r (A)y, where q r is a polynomial of degree r -1; its coefficients are the entries of the vector w of Eq. 5 . The corresponding residual r = y -Ax is r = (I -Aq r (A)) y. If we define p r (z) = 1 -zq r (z), which is the polynomial, we have r = p r (A)y for some polynomial p r ∈ P r . Thus, we can reformulate Eq. 5 as: min x∈Kr(A,b) ∥y -Ax∥ 2 = min pr∈Pr,pr(0)=1 ∥p r (A)y∥ 2 , where P r is the set of all polynomials p r of degree at most r such that p r (0) = 1. Chebyshev polynomials are frequently employed in digital signal processing and graph signal filtering to approximate a variety of functions. The analytical functions may be approximated by a minimax polynomial using the truncated Chebyshev expansions. Consequently, a truncated expansion expressed in terms of Chebyshev polynomials can minimize the loss function as follows: min w∈R r ∥ r-1 i=0 w i T i ( L)y∥ 2 , where L = 2L/λ max -I denotes the scaled Laplacian matrix. λ max is the largest eigenvalue of L and w k denote the Chebyshev coefficients. The Chebyshev polynomials can be recursively defined as T k (x) = 2xT k-1 (x) -T k-2 (x), with T 0 (x) = 1 and T 1 (x) = x. Although Chebyshev polynomials have many great properties such as relieving the Runge phenomenon, they underperform in GNNs. How to solve this problem is beyond the scope of this paper.

3.4. THEORETICAL ANALYSIS

We study our method in the contextual stochastic block model (cSBM) (Deshpande et al., 2018) , which is a generative model for random graphs. For the purposes of theoretical analysis, we take into account a CSBM model with two classes, c 0 and c 1 . The generated graphs in this instance have nodes made up of two distinct sets, C 0 and C 1 , which represent the two classes, respectively. An intra-class probability p and an inter-class probability q are used to produce edges. In particular, an edge is constructed to connect any two nodes in the graph with probability p if they belong to the same class, and q otherwise. For each node i, its initial associated features x i ∈ R l are sampled from a Gaussian distribution x i ∼ N (µ, σI), where µ = µ k ∈ R l for i ∈ C k with k ∈ {0, 1}. Hence, we denote a graph generated from such an cSBM model as G ∼ cSBM (µ 1 , µ 2 , p, q), and the features for node i obtained after a first-order graph convolution as h 1 i and h 2 i with second-order graph convolution. Ma et al. (2022) propose a very interesting problem 'Is homophily a necessity for graph neural networks?' A very useful property has been proven that first-order graph convolution can provide a better features if the deg(i) > (p+q) 2 (p-q) 2 is met, which demonstrates that the node degree deg(i) and the distinguishability (measured by the Euclidean distance) of the neighborhood distributions both affect graph convolution performance. This condition often happens in practice. Thus, we are interested whether or not the higher-order graph convolution still enjoy such a property. As the proposed method could be regarded as a multi-scale graph convolution, it is important to know whether there are existing parameters that make the multi-scale graph convolution better than single graph convolution. To better evaluate the effectiveness of our method, we study the linear classifiers with the largest margin based on {x i , i ∈ V}, h 1 i , i ∈ V and h 2 i , i ∈ V compare their performance. Here we define relation among x i , h 1 i and h 2 i as follows: h 1 i = 1 deg(i) j∈N (i) x j and h 2 i = 1 deg(i) j∈N (i) h 1 j , where N (i) denotes the neighbors of node i. For a graph G ∼ cSBM (µ 1 , µ 2 , p, q), we can approximately regard that for each node i, its neighbor's labels are independently sampled from a neighborhood distribution D yi , where y i denotes the label of node i. Specifically, the neighborhood distributions corresponding to c 0 and c 1 are D c0 = p p+q , q p+q and D c1 = q p+q , p p+q , respectively. Based on the neighborhood distributions, the features obtained from Graph Convolution follow the Gaussian distributions: h 1 i ∼ N pµ 0 + qµ 1 p + q , I deg(i) , h 2 i ∼ N deg(i)( p 2 µ 0 + 2pqµ 1 + q 2 µ 1 p + q ), I , for i ∈ C 0 , h 1 i ∼ N qµ 0 + pµ 1 p + q , I deg(i) , h 2 i ∼ N deg(i)( p 2 µ 1 + 2pqµ 0 + q 2 µ 0 p + q ), I , for i ∈ C 1 . (10) Proposition 1. (Ma et al., 2022)  (E c0 [x i ] , E c1 [x i ]) and (E c0 [h i ] , E c1 [h i ]) share the same middle point. E c0 [x i ] -E c1 [x i ] and E c0 [h i ] -E c1 [h i ] share the same direction. Specifically, the middle point m and the shared direction w are as follows: m = (µ 0 + µ 1 ) /2, and w = (µ 0 -µ 1 ) / ∥µ 0 -µ 1 ∥ 2 . This proposition follows from direct calculations. Given that the feature distributions of these two classes are systematic to each other (for both x i and h i ), the hyperplane that is orthogonal to w and goes through m and defines the decision boundary of the optimal linear classifier for both types of features. We denote this decision boundary as P = x | w ⊤ xw ⊤ (µ 0 + µ 1 ) /2 . Next, to evaluate how higher-order graph convolution affects the classification performance, we compare the probability that this linear classifier misclassifies a certain node based on the features after first-order graph convolution and after the second-order graph convolution. We summarize the results in the following theorem. Theorem 3.1. Consider a graph G ∼ cSBM (µ 0 , µ 1 , p, q). For any node i in this graph, the linear classifier defined by the decision boundary P has a lower or equivalent probability to misclassify h 2 i than h 1 i when deg(i) > (p + q) 2 /(p -q) 2 . Proof. We only prove this for nodes from classes c 0 since the case for nodes from classes c 1 is symmetric and then the proof follows. For a node i ∈ C 0 , we have the follows P h 1 i is mis-classified = P w ⊤ h 1 i + b ≤ 0 for i ∈ C 0 , and P h 2 i is mis-classified = P w ⊤ h 2 i + b ≤ 0 for i ∈ C 0 , where w and b = -w ⊤ (µ 0 + µ 1 ) /2 is the parameters of the decision boundary P. We have P w ⊤ h 1 i + b ≤ 0 = P w ⊤ deg(i)h 1 i + deg(i)b ≤ 0 and P w ⊤ h 2 i + b ≤ 0 = P w ⊤ deg(i)h 2 i + deg(i)b ≤ 0 . ( ) We denote the scaled version of h 1 i and h 2 i as h ′ i = deg(i)h 1 i and h ′′ i = deg(i)h 2 i respectively. Then, h ′ i and h ′′ i follow h ′ i = deg(i)h 1 i ∼ N deg(i) (pµ 0 + qµ 1 ) p + q , I , for i ∈ C 0 , h ′′ i = deg(i)h 2 i ∼ N deg(i)( p 2 µ 0 + 2pqµ 1 + q 2 µ 1 p + q ), I , for i ∈ C 0 . (13) Now, since h ′ i and h ′′ i share the same variance, to compare the misclassification probabilities, we only need to compare the distance from their expected value to their corresponding decision boundary. Specifically, the two distances are: dis h ′ i = deg(i)(p -q) (p + q) • ∥µ 0 -µ 1 ∥ 2 2 , dis h ′′ i = deg(i)(p -q) 2 (p + q) 2 • ∥µ 0 -µ 1 ∥ 2 2 . ( ) The larger the distance is the smaller the misclassification probability is. Hence, when dis h ′ i < dis h ′′ i , h ′′ i has a lower probability to be misclassified than h ′ i and x i . Comparing the two distances, we conclude that when deg(i) > (p + q) 2 /(p -q) 2 , h ′′ i has a lower probability to be misclassified than h ′ i . Figure 3 : Results on the contextual SBM using graph convolutions with first-order (GC), secondorder (GC2), third-order (GC3) and fouth-order (GC4). Two different combination of graph convolutions have been considered: the difference between first two-order (GCdiff1) and the difference between even orders and odd orders (GCdiff2). Theorem 3.2. When p < q (heterophilous graphs) the presence of the parameters w 1 < 0 and w 2 > 0 allows w 1 h 1 i +w 2 h 2 i to have a lower probability to be misclassified than h 2 i when w 2 1 (1-w2) 2 > deg(i)(p-q) 2 (p+q) 2 . Please refer to the appendix for proof.

4. EXPERIMENTS

4.1 RESULT ON CSBM SYNTHETIC Synthetic data In order to test the ability of graph convolution based methods with arbitrary levels of homophily and heterophily, we use cSBMs (Deshpande et al., 2018) to generate synthetic graphs. We consider the case with two equal-size classes and take into account cSBM with n = 1000, two communities C 0 and C 1 , feature means µ 0 = 1 and µ 1 = -1, and noise variance σ = 1. Then there are 500 nodes in each community, which we will refer to as "positive" and "negative," respectively. Standard normal noise is applied to the feature means of the nodes in the "positive" community, which is 1, and the "negative" community, which is -1. With the expected degree of all nodes set to 10 (i.e., 2(p + q)n = 10), we create various graphs by varying the intra-and inter-community edge probabilities p and q from p > q (highly homophilous, in that "positive" nodes are much more likely to connect to other "positive" nodes than to "negative" nodes) to q < p (highly heterophilous, in that "negative" nodes. We compare our methods with three baseline models: Raw, SGC (Wu et al., 2019) , S 2 GC (Veličković et al., 2017) , PageRank (Page et al., 1999; Klicpera et al., 2018) . At the sametime, we also evaluate first-order graph convolution (GC), second-order graph convolution (GC2), the sum of graph convolutions and the difference of graph convolutions. As shown in Figure 1a , we found that some low-pass filters (SGC and APPNP) can have a positive effect on some heterophilous graphs ( p-q p+q ≈ -1) with a low number of convolutions (second-order). However this phenomenon rapidly disappears as the number of convolutions increases as shown in Figure 1b , and when K = 8 it can be seen that all low-pass filters perform much worse than the original features on synthetic heterophilous graphs. In theoretical analysis, we prove the secondorder graph convolution can provide a more discriminant features than first-order graph convolution and raw features. As shown in Figure 2a , the distribution of the feature values can qualitatively state this view. We found that in heterophilous graphs, first-order graph convolution may change the sign of the features. Thus for methods that use non-negative weights, such as PageRank, S 2 GC, this leads to the class centre of the features moving closer to the features' centre (global centre) as shown in Figure 2b . And the proposed method is able to keep the distance between two class centers and reduce the intra-class variance. As shown in Figure 3 , we found the second-order graph convolution is better than the first-order graph convolution in graphs with different heterophilous score. The difference between second-order and first-order graph convolutions can provide a better graph convolution in heterophilous graphs while p-q p+q < -0.6.

4.2. REAL WORLD BENCHMARK

We use 5 homophilous benchmark datasets available from the Pytorch Geometric library, including the citation graphs Cora, CiteSeer, PubMed (Sen et al., 2008; Yang et al., 2016) and the Amazon co-purchase graphs Computers and Photo (McAuley et al., 2015; Shchur et al., 2018) . We also use 5 heterophilous benchmark datasets tested in (Pei et al., 2020) , including Wikipedia graphs Chameleon and Squirrel, the Actor co-occurrence graph, and webpage graphs Texas and Cornell from WebKB. We summarize the dataset statistics and results in Table 1 and 2 . Results on real-world datasets. We use accuracy (the micro-F1 score) as the evaluation metric along with a 95% confidence interval. The relevant results are summarized in Table 2 . For homophilous datasets, we provide results for sparse splitting ( 2.5%/2.5%/95% splits as training/validation/test data) as same as the definition in Chien et al. (2021) , which is different with the original setting used in (Kipf & Welling, 2016) ; (Shchur et al., 2018) . For the heterophilous datasets, we adopt dense splitting ( 60%/20%/20% splits as training/validation/test data) which is used in (Pei et al., 2020) . We apply our SGC, S 2 GC and PageRank implementations to these datasets and present the mean test accuracy over 10 randomly split data sets. We also provide a baseline on the precision of logistic regression using the raw attributes without taking into account the graph convolution. Table 1 shows that, in general, Our method cannot beat other convolution methods based on lowpass filtering designs such as SGC, S 2 GC and PageRank on homophilous datasets. However, our approach still outperforms some classical GNNs like SAGE, JKNet, GCN-Cheby and GeomGCN. GPR-GNN achieves the state-of-the-art performance. On heterophilous datasets, our method significantly outperforms all the other graph convolution models. On Chameleon and Squirrel, we outperform other methods. It is worthy to note our approach outperform the GPR-GNN, which use the same monimial basis as our method. This is a good case to prove that the label is not necessary for heterophilous graphs. In actor, most methods cannot outperform the corresponding baseline. Except ours, all graph convolution based methods cannot outperform raw attributes with logistic regression. Similarly, only APPNP and GPRGNN can outperform raw attributes with MLP. On the Texas dataset, all methods behave similarly to those on Actor. The only difference is APPNP cannot outperform the baseline while ours and GPRGNN outperform the baseline method. Conrnell is the most challenge dataset for all methods, no one can outperform the baseline (Logistic Regression and MLP) although ours and GPRGNN can have the same performance.

5. CONCLUSION

From an optimalization perspective we propose an novel framework for label (or feature) propagation that is not based on Laplacian regularization. This framework extends label propagation from the least squares problem to polynomial approximation, and sheds light on graph convolution with heretophilous graphs. We show we can learn (unsupervised setting) a graph convolution that obtains better features than raw attributes. In synthetic data experiments, we show that our method has better properties on heterophilous graphs compared to existing fixed parameter graph convolutions. In real-world benchmarks, our method even outperforms some methods that use label information.



Figure 2: Distribution of the feature values on a highly heterophilous synthetic graph before and after using different graph convolution based methods.

Statistics and results on homophilous datasets: Mean accuracy (%) ± 95% confidence interval. As expected due to design, on homophilous datasets, our method is only comparable to other graph convolution based methods because the low-pass filtering is all we need in this situation.

Statistics and results on heterophilous benchmark datasets: Mean accuracy (%) ± 95% confidence interval. As expected due to design, our methods all meet or exceed the performance of raw features and are not affected by the heterophilous property like other graph convolution methods.

A APPENDIX

A.1 PROOF OF THEOREM 3.2Proof. We could add w 1 and w 2 into Eq. 14 and have:We hope the dis comb is larger than dis h ′′ i then we need the following inequation:We can assume √ deg(i)(p-q) (p+q)> 1 and then we have:(1 -w 2 ) 2 > deg(i)(p -q) 2 (p + q) 2 (17)

