SIMPLE SPECTRAL GRAPH CONVOLUTION

Abstract

Graph Convolutional Networks (GCNs) are leading methods for learning graph representations. However, without specially designed architectures, the performance of GCNs degrades quickly with increased depth. As the aggregated neighborhood size and neural network depth are two completely orthogonal aspects of graph representation, several methods focus on summarizing the neighborhood by aggregating K-hop neighborhoods of nodes while using shallow neural networks. However, these methods still encounter oversmoothing, and suffer from high computation and storage costs. In this paper, we use a modified Markov Diffusion Kernel to derive a variant of GCN called Simple Spectral Graph Convolution (S 2 GC). Our spectral analysis shows that our simple spectral graph convolution used in S 2 GC is a trade-off of low-and high-pass filter bands which capture the global and local contexts of each node. We provide two theoretical claims which demonstrate that we can aggregate over a sequence of increasingly larger neighborhoods compared to competitors while limiting severe oversmoothing. Our experimental evaluations show that S 2 GC with a linear learner is competitive in text and node classification tasks. Moreover, S 2 GC is comparable to other state-of-the-art methods for node clustering and community prediction tasks.

1. INTRODUCTION

In the past decade, deep learning has become mainstream in computer vision and machine learning. Although deep learning has been applied for extraction of features on the Euclidean lattice (Euclidean grid-structured data) with great success, the data in many practical scenarios lies on non-Euclidean structures, whose processing poses a challenge for deep learning. By defining a convolution operator between the graph and signal, Graph Convolutional Networks (GCNs) generalize Convolutional Neural Networks (CNNs) to graph-structured inputs which contain attributes. Message Passing Neural Networks (MPNNs) (Gilmer et al., 2017) unify the graph convolution as two functions: the transformation function and the aggregation function. MPNN iteratively propagates node features based on the adjacency of the graph in a number of rounds. Despite their enormous success in many applications like social media, traffic analysis, biology, recommendation systems and even computer vision, many of the current GCN models use fairly shallow setting as many of the recent models such as GCN (Kipf & Welling, 2016) achieve their best performance given 2 layers. In other words, 2-layer GCN models aggregate nodes in two-hops neighborhood and thus have no ability to extract information in K-hops neighborhoods for K > 2. Moreover, stacking more layers and adding a non-linearity tend to degrade the performance of these models. Such a phenomenon is called oversmoothing (Li et al., 2018a) , characterized by the effect that as the number of layers increases, the representations of the nodes in GCNs tend to converge to a similar, non-distinctive from one another value. Even adding residual connections, an effective trick for training very deep CNNs, merely slows down the oversmoothing issue (Kipf & Welling, 2016) in GCNs. It appears that deep GCN models gain nothing but the performance degradation from the deep architecture. One solution for that is to widen the receptive field of aggregation function while limiting the depth of network because the required neighborhood size and neural network depth can be regarded as two separate aspects of design. To this end, SGC (Wu et al., 2019) captures the context from Khops neighbours in the graph by applying the K-th power of the normalized adjacency matrix in a single layer of neural network. This scheme is also used for attributed graph clustering (Zhang et al., 2019) . However, SGC also suffers from oversmoothing as K → ∞, as shown in Theorem 1. PPNP and APPNP (Klicpera et al., 2019a) replace the power of the normalized adjacency matrix with the Personalized PageRank matrix to solve the oversmoothing problem. Although APPNP relieves the oversmoothing problem, it employs a non-linear operation which requires costly computation of the derivative of the filter due to the non-linearity over the multiplication of feature matrix with learnable weights. In contrast, we show that our approach enjoys a free derivative computed in the feed-forward step due to the use of a linear model. Furthermore, APPNP aggregates over multiple k-hop neighborhoods (k = 0, • • • , K) but the weighting scheme favors either global or local context making it difficult if not impossible to find a good value of balancing parameter. In contrast, our approach aggregates over k-hop neighborhoods in a well-balanced manner. GDC (Klicpera et al., 2019b) further extends APPNP by generalizing Personalized PageRank (Page et al., 1999) to an arbitrary graph diffusion process. GDC has more expressive power than SGC (Wu et al., 2019) , PPNP and APPNP (Klicpera et al., 2019a) but it leads to a dense transition matrix which makes the computation and space storage intractable for large graphs, although authors suggest that the shrinkage method can be used to sparsify the generated transition matrix. Noteworthy are also orthogonal research directions of Sun et al. (2019) ; Koniusz & Zhang (2020) ; Elinas et al. (2020) which improve the performance of GCNs by the perturbation of graph, high-order aggregation of features, and the variational inference, respectively. To tackle the above issues, we propose a Simple Spectral Graph Convolution (S 2 GC) network for node clustering and node classification in semi-supervised and unsupervised settings. By analyzing the Markov Diffusion Kernel (Fouss et al., 2012) , we obtain a very simple and effective spectral filter: we aggregate k-step diffusion matrices over k = 0, • • • , K steps, which is equivalent to aggregating over neighborhoods of gradually increasing sizes. Moreover, we show that our design incorporates larger neighborhoods compared to SGC and copes better with oversmoothing. We explain that limiting overdominance of the largest neighborhoods in the aggregation step limits oversmoothing while preserving the large context of each node. We also show via the spectral analysis that S 2 GC is a trade-off between the low-and high-pass filter bands which leads to capturing the global and local contexts of each node. Moreover, we show how S 2 GC and APPNP (Klicpera et al., 2019a) are related and explain why S 2 GC captures a range of neighborhoods better than APPNP. Our experimental results include node clustering, unsupervised and semi-supervised node classification, node property prediction and supervised text classification. We show that S 2 GC is highly competitive, often significantly outperforming state-of-the-art methods.

2. PRELIMINARIES

Notations. Let G = (V, E) be a simple and connected undirected graph with n nodes and m edges. We use {1, • • • , n} to denote the node index of G, whereas d j denotes the degree of node j in G. Let A be the adjacency matrix and D be the diagonal degree matrix. Let A = A + I n denote the adjacency matrix with added self-loops and the corresponding diagonal degree matrix D, where I n ∈ S n ++ is an identity matrix. Finally, let X ∈ R n×d denote the node feature matrix, where each node v is associated with a d-dimensional feature vector X v . The normalized graph Laplacian matrix is defined as L = I n -D -1/2 AD -1/2 ∈ S n + , that is, a symmetric positive semidefinite matrix with eigendecomposition UΛU , where Λ is a diagonal matrix with eigenvalues of L, and U ∈ R n×n is a unitary matrix that consists of the eigenvectors of L. (Defferrard et al., 2016) . We consider spectral convolutions on graphs defined as the multiplication of signal x ∈ R n with a filter g θ parameterized by θ ∈ R n in the Fourier domain:

Spectral Graph Convolution

g θ (L) * x = Ug * θ (Λ)U x, where the parameter θ ∈ R n is a vector of spectral filter coefficients. One can understand g θ as a function operating on eigenvalues of L, that is, g * θ (Λ). To avoid eigendecomposition, g θ (Λ) can be approximated by a truncated expansion in terms of Chebyshev polynomials T k (Λ) up to the K-th order (Defferrard et al., 2016) : g * θ (Λ) ≈ K-1 k=0 θ k T k ( Λ), with a rescaled Λ = 1 2λmax Λ -I n , where λ max denotes the largest eigenvalue of L and θ ∈ R K is now a vector of Chebyshev coefficients. Vanila Graph Convolutional Network (GCN) (Kipf & Welling, 2016) . The vanilla GCN is a first-order approximation of spectral graph convolutions. If one sets K = 1, θ 0 = 2, and θ 1 = -1 for Eq. 2, they obtain the convolution operation g θ (L) * x = (I + D -1/2 AD -1/2 )x. Finally, by the renormalization trick, replacing matrix I + D -1/2 AD -1/2 by a normalized version T = D -1/2 A D -1/2 = (D + I n ) -1/2 (A + I n )(D + I n ) -1/2 leads to the GCN layer with a non-linear function σ: H (l+1) = σ( TH (l) W (l) ). Graph Diffusion Convolution (GDC) (Klicpera et al., 2019b) . A generalized graph diffusion is given by the diffusion matrix: S = ∞ k=0 θ k T k , with the weighting coefficients θ k and the generalized transition matrix T. Eq. 4 can be regarded as related to the Taylor expansion of matrix-valued functions. Thus, the choice of θ k and T k must at least ensure that Eq. 4 converges. Klicpera et al. (2019b) provide two special cases as low-pass filters ie., the heat kernel and the kernel based on random walk with restarts. If S denotes the adjacency matrix and D is the diagonal degree matrix of S, the corresponding graph diffusion convolution is then defined as D -1/2 SD -1/2 x. Note that θ k can be a learnable parameter, or it can be chosen in some other way. Many works use the expansion in Eq. 4 but different choices of θ k realise very different filters, making each method unique. Simple Graph Convolution (SGC) (Wu et al., 2019) . A classical MPNN (Gilmer et al., 2017) averages (in each layer) the hidden representations among 1-hop neighbors. This implies that each node in the K-th layer obtains feature information from all nodes that are K-hops away in the graph. By hypothesizing that the non-linearity between GCN layers is not critical, SGC captures information from K-hops neighborhood in the graph by applying the K-th power of the transition matrix in a single neural network layer. The SGC can be regarded as a special case of GDC without non-linearity and without the normalization by D -1/2 if we set θ K = 1 and θ i<K = 0 in Eq. 4, and T = T, which yields: Ŷ = softmax( T K XW). ( ) Although SGC is an efficient and effective method, increasing K leads to oversmoothing. Thus, SGC uses a small K number of layers. Theorem 1 shows that oversmoothing is a result of convergence to the stationary distribution in the graph diffusion process when time t → ∞. Theorem 1. (Chung & Graham, 1997) Let λ 2 denote the second largest eigenvalue of transition matrix T = D -1 A of a non-bipartite graph, p(t) be the probability distribution vector and π the stationary distribution. If walk starts from the vertex i, p i (0) = 1, then after t steps for every vertex, we have: |p j (t) -π j | ≤ d j d i λ t 2 . ( ) APPNP. Klicpera et al. (2019a) proposed to use the Personalized PageRank to derive a fixed filter of order K. Let f θ (X) denote the output of a two-layer fully connected neural network on the feature matrix X, then the PPNP model is defined as H = αI n -(1 -α) T -1 f θ (X). To avoid calculating the inverse of matrix T, Klicpera et al. (2019a) also propose the Approximate PPNP (APPNP), which replaces the costly inverse with an approximation by the truncated power iteration: H (l+1) = (1 -α) TH (l) + αH (0) , where H (0) = f θ (X) = ReLU(XW) or H (0) = f θ (X) = MLP(X) . By decoupling feature transformation and propagation steps, PPNP and APPNP aggregate information from multi-hop neighbors. 

3. METHODOLOGY

Below, we firstly outline two claims which underlie the design of our network, with the goal of mitigating oversmoothing. Moreover, we analyze the Markov Diffusion Kernel (Fouss et al., 2012) and note that it acts as a low-pass spectral filter of various degree. Based on the feature mapping function underlying this kernel, we present our Simple Spectral Graph Convolution network and discuss its relation with other models. Finally, we provide the comparison of computational and storage complexity requirements.

3.1. MOTIVATION

Our design follows Claims I and II described in Section A.3, which includes their detailed proofs. Claim I. By design, our filter gives the highest weight to the closest neighborhood of a node, as neighborhoods N of diffusion steps k = 0, • • • , K obey N ( T 0 ) ⊆ N ( T 1 ) ⊆ • • • ⊆ N ( T K ) ⊆ N ( T ∞ ). That is, smaller neighborhoods belong to larger neighborhoods too. Claim II. As K → ∞, the ratio of energies contributed by S 2 GC to SGC is 0. Thus, the energy of infinite-dimensional receptive field (largest k) will not dominate the sum energy of our filter. Thus, S 2 GC can incorporate larger receptive fields without undermining contributions of smaller receptive fields. This is substantiated by Table 8 , where S 2 GC achieves the best results for K = 16, whereas SGC achieves poorer results by comparison, whose peak is at K = 4 (note that larger K is better).

3.2. MARKOV DIFFUSION KERNEL

Two nodes are considered similar when they are diffused in a similar way through the graph, as then they influence the other nodes in a similar manner (Fouss et al., 2012) . Moreover, two nodes are close neighbors if they are in the same distinct cluster. The Markov Diffusion distance between nodes i and j at time K is defined as: d ij (K) = x i (K) -x j (K) 2 2 , where the average visiting rate x i (K) after K steps for a process that started at time k = 0 is computed as follows: x i (K) = 1 K K k=1 T k x i (0). By defining Z(K) = 1 K K k=1 T k , we reformulate Eq. 8 as the following metric: d ij (K) = Z(K)(x i (0) -x j (0)) 2 2 . ( ) The underlying feature map of Markov Diffusion Kernel (MDK) is given as Z(K)x i (0) for node i. The effect of the linear projection Z(K) (filter) acting on spectrum as f (λ) = 1 K K k=0 λ k (we sum from 0 to include self-loops) is plotted in Figure 1 , from which we observe the following properties: (i) Z(K) preserves leading (large) eigenvalues of T and (ii) the higher K is the stricter the low-pass filter becomes but the filter also preserves the high frequency. In other words, as K grows, this filter includes larger and larger neighborhood but also maintains the closest locality of nodes. Note that L = I -T, where L is the normalized Laplacian matrix and T is the normalized adjacency matrix. Thus, keeping large positive eigenvalues for T equals keeping small eigenvalues for L.

3.3. SIMPLE SPECTRAL GRAPH CONVOLUTION

Based on the aforementioned Markov Diffusion Kernel, we include self-loops and propose the Simple Spectral Graph Convolution (S 2 GC) network with the softmax classifier after the linear layer: Ŷ = softmax( 1 K K k=0 T k XW). Let x i 2 = 1, ∀i (each x i is a row in X). If K → ∞ then H = ∞ k=0 T k X is the optimal diffused representation of the normalized Laplacian Regularization problem given below: arg min H s.t. hi 2=1,∀i q(H), where q(H) = 1 2 n i,j=1 A ij h i √ d i - h j d j 2 2 + 1 2 n i=1 h i -x i 2 2 , ( ) and each vector h i denotes the i-th row of H. Compared with the more common form in (Zhou et al., 2004) , we impose h i 2 2 = x i 2 2 = 1, to minimize the difference between h i and x i via the cosine distance rather than the Euclidean distance. Differentiating q(H) with respect to H, we have LH -X = 0. Thus, the optimal representation H = (I -T) -1 X, where (I -T) -1 = ∞ k=0 T k . However, the infinite expansion resulting from Eq. 12 is in fact suboptimal due to oversmoothing. Thus, we include in Eq. 11 a self-loop T 0 = I, the α ∈ [0, 1] parameter (Table 9 evaluates its impact) to balance the self-information of node vs. consecutive neighborhoods, and we consider finite K. We generalize the Eq. 11 as: Ŷ = softmax 1 K K k=1 (1 -α) T k X + αX W . ( ) Relation of S 2 GC to GDC. GDC uses the entire filter matrix S of size n × n as S is re-normalized numerous times by its degree. Klicpera et al. (2019b) explain that 'most graph diffusions result in a dense matrix S'. In contrast, our approach is simply computed as (  and d n, where n and d are the number of nodes and features, respectively. The T k X step is computed as T K k=1 T k X)W (plus the self-loop), where X is of size n × d, • ( T • (• • • ( TX) • • • )) , which requires K sparse matrix-matrix multiplications between a sparse matrices of size n × n and a dense matrix of size n × d. Thus, S 2 GC can handle extremely large graphs as S 2 GC does not need to sparsify dense filter matrices (in contrast to GDC). Relation of S 2 GC to APPNP. Let H 0 = XW as we use the linear step in our S 2 GC. Then and only then, for l = 0 and H 0 = XW, APPNP expansion yields H 1 = (1 -α) TXW + αXW = ((1 -α) T + αI)XW, which is equal to our Z(1)XW = ( K k=0 T k )XW = TX + X = ( T + I)XW if α = 0.5, K = 1, except for scaling (constant) of H 1 . In contrast, for l = 1 and general case H 0 = f (X; W), APPNP yields H 2 = (1-α) 2 T 2 f (X; W)+ (1-α)α Tf (X; W)+αf (X; W) from which it is easy to note specific weight coefficients (1-α) 2 , (1 -α)α and α associated with 2-, 1-, and 0-hops. This shows that the APPNP expansion is very different to the S 2 GC expansion in Eq. 13. In fact, S 2 GC and APPNP are only equivalent if α = 0.5, K = 1 and f is the linear transformation.  ∂W = (1 -α) 2 T 2 f (X; W) + (1 -α)α Tf (X; W) + αf (X; W). In contrast, we use the linear function XW. Thus, ∂XW ∂W yields X. Thus, the multiplication of our expansion with X for the backpropagation step is in fact obtained in the forward pass which makes our approach very fast for large graphs. Relation of S 2 GC to AR. The AR filter (Li et al., 2019) uses the regularized Laplacian kernel (Smola & Kondor, 2003) which differs from the (modified) Markov Diffusion Kernel used by us. Specifically, the regularized Laplacian kernel uses the negated Laplacian matrix -L yielding K L = ∞ k=0 α k (-L) k = (I + αL) -1 , where L = I -T, which is related to the von Neumann diffusion kernel K vN = ∞ k=0 α k A k . In contrast, the Markov Diffusion Kernel is defined as K MD (K) = Z(K)Z T (K), where Z(K) = 1 K K k=1 T k and T = D -1/2 AD -1/2 . Relation of S 2 GC to Jumping Knowledge Network (JKN). Xu et al. (2018b) combine intermediate node representations from each layer by concatenating them in the final layer. However, (Xu et al., 2018b) use non-linear layers, which results in a completely different network architecture and the usual slower processing time due to the complex backpropagation chain.

3.4. COMPLEXITY ANALYSIS

For S 2 GC, the storage costs is O(|E| + nd), where |E| is the total edge count, nd relates to saving the T k X during intermediate multiplications T • ( T • (• • • ( TX) • • • )) . The computational cost is O(K|E|d + Knd). Each sparse matrix-matrix multiplication TX costs |E|d. We need K such multiplications, where Knd and nd are costs of summation over filters and adding features X. In contrast, the storage cost of GDC is approximately O(n 2 ), and the computational cost is approximately O(K|E|n), where n is the node numbers, K is the order of terms and |E| is the number of graph edges. APPNP, SGC and S 2 GC have much lower cost than GDC. Note that K|E|d Knd and n d. We found that APPNP, SGC and S 2 GC have similar computational and storage costs in the forward stage. We note that symbol d in APPNP is not the dimension of features X but dimension of f (X), which is the number of categories. For the backward stage including computations of the gradient of the classification step, the computational costs of GDC, SGC and S 2 GC are independent of K and |E| because the graph convolution for these methods does not require backpropagation (the gradients is computed in the forward step). In contrast, APPNP requires backprop as explained earlier. Table 1 summarizes the computational and storage costs of several methods. Table 2 demonstrates that APPNP is over 66× slower than S 2 GC on the large scale Products dataset (OGB benchmark) despite, for fairness, we use the same basic building blocks of PyTorch among compared methods.

4. EXPERIMENTS

In this section, we evaluate the proposed method on four different tasks: node clustering, community prediction, semi-supervised node classification and text classification. (Pan et al., 2018) and AGC (Zhang et al., 2019) . To evaluate the clustering performance, three performance measures are adopted: clustering Accuracy (Acc), Normalized Mutual Information (NMI) and macro F1-score (F1). We run each method 10 times on four datasets: Cora, CiteSeer, PubMed, and Wiki, and we report the average clustering results in Table 3 , where top-1 results are highlighted in bold. To adaptively select the order K, we use the clustering performance metric: internal criteria based on the information intrinsic to the data alone Zhang et al. (2019) .

4.2. COMMUNITY PREDICTION

We supplement our social network analysis by using S 2 GC to inductively predict the community structure on Reddit, a large scale dataset, as shown in Table 10 , which cannot be processed by the vanilla GCN Kipf & Welling (2016) and GDC (Klicpera et al., 2019b) due to the memory issues. On the Reddit dataset, we train S 2 GC with L-BFGS using no regularization, and we set K = 5 and α = 0.05. We evaluate S 2 GC inductively according to protocol (Chen et al., 2018) . We train S 2 GC on a subgraph comprising only training nodes and test on the original graph. On all datasets, we tune the number of epochs based on both the convergence behavior and the obtained validation accuracy. For Reddit, we compare S 2 GC to the reported performance of supervised and unsupervised variants of GraphSAGE (Hamilton et al., 2017) , FastGCN (Chen et al., 2018) , SGC (Wu et al., 2019) and DGI (Velickovic et al., 2019) . Table 4 also highlights the setting of the feature extraction step for each method. Note that S 2 GC and SGC involve no learning because they do not learn any parameters to extract features. The logistic regression is used as the classifier for both unsupervised and no-learning approaches to train with labels afterward.

4.3. NODE CLASSIFICATION

For the semi-supervised node classification task, we apply the standard fixed training, validation and testing splits (Yang et al., 2016) on the Cora, Citeseer, and Pubmed datasets, with 20 nodes per class for training, 500 nodes for validation and 1,000 nodes for testing. For baselines, We include (Veličković et al., 2017) , FastGCN (Chen et al., 2018) , APPNP (Klicpera et al., 2019a) , Mixhop (Abu-El-Haija et al., 2019), SGC (Wu et al., 2019) , DGI (Velickovic et al., 2019) and GIN (Xu et al., 2018a) . We use the Adam SGD optimizer (Kingma & Ba, 2014) with a learning rate of 0.02 to train S 2 GC. We set α = 0.05 and K = 16 on all datasets. To determine K and α, we used the MetaOpt package Bergstra et al. (2015) with 20 steps to meta-optimize hyperparameters on the validation set of Cora. Following that, we fixed K = 16 and α = 0.05 across all datasets so K and α are not tuned to individual datasets at all. We will discuss the influence of α and K later. To evaluate the proposed method on large scale benchmarks (see Table 6 ), we use Arxiv, Mag and Products datasets to compare the proposed method with SGC, GraphSage, GCN, MLP and Softmax (multinomial Regression). On these three datasets, our method consistently outperforms SGC. On Arxiv and Products, our method cannot outperform GCN and GraphSage while MLP outperforms softmax classifier significantly. Thus, we argue that MLP plays a more important role here than the graph convolution. To prove this point, we also conduct an experiment (S 2 GC+MLP) for which we use MLP in place of the linear classifier, and we obtain a more powerful variant of S 2 GC. On Mag, S 2 GC+MLP outperforms S 2 GC by a tiny margin because the performance of MLP is close to the one of softmax. On other two datasets, S 2 GC+MLP is a very strong performer. Our S 2 GC+MLP is the best performer on Mag and Arxiv.

4.4. TEXT CLASSIFICATION

Text classification predicts the labels of documents. Yao et al. (2019) use a 2-layer GCN to achieve state-of-the-art results by creating a corpus-level graph, which treats both documents and words as nodes in a graph. Word-to-word edge weights are given by Point-wise Mutual Information (PMI) and word-document edge weights are given by the normalized TF-IDF scores. We ran our experiments on five widely used benchmark corpora including the Movie Review (MR), 20-Newsgroups (20NG), Ohsumed, R52 and R8 of Reuters 21578. We first preprocessed all datasets by cleaning and tokenizing text as Kim (2014) . We then removed stop words defined in NLTK6 and low-frequency words appearing less than 5 times for 20NG, R8, R52 and Ohsumed. We compare our method with GCN (Kipf & Welling, 2016) and SGC (Wu et al., 2019) . The statistics of the preprocessed datasets are summarized in Table 11 . Table 7 shows that S 2 GC rivals their models on 5 benchmark datasets. We provide the parameters setting in the supplementary material.

4.5. A DETAILED COMPARISON WITH VARIOUS NUMBERS OF LAYERS AND α

Table 8 summaries the results for models with various numbers of layers (K is the number of layers and it coincides with the number of aggregated filters in S 2 GC). We observe that on Cora, Citeseer and Pubmed, our method consistently obtains the best performance with K = 16, equivalent of 16 layers. Overall, the results suggest that S 2 GC can aggregate over larger neighborhoods better than SGC while suffering less from oversmoothing. In contrast to S 2 GC, the performance of GCN and SGC drops rapidly as the number of layers exceeds 32 due to oversmoothing. Table 9 summaries the results for the proposed method for various α ranging from 0 to 0.15. The table shows that α slightly improves the performance of S 2 GC. Thus, balancing the impact of selfloop by α w.r.t. other filters of consecutively larger receptive fields is useful but the self-loop is not mandatory.

5. CONCLUSIONS

We have proposed Simple Spectral Graph Convolution (S 2 GC), a method extending the Markov Diffusion Kernel (Section 3.2), whose feature maps emerge from the normalized Laplacian Regularization problem (Section 3.3) if K → ∞. Our theoretical analysis shows that S 2 GC obtains the right level of balance during the aggregation of consecutively larger receptive fields. We have shown there exists a connection between S 2 GC and SGC, APPNP and JKN by analyzing spectral properties and implementation of each model. However, as our Claims I and II show that we have designed a filter with unique properties to capture a cascade of gradually increasing contexts while limiting oversmoothing by giving proportionally larger weights to the closest neighborhoods of each node. We have conducted extensive and rigorous experiments which show that S 2 GC is competitive frequently outperforming many state-of-the-art methods on unsupervised, semi-supervised and supervised tasks given several popular dataset benchmarks. A SUPPLEMENTARY MATERIAL A.1 NODE CLUSTERING For S 2 GC and AGC, we set max iterations to 60. For other baselines, we follow the parameter settings in the original papers. In particular, for DeepWalk, the number of random walks is 10, the number of latent dimensions for each node is 128, and the path length of each random walk is 80. For DNGR, the autoencoder is of three layers with 512 neurons and 256 neurons in the hidden layers respectively. For GAE and VGAE, we construct encoders with a 32-neuron hidden layer and a 16-neuron embedding layer, and train the encoders for 200 iterations using the Adam optimizer with learning rate equal 0.01. For ARGE and ARVGE, we construct encoders with a 32-neuron hidden layer and a 16-neuron embedding layer. The discriminators are built by two hidden layers with 16 and 64 neurons respectively. On Cora, Citeseer and Wiki, we train the autoencoder-related models of ARGE and ARVGE for 200 iterations with the Adam optimizer, with the encoder and discriminator learning rates both set as 0.001; on Pubmed, we train them for 2000 iterations with the encoder learning rate 0.001 and the discriminator learning rate 0.008.

A.2 TEXT CLASSIFICATION

The 20NG dataset1 (bydate version) contains 18,846 documents evenly categorized into 20 different categories. In total, 11,314 documents are in the training set and 7,532 documents are in the test set. The Ohsumed corpus2 is from the MEDLINE database, which is a bibliographic database of important medical literature maintained by the National Library of Medicine. In this work, we used the 13,929 unique cardiovascular diseases abstracts in the first 20,000 abstracts of the year 1991. Each document in the set has one or more associated categories from the 23 disease categories. As we focus on single-label text classification, the documents belonging to multiple categories are excluded so that 7,400 documents belonging to only one category remain. MR is a movie review dataset for binary sentiment classification, in which each review only contains one sentence (Pang & Lee, 2005) The corpus has 5,331 positive and 5,331 negative reviews. We used the training/test split in (Tang et al., 2015) .

A.2.1 TEXT CLASSIFICATION

Parameters. We follow the setting of Text GCN (Yao et al., 2019) that includes experiments on four widely used benchmark corpora such as 20-Newsgroups (20NG), Ohsumed, R52 and R8 of Reuters 21578. For Text GCN, SGC, and our approach, the embedding size of the first convolution To convert text classification into the node classification on graph, there are two relationships considered when forming graphs: (i) the relation between documents and words and (ii) the connection between words. For the first type of relations, we build edges among word nodes and document nodes based on the word occurrence in documents. The weight of the edge between a document node and a word node is the Term Frequency-Inverse Document Frequency (Rajaraman & Ullman, 2011) (TF-IDF) of the word in the document applied to build the Docs-words graph. For the second type of relations, we build edges in graph among word co-occurrences across the whole corpus. To utilize the global word co-occurrence information, we use a fixed-size sliding window on all documents in the corpus to gather co-occurrence statistics. Point-wise Mutual Information (Church & Hanks, 1990 ) (PMI), a popular measure for word associations, is used to calculate weights between two word nodes according to the following definition: PMI(i, j) = log p(i, j) p(i)p(j) where p(i, j) = W (i,j) W , p(i) = W (i) W . #W (i) is the number of sliding windows in a corpus that contain word i, #W (i, j) is the number of sliding windows that contain both word i and word j, and #W is the total number of sliding windows in the corpus. A positive PMI value implies a high semantic correlation of words in a corpus, while a negative PMI value indicates little or no semantic correlation in the corpus. Therefore, we only add edges between word pairs with positive PMI values: A = W 1 W 2 W 2 I or A ij =        PMI(i, j) if i, j are words, PMI(i, j) > 0, TF-IDF ij if i is document, j is word, 1 if i = j, 0 otherwise. A.3 GRAPH CLASSIFICATION We report the average accuracy of 10-fold cross validation on a number of common benchmark datasets, shown in Table 12 , where we randomly sample a training fold to serve as a validation set. We only make use of discrete node features. In case they are not given, we use one-hot encodings of node degrees as the feature input. We note that graph classification is a task highly dependent on the global pooling strategy. There exist methods that apply sophisticated mechanisms for this step. However, with a readout function and a highly scalable S 2 GC model, we comfortably outperform all methods on MUTAG, Proteins and IMDB-Binary, even DiffPool which has a differentiable graph pooling module to gather information across different scales. A stronger performer (Koniusz & Zhang, 2020) uses the GIN-0 backbone and second-order pooling with the so-called spectral power normalization, referred to as MaxExp(F). In contrast, we use a simple readout feature aggregation.  k = 0, • • • , K obey N ( T 0 ) ⊆ N ( T 1 ) ⊆ • • • ⊆ N ( T K ) ⊆ N ( T ∞ ). That is, smaller neighborhoods belong to larger neighborhoods too. To see this clearer, for the q-dimensional Euclidean lattice graph with infinite number of nodes, after t steps of random walk, the estimate of absolute distance the walk moves from the source to its current position is given as: r(t, q) = 2t q • Γ q+1 2 Γ (q + 1) , where r(t, q) is the absolute distance walked from the source to the current point and Γ(•) is the Gamma function. Moreover, if the number of dimensions q → ∞, we have r(t, q) ≤ √ t. It is clear then that the receptive field associated with the random walk (and thus diffusion at time t) obeys the monotonically increasing radius r, that is r(0 ) ≤ r(1) ≤ • • • ≤ r(K) ≤ • • • ≤ r(∞). To see that, simply plot √ t (and/or the more complicated expression that includes the Gamma function). This proves Claim I for the Euclidean lattice graph. That is, for consecutive diffusion steps T k , k = 0, • • • , K, our receptive field grows. Moreover, note that our filter is realized as the sum of consecutive diffusion steps, that is 1 t t τ =0 diff(s, τ ) where s is the source of walk. It is easy to see then that even if each walked distance was to contribute the energy proportional with r(t) to the summation term, we have: lim t→∞ 1 t t t =0 √ t √ t = 0, where the enumerator is the model of the total energy when aggregating over receptive fields from size 0 to ∞ in S 2 GC while the denominator is the total energy of SGC (filter is given by T K , that is by diff(s, t)). The above proof shows that the above ratio of energies is 0, which means that: Claim II. When the ratio of energies of two models is 0, the energy of the infinite-dimensional receptive field (when t → ∞) in S 2 GC is not going to dominate the sum energy of our filter. Thus, S 2 GC can incorporate larger receptive fields than SGC without eclipsing the contributions from smaller receptive fields as t → ∞ on the Euclidean lattice graph. However, in practice, we work with finite-dimensional non-Euclidean graphs. Obtaining the absolute distance r(t) walked from the source is a difficult topic. As an example, see Eq. 184 in Masuda et al. (2017) . For this reason, below we use a simple approximation. We use Theorem 1 as the proxy for the walked radius. That is to say the error of convergence to the stationary distribution is indicative of the absolute distance walked from the source/node. Specifically, we have: Recall Theorem 1, that is let λ 2 denote second largest eigenvalue of transition matrix T = D -1 A, p(t) be the probability distribution vector and π the stationary distribution. If walk starts from the vertex i , p i (0) = 1, then after t steps for every vertex: |p j (t) -π j | ≤ d j d i λ t 2 . ( ) Then, the average walked distance r from node i over t steps in a graph with n nodes and connectivity given by the second largest eigenvalue λ 2 , denoted by r(i, t, n) is lower-bounded by r(i, t, n) as follows: r(i, t, n) ≈ 1 1 n-1 j =i |p j (t) -Π j | ≥ r(i, t, n) = n-1 λ t 2 j =i √ dj √ di = (n-1) √ d i λ t 2 ( E - √ d i ) = ρ λ t 2 , ( ) where n is the number of nodes, t is the number of diffusion steps (think T k ), d i and d j are degrees of nodes i and j, λ 2 being the second largest eigenvalue intuitively denotes the graph connectivity (large λ 2 ≤ 1 indicates low connectivity while low λ 2 indicates high connectivity in graph), E is the sum of square roots of node degrees and ρ = (n-1) √ di E- √ di is in fact a constant for a given graph. While the above approximations may be loose for very small/large t, the important property to note is that r(i, 0, n) ≤ r(i, 1, n) ≤ • • • ≤ r(i, t, n) which indicates that our filter indeed realises the sum over increasingly larger receptive fields. As smaller receptive fields are a subset of larger receptive fields given node i, that is N ( T 0 ) ⊆ N ( T 1 ) ⊆ • • • ⊆ N ( T K ) ⊆ N ( T ∞ ), this proves our Claim I. To prove Claim II for a general connected non-bipartite graph, we have: lim t→∞ 1 t t t =0 r(i, t , n) r(i, t, n) = 0, Similar findings can be noted by carefully considering the meaning of so-called Cheeger constant introduced in Section A.5. More details on spectral analysis of filters in GCNs can be found in studies of Li et al. (2018a) and Li et al. (2018b) . A.5 GRAPH PARTITIONING Below we introduce the definitions of expansion and k-way Cheeger constant. According to the definitions, the expansion in Def. A.1 describes the effect of graph partitioning according to subset S while the k-way Cheeger constant reflects the effect of the graph partitioning into k parts-the smaller the value the better the partitioning is. Higher-order Cheeger's inequality (Bandeira et al., 2013; Lee et al., 2014) bridges the gap between the network spectral analysis and graph partitioning by controlling the bounds of k-way Cheeger constant: λ k 2 ≤ ρ G (k) ≤ O k 2 λ k , where λ k is the k-th eigenvalue of the normalized Laplacian matrix and 0 = λ 1 ≤ λ 2 ≤ • • • ≤ λ n . From inequality 21, we can conclude that small (large) eigenvalues control global clustering (local smoothing) effect of the graph partitioned into a few large parts (many small parts). Thus, specific combination of low-and high-pass filtering of our design (see Figure 1 ) a indicates the weight tradeoff between large and small partitions contained by the node.



Figure 1: (a) Function f (λ) = 1 K

For a node subset S ⊆ V , so-called expansion φ(S) = |E(S)| min{vol(S),vol(V \S)} , where E(S) is the set of edges with one node in S and vol(S) is the sum of degree of nodes in set S. Definition A.2. The k-way Cheeger constant is given as:ρ G (k) = min S1,S2,••• ,S k max{φ(S i ) : i = {1, • • • , k}}where the minimum is over all collections of k non-empty disjoint subsets S 1 , S 2 , • • • , S k ⊆ V .

Computational and storage complexities O(•).Moreover, as APPNP assumes H 0 = f (X; W) = ReLU(XW) (or MLP in place of ReLU), their optimizer has to backpropagate through f (X; W) to obtain ∂f ∂W and multiply this result with the above expansion e.g., ∂H 2

Timing (seconds) on Cora, Citeseer, Pubmed and the large scale Open Graph Benchmark (OGB) which includes Products.

Clustering performance with three different metrics on four datasets.

Test Micro F1 Score (%) averaged over 10 runs on Reddit. Results of other models are taken from their papers.

Test accuracy on the document classification task.

Summary of classification accuracy (%) w.r.t. various depths. In the linear model, the filter parameter K is equivalent to the number of layers.

The statistics of datasets used for node classification and clustering.

3,357 documents are in the training set and 4,043 documents are in the test set.

The statistics of datasets for text classification. and the window size is 20. We set the learning rate to 0.02, dropout rate to 0.5 and the decay rate to 0. The 10% of training set is randomly selected for validation. Following (Kipf & Welling, 2016), we trained our method and Text GCN for a maximum of 200 epochs using the Adam (Kingma & Ba, 2014) optimizer, and we stop training if the validation loss does not decrease for 10 consecutive epochs. The text graph was built according to steps detailed in the supplementary material.

Graph classification.Our design contains a sum of consecutive diffusion matrices T k , k = 0, • • • , K. As k increases, so does the neighborhood of each node visited during diffusion T k (analogy to random walks). This means that: Claim I. Our filter, by design, will give the highest weight to the closest neighborhood of a node as neighborhoods N of diffusion steps

ACKNOWLEDGMENTS

This research is supported by an Australian Government Research Training Program (RTP) Scholarship.

