SIMPLE SPECTRAL GRAPH CONVOLUTION

Abstract

Graph Convolutional Networks (GCNs) are leading methods for learning graph representations. However, without specially designed architectures, the performance of GCNs degrades quickly with increased depth. As the aggregated neighborhood size and neural network depth are two completely orthogonal aspects of graph representation, several methods focus on summarizing the neighborhood by aggregating K-hop neighborhoods of nodes while using shallow neural networks. However, these methods still encounter oversmoothing, and suffer from high computation and storage costs. In this paper, we use a modified Markov Diffusion Kernel to derive a variant of GCN called Simple Spectral Graph Convolution (S 2 GC). Our spectral analysis shows that our simple spectral graph convolution used in S 2 GC is a trade-off of low-and high-pass filter bands which capture the global and local contexts of each node. We provide two theoretical claims which demonstrate that we can aggregate over a sequence of increasingly larger neighborhoods compared to competitors while limiting severe oversmoothing. Our experimental evaluations show that S 2 GC with a linear learner is competitive in text and node classification tasks. Moreover, S 2 GC is comparable to other state-of-the-art methods for node clustering and community prediction tasks.

1. INTRODUCTION

In the past decade, deep learning has become mainstream in computer vision and machine learning. Although deep learning has been applied for extraction of features on the Euclidean lattice (Euclidean grid-structured data) with great success, the data in many practical scenarios lies on non-Euclidean structures, whose processing poses a challenge for deep learning. By defining a convolution operator between the graph and signal, Graph Convolutional Networks (GCNs) generalize Convolutional Neural Networks (CNNs) to graph-structured inputs which contain attributes. Message Passing Neural Networks (MPNNs) (Gilmer et al., 2017) unify the graph convolution as two functions: the transformation function and the aggregation function. MPNN iteratively propagates node features based on the adjacency of the graph in a number of rounds. Despite their enormous success in many applications like social media, traffic analysis, biology, recommendation systems and even computer vision, many of the current GCN models use fairly shallow setting as many of the recent models such as GCN (Kipf & Welling, 2016) achieve their best performance given 2 layers. In other words, 2-layer GCN models aggregate nodes in two-hops neighborhood and thus have no ability to extract information in K-hops neighborhoods for K > 2. Moreover, stacking more layers and adding a non-linearity tend to degrade the performance of these models. Such a phenomenon is called oversmoothing (Li et al., 2018a) , characterized by the effect that as the number of layers increases, the representations of the nodes in GCNs tend to converge to a similar, non-distinctive from one another value. Even adding residual connections, an effective trick for training very deep CNNs, merely slows down the oversmoothing issue (Kipf & Welling, 2016) in GCNs. It appears that deep GCN models gain nothing but the performance degradation from the deep architecture. One solution for that is to widen the receptive field of aggregation function while limiting the depth of network because the required neighborhood size and neural network depth can be regarded as two separate aspects of design. To this end, SGC (Wu et al., 2019) captures the context from Khops neighbours in the graph by applying the K-th power of the normalized adjacency matrix in a single layer of neural network. This scheme is also used for attributed graph clustering (Zhang et al., 2019) . However, SGC also suffers from oversmoothing as K → ∞, as shown in Theorem 1. PPNP and APPNP (Klicpera et al., 2019a) replace the power of the normalized adjacency matrix with the Personalized PageRank matrix to solve the oversmoothing problem. Although APPNP relieves the oversmoothing problem, it employs a non-linear operation which requires costly computation of the derivative of the filter due to the non-linearity over the multiplication of feature matrix with learnable weights. In contrast, we show that our approach enjoys a free derivative computed in the feed-forward step due to the use of a linear model. Furthermore, APPNP aggregates over multiple k-hop neighborhoods (k = 0, • • • , K) but the weighting scheme favors either global or local context making it difficult if not impossible to find a good value of balancing parameter. In contrast, our approach aggregates over k-hop neighborhoods in a well-balanced manner. GDC (Klicpera et al., 2019b) further extends APPNP by generalizing Personalized PageRank (Page et al., 1999) to an arbitrary graph diffusion process. GDC has more expressive power than SGC (Wu et al., 2019) , PPNP and APPNP (Klicpera et al., 2019a) but it leads to a dense transition matrix which makes the computation and space storage intractable for large graphs, although authors suggest that the shrinkage method can be used to sparsify the generated transition matrix. Noteworthy are also orthogonal research directions of Sun et al. ( 2019); Koniusz & Zhang (2020); Elinas et al. ( 2020) which improve the performance of GCNs by the perturbation of graph, high-order aggregation of features, and the variational inference, respectively. To tackle the above issues, we propose a Simple Spectral Graph Convolution (S 2 GC) network for node clustering and node classification in semi-supervised and unsupervised settings. By analyzing the Markov Diffusion Kernel (Fouss et al., 2012) , we obtain a very simple and effective spectral filter: we aggregate k-step diffusion matrices over k = 0, • • • , K steps, which is equivalent to aggregating over neighborhoods of gradually increasing sizes. Moreover, we show that our design incorporates larger neighborhoods compared to SGC and copes better with oversmoothing. We explain that limiting overdominance of the largest neighborhoods in the aggregation step limits oversmoothing while preserving the large context of each node. We also show via the spectral analysis that S 2 GC is a trade-off between the low-and high-pass filter bands which leads to capturing the global and local contexts of each node. Moreover, we show how S 2 GC and APPNP (Klicpera et al., 2019a) are related and explain why S 2 GC captures a range of neighborhoods better than APPNP. Our experimental results include node clustering, unsupervised and semi-supervised node classification, node property prediction and supervised text classification. We show that S 2 GC is highly competitive, often significantly outperforming state-of-the-art methods.

2. PRELIMINARIES

Notations. Let G = (V, E) be a simple and connected undirected graph with n nodes and m edges. We use {1, • • • , n} to denote the node index of G, whereas d j denotes the degree of node j in G. Let A be the adjacency matrix and D be the diagonal degree matrix. Let A = A + I n denote the adjacency matrix with added self-loops and the corresponding diagonal degree matrix D, where I n ∈ S n ++ is an identity matrix. Finally, let X ∈ R n×d denote the node feature matrix, where each node v is associated with a d-dimensional feature vector X v . The normalized graph Laplacian matrix is defined as L = I n -D -1/2 AD -1/2 ∈ S n + , that is, a symmetric positive semidefinite matrix with eigendecomposition UΛU , where Λ is a diagonal matrix with eigenvalues of L, and U ∈ R n×n is a unitary matrix that consists of the eigenvectors of L. Spectral Graph Convolution (Defferrard et al., 2016) . We consider spectral convolutions on graphs defined as the multiplication of signal x ∈ R n with a filter g θ parameterized by θ ∈ R n in the Fourier domain: g θ (L) * x = Ug * θ (Λ)U x, where the parameter θ ∈ R n is a vector of spectral filter coefficients. One can understand g θ as a function operating on eigenvalues of L, that is, g * θ (Λ). To avoid eigendecomposition, g θ (Λ) can be approximated by a truncated expansion in terms of Chebyshev polynomials T k (Λ) up to the K-th

