GENERALIZING GRAPH CONVOLUTIONAL NETWORKS VIA HEAT KERNEL

Abstract

Graph convolutional networks (GCNs) have emerged as a powerful framework for mining and learning with graphs. A recent study shows that GCNs can be simplified as a linear model by removing nonlinearities and weight matrices across all consecutive layers, resulting the simple graph convolution (SGC) model. In this paper, we aim to understand GCNs and generalize SGC as a linear model via heat kernel (HKGCN), which acts as a low-pass filter on graphs and enables the aggregation of information from extremely large receptive fields. We theoretically show that HKGCN is in nature a continuous propagation model and GCNs without nonlinearities (i.e., SGC) are the discrete versions of it. Its low-pass filter and continuity properties facilitate the fast and smooth convergence of feature propagation. Experiments on million-scale networks show that the linear HKGCN model not only achieves consistently better results than SGC but also can match or even beat advanced GCN models, while maintaining SGC's superiority in efficiency.

1. INTRODUCTION

Graph neural networks (GNNs) have emerged as a powerful framework for modeling structured and relational data (Gori et al., 2005; Scarselli et al., 2008; Gilmer et al., 2017; Kipf & Welling, 2017) . A wide range of graph mining tasks and applications have benefited from its recent emergence, such as node classification (Kipf & Welling, 2017; Veličković et al., 2018) , link inference (Zhang & Chen, 2018; Ying et al., 2018) , and graph classification (Xu et al., 2019b) . The core procedure of GNNs is the (discrete) feature propagation operation, which propagates information between nodes layer by layer based on rules derived from the graph structures. Take the graph convolutional network (GCN) (Kipf & Welling, 2017) for example, its propagation is performed through the normalized Laplacian of the input graph. Such a procedure usually involves 1) the non-linear feature transformation, commonly operated by the activation function such as ReLU, and 2) the discrete propagation layer by layer. Over the course of its development, various efforts have been devoted to advancing the propagation based architecture, such as incorporating self-attention in GAT (Veličković et al., 2018) , mixing high-order neighborhoods in MixHop (Abu-El-Haija et al., 2019) , and leveraging graphical models in GMNN (Qu et al., 2019) . Recently, Wu et al. (Wu et al., 2019) observe that the non-linear part of GCNs' feature propagation is actually associated with excess complexity and redundant operations. To that end, they simplify GCNs into a linear model SGC by removing all non-linearities between consecutive GCN layers. Surprisingly, SGC offers comparable or even better performance to advanced GCN models, based on which they argue that instead of the non-linear feature transformation, the repeated graph propagation may contribute the most to the expressive power of GCNs. Though interesting results generated, SGC still inherits the discrete nature of GCNs' propagation, which can lead to strong oscillations during the procedure. Take, for example, a simple graph of two nodes v 1 and v 2 with one-dimension input features x 1 = 1 & x 2 = 2 and one weighted edge between them, the feature updates of x 1 and x 2 during the GCN propagation is shown in Figure 1 (a), from which we can clearly observe the oscillations of x 1 and x 2 step by step. This indicates that though the features from multi-hops away may seem to be taken into consideration during the GCN propagation, it is still far away to learn patterns from them. In this work, we aim to generalize GCNs into a continuous and linear propagation model, which is referred to as HKGCN. We derive inspiration from Newton's law of cooling by assuming graph feature propagation follow a similar process. Straightforwardly, this leads us to leverage heat kernel for feature propagation in HKGCN. Theoretically, we show that the propagation matrix of GCNs is equivalent to the finite difference version of the heat kernel. In other words, using heat kernel as the propagation matrix will lead to smooth feature convergence. In the same example above, we show the heat kernel based propagation in HKGCN can prevent oscillations, as illustrated in Figure 1 (b). Finally, from the graph spectral perspective, heat kernel acts as a low-pass filter and the cutoff frequency of heat kernel can be adjusted by changing the propagation time. Empirically, we demonstrate the performance of HKGCN for both transductive and inductive semi-supervised node classification tasks. The experiments are conducted on both traditional GNN datasets, such as Cora, CiteSeer, Pubmed, and Reddit, and latest graph benchmark data indexed by Open Graph Benchmark (Hu et al., 2020) . The results suggest that the simple and linear HKGCN model can consistently outperform SGC on all six datasets and match or even beat the performance of advanced graph neural networks on both tasks, while at the same time maintaining the order-of-magnitude efficiency superiority inherited from SGC.

2. RELATED WORK

Graph Neural Networks. Graph neural networks (GNNs) have emerged as a new paradigm for graph mining and learning, as significant progresses have been made in recent years. Notably, the spectral graph convolutional network (Bruna et al., 2013) is among the first to directly use back propagation to learn the kernel filter, but this has the shortcoming of high time complexity. Another work shows how to use Chebyshev polynomial approximation to fast compute the filter kernel (Hammond et al., 2011) . Attempts to further this direction leverage Chebyshev Expansion to achieve the same linear computational complexity as classical CNNs (Defferrard et al., 2016) . Later, the graph convolutional network (GCN) (Kipf & Welling, 2017) simplifies the filter kernel to the second-order of Chebyshev Expansion, inspiring various advancements in GNNs. GAT brings the attention mechanisms into graph neural networks (Veličković et al., 2018) . GMNN combines the benefits of statistical relational learning and GNNs into a unified framework (Qu et al., 2019) . To enable fast and scalable GNN training, FastGCN interprets graph convolutions as integral transforms of features and thus uses Monte Carlo method to simulate the feature propagation step (Chen et al., 2018) . GraphSage treats the feature propagation as the aggregation from (sampled) neighborhoods (Hamilton et al., 2017) . LADIES (Zou et al., 2019) further introduces the layer-dependent importance sampling technique for efficient training. Recently, there are also research efforts devoting on the theoretical or deep understanding of GCNs (Xu et al., 2019b; Battaglia et al., 2018) . For example, the feature propagation in GNNs can be also explained as neural message passing (Gilmer et al., 2017) . In addition, studies also find that the performance of GNNs decreases with more and more layers, known as the over-smoothing issue (Li et al., 2018; Zhao & Akoglu, 2020) . To reduce GCNs' complexity, SGC turns the GCN model into a linear model by removing the non-linear activation operations between consecutive GCN layers (Wu et al., 2019) , producing promising results in terms of both efficacy and efficiency. Heat Kernel. The properties of heat kernel for graphs are reviewed in detail by Chuang in (Chung & Graham, 1997) . Recently, heat kernel has been frequently used as the feature propagation modulator. In (Kondor & Lafferty, 2002) , the authors show that heat kernel can be regarded as the discretization of the familiar Gaussian kernel of Euclidean space. Additionally, heat kernel is often used as the window function for windowed graph Fourier transform (Shuman et al., 2016) . In (Zhang et al., 2019) , the second-order heat kernel is used as the band-pass filter kernel to amplify local and global structural information for network representation learning.



Figure 1: Feature propagation under GCN and HKGCN.

