GENERALIZING GRAPH CONVOLUTIONAL NETWORKS VIA HEAT KERNEL

Abstract

Graph convolutional networks (GCNs) have emerged as a powerful framework for mining and learning with graphs. A recent study shows that GCNs can be simplified as a linear model by removing nonlinearities and weight matrices across all consecutive layers, resulting the simple graph convolution (SGC) model. In this paper, we aim to understand GCNs and generalize SGC as a linear model via heat kernel (HKGCN), which acts as a low-pass filter on graphs and enables the aggregation of information from extremely large receptive fields. We theoretically show that HKGCN is in nature a continuous propagation model and GCNs without nonlinearities (i.e., SGC) are the discrete versions of it. Its low-pass filter and continuity properties facilitate the fast and smooth convergence of feature propagation. Experiments on million-scale networks show that the linear HKGCN model not only achieves consistently better results than SGC but also can match or even beat advanced GCN models, while maintaining SGC's superiority in efficiency.



In this work, we aim to generalize GCNs into a continuous and linear propagation model, which is referred to as HKGCN. We derive inspiration from Newton's law of cooling by assuming graph feature propagation follow a similar process. Straightforwardly, this leads us to leverage heat kernel for feature propagation in HKGCN. Theoretically, we show that the propagation matrix of GCNs is equivalent to the finite difference version of the heat kernel. In other words, using heat kernel as the propagation matrix will lead to smooth feature convergence. In the same example above, we show the heat kernel based propagation in HKGCN can prevent oscillations, as illustrated in Figure 1 (b). Finally, from the graph spectral perspective, heat kernel acts as a low-pass filter and the cutoff frequency of heat kernel can be adjusted by changing the propagation time. Empirically, we demonstrate the performance of HKGCN for both transductive and inductive semi-supervised node classification tasks. The experiments are conducted on both traditional GNN datasets, such as Cora, CiteSeer, Pubmed, and Reddit, and latest graph benchmark data indexed by Open Graph Benchmark (Hu et al., 2020) . The results suggest that the simple and linear HKGCN model can consistently outperform SGC on all six datasets and match or even beat the performance of advanced graph neural networks on both tasks, while at the same time maintaining the order-of-magnitude efficiency superiority inherited from SGC.

2. RELATED WORK

Graph Neural Networks. Graph neural networks (GNNs) have emerged as a new paradigm for graph mining and learning, as significant progresses have been made in recent years. Notably, the spectral graph convolutional network (Bruna et al., 2013) is among the first to directly use back propagation to learn the kernel filter, but this has the shortcoming of high time complexity. Another work shows how to use Chebyshev polynomial approximation to fast compute the filter kernel (Hammond et al., 2011) . Attempts to further this direction leverage Chebyshev Expansion to achieve the same linear computational complexity as classical CNNs (Defferrard et al., 2016) . Later, the graph convolutional network (GCN) (Kipf & Welling, 2017) simplifies the filter kernel to the second-order of Chebyshev Expansion, inspiring various advancements in GNNs. GAT brings the attention mechanisms into graph neural networks (Veličković et al., 2018) . GMNN combines the benefits of statistical relational learning and GNNs into a unified framework (Qu et al., 2019) . To enable fast and scalable GNN training, FastGCN interprets graph convolutions as integral transforms of features and thus uses Monte Carlo method to simulate the feature propagation step (Chen et al., 2018) . GraphSage treats the feature propagation as the aggregation from (sampled) neighborhoods (Hamilton et al., 2017) . LADIES (Zou et al., 2019) further introduces the layer-dependent importance sampling technique for efficient training. Recently, there are also research efforts devoting on the theoretical or deep understanding of GCNs (Xu et al., 2019b; Battaglia et al., 2018) . For example, the feature propagation in GNNs can be also explained as neural message passing (Gilmer et al., 2017) . In addition, studies also find that the performance of GNNs decreases with more and more layers, known as the over-smoothing issue (Li et al., 2018; Zhao & Akoglu, 2020) . To reduce GCNs' complexity, SGC turns the GCN model into a linear model by removing the non-linear activation operations between consecutive GCN layers (Wu et al., 2019) , producing promising results in terms of both efficacy and efficiency. Heat Kernel. The properties of heat kernel for graphs are reviewed in detail by Chuang in (Chung & Graham, 1997) . Recently, heat kernel has been frequently used as the feature propagation modulator. In (Kondor & Lafferty, 2002) , the authors show that heat kernel can be regarded as the discretization of the familiar Gaussian kernel of Euclidean space. Additionally, heat kernel is often used as the window function for windowed graph Fourier transform (Shuman et al., 2016) . In (Zhang et al., 2019) , the second-order heat kernel is used as the band-pass filter kernel to amplify local and global structural information for network representation learning. Concurrent work. Several recent works have developed similar idea. (Poli et al., 2020; Zhuang et al., 2020) use the Neural ODE framework and parametrize the derivative function using a 2 or 3 layer GNN directly. (Xhonneux et al., 2020) improved ODE by developing a continuous messagepassing layer. All ODE models make feature converge to stable point by adding residual connection. In contrast, our model outputs an intermediate state of feature, which is a balance between local and global features. Some recent works (Xu et al., 2019a; Klicpera et al., 2019) propose to leverage heat kernel to enhance low-frequency filters and enforce a smooth feature propagation. However, they do not realize the relationship between the feature propagation of GCNs and heat kernel. 3 GENERALIZING (SIMPLE) GRAPH CONVOLUTION VIA HEAT KERNEL

3.1. PROBLEM AND BACKGROUND

We focus on the problem of semi-supervised node classification on graphs, which is the same as GCN (Kipf & Welling, 2017) . Without loss of generality, the input to this problem is an undirected network G = (V, E), where V denotes the node set of n nodes {v 1 , ..., v n } and E represents the edge set. The symmetric adjacency matrix of G is defined as A and its diagonal degree matrix as D with D ii = j A ij . For each node v i ∈ V , it is associated with a feature vector x i ∈ X ∈ R n×d and a one-hot label vector y i ∈ Y ∈ {0, 1} n×C , where C is the number of classes. The problem setting of semi-supervised graph learning is given the labels Y L of a subset of nodes V L , to infer the labels Y U of the remaining nodes V U , where V U = V \V L . Graph Convolutional Networks. Given the input graph G = (V, E) with A, D, X, and Y L , GCN can be understood as feature propagation over the graph structure. Specifically, it follows the following propagation rule: H (l+1) = σ( D-1 2 Ã D-1 2 H (l) W (l) ), where Ã = A + I N is the adjacency matrix with additional self-connections with I N as the identity matrix, W (l) is a trainable weight matrix in the l th layer, σ(•) is a nonlinear function such as ReLU, and H (l) denotes the hidden node representation in the l th layer with the first layer H (0) = X. The essence of GCN is that each GCN layer is equivalent to the first-order Chebyshev expansion of spectral convolution (Kipf & Welling, 2017) . It also assumes that the first-order coefficient a 1 is equal to the 0-th order coefficient a 0 multiplied by -1, i.e., a 1 = -a 0 . We will later prove that this is just a discrete solution of heat equation. Simple Graph Convolution. Since its inception, GCNs have drawn tremendous attention from researchers (Chen et al., 2018; Veličković et al., 2018; Qu et al., 2019) . A recent study shows that GCNs can be simplified as the Simple Graph Convolution (SGC) model by simply removing the nonlinearities between GCN layers (Wu et al., 2019) . Specifically, the SGC model is a linear model and can be formalized by the following propagation rule: Y = sof tmax(( D-1 2 Ã D-1 2 ) K XW) Surprisingly, the linear SGC model yields comparable prediction accuracy to the sophisticated GCN models in various downstream tasks, with significant advantages in efficiency and scalability due to its simplicity. Heat Equation and Heat Kernel. The heat equation, as a special case of the diffusion equation, is used to describe how heat distributes and flows over time (Widder & Vernon, 1976) . Image a scenario of graph, in which each node has a temperature and heat energy could only transfer along the edge between connected nodes, and the heat propagation on this graph follows Newton's law of cooling. So the heat propagation between node v i and node v j should be proportional to 1) the edge weight and 2) the temperature difference between v i and v j . Let x (t) i denote the temperature of v i at time t, the heat diffusion on graph G can be described by the following heat equation: dx (t) i dt = -k j Aij(x (t) i -x (t) j ) = -k[Diix (t) i - j Aijx (t) j ]. The equation under the matrix form is dX (t) dt = -kLX (t) , where L = D -A is the graph Laplacian matrix. By reparameterizing t and k into a single term t = kt, the equation can be rewritten as: dX (t ) dt = -LX (t ) A heat kernel is the fundamental solution of the heat equation (Chung & Graham, 1997) . The heat kernel H t is defined to be the n × n matrix: Ht = e -Lt Given the initial status X (0) = X, the solution to the heat equation in Eq. 4 can be written as X (t) = HtX (6) Naturally, the heat kernel can be used as the feature propagation matrix in GCNs.

3.2. CONNECTING GCN AND SGC TO HEAT KERNEL

GCN's feature propagation follows D-1 2 Ã D-1 2 , through which node features diffuse over graphs. Note that the feature propagation in GCN is just one step each time/layer, hindering individual nodes from learning global information. By analogy with the heat diffusion process on graphs, the heat kernel solution can also be perfectly generalized to the feature propagation in the graph convolution. Instead of using L, we follow GCN to use the symmetric normalized Laplacian L = I -D-1 2 Ã D-1 2 to replace it. According to SGC (Wu et al., 2019) , this convert also serves as a low-pass-type filter in graph spectral. Then we have Eq. 4 as dX (t) dt = -LX (t) and the heat kernel in Eq. 5 as H t = e -Lt . Consider the finite difference of this heat equation, it could be written as: X (t+∆t) -X (t) ∆t = -LX (t) If we set ∆t = 1, we have X t) . This is the same feature propagation rule in Eq. 1. In other words, each layer of GCN's feature propagation on graphs is equal to the finite difference of the heat kernel. If we consider the multilayer GCN without the activation function, it could be written as SGC (Wu et al., 2019) : (t+1) = X (t) -LX (t) = D-1 2 Ã D-1 2 X ( Y = sof tmax(( D-1 2 Ã D-1 2 ) K XW) , where Y is the classification result and W is the merged weight matrix. Using the finite difference of heat equation above, it can be rewritten as: Y = sof tmax(X (K) W), which is still a multistep finite difference approximation of the heat kernel. Reduce ∆t. Assuming t can be divided by ∆t, the number of iterations is n i = t ∆t , making Eq. 7 become X (t) = (I -∆t L) ni X. By fixing t = 1, we have X (1) = (I - 1 ni L) n i X = (I - ni ni L + ni(ni -1) n 2 i L2 2! + • • • + (-1) n i ni! n n i i Ln i ni! )X (9) Therefore, n i iterations of t = 1 approximate to Taylor expansion of the heat kernel at order n i . So GCN could also be seen as the first-order Taylor expansion of the heat kernel.

3.3. HEAT KERNEL AS FEATURE PROPAGATION

We have shown that the feature propagation in GCN is a multistep finite difference approximation of the heat kernel. Next, we briefly illustrate the advantage of differentiation. A case study. We illustrate how the features are updated during the GCN and heat kernel propagations. Let us consider a graph of two nodes v 1 and v 2 with one-dimension input features x 1 = 1 and x 2 = 2 and one weighted edge A 12 = A 21 = 5 between them. Recall that with ∆t = 1 GCN is equivalent to the finite difference of the heat kernel. Thus, we set ∆t = 1 for the heat kernel as well. The updates of x 1 and x 2 as the propagation step t increases are shown in Figure 1 . We can observe that heat kernel shows much smoother and faster convergence than GCN. In GCN, the discrete propagation D-1 2 Ã D-1 2 layer by layer causes node features to keep oscillating around the convergence point. The reason lies in GCN's requirement for ∆t = 1, which is too large to have a smooth convergence. Straightforwardly, the oscillating nature of GCN's feature propagation makes it sensitive to hyper-parameters and generate weak performance on large graphs. Theoretical analysis is given in Section 3.5.

3.4. GENERALIZING GRAPH CONVOLUTION

We have shown that the feature propagation in GCN is merely the solution of the finite difference version of the heat equation when ∆t = 1. As a result, using heat kernel as feature propagation can lead to smooth convergence. Since the range of t is in real number field, the propagation time t in heat kernel can be seen as a generalized parameter of the number of layers in GCN (Kipf & Welling, 2017) and SGC (Wu et al., 2019) . The advantage of heat kernel also includes that t can change smoothly compared to the discrete parameters. In light of this, we propose to generalize graph convolution networks by using heat kernel and present the HKGCN model. Specifically, we simply use the one layer linear model: Y = sof tmax(X (t) W) = sof tmax(e (-Lt) XW), ( ) where W is the n × C feature transformation weight and t can be a learnable scalar or a preset hyper-parameter. Using a preset t converts HKGCN to 1) a pre-processing step X = e (-Lt) X without parameters and 2) a linear logistic regression classifier Y = sof tmax( XW). This makes the training speed much faster than GCN. The algorithm of presetting t is in Appendix (Algorithm 1). To avoid eigendecomposition, we use Chebyshev expansion to calculate e (-Lt) (Hammond et al., 2011; Zhang et al., 2019) . The first kind Chebyshev polynomials are defined as T i+1 (x) = 2xT i (x)-T i-1 (x) with T 0 (x) = 1 and T 1 (x) = x. The requirement of the Chebyshev polynomial is that x should be in the range of [-1, 1], however, the eigenvalues of L satisfy 0 = λ 0 ≤ ... ≤ λ n-1 ≤ 2. To make the eigenvalues of L fall in the range of [-1, 1], we convert L to L = L/2. In addition, we reparameterize t to t = 2t so that the heat kernel keeps in the original form. In doing so, we have e (-Lt) = e (-Lt ) ≈ k-1 i=0 ci( t)Ti( L) And the coefficient c i of the Chebyshev expansion can be obtained by: ci( t) = β π 1 -1 Ti(x)e -x t √ 1 -x 2 dx = β(-1) i Bi( t) where β = 1 when i = 0, otherwise β = 2, and B i ( t) is the modified Bessel function of the first kind (Andrews & of Photo-optical Instrumentation Engineers, 1998) . By combining Eqs. 11 and 12 together, we can have e (-Lt ) approximated as: e (-Lt ) ≈ B0( t)T0( L) + 2 k-1 i=1 (-1) i Bi( t)Ti( L)

3.5. SPECTRAL ANALYSIS

From the graph spectral perspective, heat kernel acts as a low-pass filter (Xu et al., 2019a) . In addition, as the propagation time t increases, the cutoff frequency decreases, smoothing the feature propagation. Graph Spectral Review. We define Λ = diag(λ 1 , ..., λ n ) as the diagonal matrix of eigenvalues of L and U = (u 1 , ..., u n ) as the corresponding eigenvectors, that is, L = UΛU T . Figure 2 : Effects of different kernels. The graph Fourier transform defines that x = U T x is the frequency spectrum of x, with the inverse operation x = Ux. And the graph convolution between the kernel filter g(•) and x is g * x = Ug(Λ)U T x with g(Λ) = diag(g(λ 1 ), • • • , g(λ n )). For a polynomial g( L), we have g( L) = Ug(Λ)U T . This can be verified by setting g(λ i ) = K j=0 a j λ j i , that is, Ug(Λ)U T = U K j=0 a j Λ j U T = K j=0 a j UΛ j U T = K j=0 a j Lj = g( L). Heat Kernel. As g( L) = Ug(Λ)U T , the heat kernel H t = e -Lt = ∞ i=0 1 i! t i Li could also be seen as a polynomial of L. Thus its kernel filter is g(λ i ) = e -λit . Note that the eigenvalues of L is in the range of [0, 2]. For ∀i, j, if λ i < λ j , we have g(λi) g(λj ) = e (λ2-λ1)t > 1. Thus g(λ i ) > g(λ j ). Heat kernel acts as a low-pass filter. As t increases, the ratio e (λ2-λ1)t also increases, discounting more and more high frequencies. That said, t acts as a modulator between the low frequency and high frequency.  (λ) = 1 -λ D-1 2 Ã D-1 2 SGC g(λ) = (1 -λ) 2 ( D-1 2 Ã D-1 2 ) 2 HKGCN t = 1 g(λ) = e -λ e -L HKGCN t = 2 g(λ) = e -2λ e -2 L Different Kernels. We summary the kernel filters of GCN, SGC, and HKGCN in Table 1 . In Figure 2 , we use the eigenvalues of L to illustrate the effects of different kernel filters. We can observe that on Cora the absolute values of the filtered eigenvalues by GCN and SGC kernel filters do not decrease monotonically. However, the filtered eigenvalues of heat kernels in HKGCN do monotonically decrease. This is because for g(λ) = (1-λ) k , it monotonically increases when λ ∈ [1, 2]. In other words, the kernel filter of GCN acts as a band-stop filter, attenuating eigenvalues near 1. The Influence of High Frequency Spectrum. The eigenvalues of L and associated eigenvectors satisfy the following relation (Shuman et al., 2016) (Cf Appendix for Proof): 14) This means that similar to classical Fourier transform, those eigenvectors associated with high λ oscillate more rapidly than those associated with low λ. And we know that L = n i=1 λ i u i u T i , where u i u T i is the projection matrix project to u i . Since the eigenvector u i oscillates more rapidly as the increase of i, the projection matrix u i u T i will also globally oscillate more rapidly as i increases. Because the filter kernel follows g( L) = n i=1 g(λ i )u i u T i , the larger g(λ i ) is, the greater influence on g( L) will u i u i T have. We show above that as i increase, the oscillation of u i u T i also increase. Since Y = g( L)X, an oscillating g( L) will cause oscillating output Y. Therefore, we want the influence of higher i's u i u T i as small as possible, i.e. large i's g(λ i ) as small as possible. This explains why we need a low-pass filter. λ k = (v i ,v j )∈E Ãij[ 1 Dii u k (i) - 1 Djj u k (j)] 2

3.6. COMPLEXITY ANALYSIS

For the preprocessing step, the time complexity for matrix multiplication between the sparse Laplacian matrix and the feature vector in Chebyshev expansion is O(d|E|), with |E| is the number of edges and d is the dimension size of input feature. Thus, the time complexity of the preprocessing step is O(kd|E|), with k is the Chebyshev expansion step. For prediction, the time complexity of logistic regression is O(|V U |dC) with |V U | is the number of unlabeled nodes, and C is the number of label categories. The space complexity is O(|E| + nd), where n is the number of nodes.

4.1. EXPERIMENTAL SETUP

We follow the standard GNN experimental settings to evaluate HKGCN on benchmark datasets for both transductive and inductive tasks. The reproducibility information is detailed in Appendix. Datasets. For transductive learning, we use Cora, Citeseer and Pubmed (Kipf & Welling, 2017; Veličković et al., 2018) . For inductive tasks, we use Reddit (Hamilton et al., 2017) , ogbn-arxiv (Hu et al., 2020) , and a new arXiv dataset collected by ourselves, which contains over one million nodes and 10 million edges. In all inductive tasks, we train models on subgraphs which only contain training nodes and test models on original graphs. We adopt exactly the same data splitting as existing work for Cora, Citeseer, Pubmed, and Reddit. The ogbn-arxiv dataset is accessed from OGB (https://ogb.stanford. edu/docs/nodeprop/). It is the citation network between computer science arXiv papers and the task is to infer arXiv papers' categories, such as cs.LG, cs.SI, and cs.DB. Each paper is given a 128-dimension word embedding based feature vector. The graph is split based on time, that is, papers published until 2017 as training, papers published in 2018 as validation, and papers published in 2019 as test. Inspired by ogbn-arxiv, we also constructed a full arXiv paper citation graph from the public MAG (https://docs.microsoft.com/en-us/academic-services/graph/). We follow the same feature extraction and data splitting procedures as ogbn-arxiv and generate the arXiv dataset, which will be made publicly available upon publication. The statistics and splitting information of the six datasets are listed in Table 2 . Baselines. For the transductive tasks on Cora, Citeseer, and Pubmed, we use the same baselines used in SGC (Wu et al., 2019) , including GCN (Kipf & Welling, 2017) , GAT (Veličković et al., 2018) , FastGCN (Chen et al., 2018) , LanczosNet, AdaLanczosNet (Liao et al., 2019) , DGI (Veličković et al., 2019) , GIN (Xu et al., 2019b) , and SGC (Wu et al., 2019) . For the inductive tasks, we use supervised GraphSage (mean) without sampling (Hamilton et al., 2017) , GCN (Kipf & Welling, 2017) , ClusterGCN (Chiang et al., 2019) , GraphSaint (Zeng et al., 2019) , MLP, and SGC (Wu et al., 2019) as baselines. For the proposed HKGCN, the propagation time t is preset based on the performance on the validation set. On Reddit, we follow SGC (Wu et al., 2019) to train HKGCN and SGC with L-BFGS without regularization as optimizer (Liu & Nocedal, 1989) , due to its rapid convergence and good performance. However, this advantage brought by L-BFGS can not be observed in the other datasets, for which we use adam optimizer (Kingma & Ba, 2014) , same as the other baselines.

4.2. RESULTS

We report the performance of HKGCN and baselines in terms of both effectiveness and efficiency. Transductive. Table 3 reports the results for the transductive node classification tasks on Cora, Citeseer, and Pubmed, as well as the relative running time on Pubmed. As the reference point, it takes 0.81s for training HKGCN on Pubmed. Per community convention (Kipf & Welling, 2017; Veličković et al., 2018) , we take the results of baselines from existing publications (Wu et al., 2019) , and report the results of our HKGCN model by averaging over 100 runs. Finally, the efficiency is measured by the training time on a NVIDIA TITAN Xp GPU. We can observe that 1) HKGCN outperforms SGC in all three datasets with similar training time consumed, and 2) HKGCN can achieve the best prediction results on Citeseer and Pubmed among all methods and comparable results on Cora, while using 2-3 orders of magnitude less time than all baselines except FastGCN and SGC. In addition, it is easy to notice that comparing to complex GNNs, the performance of HKGCN (and SGC) is quite stable across 100 runs, as it benefits from the the simple and deterministic propagation. Inductive. (Wu et al., 2019) . The results suggest that 1) the performance HKGCN is consistently better than SGC and comparable to advanced GNN models, such as supervised GraphSage, ClusterGCN, and GraphSaint. Additionally, we notice that HKGCN, GraphSage, and GCN yield the best results on Reddit, arXiv, and ogbn-arxiv, respectively, indicating the lack of universally-best GNN models. Efficiency wise, HKGCN costs 2.2s, 48.5s, and 5.2s on three datasets, respectively, which are similar to SGC, both of which are clearly more efficient than other GNN models as well as MLP. Analysis of t. In HKGCN, t determines how long the feature propagation lasts for. A larger t can amplify global features and a lower one can emphasize local information. From Figure 3 , we can observe that t has a similar impact on different datasets, and as the propagation time increases, the model at first benefits from information from its local neighbors. However, as the feature propagation continues (e.g., t > 9 or 12), the performance of the model starts to decrease, likely suffering from the over-smoothing issue (Li et al., 2018) .

5. CONCLUSION AND DISCUSSION

In this work, we propose to generalize graph convolutional networks (GCNs) and simple graph convolution (SGC) via heat kernel, based on which, we present the HKGCN model. The core idea of HKGCN is to use heat kernel as the feature propagation matrix, rather than the discrete and non-linear feature propagation procedure in GCNs. We theoretically show that the feature propagation of GCNs is equivalent to the finite difference version of heat equation, which leads it to overshoot the convergence point and causes oscillated propagation. Furthermore, we show that heat kernel in HKGCN acts as a low-pass filter. On the contrary, the filter kernel of GCNs fails to attenuate high-frequency signals, which is also a factor leading to the slow convergence and feature oscillation. While in heat kernel, the cutoff frequency decreases as the increase of the propagation time t. Consequently, the HKGCN model could avoid these oscillation issue and propagate features smoothly. Empirical experiments on six datasets suggest that the proposed HKGCN model generates promising results. Effectiveness wise, the linear HKGCN model consistently beats SGC and achieves better or comparable performance than advanced GCN baselines. Efficiency wise, inherited from SGC, HKGCN offers order-of-magnitude faster training than GCNs. Notwithstanding the interesting results of the present work, there is still much room left for future work. One interesting direction is to learn the propagation time t from data in HKGCN for automatically balancing information between local and global features.



Figure 1: Feature propagation under GCN and HKGCN.

Figure 3: Performances with t varying from 0 to 30.

Kernel Filters.

Dataset Statistics

Transductive results in terms of test accuracy.

Inductive results in terms of test accuracy (left) and running time (right), averaged over 10 runs.



A REPRODUCIBILITY INFORMATION

A.1 HYPERPARAMETERS We introduce the hyperparameters used in the experiments, including weight decay, #epochs, learning rate, t, and optimizer, which are also summarized in Table 5 . The Chebyshev expansion step k is set to 20.Citation Networks. We train HKGCN for 100 epochs using the Adam optimizer (Kingma & Ba, 2014) with 0.2 as the learning rate, which are the same settings used in SGC (Wu et al., 2019) . First, we fix the weight decay to 5 × 10 -6 and search t from {0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30} based on the performance on the validation set. Then we fix the best performance t and search the weight decay from {10 -6 , 10 -5 , 10 -4 } based on the performance on the validation set.Reddit dataset. We follow SGC (Wu et al., 2019) to train HKGCN and SGC with L-BFGS without regularization as the optimizer (Liu & Nocedal, 1989) , due to its rapid convergence (2 epochs) and good performance. However, this advantage brought by L-BFGS can not be observed in the other datasets, for which we use the adam optimizer (Kingma & Ba, 2014) , same as the other baselines. We search t from all non-negative integers equal or lower than 15, based on the performance on the validation set. arXiv and ogbn-arxiv datasets. We use the Adam optimizer with no weight decay to train HKGCN and SGC, same as other baselines. Using the same treatments in citation networks, we fix the learning rate to 0.2. We first fix 500 epochs and search t from all non-negative integers equal or lower than 15 by using the validation set. On the ogbn-arxiv dataset, the model does not converge after 500 epochs and thus we extend it to 1000.

Situation for Large t.

If t is greater than 10, the floating point precision error would become an issue in calculating the Bessel function. Therefore, in practice, we convert e (-Lt ) into e (-L t 3 ) e (-L t 3 ) e (-L t 3 ) . Situation for Very Large Graphs. Though the Chebyshev expansion is very close to the minimax polynomial on the range of [-1, 1] and thus makes it converge very fast, the recurrence relation in Eq. 13 is second-order. In other words, the calculation of T i+1 (x) needs to store T i (x), T i-1 (x) and L, making it require more GPU memory than the first-order recurrence relation expansion, such as Taylor expansion. Therefore, instead of using the Chebyshev expansion, we leverage Eq. 7 as X (t+∆t) = (I -∆t L)X (t) for the calculation. In this way, we only need to store X t and L, which requires much less memory space.Training Time . We tested all experiments on NVIDIA TITAN Xp GPU with 12 GB memory. The training time of HKGCN is the sum of 1) the feature pre-processing time and 2) logistic regression training time. So we can see the largest training time difference between SGC and HKGCN occurs on the Reddit dataset. This is because it only takes two epochs to converge on the Reddit dataset by using L-BFGS, making the logistic regression step very fast. Thus the pre-processing step dominates the training time for HKGCN and SGC on Reddit. (Kipf & Welling, 2017; Veličković et al., 2018) , the results of baselines in the three citation networks (Cora, Citeseer, and Pubmed) are taken from existing publications (Wu et al., 2019) . The hyperparameters settings of inductive tasks on three large datasets (Reddit, arXiv, and ogbn-arxiv) are introduced as follow.SGC. To make a fair comparison, we select K in SGC based on the performance on the validation set, which has similar meaning as t in HKGCN.GCN. The learning rate is set to 0.01 on all datasets. The weight decay is 0. The number of layers is 3 on Reddit and ogbn-arxiv, and 2 on arXiv because of the GPU memory limit. The hidden layer size is 256 on Reddit and ogbn-arxiv, and 128 on arXiv also due to memory limit. The number of epochs is 500.GraphSage. Results on the Reddit dataset are directly taken from the SGC paper (Wu et al., 2019) .The other parameter settings are the same as GCN above. We use the mean-based aggregator as it provides the best performance.MLP. The number of layers is 3 and the hidden layer size is 256, and Batch Norm is not used.ClusterGCN. On the Reddit dataset, the number of partitions is 1500, the batch size is 20, the hidden layer size is 128, which are the same parameter settings as the original SGC paper (Chiang et al., 2019) . On the arXiv dataset, the number of partitions is 15000, the batch size is 32, the hidden layer size is 256. The number of epochs of these two datasets is 50. On the ogbn-arxiv dataset, the number of partitions is 15 (as a large number would cause an unknown bug in the PyTorch-Geometric library), the batch size is 32, the hidden layer size is 256. We set the epoch number to 200 on the ogbn-arxiv dataset because it does not converge after 50 epochs.GraphSaint. On the Reddit dataset, the number of layers is 4, the hidden layer size is 128, the dropout rate is 0.2, the same parameter settings as used in SGC (Zeng et al., 2019) . On the arXiv dataset, the number of layers is 3, the hidden layer size is 256, the dropout rate is 0.5. On the ogbn-arxiv dataset, we use the same parameter setting as the Reddit dataset, because both graphs are on the same scale.

B ALGORITHM OF PRESETTING t

Algorithm 1 Preset t in HKGCN Input: input features X; feature propagation duration time t; Chebyshev expansion step k; scaling augmented normalized Laplacian L = L/2; B i (t) is the modified Bessel function. Output: pre-processed features X (t) ; T 0 (x) ← X; X (t) ← B 0 ( t)T 0 (x); T 1 (x) ← LX; X (t) ← X (t) -2B 1 ( t)T 1 (x); for i = 2...k do T i (x) ← 2 LT i-1 (x) -T i-2 (x); if i is odd then X (t) ← X (t) -2B i ( t)T i (x); else X (t) ← X (t) + 2B i ( t)T i (x); end if end for C PROOF OF EQ. 142

