REPRESENTATION POWER OF GRAPH CONVOLUTIONS : NEURAL TANGENT KERNEL ANALYSIS

Abstract

The fundamental principle of Graph Neural Networks (GNNs) is to exploit the structural information of the data by aggregating the neighboring nodes using a 'graph convolution'. Therefore, understanding its influence on the network performance is crucial. Convolutions based on graph Laplacian have emerged as the dominant choice with the symmetric normalization of the adjacency matrix A, defined as D -1 2 AD -1 2 , being the most widely adopted one, where D is the degree matrix. However, some empirical studies show that row normalization D -1 A outperforms it in node classification. Despite the widespread use of GNNs, there is no rigorous theoretical study on the representation power of these convolution operators, that could explain this behavior. In this work, we analyze the influence of the graph convolutions theoretically using Graph Neural Tangent Kernel in a semi-supervised node classification setting. Under a Degree Corrected Stochastic Block Model we analyze different graphs that have homophilic, heterophilic and core-periphery structures, and prove that: (i) row normalization preserves the underlying class structure better than other convolutions; (ii) performance degrades with network depth due to over-smoothing, but the loss in class information is the slowest in row normalization; (iii) skip connections retain the class information even at infinite depth, thereby eliminating over-smoothing. We finally validate our theoretical findings numerically and on real datasets. Under review as a conference paper at ICLR 2023 larly, Wang et al. (2018) observes that S row outperforms S sym for two-layered GCN empirically. Intrigued by this observation, and as both S sym and S row are simply degree normalized adjacency matrices, we study the behavior over depth and observe that S row performs better than S sym in this case as well, as illustrated in Figure 1 (Details of the experiment in Appendix B.1).

1. INTRODUCTION

With the advent of Graph Neural Networks (GNNs), there has been a tremendous progress in the development of computationally efficient state-of-the-art methods in various graph based tasks, including drug discovery, community detection and recommendation systems (Wieder et al., 2020; Fortunato & Hric, 2016; van den Berg et al., 2017) . Many of these problems depend on the structural information of the entities along with the features for effective learning. Because GNNs exploit this topological information encoded in the graph, it can learn better representation of the nodes or the entire graph than traditional deep learning techniques, thereby achieving state-of-the-art performances. In order to accomplish this, GNNs apply aggregation function to each node in a graph that combines the features of the neighboring nodes, and its variants differ principally in the methods of aggregation. For instance, graph convolution networks use mean neighborhood aggregation through spectral approaches (Bruna et al., 2014; Defferrard et al., 2016; Kipf & Welling, 2017) or spatial approaches (Hamilton et al., 2017; Duvenaud et al., 2015; Xu et al., 2019) , graph attention networks apply multi-head attention based aggregation (Velickovic et al., 2018) and graph recurrent networks employ complex computational module (Scarselli et al., 2008; Li et al., 2016) . Of all the aggregation policies, the spectral approach based on graph Laplacian is most widely used in practice, specifically the one proposed by Kipf & Welling (2017) owing to its simplicity and empirical success. In this work, we focus on such graph Laplacian based aggregations in Graph Convolution Networks (GCNs), which we refer to as graph convolutions or diffusion operators. Kipf & Welling (2017) propose a GCN for node classification, a semi-supervised task, where the goal is to predict the label of a node using its feature and neighboring node information. This work suggests symmetric normalization S sym = D -1 2 AD -1 2 as the graph convolution. Ever since its introduction, S sym remains the popular choice. However, subsequent works (Wang et al., 2018; Wang & Leskovec, 2020; Ragesh et al., 2021) explore row normalization S row = D -1 A and particu-Figure 1 : Performance of GCN over depth with and without skip connections using S sym and S row evaluated on Cora dataset. Furthermore, another striking observation from Figure 1 is that the performance of GCN without skip connections decreases considerably with depth for both S sym and S row . This contradicts the conventional wisdom about standard neural networks which exhibit improvement in the performance as depth increases. Several works (Kipf & Welling, 2017; Chen et al., 2018; Wu et al., 2019) observe this behavior empirically and attribute it to the over-smoothing effect from the repeated application of the diffusion operator, resulting in averaging out of the feature information to a degree where it becomes uninformative (Li et al., 2018; Oono & Suzuki, 2019; Esser et al., 2021) . As a solution to this problem, Chen et al. (2020) and Kipf & Welling (2017) propose different forms of skip connections that overcome the smoothing effect and thus outperform the vanilla GCN. Extending it to the comparison of graph convolutions, our experiment shows S row is preferable to S sym over depth even in GCNs with skip connections (Figure 1 ). Naturally, we ask: what characteristics of S row enable better representation learning than S sym in GCNs? Rigorous theoretical analysis is particularly challenging in GCNs compared to the standard neural networks because of the graph convolution. Adding skip connections further increase the complexity of the analysis. To overcome these difficulties, we consider GCN in infinite width limit wherein the Neural Tangent Kernel (NTK) captures the network characteristics very well (Jacot et al., 2018) . The infinite width assumption is not restrictive for graph convolution analysis as the convolution operates on the graph and not the network directly, thus showing same observations as trained GCN (Figure 5) . Moreover, NTK enables the analysis to be parameter-free and hence eliminating additional complexity induced for example by optimization. Through the lens of NTK, we study the impact of different graph convolutions under a specific data distributional assumption -Degree Corrected Stochastic Block Model (DC-SBM) (Karrer & Newman, 2011) , a sparse random graph model. The node degree heterogeneity induced in DC-SBM allows us to analyze the effect of different types of normalization of the adjacency matrix, thus revealing the characteristic difference between S sym and S row . Additionally, this model enables analysis of graphs that have homophilic, heterophilic and core-periphery structures. In this paper, we present a formal approach to analyze GCNs and, specifically, the representation power of different graph convolutions, the influence of depth and the role of skip connections. This is a significant step toward understanding GCNs as it facilitates for more informed network design choices like the convolution and depth, as well as development of more competitive methods based on grounded theoretical reasoning rather than heuristics. Contributions. This paper provides rigorous theoretical analysis of the discussed empirical observations in GCN under DC-SBM distribution using graph NTK, leading to the following contributions. (i) In Section 2, we derive the NTK for GCN in infinite width limit considering node classification setting. Using the NTK for linear GCN and under DC-SBM data distribution, we show in Section 3 that S row preserves class information by computing the population NTK for different graph convolutions. We also present numerical validation of the result in homophilic and heterophilic graphs. (ii) We prove the convolution operator specific over-smoothing effect in vanilla GCN by showing the degradation in class separability with depth in Section 3.1, and also illustrate it experimentally. (iii) In Section 4, we leverage the power of NTK to analyze two different skip connections (Kipf & Welling, 2017; Chen et al., 2020) . We derive the corresponding NTKs and show that skip connections retain class information even at infinite depth along with numerical validation. (iv) We show that S sym maybe preferred over S row in absence of class structure in Section 5, and validate the theoretical results on real datasets Cora in Section 6 and Citeseer in Appendix B.5. We finally conclude in Section 7 with the discussion on the impact of the result and further possibilities, and provide all the proofs, experimental details and additional experiments in the appendix. Related Work. While GNNs are extensively used in practice, their understanding is limited, and the analysis is mostly restricted to empirical approaches (Bojchevski et al., 2018; Zhang et al., 2018; Ying et al., 2018; Wu et al., 2020) . Beyond empirical methods, rigorous theoretical analysis using learning theoretical bounds such as VC Dimension, Scarselli et al. (2018) or PAC-Bayes Liao et al. (2021) are propounded. Rademacher Complexity bounds (Garg et al., 2020; Esser et al., 2021) show that normalized graph convolution is beneficial, but those works do not provide insight on the different normalizations, and their influence on the GCN performance. Another possible tool is the NTK using which interesting theoretical insights in deep neural networks are derived (e.g. (Du et al., 2019a) ). In the context of GNNs, Du et al. (2019b) derives the NTK in the supervised setting (each graph is a data instance to be classified) and empirically studies the NTK performance, however does not extend it to a theoretical analysis. In contrast, we derive the NTK in the semisupervised setting for GCN with and without skip connections, and use it to further theoretically analyze the influence of different convolutions with respect to over-smoothing. Theoretical studies (Oono & Suzuki, 2019; Cai & Wang, 2020) show that over-smoothing causes the expressive power of GNNs to decrease exponentially with depth, while Keriven (2022) proves that in linear GNNs a finite number of convolutions improves learning before over-smoothing kicks in. While oversmoothing and role of skip connections in GNNs are theoretically analyzed in some works (Esser et al., 2021) , the influence of different convolutions that causes over-smoothing and their interplay with skip connections is not studied. For a comprehensive theory survey see Jegelka (2022) . Notations. We represent matrix and vector by bold faced uppercase and lowercase letters, respectively, the matrix Hadamard (entry-wise) product by ⊙ and the scalar product by ⟨., .⟩. We use M ⊙k to denote Hadamard product of matrix M with itself repeated k times. Let N (µ, Σ) be Gaussian distribution with mean µ and co-variance Σ. We use σ(.) to represent derivative of function σ(.), 1 n×n for the n × n matrix of ones, I n for identity matrix of size n × n, 1 [.] 

2. NEURAL TANGENT KERNEL FOR GRAPH CONVOLUTIONAL NETWORK

Before going into a detailed analysis of graph convolutions we provide a brief background on Neural Tangent Kernel (NTK) and derive its formulation in the context of node level prediction using infinitely-wide GCNs. Jacot et al. (2018) ; Arora et al. (2019) ; Yang (2019) show that the behavior and generalization properties of randomly initialized wide neural networks trained by gradient descent with infinitesimally small learning rate is equivalent to a kernel machine. Furthermore, Jacot et al. (2018) also show that the change in the kernel during training decreases as the network width increases, and hence, asymptotically, one can represent an infinitely wide neural network by a deterministic NTK, which is defined by the gradient of the network with respect to its parameters as Θ(x, x ′ ) := E W∼N (0,I) ∂F (W, x) ∂W , ∂F (W, x ′ ) ∂W . Here F (W, x) represents the output of the network at data point x parameterized by W and the expectation is with respect to W, where all the parameters of the network are randomly sampled from Gaussian distribution. Although the 'infinite width' assumption is too strong to model real (finite width) neural networks, and the absolute performance may not exactly match, the empirical trends of NTK match the corresponding network counterpart, allowing us to draw insightful conclusions. This trade-off is worth considering as this allows the analysis of over-parameterised neural networks without having to consider hyper-parameter tuning and training. Formal GCN Setup and Graph NTK. We present the formal setup of GCN and derive the corresponding NTK, using which we analyze different graph convolutions. Given a graph with n nodes and a set of node features {x i } n i=1 ⊂ R f , we may assume without loss of generality that the set of observed labels {y i } m i=1 correspond to first m nodes. We consider K classes, thus y i ∈ {0, 1} K and the goal is to predict the n -m unknown labels {y i } n i=m+1 . We represent the observed labels of m nodes as Y ∈ {0, 1} m×K , and the node features as X ∈ R n×f with the assumption that entire X is available during training. We define S ∈ R n×n to be the graph convolution operator as an expression of the adjacency matrix A and the degree matrix D. The GCN of depth d is given by F W (X, S) := c σ h d Sσ . . . σ c σ h 1 Sσ (SXW 1 ) W 2 . . . W d+1 where W := {W i ∈ R hi-1×hi } d+1 i=1 is the set of learnable weight matrices with h 0 = f and h d+1 = K, h i is the size of layer i ∈ [d] and σ : R → R is the point-wise activation function. We initialize all the weights to be i.i.d N (0, 1) and optimize it using gradient descent. We derive the NTK for the GCN in infinite width setting, that is, h 1 , . . . , h d → ∞. While this setup is similar to Kipf & Welling (2017) , it is important to note that we consider linear output layer so that NTK remains constant during training (Liu et al., 2020) and additionally add a normalization c σ /h i for layer i to ensure that the input norm is approximately preserved and c -1 σ = E u∼N (0,1) (σ(u))

2

(similar to Du et al. (2019a) ). The following theorem states the NTK between every pair of nodes, as a n × n matrix that can be computed at once, as shown below. Theorem 1 (NTK for Vanilla GCN) For the vanilla GCN defined in (2), the NTK Θ at depth d is Θ (d) = d+1 k=1 Σ k ⊙ SS T ⊙(d+1-k) ⊙ d+1-k k ′ =k Ėk ′ . ( ) Here Σ k ∈ R n×n is the co-variance between nodes of layer k, and is given by Σ 1 = SXX T S T , Σ k = SE k-1 S T with E k = c σ E F∼N (0,Σ k ) σ(F)σ(F) T and Ėk = c σ E F∼N (0,Σ k ) σ(F) σ(F) T . Comparison to Du et al. (2019b) . While the NTK in (3) is similar to the graph NTK in Du et al. (2019b) , the main difference is that NTK in our case is computed for all pairs of nodes in a graph as we focus on semi-supervised node classification, whereas Du et al. (2019b) considers supervised graph classification where input is many graphs and so the NTK is evaluated for all pairs of graphs.

3. CONVOLUTION OPERATOR S row PRESERVES CLASS INFORMATION

We use the derived NTK in Theorem 1 to analyze different graph convolutions for S defined in Definition 1 by making the following assumption on the network. Assumption 1 (Linear GCN with orthonormal features) GCN in (2) is said to be linear with orthonormal features if the activation function σ(x) = x and XX T = I n . Remark on Assumption 1. The linear activation does not impact the performance of a GCN significantly as Wu et al. (2019) empirically demonstrates that the linearized GCN performance is at par with the non-linear models with much reduced complexity. Additional orthonormal features assumption eliminates the influence of the features and facilitates identification of the influence of different convolution operators. Besides, the evaluation of our theoretical results without this assumption on real datasets is presented in Section 6 and Appendix B.5 that substantiate our findings. Therefore, the NTK for linear GCN with orthonormal features of depth d is, Θ (d) = d+1 k=1 Σ k ⊙ SS T ⊙(d+1-k) with Σ k = S k S k T . ( ) Definition 1 Symmetric degree normalized S sym = D -1 2 AD -1 2 , row normalized S row = D -1 A, column normalized S col = AD -1 and unnormalized S adj = 1 n A convolutions. While the NTK in (4) gives a precise characterization of the infinitely wide GCN, we can not directly draw conclusions about the convolution operators without further assumptions on the input graph. Therefore, we consider a planted graph model as described below, that helps in establishing the exact representation power of each operator. Random Graph Model. We consider that the underlying graph is from the Degree Corrected Stochastic Block Model (DC-SBM) (Karrer & Newman, 2011) since it enables us to distinguish between S sym , S row , S col and S adj by allowing non-uniform degree distribution on the nodes. The model is defined as follows: Consider a set of n nodes divided into K latent classes (or communities), C i ∈ [1, K]. The DC-SBM model is characterized by the parameters p, q ∈ [0, 1]governing the edge probabilities inside and outside classes-and the degree correction vector π = (π 1 , . . . , π n ) ∈ [0, 1] n with i π i = 1. A random graph on n nodes, generated from DC-SBM, has mutually independent edges with edge probabilities specified by the population adjacency matrix M = E [A] ∈ R n×n , where M ij = pπ i π j if C i = C j qπ i π j if C i ̸ = C j This allows us to model different graph types: Homophilic graphs: 0 ≤ q < p ≤ 1, Heterophilic graphs: 0 ≤ p < q ≤ 1 and Core-Periphery graphs: p = q (no assumption on class structure) and π encode core and periphery. It is evident that the NTK is a complex quantity and computing its expectation is challenging given the dependency of terms from the degree normalization in S, its powers S i and SS T . To simplify our analysis, we make the following assumption on DC-SBM, Assumption 2 (Population DC-SBM) The graph has a weighted adjacency A = M. Remark on Assumption 2. Assuming A = M is equivalent to analyzing DC-SBM in expected setting and it further enables the computation of analytic expression for the population NTK instead of the expected NTK. Moreover, we observe empirically that this analysis hold for random DC-SBM setting as well. In addition, this consideration also implies addition of self loop with a probability p. In the following theorem, we state the population NTK for graph convolutions S sym , S row , S col and S adj for K = 2 with Assumption 1 and 2. The result extends to K > 2 as discussed in the appendix. Theorem 2 (Population NTKs Θ for the four graph convolutions S) Let Assumption 1 and 2 hold, K = 2 and r = p-q p+q , δ ij = (-1) 1[Ci̸ =Cj ] . Furthermore, π is chosen such that n i=1 π i 1[C i = k] = 1 K and n i=1 π 2 i 1[C i = k] = γ ∀ k, where γ is a constant. Then ∀i, j, population NTKs Θsym , Θrow , Θcol and Θadj of depth d for S = S sym , S row ,S col and S adj respectively, are, Θ(d) sym ij = √ π i π j 1 - √ π i π j 1 + δ ij r 2 d+1 1 - √ π i π j (1 + δ ij r 2 ) + δ ij r 2(d+1) 1 - √ π i π j 1 + δ ij r 2 r -2 d+1 1 - √ π i π j (1 + δ ij r 2 ) r -2 , Θ(d) row ij = 2γ 1 -2γ 1 + δ ij r 2 d+1 1 -2γ (1 + δ ij r 2 ) + δ ij r 2(d+1) 1 -2γ 1 + δ ij r 2 r -2 d+1 1 -2γ (1 + δ ij r 2 ) r -2 , Θ(d) col ij = nπ i π j 1 -nπ i π j 1 + δ ij r 2 d+1 1 -nπ i π j (1 + δ ij r 2 ) + δ ij r 2(d+1) 1 -nπ i π j 1 + δ ij r 2 r -2 d+1 1 -nπ i π j (1 + δ ij r 2 ) r -2 , Θ(d) adj ij = π i π j d+1 k=1 γ 2 k +d-k n 2k 1[δ ij = 1] p 2 + q 2 + 1[δ ij = -1] (2pq) d+1-k × 2 k-1 l=0 1[δ ij = 1] 2 k 2l p 2 k -2l q 2l + 1[δ ij = -1] 2 k 2l + 1 p 2 k -2l-1 q 2l+1 . Note that the two assumptions on π are only to express the kernel in a simplified, easy to comprehend format. It is derived without the assumptions on π in Appendix A.2.2. Furthermore, the numerical validation of our result is without both these assumptions (Section 3.2). Comparison of graph convolutions. The population NTKs Θ(d) of depth d in Theorem 2 describes the information that the kernel has after d convolutions with S. To classify the nodes perfectly, the kernel should ideally have a block structure that aligns with the DC-SBM (p and q blocks) unaffected by degree correction π, showing class separability, that is, gap between in-class and out-of-class blocks proportional to p -q. On this basis, only Θrow exhibits a block structure unaffected by the degree correction π, and the gap is determined by r 2 and d, making S row preferable over S sym , S adj and S col . On the other hand, Θsym , Θcol and Θadj are influenced by the degree correction which obscures the class information especially with depth. Although Θsym and Θcol seem similar, Θcol is additionally influenced by the number of nodes n in the graph, making it undesirable over S sym . As a result, the preference order from the theory is Θrow ≻ Θsym ≻ Θcol ≻ Θadj . Figure 2 : Numerical validation of Theorem 2 using homophilic (q < p) and heterophilic (p < q) DC-SBM (Column 1). Columns 2 and 3 illustrate the exact NTKs of depth=2 for S sym and S row , respectively. Column 4 shows the average gap between in-class and out-of-class blocks from theory.

3.1. IMPACT OF DEPTH IN VANILLA GCN

Given that r = p-q p+q < 1, Theorem 2 shows that the difference between in-class and out-of-class blocks decreases with depth monotonically which in turn leads to decrease in performance with depth, therefore explaining the observation in Figure 1 . The kernel in the limit is stated below. Corollary 1 (Population NTK Θ(∞) as d → ∞ ) From Theorem 2, Θ(∞) adj ij = 0 and ∀ i, j for conv ∈ {sym, row, col}, Θ(∞) conv ij = ν ij 1 -ν ij (1 + δ ij r 2 ) where ν ij = √ π i π j for sym, ν ij = 2γ for row and ν ij = nπ i π j for col. From the corollary, we infer that the class separability at infinite depth is 0 for S adj , and O(r 2 ) for S sym , S row and S col showing that the large depth GCN has very little to zero class information. To further understand this, we plot the average in-class and out-of-class block difference for homophilic and heterophilic graphs using the theoretically derived population NTK Θ(d) for depths [1, 10] in a well separated DC-SBM (Column 4 of Figure 2 ). It clearly shows the rapid degradation of class separability with depth and the gap goes to 0 for large depths in all the four convolutions. Additionally, the gap in Θ(d) row is the highest showing that the class information is better preserved, illustrating the strong representation power of S row . Consequently, large depth is undesirable for all the four graph convolutions in vanilla GCN and the theory suggests S row as the best choice for shallow GCN.

3.2. NUMERICAL VALIDATION FOR RANDOM GRAPHS

Theorem 2 and Corollary 1 show that S row has better representation power under Assumption 1 and 2, that is, for the linear GCN with orthonormal features and population DC-SBM. We validate this on homophilous and heterophilous random graphs generated from DC-SBM shown in column 1 of Figure 2 . A graph of n = 1000 nodes with equal sized classes is sampled from each DC-SBM, respectively. The heatmaps for depth=2 in the case of both homophily and heterophily graphs show that the class information for all the nodes is well preserved in S row as there is a clear block structure than S sym in which each node is diffused unequally due to the degree correction. Thus validating the results derived from population NTK. Appendix B.3 presents the results for S adj and S col where both are uninformative and behave as derived theoretically.

4. SKIP CONNECTIONS RETAIN INFORMATION EVEN AT INFINITE DEPTH

Skip connection is the most common way to overcome the performance degradation with depth in GCNs, but little is known about the effectiveness of different skip connections and their interplay with the convolutions. While our focus is to understand the interplay with convolutions, we also include the impact of convolving with and without the feature information. Hence, we consider the following two variants: Skip-PC (pre-convolution), where the skip is added to the features before applying convolution (Kipf & Welling, 2017) ; and Skip-α, which gives importance to the features by adding it to each layer without convolving with S (Chen et al., 2020) . To facilitate skip connections, we need to enforce constant layer size, that is, h i = h i-1 . Therefore, we transform the input layer using a random matrix W to H 0 = XW of size n × h where W ij ∼ N (0, 1) and h is the hidden layer size. Let H i be the output of layer i. Definition 2 (Skip-PC) In a Skip-PC (pre-convolution) network, the transformed input H 0 is added to the hidden layers before applying the graph convolution S, that is, H i := cσ h S (H i-1 + σ s (H 0 )) W i ∀i ∈ [d], where σ s (.) can be linear or ReLU. The above definition deviates from Kipf & Welling (2017) in the fact that we skip to the input layer instead of the previous layer. The following defines the skip connection similar to Chen et al. (2020) . Definition 3 (Skip-α) Given an interpolation coefficient α ∈ (0, 1), a Skip-α network is defined such that the transformed input H 0 and the hidden layer are interpolated linearly, that is, H i := cσ h ((1 -α) SH i-1 + ασ s (H 0 )) W i ∀i ∈ [d], where σ s (.) can be linear or ReLU.

4.1. NTK FOR GCN WITH SKIP CONNECTIONS

We derive NTKs for the skip connections -Skip-PC and Skip-α by considering the hidden layers width h → ∞. Both the NTKs maintain the form presented in Theorem 1 with the following changes to the co-variance matrices. Let Ẽ0 = E F∼N (0,Σ0) σ s (F)σ s (F) T . Corollary 2 (NTK for Skip-PC) The NTK for an infinitely wide Skip-PC network is as presented in Theorem 1 where E k is defined as in the theorem, but Σ k is defined as Σ 0 = XX T , Σ 1 = S Ẽ0 S T and Σ k = SE k-1 S T + Σ 1 . Corollary 3 (NTK for Skip-α) The NTK for an infinitely wide Skip-α network is as presented in Theorem 1 where E k is defined as in the theorem, but Σ k is defined with Σ 0 = XX T , Σ 1 = (1 -α) 2 SE 0 S T + α (1 -α) SE 0 + E 0 S T + α 2 E 0 and Σ k = (1 -α) 2 SE k-1 S T + α 2 Ẽ0 .

4.2. IMPACT OF DEPTH IN GCNS WITH SKIP CONNECTION

Similar to the previous section we use the NTK for Skip-PC and Skip-α (Corollary 2 and 3) and analyze the graph convolutions S sym and S adj under the same considerations detailed in Section 3. Since, S adj and S col are theoretically worse and not popular in practice, we do not consider them for the skip connection analysis. The linear orthonormal feature NTK, Θ (d) , for depth d is same as (4) with changes to Σ k as follows, Skip-PC: Σ k = S k S kT + SS T , Skip-α: Σ k = (1 -α) 2k S k S kT + α (1 -α) 2k-1 S k-1 S + S T S k-1 T + α 2 k-1 l=0 (1 -α) 2l S l S lT . We derive the population NTK Θ(d) and, for convenience, only state the result as d → ∞ in the following theorems. Theorem 3 (Population NTK for Skip-PC Θ(∞) P C ) Under the assumptions of Theorem 2, Θ(∞) P C,sym ij = √ π i π j 2 + δ ij r 2 1 - √ π i π j (1 + δ ij r 2 ) , Θ(∞) P C,row ij = 2γ(2 + δ ij r 2 ) 1 -2γ (1 + δ ij r 2 ) . (5)  Θ(∞) α,sym ij = α 2 √ π i π j 1 - √ π i π j (1 + δ ij r 2 ) 1 1 -(1 -α) 2 + δ ij 1 -(1 -α) 2 r 2 , Θ(∞) α,row ij = 2γα 2 1 -2γ (1 + δ ij r 2 ) 1 1 -(1 -α) 2 + δ ij 1 -(1 -α) 2 r 2 . ( ) Similar to Theorem 2, assumptions on π in above theorems is to simplify the results. To understand the role of skip connections, we plot the gap between in-class and out-of-class blocks at infinite depth for different values of true class separability r, for vanilla GCN, Skip-PC and Skip-α using Corollary 1, Theorems 3-4, respectively (Figure 3 ). The plot clearly shows that the gap is away from 0 for both the skip connections given a reasonable true separation, unlike vanilla GCN. This implies the class information is retained in skip connections even at infinite depth.

4.3. NUMERICAL VALIDATION FOR RANDOM GRAPHS

We validate our theoretical result using the same setup detailed in Section 3.2 without the assumptions, and compute the exact NTKs for Skip-PC and Skip-α for both S sym and S row . We show the result on homophilic graphs but they equally extend to the heterophilic case. While S sym has no class information for depth=8 in vanilla GCN (Figure 3 middle), it is retained well in Skip-PC (right plot). In the case of S row , we clearly observe the blocks in both cases with more prevalent gap in Skip-PC illustrating our theoretical results. Similar observation is made for Skip-α despite considering XX T = I n as the model interpolates with the feature, and is discussed in Appendix B.3. While both S sym and S row retain the class information in larger depths, we observe that the degree correction plays a significant role in S sym as elucidated in our theoretical analysis.

5. S sym MAYBE PREFERRED OVER S row IN ABSENCE OF CLASS STRUCTURE

While we showed that the graph convolution S row preserves the underlying class structure, it is natural to wonder about the random graphs that have no communities (p = q). One such case is graphs with core-periphery structure where the graph has core nodes that are highly interconnected and periphery nodes that are sparsely connected to the core and other periphery nodes. Such a graph can be modeled using only the degree correction π such that π j ≪ π i ∀j ∈ periphery, i ∈ core (similar to Jia & Benson (2019) ). Extending Theorem 2, we derive the following Corollary 4 and show that the convolution S sym contains the graph information while S row is a constant kernel. Corollary 4 (Population NTKs Θ for p = q) Let Assumption 1 and 2 hold, K = 2 and p = q. Furthermore, π is chosen such that i∈core π 2 i = λ and i∈periphery π 2 i = µ. Then ∀i and j, the population NTKs Θsym and Θrow of depth d for S = S sym and S row , respectively, are, From Corollary 4, it is evident that the S sym has the graph information and hence could be preferred when there is no community structure. Furthermore, interestingly even skip connections prove to be of no use for S row and it remains a constant kernel in this case as well. We validate it experimentally and discuss the results in Appendix B.3 (Figure 10 ). While S row results in a constant kernel for coreperiphery without community structure, it is important to note that when there exists a community structure and each community has core-periphery nodes, then S row is still preferable over S sym as it is simply a special case of homophilic networks. This is demonstrated in Appendix B.3 (Figure 11 ). Θ(d) sym ij = √ π i π j 1 - √ π i π j d+1 1 - √ π i π j and Θ(d) row ij = (λ + µ) 1 -(λ + µ) d+1 1 -(λ + µ) .

6. EMPIRICAL ANALYSIS ON REAL DATA

In this section, we explore how well the theoretical results translate to real dataset Cora with features, that is, XX T ̸ = I n and A ̸ = M. We also provide experimental details and additional experiments on Citeseer in Appendix B.5. We consider multi-class node classification for Cora (K = 7) using GCN with linear activations and provide the results for ReLU activations in Appendix B.4 The NTKs for vanilla GCN, GCN with Skip-PC and Skip-α are illustrated in Figure 4 . We make the following observations from the experiments that validate the theory even in a much relaxed setting, (i) clear block structures show up in both GCN with and without skip connections for S row , thus illustrating that the class information is well retained by S row than S sym ; (ii) while we cannot compare the skip connections, it is still evident that S row is better than S sym for both Skip-PC and Skip-α as block structures emerge even in the case of large depth. Thus, although the theoretical result is based on DC-SBM with mild assumptions, the conclusions hold well in real settings on real datasets as well.

7. CONCLUSION

Graph convolution operators significantly influence the performance of GCNs, but existing learning theoretic bounds for GCNs do not provide insight into the representation power of the operators. We present a NTK based analysis that characterizes different convolutions, thereby proving the strong representation power of S row in community detection and explaining why S row , and to some extent S sym , are preferred in practice (Theorem 2). In contrast to applying spectral analysis of the convolutions to explain over-smoothing, our explicit characterization of the network provides more exact quantification of the impact of over-smoothing in deep GCNs (Corollary 1, see Figure 2 ). In addition, the NTKs for GCNs with skip connections enable precise understanding of the role of skip connections in countering the over-smoothing effect (Theorems 3-4). While the DC-SBM assumption may seem restrictive, experiments on Cora and Citeseer show that our theoretical results hold beyond DC-SBM, although formally characterizing such behavior could be difficult without model assumptions. We note that our analysis could be extended by considering feature information (XX T ̸ = I n ) or random samples from DC-SBM, which would require more involved analysis but could provide further insights into GCNs, such as interplay between graph and feature information. The present NTK based setup allows for the analysis of different graphs having homophilic, heterophilic and core-periphery structures, and can be extended to other graph generating processes. Furthermore, the general formulation of NTK for vanilla GCNs (Theorem 1) and with skip connections (Corollaries 2-3) can be used for analyzing any new convolutions like topological structure preserving convolutions, for obtaining a rigorous understanding of GCNs by deriving statistical consistency results or information theoretic limits, as well as for theoretical analysis of other graph learning problems, such as link prediction.

8. ETHICS STATEMENT

Our work focuses on understanding some characteristics of the graph neural network theoretically and hence it doesn't have direct implications on the ethical and fairness aspects.

9. REPRODUCIBILITY STATEMENT

The assumptions for the theory are stated clearly in Assumptions 1-2, and all the theoretical results, Theorems 1-4, Corollaries 1-3, are proved in detail in Appendix A. The implementation of GCN and NTK for GCNs with and without skip connections are provided in ntk gcn conv.zip as a supplementary material. Datasets used in the experiments are publicly available and also provided in data folder available in the zip. The experimental results can be reproduced by following the instructions in readme.md.

A MATHEMATICAL DERIVATIONS AND PROOFS

We first derive the NTK (Theorem 1) for GCN defined in (2) and prove Theorem 2 by considering linear GCN and computing the population NTK Θ(d) for different graph convolutions. We use 1 n to denote a vector of n dimension with all 1s and 1n for a vector of n dimension with -1 as first n 2 entries and +1 as the remaining n 2 entries. A.1 THEOREM 1: NTK FOR VANILLA GCN We rewrite the GCN F W (X, S) defined in (2) using the following recursive definitions: G 1 = SX, G i = c σ h i-1 Sσ(F i-1 ) ∀i ∈ {2, . . . , d + 1}, F i = G i W i ∀i ∈ [d + 1]. (7) Thus, F W (X, S) = F d+1 and using the definitions in ( 7), the gradient with respect to W i is ∂F W (X, S) ∂W i = G T i B i with B d+1 = 1 n , B i = c σ h i S T B i+1 W T i+1 ⊙ σ(F i ). (8) We derive the NTK, as defined in (1), using the recursive definition of F W (X, S) in ( 7) and its derivative in (8). Co-variance between Nodes. We will first derive the co-variance matrix of size n × n for each layer comprising of co-variance between any two nodes u and v. The co-variance between u and v in F 1 and F i are derived below. We denote u-th row of matrix Z as Z u. throughout our proofs. E [(F 1 ) uk (F 1 ) vk ′ ] = E [(G 1 W 1 ) uk (G 1 W 1 ) vk ′ ] = E h0 r=1 (G 1 ) ur (W 1 ) rk h0 s=1 (G 1 ) vs (W 1 ) sk ′ (W1) xy ∼N (0,1) = 0 ; if r ̸ = s or k ̸ = k ′ E [(F 1 ) uk (F 1 ) vk ] r=s = k=k ′ E h0 r=1 (G 1 ) ur (G 1 ) vr (W 1 ) 2 rk (W1) xy ∼N (0,1) = h0 r=1 (G 1 ) ur (G 1 ) vr = ⟨(G 1 ) u. , (G 1 ) v. ⟩ (9) E [(F i ) uk (F i ) vk ] r=s = k=k ′ E   hi-1 r=1 (G i ) ur (G i ) vr (W i ) 2 rk   (Wi) xy ∼N (0,1) = hi-1 r=1 (G i ) ur (G i ) vr = ⟨(G i ) u. , (G i ) v. ⟩ Evaluating ( 9) and ( 10) in terms of the graph in the following, ( ) : ⟨(G 1 ) u. , (G 1 ) v. ⟩ = ⟨(SX) u. , (SX) v. ⟩ = S u. XX T S T .v = (Σ 1 ) uv (11) (10) : ⟨(G i ) u. , (G i ) v. ⟩ = c σ h i-1 ⟨(Sσ(F i-1 )) u. , (Sσ(F i-1 )) v. ⟩ = c σ h i-1 hi-1 k=1 (Sσ(F i-1 )) uk (Sσ(F i-1 )) vk hi-1→∞ = c σ E [(Sσ(F i-1 )) uk (Sσ(F i-1 )) vk ] ; law of large numbers = c σ E n r=1 S ur σ (F i-1 ) rk n s=1 S vs σ (F i-1 ) sk = c σ E n r=1 n s=1 S ur S vs σ (F i-1 ) rk σ (F i-1 ) sk (a) = n r=1 n s=1 S ur (E i-1 ) rs S T sv = S u. E i-1 S T .v = (Σ i ) uv (a): using E [(F i-1 ) rk (F i-1 ) sk ] = (Σ i-1 ) rs and the definition of E i-1 in Theorem 1. NTK for Vanilla GCN. Let us first evaluate the tangent kernel component from W i respective to nodes u and v. The following two results are needed to derive it. Result 1 (Inner Product of Matrices). Let a and b be vectors of size d 1 × 1 and d 2 × 1, then ab T , ab T = tr ab T ab T T = tr ab T ba T = tr a T ab T b = a T a ⊙ b T b = ⟨a, a⟩ ⊙ ⟨b, b⟩ Result 2 ⟨(B r ) u. , (B r ) v. ⟩. We evaluate ⟨(B r ) u. , (B r ) v. ⟩ = B r B T r uv appearing in the gradient. B r B T r uv = c σ h r hr k=1 S T B r+1 W T r+1 uk σ(F r ) uk S T B r+1 W T r+1 vk σ(F r ) vk = c σ h r hr k=1 n,hr+1 i,j S iu (B r+1 ) ij (W r+1 ) kj σ(F r ) uk σ(F r ) vk n,hr+1 i ′ ,j ′ S i ′ v (B r+1 ) i ′ j ′ (W r+1 ) kj ′ = c σ h r n,hr+1 i,j n,hr+1 i ′ ,j ′ (B r+1 ) ij (B r+1 ) i ′ j ′ S iu S i ′ v hr k=1 (W r+1 ) kj σ(F r ) uk σ(F r ) vk (W r+1 ) kj ′ = hr+1,hr+1 j,j ′ S T B r+1 uj S T B r+1 vj ′ c σ h r hr k=1 (W r+1 ) kj σ(F r ) uk σ(F r ) vk (W r+1 ) kj ′ hr→∞ = hr+1 j S T B r+1 uj S T B r+1 vj c σ E W 2 r+1 kj σ(F r ) uk σ(F r ) vk ; 0 for j ̸ = j ′ (b) = S T B r+1 u. S T B r+1 v. c σ E [ σ(F r ) uk σ(F r ) vk ] (13) = SS T uv ⟨B r+1 , B r+1 ⟩ uv c σ E [ σ(F r ) uk σ(F r ) vk ] = SS T uv ⟨B r+1 , B r+1 ⟩ uv Ėr uv (14) (b): (W r+1 ) kj is independent and E W 2 r+1 kj = 1 . Now, lets derive ∂F ∂W k u , ∂F ∂W k v and ∂F ∂W1 u , ∂F ∂W1 v using the above results. ∂F ∂W k u , ∂F ∂W k v = (G k ) T u. (B k ) u. , (G k ) T v. (B k ) v. (13) = ⟨(G k ) u. , (G k ) v. ⟩ ⊙ ⟨(B k ) u. , (B k ) v. ⟩ (12),(14) = (Σ k ) uv SS T uv ⟨B r+1 , B r+1 ⟩ uv Ėr uv (c) = (Σ k ) uv SS T uv d+1-k d+1-k k ′ =k Ėk ′ uv ⟨B d+1 , B d+1 ⟩ uv (d) = (Σ k ) uv SS T uv d+1-k d+1-k k ′ =k Ėk ′ uv ( ) (c): repeated application of ( 14). (d): definition of B d+1 . Extending (15) to all n nodes which will result in n × n matrix, ∂F ∂W k , ∂F ∂W k = Σ k ⊙ SS T ⊙d+1-k d+1-k k ′ =k Ėk ′ E W k ∂F ∂W k , ∂F ∂W k = Σ k ⊙ SS T ⊙d+1-k d+1-k k ′ =k Ėk ′ (16) Finally, NTK Θ is, Θ = d+1 k=1 E W k ∂F ∂W k , ∂F ∂W k = d+1 k=1 Σ k ⊙ SS T ⊙(d+1-k) ⊙ d+1-k k ′ =k Ė′ k ( ) with definition of Σ k and Ėk mentioned in the theorem. □ A.2 THEOREM 2 AND COROLLARY 1: POPULATION NTK Θ FOR DIFFERENT S We consider Assumption 1, that is, linear GCN with orthonormal features and Assumption 2 without assumption on γ. We first prove it for K = 2 and then extend it to K classes. We consider that all nodes are sorted per class for ease of analysis which implies A is a n × n matrix with pπ i π j entries in [1, n 2 ][1, n 2 ] and [ n 2 +1, n][ n 2 +1, n] blocks and qπ i π j entries in [1, n 2 ][ n 2 +1, n] and [ n 2 +1, n][1, n 2 ] blocks. Therefore, A = ππ T ⊙ p + q 2 11 T + p -q 2 11 T = p + q 2 ππ T + p -q 2 π πT (18) where the entries of π are -π i ∀ i ∈ [1, n 2 ] and +π i ∀ i ∈ [ n 2 + 1, n] . D be the degree matrix of A and D = p+q 2 diag(π).

A.2.1 SYMMETRIC DEGREE NORMALIZED ADJACENCY S sym

Now, lets compute S sym using A (18) and its degree matrix D. S sym = D -1 2 AD -1 2 = 2 p + q diag(π) -1 2 p + q 2 ππ T + p -q 2 π πT diag(π) -1 2 = π 1 2 π 1 2 T + p -q p + q π 1 2 π 1 2 T =    √ π 1 - √ π 1 . . . . . . √ π n + √ π n    n×2 1 0 0 r 2×2    √ π 1 - √ π 1 . . . . . . √ π n + √ π n    T 2×n ; r = p -q p + q = UΛU T (19) Note that π T π = πT π = 1, π T π = 0 and U T U = I 2 , thus ( 19) is the singular value decomposition of S sym . To compute the population NTK Θ(d) sym in (4), we need S k sym S kT sym . Using (19), S k sym S kT sym (19) = UΛ 2k U T =    √ π 1 - √ π 1 . . . . . . √ π n + √ π n    n×2 1 0 0 r 2k 2×2    √ π 1 - √ π 1 . . . . . . √ π n + √ π n    T 2×n S k sym S kT sym ij = 1 + δ ij r 2k √ π i π j ; δ ij = (-1) 1[Ci̸ =Cj ] S k sym S kT sym matrix = notation    1 + r 2k √ π i π j 1 -r 2k √ π i π j 1 -r 2k √ π i π j n 2 entries 1 + r 2k √ π i π j n 2 entries    n×n (20) Consequently, population NTK Θ(d) sym for nodes i and j using ( 20) is as follows, Θ(d) sym ij = d+1 k=1 √ π i π j 1 + δ ij r 2k √ π i π j 1 + δ ij r 2 d+1-k = d+1 k=1 √ π i π j d+2-k 1 + δ ij r 2 d+1-k + δ ij d+1 k=1 √ π i π j d+2-k r 2k 1 + δ ij r 2 d+1-k = √ π i π j 1 - √ π i π j 1 + δ ij r 2 d+1 1 - √ π i π j (1 + δ ij r 2 ) + δ ij √ π i π j r 2(d+1) 1 - √ π i π j 1 + δ ij r 2 r -2 d+1 1 - √ π i π j (1 + δ ij r 2 ) r -2 (21) Since we consider i∈C k π i = 1/K, the maximum of √ π i π j < 1/4 for K = 2. This implies √ π i π j 1 + r 2 < 1. Therefore, NTK at d → ∞ is Θ(∞) sym ij = √ π i π j 1 - √ π i π j (1 + δ ij r 2 ) (22) Equations ( 21) and ( 22) prove the population NTK Θ(d) sym and Θ(∞) sym in Theorem 2 and Corollary 1, respectively. □

A.2.2 ROW DEGREE NORMALIZED ADJACENCY S row

The assumption on γ in Assumption 2 is only to simplify the expression of population NTK for S row . We derive it without this assumption in the following. We first derive S k row S kT row . S row = D -1 A = D -1 2 D -1 2 AD -1 2 D + 1 2 = D -1 2 UΛU T D + 1 2 S k row = D -1 2 UΛ k U T D + 1 2 S k row S kT row = D -1 2 UΛ k U T D + 1 2 D + 1 2 UΛ k U T D -1 2 = D -1 2 UΛ k U T DUΛ k U T D -1 2 = D -1 2 UΛ k U T D -1 2 D + 1 2 DD + 1 2 D -1 2 UΛ k U T D -1 2 = UΛ k U T D 2 UΛ k U T ; U = D -1 2 U = 2 p + q 1 T n 1T n n×2 S k row S kT row ij =      1 + r k 2 λ + 1 -r k 2 µ if i and j ∈ class 1 1 + r k 1 -r k (λ + µ) if i and j / ∈ same class 1 -r k 2 λ + 1 + r k 2 µ if i and j ∈ class 2 ; λ = n 2 s=1 π 2 s ; µ = n s= n 2 +1 π 2 s S k row S kT row matrix not. =     1 + r k 2 λ + 1 -r k 2 µ 1 + r k 1 -r k (λ + µ) 1 + r k 1 -r k (λ + µ) n 2 entries 1 -r k 2 λ + 1 + r k 2 µ n 2 entries     n×n Note that each block is a constant and independent of individual π i . Using (23), NTK in (4) for i and j belonging to class 1 is, Θ(d) row ij (23) = d+1 k=1 1 + r k 2 λ + 1 -r k 2 µ (1 + r) 2 λ + (1 -r) 2 µ d+1-k = d+1 k=1 (λ + µ) (1 + r) 2 λ + (1 -r) 2 µ d+1-k + d+1 k=1 2 (λ -µ) r k (1 + r) 2 λ + (1 -r) 2 µ d+1-k + d+1 k=1 (λ + µ) r 2k (1 + r) 2 λ + (1 -r) 2 µ d+1-k = (λ + µ) 1 -(1 + r) 2 λ + (1 -r) 2 µ d+1 1 -(1 + r) 2 λ + (1 -r) 2 µ + 2 (λ -µ) r d+1 1 -(1 + r) 2 λ + (1 -r) 2 µ d+1 r -(d+1) 1 -(1 + r) 2 λ + (1 -r) 2 µ r -1 + (λ + µ) r 2(d+1) 1 -(1 + r) 2 λ + (1 -r) 2 µ d+1 r -2(d+1) 1 -(1 + r) 2 λ + (1 -r) 2 µ r -2 Similarly for i and j in class 2, Θ(d) row ij = d+1 k=1 1 -r k 2 λ + 1 + r k 2 µ (1 -r) 2 λ + (1 + r) 2 µ d+1-k = (λ + µ) 1 -(1 -r) 2 λ + (1 + r) 2 µ d+1 1 -(1 -r) 2 λ + (1 + r) 2 µ + 2 (-λ + µ) r d+1 1 -(1 -r) 2 λ + (1 + r) 2 µ d+1 r -(d+1) 1 -(1 -r) 2 λ + (1 + r) 2 µ r -1 + (λ + µ) r 2(d+1) 1 -(1 -r) 2 λ + (1 + r) 2 µ d+1 r -2(d+1) 1 -(1 -r) 2 λ + (1 + r) 2 µ r -2 λ -µ = 0. Hence, equations ( 24), ( 25) and ( 26) of the population NTK Θ(d) row and ( 27) of Θ(∞) row reduce to the expressions in Theorem 2 and Corollary 1, respectively. □

A.2.3 COLUMN NORMALIZED ADJACENCY S col

In this section we derive the population NTK Θ(d) col . S col = AD -1 = D + 1 2 UΛU T D -1 2 S k col = D + 1 2 UΛ k U T D -1 2 S k col S kT col = D + 1 2 UΛ k U T D -1 2 D -1 2 UΛ k U T D + 1 2 = ŨΛ k ŨT D -2 ŨΛ k ŨT ; Ũ = D + 1 2 U = p + q 2 π T πT n×2 = nπ i π j 1 + δ ij r 2k matrix not. =    nπ i π j 1 + r 2k nπ i π j 1 -r 2k nπ i π j 1 -r 2k n 2 entries nπ i π j 1 + r 2k n 2 entries    n×n (28) Therefore, Θ(d) col is Θ(d) col ij = d+1 k=1 nπ i π j 1 + δ ij r 2k nπ i π j 1 + δ ij r 2 d+1-k = d+1 k=1 (nπ i π j ) d+2-k 1 + δ ij r 2 d+1-k + δ ij d+1 k=1 (nπ i π j ) d+2-k r 2k 1 + r 2 d+1-k = nπ i π j 1 -nπ i π j 1 + r 2 d+1 1 -nπ i π j (1 + r 2 ) + δ ij r 2d+2 1 -nπ i π j 1 + r 2 r -2 d+1 1 -nπ i π j (1 + r 2 ) r -2 Since n i π i = 1, π i = O( 1 n ). So, nπ i π j 1 + r 2 < 1. Therefore, using (29), Θ(∞) col ij = nπ i π j 1 + nπ i π j (1 + δ ij r 2 ) . Hence, equations ( 29) and ( 30) prove the population NTK Θ(d) col and Θ(∞) col in Theorem 2 and Corollary 1, respectively. □

A.2.4 UNNORMALIZED ADJACENCY S adj

We can rewrite A as follows, A = ππ T ⊙ p q q n 2 entries p n 2 entries n×n =    π 1 . . . π n    n×n    p q q p    n×n    π 1 . . . π n    n×n We consider γ assumption for the analysis of unnormalised adjacency to simplify the computation. But the result holds without this assumption. A 2 (31) =    π 1 . . . π n       p 2 + q 2 γ 2pqγ 2pqγ p 2 + q 2 γ       π 1 . . . π n    A 4 =    π 1 . . . π n       p 4 + q 4 + 6p 2 q 2 γ 3 4p 3 q + 4pq 3 γ 3 4p 3 q + 4pq 3 γ 3 p 4 + q 4 + 6p 2 q 2 γ 3       π 1 . . . π n    Note that in the above shown A 2k it is the even powers of binomial expansion of (p + q) 2 k for i, j in same class whereas it is the odd powers for i, j not in the same class. We compute the filter S adj using this fact. S adj = 1 n A S k adj = 1 n k A k S k adj S kT adj = 1 n 2k A 2k =              π i π j γ 2 k -1 n 2k 2 k-1 l=0 2 k 2l p 2 k -2l q 2l if i and j ∈ same class π i π j γ 2 k -1 n 2k 2 k-1 -1 l=0 2 k 2l+1 p 2 k -2l-1 q 2l+1 if i and j ∈ different class Θ(d) adj =              π i π j d+1 k=1 γ 2 k +d-k n 2k p 2 + q 2 d+1-k 2 k-1 l=0 2 k 2l p 2 k -2l q 2l if i and j ∈ same class π i π j d+1 k=1 γ 2 k +d-k n 2k (2pq) d+1-k 2 k-1 -1 l=0 2 k 2l+1 p 2 k -2l-1 q 2l+1 if i and j ∈ different class The above form is not simplified as it is not an interesting case where the gap between the two blocks disappears rapidly and Θ(∞) adj ij = 0. There is no information in the kernel proving both Theorem 2 and Corollary 1. □ A.2.5 NUMBER OF CLASSES K > 2 From the above derivation for K = 2, it can be seen that once S k sym S kT sym is computed, the population NTK for all the graph convolutions can be derived using it. Therefore, we derive it for K > 2 and it suffices to show the conclusions of Theorem 2 and Corollary 1. We denote the vector π1k with -π i ∀i ∈ 1, n K , +π i ∀i ∈ n(k-1) K , nk K and 0 for the rest. With this definition, A is A = p + (K -1)q K ππ T + p -q K K l=2 π1l πT 1l . D for K classes is p+(K-1) K diag(π) from (32). We can compute S sym using A and D as follows, S sym = D -1 2 AD -1 2 = K p + (K -1)q diag(π -1 2 ) p + (K -1)q K ππ T + p -q K K l=2 π1l πT 1l diag(π -1 2 ) = π 1 2 π 1 2 T + p -q p + (K -1)q K l=2 π 1 2 1l π 1 2 T 1l (S sym ) ij = √ π i π j 1 + δ ij p -q p + (K -1)q K l=2 K l + l 2 S k sym ij = √ π i π j 1 + δ ij p -q p + (K -1)q k K l=2 K l + l 2 S k sym S kT sym ij = √ π i π j 1 + δ ij p -q p + (K -1)q 2k K l=2 K l + l 2 (33) It is noted that the equation ( 33) is very much similar to (20) for K = 2. The further derivations of the population NTKs Θ for all the convolutions are similar and the theoretical results extend without any issues. A.3 NTK FOR GCN WITH SKIP CONNECTIONS (COROLLARY 2 AND 3) We observe that the definitions of G i ∀i ∈ [1, d+1] are different for GCN with skip connections from the vanilla GCN. Despite the difference, the definition of gradient with respect to W i in (8) does not change as G i in the gradient accounts for the change and moreover, there is no new learnable parameter since the input transformation H 0 = XW 0 where (W 0 ) ij is sampled from N (0, 1) is not learnable in our setting. Given the fact that the gradient definition holds for GCN with skip connection, the NTK will retain the form from NTK for vanilla GCN as evident from the derivation of NTK for vanilla GCN in Section A.1. The change in G i will only affect the co-variance between nodes. Hence, we will derive the co-variance matrix for Skip-PC and Skip-α in the following. Skip-PC: Co-variance between nodes. The co-variance between nodes u and v in F 1 and F i are derived below. E [(F 1 ) uk (F 1 ) vk ] = ⟨(G 1 ) u. , (G 1 ) v. ⟩ = c σ h ⟨(Sσ s (H 0 )) u. , (Sσ s (H 0 )) v. ⟩ = c σ h h k=1 (Sσ s (H 0 )) uk (Sσ s (H 0 )) vk h→∞ = c σ E [(Sσ s (H 0 )) uk (Sσ s (H 0 )) vk ] ; law of large numbers = S u. Ẽ0 S T .v ; Ẽ0 = c σ E F∼N (0,XX T ) σ s (F)σ s (F) T = (Σ 1 ) uv (34) E [(F i ) uk (F i ) vk ] = ⟨(G i ) u. , (G i ) v. ⟩ = c σ h ⟨(S (σ(F i-1 ) + σ s (H 0 ))) u. , (S (σ(F i-1 ) + σ s (H 0 ))) v. ⟩ = c σ h h k=1 (Sσ(F i-1 ) + Sσ s (H 0 )) uk (Sσ(F i-1 ) + Sσ s (H 0 )) vk h→∞ = c σ E [(Sσ(F i-1 ) + Sσ s (H 0 )) uk (Sσ(F i-1 ) + Sσ s (H 0 )) vk ] ; law of large numbers = c σ E [(Sσ(F i-1 )) uk (Sσ(F i-1 )) vk ] + E [(Sσ(F i-1 )) uk (Sσ s (H 0 )) vk ] + E [(Sσ s (H 0 )) uk (Sσ(F i-1 )) vk ] + E [(Sσ s (H 0 )) uk (Sσ s (H 0 )) vk ] = S u. E i-1 S T .v + c σ E [(Sσ(F i-1 )) uk (Sσ s (XW 0 )) vk ] + c σ E [(Sσ s (XW 0 )) uk (Sσ(F i-1 )) vk ] + c σ E n r=1 n s=1 S ur S qs σ s (XW 0 ) rk σ s (XW 0 ) sk (f ) = S u. E i-1 S T .v + c σ S u. E [σ s (XW 0 ) rk σ s (XW 0 ) sk ] S T .v = S u. E i-1 S T .v + S u. Ẽ0 S T .v = S u. E i-1 S T .v + (Σ 1 ) uv = (Σ i ) uv (35) (f ): E [(Sσ(F i-1 )) uk (Sσ s (XW 0 )) vk ] and E [(Sσ s (XW 0 )) uk (Sσ(F i-1 )) vk ] evaluate to 0 by conditioning on W 0 first and rewriting the expectation based on this conditioning. The terms within expectation are independent when conditioned on W 0 , and hence it is E W0 E Σi-1|W0 [(Sσ(F i-1 )) uk |W 0 ] E Σi-1|W0 [(Sσ s (XW 0 )) vk |W 0 ] by taking h in W 0 going to infinity first. Here, E Σi-1|W0 [(Sσ s (XW 0 )) vk |W 0 ] = 0. We get the co-variance matrix for all pairs of nodes Σ 1 = S Ẽ0 S T and Σ i = SE i-1 S T + Σ 1 from (34) and ( 35). Skip-α: Co-variance between nodes. Let u and v be two nodes and the co-variance between u and v in F 1 and F i are derived below. E [(F 1 ) uk (F 1 ) vk ] = ⟨(G 1 ) u. , (G 1 ) v. ⟩ = c σ h h k=1 ((1 -α)Sσ s (H 0 ) + ασ s (H 0 )) uk ((1 -α)Sσ s (H 0 ) + ασ s (H 0 )) vk h→∞ = c σ E [((1 -α)Sσ s (H 0 ) + ασ s (H 0 )) uk ((1 -α)Sσ s (H 0 ) + ασ s (H 0 )) vk ] = c σ (1 -α) 2 E [(Sσ s (H 0 )) uk (Sσ s (H 0 )) vk ] + (1 -α)α E [(Sσ s (H 0 )) uk (σ s (H 0 )) vk ] + E [(Sσ s (H 0 )) vk (σ s (H 0 )) uk ] + α 2 E [(σ s (H 0 )) uk (σ s (H 0 )) vk ] = (1 -α) 2 S u. Ẽ0 S T .v + (1 -α)α S u. Ẽ0 .v + Ẽ0 u. S T .v + α 2 Ẽ0 uv = (Σ 1 ) uv (36) E [(F i ) uk (F i ) vk ] = ⟨(G i ) u. , (G i ) v. ⟩ = c σ h h k=1 ((1 -α)Sσ(F i-1 ) + ασ s (H 0 )) uk ((1 -α)Sσ(F i-1 ) + ασ s (H 0 )) vk h→∞ = c σ E [((1 -α)Sσ(F i-1 ) + ασ s (H 0 )) uk ((1 -α)Sσ(F i-1 ) + ασ s (H 0 )) vk ] = c σ (1 -α) 2 E [(Sσ(F i-1 )) uk (Sσ(F i-1 )) vk ] + α 2 E [(σ s (H 0 )) uk (σ s (H 0 )) vk ] + (1 -α)α E [(Sσ(F i-1 )) uk (σ s (H 0 )) vk ] + E [(σ s (H 0 )) uk (Sσ(F i-1 )) vk ] (g) = (1 -α) 2 S u. E i-1 S T .v + α 2 Ẽ0 uv = (Σ i ) uv (g): same argument as (f ) in derivation of Σ i in Skip-PC. We get the co-variance matrix for all pairs of nodes 36) and (37). Σ 1 = (1 -α) 2 S Ẽ0 S T + α(1 - α) S Ẽ0 + Ẽ0 S T + α 2 Ẽ0 and Σ i = (1 -α) 2 SE i-1 S T + α 2 Ẽ0 from ( A.4 THEOREM 3: POPULATION NTK Θ FOR SKIP-PC NTK at depth d, Θ P C for Skip-PC with linear activations is Θ (d) P C = d+1 k=1 S k S kT + SS T ⊙ SS T ⊙d+1-k = d+1 k=1 S k S kT ⊙ SS T ⊙d+1-k I + SS T ⊙d+2-k II In (38), I is NTK without skip connection and II is computed for S row and S sym as follows. Computing II for population NTK Θ(d) for S sym : for nodes i and j, d+1 k=1 S sym S T sym ⊙d+2-k ij = d+1 k=1 √ π i π j 1 + δ ij r 2 d+2-k = √ π i π j 1 + δ ij r 2 1 - √ π i π j 1 + δ ij r 2 d+1 1 - √ π i π j (1 + δ ij r 2 ) d→∞ = √ π i π j 1 + δ ij r 2 1 - √ π i π j (1 + δ ij r 2 ) It converges to (39) as d → ∞ since √ π i π j 1 + δ ij r 2 < 1 according to our setup. Therefore, using ( 39) and ( 22) we get the population NTK Θ(∞) P C,sym for Skip-PC at d → ∞, Θ(∞) P C,sym ij = √ π i π j 2 + δ ij r 2 1 - √ π i π j (1 + δ ij r 2 ) , hence deriving Theorem 3. □ Similarly, computing II for S row without assumption on γ, i and j in class 1, d+1 k=1 S row S T row ⊙d+2-k ij = d+1 k=1 (1 + r) 2 λ + (1 -r) 2 µ d+2-k = (1 + r) 2 λ + (1 -r) 2 µ 1 -(1 + r) 2 λ + (1 -r) 2 µ d+1 1 -(1 + r) 2 λ + (1 -r) 2 µ We compute I, II and III of (44) for population NTK Θ(∞) α using S sym focusing on d → ∞. I ij = (1 -α) 2(d+1) √ π i π j 1 - √ π i π j 1 + δ ij r 2 (1 -α) -2 d+1 1 - √ π i π j (1 + δ ij r 2 ) (1 -α) -2 + r 2(d+1) 1 - √ π i π j 1 + δ ij r 2 r -2 (1 -α) -2 d+1 1 - √ π i π j (1 + δ ij r 2 ) r -2 (1 -α) -2 d→∞ = 0 (45) II = α d+1 k=1 (1 -α) 2k-1 2S 2k-1 sym ⊙ S sym S T sym ⊙d+1-k ; S sym = S T sym II ij = 2α d+1 k=1 (1 -α) 2k-1 √ π i π j d+k 1 + δ ij r 2 d+1-k + (1 -α) 2k-1 √ π i π j d+k 1 + δ ij r 2 d+1-k r 2k-1 δ ij = 2α √ π i π j (1 -α) 2d+1 1 + √ π i π j -1 (1 -α) -2 1 + δ ij r 2 d+1 1 + √ π i π j -1 (1 -α) -2 (1 + δ ij r 2 ) + δ ij r √ π i π j (1 -α) 2d+1 1 + √ π i π j -1 (r (1 -α)) -2 1 + δ ij r 2 d+1 1 + √ π i π j -1 (r (1 -α)) -2 (1 + δ ij r 2 ) d→∞ = 0 (46) III = α 2 d+1 k=1 k-1 l=0 (1 -α) 2l S l sym S lT sym ⊙ S sym S T sym ⊙d+1-k III ij = α 2 d+1 k=1 k-1 l=0 (1 -α) 2l 1 + δ ij r 2l √ π i π j 1 + δ ij r 2 √ π i π j d+1-k = α 2 √ π i π j d+1 k=1 1 1 -(1 -α) 2 + δ ij 1 -(1 -α) 2 r 2 1 + δ ij r 2 √ π i π j d+1-k + - (1 -α) 2k 1 -(1 -α) 2 -δ ij (1 -α) 2k r 2k 1 -(1 -α) 2 r 2 1 + δ ij r 2 √ π i π j d+1-k = α 2 √ π i π j 1 1 -(1 -α) 2 + δ ij 1 -(1 -α) 2 r 2 1 - √ π i π j 1 + δ ij r 2 d+1 1 - √ π i π j (1 + δ ij r 2 ) - (1 -α) 2(d+1) 1 -(1 -α) 2 1 - √ π i π j 1 + δ ij r 2 (1 -α) -2 d+1 1 - √ π i π j (1 + δ ij r 2 ) (1 -α) -2 - δ ij (1 -α) 2(d+1) r 2(d+1) 1 -(1 -α) 2 r 2 1 - √ π i π j 1 + δ ij r 2 (1 -α) -2 r -2 d+1 1 - √ π i π j (1 + δ ij r 2 ) (1 -α) -2 r -2 d→∞ = α 2 √ π i π j 1 - √ π i π j (1 + δ ij r 2 ) 1 1 -(1 -α) 2 + δ ij 1 -(1 -α) 2 r 2 Therefore the population NTK as d → ∞ is obtained by combining ( 45), ( 46) and (47). Θ(∞) α,sym ij = α 2 √ π i π j 1 - √ π i π j (1 + δ ij r 2 ) 1 1 -(1 -α) 2 + δ ij 1 -(1 -α) 2 r 2 , α 2 λ -µ 1 -(1 -α) 2 r (1 -α) 2 r (d+1) 1 -(1 + r) 2 λ + (1 -r) 2 µ d+1 (1 -α) 2 r -(d+1) 1 -(1 + r) 2 λ + (1 -r) 2 µ (1 -α) 2 r -1 d→∞ = α 2 λ + µ 1 -(1 -α) 2 + λ + µ 1 -(1 -α) 2 r 2 + λ -µ 1 -(1 -α) 2 r 1 1 -(1 + r) 2 λ + (1 -r) 2 µ Similarly for i and j in class 2, III ij d→∞ = α 2 λ + µ 1 -(1 -α) 2 + λ + µ 1 -(1 -α) 2 r 2 - λ -µ 1 -(1 -α) 2 r 1 1 -(1 -r) 2 λ + (1 + r) 2 µ ( ) For i and j in different classes, III ij d→∞ = α 2 λ + µ 1 -(1 -α) 2 - λ + µ 1 -(1 -α) 2 r 2 1 1 -(1 -r 2 ) (λ + µ) Thus, applying γ assumption to ( 49), ( 50) and ( 51) the population NTK Θ(∞) α,row as d → ∞ is, Θ(∞) α,row ij = 2γα 2 1 -2γ (1 + δ ij r 2 ) 1 1 -(1 -α) 2 + δ ij 1 -(1 -α) 2 r 2 , hence proving Theorem 4. □

B EMPIRICAL ANALYSIS B.1 EXPERIMENTAL DETAILS OF FIGURE 1

We use the code for GCN without skip connections from github1 (Kipf & Welling, 2017) and skip connection from github2 (Chen et al., 2020) . The following hyperparameters are used for GCN without skip connections: learning rate is 0.01, weight decay is 5e -4, hidden layer width is 64 and epochs is 500, 1500, 2000 for depths 2, 4, 8 respectively. For the skip connections, we used GCNII model, same parameters as vanilla GCN with α = 0.1. The performance is averaged over 5 runs.

B.2 COMPARISON OF GCN AND NTK

Although it is theoretically clear that the infinite width assumption should not affect the observations made on performance of GCN with S sym and S row in Figure 1 , we illustrate the same using graph NTK. Figure 5 shows that the observation is seen in graph NTK as well, thus supporting our theoretical argument. For the experiments, we fix the size of the sampled graphs to n = 1000, p = 0.8 and q = 0.1 for homophily DC-SBM, p = 0.1 and q = 0.8 for heterophily DC-SBM and p = q = 1 for core-periphery DC-SBM. π is sampled uniformly [0, 1] for homophily and heterophily, and π i ∼ Unif(0.5, 1)∀i ∈ core and π i ∼ Unif(0, 0.5)∀i ∈ periphery for coreperiphery DC-SBM. Illustration of impact of depth in Vanilla GCN using Homophily DC-SBM. We show the impact of depth in Vanilla GCN using homophily DC-SBM in Figure 6 . The DC-SBM is shown in the first column and columns 2 and 3 show the exact NTK for depth=1 and 8 for symmetric and row normalization, respectively. The plots clearly illustrate the complete loss of class information in symmetric normalization with depth (column 2). While the prevalence of block difference has decresed in row normalization over depth (column 3), the block/community structure is still retained. Thus showing the strong representation power of S row . Illustration of S col and S adj in Vanilla GCN using Homophily DC-SBM. We extend the experiments on numerical validation for random graphs using vanilla GCN described in Section 3.2 to column normalized adjacency S col and unnormalized adjacency S adj here. We use the same setup described in Section 3.2 and Figure 7 illustrates the results. We observe that even for depth 1 both the convolutions are influenced by the degree correction and there is no class information in the kernels for higher depth. Thus, this validates the theoretical result in Theorem 2. Illustration of impact of depth in Skip-PC and Skip-α using Homophily DC-SBM. We present a complementary result to Section 4.3 here. We use the same setting as described in Section 4.3 and plot the exact NTKs of depths 1 and 8 for symmetric and row normalization. Figure 8 shows the results for Skip-PC and we observe that the gap between in-class and out-of-class blocks decreases for both S row and S sym with depth, but the class information is still retained for larger depth and the gap doesn't vanish. Between S row and S sym , the heatmaps show that S row retains the block structure better than S sym and is devoid of the influence of the degree corrections. (d) for S sym and S row for depths 1 and 8. In the case of Skip-α,we use α = 0.1 to obtain the result illustrated in Figure 9 . Similar conclusions are derived from the experiment. Although we consider XX T = I n for Skip-α which fundamentally relies on the feature information to interpolate, the results are still meaningful and demonstrate the theoretical findings. (d) for S sym and S row for depth 2. In the second setting, we consider two communities of equal size n/2 with core-periphery in each, and the link probabilities between cores of the communities is higher than core-periphery or periphery-periphery of the two communities as shown in the first heatmap of Figure 11 . The exact NTKs of symmetric and row normalization are illustrated in the second and third heatmaps of Figure 11 where we see that row normalization retains the community structure again. In this section, we present additional experiments on Cora. Since our theory assumed orthonormal features XX T = I n , we validate it experimentally in similar setup described in Section 6. Figure 12 shows the result for S sym and S row for depth 1 and 8. The conclusions derived from real setting hold here as well and shows S row preserves the class information better than S sym . ReLU GCN. We present the result for ReLU GCN in this section. Figure 13 shows the result where the conclusions derived in Section 6 holds very well. Additionally, we plot the average in-class and out-of-class block difference in the case of vanilla GCN (line plots in first row of Figure 13 ), we observe that the average in-class and out-of-class block difference degrades with depth for each class in Cora, showing the negative impact of depth which aligns well with the theoretical result. 



for indicator function, E [.] for expectation, and [d] = {1, 2, . . . , d}.

Figure 3: Left: average in-class and out-of-class block difference at d = ∞ (in log scale) for different true class separability. Heatmaps: Θ (8) for S sym and S row for vanilla GCN and Skip-PC.

Figure 4: Evaluation on Cora dataset. Heatmaps show results of vanilla GCN, Skip-PC and Skip-α where a min and max threshold of 10 and 90 percentile is set for better visualization.

Figure 5: Comparison of the accuracy of a trained finite width GCN and the corresponding NTK.

Figure 6: Numerical validation of Theorem 2 using DC-SBM shown in the first plot of column 1. Columns 2 and 3 illustrate the exact NTKs of depth=1 and 8 for S sym and S row , respectively. Second plot in column 1 shows the average gap between in-class and out-of-class blocks from theory.

Figure 7: Numerical validation of DC-SBM for Vanilla GCN. The first two heatmaps show the exact NTK Θ (d) for column normalized adjacency convolution S col and the other two for unnormalized adjacency S adj for depths d = 1 and 8.

Figure 8: Numerical validation of DC-SBM for Skip-PC showing the exact NTKs Θ (d) for S sym and S row for depths 1 and 8.

Figure 9: Numerical validation of DC-SBM for Skip-α showing the exact NTKs Θ (d) for S sym and S row for depths 1 and 8.

Figure 10: Numerical validation of Core-Periphery DC-SBM showing the exact NTKs Θ (d) for S sym and S row for depth 2.

Figure 11: Numerical validation of Core-Periphery DC-SBM with community structure showing the exact NTKs Θ (d) for S sym and S row for depth 2.

Figure 12: Evaluation on Cora with XX T = I n for S sym and S row for depths 1 and 8.

Figure 13: Evaluation on Cora dataset. Heatmaps show results of vanilla GCN and the decrease in class separability with depth for S sym and S row . Last two show NTKs of Skip-PC where a min and max threshold of 30 and 70 percentile is set for better visualization. Skip-α results in Appendix.

annex

When i and j are in different classes,As λ and µ < 1 4 , (1 + r)When the assumption on γ is introduced, ∃ γs.t.For i and j in class 2,For i and j in different class,Therefore, the population NTK Θ(∞) α,row with γ assumption is obtained by substituting λ + µ = 2γ and λ -µ = 0 in (40), ( 41) and (42) .hence deriving Theorem 3 □.A.5 THEOREM 4: POPULATION NTK Θ FOR SKIP-αWe expand Σ 1 and Σ k of Skip-α first to derive the population NTK.Exact NTK of depth d for Skip-α is expanded using the above as follows.proving Theorem 4. □We now compute I, II and III for population NTK Θ(∞) α using S row . For nodes i and j in class 1:Similarly for nodes i and j in class 2 and different classesfor any i and j. This is similar to S sym .For i and j in class 1,Classes in Cora Another experimental study is to understand how easy it is to learn the classes that showed good in-class and out-of-class gap preservation from the above experiment. The line plot in Figure 13 shows class C2 and C5 are well represented by both S sym and S row . To study how well this holds in the trained GCN, we considered depth 4 vanilla GCN with ReLU activations and used the same hyperparameters mentioned in Section B.1. The results are shown in Figure 14 where we observe that C2 and C5 are well learnt. On the other hand, other classes that showed small gap are also well learnt by the trained GCN. This needs further investigation as it has to do with the data split and some classes are poorly represented in the training data, for instance C6. Thus, we leave it for further analysis.Linear GCN. We present the result for linear GCN with the same setup as described in Section 6 to check the goodness of our theory. The results are illustrated in Figure 15 where we observe that the theory holds very well for linear GCN than ReLU GCN. The class information is better preserved in S row than S sym especially for higher depth in the case of both GCN with and without skip connections. All the conclusions derived in the main section hold here as well. In this section, we validate our theoretical findings on Citeseer without much of the assumptions. We consider multi-class node classification (K = 6) using GCN with linear activations and relax the orthonormal feature condition, so XX T ̸ = I n . The NTKs for vanilla GCN, GCN with Skip-PC and Skip-α for depths d = 1, 2, 4, 8, 16 are computed and Figure 16 illustrates the results. All the observations made in Section 6 hold here as well and clear blocks emerge for S row making it the preferable choice as suggested in the theory.

