REPRESENTATION POWER OF GRAPH CONVOLUTIONS : NEURAL TANGENT KERNEL ANALYSIS

Abstract

The fundamental principle of Graph Neural Networks (GNNs) is to exploit the structural information of the data by aggregating the neighboring nodes using a 'graph convolution'. Therefore, understanding its influence on the network performance is crucial. Convolutions based on graph Laplacian have emerged as the dominant choice with the symmetric normalization of the adjacency matrix A, defined as D -1 2 AD -1 2 , being the most widely adopted one, where D is the degree matrix. However, some empirical studies show that row normalization D -1 A outperforms it in node classification. Despite the widespread use of GNNs, there is no rigorous theoretical study on the representation power of these convolution operators, that could explain this behavior. In this work, we analyze the influence of the graph convolutions theoretically using Graph Neural Tangent Kernel in a semi-supervised node classification setting. Under a Degree Corrected Stochastic Block Model we analyze different graphs that have homophilic, heterophilic and core-periphery structures, and prove that: (i) row normalization preserves the underlying class structure better than other convolutions; (ii) performance degrades with network depth due to over-smoothing, but the loss in class information is the slowest in row normalization; (iii) skip connections retain the class information even at infinite depth, thereby eliminating over-smoothing. We finally validate our theoretical findings numerically and on real datasets.

1. INTRODUCTION

With the advent of Graph Neural Networks (GNNs), there has been a tremendous progress in the development of computationally efficient state-of-the-art methods in various graph based tasks, including drug discovery, community detection and recommendation systems (Wieder et al., 2020; Fortunato & Hric, 2016; van den Berg et al., 2017) . Many of these problems depend on the structural information of the entities along with the features for effective learning. Because GNNs exploit this topological information encoded in the graph, it can learn better representation of the nodes or the entire graph than traditional deep learning techniques, thereby achieving state-of-the-art performances. In order to accomplish this, GNNs apply aggregation function to each node in a graph that combines the features of the neighboring nodes, and its variants differ principally in the methods of aggregation. For instance, graph convolution networks use mean neighborhood aggregation through spectral approaches (Bruna et al., 2014; Defferrard et al., 2016; Kipf & Welling, 2017) or spatial approaches (Hamilton et al., 2017; Duvenaud et al., 2015; Xu et al., 2019) , graph attention networks apply multi-head attention based aggregation (Velickovic et al., 2018) and graph recurrent networks employ complex computational module (Scarselli et al., 2008; Li et al., 2016) . Of all the aggregation policies, the spectral approach based on graph Laplacian is most widely used in practice, specifically the one proposed by Kipf & Welling (2017) owing to its simplicity and empirical success. In this work, we focus on such graph Laplacian based aggregations in Graph Convolution Networks (GCNs), which we refer to as graph convolutions or diffusion operators. Kipf & Welling (2017) propose a GCN for node classification, a semi-supervised task, where the goal is to predict the label of a node using its feature and neighboring node information. This work suggests symmetric normalization S sym = D -1 2 AD -1 2 as the graph convolution. Ever since its introduction, S sym remains the popular choice. However, subsequent works (Wang et al., 2018; Wang & Leskovec, 2020; Ragesh et al., 2021) explore row normalization S row = D -1 A and particu-larly, Wang et al. (2018) observes that S row outperforms S sym for two-layered GCN empirically. Intrigued by this observation, and as both S sym and S row are simply degree normalized adjacency matrices, we study the behavior over depth and observe that S row performs better than S sym in this case as well, as illustrated in Figure 1 Rigorous theoretical analysis is particularly challenging in GCNs compared to the standard neural networks because of the graph convolution. Adding skip connections further increase the complexity of the analysis. To overcome these difficulties, we consider GCN in infinite width limit wherein the Neural Tangent Kernel (NTK) captures the network characteristics very well (Jacot et al., 2018) . The infinite width assumption is not restrictive for graph convolution analysis as the convolution operates on the graph and not the network directly, thus showing same observations as trained GCN (Figure 5 ). Moreover, NTK enables the analysis to be parameter-free and hence eliminating additional complexity induced for example by optimization. Through the lens of NTK, we study the impact of different graph convolutions under a specific data distributional assumption -Degree Corrected Stochastic Block Model (DC-SBM)(Karrer & Newman, 2011), a sparse random graph model. The node degree heterogeneity induced in DC-SBM allows us to analyze the effect of different types of normalization of the adjacency matrix, thus revealing the characteristic difference between S sym and S row . Additionally, this model enables analysis of graphs that have homophilic, heterophilic and core-periphery structures. In this paper, we present a formal approach to analyze GCNs and, specifically, the representation power of different graph convolutions, the influence of depth and the role of skip connections. This is a significant step toward understanding GCNs as it facilitates for more informed network design choices like the convolution and depth, as well as development of more competitive methods based on grounded theoretical reasoning rather than heuristics. Contributions. This paper provides rigorous theoretical analysis of the discussed empirical observations in GCN under DC-SBM distribution using graph NTK, leading to the following contributions. (i) In Section 2, we derive the NTK for GCN in infinite width limit considering node classification setting. Using the NTK for linear GCN and under DC-SBM data distribution, we show in Section 3 that S row preserves class information by computing the population NTK for different graph convolutions. We also present numerical validation of the result in homophilic and heterophilic graphs. (ii) We prove the convolution operator specific over-smoothing effect in vanilla GCN by showing the degradation in class separability with depth in Section 3.1, and also illustrate it experimentally. (iii) In Section 4, we leverage the power of NTK to analyze two different skip connections (Kipf & Welling, 2017; Chen et al., 2020) . We derive the corresponding NTKs and show that skip connections retain class information even at infinite depth along with numerical validation. (iv) We show that S sym maybe preferred over S row in absence of class structure in Section 5, and validate the theoretical results on real datasets Cora in Section 6 and Citeseer in Appendix B.5. We finally conclude in Section 7 with the discussion on the impact of the result and further possibilities, and provide all the proofs, experimental details and additional experiments in the appendix.



(Details of the experiment in Appendix B.1). for both S sym and S row . This contradicts the conventional wisdom about standard neural networks which exhibit improvement in the performance as depth increases. Several works(Kipf & Welling, 2017; Chen et al.,  2018; Wu et al., 2019)  observe this behavior empirically and attribute it to the over-smoothing effect from the repeated application of the diffusion operator, resulting in averaging out of the feature information to a degree where it becomes uninformative(Li et al., 2018; Oono & Suzuki, 2019; Esser et al., 2021). As a solution to this problem, Chen et al. (2020) andKipf &  Welling (2017)  propose different forms of skip connections that overcome the smoothing effect and thus outperform the vanilla GCN. Extending it to the comparison of graph convolutions, our experiment shows S row is preferable to S sym over depth even in GCNs with skip connections (Figure1). Naturally, we ask: what characteristics of S row enable better representation learning than S sym in GCNs?

