EFFECTS OF GRAPH CONVOLUTIONS IN MULTI-LAYER NETWORKS

Abstract

Graph Convolutional Networks (GCNs) are one of the most popular architectures that are used to solve classification problems accompanied by graphical information. We present a rigorous theoretical understanding of the effects of graph convolutions in multi-layer networks. We study these effects through the node classification problem of a non-linearly separable Gaussian mixture model coupled with a stochastic block model. First, we show that a single graph convolution expands the regime of the distance between the means where multi-layer networks can classify the data by a factor of at least 1/ 4 √ deg, where deg denotes the expected degree of a node. Second, we show that with a slightly stronger graph density, two graph convolutions improve this factor to at least 1/ 4 √ n, where n is the number of nodes in the graph. Finally, we provide both theoretical and empirical insights into the performance of graph convolutions placed in different combinations among the layers of a neural network, concluding that the performance is mutually similar for all combinations of the placement. We present extensive experiments on both synthetic and real-world data that illustrate our results.

1. INTRODUCTION

A large amount of interesting data and the practical challenges associated with them are defined in the setting where entities have attributes as well as information about mutual relationships. Traditional classification models have been extended to capture such relational information through graphs (Hamilton, 2020) , where each node has individual attributes and the edges of the graph capture the relationships among the nodes. A variety of applications characterized by this type of graph-structured data include works in the areas of social analysis (Backstrom & Leskovec, 2011 ), recommendation systems (Ying et al., 2018 ), computer vision (Monti et al., 2017) , study of the properties of chemical compounds (Gilmer et al., 2017; Scarselli et al., 2009 ), statistical physics (Bapst et al., 2020; Battaglia et al., 2016) , and financial forensics (Zhang et al., 2017; Weber et al., 2019) . The most popular learning models for relational data use graph convolutions (Kipf & Welling, 2017) , where the idea is to aggregate the attributes of the set of neighbours of a node instead of only utilizing its own attributes. Despite several empirical studies of various GCN-type models (Chen et al., 2019; Ma et al., 2022) that demonstrate an improvement in the performance of traditional classification methods such as MLPs, there has been limited progress in the theoretical understanding of the benefits of graph convolutions in multi-layer networks in terms of improving node classification tasks. Related work. The capacity of a graph convolution for one-layer networks is studied in Baranwal et al. (2021) , along with its out-of-distribution (OoD) generalization potential. A more recent work (Wu et al., 2022) formulates the node-level OoD problem, and develops a learning method that facilitates GNNs to leverage invariance principles for prediction. In Gasteiger et al. (2019) , the authors utilize a propagation scheme based on personalized PageRank to construct a model that outperforms several GCN-like methods for semi-supervised classification. Through their algorithm, APPNP, they show that placing power iterations at the last layer of an MLP achieves state of the art performance. Our results align with this observation. There exists a large amount of theoretical work on unsupervised learning for random graph models where node features are absent and only relational information is available (Decelle et al., 2011; Massoulié, 2014; Mossel et al., 2018; 2015; Abbe & Sandon, 2015; Abbe et al., 2015; Bordenave et al., 2015; Deshpande et al., 2015; Montanari & Sen, 2016; Banks et al., 2016; Abbe & Sandon, 2018; Li et al., 2019; Kloumann et al., 2017; Gaudio et al., 2022) . For a comprehensive survey, see Abbe (2018); Moore (2017). For data models which have node features coupled with relational information, several works have studied the semi-supervised node classification problem, see, for example, Scarselli et al. ( 2009 2021). These papers provide good empirical insights into the merits of graph structure in the data. We complement these studies with theoretical results that explain the effects of graph convolutions in a multi-layer network. In Deshpande et al. ( 2018); Lu & Sen (2020), the authors explore the fundamental thresholds for the classification of a substantial fraction of the nodes with linear sample complexity and large but finite degree. Another relatively recent work (Hou et al., 2020) proposes two graph smoothness metrics for measuring the benefits of graphical information, along with a new attention-based framework. In Fountoulakis et al. ( 2022), the authors provide a theoretical study of the graph attention mechanism (GAT) and identify the regimes where the attention mechanism is (or is not) beneficial to nodeclassification tasks. Our study focuses on convolutions instead of attention-based mechanisms. Several works have studied the expressive power and extrapolation of GNNs, along with the oversmoothing phenomenon (see, for e.g., Balcilar et al. ( 2021 2018)). Some other works have also studied the homophily and heterophily problem in GNNs (Luan et al., 2021; Ma et al., 2022) . However, our focus is to draw a comparison of the benefits and limitations of graph convolutions with those of a traditional MLP that does not utilize relational information. Similar to Wei et al. ( 2022), our setting is immune to the heterophily problem, and the focus of our study is on regimes where oversmoothing does not occur. To the best of our knowledge, this area of research still lacks theoretical guarantees that explain when and why graphical data, and in particular, graph convolutions, can boost traditional multilayer networks to perform better on node-classification tasks. To this end, we study the effects of graph convolutions in deeper layers of a multi-layer network. For node classification tasks, we also study whether one can avoid using additional layers in the network design for the sole purpose of gathering information from neighbours that are farther away, by comparing the benefits of placing all convolutions in a single layer versus placing them in different layers. Our contributions. We study the performance of multi-layer networks for the task of binary node classification on a data model where node features are sampled from a Gaussian mixture, and relational information is sampled from a symmetric two-block stochastic block modelfoot_0 (see Section 2.1 for details). The node features are modelled after XOR data with two classes, and therefore, has four distinct components, two for each class. Our choice of the data model is inspired from the fact that it is non-linearly separable. Hence, a single layer network fails to classify the data from this model. Similar data models based on the contextual stochastic block model (CSBM) have been used extensively in the literature, see, for example, Deshpande et al. ( 2018 1. We show that when node features are accompanied by a graph, a single graph convolution enables a multi-layer network to classify the nodes in a wider regime as compared to methods that do not utilize the graph, improving the threshold for the distance between the means of the features by a factor of at least 1/ 4 √ Edeg. Furthermore, assuming a slightly denser graph, we show that with two graph convolutions, a multi-layer network can classify the data in an even wider regime, improving the threshold by a factor of at least 1/ 4 √ n, where n is the number of nodes in the graph. 2. We show that for multi-layer networks equipped with graph convolutions, the classification capacity is determined by the number of graph convolutions rather than the number of layers in the network. In particular, we study the gains obtained by placing graph convolutions in a layer, and compare the benefits of placing all convolutions in a single layer versus



Our analyses generalize to non-symmetric SBMs with more than two blocks. However, we focus on the binary symmetric case for the sake of simplicity in the presentation of our ideas.



); Cheng et al. (2011); Gilbert et al. (2012); Dang & Viennet (2012); Günnemann et al. (2013); Yang et al. (2013); Hamilton et al. (2017); Jin et al. (2019); Mehta et al. (2019); Chien et al. (2022); Yan et al. (

); Xu et al. (2021); Oono & Suzuki (2020); Li et al. (

); Binkiewicz et al. (2017); Chien et al. (2021; 2022); Baranwal et al. (2021). We now summarize our contributions below.

