EFFECTS OF GRAPH CONVOLUTIONS IN MULTI-LAYER NETWORKS

Abstract

Graph Convolutional Networks (GCNs) are one of the most popular architectures that are used to solve classification problems accompanied by graphical information. We present a rigorous theoretical understanding of the effects of graph convolutions in multi-layer networks. We study these effects through the node classification problem of a non-linearly separable Gaussian mixture model coupled with a stochastic block model. First, we show that a single graph convolution expands the regime of the distance between the means where multi-layer networks can classify the data by a factor of at least 1/ 4 √ deg, where deg denotes the expected degree of a node. Second, we show that with a slightly stronger graph density, two graph convolutions improve this factor to at least 1/ 4 √ n, where n is the number of nodes in the graph. Finally, we provide both theoretical and empirical insights into the performance of graph convolutions placed in different combinations among the layers of a neural network, concluding that the performance is mutually similar for all combinations of the placement. We present extensive experiments on both synthetic and real-world data that illustrate our results.

1. INTRODUCTION

A large amount of interesting data and the practical challenges associated with them are defined in the setting where entities have attributes as well as information about mutual relationships. Traditional classification models have been extended to capture such relational information through graphs (Hamilton, 2020) , where each node has individual attributes and the edges of the graph capture the relationships among the nodes. A variety of applications characterized by this type of graph-structured data include works in the areas of social analysis (Backstrom & Leskovec, 2011) , recommendation systems (Ying et al., 2018) , computer vision (Monti et al., 2017) , study of the properties of chemical compounds (Gilmer et al., 2017; Scarselli et al., 2009) , statistical physics (Bapst et al., 2020; Battaglia et al., 2016) , and financial forensics (Zhang et al., 2017; Weber et al., 2019) . The most popular learning models for relational data use graph convolutions (Kipf & Welling, 2017) , where the idea is to aggregate the attributes of the set of neighbours of a node instead of only utilizing its own attributes. Despite several empirical studies of various GCN-type models (Chen et al., 2019; Ma et al., 2022) that demonstrate an improvement in the performance of traditional classification methods such as MLPs, there has been limited progress in the theoretical understanding of the benefits of graph convolutions in multi-layer networks in terms of improving node classification tasks. Related work. The capacity of a graph convolution for one-layer networks is studied in Baranwal et al. (2021) , along with its out-of-distribution (OoD) generalization potential. A more recent work (Wu et al., 2022) formulates the node-level OoD problem, and develops a learning method that facilitates GNNs to leverage invariance principles for prediction. In Gasteiger et al. (2019) , the authors utilize a propagation scheme based on personalized PageRank to construct a model that outperforms several GCN-like methods for semi-supervised classification. Through their algorithm, APPNP, they show that placing power iterations at the last layer of an MLP achieves state of the art performance. Our results align with this observation. There exists a large amount of theoretical work on unsupervised learning for random graph models where node features are absent and only relational information is available (Decelle et al., 2011; Massoulié, 2014; Mossel et al., 2018; 2015; Abbe & Sandon, 2015; Abbe et al., 2015; Bordenave et al., 2015; Deshpande et al., 2015; Montanari & Sen, 2016; Banks et al., 2016; Abbe & Sandon, 2018; Li et al., 2019; Kloumann et al., 2017; Gaudio et al., 2022) . For a comprehensive survey, see Abbe (2018) ; Moore (2017) . For data models which have node features coupled with relational information, several works have studied the semi-supervised node classification problem, see, for example, Scarselli et al. (2009) ; Cheng et al. (2011) ; Gilbert et al. (2012) ; Dang & Viennet (2012) ; Günnemann et al. (2013) ; Yang et al. (2013) ; Hamilton et al. (2017) ; Jin et al. (2019) ; Mehta et al. (2019) ; Chien et al. (2022) ; Yan et al. (2021) . These papers provide good empirical insights into the merits of graph structure in the data. We complement these studies with theoretical results that explain the effects of graph convolutions in a multi-layer network. In Deshpande et al. (2018) ; Lu & Sen (2020) , the authors explore the fundamental thresholds for the classification of a substantial fraction of the nodes with linear sample complexity and large but finite degree. Another relatively recent work (Hou et al., 2020) proposes two graph smoothness metrics for measuring the benefits of graphical information, along with a new attention-based framework. In Fountoulakis et al. (2022) , the authors provide a theoretical study of the graph attention mechanism (GAT) and identify the regimes where the attention mechanism is (or is not) beneficial to nodeclassification tasks. Our study focuses on convolutions instead of attention-based mechanisms. Several works have studied the expressive power and extrapolation of GNNs, along with the oversmoothing phenomenon (see, for e.g., Balcilar et al. (2021) ; Xu et al. (2021) ; Oono & Suzuki (2020) ; Li et al. (2018) ). Some other works have also studied the homophily and heterophily problem in GNNs (Luan et al., 2021; Ma et al., 2022) . However, our focus is to draw a comparison of the benefits and limitations of graph convolutions with those of a traditional MLP that does not utilize relational information. Similar to Wei et al. (2022) , our setting is immune to the heterophily problem, and the focus of our study is on regimes where oversmoothing does not occur. To the best of our knowledge, this area of research still lacks theoretical guarantees that explain when and why graphical data, and in particular, graph convolutions, can boost traditional multilayer networks to perform better on node-classification tasks. To this end, we study the effects of graph convolutions in deeper layers of a multi-layer network. For node classification tasks, we also study whether one can avoid using additional layers in the network design for the sole purpose of gathering information from neighbours that are farther away, by comparing the benefits of placing all convolutions in a single layer versus placing them in different layers. Our contributions. We study the performance of multi-layer networks for the task of binary node classification on a data model where node features are sampled from a Gaussian mixture, and relational information is sampled from a symmetric two-block stochastic block modelfoot_0 (see Section 2.1 for details). The node features are modelled after XOR data with two classes, and therefore, has four distinct components, two for each class. Our choice of the data model is inspired from the fact that it is non-linearly separable. Hence, a single layer network fails to classify the data from this model. Similar data models based on the contextual stochastic block model (CSBM) have been used extensively in the literature, see, for example, Deshpande et al. (2018) ; Binkiewicz et al. (2017) ; Chien et al. (2021; 2022) ; Baranwal et al. (2021) . We now summarize our contributions below. 1. We show that when node features are accompanied by a graph, a single graph convolution enables a multi-layer network to classify the nodes in a wider regime as compared to methods that do not utilize the graph, improving the threshold for the distance between the means of the features by a factor of at least 1/ 4 √ Edeg. Furthermore, assuming a slightly denser graph, we show that with two graph convolutions, a multi-layer network can classify the data in an even wider regime, improving the threshold by a factor of at least 1/ 4 √ n, where n is the number of nodes in the graph. 2. We show that for multi-layer networks equipped with graph convolutions, the classification capacity is determined by the number of graph convolutions rather than the number of layers in the network. In particular, we study the gains obtained by placing graph convolutions in a layer, and compare the benefits of placing all convolutions in a single layer versus placing them in different combinations across different layers. We find that the performance is mutually similar for all combinations with the same number of graph convolutions. 3. We verify our theoretical results through extensive experiments on both synthetic and real-world data, showing trends about the performance of graph convolutions in various combinations across multiple layers of a network, and in different regimes of interest. The rest of our paper is organized as follows: In Section 2, we provide a detailed description of the data model and the network architecture that is central to our study, followed by our analytical results in Section 3. Finally, Section 4 presents extensive experiments that illustrate our theoretical findings. 

2. PRELIMINARIES

C b = {i ∈ [n] | ε i = b} for b ∈ {0, 1}. Let µ and ν be fixed vectors in R d , such that µ 2 = ν 2 and µ, ν = 0.foot_1 Denote by X ∈ R n×d the data matrix where each row-vector X i ∈ R d is an independent Gaussian random vector distributed as X i ∼ N ((2η i -1)((1 -ε i )µ + ε i ν), σ 2 ). We use the notation X ∼ XOR-GMM(n, d, µ, ν, σ 2 ) to refer to data sampled from this model. Let us now define the model with graphical information. In this case, in addition to the features X described above, we have a graph with the adjacency matrix, A = (a ij ) i,j∈ [n] , that corresponds to an undirected graph including self-loops, and is sampled from a standard symmetric two-block stochastic block model with parameters p and q, where p is the intra-block and q is the inter-block edge probability. The SBM(n, p, q) is then coupled with the XOR-GMM(n, d, µ, ν, σ 2 ) in the way that a ij ∼ Ber(p) if ε i = ε j and a ij ∼ Ber(q) if ε i = ε j . For data (A, X) = ({a ij } i,j∈[n] , {X i } i∈[n] ) sampled from this model, we say (A, X) ∼ XOR-CSBM(n, d, µ, ν, σ 2 , p, q). We will denote by D the diagonal degree matrix of the graph with adjacency matrix A, and thus, deg(i) = D ii = n j=1 a ij denotes the degree of node i. We will use N i = {j ∈ [n] | a ij = 1} to denote the set of neighbours of a node i. We will also use the notation i ∼ j or i j throughout the paper to signify, respectively, that i and j are in the same class, or in different classes.

2.2. NETWORK ARCHITECTURE

Our analysis focuses on MLP architectures with ReLU activations. In particular, for a network with L layers, we define the following: H (0) = X, f (l) (X) = (D -1 A) k l H (l-1) W (l) + b (l) H (l) = ReLU(f (l) (X)) for l ∈ [L], ŷ = ϕ(f (L) (X)). Here, X ∈ R n×d is the given data, which is an input for the first layer and ϕ(x) = sigmoid(x) = 1 1+e -x , applied element-wise. The final output of the network is represented by ŷ = {ŷ i } i∈ [n] . Note that D -1 A is the normalized adjacency matrixfoot_2 and k l denotes the number of graph convolutions placed in layer l. In particular, for a simple MLP with no graphical information, we have A = I n . We will denote by θ, the set of all weights and biases, (W (l) , b (l) ) l∈ [L] , which are the learnable parameters of the network. For a dataset (X, y), we denote the binary cross-entropy loss obtained by a multi-layer network with parameters θ by θ (A, X) = -1 n i∈[n] y i log(ŷ i ) + (1 -y i ) log(1 -ŷi ), and the optimization problem is formulated as OPT(A, X) = min θ∈C θ (A, X), where C denotes a suitable constraint set for θ. For our analyses, we take the constraint set C to impose the condition W (1) 2 ≤ R and W (l) 2 ≤ 1 for all 1 < l ≤ L, i.e., the weight parameters of all layers l > 1 are normalized, while for l = 1, the norm is bounded by some fixed value R. This is necessary because without the constraint, the value of the loss function can go arbitrarily close to 0. Furthermore, the parameter R helps us concisely provide bounds for the loss in our theorems for various regimes by bounding the Lipschitz constant of the learned function. In the rest of our paper, we use θ (X) to denote θ (I n , X), which is the loss in the absence of graphical information.

3. RESULTS

We now describe our theoretical contributions, followed by a discussion and a proof sketch.

3.1. SETTING UP THE BASELINE

Before stating our main result about the benefits and performance of graph convolutions, we set up a comparative baseline in the setting where graphical information is absent. In the following theorem, we completely characterize the classification threshold for the XOR-GMM data model in terms of the distance between the means of the mixture model and the number of data points n. Let Φ(•) denote the cumulative distribution function of a standard Gaussian, and Φ c (•) = 1 -Φ(•). Theorem 1. Let X ∈ R n×d ∼ XOR-GMM(n, d, µ, ν, σ 2 ) and define γ = µν 2 to be the distance between the means. Then we have the following: 1. Assume that γ ≤ Kσ and let h(x) : R d → {0, 1} be any binary classifier. Then for any K > 0 and any ∈ (0, 1), at least a fraction 2Φ c ( K /2) 2 -O(n -/2 ) of all data points are misclassified by h with probability at least 1 -exp(-2n 1-). 2. For any > 0, if the distance between the means is γ = Ω(σ(log n) 1 2 + ), then for any c > 0, with probability at least 1 -O(n -c ), there exist a two-layer and a three-layer network that perfectly classify the data, and obtain a cross-entropy loss given by θ (X) = C exp - R √ 2 γ 1 ± √ c/(log n) , where C ∈ [ 1 /2, 1] is an absolute constant and R is the optimality constraint from Eq. (1). Part one of Theorem 1 shows that if the means of the features of the two classes are at most O(σ) apart then with overwhelming probability, there is a constant fraction of points that are misclassified. Note that the fraction of misclassified points is 2Φ c ( K /2) 2 , which approaches 0 as K → ∞ and approaches 1 /2 as K → 0, signifying that if the means are very far apart then we successfully classify all data points, while if they coincide then we always misclassify roughly half of all data points. Furthermore, note that if K = c √ log n for some constant c ∈ [0, 1), then the total number of points misclassified is 2nΦ c (K) 2 n K 2 e -K 2 n 1-c 2 log n = Ω(1). Thus, intuitively, K √ log n is the threshold beyond which learning methods are expected to perfectly classify the data. This is formalized in part two of the theorem, which supplements the misclassification result by showing that if the means are roughly ω(σ √ log n) apart then the data is classifiable with overwhelming probability.

3.2. IMPROVEMENT THROUGH GRAPH CONVOLUTIONS

We now state the results that explain the effects of graph convolutions in multi-layer networks with the architecture described in Section 2.2. We characterize the improvement in the classification threshold in terms of the distance between the means of the node features. Let erf(t) = 2Φ(t √ 2) -1 be the Gauss error function and ζ(t) = t erf(t) -(1 -exp(-t 2 ))/ √ π. Theorem 2. Let (A, X) ∼ XOR-CSBM(n, d, µ, ν, σ 2 , p, q), γ = µν 2 , and Γ(p, q) = |p -q|/(p + q). There exist a two-layer network and a three-layer network with the following properties: • If the intra-class and inter-class edge probabilities are p, q = Ω( log 2 n n ), and it holds that Γ(p, q)ζ( γ /2σ) = ω log n n(p+q) , then for any c > 0, with probability at least 1 -O(n -c ), the networks equipped with a graph convolution in the second or the third layer perfectly classify the data, and obtain the following loss: θ (A, X) = C exp -CσRΓ(p, q)ζ( γ /2σ)(1 ± c /log n) , where C > 0 and C ∈ [ 1 /2, 1] are constants and R is the constraint from Eq. (1). • If p, q = Ω( log n √ n ) and Γ(p, q) 2 ζ( γ /2σ) = ω log n n , then for any c > 0, with probability at least 1 -O(n -c ), the networks with any combination of two graph convolutions in the second and/or the third layers perfectly classify the data, and obtain the following loss: θ (A, X) = C exp -CσRΓ(p, q) 2 ζ( γ /σ)(1 ± c /log n) , where C > 0 and C ∈ [ 1 /2, 1] are constants and R is the constraint from Eq. (1). To understand Theorem 2, it helps to consider the regime where Γ(p, q) = Ω(1). Part one of the theorem shows that under the assumption that p, q = Ω( log 2 n /n), a single graph convolution improves the classification threshold for γ, the distance between the means by a factor of at least 1 / 4 √ n(p+q) as compared to the case without the graph. Part two then shows that with a slightly stronger assumption on the graph density, we observe further improvement in the threshold up to a factor of at least 1 / 4 √ n. We refer to Appendix A.8 for a comprehensive explanation of this simpler case. Note that although the regime of graph density is different for part two of the theorem, the result itself is an improvement. In particular, if p, q = Ω( log n / √ n) then part one of the theorem states that one graph convolution achieves an improvement of at least 1 / 8 √ n, while part two states that two convolutions improve it to at least 1 / 4 √ n. However, we also emphasize that in the regime where the graph is dense, i.e., when p, q = Ω n (1), two graph convolutions do not have a significant advantage over one graph convolution. Our experiments in Section 4.1 demonstrate this effect. The XOR-CSBM data model also demonstrates why graph convolutions in the first layer can severely hurt the classification accuracy. Hence, for Theorem 2, our analysis only considers networks with no graph convolution in the first layer, i.e., k 1 = 0. This effect is visualized in Fig. 1 , and is attributed to the averaging of data points in the same class but different components of the mixture that have means with opposite signs. We defer the reader to Appendix A.5 for a more formal argument, and to Appendix B.1 for experiments that demonstrate this phenomenon. As n (the sample size) grows, the difference between the averages of node features over the two classes diminishes (see Figs. 1a and 1b ). In other words, the means of the two classes collapse to the same point for large n. However, in the last layer, since the input consists of transformed features that are linearly separable, a graph convolution helps with the classification task (see Figs. 1c and 1d ).

3.3. PLACEMENT OF GRAPH CONVOLUTIONS

We observe that the improvements in the classification capability of a multi-layer network depends on the number of convolutions, and does not depend on where the convolutions are placed. In particular, for the XOR-CSBM data model, putting the same number of convolutions among the second and/or the third layer in any combination achieves mutually similar improvements in the classification task. Corollary 2.1. Consider the data model XOR-CSBM(n, d, µ, ν, σ 2 , p, q) and the network architecture from Section 2.2. • Assume that p, q = Ω( log 2 n /n), and consider the three-layer network characterized by part one of Theorem 2, with one graph convolution. For this network, placing the graph convolution in the second layer (k 2 = 1, k 3 = 0) obtains the same results as placing it in the third layer (k 2 = 0, k 3 = 1). • Assume that p, q = Ω( log n / √ n), and consider the three-layer network characterized by part two of Theorem 2, with two graph convolutions. For this network, placing both convolutions in the second layer (k 2 = 2, k 3 = 0) or both of them in the third layer (k 2 = 0, k 3 = 2) obtains the same results as placing one convolution in the second layer and one in the third layer (k 2 = 1, k 3 = 1). Corollary 2.1 is immediate from the proof of Theorem 2 (see Appendices A.6 and A.7). In Section 4, we also show extensive experiments on both synthetic and real-world data that demonstrate this result.

3.4. PROOF SKETCH

In this section, we provide an overview of the key ideas and intuition behind our proof technique for the results. For comprehensive proofs, see Appendix A. For part one of Theorem 1, we utilize the assumption on the distribution of the data. Since the underlying distribution of the mixture model is known, we can find the (Bayes) optimal classifierfoot_3 , h * (x), for the XOR-GMM, which takes the form h * (x) = 1(| x, ν | -| x, µ |) , where 1(•) is the indicator function. We then compute a lower bound on the probability that h * fails to classify one data point from this model, followed by a concentration argument that computes a lower bound on the fraction of points that h * fails to classify with overwhelming probability. Consequently, a negative result for the Bayes optimal classifier implies a negative result for all classifiers. For part two of Theorem 1, we design a two-layer and a three-layer network that realize the (Bayes) optimal classifier. We then use a concentration argument to show that in the regime where the distance between the means is large enough, the function representing our two-layer or three-layer network roughly evaluates to a quantity that has a positive sign for one class and a negative sign for the other class. Furthermore, the output of the function scales with the distance between the means. Thus, with a suitable assumption on the magnitude of the distance between the means, the output of the networks has the correct signs with overwhelming probability. Following this argument, we show that the cross-entropy loss obtained by the networks can be made arbitrarily small by controlling the optimization constraint R (see Eq. ( 1)), implying perfect classification. For Theorem 2, we observe that for the (Bayes) optimal networks designed for Theorem 1, placing graph convolutions in the second or the third layer reduces the effective variance of the functions representing the network. This stems from the fact that for the data model we consider, multi-layer networks with ReLU activations are Lipschitz functions of Gaussian random variables. First, we compute the precise reduction in the variance of the data characterized by K > 0 graph convolutions (see Lemma A.3) . Then for part one of the theorem where we analyze one graph convolution, we use the assumption on the graph density to conclude that the degrees of each node concentrate around the expected degree. This helps us characterize the variance reduction, which further allows the distance between the means to be smaller than in the case of a standard MLP, hence, obtaining an improvement in the threshold for perfect classificationfoot_4 . Part two of the theorem studies the placement of two graph convolutions using a very similar argument. In this case, the variance reduction is characterized by the number of common neighbours of a pair of nodes rather than the degree of a node, and is stronger than the variance reduction offered by a single graph convolution.

4. EXPERIMENTS

In this section we provide empirical evidence that supports our claims in Section 3. We begin by analyzing the synthetic data models XOR-GMM and XOR-CSBM that are crucial to our theoretical results, followed by a similar analysis on multiple real-world datasets tailored for node classification tasks. We show a comparison of the test accuracy obtained by various learning methods in different regimes, along with a display of how the performance changes with the properties of the underlying graph, i.e., with the intra-class and inter-class edge probabilities p and q. For both synthetic and real-world data, the performance of the networks does not change significantly with the choice of the placement of graph convolutions. In particular, placing all convolutions in the last layer achieves a similar performance as any other placement for the same number of convolutions. This observation aligns with the results in Gasteiger et al. (2019) .

4.1. SYNTHETIC DATA

In this section, we empirically show the landscape of the accuracy achieved for various multi-layer networks with up to three layers and up to two graph convolutionsfoot_5 . In Fig. 2 , we show that as claimed in Theorem 2, a single graph convolution reduces the classification threshold by a factor of 1 / 4 √ E deg and two graph convolutions reduce the threshold by a factor of 1 / 4 √ n, where E deg = n 2 (p + q). We observe that the placement of graph convolutions does not matter as long as it is not in the first layer. Figs. 2a and 2b show that the performance is mutually similar for all networks that have one graph convolution placed in the second or the third layer, and for all networks that have two graph convolutions placed in any combination among the second and the third layers. In Figs. 2c and 2d , we observe that two graph convolutions do not obtain a significant advantage over one graph convolution in the setting where p and q are large, i.e., when the graph is dense. We observed similar results for various other values of p and q (see Appendix B.1 for some more plots). Furthermore, in Appendix B.1 we verify that if a graph convolution is placed in the first layer of a network, then it is difficult to learn a classifier for the XOR-CSBM data model. In this case, test accuracy is low even for the regime where the distance between the means is quite large.

4.2. REAL-WORLD DATA

For real-world data, we test our results on three graph benchmarks: CORA, CiteSeer, and Pubmed citation network datasets (Sen et al., 2008) . Results for larger datasets are presented in Appendix B.2. We observe the following trends: First, as claimed in Theorem 2, networks that utilize the graph (a) Two-layer networks with (p, q) = (0.2, 0.02). (b) Three-layer networks with (p, q) = (0.2, 0.02). (c) Two-layer networks with (p, q) = (0.5, 0.1). (d) Three-layer networks with (p, q) = (0.5, 0.1). perform remarkably better than a traditional MLP that does not use relational information. Second, all networks with one graph convolution in any layer achieve a mutually similar performance, and all networks with two graph convolutions in any combination of placement achieve a mutually similar performance. This demonstrates a result similar to Corollary 2.1 for real-world data. Finally, networks with two graph convolutions perform better than networks with one graph convolution. In Fig. 3 , we present for all networks, the maximum accuracy over 50 trials, where each trial corresponds to a random initialization of the networks. For 2-layer networks, the hidden layer has width 16, and for 3-layer networks, both hidden layers have width 16. We use a dropout probability of 0.5 and a weight decay of 10 -5 while training. For this study, we attribute minor changes in the accuracy to hyperparameters involving dropout and weight decay. This helps us clearly observe the important difference in the accuracy of networks with one graph convolution versus two graph convolutions. For example, in Fig. 3a , we note that there are differences in the accuracy of the networks with one graph convolution (red and blue). However, these differences are minor compared to the networks with one convolution (red and blue) and networks with two convolutions (green and yellow). We also show the averaged accuracy in Appendix B.2. Note that the accuracy slightly differs from well-known results in the literature due to implementation differences. In particular, the GCN implementation in Kipf & Welling (2017) uses Ã = D -1 2 AD -1 2 as the normalized adjacency matrix, however, we use Ã = D -1 A.foot_6 In Appendix B.2, we also show empirical results for the normalization Ã = D -1 2 AD -1 2 .

5. CONCLUSION AND FUTURE WORK

We study the fundamental limits of the capacity of graph convolutions when placed beyond the first layer of a multi-layer network for the XOR-CSBM data model, and provide theoretical guarantees for their performance in different regimes of the signal in the data. Through our experiments on both synthetic and real-world data, we show that the number of convolutions is a more significant factor for determining the performance of a network, rather than the number of layers in the network. Furthermore, we show that placing graph convolutions in any combination achieves mutually similar performance enhancements for the same number of them. We observe that multiple graph convolutions are advantageous when the underlying graph is relatively sparse. Intuitively, this is because in a dense graph, a single convolution can gather information from a large number of nodes, while in a sparser graph, more convolutions are needed to gather information from a larger number of nodes. Our analysis is limited to a positive result and we only provide a minimum guarantee for improvement in the classification threshold. To fully understand the limitations of graph convolutions, a complementary negative result (similar to part one of Theorem 1) for data models with relational information is required, showing the maximum improvement that graph convolutions can realize in a multi-layer network. This problem is hard because a graph convolution transforms an iid set of features into a highly correlated set of features, making it difficult to apply classical high-dimensional concentration arguments. Another potential line of work is to generalize our results for arbitrary data models. However, since our arguments rely heavily on the concentration of measure, it is hard to extend the analysis to arbitrary distributions. Therefore, we require distribution-agnostic tools.

A PROOFS

A.1 ASSUMPTIONS AND NOTATION Assumption 1. For the XOR-GMM data model, the means of the Gaussian mixture are such that µ, ν = 0 and µ 2 = ν 2 . We denote [x] + = ReLU(x) and ϕ(x) = sigmoid(x) = 1 /1+e -x , applied element-wise on the inputs. For any vector v, v = v v 2 denotes the normalized v. We use γ = µν 2 to denote the distance between the means of the inter-class components of the mixture model, and γ to denote the norm of the means, γ = γ / √ 2 = µ 2 = ν 2 . Given intra-class and inter-class edge probabilities p and q, we define Γ(p, q) = |p-q| p+q . We denote the probability density function of a standard Gaussian by φ(x), and the cumulative distribution function by Φ(x). The complementary distribution function is denoted by Φ c (x) = 1 -Φ(x).

A.2 ELEMENTARY RESULTS

In this section, we state preliminary results about the concentration of the degrees of all nodes and the number of common neighbours for all pairs of nodes, along with the effects of a graph convolution on the mean and the variance of some data. Our results regarding the merits of graph convolutions rely heavily on these arguments. Proposition A.1 (Concentration of degrees). Assume that the graph density is p, q = Ω( log 2 n n ). Then for any constant c > 0, with probability at least 1 -2n -c , we have for all i ∈ [n] that deg(i) = n 2 (p + q)(1 ± o n (1)), 1 deg(i) = 2 n(p + q) (1 ± o n (1)), 1 deg(i)   j∈C1 a ij - j∈C0 a ij   = (2ε i -1) p -q p + q (1 + o n (1)), where the error term o n (1) = O c log n . Proof. Note that deg(i) is a sum of n Bernoulli random variables, hence, we have by the Chernoff bound (Vershynin, 2018 , Section 2) that Pr deg(i) ∈ n 2 (p + q)(1 -δ), n 2 (p + q)(1 + δ) c ≤ 2 exp(-Cn(p + q)δ 2 ), for some C > 0. We now choose δ = (c+1) log n Cn(p+q) for a large constant c > 0. Note that since p, q = Ω( log 2 n /n), we have that δ = O( c log n ) = o n (1). Then following a union bound over i ∈ [n], we obtain that with probability at least 1 -2n -c , deg(i) = n 2 (p + q) 1 ± O c log n for all i ∈ [n], 1 deg(i) = 2 n(p + q) 1 ± O c log n for all i ∈ [n]. Note that 1 deg(i) j∈C b a ij for any b ∈ {0, 1} is a sum of independent Bernoulli random variables. Hence, by a similar argument, we have that with probability at least 1 -2n -c , 1 deg(i)   j∈C1 a ij - j∈C0 a ij   = (2ε i -1) p -q p + q (1 + o n (1)) for all i ∈ [n]. Proposition A.2 (Concentration of the number of common neighbours). Assume that the graph density is p, q = Ω( log n √ n ). Then for any constant c > 0, with probability at least 1 -2n -c , |N i ∩ N j | = n 2 (p 2 + q 2 )(1 ± o n (1)) for all i ∼ j, |N i ∩ N j | = npq(1 ± o n (1)) for all i j, where the error term o n (1) = O c log n . Proof. For any two distinct nodes i, j ∈ [n] we have that the number of common neighbours of i and j is |N i ∩ N j | = k∈[n] a ik a jk . This is a sum of independent Bernoulli random variables, with mean E|N i ∩ N j | = n 2 (p 2 + q 2 ) for i ∼ j and E|N i ∩ N j | = npq for i j. Denote µ ij = E|N i ∩ N j |. Therefore, by the Chernoff bound (Vershynin, 2018 , Section 2), we have for a fixed pair of nodes (i, j) that Pr |N i ∩ N j | ∈ [µ ij (1 -δ ij ), µ ij (1 + δ ij )] c ≤ 2 exp(-Cµ ij δ 2 ij ) for some constant C > 0. We now choose δ ij = (c+2) log n Cµij for any large c > 0. Note that since p, q = Ω( log n / √ n), we have that δ ij = O( c log n ) = o n (1) . Then following a union bound over all pairs (i, j) ∈ [n] × [n], we obtain that with probability at least 1 -2n -c , for all pairs of nodes (i, j) we have |N i ∩ N j | = n 2 (p 2 + q 2 )(1 ± o n (1)) for all i ∼ j, |N i ∩ N j | = npq(1 ± o n (1)) for all i j. Lemma A.3 (Variance reduction). Denote the event from Proposition A.1 to be B. Let {X i } i∈[n] ∈ R n×d be an iid sample of data. For a graph with adjacency matrix A (including self-loops) and a fixed integer K > 0, define a K-convolution to be X = (D -1 A) K X. Then we have Cov( Xi | B) = ρ(K)Cov(X i ), where ρ(K) = 1 + o n (1) ∆ 2K j∈[n] A K (i, j) 2 . Here, A K (i, j) is the entry in the ith row and jth column of the exponentiated matrix A K and ∆ = E deg = n 2 (p + q). Proof. For a matrix M, the ith convolved data point is Xi = M i X, where M i denotes the ith row of M. Since X i are iid, we have Cov( Xi ) = j∈[n] (M ij ) 2 Cov(X j ). It remains to compute the entries of the matrix M = (D -1 A) K . Note that we have D -1 A(i, j) = aij /deg(i), so we obtain that M ij = (D -1 A) K (i, j) = n j1=1 n j2=1 • • • n j K-1 =1 a ij1 a j1j2 • • • a j K-2 j K-1 a j K-1 j deg(i)deg(j 1 ) • • • deg(j K-1 ) . Recall that on the event B, the degrees of all nodes are ∆(1 ± o n (1)), and hence, we have that M ij = (1 ± o n (1)) K ∆ K n j1=1 n j2=1 • • • n j K-1 =1 a ij1 • • • a j K-2 j K-1 a j K-1 j , where the error o n (1) = O( 1 √ log n ). The sum of these products of the entries of A is simply the number of length-K paths from node i to j, i.e., A K (i, j). Thus, we have Cov( Xi | B) = j∈[n] (M ij ) 2 Cov(X j ) = 1 + o n (1) ∆ 2K j∈[n] A K (i, j) 2 Cov(X j ). Since X j are iid, we obtain that ρ (K) = 1+on(1) ∆ 2K j∈[n] A K (i, j) 2 . Let us briefly discuss the implications of Lemma A.3. Consider a sample (A, X) drawn from XOR-CSBM(n, d, µ, ν, σ 2 , p, q) for the symmetric case where exactly n/2 nodes are in each of the two classes. We have that EA = pI n/2 qI n/2 qI n/2 pI n/2 . This gives us Eρ(K) ≈ 1 n (1 + Γ(p, q) 2K ) for any K ≥ 2. Recall that a single graph convolution reduces the distance between the means by a factor of Γ(p, q). Hence, to comment on the performance of an arbitrary number of convolutions, K, we might hope to compare the reduction in this distance, Γ(p, q) K with the reduction in the variance (ρ(K)) to obtain a condition on K in terms of n, p, and q. The challenge, however, lies in the fact that in deeper layers, computing ρ(K) is non-trivial due to node features being highly correlated. Moreover, an argument to claim that ρ(K) ≈ Eρ(K) is needed for this approach, which seems to require strong density assumptions on the graph. We now state a result about the output of the (Bayes) optimal classifier for the XOR-GMM data model that is used in several of our proofs. Lemma A.4. Let h(x) = | x, ν | -| x, μ | for all x ∈ R d and define ζ(t) = t erf(t) - 1 √ π 1 -e -t 2 . Then we have 1. The expectation Eh(X i ) = - √ 2σζ( γ /2σ) i ∈ C 0 √ 2σζ( γ /2σ) i ∈ C 1 . 2. For any γ, σ > 0 such that γ = Ω n (σ), we have that ζ( γ σ ) = Ω( γ σ ). 3. For any γ, σ > 0 such that γ = o n (σ), we have that ζ( γ σ ) = Ω( γ 2 σ 2 ). Proof. For part one, observe that X i , μ and X i , ν are Gaussian random variables with variance σ 2 and means γ / √ 2, 0 if ε i = 0 and 0, γ / √ 2 if ε i = 1, respectively. Thus, | X i , μ | and | X i , ν | are folded-Gaussian random variables and we have Eh( X i ) = - √ 2ζ( γ / √ 2σ) if i ∈ C 0 and Eh(X i ) = √ 2ζ( γ / √ 2σ) otherwise. We now write ζ(t) = t erf(t) - 1 t √ π (1 -e -t 2 ) = tH(t), where H(t) = erf(t) -1 /t √ π(1 -e -t 2 ). For part two, note that H(t) is an increasing function in the range [-1, 1] and H(t) > 0 for t > 0. Hence, for t ≥ C for some positive constant C, H(t ) ≥ H(C) = C , therefore, ζ(t) = tH(t) ≥ C t. For part three when t = o n (1), we use the series expansion of h(t) about t = 0 to obtain that h(t) = t √ π - t 3 6 √ π + O(t 5 ) ≥ t √ π - t 3 6 √ π = Ω(t). Hence, ζ(t) = th(t) = Ω(t 2 ). Putting t = γ/σ completes the proof. Fact A.5. For any x ∈ [0, 1], x 2 ≤ log(1 + x) ≤ x.

A.3 PROOF OF THEOREM 1 PART ONE

In this section we prove our first result about the fraction of misclassified points in the absence of graphical information. We begin by computing the Bayes optimal classifier for the data model XOR-GMM (see Section 2.1). A Bayes classifier, denoted by h * (x), maximizes the posterior probability of observing a label given the input data x. More precisely, h * (x) = argmax b∈{0,1} Pr [y = b | x = x] , where x ∈ R d represents a single data point. Lemma A.6. For some fixed µ, ν ∈ R d and σ 2 > 0, the Bayes optimal classifier, h * (x) : R d → {0, 1} for the data model XOR-GMM(n, d, µ, ν, σ 2 ) is given by h * (x) = 1(| x, µ | < | x, ν |) = 0 | x, µ | ≥ | x, ν | 1 | x, µ | < | x, ν | , where 1 is the indicator function. Proof. Note that Pr [y = 0] = Pr [y = 1] = 1 2 . Let f x (x) denote the density function of a continuous random vector x. Therefore, for any b ∈ {0, 1}, Pr [y = b | x = x] = Pr [y = b] f x|y (x | y = b) c∈{0,1} Pr [y = c] f x|y (x | y = c) = 1 1 + f x|y (x|y=1-b) f x|y (x|y=b) . Let's compute this for b = 0. We have f x|y (x | y = 1) f x|y (x | y = 0) = cosh( x, ν /σ 2 ) cosh( x, µ /σ 2 ) exp µ 2 -ν 2 2σ 2 = cosh( x, ν /σ 2 ) cosh( x, µ /σ 2 ) , where in the last equation we used the assumption that µ = ν . The decision regions are then identified by: Pr [y = 0 | x] ≥ 1 /2 for label 0 and Pr [y = 0 | x] < 1 /2 for label 1. Thus, for label 0, we need f x|y (x|y=1) f x|y (x|y=0) < 1, which implies that cosh( x,ν /σ 2 ) cosh( x,µ /σ 2 ) ≤ 1. Now we note that cosh(x) ≤ cosh(y) =⇒ |x| ≤ |y| for all x, y ∈ R, hence, we have | x, µ | ≥ | x, ν |. Similarly, we have the complementary condition for label 1. Next, we design a two-layer and a three-layer network and show that for a particular choice of parameters θ = (W (l) , b (l) ) for l ∈ {1, 2} for the two-layer case and l ∈ {1, 2, 3} for the three-layer case, the networks realize the optimal classifier described in Lemma A.6. Proposition A.7. Consider two-layer and three-layer networks of the form described in Section 2.2, without biases (i.e., b (l) = 0 for all layers l), for parameters W (l) and some R ∈ R + as follows. 1. For the two-layer network, W (1) = R ( μ -μ ν -ν) , W (2) = (-1 -1 1 1) . 2. For the three-layer network, W (1) = R ( μ -μ ν -ν) , W (2) =    -1 1 -1 1 1 -1 1 -1    , W (3) = 1 -1 . Then for any σ > 0, the defined networks realize the Bayes optimal classifier for the data model XOR-GMM(n, d, µ, ν, σ 2 ). Proof. Note that the output of the two-layer network is ϕ([XW (1) ] + W (2) ), which is interpreted as the probability with which the network believes that the input is in the class with label 1. The final prediction for the class label is thus assigned to be 1 if the output is ≥ 0.5, and 0 otherwise. For each i ∈ [n], we have that the output of the network on data point i is ŷi = ϕ(R(-[ X i , μ ] + -[-X i , μ ] + + [ X i , ν ] + + [-X i , ν ] + )) = ϕ((R(| X i , ν | -| X i , μ |)), where we used the fact that [t] + + [-t] + = |t| for all t ∈ R. Similarly, for the three-layer network, the output is ϕ([[XW (1)  ] + W (2) ] + W (3) ). So we have for each i ∈ [n] that ŷi = ϕ R [-[ X i , μ ] + -[-X i , μ ] + + [ X i , ν ] + + [-X i , ν ] + ] + -[[ X i , μ ] + + [-X i , μ ] + -[ X i , ν ] + -[-X i , ν ] + ] + = ϕ (R([| X i , ν | -| X i , μ |] + -[| X i , μ | -| X i , ν |] + )) = ϕ (R(| X i , ν | -| X i , μ |)) , where in the last equation we used the fact that [t] + -[-t] + = t for all t ∈ R. The final prediction is then obtained by considering the maximum posterior probability among the class labels 0 and 1, and thus, pred(X i ) = 1(R | X i , μ | < R | X i , ν |) = 1(| X i , µ | < | X i , ν |), which matches the Bayes classifier in Lemma A.6. We now restate the relevant theorem below for convenience. Theorem (Restatement of part one of Theorem 1). Let X ∈ R n×d ∼ XOR-GMM(n, d, µ, ν, σ 2 ). Assume that µν 2 ≤ Kσ and let h(x) : R d → {0, 1} be any binary classifier. Then for any K > 0 and any ∈ (0, 1), at least a fraction 2Φ c ( K /2) 2 -O(n -/2 ) of all data points are misclassified by h with probability at least 1 -exp(-2n 1-). Proof. Recall from Lemma A.6 that for successful classification, we require for every i ∈ [n], | X i , µ | ≥ | X i , ν | i ∈ C 0 , | X i , µ | < | X i , ν | i ∈ C 1 . Let's try to upper bound the probability of the above event, i.e., the probability that the data is classifiable. We consider only class C 0 , since the analysis for C 1 is symmetric and similar. For i ∈ C 0 , we can write X i = µ + σg i , where g i ∼ N (0, I). Recall that γ = µν 2 and γ = γ / √ 2 = µ 2 = ν 2 . Then we have for any fixed i ∈ C 0 that Pr [| X i , µ | ≥ | X i , ν |] = Pr [|γ + σ g i , μ | ≥ |σ g i , ν |] ≤ Pr [γ + σ | g i , μ | ≥ σ | g i , ν |] (by triangle inequality) ≤ Pr [| g i , ν | -| g i , μ | ≤ K / √ 2] (using γ ≤ Kσ). We now define random variables Z 1 = g i , ν and Z 2 = g i , μ and note that Z 1 , Z 2 ∼ N (0, 1) and E[Z 1 Z 2 ] = 0. Let K = K/ √ 2. We now have Pr [|Z 1 | -|Z 2 | ≤ K ] = 4Pr [Z 1 -Z 2 ≤ K , Z 1 , Z 2 ≥ 0] = 4 ∞ 0 Pr [0 ≤ Z 1 ≤ z + K ] φ(z)dz = 4 ∞ 0 Φ(z + K ) - 1 2 φ(z)dz = 4 ∞ 0 Φ(z + K ) φ(z)dz -1 = 2Φ( K /2) + 2Φ( K /2)Φ c ( K /2) -1 = 1 -2Φ c ( K /2) 2 . To evaluate the integral above, we used (Owen, 1980, Table 1:10,010.6 and Table 2:2.3 ). Thus, the probability that a point i ∈ C 0 is misclassified is lower bounded as follows Pr [X i is misclassified] ≥ 2Φ c ( K /2) 2 = τ K . Note that this is a decreasing function of K, implying that the probability of misclassification decreases as we increase the distance between the means, and is maximum for K = 0. Define M (n) for a fixed K to be the fraction of misclassified nodes in C 0 . Define x i to be the indicator random variable 1(X i is misclassified). Then x i are Bernoulli random variables with mean at least τ K , and EM (n) = 2 n i∈C0 Ex i ≥ τ K . Using Hoeffding's inequality (Vershynin, 2018, Theorem 2.2.6) , we have that for any t > 0, Pr [M (n) ≥ τ K -t] ≥ Pr [M (n) ≥ EM (n) -t] ≥ 1 -exp(-nt 2 ). Choosing t = n -/2 for any ∈ (0, 1) yields Pr M (n) ≥ τ K -n -/2 ≥ 1 -exp(-n 1-).

A.4 PROOF OF THEOREM 1 PART TWO

In this section, we show that in the positive regime (sufficiently large distance between the means), there exists a two-layer MLP that obtains an arbitrarily small loss, and hence, successfully classifies a sample drawn from the XOR-GMM model with overwhelming probability. Theorem (Restatement of part two of Theorem 1). Let X ∈ R n×d ∼ XOR-GMM(n, d, µ, ν, σ 2 ). For any > 0, if the distance between the means is µν 2 = Ω(σ(log n) 1 2 + ), then for any c > 0, with probability at least 1 -O(n -c ), the two-layer and three-layer networks described in Proposition A.7 classify all data points, and obtain a cross-entropy loss given by θ (X) = C exp - R √ 2 µ -ν 2 1 ± √ c/(log n) , where C ∈ [ 1 /2, 1] is an absolute constant. Proof. Consider the two-layer and three-layer MLPs described in Proposition A.7, for which we have ŷi = ϕ (R(| X i , ν | -| X i , μ |)). We now look at the loss for a single data point X i , i (X, θ) = -y i log(ŷ i ) -(1 -y i ) log(1 -ŷi ) = log 1 + exp (1 -2y i )R(| X i , ν | -| X i , μ |) . Note that X i -EX i , μ and X i -EX i , ν are mean 0 Gaussian random variables with variance σ 2 . So for any fixed i ∈ [n] and m c ∈ {µ, ν}, we use (Vershynin, 2018, Proposition 2.1.2 ) to obtain Pr [| X i -EX i , mc | > t] ≤ σ t √ 2π exp - t 2 2σ 2 . Then by a union bound over all i ∈ [n] and m c ∈ {µ, ν}, we have that Pr [| X i -EX i , mc | ≤ t ∀i ∈ [n], m c ∈ {µ, ν}] ≥ 1 - nσ t 2 π exp - t 2 2σ 2 . We now set t = σ 2(c + 1) log n for any large constant c > 0. We now have with probability at least 1 - n -c √ π(c+1) log n that X i , μ = EX i , μ ± O(σ c log n), X i , ν = EX i , ν ± O(σ c log n) ∀i ∈ [n]. Thus, we can write X i , μ = γ 1 ± O c log n , X i , ν = γ • O c log n ∀i ∈ C 0 , X i , μ = γ • O c log n , X i , ν = γ 1 ± O c log n ∀i ∈ C 1 . Using Eqs. ( 2) and (3) in the expression for the loss, we obtain for all i ∈ [n], i (X, θ) = log(1 + exp(-Rγ (1 ± o n (1)))), where the error term o n (1) = c /log n. The total loss is then given by θ (X) = 1 n i (X, θ) = log(1 + exp(-Rγ (1 + o n (1)))). Next, Fact A.5 implies that for t < 0, e t /2 ≤ log(1 + e t ) ≤ e t , hence, we have that there exists a constant C ∈ [ 1 /2, 1] such that θ (X) = C exp (-Rγ (1 + o n (1)))) . Note that by scaling the optimality constraint R, the loss can go arbitrarily close to 0.

A.5 GRAPH CONVOLUTION IN THE FIRST LAYER

In this section, we show precisely why a graph convolution operation in the first layer is detrimental to the classification task. Proposition A.8. Fix a positive integer d > 0, σ ∈ R + and µ, ν ∈ R d . Let (A, X) ∼ XOR-CSBM(n, d, µ, ν, σ 2 , p, q). Define X to be the transformed data after applying a graph convolution on X, i.e., X = D -1 AX. Then in the regime where p, q = Ω( log 2 n n ), with probability at least 1 -1/poly(n) we have that E Xi =        pµ + qν 2(p + q) • o n (1) i ∈ C 0 pν + qµ 2(p + q) • o n (1) i ∈ C 1 . Hence, the distance between the means of the convolved data, given by p-q 2(p+q) µ - ν 2 • o n (1) diminishes to 0 for n → ∞. Proof. Fix µ, ν ∈ R d and define the following sets: C -µ = {i | ε i = 0, η i = 0}, C µ = {i | ε i = 0, η i = 1}, C -ν = {i | ε i = 1, η i = 0}, C ν = {i | ε i = 1, η i = 1}. Denote X = D -1 AX and note that for any i ∈ [n], the row vector Xi = 1 deg(i) j∈[n] a ij X j = 1 deg(i) j∈[n] a ij (EX j + σg j ) = 1 deg(i)   µ   j∈Cµ a ij - j∈C-µ a ij   + ν   j∈Cν a ij - j∈C-ν a ij   + σ j∈[n] a ij g j   , where we used the fact that X j = (2η j -1)((1 -ε j )µ + ε j ν + σg j ) for a set of iid Gaussian random vectors g j ∼ N (0, I d ). Note that since i , η i are Bernoulli random variables, using the Chernoff bound (Vershynin, 2018 , Section 2), we have that with probability at least 1 -1/poly(n), |C -µ | = |C µ | = |C -ν | = |C ν | = n 4 (1 ± o n (1)). We now use an argument similar to Proposition A.1 to obtain that for any c > 0, with probability at least 1 -O(n -c ), the following holds for all i ∈ [n]: 1 deg(i)   j∈Cµ a ij - j∈C-µ a ij   = O (1 -ε i )p + ε i q 2(p + q) c log n , 1 deg(i)   j∈Cν a ij - j∈C-ν a ij   = O ε i p + (1 -ε i )q 2(p + q) c log n . Hence, we have that for all i ∈ [n], E Xi = (1 -ε i )p + ε i q 2(p + q) µ + ε i p + (1 -ε i )q 2(p + q) ν • O c log n =        pµ + qν 2(p + q) • o n (1) i ∈ C 0 pν + qµ 2(p + q) • o n (1) i ∈ C 1 Using the above result, we obtain the distance between the means, which is of the order o n (γ) and thus, diminishes to 0 as n → ∞.

A.6 PROOF OF THEOREM 2 PART ONE

We begin by computing the output of the network when one graph convolution is applied at any layer other than the first. Lemma A.9. Let h(x) = | x, ν | -| x, μ | for any x ∈ R d . Consider the two-layer and threelayer networks in Proposition A.7 where the weight parameter of the last layer, W (L) , is scaled by a factor of ξ = sgn(p -q). If a graph convolution is added to these networks in either the second or the third layer then for a sample (A, X) ∼ XOR-CSBM(n, d, µ, ν, σ 2 , p, q), the output of the networks for a point i ∈ [n] is ŷi = ϕ(f (L) i (X)) = ϕ   R sgn(p -q) deg(i) j∈[n] a ij h(X j )   . Proof. The networks with scaled parameters are given as follows. 1. For the two-layer network, W (1) = R ( μ -μ ν -ν) , W (2) = ξ (-1 -1 1 1) . 2. For the three-layer network, W (1) = Rξ ( μ -μ ν -ν) , W (2) =    -1 1 -1 1 1 -1 1 -1    , W (3) = ξ 1 -1 . When a graph convolution is applied at the second layer of this two-layer MLP, the output of the last layer for data 2) . Then we have (A, X) is f (2) i (X) = D -1 A[XW (1) ] + W ( f (2) i (X) = Rξ deg(i) j∈[n] a ij (| X j , ν | -| X j , μ |) = Rξ deg(i) j∈[n] a ij h(X j ). Similarly, when the graph convolution is applied at the second layer of the three-layer MLP, the output is f 3) , and we have (3) i (X) = [D -1 A[XW (1) ] + W (2) ] + W ( f (3) i (X) = Rξ deg(i)     j∈[n] a ij h(X j )   + -   - j∈[n] a ij h(X j )   +   = Rξ deg(i) j∈[n] a ij h(X j ). Finally, when the graph convolution is applied at the third layer of the three-layer MLP, the output is 3) , and we have f (3) i (X) = D -1 A[[XW (1) ] + W (2) ] + W ( f (3) i (X) = Rξ deg(i) j∈[n] a ij [| X j , ν | -| X j , μ |] + -[| X j , μ | -| X j , ν |] + = R deg(i) j∈[n] a ij (| X j , ν | -| X j , μ |) = Rξ deg(i) j∈[n] a ij h(X j ). Therefore, in all cases where we have a single graph convolution, output of the last layer is f (L) i (X) = R sgn(p -q) deg(i) j∈[n] a ij h(X j ), where L ∈ {2, 3} is the number of layers. Theorem (Restatement of part one of Theorem 2). Let (A, X) ∼ XOR-CSBM(n, d, µ, ν, σ 2 , p, q). Assume that p, q = Ω( log 2 n n ), and it holds that Γ(p, q)ζ( γ /2σ) = ω log n n(p+q) , then for any c > 0, with probability at least 1 -O(n -c ), the networks equipped with a graph convolution in the second or the third layer perfectly classify the data, and obtain the following loss: θ (A, X) = C exp -CσRΓ(p, q)ζ( γ /2σ)(1 ± c /log n) , where C > 0 and C ∈ [ 1 /2, 1] are constants and R is the constraint from Eq. (1). Proof. First, we analyze the output conditioned on the adjacency matrix A. Note that 1 R f (L) i (X) in Lemma A.9 is Lipschitz with constant 2 deg(i) , and h(X j ) are mutually independent for j ∈ [n]. Therefore, by Gaussian concentration (Vershynin, 2018, Theorem 5.2 .2) we have that for a fixed i ∈ [n], Pr 1 R |f (L) i (X) -E[f (L) i (X)]| > δ | A ≤ 2 exp - δ 2 deg(i) 4σ 2 . We refer to the event from Proposition A.1 as B and define Q(t) to be the event that |f (L) i (X) -E[f (L) i (X)]| ≤ t for all i ∈ [n]. Then we can write Pr [Q(t) c ] = Pr [Q(t) c ∩ B] + Pr [Q(t) c ∩ B c ] ≤ 2n exp - t 2 n(p + q) 8σ 2 + Pr [B c ] ≤ 2n exp - t 2 n(p + q) 8σ 2 + 2n -c . Let ξ = sgn(p -q) and note that ξ(p-q) p+q = |p-q| p+q = Γ(p, q). We now choose t = 2σ 2(c+1) log n n(p+q) to obtain that with probability at least 1 -4n -c , the following holds for all i ∈ [n]: 1 σ f (L) i (X) = E[f (L) i (X)]/σ ± O R c log n n(p + q) = Rξ σ deg(i) j∈[n] a ij Eh(X j ) ± O R c log n n(p + q) = √ 2Rξζ( γ /2σ) σ deg(i)   j∈C1 a ij - j∈C0 a ij   ± O R c log n n(p + q) (Lemma A.4) = √ 2(2ε i -1)RΓ(p, q)ζ( γ /2σ)(1 ± o n (1)) ± O R c log n n(p + q) (Proposition A.1) = √ 2(2ε i -1)RΓ(p, q)ζ( γ /2σ)(1 ± o n (1)), where in the last equation we used assumption that Γ(p, q)ζ( γ /2σ) = ω log n n(p+q) . Overall, we obtain that with probability at least 1 -4n -c , f (L) i (X) = (2ε i -1)CσRΓ(p, q)ζ( γ /2σ)(1 ± o n (1)), for all i ∈ [n]. Recall that the loss for node i is given by (i) θ (A, X) = log(1 + e (1-2εi)f (L) i (X) ) = log (1 + exp (-CσRΓ(p, q)ζ( γ /2σ)(1 ± o n (1)))) . The total loss is given by 1 n i∈[n] θ (A, X). Next, Fact A.5 implies that for any t < 0, e t /2 ≤ log(1 + e t ) ≤ e t , hence, we have for some 1. Both convolutions in the second layer of the two-layer network. C ∈ [ 1 /2, 1] that θ (A, X) = C exp (-CσRΓ(p, q)ζ( γ /2σ)(1 ± o n (1))) . 2. Both convolutions in the second layer of the three-layer network. 3. One convolution in the second layer and one in the third layer of the three-layer network. 4. Both convolutions in the third layer of the three-layer network. Then for a sample (A, X) ∼ XOR-CSBM(n, d, µ, ν, σ 2 , p, q), the output of the networks in all the above described combinations for a point i ∈ [n] is ŷi = ϕ(f (L) i (X)) = ϕ   R deg(i) j∈[n] τ ij h(X j )   , where τ ij = k∈[n] a ik a jk deg(k) . Proof. For the two-layer network, the output of the last layer when both convolutions are at the second layer is given by f (2) 2) . Then we have i (X) = (D -1 A) 2 [XW (1) ] + W f (2) i (X) = R deg(i) j∈[n] k∈[n] a ij a jk deg(j) h(X k ) = R deg(i) j∈[n] τ ij h(X j ). Next, for the three-layer network, the output of the last layer when both convolutions are at the second layer is given by f 3) , hence, we have (3) i (X) = [(D -1 A) 2 [XW (1) ] + W (2) ] + W ( f (3) i (X) = R deg(i)     j∈[n] a ij deg(j) k∈[n] a jk h(X k )   + -   - j∈[n] a ij deg(j) k∈[n] a jk h(X k )   +   = R deg(i) j∈[n] a ij deg(j) k∈[n] a jk h(X k ) (using [t] + -[-t] + = t for any t ∈ R) = R deg(i) j∈[n] τ ij h(X j ). Similarly, the output of the last layer when one convolution is at the second layer and the other one is at the third layer is given by f 3) , hence, we have (3) i (X) = D -1 A[D -1 A[XW (1) ] + W (2) ] + W ( f (3) i (X) = R deg(i) j∈[n] a ij deg(j)     k∈[n] a jk h(X k )   + -  - k∈[n] a jk h(X k )   +   = R deg(i) j∈[n] a ij deg(j) k∈[n] a jk h(X k ) (using [t] + -[-t] + = t for any t ∈ R) = R deg(i) j∈[n] τ ij h(X j ). Finally, the output of the last layer when both convolutions are at the third layer is given by f (1) ] + W (2) ] + W (3) , hence, we have (3) i (X) = (D -1 A) 2 [[XW f (3) i (X) = R deg(i) j∈[n] a ij deg(j)   k∈[n] a jk [h(X k )] + -[-h(X k )] +   = R deg(i) j∈[n] a ij deg(j) k∈[n] a jk h(X k ) = R deg(i) j∈[n] τ ij h(X j ). Hence, the output for two graph convolutions is the same for any combination of the placement of convolutions, as long as no convolution is placed at the first layer. We are now ready to prove the positive result for two convolutions. Theorem (Restatement of part two of Theorem 2). Let (A, X) ∼ XOR-CSBM(n, d, µ, ν, σ 2 , p, q). Assume that p, q = Ω( log n √ n ) and Γ(p, q) 2 ζ( γ /2σ) = ω log n n . Then for any c > 0, with probability at least 1 -O(n -c ), the networks with any combination of two graph convolutions in the second and/or the third layers perfectly classify the data, and obtain the following loss: θ (A, X) = C exp -CσRΓ(p, q) 2 ζ( γ /σ)(1 ± c /log n) , where C > 0 and C ∈ [ 1 /2, 1] are constants and R is the constraint from Eq. (1). Proof. The proof strategy is similar to that of part one of the theorem. Note that 1 R f (L) i (X) in Lemma A.10 is Lipschitz with constant 1 R f (L) i (X) Lip ≤ 2 deg(i) 2 j∈[n] τ 2 ij . Since h(X j ) are mutually independent for j ∈ [n], by Gaussian concentration (Vershynin, 2018, Theorem 5.2 .2) we have that for a fixed i ∈ [n], Pr 1 R |f (L) i (X) -E[f (L) i (X)]| > δ | A ≤ 2 exp - δ 2 deg(i) 2 4σ 2 j∈[n] τ 2 ij . We refer to the event from Proposition A.2 as B. Note that since the graph density assumption is stronger than Ω( log 2 n n ), Proposition A.1 trivially holds in this regime, hence, the degrees also concentrate strongly around ∆ = n 2 (p q). On event B, we have that j∈[n] τ 2 ij = j∈[n]   k∈[n] a ik a jk deg(k)   2 = 1 ∆ 2 j∈[n]   k∈[n] a ik a jk   2 (1 ± o n (1)) = 1 ∆ 2   j∼i |N i ∩ N j | 2 + j i |N i ∩ N j | 2   (1 ± o n (1)) = 1 ∆ 2   j∼i n 2 (p 2 + q 2 ) 2 + j i (npq) 2   (1 ± o n (1)) (using Proposition A.2) = n 2∆ 2 n 2 4 (p 2 + q 2 ) 2 + n 2 p 2 q 2 (1 ± o n (1)) = n 3 8∆ 2 p 4 + q 4 + 6p 2 q 2 (1 ± o n (1)). Therefore, under this event we have that 1 R f (L) i (X) Lip ≤ 2 deg(i) 2 j∈[n] τ 2 ij = 4(p 4 + q 4 + 6p 2 q 2 ) n(p + q) 4 (1 ± o n (1)). Note that K = K(p, q) = 4(p 4 +q 4 +6p 2 q 2 ) (p+q) 4 ≤ 4. We now define Q(t) to be the event that |f (L) i (X) - E[f (L) i (X)]| ≤ t for all i ∈ [n]. Then we have Pr [Q(t) c ] = Pr [Q(t) c ∩ B] + Pr [Q(t) c ∩ B c ] ≤ 2n exp - nt 2 2K 2 σ 2 + 2n -c . We now choose t = σ 2K(c+1) log n n to obtain that with probability at least 1 -4n -c , the following holds for all i ∈ [n]: f (L) i (X) = E[f (L) i (X)] ± O Rσ log n n = R deg(i) j∈[n] τ ij Eh(X j ) ± O Rσ log n n . Note that we have 1 deg(i) j∈[n] τ ij Eh(X j ) = √ 2ζ( γ /2σ) deg(i)   j∈C1 τ ij - j∈C0 τ ij   (using Lemma A.4) = √ 2ζ( γ /2σ) deg(i)   j∈C1 k∈[n] a ik a jk deg(k) - j∈C0 k∈[n] a ik a jk deg(k)   = √ 2ζ( γ /2σ) deg(i)   k∈[n] a ik deg(k)   j∈C1 a jk - j∈C0 a jk     = √ 2ζ( γ /2σ)Γ(p, q) deg(i) k∈C1 a ik - k∈C0 a ik (1 + o n (1)) = √ 2ζ( γ /2σ)Γ(p, q) 2 (1 + o n (1)). In the last two equations above, we used Proposition A.1 to replace, respectively, 1 deg(k)   j∈C1 a kj - j∈C0 a kj   = (2ε k -1)Γ(p, q)(1 + o n (1)), 1 deg(i)   j∈C1 a ik - j∈C0 a ik   = (2ε k -1)Γ(p, q)(1 + o n (1)). Therefore, we obtain that f (L) i (X) = CσRζ( γ /2σ)Γ(p, q) 2 (1 + o n (1)) ± O Rσ log n n = CσRζ( γ /2σ)Γ(p, q) 2 (1 + o n (1)), where in the last equation we used Γ(p, q) 2 ζ( γ /2σ) = ω log n n . Recall that the loss for node i is given by (i) θ (A, X) = log(1 + exp((1 -2ε i )f (L) i (X))) = log 1 + exp -CσRζ( γ /2σ)Γ(p, q) 2 (1 ± o n (1)) . The total loss is 1 n i∈[n] θ (A, X). Now, using Fact A.5 we have for some C ∈ [ 1 /2, 1] that θ (A, X) = C exp -CσRζ( γ /2σ)Γ(p, q) 2 (1 ± o n (1)) .

A.8 ANALYSIS FOR A SIMPLER CASE

Although Theorem 2 encapsulates the general condition for networks with up to two graph convolutions to achieve perfect classification, let us discuss the meaning of the theorem in a simplified setting where Γ(p, q) = Ω(1). In this regime, one can analyze two cases for both parts of the theorem: Combining both cases, we find that the theorems imply perfect classification if the following holds: γ = µ -ν 2 =      Ω σ √ log n 4 √ n(p+q) for networks wth one graph convolution, Ω σ √ log n 4 √ n for networks with two graph convolutions.

B ADDITIONAL EXPERIMENTS

For all synthetic and real-data experiments, we used PyTorch Geometric (Fey & Lenssen, 2019) , using public splits for the real datasets. The models were trained on an Nvidia Titan Xp GPU, using the Adam optimizer with learning rate 10 -3 , weight decay 10 -5 , and 50 to 500 epochs varying among the datasets.

B.1 SYNTHETIC DATA

In this section we show additional results on the synthetic data. First, we show that placing a graph convolution in the first layer makes the classification task difficult since the means of the convolved data collapse towards 0. This is shown in Fig. 4 . Next, we show experiments for two sets of values of p < q to demonstrate that graph convolutions also work in this setting. In Figs. 5a and 5b we have Γ(p, q) ≈ 0.82, while in Figs. 5c and 5d we have a lower signal in the graph, Γ(p, q) ≈ 0.66. We observe that in the latter case that there is less gap in the performance of networks with one graph convolution and those with two graph convolutions. In comparison to Fig. 2 , we observe similar trends about the performance of all the networks in different regimes of interest. In particular, networks with one graph convolution perform mutually similarly, and networks with two graph convolutions perform mutually similarly, as claimed in Theorem 2. Finally, in Fig. 6 , we show the trends for the accuracy of various networks with and without graph convolutions, for different values of Γ(p, q). For cases where Γ(p, q) is relatively larger, networks with graph convolutions perform much better than a standard MLP (see Figs. 6a to 6d), while for the cases where Γ(p, q) is much smaller, the networks with graph convolutions degrade in performance (see Figs. 6e to 6h). The intuition behind this behaviour is that a smaller value of Γ(p, q) represents more noise in the data, thus, networks with graph convolutions gather roughly an equivalent amount of information from nodes in both the classes, making the feature representations noisy. (a) Two layers, (p, q) = (0.1, 0.01), Γ(p, q) ≈ 0.82. (b) Three layers, Γ(p, q) ≈ 0.82. (c) Two layers, (p, q) = (0.8, 0.1), Γ(p, q) ≈ 0.78. (d) Three layers, Γ(p, q) ≈ 0.78. (e) Two layers, (p, q) = (0.2, 0.1), Γ(p, q) ≈ 0.33. (f) Three layers, Γ(p, q) ≈ 0.33. (g) Two layers, (p, q) = (0.5, 0.4), Γ(p, q) ≈ 0.11. (h) Three layers, Γ(p, q) ≈ 0.11. Figure 6 : Test accuracy of various networks with with and without graph convolutions (GCs) for various values of p and q, on the XOR-CSBM data model. Note that networks with graph convolutions degrade in performance as Γ(p, q) (attributed to the signal in the graph) decreases.

B.2 REAL-WORLD DATA

This section contains additional experiments on real-world data. In Fig. 7 , we plot the accuracy of the networks measured on the three benchmark datasets, averaged across 50 different trials (random initialization of the network parameters). This corresponds to the plots in Fig. 3 that show the maximum accuracy across all trials. Next, we evaluate the performance of the original GCN normalization (Kipf & Welling, 2017) , D -1 2 AD -1 2 instead of D -1 A, and show that we observe the same trends about the number of convolutions and their placement. These results are shown in Figs. 8 and 9 . Note the two general trends that are consistent: first, networks with two graph convolutions perform better than those with one graph convolution, and second, placing all graph convolutions in the first layer yields worse accuracy as compared to networks where the convolutions are placed in deeper layers. Similar to the results in the main paper, we observe that there are differences within the group of networks with the same number of convolutions, however, these differences are smaller in magnitude as compared to the difference between the two groups of networks, one with one graph convolution and the other two graph convolutions. We note that in some cases, three-layer networks obtain a 2 AD -1 2 . A network with k layers and j 1 , . . . , j k convolutions in each of the layers is represented by the label kL-j 1 . . . j k .



Our analyses generalize to non-symmetric SBMs with more than two blocks. However, we focus on the binary symmetric case for the sake of simplicity in the presentation of our ideas. We take µ and ν to be orthogonal and of the same magnitude for keeping the calculations relatively simpler, while clearly depicting the main ideas behind our results. Our results rely on degree concentration for each node, hence, they readily generalize to other normalization methods like D -1 2 AD -1 2 (see Appendix A for proofs). A Bayes classifier makes the most probable prediction for a data point. Formally, such a classifier is of the form h * (x) = argmax b∈{0,1} Pr [y = b | x]. We note that although scaling the parameters of a network scales the output of the network, yet, it does not affect the accuracy, which is determined by the sign of the outputs. We defer to Appendix B.1 for experiments with networks having graph convolutions in the first layer. Our proofs rely on degree concentration, and thus, generalize to the other type of normalization as well.



(a) Original node features at the first layer. (b) Feature representation after GC at the first layer. (c) Feature representation at the last layer. (d) Feature representation after GC at the last layer.

Figure1: Placement of a graph convolution (GC) in the first layer versus the last layer for data sampled from the XOR-CSBM. For this figure we used 1000 nodes in each class and a randomly sampled stochastic block-model graph with p = 0.8 and q = 0.2.

Figure 2: Averaged test accuracy (over 50 trials) for various networks with and without graph convolutions on the XOR-CSBM data model with n = 400, d = 4 and σ 2 = 1 /d. The x-axis denotes the ratio K = µν 2 /σ on a logarithmic scale. The vertical lines indicate the classification thresholds mentioned in part two of Theorem 1 (red), and in Theorem 2 (violet and pink).

(a) Accuracy of various learning models on the CORA dataset. (b) Accuracy of various learning models on the Pubmed dataset. (c) Accuracy of various learning models on the CiteSeer dataset.

Figure3: Maximum accuracy (percentage) over 50 trials for various networks. A network with k layers and j 1 , . . . , j k convolutions in each of the layers is represented by the label kL-j 1 . . . j k .

PROOF OF THEOREM 2 PART TWO We begin by computing the output of the networks constructed in Proposition A.7 when two graph convolutions are placed among any layer in the networks other than the first. Lemma A.10. Let h(x) : R d → R = | x, ν | -| x, μ |. Consider the networks constructed in Proposition A.7 equipped with two graph convolutions in the following combinations:

Case γ = Ω(σ): Using part two of Lemma A.4 implies that ζ(γ/σ) = Ω( γ σ ). Hence, for one graph convolution, the condition Γ(p, q)ζ( γ /2σ) = ω log n n(p+q) is satisfied when γ σ = ω log n n(p+q) , implying that γ = ω σ log n n(p+q) . Similarly, for two graph convolutions, the condition Γ(p, q) 2 ζ( γ /2σ) = ω log n n is satisfied when γ = ω σ log n n . 2. Case γ = o(σ): Using part three of Lemma A.4 implies that ζ(γ/σ) = Ω( γ 2 σ 2 ). Hence, for one graph convolution, the condition Γ(p, q)ζ( γ /2σ) = ω ) , implying that γ = ω σ 4 log n n(p+q) . Similarly, for two graph convolutions, the condition Γ(p, q) 2 ζ( γ /2σ) = ω log n n is satisfied when γ = ω σ 4 log n n .

Figure 4: Comparing the accuracy and loss for various networks with and without graph convolutions, averaged over 50 trials. Networks with a graph convolution in the first layer (red and orange) fail to generalize even for a large distance between the means of the data. For this experiment, we set n = 400 and d = 4, with σ 2 = 1/d.

with 1 GC at layer 2 2 layer net with 2 GCs at layer 2 O( logn): Threshold for MLP O( logn/deg 1/4 ): Threshold for 1 GC O( logn/n 1/4 ): Threshold for 2 GCs (a) Two-layer networks with (p, q) = (0.02, 0.2). with 1 GC at layer 2 3 layer net with 1 GC at layer 3 3 layer net with 2 GCs at layer 2 3 layer net with 2 GCs at layer 3 3 layer net with 1 GC each at layers 2 and 3 O( logn): Threshold for MLP O( logn/deg 1/4 ): Threshold for 1 GC O( logn/n 1/4 ): Threshold for 2 GCs (b) Three-layer networks with (p, q) = (0.02, 0.2).

with 2 GCs at layer 2 O( logn): Threshold for MLP O( logn/deg 1/4 ): Threshold for 1 GC O( logn/n 1/4 ): Threshold for 2 GCs (c) Two-layer networks with (p, q) = (0.1, 0.5).

with 1 GC each at layers 2 and 3 O( logn): Threshold for MLP O( logn/deg 1/4 ): Threshold for 1 GC O( logn/n 1/4 ): Threshold for 2 GCs (d) Three-layer networks with (p, q) = (0.1, 0.5).

Figure 5: Averaged accuracy (over 50 trials) for various networks with and without graph convolutions on the XOR-CSBM data model with n = 400, d = 4 and σ 2 = 1 /d for p < q. The x-axis denotes the ratio K = µν 2 /σ on a logarithmic scale. The vertical lines indicate the classification thresholds mentioned in part two of Theorem 1 (red), and in Theorem 2 (violet and pink).

Figure7: Averaged accuracy (percentage) over 50 trials for various networks. A network with k layers and j 1 , . . . , j k convolutions in each of the layers is represented by the label kL-j 1 . . . j k .

Figure 9: Averaged accuracy (percentage) over 50 trials for various networks with the original GCN normalization D -12 AD -1 2 . A network with k layers and j 1 , . . . , j k convolutions in each of the layers is represented by the label kL-j 1 . . . j k .

2.1 DESCRIPTION OF THE DATA MODELLet n, d be positive integers, where n denotes the number of data points (sample size) and d denotes the dimension of the features. Define the Bernoulli random variables ε 1 , . . . , ε n ∼ Ber( 1 /2) and η

acknowledgement

6 ACKNOWLEDGEMENTS K.F. would like to acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC). Cette recherche a été financée par le Conseil de recherches en sciences naturelles et en génie du Canada (CRSNG), [RGPIN-2019-04067, DGECR-2019-00147]. A.J. acknowledges the support of the Natural Sciences and Engineering Research Council of Canada (NSERC). Cette recherche a été financée par le Conseil de recherches en sciences naturelles et en génie du Canada (CRSNG), [RGPIN-2020-04597, DGECR-2020-00199].

annex

worse accuracy, which we attribute to the fact that three layers have a lot more parameters, and thus may either be overfitting, or may not be converging for the number of epochs used. 2 AD -1 2 . A network with k layers and j 1 , . . . , j k convolutions in each of the layers is represented by the label kL-j 1 . . . j k . Furthermore, we perform the same experiments on relatively larger datasets, OGBN-arXiv and OGBN-products (Hu et al., 2020) . We observe similar trends in these experiments. First, we observe that networks with a graph convolution perform better than a simple MLP, and that two convolutions perform better than a single convolution. Furthermore, three graph convolutions do not have a significant advantage over two graph convolutions. This observation agrees with Lemma A.3, where one can compute ρ(2) and ρ(3) and realize that they are of the same order in n, i.e., the variance reduction offered by two graph convolutions is of the same order as three graph convolutions for sufficiently dense graphs. We present the results of these experiments in Fig. 10 . (d) OGBN-products with three-layer networks.Figure 10 : Averaged accuracy (percentage) for OGB datasets arXiv and products, over 10 trials for various networks. A network with k layers and j 1 , . . . , j k convolutions in each of the layers is represented by the label kL-j 1 . . . j k , while MLP3 denotes a three-layer MLP. Note that all models with one GC (in red) perform mutually similarly, while models with two GCs (in blue) and three GCs (in green) perform mutually similarly and better than models with one GC.

