IN-DISTRIBUTION AND OUT-OF-DISTRIBUTION GENERALIZATION FOR GRAPH NEURAL NETWORKS

Abstract

Graph neural networks (GNNs) are models that allow learning with structured data of varying size. Despite their popularity, theoretical understanding of the generalization of GNNs is an under-explored topic. In this work, we expand the theoretical understanding of both in-distribution and out-of-distribution generalization of GNNs. Firstly, we improve upon the state-of-the-art PAC-Bayes (in-distribution) generalization bound primarily by reducing an exponential dependency on the node degree to a linear dependency. Secondly, utilizing tools from spectral graph theory, we prove some rigorous guarantees about the out-of-distribution (OOD) size generalization of GNNs, where graphs in the training set have different numbers of nodes and edges from those in the test set. To empirically verify our theoretical findings, we conduct experiments on both synthetic and real-world graph datasets. Our computed generalization gaps for the in-distribution case significantly improve the state-of-the-art PAC-Bayes results. For the OOD case, experiments on community classification tasks in large social networks show that GNNs achieve strong size generalization performance in cases guaranteed by our theory.

1. INTRODUCTION

Graph neural networks (GNNs), firstly proposed in Scarselli et al. (2008) , generalize artificial neural networks from processing fixed-size data to processing arbitrary graph-structured or relational data, which can vary in terms of the number of nodes, the number of edges, and so on. GNNs and their modern variants (Bronstein et al., 2017; Battaglia et al., 2018) have achieved state-of-the-art results in a wide range of application domains, including social networks (Hamilton et al., 2017) , material sciences (Xie & Grossman, 2018) , drug discovery (Wieder et al., 2020) , autonomous driving (Liang et al., 2020) , quantum chemistry (Gilmer et al., 2020) , and particle physics (Shlomi et al., 2020) . Despite their empirical successes, the theoretical understanding of GNNs are somewhat limited. Existing works largely focus on analyzing the expressiveness of GNNs. In particular, Xu et al. (2018) show that GNNs are as powerful as the Weisfeiler-Lehman (WL) graph isomorphism test (Weisfeiler & Leman, 1968 ) in distinguishing graphs. Chen et al. (2019) further demonstrate an equivalence between graph isomorphism testing and universal approximation of permutation-invariant functions. Loukas (2019) show that GNNs with certain conditions (e.g., on depth and width) are Turing universal. Chen et al. (2020) and Xu et al. (2020a) respectively examine whether GNNs can count substructures and perform algorithmic reasoning. In the vein of statistical learning theory, generalization analyses for GNNs have been developed to bound the gap between training and testing errors using VC-dimension (Vapnik & Chervonenkis, 1971) , Rademacher complexity (Bartlett & Mendelson, 2002) , algorithmic stability (Bousquet & Elisseeff, 2002) , and PAC-Bayes (McAllester, 2003) (a Bayesian extension of PAC learning (Valiant, 1984) ). Depending on whether the problem setup is in-distribution (ID) or out-of-distribution (OOD), i.e., whether test data comes from the same distribution as training data, we categorize the literature into two groups. ID Generalization Bounds. Scarselli et al. (2018) provide a VC-dimension based generalization bound for GNNs whereas Verma & Zhang (2019) present the stability-based generalization analysis for singlelayer graph convolutional networks (GCNs) (Kipf & Welling, 2016) . Both consider node classification and assume the node features are independent and identically-distributed (IID), which conflicts with the common relational learning setup (e.g., semi-supervised node classification) at which GNNs excel. Relying on the neural tangent kernel (NTK) approach (Jacot et al., 2018) , Du et al. (2019) characterize the generalization bound of infinite-width GNNs on graph classification. Garg et al. (2020) derive the Rademacher complexity based bound for message passsing GNNs on graph classification. Lv (2021) establish results for GCNs on node classification using Rademacher complexity as well. Based on PAC-Bayes, Liao et al. (2020) obtain a tighter bound for both GCNs and message passsing GNNs on graph classification compared to (Garg et al., 2020; Scarselli et al., 2018) . Subsequently, Ma et al. (2021) also leverage PAC-Bayes and show generalization guarantees of GNNs on subgroups of nodes for node classification. More recently, Li et al. (2022) study the effect of graph subsampling in the generalization of GCNs. OOD Generalization Yehudai et al. (2021) study size generalization for GNNs -this is a specific OOD setting where training and testing graphs differ in the number of nodes and edges. They show negative results that specific GNNs can perfectly fit training graphs but fails on OOD testing ones. Baranwal et al. (2021) consider specific graph generative models, i.e., the contextual stochastic block model (CSBM) (Deshpande et al., 2018) , where CSBMs during training and testing are of the same means but different number of nodes, intra-, and inter-class edge probabilities. They present generalization guarantees for single-layer GCNs on binary node classification tasks. Later, Maskey et al. (2022) assume yet another class of graph generative models, i.e., graphons, where the kernel is shared across training and testing but the number of nodes and edges could vary. They obtain generalization bounds of message passing GNNs on graph classification and regression that depend on the Minkowski dimension of the node feature space. Relying on a connection of over-parameterized networks and neural tangent kernel, Xu et al. (2020b) find that taskspecific architecture/feature designs help GNNs extrapolate to OOD algorithmic tasks. Wu et al. (2022a) propose explore-to-extrapolate risk minimization framework, for which the solution is proven to provide an optimal OOD model under the invariance and heterogeneity assumptions. Yang et al. (2022) propose a two-stage model that both infers the latent environment and makes predictions to generalize to OOD data. Empirical studies suggest it works well on real-world molecule datasets. Wu et al. (2022b) study a new objective that can learn invariant and causal graph features that generalize well to OOD data empirically. All above works follow the spirit of invariant risk minimization (Arjovsky et al., 2019) and focus on designing new learning objectives. Instead, we provide generalization bound analysis from the traditional statistical learning theory perspective. Our Contributions. In this paper, we study both in-distribution and out-of-distribution generalization for GNNs. For in-distribution graph classification tasks, we significantly improve the previous state-of-the-art PAC-Bayes results in (Liao et al., 2020) by decreasing an exponential dependency on the maximum node degree to a linear dependency. For OOD node classification tasks, we do not assume any known graph generative models which is in sharp contrast to the existing work. We instead assume GNNs are trained and tested on subgraphs that are sampled via random walks from a single large underlying graph, as an efficient means to generate a connected subgraph. We identify interesting cases where a graph classification task is theoretically guaranteed to perform well at size generalization, and derive generalization bounds. We validate our theoretical results by conducting experiments on synthetic graphs, and also explore size generalization on a collection of real-world social network datasets. In the in-distribution case, we observe an improvement of several orders of magnitude in numerical calculations of the generalization bound. In the out-of-distribution case, we validate that, in cases where the theory guarantees that size generalization works well, the prediction accuracy on large subgraphs is always comparable to the accuracy on small subgraphs, and in many cases is actually better. 

2. BACKGROUND INFORMATION

A graph G is an abstract mathematical model for pairwise relationships, with a set of vertices V and a set of edges E ⊆ V × V . Two vertices v 1 , v 2 are said to be connected if (v 1 , v 2 ) ∈ E. For a given graph G ∈ G we can also denote its vertices by V (G) and edges E(G). Unless otherwise specified, we assume graphs are undirected and without multi-edges. In machine learning, a graph (or graph-structured data) typically come with a set of node features. Common graph based machine learning tasks include node classification (or regression) and graph classification (or regression). We use the following notation. • Graph data {G i = (V i , E i )} m i=1 ∈ G, where G is the set of all graphs. The neighborhood of a vertex v is denoted N (v) = {u ∈ V (G i ) : (v, u) ∈ E(G i )}. • Node feature x v : V → X , with X being the feature space, e.g., X = R dv . • Node labels y : V → Y, with Y being the set of labels, e.g., Y = [n]. Graph neural networks (GNNs). GNNs generalize regular neural networks to process data with varying structures and dependencies. GNNs achieve this flexibility via a message passing computational process. In particular, at the k-th step (or layer) of message passing, we update the representation h (k+1) u of node u as follows, h (k+1) u = UPDATE(h (k) u , AGGREGATE({h (k) v |v ∈ N (u)})). (1) This update happens for all nodes in parallel within each message passing step. Moreover, the UPDATE and AGGREGATE operators are shared by all nodes, which enables the same GNN to process varyingsized graphs. Once we have finished the finite-step message passing process, we can use the output node representations to make predictions on nodes, edges, and the graph via additionally parameterized readout functions. This message passing framework is quite general since one can instantiate the UPDATE and AGGREGATE operators by different neural networks. For example, the widely used Graph Convolutional Networks (GCNs) (Kipf & Welling, 2016) , which are the main interest of our work, have the form h (k+1) u = σ   W k v∈N (u)∪{u} h (k) v |N (u)| |N (v)|   (2) where one applies a linear transformation (W k ) to all node representations, a weighted-sum over the neighborhood, and an element-wise nonlinearity (e.g., ReLU activation). Note that the learnable weights W k are different from layer to layer. Homophily. A concept studied in network science, homophily (McPherson et al., 2001) is the property that similar nodes group together. For node classification (or node labelling), this means that neighbouring nodes tend to have the same label. Size generalization is plausible when the labelling of the nodes exhibits homophily. The presence of a homophilic graph labelling implies that the labels of the nodes are unlikely to change during the course of a long random walk on the graph. It is important to note that homophily is also a concept that relates to the graph topology, as not every possible graph structure can be given a labelling that exhibits homophilic properties. An example of one such topology where homophily is impossible is an expander graph (Hoory et al., 2006) , as shown in Figure 1a , where nodes have either random or random-like edges connected to a constant number of other nodes in the entire graph. In this case, any labelling of the nodes is far from homophilic, as can be shown using the expansion property. A setting with more homophily is akin to a barbell graph, as shown in Figure 1b , where there are two densely connected components, and comparatively few edges connecting the two dense regions. If the graph labelling of interest lines up with these divisions inherent in the topology, then it is natural to say that it exhibits a homophilic property. Cheeger's Inequality. A mathematical description of homophily can be given using concepts from spectral graph theory. Cheeger's inequality (Hoory et al., 2006) is a theorem that pertains to partitions of graphs, or equivalently binary-valued labellings on graphs (one side of the partition is labelled 0, the other 1). A crucial definition is the conductance, defined by ϕ(S) = |E(S, S)| |S| ∀S ⊆ V and ϕ(G) = min |S|≤ |V | 2 ϕ(S) . Here E(S, S) is the set of edges connecting a node in S to a node outside of S. Cheeger's inequality states λ 2 /2 ≤ ϕ(G) ≤ 2λ 2 , where λ 2 is the second-smallest eigenvalue of the normalized Laplacianfoot_0 L. This inequality links the realvalued quantity λ 2 to the concept of homophily. If λ 2 is small then the conductance of G must also be low, by Cheeger's inequality. If a labelling on graph nodes f : V (G) → {0, 1} roughly agrees with a low-conductance partition (i.e., one side of the partition S is generally labelled 0 and the complement S is generally labelled 1) then the labelling f exhibits homophily.

3. IMPROVEMENT OF IN-DISTRIBUTION PAC-BAYES BOUND

The state-of-the-art generalization bounds for GNNs in the in-distribution case were formulated by Liao et al. (2020) using the PAC-Bayes theory. Specifically, they build upon the PAC-Bayes theorem in (Neyshabur et al., 2018) that pertains to homogeneous feedforward neural networks. We denote one sample as z = (X, A, y) where X ∈ X , A ∈ G, and y ∈ Y are the node features, the adjacency matrix, and the graph label respectively. Each sample is drawn from some unknown data distribution D (with support X × G × Y) in an i.i.d. fashion. Since both training and testing samples are drawn from the same distribution, this is the in-distribution setup. Following (Liao et al., 2020) , we consider a margin loss for multi-class graph classifications as below, L D,γ = L D,γ (f w ) = P z∼D f w (X, A)[y] ≤ γ + max j̸ =y f w (X, A)[j] where γ > 0 is the margin parameter and f w is the model (hypothesis) parameterized by weights w. Since D is unknown, we can not compute this true loss (risk). We instead minimize the empirical loss (risk) that is defined on the sampled training set S as below, L S,γ = L S,γ (f w ) = 1 m z∈S 1 f w (X i , A i )[y] ≤ γ + max j̸ =y f w (X i , A i )[j] , where m is the number of training samples. For simplicity, we abbreviate L D,γ (f w ) and L S,γ (f w ) as L D,γ and L S,γ respectively from now on. Our main in-distribution result bounds the gap between true and empirical risks for GCNs, shown in the following theorem. The proof is in Appendix A.1. Theorem 3.1. For any B > 0, l > 1, let f w ∈ H : X × G → R k be an l-layer GCN. Then with probability ≥ 1 -δ over the choice of an iid size-m training set S from the data distribution D, we have for any w: L D,0 ≤ L S,γ + O     B 2 d l 2 (h + ln l) l i=1 ∥W i ∥ 2 2 l i=1 (∥W i ∥ 2 F /∥W i ∥ 2 2 ) + ln m δ γ 2 m     Here d equals to one plus the maximum node degree that can be achieved by the data distribution. l is the depth, i.e., the number of layers, of GCNs. W i is the weight matrix of GCNs in the i-th layer. B is the radius of the minimal ℓ 2 ball that contains all node features, i.e., ∀v, ∥x v ∥ 2 ≤ B. This improves the bound in (Liao et al., 2020) , which is provided below for a better comparison, L D,0 ≤ L S,γ + O     B 2 d l-1 l 2 h log(lh) l i=1 ∥W i ∥ 2 2 l i=1 (∥W i ∥ 2 F /∥W i ∥ 2 2 ) + log ml δ γ 2 m     . The proof of the theorem from (Liao et al., 2020) is an induction over the l layers, in which the spectral norm of the weights and a maximum degree term is multiplied at each step. We observe that it is possible to avoid passing the maximum degree term via a refined argument. This leads to a tightening of one of the main inequalities used in the induction proof, thus in turn resulting in substantial improvements to the overall bound. As can be seen above, we reduce the exponential term d l-1 to a linear term d, which is a significant improvement for graphs even with small node degrees.

4. TOWARDS DEVELOPING A THEORY FOR SIZE GENERALIZATON

In this section, we develop an out-of-distribution (OOD) generalization theory for GNNs. Since we adopt a statistical learning viewpoint, there must necessarily be some assumptions relating the training and testing graphs (otherwise the No-Free Lunch theorem applies). There is a tradeoff between assumptions that are practically relevant, and those for which rigorous guarantees are provable. We have chosen assumptions that we believe strike a balance between those objectives, at least for applications like social networks. Size Generalization Assumptions. We consider the following setup. First, we assume that there exists an extremely large graph G like the user network in Twitter so that one needs to sample subgraphs (e.g., via random walks) for training and testing machine learning models. This is akin to the practical setups of (Grover & Leskovec, 2016; Hamilton et al., 2017) . To generate training and testing subgraphs, we run random walks of length N and M respectively on this single large graph, where M ≫ N , and collect the subgraphs induced by these walks. GNNs are then trained on the subgraphs induced by the shorter (length-N ) walks. In testing, we assume a procedure where a length-M random walk induced subgraph is sampled from the large subgraph. Random walks are initiated by choosing an initial node uniformly at random from all the nodes in the graph, and at each step there is an equal probability of selecting any of the current node's neighbors. This is an interesting OOD problem where training and testing graphs come from different distributions determined by the underlying large graph and the random walk sampling with specific length. We consider the graph classification problem and assume that the graph label is determined by the majority of node labels within the graph, which is reasonable for many applications that involve homophilic graphs. For the node labeling, we assume it is binary but have no assumptions on how labels are generated. Crucially, we assume nothing about the underlying large graph. Therefore, our setup has advantages over some OOD setups in the literature where a generative model of graphs and labels is explicitly assumed. Relation with In-Distribution Result. We know the relationship between true error defined on the unknown data distribution D and empirical error defined on the size-m training set S. Specifically, for any GCN f , with probability at least 1 -δ, we have a general bound as follows, L D,0 ≤ L S,γ + A(f, δ, m), where we abbreviate the bound as A(f, δ, m) and omit specific parameters like maximum node degree d. In the size generalization problem, we use random walks with lengths N and M for collecting training and testing subgraphs (data) respectively. We are interested in proving a statement of the following form: for any GCN f , we have with probability at least 1 -δ, L D M ,0 ≤ L S N ,γ + B(f, δ, m, M, N ). ( ) The key detail is that D M is the distribution of subgraphs induced by random walks with length M and S N is the training set of subgraphs induced by random walks with length N . Comparing these two losses is the essence of our OOD result. The final term B(f, δ, m, M, N ) is a general bound involving these parameters. Based on the in-distribution result like in Theorem 3.1, we can similarly obtain, L D N ,0 ≤ L S N ,γ + A N (f, δ, m), where D N is the distribution of subgraphs induced by random walks with length N and A N is the general bound. The key question boils down to: what is the relationship between L D N ,0 to L D M ,0 ? This question will be answered in the following sections.

4.1. A PROBABILITY BOUND FOR PARTITION CROSSES

The above size generalization problem involves the distributions of random-walk-induced subgraphs from a large graph G with two lengths: N for training and M for testing. Also, M is much larger than N . Before we state our results, we would like to explain the simple intuition that motivates our theory: If the random walk always stays within the same partition, then the graph label of the random-walk-induced subgraph can be well predicted, no matter how long the random walk is. Here a partition means the subset of nodes with the same node label. The goal of this section is to find bounds on M for which we can provide OOD guarantees. We begin by considering a special labelling. Special Node Labeling: Sparsest Cut. A set S that minimizes ϕ(S) (and has |S| ≤ |V |/2) is called a sparsest cut. For simplicity assume that S is unique. Using Cheeger's inequality, we first prove the following probability bounds related to this sampling procedure, thereby identifying the length M for which a random walk is likely to stay within the sparsest cut for d-regular graphs. The theorems are as follows. Theorem 4.1. Let U M = [u 1 , u 2 , . . . , u M ] be a length-M random walk over a connected, d-regular graph G, with u 1 chosen from the stationary distribution of the nodes of G. If M ≤ d/(2 5/2 √ λ 2 ), then the probability that U M crosses the sparsest-cut partition at least once is under 1/2. Here crossing the sparsest-cut partition S means that there exists an edge (u, v) of the random walk satisfies u ∈ S and v ∈ S. λ 2 is the second-smallest eigenvalue of the normalized Laplacian. We can easily generalize the previous theorem to an arbitrary probability δ > 0 as below. Corollary 4.1.1. If M ≤ (δd)/2 3/2 √ λ 2 , the probability of the above random walk U M crossing over the sparsest-cut partition at least once is at most δ. General Node Labeling. Theorem 4.1 is restrictive in that it requires the partition S to be the sparsest cut. We now modify the proof to yield a quantity that can work for any node labelling f . Specifically, let φ be any boolean (i.e., {0, 1}-valued) labelling on the vertices of the graph. Let the positive node labelling of φ be S = {v ∈ V (G) : φ(v) = 1}. We are interested in bounding the probability that a random walk of length M includes an edge that crosses the positive node labelling S, i.e., an edge (u, v) satisfies u ∈ S and v ∈ S. Theorem 4.2. Let φ be a boolean labelling on the nodes of a connected, d-regular graph G with positive node labelling S (0-1 valued vector with φ [i] = 1 if v i ∈ S). Let U M = [u 1 , u 2 , . . . , u M ] be a length-M random walk over G, with u 1 chosen from the stationary distribution of the nodes of G. Let X i be the indicator variable of the event that the i-th edge of U M crosses S, i.e., X i = 1 u i ∈ S, u i+1 ∈ S and Y k = k i=1 X i is the number of times that U M crosses S in the first k steps. Let φ ′ = φ -1(|S|/|V |) and α = φ ′⊤ Lφ ′ /∥φ ′ ∥ 2 2 . The conclusion is that: if M ≤ d 2 5/2 √ α then Pr [Y M ≥ 1] ≤ 1 2 . Corollary 4.2.1. If M ≤ (δd)/2 3/2 √ α, the probability of the above random walk U M at least crosses over the positive node labelling of f once is at most δ, i.e., Pr [Y M ≥ 1] ≤ δ. The formula for α arises from an alternative formulation of Cheeger's inequality which expresses λ 2 using a Rayleigh quotient (Spielman, 2015) , in which y may be viewed as a real-valued labelling on the vertices. λ 2 = min y⊥d (y ⊤ Ly)/(y ⊤ Dy)

4.2. SIZE GENERALIZATION ERROR

Recall that, in the size generalization setup, we first train a GNN model f on subgraphs induced by many length-N random walks on G. Then during testing, given a large testing subgraph G M induced by a length-M random walk on G, we sample a subgraph G N via a length-N random walk on G M and feed it to f to compute the empirical (classification) error for G M . If all nodes of G M are within a single positive node labelling, then all of their labels are the same. Therefore, no matter which subgraph G N is sampled, the generalization error (i.e., the probability of making a wrong prediction) for G M should be the same as the one for G N . Based on this reasoning, we have the following result. Theorem 4.3 (Size Generalization Error). For any δ ∈ [0, 1), if we restrict M , the size of the large random walk-induced subgraph, such that M ≤ (δd)/2 3/2 √ α, then the in-distribution generalization error L D M ,0 , i.e., the probability of a wrong prediction on length-M -random-walk induced subgraphs, satisfies L D M ,0 ≤ δ + L D N ,0 . ) where L D N ,0 is the in-distribution generalization error of f on length-N random-walk-induced subgraphs. Note that this theorem explicitly constrains M , whereas the only condition on N is that L D N ,0 is small. Proof. Observe that, for any events F and E, we have Pr [F ] ≤ Pr [E] + Pr F | Ē . Let E be the event that a length-M random walk crosses the positive node labelling of the ground truth labels, and let F be the event that we make a wrong prediction on the induced subgraph G M . Theorem 3.1 bounds the second term, Pr F | Ē , because the generalization error on G M is the same as the one on G N (subgraphs induced by length-N random walks) when G M does not cross the positive node labelling. Corollary 4.2.1 bounds the first term. Substituting the values from the previous two theorems yields the claimed inequality. We already know the bound of the in-distribution generalization error L D N ,0 due to Theorem 3.1 -let us call this quantity δ. Using this we can obtain the final result for GCNs under our OOD setup. Theorem 4.3 simply states that, if the length M ≤ (δd)/2 3/2 √ α, with probability at least 1 -δ, the OOD generalization error on large subgraphs (induced by length-M random walks) is the sum of error δ and the in-distribution generalization bound on small subgraphs (induced by length-N random walks). 

5.1. IN-DISTRIBUTION: NUMERICAL PAC-BAYES BOUND COMPUTATION

We conduct multi-class graph classification experiments to compare our improved bound to the original PAC-Bayes bound in (Liao et al., 2020) . We use the same GCN model, adopt the same datasets, i.e., 6 synthetic datasets obtained from random graph models and 3 real world graph datasets used in (Yanardag & Vishwanathan, 2015) , and follow the same experimental protocol. After training a GCN on each dataset, we compute the theoretical bounds using final model. The numerical comparisons of log bound values are shown in Figure 2 . It is clear that our new bounds are significantly tighter and reduce the bound values by several orders of magnitude. The gap is further increased as the depth increases. The tables of bound values and the specific equations to compute them are provided in Appendix B.1.

5.2. OUT-OF-DISTRIBUTION: EFFICACY OF SIZE GENERALIZATION

We performed OOD experiments to validate the values of the upper bound on the size of large subgraphs M that was set in Theorem 4.1 and its related theorems, for synthetic graphs. We also performed experiments on synthetic graphs that were non-homophilic with the same values of M and N , to examine size generalization in this case. We also examined the general feasibility of size generalization in real-world social network data. For synthetic graphs, we calculated this theoretical value for the upper bound, and selected large subgraph size M and small subgraph size N ≪ M accordingly. For the real-world case, we chose constant values of N = 10 and M = 50. For each subgraph, we assign as its graph label the label observed most often among its nodes. After sampling datasets of subgraphs of sizes M and N , we train GCN models on the dataset with N -length random walks and measure their performance on the training set, the validation set (a smaller data set generated the same way as the train set), and the testing set (a set of subgraphs inuced by length-M random walks). On the test set we record both the performance when inputting the whole large subgraph (Test error), as well as when performing the sampling procedure used for Theorem 4.3, in which we sample an induced subgraph from an N -length random walk for each data item (Sampling-test error). Synthetic Graphs. We adopt the CSBMs (Deshpande et al., 2018) to generate graphs that exhibit the homophily property. We use two blocks with much higher probability of connections inside the same block than between blocks, which leads to barbell-like graphs. In the non-homophilic case, we set these probabilities to be equal. We generate binary node labellings via the sparsest cut. CSBMs generate node features via a Gaussian mixture where individual choices of the component are determined by the node label. Real-world Graphs. We used social network data for Twitch streamers from (Rozemberczki et al., 2019) . Each node is a streamer (Twitch user), and nodes are connected to mutual friendships. Node features are 3,169 different binary indicators of a wide array of attributes, including games liked, location, etc. Each node is labelled with a boolean value of whether the livestreamer has indicated that they use explicit language. In all cases, the GCN model achieves OOD test accuracy on large-subgraph that was comparable to ID accuracy on small-subgraph if not outright better. This is even the case when some of the constraints are violated: no d-regularity constraint was imposed for any of the datasets, and performance was still good for the test error which did not involve further subgraph sampling. This indicates that the theory is promising in practice for more general forms of size generalization. The accuracy on the train set, test set with subgraph sampling, and unaltered test set are shown in Figure 2 , and the numerical values are in Appendix B.2. For many cases including all real-world cases, the test accuracy was actually higher than the training accuracy. This could potentially indicate that in the cases where size generalization can be guaranteed to work well, the GCN model benefits significantly from extra node information. It is also possible that because of the sampling procedure, there is overlap in nodes between the training and test sets, since they come from random-walk sampling procedures that naively select a uniformly random node as the initial node.

6. DISCUSSION

In this work we have expanded the theoretical understanding of the generalizations of GNNs in both indistribution and out-of-distribution settings, deriving new theoretical guarantees in each setting. The results for in-distribution learning improve upon the state-of-the art PAC-Bayes bounds in (Liao et al., 2020) , and the results for out-of-distribution learning provide insight into a practical learning setting under which GNNs are guaranteed to perform effective size generalization. Future directions for the in-distribution understanding would involve lowering the dependencies of other variables like the spectral norm of weights. Generalizing the results to other problems like node classification would also be interesting. In the out-of-distribution case, a number of different observations in experimentation indicate that the theory can still be very much expanded. We have identified cases in real-world datasets where well beyond the bounds on size set forth in the theory, and in all experiments the d-regularity assumption is violated, yet GCN size generalization is still effective in these cases. Expansions to the theory, including generalizing to non-d-regular graphs, can be explored to explain cases like these.

A MATHEMATICAL PROOFS

A.1 PROOF OF THEOREM 3.1 The proof is as follows, and makes up the remainder of the chapter. A.1.1 IMPROVEMENT ON DEGREE DEPENDENCY In (Liao et al., 2020) , a generalization bound is attained on graph convolutional networks; this bound is dependent on a bound on the maximum perturbation of the function value when a perturbation U is applied to the weights W , presented in that paper's Lemma 3.1. The bound is as follows |f w+u (X, A) -f w (X, A)| 2 ≤ eBd l-1 2 l i=1 ∥W i ∥ 2 l k=1 ∥U k ∥ 2 ∥W k ∥ 2 The primary goal of this set of improvements is to reduce the factor of d l-1 2 . For each layer, let H i ∈ R |V |×h be the matrix containing the hidden embeddings of all of the nodes in its rows, with h being the hidden dimension. In the process of the proof of Theorem 3.1, we are able to show the following: Φ j = max i |H j [i, :]| 2 ≤ d j 2 B j i=1 ∥W i ∥ 2 Ψ j = max i |H ′ j [i, :] -H j [i, :]| 2 ≤ Bd j 2 j i=1 ∥W i ∥ 2 j k=1 ∥U k ∥ 2 ∥W k ∥ 2 1 + 1 l j-k |∆ l | 2 = 1 n 1 n H ′ l-1 (W l + U l ) - 1 n 1 n H l-1 W l 2 ≤ eBd l-1 2 l i=1 ∥W i ∥ 2 l k=1 ∥U k ∥ 2 ∥W k ∥ 2 We begin to simplify these bounds by removing the dependency on d j 2 , replacing it instead with a fixed power of d 1/2 that remains constant for every layer, and thus in the final result of Equation 11 as well. Theorem A.1. For all 1 ≤ j ≤ l -1, we have: Φ j ≤ √ d B k i=1 ∥W i ∥ 2 Ψ j ≤ 1 + 1 + 1 l j B √ d j i=1 ∥W i ∥ 2 (15) Finally, |f w+u (X, A) -f w (X, A)| 2 = |∆ l | 2 ≤ e + 1 + 2 l B √ d l i=1 ∥W i ∥ 2 The proof follows from a lemma about the 2-norm of any node representation at any layer: Lemma A.1.1. We have, for all k ∈ [n] and for j ∈ [l]: |H j [u, :]| 2 ≤ B deg(u) j i=1 ∥W i ∥ 2 Proof. We prove this by induction. By definition |H 0 [u, :]| 2 ≤ B and thus |H 0 [u]| ≤ deg(u)B 0 k=1 ∥W k ∥ 2 . We assume that for all u, we have H j-1 [u, :] ≤ deg(u)B j-1 k=1 ∥W i ∥ 2 . From these statements we are able to deduce |H j [u, :]| ≤ v∈Nu L[u, v]|H j-1 [v, :]| 2 ∥W j ∥ 2 ≤ v∈Nu 1 deg(u)deg(v) deg(v)B j-1 k=1 ∥W k ∥ 2 ∥W j ∥ 2 = v∈Nu 1 deg(u) B j-1 k=1 ∥W k ∥ 2 ∥W j ∥ 2 = deg(u) deg(u) B j k=1 ∥W k ∥ 2 = deg(u)B j k=1 ∥W k ∥ 2 In these inequalities we use the fact that L[i, j] = (A + I) ij / deg(i)deg(j), and we assume the simple case where there are unweighted edges so that (A + I) ij is 1 if and only if nodes i and j are connected and 0 otherwise. By Lemma A.1.1, we have that Φ j = max i |H j [i, :]| 2 ≤ √ dB j i=1 ∥W i ∥ 2 , which is exactly the result of equation (14). Claim A.1. For all v ∈ [n], |∆ j [v, :]| 2 ≤ B deg(v) 1 + 1 l j j i=1 ∥W i ∥ j i=1 ∥Ui∥ ∥Wi∥ Proof. Proof: We use induction assuming this is true for ∆ j-1 . We then have |∆ j [v, :]| 2 ≤ u∈N (v) L[v, u]|H ′ j-1 [u, :] -H j-1 [u, :]| 2 ∥W j + U j ∥ 2 + u∈N (v) L[v, u]|H j-1 [u, :]| 2 ∥U j ∥ 2 ≤ B 1 + 1 l j-1 j-1 i=1 ∥W i ∥ j-1 i=1 ∥U i ∥ 2 ∥W i ∥ 2 ∥W j + U j ∥ + B∥U j ∥ j-1 i=1 ∥W i ∥ (19)   u∈N (v) L[v, u] deg(u)   = B deg(v) j-1 i=1 ∥W i ∥ ∥W j + U j ∥ 1 + 1 l j-1 j-1 i=1 ∥U i ∥ 2 ∥W i ∥ 2 + ∥U j ∥ = B deg(v) j i=1 ∥W i ∥ ∥W j + U j ∥ 2 ∥W j ∥ 2 1 + 1 l j-1 j-1 i=1 ∥U i ∥ 2 ∥W i ∥ 2 + ∥U j ∥ 2 ∥W j ∥ 2 ≤ B deg(v) j i=1 ∥W i ∥ 1 + 1 l j j-1 i=1 ∥U i ∥ 2 ∥W i ∥ 2 + ∥U j ∥ 2 ∥W j ∥ 2 ≤ B deg(v) j i=1 ∥W i ∥ 1 + 1 l j j i=1 ∥U i ∥ 2 ∥W i ∥ 2 ∆ l has a slightly different formulation but it has a very similar bound: |∆ l | 2 = 1 n 1 n LH ′ l-1 (W l + U l ) - 1 n 1 n LH l-1 (W l ) 2 = 1 n 1 n L(H ′ l-1 -H l-1 )(W l + U l ) + 1 n LH l-1 (U l ) 2 ≤ 1 n n i=1 |∆ l-1 [i, :]| 2 ∥W l + U l ∥ 2 + 1 n n i=1 |H l-1 [i, :]| 2 ∥U l ∥ 2 ≤ B √ d l-1 i=1 ∥W i ∥ 1 + 1 l l-1 l-1 i=1 ∥U i ∥ 2 ∥W i ∥ 2 ∥W l + U l ∥ + B √ d∥U l ∥ 2 l-1 i=1 ∥W i ∥ 2 ≤ B √ d l i=1 ∥W i ∥ 1 + 1 l l l-1 i=1 ∥U i ∥ ∥W i ∥ + ∥U l ∥ ∥W l ∥ ≤ B √ d l i=1 ∥W i ∥ 1 + 1 l l l i=1 ≤ eB √ d l i=1 ∥W i ∥ l i=1 ∥U i ∥ ∥W i ∥ From this we have proven a tighter bound on the final output of the GNN under perturbation, which we will use to calculate probabilistic and generalization bounds. A.1.2 IMPROVEMENT ON PROBABILISTIC BOUNDS USING RANDOM MATRIX THEORY In (Liao et al., 2020) , for all i ∈ [l], with l being the number of layers, the prior and the distribution of the perturbations U i ∈ R di+1×di ,, where all hidden dimensions d i are upper-bounded by a value h, were generated by a normal distribution N (0, σ 2 I), and give probabilistic bounds on the operator norms ∥U i ∥ as P (∀i, ∥U i ∥ ≤ t) with probability greater than 1 -2lh exp -t 2 /2hσ 2 . We improve these bounds using theorems on random matrices from work on high-dimensional probability, namely (Vershynin, 2018) . Theorem A.2 (Theorem 4.4.5 in (Vershynin, 2018) ). Let A be a matrix in R m×n , where the entries A ij are independent, mean-zero, sub-Gaussian random variables. Then, for all t > 0 we have ∥A∥ ≤ CK( √ m + √ n + t) with probability ≥ 1 -exp(-t 2 ), where K = max i,j ∥A ij ∥ ψ2 and C is some constant. In the above theorem the norm ∥X∥ ψ2 is defined as inf{t : E[exp(X 2 /t 2 )] ≤ 2}. In Example 2.5.8 in (V ershynin, 2018), it is shown that if X ∼ N (0, σ 2 ) then it has ∥X∥ ψ2 ≤ Cσ. Corollary A.2.1. If U ∈ R m×n is a random matrix generated with the distribution N (0, σ 2 I) (i.e. all entries are independent and identically distributed Gaussian random variables), then we have ∥U ∥ ≤ σ( √ m + √ n + t) with probability at least 1 -2 exp(-t 2 ). With a change of variable, we are able to calculate the following: P (∀i.∥U i ∥ 2 ≤ t) ≥ 1 -P (∃i, ∥U i ∥ > t) ≥ 1 - l i=1 P (∥U i ∥ > t) ≥ 1 -2l exp t Cσ -2 √ h 2 And by setting the right-hand side to 1/2, we obtain: t = Cσ(2 √ h + ln(4l)) Using the above equation combined with our bound we are able to get |f w+u (X, A) -f w (X, A)| 2 ≤ eB √ dl l i=1 ∥W i ∥ 2 l k=1 ∥U k ∥ 2 ∥W k ∥ 2 = eB √ dβ l l l k=1 ∥U k ∥ 2 β ≤ eB √ dβ l-1 l(σ(2 √ h + ln(4l))) ≤ e 2 B √ d βl-1 (σ(2 √ h + ln(4l))) ≤ γ 4 ( ) Here β is an estimated of β such that |β -β| ≤ β/l that can be generated a priori; we discuss this in a later subsection. We can set σ = γ 4e 2 B β√ dC 2 √ h+ √ ln(4l) to satisfy the final inequality. From this we can calculate the KL-divergence between the posterior and the prior: KL(Q∥P ) = |w| 2 2 2σ 2 = 16e 4 B 2 dl 2 β 2(l-1) 2 √ h + ln(4l) 2 2γ 2 l i=1 ∥W i ∥ F ≤ O B 2 dβ 2l l 2 (h + ln(l)) γ 2 l i=1 ∥W i ∥ 2 F β 2 ≤ O B 2 dl 2 (h + ln(l)) l i=1 ∥W i ∥ 2 γ 2 l i=1 ∥W i ∥ 2 F ∥W i ∥ 2 From this we are able to calculate the generalization bound and thus prove the theorem. L D,0 ≤ L S,γ + O    B 2 dl 2 (h + ln(l)) l i=1 ∥W i ∥ 2 2 l i=1 ∥Wi∥ 2 F ∥Wi∥ 2 2 + ln m δ γ 2 m    A.1.3 SELECTING PARAMETER β The prior normal distribution's variance parameter σ 2 is dependent on β, but β cannot be used in its calculation because that information is only known after model training. Instead, we can select a parameter β such that |β -β| ≤ 1 l β and thus 1 e β l-1 ≤ βl-1 ≤ eβ l-1 (as per equation 33 in (Liao et al., 2020) ). As in (Liao et al., 2020) we only have to consider values of β in the range γ 2B √ d 1/l ≤ β ≤ γ √ m 2B √ d 1/l as otherwise the generalization bound holds trivially because L D,0 ≤ 1 by definition. If we consider values of β that cover this interval then by union bound we are still able to get a high probability; the covering C needs to have |C| = l 2 (m 1 2l -1). A.2 PROOFS OF OUT-OF-DISTRIBUTION PROBABILITY BOUNDS A.2.1 PROOF OF THEOREM 4.1 Proof. Because u 1 is chosen from the stationary distribution (uniform over vertices, because G is connected and d-regular), then for all i ≥ 1 the distribution for u i , u i+1 follows the distribution Unif [E] , where E is the edge set of the graph. Let S be the sparsest-cut partition of G. Let X i be the indicator of the event that the vertex pair is in the set of edges crossing the partition, namely 1{(u i , u i+1 ) ∈ E(S, S)}. By linearity of expectation, this means that E [X i ] = |E(S, S)|/|E|. Furthermore, let Y k be the cumulative number of edges crossing the partition along the first k steps of the random walk. This is expressed nicely as just by setting k = M and t = 2. We then use the following basic fact: if we have an inequality of the form Pr This completes the proof. Y k = k i=1 X i . Thus E[Y k ] = k |E(S, [Z ≥ z] ≤ 1 2 , then Pr[Z ≥ z ′ ] ≤ 1 2 for any z ′ ≥ z. Let E(S) A.2.2 PROOF OF THEOREM 4.2 Proof. The quantity φ ′ is a transformation of φ that retains all the information contained in φ while still being orthogonal to the all-ones vector 1, so that we can apply Cheeger's inequality. This orthogonalization is rather standard and can be found in (Spielman, 2015) . Let s = |S|/|V (G)|. Note that s ∈ [0, 1], and without loss of generality we can assume that s ≤ 1/2. We observe that the vth coordinate of the vector φ ′ corresponds to the mapping φ ′ (v) = 1 -s v ∈ S -s v / ∈ S This ensures that φ ′ is orthogonal to 1, as φ ′⊤ 1 = n i=1 φ ′ (v i ) = |S| 1 - |S| |V | + (|V | -|S|) - |S| |V | = |S| -|V | |S| |V | = 0. We then note that ∥φ 

B.1 IN-DISTRIBUTION EXPERIMENTS

The datasets used are a combination of synthetic (Erdos-Renyi and Stochastic Block Model) and real-world graphs (IMDBBINARY and IMDBMULTI of data from the Internet Movie Database, and COLLAB, a dataset of academic collaborations), and a bioinformatics dataset, PROTEINS, from (Yanardag & Vishwanathan, 2015) . Two different GCN network depths of of l = 4 and l = 6 were used. We use the following formulae for the generalization bound from (Liao et al., 2020) and our new bound, using an explicit constant factor of 42 from (Liao et al., 2020) . We remove an additive O(log m) term in the numerator within the square root after validating that it was numerically negligible. Tables below are for calculated bounds in the case of 4 layers (Table 1 ) and 6 layers ( Experiments were performed to measure the effectiveness size generalization of GCN models when applied to the size generalization learning case described in Section 4, where the learning task is classifying the most common node label in sub-communities of a large underlying network. For each of the synthetic graphs, we calculate an upper bound for M set in the out-of-distribution inequalities we have derived. Since the graphs examined are all not d-regular, we calculate a value of α as φ ⊤ Lφ φ ⊤ Dφ , where L is the graph Laplacian matrix and D is the diagonal degree matrix, to apply to the formula set in Theorem 4.2. Furthermore, we use a more permissive value of δ = 0.75.



Here L = D -1/2 (D -A)D -1/2 , where D is the diagonal matrix of vertex degrees and A is the adjacency matrix. It is important to note that this specific dependency of |E(S)| on d requires G to be a d-regular graph. If the theorem is to be expanded to more general cases, one may use the simple inequality |E(S)| ≥ |S|.



(a) An example of a small expander graph. Any labelling of its nodes cannot exhibit homophily. (b) Example of a small barbell graph. If a labelling is exactly differentiated between the two groups, then it exhibits homophily.

Figure 2: Log-generalization gap values for in-distribution experiments: (a) l = 4, Synthetic (b) l = 6, Synthetic (d) l = 4, Real-world (e) l = 6, Real-world. Accuracies for OOD experiments: (c) Synthetic (including both homophilic (Homo) and non-homophilic (NonHomo) graphs) (f) Real-world (Twitch data).

denote the set of edges connected to any vertex in S. Because |E(S)| ≤ |E|, then we have |E(S, S)|/|E| ≤ |E(S, S)|/|E(S)|. Furthermore, since we assume a connected graph, |E(S)| ≥ (d/2)|S|, and thus |E(S, S)|/|E(S)| ≤ |E(S, S)|/[(d/2)|S|]. 2 Thus using the fact above we can deduce Pr Y M ≥ 2M |E(S, Note that |E(S, S)|/|S| is the conductance of the graph ϕ(G), because S was defined to be the sparsest-cut partition of G. Thus we can apply the fact again with Cheeger's inequality to get Pr Y M ≥ 2M (2/d) 2λ 2 ≤ 1 And since we are interested in Pr[Y M ≥ 1], we can thus set 2M √ 2λ 2 ≤ 1 to get a necessary condition for M , from which we achieve

v) 2 is equal to s(1 -s)|V |, and we can infer |S|/2 ≤ ∥φ ′ ∥ 2 2 ≤ |S|; the first inequality holds since s ≤ 1/2.The number of edges |E(S, S)| crossing the labelling-partition is equal to φ ′⊤ Lφ ′ , asφ ′⊤ Lφ ′ = (u,v)∈E ((φ(u) -s) -(φ(v) -s)) 2 = |E(S, S)|where L is the Laplacian matrix of G.Thus the quantity 2M|E(S, S)| |E(S)| ≤ 2M φ ′⊤ Lφ ′ |E(S)| ≤ 2M φ ′⊤ Lφ ′ (d/2)|S| .We are able to get the second inequality because we know |E(S)| ≥ (d/2)|S|. Because we know that |S| ≥ ∥φ ′ ∥ 2 , we can then upper bound this further by 2M φ ′T Lφ ′ (d/2)∥φ ′ ∥ 2 2 Substituting this quantity in the proof of Theorem 4.1, we achieve the desired bound for M .

GenGap(B, d, l, {W i } l i=1 ) = 42 • B 2 d l-1 l 2 ln(4lh) formula used for the new PAC-Bayes generalization bound is GenGap(B, d, l, {W i } l i=1 ) = 42 • B 2 dl 2 (h + ln(l))

Suppose we wish to examine under what conditions we can ensure that we do not cross over the partition at all in M steps, i.e. Pr[Y M ≥ 1] ≤ 1/2. From the inequality above, we are able to get that

15.503044 ± 0.066 11.892126 ± 0.066 IMDBBINARY 17.370839 ± 0.079 12.458184 ± 0.079 IMDBMULTI 16.466189 ± 0.038 11.977553 ± 0.038 COLLAB 19.773157 ± 0.009 13.574678 ± 0.009 PROTEINS 14.011104 ± 0.079 10.753008 ± 0.079

Table of generalization bounds, 4 layers (log values)

Table of generalization bounds, 6 layers (log values)

annex

Similar upper bounds for M were computed for the real-world cases, but the values were too small for experimental use. In this case, we just set N = 10 and M = 50 to attempt to gain insight about the size generalization task's general feasibility in real-world cases.All experiments were performed with use of the Adam optimizer (Kingma & Ba, 2015) , with a constant learning rate 0.01. Models were trained for 10 epochs, with a batch size 32 randomly selected.The models used are different parameterizations of the Graph Convolutional Network as implemented by the library pytorch-geometric (Fey & Lenssen, 2019) . For synthetic experiments, which used smaller graphs with generally smaller degree, the parameterization was 3 layers with a hidden dimension of 5, and for the real-world data case, the parameterization was 10 layers with a hidden dimension of 32.For each underlying graph, we generate three train/validation sets (of size N random walks) and test sets (of size M random walks) and we record the loss and accuracy as the average of the three runs.

B.2.2 SYNTHETIC GRAPH EXPERIMENTS

A large underlying synthetic graph was generated using the stochastic block model, with some adjustment to ensure that the randomly-generated graph had a single connected component. By controlling the intra-and inter-block connection probability values, we are able to control the homophily of the generated graph, which we validate by measuring the value of λ 2 , as well as calculating the sparsest cut via "Cheeger rounding" (Spielman, 2015) and subsequently the conductance of the graph with respect to this partition.In the experiments, we generated a graph with approximately 2000 nodes, with in-block connectivity probability set to 8/1000 and inter-block connectivity set to 6/10 5 . Node features are generated from a mixture of multivariate Gaussian distributions with dimension 3, mean (-0.5, -0.5, -0.5) for one block, and mean (0.5, 0.5, 0.5) for the other; the covariance matrix is a diagonal matrix (each coordinate is independent) of variance either 2, 4, or 8.Experiments were also performed on non-homophilic synthetic graphs. Like the homophilic synthetic graphs they are generated with the stochastic block model with about 2000 nodes, about 1000 of each label, and with the same mixture-of-Gaussian node features. However the parameters used for the generation of connection are crucially different. The probabilities of connection between nodes of the same block and nodes of a different block are set to be equal, with both being set to 8/1000. These settings ensure that a node's label is independent from the labels of its neighbors, so the homophily property is not exhibited.Contrasting with the results shown for the homophilic synthetic graphs, the non-homophilic graph results show that the out-of-distribution test accuracy is less than the training accuracy. This further illustrates the association between homophily and size generalization.

B.2.3 REAL-WORLD GRAPH EXPERIMENTS

Since the node features are indicators, we encoded the node feature information by using the positional encoding mechanism introduced in the Transformer model (Vaswani et al., 2017) . For each node, each of their integer indicators was encoded via positional embedding and aggregated via sum. 

