ON SIZE GENERALIZATION IN GRAPH NEURAL NET-WORKS

Abstract

Graph neural networks (GNNs) can process graphs of different sizes but their capacity to generalize across sizes is still not well understood. Size generalization is key to numerous GNN applications, from solving combinatorial optimization problems to learning in molecular biology. In such problems, obtaining labels and training on large graphs can be prohibitively expensive, but training on smaller graphs is possible. This paper puts forward the size-generalization question and characterizes important aspects of that problem theoretically and empirically. We prove that even for very simple tasks, such as counting the number of nodes or edges in a graph, GNNs do not naturally generalize to graphs of larger size. Instead, their generalization performance is closely related to the distribution of local patterns of connectivity and features and how that distribution changes from small to large graphs. Specifically, we prove that for many tasks, there are weight assignments for GNNs that can perfectly solve the task on small graphs but fail on large graphs, if there is a discrepancy between their local patterns. We further demonstrate on several tasks, that training GNNs on small graphs results in solutions which do not generalize to larger graphs. We then formalize size generalization as a domainadaption problem and describe two learning setups where size generalization can be improved. First, as a self-supervised learning problem (SSL) over the target domain of large graphs. Second as a semi-supervised learning problem when few samples are available in the target domain. We demonstrate the efficacy of these solutions on a diverse set of benchmark graph datasets.

1. INTRODUCTION

Graphs are a flexible representation, widely used for representing diverse data and phenomena. Graph neural networks (GNNs) -Deep models that operate over graphs -have emerged as a prominent learning model (Bruna et al., 2013; Kipf and Welling, 2016; Veličković et al., 2017) . They are used in natural sciences (Gilmer et al., 2017) , social network analysis (Fan et al., 2019) , for solving difficult mathematical problems (Luz et al., 2020) and for approximating solutions to combinatorial optimization problems (Li et al., 2018) . In many domains, graphs data vary significantly in size. This is the case for molecular biology, where molecules -represented as graphs over atoms as nodes -span from small compounds to proteins with many thousands of nodes. It is even more severe in social networks, which can reach billions of nodes. The success of GNNs for such data stems from the fact that the same GNN model can process input graphs regardless of their size. Indeed, it has been proposed that GNNs can generalize to graphs whose size is different from what they were trained on , but it is largely unknown in what problems such generalization occurs. Empirically, several papers report good generalization performance on specific tasks (Li et al., 2018; Luz et al., 2020) . Other papers, like Veličković et al. (2019) , show that size generalization can fail on several simple graph algorithms, and can be improved by using task-specific training procedures and specific architectures. Given their flexibility to operate on variable-sized graphs, A fundamental question arises about generalization in GNNs: "When do GNNs trained on small graphs generalize to large graphs?" Aside from being an intriguing theoretical question, this problem has important practical implications. In many domains, it is hard to label large graphs. For instance, in combinatorial optimization problems, labeling a large graph boils down to solving large and hard optimization problems. In other domains, it is often very hard for human raters to correctly label complex networks. One approach to this problem could have been to resize graphs into a homogeneous size. This is the strategy taken in computer vision, where it is well understood how to resize an image while keeping its content. Unfortunately, there are no effective resizing procedures for graphs. It would therefore be extremely valuable to develop techniques that can generalize from training on small graphs. As we discuss below, a theoretical analysis of size generalization is very challenging because it depends on several different factors, including the task, the architecture, and the data. For tasks, we argue that it is important to distinguish two types of tasks, local and global. Local tasks can be solved by GNNs whose depth does not depend on the size of the input graph. For example, the task of finding a constant-size pattern. Global tasks require that the depth of the GNN grows with the size of the input graph. For example, calculating the diameter of a graph. While there are a few previous works that explore depth-dependant GNNs (e.g., Tang et al. (2020) , constant depth GNNs are by far the most widely used GNN models today and are therefore the focus of this paper. In this paper, we focus on GNNs with constant depth and study the ability of the most expressive message passing neural networks (Xu et al., 2018; Morris et al., 2019) to generalize to unseen sizes. Our key observation is that generalization to graphs of different sizes is strongly related to the distribution of patterns around nodes in the graphs of interest. These patterns, dubbed d-patterns (where d is the radius of the local neighborhood), describe the local feature-connectivity structure around each node, as seen by message-passing neural networks and are defined in Section 3. We study the role of d-patterns both empirically and theoretically. First, we theoretically show that when there is a significant discrepancy between the d-pattern distributions, GNNs have multiple global minima for graphs of a specific size range, out of which only a subset of models can generalize well to larger graphs. We complement our theoretical analysis with an experimental study and show that GNNs tend to converge to non-generalizing global minima, when d-patterns from the large graph distribution are not well-represented in the small graph distribution. Furthermore we demonstrate that the size generalization problem is accentuated in deeper GNNs. Following these observations, in the final part of this paper, we discuss two learning setups that help improve size-generalization by formulating the learning problem as a domain adaptation problem: (1) Training the GNNs on self-supervised tasks aimed at learning the d-pattern distribution of both the target (large graphs) and source (small graphs) domains. We also propose a novel SSL task that addresses over-fitting of d-patterns. (2) A semi-supervised learning setup with a limited number of labeled examples from the target domain. The idea behind both setups is to promote convergence of GNNs to local/global minima with good size generalization properties. We show that both setups are useful in a series of experiments on synthetic and real data. To summarize, this paper makes the following contributions. (1) We identify a size generalization problem when learning local tasks with GNNs and analyze it empirically and theoretically. (2) We link the size-generalization problem with the distribution of d-patterns and suggest to approach it as a domain adaptation problem (3) We empirically show how several learning setups help improve size generalization.

2. RELATED WORK

Size generalization in set and graph learning. Several papers observed successful generalization across graph sizes, but the underlying reasons were not investigated (Li et al., 2018; Maron et al., 2018; Luz et al., 2020) . More recently, (Veličković et al., 2019) showed that when training GNNs to perform simple graph algorithms step by step they generalize better to graphs of different sizes. Unfortunately, such training procedures cannot be easily applied to general tasks. Knyazev et al. (2019) studied the relationship between generalization and attention mechanisms. Tang et al. (2020) observed two issues that can harm generalization: (1) There are tasks for which a constant number of layers is not sufficient. (2) Some graph learning tasks are homogeneous functions. They then suggest a new GNN architecture to deal with these issues. Our work is complementary to these works as it explores another fundamental size generalization problem, focusing on constant depth GNNs. For more details on the distinction between constant depth and variable depth tasks see Appendix A. Several works also studied size generalization and expressivity when learning set-structured inputs (Zweig and Bruna, 2020; Bueno and Hylton, 2020). On the more practical side, Joshi et al. (2019), Joshi et al. (2020) study the combinatorial problem of traveling salesman and whether it is possible to generalize to larger sizes. Corso et al. (2020) study several multitask learning problems on graphs and evaluate how the performance changes as the size of the graphs change. Expressivity and generalization in graph neural networks. (Xu et al., 2018; Morris et al., 2019) established a fundamental connection between message-passing neural networks and the Weisfeiler-Leman (WL) graph-isomorphism test. We use similar arguments to show that GNNs have enough expressive power to solve a task on a set of small graphs and to fail on it on a set of large graphs. Several works studied generalization bounds for certain classes of GNNs (Garg et al., 2020; Puny et al., 2020; Verma and Zhang, 2019) , but did not discuss size-generalization. Sinha et al. (2020) proposed a benchmark for assessing the logical generalization abilities of GNNs.

3. THE SIZE GENERALIZATION PROBLEM

We now present the main problem discussed in the paper, that is, what determines if a GNN generalizes well to graphs of sizes not seen during training. We start with a simple motivating example showing the problem on single layer GNNs. We then show that the question of size generalization actually depends on d-patterns, the local patterns of connectivity and features of the graphs, and not only on their actual size. Setup. We are given two distributions over graphs P 1 , P 2 that contain small and large graphs accordingly, and a task that can be solved with 0 error for all graph sizes using a constant depth GNN. We train a GNN on a training set S sampled i.i.d from P 1 and study its performance on P 2 . GNN model. We focus on the first order GNN (1-GNN) architecture from Morris et al. (2019) defined in the following way: h (t) v = σ   W (t) 2 h (t-1) v + u∈N (v) W (t) 1 h (t-1) u + b (t)   . Here, h v is the feature vector of node v after t layers, W (t) 1 , W 2 ∈ R dt-1×dt , b (t) ∈ R dt denotes the parameters of the t-th layer of the GNN, and σ is some non-linear activation (e.g ReLU). It was shown in Morris et al. (2019) that GNNs composed from these layers have maximal expressive power with respect to all message-passing neural networks. In the experimental section we also experiment with Graph Isomorphism Network (GIN) (Xu et al., 2018) . For further details on GNNs see Appendix A. In this work we use the most expressive GNN variants that use the "sum" aggregation function. Using a "max" or "mean" reduces the expressive power of the network, making it not powerful enough to solve simple counting problems (e.g. counting edges or computing node degrees). On the other hand, these networks give rise to slightly different definitions of patterns and can generalize better in some cases as shown in (Veličković et al., 2019 ), yet still suffer from size overfit. Exploring these networks is beyond the scope of this work.

3.1. SIZE GENERALIZATION IN SINGLE-LAYER GNNS

We start our discussion on size generalization with a theoretical analysis of a simple setup. We consider a single-layer GNN and an easy task and show that: (1) The training objective has many different solutions, but only a small subset of these solutions generalizes to larger graphs (2) Simple regularization techniques cannot mitigate the problem. This subsection serves as a warm-up for the next subsections that contain our main results. Assume we train on a distribution of graphs with a fixed number of nodes n and a fixed number of edges m. Our goal is to predict the number of edges in the graph using a 1-GNN with a single linear layer and additive readout function, for simplicity also consider the squared loss. The objective boils down to the following function for any graph G in the training set:: L(w 1 , w 2 , b; G) = u∈V (G) w 1 • x u + v∈N (u) w 2 • x v + b -y 2 . Here, G is an input graph, V (G) are the nodes of G, N (v) are all the neighbors of node v, w 1 , w 2 and b are the trainable parameters, y is the target (m in this case) and x v is the node feature for node v. Further, assume that we have no additional information on the nodes, so we can just embed each node as a one-dimensional feature vector with a fixed value of 1. In this simple case, the trainable parameters are also one-dimensional. We note that the training objective can also be written in the following form L(w 1 , w 2 , b; G) = (nw 1 + 2mw 2 + nb -m) 2 , and that one can easily find its solutions space, which is an affine subspace defined by w 2 = m-n(w1+b)

2m

. In particular, the solutions with b + w 1 = 0, w 2 = 1/2 are the only ones which do not depend on the specific training set graph size n, and generalize to graphs of any size. It can be readily seen that when training the model on graphs of fixed size (fixed m, n), gradient descent will have no reason to favor one solution over another and we will not be able to generalize. We also note that the generalizing solution is not always the least norm solution (with respect to both L 1 and L 2 norms) so simple regularization will not help here. On the other hand, it is easy to show that training on graphs with different number of edges will favor the generalizing solution. As we will see next, the problem gets worse when considering GNNs with multiple non-linear layers, and this simple solution will not help in this case: we can train deeper GNNs on a wide variety of sizes and the solution will not generalize to other sizes.

3.2. d-PATTERNS

We wish to understand theoretically when does a GNN which was trained on on graphs with a small number of nodes can generalize to graphs with a large number of nodes. To answer that question, we first analyze what information is received by each node in the graph from its neighboring nodes after a graph is processed by a GNN with T layers. It is easy to see that every node can receive information about its neighbors which are at most T hops away. We also know that nodes do not have full information about their order T environment. For example, GNNs cannot determine if a triangle is present in a neighborhood of a given node Chen et al. (2020) . In order to characterize the exact information that can be found in each node after a T layer GNN, we use the definition of the WL test, specifically its iteration structure, which has the same representational power as GNNs (see Xu et Let G = (V, E) be a graph with maximal degree N and a node feature c v ∈ C for every node v ∈ V . We define the d-pattern of a node v ∈ V for d ≥ 0 recursively: For d = 0, its 0pattern is c v . For d > 0 we say that v with neighboring d -1 patterns has a d-pattern p = (p v , {(p i1 , m pi 1 ), . . . , (p i , m pi )}) iff node v has (d -1)-pattern p v and for every j ∈ {1, . . . , } the number of neighbors of v with (d -1)-pattern p ij is exactly m pi j . The d-pattern of a node is an encoding of the (d -1)-patterns of itself and of its neighbors. For example, assume a graph has a maximal degree of N and all the nodes start with the same node feature. The 1-pattern of each node is its degree. The 2-pattern of each node is for each possible degree i ∈ {1, . . . , N } the number of neighbors with degree i. In the same manner, the 3-pattern of a node is for each possible 2-pattern, the number of its neighbors with this exact 2-pattern. The definition of d-patterns can naturally be extended to the case of unbounded degrees. We have the following theorem which connects the d-patterns with the expressive power of GNNs: Theorem 3.2. Any function that can be represented by a d-layer GNN is constant on d-patterns. In particular, the theorem shows that for any two graphs (of any size) and two nodes, one in each graph, if the nodes have the exact same d-pattern, then any d-layer GNN will output the same result for the two nodes. The full proof can be found in Appendix B, and follows directly from the analogy between the WL algorithm (see Appendix A) and d-patterns. Thm. 3.2 implies that d-patterns don't represent more expressive power than GNN. In the next subsection, we prove that GNNs can exactly compute d-patterns, and show that this capacity is tightly related to size generalization. It is also easy to see from the definition of d-patterns and the proof of Theorem 2 from Morris et al. ( 2019) that d-patterns exactly represent the expressive power of GNNs (with additive aggregation), thus this definition is a natural tool to study the properties of GNNs, such as size generalization.

3.3. GNNS MAY OVERFIT d-PATTERNS

We can now connect the size generalization problem to the concept of d-patterns. We start with an example: consider a node prediction task in which an output is specified for each node in an input graph, and is solvable by a d-layer GNN. To perfectly solve this task, the model should produce the correct output for the d-pattern of all the nodes in the training set. Testing this GNN on a different set of graphs will succeed if the test set has graphs with similar d-patterns to those in the training set. Note that this requirement is not related to the size of the graphs but to the distribution of the d-patterns of the nodes in the test set. In the following theorem we show rigorously, that given a set of d-patterns and output for each such pattern, there is an assignment of weights to a GNN with O(d) layers that perfectly fits the output for each pattern. We will then use this theorem in order to show that, under certain assumptions on the distribution of d-patterns of the large graphs, GNNs can perfectly solve a task on a set of small graphs, and completely fail on a set on large graphs. In other words, we show that there are multiple global minima for the training objective that do not generalize to larger graphs. Theorem 3.3. Let C be a finite set of node features, P be a finite set of d-patterns on graphs with maximal degree N ∈ N, and for each pattern p ∈ P let y p ∈ [-1, 1] be some target label. Then there exists a 1-GNN F with d + 2 layers, width bounded by max (N + 1) d • |C|, 2 |P | and ReLU activation such that for every graph G with nodes v 1 , . . . , v n , corresponding d-patterns p 1 , . . . , p n ⊆ P and node features from C, the output of F on node v i is exactly y pi . The full proof is in Appendix B. Note that the width of the required GNN from the theorem is not very large if d is small, where d represents the depth of the 1-GNN. In practice, shallow GNNs are very commonly used and are proven empirically successful, while training deep GNNs was shown to be hard due to many problems like over-smoothing (Zhao and Akoglu, 2019). Using the above theorem we can claim that there are assignments of weights to GNN that cannot "size-generalize", that is, given a specific task, the GNN succeeds on the task for small graphs (up to some bound) and fails on larger graphs, as long as there is a notable discrepancy between their d-patterns distributions: Corollary 3.4. Let P 1 and P 2 be distributions of small and large graphs respectively with finite support, and let P d-pat 1 be the distribution of d-patterns over small graphs and similarly P d-pat 2 for large graphs. For any node prediction task which is solvable by a 1-GNN with depth d and > 0 there exists a 1-GNN with depth at most d + 2 that has 0-1 loss smaller then on P 1 and 0-1 loss ∆ on P 2 , where ∆( ) = max A:P d-pat 1 (A)< P d-pat 2 (A) . Here, A is a set of d-patterns and P (A) is the total probability mass for that set under P . Intuitively, large ∆ means that there exists a set of d-patterns that have a low probability for small graphs and high probability for large graphs. Corollary 3.4 implies that the major factor in the success of GNN to generalize to larger graphs is not the graph size, but the distribution of the dpatterns. Different distributions of d-patterns lead to large ∆ and thus to bad generalization to larger graphs. On the other hand, from Thm. 3.2 we immediately get that similar distributions of d-patterns imply that every GNN model that succeeds on small graphs will also succeed on large graphs, since GNNs are constant on d-patterns: Corollary 3.5. In the setting of Corollary 3.4, also assume that all the patterns that have a positive probability in P d-pat 2 also have a positive probability in P d-pat

1

. Then, for any node prediction task solvable by a depth d GNN, any 1 -GN N that have 0 loss (w.r.t the 0 -1 loss) on P 1 will also have 0 loss on P 2 . Examples. Corollary 3.4 shows that even for simple tasks, GNN may fail, here are two simple examples. (i) Consider the task of calculating the node degree. From Corollary 3.4 there is a GNN that successfully output the degree of nodes with max degree up to N and fails on nodes with larger degrees. Note that this problem can easily be solved for any node degree with a 1-layer GNN. (ii) Consider some node regression task, when the training set consists of graphs sampled i.i.d from an Erdos-Renyi graph G(n, p)foot_0 , and the test set contains graphs sampled i.i.d from G(2n, p). In this case, a GNN trained on the training set will be trained on graphs with an average degree np, while the test set contains graphs with an average degree 2np. This means that the d-patterns in the train and test set are very different, and by Corollary 3.4 the GNN may overfit. Graph prediction tasks. Our theoretical results discuss node prediction tasks. We note that they are also relevant for graph prediction tasks where there is a single output to each input graph. The reason for that is that in order to solve graph prediction tasks, a GNN first calculates node features and then pools them into a single global graph feature. Our analysis shows that the first part of the GNN, which is responsible for calculating the node features, might not generalize to large graphs. As a result, the GNN will generate an uninformative global graph feature and the GNN will fail on the original graph prediction task. In the experimental sections, we show that the size generalization problem is indeed relevant for both node and graph prediction tasks. Here is a formal statement regarding graph prediction tasks, the full proof can be found in Appendix B. Corollary 3.6. Let P 1 and P 2 be distributions of small and large graphs respectively with finite support. Let P d-pat 1 be the distribution of d-patterns over small graphs and similarly P d-pat 2 for large graphs, and assume that the supports of P d-pat GNNs can be as powerful as the WL test. Here, we show that the expressive power of GNNs can cause negative effects when there is a discrepancy between the training and test sets.

3.4. EMPIRICAL VALIDATION

In the previous subsection we have shown that for any node task, and any two datasets of graphs with different sizes that significantly differ in their d-patterns distributions, there is a 1-GNN that successfully solves the task on one dataset but fails on the second. In this subsection, we show empirically that reaching these "overfitting" GNNs is actually very common. Specifically, the sizeoverfit phenomenon is prevalent when the d-patterns of in the large graph distribution are not found in the small graph distribution. We also show that GNNs can generalize to larger graphs if the distribution of d-patterns remains similar to the distribution of patterns in the small graphs. To show this, we use a controlled regression task in a student-teacher setting. In this setting, we sample a "teacher" GNN with random weights, freeze the network, and label each graph in the dataset using the output of the "teacher" network. Our goal is to train a "student" network, which has the same architecture as the "teacher" network, to fit the labels of the teacher network. The advantages of this setting are two-fold: (1) A solution is guaranteed to exist: We know that there is a weight assignment of the student network which perfectly solve the task for graphs of any size. (2) Generality: It includes all tasks solvable by constant depth GNNs. We discuss more settings below. Architecture and training protocol. We use 1-GNN as defined in (Morris et al., 2019) . The number of GNN layers in the network we use is either 1, 2 or 3; the width of the teacher network is 32, and of the student network 64, providing more expressive power to the student network. We obtained similar results when testing with a width of 32, same as the teacher network. We use a summation readout function followed by a two-layer fully connected suffix. We use ADAM with learning rate 10 -3 . We performed a hyper-parameters search on the learning rate and weight decay and use validation-based early stopping on the source domain (small graphs). The results are averaged over 10 random seeds. All runs used Pytorch Geometric (Fey and Lenssen, 2019) on NVIDIA DGX-1. Results. Fig. 1 compares the loss of GNNs as the distribution of d-patterns changes, for the task of teacher-student graph level regression. The model was trained on graphs generated using the G(n, p) model. We show the normalized L 2 loss computed on test, where output is normalized by the average test-set (target) output. The left panel shows the test loss when training on n ∈ [40, 50] and p = 0.3 and testing on G(n, p) graphs with n = 100 and p varying from 0.05 to 0.5. In this experiment, the expected node degree is np, hence the distribution of d-patterns is most similar to the one observed in the training set when p = 0.15. Indeed, this is the value of p where the test loss is minimized. The right panel is discussed in the caption. These results are consistent with Corollary 3.4, since when the distributions of d-patterns are far the model is not able to generalize well, and it does generalize well when these distributions are similar. To give further confirmation to the effect of the local distributions on the generalization capabilities of GNN we conducted the following two experiments: (1) We tested on the teacher-student setup with a 3-layer GNN on graphs of sizes uniformly from n = 40, . . . , 50 and sampled from G(n, 0.3). We tested on graphs sampled from G(N, 0.3) where N varies from 50 up to 150. It is evident that as the graph size in the test set increases, the model performs worse. (2) We did the same test as in (1) , but this time we normalize p on the test set so that p • N = 15, which is the approximate ratio for the training set. Here we even went further to train up to sizes N = 250. In this experiment, the GNN successfully generalized to larger graphs, since the local distributions of the train and test set are indeed very similar. For the results see Fig. 2 . We also tested on the tasks of finding the max clique in a graph, calculating the number of edges, and the node prediction tasks of calculating the node degree, and the student-teacher task at the node level. In addition, we tested on the popular GIN architecture (Xu et al., 2018) , and show that the size generalization problem also occurs there. We also tested on ReLU, Tanh, and sigmoid activations. See additional experiments in Appendix C.

4. TOWARDS IMPROVING SIZE-GENERALIZATION

The results from the previous section show that the problem of size generalization is not only related to the size of the graph in terms of the number of nodes or edges but to the distribution of d-patterns induced by the distributions from which the graphs are sampled. Based on this observation, we now formulate the size generalization problem as a domain adaptation (DA) problem. We then build on techniques from domain adaptation and suggest two approaches to improve size generalization. (1) Self-supervised learning on the target domain (large graphs) and ( 2) Semi-supervised learning with a few labeled target samples. We consider the DA setting where we are given two distributions over graphs: a source distribution D S (say, for small graphs) and a target distribution D T (say, for large graphs). We consider two settings. First, the unlabeled DA setting, where we have access to labeled samples from the source D S but the target data from D T is unlabeled. Our goal is to infer labels on a test dataset sampled from the target D T . Second, we consider a semi-supervised setup, where we also have access to a small number of labeled examples from the target D T . Size generalization with Self-supervised learning. In Self-supervised learning (SSL) for DA, a model is trained on unlabeled data to learn a pretext task, which is different from the main task at hand. Pattern-tree pretext task. We propose a novel pretext task which is motivated by the definition of d-patterns. We do that by constructing a tree that fully represents the d-patterns of each node (see e.g., Xu et al. ( 2018)). We then calculate a descriptor of the tree, which is a vector containing counts of the number of nodes from each class in each layer of the tree. We treat this descriptor as the target label to be reconstructed by the SSL task. For more details see Figure 3 . Intuitively, in order to be successful on a task on graphs, the GNN needs to correctly represent the pattern trees of the nodes of the graphs. This means, that to generalize to the target domain D t , the GNN needs to be forced to represent pattern trees from both the source and the target distributions. For more details about the construction of the pattern tree see Appendix D. In short, each tree corresponds to a d-pattern in the following way: the d-pattern tree of a node can be seen as a multiset of the children of the root, and each child is a multiset of its children, etc. The pattern tree is a different description of the d-pattern of a node. This means that a GNN that successfully represent a pattern tree also represents its corresponding d-pattern, thus connecting this SSL task to the theory from Sec. 3. Semi-supervised setup. We also consider a case where a small number of labeled samples are available for the target domain. A natural approach is to train an SSL pretext task on samples from both the source and target domain, and train the main task on all the labeled samples available. We tested this setup with 1, 5, or 10 labeled examples from the target domain.

4.1. EXPERIMENTS

Architecture and training protocol. The setup is the same as in Subsection 3.4 with the following changes. We use a three-layer GNN in all experiments. Multi-task learning is used with equal weight to the main and SSL tasks. In the semi-supervised setup, we used an equal weight for the main task and the labeled examples from the target domain. Baselines. We compare our new pretext task to the following baselines: ( 2020) (Twitch egos and Deezer egos). We selected datasets that have a sufficient number of graphs (more than 1,000) and with a non-trivial split to small and large graphs as detailed in Appendix F.1. In total we used 7 datasets, 4 in molecular biology (NCI1, NCI109, D&D, Proteins), and 3 of social networks (Twitch ego nets, Deezer ego nets, IMDB-Binary). In all datasets, 50% smallest graphs were assigned to the training set, and the largest 10% of graphs assigned to the test set. We further split a random 10% of the small graphs as a validation set. Results. Table 1 compares the effect of using the Patterntree pretext task to the baselines described above. The small graphs row presents vanilla results on a validation set with small graphs. The small graph accuracy on 5 out of 7 datasets is larger by 7.3%-15.5% than on large graphs, indicating that the size-generalization problem is indeed prevalent in real datasets. Pretraining with the d-patterns pretext task outperforms other baselines in 5 out 7 datasets, with an average 4% improved accuracy on all datasets. HOMO-GNN slightly improves over the vanilla while other pretext tasks do not improve average accuracy. Naturally, the accuracy here is much lower than SOTA on these datasets because the domain shift makes the problem much harder. In Appendix F.2 we show the 1-pattern distribution discrepancy between large and small graphs in two real datasets: IMDB (large discrepancy) and D&D (small discrepancy). Correspondingly, the pattern tree SSL task improved performance on the IMDB dataset, while not improving performance on the D& D dataset. This gives further evidence that a discrepancy between the d-patterns leads to bad generalization, and that correctly representing the patterns of the test set can improve performance. We additionally tested on the synthetic tasks discussed in Sec. 3, and show that the pattern-tree pretext task improves in the student-teacher setting, while it does not solve the edge count or degree prediction tasks. On the other hand, adding even a single labeled sample from the target distribution significantly improves performance on the synthetic tasks we tested on. For more details see Sec. F.

5. CONCLUSION AND DISCUSSION

This paper is a first step towards understanding the size generalization problem in graph neural networks. We showed that GNNs do not naturally generalize to larger graphs even on simple tasks, characterized how this failure depends on d-patterns, and suggested two approaches that can improve generalization. Our characterization of d-patterns is likely to have implications to other problems where generalization is harmed by distribution shifts, and offer a way to mitigate those problems. [40] C. Yun, S. Sra, and A. Jadbabaie. Small relu networks are powerful memorizers: a tight analysis of memorization capacity. In Advances in Neural Information Processing Systems, pages 15558-15569, 2019. [41] L. Zhao and L. Akoglu. Pairnorm: Tackling oversmoothing in gnns. arXiv preprint arXiv:1909.12223, 2019. [42] A. Zweig and J. Bruna. A functional perspective on learning symmetric functions with neural networks. arXiv preprint arXiv:2008.06952, 2020.

A PRELIMINARIES

This section discusses two key concepts that we use throughout the paper: (1) Graph Neural Networks; (2) The Weisfeiler-Lehman graph isomorphism test. At the end of the section, we also discuss the setup of constant GNN depth that we use in this paper. Notation. We denote by {(a 1 , m a1 ), . . . , (a n , m an )} a multiset, that is a set where we allow multiple instances of the same element. Here a 1 , . . . , a n are distinct elements, and m ai is the number of times a i appears in the multiset. bold-face letters represent vectors.

A.1 GRAPH NEURAL NETWORKS

We consider the common message-passing GNN architecture (38) defined as follows: Let G = (V, E) be a graph, and for each node v ∈ V let h v ∈ R d0 be a node feature vector. Then for every t > 0 we define: h (t) v = UPDATE h (t-1) v , AGG h (t-1) u : u ∈ N (v) . Here, AGG is an permutation invariant aggregation function such as summation, averaging or taking max, UPDATE is an update function such as an MLP, and N (v) denotes the set of neighbors of node v. In case the graph is directed then N (v) denotes the set of nodes with edges incoming to v. For node prediction tasks, the output of a T -layer GNN for node v is h (T ) v , while for graph prediction tasks an additional readout layer is used: g (T ) = READOUT h (T ) v : v ∈ V , where READOUT is some invariant function such as summation, averaging or taking the max, possibly followed by a fully connected network. In our theoretical results, we focus on the first order GNN architecture from (25) defined in the following way: h (t) v = σ   W (t) 2 h (t-1) v + u∈N (v) W (t) 1 h (t-1) u + b (t)   . Here, W ∈ R dt-1×dt , b (t) ∈ R dt denotes the parameters of the t-th layer of the GNN, and σ is some non-linear activation (e.g ReLU). It was shown in (25) that GNNs composed from these layers have maximal expressive power with respect to all message passing neural networks. In general, we focus on GNNs that use the popular sum aggregation as it has maximal expressive power with respect to other aggregations used in message passing GNNs (38) . We discuss other aggregation functions and the trade-of they present in Section 5.

A.2 THE WEISFEILER-LEHMAN TEST

The definition of d-patterns, which are used frequently in this paper, is closely related to the Weisfeiler-Lehman (WL) test (37; 13) . The WL test is an algorithm to test whether two graphs are isomorphic and was recently coupled with the expressive power of GNNs. (38; 25) . It is easiest to describe WL by using an equivalent algorithm called the color-refinement algorithm (13) . The color refinement algorithm is executed sequentially, wherein each stage the algorithm generates a descriptor for each node according to the descriptors of its direct neighbors. The features in each stage can be used to define equivalence classes of nodes and the process continues until these equivalence classes cannot be refined anymore. The final graph descriptor is a histogram of the node features, and is closely related to the definition of d-patterns.

A.3 CONSTANT DEPTH VS. ADAPTIVE DEPTH TASKS

As stated earlier, multiple factors control size generalization. One important factor is the type of task: We make a distinction between two types of graph tasks. (1) Tasks that can be solved by constant depth GNN; A good example for such task is determining whether a certain, constant size connectivity pattern is found in the graph (2) Tasks that require the depth of the GNN to be related to different parameters of the problem, such as the diameter of the graph. To exemplify, the task of calculating the diameter of a graph can be solved with a GNN only if its depth depends on the size of the graph (For more details see (32; 21) ). On the other hand, tasks that require local information about each node can be solved with a constant depth GNN. We note that although there are several recent works about adaptive depth GNN, e.g. (32; 27) , most currently used GNNs have constant and relatively small depth ( 7)). In this paper, we focus on the first kind of tasks that require a constant number of GNN layers. We note that the size-generalization problem we discuss in this paper is also relevant for the second type of tasks, but we chose to discuss the problem in the simpler setup for clarity.

B PROOFS FROM SEC. 3

Proof of Thm. 3.2. We show that from the definition of d-patterns, and the 1-WL algorithm (see (37) ), the color given by the WL algorithm to two nodes is equal iff their d-pattern is equal. For the case of d = 0 it is clear. Suppose it is true for d -1, the WL algorithm at iteration d give node v a new color based on the colors given in iteration d -1 to the neighbors of v. This means, that the color of v at iteration d depends on the multiset of colors at iteration d -1 of its neighbors, which by induction is the d -1-pattern of the neighbors of v. To conclude, we use Theorem 1 from (25) which shows that GNNs are constant on the colors of WL, hence also constant on the d-patterns. To prove Thm. 3.3 we will first need to following claim from (40) about the memorization power of ReLU networks: Theorem B.1. Let {x i , y i } N i=1 ∈ R d × R such that all the x i are distinct and y i ∈ [-1, 1] for all i. Then there exists a 3-layer fully connected ReLU neural network f : R d → R with width 2 √ N such that f (x i ) = y i for every i. We will also need the following lemma which will be used in the construction of each layer of the 1-GNN: Lemma B.2. Let N ∈ N and f : N → R be a function defined by f (n) = w 2 σ(w 1 n -b) where w 1 , w 2 , b ∈ R N , and σ is the ReLU function. Then for every y 1 , . . . , y N ∈ R there exists w 1 , w 2 , b such that f (n) = y n for n ≤ N and f (n) = (n -N + 1)y N -(n -N )y N -1 for n > N . Proof. Define w 1 =    1 . . . 1   , b =     0 1 . . . N -1     . Let a i be the i-th coordinate of w 2 , we will define a i recursively in the following way: Let a 1 = y 1 , suppose we defined a 1 , . . . , a i-1 , then define a i = y i -2a i-1 -• • • -ia 1 . Now we have for every n ≤ N : f (n) = w 2 σ(w 1 n -b) = na 1 + (n -1)a 2 + • • • + a n = y n . For n > N we can write n = N + k for k ≥ 1, then we have: f (n) = w 2 σ(w 1 (N + k) -b) = (N + k)a 1 + (N + k -1)a 2 + • • • + (k + 1)a N = y N + k(a 1 + a 2 + • • • + a N ) = y N + k(y N -a N -1 -2a N -2 -• • • -(N -1)a 1 ) = (k + 1)y N -ky N -1 Now we are ready to prove the main theorem: Proof of Thm. 3.3. We assume w.l.o.g that at the first iteration each node i is represented as a onehot vector h (0) i of dimension |C|, with its corresponding node feature. Otherwise, since there are |C| node features we can use one GNN layer that ignores all neighbors and only represent each node as a one-hot vector. We construct the first d layers of the 1-GNN F by induction on d. Denote by a i = |C| • N i + N i-1 the dimension of the i-th layer of the GNN for 1 ≤ i ≤ d, and a 0 = |C|. The mapping has two parts, one takes the neighbors information and maps it into a feature representing the multiset of d-patterns, the other part is simply the identity saving information regarding the d-pattern of the node itself. The d layer structure is h (d) v = U (d+1) σ   W (d) 2 h (d-1) v + u∈N (v) W (d) 1 h (d-1) u -b (d)   We set W (d) 2 = [0, I] T , W [ W (d)T 1 , 0] T and U (d+1) = [ Ũ (d+1)T , 0] T with W (d) 1 ∈ R N a d-1 ×a d-1 and Ũ (d+1) ∈ R N a d-1 ×N a d-1 . For W (d) 1 we set w i , its i-th row, to be equal to e n where n =  v + u∈N (v) W (d) 1 h (d) u is that each dimension i hold the number of neighbors with a specific (d-1)-pattern. We then replicate this vector N times, and for each replica we subtract a different bias integer ranging from 0 to N -1. To that output we concatenate the original h (d-1) v Next we construct Ũ (d+1) ∈ R N a d-1 ×N a d-1 in the following way: Let u (d+1) i be its i-th row, and u (d+1) i,j its j-th coordinate. We set u (d+1) i,j = 0 for every j with j = i N and the rest N coordinates to be equal to the vector w 2 from Lemma B.2 with labels y = 0 for ∈ {1, . . . , N }\{(i mod N )+1} and y = 1 for = (i mod N ) + 1. Using the above construction we encoded the output on node v of the first layer of F as a vector: This encoding is such that the i-th coordinate of h (d+1) v for 1 ≤ i ≤ N • a d-1 is equal to 1 iff node v have (i mod N ) + 1 neighbors with node feature i N ∈ {1, . . . , |C|}. The last a d-1 rows are a copy of h (d-1) v . Construction of the suffix. Next, we construct the last two layers. First we note that for a node v with d-pattern p there is a unique vector z p such that the output of the 1-GNN on node v, h , b (d+2) , W (d+3) are the matrices produced from Thm. B.1. Note that W (d+3) is the linear output layer constructed from the theorem, thus the In the edge count/degree tasks the loss is the mean difference from the ground-truths, divided by the average degree/number of edges. In the student-teacher tasks the loss is the mean L 2 loss between the teacher's value and the student's prediction, divided by the averaged student's prediction. Both the student and teacher share the same 3-layer architecture Next, we tested on the student-teacher task, on both graph and node levels, on the degree prediction task of each node, and the task of predicting the number of edges in the graph. The goal of these experiments is to show that the size generalization problem persists on different tasks, different architectures, and its intensity is increased for deeper GNN. In all the experiments we draw the graphs from G(n, p) distribution, wherein the test set n is drawn uniformly between 40 and 50, and p = 0.3, and in the test set n = 100 and p is either 0.3 or 0.15. We note that when p = 0.15, the average degree of the test graph is equal to (approximately) the average degree of the train graph, while when p = 0.3 it is twice as large. We would expect that the model will generalize better when the average degree in the train and test set is similar, because then their d-patterns will also be more similar. Table 3 compares the performance of these tasks when changing the graph size of the test data. We tested the performance with normalized test loss, where we normalized the output by the average output of the test set. This metric allows us to estimate the percentage of mistakes from the average output.

D SSL TASK ON WL TREE

First, we will need the following definition which constructs a tree out of the d-pattern introduced in the previous section. This tree enables us to extract certain properties for each node which can, later on, be learned using a GNN. This definition is similar to the definition of "unrolled tree" from (24) . Definition D.1 (d-pattern tree). Let G = (V, E) a graph, C a finite set of node features, where each v ∈ V have a corresponding feature c v , and d ≥ 0. For a node v ∈ V , its d-pattern tree T (d) v = (V (d) v , E v ) is directed tree where each node corresponds to some node in G. It is defined recursively in the following way: For d = 0, V (i.e. nodes without incoming edges). We define: V (d) v = V (d-1) v ∪ u (d,v ) : v ∈ N (v ), u (d-1,v ) ∈ Ṽ (d-1) v E (d) v = E (d-1) v ∪ (u (d,v ) , u (d-1,v ) ) : v ∈ N (v ), u (d-1,v ) ∈ Ṽ (d-1) v and for every node u (d,v ) ∈ V (d) v , its node feature is c v -the node feature of v The main advantage of pattern trees is that they encode all the information that a GNN can produce for a given node by running the same GNN on the pattern tree. This tree corresponds to the local patterns in the following way: the d-pattern tree of a node can be seen as a multiset of the children of the root, and each child is a multiset of its children, etc. This completely describes the d-pattern of a node. In other words, there is a one-to-one correspondence between d-patterns and pattern trees of depth d. Thus, a GNN that successfully represents the pattern trees of the target distribution will also successfully represent the d-patterns of the target distribution. Using the d-pattern tree we construct a simple regression SSL task where its goal is for each node to count the number of nodes in each layer of its d-pattern tree. This is a simple descriptor of the tree, which although loses some information about connectivity, does hold information about the structure of the layers. For example, in Fig. 3 the descriptor for the tree would be that the root (zero) layer has a single black node, the first layer has two yellow nodes, the second layer has two yellow, two gray, and two black nodes, and the third layer has ten yellow, two black and two gray nodes.

E TRAINING PROCEDURE

In this section we explain in details the training procedure used in the experiments of Sec. 4. Let X M ain and X SSL be two labeled datasets, the first contains the labeled examples for the main task from the source distribution, and the second contains examples labeled by the SSL task from both the source and target distributions. Let : R × R → R be a loss function, in all our experiments we use cross entropy loss for classification tasks and squared loss for regression tasks. We construct the following models: (1) f GN N is a GNN feature extractor. Its input is a graph and its output is a feature vector for each node in the graph. (2) h M ain is a head (a small neural network) for the main task. Its inputs are the node feature and it outputs a prediction (for graph prediction tasks this head contains a global pooling layer). (3) h SSL is the head for the SSL task. Its inputs are node features, and it outputs a prediction for each node of the graph, depending on the specific SSL used. Pretraining. Here, there are two phases for the learning procedure. In the first phase, at each iteration we sample a batch (x 1 , y 1 ) from X SSL , and train ny minimizing the objective: (h SSL • f GN N (x 1 ), y 1 ). In this phase both h SSL and f GN N are trained. In the second phase, at each iteration we sample (x 2 , y 2 ) from X main and train on the loss (h M ain • f GN N (x 2 ), y 2 ), where we only train the head h M ain , while the weights of f GN N are fixed.

Multitask training.

Here we train all the functions at the same time. At each iteration we sample a batch (x 1 , y 1 ) from X SSL and a batch (x 2 , y 2 ) from X M ain and train by minimizing the objective: α (h SSL • f GN N (x 1 ), y 1 ) + (1 -α) (h M ain • f GN N (x 2 ), y 2 ). Here α ∈ [0, 1] is the weight for the SSL task, in all our experiments we used α = 1/2. For an illustration of the training procedures see Fig. 5 . These procedures are common practices for training with SSL tasks (see e.g. ( 39)). We additionally use a semi-supervised setup in which we are given a dataset X F S of few-shot examples from the target distribution with their correct label. In both training procedures, at each iteration we sample a batch (x 3 , y 3 ) from X F S and add a loss term β (h M ain • f GN N (x 3 ), y) where β ∈ [0, 1] is the weight of the few-shot loss. In pretraining this term is only added to the Here, a GNN is trained on the SSL task with a specific SSL head. After training, the weights of the GNN are fixed, and only the main head is trained on the main task. Right: Multitask learning: Here, there is a shared GNN and two separate heads, one for the SSL task and one for the main task. The GNN and both heads are trained in simultaneously. second phase, with weight 1/2 and adjust the weight of the main task to 1/2 as well (equal weight to the main task). In the multitask setup, we add this term with a weight 1/3 and adjust the weights of the two other losses to 1/3 as well, so all the losses have the same weight. F MORE EXPERIMENTS FROM SEC. 4 For synthetic datasets, we used the setting of Section 3.4. Source graphs were generated with G(n, p) with n sampled uniformly in [40, 50] and p = 0.3. Target graphs were sampled from G(n, p) with n = 100 and p = 0.3. Table 4 depicts the results of using the d-patterns SSL tasks, in addition to using the semi-supervised setting. It can be seen that adding the d-patterns SSL task significantly improves the performance on the teacher-student task, although it does not completely solve it. We also observe that adding labeled examples from the target domain significantly improves the performance of all tasks. Note that adding even a single example significantly improves performance. In all the experiments, the network was successful at learning the task on the source domain with less than 0.15 averaged error, and in most cases much less. F .1 shows the statistics of the datasets that were used in the paper. In particular the table presents the split that was used in the experiments, we trained on graphs with sizes smaller or equal to the 50-th percentile and tested on graphs with sizes larger or equal to the 90-th percentile. In all 



Graphs with n nodes such that each edge exists with probability p.



al. (2018); Morris et al. (2019)), For more details on the WL test see Appendix A. Definition 3.1 (d-patterns). Let C be a finite set of node features and N ∈ N. For d ≥ 0 we define the set of d-patterns P d on graphs with maximal degree N and node features from C. The definition is recursive in the following way: For d = 0, P 0 = C. We define P d to be the set of all tuples (a, b) where a ∈ P d-1 and b is in multisets of size at most N consisting of elements from P d-1 .

P d-pat 2 are disjoint. For any graph prediction task solvable by a 1-GNN with depth d and summation readout function, there exists a 1-GNN with depth at most d + 3 that perfectly solves the task on P 1 and fails on all graphs from P 2 . Relation to Morris et al. (2019); Xu et al. (2018). We note that Theorem 3.3 and Corollary 3.4 are somewhat related to the expressivity results in (Xu et al., 2018; Morris et al., 2019) that show that

Figure 1: The effect of graph size and d-pattern distribution on generalization in G(n,p) graphs. (left) The effect of distribution of d-patterns. Train on n drawn uniformly from [40, 50] and p = 0.3 test on n = 100 and varying p ; (right) The effect of train-graph size. Train on n drawn uniformly from [40, x] where x varies and p = 0.3; test on n = 150, p = 0.3.

If the pretext task is chosen wisely, the model learns useful representations (Doersch et al., 2015; Gidaris et al., 2018) that can help with the main task. Here, we train the pretext task on both the source and target domains, as was done for images and point clouds (Sun et al., 2019; Achituve et al., 2020). The idea is that the pretext task aligns the representations of the source and target domains leading to better predictions of the main task for target graphs.

Figure 3: Left: a graph with node features represented by colors. Right: A tree that represents the d-patterns for the black node. The tree descriptor is the number of nodes from each class in each layer of the tree. {For a detailed review of the training procedures and the losses see Appendix E. Given a pretext task we consider two different training procedures: (1) Multi-task learning (MTL): parallel training of the main task on a source domain and a pretext task on both the source of target domain (You et al., 2020). In this case, the architecture consists of a main GNN that acts as a feature extractor and two secondary networks (heads) that operate on the extracted features and try to predict the main task and the pretext task. (2) Pretraining (PT): in this procedure (Hu et al., 2019), the GNN feature extractor is trained until convergence on the pretext task on both the source and target examples. Then, the GNN part is frozen, and only the head of the model is trained on labeled examples from the source.

Figure 4: Mean accuracy over all datasets in Tab. 1 for d-pattern pre training and no SSL (Vanilla).

Figure 4 compares the performance of vanilla training versus pretraining with the pattern-tree pretext task in the semi-supervised setup. The accuracy monotonically increases with respect to the number of labeled examples in both cases. Moreover, pretraining with the pretext task yields better results in the case of 0,1,5 labeled examples and comparable results in the case we use 10 labeled examples.We additionally tested on the synthetic tasks discussed in Sec. 3, and show that the pattern-tree pretext task improves in the student-teacher setting, while it does not solve the edge count or degree prediction tasks. On the other hand, adding even a single labeled sample from the target distribution significantly improves performance on the synthetic tasks we tested on. For more details see Sec. F.

the i-th coordinate of b(d) , be equal to i -1 (mod N ) for i ≤ N • a d-1 and zero otherwise. What this does for the first a d-1 elements of W

z p . From our previous construction one can reconstruct exactly the (d-1)-pattern of each node, and the exact number of neighbors with each (d-1)-pattern and therefore can recover the d-pattern correctly from the h (d) v embedding. Finally, we use Thm. B.1 to construct the 3-layer fully connected neural network with width at most 2 |P | such that for every pattern p i ∈ P with corresponding unique representation z pi and label y i , the output of the network on z pi is equal to y i . We construct the last two layers of the 1-GNN such that W b (d+1) , W (d+2) 2

(0) v = u (0,v) , and E (0) v = ∅. Suppose we defined T (d-1) v , and let Ṽ (d-1) v be all the leaf nodes in V (d-1) v

Figure 5: Two training procedures for learning with SSL tasks. Left: Learning with pretraining:Here, a GNN is trained on the SSL task with a specific SSL head. After training, the weights of the GNN are fixed, and only the main head is trained on the main task. Right: Multitask learning: Here, there is a shared GNN and two separate heads, one for the SSL task and one for the main task. The GNN and both heads are trained in simultaneously.

Figure 6: Teacher-student setup with a 3-layer GNN. Training is on graphs drawn i.i.d from G(n, p)with n ∈ {40, ..., 50} uniformly and p = 0.3. Testing is done on graphs with n = 100 and p vary (xaxis). The "Pattern tree" plot represent training with our pattern tree SSL task, using the pretraining setup.

Fig.6depicts a side-by-side plot of the 3-layer case of Fig.2(left), where training is done on graphs sampled from G(n, p) with 40 to 50 nodes and p = 0.3, and testing on graphs with 100 nodes and p varies. We compare Vanilla training, and our pattern tree SSL task with pretraining. It is clear that for all values of p our SSL task improves over vanilla training.

Figure 7: Histogram in percentage of degrees of graphs. We used the 10% smallest and largest graphs in each dataset.

± 0.8 63.2 ± 3.3 75.5 ± 1.6 78.4 ± 1.4 75.4 ± 3.1 69.7 ± 0.2 71.1 ± 4.4 Test accuracy of compared methods in binary classification tasks. The Pattern task with pretraining achieves the highest accuracy in most tasks and has 4% higher average accuracy than the second-best method. High variance is due to the domain shift between the source and target domain. Datasets. We use datasets from Morris et al. (2020) and Rozemberczki et al. (

The difference is the predicted max clique size under size generalization domain shift. The train domain graphs were constructed by drawing n ∈ [40, 50] points uniformly in the unit square, and connecting two point if their distance is less than ρ train = 0.3. The test set domain graphs contain n = 100 nodes, effectively increasing their density by 2. We tested with two different values of ρ train /ρ test , the ratio between the train and test connectivity radius. A proper scaling that keeps the expected degree of each node is ρ = √ 2. Here, although proper scaling does not help solve the problem completely, it does improve performance.

Comparing performance on different local distributions (a) A student-teacher graph regression task; (b) A graph regression task, where the graph label is the number of edges; (c) A student-teacher node regression task; (d) A node regression task, where the node label is its degree.

annex

final output of the 1-GNN for node v is W, where h (d+2) v is the output after d + 2 layers.Proof of Corollary 3.4 . By the assumption, the output of the task is determined by the d-patterns of the nodes. For each node with pattern p i let y i be the output of the node prediction task. Define A = arg maxBy Thm. 3.3 there exists a 1-GNN such that for any d-pattern p i ∈ A gives a wrong label and for every pattern outside A, p j ∈ A c gives the correct label. Note that we can use Thm. but not for a graph from P 1 we can add an MLP after the summation readout function that outputs the identity on every input except zero, for which it outputs a value which is not a solution for the task for any graph in P 2 . Note that this is possible since P 2 has finite support. In the case that zero is also a solution of the task on a graph from P 1 , let n 1 , . . . , n be the sizes of graphs from P 2 . We pick some y ∈ R such that y • n i is not an output of any graph from P 1 and P 2 , there is such y since the distributions have a finite support. Now, we can change the proof so that the output of the GNN on all patterns from D 2 is equal to y. Note that now the output of any graph from P 2 after the summation readout function is not the correct output. In case there is an MLP after the readout function, the proof can be readily changed to account for that.Proof or Corollary 3.5. Let h be a 1-GNN that have 0 loss on P 1 . In particular, for any pattern in P d-pat 1 , h successfully predict the correct label. By the assumption, the patterns that appear in P d-pat 2 are contained in the pattern that appear in P d-pat 1 . Using Thm. 3.2, that any 1-GNN is constant on d-patterns, we get that h will also succeed on the pattern from P d-pat 2 , hence will have 0 loss on P 2 .

C ADDITIONAL EXPERIMENTS FROM SEC. 3.4

First, we consider the max-clique problem. The goal of this problem is given a graph, to output the size of the maximal clique. This is in general an NP-hard problem, hence a constant depth GNN will not be able to solve it for graphs of all sizes. For this task we sampled both the train and test graphs from a geometrical distribution defined as follows: given a number of nodes n and radius ρ we draw n points uniformly in [0, 1] 2 , each point correspond to a node in the graph, and two nodes are connected if their corresponding points have a distance less than ρ. We further analyzed the effects of how the network depth and architecture affect size generalization.Table 2 presents the test loss on the max-clique problem. Deeper networks are substantially more stricken due to the domain shift. If the test domain has a similar pattern distribution, increasing the neural network depth from one layer to three layers results in a small decrement of at most 20% to the loss. However, if the pattern distribution is different than the train pattern distribution, such change may increase the loss by more than 2.5×. We also show that the problem is consistent in both 1-GNN and GIN architectures. the datasets there is a significant difference between the graph sizes in the train and test sets, and in some datasets there is also a difference between the distribution of the output class in the small and large graphs.

F.2 DEGREE DISCREPANCY CORRESPONDS TO SIZE DISCREPANCY

In this subsection we calculated the degree histogram of two of the real datasets that we tested on: IMDB-Binary and D & D. For the degree histogram see Fig. 7 . Recall the the degree is the 1-pattern of a node. It is clear from the plots that for the IMDB dataset there is a difference in the distribution of degrees between small and large graphs, while in D & D the distributions look almost the same. Correspondingly, In our experiments the pattern tree SSL task significantly improved performance on the IMDB dataset, while not improving performance on the D& D dataset. This gives further evidence that a discrepancy between the d-patterns leads to bad generalization, and that correctly representing the patterns of the test set can improve performance. While in this simple experiment we only considered 1-patterns it is worth mentioning that the discrepancy is can only be accentuated when considering deeper patterns.

