IN-DISTRIBUTION AND OUT-OF-DISTRIBUTION GENERALIZATION FOR GRAPH NEURAL NETWORKS

Abstract

Graph neural networks (GNNs) are models that allow learning with structured data of varying size. Despite their popularity, theoretical understanding of the generalization of GNNs is an under-explored topic. In this work, we expand the theoretical understanding of both in-distribution and out-of-distribution generalization of GNNs. Firstly, we improve upon the state-of-the-art PAC-Bayes (in-distribution) generalization bound primarily by reducing an exponential dependency on the node degree to a linear dependency. Secondly, utilizing tools from spectral graph theory, we prove some rigorous guarantees about the out-of-distribution (OOD) size generalization of GNNs, where graphs in the training set have different numbers of nodes and edges from those in the test set. To empirically verify our theoretical findings, we conduct experiments on both synthetic and real-world graph datasets. Our computed generalization gaps for the in-distribution case significantly improve the state-of-the-art PAC-Bayes results. For the OOD case, experiments on community classification tasks in large social networks show that GNNs achieve strong size generalization performance in cases guaranteed by our theory.

1. INTRODUCTION

Graph neural networks (GNNs), firstly proposed in Scarselli et al. (2008) , generalize artificial neural networks from processing fixed-size data to processing arbitrary graph-structured or relational data, which can vary in terms of the number of nodes, the number of edges, and so on. GNNs and their modern variants (Bronstein et al., 2017; Battaglia et al., 2018) have achieved state-of-the-art results in a wide range of application domains, including social networks (Hamilton et al., 2017) , material sciences (Xie & Grossman, 2018) , drug discovery (Wieder et al., 2020 ), autonomous driving (Liang et al., 2020 ), quantum chemistry (Gilmer et al., 2020) , and particle physics (Shlomi et al., 2020) . Despite their empirical successes, the theoretical understanding of GNNs are somewhat limited. Existing works largely focus on analyzing the expressiveness of GNNs. In particular, Xu et al. (2018) show that GNNs are as powerful as the Weisfeiler-Lehman (WL) graph isomorphism test (Weisfeiler & Leman, 1968 ) in distinguishing graphs. Chen et al. (2019) further demonstrate an equivalence between graph isomorphism testing and universal approximation of permutation-invariant functions. Loukas (2019) show that GNNs with certain conditions (e.g., on depth and width) are Turing universal. Chen et al. ( 2020) and Xu et al. (2020a) respectively examine whether GNNs can count substructures and perform algorithmic reasoning. In the vein of statistical learning theory, generalization analyses for GNNs have been developed to bound the gap between training and testing errors using VC-dimension (Vapnik & Chervonenkis, 1971 ), Rademacher complexity (Bartlett & Mendelson, 2002) , algorithmic stability (Bousquet & Elisseeff, 2002) , and PAC-Bayes (McAllester, 2003) (a Bayesian extension of PAC learning (Valiant, 1984) ). Depending on whether the problem setup is in-distribution (ID) or out-of-distribution (OOD), i.e., whether test data comes from the same distribution as training data, we categorize the literature into two groups. 2022) assume yet another class of graph generative models, i.e., graphons, where the kernel is shared across training and testing but the number of nodes and edges could vary. They obtain generalization bounds of message passing GNNs on graph classification and regression that depend on the Minkowski dimension of the node feature space. Relying on a connection of over-parameterized networks and neural tangent kernel, Xu et al. (2020b) find that taskspecific architecture/feature designs help GNNs extrapolate to OOD algorithmic tasks. Wu et al. (2022a) propose explore-to-extrapolate risk minimization framework, for which the solution is proven to provide an optimal OOD model under the invariance and heterogeneity assumptions. Yang et al. (2022) propose a two-stage model that both infers the latent environment and makes predictions to generalize to OOD data. Empirical studies suggest it works well on real-world molecule datasets. Wu et al. (2022b) study a new objective that can learn invariant and causal graph features that generalize well to OOD data empirically. All above works follow the spirit of invariant risk minimization (Arjovsky et al., 2019) and focus on designing new learning objectives. Instead, we provide generalization bound analysis from the traditional statistical learning theory perspective.

ID Generalization

Our Contributions. In this paper, we study both in-distribution and out-of-distribution generalization for GNNs. For in-distribution graph classification tasks, we significantly improve the previous state-of-the-art PAC-Bayes results in (Liao et al., 2020) by decreasing an exponential dependency on the maximum node degree to a linear dependency. For OOD node classification tasks, we do not assume any known graph generative models which is in sharp contrast to the existing work. We instead assume GNNs are trained and tested on subgraphs that are sampled via random walks from a single large underlying graph, as an efficient means to generate a connected subgraph. We identify interesting cases where a graph classification task is theoretically guaranteed to perform well at size generalization, and derive generalization bounds. We validate our theoretical results by conducting experiments on synthetic graphs, and also explore size generalization on a collection of real-world social network datasets. In the in-distribution case, we observe an improvement of several orders of magnitude in numerical calculations of the generalization bound. In the out-of-distribution case, we validate that, in cases where the theory guarantees that size generalization works well, the prediction accuracy on large subgraphs is always comparable to the accuracy on small subgraphs, and in many cases is actually better.



Bounds. Scarselli et al. (2018)  provide a VC-dimension based generalization bound for GNNs whereas Verma & Zhang (2019) present the stability-based generalization analysis for singlelayer graph convolutional networks (GCNs)(Kipf & Welling, 2016). Both consider node classification and assume the node features are independent and identically-distributed (IID), which conflicts with the common relational learning setup (e.g., semi-supervised node classification) at which GNNs excel. Relying on the neural tangent kernel (NTK) approach(Jacot et al., 2018), Du et al. (2019) characterize the generalization bound of infinite-width GNNs on graph classification. Garg et al. (2020) derive the Rademacher complexity based bound for message passsing GNNs on graph classification. Lv (2021) establish results for GCNs on node classification using Rademacher complexity as well. Based on PAC-Bayes, Liao et al. (2020) obtain a tighter bound for both GCNs and message passsing GNNs on graph classification compared to (Garg et al., 2020; Scarselli et al., 2018). Subsequently, Ma et al. (2021) also leverage PAC-Bayes and show generalization guarantees of GNNs on subgroups of nodes for node classification. More recently, Li et al. (2022) study the effect of graph subsampling in the generalization of GCNs. OOD Generalization Yehudai et al. (2021) study size generalization for GNNs -this is a specific OOD setting where training and testing graphs differ in the number of nodes and edges. They show negative results that specific GNNs can perfectly fit training graphs but fails on OOD testing ones. Baranwal et al. (2021) consider specific graph generative models, i.e., the contextual stochastic block model (CSBM) (Deshpande et al., 2018), where CSBMs during training and testing are of the same means but different number of nodes, intra-, and inter-class edge probabilities. They present generalization guarantees for single-layer GCNs on binary node classification tasks. Later, Maskey et al. (

