A PAC-BAYESIAN APPROACH TO GENERALIZATION BOUNDS FOR GRAPH NEURAL NETWORKS

Abstract

In this paper, we derive generalization bounds for two primary classes of graph neural networks (GNNs), namely graph convolutional networks (GCNs) and message passing GNNs (MPGNNs), via a PAC-Bayesian approach. Our result reveals that the maximum node degree and the spectral norm of the weights govern the generalization bounds of both models. We also show that our bound for GCNs is a natural generalization of the results developed in (Neyshabur et al., 2017) for fully-connected and convolutional neural networks. For MPGNNs, our PAC-Bayes bound improves over the Rademacher complexity based bound (Garg et al., 2020), showing a tighter dependency on the maximum node degree and the maximum hidden dimension. The key ingredients of our proofs are a perturbation analysis of GNNs and the generalization of PAC-Bayes analysis to non-homogeneous GNNs. We perform an empirical study on several synthetic and real-world graph datasets and verify that our PAC-Bayes bound is tighter than others.

1. INTRODUCTION

Graph neural networks (GNNs) (Gori et al., 2005; Scarselli et al., 2008; Bronstein et al., 2017; Battaglia et al., 2018) have become very popular recently due to their ability to learn powerful representations from graph-structured data, and have achieved state-of-the-art results in a variety of application domains such as social networks (Hamilton et al., 2017; Xu et al., 2018) , quantum chemistry (Gilmer et al., 2017; Chen et al., 2019a ), computer vision (Qi et al., 2017; Monti et al., 2017) , reinforcement learning (Sanchez-Gonzalez et al., 2018; Wang et al., 2018) , robotics (Casas et al., 2019; Liang et al., 2020), and physics (Henrion et al., 2017) . Given a graph along with node/edge features, GNNs learn node/edge representations by propagating information on the graph via local computations shared across the nodes/edges. Based on the specific form of local computation employed, GNNs can be divided into two categories: graph convolution based GNNs (Bruna et al., 2013; Duvenaud et al., 2015; Kipf & Welling, 2016) and message passing based GNNs (Li et al., 2015; Dai et al., 2016; Gilmer et al., 2017) . The former generalizes the convolution operator from regular graphs (e.g., grids) to ones with arbitrary topology, whereas the latter mimics message passing algorithms and parameterizes the shared functions via neural networks. Due to the tremendous empirical success of GNNs, there is increasing interest in understanding their theoretical properties. For example, some recent works study their expressiveness (Maron et al., 2018; Xu et al., 2018; Chen et al., 2019b) , that is, what class of functions can be represented by GNNs. However, only few works investigate why GNNs generalize so well to unseen graphs. They are either restricted to a specific model variant (Verma & Zhang, 2019; Du et al., 2019; Garg et al., 2020) or have loose dependencies on graph statistics (Scarselli et al., 2018) . On the other hand, GNNs have close ties to standard feedforward neural networks, e.g., multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs). In particular, if each i.i.d. sample is viewed as a node, then the whole dataset becomes a graph without edges. Therefore, GNNs can be seen as generalizations of MLPs/CNNs since they model not only the regularities within a sample but also the dependencies among samples as defined in the graph. It is therefore natural to ask if we can generalize the recent advancements on generalization bounds for MLPs/CNNs (Harvey et al., 2017; Neyshabur et al., 2017; Bartlett et al., 2017; Dziugaite & Roy, 2017; Arora et al., 2018; 2019) to GNNs, and how would graph structures affect the generalization bounds? In this paper, we answer the above questions by proving generalization bounds for the two primary classes of GNNs, i.e., graph convolutional networks (GCNs) (Kipf & Welling, 2016) and messagepassing GNNs (MPGNNs) (Dai et al., 2016; Jin et al., 2018) . Our generalization bound for GCNs shows an intimate relationship with the bounds for MLPs/CNNs with ReLU activations (Neyshabur et al., 2017; Bartlett et al., 2017) . In particular, they share the same term, i.e., the product of the spectral norms of the learned weights at each layer multiplied by a factor that is additive across layers. The bound for GCNs has an additional multiplicative factor d (l 1)/2 where d 1 is the maximum node degree and l is the network depth. Since MLPs/CNNs are special GNNs operating on graphs without edges (i.e., d 1 = 0), the bound for GCNs coincides with the ones for MLPs/CNNs with ReLU activations (Neyshabur et al., 2017) on such degenerated graphs. Therefore, our result is a natural generalization of the existing results for MLPs/CNNs. Our generalization bound for message passing GNNs reveals that the governing terms of the bound are similar to the ones of GCNs, i.e., the geometric series of the learned weights and the multiplicative factor d l 1 . The geometric series appears due to the weight sharing across message passing steps, thus corresponding to the product term across layers in GCNs. The term d l 1 encodes the key graph statistics. Our bound improves the dependency on the maximum node degree and the maximum hidden dimension compared to the recent Rademacher complexity based bound (Garg et al., 2020) . Moreover, we compute the bound values on four real-world graph datasets (e.g., social networks and protein structures) and verify that our bounds are tighter. In terms of the proof techniques, our analysis follows the PAC-Bayes framework in the seminal work of (Neyshabur et al., 2017) for MLPs/CNNs with ReLU activations. However, we make two distinctive contributions which are customized for GNNs. First, a naive adaptation of the perturbation analysis in (Neyshabur et al., 2017) does not work for GNNs since ReLU is not 1-Lipschitz under the spectral norm, i.e., kReLU(X)k 2  kXk 2 does not hold for some real matrix X. Instead, we construct the recursion on certain node representations of GNNs like the one with maximum `2 norm, so that we can perform perturbation analysis with vector 2-norm. Second, in contrast to (Neyshabur et al., 2017) which only handles the homogeneous networks, i.e., f (ax) = af (x) when a 0, we properly construct a quantity of the learned weights which 1) provides a way to satisfy the constraints of the previous perturbation analysis and 2) induces a finite covering on the range of the quantity so that the PAC-Bayes bound holds for all possible weights. This generalizes the analysis to non-homogeneous GNNs like typical MPGNNs. The rest of the paper is organized as follows. In Section 2, we introduce background material necessary for our analysis. We then present our generalization bounds and the comparison to existing results in Section 3. We also provide an empirical study to support our theoretical arguments in Section 4. At last, we discuss the extensions, limitations and some open problems.

2. BACKGROUND

In this section, we first explain our analysis setup including notation and assumptions. We then describe the two representative GNN models in detail. Finally, we review the PAC-Bayes analysis.

2.1. ANALYSIS SETUP

In the following analysis, we consider the K-class graph classification problem which is common in the GNN literature, where given a graph sample z, we would like to classify it into one of the predefined K classes. We will discuss extensions to other problems like graph regression in Section 5. Each graph sample z is a triplet of an adjacency matrix A, node features X 2 R n⇥h0 and output label y 2 R 1⇥K , i.e. z = (A, X, y), where n is the number of nodes and h 0 is the input feature dimension. We start our discussion by defining our notations. Let N + k be the first k positive integers, i.e., N + k = {1, 2, . . . , k}, | • | p the vector p-norm and k • k p the operator norm induced by the vector p-norm. Further, k • k F denotes the Frobenius norm of a matrix, e the base of the natural logarithm function log, A[i, j] the (i, j)-th element of matrix A and A[i, :] the i-th row. We use parenthesis to avoid the ambiguity, e.g., (AB)[i, j] means the (i, j)-th element of the product matrix AB. We then introduce some terminologies from statistical learning theory and define the sample space as Z, z = (A, X, y) 2 Z where X 2 X (node feature space) and A 2 G (graph space), data distribution

