POLYNOMIAL GRAPH CONVOLUTIONAL NETWORKS

Abstract

Graph Convolutional Neural Networks (GCNs) exploit convolution operators, based on some neighborhood aggregating scheme, to compute representations of graphs. The most common convolution operators only exploit local topological information. To consider wider topological receptive fields, the mainstream approach is to non-linearly stack multiple Graph Convolutional (GC) layers. In this way, however, interactions among GC parameters at different levels pose a bias on the flow of topological information. In this paper, we propose a different strategy, considering a single graph convolution layer that independently exploits neighbouring nodes at different topological distances, generating decoupled representations for each of them. These representations are then processed by subsequent readout layers. We implement this strategy introducing the Polynomial Graph Convolution (PGC) layer, that we prove being more expressive than the most common convolution operators and their linear stacking. Our contribution is not limited to the definition of a convolution operator with a larger receptive field, but we prove both theoretically and experimentally that the common way multiple non-linear graph convolutions are stacked limits the neural network expressiveness. Specifically, we show that a Graph Neural Network architecture with a single PGC layer achieves state of the art performance on many commonly adopted graph classification benchmarks.

1. INTRODUCTION

In the last few years, the definition of machine learning methods, particularly neural networks, for graph-structured input has been gaining increasing attention in literature (Defferrard et al., 2016; Errica et al., 2020) . In particular, Graph Convolutional Networks (GCNs), based on the definition of a convolution operator in the graph domain, are relatively fast to compute and have shown good predictive performance. Graph Convolutions (GC) are generally based on a neighborhood aggregation scheme (Gilmer et al., 2017) considering, for each node, only its direct neighbors. Stacking multiple GC layers, the size of the receptive field of deeper filters increases (resembling standard convolutional networks). However, stacking too many GC layers may be detrimental on the network ability to represent meaningful topological information (Li et al., 2018) due to a too high Laplacian smoothing. Moreover, in this way interactions among GC parameters at different layers pose a bias on the flow of topological information. For these reasons, several convolution operators have been defined in literature, differing from one another in the considered aggregation scheme. We argue that the performance of GC networks could benefit by increasing the size of the receptive fields, but since with existing GC architectures this effect can only be obtained by stacking more GC layers, the increased difficulty in training and the limitation of expressiveness given by the stacking of many local layers ends up hurting their predictive capabilities. Consequently, the performances of existing GCNs are strongly dependent on the specific architecture. Therefore, existing graph neural network performances are limited by (i) the necessity to select an appropriate convolution operator, and (ii) the limitation of expressiveness caused by large receptive fields being possible only stacking many local layers. In this paper, we tackle both the issues following a different strategy. We propose the Polynomial Graph Convolution (PGC) layer that independently considers neighbouring nodes at different topological distances (i.e. arbitrarily large receptive fields). The PGC layer faces the problem of selecting a suitable convolution operator being able to represent many existing convolutions in literature, and being more expressive than most of them. As for the second issue a PGC layer, directly considering larger receptive fields, can represent a richer set of functions compared to the linear stacking of two or more Graph Convolution layers, i.e. it is more expressive. Moreover, the linear PGC design allows to consider large receptive fields without incurring in typical issues related to training deep networks. We developed the Polynomial Graph Convolutional Network (PGCN), an architecture that exploits the PGC layer to perform graph classification tasks. We empirically evaluate the proposed PGCN on eight commonly adopted graph classification benchmarks. We compare the proposed method to several state-of-the-art GCNs, consistently achieving higher or comparable predictive performances. Differently from other works in literature, the contribution of this paper is to show that the common approach of stacking multiple GC layers may not provide an optimal exploitation of topological information because of the strong coupling of the depth of the network with the size of the topological receptive fields. In our proposal, the depth of the PGCN is decoupled from the receptive field size, allowing to build deep GNNs avoiding the oversmoothing problem.

2. NOTATION

We use italic letters to refer to variables, bold lowercase to refer to vectors, and bold uppercase letters to refer to matrices. The elements of a matrix A are referred to as a ij (and similarly for vectors). We use uppercase letters to refer to sets or tuples. Let G = (V, E, X) be a graph, where V = {v 0 , . . . , v n-1 } denotes the set of vertices (or nodes) of the graph, E ⊆ V × V is the set of edges, and X ∈ R n×s is a multivariate signal on the graph nodes with the i-th row representing the attributes of v i . We define A ∈ R n×n as the adjacency matrix of the graph, with elements a ij = 1 ⇐⇒ (v i , v j ) ∈ E. With N (v) we denote the set of nodes adjacent to node v. Let also D ∈ R n×n be the diagonal degree matrix where d ii = j a ij , and L the normalized graph laplacian defined by L = I -D -1 2 AD -1 2 , where I is the identity matrix. With GConv θ (x v , G) we denote a graph convolution with set of parameters θ. A GCN with k levels of convolutions is denoted as GConv θ k (. . . GConv θ1 (x v , G) . . . , G). For a discussion about the most common GCNs we refer to Appendix A. We indicate with X the input representation fed to a layer, where X = X if we are considering the first layer of the graph convolutional network, or X = H (i-1) if considering the i-th graph convolution layer.

3. POLYNOMIAL GRAPH CONVOLUTION (PGC)

In this section, we introduce the Polynomial Graph Convolution (PGC), able to simultaneously and directly consider all topological receptive fields up to k -hops, just like the ones that are obtained by the graph convolutional layers in a stack of size k. PGC, however, does not incur in the typical limitation related to the complex interaction among the parameters of the GC layers. Actually, we show that PGC is more expressive than the most common convolution operators. Moreover, we prove that a single PGC convolution of order k is capable of implementing k linearly stacked layers of convolutions proposed in the literature, providing also additional functions that cannot be realized by the stack. Thus, the PGC layer extracts topological information from the input graph decoupling in an effective way the depth of the network from the size of the receptive field. Its combination with deep MLPs allows to obtain deep graph neural networks that can overcome the common oversmoothing problem of current architectures. The basic idea underpinning the definition of PGC is to consider the case in which the graph convolution can be expressed as a polynomial of the powers of a transformation T of the adjacency matrix. This definition is very general, and thus it incorporates many existing graph convolutions as special cases. Given a graph G = (V, E, X) with adjacency matrix A, the Polynomial Graph Convolution (PGC) layer of degree k, transformation T of A, and size m, is defined as P GConv k,T ,m (X, A) = R k,T W, where T (A) ∈ R n×n , R k,T ∈ R n×s * (k+1) , R k,T = [X, T (A)X, T (A) 2 X, .., T (A) k X], and W ∈ R s * (k+1)×m is a learnable weight matrix. For the sake of presentation, we will consider W as composed of blocks: W = [W 0 , . . . , W k ] , with W j ∈ R s×m . In the following, we show that PGC is very expressive, able to implement commonly used convolutions as special cases.

3.1. GRAPH CONVOLUTIONS IN LITERATURE AS PGC INSTANTIATIONS

The PGC layer in equation 1 is designed to be a generalization of the linear stacking of some of the most common spatially localized graph convolutions. The idea is that spatially localized convolutions aggregate over neighbors (the message passing phase) using a transformation of the adjacency matrix (e.g. a normalized graph Laplacian). We provide in this section a formal proof, as a theoretical contribution of this paper, that linearly stacked convolutions can be rewritten as polynomials of powers of the transformed adjacency matrix. We start showing how common graph convolution operators can be defined as particular instances of a single PGC layer (in most cases with k = 1). Then, we prove that linearly stacking any two PGC layers produces a convolution that can be written as a single PGC layer as well. Spectral: The Spectral convolution (Defferrard et al., 2016) can be considered the application of Fourier transform to graphs. It is obtained via Chebyshev polynomials of the Laplacian matrix. A layer of Spectral convolutions of order k can be implemented by a single PGC layer instantiating T (A) to be the graph Laplacian matrix (or one of its normalized versions), setting the PGC k value to k , and setting the weight matrix to encode the constraints given by the Chebyshev polynomials. For instance, we can get the output H of a Spectral convolution layer with k = 3 by the following PGC: H = [X, LX, L 2 , L 3 X]W, where W =    W 0 -W 2 W 1 -3W 3 2W 2 4W 3    , W i ∈ R s×m . (2)

GCN:

The Graph Convolution (Kipf & Welling, 2017 ) (GCN) is a simplification of the spectral convolution. The authors propose to fix the order k = 1 of the Chebyshev spectral convolution to obtain a linear first order filter for each graph convolutional layer in a neural network. By setting k = 1 and T (A) = D-1 2 Ã D-1 2 ∈ R n×n , with Ã = A + I and dii = j ãij , we obtain the following equivalent equation: H = [X, D-1 2 Ã D-1 2 X]W, where W = 0 W 1 , ( ) where 0 is a s × m matrix with all entries equal to zero and W 1 ∈ R s×m is the weight matrix of GCN. Note that the GCN does not consider a node differently from its neighbors, thus in this case there is no contribution from the first component of R k,T . GraphConv: In Morris et al. (2019) a powerful graph convolution has been proposed, that is inspired by the Weisfeiler-Lehman graph invariant. In this case, T (A) = A (the identity function), and k is again set to 1. A single GraphConv layer can be written as: H = [X, AX]W, where W = W 0 W 1 , and W 0 , W 1 ∈ R s×m . ( ) GIN: The Graph Isomorphism Network (GIN) convolution was defined in Xu et al. (2019) as: H = M LP ((1 + ) X + A X). Technically, this is a composition of a convolution (that is a linear operator) with a multi-layer perceptron. Let us thus decompose the M LP () as f • g, where g is an affine projection via weight matrix W, and f incorporates the element-wise non-linearity, and the other layers of the MLP. We can thus isolate the GIN graph convolution component and define it as a specific PGC istantiation. We let k = 1 and T () the identity function as before. A single GIN layer can then be obtained as: H = [X, AX]W, where W = (1 + )W 1 W 1 . Note that, differently from GraphConv, in this case the blocks of the matrix W are tied. Figure 1 in Appendix B depicts the expressivity of different graph convolution operators in terms of the respective constraints on the weight matrix W. The comparison is made easy by the definition of the different graph convolution layers as instances of PGC layers. Actually, it is easy to see from eqs. (3)-( 5) that GraphConv is more expressive than GCN and GIN.

3.2. LINEARLY STACKED GRAPH CONVOLUTIONS AS PGC INSTANTIATIONS

In the previous section, we have shown that common graph convolutions can be expressed as particular instantiations of a PGC layer. In this section, we show that a single PGC layer can model the linear stacking of any number of PGC layers (using the same transformation T ). Thus, a single PGC layer can model all the functions computed by arbitrarily many linearly stacked graph convolution layers defined in the previous section. We then show that a PGC layer includes also additional functions compared to the stacking of simpler PGC layers, which makes it more expressive. Theorem 1. Let us consider two linearly stacked PGC layers using the same transformation T . The resulting linear Graph Convolutional network can be expressed by a single PGC layer. Due to space limitations, the proof is reported in appendix C. Here it is important to know that the proof of Theorem 1 tells us that a single PGC of order k can represent the linear stacking of any q (T -compatible) convolutions such that k = q i=1 d i , where d i is the degree of convolution at level i. We will now show that a single PGC layer can represent also other functions, i.e. it is more general than the stacking of existing convolutions. Let us consider, for the sake of simplicity, the stacking of 2 PGC layers with k = 1 (that are equivalent to GraphConv layers, see eq. equation 4), each with parameters W (i) = [W (i) 0 , W (i) 1 ] , i = 1, 2, W (1) 0 , W (1) 1 ∈ R s×m1 , W (2) 0 , W (2) 1 ∈ R m1×m2 . The same reasoning can be applied to any other convolution among the ones presented in Section 3.1. We can explicitly write the equations computing the hidden representations: H (1) = XW (1) 0 + AXW (1) 1 , H (2) = H (1) W (2) 0 + AH (1) W (2) 1 (7) = XW (2) 0 + AX(W (2) 0 + W (1) 0 W (2) 1 ) + A 2 XW (1) 1 W A single PGC layer can implement this second order convolution as: H (2) = [X, AX, A 2 X]    W (1) 0 W (2) 0 W (1) 1 W (2) 0 + W (1) 0 W (2) 1 W (1) 1 W (2) 1    . Let us compare it with a PGC layer that corresponds to the same 2-layer architecture but that has no constraints on the weight matrix, i.e.: H (2) = [X, AX, A 2 X] W 0 W 1 W 2 , W i ∈ R s×m2 , i = 0, 1, 2. Even though it is not obvious at a first glance, equation 8 is more constrained than equation 9, i.e. there are some values of W 0 , W 1 , W 2 in equation 9 that cannot be obtained for any W (1) = [W (1) 0 , W 1 ] and W (2) = [W (2) 0 , W 1 ] in equation 8, as proven by the following theorem. Theorem 2. A PGC layer with k = 2 is more general than two stacked PGC layers with k = 1 with the same number of hidden units m. We refer the reader to Appendix C for the proof. Notice that the GraphConv layer is equivalent to a PGC layer with k = 1 (if no constraints on W are considered, see later in this section). Since the GraphConv is, in turn, more general than GCN and GIN, the above theorem holds also for those graph convolutions. Moreover, Theorem 2 trivially implies that a linear stack of q PGC layers with k = 1 is less expressive than a single PGC layer with k = q. If we now consider that in many GCN architectures it is typical, and useful, to concatenate the output of all convolution layers before aggregating the node representations, then it is not difficult to see that such concatenation can be obtained by making wider the weight matrix of PGC. Let us thus consider a network that generates a hidden representation that is the concatenation of the different representations computed on each layer, i.e. H = [H (1) , H (2) ] ∈ R s×m , m = m 1 + m 2 . We can represent a 2-layer GraphConv network as a single PGC layer as: H = [X, AX, A 2 X]    W (1) 0 W (1) 0 W (2) 0 W (1) 1 W (1) 1 W (2) 0 + W (1) 0 W (2) 1 0 W (1) 1 W (2) 1    . ( ) More in general, if we consider k GraphConv convolutional layers (see eq. ( 4)), each with parameters W (i) = [W (i) 0 , W (i) 1 ] , i = 1, . . . , k, W (i) 0 , W (i) 1 ∈ R mi-1×mi , m 0 = s, m = k j=1 , the weight matrix W ∈ R s•(k+1)×m can be defined as follows:       F0,1(W (1) ) F0,2(W (1) , W (2) ) F0,3(W (1) , W (2) , W (3) ) . . . F1,1(W (1) ) F1,2(W (1) , W (2) ) F1,3(W (1) , W (2) , W (3) ) . . . 0 F2,2(W (1) , W (2) ) F2,3(W (1) , W (2) , W (3) ) . . . 0 0 F3,3(W (1) , W (2) , W (3) ) . . . . . . . . . . . . . . .       , where F i,j (), i, j ∈ {0, . . . , k}, i ≤ j, are defined as F i,j (W (1) , . . . , W (j) ) = (z 1 ,..,z j )∈{0,1} j s.t. j q=1 zq=i j s=1 W (s) zs . We can now generalize this formulation by concatenating the output of k + 1 PGC convolutions of degree ranging from 0 up to k. This gives rise to the following definitions: W =       W0,0 W0,1 W0,2 . . . W 0,k 0 W1,1 W1,2 . . . W 1,k 0 0 W2,2 . . . W 2,k . . . . . . . . . . . . . . . 0 0 0 . . . W k,k       , H =      (XW0,0) (XW0,1 + T (A)XW1,1) . . . (XW 0,k + • • • + T (A) k XW k,k )      where we do not put constraints among matrices W i,j ∈ R s×mj , m = k j=0 m j , which are considered mutually independent. Note that as a consequence of Theorem 2, the network defined in equation 12 is more expressive than the one obtained concatenating different GraphConv layers as defined in equation 11. It can also be noted that the same network can actually be seen as a single PGC layer of order k + 1 with a constraint on the weight matrix (i.e., to be an upper triangular block matrix). Of course, any weights sharing policy can be easily implemented, e.g. by imposing ∀j W i,j = W i , which corresponds to the concatenation of the representations obtained at level i by a single stack of convolutions. In addition to reduce the number of free parameters, this weights sharing policy allows the reduction of the computational burden since the representation at level i is obtained by summing to the representation of level i -1 the contribution of matrix W i , i.e. A i XW i

3.3. COMPUTATIONAL COMPLEXITY

As detailed in the previous discussion, the degree k of a PGC layer controls the size of its considered receptive field. In terms of the number of parameters, fixing the node attribute size s and the size m of the hidden representation, the number of parameters of the PGC is O(s • k • m), i.e. it grows linearly in k. Thus, the number of parameters of a PGC layer is of the same order of magnitude compared to k stacked graph convolution layers based on message passing (Gilmer et al., 2017) (i.e. GraphConv, GIN and GCN, presented in Section 3.1). If we consider the number of required matrix multiplications, compared to message passing GC networks, in our case it is possible to pre-compute the terms T (A) i X before the start of training, making the computational cost of the convolution calculation cheaper compared to message passing. In Appendix E, we report an example that makes evident the significant improvement that can be gained in training time with respect to message passing.

4. POLYNOMIAL GRAPH CONVOLUTIONAL NETWORK (PGCN)

In this section, we present a neural architecture that exploits the PGC layer to perform graph classification tasks. Note that, differently from other GCN architectures, in our architecture (exploiting the PGC layer) the depth of the network is completely decoupled from the size k of the receptive field. The initial stage of the model consists of a first PGC layer with k = 1. The role of this first layer is to develop an initial node embedding to help the subsequent PGC layer to fully exploit its power. In fact, in bioinfromatics datasets where node labels X are one-hot encoded, all matrices X, AX, . . . , A k X are very sparse, which we observed, in preliminary experiments, influences in a negative way learning. Table 4 in Appendix F compares the sparsity of the PGC representation using the original one-hot encoded labels against their embedding obtained with the first PGC layer. The analysis shows that using this first layer the network can work on significantly denser representations of the nodes. Note that this first stage of the model does not significantly bound the PGC-layer expressiveness. A dense input for the PGC layer could have been obtained by using an embedding layer that is not a graph convolutional operator. However, this choice would have made difficult to compare our results with other state-of-the-art models in Section 6, since the same input transformation could have been applied to other models as well, making unclear the contribution of the PGC layer to the performance improvement. This is why we decided to use a PGC layer with k = 1 (equivalent to a GraphConv) to compute the node embedding, making the results fully comparable since we are using only graph convolutions in our PGCN. For what concerns the datasets that do not have the nodes' label (like the social networks datasets), using the PGC layer with k = 1 allows to create a label for each node that will be used by the subsequent larger PGC layer to compute richer node's representations. After this first PGC layer, a PGC layer as described in eq. equation 12 of degree k is applied. In order to reduce the number of hyperparameters, we adopted the same number m k+1 of columns (i.e., hidden units) for matrices W i,j , i.e. W i,j ∈ R s× m k+1 . A graph-level representation s ∈ R m * 3 based on the PGC layer output H is obtained by an aggregation layer that exploits three different aggregation strategies over the whole set of nodes V, j = 1, . . . , m: s avg j = avg({h (j) v |v ∈ V }), s max j = max({h (j) v |v ∈ V }), s sum j = sum({h (j) v |v ∈ V }), s = [s avg 1 , s max 1 , s sum 1 , . . . , s avg m , s max m , s sum m ] . The readout part of the model is composed of q dense feed-forward layers, where we consider q and the number of neurons per layer as hyper-parameters. Each one of these layers uses the ReLU activation function, and is defined as y j = ReLu(W readout j y j-1 + b readout ), j = 0, . . . , q, where y 0 = s. Finally, the output layer of the PGCN for a c-class classification task is defined as: o = LogSof tmax(W out y q + b out ).

5. COMPARISON VS MULTI-SCALE GCN ARCHITECTURES IN LITERATURE

Some recent works in literature exploit the idea of extending graph convolution layers to increase the receptive field size. In general, the majority of these models, that concatenate polynomial powers of the adjacency matrix A, are designed to perform node classification, while the proposed PGCN is developed to perform graph classification. In this regard, we want to point out that the novelty introduced in this paper is not limited to a novel GC-layer, but the proposed PGCN is a complete architecture to perform graph classification. Atwood & Towsley (2016) proposed a method that exploits the power series of the probability transition matrix, that is multiplied (using Hadamard product) by the inputs. The method differs from the PGCN even in terms of how the activation is computed and even because the activation computed for each exponentiation is summed, instead been concatenated. Similarly in Defferrard et al. (2016) the model exploits the Chebyshev polynomials, and, differently from PGCN sums them over k. This architectural choice makes the proposed method less general than the PGCN. Indeed, as showed in Section 3.1, the model proposed in (Defferrard et al., 2016) is an instance of the PGC. In (Xu et al., 2018) the authors proposed to modify the common aggregation layer in such a way that, for each node, the model aggregates all the intermediate representations computed in the previous GC-layers. In this work, differently from PGCN, the model exploits the message passing method introducing a bias in the flow of the topological information. Note that, as proven in Theorem 2, a PGC layer of degree k is not equivalent to concatenate the output of k stacked GC layers, even though the PGC layer can also implement this particular architecture. Another interesting approach is proposed in (Tran et al., 2018) , where the authors consider larger receptive fields compared to standard graph convolutions. However, they focus on a single convolution definition (using just the adjacency matrix) and consider shortest paths (differently from PGCN that exploits matrix exponentiations, i.e. random walks). In terms of expressiveness, it is complex to compare methods that exploit matrix exponentiations with methods based on the shortest paths. However, it is interesting to notice that, thanks to the very general structure of the PGC layer, it is easy to modify the PGC definition in order to use the shortest paths instead of the adjacency matrix transformation exponentiations. In particular, we plan to explore this option as the future development of the PGCN. Wu et al. introduce a simplification of the graph convolution operator, dubbed Simple Graph Convolution (SGC) (Wu et al., 2019) . The model proposed is based on the idea that perhaps the nonlinear operator introduced by GCNs is not essential, and basically, the authors propose to stack several linear GC operators. In Theorem2 we prove that staking k GC layers is less expressive than using a single PGC layer of degree k. Therefore, we can conclude that the PGC Layer is a generalization of the SGC. In (Liao et al., 2019) the authors construct a deep graph convolutional network, exploiting particular localized polynomial filters based on the Lanczos algorithm, which leverages multi-scale information. This convolution can be easily implemented by a PGC layer. In (Chen et al., 2019) the authors propose to replace the neighbor aggregation function with graph augmented features. These graph augmented features combine node degree features and multi-scale graph propagated features. Basically, the proposed model concatenates the node degree with the power series of the normalized adjacency matrix. Note that the graph augmented features differ from R k,T , used in the PGC layer. Another difference with respect to the PGCN resides on the subsequent part of the model. Indeed, instead of projecting the multi-scale features layer using a structured weights matrix, the model proposed in (Chen et al., 2019) aggregates the graph augmented features of each vertex and project each of these subsets by using an MLP. The model readout then sums the obtained results over all vertices and projects it using another MLP. Luan et al. (2019) introduced two deep GCNs that rely on Krylov blocks. The first one exploits a GC layer, named snowball, that concatenates multi-scale features incrementally, resulting in a denselyconnected graph network. The architecture stacks several layers and exploits nonlinear activation functions. Both these aspects make the gradient flow more complex compared to the PGCN. The second model, called Truncated Krylov, concatenates multi-scale features in each layer. In this model, differently from PCGN, the weights matrix of each layer has no structure, thus topological features from all levels are mixed together. A similar approach is proposed in (Rossi et al., 2020) . Rossi et. al proposed an alternative method, named SIGN, to scale GNN to a very large graph. This method uses as a building block the set of exponentiations of linear diffusion operators. In this building block, every exponentiation of the diffusion operator is linearly projected by a learnable matrix. Moreover, differently from the PGC layer, a nonlinear function is applied on the concatenation of the diffusion operators making the gradient flow more complex compared to the PGCN. Very recently Liu et al. (2020) proposed a model dubbed Deep Adaptive Graph Neural Network, to learn node representations by adaptively incorporating information from large receptive fields. Differently from PGCN, first, the model exploits an MLP network for node feature transformation. Then it constructs a multi-scale representation leveraging on the computed nodes features transformation and the exponentiation of the adjacency matrix. This representation is obtained by stacking the various adjacency matrix exponentiations (thus obtaining a 3-dimensional tensor). Similarly to (Luan et al., 2019) also in this case the model projects the obtained multi-scale representation using weights matrix that has no structure, obtaining that the topological features from all levels are mixed together. Moreover, this projection uses also a (trainable) retainment scores. These scores measure how much information of the corresponding representations derived by different propagation layers should be retained to generate the final representation for each node in order to adaptively balance the information from local and global neighborhoods. Obviously, that makes the gradient flow more complex compared to the PGCN, and also impact the computational complexity.

6. EXPERIMENTAL SETUP AND RESULTS

In this section, we introduce our model set up, the adopted datasets, the baselines models, and the hyper-parameters selection strategy. We then report and discuss the results obtained by the PGCN. For implementation details please refer to Appendix G. Datasets. We empirically validated the proposed PGCN on five commonly adopted graph classification benchmarks modeling bioinformatics problems: PTC (Helma et al., 2001) , NCI1 (Wale et al., 2008) , PROTEINS, (Borgwardt et al., 2005) , D&D (Dobson & Doig, 2003) , 2005) . Moreover, we also evaluated the PGCN on 3 large graph social datasets: COLLAB, IMDB-B, IMDB-M (Yanardag & Vishwanathan, 2015) . We report more details in Appendix D. Baselines and Hyper-parameter selection. We compare PGCN versus several GNN architectures, that achieved state-of-the-art results on the used datasets. Specifically, we considered PSCN (Niepert et al., 2016) , Funnel GCNN (FGCNN) model (Navarin et al., 2020) , DGCNN (Zhang et al., 2018) , GIN (Xu et al., 2019) , DIFFPOOL (Ying et al., 2018) and GraphSage (Hamilton et al., 2017) . Note that these graph classification models exploit the convolutions presented in Section 3.1. From (Errica et al., 2020) we report also the results of a baseline models that is structure-agnostic. The results were obtained by performing 5 runs of 10-fold cross-validation. The hyper-parameters of the model (number of hidden units, learning rate, weight decay, k, q) were selected using a grid search, where the explored sets of values were changed based on the considered dataset. Other details about validation are reported in Appendix I. The results reported in Xu et al. (2019) ; Chen et al. (2019) ; Ying et al. (2018) are not considered in our comparison since the model selection strategy is different from the one we adopted and this makes the results not comparable. The importance of the validation strategy is discussed in Errica et al. (2020) , where results of a fair comparison among the considered baseline models are reported. For the sake of completeness, we also report (and compare) in Appendix J the results obtained by evaluating the PGCN method with the validation policy used in Xu et al. ( 2019 

6.1. RESULTS AND DISCUSSION

The results reported in Table 1 show that the PGCN achieves higher results in all (except one) considered datasets compared to competing methods. In particular, on NCI1 and ENZYMES the proposed method outperforms state-of-the-art results. In fact, in both cases, the performances of PGCN and the best competing method are more than one standard deviation apart. Even for PTC, D&D, PROTEINS, IMDB-B and IMDB-M datasets PGCN shows a slight improvement over the results of FGCNN and DGCNN models. Furthermore, the results of PGCN in Bioinformatics datasets achieves a significant lower standard deviation (evaluated over the 5 runs of 10-fold cross-validation). For what concerns the COLLAB datasets, PGCN obtained the second higher result in the considered state-of-the-art methods. Note however that the difference with respect to the first one (GIN) is within the standard deviation. Impact of receptive field size on PGCN. Most of the proposed GCN architectures in literature generally stack 4 or fewer GCs layers. The proposed PGC layer allows us to represent a linear version of these architectures by using a single layer with an even higher depth (k), without incurring in problems related to the flow of the topological information. Different values of k have been tested to study how much the capability of the model to represent increased topological information helps to obtain better results. The results of these experiments are reported in Table 2 . The accuracy results in this table are referred to the validation sets, since the choice of k is part of the model selection procedure. We decided to take into account a range of k values between 3 and 6 for bioinformatics datasets, and between 3 to 9 for social networks datasets. The results show that it is crucial to select an appropriate value for k. Several factors influence how much depth is needed. It is important to take into account that the various datasets used for the experiments refer to different tasks. The quantity and the type of topological information required (or useful) to solve the task highly influences the choice of k. Moreover, also the input dimensions and the number of graphs contained in a dataset play an important role. In fact, using higher values of k increases the number columns of the R k,T matrix (and therefore the number of parameters embedded in W), making the training of the model more difficult. It is interesting to notice that in many cases our method exploits a larger receptive field (i.e. a higher degree) compared to the competing models. Note that the datasets where better results are obtained with k = 3 (PTC and PROTEINS) contain a limited amount of training samples, thus deeper models tend to overfit arguably for the limited amount of training data.

7. CONCLUSIONS AND FUTURE WORKS

In this paper, we analyze some of the most common convolution operators evaluating their expressiveness. Our study shows that their linear composition can be defined as instances of a more general Polynomial Graph Convolution operator with a higher expressiveness. We defined an architecture exploiting a single PGC layer to generate a decoupled representation for each neighbor node at a different topological distance. This strategy allows us to avoid the bias on the flow of topological information introduced by stacking multiple graph convolution layers. We empirically validated the proposed Polynomial Graph Convolutional Network on five commonly adopted graph classification benchmarks. The results show that the proposed model outperforms competing methods in almost all the considered datasets, showing also a more stable behavior. In the future, we plan to study the possibility to introduce an attention mechanism by learning a transformation T that can adapt to the input. Furthermore, we will explore whether adopting our PGC operator as a large random projection can allow to develop a novel model for learning on graph domains.

A GRAPH NEURAL NETWORKS

A Graph Neural Network (GNN) is a neural network model that exploits the structure of the graph and the information embedded in feature vectors of each node to learn a classifier or regressor on a graph domain. Due to the success in image processing, convolutional-based neural networks have become one of the main architectures (ConvGNN) applied to graph processing. The typical structure of a ConvGNN comprises a first part of processing where convolutional layers are used in order to learn a representation h v ∈ R m for each vertex v ∈ V . These representations are then combined to get a representation of the whole graph, so that a standard feed-forward (deep) neural network can be used to process it. Convolutional layers are important since they define how the (local) topological information is mixed with the information attached to involved nodes, and what is the information to pass over to the subsequent computational layers. Because of that, several convolution operators for graphs have recently been proposed (Defferrard et al., 2016; Kipf & Welling, 2017; Morris et al., 2019; Xu et al., 2019) . The first definition of neural network for graphs has been proposed by Sperduti & Starita (1997) .More recently, Micheli (2009) proposed the Neural Network for Graphs (NN4G), exploiting an idea that has been re-branded later as graph convolution, and Scarselli et al. ( 2008) defined a recurrent neural network for graphs.In the last few years, several models inspired by the graph convolution have been proposed. Many recent works defining graph convolutional networks (GCNs) extend the NN4G formulation (Micheli, 2009) , for instance the Graph Convolutional Network (GCN) (Kipf & Welling, 2017) is based on a linear first-order filter based on the normalized graph Laplacian for each graph convolutional layer in a neural network. SGC (Wu et al., 2019) proposes a fast way to compute the result of several linearly stacked GCNs. Note that SGC considers just the GCN convolution, while our proposed PGC is more expressive than any number of linearly stacked graph convolutions among the ones presented in Section 3.1 of the main paper. DGCNN (Zhang et al., 2018) adopts a graph convolution very similar to GCN (Kipf & Welling, 2017 ) (a slightly different propagation scheme for vertices' representations is defined, based on the random-walk graph Laplacian). While GCN is focused on node classification, DGCNN is suited for graph classification since it incorporates the readout. A more straightforward approach in defining convolutions on graphs is PATCHY-SAN (PSCN) (Niepert et al., 2016) . This approach is inspired by how convolutions are defined over images. It consists in selecting a fixed number of vertices from each graph and exploiting a canonical ordering on graph vertices. For each vertex, it defines a fixed-size neighborhood, exploiting the same vertex ordering. It requires the vertices of each input graph to be in a canonical ordering, which is as complex as the graph isomorphism problem (no polynomial-time algorithm is known). Another interesting proposal for the convolution over the node neighborhood is GraphSage Hamilton et al. (2017) , which proposes to perform an aggregation over the neighborhoods by using sum, mean or max-pooling operators, and then perform a linear projection in order to update the node representation. In addition to that, the proposed approach exploits a particular neighbors sampling scheme. GIN (Xu et al., 2019) is an extension of GraphSage that avoids the limitation introduced by using sum, mean or max-pooling by using a more expressive aggregation function on multi-sets. DiffPool (Ying et al., 2018) is an end-to-end architecture that combines a differentiable graph encoder with its polling mechanism. Indeed, the method learns an adaptive pooling strategy to collapse nodes on the basis of a supervised criterion. The Funnel GCNN (FGCNN) model (Navarin et al., 2020) aims to enhance the gradient propagation using a simple aggregation function and LeakyReLU activation functions. Hinging on the similarity of the adopted graph convolutional operator, that is the GraphConv, to the way feature space representations by Weisfeiler-Lehman (WL) Subtree Kernel (Shervashidze et al., 2011) are generated, it introduces a loss term for the output of each convolutional layer to guide the network to reconstruct the corresponding explicit WL features. Moreover, the number of filters used at each convolutional layer is based on a measure of the WL kernel complexity.

B EXPRESSIVENESS OF COMMONLY USED GRAPH CONVOLUTIONS

Thanks to the possibility to express commonly used graph convolutions as instances of PGC, and from the discussion in Section 3 of the paper, it is easy to characterize the expressiveness of commonly used graph convolutions. In Figure 1 we represent the inclusion relationships among the sets of functions which can be implemented by GCN, GIN, GraphConv, Spectral. Proof. With no loss of generality, let the first PGC being of degree k 1 , while the second stacked PGC of degree k 2 , i.e. H (1) = [X, . . . , T (A) k1 X]W (1) , H (2) = [H (1) , . . . , T (A) k2 H (1) ]W (2) , where W (i) =       W (i) 0 W (i) 1 . . . W (i) ki       , i = 1, 2. By expanding H (1) inside H (2) equation, we get: H (2) = [[XW (1) 0 + . . . + T (A) k1 XW (1) k1 ], . . . , T (A) k2 [XW (1) 0 + . . . + T (A) k1 XW (1) k1 ]]W (2) = [XW (1) 0 W (2) 0 + . . . + T (A) k1 XW (1) k1 W (2) 0 + . . . + T (A) k2 XW (1) 0 W (2) k2 + . . . + T (A) k1+k2 XW (1) k1 W (2) k2 ]. In this case, by defining D 1 = {0, .., k 1 }, D 2 = {0, .., k 2 }, and auxiliary functions F i (), i = 0, .., k 1 + k 2 , defined as F i (W (1) , W (2) ) = (z 1 ,z 2 )∈D 1 ×D 2 s.t. z1+z2=i W (1) z1 W (2) z2 , matrix W can be written as PTC contains chemical compounds and the task is to predict their carcinogenicity for male rats. In NCI1 the graphs represent anti-cancer screens for cell lung cancer. The last three datasets, PROTEINS, D&D and ENZYMES, contain graphs that represent proteins. Each node corresponds to an amino acid and an edge connects two of them if they are less then 6 Å apart. In particular ENZYMES, differently than the other considered datasets (that model binary classification problems) allows testing the model on multi-class classification over 6 classes. We additionally considered three large social graph datasets: COLLAB, IMDB-B, IMDB-M (Yanardag & Vishwanathan, 2015) . In COLLAB each graph represents a collaboration network of a corresponding researcher with other researchers from three fields of physics. The task consists in predicting the physics field the researcher belongs to. IMDB-B and IMDB-M are composed of graphs derived from actor/actress who played in different movies on IMDB, together with the movie genre information. Each graph has a target that represents the movie genre. IMDB-B models a binary classification task, while IMDB-M contains graphs that belong to three different classes. Differently from the bioinformatics datasets, the nodes contained in the social datasets do not have any associated label. Relevant statistics about the datasets are reported in Table 3 . W =     F 0 (W (1) , W (2) ) F 1 (W (1) , W (2) ) . . . F k1+k2 (W (1) , W (2) )     , Dataset #Graphs #

E PGCN COMPUTATION COMPLEXITY EXAMPLE

Consider a dataset with n G graphs, and the 2-layers GraphConv defined with a message passing formulation in equations equation 6 and equation 7 (assuming m 1 = m 2 = m). Each GraphConv layer requires 3 matrix multiplications. The AX term in the first layer can be pre-computed since it remains the same over all training. Thus a 2-layer GCN performs 5 • n G matrix multiplications in the forward pass for each epoch (generally the size of A is different for each graph, but for the sake of discussion we can assume their dimension is comparable). Assuming 100 epochs for training, the total number of such multiplications is then 5 • 100 • n G + 1. If we now consider the PGC formulation with k = 2 in eq. equation 9 (that we recall is more expressive than 2 stacked GraphConv layers, as shown in Section 3.1), the number of matrix multiplications required for each graph is 6. However, the terms AX and A 2 X remain the same, for each graph, during all the training. They can thus be pre-computed and stored in memory. With this implementation, eq. equation 9 would require just 3 matrix multiplications, for a total number of matrix multiplications for 100 training epochs of 3 • 100 • n G + 3. While this does not modify the asymptotic complexity of PGC compared to message passing, it significantly improves the training times.

F INITIAL NODE EMBEDDINGS

Some datasets that we used in the experiments, encode node labels (i.e., X) by using a one-hot encoding. That makes the nodes representations very sparse. In preliminary experiments, we observed that such sparse representations negatively influence learning. In Table 4 , we show how the use of a sparse node representation as input leads to have sparse input matrices X, AX, . . . , A k X. Specifically, in order to estimate the difference in terms of the sparsity degree with or without an initial PGC layer with k = 1, we computed the average ratio between the number of null entries (we round all the embedding values to the 4th decimal digit) and the total number of entries of the input matrices on the whole dataset for all the used bioinformatics datasets. We evaluated the sparsity of each PCG-layer block, considering the values of k in the interval [0, . . Table 4 : Average ratio of the number of null entries over the total number of entries in the input components up to k = 5 without (top row) and with (bottom) the P GC k=1 layer for the used datasets. Note that the value 0 corresponds to a dense matrix, while the value 1 to a null matrix. that in all datasets the use of the initial PGC leads to a sparsity ratio near 0 (therefore the subsequent PGC-layer has in input dense embeddings). That is very useful, in particular for datasets like NCI1, PTC, and D&D, where the percentage of zeros in the labels representation is near 90%.

G PGCN IMPLEMENTATION DETAILS

We implemented the PGCN in PyTorch-Geometric (Fey & Lenssen, 2019) . To reduce the covariate shift during training and to attenuate overfitting, we applied batch normalization and dropout on the output of each y j layer. We used the Negative Log Likelihood loss, the Adam optimizer (Kingma & Ba, 2014), and the identity function for T . For more details please check the publicly available codefoot_0 . For our experiments, we adopted 2 types of machines, respectively equipped with: • 2 x Intel(R) Xeon(R) CPU E5-2630L v3, 192GB of RAM and a Nvidia Tesla V100; • 2 x Intel(R) Xeon(R) CPU E5-2650 v3, 160 GB of RAM and Nvidia T4.

H SPEED OF CONVERGENCE

Here, we discuss the results in terms of computation demand between a proposed PGCN and FGCNN (Navarin et al., 2020) . We decided to compare these two models since they present a similar readout layer, therefore the comparison best highlights how the different methodology manage the number of considered k-hops, from the point of view of performance. In Table 5 , we report the average time (over the ten folds) to perform a single epoch of training and to perform the classification with both method. In the evaluation we considered similar architectures using 3 layers for the FGCNN and k = 3 for PGCN. The other hyper-parameters were set with the aim to get almost the same number of parameters in both models, to ensure a fair comparison. The batch sizes used for this evaluation are the same selected by the PGCN model selection. The results show a significant advantage in using a PCG layer instead of the message passing based method exploited by FCGNN. Concerning the speed of convergence of the two models, in Figure 2 we report the training curves for two representative datasets (D&D and NCI). In the x-axis we report the computational time in seconds, while in the y-axis we report the loss value. Both curves end after 200 training epochs. From the curves it can be seen that PGCN converges faster or with a similar pace than FCGNN. 

I HYPER-PARAMETERS SELECTION

Due to the high computational time required to perform an extensive grid search, we decided to limit the number of values taken into account for each hyper-parameter, by performing preliminary tests to identify useful ranges of values. The hyper-parameters of the model (number of hidden units, learning rate, weight decay, k) were selected by using a limited grid search, where the explored sets of values do change based on the considered dataset. Due to the high time requirements of performing an extensive grid search, we decided to limit the number of values taken into account for each hyper-parameter, by performing some preliminary tests.Preliminary tests showed that for the social network datasets, it is more convenient to use the Laplacian L as T (A). This behavior could be due to lack of label associated to nodes. In Table 6 , we report the sets of hyper-parameters values used for model selection via grid search. As evaluation measure, we used the average accuracy computed over the 10-fold crossvalidation on the validation sets, and we used the same set of selected hyper-parameters for each fold. For what concerns the selection of the epoch, it was performed for each fold independently based on the accuracy value on the validation set. 

J EXPERIMENTAL RESULTS OMITTED IN THE RESULTS COMPARISON

As validation test methodology we decided to follow the method proposed in Errica et al. (2020) , that in our opinion, turns out to be the fairest. For this reason, some results reported in the literature cannot be directly compared with the ones that we obtained. Specifically, the results reported in Xu et al. (2019) ; Chen et al. (2019) ; Ying et al. (2018) are not considered in our experimental comparison since the model selection strategy is different from the one we adopted. Indeed the results reported r cannot be compared with the other results reported in Table 1 of the paper, because the authors state "The hyper-parameters we tune for each dataset are [...] the number of epochs, i.e., a single epoch with the best cross-validation accuracy averaged over the 10 folds was selected.". Similarly, for the result reported in Chen et al. (2019) for the GCN and the GFN models, the authors state "We run the model for 100 epochs, and select the epoch in the same way as Xu et al. (2019) , i.e., a single epoch with the best cross-validation accuracy averaged over the 10 folds is selected". In both cases, the model selection strategy is clearly biased and different from the one we adopted. This makes the results not comparable. Moreover, in Xu et al. (2019) the node descriptors are augmented with structural features. In GIN experiments the authors add a one-hot representation of the node degree. We decided to use a common setting for the chemical domain, where the nodes are labeled with a one-hot encoding of their atom type. The only exception is ENZYMES, where it is common to use 18 additional available features. Also in Ying et al. (2018) there is a similar problem since the authors add the degree and the clustering coefficient to each node feature vector. For the sake of completeness in Table 7 we report the results obtained by the proposed method following the same validation policy used in Xu et al. (2019) ; Chen et al. (2019) ; Ying et al. (2018) . The table shows that the PGCN outperforms the methods proposed in the literature in almost all datasets.



omitted for double-blind review.



); Chen et al. (2019); Ying et al. (2018).

Figure 1: Expressiveness of commonly used graph convolution operators. Each ellipse represents the set of functions that can be implemented by a single graph convolution operator.

Figure 2: PGCN and FGCNN training curves for D&D and NCI-1 datasets.

60 10 -3 , 5 • 10 -4 , 10 -4 5 • 10 -3 , 5 • 10 -4 0.4, 0.616, 32 3, 4, 5, 6  1 [m/2], 2 [m * 2, m] NCI1 50, 100 10 -3 , 5 • 10 -4 5 • 10 -3 , 5 • 10 -4 0.3, 0.5 16, 32 3, 4, 5, 6 1 [m/2], 2 [m * 2, m] PROTEINS 25, 50 10 -3 , 5 • 10 -4 , 10 -4 5 • 10 -3 , 5 • 10 -4 0.3, 0.5 16, 32 3, 4, 5, 6 1 [m/2], 2 [m * 2, m] D&D 50, 75 5 • 10 -4 , 5 • 10 -5 5 • 10 -3 , 5 • 10 -4 0.3, 0.5 16, 32 3, 4, 5, 6 1 [m/2], 2 [m * 2, m] ENZYMES 50, 100 10 -3 , 10 -4 5 • 10 -3 , 5 • 10 -4 0.3, 0.5 16, 32 3, 4, 5, 6 1 [m/2], 2 [m * 2, m] COLLAB 7, 15, 30 10 -3 , 5 • 10 -4 5 • 10 -3 , 5 • 10 -4 0, 0.5 16, 32 3, 5, 7, 9 1 [m/2], 2 [m * 2, m] [m/2], 2 [m * 2, m] IMDB-M 50, 75, 100 10 -4 , 5 • 10 -5 5 • 10 -3 , 5 • 10 -4 0, 0.5 16, 32 3, 5, 7, 9 1 [m/2], 2 [m * 2, m]

and ENZYMES (Borg-Accuracy comparison between PGCNN and several state-of-the-art models on graph classification task.

PGCN accuracy comparison on the validation set of the datasets for different values of k.Significativity of our results. To understand if the improvements reported in Table1are significant or can be attributed to random chance, we conducted the two-tailed Wilcoxon Signed-Ranks test between our proposed PGCN and competing methods. This test considers all the results for the different datasets at the same time. According to this test, PGCN performs significantly better (p-value < 0.05) than PSCN, DGCNN 3 , GIN, DIFFPOOL and GraphSAGE. As for FGCNN and DGCNN 2 , four datasets are not enough to conduct the test.

Node #Edge Avg #Nodes/Graph Avg.#Edges/Graph Datasets statistics.

. , 5]. It is interesting to notice

Time in second to perform a single training epoch (2nd and 3rd column) and to perform classification (4th and 5th column), using PGCN and FGCNN(Navarin et al., 2020), respectively.

Sets of hyper-parameters values used for model selection via grid search.

PGCN accuracy comparison using different values of k. The validation policy is the same used in Xu et al. (2019); Chen et al. (2019); Ying et al. (2018). In Ying et al. (2018) the variance is not reported.

annex

and consequently the hidden representation becomes H (2) = [X, T (A)X, . . . , T (A) k1 X, . . . , T (A) k1+k2 X]W, which is the output of a PGC with k = k 1 + k 2 .Theorem 2. A PGC layer with k = 2 is more general than two stacked PGC layers with k = 1 with the same transformation T and the same number of hidden units m.Proof. Since all PGCs use the same transformation T , we can focus on the weights only. We prove our theorem providing a counterexample, i.e. we fix m = 1, s = 1 (the input dimension) and show an instantiation of the weight matrix W = [W 0 , W 1 , W 2 ] of a PGC layer with k = 2 that cannot be expressed by the composition of two PGC layers with k = 1 (equivalent to GraphConv). Let us consider the simplest case in which W 0 , W 1 , W 2 ∈ R, i.e. they are 1 × 1 matrices. Let us now consider the case where W 0 = 5, W 1 = 7, W 2 = 3. Let us also, for the sake of clarity, rename the 1 × 1 matrices of the two PGCs with k = 1 as:We get the following system of equations:where we assume b and c are different from zero (it is easy to see that either b = 0 or c = 0 would not lead to any solution). If we compute the ∆ of the third equation (solving for cb), we get ∆ = √ 49 -60 = i √ 11, i.e. a complex number. Thus there is no real value for cb that satisfies our system of equations. We thus conclude that there are no values that we can assign to the parameters of the PGCs with k = 1 that would lead to the considered PGC weight matrix.Moreover, Theorem 2 implies the following corollary. Corollary 2.1. A linear stack of q T -compatible PGC layers with k = 1 is less expressive than a single T -PGC layer with k = q.Proof. Since all PGCs use the same transformation T , we can focus on the weights only. We prove the corollary by induction. The base case is provided by Theorem 2. Let us now prove the inductive case. Let us assume that a stack of i PGC layers with k = 1 (with parameters set θ (i) 1 = {W (j) ∈ R 2s×m , j = 0, . . . , i}) is less expressive than a single PGC layer with k = i (with parameters set θ (i) 2 = W ∈ R (i+1)s×m ), and let us prove the same result for i + 1. We can consider the set of functions that can be implemented by the stack of i + 1 PGC layers with k = 1 as the composition of two functions coming from two differrent sets: the first one P GCk=1 ( θ(i) 1 )}, is the set of functions that can be computed stacking i PGC layers with k = 1, and the second one P GC(1)1 )}, is the set of functions computed by a single PGC layer with k = 1. We can then characterize the set of functions P GC (i+1) k=1 as:From Theorem 1, we know that P GC(1)k=i , g ∈ P GC(1) k=1 }. Since we know that P GC 

D DATASETS

We empirically validated the proposed PGC-GNN on five commonly adopted graph classification benchmarks modeling bioinformatics problems: PTC (Helma et al., 2001) , NCI1 (Wale et al., 2008) , PROTEINS, (Borgwardt et al., 2005) , D&D (Dobson & Doig, 2003) and ENZYMES (Borgwardt et al., 2005) . The first two of them contains chemical compounds represented by their molecular graph, where each node is labeled with an atom type, and the edges represent bonds between them.

