TOWARDS EXPRESSIVE GRAPH REPRESENTATIONS FOR GRAPH NEURAL NETWORKS

Abstract

Graph Neural Network (GNN) aggregates the neighborhood information into the node embedding and shows its powerful capability for graph representation learning in various application areas. However, most existing GNN variants aggregate the neighborhood information in a fixed non-injective fashion, which may map different graphs or nodes to the same embedding, detrimental to the model expressiveness. In this paper, we present a theoretical framework to improve the expressive power of GNN by taking both injectivity and continuity into account. Based on the framework, we develop injective and continuous expressive Graph Neural Network (iceGNN) that learns the graph and node representations in an injective and continuous fashion, so that it can map similar nodes or graphs to similar embeddings, and non-equivalent nodes or non-isomorphic graphs to different embeddings. We validate the proposed iceGNN model for graph classification and node classification on multiple benchmark datasets. The experimental results demonstrate that our model achieves state-of-the-art performances on most of the benchmarks. Our main contributions are summarized as follows. (1) We present a theoretical framework to guide the design of expressive GNNs by ensuring the injectivity and continuity in the neighborhood aggregation process. (2) We present a limitation about the representation dimension for a fully injective and continuous graph mapping. (3) Based on the framework, we implement two injective and continuous expressive GNN (iceGNN) models with a fixed and learnable aggregation function, respectively. (4) We validate our models on multiple benchmark datasets including simple graphs and attributed graphs for graph classification and node classification, the experimental results demonstrate that our models can achieve state-of-the-art performances on most of the benchmarks. Our code is available in the Supplementary Material. Common notations used throughout the paper are found in Appendix A.1 Table 5 . Many GNN variants with different aggregation rules are proposed in the literature to achieve good performances in different tasks. GIN proposed by Xu et al. ( 2019) is expected to be highly expressive for simple graphs where node attributes are one-hot encoders on which sum aggregation is injective. However, GIN cannot be directly extended to attributed graphs with the same expressive power, because the sum aggregation is no longer injective in uncountable cases. GCN is another GNN variant with simple element-wise mean pooling in a node's neighborhood, including the node itself (Kipf & Welling, 2017) . Hamilton et al. (2017) tested 3 aggregators in GraphSAGE, including mean aggregator, LSTM aggregator and max pooling aggregator, they found no significant performance difference exists between the LSTM aggregators and pool aggregators, but GraphSAGE-LSTM is significantly slower than GraphSAGE-pool. Mean aggregation and max pooling are permutation invariant on sets, but the operation is not injective, which may result in the same embedding for different inputs. LSTM aggregation could have large expressive capacity, but it is not permutation invariant, this may cause equivalent nodes or isomorphic graphs to have different embeddings. Corso et al. ( 2020) combined multiple aggregators with degree-scalers and proposed PNA to improve the expressive power of GNN, but they did not provide a theoretical guidance on how to improve the expressive power of GNN. In this article, we present a theoretical framework to guide the design of expressive GNNs by ensuring the injectivity and continuity in the neighborhood aggregation process. PNA can exactly fall into our framework by a simple comparison analysis.

1. INTRODUCTION

Graph representation learning that maps graphs or their components to vector representations has attracted growing attentions for graph analysis. Recently, graph neural networks (GNN) that can learn a distributed representation for a graph or a node in a graph are widely applied to a variety of areas, such as social network analysis (Hamilton et al., 2017; Ying et al., 2018a) , molecular structure inference (Duvenaud et al., 2015; Gilmer et al., 2017 ), text mining (Yao et al., 2019; Peng et al., 2018) , clinical decision-making (Mao et al., 2022b; Li et al., 2018) and image processing (Mao et al., 2022a; Garcia & Bruna, 2018) . GNN recursively updates the representation of a node in a graph by aggregating the feature vectors of its neighbors and itself (Hamilton et al., 2017; Morris et al., 2019; Xu et al., 2019) . The graph-level representation can then be obtained through aggregating the final representations of all the nodes in the graph. The learned representations can be fed into a prediction model for different learning tasks, such as node classification and graph classification. In GNN, the aggregation rule plays a vital role in learning expressive representations for the nodes and the entire graph. There are many GNN variants with different aggregation rules proposed to achieve good performances for different tasks and different problems (Kipf & Welling, 2017; Hamilton et al., 2017; Zhang et al., 2018; Xinyi & Chen, 2019; Wang et al., 2020) . However, most of the existing GNN aggregation rules are designed based on a fixed non-injective pooling function, (e.g., max pooling and mean pooling) or on non-continuous node types (e.g., graph isomorphism test). The non-injective aggregation may map different (non-isomorphic) graphs or (non-equivalent) nodes to the same embedding; and the non-continuous aggregation may map similar graphs or nodes to quite different embeddings, both detrimental to the expressive power of GNN. For example, for the graph with attributed nodes in Figure 1 (a), mean pooling or sum aggregation on the neighborhoods generates the same neighborhood representation for all the nodes (Figure 1(d) ), thus cannot capture any meaningful structure information. Xu et al. (2019) showed that a powerful GNN can at most achieve the discriminative power of Weisfeiler-Lehman graph isomorphism test (WL test) which can discriminate a broad class of graphs (Weisfeiler & Lehman, 1968) , and proposed the powerful graph isomorphism network (GIN). However, the theoretical framework of GIN is under the assumption that the input feature space is countable, which makes GIN less expressive when applied to graphs with continuous attributes. (3,1,0,0,0,0) (0,3,0,18,0,162) (0,4,0,14,-18,98) (1,1,0,0,0,0) (2,1,0,0,0,0) We argue that the expressive power of a graph mapping should imply two aspects, injectivity and continuity: the injectivity ensures different graphs are mapped to different representations and the continuity ensures that similar graphs are mapped to similar representations. Most previous works only took either one into account for GNN design; few considered both injectivity and continuity. Here, we present a theoretical framework that can guide us to design highly expressive GNNs with both injectivity and continuity for general graphs with continuous attributes. We also present a necessary condition related to the representation dimension for a fully injective and continuous graph mapping. The general idea of our framework is illustrated in Figure 1 . Theoretical studies on GNNs showed the expressive power of GNNs has been linked to the WL test (Morris et al., 2019; Xu et al., 2019) . Xu et al. (2019) showed that GNNs with 1-hop neighborhood aggregation can at most achieve the expressive power of the 1-WL test, and developed GIN that can achieve this expressive power in countable space. Based on k-WL tests, Morris et al. (2019) proposed k-GNN, which can take higher-order interactions among nodes into account. Maron et al. (2019a) proved that order-k invariant graph networks are at least as powerful as the k-WL tests and developed a GNN model that is more powerful than message passing GNNs, possessing the expressiveness of 3-WL, but the higher expressive power comes with a computational cost of processing high order tensors. A survey on the expressive power of GNN can be found in Sato (2020).

3.1. GRAPH NEURAL NETWORKS

Most modern GNNs fall into the category of message passing neural networks (Gilmer et al., 2017) that follow a neighborhood aggregation strategy that recursively updates a node representation by aggregating representations of its neighbors and the node itself. The graph-level representation is obtained through aggregating the final representations of all the nodes in the graph. Formally, the propagation rule of a GNN layer can be represented as H (k) N (v) = f (k) A H (k) (w)|w ∈ N (v) ; H (k+1) (v) = f (k) C H (k) (v), H (k) N (v) where H (k) (v) is the representation vector of node v in the kth layer, and H (0) (v) is initialized with X(v), the original attributes of node v. N (v) is the neighborhood of v. f (k) A (•) aggregates over the neighborhood N (v) to generate a neighborhood representation H (k) N (v) , and f (k) C (•) combines the node's current representation H (k) (v) and its neighborhood's representation H (k) N (v) in the kth layer. For node embedding, the node representation in the final layer H (K) (v) (suppose a total of K layers) is considered as an informative representation that could be used for downstream tasks, e.g., node classification. For graph or subgraph embedding, another aggregation function f R (•) is employed to obtain the graph-level representation h G by aggregating the final representations of all nodes in the graph or subgraph G, i.e., H G = f R H (K) (v)|v ∈ G ) (3) f A (•), f C (•) and f R (•) are all crucial for the expressive power of a GNN. f A (•) and f R (•) are set functions that map a set to a vector, they can be simple summations or sophisticated graph-level pooling functions (Ying et al., 2018b; Zhang et al., 2018; Wang et al., 2020) . f C (•) operates on two vectors, it can be usually modeled by a multi-layer perceptron (MLP) or linear function on the concatenated vector.

3.2. THE EXPRESSIVE POWER OF GNN

Recently, theoretical analysis showed that the expressive power of a GNN is associated with the WL test (Morris et al., 2019; Xu et al., 2019) . Xu et al. (2019) proved that the expressive power of GNN is bounded by the one-dimensional Weisfeiler-Lehman test. The following Lemma and Theorem from Xu et al. (2019) describe the relation between GNNs and WL test in expressive power on discriminating graphs, refer to Xu et al. (2019) for the proofs. Lemma 1. If the WL test decides two graphs G 1 and G 2 are isomorphic, any GNNs defined by Eq. 2 and 3 will map G 1 and G 2 to the same embedding. Theorem 1. If WL test decides two graphs G 1 and G 2 are not isomorphic, a GNN with sufficiently many GNN layers defined by Eq. 2 and 3 can also map G 1 and G 2 to different embeddings if the functions f A (•), f C (•) and f R (•) are all injective. The above Lemma and Theorem can guide us to design a GNN that has the discriminative power equal to WL test. The key is to design injective functions for f A (•), f C (•) and f R (•). An injective function for f C (•) that operates on two vectors can be easily obtained by concatenating the two vectors. But designing an injective function for f A (•) or f R (•) that operates on a set is not trivial, because sets can have different number of elements, and the operation on the set elements must be permutation-invariant. While Xu et al. (2019) considered the injectivity in their framework, the continuity of these functions are also crucial to the model expressive capability, which was not considered in Xu et al. (2019) . Some popular aggregations by pooling, e.g., mean pooling and max pooling, are continuous but not injective. Few works took both injectivity and continuity into account. In the following, we will present how to design the continuous injective aggregation function on a set and show a necessary condition related to the dimension for injective and continuous aggregation.

4.1. SET REPRESENTATION

A set function is a function defined in a domain that is a collection of sets (specifically multisets where the same object can repeat multiple times, a set means a finite multiset all through the paper). In a finite graph, the neighborhood of each node is considered as a finite set. Thus, in this paper, we only consider set functions of finite sets. A continuous set function is of real importance in practice Wagstaff et al. (2019) . The continuity of a function ensures that the change in output is very slight if the input is altered slightly by any reason, such as truncating to machine precision. In this paper, we consider the ordinary continuity, where the continuity of function f (x) at point c is defined by the limit as lim x→c f (x) = f (c). For M ∈ N, a set function f (X) defined in domain X M = {X|X ⊂ R d , |X| ≤ M } can be represented as a sequence of permutation-invariant functions f i for different set sizes, i.e., f (X) = f i (x 1 , • • • , x i ) if |X| = i ≤ M, where x 1 , • • • , x i ∈ X. Definition 1 (Continuous set function). For M ∈ N and a set function f (X) defined in domain X M = {X | X ⊂ R d , |X| ≤ M }, f (X) is represented as Eq. 4, if f i (x 1 , • • • , x i ) is continuous in the Euclidean space for every i ≤ M , we call f (X) a continuous set function. Obviously, a continuous set function can also have the property that sufficiently small changes in the input (a sufficiently small change will not change the set size) result in arbitrarily small changes in the output. Thus, by a continuous set function, graphs with very similar structures and attributes could be mapped to similar embeddings. An injective set function can map distinct sets to distinct values. Thus, an injective and continuous set representation function can encode a set to a representation such that different sets have different representations and similar sets have similar representations. The following theorem provides a way to construct continuous injective set functions in uncountable space by sum aggregation after a certain transformation. Theorem 2. Assume X M is a set of finite subsets of R d with size less than or equal to M , i.e., for M ∈ N and X M = {X|X ⊂ R d , |X| ≤ M }, there exists an infinite number of continuous functions Φ : R d → R D such that the set function f : X M → R D , f (X) = x∈X Φ(x) is continuous and injective. Theorem 2 generalizes Lemma 6 in Zaheer et al. (2017) to multidimensional cases. We prove Theorem 2 in Appendix A.2. The proof contains three steps: 1. Constructing a satisfying function in one dimensional cases (d = 1); 2. Constructing a satisfying function in multidimensional cases(d > 1) based on the results in step 1; 3. The satisfying function can be used to generate infinitely many other satisfying functions. In our proof, we find a Φ M (x) defined in Eq. 5 (where i, j = 1, • • • , d) that can make f (X) = x∈X Φ(x) continuous and injective if the first entries of all vectors in X are distinct. If the first entries of all vectors in X are not distinct, we could add P i to Φ M (x) (i.e., Φ M (x)) to ensure fully injectivity and continuity. P i =[1, x[i], x[i] 2 , • • • , x[i] M ] P i,j =[x[j], x[i]x[j], x[i] 2 x[j], • • • x[i] M -1 x[j]] Φ M (x) =[P 1 , P 1,2 , • • • , P 1,d ] Φ M (x) =[Φ M (x), P 2 , • • • , P d ] (5) In the proof, we also provide a way to construct such a function Φ(x) by defining a continuous injective function g : R d → R d , then Φ(g(x)) can also satisfy the condition if we have a function Φ(x) satisfying the condition. We call Φ(x) the transformation function. The following theorem tells a necessary condition of constructing an injective and continuous set function by the sum aggregation. Theorem 3. Let M ∈ N and X M = {X|X ⊂ R d , |X| = M }, then for any continuous function Φ : R d → R N , if N < dM , the set function f : X → R N , f (X) = x∈X Φ(x) is not injective. We prove Theorem 3 in Appendix A.3. Theorem 3 tells that, to construct a continuous injective set function for sets with M d-dimensional vectors by sum aggregation with continuous transformation Φ(x), Φ(x) must have at least dM dimensions. We are restricting Φ as a continuous function so that it can be modeled by a neural network, because a neural network can approximate any continuous function rather than any function by the universal approximation theorem (Cybenko, 1989) .

4.2. INJECTIVE AND CONTINUOUS EXPRESSIVE GRAPH NEURAL NETWORKS

Since Theorem 2 tells that a set can be uniquely represented by a sum aggregation of its elements through a continuous transformation function, we can use the unique set representation to model the neighborhood of each node in a graph, and thus improve the expressiveness of graph representation. From Theorem 1, to design an expressive GNN, we need to design injective and continuous functions for f A (•), f C (•) and f R (•). Since f C (•) is easy to get continuous and injective, and f A (•) and f R (•) both operate on a set of vectors in R d , thus, we need Theorem 2 to guide us to construct a continuous and injective set function for f A (•) and f R (•) by sum aggregation after a certain continuous transformation.

COMBINE function.

According to Theorem 3, for a set of M d-dimensional embeddings, the transformation function must be at least dM -dimensional to construct a continuous injective set function with sum aggregation, thus, without dimension reduction in f C (•), we get a (dM + d)dimensional embedding after one layer (dM for the set of neighbors and d for a node's current dimension). After k layers, the output embeddings have d(M + 1) k dimensions, which makes it impractical to implement a fully injective and continuous GNN for large graphs. Nevertheless, in a specific learning task, not all dimensions are related to the learning task, we could design learnable neural networks (e.g., MLP) to adaptively reduce the output dimension in each layer as Eq. 6. f (k) C (x 1 , x 2 ) = M LP (k) ([x 1 , x 2 ]) Note that an MLP mapping high-dimensional vectors to low-dimensional vectors cannot be continuous and injective. Here, MLP is used for task-driven feature reduction. AGGREGATE function. f A (•) operates on a set of node embeddings in the neighborhood of a node. We have two choices of the transformation function of f A (•), i.e., fixed transformation and learnable transformation. Fixed transformation. In the proof of Theorem 2, we find the function Φ M (x) defined in Eq. 5 can be used as a continuous transformation function to make the sum aggregation continuous and injective in most cases. Let M n be the max neighborhood size for all nodes in all the graphs. Usually, if M n is not very large, we can set the transformation function as Φ Mn (x) for each layer k to maintain the expressive power. Then f A (•) for layer k can be represented as f (k) A H (k) (w)|w ∈ N (v) = w∈N (v) Φ Mn H (k) (w) Combining Eqs. 1, 2, 6 and 7, we get the propagation rule, H (k+1) (v) = M LP (k) ([H (k) (v), w∈N (v) Φ Mn (H (k) (w))]) Though the function Φ M (x) defined in Eq. 5 can make the sum aggregation continuous and injective, it may result in numerical stability since the item x[i] M will make the number become very large or very close to 0 if M is very large. To address this issue, we use a continuous and injective function g(x) to normalize the power, since in the proof of Theorem 2 we know Φ M (g(x)) is also a qualified transformation function to make the sum aggregation continuous and injective, if g(x) is continuous and injective. In this paper, we set g(x)[i] = x[i] 1/M , x[i] ≥ 0 -(-x[i]) 1/M , x[i] < 0 (9) Learnable transformation. Due to the continuity of the transformation function, we can also set a learnable MLP to approach the transformation function for f A (•) by the universal approximation theorem (Cybenko, 1989) , then we get the propagation rule as H (k+1) (v) = M LP (k) c ([H (k) (v), w∈N (v) M LP (k) t (H (k) (w))]) where M LP (k) t and M LP (k) c serve as the transformation function and the combine function for the kth layer, respectively. By this way, we get all the node embeddings for all graphs. For graph or subgraph embedding, we need another aggregation function f R (•) to aggregate all the node embeddings in a graph. READOUT function. f R (•) operates on a set of all node embeddings in a graph. For large graphs with many nodes, a fully injective and continuous set function will generate a high-dimensional embeddings. We also use a learnable MLP as the transformation function to reduce the output dimension. H G = f R H (K) (v)|v ∈ G = v∈G M LP G H (K) (v) Note that the final node embeddings are output from an M LP (K) c and directly input to M LP G , we merge the two MLPs as one in the implementation. For graph classification, the output graph-level embedding are input to an MLP classifier with n C outputs corresponding to the probabilities of the n C classes. We only use the final GNN layer outputs for classification rather than concatenating all layers' outputs to construct a longer vector representation for classification as GIN did, because we think the final layer outputs contain all information from middle layers and are expressive enough for graph classification. In addition, this can reduce the input dimension of the final classifier, resulting in a simpler classifier than GIN, especially in case of many layers. Remark. By the propagation rule in Eqs. 8 and 10, we implement two variants of iceGNNs, namely iceGNN-fixed and iceGNN-MLP, respectively. In practice, for a specific learning task, e.g., graph classification that maps a graph to a single label, the whole process cannot be injective, and a certain dimension reduction process must be applied. The key is to ensure the dimension reduction is guided by the target task. For example, the sum aggregation is not injective, and moreover, the process is fixed and cannot be adjusted by the loss function; thus, the aggregation could lose important information that is related to the target task. In our framework, according to our theoretical result in Theorem 2, we could achieve an injective and continuous aggregation by employing a transformation function Φ(x). For iceGNN-fixed, a fixed transformation function is applied to make the aggregation injective and continuous, and then a learnable MLP is applied to reduce the dimension. And for iceGNN-MLP, we combine the aggregation and dimension reduction (linear function L) in one learnable MLP. Thus, the learned low-dim features are related to the task.

5.1. GRAPH CLASSIFICATION

Datasets. We use 8 simple graph benchmarks and 5 attributed graph benchmarks for graph classification, the 8 simple graph datasets contain 4 bioinformatics datasets (MUTAG, PTC, NCI1, PROTEINS) iceGNN-fixed 91.1 ± 6.7 67.9 ± 7.3 82.9 ± 1.6 77.5 ± 6.2 ----iceGNN-MLP 90.6 ± 7.9 68.8 ± 7.2 83.6 ± 1.9 76.5 ± 5.5 78.7 ± 1.5 72.8 ± 3.9 50.3 ± 2.3 92.0 ± 3.5 GIN-final 90.0 ± 5.4 65.9 ± 6.1 81.4 ± 1.6 76.2 ± 4.9 75.2 ± 2.0 72.5 ± 3.9 48.9 ± 2.7 90.1 ± 5.3 GIN (Xu et al., 2019) 89.4 ± 5.6 64.6 ± 7.0 82.7 ± 1.6 76.2 ± 2.8 80.2 ± 1.9 75.1 ± 5.1 52.3 ± 2.8 92.4 ± 2.5 GCN (Kipf & Welling, 2017) 87.8 ± 6.0 62.7 ± 8.0 73.5 ± 1.4 71.0 ± 5.0 67.0 ± 3.6 71.3 ± 4.2 42.6 ± 5.2 65.1 ± 14.0 GraphSAGE (Hamilton et al., 2017) 85.1 ± 7.6 63.9 ± 7.7 77.7 ± 1.5 75.9 ± 3.2 -72.3 ± 5.3 50.9 ± 2.2 -PSCN (Niepert et al., 2016) 92.6 ± 4.2 60.0 ± 4.8 78.6 ± 1.9 75.9 ± 2.8 72. WL subtree (Shervashidze et al., 2011) 90.4 ± 5.7 59.9 ± 4.3 86.0 ± 1.8 75.0 ± 3.1 78.9 ± 1.9 73.8 ± 3.9 50.9 ± 3.8 81.0 ± 3.1 GK (Shervashidze et al., 2009) 81.6 ± 2.1 57.3 ± 1.4 62.5 ± 0.3 71.7 ± 0.6 72.8 ± 0.3 65.9 ± 1.0 43.9 ± 0.4 77.3 ± 0.2 DGK (Yanardag & Vishwanathan, 2015) 87.4 ± 2.7 60. (Togninalli et al., 2019) 73.25 ± 0.87 -77.91 ± 0.80 -and 4 social network datasets (COLLAB, IMDB-BINARY, IMDB-MULTI, and REDDIT-BINARY) (Yanardag & Vishwanathan, 2015) . For bioinformatics datasets, the categorical node labels are encoded as one-hot input features; for social network datasets, because nodes have no given features, we initialize all node features to 1. The 5 attributed graph datasets contain 3 bioinformatics datasets (EN-ZYMES, FRANKENSTEIN, PROTEINSatt) and 2 synthetic datasets (SYNTHETICNEW, Synthie). More detailed information can be found in Appendix A.4 Table 6 . Baselines. We compared our model with a number of state-of-the-art methods listed in the first column in Table 1 and 2 for simple graph classification and attributed graph classification, respectively. Besides GIN which inspired this work and the popular GCN (Kipf & Welling, 2017), we also include the recent studies on expressive power of GNNs, e.g., Ring-GNN (Chen et al., 2019) , GHC (Nguyen & Maehara, 2020) and state-of-the-art methods on neighborhood aggregation, e.g., GraphSAGE (Hamilton et al., 2017) and HaarPool (Wang et al., 2020) . For attributed graphs, few results on the benchmarks with deep learning methods are available in the literature. We are only aware of graph kernel related baselines, listed in the first columns in Table 2 . We also compared iceGNN with our implemented GIN-final and GCN for attributed graph classification, both implemented by adjusting the corresponding official code. Results. datset. Especially, on the PTC dataset we achieve the best two performances, and improved 2 points compared to the past best. Although iceGNN can only achieve top 3 on REDDIT-BINARY dataset among the social network datasets, it can provide acceptable results overall. From Table 2 , for all the attributed graph datasets, iceGNN can achieve top 3 in these 10 models, especially, iceGNN-MLP places first on 3 datasets. Comparing iceGNN and GIN-final, iceGNN can consistently outperform GIN-final except for iceGNN-fixed on ENZYMES dataset. Performance on training set. To evaluate the expressive power, Figure 2(a-c ) illustrates accuracies in training sets in the training process on 3 datasets for graph classification. More results on other datasets can be found in Appendix A.6 Figure 4 . We can see that iceGNN-MLP on different datasets is able to fit the training sets perfectly and is better than GIN-final and iceGNN-fixed, specifically, iceGNN-MLP>iceGNN-fixed>GIN-final>GCN, in terms of the expressive power. For MUTAG and PTC datasets, GIN in Xu et al. (2019) can fit the training set well, while GIN-final cannot, since GIN in Xu et al. (2019) concatenates all the middle layer outputs as the graph embedding, it may be the reason that the final layer outputs of GIN may lose some information from middle layers. We are aware of that iceGNN has more parameters than GIN-final (Appendix A.7 Table 8 ). To identify whether the expressive power is just due to more parameters, we enlarge GIN with 5 hidden layers in each MLP, so that the number of parameters achieve the same scale with iceGNN, denoting the GIN architecture as GIN-mlp5. We found that GIN-mlp5 still cannot achieve the training accuracy as iceGNN or even worse than GIN-final, as shown in Figure 2 (a-c ). Recent GNN benchmarks. The recent work by Dwivedi et al. (2020) proposed new GNN benchmarks by which we also test our model on their ZINC and MNIST datasets for graph regression and classification, respectively. To ensure a fair comparison, we followed their problem setting (data splits, optimizer, etc.) and GNN structure (number of layers, normalization). Our results together with some state-of-the-art results from Dwivedi et al. ( 2020) are listed in Table 3 . From the results, we can see that iceGNN-MLP performs best on both ZINC and MNIST datasets among all the models, demonstrating its effectiveness. 

5.2. NODE CLASSIFICATION

We use three popular citation network datasets Cora, Citeseer, and Pubmed (Sen et al., 2008) for semisupervised node classification. The detailed information of the dataset is summarized in Appendix A.4 Table 7 . We compared our performance with a recent state-of-the-art, GCNII (Chen et al., 2020) From the experimental results, we found that iceGNN-MLP often performs better than iceGNN-fixed. We identify two reasons could make iceGNN-MLP outperform iceGNN-fixed. ( 1) By the discussion in Section 4.2, iceGNN-MLP can also retain the graph label information. (2) iceGNN-fixed usually has a much higher dimension input to each GNN layer, thus more parameters to train, having a high risk of encountering local optimum and plateau. This phenomenon also exists in general MLP.

6. CONCLUSION

In this paper, we present a theoretical framework to design highly expressive GNNs for general graphs. Based on the framework, we propose two iceGNN variants with fixed transformation function and learnable transformation function, respectively. Moreover, the proposed iceGNN can naturally learn expressive representations for graphs with continuous node attributes. We validate the proposed GNN for graph classification and node classification on multiple benchmark datasets, including simple graphs and attributed graphs. The experimental results demonstrate that our model achieves state-of-the-art performances on most of the benchmarks. Future directions include extending the framework to graph with continuous edge attributes.

A APPENDIX

A.1 COMMON NOTATIONS USED THROUGHOUT THE PAPER  , • • • , x n ] the concatenation of vectors x 1 , • • • , x n , H (k) (v) the vector representation of node v in the kth layer N (v) the neighborhood of node v in a graph x [i] the ith entry of vector x x[i : j] the subvector of x between index i and j (including) Theorem 2 Assume X M is a set of finite subsets of R d with size less than or equal to M , i.e., for M ∈ N and X M = {X|X ⊂ R d , |X| ≤ M }, there exists an infinite number of continuous functions Φ : R d → R D such that the set function f : X M → R D , f (X) = x∈X Φ(x) is continuous and injective. Proof. We prove the theorem by three steps, 1. constructing a satisfying function Φ(x) in one dimensional cases (d = 1); 2. constructing a satisfying function Φ(x) in multi-dimensional cases (d > 1); 3. The number of the satisfying functions is infinite. 1. One dimensional cases (d = 1): In one dimensional case, the theorem can be easily proved by extending the following lemma from Zaheer et al. (2017) . Lemma. Let X = {(x 1 , • • • , x M ) ∈ [0, 1] M : x 1 ≤ x 2 ≤ • • • ≤ x M }. The sum-of-power mapping E : X → R M +1 defined by the coordinate functions E(X) = [E 0 (X), E 1 (X), • • • , E M (X)] = x∈X 1, x∈X x, • • • , x∈X x M is injective. In Zaheer et al. (2017) , this lemma is proved based on the famous Newton-Girard formulae, where the domain X can be extended to X M = {X|X ⊂ R, |X| ≤ M, M ∈ N} with the same proof process. Because E 0 (X) = x∈X 1 = |X| is the number of elements in X, E 0 (X 1 ) = E 0 (X 2 ) implies equal set size between the two sets, it can be easily extended to X M = {X|X ⊂ R, |X| ≤ M, M ∈ N}. Since E(X) = x∈X 1, x∈X x, • • • , x∈X x M = x∈X [1, x, • • • , x M ]. Let Φ(x) = [1, x, • • • , x M ], obviously Φ(x) is continuous, thus, we get one Φ(X) such that f (X) = x∈X Φ(x) is injective and continuous.

2.. Multi-dimensional cases (d > 1):

For a d-dimensional vector x and a integer M , we define P i,j (x; M ) = x[j], x[i]x[j], x[i] 2 x[j], • • • , x[i] M -1 x[j] where i, j ∈ [1, d], x[i] is the ith entry of x. For a given i ∈ [1, d], define Φ i : R d → R (dM +1) , Φ i (x; M ) = [1, P i,1 , • • • , P 1,d ], we will prove that Φ i (x; M ) can meet the condition that f (X) = x∈X Φ i (x; M ) is injective and continuous if x[i] is distinct for x ∈ X. We only consider Φ 1 (x; M ), for i = 2, • • • , d, the situations are similar. For a clear understanding, we reshape Φ 1 (x; M ) to d rows similar to a matrix, like Φ 1 (x; M ) =        1,x[1], x[1] 2 , • • • , x[1] M , x[2], x[1]x[2], • • • , x[1] M -1 x[2], . . . . . . . . . . . . x[d], x[1]x[d], • • • , x[1] M -1 x[d]        For a (dm + 1)-dimensional vector V from the image domain of X M through f (X) = x∈X Φ 1 (x; M ), we will identify the number of the preimages of V . Let X is a preimage of V , we have the following equation which is exactly a equation group with dM + 1 equations. V =      x∈X 1, x∈X x[1], x∈X x[1] 2 , • • • , x∈X x[1] M , x∈X x[2], x∈X x[1]x[2], • • • , x∈X x[1] M -1 x[2], . . . . . . . . . . . . x∈X x[d], x∈X x[1]x[d], • • • , x∈X x[1] M -1 x[d]      Note that the first row of Eq. 15 is exactly the sum-of-power mapping we considered in one dimensional cases, thus we can identify a unique set of the first entry of elements in X, and the number of elements in X is also determined. Let X have M elements and X = {x 1 , x 2 , • • • , x M }, the set {x 1 [1], x 2 [1], • • • , x M [1]} is uniquely defined. Consider the second row of Eq. 15, we can rewrite the equations in the second row as linear matrix equation as Eq. 16       1 1 • • • 1 x 1 [1] x 2 [1] • • • x M [1] x 1 [1] 2 x 2 [1] 2 • • • x M [1] 2 . . . . . . . . . . . . x 1 [1] (M -1) x 2 [1] (M -1) • • • x M [1] (M -1)       ×     x 1 [2] x 2 [2] . . . x M [2]     = V [2, :] T Note that the coefficient matrix in left side of Eq. 16 is a Vandermonde matrix, if the x 1 [1], • • • , x M [1] are all distinct, the coefficient matrix is invertible (Macon & Spitzbart, 1958) , Eq. 16 has a unique solution for x 1 [2], • • • , x M [2] corresponding to x 1 [1], • • • , x M [1]. Similarly, by the ith (2 < i < d) row of Eq. 15, x 1 [i], • • • , x M [i] can be uniquely identified. In the other case, if the x 1 [1], • • • , x M [1] that solved from the first row of Eq. 15 are not all distinct, Eq. 16 has infinitely many solutions. Φ(x) defined by Eq. 14 is not sufficient to make f (X) = x∈X Φ(x) injective. We need some more dimensions appended in Φ(x). Let x 1 [1] = x 2 [1] = • • • = x k [1], then by combining the items, Eq. 16 is shrinked to       1 1 • • • 1 x k [1] x k+1 [1] • • • x M [1] x k [1] 2 x k+1 [1] 2 • • • x M [1] 2 . . . . . . . . . . . . x k [1] (M -1) x k+1 [1] (M -1) • • • x M [1] (M -1)       ×     i=1•••k x i [2] x k+1 [2] . . . x M [2]     = V [2, :] T By solving Eq. 17, we have a unique sum i=1•••k x i [2]. To identify a unique set of {x 1 [2], x 2 [2], • • • , x k [2]}, we can define a unique i=1•••k x i [2] 2 , i=1•••k x i [2] 3 , • • • , i=1•••k x i [2] k , we can add items x[2] 2 , x[1]x[2] 2 , x[1] 2 x[2] 2 , • • • , x[1] M -1 x[2] 2 to Φ(x) to uniquely identify i=1•••k x i [2] 2 . Similarly, add items x[2] k , x[1]x[2] k , x[1] 2 x[2] k , • • • , x[1] M -1 x[2] k to Φ(x) to uniquely identify i=1•••k x i [2] k . Thus all x i [2] are identified. After the set {x 1 [2], x 2 [2], • • • , x M [2]} is uniquely defined, by adding x[2] i x[j](i = 0, • • • M - 1, j = 3, • • • d) to Φ(x), we can use x i [2] to construct a Vandermonde matrix to solve x i [3], • • • , x i [d], (i = 1, • • • , M ). If x i [1 : 2] are not all distinct, we can identify x i [3] similarly by adding x[1] i x[3] j to Φ(x). By this way, the set X can be uniquely identified. In our construction of Φ(x), all the functions are continuous, thus, there exists a continuous function Φ(x) such that f (X) = x∈X Φ(x) is injective and continuous.

3.. The number of the satisfying function is infinity:

To prove the number of this kind functions is infinity, we construct a continuous injective function g : R → R, we will show that if we have a φ(x) satisfying the condition f (X) = x∈X φ(x) is continuous and injective, then φ(g(x)) also satisfy the condition f (X) = x∈X φ(g(x)) is continuous and injective.

We define a function

h : X → X , h(X) = {g(x)|x ∈ X}, since g(x) is injective, h(X) is also injective. If we have a function φ(x) such that f (X) = x∈X φ(x) is injective, f (h(X)) is injective. f (h(X)) = x∈h(X) φ(x) = x∈{g(x)|x∈X} φ(x) = x∈X φ(g(x)) Because φ(x) and g(x) are both continuous, φ(g(x)) is continuous, thus we find another function φ(g(x)) such that x∈X φ(g(x)) is injective. Because we can have an infinite number of such continuous injective functions g : R d → R d (e.g., g(x) = kx, k ∈ R), we have a infinite number of such functions Φ(x) = φ(g(x)) such that f (X) = x∈X Φ(x) is injective.

A.4 DATASET DETAILS

The dataset information is in Table 6 and  GIN-mlp(n) , the numbers in parentheses for different datasets are the number of layers in an MLP. GIN-mlp(n) is implemented to enlarge the GIN model to the same parameter scale with iceGNN-MLP to make a fair comparison of expressive power. All the models have a hidden dimension 16, GIN-mlp(n) is a GIN model where all MLPs are implemented with n layers. Since iceGNN-fixed cannot be implemented with a large MaxNb (MaxNb is defined in Tables 6 and 7 ), the transformation functions in iceGNN-fixed are set as Φ 4 (x) (defined in Eq. ( 5) in the main paper). The results show that, iceGNN is more expressive than GIN on most of the datasets, and the expressive power of iceGNN comes from the injective and continuous aggregation scheme rather than the number of parameters. A.9 HYPER-PARAMETER SETTING For the reproducibility, Table 9 provides the hyper-parameters to achieve the results of iceGNN in Table 1 and 2 in the main paper, and the code is also attached with this file. For the social network datasets, since MaxNB is too large to implement iceGNN-fixed, we only implement iceGNN-MLP. Since all nodes are of the same type, identical transformation function in the first layer is applied. For attributed graphs, to preserve injective and continuous, the transformation function in the first layer is not identical. The results of iceGNN, GIN and GCN in Table 4 in the main paper is achieved hyper-parameters in Table 10 , and all the learning rates are 0.01. The hyper-parameters of GCNII and GCNII* are the same as the official code (https://github.com/chennnM/GCNII), where the hidden dimensions are 64, 256, 256 for Cora, Citeseer and Pubmed, respectively. 



Figure 1: An overview of our framework on an exemplar attributed graph in one iteration. (a) Original graph with attributed nodes; (b) Graph nodes are represented by the corresponding attribute and neighborhood set through WL test; (c) The node vector representations after an injective set function on neighborhood sets, here the set function is f (X) = x∈X (1, x, x 2 , x 3 , x 4 ); (d) A non-injective alternative set function in GNNs, after aggregation, the node information remain unchanged, node B and D still have the same representation despite their different neighborhoods.

Figure 2: The accuracy curves on training set in the training process.

Figure 3: t-SNE visualization of the output embeddings on training data of NCI1 dataset.

Figure 4: The accuracy curves on training set in the training process. All the models have a hidden dimension 16.

Accuracy for simple graph classification in test set (%). Top 3 performances on each dataset are bolded. The best performances are underlined. The first two rows are our results, the middle part corresponds the deep learning methods, the bottom part corresponds to the graph kernel methods.

Accuracy for attributed graph classification in test set (%). Top 3 performances on each dataset are bolded. The best performances are underlined.

GNN performance on ZINC dataset for graph regression and on MNIST dataset for graph classification. Top 3 performances on each dataset are bolded. The best performances are underlined.

Node classification results. GCNII* is a variant of GCNII from Chen et al. (2020).

. Since the expressive power describes the ability of a model to discriminate different nodes, a larger training set can reflect the expressive power better. Because the official splits have only a few nodes in training, we split the nodes into training, validation and test sets by 8:1:1. The node classification results on test sets are listed in Table4, where we found that iceGNNs and can outperform all other baselines on training set; especially on Cora and Citeseer dataset, iceGNNs achieve nearly 100% accuracy, suggesting iceGNNs have strong expressive ability. For test set, iceGNN-MLP performs better than all other models on Pubmed dataset, but does not perform that well on Cora and Citeseer datasets comparing to GCN and GIN-final, suggesting that iceGNN-MLP also has good generalization ability on Pubmed dataset rather than Cora and Citeseer datasets.5.3 EXPRESSIVE CAPABILITY ANALYSISThe expressive power describes how a model can distinguish different samples. Generally, a highly expressive model will map different samples to different embeddings and similar samples to similar embeddings. For classification problem, an expressive model should make samples in the same class compact together and samples in different classes highly discriminative. Here, we fetch the output embeddings of GNN before feeding to the classifier, and visualize them to see if GNN can discriminate samples from different classes. Figure3shows the t-SNE visualization(Maaten & Hinton, 2008) of output graph representations of different GNN models on training data of NCI1 dataset. More visualization results can be found in Appendix A.8. We can see that the output embeddings of iceGNN-MLP and iceGNN-fixed are discriminative on both datasets, and the less expressive GIN-final shows somewhat more overlaps between different classes, which validates the expressive capability of iceGNN.

Common notations used throughout the paper.

Dataset information for graph classification. All datasets are available fromKersting et al. (2016). #G=number of graphs. #C=number of classes. #NC=number of node types. AvgN=average number of nodes in one graph. AvgE=average number of edges in one graph. Dim=node attribute dimension. MaxNb is the max 1-hop neighbors in all the nodes.

Dataset information for node classification. ) iceGNN-fixed, where fixed transformation functions in all layers are set as Φ M (x) or Φ M (x) in Eq. (4) in the main paper; (2) iceGNN-MLP, where the transformation functions in all layers are set as a learnable MLP. Since for simple graph with onehot node features, the summation with identical transformation is injective, we set whether the transformation function in the first layer is identical or an MLP as an optional hyperparameter for simple graph classification. For social network datasets, the max neighborhood size is too large that a fixed transformation function will produce a large hidden dimension, we do not implement fixed transformation function. Also, because the nodes have no initial features in the first layer. We also implement GIN with the output of the final layer as node embeddings to sum to graph embedding, denoted as GIN-final.The two iceGNN variants and GIN-final are implemented with 5 layers, all MLPs in iceGNN have 2 layers. Batch normalizationIoffe & Szegedy (2015) is applied in every hidden layer (including GNN layer and MLP layer) followed by a ReLU activation function. We use the Adam optimizer Kingma & Ba (2015) with a initial learning rate and decay the learning rate by 0.5 every 50 epochs. The batch size is 32, no dropout layer applied. The search space of hyper-parameters we tuned for each dataset are: (1) The number of hidden units {16, 32, 64}; (2) the inital learning rate 0.01, 0.001; (3) for the 4 bioinformatic simple graph datasets, the transformation function in the first layer is set identical or MLP; (4) For iceGNN-fixed, the transformation function is tuned with Φ M (x) or Φ M (x) in Eq. (4) in the main paper. For each dataset, we follow the standard 10-fold cross validation protocol and use the same splits withXu et al. (2019). Following the previous workXu et al. (2019); Maron et al.  (2019a);Mao et al. (2020), we reported the best averaged validation accuracy across the 10 folds for a fair comparison. All models are trained 300 epochs. To evaluate the expressive capability, we also record the average training accuracy across the 10 folds of iceGNNs and GIN-final in each epoch. All the experiments were run in 10 Tesla V100 GPUs with pytorch.A.6 ADDITIONAL RESULTSThe accuracy curves on training set in the training process on more datasets are shown in Figure4, where Figures 4a-4j are for graph classification and Figures 4k-4l are for node classification.

lists the comparison of number of trainable parameters in different models on different datasets.

Number of trainable parameters of different models on different datasets. All the models have a hidden dimension 16.

The hyper-parameters corresponding to the results of iceGNN in Table1and 2 in the main paper. h=hidden dimension; lr=learning rate; TF=transformation function; FI=whether to apply a identical transformation function for the first layer.

The hyper-parameters corresponding to the results of iceGNN in Table4in the main paper. h=hidden dimension; TF=transformation function; FI=whether to apply a identical transformation function for the first layer.

A.3 PROOF OF THEOREM 3

Theorem 3 Let M ∈ N and X = {X|X ⊂ R d , |X| = M }, then for any continuous function Φ : R d → R N , if N < dM , the set function f : X → R N f (X) = x∈X Φ(x) is not injective.Proof. Suppose f (X) is injective. Because Φ(x) is continuous, f (X) is a finite sum of continuous function, it is also continuous, thus, f (X) is continuous and injective.All sets in X have M elements from R d . In one dimensional cases (d = 1), X has a bijection toIn multi-dimensional cases (d > 1), we can construct a bijection fromwe can sort elements in X by the first entry, for the elements whose first entries are equal, sort them by the second entry, so repeatedly in this way, we get a final ordered sequence of the vectors, which is unique in S.Note that S is a convex open subset of R dM , and is therefore homeomorphic to R dM . Since N < dM , no continuous injection exists from R dM to R N . Thus no continuous injective function exist from X to R N . Hence we have reached a contradiction.

A.8 VISUALIZATION

The graph embedding output from the final layer of GNNs on some other datasets is visualized with t-SNE (Figures 5, 6, 7, 8) . 

