YOUR NEIGHBORS ARE COMMUNICATING: TOWARDS POWERFUL AND SCALABLE GRAPH NEURAL NET-WORKS

Abstract

Message passing graph neural networks (GNNs) are known to have their expressiveness upper-bounded by 1-dimensional Weisfeiler-Lehman (1-WL) algorithm. To achieve more powerful GNNs, existing attempts either require ad hoc features, or involve operations that incur high time and space complexities. In this work, we propose a general and provably powerful GNN framework that preserves the scalability of message passing scheme. In particular, we first propose to empower 1-WL for graph isomorphism test by considering edges among neighbors, giving rise to NC-1-WL. The expressiveness of NC-1-WL is shown to be strictly above 1-WL and below 3-WL theoretically. Further, we propose the NC-GNN framework as a differentiable neural version of NC-1-WL. Our simple implementation of NC-GNN is provably as powerful as NC-1-WL. Experiments demonstrate that our NC-GNN achieves remarkable performance on various benchmarks.

1. INTRODUCTION

Graph Neural Networks (GNNs) (Gori et al., 2005; Scarselli et al., 2008) have been demonstrated to be effective for various graph tasks. In general, modern GNNs employ a message passing mechanism where the representation of each node is recursively updated by aggregating representations from its neighbors (Atwood & Towsley, 2016; Li et al., 2016; Kipf & Welling, 2017; Hamilton et al., 2017; Veličković et al., 2018; Xu et al., 2019; Gilmer et al., 2017) . Such message passing GNNs, however, have been shown to be at most as powerful as the 1-dimensional Weisfeiler-Lehman (1-WL) algorithm (Weisfeiler & Lehman, 1968 ) in distinguishing non-isomorphic graphs (Xu et al., 2019; Morris et al., 2019) . Thus, message passing GNNs cannot distinguish some simple graphs and cannot detect certain important structural concepts (Chen et al., 2020; Arvind et al., 2020) . Recently, a lot of efforts have been made to improve the expressiveness of message passing GNNs by considering high-dimensional WL algorithms (e.g., Morris et al. (2019) ; Maron et al. ( 2019)), exploiting subgraph information (e.g., Bodnar et al. (2021a) ; Zhang & Li (2021) ), or adding more distinguishable features (e.g., Murphy et al. (2019) ; Bouritsas et al. (2022) ). As thoroughly discussed in Section 5, these existing methods either rely on handcrafted/predefined/domain-specific features, or require high computational cost and memory budget. In contrast, our goal in this work is to develop a general GNN framework with provably expressive power, while maintaining the scalability of the message passing scheme. Specifically, we first propose an extension of the 1-WL algorithm, namely NC-1-WL, by considering the edges among neighbors. In other words, we incorporate the information of which two neighbors are communicating (i.e., connected) into the graph isomorphism test algorithm. To achieve this, we mathematically model the edges among neighbors as a multiset of multisets, in which each edge is represented as a multiset of two elements. We theoretically show that the expressiveness of our NC-1-WL in distinguishing non-isomorphic graphs is stricly above 1-WL and below 3-WL. Further, based on NC-1-WL, we propose a general GNN framework, known as NC-GNN, which can be considered as a differentiable neural version of NC-1-WL. We provide a simple implementation of NC-GNN that is proved to be as powerful as NC-1-WL. Compared to existing expressive GNNs, our NC-GNN is a general, provably powerful and, more importantly, scalable framework. The main question addressed in our work is how to make best use of information in the one-hop neighborhood to improve expressive power while preserving scalability. In the one-hop neighborhood of each node, the local patterns we can consider are (A) what are the neighbors and (B) how the neighbors are connected to each other. The previous message passing GNNs only consider (A). We move a significant step forward to consider (B) by modeling edges among neighbors as a multiset of multisets, thereby leading to provably expressive power and preserved scalability. From this perspective, our method is fundamentally different from existing methods that encode triangle features, such as MotifNet (Monti et al., 2018) and SIGN (Rossi et al., 2020) . Specifically, these methods employ triangle-related motif-induced adjacency matrices in their convolution and diffusion operators, respectively. The edge weight in a motif-induced adjacency matrix is obtained by multiplying the original edge weight with the frequency that each edge participates in triangle motifs. Compared to this hand-crafted way, our method is a general framework to encode how the neighbors are connected to each other, and the expressiveness of our framework can be rigorously characterized. We perform experiments on graph classification and node classification to evaluate NC-GNN comprehensively. Our NC-GNN consistently outperforms GIN, which is as powerful as 1-WL, by significant margins on various tasks. Remarkably, NC-GNN outperforms GIN by an absolute margin over 12.0 on CLUSTER in term of test accuracy. In addition, NC-GNN performs competitively, often achieves better results, compared to existing expressive GNNs, while being much more efficient.

2. PRELIMINARIES

We start by introducing notations. We represent an undirected graph as G = (V, E, X), where V is the set of nodes and E ⊆ V × V denotes the set of edges. We represent an edge {v, u} ∈ E by (v, u) or (u, v) for simplicity. X = [x 1 , • • • , x n ] T ∈ R n×d is the node feature matrix, where n = |V | is the number of nodes and x v ∈ R d represents the d-dimensional feature of node v. N v = {u ∈ V |(v, u) ∈ E} is the set of neighboring nodes of node v. A multiset is denoted as {{• • • }} and formally defined as follows. Definition 1 (Multiset). A multiset is a generalized concept of set allowing repeating elements. A multiset X can be formally represented by a 2-tuple as X = (S X , m X ), where S X is the underlying set formed by the distinct elements in the multiset and m X : S X → Z + gives the multiplicity (i.e., the number of occurrences) of the elements. If the elements in the multiset are generally drawn from a set X (i.e., S X ⊆ X ), then X is the universe of X and we denote it as X ⊆ X for ease of notation. Message passing GNNs. Modern GNNs usually follow a message passing scheme to learn node representations in graphs (Gilmer et al., 2017) . To be specific, the representation of each node is updated iteratively by aggregating the multiset of representations formed by its neighbors. In general, the ℓ-th layer of a message passing GNN can be expressed as a (ℓ) v = f aggregate (ℓ) {{h (ℓ-1) u |u ∈ N v }} , h (ℓ) v = f update (ℓ) h (ℓ-1) v , a (ℓ) v . f aggregate (ℓ) and f update (ℓ) are the parameterized functions of the ℓ-th layer. h (ℓ) v is the representation of node v at the ℓ-th layer and h (0) v can be initialized as x v . After employing L such layers, the final representation h (L) v can be used for prediction tasks on each node v. For graph-level problems, a graph representation h G can be obtained by applying a readout function as, h G = f readout {{h (L) v |v ∈ V }} . ( ) Definition 2 (Isomorphism). Two graphs G = (V, E, X) and H = (P, F, Y ) are isomorphic, denoted as G ≃ H, if there exists a bijective mapping g : V → P such that x v = y g(v) , ∀v ∈ V and (v, u) ∈ E iff (g(v), g(u)) ∈ F . Graph isomorphism is still an open problem without a known polynomial-time solution. Weisfeiler-Lehman algorithm. The Weisfeiler-Lehman algorithm (Weisfeiler & Lehman, 1968 ) provides a hierarchy for graph isomorphism testing problem. Its 1-dimensional form (a.k.a., 1-WL or color refinement) is a heuristic method that can efficiently distinguish a broad class of non-isomorphic graphs (Babai & Kucera, 1979) . 1-WL assigns a color c (or feature)foot_0 and then iteratively refines the colors until convergence. Convergence means that the subsets of nodes with the same colors can not be further split to different color groups. In particular, at each iteration ℓ, it aggregates the colors of nodes and their neighborhoods, which are represented as multisets, and hashes the aggregated results into unique new colors (i.e., injectively). Formally, c (ℓ) v ← HASH c (ℓ-1) v , {{c (ℓ-1) u |u ∈ N v }} . 1-WL decides two graphs to be non-isomorphic once the colorings between these two graphs differ at some iteration. Instead of coloring each node, k-WL generalizes 1-WL by coloring each k-tuple of nodes and thus needs to refine the colors for n k tuples. The details of k-WL are provided in Algorithm 2, Appendix A.2. It is known that 1-WL is as powerful as 2-WL in terms of distinguishing non-isomorphic graphs (Cai et al., 1992; Grohe & Otto, 2015; Grohe, 2017) . Moreover, for k ≥ 2, (k + 1)-WL is strictly more powerful than k-WLfoot_1 (Grohe & Otto, 2015) . More details of the WL algorithms are given in Cai et al. (1992) ; Grohe (2017) ; Sato (2020); Morris et al. (2021) . Given the similarity between message passing GNNs and 1-WL algorithm (i.e. Eq. ( 1) vs. Eq. ( 3)), message passing GNNs can be viewed as a differentiable neural version of 1-WL. In fact, it has been shown that message passing GNNs are at most as powerful as 1-WL in distinguishing non-isomorphic graphs (Xu et al., 2019; Morris et al., 2019) . Further, Xu et al. (2019) proves that message passing GNNs can achieve the same expressiveness as 1-WL if the aggregate, update, and readout functions are injective, thereby developing the GIN model (Xu et al., 2019) . Thus, the expressive power of message passing GNNs is upper bounded by 1-WL. In other words, if two non-isomorphic graphs cannot be distinguished by 1-WL, then message passing GNNs must yield the same embedding for them. Importantly, such expressive power is not sufficient to distinguish some common graphs and cannot capture certain basic structural information such as triangles (Chen et al., 2020; Arvind et al., 2020) , which play significant roles in certain tasks, such as tasks over social networks. Several examples that cannot be distinguished by 1-WL or message passing GNNs are shown in Figure 1 (a) .

3. THE NC-1-WL ALGORITHM

In this section, we introduce the proposed NC-1-WL algorithm, which extends the 1-WL algorithm by taking the edges among neighbors into consideration. With such simple but non-trivial extension, NC-1-WL is proved to be strictly more powerful than 1-WL and less powerful than 3-WL, while preserving the efficiency of 1-WL. Algorithm 1 NC-1-WL vs. 1-WL for graph isomorphism test Input: Two graphs G = (V, E, X) and H = (P, F, Y ) c (0) v ← HASH(x v ), ∀v ∈ V d (0) p ← HASH(y p ), ∀p ∈ P repeat (ℓ = 1, 2, • • • ) if {{c (ℓ-1) v |v ∈ V }} ̸ = {{d (ℓ-1) p |p ∈ P }} then return G ̸ ≃ H end if for v ∈ V do c (ℓ) v ← HASH c (ℓ-1) v , {{c (ℓ-1) u |u ∈ N v }}, {{{{c (ℓ-1) u1 , c (ℓ-1) u2 }}|u 1 , u 2 ∈ N v , (u 1 , u 2 ) ∈ E}} end for for p ∈ P do d (ℓ) p ← HASH d (ℓ-1) p , {{d (ℓ-1) q |q ∈ N p }}, {{{{d (ℓ-1) q1 , d (ℓ-1) q2 }}|q 1 , q 2 ∈ N p , (q 1 , q 2 ) ∈ F }} end for until convergence return G ≃ H As shown in Eq. ( 3) and ( 1), 1-WL and message passing GNNs consider neighbors of each node as a multiset of representations. Here, we move one step forward by further treating edges among neighbors as a multiset, where each element is also a multiset corresponding to an edge. We formally define a multiset of multisets as follows. Definition 3 (Multiset of multisets). A multiset of multisets, denoted by W , is a multiset where each element is also a multiset. In this work, we only need to consider that each element in W is a multiset formed by 2 elements. Following our definition of multiset, if these 2 elements are generally drawn from a set X , the universe of W is the set W = {{{w 1 , w 2 }}|w 1 , w 2 ∈ X }. We can formally represent W = (S W , m W ), where the underlying set S W ⊆ W and m W : S W → Z + gives the multiplicity. Similarly, we have W ⊆ W. Particularly, our NC-1-WL considers modeling edges among neighbors as a multiset of multisets and extends 1-WL (i.e., Eq. ( 3)) to c (ℓ) v ← HASH c (ℓ-1) v , {{c (ℓ-1) u |u ∈ N v }}, {{{{c (ℓ-1) u1 , c (ℓ-1) u2 }}|u 1 , u 2 ∈ N v , (u 1 , u 2 ) ∈ E}} A multiset of multisets . (4) As 1-WL, our NC-1-WL determines two graphs to be non-isomorphic as long as the colorings of these two graphs are different at some iteration. We summarize the overall process of NC-1-WL in Algorithm 1, where the difference with 1-WL is underlined. Importantly, our NC-1-WL is more powerful than 1-WL in distinguishing non-isomorphic graphs. Several examples that cannot be distinguished by 1-WL are shown in Figure 1 (a). Our NC-1-WL can distinguish them easily. An example of executions is demonstrated in Figure 1 (b) . We rigorously characterize the expressiveness of NC-1-WL by the following theorems. The proofs are given in Appendix A.1 and A.2. Theorem 1. NC-1-WL is strictly more powerful than 1-WL in distinguishing non-isomorphic graphs. Theorem 2. NC-1-WL is strictly less powerful than 3-WL in distinguishing non-isomorphic graphs. Although NC-1-WL is less powerful than 3-WL, it is much more efficient. 3-WL has to refine the color of each 3-tuple, resulting in n 3 refinement steps in each iteration. In contrast, as 1-WL, NC-1-WL only needs to color each node, which corresponds to n refinement steps in each iteration. Thus, the superiority of our NC-1-WL lies in improving the expressiveness over 1-WL, while being efficient as 1-WL. Note that our NC-1-WL differs from the concept of Subgraph-1-WL (Zhao et al., 2022) , which ideally generalizes 1-WL from mapping the neighborhood to mapping the subgraph rooted at each node. Specifically, the refinement step in Subgraph-1-WL is c (ℓ) v ← HASH G[N k v ] , where G[N k v ] is the subgraph induced by the k-hop neighbors of node v. However, it requires an injective hash function for subgraphs, which is essentially as hard as the graph isomorphism problem and cannot be achieved. In contrast, our NC-1-WL does not aim to injectively map the neighborhood subgraph. Instead, we enhance 1-WL by mathematically modeling the edges among neighbors as a multiset of multisets. Then, injectively mapping such multiset of multisets in NC-1-WL is naturally satisfied.

4. THE NC-GNN FRAMEWORK

In this section, we propose the NC-GNN framework as a differentiable neural version of NC-1-WL. Further, we establish an instance of NC-GNN that is provably as powerful as NC-1-WL in distinguishing non-isomorphic graphs. Differing from previous message passing GNNs as Eq. ( 1), NC-GNN further considers the edges among neighbors as NC-1-WL. One layer of the NC-GNN framework can be formulated as c (ℓ) v = f communicate (ℓ) {{{{h (ℓ-1) u1 , h (ℓ-1) u2 }}|u 1 , u 2 ∈ N v , (u 1 , u 2 ) ∈ E}} , a (ℓ) v = f aggregate (ℓ) {{h (ℓ-1) u |u ∈ N v }} , h (ℓ) v = f update (ℓ) h (ℓ-1) v , a (ℓ) v , c (ℓ) v . (5) ℓ) is the parameterized function operating on multisets of multisets. The following theorem establishes the conditions under which our NC-GNN can be as powerful as NC-1-WL. Theorem 3. Let M : G → R d be an NC-GNN model with a sufficient number of layers following Eq. ( 5). M is as powerful as NC-1-WL in distinguishing non-isomorphic graphs if the following conditions hold: (1) At each layer ℓ, f communicate (ℓ) , f aggregate (ℓ) , and f update (ℓ) are injective. f communicate (2) The final readout function f readout is injective. The proof is provided in Appendix A.3. One may wonder what advantages NC-GNN has over NC-1-WL. Note that NC-1-WL only yields different colors to distinguish nodes according to their neighbors and edges among neighbors. These colors, however, do not represent any similarity information and are essentially one-hot encodings. In contrast, NC-GNN, a neural generalization of NC-1-WL, aims at representing nodes in the embedding space. Thus, an NC-GNN model satisfying Theorem 3 can not only distinguish nodes according to their neighbors and edges among neighbors, but also learn to map nodes with certain structural similarities to similar embeddings, based on the supervision from the on-hand task. This has the same philosophy as the relationship between message passing GNN and 1-WL (Xu et al., 2019) . There could exist many ways to implement the communicate, aggregate, and update functions in the NC-GNN framework. Here, following the NC-GNN framework, we provide a simple architecture, that provably satisfies Theorem 3 and thus has the same expressive power as NC-1-WL. To achieve this, we generalize the prior results of parameterizing universal functions over sets (Zaheer et al., 2017) and multisets (Xu et al., 2019) to consider both multisets and multisets of multisets. Such nontrivial generalization is formalized in the following lemmas. The proofs are available in Appendix A.4 and A.5. As Xu et al. (2019) , we assume that the node feature space is countable. Lemma 4. Assume X is countable. There exist two functions f 1 and f 2 so that h(X, W ) = x∈X f 1 (x) + {{w1,w2}}∈W f 2 (f 1 (w 1 ) + f 1 (w 2 )) is unique for any distinct pair of (X, W ), where X ⊆ X is a multiset with a bounded cardinality and W ⊆ W = {{{w 1 , w 2 }}|w 1 , w 2 ∈ X } is a multiset of multisets with a bounded cardinality. Moreover, any function g on (X, W ) can be decomposed as g (X, W ) = ϕ x∈X f 1 (x) + {{w1,w2}}∈W f 2 (f 1 (w 1 ) + f 1 (w 2 )) for some function ϕ. Lemma 5. Assume X is countable. There exist two functions f 1 and f 2 so that for infinitely many choices of ϵ, including all irrational numbers, h(c, X, W ) = (1 + ϵ)f 1 (c) + x∈X f 1 (x) + {{w1,w2}}∈W f 2 (f 1 (w 1 ) + f 1 (w 2 ) ) is unique for any distinct 3-tuple of (c, X, W ), where c ∈ X , X ⊆ X is a multiset with a bounded cardinality, and W ⊆ W = {{{w 1 , w 2 }}|w 1 , w 2 ∈ X } is a multiset of multisets with a bounded cardinality. Moreover, any function g on (c, X, W ) can be decomposed as g(c, X, W ) = φ (1+ϵ)f 1 (c)+ x∈X f 1 (x)+ {{w1,w2}}∈W f 2 (f 1 (w 1 )+f 1 (w 2 )) for some function φ.  GIN O(n) O(nd) 1-WL ✓ 1-2-3-GNN O(n 3 ) O(n 4 ) 1-WL ∼ 3-WL - PPGN O(n 2 ) O(n 3 ) 3-WL - Nested GNN O(ns) O(nds) 1-WL ∼ Unknown - NC-GNN (ours) O(n + min(m, 3T )) O(n(d + t)) 1-WL ∼ 3-WL ✓ As Xu et al. (2019) , we can use multi-layer perceptrons (MLPs) to model and learn f 1 , f 2 , and φ in Lemma 5, since MLPs are universal approximators (Hornik et al., 1989; Hornik, 1991) . To be specific, we use one MLP to model the compositional function f (ℓ+1) 1 • φ (ℓ) and another MLP to model f (ℓ) 2 for ℓ = 1, 2, • • • , L. At the first layer, we do not need f (1) 1 if the input features are one-hot encodings, since there exists a function f (1) 2 that can preserve the injectivity (See Appendix A.5 for details). Overall, one layer of our architecture can be formulated as h (ℓ) v = MLP (ℓ) 1 1 + ϵ (ℓ) h (ℓ-1) v + u∈Nv h (ℓ-1) u + u1,u2∈Nv (u1,u2)∈E MLP (ℓ) 2 h (ℓ-1) u1 + h (ℓ-1) u2 The difference with GIN , where ϵ (ℓ) is a learnable scalar parameter. According to Lemma 5 and Theorem 3, this simple architecture, plus an injective readout function, has the same expressive power as NC-1-WL. Note that this architecture follows the GIN model (Xu et al., 2019) closely. The fundamental difference between our model and GIN is highlighted in Eq. ( 6), which is also our key contribution. Note that if there does not exist any edges among neighbors for all nodes in a graph, the third term in Eq. ( 6) will be zero for all nodes, and the model will reduce to the GIN model. Complexity. Suppose a graph has n nodes and m edges. Message passing GNNs, such as GIN, require O(n) memory and have O(nd) time complexity, where d is the maximum degree of nodes. For each node, we define #Message NC as the number of edges existing among neighbors of the node. An NC-GNN model as Eq (6) has O(n(d + t)) time complexity, where t denotes the maximum #Message NC of nodes. Suppose there are totally T triangles in the graph, in addition to n node representations, we need to further store 3T representations as the input of MLP 2 . If 3T > m, we can alternatively store (h u1 + h u2 ) for each edge (u 1 , u 2 ) ∈ E. Thus, the memory complexity is O(n + min(m, 3T )). Hence, compared to message passing GNNs, our NC-GNN has a bounded memory increasement and preserves the linear time complexity with a constant factor. We compare with several existing expressive GNNs in Table 1 . Our NC-GNN has much better scalability. More discussions with related works are in Section 5.

5. RELATED WORK

The most straightforward idea to enhance the expressiveness of message passing GNNs is to mimic the k-WL (k ≥ 3) algorithms (Morris et al., 2019; 2020b; Maron et al., 2019; Chen et al., 2019) . For example, Morris et al. (2019) proposes 1-2-3-GNN according to the set-based 3-WL, which is more powerful than 1-WL and less powerful than the tuple-based 3-WL. et al., 2018 ) and 2-FWL, which has the same power as 3-WL. Thereby, PPGN achieves the same power as 3-WL with O(n 2 ) memory and O(n 3 ) time complexity. Nonetheless, the computational and memory cost of these expressive models are still too high to scale to large graphs. Another line of research for improving GNNs is to exploit subgraph information (Frasca et al., 2022) . Bodnar et al. (2021b; a) perform message passing on high-order substructures of graphs, such as simplicial and cellular complexes. Its preprocessing and message passing step are computationally expensive. Also, domain knowledge is usually required to predefine the substructure bank, while it is often unavailable in general tasks. GraphSNN (Wijesinghe & Wang, 2022) defines the overlaps between the subgraphs of each node and its neighbors, and then incorporates such overlap information into message passing scheme by using handcrafted structural coefficients. ESAN (Bevilacqua et al., 2022) employs an equivariant framework to learn from a bag of subgraphs of the graph and further proposes a subgraph selection strategy to reduce the high computational cost. (Feng et al., 2022) focuses on formulating the K-hop message passing framework and analyzing its expressive power. In contrast, we dedicate to the consideration of edges among neighbors, leading to the powerful and efficient NC-1-WL and NC-GNN with a different proof of expressivity. Due to the high memory and time complexity, most of the above methods are usually evaluated on graph-level tasks over small graphs, such as molecular graphs, and can be hardly applied to large graphs like social networks. Compared to these works, our approach differs fundamentally by proposing a general (i.e., without ad hoc features) and provably powerful GNN framework, while preserving the scalability in terms of computational time and memory requirement. We compare our NC-GNN with several existing expressive GNNs in Table 1 . There are several other heuristic methods proposed to strengthen GNNs by adding identity-aware information (Murphy et al., 2019; Vignac et al., 2020; You et al., 2021) , random features (Abboud et al., 2021; Dasoulas et al., 2021; Sato et al., 2021) , predefined structural features (Li et al., 2020; Bouritsas et al., 2022) to nodes, or randomly drop node (Papp et al., 2021) . Another direction is to improve GNNs in terms of the generalization ability (Puny et al., 2020) . These works improve GNNs from perspectives orthogonal to ours, and thus could be used as techniques to further augment our NC-GNN. In addition, PNA (Corso et al., 2020) applies multiple aggregators to enhance the GNN performance. Most recently, Morris et al. (2022) proposes a new hierarchy, which is more fine-grained than the WL hierarchy, for graph isomorphism problem. For deeper understanding of expressive GNNs, we recommend referring to the recent surveys (Sato, 2020; Morris et al., 2021; Jegelka, 2022) .

6. EXPERIMENTS

In this section, we perform extensive experiments to evaluate the effectiveness of the proposed NC-GNN model on real benchmarks. In particular, we consider widely used datasets from TU-Datasets (Morris et al., 2020a) , Open Graph Benchmark (OGB) (Hu et al., 2020) , and GNN Benchmark (Dwivedi et al., 2020) . These datasets are from various domains and cover different tasks over graphs, including graph classification and node classification. Thus, they can provide a comprehensive evaluation of our method. Note that certain datasets, such as REDDIT-BINARY and ogbg-molhiv, do not have many edges among neighbors (i.e., Avg. #Message NC < 0.2). In this case, our NC-GNN model will almost reduce to the GIN model and thus perform nearly the same as GIN. Hence, we omit such datasets in our results. All the used datasets and their statistics, including Avg. #Message NC , are summarized in Baselines. As shown in Eq. ( 6), the fundamental difference between our NC-GNN model and GIN is that we further consider modeling edges among neighbors, as highlighted in Eq. ( 6). Hence, comparing to GIN can directly demonstrate the effectiveness of including such information in our NC-GNN, which is the core contribution of our theoretical result. Therefore, in the following experimental results, we highlight our results if they are better than GIN, and analyze the improvements over GIN. We also consider the WL subtree kernel (Shervashidze et al., 2011) (Hamilton et al., 2017) . In addition, we further include the following recent methods that improve GNN expressiveness as baselines. Specifically, SIN (Bodnar et al., 2021b) , CIN (Bodnar et al., 2021a) , GNN-AK (Zhao et al., 2022) , GraphSNN (Wijesinghe & Wang, 2022) improve the expressive power of GNNs by using the subgraph information. RingGNN (Chen et al., 2019) and PPGN (Maron et al., 2019) are models based on 3-WL. TUDatasets. Following GIN (Xu et al., 2019) , we first conduct experiments on four graph classification datasets from TUDatasets (Morris et al., 2020a) ; those are IMDB-BINARY, IMDB-MULTI, COLLAB, and PROTEINS. Note that we omit other datasets used by GIN since they do not have many edges among neighbors. We employ the same number of layers as GIN. We report the 10-fold cross validation accuracy following the protocol as (Xu et al., 2019) for fair comparison. The results of baselines are directly obtained from the literature. As presented in Table 2 , we can observe that our NC-GNN outperforms GIN on all datasets consistently. Moreover, NC-GNN performs competitively with other methods aiming at improving the GNN expressiveness. The consistent improvements over GIN can show that modeling edges among neighbors in NC-GNN is practically effective. Notably, NC-GNN achieves an obvious improvement margin of 2.3 on COLLAB. This is intuitively reasonable since the Avg. #Message NC on COLLAB is much larger than other datasets, as provided in Table 7 , Appendix B.1. In this case, NC-GNN can use such informative edges existing among neighbors to boost the performance. Open Graph Benchmark. We also perform experiments on the large-scale dataset ogbg-ppa (Hu et al., 2020) , which has over 150K graphs and is known as a more convincing testbed. The graphs in ogbg-ppa are extracted from the protein-protein association networks of different species. It formulates a graph classification task and the data are split based on species. Differing from TUDatasets, the graphs in ogbg-ppa have edge features representing the type of protein-protein association. In order to apply NC-GNN to these graphs, we further define a variant of our NC-GNN by incorporating edge features into the NC-GNN framework, inspired by the GIN model with edge features introduced by Hu et al. (2019) . The details of the resulting model is given in Appendix B.3. Since we have one more MLP than GIN at each layer, one may wonder if our improvements are brought by the larger number of learnable parameters, instead of our claimed expressiveness. Thus, here we compare with GIN under the same parameter budget. Specifically, we use the same number of layers as GIN to ensure the same receptive field, and tune the hidden dimension to obtain an NC-GNN model that has the similar number of learnable parameters as GIN. Following (Hu et al., 2020) , we compare the best validation accuracy and the test accuracy at the best validation epoch. We also include the training accuracy at the best validation epoch for reference. Results over 10 random runs are reported. The results of GIN are obtained from the official benchmark leaderboard. As reported in Table 3 , our NC-GNN consistently achieves better results in terms of validation accuracy and test accuracy. Specifically, our NC-GNN model outperforms GIN on the test set by an obvious absolute margin of 3.02. Note that the only difference between NC-GNN and GIN is that edges among neighbors are modeled and considered in NC-GNN. Thus, the obvious improvements over GIN can demonstrate the practical effectiveness of incorporating such information. Therefore, combining with the previous experiments on TUDatasets, we can conclude that our NC-GNN not only has theoretically provable expressiveness, but also achieves good empirical performance on real-world tasks. GNN Benchmark. In addition to graph classification tasks, we further experiment with node classification tasks on two datasets, PATTERN and CLUSTER, from GNN Benchmark (Dwivedi et al., 2020) . PATTERN and CLUSTER respectively contain 14K and 12K graphs generated from Stochastic Block Models (Abbe, 2017) , a widely used mathematical modeling method for studying communities in social networks. The tasks on these two datasets it to classify nodes in each graph. To be specific, on PATTERN, the goal is to determine if a node belongs to specific predetermined subgraph patterns. On CLUSTER, we aim at categorizing each node to its belonging community. The details of the construction of these datasets are available in (Dwivedi et al., 2020) . We compare with typical message passing GNNs, including GIN, and two methods that mimic 3-WL; those are PPGN and RingGNN. To ensure fair comparison, we follow Dwivedi et al. ( 2020) to compare different methods under two budgets of learnable parameters, 100K and 500K, by tuning the number of layers and the hidden dimensions. Average results over 4 random runs are reported in Table 4 , where the results of baselines are obtained from (Dwivedi et al., 2020) . On each dataset, we also present the absolute improvement margin of our NC-GNN over GIN, denoted as ∆ ↑ , by comparing their corresponding best result. We observe that our NC-GNN obtains significant improvements over GIN. To be specific, NC-GNN remarkably outperforms GIN by an absolute margin of 1.142 and 12.002 on PATTERN and CLUSTER, respectively. This further strongly demonstrates the effectiveness of modeling the information of edges among neighbors, which aligns with our theoretical results. Notably, NC-GNN obtains outstanding performance on CLUSTER. Since the task on CLUSTER is to identify communities, we reasonably conjecture that considering which neighbors are connected is essential for inferring communities. Thus, we believe that our NC-GNN can be a strong basic method for tasks over social network graphs. Moreover, NC-GNN achieves much better empirical performance than RingGNN and PPGN, although they theoretically mimic 3-WL. It is observed that these 3-WL based methods are difficult to train and thus having fluctuating performance (Dwivedi et al., 2020) . In contrast, our NC-GNN is easier to train since it preserves the locality of message passing, thereby being more practically effective. Time analysis. In Table 5 , we compare the real training time of models with 100K learnable parameters on IMDB-B, PATTERN, and CLUSTER. We can observe that our NC-GNN is much more efficient than PPGN, since our NC-GNN preserves the linear time complexity w.r.t. number of nodes as GIN, according to the analysis in Table 1 . Compared to GIN, the increasement of the real running time of our NC-GNN depends on the number of edges among neighbors. For example, the time consumption of NC-GNN is similar to GIN on IMDB-B, since the Avg. #Message NC is considerably smaller than that in PATTERN and CLUSTER. Overall, our NC-GNN is shown to be more powerful than GIN theoretically and empirically, while maintaining the scalability with reasonable overhead. Thorough comparison with subgraph GNNs. We further perform a comprehensive empirical comparison with subgraph GNNs. Specifically, we compare to GIN-AK + (Zhao et al., 2022) , a representative method in the subgraph GNN family, on test accuracy, training time per epoch, total training time for convergence, GPU memory usage, MACS, and inference time. For each experiment, we run it four times and report the average results for test accuracy, training time per epoch, total time consumed to achieve the best epoch, and GPU memory consumption while keeping the same batch size. In order to compare the computational cost, FLOPS is commonly used as the number of floating operations for the model (Tan & Le, 2019) . Here we use a similar metric MACS to calculate the average multiply-accumulate operations for each graph in the test set. Note that each multiply-accumulate operation includes two float operations. The results are summarized in Table 6 . Our NC-GNN achieves competitive accuracies as GIN-AK + , while being more efficient in training, including training time per epoch and total training time. In addition, we use less GPU memory since we do not have to consider updating node representations for all the nodes in the expanded subgraphs as GIN-AK + . More importantly, the MACS overhead of GIN-AK + is 100x more than our NC-GNN. Since our NC-GNN calculates each node representation from the original graph instead of the expanded subgraphs, it can save huge MACS overhead during the forward procedure. To further show the advantage of fewer MACS overhead, we provide the inference time comparison and our NC-GNN takes less time during inference. Overall, NC-GNN reaches a sweet spot between expressivity and scalability, from both theoretical and practical observations.

7. CONCLUSIONS AND OUTLOOKS

In this work, based on our proposed NC-1-WL, we present NC-GNN, a general, provably powerful, and scalable framework for graph representation learning. In addition to the theoretical expressiveness, we empirically demonstrate that NC-GNN achieves outstanding performance on various real benchmarks. Thus, we anticipate that NC-GNN will become an important base model for learning from graphs, especially social network graphs. To further improve the expressiveness of NC-GNN, in future work, we may consider how to effectively and efficiently model the interactions between two edges that exist among neighbors. Algorithm 2 k-WL for graph isomorphism test Input: Two graphs G = (V, E, X) and H = (P, F, Y ) c (0) v ← HASH(G[v]), ∀v ∈ V k d (0) p ← HASH(H[p]), ∀p ∈ P k repeat (ℓ = 1, 2, • • • ) if {{c (ℓ-1) v |v ∈ V k }} ̸ = {{d (ℓ-1) p |v ∈ P k }} then return G ̸ ≃ H end if for v ∈ V k do c (ℓ) v,i = {{c (ℓ-1) u |u ∈ N v,i }}, for i = 1, 2, • • • , k c (ℓ) v ← HASH c (ℓ-1) v , c (ℓ) v,1 , c (ℓ) v,2 , • • • , c (ℓ) v,k end for for p ∈ P k do d (ℓ) p,i = {{d (ℓ-1) q |q ∈ N p,i }}, for i = 1, 2, • • • , k d (ℓ) p ← HASH d (ℓ-1) p , d p,1 , d Now, we consider performing NC-1-WL (Algorithm 1) on these two graphs G = (V, E, X) and H = (P, F, Y ) to color each node v ∈ V and p ∈ P . We have the same injective mapping g : V → P as above. Based on (a), we have (ℓ) p,2 , • • • , d x v = y g(v) , ∀v ∈ V , which indicates c (0) v = d (0) g(v) , ∀v ∈ V in the NC-1-WL coloring process. Similarly, according to (b) and (c), we have {{c (0) u |u ∈ N v }} = {{d (0) q |q ∈ N g(v) }}, ∀v ∈ V and {{{{c (0) u1 , c (0) u2 }}|u 1 , u 2 ∈ N v , (u 1 , u 2 ) ∈ E}} = {{{{d (0) q1 , d q2 }}|q 1 , q 2 ∈ N g(v) , (q 1 , q 2 ) ∈ F }}, ∀v ∈ V , respectively. Therefore, with initial colors satisfying such conditions, NC-1-WL cannot distinguish G and H since c (l-1) v = d (l-1) g(v) , ∀v ∈ V always holds for ℓ = 1, 2, • • • . In other words, {{c (ℓ-1) v |v ∈ V }} = {{d (ℓ-1) p |p ∈ P }} always holds no matter how many iterations (i.e., ℓ) we apply. (2) In Figure 2 , we provide two non-isomorphic graphs that can be distinguished by 3-WL but cannot be distinguished by NC-1-WL. For these two graphs, our NC-1-WL reduces to 1-WL since there does not exist any neighbors that are communicating.

A.3 PROOF OF THEOREM 3

Theorem 3. Let M : G → R d be an NC-GNN model with a sufficient number of layers following Eq. ( 5). M is as powerful as NC-1-WL in distinguishing non-isomorphic graphs if the following conditions hold: (1) At each layer ℓ, f communicate (ℓ) , f aggregate (ℓ) , and f update (ℓ) are injective. (2) The final readout function f readout is injective. Fact 1. Assume X is countable. h(X) = x∈X N -Z(x) is unique for any multiset X ⊆ X of bounded cardinality, where the mapping Z : X → N is an injection from x ∈ X to natural numbers and N ∈ N satisfies N > |X| for all X. To prove the correctness of this fact, we show that X can be uniquely obtained from the value of h(X). Following the notations in our main texts, we formally denote X = (S X , m X ), where S X is the underlying set of X and m X : S X → Z + gives the multiplicity of the elements in S X . Hence, we need to uniquely determine the elements in S X and their corresponding multiplicities, using the value of h(X). Let {x 1 , x 2 , • • • , x n } denote the countable set X (n could go infinitely). Without losing generality, we assume Z maps x 1 → 0, x 2 → 1, etc.. Then we can compute (q, r) = h(X) divmod N -0 , where q is the quotient and r is the remainder. If q = 0, we can conclude x 1 is not in S X . If q > 0, then x 1 is in S X and q gives the multiplicity of x 1 . Afterwards, we use the remainder r to replace h(X) and compute (q, r) = h(X) divmod N -1 , and the results can be used to infer if x 2 is in S X and its multiplicity. We can do this recursively until r = 0. Note that X has a bounded cardinality and N ∈ N satisfies N > |X| for all X. Otherwise, Fact 1 will not hold. Here we provide an example to show the correctness of Fact 1. Let a multiset X = {{x 1 , x 3 , x 3 }} and Z injectively maps the elements in X to natural numbers, thus obtaining a multiset {{0, 2, 2}}. Let N = 4. We have h(X) = x∈X N -Z(x) = 4 -0 + 4 -2 + 4 -2 = 9/8. Next, following our description above, we show how we can infer X by the value of h(X). First, according to 9/8 divmod 4 -0 = (1, 1/8), we can conclude that there is one x 1 in X. Then, with 1/8 divmod 4 -1 = (0, 1/8), we can infer that x 2 is not in X. Finally, we have 1/8 divmod 4 -2 = (2, 0), which indicates that there are two x 3 in X. We can stop the process since the remainder reaches 0. Let us go back to the proof of Lemma 4. Since X is countable, W = {{{w 1 , w 2 }}|w 1 , w 2 ∈ X } is also countable. Because both X and W have bounded cardinalities, we can find an number N ∈ N such that N > max(|X| + |W |, 2) for all (X, W ) pairs. Let Z 1 : X → N odd be an injection from x ∈ X to odd natural numbers. We consider f 1 = N -Z1(x) . For ease of notation, we let ψ({{w 1 , w 2 }}) = f 1 (w 1 ) + f 1 (w 2 ). According to Fact 1, ψ({{w 1 , w 2 }}) is unique for any {{w 1 , w 2 }} ∈ W. We define the set Y = {ψ({{w 1 , w 2 }})|w 1 , w 2 ∈ X }. Thus, Y is also countable as W. We consider Z 2 : Y → N even be an injection from y ∈ Y to even natural numbers and f 2 = N -Z2(y) . Then, the resulting h(X, W ) = x∈X f 1 (x)+ {{w1,w2}}∈W f 2 (f 1 (w 1 )+f 1 (w 2 )) is an injective function on (X, W ). In other words, we can uniquely determine (X, W ) from the value of h(X, W ). To be specific, from the value of h(X, W ), we can infer the histograms of natural numbers as we show in Fact 1. Then, we can uniquely obtain X (i.e., its underlying set S x and multiplicities.) based on the odd natural numbers. According to the even natural numbers, we can infer {{ψ({{w 1 , w 2 }})|{{w 1 , w 2 }} ∈ W }}. Further, since ψ({{w 1 , w 2 }}) is injective, we can uniquely obtain W . For any function g on (X, W ), we can construct a function ϕ such that ϕ h(X, W ) = g(X, W ). This is always achievable since h(X, W ) is injective. A.5 PROOF OF LEMMA 5 Lemma 5. Assume X is countable. There exist two functions f 1 and f 2 so that for infinitely many choices of ϵ, including all irrational numbers, h(c, X, W ) = (1 + ϵ)f 1 (c) + x∈X f 1 (x) + {{w1,w2}}∈W f 2 (f 1 (w 1 ) + f 1 (w 2 ) ) is unique for any distinct 3-tuple of (c, X, W ), where c ∈ X , X ⊆ X is a multiset with a bounded cardinality, and W ⊆ W = {{{w 1 , w 2 }}|w 1 , w 2 ∈ X } is a multiset of multisets with a bounded cardinality. Moreover, any function g on (c, X, W ) can be decomposed as g(c, X, W ) = φ (1+ϵ)f 1 (c)+ x∈X f 1 (x)+ {{w1,w2}}∈W f 2 (f 1 (w 1 )+f 1 (w 2 )) for some function φ. Proof. We consider the same functions f 1 = N -Z1(x) and f 2 = N -Z2(y) as in our proof for Lemma 4. We prove this lemma by showing that, if ϵ is an irrational number, h(c, X, W ) ̸ = h(c ′ , X ′ , W ′ ) holds for any (c, X, W ) ̸ = (c ′ , X ′ , W ′ ). We need to consider the following two cases. (1) If c = c ′ but (X, W ) ̸ = (X ′ , W ′ ), according to Lemma 4, we have x∈X f 1 (x) + {{w1,w2}}∈W f 2 (f 1 (w 1 )+f 1 (w 2 )) ̸ = x∈X ′ f 1 (x)+ {{w1,w2}}∈W ′ f 2 (f 1 (w 1 )+f 1 (w 2 )). Thus, we can obtain h(c, X, W ) ̸ = h(c ′ , X ′ , W ′ ). (2) If c ̸ = c ′ , we show h(c, X, W ) ̸ = h(c ′ , X ′ , W ′ ) by contradiction. Assume h(c, X, W ) = h(c ′ , X ′ , W ′ ), we have (1 + ϵ)f 1 (c) + x∈X f 1 (x) + {{w1,w2}}∈W f 2 (f 1 (w 1 ) + f 1 (w 2 )) = (1 + ϵ)f 1 (c ′ ) + x∈X ′ f 1 (x) + {{w1,w2}}∈W ′ f 2 (f 1 (w 1 ) + f 1 (w 2 )). This can be rewritten as ϵ(f 1 (c) -f 1 (c ′ )) = f 1 (c ′ ) + x∈X ′ f 1 (x) + {{w1,w2}}∈W ′ f 2 (f 1 (w 1 ) + f 1 (w 2 )) -f 1 (c) + x∈X f 1 (x) + {{w1,w2}}∈W f 2 (f 1 (w 1 ) + f 1 (w 2 )) . (12) Since f 1 (c) -f 1 (c ′ ) ̸ = 0 and it is rational, given ϵ is irrational, we can conclude that L.H.S. of Eq. ( 12) is irrational. R.H.S. of Eq. ( 12), however, is rational. This reaches a contradiction. Thus, if c ̸ = c ′ , we have h(c, X, W ) ̸ = h(c ′ , X ′ , W ′ ). For any function g on (c, X, W ), we can construct a function φ such that φ h(c, X, W ) = g(c, X, W ). This is always achievable since h(c, X, W ) is injective. Further justification for the first layer. If the input features x ∈ X are one-hot encodings, f 1 is not necessary and thus can be removed. In other words, we can show as follows that there exists an f 2 such that h ′ (c, X, W ) = (1 + ϵ)c + x∈X x + {{w1,w2}}∈W f 2 (w 1 + w 2 ) is unique for any distinct 3-tuple of (c, X, W ). Note that x∈X x is injective if input features are one-hot encodings, and the value of x∈X x must be composed of integers. In addition, ψ ′ ({{w 1 , w 2 }}) = w 1 + w 2 is also injective. Similarly, We define the set Y ′ = {ψ ′ ({{w 1 , w 2 }})|w 1 , w 2 ∈ X }. We consider Z 2 : Y ′ → N be an injection from y ∈ Y ′ to natural numbers and f 2 = N -Z2(y) , where N > |W | for all W . Then h ′ (c, X, W ) = (1 + ϵ)c + x∈X x + {{w1,w2}}∈W f 2 (w 1 + w 2 ) is injective, since {{w1,w2}}∈W f 2 (w 1 + w 2 ) is unique for any W and is a number ∈ (0, 1), thus differing from the integer-valued x∈X x. This is why we do not need another MLP to model f 

B.2 MODEL CONFIGURATIONS AND TRAINING HYPERPARAMETERS

For efficiency, we do not tune the model configurations and training hyperparameters for NC-GNN extensively. Since our NC-GNN model is a natural extension of GIN, we usually use the model configurations and tuned hyperparameters of GIN from the comminity as the start point for NC-GNN and then tune them a little bit according to the validation results. For the model architecture, we tune the following configurations; those are (1) the number of layers, (2) the number of hidden dimensions, (3) using jumping knowledge (JK) technique or not, and (4) using residual connection or not. To ensure fair comparison, we only consider employing techniques (3) and (4) on the datasets where the baseline GIN model also use them. In terms of training, we consider tuning the following hyperparameters. those are (1) the initial learning rate, (2) the step size of learning rate decay, (3) the multiplicative factor of learning rate decay, (4) the batch size, (5) the dropout rate, and (6) the total number of epochs. The selected model configurations and training hyperparameters for all datasets are summarized in Table 8 . For each dataset from GNN Benchmark, we have several NC-GNN models under different parameter budgets, as described in Section 6. Accordingly, we list the configurations and hyperparameters for all of these models for reproducibility. The layer-wise formulation of our NC-GNN model with considering edge features is h (ℓ) v = MLP (ℓ) 1 1+ϵ (ℓ) h (ℓ-1) v + u∈Nv RELU(h (ℓ-1) u +e uv )+ u1,u2∈Nv (u1,u2)∈E MLP (ℓ) 2 h (ℓ-1) u1 + h (ℓ-1) u2 + e u1u2 , which is a natural extension of Eq. ( 6) by including edge features. e uv is the edge feature associated with edge (u, v). In practice, we usually apply an embedding layer to input edge features such that they have the same dimension as the node representations. For reference, the GIN model with considering edge features (Hu et al., 2019) can be formulated as h (ℓ) v = MLP (ℓ) 1 1 + ϵ (ℓ) h (ℓ-1) v + u∈Nv RELU(h (ℓ-1) u + e uv ) .



If there are no initial features or labels, 1-WL assigns the same color to all the nodes in the graph. There are two families of WL algorithms; they are k-WL and k-FWL (Folklore WL). They both consider coloring k-tuples and their difference lies in how to aggregate colors from neighboring k-tuples. It is known that (k -1)-FWL is as powerful as k-WL for k ≥ (Grohe & Otto, 2015;Grohe, 2017;Maron et al., 2019). To avoid ambiguity, in this work, we only involve k-WL. Two graphs with different numbers of nodes can be easily distinguished by comparing the multisets of node colors, given that the cardinalities of the two multisets are different.



Figure 1: (a) Several example pairs of non-isomorphic graphs, partially adapted from Sato (2020), that cannot be distinguished by 1-WL. Colors represent initial node labels or features. Our NC-1-WL can distinguish them. (b) A comparison between the executions of 1-WL and NC-1-WL on two non-isomorphic graphs.

Figure 2: Two graphs, adapted from Sato (2020), that cannot be distinguished by NC-1-WL but can be distinguished by 3-WL.

Comparison of expressive GNNs. d is the maximum degree of nodes. T is the number of triangles in the graph. t is the maximum #Message NC of nodes. s is the maximum number of nodes in the neighborhood subgraphs, which grows exponentially with the subgraph depth. It is unknown how the expressiveness upper bound of Nested GNN compares to 3-WL

It requires O(n 3 ) memory since representations corresponding to all sets of 3 nodes needs to be stored. Moreover, without considering the sparsity of the graph, it has O(n

Zeng et al. (2021);Sandfelder et al. (2021);Zhang & Li (2021);Zhao et al. (2022) apply GNNs to the neighborhood subgraph of each node. For example, Nested GNN(Zhang & Li, 2021) first applies a base GNN to encode the neighborhood subgraph information of each node and then employs another GNN on the subgraph-encoded representations. Since message passing is performed on n neighborhood subgraphs, the memory complexity is O(ns) and the time complexity is O(nds), where s is the maximum number of nodes in a neighborhood subgraph. Note that s grows exponentially with the depths of the neighborhood subgraph, thus limiting the scalability. The recently proposed KP-GNN

Appendix B.1. Our implementation is based on the PyG library(Fey &  Lenssen, 2019). The detailed model configurations and training hyperparameters of NC-GNN on each dataset are summarized in Table8, Appendix B.2.

Results (%) on TUDatasets. The top three results on each dataset are highlighted as first, second, and third. We also highlight the cells of NC-GNN results if they are better than GIN.

Results (%) on ogbg-ppa. All models in this table consider edge features. We highlight the cells of NC-GNN results if they are better than GIN.

Results (%) on GNN Benchmark. The top three results on each dataset are highlighted as first, second, and third. We also highlight the cells of NC-GNN results if they are better than GIN.

Comparison of real training time.

The thorough comparison between NCGNN and GIN-AK + on PATTERN and CLUSTER. Time/epoch ↓ Total Time ↓ GPU Memory ↓ MACS ↓ Inference Time ↓

Dataset Statistics. Avg. #Message NC denotes the average #Message NC per node.

The selected model configurations and training hyperparameters of NC-GNN on all datasets.

A PROOFS OF THEOREMS AND LEMMAS

A.1 PROOF OF THEOREM 1 Theorem 1. NC-1-WL is strictly more powerful than 1-WL in distinguishing non-isomorphic graphs.Proof. To prove that NC-1-WL is strictly more powerful than 1-WL, we prove the correctness of the following two statements. (1) If two graphs are determined to be isomorphic by NC-1-WL, then they must be indistinguishable by 1-WL as well. (2) There exist at least two non-isomorphic graphs that cannot be distinguished by 1-WL but can be distinguished by NC-1-WL.(1) Assume two graphs G = (V, E, X) and H = (P, F, Y ) cannot be distinguished by NC-1-WL. Then, according to Algorithm 1, at any iteration ℓ = 1, 2, • • • , we have c|q ∈ N p }} |p ∈ P . This indicates that 1-WL cannot distinguish graph G and graph H as well.(2) In Figure 1 (a), we provide several pairs as example to show that there exist such non-isomorphic graphs that can be distinguished by NC-1-WL but cannot be distinguished by 1-WL.

A.2 PROOF OF THEOREM 2

Theorem 2. NC-1-WL is strictly less powerful than 3-WL in distinguishing non-isomorphic graphs.Proof. To prove that NC-1-WL is strictly less powerful than 3-WL, we prove the correctness of the following two statements. (1) If two graphs are determined to be isomorphic by 3-WL, then they must be indistinguishable by NC-1-WL as well. (2) There exist at least two non-isomorphic graphs that cannot be distinguished by NC-1-WL but can be distinguished by 3-WL.We first describe the details of k-WL in Algorithm 2, following Sato (2020). k-WL aims at coloring each k-tuple of nodes, denoted asThe initial color of each tuple v is determined by the isomorphic type of the subgraph induced by the tuple, i.e., G [v] . (See Sato (2020) for details). Note that the nodes in G[v] are ordered based on their orders in the tuple v. Thus, HASH(G(1) Assume two graphs G = (V, E, X) and H = (P, F, Y ) are determined to be isomorphic by 3-WL. G and H have the same number of nodes 3 , denoted as n. Then, according to Algorithm 2, we have {{cp |p ∈ P k }}. There always exists an injective mapping g :Here, we directly apply g to a tuple v for ease of notation, which means gWithout losing generality, we assume g maps v j to p j for j = 1, 2, • • • , n. Then, we can obtain the following results. , we have x vr = y pr . We can also haveProof. We prove the theorem by showing that an NC-GNN model that satisfies the conditions can yield different embeddings for any two graphs that are determined to be non-isomorphic by NC-1-WL. We denote such model as M. Assume two graphs G 1 = (V 1 , E 1 , X 1 ) and G 2 = (V 2 , E 2 , X 2 ) are determined to be non-isomorphic by NC-1-WL at iteration L. Given that f readout of M can injectively map different multisets of node features into different embeddings, we only need to demonstrate that M, with a sufficient number of layers, can map G 1 and G 2 to different multisets of node features.To achieve this, following Xu et al. (2019) , we show that, for any iteration ℓ, there always exists an injective function φ such that hv is the node representation given by the model M and c (ℓ) v is the color produced by NC-1-WL. We will show this by induction. Note that here v represents a general node that can be a node in G 1 or G 2 .Let ϕ denote the injective hash function used in NC-1-WL. For ℓ = 0, we have cThus, φ could be ϕ -1 for ℓ = 0. Suppose there exists an injective function φ such that hwe show that there also exists such an injective function for iteration ℓ. According to Eq. ( 5), we haveAccording to h), we further havewhere ℓ) , and φ are all injective functions. Since the composition of injective functions is also injective, there must exist some injective function ψ such thatThen, we can obtainTherefore, it is proved that for any iteration ℓ, there always exists an injective function φ such that hAs proved above, we also have {{hand φ is injective. Hence, the multisets of node features produced by M for G 1 and G 2 are also different, i.e., {{hA.4 PROOF OF LEMMA 4 Lemma 4. Assume X is countable. There exist two functions f 1 and f 2 so that h(X, W ) = x∈X f 1 (x) + {{w1,w2}}∈W f 2 (f 1 (w 1 ) + f 1 (w 2 )) is unique for any distinct pair of (X, W ), where X ⊆ X is a multiset with a bounded cardinality and W ⊆ W = {{{w 1 , w 2 }}|w 1 , w 2 ∈ X } is a multiset of multisets with a bounded cardinality. Moreover, any function g on (X, W ) can be decomposed as g(X, W ) = ϕ x∈X f 1 (x) + {{w1,w2}}∈W f 2 (f 1 (w 1 ) + f 1 (w 2 )) for some function ϕ.Proof. To prove this Lemma, we need the following fact, which is also used by Xu et al. (2019) .

