N -WL: A NEW HIERARCHY OF EXPRESSIVITY FOR GRAPH NEURAL NETWORKS

Abstract

The expressive power of Graph Neural Networks (GNNs) is fundamental for understanding their capabilities and limitations, i.e., what graph properties can or cannot be learnt by a GNN. Since standard GNNs have been characterised to be upper-bounded by the Weisfeiler-Lehman (1-WL) algorithm, recent attempts concentrated on developing more expressive GNNs in terms of the k-WL hierarchy, a well-established framework for graph isormorphism tests. In this work we show that, contrary to the widely accepted view, the k-WL hierarchy is not well-suited for measuring expressive GNNs. This is due to limitations that are inherent to highdimensional WL algorithms such as the lack of a natural interpretation and high computational costs, which makes it difficult to draw any firm conclusions about the expressive power of GNNs beyond 1-WL. Thus, we propose a novel hierarchy of graph isomorphism tests, namely Neighbourhood WL (N -WL), and also establish a new theorem on the equivalence of expressivity between induced connected subgraphs and induced subgraphs within this hierarchy. Further, we design a GNN model upon N -WL, Graph Neighbourhood Neural Network (G3N), and empirically verify its expressive power on synthetic and real-world benchmarks.

1. INTRODUCTION

Graph-theoretic algorithms are a powerful source of inspiration for Graph Neural Networks (GNNs). The most known is that the expressive power of standard GNNs is upper-bounded by the Weisfeiler-Lehman (1-WL) algorithm (Weisfeiler & Leman, 1968; Xu et al., 2019; Morris et al., 2019) . In pursuit of more expressive GNNs, various attempts have been made to leverage existing results in graph theory such as high-dimensional WL algorithms (Azizian & Lelarge, 2021; Maron et al., 2019a; Morris et al., 2020b) , substructure counting (Bouritsas et al., 2022; Barceló et al., 2021), and individualisation (Dupty et al., 2022) . The expressivity of these GNNs is measured in terms of the k-WL hierarchy, a well-established framework for graph isomorphism testing (Grohe, 2017) . However, the k-WL hierarchy exhibits several theoretical and practical limitations as a measure of expressivity for GNNs. Theoretically, it is a highly non-trivial problem to tell if and when k-WL algorithms can distinguish two particular graphs (Kiefer, 2020) . Deciding which graph properties are important for distinguishing graphs is even much harder, if not impossible. A complete description of all subgraph patterns whose counts and occurrence are k-WL invariant is only available for k = 1 (Arvind et al., 2020) . Even bearing high computational costs, the power of k-WL algorithms in recognising graph properties seems still limited and some negative results are known, e.g., 3-WL cannot identify any k-cliques with k > 3 (Fürer, 2017) . These issues hamper the practical applicability of high-dimensional WL algorithms for solving real-world tasks on graph-structured data (Chen et al., 2020; Garg et al., 2020) . A question that arises from this is -Whether the k-WL hierarchy is a good yardstick for expressivity of GNNs? In the search for an answer to this question, we observe several disparities between (standard) GNNs and the k-WL hierarchy. First, GNNs encode structural information into nodes as an efficient and practical way for graph learning. This is however against the spirit of the k-WL hierarchy which increases expressive power by going up to higher order objects, i.e., k-tuples, rather than just nodes (Cai et al., 1992; Grohe, 2017) . Second, GNNs are built upon a natural notion of local neighbourhood, i.e., within a certain distance to a node. In contrast, the k-WL hierarchy defines the neighbourhood of a k-tuple based on "adjacency". This notion of adjacency involves the enumeration of all nodes of a graph in each dimension, which is not local and raises concerns about computational efficiency (Morris et al., 2020b) . Last but not least, GNNs learn node representations by aggregating the information from its neighbouring nodes, assuming "birds of a feather flock together" from real-world perception (Zhu et al., 2020; McPherson et al., 2001) . The k-WL hierarchy updates the representation of a k-tuple by aggregating the information from its adjacent neighbours, which does not have a natural interpretation and thus makes it difficult to understand its real-world implications. In light of these observations, we explore a hierarchy of expressivity that is grounded on a new class of graph isomorphism algorithms, called Neighbourhood WL (N -WL) algorithms. This hierarchy overcomes the aforementioned limits of the k-WL hierarchy. More importantly, it enables a new paradigm for designing expressive GNNs while still remaining intuitive and computational efficient. We integrate the following novel insights into the algorithmic design of this hierarchy: (1) Instead of imposing a rigid condition that both objects and its neighbours use the same structure (e.g., k-tuple and its variants), why can we not separate them by colouring nodes in a lower-dimensional space based on information from induced subgraphs in a high-dimensional space? (2) Can we build a hierarchy of expressivity for GNNs upon a natural choice about neighbourhood, i.e., dhop neighbourhood? On one hand, this ensures the locality of neighbourhood and thus brings in computational efficiency; on the other hand, it allows high-dimensional neighbours to capture intricate graph properties into node representations for distinguishing graphs. (3) Unlike the k-WL hierarchy which increases expressivity only through one dimension k, the hierarchy in our work enables two independent ways of controlling expressive power: the size t of induced subgraphs and the size d of neighbourhoods, i.e., enumerating all subgraphs of order t within a d-hop neighbourhood. This helps strike a balance between computational complexity and expressivity of algorithms, which is often highly sought by real-world applications. Figure 1 shows pairs of simple graphs of eight vertices that are indistinguishable by 1-WL but can be distinguished by our proposed hierarchy N -WL under different t and d parameters. By the k-WL hierarchy, we only know that 312 pairs of simple graphs lie between 1-WL and 3-WL as none of them can be distinguished by 1-WL but all of them can be distinguished by 3-WL. Rather than "None" or "All", our N -WL hierarchy can distinguish these graphs in a more refined way under varied t and d values, i.e., each point (t, d) in Figure 1 indicates the number of pairs of simple graphs that remain indistinguishable under these parameters, and all pairs are distinguishable when t ≥ 3 and d ≥ 2. Further details and example graphs are provided in Appendix A. With a hierarchy of expressivity, it is natural to ask whether the hierarchy is strict. We thus further explore whether the N -WL hierarchy is strictly more expressive when considering induced subgraphs or neighbourhoods of larger sizes. To show the strictness, we construct counterexample graphs such that, for any fixed d ∈ N and every t ∈ N, there exist non-isomorphic graphs which N -WL with (t, d) fails to distinguish but can be distinguished by N -WL with (t+1, d); on the other hand, for any fixed t ∈ N and every d ∈ N, there also exist non-isomorphic graphs which N -WL with (t, d) fails to distinguish but can be distinguished by N -WL with (t, d+1). Not surprisingly, constructing such counterexample graphs turns out to be difficult, due to the intricate interaction between t and d as well as the combinatorial nature of graph structure. We present such a construction which can produce families of non-isomorphic graph pairs with O(t) or O(d) vertices. To understand how graph connectivity may affect the expressivity of N -WL, we go on to examine the relation between induced subgraphs and their connectivity. Inspired by the Algebra of Subgraphs (Kocay, 1982) , we discover a previously unknown connection between induced subgraphs of size t, for any t ∈ N, and induced connected subgraphs whose sizes are less than or equal to t. This surprisingly leads to the finding that these two families of subgraphs have equivalent expressive power for distinguishing graphs. Hence, when graphs are sparse, instead of considering all induced subgraphs, we may consider only induced connected subgraphs, improving efficiency considerably. Table 1 : Comparison of k-WL and their variants δ-k-LWL (Morris et al., 2020b) , (k, s)-LWL (Morris et al., 2022) , and (k, c)(≤)-SETWL (Zhao et al., 2022a) with our algorithms N (t, d)-WL and N c (t, d)-WL, where #Coloured objects and #Neighbour objects refer to the number of coloured objects in a graph and the number of neighbour objects for each coloured object, respectively; ∆Coloured objects and ∆Neighbour objects refer to the type of coloured objects and the type of neighbour objects, respectively; n is the number of nodes and a is the average node degree in a graph; a d is the average number of nodes in the d-hop neighbourhood of a node. Note that a d ≪ n for graphs whose diameters are considerably greater than d. k-WL δ-k-LWL (k, s)-LWL (k, c)(≤)-SETWL N (t, d)-WL N c (t, d)-WL #Coloured n k n k subset(n k , s) subset( k q=1 n q , c) n n objects #Neighbour n × k a × k a × k n × q a d t subset( t q=1 a d q , 1) objects ∆Coloured k-tuples k-tuples k-tuples ≤ k-sets nodes nodes objects ∆Neighbour k-tuples k-tuples k-tuples ≤ k-sets t-sets ≤ t-sets objects Sparsity ✗ ✓ ✓ ✓ ✗ ✓ -awareness Contributions. The main contributions of this work are summarised below: • We introduce a new class of graph isomorphism algorithms, Neighbourhood WL (N -WL), which exhibit a natural and strict hierarchy of expressivity for measuring GNNs in their ability to distinguish graphs (Theorem 3.1, Theorem 3.2, and Theorem 3.3). • We establish a new theorem on the equivalence of expressivity between induced connected subgraphs and induced subgraphs in general within the hierarchy of N -WL, which enables us to further leverage graph sparsity for improving efficiency (Theorem 3.7). • We explore how the hierarchy of N -WL relates to the k-WL hierarchy and establish their connections (Theorem 3.8). • We propose a new GNN model architecture, Graph Neighbourhood Neural Network (G3N), which instantiates the ideas of the N -WL algorithms for graph learning (Theorem 4.1).

2. RELATED WORK

Since the advent of Graph Neural Networks (GNNs), a central theme is to understand the expressivity of GNNs. It has been revealed that standard GNNs are at most expressive as the Weisfeiler-Leman (WL) algorithm (Weisfeiler & Leman, 1968; Xu et al., 2019; Morris et al., 2019) . Ever since, the WL algorithm has played a crucial role in the theoretical studies on GNNs. The k-WL hierarchy generalises the WL algorithm (1-WL) to classify k-tuples of vertices (Grohe, 2017; Babai & Kucera, 1979) . It is known that k-WL is strictly more powerful than (k-1)-WL for any k ≥ 3. The Cai-Fürer-Immerman (CFI) construction can produce a family of non-isomorphic graph pairs as counterexamples which are indistinguishable by (k-1)-WL but can be distinguished by k-WL (Cai et al., 1992) . Since k-WL is computationally expensive for k ≥ 3, k-WL is mostly used as a theoretical tool in graph isomorphism testing and not practically useful. Various attempts have been made in recent years to develop more expressive GNNs -see a survey by Sato (2020) . Until now, the expressivity of GNNs is typically measured in terms of 1-WL. However, the expressivity gap between k = 1 and k = 3 for k-WL is often too large for applications in practice. On one hand, 1-WL is not expressive enough since it cannot count some simple structures such as cycles or triangles; on the other hand, many applications may not require strong 3-WL power. There is a lack of measure for expressivity of models that lies in between 1-WL and 3-WL, as well as beyond -higher-dimensional WL -in a more refined and natural way. In this work, we depart from the k-WL hierarchy by proposing a new hierarchy that builds on high-order subgraphs within a neighbourhood aggregation scheme to characterise the expressivity of GNNs. Table 1 shows a comparison between k-WL and their variants and our hierarchy. A detailed discussion on this comparison and other related works are provided in Appendix B. G 2 d = 1 d = 2 d = 3 v 2 v 2 v 2 G 1 d = 1 d = 2 d = 3 v 1 v 1 v 1 G 1 G 2 G 1 G 2 v 1 v 2 G 1 G 2 v 1 v 2 G 1 G 2 Figure 2: Two graphs distinguishable by N -(2, 1)-WL and N -(2, 2)-WL but not N -(2, 3)-WL.

3. NEIGHBOURHOOD WL ALGORITHMS

Let G = (V, E) be a simple graph with a set V of nodes and a set E of edges and |V | = n. Given a node u ∈ V , its d-hop neighbourhood is defined by N d (u) = {v ∈ V | ρ(u, v) ≤ d} where ρ(•) denotes the shortest-path distance between two nodes and u ∈ N d (u). The order of a (sub)graph is the number of its nodes. We use G 1 ⊆ G 2 to denote that G 1 is a subgraph of G 2 , and G 1 ≃ G 2 that G 1 and G 2 are isomorphic, i.e., there exists a bijection between their nodes which induces a correspondence between their edges. Each equivalence class under ≃ is called an isomorphism type. All the proofs for lemmas and theorems in this section are provided in Appendix C.

3.1. A HIERARCHY OF EXPRESSIVITY

We begin with a simple yet weak hierarchy. This hierarchy shows that expressivity increases with the order of induced subgraphs in a neighbourhood; but counterintuitively, expressivity does not necessarily increase with the size of a neighbourhood. Weak hierarchy. A node colouring in G is a function ς : V → N which is refined in iterations, i.e. ς l (u) for l = 0, 1, . . . , m. Initially, each node u is assigned with some colour ς 0 (u). Then, the colour ς l+1 (u) is iteratively refined based on the its own colour ς l (u) and the colours of induced subgraphs in its neighbourhood at the l-th iteration. The colours of induced subgraphs are also defined iteratively, according to the colours of their nodes in the same iteration and connectivity of the nodes. Let S G denote the set of all induced subgraphs of G and f iso : S G → N be a permutation invariant function that encodes the isomorphism types of induced subgraphs. A subgraph colouring in G at the l-th iteration is a function ζ l : S G → N such that ζ l (S 1 ) ̸ = ζ l (S 2 ) iff f iso (S 1 ) ̸ = f iso (S 2 ). Given the set S (u,t,d) of all induced subgraphs of order t within the d-hop neighbourhood of a node u, ξ l (u,i) = { {ζ l (S) | S ∈ S (u,t,d) and f iso (S)=i} }, and I t = {f iso (S)|S ∈ S (u,t,d) }, we have ς l+1 (u) = HASH(ς l (u), {ξ l (u,i) } i∈It ). Here, HASH(•) is an injective hash function. We use N -(t, d)-WL to denote the above algorithm with two parameters: t for the order of induced subgraphs and d for the hop of neighbourhood. The following theorem can be proven. Theorem 3.1. For any fixed d ∈ N, N -(t+1, d)-WL is strictly more expressive than N -(t, d)-WL in distinguishing non-isomorphic graphs, where t ≥ 1. However, N -(t, d+1)-WL is not necessarily more expressive than N -(t, d)-WL in distinguishing non-isomorphic graphs. Figure 2 illustrates the problem. We observe that the source of the problem is that subgraphs are treated in multisets in which their colours can be distinguished only by their isomorphism types. Thus, aggregating subgraphs from a larger neighbourhood may lose information about subgraphs in the previous smaller neighbourhood. Strong hierarchy. To alleviate the above problem, we propose to consider not only isomorphism types but also positional types of subgraphs. This enables us to capture relative importance of structures, even of the same isomorphism type, in a neighbourhood as its receptive field increases. We define a permutation invariant function f pos : S G → N that encodes the positional types of induced subgraphs in G to satisfy the following condition for any t ∈ N and any d ∈ N: ∀S i ∈ S (u,t,d) ∀S j ∈ (S (u,t,d+1) -S (u,t,d) ) (f pos (S i ) ̸ = f pos (S j )). We also reformulate the subgraph colouring function as u,t,d) , f iso (S)=i, and f pos (S)=j} } and J d = {f pos (S)|S ∈ S (u,t,d) }. Then the node colouring in Equation 1 is redefined to integrate the positional types of induced subgraphs as ζ l : S G → N such that ζ l (S 1 ) = ζ l (S 2 ) iff f iso (S 1 ) = f iso (S 2 ) and f pos (S 1 ) = f pos (S 2 ). Let ξ l (u,i,j) = { {ζ l (S)|S ∈ S ( ς l+1 (u) = HASH(ς l (u), {ξ l (u,i,j) } i∈It,j∈J d ). We denote this algorithm with node colouring defined in Equation 3 as N (t, d)-WL, to distinguish it from N -(t, d)-WL. The following theorem states that the expressivity of N (t, d)-WL increases when considering higher-order subgraphs or larger neighbourhoods. (Grohe & Schweitzer, 2020) . This is because, given a graph G, each d-hop neighbourhood subgraph is the same as the graph G when d ≥ dia(G); induced subgraphs of t vertices are also the same as the graph G when t ≥ dia(G), where dia(G) refers to the diameter of G. Note that as shown by Cai, Fürer, and Immerman in their seminal paper (Cai et al., 1992) , the k-WL algorithm is incomplete in the sense that, for any k ∈ N, there exists a pair of non-isomorphic graphs that cannot be distinguished by the k-WL algorithm. Similarly, in our work, the construction presented in Appendix C shows that, for any t ∈ N and d ∈ N, there exists a pair of non-isomorphic graphs that cannot be distinguished by N (t, d)-WL.

3.2. CONNECTED-HEREDITARY SUBGRAPHS

The locality of neighbourhood restricts N (t, d)-WL to considering only induced subgraphs that are "close" to a node. This nice property helps reduce the computational complexity, compared with k-WL. Nonetheless, enumerating all induced subgraphs in a neighbourhood can still be inefficient if their order is high. Real-world applications often focus on analysing connectivity of nodes in a graph. For example, highly interacting proteins are more likely to form function modules (Alokshiya et al., 2019) . Thus, one natural thought to further reduce complexity is to consider only connected subgraphs. Following this direction, we identify connected-hereditary subgraphs. Surprisingly, it turns out that all connected induced subgraphs with orders no greater than t can capture no less information than all induced subgraphs of order t. This finding leads us to improving the N -WL algorithm by taking graph connectivity into consideration. Let I c t ⊆ I t be the set of isomorphism types in I t for connected subgraphs. Instead of considering all subgraphs in the neighbourhood as in Equation 3, we now restrict node colouring to connected subgraphs but allow the orders of connected subgraphs to range over [1, t]: ς l+1 (u) = HASH(ς l (u), k∈[1,t] {ξ l (u,i,j) } i∈I c k ,j∈J d ). For this algorithm with node colouring based only on connected subgraphs, we denote it as N c (t, d)-WL. Although Equation 4 might look rather restricted, we show that if two graphs G 1 and G 2 are distinguishable by N (t, d)-WL, they are also distinguishable by N c (t, d)-WL. This is due to a property of subgraphs occurring in N c (t, d)-WL, which we elaborate below. G 1 v 1 G 2 v 2 (a) Neighbourhoods of Rook's graphs and the Shrikhande graph. Let S ≤t c and S =t be the set of all subgraphs in the d-hop neighbourhood of a node u being considered by N c (t, d)-WL and N (t, d)-WL, respectively, and S =t c the set of all subgraphs of order t in S ≤t c . Definition 3.4. Let S ⊆ S G . Then S is said to be connected-hereditary if every S ∈ S is a connected induced subgraph and S ∈ S implies that every connected induced subgraph of S is also in S. G 1 d = 1 d = 2 d = 3 v 1 v 1 v 1 G 1 G 2 G 1 G 2 v 1 v 2

S ≤t

c contains all connected induced subgraphs of k vertices with 1 ≤ k ≤ t in the d-hop neighbourhood of a node. We have the following lemma. Lemma 3.5. S ≤t c is connected-hereditary. Let S 1 •S 2 denote the union of two node-disjoint subgraphs, i.e., V (S 1 )∩ V (S 2 ) = ∅, and µ(S) the number of connected components in S. The lemma below states that any non-connected subgraph in (S =t -S =t c ) is a union of a number of smaller connected subgraphs in S ≤t c whose nodes are disjoint to each other. Lemma 3.6. For each induced subgraph S satisfying S ∈ S =t but S / ∈ S =t c , there exists a set {S 1 , S 2 , . . . , S q } ⊆ S ≤t c such that S = S 1 • S 2 • ... • S q , where µ(S) = q. The intuition is that smaller connected induced subgraphs can actually be used as pieces to reconstruct all the disconnected order-t induced subgraphs, without losing expressivity. Hence, based on these lemmas, we can prove the following theorem that the N c (t, d)-WL algorithm is as powerful as the N (t, d)-WL algorithm. In fact, both algorithms have the same expressivity. Theorem 3.7. For any t ∈ N and d ∈ N, N c (t, d)-WL and N (t, d)-WL have the same expressivity in distinguishing non-isomorphic graphs. It is worth noting that, by further restricting connected subgraphs to specific isomorphism types, we may obtain different subclasses of algorithms. For example, if limiting I c to cycles, cliques, and paths, N c (t, d)-WL is reduced to the neighbourhood WL algorithms N cycle (t, d)-WL, N clique (t, d)-WL, and N path (t, d)-WL, respectively, which encode structures of specific kinds from the node neighbourhoods into node colours.

3.3. CONNECTIONS TO k-WL HIERARCHY

One may arise a question with regard to how N -WL relates to the k-WL hierarchy. Since 1-WL corresponds to node labelling based on a multiset of the colours of its neighbours within a 1-hop neighbourhood, it is straightforward to establish the following connection. Theorem 3.8. N (1, 1)-WL is equivalent to 1-WL in distinguishing non-isomorphic graphs. By Theorems 3.2 and 3.3, we know that the expressivity of N (t, d) increases when increasing either the order of subgraphs or the hop of a neighbourhood. This leads to the proposition below. Proposition 3.9. Both N (2, 1)-WL and N (1, 2)-WL are strictly more powerful than 1-WL in distinguishing non-isomorphic graphs. However, in general, the N -WL hierarchy is incomparable with k-WL, as we notice in the nontrivial case that N (3, 1) is incomparable to 3-WL. A canonical example of two non-isomorphic, strongly regular graphs which 3-WL cannot distinguish are Rook's graph and the Shrikhande graph. Despite being strongly regular, N (3, 1)-WL can distinguish them by distinguishing 3-order subgraphs in 1-hop node neighbourhoods as in Figure 3 (a). Nevertheless, N (3, 1)-WL fails to detect 4-cycles as its receptive field is too small in contrast to 3-WL. This is seen where the former is unable to distinguish an 8-cycle from two disconnected 4-cycles in Figure 3 (b) whereas 3-WL is able to distinguish them. 

4. GRAPH NEIGHBOURHOOD NEURAL NETWORK

Motivated by the N -WL algorithm, we design Graph Neighbourhood Neural Network (G3N) which is able to leverage and learn structural information from neighbourhoods.

Model design. Given a graph

G = (V, E), each node u ∈ V is associated with an f -dimensional feature vector x u ∈ R f and h (0) u = x u . Let S (l) u (i, j) denote the set of all t-order subgraphs within the d-hop neighbourhood of a node u with the isomorphism type i and positional type j at the l-th layer. Then at the (l+1)-th layer the node embedding h (l+1) u of a node u is defined by h (l+1) u = COMBINE h (l) u , AGG N (i,j)∈It×J d AGG T S∈S (l) u (i,j) (POOL(S)) . (5) POOL(•) extracts node representations within a subgraph S as a subgraph embedding which can be defined by any graph pooling method. Aggregation proceeds in two steps: AGG T (•) combines subgraph embeddings of the same isomorphism and positional types and AGG N (•) combines the resulting embeddings from all subgraphs in the neighbourhood. COMBINE(•) combines the node embedding of node u at the previous layer with the aggregated embedding of subgraphs. Further details of the G3N model architecture are described in Appendix D. One can compare Equation 5to the node colouring of N -WL described in Equation 3where subgraph pooling corresponds to subgraph colouring, and the aggregation corresponds to the hashing of sets of multisets. We refer to G3N with given t and d by G3N-(t, d). Expressivity analysis. G3N-(t, d) is at most as expressive as N (t, d)-WL. To match the expressiveness of N (t, d)-WL, one may insert Multi-Layer Perceptrons (MLPs) to approximate injective functions as employed by Xu et al. (2019) . However, this comes at higher parameter complexity which may increase expressiveness but could decrease generalisability. The following theorem phrases this formally with the proof available in Appendix C. Theorem 4.1. G3N-(t, d) with injective COMBINE and AGG N functions, an injective AGG T function w.r.t. multisets of subgraphs with the same isomorphism and positional types, an injective graph readout function, and sufficiently many layers is as powerful as N (t, d)-WL. Complexity analysis. Usually, expressivity comes at a cost of computational complexity and this is no exception for G3N. Let a denote the average node degree of a graph. Then ignoring node features embedding dimensions, standard GNNs have the complexity O(n • a) while G3N has O(n • a d t ) per layer. Here, t is a very small value, usually less than 6. Note that a d < < n is an average size of a local d-hop neighbourhood, different from k-WL which considers all vertices in a graph, i.e., O(n k ). Further, we may consider only connected subgraphs when graphs are sparse. Subgraph aggregations can be conducted in parallel across all subgraphs according to their isomorphism and positional types. Computationally, it is much more efficient than expressive GNNs based on k-WL. .22E-1 1.0E-4 1.73E-1 GCN 600 105 4775 45 2.43E-1 1.42E-1 1.0E-4 1.14E-1 GAT 600 105 1828 45 2.47E-1 1.44E-1 1.0E-4 1.12E-1 GIN 600 105 386 45 2.06E-1 1.18E-1 1.0E-4 1.21E-1 PPGN 0 105 0 1 1.00E-4 2.61E-4 1.0E-4 3.30E-4 G3N-(2,1) 600 105 5 36 1.81E-4 3.58E-4 6.46E-4 1.03E-1 G3N-(3,1) 600 0 5 36 4.31E-4 4.30E-4 3.41E-4 1.14E-1 G3N-(2,2) 0 105 6 15 6.77E-4 7.06E-4 8.63E-4 1.61E-2 G3N-(3,2) 0 0 0 10 1.55E-3 1.69E-3 4.42E-3 7.38E-3

5. EXPERIMENTS

We run experiments on synthetic and real-world datasets to empirically validate the theoretical properties of G3N. For synthetic datasets, we focus on two main tasks: the graph isomorphism problem and substructure counting. As for real-world datasets, we focus on graph classification and regression on molecular datasets and social networks. Details about datasets, baselines, and experimental setups, as well as additional experimental results, are available in Appendix E.

5.1. SYNTHETIC DATASETS

Setup. EXP (Abboud et al., 2021) consists of 600 pairs of nonisomorphic graphs (G i , H i ) each encoding a propositional formula which is either satisfiable or not. The graphs are designed to be indistinguishable by 1-WL. SR25foot_0 (Balcilar et al., 2021) Chen et al., 2020) . We follow the setup described by Balcilar et al. (2021) . Observations. From Table 2 we can see that G3N with t = 2 and d = 1 alone is already more expressive than 1-WL bounded GNNs, being able to distinguish most graph8c graphs and some CSL graphs. By increasing t to 3, we are able to distinguish all SR25 graphs which the 3-WL bounded PPGN is unable to. Other configurations of d and t also support that N -WL is incomparable with k-WL. Looking at substructure counting tasks, G3N with t = 2 are able to trivially detect triangles but not 4-cycles. Setting t = 3 and d = 2 allows us to detect 4-cycles. Note that G3N with t = 2 and d = 2 is unable to learn 4-cycles but is able to memorise some configurations, leading to the slightly lower error. We finally observe that for higher d and t, convergence of G3N takes longer due to its higher expressivity and low parameter budget in this setting, resulting in slighty poorer predictions.

5.2. REAL-WORLD DATASETS

Setup. ZINC (Irwin et al., 2012) contains 250K molecules of which 12K are selected for the task of regressing constrained solubility. We follow the experimental setup described in (Dwivedi et al., 2020) . MolHIV was introduced by Wu et al. (2018) as part of the MoleculeNet benchmark and has been proposed in (Hu et al., 2020) to be graph classification tasks. We follow the train, validation and test split from Hu et al. (2020) and evaluate on the test score corresponding to the best validation score. We further consider the bioinformatics and social network datasets from the TUDataset collection (Morris et al., 2020a) . We follow the setup described in (Xu et al., 2019) . Observations. From the results collated in Tables 3 and 4 , G3N performs well compared with the other baselines. On ZINC, G3N performs better on learning topological features over PPGN. There exist stronger performing models on molecular datasets such as CIN (Bodnar et al., 2021a) and GSN (Bouritsas et al., 2022) ; however, these methods explicitly compute cycle information which is known to be correlated with molecular classes and attributes. For example, cycle counts are used in the calculation of the regressed constrained solubility attribute of ZINC datasets (Irwin et al., 2012) . Among the baselines that learn topological features without pre-selecting application-dependent subgraph patterns, our G3N performs best on ZINC both with edge features and without edge features.

6. CONCLUSIONS

We proposed a new class of graph isomorphism algorithms known as N -WL algorithms. This class of graph isomorphism algorithms exhibits a more refined hierarchy of expressivity for distinguishing non-isomorphic graphs by focusing on approximating graph neighbourhood structures as best as possible. We further explored how to narrow our focus down to only connected subgraphs to improve efficiency while still preserving expressiveness. Motivated by N -WL, we proposed the G3N architecture which is provably more expressive than vanilla message-passing neural networks and is able to leverage graph structure more effectively for prediction tasks. 

F Limitations and Future Work

A INDISTINGUISHABLE PAIRS -k-WL VS N -WL Figure 5 presents two pairs of non-isomorphic simple graphs of eight vertices from the 312 pairs of non-isomorphic simple graphs reported in Figure 1 . In terms of the k-WL hierarchy, both pairs of these non-isomorphic simple graphs cannot be distinguished by 1-WL but can be distinguished by 3-WL. Beyond these, there is no further information about why these pairs of non-isomorphic graphs can or cannot be distinguished and what graph properties are important in distinguishing them. Nonetheless, if looking into these pairs of non-isomorphic graphs in terms of the N -WL hierarchy, we have the following observations: (1) Although both pairs of graphs are among the 312 pairs of non-isomorphic simple graphs of eight vertices that cannot be distinguished by 1-WL, they look very different, exhibiting different properties. (2) The reason why G ′ 1 and G ′ 2 in Figure 5 (a) cannot be distinguished by 1-WL is due to the fact that these two graphs have the same number of vertices in their 1-hop neighbourhoods. Indeed, they also have the same number of edges in their 1-hop neighbourhoods. However, they do have different numbers of triangles in their 1-hop neighbourhoods, which explains why they can only be distinguished by N (t, d)-WL if t ≥ 3. (3) The reason why G 1 and G 2 in Figure 5 (b) cannot be distinguished by 1-WL is different from the case why G ′ 1 and G ′ 2 in Figure 5 (a) cannot be distinguished by 1-WL. If d = 1, then no matter how t is increased, it is always insufficient to distinguish these two graphs. On the other hand, even if t = 1, these two graphs can be distinguished as long as we consider their 2-hop neighbourhoods rather than 1-hop neighbourhoods, i.e., d ≥ 2. Table 5 presents that, among all simple graphs of eight vertices, N (t, d)-WL with varying t and d values have different expressive powers in distinguishing pairs of these simple graphs. For example N (1, 1)-WL fails to distinguish 312 pairs of simple graphs from all pairs of simple graphs of eight In summary, for all simple graphs of eight vertices, N (1, 1)-WL with t ≥ 3 and d ≥ 2 is sufficient to distinguish any pair of them. G′ 1 G′ 2 (a) (b) G 1 G 2 (b) Table 5 : Indistinguishable pairs by N (t, d)-WL on all simple graphs of eight vertices. d = 1 d = 2 d = 3 d = 4 d = 5 d = 6 t = 1 312 186 186 186 186 186 t = 2 20 6 6 6 6 6 t = 3 5 0 0 0 0 0 t = 4 5 0 0 0 0 0 t = 5 5 0 0 0 0 0 t = 6 5 0 0 0 0 0 B RELATED WORK B.1 EXPRESSIVE GNNS BEYOND 1-WL TEST Below, we summarise the main directions of research for designing expressive GNNs that go beyond 1-WL test. High-order GNNs. Following k-WL, a natural way of increasing the power of GNNs is to go higher dimensions (Azizian & Lelarge, 2021; Morris et al., 2021) . Morris et al. (2019) introduced k-GNN based on a set variant of k-WL, which learns features over subgraphs on k nodes and is strictly weaker than k-WL. Maron et al. (2018; 2019b; a) developed k-order GNNs that are as expressive as k-WL and showed that a reduced 2-order GNN is as powerful as 3-WL. Morris et al. (2020b; 2022) and Zhao et al. (2022a) proposed high-order GNNs by only considering a subset of all k-tuples, i.e., k-tuples that are local or correspond to inducing subgraphs with certain connected components. These GNN models are all provably stronger than 1-WL; however, they still suffer from the shortcomings of k-WL -non-intuitive design and high computational costs -and are thus impractical for real-world tasks, particularly when graphs are large. Unlike these works, our GNN architecture is not built upon the classical k-WL algorithms. At its core, the N -WL algorithm proposed in this work is a node (1-dimensional) refinement algorithm based on high-order induced subgraphs in its d-hop neighbourhood, i.e., incorporating subgraph colouring into node colouring instead of propagating k-tuple colouring based on k-tuple colouring of the same dimension. Further, the N -WL algorithm has the same locality as in standard GNNs, i.e., considering local structures in the neighbourhood within d-hop away from each node. Injecting structures. Substructure counting has been leveraged by several GNNs to inject structural information into node or edge features of a graph (Bouritsas et al., 2022; Barceló et al., 2021) . Typically, isormorphism counts (Bouritsas et al., 2022) or homomorphism counts (Barceló et al., 2021) of small subgraph patterns (e.g., cycles and cliques) are precomputed as application-specific inductive biases. In a similar spirit, Thiede et al. ( 2021) applied convolutions on automorphism groups of subgraph patterns; Nguyen & Maehara (2020) used homomorphism counts of subgraph pattens as graph invariants. A recent work (Wijesinghe & Wang, 2022) introduced a way of injecting structural coefficients from local structures into neighbour aggregation. Despite being more expressive than 1-WL, all these models require preprocessing to extract structural information from subgraphs. Although considering subgraphs, our work has some important distinctions from the above works. We consider subgraphs from a combinatorial perspective which does not require hand-choosing subgraph patterns, while the above GNN models consider specific subgraphs that are pre-defined. As noted by Barceló et al. (2021) , knowing which subgraph patterns work well and which do not is critical for the power of the GNN models that reply on specific pre-defined subgraph patterns. However, this question is not easy to answer because determining which subgraph patterns work well is often application-dependent. Subgraph-based GNNs. Several recent studies proposed to apply a base GNN on subgraphs of a graph rather than on the whole graph directly (Zhao et al., 2022b; Zhang & Li, 2021; Bevilacqua et al., 2022; Frasca et al., 2022) . The way of representing a graph in terms of subgraphs may vary in models. Bevilacqua et al. (2022) represents each graph as a multiset of subgraphs chosen according to a predefined policy, e.g., removing one edge or one node. Zhao et al. (2022b) and Zhang & Li (2021) represent a graph with a multiset of induced subgraphs, each rooted at one node in the graph, which generalises rooted subtrees used in the traditional setting. It has been shown that, by applying a base GNN that is expressive as 1-WL such as GIN (Xu et al., 2019) on subgraphs, the expressive power of these models goes beyond 1-WL. Our work is different from these subgraph-based GNNs because we do not apply a GNN on subgraphs. Instead, our GNN architecture aims to incorporate structural information (i.e., isomorphism types) and positional information (i.e., positional types) of subgraphs in the neighbourhood of each node into the node's embedding. To this end, our GNN architecture is similar to standard GNNs except that the neighbourhood information for aggregation may include the information from high-order subgraphs in a neighbourhood rather than merely neighbouring nodes. K-hop message-passing GNNs. Recently, some attempts have been made to extend GNNs from 1-hop message passing aggregation to K-hop message passing aggregation (Abu-El-Haija et al., 2019; Nikolentzos et al., 2020; Brossard et al., 2020; Wang et al., 2020; Feng et al., 2022) . The key idea is to aggregate information from not only direct neighbours (1-hop neighbours) of a node but also neighbours within K hops of a node. Feng et al. (2022) showed that standard K-hop message-passing GNNs are strictly more powerful than 1-WL test when K > 1, but their expressive power is bounded by 3-WL test whose proof is based on the work in (Frasca et al., 2022) . To further improve the expressive power, Feng et al. (2022) proposed a new GNN model, called KP-GNN, which leverages the information of peripheral subgraphs into K-hop message-passing GNNs. Our work provides a comprehensive theoretical framework for characterising the expressive power of different K-hop message-passing GNNs. For example, standard K-hop message-passing GNNs with the shortest path distance kernel (Feng et al., 2022) can be characterised as corresponding to our N (1, d)-WL with d = K, where each node has its shortest-path distance to the target node as a positional type. Further, depending on the isomorphism types of peripheral subgraphs, KP-GNN can be characterised as an instance of N (t, d)-WL in our work with d = K and t referring to the size of peripheral subgraphs, where isomrphism types are restricted to pre-selected isormorphism types by KP-GNN. Node identity and individualisation. Several works augmented node identifiers (Loukas, 2020; You et al., 2021) and random features (Vignac et al., 2020; Sato et al., 2021; Abboud et al., 2021) or In our work, we do not assign identifiers and random features to nodes, nor individualise and refine the colours of nodes. Our GNN architecture can preserve permutation invariance. Tensor languages. Several works have studied the expressive power of GNNs from the perspective of matrix query languages. Balcilar et al. (2021) obtained the upper bounds of the expressive power of GNNs in terms of 1-WL and 2-WL based on the MATLANG matrix query language (Brijder et al., 2019) . Geerts & Reutter (2022) explored a tensor language specifically designed for specifying GNNs and discovered the connections between a guarded fragment of the MATLANG matrix query language and k-WL. The key idea is to first reduce the expressivity problem of GNNs to the problem of specifying GNNs in a tensor language and then analyse their expressive powers in terms of the number of indices and the summation depth used in tensor language expressions. In addition to the above, Bodnar et al. (2021a; b) proposed a message passing scheme operating on topological objects such as simplicial complexes or cell complexes. Horn et al. (2021) proposed a topological layer for GNNs that can leverage topological information of a graph using persistent homology -a technique for calculating topological features such as connected components and cycles of a graph. These works also lead to more expressive GNNs than 1-WL.

B.2 CONNECTIONS TO k-WL AND ITS VARIANTS

Recently, several heuristic algorithms for the graph isomorphism problem have been developed based on the classical k-WL algorithms, which take into account the sparsity of graphs (Morris et al., 2020b; 2022; Zhao et al., 2022a) . These works were motivated by addressing the issue of high computational cost associated with the classical k-WL algorithms. In the following, we conduct a comparison of the k-WL and its variants, and our N -WL algorithms. To obtain a better understanding of the underlying differences between these two families of algorithms (k-WL vs N -WL), we separate the comparison of these algorithms into two aspects: (1) Objects to be coloured in a graph (i.e., #Coloured objects referring to the number of coloured objects and ∆Coloured objects referring to the type of coloured objects); (2) Neighbour objects when colouring each object (i.e., #Neighbour objects referring to the number of neighbour objects and ∆Neighbour objects referring to the type of neighbour objects). Let n denote the number of nodes in a graph, a be the average number of nodes adjacent to a node in a graph (i.e., average node degree), a d be the average number of nodes in the d-hop neighbourhood of a node, and dia(G) be the diameter of a graph. We denote sets with exactly k nodes as k-sets and sets with at most k nodes as ≤ k-sets. Usually, a d < < n when d < < dia(G). Further, subset(n k , s) refers to the number of k-tuples in a subset of all n k k-tuples whose induced subgraphs have at most s connected components, while subset( k i=1 n i , c) refers to the number of ≤ k-sets in a subset of all k i=1 n i sets whose induced subgraphs have at most c connected components. Table 6 compares k-WL, N -WL, and their variants in terms of #Coloured objects, #Neighbour objects, ∆Coloured objects, ∆Neighbour objects, and sparsity-awareness. We have the following observations: • k-WL and its variants focus on colouring k-tuples, subsets of k-tuples or their corresponding set forms such as k-sets and ≤ k-sets by aggregating information from their neighbours of the same types. N -WL aims to colour nodes by aggregating information from neighbouring subgraphs. Here, subgraphs may be considered as corresponding to t-sets or ≤ t-sets. In regard to their connections to GNN architectures, both k-WL and N -WL can be easily implemented for graph-level learning tasks. However, for node-level learning tasks, as discussed in Morris et al. (2022) , k-WL and its variants need an additional pooling operation to combine features of k-tuples, k-sets, or ≤ k-sets that contain a node for computing a representation of the node. In contrast, N -WL serves as a natural paradigm that computes node representations as first-class citizens. • k-WL has a more global nature than N -WL in the sense that its coloured objects are grounded on all possible k-tuples of an entire graph. The variants of k-WL impose some  k-WL δ-k-LWL (k, s)-LWL (k, c)(≤)-SETWL N (t, d)-WL N c (t, d)-WL #Coloured n k n k subset(n k , s) subset( k q=1 n q , c) n n objects #Neighbour n × k a × k a × k n × q a d t subset( t q=1 a d q , 1) objects ∆Coloured k-tuples k-tuples k-tuples ≤ k-sets nodes nodes objects ∆Neighbour k-tuples k-tuples k-tuples ≤ k-sets t-sets ≤ t-sets objects Sparsity ✗ ✓ ✓ ✓ ✗ ✓ -awareness additional conditions on all k-tuples with an aim to reduce all k-tuples to a subset of all k-tuples or ≤ k-tuples. Two commonly used conditions by the variants of k-WL are: -reducing k-tuples or ≤ k-tuples to a set form such as k-sets and ≤ k-sets, and -restricting coloured objects or neighbour objects to satisfying a certain connectivity conditions, e.g., containing no more than a specified number of connected components. In contrast, N -WL is designed to be local. This difference can be observed from Table 6 . Specifically, for coloured objects, k-WL and its variants deal with the number n k of all k-tuples, or a subset of all k-tuples, or ≤ k-sets from an entire graph, while N -WL only deals with the number n of all nodes in a graph. For neighbour objects, k-WL considers the number n × k of neighbour objects for each coloured object; δ-k-LWL (Morris et al., 2020b) and (k, s)-LWL (Morris et al., 2022) consider the number a × k of neighbour objects, where a refers to the average node degree, since only local neighbours of a k-tuple are taken into consideration; and (k, c)(≤)-SETWL (Zhao et al., 2022a ) considers the number n × q of neighbour objects for each coloured object, where q ∈ [k]. For N (t, d)-WL, it considers the number a d t of subgraphs of order t from the d-hop neighbourhood of a node while N c (t, d)-WL considers only connected subgraphs with ≤ t nodes. When a d ≪ n, N -WL performs much more efficient than k-WL and its variants. • As depicted in Table 6 , the variants δ-k-LWL, (k, s)-LWL, (k, c)(≤)-SETWL, and our N c (t, d)-WL are sparsity-aware (i.e., adapt to the sparsity of a graph), while the classical k-WL and our N (t, d)-WL algorithms are not sparsity-aware. Nonetheless, different from these k-WL variants which trade off expressivity for computational efficiency, our N c (t, d)-WL is provably as powerful as N (t, d)-WL without compromising expressive power while improving efficiency on sparse graphs (see Theorem 3.7). Remark B.1. It is worthy to note that, when learning a node representation, our N -WL algorithms consider all or connected subgraphs bounded within the d-hop neighbourhood of a node, but these subgraphs may not necessarily be incident to the node. This is different from rooted subgraphs considered by the local variants of k-WL as well as subgraph-based GNNs in the literature.

C PROOFS OF THEOREMS C.1 PROOFS FOR WEAK HIERARCHY (THEOREM 3.1)

To prove Theorem 3.1, we begin with the following lemma. Lemma 1. For any fixed d ∈ N, N -(t+1, d)-WL is at least as expressive as N -(t, d)-WL in distinguishing non-isomorphic graphs, where t ≥ 1. Proof. We prove this lemma by contradiction. Assume that there exist two non-isomorphic graphs G 1 and G 2 which can be distinguished by N -(t, d)-WL, but cannot be distinguished by N -(t+1, d)-WL after k iterations. This implies that, for any l-th iteration where l = 0, 1, . . . , k-1, N -(t+1, d)-WL must have the same multiset of node colours for G 1 and G 2 . Now it suffices to show that, for any iteration l, if the colours of any two nodes in G 1 and G 2 are the same by N -(t+1, d)-WL, then their node colours by N -(t, d)-WL must also be the same. We prove this by induction. When l = 0, the initial node colours are the same for N -(t+1, d)-WL and N -(t, d)-WL; hence, the above statement holds. When l > 0, we assume that this statement holds for l-1. Then, if the colours of any two nodes in G 1 and G 2 are the same at the l-th iteration by N -(t+1, d)-WL, i.e., ς l (u 1 ) = ς l (u 2 ), by Equation 1, we have: (ς l-1 (u 1 ), {ξ l-1 (u1,i) } i∈It+1 ) = (ς l-1 (u 2 ), {ξ l-1 (u2,i) } i∈It+1 ). Since u 1 and u 2 have the same colour in the l-1-th iteration, this gives us the following equation which must hold {ξ l-1 (u1,i) } i∈It+1 = {ξ l-1 (u2,i) } i∈It+1 . By the definition of the N --WL algorithm, {ξ l-1 (u,i) } i∈It+1 represents the set of multisets of colours for subgraphs of t+1 order from the d-hop neighbourhood of a node u at the l-1-th iteration of applying N -(t+1, d)-WL. This corresponds to m t+1 induced subgraphs of order t+1 where m refers to the number of vertices in the d-hop neighbourhood of node u and w.l.o.g. we assume t+1 ≤ m. We show that {ξ l (u,i) } i∈It+1 determines {ξ l (u,i) } i∈It . Here, {ξ l (u,i) } i∈It corresponds to m t induced subgraphs of order t in the d-hop neighbourhood of node u at the l-th iteration of N -(t, d)-WL. Since each ξ l (u,i) encodes the count of induced subgraphs with respect to an isomorphism type i, we just need to show that the isomorphism counts of induced subgraphs of order t are determined by the isomorphism counts of induced subgraphs of order t+1. Let S be an induced subgraph of order t+1. We denote the deck of S as deck(S) which is the set of induced subgraphs of order t that are obtained by deleting one vertex in every possible way from S. That is, deck(S) = {S ′ |v ∈ V (S), V (S ′ ) = V (S) -{v}, S ′ ⊂ S}. By the results in (Kocay, 1982) , we know that counting induced subgraphs and counting all subgraphs have the same power. Note that each subgraph of t vertices lies in m-t subgraphs of t+1 vertices. So there is a total number (m-t) of subgraphs of t+1 vertices, for which each subgraph S ′ of t vertices sits in their decks. Thus, to determine the number of times a subgraph S ′ of t vertices occurs in a graph, we can first count the occurrence of S ′ in all of subgraphs of t+1 vertices, and then divide the count by (m-t). Therefore, the counts of subgraphs of vertices t+1 implies the counts of subgraphs of vertices t. In doing so, we are able to infer the isomorphism counts of induced subgraphs of order t from the isomorphism counts of induced subgraphs of order t+1. Thus, we have the following equation: {ξ l-1 (u1,i) } i∈It = {ξ l-1 (u2,i) } i∈It . Accordingly, the following must hold: (ς l-1 (u 1 ), {ξ l-1 (u1,i) } i∈It ) = (ς l-1 (u 2 ), {ξ l-1 (u2,i) } i∈It ). Hence, by Equation 1 again, we know that the node colours of u 1 and u 2 by N -(t, d)-WL must also be the same at the l-th iteration. This implies that there exists an injective function between colours of the nodes by N -(t+1, d)-WL and by N -(t, d)-WL at any l-th iteration. Since for any l-th iteration where l = 0, 1, . . Let V (G) and E(G) denote the set of nodes and the set of edges in G, respectively. In general, there are three cases when constructing such a pair of non-isomorphic graphs ( Ĝ1 , Ĝ2 ): • When t = 1, we construct Ĝ1 to be two cycles of length 2d + 1 and Ĝ2 to be one cycle of length 4d + 2. • When t = 2, we first construct a pair (G 1 , G 2 ) of graphs where G 1 consists of two cycles of length 4 and G 2 consists of one cycle of length 8. Let G denote the complement of a graph G. Then, we construct Ĝr = G r to be the complement of G r for r = 1, 2. • When t ≥ 3, we first construct a pair (G 1 , G 2 ) of graphs where G 1 consists of two cycles of length ℓ and G 2 consists of one cycle of length 2ℓ, where r is defined as follows: ℓ = (3t -1)/2 or (3t + 1)/2 if t is odd; 3t/2 if t is even. (6) Then, for each of G 1 and G 2 , we add a single vertex that connects to all vertices in a graph. More precisely, given two new vertices v 1 / ∈ V (G 1 ) and v 2 / ∈ V (G 2 ), we define Ĝ1 and Ĝ2 as Figure 6 shows the pair of non-isomorphic graphs for the case of t = 2. Proof. The proof can be carried out in the same way as in the proof for Lemma 1. Specifically, to prove this lemma, it suffices to show that, for any two non-isomorphic graphs G 1 and G 2 , if they can be distinguished by N (t, d)-WL at the l-th iteration, they can also be distinguished by N (t+1, d)-WL at the l-th iteration. By the definition of the N -WL algroithm, the set of multisets of colours for subgraphs of t+1 order from the d-hop neighbourhood of a node u at the l-th iteration of applying N -(t+1, d)-WL is represented as {ξ l (u,i,j) } i∈It+1,j∈J d . This still corresponds to m t+1 induced subgraphs of order t+1, where m refers to the number of vertices in the d-hop neighbourhood of node u and w.l.o.g. we assume t+1 ≤ m. Now, we need to show that {ξ l (u,i,j) } i∈It+1,j∈J d determines {ξ l (u,i,j) } i∈It,j∈J d . {ξ l (u,i,j) } i∈It,j∈J d corresponds to m t induced subgraphs of order t in the d-hop neighbourhood of node u at the l-th iteration of N (t, d)-WL. Different from ξ l (u,i) that encodes the count of induced subgraphs with respect to only an isomorphism type i, ξ l (u,i,j) encodes the count of induced subgraphs with respect to both an isomorphism type i and a positional type j. Accordingly, we need to show that the counts Ĝ Proof. Since S ∈ S =t but S / ∈ S =t c , such an induced subgraph S must have more than one connected component. Then, for each connected component in S, it corresponds to one connected induced subgraph in S ≤t c . Thus, it is straightforward to show that S = S 1 •S 2 •...•S q where {S 1 , S 2 , . . . , S q } corresponds to the set of components of S and {S 1 , S 2 , . . . , S q } ⊆ S ≤t c is such a set of connected induced subgraphs. V ( Ĝr ) = V (G r ) ∪ {v r } and E( Ĝr ) = E(G r ) ∪ {(v r , u)|u ∈ V (G r )} for r = 1, 2. 1 Ĝ 2 Ĝ 1 = G 1 Ĝ 2 = G 2 G 1 G 2

8-cycle 4-cycles

1 Ĝ 1 Ĝ 2 ℓ = 5 v 1 v 2 Ĝ 1 Ĝ 2 ℓ = 4 v 1 v 2 Ĝ 1 Ĝ 2 ℓ = 6 v 1 v 2 (c) (b) (a) 1 (2d+ 2) Ĝ 1 Ĝ 2 (c) Ĝ 2 -cycles -cycles (2d+ 1) (a) -cycle (4d+ 2) -cycles (2d+ 3) Ĝ 1 Ĝ 2 (b) -cycle (4d+ 6) -cycle (4d+ 4) The proof of Theorem 3.7 depends on two lemmas -Lemma 4 and Lemma 5. In the following, we prove these two lemmas first. Lemma 4. For any t ∈ N and d ∈ N, N c (t, d)-WL is at least as expressive as N (t, d)-WL in distinguishing non-isomorphic graphs. Proof. For clarity, we use ς l c (u) and ς l (u) to refer to the colour of a node u by N c (t, d)-WL and N (t, d)-WL, respectively, in the following of this proof. We prove this lemma by contradiction. Assume that there exist two non-isomorphic graphs G 1 and G 2 which can be distinguished by N (t, d)-WL, but cannot be distinguished by N c (t, d)-WL after k iterations. This implies that, for any l-th iteration where l = 0, 1, . . . , k-1, N c (t, d)-WL must have the same multiset of node colours for G 1 and G 2 , i.e., { {ς l c (u 1 )} } u1∈V (G1) = { {ς l c (u 2 )} } u2∈V (G2) . Below, we need to prove the following statement: (A1). For any iteration l, if the colours of any two nodes in G 1 and G 2 are the same by N c (t, d)-WL, i.e., ς l c (u 1 ) = ς l c (u 2 ), then their node colours by N (t, d)-WL must also be the same, i.e., ς l (u 1 ) = ς l (u 2 ). We prove Statement A1 by induction. • For l = 0, Statement A1 holds since the initial node colours are the same for N c (t, d)-WL and N (t, d)-WL. • Assume that Statement A1 holds for l-1. Then if ς l c (u 1 ) = ς l c (u 2 ), by Equation 4and the fact that HASH(•) is injective, we have:   ς l-1 c (u 1 ), k∈[1,t] {ξ l-1 (u1,i,j) } i∈I c k ,j∈J d   =   ς l-1 c (u 2 ), k∈[1,t] {ξ l-1 (u2,i,j) } i∈I c k ,j∈J d   . This leads to the following equation: k∈[1,t] {ξ l-1 (u1,i,j) } i∈I c k ,j∈J d = k∈[1,t] {ξ l-1 (u2,i,j) } i∈I c k ,j∈J d . Now, we need to show that, with respect to both isomorphism types and positional types, the counts of connected subgraphs of vertices k for 1 ≤ k ≤ t determine the counts of all subgraphs of t vertices. We first ignore positional types and focus on proving that the counts of connected subgraphs of k vertices for 1 ≤ k ≤ t determine the counts of all subgraphs of t vertices with respect to isomorphism types. The main theorem upon which our proof will be based is the Vertex Theorem of Kocay (1982) . Let G = (V, E) be a graph, G[V ′ ] for V ′ ⊆ V be an induced subgraph of G whose vertex set is V ′ and whose edge set consists of all of the edges in E that have both endpoints in V ′ , and c(S, G) = |{V ′ ⊆ V (G) : G[V ′ ] ≃ S}| denote the number of induced subgraphs of G which are isomorphic to S, i.e., isomorphism counts of S in G. In the following, we omit the symbol G in c(S, G) unless the graph G needs to be specified. We start with introducing Kocay's Vertex Theorem. Vertex Theorem. Let ⟨i 1 , . . . , i b ⟩ be a list of all isomorphism types of graphs, and S 1 and S 2 be any two graphs. Then c(S 1 )c(S 2 ) ≡ 1≤m≤b a m c(i m ), where the coefficient a m is the number of decompositions of V (i m ) into V 1 ∪ V 2 such that i m [V 1 ] ≃ S 1 and i m [V 2 ] ≃ S 2 , and V 1 and V 2 may be overlapping. Below, we show how to use Kocay's Vertex Theorem to prove that, given the counts of connected subgraphs of k vertices with 1 ≤ k ≤ t with respect to their isomorphism types, the counts of subgraphs of t vertices with respect to all isomorphism types can be determined. If an isomorphism type i corresponds to connected subgraphs, i.e., i ∈ I c t , then it is trivial to prove because the count of connected subgraphs with respect to such an isomorphism type is already available. Thus, we only need to discuss the case for i ∈ (I t -I c t ) in the following. We prove this by induction: -For k = 1, there is only one isomorphism type i node that is in I c k . Thus, the count of nodes with respect to i node is already available. -For k = 2, there are two isomorphism types -one is in I c k (i.e., i edge ) and the other is in (I k -I c k ) (i.e., i noedge ). By the Vertex Theorem, we can determine the count of subgraphs for i noedge easily through the following equation by applying the Vertex Theorem: c(i node )c(i node ) ≡ a 1 c(i noedge ) + a 2 c(i edge ) + a 3 c(i node ). Here, a 1 , a 2 , and a 3 are the coefficients. Following the Vertex Theorem, each a m where m = 1, 2, 3 represents the number of decompositions of V (i m ), where i m = i noedge , i edge , i node , respectively, into V 1 ∪ V 2 such that i m [V 1 ] ≃ i node and i m [V 2 ] ≃ i node . Note that, it is possible that V 1 ∩ V 2 ̸ = ∅. Based on this, we have a 1 = 2, a 2 = 2, and a 3 = 1. That is, c(i node )c(i node ) ≡ 2c(i noedge ) + 2c(i edge ) + c(i node ). Since i node ∈ I c k and i edge ∈ I c k , the values of c(i node ) and c(i edge ) are already available. Hence, we can calculate the value of c(i noedge ) according to the above equation. -Assume that this holds for k = t-1, we now prove the case k = t. For each isomorphism type i ∈ (I t -I c t ), we apply Equation 7 of the Vertex Theorem as follows: * The LHS of Equation 7 corresponds to two subgraphs S 1 and S 2 of i which satisfy the condition: V (S 1 ) ̸ = ∅, V (S 2 ) ̸ = ∅, and S 1 • S 2 = i, i.e., i is the union of two node-disjoint subgraphs S 1 and S 2 ; * The RHS of Equation 7 corresponds to the set of all isomorphism types in 1≤k≤t I k , including the isomorphism type i. Note that the order of considering isomorphism types in (I t -I c t ) is non-trivial. Specifically, given two isomorphism types i and i ′ from I t with V (i) = V (i ′ ), we say that i subsume i ′ , denoted as i ⊒ i ′ , if and only if there exists at least one decomposition of V (i) into V ′ 1 , . . . , V ′ q such that 1≤b≤q V ′ b = V (i) and {i[V ′ 1 ], . . . , i[V ′ q ]} ≃ {S ′ 1 , S ′ 2 , . . . , S ′ q } where {S ′ 1 , S ′ 2 , . . . , S ′ q } is the set of all connected components of i ′ . For any two isomorphism types from (I t -I c t ), one isomorphism type should be calculated before the other isomorphism type if the former subsumes the latter. Since all isomorphism types in I t have the same number of vertices (i.e., t vertices) and are distinct from each other, we have the following properties: P1. (I t , ⊒) is a poset, i.e., ⊒ is a partial order on isomorphism types in I t . P2. Coefficients for isomorphism types in I t -{i} are zero if they are subsumed by i. According to these two properties, we know that, given any isomorphism type i ∈ (I t -I c t ), the values of c(i ′ ) for isomorphism types i ′ ∈ (I t -I c t ) that satisfy i ′ ⊒ i are already known, while the values of c(i ′′ ) for isomorphism types i ′′ ∈ (I t -I c t ) that satisfy i ⊒ i ′′ are unknown but also not needed since their corresponding coefficients are zeros. Moreover, by the assumption that the case k = t-1 holds, the values of c(i ′′′ ) for isomorphism types i ′′′ ∈ I k where 1 ≤ k ⪇ t are known, including c(S 1 ) and C(S 2 ). Hence, the value of c(i) can be determined by applying Equation 7 of the Vertex Theorem as described above. Since positional types provide additional information to distinguish subgraphs, we may view them as one additional feature for subgraphs and handle it as an extension to isomorphism types to prove that the counts of connected subgraphs of vertices k for 1 ≤ k ≤ t determine the counts of all subgraphs of t vertices with respect to both isomorphism types and positional types. This thus leads to the following: {ξ l-1 (u1,i,j) } i∈It,j∈J d = {ξ l-1 (u2,i,j) } i∈It,j∈J d . Since we assume that Statement A1 holds for the (l-1)-th iteration, i.e., if ς l-1 c (u 1 ) = ς l-1 c (u 2 ), then ς l-1 (u 1 ) = ς l-1 (u 2 ), we further have the following: ς l-1 (u 1 ), {(ξ l-1 (u1,i,j) , i, j)} i∈It,j∈J d ) = (ς l-1 (u 2 ), {(ξ l-1 (u2,i,j) , i, j)} i∈It,j∈J d . Hence, from the above equation and by the definition of node colouring in Equation 3, we know that ς l (u 1 ) = ς l (u 2 ). Statement A1 thus holds for the l-th iteration. I c t I t -I c t t = 4 t = 3 t = 2 t = 1 Figure 9 : Isomorphism types for graphs of t vertices, where 1 ≤ t ≤ 4, I c t refers to a set of isomorphism types for connected subgraphs of order less than or equal to t, and I t -I c t refers to a set of isomorphism types for disconnected subgraphs (i.e., containing at least two connected components) of order less than or equal to t. According to Statement A1, there must exist an injective function f such that ς l (u) = f (ς l c (u)) for any vertex in G 1 and G 2 . Then, since for any l-th iteration where l = 0, 1, . . . , k-1 N c (t, d)-WL has the same multiset of node colours for G 1 and G 2 , i.e., { {ς l c (u 1 )} } u1∈V (G1) = { {ς l c (u 2 )} } u2∈V (G2) , N (t, d)-WL must also have the same multiset of node colours for G 1 and G 2 , i.e., { {f G2) . This means that N (t, d)-WL cannot distinguish G 1 and G 2 after k iterations, which contradicts with the assumption. a 13 c( )+ a 14 c( ) Following the Vertex Theorem, we have the following coefficients for Equation 11: a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 a 11 a 12 a 13 a 14 0 (ς l c (u 1 ))} } u1∈V (G1) = { {f (ς l c (u 2 ))} } u2∈V 1 0 2 3 1 3 2 0 1 0 2 2 0 In the RHS of Equation 11, the isomorphism types in the first, third and fifth lines belong to I c k for k = 4, 3, 2, respectively. Hence, the counts of subgraphs with respect to these isomorphism types are already available. The isomorphism types in the second line belong to I k -I c k for k = 4, By the property P 1, we know that the counts of subgraphs with respect to these isomorphism types , and have been calculated before calculating the count with respect to . The isomorphism types in the fourth line belong to I k -I c k for k = 3, and by the induction, the count of subgraphs with respect to the isomorphism type is available. Therefore, only the value of c( ) is unknown in the RHS of Equation 11. On the other hand, the values for c( ) and c( ) are both known because and belong to I k for k = 2. Putting things together, we can determine the value of c( ) using Equation 11. In the following, we illustrate how to apply Equation 10 and Equation 11 in the above steps to calculate c( ) for two graphs: and . (1) Calculating c( ) for : We hereby detail how c( ) can be derived for the graph . We know c( ) = 5 and c( ) = 6 for the graph . By applying Equation 10 in Step 1, we have 5 × 5 ≡ 2c( ) + 2 × 6 + 5. Hence, we obtain c( ) = 4. By applying Equation 11, we thus have

Now we apply

6 × 4 ≡0 + 0 + 0 + 2 × 2 + 0 + 1 × 2+ 0 + 0 + 0 + c( )+ 0 + 2 × 6+ 2 × 3+ 0 By solving it we have c( ) = 0 for the graph . (2) Calculating c( ) for : We show another example of calculating c( ) for the graph . By applying Equation 11, we thus have 4 × 6 ≡0 + 0 + 0 + 0 + 3 × 1 + 1 × 2+ 0 + 2 × 1 + 0 + c( )+ 0 + 2 × 4+ 2 × 4+ 0 By solving it we have c( ) = 1 for the graph . Remark C.1. It is worthy to notice that, when calculating c(i) for an isomorphism type i ∈ (I t -I c t ), the ways of applying Equation 7of the Vertex Theorem may not be unique. Taking the isomorphism type c( ) for example, we can calculate it in two different ways: (1) As illustrated in Example 1, we may first calculate c( ) via treating S 1 = and S 2 = , and then calculate c( ) via treating S 1 = and S 2 = . (2) Alternatively, we may first calculate c( ) via treating S 1 = and S 2 = , and then calculate c( ) via treating S 1 = and S 2 = . Lemma 5. For any t ∈ N and d ∈ N, N (t, d)-WL is at least as expressive as N c (t, d)-WL in distinguishing non-isomorphic graphs. Proof. The proof follows a similar structure as used in the previous lemmas. Specifically, we assume that there exist two non-isomorphic graphs G 1 and G 2 which can be distinguished by N c (t, d)-WL, but cannot be distinguished by N (t, d)-WL after k iterations. This implies that, for any l-th iteration where l = 0, 1, . . . , k-1, N (t, d)-WL must have the same multiset of node colours for G 1 and G 2 , i.e., { {ς l (u 1 )} } u1∈V (G1) = { {ς l (u 2 )} } u2∈V (G2) . We need to prove the following statement: (A2). For any iteration l, if the colours of any two nodes in G 1 and G 2 are the same by N (t, d)-WL, i.e., ς l (u 1 ) = ς l (u 2 ), then their node colours by N c (t, d)-WL must also be the same, i.e., ς l c (u 1 ) = ς l c (u 2 ). We prove Statement A2 by induction: • For l = 0, Statement A2 holds since the initial node colours are the same for N c (t, d)-WL and N (t, d)-WL. • Assume that Statement A2 holds for l-1. Then if ς l (u 1 ) = ς l (u 2 ), by Equation 3, we have ς l-1 (u 1 ), {(ξ l-1 (u1,i,j) , i, j)} i∈It,j∈J d ) = (ς l-1 (u 2 ), {(ξ l-1 (u2,i,j) , i, j)} i∈It,j∈J d . This gives us the following {ξ l-1 (u1,i,j) } i∈It,j∈J d = {ξ l-1 (u2,i,j) } i∈It,j∈J d . Then, we need to further prove that, with respect to both isomorphism types and positional types, the counts of all subgraphs of t vertices can determine the counts of connected subgraphs of k vertices for 1 ≤ k ≤ t. The proof may follow a similar counting argument from Lemma 1 by noticing that each subgraph of t vertices uniquely determines the set of all subgraphs of k vertices contained within it for 1 ≤ k ≤ t. Thus, to determine the At the k-th iteration, if N (t, d)-WL distinguishes G 1 and G 2 to be non-isomorphic, this implies that G 1 and G 2 differ in { {ς k (u)|u ∈ V (G 1 )} } and { {ς k (u)|u ∈ V (G 2 )} }. Then, G 1 and G 2 must also have different G3N's node embeddings at the k-th iteration due to the injectivity of f . The proof is done.

D G3N MODEL ARCHITECTURE

Here we describe the specific G3N layer in more detail. An instance of Equation 5 we employ is as follows: h (l+1) u = σ   h (l) u W (l) + (i,j)∈It×J d α (l) (i,j)   S∈S (l) u (i,j) σ v∈S h (l) v W (l) (i,j)     where h (l) u ∈ R d1 and h (l+1) u ∈ R d2 are learned node embeddings at each layer, and excluding suband superscript notations, W ∈ R d1×d2 are learnable linear transformations and α are learnable scalars corresponding to different isomorphism and positional types of subgraphs. Furthermore, σ is any nonlinear activation function such as sigmoid or ReLU. Matching the notation described earlier in Equation 5, the pooling function for subgraphs can be defined by any permutation invariant pooling function. Here, given a subgraph S ∈ S (l) u (i, j), we have POOL(S) = σ v∈S h (l) v W (l) (i,j) . In practice, we may also consider component-wise product over node embeddings in a subgraph: POOL(S) = v∈S σ h (l) v W (i,j) . In this case, σ has to be a bounded activation function such as sigmoid in order to ensure numerical stability. The inner aggregation step AGG T simply sums up all the subgraph embeddings of the same isomorphism and positional types, while the outer aggregation step AGG N applies a weighted sum with learnable weights on the aggregated groups of subgraph embeddings of the same isomorphism and positional types. To match expressivity of N -WL, one may replace the outer activation function σ with an MLP and replace W (l) with 1 + ε (l) , where ε (l) is a learnable scalar parameter, as seen in GIN (Xu et al., 2019) . However, we found that using different learnable linear transformations for all possible isomorphism and positional types leads to much slower training. For example, if we had a number |J d | of positional types and a number |I t | of isomorphism types, this would lead to |I t | × |J d | possible combinations of linear transformations, resulting in longer training and higher generalisation gap due to the large number of parameters. To remedy this problem, we instead only use different weight matrices for positional types only, and inject the isomorphism type information as additional features to node embeddings. For example, we change Equation 14 to Equation 16below h (l+1) u = σ   h (l) u W (l) + (i,j)∈It×J d α (l) j   S∈S (l) u (i,j) σ v∈S (h (l) v || e i )W (l) j     where e i denotes a vector with entries all zero except one at the i-th component. Thus, we have a different pool function for subgraph embeddings but keep all other components the same: POOL(S) = σ v∈S (h (l) v || e i )W (l) j . In the implementation, we apply a pre-processing step to compute all indices of induced neighbourhood subgraphs with respect to their isomorphism and positional types and store all these indices, which are used by G3N layers described in Equation 5. This pre-processing step is only required to perform once on each dataset. For isomorphism types, we can compute them generally for any t-order subgraphs using the McKay's nauty algorithm (McKay, 1981; Mckay & Piperno, 2014) . When t ≤ 3, for efficiency, we consider a simple way to determine isomorphism types which count edges and nodes of subgraphs. In order to preserve structural information, initial colours of nodes are treated as being the same in the computation of isomorphism types. For positional types, we implement f pos as follows. Let S be an induced subgraph in the d-hop neighbourhood of a node u. Then f pos (S) is the set of shortest-path distances of the nodes in S to the node u, i.e., f pos (S) = {ρ(u, v)|v ∈ V (S)}, where ρ refers to the shortest-path distance between nodes u and v. For example, if S has three nodes V (S) = {v 1 , v 2 , v 3 } with ρ(u, v 1 ) = 1, ρ(u, v 2 ) = 1 and ρ(u, v 3 ) = 2, then the positional type of S is f pos (S) = {1, 1, 2} in this case. Remark D.1. In our work, the definition of positional types for induced subgraphs (via the function f pos ) is intended to characterise two general conditions that are required at the minimum to establish the proposed strong hierarchy: (1) permutation invariant; (2) the condition specified by Equation 2. Thus, in general, there are many different ways to instantiate the function f pos for positional types in implementation, as long as f pos is permutation invariant and satisfies the condition specified in Equation 2. There are two model variants in our G3N implementation: one is to aggregate all connected-hereditary subgraphs of order up to t, corresponding to the node colouring function defined by Equation 4as discussed in Section 3.2, and the other is to aggregate all t-order subgraphs regardless of connectivity, corresponding to the node colouring function defined by Equation 3. In our experiments, we consider the former one as the default model, unless otherwise stated.

E EXPERIMENTAL DETAILS AND RESULTS

In the following, we provide the further information about the datasets, baseline methods, and parameter selection considered in our experiments as well as additional experimental results. E.1 DATASETS Table 7 summarises the dataset tasks and the statistics of the datasets used in our experiments. -(3,3) 2.5 97619 -layers and a hidden dimension adhering to a 30k parameter budget for runtime analysis, and a hidden dimension of 64 for memory analysis. Furthermore, we run experiments on two different datasets ZINC and IMDB-B which have different settings and graph densities: ZINC consists of molecular graphs with average graph density 0.195 and diameter 12.47 which are sparser than the IMDB-B social networks with average graph density 0.520 and diameter 1.86, and max diameter 2. We further compare against GNN models with 3WL expressivity: GNNML3 (Balcilar et al., 2021) and PPGN (Maron et al., 2019a) . The implementations are taken from Balcilar et al. (2021) . The experiments for this section were run on an RTX 3090 GPU. Observations. Viewing the results in Table 11 , we notice that for sparse graphs the order of magnitude for runtime stays the same even when d and t are increased up to 3. On the other extreme when graphs are small and dense, runtime increases at a greater rate as d and t increase, given that the number of connected components in a neighbourhood grows large as can be seen with d = 2 on the IMDB-B dataset which results in global aggregation. For t = 3, this results in memory issues from computing all 3-order subgraph indices. Note that the runtime for (t, d) = (2, 1) is greater than for (t, d) = (3, 1). This is because all 3-order subgraphs in 1-hop neighbourhoods have the same positional type, whereas there are more positional types of 2-order subgraphs in the given datasets. We further note that runtime performance is comparable with the higher order GNN methods with the exception on the IMDB-B dataset where our model performs aggregation on whole graphs. Observations. From these figures, we observe the following: • When d = 1, for graphs with relatively large diameters such as ZINC, MolHIV, and NCI1, as shown in Figures 10, 11, and 12 , our N -WL algorithm is several orders of magnitude faster than the k-WL algorithm. We can see that the performance gap between k-WL and N -WL becomes more and more significant when t increases and becomes stable after t = 4. This is because the number of 1-hop neighbours a 1 is much smaller than the total number of nodes n in a graph (i.e., a 1 ≪ n). For graphs with very small diameters such as IMDB-B, all the nodes are almost within 2-hop of a node; therefore, the number of 1-hop neighbours a 1 is relatively closer to the number of nodes n in such graphs. In this case, as depicted in Figure 13 , the performance gap between k-WL and N -WL becomes small. • When d = 2, for graphs with relatively large diameters such as ZINC, MolHIV, and NCI1, as shown in Figures 10, 11, and 12, N -WL still shows a significant performance improvement over k-WL and the performance gap starts to increase after t = 4. The increase in the complexity of N -WL for t = [1, 2, 3] is due to the effect of including 2-hop neighbours along with 1-hop neighbours; after t = 4, the size of 2-hop neighbours |a 2 | and parameter t start to approach each other which results in an increased performance gap. For IMDB-B dataset with a very small diameter, as shown in Figure 13 , almost all the nodes in the graphs are within 2-hop (i.e., a 2 ≈ n) and the time complexity of our algorithm increases for initial values of t and then start to stabilizes or decreases as compared to k-WL. • When d = 3, the value of a d gradually approaches the total number of nodes in a graph. In such cases, N -WL remains significantly faster on graphs with relatively large diameters, as shown in Figures 10, 11, and 12 , and also shows to outperform well for graphs with small diameters with k-WL, as shown in Figure 13 . The performance gap still becomes larger with the increased values of parameter t. Overall, due to the local nature of the N -WL algorithm in contrast to the global nature of the k-WL algorithm, the N -WL algorithm is much more efficient than the k-WL algorithm when performing on graphs with reasonably large diameters, particularly if diameters are greater than d values. The performance gap between the N -WL algorithm and the k-WL algorithm is reduced when increasing d or graphs have very small diameters, e.g., IMDB-B has an average diameter 1.9. Remark E.1. The complexity of N -WL can be reduced when the gap between t and a d gets smaller. This is because the number of induced subgraphs a d t may decrease in this case. For instance, we can see from Figure 10 that when d = 1 and t = 4, the complexity is lower than the one at d = 1 and t = 3; then it stabilises when d = 1 and t ≥ 5. Similar cases can be observed from Figure 11 and Figure 12 .

F LIMITATIONS AND FUTURE WORK

In this work, the design of G3N still has some limitations. Firstly, as with most expressive GNN models going beyond 1-WL, G3N suffers from inevitably increased runtime per training step and parameter complexity. Although seen to be feasible for sparse molecular graphs, this may still pose an issue with much larger datasets. To solve this issue, one may have to consider sampling methods. Another issue is with an unclear explanation on the gap between expressivity and generalisability. Finally, although G3N is able to detect graph substructures contrary to methods which precompute and inject them as additional features for learning, the parameters d and t are still domain dependent and will have to be hand chosen to balance computational resources and performance. Due to these limitations, one interesting future research direction is to explore the underlying design principles of expressive GNN models for a better understanding of the interaction between expressivity, generalisability, explanability, and training efficiency of GNN models. From the algorithmic perspective, for both k-WL and N -WL, a possible future research direction is to analyse their connections to node-level, link-level, or more generally subgraph-level properties and accordingly build connections between subgraph properties and expressivity of GNN models. In the literature of the k-WL algorithms, despite their fruitful connections to logic and descriptive complexity theory, there is no clear understanding for characterisations of subgraph patterns whose counts and occurrence are higher-order k-WL invariant (Kiefer, 2020) . In fact, traditionally, the k-WL algorithm is mainly used as a combinatorial tool in graph isomorphism tests and little work has been done to study the applicability of k-WL to recognition of graph properties rather than to testing isomorphism (Fuhlbrück et al., 2021) . Further, a study on the formal properties of the N -WL algorithms by exploring the connections between the N -WL hierarchy and logic is needed.



available at http://users.cecs.anu.edu.au/~bdm/data/graphs.html



Figure 1: Indistinguishable pairs of simple graphs of eight vertices by 1-WL, which are distinguishable by N -WL under different d and t values.

Figure 3: Pairs of nonisomorphic (sub)graphs.

Figure 4: An overview of a G3N layer: (a) t-order subgraphs are extracted from a node's d-hop neighbourhood. (b) The subgraphs are grouped by their positional and isomorphic types. (c) The subgraphs are embedded by a pooling function POOL. (d) The subgraph embeddings are aggregated in their own topological groups by AGG T . (e) The resulting embedding vectors are further aggregated and combined with AGG N and COMBINE to form an updated node embedding.

Figure 5: Two pairs of non-isomorphic simple graphs of eight vertices: (a) G ′ 1 and G ′ 2 cannot be distinguished by N (t, d)-WL unless t ≥ 3, regardless how large d is; (b) G 1 and G 2 cannot be distinguished by N (t, d)-WL unless d ≥ 2, regardless how large t is.

applied individualisation and refinement(Dupty et al., 2022) to improve the expressive power of GNNs. Specifically,You et al. (2021) proposed ID-GNNs which add identity information for each rooted node during message passing.Vignac et al. (2020) manipulated node identifiers in a permutation-invariant way to learn a local context around each node.Murphy et al. (2019) introduced RP-GNN which runs a permutation-sensitive GNN by adding node identifiers and then sums over all permutations of node identifiers to obtain permutation-invariant representations.Dupty et al. (2022) individualised nodes using individualisation and refinement for canonical colouring. These GNNs are permutation invariant in expectation.

Figure 8.(a)  demonstrates the aforementioned pairs of non-isomorphic graphs for the case of t = 1. Figure6shows the pair of non-isomorphic graphs for the case of t = 2. Figure7.(a)-(c) shows the pairs of non-isomorphic graphs for the cases of t = 3 and t = 4.

Figure 8.(a)  demonstrates the aforementioned pairs of non-isomorphic graphs for the case of t = 1. Figure6shows the pair of non-isomorphic graphs for the case of t = 2. Figure7.(a)-(c) shows the pairs of non-isomorphic graphs for the cases of t = 3 and t = 4.The proof is done.



Figure 6: (Left) Two graphs G 1 and G 2 where G 1 consists of two 4-cycles and G 2 is one 8-cycle; (Right) Two graphs Ĝ1 and Ĝ2 that can be distinguished by N -(t + 1, d)-WL but cannot be distinguished by N -(t, d)-WL for t = 2 and d ≥ 1, where Ĝ1 and Ĝ2 are the complements of G 1 and G 2 , respectively.



Figure 7: Three pairs of non-isomorphic graphs ( Ĝ1 , Ĝ2 ) that can be distinguished by N -(t + 1, d)-WL but cannot be distinguished by N -(t, d)-WL: (a) t = 3 and ℓ = 4, (b) t = 3 and ℓ = 5, and (c) t = 4 and ℓ = 6, where d ≥ 1.

Figure 8: Three families of pairs of graphs that can be distinguished by N -(t, d + 1)-WL but cannot be distinguished by N -(t, d)-WL: (a) t = 1 and d ≥ 1, (b) t = 2 and d ≥ 1, and (c) t ≥ 3 and d ≥ 1.

Figure 9 depicts all isomorphism types for graphs with up to 4 vertices. Below, we illustrate how to calculate c( ) on graphs by applying the Vertex Theorem, where is an isomorphism type in (I t -I c t ) for t = 4. -Step 1: We calculate c( ) using the following equation, where c( ) and c( ) are known because both and are isomorphism types in 1≤k≤2 I c k : c( )c( ) ≡ 2c( ) + 2c( ) + c( ). (10) -Step 2: We calculate c( ) by decomposing into two subgraphs S 1 = and S 2 = . Since the coefficients that correspond to the following isomorphism types are zeros, we omit these isomorphism types in the equation: Then, we have the equation below: c( )c( ) ≡a 1 c( ) + a 2 c( ) + a 3 c( ) + a 4 c( ) + a 5 c( ) + a 6 c( )+ a 7 c( ) + a 8 c( ) + a 9 c( ) + a 10 c(

calculated c( ) = c( ) = 0 and c( ) = 2. Further, we know the counts of isomorphism types that belong to I c k for k = 4, 3, 2, i.e. we have c( ) = c( ) = 0, c( ) = c( ) = 1, c( ) = c( ) = 2 and c( ) = c( ) = 6. By the induction, we have also calculated c( ) = 3 at k = 3.

We know c( ) = 5 and c( ) = 4 for the graph . By applying Equation 10 in Step 1, we have5 × 5 ≡ 2c( ) + 2 × 4 + 5.(13)Hence, we know c( ) = 6.Now we applyStep 2. Because , and all subsume , we have calculated c( ) = c( ) = 0 and c( ) = 1. Further, we know the counts of isomorphism types that belong to I c k for k = 4, 3, 2, i.e. we have c( ) = c( ) = c( ) = c( ) = 0, c( ) = 1, c( ) = 2 and c( ) = c( ) = 4. By the induction, we have also calculated c( ) = 4 at k = 3.

Figure 10: ZINC

Figure 11: MolHIV

Figure 12: NCI1

Graph isomorphism tasks on EXP, SR25, graph8c and CSL are evaluated by counting the pairs of graphs which are indistinguishable. Substructure counting tasks are performed on the RandomGraph dataset and evaluated by MSE.

TU datasets evaluated by classification accuracy (%). The first 3 rows are kernel methods and the remaining are GNN models. The first and second best results are highlighted and underlined.

Experiments on ZINC and MolHIV, where the performance of the methods marked by * (i.e., DEEP LRP, GSN, and CIN) is based on pre-selected subgraph patterns.(a) ZINC evaluated by MAE.

Expressive GNNs Beyond 1-WL Test . . . . . . . . . . . . . . . . . . . . . . . . B.2 Connections to k-WL and its Variants . . . . . . . . . . . . . . . . . . . . . . . . Proofs for Weak Hierarchy (Theorem 3.1) . . . . . . . . . . . . . . . . . . . . . . C.2 Proofs for Strong Hierarchy (Theorem 3.2 and Theorem 3.3) . . . . . . . . . . . . C.3 Proofs for Connected-Hereditary Subgraphs (Theorem 3.7) . . . . . . . . . . . . . C.4 Proofs for Connections to k-WL Hierarchy (Theorem 3.8) . . . . . . . . . . . . . C.5 Proofs for Graph Neighbourhood Neural Network (Theorem 4.1) . . . . . . . . . .

Comparison of k-WL and their variants δ-k-LWL(Morris et al., 2020b), (k, s)-LWL(Morris et al., 2022), and (k, c)(≤)-SETWL(Zhao et al., 2022a)  with our N (t, d)-WL and N c (t, d)-WL, where #Coloured objects and #Neighbour objects refer to the number of coloured objects in a graph and the number of neighbour objects for each coloured object, respectively; ∆Coloured objects and ∆Neighbour objects refer to the type of coloured objects and the type of neighbour objects, respectively; n is the number of nodes and a is the average node degree in a graph; a d is the average number of nodes in the d-hop neighbourhood of a node. Note that a d ≪ n for graphs whose diameters are considerably greater than d.

. , k-1 N -(t+1, d)-WL has the same multiset of node colours for G 1 and G 2 , N -(t, d)-WL must also have the same multiset of node colours for G 1 and G 2 . This means that N -(t, d)-WL cannot distinguish G 1 and G 2 after k iterations, which contradicts with the assumption. The proof is done. For any fixed d ∈ N, N -(t+1, d)-WL is strictly more expressive than N -(t, d)-WL in distinguishing non-isomorphic graphs, where t ≥ 1.Proof. By Lemma 1, we know that N -(t+1, d)-WL is as expressive as N -(t, d)-WL. Thus, we just need to show that, for any fixed but arbitrary d ∈ N, there exists at least a pair of nonisomorphic graphs ( Ĝ1 , Ĝ2 ) that cannot be distinguished by N -(t, d)-WL but can be distinguished by N -(t+1, d)-WL.

Dataset statistics.

Runtime measured by seconds per epoch on ZINC (molecular dataset) and IMDB-B (social network dataset) with a parameter budget of approximately 30k. A message passing neural network (MPNN) is modelled by G3N with d = t = 1. The IMDB-B graphs have maximum diameter 2, meaning that d = 2, 3 use the same resources. OOM represents out of memory.

Runtime and number of parameters for G3N on ZINC (molecular dataset) and IMDB-B (social network dataset) with 4 number of layers and hidden dimension of 64. The IMDB-B graphs have maximum diameter 2, meaning that d = 2, 3 use the same resources. OOM represents out of memory.

ACKNOWLEDGMENTS

We are thankful to Professor Brendan Mckay for helpful discussions and feedback. We gratefully acknowledge the support of the Australian Research Council under Discovery Project DP210102273. We also would like to thank anonymous reviewers for their comments and suggestions which helped improve the quality of the paper.

REPRODUCIBILITY STATEMENT

To ensure reproducibility of the theoretical and empirical results included in this work, we have made the following efforts: (1) For theoretical results, the complete proofs for lemmas and theorems are provided in Appendix C; (2) Details of the model architecture and the implementation are included in Appendix D; (3) Experimental details, including datasets, baseline methods, hyperparameter selection, and additional experimental results, are provided in Appendix E; (4) The code is available at https://github.com/seanli3/G3N.

annex

of induced subgraphs of order t with respect to both isomorphism types and positional types are determined by the counts of induced subgraphs of order t+1 with respect to both isomorphism types and positional types.The fact that each subgraph of t vertices lies in m-t subgraphs of t+1 vertices remains. To take both isomorphism types and positional types into account, when determining the number of times a subgraph S ′ of t vertices occurs in a graph, we can first count the occurrence of S ′ in all subgraphs of t+1 vertices with respect to both the isomorphism and positional types, and then divide the count by (m-t). Thus, the counts of induced subgraphs of vertices t+1 imply the counts of induced subgraphs of vertices t respect to both isomorphism types and positional types.Theorem 3.2. For any fixed d ∈ N, N (t+1, d)-WL is strictly more expressive than N (t, d)-WL in distinguishing non-isomorphic graphs, where t ≥ 1.Proof. By Lemma 2, we know that N (t+1, d)-WL is at least as expressive as N (t, d)-WL in distinguishing non-isomorphic graphs. For the strictness of expressivity between N (t+1, d)-WL and N (t, d)-WL, we can see that the pairs of non-isomorphic graphs ( Ĝ1 , Ĝ2 ) that cannot be distinguished by N -(t, d)-WL but can be distinguished by N -(t+1, d)-WL in the proof of Theorem 3.1 still hold when taking positional types of induced subgraphs into account. Thus, we can prove this strictness in a similar way as for the case that N -(t+1, d)-WL is strictly more expressive than N -(t, d)-WL, i.e. based on the same pairs of non-isomorphic graphs ( Ĝ1 , Ĝ2 ) described in the proof of Theorem 3.1.Before proving Theorem 3.3, we first prove the following lemma.Lemma 3. For any fixed t ∈ N, N (t, d+1)-WL is at least as expressive as N (t, d)-WL in distinguishing non-isomorphic graphs, where d ≥ 1.Proof. We can proceed with the proof as follows, in a way similar to Lemma 1. Assume that there exist two non-isomorphic graphs G 1 and G 2 which can be distinguished by N (t, d)-WL, but cannot be distinguished by N (t, d+1)-WL after k iterations. This implies that, for any l-th iteration where l = 0, 1, . . . , k-1, N (t, d+1)-WL must have the same multiset of node colours for G 1 and G 2 .Then we show that, for any iteration l, if the colours of any two nodes in G 1 and G 2 are the same by N (t, d+1)-WL, then their node colours by N (t, d)-WL must also be the same. Again, we prove this by induction.When l = 0, the initial node colours are the same for N (t, d+1)-WL and N (t, d)-WL. Hence, the above statement holds. When l > 0, we assume that this statement holds for the l-1-th iteration. Then, if the colours of any two nodes in G 1 and G 2 are the same at the l-th iteration by N (t, d+1)-WL, i.e., ς l (u 1 ) = ς l (u 2 ), by Equation 3, we have:Since u 1 and u 2 have the same colour in the l-1-th iteration, this gives us the following equation which must holdBy the definition of the permutation invariant function f pos : S G → N that encodes the positional types of induced subgraphs and satisfies the condition:)), we know that J d ⊆ J d+1 holds for induced subgraphs in neighbourhoods and any fixed but arbitrary t ∈ N, where J d = {f pos (S)|S ∈ S (u,t,d) } and J d+1 = {f pos (S)|S ∈ S (u,t,d+1) }. Accordingly, we have the following for any node u in G 1 and G 2 : u,t,d) , f iso (S)=i, and f pos (S)=j} }, by Equation 3 again and the fact that HASH(•) in Equation 3 is an injective function, we further haveWe know ς l-1 (u 1 ) = ς l-1 (u 2 ) by the assumption for the l -1-th iteration. Thus, the following must hold:According to Equation 3, we know that the node colours of u 1 and u 2 by N (t, d)-WL must also be the same at the l-th iteration.This implies that there exists an injective function between the colours of nodes by N (t, d+1)-WL and N (t, d)-WL at any l-th iteration. Since for any l-th iteration where l = 0, 1, . . Proof. Firstly, we show that, for a fixed but arbitrary t ∈ N, there exists at least one pair of nonisomorphic graphs ( Ĝ1 , Ĝ2 ) that cannot be distinguished by N (t, d)-WL but can be distinguished by N (t, d+1)-WL. Generally speaking, there are also three cases to consider when constructing such a pair of non-isomorphic graphs ( Ĝ1 , Ĝ2 ):• When t = 1, we construct Ĝ1 to be two cycles of length 2d + 1 and Ĝ2 to be one cycle of length 4d + 2.• When t = 2, we construct Ĝ1 to be two cycles of length 2d + 3 and Ĝ2 to be one cycle of length 4d + 6.• When t ≥ 3, we construct Ĝ1 to be two cycles of length 2d + 2 and Ĝ2 to be one cycle of length 4d + 4. Then, by Lemma 3, we also know that N (t, d+1)-WL is as expressive as N (t, d)-WL. Based on the above two aspects, it can thus be concluded that N (t, d+1)-WL is strictly more expressive than N (t, d)-WL in distinguishing non-isomorphic graphs.C.3 PROOFS FOR CONNECTED-HEREDITARY SUBGRAPHS (THEOREM 3.7) Lemma 3.5. S ≤t c is connected-hereditary.Proof. By the definition of S ≤t c , every S ∈ S ≤t c is a connected induced subgraph. Further, if S ∈ S ≤t c , then every connected induced subgraph of S is also in S ≤t c . By Theorem 3.4, S ≤t c is thus connected-hereditary.Lemma 3.6. For each induced subgraph S satisfying S ∈ S =t but S / ∈ S =t c , there exists a setnumber of times a subgraph S of k vertices occurs in a graph where 1 ≤ k ≤ t, we can check all subgraphs of t vertices which contain it and then divide the count by m-t t-k , i.e., the number of subgraphs of vertices t which contain S where m = |N d (u)| is the size of the neighbourhood of a node u. Thus, we haveSimilar to the previous direction, by assumption Statement A2 holds for the (l-1)-th iteration, i.e., ifWe obtain the following:By the definition of node colouring in Equation 4forHence, Statement A2 also holds for the l-th iteration.According to Statement A2, there must exist an injective functionThis means that N c (t, d)-WL cannot distinguish G 1 and G 2 after k iterations, which contradicts with the assumption. The proof is done. Proof. N (1, 1)-WL has t = 1 and d = 1. Accordingly, the node colouring function in Equation 3for N (1, 1)-WL can be expressed asWe may remove { {ς l (u)} } since it occurs twice in the input of the above hash function and does not add additionl information. This leads to the following simplified expression:which is exactly the same as the node colouring function of 1-WL. Proof. The proof can proceed similarly to the proof that GIN is as expressive as 1-WL (Xu et al., 2019) . Let G 1 and G 2 be any two non-isomorphic graphs which can be distinguished by N (t, d)-WL at the k-th iteration. It suffices to show that G3N's neighbourhood aggregation scheme described in Equation 5 with the above assumptions is able to map G 1 and G 2 into different multisets of node features. The G3N's neighbourhood aggregation scheme is restated asAccording to Equation 3, N (t, d)-WL applies an injective hash function HASH(•) to update the node colours at the l-th iteration:where ξ l-1 (u,i,j) = { {ζ l-1 (S)|S ∈ S (u,t,d) , f iso (S)=i, and f pos (S)=j} }.Below, we need to show that there always exists an injective function f such that h), for any iteration l. We prove this by induction.• When l = 0, both h (l) u and ς l (u) are the input node feature of a node u. Thus, h (l) u = ς l (u) holds for any node u in G 1 and G 2 .• Assume that h (l-1) u = f (ς l-1 (u)) holds, we now need to show that h (l) u = f (ς l (u)) holds. Firstly, we have:Since we assume that both COMBINE(•) and AGG N (•) are injective functions and the composition of injective functions is injective, there exists some injective function g such thatBecause S (l-1) u (i, j) denotes the set of t-order subgraphs within the d-hop neighbourhood of a node u with the isomorphism type i and the positional type j at the (l-1)-th layer, we have ξ l-1 (u,i,j) = { {ζ l-1 (S)|S ∈ S (l-1) u (i, j)} }. By the definition of subgraph colouring, i.e., ζ l : S G → N such that ζ l (S 1 ) = ζ l (S 2 ) iff f iso (S 1 ) = f iso (S 2 ) and f pos (S 1 ) = f pos (S 2 ), and the assumption that AGG T (•) is an injective function with respect to multisets of subgraphs with the same isomorphism and positional types, we further have the following:where g ′ is also an injective function. Thus, we have h) and f = g ′ • HASH -1 . Since g ′ and HASH are injective, and the composition of injective functions is injective, f is injective.

E.2 BASELINE METHODS

In our experiments on synthetic datasets, we compared the performance of G3N against the following baseline methods: MLP, GCN (Kipf & Welling, 2017), GAT (Velickovic et al., 2017) , GIN (Xu et al., 2019), and PPGN (Maron et al., 2019a) .In our experiments for real-world datasets, we considered the following baseline methods:-For the TU datasets, we compared G3N against (1) three kernel methods -RWK (Gärtner et al., 2003) , WL-kernel (Shervashidze et al., 2011) , and P-WL (Rieck et al., 2019) ; (2) five GNN models -PATCHY-SAN (Niepert et al., 2016) , DCNN (Atwood & Towsley, 2016) , DGCNN (Zhang et al., 2018) , GIN (Xu et al., 2019), and PPGN (Maron et al., 2019a ). -For the ZINC and MolHIV datasets, we compared G3N against GCN (Kipf & Welling, 2017) , PPGN (Maron et al., 2019a) , GIN (Xu et al., 2019) , PNA (Corso et al., 2020) , DGN (Beaini et al., 2021) , DEEP LPR (Chen et al., 2020) , GSN (Bouritsas et al., 2022) and CIN (Bodnar et al., 2021a ).

E.3 PARAMETER SELECTION

Synthetic datasets. We follow the experimental setup described by Balcilar et al. (2021) .-For the EXP, SR25, graph8c, and CSL datasets, we run 100 randomly initialised weights on the models and record pairs of graphs to be similar if the L1 distance of the length 10 graph representations is less than 0.001 on any of the runs. All models are constrained to a 30K parameter budget, consisting of 4 convolutional layers, sum readout, and one final linear layer. -For the RandomGraph dataset, all models are also restricted to a 30K parameter budget, consisting of 4 convolutional layers, sum readout, and a further 2 fully connected layers trained for up to 200 iterations with a fixed learning rate of 0.001. Learning terminates when error goes below 10 -4 .TU datasets. Following the setup from Xu et al. (2019) , model evaluation and selection are done by collecting the accuracy from the single epoch with the best cross-validation accuracy averaged over the 10 folds. We fix t = 2, d = 2, hidden units of 128 and learning rate of 0.001 which is halved every 50 steps. Table 8 summarises the hyperparameters used in our experiments for these datasets.ZINC. We follow the setup described in Dwivedi et al. ( 2020) with batch size 128 and 1000 epochs with initial learning rate of 10 -3 which is halved when validation does not improve over after 20 steps and training is halted when the learning rate goes under 10 -5 . We fix t = 2, d = 3 and by adhering to the 100k parameter budget, we use 4 message passing layers with hidden size 80, followed by a final MLP readout.MolHIV. We follow the train, validation and test split from Hu et al. (2020) and evaluate on the test score corresponding to the best validation score. We fix t = 2, d = 3, and set the batch size to be 128, hidden dimension 128, number of message passing layers 3, 100 epochs with a fixed learning rate 10 -3 and dropout ratio 0.5.MolTOX21. We follow the train, validation and test split from Hu et al. (2020) and evaluate on the test score corresponding to the best validation score. We set the batch size to be 64, hidden dimension 300, number of message passing layers 4, 100 epochs with a fixed learning rate 10 -3 and dropout ratio 0.5. Setup. We analyse the effect of t and d on the expressivity and generalisability of G3N. We run all different configurations of t ∈ {1, 2, 3} , d ∈ {1, 2, 3} and number of layers ∈ {3, 4, 5} on the datasets ZINC and MolTOX21. An initial learning rate is set to 10 -3 and is halved when validation does not improve over 20 patience steps and training is halted when the learning rate goes under 10 -5 . The results are reported in Table 9 and Table 10 . Observations. We observe that increasing the size of subgraphs t or the size of the receptive field d generally increase expressive power and reduces generalisation gap. This supports the theoretical results with the higher expressive power of N -WL for higher t or higher d. One reason the generalisation gap is lower is because graph structure becomes more important for learning than node features with G3N which prevents overfitting. Further, according to the On the other hand in Table 12 , we note that the parameter usage increases more moderately regardless of dataset structure, as the parameters are bounded by up to 5 times the parameters used in the message passing version of G3N. This is because the number of parameters scales with the number of isomorphism types, which are modest for connected components and small d and t.We further note from both tables that the main computation effort arises from varying d and t and not the number of parameters of the model as expected.E.6 RUNTIME ANALYSIS FOR CONNECTED VARIANTS Setup. We further present an empirical analysis of the runtime and performance of G3N on the ZINC dataset with various aggregation variants depending on which subgraph types it aggregates. Specifically, we consider the original G3N which aggregates only connected t-order subgraphs, G3N which aggregates connected-hereditary subgraphs discussed in Section 3.2, and G3N which aggregates all t-order subgraphs regardless of connectivity. The message G3N layer equations given in Equation 14and Equation 16remain unchanged for all connectivity variants with the except of the definition of Su (i, j) which consists of the set of all induced subgraphs from the neighbourhood of a node u being restricted to connectivity variants discussed above.The parameter setup for all three variants consists of the same setup with the ZINC experiments above. Specifically, we have a batch size 128 and an initial learning rate of 10 -3 which is halved when validation does not improve over after 25 steps and training is halted when the learning rate goes under 10 -5 . We fix t = 3, d = 3 with 2 message passing layers and fix the number of hidden units such that we adhere to a 100k parameter budget. We use an MLP readout. The experiments for this section were run on an RTX 3090 GPU.Observations. From Table 13 , we observe that the runtime of all the G3N variants have the same order of magnitude. By considering connected-hereditary subgraphs, we achieve lower runtime as there are fewer isomorphism types to consider for the given dataset. Note however, that this does not hold in general and that there exist graphs where considering connected-hereditary subgraphs is less efficient, such as with cliques. We further note that the connected-hereditary variant provides better performance which may be attributed to the fact that by considering all possible isomorphism types regardless of connectivity, we may lose information in the training process as to which structures are actually helpful for prediction. This is given by the intuition that connected components should have a greater contribution for predictions. Finally, we note that considering connected t-order subgraphs only as opposed to connected-hereditary subgraphs (all k-order subgraphs for 1 ≤ k ≤ t) obviously provides faster runtime but more significantly provides similar performance. This may be attributed to the fact that larger t-order subgraphs in the given dataset generally subsume smaller subgraphs in providing structural information for prediction. E.7 COMPLEXITY ANALYSIS -k-WL VS N -WL Setup. We conduct an experiment to empirically analyse the time complexity of our N -WL algorithms, in comparison with the classical k-WL algorithms. In the experiment, four datasets ZINC, MolHIV, NCI1, and IMDB-B are selected, which have varying graph structural information, such as the sparsity of a graph, the diameter of a graph, etc. For each dataset, we randomly select 10 graphs.We compute the average complexity of k-WL and N -WL as follows: (1) for k-WL, we compute n t × (t • n) where t = k. This is because k-WL considers the number n k of k-tuples in a graph G with |V (G)| = n and there are further (k • n) neighbouring k-tuples considered for colouring each of such k-tuples. (2) for N -WL, we compute n × a d t where n refers to the number of nodes to be

