RETHINKING THE EXPRESSIVE POWER OF GNNS VIA GRAPH BICONNECTIVITY

Abstract

Designing expressive Graph Neural Networks (GNNs) is a central topic in learning graph-structured data. While numerous approaches have been proposed to improve GNNs in terms of the Weisfeiler-Lehman (WL) test, generally there is still a lack of deep understanding of what additional power they can systematically and provably gain. In this paper, we take a fundamentally different perspective to study the expressive power of GNNs beyond the WL test. Specifically, we introduce a novel class of expressivity metrics via graph biconnectivity and highlight their importance in both theory and practice. As biconnectivity can be easily calculated using simple algorithms that have linear computational costs, it is natural to expect that popular GNNs can learn it easily as well. However, after a thorough review of prior GNN architectures, we surprisingly find that most of them are not expressive for any of these metrics. The only exception is the ESAN framework (Bevilacqua et al., 2022) , for which we give a theoretical justification of its power. We proceed to introduce a principled and more efficient approach, called the Generalized Distance Weisfeiler-Lehman (GD-WL), which is provably expressive for all biconnectivity metrics. Practically, we show GD-WL can be implemented by a Transformer-like architecture that preserves expressiveness and enjoys full parallelizability. A set of experiments on both synthetic and real datasets demonstrates that our approach can consistently outperform prior GNN architectures.

1. INTRODUCTION

Graph neural networks (GNNs) have recently become the dominant approach for graph representation learning. Among numerous architectures, message-passing neural networks (MPNNs) are arguably the most popular design paradigm and have achieved great success in various fields (Gilmer et al., 2017; Hamilton et al., 2017; Kipf & Welling, 2017; Veličković et al., 2018 ). However, one major drawback of MPNNs lies in the limited expressiveness: as pointed out by Xu et al. (2019) ; Morris et al. (2019) , they can never be more powerful than the classic 1-dimensional Weisfeiler-Lehman (1-WL) test in distinguishing non-isomorphic graphs (Weisfeiler & Leman, 1968 ). This inspired a variety of works to design provably more powerful GNNs that go beyond the 1-WL test. One line of subsequent works aimed to propose GNNs that match the higher-order WL variants (Morris et al., 2019; 2020; Maron et al., 2019c; a; Geerts & Reutter, 2022) . While being highly expressive, such an approach suffers from severe computation/memory costs. Moreover, there have been concerns about whether the achieved expressiveness is necessary for real-world tasks (Veličković, 2022) . In light of this, other recent works sought to develop new GNN architectures with improved expressiveness while still keeping the message-passing framework for efficiency (Bouritsas et al., 2022; Bodnar et al., 2021b; a; Bevilacqua et al., 2022; Wijesinghe & Wang, 2022 , and see Appendix A for more recent advances). However, most of these works mainly justify their expressiveness by giving toy examples where WL algorithms fail to distinguish, e.g., by focusing on regular graphs. On the theoretical side, it is quite unclear what additional power they can systematically and provably gain. More fundamentally, to the best of our knowledge (see Appendix D.1), there is still a lack of principled and convincing metrics beyond the WL hierarchy to formally measure the expressive power and to guide the design of provably better GNN architectures. In this paper, we systematically study the problem of designing expressive GNNs from a novel perspective of graph biconnectivity. Biconnectivity has long been a central topic in graph theory (Bollobás, 1998) . It comprises a series of important concepts such as cut vertex (articulation point), cut edge (bridge), biconnected component, and block cut tree (see Section 2 for formal definitions). Intuitively, biconnectivity provides a structural description of a graph by decomposing it into disjoint sub-components and linking them via cut vertices/edges to form a tree structure (cf. Figure 1(b, c )). As can be seen, biconnectivity purely captures the intrinsic structure of a graph. The significance of graph biconnectivity can be reflected in various aspects. Firstly, from a theoretical point of view, it is a basic graph property and is linked to many fundamental topics in graph theory, ranging from path-related problems to network flow (Granot & Veinott Jr, 1985) and spanning trees (Kapoor & Ramesh, 1995) , and is highly relevant to planar graph isomorphism (Hopcroft & Tarjan, 1972) . Secondly, from a practical point of view, cut vertices/edges have substantial values in many real applications. For example, chemical reactions are highly related to edge-biconnectivity of the molecule graph, where the breakage of molecular bonds usually occurs at the cut edges and each biconnected component often remains unchanged after the reaction. As another example, social networks are related to vertex-biconnectivity, where cut vertices play an important role in linking between different groups of people (biconnected components). Finally, from a computational point of view, the problems related to biconnectivity (e.g., finding cut vertices/edges or constructing block cut trees) can all be efficiently solved using classic algorithms (Tarjan, 1972) , with a computation complexity equal to graph size (which is the same as an MPNN). Therefore, one may naturally expect that popular GNNs should be able to learn all things related to biconnectivity without difficulty. Unfortunately, we show this is not the case. After a thorough analysis of four classes of representative GNN architectures in literature (see Section 3.1), we find that surprisingly, none of them could even solve the easiest biconnectivity problem: to distinguish whether a graph has cut vertices/edges or not (corresponding to a graph-level binary classification). As a result, they obviously failed in the following harder tasks: (i) identifying all cut vertices (a node-level task); (ii) identifying all cut edges (an edge-level task); (iii) the graph-level task for general biconnectivity problems, e.g., distinguishing a pair of graphs that have non-isomorphic block cut trees. This raises the following question: can we design GNNs with provable expressiveness for biconnectivity problems? We first give an affirmative answer to the above question. By conducting a deep analysis of the recently proposed Equivariant Subgraph Aggregation Network (ESAN) (Bevilacqua et al., 2022) , we prove that the DSS-WL algorithm with node marking policy can precisely identify both cut vertices and cut edges. This provides a new understanding as well as a strong theoretical justification for the expressive power of DSS-WL and its recent extensions (Frasca et al., 2022) . Furthermore, we give a fine-grained analysis of several key factors in the framework, such as the graph generation policy and the aggregation scheme, by showing that neither (i) the ego-network policy without marking nor (ii) a variant of the weaker DS-WL algorithm can identify cut vertices. However, GNNs designed based on DSS-WL are usually sophisticated and suffer from high computation/memory costs. The main contribution in this paper is then to give a principled and efficient way to design GNNs that are expressive for biconnectivity problems. Targeting this question, we restart from the classic 1-WL algorithm and figure out a major weakness in distinguishing biconnectivity: the lack of distance information between nodes. Indeed, the importance of distance information is theoretically justified in our proof for analyzing the expressive power of DSS-WL. To this end, we introduce a novel color refinement framework, formalized as Generalized Distance Weisfeiler-Lehman (GD-WL), by directly encoding a general distance metric into the WL aggrega-Table 1 : Summary of theoretical results on the expressive power of different GNN models for various biconnectivity problems. We also list the time/space complexity (per WL iteration) for each WL algorithm, where n and m are the number of nodes and edges of a graph, respectively. tion procedure. We first prove that as a special case, the Shortest Path Distance WL (SPD-WL) is expressive for all edge-biconnectivity problems, thus providing a novel understanding of its empirical success. However, it still cannot identify cut vertices. We further suggest an alternative called the Resistance Distance WL (RD-WL) for vertex-biconnectivity. To sum up, all biconnectivity problems can be provably solved within our proposed GD-WL framework. Practical Implementation. The main advantage of GD-WL lies in its simplicity, efficiency and parallelizability. We show it can be easily implemented using a Transformer-like architecture by injecting the distance into Multi-head Attention (Vaswani et al., 2017) , similar to Ying et al. (2021a) . Importantly, we prove that the resulting Graph Transformer (called Graphormer-GD) is as expressive as GD-WL. This offers strong theoretical insights into the power and limits of Graph Transformers. Empirically, we show Graphormer-GD not only achieves perfect accuracy in detecting cut vertices and cut edges, but also outperforms prior GNN achitectures on popular benchmark datasets.

2. PRELIMINARY

Notations. We use { } to denote sets and use {{ }} to denote multisets. The cardinality of (multi)set S is denoted as |S|. The index set is denoted as [n] := {1, • • • , n}. Throughout this paper, we consider simple undirected graphs G = (V, E) with no repeated edges or self-loops. Therefore, each edge {u, v} ∈ E can be expressed as a set of two elements. For a node u ∈ V, denote its neighbors as N G (u) := {v ∈ V : {u, v} ∈ E} and denote its degree as deg G (u) := |N G (u)|. A path P = (u 0 , • • • , u d ) is a tuple of nodes satisfying {u i-1 , u i } ∈ E for all i ∈ [d], and its length is denoted as |P | := d. A path P is said to be simple if it does not go through a node more than once, i.e. u i ̸ = u j for i ̸ = j. The shortest path distance between two nodes u and v is denoted to be dis G (u, v) := min{|P | : P is a path from u to v}. The induced subgraph with vertex subset S ⊂ V is defined as G[S] = (S, E S ) where E S := {{u, v} ∈ E : u, v ∈ S}. We next introduce the concepts of connectivity, vertex-biconnectivity and edge-biconnectivity. Definition 2.1. (Connectivity) A graph G is connected if for any two nodes u, v ∈ V, there is a path from u to v. is not vertex-biconnected. We can similarly define the concepts of cut edge (or bridge) and edge-biconnected component (we omit them for brevity). Finally, denote BCC V (G) (resp. BCC E (G)) as the set of all vertex-biconnected (resp. edge-biconnected) components. A vertex set S ⊂ V is a connected component of G if G[S] Two non-adjacent nodes u, v ∈ V are in the same vertex-biconnected component iff there are two paths from u to v that do not intersect (except at endpoints). Two nodes u, v are in the same edgebiconnected component iff there are two paths from u to v that do not share an edge. On the other hand, if two nodes are in different vertex/edge-biconnected components, any path between them must go through some cut vertex/edge. Therefore, cut vertices/edges can be regarded as "hubs" in a graph that link different subgraphs into a whole. Furthermore, the link between cut vertices/edges and biconnected components forms a tree structure, which are called the block cut tree (cf. Figure 1 ). Definition 2.3. (Block cut-edge tree) The block cut-edge tree of graph G = (V, E) is defined as follows: BCETree(G) := (BCC E (G), E E ), where E E := {S 1 , S 2 } : S 1 , S 2 ∈ BCC E (G), ∃u ∈ S 1 , v ∈ S 2 , s.t. {u, v} ∈ E . Definition 2.4. (Block cut-vertex tree) The block cut-vertex tree of graph G = (V, E) is defined as follows: BCVTree(G) := (BCC V (G) ∪ V Cut , E V ), where V Cut ⊂ V is the set containing all cut vertices of G and E V := {S, v} : S ∈ BCC V (G), v ∈ V Cut , v ∈ S . The following theorem shows that all concepts related to biconnectivity can be efficiently computed. Theorem 2.5. (Tarjan, 1972) The problems related to biconnectivity, including identifying all cut vertices/edges, finding all biconnected components (BCC V (G) and BCC E (G)), and building block cut trees (BCVTree(G) and BCETree(G)), can all be solved using the Depth-First Search algorithm, within a computation complexity linear in the graph size, i.e. Θ(|V| + |E|). Isomorphism and color refinement algorithms. Two graphs G = (V G , E G ) and H = (V H , E H ) are isomorphic (denoted as G ≃ H) if there is an isomorphism (bijective mapping) f : V G → V H such that for any nodes u, v ∈ V G , {u, v} ∈ E G iff {f (u), f (v)} ∈ E H . A color refinement algorithm is an algorithm that outputs a color mapping χ G : V G → C when taking graph G as input, where C is called the color set. A valid color refinement algorithm must preserve invariance under isomorphism, i.e., χ G (u) = χ H (f (u)) for isomorphism f and node u ∈ V G . As a result, it can be used as a necessary test for graph isomorphism by comparing the multisets {{χ G (u) : u ∈ V G }} and {{χ H (u) : u ∈ V H }}, which we call the graph representations. Similarly, χ G (u) can be seen as the node feature of u ∈ V G , and {{χ G (u), χ G (v)}} corresponds to the edge feature of {u, v} ∈ E G . All algorithms studied in this paper fit the color refinement framework, and please refer to Appendix B for a precise description of several representatives (e.g., the classic 1-WL and k-FWL algorithms). Problem setup. This paper focuses on the following three types of problems with increasing difficulties. Firstly, we say a color refinement algorithm can distinguish whether a graph is vertex/edgebiconnected, if for any graphs G, H where G is vertex/edge-biconnected but H is not, their graph representations are different, i.e. {{χ G (u) :  u ∈ V G }} ̸ = {{χ H (u) : u ∈ V H }}. ∈ V G }} ̸ = {{χ H (u) : u ∈ V H }}.

3. INVESTIGATING KNOWN GNN ARCHITECTURES VIA BICONNECTIVITY

In this section, we provide a comprehensive investigation of popular GNN variants in literature, including the classic MPNNs, Graph Substructure Networks (GSN) (Bouritsas et al., 2022) and its variant (Barceló et al., 2021) , GNN with lifting transformations (MPSN and CWN) (Bodnar et al., 2021b; a) , GraphSNN (Wijesinghe & Wang, 2022) , and Subgraph GNNs (e.g., Bevilacqua et al. (2022) ). Surprisingly, we find most of these works are not expressive for any biconnectivity problems listed above. The only exceptions are the ESAN (Bevilacqua et al., 2022) and several variants, where we give a rigorous justification of their expressive power for both vertex/edge-biconnectivity.

3.1. COUNTEREXAMPLES

1-WL/MPNNs. We first consider the classic 1-WL. We provide two principled class of counterexamples which are formally defined in Examples C.9 and C.10, with a few special cases illustrated in Figure 2 . For each pair of graphs in Figure 2 , the color of each node is drawn according to the 1-WL color mapping. It can be seen that the two graph representations are the same. Therefore, 1-WL cannot distinguish any biconnectivity problem listed in Section 2. . We also point out that our negative result applies to the similar GNN variant in Barceló et al. (2021) . GNNs with lifting transformations (MPSN/CWN). Bodnar et al. (2021b; a) considered another approach to design powerful GNNs by using graph lifting transformations. In a nutshell, these approaches exploit higher-order graph structures such as cliques and cycles to design new WL aggregation procedures. Unfortunately, we show the resulting algorithms, called the SWL and CWL, still cannot solve any biconnectivity problem. Please see Appendix C.2 (Proposition C.12) for details. Theorem 3.1. Let H = {H 1 , • • • , H k }, H i = (V i , E i ) be Other GNN variants. In Appendix C.2, we discuss other recently proposed GNNs, such as Graph-SNN (Wijesinghe & Wang, 2022) , GNN-AK (Zhao et al., 2022) , and NGNN (Zhang & Li, 2021) . Due to space limit, we defer the corresponding negative results in Propositions C.13, C.15 and C.16.

3.2. PROVABLE EXPRESSIVENESS OF ESAN AND DSS-WL

We next switch our attention to a new type of GNN framework proposed in Bevilacqua et al. (2022) , called the Equivariant Subgraph Aggregation Networks (ESAN). The central algorithm in EASN is called the DSS-WL. Given a graph G, DSS-WL first generates a bag of vertex-shared (sub)graphs B π G = {{G 1 , • • • , G m }} according to a graph generation policy π. Then in each iteration t, the algorithm refines the color of each node v in each subgraph G i by jointly aggregating its neighboring colors in the own subgraph and across all subgraphs. The aggregation formula can be written as: χ t Gi (v) := hash χ t-1 Gi (v), {{χ t-1 Gi (u) : u ∈ N Gi (v)}}, χ t-1 G (v), {{χ t-1 G (u) : u ∈ N G (v)}} , (1) χ t G (v) := hash {{χ t Gi (v) : i ∈ [m]}} , ( ) where hash is a perfect hash function. DSS-WL terminates when χ t G induces a stable vertex partition. In this paper, we consider node-based graph generation policies, for which each subgraph is associated to a specific node, i.e. B π G = {{G v : v ∈ V}}. Some popular choices are node deletion π ND , node marking π NM , k-ego-network π EGO(k) , and its node marking version π EGOM(k) . A full description of DSS-WL as well as different policies can be found in Appendix B.4 (Algorithm 3). A fundamental question regarding DSS-WL is how expressive it is. While a straightforward analysis shows that DSS-WL is strictly more powerful than 1-WL, an in-depth understanding on what additional power DSS-WL gains over 1-WL is still limited. The only new result is the very recent work of Frasca et al. (2022) , who showed a 3-WL upper bound for the expressivity of DSS-WL. Yet, such a result actually gives a limitation of DSS-WL rather than showing its power. Moreover, there is a large gap between the highly strong 3-WL and the weak 1-WL. In the following, we take a different perspective and prove that DSS-WL is expressive for both types of biconnectivity problems. Theorem 3.2. Let G = (V G , E G ) and H = (V H , E H ) be two graphs, and let χ G and χ H be the corresponding DSS-WL color mapping with node marking policy. Then the following holds: • For any two nodes w ∈ V G and x ∈ V H , if χ G (w) = χ H (x), then w is a cut vertex if and only if x is a cut vertex. • For any two edges {w 1 , w 2 } ∈ E G and {x 1 , x 2 } ∈ E H , if {{χ G (w 1 ), χ G (w 2 )}} = {{χ H (x 1 ), χ H (x 2 )}}, then {w 1 , w 2 } is a cut edge if and only if {x 1 , x 2 } is a cut edge. The proof of Theorem 3.2 is highly technical and is deferred to Appendix C.3. By using the basic results derived in Appendix C.1, we conduct a careful analysis of the DSS-WL color mapping and discover several important properties. They give insights on why DSS-WL can succeed in distinguishing biconnectivity, as we will discuss below. How can DSS-WL distinguish biconnectivity? We find that a crucial advantage of DSS-WL over the classic 1-WL is that DSS-WL color mapping implicitly encodes distance information (see Lemma C.19(e) and Corollary C.24). For example, two nodes u ∈ V G , v ∈ V H will have dif- ferent DSS-WL colors if the distance set {{dis G (u, w) : w ∈ V G }} differs from {{dis H (v, w) : w ∈ V H }}. Our proof highlights that distance information plays a vital role in distinguishing edgebiconnectivity when combining with color refinement algorithms (detailed in Section 4), and it also helps distinguish vertex-biconnectivity (see the proof of Lemma C.22). Consequently, our analysis provides a novel understanding and a strong justification for the success of DSS-WL in two aspects: the graph representation computed by DSS-WL intrinsically encodes distance and biconnectivity information, both of which are fundamental structural properties of graphs but are lacking in 1-WL. Discussions on graph generation policies. Note that Theorem 3.2 holds for node marking policy. In fact, the ability of DSS-WL to encode distance information heavily relies on node marking as shown in the proof of Lemma C.19. In contrast, we prove that the ego-network policy π EGO(k) cannot distinguish cut vertices (Proposition C.14), using the counterexample given in Figure 2 (c). Therefore, our result shows an inherent advantage of node marking than the ego-network policy in distinguishing a class of non-isomorphic graphs, which is raised as an open question in Bevilacqua et al. (2022, Section 5) . It also highlights a theoretical limitation of π EGO(k) compared with its node marking version π EGOM(k) , a subtle difference that may not have received sufficient attention yet. For example, both the GNN-AK and GNN-AK-ctx architecture (Zhao et al., 2022) cannot solve vertex-biconnectivity problems since it is similar to π EGO(k) (see Proposition C.15). On the other hand, the GNN-AK+ does not suffer from such a drawback although it also uses π EGO(k) , because it further adds distance encoding in each subgraph (which is more expressive than node marking). Discussions on DS-WL. Bevilacqua et al. (2022) ; Cotta et al. (2021) also considered a weaker version of DSS-WL, called the DS-WL, which aggregates the node color in each subgraph without interaction across different subgraphs (see formula (10)). We show in Proposition C.16 that unfortunately, DS-WL with common node-based policies cannot identify cut vertices when the color of each node v is defined as its associated subgraph representation G v . This theoretically reveals the importance of cross-graph aggregation and justifies the design of DSS-WL. Finally, we point out that Qian et al. (2022) very recently proposed an extension of DS-WL that adds a final cross-graph aggregation procedure, for which our negative result may not hold. It may be an interesting direction to theoretically analyze the expressiveness of this type of DS-WL in future work.

4. GENERALIZED DISTANCE WEISFEILER-LEHMAN TEST

After an extensive review of prior GNN architectures, in this section we would like to formally study the following problem: can we design a principled and efficient GNN framework with provable expressiveness for biconnectivity? In fact, while in Section 3.2 we have proved that DSS-WL can solve biconnectivity problems, it is still far from enough. Firstly, the corresponding GNNs based on DSS-WL is usually sophisticated due to the complex aggregation formula (1), which inspires us to study whether simpler architectures exist. More importantly, DSS-WL suffers from high computational costs in both time and memory. Indeed, it requires Θ(n 2 ) space and Θ(nm) time per iteration (using policy π NM ) to compute node colors for a graph with n nodes and m edges, which is n times costly than 1-WL. Given the theoretical linear lower bound in Theorem 2.5, one may naturally raise the question of how to close the gap by developing more efficient color refinement algorithms. We approach the problem by rethinking the classic 1-WL test. We argue that a major weakness of 1-WL is that it is agnostic to distance information between nodes, partly because each node can only "see" its neighbors in aggregation. On the other hand, the DSS-WL color mapping implicitly encodes distance information as shown in Section 3.2, which inspires us to formally study whether incorporating distance in the aggregation procedure is crucial for solving biconnectivity problems. To this end, we introduce a novel color refinement framework which we call Generalized Distance Weisfeiler-Lehman (GD-WL). The update rule of GD-WL is very simple and can be written as: χ t G (v) := hash {{(d G (v, u), χ t-1 G (u)) : u ∈ V}} , where d G can be an arbitrary distance metric. The full algorithm is described in Algorithm 4. SPD-WL for edge-biconnectivity. As a special case, when choosing the shortest path distance d G = dis G , we obtain an algorithm which we call SPD-WL. It can be equivalently written as χ t G (v) := hash χ t-1 G (v), {{χ t-1 G (u) : u ∈ N G (v)}}, {{χ t-1 G (u) : dis G (v, u) = 2}}, • • • , {{χ t-1 G (u) : dis G (v, u) = n -1}}, {{χ t-1 G (u) : dis G (v, u) = ∞}} . From ( 4) it is clear that SPD-WL is strictly more powerful than 1-WL since it additionally aggregates the k-hop neighbors for all k > 1. There have been several prior works related to SPD-WL, including using distance encoding as node features (Li et al., 2020) or performing k-hop aggregation for some small k (see Appendix D.2 for more related works and discussions). Yet, these works are either purely empirical or provide limited theoretical analysis (e.g., by focusing only on regular graphs). Instead, we introduce the general and more expressive SPD-WL framework with a rather different motivation and perform a systematic study on its expressive power. Our key result confirms that SPD-WL is fully expressive for all edge-biconnectivity problems listed in Section 2. Theorem 4.1. Let G = (V G , E G ) and H = (V H , E H ) be two graphs, and let χ G and χ H be the corresponding SPD-WL color mapping. Then the following holds: • For any two edges {w 1 , w 2 } ∈ E G and {x 1 , x 2 } ∈ E H , if {{χ G (w 1 ), χ G (w 2 )}} = {{χ H (x 1 ), χ H (x 2 )}}, then {w 1 , w 2 } is a cut edge if and only if {x 1 , x 2 } is a cut edge. • If {{χ G (w) : w ∈ V G }} = {{χ H (w) : w ∈ V H }}, then BCETree(G) ≃ BCETree(H). Theorem 4.1 is highly non-trivial and perhaps surprising at first sight, as it combines three seemingly unrelated concepts (i.e., SPD, biconnectivity, and the WL test) into a unified conclusion. We give a proof in Appendix C.4, which separately considers two cases: χ G (w 1 ) ̸ = χ G (w 2 ) and χ G (w 1 ) = χ G (w 2 ) (see Figure 2(b, d ) for examples). For each case, the key technique in the proof is to construct an auxiliary graph (Definitions C.26 and C.34 ) that precisely characterizes the structural relationship between nodes that have specific colors (see Corollaries C.31 and C.40) . Finally, we highlight that the second item of Theorem 4.1 may be particularly interesting: while distinguishing general nonisomorphic graphs are known to be hard (Cai et al., 1992; Babai, 2016) , we show distinguishing non-isomorphic graphs with different block cut-edge trees can be much easily solved by SPD-WL. RD-WL for vertex-biconnectivity. Unfortunately, while SPD-WL is fully expressive for edgebiconnectivity, it is not expressive for vertex-biconnectivity. We give a simple counterexample in Figure 2(c) , where SPD-WL cannot distinguish the two graphs. Nevertheless, we find that by using a different distance metric, problems related to vertex-biconnectivity can also be fully solved. We propose such a choice called the Resistance Distance (RD) (denoted as dis R G ). Like SPD, RD is also a basic metric in graph theory (Doyle & Snell, 1984; Klein & Randić, 1993) and has been widely used to characterize the relationship between nodes (Sanmartın et al., 2022; Velingker et al., 2022) . Formally, the value of dis R G (u, v) is defined to be the effective resistance between nodes u and v when treating G as an electrical network where each edge corresponds to a resistance of one ohm. RD has many elegant properties. First, it is a valid metric: indeed, RD is non-negative, semidefinite, symmetric, and satisfies the triangular inequality (see Appendix E.2). Moreover, similar to SPD, we also have 0 ≤ dis R G (u, v) ≤ n -1, and dis R G (u, v) = dis G (u, v) if G is a tree. In Appendix E.2, we further show that RD is highly related to the graph Laplacian and can be efficiently calculated. Theorem 4.2. Let G = (V G , E G ) and H = (V H , E H ) be two graphs, and let χ G and χ H be the corresponding RD-WL color mapping. Then the following holds: • For any two nodes w ∈ V G and x ∈ V H , if χ G (w) = χ H (x), then w is a cut vertex if and only if x is a cut vertex. • If {{χ G (w) : w ∈ V G }} = {{χ H (w) : w ∈ V H }}, then BCVTree(G) ≃ BCVTree(H). The form of Theorem 4.2 exactly parallels Theorem 4.1, which shows that RD-WL is fully expressive for vertex-biconnectivity. We give a proof of Theorem 4.1 in Appendix C.5. In particular, the proof of the second item is highly technical due to the challenges in analyzing the (complex) structure of the block cut-vertex tree. It also highlights that distinguishing non-isomorphic graphs that have different BCVTrees is much easier than the general case. Combining Theorems 4.1 and 4.2 immediately yields the following corollary, showing that all biconnectivity problems can be solved within our proposed GD-WL framework. Corollary 4.3. When using both SPD and RD (i.e., by  setting d G (u, v) := (dis G (u, v), dis R G (u, v)) ), the corresponding GD-WL is fully expressive for both vertex-biconnectivity and edge-biconnectivity. Computational cost. The GD-WL framework only needs a complexity of Θ(n) space and Θ(n 2 ) time per-iteration for a graph of n nodes and m edges, both of which are strictly less than DSS-WL. In particular, GD-WL has the same space complexity as 1-WL, which can be crucial for large-scale tasks. On the other hand, one may ask how much computational overhead there is in preprocessing pairwise distances between nodes. We show in Appendix E that the computational cost can be trivially upper bounded by O(nm) for SPD and O(n 3 ) for RD. Note that the preprocessing step only needs to be executed once, and we find that the cost is negligible compared to the GNN architecture. Practical implementation. One of the main advantages of GD-WL is its high degree of parallelizability. In particular, we find GD-WL can be easily implemented using a Transformer-like architecture by injecting distance information into Multi-head Attention (Vaswani et al., 2017) , similar to the structural encoding in Graphormer (Ying et al., 2021a) . The attention layer can be written as: Y h = ϕ h 1 (D) ⊙ softmax XW h Q (XW h K ) ⊤ + ϕ h 2 (D) XW h V , where X ∈ R n×d is the input node features of the previous layer, D ∈ R n×n is the distance matrix such that D uv = d G (u, v), W h Q , W h K , W h V ∈ R d×d H are learnable weight matrices of the h-th head, ϕ h 1 and ϕ h 2 are elementwise functions applied to D (possibly parameterized), and ⊙ denotes the elementwise multiplication. The results Y h ∈ R n×d H across all heads h are then combined and projected to obtain the final output Y = h Y h W h O where W h O ∈ R d H ×d . We call the resulting architecture Graphormer-GD, and the full structure of Graphormer-GD is provided in Appendix E.3. It is easy to see that the mapping from X to Y in (5) is equivariant and simulates the GD-WL aggregation. Importantly, we have the following expressivity result, which precisely characterizes the power and limits of Graphormer-GD. We give a proof in Appendix E.3. Theorem 4.4. Graphormer-GD is at most as powerful as GD-WL in distinguishing non-isomorphic graphs. Moreover, when choosing proper functions ϕ h 1 and ϕ h 2 and using a sufficiently large number of heads and layers, Graphormer-GD is as powerful as GD-WL. On the expressivity upper bound of GD-WL. To complete the theoretical analysis, we finally provide an upper bound of the expressive power for our proposed SPD-WL and RD-WL, by studying the relationship with the standard 2-FWL (3-WL) algorithm. Theorem 4.5. The 2-FWL algorithm is more powerful than both SPD-WL and RD-WL. Formally, the 2-FWL color mapping induces a finer vertex partition than that of both SPD-WL and RD-WL. We give a proof in Appendix C.6. Using Theorem 4.5, we arrive at the concluding corollary: Corollary 4.6. The 2-FWL is fully expressive for both vertex-biconnectivity and edge-biconnectivity.

5. EXPERIMENTS

In this section, we perform empirical evaluations of our proposed Graphormer-GD. We mainly consider the following two sets of experiments. Firstly, we would like to verify whether Graphormer-GD can indeed learn biconnectivity-related metrics easily as our theory predicts. Secondly, we would like to investigate whether GNNs with sufficient expressiveness for biconnectivity can also help real-world tasks and benefit the generalization performance as well. The code and models will be made publicly available at https://github.com/lsj2408/Graphormer-GD. Synthetic tasks. To test the expressive power of GNNs for biconnectivity metrics, we separately consider two tasks: (i) Cut Vertex Detection and (ii) Cut Edge Detection. Given a GNN model that outputs node features, we add a learnable prediction head that takes each node feature (or two node features corresponding to each edge) as input and predicts whether it is a cut vertex (cut edge) or not. The evaluation metric for both tasks is the graph-level accuracy, i.e., given a graph, the model prediction is considered correct only when all the cut vertices/edges are correctly identified. To make the results convincing, we construct a challenging dataset that comprises various types of hard graphs, including the regular graphs with cut vertices/edges and also Examples C.9 and C.10 mentioned in Section 3. We also choose several GNN baselines with different levels of expressive power: (i) classic MPNNs (Kipf & Welling, 2017; Veličković et al., 2018; Xu et al., 2019) ; (ii) Graph Substructure Network (Bouritsas et al., 2022) ; (iii) Graphormer (Ying et al., 2021a) . The details of model configurations, dataset, and training procedure are provided in Appendix F.1. (Veličković et al., 2018) 52.0%±1.3% 62.8%±1.9% GIN (Xu et al., 2019) 53.9%±1.7% 63.1%±2.2% GSN (Bouritsas et al., 2022) 60.1%±1.9% 70.7%±2.1% Graphormer (Ying et al., 2021a) The results are presented in Table 2 . It can be seen that baseline GNNs cannot perfectly solve these synthetic tasks. In contrast, the Graphormer-GD achieves 100% accuracy on both tasks, implying that it can easily learn biconnectivity metrics even in very difficult graphs. Moreover, while using only SPD suffices to identify cut edges, it is still necessary to further incorporate RD to identify cut vertices. This is consistent with our theoretical results in Theorems 4.1, 4.2 and 4.4. Real-world tasks. We further study the empirical performance of our Graphormer-GD on the realworld benchmark: ZINC from Benchmarking-GNNs (Dwivedi et al., 2020) . To show the scalability of Graphormer-GD, we train our models on both ZINC-Full (consisting of 250K molecular graphs) and ZINC-Subset (12K selected graphs). We comprehensively compare our model with prior expressive GNNs that have been publicly released. For a fair comparison, we ensure that the parameter budget of both Graphormer-GD and other compared models are around 500K, following Dwivedi et al. (2020) . Details of baselines and settings are presented in Appendix F.2. The results are shown in Table 3 , where our score is averaged over four experiments with different seeds. We also list the per-epoch training time of different models on ZINC-subset as well as their model parameters. It can be seen that Graphormer-GD surpasses or matches all competitive baselines on the test set of both ZINC-Subset and ZINC-Full. Furthermore, we find that the empirical performance of compared models align with their expressive power measured by graph biconnectivity. For example, Subgraph GNNs that are expressive for biconnectivity also consistently outperform classic MPNNs by a large margin. Compared with Subgraph GNNs, the main advantage of Graphormer-GD is that it is simpler to implement, has stronger parallelizability, while still achieving better performance. Therefore, we believe our proposed architecture is both effective and efficient and can be well extended to more practical scenarios like drug discovery.

6. CONCLUSION

In this paper, we systematically investigate the expressive power of GNNs via the perspective of graph biconnectivity. Through the novel lens, we gain strong theoretical insights into the power and limits of existing popular GNNs. We then introduce the principled GD-WL framework that is fully expressive for all biconnectivity metrics. We further design the Graphormer-GD architecture that is provably powerful while enjoying practical efficiency and parallelizability. Experiments on both synthetic and real-world datasets demonstrate the effectiveness of Graphormer-GD. There are still many promising directions that have not yet been explored. Firstly, it remains an important open problem whether biconnectivity can be solved more efficiently in o(n 2 ) time using equivariant GNNs. Secondly, a deep understanding of GD-WL is generally lacking. For example, we conjecture that RD-WL can encode graph spectral (Lim et al., 2022) and is strictly more powerful than SPD-WL in distinguishing general graphs. Thirdly, it may be interesting to further investigate more expressive distance (structural) encoding schemes beyond RD-WL and explore how to encode them in Graph Transformers. Finally, one can extend biconnectivity to a hierarchy of higher-order variants (e.g., tri-connectivity), which provides a completely different view parallel to the WL hierarchy to study the expressive power and guide designing provably powerful GNNs architectures. (Veličković et al., 2018) 8.28 531,345 0.384±0.007 0.111±0.002 GCN (Kipf & Welling, 2017) 5.85 505,079 0.367±0.011 0.113±0.002 MoNet (Monti et al., 2017) 7.19 504,013 0.292±0.006 0.090±0.002 GatedGCN-PE (Bresson & Laurent, 2017) 10.74 505,011 0.214±0.006 -MPNN(sum) (Gilmer et al., 2017) -480,805 0.145±0.007 -PNA (Corso et al., 2020 

A RECENT ADVANCES IN EXPRESSIVE GNNS

Since the seminal works of Xu et al. (2019) ; Morris et al. (2019) , extensive studies have devoted to developing new GNN architectures with better expressiveness beyond the 1-WL test. These works can be broadly classified into the following categories. Higher-order GNNs. One straightforward way to design provably more expressive GNNs is inspired by the higher-order WL tests (see Appendix B.2). Instead of performing node feature aggregation, these higher-order GNNs calculate a feature vector for each k-tuple of nodes (k ≥ 2) and perform aggregation between features of different tuples using tensor operations (Morris et al., 2019; Maron et al., 2019b; c; a; Keriven & Peyré, 2019; Azizian & Lelarge, 2021; Geerts & Reutter, 2022) . In particular, Maron et al. ( 2019a) leveraged equivariant matrix multiplication to design network layers that mimic the 2-FWL aggregation. Due to the huge computational cost of higher-order GNNs, several recent works considered improving efficiency by leveraging the sparse and local nature of graphs and designing a "local" version of the k-WL aggregation, which comes at the cost of some expressiveness (Morris et al., 2020; 2022) . The work of Vignac et al. (2020) can also be seen as a local 2-order GNN and its expressive power is bounded by 3-IGN (Maron et al., 2019c) . Substructure-based GNNs. Another way to design more expressive GNNs is inspired by studying the failure cases of 1-WL test. In particular, Chen et al. ( 2020) pointed out that standard MPNNs cannot detect/count common substructures such as cycles, cliques, and paths. Based on this finding, Bouritsas et al. (2022) designed the Graph Substructure Network (GSN) by incorporating substructure counting into node features using a preprocessing step. Such an approach was later extended by Barceló et al. (2021) Subgraph GNNs. In fact, the graphs indistinguishable by 1-WL tend to possess a high degree of symmetry (e.g., see Figure 2 ). Based on this observation, a variety of recent approaches sought to break the symmetry by feeding subgraphs into an MPNN. To maintain equivariance, a set of subgraphs is generated symmetrically from the original graph using predefined policies, and the final output is aggregated across all subgraphs. There have been several subgraph generation policies in prior works, such as node deletion (Cotta et al., 2021) , edge deletion (Bevilacqua et al., 2022) , node marking (Papp & Wattenhofer, 2022), and ego-networks (Zhao et al., 2022; Zhang & Li, 2021; You et al., 2021) . These works also slightly differ in the aggregation schemes. In particular, Bevilacqua et al. (2022) developed a unified framework, called ESAN, which includes per-layer aggregation across subgraphs and thus enjoys better expressiveness. Very recently, Frasca et al. (2022) further extended the framework based on a more relaxed symmetry analysis and proved an upper bound of its expressiveness to be 3-WL. Qian et al. (2022) provided a theoretical analysis of how subgraph GNNs relate to k-FWL and also designed an approach to learn policies. Non-equivariant GNNs. Perhaps one of the simplest way to break the intrinsic symmetry of 1-WL aggregation is to use non-equivariant GNNs. Indeed, Loukas (2020) proved that if each node in a GNN is equipped with a unique identifier, then standard MPNNs can already be Turing universal. There have been several works that exploit this idea to build powerful GNNs, such as using port numbering (Sato et al., 2019) , relational pooling (Murphy et al., 2019) , random features (Sato et al., 2021; Abboud et al., 2021) , or dropout techniques (Papp et al., 2021) . However, since the resulting architectures cannot fully preserve equivariance, the sample complexity required for training and generalization may not be guaranteed (Garg et al., 2020) . Therefore, in this paper we only focus on analyzing and designing equivariant GNNs. 2022) utilized spectral information of graphs to achieve better expressiveness beyond 1-WL. Talak et al. (2021) proposed the Neural Tree Network that performs message passing between higher-order subgraphs instead of node-level aggregation. Finally, for a comprehensive survey on expressive GNNs, we refer readers to Sato (2020) and Morris et al. (2021) .

B THE WEISFEILER-LEHMAN ALGORITHMS AND RECENTLY PROPOSED VARIANTS

In this section, we give a precise description on the family of Weisfeiler-Lehman algorithms and several recently proposed variants that are studied in this paper. We first present the classic 1-WL algorithm (Weisfeiler & Leman, 1968 ) and the more advanced k-FWL (Cai et al., 1992; Morris et al., 2019) . Then we present several recently proposed WL variants, including WL with Substructure Counting (SC-WL) (Bouritsas et al., 2022) , Overlap Subgraph WL (OS-WL) (Wijesinghe & Wang, 2022) , Equivariant Subgraph Aggregation WL (DSS-WL) (Bevilacqua et al., 2022) and Generalized Distance WL (GD-WL). Throughout this section, we assume hash : X → C is an injective hash function that can map "arbitrary objects" to a color in C where C is an abstract set called the color set. Formally, the domain X comprises all the objects we are interested in: • R ⊂ X and C ⊂ X ; • For any finite multiset M with elements in X , M ∈ X ; • For any tuple c ∈ X k of finite dimension k ∈ N + , c ∈ X . B.1 1-WL TEST Given a graph G = (V, E), the 1-dimensional Weisfeiler-Lehman algorithm (1-WL), also called the color refinement algorithm, iteratively calculates a color mapping χ G from each vertex v ∈ V to a color χ G (v) ∈ C. The pseudo code of 1-WL is presented in Algorithm 1. Intuitively, at the beginning the color of each vertex is initialized to be the same. Then in each iteration, 1-WL algorithm updates each vertex color by combining its own color with the neighborhood color multiset using a hash function. This procedure is repeated for a sufficiently large number of iterations T , e.g. T = |V|. Algorithm 1: The 1-dimensional Weisfeiler-Lehman Algorithm Input : Graph G = (V, E) and the number of iterations T Output: Color mapping χ G : V → C Initialize: Pick a fixed (arbitrary) element c 0 ∈ C, and let χ 0 G (v) := c 0 for all v ∈ V for t ← 1 to T do for each v ∈ V do χ t G (v) := hash χ t-1 G (v), {{χ t-1 G (u) : u ∈ N G (v)}} Return: χ T G At each iteration, the color mapping χ t G induces a partition of the vertex set V with an equivalence relation ∼ χ t G defined to be u ∼ χ t G v ⇐⇒ χ t G (u) = χ t G (v) for u, v ∈ V. We call each equivalence class a color class with an associated color c ∈ C, denoted as (χ t G ) -1 (c) := {v ∈ V : χ t G (v) = c}. The corresponding partition is then denoted as P t G = {(χ t G ) -1 (c) : c ∈ C t G } where C t G := {χ t G (v) : v ∈ V} is the color set containing all the presented colors of vertices in G. An important observation is that each 1-WL iteration refines the partition P t G to a finer partition P t+1 G , because for any u, v ∈ V, u ∼ χ t+1 G v implies u ∼ χ t G v. Since the number of vertices |V| is finite, there must exist an iteration T stable < |V| such that P Tstable G = P Tstable+1 G . It follows that P t G = P Tstable G for all t ≥ T stable , i.e. the partition stabilizes. We thus denote P G := P Tstable G as the stable partition induced by the 1-WL algorithm, and denote χ G as any stable color mapping (i.e. by picking any χ t G with t ≥ T stable ). We can similarly define the inverse mapping χ -1 G . The mapping χ G serves as a node feature extractor so that χ G (v) is the representation of node v ∈ V. Correspondingly, the multiset {{χ G (v) : v ∈ V}} can serve as the representation of graph G. The 1-WL algorithm can be used to distinguish whether two graphs G and H are isomorphic, by comparing their graph representations {{χ G (v) : v ∈ V}} and {{χ H (v) : v ∈ V}}. If the two multisets are not equivalent, then G and H are clearly non-isomorphic. Thus 1-WL is a necessary condition to test graph isomorphism. Nevertheless, the 1-WL test fails when {{χ G (v) : v ∈ V}} = {{χ H (v) : v ∈ V}} but G and H are still non-isomorphic (see Figure 2 for a counterexample). This motivates the more powerful higher-order WL tests, which are illustrated in the next subsection.

B.2 k-FWL TEST

In this section, we present a family of algorithms called the k-dimensional Folklore Weisfeiler-Lehman algorithms (k-FWL). Instead of calculating a node color mapping, k-FWL computes a color mapping on each k-tuple of nodes. The pseudo code of k-FWL (k ≥ 2) is presented in Algorithm 2. Algorithm 2: The k-dimensional Folklore Weisfeiler-Lehman Algorithm Input : Graph G = (V, E) and the number of iterations T Output: Color mapping χ G : V k → C Initialize: Pick three fixed different elements c 0 , c 1 , c node ∈ C, let χ 0 G (v) := hash(vec(A v )) for each v ∈ V k where A v ∈ C k×k is a matrix with elements A v ij = c node if v i = v j c 0 if v i ̸ = v j and {v i , v j } / ∈ E c 1 if v i ̸ = v j and {v i , v j } ∈ E (6) for t ← 1 to T do for each v ∈ V k do χ t G (v) := hash χ t-1 G (v), {{(χ t-1 G (N 1 (v, u)), • • • , χ t-1 G (N k (v, u))) : u ∈ V}} where N i (v, u) = (v 1 , • • • , v i-1 , u, v i+1 , • • • , v k ) Return: χ T G Intuitively, at the beginning, the color of each vertex tuple v encodes the full structure (i.e. isomophism type) of the subgraph induced by the ordered vertex set {v i : i ∈ [k]}, by hashing the "adjacency" matrix A v defined in (6). Then in each iteration, k-FWL algorithm updates the color of each vertex tuple by combining its own color with the "neighborhood" color using a hash function. Here, the neighborhood of a tuple v is all the tuples that differ v by exactly one element. These k × |V| neighborhood colors are grouped into a multiset of size |V| where each element is a k-tuple. Finally, the update procedure is repeated for a sufficiently large number of iterations T , e.g. T = |V| k . Simiar to 1-WL, the k-FWL color mapping χ t G induces a partition of the set of vertex k-tuples V k , and each k-FWL iteration refines the partition of the previous iteration. Since the number of vertex k-tuples |V| k is finite, there must exist an iteration T stable < |V| k such that the partition no longer changes after t ≥ T stable . We denote the stable color mapping as χ G by picking any χ t G with t ≥ T stable . The k-FWL algorithm can be used to distinguish whether two graphs G and H are isomorphic, by comparing their graph representations {{χ G (v) : v ∈ V k }} and {{χ H (v) : v ∈ V k }}. It has been proved that k-FWL is strictly more powerful than 1-WL in distinguishing non-isomorphic graphs, and (k + 1)-FWL is strictly more powerful than k-FWL for all k ≥ 2 (Cai et al., 1992) . Moreover, the k-FWL algorithm can also be used to extract node representations as with 1-WL. To do this, we can simply define χ G (v) := χ G (v, • • • , v) as the vertex color of the k-FWL algorithm (without abuse of notation), which induces a partition P G over vertex set V. It has been shown that this partition is finer than the partition induces by 1-WL, and also the vertex partition induced by (k + 1)-FWL is finer than that of k-FWL (Kiefer, 2020) .

B.3 WL WITH SUBSTRUCTURE COUNTING (SC-WL)

Recently, Bouritsas et al. (2022) proposed a variant of the 1-WL algorithm by incorporating the socalled substructure counting into WL aggregation procedure. This yields a algorithm that is provably powerful than the original 1-WL test. To describe the algorithm, we first need the notation of automorphism group. Given a graph H = (V H , E H ), an automorphism of H is a bijective mapping f : V H → V H such that for any two vertices u, v ∈ V H , {u, v} ∈ E H ⇐⇒ {f (u), f (v)} ∈ E H . It follows that all automorphisms of H form a group under function composition, which is called the automorphism group and denoted as Aut(H). The automorphism group Aut(H) yields a partition of the vertex set V, called orbits. Formally, given a vertex v ∈ V H , define its orbit Orb H (v) = {u ∈ V H : ∃f ∈ Aut(H), f (u) = v}. The set of all orbits H\ Aut(H) := {Orb H (v) : v ∈ V H } is called the quotient of the automorphism. Denote d H = |H\ Aut(H)| and denote the elements in H\ Aut(H) as {O V H,i } d H i=1 . We are now ready to describe the procedure of SC-WL. Pre-processing. Depending on the tasks, one first specify a set of (small) connected graphs H = {H 1 , • • • , H k }, which will be used for sub-structure counting in the input graph G. Popular choices of these small graphs are cycles of different lengths (e.g., triangle or square) and cliques. Given a graph G = (V G , E G ), for each vertex v ∈ V G and each graph H ∈ H, the following quantities are calculated: x V H,i (v) := G[S] : S ⊂ V, G[S] ≃ H, v ∈ S, f G[S]→V H (v) ∈ O V H,i , i ∈ [d H ] where f G[S]→V H is any isomorphism that maps the vertices of graph G[S] to those of graph H. Intuitively, x V H,i (v) counts the number of induced subgraphs of G that is isomorphic to H and contains node v, such that the orbit of v is similar to the orbit O V H,i . The counts corresponding to different orbits O V H,i and different graphs H are finally combined and concatenated into a vector: x V (v) = [x V H1 (v) ⊤ , • • • , x V H k (v) ⊤ ] ⊤ ∈ N D + (8) where the dimension of x V (v) is D = i∈[k] d i . Message Passing. The message passing procedure is similar to Algorithm 1, except that the aggregation formula (Line 4) is replaced by the following update rule: χ t G (v) := hash χ t-1 G (v), x V (v), {{(χ t-1 G (u), x V (u)) : u ∈ N G (v)}} which incorporates the substructure counts (7, 8). Note that the update rule ( 9) is slightly simpler than the original paper (Bouritsas et al., 2022, Section 3.2 ), but the expressive power of the two formulations are the same. Finally, we note that the above procedure counts substructures and calculates features x V for each vertex of G. One can similarly consider calculating substructure counts for each edge of G, and the conclusion in this paper (Theorem 3.1) still holds. Please refer to Bouritsas et al. (2022) for more details on how to calculate edge features.

B.4 EQUIVARIANT SUBGRAPH AGGREGATION WL (DSS-WL)

Recently, Bevilacqua et al. (2022) developd a new type of graph neural networks, called Equivariant Subgraph Aggregation Networks, as well as a new WL variant named DSS-WL. Given a graph G = (V, E), DSS-WL first generates a bag of graphs B π G = {{G 1 , • • • , G m }} which share the vertices, i.e. G i = (V, E i ), but differ in the edge sets E i . Here π denotes the graph generation policy which determines the edge set E i for each graph G i . The initial coloring χ 0 Gi (v) for each node v ∈ V in graph G i is also determined by π and can be different across different nodes and graphs. In each iteration, the algorithm refines the color of each node by jointly aggregating its neighboring colors in the own graph and across different graphs. This procedure is repeated for a sufficiently large iterations T to obtain the stable color mappings χ Gi and χ G . The pseudo code of DSS-WL is presented in Algorithm 3. The key component in the DSS-WL algorithm is the graph generation policy π which must maintain symmetry, i.e., be equivairant under permutation of the vertex set. We list several common choices below: • Node marking policy π = π NM . In this policy, we have B π G = {{G v : v ∈ V}} where G v = G, i.e., there are |V| graphs in B π G whose structures are the completely the same. The difference, however, lies in the initial coloring which marks the special node v in the following way: χ 0 Gv (v) = c 1 and χ 0 Gv (u) = c 0 for other nodes u ̸ = v, where c 0 , c 1 ∈ C are two different colors. • Node deletion policy π = π ND . The bag of graphs for this policy is also defined as B π G = {{G v : v ∈ V}}, but each graph G v = (V, E v ) has a different edge set E v := E\{{v, w} : w ∈ N G (v)}. Intuitively, it removes all edges that connects to node v and thus makes v an isolated node. The initial coloring is chosen as a constant χ 0 Gi (v) = c 0 for all v ∈ V and G i ∈ B π G for some fixed color c 0 ∈ C. Algorithm 3: DSS Weisfeiler-Lehman Algorithm Input : Graph G = (V, E), the number of iterations T , and graph selection policy π Output: Color mapping χ G : V → C Initialize: Generate a bag of graphs B π G = {{G i }} m i=1 , G i = (V, E i ) and initial coloring χ 0 Gi for i ∈ [m] according to policy π Let χ 0 G (v) := hash {{χ t Gi (v) : i ∈ [m]}} for each v ∈ V for t ← 1 to T do for each v ∈ V do for i ← 1 to m do χ t Gi (v) := hash χ t-1 Gi (v), {{χ t-1 Gi (u) : u ∈ N Gi (v)}}, χ t-1 G (v), {{χ t-1 G (u) : u ∈ N G (v)}} χ t G (v) := hash {{χ t Gi (v) : i ∈ [m]}} Return: χ T G • Ego network policy π = π EGO(k) . In this policy, we also have B π G = {{G v : v ∈ V}}, G v = (V, E v ). The edge set E v is defined as E v := {{u, w} ∈ E : dis G (u, v) ≤ k, dis G (w, v) ≤ k}, which corresponds to a subgraph containing all the k-hop neighbors of v and isolating other nodes. The initial coloring is chosen as χ 0 Gi (v) = c 0 for all v ∈ V and G i ∈ B π G where c 0 ∈ C is a constant. One can also consider the ego network policy with marking π = π EGOM(k) , by marking the initial color of the special node v for each G v . We note that for all the above policies, |B π G | = |V|. There are other choices such as the edge deletion policy (Bevilacqua et al., 2022 ), but we do not discuss them in this paper. A straightforward analysis yields that DSS-WL with any above policy is strictly powerful than the classic 1-WL algorithm. Also, node marking policy has been shown to be not less powerful than the node deletion policy (Papp & Wattenhofer, 2022) . Finally, we highlight that Bevilacqua et al. (2022) ; Cotta et al. (2021) also proposed a weaker version of DSS-WL, called the DS-WL algorithm. The difference is that for DS-WL, Lines 6 and 7 in Algorithm 3 are replaced by a simple 1-WL aggregation: χ t Gi (v) := hash χ t-1 Gi (v), {{χ t-1 Gi (u) : u ∈ N G (v)}} . However, the original formulation of DS-WL (Bevilacqua et al., 2022) only outputs a graph representation {{{{χ Gi (v) : v ∈ V}} : G i ∈ B π G }} rather than outputs each node color, which does not suit the node-level tasks (e.g., finding cut vertices). Nevertheless, there are simple adaptations that makes DS-WL output a color mapping χ G . We will study these adaptations in Appendix C.2 (see the paragraph above Proposition C.16) and discuss their limitations compared with DSS-WL.

B.5 GENERALIZED DISTANCE WL (GD-WL)

In this paper, we study a new variant of the color refinement algorithm, called the Generalized Distance WL (GD-WL). The complete algorithm is described below. As a special case, when choosing d G = dis G , the resulting algorithm is called the Shortest Path Distance WL (SPD-WL), which is strictly powerful than the classic 1-WL. Algorithm 4: The Genealized Distance Weisfeiler-Lehman Algorithm  Input : Graph G = (V, E), distance metric d G : V × V → R + , G (v) := c 0 for all v ∈ V for t ← 1 to T do for each v ∈ V do χ t G (v) := hash {{(d G (v, u), χ t-1 G (u)) : u ∈ V}} Return: χ T G C PROOF OF THEOREMS This section provides all the missing proofs in this paper. For the convenience of reading, we will restate each theorem before giving a proof.

C.1 PROPERTIES OF COLOR REFINEMENT ALGORITHMS

In this subsection, we first derive several important properties that are shared by a general class of color refinement algorithms. They will serve as key lemmas in our subsequent proofs. Here, a general color refinement algorithm takes a graph G = (V G , E G ) as input and calculates a color mapping χ G : V G → C. We first define a concept called the WL-condition. Definition C.1. A color mapping χ G : V G → C is said to satisfy the WL-condition if for any two vertices u, v with the same color (i.e. χ G (u) = χ G (v)) and any color c ∈ C, |N G (u) ∩ χ -1 G (c)| = |N G (v) ∩ χ -1 G (c)|, where χ -1 G is the inverse mapping of χ G . Remark C.2. The WL-condition can be further generalized to handle two graphs. Let χ G : V G → C and χ H : V H → C be two color mappings obtained by applying the same color refinement algorithm for graphs G and H, respectively. χ G and χ H are said to jointly satisfy the WL-condition, if for any two vertices u ∈ V G and v ∈ V H with the same color (χ G (u) = χ H (v)) and any color c ∈ C, |N G (u) ∩ χ -1 G (c)| = |N H (v) ∩ χ -1 H (c)|. It clearly implies Definition C.1 by choosing G = H. It is easy to see that the classic 1-WL algorithm (Algorithm 1) satisfies the WL-condition. In fact, many of the presented algorithms in this paper satisfy such a condition as we will show below, such as DSS-WL (Algorithm 3), SPD-WL (Algorithm 4 with d G = dis G ), and k-FWL (Algorithm 2). Proof. We first prove the second bullet of Proposition C.3. By definition of the DSS-WL aggregation procedure (Line 6 in Algorithm 3), χ Gi (u) = χ Hi (v) already implies {{χ Gi (w) : Proof. If χ G (u) = χ H (v) for some nodes u, v, then by the update rule (Line 4 in Algorithm 4) w ∈ N Gi (u)}} = {{χ Hj (w) : w ∈ N Hj (v)}}. Namely, |{w : w ∈ N Gi (u) ∩ χ -1 Gi (c)}| = |{w : w ∈ N Hj (v) ∩ χ -1 Hj (c)}| holds for any c ∈ C. We then turn to the first bullet. If χ G (u) = χ H (v), then {{χ Gi (u) : i ∈ [m G ]}} = {{χ Hj (v) : j ∈ [m H ]}} (Line 7 in Algorithm 3). Then there exists a pair of indices i ∈ [m G ] and j ∈ [m H ] such that χ Gi (u) = χ Hj (v). By definition of the DSS-WL aggregation, it implies {{χ G (w) : w ∈ N G (u)}} = {{χ H (w) : w ∈ N H (v) {{(dis G (u, w), χ G (w)) : w ∈ V}} = {{(dis G (v, w), χ G (w)) : w ∈ V}}. Since w ∈ N G (u) if and only if dis G (u, w) = 1, we have {{χ G (w) : w ∈ N G (u)}} = {{χ G (w) : w ∈ N G (v)}}. Therefore, for any c ∈ C, |{w : w ∈ N G (u) ∩ χ -1 G (c)}| = |{w : w ∈ N G (v) ∩ χ -1 G (c)}|. Proposition C.5. Let χ G and χ H be two vertex color mappings returned by the k-FWL algorithm (k ≥ 2). Then χ G and χ H jointly satisfy the WL-condition. Proof. Let χ G (u) = χ H (v) for some u ∈ V G and v ∈ V H . By the update formula (Line 4 in Algorithm 2), {{χ G (u, • • • , u, w) : w ∈ V G }} = {{χ H (v, • • • , v, w) : w ∈ V H }}. Note that for any nodes w 1 ∈ V G , w 2 ∈ V H and any x 1 ∈ N G (w 1 ), x 2 / ∈ N H (w 2 ), one has χ G (w 1 , • • • , w 1 , x 1 ) ̸ = χ H (w 2 , • • • , w 2 , x 2 ) . This is obtained by the definition of the initialization mapping χ 0 G and the fact that χ G refines χ 0 G . Consequently, {{χ G (u, • • • , u, w) : w ∈ N G (u)}} = {{χ G (v, • • • , v, w) : w ∈ N H (v)}}. Next, we can use the fact that if χ G (u, • • • , u, w 1 ) = χ G (v, • • • , v, w 2 ) for some w 1 , w 2 ∈ V, then χ G (w 1 ) = χ G (w 2 ) (see Lemma C.6). Therefore, {{χ G (w) : w ∈ N G (u)}} = {{χ G (w) : w ∈ N H (v)}}, which concludes the proof. To complete the proof of Proposition C.5, it remains to prove the following lemma: Lemma C.6. Let χ G and χ H be color mappings for graphs G and H in the k-FWL algorithm (k ≥ 2). Denote cat i,j (w, x) := (w, • • • , w i times , x, • • • , x j times ). Then for any i ∈ [k -1] and any nodes u, w ∈ V G , v, x ∈ V H , if χ G (cat k-i,i (u, w)) = χ H (cat k-i,i (v, x)), then χ G (cat k-i-1,i+1 (u, w)) = χ H (cat k-i-1,i+1 (v, x)). Consequently, χ G (w) = χ H (x). Proof. By the update formula (Line 4 in Algorithm 2), χ G (cat k-i,i (u, w)) = χ H (cat k-i,i (v, x)) implies that {{χ G (cat k-i-1,1,i (u, y, w)) : y ∈ V G }} = {{χ H (cat k-i-1,1,i (v, y, x)) : y ∈ V H }}. Note that for any j ∈ [k -1] and any z ∈ V k G , z ′ ∈ V k H with z j = z j+1 but z ′ j ̸ = z ′ j+1 , one has χ G (z) ̸ = χ H (z ′ ). This is obtained by the definition of the initialization mapping χ 0 G and the fact that χ G refines χ 0 G . Therefore, we have χ G (cat k-i-1,i+1 (u, w)) = χ H (cat k-i-1,i+1 (v, x)), as desired. Equipped with the concept of WL-condition, we now present several key results. In the following, let χ G : V G → C and χ H : V H → C be two color mappings jointly satisfying the WL-condition. Lemma C.7. Let (v 0 , • • • , v d ) be any path (not necessarily simple) of length d in graph G. Then for any node u 0 ∈ χ -1 H (χ G (v 0 )) in graph H, there exists a path (u 0 , • • • , u d ) of the same length d starting at u 0 , such that χ H (u i ) = χ G (v i ) holds for all i ∈ [d]. Proof. The proof is based on induction over the path length d. For the base case of d = 1, if the conclusion does not hold, then there exists two vertices u ∈ V G , v ∈ V H with the same color (i.e. χ G (u) = χ H (v)) and a color c = χ G (v 1 ) such that N G (u) ∩ χ -1 G (c) ̸ = ∅ but N H (v) ∩ χ -1 H (c) = ∅. This obviously contradicts the WL-condition. For the induction step on the path length d, one can just split it by two parts (v 0 , • • • , v d-1 ) and (v d-1 , v d ). Separately using induction yields two paths (u 0 , • • • , u d-1 ) and (u d-1 , u d ) such that χ H (u i ) = χ G (v i ) for all i ∈ [d]. By linking the two paths we have completed the proof. Finally, let us define the shortest path distance between node u and vertex set S as dis G (u, S) := min v∈S dis G (u, v). The above lemma directly yields the following corollary: Corollary C.8. For any color c ∈ {χ G (w) : w ∈ V G } and any two vertices u ∈ V G , v ∈ V H with the same color (i.e. χ G (u) = χ H (v)), dis G (u, χ -1 G (c)) = dis H (v, χ -1 H (c)).

C.2 COUNTEREXAMPLES

We provides the following two families of counterexamples, which most prior works cannot distinguish. Example C.9. Let G 1 = (V, E 1 ) and G 2 = (V, E 2 ) be a pair of graphs with n = 2km + 1 nodes where m, k are two positive integers satisfying mk ≥ 3. Denote V = [n] and define the edge sets as follows: E 1 = {{i, (i mod 2km) + 1} : i ∈ [2km]} ∪ {{n, i} : i ∈ [2km], i mod m = 0} , E 2 = {{i, (i mod km) + 1} : i ∈ [km]} ∪ {{i + km, (i mod km) + km + 1} : i ∈ [km]} ∪ {{n, i} : i ∈ [2km], i mod m = 0} . See Figure 2(a-c ) for an illustration of three cases: (i) m = 2, k = 2; (ii) m = 4, k = 1; (iii) m = 1, k = 4. It is easy to see that regardless of the chosen of m and k, G 1 always has no cut vertex but G 2 do always have a cut vertex with node number n. The case of k = 1 is more special, for which G 2 actually has three cut vertices with node number m, 2m, and n, respectively, and it even has two cut edges {m, n} and {2m, n} (Figure 2(b) ). Example C.10. Let G 1 = (V, E 1 ) and G 2 = (V, E 2 ) be a pair of graphs with n = 2m nodes where m ≥ 3 is an arbitrary integer. Denote V = [n] and define the edge sets as follows: Proof for Example C.9. Let H i be a tree with less than m vertices where m is defined in Example C.9. By symmetry of the two graphs G 1 and G 2 , it suffices to prove the following two types of equations: E 1 = {{i, (i mod n) + 1} : i ∈ [n]} ∪ {{m, 2m}} , E 2 = {{i, (i mod m) + 1} : i ∈ [m]} ∪ {{i + m, (i mod m) + m + 1} : i ∈ [m]} ∪ {{m, x V G1 (n) = x V G2 (n) and x V G1 (i) = x V G2 (i) for all m < i ≤ 2m , where x V is defined in (8). We first aim to prove that x V G1 (v) = x V G2 (v) for v ∈ {m + 1, • • • , 2m}. Consider an induced sub- graph G 1 [S] which is isomorphic to H i and contains node v. Define the set T := {jm : j ∈ [k]}∩S. For ease of presentation, we define an operation cir(x, a, b) that outputs an integer y in the range of (a, b] such that y has the same remainder as x (mod b -a). Formally, cir(x, a, b) = y if a < y ≤ b and x ≡ y (mod b -a).

• If n /

∈ S, then it is easy to see that G 1 [S] is a chain, i.e., no vertices have a degree larger than 2. We define the following mapping g S : S → [n], such that g S (u) = cir(u, m, 2m) if k = 1, cir(u, 0, km) if k ≥ 2. In this way, the chain G 1 [S] is mapped to a chain of G 2 that contains v. Concretely, denote g S (S) = {g S (u) : u ∈ S}, then G 2 [g S (S)] ≃ G 1 [S] ≃ H i , and obviously the orbit of v in G 2 [g S (S)] matches the orbit of v in G 1 [S] . See Figure 3(a, b ) for an illustration of this case. • If n ∈ S, then it is easy to see that the set T ̸ = ∅. We will similarly construct a mapping g S : S → [n] that maps S to g S (S) satisfying g S (v) = v, which is defined as follows. For each u ∈ S\{n}, we find a unique vertex w u in T such that dis G1[S] (u, w u ) is the minimum. Note that the node w u is well-defined since T ̸ = ∅ and any path in G 1 [S] from u to a node in T goes through w u . Define g S (u) =      cir(u, m, 2m) if k = 1 and w u = w v , cir(u, 0, m) if k = 1 and w u ̸ = w v , cir(u, 0, km) if k > 1 and w u ≤ km, cir(u, km, 2km) if k > 1 and w u > km. We also define g S (n) = n. Such a definition guarantees that for any x 1 , x 2 ∈ S, {x 1 , x 2 } ∈ E G1 ⇐⇒ {g S (x 1 ), g S (x 2 )} ∈ E G2 . Therefore, G 2 [g S (S)] ≃ G 1 [S] ≃ H i . Moreover, observe that g S (u) ≡ u (mod m) always holds, and thus it is easy to see that the orbit of v in G 2 [g S (S)] matches the orbit of v in G 1 [S] . See Figure 3(c, d ) for an illustration of this case. Finally, note that for any two different sets S 1 and

𝑣𝑣

2𝑚𝑚 -1 𝑚𝑚+2 𝑚𝑚+1 2𝑚𝑚 𝑚𝑚-1 𝑚𝑚+1 𝑚𝑚+2 1 1 2 2 … … 𝑚𝑚-1 𝑚𝑚 𝑚𝑚 … … 2𝑚𝑚 -1 2𝑚𝑚 𝑛𝑛 𝑛𝑛 𝑣𝑣 𝑣𝑣 𝑣𝑣 1 2 2𝑚𝑚 -1 2𝑚𝑚 2𝑚𝑚 -1 2𝑚𝑚 2𝑚𝑚 +1 1 2 𝑚𝑚-1 𝑚𝑚 𝑚𝑚+1 𝑚𝑚+2 𝑚𝑚+2 … 𝑚𝑚-1 𝑚𝑚 𝑚𝑚+1 … … 𝑛𝑛 2𝑚𝑚 +1 3𝑚𝑚 -1 3𝑚𝑚 3𝑚𝑚 +1 … 4𝑚𝑚 -1 4𝑚𝑚 3𝑚𝑚 -1 3𝑚𝑚 3𝑚𝑚 +1 4𝑚𝑚 -1 4𝑚𝑚 𝑛𝑛 2𝑚𝑚 -1 𝑣𝑣 𝑣𝑣 𝑛𝑛 1 𝑚𝑚-1 𝑚𝑚 𝑚𝑚+1 2𝑚𝑚 𝑚𝑚+2 … 1 𝑚𝑚-1 𝑚𝑚 𝑚𝑚+1 𝑚𝑚+2 2𝑚𝑚 -1 2𝑚𝑚 𝑛𝑛 2 2 … … … 2 𝑣𝑣 𝑣𝑣 1 2 𝑚𝑚-1 𝑚𝑚 𝑚𝑚+1 2𝑚𝑚 -1 2𝑚𝑚 𝑛𝑛 2𝑚𝑚 +1 3𝑚𝑚 -1 3𝑚𝑚 3𝑚𝑚 +1 4𝑚𝑚 -1 4𝑚𝑚 1 𝑚𝑚-1 𝑚𝑚 𝑚𝑚+1 𝑚𝑚+2 2𝑚𝑚 -1 2𝑚𝑚 2𝑚𝑚 +1 4𝑚𝑚 -1 4𝑚𝑚 𝑛𝑛 𝑚𝑚+2 … … … … 3𝑚𝑚 -1 3𝑚𝑚 3𝑚𝑚 +1 (a) n / ∈ S, k = 1 (b) n / ∈ S, k > 1 (c) n ∈ S, k = 1 (d) n ∈ S, k > 1 S 2 such that G 1 [S 1 ] ≃ G 1 [S 2 ] ≃ H i , we have g S1 (S 1 ) ̸ = g S2 (S 2 ), which guarantees that the mapping g : {S ⊂ [n] : G 1 [S] ≃ H i , v ∈ S} → {S ⊂ [n] : G 2 [S] ≃ H i , v ∈ S} defined to be g(S) = g S (S) is injective. One can further check that the mapping g is also surjective, and thus it is bijective. This means x V G1 (v) = x V G2 (v) for v ∈ {m, • • • , 2m -1}. The proof for x V G1 (n) = x V G2 (n) is almost the same, so we omit it here. Noting that under classic 1-WL, the colors χ G1 (v) = χ G2 (v) are also the same. Therefore, adding the features x V (v) does not help distinguish the two graphs. We have finished the proof for Example C.9. Using a similar cycle analysis as the above proof, we have the following negative result for Simplicial WL (Bodnar et al., 2021b) and Cellular WL (Bodnar et al., 2021a ): Proposition C.12. Consider the SWL algorithm (Bodnar et al., 2021b) , or more generally, the CWL algorithms with either k-CL, k-IC, or k-C as lifting maps (k ≥ 3 is an integer) (Bodnar et al., 2021a, Definition 14) . These algorithms can neither distinguish whether a given graph has cut vertices nor distinguish whether it has cut edges. Proof. Observe that the counterexample graphs in both Examples C.9 and C.10 do not have cliques. Therefore, SWL (or CWL with k-CL) reduces to the classic 1-WL and thus fails to distinguish them. Since the lengths of any cycles in these counterexample graphs are at least m (m is defined in Examples C.9 and C.10), we have that CWL with k-IC or k-C also reduces to 1-WL when m > k. Therefore, there exists graphs whose size is O(k) such that CWL can neither distinguish cut vertices nor cut edges. Finally, we point out that even if k is not a constant (i.e., can scale to the graph size), CWL with k-IC still fails to distinguish whether a given graph has cut vertices. This is because for Example C.9 with k ≥ 2 (e.g. Figure 2(b, c )), CWL with IC still outputs the same graph representation for both G 1 and G 2 . This happens because all the 2-dimensional cells in these examples are cycles of an equal length of m + 2 and one can easily check that they have the same CWL color. We finally turn to the case of subgraph-based WL variants. Proposition C.13. The Overlap Subgraph WL (Wijesinghe & Wang, 2022) using any subgraph mapping ω can neither distinguish whether a given graph has cut vertices nor distinguish whether it has cut edges. Proof. An important limitation of OS-WL is that if a graph does not contain triangles, then any overlap subgraph S uv between two adjacent nodes u, v will only have one edge {u, v}. Consequently, the subgraph mapping ω does not take effect can OS-WL reduces to the classic 1-WL. Therefore, Example C.9 with m > 1 and Example C.10 with m > 3 still apply here since the graphs G 1 and G 2 do not contain triangles (see Figure 2(a, b, d )). Moreover, Example C.9 with m = 1 (see Figure 2(c) ) is also a counterexample as discussed in Wijesinghe & Wang (2022, Figure 2(a) ). Proposition C.14. The DSS-WL with ego network policy without marking cannot distinguish the graphs in Example C.9 with m = 1 (Figure 2(c) ). Proof. First note that for any two vertices u, v in either G 1 or G 2 defined in Example C.9, their shortest path distance does not exceed 2. Thus we only need to consider the ego network policy π EGO(1) and π EGO(2) . • For π EGO(2) , the ego graphs of all nodes are simply the original graph and thus all graphs in the bag B π and equal. Thus DSS-WL reduces to the classic 1-WL and cannot distinguish G 1 and G 2 . • For π EGO(1) , the ego graph of each node v ̸ = n is a graph with 5 edges, having a shape of two triangles sharing one edge. These ego graphs are clearly isomorphic. The ego graph of the special node n is the original graph containing all edges. It is easy to see that the vertex partition of DSS-WL becomes stable only after one iteration, and the color mapping of G 1 and G 2 is the same. Therefore, DSS-WL cannot distinguish G 1 and G 2 . We thus conclude the proof. Proposition C.15. The GNN-AK architecture proposed in Zhao et al. (2022) cannot distinguish whether a given graph has cut vertices. Proof. The GNN-AK architecture is quite similar to DSS-WL using the ego network policy but is weaker. There is also a subtle difference: GNN-AK adds the so-called centroid encoding. However, unlike node marking that is performed before the WL procedure, centroid encoding is performed after the WL procedure. The subtle difference causes GNN-AK to be unable to distinguish between the two graphs G 1 and G 2 . We finally consider the DS-WL algorithm proposed in Cotta et al. (2021) ; Bevilacqua et al. (2022) . As discussed in Appendix B.4, the original DS-WL formulation only outputs a graph representation rather than node colors. There are two simple ways to define nodes colors for DS-WL: • If the graph generation policy π is node-based, then each subgraph in B π G = {{G i }} |V| i=1 is uniquely associated to a specific node v ∈ V. We can thus use the graph representation of each subgraph G i as the color of each node. This strategy has appeared in prior works, e.g. Zhao et al. (2022) . • For a general graph generation policy π, there no longer exists an explicit bijective mapping between nodes and subgraphs. In this case, another possible way is to define χ G (v) := {{χ Gi (v) : G i ∈ B π G }}, similar to DSS-WL. This approach is recently introduced by Qian et al. (2022) . However, such a strategy loses the memory advantage of DS-WL (i.e., needing Θ(|V||B π G |) memory complexity rather than Θ(|V|+|B π G |)), and is less expressive than DSS-WL. We thus do not study this variant in the present work. Proposition C.16. The DS-WL algorithm with node marking/deletion policy cannot distinguish cut vertices when each node's color is defined as its associated subgraph representation. Proof. One can similarly check that for Example C.9 with m = 1 (see Figure 2(c )), the color of node n will be the same for both graphs G 1 and G 2 . Therefore, DS-WL cannot identify cut vertices. Finally, using a similar proof technique, the NGNN architecture proposed in Zhang & Li (2021) (with shortest path distance encoding) cannot identify cut vertices.

C.3 PROOF OF THEOREM 3.2

Theorem C.17. Let G = (V, E G ) and H = (V, E H ) be two graphs, and let χ G and χ H be the corresponding DSS-WL color mapping with node marking policy. Then the following holds: • For any two nodes w ∈ V in G and x ∈ V in H, if χ G (w) = χ H (x), then w is a cut vertex in graph G if and only if x is a cut vertex in graph H. • For any two edges {w 1 , w 2 } ∈ E G and {x 1 , x 2 } ∈ E H , if {{χ G (w 1 ), χ G (w 2 )}} = {{χ H (x 1 ), χ H (x 2 )}}, then {w 1 , w 2 } is a cut edge if and only if {x 1 , x 2 } is a cut edge. Proof. We divide the proof into two parts in Appendices C.3.1 and C.3.2, separately focusing on proving each bullet of Theorem 3.2. Before going into the proof, we first define several notations. Denote χ u G (v) as the color of node v under the DSS-WL algorithm when marking u as a special node. By definition of DSS-WL (Line 7 in Algorithm 3), χ G (v) = hash ({{χ u G (v) : u ∈ V}}). We can similarly define the inverse mappings (χ u G ) -1 . We first present a lemma which can help us exclude the case of disconnected graphs. Lemma C.18. Given a node w, let S G (w) ⊂ V be the connected component in graph G that comprises node w. For any two nodes w ∈ V in G and x ∈ V in H, if χ G (w) = χ H (x), then χ G[S G (w)] (w) = χ H[S H (x)] (x). Proof. We first prove that if χ G (w) = χ H (x), then {{χ u G (w) : u ∈ S G (w)}} = {{χ u H (x) : u ∈ S H (x)}}. First note that for any nodes u, w in G and v, x in H, if u ∈ S G (w) but v / ∈ S H (x), then χ u G (w) ̸ = χ v H (x) . This is because DSS-WL only performs neighborhood aggregation, and the marking v cannot propagate to node x while the marking u can propagate to node w. By definition we have χ G (w) = hash ({{χ u G (w) : u ∈ S G (w)}} ∪ {{χ v G (w) : v / ∈ S G (w)}}) . Similarly, χ H (x) = hash ({{χ u H (x) : u ∈ S H (x)}} ∪ {{χ v H (x) : v / ∈ S H (x)}}) . Since χ G (w) = χ H (x), we have {{χ u G : u ∈ S G (w)}} = {{χ u H : u ∈ S H (x)}}. This clearly implies {{χ u G[S G (w)] : u ∈ S G (w)}} = {{χ u H[S H (x)] : u ∈ S H (x)}}, and thus χ G[S G (w)] (w) = χ H[S H (x)] (x). Note that w is a cut vertex in G implies w is a cut vertex in G[S G (w)]. Therefore, based on Lemma C.18, we can restrict our attention to subgraphs G[S G (w)] and H[S H (x)] instead of the original (potentially disconnected) graphs. In other words, in the subsequent proof we can simply assume that both graphs G and H are connected. We next present several simple but important properties regrading the DSS-WL color mapping as well as the subgraph color mappings. Lemma C.19. Let u, w be two nodes in connected graph G and v, x be two nodes in connected graph H. Then the following holds: (a) If w = u and x ̸ = v, then χ u G (w) ̸ = χ v H (x); (b) If χ u G (w) = χ v H (x), then χ G (w) = χ H (x); (c) If χ u G (w) = χ v H (x), then χ G (u) = χ H (v); (d) χ G (w) = χ H (x) if and only if χ w G (w) = χ x H (x); (e) If χ u G (w) = χ v H (x), then dis G (u, w) = dis H (v, x). Proof. Item (a) holds because in DSS-WL, the node with marking cannot have the same color as a node without marking. This can be rigorously proved by induction over the iteration t in the DSS-WL algorithm (Line 6 in Algorithm 3). Item (b) simply follows by definition of the DSS-WL aggregation procedure since the color χ u G (w) encodes the color of χ G (w). We next prove item (c), which follows by using the WL-condition of DSS-WL algorithm (Proposition C.3). Since G is connected, there is a path from w to u. Therefore, in graph H there is also a path from x to some node v ′ satisfying χ u G (u) = χ v H (v ′ ) (Lemma C.7). Now using item (a), it can only be the case v ′ = v and thus χ u G (u) = χ v H (v). Finally, by item (b) we obtain the desired result. We next prove item (d). On the one hand, item (b) already shows that χ w G (w) = χ x G (x) =⇒ χ G (w) = χ H (x). On the other hand, by definition of the DSS-WL algorithm, χ G (w) = hash ({{χ w G (w)}} ∪ {{χ u G (w) : u ∈ V\{w}}}) , χ H (x) = hash ({{χ x H (x)}} ∪ {{χ v H (x) : v ∈ V\{x}}}) . Since χ G (w) = χ H (x) and χ w G (w) ̸ = χ v H (x) holds for all v ∈ V\{x} (by item (a)), we obtain χ w G (w) = χ x G (x). We finally prove item (e), which again can be derived from the WL-condition of DSS-WL algorithm . If χ u G (w) = χ v H (x), then by Corollary C.8 we have dis G (w, (χ u G ) -1 (χ u G (u))) = dis H (x, (χ v H ) -1 (χ u G (u))). Using item (a), we have (χ u G ) -1 (χ u G (u)) = {u} and for any v ′ ̸ = v, χ v H (v ′ ) ̸ = χ v H (v) . Therefore, it can only be the case that (χ v H ) -1 (χ u G (u)) = {v} and χ v H (v) = χ u G (u) . This yields dis G (u, w) = dis G (v, x) and concludes the proof. C.3.1 PROOF FOR THE FIRST PART OF THEOREM 3.2 The following technical lemma is useful in the subsequent proof: Lemma C.20. Let u, v ∈ V be two nodes in connected graphs G and H, respectively. If  χ u G (u) = χ v H (v), then {{χ u G (w) : w ∈ V}} = {{χ v H (w) : w ∈ V}}. Proof. Let N d G (u) := {w ∈ V : dis G (u, w) = d} satisfying χ u G (x 1 ) = χ v H (x 2 ), {{χ u G (w) : w ∈ N G (x 1 )}} = {{χ v H (w) : w ∈ N H (x 2 )}}. Therefore, by the induction assumption C d G = C d H , x∈N d G (u) {{χ u G (w) : w ∈ N G (x)}} = x∈N d H (v) {{χ v H (w) : w ∈ N H (x)}}. We next claim that C d G ∩ C d ′ G = ∅ for any d ̸ = d ′ . This is because for any nodes w 1 and w 2 with the same color χ u G (w 1 ) = χ u G (w 2 ), by Lemma C.19(e) we have dis G (w 1 , u) = dis G (w 2 , u). Using this property, we obtain x∈N d G (u) {{χ u G (w) : w ∈ N G (x) ∩ N d+1 G (u)}} = x∈N d H (v) {{χ v H (w) : w ∈ N H (x) ∩ N d+1 H (v)}}. It is equivalent to the following equation: w∈N d+1 G (u) {{χ u G (w)}} × |N G (w) ∩ N d G (u)| = w∈N d+1 H (v) {{χ v H (w)}} × |N H (w) ∩ N d H (v)|. where {{c}} × m is a multiset containing m repeated elements c. Finally, observe that if χ u G (w 1 ) = χ v H (w 2 ) for some nodes w 1 and w 2 , then |N G (w 1 ) ∩ N d G (u)| = |N H (w 2 ) ∩ N d H (v)| (because C d G ∩ C d ′ G = ∅ for any d ̸ = d ′ ). Consequently, {{χ u G (w) : w ∈ N d+1 G (u)}} = {{χ v H (w) : w ∈ N d+1 H (v)}}, namely C d+1 G = C d+1 H . We have thus completed the proof of the induction step. We now present the following key result, which shows an important property of the color mapping for DSS-WL: Corollary C.21. Let u, v ∈ V be two nodes in connected graph G with the same DSS-WL color, i.e. χ G (u) = χ G (v). Then for any color c ∈ C, {{χ u G (w) : w ∈ χ -1 G (c)}} = {{χ v G (w) : w ∈ χ -1 G (c)}}. Proof. First observe that if χ G (u) = χ G (v), then χ u G (u) = χ v G (v) (by Lemma C.19(d)). Con- sequently, {{χ u G (w) : w ∈ V}} = {{χ v G (w) : w ∈ V}} holds by Lemma C.20. If {{χ u G (w) : w ∈ χ -1 G (c)}} ̸ = {{χ v G (w) : w ∈ χ -1 G (c)}}, then there must exist two nodes w 1 ∈ χ -1 G (c) and w 2 / ∈ χ -1 G (c), such that χ u G (w 1 ) = χ v G (w 2 ) . Therefore, by Lemma C.19(b) we have χ G (w 1 ) = χ G (w 2 ), yielding a contradiction. In the subsequent proof, we assume the connected graph G is not vertex-biconnected and let u ∈ V be a cut vertex in G. Let {S i } m i=1 (m ≥ 2) be the partition of the vertex set V\{u}, representing each connected component after removing node u. Lemma C.22. There is at most one set S i satisfying S i ∩ χ -1 G (χ G (u)) ̸ = ∅. In other words, if S i ∩ χ -1 G (χ G (u)) ̸ = ∅ for some i ∈ [m], then for any j ∈ [m] and j ̸ = i, S j ∩ χ -1 G (χ G (u)) = ∅. Proof. When |χ -1 G (χ G (u))| = 1, the conclusion clearly holds. If |χ -1 G (χ G (u))| > 1, then we can pick a node u 1 ∈ χ -1 G (χ G (u)) that maximizes the shortest path distance dis G (u 1 , u). Let u 1 ∈ S i for some i ∈ [m]. If the lemma does not hold, then we can pick another node 4 (a) for an illustration of this paragraph. u 2 ∈ χ -1 G (χ G (u)) and u 2 / ∈ S i . Since u 1 and u 2 are in different connected component after removing u, dis G (u 1 , u 2 ) = dis G (u 1 , u) + dis G (u 2 , u). See Figure

By Corollary

C.21, {{χ u1 G (w) : w ∈ χ -1 G (χ G (u))}} = {{χ u G (w) : w ∈ χ -1 G (χ G (u))}}. There- fore, there must exist a node u 3 ∈ χ -1 G (χ G (u)) satisfying χ u1 G (u 2 ) = χ u G (u 3 ). We thus have dis G (u 2 , u 1 ) = dis G (u 3 , u) by Lemma C.

19(e). On the other hand, by definition of the node u

1 , dis G (u 1 , u) ≥ dis G (u 3 , u). Therefore, dis G (u 2 , u 1 ) = dis G (u 1 , u) + dis G (u 2 , u) > dis G (u 3 , u). This yields a contradiction and concludes the proof. Lemma C.23. For all u ′ ∈ χ -1 G (χ G (u)), u ′ it is a cut vertex of G. Proof. When |χ -1 G (χ G (u))| = 1, the conclusion clearly holds. Now assume |χ -1 G (χ G (u))| > 1. Since u is a cut vertex in G, by Lemma C.22, there exists a set S j such that S j ∩ χ -1 G (χ G (u)) = ∅. Pick any node w ∈ S j , then χ G (w) ̸ = χ G (u). Let u ′ ̸ = u be any node with color χ G (u) = χ G (u ′ ). It follows that χ u G (u) = χ u ′ G (u ′ ) by Lemma C.19(d). Based on the WL-condition of the mappings χ u G and χ u ′ G , by Lemma C.7 there exists a node w ′ with color χ u ′ G (w ′ ) = χ u G (w) (because there is a path from node u to w). See Figure 4 (b) for an illustration of this paragraph. Suppose u ′ is not a cut vertex. Then there is a path P from w ′ to u without going through node u ′ . Denote Lemma C.19(a) ). Again by using the WL-condition, there exists a path  P = (x 0 , • • • , x d ) where x 0 = w ′ and x d = u. It follows that χ u ′ G (x i ) ̸ = χ u ′ G (u ′ ) for all i ∈ [d] (by Q = (y 0 , • • • , y d ) satisfying y 0 = w and χ u G (y i ) = χ u ′ G (x i ) for all i ∈ [d]. In particular, χ u G (y d ) = χ u ′ G (u), which implies χ G (y d ) = χ G (u) d ∈ χ -1 G (χ G (u)) must go through node u, implying that χ u G (y i ) = χ u G (u) for some i ∈ [d]. However, we have proved that χ u G (y i ) = χ u ′ G (x i ) ̸ = χ u ′ G (u ′ ) = χ u G (u), yielding a contradiction. Therefore, u ′ is a cut vertex. Using a similar proof technique as the one in Lemma C.23, we can prove the first part of Theorem 3.2. Suppose u ′ ∈ χ -1 H (χ G (u)) and we want to prove that u ′ is a cut vertex of graph H. We first consider the case when |χ -1 Observe that |χ -1 G (χ G (u))| = |χ -1 H (χ H (u))|. (A simple proof is as follows: χ G (u) = χ H (u ′ ) implies χ u G (u) = χ u ′ H (u ′ ) G (χ G (u))| = |χ -1 H (χ H (u))| > 1. Following the above proof, we can similarly pick w ∈ S j in G and w not a cut vertex, then there is a path ′ in H satisfying χ G (w) ̸ = χ G (u) and χ u ′ H (w ′ ) = χ u G (w). Since |χ -1 G (χ G (u))| > 1, we can pick a node u H ∈ χ -1 H (χ G (u)) in H such that u H ̸ = u ′ . If u ′ is P = (x 0 , • • • , x d ) in H where x 0 = w ′ and x d = u H , such that χ u ′ H (x i ) ̸ = χ u ′ H (u ′ ) for all i ∈ [d] (by Lemma C.19(a)). Using the WL-condition, there exists a path Q = (y 0 , • • • , y d ) in G satisfying y 0 = w and χ u G (y i ) = χ u ′ H (x i ) for all i ∈ [d]. In particular, χ u G (y d ) = χ u ′ H (u H ), which implies χ G (y d ) = χ G (u H ) by using Lemma C.19(b). However, any path from w to y d ∈ χ -1 G (χ G (u)) must go through node u, implying that χ u G (y i ) = χ u G (u) for some i ∈ [d]. This yields a contradiction because χ u G (y i ) = χ u ′ H (x i ) ̸ = χ u ′ H (u ′ ) = χ u G (u). See Figure 5 (a) for an illustration of this paragraph. We finally consider the case when  |χ -1 G (χ G (u))| = |χ -1 H (χ H (u))| = 1. Let w ∈ S 1 and x ∈ S 2 be two nodes in G that belongs to different connected components when removing node u, then χ G (w) ̸ = χ G (u) and χ G (x) ̸ = χ G (u). Since χ G (u) = χ H (u ′ ), by the WL-condition (Lemma C.7) there is a node w ′ ∈ χ -1 H (χ G (w)) in H. Consequently, χ w G (w) = χ w ′ H (w ′ ) (Lemma C.19(d)). Again by the WL-condition, there is a node x ′ ∈ (χ w ′ H ) -1 (χ w G (x)) in H. Clearly, w ′ ̸ = u ′ and x ′ ̸ = u ′ (because they have different colors). If u ′ is not a cut vertex, then there is path P = (y 0 , • • • , y d ) in H such that y 0 = x ′ , y d = w ′ and y i ̸ = u ′ for all i ∈ [d]. It follows that for all i ∈ [d], χ H (y i ) ̸ = χ H (u ′ ) by our assumption |χ -1 H (χ H (u))| = 1, and thus χ w ′ H (y i ) ̸ = χ w ′ H (u ′ ) (by Lemma C.19(b)). Since χ w G (x) = χ w ′ H (x ′ ), by the WL-condition (Lemma C.7), there is a path Q = (z 0 , • • • , z d ) in G satisfying z 0 = x and z i ∈ (χ w G ) -1 (χ w ′ H (y i )) for i ∈ [d]. See χ w G (z i ) = χ w ′ H (y i ) implies χ G (z i ) = χ H (y i ) and thus χ G (z i ) ̸ = χ H (u ′ ) = χ G (u) holds for all i ∈ [d] and thus z i ̸ = u. In other words, we have found a path from x to w without going through node u, which yields a contradiction as u is a cut vertex. We have thus finished the proof.

C.3.2 PROOF FOR THE SECOND PART OF THEOREM 3.2

The proof is based on the following key result: Corollary C.24. Let w and x be two nodes in connected graph G with the same DSS-WL color, i.e. χ G (w) = χ G (x). Then for any color c ∈ C, {{dis G (w, v) : v ∈ χ -1 G (c)}} = {{dis G (x, v) : v ∈ χ -1 G (c)}}. Proof. By Corollary C.21, we have {{χ w G (v) : v ∈ χ -1 G (c)}} = {{χ x G (v) : v ∈ χ -1 G (c)}}. Since for any nodes u, v, χ w G (u) = χ x G (v) implies dis G (u, w) = dis G (v, x) (by Lemma C. 19(e)), we have obtained the desired conclusion. Equivalently, the above corollary says that if χ G (w) = χ G (x), then the following two multisets are equivalent: {{(dis G (w, v), χ G (v)) : v ∈ V}} = {{(dis G (x, v), χ G (v)) : v ∈ V}}. Therefore, it guarantees that the vertex partition induced by the DSS-WL color mapping is finer than that of the SPD-WL (Algorithm 4 with d G = dis G ). We can thus invoke Theorem 4.1, which directly concludes the proof (due to Proposition C.56).

C.4 PROOF OF THEOREM 4.1

Theorem C.25. Let G = (V, E G ) and H = (V, E H ) be two graphs, and let χ G and χ H be the corresponding SPD-WL color mapping. Then the following holds:  • For any two edges {w 1 , w 2 } ∈ E G and {x 1 , x 2 } ∈ E H , if {{χ G (w 1 ), χ G (w 2 )}} = {{χ H (x 1 ), χ H (x 2 )}}, χ G (u) ̸ = χ G (v) (Appendix C.4.1) and χ G (u) = χ G (v) (Appendix C.4. 2), and prove that any edge  {u ′ , v ′ } ∈ E H satisfying {{χ G (u), χ G (v)}} = {{χ H (u ′ ), χ H (v ′ )}} is ∩ S v | > 1. We first prove that χ -1 G (χ G (u)) ∩ S v ̸ = ∅ and χ -1 G (χ G (v)) ∩ S u ̸ = ∅. By symmetry, we only need to prove the former. Suppose χ -1 G (χ G (u)) ∩ S v = ∅, then (χ -1 G (χ G (v)) ∩ S v )\{v} ̸ = ∅ (because |S ∩ S v | > 1) , and thus there exists v ′ ∈ S v , v ′ ̸ = v such that χ G (v ′ ) = χ G (v). Note that v ′ must connect to a node u ′ with χ G (u ′ ) = χ G (u). Since {u, v} is a cut edge in G, u ′ ∈ S v . Therefore, χ -1 G (χ G (u)) ∩ S v ̸ = ∅ , yielding a contradiction. This paragraph is illustrated in Figure 6(a) . We next prove that at least one of the following two conditions holds (which are symmetric): (i) (χ -1 G (χ G (u)) ∩ S u )\{u} ̸ = ∅; (ii) (χ -1 G (χ G (v)) ∩ S v )\{v} ̸ = ∅. Based on the above paragraph, there exists v ′ ∈ S u satisfying χ G (v ′ ) = χ G (v). Note that v ′ must connect to a node with color χ G (u). If condition (i) does not hold, i.e. χ -1 G (χ G (u)) ∩ S u = {u}, then v ′ must connect to u. This means |N G (u) ∩ χ -1 G (χ G (v))| ≥ 2. Again using χ -1 G (χ G (u)) ∩ S v ̸ = ∅ (the above paragraph), we can pick such a node u ′ ∈ χ -1 G (χ G (u)) ∩ S v . By the WL-condition (Proposition C.4), |N G (u ′ ) ∩ χ -1 G (χ G (v))| ≥ 2, which implies |S v ∩ χ -1 G (χ G (v))| ≥ 2. Thus (χ -1 G (χ G (v)) ∩ S v )\{v} ̸ = ∅ holds, which is exactly the condition (ii). This paragraph is illustrated in Figure 6(b) . Based on the above two paragraphs, by symmetry we can without loss of generality assume χ -1 G (χ G (u)) ∩ S v ̸ = ∅ and (χ -1 G (χ G (u)) ∩ S u )\{u} ̸ = ∅. We are now ready to derive a contradiction. To do this, pick ũ = arg max w∈χ -1 G (χ G (u)) dis G (u, w) and separately consider the following two cases: • ũ ∈ S v . Then by picking a node • ũ ∈ S u . Then by picking a node x ∈ S v ∩ χ -1 G (χ G (u)), it follows that dis G (x, ũ) = dis G (x, v) + dis G (u, ũ) + 1 > dis G (u, ũ). x ∈ (S u ∩ χ -1 G (χ G (u)))\{u}, it follows that dis G (x, ũ) ≥ dis G (x, u) + dis G (u, ũ) > dis G (u, ũ). In both cases, x and u cannot have the same color under SPD-WL because max w∈χ -1 G (χ G (u)) dis G (u, w) = dis G (u, ũ) < dis G (x, ũ) ≤ max w∈χ -1 G (χ G (u)) dis G (x, w). This yields a contradiction and concludes the proof. Based on Lemma C.27, in the subsequent proof we can without loss of generality assume χ -1 G (χ G (u)) ∩ S u = {u} and χ -1 G (χ G (v)) ∩ S u = ∅. This leads to the following lemma: Lemma C.28. For any u 1 , u 2 ∈ χ -1 G (χ G (u)), u 1 ̸ = u 2 , any path from u 1 to u 2 goes through a node v ′ ∈ χ -1 G (χ G (v)). Proof. Note that χ -1 G (χ G (u)) ∩ S u = {u}. If |χ -1 G (χ G (u)) ∩ S v | ≤ 1, the conclusion is clear since any path from u 1 to u 2 goes through v. Now suppose |χ -1 G (χ G (u)) ∩ S v | > 1 and the lemma does not hold. Then there exist two different nodes u ′ 1 , u ′ 2 ∈ χ -1 G (χ G (u)) ∩ S v and a path P from u ′ 1 to u ′ 2 without going through any node in the set χ -1 G (χ G (v)). Pick u 1 , u 2 and P such that the length |P | is minimal. Split P into two parts P 1 and P 2 with endpoints {u 1 , w} and {w, 6 for an illustration of this paragraph. u 2 } such that |P 1 | ≤ |P 2 | ≤ |P 1 | + 1 and |P 1 | + |P 2 | = |P |. Note that |P | ≥ 2 since {u 1 , u 2 } / ∈ E G (otherwise u cannot have the same color as u 1 because χ -1 G (u) ∩ S u = {u}). Therefore, w ̸ = u 1 and w ̸ = u 2 . Also note that χ G (w) ̸ = χ G (u) since |P | is minimal. Since SPD-WL satisfies the WL-condition (Proposition C.4), there is a path (not necessarily simple) from u to some w ′ ∈ χ -1 G (χ G (w)) of length |P 1 | without going through nodes in the set χ -1 G (χ G (v)) (according to Lemma C.7). Therefore, w ′ ∈ S u . See Figure

We next prove that dis

G (u, w ′ ) = |P 1 |. First, we obviously have dis G (u, w ′ ) ≤ |P 1 |. Moreover, since w ′ , u ∈ S u and χ -1 G (χ G (v)) ∩ S u = ∅ (Lemma C.27 ), any shortest path from w ′ to u does not go through nodes in the set χ -1 G (χ G (v)). Again using the WL-condition, there exists a path P 3 (not necessarily simple) from w to some u 3 ∈ χ -1 G (χ G (u)) of length |P 3 | = dis G (u, w ′ ) without going through nodes in the set χ -1 G (χ G (v)) (according to Lemma C.7 ). It follows that u 3 ∈ S v . Consider the following two cases: • If u 3 = u 1 , by the minimal length of P we have |P 1 | ≤ |P 3 | = dis G (u, w ′ ) ≤ |P 1 | and thus dis G (u, w ′ ) = |P 1 |. • If u 3 ̸ = u 1 , by linking the path P 1 and P 3 , there will be a path of length |P 1 | + |P 3 | from u 1 to u 3 without going through nodes in χ -1 G (χ G (v)). Since P has the minimal length, |P 1 | + |P 2 | ≤ |P 1 | + |P 3 |. Therefore, |P 2 | ≤ |P 3 | = dis G (u, w ′ ) and thus by definition |P 1 | ≤ |P 2 | ≤ dis G (u, w ′ ) ≤ |P 1 |. Therefore, |P 1 | = |P 2 | = dis G (u, w ′ ). Now define the set D(x) := {u ′ : u ′ ∈ χ -1 G (χ G (u)), dis G (x, u ′ ) ≤ |P 2 |}. Let us focus on the cardinality of the sets D(w) and D(w ′ ). It follows that D(w ′ ) = {u}, because for any other node  u ′ ∈ χ -1 G (χ G (u)), u ′ ̸ = u, we have u ′ ∈ S v and thus dis G (w ′ , u ′ ) > dis G (w ′ , v) = dis G (w ′ , u) + 1 = |P 1 | + 1 ≥ |P 2 |. Proof. Suppose {{χ G (u), χ G (v)}} is not a cut edge of G C . Then there is a simple cycle (c 1 , • • • , c m ) where c 1 = χ G (u), c m = χ G (v) and m > 2. Namely, there exists a simple path from c 1 to c m with length ≥ 2. By the definition of G C and the WL-condition, there exists a sequence of nodes of G {w i } m i=1 where w 1 = u and χ(w i ) = c i such that {w i , w i+1 } ∈ E G , i ∈ [m -1]. Note that w i ̸ = u for i = {2, • • • , m} and w 2 ̸ = v because (c 1 , • • • , c m ) is a simple path. Therefore, w i ∈ S u for all i ∈ [m]. However, it contradicts |S ∩ S u | = 1 (Lemma C.27) since χ G (w m ) = χ G (v). Combining Lemmas C.27 to C.29, we arrived at the following corollary: Corollary C.30. For all u ′ ∈ χ -1 G (χ G (u)) and v ′ ∈ χ -1 G (χ G (v)), if {u ′ , v ′ } ∈ E G , then it is a cut edge of G. Proof. If {u ′ , v ′ } is not a cut edge, there is a simple cycle going through {u ′ , v ′ }. Denote it as (w 1 , • • • , w m ) where w 1 = u ′ , w m = v ′ , m > 2. By Lemma C.27, w 2 / ∈ χ G (v), otherwise u ′ will connect to at least two different nodes w 2 , w m ∈ χ -1 G (χ G (v) ) and thus u ′ and u cannot have the same color under SPD-WL. Let j be the index such that j = min{j ∈ [m] : χ G (w j ) = χ G (v)}, then j > 2. Consider the path (w 1 , • • • , w j ). It follows that χ G (w k ) ̸ = χ G (u) for all k ∈ {2, • • • , j} by Lemma C.28 (otherwise there is a path from node w 1 to some node Proof. By the definition of C u , any node c ∈ C u in the color graph can reach the node χ G (u) without going through χ G (v). Therefore, there exists some u ′ ∈ χ -1 G (χ G (u)) such that there exists a path P 1 from w to u ′ without going through nodes in the set χ -1 G (χ G (v)). Also, there exists a node w i ∈ χ -1 G (χ G (u)) (i ∈ {2, • • • , j}) that does not go through nodes in the set χ -1 G (χ G (v)), a contradiction). Therefore, (χ G (w 1 ), • • • , χ G (w j )) is a path of length ≥ 2 in G C from χ G (u) to χ G (v) (not G (w) ∈ C u , there exists a cut edge {u ′ , v ′ }, u ′ ∈ χ -1 G (χ G (u)), v ′ ∈ χ -1 G (χ G (v)), that partitions V into two classes S u ′ ∪ S v ′ , u ′ , w ∈ S u ′ , v ′ ∈ S v ′ , such that χ -1 G (χ G (u ′ )) ∪ S u ′ = {u ′ } and χ -1 G (χ G (v ′ )) ∪ S u ′ = ∅. Remark C.32. Corollary C.31 can be seen as a generalized version of Lemma C.27. Indeed, when w ∈ S u , one can pick u ′ = u and v ′ = v. Then χ -1 G (χ G (u ′ )) ∪ S u ′ = {u ′ } and χ -1 G (χ G (v ′ )) ∪ S u ′ = ∅ hold v ′ ∈ N G (u ′ ) with χ G (v ′ ) = χ G (v) due to the color of u ′ . By Corollary C.30, {u ′ , v ′ } is a cut edge of G. Clearly, w ∈ S u ′ . We next prove the following fact: for any x ∈ S u ′ , χ G (x) ∈ C u . Otherwise, one can pick a node x ∈ S u ′ with color χ G (x) ∈ C v . Consider the shortest path between nodes x and u ′ , denoted as (y 1 , • • • , y m ) where y 1 = x and y m = u ′ . It follows that y i ∈ S u for all i ∈ [m]. Denote 𝑣𝑣 1 𝑣𝑣 2 𝑣𝑣 3 𝑣𝑣 4 𝑢𝑢 1 𝑢𝑢 2 𝑢𝑢 3 𝑢𝑢 4 𝑢𝑢 5 𝑢𝑢 6 𝑢𝑢 7 𝑢𝑢 8 𝑤𝑤 2 𝑤𝑤 5 𝑤𝑤 7 𝑤𝑤 3 𝑤𝑤 6 𝑤𝑤 4 𝑤𝑤 1 𝑤𝑤 8 𝑦𝑦 𝑗𝑗 𝑆𝑆 𝑢𝑢 𝑢𝑢′ 𝑆𝑆 𝑣𝑣 𝑤𝑤 𝑥𝑥 𝑣𝑣′ 𝑃𝑃 1 𝑦𝑦 𝑗𝑗+1 (a) (b) Figure 8 : Illustration of Corollary C.31 and its proof. c i = χ G (y i ), i ∈ [m]. Then (c 1 , • • • , c m ) is a path (not necessarily simple) in the color graph G C . Now pick the index j = max{j ∈ [m] : c j ∈ C v } (which is well-defined because c 1 ∈ C v ). It follows that j < m (since y m ∈ C u ), c j = χ G (v) and c j+1 = χ G (u) (because {{χ G (u), χ G (v)}} is a cut edge that partitions the color graph G C into C u and C v ). Consider the following two cases (see Figure 8 (b) for an illustration): • j = m-1. Then u ′ connects to both nodes y j and v ′ with color χ G (y j ) = χ G (v ′ ) = χ G (v). This contradicts Lemma C.27 since u only connects to one node v with color χ G (v). • j < m -1. Then y j+1 ̸ = u ′ because the path (y 1 , • • • , y m ) is simple. Howover, one has χ G (y i ) ̸ = χ G (v) for all i ∈ {j +1, • • • , m} by definition of j. This contradicts Lemma C.28. This completes the proof that for any x ∈ S u ′ , χ G (x) ∈ C u . Therefore, χ -1 G (χ G (v ′ )) ∪ S u ′ = ∅. We finally prove that χ -1 G (χ G (u)) ∪ S u ′ = {u ′ }. If not, pick u ′′ ∈ χ -1 G (χ G (u)) ∪ S u ′ and u ′′ ̸ = u ′ . By Lemma C.28, the shortest path between u ′ and u ′′ goes through some node v ′′ with color χ G (v). Clearly, v ′′ ∈ S u , which contradicts the above paragraph and concludes the proof. We have already fully characterized the properties of cut edges {u ′ , v ′ } with color {χ G (u), χ G (v)}. Now we switch our focus to the graph H. We first prove a general result that holds for arbitrary H. Lemma C.33. Let {w 1 , w 2 } ∈ E H and P is a path with the minimum length from w 1 to w 2 without going through edge {w 1 , w 2 }. In other words, linking path P with the edge {w 1 , w 2 } forms a simple cycle Q. Then for any two nodes x 1 , x 2 in Q, dis H (x 1 , x 2 ) = dis Q (x 1 , x 2 ). Proof. Split the cycle Q into two paths Q 1 and Q 2 with endpoints {x 1 , x 2 } where Q 1 contains the edge {w 1 , w 2 } and Q 2 does not contain {w 1 , w 2 }. Assume the above lemma does not hold and dis H (w, x) < dis Q (w, x). It means that there exists a path R in H from x 1 to x 2 for which |R| < min(|Q 1 |, |Q 2 |). Note that the edge {u, v} occurs at most once in R. Separately consider two cases: • {w 1 , w 2 } occurs in R. Then linking R with Q 2 forms a cycle that contains {w 1 , w 2 } exactly once; • {w 1 , w 2 } does not occur in R. Then linking R with Q 1 forms a cycle that contains {w 1 , w 2 } exactly once. In both cases, the cycle has a length less than |Q|. This contradicts the condition that P is a path with minimum length from w 1 to w 2 without passing edge {w 1 , w 2 }. We can similarly consider the color graph  H C = (C, E H C ) {u, v} ∈ E H , χ H (u) = χ G (u), χ H (v) = χ G (v), and χ H (u) ∈ C u , χ H (v) ∈ C v . We similarly define χ -1 H (c) = {w ∈ V : χ H (w) = c}. Define a mapping h : C → {χ H (u), χ H (v)} where h(c) = χ H (u) if dis H C (c, χ H (u)) < dis H C (c, χ H (v)), χ H (v) if dis H C (c, χ H (u)) > dis H C (c, χ H (v)). Note that it never happens that dis H C (c, χ H (u)) = dis H C (c, χ H (v)) because {{χ H (u), χ H (v)}} is a cut edge of H C . Assume {u, v} is not a cut edge in H. Then there exists a path (w 1 , • • • , w m ) in H with w 1 = u and w m = v without going through {u, v}. We pick such a path with the minimum length, then the path is simple. Since h(χ H (u)) ∈ C u and h(χ H (v)) ∈ C v , there is a minimum index j ∈ [m -1] such that h(χ H (w j )) ∈ C u and h(χ H (w j+1 )) ∈ C v . By definition of C u , C v and the cut edge {{χ H (u), χ H (v)}}, it follows that χ H (w j ) = χ H (u) and χ H (w j+1 ) = χ H (v) . Denote u ′ := w j . Note that j ̸ = 1 and j ̸ = 2, otherwise u either connects to two nodes w 2 and w m with color χ H (w 2 ) = χ H (w m ) = χ H (v), or connects to the node u ′ with color χ H (u ′ ) = χ H (u), contradicting χ H (u) = χ G (u). Pick k = ⌈j/2⌉. By Lemma C.33, (w 1 , • • • , w k ) is the shortest path between u and w k , and (w k , • • • , w j ) is the shortest path between w k and u ′ . We give an illustration of the structure of H in Figure 9 (a) based on this paragraph. Since the graph representations of G and H are the same under SPD-WL, there exists a node w ′ with color χ G (w ′ ) = χ H (w k ) and two different nodes u ′ 1 , u ′ 2 with color χ G (u ′ 1 ) = χ G (u ′ 2 ) = χ G (u), such that dis G (w ′ , u ′ 1 ) = dis H (w k , u 1 ) and dis G (w ′ , u ′ 2 ) = dis H (w k , u 2 ). In particular, |dis G (w ′ , u ′ 1 ) -dis G (w ′ , u ′ 2 )| ≤ 1. Note that by the definition of indices j and k, in the color graph H C there is a path from χ H (w k ) to χ H (u) without going through nodes in the set χ -1 H (χ H (v)), so χ H (w k ) ∈ C u , namely χ G (w ′ ) ∈ C u . By Corollary C.31, there is a cut edge {u ′ w , v ′ w } that partitions G into two vertex sets S u ′ w , S v ′ w , with w ′ , u ′ w ∈ S u ′ w , v ′ w ∈ S v ′ w . Note that u ′ w ̸ = u ′ 1 and u ′ w ̸ = u ′ 2 (otherwise by Corollary C.31 any path from w ′ to a node u ′ ̸ = u ′ w with color χ G (u ′ ) = χ G (u) must first go through u ′ w and then go through v ′ w , implying that |dis G (w ′ , u ′ 1 ) - dis G (w ′ , u ′ 2 )| ≥ 2 and yielding a contradiction). Therefore, dis G (w ′ , u ′ 1 ) > dis G (w ′ , u ′ w ) and dis G (w ′ , u ′ 2 ) > dis G (w ′ , u ′ w ). We give an illustration of the structure of G in Figure 9 (b) based on this paragraph.

Pick any v

w ∈ χ -1 H (χ H (v)) satisfying dis H (v w , w k ) = dis G (v ′ w , w ′ ). Denote the operation dropmin(S) := S\{{min S}} that takes a multiset S and removes one of the minimum elements in S. We have dropmin({{dis G (w ′ , u G ) : u G ∈ χ -1 G (χ G (u))) = dropmin({{dis G (w ′ , v ′ w ) + dis G (v ′ w , u G ) : u G ∈ χ -1 G (χ G (u))}}) (by Corollary C.31) = dropmin({{dis H (w k , v w ) + dis H (v w , u H ) : u H ∈ χ -1 H (χ H (u))}}) and also dropmin({{dis G (w ′ , u G ) : u G ∈ χ -1 G (χ G (u))) = dropmin({{dis H (w k , u H ) : u H ∈ χ -1 H (χ H (u))}}) due to the same color χ G (w ′ ) = χ H (w k ). Combining the above two equations and noting that dis H (w k , v w ) + dis H (v w , u H ) ≥ dis H (w k , u H ), we obtain the following result: for any u H ∈ χ -1 H (χ H (u)) for which dis H (w k , v w ) + dis H (v w , u H ) > dis G (w ′ , u ′ w ), dis H (w k , v w ) + dis H (v w , u H ) = dis H (w k , u H ). In particular, dis H (w k , w 1 ) = dis H (w k , v w ) + dis H (v w , w 1 ), dis H (w k , w j ) = dis H (w k , v w ) + dis H (v w , w j ). Therefore, dis H (w 1 , w j ) = dis H (w 1 , w k ) + dis H (w k , w j ) = 2dis H (w k , v w ) + dis H (v w , w 1 ) + dis H (v w , w j ) ≥ 2dis H (w k , v w ) + dis H (w 1 , w j ), implying w k = v w . However, χ H (w k ) ∈ C u while χ H (v w ) ∈ C v , yielding a contradiction. C.4.2 THE CASE OF χ G (u) = χ G (v) FOR CONNECTED GRAPHS We first define several notations. Define the mapping f G : V → {u, v} × C as follows: f G (w) = (h G (w), χ G (w)) where h G (w) = u if dis G (w, v) = dis G (w, u) + 1, v if dis G (w, u) = dis G (w, v) + 1. It is easy to see that h G is well-defined for all w ∈ V because {u, v} is a cut edge of G. We further define the following auxiliary graph: Definition C.34. (Auxiliary graph) Define the auxiliary graph G A = (V G A , E G A ) where V G A := {u, v} × C and E G A := {{{f G (w 1 ), f G (w 2 )}} : {w 1 , w 2 } ∈ E G }. Note that G A can have self loops, so each edge is denoted as a multiset with two elements. It is straightforward to see that there is only one edge in G A with the form {{(u, c 1 ), (v, c 2 )}} ∈ E G A for some c 1 , c 2 ∈ C since {u, v} is a cut edge of G. Therefore, the only edge is {{(u, χ G (u)), (v, χ G (v))}} and is a cut edge in G A . We also define f -1 G as the inverse mapping of f G , i.e. f -1 G (z, c) = {w ∈ V : f G (w) = (z, c)}. We first prove that f -1 G is well-defined on the domain V G A . Lemma C.35. f G is a surjection. Proof. Suppose that f G is not a surjection. Then there exists a color c ∈ C such that either f -1 G (u, c) or f -1 G (v, c) is an empty set. Without loss of generality, assume f -1 G (v, c) = ∅, then f -1 G (u, c) ̸ = ∅. Pick any w ∈ f -1 G (u, c). Obviously, w ̸ = u (otherwise f -1 G (v, χ G (v)) = ∅, a contradiction). Then we claim that for any x ∈ N G (w), f -1 G (v, χ G (x)) is empty. Note that x ∈ f -1 G (u, χ G (x)). If the claim does not hold, take x ′ ∈ f -1 G (v, χ G (x) ). Since x connects to a node with color c and χ G (x) = χ G (x ′ ), x ′ must also connect to a node with color c. Denote the node that connects to x ′ with color c as w ′ . Then w ′ ∈ f -1 G (v, c), yielding a contradiction. By induction, for any x such that there exists a path from x to w without going through the edge {u, v}, we have f -1 G (v, χ G (x)) = ∅. This finally implies f -1 G (v, χ G (v)) = ∅, leading to a contra- diction. Therefore, f is a surjection. Lemma C.36. |f -1 G (u, χ G (u))| = |f -1 G (v, χ G (v))| = 1. Proof. Pick u ′ = arg max u ′ ∈f -1 G (u,χ(u)) dis G (u, u ′ ) and similarly pick v ′ . It follows that any path between u ′ and v ′ goes through edge {u, v}. Therefore, dis G (u ′ , v ′ ) = dis G (u, u ′ )+dis G (v, v ′ )+1. Since all nodes u, u ′ , v, v ′ have the same color under SPD-WL, there exists a node w ∈ χ -1 G (χ G (u)) satisfying dis G (u, w) = dis G (u ′ , v ′ ) and thus dis G (u, w) > dis G (u, u ′ ). By definition of the node u ′ , f G (w) ̸ = (u, χ(u)) and thus f G (w) = (v, χ(u)). Therefore, dis G (u, w) = dis G (v, w) + 1, which implies that dis G (v, w) = dis G (v, v ′ ) + dis G (u, u ′ ). Since dis G (v, w) ≤ dis G (v, v ′ ), we have dis G (v, w) = dis G (v, v ′ ) and u = u ′ . A similar argument yields v = v ′ , finishing the proof. We can now prove some useful properties of the auxiliary graph G A based on Lemmas C.35 and C.36. Corollary C.37. For any c 1 , c 2 ∈ C, {{(u, c 1 ), (u, c 2 )}} ∈ E G A if and only if {{(v, c 1 ), (v, c 2 )}} ∈ E G A . Proof. By definition of E A G , if {{(u, c 1 ), (u, c 2 )}} ∈ E G A , then there exists two vertices w 1 ∈ f -1 G (u, c 1 ) and w 2 ∈ f -1 G (u, c 2 ) such that {w 1 , w 2 } ∈ E G . By Lemma C.36, either χ G (w 1 ) ̸ = χ G (u) or χ G (w 2 ) ̸ = χ G (u). Without loss of generality, assume c 1 ̸ = χ G (u). By Lemma C.35, there exists x 1 ∈ f -1 G (v, c 1 ). Since χ G (x 1 ) = χ G (w 1 ), x 1 must also connect to a node x 2 with χ G (x 2 ) = c 2 . The edge {x 1 , x 2 } ̸ = {u, v} because χ G (x 1 ) = c 1 ̸ = χ G (u). Therefore, f (x 2 ) = (v, c 2 ), namely {{(v, c 1 ), (v, c 2 )}} ∈ E A G . The following lemma establishes the distance relationship between graphs G and G A . Lemma C.38. The following holds: • For any w, w ′ ∈ V, dis G (w, w ′ ) ≥ dis G A (f (w), f (w ′ )). • For any ξ, ξ ′ ∈ V A and any node w ∈ f -1 G (ξ), there exists a node w ′ ∈ f -1 G (ξ ′ ) such that dis G (w, w ′ ) = dis G A (ξ, ξ ′ ). Proof. The first bullet is trivial since for all {w, w ′ } ∈ E G , {{f (w), f (w ′ )}} ∈ E G A by Definition C.34. We prove the second bullet in the following. Note that G A can have self-loops, but for any ξ, ξ ′ ∈ V A , the shortest path between ξ and ξ ′ will not go through self-loops. We only need to prove that for all {{ξ, ξ ′ }} ∈ E A , ξ ̸ = ξ ′ and all w ∈ f -1 G (ξ), there exists w ′ ∈ f -1 G (ξ ′ ) such that {w, w ′ } ∈ E G . This will imply that dis G (w, w ′ ) ≤ dis G A (ξ, ξ ′ ) and completes the proof by combining the first bullet in Lemma C.38. The case of {{ξ, ξ ′ }} = {{(u, χ G (u)), (v, χ G (v))}} is trivial. Now assume that {{ξ, ξ ′ }} ̸ = {{(u, χ G (u)), (v, χ G (v))}}. By Definition C.34, there exists x ∈ f -1 G (ξ) and x ′ ∈ f -1 G (ξ ′ ), such that {x, x ′ } ∈ E G . Note that h G (x) = h G (x ′ ) because {x, x ′ } ̸ = {u, v}. Since χ G (x) = χ G (w), there exists w ′ ∈ χ -1 G (χ G (x ′ )) such that {w, w ′ } ∈ E G . It must hold that h G (w) = h G (w ′ ) (otherwise {w, w ′ } = {u, v} and thus {{ξ, ξ ′ }} = {{(u, χ G (u)), (v, χ G (v))). Therefore, h G (w ′ ) = h G (w) = h G (x) = h G (x ′ ) and thus f G (w ′ ) = f G (x ′ ), namely w ′ ∈ f -1 G (ξ ′ ). Lemma C.38 leads to the following corollary: Corollary C.39. The following holds: • For any w, w ′ ∈ V satisfying χ G (w) = χ G (w ′ ) and h G (w) = h G (w ′ ) (i.e. f G (w) = f G (w ′ )), dis G (u, w) = dis G (u, w ′ ) and dis G (v, w) = dis G (v, w ′ ); • For any w, w ′ ∈ V satisfying χ G (w) = χ G (w ′ ) and h G (w) ̸ = h G (w ′ ), dis G (u, w) = dis G (v, w ′ ) and dis G (v, w) = dis G (u, w ′ ). Proof. Proof of the first bullet: by Lemma C.38, there exists two nodes (v, c)) . By the definition of G A and its cut edge {{(u, χ G (u), (v, χ G (v))}}, the shortest path between (u, χ G (u)) and (u, c) must only go through nodes in the set {(u, c 1 ) : c 1 ∈ C}, and similarly the shortest path between (v, χ G (v)) and (v, c) must only go through nodes in {(v, c 2 ) : c 2 ∈ C}. Finally, Corollary C.37 says that for c 1 , c 2 ∈ C, {{(u, c 1 ), (u, c 2 )}} ∈ G A if and only if {{(v, c 1 ), (v, c 2 )}} ∈ G A . We thus conclude that dis u 1 , u 2 ∈ f -1 G (f G (u)) such that dis G (u 1 , w) = dis G A (f G (u), f G (w)) and dis G (u 2 , w ′ ) = dis G A (f G (u), f G (w ′ )). Therefore, dis G (u 1 , w) = dis G (u 2 , w ′ ). However, by Lemma C.36 and the condition h G (w) = h G (w ′ ), it must be u 1 = u 2 = u, namely dis G (u, w) = dis G (u, w ′ ). The proof of dis G (v, w) = dis G (v ′ , w ′ ) is similar. Proof of the second bullet: Let χ G (w) = χ G (w ′ ) = c. Without loss of generality, assume f G (w) = (u, c) and f (w ′ ) = (v, c). By Lemma C.38, it suffices to prove that dis G A ((u, χ G (u)), (u, c)) = dis G A ((v, χ G (v)), G A ((u, χ G (u)), (u, c)) = dis G A ((v, χ G (v)), (v, c)) and dis G (u, w) = dis G (v, w ′ ). Finally, we can prove the following important corollary:  and |f -1 G (u, c)| elements of value d + 1. Since u and v has the same color under SPD-WL, the two multiset must be equivalent. Therefore, Corollary C.40. For any c ∈ C, |f -1 G (u, c)| = |f -1 G (v, c)|. Proof. Pick any w ∈ f -1 G (u, c) and x ∈ f -1 G (v, c). By Corollary C.39, we have dis G (w, u) = dis G (x, v) := d, dis G (w, v) = dis G (x, u) = d + 1. The multiset {{dis G (u, w ′ ) : χ G (w ′ ) = c}} contains |f -1 G (u, c)| elements of value d and |f -1 G (v, c)| elements of value d + 1. The multiset {{dis G (v, w ′ ) : χ G (w ′ ) = c}} has |f -1 G (v, c)| elements of value d |f -1 G (u, c)| = |f -1 G (v, c)|. Next, we switch our focus to the graph H.  (u, c)| = |f -1 H (v, c)| = |f -1 G (u, c)| = |f -1 G (v, c)|. Proof. If |f -1 H (u, c)| ̸ = |f -1 H (v, c)|, we have {{dis H (u, w) : χ H (w) = c}} ̸ = {{dis H (v, w) : χ H (w) = c}}, implying that u and v cannot have the same color under SPD-WL. This already concludes the proof by using Corollary C.40 as |f -1 H (u, c)| + |f -1 H (v, c)| = |f -1 G (u, c)| + |f -1 G (v, c)|. We finally present a technical lemma which will be used in the subsequent proof. 

Proof. Without loss of generality, assume h

G (w) = h H (w ′ ) = u and let f G (w) = f H (w ′ ) = (u, c w ). Pick x ∈ f -1 G (v, c) and x ′ ∈ f -1 H (u, c), then dis H (x ′ , u) = min(dis G (x, u), dis G (x, v)) and dis H (w ′ , u) = min(dis G (w, u), dis G (w, v)). Thus dis H (w ′ , x ′ ) ≤ dis H (w ′ , u) + dis H (u, x ′ ) = min(dis G (w, u), dis G (w, v)) + min(dis G (x, u), dis G (x, v)) < min(dis G (w, u) + dis G (x, v), dis G (w, v) + dis G (x, u)) + 1 = dis G (w, x), which concludes the proof. In the following, we will prove that {u, v} is a cut edge in graph H. Consider an edge {{(u, c 1 ), (v, c 2 )}} ∈ E H A (such an edge exists because {{(u, χ H (u)), (v, χ H (v))}} ∈ E A H ). We will prove that this is the only case, i.e. it must be c 1 = χ H (u) = χ H (v) = c 2 . By Definition C.34, {{(u, c 1 ), (v, c 2 )}} ∈ E H A implies that there exists two nodes We claim that all elements in the set D G,̸ = (w, c 1 ) are the same. This is because for any x ′ ∈ f -1 H (u, c 1 ) and w ′ ∈ f -1 H (v, c 2 ), such that {w ′ , x ′ } ∈ E H . Pick w ∈ χ -1 G (c 2 ). x ∈ χ -1 G (c 1 ), h G (x) ̸ = h G (w), we have dis G (w, x) = dis G (w, h(w)) + 1 + dis G (h(x), x), and by Corollary C.39 dis G (w, h(w)) (or dis G (h(x), x)) has an equal value for different x. Since {w ′ , x ′ } ∈ E H , we have 1 ∈ D H,̸ = (w ′ , c 1 ) and thus all elements in D G,̸ = (w, c 1 ) equals 1. Therefore,  c 1 = χ G (u). Analogously, c 2 = χ G (u). Therefore, c 1 = χ H (u) = χ H (v) = c 2 . Let S u = {w ∈ V : h H (w) = u} and S v = {w ∈ V : h H (w) = v}. χ t+1 G (v) = hash {{hash(d G (v, u), χ t G (u)) : u ∈ V}} . Therefore, χ t+1 G (v) ̸ = χ t+1 G (u) for all u, v ∈ V, namely {{χ t+1 G (w) : w ∈ V}} ∩ {{χ t+1 H (w) : w ∈ V}} = ∅. Finally, by the injective property of the hash function, for any t ≥ T + 1, the above equation always holds. Therefore, the stable color mappings χ G and χ H satisfy Lemma C.43. The above lemma implies that if there exists edges {w 1 , w 2 } ∈ E G , {x 1 , x 2 } ∈ E H satisfying {{χ G (w 1 ), χ G (w 2 )}} = {{χ H (x 1 ), χ H (x 2 )}}, then {{χ G (w) : w ∈ V}} = {{χ H (w) : w ∈ V}}. Also, SPD-WL ensures that both graphs are either connected or disconnected. If they are both connected, the previous proof (Appendices C.4.1 and C.4.2) ensures that {w 1 , w 2 } is a cut edge of G if and only if {x 1 , x 2 } is a cut edge of H. For the disconnected case, let S G ⊂ V be the largest connected component containing nodes w 1 , w 2 , and similarly denote S H ⊂ V as the largest connected component containing nodes  ∈ S H , χ G (u G ) = χ H (u H ) implies that χ G[S G ] (u G ) = χ H[S H ] (u H ). Therefore, {w 1 , w 2 } is a cut edge of G[S G ] if and only if {x 1 , x 2 } is a cut edge of H[S H ]. By the dinifition of S G and S H , {w 1 , w 2 } is a cut edge of G if and only if {x 1 , x 2 } is a cut edge of H. It remains to prove that {{χ G (w) : w ∈ V}} = {{χ H (w) : w ∈ V}} implies BCETree(G) ≃ BCETree(H). By definition of the block cut-edge tree, each cut edge of G corresponds to a tree edge in BCETree(G) and each biconnected component of G corresponds to a node of BCETree(G). We still only focus on the case of connected graphs G, H, and it is straightforward to extend the proof to the general (disconnected) case using a similar technique as the previous paragraph. Given a fixed SPD-WL graph representation R, consider any graphs G = (V, E G ) satisfying {{χ G (w) : w ∈ V}} = R. Since we have proved that the SPD-WL node feature χ G (v), v ∈ V precisely locates all the cut edges, the multiset C E := {{{χ G (u), χ G (v)} : {u, v} ∈ E G is a cut edge}} is fixed (fully determined by R, not G). Denote C V := {c1,c2}∈C E {c 1 , c 2 } as the set that contains the color of endpoints of all cut edges. For each cut edge {u, v} ∈ E G , denote S G,u and S G,v be • If v is not a cut vertex, given any nodes u, w, u ̸ = v, w ̸ = v, we can partition the set of all hitting paths P uw from u to w (not necessarily simple) into two sets P v uw and P v uw such that all paths P ∈ P v uw contain v and no path P ∈ P q(P 1 )q(P 2 )(|P 1 | + |P 2 |) + P ∈P v uw q(P ) • |P | = P1∈P w uv q(P 1 )|P 1 | P2∈Pvw q(P 2 ) + P2∈Pvw q(P 2 )|P 2 |   P1∈P w uv q(P 1 )   + P ∈P v uw q(P )|P | ≤ P ∈P w uv q(P )|P | + P ∈P v uw q(P )|P | + P ∈Pvw q(P )|P | < P ∈P w uv q(P )|P | + P ∈P w uv q(P )|P | + P ∈Pvw q(P )|P | = h G (u, v) + h G (v, w).

We can similarly prove that h

G (w, u) < h G (w, v) + h G (v, u). • If v is a cut vertex, then there exists two different nodes u, w ∈ V, u ̸ = v, w ̸ = v, such that any path from u to w goes through v. A similar analysis yields the conclusion that h G (u, w) = h G (u, v) + h G (v, w) and h G (w, u) = h G (w, v) + h G (v, u). This completes the proof of Lemma C.45. In the subsequent proof, assume u ∈ V is a cut vertex of G, and let {S i } m i=1 (m ≥ 2) be the partition of the vertex set V\{u}, representing each connected component after removing node u. We have the following lemma (which has a similar form as Lemma C.27): Lemma C.46. There is at most one set S i satisfying S i ∩ χ -1 G (χ G (u)) ̸ = ∅. In other words, if S i ∩ χ -1 G (χ G (u)) ̸ = ∅ for some i ∈ [m], then for any j ∈ [m] and j ̸ = i, S j ∩ χ -1 G (χ G (u)) = ∅. Proof. Let u i = arg max u ′ ∈χ -1 G (χ G (u)) dis R G (u, u ′ ). If u i = u, then S i ∩ χ -1 G (χ G (u)) = ∅ for all i ∈ [m] and thus Lemma C.46 clearly holds. Otherwise, u i ∈ S i for some i. We will prove that for any j ̸ = i, S j ∩ χ -1 G (χ G (u)) = ∅. If the above conclusion does not holds, then we can pick a set S j and a vertex u j ∈ S j ∩χ -1 G (χ G (u)). Since u is a cut vertex and S i , S j are different connected components, by Lemma C.45 we have dis R G (u i , u j ) = dis R G (u i , u) + dis R G (u, u j ) > dis R G (u i , u). This yields a contradiction because max u ′ ∈χ -1 G (χ G (u)) dis R G (u, u ′ ) ̸ = max u ′ ∈χ -1 G (χ G (ui)) dis R G (u i , u ′ ) , which means that u and u i cannot have the same RD-WL color. The next lemma presents a key result which is similar to Corollary C.30. 15) where ( 14) holds because u is a cut vertex and all u ′′ ̸ = u are in the set S i but w ∈ S j (Lemma C.46), and (15) holds because χ G (u) = χ G (u ′ ). On the other hands, Lemma C.47. For all u ′ ∈ χ -1 G (χ G (u)), u ′ it is a cut vertex of G. Proof. If |χ -1 G (χ G (u))| = 1, i ∩ χ -1 G (χ G (u)) ̸ = ∅, S j ∩ χ -1 G (χ G (u)) = ∅. Since S j ̸ = ∅, we can pick w ∈ S j with color χ G (w) ̸ = χ G (u). Pick u ′ ∈ S i ∩ χ -1 G (χ G (u)). Since χ G (u) = χ G (u ′ ), there exists a node w ′ ∈ χ -1 G (χ G (w)) such that dis R G (u, w) = dis R G (u ′ , w ′ ). Then we have {{dis R G (w, u ′′ ) : u ′′ ∈ χ -1 G (χ G (u))}} = {{dis R G (w, u) + dis R G (u, u ′′ ) : u ′′ ∈ χ -1 G (χ G (u))}} (14) = {{dis R G (w ′ , u ′ ) + dis R G (u ′ , u ′′ ) : u ′′ ∈ χ -1 G (χ G (u))}} ( {{dis R G (w, u ′′ ) : u ′′ ∈ χ -1 G (χ G (u))}} = {{dis R G (w ′ , u ′′ ) : u ′′ ∈ χ -1 G (χ G (u))}}. Therefore, dis R G (w ′ , u ′′ ) = dis R G (w ′ , u ′ ) + dis R G (u ′ , u ′′ ) for all u ′′ ∈ χ -1 G (χ G (u)) . Pick u ′′ = u, then clearly u ′′ ̸ = u ′ and u ′′ ̸ = w. Lemma C.45 shows that u ′ is a cut vertex, which concludes the proof. See Figure 10 for an illustration of the above proof.  (χ G (u))| > 1. Pick w H ∈ χ -1 H (χ G (w)) where w is defined in the proof of Lemma C.47. Then, there exists u H ∈ χ -1 H (χ G (u)) satisfying dis R H (w H , u H ) = dis R G (w , u). Pick another node u ′ H ∈ χ -1 H (χ G (u)), u ′ H ̸ = u H (this is feasible as |χ -1 H (χ G (u))| > 1) . Following the procedure of the above proof, we can obtain that dis R  H (w H , u ′′ ) = dis R H (w H , u H )+dis R H (u H , u ′′ ) for all u ′′ ∈ χ -1 H (χ G (u)). There- fore, dis R H (w H , u ′ H ) = dis R H (w H , u H ) + dis R H (u H , u ′ H ), imply- ing u H is a cut vertex of 2 ∈ S 2 , then dis R G (w 1 , u) + dis R G (u, w 2 ) = dis R G (w 1 , w 2 ) (Lemma C.45). Pick any w ′ 1 ∈ χ -1 H (χ G (w 1 )) in H, then there exists a node w ′ 2 ∈ χ -1 H (χ G (w 2 )) satisfying dis R G (w 1 , w 2 ) = dis R H (w ′ 1 , w ′ 2 ). We also have dis R H (w ′ 1 , u) = dis R G (w 1 , u) and dis R H (w ′ 2 , u) = dis R G (w 2 , u) because u is the unique node with color χ G (u) in H. Therefore, dis R H (w ′ 1 , u)+dis R H (u, w ′ 2 ) = dis R H (w ′ 1 , w ′ 2 ) and u is a cut vertex in H (Lemma C.45).

C.5.2 PROOF OF THE SECOND PART

We first introduce some notations. As before, we assume G and H are connected and {{χ G (w) : w ∈ V}} = {{χ H (w) : w ∈ V}}. As we will consider multiple cut vertices in the following proof, we adopt the notation {S G,i (u)} m G (u) i=1 , which represents the set of connected components of graph G after removing node u. Here, m G (u) is the number of connected components after removing node u, which is greater than 1 if u is a cut vertex. It follows that m G (u) i=1 S G,i (u) = V\{u}. We further define the index set M G (u) := {i ∈ [m G (u)] : S G,i (u) ∩ χ -1 G (χ G (u)) = ∅}. By Lemma C.46, either |M G (u)| = m G (u) -1 or |M G (u)| = m G (u). Lemma C.48. Let u ∈ V be a cut vertex of G. Let u ′ ∈ χ -1 H (χ G (u)), then u ′ is also a cut vertex of H. Let i ∈ [m G (u)] and j ∈ [m H (u ′ ) ] be two indices and pick nodes w ∈ S G,i (u) and w ′ ∈ S H,j (u ′ ). Assume w and w ′ have the same color, i.e. χ G (w) = χ H (w ′ ). Then the following holds: • If i ∈ M G (u) and j ∈ M H (u ′ ), then dis R G (w, u) = dis R H (w ′ , u ′ ). • If i ∈ M G (u) and j / ∈ M H (u ′ ), then dis R G (w, u) < dis R H (w ′ , u ′ ). Proof. Proof of the first bullet: since i ∈ M G (u), any path from w to a node u G ∈ χ -1 G (χ G (u)) goes through the cut vertex u, implying min u G ∈χ -1 G (χ G (u)) dis R G (w, u G ) = dis R G (w, u). Similarly, since j ∈ M H (u ′ ), min u H ∈χ -1 H (χ H (u ′ )) dis R H (w ′ , u H ) = dis R H (w ′ , u ′ ). Since the color of nodes w and w ′ are the same under RD-WL, we have min u H ∈χ -1 H (χ H (u ′ )) dis R H (w ′ , u H ) = min u G ∈χ -1 G (χ G (u)) dis R G (w, u G ) and thus dis R H (w ′ , u ′ ) = dis R G (w, u). Proof of the second bullet: first note that dis R H (w ′ , u ′ ) ≥ dis R G (w, u) because dis R H (w ′ , u ′ ) ≥ min u H ∈χ -1 H (χ H (u ′ )) dis R H (w ′ , u H ) = min u G ∈χ -1 G (χ G (u)) dis R G (w, u G ) = dis R G (w, u). If the lemma does not hold, then dis R H (w ′ , u ′ ) = dis R G (w, u). Consequently, {{dis R G (w, u G ) : u G ∈ χ -1 G (χ G (u))}} = {{dis R G (w, u) + dis R G (u, u G ) : u G ∈ χ -1 G (χ G (u))}} = {{dis R H (w ′ , u ′ ) + dis R H (u ′ , u H ) : u H ∈ χ -1 H (χ H (u ′ ))}}. On the other hands, {{dis R G (w, u G ) : u G ∈ χ -1 G (χ G (u))}} = {{dis R H (w ′ , u H ) : u H ∈ χ -1 H (χ H (u ′ ))}}. Therefore, dis R H (w ′ , u H ) = dis R H (w ′ , u ′ ) + dis R H (u ′ , u H ) for all u H ∈ χ -1 H (χ H (u ′ )). However, we can choose u ′′ ∈ χ -1 H (χ H (u ′ )) ∩ S H,j (u ′ ) by definition of j, and clearly dis R H (w ′ , u ′′ ) < dis R H (w ′ , u ′ )+dis R H (u ′ , u ′′ ) because w ′ and u ′′ are in the same connected component (Lemma C.45). This yields a contradiction and concludes the proof. Corollary C.49. Let u ∈ V be a cut vertex of G. Let u ′ ∈ χ -1 H (χ G (u)) , then u ′ is also a cut vertex of H. Pick any S G,i (u) and S H,j (u ′ ) with indices i ∈ M G (u) and j ∈ M H (u ′ ). Then either of the following holds: • {{χ G (w) : w ∈ S G,i (u)}} = {{χ H (w) : w ∈ S H,j (u ′ )}}. • {{χ G (w) : w ∈ S G,i (u)}} ∩ {{χ H (w) : w ∈ S H,j (u ′ )}} = ∅. Proof. Assume {{χ G (w) : w ∈ S G,i (u)}} ∩ {{χ H (w) : w ∈ S H,j (u ′ )}} ̸ = ∅. Then there exists nodes w ∈ S G,i (u) in G and w ′ ∈ S H,j (u ′ ) in H, satisfying χ G (w) = χ H (w ′ ). Our goal is to prove that {{χ G (w) : w ∈ S G,i (u)}} = {{χ H (w) : w ∈ S H,j (u ′ )}}. It thus suffices to prove that for any color  c ∈ C, |χ -1 G (c) ∩ S G,i (u)| = |χ -1 H (c) ∩ S H,j (u ′ )|. Define D G (w, c) = {{dis R G (w, x) : x ∈ χ -1 G (c)}} and define D G (w, c) + d := {{d + d ′ : d ′ ∈ D G (w, c)}}. We next claim that |χ -1 G (c) ∩ S G,i (u)| = |χ -1 G (c)| -|D G (w, c) ∩ (D G (u, c) + dis R G (w, u))|. This is simply because for any x ∈ χ -1 G (c), either x ∈ S G,i (u) or x / ∈ S G,i (u). If x / ∈ S G,i (u), then dis R G (w, x) = dis R G (w, u) + dis R G (u, x) (Lemma C.45); otherwise, dis R G (w, x) ̸ = dis R G (w, u) + dis R G (u, x). Similarly, |χ -1 H (c) ∩ S H,j (u ′ )| = |χ -1 H (c)| -|D H (w ′ , c) ∩ (D H (u ′ , c) + dis R H (w ′ , u ′ ))|. Noting that |χ -1 G (c)| = |χ -1 H (c)|, D G (w, c) = D H (w ′ , c), and dis R G (w, u) = dis R H (w ′ , u ′ ) (Lemma C.48), we obtain |χ -1 G (c) ∩ S G,i (u)| = |χ -1 H (c) ∩ S H,j (u ′ )| (u) such that S G,i (u) ∩ χ -1 G (χ G (u)) = S G,j (u) ∩ χ -1 G (χ G (u)) = ∅, ′ ∈ V be a vertex in H. If χ G (u) = χ H (u ′ ), then m G (u) = m H (u ′ ) and {{{{χ G (w) : w ∈ S G,i (u)}}}} m G (u) i=1 = {{{{χ H (w) : w ∈ S H,i (u ′ )}}}} m H (u ′ ) i=1 . Proof. If both u and u ′ are not cut vertices, Corollary C.51 trivially holds since m G (u) = m H (u ′ ) = 1 and S G,1 (u) = V\{u}, S H,1 (u ′ ) = V\{u ′ }. Now assume u and u ′ are both cut vertices. We first claim that {{χ G (w) : w ∈ i∈M G (u) S G,i (u)}} = {{χ H (w) : w ∈ i∈M H (u ′ ) S H,i (u ′ )}}. To prove the claim, it suffices to prove that for each color c ∈ C, i∈M G (u) S G,i (u) ∩ χ -1 G (c) = i∈M H (u ′ ) S H,i (u ′ ) ∩ χ -1 H (c) . ( ) Note that |χ -1 G (c)| = |χ -1 H (c)|. Also note that by Lemma C.48, for any two nodes w 1 ∈ i∈M G (u) S G,i (u) ∩ χ -1 G (c) and w 2 ∈ i / ∈M G (u) S G,i (u) ∩ χ -1 G (c), we have dis R G (u, w 1 ) < dis R G (u, w 2 ). In other words, the following two sets does not intersect: D G (u, c) := {{dis R G (w, u) : w ∈ i∈M G (u) S G,i (u) ∩ χ -1 G (c)}}, D G (u, c) := {{dis R G (w, u) : w ∈ i / ∈M G (u) S G,i (u) ∩ χ -1 G (c)}}. Since χ G (u) = χ H (u ′ ), we have D G (u, c) ∪ D G (u, c) = D H (u ′ , c) ∪ D H (u ′ , c). Then D G (u, c) ∩ D G (u, c) = D H (u ′ , c) ∩ D H (u ′ , c) = ∅ implies that D G (u, c) = D H (u ′ , c) and D G (u, c) = D H (u ′ , c ). This proves (17) and thus ( 16) holds. We next claim that {{{{χ G (w) : w ∈ S G,i (u)}} : i ∈ M G (u)}} = {{{{χ H (w) : w ∈ S H,i (u ′ )}} : i ∈ M H (u ′ )}}. This simply follows by using ( 16) and Corollary C.49. Finally, (18) already yields the desired conclusion because: • If |M G (u)| = m G (u), then (16) implies that i∈M H (u ′ ) S H,i (u ′ ) = i∈M G (u) S G,i (u) = |V| -1 and thus |M H (u ′ )| = m H (u ′ ). • If |M G (u)| = m G (u) -1, then analogously |M H (u ′ )| = m H (u) -1. Furthermore, {{{{χ G (w) : w ∈ S G,i (u)}} : i / ∈ M G (u)}} = {{{{χ G (w) : w ∈ S H,i (u ′ )}} : i / ∈ M H (u ′ )}} because {{χ G (w) : w ∈ V\{u}}} = {{χ H (w) : w ∈ V\{u ′ }}}. In both cases, Corollary C.51 holds. We are now ready to prove that {{χ G (w) : w ∈ V}} = {{χ H (w) : w ∈ V}} implies BCVTree(G) ≃ BCVTree(H). Recall that in a block cut-vertex tree BCVTree(G), there are two types of nodes: all cut vertices of G, and all biconnected components of G. Each edge in BCVTree(G) is connected between a cut vertex u ∈ V and a biconnected component B ⊂ V such that u ∈ B. Given a fixed RD-WL graph representation R, consider any graph G = (V, E G ) satisfying {{χ G (w) : w ∈ V}} = R. First, all cut vertices of G can be determined purely from R using the node colors. We denote the cut vertex color multiset as C V := {{χ G (u) : u is a cut vertex of G}}. Next, the number m G (u) for each cut vertex u can be determined only by its color χ G (u) (by Corollary C.51), which is equal to the degree of node u in BCVTree(G). We now give a procedure to construct BCVTree(G), which purely depends on R rather than the specific graph G. We examine the multisets T (u ) := {{{{χ G (w) : w ∈ S G,i (u)}}}} m G (u) i=1 for all cut vertices u, which only depends on R and χ G (u) rather than the specific graph G or node u by Corollary C.51. See Figure 11 (b) for an illustration of T u for four types of cut vertices u. In the first step, we find all cut vertices u such that S∈T (u) 1[C V ∩ S ̸ = ∅] ≤ 1 where 1[•] is the indicator function. In other words, we find cut vertices u such that there is at most one connected component S G,i (u) that contains cut vertices. These cut vertices u will serve as "leaf (cut vertex) nodes" in BCVTree(G), in the sense that it connects to at most one internal node in BCVTree(G). The number of BCVTree leaf After finding all the "leaf (cut vertex) nodes", we can then find cut vertex nodes v such that when removing all "leaf (cut vertex) nodes" in the BCVTree, v will serve as a "leaf (cut vertex) node". To do this, we compute for each cut vertex v and each biconnected component B v associated with v, whether B v has no cut vertex or all cut vertices in B v correspond to the "leaf (cut vertex) nodes" in BCVTree(G). Then, we check whether a cut vertex v satisfies S∈T × 3 × 4 × 2 × 2 × 3 × 4 × 2 × 2 × 2 × 2 × 1 × 2 × 2 × 2 × 1 × 2 × 2 × 1 × 1 (a) (v) 1[(C V ∩ S)\C V u ̸ = ∅] ≤ 1, where the set C V u contains all colors corresponding to "leaf (cut vertex) nodes". These vertices v will serve as new "leaf (cut vertex) nodes" when removing all "leaf (cut vertex) nodes" in the BCVTree, and the connection between such vertices v and "leaf (cut vertex) nodes" can also be determined (see Figure 11(d ) for an illustration). The procedure can be recursively executed until the full BCVTree is constructed (see Figure 11(f) ), and the whole procedure does not depend on the specific graph G and only depends on R, which completes the proof. C.6 PROOF OF THEOREM 4.5 Given a graph G = (V, E), let χ t G be the 2-FWL color mapping after the t-th iteration (see Algorithm 2 for details), and let χ G be the stable 2-FWL color mapping. The following result is useful for the subsequent proof: Lemma C.52. Let u 1 , u 2 , v 1 , v 2 ∈ V be nodes in graph G and t be an integer. The following holds: • If χ t G (u 1 , v 1 ) = χ t G (u 2 , v 2 ), then u 1 = v 1 if and only if u 2 = v 2 ; • If χ t G (u 1 , v 1 ) = χ t G (u 2 , v 2 ), then {u 1 , v 1 } ∈ E if and only if {u 2 , v 2 } ∈ E; • If χ t G (u 1 , v 1 ) = χ t G (u 2 , v 2 ) and t ≥ 1, then deg G (u 1 ) = deg G (u 2 ) and deg G (v 1 ) = deg G (v 2 ).

Let χ t

G : V G × V G → C be the 2-FWL color mapping of graph G after t iterations. We aim to prove that for any nodes u, v ∈ V G and w, x ∈ V H , if dis G (u, v) = dis G (w, x), then χ t G (u, v) = χ t H (w, x) for any t ∈ N. We prove it by induction. The base case of t = 0 trivially holds. Now suppose the case of t holds and let us consider the color mapping after t + 1 iterations. This already yields (24) by the induction result of iteration t. We thus complete the proof.

D FURTHER DISCUSSIONS WITH PRIOR WORKS D.1 KNOWN METRICS FOR MEASURING THE EXPRESSIVE POWER OF GNNS

In this subsection, we review existing metrics used in prior works to measure the expressiveness of GNNs. We will discuss the limitations of these metrics and argue why biconnectivity may serve as a more reasonable and compelling criterion in designing powerful GNN architectures. WL hierarchy. Since the discovery of the relationship between MPNNs and 1-WL test (Xu et al., 2019; Morris et al., 2019) , the WL hierarchy has been considered as the most standard metric to guide designing expressive GNNs. However, achieving an expressive power that matches the 2-FWL test is already highly difficult. Indeed, each iteration of the 2-FWL algorithm already requires a complexity of Ω(n 3 ) time and Θ(n 2 ) space for a graph with n vertices (Immerman & Lander, 1990) . Therefore, it is impossible to design expressive GNNs using this metric while maintaining its computational efficiency. Moreover, whether achieving higher-order WL expressiveness is necessary and helpful for real-world tasks has been questioned by recent works (Veličković, 2022) . Structural metrics. Another line of works thus sought different metrics to measure the expressive power of GNNs. Several popular choices are the ability of counting substructures (Arvind et al., 2020; Chen et al., 2020; Bouritsas et al., 2022) , detecting cycles (Loukas, 2020; Vignac et al., 2020; Huang et al., 2023) , calculating the graph diameter (Garg et al., 2020; Loukas, 2020) or other graphrelated (combinatorial) problems (Sato et al., 2019) . Yet, all these metrics have a common drawback: the corresponding problems may be too hard for GNNs to solve. Indeed, we show in Table 4 that solving any above task requires a computation complexity that grows super-linear w.r.t. the graph size even using advanced algorithms. Therefore, it is quite natural that standard MPNNs are not expressive for these metrics, since no GNNs can solve these tasks while being efficient. Consequently, instead of using GNNs to directly learn these metrics, these works had to use a precomputation step which can be costly in the worst case. (Tarjan, 1972) Due to the lack of proper metrics, most subsequent works mainly justify the expressive power of their proposed GNNs by focusing on regular graphs (Li et al., 2020; Bevilacqua et al., 2022; Bodnar et al., 2021b; Feng et al., 2022; Velingker et al., 2022 , to list a few), which hardly appear in practice. In contrast, the biconnectivity metrics proposed in this paper are different from all prior metrics, in that (i) it is a basic graph property and has significant values in both theory and applications; (i) it can be efficiently calculated with a complexity linear in the graph size, and thus it is reasonable to expect that these metrics should be learned by expressive GNNs.

D.2 GNNS WITH DISTANCE ENCODING

In this subsection, we review prior works that are related to our proposed GD-WL. In the research field of expressive GNNs, the idea of incorporating distance first appeared in Li et al. (2020) , where the authors mainly considered using distance encoding as node features and showed that distance can help distinguish regular graphs. They also considered an approach similar to k-hop aggregation by incorporating distance into the message-passing procedure (but without a systematic study). Zhang & Li (2021) designed a subgraph GNN that also uses (generalized) distance encoding as node features in each subgraph. Ying et al. (2021a) designed a Transformer architecture that incorporates distance information and empirically showed excellent performance. Very recently, Feng et al. (2022) formally studied the expressive power of k-hop GNNs. Yet, they still restricted the analysis to regular graphs. The concurrent work of Abboud et al. (2022) designed the shortest path network which is highly similar to our proposed SPD-WL. They showed the resulting model can alleviate the bottlenecks and over-squashing problems for MPNNs (Alon & Yahav, 2021; Topping et al., 2022) due to the increased receptive field. Compared with prior works, our contribution lies in the following three aspects: • We formalize the principled and more expressive GD-WL framework, which comprises SPD-WL as a special case. Our framework is theoretically clean and generalizes all prior works in a unified manner. • We systematically and theoretically analyze the expressive power of SPD-WL for general graphs and highlight a fundamental advantage in distinguishing edge-biconnectivity. • We design a Transformer-based GNN that is provably as expressive as GD-WL. Thus, our framework is not only for theoretical analysis, but can also be easily implemented with good empirical performance on real-world tasks. Discussions with the concurrent work of Velingker et al. (2022) . After the initial submission, we became aware of a concurrent work (Velingker et al., 2022) which also explored the use of Resistance Distance to enhance the expressiveness of standard MPNNs. Here, we provide a comprehensive comparison of these two works. Overall, the main difference is that their approach incorporates RD (and several related affinity measures) into node/edge features (like Zhang & Li (2021) ), while we combine RD to design a new WL aggregation procedure. As for the theoretical analysis, they only give a few toy examples of regular graphs to justify the expressive power beyond the 1-WL test, while we give a systematic analysis of the power of RD-WL for general graphs and point out that it is fully expressive for vertex-biconnectivity. In Velingker et al. (2022) , the authors also made comparisons to SPD and conjectured that RD may have additional advantages than SPD in terms of expressiveness. In fact, this question is formally answered in our work, by proving that RD-WL is expressive for vertex-biconnectivity while SPD-WL is not. Another important contribution of our work is that we provide an upper bound of the expressive power of RD-WL to be 2-FWL (3-WL), which reveals the limit of incorporating RD information. We also provide a precise and complete characterization for the expressiveness of RD-WL in distinguishing distance-regular graphs, which reveals that RD-WL can match the power of 2-FWL in distinguishing these hard graphs.

E IMPLEMENTATION OF GENERALIZED DISTANCE WEISFEILER-LEHMAN

In this section, we give implementation details of GD-WL and our proposed GNN architecture. We also give detailed analysis of its computation complexity. Below, assume the input graph G = (V, E) has n vertices and m edges. retical results on Simplicial Complexes to regular Cell Complexes. Such generalization provides a powerful set of graph "lifting" transformations with a hierarchical message passing procedure. Moreover, we compare several Subgraph GNNs. Nested Graph Neural Network (NGNN) (Zhang & Li, 2021 ) represents a graph with rooted subgraphs instead of rooted subtrees. It extracts a local subgraph around each node and applies a base GNN to each subgraph to learn a subgraph representation. The whole-graph representation is then obtained by pooling these subgraph representations. GNN-AK (Zhao et al., 2022) follows a similar manner to develop Subgraph GNNs with different generation policies. Equivariant Subgraph Aggregation Networks (ESAN) (Bevilacqua et al., 2022) develops a unified framework that includes per-layer aggregation across subgraphs, which are generated using pre-defined policies like edge deletion and ego-networks. Subgraph Union Network (SUN) (Frasca et al., 2022) is developed based on the symmetry analysis of a series of existing Subgraph GNNs and an upper bound on their expressive power, which theoretically unifies previous architectures and performs well across several graph representation learning benchmarks. Last, we compare several Graph Transformer models. GraphTransformer (GT) (Dwivedi & Bresson, 2021 ) uses the Transformer model on graph tasks, which only aggregates the information from neighbor nodes to ensure graph sparsity, and proposes to use Laplacian eigenvector as positional encoding. Spectral Attention Network (SAN) (Kreuzer et al., 2021) uses a learned positional encoding (LPE) that can take advantage of the full Laplacian spectrum to learn the position of each node in a given graph. Graphormer (Ying et al., 2021a) develops the centrality encoding, spatial encoding, and edge encoding to incorporate the graph structure information into the Transformer model. Universal RPE (URPE) (Luo et al., 2022b) first shows that there exist continuous sequenceto-sequence functions which RPE-based Transformers cannot approximate, and develops a novel and universal attention module called Universal RPE-based Attention. The effectiveness of URPE has been verified across language and graph benchmarks (e.g., the ZINC dataset). Settings. Our Graphormer-GD consists of 12 layers. The dimension of hidden layers and feedforward layers are set to 80. The number of Gaussian Basis kernels is set to 128. The number of attention heads is set to 8. The batch size is selected from [128, 256, 512] . We use AdamW (Kingma & Ba, 2014) as the optimizer, and set its hyperparameter ϵ to 1e-8 and (β 1 , β 2 ) to (0.9, 0.999). The peak learning rate is selected from [4e-4, 5e-4]. The model is trained for 600k and 800k steps with a 60K-step warm-up stage for ZINC-Subset and ZINC-Full respectively. After the warm-up stage, the learning rate decays linearly to zero. The dropout ratio is selected from [0.0, 0.1]. The weight decay is selected from [0.0, 0.01]. All models are trained on 4 NVIDIA Tesla V100 GPUs.

F.3 MORE TASKS

Node-level Tasks. We further conduct experiments on real-world node-level tasks. Following Li et al. (2020) , we benchmark our model on two real-world graphs: Brazil-Airports and Europe-Airports, both of which are air traffic networks and are collected by Ackland et al. (2005) from the government websites. The nodes in each graph represent airports and each edge represents that there are commercial flights between the connected nodes. The Brazil-Airports graph has 131 nodes, 1038 edges in total and its diameter is 5. The Europe-Airports graph has 399 nodes, 5995 edges in total and its diameter is 5. The airport nodes are divided into 4 different levels according to the annual passenger flow distribution by 3 quantiles: 25%, 50%, and 75%. The task is to predict the level of each airport node. We follow Li et al. (2020) to split the nodes of each graph into train/validation/test subsets with the ratio being 0.8/0.1/0.1, respectively. The test accuracy of the best checkpoint on the validation set is reported. We use different seeds to repeat the experiments 20 times and report the average accuracy. Following Li et al. (2020) , we choose several competitive baselines including classical MPNNs (GCN, GraphSAGE, GIN), Struc2vec and Distance-encoding based GNNs (DE-GNN-SPD, DE-GNN-LP, DEA-GNN-SPD). We refer interested readers to Li et al. (2020) for detailed descriptions of baselines. For our Graphormer-GD, the dimension of hidden layers and feed-forward layers are set to 80. The number of layers is selected from [3, 6] . The number of Gaussian Basis kernels is set to 128. The number of attention heads is set to 8. The batch size is selected from [4, 8, 16, 32] . We use AdamW (Kingma & Ba, 2014) as the optimizer, and set its hyperparameter ϵ to 1e-8 and (β 1 , β 2 ) to (0.9, 0.999). The peak learning rate is selected from [2e-4, 7e-5, 4e-5]. The total number of training steps is selected from [500, 1000, 2000] . The ratio of the warm-up stage is set to 10%.



The space complexity of WL algorithms may differ from the corresponding GNN models in training, e.g., for DS-WL and GD-WL, due to the need to store intermediate results for back-propagation.



Figure 1: An illustration of edge-biconnectivity and vertex-biconnectivity. Cut vertices/edges are outlined in bold red. Gray nodes in (b)/(c) are edge/vertex-biconnected components, respectively.

is connected and for any proper superset T ⊋ S, G[T ] is disconnected. Denote CC(G) as the set of all connected components, then CC(G) forms a partition of the vertex setV. Clearly, G is connected iff |CC(G)| = 1. Definition 2.2. (Biconnectivity) A node v ∈ V is a cut vertex (or articulation point) of G if removing v increases the number of connected components, i.e., |CC(G[V\{v}])| > |CC(G)|. A graph is vertex-biconnected if it is connected and does not have any cut vertex. A vertex set S ⊂ V is a vertex-biconnected component of G if G[S]is vertex-biconnected and for any proper superset T ⊋ S, G[T ]

Figure 2: Illustration of four representative counterexamples (see Examples C.9 and C.10 for general definitions). Graphs in the first row have cut vertices (outlined in bold red) and some also have cut edges (denoted as red lines), while graphs in the second row do not have any cut vertex or cut edge. Substructure Counting WL/GSN. Bouritsas et al. (2022) developed a principled approach to boost the expressiveness of MPNNs by incorporating substructure counts into node features or the 1-WL aggregation procedure. The resulting algorithm, which we call the SC-WL, is detailed in Appendix B.3. However, we show no matter what sub-structures are used, the corresponding GSN still cannot solve any biconnectivity problem listed in Section 2. We give a proof in Appendix C.2 for the general case that allows arbitrary substructures, based on Examples C.9 and C.10. We also point out that our negative result applies to the similar GNN variant inBarceló et al. (2021).Theorem 3.1. Let H = {H 1 , • • • , H k }, H i = (V i , E i )be any set of connected graphs and denote n = max i∈[k] |V i |. Then SC-WL (Appendix B.3) using the substructure set H cannot solve any vertex/edge-biconnectivity problem listed in Section 2. Moreover, there exist counterexample graphs whose sizes (both in terms of vertices and edges) are O(n).

based on homomorphism counting. Bodnar et al. (2021b;a); Thiede et al. (2021); Horn et al. (2022) further developed novel WL aggregation schemes that take into account these substructures (e.g., cycles or cliques).Toenshoff et al. (2021) considered using random walk techniques to generate small substructures.

Other approaches.Wijesinghe & Wang (2022);de Haan et al. (2020) designed novel variants of MPNNs based on more powerful neighborhood aggregation schemes that are aware of the local graph structure, rather than simply treating neighboring nodes as a set.Li et al. (2020);Velingker et al. (2022) incorporated distance encoding into node/edge features to enhance the expressive power of MPNNs.Balcilar et al. (2021);Feldman et al. (

and the number of iterations T Output: Color mapping χ G : V → C Initialize: Pick a fixed (arbitrary) element c 0 ∈ C, and let χ 0

Proposition C.3. Consider the DSS-WL algorithm (Algorithm 4) with arbitrary graph selection policy π. Let χ G and χ H be the color mappings for graphs G and H, and let {{χ Gi : i ∈ [m G ]}} and {{χ Hi : i ∈ [m H ]}} be the color mapping for subgraphs generated by π. Then, • χ G and χ H jointly satisfy the WL-condition; • χ Gi and χ Hj jointly satisfy the WL-condition for any i ∈ [m G ] and j ∈ [m H ].

}} and concludes the proof. Proposition C.4. Let χ G and χ H be two mappings returned by SPD-WL (Algorithm 4 with d G = dis G ) for graphs G and H, respectively. Then χ G and χ H jointly satisfy the WL-condition.

Figure 3: Illustration of the proof of Theorem 3.1. The trees G 1 [S], G 2 [g(S)] are outlined by orange.

Figure 4: Several illustrations to help understand the lemmas.

by using Lemma C.19(b). By the definition of w and Lemma C.22, any path from w to y

by Lemma C.19(d), and thus using Lemma C.20 we have {{χ u G (w) : w ∈ V}} = {{χ u ′ H (w) : w ∈ V}} and finally obtain {{χ G (w) : w ∈ V}} = {{χ H (w) : w ∈ V}} by Lemma C.19(b).)

Figure 5: Several illustrations to help understand the main proof of Theorem 3.2.

Figure 5(b) for an illustration of this paragraph. Clearly, we have z d = w using χ w G (z d ) = χ w ′ H (w ′ ) and Lemma C.19(a). On the other hand, by Lemma C.19(b),

also a cut edge of H. This basically finishes the proof of the first bullet in the theorem. Finally, we consider the general setting where graphs G, H can be disconnected and their representation is not the same in Appendix C.4.3, and complete the proof of Theorem 4.1.Without abuse of notation, throughout Appendices C.4.1 and C.4.2 we redefine the color set C := {χ G (w) : w ∈ V} = {χ H (w) : w ∈ V} to focus only on colors that are present in G (or H), rather than all (irrelevant) colors in the range of a hash function.C.4.1 THE CASE OF χ G (u) ̸ = χ G (v) FOR CONNECTED GRAPHSWe first define several notations. Throughout this case, denote {S u , S v } as the partition of V, representing the two connected components after removing the edge {u, v} such that u ∈ S u , v ∈ S v , S u ∩ S v = ∅ and S u ∪ S v = V. We then define an important concept called the color graph. Definition C.26. (Color graph) Define the auxiliary color graph G C = (C, E G C ) where E G C = {{{χ G (w), χ G (x)}} : {w, x} ∈ E G }. Note that G C can have self loops, so each edge is denoted as a multiset with two elements. Lemma C.27. Let S = χ -1 G (χ G (u)) ∪ χ -1 G (χ G (v)) be the set containing vertices with color χ G (u) or χ G (v). Then either S ∩ S u = {u} or S ∩ S v = {v}. Proof. Assume the lemma does not hold, i.e. |S ∩ S u | > 1 and |S

Figure 6: Illustration of the proof of Lemma C.27.

Therefore, |D(w ′ )| = 1. On the other hand, we clearly have |D(w)| ≥ 2 since both u 1 , u 2 ∈ D(w). Consequently, w and w ′ cannot have the same color under the SPD-WL algorithm because |D(w ′ )| ̸ = |D(w ′ )|. This yields a contradiction and completes the proof. The next lemma presents an important property of the color graph G C (defined in Definition C.26). Lemma C.29. G C has a cut edge {{χ G (u), χ G (v)}}.

necessarily simple), without going through the edge {{χ G (u), χ G (v)}}. This contradicts Lemma C.29, which says that {{χ G (u), χ G (v)}} is a cut edge in G C . Based on Lemma C.29, the cut edge {{χ G (u), χ G (v)}} partitions the vertices C of the color graph G C into two classes. Denote them as {C u , C v } where χ G (u) ∈ C u and χ G (v) ∈ C v . The next corollary characterizes the structure of the node colors calculated in SPD-WL. Corollary C.31. For any w satisfying χ

due to Lemma C.27. In general, Corollary C.31 says that all the cut edges with color {χ G (u), χ G (v)} play an equal role: Lemma C.27 applies for any chosen cut edge {u ′ , v ′ }. An illustration of Corollary C.31 is given in Figure 8(a).

Figure 9: Illustrations to help understand the proof of the main result.

Lemma C.42. Given node w ∈ V and color c ∈ C, define multisetsD G,= (w, c) := {{dis G (w, x) : x ∈ χ -1 G (c), h G (x) = h G (w)}}, D G,̸ = (w, c) := {{dis G (w, x) : x ∈ χ -1 G (c), h G (x) ̸ = h G (w)}}.For any two nodes w, w ′ ∈ V in graphs G and H satisfying χ G (w) = χ H (w ′ ), pick any d ∈ D G,̸ = (w, c) and d ′ ∈ D H,= (w ′ , c). Then d ′ < d.

By Lemma C.42, D H,= (w ′ , c 1 )∩ D G,̸ = (w, c 1 ) = ∅. Since w ′ and w have the same color under SPD-WL, D H,= (w ′ , c 1 ) ∪ D H,̸ = (w ′ , c 1 ) = D G,= (w, c 1 ) ∪ D G,̸ = (w, c 1 ). By Lemma C.41, |D H,= (w ′ , c 1 )| = |D H,̸ = (w ′ , c 1 )| = |D G,= (w, c 1 )| = |D G,̸ = (w, c 1 )|. Therefore, D G,̸ = (w, c 1 ) = D H,̸ = (w ′ , c 1 ).

x 1 , x 2 . Obviously, |S G | = |S H | due to the facts that dis G (w 1 , y) = ∞ ̸ = dis G (w 1 , y ′ ) for all y / ∈ S G , y ∈ S G and that the two edges {w 1 , w 2 } ∈ E G , {x 1 , x 2 } have the same color under SPD-WL. Moreover, {{χ G (w) : w ∈ S G }} = {{χ H (w) : w ∈ S H }}. Now consider re-execute the SPD-WL algorithm on subgraphs G[S G ] and H[S H ] induced by the vertices in set S G and S H , respectively. It follows that for any u G ∈ S G and u H

uw contains v. Clearly, P v uw ̸ = ∅ and P v uw ̸ = ∅. Given a path P = (x 0 , • • • , x m ), define the probability function q(P ) := 1/ m-1 i=0 deg G (x i ). Then by definitions of the average hitting time h, h G (u, w) =

H by Lemma C.45. Now consider the case when |χ -1 G (χ G (u))| = 1. Then |χ -1 H (χ G (u))| = 1 and we can denote the node in χ -1 H (χ G (u)) as u without abuse of notation. Choose arbitrary two nodes w 1 ∈ S 1 and w

Figure 11: Illustrations for constructing the BCVTree given the graph representation R.

Bouritsas et al. (2022) developed a principled approach to boost the expressiveness of MPNNs by incorporating substructure counts into node features or the 1-WL aggregation procedure. The resulting algorithm, which we call the SC-WL, is detailed in Appendix B.3. However, we show no matter what sub-structures are used, the corresponding GSN still cannot solve any biconnectivity problem listed in Section 2. We give a proof in Appendix C.2 for the general case that allows arbitrary substructures, based on Examples C.9 and C.10

Accuracy on cut vertex (articulation point) and cut edge (bridge) detection tasks.

Mean Absolute Error (MAE) on ZINC test set. Following Dwivedi et al. (2020), the parameter budget of compared models is set to 500k. We use * to indicate the best performance.

Lehman Algorithms and Recently Proposed Variants B.1 1-WL Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 k-FWL Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3 WL with Substructure Counting (SC-WL) . . . . . . . . . . . . . . . . . . . . B.4 Equivariant Subgraph Aggregation WL (DSS-WL) . . . . . . . . . . . . . . . . B.5 Generalized Distance WL (GD-WL) . . . . . . . . . . . . . . . . . . . . . . .

2m}} . See Figure2(d) for an illustration of the case n = 8. It is easy to see that G 1 does not have any cut vertex or cut edge, but G 2 do have two cut vertices with node number m and 2m, and has a cut edge {m, 2m}.Theorem C.11. Let H = {H 1 , • • • , H k }, H i = (V i , E i ) be any set of connected graphs and denote n V = max i∈[k] |V i |. Then SC-WL (Appendix B.3) using the substructure set H can neither distinguish whether a given graph has cut vertices nor distinguish whether it has cut edges. Moreover, there exist counterexample graphs whose size (both in terms of vertices and edges) is O(n V ).Proof. We would like to prove that SC-WL cannot distinguish both Examples C.9 and C.10 when n V < m (m is defined in these examples). First note that for both examples, any cycle in both G 1 and G 2 has a length of at least m. Since the number of nodes in H i is O(n V ), if H i contains cycles, it will not occur in both G 1 and G 2 , thus taking no effect in distinguishing the two graphs. As a result, we can simply assume all graphs in H are trees (connected graphs with no cycles). Below, we provide a complete proof for Example C.9, which already yields the conclusion that SC-WL can neither distinguish cut vertices nor cut edges. We omit the proof for Example C.10 since the proof technique is similar.

be the d-hop neighbors of node u in graph G, and denote C d G := {{χ u G (w) : w ∈ N d G (u)}} as the multiset containing the color of all nodes w with distance d to node u. We can similarly denote N d It suffices to prove that for all d ∈ N + , C d G = C d H . We will prove the above result by induction. The case of d = 0 is trivial. Now suppose the case of d is true (i.e., C d G = C d H ) and we want to prove C d+1

then {w 1 , w 2 } is a cut edge if and only if {x 1 , x 2 } is a cut edge.• If the graph representations of G and H are the same under SPD-WL, then their block cut-edge trees (Definition 2.3) are isomorphic. Mathematically, {{χ G (w) : w ∈ V}} = {{χ H (w) : w ∈ V}} implies that BCETree(G) ≃ BCETree(H). The proof of Theorem 4.1 is highly non-trivial and is divided into three parts (presented in Appendices C.4.1 to C.4.3, respectively). We first consider the special setting when both G and H are connected and {{χ G (w) : w ∈ V}} = {{χ H (w) : w ∈ V}}. Assume G is not edge-biconnected, and let {u, v} ∈ E G be a cut edge in G. We separately consider two cases:

Since we have assumed that the graph representations of G and H are the same, i.e. {{χ G (w) : w ∈ V}} = {{χ H (w) : w ∈ V}}, the size of the set {w ∈ V : χ H (w) = χ G (u)} must be 2. We may denote the elements as u and v without abuse of notation and thus {u, v} ∈ E H . Also for any w ∈ V, we have dis H (w, u) ̸ = dis H (w, v). Therefore, we can similarly define the mapping f H : V → {u, v} × V and the mapping h H : V → {u, v} as in (12). The auxiliary graph H A is defined analogous to Definition C.34. Lemma C.41. For any c ∈ C, |f -1 H

Then the above argument implies that if w ∈ S u , x ∈ S v and {w, x} ∈ E G , then {w, x} = {u, v}. Therefore {u, v} is a cut edge of graph H.C.4.3 THE GENERAL CASEThe above proof assumes that the graphs G and H are both connected, and their graph representations are euqal, i.e. {{χ G (w) : w ∈ V}} = {{χ H (w) : w ∈ V}}. In the subsequent proof we remove these assumptions and prove the general setting. Lemma C.43. Either of the following two properties holds:• {{χ G (w) : w ∈ V}} = {{χ H (w) : w ∈ V}}; • {{χ G (w) : w ∈ V}} ∩ {{χ H (w) : w ∈ V}} = ∅.Proof. Consider the GD-WL procedure defined in Algorithm 4 with arbitrary distance function d G . Suppose at iteration t ≥ T , {{χ t G (w) : w ∈ V}} ̸ = {{χ t H (w) : w ∈ V}}. Then at iteration t + 1, we have for each v ∈ V,

then Lemma C.47 clearly holds. Otherwise, by Lemma C.46 there exists two sets S i and S j satisfying S

and conclude the proof. Remark C.50. As a special case, Lemma C.48 and Corollary C.49 also hold when G = H. For example, Corollary C.49 implies that for any S G,i (u) and S G,j

either of the two items in Corollary C.49 holds.

By the 2-FWL update rule (2), (z, v)) : z ∈ V G }} = {{(χ t H (w, z), χ t H (z, x)) : z ∈ V H }}.(24)By Lemma C.60, we have{{(dis G (u, z), dis G (z, v)) : z ∈ V G }} = {{(dis H (w, z), dis H (z, x)) : z ∈ V H }}.

The best computational complexity of known algorithms for solving different graph problems. Here n and m are the number of nodes and edges of a given graph, respectively.

ACKNOWLEDGMENTS

Bohang Zhang is grateful to Ruichen Li for his great help in discussing and checking several of the main results in this paper, including Theorems 3.1, 3.2, 4.1 and C.58. In particular, after the initial submission, Ruichen Li discovered a simpler proof of Lemma C.28 and helped complete the proof of Theorem C.58. Bohang Zhang would also thank Yiheng Du, Kai Yang amd Ruichen Li for correcting some small mistakes in the proof of Lemmas C.20 and C.45. We also thank all the anonymous reviewers for the careful reviews and the valuable suggestions. Their help has further enhanced our work. This work is supported by National Key R&D Program of China (2022ZD0114900) and National Science Foundation of China (NSFC62276005).

annex

the vertex partition corresponding to the two connected components after removing the edge {u, v}, satisfying u ∈ S G,u , v ∈ S G,v , S G,u ∩ S G,v = ∅, S G,u ∪ S G,v = V. It suffices to prove that given a cut edge {u, v} ∈ E G with color {χ G (u), χ G (v)}, the multiset {{χ G (w) : w ∈ S G,u , χ G (w) ∈ C V }} can be determined purely based on R and the edge color {{χ G (u), χ G (v)}}, rather than the specific graph G or edge {u, v}. This basically concludes the proof, since the BCETree can be uniquely constructed as follows: if {{χ G (w) : w ∈ S G,u , χ G (w) ∈ C V }} = {{χ G (u)}} (i.e. with only one element), then {{χ G (u), χ G (v)}} is a leaf edge of the BCETree such that χ G (u) connects to a biconnected component that is a leaf of the BCETree. After finding all the leaf edges, we can then find the BCETree edges that connect to leaf edges and determine which leaf edges they connect. The procedure can be recursively executed until the full BCETree is constructed. The whole procedure does not depend on the specific graph G and only depends on R.We now show how to determine {{χ G (w) : w ∈ S G,u , χ G (w) ∈ C V }} given a cut edge {u, v} ∈ E G with color {χ G (u), χ G (v)}. Define the multiset) can be picked arbitrarily) Note that D(c 1 , c 2 ) is well-defined (does not depend on w) by definition of the SPD-WL color. For any c u , c v ∈ C E , pick arbitrary cut edge {u, v} with color χ G (uwhere {{c}} × m denotes a multiset with m repeated elements c, and D(c u , c) + 1 := {{d + 1 : d ∈ D(c u , c)}}. Intuitively speaking, T (c u , c v ) corresponds to the color of all nodes w ∈ V such that dis G (u, w) + 1 = dis G (v, w) and χ G (w) ∈ C V . Therefore, T (c u , c v ) is exactly the multiset {{χ G (w) : w ∈ S G,u , χ G (w) ∈ C V }} and we have completed the proof.C.5 PROOF OF THEOREM 4.2Theorem C.44. Let G = (V, E G ) and H = (V, E H ) be two graphs, and let χ G and χ H be the corresponding RD-WL color mapping. Then the following holds:• For any two nodes w ∈ V in G and x ∈ V in H, if χ G (w) = χ H (x), then w is a cut vertex of G if and only if x is a cut vertex of H.• If the graph representations of G and H are the same under RD-WL, then their block cut-vertex trees (Definition 2.4) are isomorphic. Mathematically, {{χ G (w) : w ∈ V}} = {{χ H (w) : w ∈ V}} implies that BCVTree(G) ≃ BCVTree(H).Proof Sketch. First observe that Lemma C.43 holds for general distances and thus applies here. Therefore, if χ G (w) = χ H (x), the graph representations will be the same, i.e. {{χ G (w) : w ∈ V}} = {{χ H (w) : w ∈ V}}. By a similar analysis as SPD-WL (Appendix C.4.3), we can only focus on the case that both graphs are connected. We prove the first bullet of Theorem 4.2 in Appendix C.5.1 and prove the second bullet in Appendix C.5.2, both assuming that G and H are connected and their graph representations are the same.

C.5.1 PROOF OF THE FIRST PART

We first present a key property of the Resistance Distance, which surprisingly relates to the cut vertices in a graph. Lemma C.45. Let G = (V, E) be a connected graph and v ∈ V. Then v is a cut vertex of G if and only if there exists two nodes u, w ∈ V,Proof. We use the key finding that the Resistance Distance is equivalent to the Commute Time Distance multiplied by a constant (Chandra et al., 1996, Here, the Commute Time Distance is defined as dis C G (u, w) := h G (u, w) + h G (w, u) where h G (u, w) is the average hitting time from u to w in a random walk (Appendix E.2).Proof. By the initial coloring (6) of 2-FWL, χ 0 G (u, v) can have the following three types of values:where c same , c edge , c other are three different colors. Therefore, if 19). This concludes the proof of the case t ≥ 1 by a simple induction.)) which is a tuple of length d -1. We have the following key lemma:, then the following holds:• Denote P d (u, v) be the set of all paths (not necessarily simple) from node u to node v of length d.Proof. We prove the lemma by induction over iteration t. We first prove the base case t = 0.where both sets have a single element that is an empty tuple (0-dimension).Now suppose that the conclusion of Lemma C.53 holds in iteration t, we will prove that it also holds in iteration t + 1. First note that for any two nodes u, v,, then by definition of 2-FWL update formula ( 19) v2) |P t+1 (u 2 , w)| and thus we haveFurther using the third bullet of Lemma C.52 and rearranging the two multisets yieldsThis concludes the proof of the induction step.The above lemma directly yields the following corollary: is the set containing all hitting paths of length i from u to v, and. By Lemma C.53, we have P ∈Qi(u1,v1) q(P ) = P ∈Qi(u2,v2) q(P ) and P ∈Qi(v1,u1) q(P ) = P ∈Qi(v2,u2) q(P ) for all i ≥ 0 (the case i = 0 trivially holds) and thusWe are now ready to prove Theorem 4.5. w 2 ) for some nodes w 1 and w 2 , then χ G (w 1 ) = χ G (w 2 ). Therefore, by using Corollary C.54 we obtain that if χThe above equantions show that the partition induced by χ 2FWL G is finer than both χ SPDWL G and χ RDWL G and conclude the proof.Finally, the following proposition trivially holds and will be used to prove Corollary 4.6.Proposition C.56. Given a graph G = (V, E G ), let χ G and χG be two color mappings induced by two different (general) color refinement algorithms, respectively. If the vertex partition induced by the mapping χ G is finer than that of χG , then:• The mapping χ G can distinguish cut vertices/edges if χG can distinguish cut vertices/edges;• The mapping χ G can distinguish the isomorphism type of BCVTree(G)/BCETree(G) if χG can distinguish the isomorphism type of BCVTree(G)/BCETree(G). 

C.7 REGARDING DISTANCE-REGULAR GRAPHS

In this subsection, we give more fine-grained theoretical results on the expressiveness upper bound of GD-WL by considering the special problem of distinguishing distance-regular graphs, a class of hard graphs that are highly relevant to the GD-WL framework. We provide a full characterization of what types of distance-regular graphs different GD-WL algorithms can or cannot distinguish, with both proofs and counterexamples.Given, the number of i-hop neighbors is the same for all nodes. We thus denote κ We now present our main results.Theorem C.58. Let G and H be two connected distance-regular graphs. Then the following holds:• SPD-WL can distinguish the two graphs if and only if their k-hop-neighbor arrays differ, i.e. κ(G) ̸ = κ(H).• RD-WL can distinguish the two graphs if and only if their intersection arrays differ, i.e. ι(G) ̸ = ι(H).• 2-FWL can distinguish the two graphs if and only if their intersection arrays differ, i.e. ι(G) ̸ = ι(H).Theorem C.58 precisely characterizes the equivalence class of all distance-regular graphs for different types of algorithms. Combined the fact that ι Corollary C.59. RD-WL is strictly more powerful than SPD-WL in distinguishing non-isomorphic distance-regular graphs. Moreover, RD-WL is as powerful as 2-FWL in distinguishing nonisomorphic distance-regular graphs.Counterexamples. We provide representitive counterexamples in Figure 12 for both SPD-WL and RD-WL. In Figure 12 Proof of the first item of Theorem C.58. This part is straightforward. Consider the SPD-WL color mapping χ 1 G of graph G after the first iteration. Then for two graphs G, H with n nodes,for all nodes u in G and v in H, implying that SPD-WL can distinguish the two graphs. On the other hand, if κ(G) = κ(H), then for any node u in G and v in H we have χ 1for any iteration t ∈ N, and thus SPD-WL cannot distinguish the two graphs.Proof of the second item of Theorem C.58. The key insight is that given a distance-regular graph, the resistance distance between a pair of nodes (u, v) only depends on its SPD. Formally, for any nodes u, v, w, x in a distance-regular graph. Actually, we have the following stronger result:d=0 is recursively defined as follows:where) is its k-hop-neighbor array.Proof. Let R ∈ R n×n be the RD matrix. Based on Theorem E.1, R can be expressed as R = diag(M)11 ⊤ + 11 ⊤ diag(M) -2M, where M = L + 1 n 11 ⊤ -1 and L is the graph Laplacian matrix. Now let R = [r dis G (u,v) ] u,v∈V be the matrix with elements defined in (20). The key step is to prove that 2M = c11 ⊤ -R for some c ∈ R. This will yieldand finish the proof.We now proveCombined the fact that L1 = 0, we haveWe haveNow consider the following three cases:• u = v. In this case, r dis G (u,v) = 0 and we have, and by definition of intersection array we havewhere in the second last step we use the recursive relation of r j , and in the last step we use the fact that k j = • u ̸ = v and dis G (u, v) = D(G). This case is similar as the previous one. Denote j = dis G (u, v), andwhere we again use k j = kj-1bj-1 cj .Combining the above three cases, we conclude that L R = 2 n 11 ⊤ -2I, which finishes the proof.We are now ready to prove the main result. Let G = (V G , E G ) and H = (V H , E H ) be two distanceregular graphs. We first prove that if ι(G) = ι(H), then RD-WL cannot distinguish the two graphs. This is a simple consequence of Theorem C.61. Combined with the fact that κSimilarly, after the t-th iteration we still have χ t G (u) = χ t H (v) for all u ∈ V G and v ∈ V H and thus RD-WL cannot distinguish the two graphs.It remains to prove that if ι(G) ̸ = ι(H), then RD-WL can distinguish the two graphs. First observe that in Theorem C.61, r i < r j holds for any i < j. Therefore, for any nodeswhere {{r}} × k is a multiset containing k repeated elements of value r. If ι(G) ̸ = ι(H), then there exists a minimal index. Therefore, ( 22) does not hold and χ 1 G (u) ̸ = χ 1 H (v) for any u ∈ V G and v ∈ V H , namely, RD-WL can distinguish the two graphs.Proof of the third item of Theorem C.58. First, if ι(G) ̸ = ι(H), then 2-FWL can distinguish graphs G and H. This is simply due to the fact that 2-FWL is more powerful than RD-WL (Theorem 4.5). It remains to prove that if ι(G) = ι(H), then 2-FWL cannot distinguish graphs G and H.Published as a conference paper at ICLR 2023 E.1 PREPROCESSING SHORTEST PATH DISTANCE Shortest Path Distance can be easily calculated using the Floyd-Warshall algorithm (Floyd, 1962) , which has a complexity of Θ(n 3 ). For sparse graphs typically encountered in practice (i.e. m = o(n 2 )), a more clever way is to use breadth-first search that computes the distance from a given node to all other nodes in the graph. The time complexity can be improved to Θ(nm).

E.2 PREPROCESSING RESISTANCE DISTANCE

In this subsection, we first describe several important properties of Resistance Distance. Based on these properties, we give a simple yet efficient algorithm to calculate Resistance Distance.Equivalence between Resistance Distance (RD) and Commute Time Distance (CTD). Chandra et al. (1996) established an important relationship between RD and CTD, by proving thatholds for any graph G and any nodes u, v ∈ V. Here, the Commute Time Distance is defined asis the average hitting time from u to v in a random walk. Concretely, h G (u, v) is equal to the average number of edges passed in a random walk when starting from u and reaching v for the first time. Mathmatically, it satisfies the following recursive relation:(25) The above equation can be used to calculate CTD and thus RD, as we will show later.Resistance Distance is a graph metric. We say a function d G : V × V → R is a graph metric if it is non-negative, positive semidefinite, symmetric, and satisfies triangular inequality. Let G be a connected graph. Then Resistance Distance dis R G is a valid graph metric because: u, w) . This can be seen from the definition of CTD, since dis C G (u, v)+dis C G (v, w) is equal to the average hitting time from u to w under the condition of passing node v, which is obviously larger than dis R G (u, w). Comparing RD with SPD. It is easy to see that RD is always no larger than SPD, i.e. dis R G (u, v) ≤ dis G (u, v). This is because for any subgraphand when G ′ is chosen to contain only the edges that belong to the shortest path between u and v, we haveTherefore, the range of RD is the same as SPD, i.e. 0 ≤ dis R G (u, v) ≤ n -1. However, unlike SPD which is an integer, RD can be a general rational number. RD can thus be seen as a more fine-grained distance metric than SPD. Nevertheless, RD is still discrete and there are only finitely many possible values of dis R G (u, v) when n is fixed. Relationship to graph Laplacian. We have the following theorem: Theorem E.1. Let G = (V, E) be a connected graph, V = [n], and let L ∈ S n be the graph Laplacian. Then dis R G (i, j) = M i,i + M j,j -2M i,j , where M ∈ S n is a symmetric matrix defined asDefine the probability matrix P such thatThen for any i ̸ = j, (25) can be equivalently written asNow define a matrix H such that Hij = 1 + n k=1 P ik Hkj -P ij Hjj , then Hij = h(i, j) for all i ̸ = j (although Hii ̸ = 0 = h(i, i)). H can be equivalently written aswhere diag( H) is the diagnal matrix with elements Hii for i ∈ [n].We first calculate diag( H). Noting that d ⊤ P = d, we haveand thusNow define H = H -diag( H), then H ij = h(i, j) for all i, j ∈ [n]. We will calculate H in the following proof. We first write ( 27) equivalently as H + diag( H) = 11 ⊤ + PH. Then by multiplying D, we haveUsing the fact that D(I -P) = L and ( 28), we obtainNext, noting that L1 = 0, we haveOne important property is that the matrix L + 1 n 11 ⊤ is invertible (see Gutman & Xiao (2004, Theorem 4 ) for a proof). Combining ( 30) and ( 31) we haveBy taking diagonal elements and noting that diag(H) = O, we otainNamely,Substituting ( 34) into ( 32) yieldsTherefore,and concludes the proof.Computational Complexity. The graph Laplacian can be calculated in O(n 2 ) time, and M can be calculated by matrix inversion which requires O(n 3 ) time. Therefore, the overall computational complexity is O(n 3 ) (or O(n 2.376 ) using advanced matrix multiplication algorithms).For sparse graphs typically encountered in practice (i.e. m = o(n 2 )), one may similarly ask whether a complexity that depends on m can be achieved. We conjecture that it should be possible. Below, we give another algorithm to calculate L + 1 n 11 ⊤ -1 . Note that the graph Laplacian L can be equivalently written as L = EE ⊤ , where E ∈ R n×m is defined aswhere we denoteNoting that each e i is highly sparse with only two non-zero elements.We suspect that one can obtain an O(nm) complexity using techniques similar to the Sherman-Morrison-Woodbury update. We leave it as an open problem.

E.3 TRANSFORMER-BASED IMPLEMENTATION

Graphormer-GD. The model is built on the Graphormer (Ying et al., 2021a) model, which use the Transformer (Vaswani et al., 2017) as the backbone network. A Transformer block consists of two layers: a self-attention layer followed by a feed-forward layer, with both layers having normalization (e.g., LayerNorm (Ba et al., 2016) ) and skip connections (He et al., 2016) . Denote X (l) ∈ R n×d as the input to the (l + 1)-th block and define X (0) = X, where n is the number of nodes and d is the feature dimension. For an input X (l) , the (l + 1)-th block works as follows:whereH is the number of attention heads, d H is the dimension of each head, and r is the dimension of the hidden layer. A h (X) is usually referred to as the attention matrix.Note that the self-attention layer and the feed-forward layer introduced in (39) and (40) do not encode any structural information of the input graph. As stated in Section 4, we incorporate the distance information into the attention layers of our Graphormer-GD model. The calculation of the attention matrix in (38) is modified as:where D ∈ R n×n is the distance matrix such that D uv = d G (u, v), ϕ h 1 and ϕ h 2 are element-wise functions applied to D, and ⊙ denotes the element-wise multiplication. In this way, the graph structural information can be captured by our Graphormer-GD model.As stated in Section 4, we mainly consider two distance metrics: Shortest Path Distance dis G and Resistance Distance dis R G . For SPD, we follow Ying et al. (2021a) to use their shortest path distance encoding. Formally, let D SPD be the SPD matrix such that D SPD uv = dis G (u, v). The function ϕ 1 and ϕ 2 can simply be parameterized by two learnable vectors v 1 and v 2 , so that ϕ 1 (D SPD uv ) is a learnable scalar corresponding to v 1 D SPD uv (and similarly for ϕ 2 ). If two nodes u and v are not in the same connected component, i.e., D SPD uv = ∞, a special learnable scalar is assigned. For RD, we use the Gaussian Basis kernels (Scholkopf et al., 1997) to encode the value since it may not be an integer. The encoded values from different Gaussian Basis kernels are concatenated and further transformed by a two-layer MLP. We integrate both the SPD encoding and the RD encoding to obtain ϕ l,h 1 (D) and ϕ l,h 2 (D). Note that these two matrices are parameterized by different sets of parameters. Following Ying et al. (2021a) , we also incorporate the degree of each node in the input layer using a degree embedding.Relationship between Graphormer-GD and GD-WL. As stated in Section 4, the expressive power of Graphormer-GD is at most as powerful as GD-WL. We will prove that it is actually as powerful as GD-WL under mild assumptions. We first restate the Lemma 5 from Xu et al. (2019) , which shows that sum aggregators can represent injective functions over multisets.Lemma E.2. (Xu et al., 2019, Lemma 5) Assume the set X is countable. Then there exists a function f : X → R n so that the function h( X ) := x∈ X f (x) is unique for each multiset X ⊂ X of bounded size. Moreover, any multiset function g can be decomposed as g( X ) = ϕ( x∈ X f (x)) for some function ϕ.We are now ready to present the detailed proof of the Theorem 4.4, which is restated as follows: Theorem E.3. Graphormer-GD is at most as powerful as GD-WL. Moreover, when choosing proper functions ϕ h 1 and ϕ h 2 and using a sufficiently large number of heads and layers, Graphormer-GD is as powerful as GD-WL.Proof. Consider all graphs with no more than n nodes. The total number of possible values of both SPD and RD are thus finite and depends on n (see Appendix E.2). LetThen the GD-WL aggregation in (3) can be reformulated as follows:,Intuitively, this reformulation indicates that in each iteration, GD-WL updates the color of node v by hashing a tuple of color multisets, where each multiset is obtained by injectively aggregating the colors of all nodes u ∈ V with certain distance configuration to node v. Therefore, to express GD-WL, the model suffices to update the representation of each node following the above procedure.We show Graphormer-GD can achieve this goal. Recall that for the h-th head, the attention matrix is defined as. For the function ϕ h 1 , we define it to be the indicator function ϕ h 1 (d) := I(d = d G,h ). For the function ϕ h 2 , we set it to be constant irrespective to the matrix D. Let W h Q , W h K be zero matrices. It can be seen that the term softmax XW h Q (XW h K ) ⊤ + ϕ h 2 (D) reduces to 1 |V| 11 ⊤ , and thus for each node v, the output in the h-th attention head is the sum aggregation of representations of node u satisfyingNote that the constant 1 |V| can be extracted with an additional head and be concatenated to the node representations. Moreover, the node representation X is processed via the feed-forward network in the previous layer (see (40) . Thus, we can invoke Lemma E.2 and prove that the h-th attention head in Graphormer-GD can implement an injective aggregating function for {{χ t-1 G (u) : u ∈ V, d G (u, v) = d G,h }}. Therefore, by using a sufficiently large number of attention heads, the multiset representations χ t,k G , k ∈ [|D n |] can be injectively obtained. Finally, the multi-head attention defined in ( 39) is equivalent to first concatenating the output of each attention head and then using a linear mapping to transform the results. Thus, the concatenation is clearly an injective mapping of the tuple of multisets χ t,1 G , χ t,2 G , ..., χ t,|Dn| G. When the linear mapping has irrelational weights, the projection will also be injective. Therefore, one attention layer followed by the feed-forward network can implement the aggregation formula (42). Thus, our Graphormer-GD is able to simulate the GD-WL when using a sufficiant number of layers, which concludes the proof.

F EXPERIMENTAL DETAILS

F.1 SYNTHETIC TASKS Data Generation and Evaluation Metrics. We carefully design several graph generators to examine the expressive power of compared models on graph biconnectivity tasks. First, we include the two families of graphs presented in Examples C.9 and C.10 (Appendix C.2). We further introduce a rich family of regular graphs with both cut vertices and cut edges. Each graph in this family is constructed by first randomly generating several connected components and then linking them via cut edges while simultaneously ensuring that each node has the same degree. Combining the above three families of hard graphs, we online generate data instances to train the compared models. For each data instance, the total number of nodes is upper bounded by 120. We use graph-level accuracy as the metric. That is, for each graph, the prediction of the model is considered correct only when all and only the cut vertices/edges are correctly identified. We use different seeds to repeat the experiments 5 times and report the average accuracy.Baselines. We choose several baselines with their expressive power being at different levels. First, we consider classic MPNNs including GCN (Kipf & Welling, 2017) , GAT (Veličković et al., 2018) , and GIN (Bouritsas et al., 2022) . The expressive power of these GNNs is proven to be at most as powerful as the 1-WL test (Xu et al., 2019) . We also compare the Graph Substructure Network (Bouritsas et al., 2022) , which extracts graph substructures to improve the expressive power of MPNNs. The substructure counts are incorporated into node features or the aggregation procedure. Lastly, we also compare the Graphormer model (Ying et al., 2021a) , which achieved impressive performance in several world competitions (Ying et al., 2021b; Shi et al., 2022; Luo et al., 2022a) .Settings. We employ a 6-layer Graphormer-GD model. The dimension of hidden layers and feedforward layers is set to 768. The number of Gaussian Basis kernels is set to 128. The number of attention heads is set to 64. The batch size is set to 32. We use AdamW (Kingma & Ba, 2014) as the optimizer and set its hyperparameter ϵ to 1e-8 and (β 1 , β 2 ) to (0.9, 0.999). The peak learning rate is set to 9e-5. The model is trained for 100k steps with a 6K-step warm-up stage. After the warm-up stage, the learning rate decays linearly to 0. All models are trained on 1 NVIDIA Tesla V100 GPU.

F.2 REAL-WORLD TASKS

We conduct experiments on the popular benchmark dataset: ZINC from Benchmarking-GNNs (Dwivedi et al., 2020) . It is a real-world dataset that consists of 250K molecular graphs. The task is to predict the constrained solubility of a molecule, which is an important chemical property for drug discovery. We train our models on both the ZINC-Full and ZINC-Subset (12K selected graphs following Dwivedi et al. ( 2020)).Baselines. For a fair comparison, we set the parameter budget of the model to be around 500K following Dwivedi et al. (2020) . We compare our Graphormer-GD with several competitive baselines, which mainly fall into five categories: Message Passing Neural Networks (MPNNs), High-order GNNs, Substructure-based GNNs, Subgraph GNNs, and Graph Transformers.First, we compare several classic MPNNs including Graph Convolution Network (GCN) (Kipf & Welling, 2017) , Graph Isomorphism Network (GIN) (Xu et al., 2019) , Graph Attention Network (GAT) (Veličković et al., 2018 ), GraphSAGE (Hamilton et al., 2017) and MPNN(sum) (Gilmer et al., 2017) . Besides, we also include several popularly used models. Mixture Model Network (MoNet) (Monti et al., 2017) introduces a general architecture to learn on graphs and manifolds using the Bayesian Gaussian Mixture Model. Gated Graph ConvNet (GatedGCN) considers residual connections, batch normalization, and edge gates to design an anisotropic variant of GCN. We compare the GatedGCN with positional encodings. Principal Neighborhood Aggregation (PNA) (Corso et al., 2020) combines multiple aggregators with degree-scalers. Second, we compare two higher-order Graph Neural Networks: RingGNN (Chen et al., 2019) and 3WLGNN (Maron et al., 2019a) following Dwivedi et al. (2020) . RingGNN extends the family of order-2 Graph G-invariant Networks without going into higher order tensors and is able to distinguish between non-isomorphic regular graphs where order-2 G-invariant networks provably fail. 3WLGNN uses rank-2 tensors to build the neural network and is proved to be equivalent to the 3-WL test on graph isomorphism problems. Third, we compare two representative types of substructure-based GNNs. The Graph Substructure Network (Bouritsas et al., 2022) extracts graph substructures to improve the expressive power of MPNNs. The substructure counts are incorporated into node features or the aggregation procedure. We also compare the Cellular Isomorphism Network (Bodnar et al., 2021a) , which extends theo-After the warm-up stage, the learning rate decays linearly to zero. The dropout ratio is selected from [0.0, 0.1, 0.5]. All models are trained on 1 NVIDIA Tesla V100 GPUs.The results are presented in Table 5 . We can see that our model outperforms these baselines on both datasets with a slightly larger variance value due to the small scale of the datasets. (Li et al., 2020) 73.28±2.47 56.98±2.79 DE-GNN-LP (Li et al., 2020) 75.10±3.80 58.41±3.20 DEA-GNN-SPD (Li et al., 2020) 75.37±3.25 57.99±2.39Graphormer-GD (ours) 77.69±6.39* 59.23±4.05*

F.4 EFFICIENCY EVALUATION

We further conduct experiments to measure the efficiency of our approach by profiling the time cost per training epoch. We compare the efficiency of Graphormer-GD with other baselines along with the number of model parameters on the ZINC-subset from Dwivedi et al. ( 2020). The number of layers and the hidden dimension of our Graphormer-GD are set to 12 and 80 respectively. The number of attention heads is set to 8. The batch size is set to 128, which is the same as the settings of all baselines. We run profiling of all models on a 16GB NVIDIA Tesla V100 GPU. For all baselines, we evaluate the time costs based on the publicly available codes of Dwivedi et al. ( 2020) and Ying et al. (2021a) . The results are presented in Table 6 .From Table 6 , we can draw the following conclusions. Firstly, the efficiency of Graphormer-GD is in the same order of magnitude as classic MPNNs despite the fact that the computation complexity of Graphormer-GD is higher than MPNNs (i.e., Θ(n 2 ) v.s. Θ(n + m) for a graph with n nodes and m edges). This may be due to the high parallelizability of the Transformer layers. Secondly, Graphormer-GD is much more efficient than higher-order GNNs as reflected by the computation complexity in Table 1 . Finally, Graphormer-GD is almost as efficient as the original Graphormer, since the newly introduced module to encode the Resistance Distance takes negligible additional time compared to that of the whole architecture. (Hamilton et al., 2017) 505,341 6.02 MoNet (Monti et al., 2017) 504,013 7.19 GIN (Xu et al., 2019) 509,549 8.05 GAT (Veličković et al., 2018) 531,345 8.28 GatedGCN-PE (Bresson & Laurent, 2017) 505,011 10.74 RingGNN (Chen et al., 2019 ) 527,283 178.03 3WLGNN (Maron et al., 2019a) 507,603 179.35 Graphormer (Ying et al., 2021a) 489,321 12.26Graphormer-GD (ours) 502,793 12.52

