CYCLE TO CLIQUE (CY2C) GRAPH NEURAL NET-WORK: A SIGHT TO SEE BEYOND NEIGHBORHOOD AGGREGATION

Abstract

Graph neural networks have been successfully adapted for learning vector representations of graphs through various neighborhood aggregation schemes. Previous researches suggest, however, that they possess limitations in incorporating key non-Euclidean topological properties of graphs. This paper mathematically identifies the caliber of graph neural networks in classifying isomorphism classes of graphs with continuous node attributes up to their local topological properties. In light of these observations, we construct the Cycle to Clique graph neural network, a novel yet simple algorithm which topologically enriches the input data of conventional graph neural networks while preserving their architectural components. This method theoretically outperforms conventional graph neural networks in classifying isomorphism classes of graphs while ensuring comparable time complexity in representing random graphs. Empirical results further support that the novel algorithm produces comparable or enhanced results in classifying benchmark graph data sets compared to contemporary variants of graph neural networks.

1. INTRODUCTION

Graph neural networks (GNN) are prominent deep learning methods for learning vector representation of graphs. Research in GNNs explores their empirical capabilities and effectiveness in classifying node labels, classifying graphs, and predicting links by modifying the message passing layers or pooling methods. These experiments support that GNNs can achieve state-of-the-art performances in executing these tasks and ensure equivalent performance to that of the Weisfeiler-Lehman (WL) isomorphism test in representing graphs with discrete node labels Xu et al. (2018) . However, they have limited capabilities in incorporating global topological properties of graphs, thereby exhibiting restricted discriminative power in distinguishing isomorphism classes of graphs Bouritsas et al. (2022) ; Rieck et al. (2019) . To overcome these limitations, this paper presents a mathematical framework that examines which topological properties of graphs with continuous node attribute that conventional GNNs can encapsulate. Inspired by the works of Krebs and Verbitsky Krebs & Verbitsky (2015) and Xu et al Xu et al. (2018) , we use the theory of covering spaces to prove that under some constraints, a pair of graphs with continuous node attributes is distinguishable by GNNs if and only if there exist isomorphisms among the collection of their finite depth unfolding trees that induce equality of induced node attributes. This gives a universal formulation which pinpoints the discriminative power of a wide range of variants of GNNs and the topological enrichments these models endow over the graph data set. Such approaches include enriching node attributes, using persistent homological techniques, gluing high dimensional complexes, and keeping track of recurring subgraph structures. Among these candidates, we focus on the procedure of transforming the cycle bases of graphs to complete subgraphs or cliques. This operation can be easily implemented by adding suitable edges to transform a cyclic subgraph into a clique and masking any other edges not included in the subgraph. The adjacency matrices obtained from the induced cliques, denoted as clique adjacency matrices, allow GNNs to effectively process the bases of cycles, which are topological properties equivalent to the first homological invariants of graphs Paton (1969) . In particular, the operation can be thought as a straightforward pre-processing procedure independent from training dynamical filtration functions or attaching higher dimensional cells Horn et al. (2021) ; Bouritsas et al. (2022) ; Bodnar et al. (2021b; a) , which are previously carefully studied methods for encapsulating the cyclic structures of graphs. We thus propose the Cycle-to-Clique Graph Neural Network (Cy2C-GNN), a graph neural network whose message passing layers compute two types of hidden node attributes, each obtained from the adjacency matrix and the induced clique adjacency matrix of a graph. We confirm that Cy2C-GNN effectively processes cycle bases of graphs, thus surpassing the strengths of conventional GNNs. Experimental results support that Cy2C-GNN ensures comparable performance to GNNs that utilize persistent homological techniques with both fixed and arbitrary filtration functions. Furthermore, the simplicity of the architecture guarantees equivalent computational complexity to conventional GNNs in representing random graphs and the effective utilization of trainable parameters. Our main contributions can therefore be summarized as follows: 1. Theoretical Foundation: We use the theory of covering spaces to prove that conventional GNNs fail to effectively represent cyclic structures of graphs with continuous node attributes. (Theorem 3.3, Section 3) 2. A Simple yet Novel Network: We propose a novel algorithm called "Cy2C-GNN" which overcomes the theoretical limitations by enriching the topological properties of the input data admitted by GNNs with clique adjacency matrices, which does not require training filtration functions or attaching high-dimensional complexes. (Theorem 4.3, Section 4) 3. Efficient Enhancements: The proposed algorithm effectively incorporates cyclic structures of graph data sets while ensuring equivalent computational complexity to conventional GNNs for representing random graphs and adaptability to variants of GNNs. (Section 5)

2. RELATED WORKS

Graph Neural Networks (GNNs) We recall the construction of GNNs as suggested in Xu et al Xu et al. (2018) . Denote by GNN l the conventional GNN (GNN) comprised of composition of l neighborhood aggregating layers. Each m-th layer H (m) of the network constructs hidden node attributes of dimension k m , denoted as h (m) v , using the following composition of functions: h (m) v := COMBINE (m) h (m-1) v , AGGREGATE (m) v h (m-1) u | u ∈ N (v) h (0) v := X v (1) Here, X v is the initial node attribute at v, N (v) is the set of nodes adjacent to v ∈ V (G), AGGREGATE (m) v is a function which aggregates features of nodes adjacency to v, and COMBINE (m) is a function which combines features of the node v with those of nodes adjacent to v. Denote by H (l) the final layer of the network. The K-dimensional vector representation of G, denoted as h G , is given by h G := READOUT (l) {{h (l) v | v ∈ V (G)}} where READOUT (l) is the graph readout function of node features updated from l hidden layers. We refer readers to Appendix A.2 for a rigorous definition of graph neural networks. A wide range of GNNs and graph representation techniques can be formulated in terms of the construction outlined above. For example, the WL test is a classical technique which consists of combining adjacent discrete node labels, substituting newly obtained labels, and constructing a complete histogram of updated labels. The test is equivalent to conventional GNNs whose aggregation and combination functions correspond to sums of multisets, and the graph readout function corresponds to a hashing function of discrete node labels. Other well-known networks whose architecture can be formulated using conventional GNNs from Section 2 include graph convolutional networks (GCN), graph attention networks (GAT), and graph isomorphism networks (GIN) Kipf & Welling (2016) ; Veličković et al. (2017) ; Xu et al. (2018) . Covering spaces A number of studies carefully analyzed the strength of graph neural networks in distinguishing isomorphism classes of graphs with discrete node labels. The work by Krebs and Verbitsky Krebs & Verbitsky (2015) shows that Weisfeiler-Lehman test can distinguish a pair of graphs with constant node labels up to isomorphism of their universal covers, i.e. isomorphism of collections of finite depth unfolding trees. We refer to Appendix A.1. for a rigorous treatment of the definition of covering spaces of graphs Hatcher (2002) ; Krebs & Verbitsky (2015) . Definition 2.1. Let G := (V, E) be a directed graph. For each node v ∈ V (G), the depth-1 unfolding tree rooted at v, denoted as T 1 v , is a subtree of G whose set of nodes consists of the distinguished node v itself and the nodes w such that there exists a directed edge from v to w. The set of edges of T 1 v are comprised of directed edges from v. (See Definition A.11 for a rigorous definition.) For any positive number k, the depth-k unfolding tree rooted at v, denoted as T k v , is a subtree of G whose set of nodes consists of the distinguished node v itself and the nodes w such that there exists at most k consecutive directed edges from v to w. The set of edges of T k v are comprised of unions of all k consecutive directed edges from v. (For any undirected graph, the finite depth unfolding tree rooted at v is defined in an analogous manner, though the set of edges consists of undirected edges instead of directed edges). The collections of all unfolding trees of G of arbitrary depth can be represented as a single graph, called the universal cover of G. Definition 2.2. Let G := (V, E) be a connected graph. The universal cover of G is a connected tree G (possibly infinite) with the projection map π G : G → G that maps any depth-1 unfolding trees in G isomorphically to some depth-1 unfolding tree in G. (See Definitions A.12 for more details.) For graphs with several connected components, their universal covers are disjoint unions of connected trees, each component of which satisfies the local isomorphism condition from Definition 2.2. We recall that Xu et al. proved that GNNs with injective aggregation, combination, and graph readout functions share equivalent strengths with Weisfeiler-Lehman test in distinguishing non-isomorphism classes of graphs with discrete node labels Xu et al. (2018) . Combined with Krebs and Verbitsky's result, we obtain that such GNNs can distinguish pairs of graphs with constant node labels up to isomorphism classes of their universal covers, or finite depth unfolding trees. Improvements Recent studies focused on excavating novel techniques which may outperform or optimize GNNs in analyzing graph data sets, such as classifying node attributes, classifying graph datasets, and predicting links among the set of nodes. Graph kernels measure similarities among graphs by utilizing inner products between their representations, a form of an embedding to a reproducing kernel Hilbert space. Borgwardt et al. (2020) ; Kashima et al. (2003) ; Shervashidze et al. (2011) ; Vishwanathan et al. (2010) Persistent homological techniques and topological data analytic tools have garnered attention as key sources of encapsulating global topological invariants of graphs, such as counting the number of connected components or cycles present in given graph datasets. Several studies focused on incorporating these topological tools with graph kernels by using pre-defined filtration functions Rieck et al. (2019) , or with GNNs by constructing task-specific filtration functions Hofer et al. (2020) ; Horn et al. (2021) . Keeping track of topological substructures of graphs is also shown to be effective in enriching the quality of vector representations obtained from GNNs Rieck et al. (2017) ; Sizemore et al. (2017) . Other approaches focus on proposing novel message passing schemes as measures to address the limited performance of conventional GNNs Bodnar et al. (2021b) ; Bouritsas et al. (2022) .

3. GNNS AND UNIVERSAL COVERS

In this section, we utilize the theory of covering spaces to assess the quality of representations conventional GNNs construct from collections of graphs with continuous node attributes. Definition 3.1. Let G := (V, E) be a graph. Denote by G ′ := (V, E ′ ) the graph where every node has a self-loop. Denote by G ′′ := (V, E ′′ ) the directed graph with a projection map p : G ′′ → G ′ such that each undirected edge of G ′ from v 1 to v 2 is constructed as follows: If v 1 ̸ = v 2 , the undirected edge corresponds to two directed edges, one edge from v 1 to v 2 , and the other from v 2 to v 1 : Otherwise, the edge corresponds to a single directed self-loop from v 1 to itself. Figure 1 illustrates the construction of G ′ and G ′′ . Recall that the graph G is endowed with continuous attributes f G : V (G) → R k . Respecting the configuration of these attributes, we can endow new node attributes to G ′′ and the universal cover G′′ . We call these new assignments of node attributes to the universal cover G′′ the pullback of node attributes. Definition 3.2. Let f G : V (G) → R k be the function of node attributes over the graph G. The pullback of the node attributes to the associated universal cover G′′ is the composition of functions f G ′′ • π G ′′ : V ( G′′ ) → R k , where f G ′′ : V (G ′′ ) → R k is the node attributes on G ′′ obtained from using the identical node attributes on G. (See Definitions A.10 for a rigorous formulation.) Inspired from the effectiveness of Weisfeiler-Lehman isomorphism tests Krebs & Verbitsky (2015) and GNNs Xu et al. (2018) , we prove that GNNs with injective message passing layers represent a pair of graphs with continuous node attributes as distinct vectors if and only if there exists an isomorphism between their associated universal covers whose node attributes endowed in a manner that respects the node attributes of the original graphs are identical, i.e. there exist isomorphisms among the collection of their finite depth unfolding trees that induce equality of the pullback of node attributes. We refer the proof of the theorem to Appendices A.3. and A.4. Theorem 3.3. Let G be a collection of finite undirected connected graphs with continuous node attributes. Let G, H ∈ G be any two connected graphs with the same number of nodes. Denote by f G : V (G) → R k and f H : V (H) → R k an arbitrary choice of continuous node attributes. Let d G , d H be the diameters of graphs G and H, i.e. the maximum of the shortest length of paths between any two nodes (see Appendix A.4 for the rigorous definition.) Suppose a graph neural network GN N l with l layers satisfies the following three constraints: (1) For every m such that 1 ≤ m ≤ l and for each v ∈ V (G), the functions AGGREGATE (m) v and COMBINE (m) are injective: (2) The graph read-out function READOUT is injective: And (3) l ≥ 2 max(d G , d H ). Then GN N l maps the pair of graphs G, H ∈ G to identical vector representations if and only if there exists an isomorphism of universal covers φ : G′′ → H′′ whose node attributes respecting the node attributes over G and H are identical, i.e. h G = h H ⇐⇒ ∃ isomorphism φ : G′′ → H′′ s.t. f G ′′ • π G ′′ = f H ′′ • π H ′′ • φ Remark 3.4. The above theorem outlines the maximal extent of GNNs in detecting non-isomorphism classes of graphs, a generalization of results proved in Krebs & Verbitsky (2015) , Xu et al. (2018) , and a contemporary result from Bamberger (2022) . It also pinpoints that conventional GNNs whose number of layers are at least twice the maximum diameter of a graph data set are sufficient to fully distinguish isomorphism classes of graphs up to their universal covers, which to the best of our knowledge enhances the results from previous researches that focused on analyzing the performance of GNNs with sufficiently large numbers of layers. Meanwhile, universal covers of graphs are infinite trees that do not contain any cycles, see Chapter 1.3 of Hatcher (2002) for example. Thus, conventional GNNs, regardless of the injectivities of the three functions, have limited capability in incorporating homological invariants of graphs, such as cyclic subgraph structures of a finite graph.

4. CYCLE-TO-CLIQUE GRAPH NEURAL NETWORKS (CY2C-GNN)

Motivation As shown in Theorem 3.3, GNNs can distinguish a collection of finite graphs up to isomorphism classes of their universal covers and equivalences of pullbacks of node attributes, but may fail to distinguish a collection of graphs with isomorphic universal covers whose subgraphs are comprised of cyclic graphs with different number of nodes, see Figure 1 for instance. We note that Theorem 3.3 proves a rigorous mathematical equivalence relation between the conditions conventional GNNs satisfy and the nature of isomorphism classes of graphs conventional GNNs can distinguish. Thus, any novel GNN model which twists one of the conditions from Theorem 3.3 can be considered as a candidate model for outperforming conventional GNNs. Using Theorem 3.3, we deduce four possible approaches which may enrich the input data structure of graphs the GNN may process to distinguish non-isomorphic classes of graphs (2019) . These diagrams allow one to keep track of variations in the number of components and cycles throughout the filtration of graph dataset induced from the choice of the height functions. (3) Topological Enrichments: One can also enrich topological structures of graphs by attaching simplicial and cellular complexes, or incorporating subgraph structures Bodnar et al. (2021b; a) ; Bouritsas et al. (2022) . This allows GNNs to capture topological substructures of graphs, such as cliques or cyclic subgraphs. (4) Non-isomorphic universal covers: We may construct a collection of new graphs with non-isomorphic universal covers induced from cyclic subgraphs. We can thus allow conventional GNNs to represent a collection of cyclic subgraphs as distinct vectors without significantly altering their architectural designs. The last approach hints a natural procedure orthogonal to the first three previously studied approaches: Substitute cyclic graphs with complete graphs of the same number of nodes. This can be rigorously formulated as in the following lemma, whose proof is in Lemma A.30. We thus devise a GNN which admits both the adjacency matrix of a graph G and the adjacency matrix of unions of cliques induced from substituting cyclic subgraphs with complete subgraphs. Definition 4.2 (Clique Adjacency Matrix). Let G := (V, E) be an undirected graph. Fix the cycle basis B G of G, the set of cyclic subgraphs of G which forms the basis of the cycle space (or the first homology group) of G. The clique adjacency matrix of G, denoted as A C , is the adjacency matrix of the union of #B G complete subgraphs, each obtained from adding all possible edges among the set of nodes of each basis element B ∈ B G . Note that such a B is a cyclic subgraph of G. Explicitly, the matrix A C := {a C u,v } u,v∈V (G) is given by a C u,v := 1 if ∃ B ∈ B G cyclic s.t. u, v ∈ V (B) 0 otherwise (4) Given 3 ≤ c 1 ≤ c 2 < ∞, one may also define the bounded clique adjacency matrix A C | [c1,c2] := {a C u,v | [c1,c2] } u,v∈V G) which only substitutes cycles of size between c 1 and c 2 to cliques: a C u,v | [c1,c2] := 1 if ∃ B ∈ B G cyclic s.t. u, v ∈ V (B), c 1 ≤ |V (B)| ≤ c 2 0 otherwise (5) For each node v ∈ V (G), we denote by C(v) the set of nodes w ∈ V (G) such that there exists a cyclic subgraph C ∈ B G such that both v and w lie in C: C(v) := {w ∈ V (G) | ∃ B ∈ B G s.t. v, w ∈ V (B)} (6) Required for computing the set C(v) for each node v ∈ V (G), is the basis of cycles of the graph G. Here, the cycle basis, denoted as B G , refers to the minimal set of cyclic subgraphs of G whose combinations generate all possible cyclic subgraphs. To construct the basis, we use the algorithm proposed by Paton, whose time complexity for computing the basis for a graph with n nodes and m edges is of order O(n γ ) for 2 ≤ γ ≤ 3, and that for a random graph with n nodes is of order O(n 2 ) Paton (1969) The Cy2C-GNN consists of two sets of neighborhood aggregating layers: A single layer utilizing the clique adjacency matrix A C : And l neighborhood aggregating layers identical to conventional GNN layers, as defined in Section 2, which utilize the usual adjacency matrix A. The model thus preserves various executive merits of GNNs, such as incorporation of edge attributes and efficient applicability to large graph datasets. The trainable weights are not shared among the hidden layers. The first layer of Cy2C-GNN is a single neighborhood aggregating layer utilizing the clique adjacency matrix A C ∈ R n×n . The output of the first hidden layer is the hidden attribute c (1) v obtained from the usual aggregation and combination functions used to obtain hidden vectors of conventional GNNs: c (1) v = COMBINE (1) (X v , AGGREGATE (1) v ({{X u : u ∈ C(v)}})) Recall that C(v) is a set of nodes which are adjacent to v in the clique adjacency matrix. Here, the functions AGGREGATE (1) v (•) and COMBINE (1) (•) are functions as defined in Section 2. To entwine the local topological properties with cyclic structures of a graph, we implement a conventional GNN comprised of l neighborhood aggregating layers, disjoint from the single layer utilizing clique adjacency matrices. Each m-th layer outputs the hidden attribute h (m) v which is inductively defined as follows. h (m) v := COMBINE (m) h (m-1) v , AGGREGATE (m) v h (m-1) u | u ∈ N (v) h (0) v := X v for 1 ≤ m ≤ l (8) The hidden node attributes obtained from pairs of layers c (1) v and h (l) v are concatenated, followed by multi-layer perceptronos (MLPs) to obtain the final hidden node attribute H v : H v = MLP(CONCAT(c (1) v , h (l) v )). As for obtaining the vector representation of a graph G, the Cy2C-GNN separately aggregates the hidden attributes c (1) v and h (l) v for each node v, followed by a composition with MLPs. H h G ,c G = MLP(CONCAT(H h G , H c G )) H h G = READOUT (l) ({{h (l) v | v ∈ V (G)}}) H c G = READOUT (1) ({{c (1) v | v ∈ V (G)}}) Discriminative Power of Cy2C-GNN Cy2C-GNN can distinguish a collection of unions of cyclic graphs, each comprised of possibly different number of nodes, the non-isomorphic classes of graphs of which GNNs may not distinguish. We refer to Theorem 4.3 and Figure 1 for some pairs of graphs with node attributes that Cy2C-GNN can distinguish, whereas conventional GNNs cannot. Theorem 4.3. Let G and H be graphs which have isomorphic universal covers, endowed with node features  f G : V (G) → R k and f H : V (G) → R k . Fix

Computational Complexity

Because the Cy2C-GNN algorithm preserves the conventional neighborhood aggregating layers of GNNs, the time complexity of representing a connected graph G with n nodes and m edges using the Cy2C-GNN algorithm with l + 1 layers is equal to O(m C + lm), where m C is the number of edges of the graph associated to the clique adjacency matrix A C of G. By the Euler characteristic formula, the number of elements in the cycle bases B G of a connected graph G is equal to m -n + 1. Hence, for connected graphs with bounded number of nodes constituting the subgraphs of their cycle bases, Cy2C-GNN is comparable to time complexity of conventional GNNs, and more efficient than spectral decomposition of adjacency matrices of finite graphs and constructing persistence diagrams using trainable or dynamic filtration functions Milosavljevic et al. (2011) ; Rieck et al. (2019) . In addition, the time complexity for preprocessing the graph G to obtain clique adjacency matrices is practically equivalent to O(m), without requiring any training of filtration functions for each graph G. We refer to Appendix A.6 for a detailed discussion on the computational complexity of these algorithms.

5. EXPERIMENTS

Dataset To analyze the effectiveness of Cy2C-GNN in distinguishing graphs with varying cyclic subgraphs, we perform an ablation study by utilizing the "CYCLES" and "NECKLACES" synthetic datasets constructed from Horn et al. (2021) . These datasets are comprised of graphs containing cyclic subgraphs, which are designed to assess whether the given GNN can identify differences among such cyclic substructures. As for evaluating the effectiveness of the proposed models in classifying graph datasets, we use the 3 bioinformatics(DD, PROTEINS(FULL), ENZYMES), 3 social network datasets (COLLAB, IMDB-B, REDDIT-B ), and 3 small molecular datasets (MUTAG, NCI1, NCI109). To further verify the extendability of Cy2C-GNN models to graph datasets with additional attributes, we utilized 3 datasets with edge features (BZR-MD, COX2-MD, PTC-MR) as well. Lastly, we tested Cy2C-GNN models on 4 large datasets (REDDIT-M-5K, MOLHIV, MOLTOX21, MOLTOXCAST) from TU datasets Morris et al. (2020) and Open Graph Benchmark datasets Hu et al. (2021) to test their efficiency in processing large graph data sets in comparison to GNNs based on persistent homological techniques. The details of all the aforementioned datasets, obtained from the pytorch-geometric library, are summarized in Appendix B.

Models

To assess whether improvements in the discriminative power of Cy2C-GNN lead to enhancements in classifying benchmark graph datasets, we additionally implemented three baseline models comprised of Graph Convolutional Network(GCN), Graph Attention Networks(GAT), and Graph Isomorphism Network(GIN). All baseline models share the same structure with Cy2C-GNNs except for additional structures required for implementing clique adjacency matrices. We also compare the best classification results obtained from Cy2C-GNN with baseline GNNs, those of the kernel methods(WL (Borgwardt et al., 2020, Table 4.5) , WL-OA (Borgwardt et al., 2020, Table 4.5) , RetGK (Ye et al., 2020, Table 3) , HGK (Togninalli et al., 2019, Table 2) , GH (Togninalli et al., 2019, Table 2 )), kernel method with persistent homology(PWL) (Rieck et al., 2019, Table 1 ), DGCNN (Wijesinghe & Wang, 2022, Table 14) , PNA (Corso et al., 2020, Figure 6 ), ID-GNN (You et al., 2021, Table 3 ), GraphSNN (Wijesinghe & Wang, 2022, Table 3,4,14) ) and GNN models with persistent homology(PersLay (Carrière et al., 2020, Table 2 ), TOGL (Horn et al., 2021, Table 2, 3 )), which aligned to the experimental protocols tested for Cy2C-GNNs. For OGB datasets(MOLHIV, MOLTOX21, MOLTOXCAST), we also consider GNNs with virtual node(VN) methods Hu et al. (2021) . We omitted classification results obtained from other contemporary GNNs whose experimental procedures are different from those suggested in this paper to avoid biased comparisons of proposed GNNs, as carefully suggested in Errica et al. (2020) . Further elaborations on the aforementioned models are outlined in Appendix B.1. 2021), we imposed that the architectural components of the baseline GCN, GAT, and GIN models, such as the number of layers, hidden attribute dimensions, and the classes of aggregation and combination functions, be identical to those of Cy2C-GNN. Differences in the number of trainable parameters among these models mostly occur at the final layer, where Cy2C-GNN harbors additional MLPs utilized for representing graphs as real vectors. For additional test on dataset with edge features and large dataset, we only consider Cy2C-GCN. We performed additional hyperparameter optimizations for implementing Cy2C-GNNs. Further details on the differences among the implemented networks are explicated in Appendix B.1.

Ablation Studies Figure 2 illustrates the classification results obtained from baseline models and

Cy2C-GCN from synthetic CYCLE (a) and NECKLACE (b) datasets Horn et al. (2021) . We denote by "GCN" graph convolutional networks, "TOGL" topological graph neural networks which model dynamic persistent homological techniques, "PH" static persistent homological techniques, and "WL" the 1-dimensional WL test, all results of which were obtained from (Horn et al., 2021 , Figure 1 ). The Cy2C-GCN model detects cyclic structures of graphs as effectively as persistent homological techniques, which utilize dynamic choices of graph features. We note that Cy2C-GCN can effectively distinguish cyclic graphs with number of nodes at least 4, because Lemma 4.1 implies that cyclic graphs and complete graphs of sizes 3 are isomorphic. Placing a single neighborhood aggregating layer utilizing clique adjacency matrices ahead of other conventional layers proves to be effective in detecting desired cyclic structures, as implied from Lemma 4.1.

Results

Comparisons among the proposed Cy2C-GNN and contemporary graph representation techniques on benchmark dataset are listed in Table 1 . The Cy2C-GNN produces outperforming classification results than baseline GNNs on all of the benchmark datasets. With the exception of NCI dataset, we confirm that the Cy2C-GNNs exhibit better or similar performance to variants of WL tests and conventional GNNs among bioinformatics, social network, and small molecules datasets. Furthermore, we compare Cy2C-GNN model on dataset with edge features and large dataset to verify the robustness of Cy2C-GNN models in representing graph data sets with additional features. Cy2C-GNN shows equivalent or outperforming performance to other GNNs and persistent homological techniques in classifying most graph datasets except PTC-MR and MOLHIV datasets. These results demonstrate that Cy2C-GNN has the potential to efficiently incorporate cyclic structures of large graph datasets to message passing layers, even for datasets with edge features. We also verify that  (GAT-3) (GAT-3) (GAT-4) (GAT-2) (GAT-4) (GAT-3) (GAT-4) (GAT- Cy2C-GCN-3) (Cy2C-GCN-4) (Cy2C-GAT-2) (Cy2C-GCN-4) (Cy2C-GCN-4) (Cy2C-GCN-5) (Cy2C-GIN-2) (Cy2C-GIN-3) (Cy2C-GIN-4) (Cy2C-GCN-5) (Cy2C-GCN-3) (Cy2C-GCN-5) (Best) (Cy2C-GCN-5) (Cy2C-GCN-5) (Cy2C-GCN-7) (Cy2C-GCN-3) Cy2C-GNNs are sufficient enough to incorporate topological structures while managing to reduce computational costs compared to dynamic persistent homological techniques, which require users to find suitable filtration functions for each given graph dataset.

6. CONCLUSION

In this paper, we utilized the theory of covering spaces to formulate a mathematical framework entailing the strengths of conventional GNNs in detecting isomorphism classes of graphs with continuous node attributes. These mathematical observations lead us to propose Cy2C-GNN, which enriches the topological characteristic of input data by utilizing clique adjacency matrices. We demonstrated both theoretically and experimentally that the proposed network can efficiently and reliably represent cyclic (or homological) structures of graph data sets without undergoing major alterations in the architectural structure of conventional message passing layers, such as training dynamic filtration functions or gluing high-dimensional cells or complexes. Nevertheless, Cy2C-GNN is not a panacea for resolving the problem of distinguishing all nonisomorphic classes of graphs. For instance, Cy2C-GNN does not guarantee to distinguish collections of graphs whose number of nodes of all cyclic subgraphs are equal to each other. Precautionary measures are also required in stacking a large number of neighborhood aggregating layers utilizing clique adjacency matrices, as it may increase the risk of oversmoothing the node features from significantly decreasing the graph diameters. These risks must be taken into account when applying Cy2C-GNN layers for classifying node labels or performing graph regression tasks. We hence advise to use a single layer of Cy2C-GNN preceding conventional GNNs, which is enough to discern graphs with markedly different cyclic subgraph structures. Future research may focus on identifying the extent of how much homological structures Cy2C-GNN can incorporate, identifying It would be of great interest to analyze the inherent relations among Cy2C-GNN and other variants of GNNs, and experiment whether combining Cy2C-GNNs with other state-of-the-art techniques may further enhance their performances.

A PROOFS A.1 CELL COMPLEXES AND COVERING SPACES

In this section, we utilize the theory of covering spaces to provide a rigorous formulation that Weisfeiler-Lehman test encapsulates local topological properties of graphs by representing finite depth unfolding trees using node attributes. Readers who are interested in a rigorous treatment of the theory of covering spaces may refer to Hatcher (2002) or Krebs & Verbitsky (2015) . Krebs and Verbitsky Krebs & Verbitsky (2015) proved that the Weisfeiler-Lehman test represents a pair of graphs with fixed constant node labels as identical vectors if and only if their universal covers are isomorphic. Combined with the results on the equivalence between graph neural networks and Weisfeiler-Lehman tests Xu et al. (2018) , we prove that graph neural networks represents a pair of graphs with arbitrary node labels as identical vectors if and only if there exists a graph isomorphism between their universal covers that induces an equality of node labels on the universal covers. Throughout this paper, we consider a graph G := (V, E) as a cell complex, which is constructed as follows. Definition A.1 (Graph as a cell complex). A graph G := (V, E) may be constructed in the following procedure. 1. The set of nodes V := V (G) corresponds to a discrete set of points {v} v∈V (G) (0-cells).

2.. The set of edges

E := E(G) corresponds to a discrete set of intervals {[0, 1]} e∈E(G) (1-cells). 3. The graph G is inductively constructed by attaching the endpoints of the edges to their corresponding nodes. That is, given an interval [0, 1] corresponding to an edge e := (v 1 , v 2 ), one glues the endpoint {0} to the node v 1 , and the endpoint {1} to the node v 2 . 4. One may iterate the inductive process finitely or indefinitely many times, depending on the cardinality of the set of edges. We call the spaces constructed in this manner as a 1-dimensional cell complex. Cell complexes allow one to endow graphs with topological structures induced from that over the set of discrete points and the unit interval In other words, a covering space Ĝ of G is a graph whose local topological properties are equivalent to those of G. In this paper, we will consider the universal cover of G, which is a canonical graph whose subgraphs correspond to paths in G starting at a node v ∈ V (G) up to homotopy, an equivalence relation which rigorously defines continuous transformation from one path to another path with same end points. Definition A.3 (Homotopy). Let f and g be two paths in the graph G which starts from a node v 0 and ends at a node v 1 . Note that these paths can be considered as functions from the unit interval [0, 1] to G, i.e. f, g : [0, 1] → G are paths such that f (0) = g(0) = v 0 and f (1) = g(1) = v 1 . We say that two paths f and g are homotopic if there exists a continuous function H : [0, 1] × [0, 1] → G such that 1. H(t, 0) = f (t) Published as a conference paper at ICLR 2023 2. H(t, 1) = g(t) 3. H(0, x) = v 0 and H(1, x) = v 1 for every x ∈ [0, 1]. The homotopic relation on paths with the same end points in any space is an equivalence relation, see Chapter 1.1 of Hatcher (2002) for the proof of this nontrivial fact. Given a path f : [0, 1] → G over a graph G, we denote by [f ] the equivalence class of f under the homotopy equivalence relation. Using this equivalence relation, we now construct the universal cover of a graph G. Definition A.4 (Universal Covering Space). Let G be a graph. Fix a point v ∈ V (G). The universal cover of G, denoted as G, is the space of homotopic classes of paths in G starting at v: G := {[f ] | f : [0, 1] → G such that f (0) = v} We end this subsection with the theorem that universal covers of connected graphs G is a graph without any cycles, whose proof can be found in Chapter 1.3 of Hatcher (2002) . Theorem A.5. Let G be a connected graph. Then its universal cover G is simply connected. In particular, it is a connected graph without any cycles.

A.2 GRAPH NEURAL NETWORKS

Given a graph G := (V, E) with n nodes, denote by A ∈ R n×n the adjacency matrix of G, D ∈ R n×n the diagonal matrix of node degrees of G, and X ∈ R n×k the matrix of concatenated k-dimensional node attributes of G. Denote by Ã ∈ R n×n the normalized adjacency matrix of G. (One may take, for instance, Ã := D -1 2 AD -1 2 .) Definition A.6. Throughout the appendix, we use the notation {{•}} to denote a multiset of real vectors, i.e. we allow multiple instances of its elements.  F : M k m → R l {{v 1 , v 2 , • • • , v m }} → F (v 1 , v 2 , • • • , v m ) We recall the definition of graph neural networks proposed from Xu et al. (2018) . Definition A.8. We denote by GN N l the graph neural network comprised of composition of l neighborhood aggregating layers. 1. Each m-th layer H (m) of the network constructs hidden node attributes of dimension k m , denoted as h (m) v , using the following composition of functions: h (m) v := COMBINE (m) h (m-1) v , AGGREGATE (m) v h (m-1) u | u∈V (G),u̸ =v (u,v)∈E(G) h (0) v := X v (13) In the equation above, X v is the initial node attribute at v, M (m) v is the collection of all multisets of k m-1 -dimensional real vectors with deg v elements counting multiplicities, the aggregation function 2. Denote by H (l) the final layer of the network. Let M (l) be the collection of all multisets of k l -dimensional vectors with #V (G) elements. Let AGGREGATE (m) v : M (m) v → R k ′ m (14) READOUT : M (l) → R K (16) be the graph readout function of K-dimensional real vectors defined over the multiset M (l) . Then the K-dimensional vector representation of G, denoted as h G , is given by h G := READOUT {{h (l) v | v ∈ V (G)}} Observant readers may notice that graph neural network is a generalization of the color refinement algorithm, which is designed for distinguishing non-isomorphic pairs of graphs with identical discrete node labels Krebs & Verbitsky (2015) .

A.3 HIDDEN NODE ATTRIBUTES AND UNIVERSAL COVERS

We now provide a rigorous formulation that graph neural networks compute vector representations of finite depth unfolding trees of a graph. Definition A.9. Let G := (V, E) be a graph. Denote by G ′ := (V, E ′ ) the graph where every node has a self-loop. Denote by G ′′ := (V, E ′′ ) the directed graph with a projection map p : G ′′ → G ′ such that each undirected edge of G ′ from v 1 to v 2 corresponds to the following edges: 1. If v 1 ̸ = v 2 , the undirected edge corresponds to two directed edges, one edge from v 1 to v 2 , and the other from v 2 to v 1 . 2. If v 1 = v 2 , the edge corresponds to a single directed self-loop from v 1 to itself. By construction, k-dimensional continuous node attributes over G, denoted as the function f G : V (G) → R k , clearly extends to continuous node attributes over G ′′ , denoted as f G ′′ : V (G ′′ ) → R k . One can also induce attributes over the set of nodes of the universal cover of G. We achieve this by pre-composing the function f G with the covering map π G : G → G. Definition A.10. Let f G : V (G) → R k be the function of k-dimensional node labels over the graph G. Let π G : G → G be the universal covering map. Note that the covering map restricts to a function between set of nodes π G : V ( G) → V (G). The pullback of the node labels to the universal cover G is the composition of functions 6 for an illustration of how pullback of node features are defined over the universal cover). Definition A.11. Let G := (V, E) be a directed graph. For each node v ∈ V (G) and any positive number k, the depth-k neighborhood rooted at v, denoted as U k v , is a subgraph of G whose set of nodes consists of the distinguished node v itself and the nodes w such that there exists at most k consecutive directed edges from v to w. The set of edges of U k v are comprised of unions of all k consecutive directed edges from v. f G • π G : V ( G) → R k . (See Figure (For any undirected graph, the finite depth neighborhoods rooted at v is defined in an analogous manner, where the set of edges is comprised of undirected edges among a pair of nodes of G, instead of directed edges). Given a graph G, there exists an injective lift of set of nodes from G to G′′ , defined as i G : V (G) → = V (G ′ ) → = V (G ′′ ) → V ( G′′ ) Likewise, there also exists an injective lift of set of nodes from G to its universal cover G: i un G : V (G) → V ( G). ( ) Using a predetermined injective lift of nodes of G to G and G′′ , we may define finite depth unfolding trees at a node v ∈ V (G) as follows. Definition A.12. Let G := (V, E) be a graph. Fix an injective lift i G : V (G) → V ( G′′ ). The directed depth k unfolding tree at a node v ∈ V (G), denoted as T k v , is the depth-k neighborhood rooted at i G (v) as a subtree of the associated universal cover G′′ of G. The undirected depth k unfolding tree at a node v ∈ V (G), denoted as T k,un v is the depth-k neighborhood rooted at i un G (v) as a subtree of the universal cover G of G. Throughout this manuscript, we will abbreviate the choice of injective lift from the base space G to its covering space G′′ . This is thankfully because if G′′ is an infinite graph, then the construction of universal covers imply that the directed depth k unfolding trees rooted at a fixed node v ∈ V (G) obtained from any given injective lift i G : V (G) → V ( G′′ ) are all isomorphic to each other. With abuse of notation, we will not specify the choice of injective lifts when defining depth k unfolding trees. We note that the initial node attributes X v used for graph neural networks (Definition A.8) is given by X v = f G (v) = (f G ′′ • π G ′′ )(i G (v)) Example A.13. Let G be an undirected graph without self-loops given as in Figure 3 . The associated graphs G ′ and G ′′ are given as in Figure 3 . Let f G : V (G) → R 3 be a function of 3-dimensional node attributes over G. In the exemplary figure, the given node attributes are coordinate-wise real vectors, each represented as discrete node labels, i.e. (1, 0, 0) → A, (0, 1, 0) → B, and (0, 0, 1) → C. Figures 1 and 5 demonstrates the directed depth 3 unfolding tree at a node, considered as a subspace of the universal cover G′′ of G ′′ . Remark A.14. Given a node v ∈ V (G) of an undirected graph G := (V, E), the following three relations hold among directed depth m unfolding subtrees of T m+1 v . 1. For any two nodes w 1 , w 2 adjacent to v ∈ V (G) excluding v itself, there exists a pair of disjoint subtrees T 1 , T 2 of T m+1 v ⊂ G′′ which are isomorphic to directed depth m unfolding trees rooted at w 1 and w 2 , i.e. T m w1 ∩ T m w2 = ∅ (21) 2. The tree T m+1 v contains disjoint unions of directed depth m unfolding trees rooted at all nodes adjacent to v (including itself), i.e. T m+1 v ⊃ w∈V (G) (w,v)∈E(G) T m w (22) 3. The set of nodes of the tree T m+1 v is the disjoint union of the singleton set {v} and the set of nodes of depth m unfolding trees at all nodes adjacent to v (including itself), i.e. V (T m+1 v ) = {v} ⊔ w∈V (G) (w,v)∈E(G) V (T m w ) Figure 5 illustrates how directed depth 2 unfolding subtrees of T 3 v constructed from Figure 1 satisfy the three aforementioned relations. Note the depth 2 unfolding subtrees rooted at each node are marked in different colors. The goal of this section is to identify hidden node attributes obtained from graph neural networks as a function from a multiset to real vector spaces. To do so, we define what is called the multiset of nested multisets associated to finite depth unfolding trees. Before doing so, we introduce some notational abbreviations. Definition A.15. Given a multiset S and number d, we use the abbreviation {{S}} d to denote the multiset whose elements are multisets of d elements in S. Given a function of multisets f : A → B, we denote by {{f }} : A → {{B}} the function obtained from representing each image of f as a singleton multiset, i.e. {{f }}(a) := {{f (a)}} for any a ∈ A (24) Given two multisets A, B, the sum of two multisets, denoted as A + B, is a concatenation of A and B, i.e. it is a multiset whose elements are in either A or B, and whose element-wise multiplicity is the sum of multiplicities of elements of A and B. One may use the usual summation notation to denote a sum of several multisets. For example, an element of {{R k }} 3 is a multiset of 3 real vectors of dimension k. Given a real valued function f (x) = 2x, the function {{f }} sends x to the singleton multiset {{2x}}. The sum of two multisets {{2, 3, 5}} and {{3, 6, 7}} is equal to {{2, 3, 3, 5, 6, 7}}. Example A.18. Let G be an undirected graph without self-loops given as in Figure 3 . Let f G : V (G) → R 3 be a function of 3-dimensional node attributes over G. Recall that the given node attributes are coordinate-wise real vectors, each represented as discrete node labels, i.e. (1, 0, 0) → A, (0, 1, 0) → B, and (0, 0, 1) → C. Let v ∈ V (G) be a node whose attribute is represented as A. Denote by u 1 ∈ V (G) the node whose attribute is represented as B, u 2 ∈ V (G) the degree 3 node whose attribute is represented at C, and u 3 the remaining node. Figure 4 illustrates the directed depth 3 unfolding tree T 3 v rooted at the node v. We explain how the maps p T 0 v , p T 1 v , and p T 2 v are defined. The map p T 0 v : R 3 → R 3 , by definition, is an identity function. p T 0 v (A) = A The map p T 1 v : R 3×#(V (T 1 v )\V (T 0 v )) → T 1 v sends a tuple of 3-dimensional real vectors of length deg v + 1 = 4 to the following multiset: p T 1 v : R 3×#(V (T 1 v )\V (T 0 v )) → T 1 v = u∈V (T 1 v )\V (T 0 v ) {{R 3 }} 1 = {{R 3 }} 4 p T 1 v ((A, B, C, C)) = {{p T 0 v (A)}} + {{p T 0 u 1 (B)}} + {{p T 0 u 2 (C)}} + {{p T 0 u 3 (C)}} = {{A}} + {{B}} + {{C}} + {{C}} = {{A, B, C, C}} The map p T 2 v : R k×#(V (T 2 v )\V (T 1 v )) → T 2 v sends a tuple of 3-dimensional real vectors of length # (V (T 2 v ) \ V (T 1 v )) = 12 to the following multiset of multisets:  p T 2 v : R 3×#(V (T 2 v )\V (T 1 v )) → T 2 v = u∈V (T 1 v )\V (T 0 v ) {{T 1 u }} p T 2 v ((A, B, C, C, A, B, C, A, B, C, A, C)) ={{p T 1 v }}(A, B, C, C) + {{p T 1 u 1 }}(A, B, C) + {{p T 1 u 2 }}(A, B, C) + {{p T 1 u 3 }}(A, C) ={{{{A, Using these definitions, we are now able to rigorously formulate that graph neural networks capture local topological properties of G by representing subtrees of the associated universal cover G′′ as real vectors. Theorem A.19. Let G := (V, E) be a graph. Denote by f G : V (G) → R k the function of kdimensional node labels over G. The node labels over G can be extended to those over the graph G ′′ , denoted as f G ′′ . Denote by π G ′′ : G′′ → G ′′ the universal covering map of G ′′ . Let T k v be the directed depth k unfolding tree of the graph G ′′ at the node v ∈ V (G ′′ ). Let GN N l be a graph neural network comprised of l layers. For each node v ∈ V (G), let T l v be the multiset of nested multisets associated to the depth l unfolding tree T l v , as constructed in Definition A.16. Let p T l v : R k×#(V (T l v )\V (T l-1 v )) → T l v be the morphism defined as in Remark A.17. Then there exists a set theoretic function F l v : T l v → R k l such that the node label at v obtained from graph neural networks with l layers given by h (l) v = F l v p T l v ((f G ′′ • π G ′′ )(u)) u∈V (T l v )\V (T l-1 v ) In other words, the updated node attributes are obtained from a set theoretic function supported over the set of nodes in V (T l v ) \ V (T l-1 v ). Proof. We prove the theorem by induction. For any graph G := (V, E) with a predetermined normalized adjacency matrix Ã, denote by ãu,v the entry of the normalized adjacency matrix Ã at a pair of nodes (u, v). Suppose l = 1. The node label at v ∈ V (G) updated from a graph neural network with a single layer is obtained from taking a weighted sum of labels of nodes adjacent to v and the node v itself, followed by evaluating the newly obtained attributes using the activation function σ 1 . By construction, the depth 0 unfolding tree of the graph G ′′ at v is the node v itself. The parent node of the depth 1 A.4 PROOF OF THEOREM 3.3 Xu et al. shows that graph neural networks with injective node feature aggregating functions and injective graph-level readout functions are as powerful as Weisfeiler-Lehman isomorphism tests in distinguishing non-isomorphic classes of graphs Xu et al. (2018) . Using the theory of covering spaces, we prove a refinement of the isomorphism-invariant properties of Weisfeiler-Lehman tests over graphs with identical node labels Krebs & Verbitsky (2015) for graph neural networks. Under certain conditions, the vector representations of graphs obtained from graph neural network can distinguish isomorphism classes of universal covers of graphs as well as non-equivalent pullback of node labels. Theorem A.21 (Theorem 3.3). Let G be a collection of finite connected graphs such that the least upper bound of their diameters is equal to d. Suppose a graph neural network GN N l with l layers satisfies the following three constraints: • For every m such that 1 ≤ m ≤ l and for each v ∈ V (G), the functions AGGREGATE (m) v and COMBINE (m) are injective. • The graph read-out function READOUT is injective. • l ≥ 2d. Then GN N l maps any two connected graphs G, H ∈ G with the same number of nodes to identical vector representations if and only if there exists an isomorphism of universal covers φ : G′′ → H′′ that induces an equality of the pullback of node labels, i.e. h G = h H ⇐⇒ ∃ isomorphism φ : G′′ → H′′ s.t. f G ′′ • π G ′′ = f H ′′ • π H ′′ • φ (43) The theorem generalizes the result of Krebs and Verbitsky Krebs & Verbitsky (2015) for pairs of graphs with fixed constant node labels. Before we prove Theorem 3.3, we recall the definition of graph diameters and note the following lemma . Definition A.22 (Graph Diameter). Let G := (V, E) be a finite undirected graph. Given a pair of nodes v, w ∈ V (G), let P v,w be the set of sequences of edges p = (e i ) p i=1 such that one can move from node v to node w by traveling along the sequence of edges (e i ) p i=1 . The length of the path p is the size of the sequence, i.e. l(p) = l((e i ) p i=1 ) = p. Then the graph diameter d G of G is the maximum of the shortest length of paths required to move between any two nodes. In other words, d G := max (v,w)∈V (G) min p∈Pv,w l(p) Lemma A.23. Let G 1 and G 2 be two undirected graphs with finite number of nodes. Let f G1 : V (G 1 ) → R k and f G2 : V (G 2 ) → R k be node attribute functions over G 1 and G 2 . Denote by π Gi : Gi → G i and π G ′′ i : Gi ′′ → G ′′ i the universal covering maps for i = 1, 2. Then the isomorphism h : G1 → G2 induces an isomorphism h ′′ : G1 ′′ → G2 ′′ , and vice versa. If such isomorphisms exist, then f G1 • π G1 = f G2 • π G2 • h if and only if f G ′′ 1 • π G ′′ 1 = f G ′′ 2 • π G ′′ 2 • h ′′ . Proof. Let v ∈ V (G) be a node of an undirected graph G. We observe that a collection of finite depth unfolding trees rooted at all nodes of a universal cover G (or G′′ ) defines an open cover. It hence suffices to show that for any l, two undirected depth l unfolding trees T l,un v are isomorphic if and only if two directed depth l unfolding trees T l v are isomorphic. Note that by the construction of finite depth unfolding trees, the result on equality of node attributes follows immediately. As constructed from Definition A.12, we denote by T 1,un v ⊂ G the undirected depth 1 unfolding tree rooted at v, and T 1 v ⊂ G′′ the directed depth 1 unfolding tree rooted at v. Note that the directed tree T 1 v can be constructed from the undirected tree T 1,un v by substituting all undirected edges with directed edges from v to its neighboring nodes, adding a new copy of the node v itself, and adding a new directed edge from v to its copy. This demonstrates that the statement holds for l = 1. Suppose the isomorphism invariance of unfolding trees holds for l = k. As before, denote by T k+1,un v ⊂ G the undirected depth k + 1 unfolding tree rooted at v, and T k+1 v ⊂ G′′ the directed Figure 8 : A comparison between the universal cover G and the induced universal cover G′′ of a graph G constructed in Figure 3 . depth k + 1 unfolding tree rooted at v. The directed tree T k+1 v can be constructed from the undirected tree T k+1,un v using the following procedure. For every node w ∈ V (T 1,un v ) \ {v}, substitute the undirected tree T k,un w rooted at w with the directed tree T k w . We then add a disjoint copy of the directed tree T k v , and add a new directed edge from the root of the tree T k+1 v to the root of the newly added tree T k v (This corresponds to adding a new directed edge from v to its copy). The procedure guarantees to map isomorphic undirected unfolding trees to isomorphic directed unfolding trees, and vice versa. A visual illustration which compares the undirected and the directed rooted unfolding trees can be found in Figure 8 . We recall the following two lemmas on isomorphism classes of finite depth unfolding trees. Both lemmas are reformulations of Lemma 2.5 and Lemma 2.7 of Krebs & Verbitsky (2015) , which proves the statements for undirected finite depth unfolding rooted trees. Lemma A.24. Let v, w ∈ V (G) be any pair of nodes such that for any l ≥ 1, the following conditions hold. 1. The two directed unfolding trees T l-1 v and T l-1 w are isomorphic. Proof. Note that any depth-d unfolding tree rooted at a node v ∈ V (G) or w ∈ V (H), as a subspace of the universal covers G′′ and H′′ , contains all pre-images of the set of nodes V (G) and V (H). This implies that any depth-2d -1 unfolding trees rooted at any node v ∈ V (G) or w ∈ V (H) contains all possible rooted depth-d unfolding trees. It remains to invoke the ideas of the proof of Lemma 2.7 of Krebs & Verbitsky (2015) to further reduce the lower bound of l from 2n to 2d.

2.. There exists a bijection of nodes g

l : V (T 1 v ) → V (T 1 w ) such that for all u ∈ V (T 1 v ), Using the aforementioned lemmas, we are able to prove that the hidden node attributes obtained from GNNs with l layers whose combination and neighborhood aggregation functions are injective indicate the isomorphism classes of depth l unfolding rooted trees. Lemma A.26. Let G and H be connected graphs with n nodes. Denote by f G : V (G) → R k and f H : V (H) → R k the k-dimensional node attributes of G and H. Denote by π G ′′ : G′′ → G ′′ and π H ′′ : H′′ → H ′′ the universal covering maps of the induced directed graphs G ′′ and H ′′ . Suppose a graph neural network GN N l with l layers satisfies the condition that for every m such that 1 ≤ m ≤ l and for each node v of any finite graph Ĝ, the functions AGGREGATE (m) v and COMBINE (m) are injective. For each v ∈ V (G), pick a bijection of set of nodes ϕ v : V (G) → V (H) which induces an equality (not an isomorphism) of depth-1 unfolding tree at v and ϕ v (v), i.e. ϕ v (T 1 v ) = ϕ v (T 1 ϕv(v) ). Then for every 1 ≤ m ≤ l, the bijection ϕ v induces an equality of hidden node attributes h (m) v and h (m) ϕv(v) obtained from GN N l for every node v ∈ V (G) if and only if the bijection ϕ v induces isomorphisms φ v,m : T m v → T m ϕv(v) that imply equality of node attributes over the trees T m v and T m ϕv(v) . In other words: ∃ φv,m:T m v → ∼ = T m ϕv (v) such that (f G ′′ •π G ′′ )| T m v =(f H ′′ •π H ′′ )| T m ϕv (v) •φv,m ⇐⇒ h (m) v = h (m) ϕv(v) ∀1 ≤ m ≤ l. ( ) Note that the bijections {ϕ v } v∈V (G) may not necessarily be a singleton set. We refer to Figure 9 for an example of a pair of graphs G and H where one needs to choose distinct bijections ϕ v for each node v. Proof. This lemma is a generalization of Lemma 2.6 of Krebs & Verbitsky (2015) . We prove the statement of the lemma by induction on l. The base case for l = 0 is trivial, as the statement of the lemma simplifies to comparing the attributes of a given pair of nodes.

Suppose the equivalence relation holds up to

l = l 0 . For each v ∈ V (G), let ϕ v : V (G) → V (H) be a bijection of set of nodes which induces an isomorphism of depth-1 unfolding trees rooted at v and ϕ v (v). Suppose the hidden node attributes obtained from GN N with at most l 0 + 1 layers are identical for every node, i.e. h (m) v = h (m) ϕv(v) for 1 ≤ m ≤ l 0 + 1. By the induction hypothesis, for any 1 ≤ m ≤ l 0 , the following equivalence relation holds for each pair of nodes {(v, ϕ v (v))} v∈V (G) : ∃ φv,m:T m v → ∼ = T m ϕv (v) such that (f G ′′ •π G ′′ )| T m v =(f H ′′ •π H ′′ )| T m ϕv (v) •φv,m ⇐⇒ h (m) v = h (m) ϕv(v) Observe that for each 1 ≤ i ≤ l 0 , the isomorphism φ v,i : T i v → T i ϕv(v) induces the following equivalence relations for all pairs of nodes {(u, φ v,i (u))} u∈V (T 1 v )\V (T 0 v ) and for any 1 ≤ m ≤ l 0 : ∃ φu,m:T m u → ∼ = T m φ v,i (u) such that (f G ′′ •π G ′′ )| T m u =(f H ′′ •π H ′′ )| T m φ v,i •φu,m ⇐⇒ h (m) π G ′′ (u) = h (m) π H ′′ (φv,i(u)) (47) Note that φ v,i (u) = ϕ v (u) because ϕ v induces an isomorphism of depth-1 unfolding trees at v and ϕ v (v). Consider the open cover of T l0+1 v by the directed trees {T l0 u } u∈V (T 1 v ) . The intersection of any two trees satisfy T l0 u ∩ T l0 u ′ = T l0-1 u ′ if u ∈ V (T 0 v ) and u ′ ∈ V (T 1 v ) \ V (T 0 v ) ∅ otherwise By Lemma A.24, the collection of isomorphisms {φ u,l0 } u∈V (T 1 v ) ∪ {φ u,l0-1 } u∈V (T 1 v )\V (T 0 v ) induces the isomorphism φ v,l0+1 : T l0+1 v → T l0+1 ϕv(v) . Recall that the hidden node attributes h (l0+1) v and h (l0+1) ϕv(v) are identical. Then the following collections of multiset of node attributes are identical. p T l 0 +1 v ((f G ′′ • π G ′′ )(v)) v∈V (T l 0 +1 v )\V (T l 0 v ) = p T l 0 +1 ϕv (v) ((f H ′′ • π H ′′ )(u)) u∈V (T l 0 +1 ϕv (v) )\V (T l 0 ϕv (v) ) A.5 PROOF OF THEOREM 4.3 We now prove that the Cycle-to-Clique graph networks are more powerful than graph neural networks in distinguishing non-isomorphism classes of graphs. Remark A.28. As stated in the main manuscript, there are four approaches that enrich the input data structure of graphs the graph neural network may process to distinguish non-isomorphic classes of graphs. 1. Unique assignment of node labels: We may distinguish distinct classes of graphs by assigning unique (or randomized) choices of node labels. implementations include assigning node degrees as labels, imposing positional encoding, and implementing labeling tricks. 2. Persistent homological methods: We may incorporate homological invariants of graphs to GNNs by utilizing or constructing height functions over nodes and edges and constructing the associated persistence diagrams. These diagrams allow one to keep track of variations in the number of components and cycles throughout the filtration of graph dataset induced from the choice of the height functions.

3.. Topological Enrichment:

We may enrich topological structures of graphs by attaching simplicial and cellular complexes, or incorporating subgraph structures. Such actions allows conventional GNNs to encapsulate particular topological substructures of graphs suitable for classifying graph datasets of our interest, such as cliques within social media datasets or cyclic subgraphs within molecular datasets.

4.. Construction of non-isomorphic universal covers:

We may incorporate homological invariants of graphs, in particular cyclic subgraphs of graphs, to GNNs by constructing a new graph structure induced from subgraphs isomorphic to cyclic graphs. It is advised that the universal covers of newly induced subgraphs are not isomorphic to each other, thereby allowing the graph neural network to represent a collection of cyclic subgraphs as distinct vector representations. Definition A.29. Denote by C n the cyclic undirected graph without self-loops consisting of n nodes, and K n the undirected complete graph with n nodes without self-loops. We state a simple result that the induced universal covers of C n 's are isomorphic, whereas those of K n 's are not. Proof. The proof follows from the observation that any induced universal cover G′′ of a k-regular graph G is isomorphic to a directed infinite 2k + 2 regular tree, where each node v is connected by k + 1 directed edges from source nodes to v, and by k + 1 directed edges from v to target nodes. Hence, the lemma suggests a natural procedure to allow graph neural networks to distinguish cyclic graphs with distinct number of nodes: Substitute the cyclic graphs C n with complete graphs K n . Using this observation, we may construct the Cy2C-GNN, as indicated in the main manuscript. Before we recall the definitions of clique adjacency matrix and the architecture of the Cy2C-GNN, we first state the definition of a cycle basis B G of a graph G. Definition A.31. Let G := (V, E) be a finite graph. A cycle basis B G is a set of cyclic subgraphs of G such that every cyclic subgraph of G can be represented by unions and complements of elements in B G . Given that G is connected, a canonical method to construct a cycle basis of a graph G is to find a spanning tree T ⊂ G. It is a non-trivial result from algebraic topology that if G has n nodes and m edges, then the cardinality of any cycle basis of G is equal to m -n + 1, see Theorem 2.44 of Hatcher (2002) for instance. Because T is a subgraph of G consisting of n nodes and n -1 edges, we immediately obtain that the remaining m -n + 1 edges of G not lying in T generates the elements of a cycle basis. Indeed, one can construct a cyclic subgraph by adding one of the m -n + 1 edges to the spanning tree T . Figure 10 illustrates an example how one can obtain a cycle  A C := {a C u,v } u,v∈V (G) is given by a C u,v := 1 if ∃ B ∈ B G cyclic s.t. u, v ∈ V (B) 0 otherwise (60) Given 3 ≤ c 1 ≤ c 2 < ∞, one may also define the bounded clique adjacency matrix A C | [c1,c2] := {a C u,v | [c1,c2] } u,v∈V G) which only substitutes cycles of size between c 1 and c 2 to cliques: a C u,v | [c1,c2] := 1 if ∃ B ∈ B G cyclic s.t. u, v ∈ V (B), c 1 ≤ |V (B)| ≤ c 2 0 otherwise For each node v ∈ V (G), we denote by C(v) the set of nodes w ∈ V (G) such that there exists a cyclic subgraph C ∈ B G such that both v and w lie in C: C(v) := {w ∈ V (G) | ∃ C ∈ B G s.t. v, w ∈ V (C)} Definition A.33 (Cy2C-GNN). Let G := (V, E) be a graph with n nodes and continuous node attributes f G : V (G) → R k . We denote by Cy2C-GNN l the cycle-to-clique graph neural network comprised of a disjoint pair of two types of layers: A single neighborhood aggregating layer H (1) C utilizing the clique adjacency matrix A C : And l layers of conventional neighborhood aggregating layers H (m) for 1 ≤ m ≤ l utilizing the adjacnecy matrix A. The Cy2C-GNN model admits the following three matrices as inputs: Theorem A.35 (Theorem 4.3). Let G and H be graphs which have isomorphic universal covers, endowed with node features f where c 1 = c 2 = |V (C)|. This results in transforming G to be a non-trivial graph including the graph C, and transforming H to be an empty graph. It is easy to see that the universal covers of such two graphs are not isomorphic to each other. G : V (G) → R k and f H : V (G) → R k . Fix We note that one may still use the usual clique adjacency matrix to distinguish some of the pairs of graphs G and H satisfying the conditions of Theorem 4.3. One needs to verify, however, that the resulting induced graphs associated to clique adjacency matrices of G and H have non-isomorphic universal covers. The upcoming example gives exemplary pairs of graphs for every node |V (G)| ≥ 6 which are discernible by using Cy2C-GNN equipped with the usual clique adjacency matrix. Example A.36. In this example, we explicitly construct a collection of isomorphism classes of graphs with n nodes such that Cy2C-GNN can distinguish, whereas conventional GNNs cannot. Consider the two connected graphs G and H with 6 nodes as shown in Figure 11 . Theorem 3.3 implies that any GNN (including the Weisfeiler-Lehman test) cannot distinguish the two graphs because the induced initial node attributes over the universal covers G′′ and H′′ are equal. In fact, for any even number of nodes 2n ≥ 6, there exists a pair of two labelled connected graphs with 2n nodes that any GNNs cannot distinguish. Consider a graph G n constructed from gluing two C n+1 cyclic graphs along a distinguished edge. Consider another graph H n constructed from connecting a pair of distinguished nodes from two disjoint C n cyclic graphs by an edge. Impose identical attributes to nodes based on their degrees. Note that there are 2 nodes of degree 3 and 2n -2 nodes of degree 2. Then the two universal covers Gn ′′ and Hn ′′ are isomorphic. Furthermore, the node attributes over the universal covers induced from G n and H n are identical. Theorem 3.3 hence implies that any GNNs cannot distinguish G n and H n . However, the two graphs are clearly not isomorphic. Let G n,1 be the graph obtained by substituting the two cyclic subgraphs C n+1 with two complete graphs K n+1 . Likewise, let H n,1 be the graph obtained by substituting the two disjoint cyclic graphs C n with two complete graphs K n . Among the nodes of G n,1 , there are 2n -2 nodes of degree n and 2 nodes of degree n + 2. As for the nodes of H n,1 , there are 2n -2 nodes of degree n -1 and 2 nodes of degree n. Thus, G n,1 ′′ ̸ ∼ = H n,1 ′′ (In fact, H n,1 ′′ consists of two disjoint isomorphic copies of infinite trees, each of which is not isomorphic to G n,1

′′

). Therefore, any GNNs satsifying the conditions of Theorem 3.3 may distinguish the graphs G n,1 and H n,1 . The Cy2C-GNN captures the non-isomorphism of universal covers by admitting the clique adjacency matrices of G and H as inputs. Note that the clique adjacency matrix of G (and H) is equal to the sum of the adjacency matrix of G n,1 (and H n,1 , respectively) and the identity matrix. Likewise, there are a collection of finite graphs with many connected components that any graph neural network cannot distinguish. Consider a collection of graphs G with n nodes consisting of isomorphism classes of disjoint union of cyclic graphs The directed unfolding tree T l v of depth l at each node v of the graph is isomorphic to a directed 3-regular rooted tree of depth l. If the node attributes are constant, then Theorem 3.3 implies that any graph neural networks cannot distinguish all such disjoint union of cyclic graphs. Figure 12 : An illustration of Theorem 4.3. Each of the following pair of graphs with non-isomorphic cyclic subgraphs have identical associated universal covers, which implies that graph neural networks cannot distinguish the two graphs. Nevertheless, Cy2C-GNN is able to distinguish them by adding complete graphs from cycles of these graphs, whose universal covers are not isomorphic. By substituting every disjoint cyclic graphs with complete graphs, however, we obtain non-isomorphic classes of disjoint unions of universal covers. Therefore, even if the node attributes are constant, Theorem 3.3 implies that Cy2C-GNN can distniguish all such disjoint union of cyclic graphs. Remark A.37. One immediate result from Theorem 4.3 is that Cy2C-GNN can distinguish nonisomorphism classes of graphs that 1,2,and 3-WL tests cannot distinguish. The proof of this claim originates from the adaptation of the well-known arguments presented in Bodnar et al. (2021b; a) . The Rook's 4 × 4 graph and the Shrikhande graph, which constitutes the only two elements of the strongly regular graphs in family SR(16, 6, 2, 2) , are not isomorphic because the Shrikhande graph contains a chordless cyclic subgraph of size 5, whereas Rook's 4 × 4 graph does not, see for instance Table 4 of Bodnar et al. (2021a) . Theorem 4.3 hence implies that Cy2C-GNN can distinguish these two graphs by incorporating the cycle basis of the Shrikhande graph which contains a chordless cyclic subgraph of size 5, and that of the Rook's 4 × 4 graph which does not contain such subgraphs. We note that contemporary techniques which are known to be able to classify classes of graphs that 1,2 and 3-WL tests cannot possess capability in counting the number of nodes of a given cyclic or clique subgraphs of G. Such results include Proposition 25 of Bodnar et al. (2021a) , Proposition 2 of You et al. (2021) , Theorem 8 of Bodnar et al. (2021b) , and Proposition 3.4 of Bouritsas et al. (2022) . While these previously studied GNNs and Cy2C-GNN share a common objective in distinguishing cyclic substructures of graphs, the inherent algorithm which manifests such objectives are different. To elaborate, Bodnar et al. (2021a) and Bouritsas et al. (2022) propose gluing higher dimensional cells or complexes while preserving the 1-skeleton structure of a given graph. You et al. (2021) proposes perturbing the feature of a designated node, which allows GNNs with sufficiently large number of layers to distinguish cyclic substructures of graphs. Cy2C-GNN, on the other hand, circumvents utilizing higher dimensional cells or perturbing the given node attributes by adding suitable edges, or transforming the 1-skeleton structure of a graph. This procedure allows a wide range of classes of graphs, in particular those with varying cyclic structures, with isomorphic universal covers to attain additional non-isomorphic universal covers. Because neighborhood aggregating layers are effective in distinguishing isomorphism classes of graphs up to their universal covers, Cy2C-GNN allows conventional GNNs to detect such cyclic structures with a single neighborhood aggregating layer utilizing clique adjacency matrices.

A.6 COMPUTATIONAL COMPLEXITY

The computation complexity of Cy2C-GNN method consists of two parts: An extraction algorithm for extracting clique adjacency matrix from general description of graph: And the GNN model with clique adjacency matrix. The extraction algorithm is equivalent to constructing a basis of cycles of the first homology group of a given graph G. The classical algorithm proposed by Paton proves that the time complexity of finding a cycle basis of a finite graph with n nodes is given by O(n γ ) for 2 ≤ γ ≤ 3 Paton (1969) . For random graphs with n nodes and density 0.5, the time complexity of Paton's algorithm is equal to O(n 2 ). Indeed, as shown in Table 3 in Appendix B, the average sizes of cycle bases B G for benchmark graph datasets are bounded above by their average numbers of edges. There are other more efficient algorithms than Paton's algorithm in obtaining a cycle basis of a finite graph with n nodes. One may resort to using persistent homological techniques to construct persistent diagrams associated to filtrations of a finite graph G with m edges, whose time complexity is of order O(mα(m)) Horn et al. (2021) . The function α(m) is the inverse Ackermann function, which grows at an extremely slow rate in terms of m. In fact, it grows at such a slow rate that we may even assume without loss of generality that α(m) is a constant in terms of m for any practical choice of the number of edges m. The latter architectural component of the Cy2C-GNN algorithm shares equivalent time complexities to conventional GNNs, corresponding to O(m C + lm 1 ), where l is the number of layers, m C is the number of edges of a graph associated to the clique adjacency matrix A C , and m is the number of edges of G. We recall that the Euler characteristic formula implies #B G = #E(G) -#V (G) + 1 for any connected graph G. Hence, the time complexity of the Cy2C-GNN method is practically equivalent to O(m), given that the number of layers is fixed and the number of nodes consisting each cycle bases is bounded. We established that as long as one chooses a sufficiently efficient algorithm for computing cycle bases of graphs, the time complexity to represent graphs with n nodes and m edges using Cy2C-GNN is equivalent to O(m). Such a complexity is equivalent to that of conventional GNNs, and more efficient than persistent homological techniques, while managing to capture cyclic substructures of graph data sets. The time complexity of constructing persistence diagrams given an arbitrary height function over the set of nodes and edges of a graph G is equal to O(n 2w ), where w is a positive number such that O(n w ) is the time complexity required for multiplying two n × n matrix Milosavljevic et al. (2011) . We note that the time complexity of conventional matrix multiplication algorithm is given by O(n 3 ), with the best asymptotic complexity running in O(n w ) with w ∼ 2.37 Alman & Williams (2021) . As for persistent homological techniques obtained from a predetermined height function over G, there are classes of height functions whose time complexity to construct persistence diagrams is equal to O(n 2 log n), see for instance the work by Rieck et al Rieck et al. (2019) . Nevertheless, such persistent homological techniques may not necessarily distinguish all isomorphism classes of graphs, as carefully studied in the work by Horn et al Horn et al. (2021) . For example, consider the height function h : G → R over a graph G with continuous node attributes f G : V (G) → R k defined as h(v) = 0 for v ∈ V (G) h(e) = ∥f G (v) -f G (w)∥ p + τ for e = (v, w) ∈ E(G) where ∥ • ∥ p is the L p -distance defined over R k , and τ > 0 is a positive bias term. Let G and H be two graphs as shown in Figure 4 . Then one may observe that there exists a bijection between the zeroth dimensional persistence diagrams (and first dimensional persistence diagrams, respectively) obtained from filtrations of G and H with respect to the height function h.

B IMPLEMENTATION

The experiments were run on cloud provider, comprising of 16 physical cores (Intel Xeon Processor (Skylake, IBRS) CPU processor @ 2.10GHz) with 2 NVIDIA Quadro RTX 6000 GPUs. 2021). We use one-hot encoding for discrete node features and normalize the continuous node features before using GNNs. We make initial node features for the social dataset by utilizing one-hot encoding of node degrees. The details of statistical properties of benchmark data sets are summarized in Table 3 and 4 . For ablataion study, we perform an ablation study by utilizing the "CYCLES" and "NECKLACES" synthetic datasets from Horn et al. (2021) . Models We use three baseline models to compare the proposed methods comprised of GCN, GAT, and GIN. The hidden dimension of each message passing layer is uniformly given by 136. The number of heads in GAT is equal to 8, and each hidden size is equal to 17 for every benchmark dataset. For implementing GCN and GAT, the element-wise mean pooling layer is used for constructing the outputs of the final message passing layer. We use a classifier that consists of three MLPs for classifying graph data sets. In the case of GIN, each layer passes the hidden attributes through the element-wise sum pooling. Furthermore, summed graph representations of each layer are also summed after passing MLPs to obtain the entire graph representation. Cy2C-GNNs share identical architectural structures to GCN, GAT, and GIN, except for the additional layer for clique adjacency matrix and the dimensions of the hidden state at the first MLP in the classifier. All baseline models and Cy2C-GNNs include batch normalization Ioffe & Szegedy (2015) and residual connection He et al. (2016) . The hyperparameters of Cy2C-GNNs are optimized with hidden dimensions from 32 to 256, weight decay from 0.0 to 0.01, the number of message passing layer from 1 to 5, initial dropout rate from 0 to 0.8, dropout rate of message passing layer from 0 to 0.8 and the number of layer in classifier from 1 to 3. In case of large data sets, we implement Cy2C-GNN models whose number of layers are between 1 and 7. We also consider both methods for classifier with last layers outputs and with concatenated outputs of all layers. For classifying BZR-MD and COX2-MD datasets, we require that Cy2C-GNNs incorporate edge attributes of respective data sets. For classifying OGB and PTC-MR data sets, however, we implement both cases where Cy2C-GNNs incorporate the given edge features or not. Graph classification For classifying graph data sets, we adapted stratified 10-fold cross-validations to evaluate the performance of baseline models and Cy2C-GNNs, while preserving the percentage of the train, validation, and test samples for each class. Theses models are optimized by Adam optimizer Kingma & Ba (2014) with a initial learning rate from 1 × 10 -4 to 1 × 10 -3 . We use ReduceLROnPlateau for the learning rate scheduler in Pytorch library, which multiplies the learning rate by 0.8 when validation loss does not decrease with the patience of 25. Additionally, we use the early stopping criterion when validation loss does not decrease during the 100 epochs. Training sequences are stopped when the learning rate becomes below the minimum learning rate of 1 × 10 -6 . In case of OGB datasets, we perform overall experiments 10 times with original training/validation/test dataset splits Hu et al. (2020) and batch size is set to 128. Cy2C-GNNs are also optimized by Adam optimizer Kingma & Ba (2014) . ReduceLROnPlateau are used, which multiplies the learning rate by 0.8 whenever the validation accuracy does not increase during the 25 epochs, along with the stopping criterion of patience of 100.

B.2 DETAILS OF DATASET

Tables 1 and 2 summarizes the statistical properties of benchmark data sets utilized in demonstrating the effectiveness of the proposed model. The synthetic datasets "CYCLE" and "NECKLACE" are generated using the github repository provided from Horn et al. (2021) . Each dataset consists of 1000 synthetically generated graphs with two distinct cyclic substructures, each class of which contains exactly half (for instance 500 graphs out of 1000 graphs) out of all generated graphs. The objective is to verify whether the proposed GNN can (2008) which returns a list of cycle basis elements of a graph. The node sets of cycle basis are used for making new edges that transform each cycle to a clique. Then, the clique adjacency matrix is obtained by masking edges that do not include any cycle and adding new edges from the adjacency matrix. We refer to Figure 11 to recall how the clique adjacency matrix is constructed.

Running time analysis

We performed additional experiments on measuring the running time of the baseline GNNs and Cy2C-GNNs, including the pre-processing steps required for Cy2C-GNNs, such as constructing clique adjacency matrices and detecting cycle bases. Since the REDDIT-M-5K dataset has the highest average number of nodes and edges in our benchmark datasets, we selected the dataset to compare additional experiments. First, the CPU's running time for preprocessing steps takes an average of 0.49 seconds for each graph and 2461.44 seconds for all graphs in the dataset. For a fair comparison, we evaluate the GPU's running time of baseline GCNs and Cy2C-GCNs with 128 hidden dimensions in the train sequence. We perform five iterations of implementing GCNs and Cy2C-GCNs with identical hyperparameters, and measure the average time spent for reaching the same number of epochs to analyze additional computational costs derived from the clique adjacency matrix. Table 5 shows the running time of baseline GCNs and Cy2C-GCNs with different numbers of layers. Cy2C-GNNs with a single message passing layer takes approximately 1.5 times more time to represent graphs compared to baseline GCNs with the same number of layers. Nevertheless, we observe that increasing the number of layers can significantly decrease the relative computational costs of Cy2C-GNNs to baseline GCNs. This observation is suggested from the fact that Cy2C-GNNs with five message passing layers takes approximately 1.2 times more time to represent graphs compared to baseline GCNs with identical architectural structures, a marked improvement in relative running time compared to the case where Cy2C-GNNs had a single message passing layer. The running time of Cy2C-GNNs, including preprocessing steps, is significantly longer than that of the baseline GNNs. However, since preprocessing steps need to be performed only once for the first time and can be parallelized, we can claim that Cy2C-GNNs have comparable computational complexity to baseline GCNs. Additional ablation studies We performed an additional ablation study to empirically shows the expressive power of Cy2C-GNNs by comparing results of conventional GNNs, Relational Pooling GIN(RP-GIN) Murphy et al. (2019) and our method obtained from Circular Skip Link(CSL) dataset Murphy et al. (2019) . Figure 13 shows the example of CSL data that have different skip length , 3, 4, 5, 6, 9, 11, 12, 13, 16} with 41 nodes that have the same node features. We evaluate baseline GCNs and Cy2C-GCN-1 with 5-fold cross-validations while preserving the percentage of the train and validation in reference Murphy et al. (2019) . The hidden dimension of each message passing layer is fixed by 16, and the batch size is set to 16. Baseline GCNs and Cy2C-GCN-1 are optimized by Adam optimizer Kingma & Ba (2014) with a learning rate from 1 × 10 -4 . Since only the train dataset and validation dataset exist, we choose the best validation accuracy in terms of the accuracy of the train dataset. The results are listed in Table 6 . The results of baseline GCNs is consistent with the fact that the conventional GNNs cannot distinguish graphs in CSL dataset Murphy et al. (2019) . Cy2C-GCN-1 not only distinguishes graphs in CSL dataset, but also shows much higher performance than the RP-GIN.



Figure 1: (Upper Left) An exemplary graph G with the induced graphs G ′ and G ′′ . (Lower Left) An illustration of pullback of node attributes (Definition 3.2) defined over the base graph G ′′ to its universal cover G′′ . The corresponding node attributes and edges over the two graphs are marked in identical colors. (Right) An illustration of Theorem 3.3 and Theorem 4.3. The two graphs G and H have identical associated universal covers G′′ and H′′ and equality of pullback of node attributes.

Lemma 4.1. Denote by C n the cyclic undirected graph without self-loops consisting of n nodes, and K n the undirected complete graph with n nodes without self-loops. Then for any m 1 ̸ = m 2 , the universal covers of cyclic graphs C m1 ′′ and C m2 ′′ are isomorphic, whereas the universal covers of complete graphs K m1 ′′ and K m2 ′′ are not.

Figure 2: Comparisons of classification results obtained from baseline models and Cy2C-GNN from synthetic CYCLE (left) and NECKLACE (right) datasets Horn et al. (2021). All other four baseline models are obtained from (Horn et al., 2021, Figure 1).

In accordance with the experimental setup proposed by Dwivedi et al. Dwivedi et al. (2020) and Horn et al. Horn et al. (

[0, 1]. Let us recall that the subset of points characterizes the topological structure defined over the discrete set of points, whereas the open intervals (a, b) with 0 ≤ a ≤ b ≤ 1 characterizes the topological structure defined over [0, 1]. In an analogous manner, the open subsets characterizing the topological structure over graphs are the subset of nodes, the open intervals defined over an edge, and the open subset centered at a node v, obtained from gluing a set of open sub-intervals with common endpoints v. Taking countable unions and finite intersections of these three types of open subsets, one shall construct any subsets which represent local topological properties of G. Definition A.2 (Covering Space). Let G be a graph. Let x ∈ G be a point, which could be a node v ∈ V (G) or any point on an edge e ∈ E(G). A covering space of G is a graph Ĝ with a projection map p G : Ĝ → G such that for any point x ∈ G, there exists an open subset U ⊂ G whose pre-image p -1 G (U ) is a disjoint union of open subsets {U i }, each of which are homeomorphic to U .

Remark A.7. Let M k m be the collection of all multisets of k-dimensional vectors with m elements, counting multiplicities. Suppose F : R k×m → R l is a function which is invariant under the permutation action of the symmetric group with m elements S m . (The action corresponds to the permutation of rows of the k × m real matrix). Then the function F induces the function over M k m defined as

is a set theoretic function of k ′ m -dimensional real vectors defined over M v , and the combination function COMBINE (m) : R km-1+k ′ m → R km (15) is a set theoretic function combining the attribute h m-1 v and the image of AGGREGATE (m) v .

Figure 3: An exemplary graph G with the induced graphs G ′ and G ′′ .

Figure 4: An illustration of Theorem A.19. Each respective node labels obtained from summing the attributes of the node v itself and those of nodes adjacent to v correspond to the sum of attributes of nodes of G′′ in the respective colored regions.

Figure 5: An illustration of three relations among depth 2 unfolding subtrees of T 3 v from Remark A.14. Each respective node labels obtained from summing the attributes of the node v itself and those of nodes adjacent to v correspond to the sum of node labels of T m v in the respective colored regions.

B, C, C}}}} + {{{{A, B, C}}}} + {{{{A, B, C}}}} + {{{{A, C}}}} ={{{{A, B, C, C}}, {{A, B, C}}, {{A, B, C}}, {{A, C}}}}

25. Let G, H be two connected graphs with at most n nodes. Let d be the maximum of the diameters of two graphs G and H, i.e. d := max(d G , d H ). Then for any pair of nodes v ∈ V (G) and w ∈ V (H), the directed unfolding trees T l v and T l w are isomorphic for any l ≥ 2d if and only if T 2d

Lemma A.30 (Lemma 4.2). For any n ̸ = m, Cn ′′ ∼ = Cm ′′ , whereas Kn ′′ ̸ ∼ = Km ′′ .

Figure 10: An illustration which shows how a choice of a spanning tree of a graph gives rise to a cycle basis B G of a graph (Definition A.31). We note that the first, second, and the fifth cyclic subgraphs are chordless, whereas the third and the fourth cyclic subgraphs are not (Definition A.34). The non-chordless cyclic subgraphs can be further decomposed into chordless cyclic subgraphs, as shown in the new cycle basis B ′ G . (Theorem 4.3)

mi . See Figure12for instance.

Figure 11: An illustration of Definition 4.2 and Theorem 4.3. The two connected graphs have identical associated universal covers, which implies that graph neural networks cannot distinguish the two graphs. Nevertheless, Cy2CGN is able to distinguish them by adding complete graphs from cycles of these graphs, whose universal covers are not isomorphic. The incorporation of cycle bases of graphs to graph representations become feasible via admitting the clique adjacency matrices as inputs.

EXPERIMENTAL SETUPDataset We perform graph classification on the 3 bioinformatics(DD, PROTEINS(FULL), EN-ZYMES), 3 social network datasets (COLLAB, IMDB-B, REDDIT-B ), 3 small molecules datasets ( MUTAG, NCI1, NCI109 ), 3 datasets with edge features (BZR-MD, COX2-MD, PTC-MR) and 4 large datasets (REDDIT-M-5K, MOLHIV, MOLTOX21, MOLTOXCAST) from TU datasetsMorris et al. (2020) which are available in pytorch-geometric libraryFey & Lenssen (2019) and Open Graph Benchmark datasetsHu et al. (

Figure 13: An example of CSL data Murphy et al. (2019) with skip length 2 and 3. WL test and Conventional GNNs cannot distinguish these graphs.

E) with continuous node attributes f G : V (G) → R k can be represented by three types of matrices: The adjacency matrix A ∈ R n×n : The node feature matrix X ∈ R n×k obtained from the function f G : And the clique adjacency matrix A C ∈ R n×n (possibly bounded) from Definition 4.2. The Cy2C-GNN admits the three types of matrices (A, A C , X) as inputs, whereas conventional GNNs only utilize two types of matrices (A, X).

a cycle basis B G of G. Suppose that there exists a chordless cyclic subgraph C ∈ B G such that any cycle basis B H does not have any chordless cyclic subgraph of size equal to |V (C)|. Then Cy2C-GNN which utilizes bounded clique adjacency matrices can distinguish G and H, whereas conventional GNNs cannot. matrix is enough to distinguish such pairs of graphs, a marked improvement from other contemporary GNNs which assume to have sufficiently large numbers of hidden layers. We leave the proof of Theorem 4.3 to Theorem A.35 as well as comparison in discriminative power of Cy2C-GNN to other contemporary state-of-the-art GNNs to Example A.36 and Remark A.37.

Classification results obtained from bioinformatics, social network and small molecules dataset. Note that N/A indicate graph classification methods which do not report classification results on the given graph data set. Classification methods with grey color text are cited from available results obtained from pre-existing publications.

Classification results obtained from datasets with edge features and large datasets. Note that N/A indicate graph classification methods which do not report classification results on the given graph data set. Classification methods with grey color text are cited from available results obtained from pre-existing publications.

a cycle basis B G of G. Suppose that there exists a chordless cyclic subgraph C ∈ B G such that any cycle basis B H does not have any chordless cyclic subgraph of size equal to |V (C)|. Then Cy2C-GNN which utilizes bounded clique adjacency matrices can distinguish G and H, whereas conventional GNNs cannot. Proof. Let G and H be graphs satisfying the conditions of the theorem. Let C ∈ B G be the chordless cyclic subgraph of our interest. If a cyclic subgraph C H ∈ B H has size equal to |V (C)|, then C H is not chordless, i.e. there exist a set of chordless subgraphs C H,i ⊂ H such that C H = ∪ i C H,i . Because C H ∈ B H , we may substitute the element C H with one of the C H,i 's such that |V(C H,i )| < |V (C H )| = |V (C)|.Therefore, we can assume that the cycle basis B H does not contain any elements whose size is equal to |V (C)|. A visual illustration of the aforementioned argumentation can be found in Figure10, where one can obtain a new cycle basis comprised of chordless cyclic subgraphs of strictly smaller sizes using the elements from a given cycle basis. We can hence apply a single layer of Cy2C-GNN equipped with the bounded clique adjacency matrix A

A summary of statistics of bioinformatics, social network graph and small molecules datasets. Cells notated as "-" indicate graph data set which do not have or use features in this paper. Numbers in parentheses are dimension of features. The term "Average # H1 Cycles" indicates the average size of the cycle bases of graphs. The term "Average Magnitude # Cycles" denotes the average number of nodes present in a cyclic subgraph of a graph.

A summary of statistics for graph classification datasets with edge features and large datasets including OGB. Numbers in parentheses are dimension of features. The term "Average # H1 Cycles" indicates the average size of the cycle bases of graphs. The term "Average Magnitude # Cycles" denotes the average number of nodes present in a cyclic subgraph of a graph.

Comparison of running time (in seconds) at 50, 100 and 150 epoch obtained from REDDIT-M-5K.

Classification resutls obtained from CSL dataset. Classification methods with grey color text are cited from results obtained from pre-existing publicationMurphy et al. (2019). The term "Baseline GCN-N" denotes conventional GCNs with N layers, where N takes any value in {1, 2, 3, 4, 5}.

ACKNOWLEDGEMENT

This research was supported by the National Institute for Mathematical Sciences (NIMS) grant funded by the Korean Government (MSIT)(No.B23910000) Published as a conference paper at ICLR 2023 Figure 6 : A visual demonstration of pullback of node labels from a given graph G ′′ to its universal cover G′′ , as defined from 3 and 4. The corresponding node attributes and edges are marked in identical colors.Definition A.16. Let G := (V, E) be a finite undirected graph without self-loops, f G : V (G) → R k the function of k-dimensional node attributes over G, and T l v the depth l unfolding tree at v ∈ V (G). We inductively define the multiset of nested multisets T l v associated to the depth l unfolding tree T l v as follows.where is the multiset summation operator. Remark A.17. We note that the multiset of nested multisets T l v is not identical to the multiset of of k-dimensional real vectors with # (V (T l v ) \ V (T l-1 v )) elements. Nevertheless, recall from Remark A.14 thatThe above relation, along with the inductive construction of T l v , induces an inductive construction of a morphismwhich sends a tuple of node attributes ((f) to the corresponding multiset respecting the subgraph structure specified by depth l -1 unfolding subtrees. Here, for each u ∈the concatenation of all k-dimensional node attributes supported over the depth l -1 unfolding tree rooted at u.Note that p T 0 v is the identity function from R k to itself. As will be demonstrated in the upcoming example, the functions p T l v are generalizations of the Weisfeiler-Lehman iteration scheme for updating node attributes.Published as a conference paper at ICLR 2023 unfolding tree T 1 v is the node v. The child nodes of T 1 v are the nodes of G adjacent to v, including v itself. Hence there exists a bijectionwhich induces an equality of respective restrictions of node attributes f G and f G ′′ •π G ′′ . By definition and the existence of the bijection φ 1 , there exists a functionSuppose the theorem holds for l = m. For any node u adjacent to v (i.e. u ∈ V (T 1 v ) \ {v} ), the three relations from Remark A.14 among directed finite depth unfolding trees establish a bijection of set of nodes φ m u : A m u → B m v,u , where the sets A m u and B m v,u are given asThe bijection φ m u induces an equality between respective restrictions of pullback of node attributesboth of which induce equalities between respective restrictions of f G ′′ • π G ′′ . We refer to Figure 7 for an illustration of the set of nodes A m u and B m v,u obtained from the exemplary graph drawn in Figure 3 .Denote by T m+1 v the following multiset of nested multisets:Published as a conference paper at ICLR 2023In other words, T m+1 v is a multiset summation of nested multisets associated to depth m unfolding trees T m u rooted at all nodes u which do not map to the given node v under the universal covering map πthe function given by:Here, for each node u, the vectors L m u are elements of R k×T m u , i.e. the concatenation of all kdimensional node attributes supported over the depth m unfolding tree rooted at u. Note that as a multi-set function,For some positive number k ′ m+1 there exists a functionTo see this, we observe thatThe last equation follows from the following observation. The domain of the aggregation function is supported over the set of nodes u∈A 1 v \{v} B m v,u . Applying the bijection φm+1 implies that the aggregation function is defined over the real vector space supported over the set of nodes A m+1 v \ A m v . By (39), we hence obtain that there exists a functionExample A.20. We revisit the graph G as shown in Figure 3 . Consider a graph neural network with m layers such that for all layers the aggregation function AGGREGATE v and the combination function COMBINE correspond to summation of respective node attributes. One may consider the resulting GNN as a simplified generalization of the Weisfeiler-Lehman isomorphism test for continuous node attributes. Concatenations of adjacent discrete node labels correspond to summations of adjacent node attributes, whereas substitutions of newly obtained node labels are skipped. Then there exists a correspondence between the node attributes updated from the graph neural network with m layers and the attributes over the set of nodes in). For example, as indicated in Figure 4 and 5, the updated node attributes can be obtained by counting the number of occurrences of attributes in the respective colored region.Figure 9 : A visual demonstration of the gluing procedure from the proof of Theorem A.26. One can construct an isomorphism between a pair of depth l + 1 unfolding trees rooted at v and ϕ v (v) by identifying the isomorphism classes of depth l unfolding trees rooted at u and ϕ v (u) for all u ∈ V (T 1 v ) and gluing the trees in accordance to their intersections, which are depth l -1 unfolding trees (48). Observe that the bijection of the set of nodes ϕ v1 does not induce an equality (not an isomorphism) of depth-1 trees at v i and ϕ v1 (v i ) for i = 2, 3, 5, 6. For example, the set of nodes of the undirected depth-1 unfolding tree at v 2 is given by {v 1 , v 2 , v 5 }, whereas the set of nodes of the undirected depth-1 unfolding tree at ϕ v1 (v 2 ) is the set {ϕ v1 (v 1 ), ϕ v1 (v 2 ), ϕ v1 (v 3 )}. For such nodes, different choices of the bijections of the set of nodes ϕ vi are required.The above equation follows from the condition that the functions AGGREGATE (m) v and COMBINE (m) v are injective for all v ∈ V (G) and 1 ≤ m ≤ l 0 + 1, which implies that the function over the multiset of labels F l0+1 v as constructed from Theorem A.19 is an injective function. The collection of isomorphisms from (49) further indicates that for any u ∈ V (T 1 v ) \ V (T 0 v ), the following equality of node attributes over T l0 v , T l0 , and their intersectionextends to the equality of node attributes overIterating the gluing procedure for all depth l 0 trees {T l0 u } u∈V (T 1 v )\V (T 0 v ) results in the desired equality of node attributes over T l0+1 v . Now suppose that there exists an equality of node attributes between T l0+1 v and T l0+1 ϕv(v) induced from the unfolding tree isomorphism φ v,l0+1 : T l0+1 v → T l0+1 ϕv(v) . Theorem A.19 implies that there exists a set theoretic function F l0+1 v : T l0+1 v → R k l 0 +1 such that the node label at v obtained from graph neural networks with l layers is given byPublished as a conference paper at ICLR 2023The conditions that AGGREGATE (l0+1) v and COMBINE (l0+1) v are injective imply that F l0+1 v is an injective function. Therefore, the equality of hidden node attributes hϕv(v) follows immediately from the fact that the equality of node attributes between depth l 0 + 1 unfolding trees ensures the equality of collection of multiset of node attributes overUsing Lemma A.25 and A.26, we now prove Theorem 3.3.Theorem A.21 (Theorem 3.3). For each v ∈ V let ϕ v : V (G) → V (H) be a bijection of set of nodes of G and H which induces an equality of depth-1 unfolding trees at v and ϕ v (v). Lemma A.26 implies that the following equivalence relation holds for every v ∈ V (G):Lemma A.25 implies that for any positive number l ≥ 2d, the following equivalence relation holds:Consider the open cover {T l w } w∈V ( G′′ ) consisting of directed depth l unfolding trees rooted at every node of G′′ . Note that for any w ∈ V ( G′′ ), there exists a node v ∈ V (G) such that T l w ∼ = T l v . For any two nodes w 1 , w 2 ∈ V ( G), there exist injective lifts i G (w 1 ) and i ′ G (w 2 ) whose corresponding depth l unfolding trees rooted at w i 's satisfy the following equation:Hence there exists an isomorphism φ : G′′ → H′′ if and only if for some l ≥ 2d, there exists a bijection ϕ : V (G) → V (H) which induces an isomorphism of directed depth l unfolding trees φ v,l : T l v ∼ = T l ϕ(v) for every v ∈ V (G). Furthermore, the induced node attributes f G ′′ • π G ′′ and f H ′′ • π H ′′ are identical if and only if for some l ≥ 2d, there exists a bijection ϕ v : V (G) → V (H) which induces an equality of node attributesimposed on directed depth l unfolding trees T l v ∼ = T l ϕv(v) for each v ∈ V (G). Therefore, we obtain thatBecause the graph readout function READOUT is injective, we obtainCombining the two equations above proves the theorem.Remark A.27. One can in fact use Theorem 3.3 to establish a notion of weak isomorphism between a pair of graphs G 1 and G 2 . Let G be the collection of finite connected undirected graphs. One may define a weak equivalence relation among the elements of G by constructing a function F : G → X from G to the space of regular cell complexes X with the property that two graphs G 1 and G 2 are weakly isomorphic if and only if their images F (G 1 ) and F (G 2 ) are isomorphic as cell complexes.We refer to Definition 8 of Bodnar et al. (2021a) for a related concept named "cellular lifting map". Theorem 3.3 proves that conventional GNNs establishes a weak isomorphism among graphs in G using the lifting function F : G → X univ from G to the collection of universal covers of 1-dimensional cell complexes X univ .• The adjacency matrix of a graph A ∈ R n×n• The node feature matrix X ∈ R n×k obtained from the function f G• The clique adjacency matrix A C ∈ R n×n (possibly bounded) from Definition 4.2.Each types of message passing layer is constructed in the following manner:1. The first layer H(1)C of the network constructs the hidden node attribute c(1) v using the clique adjacency matrix A C and the following composition of functions:As for the other type of the message passing layer, the m-th layer H (m) of the network for each 1 ≤ m ≤ l constructs the hidden node attribute of dimension k m , denoted as h (m) v using the following composition of functions:is the collection of all multisets of k m-1 -dimensional real vectors with deg v elements counting multiplicities, the aggregation functionis a set theoretic function combining the attribute h m-1 v and the image of AGGREGATE (m) v . For obtaining the hidden attributes, identical aggregation and combination functions are employed.2. Denote by H (l) be the final conventional neighborhood aggregating layer of the network.• The final hidden attribute H v at node v ∈ V (G) obtained from Cy2C-GNN l is obtained by concatenating the hidden attributes c(1) v and h (l)v at each v ∈ V (G) and composing with multi-layer perceptrons (MLPs):• For 1 ≤ m ≤ l, let M (m) be the collection of all multisets of k m -dimensional vectors with #V (G) elements. Let READOUT (m) : M (m) → R K (68) be the graph readout function of K-dimensional real vectors defined over the multiset M (m) . Then the vector representation of G, denoted as H h G ,c G , is given byWe end this section with the statement that Cy2C-GNN is more powerful than graph neural networks satisfying the conditions of Theorem 3.3 in distinguishing non-isomorphic classes of graphs. Definition A.34 (Chordless Subgraphs). Let H ⊂ G be a subgraph. We say that H is chordless if there does not exist a cyclic subgraph C of G such that V (C) ⊊ V (H) (See Figure 10 for examples of cyclic subgraphs which are chordless and not).

