GRAPH NEURAL NETWORKS ARE MORE POWERFUL THAN WE THINK

Abstract

Graph Neural Networks (GNNs) are powerful convolutional architectures that have shown remarkable performance in various node-level and graph-level tasks. Despite their success, the common belief is that the expressive power of standard GNNs is limited and that they are at most as discriminative as the Weisfeiler-Lehman (WL) algorithm. In this paper we argue the opposite and show that standard GNNs, with anonymous inputs, produce more discriminative representations than the WL algorithm. In this direction, we derive an alternative analysis that employs linear algebraic tools and characterize the representational power of GNNs with respect to the eigenvalue decomposition of the graph operators. We prove that GNNs are able to generate distinctive outputs from white uninformative inputs, for, at least, all graphs that have different eigenvalues. We also show that simple convolutional architectures with white inputs, produce features that count the closed paths in the graph and are provably more expressive than the WL representations. Thorough experimental analysis on graph isomorphism and graph classification datasets corroborates our theoretical results and demonstrates the effectiveness of the proposed approach.

1. INTRODUCTION

Graph Neural Networks (GNNs) have emerged in the field of machine learning and artificial intelligence as powerful tools that process network structures and network data. Their convolutional architecture allows them to inherit all the favorable properties of convolutional neural networks (CNNs), while they also exploit the graph structure. Despite their remarkable performance, the success of GNNs is still to be demystified. A lot of research has been conducted to theoretically support the experimental developments, focusing on understanding the functionality of GNNs and analyzing their properties. In particular, permutation invariance-equivariance (Maron et al., 2018) , stability to perturbations (Gama et al., 2020) and transferability (Ruiz et al., 2020a; Levie et al., 2021) are properties tantamount to the success of the GNNs. Lately, the research focus has been shifted towards analyzing the expressive power of GNNs, since their universality depends on their ability to produce different outputs for different graphs. The common belief is that standard anonymous GNNs have limited expressive power (Xu et al., 2019) and that it is upper bounded by the expressive power of the Weisfeiler-Lehman (WL) algorithm (Weisfeiler & Leman, 1968) . This induced increased research activity towards building more expressive GNNs by either increasing their complexity, or employ independent graph algorithms to design expressive inputs. In this work we argue the opposite. We prove that standard anonymous graph convolutional structures are able to generate more expressive representations than the WL algorithm. Therefore, resorting to handcrafted features or complex GNNs to break the WL limits is not necessary.

Our work is motivated by the following research problem:

Problem definition: Given a pair of different graphs G, Ĝ and anonymous inputs X, X; is there a GNN ϕ with parameter tensor H such that ϕ (X; G, H) , ϕ X; Ĝ, H are nonisomorphic? As anonymous inputs, we define inputs that are identity and structure agnostic, i.e., they cannot distinguish graphs or nodes of the graph before processing. Why anonymous? Because if the inputs are discriminative prior to processing, concrete conclusions on the discriminative power of GNNs, cannot be derived. Analyzing GNNs with powerful input features only indicates whether GNNs will maintain or ignore valuable information, not if they can produce this information. This study does not underestimate the importance of drawing powerful input features, which is crucial for most tasks. However, it underscores the need for an alternative analysis. This paper gives an affirmative answer to the above research question. Our analysis utilizes spectral decomposition tools to show that the source of the WL test as a limit for the expressive power of GNNs is the use of the all-one input. This is expected, since analyzing the representational capacity of ϕ (X; G, H) by studying ϕ (1; G, H) cannot lead to definitive conclusions. For this reason we study GNNs with white random inputs and show that they generate discriminative outputs, for at least all graphs with different eigenvalues. In particular, we prove that ϕ (X; G, H), ϕ X; Ĝ, H belong to nonisomorphic distributions, even though the input X is drawn from the same distribution. This implies that standard anonymous GNNs are provably more expressive than the WL algorithm as they produce discriminative representations for graphs that fail the WL test, yet have different eigenvalues. In fact, having different eigenvalues is a very mild condition that is rarely not met in practice. From a practical viewpoint, using white noise as an input to a GNN may be computationally intractable. We show, however, that there are two alternative architectures that are equivalent to a GNN with white random inputs: (i) A GNN that operates on graph representations without requiring any input. (ii) A GNN in which input features are the number of closed paths each node participates. Note that these features can be viewed as the output of the first GNN layer, i.e., they can be generated from a GNN. These results also imply that ϕ (X; G, H) is more powerful than the WL algorithm even if we restrict out attention to countable inputs X. Our numerical results show that our proposed GNNs are better anonymous discriminators in some graph classification problems. Our contribution is summarized as follows: (C1) We provide a meaningful definition to characterize the representational power of GNNs and develop spectral decomposition tools to study their expressivity. (C2) We explain that the WL algorithm is not the real limit on the expressive power of anonymous GNNs, but it is associated with the all-one vector as an input. (C3) We study standard GNNs with white random inputs and show that they can produce discriminative representations for any pair of graphs with different eigenvalues. This implies that standard anonymous GNNs are provably more expressive than the WL algorithm. (C4) We prove that standard GNNs with white random inputs can count the number of closed paths of each node, which enables the design of equivalent architectures that circumvent the use of random input features. (C5) We demonstrate the effectiveness of using GNNs with white random inputs, or the proposed alternatives, vs all-one inputs in graph isomorphism and graph classification datasets.

Related work:

The first work to study the approximation properties of the GNNs was by (Scarselli et al., 2008a) . Along the same lines (Maron et al., 2019b; Keriven & Peyré, 2019) discuss the universality of GNNs for permutation invariant or equivariant functions. Then the scientific attention focused on the ability of GNNs to distinguish between nonisomorphic graphs. The works of (Morris et al., 2019; Xu et al., 2019) place the expressive power of GNNs with respect to that of the WL algorithm and prompted various follow-up works in the area. Specifically, (Abboud et al., 2021; Sato et al., 2021) use random features to increase the separation capabilities of GNNs, whereas (Tahmasebi et al., 2020; You et al., 2021; Bouritsas et al., 2022) compute features related to the subgraph information. (Ishiguro et al., 2020) uses label features in WL settings and (Corso et al., 2020; Beaini et al., 2021) use multiple and directional aggregators, respectively, to increase the GNN expressivity. GNNs that use k-tuple and k-subgraph information have been designed by (Maron et al., 2019a; Murphy et al., 2019; Azizian et al., 2020; Morris et al., 2020; Geerts & Reutter, 2021; Giusti et al., 2022) . These works use a tensor framework, and employ more expressive structures compared to simple GNNs. However, they are usually computationally heavier to implement and also prone to overfitting. Moreover, (Balcilar et al., 2021) design convolutions in the spectral domain to produce powerful GNNs, whereas (Loukas, 2019) studies the learning capabilities of a GNN with respect to its width and depth. Finally, (Chen et al., 2019) reveal a connection between the universal approximation and the capacity capabilities of GNNs.

2. ON THE EXPRESSIVE POWER OF GNNS

One of the most influential works in GNN expressivity by (Xu et al., 2019) , compares the representational capabilities of GNNs with those of the WL algorithm (color refinement algorithm). The claim is that GNNs are at most as powerful as the WL algorithm in distinguishing between different graphs. This is indeed true when the input to the GNN is the constant (all-one) vector. A question that naturally arises is 'Why limit attention to input features x = 1?'. The constant vector might be an obvious choice to study anonymous GNNs, however it represents only a small subset of GNN inputs. As we show in the next session, it is also associated with certain spectral limitations, that prohibit a rigorous examination of the GNN representational power. The need for further analysis with general input signals is therefore clear. To this end, consider graphs G, Ĝ with graph operators S, Ŝ ∈ {0, 1} N ×N . In this paper we focus on graph adjacencies, but any graph operators can be used instead. We assume that S, Ŝ are both symmetric and thus admit eigenvalue decompositions S = U ΛU T , Ŝ = Û Λ Û T , where U , Û are orthogonal matrices containing the eigenvectors, and Λ, Λ are the diagonal matrices of corresponding eigenvalues. G, Ĝ are nonisomorphic if and only if there is no permutation matrix Π such that S = Π ŜΠ T . A broad class of nonisomorphic graphs have different eigenvalues. To be more precise, let S, Ŝ be the set containing the unique eigenvalues of S and Ŝ with multiplicities denoted by m λ , mλ respectively. The following assumption is heavily used in the main part of this paper: Assumption 2.1 S, Ŝ have different eigenvalues, i.e., there exists λ ∈ S, such that λ / ∈ Ŝ or m λ ̸ = mλ . When Assumption 2.1 holds, G, Ĝ are always nonisomorphic. Assumption 2.1 is not restrictive. Real nonisomorphic graphs have different eigenvalues with very high probability (Haemers & Spence, 2004) . Corner cases where Assumption 2.1 doesn't hold are studied in Appendix H. First, we consider GNNs that are constructed by the following modules, corresponding to the neurons of a typical (non-graph) neural network: Y = σ K-1 k=0 S k XH k . (1) The module in (1) is composed by a graph filter of length K followed by a nonlinearity σ(•). H k represents the filter parameters and can be a matrix, a vector, or a scalar. In order to characterize the representational power of GNNs with general input X ∈ R N ×D , we provide the following theorem: Theorem 2.2 Let G, Ĝ be nonisomorphic graphs with graph signals X, X. Also let V λ , Vλ be the eigenspaces corresponding to λ in S, Ŝ respectivelly. There exist a GNN ϕ (X; G, H) that produces nonisomorphic representations for G and Ĝ if: 1. There does not exist permutation matrix Π such that X = Π X, or 2. There exists λ ∈ S, such that λ / ∈ Ŝ and X T V λ ̸ = 0, or 3. There exists λ ∈ S, Ŝ, such that m λ ̸ = mλ and X T V λ ⊕ Vλ ̸ = 0.foot_0 Theorem 2.2 highlights the importance of the input X in the representational capabilities of a GNN. For problems in which inputs are given, it states that a GNN can distinguish between nonisomorphic graphs if they have different graph signals or their signals are not orthogonal to the eigenspace associated with the eigenvalue that differentiates them. In problems where inputs are not available, Theorem 2.2 provides guidelines on how to design input X from the graph. Theorem 2.2 also indicates that the limitations of GNNs discussed in (Xu et al., 2019) are not due to the architecture but they are limitations associated with the input. In particular, x = 1 fails to satisfy condition 1, while it is also prone to fail condition 2 and 3, since the majority of real graphs have eigenvectors that are orthogonal to 1. This observation is the impetus to study GNNs with white random inputs. White inputs are unonymous (they carry no information on the graphs), they model a large set of GNN inputs and always satisfy the conditions of Theorem 2.2. Furthermore, as we show in section 5, GNNs with random inputs can generate deterministic, countable features: X = diag S 0 , diag S 1 , diag S 2 , . . . , diag S D-1 ∈ N N ×D 0 , (2) that satisfy the conditions of Theorem 2.2. A nice interpretation of this result, given in section 5, connects X in (2) with high-order subgraphs and shows that GNNs can count closed paths. 3 LIMITATIONS OF GNNS WITH x = 1 INPUT AND THE WL ALGORITHM Using Theorem 2.2 we can explain why feeding a GNN with x = 1 is limiting. The limitations associated with input x = 1 are also highly related to the limitations of the WL algorithm. The problem appears in graphs that admit spectral decompositions with eigenvectors that are orthogonal to 1 (they sum up to zero). According to Theorem 2.2, if two graphs are the same except eigenvalues corresponding to eigenvectors that sum up to zero, GNNs with constant inputs are likely to produce isomorphic representations for the two graphs. To see this consider the graphs G, Ĝ with spectral decompositions: S = U ΛU T = λ 1 u 1 u T 1 + λ 2 u 2 u T 2 + λ 3 u 3 u T 3 , (3) Ŝ = Û Λ Û T = λ 1 u 1 u T 1 + λ 2 u 2 u T 2 + λ3 u 3 u T 3 , where λ 3 ̸ = λ3 . If u 3 is orthogonal to 1 then: S k 1 = U Λ k U T 1 = λ k 1 u 1 u T 1 1 + λ k 2 u 2 u T 2 1 + λ k 3 u 3 u T 3 1 = λ k 1 u T 1 1 u 1 + λ k 2 u T 2 1 u 2 (5) Ŝk 1 = Û Λk Û T 1 = λ k 1 u 1 u T 1 1 + λ k 2 u 2 u T 2 1 + λk 3 u 3 u T 3 1 = λ k 1 u T 1 1 u 1 + λ k 2 u T 2 1 u 2 (6) The diffused information in GNNs with this naive input is related to S k 1 and therefore in the above example the decisive information that differentiates the two graphs is highly likely to be omitted. Graphs with eigenvectors orthogonal to 1 can also affect the performance of the WL algorithm. In the absence of features the WL algorithm is initialized with x = S1, which is propagated through the nodes iteratively. In graphs with eigenvectors orthogonal to 1, the propagated degrees have suffered critical information loss in the initialization, which in certain graph structures is impossible to recover, as WL iterations progress. Further analysis on this subject can be found in Appendix C. Classic examples of graphs with different eigenvalues, that the WL algorithm and GNNs with x = 1 input cannot tell apart, are presented in Figs. 1, 2. In particular, these approaches decide that G and Ĝ in Fig. 1 and G and Ĝ in Fig. 2 are the same. This is due to the fact that these graphs contain eigenvectors that are orthogonal to 1. The case of Fig. 1 is straightforward. All the nodes of G and Ĝ have the same degree, i.e., x = 1 is an eigenvector in both graphs and thus orthogonal to all the remaining eigenvectors. As a result, the node degrees (which are the same for both graphs) are the only information that the WL algorithm and GNNs with 1 input are able to process. The case of Fig. 2 is more complicated; x = 1 is not an eigenvector in any of the graphs, but it is orthogonal to the eigenvectors corresponding to the eigenvalues that differentiate the two graphs. Consequently, the operation S1 negates vital information and the two approaches fail. 8, 9 of Appendix K. This information corroborates the issues discussed in the previous paragraph. As noted earlier and will be explained in more detail in the upcoming sections, GNNs are discriminative enough to overcome these issues and provide nonisomorphic representation for G and Ĝ in both Figs. 1, 2. Stochastic GNN block In this section we study the representational power of GNNs, by feeding them with random white inputs. Our analysis overcomes the GNN limitations associated with x = 1 and derives rigorous conclusions. Consider the GNN module in (1) where H k is a scalar, i.e., y = σ x z = K-1 k=0 h k S k x y = E z 2 (a) Stochastic GNN module Diagonal GNN block Z = 2K-2 k=0 h k S k y =diag Z (b) equivalent model K-1 k=0 h k S k x . Before choosing an appropriate nonlinearity, let us focus on the linear convolutional graph filter of length K: z = K-1 k=0 h k S k x, which we load with white random input x ∈ R N , i.e., E [x] = 0, E xx T = σ 2 I. Since x is a zero-mean random vector, z is also a random vector with E [z] = 0. Thus, the expected value provides no information about the network. Measuring the covariance, on the other hand, yields: cov [z] = E zz T = E K-1 k=0 h k S k xx T K-1 m=0 h m S m T = K-1 k=0 h k S k E xx T K-1 m=0 h m S m = σ 2 K-1 k=0 h k S k K-1 m=0 h m S m = σ 2 K-1 k=0 K-1 m=0 h k h m S k S m = 2K-2 k=0 h ′ k S k , where h ′ k = σ 2 m,l h m h l , such that m + l = k. The results of equation ( 8) are noteworthy. We have shown that the covariance of a graph filter with random white input corresponds to a different graph filter with no input. Furthermore, the resulting filter has length 2K -1, whereas the original filter has length K. In other words the nonlinearity introduced by the covariance computation enables the filter to gather information from a broader neighborhood compared to the initial filter. However, there is a caveat that the degrees of freedom for h ′ are K and not 2K -1. Further discussion on the subject can be found in Appendix D. In practice we want to associate the output of a GNN with a feature for each node that is permutation equivariant. This is not the case with the rows or columns of the covariance matrix in (8). Therefore we choose σ(•) to be the variance of each node i.e., y = σ (z) = var [z] = E z 2 = diag (cov [z]) = diag 2K-2 k=0 h ′ k S k = 2K-2 k=0 h ′ k diag S k . (9) The stochastic GNN module, defined by the linear filter in ( 7) and the variance operator is illustrated in Fig. 3a . Regarding its expressive power, we present the following theorem: Theorem 4.1 Let G, Ĝ be nonisomorphic graphs. If Assumption 2.1 holds, there exists a GNN with modules as in Fig. 3a that produces nonisomorphic representations for the two graphs. The implications of Theorem 4.1 are noteworthy. A GNN ϕ (X; G, H) with white input produces outputs that are drawn from different distributions for all graphs with different eigenvalues. Furthermore, measuring the variance produces equivariant node representations that can separate all graphs with different eigenvalues. Proposition 4.1 The GNN module in Fig. 3a with white random input is equivalent to the GNN module in Fig. 3b with no input up to degrees of freedom (dependencies) in the filter parameters. The proof of Proposition 4.1 is by the definition (equation ( 9)) of the GNN module in Fig. 3b . The claim is eminent. It proves equivalence of two GNN architectures; a standard graph filter with white input followed by a variance operator with a deterministic graph filter followed by a diagonal operator. Depending on the problem and the variance of the system one has the option to choose either of them. Further discussion on the stochastic approach can be found in Appendix D.

5. THE DIAGONAL MODULE

Proposition 4.1 proved the equivalence of the two GNN modules in Fig. 3 . In this section we focus on the module in 3b and analyze its unique properties. To be more precise, we study the following diagonal GNN module: y = σ K-1 k=0 h k diag S k , Note that the module in ( 10) is not exactly the same as the one in Fig. 3b , since a nonlineatity is added and the filter is of length K. As an example, we test the proposed diagonal module on the graphs of Figs. 1, 2, and present the output y of (10) with parameters (h 0 , h 1 , h 2 , h 3 , h 4 , h 5 ) = (10, 1, -1 2 , 1 3 , -1 4 , 1 5 ) and ReLU nonlinearity, in Table 1 . We observe that the output (10) of the proposed diagonal module produces embeddings that are different for the nodes of G and Ĝ in both Figs. 1, 2. Therefore, there does not exist permutation matrix Π such that y = Π ŷ and the proposed architecture is able to tell G and Ĝ apart in both Figs. 1, 2. This is in stark contrast to GNNs with x = 1 input and the WL algorithm that fail to distinguish between these graphs (as discussed in section 3). We now study the diagonal module in the frequency domain to analyse the representational capabilities of standard GNNs: y = σ K-1 k=0 h k diag N n=1 λ n k u n u T n = σ K-1 k=0 N n=1 h k λ k n |u n | 2 = σ N n=1 h (λ n ) |u n | 2 , ( ) where h (λ n ) = K-1 k=0 h k λ k n is the frequency response of the graph filter in (7) at λ n . In simple words, the frequency representation of the proposed diagonal module, or standard GNNs with white input, depends on the absolute values of the graph adjacency eigenvectors. On the contrary, standard GNNs with constant inputs admit a different frequency representation: y 1 = σ K-1 k=0 h k S k 1 = σ K-1 k=0 N n=1 h k λ n k u n u T n 1 = σ N n=1 h (λ n ) u T n 1u n , As we can see both outputs y, y 1 are functions of the graph eigenvectors. The question that arises is which function, |u n | or u T n 1 u n , results in more expressive GNNs. The naive answer is that depending on the graph, there is a trade-off between the information loss caused by |u n | or u T n 1 u n . However, after adding a second layer, GNNs with white inputs are always more powerful than GNNs initialized by 1. This will be explained in more detail in the next section. A closer look at equations ( 10) and ( 11), reveals further insights regarding standard GNNs with anonymous inputs. In particular, Type-1 GNN block The proof is the combination of Theorem 4.1 and equation ( 10). In particular, a standard anonymous GNN can compute the following vector representations: Z = K-1 k=0 h k S k y = σ diag Z (a) Type-1 GNN module Type-2 GNN block X Z = K-1 k=0 S k X H k Y = σ Z (b) Type-2 GNN module d k = diag S k = N n=1 λ k n |u n | 2 , ( ) that count the number of klength closed paths of each node. For instance, when k = 2, d k indicates the degree of each node, whereas for k = 3, it counts the number of triangles each node is involved in, multiplied by a constant factor. For k = 4, d k holds information about the degrees of 1-hop and 2-hop neighbors as well as the 4-th order cycles. Similar observations are derived by considering larger values of k. Graph adjacency diagonals are not only associated with k-hop neighbor degrees but also with motifs that are present in the graph. This observation becomes even more valuable, if we consider the significance of subgraph mining in graph theory (Kuramochi & Karypis, 2001; Danisch et al., 2018) . Our final observation is that, the k-th order closed paths are associated with the absolute values of the adjacency eigenvectors |u n |, whereas degrees are connected with u T n 1 u n . The following theorem characterizes the expressive power of GNNs with modules as in (10): Theorem 5.2 Let G, Ĝ be nonisomorphic graphs. If Assumption 2.1 holds, there exists a GNN with diagonal modules as in (10) that produces distinct representations for G, Ĝ.

6. DESIGNING POWERFUL GNN ARCHITECTURES

After analyzing GNNs with white inputs and introducing the GNN module in (10), it is time to build practical powerful architectures. The modules we employ to build the proposed GNN architecture are presented in Fig. 4 . Regarding their functionality we provide the following result: Proposition 6.1 A GNN designed with the diagonal modules of Fig. 4a (eq. ( 10)) in the input layer is equivalent to a standard GNN designed with the modules of Fig. 4b in the input layer, if the input to the modules of Fig. 4b (eq. (1)) is designed according to: X = diag S 0 , diag S 1 , diag S 2 , . . . , diag S D-1 . ( ) The claim of Proposition 6.1 is fundamental and relates a standard GNN with white input to a standard GNN with countable input defined by (31). Specifically, combining propositions 4.1 and 6.1 yields a direct connection between the three considered architectures; standard GNNs with white input and variance nonlinearity, GNNs with no input and diagonal operator, and standard GNNs with input as in ( 14). Guided by these findings we design the GNN architectures presented in Fig. 5 . The architecture on the left uses one type of GNN blocks (type-2) and the input is designed by equation ( 14). Furthermore, it is a symmetric architecture and admits all the favorable properties of symmetric designs. On the other hand, the architecture on the right uses a combination of type-1 and type-2 GNN blocks and designing an input is not necessary. Although the design is not symmetric, it offers reduced number of trainable parameters and reuse of first layer features, which has been observed to benefit convolutional architectures. The expressive power of the proposed architectures is demonstrated in the following theorem: Theorem 6.1 Let G, Ĝ be nonisomorphic graphs with graph signals X, X designed according to (14). If Assumption 2.1 holds, then the proposed GNNs in Fig. 5 can tell the two graphs apart. Corollary 6.2 follows from Theorem 6.1 and the fact that both diag S 0 = 1, diag S 2 = S1 are included in the proposed input X, defined in ( 14). Overall, our proposed analysis proves that standard GNNs ϕ (X; G, H) are more powerful than the WL algorithm for both countable and continuous inputs.

7. SIMULATIONS

In this section we test the effect of using anonymous all-one inputs vs anonymous random inputs on the expressivity of GNNs. The task of interest is graph classification. In particular, we use graph isomorphism and graph classification datasets and train the standard convolutional GNN in (1) and GIN (Xu et al., 2019) . GIN initialized with x = 1 is denoted as GIN 1 and GIN with random input is denoted as GIN plus . For the standard GNN model we only test random inputs. To avoid practical issues associated with random inputs we use the equivalent model of section 6 instead, i.e., we intialize both standard convolutional GNN and GIN according to equation (14).

7.1. THE CSL DATASET

Our first experiment involves the Circular Skip Link (CSL) dataset, which was introduced in (Murphy et al., 2019) to test the expressiveness of GNNs; it is the golden standard when it comes to benchmarking GNNs for isomorphism (Dwivedi et al., 2020) . CSL is a symmetric graph dataset. It contains 150 4-regular graphs, where the edges form a cycle and contain skip-links between nodes. A schematic representation of the CSL graphs can be found in Appendix K. Each graph consists of 41 nodes and 164 edges and belongs to one of 10 classes. All the nodes have degree 4 and thus x = 1 is an eigenvector of every graph and orthogonal to all the remaining eigenvectors. As a result the degree vector is uninformative and so is any message passing operation of the degree. GNNs initialized with x = 1 and the WL algorithm fail to provide any essential information for this set of graphs and the classification task is completely random, as shown in Table 4 . The proposed GNN architectures, on the other hand, have no issue in dealing with this dataset. In particular a single diagonal GNN module with parameters (h 0 , h 1 , h 2 , h 3 , h 4 , h 5 , h 6 , h 7 , h 8 , h 9 ) = (0, 1, -1 2 , 1 3 , -1 4 , 1 5 , -1 6 , 1 7 , -1 8 , 1 9 ) and σ(•) being the linear function, is able to classify these graphs with 100% accuracy. To see this, we present in Table 2 the output 1 T y for every class, where y is defined in (10) with the aforementioned parameters. The output is the same for each graph in the same class but different for graphs that belong to different classes. Therefore, perfect classification accuracy is achieved by passing the GNN output to a simple linear classifier or even a linear assignment algorithm. Next, we test the performance of the proposed architecture with standard social, chemical and bioinformatics graph classification datasets (Errica et al., 2019) . The details of each dataset can be found in Table 3 . To perform the graph classification task, we train a GNN with 4 layers, each layer consisting of the same number of neurons. The input to each GNN is designed by equation ( 14) with K = 10 and we also pass the k-th degree vector. Apart from feeding the output of each layer to the next layer, we also apply a readout function that performs graph pooling. The graph pooling layer generates a global graph embedding from the node representations and passes it to a linear classifier. The nonlinearity is chosen to be the ReLU. An illustration of the used architecture, as well as a detailed description of the experiments, is presented in Appendix K. To test the performance of the anonymous architectures we divide each dataset into 50 -50 trainingtesting splits and perform 10-fold cross validation. We measure the micro F1 and macro F1 score for each epoch and present the epoch with the best average result among the 10 folds. The mean and standard deviation of the testing results over 10 shuffles are presented in Table 4 . In Table 4 we ob- serve that the proposed architecture and GIN plus markedly outperform GIN 1 in the REDDITBINARY dataset, and also show notable improvement in the REDDITMULTI dataset. GIN 1 , on the other hand, has a 3% advantage in the IMDBBINARY dataset, whereas in the remaining datasets the performances of the competing algorithms are statistically similar. The latter can be explained, since the vital classification components, of these datasets, are not orthogonal to x = 1 and GIN 1 is not undergoing critical information loss. Overall, we conclude that properly designed GNNs, as the proposed and GIN plus can not only demonstrate remarkable performance in graph classification tasks, but can also handle pathological datasets such as the CSL. This is an indicator on the importance of the representational properties. However, what is equally important is generalization capability, data handling and optimization, which we do not study in this paper.

8. CONCLUSION

In this paper we studied the expressive power of GNNs with spectral decomposition tools. We showed that, contrary to common belief, the WL algorithm is not the real limit and proved that anonymous GNNs can distinguish between any graphs with different eigenvalues. Furthermore, we explained the limitations of GNNs with all-one inputs and designed GNN architectures that overcome these limitations. Experiments with graph isomorphism and graph classification datasets demonstrated the validity of the proposed approach. With this work we move one step closer to understanding the properties of GNNs and analyzing their functionality.

A.1 GRAPH NEURAL NETWORKS (GNNS)

A graph convolution is defined as: z = K-1 k=0 h k S k x, where H (S) = K-1 k=0 h k S k is a linear filter of length K and x, z ∈ R N are the input and output of the filter respectively. Let S = U ΛU T , be the eigenvalue decomposition of S. Then: z = K-1 k=0 h k U Λ k U T x (16) U T z = K-1 k=0 h k Λ k U T x (17) z = K-1 k=0 h k Λ k x, where x, z are the frequency representations of x, z respectively. The frequency representation of the graph filter is H (Λ) = K-1 k=0 h k Λ k and can also be written as: H (λ i ) = K-1 k=0 h k λ k i . H (λ i ) is a polynomial on λ i and zi = H (λ i ) xi . The simplest form of a Graph Neural Network (GNN) is an array of graph filters followed by point-wise nonlinearities. The l-th layer of the GNN is a graph perceptron, which is described by: X (l+1) = σ K-1 k=0 h (l) k S k X (l) . ( ) Note that here we are using a recursive equation, whereas in the main paper we used X for input and Y for output, to make things simple. Common choices of σ(•) are the Rectified Linear Unit (ReLU) activation function, the Leaky ReLU or the hyperbolic tangent function.

A.2 MULTIPLE FEATURE GNNS

As mentioned earlier, the nodes of the graph are usually associated with a graph signal, which is multidimensional, i.e., D > 1 and X (l) is a matrix. Although the architecture in (20) can also handle multidimensional graph signals, multiple feature GNNs are commonly used, which are described by the following recursion formula: X (l+1) = σ K-1 k=0 S k X (l) H (l) k , where H (l) k ∈ R F ×G represents a set of F × G graph filters. Compared to the architecture in (20), the MIMO GNN employs multiple filters instead of one, and the outputs of the filters are combined to produce a layer output X (l+1) that has feature dimension equal to G.

A.3 NOTATION

Our notation is summarized in Table 5 .

B RELATION TO OTHER ARCHITECTURES

GNNs have attracted significant attention and numerous architectures have been proposed. The first GNNs of (Scarselli et al., 2008b; Kipf & Welling, 2016; Battaglia et al., 2016; Defferrard et al., 2016)  Y = σ (Z) y ≜ vector output of a GNN module; y = σ (z) a ≜ scalar a ≜ vector A ≜ matrix A T ≜ transpose of matrix A A k ≜ A[k, :] T , k-th row of matrix A a k ≜ A[:, k], k-th column of matrix A U ≜ eigenvector matrix U [k, :] ≜ k-th row of U (row vector) U [:, k] ≜ k-th column of U u k ≜ k-th eigenvector, k-th column of U I ≜ Identity matrix 1 ≜ vector of ones 0 ≜ vector or matrix of zeros | • | ≜ point-wise absolute value m n ≜ binomial coefficient used simple convolutions in static data and graphs, whereas more sophisticated architectures utilize a variety of attention mechanisms (Hamilton et al., 2017; Veličković et al., 2018; Liu et al., 2021) . Graph convolutional architectures have also been designed for time-varying graphs and signals. Some of them exploit both the graph and the time structure (Hajiramezanali et al., 2019; Wang et al., 2021; Hadou et al., 2021) , while others employ recurrent architectures (Li et al., 2016; Seo et al., 2018; Nicolicioiu et al., 2019; Ruiz et al., 2020b) . It is often the case that GNNs are presented in the literature using different definitions. The GNN by (Kipf & Welling, 2016) for example is written as: X (l+1) = σ D -1/2 (S + I) D -1/2 X (l) H (l) = σ D -1/2 SD -1/2 X (l) H (l) + D -1 X (l) H (l) , where S ∈ {0, 1} N ×N represents the graph adjacency, D is a diagonal matrix, with D[i, i] being the degree of node i. The matrix D -1/2 (S + I) D -1/2 is also a GSO S ′ and the formula in ( 22) can be written as: X (l+1) = σ S ′ X (l) H (l) , which is a special case of the MIMO GNN in (21), for K = 2. Another way that GNNs are represented in the literature is via the following equations: A (l) v = AGGREGATE X (l) u : u ∈ N (v) B (l) v = COMBINE X (l) v , A (l) v (25) X (l+1) v = σ H (l) B (l) v ( ) where X (l) v is the signal of node v in layer l and the v-th row of the feature matrix X (l) , i.e., X (l) =     X (l) T 1 . . . X (l) T N     . Equivalently, A (l) v , B v are rows of matrices A (l) , B (l) respectively and represent signals associated with node v. The majority of the architectures based on the equations ( 24)-( 26) can be written as combinations of the GNN modules in (21). Different architectures employ different functions for AGGREGATE and COMBINE. Popular choices of AGGREGATE functions include the mean, the sum, pooling functions or LSTM functions. The COMBINE routine, on the other hand, usually utilizes the concatanation or summation function. The settings that are mainly used are summation function for AGGREGATE and concatenation for COMBINE. This is due to the fact that summation-concatenation preserves the more information compared to other options Xu et al. (2019) . It is then easy to see that: A (l) = SX (l) B (l) = A (l) , X (l) X (l+1) = σ B (l) H (l) = σ SX (l) H (l) 1 + X (l) H (l) 0 = σ 1 k=0 S k X (l) H (l) k , ( ) where S is the graph adjacency and H (l) = H (l) 1 H (l) 0 . Therefore, the GNN defined in ( 29) is a special case of the GNN in ( 21), for K = 2. Now consider the GNN defined in (29) that consists of K layers and σ(•) is the linear function for the hidden layers and a nonlinear activation function in the output layer, i.e., X (l+1) = SX (l) H (l) + X (l) H (l) = (S + I) X (l) H (l) , for l = {0, . . . , K -2} (30) X (l+1) = σ SX (l) H (l) + X (l) H (l) , for l = K -1 Then it holds that: l) , for l = {0, . . . , K -2} (32) X (l+1) = (S + I) l+1 X (0) H (1) • • • H ( X (l+1) = σ SX (l) H (l) + X (l) H (l) , for l = K -1 As a result: X (K) = σ (S + I) K X (0) H (K-1) • • • H (0) = σ K l=0 S l X (0) H ′ l , which again corresponds to the GNN in ( 21). The last equality holds since (X + I) K = K l=0 K l S K-l , where n k = n! k!(n-k)! is the binomial coefficient. Overall there is a direct connection between the GNNs defined by the equations ( 24)-( 26) and the GNNs defined by (21). Furthermore, apropriate selection of GSO and nonlinearities in (21) with respect to the AGGREGATE and COMBINE functions in ( 24)-(26) makes the described architectures equivalent.

C ASSOCIATING THE WL ALGORITHM WITH THE SPECTRAL DECOMPOSITION OF A GRAPH

In section 3 we observed a connection between the limitations of the WL algorithm and graphs with eigenvectors orthogonal to 1. The WL algorithm is initialized with either x = 1 or x = S1 and in the remaining iterations this information is propagated (diffused) through the nodes. In particular, at iteration k of the WL algorithm, node i receives a multiset defined as: T k i : x j ∈ T k i |x j = n λ n u T n 1 u n (j), j ∈ N k i , where N k i denotes the k-th neighborhood of node i. If there is one-to-one correspondence between T k i and S k 1(i) for all nodes i then the WL algorithm can be analyzed by building and comparing the following features for each node: X = S1, S 2 1, . . . , S K 1 (37) In other words, if the summation operation is a proper hash function for a specific graph, the WL algorithm is equivalent to the feature generation of (37). In that case, we can use the spectral decomposition of S and the analysis of section 3 to characterize the limitations of the WL algorithm. Then the WL algorithm admits the same limitations as the GNNs with x = 1 input and it omits the information associated with eigenvectors that are orthogonal to 1 .

D THE STOCHASTIC GNN MODULE

In this section we elaborate more on the proposed stochastic GNN module in Fig. 3a . In order to implement it, we can either use the equivalent model in Fig. 3b or we can design an empirical variance model. In practice, the input to the empirical model is a matrix X ∈ R N ×M where each element is independently drawn from a Gaussian distribution with zero mean and unit variance and M is the total number of samples. The output of the filter is Z = k h k S k X ∈ R N ×M and the maximum likelihood estimate of the empirical covariance of Z takes the form: Q = 1 M ZZ T . ( ) Then the GNN output can be written as: y =diag (Q) = 1 M diag ZZ T = 1 M Z 2 1 (39) The GNN module of the empirical variance model is illustrated in Fig. 6 . Empirical GNN block Now we discuss the effect of square nonlinearity in the representation of nodes, introduced by the variance operator. As mentioned in section 4 the nonlinearity added by the variance computation allows the proposed GNN to gather information from farther neighborhoods compared to a linear filter or the WL algorithm. To make things more concrete, consider the graph in Fig. 6 and let K -1 = 2, which corresponds to running the WL algorithm for 2 iterations and graph filters that process S and S 2 . X Z = K-1 k=0 h k S k y = Z 2 1 In Table 7 we present the representations produced by the stochastic GNN and the WL algorithm for each node of the graph in Fig. 6 . In particular, we present two iterations of the WL algorithm and the value that y in (39) converges to, for filter values (h 0 , h 1 , h 2 ) = (3, 5, 7). We observe that the WL algorithm represents nodes A and D with the same value, whereas the output of the stochastic GNN is capable of separating these two nodes. Overall, the the nonlinearity in the variance operator allows acquiring global information, which can be vital in the resulting node representation of the graph. In the main core of the paper we presented 3 almost eguivalent GNN modules; the stochastic module with random Gaussian input, the diagonal module with no input and the standard GNN module with input designed by ( 14). On the basis of the requirements and constraints of each task, we can employ either of them in a GNN architecture. For instance, in applications where computing the adjacency power diagonals is computationally prohibitive, we can use the empirical module in Fig. 6 . The drawback is that for systems with high variance, a significant number of samples will be required for the output to converge. This can be mitigated by computing the output in ( 39) recursively. To be more precise, let z m be the m-th column of the filter output Z and Q (M ) , y (M ) be the empirical covariance and output after obtaining M samples. Then the recursive equations can be written as: Q (M ) = 1 M M m=1 z m z T m = 1 M M -1 m=1 z m z T m + 1 M z M z T M = M -1 M Q (M -1) + 1 M z M z T M (40) y (M ) = diag Q (M ) = M -1 M diag Q (M -1) + 1 M diag z M z T M = M -1 M y (M -1) + 1 M |z M | 2 (41) Therefore, using y (M ) = M -1 M y (M -1) + 1 M |z M | 2 , allows for online computations and reduces the required memory complexity.

E PROOF OF THEOREM 2.2

To prove Theorem 2.2, consider the GNN module in (1), where H k is a scalar, that is, Y = σ K-1 k=0 h k S k X (42) E.1 CASE 1: THERE DOES NOT EXIST PERMUTATION MATRIX Π SUCH THAT X = Π X . Consider an 1layer GNN with 2 neurons defined by h 0 = 1, h i = 0, i ̸ = 0 and h 0 = -1, h i = 0, i ̸ = 0, i.e., Y 1 = σ (X) , Y 2 = σ (-X) ) Summing up the output of the 2 neurons to produce the final GNN output yields Y = Y 1 + Y 2 = X when the σ(•) =ReLU(•). As a result, the output of the GNN is the graph signal and since there does not exist permutation matrix Π such that X = Π X, this GNN decides that G and Ĝ are different.

E.2 CASE 2: THERE

EXISTS λ ∈ S, SUCH THAT λ / ∈ Ŝ AND X T V λ ̸ = 0. Let S = {λ 1 , . . . , λ p } be the set containing the unique (non-repeated) eigenvalues of S and Ŝ = { λ1 , . . . , λr } be the set containing the unique eigenvalues of Ŝ. Note that the eigenvalues of S, Ŝ are not required to be distinct. Also, let {µ 1 , . . . , µ q } be the set of all distinct eigenvalues of S and Ŝ, i.e., µ i ∈ S Ŝ and µ i ̸ = µ j , ∀ i ̸ = j. Suppose that S, Ŝ have at least one different eigenvalue, i.e., there exists λ ∈ S such that λ / ∈ Ŝ. Recall from Appendix A that a graph filter can be represented in the frequency domain by: H (λ i ) = K-1 k=0 h k λ k i , Then:      H (µ 1 ) H (µ 2 ) . . . H (µ q )      =      1 µ 1 µ 2 1 . . . µ K-1 1 1 µ 2 µ 2 2 . . . µ K-1 2 . . . 1 µ q µ 2 q . . . µ K-1 q          h 0 h 1 . . . h K-1     = W h (45) W is a Vandermonde matrix and when K = q the determinant of W takes the form: det (W ) = Π 1≤i<j≤q (µ i -µ j ) Since the values µ i are distinct, W has full column rank and there exists a graph filter H (•) with unique parameters h that passes only the λ eigenvalue, i.e., H (µ i ) = 1, if µ i = λ 0, if µ i ̸ = λ Under this parametrization, the filter H (•) takes the form H (S) = V λ V T λ , where V λ is the eigenspace (orthogonal space of the eigenvectors) corresponding to λ, and H Ŝ = 0. Then the output of the GNN, for the two graphs, takes the form: Y = σ (H (S) X) = σ V λ V T λ X (48) Ŷ = σ H Ŝ X = 0 Under the assumption that X T V λ ̸ = 0, we also have V λ V T λ X ̸ = 0. As a result σ V λ V T λ X ̸ = 0, there does not exist a permutation Π such that Y = Π Ŷ and the proposed GNN decides that the two graphs are different. Note σ V λ V T λ X ̸ = 0 always holds when, for example, leaky ReLU is used that allows both positive and negative values to pass. In the case where σ(•) =ReLU(•) the proof is still valid as long as there is at least one positive value in V λ V T λ X. In case V λ V T λ X ≤ 0 we can without loss of generality consider the filter: H (µ i ) = -1, if µ i = λ 0, if µ i ̸ = λ (50) that results in σ -V λ V T λ X ̸ = 0. E.3 CASE 3: THERE EXISTS λ ∈ S, Ŝ, SUCH THAT m λ ̸ = mλ AND X T V λ ⊕ Vλ ̸ = 0. Let S = {λ 1 , . . . , λ p } be the set containing the unique (non-repeated) eigenvalues of S with multiplicities {m λ1 , . . . , m λp } and Ŝ = { λ1 , . . . , λr } be the set containing the unique eigenvalues of Ŝ with multiplicities { mλ 1 , . . . , mλ r }. Note that the eigenvalues of S, Ŝ are not required to be distinct. Also, let {µ 1 , . . . , µ q } be the set of all distinct eigenvalues of S and Ŝ, i.e., µ i ∈ S Ŝ and µ i ̸ = µ j , ∀ i ̸ = j. Suppose that S, Ŝ have at least one common eigenvalue but with different multiplicity, i.e., there exists λ ∈ S, Ŝ, such that m λ ̸ = mλ . Under the parametrization of ( 47), H (S) = V λ V T λ , where V λ ∈ R N ×m λ is the eigenspace of S corresponding to λ, and H Ŝ = Vλ V T λ , where Vλ ∈ R N × mλ is the eigenspace of Ŝ where V λ [:, i] is the i-th column of V λ . Since V λ ̸ = 0 by definition, there does not exist a permutation Π such that y = Π ŷ and the proposed GNN can tell the two graphs apart. Next we study the case where λ ∈ S, Ŝ, but m λ ̸ = mλ . Then H (S) = V λ V T λ and H Ŝ = Vλ V T λ , where Vλ ∈ R N × mλ is the eigenspace of Ŝ corresponding to λ. The output y of (57), for the two graphs, takes the form: y = diag (H (S)) = m i=1 |V λ [:, i]| 2 (61) ŷ = diag H Ŝ = m i=1 | Vλ [:, i]| 2 We observe that: 1 T y = 1 T diag (H (S)) = Trace V λ V T λ = m λ ( ) 1 T ŷ = 1 T diag H Ŝ = Trace Vλ V T λ = mλ Since m λ ̸ = mλ , there is no permutation Π such that y = Π ŷ and the proposed GNN can tell the two graphs apart. This concludes the proof for Theorem 5.2. Using Proposition 6.1 we prove the equivalence of Theorems 5.2 and 6.1 and therefore the proof is the same. To prove Theorem 4.1 we need one more extra step. In particular, we plug the filter, with parametrization as in (58), in equation ( 8), for S, Ŝ, i.e., cov [z; S] = K-1 k=0 h k S k K-1 m=0 h m S m = V λ V T λ V λ V T λ = V λ V T λ (65) cov z; Ŝ = K-1 k=0 h k Ŝk K-1 m=0 h m Ŝm = Vλ V T λ Vλ V T λ = Vλ V T λ , where the last equality in (65) holds, since V λ , Vλ are orthogonal. Then the output y of (9), for the two graphs, can be written as: y = var [z; S] = diag (cov [z; S]) = |V λ [:, 1]| 2 + • • • + |V λ [:, m]| 2 = m i=1 |V λ [:, i]| 2 (67) ŷ = var [z; S] = diag (cov [z; S]) = | Vλ [:, 1]| 2 + • • • + | Vλ [:, m]| 2 = m i=1 | Vλ [:, i]| 2 (68) If λ ∈ S but λ / ∈ Ŝ, ŷ = 0. If λ ∈ S, Ŝ, but m λ ̸ = mλ , 1 T y ̸ = 1 T ŷ. In any case, there does not exist a permutation Π such that y = Π ŷ and the proposed stochastic GNN can separate the two graphs. G PROOF OF PROPOSITION 6.1 The output of type-1 module can be cast as: y = σ K-1 k=0 h k diag S k = σ (Xh) , when X is designed as in ( 14) and h =    h 0 . . . h K-1    is the vector of filter parameters. The same output can be produced by the type-2 module when H k is a vector and K = 1. On the other hand, a set of K type-1 modules in the input layer can produce the X in (14). To see this, consider the following type-1 GNN modules. y i = σ K-1 k=0 h (i) k diag S k , i = 0, . . . , K -1, where h (i) k = 1, if i = k 0, if i ̸ = k Concatenating the outputs y i into W = [y 0 , . . . , y K-1 ] results in the X in ( 14) which we can apply to a type-2 module and produce the same output as a type-2 GNN module with input as in ( 14). □

H NONISOMORPHIC GRAPHS WITH THE SAME SET OF EIGENVALUES

In the core of this paper, we discuss the ability of GNNs to distinguish between nonisomorphic graphs that have different eigenvalues. This analysis covers the majority of real graphs, since real graphs almost never share the same eigenvalues. However, there exist interesting cases of graphs with the same set of eigenvalues that GNNs can also distinguish. In this section, we study these cases and provide interesting results.

H.1 GRAPHS WITH THE SAME DISTINCT EIGENVALUES

We consider the case where S, Ŝ have distinct eigenvalues which are the same, i.e., Λ = Λ. Formally: Assumption H.1 S, Ŝ have the same distinct eigenvalues, i.e., S ⊆ Ŝ and Ŝ ⊆ S, with λ i ̸ = λ j for all i, j. Lemma H.2 characterizes nonisomorphic graphs with distinct eigenvalues. Lemma H.2 When S, Ŝ have the same distinct eigenvalues, G, Ĝ are nonisomorphic if and only if there is no permutation matrix Π and diagonal ±1 matrix D such that: U = Π Û D Proof: Let S = U ΛU T , Ŝ = Û Λ Û T . Since S, Ŝ have the same distinct eigenvalues, we have Λ = Λ. To prove the 'forward' statement assume that G, Ĝ are nonisomorphic, i.e., there does not exist permutation matrix Π such that S = Π ŜΠ T . If there exist permutation matrix Π and ±1 diagonal matrix D such that U = Π Û D, then: S = U ΛU T = Π Û DΛD Û T Π T = Π Û Λ Û T Π T = Π ŜΠ T . By contradiction when S, Ŝ have the same distinct eigenvalues, G, Ĝ are nonisomorphic if there do not exist a permutation matrix Π and a diagonal ±1 matrix D such that U = Π Û D. To prove the 'backward' statement assume that there do not exist permutation matrix Π and diagonal ±1 matrix D such that U = Π Û D. If G, Ĝ are isomorphic, i.e., there exists permutation matrix Π such that S = Π ŜΠ T , then: U ΛU T = Π Û Λ Û T Π T , which implies that u n = ±Π ûn for all n, where u n , ûn refer to the columns of U , Û respectively. As a result, U = Π Û D and by contradiction we prove the 'backward' statement which concludes the proof. □ In a nutshell, Lemma H.2 states that in order for G, Ĝ to be nonisomorphic, while Assumption H.1 holds, the two graphs need to admit different eigenvectors that correspond to the same eigenvalues. As a side note, we mention that S, Ŝ can still span the same columnspace, under row permutation. However, the power on each eigendirection has to be different for them to be nonisomorphic. We can now extend the results of Theorem 2.2 to the following: Theorem H.3 Let G, Ĝ be nonisomorphic graphs with graph signals X, X. There exists a GNN that tells G and Ĝ apart if: 1. There does not exist permutation matrix Π such that X = Π X, or 2. Assumption 2.1 holds and V T λ X ̸ = 0, or 3. Assumption H.1 holds and X T u n ̸ = 0 for all eigenvectors u n or XT ûn ̸ = 0 for all eigenvectors ûn . Proof: The proof for cases 1 and 2 can be found in Appendix E. Case 3 includes Assumption H.1, i.e., both S, Ŝ have N distinct eigenvalues, where N is the number of nodes in each graph, and also S, Ŝ share the same eigenvalues. To prove this last part of Theorem 2.2 we consider an 1-layer GNN with N neurons. Each neuron consists of a graph filter that isolates one eigenvalue and sets it to one, as in Appendix E. Then, each neuron is described by the following set of equations: Y n = σ (H n (S) X) , n = 1, . . . , N Hn (λ i ) = 1, if i = n 0, if i ̸ = n , n = 1, . . . , N For the rest of the proof, we will assume that σ(•) is the linear function. This is without loss of generality since if we double the number of neurons in the layer and set σ(•) =ReLU(•) we can produce the same output as the linear function by using the same trick as in Appendix E.1. In particular, N of the graph filters will follow the equations in ( 73) and the remaining N filters will follow the same equation with -1 instead, as in (50). Then for each eigenvalue we have a pair of filters, one with +1 and one with -1 in the filter equations. Summing up the outputs of these neuron pairs will produce an output that is the same as if σ(•) was the linear function. The output of the GNN for the two graphs takes the form Y n = H n (S) X = u n u T n X, n = 1, . . . , N Ŷn = H n Ŝ X = ûn ûT n X, n = 1, . . . , N Y n = u n u T n x 1 , . . . , u T n x D , n = 1, . . . , N Ŷn = ûn ûT n x1 , . . . , ûT n xD , n = 1, . . . , N Now we assume that X T u n ̸ = 0 for all eigenvectors u n , n = 1, . . . , N . As a result, there exist at least one column in each Y n that is not equal to the zero column. We can then collect one nonzero column from each Y n and form a matrix M as: M = u 1 u T 1 x i , . . . , u N u T N x j = [u 1 α 1 , . . . , u N α N ] = U     α 1 , 0, . . . , 0 0, α 2 , . . . , 0 . . . 0, 0, . . . , α N     = U A, where x i , x j are columns of X such that u T 1 x i ̸ = 0, u T N x j ̸ = 0, A is a diagonal matrix and α n ̸ = 0 for all n. If we also collect the corresponding columns for each Ŷn we can form: This implies that there does not exist a permutation matrix Π such that M = Π M . To complete the proof, we consider the output of the considered GNN to be the concatenation of Y n , n = 1, . . . , N . In particular, the outputs for G, Ĝ are: Y = [Y 1 , Y 2 , . . . , Y N ] Ŷ = Ŷ1 , Ŷ2 , . . . , ŶN . The columns of M , M are also columns of Y , Ŷ . Since there does not exist a permutation matrix Π such that M = Π M , there does not exist Π such that Y = Π Ŷ and the GNN decides that G, Ĝ are nonisomorphic. Note that the same analysis is applicable if we assume that XT ûn ̸ = 0 for all eigenvectors ûn , n = 1, . . . , N and is therefore omitted. Now our proof is complete. □ We also extend the results of Theorems 4.1, 5.2, 6.1 to incorporate the cases where the eigenvalues of the two graphs are the same. Theorem H.4 Let G, Ĝ be nonisomorphic graphs. Then there exists a GNN with modules as in Fig. 3 or as in Fig. 4 Proof: The proof of Case 1 can be found in Appendix F. In order to prove case 2 of Theorem H.4 we use the architecture illustrated in Fig. 7 . This GNN is designed with 2 layers, each of them consisting of N neurons. Recall from the previous proofs that there exists a graph filter H (S) with unique parameters h that isolates one eigenvalue (the n-th eigenvalue) and sets it to one, i.e.,

H (λ

i ) = 1, if i = n 0, if i ̸ = k (82) Since the considered graphs have N distinct eigenvalues, we can build the first layer of Fig. 7 with N neurons described by the following set of equations: y n = σ diag H (1) n (S) , n = 1, . . . , N H( 1) n (λ i ) = 1, if i = n 0, if i ̸ = n , n = 1, . . . , N Proposition I.1 . Let G, Ĝ be two isomorphic graphs, i.e., S = Π ŜΠ T . Also let X, X be the graph signals associated with G, Ĝ that satisfy X = Π X. Then any GNN with modules as in (1) decides the two graphs are the same. Proof: To prove this proposition, it suffices to show that the output Y in ( 1) is permutation equivariant. To see this, consider the graph adjacencies S and Ŝ such that Ŝ = ΠSΠ T , where Π is a permutation matrix. Then equation (1) gives: Ŷ = σ K-1 k=0 Ŝk XH k (1) = σ K-1 k=0 h k ΠS k Π T ΠXH k (2) = σ K-1 k=0 h k ΠS k XH k (94) = σ Π K-1 k=0 h k S k XH k = ΠY , where equality (1) holds because ΠSΠ T k = ΠS k Π T and equality (2) comes from the fact that Π T Π = I. Therefore, Y is equivariant in permutation. Overall GNNs with modules as in (1) produce permutation equavariant outputs for isomorphic graphs. Proposition I.2 . Let G, Ĝ be two isomorphic graphs. Then any GNN with modules as in Fig. 3 or Fig. 4 decides that the two graphs are the same. Proof: To prove this proposition, it suffices to show that the output in ( 10) is permutation equivariant. To see this, consider two graph adjacencies S and Ŝ such that Ŝ = ΠSΠ T , where Π is a permutation matrix. Then Equation (10) gives: ŷ = σ K-1 k=0 h k diag Ŝk (1) = σ K-1 k=0 h k diag ΠS k Π T (2) = σ K-1 k=0 h k Πdiag S k (96) = σ Π K-1 k=0 h k diag S k = Πy, where equality (1) holds because ΠSΠ T k = ΠS k Π T and equality (2) comes from the fact that diag ΠSΠ T = Πdiag (S). The output y is permutation equivariant and we can conclude that the proposed architectures produce permutation equivariant outputs for isomorphic graphs.

J GNNS VS SPECTRAL DECOMPOSITION

In this paper, we discuss the ability of GNNs to distinguish between different graphs. Our analysis uses spectral decomposition tools and provides conditions under which a GNN can tell two graphs apart. These conditions are related to the eigenvalues and eigenvectors of the graph operators. Therefore, it is natural to study the similarities and differences of GNNs and spectral decomposition algorithms.

J.1 THE TWO GRAPHS HAVE DIFFERENT EIGENVALUES

As explained in the main part of the paper, there always exists a GNN that can distinguish between a pair of graphs with different eigenvalues. Furthermore, computing the eigenvalues of the two graphs can also attest that the two graphs are nonisomorphic. Therefore, the two approaches are equally powerful. The difference lies in the fact that a GNN needs to be trained to perform the isomorphism test, whereas the spectral decomposition is unsupervised. On the other hand, computing the spectral decomposition for real graphs can be computationally very challenging.

J.2 THE TWO GRAPHS HAVE THE SAME SET OF EIGENVALUES THAT ARE DISTINCT

This case is a bit more complicated. Since the eigenvalues are the same, one must resort to the eigenvectors to distinguish between the graphs. When the eigenvalues are distinct, the eigenvectors of the graph are unique up to a sign for each eigenvector. To be more precise, let G, Ĝ be isomorphic graphs with eigenvectors U , Û respectively. Then we have the following: U = Π Û D, where Π is a permutation matrix and D is a diagonal matrix with elements ±1. We observe the following: Remark J.1 When Assumption H.1 holds, the eigenvectors of isomorphic graphs are not permutation equivariant, since there exists a sign ambiguity for each eigenvector. On the contrary, the produced GNN node embeddings are always permutation equivariant, according to Propositions I.2 and I.1. In other words, GNNs always produce equivariant node embeddings for isomorphic graphs, which is not the case for the spectral decomposition. We see that these conditions involve the eigenvectors of the graphs and therefore we can construct an eigen-based algorithm with the same guarantees. Note that these guarantees are only sufficient and there might be cases where the GNNs can distinguish between nonisomorphic graphs, whereas an algorithm based on the above conditions might fail. Furthermore, calculating the complete set of eigenvectors of a real graph might be computationally prohibitive.

J.3 THE TWO GRAPHS HAVE THE SAME MULTISET OF EIGENVALUES THAT ARE NOT DISTINCT

Scenario 1: The graphs are isomorphic. GNNs will always produce equivariant embeddings for isomorphic graphs. On the contrary, eigenvectors are not unique and they will not provide equivariant representations (up to scaling) for isomorphic graphs. Scenario 2: The graphs are nonisomorphic. The GNN analysis for this case is relegated for future work. Regarding the spectral decomposition, we need to resort to eigenvectors, which are not unique. Therefore, detecting nonisomorphic graphs is challenging.

J.4 STABILITY AND DISCRIMINABILITY OF GNNS

From our discussion so far, we have observed similarities and differences between the functionality of GNNs and the spectral decomposition of the graph. There is one more fundamental difference that has not yet been discussed and involves the stability and discriminability properties of GNNs (Gama et al., 2020) . In particular, a GNN is stable under small perturbations of the graph operator, i.e., the output of a GNN is similar for 'similar' graphs. On the other hand, small perturbations of the graph can result in essential changes in the eigenvalues and eigenvectors of the graph operator, which makes the spectral decomposition more unstable. Therefore, there seems to be a stability vs. discriminability trade-off between GNNs and spectral decomposition. However, the architectural nonlinearities allow GNNs to be both stable and discriminative. To recap, the conditions of this paper involve the eigenvalues and eigenvectors of the graph operator. Compared to eigen-based algorithms, there is an advantage of GNNs when the eigenvalues are exactly the same with the same multiplicities. This is due to the fact that the eigenvectors of a graph operator are not unique and therefore isomorphic graphs do not admit permutation equivariant eigenvectors, whereas GNNs always produce permutation equivariant node embeddings for isomorphic graphs. On the other hand, when the eigenvalues are different, GNNs and spectral decomposition are equally powerful. Furthermore, GNNs are robust to small changes of the graph, which is not the case for spectral decomposition. Finally, the spectral decomposition is computationally heavy and unsupervised, but GNNs are lighter to execute and require training. We observe that G and Ĝ in both figures admit a different set of eigenvalues. However, the eigenvectors that correspond to the eigenvalues that differentiate them are orthogonal to the vector of all-ones (they sum up to zero). Therefore, the WL algorithm and GNNs with x = 1 input fail to tell them apart.

K.2 DETAILS ON THE EXPERIMENTS OF SECTION 7

In Fig. 8 we present a paradigm of two graphs in the CSL dataset that belong to different classes. It is clear from the figure that the two graphs consist of nodes that all have degrees equal to 4. Therefore, x = 1 is an eigenvector of both graphs and orthogonal to the remaining eigenvectors. Any valuable information that separates the two graphs is lost when we run the WL algorithm or feed a GNN with x = 1. Next, we present the details on the experiments of section 7.2. For the most part, we use the specifications suggested in (Xu et al., 2019) . In particular, we train a 4-layer graph neural network where the output of each layer and the input are passed through a graph pooling layer and then a The nonlinearity used in our experiments is the ReLU and the readout function that performs graph pooling is 1 T X (l) for l = 0, . . . , 5. X (l) represents the output of layer l with X (0) = X. For each type-2 GNN block, we only use 1 tap for k = 1. This is due to the fact that we pass the output of every layer to the final classifier, so additional taps might be redundant. To train the proposed architecture, we use Adam optimizer with a learning rate equal to 10 -foot_1 , batch size equal to 128 and a dropout ratio equal to 0.5. Training is carried out over 200 epochs with 50 iterations per epoch. To assess the performance of the proposed architecture, we divide each dataset into 50 -50 training-testing splits and apply 10-fold cross-validation. The only parameter we tune is the hidden dimension for each layer. In particular, the number of modules for each layer is the same and we tune over {8, 16, 32, 64, 128, 256} modules. We also compare our proposed architecture with GIN (Xu et al., 2019) initialized with x = 1 and GIN initialized according to equation ( 14). We use the publicly available code 2 provided by the authors. We use the exact same specification for fair comparisons and tune the hidden layer over {8, 16, 32, 64, 128, 256} dimensions. All experiments are conducted on a Linux server with NVIDIA RTX 3080 GPU. The data 13 are publicly available, and the code of the proposed architectures with all the experiments can be found in this repositoryfoot_3 .



We define V λ ⊕ Vλ := {u + w | u ∈ V λ ; w ∈ Vλ ; u, w / ∈ V λ Vλ } as the exclusive sum of subspaces https://github.com/weihua916/powerful-gnns https://pytorch-geometric.readthedocs.io/en/latest/ https://github.com/tempcode100/gnns-are-powerful



Figure 1: WL indistinguishable graphs.

Figure 2: WL indistinguishable graphs

Figure 3: GNN with random Gaussian input

Figure 4: Proposed GNN modules

Figure 5: Proposed GNN architectures

Figure 6: Empirical variance GNN module

Figure 7: GNN architecture

that tells the two graphs apart if: 1. Assumption 2.1 holds or 2. Assumption H.1 holds and (a) There is no permutation matrix Π such that |U | = Π| Û |, or (b) |U T |u n ̸ = 0 for all eigenvectors u n or | Û T | ûn ̸ = 0 for all eigenvectors ûn .

If G, Ĝ are nonisomorphic the story is different. According to Lemma H.2, there does not exist permutation matrix Π such that U = Π Û D and the GNNs detect nonisomorphic graphs under Theorem H.3, or Theorem H.4. Let us focus on the conditions of Theorem H.4 i.e., (a) There does not exist permutation matrix Π such that |U | = Π| Û |, (b) |U T |u n ̸ = 0 for all eigenvectors u n or | Û T | ûn ̸ = 0 for all eigenvectors ûn .

Figure 8: CSL graphs

Figure 9: GNN architecture

Outputs y of G and ŷ of Ĝ of the proposed diagonal module for the graphs in Figs. 1, 2.

GNN output y for every class of the CSL graphs.



Average testing score and standard deviation over 10 shuffles

Overview of notation.

GNN vs WL algorithm on the graph in Fig.6for K = 3.

Eigenvalue and eigenvector information for the graphs in Fig.2.

A PRELIMINARIES

Networks are naturally represented by graphs G := (V, E), where V = {1, . . . , N } is the set of vertices (nodes) and E = {(v, u)} are the edges between pairs of nodes. The 1-hop neighborhood N (v) of node v is the set of nodes u ∈ V that satisfy (u, v) ∈ E. A graph can also be modeled by a Graph Shift Operator (GSO) S ∈ R N ×N , where S(i, j) quantifies the relation between node i and node j and N = |V|. Popular choices of the GSO is the graph adjacency, the graph Laplacian or weighted versions of them. The nodes of the graph are often associated with graphs signals X ∈ R N ×D , also known as node attributes, where D is the dimension of each graph signal (feature dimension).corresponding to λ. Then the output of the GNN, for the two graphs, takes the form:The subspaces V λ , Vλ can be written as:where Q c ∈ R N ×c is the common subspace between V λ and Vλ and, where + denotes here the sum between subspaces. Furthemore W ∈ R m λ ×m λ , Ŵ ∈ R mλ × mλ are square orthogonal matrices.As a resultWe define:as the exclusive sum of subspace. Ifsince there is no permutation that makes these vectors collinear. Our previous analysis shows that suitable nonlinearities will also guarantee that there is no permutation Π such that Y = Π Ŷ .F PROOF OF THEOREMS 4.1, 5.2, 6.1:The proof of Theorems 5.2, 6.1 is equivalent and very similar to the proof of Theorem 4.1. We begin by proving Theorem 5.2. To prove Theorem 5.2 let us consider again the GNN module in (10).For simplicity we assume that σ (•) is a linear function. In eq. ( 43) we show how to produce the linear function from ReLU. If Assumption 2.1 holds, there exists λ ∈ S, such that λ / ∈ Ŝ or m λ ̸ = mλ . We use the proof of Theorem 2.2 and conclude that there exists a graph filter H (•) with unique parameters h that passes only the λ eigenvalue, i.e.,

H (µ

First we study the case where λ ∈ S, but λ / ∈ Ŝ. Then H (S) = V λ V T λ , where V λ ∈ R N ×m λ is the eigenspace of S corresponding to λ, and H Ŝ = 0. The output y of (57), for the two graphs, takes the form:Under the above parametrization, the filter H(1) n (S) takes the form H(1), where u n is the eigenvector corresponding to the n-th eigenvalue of S. Then the output of the first layer for the two graphs takes the form:Since both S, Ŝ have distinct eigenvalues, we can concatenate the output of each neuron and result in layer-1 outputs as:If there does not exist a permutation matrix Π such that |U | = Π| Û |, one layer is sufficient and the proposed GNN can tell the two graphs apart.For the second layer of the GNN in Fig. 7 we consider the following parametrization:where X = Y (1) = |U | is the output of the first layer. Then the final output of the GNN for the two graphs can be written as:If we assume that |U T |u n ̸ = 0 for all eigenvectors u n , or | Û T | ûn ̸ = 0 for all eigenvectors ûn , we can use the same steps as in the proof of Theorem H.3 and show that the proposed GNN decides that the two graphs are different. Note that in layer 1 we can use the stochastic modules in Fig. 3a and the proof still holds, since the filter with parameters as in (84) yields:and the same output as in ( 86) can be produced. Also, by using Proposition 6.1 we can substitute the modules in the first layer with the modules in Fig. 4b and the proof still holds. □

H.2 GRAPHS WITH THE SAME EIGENVALUES WHICH ARE NOT DISTINCT.

The last case appears when the graph adjacencies have the same eigenvalues, which are not distinct and have the same multiplicities. This case is more complicated, since the two graphs can be nonisomorphic even if there exist a permutation matrix Π and a diagonal matrix D such that U = Π Û D (the condition in Lemma H.2 does not hold). Analysis and results for this case are left for future work.

I GNNS AND ISOMORPHIC GRAPHS

The core of this paper studies the ability of GNNs to distinguish between nonisomorphic graphs. Another important question is whether a GNN can tell if two graphs are isomorphic. The answer is affirmative. GNNs are permutation equivariant architectures and can always detect isomorphic graphs. To make things concrete, we present the following proposition:

