ANALYZING THE EXPRESSIVE POWER OF GRAPH NEURAL NETWORKS IN A SPECTRAL PERSPECTIVE

Abstract

In the recent literature of Graph Neural Networks (GNN), the expressive power of models has been studied through their capability to distinguish if two given graphs are isomorphic or not. Since the graph isomorphism problem is NP-intermediate, and Weisfeiler-Lehman (WL) test can give sufficient but not enough evidence in polynomial time, the theoretical power of GNNs is usually evaluated by the equivalence of WL-test order, followed by an empirical analysis of the models on some reference inductive and transductive datasets. However, such analysis does not account the signal processing pipeline, whose capability is generally evaluated in the spectral domain. In this paper, we argue that a spectral analysis of GNNs behavior can provide a complementary point of view to go one step further in the understanding of GNNs. By bridging the gap between the spectral and spatial design of graph convolutions, we theoretically demonstrate some equivalence of the graph convolution process regardless it is designed in the spatial or the spectral domain. Using this connection, we managed to re-formulate most of the state-of-the-art graph neural networks into one common framework. This general framework allows to lead a spectral analysis of the most popular GNNs, explaining their performance and showing their limits according to spectral point of view. Our theoretical spectral analysis is confirmed by experiments on various graph databases. Furthermore, we demonstrate the necessity of high and/or band-pass filters on a graph dataset, while the majority of GNN is limited to only low-pass and inevitably it fails.

1. INTRODUCTION

Over the last five years, many Graph Neural Networks (GNNs) have been proposed in the literature of geometric deep learning (Veličković et al., 2018; Gilmer et al., 2017; Bronstein et al., 2017; Battaglia et al., 2018) , in order to generalize the very efficient deep learning paradigm into the world of graphs. This large number of contributions explains a new challenge recently tackled by the community, which consists in assessing the expressive power of GNNs. In this area of research, there is a consensus to evaluate the theoretic expressive power of GNNs according to equivalence of Weisfeiler-Lehman (WL) test order (Morris et al., 2019; Xu et al., 2019; Maron et al., 2019b; a) . Hence, GNNs models are frequently classified as "as powerful as 1-WL", "as powerful as 2-WL", . . . , "as powerful as k-WL". However, this perspective cannot make differences between two methods if they are as powerful as the same WL test order. Moreover, it does not always explain success or failure of any GNN on common benchmark datasets. In this paper, we claim that analyzing theoretically and experimentally GNNs with a spectral point of view can bring a new perspective on their expressive power. So far, GNNs have been generally studied separately as spectral based or as spatial based (Wu et al., 2019b; Chami et al., 2020) . To the best of our knowledge, Message Passing Neural Networks (MPNNs) (Gilmer et al., 2017) and GraphNets (Battaglia et al., 2018) are the only attempts to merge both approaches in the same framework. However, these models are not able to generalize custom designed spectral filters, as well as the effect of each convolution support in a multi convolution case. The spatial-spectral connection is also mentioned indirectly in several cornerstone studies by Defferrard et al. (2016) ; Kipf & Welling (2017) ; Levie et al. (2019) . Since the spectral-spatial interchangeability is missing, they did not propose to show spectral behavior of any graph convolution. Recent studies have also attempted to show, for a limited number of spatial GNNs, that they act as low-pass filters (NT & Maehara, 2019; Wu et al., 2019a) . NT & Maehara (2019) concluded that using adjacency induces low-pass effects, while Wu et al. (2019a) studied a single spatial GNN's spectral behavior by assuming adding self-connection changes the given topology of the graph. In this paper, we bridge the gap between spectral and spatial domains for GNNs. Our first contribution consists in demonstrating the equivalence of convolution processes regardless if they are defined as spatial or as spectral GNN. Using this connection, we propose a new general framework and taxonomy for GNNs as the second contribution. Taking advantage of this equivalence, our third contribution is to provide a spectral analysis of any GNN model. This spectral analysis is another perspective for the analysis of expressive power of GNNs. Our theoretical spectral analysis is confirmed by experiments on various well-known graph datasets. Furthermore, we show the necessity of high and/or band-pass filters in our experiments, while the majority of GNNs are limited to only low-pass filters and thus inevitably fail when dealing with these problems. The code used in this paper is available at https://github.com/balcilar/gnn-spectral-expressive-power. The remainder of this paper is organized as follows. Section 2 introduces convolutional GNNs and presents existing approaches. In Section 3 and Section 4, we describe the main contributions mentioned above. Section 5 presents a series of experiments and results which validate our propositions. Finally, Section 6 concludes this paper.

2. PROBLEM STATEMENT AND STATE OF THE ART

Let G be a graph with n nodes and an arbitrary number of edges. Connectivity is given by the adjacency matrix A ∈ {0, 1} n×n and features are defined on nodes by X ∈ R n×f0 , with f 0 the length of feature vectors. For any matrix X, we used X i , X :j and X i,j to refer its i-th column vector, j-th row vector and scalar value on (i, j) location, respectively. A graph Laplacian is L = D -A (or L = I -D -1/2 AD -1/2 ) where D ∈ R n×n is the diagonal degree matrix and I is the identity. Through eigendecomposition, L can be written by L = U diag(λ)U T where each column of U ∈ R n×n is an eigenvector of L, λ ∈ R n gathers the eigenvalues of L and diag(.) function creates a diagonal matrix whose diagonal elements are from a given vector. We use superscript to refer same kind variable as base. For instance, H (l) ∈ R n×f l refers node representation on layer l whose feature dimension is f l . A Graph Convolution layer takes the node representation of the previous layer H (l-1) as input and produces a new representation H (l) , with H (0) = X.

2.1. SPECTRAL APPROACHES

Spectral GNNs rely on the spectral graph theory (Chung, 1997) . In this framework, signals on graphs are filtered using the eigendecomposition of graph Laplacian (Shuman et al., 2013) . By transposing the convolution theorem to graphs, the spectral filtering in the frequency domain can be defined by x f lt = U diag(Φ(λ))U x, where Φ(.) is the desired filter function. As a consequence, a graph convolution layer in spectral domain can be written by a sum of filtered signals followed by an activation function as in (Bruna et al., 2013) , namely H (l+1) j = σ f l i=1 U diag(F (l,j) i )U H (l) i , for j ∈ {1, . . . , f l+1 }. (1) Here, σ is the activation function, F (l,j) ∈ R n×f l is the corresponding weight vector to be tuned as used in (Henaff et al., 2015) for the single-graph problem known as non-parametric spectral GNN. A first drawback is the necessity of Fourier and inverse Fourier transform by matrix multiplication of U and U T . Another drawback occurs when generalizing the approach to multi-graph learning problems. Indeed, the k-th element of the vector F (l,j) i weights the contribution of the k-th eigenvector to the output. Those weights are not shareable between graphs of different sizes, which means a different length of F (l,j) i is needed. Moreover, even though the graphs have the same number of nodes, their eigenvalues will be different if their structures differ. To overcome these issues, a few spatially-localized filters have been defined such as cubic B-spline (Bruna et al., 2013) , polynomial and Chebyshev polynomial (Defferrard et al., 2016) and Cayley polynomial parameterization (Levie et al., 2019) . With such approaches, trainable parameters are defined by F (l,j) i = B W (l,1) i,j , . . . , W (l,se) i,j , where each column in B ∈ R n×se is designed as a function of eigenvalues, namely B k,s = Φ s (λ k ), where k = 1, . . . , n denotes eigenvalue index, s = 1, . . . , s e denotes index of filters and s e is the number of desired filters. Here, W (l,s) ∈ R f l ×f l+1 is the trainable matrix for the l-th layer's s-th filter's.

2.2. SPATIAL APPROACHES

Spatial GNNs consider an agg operator, which aggregates the neighborhood nodes, and an upd operator, which updates the concerned node as follows: H (l+1) :v = upd g 0 (H (l) :v ), agg g 1 (H (l) :u ) : u ∈ N (v) , where N (v) is the set of neighborhood nodes and g 0 , g 1 : R n×f l → R n×f l+1 trainable models. The choice of agg, upd, g 0 , g 1 , and even N (v), determines the capability of model. The vanilla GNN (known by GIN-0 in (Xu et al., 2019) ) uses the same weights in g 0 and g 1 . N (v) is the set of connected nodes to v, agg is the sum of all connected node values and upd(x, y) := σ(x + y) where σ is an elementwise nonlinearity. GCN has the same selection but normalizes features as in (Kipf & Welling, 2017) . Hamilton et al. (2017) used separated weights in g 0 and g 1 , which means that two sets of trainable weights are applied on self feature and neighbor nodes. Other approaches defined multi neighborhood and used different g i for different kind of neighborhood. For instance, Duvenaud et al. (2015) defined the neighborhood according to node label and/or degree, Niepert et al. (2016) reordered the neighbor nodes and used the same model g i to neighbors according to their order. These spatial GNNs use sum or normalized sum over g i in equation 2. Other methods weighted this summation by another trainable parameter, where the weights can be written by the function of node and/or edge features in order to make the convolutions more productive, such as graph attention networks (Veličković et al., 2018) , MoNet (Monti et al., 2017) , GatedGCN (Bresson & Laurent, 2018) and SplineCNN (Fey et al., 2018) .

3. BRIDGING SPATIAL AND SPECTRAL GNNS

In this section, we define a general framework which includes most of the well-know GNN models, including euclidean convolution and models which use anisotropic update schema such as in Veličković et al. (2018) ; Bresson & Laurent (2018) . When upd(x, y) = σ(x + y), agg is a sum (or weighted sum) of the defined neighborhood nodes contributions and g i applies linear transformation, one can trivially show that mentioned spatial GNNs can be generalized as propagation of the node features to the neighboring nodes followed by feature transformation and activation function of the form H (l+1) = σ s C (s) H (l) W (l,s) , where C (s) ∈ R n×n is the s-th convolution support that defines how the node features are propagated to the neighboring nodes. Within this generalization, GNNs differ from each other by the choice of convolution supports C (s) . This formulation generalizes many different kinds of Graph Convolutions, as well as Euclidean domain convolutions, which can be seen in Appendix A with the detailed schema. Definition 1. A Trainable-support is a Graph Convolution Support C (s) with at least one trainable parameter that can be tuned during training. If C (s) has no trainable parameters, i.e. when the supports are pre-designed, it is called a fixed-support graph convolution. In the trainable support case, supports can be different in each layer, which can be shown by C (l,s) for the s-th support in layer l. Formally, we can define a trainable support by: C (l,s) v,u = h s,l H (l) :v , H (l) :u , E (l) v,u , A , where E (l) v,u shows edge features on layer l from node v to node u if it is available and h(.) is any trainable model parametrized by (s, l). Theorem 1. Spectral GNN parameterized with B of entries B i,j = Φ j (λ i ), defined as H (l+1) j = σ f l i=1 U diag B W (l,1) i,j , . . . , W (l,se) i,j U H (l) i , ( ) is a particular case of framework in equation 3 with the convolution kernel set to C (s) = U diag(Φ s (λ))U . The proof can be found in Appendix B. This theorem is general and it covers many well-known spectral GNNs, such as non-parametric spectral graph convolution (Henaff et al., 2015) , polynomial parameterization (Defferrard et al., 2016) , cubic B-spline parameterization (Bruna et al., 2013) , Cay-leyNet (Levie et al., 2019) and also any custom designed graph convolution. From Theorem 1, one can see that the spatial and spectral GNNs work all the same way. Therefore, Fourier calculations are not necessary when convolutions are parameterized by B. As a consequence of Theorem 1, one can see that the separation of spectral and spatial GNNs is just an interpretation. The only difference is the way convolution supports are designed: either in the spectral domain or in the spatial one. Definition 2. A Spectral-designed graph convolution refers to a convolution where supports are written as a function of eigenvalues (Φ s (λ)) and eigenvectors (U ) of the corresponding graph Laplacian (equation 6). Thus, each convolution support C (s) has the same frequency response Φ s (λ) over different graphs. Graph convolution out of this definition is called spatial-designed graph convolution. Corollary 1.1. The frequency profile of any given graph convolution support C (s) can be defined in spectral domain by Φ s (λ) = diag -1 (U C (s) U ). ( ) where diag -1 (.) returns the vector made of the diagonal elements from the given matrix. The proof of this corollary is given in Appendix C. This corollary leads to the spectral analysis of any given graph convolution support, including spatial-designed convolutions. Since the spatialdesigned convolutions do not fit into equation 6, U C (s) U is not a diagonal matrix. Therefore, we also compute the full frequency profile by Φ s = U C (s) U , which includes all eigenvectors pairwise contributions for spatial-designed convolutions.

4. THEORETICAL FREQUENCY RESPONSE OF GRAPH CONVOLUTIONS

This section aims at providing a theoretical understanding of the graph convolution process through an analysis in the spectral domain of existing GNNs. To the best of our knowledge, no one has led such an analysis concerning graph convolutions in the literature. This analysis is based on a reformulation of existing graph convolutions in our general framework (equation 3), and based on deriving analytical expressions of Φ s (λ) (equation 7 in Corollary 1.1) for each convolution support of concerned graph convolution process. All proofs are provided in Appendices. The theoretical frequency response of ChebNet (Defferrard et al., 2016) convolutions is given by the following theorem. Theorem 2. The theoretical frequency response of each support of ChebNet can be defined as Φ 1 (λ) = 1, Φ 2 (λ) = 2λ λ max -1, Φ k (λ) = 2Φ 2 (λ)Φ k-1 (λ) -Φ k-2 (λ), ( ) where 1 is the vector of ones and λ max is the maximum eigenvalue. The proof of Theorem 2 is given in Appendix D. Since it has no trainable parameter in the supports and all support frequency responses do not depend on the graph, we can classify ChebNet as spectral-designed fixed-support graph convolution. The theoretical frequency response of CayleyNet (Levie et al., 2019) convolution is given in the following theorem, and its proof is given in Appendix E. Theorem 3. The theoretical frequency response of each support of CayleyNet can be defined as Φ s (λ) =    1 if s = 1 cos( s 2 θ(hλ)) if s ∈ {2, 4, . . . , 2r} -sin( s-1 2 θ(hλ)) if s ∈ {3, 5, . . . , 2r + 1} (9) where h is a trainable scalar and θ(x) = atan2(-1, x) -atan2(1, x). Since it has a trainable parameter h in the supports and all support frequency responses do not depend on the graph, we can classify CayleyNet as spectral-designed trainable-support graph convolution. GCN (Kipf & Welling, 2017 ) uses a single convolution support and its theoretical frequency response is defined approximately in the following theorem, and its proof is given in Appendix F. Theorem 4. The theoretical frequency response of GCN support can be approximated as Φ(λ) ≈ 1 -λp/(p + 1), ) where p is the average node degree in the graph. Since its support has no trainable parameter but the frequency response is not independent of the graph, we can classify GCN as spatial-designed fixed-support graph convolution. Graph Isomorphism Network (GIN) defined in Xu et al. (2019) has attracted a lot of interests from the community, mostly because of its simple convolution mechanism. It has a single convolution support and its theoretical frequency response is given in the following theorem: Theorem 5. The theoretical frequency response of GIN support can be approximated as Φ(λ) ≈ p 1 + p + 1 -λ (11) where is a trainable scalar. The proof of this theorem is in Appendix G. Since its support has trainable parameters but the frequency response depends on the graph structure, we classify GIN as spatial-designed trainablesupport graph convolution. Graph attention networks (GATs) in (Veličković et al., 2018) proposes an application for graph world of the attention mechanism from Vaswani et al. (2017) . Due to the fact that graphs are invariant to the node order, GAT cannot use positional encoding. In addition, instead of considering that all nodes are connected to each other, GAT just assigns attention weights to the node itself and the connected ones according to adjacency (sparse attention). Thus, we can see its convolution support as weighted, self loop added adjacency. GAT can be represented in our framework in equation 3 by defining trainable convolution supports as follows: C (l,s) v,u = e v,u k∈ Ñ (v) e v,k , where e v,u = exp σ(a (l,s) [H (l) :v W (l,s) ||H :u W (l,s) ]) , and a (l,s) is another trainable weight. Convolution support will be calculated from node v to each element of Ñ (v), which shows the selfconnection added neighborhood. Thus, we classify GAT as spatial-designed trainable-support graph neural network in our framework. Since convolution supports are function of connected node features, a theoretical frequency response is not possible to formulate.All studied models are summarised in Table 1 .

5. EXPERIMENTAL RESULTS

This section is dedicated to empirical spectral analysis of existing GNNs on some certain graphs to validate the theoretical results and also performance analysis of these GNNs on a benchmark graph  = I Φ(λ) = 1 GCN Spatial Fixed C = D-0.5 Ã D-0.5 Φ(λ) ≈ 1 -λp/(p + 1) GIN Spatial Trainable C = A + (1 + )I Φ(λ) ≈ p 1+ p + 1 -λ GAT Spatial Trainable C (s) v,u = ev,u/ k∈ Ñ (v) e v,k NA CayleyNet a Spectral Trainable C (1) = I C (2r) = Re(ρ(hL) r ) C (2r+1) = Re(iρ(hL) r ) Φ1(λ) = 1 Φ2r(λ) = cos(rθ(hλ)) Φ2r+1(λ) = -sin(rθ(hλ)) ChebNet Spectral Fixed C (1) = I C (2) = 2L/λmax -I C (s) = 2C (2) C (s-1) -C (s-2) Φ1(λ) = 1 Φ2(λ) = 2λ/λmax -1 Φs(λ) = 2Φ2(λ)Φs-1(λ) -Φs-2(λ) a ρ(x) = (x -iI)/(x + iI) dataset to demonstrate the necessity of having various frequency responses convolution supports. The implementation and the introduced datasets are publicly availablefoot_0 .

5.1. SPECTRAL ANALYSIS RESULTS

All empirical analyses are based on obtaining convolution supports matrix for certain GNN model, followed by equation 7 to obtain the frequency response. In our analysis, we used three graphs independently: the first is a 1D signal encoded as a regular circular line graph with 1001 nodes; the others are the well-known Cora and CiteSeer graphs with 2708 and 3327 nodes respectively (Yang et al., 2016) . Besides, we used 2 different collections of graph datasets, ENZYMES and PROTEIN, which have 600 and 1113 graph respectively (Kersting et al., 2016) . The details of the graphs can be found in Appendix L. Since ChebNet and CayleyNet are spectral-designed, their frequency responses do not change for different graphs. They are presented in Figure 1 for first 5 and 7 supports respectively. The results in Figure 1 confirm the theoretical analyses in Theorem 2 and Theorem 3. The full frequency profiles are not illustrated because they consist of zeros outside the diagonal. Analyzing the frequency profile of ChebNet, one can argue that the convolutions mostly cover the spectrum. However, none of the kernels focuses on some certain parts of the spectrum. As an example, the second kernel is mostly a low-pass and high-pass filter and stops the middle band, while the third one passes very high, very low and middle bands, but stops almost first and third quarter of the spectrum. Therefore, if the relation between input-output pairs can be figured out by just a low-pass, high-pass or some specific band-pass filter, a high number of convolution kernels is needed. However, in the literature, only 2 or 3 kernels are generally used in experiments (Defferrard et al., 2016; Kipf & Welling, 2017) . The scale parameter h in CayleyNet affects the x-axis scaling, but does not change the global shape. When h = 1, frequency profiles can be defined within the range [0, 2] (because λ max = 2 in all three test graphs). If h = 1.5, the frequency profile can be defined till 1.5λ max = 3 in Figure 1 and rescale axis label from [0, 3] to [0, 2] in original range. Learning the scaling of eigenvalues may seem advantageous. However, it induces extra computational cost in order to calculate the new convolution supports in every learning epoch. In addition, similarly to ChebNet, CayleyNet does not have any band specific convolutions, even when considering different scaling factors. As Theorem 4 and Theorem 5 demonstrate, GCN's and GIN's frequency responses depend on the average node degree. GCN's cut-off frequency decrease by increasing the p while p acts as scaling factor on GIN's frequency response. This analysis leads us to understand that GCN works as lowpass filter and does not cover the whole spectrum. This approach is not able to learn relations that can be represented by high-pass or band-pass filtering. Hence, even though it gives very good results on a single graph node classification problem in Kipf & Welling (2017) , it may fail for problems where discriminant information lies in particular frequency bands. Therefore, such an approach can be considered as problem specific. In order to create some variations between low-pass to high-pass, having trainable parameter in GIN's convolution support seems advantageous. But, since it is not spectral-designed, there is no guarantee that it has exactly the same spectral profiles for different graphs. Besides, its low-pass shape (where is high) is a linearly decreasing function, thus it is not a strong low-pass that generally natural graph problems need. Using more stacked layer may be a solution. In addition, this convolution cannot focus on some certain bands if the problem needs. Since the GAT's convolution supports are function of connected nodes feature, frequency profiles cannot be directly computed similarly to previous ones. Thus, we proposed to obtain frequency response by two ways, one is the expected frequency responses among simulations, the other is the frequency responses of trained model for any specific graph learning problem. We calculated the expected frequency responses of GAT convolution supports on Cora graph by simulation of randomly created 240 possible attention weights. The expected value of simulated support's frequency response and its standard deviation are shown in Figure 3a . This result gives an idea about the capability of the model on spectral domain, without being the true learned convolution support. In addition, the simulation is just for the first layer, because the first layer's input is known without learning. Besides, we also provide in Figure 3b -c the frequency responses of all learned GAT attention head's in all layers for all the graphs of ENZYMES and PROTEINS datasets respectively (in our model, there are two GNN layer consisting of 25 attention head). Since there is no significant As one can see, the mean standard frequency profile has a similar shape than those of GCN and GIN-0 which are methods that use self-looped added (normalized or not) adjacency matrix as convolution support. Variations on the frequency profile induce more variations on output signal when compared to GCN and GIN-0. However, the variation on frequency profile might not be sufficient in problems that need some specific band-pass filters.

5.2. PERFORMANCE ANALYSIS OF GNNS

First, our goal was to assess empirically the ability of GNN models to produce the desired frequency effects on given graph signals. With the conducted experiments detailed in Appendix I, we can outline the empirical results as follows. GCN and GAT can produce low-pass effects but not band-pass or high-pass, while GAT has better variation on frequency profile; these empirical results corroborate the theoretical analysis of this paper, namely Theorem 4 and Section 5.1. Thanks to its trainable parameter , GIN can do better on producing low-pass and high-pass effects, but not band-pass, as demonstrated by Theorem 5. However, the spectral-designed ChebNet always outperformed the rest with a huge margin, which is not surprising. Secondly, we measured the generalization capability of the GNN model for graph classification task as a toy example where the graph classes depend on the frequency of the signal on the graph. The conducted experiments are described in Appendix J. Again, we seen that GCN and GAT performed worse than GIN, because of their inability to catch necessary frequency components. Besides, thanks to the spectral-designed convolutions, ChebNet can catch the underlying patterns on the graphs and finally achieves better results. Some graph problems naturally just need low-pass filtering, as we argued in Appendix K. Having spectral ability may increase the complexity of the model, which may result in a negative effect. However, some other problems might need various kind of filters, like image understanding problems. In our last experiment, we use the superpixel version of MNIST dataset (MNIST-75)foot_1 to show an example of graph problems that need various filtering. In MNIST-75, images are segmented into around 75 regions by the SLIC superpixel segmentation algorithm (Achanta et al., 2012) . Regions constitute the nodes of the graph and edges correspond to connection between regions in the image. The average pixel value of this region was assign to node, giving one continuous value. The dataset also includes the center position of each region, but we excluded that information to make the problem more realistic and harder in terms of graph research. The dataset consists of 55K graphs for training, 5K graphs for validation and 10K for testing. Details and some illustrations of the dataset can be found in Appendix L. We use 3 hidden graph convolution layers that have 64, 128, and 128 features respectively, followed by a global mean operator as graph readout layer, and ended by a fully connected layer with 10 outputs corresponding to the number of classes. To understand the effect of graph convolution, we apply the tests on 3 different inputs: the first one uses node degree as feature, the second one uses 

6. FINAL REMARKS

In this paper, we have shown that most influential graph convolutions such as (Kipf & Welling, 2017; Veličković et al., 2018) operate as low-pass filters and some have a very limited ability on producing high-pass in addition to low-pass filtering effect such as (Xu et al., 2019) . Interestingly, while being restricted to low-pass filters, they obtain state-of-the-art performance on reference node classification problems such as Cora, CiteSeer and Pubmed (Yang et al., 2016) . These good results on these particular problems are induced by the nature of the graphs to be processed. Indeed, citation network problems, which are heavily assortative, are inherently low-pass filtering problems. It is worth noting that, if we use enough convolution kernels, the frequency response of ChebNet kernels (Defferrard et al., 2016; Levie et al., 2019) covers nearly all frequency profiles. However, these frequency responses are not specific to special bands of frequency. It means that they can act as high-pass filters, but not as Gabor-like special band-pass filters,if a low number of convolution supports are used (e.g. 3). Getting any arbitrary band-pass effect requires a large number of convolution kernels, which makes the convolution not spatially-localized and increases the computational complexity. As a conclusion, we claim that graph convolutions are problem specific and not problem agnostic. To have problem agnostic solutions, graph convolutions need to be able to produce necessary or at least plenty of different frequencies in output signal profile. The frequency profile of graph convolutions is not the single issue to be taken into account. But it is definitely one of the important perspectives that we need to pay attention. We point out that using only low-pass GNNs may not be a good choice for many graph problems. Finally, the convolution design can be considered as the tuning of hyperparameters or it can be automatically designed by another secondary unsupervised task with respect to the problem domain. Our future work will investigate this track. Experiments conducted in Section 5 provided empirical results to validate the theoretical analysis conducted in this paper. A GENERALISATION OF FRAMEWORK Our selection of GNN generalization in equation 3 can be shown in Figure 4a with a detailed schematic of graph convolution layer on a sample graph signal. This framework can also generalize the Euclidean domain convolution layer. For 2D signal convolution by W (l) ∈ R 3×3 mask, we can define 9 different convolution supports denoted by C (1) . . . C (9) ∈ {0, 1} 16×16 in Figure 4 for sample signal (e.g. image) shown by H (l) ∈ R 4×4 . When we stack all node values (e.g. pixels) into column vector H (l) ∈ R 16×1 , and W (l,s) shows s-th scalar weight in stacked W (l) , we can write equivalence of Euclidean convolution by: H (l) W (l) = 9 s=1 C (s) H (l) W (l,s) , One can see that in Euclidean domain, the supports can be designed by relative position of the nodes which is not the case in graph world.

B PROOF OF THEOREM 1

Proof. First, let us expand the B matrix by introducing its columns denoted Φ 1 (λ), . . . , Φ S (λ) ∈ R n : H (l+1) j = σ f l i=1 U diag S s=1 W (l,s) i,j Φ s (λ) U H (l) i . ( ) Now, we distribute U and U over the inner summation: H (l+1) j = σ S s=1 f l i=1 U diag W (l,s) i,j Φ s (λ) U H (l) i . ( ) Then, we take out the scalars W (l,s) i,j of the diag operator: H (l+1) j = σ S s=1 f l i=1 W (l,s) i,j U diag(Φ s (λ))U H (l) i . ( ) Let us define a convolution operator C (s) ∈ R n×n as: C (s) = U diag(Φ s (λ))U . ( ) Using equation 16 and equation 17, we have thus: H (l+1) j = σ f l i=1 S s=1 W (l,s) i,j C (s) H (l) i . ( ) Then, each term of the sum over s corresponds to a matrix H (l+1) ∈ R n×f l+1 with H (l+1) = σ C (1) H (l) W (l,1) + • • • + C (S) H (l) W (l,S) , with H (l) = [H (l) 1 , . . . , H f l ]. We get by grouping the terms: H (l+1) = σ S s=1 C (s) H (l) W (l,s) , which corresponds to equation 3. Therefore, equation 5 corresponds to equation 3 with C (s) defined as equation 6. C PROOF OF COROLLARY 1.1 Proof. By using equation 6 from Theorem 1, we can obtain a spatial convolution kernel C (s) whose frequency profile is Φ s (λ). Since the eigenvector matrix is orthonormal (i.e., U -1 = U ), we can extract Φ s (λ), which yields equation 7.

D PROOF OF THEOREM 2

ChebNet relies on the approximation of a spectral graph analysis proposed in (Hammond et al., 2011) , based on the Chebyshev polynomial expansion of the scaled graph Laplacian. Even though its multi supports frequency responses are known, to the sake of simplicity, it was represented in form of equation 3 in (Defferrard et al., 2016) as follows; C (1) = I, C (2) = 2L/λ max -I, C (k) = 2C (2) C (k-1) -C (k-2) . Proof. When the identity matrix is used as convolution kernel, it just directly transmits the inputs to the outputs without any modification. This process is called all-pass filter. Mathematically, we can calculate the full frequency profile for kernel I by using Corollary 1.1, namely Φ 1 = U IU = U U = I, since the eigenvectors are orthonormal. Therefore, we can parametrize the diagonal of the full frequency profile by λ and reach the standard frequency profile for the first ChebNet support as follows: Φ 1 (λ) = diag(I) = 1. ( ) We can compute the C (2) kernel full frequency profile using Corollary 1.1: Φ 2 = U 2 λ max L -I U. ( ) Since U IU = I, equation 24 can be rearranged as Φ 2 = 2 λ max U LU -I. ( ) Since λ = [λ 1 , . . . , λ n ] are the eigenvalues of the graph Laplacian L, those must conform to the following condition: LU = U diag(λ); (26) U LU = diag(λ). ( ) Replacing equation 27 into equation 25, we get Φ 2 = 2 λ max diag(λ) -I. ( ) This full frequency profile consists of two parts, a diagonal matrix and the negative identity matrix. Therefore, we can parametrize the full frequency matrix diagonal to show the standard frequency profile as follows: Φ 2 (λ) = diag(Φ 2 ) = 2λ λ max -1. ( ) Given the third and following ChebNet supports, when we use Corollary 1.1, the corresponding frequency profile is Φ k = U 2C (2) C (k-1) -C (k-2) U. ( ) By expanding equation 30, we get Φ k = 2U C (2) C (k-1) U -U C (k-2) U. Since U U = I, we can insert the product U U into equation 31. Thus, we have Φ k = 2U C (2) U U C (k-1) U -U C (k-2) U (32) Φ k = 2 U C (2) U U C (k-1) U -U C (k-2) U. Since Φ k = U C (k ) U for any k , it yields: Φ k = 2Φ 2 Φ k-1 -Φ k-2 , Hence Φ 1 and Φ 2 are diagonal matrices, and the rest of the kernels frequency profiles become diagonal matrices in equation 34. Therefore, we can write the corresponding standard frequency profiles of third and following ChebNet convolution supports as follows: Φ k (λ) = 2Φ 2 (λ)Φ k-1 (λ) -Φ k-2 (λ). E PROOF OF THEOREM 3 Proof. CayleyNet was originally defined as it uses the weight vector parametrization of F (l,j) i = [g i,j,l (λ 1 , h), ..., g i,j,l (λ n , h)] in equation 1, where the function g(•, •) is defined in (Levie et al., 2019) by g(λ, h) = c 0 + 2Re r k=1 c k hλ -i hλ + i k , ( ) where i 2 = -1, Re(•) is the function that returns the real part of a given complex number, c 0 is a trainable real coefficient, and c 1 , . . . , c r are complex trainable coefficients. We can write hλ -i in Euler form by √ h 2 λ 2 + 1.e iatan2(-1,hλ) and for hλ + i by √ h 2 λ 2 + 1.e iatan2 (1,hλ) . By this substitution, equation 36 becomes g(λ, h) = c 0 + 2Re r k=1 c k e ik(atan2(-1,hλ)-atan2(1,hλ)) . ( ) where atan2(y, x) is the inverse tangent function, which finds the angle (in range of [-π, π]) of a point given its y and x coordinates. For further simplification, let us introduce the θ(•) function defined by θ(x) = atan2(-1, x) -atan2(1, x). (38) Since the c k s are complex numbers, we can write them as a sum of real and imaginary parts, c k = a k /2 + ib k /2 (the scale factor 2 is added for convenience). Thus, equation 37 can be rewritten as follows: g(λ, h) = c 0 + Re r k=1 (a k + ib k )e ikθ(hλ) . ( ) We can replace e ikθ(hλ) with its polar coordinate equivalence form cos(kθ(hλ)) + i sin(kθ(hλ)). When we remove the imaginary components because of Re(•) function, equation 39 becomes g(λ, h) = c 0 + r k=1 a k cos(kθ(hλ)) -b k sin(kθ(hλ)). In this definition, there is no complex coefficient, but only real coefficients (c 0 , a k and b k for k = 1, . . . , r) to be tuned by training. By using the form in equation 40, we can parametrize CayleyNet by the parametrization matrix B ∈ R n×2r+1 by [g(λ 0 , h), . . . , g(λ n , h)] = B[c 0 , a 1 , b 1 , . . . , a r , b r ] . The s-th column vector of matrix B, denotes B s , must fulfill the following conditions: B s = Φ s (λ) =    1 if s = 1 cos( s 2 θ(hλ)) if s ∈ {2, 4, . . . , 2r} -sin( s-1 2 θ(hλ)) if s ∈ {3, 5, . . . , 2r + 1} We can see CayleyNet as a spectral graph convolution that uses 2r + 1 convolution kernels. The first kernel is an all-pass filter, and the frequency profiles of remaining 2r kernels (Φ s (λ)) are created using sine and cosine functions, with a parameter h used to scale the eigenvalues in equation 42. Considering equation 6 in Theorem 1, we can write CayleyNet's convolutions (C (s) ) in spatial domain. CayleyNet includes the tuning of this scaling parameter in the training pipeline. Note that because of the function definition in equation 38, θ(hλ) is not linear in λ. Therefore, Φ s cannot be a perfect sinusoidal in λs.

F PROOF OF THEOREM 4

One major simplification of the ChebNet is Graph Convolution Network (GCN) (Kipf & Welling, 2017) , which has single convolution support and already presented in our framework in equation 3. The first proposal of this paper uses the subtraction of the second ChebNet support from the first one under the assumption of λ max = 2 and L is the normalized graph Laplacian, as it is defined by C GCN * = C (1) -C (2) = 2I -L. Proposition 1. C GCN * = 2I -L is a spectral-designed support and its frequency response is Φ GCN * (λ) = 2 -λ. Proof. If the assumption is true, it should meet: 2I -L = U diag(2 -λ)U this can be written in the following way as well 2I -L = 2U IU -U diag(λ)U since U IU = I and U diag(λ)U = L, the proposition is true. One can see that the first proposal of GCN is spectral-designed and it is low-pass filter. That is why GCN is misclassified as a spectral approach in the literature (Wu et al., 2019b; Chami et al., 2020) . However, instead of using this version, GCN used re-normalization trick and defined its final single convolution support as: C GCN = (D + I) -1/2 (A + I)(D + I) -1/2 , ( ) where D is diagonal degree matrix and A is the adjacency matrix. Proposition 2. C GCN = (D + I) -1/2 (A + I)(D + I) -1/2 frequency response is Φ GCN (λ) = 1 -p p+1 λ for regular graphs whose node degrees are p. Proof. When all node degrees are p, we can write diagonal degree matrix by D = pI. It yields, L = I -A/p or A = pI -pL. When we substitute new equations of A and D into GCN support, we get C GCN = pI -pL + I p + 1 = I - p p + 1 L. ( ) It should meet the following condition if the given frequency response is true: I - p p + 1 L = U diag(1 - p p + 1 λ)U Since U diag(1)U = I, and U diag(λ)U = L, the condition in equation 47 is satisfied. This proposition shows that the GCN frequency profile acts as a low-pass filter. When the given graph is a regular graph, all node degrees are equal for the case of p = 2, is leading to a frequency profile defined by 1-2λ/3. Since the normalized graph Laplacian eigenvalues are in the range [0, 2], the filter magnitude linearly decreases until the third quarter of the spectrum (cut-off frequency) where it reaches zero. Then it linearly increases until the end of the spectrum. This explains the shape of the frequency profile of GCN convolutions for 1D regular graph observed in Figure 2a in blue one. However, this conclusion cannot explain the perturbations on the GCN frequency profile. To analyse this point, we relax the assumption D = pI and rewrite equation 45 as follows and start to proof. C GCN = (D + I) -1 + (D + I) -1/2 A(D + I) -1/2 . ( ) Proof. We can see that the GCN kernel consists of two parts, C GCN = c 1 + c 2 , where first part is given by c 1 = (D + I) -1 and the second one is c 2 = (D + I) -1/2 A(D + I) -1/2 . For the second part (c 2 ), we can write it using the element-wise multiplication operator (Hadamard multiplication) c 2 = A 1/(d + 1) • 1/(d + 1) , ( ) where d is the column degree vector d = diag(D) and the division and square root are also elementwise (Hadamard) operations. With the same notation, we can rewrite the Chebyshev second kernel, assuming that λ max = 2, C (2) = -A 1/d • 1/d . ( ) The two expressions equation 49 and equation 50 show that negative c 2 is an approximation of the second Chebyshev kernel if vector d consists of same values, as it was assumed in Proposition 2. When the vector d is composed of different values, the two matrices 1/d. 1/d and 1/(d + 1). 1/(d + 1) are not proportional for each coordinate (i.e., entry). To obtain c 2 from C (2) , we need to use different coefficients for each coordinate of the kernel. If the difference between node degrees is important, these coefficients have the strong influence, and c 2 may be very different from C (2) . Conversely, if the node degrees are quite uniform, these coefficients may be neglected. This phenomenon is the first cause of perturbation on GCN frequency profile. The first part (c 1 ) of the GCN kernel in equation 48 is more interesting. Actually, it is a diagonal matrix that shows the contribution of each node in the convolution process. Instead of looking for some approximations of known frequency profiles such as those of Chebyshev kernels, we can write It should meet the following condition if the given frequency response is true: (p + 1 + )I -pL = U diag(p + + 1 -pλ)U . ( ) We can obtain the following equation by p + + 1 = (p + 1 + )I substitution: (p + 1 + )I -pL = (p + 1 + )U IU -pU diag(λ)U . ( ) Since U IU = I and U diag(λ)U = L, the condition in equation 58 is satisfied. By relying on Proposition 58, we establish Theorem 5 as follows. Proof. Even in regular graph, the theoretical frequency response of GIN is not identical and it depends on the node degree, thus it is not spectral-designed. In addition, we can see the GIN convolution support as the sum of two matrices where the second one (1 + )I is diagonalizable by eigenvectors U of graph Laplacian by Φ = 1 + . Thus, the second part of GIN support is spectral. However, the first part, which is adjacency A, cannot be diagonalizable by U . Since the convolution support is not diagonalizable, we cannot write exact frequency response of GIN convolution but just an approximation of Proposition 3, assuming by the average node degree of the graph is p in Φ GIN (λ) ≈ p 1 + p + 1 -λ . ( )

H ADDITIONAL RESULTS ON SPECTRAL ANALYSIS H.1 CHEBNET

To get empirical frequency responses of ChebNet supports, we used regular 1D graph, Cora, Cite-Seer. As confirmed in theoretical analysis, in all cases the frequency responses do not depend on graph. The magnitude (in absolute value) of the frequency responses are shown in Figure 1a . As stated by Theorem 2, the first two kernel frequency profiles of ChebNet are Φ 1 (λ) = 1 and Φ 2 (λ) = 2λ/λ max -1, where 1 is the vector of ones. Since λ max = 2 for all graphs that we used in the analysis, we get Φ 2 (λ) = λ -1. The third one and following kernel frequency profiles can also be computed using Φ k (λ) = 2Φ 2 (λ)Φ k-1 (λ) -Φ k-2 (λ), leading to Φ 3 (λ) = 2λ 2 -4λ + 1 for example for the third kernel. One can easily confirm the functions in range of [0...2] by relevant plot in Figure 1a . Thanks to Chebyshev polynomial expansion, we do not need to calculate supports by eigendecomposition which makes the method computationally efficient. Besides, as it is spectral designed, ChebNet covers all the spectrum. Theoretically it can create all necessary filters if we use many kernels and stack the layers back to back. However higher order supports frequency responses are less smooth than lower order ones. This does not guarantee that the graph convolution transferability is maintained. For this reason, in literature, generally a few (up to first 3) supports are used (Defferrard et al., 2016; Kipf & Welling, 2017) .

H.2 CAYLEYNET

Since CayleyNet is spectral, its supports frequency response is consistent and does not change according to graph structure. It leads to get the same empirical results for all our attempts on 1D graph, Cora, CiteSeer graphs. Theorem 3 result can be compared to relevant support result in Figure 1b . For instance, the first support frequency response is Φ 1 (λ) = 1 as it is all-pass filter in Figure 1b (blue plot). If we assume zoom parameter is h = 1, the second support frequency response becomes Φ 2 (λ) = cos(θ(λ)). To confirm that result, we can first check the case where λ = 0. Since θ(0) = -π, Φ 2 (0) = -1 where its magnitude (absolute value) in the diagram is 1. Later, we can check the case where λ = 1. Since θ(1) = -π/2, thus Φ 2 (1) = 0 as seen in the orange plot in Figure 1b , λ = 1 is the cut-off frequency for the second support of CayleyNet. Having multi-support with different frequency responses makes the convolution productive in terms of output signal profile. Moreover, by learning zoom parameter, theoretically, we can shrink (higher h value) or expand (smaller h value) the frequency responses which is needed according to the problem. However, it makes the supports non-static. The supports need to be calculated in each learning epoch. Although, to limit the induced computational cost, an approximation is computed using a fixed number of Jacobi iterations (Levie et al., 2019) . But still, it seems not efficient in benchmark problem. Instead, a fixed value for the h parameter might be used and h can be treated as an hyperparameter to be tuned according to validation set. Besides, it has no band specific supports, but band-pass might be obtained by using multi supports and stacked layers.

H.3 GCN

By a clear margin, the most popular method is GCN (Kipf & Welling, 2017) in GNN literature, thanks to its simplicity and relatively good results on some benchmark dataset. However, as we prove in Theorem F, it is a low-pass filter. Since it has a single support, one kind of filter which is low-pass, stacking that layer in deep architecture will not work. Because it continuously smooths the signal on the graph and on the final layer, there will only be a smoothed signal. The three standard frequency responses in Figure 2a have almost the same low-pass filter shape. It corresponds to a function composed of a decreasing part the three first quarters of the eigenvalues range, followed by an increasing part on the remaining range. This observation is coherent with the theoretical analysis. Hence, kernels used in GCN are transferable across the three graphs at hand. In Figure 2a , the cut-off frequency of the 1-D linear circular graph is exactly 1.5, while it is about 1.35 for CiteSeer. This observation can be explained by the fact that when considering a 1-D linear circular graph, all nodes have a degree (p = 2), hence λ cut = 1.5. Since the average node degree in CiteSeer is 2.77, therefore λ cut ≈ 1.36. Concerning the full frequency responses, there is no contribution outside the diagonal for the regular line graph (Figure 5 a ). Conversely, some off-diagonal values are not null for CiteSeer and Cora (Figure 5 b-c ). Again, this observation confirms the theoretical analysis in Appendix F. We also provided heat map of frequency responses of GCN convolution on more realistic biological graph dataset named ENZYMES and PROTEIN in Figure 6 . The majority of the graph frequency responses have almost the same shape. However, some graph frequency responses are far away from the expected frequency response as illustrated with lighter color in heatmap. H.4 GIN GIN model defined in Xu et al. (2019) is not just a layer but a mini multi-layer model. The first layer is the main graph convolution layer which we analyzed, followed by at least one but preferably two MLP layers. According to Theorem 5, the frequency response of the main GIN convolution has 1 -+ p of magnitude at zero eigenvalue and it decreases with respect to the eigenvalue. The outer node degree is a scaling factor of the frequency response that does not have any effect of the character of filter. While increases, the cut-off frequency of the convolution increases, thus it makes more low-pass effect. On the other hand, when decreases, the cut-off decreases, thus it makes more high-pass effect while inner p can be seen as multiplicand of the effect. Theoretically, we can say that GIN's cut-off frequency is λ cut ≈ 1 + (1 + )/p, which is the same with GCN, if = 0. One can easily calculate the frequency response of adjacency as a convolution support where C = A. It can be seen as a special GIN support where = -1. Thus Φ A (λ) ≈ p(1 -λ). The formulation is almost the same as the one given by (NT & Maehara, 2019). It differs by the scaling factor and an approximation of the regular graph case. Since the eigenvalues of the normalized Laplacian lie on the interval [0, 2], it works as a notch-like band-stop filter for intermediate frequency (λ = 1). But, in most of applications, the eigenvalues greater than 1 is less likely. In this case, there are less component to pass. It results that using adjacency has more likely a low-pass effect as concluded in (NT & Maehara, 2019). The experimental analysis of the spectral behavior of GIN (Xu et al., 2019) first implies to compute the convolution kernel as given in equation 56 for = {-2, -1, 0, 1}. Then, the spectral representation of the obtained convolution matrix can be calculated using Corollary 1.1. This result leads to the frequency profiles illustrated in Figure 2b-c 2b , the cut-off frequencies are 2, 1.5, 1.0, and 0.5 for = {1, 0, -1, -2} respectively. But for realistic graphs such as CiteSeer, since its average degree is p ≈ 2.77, the cut-off frequencies are 1,72, 1.36, 1.0 and 0.63 for = {1, 0, -1, -2} respectively as shown in Figure 2c . The results for Cora graph are slightly different than CiteSeer in Figure 7b because of the fact that average node degree is different. Since GIN does not have spectral designed support, in its full frequency profile, there are some non-zero components out of the diagonal as shown in Figure 7a for GIN-0 model on Cora. Figure 8 and 9 show the heat map of frequency responses of the GIN model under = {1, 0, -1, -2} on ENZYMES and PROTEIN collection of graphs. (l,s) , a (l,s) ) in GAT, frequency profiles cannot be directly computed similarly to previous ones. We did bunch of simulations for Cora graph. As the proposed method used for Cora problem, we have generated 8 different convolution supports corresponding to 8 pairs of W (l,s) ∈ R 1433×8 (1433 features for each node) and a (l,s) ∈ R 16×1 trainable weights for the first layer (Veličković et al., 2018) . We produce 240 (30 for each support) random pairs of W (l,s) and a (l,s) where activation function is LeakyReLU has 0.2 negative slope as in (Veličković et al., 2018) . Later, we calculated frequency response of generated supports by Corollary 1.1. The mean and standard deviation of the frequency profiles for these simulated GAT supports are shown in Figure 3 a and its expected and standard deviation of the full frequency response shown in Figure 10 . The full frequency profile is not symmetric as seen in Figure 10a . According to Figure 10b , variations are mostly on the right side of the diagonal in the full frequency profile. This is related to the fact that these convolution kernels are not symmetric. However, the variation on frequency profile might not be sufficient in problems that need some specific band-pass filters. In order to get trained attention head frequency responses in ENZYMES and PROTEIN datasets, we randomly divided the dataset into 4 folds. We trained the 2-layer GAT model which has 25 attention heads each by using 3 folds. We calculated attention heads (50 C matrices given in equation 12, 25 each layer) for each graph in test fold. The density heat map in Figure 3b -c are the frequency responses of these attention heads for ENZYMES and PROTEINS dataset respectively. In this section, we seek to measure of the ability of GNN models to learn some specific filtering process. This study is very important in order to understand the learning capability of existing GNN models. Since the problem may need various types of filtering, the best GNN model has to be able to learn any kind of filtering. For this purpose, we conduct an empirical analysis on a real image with resolution of 100×100 and its corresponding 2D regular 4-neighborhood grid graph. The input of the GNN is the adjacency matrix of size 10000×10000 and the pixel intensities given in a 10000-length vector. We create three different spectral filters that correspond to low-pass, band-pass and high-pass effects and apply these filters to the given input image. Our selection of spectral filters are defined by Φ 1 (ρ) = exp(-100ρ 2 ), Φ 2 (ρ) = exp(-1000(ρ -0.5) 2 ) and Φ 3 (ρ) = 1 -exp(-10ρ 2 ) for lowpass, band-pass and high-pass filters respectively, where ρ 2 = u 2 + v 2 and u and v are the normalized frequencies on each direction for a given image resolution. Used input image and its filtering results can be found in Figure 11 . Since we do not use pixel positions, neither as node feature nor as edge feature, we create these spectral filters to be learned in a directional agnostic way. Therefore, the problem can be viewed as a single graph node regression problem, where we train the GNN models to minimize the square error between its output and targeted filtered image. In order to assess ChebNet, GCN, GIN and GAT, we use a 3-layer GNN architecture whose input is a one-length feature (intensity of the pixel) and the number of neurons in hidden layers is respectively 32, 64 and 64; the output layer is an MLP that projects the final node representation onto the single output for each node. We used roughly 30k trainable parameter in ChebNet with 5 supports. For the other methods, we tuned the hidden neuron numbers in order to be sure that they have a similar number of trainable parameters. Since the aim is not assessing the generalization performance, we do not use any regularization or dropout to address overfitting, but simply force the GNN to learn the input-output relation. We keep the iterations till there is no improvement for consecutive 100 iterations or maximum 3000 iterations. Table 3 gives the sum of squared errors between target and the output of the trained model. One can see that ChebNet constantly outperformed GCN, GIN and GAT for all tasks. For learning low-pass filtering, the rest of the models did better compared to the high-pass and band-pass tasks. That is the fact that GCN, GIN and GAT have the ability to act as low-pass filters. In addition to do better on the low-pass task, GIN also did relatively better on the high-pass task as well. It is obvious that GIN can work as high pass if the parameter is selected negative (see Theorem 5). It turns out that the trained values of in GIN for each layer are -5.27, -2.21 and -0.47 for the high-pass task. Thanks to the spectral-designed convolution supports in ChebNet, it could learn high-pass and lowpass tasks very well. However for band-pass tasks, even though it is the best in this category too, it still has large errors compared to the high-pass and low-pass tasks. This is due to the fact that the selected band-pass filter is very narrow, because the coefficient -1000 in the formulation of Φ 2 makes the used ChebNet (with 5 convolution supports and 3 layers) unable to adapt this stiff (not smooth) filter function. Moreover, since ChebNet has no band specific convolutions, band specific output can be produced if the number of kernels increases (going wider) and/or the model goes deeper. To clarify this point, we conducted another test for band-pass task on ChebNet to show the effect of going deeper in the model and going wider (increase the convolution support) while keeping the trainable parameters fixed. These results are given in Table 4 . According to Table 4 , the ability of ChebNet to learn the given frequency response becomes better with respect to the number of convolution supports and number of layers. However, this result is not surprising where it is proved that any frequency response can be written by a weighed sum of enough number of Chebyshev polynomials (Hammond et al. (2011) ). When we train the ChebNet, it just finds these coefficients to create the target frequency response by minimizing the error. However, the interesting point is the incapability of GCN, GIN and GAT methods to even create reasonable approximations of these targeted filter effects. For instance, it can be seen in Figure 12 that ChebNet performed well to produce the desired band-pass output. However, GAT and GCN produce just a different kind of low-pass filtering result instead of band-pass, while GIN at least can find edges (high-pass component) thank to its trainable parameter . We also tested the deeper network for GCN, GAT and GIN as well and have not seen any significant improvement when we use deeper network. J CAN GNN CLASSIFY GRAPHS ACCORDING TO FREQUENCY OF ITS SIGNAL? In this section, we measure the generalization ability of GNN for graph classification problem where graph classes depend on the signal that the graphs carry. We generate 5000 images of 100×100 pixels composed of random generated frequency patterns obtained by a sinusoidal function with a frequency in the range [1] [2] [3] [4] [5] . We labelled the image as negative if the pattern's frequency is in the ranges [2-2.5] or [4-4.5 ]. The rest of the frequency patterns are labeled as belonging to the positive class. Then, we randomly rotate and translate the image pattern, add white noise (with std=0.2) and normalize each image independently. From each image, we randomly sample 200 points in the (Meyer, 1994) , where each sampled point is the marker. From this preprocessing, we generate 5000 graphs, each graph having 200 nodes. Each node corresponds to a watershed region in the image, and if the two regions have intersection on the image plane, we assume these two nodes are connected by an edge in the graph. We set the average intensity value in each region as a 1-length node feature. Even though we know the region center position, we do not use it in order to make the problem harder. Sampled generated image, randomly selected points and their watershed regions, and the graph can be found in Figure 13 for a 30-node illustration. We divided the dataset into train/valid/test subsets, with respectively 3000, 1000 and 1000 graphs. We resampled the same number of positive and negative examples, such that the dataset is balanced. We used 3 layers of GNN followed by a mean readout layer and finally two fully connected layers which have 10 and 1 neuron respectively. Since the problem is a binary graph classification problem, we used binary cross entropy loss and no regularization. We roughly use 30K parameters in each model. The dropout ratio has been applied to all GNN layer's inputs and optimized with respect to the validation set performance. The results are found on Table 5 . Since the node distributions are all the same in the graphs (because the graph nodes were independently normalized), MLP cannot do better than a random classifier. GCN does not perform well, probably because of its low-pass nature. Since GAT and GIN are better than GCN according to the spectral ability, they got a better accuracy than GCN. Finally, ChebNet with 5 convolution supports clearly outperforms the rest of them with a huge margin. These results show that models able to catch a particular band of frequencies obtain the best results, whereas only low-pass based methods like GCN perform only slightly better than MLP. Therefore, this toy example confirms our theoretical analysis. To conclude, we have shown that if the model is able to perform different filtering operation, it can classify the graphs according to frequency of its signal. K WHY LOW-PASS GNNS GIVE REASONABLE RESULTS ON SEMI SUPERVISED TASKS? In the recent literature, GNNs are generally evaluated on semi-supervised node classification problems. The most well-known datasets are Cora, CiteSeer and PubMed paper citation graphs (Yang et al., 2016) . In these graphs, each node corresponds to a paper. If one paper cites another one, there is an unlabeled and undirected edge between the corresponding nodes. Binary features on the nodes indicate the presence of specific keywords in the corresponding paper. The learning task is to attribute a class to each node (i.e., paper) of the graph using for training the graph itself and a very limited number of labeled nodes. Labeled data ratios are 5.1%, 3.6% and 0.3% for Cora, CiteSeer and PubMed respectively. Since the connected node's probability of being in the same class is high (0.83, 0.71, 0.79 for Cora, CiteSeer and PubMed respectively in Liu et al. (2020) ), these graphs are classified as assortative graphs. When the connected nodes are highly likely to be in the same class, label propagation based low-pass effected algorithms can give reasonable results. To show empirical evidence that any ordinary low-pass filter can give comparable results by lowpass GNNs, we created a fixed, spectral-designed, single convolution support GNN whose frequency response is manually designed in the spectral domain by Φ(λ) = (1 -λ/λ max ) 5 . This GNN model is denoted as LowPassConv in Table 6 and its average accuracy and standard deviation over 20 random runs reported. We use predefined train, validation and test sets as defined by Yang et al. (2016) and follow the test procedure of Kipf & Welling (2017) and Veličković et al. (2018) for a fair comparison. According to Table 6 , spectral-designed GNN's such as CayleyNet and ChebNet are slightly outperformed by other methods, including our simple low-pass convolution GNN. On the other hand, GCN and GAT do not give significantly better results than ordinary low-pass graph convolution. These results seem conflicting with the idea of having spectral well-designed graph convolutions. However, if the problem just needs low-pass filtering effect and if the model produces some unnecessary spectral component in the output, it may have a negative effect on the accuracy. Instead of just one single low-pass convolution support, if we have many convolution supports with different spectral properties, it increases the trainable parameters for a vain. Regularization may help to overcome this issue. This problem may be solved by learning the convolution support in the frequency domain by another secondary unsupervised task which is in our priority list to do. In our experiments, we used 3 citation graph datasets, named Cora CiteSeer and PubMed, an artificial regular graph where each node degree is 2, named 1D, 2 biological graph datasets named PROTEINS and ENZYMES, Band-Pass graph dataset which was created by us in order to evaluate the models, and a large scale MNIST-75 graph dataset. The details of these datasets can be found in Table 7 . Two samples in MNIST-75 dataset are shown in Figure 14 .

L DATASETS AND IMPLEMENTATION DETAILS

In our tests on MNIST-75, all hyperparameters are tuned by a grid as follows: -2 norm regularization applied on trainable weights in {0, 10 -1 , 10 -2 , 10 -3 , 10 -4 , 10 -5 } and dropout in {0, 0.2, 0.4, 0.6} for all models. In CayleyNet, we treated the zoom parameter as an hyperparameter and tuned it within the candidate values {0.5, 1, 1.5, 2} and also r ∈ {1, 2, 3} which leads {3, 5, 7} numbers of supports. ChebNet support number is also another hyperparameter, we tuned it in set of {3, 5, 7}. For GAT, we tuned the number of heads and the number of output features by concatenating or not which can give the predefined layer feature. For a 64-feature output, two architectures were studied: 8 heads each has 8 outputs with concatenating, and 8 heads each has 64 output without concatenating. The same for 128-feature output layer as well, using the same number of heads. For each layer of GIN model, there is one main GIN convolution layer followed by the same size of MLP. We tested the fixed and trainable as well; finally we concluded to use trainable one according to a validation set result. All activation are ReLU, but Elu (exponential linear unit) in GAT. In the output layer, the linear activation is used in all models and the loss function is the cross entropy. We used Adam optimization with a 0.01 learning rate without decay. We fixed the number of iterations to 100 under 64 batch size. The test results were selected on the iteration where the validation set accuracy is maximum.



https://github.com/balcilar/gnn-spectral-expressive-power https://graphics.cs.tu-dortmund.de/fileadmin/ls7-www/misc/cvpr/mnist-superpixels.tar.gz



Figure 1: Frequency profiles (Φ s (λ))

Figure 2: Frequency profiles of GCN on 1D, Cora, CiteSeer graph and GIN on 1D and CiteSeer graph with = 1, 0, -1, -2

Figure 3: Frequency profiles of GAT

Figure 4: a) Schematic of the graph convolution layer defined in equation 3. The graph has 12 nodes and 12 edges. In the l-th layer, each node has a 2-length feature vector H (l) 1 and H (l) 2 represented by colors. The l + 1-th layer, it has a 3-length feature vector, denoted H (l+1) 1 , H (l+1) 2 and H (l+1) 3 . Two convolution supports C (1) and C (2) are used. This architecture has 12 trainable parameters, omitting biases if the convolution supports C (1) and C (2) are fixed. b) Graph convolution supports for 2D Euclidean domain signal to perform convolution process by 3 × 3 mask.

Figure 5: Full frequency response of GCN on 1D, Cora and CiteSeer graphs

Figure 6: Heat map of GCN's frequency profiles on ENZYMES and PROTEIN dataset graphs.

Figure 7: Full frequency profiles of GIN-0 and frequency responses of different values for Cora graph.

(1D and CiteSeer graph) and Figure 7 (Cora graph). For regular 1D circular graph where p = 2, the frequency responses are absolutely what Proposition 3 indicates. As shown in Figure

Figure 9: Heat map of different valued GIN frequency profiles on PROTEIN

Figure 10: Full frequency profile of GAT and its standard deviation.

Figure 11: Input image, and its filtering results by Φ 1 , Φ 2 and Φ 3 respectively

Figure 12: The output of GNNs trained with band-pass task. Images are taken from ChebNet, GIN, GAT and GCN respectively.

Figure 13: Sample graph in Band-Pass graph dataset. Random rotated and translated image pattern with frequency of 1, random sampled points and their watershed regions, and graph represent the connected region and average region intensity value respectively.

Figure 14: Two sample graphs in MNIST-75 dataset (from 0 and from 1 class), the location of the nodes is just for illustration. Models do not use the node positions.

Summary of the studied GNN models.

Test set accuracies on MNIST superpixel datasetTable2gives the mean and standard deviation of the accuracy obtained over 10 runs on the test set, with different seed numbers. It is well known that the image version of the MNIST dataset can be processed by any ordinary CNN architecture, which is able to apply various filtering operations. Hence, we argue that superpixel graph of MNIST is a good candidate to show if the graph data needs various kind of filtering. As seen in Table2, MLP and GCN cannot do significantly better than a random classifier when using only node degree or pixel value as input. That means that the distribution of node degrees or pixel values has no significant meaning for classification. When both node degree and pixel values are given, the accuracy of GCN is increased, but remains behind the best results. GIN and GAT outperform GCN in each case, but their performances remain behind those of ChebNet and CayleyNet, which are spectral-designed with supports that cover the spectrum.

Sum of squared errors. All models have roughly 30k trainable parameters.

ChebNet's sum of squared errors on band-pass tasks with respect to S kernels and L stacked layers. All models have roughly 30k trainable parameters.

Test set accuracy and binary cross entropy loss.

Comparison of methods on the transductive learning problems using publicly defined train, validation and test sets. Results are on accuracy LowPassConv 0.827 ± 0.006 0.717 ± 0.005 0.794 ± 0.005

Summary of the datasets used in our experiments.

ACKNOWLEDGMENTS

This work was partially supported by the French Agence National de Recherche (ANR), grant APi (ANR-18-CE23-0014), and the PAUSE Program (Collège de France).

annex

Published as a conference paper at ICLR 2021 its frequency profile directly. Using Corollary 1.1, we can express the frequency profile of c 1 in matrix form by Φ c1 = (U c 1 U ), (51) where U is the eigenvectors matrix. By taking advantage of having a diagonal kernel c 1 , we can express each component of full frequency profile aswhere n is the number of nodes in the graph, d k is degree of the k-th node, U i,k is the k-th element of i-th eigenvector. As eigenvectors U i and U j are orthogonal for i = j, their scalar product is null. However, in equation 52, the weighting coefficient 1 1+d k is not constant over all the dimensions of the eigenvectors. Therefore, there is no guarantee that Φ c1 (i, j) is null. This is another reason that explains that the GCN frequency profile has many non-zero elements outside of the diagonal.In addition, it is also clear that the standard frequency profile of c 1 (diagonal of Φ c1 , i.e., (Φ c1 ) i,i in equation 52) is not smooth. Indeed, the diagonal elements of Φ c1 can be written as a weighted sum of squared eigenvalues elements, which again is weighted by 1/(1 + d k ). If the latter is constant for all k, the sum of squared eigenvectors elements has to be 1 since the eigenvectors have unit L2norm. But in the general case where 1/(1 + d k ) are not necessarily constant over all the dimensions of eigenvectors, the diagonal of the matrix may have some perturbations. This point constitutes another explanation on the fact that the GCN standard frequency profile is not smooth.On the other hand, under the assumption that the node degrees distribution is uniform, we can derive the following approximation:We can then write an approximation of the GCN frequency profile as a function of the average node degree by replacing p with p and obtain the final approximation:Therefore, we can theoretically show the cut-off frequency, namely where GCN kernel frequency profile reaches 0, byG PROOF OF THEOREM 5Graph Isomorphism Network (GIN) defined in (Xu et al., 2019 ) has a single convolution support defined as follows:where is a trainable parameter that makes the support trainable (GIN-) and classified as spatialdesigned trainable-support graph convolution. Another version named GIN-0 is also defined in the same paper where = 0, which makes C GIN = A + I; thus, the convolution becomes fixed-support and identical with Vanilla GNN defined in Section 2.2.The proof of Theorem 5 relies on the following proposition. (57)

