SPECFORMER: SPECTRAL GRAPH NEURAL NETWORKS MEET TRANSFORMERS

Abstract

Spectral graph neural networks (GNNs) learn graph representations via spectraldomain graph convolutions. However, most existing spectral graph filters are scalar-to-scalar functions, i.e., mapping a single eigenvalue to a single filtered value, thus ignoring the global pattern of the spectrum. Furthermore, these filters are often constructed based on some fixed-order polynomials, which have limited expressiveness and flexibility. To tackle these issues, we introduce Specformer, which effectively encodes the set of all eigenvalues and performs self-attention in the spectral domain, leading to a learnable set-to-set spectral filter. We also design a decoder with learnable bases to enable non-local graph convolution. Importantly, Specformer is equivariant to permutation. By stacking multiple Specformer layers, one can build a powerful spectral GNN. On synthetic datasets, we show that our Specformer can better recover ground-truth spectral filters than other spectral GNNs. Extensive experiments of both node-level and graph-level tasks on real-world graph datasets show that our Specformer outperforms state-ofthe-art GNNs and learns meaningful spectrum patterns. Code and data are available at https://github.com/bdy9527/Specformer.

1. INTRODUCTION

Graph neural networks (GNNs), firstly proposed in (Scarselli et al., 2008) , become increasingly popular in the field of machine learning due to their empirical successes. Depending on how the graph signals (or features) are leveraged, GNNs can be roughly categorized into two classes, namely spatial GNNs and spectral GNNs. Spatial GNNs often adopt a message passing framework (Gilmer et al., 2017; Battaglia et al., 2018) , which learns useful graph representations via propagating local information on graphs. Spectral GNNs (Bruna et al., 2013; Defferrard et al., 2016) instead perform graph convolutions via spectral filters (i.e., filters applied to the spectrum of the graph Laplacian), which can learn to capture non-local dependencies in graph signals. Although spatial GNNs have achieved impressive performances in many domains, spectral GNNs are somewhat under-explored. There are a few reasons why spectral GNNs have not been able to catch up. First, most existing spectral filters are essentially scalar-to-scalar functions. In particular, they take a single eigenvalue as input and apply the same filter to all eigenvalues. This filtering mechanism could ignore the rich information embedded in the spectrum, i.e., the set of eigenvalues. For example, we know from the spectral graph theory that the algebraic multiplicity of the eigenvalue 0 tells us the number of connected components in the graph. However, such information can not be captured by scalarto-scalar filters. Second, the spectral filters are often approximated via fixed-order (or truncated) orthonormal bases, e.g., Chebyshev polynomials (Defferrard et al., 2016; He et al., 2022) and graph wavelets (Hammond et al., 2011; Xu et al., 2019) , in order to avoid the costly spectral decomposition of the graph Laplacian. Although the orthonormality is a nice property, this truncated approximation is less expressive and may severely limit the graph representation learning. Therefore, in order to improve spectral GNNs, it is natural to ask: how can we build expressive spectral filters that can effectively leverage the spectrum of graph Laplacian? To answer this question, we first note that eigenvalues of graph Laplacian represent the frequency, i.e., total variation of the corresponding eigenvectors. The magnitudes of frequencies thus convey rich information. Moreover, the relative difference between two eigenvalues also reflects important frequency information, e.g., the spectral gap. To capture both magnitudes of frequency and relative frequency, we propose a Transformer (Vaswani et al., 2017b) based set-to-set spectral filter, termed Specformer. Our Specformer first encodes the range of eigenvalues via positional embedding and then exploits the self-attention mechanism to learn relative information from the set of eigenvalues. Relying on the learned representations of eigenvalues, we also design a decoder with a bank of learnable bases. Finally, by combining these bases, Specformer can construct a permutation-equivariant and non-local graph convolution. In summary, our contributions are as follows: • We propose a novel Transformer-based set-to-set spectral filter along with learnable bases, called Specformer, which effectively captures both magnitudes and relative differences of all eigenvalues of the graph Laplacian. • We show that Specformer is permutation equivariant and can perform non-local graph convolutions, which is non-trivial to achieve in many spatial GNNs. • Experiments on synthetic datasets show that Specformer learns to better recover the given spectral filters than other spectral GNNs. • Extensive experiments on various node-level and graph-level benchmarks demonstrate that Specformer outperforms state-of-the-art GNNs and learns meaningful spectrum patterns.

2. RELATED WORK

Existing GNNs can be roughly divided into two categories: spatial and spectral GNNs. Spatial GNNs. Spatial GNNs like GAT (Velickovic et al., 2018) and MPNN (Gilmer et al., 2017) leverage message passing to aggregate local information from neighborhoods. By stacking multiple layers, spatial GNNs can possibly learn long-range dependencies but suffer from over-smoothing (Oono & Suzuki, 2020) and over-squashing (Topping et al., 2022) . Therefore, how to balance local and global information is an important research topic for spatial GNNs. We refer readers to (Wu et al., 2021; Zhou et al., 2020; Liao, 2021) for a more detailed discussion about spatial GNNs. Spectral GNNs. Spectral GNNs (Ortega et al., 2018; Dong et al., 2020; Wu et al., 2019; Zhu et al., 2021; Bo et al., 2021; Chang et al., 2021; Yang et al., 2022) leverage the spectrum of graph Laplacian to perform convolutions in the spectral domain. A popular subclass of spectral GNNs leverages different kinds of orthogonal polynomials to approximate arbitrary filters, including Monomial (Chien et al., 2021) , Chebyshev (Defferrard et al., 2016; Kipf & Welling, 2017; He et al., 2022) , Bernstein (He et al., 2021) , and Jacobi (Wang & Zhang, 2022) . Relying on the diagonalization of symmetric matrices, they avoid direct spectral decomposition and guarantee localization. However, all such polynomial filters are scalar-to-scalar functions, and the bases are pre-defined, which limits their expressiveness. Another subclass requires either full or partial spectral decomposition, such as SpectralCNN (Estrach et al., 2014) and LanczosNet (Liao et al., 2019) . They parameterize the spectral filters by neural networks, thus being more expressive than truncated polynomials. However, such spectral filters are still limited as they do not capture the dependencies among multiple eigenvalues. Graph Transformer. Transformers and GNNs are closely relevant since the attention weights of Transformer can be seen as a weighted adjacency matrix of a fully connected graph. Graph Transformers Dwivedi & Bresson (2020) combine both and have gained popularity recently. Graphormer (Ying et al., 2022) , SAN (Kreuzer et al., 2021) , and GPS (Rampásek et al., 2022) design powerful positional and structural embeddings to further improve their expressive power. Graph Transformers still belong to spatial GNNs, although the high-cost self-attention is non-local. The limitation of spatial attention compared to spectral attention has been discussed in (Bastos et al., 2022) . Preliminary. Assume that we have a graph G = (V, E), where V denotes the node set with |V| = n and E is the edge set. The corresponding adjacency matrix A ∈ {0, 1} n×n , where A ij = 1 if there is an edge between nodes i and j, and A ij = 0 otherwise. The normalized graph Laplacian matrix is defined as L = I n -D -1/2 AD -1/2 , where I n is the n × n identity matrix and D is the diagonal degree matrix with diagonal entries D ii = j A ij for all i ∈ V and off-diagonal entries D ij = 0 for i ̸ = j. We assume G is undirected. Hence, L is a real symmetric matrix, whose spectral decomposition can be written as L = U ΛU ⊤ , where the columns of U are the eigenvectors and Λ = diag([λ 1 , λ 2 , . . . , λ n ]) are the corresponding eigenvalues ranged in [0, 2]. Graph Signal Processing (GSP). Spectral GNNs rely on several important concepts from GSP, namely, spectral filtering, graph Fourier transform and its inverse. The graph Fourier transform is written as x = U ⊤ x, where x ∈ R n×1 is a graph signal and x ∈ R n×1 represents the Fourier coefficients. Then a spectral filter G θ is used to scale x. Finally, the inverse graph Fourier transform is applied to yield the filtered signal in spatial domain x = U G θ x. The key task in GSP is to design a powerful spectral filter G θ so that we can exploit the useful frequency information. Transformer. Transformer is a powerful deep learning model, which is widely used in natural language processing (Devlin et al., 2019) , vision (Dosovitskiy et al., 2021) , and graphs (Ying et al., 2022; Rampásek et al., 2022) . Each Transformer layer consists of two components: a multihead self-attention (MHA) module and a token-wise feed-forward network (FFN). Given the input representations H = [h ⊤ 1 , . . . , h ⊤ n ] ∈ R n×d , where d is the hidden dimension, MHA first projects H into query, key and value through three matrices (W Q , W K and W V ) to calculate attentions. And FFN is then used to add transformation. The model can be written as follows where we denote the query dimension as d q . For simplicity, we omit the bias and the description of multi-head attention. Attention(Q, K, V ) = Softmax( QK ⊤ d q )V , Q = HW Q , K = HW K , V = HW V . (1) In this section, we introduce our Specformer model. Fig. 1 illustrates the model architecture. We first explain how we encode the eigenvalues and use Transformer to capture their dependencies to yield useful representations. Then we turn to the decoder, which learns new eigenvalues from the representations and reconstructs the graph Laplacian matrix for graph convolution. Finally, we discuss the relationship between our Specformer and other methods, including MPNNs, polynomial GNNs, and graph Transformers.

4.1. EIGENVALUE ENCODING

We design a powerful set-to-set spectral filter using Transformer to leverage both magnitudes and relative differences of all eigenvalues. However, the expressiveness of self-attention will be restricted heavily if we directly use the scalar eigenvalues to calculate the attention maps. Therefore, it is important to find a suitable function, ρ(λ) : R 1 → R d , to map each eigenvalue from a scalar to a meaningful vector. We use an eigenvalue encoding function as follows. ρ(λ, 2i) = sin(ϵλ/10000 2i/d ), ρ(λ, 2i + 1) = cos(ϵλ/10000 2i/d ), ( ) where i is the dimension of the representations and ϵ is a hyperparameter. The benefits of ρ(λ) are three-fold: (1) It can capture the relative frequency shifts of eigenvalues and provides high-dimension vector representations. ( 2) It has the wavelengths from 2π to 10000 • 2π, which forms a multi-scale representation for eigenvalues. (3) It can control the influence of λ by adjusting the value of ϵ. The choice of ϵ is crucial because we find that only the first few dimensions of ρ(λ) can distinguish different eigenvalues if we simply set ϵ = 1. The reason is that eigenvalues lie in the range [0, 2], and the value of λ/10000 2i/d will change slightly when i becomes larger. Therefore, it is important to assign a large value of ϵ to enlarge the influence of λ. Experiments can be seen in Appendix C.1. Notably, although the eigenvalue encoding (EE) is similar to the positional encoding (PE) of Transformer, they act quite differently. PE describes the information of discrete positions in the spatial domain. While EE represents the information of continuous eigenvalues in the spectral domain. Applying PE to the spatial positions (i.e., indices) of eigenvalues will destroy the permutation equivariance property, thereby impairing the learning ability. The initial representations of eigenvalues are the concatenation of eigenvalues and their encodings, d+1) . Then a standard Transformer block is used to learn the dependency between eigenvalues. We first apply layer normalization (LN) on the representations before feeding them into other sub-layers, i.e., MHA and FFN. This pre-norm trick has been used widely to improve the optimization of Transformer (Ying et al., 2022) : Z = [λ 1 ∥ρ(λ 1 ), • • • , λ n ∥ρ(λ n )] ⊤ ∈ R n×( Z = MHA(LN(Z)) + Z, Ẑ = FFN(LN( Z)) + Z. After stacking multiple Transformer blocks, we obtain the expressive representations of the spectrum.

4.2. EIGENVALUE DECODING

Based on the representations returned by the encoder, the decoder can learn new eigenvalues for spectral filtering. Recent studies (Yang et al., 2022; Wang & Zhang, 2022) show that assigning each feature dimension a separate spectral filter improves the performance of GNNs. Motivated by this discovery, our decoder first decodes several bases. An FFN is then used to combine these bases to construct the final graph convolution. Spectral filters. In general, the bases should learn to cover different information of the graph signal space as much as possible. For this purpose, we utilize the multi-head attention mechanism because each head has its own self-attention module. Specifically, the representations learned by each head will be fed into the decoder to perform spectral filtering to get the new eigenvalues. Z m = Attention(QW Q m , KW K m , V W V m ), λ m = ϕ(Z m W λ ), where Z m denotes the representations learned by the m-th heads, and ϕ is the activation, e.g., ReLU or Tanh, which is optional. λ m ∈ R n×1 is the m-th eigenvalues after the spectral filtering. Learnable bases. After get M filtered eigenvalues, we use a FFN: R M +1 → R d to construct the learnable bases. We first reconstruct individual new bases, concatenate them along the channel dimension, and feed them to a FFN as below, S m = U diag(λ m )U ⊤ , Ŝ = FFN([I n ||S 1 || • • • ||S M ]), where S m ∈ R n×n is the m-th new basis and Ŝ ∈ R n×n×d is the combined version. Note that our bases here serve similar purpose as those polynomial bases in the literature. But the way they are combined is learned rather than following certain recursions as in Chebyshev polynomials. We have three optional ways to leverage this design of new bases. (1) Shared filters and shared FFN. This model has the least parameters, where the basis Ŝ is shared across all graph convolutional layers. (2) Shared filters and layer-specific FFN, which compromises between parameters and performance, e.g., Ŝ(l) = FFN (l) ([I n ||S 1 || • • • ||S M ]) where the superscript l denotes the index of layer. (3) Layer-specific filters and layer-specific FFN. This model has the most parameters and each layer has its own encoder and decoder, e.g., Ŝ(l) = FFN (l) ([I n ||S (l) 1 || • • • ||S (l) M ] ). We refer these three models as Specformer-Small, Specformer-Medium, and Specformer-Large.

4.3. GRAPH CONVOLUTION

Finally, we assign each feature dimension a separate graph Laplacian matrix based on the learned basis Ŝ, which can be written as follows: X(l-1) :,i = Ŝ:,:,i X (l-1) :,i , X (l) = σ X(l-1) W (l-1) x + X (l-1) , where X (l) is the node representations in the l-th layer, X(l-1) :,i is the i-th channel dimension, W (l-1) x is the transformation, and σ is the activation. The residual connection is an optional choice. By stacking multiple graph convolutional layers, Specformer can effectively learn node representations. 

4.4. KEY PROPERTIES COMPARED

G θ U ⊤ = θ 1 u 1 u ⊤ 1 + • • • + θ n u n u ⊤ n . Because u i is a eigenvector, u i u ⊤ i constructs a fully-connected graph. Therefore, our Specformer can break the localization of MPNNs and leverage global information. Specformer v.s. Graph Transformers. Graphormer (Ying et al., 2022) has shown that graph Transformer can perform well on graph-level tasks. However, existing graph Transformers do not show competitiveness in the node-level tasks, e.g., node classification. Recent studies (Bastos et al., 2022; Wang et al., 2022; Shi et al., 2022) provide some evidence for this phenomenon. They show that Transformer is essentially a low-pass filter. Therefore, graph Transformers cannot handle the complex node label distribution, e.g., homophilic and heterophilic. On the contrary, as we will see in the experiment section, Specformer can learn arbitrary bases for graph convolution and perform well on both node-level and graph-level tasks. Besides the above advantages, we show that our Specformer has the following theoretical properties. Proposition 1. Specformer is permutation equivariant. Proposition 2. Specformer can approximate any univariate and multivariate continuous functions. Proposition 1 shows that Specformer can learn permutation-equivariant node representations. Proposition 2 states that Specformer is more expressive than other graph filters. First, the ability to approximate any univariate functions generalizes existing scalar-to-scalar filters. Besides, Specformer can handle multiple eigenvalues and learn multivariate functions, so it can approximate a broader range of filter functions than scalar-to-scalar filters. All proofs are provided in Appendix D Complexity. Specformer has two parts of computation: spectral decomposition and forward process. Spectral decomposition is pre-computed and has the complexity of O(n 3 ). The forward complexity has three parts: Transformer, learnable bases, and graph convolution. Their corresponding complexities are O(n 2 d + nd 2 ), O(M n 2 ) and O(Lnd), respectively, where n, M, L represent the number of nodes, filters, and layers, and d is the hidden dimension. The overall forward complexity is O(n 2 (d + M ) + nd(L + d)). The overall complexity of Specformer is the sum of the forward complexity and the decomposition complexity amortized over the number of uses in training and inference, rather than a simple summation of the two. See Appendix C.2 for more discussion. Scalability. When applying Specformer to large graphs, one can use the Sparse Generalized Eigenvalue (SGE) algorithms (Cai et al., 2021) to calculate q eigenvalues and eigenvectors, in which case the forward complexity will reduce to (q 2 (d + M ) + nd(L + d)).

5. EXPERIMENTS

In this section, we conduct experiments on a synthetic dataset and a wide range of real-world graph datasets to verify the effectiveness of our Specformer. 

5.1. LEARNING SPECTRAL FILTERS ON SYNTHETIC DATA

Dataset description. We take 50 images with the resolution of 100×100 from the Image Processing Toolboxfoot_0 . Each image is processed as a 2D regular 4-neighborhood grid graph, and the values of pixels are the node features. Therefore, these images share the same adjacency matrix A ∈ R 10000×10000 and the m-th image has its graph signal x m ∈ R 10000×1 . Five predefined graph filters are used to generate ground truth graph signals. For example, if we use the low-pass filter with G θ = exp(-10λ 2 ), the filtered graph signal is calculated by xm = U diag[exp(-10λ 2 1 ), . . . , exp(-10λ 2 n )]U ⊤ x m . Setup. We choose six Spectral GNNs as baselines: GCN (Kipf & Welling, 2017) , GAT (Velickovic et al., 2018) , ChebyNet (Defferrard et al., 2016) , GPR-GNN (Chien et al., 2021) , BernNet (He et al., 2021) , and JacobiConv (Wang & Zhang, 2022) . Each method takes A and x m as inputs, and tries to minimize the sum of squared error between the outputs xm and the pre-filtered graph signal xm . We tune the number of hidden units to ensure that each method has nearly 2K trainable parameters. The polynomial order is set to 10 for ChebyNet, GPR-GNN, and BernNet. For our model, we use Specformer-Small with 16 hidden units and 1 head. In training, the maximum number of epochs is set to 2000, and the model will be stopped early if the loss does not descend 200 epochs. All regularization tricks are removed. The learning rate is set to 0.01 for all models. We use two metrics to evaluate each method: sum of squared error and R 2 score. Results. The quantitative experiment results are shown in Table 1 , from which we can see that Specformer achieves the best performance on all synthetic graphs. Especially, it has more improvements on challenging graphs, such as Band-rejection and Comb. This validates the effectiveness of Specformer in learning complex graph filters. In addition, we can see that GCN and GAT only perform better on the homophilic graph, which reflects that only using low-frequency information is not enough. Polynomial-based GNNs, i.e., ChebyNet, GPR-GNN, BernNet, and JacobiConv, have more stable performances. But their expressiveness is still weaker than Specformer. We visualize the graph filters learned by GPR-GNN, BernNet, and Specformer in Figure 2 , which further validates our claims. The horizontal axis presents the original eigenvalues, and the vertical axis indicates the corresponding new eigenvalues. For clarity, we uniformly downsample the eigenvalues at a ratio of 1:200 and only visualize three graphs because the situations of Low-pass and High-pass are similar. The same goes for Band-pass and Band-rejection. It can be seen that all methods can fit the easy filters well, i.e., High-pass. However, the polynomial-based GNNs cannot learn the narrow bands in Band-rejection and Comb, e.g., λ ∈ [0.75, 1.25], which harms their performance. On the contrary, Specformer fits the ground truth precisely, reflecting the superior learning ability over polynomials. The spatial results, i.e., filtered images, can be seen in Appendix C.3. 

5.2. NODE CLASSIFICATION

Datasets. For the node classification task, we perform experiments on four homophilic datasets, i.e., Cora, Citeser, Amazon-Photo and ogbn-arXiv, and four heterophilic datasets, i.e., Chameleon, Squirrel, Actor and Penn94. Penn94 (Lim et al., 2021) and arXiv (Hu et al., 2020) are two large scale datasets. Other datasets, provided by (Rozemberczki et al., 2021; Pei et al., 2020) , are commonly used to evaluate the performance of GNNs on heterophilic and homophilic graphs. Baselines and settings. We benchmark our model against a series of competitive baselines, including spatial GNNs, spectral GNNs, and graph Transformers. For all datasets, we use the full-supervised split, i.e., 60% for training, 20% for validation, and 20% for testing, as suggested in (He et al., 2021) . All methods run 10 times and report the mean accuracy with a 95% confidence interval. For polynomial GNNs, we set the order of polynomials K = 10. For other methods, we use a 2-layer module. To ensure all models have similar numbers of parameters, in six small datasets, we set the hidden size d = 64 for spatial and spectral GNNs and d = 32 for graph Transformers and Specformer. The total numbers of parameters on Photo dataset are shown in Table 2 . On two large datasets, we use truncated spectral decomposition to improve the scalability. Based on the filters learned on the small datasets, we find that band-rejection filters are important for heterophilic datasets and low-pass filters are suitable for homophilic datasets. See Figures 4(b ) and 4(d). Therefore, we use eigenvectors with the smallest 3000 (low-frequency) and largest 3000 eigenvalues (high-frequency) for Penn94, and eigenvectors with the smallest 5000 eigenvalues (low-frequency) for arXiv. We use one layer for Specformer and set d = 64 in Penn94 and d = 512 in arXiv for all methods, as suggested by (Lim et al., 2021; He et al., 2022) . More details, e.g., optimizers, can be found in Appendix A. Results. In Table 2 , we can find that Specformer outperforms state-of-the-art baselines on 7 out of 8 datasets and achieves 12% relative improvement on the Squirrel dataset, which validates the superior learning ability of Specformer. In addition, the improvement is more pronounced on heterophilic datasets than on homophilic datasets. This is probably caused by the easier fitting of the low-pass filters in homophilic datasets. The same phenomenon is also observed in the synthetic graphs. An interesting observation is that the improvement on larger graphs, e.g., Actor and Photo, is less than that on smaller graphs. One possible reason is that the role of the self-attention mechanism is weakened, i.e., the attention values become uniform due to a large number of tokens. We notice that Specformer has a slightly higher variance than the baselines. This is because we set a large dropout rate to prevent overfitting. On the two large graph datasets, we can see that graph Transformers are memory-consuming due to the self-attention. On the contrary, Specformer reduces the time and space costs by using the truncated decomposition and shows better scalability than graph Transformers. The time and space overheads are listed in Appendix C.2.

5.3. GRAPH CLASSIFICATION AND REGRESSION

Datasets. We conduct experiments on three graph-level datasets with different scales. ZINC (Dwivedi et al., 2020) is a small subset of a large molecular dataset, which contains 12K graphs in total. MolHIV and MolPCBA are taken from the Open Graph Benchmark (OGB) datasets (Hu et al., 2020) . MolHIV is a medium dataset that has nearly 41K graphs. MolPCBA is the largest, containing 437K graphs. For all datasets, nodes represent the atoms, and edges indicate the bonds. Baselines and Settings. We choose popular MPNNs (GCN, GIN, and GatedGNN), graph Transformers with positional or structural embedding (SAN, Graphormer, and GPS), and other state-ofthe-art GNNs (CIN, GIN-AK+, etc.) as the baselines of graph-level tasks. For the ZINC dataset, we tune the hyperparameters of Specformer to ensure that the total parameters are around 500K. Results. We apply Specformer-Small, Medium, and Large for ZINC, MolHIV, and MolPCBA, respectively. The results are shown in Table 3 . It can be seen that Specformer outperforms the state-of-the-art models in ZINC and MolPCBA datasets, without using any hand-crafted features or pre-defined polynomials. This phenomenon proves that directly using neural networks to learn the graph spectrum is a promising way to construct powerful GNNs.

5.4. ABLATION STUDIES

We perform ablation studies on two node-level datasets and one graph-level dataset to evaluate the effectiveness of each component. The results are shown in Table 4 . The top three lines show the effect of the encoder, i.e., eigenvalue encoding (EE) and self-attention. It can be seen that EE is more important on Squirrel than on Citeseer. The reason is that the spectral filter of Squirrel is more difficult to learn. Therefore, the model needs the encoding to learn better representations. The attention module consistently improves performance by capturing the dependency among eigenvalues. The bottom three lines verify the performance of graph filters at different scales. We can see that in the easy task, e.g., Citeseer, the Small and Medium models have similar performance, but the Large model causes serious overfitting. In the hard task, e.g., Squirrel and MolPCBA, the Large model is slightly better than the Medium model but outperforms the Small model a lot, implying that adding the number of parameters can boost the performance. In summary, it is important to consider the difficulty of tasks when selecting models. The results are shown in Figures 3, 4, and 5 , from which we have some interesting observations. (1) Similar dependency patterns can be learned on different graphs. In low-pass filtering, e.g., Citeseer and Low-pass, all frequency bands tend to use the low-frequency information. While, in the bandrelated filtering, e.g., Squirrel, Band-pass, and Band-rejection, the low-and high-frequency highly depend on the medium-frequency, and the situation of the medium-frequency is opposite. (2) The more difficult the task, the less obvious the dependency. On Comb and ZINC, the dependency of eigenvalues is inconspicuous. (3) On graph-level datasets, the decoder can learn different filters. Figure 5 shows two basic filters. It can be seen that Ŝ1 and Ŝ2 have different dependencies and patterns, which are different from node-level tasks, where only one filter is needed. The finding suggests that these graph-level tasks are more difficult than node-level tasks and are still challenging for spectral GNNs.

6. CONCLUSION

In this paper, we propose Specformer that leverages Transformer to build a set-to-set spectral filter along with learnable bases. Specformer effectively captures magnitudes and relative dependencies of the eigenvalues in a permutation-equivariant fashion and can perform non-local graph convolution. Experiments on synthetic and real-world datasets demonstrate that Specformer outperforms various GNNs and learns meaningful spectrum patterns. A promising future direction is to improve the efficiency of Specformer through sparsifying the self-attention matrix of Transformer.

A EXPERIMENTAL DETAILS

A.1 DATASETS 2020) provides one time-based split for arXiv dataset. Therefore, we run the Penn94 dataset five times, each with a different split. And we run the arXiv dataset ten times, each with the same split and a different initialization. For other datasets, we run the experiments ten times, each with a different split and initialization because there is no official splitting. In the graph-level tasks, we use the official splitting provided by Hu et al. (2020) and run the experiments ten times, each with a different initialization. Optimizer. For the node classification task, we use the Adam (Kingma & Ba, 2015) optimizer, as suggested by He et al. (2021) ; Wang & Zhang (2022) . For graph-level tasks, we use the AdamW (Loshchilov & Hutter, 2019) optimizer, with the default parameters of ϵ =1e-8 and (β 1 , β 2 ) = (0.99, 0.999), as suggested by Ying et al. (2022) ; Rampásek et al. (2022) . Besides, we also use a learning rate scheduler for graph-level tasks, which is first a linear warm-up stage followed by a cosine decay. Model selection. In the node classification task, we run the experiments with 2000 epochs and stop the training in advance if the validation loss does not continuously decrease for 200 epochs. In the graph-level tasks, we run the experiments without early stop. Then we choose the model checkpoint with the lowest validation loss for evaluation. Environment. The environment in which we run experiments is: • Operating system: Linux version 3.10.0-693.el7.x86 64 • CPU information: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz • GeForce RTX 3090 (24GB) Hyperparameters. The hyperparameters of specformer can be seen in Tables 7 and 8 . Since B is row-normalized, we hope the condensation B is still approximately row-normalized. For this purpose, we first sum the columns of B through the pre-defined frequency bands, e.g., Low, Medium, and High, and then average the rows. Bi,j = λp∈fi λq∈fj B p,q |1 λ∈fi |, where f i and f j are the frequency bands, and |1 λ∈fi | indicates the number of eigenvalues belonging to the frequency band f i . Through this condensation strategy, we can approximately the self-attention matrix's information and find the frequency bands' dependency patterns.

C MORE EXPERIMENTAL RESULTS

C Here we visualize the outputs of eigenvalue encoding with different values of ϵ. Specifically, we uniformly sample 50 eigenvalues from the region [0, 2] and map them into representations with d = 64. The results are shown in Figure 6 . We can find that when ϵ = 1, only the first 20 dimensions can distinguish different eigenvalues. As ϵ increases, the resolution of eigenvalue encoding becomes higher.

C.2 TIME AND SPACE OVERHEAD

We test the time and space overheads of Specformer and two popular polynomial GNNs, i.e., GPR-GNN and BernNet. Polynomial GNNs are implemented with sparse matrices and sparse matrix multiplication. We choose three datasets, i.e., Squirrel, Penn94, and ZINC, two for node classification and one for graph regression. For ZINC dataset, we sample 2,000 molecular graphs as the inputs of the forward process and omit the edge features that cannot be used by polynomial GNNs. Setup. In the complexity analysis, we mentioned that spectral decomposition only needs to be calculated once and can be reused in the forward process. To perform a fair comparison, we run each model for 1000 epochs and report the total time and space costs. For polynomial GNNs, we set K = 10 as suggested by the original papers; for Specformer, we use full eigenvectors for Squirrel and ZINC, and 6,000 eigenvectors for Penn94. The hidden dimension is d = 64 for all methods. Results. From the time overheads in Table 9 , we can find that the spectral decomposition of small graphs, e.g., Squirrel and ZINC, does not bring much computational cost. And the forward time of Specformer is close to GPR-GNN and less than BernNet. This is because polynomial GNNs need to calculate AX or LX recurrently. Specformer only needs to calculate U diag(λ)U ⊤ X once, due to the non-local capability. Besides, BernNet needs to calculate all the combinations of L and 2I -L, i.e., K k=1 K k (2I -L) K-k (L) k , which requires a lot of computations. In the Penn94 dataset, we use truncated spectral decomposition to reduce the forward complexity from O(n 2 (d + M ) + nd(L + d)) to O(q 2 (d + M ) + nd(L + d)) , where q is the number of eigenvalues. Table 9 shows the space overheads, where Specformer is higher than polynomials GNNs because of the dense eigenvectors. One can use fewer eigenvectors to reduce the space cost. In addition to visual comparisons of learned spectral filters, we also compare Specformer and polynomial GNNs from the spatial perspective. In Figure . 7, we show the raw images, ground truth images filtered by Comb filter, i.e. |sin(πλ)|, and images filtered by Specformer and GPR-GNN. We can see that Specformer is similar to the ground truth, and the contrast of GPR-GNN is darker than the ground truth, implying that Specformer is better than polynomial GNNs at capturing global information. Raw Image Specformer GroundTruth GPR-GNN PCQM4Mv2 is a large-scale graph regression dataset (Hu et al., 2021) , which has 3.7M graphs, and the goal is to regress the HOMO-LUMO gap. We follow the experimental setting of GPS (Rampásek et al., 2022) . Because the original test set is unreachable, we use the original validation set as the test set and randomly sample 150K molecules for validation Due to the time limitation, we only run Specformer-Medium on the largest molecular datasets. The results are shown in Table 11 . We can see that due to the share of learnable bases, the number of parameters of Specformer-Medium is relatively small. That is to say, there is only one Transformer block in Specformer-Medium. But the performance of Specformer-Medium is better than the baselines with similar parameters, e.g., GCN, GIN, and GPS-small. Proof. We show that Specformer is permutation equivariant by proving that all the components of Specformer are permutation equivariant. First, the element-wise functions, i.e., eigenvalue encoding, feed-forward networks, and layer normalization, are permutation equivariant because they are applied in node-independent manner. Second, the self-attention mechanism is permutation equivariant because (P ZP ⊤ )(P ZP ⊤ ) ⊤ = P (ZZ ⊤ )P ⊤ , where Z is the data representation matrix and P is an arbitrary permutation matrix. Third, the construction of learnable bases is permutation equivariant because (P U P ⊤ )(P ΛP ⊤ )(P U P ⊤ ) ⊤ = P (U ΛU ⊤ )P ⊤ . Based on all the conclusions above, we prove that Specformer is permutation equivariant. with continuous outer and inner functions ρ : R 2M +1 → R and ϕ : R → R 2M +1 . The inner function ϕ is independent of the function f . Proposition 2. Specformer can approximate any univariate and multivariate continuous functions. Proof. We first prove that Specformer can approximate any univariate continuous functions. We set the self-attention matrix to be an identity matrix. Then Specformer becomes a scalar-to-scalar function, and the spectral filter is learned through the eigenvalue encoding ρ(λ) in Equation 2. Given a linear transformation w ∈ R d+1 , the eigenvalue encoding becomes a Fourier series: To choose orthogonal bases, one can set ρ(λ, 2i) = sin(iλ), ρ(λ, 2i + 1) = cos(iλ). We then prove that Specformer can approximate any multivariate continuous functions. Theorem 2 states that any multivariate function is a superposition of continuous functions of a single variable. Let ϕ be the eigenvalue encoding, λ m be the self-attention weight, and ρ be the FFN decoder in Equation 4. Because eigenvalue encoding can approximate any continuous univariate functions and Montanelli & Yang (2020) proves that deep ReLU networks can approximate the outer function ρ, Specformer can approximate any continuous multivariate functions.



https://ww2.mathworks.cn/products/image.html We retrain Graphomer on MolHIV and MolPCBA datasets without using pre-training and augmentation.



Figure 1: Illustration of the proposed Specformer.

Figure 2: Illustrations of filters learned by two polynomial GNNs and Specformer.

Figure 3: The dependency of eigenvalues on synthetic graphs.

Figure 6: Eigenvalue encoding with different values of ϵ. Best viewed in color.

Figure 7: Synthetic data filtered by GPR-GNN and Specformer. Best viewed in color.

APPROXIMATING UNIVARIATE AND MULTIVARIATE FUNCTIONS Theorem 1. (Uniform convergence of Fourier series) (Stein & Shakarchi, 2011) For any continuous real-valued function f (x) on [a, b] and f ′ (x) is piece-wise continuous on [a, b] and any ϵ > 0, there exists a Fourier series P (x) converges to f (x) uniformly such that max a≤x≤b |P (x) -f (x)| < ϵ. (10) Theorem 2. (Kolmogorov-Arnold representation theorem) (Zaheer et al., 2017) Let f : [0, 1] M → R be an arbitrary multivariate continuous function. Then it has the representation f (x 1 , . . . , x M ) = ρ

d . Because the eigenvalues fall in the range [0, 2], based on Theorem 1, Specformer can approximate any univariate continuous functions in the interval [0, 2].

TO RELATED MODELS Specformer v.s. Polynomial GNNs. Specformer replaces the fixed bases of polynomials, e.g., λ, λ 2 , • • • , λ k , with learnable bases, which has two major advantages: (1) Universality. Polynomial GNNs are the special cases of Specformer because the learnable bases can approximate any polynomials. (2) Flexibility. Polynomial GNNs are designed to learn a shared function for all eigenvalues whereas Specformer can learn eigenvalue-specific functions, thus being more flexible.

Node regression results, mean of the sum of squared error & R 2 score, on synthetic data.

Results on real-world node classification tasks. Mean accuracy (%) ± 95% confidence interval. * means re-implemented baselines. "OOM" means out of GPU memory.

Results on graph-level datasets. ↓ means lower the better, and ↑ means higher the better.

Ablation studies on node-level and graph-level tasks.

Detailed information of node-level datasets.

Detailed information of graph-level datasets.

Time

Results on large-scale graph dataset PCQM4Mv2.

acknowledgement

ACKNOWLEDGMENTS This work is supported in part by the National Natural Science Foundation of China (No. U20B2045, 62192784, 62172052, 62002029, 62172052, U1936014), BUPT Excellent Ph.D. Students Foundation (No. CX2022310), the NSERC Discovery Grants (No. RGPIN-2019-05448, No. RGPIN-2022-04636), and the NSERC Collaborative Research and Development Grant (No. CRDPJ 543676-19). Resources used in preparing this research were provided, in part, by Advanced Research Computing at the University of British Columbia, the Oracle for Research program, and Compute Canada.

annex

Here we explain how to incorporate the Specformer layer with edge features. Specifically, we first broadcast the node features to the edges and filter the mixed edge features. Finally, the filtered edge features are aggregated to yield new node features.where H ∈ R N ×d is the node feature matrix and E ∈ R N ×N ×d is the edge feature matrix.

B.2 CONDENSATION OF SELF-ATTENTION

In this section, we explain the details of the condensation of self-attention. We use B and B to represent the self-attention matrix and its condensation.(8)

