GLOBAL NODE ATTENTIONS VIA ADAPTIVE SPECTRAL FILTERS

Abstract

Graph neural networks (GNNs) have been extensively studied for prediction tasks on graphs. Most GNNs assume local homophily, i.e., strong similarities in local neighborhoods. This assumption limits the generalizability of GNNs, which has been demonstrated by recent work on disassortative graphs with weak local homophily. In this paper, we argue that GNN's feature aggregation scheme can be made flexible and adaptive to data without the assumption of local homophily. To demonstrate, we propose a GNN model with a global self-attention mechanism defined using learnable spectral filters, which can attend to any nodes, regardless of distance. We evaluated the proposed model on node classification tasks over seven benchmark datasets. The proposed model has been shown to generalize well to both assortative and disassortative graphs. Further, it outperforms all state-ofthe-art baselines on disassortative graphs and performs comparably with them on assortative graphs.

1. INTRODUCTION

Graph neural networks (GNNs) have recently demonstrated great power in graph-related learning tasks, such as node classification (Kipf & Welling, 2017) , link prediction (Zhang & Chen, 2018 ) and graph classification (Lee et al., 2018) . Most GNNs follow a message-passing architecture where, in each GNN layer, a node aggregates information from its direct neighbors indifferently. In this architecture, information from long-distance nodes is propagated and aggregated by stacking multiple GNN layers together (Kipf & Welling, 2017; Velickovic et al., 2018; Defferrard et al., 2016) . However, this architecture underlies the assumption of local homophily, i.e. proximity of similar nodes. While this assumption seems reasonable and helps to achieve good prediction results on graphs with strong local homophily, such as citation networks and community networks (Pei et al., 2020) , it limits GNNs' generalizability. Particularly, determining whether a graph has strong local homophily or not is a challenge by itself. Furthermore, strong and weak local homophily can both exhibit in different parts of a graph, which makes a learning task more challenging. Pei et al. (2020) proposed a metric to measure local node homophily based on how many neighbors of a node are from the same class. Using this metric, they categorized graphs as assortative (strong local homophily) or disassortative (weak local homophily), and showed that classical GNNs such as GCN (Kipf & Welling, 2017) and GAT (Velickovic et al., 2018) perform poorly on disassortative graphs. Liu et al. (2020) further showed that GCN and GAT are outperformed by a simple multilayer perceptron (MLP) in node classification tasks on disassortative graphs. This is because the naive local aggregation of homophilic models brings in more noise than useful information for such graphs. These findings indicate that these GNN models perform sub-optimally when the fundamental assumption of local homophily does not hold. Based on the above observation, we argue that a well-generalized GNN should perform well on graphs, regardless of their local homophily. Furthermore, since a real-world graph can exhibit both strong and weak homophily in different node neighborhoods, a powerful GNN model should be able to aggregate node features using different strategies accordingly. For instance, in disassortative graphs where a node shares no similarity with any of its direct neighbors, such a GNN model should be able to ignore direct neighbors and reach farther to find similar nodes, or at least, resort to the node's attributes to make a prediction. Since the validity of the assumption about local homophily is often unknown, such aggregation strategies should be learned from data rather than decided upfront. To circumvent this issue, in this paper, we propose a novel GNN model with global self-attention mechanism, called GNAN. Most existing attention-based aggregation architectures perform selfattention to the local neighborhood of a node (Velickovic et al., 2018) , which may add local noises in aggregation. Unlike these works, we aim to design an aggregation method that can gather informative features from both close and far-distant nodes. To achieve this, we employ graph wavelets under a relaxed condition of localization, which enables us to learn attention weights for nodes in the spectral domain. In doing so, the model can effectively capture not only local information but also global structure into node representations. To further improve the generalizability of our model, instead of using predefined spectral kernels, we propose to use multi-layer perceptrons (MLP) to learn the desired spectral filters without limiting their shapes. Existing works on graph wavelet transform choose wavelet filters heuristically, such as heat kernel, wave kernel and personalized page rank kernel (Klicpera et al., 2019b; Xu et al., 2019; Klicpera et al., 2019a) . They are mostly low-pass filters, which means that these models implicitly treat high-frequency components as "noises" and have them discarded (Shuman et al., 2013; Hammond et al., 2011; Chang et al., 2020) . However, this may hinder the generalizability of models since high-frequency components can carry meaningful information about local discontinuities, as analyzed in (Shuman et al., 2013) . Our model overcomes these limitations by incorporating fully learnable spectral filters into the proposed global self-attention mechanism. From a computational perspective, learning global self-attention may impose high computational overhead, particularly when graphs are large. We alleviate this problem from two aspects. First, we sparsify nodes according to their wavelet coefficients, which enables attention weights to be distributed across the graph sparsely. Second, we observed that spectral filters learned by different MLPs tend to converge to be of similar shapes. Thus, we use a single MLP to reduce redundancy among filters, where each dimension in the output corresponds to one learnable spectral filter. In addition to these, following (Xu et al., 2019; Klicpera et al., 2019b) , we use a fast algorithm to efficiently approximate graph wavelet transform, which has computational complexity O(p × |E|), where p is the order of Chebyshev polynomials and |E| is the number of edges in a graph. To summarize, the main contributions of this work are as follows: 1. We propose a generalized GNN model which performs well on both assortative and disassortative graphs, regardless of local node homophily. 2. We exhibit that GNN's aggregation strategy can be trained via a fully learnable spectral filter, thereby enabling feature aggregation from both close and far nodes. 3. We show that, unlike commonly understood, higher-frequency on disassortative graphs provides meaningful information that helps improving prediction performance. We conduct extensive experiments to compare GNAN with well-known baselines on node classification tasks. The experimental results show that GNAN significantly outperforms the state-of-the-art methods on disassortative graphs where local node homophily is weak, and performs comparably with the state-of-the-art methods on assortative graphs where local node homophily is strong. This empirically verifies that GNAN is a general model for learning on different types of graphs.

2. PRELIMINARIES

Let G = (V, E, A, x) be an undirected graph with N nodes, where V , E, and A are the node set, edge set, and adjacency matrix of G, respectively, and x : V → R m is a graph signal function that associates each node with a feature vector. The normalized Laplacian matrix of G is defined as L = I -D -1/2 AD -1/2 , where D ∈ R N ×N is the diagonal degree matrix of G. In spectral graph theory, the eigenvalues Λ = diag(λ 1 , ..., λ N ) and eigenvectors U of L = U ΛU H are known as the graph's spectrum and spectral basis, respectively, where U H is the Hermitian transpose of U . The graph Fourier transform of x is x = U H x and its inverse is x = U x. The spectrum and spectral basis carry important information on the connectivity of a graph (Shuman et al., 2013) . Intuitively, lower frequencies correspond to global and smooth information on the graph, while higher frequencies correspond to local information, discontinuities and possible noise (Shuman et al., 2013) Segarra, 2018), abnormally detection (Miller et al., 2011) and clustering (Wai et al., 2018) . Spectral convolutions on graphs is defined as the multiplication of a signal x with a filter g(Λ) in the Fourier domain, i.e. g(L)x = g(U ΛU H )x = U g(Λ)U H x = U g(Λ) x. W i 0 W s O g H V K L j E l u F G Y C d R S K N A Y D s Y 3 8 3 9 9 g S V 5 r F 8 M N M E / Y g O J Q 8 5 o 8 Z K z U m / X H G r 7 g J k n X g 5 q U C O R r / 8 1 R v E L I 1 Q G i a o 1 l 3 P T Y y f U W U 4 E z g r 9 V K N C W V j O s S u p Z J G q P 1 s c e i M X F h l Q M J Y 2 Z K G L N T f E x m N t J 5 G g e 2 M q B n p V W 8 u / u d 1 U x P e + B m X S W p Q s u W i M B X E x G T + N R l w h c y I q S W U K W 5 v J W x E F W X G Z l O y I X i r L 6 + T 9 l X V q 1 U 9 (1) When a spectral filter is parameterized by a scale factor, which controls the radius of neighbourhood aggregation, Equation 1 is also known as the Spectral Graph Wavelet Transform (SGWT) (Hammond et al., 2011; Shuman et al., 2013) . For example, Xu et al. (2019) uses a small scale parameter s < 2 for a heat kernel, g(sλ) = e -λs , to localize the wavelet at a node.

3. PROPOSED APPROACH

Graph neural networks (GNNs) learn lower-dimensional embeddings of nodes from graph structured data. In general, given a node, GNNs iteratively aggregate information from its neighbor nodes, and then combine the aggregated information with its own information. An embedding of node v at the kth layer of GNN is typically formulated as m v = aggregate({h (k-1) u |u ∈ N v }) h (k) v = combine(h (k-1) v , m v ), where N v is the set of neighbor nodes of node v, m v is the aggregated information from the neighbors, and h (k) v is the embedding of the node v at the kth layer (h (0) v = x v ). The embedding h n v of the node v at the final layer is then used for some prediction tasks. In most GNNs, N v is restricted to a set of one-hop neighbors of node v. Therefore, one needs to stack multiple aggregation layers in order to collect the information from more than one-hop neighborhood within this architecture. Adaptive spectral filters. Instead of stacking multiple aggregation layers, we introduce a spectral attention layer that rewires a graph based on spectral graph wavelets. A spectral graph wavelet ψ v at node v is a modulation in the spectral domain of signals centered around the node v, given by an N -dimensional vector ψ v = U g(Λ)U H δ v , where g(•) is a spectral filter and δ v is a one-hot vector for node v. The common choice of a spectral filter is heat kernel. A wavelet coefficient ψ vu computed from a heat kernel can be interpreted as the amount of energy that node v has received from node u in its local neighborhood. In this work, instead of using pre-defined localized kernels, we use multilayer perceptrons (MLP) to learn spectral filters. With learnable spectral kernels, we obtain wavelet coefficients ψ v = U diag(MLP(Λ))U H δ v . Similar to that of a heat kernel, the wavelet coefficient with a learnable spectral filter ψ vu can be understood as the amount of energy that is distributed from node v to node u, under the conditions regulated by the spectral filter. Note that we use the terminology wavelet and spectral filter interchangeably as we have relaxed the wavelet definition from (Hammond et al., 2011) so that learnable spectral filters in our work are not necessarily localized in the spectral and spatial domains. Equation 3requires the eigen-decomposition of a Laplacian matrix, which is expensive and infeasible for large graphs. We follow Xu et al. (2019) ; Klicpera et al. (2019b) to approximate graph wavelet transform using Chebyshev polynomials (Shuman et al., 2013) (see Appendix A for details). Global self-attention. Unlike the previous work (Xu et al., 2019) where wavelet coefficients are directly used to compute node embeddings, we normalize wavelet coefficients through a softmax layer a v = softmax(ψ v ), where a v ∈ R N is an attention weight vector. With attention weights, an update layer is then formalized as h (k) v = σ N u=1 a vu h (k-1) u W (k) , where W (k) is a weight matrix shared across all nodes in the kth layer and σ is ELU nonlinear activation. Unlike heat kernel, the wavelet coefficient with a learnable spectral kernel is not localized. Hence, our work can actively aggregate information from far-distant nodes. Note that the update layer is not divided into aggregation and combine steps in our work. Instead, we compute the attention a vv directly from a spectral filter. Sparsified node attentions. With predefined localized spectral filters such as a heat kernel, most of wavelet coefficients are zero due to their locality. In our work, spectral filters are fully learned from data, and consequently attention weights obtained from learnable spectral filters do not impose any sparsity. This means that to perform an aggregation operation we need to retrieve all possible nodes in a graph, which is time consuming with large graphs. From our experiments, we observe that most of attention weights are negligible after softmax. Thus, we consider two sparsification techniques: 1. Discard the entries of wavelet bases that are below a threshold t, i.e. ψvu = ψ vu if ψ vu > t -∞ otherwise. (5) The threshold t can be easily applied on all entries of wavelet bases. However, it offers little guarantee on attention sparsity since attention weights may vary, depending on the learning process of spectral filters and the characteristics of different datasets, as will be further discussed in Section 4.2. 2. Keep only the largest k entries of wavelet bases for each node, i.e. ψvu = ψ vu if ψ vu ∈ topK({ψ v0 , ..., ψ vN }, k) -∞ otherwise, ( ) where topK is a partial sorting function that returns the largest k entries from a set of wavelet bases {ψ v0 , ..., ψ vN }. This technique guarantees attention sparsity such that the embedding of each node can be aggregated from at most k other nodes. However, it takes more computational overhead to sort entries since topK has a time complexity of O(N + k log N ). The resulting ψ from either of the above techniques is then fed into the softmax layer to compute attention weights. The experiments for comparing these techniques will be discussed in Section 4.2. We adopt multi-head attention to model multiple spectral filters. Each attention head aggregates node information with a different spectral filter, and the aggregated embedding is concatenated before being sent to the next layer. We can allocate an independent MLP for each of attention heads; however, we found independent MLPs tend to learn spectral filters of similar shapes. Hence, we adopt a single MLP: R N → R N ×M , where M is the number of attention heads, and each column of the output corresponds to one adaptive spectral filter. We name the multi-head spectral attention architecture as a global node attention network (GNAN). The design of GNAN is easily generalizable, and many existing GNNs can be expressed as special cases of GNAN (see Appendix D). Figure 1 illustrates how GNAN works with two attention heads learned from the CITESEER dataset. As shown in the illustration, the MLP learns adaptive filters such as low-band pass and high-band pass filters. A low-band pass filter assigns high attention weights in local neighborhoods, while a high-band pass filter assigns high attention weights on far-distant nodes, which cannot be captured by a one-hop aggregation scheme.

4. EXPERIMENTS

To evaluate the performance of our proposed model, we conduct experiments on node classification tasks with assortative graph datasets, where the labels of nodes exhibit strong homophily, and disassortative graph datasets, where the local homophily is weak and labels of nodes represent their structural roles. To quantify the assortativeness of graphs, we use the metric β introduced by Pei et al. (2020), β = 1 |V | v∈V β v and β v = |{u ∈ N v | (u) = (v)}| |N v | , where (v) refers to the label of node v. β measures the homophily of a graph, and β v measures the homophily of node v in the graph. A graph has strong local homophily if β is large and vice versa.

4.1. EXPERIMENTAL SETUP

Baseline methods. We evaluate two variants of GNAN which only differ in the method used for sparsification: one adopts Equation 5 called GNAN-T, and the other adopts Equation 6 called GNAN-K. We compare both variants against 11 benchmark methods: vanilla GCN (Kipf & Welling, 2017) and its simplified version SGC (Wu et al., 2019) ; two spectral methods, one using the Chebyshev polynomial spectral filters (Defferrard et al., 2016) and the other using the auto-regressive moving average (ARMA) filters (Bianchi et al., 2019) ; the graph attention model GAT (Velickovic et al., 2018) ; APPNP which allows adaptive neighbourhood aggregation using personalized page rank (Klicpera et al., 2019a) ; three sampling-based approaches, GraphSage (Hamilton et al., 2017) , FastGCN Chen et al. (2018) and ASGCN (Huang et al., 2018) ; and Geom-GCN which targets prediction on disassortative graphs (Pei et al., 2020) . We also include MLP in the baselines since it performs better than many GNN-based methods on disassortative graphs (Liu et al., 2020) . Datasets. We evaluate our model and the baseline methods on node classification tasks over three citation networks: CORA, CITESEER and PUBMED (Sen et al., 2008) , three webgraphs from the WebKB datasetfoot_0 : WISCONSIN, TEXAS and CORNELL, and another webgraph from Wikipedia called CHAMELEON (Rozemberczki et al., 2019) . We divide these datasets into two groups, assortative and disassortative, based on their β. The details of these datasets are summarized in Table 1 . Hyper-parameter settings. For the citation networks, we follow the experimental setup for node classification from (Hamilton et al., 2017; Huang et al., 2018; Chen et al., 2018) and report the results averaged on 10 runs. For the webgraphs, we run each model on the 10 splits provided by (Pei et al., 2020) and take the average, where each split uses 60%, 20%, and 20% nodes of each class for training, validation and testing, respectively. The results we report on GCN and GAT are better than Pei et al. (2020) due to converting the graphs to undirected before trainingfoot_1 . Geom-GCN uses node embeddings pre-trained from different methods such as Isomap (Tenenbaum et al., 2000) , Poincare (Nickel & Kiela, 2017 ) and struc2vec (Ribeiro et al., 2017) . We hereby report the best micro-F1 results among all variants for Geom-GCN. We use the best-performing hyperparameters specified in the original papers of baseline methods. For hyperparameters not specified in the original papers, we use the parameters from (Fey & Lenssen, 2019) . We report the test accuracy results from epochs with the smallest validation loss and highest validation accuracy. Early termination is adopted for both validation loss and accuracy, and the training is thus stopped when neither of validation loss and accuracy improve for 100 consecutive epochs. We use a two-layer GNAN where multi-head's filters are learned using a MLP of 2 hidden layers and then approximated by Chebyshev polynomials. Each layer of the MLP consists of a linear function and a ReLU activation. To avoid overfitting, dropout is applied in each GNAN layer on both attention weights and inputs equally.

4.2. RESULTS AND DISCUSSION

We use two evaluation metrics to evaluate the performance of node classification tasks: micro-F1 and macro-F1. The results with micro-F1 are summarized in Table 2 , and the results with macro-F1 are provided in Table 3 in the appendix. Overall, on assortative citation networks, GNAN performs comparably with the state-of-the-art methods, ranking first on PUBMED and second on CORA and CITESEER in terms of micro-F1 scores. On disassortative graphs, GNAN outperforms all the stateof-the-art methods by a margin of at least 2.4% and MLP by a margin of at least 1.3%. These results indicate that GNAN can learn spectral filters adaptively based on different characteristics of graphs. Although our model GNAN performs well on both assortative and disassortative graphs, it is unclear how GNAN performs on disassortative nodes whose neighbors are mostly of different classes in an assortative graph. Thus, we report an average classification accuracy on disassortative nodes at different levels of β v in Figure 2 for the assortative graph datasets CITESEER and PUBMED. The nodes are binned into five groups based on β v . For example, all nodes with 0.3 < β v ≤ 0.4 belong to the bin at 0.4. We have excluded CORA from the report since it has very few nodes with low β v . The results in Figure 2 show that all GNNs based on local aggregation schemes perform poorly when β v is low. One may argue that the performance on disassortative graphs might improve by stacking multiple GNN layers together to obtain information from far-distant nodes. However, it turns out that this approach introduces an oversmoothing problem in local aggregation-based GNNs (Li et al., 2018) . On the other hand, GNAN outperforms the other GNN-based methods on disassortative nodes, suggesting that adaptive spectral filters reduce local noise in aggregation while allowing far-distant nodes to be attended to. Attention sparsification. The two variants of GNAN use slightly different sparsification techniques to speed up computation. For each node v, GNAN-T uses a threshold t to eliminate low ψ vu (Equation 5), thereby sparsifying the resulting attention matrix. However, t cannot control the level of sparsification precisely. In comparison, GNAN-K keeps the k largest φ vu (Equation 6); therefore it guarantees a certain level of sparsification. Nonetheless, GNAN-K requires a partial sorting which adds an overhead of O(n + k log N ). To further analyze the impact of attention sparsity on runtime, we plot the density of an attention matrix with respect to both k (Figure 3 .a and 3.c) and t (Figure 3 .b and 3.d). The results are drawn from two datasets: the disassortative dataset CHAMELEON and the assortative dataset CORA. As expected, GNAN-K shows a stable growth in the attention density as the value of k increases. GNAN-T, on the other hand, demonstrates fluctuation in density with t and reaches the lowest density at t = 1e -9 and t = 1e -6 for CORAand CHAMELEON, respectively. We observe that the attention weights tend to converge to similar small values on all nodes when t goes beyond 0.001 in both datasets. To study how efficiency is improved via sparsification, we also plot the training time averaged over 500 epochs in Figure 3 . It shows that the model GNAN runs much faster when attention weights are well-sparsified. In our experiments, we find the best results are achieved on k < 20 for GNAN-K and t < 1e -5 for GNAN-T. Thus, the model GNAN not only runs faster, but also performs better when attention weights are well-sparsified. Frequency range ablation. To understand how adaptive spectral filters contribute to GNAN's performance on disassortative graphs, we conduct an ablation study on spectral frequency ranges. We first divide the entire frequency range (0 ∼ 2) into a set of predefined sub-ranges exclusively, and then manually set the filter frequency responses to zero for each sub-range at a time in order to check the impact of each sub-range on the performance of classification. By doing so, the frequencies within a selected sub-range do not contribute to neither node attention nor feature aggregation, therefore helping to reveal the importance of the sub-range. We consider three different lengths of sub-ranges, i.e., step=1.0, step=0.5, and step=0.25. The results of frequency ablation on the three assortative graphs are summarized in Figure 4 . The results for step=1.0 reveal the importance of high-frequency range (1 ∼ 2) on node classification of disassortative graphs. The performances are significantly dropped by ablating high-frequency range on all datasets. Further investigation at the finer-level sub-ranges (step=0.5) shows that sub-range 0.5 ∼ 1.5 has the most negative impact on performance, whereas the most important sub-range varies across different datasets at the finest level (step=0.25). This finding matches our intuition that low-pass filters used in GNNs underlie the local node homophily assumption in a similar way as naive local aggregation. We suspect the choice of Attention matrix is effectively sparsified by both k and t, which improves runtime efficiency. Note that the density is not monotonically increasing for GNAN-T since the threshold is applied to the "learnable" attention weights. When all values of ψ are below t, the density becomes 100% as a result of the softmax normalization. low-pass filters also relates to oversmoothing issues in spectral methods (Li et al., 2018) , but we leave it for future work. Attention head ablation. In GNAN, each head uses a spectral filter to produce attention weights. To delve the importance of a spectral filter, we further follow the ablation method used by (Michel et al., 2019) . Specifically, we ablate one or more filters by manually setting their attention weights to zeros. We then measure the impact on performance using micro-F1. If the ablation results in a large decrease in performance, the ablated filters are considered important. We observe that all attention heads (spectral filters) in GNAN are of similar importance, and only all attention heads combined produce the best performance. Please check Appendix C for the detailed results.

5. RELATED WORK

Graph neural networks have been extensively studied recently. We categorize work relevant to ours into three perspectives and summarize the key ideas. Attention on graphs. Graph attention networks (GAT) (Velickovic et al., 2018) was the first to introduce attention mechanisms on graphs. GAT assigns different importance scores to local neighbors via attention mechanism. Similar to other GNN variants, long-distance information propagation in GAT is realized by stacking multiple layers together. Therefore, GAT suffers from the oversmoothing issue (Zhao & Akoglu, 2020) . Zhang et al. (2020) improve GAT by incorporating both structural and feature similarities while computing attention scores. Spectral graph filters and wavelets. Some GNNs also use graph wavelets to extract information from graphs. Xu et al. (2019) applied graph wavelet transform defined by Shuman et al. (2013) in GNNs. Klicpera et al. (2019b) proposed a general GNN argumentation using graph diffusion kernels to rewire the nodes. Donnat et al. (2018) used heat wavelet to learn node embeddings in unsupervised ways and showed that the learned embeddings closely capture structural similarities between nodes. Other spectral filters used in GNNs can also be viewed as special forms of graph wavelets (Kipf & Welling, 2017; Defferrard et al., 2016; Bianchi et al., 2019) . Coincidentally, Chang et al. (2020) also noticed useful information carried by high-frequency components from a graph Laplacian. Similarly, they attempted to utilize such components using node attentions. However, they resorted to the traditional choice of heat kernels and applied such kernels separately to low-frequency Figure 4 : Micro-F1 with respect to an ablated frequency range on disassortative graphs. We divide the frequency range into a set of sub-ranges with different lengths. The results (a) and (d) reveal the importance of high-frequency range (1 ∼ 2). Further experiments show that there is a subtle difference in the most important range across datasets, but it ranges between (0.75 ∼ 1.25). and high-frequency components divided by a hyperparameter. In addition to this, their work did not link high-frequency components to disassortative graphs. Prediction on disassortative graphs. Pei et al. (2020) have drawn attention to GCN and GAT's poor performance on disassortative graphs very recently. They tried to address the issue by essentially pivoting feature aggregation to structural neighborhoods from a continuous latent space learned by unsupervised methods. Another attempt to address the issue was proposed by Liu et al. (2020) . They proposed to sort locally aggregated node embeddings along a one-dimensional space and used a one-dimensional convolution layer to aggregate embeddings a second time. By doing so, non-local but similar nodes can be attended to. Although our method shares some similarities in motivation with the aforementioned work, it is fundamentally different in several aspects. To the best of our knowledge, our method is the first to learn spectral filters as part of supervised training on graphs. It is also the first architecture we know that computes node attention weights purely from learned spectral filters. As a result, in contrast to commonly used heat kernel, our method utilizes high-frequency components of a graph, which helps prediction on disassortative graphs.

6. CONCLUSION

In this paper, we study the node classification tasks on graphs where local node homophily is weak. We argue the assumption of local homophily is the cause of poor performance on disassortative graphs. In order to design more generalizable GNNs, we suggest that a more flexible and adaptive feature aggregation scheme is needed. To demonstrate, we have introduced the global node attention network (GNAN) which achieves flexible feature aggregation using learnable spectral graph filters. By utilizing the full graph spectrum adaptively via the learned filters, GNAN is able to aggregate features from nodes that are close and far. For node classification tasks, GNAN outperforms all benchmarks on disassortative graphs, and performs comparably on assortative graphs. On assortative graphs, GNAN also performs better for nodes with weak local homophily. Through our analysis, we find the performance gain is closely linked to the higher end of the frequency spectrum.

APPENDIX A GRAPH SPECTRAL FILTERING WITHOUT EIGEN-DECOMPOSITION

Chebyshev polynomials approximation has been the de facto approximation method for avoiding eigendecomposition in spectral graph filters. It has been commonly used in previous works Hammond et al. (2011) ; Sakiyama et al. (2016) ; Xu et al. (2019) . We hereby use it to approximate Equation 3. In fact, other approximation methods can also be used for the purpose, such as the Jackson-Chebychev polynomials (Napoli et al., 2016) but we will leave it for future study. Briefly, in Chebyshev polynomial approximation, the graph signal filtered by a filter g(L) is approximated as g(L), and represented as a sum of recursive polynomials (Sakiyama et al., 2016) : g(L)x = 1 2 c0 + p i=1 ci Ti(L) x where T0 = 1, T1(L) = 2(L -1)/λmax, Ti(L) = 4(L -1) Ti-1/λmax -Ti-2(L), and ci = 2 S S m=1 cos πi(m -1 2 ) S × g λmax 2 cos π(m -1 2 ) S + 1 for i = 0, ..., p, where p is the approximation order, S is the number of sampling points and is normally set to S = p + 1. In Equation 3, MLP is used to produce the filter responses so we have g = MLP in Equation 9. The above equation is differentiable so the parameters in MLP can be learned by gradient decent from the loss function. The above approximation has a time complexity of O(p × |E|), so that the complexity for Equation 3 is also O(p × |E|). Please note, while Chebyshev polynomials are mentioned in both our method and ChevNet, however they are used in fundamentally different ways: ChevNet uses the simplified Chebyshev polynomials as the polynomial filter directly, while we use it as a method to approximate the filtering operation. Naturally, approximation error reduces while a larger p is used, which is also why we have p > 12 in our model.

B FURTHER EXPERIMENT RESULTS

We provide the macro-F1 scores on the classification task in Table 3 . The proposed model outperforms the other models on disassortative graphs and performs comparable on the assortative graphs. 

C ABLATION STUDY ON FILTERS

We provide the detailed version of Figure 4 (c) and (f) in Figure 5 . We further ablated attention head to check the importance of each head in classification. Ablating all but one spectral filter. In GNAN, each head uses a filter to produce spectral attention weights. To delve the importance of a filter, we follow the ablation method used by (Michel et al., 2019) . Specifically, we ablate one or more filters by manually setting the attention scores to zeros. We then measure the impact on performance using micro-F1. If the ablation results in a large decrease in performance, the ablated fitlerbank(s) 0 . 0 -0 . 2 5 0 . 2 5 -0 . 5 0 . 5 -0 . 7 5 0 . 7 5 -1 . 0 1 . 0 -1 . 2 5 1 . 2 5 -1 . 5 1 . 5 -1 . is considered important. The results are summarized in Table 4a . All attention head (filters) in GNAN are of similar importance, and only all heads combined produces the best performance. Ablating only one spectral filter. We then examine performance differences by ablating one filter only and keeping all other fitlerbanks Table 4b . Different with above, ablating just one fitlerbank only decreases performance by a small margin. Moreover, ablating some fitlerbanks does not impact prediction performance at all. This is an indicator of potential redundancies in the filters. We leave the redundancy reduction in the model for future work. CORA 0.00% 0.00% -0.30% 0.00% -0.40% 0.00% 0.00% 0.00% -0.40% -0.40% 0.00% -0.40% --CITESEER -0.20% -0.30% -0.70% -0.70% -0.40% -0.60% -0.80% 0.00% -0.60% 0.00% -0.80% -0.50% 0.00% 0.00% PUBMED -0.80% -0.80% -0.70% 0.00% -0.80% 0.00% 0.00% 0.00% -0.70% -0.80% -0.80% 0.00% --

D CONNECTIONS TO OTHER METHODS

In this section, we show GNAN has strong connection to existing models, and many GNNs can be expressed as a special case of GNAN under certain conditions.

D.1 CONNECTION TO GCN

A GCN (Kipf & Welling, 2017) layer can be expressed as h (k) v = ReLU( N u=1 âvuh (k-1) u W (k) ) where âvu is the elements from the vth row of the symmetric adjacency matrix Â = D-1/2 Ã D-1/2 where Ã = A + IN , Dvv = N u=1 Ãvu So that âvu = 1 if evu ∈ E 0 if evu / ∈ E Therefore, GCN can be viewed as a case of Equation 4 with σ = ReLU and avu = âvu

D.2 CONNECTION TO POLYNOMIAL FILTERS

Polynomial filters localize in a node's K-hop neighbors utilizing K-order polynomials (Defferrard et al., 2016) , most of them takes the following form: g θ (Λ) = K-1 k=0 θ k Λ k where θ k is a learnable polynomial coefficient for each order. Thus a GNN layer using a polynomial filter becomes h (k) v = ReLU( N u=1 U g θ (Λ)U T hu) which can be expressed using Equation 4with W (k) = IN , σ = ReLU and avu = (U g θ (Λ)U T )vu. In comparison, our method uses a MLP to learn the spectral filters instead of using a polynomial filter. Also, in our method, coefficients after sparsification and normalization are used as directly as attentions.

D.3 CONNECTION TO GAT

Our method is inspired by and closely related to GAT (Velickovic et al., 2018) . To demonstrate the connection, we firstly define a matrix Φ where each column φv is the transformed feature vector of node v concatenated with feature vector of another node (including node v itself) in the graph. φv = || N j=0 [W hv||W hu] GAT multiplies each column of Φ with a learnable weight vector α and masks the result with the adjacency A before feeding it to the nonlinear function LeakyRelu and softmax to calculate attention scores. The masking can be expressed as a Hadamard product with the adjacency matrix A which is the congruent of a graph wavelet transform with the filter g(Λ) = I -Λ: Ψ = A = D 1 2 U (I -Λ)U T D 1 2 And the GAT attention vector for node i become av = softmax(LeakyReLU(α T φv ψv)) where ψv is the vth row of Ψ after applying Equation 5 with t = 0, denotes the Hadamard product, as of (Velickovic et al., 2018) . In comparison with our method, GAT incorporate node features in the attention score calculation, while node attentions in our methods are purely computed from the graph wavelet transform. Also, attentions in GAT are masked by A, which means the attentions are restricted to node v's 1-hop neighbours only.

D.4 CONNECTION TO SKIP-GRAM METHODS

Skip-gram models in natural language processing are shown to be equivalent to a form of matrix factorization (Levy & Goldberg, 2014) . Recently Qiu et al. (2018) proved that many Skip-Gram Negative Sampling (SGNS) models used in node embedding, including DeepWalk (Perozzi et al., 2014) , LINE (Tang et al., 2015b) , PTE (Tang et al., 2015a) , and node2vec (Grover & Leskovec, 2016) , are essentially factorizing implicit matrices closely related to the normalized graph Laplacian. The implicit matrices can be presented as graph wavelet transforms on the graph Laplacian. For simplicity, we hereby use DeepWalk, a generalized form of LINE and PTE, as an example. Qiu et al. (2018) shows DeepWalk effectively factorizes the matrix log vol(G) T ( T r=1 P r )D -1 -log(b) where vol(G) = v Dvv is the sum of node degrees, P = D -1 A is the random walk matrix, T is the skip-gram window size and b is the parameter for negative sampling. We know P = I -D -1 2 LD 1 2 = D -1 2 U (I -Λ)U T D 1 2 So Equation 13 can be written using graph Laplacian as: log vol(G) T D -1 2 T r=1 (I -L) r D 1 2 -log(b) Or, after eigen-decomposition, as: M = log vol(G) T b D -1 2 U T r=1 (I -Λ) r U T D 1 2 where U T r=1 (I -Λ) r U T , denoted as ψsg, is a wavelet transform with the filter gsg(λ) = T r=1 (1 -λ) r . Therefore, DeepWalk can be seen a special case of Equation 4where: av = ψv if v = k 0 if v = u Assigning H = W = I, K = 1 and σ(X) = log( vol(G) T b D -1 2 XD 1 2 ). We have h i = FACTORIZE(σ(a i )) ( ) where FACTORIZE is a matrix factorization operator of choice. Qiu et al. (2018) 



http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/ https://openreview.net/forum?id=S1e2agrFvS



t e x i t s h a 1 _ b a s e 6 4 = " S S g 2 d 8 o G m Y J z S 1 3 e N Y 8 2 k A N s 4 M Q = " > A A A B 6 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k o M e i F 4 8 t W F t o Q 9 l s J + 3 a z S b s b g o l 9 B d 4 8 a A g X v 1 H 3 v w 3 b t s c t P X B w O O 9 G W b m B Y n g 2 r j u t 1 P Y 2 N z a 3 i n u l v b 2 D w 6 P y s c n j z p O F c M

Figure 1: Illustration of a spectral node attention layer on a 3-hop ego network of the central node v from the CITESEER dataset. Node classes are indicated by shape and color. Passing the graph through two learned spectral filters (adaptive spectral filters) place attention scores on nodes, including node v itself. Nodes with positive attention scores are presented in color. Node features are aggregated for node v according to attention scores (aggregation via global attention). The low-pass filter attend to local neighbors (filter 1), while the high-pass filter skips the first hop and attend the nodes in the second hop (filter K). The resulting embeddings from multiple heads are then concatenated before being sent to the next layer (multi-head concatenation). Note that we have visualized learned filters from experiments.

Figure 2: Micro-F1 results for classification accuracy on disassortative nodes (β v ≤ 0.5). GNAN shows better accuracy on classifying disassortative nodes than the other methods.

Figure3: Attention matrix density and training runtime with respect to k and t. Attention matrix is effectively sparsified by both k and t, which improves runtime efficiency. Note that the density is not monotonically increasing for GNAN-T since the threshold is applied to the "learnable" attention weights. When all values of ψ are below t, the density becomes 100% as a result of the softmax normalization.

Figure 5: Full details of the performances on frequency ablation at 0.25 level.

. One can apply a spectral filter g as in Equation 1 and use graph Fourier transform to manipulate signals on a graph in various ways, such as smoothing and denoising(Schaub &

Dataset statistics. We categorize the datasets into assortative and disassortative based on their homophily (β).

Micro-F1 results for node classification. The proposed model consistently outperforms the other variants of GNN on disassortative graphs and performs comparably on assortative graphs.

Macro-F1 for node classification task.

Ablation study on attention head (filter). We use 12 attention heads for CORA and PUBMED, and 14 heads for CITESEER.

uses SVD in a generalized SGNS model, where the decomposed matrix U d and Σ d from M = U d ΣV d is used to obtain the node embedding U d √ Σ d .

