GLOBAL NODE ATTENTIONS VIA ADAPTIVE SPECTRAL FILTERS

Abstract

Graph neural networks (GNNs) have been extensively studied for prediction tasks on graphs. Most GNNs assume local homophily, i.e., strong similarities in local neighborhoods. This assumption limits the generalizability of GNNs, which has been demonstrated by recent work on disassortative graphs with weak local homophily. In this paper, we argue that GNN's feature aggregation scheme can be made flexible and adaptive to data without the assumption of local homophily. To demonstrate, we propose a GNN model with a global self-attention mechanism defined using learnable spectral filters, which can attend to any nodes, regardless of distance. We evaluated the proposed model on node classification tasks over seven benchmark datasets. The proposed model has been shown to generalize well to both assortative and disassortative graphs. Further, it outperforms all state-ofthe-art baselines on disassortative graphs and performs comparably with them on assortative graphs.

1. INTRODUCTION

Graph neural networks (GNNs) have recently demonstrated great power in graph-related learning tasks, such as node classification (Kipf & Welling, 2017) , link prediction (Zhang & Chen, 2018 ) and graph classification (Lee et al., 2018) . Most GNNs follow a message-passing architecture where, in each GNN layer, a node aggregates information from its direct neighbors indifferently. In this architecture, information from long-distance nodes is propagated and aggregated by stacking multiple GNN layers together (Kipf & Welling, 2017; Velickovic et al., 2018; Defferrard et al., 2016) . However, this architecture underlies the assumption of local homophily, i.e. proximity of similar nodes. While this assumption seems reasonable and helps to achieve good prediction results on graphs with strong local homophily, such as citation networks and community networks (Pei et al., 2020) , it limits GNNs' generalizability. Particularly, determining whether a graph has strong local homophily or not is a challenge by itself. Furthermore, strong and weak local homophily can both exhibit in different parts of a graph, which makes a learning task more challenging. Pei et al. (2020) proposed a metric to measure local node homophily based on how many neighbors of a node are from the same class. Using this metric, they categorized graphs as assortative (strong local homophily) or disassortative (weak local homophily), and showed that classical GNNs such as GCN (Kipf & Welling, 2017) and GAT (Velickovic et al., 2018) perform poorly on disassortative graphs. Liu et al. (2020) further showed that GCN and GAT are outperformed by a simple multilayer perceptron (MLP) in node classification tasks on disassortative graphs. This is because the naive local aggregation of homophilic models brings in more noise than useful information for such graphs. These findings indicate that these GNN models perform sub-optimally when the fundamental assumption of local homophily does not hold. Based on the above observation, we argue that a well-generalized GNN should perform well on graphs, regardless of their local homophily. Furthermore, since a real-world graph can exhibit both strong and weak homophily in different node neighborhoods, a powerful GNN model should be able to aggregate node features using different strategies accordingly. For instance, in disassortative graphs where a node shares no similarity with any of its direct neighbors, such a GNN model should be able to ignore direct neighbors and reach farther to find similar nodes, or at least, resort to the node's attributes to make a prediction. Since the validity of the assumption about local homophily is often unknown, such aggregation strategies should be learned from data rather than decided upfront. To circumvent this issue, in this paper, we propose a novel GNN model with global self-attention mechanism, called GNAN. Most existing attention-based aggregation architectures perform selfattention to the local neighborhood of a node (Velickovic et al., 2018) , which may add local noises in aggregation. Unlike these works, we aim to design an aggregation method that can gather informative features from both close and far-distant nodes. To achieve this, we employ graph wavelets under a relaxed condition of localization, which enables us to learn attention weights for nodes in the spectral domain. In doing so, the model can effectively capture not only local information but also global structure into node representations. To further improve the generalizability of our model, instead of using predefined spectral kernels, we propose to use multi-layer perceptrons (MLP) to learn the desired spectral filters without limiting their shapes. Existing works on graph wavelet transform choose wavelet filters heuristically, such as heat kernel, wave kernel and personalized page rank kernel (Klicpera et al., 2019b; Xu et al., 2019; Klicpera et al., 2019a) . They are mostly low-pass filters, which means that these models implicitly treat high-frequency components as "noises" and have them discarded (Shuman et al., 2013; Hammond et al., 2011; Chang et al., 2020) . However, this may hinder the generalizability of models since high-frequency components can carry meaningful information about local discontinuities, as analyzed in (Shuman et al., 2013) . Our model overcomes these limitations by incorporating fully learnable spectral filters into the proposed global self-attention mechanism. From a computational perspective, learning global self-attention may impose high computational overhead, particularly when graphs are large. We alleviate this problem from two aspects. First, we sparsify nodes according to their wavelet coefficients, which enables attention weights to be distributed across the graph sparsely. Second, we observed that spectral filters learned by different MLPs tend to converge to be of similar shapes. Thus, we use a single MLP to reduce redundancy among filters, where each dimension in the output corresponds to one learnable spectral filter. In addition to these, following (Xu et al., 2019; Klicpera et al., 2019b) , we use a fast algorithm to efficiently approximate graph wavelet transform, which has computational complexity O(p × |E|), where p is the order of Chebyshev polynomials and |E| is the number of edges in a graph. To summarize, the main contributions of this work are as follows: 1. We propose a generalized GNN model which performs well on both assortative and disassortative graphs, regardless of local node homophily. 2. We exhibit that GNN's aggregation strategy can be trained via a fully learnable spectral filter, thereby enabling feature aggregation from both close and far nodes. 3. We show that, unlike commonly understood, higher-frequency on disassortative graphs provides meaningful information that helps improving prediction performance. We conduct extensive experiments to compare GNAN with well-known baselines on node classification tasks. The experimental results show that GNAN significantly outperforms the state-of-the-art methods on disassortative graphs where local node homophily is weak, and performs comparably with the state-of-the-art methods on assortative graphs where local node homophily is strong. This empirically verifies that GNAN is a general model for learning on different types of graphs.

2. PRELIMINARIES

Let G = (V, E, A, x) be an undirected graph with N nodes, where V , E, and A are the node set, edge set, and adjacency matrix of G, respectively, and x : V → R m is a graph signal function that associates each node with a feature vector. The normalized Laplacian matrix of G is defined as L = I -D -1/2 AD -1/2 , where D ∈ R N ×N is the diagonal degree matrix of G. In spectral graph theory, the eigenvalues Λ = diag(λ 1 , ..., λ N ) and eigenvectors U of L = U ΛU H are known as the graph's spectrum and spectral basis, respectively, where U H is the Hermitian transpose of U . The graph Fourier transform of x is x = U H x and its inverse is x = U x. The spectrum and spectral basis carry important information on the connectivity of a graph (Shuman et al., 2013) . Intuitively, lower frequencies correspond to global and smooth information on the graph, while higher frequencies correspond to local information, discontinuities and possible noise (Shuman et al., 2013) . One can apply a spectral filter g as in Equation 1 and use graph Fourier transform to manipulate signals on a graph in various ways, such as smoothing and denoising (Schaub &

