REWIRING WITH POSITIONAL ENCODINGS FOR GNNS

Abstract

Several recent works use positional encodings to extend the receptive fields of graph neural network (GNN) layers equipped with attention mechanisms. These techniques, however, extend receptive fields to the complete graph, at substantial computational cost and risking a change in the inductive biases of conventional GNNs, or require complex architecture adjustments. As a conservative alternative, we use positional encodings to expand receptive fields to r-hop neighborhoods. More specifically, our method augments the input graph with additional nodes/edges and uses positional encodings as node and/or edge features. We thus modify graphs before inputting them to a downstream GNN model, instead of modifying the model itself. This makes our method model-agnostic, i.e. compatible with any existing GNN architectures. We also provide examples of positional encodings that are lossless with a one-to-one map between the original and the modified graphs. We demonstrate that extending receptive fields via positional encodings and a virtual fully-connected node significantly improves GNN performance and alleviates over-squashing using small r. We obtain improvements on a variety of models and datasets, and reach state-of-the-art performance using traditional GNNs or graph Transformers.

1. INTRODUCTION

GNN layers typically embed each node of a graph as a function of its neighbors' (1-ring's) embeddings from the previous layer; that is, the receptive field of each node is its 1-hop neighborhood. Hence, at least r stacked GNN layers are needed for nodes to get information about their r-hop neighborhoods. Barceló et al. (2020) and Alon and Yahav (2021) identify two broad limitations associated with this structure: under-reaching occurs when the number of layers is insufficient to communicate information between distant vertices, while over-squashing occurs when certain edges act as bottlenecks for information flow. Inspired by the success of the Transformer in natural language processing (Vaswani et al., 2017) , recent methods expand node receptive fields to the whole graph (Dwivedi and Bresson, 2021; Ying et al., 2021) . Since they effectively replace the topology of the graph with that of a complete graph, these works propose positional encodings that communicate the connectivity of the input graph as node or edge features. As these methods operate on fully-connected graphs, the computational cost of each layer is quadratic in the number of nodes, obliterating the sparsity afforded by conventional 1-ring based architectures. Moreover, the success of the 1-ring GNNs suggests that local feature aggregation is a useful inductive bias, which has to be learned when the receptive field is the whole graph, leading to slow and sensitive training. In this paper, we expand receptive fields from 1-ring neighborhoods to r-ring neighborhoods, where r ranges from 1 (typical GNNs) to R, the diameter of the graph (fully-connected). That is, we augment a graph with edges between each node and all others within distance r in the input topology. We show that performance is significantly improved using fairly small r and carefully-chosen positional encodings annotating this augmented graph. This simple but effective approach can be combined with any GNN. Contributions. We apply GNN architectures to augmented graphs connecting vertices to their peers of distance ≤ r. Our contributions are as follows: (i) We increase receptive fields using a modified graph with positional encodings as edge and node features. (ii) We compare r-hop positional encodings on the augmented graph, specifically lengths of shortest paths, spectral computations, and powers of the graph adjacency matrix. (iii) We demonstrate that relatively small r-hop neighborhoods sufficiently increase performance across models and that performance degrades in the fullyconnected setting.

2. RELATED WORK

The Transformer has permeated deep learning (Vaswani et al., 2017) , with state-of-the-art performance in NLP (Devlin et al., 2018 ), vision (Parmar et al., 2018 ), and genomics (Zaheer et al., 2020) . Its core components include multi-head attention, an expanded receptive field, positional encodings, and a CLS-token (virtual global source and sink nodes). Several works adapt these constructions to GNNs. For example, the Graph Attention Network (GAT) performs attention over the neighborhood of each node, but does not generalize multi-head attention using positional encodings (Veličković et al., 2018) . Recent works use Laplacian spectra, node degrees, and shortest-path lengths as positional encodings to expand attention to all nodes (Kreuzer et al., 2021; Dwivedi and Bresson, 2021; Rong et al., 2020; Ying et al., 2021) . Several works also adapt attention mechanisms to GNNs (Yun et al., 2019; Cai and Lam, 2019; Hu et al., 2020; Baek et al., 2021; Veličković et al., 2018; Wang et al., 2021b; Zhang et al., 2020; Shi et al., 2021) . Path and distance information has been incorporated into GNNs more generally. Yang et al. ( 2019) introduce the Shortest Path Graph Attention Network (SPAGAN), whose layers incorporate pathbased attention via shortest paths between a center node and distant neighbors, using an involved hierarchical path aggregation method to aggregate a feature for each node. Like us, SPAGAN introduces the ≤ k-hop neighbors around the center node as a hyperparameter; their model, however, has hyperparameters controlling path sampling. Beyond SPAGAN, Chen et al. ( 2019 Each layer of our GNN attends to the r-hop neighborhood around each node. Unlike SPAGAN and Graph-BERT, our method is model agnostic and does not perform sampling, avoiding their sampling-ratio and number-of-iterations hyperparameters. Unlike GTN, we do not restrict to a particular graph structure. Broadly, our approach does not require architecture or optimization changes. Thus, our work also joins a trend of decoupling the input graph from the graph used for information propagation (Veličković, 2022) In contrast, our work does not use diffusion, curvature, or sampling, but expands receptive fields via Transformer-inspired positional encodings. In this sense, we avoid the inductive biases from pre-defined notions of diffusion and curvature, and since we do not remove connectivity, injective lossless changes are easy to obtain.

3. PRELIMINARIES AND DESIGN

Let G = (V, E, f v , f e ) denote a graph with nodes V ⊂ N 0 and edges E ⊆ V × V , and let G be the set of graphs. For each graph, let functions f v ∶ V → R dv and f e ∶ E → R de denote node and edge features, respectively. We consider learning on graphs, specially node classification and graph classification. At inference, the input is a graph G. For node classification, the task is to predict



) concatenate node features, edge features, distances, and ring flags to compute attention probabilities. Li et al. (2020) show that distance encodings (i.e., one-hot feature of distance as an extra node attribute) obtain more expressive power than the 1-Weisfeiler-Lehman test. Graph-BERT introduces multiple positional encodings to apply Transformers to graphs and operates on sampled subgraphs to handle large graphs (Zhang et al., 2020). Yang et al. (2019) introduce the Graph Transformer Network (GTN) for learning a new graph structure, which identifies "meta-paths" and multi-hop connections to learn node representations. Wang et al. (2021a) introduce Multi-hop Attention Graph Neural Network (MAGNA) that uses diffusion to extend attention to multi-hop connections. Frankel et al. (2021) extend GAT attention to a stochastically-sampled neighborhood of neighbors within 5-hops of the central node. Isufi et al. (2020) introduce EdgeNets, which enable flexible multi-hop diffusion. Luan et al. (2019) generalizes spectral graph convolution and GCN in block Krylov subspace forms.

. For scalability, Hamilton et al. (2017) sample from a node's local neighborhood to generate embeddings and aggregate features, while Zhang et al. (2018) sample to deal with topological noise. Rossi et al. (2020) introduce Scalable Inception Graph Neural Networks (SIGN), which avoid sampling by precomputing convolutional filters. Kipf and Welling (2017) preprocess diffusion on graphs for efficient training. Topping et al. (2021) use graph curvature to rewire graphs and combat over-squashing and bottlenecks.

