DIFFUSING GRAPH ATTENTION

Abstract

The dominant paradigm for machine learning on graphs uses Message Passing Graph Neural Networks (MP-GNNs), in which node representations are updated by aggregating information in their local neighborhood. Recently, there have been increasingly more attempts to adapt the Transformer architecture to graphs in an effort to solve some known limitations of MP-GNN. A challenging aspect of designing Graph Transformers is integrating the arbitrary graph structure into the architecture. We propose Graph Diffuser (GD) to address this challenge. GD learns to extract structural and positional relationships between distant nodes in the graph, which it then uses to direct the Transformer's attention and node representation. We demonstrate that existing GNNs and Graph Transformers struggle to capture long-range interactions and how Graph Diffuser does so while admitting intuitive visualizations. Experiments on eight benchmarks show Graph Diffuser to be a highly competitive model, outperforming the state-of-the-art in a diverse set of domains.

1. INTRODUCTION

Graph Neural Networks have seen increasing popularity as a versatile tool for graph representation learning, with applications in a wide variety of domains such as protein design (e.g., Ingraham et al. (2019) ) and drug development (e.g., Gaudelet et al. (2020) ). The majority of Graph Neural Networks (GNNs) operate by stacking multiple local message passing layers Gilmer et al. (2017) , in which nodes update their representation by aggregating information from their immediate neighbors Li et al. ( 2016 In recent years, several limitations of GNNs have been observed by the community. These include under-reaching Barceló et al. (2020 ), over-smoothing Wu et al. (2020) and over-squashing Alon & Yahav (2021); Topping et al. (2022) . Over-smoothing manifests as node representations of wellconnected nodes become indistinguishable after sufficiently many layers, and over-squashing occurs when distant nodes do not communicate effectively due to the exponentially growing amount of messages that must get compressed into a fixed-sized vector. Even prior to the formalization of these limitations, it was clear that going beyond local aggregation is essential for certain problems Atwood & Towsley (2016); Klicpera et al. (2019) . Since their first appearance for natural language processing, Transformers have been applied to domains such as computer vision(Han et al. ( 2022)), robotic control Kurin et al. (2020) , and biological sequence modeling Rives et al. ( 2021) with great success. They improve previous models' expressivity and efficiency by replacing local inductive biases with the global communication of the attention mechanism. Following this trend, the Transformer has been studied extensively in recent years as a way to combat the issues mentioned above with GNNs. Graph Transformers (GTs) usually integrate the input into the architecture by encoding node structural and positional information as features or by modulating the attention between nodes based on their relationships within the graph. However, given the arbitrary structure of graphs, incorporating the input into the Transformer remains a challenging aspect in designing GTs, and so far, there has been no universal solution. We propose a simple architecture for incorporating structural data into the Transformer, Graph Diffuser (GD). The intuition that guides us is that while the aggregation scheme of GNNs is limited, the Figure 1 shows an overview of our approach. We start with the graph structure (as shown on the left) and construct Virtual Edges (middle) that capture the propagation of information between nodes at multiple propagation steps. This allows our approach to zoom out of the local message passing and relates distant nodes that do not have a direct connection in the original graph. We then use the virtual edges to direct the transformer attention (shown on the right) and node representations. In the following sections, we first show that existing GNNs and Graph Transformers struggle to model long-range interactions using a seemingly trivial problem. We follow by defining Graph Diffuser and then show how it solves the same problem. Finally, we demonstrate the effectiveness of our approach on eight benchmarks spanning multiple domains by showing it outperforms stateof-the-art with no hyperparameter tuning. 



); Kipf & Welling (2017); Hamilton et al. (2017); Veličković et al. (2018); Wu et al. (2019a); Xu et al. (2019b).

Figure 1: Illustration of our positional attention, focusing on node B. Information propagation from multiple propagation steps is combined to create Virtual Edges(colored) between distant nodes, which then direct the Transformer's attention in each layer.

Passing Graph Neural Networks (MP-GNNs) MP-GNNs Gori et al. (2005); Scarselli et al. (2008) have been the predominant method for graph representation learning in recent years, and have been applied to a wide variety of domains (e.g., Kosaraju et al. (2019); Nathani et al. (2019); Wang et al. (2019); Huang & Carley (2019); Yang et al. (2020); Ma et al. (2020); Wu et al. (2020); Zhang et al. (2020)), MP-GNNss update node representation by stacking multiple layers in which each node aggregates information from its local neighborhoodLi et al. (2016); Kipf & Welling (2017); Veličković et al. (2018); Wu et al. (2019a); Xu et al. (2019b); Hamilton et al. (2017); Xu et al. (2019a). However, as mentioned above, they suffer from under-reaching Barceló et al. (2020), over-smoothing Wu et al. (2020) and over-squashing Alon & Yahav (2021); Topping et al. (2022). Several works have addressed the problem of over-squashing. Gilmer et al. (2017) add "virtual edges" to shorten long distances, and Scarselli et al. (2008) add "supersource nodes". None of these works, however, integrate such virtual nodes or edges into the Transformer. Another line of work uses the attention mechanism Veličković et al. (2018); Brody et al. (2022) to dynamically propagate information in the graph rather than use the original adjacency matrix or Laplacian. Graph Transformers (GTs) Considering their successes in natural language understanding Vaswani et al. (2017); Kalyan et al. (2021),computer vision d'Ascoli et al. (2021); Han et al. (2022); Guo et al. (2021), robotic control Kurin et al. (2020) and a variety of other domains, there have been numerous attempts to apply Transformers to graphs. These works add positional encoding as node features, similar to the encoding in Vaswani et al. (2017), or use relative positioning to bias the attention between nodes, similar to Shaw et al. (2018). Others combine Transformer with

