DIFFUSING GRAPH ATTENTION

Abstract

The dominant paradigm for machine learning on graphs uses Message Passing Graph Neural Networks (MP-GNNs), in which node representations are updated by aggregating information in their local neighborhood. Recently, there have been increasingly more attempts to adapt the Transformer architecture to graphs in an effort to solve some known limitations of MP-GNN. A challenging aspect of designing Graph Transformers is integrating the arbitrary graph structure into the architecture. We propose Graph Diffuser (GD) to address this challenge. GD learns to extract structural and positional relationships between distant nodes in the graph, which it then uses to direct the Transformer's attention and node representation. We demonstrate that existing GNNs and Graph Transformers struggle to capture long-range interactions and how Graph Diffuser does so while admitting intuitive visualizations. Experiments on eight benchmarks show Graph Diffuser to be a highly competitive model, outperforming the state-of-the-art in a diverse set of domains.

1. INTRODUCTION

Graph Neural Networks have seen increasing popularity as a versatile tool for graph representation learning, with applications in a wide variety of domains such as protein design (e.g., Ingraham et al. 2021) with great success. They improve previous models' expressivity and efficiency by replacing local inductive biases with the global communication of the attention mechanism. Following this trend, the Transformer has been studied extensively in recent years as a way to combat the issues mentioned above with GNNs. Graph Transformers (GTs) usually integrate the input into the architecture by encoding node structural and positional information as features or by modulating the attention between nodes based on their relationships within the graph. However, given the arbitrary structure of graphs, incorporating the input into the Transformer remains a challenging aspect in designing GTs, and so far, there has been no universal solution. We propose a simple architecture for incorporating structural data into the Transformer, Graph Diffuser (GD). The intuition that guides us is that while the aggregation scheme of GNNs is limited, the 1



(2019)) and drug development (e.g., Gaudelet et al. (2020)). The majority of Graph Neural Networks (GNNs) operate by stacking multiple local message passing layers Gilmer et al. (2017), in which nodes update their representation by aggregating information from their immediate neighbors Li et al. (2016); Kipf & Welling (2017); Hamilton et al. (2017); Veličković et al. (2018); Wu et al. (2019a); Xu et al. (2019b). In recent years, several limitations of GNNs have been observed by the community. These include under-reaching Barceló et al. (2020), over-smoothing Wu et al. (2020) and over-squashing Alon & Yahav (2021); Topping et al. (2022). Over-smoothing manifests as node representations of wellconnected nodes become indistinguishable after sufficiently many layers, and over-squashing occurs when distant nodes do not communicate effectively due to the exponentially growing amount of messages that must get compressed into a fixed-sized vector. Even prior to the formalization of these limitations, it was clear that going beyond local aggregation is essential for certain problems Atwood & Towsley (2016);Klicpera et al. (2019).Since their first appearance for natural language processing, Transformers have been applied to domains such as computer vision(Han et al. (2022)), robotic control Kurin et al. (2020), and biological sequence modeling Rives et al. (

