DIFFUSING GRAPH ATTENTION

Abstract

The dominant paradigm for machine learning on graphs uses Message Passing Graph Neural Networks (MP-GNNs), in which node representations are updated by aggregating information in their local neighborhood. Recently, there have been increasingly more attempts to adapt the Transformer architecture to graphs in an effort to solve some known limitations of MP-GNN. A challenging aspect of designing Graph Transformers is integrating the arbitrary graph structure into the architecture. We propose Graph Diffuser (GD) to address this challenge. GD learns to extract structural and positional relationships between distant nodes in the graph, which it then uses to direct the Transformer's attention and node representation. We demonstrate that existing GNNs and Graph Transformers struggle to capture long-range interactions and how Graph Diffuser does so while admitting intuitive visualizations. Experiments on eight benchmarks show Graph Diffuser to be a highly competitive model, outperforming the state-of-the-art in a diverse set of domains.

1. INTRODUCTION

Graph Neural Networks have seen increasing popularity as a versatile tool for graph representation learning, with applications in a wide variety of domains such as protein design (e.g., Ingraham et al. (2019) ) and drug development (e.g., Gaudelet et al. (2020) ). The majority of Graph Neural Networks (GNNs) operate by stacking multiple local message passing layers Gilmer et al. (2017) , in which nodes update their representation by aggregating information from their immediate neighbors Li et al. (2016) ; Kipf & Welling (2017) ; Hamilton et al. (2017) ; Veličković et al. (2018) ; Wu et al. (2019a) ; Xu et al. (2019b) . In recent years, several limitations of GNNs have been observed by the community. These include under-reaching Barceló et al. (2020) , over-smoothing Wu et al. (2020) and over-squashing Alon & Yahav (2021) ; Topping et al. (2022) . Over-smoothing manifests as node representations of wellconnected nodes become indistinguishable after sufficiently many layers, and over-squashing occurs when distant nodes do not communicate effectively due to the exponentially growing amount of messages that must get compressed into a fixed-sized vector. Even prior to the formalization of these limitations, it was clear that going beyond local aggregation is essential for certain problems Atwood & Towsley (2016) ; Klicpera et al. (2019) . Since their first appearance for natural language processing, Transformers have been applied to domains such as computer vision (Han et al. (2022) ), robotic control Kurin et al. (2020) , and biological sequence modeling Rives et al. (2021) with great success. They improve previous models' expressivity and efficiency by replacing local inductive biases with the global communication of the attention mechanism. Following this trend, the Transformer has been studied extensively in recent years as a way to combat the issues mentioned above with GNNs. Graph Transformers (GTs) usually integrate the input into the architecture by encoding node structural and positional information as features or by modulating the attention between nodes based on their relationships within the graph. However, given the arbitrary structure of graphs, incorporating the input into the Transformer remains a challenging aspect in designing GTs, and so far, there has been no universal solution. We propose a simple architecture for incorporating structural data into the Transformer, Graph Diffuser (GD). The intuition that guides us is that while the aggregation scheme of GNNs is limited, the propagation of information along the graph structure provides a valuable inductive bias for learning on graphs. Figure 1 shows an overview of our approach. We start with the graph structure (as shown on the left) and construct Virtual Edges (middle) that capture the propagation of information between nodes at multiple propagation steps. This allows our approach to zoom out of the local message passing and relates distant nodes that do not have a direct connection in the original graph. We then use the virtual edges to direct the transformer attention (shown on the right) and node representations. In the following sections, we first show that existing GNNs and Graph Transformers struggle to model long-range interactions using a seemingly trivial problem. We follow by defining Graph Diffuser and then show how it solves the same problem. Finally, we demonstrate the effectiveness of our approach on eight benchmarks spanning multiple domains by showing it outperforms stateof-the-art with no hyperparameter tuning. et al. (2019a) . However, as mentioned above, they suffer from under-reaching Barceló et al. (2020) , over-smoothing Wu et al. (2020) and over-squashing Alon & Yahav (2021) ; Topping et al. (2022) .

Message

Several works have addressed the problem of over-squashing. (2021) . None of these works, however, combines information from multiple different propagation steps. To the best of our knowledge, this work is the first Graph Transformer to: (1) learn to construct a new adjacency matrix using node and edge features to generate positional or relative encoding, and (2) learn to combine information propagation over multiple different propagation steps in an end-to-end manner. 

3. WHY DO WE NEED ANOTHER GRAPH TRANSFORMER?

Given the successful application of Transformers to other domains and the flurry of recent Graph Transformers, it is natural to ask why is there a need for another Graph Transformer? To We now describe our approach for taking a graph G = (X, A) with a node embeddings matrix X and edges A and processing it using the Transformer, which usually takes only a single matrix X as input. Our approach consists of 2 stages, illustrated in the left and right halves of Figure 3 . First, GD embeds the structural relations between distant nodes in what we refer to as Virtual Edges. Then, the Transformer processes the nodes while using the virtual edges to direct the computation at different layers.

4.1. VIRTUAL EDGES

Virtual Edges are high-dimensional representations constructed between distant nodes in the graph. They contain rich information on the structural and positional relationships.

4.1.1. POWERS OF THE ADJACENCY MATRIX

To consider relations between distant nodes, the first step is broadening the receptive field on which the architecture operates. In a row-normalized adjacency matrix A, A k ij corresponds to the probability of getting from node i to node j in a k step random walk. We stack different powers of A into a 3 dimensional tensor E ∈ R n×n×k E = [I|A|A 2 |..|A k ] ( ) where k is the number of stacks and | denotes stacking matrices along the 3ed dimension and I is the identity matrix. Multiplication by the adjacency matrix is closely related to the aggregation scheme of MP-GNNS Xu et al. (2018) , however, considering multiple powers of the matrix at once, we zoom out of the local Message Passing paradigm and discover information that GNNs may not detect. Virtual Edges contain structural information such as if nodes are on an odd length cycle, can distinguish many non-isomorphic graphs, and cover many distance measures such as shortest distance and generalized PageRank Li et al. (2020b) .

4.1.2. EDGE-WISE FEED-FORWARD NETWORK

In order to mix information between different propagation steps and extract meaningful structural relations, each stack is processed by a fully connected edge-wise feed-forward network. Each layer consists of 1 hidden layer with batch norm, ReLU activation, and a residual connection. Edge-F F N (E ij ) = ReLU (BN (E ij W 1))W 2 1 (2) E ij = Edge-F F N (E ij ) + E ij The Edge-Wise FFN consists of 2 such layers and we apply batch norm on the input, before the first layer. This network is applied once, before any of the Transformers layers.

4.1.3. WEIGHTED ADJACENCY

Rather than using the original adjacency matrix in equation 1, we found it beneficial to learn a new adjacency matrix using the node and edge features. Âij = ReLU (BN ([x i ; x j , e ij ]W 1 ))W 2 (4) A ij = normalize(σ( Âij )) Where ; means contacting vectors, σ is the sigmoid function and noramlize means l1 row normalization. x i ,x j ∈ R d , e ij ∈ R d edge , W 1 ∈ R (2d+d edge )×2d and W 2 ∈ R 2d×1 . If there are no edge features, we use only the nodes. This aligns with existing approaches such as GAT (Veličković et al. (2018) ; Brody et al. (2022) ), which find it beneficial to dynamically learn a new adjacency matrix.

4.2. INTEGRATING WITH THE TRANSFORMER

Most Graph Transformers use the graph structure to either alter the node's representations or to affect the attention mechanism itself. Graph Diffuser combines the 2 approaches. In the input layer, Self-Virtual Edges are added to the node representation as positional encoding, and at each attention layer, the Virtual Edges are reduced to an attention matrix that is combined with the standard dotproduct attention.

4.2.1. ATTENTION

At each of the Transformer attention layers, we linearly project each virtual edge E ij separately to get the positional attention score between i and j. Âtt P osition ij = E ij W p (6) We then combine the positional attention and the content(dot-product) attention scores and apply row-wise normalization.

Âtt

Content ij = QK T √ d ij (7) Att = normalize(exp( Âtt Content ) ⊙ σ( Âtt P osition )) Where σ is the sigmoid function and ⊙ means element-wise multiplication. In multi-head attention with h heads, the projection in equation 6 is done h times. equation 8 can be viewed as scaling the content attention coefficients based on the positional attention. As we will see in the next section, separating the attention based on content and positions and then combining them seems to assist with learning meaningful connections in the data and provides a natural visualization mechanism.

4.3. POSITIONAL ENCODING

Self-Virtual-Edges, E ii , are Virtual Edges between a node to itself. They contain important structural information, such as whether the node is part of an odd length cycle and the degrees of adjacent nodes. We project Self-Virtual-Edges to the node's representation dimensionality and add them as positional encoding. X i = X i + Relu(E ii W pe ) Where W pe ∈ R k×d . Self-walks were shown (Dwivedi et al. ( 2022a)) to be an effective positional encoding. Adding such features to the node representations allows the Transformer to use structural information in all of its layers, including the feed-forward network. The GNN baseline, which in theory can solve the task, achieves 0.483 accuracy. This may be due to the over-squashing phenomenon. Indeed in a 10 × 10 grid, passing information between opposing nodes in the same row will require 9 message passing layers to deliver a single message that has been "squashed" with nearly 90k other messages. Next, we observe that using only positional attention(without dot-product attention) performs the second best while having the least parameters. Finally, Graph Diffuser nearly solves the task entirely by combining both positional and content attention. Looking at Figure 2b , we can see the model learns to use the different attention mechanisms in an intuitive way, with content attention detecting nodes in the same color and position attention detecting nodes in the same row or column.

5.2. BENCHMARKING GRAPH DIFFUSER

Throughout our experiments, we take an existing Graph Transformer architecture and use it "as is" as our Transformer module without doing any hyper-parameter search. That is, we take a graph Transformer with the same hyperparameters used by the original work and only add our positional attention and encoding.foot_1 For 5 of the datasets, ZINC, OGBG-{molpcba, ppa,code2}, PCQM-Contact and PCQM4Mv2, we use the very competitive GPS Rampášek et al. (2022) as the Transformer model. For the LRGB benchmarks, Peptides-func and Peptides-struct, there are no official GPS hyperparameters, and we use a Transformer with the authors' baseline hyperparameters and a CLS token added to the input. For a detailed hyperparameters report, see Table 8 and 9 .

5.2.1. RESULTS

We evaluate our model on 8 datasets overall and compare our results with those of popular MP-GNNs, Graph Transformers and other recent state-of-the-art models. Graph Diffuser exceeds SOTA in 6 of them and achieves very competitive results in the rest. All results for the comparison methods are taken from the original paper or their official leaderboards. ZINC Dwivedi et al. ( 2020a) is a molecular regression dataset, where the value to be predicted is the molecule's constrained solubility. Table 2 shows the results, where we exceed SOTA results by a substitutional margin. Open Graph Benchmark Hu et al. ( 2020) is a highly competitive benchmark with a variety of datasets. We evaluate GD on three graph-level prediction tasks from different domains and report the results at Table 3 OGBG-MOLPCBA is a multi-task binary classification dataset containing 438k molecules, and the task is to predict the activity/inactivity of 128 properties. Here we rank the second highest and the first among all GNN or Transformer architectures. OGBG-PPA consists of protein-protein interaction networks of different species, where the task is predicting the category of species the network is from. The previously highest-ranking model is GPS, and adding our positional attention and encoding to it proves to be an efficient strategy that reaches a new state-ofthe-art. OGBG-CODE2 contains Abstract Syntax Trees(ASTs) of Python methods, and models are tasked with predicting the methods' names. The average distance between nodes in this dataset is larger than that of any other dataset in our experiments, which can explain why Graph Transformers variations dominate its leaderboards. Since positional encoding was not found to be significant by the previous three highest ranking models in this dataset, and considering the large improvement of our model over GPS, it seems that positional attention brings significant benefits in this benchmark, which allows us to reach a new state-of-the-art. OGB-LSC PCQM4Mv2 Hu et al. ( 2021) is a largescale molecular dataset with over 3.7M graphs. Here, our results are similar to GPS, which ranks highest among Graph Transformers variations. Our results lag behind only GEM-2 Liu et al. (2022) , which was designed specifically for modeling molecular interactions. 0.367 ± 0.011 GIN Xu et al. (2019a) 0.526 ± 0.051 GAT Veličković et al. (2018) 0.384 ± 0.007 GatedGCN Bresson & Laurent (2017); Dwivedi et al. (2020b) 0.282 ± 0.015 GatedGCN-LSPE Dwivedi et al. (2022a) 0.090 ± 0.001 CRaWl Toenshoff et al. (2021) 0.085 ± 0.004 GIN-AK+ Zhao et al. (2022) 0.080 ± 0.001 SAN Kreuzer et al. (2021) 0.139 ± 0.006 Graphormer Ying et al. (2021) 0.122 ± 0.006 K-Subgraph SAT Chen et al. (2022) 0.094 ± 0.008 GPS Rampášek et al. (2022) 0 

6. CONCLUSION

In this work, we introduced a simple and effective architecture for machine learning on graphs, Graph Diffuser. Using a controlled example, we demonstrated its effectiveness in modeling longrange interactions while providing better interpretability. We then showed that this translates to real-world problems by evaluating GD on eight benchmarks from multiple domains. With minimal hyperparameter tuning, Graph Diffuser achieves a new state-of-the-art on most datasets and reaches very competitive results on the rest. In the future, we plan to integrate Graph Diffuser with other promising Graph Transformer compositions, such as Transformers stacked on top of GNNs. Another area of interest is amplifying virtual edges, for example, by considering paths between nodes rather than just information propagation. Limitations Our architecture integrates with the Transformer and consequently suffers from the quadratic memory complexity of the attention mechanism, which can restrict its applicability to larger graphs.

REPRODUCIBILITY STATEMENT

The authors support and advocate the principles of open science and reproducible research. We describe Graph Diffuser in detail in the text and figures and mention all relevant hyperparameters to reproduce our experiments in the appendix. Moreover, we will release our code as open-source with clear instructions on how to reproduce all of our results, as well as how to extend our model and apply it to new datasets. 



Omitting bias for clearity We remove any positional encoding used by the original work. https://github.com/rampasek/GraphGPS



Figure 1: Illustration of our positional attention, focusing on node B. Information propagation from multiple propagation steps is combined to create Virtual Edges(colored) between distant nodes, which then direct the Transformer's attention in each layer.

Passing Graph Neural Networks (MP-GNNs) MP-GNNs Gori et al. (2005); Scarselli et al. (2008) have been the predominant method for graph representation learning in recent years, and have been applied to a wide variety of domains (e.g., Kosaraju et al. (2019); Nathani et al. (2019); Wang et al. (2019); Huang & Carley (2019); Yang et al. (2020); Ma et al. (2020); Wu et al. (2020); Zhang et al. (2020)), MP-GNNss update node representation by stacking multiple layers in which each node aggregates information from its local neighborhoodLi et al. (2016); Kipf & Welling (2017); Veličković et al. (2018); Wu et al. (2019a); Xu et al. (2019b); Hamilton et al. (2017); Xu

An example of a 5 × 5 grid. Nodes are colored in one out of 20 colors. The goal is to predict, for each node, how many other nodes in the same row or column have the same color. For example, node 23, at the bottom row, has only a single node (node 20) that has the same color(green) and is in the same row or column with node 23. Therefore, its label is 1. The label for node 6 is 2, since it matches with nodes of the position and content attention our trained model gives to node 23 in the input graph shown in 2a. The model learns to pay attention to nodes in the same row and column using positional attention and to nodes of the same color using content attention. Their element-wise product selects only the relevant nodes for solving the task.

Figure 2: An example of 2D Grid Histogram Counting with a 5 × 5 grid. 2a shows the original graph. 2b shows the attention patterns our model learns for node 23 in the graph.

Figure 3: An illustration of Graph Diffuser with 2 Transformer layers. Virtual Edges are created by combining multiple powers of the weighted adjacency matrix(left). Virtual Edges modulate each of the Transformers attention layers(right). Self-Virtual Edges are added to the Transformer input as positional encoding(bottom right)

structure as a soft bias to the attention. However, all of the works above use the original adjacency matrix or Laplacian to learn positional or relative encoding, unlike this work that learns the adjacency matrix using node and edge features.

illustrate the difficulties of current GNNs and Graph Transformers in modeling interactions in a graph, we use a simple synthetic node classification task. Counting the frequencies of tokens in a sequence is an elementary task thatWeiss et al. (2021) showed the original Transformer could easily solve. We propose a simple extension of this task to graphs: In Grid Histogram Counting (Figure2), we generate N × M grids with randomly colored nodes and ask models to predict, for each node, how many other nodes in the same row or column have the same color. This is a contrived problem but illustrates many real-world problems' needs. Solving it requires far away nodes to communicate, and the communication should consider both the nodes' content (color) and relation within the graph (being in the same row or column). This is a straightforward generalization of the 1D sequential problem that Transformer easily solves to a 2D graph. However, as we show in Section 5.1.2, it defeats all existing GNN and Graph Transformer techniques, while Graph Diffuser succeeds.

Grid Histogram Counting average accuracy and s.d over 10 different seeds. GNN→ Transformer refers to Transformer blocks stacked on top of GNN (Jain et al. (2021)) and GNN | Transformer refers to combining Transformer and GNN blocks in the same layer (Rampášek et al. (2022)).

shows the results. First, we observe that the vanilla Transformer, which easily solves the histogram task over sequential input, fails here and performs the worst. This is not surprising as the Transformer is oblivious to the graph structure. Next, we observe that RWPE Dwivedi et al. (2022a), a popular positional embedding technique, improves very slightly over the base Transformer. This may be because RWPE is unable to detect symmetries in the graph. Nodes 0,4,20 and 24 in the example illustrated in Figure2awill all get the same embedding by RWPE. Adding SignNetLim et al. (2022), a theoretically expressive positional embedding technique based on the graph eigenvectors, does not yield much improvement either. This may indicate difficulties learning useful embeddings in eigenvectors based techniques.

MAE for the ZINC dataset from Dwivedi et al. (2020b).

Dwivedi et al. (2022b)  is a recently proposed dataset specifically designed to evaluate models on their ability to capture long-range interactions. We are the first, other than the author's baselines, to evaluate our model on these datasets and currently rank first in all of them. Peptides-func and Peptides-struct are multi-label graph classification and regression datasets containing 15.5k Peptides molecular graphs. They are of particular interest since the molecules in them consist of many more nodes, on average, than the other molecular datasets in our experiments. As we can see in Table5, adding our positional attention and encoding outperforms both RWSE and LapPE encodings. PCQM-Contact is a molecular dataset with an edge prediction task of predicting if two atoms interact. Our results are reported in Table6. Test results in OGB graph-level benchmarksHu et al. (2020). Pre-trained or ensemble models are not included. Bold:first, Underlined:Second.

Evaluation on PCQM4Mv2 Hu et al. (2021). Since the test set labels are private, we use 150k examples from the train set as our validation and the validation set is treated as the test set.

Evaluation on the recently suggested Peptides-func and Peptides-struct Dwivedi et al. (2022b).

Comparison with the baselines in the PCQM-Contact dataset.

Graph Diffuser hyperparameters for the OGB benchmarks.

Hyperparameters for ZINC and the LRGB datasets.

annex

over learning rates in {3e-4, 4e-4, 8e-4}, repeat each experiment with ten different random seeds and report the average and standard deviation of the best configuration. All models have a hidden dimension of 32. We use GatedGCN as GNN modules, and four attention heads in Transformer modules. Transformer models use six layers, GN N → T ransf ormer uses 3 GNN layers and 3 Transformer layers. GNNs model has 12 layers to avoid underreaching. For Graph Diffuser, we found 3 layers sufficient. We use 0.2 dropout for GNN's modules and 0.5 dropout for attention. For the Diffuser model, we found no need for dropout.

A.2 THE EFFECTS OF GD ON GRAPH TRANSFORMER VARIATIONS

We Conducted a simplified experiment to see the effects of GD when used as an out-of-the-box addition to different Transformer compositions, with no hyperparameter tuning. We evaluated 3 Transformer variations: Transformer, Transformer layers stacked on top of GNN layers (GNN → Transformer), and Transformer interleaved with GNN in the same layer ( GNN | Transformer). For each variation, we evaluate the unmodified architecture with no positional embedding or positional attention, the architecture with our positional encoding and attention added, but not learning the adjacency matrix(+ Graph Diffuser), and one we learn a new weighted adjacency matrix(+ Weighted Adjacency), as described in Section 4.1.3. The results are in Table 7 . You et al. (2020) . We use a machine with 24GB NVIDIA A10G GPU with 32GB or 64GB RAM for the larger datasets(ogbg-ppa, ogbg-code2,PCQM4Mv2), and a shared cluster equipped with NVIDIA TITAN Xp for the other datasets.

