MULTI-HOP ATTENTION GRAPH NEURAL NETWORKS

Abstract

Self-attention mechanism in graph neural networks (GNNs) led to state-of-theart performance on many graph representation learning task. Currently, at every layer, attention is computed between connected pairs of nodes and depends solely on the representation of the two nodes. However, such attention mechanism does not account for nodes that are not directly connected but provide important network context, which could lead to improved predictive performance. Here we propose Multi-hop Attention Graph Neural Network (MAGNA), a principled way to incorporate multi-hop context information into attention computation, enabling long-range interactions at every layer of the GNN. To compute attention between nodes that are not directly connected, MAGNA diffuses the attention scores across the network, which increases the "receptive field" for every layer of the GNN. Unlike previous approaches, MAGNA uses a diffusion prior on attention values, to efficiently account for all paths between the pair of disconnected nodes. This helps MAGNA capture large-scale structural information in every layer, and learn more informative attention. Experimental results on node classification as well as the knowledge graph completion benchmarks show that MAGNA achieves stateof-the-art results: MAGNA achieves up to 5.7% relative error reduction over the previous state-of-the-art on Cora, Citeseer, and Pubmed. MAGNA also obtains the best performance on a large-scale Open Graph Benchmark dataset. On knowledge graph completion MAGNA advances state-of-the-art on WN18RR and FB15k-237 across four different performance metrics.

1. INTRODUCTION

The introduction of the self-attention mechanism (Bahdanau et al., 2015; Vaswani et al., 2017) , has pushed the state-of-the-art in many domains including graph presentation learning (Radford et al., 2019; Devlin et al., 2019; Liu et al., 2019a; Lan et al., 2019) . Graph Attention Network (GAT) (Veličković et al., 2018) and related models (Li et al., 2018; Wang et al., 2019a; Liu et al., 2019b; Oono & Suzuki, 2020) developed attention mechanism for Graph Neural Networks (GNNs), which compute attention scores between nodes connected by an edge, allowing the model to attend to messages of node's direct neighbors according to their attention scores. However, such attention computation on pairs of nodes connected by edges implies that a node can only attend to its immediate neighbors to compute its (next layer) representation. This implies that receptive field of a single GNN layer is restricted to one-hop network neighborhoods. Although stacking multiple GATs could in principle enlarge the receptive field and learn non-neighboring interactions, such deep GAT architectures suffer from the oversmoothing problem (Wang et al., 2019a; Liu et al., 2019b; Oono & Suzuki, 2020) and do not perform well in practice. Furthermore, edge attention weights in the single GAT layer are based solely on representations of the two nodes at the edge endpoints, and do not depend on their graph neighborhood context. In other words, the one-hop attention mechanism in GATs limits their ability to explore the relationship between the broader graph structure and the attention weights. While previous works (Xu et al., 2018; Klicpera et al., 2019b) have shown advantages in performing multi-hop message-passing in a single layer, these approaches are not graph-attention based. Therefore, incorporating multi-hop neighboring context into the attention computation in graph neural networks had not been explored. Here we present Multi-hop Attention Graph Neural Network (MAGNA), an effective and efficient multi-hop self-attention mechanism for graph structured data. MAGNA uses a novel graph attention diffusion layer (Figure 1 ), where we first compute attention weights on edges (represented by solid arrows), and then compute self-attention weights (dotted arrows) between disconnected pairs of nodes through an attention diffusion process using the attention weights on the edges. Our model has two main advantages: 1) MAGNA captures long-range interactions between nodes that are not directly connected but may be multiple hops away. Thus the model enables effective long-range message passing, from important nodes multiple hops away. 2) The attention computation in MAGNA is context-dependent. The attention value in GATs (Veličković et al., 2018) only depends on node representations of the previous layer, and is zero between disconnected pairs of nodes. In contrast, for any pair of nodes within a chosen multi-hop neighborhood, MAGNA computes attention by aggregating the attention scores over all the possible paths (length ≥ 1) connecting the two nodes. Theoretically we demonstrate that MAGNA places a Personalized Page Rank (PPR) prior on the attention values. We further apply spectral graph analysis to show that MAGNA has the capability of emphasizing on large-scale graph structure and lowering high-frequency noise in graphs. Specifically, MAGNA enlarges the lower Laplacian eigen-values, which correspond to the large-scale structure in the graph, and suppresses the higher Laplacian eigen-values which correspond to more noisy and fine-grained information in the graph. We experiment on standard datasets for semisupervised node classification as well as knowledge graph completion. Experiments show that MAGNA achieves state-of-the-art results: MAGNA achieves up to 5.7% relative error reduction over previous state-of-the-art on Cora, Citeseer, and Pubmed. MAGNA also obtains better performance on a large-scale Open Graph Benchmark dataset. On knowledge graph completion, MAGNA advances state-of-the-art on WN18RR and FB15k-237 across four metrics, with the largest gain of 7.1% in the metric of Hit at 1. Furthermore, we show that MAGNA with just 3 layers and 6 hop wide attention per layer significantly out-performs GAT with 18 layers, even though both architectures have the same receptive field. Moreover, our ablation study reveals the synergistic effect of the essential components of MAGNA, including layer normalization and multi-hop diffused attention. We further observe that compared to GAT, the attention values learned by MAGNA have higher diversity, indicating the ability to better pay attention to important nodes.

2. MULTI-HOP ATTENTION GRAPH NEURAL NETWORK (MAGNA)

We first discuss the background and then explain Multi-hop Attention Graph Neural Network's novel multi-hop attention diffusion module and its overall model architecture.

2.1. PRELIMINARIES

Let G = (V, E) be a given graph, where V is the set of N n nodes, E ⊆ V × V is the set of N e edges connecting M pairs of nodes in V. Each node v ∈ V and each edge e ∈ E are associated with their type mapping functions: φ : V → T and ψ : E → R. Here T and R denote the sets of node types (labels) and edge/relation types. Our framework supports learning on heterogeneous graphs with multiple elements in R. A general Graph Neural Network (GNN) approach learns an embedding that maps nodes and/or edge types into a continuous vector space. Let X ∈ R Nn×dn and R ∈ R Nr×dr be the node embedding and edge/relation type embedding, where N n = |V|, N r = |R|, d n and d r represent the embedding dimension of node and edge/relation types, each row x i = X[i :] represents the embedding of node v i (1 ≤ i ≤ N n ), and r j = R[j :] represents the embedding of relation r j (1 ≤ j ≤ N r ). MAGNA builds on GNNs, while bringing together the benefits of Graph Attention and Diffusion techniques. The core of MAGNA is Multi-hop Attention Diffusion, a principled way to learn attention between any pair of nodes in a scalable way, taking into



Figure 1: Multi-hop attention diffusion. Consider making a prediction at nodes A and D. Left: A single GAT layer only computes attention scores α between directly connected pairs of nodes (i.e., edges) and thus α D,C = 0. Furthermore, the attention α A,B between A and B only depends on their node representations. Right: A single MAGNA layer is able to: (1) capture the information of two-hop neighbor C to D via the diffused multi-hop attention α D,C ; And, (2) enhance graph structure learning by considering all paths between nodes via diffused attention based on powers of graph adjacency matrix.

