MULTI-HOP ATTENTION GRAPH NEURAL NETWORKS

Abstract

Self-attention mechanism in graph neural networks (GNNs) led to state-of-theart performance on many graph representation learning task. Currently, at every layer, attention is computed between connected pairs of nodes and depends solely on the representation of the two nodes. However, such attention mechanism does not account for nodes that are not directly connected but provide important network context, which could lead to improved predictive performance. Here we propose Multi-hop Attention Graph Neural Network (MAGNA), a principled way to incorporate multi-hop context information into attention computation, enabling long-range interactions at every layer of the GNN. To compute attention between nodes that are not directly connected, MAGNA diffuses the attention scores across the network, which increases the "receptive field" for every layer of the GNN. Unlike previous approaches, MAGNA uses a diffusion prior on attention values, to efficiently account for all paths between the pair of disconnected nodes. This helps MAGNA capture large-scale structural information in every layer, and learn more informative attention. Experimental results on node classification as well as the knowledge graph completion benchmarks show that MAGNA achieves stateof-the-art results: MAGNA achieves up to 5.7% relative error reduction over the previous state-of-the-art on Cora, Citeseer, and Pubmed. MAGNA also obtains the best performance on a large-scale Open Graph Benchmark dataset. On knowledge graph completion MAGNA advances state-of-the-art on WN18RR and FB15k-237 across four different performance metrics.

1. INTRODUCTION

The introduction of the self-attention mechanism (Bahdanau et al., 2015; Vaswani et al., 2017) , has pushed the state-of-the-art in many domains including graph presentation learning (Radford et al., 2019; Devlin et al., 2019; Liu et al., 2019a; Lan et al., 2019) . Graph Attention Network (GAT) (Veličković et al., 2018) and related models (Li et al., 2018; Wang et al., 2019a; Liu et al., 2019b; Oono & Suzuki, 2020) developed attention mechanism for Graph Neural Networks (GNNs), which compute attention scores between nodes connected by an edge, allowing the model to attend to messages of node's direct neighbors according to their attention scores. However, such attention computation on pairs of nodes connected by edges implies that a node can only attend to its immediate neighbors to compute its (next layer) representation. This implies that receptive field of a single GNN layer is restricted to one-hop network neighborhoods. Although stacking multiple GATs could in principle enlarge the receptive field and learn non-neighboring interactions, such deep GAT architectures suffer from the oversmoothing problem (Wang et al., 2019a; Liu et al., 2019b; Oono & Suzuki, 2020) and do not perform well in practice. Furthermore, edge attention weights in the single GAT layer are based solely on representations of the two nodes at the edge endpoints, and do not depend on their graph neighborhood context. In other words, the one-hop attention mechanism in GATs limits their ability to explore the relationship between the broader graph structure and the attention weights. While previous works (Xu et al., 2018; Klicpera et al., 2019b) have shown advantages in performing multi-hop message-passing in a single layer, these approaches are not graph-attention based. Therefore, incorporating multi-hop neighboring context into the attention computation in graph neural networks had not been explored. Here we present Multi-hop Attention Graph Neural Network (MAGNA), an effective and efficient multi-hop self-attention mechanism for graph structured data. MAGNA uses a novel graph attention diffusion layer (Figure 1 ), where we first compute attention weights on edges (represented by solid arrows), and then compute self-attention weights (dotted arrows) between disconnected pairs of nodes through an attention diffusion process using the attention weights on the edges.

