MEMORY-AUGMENTED DESIGN OF GRAPH NEURAL NETWORKS Anonymous authors Paper under double-blind review

Abstract

The expressive power of graph neural networks (GNN) has drawn much interest recently. Most existent work focused on measuring the expressiveness of GNN through the task of distinguishing between graphs. In this paper, we inspect the representation limits of locally unordered messaging passing (LUMP) GNN architecture through the lens of node classification. For GNNs based on permutation invariant local aggregators, we characterize graph-theoretic conditions under which such GNNs fail to discriminate simple instances, regardless of underlying architecture or network depth. To overcome this limitation, we propose a novel framework to augment GNNs with global graph information called memory augmentation. Specifically, we allow every node in the original graph to interact with a group of memory nodes. For each node, information from all the other nodes in the graph can be gleaned through the relay of the memory nodes. For proper backbone architectures like GAT and GCN, memory augmented GNNs are theoretically shown to be more expressive than LUMP GNNs. Empirical evaluations demonstrate the significant improvement of memory augmentation. In particular, memory augmented GAT and GCN are shown to either outperform or closely match state-of-the-art performance across various benchmark datasets.

1. INTRODUCTION

Graph neural networks (GNN) are a powerful tool for learning with graph-structured data, and has achieved great success on problems like node classification (Kipf & Welling, 2016) , graph classification (Duvenaud et al., 2015) and link prediction (Grover & Leskovec, 2016) . GNNs typically follow a recursive neighborhood aggregation (or message passing) scheme (Xu et al., 2019) such that within each aggregation step, each node collects information from its neighborhood (usually feature vectors), then apply aggregation and combination mechanism to compute its new feature vector. Typically, GNN architectures differ in their design of aggregation and combination mechanisms. Popular architectures like GCN (Kipf & Welling, 2016 ), GraphSAGE (Hamilton et al., 2017 ), and GAT (Veličković et al., 2018) fall into this paradigm. Despite their empirical success, there are a couple of limitations of GNNs that update node features only based on local information. One important issue is their limited expressive power. In the graph classification setting (Xu et al., 2019) , it was shown that message passing neural networks are at most as powerful as Weisfeiler Lehman graph isomorphism tests. A more recent line of work has suggested using variants of message passing scheme that incorporates the layout of local neighborhoods (Sato et al., 2019; Klicpera et al., 2020) or spatial information of the graph (You et al., 2019) . Another problem is due to the phenomenon that the performance of GNN does not improve, or even degrades when layer size increases (Kipf & Welling, 2016; Xu et al., 2018; Li et al., 2018; Oono & Suzuki, 2020) , known as the problem of over-smoothing that makes extending the receptive path of message passing GNNs a difficult task. Many successful GNN architectures are based on stacking a few number of layers like 2 or 3 (Kipf & Welling, 2016), which could be viewed as an implicit inductive bias that node labels are determined up to neighborhoods that are a few hops away. However this assumption may not hold for many real-world data-for example, structurally similar nodes may offer strong predictive power for very distant node pairs (Donnat et al., 2018) . Several techniques are proposed for aggregating node information from a wider range (Xu et al., 2018; Klicpera et al., 2019a; b) . In this paper, we investigate the expressive power of GNNs through the task of node classification. We characterize cases where GNNs that builds on LUMP protocol fail, regardless of underlying implementation or aggregation range. We then propose a novel architecture that aggregates information beyond the local neighborhood. Through making use of global feature information, we can distinguish a wide range of cases that LUMP type GNNs inherently fail. Our main contributions are summarized as follows: • We discuss the expressive power of GNNs for node classification tasks under the premise that node labels are not solely determined by first-order neighborhood information, and show an indistinguishable scenario where LUMP algorithms fail to discriminate nodes in structurally different graphs even if infinite rounds of message passing is performed. • We develop a novel framework that extends GNN with global graph information called memory augmentation, motivated by memory networks (Graves et al., 2014; Weston et al., 2014) . With proper choice of backbone architectures like GAT or GCN, the augmented architectures are provably more expressive than LUMP type GNNs in that they discriminate a wide range of cases that LUMP type GNNs fail with a compact architecture of two layers. • We derive two representative memory augmented architectures, MemGAT and MemGCN, and evaluate their performance on standard datasets. Empirical results show that the memory augmented architectures significantly improves their corresponding backbone architectures across all tasks, either outperforming or closely matching state-of-the-art performance.

2. REPRESENTATION LIMITS OF LOCALLY UNORDERED MESSAGE PASSING

In this paper we consider the task of node classification over an undirected graph G = (V, E) with node set V and edge set E. Let N = |V | be the number of nodes and A, D be its associated adjacency matrix and degree matrix. For each node v ∈ V , let N v = {u | (u, v) ∈ E} be its neighborhood set and X v ∈ X ⊂ R d be its associating feature. Each node v ∈ V is associated with a label Y v . Node classification algorithms make predictions of Y v based on the information given by G and the node feature matrix X. In this paper we will be interested in situations where node labels are not determined solely by their first order neighborhood information. i.e., P (Y v |G, X) = P (Y v |X v , X u , u ∈ N v ) , ∀v ∈ V . For a collection of elements C that are not necessarily distinct, we use {C} to denote its set representation and { {C} } to denote its multiset representation. For each c ∈ {C}, let r C (c) be the multiplicity of c in { {C} }. A popular tool for encoding higher order graph information is to utilize the locally unordered message passing (LUMP) protocol (Garg et al., 2020) to build GNNs. For node v, its (hidden) representation h v is updated using an aggregation and combine strategy: h (l) v = COMBINE h (l-1) v , AGG h (l-1) u , u ∈ N v (1) The protocol is unordered in the sense that no spatial information (like the relative orientation of neighbors) is used throughout the message passing procedure, and the aggregator AGG is often chosen as a permutation invariant function. After k rounds of message passing, each node will have a feature vector that encodes the information of its height k rooted subtree. Aggregation strategies that extend to arbitrary node were suggested in pioneering works of GNNs (Scarselli et al., 2009) that use a learnable, contractive aggregator, and perform infinite rounds of message passing till convergence. Next we discuss the expressive power of the above mentioned mechanisms. Let G(v) be the subgraph of G that contains v, N v and their associated edges. We consider two graphs, G = (V, E) and G = (V , E ), with corresponding feature matrices X and X . Definition 1. (Gross & Tucker, 2001) A graph map f : G → G is called a local isomorphism if for every node v ∈ V , the restriction of f to G(v) is an isomorphism onto G(f (v)). Local isomorphism could be understood as a relaxed version of graph isomorphism. In particular, for two isomorphic graphs, the isomorphism map is also a local isomorphism but the converse is not true (see figure 1 ). Next we use the notion of local isomorphism to help characterize the expressive power of GNNs in node classification context. We say a graph G is locally indistinguishable to graph G , if there exists a surjective local isomorphism f from G to G , and if in addition the feature matrices are related as X v = X f (v) . The following theorem states a specific situation where LUMP type GNNs fail to distinguish between nodes in different graph contexts.

