GRAPH JOINT ATTENTION NETWORKS

Abstract

Graph attention networks (GATs) have been recognized as powerful tools for learning in graph structured data. However, how to enable the attention mechanisms in GATs to smoothly consider both structural and feature information is still very challenging. In this paper, we propose Graph Joint Attention Networks (JATs) to address the aforementioned challenge. Different from previous attention-based graph neural networks (GNNs), JATs adopt novel joint attention mechanisms which can automatically determine the relative significance between node features and structural coefficients learned from graph subspace, when computing the attention scores. Therefore, representations concerning more structural properties can be inferred by JATs. Besides, we theoretically analyze the expressive power of JATs and further propose an improved strategy for the joint attention mechanisms that enables JATs to reach the upper bound of expressive power which every message-passing GNN can ultimately achieve, i.e., 1-WL test. JATs can thereby be seen as most powerful message-passing GNNs. The proposed neural architecture has been extensively tested on widely used benchmarking datasets, including Cora, Cite, Pubmed, and OGBN-Arxiv, and has been compared with state-of-the-art GNNs for node classification tasks. Experimental results show that JATs achieve state-of-the-art performance on all the testing datasets.

1. INTRODUCTION

Many real-world data can be modeled as a graph, where a set of nodes (vertices), edges, and bag-ofwords features respectively represent data instances, instance-instance interrelationships, and contents characterizing the nodes. For example, scientific articles in a research domain can be modeled as a graph, where nodes, edges, and node features respectively represent published articles, citations, and index information of the articles. Besides, social network users and interacted biological units can also be similarly represented as graphs possessing different structural and descriptive information. As graph data are widely available and they are related to various analytical tasks, learning in graphs has been a hot-spot in machine learning community. There have been a number of approaches proposed to effectively learn in graph structured data. Amongst them, graph convolutional networks (GCNs) have shown to be powerful in learning lowdimensional representations for various subsequent analytical tasks. Different from those empirical convolutional neural networks (CNNs) which have achieved a great success in learning in image, vision, and natural language data (Krizhevsky et al., 2012; Xu et al., 2014) , and whose convolution operators are always defined to process a grid-like data structure, GCNs attempt to formulate convolution operators aggregating the node features according to the observed graph structure, and learn the information propagation through different neural architectures. Meaningful representations which capture discriminative node features as well as intricate graph structure can thereby be learned by GCNs. There have been several sophisticated GCNs proposed in the recent. According to the ways through which GCNs make use of graph topology to define convolution operators for feature aggregation, GCNs can generally be categorized as spectral, and spatial ones (Wu et al., 2020) . Spectral GCNs define the convolutional layer for aggregating neighbor features based on the spectral representation of the graph. For example, Spectral CNN (Bruna et al., 2013) constructs the convolution layer based on the eigen-decomposition of graph Laplacian in the Fourier domain. However, such layer is computationally demanding. Aiming to reduce such computational burden, several approaches adopting the convolution operators which are based on simplified/approximate spectral graph theory are proposed. First, parameterized filters with smooth coefficients are introduced for Spectral CNN to allow it to consider spatially localized nodes in the graph (Henaff et al., 2015) . Chebyshev expansion is then introduced in (Defferrard et al., 2016) to approximate graph Laplacian rather than directly perform eigen-decomposition of it. Finally, the graph convolution filter is further simplified by only considering connected neighbors of each node (Kipf & Welling, 2017) so as to further make spectral GCNs computationally efficient. In contrast, spatial GCNs define the convolution operators for feature aggregation directly making use of local structural properties of the central node. The key of spatial GCNs is consequently how to design an appropriate function for aggregating the effect brought by the features of candidate neighbors selected according to a proper sampling strategy. To achieve this, it sometimes requires to learn a weight matrix according to node degree (Duvenaud et al., 2015) , utilize the power of transition matrix to preserve the neighbor importance (Atwood & Towsley, 2016; Busch et al., 2020; Klicpera et al., 2019) , extract the normalized neighbors (Niepert et al., 2016) , or sample a fixed number of neighbors (Hamilton et al., 2017; Zhang et al., 2020) . As a representative spatial GCN, Graph attention network (GAT) (Veličković et al., 2018; Zhang et al., 2020) has shown a promising performance in various graph learning tasks. What makes GATs effective in learning graph representations is they adopt the attention mechanism, which has been successfully used in machine reading and translation (Luong et al., 2015; Cheng et al., 2016) , and video processing (Xu et al., 2015) , to compute the node-feature-based attention weights (attention scores) between a central node and its one-hop neighbors (including the central node itself). Then, GATs utilize the attention scores to obtain a weighted aggregation of node features which are propagated to the next layer. As a result, those neighbors possessing similar features may impact more on the center node, and meaningful representations can be inferred by GATs. Although GATs have been experimentally verified as powerful tools for various graph learning tasks, they still confront several challenges. First, for attention-based GNNs, appropriate attention mechanisms which can automatically identify the relative significance between the graph structure and node features are not many. As a result, most current attention mechanisms for GATs cannot effectively capture the joint effect brought by the underlying graph structure and node features for seamlessly impacting the message-passing in the neural architecture. Second, whether the expressive power of GNNs adopting the attention mechanisms which can effectively acquire the aforementioned joint effect may reach the upper bound of message-passing GNNs has not been theoretically investigated. To address the mentioned challenges, in this paper, we propose novel attention-based GNNs, dubbed Graph Joint Attention Networks (JATs). Different from previous works, the attention mechanisms adopted by JATs are able to automatically capture the relative significance between structural coefficient learned from graph topology, and node features, so that higher attention scores may be learned by those neighbors which are topologically and contextually correlated. JATs are consequently able to smoothly adjust attention scores according to the contemporary structure and node features, and truly capture the joint attention on structural and contextual information propagated in the neural network. Besides, we theoretically analyze the expressive power of JATs and further propose an improved strategy which enables JATs to distinguish all distinct graph structures as 1-dimensional Weisfeiler-Lehman test (1-WL test) does. This means JATs can reach the upper bound w.r.t. expressive power which all message-passing GNNs can ultimately achieve. JATs have been extensively tested on four widely used datasets, i.e., Cora, Citeseer, Pubmed, and OGBN-Arxiv, and have been compared with a number of strong baselines. The experimental results show that JATs achieve the state-of-the-art performance. The rest of the paper is organized as follows. In Section 2, we elaborate the proposed JATs, and compare JATs with other GNNs. In Section 3, we prove the limitation w.r.t. expressive power of the joint attention mechanisms presented in Section 2. A strategy is then proposed to improve JATs to reach the upper bound of expressive power which all message-passing GNNs can at most achieve. The comprehensive experiments which are used to validate the effectiveness of JATs are presented in Section 4. Finally, we summarize the contributions of the paper and propose future works potentially improving JATs.

2. JOINT ATTENTION-BASED GRAPH NEURAL NETWORKS

In this section, we elaborate the proposed JATs. Mathematical preliminaries and notations used in the paper are firstly illustrated. How JATs learn the structural coefficients which are used in the

