REACH THE REMOTE NEIGHBORS: DUAL-ENCODING TRANSFORMER FOR GRAPHS

Abstract

Despite recent successes in natural language processing and computer vision, Transformer suffers from the scalability problem when dealing with graphs. Computing node-to-node attentions is infeasible on complicated graphs, e.g., knowledge graphs. One solution is to consider only the near neighbors, which, however, will lose the key merit of Transformer that attends to the elements at any distance. In this paper, we propose a new Transformer architecture, named dual-encoding Transformer (DET), which has a structural encoder to aggregate information from near neighbors and a semantic encoder to focus on useful semantically close neighbors. The two encoders can be incorporated to boost each other's performance. Our experiments demonstrate that DET achieves superior performance compared to the respective state-of-the-art attention-based methods in dealing with molecules, networks and knowledge graphs.

1. INTRODUCTION

Transformer has become one of the most prevalent neural models for natural language processing (NLP) (Vaswani et al., 2017; Devlin et al., 2019) . The self-attention mechanism leveraged by Transformer has already been extended to graph neural networks (GNNs), e.g., GAT (Velickovic et al., 2018) and its variants (Wu et al., 2019; Vashishth et al., 2020; Chen et al., 2021b; Kim & Oh, 2021) . Nevertheless, these models only consider the near (usually one-hop) neighbors, which may violate the original intention of Transformer that attends to the elements at distant positions. Recently, Graphormer (Ying et al., 2021) starts to leverage the standard Transformer architecture for graph representation learning and has achieved superior performance on many benchmarks. However, in its scenarios of graph property prediction, the datasets are small graphs (e.g., small molecules). The full node-to-node attention leveraged by Graphormer makes it inapplicable to large graphs with millions of nodes, such as knowledge graphs (KGs) or social networks. The same problem also appears in the computer vision (CV) area, yet has recently been tackled by patching pixels to patches and then to windows in a hierarchical fashion (Dosovitskiy et al., 2021; Liu et al., 2021b) . These works inspire us to explore the possibility of using one universal Transformer architecture as the general backbone to model graphs of different sizes. In addition to many self-attention-based methods considering only one-hop neighbors (Schlichtkrull et al., 2018; Wu et al., 2019; Ye et al., 2019; Chen et al., 2021b; Kim & Oh, 2021) , some existing works introduce multi-hop (usually 2-or 3-hop) neighbors (Abu-El-Haija et al., 2019; Sun et al., 2020) . However, they still concentrate on the local information and fail to obtain useful information from the remote nodes. Capturing the remote correlations is one of the most important characteristics of Transformer, because the rich context not only boosts the performance but also avoids over-fitting for local information. For example, attending to the distant nodes may be helpful even on a highly homophilic graph, considering the existence of enormous missing links (Ciotti et al., 2016) . In this paper, we propose a dual-encoding Transformer (DET). In DET, we consider two types of neighbors, i.e., structural neighbors and semantic neighbors. Structural neighbors are the near neighbors leveraged by most existing GNNs (Vashishth et al., 2020; Chen et al., 2021b; Kim & Oh, 2021) . By contrast, Semantic neighbors, i.e., the remote neighbors, are the semantically close nodes but may be remote from the node of interest in structure.

