REACH THE REMOTE NEIGHBORS: DUAL-ENCODING TRANSFORMER FOR GRAPHS

Abstract

Despite recent successes in natural language processing and computer vision, Transformer suffers from the scalability problem when dealing with graphs. Computing node-to-node attentions is infeasible on complicated graphs, e.g., knowledge graphs. One solution is to consider only the near neighbors, which, however, will lose the key merit of Transformer that attends to the elements at any distance. In this paper, we propose a new Transformer architecture, named dual-encoding Transformer (DET), which has a structural encoder to aggregate information from near neighbors and a semantic encoder to focus on useful semantically close neighbors. The two encoders can be incorporated to boost each other's performance. Our experiments demonstrate that DET achieves superior performance compared to the respective state-of-the-art attention-based methods in dealing with molecules, networks and knowledge graphs.

1. INTRODUCTION

Transformer has become one of the most prevalent neural models for natural language processing (NLP) (Vaswani et al., 2017; Devlin et al., 2019) . The self-attention mechanism leveraged by Transformer has already been extended to graph neural networks (GNNs), e.g., GAT (Velickovic et al., 2018) and its variants (Wu et al., 2019; Vashishth et al., 2020; Chen et al., 2021b; Kim & Oh, 2021) . Nevertheless, these models only consider the near (usually one-hop) neighbors, which may violate the original intention of Transformer that attends to the elements at distant positions. Recently, Graphormer (Ying et al., 2021) starts to leverage the standard Transformer architecture for graph representation learning and has achieved superior performance on many benchmarks. However, in its scenarios of graph property prediction, the datasets are small graphs (e.g., small molecules). The full node-to-node attention leveraged by Graphormer makes it inapplicable to large graphs with millions of nodes, such as knowledge graphs (KGs) or social networks. The same problem also appears in the computer vision (CV) area, yet has recently been tackled by patching pixels to patches and then to windows in a hierarchical fashion (Dosovitskiy et al., 2021; Liu et al., 2021b) . These works inspire us to explore the possibility of using one universal Transformer architecture as the general backbone to model graphs of different sizes. In addition to many self-attention-based methods considering only one-hop neighbors (Schlichtkrull et al., 2018; Wu et al., 2019; Ye et al., 2019; Chen et al., 2021b; Kim & Oh, 2021) , some existing works introduce multi-hop (usually 2-or 3-hop) neighbors (Abu-El-Haija et al., 2019; Sun et al., 2020) . However, they still concentrate on the local information and fail to obtain useful information from the remote nodes. Capturing the remote correlations is one of the most important characteristics of Transformer, because the rich context not only boosts the performance but also avoids over-fitting for local information. For example, attending to the distant nodes may be helpful even on a highly homophilic graph, considering the existence of enormous missing links (Ciotti et al., 2016) . In this paper, we propose a dual-encoding Transformer (DET). In DET, we consider two types of neighbors, i.e., structural neighbors and semantic neighbors. Structural neighbors are the near neighbors leveraged by most existing GNNs (Vashishth et al., 2020; Chen et al., 2021b; Kim & Oh, 2021) . By contrast, Semantic neighbors, i.e., the remote neighbors, are the semantically close nodes but may be remote from the node of interest in structure. Figure 1 shows the basic idea of DET. For structural encoding, we use the standard self-attention layer to encode the structural neighbors. For semantic encoding, we use a modified linear attention layer to encode the semantic neighbors. The dual encoding ensures both local aggregation and global connection, and also enables them to benefit from each other through back propagation. The 

2. RELATED WORKS

We split the related literature into three parts: non-local GNNs, self-attention, and position embedding. Non-local GNNs Some methods also investigate how to capture the relationships among disconnected nodes (Pei et al., 2020; Yao et al., 2020; Liu et al., 2021a; Min et al., 2022) . Specifically, Geom-GCN (Pei et al., 2020) learns the aggregation purely based on embedding distance, while Non-local-GNNs (Liu et al., 2021a) uses the attention scores from a virtual node to other nodes as a sorting metric to find non-local neighbors. However, they only focus on modeling networks and are evaluated on multi-classification tasks with a few classes. They also do not distinguish between remote nodes and direct neighbors. (Yao et al., 2020; Min et al., 2022) leverage hand-crafted features to find the useful remote nodes, which are less relevant to our work. Self-attention Self-attention-based neural models, such as Transformer, have recently become the de facto choice in NLP, ranging from language modeling and machine translation (Devlin et al., 2019; Vaswani et al., 2017) to question answering (Yang et al., 2019; Yavuz et al., 2022) and sentiment analysis (Cheng et al., 2021; Xu et al., 2019a) . Transformer has significant advantages over conventional sequential models like recurrent neural networks (RNNs) (Williams & Zipser, 1989; Hochreiter & Schmidhuber, 1997) in both scalability and efficiency.



Figure 1: Overview of DET. Structural neighbors are local neighbors connected with the node of interest on the graph, while semantic neighbors are remote nodes with similar semantics to the node of interest. The two encoders focus on encoding different aspects of neighboring information, and thus are capable of complementing each other.

idea of reaching remote neighbors is inspired byMSA Transformer (Rao et al., 2021)  and AlphaFold 2(Jumper et al., 2021). They query the genetic database to fetch the similar sequences (i.e., proteins) as "family members". The difference is that the family members in DET (i.e., semantic neighbors) are obtained by self-supervised learning rather than asking for external resources. Briefly, we convert this problem to a learning task to find the distant nodes that are as important as local neighbors. We then view local neighbors as positive examples and randomly sampled distant nodes as negative examples, constructing a standard contrastive learning. Furthermore, we propose to use the semantic operator to estimate the score between the node of interest and others. It is a learnable function to compel the encoder to value the semantically close nodes.

