REACH THE REMOTE NEIGHBORS: DUAL-ENCODING TRANSFORMER FOR GRAPHS

Abstract

Despite recent successes in natural language processing and computer vision, Transformer suffers from the scalability problem when dealing with graphs. Computing node-to-node attentions is infeasible on complicated graphs, e.g., knowledge graphs. One solution is to consider only the near neighbors, which, however, will lose the key merit of Transformer that attends to the elements at any distance. In this paper, we propose a new Transformer architecture, named dual-encoding Transformer (DET), which has a structural encoder to aggregate information from near neighbors and a semantic encoder to focus on useful semantically close neighbors. The two encoders can be incorporated to boost each other's performance. Our experiments demonstrate that DET achieves superior performance compared to the respective state-of-the-art attention-based methods in dealing with molecules, networks and knowledge graphs.

1. INTRODUCTION

Transformer has become one of the most prevalent neural models for natural language processing (NLP) (Vaswani et al., 2017; Devlin et al., 2019) . The self-attention mechanism leveraged by Transformer has already been extended to graph neural networks (GNNs), e.g., GAT (Velickovic et al., 2018) and its variants (Wu et al., 2019; Vashishth et al., 2020; Chen et al., 2021b; Kim & Oh, 2021) . Nevertheless, these models only consider the near (usually one-hop) neighbors, which may violate the original intention of Transformer that attends to the elements at distant positions. Recently, Graphormer (Ying et al., 2021) starts to leverage the standard Transformer architecture for graph representation learning and has achieved superior performance on many benchmarks. However, in its scenarios of graph property prediction, the datasets are small graphs (e.g., small molecules). The full node-to-node attention leveraged by Graphormer makes it inapplicable to large graphs with millions of nodes, such as knowledge graphs (KGs) or social networks. The same problem also appears in the computer vision (CV) area, yet has recently been tackled by patching pixels to patches and then to windows in a hierarchical fashion (Dosovitskiy et al., 2021; Liu et al., 2021b) . These works inspire us to explore the possibility of using one universal Transformer architecture as the general backbone to model graphs of different sizes. In addition to many self-attention-based methods considering only one-hop neighbors (Schlichtkrull et al., 2018; Wu et al., 2019; Ye et al., 2019; Chen et al., 2021b; Kim & Oh, 2021) , some existing works introduce multi-hop (usually 2-or 3-hop) neighbors (Abu-El-Haija et al., 2019; Sun et al., 2020) . However, they still concentrate on the local information and fail to obtain useful information from the remote nodes. Capturing the remote correlations is one of the most important characteristics of Transformer, because the rich context not only boosts the performance but also avoids over-fitting for local information. For example, attending to the distant nodes may be helpful even on a highly homophilic graph, considering the existence of enormous missing links (Ciotti et al., 2016) . In this paper, we propose a dual-encoding Transformer (DET). In DET, we consider two types of neighbors, i.e., structural neighbors and semantic neighbors. Structural neighbors are the near neighbors leveraged by most existing GNNs (Vashishth et al., 2020; Chen et al., 2021b; Kim & Oh, 2021) . By contrast, Semantic neighbors, i.e., the remote neighbors, are the semantically close nodes but may be remote from the node of interest in structure. Figure 1 shows the basic idea of DET. For structural encoding, we use the standard self-attention layer to encode the structural neighbors. For semantic encoding, we use a modified linear attention layer to encode the semantic neighbors. The dual encoding ensures both local aggregation and global connection, and also enables them to benefit from each other through back propagation. The idea of reaching remote neighbors is inspired by MSA Transformer (Rao et al., 2021) and AlphaFold 2 (Jumper et al., 2021) . They query the genetic database to fetch the similar sequences (i.e., proteins) as "family members". The difference is that the family members in DET (i.e., semantic neighbors) are obtained by self-supervised learning rather than asking for external resources. Briefly, we convert this problem to a learning task to find the distant nodes that are as important as local neighbors. We then view local neighbors as positive examples and randomly sampled distant nodes as negative examples, constructing a standard contrastive learning. Furthermore, we propose to use the semantic operator to estimate the score between the node of interest and others. It is a learnable function to compel the encoder to value the semantically close nodes. The proposed DET is capable of achieving superior performance on various graph learning tasks. (1) For graph property prediction, DET outperforms the best-performing methods on the PCQM4M-LSCv1 (Nakata & Shimazaki, 2017) and ZINC (Dwivedi et al., 2020) datasets; (2) For node classification, DET obtains competitive or better performance compared with the state-of-the-art attentionbased methods, on several prevalent benchmarks (e.g., Cora, PPI, and ogbn-arxiv) (Yang et al., 2016; Nakata & Shimazaki, 2017; Zitnik & Leskovec, 2017) ; (3) For KG completion (a.k.a., entity prediction), DET achieves the state-of-the-art performance on both FB15K-237 (Toutanova & Chen, 2015) and WN18RR (Dettmers et al., 2018) .

2. RELATED WORKS

We split the related literature into three parts: non-local GNNs, self-attention, and position embedding. Non-local GNNs Some methods also investigate how to capture the relationships among disconnected nodes (Pei et al., 2020; Yao et al., 2020; Liu et al., 2021a; Min et al., 2022) . Specifically, Geom-GCN (Pei et al., 2020) learns the aggregation purely based on embedding distance, while Non-local-GNNs (Liu et al., 2021a) uses the attention scores from a virtual node to other nodes as a sorting metric to find non-local neighbors. However, they only focus on modeling networks and are evaluated on multi-classification tasks with a few classes. They also do not distinguish between remote nodes and direct neighbors. (Yao et al., 2020; Min et al., 2022) leverage hand-crafted features to find the useful remote nodes, which are less relevant to our work. Self-attention Self-attention-based neural models, such as Transformer, have recently become the de facto choice in NLP, ranging from language modeling and machine translation (Devlin et al., 2019; Vaswani et al., 2017) to question answering (Yang et al., 2019; Yavuz et al., 2022) and sentiment analysis (Cheng et al., 2021; Xu et al., 2019a) . Transformer has significant advantages over conventional sequential models like recurrent neural networks (RNNs) (Williams & Zipser, 1989; Hochreiter & Schmidhuber, 1997) in both scalability and efficiency. Position Embedding Position embedding is one of the most important modules of Transformer. Transformer variants in different fields typically customize different designs in this module. For example, ViT and its followers (Dosovitskiy et al., 2021; Fan et al., 2021; Han et al., 2021) sequentially index the patches and encode the indices as 1D position embeddings. SwinTransformer (Liu et al., 2021b; c) proposes the 2D-aware relative position biases, which employs a learnable matrix to record pairwise patch position information in the window. In addition to the position information, other prior knowledge can also be injected as attention biases or position embeddings into Transformer, which becomes the key to applying Transformer on graphs (Ahn et al., 2021; Chen et al., 2021a; b; Dwivedi & Bresson, 2021; Kreuzer et al., 2021; Ying et al., 2021) . For example, GT (Dwivedi & Bresson, 2021) replaces the sinusoidal positional embeddings by the Laplacian eigenvectors. Graphformer (Ying et al., 2021) encodes centrality and shortest path distance into embeddings, and then incorporates them as position embeddings into Transformer. HittER (Chen et al., 2021b ) adds the edge type (i.e., relation) information of KGs when encoding entity embeddings.

3. METHODOLOGY

In this section, we present the details of DET. We start from the preliminaries and then introduce the dual-encoding process. Finally, we illustrate how to train a DET model.

3.1. PRELIMINARIES

We first introduce the terminologies and notations that will be used in the following sections. Graph We define a graph as G = (V, E), where V = {v 1 , v 2 , ..., v n } is the node set, and E = {e 1 , e 2 , ..., e m } is the edge set. n and m denote the number of nodes and edges, respectively. In practice, different tasks often have more complicated graph structures. For example, molecular graphs and KGs have edge types (i.e., chemical bonds and relations). We do not discuss the details and follow the general setting to process these specific features (Chen et al., 2021b; Ying et al., 2021) . GNN and Self-attention Without loss of generality, we define a GNN as a neural network that learns a group of weights to aggregate the embeddings of the one-hop or multi-hop neighbors for the node of interest. In this sense, self-attention can be naturally treated as a GNN model. Let Q ∈ R n×h , K ∈ R n×h , V ∈ R n×h denote the query, key, and value matrices, respectively. In this paper, they are the same node embedding matrix. h denotes the hidden layer size. Self-attention calculates the attention scores as follows: A = Softmax( QK √ h ), where A ∈ R n×n records the node-to-node attention scores. We then aggregate the node embeddings with the following equation: H = AV , where H ∈ R n×h is the output embedding matrix, with each row denoting the embedding of a node. Linear Attention The computational complexity of the above dot-product implementation is Ω(n 2 ) (without considering the hidden layer size). As the number of nodes increases, the cost becomes unacceptable. GAT (Velickovic et al., 2018) proposes a linear self-attention implementation to mitigate this problem by only considering the one-hop neighbors: B ij = σ b (W v i W v j ) , where B ij denotes the attention score from the node of interest v i to a neighbor v j . v i , v j ∈ R h are the embeddings of v i and v j , respectively. b ∈ R h and W ∈ R h×h are weight vector and matrix, respectively. σ is the activation function and denotes the concatenation operation. The linear attention does not consider the correlations within neighbors, and thus its computational complexity is cut down from Ω(n 2 ) to Ω(m). Note that, it is not necessary to use one-hop neighbors as keys in linear attention. Some existing works also consider multi-hop neighbors (Sun et al., 2020) . for each batch data (X st , X se , Y ) do 6: Equation (4)); 7: H se ← M se (X se ) (Equation ( 6)); 8: H st ← M st (X st ) ( H ← H st ⊕ H se ; 9: Compute L sn (Equation ( 7)); end for 13: until the performance on the validation set converges;

3.2. STRUCTURAL ENCODING

The standard dot-product attention can be easily extended on the small graphs. We add a virtual node v c (Devlin et al., 2019) as the context node connected with all nodes in G. Then, the output representation for v c can be regarded as an embedding of G. For the large graphs like KGs or networks, we perform self-attention on the local subgraph G i given the node of interest v i . Therefore, the output embedding for node v i is h st i = vj ∈{vi}∪N (vi) A cj v j , where h st i denotes the output of the structural encoder for v i . A cj denotes the attention score from the context node c to the neighbor v j . N (v i ) denotes the set of local neighbors (one-hop neighbors in our implementation) for v i . Follow the existing works (Chen et al., 2021b; Ying et al., 2021) , we accordingly add the centrality, relation type, or shortest distance path information as special position embeddings to the encoder. The details can be found in Appendix A.

3.3. SEMANTIC ENCODING

The structural features sometimes are unreliable for identifying a neighbor. For example, if the node of interest has a large amount of neighbors, many of them inevitably have similar or identical structural features (e.g., shortest path distance to v i ). This problem is even more serious when they are all one-or two-hop neighbors. However, if we directly aggregate more-hop nodes, the sheer quantity of available information will overwhelm the neural network. Table 1 summarizes the average frequency of entities appearing as others' neighbors in different hops. We can find that the two-or more-hop neighbors of a node are shared by many others, which is why current GNNs rarely consider multi-hop neighbors. This phenomenon also reveals the over-smoothing problem to some extent. To make the node of interest more distinguishable to the classifier, weighting its one-hop neighbors is usually reasonable due to the less redundancy. Therefore, we make the following hypothesis: Hypothesis 1 The one-hop neighbors are the most informative features to identify and represent the node of interest. Recent successes (Rao et al., 2021; Jumper et al., 2021) in biological science demonstrate that using the information provided by the family members can help protein structure prediction. Specifically, they leverage multiple sequence alignment (MSA) (the results of biological sequence alignment) to make use of the information within an evolutionary family. The protein sequences in a family are assumed to have a common ancestor or an evolutionary relationship. Therefore, they may also share some important sub-structures in protein folding. If we view the embedding of a node as a kind of sequence, then the remote neighbors used in semantic encoding should have similar embeddings to the node of interest. We illustrate this idea in Figure 2 , our basic idea is to view node embeddings as protein amino acid sequences with a fixed length. Then, we can make use of the insight from the well-known MSA Transformer and AlphaFold 2. To this end, we need to find the family members (that share similar evolutional characteristics and relationships) of our "protein". In their original implementation, this step is done by querying an external gene database. In our case, we propose the semantic encoder to find such family members. Therefore, we make the second hypothesis: Hypothesis 2 The distant nodes that have a high embedding similarity with the node of interest are important features for this node. In this paper, we estimate the similarity score by a learnable neural function f s : R h × R h → R: f s (v i , v j ) = v i v j = 1 -σ (v i -v j )W s + b s , where is the semantic difference operator. In fact, the choice of is flexible as long as it can reflect the similarity between v i and v j . We use the weighted difference as , which can be easily extended to matrix operation. W s ∈ R h×1 and b s ∈ R are the weight and bias, respectively. We use Sigmoid as activation to normalize the difference to (0, 1) and then convert it to a similarity score. Therefore, the output embedding for semantic encoding is written as follows: h se i = vj ∈N se (vi) B ij v j = vj ∈N se (vi) f s (v i , v j )v j , where B ∈ R n×n records the node-to-node semantic attention scores estimated by f s in Equation ( 5). N se (v i ) is the semantic neighbor set. In our implementation, it is sampled from the top candidates (top 10% in our setting) during training. We then define the semantic neighbor fetching loss to learn f s : L sn (v i ) = - 1 |N (v i )| vj ∈N (vi) ln f s (v i , v j ) + 1 |N -(v i )| v k ∈N -(vi) ln f s (v i , v k ) , where N (v i ) is the positive example set that includes the local neighbors ( Hypothesis 1) of v i , and N -(v i ) is the negative example set in which the negative examples are randomly-sampled distant nodes (Hypothesis 2) of v i . Therefore, the learning process of f s is self-supervised and can be jointly optimized with the main task loss.

3.4. DUAL-ENCODING TRANSFORMER

Algorithm We illustrate the implementation of DET in Algorithm 1 and summarize the overall training process as follows: We first initialize the input embeddings and all parameters of DET. For each epoch (or every few epochs), we first draw semantic neighbors from the top 10% candidates (according to Equation (5) for each node. In each batch, we feed the structural encoder with the structural neighbors of the input nodes, and the semantic encoder with the semantic neighbors. We combine the output embeddings of two encoders by weighted addition and jointly minimize the main task loss and semantic neighbor fetching loss. Computational Cost We find the total training time does not increase too much compared with the baselines. The design of the semantic operator in the semantic encoder is simpler than the linear attention, which only increases a small number of parameters. Although fetching the semantic neighbors needs to iterate all nodes (yields a time complexity of n 2 ), we do not compute them on the fly. Instead, we update semantic neighbors of each node every few epochs, improving both efficiency and robustness. Hence, the overall training time remains at the same level (see Appendix B for the detailed statistics). 

4. EXPERIMENT

We conducted experiments on a variety of benchmarks to verify the effectiveness of DET. We uploaded the source code and reported the dataset statistics and parameter settings in Appendix C.

4.1. GRAPH PROPERTY PREDICTION

Datasets We evaluated DET on the graph property prediction benchmarks PCQM4M-LSCv1 (Hu et al., 2021) and ZINC (Dwivedi et al., 2020) . The former is used in the recent Open Graph Benchmark Large-Scale Challenge, while the latter is a popular dataset used to evaluate molecular graph representation learning methods. Considering that the number of nodes in each graph is very small (usually less than 50), we directly perform attention operations on the whole graph. Therefore, we removed the semantic neighbor fetching loss in this experiment. Baselines We compared DET with the state-of-the-art methods: the attention-based GAT (Velickovic et al., 2018) , GT (Dwivedi & Bresson, 2021) and Graphormer (Ying et al., 2021) ; and other recently developed GCN (Kipf & Welling, 2017) , GraphSage (Hamilton et al., 2017) , GIN (Xu et al., 2019b) , DeeperGCN (Gilmer et al., 2017) , and GatedGCN-LSPE (Dwivedi et al., 2022) . Results Table 2 and Table 3 summarize the experimental results measured by mean average error (MAE) on the two datasets. Due to the inaccessibility of the testing data on PCQM4M-LSCv1, we alternatively report the MAE results on the training and validation sets. Overall, DET outperformed all the baseline methods on PCQM4M-LSCv1. Compared with Graphormer that only considers encoding structural information with Transformer, DET significantly improved the performance, with 6.2% and 7.4% MAE decline on PCQM4M-LSCv1 and ZINC, respectively. Furthermore, the number of model parameters still maintained the same level to that of baselines. We also observed that DET had more significant advantages over other Transformer-based methods on ZINC. Although GatedGCN-LSPE had a better result, we argue that it is feasible to use it as our structural encoder to obtain better performance.

4.2. NODE CLASSIFICATION

Datasets We evaluated DET on five benchmarks that are generally used for node representation learning. Specifically, Cora, CiteSeer, and PubMed (Yang et al., 2016) are three citation network datasets commonly used in the transductive setting, while PPI (Zitnik & Leskovec, 2017 ) is a wellused protein-protein interaction dataset in the inductive setting. ogbn-arxiv (Nakata & Shimazaki, 2017 ) is a large citation network dataset. Most experiment settings follow (Kim & Oh, 2021) , and we repeated experiments 100 times on Cora, CiteSeer, and PubMed with random seeds, and 30 times on PPI and ogbn-arxiv, to produce reliable results and ensure a fair comparison. Baselines We selected the attention-base methods GAT (Velickovic et al., 2018) , CGAT (Wang et al., 2019a) , Graph-Bert (Zhang et al., 2020) and SuperGAT SD (Kim & Oh, 2021) as baseline methods. In addition, other GNN-based methods like GCN (Kipf & Welling, 2017) , GraphSage (Hamilton et al., 2017) and GCN+NS (Zheng et al., 2020) were also added for comparison. 

Results

The results are shown in Table 4 , from which we can observe that DET outperformed the attention-based methods on most datasets except PPI. Although SuperGAT SD had better performance on this dataset, we argue that it is no contradiction to incorporate SuperGAT SD as structural encoder into DET to obtain a stronger model. Interestingly, the attention-based methods unanimously performed worse than the GCN-based method GCN+NS on CiteSeer. SuperGAT SD and CGAT even had the same or worse results compared with the original GAT. Nevertheless, we observed an improvement from GAT to DET. This result empirically demonstrates the strength of leveraging semantic neighbors.

Datasets

We conducted experiments on the KG completion (a.k.a., entity prediction) task. The main target is to predict the subject entity (or object entity) given an incomplete triple. We evaluated DET on two benchmark datasets FB15K-237 (Toutanova & Chen, 2015) and WN18RR (Dettmers et al., 2018) , which are sampled from the real-world KGs Freebase (Bollacker et al., 2008) and WordNet (Miller, 1995) , respectively. Baselines We chose the best-performing entity prediction methods as our baselines: the TransEfamily methods TransE (Bordes et al., 2013) , RotatE (Sun et al., 2019) , and TuckER (Balazevic et al., 2019) , and the attention-based methods CoKE (Wang et al., 2019b) , CompGCN (Vashishth et al., 2020) , and HittER (Chen et al., 2021b) . Specifically, CoKE and HittER also leverage Transformer to encode the structural information.

Results

We report the main results on Table 5 . It is clear that DET surpassed all the baselines across all datasets and metrics. The improvement on MR (mean rank) is most significant, which implies that DET learned better embeddings for all entities, not only for the top ones favored by Hits@1. Overall, DET achieved competitive performance on all three types of tasks, which demonstrates its effectiveness and generality in modeling graphs. 

5. FURTHER ANALYSIS

To better understand DET, we design three experiments to explore and evaluate DET in depth.

5.1. IS EVERY MODULE IN DET USEFUL?

We conducted ablation studies to verify the effectiveness of each module in DET. We used six datasets in different tasks and present the results in Table 6 . We removed the modules from DET step-by-step while keeping identical hyper-parameter settings throughout the experiments. Semantic Neighbor Fetching The semantic neighbor fetching loss is undoubtedly important to DET. No matter if combining two encoders or only using the semantic encoder, integrating with the semantic fetching module had better performance in most cases. The improvement was most notable on PubMed, where it yielded 3.7% and 3.8% of accuracy increases, respectively. The mean rank results on WN18RR also got worse without the fetching loss. Semantic Encoder If we do not consider the semantic neighbor fetching loss, is the semantic encoder itself still useful to DET? Unfortunately, we find the answer ambiguous. For Cora, PubMed, and WN18RR, when we did not employ the fetching loss, DET with the semantic encoder performed much worse than DET without the semantic encoder. But we observe that the situation was reversed on CiteSeer and FB15K-237. In fact, when we only consider one-hop neighbors, the semantic encoder without fetching loss is just a "minus" attention layer. It may have its pros and cons compared with the standard dot-product attention layer on different datasets. In this sense, the semantic fetching loss is who endows the semantic encoder with the characteristic. On the ZINC dataset, where the model can apply attention operations on the whole graph, the semantic encoder was capable of estimating the semantic similarity of remote neighbors to the node of interest without the help of the fetching loss. Therefore, we can see that the dual-encoding version of DET greatly outperformed the structural encoder only version. Overall, the effectiveness of the semantic encoder is conditioned: it must get in touch with the remote nodes.

Structural Encoder

The structural encoder also has merits. From the results of the 3-rd and 5-th rows in Table 6 , we find that it had better performance than the semantic encoder on all datasets except CiteSeer. We also noticed that only using the semantic encoder had the worst MAE on the ZINC dataset, due to the absence of all structural information. 

5.2. THE CORRELATION SEMANTIC ENCODING AND GRAPH HOMOPHILY

In the Introduction section, we mention that the semantic encoding is helpful even on graphs with high homophily. Therefore, we conducted experiment to verify the correlations between the effectiveness of the semantic encoder and the homophily of datasets. We set a hyper-parameter τ to control the combination of the structural encoder and the semantic encoder, which can be written in the following equation: h = τ h st + (1 -τ )h se , where h, h st , h se denote the combined output, structural output, and semantic output, respectively. By assigning different τ , we can control the importance of each encoder in the combination. The experimental results are shown in Figure 3 , from which we can see that the performance gap for different τ existed in all three datasets, but the trends and peaks were different. On highly homophilic dataset Cora, the accuracy first increased from τ = 0 to τ = 0.15, and peaked around τ = 0.2, then dropped until τ = 95. On CoraFull and Chameleon with median and low homophily, we observe that the performance peaked around [0, 0.05] and maintained steady in the interval [0.05, 0.2]. When τ ≥ 0.5, the performance rapidly dropped to minima. Therefore, we may make the following conclusions: (1) the semantic encoder has better effects when the datasets have low homophily. The larger proportion the semantic encoding occupied, the better performance the model achieved; (2) even on the highly homophilic dataset, engaging the semantic encoder with a proper τ has significant advantages over using only structural encoder; (3) in any cases, properly combining the output of two encoders (i.e., DET) is the best choice.

5.3. HOW DOES THE SEMANTIC ENCODER HELP THE STRUCTURAL ENCODER?

It is worth exploring how the semantic encoder affects the structural encoder. In Figure 4 , we illustrate two examples on FB15K-237 and WN18RR, respectively. We find that the semantic scores for the structural neighbors are also in line with human intuition. In the left figure, the entity USA has a low score although it is directly connected to Nintendo by relation service_location. The verb precede and accompany obtain relatively low scores in the right figure. These neighbors are not very related to the entities of interest from the human perspective. By contrast, some one-hop neighbors get high semantic scores, e.g., the well-known director Shigeru Miyamo of Nintendo in FB15K-237 and the verb walk in WN18. They are the more informative entities. For the semantic neighbors, we can see that the exploited remote neighbors are closely related to the entity of interest, as well as the structural neighbors with high semantic scores. For example, Atlus is an important game developer to Nintendo. Aggregating such information may be helpful when the model is asked to predict the games related to Nintendo. For the verb travel in WN18RR, move also shares many key features with it.

6. CONCLUSION AND FUTURE WORK

In this paper, we propose a new Transformer architecture DET to deal with graphs of different types and sizes. In DET, the structural encoder aggregates information from local nodes while the semantic encoder seeks the remote nodes with useful semantics. The experimental results demonstrate the strong performance of DET on three prevalent GNN tasks across 9 benchmarks. We hope DET can bring more insights and inspirations in developing unified Transformer architectures. In future, we plan to adapt DET to the NLP and CV areas. A POSITION EMBEDDING There are many important graph features that can be used to identify different nodes. Thanks to learnable position embedding, these discrete features now can be encoding into embeddings and combined with the raw nodes embedding vectors. Specifically, for graph property prediction task, we use the method proposed by Graphormer (Ying et al., 2021) to encode degree centrality of an arbitrary node v i as follows: c i = f c (deg(v i )), where deg(v i ) denotes the degree of the node v i , and f c : R → R h is the mapping function that converts the node degree to a learnable embedding. We also consider encoding the distances from the node of interest v i to different neighbors by the following equation: d vi,vj = f d (spd(v i , v j )), where spd(v i , v j ) denotes the shortest path distance from v i to v j , and f d : R → R h is a similar function that converts the distance to a learnable embedding. For KG representation learning task, we follow (Chen et al., 2021b) to encode the edge types into node embeddings, which is implemented by an additional atom triple Transformer M A . Specifically, for a given triple (v i , r ij , v j ) for the node of interest v i , where r ij denote the edge type (i.e., relationship) between v i and v j . The edge type information can be encoded by the following equation: e ij = M A ([c A , v i , r ij , v j ]), where [c A , v i , r ij , v j ] is the input embedding sequence. c A is the virtual node for the atom Transformer, whose output represents the edge encoding embedding.

B COMPUTATIONAL COST

We used one 32GB V100 GPU to train all the methods for estimation and show the average training time in Table 7 . Clearly, incorporating the semantic encoder did not significantly increase the computational cost, especially on the node classification datasets.

C EXPERIMENT DETAILS C.1 DATASET SETTINGS

We present the overall dataset statistics in Table 8 . Graph Property Prediction For PCQM4M-LSCv1, the model is asked to predict the DFT (density functional theory)-calculated HOMO-LUMO energy gap of given molecules. It contains more than 3.8M 2D molecular graphs as input, which is especially appropriate to evaluate the performance of model in large scale scenarios. On other hand, ZINC is a relative small datasets, where the main target is to predict the graph property regression for constrained solubility. It is one of the most popular real-world molecular datasets for graph representation learning. Node Classfication Cora, CiteSeer and PubMed are three citation network datasets proposed by (Yang et al., 2016) . They are typically used for transductive node classification task. PPI on the other hand is used for inductive evaluation. It consists of 24 graphs, with 20 graphs for training, and 2 for validation and 2 for testing. All four datasets are the prevalent benchmarks used for node classification. KG Completion FB15K-237 is the revised version of the original FB15K dataset (Bordes et al., 2013) that was used as entity prediction benchmark in last ten years. However, recent studies (Dettmers et al., 2018; Toutanova & Chen, 2015) find that the original FB15K contains a large proportion of redundant data, some of which may incur testing data leakage, the same to another well-used dataset WN18. Therefore, most latest studies only use the revised datasets FB15K-237 and WN18RR for evaluation. FB15K-237 has more different relationships, while WN18RR is more sparse and has more different entities.

C.2 PARAMETER SETTINGS

We summarize the main hyper-parameter settings on different datasets in Table 9 . The more specific settings can be found in the source code. For the graph property prediction tasks, we directly let the semantic encoder access all neighbors, and thus the semantic neighbor fetching loss is not used. Alternatively, we adapt an increasing strategy to gradually improve the weight of the output of semantic encoder during training. We conducted experiments to compare the performance of structural encoder with and without the semantic encoder, in term of training steps. We depict the valid MAE results on ZINC in Figure 5a . At the beginning of the training, we find that the two methods do not have a visible performance gap. The curves are tightly overlapped during step 0 to 74, 000. As the performance starts to be converged, i.e., step 74, 000 to 148, 000, only using structural encoder is better than combining the two encoders. However, as the valid MAE tends to be stable, the dual-encoding DET gradually outperforms the single structural encoder. It is worth-noting that the turning point appears at the performance starts to be converged, where the input embeddings also approach the ideal positions. Therefore, the semantic similarity among embeddings can be estimated more precisely, and contribute to a better semantic encoder. Therefore, DET can obtain a lower MAE than the single strctural encoder. We present the result on PCQM4M-LSCv1 in Figure 5b , which supports the same conclusion.

D.2 RESULTS ON OTHER NODE CLASSIFICATION DATASETS

We also provide the results on additional 8 real-world datasets for node classification. They are: CS and Physics (Shchur et al., 2018) ; Cora-ML, Cora-Full and DBLP (Bojchevski & Günnemann, 2018) ; Chameleon (Rozemberczki et al., 2019) ; Four-Univ (Craven et al., 1998) ; and Wiki-CS (Mernyei & Cangea, 2020) . We report the accuracy on Table 10 . It is clear that DET consistently and significantly outperformed the attention-based methods on these datasets.

D.3 CHOICES OF f s

In this section, we conducted experiments to verify the effectiveness of proposed f s , in comparison with a linear attention implementation. The results are shown in Table 11 .we can observe that our proposed attention outperformed linear attention, and the results of using linear attention were similar to those of SuperGAT and the original GAT. SuperGAT also leverages a self-supervised contrastive loss to predict the existence of edges in the original graph. If we use linear attention to replace our proposed attention, our method is more like a SuperGAT with two separated linear attentions to cope with node classification and edge prediction, respectively.



Figure 1: Overview of DET. Structural neighbors are local neighbors connected with the node of interest on the graph, while semantic neighbors are remote nodes with similar semantics to the node of interest. The two encoders focus on encoding different aspects of neighboring information, and thus are capable of complementing each other.

Figure 2: A comparison between MSA and semantic neighbors. The left figure is sliced from Jumper et al. (2021). The right figure is an example from WordNet(Miller, 1995).

main (H, Y ) + L sn ; 11: Update the parameters according to L; 12:

Figure 3: Accuracy on three datasets with different homophily (Cora (Yang et al., 2016): 0.83, CoraFull (Bojchevski & Günnemann, 2018): 0.59, Chameleon (Rozemberczki et al., 2019): 0.21), w.r.t. hyper-parameter τ (average of 7 runs). The homophily statistics are from(Zhu et al., 2020).

Figure 4: Examples of the semantic attention scores to different types of neighbors.

Figure 5: The validate MAE on ZINC and PCQM4M-LSCv1, w.r.t., training step.

The occurrence frequency of entities in FB15K-237 and WN18RR, in term of hops.

Graph property prediction results on the PCQM4M-LSCv1 dataset.

Graph property prediction results on the ZINC dataset.

Node classification results on five benchmarks (accuracy for Cora, CiteSeer, PubMed, and ogbn-arxiv; F1-score for PPI). The results of Graph-Bert are from(Zhang et al., 2020).

Ablation study results on different datasets (↑: higher is better; ↓: lower is better. ×: unavailable entry). St. and Se. are the abbreviations of Structural and Semantic.

The average training time of DET in comparison with respective baselines, on a 32GB V100.

The dataset statistics. R and C denote the regression and classification tasks, respectively.

The hyper-parameter settings on the datasets in the main experiments.

Accuracy on 8 popular node classification datasets. Table 11: A result comparison of different f s .

