CAN GNNS LEARN HEURISTIC INFORMATION FOR LINK PREDICTION?

Abstract

Graph Neural Networks (GNNs) have shown superior performance in Link Prediction (LP). Especially, SEAL and its successors address the LP problem by classifying the subgraphs extracted specifically for candidate links, gaining state-of-theart results. Nevertheless, we question whether these methods can effectively learn the information equivalent to link heuristics such as Common Neighbors, Katz index, etc. (we refer to such information as heuristic information in this work). We show that link heuristics and GNNs capture different information. Link heuristics usually collect pair-specific information by counting the involved neighbors or paths between two nodes in a candidate link, while GNNs learn node-wise representations through a neighborhood aggregation algorithm in which two nodes in the candidate link do not pay special attention to each other. Our further analysis shows that SEAL-type methods only use a GNN to model the pair-specific subgraphs and also cannot effectively capture heuristic information. To verify our analysis, a straightforward way is to compare the LP performance between existing methods and a model that learns heuristic information independently of the GNN learning. To this end, we present a simple yet light framework ComHG 1 by directly Combining the embeddings of link Heuristics and the representations produced by a GNN. Experiments on OGB LP benchmarks show that ComHG outperforms all top competitors by a large margin, empirically confirming our propositions. Our experimental study also indicates that the contributions of link heuristics and the GNN to LP are sensitive to the graph degree, where the former is powerful on sparse graphs while the latter becomes dominant on dense graphs.

1. INTRODUCTION

Link Prediction (LP), aiming at predicting the existence likelihood of a link between a pair of nodes in a graph, is a prominent task in graph-based data mining (Kumar et al., 2020) . It has a wide range of beneficial applications, such as recommender systems (Wu et al., 2021) , molecular interaction prediction (Huang et al., 2020) , and knowledge graph completion (Li et al., 2022) . Throughout the history of LP research, a number of link heuristics have been defined, such as Common Neighbors (CN), Katz index (Katz, 1953) , etc. A link heuristic usually describes a specific fact or hypothesis that gives the best interpretation to a statistical pattern in link observations (Martínez et al., 2016) . The effectiveness of many link heuristics has been confirmed in various real-world LP applications (Liben-Nowell & Kleinberg, 2007; Zhou et al., 2009; Martínez et al., 2016) . Recently, graph representation learning has been proven powerful for LP (Perozzi et al., 2014; Zhang & Chen, 2018; Yun et al., 2021) . Among the approaches in this domain, Graph Neural Networks (GNNs) have demonstrated stronger LP performance than others like node embedding methods based on positional encoding (Perozzi et al., 2014; Galkin et al., 2021) . Modern prevalent GNNs like GCN (Kipf & Welling, 2017) , GAT (Veličković et al., 2018) , etc. follow a form of neighborhood information aggregation algorithm in which each node's representation is updated by aggregating the representations of this node and its neighbors. In this paper, we use the term GNNs to refer to such aggregation-based GNNs. Link heuristics are defined by counting the involved neighbors or paths between two nodes. GNNs learn node-wise representations by aggregating neighborhood information. link? link? Figure 1: An illustration of the difference between link heuristics and aggregation-based GNNs. In the literature, several LP-specific methods based on GNNs have been proposed, such as SEAL (Zhang & Chen, 2018), GraiL (Teru et al., 2020) and NBFNet (Zhu et al., 2021) . In particular, SEAL and its follow-up works (Zhang & Chen, 2018; Li et al., 2020; Teru et al., 2020; Yin et al., 2022) predict the link likelihood between two nodes through classifying the subgraph extracted specifically for this target pair of nodes (we refer to such subgraph as pair-specific subgraph). SEAL-type methods also label every node in the pair-specific subgraph according to the relationship of the node to the target pair of nodes (Zhang et al., 2021) . The pair-specific subgraphs together with the node labeling help SEAL-type methods learn better link representations than other methods, attaining state-of-the-art LP performance. Despite all successes achieved by existing GNN-based LP methods, we are still curious about the question: whether these methods can effectively learn the information equivalent to link heuristics (i.e., heuristic information) for LP? Our analysis and experiments suggest a negative answer. Contributions. In this work, we show that traditional link heuristics and GNNs capture different information. As illustrated in Figure 1 , link heuristics are typically defined based on the number of involved nodes or paths between a pair of nodes. They are pair-specific. By comparison, a GNN updates the representation of a node by aggregating the representations of this node and its neighbors, where none of the neighbors is treated particularly. The learned nodes' representations are node-level. GNNs pay more attention to what information every node has, while link heuristics focus more on how many shared neighbors or paths are between a pair of nodes. The difference between link heuristics and GNNs means that classical GNN-based LP methods simply combining the node-wise representations of two nodes in a candidate link into a link representation can hardly learn heuristic information. Moreover, we find that SEAL-type methods also cannot effectively capture heuristic information. Briefly, despite the help of the pair-specific subgraph and node labeling techniques, SEAL-type methods still use a GNN (e.g., DGCNN (Zhang et al., 2018) in SEAL (Zhang & Chen, 2018), R-GCN (Schlichtkrull et al., 2018) in GraiL (Teru et al., 2020) ) to perform graph representation learning, where the GNN inherently lacks the model ability to learn heuristic information. Meanwhile, the labeling features of nodes are mixed in the neighborhood aggregation process of the GNN and the heuristic information embedded in the labeling features cannot be effectively kept in the learned node representations. A simple way to verify our propositions is to study the LP performance of a model that separates the heuristic information learning and the GNN-based representation learning. Therefore, we present a light LP framework ComHG by Combining link Heuristics and the GNN. In ComHG, various link heuristics are encoded into trainable embeddings and combined with the representations produced by a GNN, followed by a predictor that takes the combinations as input to perform the final prediction. We conduct experiments on four OGB LP benchmark datasets (Hu et al., 2020) . ComHG significantly outperforms all previous methods on all datasets. The strong results confirm that link heuristics and the GNN capture different yet effective information for LP, and suggest that combining both of them can boost LP performance. The results also empirically verify our analysis of the limitations of existing GNN-based LP methods in learning heuristic information. Furthermore, our experimental study shows that link heuristics could contribute more to LP performance on sparse graphs while GNN-based representation learning becomes dominant on dense graphs.

2. PRELIMINARIES

Without loss of generality, we demonstrate our work on homogeneous graphs. Let G = (V, E) denote a graph G with N nodes, where V is the set of nodes, |V| = N , and E is the set of edges between the nodes in V. A set of nodes connected directly to a node v ∈ V is the first-order neighborhood set of v and is denoted as Γ v . The degree of a node v is defined as the number of edges connected to this node. The degree of a graph is defined as the average degree of all nodes.



Our code is available at https://github.com/astroming/ComHG

