LINKLESS LINK PREDICTION VIA RELATIONAL DISTILLATION

Abstract

Graph Neural Networks (GNNs) have been widely used on graph data and have shown exceptional performance in the task of link prediction. Despite their effectiveness, GNNs often suffer from high latency due to non-trivial neighborhood data dependency in practical deployments. To address this issue, researchers have proposed methods based on knowledge distillation (KD) to transfer the knowledge from teacher GNNs to student MLPs, which are known to be efficient even with industrial scale data, and have shown promising results on node classification. Nonetheless, using KD to accelerate link prediction is still unexplored. In this work, we start with exploring two direct analogs of traditional KD for link prediction, i.e., predicted logit-based matching and node representation-based matching. Upon observing direct KD analogs do not perform well for link prediction, we propose a relational KD framework, Linkless Link Prediction (LLP). Unlike simple KD methods that match independent link logits or node representations, LLP distills relational knowledge that is centered around each (anchor) node to the student MLP. Specifically, we propose two matching strategies that complement each other: rank-based matching and distribution-based matching. Extensive experiments demonstrate that LLP boosts the link prediction performance of MLPs with significant margins, and even outperforms the teacher GNNs on 6 out of 9 benchmarks. LLP also achieves a 776.37× speedup in link prediction inference compared to GNNs on the large scale OGB-Citation2 dataset.

1. INTRODUCTION

Graph neural networks (GNNs) have been widely used for machine learning on graph-structured data (Kipf & Welling, 2016a; Hamilton et al., 2017) . They have shown significant performance in various applications, such as node classification (Veličković et al., 2017; Chen et al., 2020) , graph classification (Zhang et al., 2018; Ying et al., 2018b) , graph generation (You et al., 2018; Shiao & Papalexakis, 2021) , and link prediction (Zhang & Chen, 2018) . Of these, link prediction is a notably critical problem in the graph machine learning community, which aims to predict the likelihood of any two nodes forming a link. It has broad practical applications such as knowledge graph completion (Schlichtkrull et al., 2018; Nathani et al., 2019; Vashishth et al., 2020) , friend recommendation on social platforms (Sankar et al., 2021; Tang et al., 2022; Fan et al., 2022) and item recommendation for users on service and commerce platforms (Koren et al., 2009; Ying et al., 2018a; He et al., 2020) . With the rising popularity of GNNs, state-of-the-art link prediction methods adopt encoder-decoder style models, where encoders are GNNs, and decoders are applied directly on pairs of node representations learned by the GNNs (Kipf & Welling, 2016b; Zhang & Chen, 2018; Cai & Ji, 2020; Zhao et al., 2022) . The success of GNNs is typically attributed to the explicit use of contextual information from nodes' surrounding neighborhoods (Zhang et al., 2020e) . However, this induces a heavy reliance on neighborhood fetching and aggregation schemes, which can lead to high time cost in training and inference compared to tabular models, such as multi-layer perceptrons (MLPs), especially owing to neighbor explosion (Zhang et al., 2020b; Jia et al., 2020; Zhang et al., 2021b; Zeng et al., 2019) . Compared to GNNs, MLPs do not require any graph topology information, making them more suitable for new or isolated nodes (e.g., for cold-start settings), but usually resulting in worse general task performance as encoders, which we also empirically validate Section 4. Nonetheless, having no graph dependency makes the training and inference time for MLPs negligible when comparing with those of GNNs. Thus, in industrial-scale applications where fast real-time inference is required, MLPs are still a leading option (Zhang et al., 2021b; Covington et al., 2016; Gholami et al., 2021) . Given these speed-performance tradeoffs, several recent works propose to transfer the learned knowledge from GNNs to MLP using knowledge distillation (KD) techniques (Hinton et al., 2015; Zhang et al., 2021b; Zheng et al., 2021; Hu et al., 2021) , to take advantage of both GNN's performance benefits and MLP's speed benefits. Specifically, in this way, the student MLP can potentially obtain the graph-context knowledge transferred from the GNN teacher via KD to not only perform better in practice, but also enjoy model latency benefits compared to GNNs, e.g. in production inference settings. However, these works focus on node-or graph-level tasks. Given that KD on link prediction tasks have not been explored, and the massive scope of recommendation systems contexts that are posed as link prediction problems, our work aims to bridge a critical gap. Specifically, we ask: Can we effectively distill link prediction-relevant knowledge from GNNs to MLPs? In this work, we focus on exploring, building upon, and proposing cross-model (GNN to MLP) distillation techniques for link prediction settings. We start with exploring two direct KD methods of aligning student and teacher: (i) logit-based matching of predicted link existence probabilities (Hinton et al., 2015) , and (ii) representation-based matching of the generated latent node representations (Gou et al., 2021) . However, empirically we observe that neither the logit-based matching nor the representationbased matching are powerful enough to distill sufficient knowledge for the student model to perform well on link prediction tasks. We hypothesize that the reason of these two KD approaches not performing well is that link prediction, unlike node classification, heavily relies on relational graph topological information (Martínez et al., 2016; Zhang & Chen, 2018; Yun et al., 2021; Zhao et al., 2022) , which is not well-captured by direct methods. To address this issue, we propose a relational KD framework, namely LLP: our key intuition is that instead of focusing on matching individual node pairs or node representations, we focus on matching the relationships between each (anchor) node with respect to other (context) nodes in the graph. Given the relational knowledge centered at the anchor node, i.e., the teacher model's predicted link existence probabilities between the anchor node and each context node, LLP distills it to the student model via two matching methods: (i) rank-based matching, and (ii) distribution-based matching. More specifically, rank-based matching equips the student model with a ranking loss over the relative ranks of all context nodes w.r.t the anchor node, preserving crucial ranking information that are directly relevant to downstream link prediction use-cases, e.g. user-contextual friend recommendation (Sankar et al., 2021; Tang et al., 2022) or item recommendation (Ying et al., 2018a; He et al., 2020) . On the other hand, distribution-based matching equips the student model with the link probability distribution over context nodes, conditioned on the anchor node. Importantly, distribution-based matching is complementary to rank-based matching, as it provides auxiliary information about the relative values of the probabilities and magnitudes of differences. To comprehensively evaluate the effectiveness of our proposed LLP, we conduct experiments on 9 public benchmarks. In addition to the standard transductive setting for graph tasks, we also design a more realistic setting that mimics realistic (on-line) use-cases for link prediction, which we call the production setting. LLP consistently outperforms stand-alone MLPs by 17.13 points on average under the transductive setting and 12.01 points under the production setting on all the datasets, and matches or outperforms teacher GNNs on 6/9 datasets under the transductive setting. Promisingly, for cold-start nodes, LLP outperforms teacher GNNs and stand-alone MLPs by 25.29 and 9.42 Hits@20 on average, respectively. Finally, LLP infers drastically faster than GNNs, e.g. 776.37× faster on the large-scale OGB-Citation2 dataset.

2. RELATED WORK AND PRELIMINARIES

We briefly discuss related work and preliminaries relevant to contextualizing our methods and contributions. Due to space limit, we defer more related work to Appendix A. Notation. Let G = (V, E) denote an undirected graph, where V denotes the set of N nodes and E ⊆ V × V denotes the set of observed links. A ∈ {0, 1} N ×N denotes the adjacency matrix, where A i,j = 1 if exists an edge e i,j in E and 0 otherwise. Let the matrix of node features be denoted by X ∈ R N ×F , where each row x i is the F -dim raw feature vector of node i. Given both E and A have

