DOUBLE WINS: BOOSTING ACCURACY AND EFFI-CIENCY OF GRAPH NEURAL NETWORKS BY RELIABLE KNOWLEDGE DISTILLATION

Abstract

The recent breakthrough achieved by graph neural networks (GNNs) with few labeled data accelerates the pace of deploying GNNs on real-world applications. While several efforts have been made to scale GNNs training for large-scale graphs, GNNs still suffer from the scalability challenge of model inference, due to the graph dependency issue incurred by the message passing mechanism, therefore hindering its deployment in resource-constrained applications. A recent study (Zhang et al., 2022b) revealed that GNNs can be compressed to inference-friendly multi-layer perceptrons (MLPs), by training MLPs using the soft labels of labeled and unlabeled nodes from the teacher. However, blindly leveraging the soft labels of all unlabeled nodes may be suboptimal, since the teacher model would inevitably make wrong predictions. This intriguing observation motivates us to ask: Is it possible to train a stronger MLP student by making better use of the unlabeled data? This paper studies cross-model knowledge distillation -from GNN teacher to MLP student in a semi-supervised setting, showing their strong promise in achieving a "sweet point" in co-optimizing model accuracy and efficiency. Our proposed solution, dubbed Reliable Knowledge Distillation for MLP optimization (RKD-MLP), is the first noise-aware knowledge distillation framework for GNNs distillation. Its core idea is to use a meta-policy to filter out those unreliable soft labels. To train the meta-policy, we design a reward-driven objective based on a meta-set and adopt policy gradient to optimize the expected reward. Then we apply the metapolicy to the unlabeled nodes and select the most reliable soft labels for distillation. Extensive experiments across various GNN backbones, on 7 small graphs and 2 large-scale datasets from the challenging Open Graph Benchmark, demonstrate the superiority of our proposal. Moreover, our RKD-MLP model shows good robustness w.r.t. graph topology and node feature noises.

1. INTRODUCTION

Graph neural networks (GNNs), as the de facto neural architecture in graph representation learning (Zhou et al., 2020; Hamilton et al., 2017b) , have achieved state-of-the-art results across a variety of applications, such as node classification (Kipf & Welling, 2016; Liu et al., 2020) , graph classification (Ying et al., 2018; Gao & Ji, 2019) , link prediction (Zhang & Chen, 2018; Zhang et al., 2021) , and anomaly detection (Deng & Zhang, 2021; Chaudhary et al., 2019) . Different from plain network embedding methods (Perozzi et al., 2014; Grover & Leskovec, 2016) , GNNs rely on the convolution-like message propagation mechanism (Gilmer et al., 2017) to recursively aggregate messages from neighboring nodes, which are believed to improve model expressiveness and representation flexibility (Xu et al., 2018) . Despite the recent advances, GNNs are still facing several challenges during inference, especially when going deeper (Chen et al., 2020; 2021) and applying to large-scale graphs (Chiang et al., 2019; Zeng et al., 2019) . The major reason (Abadal et al., 2021) is that the message propagation among neighbors from multi-hops always incurs heavy data dependency, causing substantially computational costs and memory footprints. Some preliminary efforts attempt to fill the gap from different aspects. For example, (Zhou et al., 2021) proposes to accelerate inference via model pruning, and (Tailor et al., 2020) suggests to directly reduce computational costs by weight quantization. Although they can speed up GNNs to some extent, the improvements are rather limited, since the data dependency issue remains unresolved. Recently, GLNN (Zhang et al., 2022b) tries to tackle this issue by compressing GNNs to inference-friendly multi-layer perceptrons (MLPs) via knowledge distillation (KD). Similar to standard KD protocols (Hinton et al., 2015) , GLNN trains the MLP student by using the soft labels from GNN teacher as guidance, and then deploys the distilled MLP student to conduct latency-constrained inference. However, directly leveraging soft labels from the GNN teacher is suboptimal when the labeled nodes are scarce, a common scenario in graph-structured data (Kipf & Welling, 2016; Garcia & Bruna, 2017; Feng et al., 2020) . This is mainly because a large portion of unlabeled nodes will be incorrectly predicted by GNNs due to its limited generalization ability. For instance, many GNN variants (Kipf & Welling, 2016; Veličković et al., 2017; Klicpera et al., 2018) can achieve 100% accuracy on the training set, yet their test accuracy is merely around 80% on Planetoid benchmarks. As a result, the soft labels of those wrongly predicted unlabeled nodes would introduce noises to the optimization landscape of the MLP student, leading to an obvious performance gap w.r.t. the GNN teacher (Zhang et al., 2022b) . To avoid the influence of mislabeled nodes, the common practice is to analyze their logit distributions from the teacher model (Kwon et al., 2020; Zhu et al., 2021a; Zhang et al., 2022a) . For example, Zhang et al. (Zhang et al., 2022a) propose to assign larger weights to samples if their teacher predictions are close to one-hot labels. Zhu et al. (Zhu et al., 2021a) suggest filtering out data points if their teacher predictions mismatch with ground truth labels. Nevertheless, these methods cannot be applied in real-world graphs where node labels are expensive to access. Recently, Kwon et al. (Kwon et al., 2020) suggest discriminating samples based on entropy values, by assuming that teacher predictions with lower entropy are more reliable. However, we found that entropy values are ineffective to distinguish the correct and wrong decision boundaries of GNN models since they are often largely overlapped, as we show in Figure 1 (right panel). Therefore, it still remains an open challenge to effectively distill semi-supervised GNN models to light-weight MLP students. Present Work. Motivated by this, we propose a novel KD framework -RKD-MLP to boost the MLP student via noise-aware distillation. It is noteworthy that while we focus on the MLP student for efficiency purposes, our solution is ready for other student types, such as GNNs (See Appendix F for more discussion). Specifically, RKD-MLP uses a meta-policy to filter out those unreliable soft labels by deciding whether each node should be used in distillation given its node representations. The student then only distills the soft labels of the nodes that are kept by the meta-policy. To train the meta-policy, we design a reward-driven objective based on a meta-set, where the meta-policy is rewarded for making correct filtering. The meta-policy is optimized with policy gradient to achieve the best expected reward and then be applied to unlabeled nodes. We iteratively update the meta-policy and the student model, achieving a win-win scenario: it substantially improves the performance of the vanilla MLP student by teaching it with reliable guidance while maintaining the inference efficiency of MLPs without increasing the model size. • We provide the first comprehensive investigation of unlabeled nodes in GNNs distillation by demonstrating its validity in boosting the MLP student via providing effective pseudo labels, and perniciousness in degrading model performance via adding incorrect soft labels. • Motivated by our analysis, we propose to use a meta-policy to filter out unreliable nodes whose soft labels are wrongly predicted by the GNN teacher, and introduce a bi-level optimization strategy to jointly train the meta-policy and the student model. • Extensive experiments over a variety of GNN backbones on 7 small datasets and 2 challenging OGB benchmarks demonstrate the superiority of our proposal. Notably, our RKD-MLP outperforms the vanilla KD solution with up to 5.82% standard accuracy, while its inference is at least 100 times faster than conventional GNNs.

