DOUBLE WINS: BOOSTING ACCURACY AND EFFI-CIENCY OF GRAPH NEURAL NETWORKS BY RELIABLE KNOWLEDGE DISTILLATION

Abstract

The recent breakthrough achieved by graph neural networks (GNNs) with few labeled data accelerates the pace of deploying GNNs on real-world applications. While several efforts have been made to scale GNNs training for large-scale graphs, GNNs still suffer from the scalability challenge of model inference, due to the graph dependency issue incurred by the message passing mechanism, therefore hindering its deployment in resource-constrained applications. A recent study (Zhang et al., 2022b) revealed that GNNs can be compressed to inference-friendly multi-layer perceptrons (MLPs), by training MLPs using the soft labels of labeled and unlabeled nodes from the teacher. However, blindly leveraging the soft labels of all unlabeled nodes may be suboptimal, since the teacher model would inevitably make wrong predictions. This intriguing observation motivates us to ask: Is it possible to train a stronger MLP student by making better use of the unlabeled data? This paper studies cross-model knowledge distillation -from GNN teacher to MLP student in a semi-supervised setting, showing their strong promise in achieving a "sweet point" in co-optimizing model accuracy and efficiency. Our proposed solution, dubbed Reliable Knowledge Distillation for MLP optimization (RKD-MLP), is the first noise-aware knowledge distillation framework for GNNs distillation. Its core idea is to use a meta-policy to filter out those unreliable soft labels. To train the meta-policy, we design a reward-driven objective based on a meta-set and adopt policy gradient to optimize the expected reward. Then we apply the metapolicy to the unlabeled nodes and select the most reliable soft labels for distillation. Extensive experiments across various GNN backbones, on 7 small graphs and 2 large-scale datasets from the challenging Open Graph Benchmark, demonstrate the superiority of our proposal. Moreover, our RKD-MLP model shows good robustness w.r.t. graph topology and node feature noises.

1. INTRODUCTION

Graph neural networks (GNNs), as the de facto neural architecture in graph representation learning (Zhou et al., 2020; Hamilton et al., 2017b) , have achieved state-of-the-art results across a variety of applications, such as node classification (Kipf & Welling, 2016; Liu et al., 2020) , graph classification (Ying et al., 2018; Gao & Ji, 2019 ), link prediction (Zhang & Chen, 2018; Zhang et al., 2021) , and anomaly detection (Deng & Zhang, 2021; Chaudhary et al., 2019) . Different from plain network embedding methods (Perozzi et al., 2014; Grover & Leskovec, 2016) , GNNs rely on the convolution-like message propagation mechanism (Gilmer et al., 2017) to recursively aggregate messages from neighboring nodes, which are believed to improve model expressiveness and representation flexibility (Xu et al., 2018) . Despite the recent advances, GNNs are still facing several challenges during inference, especially when going deeper (Chen et al., 2020; 2021) and applying to large-scale graphs (Chiang et al., 2019; Zeng et al., 2019) . The major reason (Abadal et al., 2021) is that the message propagation among neighbors from multi-hops always incurs heavy data dependency, causing substantially computational

