LEARNING MLPS ON GRAPHS: A UNIFIED VIEW OF EFFECTIVENESS, ROBUSTNESS, AND EFFICIENCY

Abstract

While Graph Neural Networks (GNNs) have demonstrated their efficacy in dealing with non-Euclidean structural data, they are difficult to be deployed in real applications due to the scalability constraint imposed by the multi-hop data dependency. Existing methods attempt to address this scalability issue by training student multi-layer perceptrons (MLPs) exclusively on node content features using labels derived from the teacher GNNs. However, the trained MLPs are neither effective nor robust. In this paper, we ascribe the lack of effectiveness and robustness to three significant challenges: 1) the misalignment between content feature and label spaces, 2) the strict hard matching to teacher's output, and 3) the sensitivity to node feature noises. To address the challenges, we propose NOSMOG, a novel method to learn NOise-robust Structure-aware MLPs On Graphs, with remarkable effectiveness, robustness, and efficiency. Specifically, we first address the misalignment by complementing node content with position features to capture the graph structural information. We then design an innovative representational similarity distillation strategy to inject soft node similarities into MLPs. Finally, we introduce adversarial feature augmentation to ensure stable learning against feature noises. Extensive experiments and theoretical analyses demonstrate the superiority of NOSMOG by comparing it to GNNs and the state-of-the-art method in both transductive and inductive settings across seven datasets. Codes are available at

1. INTRODUCTION

Graph Neural Networks (GNNs) have shown exceptional effectiveness in handling non-Euclidean structural data and have achieved state-of-the-art performance across a broad range of graph mining tasks (Hamilton et al., 2017; Kipf & Welling, 2017; Veličković et al., 2018) . The success of modern GNNs relies on the usage of message passing architecture, which aggregates and learns node representations based on their (multi-hop) neighborhood (Wu et al., 2020; Zhou et al., 2020) . However, message passing is time-consuming and computation-intensive, making it challenging to apply GNNs to real large-scale applications that are always constrained by latency and require the deployed model to infer fast (Zhang et al., 2020; 2022a) . To meet the latency requirement, multi-layer perceptrons (MLPs) continue to be the first choice (Zhang et al., 2022b) , despite the fact that they perform poorly in non-euclidean data and focus exclusively on the node content information. Inspired by the performance advantage of GNNs and the latency advantage of MLPs, researchers have explored combining GNNs and MLPs together to enjoy the advantages of both (Zhang et al., 2022b; Zheng et al., 2022; Chen et al., 2021) . To combine them, one effective approach is to use knowledge distillation (KD) (Hinton et al., 2015) , where the learned knowledge is transferred from GNNs to MLPs through soft labels (Phuong & Lampert, 2019) . Then only MLPs are deployed for inference, with node content features as input. In this way, MLPs can perform well by mimicking the output of GNNs without requiring explicit message passing, and thus obtaining a fast inference speed (Hu et al., 2021) . Nevertheless, existing methods are neither effective nor robust, with three major drawbacks: (1) MLPs cannot fully align the input content feature to the label space, especially when node labels are correlated with the graph structure; (2) MLPs rely on the teacher's output to learn a strict hard matching, jeopardizing the soft structural representational similarity among nodes; and (3) MLPs are sensitive to node feature noises that can easily destroy the performance. We thus ask: Can we learn MLPs that are graph structure-aware in both the feature and representation spaces, insensitive to node feature noises, and have superior performance as well as fast inference speed? To address these issues and answer the question, we propose to learn NOise-robust Structure-aware MLPs On Graphs (NOSMOG), a novel method with remarkable performance, outstanding robustness and exceptional inference speed. Specifically, we first extract node position features from the graph and combine them with node content features as the input of MLPs. Thus MLPs can fully capture the graph structure as well as the node positional information. Then, we design a novel representational similarity distillation strategy to transfer the node similarity information from GNNs to MLPs, so that MLPs can encode the structural node affinity and learn more effectively from GNNs through hidden layer representations. After that, we introduce the adversarial feature augmentation to make MLPs noise-resistant and further improve the performance. To fully evaluate our model, we conduct extensive experiments on 7 public benchmark datasets in both transductive and inductive settings. Experiments show that NOSMOG can outperform the state-of-the-art method and also the teacher GNNs, with robustness to noises and fast inference speed. In particular, NOSMOG improves GNNs by 2.05%, MLPs by 25.22%, and existing state-of-the-art method by 6.63%, averaged across 7 datasets and 2 settings. In the meantime, NOSMOG achieves comparable efficiency to the state-of-the-art method and is 833× faster than GNNs with the same number of layers. In addition, we provide theoretical analyses based on information theory and conduct consistency measurements between graph topology and model predictions to facilitate a better understanding of the model. To summarize, the contributions of this paper are as follows: • We point out that existing works of learning MLPs on graphs are neither effective nor robust. We identify three issues that undermine their capability: the misalignment between content feature and label spaces, the strict hard matching to teacher's output, and the sensitivity to node feature noises. • To address the issues, we propose to learn noise-robust and structure-aware MLPs on graphs, with remarkable effectiveness, robustness, and efficiency. The proposed model contains three key components: the incorporation of position features, representational similarity distillation, and adversarial feature augmentation. • Extensive experiments demonstrate that NOSMOG can easily outperform GNNs and the state-ofthe-art method. In addition, we present theoretical analyses, robustness investigations, efficiency comparisons, and ablation studies to validate the superiority of the proposed model.

2. RELATED WORK

Graph Neural Networks. Many graph neural networks (Veličković et al., 2018; Li et al., 2019; Zhang et al., 2019; Chen et al., 2020) have been proposed to encode the graph-structure data. They take advantage of the message passing paradigm by aggregating neighborhood information to learn node embeddings. For example, GCN (Kipf & Welling, 2017) introduces a layer-wise propagation rule to learn node features. GAT (Veličković et al., 2018) incorporates an attention mechanism to aggregate features. DeepGCNs (Li et al., 2019) and GCNII (Chen et al., 2020) utilize residual connections to aggregate neighbors from multi-hop and further address the over-smoothing problem. However, These message passing GNNs only leverage local graph structure and have been demonstrated to be no more powerful than the WL graph isomorphism test (Xu et al., 2019; Morris et al., 2019) . Recent works propose to empower graph learning with positional encoding techniques such as Laplacian Eigenmap and DeepWalk (You et al., 2019; Wang et al., 2022; Tian et al., 2023a) , so that the node's position within the broader context of the graph structure can be detected. Inspired by these studies, we incorporate position features to fully capture the graph structure and node positional information. Knowledge Distillation on Graph. Knowledge Distillation (KD) has been applied widely in graphbased research and GNNs (Yang et al., 2020; Yan et al., 2020; Guo et al., 2023; Tian et al., 2023b) . Previous works apply KD primarily to learn student GNNs with fewer parameters but perform as well as the teacher GNNs. However, time-consuming message passing is still required during the learning process. For example, LSP (Yang et al., 2020) and TinyGNN (Yan et al., 2020) introduce the local structure-preserving and peer-aware modules that rely heavily on message passing. To overcome the latency issues, recent works start focusing on learning MLP-based student models that do not require message passing (Hu et al., 2021; Zhang et al., 2022b; Zheng et al., 2022) . Specifically, MLP student

availability

https://github.com/meettyj

