LEARNING MLPS ON GRAPHS: A UNIFIED VIEW OF EFFECTIVENESS, ROBUSTNESS, AND EFFICIENCY

Abstract

While Graph Neural Networks (GNNs) have demonstrated their efficacy in dealing with non-Euclidean structural data, they are difficult to be deployed in real applications due to the scalability constraint imposed by the multi-hop data dependency. Existing methods attempt to address this scalability issue by training student multi-layer perceptrons (MLPs) exclusively on node content features using labels derived from the teacher GNNs. However, the trained MLPs are neither effective nor robust. In this paper, we ascribe the lack of effectiveness and robustness to three significant challenges: 1) the misalignment between content feature and label spaces, 2) the strict hard matching to teacher's output, and 3) the sensitivity to node feature noises. To address the challenges, we propose NOSMOG, a novel method to learn NOise-robust Structure-aware MLPs On Graphs, with remarkable effectiveness, robustness, and efficiency. Specifically, we first address the misalignment by complementing node content with position features to capture the graph structural information. We then design an innovative representational similarity distillation strategy to inject soft node similarities into MLPs. Finally, we introduce adversarial feature augmentation to ensure stable learning against feature noises. Extensive experiments and theoretical analyses demonstrate the superiority of NOSMOG by comparing it to GNNs and the state-of-the-art method in both transductive and inductive settings across seven datasets. Codes are available at

1. INTRODUCTION

Graph Neural Networks (GNNs) have shown exceptional effectiveness in handling non-Euclidean structural data and have achieved state-of-the-art performance across a broad range of graph mining tasks (Hamilton et al., 2017; Kipf & Welling, 2017; Veličković et al., 2018) . The success of modern GNNs relies on the usage of message passing architecture, which aggregates and learns node representations based on their (multi-hop) neighborhood (Wu et al., 2020; Zhou et al., 2020) . However, message passing is time-consuming and computation-intensive, making it challenging to apply GNNs to real large-scale applications that are always constrained by latency and require the deployed model to infer fast (Zhang et al., 2020; 2022a) . To meet the latency requirement, multi-layer perceptrons (MLPs) continue to be the first choice (Zhang et al., 2022b) , despite the fact that they perform poorly in non-euclidean data and focus exclusively on the node content information. Inspired by the performance advantage of GNNs and the latency advantage of MLPs, researchers have explored combining GNNs and MLPs together to enjoy the advantages of both (Zhang et al., 2022b; Zheng et al., 2022; Chen et al., 2021) . To combine them, one effective approach is to use knowledge distillation (KD) (Hinton et al., 2015) , where the learned knowledge is transferred from GNNs to MLPs through soft labels (Phuong & Lampert, 2019) . Then only MLPs are deployed for inference, with node content features as input. In this way, MLPs can perform well by mimicking the output of GNNs without requiring explicit message passing, and thus obtaining a fast inference speed (Hu et al., 2021) . Nevertheless, existing methods are neither effective nor robust, with three major drawbacks: (1) MLPs cannot fully align the input content feature to the label space, especially when node labels are correlated with the graph structure; (2) MLPs rely on the teacher's output to learn a strict hard matching, jeopardizing the soft structural representational similarity among nodes; and (3) MLPs are sensitive to node feature noises that can easily destroy the performance. We thus ask: Can we learn MLPs that are graph structure-aware in both the feature and representation spaces, insensitive to node feature noises, and have superior performance as well as fast inference speed? To address these issues and answer the question, we propose to learn NOise-robust Structure-aware MLPs On Graphs (NOSMOG), a novel method with remarkable performance, outstanding robustness and exceptional inference speed. Specifically, we first extract node position features from the graph and combine them with node content features as the input of MLPs. Thus MLPs can fully capture the graph structure as well as the node positional information. Then, we design a novel representational similarity distillation strategy to transfer the node similarity information from GNNs to MLPs, so that MLPs can encode the structural node affinity and learn more effectively from GNNs through hidden layer representations. After that, we introduce the adversarial feature augmentation to make MLPs noise-resistant and further improve the performance. To fully evaluate our model, we conduct extensive experiments on 7 public benchmark datasets in both transductive and inductive settings. Experiments show that NOSMOG can outperform the state-of-the-art method and also the teacher GNNs, with robustness to noises and fast inference speed. In particular, NOSMOG improves GNNs by 2.05%, MLPs by 25.22%, and existing state-of-the-art method by 6.63%, averaged across 7 datasets and 2 settings. In the meantime, NOSMOG achieves comparable efficiency to the state-of-the-art method and is 833× faster than GNNs with the same number of layers. In addition, we provide theoretical analyses based on information theory and conduct consistency measurements between graph topology and model predictions to facilitate a better understanding of the model. To summarize, the contributions of this paper are as follows: • We point out that existing works of learning MLPs on graphs are neither effective nor robust. We identify three issues that undermine their capability: the misalignment between content feature and label spaces, the strict hard matching to teacher's output, and the sensitivity to node feature noises. • To address the issues, we propose to learn noise-robust and structure-aware MLPs on graphs, with remarkable effectiveness, robustness, and efficiency. The proposed model contains three key components: the incorporation of position features, representational similarity distillation, and adversarial feature augmentation. • Extensive experiments demonstrate that NOSMOG can easily outperform GNNs and the state-ofthe-art method. In addition, we present theoretical analyses, robustness investigations, efficiency comparisons, and ablation studies to validate the superiority of the proposed model.

2. RELATED WORK

Graph Neural Networks. Many graph neural networks (Veličković et al., 2018; Li et al., 2019; Zhang et al., 2019; Chen et al., 2020) have been proposed to encode the graph-structure data. They take advantage of the message passing paradigm by aggregating neighborhood information to learn node embeddings. For example, GCN (Kipf & Welling, 2017 ) introduces a layer-wise propagation rule to learn node features. GAT (Veličković et al., 2018) incorporates an attention mechanism to aggregate features. DeepGCNs (Li et al., 2019) and GCNII (Chen et al., 2020) utilize residual connections to aggregate neighbors from multi-hop and further address the over-smoothing problem. However, These message passing GNNs only leverage local graph structure and have been demonstrated to be no more powerful than the WL graph isomorphism test (Xu et al., 2019; Morris et al., 2019) . Recent works propose to empower graph learning with positional encoding techniques such as Laplacian Eigenmap and DeepWalk (You et al., 2019; Wang et al., 2022; Tian et al., 2023a) , so that the node's position within the broader context of the graph structure can be detected. Inspired by these studies, we incorporate position features to fully capture the graph structure and node positional information. Knowledge Distillation on Graph. Knowledge Distillation (KD) has been applied widely in graphbased research and GNNs (Yang et al., 2020; Yan et al., 2020; Guo et al., 2023; Tian et al., 2023b) . Previous works apply KD primarily to learn student GNNs with fewer parameters but perform as well as the teacher GNNs. However, time-consuming message passing is still required during the learning process. For example, LSP (Yang et al., 2020) and TinyGNN (Yan et al., 2020) introduce the local structure-preserving and peer-aware modules that rely heavily on message passing. To overcome the latency issues, recent works start focusing on learning MLP-based student models that do not require message passing (Hu et al., 2021; Zhang et al., 2022b; Zheng et al., 2022) . Specifically, MLP student is trained with node content features as input and soft labels from GNN teacher as targets. Although MLP can mimic GNN's prediction, the graph structural information is explicitly overlooked from the input, resulting in incomplete learning, and the student is highly susceptible to noises that may be present in the feature. We address these concerns in our work. In addition, since adversarial learning has shown great performance in handling feature noises and enhancing model learning capability (Jiang et al., 2020; Xie et al., 2020; Tian et al., 2022; Kong et al., 2022) , we introduce the adversarial feature augmentation to ensure stable learning against noises.

3. PRELIMINARY

Notations. A graph is usually denoted as G = (V, E, C), where V represents the node set, E represents the edge set, C ∈ R N ×dc stands for the d c -dimensional node content attributes, and N is the total number of nodes. In the node classification task, the model tries to predict the category probability for each node v ∈ V, supervised by the ground truth node category Y ∈ R K , where K is the number of categories. We use superscript L to mark the properties of labeled nodes (i.e., V L , C L , and Y L ), and superscript U to mark the properties of unlabeled nodes (i.e., V U , C U , and Y U ). Graph Neural Networks. For a given node v ∈ V, GNNs aggregate the messages from node neighbors N (v) to learn node embedding h v ∈ R dn with dimension d n . Specifically, the node embedding in l-th layer h (l) v is learned by first aggregating (AGG) the neighbor embeddings and then combining (COM) it with the embedding from the previous layer. The whole learning process can be denoted as: h (l) v = COM(h (l-1) v , AGG({h (l-1) u : u ∈ N (v)})).

4. PROPOSED MODEL

In this section, we present the details of NOSMOG. An overview of the proposed model is shown in Figure 1 . We develop NOSMOG by first introducing the background of training MLPs with GNNs distillation, and then illustrating three key components in NOSMOG, i.e., the incorporation of position features (Figure 1 Training MLPs with GNNs Distillation. The key idea of training MLPs with the knowledge distilled from GNNs is simple. Given a cumbersome pre-trained GNN, the ground truth label y v for any labeled node v ∈ V L , and the soft label z v learned by the teacher GNN for any node v ∈ V, the goal is to train a lightweight MLP using both ground truth labels and soft labels. The objective function can be formulated as: L = v∈V L L GT ( ŷv , y v ) + λ v∈V L SL ( ŷv , z v ), where L GT is the cross-entropy loss between the student prediction ŷv and the ground truth label y v , L SL is the KL-divergence loss between the student prediction ŷv and the soft labels z v , and λ is a trade-off weight for balancing two losses. Incorporating Node Position Features. To address the issue of misalignment between content feature and label spaces as well as assist MLP in capturing node positions on graph, we propose to enrich node content by positional encoding techniques such as DeepWalk (Perozzi et al., 2014) . By simply concatenating the node content features with the learned position features, MLP is able to capture node positional information within a broad context of the graph structure and encode it in the feature space. The idea of incorporating position features is straightforward, yet as we will see, extremely effective. Specifically, we first learn position feature P v for node v ∈ V by running DeepWalk algorithm on G. Noticed that there are no node content features involved in this step, so the position features are solely determined by the graph structure and the node positions in the graph. Then, we concatenate (CONCAT) the content feature C v and position feature P v to form the final node feature X v . After that, we send the concatenated node feature into MLP to obtain the category prediction ŷv . The entire process is formulated as: X v = CONCAT(C v , P v ), ŷv = MLP(X v ). Later, ŷv is leveraged to calculate L GT and L SL (Equation 1).  S GN N = H G • (H G ) T and S M LP = H ′ M • (H ′ M ) T , H ′ M = σ(W M • H M ), (3) where W M ∈ R d M ×d M is the transformation matrix, σ is the activation function (we use ReLU in this work), and H ′ M is the transformed MLP representations. We then define the RSD loss L RSD to minimize the inter-model representation similarity using the Frobenius norm || • || F : L RSD (S GN N , S M LP ) = ||S GN N -S M LP || 2 F , Adversarial Feature Augmentation. MLP is highly susceptible to feature noises (Dey et al., 2017) if only considers the explicit feature information associated with each node. To enhance MLP's robustness to noises, we introduce the adversarial feature augmentation to leverage the regularization power of adversarial features (Kong et al., 2022; Zhang et al., 2023) . In other words, adversarial feature augmentation makes MLP invariant to small feature fluctuations and generalizes to out-ofdistribution samples, with the ability to further boost performance (Wang et al., 2019) . Compared to the vanilla training that original node content features C are utilized to obtain the category prediction, adversarial training learns the perturbation δ and sends maliciously perturbed features X + δ as input for MLP to learn. This process can be formulated as the following min-max optimization problem: min θ max ∥δ∥p≤ϵ (-Y log(MLP(X + δ))) , where X represents the concatenation of node content and position features as shown in Eq. 2, θ indicates the model parameters, δ is the perturbation, ∥ • ∥ p is the ℓ p -norm distance metric, and ϵ is the perturbation range. Specifically, we choose Projected Gradient Descent (Madry et al., 2018) as the default attacker to generate adversarial perturbation iteratively: δ t+1 = Π ∥δ∥∞≤ϵ [δ t + s • sign (∇ δ (-Y log(MLP(X + δ t ))))] , where s is the perturbation step size and ∇ δ is the calculated gradient given δ. For maximum robustness, the final perturbation δ = δ T is learned by updating Equation 6for T times to generate the worst-case noises. Finally, to accommodate KD, we perform adversarial training using both ground truth labels y v for labeled nodes (v ∈ V L ), and soft labels z v for all nodes in graph (v ∈ V). Therefore, we can reformulate the objective in Equation 5 as follows: X ′ v = X v + δ, ŷ′ v = MLP(X ′ v ), L ADV = max δ∈ϵ [- v∈V L y v log( ŷ′ v ) - v∈V z v log( ŷ′ v )]. The final objective function L is defined as the weighted combination of ground truth cross-entropy loss L GT , soft label distillation loss L SL , representational similarity distillation loss L RSD and adversarial learning loss L ADV : L = L GT + λL SL + µL RSD + ηL ADV , where λ, µ and η are trade-off weights for balancing L SL , L RSD and L ADV , respectively.

5. EXPERIMENTS

In this section, we conduct extensive experiments to validate the effectiveness, robustness, and efficiency of the proposed model and answer the following questions: 1) Can NOSMOG outperform GNNs, MLPs, and other GNNs-MLPs methods? 2) Can NOSMOG work well under both inductive and transductive settings? 3) Can NOSMOG work well with noisy features? 4) How does NOSMOG perform in terms of inference time? 5) How does NOSMOG perform with different model components? 6) How does NOSMOG perform with different teacher GNNs? 7) How can we explain the superior performance of NOSMOG?

5.1. EXPERIMENT SETTINGS

Datasets. We use five widely used public benchmark datasets (i.e., Cora, Citeseer, Pubmed, A-computer, and A-photo) (Zhang et al., 2022b; Yang et al., 2021) , and two large OGB datasets (i.e., Arxiv and Products) (Hu et al., 2020) to evaluate the proposed model. Model Architectures. For a fair comparison, we follow the paper (Zhang et al., 2022b) to use GraphSAGE (Hamilton et al., 2017) with GCN aggregation as the teacher model. However, we also show the impact of other teacher models including GCN (Kipf & Welling, 2017) , GAT (Veličković et al., 2018) and APPNP (Klicpera et al., 2019) in Section 5.7. Evaluation Protocol. For experiments, we report the mean and standard deviation of ten separate runs with different random seeds. We adopt accuracy to measure the model performance, use validation data to select the optimal model, and report the results on test data. Two Settings: Transductive vs. Inductive. To fully evaluate the model, we conduct node classification in two settings: transductive (tran) and inductive (ind). For tran, we train models on G, X L , and Y L , while evaluate them on X U and Y U . We generate soft labels for every nodes in the graph (i.e., z v for v ∈ V). For ind, we randomly select out 20% test data for inductive evaluation. Specifically, we separate the unlabeled nodes V U into two disjoint observed and inductive subsets (i.e., V U = V U obs ⊔ V U ind ), which leads to three separate graphs G = G L ⊔ G U obs ⊔ G U ind with no shared nodes. The edges between G L ⊔ G U obs and G U ind are removed during training but are used during inference to transfer position features by average operator (Hamilton et al., 2017) . Node features and labels are partitioned into three disjoint sets, i.e., X = X L ⊔ X U obs ⊔ X U ind and Y = Y L ⊔ Y U obs ⊔ Y U ind . We generate soft labels for nodes in the labeled and observed subsets (i.e., z v for v ∈ V L ⊔ V U obs ).

5.2. CAN NOSMOG OUTPERFORM GNNS, MLPS, AND OTHER GNNS-MLPS METHODS?

We compare NOSMOG to GNN, MLP, and the state-of-the-art method under the standard transductive setting and report the results in Table 1 , which can be directly comparable to those reported in previous literature (Zhang et al., 2022b; Hu et al., 2020; Yang et al., 2021) . As shown in Table 1 , 

5.3. CAN NOSMOG WORK WELL UNDER BOTH INDUCTIVE AND TRANSDUCTIVE SETTINGS?

To better understand the effectiveness of NOSMOG, we conduct experiments in a realistic production (prod) scenario that involves both inductive (ind) and transductive (tran) settings (see Table 2 ). We find that NOSMOG can achieve superior or comparable performance to the teacher model and baseline methods across all datasets and settings. Specifically, compared to GNN, NOSMOG achieves better performance in all datasets and settings, except for ind on Arxiv and Products where NOSMOG can only achieve comparable performance. Considering these two datasets have a significant distribution shift between training data and test data (Zhang et al., 2022b) , this is understandable that NOSMOG cannot outperform GNN without explicit graph structure input. However, compared to GLNN which can barely learn on these two datasets, NOSMOG improves the performance extensively, i.e., 18.72% and 11.01% on these two datasets, respectively. This demonstrates the capability of NOSMOG in capturing graph structural information on large-scale datasets, despite the significant distribution shift. In addition, NOSMOG outperforms MLP and GLNN by great margins in all datasets and settings, with an average of 24.15% and 6.39% improvement, respectively. Therefore, we conclude that NOSMOG can achieve exceptional performance in the production environment with both inductive and transductive settings.

5.4. CAN NOSMOG WORK WELL WITH NOISY FEATURES?

Figure 2 : Accuracy vs. Feature Noises. Considering that MLP and GLNN are sensitive to feature noises and may not perform well when the labels are uncorrelated with the node content, we further evaluate the performance of NOSMOG with regards to noise levels in Figure 2 . Experiment results are averaged across various datasets. Specifically, we add different levels of Gaussian noises to content features by replacing C with C = (1 -α)C + αn, where n represents Gaussian noises that independent from C, and α ∈ [0, 1] incates the noise level. We find that NOSMOG achieves better or comparable performance to GNNs across different α, which demonstrates the superior efficacy of NOSMOG, especially when GNNs can mitigate the impact of noises by leveraging information from neighbors and surrounding subgraphs, whereas NOSMOG only relies on content and position features. GLNN and MLP, however, drop their performance quickly as α increases. In the extreme case when α equals 1, the input features are completely noises and C and C are independent. We observe that NOSMOG can still perform as good as GNNs by considering the position features, while GLNN and MLP perform poorly. To demonstrate the efficiency of NOSMOG, we analyze the capacity of NOSMOG by visualizing the trade-off between prediction accuracy and model inference time on Products dataset in Figure 3 . We find that NOSMOG can achieve high accuracy (78%) while maintaining a fast inference time (1.35ms). Specifically, compared to other models with similar inference time, NOSMOG performs significantly better, while GLNN and MLPs can only achieve 64% and 60% accuracy, respectively. For those models that have close or similar performance as NOSMOG, they need a considerable amount of time for inference, e.g., 2 layers GraphSAGE (SAGE-L2) needs 144.47ms and 3 layers GraphSAGE (SAGE-L3) needs 1125.43ms, which is not applicable in real applications. This makes NOSMOG 107× faster than SAGE-L2 and 833× faster than SAGE-L3. In addition, since increasing the hidden size of GLNN may improve the performance, we compare NOSMOG with GLNNw4 (4-times wider than GLNN) and GLNNw8 (8-times wider than GLNN). Results show that although GLNNw4 and GLNNw8 can improve GLNN, they still perform worse than NOSMOG and even require more time for inference. We thus conclude that NOSMOG is superior to existing methods and GNNs in terms of both accuracy and inference time.

5.6. HOW DOES NOSMOG PERFORM WITH DIFFERENT MODEL COMPONENTS?

Since NOSMOG contains various essential components (i.e., node position features (POS), representational similarity distillation (RSD), and adversarial feature augmentation (ADV)), we conduct ablation studies to analyze the contributions of different components by removing each of them independently (see Table 3 ). From the table, we find that the performance drops when a component is removed, indicating the efficiency of each component. In general, the incorporation of position features contributes the most, especially on Arxiv and Products datasets. By integrating the position features, NOSMOG can learn from the node positions and achieve exceptional performance. RSD contributes little to the overall performance across different datasets. This is because the goal of RSD is to distill more information from GNN to MLP, while MLP already learns well by mimicking GNN through soft labels. ADV contributes moderately across datasets, given that it mitigates overfitting and improves generalization. Finally, NOSMOG achieves the best performance on all datasets, demonstrating the effectiveness of the proposed model. We use GraphSAGE to represent the teacher GNNs so far. However, different GNN architectures may have different performances across datasets, we thus study if NOSMOG can perform well with other GNNs. In Figure 4 , we show average performance with different teacher GNNs (i.e., GCN, GAT, and APPNP) across the five benchmark datasets. From the figure, we conclude that the performance of all four teachers is comparable, and NOSMOG can always learn from different teachers and outperform them, albeit with slightly diminished performance when distilled from APPNP, indicating that APPNP provides the least benefit for student. This is due to the fact that the APPNP uses node features for prediction prior to the message passing on the graph, which is very similar to what the student MLP does, and therefore provides MLP with little additional information than other teachers. However, NOSMOG consistently outperforms GLNN, which further demonstrates the effectiveness of the proposed model.

5.8. HOW CAN WE EXPLAIN THE SUPERIOR PERFORMANCE OF NOSMOG?

In this section, we analyze the superior performance and expressiveness of NOSMOG from several perspectives, including the comparison with GLNN and GNNs from information theoretical perspective, and the consistency measure of model predictions and graph topology based on Min-Cut. The expressiveness of NOSMOG compared to GLNN and GNNs. The goal of node classification task is to fit a function f on the rooted graph G [v] with label y v (a rooted graph G [v] is the graph with one node v in G [v] designated as the root) (Chen et al., 2021) . From the information theoretical perspective, learning f by minimizing cross-entropy loss is equivalent to maximizing the mutual information (MI) (Qin et al., 2019) , i.e., I(G [v] ; y i ). If we consider G [v] as a joint distribution of two random variables X [v] and E [v] , that represent the node features and edges in G [v] respectively, we have: I(G [v] ; y v ) = I(X [v] , E [v] ; y v ) = I(E [v] ; y v ) + I(X [v] ; y v |E [v] ), where I(E [v] ; y v ) is the MI between edges and labels, which indicates the relevance between labels and graph structure, and I(X [v] ; y v |E [v] ) is the MI between features and labels given edges E [v] . To compare the effectiveness of NOSMOG, GLNN and GNNs, we start by analyzing the objective of GNNs. For a given node v, GNNs aim to learn an embedding function f GN N that computes the node embedding z v , where the objective is to maximize the likelihood of the conditional distribution P (y v |z [v] ) to approximate I(G [v] ; y v ). Generally, the embedding function f GN N takes the node features X [v] and its multihop neighbourhood subgraph S [v] as input, which can be written as z [v] = f GN N (X [v] , S [v] ). Correspondingly, the process of maxmizing likelihood P (y v |z [v] ) can be expressed as the process of minimizing the objective function L 1 (f GN N (X [v] , S [v] ), y v ). Since S [v] contains the multi-hop neighbours, optimizing L 1 captures both node features and the surrounding structure information, which approximating I(X [v] ; y v |E [v] ) and I(E [v] ; y v ), respectively. GLNN leverages the objective functions described in Eq. 1, which approximates I(G [v] ; y v ) by only maxmizing I(X [v] ; y v |E [v] ), while ignoring I(E [v] ; y v ). However, there are situations that node labels are not strongly correlated to node features, or labels are mainly determined by the node positions or graph structure, e.g., node degrees (Lim et al., 2021; Zhu et al., 2021) . In these cases, GLNN won't be able to fit. Alternatively, NOSMOG focuses on modeling both I(E [v] ; y v ) and I(X [v] ; y v |E [v] ) by jointly considering node position features and content features. In particular, I(E [v] ; y v ) is optimized by the objective functions L 2 (f M LP (X [v] , P [v] ), y v ) and L 3 (f M LP (X [v] , P [v] ), f GN N (X [v] , S [v] )) by extending L GT and L SL in Equation 8 that incorporates position features. Here f M LP is the embedding function that NOSMOG learns given the content feature X [v] and position feature P [v] . Essentially, optimizing L 3 forces MLP to learn from GNN's output and eventually achieve comparable performance as GNNs. In the meanwhile, optimizing L 2 allows MLP to capture node positions that may not be learned by GNNs, which is important if the label y v is correlated to the node positional information. Therefore, there is no doubt that NOSMOG can perform better, considering that y v is always correlated with E [v] in graph data. Even in the extreme case that when y v is uncorrelated with I(X [v] ; y v |E [v] ), NOSMOG can still achieve superior or comparable performance to GNNs, as demonstrated in Section 5.4. The consistency measure of model predictions and graph topology. To further validate that NOSMOG is superior to GNNs, MLPs, and GLNN in encoding graph structural information, we design the cut value CV ∈ [0, 1] to measure the consistency between model predictions and graph topology (Zhang et al., 2022b) , based on the approximation for the mincut problem (Bianchi et al., 2019) . The min-cut problem divides nodes V into K disjoint subsets by removing the minimum number of edges. Correspondingly, the min-cut problem can be expressed as: max 1 K K k=1 (C T k AC k )/(C T k DC k ) , where C is the node class assignment, A is the adjacency matrix, and D is the degree matrix. Therefore, we design the cut value as follows: CV = tr( Ŷ T A Ŷ )/tr( Ŷ T D Ŷ ), where Ŷ is the model prediction output, and the cut value CV indicates the consistency between the model predictions and the graph topology. The bigger the value is, the predictions are more consistent with the graph topology, and the model is more capable of capturing graph structural information. The cut values for different models in transductive setting are shown in Table 4 . We find that the average CV for NOSMOG is 0.9348, while the average CV for SAGE, MLP, and GLNN are 0.9276, 0.7572, and 0.8725, respectively. We conclude that NOSMOG achieves the highest cut value, demonstrating the superior expressiveness of NOSMOG in capturing graph topology compared to GNN, MLP, and GLNN.

6. CONCLUSION

In this paper, we address three significant issues of existing GNNs-MLPs frameworks and present a unified view of learning MLPs with effectiveness, robustness, and efficiency. Specifically, we propose to learn Noise-robust and Structure-aware MLPs on Graphs (NOSMOG) that considers position features, representational similarity distillation, and adversarial feature augmentation. Extensive experiments on seven datasets demonstrate that NOSMOG can improve GNNs by 2.05%, MLPs by 25.22%, and the state-of-the-art method by 6.63%, meanwhile maintaining a remarkable robustness and a fast inference speed of 833× compared to GNNs. Furthermore, we present additional theoretical analyses as well as empirical investigations of robustness and efficiency to demonstrate the superiority of the proposed model.



(b)), representational similarity distillation (Figure 1 (c)), and adversarial feature augmentation (Figure 1 (d)).

Figure 1: (a) The overall framework of NOSMOG: A GNN teacher is trained on the graph to obtain the representational node similarity and soft labels. Then, an MLP student is trained on node content features and position features, guided by the learned representational node similarity and soft labels. We also introduce the adversarial feature augmentation to ensure stable learning against feature noises. (b) Acquisition of Node Position Features: capturing node positional information by positional encoding techniques. (c) Representational Similarity Distillation: enforcing MLP to learn node similarity from GNN's representation space. (d) Adversarial Feature Augmentation: learning adversarial features by generating adversarial perturbation for input features.

Figure 3: Accuracy vs. Inference Time.

5.7 HOW DOES NOSMOG PERFORM WITH DIFFERENT TEACHER GNNS?

Figure 4: Accuracy vs. Teacher GNN Architectures.

NOSMOG outperforms GNN, MLP, and the state-of-the-art method GLNN on all datasets under the standard setting. ∆GNN , ∆MLP , ∆GLNN represents the difference between the NOSMOG and GNN, MLP, GLNN, respectively. Results show accuracy (higher is better). NOSMOG improves the performance by 2.46% on average across different datasets, which demonstrates that NOSMOG can capture better structural information than GNN without explicit graph structure input. Compared to MLP, NOSMOG improves the performance by 26.29% on average across datasets, while the state-of-the-art method GLNN can only improve MLP by 18.55%, which shows the efficacy of KD and demonstrates that NOSMOG can capture additional information that GLNN cannot. Compared to GLNN, NOSMOG improves the performance by 6.86% on average across datasets. Specifically, GLNN performs poorly in large OGB datasets (the last 2 rows) while NOSMOG learns well and shows an improvement of 12.39% and 23.14%, respectively. This further demonstrates the effectiveness of NOSMOG. The analyses of the capability of each model component and the expressiveness of NOSMOG are shown in Sections 5.6 and 5.8, respectively.

NOSMOG outperforms GNN, MLP, and the state-of-the-art method GLNN in a production scenario with both inductive and transductive settings. ind indicates the results on V U ind , tran indicates the results on V U tran , and prod indicates the interpolated production results of both ind and tran. ± 0.58 55.29 ± 0.63 59.04 ± 0.46 70.09 ± 0.55 ↓ -0.85% ↑ 26.77% ↑ 18.72% tran 70.69 ± 0.39 55.36 ± 0.34 64.61 ± 0.15 71.10 ± 0.34 ↑ 0.58% ↑ 28.43% ↑ 10.05%

Accuracy of different model variants. The decreasing performance of these model variants demonstrates the effectiveness of each component in enhancing the model.

The cut value. NOSMOG predictions are more consistent with the graph topology than GNN, MLP, and the state-of-the-art method GLNN.

availability

https://github.com/meettyj

