GRAPH NEURAL NETWORKS ARE INHERENTLY GOOD GENERALIZERS: INSIGHTS BY BRIDGING GNNS AND MLPS

Abstract

Graph neural networks (GNNs), as the de-facto model class for representation learning on graphs, are built upon the multi-layer perceptrons (MLP) architecture with additional message passing layers to allow features to flow across nodes. While conventional wisdom commonly attributes the success of GNNs to their advanced expressivity, we conjecture that this is not the main cause of GNNs' superiority in node-level prediction tasks. This paper pinpoints the major source of GNNs' performance gain to their intrinsic generalization capability, by introducing an intermediate model class dubbed as P(ropagational)MLP, which is identical to standard MLP in training, but then adopts GNN's architecture in testing. Intriguingly, we observe that PMLPs consistently perform on par with (or even exceed) their GNN counterparts, while being much more efficient in training. Codes are available at https://github.com/chr26195/PMLP. This finding provides a new perspective for understanding the learning behavior of GNNs, and can be used as an analytic tool for dissecting various GNN-related research problems including expressivity, generalization, over-smoothing and heterophily. As an initial step to analyze PMLP, we show its essential difference to MLP at infinite-width limit lies in the NTK feature map in the post-training stage. Moreover, through extrapolation analysis (i.e., generalization under distribution shifts), we find that though most GNNs and their PMLP counterparts cannot extrapolate non-linear functions for extreme out-of-distribution data, they have greater potential to generalize to testing data near the training data support as natural advantages of the GNN architecture used for inference.

1. INTRODUCTION

In the past decades, Neural Networks (NNs) have achieved great success in many areas. As a classic NN architecture, Multi-Layer Perceptrons (MLPs) (Rumelhart et al., 1986) stack multiple Feed-Forward (FF) layers with nonlinearity to universally approximate functions. Later, Graph Neural Networks (GNNs) (Scarselli et al., 2008b; Bruna et al., 2014; Gilmer et al., 2017; Kipf & Welling, 2017; Veličković et al., 2017; Hamilton et al., 2017; Klicpera et al., 2019; Wu et al., 2019) build themselves upon the MLP architecture, e.g., by inserting additional Message Passing (MP) operations amid FF layers (Kipf & Welling, 2017) to accommodate the interdependence between instance pairs. Two cornerstone concepts lying in the basis of deep learning research are model's representation and generalization power. While the former is concerned with what function class NNs can approximate and to what extent they can minimize the empirical risk R(•), the latter instead focuses on the inductive bias in the learning procedure, asking how well the learned function can generalize to unseen in-and out-of-distribution samples, reflected by the generalization gap R(•) -R(•). There exist a number of works trying to dissect GNNs' representational power (e.g., Scarselli et al. (2008a) ; Xu et al. (2018a) ; Maron et al. (2019) ; Oono & Suzuki (2019) ), while their generalizability and connections with MLP are far less well-understood. In the testing phase, PMLPs additionally insert non-parametric MP layers amid FF layers, as shown in Fig. 1 (a), to align with various GNN architectures including (but not limited in) GCN (Kipf & Welling, 2017) , SGC (Wu et al., 2019) and APPNP (Klicpera et al., 2019) . (Empirical Results and Implications) According to experiments across sixteen node classification benchmarks and additional discussions on different architectural choices (i.e., layer number, hidden size), model instantiations (i.e., FF/MP layer implementation) and data characteristics (i.e., data split, amount of structural information), we identify two-fold intriguing empirical phenomenons: • Phenomenon 1: PMLP significantly outperforms MLP. Despite that PMLP shares the same weights (i.e., trained model parameters) with a vanilla MLP, it tends to yield lower generalization gap and thereby outperforms MLP by a large margin in testing, as illustrated in Fig. 1(c ) and (b) respectively. This observation suggests that the message passing / graph convolution modules in GNNs can inherently improve model's generalization capability for handling unseen samples. The word "inherently" underlines that such particular generalization effects are implicit in the GNN architectures (with message passing mechanism) used in inference, but isolated from factors in the training process, such as: larger hypothesis space for representing a rich set of "graph-aware" functions (Scarselli et al., 2008a; Xu et al., 2018a) , more suitable inductive biases in model selection that prioritize those functions capable of relational reasoning (Battaglia et al., 2018) , etc. • Phenomenon 2: PMLP performs on par with or even exceed GNNs. PMLP achieves close testing performance to its GNN counterpart in inductive node classification tasks, and can even outperform GNN by a large margin in some cases (i.e., removing self-loops and adding noisy edges). Given that the only difference between GNN and PMLP is the model architecture used in training and the representation power of PMLP is exactly the same with MLP before testing, this observation suggests that the major (but not only) source of performance improvement of GNNs over MLP in node classification stems from the aforementioned inherent generalization capability of GNNs. (Practical Significance) We also highlight that PMLP, as a novel class of models (using MLP architecture in training and GNN architecture in inference), can be used for broader analysis purpose or applied as a simple, flexible and very efficient graph encoder model for scalable training. ⋄ PMLP as an analytic tool. PMLPs can be used for dissecting various GNN-related problems such as over-smoothing and heterophily (see Sec. 3.3 for preliminary explorations), and in a broader sense can potentially bridge theoretical research in two areas by enabling us to conveniently leverage well-established theoretical frameworks for MLPs to enrich those for GNNs. ⋄ PMLP as efficient graph encoders. While being as effective as GNNs in many cases, PMLPs are significantly more efficient in training (5 ∼ 17× faster on large datasets, and 65× faster for very deep GNNs with more than 100 MP layers). In fact, PMLPs are equivalent to GNNs with all edges dropped in training, which itself (Rong et al., 2020) is a widely recognized way (i.e., DropEdge) for accelerating GNN training. Moreover, PMLPs are more robust against noisy edges, can be trivially combined with mini-batch training (and many other training tricks for general NNs), and help to quickly evaluate GNN architectures to facilitate model development. Notably, PMLPs can further be extended to transductive learning setting, and are compatible with many other GNN architectures with residual connections (e.g., GCNII (Chen et al., 2020b) ) or parametric message passing layers (e.g., GAT (Veličković et al., 2017) ) with slight modifications as will be specified. (Theoretical Results and Contributions) As mentioned above, our empirical finding narrows down many different factors between MLPs and GNNs to a key one that attributes to their performance gap, i.e., improvement in generalizability due to the change in network architecture. Then, a natural question arises: "Why this is the case and how does the GNN architecture (in testing) helps the model to generalize?". We take an initial step towards answering this question: • Comparison of three classes of models in NTK regime. We compare MLP, PMLP, and GNN in the Neural Tangent Kernel (NTK) regime (Jacot et al., 2018) , where models are over-parameterized and gradient descent finds the global optima. From this perspective, the distinction of PMLP and MLP is rooted in the change of NTK feature map determined by model architecture while fixing their minimum RKHS-norm solutions (Proposition 1). For deeper investigation, we first extend the definition of Graph Neural Tangent Kernel (GNTK) (Du et al., 2019) to the node regression setting (Lemma 2), and derive the explicit formula for computing the feature map for PMLP/GNN. • OoD generalization / Extrapolation analysis for PMLPs and GNNs. We consider an important (yet overlooked) aspect of generalization analysis, i.e., extrapolation for Out-of-Distribution (OoD) testing samples (Xu et al., 2021) where testing node features become increasingly outside the training data support. Particularly, we reveal that alike MLP, both PMLP and GNN eventually converge to linear functions when testing samples are infinitely far away from the training data support (Theorem 4). Nevertheless, their convergence rates are smaller than that of MLP by a factor related to node degrees and features' cosine similarity (Theorem 5), which indicates both PMLP and GNN are more tolerant to OoD samples and thus have larger potential to generalize near the training data support (which is often the real-world case). We provide an illustration in Fig. 1(d ). 1.1 RELATED WORKS Generalization, especially for feed-forward NNs (i.e., MLPs), has been extensively studied in the general ML field (Arora et al., 2019a; Allen-Zhu et al., 2019; Cao & Gu, 2019) . However for GNNs, the large body of existing theoretical works focus on their representational power(e.g., Scarselli et al. (2008a) ; Xu et al. (2018a); Maron et al. (2019) ; Oono & Suzuki (2019) ), while their generalization capability is less well-understood. For node-level prediction setting, those works in generalization analysis (Scarselli et al., 2018; Verma & Zhang, 2019; Baranwal et al., 2021; Ma et al., 2021) mainly aim to derive generalization bounds, but did not establish connections with MLPs since they assume the same GNN architecture in training and testing. For theoretical analysis, the most relevant work is (Xu et al., 2021) that studies the extrapolation behavior of MLP. Their results will later be used in this work. The authors also shed lights on the extrapolation power of GNNs, but for graph-level prediction with max/min propagation from the perspective of algorithmic alignment, which cannot apply to GNNs with average/sum propagation in node-level prediction that are more commonly used. Regarding the relation between MLPs and GNNs, there are some recent attempts to boost the performance of MLP to approach that of GNN, by using label propagation (Huang et al., 2021) , contrastive learning (Hu et al., 2021) , knowledge distillation (Zhang et al., 2022) or additional regularization in training (Zhang et al., 2023) . However, it is unclear whether these graph-enhanced MLPs can explain the success of GNNs since it is still an open research question to understand these training techniques themselves. There are also few works probing into similar model architectures as PMLP, e.g. Klicpera et al. (2019) , which can generally be seen as special cases of PMLP. A concurrent work (Han et al., 2023) further finds that PMLP with an additional fine-tuning procedure can be used to significantly accelerate GNN training. Moreover, a recent work (Baranwal et al., 2023) also theoretically studies how message passing operations benefit multi-layer networks in node classification tasks, which complements our results.

2. BACKGROUND AND MODEL FORMULATION

Assume a graph dataset G = (V, E) where the node set V contains n nodes instances {(x u , y u )} u∈V , where x u ∈ R d denotes node features and y u is the label. Without loss of generality, y u can be a categorical variable or a continuous one depending on specific prediction tasks (classification or regression). Instance relations are described by the edge set E and an associated adjacency matrix A ∈ {0, 1} n×n . In general, the problem is to learn a predictor model with ŷ = f (x; θ, G k x ) for node-level prediction, where G k x denotes the k-hop ego-graph around x over G. Graph Neural Networks and Multi-Layer Perceptrons. To probe into the connection between mainstream GNNs and MLP from the architectural view, we re-write the GNN formulation in a general form that explicitly disentangles each layer into two operations, namely a Message-Passing (MP) operation and then a Feed-Forwarding (FF) operation: (MP): h(l-1) u = v∈Nu∪{u} a G (u, v) • h (l-1) u , (FF): h (l) u = ψ (l) h(l-1) u , where N u is the set of neighbored nodes centered at u, a G (u, v) is the affinity function dependent on graph structure G, ψ (l) denotes a feature transformation mapping at the l-th layer, and h (0) u = x u is the initial node feature. For example, in Graph Convolution Network (GCN) (Kipf & Welling, 2017) , a G (u, v) = A uv / du dv , where du denotes the degree of node u (with self-loop), and ψ (l) is a fully-connected layer with non-linearity. For an L-layer GNN, the prediction is given by ŷu = ψ (L) (h (L-1) u ), where ψ (L) is often set as linear transformation for regression tasks or with Softmax for classification tasks. Note that GNN models in forms of Eq. 1 degrade to an MLP with a series of FF layers after removing all the MP operations: ŷu = ψ (L) (• • • (ψ (1) (x u )) = ψ(x u ). (2) Typical Types of GNN Architectures. Besides GCN, many other mainstream GNN models can be written as the architectural form defined by Eq. 1 whose layer-wise updating rule involves MP and FF operations, e.g., GAT (Veličković et al., 2017) and GraphSAINT (Zeng et al., 2019) . Some recently proposed node-level Transformer models such as NodeFormer (Wu et al., 2022) and DIFFormer (Wu et al., 2023 ) also fall into this category. Furthermore, there are also other types of GNN architecture represented by SGC (Wu et al., 2019) and APPNP (Klicpera et al., 2019) where the former adopts multiple MP operations on the initial node features, and the later stacks a series of MP operations at the end of FF layers. These two classes of GNNs are also widely explored and studied. For example, SIGN (Rossi et al., 2020) , S 2 GC (Zhu & Koniusz, 2021) and GBP (Chen et al., 2020a) follow the SGC-style, and DAGNN (Liu et al., 2020c) , AP-GCN (Spinelli et al., 2020) and GPR-GNN (Chien et al., 2020) follow the APPNP-style. Bridging GNNs and MLPs: Propagational MLP. After decoupling the MP and FF operations from GNNs' layer-wise updating, we notice that the unique and critical difference of GNNs and MLP lies in whether to adopt MP (somewhere between the input node features and output prediction). To connect two families, we introduce a new model class, dubbed as Propagational MLP (PMLP), which has exactly the same architecture as conventional MLP, namely, the same feed-forwarding network. During the inference/testing stage, PMLP GCN incorporates a message passing layer into each layer's feed-forwarding, PMLP SGC adds multiple MP layers in the first layer, and PMLP AP P adds them in the last layer. For clear head-to-head comparison, Table 6 in the appendix summarizes the architecture of these models in training and testing stages. Extensions of PMLP. The proposed PMLP is generic and compatible with many other GNN architectures with some slight modifications. For example, we can extend the definition of PMLPs to GNNs with residual connections such as JKNet (Xu et al., 2018b) and GCNII (Chen et al., 2020b) by removing their message passing modules in training, and correspondingly, PMLP KJN et and PMLP GCN II become MLPs with different residual connections in training, which will be further discussed in the next section. For GNNs whose MP layers are parameterized such as GAT (Veličković et al., 2017) , one can additionally fine-tune the PMLP model using the corresponding GNN architecture on top of pre-trained FF layers or training MP layers independently.

3. EMPIRICAL EVALUATION

We conduct experiments on a variety of node-level prediction benchmarks. Section 3.1 shows that the proposed PMLPs can significantly outperform the original MLP though they share the same weights, and approach or even exceed their GNN counterparts. Section 3.2 shows this phenomenon holds across different experimental settings. Section 3.3 sheds new insights on some research We consider sixteen node classification benchmarks involving different types of networks. For fair comparison, we set the layer number and hidden size to the same values for GNN, PMLP and MLP in the same dataset. We basically use GCN convolution for MP layer and ReLU activation for FF layer for all models unless otherwise stated. PMLPs use the MLP architecture for validation. For SGC, we use a MLP instead of one FF layer after linear message passing, and for APPNP, we remove the residual connection (i.e., α = 0) such that the MP layer is aligned with other models. More details about implementation, datasets and hyperparameters are deferred to Appendix F. We adopt inductive learning setting as the evaluation protocol, which is a commonly used benchmark setting by the community, and guarantee a fair comparison between PMLP and its GNN counterpart by ensuring the information of validation/testing nodes is not used in training for all models. Specifically, for node set V = V tr ∪ V te where V tr (resp. V te ) denotes training (resp. testing) nodes, the training process is only exposed to G tr = {V tr , E tr }, where E tr ⊂ V tr × V tr only contains edges for nodes in V tr and the trained model is tested with the whole graph G for prediction on V te .

3.1. MAIN RESULTS

How do PMLPs perform compared with GNNs and MLP on common benchmarks? The main results for comparing the testing accuracy of MLP, PMLPs and GNNs on seven benchmark datasets are shown in Table . 1. We found that, intriguingly, three variants of PMLPs consistently outperform MLP by a large margin on all the datasets despite using the same model with the same set of trainable parameters. Moreover, PMLPs are as effective as their GNN counterparts and can even exceed GNNs in some cases. These results suggest two implications. First, the performance improvement brought by GNNs (or more specifically, the MP operation) over MLP may not purely stem from the more advanced representational power, but the generalization ability. Second, the message passing can indeed contribute to better generalization ability of MLP, though it currently remains unclear how it helps MLP to generalize on unseen testing data. We will later try to shed some light on this question via theoretical analysis in Section 4. How do PMLPs perform on larger datasets? We next apply PMLPs to larger graphs which can be harder for extracting informative features from observed data. As shown in considerably outperforms MLP. Yet differently, there is a certain gap between PMLP and GNN. We conjecture that this is because in such large graphs the relations between inputs and target labels can be more complex, which requires more expressive architectures for learning desired node-level representations. This hypothesis is further validated by the results of training losses in Table 2 , which indeed shows that GNNs can yield lower fitting error on the training data.

3.2. FURTHER DISCUSSIONS

We next conduct more experiments and comparison for verifying the consistency of the observed phenomenon across different settings regarding model implementation and graph property. We also try to reveal how PMLPs work for representation learning through visualizations of the produced embeddings, and the results are deferred to Appendix G. Q1: What is the impact of model layers and hidden sizes? In Fig. 2 (a) and (b), we plot the testing accuracy of GCN, PMLP GCN and MLP w.r.t. different layer numbers and hidden sizes. The results show that the observed phenomenon in Section 3.1 consistently holds with different settings of model depth and width, which suggests that the generalization effect of the MP operation is insensitive to model architectural hyperparameters. The increase of layer numbers cause performance degradation for all three models, presumably because of over-fitting. We will further discuss the impact of model depth (where layer number exceeds 100) and residual connections in the next subsection. Q2: What is the impact of different activation functions in FF layers? As shown in Fig. 3 (a), the relative performance rankings among GCN, PMLP and MLP stay consistent across four different activation functions (tanh, cos, ELU, ReLU) and in particular, in some cases the performance gain of PMLP over GCN is further amplified (e.g., with cos activation). Q3: What is the impact of different propagation schemes in MP layers? We replace the original transition matrix P sym = D-1 2 Ã D-1 2 used in the MP layer by other commonly used transition matrices: 1) P no-loop = D -1 2 AD -1 2 , i.e., removing self-loop; 2) P rw = D-1 Ã, i.e., random walk matrix; 3) P diff = ∞ k=0 1 e•k! ( D-1 Ã) k , i.e. , heat kernel diffusion matrix. The results are presented in Fig. 3 (b) where we found that the relative performance rankings of three models keep nearly unchanged after replacing the original MP layer. And, intriguingly, the performance of GNNs degrade dramatically after removing the self-loop, while the accuracy of PMLPs stays at almost the same level. The possible reason is that the self-loop connection plays an important role in GCN's training stage for preserving enough centered nodes' information, but does not affect PMLP. Q4: What is the impact of training proportion? We use random split to control the amount of graph structure information used in training. As shown in Fig. 4 (left), the labeled portion of nodes and the amount of training edges have negligible impacts on the relative performance of three models. Q5: What is the impact of graph sparsity? As suggested by Fig. 4 (middle), when the graph goes sparser, the absolute performance of GCN and PMLP degrades yet their performance gap remains unchanged. This shows that the quality of input graphs indeed impacts the testing performance and tends to control the performance upper bound of the models. Critically though, the generalization effect brought by PMLP is insensitive to the graph completeness. Q6: What is the impact of noisy structure? As shown in Fig. 4 (right), the performances of both PMLP and GCN tend to decrease as we gradually add random connections to the graph, whose amount is controlled by the noise ratio (defined as # noisy edges/|E|), while PMLP shows better robustness to such noise. This indicates noisy structures have negative effects on both training and generalization of GNNs and one may use PMLP to mitigate this issue.

3.3. OVER-SMOOTHING, MODEL DEPTH, AND HETEROPHILY

We conduct additional experiments to shed lights on broader aspects of GNNs including oversmoothing, model depth and graph heterophily. Detailed results and respective discussions are deferred to Appendix E. In a nutshell, we find the phenomenon still holds for GNNs with residual connections (i.e., JKNet and GCNII), very deep GNNs (with more than 100 layers) and heterophilic graphs. This indicates that both over-smoothing and heterophily are problems closely related to failure cases of GNN generalization, and can be mitigated by using MP layers that are more suitable for the data or backbone MLP models with better generalization capability, aligning with some previous studies (Battaglia et al., 2018; Cong et al., 2021) .

3.4. EXTENSION TO TRANSDUCTIVE SETTINGS

Notably, the current training procedure of PMLP does not involve unlabeled nodes but can be extended to such scenario by combining with existing semi-supervised learning approaches (Van Engelen & Hoos, 2020) such as using label propagation (Zhu et al., 2003) to generate pseudo labels for unlabeled nodes, or additionally using the GNN architecture for fine-tuning, which is still shown to be more efficient than training GNNs from scratch (Han et al., 2023) .

4. THEORETICAL INSIGHTS ON GNN GENERALIZATION

Towards theoretically answering why "GNNs are inherently good generalizers" and explaining the superior generalization performance of PMLP and GNN, we next compare MLP, PMLP and GNN from the Neural Tangent Kernel (NTK) perspective, derive the formula for computing Graph Neural Tangent Kernel (GNTK) in node regression, and then use the results to examine their extrapolation behaviors, i.e., generalization under distribution shifts. Note that our analysis focuses on the model architecture used in inference, and thus the results presented in Sec. 4.2 are applicable for both GNN and PMLP in both inductive and transductive settings.

4.1. NTK PERSPECTIVE ON MLP, PMLP AND GNN

Linearization of Neural Networks. For a neural network f (x; θ) : X → R with initial parameters θ 0 and a fixed input sample x, performing first-order Taylor expansion around θ 0 yields the linearized form of NNs (Lee et al., 2019) as follows: f lin (x; θ) = f (x; θ 0 ) + ∇ θ f (x; θ 0 ) ⊤ (θ -θ 0 ) , where the gradient ∇ θ f (x; θ 0 ) could be thought of as a feature map ϕ(x) : X → R |θ| , depending on the specific initialization. As such, when θ 0 is initialized by Gaussian distribution with certain scaling and the network width tends to infinity (i.e., m → ∞, where m denotes the layer width), the feature map becomes constant and is determined by the model architecture (e.g., MLP, GNN and CNN), inducing a kernel called NTK (Jacot et al., 2018) : NTK(x i , x j ) = ϕ ntk (x i ) ⊤ ϕ ntk (x j ) = ⟨∇ θ f (x i ; θ), ∇ θ f (x j ; θ)⟩ (4) Kernel Regression with NTK. Recent works (Liu et al., 2020b; a) show that the spectral norm of Hessian matrix in the Taylor series tends to zero with increasing width by Θ(1/ √ m), and hence the linearization becomes almost exact. Therefore, training an over-parameterized NN using gradient descent with infinitesimal step size is equivalent to kernel regression with NTK (Arora et al., 2019b) : f (x; w) = w ⊤ ϕ ntk (x), L(w) = 1 2 n i=1 y i -w ⊤ ϕ ntk (x i ) 2 . ( ) We next show the equivalence of MLP and PMLP in training at infinite width limit (NTK regime). Proposition 1. MLP and its corresponding PMLP have the same minimum RKHS-norm NTK kernel regression solution, but differ from that of GNNs, i.e., w * mlp = w * pmlp ̸ = w * gnn . Implications. From the NTK perspective, stacking additional message passing layers in the testing phase implies transforming the fixed feature map from that of MLP ϕ mlp (x) to that of GNN ϕ gnn (x), while fixing w. Given w * mlp = w * pmlp , the superior generalization performance of PMLP (i.e., the key factor of performance surge from MLP to GNN) can be explained by such transformation of feature map from ϕ mlp (x) to ϕ gnn (x) in testing: f mlp (x) = w * ⊤ mlp ϕ mlp (x), f pmlp (x) = w * ⊤ mlp ϕ gnn (x), f gnn (x) = w * ⊤ gnn ϕ gnn (x). (6) This perspective simplifies the subsequent theoretical analysis by setting the focus on the difference of feature map, which is determined by the network architecture used in inference, and empirically suggested to be the key factor attributing to superior generalization performance of GNN and PMLP. To step further, we next derive the formula for computing NTK feature map of PMLP and GNN (i.e., ϕ gnn (x)) in node regression tasks. GNTK in Node Regression. Following previous works that analyse shallow and wide NNs (Arora et al., 2019a; Chizat & Bach, 2020; Xu et al., 2021) , we focus on a two-layer GNN using average aggregation with self-connection. By extending the original definition of GNTK (Du et al., 2019) from graph-level regression to node-level regression as specified in Appendix. C, we have the following explicit form of ϕ gnn (x) for a two-layer GNN. Lemma 2. The explicit form of GNTK feature map for a two-layer GNN ϕ gnn (x) with average aggregation and ReLU activation in node regression is ϕ gnn (x i ) = c j∈Ni∪{i} X ⊤ a j • I + w (k) ⊤ X ⊤ a j , w (k) ⊤ X ⊤ a j • I + w (k) ⊤ X ⊤ a j , . . . , where c = O( d-1 ) is a constant proportional to the inverse of node degree, X ∈ R n×d is node features, a i ∈ R n denotes adjacency vector of node i, w (k) ∼ N (0, I d ) is a random Gaussian vector in R d , two components in the brackets repeat infinitely many times with k ranging from 1 to ∞, and I + is an indicator function that outputs 1 if the input is positive otherwise 0.

4.2. MLP V.S. PMLP IN EXTRAPOLATION

As indicated by Proposition 1, the fundamental difference between MLP and PMLP at infinite width limit stems from the difference of feature map in the testing phase. This reduces the problem of explaining the success of GNN to the question that why this change is significant for generalizability. Extrapolation Behavior of MLP. One important aspect of generalization analysis is regarding model's behavior when confronted with OoD testing samples (i.e., testing nodes that are considerably outside the training support), a.k.a. extrapolation analysis. A previous study on this direction (Xu et al., 2021) reveal that a standard MLP with ReLU activation quickly converges to a linear function as the testing sample escapes the training support, which is formalized by the following theorem. Theorem 3. (Xu et al., 2021) . Suppose f mlp (x) is an infinitely-wide two-layer MLP with ReLU trained by square loss. For any direction v ∈ R d and step size ∆t > 0, let x 0 = tv, we have (f mlp (x 0 + ∆tv) -f mlp (x 0 )) /∆t c v -1 = O( 1 t ). ( ) where c v is a constant linear coefficient. That is, as t → ∞, f mlp (x 0 ) converges to a linear function. The intuition behind this phenomenon is the fact that ReLU MLPs learn piece-wise linear functions with finitely many linear regions and thus eventually becomes linear outside training data support. Now a naturally arising question is how does PMLP compare with MLP regarding extrapolation? Extrapolation Behavior of PMLP. Based on the explicit formula for ϕ gnn (x) in Lemma 2, we extend the theoretical result of extrapolation analysis from MLP to PMLP. Our first finding is that, as the testing node feature becomes increasingly outside the range of training data, alike MLP, PMLP (as well as GNN) with average aggregation also converges to a linear function, yet the corresponding linear coefficient reflects its ego-graph property rather than being a fixed constant. Theorem 4. Suppose f pmlp (x) is an infinitely-wide two-layer MLP with ReLU activation trained using squared loss, and adds average message passing layer before each feed-forward layer in the testing phase. For any direction v ∈ R d and step size ∆t > 0, let x 0 = tv, and as t → ∞, we have (f pmlp (x 0 + ∆tv) -f pmlp (x 0 )) /∆t → c v i∈N0∪{0} ( d • di ) -1 . ( ) where c v is the same constant as in Theorem 3, d0 = d is the node degree (with self-connection) of x 0 , and di is the node degree of its neighbors. Remark. This result also applies to infinitely-wide two-layer GNN with ReLU activation and average aggregation in node regression settings, except the constant c v is different from that of MLP. Convergence Comparison. Though both MLP and PMLP tend to linearize for outlier testing samples, indicating that they have common difficulty to extrapolate non-linear functions, we find that PMLP in general has more freedom to deviate from the convergent linear coefficient, implying smoother transition from in-distribution (non-linear) to out-of-distribution (linear) regime and thus could potentially generalize to out-of-distribution samples near the range of training data. Theorem 5. Suppose all node features are normalized, and the cosine similarity of node x i and the average of its neighbors is deonoted as α i ∈ [0, 1]. Then, the convergence rate for f pmlp (x) is (f pmlp (x 0 + ∆tv) -f pmlp (x 0 )) /∆t c v i∈N0∪{0} ( d • di ) -1 -1 = O 1 + ( dmax -1) 1 -α 2 min t . ( ) where α min = min{α i } i∈N0∪{0} ∈ [0, 1], and dmax ≥ 1 denotes the maximum node degree in the testing node x 0 's neighbors (including itself). This result indicates larger node degree and feature dissimilarity imply smoother transition and better compatibility with OoD samples. Specifically, when the testing node's degree is 1 (connection to itself), PMLP becomes equivalent to MLP. As reflection in Eq. 10, dmax = 1, and the bound degrades to that of MLP in Eq. 8. Moreover, when all node features are equal, message passing will become meaningless. Correspondingly, αmin = 1, and the bound also degrades to that of MLP.

5. MORE DISCUSSIONS AND CONCLUSION

We defer more discussions on other sources of performance gap between MLP and GNN, our current limitations and outlooks to Appendix A.

Conclusion.

In this work, we bridge MLP and GNN by introducing an intermediate model class called PMLP, which is equivalent to MLP in training, but shows significantly better generalization performance after adding unlearned message passing layers in testing and can rival with its GNN counterpart in most cases. This phenomenon is consistent across different datasets and experimental settings. To shed some lights on this phenomenon, we show despite that both MLP and PMLP cannot extrapolate non-linear functions, PMLP converges slower, indicating smoother transition and better tolerance for out-of-distribution samples.

A MORE DISCUSSIONS

Other sources of performance gap between MLP and GNN / When PMLP fails? Besides the intrinsic generalizability of GNNs that is revealed by the performance gain from MLP to PMLP in this work, we note that there are some other less significant but non-negligible sources that attributes to the performance gap between GNN and MLP in node prediction tasks: • Expressiveness: While our experiments find that GNNs and PMLPs can perform similarly in most cases, showing great advantage over MLPs in generalization, in practice, there still exists a certain gap between their expressiveness, which can be amplified in large datasets and causes certain degrees of performance difference. This is reflected by our experiments on three large-scale datasets. Despite that, we can see from Table 2 that the intrinsic generalizability of GNN (corresponding to ∆ mlp ) is still the major source of performance gain from MLP. • Semi-Supervised / Transductive Learning: As our default experimental setting, inductive learning ensures that testing samples are unobserved during training and keeps the comparison among models fair. However in practice, the ability to leverage the information of unlabeled nodes in training is a well-known advantage of GNN (but not the advantage of PMLP). Still, PMLP can be used in transductive setting with training techniques as described in Sec. 3.4. Current Limitations and Outlooks. The result in Theorem 5 provides a bound to show PMLP's better potential in OoD generalization, rather than guaranteeing its superior generalization capability. Explaining when and why PMLPs perform closely to their GNN counterparts also need further investigations. Moreover, following most theoretical works on NTK, we consider the regression task with squared loss for analysis instead of classification. However, as evidences (Janocha & Czarnecki, 2017; Hui & Belkin, 2020) show squared loss can be as competitive as softmax cross-entropy loss, the insights obtained from regression tasks could also adapt to classification tasks.

B PROOF FOR PROPOSITION 1

To analyse the extrapolation behavior of PMLP and compare it with MLP, As mentioned in the main text, training an infinitely wide neural network using gradient descent with infinitesimal step size is equivalent to solving kernel regression with the so-called NTK by minimizing the following squared loss function: f (x; w) = w ⊤ ϕ ntk (x), L(w) = 1 2 n i=1 y i -w ⊤ ϕ ntk (x i ) 2 . ( ) Let us now consider an arbitrary minimizer w * ∈ H in NTK's reproducing kernel Hilbert space. The minimizer could be further decomposed as w * = ŵ * + w ⊥ , where ŵ * lies in the linear span of feature mappings for training data and w ⊥ is orthogonal to ŵ * , i.e., ŵ * = n i=1 λ i • ϕ ntk (x i ), ⟨ ŵ * , w ⊥ ⟩ H = 0. ( ) One observation is that the loss function is unchanged after removing the orthogonal component w ⊥ : L(w * ) = 1 2 n i=1 y i -⟨ n i=1 λ i • ϕ ntk (x i ) + w ⊥ , ϕ ntk (x i )⟩ H 2 = 1 2 n i=1 y i -⟨ n i=1 λ i • ϕ ntk (x i ), ϕ ntk (x i )⟩ H 2 = L( ŵ * ). This indicates that ŵ * is also a minimizer whose H-norm is smaller than that of w * , i.e., ∥ ŵ * ∥ H ≤ ∥w * ∥ H . It follows that the minimum H-norm solution for Eq. 11 can be expressed as a linear combination of feature mappings for training data. Therefore, solving Eq. 11 boils down to solving a linear system with coefficients λ = [λ i ] n i=1 . Resultingly, the minimum H-norm solution is w * = n i=1 y i K -1 i ϕ ntk (x i ), where K ∈ R n×n is the kernel matrix for training data. We see that the final solution is only dependent on training data {x i , y i } n i=1 and the model architecture used in training (since it determines the form of ϕ ntk (x i )). It follows immediately that the min-norm NTK kernel regression solution is equivalent for MLP and PMLP (i.e., w * mlp = w * pmlp ) given that they are the same model trained on the same set of data. In contrast, the architecture of GNN is different from that of MLP, implying different form of feature map, and hence they have different solutions in their respective NTK kernel regression problems (i.e., w * mlp ̸ = w * gnn ). C PROOF FOR LEMMA 3  GNTK(G i , G j ) = ϕ gnn (G i ) ⊤ ϕ gnn (G j ) = ⟨∇ θ f (G i ; θ), ∇ θ f (G j ; θ)⟩ , L(θ) = 1 2 n i=1 (y i -f (G i ; θ)) 2 , ( ) where f (G i ; θ) yields prediction for a graph such as the property of a molecule. The formula for calculating GNTK is given by the following (where we modify the original notation for clarity and alignment with our definition of GNTK in node-level regression setting): GNTK (0) (G i , G j ) uu ′ = Σ (0) (G i , G j ) uu ′ = x ⊤ u x u ′ , where x u ∈ V i , x u ′ ∈ V j . The message passing operation in each layer corresponds to: Σ (ℓ) mp (G i , G j ) uu ′ = c u c u ′ v∈Nu∪{u} v ′ ∈N u ′ ∪{u ′ } Σ (ℓ) (G i , G j ) vv ′ GNTK (ℓ) mp (G i , G j ) uu ′ = c u c u ′ v∈Nu∪{u} v ′ ∈N u ′ ∪{u ′ } GNTK (ℓ) (G i , G j ) vv ′ , where c u denotes a scaling factor. The calculation formula of feed-forward operation (from GNTK (ℓ-1) mp (G i , G j ) to GNTK (ℓ) (G i , G j ) ) is similar to that for NTKs of MLP (Jacot et al., 2018) . The final output of GNTK (without jumping knowledge) is calculated by GNTK (G i , G j ) = u∈Vi,u ′ ∈Vj GNTK (L-1) (G i , G j ) uu ′ . ( ) C.2 GNTK FOR NODE-LEVEL REGRESSION We next extend the above definition of GNTK to the node-level regression setting, where the model f (x; θ, G) outputs prediction of a node, and the kernel function is defined over a pair of nodes in a single graph: GNTK(x i , x j ) = ϕ gnn (x i ) ⊤ ϕ gnn (x j ) = ⟨∇ θ f (x i ; θ, G), ∇ θ f (x j ; θ, G)⟩ , L(θ) = 1 2 n i=1 (y i -f (x i ; θ, G)) 2 . ( ) Then, the explicit formula for GNTK in node-level regression is as follows. GNTK (0) (x i , x j ) = Σ (0) (x i , x j ) = x ⊤ i x j , Without loss of generality, we consider using random walk matrix as implementation of message passing. Then, the message passing operation in each layer corresponds to: Σ (ℓ) mp (x i , x j ) = 1 (|N i | + 1)(|N j | + 1) i ′ ∈Ni∪{i} j ′ ∈Nj ∪{j} Σ (ℓ) (x i ′ , x j ′ ) GNTK (ℓ) mp (x i , x j ) = 1 (|N i | + 1)(|N j | + 1) i ′ ∈Ni∪{i} j ′ ∈Nj ∪{j} GNTK (ℓ) (x i ′ , x j ′ ) , Moreover, the feed-forward operation in each layer corresponds to: Σ (ℓ) (x i , x j ) = c • E u,v∼N (0,Λ (ℓ) ) [σ(u)σ(v)], Σ(ℓ) (x i , x j ) = c • E u,v∼N (0,Λ (ℓ) ) [ σ(u) σ(v)], GNTK (ℓ) (x i , x j ) = GNTK (ℓ-1) mp (x i , x j ) • Σ(ℓ) (x i , x j ) + Σ (ℓ) (x i , x j ) , where Λ (ℓ) = Σ (ℓ-1) mp (x i , x i ) Σ (ℓ-1) mp (x i , x j ) Σ (ℓ-1) mp (x j , x i ) Σ (ℓ-1) mp (x j , x j ) (24) Suppose the GNN has L layers and the last layer uses linear transformation that is akin to MLP, the final GNTK in node-level regression is defined as GNTK(x i , x j ) = GNTK (L-1) mp (x i , x j )

C.3 GNTK AND FEATURE MAP FOR A TWO-LAYER GNN

We next derive the explicit NTK formula for a two-layer graph neural network in node-level regression setting. For notational convenience, we use a i ∈ R n to denote adjacency, i.e., (a i ) j = 1/(|N i | + 1) if (i, j) ∈ E 0 if (i, j) / ∈ E , and G = XX ⊤ ∈ R n×n to denote the Gram matrix of all nodes. Then we have (First message passing layer) GNTK (0) (x i , x j ) = Σ (0) (x i , x j ) = x ⊤ i x j , GNTK (0) mp (x i , x j ) = Σ (0) mp (x i , x j ) = a ⊤ i Ga j , (First feed-forward layer) Σ (1) (x i , x j ) = c • E u,v∼N (0,Λ (1) ) [σ(u)σ(v)], Σ(1) (x i , x j ) = c • E u,v∼N (0,Λ (1) ) [ σ(u) σ(v)], Λ (1) = a ⊤ i Ga i a ⊤ i Ga j a ⊤ j Ga i a ⊤ j Ga j = X ⊤ a i X ⊤ a j • X ⊤ a i X ⊤ a j (28) By noting that a ⊤ i Ga j = (X ⊤ a i ) ⊤ X ⊤ a j and substituting σ(k) = k • I + (k), σ(k) = I + (k) , where I + (k) is an indicator function that outputs 1 if k is positive otherwise 0, we have the following equivalent form for the covariance Σ (1) (x i , x j ) = c • E w∼N (0,I d ) w ⊤ X ⊤ a i • I + (w ⊤ X ⊤ a i ) • w ⊤ X ⊤ a j • I + (w ⊤ X ⊤ a j ) , Σ(1) (x i , x j ) = c • E w∼N (0,I d ) I + (w ⊤ X ⊤ a i ) • I + (w ⊤ X ⊤ a j ) . Hence, we have GNTK (1) (x i , x j ) = GNTK (0) mp (x i , x j ) • Σ(1) (x i , x j ) + Σ (1) (x i , x j ) = c • E w∼N (0,I d ) a ⊤ i Ga j • I + (w ⊤ X ⊤ a i ) • I + (w ⊤ X ⊤ a j ) + c • E w∼N (0,I d ) w ⊤ X ⊤ a i • I + (w ⊤ X ⊤ a i ) • w ⊤ X ⊤ a j • I + (w ⊤ X ⊤ a j ) (Second message passing layer / The last layer) Since the GNN uses a linear transformation on top of the (second) message passing layer for output, the neural tangent kernel for a two-layer GNN is given by GNTK(x i , x j ) = GNTK (1) mp (x i , x j ) = 1 (|N i | + 1)(|N j | + 1) i ′ ∈Ni∪{i} j ′ ∈Nj ∪{j} GNTK (1) (x i ′ , x j ′ ) = ϕ (1) (x 1 ), • • • , ϕ (1) (x n ) ⊤ a i , ϕ (1) (x 1 ), • • • , ϕ (1) (x n ) ⊤ a j H ( ) where K (1) ∈ R n×n and ϕ (1) : R d → H is the kernel matrix and feature map induced by GNTK (1) . By Eq. 31, the final feature map ϕ gnn (x) is ϕ gnn (x i ) = ϕ (1) (x 1 ), • • • , ϕ (1) (x n ) ⊤ a i . ( ) Also, notice that in Eq. 30, a ⊤ i Ga j • I + (w ⊤ X ⊤ a i ) • I + (w ⊤ X ⊤ a j ) = ϕ ⊤ i ϕ j , w ⊤ X ⊤ a i • I + (w ⊤ X ⊤ a i ) • w ⊤ X ⊤ a j • I + (w ⊤ X ⊤ a j ) = (w ⊤ ϕ i ) ⊤ w ⊤ ϕ j , where ϕ i = X ⊤ a i • I + (w ⊤ X ⊤ a i ). (33) Then, the feature map ϕ (1) can be written as ϕ (1) (x i ) = c ′ • X ⊤ a i • I + w (1) ⊤ X ⊤ a i , w (1) ⊤ X ⊤ a i • I + w (1) ⊤ X ⊤ a i , X ⊤ a i • I + w (2)⊤ X ⊤ a i , w (2) ⊤ X ⊤ a i • I + w (2) ⊤ X ⊤ a i , • • • X ⊤ a i • I + w (∞)⊤ X ⊤ a i , w (∞) ⊤ X ⊤ a i • I + w (∞) ⊤ X ⊤ a i where w (k) ∼ N (0, I d ) is random Gaussian vector in R d , with the superscript (k) denoting that it is the k-th sample among infinitely many i.i.d. sampled ones, c ′ is a constant. We write Eq. 34 in short as ϕ (1) (x i ) = c ′ • X ⊤ a i • I + w (k) ⊤ X ⊤ a i , w (k) ⊤ X ⊤ a i • I + w (k) ⊤ X ⊤ a i , • • • . ( ) Finally, substituting Eq. 35 into Eq. 32 completes the proof.

D PROOF FOR THEOREM 4 AND THEOREM 5

To analyse the extrapolation behavior of PMLP along a certain direction v in the testing phase and compare it to MLP, we consider a newly arrived testing node x 0 = tv, whose degree (with self-connection) is d and its corresponding adjacency vector is a ∈ R n+1 , where (a) i = 1/ d if (i, 0) ∈ E otherwise 0. Following (Xu et al., 2021; Bietti & Mairal, 2019) , we consider a constant bias term and denote the data x plus this term as x = [x|1]. Then, the asymptotic behavior of f (•) at large distances from the training data range can be characterized by the change of network output with a fixed-length step ∆t • v along the direction v, which is given by the following in the NTK regime 1 ∆t (f mlp ( x) -f mlp ( x0 )) = 1 ∆t w * ⊤ mlp (ϕ mlp ( x) -ϕ mlp ( x0 )) (36) 1 ∆t (f pmlp ( x) -f pmlp ( x0 )) = 1 ∆t w * ⊤ mlp (ϕ gnn ( x) -ϕ gnn ( x0 )) where x = x 0 + ∆t • v = (t + ∆t)v, x = [ x|1] and x0 = [ x0 |1]. As our interest is in how PMLP extrapolation, we use f (•) to refer to f pmlp (•) in the rest of the proof. By Lemma. 2, the explicit formula for computing this node's feature map is given by ϕ gnn ( x0 ) = ϕ (1) ( x0 ), ϕ (1) (x 1 ; x0 ), • • • , ϕ (1) (x n ; x0 ) ⊤ a, where ϕ (1) ( x0 ) = c ′ • [ x0 , X] ⊤ a • I + w (k) ⊤ [ x0 , X] ⊤ a , w (k) ⊤ [ x0 , X] ⊤ a • I + w (k) ⊤ [ x0 , X] ⊤ a , . . . , and similarly ϕ (1) (x i ; x0 ) = c ′ • [ x0 , X] ⊤ a i • I + w (k) ⊤ [ x0 , X] ⊤ a i , w (k) ⊤ [ x0 , X] ⊤ a i • I + w (k) ⊤ [ x0 , X] ⊤ a i , . . . , w (k) ∼ N (0, I d ), with k going to infinity, c ′ is a constant, I + (k) is an indicator function that outputs 1 if k is positive otherwise 0, (a i ) 1 = 1/(|N i | + 1 ) if i is connected to the new testing node. It follows from Eq. 37 that 1 ∆t (f ( x) -f ( x0 )) = 1 ∆t w * ⊤ mlp ϕ (1) ( x) -ϕ (1) ( x0 ), • • • , ϕ (1) (x n ; x) -ϕ (1) (x n ; x0 ) ⊤ a = 1 d∆t w * ⊤ mlp ϕ (1) ( x) -ϕ (1) ( x0 ) + 1 d∆t i∈N0 w * ⊤ mlp ϕ (1) (x i ; x) -ϕ (1) (x i ; x0 ) Now, let us consider w * ⊤ mlp ϕ (1) ( x) -ϕ (1) ( x0 ) . Recall that w * ⊤ mlp is from infinite dimensional Hilbert space, and w (k) is drawn from Gaussian with k going to infinity in Eq. 39 and Eq. 40, where each w (k) corresponds to some certain dimensions of w * mlp . Let us denote the part that corresponds to the first line in Eq. 39 as β w (k) and the second line in Eq. 39 as γ w (k) . Consider the following way of rearrangement for (the first element in) ϕ (1) ( x) -ϕ (1) ( x0 ) [ x, X] ⊤ a • I + w (k) ⊤ [ x, X] ⊤ a -[ x0 , X] ⊤ a • I + w (k) ⊤ [ x0 , X] ⊤ a =[ x, X] ⊤ a I + w (k) ⊤ [ x, X] ⊤ a -I + w (k) ⊤ [ x0 , X] ⊤ a + 1 d [∆tv | 0] • I + w (k) ⊤ [ x0 , X] ⊤ a , where 1 d [∆tv | 0] is obtained by subtracting [ x0 , X] ⊤ a from [ x, X] ⊤ a. Then, we can re-write w * ⊤ mlp ϕ (1) ( x) -ϕ (1) ( x0 ) into a more convenient form: 1 ∆t w * ⊤ mlp ϕ (1) ( x) -ϕ (1) ( x0 ) (43) = 1 d β ⊤ w [v | 0] • I + w ⊤ [ x0 , X] ⊤ a dP(w) + β ⊤ w [ x/∆t, X/∆t] ⊤ a I + w ⊤ [ x, X] ⊤ a -I + w ⊤ [ x0 , X] ⊤ a dP(w) + 1 d γ w w ⊤ [v | 0] • I + w ⊤ [ x0 , X] ⊤ a dP(w) + γ w w ⊤ [ x/∆t, X/∆t] ⊤ a I + w ⊤ [ x, X] ⊤ a -I + w ⊤ [ x0 , X] ⊤ a dP(w) Remark. For other components in Eq. 41, i.e., 1 ∆t w * ⊤ mlp ϕ (1) (x i ; x) -ϕ (1) (x i ; x0 ) where i ∈ N 0 , the corresponding result of the expansion in Eq. 43 is similar, which only differs by a scaling factor (since the first element in both a and a i indicating whether the current node is connected to the new testing node is non-zero) and replacing a with a i . Therefore, in the following proof we focus on Eq. 43 and then generalize the result to other components in Eq. 41.

D.1 PROOF FOR THEOREM 4 (CONVERGENCE TO A LINEAR FUNCTION)

We first analyse the convergence of Eq. 43. Specifically, for Eq. 44: 1 d β ⊤ w [v | 0] • I + w ⊤ [ x0 , X] ⊤ a dP(w) = 1 d β ⊤ w [v | 0] • I + w ⊤ [ x0 /t, X/t] ⊤ a dP(w) → 1 d β ⊤ w [v | 0] • I + w ⊤ [v | 0] dP(w) = c ′ v d , as t → ∞ where the final result is a constant that depends on training data, direction v and node degree d. Moreover, the convergence of Eq. 45 is given by β ⊤ w [ x/∆t, X/∆t] ⊤ a I + w ⊤ [ x, X] ⊤ a -I + w ⊤ [ x0 , X] ⊤ a dP(w) = β ⊤ w [ x/∆t, X/∆t] ⊤ a I + w ⊤ [[v | 1 t + ∆t ], X t + ∆t ] ⊤ a -I + w ⊤ [[v | 1 t ], X t ] ⊤ a dP(w) → β ⊤ w [ x/∆t, X/∆t] ⊤ a I + w ⊤ [v | 0] -I + w ⊤ [v | 0] dP(w) = 0, as t → ∞ The similar results in Eq. 48 and Eq. 49 also apply to analysis of Eq. 46 and Eq. 47, respectively. By combining these results, we conclude that 1 ∆t w * ⊤ mlp ϕ (1) ( x) -ϕ (1) ( x0 ) → c v d , as t → ∞ It follows that 1 ∆t (f ( x) -f ( x0 )) → c v d-1 i∈N0∪{0} d-1 i , as t → ∞. In conclusion, both MLP and PMLP with ReLU activation will eventually converge to a linear function along directions away from the training data. In fact, this result also holds true for two-layer GNNs with weighted-sum style message passing layers by simply replacing w * mlp by w * gnn in the proof. However, a remarkable difference between MLP and PMLP is that the linear coefficient for MLP is a constant c v that is fixed upon a specific direction v and not affected by the inter-connection between testing node x and training data {(x i , y i )} n i=1 . In contrast, the linear coefficient for PMLP (and GNN) is also dependent on testing node's degree and the degrees of adjacent nodes. Moreover, by Proposition. 1, MLP and PMLP share the same w * mlp (including β w and γ w ), and thus the constant c v in Eq. 51 is exactly the linear coefficient of MLP. This can also be verified by setting x to be an isolated node, in which case d-1 i∈N0∪{0} d-1 i = 1 and PMLP is equivalent to MLP. Therefore, we can directly compare the linear coefficients for MLP and PMLP. As an immediate consequence, if all node degrees of adjacent nodes are larger than the node degree of the testing node, the linear coefficient will become smaller, vice versa.

D.2 PROOF FOR THEOREM 5

We next analyse the convergence rate for Eq. 48 and Eq. 49 to see to what extent can PMLP deviate from the converged linear coefficient as an indication of its tolerance to out-of-distribution sample. For Eq. 48, we have 1 d β ⊤ w [v | 0] • I + w ⊤ [ x0 , X] ⊤ a -I + w ⊤ [v | 0] dP(w) ≤ c ′ 1 d • I + w ⊤ [ x0 , X] ⊤ a -I + w ⊤ [v | 0] dP(w) = c ′ 1 d • I + w ⊤ [[x 0 | 1], X] ⊤ a -I + w ⊤ [x 0 | 0] dP(w) Based on the observation that the integral of |I + (w ⊤ v 1 ) -I + (w ⊤ v 2 )| represents the volume of non-overlapping part of two half-balls that are orthogonal to v 1 and v 2 , which grows linearly with the angle between v 1 and v 2 , denoted by ∡ (v 1 , v 2 ). Therefore, we have c ′ 1 d • I + w ⊤ [[x 0 | 1], X] ⊤ a -I + w ⊤ [x 0 | 0] dP(w) = c 1 d • ∡ [[x 0 | 1], X] ⊤ a , [x 0 | 0] , Note that the first term in the angle can be decomposed as d • [[x 0 | 1], X] ⊤ a = [x 0 | 0] + [0 | 1] + i∈N (0) x i . Suppose all node features are normalized, then we have ∡ [[x 0 , 1], X] ⊤ a , [x 0 , 0] ≤ ∡ ([x 0 , 0] , [x 0 , 1]) + ∡( x0 , x0 + i∈N (0) x i ) = arctan( 1 t ) + arctan( ( d -1) √ 1 -α 2 ( d -1)α + √ t 2 + 1 ) = O( 1 + ( d -1) √ 1 -α 2 t ) where α denotes the cosine similarity of the testing node and the sum of its neighbors. The last step is obtained by noting arctan(x) < x. Using the same reasoning, for Eq. 49, we have β ⊤ w [ x/∆t, X/∆t] ⊤ a I + w ⊤ [ x0 , X] ⊤ a -I + w ⊤ [ x, X] ⊤ a dP(w) ≤ β ⊤ w [ x/∆t, X/∆t] ⊤ a • I + w ⊤ [ x0 , X] ⊤ a -I + w ⊤ [ x, X] ⊤ a dP(w) = β ⊤ w [ x/∆t, X/∆t] ⊤ a • I + w ⊤ [x 0 | 1], X ⊤ a -I + w ⊤ [x 0 | t t + ∆t ], t t + ∆t X ⊤ a dP(w) = β ⊤ w [ x/∆t, X/∆t] ⊤ a • ∡ [x 0 | 1], X ⊤ a , [x 0 | t t + ∆t ], t t + ∆t X ⊤ a , (56) Note that the second term in the angle can be re-written as [x 0 | t t + ∆t ], t t + ∆t X ⊤ a = t t + ∆t • [x 0 | 1], X ⊤ a + ∆t t + ∆t • [x 0 | 0], 0 ⊤ a, and hence the angle is at most ∆t/(t + ∆t) times of that in Eq. 53. It follows that β ⊤ w [ x/∆t, X/∆t] ⊤ a I + w ⊤ [ x0 , X] ⊤ a -I + w ⊤ [ x, X] ⊤ a dP(w) = O( t + ∆t d ) • O( 1 + ( d -1) √ 1 -α 2 t ) • O( ∆t t + ∆t ) = O( 1 + ( d -1) √ 1 -α 2 dt ) For Eq. 46 and Eq. 47, similar results can be derived by bounding w with standard concentration techniques. By substituting the above convergence rates to Eq. 43, and dividing it by the linear coefficient in Eq. 50, the convergence rate for Eq. 43 is 1 ∆t w * ⊤ mlp ϕ (1) ( x) -ϕ (1) ( x0 ) -c v / d c v / d = O( 1 + ( d -1) √ 1 -α 2 t ) It follows that the convergence rate for Eq. 37 is O( 1+( dmax-1) √ 1-α 2 min t ), where dmax denotes the maximum node degree in the testing node' neighbors (including itself) and α min = min{α i } i∈N0∪{0} (60) denotes the minimum cosine similarity for the testing node' neighbors (including itself). This completes the proof.

E OVER-SMOOTHING, DEEP GNNS, AND HETEROPHILY

Over-Smoothing and Deep GNNs. To gain more insights into the over-smoothing problem and the impact of model depth, we further investigate GNN architectures with residual connections (including GCN-style ones where residual connections are employed across FF layers, e.g., JKNet (Xu et al., 2018b) and GCNII (Chen et al., 2020b) , and SGC/APPNP-style ones where the implementation of MP layers involve residual connections, e.g., APPNP with non-zero α). Table . 3 and Table. 4 respectively report the results for SGC/APPNP-style and GCN-style GNNs on Cora dataset. Similar trends are also observed on other datasets, some of which are plotted in Fig. 5 . As we can see, the performances of PMLPs without residual connections (i.e., PMLP GCN , PMLP SGC and PMLP AP P ) exhibit very similar downward trends with their GNN counterparts w.r.t. increasing layer number (from 2 to 128). Such phenomenon in GNNs is commonly thought to be caused by the oversmoothing issue wherein node features become hard to distinguish after multiple steps of message passing, and since PMLP is immune from such problem in the training stage but still perform poorly in testing, we may conclude that oversmoothing is more of a problem related to failure modes of GNNs' generalization ability, rather than impairing their representational power, which is somewhat in alignment with (Cong et al., 2021) where the authors theoretically show very deep GNNs that are vulnerable to oversmoothing can still achieve high training accuracy but will perform poorly in terms of generalization. We experimental results further suggest the reason why additional residual connections (either in MP layers, e.g., ResAPPNP/SGC, or across FF layers, e.g., GCNII and JKNet) are empirical effective for solving oversmoothing is that they can improve GNN's generalization ability according to results of PMLP GCN II , PMLP JKN et , PMLP ResSGC and PMLP ResAP P where model depth seems to have less impact on their generalization performance. Graph Heterophily. We further conduct experiments on six datasets (i.e., Chameleon, Squirrel, Film, Cornell, Texas, Wisconsin (Pei et al., 2020) ) with high heterophily levels, and the results are shown in Table . 5. As we can see, PMLP achieves better performance than MLP when its GNN counterpart can outperform MLP, but is otherwise inferior than MLP. This indicates that the inherent generalization ability of a certain GNN architecture is also related to whether the message passing scheme is suitable for the characteristics of data, which also partially explains why training more dedicated GNN architectures such as H2GCN (Zhu et al., 2020) is helpful for improving model's generalization performance on heterophilic graphs. Table 3 : For SGC and APPNP styles, layer number denotes the number of MP layers. The number of FF layers and hidden size are fixed as 2 and 64. 'Res' denotes using residual connection in the form of X (k) = (1 -α) MP(X (k-1) ) + αX (0)  (ℓ+1) = σ(((1 -α ℓ ) MP(H (ℓ) ) + α ℓ H (0) )((1 -β ℓ )I n + β ℓ W (ℓ) )), we set α ℓ = 0.1, β ℓ = 0.5/ℓ.

F IMPLEMENTATION DETAILS

We present implementation details for our experiments for reproducibility. We implement our model as well as the baselines with Python 3.7, Pytorch 1.9.0 and Pytorch Geometric 1.7.2. All parameters are initialized with Xavier initialization procedure. We train the model by Adam optimizer. Most of the experiments are running with a NVIDIA 2080Ti with 11GB memory, except that for large-scale datasets we use a NVIDIA 3090 with 24GB memory. Table 6 summarizes the architecture of PMLPs adopted in training and testing stages for clear head-to-head comparison.

F.1 DATASET DESCRIPTION

We use sixteen widely adopted node classification benchmarks involving different types of networks: three citations networks (Cora, Citeseer and Pubmed), two product co-occurrency networks (Amazon-Computer and Amazon-Photo), two coauthor-ship networks (Coauthor-CS, Coauthor-Computer), and three large-scale networks (OGBN-Arxiv, OGBN-Products and Flickr). For Cora, Citeseer and Pubmed, we use the provided split in (Kipf & Welling, 2017) . For Amazon-Computer, Amazon-Photo, Coauthor-CS and Coauthor-Computer, we randomly sample 20 nodes from each class as labeled nodes, 30 nodes for validation and all other nodes for test following (Shchur et al., 2018) . For two large-scale datasets, we follow the original splitting Hu et al. (2020) for evaluation. For Flickr, we use random split where the training and validation proportions are 10%. The statistics of these datasets are summarized in Table . 7.

F.2 HYPERPARAMETER SEARCH

We use the same MLP architecture (i.e., number of FF layers and size of hidden states) as backbone for models in the same dataset, same GNN architecture (i.e., number of MP layers) for PMLP and its GNN counterpart, and finetune hyperparameters for each model including dropout rate (from {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}), weight decay factor (from {0, 0.0001, 0.001, 0.01, 0.1}), and learning rate (from {0.0001, 0.001, 0.01, 0.1}) using grid search. For model architecture (i.e., layer number, size of hidden states), we fix them as reported in Table . 8, instead of fine-tuning them in favor of GNN or PMLP for each dataset which might introduce bias into their comparison. By default, we set (FF and MP) number as 2, and hidden size as 64, but manually adjust in case the performance of GNN is far from the optimal.

G ADDITIONAL EXPERIMENTAL RESULTS

We supplement more experimental results in this section including extensions of the results in the main text and visualizations of the internal representations of nodes learned by 2-layer MLP, GNN, and PMLP on Cora and Citeseer datasets. As we can see from 



Figure 1: (a) Model illustration for MLP, GNN (in GCN-style) and PMLP. (b) Learning curves for node classification on Cora that depicts a typical empirical phenomenon. (c) Intrinsic generalizability of GNN reflected by close generalization performance of GNN and PMLP. (d) Extrapolation illustration: both MLP and PMLP linearize outside the training data support (•: train sample, •: test sample), while PMLP transits more smoothly and exhibits larger tolerance for OoD testing sample. In this work, we bridge GNNs and MLPs by introducing an intermediate model class called Propagational MLPs (PMLPs). During training, PMLPs are exactly the same as a standard MLP (e.g., same architecture, data for training, initialization, loss function, optimization algorithm). In the testing phase, PMLPs additionally insert non-parametric MP layers amid FF layers, as shown in Fig. 1(a), to align with various GNN architectures including (but not limited in) GCN (Kipf & Welling, 2017), SGC(Wu et al., 2019) and APPNP(Klicpera et al., 2019).

Figure 2: Performance variation with increasing layer number and size of hidden states. (See complete results in Appendix G).

both  PMLPs and GNNs show better capability for separating nodes of different classes than MLP in the internal layer, despite the fact that PMLPs share the same set of weights with MLP. Such results might indicate that the superior classification performance of GNNs mainly stem from the effects of message passing in inference, rather than GNNs' ability of learning better node representations. APPNP v.s. PMLPAP P v.s. MLP

Figure 6: Impact of graph structural information by changing data split, sparsifying the graph, adding random structural noise on Cora.

Figure 9: Performance variation with different activation functions in FF layer.

Figure 11: Visualization of node embeddings (2-D projection by t-SNE) in the internal layer for two-layer MLP, GCN and PMLP on Cora.

Figure 12: Visualization of node embeddings (2-D projection by t-SNE) in the internal layer for two-layer MLP, SGC and PMLP on Cora.

Figure 13: Visualization of node embeddings (2-D projection by t-SNE) in the internal layer for two-layer MLP, GCN and PMLP on Citeseer.

Figure 14: Visualization of node embeddings (2-D projection by t-SNE) in the internal layer for two-layer MLP, SGC and PMLP on Citeseer.

Mean and STD of testing accuracy on node-level prediction benchmark datasets.

Mean and STD of testing accuracy on three large-scale datasets.



.

Layer number denotes the number of MP+FF layers, and the number of FF layers is fixed as 2. The hidden size is fixed as 64. For GCNII, H

ResNet here denotes MLP with residual connections (or equivalently, GCNII without MP operations). For JKNet, we use concatenation for layer aggregation. MLP+JK denotes MLP with jumping knowledge (or equivalently, JKNet without MP operations) .

Mean and STD of testing accuracy on datasets with high heterophily level.

Head-to-head comparison of three proposed PMLP models (PMLP GCN , PMLP SGC and PMLP AP P ) and the standard MLP.

Statistics of datasets.

6: Impact of graph structural information by changing data split, sparsifying the graph, adding random structural noise on Cora.Figure7: Performance variation with increasing layer number (from 2 to 8). Layer number here denotes number of FF and MP layers for GCN, and number of MP layers for SGC and APPNP.Figure8: Performance variation with increasing size of hidden states.

funding

* Correspondence author is Junchi Yan who is also affiliated with Shanghai AI Laboratory. The work was in part supported by National Key Research and Development Program of China (2020AAA0107600), NSFC (62222607), and STCSM (22511105100).WDQK FRV HOX UHOX $FFXUDF\ *&1 *&1 *&1 *&1 &RUD WDQK FRV HOX UHOX

annex

Furthermore, we have discussed different settings for these architecture hyperparameters and find the performance of PMLP is consistently close to its GNN counterpart.For other hyperparameters (i.e., learning rate, dropout rate, weight decay factor), we finetune them separately for each model on each dataset based on the performance on validation set. For PMLP, we use the MLP architecture for validation rather than using GNN since we find there is only slight difference in performance. In that sense, all PMLPs share the same training and validation process as the vanilla MLP, making them exactly the same model before inference. 

